Logs Classification using ML (1/2)

Intro

In this blog so far, there has been rather little “Machine Learning”. That wasn’t really intentional, but incidentally, it does show something about the whole “Data Science Process”: There is much more time to be dedicated to Data Engineering (i.e. Getting & Cleaning Data) and Exploratory Data Analysis (a.k.a. EDA, during which the objective is actually to LOOK at the data and understand it), than there is to the actual implementation and use of “Machine Learning” algorithms and modelling.

But that doesn’t mean that “ML” isn’t probably often the objective. So let’s dive into this subject today. Because this is a rather big topic, this will be a two-parts thing.

For now, let’s discuss some relevant concepts, specifically about “Supervised Machine Learning”. Next week, we’ll try to show some of it in practice.

What makes ML different from traditional programming?

To simplify, we will be talking about “Supervised Machine Learning” today, with a “classification goal”.

In a traditional approach to programming (at least for data analysis), one receives data, creates a program to analyse it, and produces an output, say a classification for instance. Here, the human is in charge of defining how to go from data samples it wants to classify, to the correct class. This is often referred to as an “Expert System” (which by the way is often mostly a set of if-then-else conditions…).

What happens then is that the human (programmer) must program manually the conditions used by the system. For each new classifier, a new program must be created, with a clear upfront understanding of all the parameters to be considered for a correct classification.

Machine Learning (“ML” in short) is essentially a different approach. In ML, the goal is to feed data to the computer and have it “program itself” (simplifying a bit, but that’s about the gist of it) in order for the machine to infer some conditions (maybe not the same the human programmer would have chosen) used to then choose one category/class or the other. In a supervised setup, the machine is to be fed with as much and as representative pre-classified data as possible. Then we will ask the COMPUTER to try and learn from the pre-classified data and somehow infer the classification rules.

There are several ways to go about it. Partition Trees, Support Vector Machine & NeuralNets among them (Deep Learning often “falling into” the category of sub-types of Neural Networks, but I won’t go there today).

Let’s just say for now that you could use one algorithm, say a “partition tree” for example, to distinguish two classes of logs (e.g. logs from a Linux box and logs from a Windows server). But let me get ahead of myself: What is cool is that you could use the “same thing” without programming much (if at all) to distinguish logs of a Web server, from logs of a Database server. Or logs from a firewall, from logs from a DLP product. Each one of these could be potential classifiers, and all you need is pre-tagged data (albeit usually lots of it).

OK, let’s step back a little.

So why is (supervised) Machine Learning so important?

Well, to SIMPLIFY (and let me stress that), the general idea is to remove the dependency on the human’s understanding of each dataset, and instead use a set of algorithms and lots of data to create a classifier.

Now that might seem a bit less than “revolutionary”. But stop for a minute and think about it:

In today’s society, do we have plenty of data? YES.
In many instances, such data is probably already classified and/or can somehow be classified by humans, without the need for coding (which is less common than say humans knowing how to distinguish cats from dogs in a picture).
Computers today are capable of storing lots of data, at a reasonable price.
For the more complex algorithms, maybe a lot of computing power is needed (think “Deep Learning”), but with Cloud computing (for instance) we probably can get there at a reasonable cost. And there are some specialised components (e.g. GPUs) that can now be used to work fast-enough even from a (good) laptop, where the same approach would have taken maybe years to run only a few years back.

Those “dots” above weren’t there in the 1950’s, while many of the math and algorithms for traditional Machine Learning comes from that period. And that’s why ML today is practical, where it wasn’t a few years ago.

ML Process Overview

I’m going very quick about the concepts, because I want to get to an actual example. But let’s review a simplified ML supervised classification process:

Get as much data as possible to feed your PC, with a tag distinguishing say between two “classes”.
The ML algorithm needs to be trained before being able to classify correctly new data samples. Usually, you’ll have access to (potentially) a lot of historical data, and your goal is for the PC to learn from it so that it will classify correctly NEW data.
Let’s say (for today) that you want to be able to distinguish between two classes (the typical example being “cat” or “dog”, and usually you will be able to simplify and say “cat, or not cat”, which will allow you to create a binary classifier, which will output 0 (not cat) or 1 (cat), for example).
Because you don’t yet have the “future” data, but you want to get an overall idea of the “fitness” of the classification capacity by your trained algorithm (your “model”), what you’ll do is separate the historical dataset in two, one subset for training, and one for testing the trained model. (There is much more to discuss here, like k-fold cross-validation & al., but that’s beyond the general process overview). Let’s say you train your algorithm on a random sample of 70% of your historical data, and you keep the rest to try and test it.
You’ll use that to “train” an ML algorithm (or more than one, but that’s a different story altogether, more related with accuracy and speed), thereby creating your classifier.
Then you’ll test the above classifier against the test data subset, and check whether or not it classifies correctly “new” data, which it hasn’t seen before. This step is critical, as it will tell you whether or not the chosen algorithm, with the chosen parameters, has done a good job learning “features” associated to your data, while being able to “generalise” its classification power to new, unseen-before, data. This will give you a sense of how the model will perform when it is exposed to new data, telling you how much false positives, false negatives and correctly classified dat you are to expect. (I wish I had had the following picture earlier to explain it to me)

As seen on Twitter (reference by: @mszll #mw)

Note: Depending on your objectives, you might accept more or less false-positives, or false-negatives. This has more to do with understanding the context of the “business”, another key component of Data Science.
At this point, if all went according to plan, your computer, WITHOUT YOU HAVING TO CODE any “RULE” per-se, should be able to classify correctly new data.

Again, why is the above process so cool?

Well, you don’t need to tell the computer whether or not to look for whiskers, for instance. Beyond that, how do you TELL the computer to understand how a cat is different from a dog in a photo? Think about it, you’ll see: it’s not that easy to do using a programming language.

A note about Training, Testing and Generalisation

OK so we have mentioned before: You should test your results before being fed with new data. Ideally, you want to do that with DIFFERENT data from the training dataset.

Why? To go back to the example of cat versus dog: Let’s suppose that you gather photos of cats from cats owners, and of dogs from dog owners.

We can suppose that generally speaking, cat owners will have their pet indoors. This is probably not as generally true for dog owners (this is just a theoretical example, bear with me here).

So it is possible that dog photographs can show more often grass or cars in the background.

Without telling the computer about this “rule”, it would possibly learn the “feature” that a green background is probably associated to a dog, therefore assuming, for future data examples, that if there is grass, the photograph is of a dog.

Now you and I understand that the grass doesn’t define at all “dog”, but that could be a feature our machine would learn. (Sorry if this is a silly example, but I consider it a valid theory). A photo of a cat in a garden would then be classified as showing a dog, which obviously is not what we want.

BIAS IS AN IMPORTANT concept, although we won’t get there here. But if there IS bias IN THE TRAINING dataset, chances are, your classifier will be biased as well. This has to do with balanced samples, and why I insist on using the words “representative dataset” instead of just “bigdata sets”. Just know that this has had actual legal implications and affected real lives in the past, so this concept is not to be taken lightly.

That’s why you need to:

Have a big enough AND representative-enough sample of pre-classified data.
Separate the test from the train sets, randomly, so that you validate that the inferred classifier can actually distinguish a cat from a dog, rather than learning somehow irrelevant features about the photographs that (somehow) made their way into the training dataset

Generalisation is a derived and important concept from the above.

Presenting our example

OK so I don’t really care much about my PC being able to distinguish between cats’ and dogs’ photos.

But wouldn’t it be nice if I were able to feed a theoretical SIEM and have it select the correct parser for a given log source (e.g. a server), without having to teach it how to do it manually?

In other words, if I receive logs from a proxy, and logs from a web server, and I have to choose between two parsers, can I have my machine recognise which parser it should use for each, without having to expressly tell the SIEM which is which?

You might say that the SIEM could test both parsers and see which one matches, for each incoming log. That is correct.

Unfortunately, you might soon find that for many logs of UNIX servers, there will be some overlap, for example. A RHEL log might look very similar to an AIX log. Depending on the quality of your parsers, both might match. This happens.

Indeed, your SIEM could be “intelligent enough” to keep track of a few past logs and see how often an AIX parser matched, versus a RHEL parser. If both work for many logs, but only one works for all the logs of one source, then you can probably safely tag that particular source as a RHEL (for example). The “machine learning” approach we’ll see next week is actually very similar to that approach, in terms of the results it should provide.

What’s different is that we can do it without having parsers. That is, we could use samples of logs (e.g. extracted from a SyslogNG source, which doesn’t provide parsers, supposing we don’t have access to an actual SIEM), and use our ML-based classifier to tell us whether that set of logs comes from an Apache server or a BlueCoat proxy. Because we don’t have parsers, we cannot test upfront to see which one matches more often. Also, there could be a product that doesn’t yet have a parser in our SIEM, so any match would be misleading. However, we could use our model to classify servers in two categories of “web-related logs generators”, without knowing up front or telling the computer how to recognise each of the two.

Generalising (and that will close a bit the loop of our theoretical review), one could train SEVERAL ML models, one for each binary classification:

Is it a Web Server log or something else?
Is it a Windows log or something else?
Is it a Database log or something else?
…

This is called a “one vs rest” approach. Running samples against the different models, trained to recognise each one only one category, we would:

Receive a sample of logs
feed that sample to different classifiers
Use the category that is “NOT REST”, so that (hopefully) only one of our classifiers has decided it knew the sample was of “class 1” (e.g. outputting “1”, while all the others output 0).

Let me insist one last time: This is cool, because I don’t need to program regexes for each log type I want to be able to recognise. I only need a big enough and representative enough sample of each type, along with the “tag” I want to give it.

Why a “Proxy” and “Web server” logs as an example? Just because: The exercise I’ll present next week I actually created back in May 2019, to demo ML classifiers and explain supervised ML concepts just like I will be doing here.

But that’s enough for today.

Conclusions

Next week, we’ll see all the above in practice, along with some more concepts (“feature reduction”, some common “NLP” algorithms, and some visualisations that might help).

I hope today’s long theoretical discussion will help follow with actual code and data.

References

Photo of the dogs: https://twitter.com/towards_AI/status/1332567246011555840