No code for today. I am busy with other personal stuff, and this blog is only my “weekend hobby” after all. But I make a point of trying to write something every week, and if you’ll allow me, this week will not be about “R Programming”, but about concepts, of “applications of Anomaly Detection” and some personal opinions about the “Data Science” process at large.
One approach can apply to many datasets
Last week (well, on Wednesday), I published the second part about anomaly detection applied to detecting one potential behaviour of a ransomware. It was very simple (simplistic, really, obviating things like which processes were accessing which files and many many other potentially interesting data to consider). But that was only one exercise, applied to one specific case.
Now what if we could apply that “approach” to other detections?
Anomaly detection is one relevant application of statistics and Machine Learning to the IT Security (or Cybersecurity) field. It is relevant in part because it is not too dependent on knowing up front what one looks for, beyond abnormality. That is, it is not a supervised approach. Supervised machine learning has a caveat for security data: You need lots of data pre-clasified to train your algorithms, and this is not necessarily easy to get when you work in our field.
What I explained then, could be applied to many other things. (And yes, there are other algorithms for anomaly detection out there, like Isolation Forests, for instance. Another classic example would be K-Means. Anyhow).
So what about it?
Well, the exact same approach we used last week could also be used to approach any number of datasets:
- A typical example is detecting anomaly in CPU and RAM (and maybe I/O, etc.) usage on a machine (although this doesn’t necessarily apply only to Security per-se, it could definitely be useful for IT monitoring at large)
- Error response codes from your web servers
- Number of logs collected per machine (I have used the tsoutliers package to do just that in the past, thereby detecting low or peak log volumes)
- Number of logins per user-hour (or minutes, or whatever), or failed logins (this is useful for brute-force attacks detection), applied to your AD data or extranet
- Number of drops on your firewalls (maybe per destination, or destination ports)
- Number of accesses to shared folders or databases (applications could be expected for DLP efforts)
- …
But you could go even further, with more work on the dataset, like recognizing “normal” query string lengths from you web server logs (you’d do it per-site/virtual host/URI), and any number of other applications.
Other applications of anomaly detection
This will give for future posts, but anomaly detection, more generally, has value to analyze security data.
If you’re able to group your users in sub-groups (e.g. OU for marketing, IT, etc.), and you observe browsing habits of your users (where they connect to – websites -, at what hours), you could in theory “profile” your groups, and detect “new” websites that are unusual destinations for one given group.
You could profile normal connections for one given machine (and here, you could come back to your Netflow data, but I said I would leave it for a while, so I’ll get into this later on), and learn “normal connections per day-hour, on certain ports, with certain volumes of connections or bytes exchanged). I have seen such an approach from one vendor (I can’t seem to remember which, sorry). You could then “draw” a matrix of normal activity for a week, for each machine, for a subset of relevant ports, from a vector of data points reflecting such numbers per port (destination mostly), hour (you’d use averages), with number of source IPs, volume of bytes, number of connections, etc. over a week (for example). There’d definitely be some feature engineering (you could add in to your dataset means, standard deviations, medians, etc.). There is a visualization to be foreseen for such data (e.g. coloured points as a heatmap in a matrix per port/hour of the week, with different colours for higher or lower numbers…).
You could “profile” the usual distribution of words (possibly a rather sparse matrix) from the variables of the queries to your webs (that should work, per site/URI) or databases (not so much). Then you could use PCA to focus on the most common such words, for whatever reason.
You could, depending on the environment, recognize usernames that do not follow a company-wide standard pattern (then you’d need to factor things like service usernames vs personal usernames, privileged usernames, etc.), thereby not needing an actual list of users, while detecting (potentially) logins by unusual usernames (that’d come from an attacker that hasn’t done her homework).
The list would go on.
And clearly, NTA, UEBA and the likes sort of do just that, really.
A note about Deep Learning (personal opinion)
By the way: Yes, sure, you could go “Deep Learning” about it all. I’m not (as of today at least) a big fan of deep learning as a one-size-fits-all approach to the things I discuss here. Mainly for two or three reasons:
- This easily becomes a black-box approach, and debugging that is hard. So you end up with something that, most probably (if done well), works, but you don’t know why.
- Processing power: Training a Deep NeuralNet of any sort usually implies TPUs, GPUs and the likes, and up to weeks of training. One wrong step, and the training needs to start all over (possibly).
- Also, by the way, I’m talking about training, thereby assuming we have plenty of data, ideally pre-classified data. Beyond the fact, as already noted, that this is not necessarily the case, I prefer “small data” over “big data”. Small datasets (thousands of rows, tens of variables) are easier to understand, and this comes back to the avoiding-the-blackbox-approach above.
I’m not saying I would never use DL (I’d definitely need to brush up on the subject though, as I have a rather basic understanding of these algorithms). I am saying however that I like better the more traditional approaches, those I understand (there again, there is still plenty to be learnt: the more I learn, the more I understand how little I know).
But let’s move on.
Data Engineering Matters
For any of the above, the detection itself is nothing “magical”, it’s often rather easy, and several approaches exist.
But then the matter of gathering the data becomes important. If you have a SIEM that offers an API, for example (and the relevant logs for whichever your objective is are being collected), it should be accessible.
Collecting the necessary data is a matter in- and of-itself. Format issues (raw logs, parsed logs, simple text, CSV, JSON, etc.) quickly become cumbersome, and you need to be prepared to deal with that.
Connecting to different data sources (e.g. not all will be in your SIEM, IF you even have a SIEM to begin with) also easily “add to the pile” (firewall rules, authentication, etc.).
Stabilizing all that for a productive SOC operation IS hard work, and should not be discarded. I’d say it’s actually easily the hardest part of the work.
It’s not only about gathering, though. Once you have the data, you’ll have to format it in a suitable way for your software/scripts to work with it.
You’ll also need to keep it clean (deleting old data regularly, etc.), which you’ll want to automate too. There again, more work.
Data Exploration
And then, you need to UNDERSTAND the data you will be using. Exploratory Data Analysis (e.g. graphing your data in different ways, summarising it, understanding the features (units, etc.)) is, the way I have learnt over the past few years, an integral part of the data science process.
Anomaly Detection is but one application of data science here, so it should be no exception.
Using simple statistics or more advanced machine learning algorithms (often available through pre-existing R Packages) is just fine, but it’s a rather small part of the process.
I’d definitely agree with the idea of understanding your data first.
Also, during Exploration, maybe you can conclude all you need is a report, a visualization of some sort, or a summary table. That is, maybe you start a process with a “Machine learning” idea in mind, but finish it with a simple “dashboard”. Maybe you conclude that actual “anomaly detection” is not needed at all, or would be useless for whatever reason. You’d then have actually saved time of trying to fit a model to your data. (Worst case scenario: Maybe you conclude the exercise was useless altogether. I’d argue you’d have learnt something about some data… But sometimes, the effort ends with no dashboard, no algorithm, no useful conclusion… That’s just part of what “Exploratory” means ;))
Conclusions
The above (data engineering and data exploration) could easily make up to 70-90% of your efforts. In data science, DO NOT underestimate the efforts needed to gather, prepare (“tidy”) and understand your data. Don’t jump to trying to fit a “model” o apply an ML algorithm (or many ;)).
Once you get there though, Anomaly Detection is an important application of data science for Cybersecurity data.
I hope these “general thoughts” make sense.