Last post of 2020 (most probably).
Not specific to R per-se, true.
I personally believe that any IT security analyst should be able to work with regular expressions. Not only for logs, mind you, but when it comes to it, I think it is a basic skill.
Using regular expressions to filter logs
OK so we have gathered logs from a dnsmasq program on our Home Lab server. Today we’ll focus on extracting some (just a demo) relevant info from them. A DNS log will probably register, among other things, Fully Qualified Domain Name (a.k.a. FQDN) resolution queries. Such queries will include:
- An FQDN for which one wants to know an IP address
- A requesting IP address (I guess other DNS servers can include the client’s names)
- A time stamp at which the request is received
That’s the gist of it for a subset of the logs a DNS Server can produce. In our case, the dnsmasq configuration log-queries by default sends logs to the /var/log/syslog file.
That file is filled with data from many other programs on a Linux box by default, so first we’ll want to filter out the logs that contain dnsmasq resolution queries:
dns_logs_dir <- "/mnt/R/Demos/server_data/syslog_data/" dns_logs <- lapply(list.files(dns_logs_dir), function(x) { tempfile <- paste0(dns_logs_dir, x) # First of, I have an issue with the logs: In the format I have them, the year is not informed. # As we are already in December, this will soon be a problem... # So I'll add the year of the file to each log entry like so: (requires lubridate package) paste(year(file.info(tempfile)$mtime), readLines(tempfile, warn = FALSE)) }) # We have a list of entries per file read: dns_logs <- unlist(dns_logs) # Now from all the logs, we keep only those that include a query: dns_logs <- dns_logs[grep("dnsmasq.*query\\[", dns_logs)]
OK so far so good. But a simple vector of strings is not very usable. Let’s put it into a data.frame:
# Finally we prepare a data.frame to work with later on: dns_logs <- data.frame(log = dns_logs, req.ip = "", domain = "", stringsAsFactors = FALSE)
Let’s see what the dnsmasq logs really look like:
OK so we will focus on the IP address. There are a few ways we could do it. Here we use str_extract from the (VERY HANDY) stringr package. (If you work with logs, chances are you will need stringr, lubridate & dplyr). The requesting IP address appears at the very end of the string, as observed in the logs shown at the beginning of this post. One way to go about it is a to look for “everything not a space, until the end”:
dns_logs$req.ip <- with(dns_logs, str_extract(log, "[^ ]+$"))
Note: Why use “with” in the command above? Well, in other cases, it helps write a bit less, you can call variables of the data.frame without calling the data.frame each time. The same results can be achieved like so:
dns_logs$req.ip <- str_extract(dns_logs$log, "[^ ]+$")
Obviously in this case, this had no positive effect, but as soon as I might need to access two variables in one command… Anyhow, just know you can use it.
Looking for IP addresses in Regex
Just as an exercise, let’s see if we can test for valid IP addresses. (We’ve done that in the past already)
# Just for the fun of it: Let's do some regex exercises: ip_tests <- c("a.b.c.d", "1.2.3.4", "256.255.255.255", "0.0.0.0", "-1.2.3.4") # Too generic. Double escape for backslash is necessary: grep("[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+", ip_tests) # grepl returns a logical value. Same as above, but shorter: grepl("([0-9]+\\.){3}[0-9]+", ip_tests) # A somewhat better expression, more precise: grep("\\b(((2[0-5]{2})|([0-1]?[0-9]?[0-9]))\\.){3}((2[0-5]{2})|([0-1]?[0-9]?[0-9]))\\b", ip_tests)
Negative Regex
test_more_regex <- head(dns_logs$log)
Well sometimes you will want to filter OUT some of the lines with a specific structure from a log file. You can use a negative lookahead:
# To keep only lines that do not contain the word "type", using negative lookahead. # The trick is the (?!<expression>) # But this is a PERL compatible option, not a default one: grepl("^(?:(?!type).)+$", test_more_regex, perl = TRUE)
But just as you get the modifier “-v” for a bash grep call, you can “invert” the grep search in R:
# An easier option, use negative grep. invert is not an option for logical grep: grep("type", test_more_regex, invert = TRUE)
It’s not all Regex though
One can use other tricks to play with strings. The great thing about logs is that they usually have some structure (i.e. they are, so to say, semi-structured data).
One thing to be done on the dnsmasq query logs is to extract the FQDN for which an IP is requested.
This can be done using regular expressions, indeed (I’ll do it in two steps here):
dns_logs$domain <- with(dns_logs, str_extract(log, "[^\\]]+$")) # and then everything not a space character after the first " ": dns_logs$domain <- str_extract(dns_logs$domain, "[^ ]+") # Now domain contains the FQDN
But one thing to be noted is that regular expressions are… Expensive. In terms of processing that is. So instead of computationally expensive solutions, one might consider alternatives that are, in fact, simpler at times. Although here this is not much simpler, it does help make it readable in a one-liner:
# Another option, say to get the FQDN, is to split by spaces here. # FQDN comes in 9th position then. # This requires no "regex" beyond the splitting character: sapply(str_split(test_more_regex, " "), function(x) x[9])
Conclusions
Manipulating strings, filtering and extracting data from them, is a VERY important skill for an IT Security Analyst.
Note that R, once again, is just an option. Many great commands in Linux Shell make it reasonably easy to put together a repeatable program in a bash script (and grep, cut & sed are just a few very useful ones you might want to look into).
Some analysts are big fans of AWK and do marvellous things with just a few lines.
I don’t know by heart all the possible tricks or regular expression, I’ll admit that much. Which is why I keep a reference handy (see below).
Anyway… I wish you, reader (and your family and friends), a happy holiday.