As is usual, data in the security field is “domain specific”. Among other things, certain numbers are to be treated like factors and not ordinals (e.g. TCP Port Numbers…).
IP addresses are an example of data that “looks” somewhat numeric, and actually can be transformed into numerics, but are not exactly such.
An IP address (IP v4) is a set of 4 bytes (that is, 4x8bits), ordered and presented with a “.” in between bytes. Up front, there is one conclusion about it: 32 bits is what it takes to codify all possible IP v4 addresses. And in R, an integer is precisely that, a 32 bits number.
So first clue here: We can use 1 integer to “store” 1 IP (v4) address, as long as we transform a string of characters into a number; and retrieve it from a number into a readable IP v4 address with another transformation.
OK, so we do precisely that. Instead of programming it myself, I used code from the book “Data Driven Security” (have a look here). (Only two functions were used as inspiration, and legally as licensed CC4.0)
library("bitops") # Transform an IP string into a number: ip2long <- function(ip) { ips <- unlist(strsplit(ip, ".", fixed = TRUE)) octet <- function(x, y) bitOr(bitShiftL(x, 8), y) Reduce(octet, as.integer(ips)) }
Here the author uses bit shifts to multiply by powers of 2. Fair enough. Here is a functional version of the same code for less geeky personnel out there:
ip2long_vSimple <- function(ip = NA) { # Extract each byte, separated by ".", and keep it as integers: ip_bytes <- as.integer(unlist(strsplit(ip, ".", fixed = TRUE))) # Transform into a numeric of up to 32 bits, e.g. an integer: 2 ^ 24 * ip_bytes[1] + 2 ^ 16 * ip_bytes[2] + 2 ^ 8 * ip_bytes[3] + ip_bytes[4] }
Next, with the concepts of “robust code” in mind, we will make sure that before we transform an IP v4 string into a number, the string conforms to the correct format. So I created this quick check:
is_valid_ip_string_format <- function(ip = NA) { is.character(ip) && grepl("^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\\.|$)){4}", ip) } ## Implicit returns and implicit TRUE / FALSE eval
This “quick check” impacts speed of execution if used, of course. That tends to happen with regular expressions. Checking thousands of IP addresses might not be exactly “fast”, but that’s an issue for another day.
Clearly, I’m not inventing the wheel here. So let me add a new programming concept now: Object Oriented Programming in R. (“new” in the sense: not discussed before in this Blog…)
Creating Objects in R
Now an IP address is more than a set of numbers. It can be considered a “thing” that has “properties”. For example, a person can be tall or slim, and have a phone number or live at a specific address. So a person could be treated as an object, an instance of a class “Person”, for example.
Well, much in the same way, an IP address can be “public” or “private” for example. (Most IP addresses are public. Some are private, which everyone will recognise. For example, all IP addresses between 10.0.0.0 and 10.255.255.255 are private. This goes beyond the scope of this entry though…)
Now once an IP address is in numeric format, it becomes relatively easy to check whether they are private or public:
calculate_private <- function(ip_addr) { if((ip_addr >= 2886729728 && ip_addr <= 2886737919) || #172.16.0.0/12 (ip_addr >= 167772160 && ip_addr <= 184549375) || # 10.0.0.0/8 (ip_addr >= 3232235520 && ip_addr <= 3232301055) || # 192.168.0.0/16 (ip_addr >= 2851995648 && ip_addr <= 2852061183)) # 169.254.0.0/16 return(TRUE) FALSE }
Even further, a public IP address can be tied to a geographic location (albeit temporarily, as these IP addresses can be bought… But that’s not of importance here). But we will explore this in a later post, we will use MaxMind GeoIP database to try and assign a country to each public IP address.
Anyhow, an IP address can now be considered as an object with a numeric value, a property of public or private, and a country (and many other potential things… for example, owners, and/or bad reputation…)
For now, and only as an exercise, let’s create an IP address Class, and then an IP address object. We will stick to IP v4 for this exercise, and consider only the property of private or public. Just for the exercise, we’ll create the object of the type “Reference Class”. There are other object types in R, this one is apparently one of the most recent.
(As usual, the code for this post was uploaded to my GitHub here.)
Such an object has “fields” and “methods”.
We will use an “initialize” method to create an object with certain default values when provided with an IP v4 address as string input. The instantiation of the object will automatically set up the numerical value of the IP address as a field. Still in the initialize method, we will then calculate whether the IP is private or not (a boolean in a different field).
The object provides two more methods, to retrieve the character string of the IP object and to check whether or not the IP address is private. This is because it is a good practice not to expose directly the fields (object variables) to the outside world, and rather one should provide methods (an interface) to manipulate the object. Now the version above is simplistic, and will need review (one can for now edit fields directly, for example).
Conclusions
This is a simple example of object oriented programming, applied to a relevant (albeit very basic) concept in IT Security (the IP address), but as I personally do not use objects too often (they seem to be incompatible with data.frames, which to me is an important limitation), this post is more about demonstrating how one COULD do things.
All in all, Objects are an option in R.