Intro
(It’s been a few weeks, taking some time off to recharge batteries before I get myself into a master… So I thought it would be a good idea.)
Anyhow, to the point: Many a time, when working with data, the first thing will be to “look at it”. Then maybe you’ll need to clean up some stuff (it’s often the case, in my experience).
So I’ve done that quite a few times by now. Some of the past Blog entries were around that very subject (here and here). But it’s tedious, and I started working on some functions of my own to do these things… Until I came across an entry on LinkedIn, where someone was praising skimr, which seemed very promising to begin with. Looking further into the topic of helping myself “prepare” a data analysis, I found the package DataExplorer.
But as we know, it’s not enough to read about these things, one needs to test them 🙂
Before we start
DataExplorer somehow requires igraph. igraph is nice, but as it turns out, it requires a package on the box running R that wasn’t there (I still use my tuned Docker containers, using rocker as the base image).
So first of, to work with DataExplorer on rocker/rstudio, you’ll need to install that package:
# apt-get install libglpk-dev
Then only will you be able to do:
library(DataExplorer)
Let’s keep going now
Good. So why all the fuss about that DataExplorer package?
Let me re-use some old “demo CMDB” generator to prepare some dummy data with a few common issues. (Over time, I’ve seen MANY more issues, but that will have to do for this test).
Using that, it should be easy to get to something like so:
> head(cmdb) machine_name ip_address 1 WINDEMO01.KAIZEN-R.COM 192.168.1.17 2 WinDemo02 192.168.1.142 3 LinDemo03 192.168.1.252 4 WINDEMO04.KAIZEN-R.COM 192.168.1.72 5 LINDEMO05.KAIZEN-R.COM 192.168.1.33 6 LINDEMO06.KAIZEN-R.COM 192.168.1.187 > head(cmdb_apps) machine_name app type 1 WinDemo01 Data Platform Production 2 WinDemo02 Data Platform Production 3 LinDemo03 Data Platform Pre-Production 4 WinDemo04 Devel 5 LinDemo05 Devel 6 LinDemo06 Data Platform Production
Which will have to do.
Then it all is rather simple: Let’s use the help (a.k.a. “RTFM”). So let’s try a few things:
> DataExplorer::plot_bar(cmdb_apps)
1 columns ignored with more than 50 categories.
machine_name: 100 categories
Gives the following:
Now it might not look like much, but believe me: It’s easier here (1 simple line of code) than it is doing it manually (and actually, I think it’s easier than doing the same thing in Excel, so there: Hurra for R!)
With the same data, if you have a factor column (which I here just forced, it made sense in this case), you can then see things by groups of such factors in one go:
> cmdb_apps[] <- lapply(cmdb_apps, as.factor) > DataExplorer::plot_bar(cmdb_apps, by = "type") 1 columns ignored with more than 50 categories. machine_name: 100 categories
Once again, the key of the matter is that those are one-liners!
Not everything can be automated
Unfortunately it’s not ALL going to be that easy. So in order to look a bit further into the powers of DataExplorer, I require some data with more… Columns, to begin with. Thankfully, we already have more columns, only in two different data frames. Let’s merge them… But wait! They should merge on the same column name easily if it weren’t for the difference in naming the machines!
So right there, something to be done…
It requires a bit of magic, but it’s feasible:
library(stringr) library(magrittr) cmdb_new <- cmdb cmdb_new$machine_name %<>% tolower() cmdb_new$hostname <- sapply(cmdb_new$machine_name, function(x) { found_pos <- str_locate(x, "\\.")[[1]] if(is.na(found_pos)) return(x) str_sub(x, end = found_pos - 1) }) cmdb_new$domain <- sapply(cmdb_new$machine_name, function(x) { found_pos <- str_locate(x, "\\.")[[1]] if(is.na(found_pos)) return("") str_sub(x, start = found_pos + 1) }) cmdb_new$fqdn <- sapply(1:nrow(cmdb_new), function(x) { if(cmdb_new$domain[x] == "") return("") paste0(cmdb_new$hostname[x], ".", cmdb_new$domain[x]) }) cmdb_new$machine_name <- NULL cmdb_apps$machine_name %<>% tolower()
Which gives us something a bit more compatible (see the “hostname” in cmdb_new, and “machine_name” in cmdb_apps…)
> head(cmdb_new, n = 2) ip_address hostname domain fqdn 1 192.168.1.17 windemo01 kaizen-r.com windemo01.kaizen-r.com 2 192.168.1.142 windemo02 > head(cmdb_apps, n = 2) machine_name app type 1 windemo01 Data Platform Production 2 windemo02 Data Platform Production > cmdb_new <- merge(cmdb_new, cmdb_apps, by.x = "hostname", by.y = "machine_name", all = TRUE) > head(cmdb_new) hostname ip_address domain fqdn app type 1 lindemo010 192.168.1.158 Data Platform Production 2 lindemo010 192.168.1.155 Data Platform Production 3 lindemo0100 192.168.1.47 Data Platform Devel 4 lindemo011 192.168.1.75 kaizen-r.com lindemo011.kaizen-r.com Data Platform Production 5 lindemo012 192.168.1.94 Data Platform Production 6 lindemo012 192.168.1.236 Data Platform Production
Now that’s better, let’s have a look shall we?
> DataExplorer::introduce(cmdb_new)
rows columns discrete_columns continuous_columns all_missing_columns total_missing_values
1 130 6 6 0 0 0
complete_rows total_observations memory_usage
1 130 780 27464
Now we might want to have some numerical value in there somewhere, just for testing. In the context of a CMDB of Systems, maybe the amount of RAM per system could make sense, and CPUs speeds. Let’s throw that in:
rams_df <- data.frame(hostname = unique(cmdb_new$hostname), ram_amount = sample(c(2,4,8), size = length(unique(cmdb_new$hostname)), replace = TRUE, prob = c(1,1,3))) cmdb_new <- merge(cmdb_new, rams_df, all.x = TRUE) cpus_df <- data.frame(hostname = unique(cmdb_new$hostname), cpu_ghz = sample(c(1.6,2.4,0.8,3.2), size = length(unique(cmdb_new$hostname)), replace = TRUE, prob = c(2,3,1,3))) cmdb_new <- merge(cmdb_new, cpus_df, all.x = TRUE)
And now, DataExplorer will find on its own the numeric variables and help us have a quick glance at them:
DataExplorer::plot_histogram(cmdb_new) DataExplorer::plot_boxplot(cmdb_new, by = "app")
Even better than all of the above, DataExplorer provides a function (1 liner) to do all of the above and more and generate one report, from one data frame… I mean, NICE STUFF!
But I’ll let you explore that last trick on your own.
A simple (but great) alternative: skimr
So just because I’ve seen it, and it is as simple as it gets, BUT it is very useful to get a sense of a data.frame in seconds:
Hell it even decided on its own: Empty strings “” count as empty cells to it! That’s just the kind of things one needs to speed up data sets reviews 🙂
Conclusions
Some things will still need to be manual. EDA, data cleaning, data imputation, etc. all require us to LOOK AT THE DATA.
That doesn’t necessarily mean however that one needs to program from scratch all the nifty details about understanding and fixing some dataset each time.
One of the great things of R, is its community of developers that create a plethora of great packages for the rest of us to be able to do our job faster, and easier.
So Kudos to the R Community, thanks all! (I don’t say that enough…)