They say big part of “Data Science” is about communicating information.
This entry is about one visualization.
Motivation
I have been thinking for a couple of days now about how to best “transmit” the status of a set of projects. I have 40+ projects to inform about at work, but progress about 40+ projects is too much info to squeeze into one slide.
I came across “Parking Lot Diagrams” recently in a PMI video, and it was a nice idea, as color codes would make it very visual to see where a set of things were at. But I am not a fan of the date part of the visualization there, and so I felt I should go back to good-ol’ Gantt. But then Gantt charts seem a bit overloaded to me, I prefer a milestones-only version of it (no horizontal bars).
Also I want to report on the status of the 40 projects every week, so I want to “touch” as little as possible and still get an updated visualization each week (i.e. I needed to automate that a bit).
And so I thought, maybe I’d mix both visualizations for my personal objectives, and program that (in R, as anyone would expect of me).
Note: This is NOT Security related data by the way. I actually created fake data to demo the visualization as we will see later on. And this was done on my spare time (this morning), but I do intend to use some version of it for reporting at work 🙂
The Data
So how I “codified” the information is fairly easy:
I have a set of projects. Each project can be in any of 4 phases (that can be changed, it just depends on the data). I chose four phases like so: “Acquisition”, “Deployment”, “Migration”, “Testing”. This is completely arbitrary, mind you.
For each phase of each project, there is a deadline, and a progress as of “today”.
The deadline is well, self-explanatory.
For the progress or status, I wanted to set ONE number. So I chose the following:
- -1: does not apply (some projects might not need an “Acquisition” phase or a “Migration” phase at all).
- 0: Applies, but not started.
- 0 < x < 1: Ongoing, with varying progress (this will be simplified, but it made sense for drill-down).
- 1: Finished
This is pretty much it. Once I have an initial version, I just need to go through the “progress” number of each project-phase pair. Optionally, I can also edit the deadline, but that’s not a best practice.
The Visualization
Now after some playing around, I settled for the following:
- Vertically group per phase.
- For each phase, each project is one row (this will fast become squeezed, but that’s not an issue as we’ll see later on).
- Horizontal axis is time, a conservative approach.
Now for each project-phase-date, there is a “progress” number. I colour coded these, so that:
- White: Does not apply (it is the default colour and can be used for progress/status “-1”)
- Grey: Not started yet.
- Light Green: Ongoing (status is between 0 and 1, not inclusive)
- Green: Finished
- Red: If status is not “finished” and the date at which the script is executed is beyond the due date, I overwrite the color so that we clearly mark “behind schedule” here.
And that’s it!
Why an entry here in this Blog?
Because there is a twist to this: This is the first time I used data.table instead of a data.frame.
Ever since I heard about Data Table as an alternative to a Data Frame, I thought it would help keep the code more readable. I’ve known about Data Tables for a couple of years now, but I never actually took the time to use this type of objects. Until now, that is.
For instance, as explained above, I need to “calculate” the colour to associate to a given status:
# We colour-code the status set_color <- function(status) { if(status == 0) return("grey") if(status > 0 && status < 1) return("lightgreen") if(status == 1) return("green") return("white") }
But it took me a minute to understand the updating of columns “by reference” and the (unusual to me) “:=” operator:
demo_dt[, status_color := sapply(status, set_color)]
This here is how I set (overwrite) to “red” the status colour using a data.table:
# Anything not finished and past due date is red to us: demo_dt[(status < 1) & (duedate < Sys.Date()), status_color := "red"]
Now that’s definitely cleaner (to my eyes) than the equivalent code with a data.frame (not tested):
demo_df[(demo_df$status < 1) & (demo_df$duedate < Sys.Date()), "status_color"] <- "red"
At least that’s my opinion. I’m not saying I am ready to go “all in” with data.tables and discard the data.frame that has served me so well over the past few years. But indeed, I will keep the data.table in mind if I ever prioritize readability of my code over… Well, my current experience.
Creating a visualization in R
I am really keen on “choosing one thing that works and stick to it”. Now this might go a bit against the “continuous improvement” concepts from one perspective, but on the other hand getting to be better at one thing instead of using different things… Well, it’s a balance exercise I guess.
So the same way I chose R over Python back in the day (not many regrets yet as I have managed to do what I needed in R, but I understand the whole discussion out there…), I choose now (for some time now) to stick to GGPlot2. And its family members.
I like the “layered” approach to drawing graphics with GGPlot. And I quite like the concept of “grammar of graphics”.
This is probably the first visualization entry in this Blog, so I needed to clarify why GGPlot here. There is much more to it, and as mentioned at the beginning: Visualization is an important part of data science.
Result
And so The first version looks like so:
Now the whole idea is to eventually “squeeze” much more projects in there, with different milestones, thereby making it more “visual” in the sense that if the graph is mostly “green”, then things are on track. While if a lot of red appears, there are clearly issues (delays).
This first version of visualizing the status of multiple project is of course a bit “rough”. There is much space for improvement:
If we make it interactive, then we could drill down by clicking on a “tile” in the graph. That would allow to present further information, in a different graph maybe in a Shiny Dashboard (I’ll get to that probably later on, maybe for this very exercise).
We can have the graph popup context information while hovering with the mouse for example, like the actual progress value (put into %) and project name (thereby removing the need to show it several times on the left). That way, we could see only colours and time by default, giving a “sense” of the status of the different projects. Well… I then went and did just that:
Conclusions
I am not yet convinced this is a “good” visualization, I’ll need to test it with “many” projects and actual real data (dates, progress, etc.).
But it does meet the goal of showing in one page a lot of information about actual progress, planning and eventual delays.
You can check out the code here on my GitHub account.