Intro
The other day, someone asked in a Telegram group how to go about changing some factor with 4 levels to another set of factors with two levels, in R.
So you had originally factors A, B, C, and D, and the new factor would be X for the first 3 cases, and Y for the last case.
My first reaction
So because I’m using dplyr quite a bit, as it makes things cleaner (and in R 4.1, you can use the |> pipe instead of the %>% from magrittr), I immediately thought of a mix of mutate and ifelse for this scenario.
But then…
Well, I started thinking afterwards:
- First, that I didn’t know whether mutate would be faster than a simpler (albeit maybe less readable) base R version.
- Second, that we were talking about a factor. Factors have levels, so maybe there was something to be done on the level, instead of on the values?
So I went ahead and created a basic test for that. I created a dummy example dataframe, and then created different alternatives to calculate the new factor.
Here is the code for comparing the results, if you want to reproduce it.
As I wanted to compare processing speeds easily, I put the different alternatives in separate functions that would have to return a dataframe with the new factor.
test1 <- data.frame(num = c(1:99), cat = factor(rep(c("A", "B", "C"), 33))) f1 <- function(df) { # base R basic way df$new_cat <- ifelse(test1$cat == "C", "cat2", "cat1") df$new_cat <- as.factor(df$new_cat) df[, c(1,3)] } f2 <- function(df) { # dplyr way df <- df |> mutate(new_cat = ifelse(cat == "C", "cat2", "cat1")) df$new_cat <- as.factor(df$new_cat) df |> select(num, new_cat) } f3 <- function(df) { # plyr way df$new_cat <- mapvalues(df$cat, from = c("A", "B", "C"), to = c("cat1", "cat1", "cat2")) df[, c(1,3)] } f4 <- function(df) { # base R working on factor levels levels(df$cat)[df$cat %in% c("A", "B")] <- "cat1" levels(df$cat)[levels(df$cat) == "C"] <- "cat2" names(df)[2] <- "new_cat" df }
And here are the results:
> test_that("Validate equal functionality", {
+ expect_equal(f1(test1), f2(test1))
+ expect_equal(f1(test1), f3(test1))
+ expect_equal(f1(test1), f4(test1))
+ })
Test passed 🌈
>
> microbenchmark(
+ f1(test1), f2(test1), f3(test1),
+ f4(test1),
+ times = 100L)
Unit: microseconds
expr min lq mean median uq max neval
f1(test1) 164.984 200.1815 301.1756 254.4185 303.167 3379.964 100
f2(test1) 2532.623 2723.3530 3430.4129 2860.8335 3263.307 16781.194 100
f3(test1) 88.998 105.5680 203.0238 130.5015 158.680 5710.515 100
f4(test1) 86.071 96.6425 225.2438 110.8970 136.925 9588.608 100
Conclusions
This was a quick post.
Indeed, the dplyr way seems to be much slower. In general, I prefer dplyr and the pipes to make the code more readable, but in some cases the less readable code is a tradeoff for (much) faster execution.
To be noted that the last two functions work directly on the factor levels, instead of the values for each entry, so I’m guessing that’s what make them all-the-more fast.