dplyr is not faster

Intro

The other day, someone asked in a Telegram group how to go about changing some factor with 4 levels to another set of factors with two levels, in R.

So you had originally factors A, B, C, and D, and the new factor would be X for the first 3 cases, and Y for the last case.

My first reaction

So because I’m using dplyr quite a bit, as it makes things cleaner (and in R 4.1, you can use the |> pipe instead of the %>% from magrittr), I immediately thought of a mix of mutate and ifelse for this scenario.

But then…

Well, I started thinking afterwards:

First, that I didn’t know whether mutate would be faster than a simpler (albeit maybe less readable) base R version.
Second, that we were talking about a factor. Factors have levels, so maybe there was something to be done on the level, instead of on the values?

So I went ahead and created a basic test for that. I created a dummy example dataframe, and then created different alternatives to calculate the new factor.

Here is the code for comparing the results, if you want to reproduce it.

As I wanted to compare processing speeds easily, I put the different alternatives in separate functions that would have to return a dataframe with the new factor.

test1 <- data.frame(num = c(1:99),
    cat = factor(rep(c("A", "B", "C"), 33)))

f1 <- function(df) { # base R basic way
  df$new_cat <- ifelse(test1$cat == "C", "cat2", "cat1")
  df$new_cat <- as.factor(df$new_cat)
  df[, c(1,3)]
}
f2 <- function(df) { # dplyr way
  df <- df |> mutate(new_cat = ifelse(cat == "C", "cat2", "cat1"))
  df$new_cat <- as.factor(df$new_cat)
  df |> select(num, new_cat)
}
f3 <- function(df) { # plyr way
  df$new_cat <- mapvalues(df$cat, from = c("A", "B", "C"), to = c("cat1", "cat1", "cat2"))
  df[, c(1,3)]
}
f4 <- function(df) { # base R working on factor levels
  levels(df$cat)[df$cat %in% c("A", "B")] <- "cat1"
  levels(df$cat)[levels(df$cat) == "C"] <- "cat2"
  names(df)[2] <- "new_cat"
  df
}

And here are the results:

> test_that("Validate equal functionality", {
+   expect_equal(f1(test1), f2(test1))
+   expect_equal(f1(test1), f3(test1))
+   expect_equal(f1(test1), f4(test1))
+ })
Test passed 🌈
> 
> microbenchmark(
+   f1(test1), f2(test1), f3(test1),
+   f4(test1),
+   times = 100L)
Unit: microseconds
      expr      min        lq      mean    median       uq       max neval
 f1(test1)  164.984  200.1815  301.1756  254.4185  303.167  3379.964   100
 f2(test1) 2532.623 2723.3530 3430.4129 2860.8335 3263.307 16781.194   100
 f3(test1)   88.998  105.5680  203.0238  130.5015  158.680  5710.515   100
 f4(test1)   86.071   96.6425  225.2438  110.8970  136.925  9588.608   100

Conclusions

This was a quick post.

Indeed, the dplyr way seems to be much slower. In general, I prefer dplyr and the pipes to make the code more readable, but in some cases the less readable code is a tradeoff for (much) faster execution.

To be noted that the last two functions work directly on the factor levels, instead of the values for each entry, so I’m guessing that’s what make them all-the-more fast.