Billy Fung

R data.table for 2 weeks

Billy Fung / 2018-01-28


Everything is so much quicker

Often this is the main reason why people choose to use the data.table package. I myself find that I tend to go towards it if I’m loading .csv files because really, they do load so much quicker. But so far in the past 2 months, I’ve been mentoring a statistics intern, one who really has no real world experience when it comes to using R or applying statistics using R. Performance is an easy scapegoat to find when you are looking for just another small thing to procrastinate about.

Often with analysis type work, we are applying the scientific method and first and foremost, there should be a reason for everything you are attempting to do. I myself do fall into the trap where when something doesn’t work as I expect it to, or I don’t understand the process of my experiment, I’ll go and find some different package to try.

Building blocks

One of the main reasons why I prefer to teach the tidyverse to somebody new is that majority of people enjoy separating their code into parts that they can see. This is similar to why Excel is so popular and powerful. Being able to physically see the rows and columns of data is how most people think of it in their heads, so it only makes sense to see it all. dplyr helps in that, being able to easily piece out your code before applying it. To me, this is crucial in building the habit of layering your code in pieces. The analogy of building a brick house applies here, first you must set the bottom layer of blocks as a strong foundation.

Different styles

Where the argument for dplyr and data.table come in is usually when you discuss how those blocks are put together. There will always be multiple ways to do the same thing, but who cares how fast you do it if you don’t end up having a solid house? It’s easy to get impressed by how fast a Ferrari goes, but when you get into one, can you drive it? I personally have been using data.table more often than not at work, but this is because I am very comfortable doing SQL-like queries, and at a larger scale, having learned matrix algebra and MATLAB, I find that I don’t always need to physically see the data. (I’m still way faster at certain stuff in Excel, like pivoting and visualising)

I think with using each tool, the biggest question is whether or not you’ll have to constantly explain your work, and how easily you will be able to do that. dplyr is way easier to show in a presentation to non-technical people, whereas data.table would probably only be shown to people who have used it before or are comfortable with relational grammar like SQL.

dplyr sample

mtcars %>%
  filter(hp >= 200) %>%
  mutate(mpg_cyl = mpg/cyl) %>%
  select(mpg, cyl, wt, mpg_cyl)

The key operator in dplyr is the pipe %>% which allows you to layer out your code. This code takes the mtcars dataset, filters for those with hp > 200, then it creates a new column that calculates mpg/cyl which is an arbitrary metric that shows the mpg per cylinder of the car. Then the select statement returns all the specified columns. This is simple and easy to read, and showing dplyr code shouldn’t have too steep of a learning curve.

rdatatable sample

mt = as.data.table(mtcars)
mt[hp >=200, mpg_cyl := mpg/cyl]
mt[!is.na(mpg_cyl), c('mpg', 'cyl', 'wt', 'mpg_cyl')]

This is probably not the best way of doing it, but more steps let’s you see what is being done.