Hadleyverse vs data.table
Billy Fung / 2017-08-23
Is the Hadleyverse the only option?
One of the major downsides to learning R is figuring out which packages to use. R comes with many standard packages, but then sometimes they aren’t the most intuitive way to handle your data, or they can be straight up bad. Many people often direct new learners to use the packages writtein by Hadley Wickham, which I will say are great packages.
dplyr
Most of the packages deal with handling data in an intuitive way that makes
your work easier to follow, and easier to read. One such package is dplyr
,
which is used to manipulate data, mainly data frames.
Example of filtering a dataframe for rows where a column is a specific value.
starwars %>%
filter(species == "Droid")
The beauty in dplyr
lies in being able to chain multiple operations
together, so you don’t need to create multiple dataframes. In base R, I found
myself creating many tmp
dataframes that I used to hold data as I was doing
multiple operations on it.
starwars %>%
group_by(species) %>%
summarise(
n = n(),
mass = mean(mass, na.rm = TRUE)
) %>%
filter(n > 1)
But is there something faster?
One of the downsides of manipulating dataframes is that they aren’t the most
nimble things to be moving around. I’ve been using dplyr
and other
Hadleyverse packages for quite awhile now, but lately I’ve been dealing with
datasets that are much larger, like <5m rows. So the manipulation of data
becomes much slower, and grouping operations start to lag.
data.table
This is a package that lets a data.table
inherit from data.frame
.
Essentially somebody created this package to improve on memory usage, along
with speed. There isn’t any piping or elegant usage of writing code that makes
a datatable stand out, because that is a very subjective topic.
What it does do better is that I have found fread/fwrite
to be way faster at
reading/writing files. Much much faster compared to read_csv/write_csv
.
With operations that modify the data, data.table can update by reference, this saves computation and memory cost in assigning the result back to a variable.
DT[x > 0, y:= 'positive']
That simple code will update the data.table y column in-place. As table size grows, data.table will do that a lot quicker compared to dplyr and data.frame.
The dplyr and data.frame equivalent.
new_DF <- DF %>%
mutate(y = replace(y, which(x > 0), 'positive'))
So far I am still quite new to data.table, but I have found that I am enjoying using it because of the speed and memory gains, and I don’t find that I wish it had dplyr like syntax. Aggregating and joining is so much faster with large tables that I wish I found out about data.table earlier so then I could gain the intuitive for writing consistent code.
Every bit of optimisation counts when your datasets start growing, so learning the fastest way first will save you time in the long run.