The journey of data.table-ian exploring the universe of Hadley and Co created.
I’ve been doing data transformation with R for about 4 years now. At the very beginning, I was introduced with data.table package by my supervisor. My first task was to use data.table::fread which is blazingly fast compared to base::read.csv. That function left me a good impression such that I am willing to climb the steep learning curve of the package. Fast forward, I now don’t even know how to work with data.frame.
“You’ve been doing data science for 4 years, but never heard of Tidyverse?!” Don’t get me wrong, I’ve used ggplot2, lubridate and stringr extensively on my scripts.
Besides, data.table complement nicely with functions from those packages. However, I’ve never used dplyr, tidyr, readr, purr, tibble, etc. I have always thought of them as a substitute of data.table. As I’m comfortable with my current workflow, I have never tried to learn them.
There’s a number of discourse between tidyverse and data.table. I don’t think this blog post will be more comprehensive than them. Frankly, I’m always working with a large dataset and I feel that speed is very important. Hence data.table is my go-to (Hadley said so!).
That’s a lot of background information. This will be my first attempt on understanding the tidyverse ecosystem. As it progress, I will try to make an objective comparison between them. I’m going to closely follow data transformation chapter of R for Data Science. Let’s dive in!
Loading the relevant packages and dataset:
library(nycflights13)
library(tidyverse)
library(data.table)
tib <- flights
dt <- data.table(flights)
I will use dplyr::all_equal extensively in this comparison. To demonstrate, when the 2 objects have different entry:
all_equal(
tibble(a=c(1,2),b=c(3,4)),
data.table(a=c(1,2),b=c(3,5))
)
[1] "- Rows in x but not in y: 2\n- Rows in y but not in x: 2\n"
When, all of the entries are the same:
all_equal(
tibble(a=c(1,2),b=c(3,4)),
data.table(a=c(1,2),b=c(3,4))
)
[1] TRUE
When the order of row matters, we have to use dplyr::all_equal:
all_equal(
tibble(a=c(1,2),b=c(3,4)),
data.table(a=c(2,1),b=c(4,3)),
ignore_row_order = F,
)
[1] "Same row values, but different order"
Hence, we check whether the tibble and data.table have equal values:
all_equal(tib,dt)
[1] TRUE
Unless your dataset is robustly documented (which is never the case), the chance is (always) you have to stare at it to understand. Hence, the method to display the dataset is very crucial. Let’s take a look on how tibble and data.table differ!
tib
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
dt
year month day dep_time sched_dep_time dep_delay arr_time
1: 2013 1 1 517 515 2 830
2: 2013 1 1 533 529 4 850
3: 2013 1 1 542 540 2 923
4: 2013 1 1 544 545 -1 1004
5: 2013 1 1 554 600 -6 812
---
336772: 2013 9 30 NA 1455 NA NA
336773: 2013 9 30 NA 2200 NA NA
336774: 2013 9 30 NA 1210 NA NA
336775: 2013 9 30 NA 1159 NA NA
336776: 2013 9 30 NA 840 NA NA
sched_arr_time arr_delay carrier flight tailnum origin dest
1: 819 11 UA 1545 N14228 EWR IAH
2: 830 20 UA 1714 N24211 LGA IAH
3: 850 33 AA 1141 N619AA JFK MIA
4: 1022 -18 B6 725 N804JB JFK BQN
5: 837 -25 DL 461 N668DN LGA ATL
---
336772: 1634 NA 9E 3393 <NA> JFK DCA
336773: 2312 NA 9E 3525 <NA> LGA SYR
336774: 1330 NA MQ 3461 N535MQ LGA BNA
336775: 1344 NA MQ 3572 N511MQ LGA CLE
336776: 1020 NA MQ 3531 N839MQ LGA RDU
air_time distance hour minute time_hour
1: 227 1400 5 15 2013-01-01 05:00:00
2: 227 1416 5 29 2013-01-01 05:00:00
3: 160 1089 5 40 2013-01-01 05:00:00
4: 183 1576 5 45 2013-01-01 05:00:00
5: 116 762 6 0 2013-01-01 06:00:00
---
336772: NA 213 14 55 2013-09-30 14:00:00
336773: NA 198 22 0 2013-09-30 22:00:00
336774: NA 764 12 10 2013-09-30 12:00:00
336775: NA 419 11 59 2013-09-30 11:00:00
336776: NA 431 8 40 2013-09-30 08:00:00
Tibble starts with number of rows and columns. That’s neat! I can never figure out how many columns are there in data.table. It also print out the classes of each variable. Wow! Why double and datetime though, instead of numeric and POSIXct?
If you use RStudio Console, you will notice that both tibble and data.table are responsive. Re-size console then re-print, it will display differently. Tibble is neat! Instead of the stacking in data.table, tibble hides the further columns. Tibble is colour coded too. I am impressed!
Fortunately, the dataset is pretty straight forward. Let’s move on!
dplyr::filter is equivalent with DT[i,]:
all_equal(
filter(tib,month == 1, day == 1),
dt[month==1&day==1,]
)
[1] TRUE
all_equal(
filter(tib, !(arr_delay > 120 | dep_delay > 120)),
dt[!(arr_delay > 120 | dep_delay > 120),]
)
[1] TRUE
dplyr::arrange is equivalent with DT[order()]:
all_equal(
arrange(tib,year,month,day),
dt[order(year,month,day)],
ignore_row_order = F,
)
[1] TRUE
Descending order can be achieved by adding minus operator:
all_equal(
arrange(tib,desc(dep_delay)),
dt[order(-dep_delay)],
ignore_row_order = F,
)
[1] TRUE
dplyr::select is equivalent with DT[,.()]:
all_equal(
select(tib,year,month,day),
dt[,.(year,month,day)]
)
[1] TRUE
dplyr::select is supported with several helper functions.
tibSml <- select(tib,
year:day,
ends_with("delay"),
distance,
air_time)
dtSml <- dt[,.(year,
month,
day,
dep_delay,
arr_delay,
distance,
air_time)]
all_equal(tibSml,dtSml)
[1] TRUE
In the above case, it doesn’t bother me to write it out in data.table. However, you can use DT[,..(character)] for selecting columns after evaluating the regular expressions.
dplyr::mutate is similar with DT[,“:=”()]. However, data.table solution does not allow you to make a copy. As mutate will make the object larger, the non-copious behaviour in data.table will save me from “cannot allocate object of size X GB” doom.
all_equal(
mutate(tibSml,
gain = dep_delay - arr_delay,
hours = air_time / 60
),
dtSml[,":="(
gain = dep_delay - arr_delay,
hours = air_time / 60
)]
)
[1] TRUE
Whereas, its synonym dplyr::transmute is implemented nicely in DT[,.()].
all_equal(
transmute(tib,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
),
dt[,.(dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100)]
)
[1] TRUE
So, DT[,.()] can act as dplyr::transmute and dplyr::select.
dplyr::summarise reduce data.frame to a single row. In data.table, there is no such concept. However, some functions behave exactly like summarise (e.g., mean, max, median, hist).
As mentioned in previous section, the following 2 outputs should have the same value.
transmute(tib,delay = mean(dep_delay,na.rm=T))
# A tibble: 336,776 x 1
delay
<dbl>
1 12.6
2 12.6
3 12.6
4 12.6
5 12.6
6 12.6
7 12.6
8 12.6
9 12.6
10 12.6
# … with 336,766 more rows
dt[,.(delay = mean(dep_delay,na.rm=T))]
delay
1: 12.63907
However, the data.table reduced it to a single row. Surprisingly, the following code results in equal value.
all_equal(transmute(tib,delay = mean(dep_delay,na.rm=T)/dep_delay),
dt[,.(delay = mean(dep_delay,na.rm=T)/dep_delay)])
[1] TRUE
The above scenario highlights the ambiguity in data.table. These are some perks that I learned from experience. As a result, summarised mean can be done in data.table by:
all_equal(
summarise(tib, delay = mean(dep_delay,na.rm = T)),
dt[,.(delay = mean(dep_delay,na.rm = T))]
)
[1] TRUE
There’s no explicit dplyr::group_by implementation in data.table, but you can summarise by group using DT[,.(),keyby]
tibDay <- group_by(flights, year, month, day)
all_equal(
summarise(tibDay, delay = mean(dep_delay, na.rm = TRUE)),
dt[,.(delay = mean(dep_delay, na.rm = TRUE)),keyby=.(year,month,day)]
)
[1] "- Rows in x but not in y: 15, 274, 302, 303\n- Rows in y but not in x: 15, 274, 302, 303\n"
Chaining can be done with the pipe operator, %>% in dplyr and DT[…][…] in data.table. It is mentioned that the purpose of the pipe is to remove intermediary objects, so that there is no need to name them. Naming is hard, I couldn’t agree more! Here is an example of how chaining can be executed in both package:
all_equal(
tib %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL"),
dt[,.(count = .N,
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)),
keyby=.(dest)][count>20&dest!="HNL"],
tolerance=0.1
)
[1] "- Rows in x but not in y: 4, 23, 30, 45\n- Rows in y but not in x: 4, 23, 30, 45\n"
This will be the end of the first part of exposing myself with the tidyverse. As of now, I notice that dplyr and data.table can solve the same set of problems. In the future, I will explore the dplyr’s method in handling table joins which is essential in analysing relational tables.
Thank you for reading!
For attribution, please cite this work as
Haetami (2019, March 13). artidata.io: Testing The Water of Tidyverse (Part 1). Retrieved from blog.artidata.io/posts/2019-03-13-testing-the-water-of-tidyverse-part-1/
BibTeX citation
@misc{haetami2019testing, author = {Haetami, Imaduddin}, title = {artidata.io: Testing The Water of Tidyverse (Part 1)}, url = {blog.artidata.io/posts/2019-03-13-testing-the-water-of-tidyverse-part-1/}, year = {2019} }