Testing The Water of Tidyverse (Part 1)

The journey of data.table-ian exploring the universe of Hadley and Co created.

Imaduddin Haetami
03-13-2019

Table of Contents


Background Information

I’ve been doing data transformation with R for about 4 years now. At the very beginning, I was introduced with data.table package by my supervisor. My first task was to use data.table::fread which is blazingly fast compared to base::read.csv. That function left me a good impression such that I am willing to climb the steep learning curve of the package. Fast forward, I now don’t even know how to work with data.frame.

“You’ve been doing data science for 4 years, but never heard of Tidyverse?!” Don’t get me wrong, I’ve used ggplot2, lubridate and stringr extensively on my scripts.
Besides, data.table complement nicely with functions from those packages. However, I’ve never used dplyr, tidyr, readr, purr, tibble, etc. I have always thought of them as a substitute of data.table. As I’m comfortable with my current workflow, I have never tried to learn them.

There’s a number of discourse between tidyverse and data.table. I don’t think this blog post will be more comprehensive than them. Frankly, I’m always working with a large dataset and I feel that speed is very important. Hence data.table is my go-to (Hadley said so!).

That’s a lot of background information. This will be my first attempt on understanding the tidyverse ecosystem. As it progress, I will try to make an objective comparison between them. I’m going to closely follow data transformation chapter of R for Data Science. Let’s dive in!

Method of Comparison

Loading the relevant packages and dataset:


library(nycflights13)
library(tidyverse)
library(data.table)

tib <- flights
dt <- data.table(flights)

I will use dplyr::all_equal extensively in this comparison. To demonstrate, when the 2 objects have different entry:


all_equal(
  tibble(a=c(1,2),b=c(3,4)),
  data.table(a=c(1,2),b=c(3,5))
)

[1] "- Rows in x but not in y: 2\n- Rows in y but not in x: 2\n"

When, all of the entries are the same:


all_equal(
  tibble(a=c(1,2),b=c(3,4)),
  data.table(a=c(1,2),b=c(3,4))
)

[1] TRUE

When the order of row matters, we have to use dplyr::all_equal:


all_equal(
  tibble(a=c(1,2),b=c(3,4)),
  data.table(a=c(2,1),b=c(4,3)),
  ignore_row_order = F,
)

[1] "Same row values, but different order"

Hence, we check whether the tibble and data.table have equal values:


all_equal(tib,dt)

[1] TRUE

Dataset Printing

Unless your dataset is robustly documented (which is never the case), the chance is (always) you have to stare at it to understand. Hence, the method to display the dataset is very crucial. Let’s take a look on how tibble and data.table differ!


tib

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

dt

        year month day dep_time sched_dep_time dep_delay arr_time
     1: 2013     1   1      517            515         2      830
     2: 2013     1   1      533            529         4      850
     3: 2013     1   1      542            540         2      923
     4: 2013     1   1      544            545        -1     1004
     5: 2013     1   1      554            600        -6      812
    ---                                                          
336772: 2013     9  30       NA           1455        NA       NA
336773: 2013     9  30       NA           2200        NA       NA
336774: 2013     9  30       NA           1210        NA       NA
336775: 2013     9  30       NA           1159        NA       NA
336776: 2013     9  30       NA            840        NA       NA
        sched_arr_time arr_delay carrier flight tailnum origin dest
     1:            819        11      UA   1545  N14228    EWR  IAH
     2:            830        20      UA   1714  N24211    LGA  IAH
     3:            850        33      AA   1141  N619AA    JFK  MIA
     4:           1022       -18      B6    725  N804JB    JFK  BQN
     5:            837       -25      DL    461  N668DN    LGA  ATL
    ---                                                            
336772:           1634        NA      9E   3393    <NA>    JFK  DCA
336773:           2312        NA      9E   3525    <NA>    LGA  SYR
336774:           1330        NA      MQ   3461  N535MQ    LGA  BNA
336775:           1344        NA      MQ   3572  N511MQ    LGA  CLE
336776:           1020        NA      MQ   3531  N839MQ    LGA  RDU
        air_time distance hour minute           time_hour
     1:      227     1400    5     15 2013-01-01 05:00:00
     2:      227     1416    5     29 2013-01-01 05:00:00
     3:      160     1089    5     40 2013-01-01 05:00:00
     4:      183     1576    5     45 2013-01-01 05:00:00
     5:      116      762    6      0 2013-01-01 06:00:00
    ---                                                  
336772:       NA      213   14     55 2013-09-30 14:00:00
336773:       NA      198   22      0 2013-09-30 22:00:00
336774:       NA      764   12     10 2013-09-30 12:00:00
336775:       NA      419   11     59 2013-09-30 11:00:00
336776:       NA      431    8     40 2013-09-30 08:00:00

Tibble starts with number of rows and columns. That’s neat! I can never figure out how many columns are there in data.table. It also print out the classes of each variable. Wow! Why double and datetime though, instead of numeric and POSIXct?

If you use RStudio Console, you will notice that both tibble and data.table are responsive. Re-size console then re-print, it will display differently. Tibble is neat! Instead of the stacking in data.table, tibble hides the further columns. Tibble is colour coded too. I am impressed!

Fortunately, the dataset is pretty straight forward. Let’s move on!

Single Transformation

dplyr::filter

dplyr::filter is equivalent with DT[i,]:


all_equal(
  filter(tib,month == 1, day == 1),
  dt[month==1&day==1,]
)

[1] TRUE

all_equal(
  filter(tib, !(arr_delay > 120 | dep_delay > 120)),
  dt[!(arr_delay > 120 | dep_delay > 120),]
)

[1] TRUE

dplyr::arrange

dplyr::arrange is equivalent with DT[order()]:


all_equal(
  arrange(tib,year,month,day),
  dt[order(year,month,day)],
  ignore_row_order = F,
)

[1] TRUE

Descending order can be achieved by adding minus operator:


all_equal(
  arrange(tib,desc(dep_delay)),
  dt[order(-dep_delay)],
  ignore_row_order = F,
)

[1] TRUE

dplyr::select

dplyr::select is equivalent with DT[,.()]:


all_equal(
  select(tib,year,month,day),
  dt[,.(year,month,day)]
)

[1] TRUE

dplyr::select is supported with several helper functions.


tibSml <- select(tib, 
                 year:day, 
                 ends_with("delay"), 
                 distance, 
                 air_time)
dtSml <- dt[,.(year,
               month,
               day,
               dep_delay,
               arr_delay,
               distance,
               air_time)]
all_equal(tibSml,dtSml)

[1] TRUE

In the above case, it doesn’t bother me to write it out in data.table. However, you can use DT[,..(character)] for selecting columns after evaluating the regular expressions.

dplyr::mutate

dplyr::mutate is similar with DT[,“:=”()]. However, data.table solution does not allow you to make a copy. As mutate will make the object larger, the non-copious behaviour in data.table will save me from “cannot allocate object of size X GB” doom.


all_equal(
  mutate(tibSml,
         gain = dep_delay - arr_delay,
         hours = air_time / 60
  ),
  dtSml[,":="(
    gain = dep_delay - arr_delay,
    hours = air_time / 60
  )]
)

[1] TRUE

Whereas, its synonym dplyr::transmute is implemented nicely in DT[,.()].


all_equal(
  transmute(tib,
            dep_time,
            hour = dep_time %/% 100,
            minute = dep_time %% 100
  ),
  dt[,.(dep_time,
        hour = dep_time %/% 100,
        minute = dep_time %% 100)]
)

[1] TRUE

So, DT[,.()] can act as dplyr::transmute and dplyr::select.

dplyr::summarise

dplyr::summarise reduce data.frame to a single row. In data.table, there is no such concept. However, some functions behave exactly like summarise (e.g., mean, max, median, hist).

As mentioned in previous section, the following 2 outputs should have the same value.


transmute(tib,delay = mean(dep_delay,na.rm=T))

# A tibble: 336,776 x 1
   delay
   <dbl>
 1  12.6
 2  12.6
 3  12.6
 4  12.6
 5  12.6
 6  12.6
 7  12.6
 8  12.6
 9  12.6
10  12.6
# … with 336,766 more rows

dt[,.(delay = mean(dep_delay,na.rm=T))]

      delay
1: 12.63907

However, the data.table reduced it to a single row. Surprisingly, the following code results in equal value.


all_equal(transmute(tib,delay = mean(dep_delay,na.rm=T)/dep_delay),
          dt[,.(delay = mean(dep_delay,na.rm=T)/dep_delay)])

[1] TRUE

The above scenario highlights the ambiguity in data.table. These are some perks that I learned from experience. As a result, summarised mean can be done in data.table by:


all_equal(
  summarise(tib, delay = mean(dep_delay,na.rm = T)),
  dt[,.(delay = mean(dep_delay,na.rm = T))]
)

[1] TRUE

There’s no explicit dplyr::group_by implementation in data.table, but you can summarise by group using DT[,.(),keyby]


tibDay <- group_by(flights, year, month, day)
all_equal(
  summarise(tibDay, delay = mean(dep_delay, na.rm = TRUE)),
  dt[,.(delay = mean(dep_delay, na.rm = TRUE)),keyby=.(year,month,day)]
)

[1] "- Rows in x but not in y: 15, 274, 302, 303\n- Rows in y but not in x: 15, 274, 302, 303\n"

Chaining

Chaining can be done with the pipe operator, %>% in dplyr and DT[…][…] in data.table. It is mentioned that the purpose of the pipe is to remove intermediary objects, so that there is no need to name them. Naming is hard, I couldn’t agree more! Here is an example of how chaining can be executed in both package:


all_equal(
  tib %>% 
    group_by(dest) %>% 
    summarise(
      count = n(),
      dist = mean(distance, na.rm = TRUE),
      delay = mean(arr_delay, na.rm = TRUE)
    ) %>% 
    filter(count > 20, dest != "HNL"),
  dt[,.(count = .N, 
               dist = mean(distance, na.rm = TRUE),
               delay = mean(arr_delay, na.rm = TRUE)),
            keyby=.(dest)][count>20&dest!="HNL"],
  tolerance=0.1
)

[1] "- Rows in x but not in y: 4, 23, 30, 45\n- Rows in y but not in x: 4, 23, 30, 45\n"

This will be the end of the first part of exposing myself with the tidyverse. As of now, I notice that dplyr and data.table can solve the same set of problems. In the future, I will explore the dplyr’s method in handling table joins which is essential in analysing relational tables.

Thank you for reading!

Citation

For attribution, please cite this work as

Haetami (2019, March 13). artidata.io: Testing The Water of Tidyverse (Part 1). Retrieved from blog.artidata.io/posts/2019-03-13-testing-the-water-of-tidyverse-part-1/

BibTeX citation

@misc{haetami2019testing,
  author = {Haetami, Imaduddin},
  title = {artidata.io: Testing The Water of Tidyverse (Part 1)},
  url = {blog.artidata.io/posts/2019-03-13-testing-the-water-of-tidyverse-part-1/},
  year = {2019}
}