Need FoR Speed EP1

Replace readLines with data.table::fread(sep=NULL) for reading any text files.

Imaduddin Haetami http://artidata.io/blog/about.html
03-26-2019

Table of Contents


Preamble

I recently encounter a supposedly 16GB csv file. However, the csv file does not have the same number of comma “,” for each line. The reason is the one who exported the data, does not consider the randomly existing comma(s) in the address columns. Several method to read the csv files using R, such as utils::read.csv, readr::read_csv, and default data.table::fread will either throw an error or make an incomplete reading. Hence, I have no choice but to use the readLines and fix the errors.

Unfortunately, I have to do the readLines for several times. There were several files which such errors or I overwrite the already read object. Although I work on my office HPC with 16 Cores of Intel Xeon Gold 6144 as the processor, the readLines still takes me around 10 minutes to do one reading. So slow~

Problem Replication

Let me replicate the problem and objectively compare the performance. Our data looks very similar with the one below:


len <- 2000000
set.seed(240193)
char <- sapply(1:len,
               function(x)
                 paste(sample(c(LETTERS,","),100,replace = T),
                       collapse=""))

The first 5 samples are:


head(char)

[1] "WAG,UUXAPZOYOBQHFSHBNHKLYYZMOQFOTPJNS,WIDYDTFDXBGZLHWQWFHFHM,BSYENLVK,QIDFSAPMHXRVWWKYOUCELQOJGEEQYY"
[2] "ZDUIBXQCTSZBNIX,THOQMYNHTXOOHNBRSQOYLXWTWDKCNMOI,YSGUWGQBEQOZB,KQFFVPMNRE,XFFYHURKMYVGDLKLMGHRLDBOBU"
[3] "MOXICB,YXCUGPGDUWCZVCXWVHAUGKOLPRALVNQXBAYENZWNNRT,LVEHXVZ,ZRLOATCDPOODTO,ENWJWEXECCQOXGVHNQBUQ,VJSP"
[4] "DZOQIFWDWSKOOW,CASGYCAJEKYFGSDFRGZRZJCOHNMMOQERLESUDB,IHUAAACYWGGVH,,AXBHKMJJULXZNSHXFZAUXGZF,CZHCJI"
[5] "PMEHVZNYCOOIEJUVTDDIOIYLUBVTO,QLORLXYWIUTUMJNZBFYZV,JIOGLLMDLGJYNLFXYRNLNYQU,HTVXWGTPJATESUWGV,YMMTQ"
[6] "AXCTBUCKDLBOFOVPAAOVLYKEEOXRI,FRNRPTDYBLTMVMPJSXNNIFBCZAZPKWRANSMVDHITXOB,QPKGYZNTWRPJX,EKIKHVCLDOXV"

This is a character vector with 210^{6} elements of 100 random letters and commas. Then, we write it to the hard drive as a text file:


writeLines(char,"char.txt")

The size of the written files (bytes):


file.size("char.txt")

[1] 2.02e+08

Objective Comparison

Now, we compare the 2 techniques:


(t1 <- system.time(ReadLines <- readLines("char.txt")))

   user  system elapsed 
  3.040   0.098   3.138 

library(data.table)
(t2 <- system.time(Fread <- fread("char.txt",sep=NULL,header=F)[,V1]))

   user  system elapsed 
  1.276   0.020   1.295 

Checking equality of the 2 objects:


identical(ReadLines,Fread)

[1] TRUE

Hence, the second method is 2.423166 times as fast as the first method. In other words, you can save 58.7316762% of your time by adopting the second method.

I believe the reason for such speed is the built-in parallelization of data.table::fread.
It manage to utilize the number of cores of my CPU:


getDTthreads()

[1] 1

Here is further reading on parallelization of data.table::fread. In my experience, the number of cores generally increase the reading speed of fread. However, your hard disk drive reading speed may become the bottleneck and slowing down the process.

Thank you for reading!

Citation

For attribution, please cite this work as

Haetami (2019, March 26). artidata.io: Need FoR Speed EP1. Retrieved from blog.artidata.io/posts/2019-03-26-need-for-speed-ep1/

BibTeX citation

@misc{haetami2019need,
  author = {Haetami, Imaduddin},
  title = {artidata.io: Need FoR Speed EP1},
  url = {blog.artidata.io/posts/2019-03-26-need-for-speed-ep1/},
  year = {2019}
}