- Mastering Data Analysis with R
- Gergely Daróczi
- 363字
- 2021-07-09 21:58:48
Benchmarking text file parsers
Another notable alternative for handling and loading reasonable sized data from flat files to R is the data.table
package. Although it has a unique syntax differing from the traditional S-based R markup, the package comes with great documentation, vignettes, and case studies on the indeed impressive speedup it can offer for various database actions. Such uses cases and examples will be discussed in the Chapter 3, Filtering and Summarizing Data and Chapter 4, Restructuring Data.
The package ships a custom R function to read text files with improved performance:
> library(data.table) > system.time(dt <- fread('hflights.csv')) user system elapsed 0.153 0.003 0.158
Loading the data was extremely quick compared to the preceding examples, although it resulted in an R object with a custom data.table
class, which can be easily transformed to the traditional data.frame
if needed:
> df <- as.data.frame(dt)
Or by using the setDF
function, which provides a very fast and in-place method of object conversion without actually copying the data in the memory. Similarly, please note:
> is.data.frame(dt) [1] TRUE
This means that a data.table
object can fall back to act as a data.frame
for traditional usage. Leaving the imported data as is or transforming it to data.frame
depends on the latter usage. Aggregating, merging, and restructuring data with the first is faster compared to the standard data frame format in R. On the other hand, the user has to learn the custom syntax of data.table
—for example, DT[i, j, by]
stands for "from DT subset by i
, then do j
grouped by by". We will discuss it later in the Chapter 3, Filtering and Summarizing Data.
Now, let's compare all the aforementioned data import methods: how fast are they? The final winner seems to be fread
from data.table
anyway. First, we define some methods to be benchmarked by declaring the test functions:
> .read.csv.orig <- function() read.csv('hflights.csv') > .read.csv.opt <- function() read.csv('hflights.csv', + colClasses = colClasses, nrows = 227496, comment.char = '', + stringsAsFactors = FALSE) > .read.csv.sql <- function() read.csv.sql('hflights.csv') > .read.csv.ffdf <- function() read.csv.ffdf(file = 'hflights.csv') > .read.big.matrix <- function() read.big.matrix('hflights.csv', + header = TRUE) > .fread <- function() fread('hflights.csv')
Now, let's run all these functions 10 times each instead of several hundreds of iterations like previously—simply to save some time:
> res <- microbenchmark(.read.csv.orig(), .read.csv.opt(), + .read.csv.sql(), .read.csv.ffdf(), .read.big.matrix(), .fread(), + times = 10)
And print the results of the benchmark with a predefined number of digits:
> print(res, digits = 6) Unit: milliseconds expr min lq median uq max neval .read.csv.orig() 2109.643 2149.32 2186.433 2241.054 2421.392 10 .read.csv.opt() 1525.997 1565.23 1618.294 1660.432 1703.049 10 .read.csv.sql() 2234.375 2265.25 2283.736 2365.420 2599.062 10 .read.csv.ffdf() 1878.964 1901.63 1947.959 2015.794 2078.970 10 .read.big.matrix() 1579.845 1603.33 1647.621 1690.067 1937.661 10 .fread() 153.289 154.84 164.994 197.034 207.279 10
Please note that now we were dealing with datasets fitting in actual physical memory, and some of the benchmarked packages are designed and optimized for far larger databases. So it seems that optimizing the read.table
function gives a great performance boost over the default settings, although if we are after really fast importing of reasonable sized data, using the data.table
package is the optimal solution.
- Redis入門指南(第3版)
- Python自動化運維快速入門
- Android Development with Kotlin
- Mastering Python Scripting for System Administrators
- PostgreSQL 11從入門到精通(視頻教學版)
- 算法訓練營:提高篇(全彩版)
- Instant Lucene.NET
- SQL Server 2008 R2數據庫技術及應用(第3版)
- RubyMotion iOS Develoment Essentials
- Redmine Cookbook
- 寫給青少年的人工智能(Python版·微課視頻版)
- C/C++代碼調試的藝術
- RESTful Web API Design with Node.js(Second Edition)
- Practical Responsive Typography
- 語義Web編程