官术网_书友最值得收藏!

dplyr versus data.table

You might now be wondering, "which package should we use?"

The dplyr and data.table packages provide a spectacularly different syntax and a slightly less determinative difference in performance. Although data.table seems to be slightly more effective on larger datasets, there is no clear winner in this spectrum—except for doing aggregations on a high number of groups. And to be honest, the syntax of dplyr, provided by the magrittr package, can be also used by the data.table objects if needed.

Also, there is another R package that provides pipes in R, called the pipeR package, which claims to be a lot more effective on larger datasets than magrittr. This performance gain is due to the fact that the pipeR operators do not try to be smart like the F# language's |>-compatible operator in magrittr. Sometimes, this performance overhead is estimated to be 5-15 times more than the ones where no pipes are used at all.

One should take into account the community and support behind an R package before spending a reasonable amount of time learning about its usage. In a nutshell, the data.table package is now mature enough, without doubt, for production usage, as the development was started around 6 years ago by Matt Dowle, who was working for a large hedge fund at that time. The development has been continuous since then. Matt and Arun (co-developer of the package) release new features and performance tweaks from time to time, and they both seem to be keen on providing support on the public R forums and channels, such as mailing lists and StackOverflow.

On the other hand, dplyr is shipped by Hadley Wickham and RStudio, one of the most well-known persons and trending companies in the R community, which translates to an even larger user-base, community, and kind-of-instant support on StackOverflow and GitHub.

In short, I suggest using the packages that fit your needs best, after dedicating some time to discover the power and features they make available. If you are coming from an SQL background, you'll probably find data.table a lot more convenient, while others rather opt for the Hadleyverse (take a look at the R package with this name; it installs a bunch of useful R packages developed by Hadley). You should not mix the two approaches in a single project, as both for readability and performance issues, it's better to stick to only one syntax at a time.

To get a deeper understanding of the pros and cons of the different approaches, I will continue to provide multiple implementations of the same problem in the following few pages as well.

主站蜘蛛池模板: 龙游县| 临沧市| 临洮县| 武邑县| 碌曲县| 永城市| 桃园县| 达日县| 绍兴市| 社会| 双柏县| 阜阳市| 旬邑县| 高安市| 乌海市| 榆中县| 北碚区| 什邡市| 密云县| 神农架林区| 崇左市| 慈溪市| 如皋市| 浦县| 金秀| 黄浦区| 云南省| 丹凤县| 武山县| 南京市| 卫辉市| 资中县| 文昌市| 烟台市| 汝南县| 桂阳县| 万州区| 太保市| 离岛区| 灵寿县| 玉树县|