官术网_书友最值得收藏!

Chapter 1. Hello, Data!

Most projects in R start with loading at least some data into the running R session. As R supports a variety of file formats and database backend, there are several ways to do so. In this chapter, we will not deal with basic data structures, which are already familiar to you, but will concentrate on the performance issue of loading larger datasets and dealing with special file formats.

Note

For a quick overview on the standard tools and to refresh your knowledge on importing general data, please see Chapter 7 of the official An Introduction to R manual of CRAN at http://cran.r-project.org/doc/manuals/R-intro.html#Reading-data-from-files or Rob Kabacoff's Quick-R site, which offers keywords and cheat-sheets for most general tasks in R at http://www.statmethods.net/input/importingdata.html. For further materials, please see the References section in the Appendix.

Although R has its own (serialized) binary RData and rds file formats, which are extremely convenient to use for all R users as these also store R object meta-information in an efficient way, most of the time we have to deal with other input formats—provided by our employer or client.

One of the most popular data file formats is flat files, which are simple text files in which the values are separated by white-space, the pipe character, commas, or more often by semi-colon in Europe. This chapter will discuss several options R has to offer to load these kinds of documents, and we will benchmark which of these is the most efficient approach to import larger files.

Sometimes we are only interested in a subset of a dataset; thus, there is no need to load all the data from the sources. In such cases, database backend can provide the best performance, where the data is stored in a structured way preloaded on our system, so we can query any subset of that with simple and efficient commands. The second section of this chapter will focus on the three most popular databases (MySQL, PostgreSQL, and Oracle Database), and how to interact with those in R.

Besides some other helper tools and a quick overview on other database backend, we will also discuss how to load Excel spreadsheets into R—without the need to previously convert those to text files in Excel or Open/LibreOffice.

Of course this chapter is not just about data file formats, database connections, and such boring internals. But please bear in mind that data analytics always starts with loading data. This is unavoidable, so that our computer and statistical environment know the structure of the data before doing some real analytics.

主站蜘蛛池模板: 顺义区| 平安县| 平昌县| 满城县| 安阳市| 阳泉市| 镇原县| 高台县| 措勤县| 深水埗区| 马山县| 琼中| 富平县| 高台县| 大同县| 长垣县| 弋阳县| 玉溪市| 阿瓦提县| 拉孜县| 山东省| 宝清县| 古交市| 徐州市| 靖宇县| 湟中县| 鞍山市| 隆回县| 阿拉善右旗| 康乐县| 孙吴县| 临湘市| 东丽区| 泰兴市| 漳浦县| 德兴市| 伽师县| 明光市| 丹东市| 乌鲁木齐市| 邯郸市|