In R, we can read the files stored from outside the R environment. We can also write the data into files which can be stored and accessed by the operating system. In R, we can read and write different formats of files, such as CSV, Excel, TXT, and so on. In this section, we are going to discuss how to read and write different formats of files.
The required files should be present in the current directory to read them. Otherwise, the directory should be changed to the required destination.
The first step for reading/writing files is to know the working directory. You can find the path of the working directory by running the following code:
>print (getwd())
This will give the paths for the current working directory. If it is not your desired directory, then please set your own desired directory by using the following code:
>setwd("")
For instance, the following code makes the folder C:/Users the working directory:
>setwd("C:/Users")
How to read and write a CSV format file
A CSV format file is a text file in which values are comma separated. Let us consider a CSV file with the following content from stock-market data:
To read the preceding file in R, first save this file in the working directory, and then read it (the name of the file is Sample.csv) using the following code:
>data<-read.csv("Sample.csv")
>print(data)
When the preceding code gets executed, it will give the following output:
Read.csv by default produces the file in DataFrame format; this can be checked by running the following code:
>print(is.data.frame(data))
Now, whatever analysis you want to do, you can perform it by applying various functions on the DataFrame in R, and once you have done the analysis, you can write your desired output file using the following code:
When the preceding code gets executed, it writes the output file in the working directory folder in CSV format.
XLSX
Excel is the most common format of file for storing data, and it ends with extension .xls or .xlsx.
The xlsx package will be used to read or write .xlsx files in the R environment.
Installing the xlsx package has dependency on Java, so Java needs to be installed on the system. The xlsx package can be installed using the following command:
>install.packages("xlsx")
When the previous command gets executed, it will ask for the nearest CRAN mirror, which the user has to select to install the package. We can verify that the package has been installed or not by executing the following command:
>any(grepl("xlsx",installed.packages()))
If it has been installed successfully, it will show the following output:
The Web is one main source of data these days, and we want to directly bring the data from web form to the R environment. R supports this:
URL <- "http://ichart.finance.yahoo.com/table.csv?s=^GSPC"
snp <- as.data.frame(read.csv(URL))
head(snp)
When the preceding code is executed, it directly brings the data for the S&P500 index into R in DataFrame format. A portion of the data has been displayed by using the head() function here:
Please note that we will be mostly using the snp and dji indexes for example illustrations in the rest of the book and these will be referred to as snp and dji.
Databases
A relational database stores data in normalized format, and to perform statistical analysis, we need to write complex and advance queries. But R can connect to various relational databases such as MySQL Oracle, and SQL Server, easily and convert the data tables into DataFrames. Once the data is in DataFrame format, doing statistical analysis is easy to perform using all the available functions and packages.
In this section, we will take the example of MySQL as reference.
R has a built-in package, RMySQL , which provides connectivity with the database; it can be installed using the following command:
>install.packages("RMySQL")
Once the package is installed, we can create a connection object to create a connection with the database. It takes username, password, database name, and localhost name as input. We can give our inputs and use the following command to connect with the required database:
When the database is connected, we can list the table that is present in the database by executing the following command:
>dbListTables(mysqlconnection)
We can query the database using the function dbSendQuery(), and the result is returned to R by using function fetch(). Then the output is stored in DataFrame format:
When the previous code gets executed, it returns the required output.
We can query with a filter clause, update rows in database tables, insert data into a database table, create tables, drop tables, and so on by sending queries through dbSendQuery().