We covered reading the data in R, understanding the data types, and performing various operations on the data. Now, we will see a few concepts that will be used just before an analysis or building a model. While performing an analysis, we might not need to study the entire dataset and we can just focus a subset of it, or, on the other hand, we might have to combine data from multiple data sources. These are the various concepts that will be covered in this chapter.
The most commonly used functionality will be to select the desired column from the dataset. While building the model, we will not be using all the columns in the dataset but just some of them that are more relevant. In order to select the column, we can either specify the column name or number, or simply delete the columns that are not required.
In the preceding code, we first selected the column by its position. The first line of the code will select the first column and then the 5th to 10th column from the dataset, whereas, in the last line, the specified two columns are removed from the dataset. Both the preceding commands will yield the same result.
We can also arrive at a situation where we need to filter the data based on a condition. While building the model, we cannot create a single model for the whole of the population but we should create multiple models based on the behavior present in the population. This can be achieved by subsetting the dataset. In the following code, we will get the data of cars that have an mpg more than 25 alone:
We might also need to consider a sample of the dataset. For example, while building a regression or logistic model, we need to have two datasets—one for the training and the other for the testing. In these cases, we need to choose a random sample. This can be done using the following code:
We considered a random sample of 10 rows from the dataset. Along with these, we might have to merge two different datasets. Let's see how this can be achieved. We can combine the data both row-wise as well as column-wise as follows:
The preceding code is used to combine two datasets that share the same column format. Then we can combine them using the rbind function. Alternatively, if the two datasets have the same length of data but different columns, then we can combine them using the cind or merge functions:
When we have two different datasets with a common column, then we can use the merge function to combine them. On using merge, the dataset will be merged based on the common columns.
These are the essential concepts necessary to prepare the dataset for the analysis, which will be discussed in the next few chapters.