- Learning Spark SQL
- Aurobindo Sarkar
- 264字
- 2021-07-02 18:23:52
Combining data using a JOIN operation
In this section, we will introduce the JOIN operation, in which the daily household electric power consumption is combined with the weather data. We have assumed the locations of readings taken for the household electric power consumption and the weather readings are in close enough proximity to be relevant.
Next, we use the join operation to combine the daily household electric power consumption Dataset with the weather Dataset.

Verify the number of rows in the final DataFrame obtained with the number of rows expected subsequent to the join operation shown as follows:

You can compute a series of correlations between various columns in the newly joined Dataset containing columns from each of the two original Datasets to get a feel for the strength and direction of relationships between the columns, as follows:

Similarly, you can join the Datasets grouped by year and month to get a higher-level summarization of the data.

In order to visualize the summarized data, we can execute the preceding statements in an Apache Zeppelin notebook. For instance, we can plot the monthly Global Reactive Power (GRP) values by transforming joinedMonthlyDF into a table and then selecting the appropriate columns from it, as follows:


Similarly, if you want to analyze readings by the day of the week then follow, the steps as shown:

Finally, we print the schema of the joined Dataset (augmented with the day of the week column) so you can further explore the relationships between various fields of this DataFrame:

In the next section, we shift our focus to munging textual data.