書名： Learning Spark SQL
作者名： Aurobindo Sarkar
本章字數： 259字
更新時間： 2021-07-02 18:23:49

Exploring data munging techniques

In this section, we will introduce several data munging techniques using household electric consumption and weather Datasets. The best way to learn these techniques is to practice the various ways to manipulate the data contained in various publically available Datasets (in addition to the ones used here). The more you practice, the better you will get at it. In the process, you will probably evolve your own style, and develop several toolsets and techniques to achieve your munging objectives. At a minimum, you should get very comfortable working with and moving between RDDs, DataFrames, and Datasets, computing counts, distinct counts, and various aggregations to cross-check your results and match your intuitive understanding the Datasets. Additionally, it is also important to develop the ability to make decisions based on the pros and cons of executing any given munging step.

We will attempt to accomplish the following objectives in this section:

Pre-process the household electric consumption Dataset--read the input Dataset, define case class for the rows, count the number of records, remove the header and rows with missing data values, and create a DataFrame.
Compute basic statistics and aggregations
Augment the Dataset with new information relevant to the analysis
Execute other miscellaneous processing steps, if required
Pre-process the weather Dataset--similar to step 1
Analyze missing data
Combine the Datasets using JOIN and analyze the results

Start the Spark shell, at this time, and follow along as you read through this and the subsequent sections.

Import all required classes used in this section:

官术网_书友最值得收藏!

Learning Spark SQL

Exploring data munging techniques