官术网_书友最值得收藏!

Exploring data munging techniques

In this section, we will introduce several data munging techniques using household electric consumption and weather Datasets. The best way to learn these techniques is to practice the various ways to manipulate the data contained in various publically available Datasets (in addition to the ones used here). The more you practice, the better you will get at it. In the process, you will probably evolve your own style, and develop several toolsets and techniques to achieve your munging objectives. At a minimum, you should get very comfortable working with and moving between RDDs, DataFrames, and Datasets, computing counts, distinct counts, and various aggregations to cross-check your results and match your intuitive understanding the Datasets. Additionally, it is also important to develop the ability to make decisions based on the pros and cons of executing any given munging step.

We will attempt to accomplish the following objectives in this section:

  1. Pre-process the household electric consumption Dataset--read the input Dataset, define case class for the rows,  count the number of records, remove the header and rows with missing data values, and create a DataFrame.
  2. Compute basic statistics and aggregations
  3. Augment the Dataset with new information relevant to the analysis
  4. Execute other miscellaneous processing steps, if required
  5. Pre-process the weather Dataset--similar to step 1
  6. Analyze missing data
  7. Combine the Datasets using JOIN and analyze the results

Start the Spark shell, at this time, and follow along as you read through this and the subsequent sections.

Import all required classes used in this section:

主站蜘蛛池模板: 宝鸡市| 大邑县| 涡阳县| 商水县| 衡东县| 奉节县| 丹凤县| 德惠市| 施秉县| 洛南县| 乳山市| 乌苏市| 衡阳市| 鸡泽县| 安庆市| 贵阳市| 静乐县| 峡江县| 秦皇岛市| 诏安县| 贵溪市| 铜鼓县| 中阳县| 富源县| 德庆县| 平谷区| 扎囊县| 沿河| 九龙坡区| 常熟市| 贡山| 通榆县| 台中市| 苍溪县| 景宁| 阜新市| 商都县| 东辽县| 定远县| 石门县| 祁门县|