- Learning Spark SQL
- Aurobindo Sarkar
- 259字
- 2021-07-02 18:23:49
Exploring data munging techniques
In this section, we will introduce several data munging techniques using household electric consumption and weather Datasets. The best way to learn these techniques is to practice the various ways to manipulate the data contained in various publically available Datasets (in addition to the ones used here). The more you practice, the better you will get at it. In the process, you will probably evolve your own style, and develop several toolsets and techniques to achieve your munging objectives. At a minimum, you should get very comfortable working with and moving between RDDs, DataFrames, and Datasets, computing counts, distinct counts, and various aggregations to cross-check your results and match your intuitive understanding the Datasets. Additionally, it is also important to develop the ability to make decisions based on the pros and cons of executing any given munging step.
We will attempt to accomplish the following objectives in this section:
- Pre-process the household electric consumption Dataset--read the input Dataset, define case class for the rows, count the number of records, remove the header and rows with missing data values, and create a DataFrame.
- Compute basic statistics and aggregations
- Augment the Dataset with new information relevant to the analysis
- Execute other miscellaneous processing steps, if required
- Pre-process the weather Dataset--similar to step 1
- Analyze missing data
- Combine the Datasets using JOIN and analyze the results
Start the Spark shell, at this time, and follow along as you read through this and the subsequent sections.
Import all required classes used in this section:

- 程序員面試白皮書
- CentOS 7 Linux Server Cookbook(Second Edition)
- 劍指MySQL:架構、調優與運維
- Learning Node.js for .NET Developers
- Android傳感器開發與智能設備案例實戰
- OpenMP核心技術指南
- Building Dynamics CRM 2015 Dashboards with Power BI
- 并行編程方法與優化實踐
- Python物理建模初學者指南(第2版)
- WordPress Search Engine Optimization(Second Edition)
- Python數據可視化之matplotlib實踐
- Implementing Domain:Specific Languages with Xtext and Xtend
- 程序員面試金典(第6版)
- 系統分析師UML用例實戰
- Java Web入門很輕松(微課超值版)