書名： Learning Spark SQL
作者名： Aurobindo Sarkar
本章字數(shù)： 115字
更新時間： 2021-07-02 18:23:49

Using Spark SQL for Data Munging

In this code-intensive chapter, we will present key data munging techniques used to transform raw data to a usable format for analysis. We start with some general data munging steps that are applicable in a wide variety of scenarios. Then, we shift our focus to specific types of data including time-series data, text, and data preprocessing steps for Spark MLlib-based machine learning pipelines. We will use several Datasets to illustrate these techniques.

In this chapter, we shall learn:

What is data munging?
Explore data munging techniques
Combine data using joins
Munging on textual data
Munging on time-series data
Dealing with variable length records
Data preparation for machine learning pipelines

官术网_书友最值得收藏!

Learning Spark SQL

Using Spark SQL for Data Munging