In this chapter, we will examine ApacheSparkSQL, SQL, DataFrames, and Datasets on top of Resilient Distributed Datasets (RDDs). DataFrames were introduced in Spark 1.3, basically replacing SchemaRDDs, and are columnar data storage structures roughly equivalent to relational database tables, whereas Datasets were introduced as experimental in Spark 1.6 and have become an additional component in Spark 2.0.
We have tried to reduce the dependency between individual chapters as much as possible in order to give you the opportunity to work through them as you like. However, we do recommend that you read this chapter because the other chapters are dependent on the knowledge of DataFrames and Datasets.
This chapter will cover the following topics:
SparkSession
Importing and saving data
Processing the text files
Processing the JSON files
Processing the Parquet files
DataSource API
DataFrames
Datasets
Using SQL
User-defined functions
RDDs versus DataFrames versus Datasets
Before moving on to SQL, DataFrames, and Datasets, we will cover an overview of the SparkSession.