官术网_书友最值得收藏!

Apache Spark SQL

In this chapter, we will examine ApacheSparkSQL, SQL, DataFrames, and Datasets on top of Resilient Distributed Datasets (RDDs). DataFrames were introduced in Spark 1.3, basically replacing SchemaRDDs, and are columnar data storage structures roughly equivalent to relational database tables, whereas Datasets were introduced as experimental in Spark 1.6 and have become an additional component in Spark 2.0.

We have tried to reduce the dependency between individual chapters as much as possible in order to give you the opportunity to work through them as you like. However, we do recommend that you read this chapter because the other chapters are dependent on the knowledge of DataFrames and Datasets.

This chapter will cover the following topics:

  • SparkSession
  • Importing and saving data
  • Processing the text files
  • Processing the JSON files
  • Processing the Parquet files
  • DataSource API
  • DataFrames
  • Datasets
  • Using SQL
  • User-defined functions
  • RDDs versus DataFrames versus Datasets

Before moving on to SQL, DataFrames, and Datasets, we will cover an overview of the SparkSession.

主站蜘蛛池模板: 平阴县| 安平县| 久治县| 正宁县| 本溪| 修水县| 临海市| 峨山| 怀集县| 香河县| 鄂托克前旗| 温州市| 怀集县| 德江县| 东安县| 嘉祥县| 霍邱县| 宁河县| 龙川县| 修水县| 贵州省| 奉节县| 牙克石市| 新安县| 南和县| 区。| 四子王旗| 罗定市| 根河市| 南召县| 昭通市| 怀来县| 馆陶县| 东乡| 沙田区| 秦皇岛市| 巩留县| 临夏县| 璧山县| 聂荣县| 柳林县|