官术网_书友最值得收藏!

  • Learning Spark SQL
  • Aurobindo Sarkar
  • 439字
  • 2021-07-02 18:23:40

What is Spark SQL?

Spark SQL is one of the most advanced components of Apache Spark. It has been a part of the core distribution since Spark 1.0 and supports Python, Scala, Java, and R programming APIs. As illustrated in the figure below, Spark SQL components provide the foundation for Spark machine learning applications, streaming applications, graph applications, and many other types of application architectures.

Such applications, typically, use Spark ML pipelines, Structured Streaming, and GraphFrames, which are all based on Spark SQL interfaces (DataFrame/Dataset API). These applications, along with constructs such as SQL, DataFrames, and Datasets API, receive the benefits of the Catalyst optimizer, automatically. This optimizer is also responsible for generating executable query plans based on the lower-level RDD interfaces.

We will explore ML pipelines in more detail in Chapter 6Using Spark SQL in Machine Learning Applications. GraphFrames will be covered in Chapter 7Using Spark SQL in Graph Applications. While, we will introduce the key concepts regarding Structured Streaming and the Catalyst optimizer in this chapter, we will get more details about them in Chapter 5Using Spark SQL in Streaming Applications, and Chapter 11Tuning Spark SQL Components for Performance.

In Spark 2.0, the DataFrame API has been merged with the Dataset API, thereby unifying data processing capabilities across Spark libraries. This also enables developers to work with a single high-level and type-safe API. However, the Spark software stack does not prevent developers from directly using the low-level RDD interface in their applications. Though the low-level RDD API will continue to be available, a vast majority of developers are expected to (and are recommended to) use the high-level APIs, namely, the Dataset and DataFrame APIs.

Additionally, Spark 2.0 extends Spark SQL capabilities by including a new ANSI SQL parser with support for subqueries and the SQL:2003 standard. More specifically, the subquery support now includes correlated/uncorrelated subqueries, and IN / NOT IN and EXISTS / NOT EXISTS predicates in WHERE / HAVING clauses.

At the core of Spark SQL is the Catalyst optimizer, which leverages Scala's advanced features, such as pattern matching, to provide an extensible query optimizer. DataFrames, Datasets, and SQL queries share the same execution and optimization pipeline; hence, there is no performance impact of using any one or the other of these constructs (or of using any of the supported programming APIs). The high-level DataFrame-based code written by the developer is converted to Catalyst expressions and then to low-level Java bytecode as it passes through this pipeline.

SparkSession is the entry point into Spark SQL-related functionality and we describe it in more detail in the next section.

主站蜘蛛池模板: 德江县| 巴林左旗| 成武县| 高陵县| 晋中市| 辽阳市| 广平县| 资阳市| 亳州市| 通州市| 泾阳县| 哈巴河县| 灌阳县| 黄浦区| 盐边县| 伊金霍洛旗| 丰宁| 平顶山市| 合川市| 塘沽区| 班玛县| 松原市| 河源市| 青龙| 澄城县| 潮州市| 平原县| 兴宁市| 长沙市| 阆中市| 邵武市| 安阳县| 铜陵市| 珲春市| 阿拉善盟| 阳新县| 新巴尔虎左旗| 望江县| 双江| 江口县| 怀宁县|