官术网_书友最值得收藏!

  • Learning Spark SQL
  • Aurobindo Sarkar
  • 441字
  • 2021-07-02 18:23:41

Understanding the Catalyst optimizer

The Catalyst optimizer is at the core of Spark SQL and is implemented in Scala. It enables several key features, such as schema inference (from JSON data), that are very useful in data analysis work.

The following figure shows the high-level transformation process from a developer's program containing DataFrames/Datasets to the final execution plan:

The internal representation of the program is a query plan. The query plan describes data operations such as aggregate, join, and filter, which match what is defined in your query. These operations generate a new Dataset from the input Dataset. After we have an initial version of the query plan ready, the Catalyst optimizer will apply a series of transformations to convert it to an optimized query plan. Finally, the Spark SQL code generation mechanism translates the optimized query plan into a DAG of RDDs that is ready for execution. The query plans and the optimized query plans are internally represented as trees. So, at its core, the Catalyst optimizer contains a general library for representing trees and applying rules to manipulate them. On top of this library, are several other libraries that are more specific to relational query processing.

Catalyst has two types of query plans: Logical and Physical Plans. The Logical Plan describes the computations on the Datasets without defining how to carry out the specific computations. Typically, the Logical Plan generates a list of attributes or columns as output under a set of constraints on the generated rows. The Physical Plan describes the computations on Datasets with specific definitions on how to execute them (it is executable).

Let's explore the transformation steps in more detail. The initial query plan is essentially an unresolved Logical Plan, that is, we don't know the source of the Datasets or the columns (contained in the Dataset) at this stage and we also don't know the types of columns. The first step in this pipeline is the analysis step. During analysis, the catalog information is used to convert the unresolved Logical Plan to a resolved Logical Plan.

In the next step, a set of logical optimization rules is applied to the resolved Logical Plan, resulting in an optimized Logical Plan. In the next step the optimizer may generate multiple Physical Plans and compare their costs to pick the best one. The first version of the Cost-based Optimizer (CBO), built on top of Spark SQL has been released in Spark 2.2. More details on cost-based optimization are presented in Chapter 11Tuning Spark SQL Components for Performance.

All three--DataFrame, Dataset and SQL--share the same optimization pipeline as illustrated in the following figure:

主站蜘蛛池模板: 白水县| 五华县| 平南县| 安图县| 怀柔区| 白城市| 上饶市| 东山县| 滦平县| 广宁县| 内黄县| 正蓝旗| 太白县| 仙桃市| 龙井市| 尉犁县| 渭南市| 龙海市| 社会| 清涧县| 罗江县| 高尔夫| 肇源县| 施秉县| 台中县| 宜阳县| 平凉市| 镇赉县| 偃师市| 平武县| 中江县| 西昌市| 道孚县| 盐山县| 乌兰浩特市| 盱眙县| 江西省| 湛江市| 肥城市| 海盐县| 嘉定区|