官术网_书友最值得收藏!

SchemaRDD

SchemaRDD is a combination of RDD and schema information. It also offers many rich and easy-to-use APIs (that is, the DataSet API). SchemaRDD is not used with 2.0 and is internally used by DataFrame and Dataset APIs.

A schema is used to describe how structured data is logically organized. After obtaining the schema information, the SQL engine is able to provide the structured query capability for the corresponding data. The DataSet API is a replacement for Spark SQL parser's functions. It is an API to achieve the original program logic tree. Subsequent processing steps reuse Spark SQL's core logic. We can safely consider DataSet API's processing functions as completely equivalent to that of SQL queries.

SchemaRDD is an RDD subclass. When a program calls the DataSet API, a new SchemaRDD object is created, and a logic plan attribute of the new object is created by adding a new logic operation node on the original logic plan tree. Operations of the DataSet API (like RDD) are of two types--Transformation and Action.

APIs related to the relational operations are attributed to the Transformation type.

Operations associated with data output sources are of Action type. Like RDD, a Spark job is triggered and delivered for cluster execution, only when an Action type operation is called.

主站蜘蛛池模板: 沁源县| 宣城市| 黄平县| 斗六市| 中西区| 承德市| 兴和县| 隆子县| 清徐县| 志丹县| 遂宁市| 巫溪县| 宜都市| 丹阳市| 航空| 通海县| 岳阳市| 右玉县| 西峡县| 常德市| 古浪县| 嘉祥县| 明星| 海阳市| 南江县| 罗江县| 高雄市| 西丰县| 双鸭山市| 宜昌市| 连江县| 大连市| 沙雅县| 石嘴山市| 滨州市| 南雄市| 饶平县| 旬邑县| 左贡县| 武胜县| 虹口区|