- Machine Learning with Spark(Second Edition)
- Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath
- 214字
- 2021-07-09 21:07:41
SchemaRDD
SchemaRDD is a combination of RDD and schema information. It also offers many rich and easy-to-use APIs (that is, the DataSet API). SchemaRDD is not used with 2.0 and is internally used by DataFrame and Dataset APIs.
A schema is used to describe how structured data is logically organized. After obtaining the schema information, the SQL engine is able to provide the structured query capability for the corresponding data. The DataSet API is a replacement for Spark SQL parser's functions. It is an API to achieve the original program logic tree. Subsequent processing steps reuse Spark SQL's core logic. We can safely consider DataSet API's processing functions as completely equivalent to that of SQL queries.
SchemaRDD is an RDD subclass. When a program calls the DataSet API, a new SchemaRDD object is created, and a logic plan attribute of the new object is created by adding a new logic operation node on the original logic plan tree. Operations of the DataSet API (like RDD) are of two types--Transformation and Action.
APIs related to the relational operations are attributed to the Transformation type.
Operations associated with data output sources are of Action type. Like RDD, a Spark job is triggered and delivered for cluster execution, only when an Action type operation is called.
- Hands-On Intelligent Agents with OpenAI Gym
- 機器學習及應用(在線實驗+在線自測)
- Practical Data Wrangling
- 自動控制原理
- 分布式多媒體計算機系統(tǒng)
- 基于多目標決策的數(shù)據(jù)挖掘方法評估與應用
- Spark大數(shù)據(jù)技術與應用
- Photoshop CS3圖層、通道、蒙版深度剖析寶典
- 自動控制理論(非自動化專業(yè))
- 數(shù)據(jù)通信與計算機網(wǎng)絡
- 大數(shù)據(jù)時代
- Containers in OpenStack
- SAP Business Intelligence Quick Start Guide
- Xilinx FPGA高級設計及應用
- Mastering Android Game Development with Unity