官术网_书友最值得收藏!

Introduction

Resilient Distributed Datasets (RDDs) are collections of immutable JVM objects that are distributed across an Apache Spark cluster. Please note that if you are new to Apache Spark, you may want to initially skip this chapter as Spark DataFrames/Datasets are both significantly easier to develop and typically have faster performance. More information on Spark DataFrames can be found in the next chapter.

An RDD is the most fundamental dataset type of Apache Spark; any action on a Spark DataFrame eventually gets translated into a highly optimized execution of transformations and actions on RDDs (see the paragraph on catalyst optimizer in Chapter 3, Abstracting Data with DataFrames, in the Introduction section). 

Data in an RDD is split into chunks based on a key and then dispersed across all the executor nodes. RDDs are highly resilient, that is, there are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. Thus, even if one executor fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes. RDDs keep a log of all the execution steps applied to each chunk. This, on top of the data replication, speeds up the computations and, if anything goes wrong, RDDs can still recover the portion of the data lost due to an executor error.

While it is common to lose a node in distributed environments (for example, due to connectivity issues, hardware problems), distribution and replication of the data defends against data loss, while data lineage allows the system to recover quickly.

主站蜘蛛池模板: 顺平县| 屏山县| 湟源县| 浪卡子县| 林西县| 三门峡市| 巍山| 湛江市| 鲁甸县| 沙湾县| 皋兰县| 万盛区| 盱眙县| 崇礼县| 苍梧县| 丹巴县| 临泉县| 翁牛特旗| 德清县| 建阳市| 许昌市| 孝昌县| 河南省| 新河县| 新野县| 即墨市| 米易县| 临漳县| 临颍县| 栾川县| 宁强县| 枞阳县| 蓬安县| 赤城县| 哈尔滨市| 克什克腾旗| 股票| 阳山县| 北京市| 共和县| 安阳市|