官术网_书友最值得收藏!

RDD - the first citizen of Spark

The very first paper on RDD Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing described it as follows:

Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. As Spark is written in a functional programming paradigm, one of the key concepts of functional programming is immutable objects. Resilient Distributed Dataset is also an immutable dataset.

Formally, we can define an RDD as an immutable distributed collection of objects. It is the primary data type of Spark. It leverages cluster memory and is partitioned across the cluster.

The following is the logical representation of RDD:

RDDs can consist of (key, value) pairs as well. The following is the logical representation of pair of RDDs:

Also, as mentioned, RDD can be partitioned across the cluster. So the following is the logical representation of partitioned RDDs in a cluster:

主站蜘蛛池模板: 白山市| 溆浦县| 永清县| 永春县| 漳州市| 永城市| 韶山市| 读书| 巴林左旗| 新和县| 乐昌市| 宁强县| 随州市| 常德市| 鞍山市| 资中县| 锡林浩特市| 来凤县| 政和县| 广饶县| 襄汾县| 米泉市| 天峻县| 寿阳县| 丹阳市| 宁津县| 峨边| 喀喇沁旗| 红安县| 革吉县| 贵溪市| 浏阳市| 皮山县| 和政县| 禹州市| 贡觉县| 宜黄县| 漯河市| 当涂县| 讷河市| 兴山县|