官术网_书友最值得收藏!

Spark context parallelize method

Under the covers, there are quite a few actions that happened when you created your RDD. Let's start with the RDD creation and break down this code snippet:

myRDD = sc.parallelize( 
[('Mike', 19), ('June', 18), ('Rachel',16), ('Rob', 18), ('Scott', 17)]
)

Focusing first on the statement in the sc.parallelize() method, we first created a Python list (that is, [A, B, ..., E]) composed of a list of arrays (that is, ('Mike', 19), ('June', 19), ..., ('Scott', 17)). The sc.parallelize() method is the SparkContext's parallelize method to create a parallelized collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data:

Now that we have created myRDD as a parallelized collection, Spark can operate against this data in parallel. Once created, the distributed dataset (distData) can be operated on in parallel. For example, we can call myRDD.reduceByKey(add) to add up the grouped by keys of the list; we have recipes for RDD operations in subsequent sections of this chapter.

主站蜘蛛池模板: 六盘水市| 上蔡县| 宁城县| 东安县| 晋江市| 通河县| 江口县| 成都市| 福州市| 涞水县| 湄潭县| 股票| 金阳县| 梓潼县| 新邵县| 平武县| 宜兰市| 铜鼓县| 和硕县| 绥中县| 诸城市| 东乡县| 宣汉县| 武强县| 五莲县| 德江县| 博湖县| 临西县| 日照市| 焦作市| 潼关县| 南宫市| 云和县| 安庆市| 方正县| 屯留县| 芷江| 科技| 鄯善县| 刚察县| 苗栗县|