書名： PySpark Cookbook
作者名： Denny Lee Tomasz Drabas
本章字數： 175字
更新時間： 2021-06-18 19:06:35

Spark context parallelize method

Under the covers, there are quite a few actions that happened when you created your RDD. Let's start with the RDD creation and break down this code snippet:

myRDD = sc.parallelize( 
 [('Mike', 19), ('June', 18), ('Rachel',16), ('Rob', 18), ('Scott', 17)]
)

Focusing first on the statement in the sc.parallelize() method, we first created a Python list (that is, [A, B, ..., E]) composed of a list of arrays (that is, ('Mike', 19), ('June', 19), ..., ('Scott', 17)). The sc.parallelize() method is the SparkContext's parallelize method to create a parallelized collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data:

Now that we have created myRDD as a parallelized collection, Spark can operate against this data in parallel. Once created, the distributed dataset (distData) can be operated on in parallel. For example, we can call myRDD.reduceByKey(add) to add up the grouped by keys of the list; we have recipes for RDD operations in subsequent sections of this chapter.

官术网_书友最值得收藏!

PySpark Cookbook

Spark context parallelize method