- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 175字
- 2021-06-18 19:06:35
Spark context parallelize method
Under the covers, there are quite a few actions that happened when you created your RDD. Let's start with the RDD creation and break down this code snippet:
myRDD = sc.parallelize(
[('Mike', 19), ('June', 18), ('Rachel',16), ('Rob', 18), ('Scott', 17)]
)
Focusing first on the statement in the sc.parallelize() method, we first created a Python list (that is, [A, B, ..., E]) composed of a list of arrays (that is, ('Mike', 19), ('June', 19), ..., ('Scott', 17)). The sc.parallelize() method is the SparkContext's parallelize method to create a parallelized collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data:

Now that we have created myRDD as a parallelized collection, Spark can operate against this data in parallel. Once created, the distributed dataset (distData) can be operated on in parallel. For example, we can call myRDD.reduceByKey(add) to add up the grouped by keys of the list; we have recipes for RDD operations in subsequent sections of this chapter.
- Java入門很輕松(微課超值版)
- Mastering Articulate Storyline
- C語言程序設計
- 碼上行動:用ChatGPT學會Python編程
- RealSenseTM互動開發實戰
- Swift 4從零到精通iOS開發
- 計算機應用基礎項目化教程
- .NET 4.0面向對象編程漫談:應用篇
- Visual Basic 程序設計實踐教程
- Maven for Eclipse
- Oracle Database XE 11gR2 Jump Start Guide
- Sitecore Cookbook for Developers
- Laravel Design Patterns and Best Practices
- Ionic3與CodePush初探:支持跨平臺與熱更新的App開發技術
- Responsive Web Design with jQuery