- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 273字
- 2021-06-18 19:06:35
Introduction
Resilient Distributed Datasets (RDDs) are collections of immutable JVM objects that are distributed across an Apache Spark cluster. Please note that if you are new to Apache Spark, you may want to initially skip this chapter as Spark DataFrames/Datasets are both significantly easier to develop and typically have faster performance. More information on Spark DataFrames can be found in the next chapter.
An RDD is the most fundamental dataset type of Apache Spark; any action on a Spark DataFrame eventually gets translated into a highly optimized execution of transformations and actions on RDDs (see the paragraph on catalyst optimizer in Chapter 3, Abstracting Data with DataFrames, in the Introduction section).
Data in an RDD is split into chunks based on a key and then dispersed across all the executor nodes. RDDs are highly resilient, that is, there are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. Thus, even if one executor fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes. RDDs keep a log of all the execution steps applied to each chunk. This, on top of the data replication, speeds up the computations and, if anything goes wrong, RDDs can still recover the portion of the data lost due to an executor error.
While it is common to lose a node in distributed environments (for example, due to connectivity issues, hardware problems), distribution and replication of the data defends against data loss, while data lineage allows the system to recover quickly.
- 自己動(dòng)手寫(xiě)搜索引擎
- Kali Linux Web Penetration Testing Cookbook
- TestNG Beginner's Guide
- 看透JavaScript:原理、方法與實(shí)踐
- 云原生Spring實(shí)戰(zhàn)
- Building Cross-Platform Desktop Applications with Electron
- Building Minecraft Server Modifications
- Android底層接口與驅(qū)動(dòng)開(kāi)發(fā)技術(shù)詳解
- Mastering ROS for Robotics Programming
- 從Java到Web程序設(shè)計(jì)教程
- 微信小程序全棧開(kāi)發(fā)技術(shù)與實(shí)戰(zhàn)(微課版)
- Python從入門(mén)到精通(第3版)
- 計(jì)算機(jī)應(yīng)用基礎(chǔ)(第二版)
- 超好玩的Scratch 3.5少兒編程
- C語(yǔ)言從入門(mén)到精通