- Learning Apache Spark 2
- Muhammad Asif Abbasi
- 436字
- 2021-07-09 18:46:00
Shared variables
Spark being an MPP environment generally does not provide a shared state as the code is executed in parallel on a remote cluster node. Separate copies of data and variables are generally used during the map()
or reduce()
phases, and providing an ability to have a read-write shared variable across multiple executing tasks would be grossly inefficient. Spark, however, provides two types of shared variables:
Broadcast variables
- Read-only variables cached on each machineAccumulators
- Variables that can be added through associative and commutative property
Broadcast variables
Largescale data movement is often a major factor in negatively affecting performance in MPP environments and hence every care is taken to reduce data movement while working on a clustered environment. One of the ways to reduce data movement is to cache frequently accessed data objects on the machines, which is essentially what Spark's broadcast variables are about - keep read-only variables cached on each machine rather than shipping a copy of it with its tasks. This is often required when you need to have the same copy of a small data set (typically a dimension table) accessible to every node in the cluster. Spark will distribute the data to the worker nodes using a very efficient algorithm:
- Broadcast variables are set by the calling program/driver program and will be retrieved by the workers across the cluster
- Since the objective is to share the data across the cluster, they are read-only after they have been set
- The value of a broadcast variable is retrieved and stored only on the first read
A very common example is processing weblogs, where the weblogs contain only the pageId
, whereas the page titles are stored in a lookup table. During the analysis of the weblogs you might want to join the page Id from the weblog to the one in the lookup table to identify what particular page was being browsed, which page gets the most hits, which page loses the most customers, and so on. This can be done using the web page lookup table being broadcasted across the cluster. For an example of Broadcast variables, please visit Appendix, There's More with Spark.
Accumulators
Accumulators are variables that support associative and commutative properties, which are essential for parallel computations. They are often required to implement counters and are natively supported by Spark for numeric types. Accumulators are different from broadcast variables because:
- They are not read-only
- Executors across the cluster can add to the value of the accumulator variables
- The driver program can access the value of the accumulator variables
- For an example on Accumulators, please visit Appendix, There's More with Spark.
- 大學(xué)計(jì)算機(jī)基礎(chǔ):基礎(chǔ)理論篇
- 自動(dòng)控制工程設(shè)計(jì)入門
- Hands-On Machine Learning with TensorFlow.js
- Hands-On Cybersecurity with Blockchain
- 深度學(xué)習(xí)中的圖像分類與對(duì)抗技術(shù)
- AWS Administration Cookbook
- Photoshop CS3圖層、通道、蒙版深度剖析寶典
- Cloudera Administration Handbook
- Splunk Operational Intelligence Cookbook
- 樂高機(jī)器人—槍械武器庫
- 完全掌握AutoCAD 2008中文版:機(jī)械篇
- 運(yùn)動(dòng)控制系統(tǒng)應(yīng)用與實(shí)踐
- 激光選區(qū)熔化3D打印技術(shù)
- 液壓機(jī)智能故障診斷方法集成技術(shù)
- 學(xué)練一本通:51單片機(jī)應(yīng)用技術(shù)