官术网_书友最值得收藏!

Apache Spark fundamentals

This section covers the Apache Spark fundamentals. It is important to become very familiar with the concepts that are presented here before moving on to the next chapters, where we'll be exploring the available APIs.

As mentioned in the introduction to this chapter, the Spark engine processes data in distributed memory across the nodes of a cluster. The following diagram shows the logical structure of how a typical Spark job processes information:

Figure 1.1

Spark executes a job in the following way:

Figure 1.2

The Master controls how data is partitioned and takes advantage of data locality while keeping track of all the distributed data computation on the Slave machines. If a certain Slave machine becomes unavailable, the data on that machine is reconstructed on another available machine(s). In standalone mode, the Master is a single point of failure. This chapter's Cluster mode using different managers section covers the possible running modes and explains fault tolerance in Spark.

Spark comes with five major components:

Figure 1.3

These components are as follows:

  • The core engine.
  • Spark SQL: A module for structured data processing.
  • Spark Streaming: This extends the core Spark API. It allows live data stream processing. Its strengths include scalability, high throughput, and fault tolerance.
  • MLib: The Spark machine learning library.
  • GraphX: Graphs and graph-parallel computation algorithms.

Spark can access data that's stored in different systems, such as HDFS, Cassandra, MongoDB, relational databases, and also cloud storage services such as Amazon S3 and Azure Data Lake Storage.

主站蜘蛛池模板: 莲花县| 凉城县| 昔阳县| 江门市| 扎鲁特旗| 大竹县| 澄迈县| 定兴县| 错那县| 丰城市| 武威市| 蓝山县| 大同县| 天等县| 邯郸县| 老河口市| 神池县| 苗栗县| 吉林市| 安多县| 开封县| 民权县| 特克斯县| 浮梁县| 南城县| 呼伦贝尔市| 株洲县| 偏关县| 永济市| 湛江市| 永泰县| 溧阳市| 武乡县| 贵德县| 商丘市| 蛟河市| 锦屏县| 响水县| 五家渠市| 新野县| 巴楚县|