官术网_书友最值得收藏!

Spark architecture

Let's have a look at Apache Spark architecture, including a high level overview and a brief description of some of the key software components.

High level overview

At the high level, Apache Spark application architecture consists of the following key software components and it is important to understand each one of them to get to grips with the intricacies of the framework:

  • Driver program
  • Master node
  • Worker node
  • Executor
  • Tasks
  • SparkContext
  • SQL context
  • Spark session

Here's an overview of how some of these software components fit together within the overall architecture:

Figure 1.15: Apache Spark application architecture - Standalone mode

Driver program

Driver program is the main program of your Spark application. The machine where the Spark application process (the one that creates SparkContext and Spark Session) is running is called the Driver node, and the process is called the Driver process. The driver program communicates with the Cluster Manager to distribute tasks to the executors.

Cluster Manager

A cluster manager as the name indicates manages a cluster, and as discussed earlier Spark has the ability to work with a multitude of cluster managers including YARN, Mesos and a Standalone cluster manager. A standalone cluster manager consists of two long running daemons, one on the master node, and one on each of the worker nodes. We'll talk more about the cluster managers and deployment models in Chapter 8, Operating in Clustered Mode.

Worker

If you are familiar with Hadoop, a Worker Node is something similar to a slave node. Worker machines are the machines where the actual work is happening in terms of execution inside Spark executors. This process reports the available resources on the node to the master node. Typically each node in a Spark cluster except the master runs a worker process. We normally start one spark worker daemon per worker node, which then starts and monitors executors for the applications.

Executors

The master allocates the resources and uses the workers running across the cluster to create Executors for the driver. The driver can then use these executors to run its tasks. Executors are only launched when a job execution starts on a worker node. Each application has its own executor processes, which can stay up for the duration of the application and run tasks in multiple threads. This also leads to the side effect of application isolation and non-sharing of data between multiple applications. Executors are responsible for running tasks and keeping the data in memory or disk storage across them.

Tasks

A task is a unit of work that will be sent to one executor. Specifically speaking, it is a command sent from the driver program to an executor by serializing your Function object. The executor deserializes the command (as it is part of your JAR that has already been loaded) and executes it on a partition.

A partition is a logical chunk of data distributed across a Spark cluster. In most cases Spark would be reading data out of a distributed storage, and would partition the data in order to parallelize the processing across the cluster. For example, if you are reading data from HDFS, a partition would be created for each HDFS partition. Partitions are important because Spark will run one task for each partition. This therefore implies that the number of partitions are important. Spark therefore attempts to set the number of partitions automatically unless you specify the number of partitions manually e.g. sc.parallelize (data,numPartitions).

SparkContext

SparkContext is the entry point of the Spark session. It is your connection to the Spark cluster and can be used to create RDDs, accumulators, and broadcast variables on that cluster. It is preferable to have one SparkContext active per JVM, and hence you should call stop() on the active SparkContext before you create a new one. You might have noticed previously that in the local mode, whenever we start a Python or Scala shell we have a SparkContext object created automatically and the variable sc refers to the SparkContext object. We did not need to create the SparkContext, but instead started using it to create RDDs from text files.

Spark Session

Spark session is the entry point to programming with Spark with the dataset and DataFrame API.

主站蜘蛛池模板: 华亭县| 涪陵区| 柘荣县| 惠来县| 樟树市| 日照市| 凤阳县| 政和县| 惠水县| 汉寿县| 德惠市| 北京市| 富顺县| 焦作市| 沈阳市| 百色市| 东城区| 龙井市| 锡林浩特市| 庆云县| 卓尼县| 沛县| 阿拉尔市| 谷城县| 贞丰县| 托克逊县| 桓台县| 买车| 富锦市| 三明市| 岳阳市| 岚皋县| 敦化市| 上杭县| 永嘉县| 耒阳市| 晋宁县| 新河县| 崇义县| 武冈市| 金华市|