官术网_书友最值得收藏!

Apache Spark

Apache Spark (https://spark.apache.org/) is a unified analytics engine for large-scale data processing. Spark provides APIs for batch as well as stream data processing in a distributed computing environment. Spark's API can be broadly divided into the following five categories:

  • Core: RDD
  • SQL structured: DataFrames and Datasets
  • Streaming: Structured streaming and DStreams
  • MLlib: Machine learning
  • GraphX: Graph processing

Apache Spark is a very active open source project. New features are added and performance improvements made on a regular basis. Typically, there is a new minor release of Apache Spark every three months with significant performance and feature improvements. At the time of writing, 2.4.0 is the most recent version of Spark.

The following is Spark core's SBT dependency:

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.1"

Spark version 2.4.0 has introduced support for Scala version 2.12; however, we will be using Scala version 2.11 for exploring Spark's feature sets. Spark will be covered in more detail in the subsequent chapters.

主站蜘蛛池模板: 招远市| 大荔县| 山东省| 正宁县| 互助| 仪陇县| 河南省| 鄂托克前旗| 望奎县| 托克逊县| 达尔| 广德县| 连云港市| 永仁县| 庄浪县| 沙坪坝区| 和田市| 墨玉县| 双牌县| 百色市| 孟津县| 洛隆县| 中超| 营山县| 和静县| 安泽县| 察雅县| 高雄县| 玉山县| 榆社县| 呈贡县| 石楼县| 临西县| 铁岭市| 屏东市| 澄江县| 内丘县| 裕民县| 开江县| 饶平县| 惠来县|