官术网_书友最值得收藏!

Exploring the Spark ecosystem

Apache Spark is considered as the general purpose system in the big data world. It consists of a lot of libraries that help to perform various analytics on your data. It provides built-in libraries to perform batch analytics, perform real-time analytics, apply machine learning algorithms on your data, and much more.

The following are the various built-in libraries available in Spark:

  • Spark Core: As its name says, the Spark Core library consists of all the core modules of Spark. It consists of the basics of the Spark model, including RDD and various transformation and actions that can be performed with it. Basically, all the batch analytics that can be performed with the Spark programming model using the MapReduce paradigm is the part of this library. It also helps to analyze different varieties of data.
  • Spark Streaming: The Spark Streaming library consists of modules that help users to run near real-time streaming processing on the incoming data. It helps to handle the velocity part of the big data territory. It consists of a lot of modules that help to listen to various streaming sources and perform analytics in near real time on the data received from those sources.
  • Spark SQL: The Spark SQL library helps to analyze structured data using the very popular SQL queries. It consists of a dataset library which helps to view the structured data in the form of a table and provides the capabilities of running SQL on top of it. The SQL library consists of a lot of functions which are available in SQL of RDBMS. It also provides an option to write your own function, called the User Defined Function (UDF).
  • MLlib: Spark MLlib helps to apply various machine learning techniques on your data, leveraging the distributed and scalable capability of Spark. It consists of a lot of learning algorithms and utilities and provides algorithms for classification, regression, clustering, decomposition, collaborative filtering, and so on.
  • GraphX: The Spark GraphX library provides APIs for graph-based computations. With the help of this library, the user can perform parallel computations on graph-based data. GraphX is the one of the fastest ways of performing graph-based computations.
  • Spark-R: The Spark R library is used to run R scripts or commands on Spark Cluster. This helps to provide distributed environment for R scripts to execute. Spark comes with a shell called sparkR which can be used to run R scripts on Spark Cluster. Users which are more familiar with R, can use tool such as RStudio or Rshell and can execute R scripts which will run on the Spark cluster.
主站蜘蛛池模板: 昆山市| 杭州市| 临沧市| 张家川| 舞钢市| 惠州市| 忻城县| 黔南| 灵川县| 武邑县| 汉源县| 宝兴县| 南投市| 马尔康县| 临西县| 舞钢市| 临高县| 昌图县| 轮台县| 日照市| 唐山市| 长治市| 高邮市| 穆棱市| 桂东县| 会东县| 板桥市| 富源县| 古交市| 鄂托克旗| 安化县| 兴文县| 华容县| 乐安县| 陵川县| 巴东县| 玉环县| 内江市| 万宁市| 永州市| 沁水县|