官术网_书友最值得收藏!

  • Machine Learning in Java
  • AshishSingh Bhatia Bostjan Kaluza
  • 545字
  • 2021-06-10 19:30:07

Apache Spark

Apache Spark, or simply Spark, is a platform for large-scale data processing builds atop Hadoop, but, in contrast to Mahout, it is not tied to the MapReduce paradigm. Instead, it uses in-memory caches to extract a working set of data, process it, and repeat the query. This is reported to be up to ten times as fast as a Mahout implementation that works directly with data stored in the disk. It can be grabbed from https://spark.apache.org.

There are many modules built atop Spark, for instance, GraphX for graph processing, Spark Streaming for processing real-time data streams, and MLlib for machine learning library featuring classification, regression, collaborative filtering, clustering, dimensionality reduction, and optimization.

Spark's MLlib can use a Hadoop-based data source, for example, Hadoop Distributed File System (HDFS) or HBase, as well as local files. The supported data types include the following:

  • Local vectors are stored on a single machine. Dense vectors are presented as an array of double-typed values, for example, (2.0, 0.0, 1.0, 0.0), while sparse vector is presented by the size of the vector, an array of indices, and an array of values, for example, [4, (0, 2), (2.0, 1.0)].
  • Labelled point is used for supervised learning algorithms and consists of a local vector labelled with double-typed class values. The label can be a class index, binary outcome, or a list of multiple class indices (multiclass classification). For example, a labelled dense vector is presented as [1.0, (2.0, 0.0, 1.0, 0.0)].
  • Local matrices store a dense matrix on a single machine. It is defined by matrix dimensions and a single double-array arranged in a column-major order.
  • Distributed matrices operate on data stored in Spark's Resilient Distributed Dataset (RDD), which represents a collection of elements that can be operated on in parallel. There are three presentations: row matrix, where each row is a local vector that can be stored on a single machine, row indices are meaningless; indexed row matrix, which is similar to row matrix, but the row indices are meaningful, that is, rows can be identified and joins can be executed; and coordinate matrix, which is used when a row cannot be stored on a single machine and the matrix is very sparse.

Spark's MLlib API library provides interfaces for various learning algorithms and utilities, as outlined in the following list:

  • org.apache.spark.mllib.classification: These are binary and multiclass classification algorithms, including linear SVMs, logistic regression, decision trees, and Naive Bayes
  • org.apache.spark.mllib.clustering: These are k-means clustering algorithms
  • org.apache.spark.mllib.linalg: These are data presentations, including dense vectors, sparse vectors, and matrices
  • org.apache.spark.mllib.optimization: These are the various optimization algorithms that are used as low-level primitives in MLlib, including gradient descent, stochastic gradient descent (SGD), update schemes for distributed SGD, and the limited-memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm
  • org.apache.spark.mllib.recommendation: These are model-based collaborative filtering techniques implemented with alternating least squares matrix factorization
  • org.apache.spark.mllib.regression: These are regression learning algorithms, such as linear least squares, decision trees, Lasso, and Ridge regression
  • org.apache.spark.mllib.stat: These are statistical functions for samples in sparse or dense vector format to compute the mean, variance, minimum, maximum, counts, and nonzero counts
  • org.apache.spark.mllib.tree: This implements classification and regression decision tree-learning algorithms
  • org.apache.spark.mllib.util: These are a collection of methods used for loading, saving, preprocessing, generating, and validating the data
主站蜘蛛池模板: 永顺县| 荥经县| 卫辉市| 中江县| 宜昌市| 南安市| 四子王旗| 乌兰县| 龙陵县| 德化县| 错那县| 平原县| 巍山| 宁阳县| 孙吴县| 新蔡县| 乌海市| 自贡市| 海伦市| 象山县| 象州县| 彰化县| 九台市| 荆门市| 浏阳市| 白水县| 永年县| 内丘县| 莎车县| 禄劝| 黑山县| 长乐市| 分宜县| 阜阳市| 汪清县| 唐山市| 皮山县| 嵩明县| 安陆市| 二连浩特市| 瑞安市|