官术网_书友最值得收藏!

Data Pipeline in Apache Spark

As we have seen the movie lens use case, it is quite common to run a sequence of machine learning algorithms to process and learn from data. Another example is a simple text document processing workflow, which can include several stages:

  • Split the document's text into words
  • Convert the document's words into a numerical feature vector
  • Learn a prediction model from feature vectors and labels

Spark MLlib represents such a workflow as a Pipeline; it consists of Pipeline Stages in sequence (Transformers and Estimators), which are run in a specific order.

A Pipeline is specified as a sequence of stages. Each stage is a Transformer or an Estimator. Transform converts one data frame into another. Estimator, on the other hand, is a learning algorithm. Pipeline stages are run in order, and the input DataFrame is transformed as it passes through each stage.

In Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel or fitted Pipeline). The transformer's transform() method is executed on the DataFrame.

主站蜘蛛池模板: 星座| 怀来县| 韩城市| 沂源县| 普兰县| 安新县| 武邑县| 滦南县| 西青区| 桃园市| 佛坪县| 南岸区| 确山县| 泰州市| 武安市| 江安县| 义马市| 营口市| 榆中县| 灵璧县| 和硕县| 开封市| 成安县| 湖口县| 望都县| 山东| 浦北县| 永泰县| 黄梅县| 剑河县| 阿城市| 内黄县| 拉萨市| 财经| 海盐县| 长沙市| 扎囊县| 岗巴县| 楚雄市| 区。| 师宗县|