- Machine Learning with Spark(Second Edition)
- Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath
- 188字
- 2021-07-09 21:07:55
Data Pipeline in Apache Spark
As we have seen the movie lens use case, it is quite common to run a sequence of machine learning algorithms to process and learn from data. Another example is a simple text document processing workflow, which can include several stages:
- Split the document's text into words
- Convert the document's words into a numerical feature vector
- Learn a prediction model from feature vectors and labels
Spark MLlib represents such a workflow as a Pipeline; it consists of Pipeline Stages in sequence (Transformers and Estimators), which are run in a specific order.
A Pipeline is specified as a sequence of stages. Each stage is a Transformer or an Estimator. Transform converts one data frame into another. Estimator, on the other hand, is a learning algorithm. Pipeline stages are run in order, and the input DataFrame is transformed as it passes through each stage.
In Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel or fitted Pipeline). The transformer's transform() method is executed on the DataFrame.
- Learning Microsoft Azure Storage
- 蕩胸生層云:C語言開發(fā)修行實錄
- MCSA Windows Server 2016 Certification Guide:Exam 70-741
- JMAG電機電磁仿真分析與實例解析
- 永磁同步電動機變頻調(diào)速系統(tǒng)及其控制(第2版)
- Docker High Performance(Second Edition)
- 中國戰(zhàn)略性新興產(chǎn)業(yè)研究與發(fā)展·工業(yè)機器人
- 項目管理成功利器Project 2007全程解析
- Troubleshooting OpenVPN
- Visual FoxPro數(shù)據(jù)庫基礎(chǔ)及應(yīng)用
- Excel 2007常見技法與行業(yè)應(yīng)用實例精講
- Linux Shell編程從初學到精通
- Mastering MongoDB 3.x
- Redash v5 Quick Start Guide
- 簡明學中文版Flash動畫制作