- Machine Learning with Scala Quick Start Guide
- Md. Rezaul Karim
- 241字
- 2021-06-24 14:32:03
Preprocessing and feature engineering
As per the dataset description on the UCI machine learning repository, there are no null values. Also, the Spark ML-based classifiers expect numeric values to model them. The good thing is that, as seen in the schema, all the required fields are numeric (that is, either integers or floating point values). Also, the Spark ML algorithms expect a label column, which in our case is Result_of_Treatment. Let's rename it to label using the Spark-provided withColumnRenamed() method:
//Spark ML algorithm expect a 'label' column, which is in our case 'Survived". Let's rename it to 'label'
CryotherapyDF = CryotherapyDF.withColumnRenamed("Result_of_Treatment", "label")
CryotherapyDF.printSchema()
All the Spark ML-based classifiers expect training data containing two objects called label (which we already have) and features. We have seen that we have six features. However, those features have to be assembled to create a feature vector. This can be done using the VectorAssembler() method. It is one kind of transformer from the Spark ML library. But first we need to select all the columns except the label column:
val selectedCols = Array("sex", "age", "Time", "Number_of_Warts", "Type", "Area")
Then we instantiate a VectorAssembler() transformer and transform as follows:
val vectorAssembler = new VectorAssembler()
.setInputCols(selectedCols)
.setOutputCol("features")
val numericDF = vectorAssembler.transform(CryotherapyDF)
.select("label", "features")
numericDF.show()
As expected, the last line of the preceding code segment shows the assembled DataFrame having label and features, which are needed to train an ML algorithm:

- Machine Learning for Cybersecurity Cookbook
- Blockchain Quick Start Guide
- Hands-On Machine Learning with TensorFlow.js
- Maya 2012從入門到精通
- 工業機器人現場編程(FANUC)
- 自動生產線的拆裝與調試
- 項目管理成功利器Project 2007全程解析
- 傳感器與新聞
- FPGA/CPLD應用技術(Verilog語言版)
- 青少年VEX IQ機器人實訓課程(初級)
- 簡明學中文版Photoshop
- 自適應學習:人工智能時代的教育革命
- Learning Couchbase
- Arduino創意機器人入門:基于ArduBlock(第2版)
- 電氣自動化工程師自學寶典(基礎篇)