官术网_书友最值得收藏!

Preprocessing and feature engineering

As per the dataset description on the UCI machine learning repository, there are no null values. Also, the Spark ML-based classifiers expect numeric values to model them. The good thing is that, as seen in the schema, all the required fields are numeric (that is, either integers or floating point values). Also, the Spark ML algorithms expect a label column, which in our case is Result_of_Treatment. Let's rename it to label using the Spark-provided withColumnRenamed() method:

//Spark ML algorithm expect a 'label' column, which is in our case 'Survived". Let's rename it to 'label'
CryotherapyDF = CryotherapyDF.withColumnRenamed("Result_of_Treatment", "label")
CryotherapyDF.printSchema()

All the Spark ML-based classifiers expect training data containing two objects called label (which we already have) and features. We have seen that we have six features. However, those features have to be assembled to create a feature vector. This can be done using the VectorAssembler() method. It is one kind of transformer from the Spark ML library. But first we need to select all the columns except the label column:

val selectedCols = Array("sex", "age", "Time", "Number_of_Warts", "Type", "Area")

Then we instantiate a VectorAssembler() transformer and transform as follows:

val vectorAssembler = new VectorAssembler()
.setInputCols(selectedCols)
.setOutputCol("features")
val numericDF = vectorAssembler.transform(CryotherapyDF)
.select("label", "features")
numericDF.show()

As expected, the last line of the preceding code segment shows the assembled DataFrame having label and features, which are needed to train an ML algorithm:

主站蜘蛛池模板: 太谷县| 淳安县| 本溪市| 偃师市| 崇义县| 弥渡县| 大庆市| 得荣县| 固阳县| 大田县| 夏邑县| 都匀市| 玛多县| 北辰区| 昌乐县| 富源县| 金昌市| 司法| 泉州市| 龙井市| 呈贡县| 安顺市| 峨眉山市| 彭山县| 密山市| 通许县| 陵川县| 囊谦县| 石城县| 泽州县| 鄂州市| 柯坪县| 长乐市| 涿鹿县| 芒康县| 苍梧县| 汪清县| 三穗县| 大田县| 澄江县| 凤冈县|