Labeled point vector

Prior to running any supervised machine learning algorithm using Spark MLlib, we must convert our dataset into a labeled point vector which maps features to a given label/response; labels are stored as doubles which facilitates their use for both classification and regression tasks. For all binary classification problems, labels should be stored as either 0 or 1, which we confirmed from the preceding summary statistics holds true for our example.

val higgs = response.zip(features).map {  
case (response, features) =>  
LabeledPoint(response, features) } 
 
higgs.setName("higgs").cache()

An example of a labeled point vector follows:

(1.0, [0.123, 0.456, 0.567, 0.678, ..., 0.789])

In the preceding example, all doubles inside the bracket are the features and the single number outside the bracket is our label. Note that we are yet to tell Spark that we are performing a classification task and not a regression task which will happen later.

In this example, all input features contain only numeric values, but in many situations data that contains categorical values or string data. All this non-numeric representation needs to be converted into numbers, which we will show later in this book.

官术网_书友最值得收藏!

Mastering Machine Learning with Spark 2.x

Labeled point vector