官术网_书友最值得收藏!

Labeled point vector

Prior to running any supervised machine learning algorithm using Spark MLlib, we must convert our dataset into a labeled point vector which maps features to a given label/response; labels are stored as doubles which facilitates their use for both classification and regression tasks. For all binary classification problems, labels should be stored as either 0 or 1, which we confirmed from the preceding summary statistics holds true for our example.

val higgs = response.zip(features).map {  
case (response, features) =>  
LabeledPoint(response, features) } 
 
higgs.setName("higgs").cache() 

An example of a labeled point vector follows:

(1.0, [0.123, 0.456, 0.567, 0.678, ..., 0.789]) 

In the preceding example, all doubles inside the bracket are the features and the single number outside the bracket is our label. Note that we are yet to tell Spark that we are performing a classification task and not a regression task which will happen later.

In this example, all input features contain only numeric values, but in many situations data that contains categorical values or string data. All this non-numeric representation needs to be converted into numbers, which we will show later in this book.
主站蜘蛛池模板: 堆龙德庆县| 永兴县| 明星| 民和| 偃师市| 靖远县| 土默特右旗| 庐江县| 通化县| 新宾| 冕宁县| 阿巴嘎旗| 阳城县| 岳阳县| 汉阴县| 盖州市| 定结县| 丹阳市| 黄梅县| 淄博市| 兴山县| 蒲江县| 泸西县| 昭平县| 电白县| 梁山县| 阜阳市| 正定县| 舟山市| 沧州市| 公安县| 长岭县| 武穴市| 河源市| 衡山县| 林周县| 兴仁县| 玉田县| 望都县| 连州市| 江城|