官术网_书友最值得收藏!

Creating a training and testing set

As with most supervised learning tasks, we will create a split in our dataset so that we teach a model on one subset and then test its ability to generalize on new data against the holdout set. For the purposes of this example, we split the data 80/20 but there is no hard rule on what the ratio for a split should be - or for that matter - how many splits there should be in the first place:

// Create Train & Test Splits 
val trainTestSplits = higgs.randomSplit(Array(0.8, 0.2)) 
val (trainingData, testData) = (trainTestSplits(0), trainTestSplits(1)) 

By creating our 80/20 split on the dataset, we are taking a random sample of 8.8 million examples as our training set and the remaining 2.2 million as our testing set. We could just as easily take another random 80/20 split and generate a new training set with the same number of examples (8.8 million) but with different data. Doing this type of hard splitting of our original dataset introduces a sampling bias, which basically means that our model will learn to fit the training data but the training data may not be representative of "reality". Given that we are working with 11 million examples already, this bias is not as prominent versus if our original dataset is 100 rows, for example. This is often referred to as the holdout method for model validation.

You can also use the H2O Flow to split the data:

  1. Publish the Higgs data as H2OFrame:
val higgsHF = h2oContext.asH2OFrame(higgs.toDF, "higgsHF") 
  1. Split data in the Flow UI using the command splitFrame (see Figure 07).
  2. And then publish the results back to RDD.
Figure 7 - Splitting Higgs dataset into two H2O frames representing 80 and 20 percent of data.

In contrast to Spark lazy evaluation, the H2O computation model is eager. That means the splitFrame invocation processes the data right away and creates two new frames, which can be directly accessed.

主站蜘蛛池模板: 平潭县| 靖边县| 井陉县| 海宁市| 清河县| 南充市| 新乡县| 苗栗市| 兴文县| 西峡县| 读书| 华蓥市| 泸西县| 铁岭县| 巫山县| 绥中县| 盘锦市| 平昌县| 新龙县| 高州市| 志丹县| 玛纳斯县| 天台县| 通河县| 阳谷县| 南充市| 武冈市| 深水埗区| 涞水县| 通辽市| 龙海市| 阿鲁科尔沁旗| 且末县| 万州区| 洛隆县| 新密市| 乃东县| 钟祥市| 绥滨县| 红河县| 溧阳市|