官术网_书友最值得收藏!

Linear regression

Now that that's all done, let's do some linear regression! But first, let's clean up our code. We'll move our exploratory work so far into a function called exploration(). Then we will reread the file, split the dataset into training and testing dataset, and perform all the transformations before finally running the regression. For that, we will use github.com/sajari/regression and apply the regression.

The first part looks like this:

func main() {
// exploratory() // commented out because we're done with exploratory work.

f, err := os.Open("train.csv")
mHandleErr(err)
defer f.Close()
hdr, data, indices, err := ingest(f)
rows, cols, XsBack, YsBack, newHdr, newHints := clean(hdr, data, indices, datahints, ignored)
Xs := tensor.New(tensor.WithShape(rows, cols), tensor.WithBacking(XsBack))
it, err := native.MatrixF64(Xs)
mHandleErr(err)

// transform the Ys
for i := range YsBack {
YsBack[i] = math.Log1p(YsBack[i])
}
// transform the Xs
transform(it, newHdr, newHints)

// partition the data
shuffle(it, YsBack)
testingRows := int(float64(rows) * 0.2)
trainingRows := rows - testingRows
testingSet := it[trainingRows:]
testingYs := YsBack[trainingRows:]
it = it[:trainingRows]
YsBack = YsBack[:trainingRows]
log.Printf("len(it): %d || %d", len(it), len(YsBack))
...

We first ingest and clean the data, then we create an iterator for the matrix of Xs for easier access. We then transform both the Xs and the Ys. Finally, we shuffle the Xs, and partition them into a training dataset and a testing dataset.

Recall from the first chapter on knowing whether a model is good. A good model must be able to generalize to previously unseen combinations of values. To prevent overfitting, we must cross-validate our model.

In order to achieve that, we must only train on a limited subset of data, then use the model to predict on the test set of data. We can then get a score of how well it did when being run on the testing set.

Ideally, this should be done before the parsing of the data into the Xs and Ys. But we'd like to reuse the functions we wrote earlier, so we shan't do that. The separate functions of ingest and clean, however, allows you to do that. And if you visit the repository on GitHub, you will find that all the functions for such an act can easily be done.

For now, we simply take out 20% of the dataset, and set it aside. A shuffle is used to resample the rows so that we don't train on the same 80% every time.

Also, note that now the clean function takes ignored, while in the exploratory mode, it took nil. This, along with the shuffle, are important for cross-validation later on.

主站蜘蛛池模板: 延吉市| 凤翔县| 沾益县| 汉阴县| 前郭尔| 井陉县| 道真| 岳普湖县| 佛教| 富裕县| 岢岚县| 明溪县| 德惠市| 泸西县| 白玉县| 郑州市| 定襄县| 尉氏县| 类乌齐县| 长沙市| 南郑县| 太仓市| 湛江市| 海口市| 田阳县| 察雅县| 柳林县| 深水埗区| 方正县| 吉林省| 广汉市| 德阳市| 慈溪市| 晋中市| 文山县| 肥东县| 兴海县| 佛山市| 盐山县| 叙永县| 山东|