官术网_书友最值得收藏!

Holdout sample

While working on a training dataset, a small portion of the data is kept aside for testing the performance of the models. The small portion of data is unseen data (not used in training), therefore one can rely on the measurements obtained for this data. The measurements obtained can be used to tune the parameters of the model or just to report out the performance of the model so as to set expectations in terms of what level of performance can be expected from the model.

It may be noted that the performance measurement reported out on the basis of a holdout sample is not as robust an estimate as that of a k-fold cross validation estimate. This is because there could be some unknown biases that could have crept in during the random split of the holdout set from the original dataset. Also, there are also no guarantees that the holdout dataset has a representation of all the classes involved in the training dataset. If we need representation of all classes in the holdout dataset, then a special technique called a stratified holdout sample needs to be applied. This ensures that there is representation for all classes in the holdout dataset. It is obvious that a performance measurement obtained from a stratified holdout sample is a better estimate of performance than that of the estimate of performance obtained from a nonstratified holdout sample.

70%-30%, 80%-20%, and 90%-10% are generally the sets of training data-holdout data splits observed in ML projects.

主站蜘蛛池模板: 元江| 遂溪县| 安阳县| 罗甸县| 辰溪县| 安庆市| 兴山县| 建阳市| 贵港市| 佛教| 雅安市| 福清市| 十堰市| 本溪| 灵璧县| 江西省| 阜宁县| 辉南县| 宁乡县| 金昌市| 彭阳县| 江源县| 定结县| 宜兰市| 柘城县| 太原市| 峨眉山市| 涪陵区| 和政县| 嘉定区| 临武县| 章丘市| 绥江县| 云浮市| 巴塘县| 玛曲县| 洪湖市| 井冈山市| 通山县| 吕梁市| 揭东县|