官术网_书友最值得收藏!

Cross-validation

Cross-validation (which you may hear some data scientists refer to as rotation estimation, or simply a general technique for assessing models), is another method for assessing a model's performance (or its accuracy).

Mainly used with predictive modeling to estimate how accurately a model might perform in practice, one might see cross-validation used to check how a model will potentially generalize; in other words, how the model will apply what it infers from samples, to an entire population (or dataset).

With cross-validation, you identify a (known) dataset as your validation dataset on which training is run, along with a dataset of unknown data (or first seen data) against which the model will be tested (this is known as your testing dataset). The objective is to ensure that problems such as overfitting (allowing non-inclusive information to influence results) are controlled, as well as provide an insight on how the model will generalize a real problem or on a real data file.

This process will consist of separating data into samples of similar subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set):

Separation → Analysis → Validation

To reduce variability, multiple iterations (also called folds or rounds) of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Typically, a data scientist will use a model's stability to determine the actual number of rounds of cross-validation that should be performed.

Again, the cross-validation method can perhaps be better understood by thinking about selecting a subset of data and manually calculating the results. Once you know the correct results, they can be compared to the model-produced results (using a separate subset of data). This is one round. Multiple rounds would be performed and the compared results averaged and reviewed, eventually providing a fair estimate of a model's prediction performance.

Suppose a university provides data on its student body over time. The students are described as having various characteristics, such as having a High School GPA greater or less than 3.0, if they have a family member that graduated from the school, if the student was active in non-program activities, was a resident (lived on campus), was a student athlete, and so on. Our predictive model wants to predict what characteristics students who graduate early have.

The following table is a representation of the results of using a five-round cross-validation process to predict our model's expected accuracy:

Cross-validation

Given the preceding figures, I'd say our predictive model is expected to be very accurate!

In summary, cross-validation combines (averages) measures of fit (prediction error) to derive a more accurate estimate of model prediction performance. This method is typically used in cases where there is not enough data available to test without losing significant modeling or testing quality.

主站蜘蛛池模板: 富蕴县| 绍兴县| 平安县| 浙江省| 开原市| 襄樊市| 策勒县| 荣成市| 景泰县| 讷河市| 克东县| 五大连池市| 库车县| 临泽县| 洛隆县| 边坝县| 平塘县| 康定县| 瑞丽市| 岳阳市| 衡水市| 泾源县| 定结县| 双辽市| 沐川县| 博野县| 津市市| 东台市| 景洪市| 涞水县| 谷城县| 云浮市| 南澳县| 增城市| 漳平市| 松江区| 普安县| 禄丰县| 常德市| 肥乡县| 曲松县|