官术网_书友最值得收藏!

Train and test data

In practice, data usually will be split randomly 70-30 or 80-20 into train and test datasets respectively in statistical modeling, in which training data utilized for building the model and its effectiveness will be checked on test data:

In the following code, we split the original data into train and test data by 70 percent - 30 percent. An important point to consider here is that we set the seed values for random numbers in order to repeat the random sampling every time we create the same observations in training and testing data. Repeatability is very much needed in order to reproduce the results:

# Train & Test split 
>>> import pandas as pd       
>>> from sklearn.model_selection import train_test_split 
 
>>> original_data = pd.read_csv("mtcars.csv")      

In the following code, train size is 0.7, which means 70 percent of the data should be split into the training dataset and the remaining 30% should be in the testing dataset. Random state is seed in this process of generating pseudo-random numbers, which makes the results reproducible by splitting the exact same observations while running every time:

>>> train_data,test_data = train_test_split(original_data,train_size = 0.7,random_state=42) 

The R code for the train and test split for statistical modeling is as follows:

full_data = read.csv("mtcars.csv",header=TRUE) 
set.seed(123) 
numrow = nrow(full_data) 
trnind = sample(1:numrow,size = as.integer(0.7*numrow)) 
train_data = full_data[trnind,] 
test_data = full_data[-trnind,] 
主站蜘蛛池模板: 开原市| 闸北区| 察雅县| 浪卡子县| 黄浦区| 郁南县| 清涧县| 手游| 南岸区| 民县| 临洮县| 太仆寺旗| 孝感市| 白水县| 武清区| 浙江省| 西充县| 兴海县| 武安市| 旬阳县| 韶山市| 孝昌县| 晋宁县| 昌江| 宜君县| 许昌县| 元阳县| 资兴市| 乌审旗| 互助| 炉霍县| 高邮市| 介休市| 西昌市| 奎屯市| 五莲县| 酉阳| 牡丹江市| 和静县| 石家庄市| 淳化县|