官术网_书友最值得收藏!

Splitting the data

Finally, we want to split our data into training and test sets. We will train our classifier only on the training set, so it will never see the test set until we want to evaluate its performance. This is a very important step, because as we will see in the future, the quality of predictions on the test set can differ dramatically from the quality measured on the training set. Data splitting is an operation specific to machine learning tasks, so we will import scikit-learn (a machine learning package) and use some functions from it:

In []: 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42) 
X_train.shape, y_train.shape, X_test.shape, y_test.shape 
Out[]: 
 ((700, 6), (700,), (300, 6), (300,)) 

Now we have 700 training samples with 6 features each, and 300 test samples with the same number of features.

主站蜘蛛池模板: 花垣县| 祁东县| 延庆县| 永春县| 上高县| 宝兴县| 焉耆| 南投县| 永福县| 定西市| 聂荣县| 桦甸市| 龙江县| 阿克苏市| 那曲县| 江达县| 镇原县| 齐齐哈尔市| 乐平市| 秦安县| 青川县| 贺兰县| 宜川县| 育儿| 高密市| 特克斯县| 云梦县| 宿迁市| 江源县| 镶黄旗| 博乐市| 大英县| 监利县| 建阳市| 犍为县| 平度市| 泾川县| 闵行区| 遂溪县| 赤水市| 达孜县|