官术网_书友最值得收藏!

Evaluating the model

To evaluate an algorithm, it's necessary to judge the performance of the algorithm on data that was not used to train the model. For this reason, it's common to split the data in the training and test set. The training set is used to train the model, which means that it's used to find the parameters of our algorithm. For example, training a decision tree will determine the values and variables that will create the split of the branches of the tree. The test set must remain totally hidden from the training. That means that all operations such as features engineering or feature scaling must be trained on the training set only and applied to the test set, as in the following example.

Usually, the training set will be 70-80% of the dataset, while the test set will be the rest:

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn import datasets

# import some data
iris = datasets.load_iris()

X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=0)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_train)

clf = LinearRegression().fit(X_train_transformed, y_train)

predictions = clf.predict(X_test_transformed)

print('Predictions: ', predictions)

The most common way to evaluate a supervised learning algorithm offline is cross-validation. This technique consists of dividing the dataset into test and training multiple times and use one part for training and one for testing.

This allows to not only check for overfitting but also to evaluate the variance in our loss

For problems where it's not possible to randomly divide the data, such as in a time series, scikit-learn has other splitting methods, such as the TimeSeriesSplit class.

In Keras, it's possible to specify a simple way to split in train/test directly during fit:

hist = model.fit(x, y, validation_split=0.2)

If the data does not fit in memory, it's also possible to use train_on_batch and test_on_batch.

For image data, in Keras, it is also possible to use the folder structure to create train and test and specify the labels. To accomplish this, it is important to use the flow_from_directory function, which will load the data with the labels and train/test split as specified. We will need to have the following directory structure:

data/
train/
category1/
001.jpg
002.jpg
...
category2/
003.jpg
004.jpg
...
validation/
category1/
0011.jpg
0022.jpg
...
category2/
0033.jpg
0044.jpg
...

Use the following function:

flow_from_directory(directory, target_size=(96, 96), color_mode='rgb', classes=None, class_mode='categorical', batch_size=128, shuffle=True, seed=11, save_to_dir=None, save_prefix='output', save_format='jpg', follow_links=False, subset=None, interpolation='nearest')
主站蜘蛛池模板: 库尔勒市| 酒泉市| 屏山县| 白朗县| 泌阳县| 康平县| 班戈县| 高唐县| 夏河县| 洪泽县| 合肥市| 澜沧| 莲花县| 苗栗市| 阿尔山市| 沈丘县| 咸阳市| 即墨市| 化州市| 沙雅县| 沁水县| 金乡县| 腾冲县| 孟连| 出国| 山阴县| 金平| 宁陕县| 年辖:市辖区| 巍山| 綦江县| 泉州市| 河间市| 鄯善县| 双桥区| 鄂尔多斯市| 长兴县| 天祝| 湄潭县| 湖南省| 东丰县|