官术网_书友最值得收藏!

Setting parameters

Almost all parameters that the user can set, letting algorithms focus more on the specific dataset, rather than only being applicable across a small and specific range of problems. Setting these parameters can be quite difficult, as choosing good parameter values is often highly reliant on features of the dataset.

The nearest neighbor algorithm has several parameters, but the most important one is that of the number of nearest neighbors to use when predicting the class of an unseen attribution. In -learn, this parameter is called n_neighbors. In the following figure, we show that when this number is too low, a randomly labeled sample can cause an error. In contrast, when it is too high, the actual nearest neighbors have a lower effect on the result:

In figure (a), on the left-hand side, we would usually expect to classify the test sample (the triangle) as a circle. However, if n_neighbors is 1, the single red diamond in this area (likely a noisy sample) causes the sample to be predicted as a diamond. In figure (b), on the right-hand side, we would usually expect to classify the test sample as a diamond. However, if n_neighbors is 7, the three nearest neighbors (which are all diamonds) are overridden by a large number of circle samples. Nearest neighbors a difficult problem to solve, as the parameter can make a huge difference. Luckily, most of the time the specific parameter value does not greatly affect the end result, and the standard values (usually 5 or 10) are often near enough.

With that in mind, we can test out a range of values, and investigate the impact that this parameter has on performance. If we want to test a number of values for the n_neighbors parameter, for example, each of the values from 1 to 20, we can rerun the experiment many times by setting n_neighbors and observing the result. The code below does this, storing the values in the avg_scores and all_scores variables.

avg_scores = [] 
all_scores = []
parameter_values = list(range(1, 21)) # Include 20
for n_neighbors in parameter_values:
estimator = KNeighborsClassifier(n_neighbors=n_neighbors)
scores = cross_val_score(estimator, X, y, scoring='accuracy') avg_scores.append(np.mean(scores))
all_scores.append(scores)

We can then plot the relationship between the value of n_neighbors and the accuracy. First, we tell the Jupyter Notebook that we want to show plots inline in the notebook itself:

%matplotlib inline

We then import pyplot from the matplotlib library and plot the parameter values alongside average scores:

from matplotlib import pyplot as plt plt.plot(parameter_values,  avg_scores, '-o')

While there is a lot of variance, the plot shows a decreasing trend as the number of neighbors increases. With regard to the variance, you can expect large amounts of variance whenever you do evaluations of this nature. To compensate, update the code to run 100 tests, per value of n_neighbors.

主站蜘蛛池模板: 区。| 石家庄市| 大宁县| 自贡市| 禄劝| 拉孜县| 万山特区| 黄陵县| 江油市| 康平县| 漳浦县| 迭部县| 尚志市| 灵武市| 海安县| 丰台区| 石楼县| 梨树县| 五指山市| 新巴尔虎右旗| 石台县| 东乡族自治县| 宁陕县| 鹤山市| 拜泉县| 安康市| 麻栗坡县| 凉山| 海晏县| 炉霍县| 望城县| 衢州市| 元氏县| 定边县| 阿克苏市| 界首市| 西和县| 通榆县| 汨罗市| 安仁县| 建宁县|