官术网_书友最值得收藏!

Toward more complex models

Let's now fit a more complex model, a polynomial of degree 2, to see whether it better understands our data:

>>> f2p = np.polyfit(x, y, 2)
>>> print(f2p)
[ 1.05605675e-02 -5.29774287e+00 1.98466917e+03]
>>> f2 = np.poly1d(f2p)
>>> print(error(f2, x, y))
181347660.764

With plot_web_traffic(x, y, [f1, f2]) we can see how a function of degree 2 manages to model our web traffic data:

The error is 181,347,660.764, which is almost half the error of the straight-line model. This is good, but unfortunately this comes at a price: we now have a more complex function, meaning that we have one more parameter to tune inside polyfit(). The fitted polynomial is as follows:

f(x) = 0.0105605675 * x**2 - 5.29774287 * x + 1984.66917

So, if more complexity gives better results, why not increase the complexity even more? Let's try it for degrees 3, 10, and 100:

Interestingly, we do not see d = 100 for the polynomial that had been fitted with 100 degrees, but instead d = 53. This has to do with the warning we get when fitting 100 degrees:

RankWarning: Polyfit may be poorly conditioned

This means that, because of numerical errors, polyfit cannot determine a good fit with 100 degrees. Instead, it figured that 53 would be good enough.

It seems like the curves capture the fitted data better the more complex they get. The errors seem to tell the same story:

>>> print("Errors for the complete data set:")
>>> for f in [f1, f2, f3, f10, f100]:
... print("td=%i: %f" % (f.order, error(f, x, y)))
...

The errors for the complete dataset are as follows:

  • d=1: 319,531,507.008126
  • d=2: 181,347,660.764236
  • d=3: 140,576,460.879141
  • d=10: 123,426,935.754101
  • d=53: 110,768,263.808878

However, taking a closer look at the fitted curves, we start to wonder whether they also capture the true process that generated that data. Framed differently, do our models correctly represent the underlying mass behavior of customers visiting our website? Looking at the polynomials of degree 10 and 53, we see wildly oscillating behavior. It seems that the models are fitted too much to the data. So much so that the graph is now capturing not only the underlying process, but also the noise. This is called overfitting.

At this point, we have the following choices:

  • Choose one of the fitted polynomial models
  • Switch to another more complex model class
  • Think differently about the data and start again

Out of the five fitted models, the first-order model is clearly too simple, and the models of order 10 and 53 are clearly overfitting. Only the second- and third-order models seem to somehow match the data. However, if we extrapolate them at both borders, we see them going berserk.

Switching to a more complex class also doesn't seem to be the right way to go. Which arguments would back which class? At this point, we realize that we have probably not fully understood our data.

主站蜘蛛池模板: 雅江县| 郑州市| 鲁甸县| 锡林郭勒盟| 普安县| 香格里拉县| 柞水县| 舟曲县| 临颍县| 西乌珠穆沁旗| 巴林左旗| 景洪市| 阿拉善盟| 泊头市| 游戏| 垫江县| 渝中区| 海口市| 绵阳市| 金川县| 葵青区| 洮南市| 黄浦区| 应用必备| 石渠县| 镇江市| 德州市| 天水市| 会昌县| 云霄县| 三台县| 澳门| 宁南县| 镇原县| 香格里拉县| 阿克陶县| 宝坻区| 昂仁县| 晋中市| 嘉峪关市| 哈密市|