官术网_书友最值得收藏!

Accepting non-linear patterns

A linear regression model implies that the outcome can be estimated by a linear combination of the predictors. This, of course, is not always the case, as features often exhibit nonlinear patterns.

Consider the following graph, where Y axis depends on X axis but the relationship displays an obvious quadratic pattern. Fitting a line (y = aX + b) as a prediction model of Y as a function of X does not work:

Some models and algorithms are able to naturally handle non-linearities, for example, tree-based models or support vector machines with non-linear kernels. Linear regression and SGD are not.

Transformations: One way to deal with these nonlinear patterns in the context of linear regression is to transform the predictors. In the preceding simple example, adding the square of the predictor X to the model would give a much better result. The model would now be of the following form:

And as shown in the following diagram, the new quadratic model fits the data much better:

We are not restricted to the quadratic case, and a power function of higher order can be used to transform existing attributes and create new predictors. Other useful transformations could include taking the logarithm, exponential, sine and cosine, and so on. The Boxcox transformation (http://onlinestatbook.com/2/transformations/box-cox.html) is worth citing at this point. It's an efficient data transformation that reduces skewness and kurtosis of a variable distribution. It reshapes the variable distribution into one closer to a Gaussian distribution.

Splines are an excellent and more powerful alternative to polynomial interpolation. Splines are piece-wise polynomials that join smoothly. At their simplest level, splines consists of lines that are connected together at different points. Splines are not available in Amazon ML.

Quantile binning is the Amazon ML solution to non-linearities. By splitting the data into N bins, you remove any non-linearities in the bin's intervals. Although binning has several drawbacks (http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous), the main one being that information is discarded in the process, it has been shown to generate excellent prediction performance in the Amazon ML platform.

主站蜘蛛池模板: 全南县| 慈溪市| 吉木萨尔县| 东港市| 安达市| 苍南县| 曲阜市| 永丰县| 扎兰屯市| 宁波市| 伊春市| 松溪县| 宜君县| 日照市| 都昌县| 德保县| 应用必备| 阳山县| 乌兰浩特市| 赤城县| 西乌珠穆沁旗| 泗阳县| 聂荣县| 安西县| 十堰市| 贡嘎县| 大渡口区| 蒙城县| 桐梓县| 台山市| 临湘市| 梅州市| 内乡县| 扎兰屯市| 蒙自县| 建湖县| 景德镇市| 化隆| 海安县| 老河口市| 汤阴县|