官术网_书友最值得收藏!

Addressing multicollinearity

The following is the definition of multicollinearity according to Wikipedia (https://en.wikipedia.org/wiki/Multicollinearity):

Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

Basically, let's say you have a model with three predictors:

And one of the predictors is a linear combination (perfect multicollinearity) or is approximated by a linear combination (near multicollinearity) of two other predictors.

For instance: 

Here,  is some noise variable.

In that case, changes in  and  will drive changes in , and as a consequence,  will be tied to  and . The information already contained in  and  will be shared with the third predictor , which will cause high uncertainty and instability in the model. Small changes in the predictors' values will bring large variations in the coefficients. The regression may no longer be reliable. In more technical terms, the standard errors of the coefficient would increase, which would lower the significance of otherwise important predictors.

There are several ways to detect multicollinearity. Calculating the correlation matrix of the predictors is a first step, but that would only detect collinearity between pairs of predictors.

A widely used detection method for multicollinearity is to calculate the variance inflation factor or VIF for each of the predictors:

Here,  is the coefficient of determination of the regression equation in step one, with on the left-hand side and all other predictor variables (all the other variables) on the right-hand side.

A large value for one of the VIFs is an indication that the variance (the square of the standard error) of a particular coefficient is larger than it would be if that predictor was completely uncorrelated with all the other predictors. For instance, a VIF of 1.8 indicates that the variance of a predictor is 80% larger than what it would be in the uncorrelated case.

Once the attributes with high collinearity have been identified, the following options will reduce collinearity:

  • Removing the high collinearity predictors from the dataset
  • Using Partial Least Squares Regression (PLS) or Principal Components Analysis (PCA), regression methods (https://en.wikipedia.org/wiki/Principal_component_analysis); PLS and PCA will reduce the number of predictors to a smaller set of uncorrelated variables

Unfortunately, detection and removal of multicollinear variables is not available from the Amazon ML platform.

主站蜘蛛池模板: 吴忠市| 江安县| 治多县| 壶关县| 呼和浩特市| 长治县| 聂拉木县| 麦盖提县| 都江堰市| 长垣县| 漯河市| 观塘区| 横山县| 丹江口市| 桦甸市| 田林县| 扎兰屯市| 河东区| 古交市| 玉田县| 涟水县| 文登市| 晋中市| 新野县| 许昌县| 澄城县| 绥中县| 五大连池市| 加查县| 曲水县| 错那县| 紫阳县| 罗山县| 忻城县| 阿图什市| 鄂托克旗| 乡城县| 和硕县| 澄城县| 楚雄市| 保定市|