官术网_书友最值得收藏!

Validating the dataset

Not all datasets lend themselves to linear modeling. There are several conditions that the samples must verify for your linear model to make sense. Some conditions are strict, others can be relaxed.

In general, linear modeling assumes the following conditions (http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/):

  • Normalization/standardization: Linear regression can be sensitive to predictors that exhibit very different scales. This is true for all loss functions that rely on a measure of the distance between samples or on the standard deviations of samples. Predictors with higher means and standard deviations have more impact on the model and may potentially overshadow predictors with better predictive power but more constrained range of values. Standardization of predictors puts all the predictors on the same level.
  • Independent and identically distributed (i.i.d.): The samples are assumed to be independent from each other and to follow a similar distribution. This property is often assumed even when the samples are not that independent from each other. In the case of time series where samples depend on previous values, using the sample to sample difference as the data is often enough to satisfy the independence assumption. As we will see in Chapter 2, Machine Learning Definitions and Concepts, confounders and noise will also negatively impact linear regression.
  • No multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data, meaning that one predictor is not a linear composition of other predictors. Predictors that can be approximated by linear combinations of other predictors will confuse the model.
  • Heteroskedasticity: The standard deviation of a predictor is constant across the whole range of its values.
  • Gaussian distribution of the residuals: This is more than a posteriori validation that the linear regression is valid. The residuals are the differences between the true values and their linear estimation. The linear regression is considered relevant if these residuals follow a Gaussian distribution.

These assumptions are rarely perfectly met in real-life datasets. As we will see in Chapter 2, Machine Learning Definitions and Concepts, there are techniques to detect when the linear modeling assumptions are not respected, and subsequently to transform the data to get closer to the ideal linear regression context.

主站蜘蛛池模板: 神木县| 黄陵县| 壤塘县| 乌海市| 新巴尔虎左旗| 舞阳县| 萍乡市| 阜南县| 关岭| 谢通门县| 锦州市| 读书| 红原县| 玉屏| 商丘市| 连山| 常宁市| 杭锦旗| 沂源县| 巴塘县| 故城县| 沅陵县| 铜川市| 图木舒克市| 曲阜市| 偏关县| 邢台市| 定远县| 尤溪县| 昌都县| 安康市| 马龙县| 云南省| 陇川县| 聂拉木县| 康定县| 含山县| 西贡区| 平凉市| 平乐县| 松阳县|