官术网_书友最值得收藏!

  • R Deep Learning Essentials
  • Mark Hodnett Joshua F. Wiley
  • 607字
  • 2021-08-13 15:34:33

L1 penalty

The basic concept of the L1 penalty, also known as the least-absolute shrinkage and selection operator (LassoHastie, T., Tibshirani, R., and Friedman, J. (2009)), is that a penalty is used to shrink weights toward zero. The penalty term uses the sum of the absolute weights, so some weights may get shrunken to zero. This means that Lasso can also be used as a type of variable selection. The strength of the penalty is controlled by a hyper-parameter, alpha (λ), which multiplies the sum of the absolute weights, and it can be a fixed value or, as with other hyper-parameters, optimized using cross-validation or some similar approach.

It is easier to describe Lasso if we use an ordinary least squares (OLS) regression model. In regression, a set of coefficients or model weights is estimated using the least-squared error criterion, where the weight/coefficient vector, Θ, is estimated such that it minimizes ∑(yi - ?i) where ?i=b+Θx, yi is the target value we want to predict and ?i is the predicted value. Lasso regression adds an additional penalty term that now tries to minimize ∑(yi - ?iλ?Θ?, where ?Θ? is the absolute value of ΘTypically, the intercept or offset term is excluded from this constraint.

There are a number of practical implications for Lasso regression. First, the effect of the penalty depends on the size of the weights, and the size of the weights depends on the scale of the data. Therefore, data is typically standardized to have unit variance first (or at least to make the variance of each variable equal). The L1 penalty has a tendency to shrink small weights to zero (for explanations as to why this happens, see Hastie, T., Tibshirani, R., and Friedman, J. (2009)). If you only consider variables for which the L1 penalty leaves non-zero weights, it can essentially function as feature-selection. The tendency for the L1 penalty to shrink small coefficients to zero can also be convenient for simplifying the interpretation of the model results.

Applying the L1 penalty to neural networks works exactly the same for neural networks as it does for regression. If X represents the input, Y is the outcome or dependent variable, B the parameters, and F the objective function that will be optimized to obtain B, that is, we want to minimize F(B; X, Y). The L1 penalty modifies the objective function to be F(B; X, Y) + λ?Θ?, where Θ represents the weights (typically offsets are ignored). The L1 penalty tends to result in a sparse solution (that is, more zero weights) as small and larger weights result in equal penalties, so that at each update of the gradient, the weights are moved toward zero.

We have only considered the case where λ is a constant, controlling the degree of penalty or regularization. However, it is possible to set different values with deep neural networks, where varying degrees of regularization can be applied to different layers. One reason for considering such differential regularization is that it is sometimes desirable to allow a greater number of parameters (say by including more neurons in a particular layer) but then counteract this somewhat through stronger regularization. However, this approach can be quite computationally demanding if we are allowing the L1 penalty to vary for every layer of a deep neural network and using cross-validation to optimize all possible combinations of the L1 penalty. Therefore, usually a single value is used across the entire model.

主站蜘蛛池模板: 楚雄市| 梁平县| 南陵县| 黑河市| 仁寿县| 江油市| 铅山县| 柳林县| 纳雍县| 龙岩市| 津南区| 谷城县| 叶城县| 虎林市| 龙口市| 聂荣县| 稷山县| 桐乡市| 宁海县| 明星| 融水| 邯郸市| 珲春市| 九江县| 酒泉市| 青冈县| 荣成市| 合阳县| 噶尔县| 雷波县| 安康市| 灵武市| 辽阳县| 鄂伦春自治旗| 白玉县| 揭阳市| 望奎县| 伽师县| 锡林郭勒盟| 望都县| 安仁县|