- Deep Learning with R for Beginners
- Mark Hodnett Joshua F. Wiley Yuxi (Hayden) Liu Pablo Maldonado
- 613字
- 2021-06-24 14:30:42
L1 penalty
The basic concept of the L1 penalty, also known as the least-absolute shrinkage and selection operator (Lasso–Hastie, T., Tibshirani, R., and Friedman, J. (2009)), is that a penalty is used to shrink weights toward zero. The penalty term uses the sum of the absolute weights, so some weights may get shrunken to zero. This means that Lasso can also be used as a type of variable selection. The strength of the penalty is controlled by a hyper-parameter, alpha (λ), which multiplies the sum of the absolute weights, and it can be a fixed value or, as with other hyper-parameters, optimized using cross-validation or some similar approach.
It is easier to describe Lasso if we use an ordinary least squares (OLS) regression model. In regression, a set of coefficients or model weights is estimated using the least-squared error criterion, where the weight/coefficient vector, Θ, is estimated such that it minimizes ∑(yi - ?i) where ?i=b+Θx, yi is the target value we want to predict and ?i is the predicted value. Lasso regression adds an additional penalty term that now tries to minimize ∑(yi - ?i) + λ?Θ?, where ?Θ? is the absolute value of Θ. Typically, the intercept or offset term is excluded from this constraint.
There are a number of practical implications for Lasso regression. First, the effect of the penalty depends on the size of the weights, and the size of the weights depends on the scale of the data. Therefore, data is typically standardized to have unit variance first (or at least to make the variance of each variable equal). The L1 penalty has a tendency to shrink small weights to zero (for explanations as to why this happens, see Hastie, T., Tibshirani, R., and Friedman, J. (2009)). If you only consider variables for which the L1 penalty leaves non-zero weights, it can essentially function as feature-selection. The tendency for the L1 penalty to shrink small coefficients to zero can also be convenient for simplifying the interpretation of the model results.
Applying the L1 penalty to neural networks works exactly the same for neural networks as it does for regression. If X represents the input, Y is the outcome or dependent variable, B the parameters, and F the objective function that will be optimized to obtain B, that is, we want to minimize F(B; X, Y). The L1 penalty modifies the objective function to be F(B; X, Y) + λ?Θ?, where Θ represents the weights (typically offsets are ignored). The L1 penalty tends to result in a sparse solution (that is, more zero weights) as small and larger weights result in equal penalties, so that at each update of the gradient, the weights are moved toward zero.
We have only considered the case where λ is a constant, controlling the degree of penalty or regularization. However, it is possible to set different values with deep neural networks, where varying degrees of regularization can be applied to different layers. One reason for considering such differential regularization is that it is sometimes desirable to allow a greater number of parameters (say by including more neurons in a particular layer) but then counteract this somewhat through stronger regularization. However, this approach can be quite computationally demanding if we are allowing the L1 penalty to vary for every layer of a deep neural network and using cross-validation to optimize all possible combinations of the L1 penalty. Therefore, usually a single value is used across the entire model.
- 程序員修煉之道:從小工到專家
- 揭秘云計(jì)算與大數(shù)據(jù)
- 深入淺出MySQL:數(shù)據(jù)庫開發(fā)、優(yōu)化與管理維護(hù)(第2版)
- 數(shù)據(jù)架構(gòu)與商業(yè)智能
- Sybase數(shù)據(jù)庫在UNIX、Windows上的實(shí)施和管理
- Lego Mindstorms EV3 Essentials
- “互聯(lián)網(wǎng)+”時(shí)代立體化計(jì)算機(jī)組
- 數(shù)據(jù)庫設(shè)計(jì)與應(yīng)用(SQL Server 2014)(第二版)
- Flutter Projects
- gnuplot Cookbook
- 重復(fù)數(shù)據(jù)刪除技術(shù):面向大數(shù)據(jù)管理的縮減技術(shù)
- R Object-oriented Programming
- 算力經(jīng)濟(jì):從超級(jí)計(jì)算到云計(jì)算
- Unity Game Development Blueprints
- PostgreSQL高可用實(shí)戰(zhàn)