Parameter Norm Penalties

$$ J(θ; X, y) = J(θ; X, y) + αΩ(θ) $$

$L^2$ Parameter Regularization (Ridge Regression)

$$ Ω(θ) =\frac{1}{2}||w||^2_2 $$

Update Rule:

$$ w ← (1 − \epsilon α)w −\epsilon ∇_wJ(w; X, y). $$

Optimal Solution:

$$ w = (X^\top X + αI)^{−1}X^\top y $$

We can see that the addition of the weight decay term has modiﬁed the learning rule to multiplicatively shrink the weight vector by a constant factor on each step, just before performing the usual gradient update.

Untitled

$L^1$ Regularization (Lasso)

$$ Ω(θ) = ||w||_1=\sum_i|w_i| $$

Update Rule:

$$ ∇_wJ(w; X, y) = αsign(w) + ∇_wJ(X, y; w) $$

No analytical solution.

We can see immediately that the eﬀect of L1 regularization is quite diﬀerent from that of L2 regularization. Speciﬁcally, we can see that the regularization contribution to the gradient no longer scales linearly with each wi; instead it is a constant factor with a sign equal to sign(wi). One consequence of this form of the gradient is that we will not necessarily see clean algebraic solutions to quadratic approximations of J(X, y;w) as we did for L2 regularization.

In comparison to L2 regularization, L1 regularization results in a solution that is more sparse. Sparsity in this context refers to the fact that some parameter shave an optimal value of zero. The sparsity of L1 regularization is a qualitatively diﬀerent behavior than arises with L2 regularization. EIf we revisit that equation using the assumption of a diagonal and positive deﬁnite Hessian H of L1 regularization, we ﬁnd that $w_i=\frac{H_{i,i}}{H_{i,i}+α}w^∗_i$. If $w_i^*$ was nonzero, then $w_i$ remains nonzero. This demonstrates that L2 regularization does not cause the parameters to become sparse, while L1 regularization may do so for large enough α. The sparsity property induced by L1 regularization has been used extensively as a feature selection mechanism. Feature selection simpliﬁes a machine learning problem by choosing which subset of the available features should be used.

Norm Penalties as Constrained Optimization (KKT)

Problem setup: construct a generalized Lagrange function: