Because subset selection is discrete, resulting model can have high variance. Shrinkage methods are more continuous.
Minimizes penalized ($\lambda \beta^T \beta$) RSS. $\lambda$ is complexity parameter.
With OLS, coefficients of correlated features can have high variance; large positive coefficients offset effect of large negative coefficients.
Need feature normalization:
Estimate is $ \hat{\beta_{RR}} = (X^T X + \lambda I)^{-1} X^T y $, where $X^T X + \lambda I$ is invertible if $\lambda > 0$.
Using SVD, $X \hat{\beta}_{LS} = UU^Ty $ and $X\hat{\beta}_{RR} = \sum_i^p u_i \frac{d_i^2}{d_i^2 + \lambda} u_i^T y$. p-th coordinate $u_p^T y$ of the basis $U$ is shrinked the most because it has smallest singular value $d_p$.
Also, first principal component $z_1 = Xv_1$ has maximal variance among all directions in $C(X)$ and subsequent components (with smaller singular values) have smaller variance. Hence, ridge regression shrinks directions with smaller variance.
Effective degrees of freedom is $df(\lambda) = \sum_i^p \frac{d_i^2}{d_i^2 + \lambda}$.
$L_1$ penalty $\sum_1^p \lvert \beta_j \rvert \leq t$.
Solution is nonlinear w.r.t. $y_i$, and there is no closed form solution. Use LAR (3.4.4) for minimizing Lasso penalty.
Continuous subset selection: sufficiently small $t$ will cause some coefficients to be zero.
Standardized parameter is $ s = t / \sum_1^p \lvert \hat{\beta}^{LS}_i \rvert$.
If features are orthonormal, Lasso also has closed form estimate. Ridge is proportional shrinkage, Lasso is soft-thresholding (shrink to drop), subset selection is hard-thresholding (drop or not).
Estimation picture (Figure 3.11). Equi-RSS elliptical contours centered at $\hat{\beta}_{LS}$ meet $L_q$ constraint region.
TODO: Elastic-net penalty
TODO: Bayesian interpretation
At each iteration $k$, move coefficients in the active set $A_k$ towards the OLS estimate of the residual $r_k$ onto $X_{A_k}$, until another feature has as much correlation with the residual.
Then, at each iteration:
For Lasso path, if a non-zero coefficient reaches zero, drop it from $A_k$.
How many parameters were used? Define effective degrees of freedom of adaptively fit $\hat{y}$ as df($\hat{y}$) $= \frac{1}{\sigma^2} \sum_{i=1}^N Cov(\hat{y}_i, y_i)$.
Estimation Method | df($\hat{y}$) | Notes |
---|---|---|
Least Squares | k | - |
Ridge | $\sum_i \frac{d_i^2}{d_i^2 + \lambda}$ | - |
Subset Selection | > k | No closed form; by simulation. |
Least Angle | = k | - |
Lasso | ~ k | approximately size of active set. |