ESL 3.4: Shrinkage Methods

Summary

Shrinkage Methods

Because subset selection is discrete, resulting model can have high variance. Shrinkage methods are more continuous.

3.4.1 Ridge Regression

Minimizes penalized ($\lambda \beta^T \beta$) RSS. $\lambda$ is complexity parameter.

With OLS, coefficients of correlated features can have high variance; large positive coefficients offset effect of large negative coefficients.

Need feature normalization:

Standardize features: otherwise, coefficients of smaller-scaled features will be penalized more.
Center features: intercept $\beta_0$ is not included in penalty term. Estimate $\beta_0 = \overline{y}$ and estimate other coefficients with centered features.

Estimate is $ \hat{\beta_{RR}} = (X^T X + \lambda I)^{-1} X^T y $, where $X^T X + \lambda I$ is invertible if $\lambda > 0$.

Using SVD, $X \hat{\beta}_{LS} = UU^Ty $ and $X\hat{\beta}_{RR} = \sum_i^p u_i \frac{d_i^2}{d_i^2 + \lambda} u_i^T y$. p-th coordinate $u_p^T y$ of the basis $U$ is shrinked the most because it has smallest singular value $d_p$.

Also, first principal component $z_1 = Xv_1$ has maximal variance among all directions in $C(X)$ and subsequent components (with smaller singular values) have smaller variance. Hence, ridge regression shrinks directions with smaller variance.

Effective degrees of freedom is $df(\lambda) = \sum_i^p \frac{d_i^2}{d_i^2 + \lambda}$.

3.4.2 Lasso

$L_1$ penalty $\sum_1^p \lvert \beta_j \rvert \leq t$.

Solution is nonlinear w.r.t. $y_i$, and there is no closed form solution. Use LAR (3.4.4) for minimizing Lasso penalty.

Continuous subset selection: sufficiently small $t$ will cause some coefficients to be zero.

Standardized parameter is $ s = t / \sum_1^p \lvert \hat{\beta}^{LS}_i \rvert$.

3.4.3 Subset selection, ridge regression, and Lasso

If features are orthonormal, Lasso also has closed form estimate. Ridge is proportional shrinkage, Lasso is soft-thresholding (shrink to drop), subset selection is hard-thresholding (drop or not).

Estimation picture (Figure 3.11). Equi-RSS elliptical contours centered at $\hat{\beta}_{LS}$ meet $L_q$ constraint region.

When $p > 2$, Lasso constraint region has many corners, setting parameters to $0$.
Lasso has smallest $q = 1$ s.t. constraint region is convex; if $q < 1$ optimization is more difficult.
If $q > 1$, $\lvert \beta_j \rvert^q$ is differentiable at 0, so doesn’t drop coefficients to 0.

TODO: Elastic-net penalty

TODO: Bayesian interpretation

3.4.4 Least Angle Regression

At each iteration $k$, move coefficients in the active set $A_k$ towards the OLS estimate of the residual $r_k$ onto $X_{A_k}$, until another feature has as much correlation with the residual.

Then, at each iteration:

OLS direction keeps the correlations tied and decreasing.
$X_{A_k}\hat{\beta}_k$ makes smallest angle with the features in $X_{A_k}$.
Exact step length can be calculated.

For Lasso path, if a non-zero coefficient reaches zero, drop it from $A_k$.

Degrees of freedom

How many parameters were used? Define effective degrees of freedom of adaptively fit $\hat{y}$ as df($\hat{y}$) $= \frac{1}{\sigma^2} \sum_{i=1}^N Cov(\hat{y}_i, y_i)$.

Estimation Method	df($\hat{y}$)	Notes
Least Squares	k	-
Ridge	$\sum_i \frac{d_i^2}{d_i^2 + \lambda}$	-
Subset Selection	> k	No closed form; by simulation.
Least Angle	= k	-
Lasso	~ k	approximately size of active set.

Derivations

(3.44) Ridge regression estimate

Follows from OLS estimate derivation, because $\frac{\partial}{\partial \beta} \lambda \beta^T \beta = 2 \lambda \beta$.

$X^T X + \lambda I$ is invertible

$X^T X$ is PSD, so its eigenvalues are non-negative. That is, $(u^T X^T) X u = \lVert Xu \lVert^2 \geq 0$ for any $u$, so for any eigenvalue $e$ and eigenvector $v$, $0 \leq v^T (X^T X v) = v^T e v = e \lVert v \lVert^2$, and therefore $e \geq 0$.

Adding $\lambda I$ shifts each eigenvalue of $X^T X$ by $\lambda$: $(X^T X + \lambda I)v = X^TXv + \lambda v = (e + \lambda)v$. So, eigenvalues of $X^T X + \lambda I$ are positive if $\lambda > 0$ (i.e. is positive definite).

Because all eigenvalues of $X^T X + \lambda I$ are positive, it is invertible. That is, by contradiction, if we assume $X^T X + \lambda I$ is singular, then $\exists x \neq 0$ s.t. $(X^T X + \lambda I)x = 0$, but then $0$ is an eigenvalue.

(3.47) Analyze ridge regression estimate using SVD of $X$

$$ \begin{equation*} \begin{split} X\hat{\beta}_{RR} &= X(X^TX + \lambda I)^{-1}X^Ty \ &= UDV^T (VD^TU^T UDV^T + \lambda I)^{-1}VD^TU^Ty \ &= UDV^T (VD^2 V^T + \lambda VV^T)^{-1}VD^TU^Ty \ &= UDV^T (V (D^2 + \lambda I)V^T)^{-1}VD^TU^Ty \ &= UDV^T V^{-T} (D^2 + \lambda I)^{-1} V^{-1} VD^TU^Ty \ &= UD(D^2 + \lambda I)^{-1} DU^Ty \end{split} \end{equation*} $$

$D(D^2 + \lambda I)^{-1} D$ is a diagonal matrix with $j$-th diagonal entry as $d_j^2 / (d_j^2 + \lambda)$. So, $j$-th column of $UD(D^2 + \lambda I)^{-1} D$ is $u_j \frac{d_j^2}{d_j^2 + \lambda}$. By right-multiplying $U^T$, we obtain (3.47), a sum of outer products.

Hence, features with smaller singular values $d_j$ are shrinked more.

$z_1 = Xv_1$ has largest variance among all normalized linear combinations of columns of $X$

Let $z = Xa$ be a normalized linear combination of columns of $X$. That is, $\lVert a \rVert^2 = 1$. $v_1$ is such $a$.

Then, following the derivation of (3.49) below, we obtain $Var(z) = Var(Xa) = a^TVD^2V^Ta$.

Let $b = V^Ta$. Then, $Var(z) = b^TD^2b = \sum_{i,j} b_i d_{i,j}^2 b_j$, which is equal to $\sum_i b_i^2 d_{i,i}^2$ because $D$ is diagonal.

Note that, $\sum_i b_i^2 = b^T b = a^TVV^Ta = a^Ta = \lVert a \rVert^2 = 1$. Hence, $Var(z) = \sum_i b_i^2 d_{i,i}^2$ is a weighted sum of the singular values s.t. the weights sum to 1.

What are the weights that maximize $Var(z)$? Since first singular value is the largest, best weights are $e_1 = [1, 0, …, 0]^T$, which is given by setting $a$ to $v_1$.

TODO: Derive from optimization formulation of PCA. See Mathematics for Machine Learning, equation (10.10).

(3.49) $Var(Xv_1) = d_1^2 / N$

Note $\overline{Xv_1} = \frac{1}{N} \sum_i^N (Xv_1)_i = \frac{1}{N} [\sum_i^N {x_i}^T ] v_1 = 0$ because features in $X$ were centered. Hence,

$$ \begin{equation*} \begin{split} N \cdot Var(Xv_1) &= \sum_i^N ((Xv_1)_i - \overline{Xv_1})^2\ &= (Xv_1)^T Xv_1\ &= v_1^TVD^2V^Tv_1 && \text{by SVD from (3.48)}\ &= {e_1}^TD^2e_1 && \text{where $e_1 = [1, 0, …, 0]^T$}\ &= d_1^2 \end{split} \end{equation*} $$

Reference:

“Principal Component Analysis from Statistical and Machine Learning Perspectives (Part 1)” by Suhyun Kim.

(3.50) Effective degrees of freedom of ridge regression

Follows from (3.47) and cyclic property of the trace.

Lasso estimate has no closed form expression

TODO

Exercises

3.5

See Weatherwax and Epstein’s solution; below only adds a bit of explanation. $\beta$ below excludes the intercept $\beta_0$.

1. Show the correspondence between $\beta^c$ and $\beta$ and between $\beta^c_0$ and $\beta_0$.

Rewrite the criteria (3.85) as $ \sum_i [y_i - (\beta^c_0 - \sum_j \overline{x}_{j} \beta^c_j) - \sum_j x_{ij} \beta^c_j]^2 + \lambda \sum_j {\beta^c_j}^2 $.

Above is equivalent to the original ridge regression criteria (3.41) if $\beta_0 = \beta^c_0 - \sum_j \overline{x}_{j} \beta^c_j$ and $\forall j > 0, \beta_j = \beta^c_j$.

2. Characterize (solve for) $\hat{\beta^c_0}$ and $\hat{\beta^c}$

Since we’ve shown the correspondence between $\beta^c$ and $\beta$, let’s solve (3.41) for $\beta_0$ then replace. In vector form:

$$ \hat{\beta_0} = \frac{d}{d\beta_0} \sum_i \beta_0^2 - 2 y_i \beta_0 + 2 x_i^T \beta \beta_0 = \sum_i 2\beta_0 - 2 y_i + 2 x_i^T \beta = 0 $$

where we ignored the irrelevant terms in the first equality. Since $\beta_0 = \beta^c_0 - \overline{x}^T \beta^c$ and $\beta = \beta^c$ we obtain:

$$ N \cdot \hat{\beta^c_0} = \sum_i y_i - \sum_i x_i^T \beta^c + N \cdot \overline{x}^T \beta^c$$

The later two terms cancel out by definition of the sample mean, so $\hat{\beta^c_0} = \frac{\sum_i y_i}{N} = \overline{y}$.

For $\hat{\beta^c}$, since $\beta = \beta^c$ and the solution $(X^TX + \lambda I)^{-1}X^Ty$ does not involve $\beta_0$, we have the same solution.

A similar result holds for the lasso intercept because the penalty term was not involved in characterizing $\hat{\beta^c_0}$.

References:

Weatherwax and Epstein. A Solution Manual and Notes for: The Elements of Statistical Learning.

3.6

Assume likelihood $y \lvert \beta, X \sim N(X\beta, \sigma^2 I)$ and prior $\beta \sim N(0, \gamma I)$. Show that the mode and mean of the posterior is equivalent to the ridge regression estimate. Show relationship between $\lambda$, $\gamma$ and $\sigma^2$.

The mode maximizes the posterior (maximum a posteriori estimation):

$$ \begin{align*} \hat{\beta}{MAP} &= \arg\max{\beta} p(\beta \lvert y, X)\ &= \arg\max_{\beta} \log p(\beta \lvert y, X)\ &= \arg\max_{\beta} \log p(y \lvert \beta, X) + \log p(\beta)\ &= \arg\max_{\beta} \log \frac{\exp(- (y - X\beta)^T (y - X\beta) / 2\sigma^2)}{\sqrt{(2\pi)^N N \sigma^2}} + \log \frac{\exp(- \beta^T \beta / 2\gamma)}{\sqrt{(2\pi)^p p \gamma}}\ &= \arg\max_{\beta} -\frac{1}{2\sigma^2} (y - X\beta)^T (y - X\beta) - \frac{1}{2\gamma} \beta^T\beta\ &= \arg\max_{\beta} \frac{1}{\sigma^2} \beta^TX^Ty -\frac{1}{2\sigma^2} \beta^TX^TX\beta - \frac{1}{2\gamma} \beta^T\beta\ \end{align*} $$

By taking derivative, we obtain

$$\nabla_{\beta} = \frac{1}{\sigma^2} X^Ty - \frac{1}{\sigma^2} X^TX\beta - \frac{1}{\gamma} \beta = 0$$

Hence, if $\gamma = \sigma^2 / \lambda$, the mode is equal to the ridge regression solution.

For the mean, note that the posterior is Gaussian because $p(\beta \lvert y, X) \propto p(y \lvert \beta, X) p(\beta)$ and both likelihood and prior are Gaussian (see Paisley Lecture 5 for a more formal explanation). For a Gaussian distribution, mean is equal to the mode.

References:

Paisley. ColumbiaX - CSMM.102x: Machine Learning. Lecture 4 and 5.
StackExchange answer by tibL.

3.7

Assume $y_i \sim N(\beta_0 + x_i^T \beta, \sigma^2)$ and $\forall j > 0, \beta_j \overset{\text{i.i.d}}{\sim} N(0, \gamma^2)$. Show that the log-posterior is proportional to $\sum_i [(y_i - \beta_0 - x_i^T \beta)^2] + \lambda \beta^T\beta$ from (3.41).

Assuming that $y_i$ are conditionally independent:

$$ \begin{align*} \log p(\beta \lvert y, X) &\propto \log p(y \lvert \beta, X) + \log p(\beta) \ &\propto \log \prod_i [p(y_i \lvert \beta, x_i)] + \log p(\beta) \ &\propto \sum_i [\log p(y_i \lvert \beta, x_i)] + \log p(\beta)\ &= \sum_i [\log \frac{\exp(- (y_i - \beta_0 - x_i^T \beta)^2 / 2 \sigma^2)}{\sigma \sqrt{2 \pi}}] + \log \frac{\exp(- \beta^T \beta / 2\gamma^2)}{\sqrt{(2\pi)^p p \gamma^2}}\ &\propto \sum_i [- \frac{1}{2\sigma^2} (y_i - \beta_0 - x_i^T \beta)^2] -\frac{1}{2\gamma^2} \beta^T \beta\ &= -\frac{1}{2\sigma^2} \sum_i [(y_i - \beta_0 - x_i^T \beta)^2] + \lambda \beta^T \beta\ \end{align*}$$

which is proportional to the given sum.

3.8

Let $X$ be a of N x (p + 1) matrix with a first column of ones. Let $Q_2$ be the N x p matrix with first column removed from the QR decomposition of $X$. Let $\tilde{X}$ be the N x p centered matrix without the first column of $X$. Let $UDV^T$ be the SVD of $\tilde{X}$.

1. Show that $Q_2$ and $U$ span the same subspace.

$U$ trivially spans $C(\tilde{X})$ because $DV^T$ gives the linear combinations. So, let’s show that $Q_2$ also spans $C(\tilde{X})$, which is equivalent to linearly combining columns of $Q_2$ to obtain each $\tilde{x}_j$.

$$ \begin{align*} \tilde{x}j &= x_j - \overline{x}j \mathbf{1}\ &= x_j - (\frac{\mathbf{1}^T x_j}{N}) \mathbf{1}\ &= x_j - (\frac{\sqrt{N} q_0^T x_j}{N}) \sqrt{N} q_0 && \text{since $q_0 = \mathbf{1} / \sqrt{N} $}\ &= x_j - (q_0^T x_j) q_0\ &= \sum{k = 0}^{j} (q_k^T x_j) q_k - (q_0^T x_j) q_0 && \text{from Gram-Schmidt (see below)}\ &= \sum{k = 1}^{j} (q_k^T x_j) q_k \end{align*} $$

which is a linear combination of the columns of $Q_2$.

From Gram-Schmidt and QR decomposition, $r_{jk} = x_j^Tq_k$. So $x_j = Qr_j = \sum_{k = 0}^{j} (q_k^T x_j) q_k$, a linear combination of ($x_j$’s projection onto) $q_k$s from $k=0$ to $j$ (see Strang Section 4.4).

By subtracting $\overline{x}_j \mathbf{1}$ or $(q_0^T x_j) q_0$ from $x_j$, we have removed the $q_0$’s contribution to that linear combination. Hence, $\tilde{x}_j$ can be linearly combined without the first column of $Q$ thereby only using the columns of $Q_2$.

3.12

Augment $X$ with $\sqrt{\lambda}I_{p}$ and $y$ with $\mathbf{0}_{p}$ additional rows. Show that OLS estimate on the augmented data yields ridge regression estimates for $X$ and $y$.

Follows from $X_{new}^T X_{new} = X^T X + \lambda I_p$ and $X_{new}^T y_{new} = X^T y$ using block multiplication (see Strang Section 2.4).

3.16

If columns of $X$ are orthonormal find j-th coefficient estimate.

1. Subset Selection estimate is $\hat{\beta}_j \cdot I[\lvert \hat{\beta_j} \rvert \geq \lvert \hat{\beta}_{(M)} \rvert]$

$$ \begin{align*} RSS_{subset} &= \lVert y - X_{subset}\hat{\beta}{subset} \rVert^2 \ &= \lVert y - X\hat{\beta} + \sum{j \in D} \hat{\beta}j x_j \rVert^2 && \text{let D be the set of dropped coefficients’ indices} \ &= \lVert y - X\hat{\beta} \rVert^2 + \lVert \sum{j \in D} \hat{\beta}j^2 x_j \rVert^2 && \text{residual and linear combination of $x_j$s are orthogonal}\ &= \lVert y - X\hat{\beta} \rVert^2 + \sum{j \in D} \hat{\beta}j^2 \lVert x_j \rVert^2 && \text{$x_j$s are mutually orthogonal}\ &= RSS + \sum{j \in D} \hat{\beta}_j^2 && \text{$x_j$ are normal}\ \end{align*} $$

Hence to minimize $RSS_{subset}$, subset selection drops coefficients with smallest $\lvert \hat{\beta}_j \rvert$ or keeps coefficients $\lvert \hat{\beta_j} \rvert \geq \lvert \hat{\beta}_{(M)} \rvert$.

2. Ridge estimate is $\hat{\beta}_j / (1 + \lambda)$

Because $(X^TX + \lambda I)^{-1}X^Ty = (I + \lambda I)^{-1}X^Ty = X^Ty / (1 + \lambda)$ and OLS estimate is $X^Ty$.

3. Lasso estimate is $sign(\hat{\beta}{j})(\lvert \hat{\beta}{j} \rvert - \lambda)_{+}$

ESL 3.4: Shrinkage Methods

Summary

Shrinkage Methods

3.4.1 Ridge Regression

3.4.2 Lasso

3.4.3 Subset selection, ridge regression, and Lasso

3.4.4 Least Angle Regression

Degrees of freedom

Derivations

(3.44) Ridge regression estimate

$X^T X + \lambda I$ is invertible

(3.47) Analyze ridge regression estimate using SVD of $X$

$z_1 = Xv_1$ has largest variance among all normalized linear combinations of columns of $X$

(3.49) $Var(Xv_1) = d_1^2 / N$

(3.50) Effective degrees of freedom of ridge regression

Lasso estimate has no closed form expression

Exercises

3.5

3.6

3.7

3.8

3.12

3.16

3.17

3.23

3.24

3.25

3.26

3.27

3.28

3.29

3.30

Summary#

Shrinkage Methods#

3.4.1 Ridge Regression#

3.4.2 Lasso#

3.4.3 Subset selection, ridge regression, and Lasso#

3.4.4 Least Angle Regression#

Degrees of freedom#

Derivations#

(3.44) Ridge regression estimate#

$X^T X + \lambda I$ is invertible#

(3.47) Analyze ridge regression estimate using SVD of $X$#

$z_1 = Xv_1$ has largest variance among all normalized linear combinations of columns of $X$#

(3.49) $Var(Xv_1) = d_1^2 / N$#

(3.50) Effective degrees of freedom of ridge regression#

Lasso estimate has no closed form expression#

Exercises#

3.5#

3.6#

3.7#

3.8#

3.12#

3.16#

3.17#

3.23#

3.24#

3.25#

3.26#

3.27#

3.28#

3.29#

3.30#

Summary

Shrinkage Methods

3.4.1 Ridge Regression

3.4.2 Lasso

3.4.3 Subset selection, ridge regression, and Lasso

3.4.4 Least Angle Regression

Degrees of freedom

Derivations

(3.44) Ridge regression estimate

$X^T X + \lambda I$ is invertible

(3.47) Analyze ridge regression estimate using SVD of $X$

$z_1 = Xv_1$ has largest variance among all normalized linear combinations of columns of $X$

(3.49) $Var(Xv_1) = d_1^2 / N$

(3.50) Effective degrees of freedom of ridge regression

Lasso estimate has no closed form expression

Exercises

3.5

3.6

3.7

3.8

3.12

3.16

3.17

3.23

3.24

3.25

3.26

3.27

3.28

3.29

3.30