Summary
Subset selection
By retaining a subset of features, we can improve:
- Prediction accuracy, by trading higher bias for lower variance.
- Interpretability, by narrowing down features with strongest effects.
3.3.1 Best-subset selection
For each subset size $k$ from $0$ to $p$, finds subset that gives smallest RSS.
- Best subsets are not necessarily nested.
- Produce sequence of models increasing in complexity: RSS of the best subset decreases as $k$ increases, so cannot use RSS to select $k$.
- Computationally infeasible if $p > 40$.
3.3.2 Forward and backward stepwise selection
Sequentially add or remove a feature that minimizes the subsequent RSS to the active set.
1. Forward: Begin with intercept. Add the feature that maximizes $\lvert q^Ty \rvert$ (Exercise 3.9).
Greedy. Higher RSS than best-subset selection, but:
- Computationally feasible
- Has lower variance (with higher bias) because more constrained search TODO: formalize
2. Backward: Removes feature with smallest Z-score.
Unlike Forward, Backward cannot be used if $ N <= p $ because $X^TX$ in the Z-score is not invertible.
3.3.3 Forward-Stagewise regression
At each step, selects a feature that is most correlated with the current residual. To current coefficient of that feature, adds univariate linear regression coefficient of the residual onto that feature.
Derivations
(p57) Best-subset curve is necessarily decreasing, so we cannot use RSS to select the subset size.
Since RSS is the residual after projecting y onto the $C(X)$, RSS decreases as we expand $C(X)$ with larger subsets.
(p59) Forward stepwise will have lower variance than best subset
(Hand-wavy) Forward stepwise simply has less features to choose from at each size since it’s greedy.
(p60) Selection by F-statistics do not account for multiple testing issues
(p60) Standard errors after the model search is not valid, because they do not account for the search process.
Exercises
3.9
Suppose we have N x q matrix $X_1 = Q_1 R_1$ and residual $r_1 = y-\hat{y_1}$. Among the remaining $p - q$ features, which one should we select as $x_2$ so that the $RSS_2$ is reduced the most?
$$ \begin{align*} RSS_2 &= \lVert y - \hat{y_2} \rVert^2\ &= \lVert y - Q_2 {Q_2}^T y \rVert^2 && \text{from (3.33)}\ &= y^Ty - 2 y^T Q_2 {Q_2}^T y + y^T Q_2 {Q_2}^T Q_2 {Q_2}^T y\ &= y^Ty - y^T Q_2 Q_2^T y && \text{()}\ &= y^Ty - y^T (Q_1 Q_1^T + q_2 {q_2}^T) y && UV^T = u_1{v_1}^T + u_2{v_2}^T + …\ &= RSS_1 - y^T q_2 {q_2}^T y && \text{from ()}\ \end{align*} $$
Hence, $\arg\min_{q_2} RSS_2 = \arg\max_{q_2} y^T q_2 {q_2}^T y = \arg\max_{q_2} \lvert y^T q_2 \rvert$.
Moreover, $y = r_1 + \hat{y_1}$ and $q_2 = x_2 - Q_1{Q_1}^T x_2$ because $q_2$ is the residual after projecting $x_2$ onto $C(X_1)$. So:
$$ \begin{align*} y^T q_2 &= (r_1 + Q_1{Q_1}^Ty)^T (x_2 - Q_1{Q_1}^Tx_2)\ &= {r_1}^Tx_2 - {r_1}^T Q_1{Q_1}^Tx_2 + y^TQ_1{Q_1}^T x_2 - y^TQ_1{Q_1}^T Q_1{Q_1}^Tx_2\ &= {r_1}^T (I - Q_1{Q_1}^T) x_2\ \end{align*} $$
Hence, we can pre-compute the fixed $r^T (I - Q_1{Q_1}^T)$ and select the $x_2$ that maximizes the absolute dot product.
3.10
Show that dropping the feature with the smallest absolute Z-score increases RSS the least.
The feature that increases the RSS the least when dropped has the smallest F statistic:
$$ F = \frac{RSS_{dropped} - RSS}{RSS / (N - p - 1)} $$
From Exercise 3.1, this F statistic is equal to the square of the corresponding Z-score.
Reference:
- Weatherwax and Epstein. A Solution Manual and Notes for: The Elements of Statistical Learning.