By retaining a subset of features, we can improve:
For each subset size $k$ from $0$ to $p$, finds subset that gives smallest RSS.
Sequentially add or remove a feature that minimizes the subsequent RSS to the active set.
1. Forward: Begin with intercept. Add the feature that maximizes $\lvert q^Ty \rvert$ (Exercise 3.9).
Greedy. Higher RSS than best-subset selection, but:
2. Backward: Removes feature with smallest Z-score.
Unlike Forward, Backward cannot be used if $ N <= p $ because $X^TX$ in the Z-score is not invertible.
At each step, selects a feature that is most correlated with the current residual. To current coefficient of that feature, adds univariate linear regression coefficient of the residual onto that feature.