Suppose we have N x q matrix $X_1 = Q_1 R_1$ and residual $r_1 = y-\hat{y_1}$. Among the remaining $p - q$ features, which one should we select as $x_2$ so that the $RSS_2$ is reduced the most?
\[\begin{align*} RSS_2 &= \lVert y - \hat{y_2} \rVert^2\\ &= \lVert y - Q_2 {Q_2}^T y \rVert^2 && \text{from (3.33)}\\ &= y^Ty - 2 y^T Q_2 {Q_2}^T y + y^T Q_2 {Q_2}^T Q_2 {Q_2}^T y\\ &= y^Ty - y^T Q_2 Q_2^T y && \text{(*)}\\ &= y^Ty - y^T (Q_1 Q_1^T + q_2 {q_2}^T) y && UV^T = u_1{v_1}^T + u_2{v_2}^T + ...\\ &= RSS_1 - y^T q_2 {q_2}^T y && \text{from (*)}\\ \end{align*}\]Hence, $\arg\min_{q_2} RSS_2 = \arg\max_{q_2} y^T q_2 {q_2}^T y = \arg\max_{q_2} \lvert y^T q_2 \rvert$.
Moreover, $y = r_1 + \hat{y_1}$ and $q_2 = x_2 - Q_1{Q_1}^T x_2$ because $q_2$ is the residual after projecting $x_2$ onto $C(X_1)$. So:
\[\begin{align*} y^T q_2 &= (r_1 + Q_1{Q_1}^Ty)^T (x_2 - Q_1{Q_1}^Tx_2)\\ &= {r_1}^Tx_2 - {r_1}^T Q_1{Q_1}^Tx_2 + y^TQ_1{Q_1}^T x_2 - y^TQ_1{Q_1}^T Q_1{Q_1}^Tx_2\\ &= {r_1}^T (I - Q_1{Q_1}^T) x_2\\ \end{align*}\]Hence, we can pre-compute the fixed $r^T (I - Q_1{Q_1}^T)$ and select the $x_2$ that maximizes the absolute dot product.
Show that dropping the feature with the smallest absolute Z-score increases RSS the least.
The feature that increases the RSS the least when dropped has the smallest F statistic:
\[F = \frac{RSS_{dropped} - RSS}{RSS / (N - p - 1)}\]From Exercise 3.1, this F statistic is equal to the square of the corresponding Z-score.
Reference: