Follows from OLS estimate derivation, because $\frac{\partial}{\partial \beta} \lambda \beta^T \beta = 2 \lambda \beta$.
$X^T X$ is PSD, so its eigenvalues are non-negative. That is, $(u^T X^T) X u = \lVert Xu \lVert^2 \geq 0$ for any $u$, so for any eigenvalue $e$ and eigenvector $v$, $0 \leq v^T (X^T X v) = v^T e v = e \lVert v \lVert^2$, and therefore $e \geq 0$.
Adding $\lambda I$ shifts each eigenvalue of $X^T X$ by $\lambda$: $(X^T X + \lambda I)v = X^TXv + \lambda v = (e + \lambda)v$. So, eigenvalues of $X^T X + \lambda I$ are positive if $\lambda > 0$ (i.e. is positive definite).
Because all eigenvalues of $X^T X + \lambda I$ are positive, it is invertible. That is, by contradiction, if we assume $X^T X + \lambda I$ is singular, then $\exists x \neq 0$ s.t. $(X^T X + \lambda I)x = 0$, but then $0$ is an eigenvalue.
$D(D^2 + \lambda I)^{-1} D$ is a diagonal matrix with $j$-th diagonal entry as $d_j^2 / (d_j^2 + \lambda)$. So, $j$-th column of $UD(D^2 + \lambda I)^{-1} D$ is $u_j \frac{d_j^2}{d_j^2 + \lambda}$. By right-multiplying $U^T$, we obtain (3.47), a sum of outer products.
Hence, features with smaller singular values $d_j$ are shrinked more.
Let $z = Xa$ be a normalized linear combination of columns of $X$. That is, $\lVert a \rVert^2 = 1$. $v_1$ is such $a$.
Then, following the derivation of (3.49) below, we obtain $Var(z) = Var(Xa) = a^TVD^2V^Ta$.
Let $b = V^Ta$. Then, $Var(z) = b^TD^2b = \sum_{i,j} b_i d_{i,j}^2 b_j$, which is equal to $\sum_i b_i^2 d_{i,i}^2$ because $D$ is diagonal.
Note that, $\sum_i b_i^2 = b^T b = a^TVV^Ta = a^Ta = \lVert a \rVert^2 = 1$. Hence, $Var(z) = \sum_i b_i^2 d_{i,i}^2$ is a weighted sum of the singular values s.t. the weights sum to 1.
What are the weights that maximize $Var(z)$? Since first singular value is the largest, best weights are $e_1 = [1, 0, …, 0]^T$, which is given by setting $a$ to $v_1$.
TODO: Derive from optimization formulation of PCA. See Mathematics for Machine Learning, equation (10.10).
Note $\overline{Xv_1} = \frac{1}{N} \sum_i^N (Xv_1)_i = \frac{1}{N} [\sum_i^N {x_i}^T ] v_1 = 0$ because features in $X$ were centered. Hence,
\[\begin{equation*} \begin{split} N \cdot Var(Xv_1) &= \sum_i^N ((Xv_1)_i - \overline{Xv_1})^2\\ &= (Xv_1)^T Xv_1\\ &= v_1^TVD^2V^Tv_1 && \text{by SVD from (3.48)}\\ &= {e_1}^TD^2e_1 && \text{where $e_1 = [1, 0, ..., 0]^T$}\\ &= d_1^2 \end{split} \end{equation*}\]Reference:
Follows from (3.47) and cyclic property of the trace.
TODO