ESL 7.3: The Bias-Variance Decomposition

Summary

Assume $Y = f(X) + \varepsilon$, $E[\varepsilon] = 0, Var[\varepsilon] = \sigma^2$. EPE of regression model $\hat{f}$ at $X = x_0$ using squared-error loss is:

$$Err(x_0) = \sigma^2 + (\underbrace{f(x_0) - E[\hat{f}(x_0)]}{\text{Bias}})^2 + \underbrace{E[(\hat{f}(x_0) - E[\hat{f}(x_0)])^2]}{\text{Variance}}$$

As complexity $\uparrow$, variance $\uparrow$ and bias $\downarrow$. We can derive bias and variance terms for k-nearest neighbor and linear models.

kNN: as $k \downarrow$, complexity $\uparrow$.
linear model: as $p \uparrow$, complexity $\uparrow$.

For the linear model family, the bias decomposes further:

$$E_{x_0}[\text{Bias}^2] = E_{x_0}[\underbrace{(f(x_0) - x^T_0 \beta_{\ast})^2}{\text{Model bias}}] + E{x_0}[\underbrace{(x^T_0 \beta_{\ast} - E[x^T_0 \hat{\beta}\alpha])^2}{\text{Estimation bias}}]$$

Model bias: can only be reduced by enlarging the class of models (basis expansions).
Estimation bias: is zero for OLS. Restricted fits (ridge, Lasso) tradeoff positive estimation bias with lower variance.

7.3.1 Example

For classification (0-1) loss, prediction error might not be sum of squared bias and variance; even though bias rises, prediction error can decrease or stay the same. TODO: “Estimation errors that leave us on right side of decision boundary doesn’t hurt.”

Bias-variance tradeoff behaves differently for squared and 0-1 loss, so best tuning parameters differ.

Derivations

$\require{cancel}$

(7.9) Bias-variance decomposition of expected prediction error (EPE) using squared loss

In (7.3), the EPE averages over the training set $\mathcal{T}$ and the test sample $X, Y$ from the population.

Since $X = x_0$ and $Y = f(x_0) + \varepsilon$, only the $\varepsilon$ and $\mathcal{T}$ are random. In other words, $f(x_0)$ is not random, but $\varepsilon$ and $\hat{f}(x_0)$ (which is a function of the training set $\mathcal{T}$) are random. Abbreviating $f(x_0)$ as $f$ and $\hat{f}(x_0)$ as $\hat{f}$, EPE becomes

$$ \begin{align*} E[((f - \hat{f}) + \varepsilon)^2] &= E[(f - \hat{f})^2 + 2 f \varepsilon - 2 \hat{f} \varepsilon + \varepsilon^2] \ &= E[(f - \hat{f})^2] + 2 f \cancelto{0}{E[\varepsilon]} - 2 E[\hat{f} \varepsilon] + E[\varepsilon^2] \ &= E[(f - \hat{f})^2] - 2 E[\hat{f}] \cancelto{0}{E[\varepsilon]} + (Var(\varepsilon) + \cancelto{0}{E[\varepsilon]^2}) \ &= E[(f - \hat{f})^2] + \sigma^2_\varepsilon \end{align*}$$

In the third line, $E[\hat{f} \varepsilon] = E[\hat{f}] E[\varepsilon]$ if we assume that $\varepsilon$ and $\mathcal{T}$ are (conditionally) independent. This is equal to

$$ \begin{align*} \text{second line in (7.9)} &= \sigma^2_\varepsilon + E[\hat{f}]^2 - 2E[\hat{f}]f + f^2 + E[\hat{f}^2 - 2 \hat{f} E[\hat{f}] + E[\hat{f}]^2] \ &= \sigma^2_\varepsilon + \cancel{E[\hat{f}]^2} - 2E[\hat{f}]f + f^2 + E[\hat{f}^2] - \cancel{2 E[\hat{f}]^2} + \cancel{E[\hat{f}]^2} \ &= \sigma^2_\varepsilon - 2E[\hat{f}]f + f^2 + E[\hat{f}^2] \ \end{align*} $$

TODO: Show $\varepsilon$ and $\mathcal{T}$ are independent if $\varepsilon_i$ are i.i.d and $\varepsilon$ and training features $X$ are independent.

(7.10) Bias-variance decomposition of K-nearest neighbors EPE

Note $$\hat{f}(x_0) = \frac{1}{k} \sum^k_{l=1} y_l = \frac{1}{k} \sum^k_{l=1} f(x_l) + \varepsilon_l$$. Then, $$E[\hat{f}(x_0)] = \frac{1}{k} \sum^k_{l=1} E[f(x_l)] + \cancelto{0}{E[\varepsilon_l]} = \frac{1}{k} \sum^k_{l=1} f(x_l)$$.

Bias term follows from definition. Assuming $\varepsilon$ are pairwise independent, variance is

$$E[(\hat{f} - E[\hat{f}])^2] = E[(\frac{1}{k} \sum^k_{l=1} \varepsilon_l)^2] = \frac{1}{k^2} E[\sum^k_{l=1} \sum^k_{m=1} \varepsilon_l \varepsilon_m] = \frac{1}{k^2} \sum^k_{l=1} E[\varepsilon_l^2] = \frac{1}{k^2} k \cdot (Var(\varepsilon) - \cancelto{0}{E[\varepsilon_l]^2}) = \frac{\sigma^2_\varepsilon}{k} $$

(7.11) Variance in linear regression EPE

$$ Var(\hat{f}) = E[(\hat{f} - E[\hat{f}])^2] = E[\hat{f}^2] - E[\hat{f}]^2 = E[h^Ty y^Th] - E[h^Ty]^2 $$

Since $y$ is random, above is equivalent to

$$ h^T E[yy^T] h - h^T E[y]E[y]^T h = h^T (E[yy^T] - E[y]E[y]^T) h = h^T Var(y) h = \sigma^2_\varepsilon h^T h $$

where last equality assumes $Cov(y_i,y_j) = 0$.

(7.12) Variance in linear regression in-sample EPE

$$\sum_i h(x_i)^T h(x_i) = \sum_i x^T_i (X^TX)^{-1} x_i = tr[X(X^TX)^{-1}X^T] = tr[I_p] = p$$

So, $$\frac{1}{N} \sum_i \lVert h(x_i) \rVert^2 \sigma^2_\varepsilon = \frac{p}{N} \sigma^2_\varepsilon$$.

(7.14) Decompose expected squared bias of linear model into model bias and estimation bias

See this Cross Validated answer by Lei.

Exercises

7.2

We will abbreviate by removing $x_0$ from $f(x_0), \hat{f}(x_0), G(x_0)$, and $\hat{G}(x_0)$.

Show (7.62)

Let $\hat{g} = Pr(\hat{f} > \frac{1}{2})$. Then

$$Err(x_0) = E[I[Y \neq \hat{G}] \lvert X = x_0] = Pr(Y \neq \hat{G} \lvert X = x_0) = f (1 - \hat{g}) + (1-f) \hat{g} = f - 2 f \hat{g} + \hat{g}$$

$$ Err_B(x_0) + \lvert 2f - 1 \rvert P(\hat{G} \neq G \lvert X = x_0) = \begin{cases} f + (1 - 2f) \hat{g} & \text{if } f \leq \frac{1}{2} \ (1 - f) + (2f - 1) (1 - \hat{g}) & \text{else} \end{cases} = f - 2 f \hat{g} + \hat{g} $$

Show (7.63)

$$ Pr(\hat{G} \neq G \lvert X = x_0) = Pr(\hat{f} > \frac{1}{2})Pr(f \leq \frac{1}{2}) + Pr(\hat{f} \leq \frac{1}{2})Pr(f > \frac{1}{2})$$

If $f > \frac{1}{2}$, above reduces to $Pr(\hat{f} \leq \frac{1}{2})$. Since $\frac{\hat{f} - E[\hat{f}]}{\sqrt{Var[\hat{f}]}} \sim N(0,1)$, we obtain $Pr(\hat{f} \leq \frac{1}{2}) = \Phi(\frac{\frac{1}{2} - E[\hat{f}]}{\sqrt{Var[\hat{f}]}})$. Then, (7.63) holds because sign($\frac{1}{2} - f$) is negative. Similar (sign-reversed) argument holds for the $f \leq \frac{1}{2}$ case.

Summary#

7.3.1 Example#

Derivations#

(7.9) Bias-variance decomposition of expected prediction error (EPE) using squared loss#

(7.10) Bias-variance decomposition of K-nearest neighbors EPE#

(7.11) Variance in linear regression EPE#

(7.12) Variance in linear regression in-sample EPE#

(7.14) Decompose expected squared bias of linear model into model bias and estimation bias#

Exercises#

7.2#