Brian Lee

ESL 4.4: Logistic Regression

Summary Model specifies $K - 1$ linear log-odds/logits: $$ \frac{P(G = k \lvert X = x)}{P(G = K \lvert X = x)} = \beta_{k0} + \beta_k^T x, k = 1,…, K - 1$$. Then $$ P(G = k \lvert X = x) = \frac{\exp(\beta_{k0} + \beta_k^T x)}{1 + \sum_{l=1}^{K-1} \exp(\beta_{l0} + \beta_l^T x)} $$ and $$ P(G = K \lvert X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} \exp(\beta_{l0} + \beta_l^T x)} $$, so the posteriors are in $[0,1]$ and sum to 1. ...

ESL 4.5: Separating Hyperplanes

Derivations (4.40) ${\beta^\ast}^T(x - x_0)$ is signed distance of $x$ to $L$ Pick $x_0$ as projection of $x$ onto $L$. Since $\beta^\ast$ and $x - x_0$ are parallel, ${\beta^\ast}^T(x - x_0) = \lVert \beta^\ast \rVert \lVert x - x_0 \rVert \cos\theta = \pm \lVert x - x_0 \rVert$. (4.52) Wolfe dual $-\sum_i \alpha_i [y_i(x_i^T \beta + \beta_0) - 1] = -\sum_i \alpha_i y_i x_i^T \beta - \sum_i \alpha_i y_i \beta_0 + \sum_i \alpha_i$, which is $-\sum_i \alpha_i y_i x_i^T \beta + \sum_i \alpha_i$ from (4.51). ...

ESL 7.2: Bias, Variance, and Model Complexity

Summary Generalization performance relates to prediction capability on independent test data. How do we assess generalization performance and use it to select models? See Section 7.4 Summary for definitions of test error, expected test error, and training error. We estimate (expected) test error for two reasons: Model selection: As complexity $\uparrow$, variance $\uparrow$ and bias $\downarrow$ Select complexity-tuning parameters $\alpha$ that minimizes EPE. Model assessment: Estimate test error for the selected model. Use validation set for model selection and test set only for assessment. ...

ESL 7.3: The Bias-Variance Decomposition

Summary Assume $Y = f(X) + \varepsilon$, $E[\varepsilon] = 0, Var[\varepsilon] = \sigma^2$. EPE of regression model $\hat{f}$ at $X = x_0$ using squared-error loss is: $$Err(x_0) = \sigma^2 + (\underbrace{f(x_0) - E[\hat{f}(x_0)]}{\text{Bias}})^2 + \underbrace{E[(\hat{f}(x_0) - E[\hat{f}(x_0)])^2]}{\text{Variance}}$$ As complexity $\uparrow$, variance $\uparrow$ and bias $\downarrow$. We can derive bias and variance terms for k-nearest neighbor and linear models. kNN: as $k \downarrow$, complexity $\uparrow$. linear model: as $p \uparrow$, complexity $\uparrow$. For the linear model family, the bias decomposes further: ...

ESL 7.4: Optimism of the Training Error Rate

Summary Definitions (terms test, generalization, and prediction are used interchangeably): Test error (extra-sample error): $$Err_{\mathcal{T}} = E_{X^0,Y^0}[L(Y^0, \hat{f}(X^0)) \lvert \mathcal{T}]$$ Expected prediction error (EPE): $$Err = E_{\mathcal{T}}[Err_{\mathcal{T}}] = E_\mathcal{T} E_{X^0,Y^0}[L(Y^0, \hat{f}(X^0)) \lvert \mathcal{T}]$$ Training error: $$\overline{err} = \frac{1}{N} \sum^N_{i=1} L(y_i, \hat{f}(x_i))$$ In-sample error: $$Err_{in} = \frac{1}{N} \sum^N_{i=1} E_{Y^0}[L(Y^0_i, \hat{f}(x_i)) \lvert \mathcal{T}]$$ (observe new response $Y_i^0$ at each training input $x_i$) Optimism: $\text{op} = Err_{in} - \overline{err}$ Expected optimism: $\omega = E_\mathbf{y}(\text{op})$ (predictors are fixed, expectation over training set outcomes $\mathbf{y}$) $\overline{err}$ is not a good estimate of test error, since $\overline{err}$ drops to zero as we increase model complexity. ...

ESL 7.5: Estimates of In-Sample Prediction Error

Summary $$\hat{Err}{in} = \overline{err} + \hat{w}$$. If squared error loss, we obtain $$C_p = \overline{err} + 2 \frac{d}{N} \hat{\sigma}^2\varepsilon$$. Use MSE of a low-bias model for $$\hat{\sigma}^2_\varepsilon$$. If log-likelihood loss, another estimate of $Err_{in}$ is $$ AIC = \frac{2}{N} (- logik + d)$$. For Gaussian model, AIC is equivalent to $C_p$. If $d$ basis functions are chosen adaptively from $p$ inputs, effective number of parameters is more than $d$. For nonlinear models, replace $d$ with some measure of model complexity. ...

ESL 7.6: Effective Number of Parameters

Summary How do we generalize “the number of parameters”, for regularized or non-linear models? For a linear fitting method $\hat{y} = Sy$, effective degrees of freedom is defined as $\text{df}(S) = \text{tr}(S)$. If S is orthogonal projection matrix, $\text{tr}(S) = d$. Also, define $\text{df}(\hat{y}) = \frac{\sum_i Cov(\hat{y_i}, y_i)}{\sigma^2_\varepsilon}$. If additive-error model, this equals $\text{tr}(S)$. TODO: df for neural networks. Exercises $\require{cancel}$ 7.5 Analogous to Exercise 7.1. $$\sum_{i=1}^N Cov(\hat{y_i}, y_i) = tr[Cov(\hat{y}, y)] = tr[E[(\hat{y} - E[\hat{y}])(y - E[y])^T]] = tr[S \cancelto{Var(y)}{E[(y - E[y])(y - E[y])^T]}] = tr[S] \sigma^2$$ ...