Summary
LDA and QDA
Recall Bayes classifier: $\hat{G}(x)= \arg\max_{k} P(G = k \lvert X = x) = \arg\max_{k} P(X = x \lvert G = k) P(G = k)$. Both LDA and QDA model each class-conditional density as Gaussian.
LDA arises with a common class-conditional density covariance.
- Pairwise decision boundary is linear (hyperplane) because log-ratio is linear.
- Linear discriminant functions yield same decision boundary.
- Estimate prior, mean, and common covariance.
LDA and linear regression:
- If two-class, LDA direction is proportional to the coefficient vector from linear regression with +1 and -1 responses.
- If multi-class, LDA avoids the masking problem unlike linear regression.
QDA arises with differing covariances.
- Log-ratio, pairwise decision boundary, and discriminant functions are quadratic.
- Estimate covariance matrix for each class.
- TODO: LDA has $O(K \cdot p)$ parameters whereas QDA has $O(K \cdot p^2)$.
LDA and QDA may perform well due to bias-variance tradeoff; incur bias of linear decision boundary because we can estimate it with lower variance.
4.3.1 Regularized Discriminant Analysis
Let each covariance be a weighted mix of common and individual covariances. Choose weight via cross-validation.
4.3.2 Computations for LDA
If we sphere the data using SVD of $\hat{\Sigma}$, we can implement LDA by nearest centroid classification.
4.3.3 Reduced-Rank LDA
LDA is nearest centroid classification with sphered data. Since the centroids span a subspace $H_{K - 1}$ of dimension at most $K - 1$, we can project the p-dimensional data to $H_{K - 1}$ for classification without loss of information.
We can further reduce the dimension to $L < K - 1$, by selecting principal component subspace of the $H_{K - 1}$.
TODO: Prediction with reduced dimension.
TODO: Fisher’s problem. More analysis.
Derivations
(p113) LDA by data sphering and nearest centroid classification
See Spring ‘13 CMU Data Mining: Lecture 21 by Ryan Tibshirani, slides 4 and 5.
With sphered $x$ and $\mu_k$, the LDA discriminant function becomes distance to the mean (plus log prior):
$$ \begin{align*} \hat{\delta_1}(x) &= -\frac{1}{2} (x - \hat{\mu_1})^T \Sigma^{-1} (x - \hat{\mu_1}) + \log \hat{\pi_1}\ &= -\frac{1}{2} (x - \hat{\mu_1})^T UD^{-1}U^T (x - \hat{\mu_1}) + \log \hat{\pi_1} \ &= -\frac{1}{2} \lVert D^{-\frac{1}{2}} U^T (x - \hat{\mu_1}) \rVert^2 + \log \hat{\pi_1}\ &= -\frac{1}{2} \lVert x^* - \mu_1^* \rVert^2 + \log \hat{\pi_1}\ \end{align*} $$
Hence, to maximize the discriminant function, we minimize the distance to the centroid, modulo the log prior.
Exercises
4.1
See this Khan Academy post on using Lagrange multipliers. To find $a$ that maximizes $a^T B a$ subject to $a^T W a = 1$, we can find the critical points of the Lagrange function, $L(a, \lambda) = a^T B a - \lambda (a^T W a - 1)$.
Critical points of $L$ satisfy $ \frac{\partial L}{\partial a} = 2 B a - 2 \lambda W a = \mathbf{0}$ (and $ \frac{\partial}{\partial \lambda} = a^T W a - 1 = 0$, the original constraint). Rearranging terms, we obtain a generalized eigenvalue problem, $ Ba = \lambda Wa$. If $W$ is invertible, this can be transformed to a standard eigenvalue problem, $W^{-1}Ba = \lambda a$.
Among the eigenvectors of $W^{-1}B$, which one maximizes $a^T B a$? By left-multiplying $a^T$ to the generalized eigenvalue problem, we obtain $a^T B a = \lambda a^T W a = \lambda$. So, pick the eigenvector with the largest eigenvalue.
4.2
Please see the problem statement in the book.
(a) Follows from rearranging terms in $\hat{\delta}_2(x) > \hat{\delta}_1(x)$ using LDA discriminant functions from (4.10).
(b) Let’s obtain the normal equations for (4.55) w.r.t. $\beta$ and show that it is equivalent to (4.56).
We can rewrite (4.55) as $RSS = \lVert y_i - \mathbf{1} \beta_0 - X \beta \rVert^2$.
Then $\frac{d}{d \beta_0} RSS = \frac{d}{d \beta_0} - 2 \beta_0 \mathbf{1}^T y - 2 \beta_0 \mathbf{1}^T X\beta + \beta_0^2 \mathbf{1}^T \mathbf{1} = - 2 \mathbf{1}^T y + 2 \mathbf{1}^T X \beta + 2N \beta_0$. Hence $N \beta_0 = \mathbf{1}^T y - \mathbf{1}^T X \beta$.
Since $\frac{\partial}{\partial \beta} RSS$ gives $X^TX \beta = X^T(y - \mathbf{1} \beta_0)$, we substitute $\beta_0$ from above to obtain
$$ X^TX \beta = X^Ty - \frac{1}{N} X^T \mathbf{1}\mathbf{1}^T y + \frac{1}{N} X^T \mathbf{1}\mathbf{1}^T X\beta $$
By letting $C$ be the centering matrix $I - \frac{1}{N} \mathbf{1}\mathbf{1}^T$, we obtain:
$$ X^T C X \beta = X^T C y $$
Let’s show that above is equivalent to (4.56).
RHS: Note that $Cy = y$ because the mean $\overline{y}$ that centering matrix subtracts from each $y_i$ is $N_1 \cdot - \frac{N}{N1} + N_2 \cdot \frac{N}{N2} = 0$. Then, $X^TCy$ is equal to $N (\hat{\mu}_1 - \hat{\mu}_2)$ because:
$$ N (\hat{\mu}1 - \hat{\mu}2) = - \frac{N}{N_1} \sum{g_i = 1} x_i + \frac{N}{N_2} \sum{g_i = 2} x_i = \sum_{g_i = 1} y_i x_i + \sum_{g_i = 2} y_i x_i = X^T y = X^T C y$$
LHS: Similarly, $CX$ centers each column/feature of $X$. So, $i$-th row of $CX$ is $x_i - \frac{N_1}{N} \hat{\mu}_1 - \frac{N_2}{N} \hat{\mu}_2$.
Then, view $X^T CX = (X^T){(CX)^T}^T$ as sum of outer products, $\sum_i x_i (x_i - \frac{N_1}{N} \hat{\mu}_1 - \frac{N_2}{N} \hat{\mu}_2)^T$, which becomes
$$ \begin{align*} X^T CX &= \sum_i x_i (x_i - \frac{N_1}{N} \hat{\mu}_1 - \frac{N_2}{N} \hat{\mu}2)^T\ &= \sum_i x_i x_i^T - \sum{g_i = 1} \frac{N_1}{N} x_i \hat{\mu}1^T - \sum{g_i = 2} \frac{N_1}{N} x_i \hat{\mu}1^T - \sum{g_i = 1} \frac{N_2}{N} x_i \hat{\mu}2^T - \sum{g_i = 2} \frac{N_2}{N} x_i \hat{\mu}_2^T \ &= \sum_i x_i x_i^T - \frac{N_1^2}{N} \hat{\mu}_1 \hat{\mu}_1^T - \frac{N_2^2}{N} \hat{\mu}_2 \hat{\mu}_2^T - \frac{2 N_1 N_2}{N} \hat{\mu}_1 \hat{\mu}_2^T \ \end{align*} $$
due to $[\sum_{g_i = 1} x_i] \hat{\mu}_1^T = N_1 \hat{\mu}_1 \hat{\mu}_1^T$ and its variations. The LHS of (4.56) also reduces to:
$$ \begin{align*} (N - 2) \hat{\Sigma} + N \hat{\Sigma}B &= \sum{g_i = 1} (x_i - \hat{\mu}_1) (x_i - \hat{\mu}1)^T + \sum{g_i = 2} (x_i - \hat{\mu}_2) (x_i - \hat{\mu}_2)^T + \frac{N_1 N_2}{N} (\hat{\mu}_2 - \hat{\mu}_1) (\hat{\mu}_2 - \hat{\mu}1)^T \ &= \sum{i} x_i x_i^T - N_1 \hat{\mu}_1 \hat{\mu}_1^T - N_2 \hat{\mu}_2 \hat{\mu}_2^T + \frac{N_1 N_2}{N} (\hat{\mu}_2 \hat{\mu}_2^T - 2 \hat{\mu}_1 \hat{\mu}_2^T + \hat{\mu}_1 \hat{\mu}_1^T) \end{align*} $$
which is equivalent to $X^TCX$ above because $\frac{N_1 N_2}{N} - N_1 = \frac{N_1 (N_2 - N)}{N} = - \frac{N_1^2}{N}$.
(c) $\hat{\Sigma}_B \beta$ is in the direction of $(\hat{\mu}_2 - \hat{\mu}_1)$ because $(\hat{\mu}_2 - \hat{\mu}_1)^T \beta$ in $\hat{\Sigma}_B \beta$ is a scalar. (4.57) follows from moving $\hat{\Sigma}_B \beta$ to RHS and rearranging the terms.
(d)
(e)