Summary
Model specifies $K - 1$ linear log-odds/logits: $$ \frac{P(G = k \lvert X = x)}{P(G = K \lvert X = x)} = \beta_{k0} + \beta_k^T x, k = 1,…, K - 1$$.
Then $$ P(G = k \lvert X = x) = \frac{\exp(\beta_{k0} + \beta_k^T x)}{1 + \sum_{l=1}^{K-1} \exp(\beta_{l0} + \beta_l^T x)} $$ and $$ P(G = K \lvert X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} \exp(\beta_{l0} + \beta_l^T x)} $$,
so the posteriors are in $[0,1]$ and sum to 1.
4.4.1 Fitting Logistic Regression Models
Maximum (conditional) likelihood estimation: $$L(\theta) = \log P(g_1,…,g_N \lvert x_1,…,x_N; \theta) = \sum_i \log P(G = g_i \lvert X = x_i; \theta) $$
Binary case: with 0-1 responses, MLE gives score equations: $$ \frac{\partial l(\beta)}{\partial \beta} = \sum_i x_i (y_i - P(G = 1 \lvert X = x_i; \beta)) = 0$$
Score equations can be solved by Newton-Raphson or coordinate-descent algorithms.
TODO: Newton-Raphson algorithm
4.4.3 Quadratic Approximations and Inference
4.4.4 $L_1$ Regularized Logistic Regression
Add $L_1$ penalty to log-likelihood for variable selection and shrinkage.
TODO: Optimization methods:
- Nonlinear programming of concave function
- Weighted Lasso using quadratic approximations from Newton algorithm
- Path algorithms (LAR)
- Coordinate descent is efficient
4.4.5 Logistic Regression or LDA?
Log-posterior odds for both models are linear. However,
- Logistic regression fits $P(G \lvert X)$ by maximizing (multinomial) conditional likelihood
- LDA fits $P(X, G = k) = P(X \lvert G = k) P(G = k)$ by maximizing full likelihood
Equivalently, LDA fits marginal mixture density $$P(X) = \sum_k \pi_k \phi(X; \mu_k, \Sigma)$$.
Derivations
(4.18) Express posteriors using Sigmoid (binary) and Softmax (multi-class)
(Not stated in book.)
(4.21) Maximizing log-likelihood is equivalent to minimizing cross-entropy
(Not stated in book.)
(4.23) Why does the Newton-Raphson algorithm work?
(p121) Log-likelihood is concave
(p121) Coordinate-descent methods can be used to maximize the (multi-class) log-likelihood efficiently
(Table 4.2) Std Error and Z-score for each coefficient
(Table 4.3) Subset selection using analysis of deviance
(Section 4.4.3) Implications of least squares connections
(4.31) L-1 regularized log-likelihood is concave
(Section 4.4.4) Optimization methods for L-1 regularized logistic regression
(Section 4.4.5) LDA is generative and logistic regression is discriminative
(Not stated in book.)
(4.37) Maximizing full log-likelihood gives LDA parameter estimates
(p128) LDA is not robust to gross outliers
(p128) Marginal likelihood as a regularizer
Exercises
4.4
See Weatherwax solution and Bishop Section 4.3.4.
With multi-class logistic regression, the log-likelihood of one sample is:
$$ \begin{align*} \log P(G = g \lvert x; \theta) &= \log \prod_{l=1}^K P(G = l \lvert x; \theta)^{I[g_i = l]} \ &= (\sum_{l=1}^{K-1} I[g_i = l] \log p_l) + I[g_i = K] \log p_K \ &= \sum_{l=1}^{K-1} I[g_i = l] \beta_l^T x
- \sum_{l=1}^{K-1} I[g_i = l] \log p_K^{-1}
- I[g_i = K] \cdot - \log p_K^{-1} \ &= \sum_{l=1}^{K-1} y_l \beta_l^T x - \sum_{l=1}^{K} I[g_i = l] \log p_K^{-1} \ &= \sum_{l=1}^{K-1} y_l \beta_l^T x - \log p_K^{-1} \ &= \sum_{l=1}^{K-1} y_l \beta_l^T x - \log (1 + \sum_{m=1}^{K-1} e^{\beta_m^Tx}) \ \end{align*} $$
where $p_l = P(G = l \lvert x; \theta)$ and $y$ is a 0/1 encoded vector of length $K-1$. So, the log-likelihood of all samples is
$$L(\theta) = \sum_{i=1}^N \sum_{l=1}^{K-1} y_{il} \beta_l^T x_i - \log (1 + \sum_{m=1}^{K-1} e^{\beta_m^Tx_i})$$
For Newton-Raphson, we find the gradient and hessian of $L$ w.r.t. $\theta$, the $K - 1$ $\beta$s.
The gradient is a $K - 1$ by $p + 1$ matrix, where the first row is
$$ \frac{\partial L}{\partial \beta_1} = \sum_{i=1}^N y_{i1} x_i - \frac{1}{1 + \sum_{m=1}^{K-1} e^{\beta_m^Tx_i}} \cdot e^{\beta_1^Tx_i} \cdot x_i = \sum_{i=1}^N (y_{i1} - p_{i1}) x_i $$
TODO: find Hessian by blocks.
4.5
Implementation
See Colab.