Summary

Model specifies $K - 1$ linear log-odds/logits: $$ \frac{P(G = k \lvert X = x)}{P(G = K \lvert X = x)} = \beta_{k0} + \beta_k^T x, k = 1,…, K - 1$$.

Then $$ P(G = k \lvert X = x) = \frac{\exp(\beta_{k0} + \beta_k^T x)}{1 + \sum_{l=1}^{K-1} \exp(\beta_{l0} + \beta_l^T x)} $$ and $$ P(G = K \lvert X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} \exp(\beta_{l0} + \beta_l^T x)} $$,
so the posteriors are in $[0,1]$ and sum to 1.

4.4.1 Fitting Logistic Regression Models

Maximum (conditional) likelihood estimation: $$L(\theta) = \log P(g_1,…,g_N \lvert x_1,…,x_N; \theta) = \sum_i \log P(G = g_i \lvert X = x_i; \theta) $$

Binary case: with 0-1 responses, MLE gives score equations: $$ \frac{\partial l(\beta)}{\partial \beta} = \sum_i x_i (y_i - P(G = 1 \lvert X = x_i; \beta)) = 0$$

Score equations can be solved by Newton-Raphson or coordinate-descent algorithms.

TODO: Newton-Raphson algorithm

4.4.3 Quadratic Approximations and Inference

4.4.4 $L_1$ Regularized Logistic Regression

Add $L_1$ penalty to log-likelihood for variable selection and shrinkage.

TODO: Optimization methods:

  • Nonlinear programming of concave function
  • Weighted Lasso using quadratic approximations from Newton algorithm
  • Path algorithms (LAR)
  • Coordinate descent is efficient

4.4.5 Logistic Regression or LDA?

Log-posterior odds for both models are linear. However,

  • Logistic regression fits $P(G \lvert X)$ by maximizing (multinomial) conditional likelihood
  • LDA fits $P(X, G = k) = P(X \lvert G = k) P(G = k)$ by maximizing full likelihood

Equivalently, LDA fits marginal mixture density $$P(X) = \sum_k \pi_k \phi(X; \mu_k, \Sigma)$$.

Derivations

(4.18) Express posteriors using Sigmoid (binary) and Softmax (multi-class)

(Not stated in book.)

(4.21) Maximizing log-likelihood is equivalent to minimizing cross-entropy

(Not stated in book.)

(4.23) Why does the Newton-Raphson algorithm work?

(p121) Log-likelihood is concave

(p121) Coordinate-descent methods can be used to maximize the (multi-class) log-likelihood efficiently

(Table 4.2) Std Error and Z-score for each coefficient

(Table 4.3) Subset selection using analysis of deviance

(Section 4.4.3) Implications of least squares connections

(4.31) L-1 regularized log-likelihood is concave

(Section 4.4.4) Optimization methods for L-1 regularized logistic regression

(Section 4.4.5) LDA is generative and logistic regression is discriminative

(Not stated in book.)

(4.37) Maximizing full log-likelihood gives LDA parameter estimates

(p128) LDA is not robust to gross outliers

(p128) Marginal likelihood as a regularizer

Exercises

4.4

See Weatherwax solution and Bishop Section 4.3.4.

With multi-class logistic regression, the log-likelihood of one sample is:

$$ \begin{align*} \log P(G = g \lvert x; \theta) &= \log \prod_{l=1}^K P(G = l \lvert x; \theta)^{I[g_i = l]} \ &= (\sum_{l=1}^{K-1} I[g_i = l] \log p_l) + I[g_i = K] \log p_K \ &= \sum_{l=1}^{K-1} I[g_i = l] \beta_l^T x

  • \sum_{l=1}^{K-1} I[g_i = l] \log p_K^{-1}
  • I[g_i = K] \cdot - \log p_K^{-1} \ &= \sum_{l=1}^{K-1} y_l \beta_l^T x - \sum_{l=1}^{K} I[g_i = l] \log p_K^{-1} \ &= \sum_{l=1}^{K-1} y_l \beta_l^T x - \log p_K^{-1} \ &= \sum_{l=1}^{K-1} y_l \beta_l^T x - \log (1 + \sum_{m=1}^{K-1} e^{\beta_m^Tx}) \ \end{align*} $$

where $p_l = P(G = l \lvert x; \theta)$ and $y$ is a 0/1 encoded vector of length $K-1$. So, the log-likelihood of all samples is

$$L(\theta) = \sum_{i=1}^N \sum_{l=1}^{K-1} y_{il} \beta_l^T x_i - \log (1 + \sum_{m=1}^{K-1} e^{\beta_m^Tx_i})$$

For Newton-Raphson, we find the gradient and hessian of $L$ w.r.t. $\theta$, the $K - 1$ $\beta$s.

The gradient is a $K - 1$ by $p + 1$ matrix, where the first row is

$$ \frac{\partial L}{\partial \beta_1} = \sum_{i=1}^N y_{i1} x_i - \frac{1}{1 + \sum_{m=1}^{K-1} e^{\beta_m^Tx_i}} \cdot e^{\beta_1^Tx_i} \cdot x_i = \sum_{i=1}^N (y_{i1} - p_{i1}) x_i $$

TODO: find Hessian by blocks.

4.5

Implementation

See Colab.