Model specifies $K - 1$ linear log-odds/logits: \(\frac{P(G = k \lvert X = x)}{P(G = K \lvert X = x)} = \beta_{k0} + \beta_k^T x, k = 1,..., K - 1\).
Then \(P(G = k \lvert X = x) = \frac{\exp(\beta_{k0} + \beta_k^T x)}{1 + \sum_{l=1}^{K-1} \exp(\beta_{l0} + \beta_l^T x)}\) and \(P(G = K \lvert X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} \exp(\beta_{l0} + \beta_l^T x)}\),
so the posteriors are in $[0,1]$ and sum to 1.
Maximum (conditional) likelihood estimation: \(L(\theta) = \log P(g_1,...,g_N \lvert x_1,...,x_N; \theta) = \sum_i \log P(G = g_i \lvert X = x_i; \theta)\)
Binary case: with 0-1 responses, MLE gives score equations: \(\frac{\partial l(\beta)}{\partial \beta} = \sum_i x_i (y_i - P(G = 1 \lvert X = x_i; \beta)) = 0\)
Score equations can be solved by Newton-Raphson or coordinate-descent algorithms.
TODO: Newton-Raphson algorithm
Add $L_1$ penalty to log-likelihood for variable selection and shrinkage.
TODO: Optimization methods:
Log-posterior odds for both models are linear. However,
Equivalently, LDA fits marginal mixture density \(P(X) = \sum_k \pi_k \phi(X; \mu_k, \Sigma)\).