4.4 Logistic Regression - Derivations

(4.18) Express posteriors using Sigmoid (binary) and Softmax (multi-class)

(Not stated in book.)

(4.21) Maximizing log-likelihood is equivalent to minimizing cross-entropy

(Not stated in book.)

(4.23) Why does the Newton-Raphson algorithm work?

(p121) Log-likelihood is concave

(p121) Coordinate-descent methods can be used to maximize the (multi-class) log-likelihood efficiently

(Table 4.2) Std Error and Z-score for each coefficient

(Table 4.3) Subset selection using analysis of deviance

(Section 4.4.3) Implications of least squares connections

(4.31) L-1 regularized log-likelihood is concave

(Section 4.4.4) Optimization methods for L-1 regularized logistic regression

(Section 4.4.5) LDA is generative and logistic regression is discriminative

(Not stated in book.)

(4.37) Maximizing full log-likelihood gives LDA parameter estimates

(p128) LDA is not robust to gross outliers

(p128) Marginal likelihood as a regularizer