Hi, this is Brian (Seungmin) Lee. I'm documenting my learning notes in this blog.
Blog format was inspired by Lil'Log.
Hi, this is Brian (Seungmin) Lee. I'm documenting my learning notes in this blog.
Blog format was inspired by Lil'Log.
Derivations (2.5) Normal equations See Derivation of (3.4) Exercises 2.1 (I will write the prediction $\hat{y}$ as $y$.) $argmin_k \lvert\lvert t_k - y \lvert\lvert = argmin_k (t_k - y)^T (t_k - y) = argmin_k [{t_k}^T t_k - 2 y^T t_k + y^Ty]$. ${t_k}^T{t_k}$ is 1 for any $k$ and $y^Ty$ is not affected by $k$. So, the problem is equivalent to $argmax_k [y^T t_k]$, which picks the position in $y$ with the largest element. ...
Derivations Curse of dimensionality Cube example Suppose a p-dimensional hypercube. The edge length $e$ of the hypercube that captures fraction $r$ of the volume is $e = r^{1/p}$ because $r / 1 = e^p$. For example, in 10-dimensions, edge length of 0.5 captures only 0.001 of the volume. Bias-variance decomposition of MSE See (2.25). I will write the true value $f(x_0)$ as $f$ and prediction $\hat{y}$ as $y$. All expectations are across the training set $T$. ...
Summary Linear regression model Imagine a regression problem with feature vector $ X \in \mathbb{R}^{p+1} $ and response $ Y \in \mathbb{R} $. The linear regression model: Assumes that the regression function is a linear function of inputs: $E(Y \lvert X) = X^T \beta$. Predicts output as a linear function of the parameters: $ \hat{Y} = X^T \hat{\beta}$. Least squares estimation method We estimate the model parameter $\beta$ from training data: $X$ is $N$ x $(p + 1)$ and $y$ is $N$ x $1$. Least squares (OLS) estimation method finds the $\hat{\beta}$ that minimizes residual sum of squares (RSS): $\lVert y - X\hat{\beta} \rVert^2$. ...
Summary Subset selection By retaining a subset of features, we can improve: Prediction accuracy, by trading higher bias for lower variance. Interpretability, by narrowing down features with strongest effects. 3.3.1 Best-subset selection For each subset size $k$ from $0$ to $p$, finds subset that gives smallest RSS. Best subsets are not necessarily nested. Produce sequence of models increasing in complexity: RSS of the best subset decreases as $k$ increases, so cannot use RSS to select $k$. Computationally infeasible if $p > 40$. 3.3.2 Forward and backward stepwise selection Sequentially add or remove a feature that minimizes the subsequent RSS to the active set. ...
Summary Shrinkage Methods Because subset selection is discrete, resulting model can have high variance. Shrinkage methods are more continuous. 3.4.1 Ridge Regression Minimizes penalized ($\lambda \beta^T \beta$) RSS. $\lambda$ is complexity parameter. With OLS, coefficients of correlated features can have high variance; large positive coefficients offset effect of large negative coefficients. Need feature normalization: Standardize features: otherwise, coefficients of smaller-scaled features will be penalized more. Center features: intercept $\beta_0$ is not included in penalty term. Estimate $\beta_0 = \overline{y}$ and estimate other coefficients with centered features. Estimate is $ \hat{\beta_{RR}} = (X^T X + \lambda I)^{-1} X^T y $, where $X^T X + \lambda I$ is invertible if $\lambda > 0$. ...
Summary Given many highly correlated features, use few linear combinations of them. 3.5.1 Principal Components Regression Use $M \leq p$ principal components. Since principal components are orthogonal, $\hat{y}$ is sum of univariate regressions. Normalize features: principal components depend on scale of features. Similar to ridge: instead of shrinking coefficients with smaller singular values, drop them. 3.5.2 Partial Least Squares Each derived feature $z$ is a sum of inputs (orthogonalized w.r.t. prev feature), weighted by strength of univariate effect on $y$. For each feature: ...
Exercises 3.20 3.21 3.22
Summary Decision boundary divides the feature space into labeled regions. We focus on linear decision boundaries. Discriminant analysis models discriminant function $\delta_k(x)$ for each class, then classifies to $\hat{G}(x) = \arg\max_k \hat{\delta_k}(x)$. Decision boundary is $\delta_1(x) = \delta_2(x)$. Example: fit $k$ linear regression models to class indicator responses. Hyperplane decision boundaries. Modeling the posterior $P(G \lvert X = x)$ is a form of discriminant analysis. If monotone transformation of the discriminant function or posterior is linear, then decision boundary is linear. ...
Summary Fit $k$ linear regression models as discriminant functions using indicator response matrix. Interpret each regression as an estimate of conditional expectation $E(Y_k \lvert X = x)$, which is equal to $P(G = k \lvert X = x)$. Is linear regression a good estimate of the posterior? Estimates sum to 1, but can be negative or greater than 1. Masking problem with $K \geq 3$: even though classes are separable, the model masks a class because its fitted regression value is never dominant. Occurs when $K$ is large but $p$ is small. ...
Summary LDA and QDA Recall Bayes classifier: $\hat{G}(x)= \arg\max_{k} P(G = k \lvert X = x) = \arg\max_{k} P(X = x \lvert G = k) P(G = k)$. Both LDA and QDA model each class-conditional density as Gaussian. LDA arises with a common class-conditional density covariance. Pairwise decision boundary is linear (hyperplane) because log-ratio is linear. Linear discriminant functions yield same decision boundary. Estimate prior, mean, and common covariance. LDA and linear regression: ...