Welcome to Brian’s Blog

Hi, this is Brian (Seungmin) Lee. I'm documenting my learning notes in this blog.

Blog format was inspired by Lil'Log.

ESL 2.3: Least squares and Nearest Neighbors

Derivations (2.5) Normal equations See Derivation of (3.4) Exercises 2.1 (I will write the prediction $\hat{y}$ as $y$.) $argmin_k \lvert\lvert t_k - y \lvert\lvert = argmin_k (t_k - y)^T (t_k - y) = argmin_k [{t_k}^T t_k - 2 y^T t_k + y^Ty]$. ${t_k}^T{t_k}$ is 1 for any $k$ and $y^Ty$ is not affected by $k$. So, the problem is equivalent to $argmax_k [y^T t_k]$, which picks the position in $y$ with the largest element. ...

December 31, 2020

ESL 2.5: Local Methods in High Dimensions

Derivations Curse of dimensionality Cube example Suppose a p-dimensional hypercube. The edge length $e$ of the hypercube that captures fraction $r$ of the volume is $e = r^{1/p}$ because $r / 1 = e^p$. For example, in 10-dimensions, edge length of 0.5 captures only 0.001 of the volume. Bias-variance decomposition of MSE See (2.25). I will write the true value $f(x_0)$ as $f$ and prediction $\hat{y}$ as $y$. All expectations are across the training set $T$. ...

December 31, 2020

ESL 3.2: Linear Regression Models and Least Squares

Summary Linear regression model Imagine a regression problem with feature vector $ X \in \mathbb{R}^{p+1} $ and response $ Y \in \mathbb{R} $. The linear regression model: Assumes that the regression function is a linear function of inputs: $E(Y \lvert X) = X^T \beta$. Predicts output as a linear function of the parameters: $ \hat{Y} = X^T \hat{\beta}$. Least squares estimation method We estimate the model parameter $\beta$ from training data: $X$ is $N$ x $(p + 1)$ and $y$ is $N$ x $1$. Least squares (OLS) estimation method finds the $\hat{\beta}$ that minimizes residual sum of squares (RSS): $\lVert y - X\hat{\beta} \rVert^2$. ...

December 31, 2020

ESL 3.3: Subset Selection

Summary Subset selection By retaining a subset of features, we can improve: Prediction accuracy, by trading higher bias for lower variance. Interpretability, by narrowing down features with strongest effects. 3.3.1 Best-subset selection For each subset size $k$ from $0$ to $p$, finds subset that gives smallest RSS. Best subsets are not necessarily nested. Produce sequence of models increasing in complexity: RSS of the best subset decreases as $k$ increases, so cannot use RSS to select $k$. Computationally infeasible if $p > 40$. 3.3.2 Forward and backward stepwise selection Sequentially add or remove a feature that minimizes the subsequent RSS to the active set. ...

December 31, 2020

ESL 3.4: Shrinkage Methods

Summary Shrinkage Methods Because subset selection is discrete, resulting model can have high variance. Shrinkage methods are more continuous. 3.4.1 Ridge Regression Minimizes penalized ($\lambda \beta^T \beta$) RSS. $\lambda$ is complexity parameter. With OLS, coefficients of correlated features can have high variance; large positive coefficients offset effect of large negative coefficients. Need feature normalization: Standardize features: otherwise, coefficients of smaller-scaled features will be penalized more. Center features: intercept $\beta_0$ is not included in penalty term. Estimate $\beta_0 = \overline{y}$ and estimate other coefficients with centered features. Estimate is $ \hat{\beta_{RR}} = (X^T X + \lambda I)^{-1} X^T y $, where $X^T X + \lambda I$ is invertible if $\lambda > 0$. ...

December 31, 2020

ESL 3.5: Methods using Derived Input Directions

Summary Given many highly correlated features, use few linear combinations of them. 3.5.1 Principal Components Regression Use $M \leq p$ principal components. Since principal components are orthogonal, $\hat{y}$ is sum of univariate regressions. Normalize features: principal components depend on scale of features. Similar to ridge: instead of shrinking coefficients with smaller singular values, drop them. 3.5.2 Partial Least Squares Each derived feature $z$ is a sum of inputs (orthogonalized w.r.t. prev feature), weighted by strength of univariate effect on $y$. For each feature: ...

December 31, 2020

ESL 3.7: Multiple Outcome Shrinkage and Selection

Exercises 3.20 3.21 3.22

December 31, 2020

ESL 4.1: Introduction

Summary Decision boundary divides the feature space into labeled regions. We focus on linear decision boundaries. Discriminant analysis models discriminant function $\delta_k(x)$ for each class, then classifies to $\hat{G}(x) = \arg\max_k \hat{\delta_k}(x)$. Decision boundary is $\delta_1(x) = \delta_2(x)$. Example: fit $k$ linear regression models to class indicator responses. Hyperplane decision boundaries. Modeling the posterior $P(G \lvert X = x)$ is a form of discriminant analysis. If monotone transformation of the discriminant function or posterior is linear, then decision boundary is linear. ...

December 31, 2020

ESL 4.2: Linear Regression of an Indicator Matrix

Summary Fit $k$ linear regression models as discriminant functions using indicator response matrix. Interpret each regression as an estimate of conditional expectation $E(Y_k \lvert X = x)$, which is equal to $P(G = k \lvert X = x)$. Is linear regression a good estimate of the posterior? Estimates sum to 1, but can be negative or greater than 1. Masking problem with $K \geq 3$: even though classes are separable, the model masks a class because its fitted regression value is never dominant. Occurs when $K$ is large but $p$ is small. ...

December 31, 2020

ESL 4.3: Linear Discriminant Analysis

Summary LDA and QDA Recall Bayes classifier: $\hat{G}(x)= \arg\max_{k} P(G = k \lvert X = x) = \arg\max_{k} P(X = x \lvert G = k) P(G = k)$. Both LDA and QDA model each class-conditional density as Gaussian. LDA arises with a common class-conditional density covariance. Pairwise decision boundary is linear (hyperplane) because log-ratio is linear. Linear discriminant functions yield same decision boundary. Estimate prior, mean, and common covariance. LDA and linear regression: ...

December 31, 2020