Summary

Linear regression model

Imagine a regression problem with feature vector $ X \in \mathbb{R}^{p+1} $ and response $ Y \in \mathbb{R} $.

The linear regression model:

  1. Assumes that the regression function is a linear function of inputs: $E(Y \lvert X) = X^T \beta$.
  2. Predicts output as a linear function of the parameters: $ \hat{Y} = X^T \hat{\beta}$.

Least squares estimation method

We estimate the model parameter $\beta$ from training data: $X$ is $N$ x $(p + 1)$ and $y$ is $N$ x $1$. Least squares (OLS) estimation method finds the $\hat{\beta}$ that minimizes residual sum of squares (RSS): $\lVert y - X\hat{\beta} \rVert^2$.

OLS estimate has a unique solution $\hat{\beta} = (X^T X)^{-1} X^T y$, if $X$ has full column rank. Else, we can drop redundant features.

Since $X^T (y - X\hat{\beta}) = 0$, residual vector is orthogonal to column space of $X$. Hence, $\hat{y} = X \hat{\beta} = Hy$ is orthogonal projection of $y$ onto column space of $X$. $H$ is projection matrix.

Statistical properties of the OLS estimate

Assume following about true distribution of data:

  1. Features $x_i$ are fixed, not random variables.

  2. Responses $y_i$ are uncorrelated and have constant variance $\sigma^2$ : $Var(y) = \sigma^2 I$.

    Then, $Var(\hat{\beta}) = \sigma^2 (X^T X)^{-1}$. Additionally, assume:

  3. Again, regression function is a linear function of inputs: $E(Y \lvert X) = X^T \beta$.

  4. $ Y = E(Y \lvert X) + \varepsilon $, where $ \varepsilon \sim N(0, \sigma^2) $.

    Then, $ \hat{\beta} \sim N(\beta, \sigma^2 (X^T X)^{-1}) $. $\hat{\beta}$ is unbiased.

Estimate variance of $y_i$

We further estimate $\sigma^2$ by $ \hat{\sigma}^2 = \frac{\lVert y - \hat{y} \rVert^2}{N - (p + 1)}$, so that $ \hat{\sigma}^2 $ is unbiased.

Then, $(N - p - 1) \hat{\sigma}^2 \sim \sigma^2 \chi^2_{N - p - 1}$.

Hypothesis testing and confidence intervals with OLS estimate

3.2.2 Gauss-Markov Theorem

Given the assumptions 1 to 3 from above, the OLS estimate has the smallest variance (as well as the Mean Squared Error and Expected Prediction Error) among all linear unbiased estimates.

3.2.3 From univariate to multiple regression

If input features are orthogonal, each multiple linear regression OLS estimate $\hat{\beta}_i$ is equal to the univariate estimate of regressing $y$ on $x_i$.

However, input features are usually not orthogonal, so we orthogonalize them by Gram-Schmidt procedure. Then, $\hat{\beta}_p$ is equal to the univariate estimate of regressing $y$ on $z_p$, the residual after regressing $x_p$ on all other features. Also, $Var(\hat{\beta}_p)$ = $ \sigma^2 / \lVert z_p \rVert^2 $, so $\hat{\beta}_p$ is unstable if $z_p$ is close to 0.

In matrix form, we obtain the QR decomposition, $ X = Z D^{-1} D \Gamma = QR$, where $Q$ is a N x (p+1) orthonormal matrix and $R$ is an upper triangular (invertible) matrix. Then, we obtain the entire OLS estimate $\hat{\beta} = R^{-1}Q^T y$ and $\hat{y} = QQ^T y$.

3.2.4 Multiple outputs

If errors are same across across each output’s observations and are uncorrelated across the multiple outputs, multiple outputs do not affect each other’s OLS estimates.

If the errors are correlated across the multiple outputs, use multivariate RSS criterion, which yields the same estimate. If the errors vary across observations, multiple outputs no longer decouple.

Derivations

(3.2) RSS criterion is “statistically reasonable” if $y_i$’s are conditionally independent given X

(3.4) Differentiate RSS$(\beta)$

RSS$(\beta)$ $ = y^T y - 2 \beta^T X^T y + \beta^T X^TX \beta$. Note $\frac{\partial}{\partial \beta} \beta^T X^T y = X^Ty$. Show $ \frac{\partial}{\partial \beta} \beta^T X^T X \beta = 2 X^T X\beta$.

Let $ X^T X = A$. Then $\beta^T A \beta = \sum_i \beta_i (A\beta)_{i} = \sum_i \beta_i \sum_j A_{i,j} \beta_j$. Reorder to $\sum_i \sum_j A_{i,j} \beta_i \beta_j$. Then:

$$ \begin{align*} \frac{d}{d \beta_k} \sum_i \sum_j A_{i,j} \beta_i \beta_j &= \frac{d}{d \beta_k} [A_{k,k}{\beta_k}^2 + \sum_{i \neq k} A_{i, k} \beta_i \beta_k + \sum_{j \neq k} A_{k, j} \beta_k \beta_j] \ &= 2 A_{k,k} {\beta_k} + \sum_{i \neq k}^d A_{i, k} \beta_i + \sum_{j \neq k}^d A_{k, j} \beta_j \ &= \sum_{i}^d A_{i, k} \beta_i + \sum_{j}^d A_{k, j} \beta_j \ &= (A \beta){k} + (A^T \beta){k} \ &= ((A + A^T)\beta)_{k} \ \end{align*} $$

Therefore, $\frac{\partial}{\partial \beta} \beta^T A \beta = (A^T + A)\beta$, equivalent to $2 X^T X\beta$ since $X^T X$ is symmetric.

From the two derivatives, we obtain $ \frac{\partial}{\partial \beta} RSS = - 2 X^T (y - X \beta)$, which we set to zeros for the normal equation.

Reference:

(3.8) Covariance matrix of OLS estimate

Note $Var(\hat{\beta}) = E[(\hat{\beta} - E[\hat{\beta}])(\hat{\beta} - E[\hat{\beta}])^T] = E[\hat{\beta}\hat{\beta}^T] - E[\hat{\beta}]E[\hat{\beta}]^T$. Equivalently, $E[yy^T] = Var(y) + E[y]E[y]^T$.

$$ \begin{align*} Var(\hat{\beta}) &= E[\hat{\beta}\hat{\beta}^T] - E[\hat{\beta}]E[\hat{\beta}]^T\ &= E[(X^TX)^{-1}X^Tyy^TX(X^TX)^{-T}] - E[(X^TX)^{-1}X^Ty]E[(X^TX)^{-1}X^Ty]^T\ &= (X^TX)^{-1}X^TE[yy^T]X(X^TX)^{-1} - (X^TX)^{-1}X^TEy^T\ &= (X^TX)^{-1}X^T(\sigma^2I + E[y]E[y]^T)X(X^TX)^{-1} - (X^TX)^{-1}X^TE[y]E[y]^TX(X^TX)^{-1}\ &= (\sigma^2(X^TX)^{-1} + (X^TX)^{-1}X^TE[y]E[y]^TX(X^TX)^{-1}) - (X^TX)^{-1}X^TE[y]E[y]^TX(X^TX)^{-1}\ &= \sigma^2(X^TX)^{-1} \end{align*} $$

Reference:

(3.8) Estimate $\hat{\sigma}^2$ of $y_i$’s variance using denominator $N-p-1$ is unbiased

See this StackExchange answer. Note that this answer assumes that the variance $\sigma^2$ of $y_i$ comes from $\varepsilon \sim N(0, \sigma^2)$.

TODO: Is it possible to derive without the assumption above?

(3.10) OLS estimate has normal distribution under (3.9)

Let $e$ be the N x 1 vector of errors $\varepsilon$ from (3.9). Assuming errors are i.i.d., $e \sim N(0, \sigma^2 I)$.

Note $\hat{\beta} = (X^TX)^{-1}X^Ty = (X^TX)^{-1}X^T(X\beta + e) = \beta + (X^TX)^{-1}X^Te$.

If $x \sim N(\mu, \Sigma)$ then $\beta + a^Tx \sim N(\beta + a^T\mu, a^T\Sigma a)$. So, covariance matrix of $\beta + (X^TX)^{-1}X^Te$ is $(X^TX)^{-1}X^T \sigma^2 I X (X^TX)^{-1} = \sigma^2 (X^TX)^{-1}$, which confirms our derivation from (3.8).

Hence, $\hat{\beta} \sim N(\beta, \sigma^2 (X^TX)^{-1})$.

(3.11) $(N-p-1)\hat{\sigma}^2$ has chi-squared distribution

See this StackExchange answer. Below slightly modifies the answer by using (imo) a bit easier linear algebra.

From (3.8), $(N - p - 1)\hat{\sigma}^2 = \sum_i^N (y_i - \hat{y_i})^2 = \lVert y - \hat{y} \rVert^2$.

$y - \hat{y} = (X\beta + \varepsilon) - X(X^TX)^{-1}X^T(X\beta + \varepsilon) = \varepsilon - X(X^TX)^{-1}X^T\varepsilon = (I - H)\varepsilon$, where H is projection/hat matrix.

Let $A = I - H$, which satisfies:

  1. $A^T = A$. Symmetric matrix has $N$ real eigenvalues.
  2. $A^2 = A$. The $N$ eigenvalues are $0$ or $1$ since $\lambda v = Av = A^2v = \lambda^2 v$.
  3. $tr(I - H) = N - tr(X(X^TX)^{-1}X^T) = N - tr(X^TX(X^TX)^{-1}) = N - (p + 1)$. Since the trace equals the sum of eigenvalues, $N - p - 1$ eigenvalues are 1.

So, we can eigen-decompose $A = Q \Lambda Q^{-1}$ where $\Lambda = diag(\underbrace{1,…,1}_{N-p-1}, \underbrace{0,…,0}_{p+1})$.

From $\varepsilon \sim N(0, \sigma^2 I)$ we obtain $A\varepsilon \sim N(0, \sigma^2A)$ because $A\sigma^2 I A^T = \sigma^2A$.\ Moreover, $Q^TA\varepsilon \sim N(0, \sigma^2 \Lambda)$ because $Q^TAQ = Q^T(Q \Lambda Q^{-1})Q = \Lambda$.

That is, among N independent random variables in $Q^TA\varepsilon$, first $N - p - 1$ have variance $\sigma^2$ and others are fixed at 0 (both mean and variance are 0). Then, $(Q^TA\varepsilon)_i / \sigma^2$ are independent standard normals for all $i$ upto $N-p-1$. So, $\chi_{N-p-1}^2 = \sum_{i=1}^{N-p-1} (Q^TA\varepsilon)_i^2 / \sigma^2$, but we can extend the sum upto $N$ since remaining $p+1$ entries are fixed at 0.

Therefore, $\sigma^2 \chi_{N-p-1}^2 = (Q^TA\varepsilon)^T (Q^TA\varepsilon) = \varepsilon^TA^TA\varepsilon = \lVert (I - H)\varepsilon \rVert^2 = \lVert y - \hat{y} \rVert^2 = (N - p - 1)\hat{\sigma}^2$.

Alternatively, we can use the following theorems:

  • If A is a projection matrix with rank $r$ and $z \sim N(0, I)$, then the quadratic form $z^TAz$ is distributed as $\chi_{r}^2$. \ See Davidson and Mackinnon, Section 4.3.
  • For a projection matrix, rank is equal to trace.

References:

  • StackExchange answer by ocram.
  • Strang. Introduction to Linear Algebra. Section 6.4.
  • Davidson and Mackinnon. Econometric Theory and Methods. Section 4.3.

(3.11) $\hat{\beta}$ and $\hat{\sigma}^2$ are independent

(3.12) Hypothesis testing of one parameter coefficient

$z_j \sim t_{N-p-1}$ if $z_j$ can be expressed as $\frac{a}{\sqrt{b / (N-p-1)}}$ where a and b are independent, $a \sim N(0,1)$, and $b \sim \chi_{N-p-1}^2$.

Note $\frac{\hat{\beta_j}}{\sigma \sqrt{v_j}} \sim N(0,1)$ because $\hat{\beta_j} \sim N(0, \sigma^2 v_j)$ from (3.10). Also, $(N-p-1)\hat{\sigma^2} / \sigma^2 \sim \chi_{N-p-1}^2$ from (3.11).

Therefore, $z_j = \frac{\hat{\beta_j}}{\hat{\sigma}\sqrt{v_j}} = \frac{\hat{\beta_j} / \sigma \sqrt{v_j}}{\hat{\sigma} / \sigma} \sim t_{N-p-1}$.

Alternatively, Section 4.4 in Davidson and Mackinnon derives:

  1. The $z_j$ in (3.12) by using FWL theorem to extract the estimate and variance of a subset of coefficients in $\hat{\beta}$.
  2. If we know the true $\sigma$, then $z_j \sim N(0,1)$ under $\beta_j = 0$,
  3. If we estimate $\sigma$ using $\hat{\sigma}$, then $z_j \sim t_{N-p-1}$ under $\beta_j = 0$.

References:

  • Davidson and Mackinnon. Econometric Theory and Methods.

(3.13) Hypothesis testing of a group of parameters

Also see Section 4.4 in Davidson and Mackinnon.

(3.13) F statistic for a single coefficient is equivalent to square of the Z-score

See Exercise 3.1

(3.19) Gauss-Markov Theorem

See Exercise 3.3 (a)

(3.40) Multivariate weighted RSS arises from Gaussian theory

(3.40) Multivariate weighted RSS has same solution as (3.39)

See Exercise 3.11

(3.40) If the $\Sigma_i$ vary among observations, errors no longer decouple

Exercises

3.1

F statistic for a single coefficient is equivalent to square of the Z-score.

See Section 4.3 from Davidson and Mackinnon. Square of a R.V. that is distributed as $t_{N-p-1}$ is distributed as $F_{1, N-p-1}$:

F distribution is a ratio of two independent RVs with chi-squared distributions $\frac{a_1 / n_1}{a_2 / n_2}$, where $a_1 \sim \chi_{n_1}^2$ and $a_2 \sim ~ \chi_{n_2}^2$.\ t distribution is a ratio of independent std normal and chi-squared RVs; $t = \frac{b}{(a / n)^{1/2}}$, where $b \sim N(0,1)$ and $a \sim \chi^2_{n}$.\ Chi-squared distribution is a sum of squares of independent standard normals; $\sum^n_{i=1} x^2 \sim \chi_n^2$ if $x \sim N(0,1)$.

Then, $t^2 = \frac{b^2}{a / n} \sim F_{1, n}$ because $b^2 \sim \chi^2_{1}$.

3.2

3.3

(a) Assume $X$ is fixed, $E(y) = X\beta$, and $Var(y) = \sigma^2 I$. Then $\hat{\theta} = a^T \hat{\beta} = a^T(X^TX)^{-1}X^Ty$ is an unbiased estimate; $E(\hat{\theta})=a^T\beta$ from (3.18). Suppose another linear estimator $\widetilde{\theta} = c^Ty$ is also unbiased. Show $Var(\widetilde{\theta}) \geq Var(\hat{\theta})$.

First, $a^T = c^T X$ because $a^T \beta = E(\widetilde{\theta}) = E(c^T y) = c^T E(y) = c^T X \beta$. Then

$$ \begin{equation*} \begin{split} Var(\widetilde{\theta}) - Var(\hat{\theta}) &= c^T Var(y) c - a^T Var(\hat{\beta})a\ &= \sigma^2 c^Tc - \sigma^2 a^T (X^TX)^{-1}a && Var(\hat{\beta}) = \sigma^2(X^TX)^{-1} \text{ from (3.10)} \ &= \sigma^2 (c^Tc - c^T X (X^TX)^{-1} X^T c) \ &= \sigma^2 c^T (I - X(X^TX)^{-1}X^T)c \ &= \sigma^2 c^T (I - H)c && \text{H is projection matrix (3.7)} \ &= \sigma^2 c^T (I - H)^T(I - H)c && (I - H)^2 = (I - H) = (I - H)^T \ &= \sigma^2 \lVert (I - H) c \rVert^2 && \text{i.e. $I - H$ is PSD}\ &\geq 0 \end{split} \end{equation*} $$

Reference:

(b) $\hat{\beta}$ is an unbiased estimate of $\beta$. Suppose $\widetilde{\beta}$ is another unbiased estimate. Show $Var(\widetilde{\beta}) - Var(\hat{\beta})$ is positive semidefinite.

$Var(\widetilde{\beta}) - Var(\hat{\beta})$ is positive semidefinite iff $\forall b \neq 0, b^T (Var(\widetilde{\beta}) - Var(\hat{\beta})) b \geq 0$. This is equivalent to $Var(b^T \widetilde{\beta}) - Var(b^T \hat{\beta}) \geq 0$.

Later inequality is same as (a) if we can let $b^T \widetilde{\beta} = \widetilde{\theta} = c^T y$.

  1. If $y \neq 0$, there exists a $c$ that can combine $y$ to obtain any $b^T \widetilde{\beta}$. So, the inequality follows from (a).
  2. If $y = 0$, $Var(b^T \hat{\beta}) = 0$, so the inequality is trivially true.

3.4

Obtain OLS coefficients from a single pass of Gram-Schmidt procedure. Express in terms of QR decomposition of X

$X$ is decomposed into $Q$ and $R$ from a single run of Gram-Schmidt procedure; $X = (ZD^{-1})(D\Gamma) = QR$ (3.31).

By QR decomposition, $\hat{\beta} = (R^T Q^TQR)^{-1}R^TQ^Ty = R^{-1}R^{-T}R^TQ^Ty = R^{-1}Q^Ty$ (3.32).

Since $R \hat{\beta} = Q^T y$ and $R$ is upper triangular, we can solve for $\hat{\beta}$ by back-substitution, solving from p-th coefficient to first.

(It is a bit unclear if this exercise is asking for a modification of the step 2 in the Gram-Schmidt procedure such that the $\hat{\beta}$ can be obtained without the final back-substitution. However, since the computational complexity of Gram-Schmidt procedure $O(n^3)$ dominates that of back-substitution $O(n^2)$, the additional back-substitution is probably ok.)

Reference:

  • Strang. Introduction to Linear Algebra. See chapters 2.6 and 11.1 for computational complexity of back-substitution (LU) and QR.

3.11

Suppose a multiple-output linear regression problem where the errors are correlated between outputs. Show that solution to the weighted RSS criterion (3.40) is given by $(X^TX)^{-1}X^TY$, ignoring the correlations.

Let $A = Y - XB$. Then, $\sum_{i=1}^N a_i^T \Sigma^{-1} a_i = tr[\Sigma^{-1} A^TA]$, where $a_i$ is the i-th row of $A$ as a column vector. This is because:

$$ \begin{equation*} \begin{split} tr[\Sigma^{-1} A^TA] &= \sum_{j=1}^K \sum_{k=1}^K \Sigma^{-1}{j,k} (A^TA){j,k}\ &= \sum_{j=1}^K \sum_{k=1}^K \Sigma^{-1}{j,k} \sum{i=1}^N A^T_{j,i} A_{i,k}\ &= \sum_{i=1}^N \sum_{j=1}^K \sum_{k=1}^K A_{i,j} \Sigma^{-1}{j,k} A{i,k}\ &= \sum_{i=1}^N a_i^T \Sigma^{-1} a_i\ \end{split} \end{equation*} $$

So, RSS (3.40) is equivalent to $tr[\Sigma^{-1} (Y - XB)^T (Y - XB)]$. We find its derivative w.r.t. $B$:

$$ \begin{align*} \frac{\partial}{\partial B} tr[\Sigma^{-1} (Y - XB)^T (Y - XB)] &= \frac{\partial}{\partial B} - tr[\Sigma^{-1} B^TX^TY] - tr[\Sigma^{-1} Y^TXB] + tr[\Sigma^{-1}B^TX^TXB] \ &= \frac{\partial}{\partial B} - 2 tr[B^TX^TY \Sigma^{-1}] - tr[B^TX^TXB \Sigma^{-1}]\ &= -2X^TY\Sigma^{-1} + (X^TXB\Sigma^{-1} + (X^TX)^T B \Sigma^{-T})\ &= -2X^TY\Sigma^{-1} + 2X^TXB\Sigma^{-1}\ \end{align*} $$

where the trace derivatives are from Matrix Cookbook (103) and (117). By setting the derivative to zeros, we obtain $X^TXB\Sigma^{-1} = X^TY\Sigma^{-1}$, which yields (3.39) because $\Sigma^{-1}$ is invertible.

Reference:

  • Weatherwax and Epstein. A Solution Manual and Notes for: The Elements of Statistical Learning. See the solution for an easier explanation of the derivative.
  • Petersen and Pedersen. Matrix Cookbook.