Standard Errors vs. Standard Deviation

Derive SE

For method="lm" the CI ribbon is based on the standard error of the estimated mean $\hat y(x_0)=\widehat{E[Y\mid X=x_0]}$. The SE formula comes directly from the variance of a linear combination of the OLS estimator.

Write the linear model in matrix form

\[ y = X\beta + \varepsilon,\qquad \mathbb E\varepsilon=0,\quad \mathrm{Var}(\varepsilon)=\sigma^2 I_n.\] 2) OLS estimator and its variance

OLS gives \[ \hat\beta=(X^\top X)^{-1}X^\top y.\] Because $\hat\beta$ is a linear function of $y$, \[ \mathrm{Var}(\hat\beta) = (X^\top X)^{-1}X^\top \mathrm{Var}(y) X (X^\top X)^{-1}.\] But $y=X\beta+\varepsilon$ and $\mathrm{Var}(y)=\mathrm{Var}(\varepsilon)=\sigma^2 I$, so \[ \mathrm{Var}(\hat\beta)=\sigma^2 (X^\top X)^{-1}.\]

Fitted mean at a point is a linear combo of $\hat\beta$

At $x_0$, let the (row) design vector be \[ x_0^* = (1 \;\;x_0)\] (for simple linear regression; more generally it’s whatever basis terms your formula uses). Then the fitted mean is

\[ \hat y(x_0)=x_0^* \hat\beta.\]

Variance of a linear combination

Use the rule: if $a$ is a constant vector and $Z$ is random, \[ \mathrm{Var}(a^\top Z)=a^\top \mathrm{Var}(Z)a.\] Here $a = (x_0^*)^\top$ and $Z=\hat\beta$. Therefore,

\[\mathrm{Var}(\hat y(x_0)) = \mathrm{Var}(x_0^* \hat\beta) = x_0^*\mathrm{Var}(\hat\beta)(x_0^*)^\top = x_0^* \big(\sigma ^2 (X^\top X)^{-1}\big) (x_0^*)^\top.\] So the standard error is \[ SE{\hat y(x_0)}=\sqrt{\sigma^2 x_0^*(X^\top X)^{-1}(x_0^*)^\top }.\]

Replace $\sigma ^2$ by its sample estimate

Since $\sigma^2$ is unknown, we plug in

\[ \hat\sigma^2=\frac{\sum_{i=1}^n \hat\varepsilon_i^2}{n-p},\] giving the practical SE:

\[SE{\hat y(x_0)}=\sqrt{\hat\sigma^2 x_0^*(X^\top X)^{-1}(x_0^*)^\top }.\]

same formula in “simple regression” scalar form

For $y=\beta_0+\beta_1 x+\varepsilon$,

\[SE{\hat y(x_0)}=\hat\sigma\sqrt{\frac{1}{n}+\frac{(x_0-\bar x)^2}{\sum_{i=1}^n (x_i-\bar x)^2}}.\] This is algebraically equivalent to the matrix expression above.

SD vs SE

SD (standard deviation) describes how spread out the data are.

It’s a property of the observations.
Example: “Students’ heights have SD = 7 cm” = individual heights vary a lot.

SE (standard error) describes how uncertain an estimate is (because it’s based on a sample).

It’s a property of an estimator (sample mean, regression slope, fitted mean, etc.).
Example: “The sample mean height has SE = 1 cm” = the mean is estimated pretty precisely.

A good one-liner:

SD = variability of points
SE = variability of a statistic across repeated samples

The classic relationship: mean

If $X_1,\dots,X_n$ are i.i.d. with $\mathrm{Var}(X_i)=\sigma ^2$, then \[ \bar X=\frac{1}{n}\sum_{i=1}^n X_i,\qquad \mathrm{Var}(\bar X)=\frac{\sigma^2}{n}.\] So \[ SE(\bar X)=\sqrt{\mathrm{Var}(\bar X)}=\frac{\sigma}{\sqrt n}.\] In practice we don’t know $\sigma$, so we plug in the sample SD $s$: \[ \widehat{SE}(\bar X)=\frac{s}{\sqrt n}.\] This is why SE shrinks as $n$ grows, while SD typically doesn’t.

In regression (`geom_smooth(method="lm", se=TRUE)`)

The SD-like quantity is the residual spread around the line (noise level).
The SE band is uncertainty in the estimated mean line $\hat y(x)$. It’s smaller near $\bar x$ and larger near the ends because the line is less well anchored there.

Why is $\hat{\sigma}^2 = \dfrac{\sum \hat\varepsilon_i^2}{n-p}$ ?

Consider the linear model \[ y = X\beta + \varepsilon,\quad \mathbb E(\varepsilon)=0,\quad \mathrm{Var}(\varepsilon)=\sigma^2 I_n,\] with $p$ parameters (in simple regression, (p=2): intercept + slope).

Let $\hat y = X\hat\beta$ be fitted values and residuals $e = y-\hat y$. The residual sum of squares is \[ RSS = e^\top e = \sum_{i=1}^n e_i^2.\]

Key fact (degrees of freedom)

The fitted values live in the column space of $X$, which is $p$-dimensional. Residuals live in the orthogonal complement, which has dimension $n-p$. So residuals have only $n-p$ independent pieces of variation.

More formally, define the “residual-maker” matrix \[ M = I - H,\quad H = X(X^\top X)^{-1}X^\top.\] Then $e = My$ and $M$ is symmetric idempotent with \[ \mathrm{rank}(M)=n-p,\quad \mathrm{tr}(M)=n-p.\] Since $y = X\beta + \varepsilon$ and $MX\beta=0$, \[ e = M\varepsilon.\] Now compute the expected RSS: \[\mathbb E(RSS) = \mathbb E(e^\top e) = \mathbb E(\varepsilon^\top M^\top M \varepsilon) = \mathbb E(\varepsilon^\top M \varepsilon). \] Using $\mathbb E(\varepsilon^\top A \varepsilon)=\sigma^2\mathrm{tr}(A)$ when $\mathrm{Var}(\varepsilon)=\sigma^2 I$,

\[ \mathbb E(RSS)=\sigma ^2 \mathrm{tr}(M)=\sigma^2 (n-p).\] So \[ \mathbb E\left(\frac{RSS}{n-p}\right)=\sigma^2,\] meaning \[ \hat\sigma^2=\frac{RSS}{n-p}\] is an unbiased estimator of $^2).

Intuition

If you divided by $n$, you’d systematically underestimate $\sigma^2$ because the regression line was chosen to make residuals small (you “spent” $p$ degrees of freedom estimating parameters). Dividing by $n-p$ corrects that.