Standard Errors vs. Standard Deviation
Derive SE
For method="lm" the CI ribbon is based on the standard error of the estimated mean \(\hat y(x_0)=\widehat{E[Y\mid X=x_0]}\). The SE formula comes directly from the variance of a linear combination of the OLS estimator.
- Write the linear model in matrix form
\[ y = X\beta + \varepsilon,\qquad \mathbb E\varepsilon=0,\quad \mathrm{Var}(\varepsilon)=\sigma^2 I_n.\] 2) OLS estimator and its variance
OLS gives \[ \hat\beta=(X^\top X)^{-1}X^\top y.\] Because \(\hat\beta\) is a linear function of \(y\), \[ \mathrm{Var}(\hat\beta) = (X^\top X)^{-1}X^\top \mathrm{Var}(y) X (X^\top X)^{-1}.\] But \(y=X\beta+\varepsilon\) and \(\mathrm{Var}(y)=\mathrm{Var}(\varepsilon)=\sigma^2 I\), so \[ \mathrm{Var}(\hat\beta)=\sigma^2 (X^\top X)^{-1}.\]
- Fitted mean at a point is a linear combo of \(\hat\beta\)
At \(x_0\), let the (row) design vector be \[ x_0^* = (1 \;\;x_0)\] (for simple linear regression; more generally it’s whatever basis terms your formula uses). Then the fitted mean is
\[ \hat y(x_0)=x_0^* \hat\beta.\]
- Variance of a linear combination
Use the rule: if \(a\) is a constant vector and \(Z\) is random, \[ \mathrm{Var}(a^\top Z)=a^\top \mathrm{Var}(Z)a.\] Here \(a = (x_0^*)^\top\) and \(Z=\hat\beta\). Therefore,
\[\mathrm{Var}(\hat y(x_0)) = \mathrm{Var}(x_0^* \hat\beta) = x_0^*\mathrm{Var}(\hat\beta)(x_0^*)^\top = x_0^* \big(\sigma ^2 (X^\top X)^{-1}\big) (x_0^*)^\top.\] So the standard error is \[ SE{\hat y(x_0)}=\sqrt{\sigma^2 x_0^*(X^\top X)^{-1}(x_0^*)^\top }.\]
- Replace \(\sigma ^2\) by its sample estimate
Since \(\sigma^2\) is unknown, we plug in
\[ \hat\sigma^2=\frac{\sum_{i=1}^n \hat\varepsilon_i^2}{n-p},\] giving the practical SE:
\[SE{\hat y(x_0)}=\sqrt{\hat\sigma^2 x_0^*(X^\top X)^{-1}(x_0^*)^\top }.\]
- same formula in “simple regression” scalar form
For \(y=\beta_0+\beta_1 x+\varepsilon\),
\[SE{\hat y(x_0)}=\hat\sigma\sqrt{\frac{1}{n}+\frac{(x_0-\bar x)^2}{\sum_{i=1}^n (x_i-\bar x)^2}}.\] This is algebraically equivalent to the matrix expression above.
SD vs SE
SD (standard deviation) describes how spread out the data are.
- It’s a property of the observations.
- Example: “Students’ heights have SD = 7 cm” = individual heights vary a lot.
SE (standard error) describes how uncertain an estimate is (because it’s based on a sample).
- It’s a property of an estimator (sample mean, regression slope, fitted mean, etc.).
- Example: “The sample mean height has SE = 1 cm” = the mean is estimated pretty precisely.
A good one-liner:
- SD = variability of points
- SE = variability of a statistic across repeated samples
The classic relationship: mean
If \(X_1,\dots,X_n\) are i.i.d. with \(\mathrm{Var}(X_i)=\sigma ^2\), then \[ \bar X=\frac{1}{n}\sum_{i=1}^n X_i,\qquad \mathrm{Var}(\bar X)=\frac{\sigma^2}{n}.\] So \[ SE(\bar X)=\sqrt{\mathrm{Var}(\bar X)}=\frac{\sigma}{\sqrt n}.\] In practice we don’t know \(\sigma\), so we plug in the sample SD \(s\): \[ \widehat{SE}(\bar X)=\frac{s}{\sqrt n}.\] This is why SE shrinks as \(n\) grows, while SD typically doesn’t.
In regression (geom_smooth(method="lm", se=TRUE))
- The SD-like quantity is the residual spread around the line (noise level).
- The SE band is uncertainty in the estimated mean line \(\hat y(x)\). It’s smaller near \(\bar x\) and larger near the ends because the line is less well anchored there.
Why is \(\hat{\sigma}^2 = \dfrac{\sum \hat\varepsilon_i^2}{n-p}\) ?
Consider the linear model \[ y = X\beta + \varepsilon,\quad \mathbb E(\varepsilon)=0,\quad \mathrm{Var}(\varepsilon)=\sigma^2 I_n,\] with \(p\) parameters (in simple regression, (p=2): intercept + slope).
Let \(\hat y = X\hat\beta\) be fitted values and residuals \(e = y-\hat y\). The residual sum of squares is \[ RSS = e^\top e = \sum_{i=1}^n e_i^2.\]
Key fact (degrees of freedom)
The fitted values live in the column space of \(X\), which is \(p\)-dimensional. Residuals live in the orthogonal complement, which has dimension \(n-p\). So residuals have only \(n-p\) independent pieces of variation.
More formally, define the “residual-maker” matrix \[ M = I - H,\quad H = X(X^\top X)^{-1}X^\top.\] Then \(e = My\) and \(M\) is symmetric idempotent with \[ \mathrm{rank}(M)=n-p,\quad \mathrm{tr}(M)=n-p.\] Since \(y = X\beta + \varepsilon\) and \(MX\beta=0\), \[ e = M\varepsilon.\] Now compute the expected RSS: \[\mathbb E(RSS) = \mathbb E(e^\top e) = \mathbb E(\varepsilon^\top M^\top M \varepsilon) = \mathbb E(\varepsilon^\top M \varepsilon). \] Using \(\mathbb E(\varepsilon^\top A \varepsilon)=\sigma^2\mathrm{tr}(A)\) when \(\mathrm{Var}(\varepsilon)=\sigma^2 I\),
\[ \mathbb E(RSS)=\sigma ^2 \mathrm{tr}(M)=\sigma^2 (n-p).\] So \[ \mathbb E\left(\frac{RSS}{n-p}\right)=\sigma^2,\] meaning \[ \hat\sigma^2=\frac{RSS}{n-p}\] is an unbiased estimator of $^2).
Intuition
If you divided by \(n\), you’d systematically underestimate \(\sigma^2\) because the regression line was chosen to make residuals small (you “spent” \(p\) degrees of freedom estimating parameters). Dividing by \(n-p\) corrects that.