Confidence Intervals

https://rafalab.dfci.harvard.edu/dsbook-part-2/inference/estimates-confidence-intervals.html#confidence-intervals

Confidence intervals

Confidence intervals are a very useful concept widely employed by data analysts.
A version of these that are commonly seen come from the ggplot geometry geom_smooth.
Below is an example using a temperature dataset available in R:

library(tidyverse) #nhtemp is built-in R ts
data.frame(year = as.numeric(time(nhtemp)), temperature = as.numeric(nhtemp)) |> 
  ggplot(aes(year, temperature)) +  
  geom_point() +  # 
  geom_smooth() +  # plot smooth trend-curve (LOESS(small set)/lm (large set)) with a (default) 95% CI
  ggtitle("Average Yearly Temperatures in New Haven")

Confidence intervals

\([0,1]\) is guaranteed to include \(p\), but not useful
the spread between -100% and 100%, will be ridiculed for stating the obvious.
Even a smaller interval, such as saying the spread between -10 and 10%, will not be considered serious.
a very small intervals but misses the mark most of the time will not be considered good
We can use the statistical theory to compute the probability of any given interval including \(p\).

Confidence intervals: MC simulation Parameter

To illustrate this we run the Monte Carlo simulation.
We use the same parameters as above:

p <- 0.45 
N <- 1000

Confidence intervals: Simulation

And notice that the interval here:

x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p)) 
x_hat <- mean(x) 
se_hat <- sqrt(x_hat*(1 - x_hat)/N) 
c(x_hat - 1.96*se_hat, x_hat + 1.96*se_hat)

[1] 0.4102262 0.4717738

is different from this one:

x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1 - p, p)) 
x_hat <- mean(x) 
se_hat <- sqrt(x_hat*(1 - x_hat)/N) 
c(x_hat - 1.96*se_hat, x_hat + 1.96*se_hat)

[1] 0.429109 0.490891

Keep sampling and creating intervals, and you will see the random variation.

Probability of a Confidence intervals including \(p\)

To determine the probability that the interval includes \(p\), we need to compute the following:

\[ \mbox{Pr}\left(\bar{X} - 1.96\hat{\mbox{SE}}(\bar{X}) \leq p \leq \bar{X} + 1.96\hat{\mbox{SE}}(\bar{X})\right) \] Equivalently

\[ \mbox{Pr}\left(-1.96 \leq \frac{\bar{X}- p}{\hat{\mbox{SE}}(\bar{X})} \leq 1.96\right) \]

Confidence intervals with 95% confidence level

The term in the middle is an approximately normal random variable with expected value 0 and standard error 1, which we have been denoting with \(Z\), so we have:

\[ \mbox{Pr}\left(-1.96 \leq Z \leq 1.96\right) \]

which we can quickly compute using :

pnorm(1.96) - pnorm(-1.96)

[1] 0.9500042

proving that we have a 95% probability.

Confidence intervals of 99% confidence level

If we want to have a larger probability, say 99%, we need to multiply by whatever z satisfies the following:

\[ \mbox{Pr}\left(-z \leq Z \leq z\right) = 0.99 \]

We use:

z <- qnorm(0.995) 
z

[1] 2.575829

pnorm(z) - pnorm(-z)

[1] 0.99

Confidence intervals: significance level \(\alpha\)

Confidence interval formulas are given for arbitrary probabilities written as \(1-\alpha\).
We can obtain the \(z\) for the equation above using z = qnorm(1 - alpha / 2)
for \(\alpha=0.05\), \(1 - \alpha/2 = 0.975\) and we get the \(z=1.96\):

qnorm(0.975)

[1] 1.959964

A Monte Carlo simulation

We can run a Monte Carlo simulation to confirm that, in fact, a 95% confidence interval includes \(p\) 95% of the time.

N <- 1000 
B <- 10000 
inside <- replicate(B, { 
  x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1 - p, p)) 
  x_hat <- mean(x) 
  se_hat <- sqrt(x_hat*(1 - x_hat)/N) 
  between(p, x_hat - 1.96*se_hat, x_hat + 1.96*se_hat) 
}) 
mean(inside)

[1] 0.9477

A Monte Carlo simulation

The Graph
The code

The following plot shows the first 100 confidence intervals.

set.seed(1) 
tab <- replicate(100, { 
  x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1 - p, p)) 
  x_hat <- mean(x) 
  se_hat <- sqrt(x_hat*(1 - x_hat)/N) 
  hit <- between(p, x_hat - 1.96*se_hat, x_hat + 1.96*se_hat) 
  c(x_hat, x_hat - 1.96*se_hat, x_hat + 1.96*se_hat, hit) 
}) #tab: 4 by 100 matrix
tab <- data.frame(poll = 1:ncol(tab), t(tab)) # add a "poll" index col, t(tab): transpose `tab`
names(tab) <- c("poll", "estimate", "low", "high", "hit") #rename columns
tab <- mutate(tab, p_inside = ifelse(hit, "Yes", "No")) 
ggplot(tab, aes(poll, estimate, ymin = low, ymax = high, col = p_inside)) +  
  geom_point() + 
  geom_errorbar() +  # draw CI from y_min to y_max for each plotted point (poll, estimate)
  # geom_errorbar() does not compute the interval by itself
  coord_flip() +  
  geom_hline(yintercept = p) # draw the true value of p=0.45