
Confidence intervals are a very useful concept widely employed by data analysts.
A version of these that are commonly seen come from the ggplot geometry geom_smooth.
Below is an example using a temperature dataset available in R:

library(tidyverse) #nhtemp is built-in R ts
data.frame(year = as.numeric(time(nhtemp)), temperature = as.numeric(nhtemp)) |>
ggplot(aes(year, temperature)) +
geom_point() + #
geom_smooth() + # plot smooth trend-curve (LOESS(small set)/lm (large set)) with a (default) 95% CI
ggtitle("Average Yearly Temperatures in New Haven") \([0,1]\) is guaranteed to include \(p\), but not useful
the spread between -100% and 100%, will be ridiculed for stating the obvious.
Even a smaller interval, such as saying the spread between -10 and 10%, will not be considered serious.
a very small intervals but misses the mark most of the time will not be considered good
We can use the statistical theory to compute the probability of any given interval including \(p\).
To illustrate this we run the Monte Carlo simulation.
We use the same parameters as above:
x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p))
x_hat <- mean(x)
se_hat <- sqrt(x_hat*(1 - x_hat)/N)
c(x_hat - 1.96*se_hat, x_hat + 1.96*se_hat) [1] 0.4102262 0.4717738
x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1 - p, p))
x_hat <- mean(x)
se_hat <- sqrt(x_hat*(1 - x_hat)/N)
c(x_hat - 1.96*se_hat, x_hat + 1.96*se_hat) [1] 0.429109 0.490891
\[ \mbox{Pr}\left(\bar{X} - 1.96\hat{\mbox{SE}}(\bar{X}) \leq p \leq \bar{X} + 1.96\hat{\mbox{SE}}(\bar{X})\right) \] Equivalently
\[ \mbox{Pr}\left(-1.96 \leq \frac{\bar{X}- p}{\hat{\mbox{SE}}(\bar{X})} \leq 1.96\right) \]
\[ \mbox{Pr}\left(-1.96 \leq Z \leq 1.96\right) \]
z satisfies the following:\[ \mbox{Pr}\left(-z \leq Z \leq z\right) = 0.99 \]
Confidence interval formulas are given for arbitrary probabilities written as \(1-\alpha\).
We can obtain the \(z\) for the equation above using z = qnorm(1 - alpha / 2)
for \(\alpha=0.05\), \(1 - \alpha/2 = 0.975\) and we get the \(z=1.96\):

set.seed(1)
tab <- replicate(100, {
x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1 - p, p))
x_hat <- mean(x)
se_hat <- sqrt(x_hat*(1 - x_hat)/N)
hit <- between(p, x_hat - 1.96*se_hat, x_hat + 1.96*se_hat)
c(x_hat, x_hat - 1.96*se_hat, x_hat + 1.96*se_hat, hit)
}) #tab: 4 by 100 matrix
tab <- data.frame(poll = 1:ncol(tab), t(tab)) # add a "poll" index col, t(tab): transpose `tab`
names(tab) <- c("poll", "estimate", "low", "high", "hit") #rename columns
tab <- mutate(tab, p_inside = ifelse(hit, "Yes", "No"))
ggplot(tab, aes(poll, estimate, ymin = low, ymax = high, col = p_inside)) +
geom_point() +
geom_errorbar() + # draw CI from y_min to y_max for each plotted point (poll, estimate)
# geom_errorbar() does not compute the interval by itself
coord_flip() +
geom_hline(yintercept = p) # draw the true value of p=0.45