QQ plot

How a Q–Q plot is constructed

A Q–Q plot compares:

sample quantiles from your data
theoretical quantiles from some reference distribution, often a normal distribution

The idea is: if the data really come from that reference distribution, the points should lie roughly on a straight line.

Suppose your data are

\[ x_1, x_2, \dots, x_n. \]

Step 1: sort the data

Order them from smallest to largest:

\[ x_{(1)} \le x_{(2)} \le \cdots \le x_{(n)}. \]

These are the sample quantiles.

Step 2: choose plotting probabilities

For each rank (i), assign a probability, often approximately

\[ p_i = \frac{i - a}{n+1-2a}, \qquad i=1,\dots,n. \] where the default offset is \[a=\begin{cases} 3/8, & n\ge 10 \\ 1/2, & n >10 \end{cases} \] These probabilities mark where each ordered observation sits in the distribution. If \(n >10\), \(a=1/2\) and \(p_i \approx \frac{i-1/2}{n}\).

Step 3: compute theoretical quantiles

If the reference distribution is (N(,^2)), compute

\[ q_i = F^{-1}(p_i), \]

where (F^{-1}) is the quantile function of that normal distribution.

So (q_i) is the theoretical value such that

\[ P(X \le q_i) = p_i. \]

For a normal Q–Q plot, these are the normal quantiles.

Step 4: plot the pairs

Plot the points

\[ (q_i,; x_{(i)}). \]

So:

x-axis = theoretical quantiles
y-axis = sample quantiles

If the sample distribution matches the theoretical one well, then

\[ x_{(i)} \approx q_i \]

up to location and scale, so the points fall near a straight line.

Why the line \(y=x\) appears

If the theoretical distribution and the sample have exactly the same center and spread, then the points should lie near

\[ y=x. \]

But there is an important subtlety:

if you compare your data to the standard normal \(N(0,1)\), the ideal line is only \(y=x\) when your data also have mean 0 and sd 1
if you compare to a normal with the same mean and sd as your data, then \(y=x\) is more sensible

Even then, stat_qq_line() is usually better because it computes a fitted reference line based on quartiles.

p <- 0.45 
N <- 1000 
x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p)) 
x_hat <- mean(x)

B <- 10000 
x_hat <- replicate(B, { 
  x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p)) 
  mean(x) 
})

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data.frame(x_hat = x_hat) |> ggplot(aes(sample =  x_hat)) +  
  stat_qq(dparams = list(mean = mean(x_hat), sd = sd(x_hat))) + 
  geom_abline()

this works as follows:

stat_qq(...) sorts the values of x_hat
it computes theoretical quantiles from a normal distribution with mean mean(x_hat) and sd sd(x_hat)
it plots sample quantiles against theoretical quantiles
geom_abline() adds the line (y=x)

So the plot checks whether the ordered values of x_hat look like quantiles from that normal distribution.

How to interpret the shape

If points are close to a straight line:

the data are approximately from the reference distribution

If the plot bends upward or downward:

the data may be skewed

If the ends depart strongly:

the tails may be heavier or lighter than the reference distribution

For example:

right tail too large: upper-right points rise above the line
left tail too small: lower-left points fall below the line
S-shape: often suggests heavier tails than normal

Small concrete example

Suppose the sorted sample is

\[ x_{(1)}, x_{(2)}, x_{(3)}, x_{(4)}, x_{(5)}. \]

For (n=5), a common choice is

\[ p_i = \frac{i-0.5}{5} = 0.1, 0.3, 0.5, 0.7, 0.9. \]

If comparing to standard normal, the theoretical quantiles are about

\[ -1.28,\ -0.52,\ 0,\ 0.52,\ 1.28. \]

Then the Q–Q plot points are

\[ (-1.28, x_{(1)}),\ (-0.52, x_{(2)}),\ (0, x_{(3)}),\ (0.52, x_{(4)}),\ (1.28, x_{(5)}). \]

If the data are close to normal, these points line up.

Better reference line

Usually this is preferred:

ggplot(data.frame(x_hat = x_hat), aes(sample = x_hat)) +
  stat_qq(dparams = list(mean = mean(x_hat), sd = sd(x_hat))) +
  stat_qq_line(dparams = list(mean = mean(x_hat), sd = sd(x_hat)), color="red")

because stat_qq_line() gives a more meaningful Q–Q reference line than simply forcing \(y=x\).

`qqnorm()` and `qqplot()`

qqnorm() is for comparing one sample to a normal distribution. qqplot() is for comparing the quantiles of two datasets. Both rely on probability points internally, and qqnorm() specifically produces a normal QQ-plot.

1. Example: `qqnorm()`

Use this when you want to check whether data look approximately normal.

set.seed(1)
x <- rnorm(100)

qqnorm(x)
qqline(x, col = "red", lwd = 2)

qqnorm(x) plots the sample quantiles of x against the theoretical normal quantiles.
qqline(x) adds a reference line. By default, that line is based on the first and third quartiles, not necessarily the identity line (y=x).

A second example with non-normal data:

set.seed(1)
y <- rexp(100)

qqnorm(y)
qqline(y, col = "red", lwd = 2)

This plot will usually bend away from the line because exponential data are not normal.

2. Example: `qqplot()`

Use this when you want to compare the distributions of two samples.

set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
y <- rnorm(100, mean = 1, sd = 1.5)

qqplot(x, y,
       xlab = "Quantiles of x",
       ylab = "Quantiles of y",
       main = "QQ-plot of y against x")
abline(0, 1, col = "red", lwd = 2)

qqplot(x, y) compares the quantiles of x and y.
If the points fall near a straight line, the two distributions have similar shape.
If the line is not close to slope 1 or intercept 0, one sample may be more spread out or shifted than the other.

3. `qqplot()` against a theoretical distribution

You can also use qqplot() to compare data to a non-normal theoretical distribution by supplying theoretical quantiles yourself.

For example, compare data to a chi-square distribution:

set.seed(1)
y <- rchisq(100, df = 3)

theoretical <- qchisq(ppoints(length(y)), df = 3)

qqplot(theoretical, sort(y),
       xlab = "Theoretical chi-square quantiles",
       ylab = "Sample quantiles",
       main = "Chi-square QQ-plot")
abline(0, 1, col = "red", lwd = 2)

This works because ppoints() generates the plotting probabilities used to evaluate inverse distribution functions such as qnorm() or qchisq().