Introduction to Probability

https://rafalab.dfci.harvard.edu/dsbook-part-2/prob/discrete-probability.html

Probability

The term probability is used in everyday language.
Yet answering questions about probability is often hard, if not impossible.
Probability has a very intuitive definition in games of chance.
Probability Theory was born because certain mathematcial computations can give an advantage in games of chance.

Events

Events are fundamental concepts that help us understand and quantify uncertainty in various situations.
An event is defined as a specific outcome or a collection of outcomes from a random experiment.

Example of events

2 red beads and 3 blue beads inside an urn, and we perform the random experiment of picking 1 bead, there are two outcomes: bead is red or blue.

beads <- rep( c("red", "blue"), times = c(2,3)) 
#rep(x, times=...):repeat each element of x the number of times gieven in times
beads

[1] "red"  "red"  "blue" "blue" "blue"

The four possible events: bead is red, bead is blue, bead is red or blue, and an event with no outcomes.

More examples of events

For example, if the random experiment is picking 2 beads, we can define events such as first bead is red, second bead is blue, both beads are red, and so on.
In a random experiment such as political poll, where we randomly phone 100 likely voters at random, we can form many million events, for example calling 48 Democrats and 52 Republicans.

Definitions and Notation

We usually use capital letters \(A\), \(B\), \(C\), … to denote events.
If we denote an event as \(A\) then we use the notation \(\mbox{Pr}(A)\) to denote the probability of event \(A\) occurring.

Combine events: Union and Intersection

We can combine events in different ways to form new events. For example, if event
\(A\)=first bead is red and second bead is blue, and
\(B\)=first bead is red and second bead is red
then \(A \cup B\) (\(A\) or \(B\)) is the event first bead is red,
while \(A \cap B\) (\(A\) and \(B\)) is the empty event since both can’t happen.

Independence of Events

When we deal the first card, the probability of getting a King is 1/13 since there are thirteen possibilities: Ace, Deuce, Three, \(\dots\), Ten, Jack, Queen, King, and Ace.
If we deal a King for the first card, the probability of a second card being a King decreases because there are only three Kings left: The probability is 3 out of 51.

Examples

set.seed(1996)
beads <- rep( c("red", "blue"), times = c(2,3))
x <- sample(beads, 5) #by default, replace=FALSE

If you have to guess the color of the first bead, you will predict blue since blue has a 60% chance.
However, if I show you the result of the last four outcomes:

x[2:5]

[1] "blue" "blue" "red"  "blue"

would you still guess blue? Of course not.

Conditional probabilities on dependence

When events are not independent, conditional probabilities are useful.
We use the \(|\) to shorten conditional on. For example:

\[ \mbox{Pr}(\mbox{Card 2 is a king} \mid \mbox{Card 1 is a king}) = 3/51 \]

Conditional probabilities for Independent events

When two events, say \(A\) and \(B\), are independent, we have:

\[ \mbox{Pr}(A \mid B) = \mbox{Pr}(A) \]

In fact, this can be considered the mathematical definition of independence.

Multiplication rule for intersection event

If we want to determine the probability of two events, say \(A\) and \(B\), occurring, we can use the multiplication rule:

\[ \mbox{Pr}(A \cap B) = \mbox{Pr}(A)\mbox{Pr}(B \mid A) \]

Multiplication rule: example

For example:

\[ \mbox{Pr}(\mbox{Blackjack in first hand}) = \\ \mbox{Pr}(\mbox{Ace first})\mbox{Pr}(\mbox{Face card second}\mid \mbox{Ace first}) +\\ \mbox{Pr}(\mbox{Face card first})\mbox{Pr}(\mbox{Ace}\mid \mbox{Face card first}) =\\ \frac{1}{13}\frac{16}{51} + \frac{4}{13}\frac{4}{51} \approx 0.0483 \]

Multiplication rule for multiple events

We can use induction to expand for more events:

\[ \mbox{Pr}(A \cap B \cap C) = \mbox{Pr}(A)\mbox{Pr}(B \mid A)\mbox{Pr}(C \mid A \cap B) \]

When dealing with independent events, the multiplication rule becomes simpler:

\[ \mbox{Pr}(A \cap B \cap C) = \mbox{Pr}(A)\mbox{Pr}(B)\mbox{Pr}(C) \]

Multiplication rule example

Imagine a court case in which the suspect was described as having a mustache and a beard.
The defendant has both and an “expert” testifies that 1/10 men have beards and 1/5 have mustaches.
Using the multiplication rule, he concludes that \(1/10 \times 1/5\) or 0.02 have both.
But this assumes independence!
If the conditional probability of a man having a mustache, conditional on him having a beard, is .95, then the probability is: \(1/10 \times 95/100 = 0.095\)

computing conditional probability

The multiplication rule also gives us a general formula for computing conditional probabilities:

\[ \mbox{Pr}(B \mid A) = \frac{\mbox{Pr}(A \cap B)}{ \mbox{Pr}(A)} \]

\[ \mbox{Pr}(A \cup B) = \mbox{Pr}(A) + \mbox{Pr}(B) - \mbox{Pr}(A \cap B) \]

library(VennDiagram) 
rafalib::mypar() # sets some base plotting parameters, relevant to base R plots
grid.newpage() # clears the curetn grid graphics page
tmp <- draw.pairwise.venn(22, 20, 11, category = c("A", "B"),  #(areaA=22, areaB=20, coss.area=11)
                   lty = rep("blank", 2), # remove the circle borders 
                   fill = c("light blue", "pink"),  
                   alpha = rep(0.5, 2),   
                   cat.dist = rep(0.025, 2), # distance of category lables from the circles
                   cex = 0, # tex size fore region counts. 0 hides the numbers
                   cat.cex = rep(2.5,2)) # text size for the category labels

Random Variables

Random variables are numeric outcomes resulting from random processes (formally, a mapping)
\(X: \text{outcome}\mapsto \mathbb{R}\)
define X to be 1 if a bead is blue and 0 if red:

beads <- rep( c("red", "blue"), times = c(2,3))
x <- ifelse(sample(beads, 1) == "blue", 1, 0)

ifelse(sample(beads, 1) == "blue", 1, 0)

[1] 0

ifelse(sample(beads, 1) == "blue", 1, 0)

[1] 1

ifelse(sample(beads, 1) == "blue", 1, 0)

[1] 0

More Random Variables examples

More interesting random variables are:
- the number of times we win in a game of chance,
- the number of democrats in a random sample of 1,000 voters, and
- the proportion of patients randomly assigned to a control group in drug trial.

Discrete probability: frequentist way

A more tangible way to think about the probability of an event is as the proportion of times the event occurs when we repeat the experiment an infinite number of times, independently, and under the same conditions.
This is the frequentist way of thinking about probability.

Discrete probability: Example

If I have 2 red beads and 3 blue beads inside an urn and I pick one at random, what is the probability of picking a red one? Our intuition tells us that the answer is 2/5 or 40%.

A precise definition can be given by noting that there are five possible outcomes, of which two satisfy the condition necessary for the event pick a red bead.
Since each of the five outcomes has an equal chance of occurring, we conclude that the probability is .4 for red and .6 for blue.

Monte Carlo

Monte Carlo simulations use computers to perform these experiments.
Random number generators permit us to mimic the process of picking at random.
The sample function in R uses a random number generator:

beads <- rep(c("red", "blue"), times = c(2,3))
sample(beads, 1)

[1] "blue"

Monte Carlo

If we repeat the experiment over and over, we can define the probability using the frequentists definition

n <- 10^7
x <- sample(beads, n, replace = TRUE)
table(x)/n

x
     blue       red 
0.6002232 0.3997768

Note the definition is for \(n=\infty\). In practice we use very large numbers to get very close.

Probability distributions

An example of a probability distribution is:

Pr(picking a Republican)	=	0.44
Pr(picking a Democrat)	=	0.44
Pr(picking an undecided)	=	0.10
Pr(picking a Green)	=	0.02

The probability density for discrete r.v.

For categorical distributions, we can define the probability of a category.
For example, a roll of a die, let’s call it \(X\), can be 1, 2, 3, 4, 5 or 6.
The probability of 4 is defined as:

\[ \mbox{Pr}(X=4) = 1/6 \]

CDF for discrete RVs

The CDF can then easily be defined:

\[ \begin{aligned} F(4) &= & \mbox{Pr}(X\leq 4) \\ &= & \mbox{Pr}(X = 4) + \mbox{Pr}(X = 3) \\ & + & \mbox{Pr}(X = 2) + \mbox{Pr}(X = 1) \end{aligned} \]

Setting the random seed

set.seed(2026-02-14)

When using random number generators you get a different answer each time.
This is fine, but if you want to ensure that results are consistent with each run, you can set R’s random number generation seed to a specific number.

Combinations and permutations

Being able to count combinations and permutations is an important part of performing discrete probability computations.
the function expand.grid
gtools functions: permutatios, combinations.
generate a deck of cards:

suits <- c("Diamonds", "Clubs", "Hearts", "Spades") 
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven",  
             "Eight", "Nine", "Ten", "Jack", "Queen", "King") 
deck <- expand.grid(number = numbers, suit = suits) # expand.grid makes a df of all combinations of (number, suit). Vary number fist: Ace Diamonds, Deuce Diamonds, ...,
deck <- paste(deck$number, deck$suit) # combine two columns with a space
head(deck,3)

[1] "Ace Diamonds"   "Deuce Diamonds" "Three Diamonds"

Permutations

choose two numbers from a list consisting of 1,2,3:

library(gtools) 
permutations(3, 2)

     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    2    1
[4,]    2    3
[5,]    3    1
[6,]    3    2

The order matters here: 3,1 is different than 1,3.
(1,1), (2,2), and (3,3) do not appear because once we pick a number, it can’t appear again.
To compute all possible ways we can choose two cards when the order matters, we type, you can use the v option:

hands <- permutations(52, 2, v = deck)

Combinations

What about if the order does not matter? For example, in Blackjack, if you obtain an Ace and a face card in the first draw, it is called a Natural 21, and you win automatically.
If we wanted to compute the probability of this happening, we would enumerate the combinations, not the permutations, since the order does not matter.

combinations(3,2)

     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    2    3

Continuous probability

When summarizing a list of numeric values, such as heights, it is not useful to construct a distribution that defines a proportion to each possible outcome.
Similarly, for a random variable that can take any value in a continuous set, it impossible to assign a positive probabilities to the infinite number of possible values.

eCDF

We use the heights of adult male students as an example:

library(tidyverse) 
library(dslabs) 
x <- heights %>% filter(sex == "Male") %>% pull(height)

and defined the empirical cumulative distribution function (eCDF) as.

F <- function(a) mean(x <= a)

which, for any value a, gives the proportion of values in the list x that are smaller or equal than a.

CDF

If I randomly pick one of the male students, what is the chance that he is taller than 70.5 inches?
Since every student has the same chance of being picked, the answer is the proportion of students that are taller than 70.5 inches.
Using the eCDF we obtain an answer by typing:

1 - F(70.5)

[1] 0.3633005

The CDF is a version of the eCDF that assigns theoretical probabilities for each \(a\) rather than proportions computed from data.

CDF

Specifically, the CDF for a random outcome \(X\) defines, for any number \(a\), the probability of observing a value less than or equal to \(a\).

\[ F(a) = \mbox{Pr}(X \leq a) \]

Once a CDF is defined, we can use it to compute the probability of any subset of values.

CDF

For instance, the probability of a student being between height a and height b is:

\[ \mbox{Pr}(a < X \leq b) = F(b)-F(a) \]

Since we can compute the probability for any possible event using this approach, the CDF defines the probability distribution.

pdf for continuous RVs

Although for continuous distributions the probability of a single value \(\mbox{Pr}(X=x)\) is not defined, there is a theoretical definition that has a similar interpretation.
The probability density function is defined as

\[ F(a) = \mbox{Pr}(X\leq a) = \int_{-\infty}^a f(x)\, dx \]

\[ F(b) - F(a) = \int_a^b f(x)\,dx \]

Probability density function

The intuition is that even for continuous outcomes we can define tiny intervals, that are almost as small as points, that have positive probabilities.
If we think of the size of these intervals as the base of a rectangle, the probability density function \(f\) determines the height of the rectangle so that the summing up of the area of these rectangles approximate the probability \(F(b) - F(a)\).

Probability density function

The graph
The code

cont <- data.frame(x = seq(0, 5, len = 300), y = dgamma(seq(0, 5, len = 300), 2, 2)) #Gamma(shape=2, rate=2): f(a, b)=b^a/Gamma(a) x^{a-1}e^{-bx}
disc <- data.frame(x = seq(0, 5, 0.075), y = dgamma(seq(0, 5, 0.075), 2, 2)) 
ggplot(mapping = aes(x, y)) + 
  geom_col(data =  disc) + 
  geom_line(data = cont) + 
  ylab("f(x)")

geom_bar(stat=“count”): (default)for raw data; geom_col(stat=“identity”): (default) need x and y, use y values

Normal distribution

The probability density function is given by:

\[f(x) = e^{-\frac{1}{2}\left( \frac{x-m}{s} \right)^2} \]

The cumulative distribution for the normal distribution in R can be obtained with the function pnorm.

pdf of normal distribution

The graph
The code

dat <- tibble(x = seq(-4, 4, length = 100)*s + m, # x\in [m-4s, m+4s]
              y = dnorm(x, m, s)) 
dat_ribbon <- filter(dat, x >= 2*s + m) # the right tail at least 2 sd from the mean
ggplot(dat, aes(x, y)) + 
  geom_line() + 
  geom_ribbon(aes(ymin = 0, ymax = y), data = dat_ribbon)

geom_ribbon(): add a shading

cdf of a normal distribution

a random quantity is normally distributed with average m and standard deviation s

F(a) = pnorm(a, m, s)

if we use the normal approximation, What is the probability that a randomly selected student is taller then 70.5 inches?

m <- mean(x) 
s <- sd(x) 
1 - pnorm(70.5, m, s)

[1] 0.371369

Distributions as approximations

The graph
The code

Below is a plot of that (discrete) probability distribution of the height data set

rafalib::mypar() 
plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(X = a)")

table(): frequency counts. Default drops NA, unless table(x, useNA="ifany")
prop.table(): divides those counts by the total count->relative frequency
for 2-way table

prop.table(table(a, b), margin = 1)  # rows sum to 1
prop.table(table(a, b), margin = 2)  # cols sum to 1

Does it make sense?

While most students rounded up their heights to the nearest inch, others reported values with more precision.
One student reported his height to be 69.6850393700787, which is 177 centimeters.
The probability assigned to this height is 0.0012315 or 1 in 812.
The probability for 70 inches is much higher at 0.1059113,
Does it really make sense to think of the probability of being exactly 70 inches as being different than 69.6850393700787?

Use a continuous variable

Clearly it is much more useful to treat this outcome as a continuous numeric variable, keeping in mind that very few people, or perhaps none, are exactly 70 inches, and that the reason we get more values at 70 is because people round to the nearest inch.
With continuous distributions, the probability of a singular value is 0.
We therefore could ask, what is the probability that someone is between 69.5 and 70.5?

Proportioins from data

the normal distribution is useful for approximating the proportion of students reporting values in intervals like the following three:

mean(x <= 68.5) - mean(x <= 67.5)

[1] 0.114532

mean(x <= 69.5) - mean(x <= 68.5)

[1] 0.1194581

mean(x <= 70.5) - mean(x <= 69.5)

[1] 0.1219212

Proportions from normal distribution

Note how close we get with the normal approximation:

pnorm(68.5, m, s) - pnorm(67.5, m, s)

[1] 0.1031077

pnorm(69.5, m, s) - pnorm(68.5, m, s)

[1] 0.1097121

pnorm(70.5, m, s) - pnorm(69.5, m, s)

[1] 0.1081743

However, the approximation is not as useful for other intervals.

Problems caused by discretization

notice how the approximation breaks down:

mean(x <= 70.9) - mean(x <= 70.1)

[1] 0.02216749

pnorm(70.9, m, s) - pnorm(70.1, m, s)

[1] 0.08359562

Although the true height distribution is continuous, the reported heights tend to be more common at discrete values, in this case, due to rounding. We call this situation discretization.
As long as we are aware of how to deal with this reality, the normal approximation can still be a very useful tool.

Normal distribution example

For example, to use the normal approximation to estimate the probability of someone being taller than 76 inches, we use:

1 - pnorm(76, m, s)

[1] 0.03206008

`rnorm`

R provides functions to generate normally distributed outcomes.
Specifically, the rnorm function takes three arguments: size, average (defaults to 0), and standard deviation (defaults to 1), and produces random numbers.

Monte Carlo example

The code
The histogram of the simulated data

Here is an example of how we could generate data that looks like our reported heights:

n <- length(x) 
m <- mean(x) 
s <- sd(x) 
simulated_heights <- rnorm(n, m, s)

Monte Carlo simulation

it will permit us to generate data that mimics natural events and answers questions related to what could happen by chance by running Monte Carlo simulations.
If, for example, we pick 800 males at random, what is the distribution of the tallest person? How rare is a seven-footer in a group of 800 males? The following Monte Carlo simulation helps us answer that question:

Monte Carlo

The code
The graph

B <- 10000 
tallest <- replicate(B, { 
  simulated_data <- rnorm(800, m, s) 
  max(simulated_data) 
})

replicate(B, {...}): run the block inside {...} B times.
Having a seven-footer is quite rare:

mean(tallest >= 7*12)

[1] 0.0192

Here is the resulting distribution:

Note that it does not look normal.

Other Continuous distributions

The normal distribution is not the only useful theoretical distribution.
Other continuous distributions: the student-t, Chi-square, exponential, gamma, beta, and beta-binomial.
R provides functions to compute the density, the quantiles, the cumulative distribution functions and to generate Monte Carlo simulations.

Other continuous distributions

The code
The graph

R uses the letters d, q, p, and r in front of a shorthand for the distribution.
We have already seen the functions dnorm, pnorm, and rnorm for the normal distribution.
The functions qnorm gives us the quantiles.

x <- seq(-4, 4, length.out = 100) 
qplot(x, f, geom = "line", data = data.frame(x, f = dnorm(x)))

x <- seq(-4, 4, length.out = 100) 
library(ggplot2)
df <- data.frame(x = x, f = dnorm(x))
ggplot(df, aes(x, f)) + geom_line()

Introduction to Probability

Probability

Events

Example of events

More examples of events

Definitions and Notation

Combine events: Union and Intersection

Events related to a continuous variable

Independence of Events

Examples

Conditional probabilities on dependence

Conditional probabilities for Independent events

Multiplication rule for intersection event

Multiplication rule: example

Multiplication rule for multiple events

Multiplication rule example

computing conditional probability

Addition rule for union of events

Random Variables

More Random Variables examples

Discrete probability: frequentist way

Discrete probability: Example

Monte Carlo

Monte Carlo

Probability distributions

The probability density for discrete r.v.

CDF for discrete RVs

Setting the random seed

Combinations and permutations

Permutations

Combinations

Continuous probability

eCDF

CDF

CDF

CDF

pdf for continuous RVs

Probability density function

Probability density function

Normal distribution

pdf of normal distribution

cdf of a normal distribution

Distributions as approximations

Does it make sense?

Use a continuous variable

Proportioins from data

Proportions from normal distribution

Problems caused by discretization

Normal distribution example

rnorm

Monte Carlo example

Monte Carlo simulation

Monte Carlo

Other Continuous distributions

Other continuous distributions

`rnorm`