ggplot2

http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/ggplot2.html

ggplot2

We fist load three libraries:

library(dplyr)
library(ggplot2)
library(dslabs)
library(lattice)

The graph we will create

The components of a graph

  • gg stands for grammar of graphics.

  • Analogy: we learn verbs and nouns to construct sentences.

  • The first step in learning ggplot2 is breaking a graph apart into components.

Three main components of a graph

  • Data: ususally a data frame, such as murders used to generate the previous plot
  • Geometry: plot type, such as the scatterplot above.
  • Aesthetic mapping: How we map visual cues (\(x, y,\) color, labels, etc.) to the data

ggplot objects

  • Start by defining the dataset:
ggplot(data = murders)
  • We can also use the pipe:
murders |> ggplot()
  • We assign the output to a variable
p <- ggplot(data = murders)
class(p)
[1] "ggplot2::ggplot" "ggplot"          "ggplot2::gg"     "S7_object"      
[5] "gg"             

Graph layers

  • We create graphs by adding layers.

  • Layers define geometries, compute summary statistics, define what scales to use, or even change styles.

  • To add layers, we use the symbol +.

  • In general, a line of code will look like this:

DATA |> ggplot() + LAYER 1 + LAYER 2 + ... + LAYER N
  • Usually, the first added layer defines the geometry.

Geometries

Scatter plot: geom_points

murders |> ggplot() + geom_point(aes(x = population/10^6, y = total))

Add a layer

  • Since we defined p earlier, we can add a layer like this:
p + geom_point(aes(population/10^6, total))
  • Note x= and y = can be omitted

Add text with geom_text

p + geom_point(aes(population/10^6, total)) +
  geom_text(aes(population/10^6, total, label = abb))

Scopes of aesthetical mapping

  • This one is fine
p_test <- p + geom_text(aes(population/10^6, total, label = abb))

is fine, whereas this call: this one is not

p_test <- p + geom_text(aes(population/10^6, total), label = abb) 

will give you an error since abb is not found because it is outside of the aes function.

  • geom_text does not know where to find abb: it’s a column name and not a global variable.

Multipanel plots - lattice

pl <- xyplot(total ~ population/10^6 | region, data = murders, #group by region
      type = c("p", "g"), xlab = "Population/10^6", ylab = "Total", # points +grid
     strip = strip.custom(strip.names = TRUE, var.name = "region"), layout=c(2,2))
# strip: title bar in each panel. strip.names=TRUE: show the var.name "region"
print(pl)

Multipanel plots - ggplot2

pg <- ggplot(murders, aes(population/10^6, total)) + geom_point(shape = 1) +
    facet_wrap(~region) + theme(aspect.ratio = 1) #square panels
print(pg)
# shape =1: open circles; shape=16: filled circles

point size

p + geom_point(aes(population/10^6, total), size = 3) +
  geom_text(aes(population/10^6, total, label = abb))
  • size can be an aesthetic mapping, but here it is not, so all points get bigger.

Tinkering with argumentsnudege_x

p + geom_point(aes(population/10^6, total), size = 3) +
  geom_text(aes(population/10^6, total, label = abb), nudge_x = 1.5) #shift x right
  • nudge_x is not an aesthetic mapping.

Global mappings

  • Defined in aes in the ggplot function:
args(ggplot) # show the arguments of ggplot()
function (data = NULL, mapping = aes(), ..., environment = parent.frame()) 
NULL

Global versus local mappings

  • All the layers will assume the global mapping unless we explicitly define another one.
p <- murders |> ggplot(aes(population/10^6, total, label = abb))
p + geom_point(size = 3) + geom_text(nudge_x = 1.5)
  • The two layers use the global mapping.

Global versus local mappings

  • We can override the global aes by defining one in the geometry functions:
p + geom_point(size = 3) +  # global mapping
  geom_text(aes(x = 10, y = 800, label = "Hello there!")) #local mapping

Scales

  • Layers can define transformations:
p + geom_point(size = 3) +  
  geom_text(nudge_x = 0.05) + 
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") 

Scales

  • This particular transformation is so common that ggplot2 provides the specialized functions:
p + geom_point(size = 3) +  
  geom_text(nudge_x = 0.05) + 
  scale_x_log10() +
  scale_y_log10() 

Labels and titles

p + geom_point(size = 3) +  
  geom_text(nudge_x = 0.05) + 
  scale_x_log10() +
  scale_y_log10() +
  xlab("Populations in millions (log scale)") + 
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in 2010")

Labels and titles with labs

p + geom_point(size = 3) +  
  geom_text(nudge_x = 0.05) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Populations in millions (log scale)", 
       y = "Total number of murders (log scale)", 
       title = "US Gun Murders in 2010")
  • This produces the same graph as in the previous slide.

Almost there

p + geom_point(size = 3) +  
  geom_text(nudge_x = 0.05) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Populations in millions (log scale)", 
       y = "Total number of murders (log scale)", 
       title = "US Gun Murders in 2010")

Adding color

murders |> ggplot(aes(population/10^6, total, label = abb)) +   
  geom_text(nudge_x = 0.05) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Populations in millions (log scale)", 
       y = "Total number of murders (log scale)", 
       title = "US Gun Murders in 2010") +
  geom_point(size = 3, color = "blue")

A mapped color

murders |> ggplot(aes(population/10^6, total, label = abb)) +   
  geom_text(nudge_x = 0.05) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Populations in millions (log scale)", 
       y = "Total number of murders (log scale)", 
       title = "US Gun Murders in 2010") +
  geom_point(aes(col = region), size = 3)

A legend is added automatically!

Change legend name

murders |> ggplot(aes(population/10^6, total, label = abb)) +   
  geom_text(nudge_x = 0.05) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Populations in millions (log scale)", 
       y = "Total number of murders (log scale)", 
       title = "US Gun Murders in 2010",
       color = "Region") + # Lgend name comes from inside labs()
  geom_point(aes(col = region), size = 3)

Other adjustments

  • add a line with intercept the US rate.
r <- murders |> 
  summarize(rate = sum(total) /  sum(population) * 10^6) |> 
  pull(rate)

Add a line

murders |> ggplot(aes(population/10^6, total, label = abb)) +   
  geom_text(nudge_x = 0.05) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Populations in millions (log scale)", 
       y = "Total number of murders (log scale)", 
       title = "US Gun Murders in 2010",
       color = "Region") +
  geom_point(aes(col = region), size = 3) +
  geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") 
#default slope=1; lty=2: dashed line

0= no line, 1=solid, 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash

We are close!

Assign the graph to a variable

p <- murders |> ggplot(aes(population/10^6, total, label = abb)) +   
  geom_text(nudge_x = 0.05) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Populations in millions (log scale)", 
       y = "Total number of murders (log scale)", 
       title = "US Gun Murders in 2010",
       color = "Region") +
  geom_point(aes(col = region), size = 3) +
  geom_abline(intercept = log10(r), lty = 2, color = "darkgrey")

Add-on packages

  • The dslabs package can define the look used in the textbook:
ds_theme_set()
  • Many other themes are added by the package ggthemes.

Add-on packages

ggthemes provides pre-designed themes.

library(ggthemes)
p + theme_economist()

theme_Fivethirtyeight()

Here is the FiveThirtyEight theme:

p + theme_fivethirtyeight()

theme_excel()

maybe not a good one!

p + theme_excel()

theme_starwars()

ThemePark provides fun themes:

library(ThemePark)
p + theme_starwars()

theme_barbie()

This is a fan favorite:

p + theme_barbie()

geom_text_repel()

  • To avoid the state abbreviations being on top of each other we can use the ggrepel package.

  • We change the layer geom_text(nudge_x = 0.05) to geom_text_repel()

Putting it all together

library(ggthemes)
library(ggrepel)

r <- murders |> 
  summarize(rate = sum(total) /  sum(population) * 10^6) |>
  pull(rate)

murders |> ggplot(aes(population/10^6, total, label = abb)) +   
  geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
  geom_point(aes(col = region), size = 3) +
  geom_text_repel() + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Populations in millions (log scale)", 
       y = "Total number of murders (log scale)", 
       title = "US Gun Murders in 2010",
       color = "Region") +
  theme_economist()

Grids of plots

  • We often want to put plots next to each other.

  • The gridExtra package permits us to do that:

library(gridExtra)
p1 <- murders |> 
  ggplot(aes(log10(population))) + 
  geom_histogram()
p2 <- murders |> 
  ggplot(aes(log10(population), log10(total))) + 
  geom_point()
grid.arrange(p1, p2, ncol = 2)

Other packages for grids of plots

  • cowplot: A versatile package designed for publication-quality plots, offering seamless integration with ggplot2.

  • ggpubr: Provides user-friendly functions for combining and annotating ggplot2 plots with minimal effort.