Tidyverse

http://rafalab.dfci.harvard.edu/dsbook-part-1/R/tidyverse.html

Tidyverse

library(tidyverse)
  • The tidyverse is not a package but a group of packages.

  • A popular tool for data analysis by sacrificing some flexibility.

  • One way code is simplified by ensuring all functions take and return tidy data.

Tidy data (long data)

  • Stored in a data frame.

  • Each observation is exactly one row.

  • Variables are stored in columns.

  • Not all data can be represented this way.

  • Assuming data is tidy simplifies coding.

Tidy data

  • This is an example of a tidy dataset:
library(dslabs)
tidy_data <- gapminder |> 
  filter(country %in% c("South Korea", "Germany") & !is.na(fertility)) |>
  select(country, year, fertility)
head(tidy_data, 6)
      country year fertility
1     Germany 1960      2.41
2 South Korea 1960      6.16
3     Germany 1961      2.44
4 South Korea 1961      5.99
5     Germany 1962      2.47
6 South Korea 1962      5.79

Tidy data

  • Originally, the data was in the following format:
path <- system.file("extdata", package = "dslabs") #return the full path of the directory "extdata" in "dslabs"
filename <- file.path(path, "fertility-two-countries-example.csv") 
 # return the path with the system-specific seprator (Win: \; Mac/Linux: /)
wide_data <- read_csv(filename)
select(wide_data, country, `1960`:`1970`) |> as.data.frame()
      country 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
1     Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 2.28 2.17 2.04
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85 4.73 4.62 4.53
  • This is not tidy. Each row has multiple measures across multiple columns.

Tidyverse packages

  • tibble - modifies the data frame class.

  • readr - import data.

  • dplyr - used to modify data frames.

  • ggplot2 - simplifies plotting.

  • tidyr - helps convert data into tidy format.

  • stringr - string processing.

  • forcats - utilities for categorical data.

  • purrr - tidy version of apply functions.

dplyr

  • We focus on the following functions:

    • mutate

    • select

    • across

    • filter

    • group_by

    • summarize

Adding a column with mutate

colnames(murders)
[1] "state"      "abb"        "region"     "population" "total"     
murders <- mutate(murders, rate = total/population*100000)
colnames(murders)
[1] "state"      "abb"        "region"     "population" "total"     
[6] "rate"      
filter(murders, rate <= 0.71)
          state abb        region population total      rate
1        Hawaii  HI          West    1360301     7 0.5145920
2          Iowa  IA North Central    3046355    21 0.6893484
3 New Hampshire  NH     Northeast    1316470     5 0.3798036
4  North Dakota  ND North Central     672591     4 0.5947151
5       Vermont  VT     Northeast     625741     2 0.3196211
  • Notice that here we used total and population inside the function, which are objects that are not defined in our workspace.

  • This is known as non-standard evaluation where the context is used to know what variable names means.

  • Tidyverse extensively uses non-standard evaluation.

Subsetting with filter

filter(murders, rate <= 0.71)
          state abb        region population total      rate
1        Hawaii  HI          West    1360301     7 0.5145920
2          Iowa  IA North Central    3046355    21 0.6893484
3 New Hampshire  NH     Northeast    1316470     5 0.3798036
4  North Dakota  ND North Central     672591     4 0.5947151
5       Vermont  VT     Northeast     625741     2 0.3196211

Selecting columns with select

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
          state        region      rate
1        Hawaii          West 0.5145920
2          Iowa North Central 0.6893484
3 New Hampshire     Northeast 0.3798036
4  North Dakota North Central 0.5947151
5       Vermont     Northeast 0.3196211

Transforming variables

  • The function mutate can also be used to transform variables.

  • For example, the following code takes the log transformation of the population variable:

mutate(murders, population = log10(population)) |> head()
       state abb region population total     rate
1    Alabama  AL  South   6.679404   135 2.824424
2     Alaska  AK   West   5.851400    19 2.675186
3    Arizona  AZ   West   6.805638   232 3.629527
4   Arkansas  AR  South   6.464775    93 3.189390
5 California  CA   West   7.571172  1257 3.374138
6   Colorado  CO   West   6.701499    65 1.292453

Transforming several variables

  • Often, we need to apply the same transformation to several variables.

  • The function across facilitates the operation.

  • For example if want to log transform both population and total murders we can use:

mutate(murders, across(c(population, total), log10)) |> head()
       state abb region population    total     rate
1    Alabama  AL  South   6.679404 2.130334 2.824424
2     Alaska  AK   West   5.851400 1.278754 2.675186
3    Arizona  AZ   West   6.805638 2.365488 3.629527
4   Arkansas  AR  South   6.464775 1.968483 3.189390
5 California  CA   West   7.571172 3.099335 3.374138
6   Colorado  CO   West   6.701499 1.812913 1.292453

More examples with across

  • apply the same transformation to all numeric variables:
mutate(murders, across(where(is.numeric), log10)) |>head(2)
    state abb region population    total      rate
1 Alabama  AL  South   6.679404 2.130334 0.4509299
2  Alaska  AK   West   5.851400 1.278754 0.4273540
  • or all character variables:
mutate(murders, across(where(is.character), tolower)) |> head(2)
    state abb region population total     rate
1 alabama  al  South    4779736   135 2.824424
2  alaska  ak   West     710231    19 2.675186

The pipe: |>

  • We use the pipe to chain a series of operations.

  • For example if we want to select columns and then filter rows: \[ \mbox{original data } \rightarrow \mbox{ select } \rightarrow \mbox{ filter } \]

murders |> select(state, region, rate) |> filter(rate <= 0.71)
          state        region      rate
1        Hawaii          West 0.5145920
2          Iowa North Central 0.6893484
3 New Hampshire     Northeast 0.3798036
4  North Dakota North Central 0.5947151
5       Vermont     Northeast 0.3196211
  • The output of the left side of a pipe is used as the input (as first argument) for the function on the right.

  • Here is a simple example:

16 |> sqrt() |> log(base = 2)
[1] 2

Summarizing data

  • We use the dplyr summarize function, not to be confused with summary from R base.
murders |> summarize(avg = mean(rate))
       avg
1 2.779125
  • Is this calculation correct?

Summarizing data: Correct calculation

  • No.

  • The correct calculation is

murders |> summarize(rate = sum(total)/sum(population)*100000)
      rate
1 3.034555

Multiple summaries

  • Suppose we want the median, minimum and max population size:
murders |> summarize(median = median(population), min = min(population), max = max(population))
   median    min      max
1 4339367 563626 37253956
  • use quantiles?
murders |> summarize(quantiles = quantile(population, c(0.5, 0, 1)))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
  quantiles
1   4339367
2    563626
3  37253956

Multiple summaries with reframe

murders |> reframe(quantiles = quantile(population, c(0.5, 0, 1)))
  quantiles
1   4339367
2    563626
3  37253956

Multiple summaries with a dataframe

  • To have a column per summary, as when we called min, median, and max separately, we have to define a function that returns a data frame like this:
median_min_max <- function(x){
  qs <- quantile(x, c(0.5, 0, 1))
  data.frame(median = qs[1], min = qs[2], max = qs[3])
}
  • Then we can call summarize:
murders |> summarize(median_min_max(population))
   median    min      max
1 4339367 563626 37253956

Compare

 median_min_max(murders$population) 
     median    min      max
50% 4339367 563626 37253956
#data.frame will use the first string `median` as the row name, but changed it to `50%`

Group using group_by

  • Let’s compute murder rate by region.
murders |> group_by(region) |> head(4)
# A tibble: 4 × 6
# Groups:   region [2]
  state    abb   region population total  rate
  <chr>    <chr> <fct>       <dbl> <dbl> <dbl>
1 Alabama  AL    South     4779736   135  2.82
2 Alaska   AK    West       710231    19  2.68
3 Arizona  AZ    West      6392017   232  3.63
4 Arkansas AR    South     2915918    93  3.19
  • Note the Groups: region at the top. This is the number of groups in the first 4 rows.

  • This is a special data frame called a grouped data frame.

Group_by then summarize

murders |> 
  group_by(region) |> 
  summarize(rate = sum(total) / sum(population) * 100000)
# A tibble: 4 × 2
  region         rate
  <fct>         <dbl>
1 Northeast      2.66
2 South          3.63
3 North Central  2.73
4 West           2.66
  • The summarize function applies the summarization to each group separately.

Group_by then summarize

  • compute the median, minimum, and maximum population in the four regions
murders |> group_by(region) |> summarize(median_min_max(population))
# A tibble: 4 × 4
  region          median    min      max
  <fct>            <dbl>  <dbl>    <dbl>
1 Northeast     3574097  625741 19378102
2 South         4625364  601723 25145561
3 North Central 5495456. 672591 12830632
4 West          2700551  563626 37253956
  • summarize takes a dataframe or a column, or a list of values, as an input

group_by then mutate

  • To summarize a variable but not collapse the dataset, use group_by then mutate instead of summarize.

  • Next example adds a column with the population in each region and the number of states in the region, shown for each state.

murders |> group_by(region) |> 
  mutate(region_pop = sum(population), n = n()) |> head(2)
# A tibble: 2 × 8
# Groups:   region [2]
  state   abb   region population total  rate region_pop     n
  <chr>   <chr> <fct>       <dbl> <dbl> <dbl>      <dbl> <int>
1 Alabama AL    South     4779736   135  2.82  115674434    17
2 Alaska  AK    West       710231    19  2.68   71945553    13

ungroup

murders |> group_by(region) |> 
  mutate(region_pop = sum(population), n = n()) |>
  ungroup() |> head(2)
# A tibble: 2 × 8
  state   abb   region population total  rate region_pop     n
  <chr>   <chr> <fct>       <dbl> <dbl> <dbl>      <dbl> <int>
1 Alabama AL    South     4779736   135  2.82  115674434    17
2 Alaska  AK    West       710231    19  2.68   71945553    13
  • This avoids having a grouped data frame that we don’t need.

Tydeverse returns a df

  • Tidyverse function always returns a data frame. Even if its just one number.
murders |> 
  summarize(rate = sum(total)/sum(population)*100000) |>
  class()
[1] "data.frame"

pull

  • To get a numeric use pull:
murders |> 
  summarize(rate = sum(total)/sum(population)*100000) |>
  pull(rate) 
[1] 3.034555

Sorting data frames with arrange

  • States order by rate
murders |> arrange(rate) |> head()
          state abb        region population total      rate
1       Vermont  VT     Northeast     625741     2 0.3196211
2 New Hampshire  NH     Northeast    1316470     5 0.3798036
3        Hawaii  HI          West    1360301     7 0.5145920
4  North Dakota  ND North Central     672591     4 0.5947151
5          Iowa  IA North Central    3046355    21 0.6893484
6         Idaho  ID          West    1567582    12 0.7655102

Sorting in descending order

  • we can either use the negative or, use desc:
murders |> arrange(desc(rate)) |> head()
                 state abb        region population total      rate
1 District of Columbia  DC         South     601723    99 16.452753
2            Louisiana  LA         South    4533372   351  7.742581
3             Missouri  MO North Central    5988927   321  5.359892
4             Maryland  MD         South    5773552   293  5.074866
5       South Carolina  SC         South    4625364   207  4.475323
6             Delaware  DE         South     897934    38  4.231937

Sorting by multiple variabls

murders |> arrange(region, desc(rate)) |> head(11)
                  state abb    region population total       rate
1          Pennsylvania  PA Northeast   12702379   457  3.5977513
2            New Jersey  NJ Northeast    8791894   246  2.7980319
3           Connecticut  CT Northeast    3574097    97  2.7139722
4              New York  NY Northeast   19378102   517  2.6679599
5         Massachusetts  MA Northeast    6547629   118  1.8021791
6          Rhode Island  RI Northeast    1052567    16  1.5200933
7                 Maine  ME Northeast    1328361    11  0.8280881
8         New Hampshire  NH Northeast    1316470     5  0.3798036
9               Vermont  VT Northeast     625741     2  0.3196211
10 District of Columbia  DC     South     601723    99 16.4527532
11            Louisiana  LA     South    4533372   351  7.7425810