Tidyverse

http://rafalab.dfci.harvard.edu/dsbook-part-1/R/tidyverse.html

Tidyverse

library(tidyverse)

The tidyverse is not a package but a group of packages.
A popular tool for data analysis by sacrificing some flexibility.
One way code is simplified by ensuring all functions take and return tidy data.

Tidy data (long data)

Stored in a data frame.
Each observation is exactly one row.
Variables are stored in columns.
Not all data can be represented this way.
Assuming data is tidy simplifies coding.

Tidy data

This is an example of a tidy dataset:

library(dslabs)
tidy_data <- gapminder |> 
  filter(country %in% c("South Korea", "Germany") & !is.na(fertility)) |>
  select(country, year, fertility)
head(tidy_data, 6)

      country year fertility
1     Germany 1960      2.41
2 South Korea 1960      6.16
3     Germany 1961      2.44
4 South Korea 1961      5.99
5     Germany 1962      2.47
6 South Korea 1962      5.79

Tidy data

Originally, the data was in the following format:

path <- system.file("extdata", package = "dslabs") #return the full path of the directory "extdata" in "dslabs"
filename <- file.path(path, "fertility-two-countries-example.csv") 
 # return the path with the system-specific seprator (Win: \; Mac/Linux: /)
wide_data <- read_csv(filename)
select(wide_data, country, `1960`:`1970`) |> as.data.frame()

      country 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
1     Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 2.28 2.17 2.04
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85 4.73 4.62 4.53

This is not tidy. Each row has multiple measures across multiple columns.

Tidyverse packages

tibble - modifies the data frame class.
readr - import data.
dplyr - used to modify data frames.
ggplot2 - simplifies plotting.
tidyr - helps convert data into tidy format.
stringr - string processing.
forcats - utilities for categorical data.
purrr - tidy version of apply functions.

dplyr

We focus on the following functions:
- mutate
- select
- across
- filter
- group_by
- summarize

Adding a column with `mutate`

colnames(murders)

[1] "state"      "abb"        "region"     "population" "total"

murders <- mutate(murders, rate = total/population*100000)
colnames(murders)

[1] "state"      "abb"        "region"     "population" "total"     
[6] "rate"

filter(murders, rate <= 0.71)

          state abb        region population total      rate
1        Hawaii  HI          West    1360301     7 0.5145920
2          Iowa  IA North Central    3046355    21 0.6893484
3 New Hampshire  NH     Northeast    1316470     5 0.3798036
4  North Dakota  ND North Central     672591     4 0.5947151
5       Vermont  VT     Northeast     625741     2 0.3196211

Notice that here we used total and population inside the function, which are objects that are not defined in our workspace.
This is known as non-standard evaluation where the context is used to know what variable names means.
Tidyverse extensively uses non-standard evaluation.

Subsetting with `filter`

filter(murders, rate <= 0.71)

          state abb        region population total      rate
1        Hawaii  HI          West    1360301     7 0.5145920
2          Iowa  IA North Central    3046355    21 0.6893484
3 New Hampshire  NH     Northeast    1316470     5 0.3798036
4  North Dakota  ND North Central     672591     4 0.5947151
5       Vermont  VT     Northeast     625741     2 0.3196211

Selecting columns with `select`

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)

          state        region      rate
1        Hawaii          West 0.5145920
2          Iowa North Central 0.6893484
3 New Hampshire     Northeast 0.3798036
4  North Dakota North Central 0.5947151
5       Vermont     Northeast 0.3196211

Transforming variables

The function mutate can also be used to transform variables.
For example, the following code takes the log transformation of the population variable:

mutate(murders, population = log10(population)) |> head()

       state abb region population total     rate
1    Alabama  AL  South   6.679404   135 2.824424
2     Alaska  AK   West   5.851400    19 2.675186
3    Arizona  AZ   West   6.805638   232 3.629527
4   Arkansas  AR  South   6.464775    93 3.189390
5 California  CA   West   7.571172  1257 3.374138
6   Colorado  CO   West   6.701499    65 1.292453

Transforming several variables

Often, we need to apply the same transformation to several variables.
The function across facilitates the operation.
For example if want to log transform both population and total murders we can use:

mutate(murders, across(c(population, total), log10)) |> head()

       state abb region population    total     rate
1    Alabama  AL  South   6.679404 2.130334 2.824424
2     Alaska  AK   West   5.851400 1.278754 2.675186
3    Arizona  AZ   West   6.805638 2.365488 3.629527
4   Arkansas  AR  South   6.464775 1.968483 3.189390
5 California  CA   West   7.571172 3.099335 3.374138
6   Colorado  CO   West   6.701499 1.812913 1.292453

More examples with across

apply the same transformation to all numeric variables:

mutate(murders, across(where(is.numeric), log10)) |>head(2)

    state abb region population    total      rate
1 Alabama  AL  South   6.679404 2.130334 0.4509299
2  Alaska  AK   West   5.851400 1.278754 0.4273540

or all character variables:

mutate(murders, across(where(is.character), tolower)) |> head(2)

    state abb region population total     rate
1 alabama  al  South    4779736   135 2.824424
2  alaska  ak   West     710231    19 2.675186

The pipe: `|>`

We use the pipe to chain a series of operations.
For example if we want to select columns and then filter rows: \[ \mbox{original data } \rightarrow \mbox{ select } \rightarrow \mbox{ filter } \]

murders |> select(state, region, rate) |> filter(rate <= 0.71)

          state        region      rate
1        Hawaii          West 0.5145920
2          Iowa North Central 0.6893484
3 New Hampshire     Northeast 0.3798036
4  North Dakota North Central 0.5947151
5       Vermont     Northeast 0.3196211

The output of the left side of a pipe is used as the input (as first argument) for the function on the right.
Here is a simple example:

16 |> sqrt() |> log(base = 2)

[1] 2

Summarizing data

We use the dplyr summarize function, not to be confused with summary from R base.

murders |> summarize(avg = mean(rate))

       avg
1 2.779125

Is this calculation correct?

Summarizing data: Correct calculation

No.
The correct calculation is

murders |> summarize(rate = sum(total)/sum(population)*100000)

      rate
1 3.034555

Multiple summaries

Suppose we want the median, minimum and max population size:

murders |> summarize(median = median(population), min = min(population), max = max(population))

   median    min      max
1 4339367 563626 37253956

use quantiles?

murders |> summarize(quantiles = quantile(population, c(0.5, 0, 1)))

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

  quantiles
1   4339367
2    563626
3  37253956

Multiple summaries with `reframe`

murders |> reframe(quantiles = quantile(population, c(0.5, 0, 1)))

  quantiles
1   4339367
2    563626
3  37253956

Multiple summaries with a dataframe

To have a column per summary, as when we called min, median, and max separately, we have to define a function that returns a data frame like this:

median_min_max <- function(x){
  qs <- quantile(x, c(0.5, 0, 1))
  data.frame(median = qs[1], min = qs[2], max = qs[3])
}

Then we can call summarize:

murders |> summarize(median_min_max(population))

   median    min      max
1 4339367 563626 37253956

Compare

 median_min_max(murders$population)

     median    min      max
50% 4339367 563626 37253956

#data.frame will use the first string `median` as the row name, but changed it to `50%`

Group using `group_by`

Let’s compute murder rate by region.

murders |> group_by(region) |> head(4)

# A tibble: 4 × 6
# Groups:   region [2]
  state    abb   region population total  rate
  <chr>    <chr> <fct>       <dbl> <dbl> <dbl>
1 Alabama  AL    South     4779736   135  2.82
2 Alaska   AK    West       710231    19  2.68
3 Arizona  AZ    West      6392017   232  3.63
4 Arkansas AR    South     2915918    93  3.19

Note the Groups: region at the top. This is the number of groups in the first 4 rows.
This is a special data frame called a grouped data frame.

Group_by then summarize

murders |> 
  group_by(region) |> 
  summarize(rate = sum(total) / sum(population) * 100000)

# A tibble: 4 × 2
  region         rate
  <fct>         <dbl>
1 Northeast      2.66
2 South          3.63
3 North Central  2.73
4 West           2.66

The summarize function applies the summarization to each group separately.

Group_by then summarize

compute the median, minimum, and maximum population in the four regions

murders |> group_by(region) |> summarize(median_min_max(population))

# A tibble: 4 × 4
  region          median    min      max
  <fct>            <dbl>  <dbl>    <dbl>
1 Northeast     3574097  625741 19378102
2 South         4625364  601723 25145561
3 North Central 5495456. 672591 12830632
4 West          2700551  563626 37253956

summarize takes a dataframe or a column, or a list of values, as an input

group_by then mutate

To summarize a variable but not collapse the dataset, use group_by then mutate instead of summarize.
Next example adds a column with the population in each region and the number of states in the region, shown for each state.

murders |> group_by(region) |> 
  mutate(region_pop = sum(population), n = n()) |> head(2)

# A tibble: 2 × 8
# Groups:   region [2]
  state   abb   region population total  rate region_pop     n
  <chr>   <chr> <fct>       <dbl> <dbl> <dbl>      <dbl> <int>
1 Alabama AL    South     4779736   135  2.82  115674434    17
2 Alaska  AK    West       710231    19  2.68   71945553    13

ungroup

murders |> group_by(region) |> 
  mutate(region_pop = sum(population), n = n()) |>
  ungroup() |> head(2)

# A tibble: 2 × 8
  state   abb   region population total  rate region_pop     n
  <chr>   <chr> <fct>       <dbl> <dbl> <dbl>      <dbl> <int>
1 Alabama AL    South     4779736   135  2.82  115674434    17
2 Alaska  AK    West       710231    19  2.68   71945553    13

This avoids having a grouped data frame that we don’t need.

Tydeverse returns a df

Tidyverse function always returns a data frame. Even if its just one number.

murders |> 
  summarize(rate = sum(total)/sum(population)*100000) |>
  class()

[1] "data.frame"

`pull`

To get a numeric use pull:

murders |> 
  summarize(rate = sum(total)/sum(population)*100000) |>
  pull(rate)

[1] 3.034555

Sorting data frames with `arrange`

States order by rate

murders |> arrange(rate) |> head()

          state abb        region population total      rate
1       Vermont  VT     Northeast     625741     2 0.3196211
2 New Hampshire  NH     Northeast    1316470     5 0.3798036
3        Hawaii  HI          West    1360301     7 0.5145920
4  North Dakota  ND North Central     672591     4 0.5947151
5          Iowa  IA North Central    3046355    21 0.6893484
6         Idaho  ID          West    1567582    12 0.7655102

Sorting in descending order

we can either use the negative or, use desc:

murders |> arrange(desc(rate)) |> head()

                 state abb        region population total      rate
1 District of Columbia  DC         South     601723    99 16.452753
2            Louisiana  LA         South    4533372   351  7.742581
3             Missouri  MO North Central    5988927   321  5.359892
4             Maryland  MD         South    5773552   293  5.074866
5       South Carolina  SC         South    4625364   207  4.475323
6             Delaware  DE         South     897934    38  4.231937

Sorting by multiple variabls

murders |> arrange(region, desc(rate)) |> head(11)

                  state abb    region population total       rate
1          Pennsylvania  PA Northeast   12702379   457  3.5977513
2            New Jersey  NJ Northeast    8791894   246  2.7980319
3           Connecticut  CT Northeast    3574097    97  2.7139722
4              New York  NY Northeast   19378102   517  2.6679599
5         Massachusetts  MA Northeast    6547629   118  1.8021791
6          Rhode Island  RI Northeast    1052567    16  1.5200933
7                 Maine  ME Northeast    1328361    11  0.8280881
8         New Hampshire  NH Northeast    1316470     5  0.3798036
9               Vermont  VT Northeast     625741     2  0.3196211
10 District of Columbia  DC     South     601723    99 16.4527532
11            Louisiana  LA     South    4533372   351  7.7425810

Tidyverse

Tidyverse

Tidy data (long data)

Tidy data

Tidy data

Tidyverse packages

dplyr

Adding a column with mutate

Subsetting with filter

Selecting columns with select

Transforming variables

Transforming several variables

More examples with across

The pipe: |>

Summarizing data

Summarizing data: Correct calculation

Multiple summaries

Multiple summaries with reframe

Multiple summaries with a dataframe

Group using group_by

Group_by then summarize

Group_by then summarize

group_by then mutate

ungroup

Tydeverse returns a df

pull

Sorting data frames with arrange

Sorting in descending order

Sorting by multiple variabls

Adding a column with `mutate`

Subsetting with `filter`

Selecting columns with `select`

The pipe: `|>`

Multiple summaries with `reframe`

Group using `group_by`

`pull`

Sorting data frames with `arrange`