http://rafalab.dfci.harvard.edu/dsbook-part-1/R/tidyverse.html
The tidyverse is not a package but a group of packages.
A popular tool for data analysis by sacrificing some flexibility.
One way code is simplified by ensuring all functions take and return tidy data.
Stored in a data frame.
Each observation is exactly one row.
Variables are stored in columns.
Not all data can be represented this way.
Assuming data is tidy simplifies coding.
library(dslabs)
tidy_data <- gapminder |>
filter(country %in% c("South Korea", "Germany") & !is.na(fertility)) |>
select(country, year, fertility)
head(tidy_data, 6) country year fertility
1 Germany 1960 2.41
2 South Korea 1960 6.16
3 Germany 1961 2.44
4 South Korea 1961 5.99
5 Germany 1962 2.47
6 South Korea 1962 5.79
path <- system.file("extdata", package = "dslabs") #return the full path of the directory "extdata" in "dslabs"
filename <- file.path(path, "fertility-two-countries-example.csv")
# return the path with the system-specific seprator (Win: \; Mac/Linux: /)
wide_data <- read_csv(filename)
select(wide_data, country, `1960`:`1970`) |> as.data.frame() country 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 2.28 2.17 2.04
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85 4.73 4.62 4.53
tibble - modifies the data frame class.
readr - import data.
dplyr - used to modify data frames.
ggplot2 - simplifies plotting.
tidyr - helps convert data into tidy format.
stringr - string processing.
forcats - utilities for categorical data.
purrr - tidy version of apply functions.
We focus on the following functions:
mutate
select
across
filter
group_by
summarize
mutate[1] "state" "abb" "region" "population" "total"
[1] "state" "abb" "region" "population" "total"
[6] "rate"
state abb region population total rate
1 Hawaii HI West 1360301 7 0.5145920
2 Iowa IA North Central 3046355 21 0.6893484
3 New Hampshire NH Northeast 1316470 5 0.3798036
4 North Dakota ND North Central 672591 4 0.5947151
5 Vermont VT Northeast 625741 2 0.3196211
Notice that here we used total and population inside the function, which are objects that are not defined in our workspace.
This is known as non-standard evaluation where the context is used to know what variable names means.
Tidyverse extensively uses non-standard evaluation.
filterselectThe function mutate can also be used to transform variables.
For example, the following code takes the log transformation of the population variable:
state abb region population total rate
1 Alabama AL South 6.679404 135 2.824424
2 Alaska AK West 5.851400 19 2.675186
3 Arizona AZ West 6.805638 232 3.629527
4 Arkansas AR South 6.464775 93 3.189390
5 California CA West 7.571172 1257 3.374138
6 Colorado CO West 6.701499 65 1.292453
Often, we need to apply the same transformation to several variables.
The function across facilitates the operation.
For example if want to log transform both population and total murders we can use:
state abb region population total rate
1 Alabama AL South 6.679404 2.130334 2.824424
2 Alaska AK West 5.851400 1.278754 2.675186
3 Arizona AZ West 6.805638 2.365488 3.629527
4 Arkansas AR South 6.464775 1.968483 3.189390
5 California CA West 7.571172 3.099335 3.374138
6 Colorado CO West 6.701499 1.812913 1.292453
state abb region population total rate
1 Alabama AL South 6.679404 2.130334 0.4509299
2 Alaska AK West 5.851400 1.278754 0.4273540
|>We use the pipe to chain a series of operations.
For example if we want to select columns and then filter rows: \[ \mbox{original data } \rightarrow \mbox{ select } \rightarrow \mbox{ filter } \]
state region rate
1 Hawaii West 0.5145920
2 Iowa North Central 0.6893484
3 New Hampshire Northeast 0.3798036
4 North Dakota North Central 0.5947151
5 Vermont Northeast 0.3196211
The output of the left side of a pipe is used as the input (as first argument) for the function on the right.
Here is a simple example:
summarize function, not to be confused with summary from R base.No.
The correct calculation is
median min max
1 4339367 563626 37253956
quantiles?Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
quantiles
1 4339367
2 563626
3 37253956
reframemin, median, and max separately, we have to define a function that returns a data frame like this:summarize:Compare
group_by# A tibble: 4 × 6
# Groups: region [2]
state abb region population total rate
<chr> <chr> <fct> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Arizona AZ West 6392017 232 3.63
4 Arkansas AR South 2915918 93 3.19
Note the Groups: region at the top. This is the number of groups in the first 4 rows.
This is a special data frame called a grouped data frame.
# A tibble: 4 × 2
region rate
<fct> <dbl>
1 Northeast 2.66
2 South 3.63
3 North Central 2.73
4 West 2.66
summarize function applies the summarization to each group separately.# A tibble: 4 × 4
region median min max
<fct> <dbl> <dbl> <dbl>
1 Northeast 3574097 625741 19378102
2 South 4625364 601723 25145561
3 North Central 5495456. 672591 12830632
4 West 2700551 563626 37253956
To summarize a variable but not collapse the dataset, use group_by then mutate instead of summarize.
Next example adds a column with the population in each region and the number of states in the region, shown for each state.
# A tibble: 2 × 8
# Groups: region [2]
state abb region population total rate region_pop n
<chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 Alabama AL South 4779736 135 2.82 115674434 17
2 Alaska AK West 710231 19 2.68 71945553 13
murders |> group_by(region) |>
mutate(region_pop = sum(population), n = n()) |>
ungroup() |> head(2)# A tibble: 2 × 8
state abb region population total rate region_pop n
<chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 Alabama AL South 4779736 135 2.82 115674434 17
2 Alaska AK West 710231 19 2.68 71945553 13
pullarrange state abb region population total rate
1 Vermont VT Northeast 625741 2 0.3196211
2 New Hampshire NH Northeast 1316470 5 0.3798036
3 Hawaii HI West 1360301 7 0.5145920
4 North Dakota ND North Central 672591 4 0.5947151
5 Iowa IA North Central 3046355 21 0.6893484
6 Idaho ID West 1567582 12 0.7655102
desc: state abb region population total rate
1 District of Columbia DC South 601723 99 16.452753
2 Louisiana LA South 4533372 351 7.742581
3 Missouri MO North Central 5988927 321 5.359892
4 Maryland MD South 5773552 293 5.074866
5 South Carolina SC South 4625364 207 4.475323
6 Delaware DE South 897934 38 4.231937
state abb region population total rate
1 Pennsylvania PA Northeast 12702379 457 3.5977513
2 New Jersey NJ Northeast 8791894 246 2.7980319
3 Connecticut CT Northeast 3574097 97 2.7139722
4 New York NY Northeast 19378102 517 2.6679599
5 Massachusetts MA Northeast 6547629 118 1.8021791
6 Rhode Island RI Northeast 1052567 16 1.5200933
7 Maine ME Northeast 1328361 11 0.8280881
8 New Hampshire NH Northeast 1316470 5 0.3798036
9 Vermont VT Northeast 625741 2 0.3196211
10 District of Columbia DC South 601723 99 16.4527532
11 Louisiana LA South 4533372 351 7.7425810