Problem set 3

Download source

In these exercises, we will explore a subset of the NHANES dataset to investigate potential differences in systolic blood pressure across groups defined by self reported race. You can find more information about the NHANES survey data in the reference manual of the NHANES CRAN package.

Grading Information

Problems 1-15 are mandatory. Problems 16-18 are optional.

Instructions

For each exercise, write a single piped expression using the pipe (|>) to chain operations together. You may break it across multiple physical lines, but the work for the exercise should be one continuous pipeline (no intermediate assignments), except when the problem explicitly asks you to save an object (like dat) or define a function.
Generate an HTML document that shows the code for each exercise.
For tables: use knitr::kable() (do not print raw data frames).
Use only two significant digits for numbers displayed in tables.
Submit a PDF in Canvas using the following code

quarto render hw/hw3.qmd --to pdf

You will need the following libraries:

library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)
library(knitr)
library(NHANES)
options(digits = 2)

Exercises

1

Filter the NHANES data to only include survey year 2011-2012. Save the resulting data.frame (in tbl or tbl_df format) in dat. This data.frame should have 5,000 rows and 76 columns.

# Scaffold (uncomment and complete):
# dat <- NHANES |>
#   filter(SurveyYr == "2011_12")

# After you create dat, check the dimensions:
# dim(dat)

2

Compute the average and standard deviation (SD) for the combined systolic blood pressure reading (BPSysAve) for males and females separately. Show a data frame with two rows (female and male) and two columns (average and SD).

# Scaffold (single pipeline, then kable):
# dat |>
#   filter(!is.na(BPSysAve)) |>
#   group_by(Gender) |>
#   summarize(
#     avg = mean(BPSysAve),
#     sd  = sd(BPSysAve),
#     .groups = "drop"
#   ) |>
#   select(Gender, avg, sd) |>
#   kable()

3

Compute the average and SD for SBP for each race variable in column Race3 for females and males separately. Table: 4 columns (sex, race, average, SD) and 12 rows. Arrange from highest to lowest average.

# Scaffold:
# dat |>
#   filter(!is.na(BPSysAve), !is.na(Race3), !is.na(Gender)) |>
#   group_by(Gender, Race3) |>
#   summarize(
#     avg = mean(BPSysAve),
#     sd  = sd(BPSysAve),
#     .groups = "drop"
#   ) |>
#   arrange(desc(avg)) |>
#   select(Gender, Race3, avg, sd) |>
#   kable()

4

Repeat Exercise 3 but add two columns for a 95% confidence interval (lower, upper):

\[ \bar{X} \pm 1.96 \, s / \sqrt{n}. \]

# Scaffold:
# dat |>
#   filter(!is.na(BPSysAve), !is.na(Race3), !is.na(Gender)) |>
#   group_by(Gender, Race3) |>
#   summarize(
#     n   = n(),
#     avg = mean(BPSysAve),
#     sd  = sd(BPSysAve),
#     .groups = "drop"
#   ) |>
#   mutate(
#     lower = avg - 1.96 * sd / sqrt(n),
#     upper = avg + 1.96 * sd / sqrt(n)
#   ) |>
#   arrange(desc(avg)) |>
#   select(Gender, Race3, avg, sd, lower, upper) |>
#   kable()

5

Make a graph showing the results from Exercise 4. Plot averages as points and confidence intervals as error bars (geom_errorbar). Order the groups from lowest to highest average (based on the average of the male and female averages). Use facet_wrap to separate females and males. Axes: Race and Average. Title: Comparing systolic blood pressure across groups. Caption: Bars represent 95% confidence intervals.

# Scaffold idea:
# 1) Make the summary table (Gender, Race3, avg, lower, upper) like Exercise 4
# 2) Create an ordering key across genders:
#    group_by(Race3) |> summarize(overall_avg = mean(avg))
# 3) Use fct_reorder(Race3, overall_avg) to order categories
# 4) Plot with ggplot + geom_point + geom_errorbar + facet_wrap(~Gender)

# Example structure (students should write the full pipeline):
# dat |>
#   ...summary steps... |>
#   mutate(Race3 = fct_reorder(Race3, overall_avg)) |>
#   ggplot(aes(x = Race3, y = avg)) +
#   geom_point() +
#   geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.15) +
#   facet_wrap(~ Gender) +
#   labs(
#     x = "Race",
#     y = "Average",
#     title = "Comparing systolic blood pressure across groups",
#     caption = "Bars represent 95% confidence intervals"
#   )

6

Create a table like Exercise 4 but show average SBP by sex and age group (AgeDecade). Order groups chronologically. Make a separate plot for males and females. Filter out observations with no AgeDecade.

# Scaffold:
# dat |>
#   filter(!is.na(BPSysAve), !is.na(AgeDecade), !is.na(Gender)) |>
#   group_by(Gender, AgeDecade) |>
#   summarize(
#     n   = n(),
#     avg = mean(BPSysAve),
#     sd  = sd(BPSysAve),
#     .groups = "drop"
#   ) |>
#   mutate(
#     lower = avg - 1.96 * sd / sqrt(n),
#     upper = avg + 1.96 * sd / sqrt(n),
#     AgeDecade = fct_inorder(AgeDecade)   # keeps factor order as it appears
#   ) |>
#   ggplot(aes(x = AgeDecade, y = avg)) +
#   geom_point() +
#   geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.15) +
#   facet_wrap(~ Gender) +
#   labs(x = "Age group", y = "Average SBP")

7

Explore age distributions of each Race3 group. Make a histogram of Age for each Race3 group and stack them vertically. Generate two columns (males and females). Use bins of width 5 years up to 80.

Below the graph, comment on what you notice and how this can explain the difference between the White and Mexican groups.

# Scaffold:
# dat |>
#   filter(!is.na(Age), !is.na(Race3), !is.na(Gender), Age <= 80) |>
#   ggplot(aes(x = Age)) +
#   geom_histogram(binwidth = 5, boundary = 0) +
#   facet_grid(Race3 ~ Gender) +
#   labs(x = "Age", y = "Count")

Comment (write 2–4 sentences): (Your text here.)

8

Compute the median age for each Race3 group and the percent of individuals younger than 18. Order rows by median age. Table: 6 rows and 3 columns (group, median age, children).

# Scaffold:
# dat |>
#   filter(!is.na(Age), !is.na(Race3)) |>
#   group_by(Race3) |>
#   summarize(
#     median_age = median(Age),
#     children   = mean(Age < 18) * 100,
#     .groups = "drop"
#   ) |>
#   arrange(median_age) |>
#   select(Race3, median_age, children) |>
#   kable()

Given these results, provide an explanation for why systolic pressure is lower when comparing the White and Mexican groups.

Explanation (write 2–4 sentences): (Your text here.)

9

Write a function that computes the number of observations in each gender, age group, and race combination. Show groups with fewer than five observations.

Remove rows with no BPSysAve before counting.
Include combinations with 0 individuals (use complete).
Table columns: gender, age strata, group, count.

# Scaffold: define the function (you will complete the pipeline inside)
count_cells <- function(df) {
  # df |>
  #   filter(!is.na(BPSysAve)) |>
  #   count(Gender, AgeDecade, Race3) |>
  #   complete(Gender, AgeDecade, Race3, fill = list(n = 0))
}

# Scaffold: use the function and then filter small cells
# count_cells(dat) |>
#   filter(n < 5) |>
#   select(Gender, AgeDecade, Race3, n) |>
#   kable()

10

Redefine dat with:

Include only survey year 2011-2012.
Remove observations with no age group reported.
Remove the 0-9 age group.
Combine the 60-69 and 70+ age groups into a 60+ group.
Remove Other in Race3.
Rename Race3 to Race.

Hints:

Levels in AgeDecade start with a space.
Use fct_collapse to combine factors.

# Scaffold:
# dat <- NHANES |>
#   filter(SurveyYr == "2011_12") |>
#   filter(!is.na(AgeDecade)) |>
#   filter(AgeDecade != " 0-9") |>
#   mutate(AgeDecade = fct_collapse(AgeDecade, ` 60+` = c(" 60-69", " 70+"))) |>
#   filter(!is.na(Race3), Race3 != "Other") |>
#   rename(Race = Race3)

11

Create a plot showing average SBP for each age decade. Show race groups with color and connect points with lines. Make separate plots for males and females.

# Scaffold:
# dat |>
#   filter(!is.na(BPSysAve), !is.na(AgeDecade), !is.na(Race), !is.na(Gender)) |>
#   group_by(Gender, Race, AgeDecade) |>
#   summarize(avg = mean(BPSysAve), .groups = "drop") |>
#   ggplot(aes(x = AgeDecade, y = avg, color = Race, group = Race)) +
#   geom_point() +
#   geom_line() +
#   facet_wrap(~ Gender) +
#   labs(x = "Age decade", y = "Average SBP")

12

Pick two race groups that you think are consistently different. Redo Exercise 11 for only these two groups, add confidence intervals, remove lines, and put confidence intervals next to each other (use position_dodge). Comment on your finding.

# Choose your two groups:
group1 <- "White"
group2 <- "Mexican"

# Scaffold:
# dat |>
#   filter(!is.na(BPSysAve), !is.na(AgeDecade), Race %in% c(group1, group2)) |>
#   group_by(Gender, Race, AgeDecade) |>
#   summarize(
#     n   = n(),
#     avg = mean(BPSysAve),
#     sd  = sd(BPSysAve),
#     .groups = "drop"
#   ) |>
#   mutate(
#     lower = avg - 1.96 * sd / sqrt(n),
#     upper = avg + 1.96 * sd / sqrt(n)
#   ) |>
#   ggplot(aes(x = AgeDecade, y = avg, color = Race)) +
#   geom_point(position = position_dodge(width = 0.4)) +
#   geom_errorbar(aes(ymin = lower, ymax = upper),
#                 width = 0.15,
#                 position = position_dodge(width = 0.4)) +
#   facet_wrap(~ Gender) +
#   labs(x = "Age decade", y = "Average SBP")

Comment (write 2–4 sentences): (Your text here.)

13

For the two groups selected above, compute the difference in average SBP between the two groups for each age strata. Table columns: age strata, difference for females, difference for males.

# Scaffold plan:
# 1) Summarize avg by Gender, AgeDecade, Race for the two groups
# 2) pivot_wider(names_from = Race, values_from = avg)
# 3) compute diff = avg(group1) - avg(group2)
# 4) keep Gender, AgeDecade, diff
# 5) pivot_wider(names_from = Gender, values_from = diff)
# 6) kable()

# dat |>
#   filter(!is.na(BPSysAve), !is.na(AgeDecade), Race %in% c(group1, group2)) |>
#   group_by(Gender, AgeDecade, Race) |>
#   summarize(avg = mean(BPSysAve), .groups = "drop") |>
#   pivot_wider(names_from = Race, values_from = avg) |>
#   mutate(diff = .data[[group1]] - .data[[group2]]) |>
#   select(Gender, AgeDecade, diff) |>
#   pivot_wider(names_from = Gender, values_from = diff) |>
#   kable()

14

Create a table showing average BMI (BMI) for each Race and Gender combination using the redefined dat from Exercise 10.

Include: number of observations, average BMI, SD of BMI. Remove missing BMI values. Order by average BMI from lowest to highest.

# Scaffold:
# dat |>
#   filter(!is.na(BMI), !is.na(Race), !is.na(Gender)) |>
#   group_by(Race, Gender) |>
#   summarize(
#     n = n(),
#     avg_bmi = mean(BMI),
#     sd_bmi  = sd(BMI),
#     .groups = "drop"
#   ) |>
#   arrange(avg_bmi) |>
#   kable()

15

Create a summary table showing average SBP by smoking status (Smoke100) for each Race and Gender.

Columns: race, gender, smoking status (Yes/No), n, average SBP, SD of SBP. Filter out missing Smoke100 or BPSysAve. Arrange by race, then gender, then smoking status.

# Scaffold:
# dat |>
#   filter(!is.na(Smoke100), !is.na(BPSysAve)) |>
#   group_by(Race, Gender, Smoke100) |>
#   summarize(
#     n = n(),
#     avg_sbp = mean(BPSysAve),
#     sd_sbp  = sd(BPSysAve),
#     .groups = "drop"
#   ) |>
#   arrange(Race, Gender, Smoke100) |>
#   kable()

Optional Exercises

16

Create box plots comparing SBP distributions across race groups. Make separate plots for smokers and non-smokers and use facets for males/females. Filter out missing BPSysAve and Smoke100. Title: “Systolic Blood Pressure by Race and Smoking Status”.

# Scaffold:
# dat |>
#   filter(!is.na(BPSysAve), !is.na(Smoke100), !is.na(Race), !is.na(Gender)) |>
#   ggplot(aes(x = Race, y = BPSysAve)) +
#   geom_boxplot() +
#   facet_grid(Gender ~ Smoke100) +
#   labs(
#     x = "Race",
#     y = "Systolic blood pressure",
#     title = "Systolic Blood Pressure by Race and Smoking Status"
#   )

17

Create a table showing counts for each combination of AgeDecade, Race, and Gender, but only for combinations with fewer than 10 observations. Include only non-missing BPSysAve.

# Scaffold:
# dat |>
#   filter(!is.na(BPSysAve), !is.na(AgeDecade), !is.na(Race), !is.na(Gender)) |>
#   count(AgeDecade, Race, Gender) |>
#   filter(n < 10) |>
#   kable()

18

For each Race and Gender, compute: n, avg SBP, avg age, % smokers, avg BMI. Filter out missing BPSysAve, Age, Smoke100, BMI. Round numeric values to 1 decimal. Arrange by race then gender.

# Scaffold:
# dat |>
#   filter(!is.na(BPSysAve), !is.na(Age), !is.na(Smoke100), !is.na(BMI)) |>
#   group_by(Race, Gender) |>
#   summarize(
#     n = n(),
#     avg_sbp = mean(BPSysAve),
#     avg_age = mean(Age),
#     pct_smoke = mean(Smoke100 == "Yes") * 100,
#     avg_bmi = mean(BMI),
#     .groups = "drop"
#   ) |>
#   mutate(
#     avg_sbp = round(avg_sbp, 1),
#     avg_age = round(avg_age, 1),
#     pct_smoke = round(pct_smoke, 1),
#     avg_bmi = round(avg_bmi, 1)
#   ) |>
#   arrange(Race, Gender) |>
#   kable()