Problem Set 4

Introduction

In this problem set, you will work with US Census and CDC COVID-19 data. The goal is not to produce a formal causal analysis, but rather to build a clean weekly state-level dataset and use graphics to investigate patterns in COVID-19 burden and vaccination uptake.

We will focus on the following questions:

How did COVID-19 case, hospitalization, and death rates evolve during 2020 and 2021?
Were there regional differences in COVID-19 burden and vaccination progress?
During large national waves, did states with higher vaccination or booster coverage tend to have lower hospitalization or death rates?

Objective

You will create a single weekly data frame for each jurisdiction in 2020 and 2021. Your final table should contain, at a minimum, the following variables:

date
state
state_name
region
region_name
population
cases
hosp
deaths
vax
booster

You will then use this table to create several figures.

Instructions

1. Obtain a Census API key

Go to https://api.census.gov/data/key_signup.html and request a Census API key.

Do not hard-code your key directly into your homework file. Instead, create a file in your working directory called _census-key.R containing exactly one line of code:

census_key <- "YOUR_REAL_CENSUS_KEY"

Note that the API key must be enclosed inside the quotes. Your homework code should then load the key by sourcing that file.

2. Create a project structure

In your course Git repository, create a folder hw4 that includes the following directories:

data
code
figs

Inside code, include:

funcs.R
wrangle.R
analysis.qmd

Your final wrangling script should save the completed dataset as an RDS file in data, and your figures should be saved as PNG files in figs.

3. General guidance

The CDC datasets do not all use the same variable names, so inspect the column names carefully.
The CDC APIs can change over time, so part of the assignment is to verify the current schema before wrangling.
Include comments in your code so a grader can follow your logic.
In each code chunk, change #| eval: false to #| eval: true.
Remove NULL in each code chunk after you add your own code.

Part I: Download the data

For this part, put your code in code/wrangle.R unless otherwise specified.

1. Load the required packages and source the Census key

Add code that loads the packages you will need and defines census_key by sourcing census-key.R.

# Load packages you plan to use
library(tidyverse)
library(httr2)
library(janitor)
library(jsonlite)
library(purrr)
library(lubridate)
library(knitr)

# Define census_key by sourcing the local file
# Your code here

2. Download Census population data

Use the Census API to request 2020 and 2021 state population estimates. The code below constructs and performs the request. Add at least two lines of your own code to do the following:

check that the response was successful,
determine the content type,
and extract the body into an object called population_raw.

census_url <- "https://api.census.gov/data/2021/pep/population"

population_request <- request(census_url) |>
  req_url_query(
    get = "POP_2020,POP_2021,NAME",
    "for" = "state:*",
    key = census_key
  )

population_response <- population_request |>
  req_perform()

# Add your code below
# 1. Check the response status
# 2. Determine the content type
# 3. Extract the body into population_raw

3. Tidy the population data

Convert population_raw into a tidy tibble called population with one row per state-year. Your final table should include:

state_name
state
year
population

Also add state abbreviations, making sure to handle DC and Puerto Rico correctly.

## Create tidy population table
population <- population_raw |>
  # remove the header row problem
  # convert to tibble
  # pivot to long format
  # parse year and population as numeric
  # add state abbreviations
  # keep a tidy set of variables
  NULL

4. Download the regions file

The following JSON file lists the Public Health Service regions:

regions_url <- "https://github.com/datasciencelabs/2025/raw/refs/heads/main/data/regions.json"

Write code to create a data frame called regions with the variables:

state_name
region
region_name

One of the region names is long. Replace it with a shorter version.

regions_json <- fromJSON(regions_url, simplifyVector = FALSE)

regions <- 
  # convert the parsed list into a tidy data frame
  # shorten the longest region name
  NULL

5. Join population and region information

Join regions onto population.

population <- 
  # join population and regions
  NULL

6. Write a reusable CDC downloader

Save the following function in code/funcs.R. This function should download a CDC Socrata endpoint, remove the default row limit by setting a very large $limit, parse the JSON, and return a tibble with clean column names.

get_cdc_data <- function(endpoint) {
  request(endpoint) |>
    req_url_query("$limit" = 10000000) |>
    req_perform() |>
    resp_check_status() |>
    resp_body_json(simplifyVector = TRUE) |>
    as_tibble() |>
    janitor::clean_names()
}

In wrangle.R, source funcs.R.

source("code/funcs.R")

7. Download the CDC datasets

Use the function above to download the four datasets below. Save them as cases_raw, hosp_raw, deaths_raw, and vax_raw.

cases: https://data.cdc.gov/resource/pwn4-m3yp.json
hospitalizations: https://data.cdc.gov/resource/39z2-9zu6.json
deaths: https://data.cdc.gov/resource/r8kw-7aab.json
vaccinations: https://data.cdc.gov/resource/rh2h-3yt2.json

The download code is provided. Add one or two lines of your own to inspect the result, for example by checking dimensions or variable names.

cases_raw  <- get_cdc_data("https://data.cdc.gov/resource/pwn4-m3yp.json")
hosp_raw   <- get_cdc_data("https://data.cdc.gov/resource/39z2-9zu6.json")
deaths_raw <- get_cdc_data("https://data.cdc.gov/resource/r8kw-7aab.json")
vax_raw    <- get_cdc_data("https://data.cdc.gov/resource/rh2h-3yt2.json")

# Add one or two quick inspection lines here

Part II: Wrangle the weekly state-level dataset

In this section, you will create a weekly panel for 2020 and 2021.

8. Compare the raw schemas

Create a small summary table that reports, for each raw CDC dataset:

the column containing the jurisdiction identifier,
whether the data are daily or weekly,
and the columns that contain time information.

You may render the table with knitr::kable().

raw_summary <- tibble(
  outcome = c("cases", "hospitalizations", "deaths", "vaccinations")
  # add the remaining columns
)

# render the table

9. Wrangle the cases data

Create a table called cases that contains one row per state-week with:

state
mmwr_year
mmwr_week
date (the week-ending date)
cases (weekly total cases)

Keep only states that appear in the population table.

Hints:

The date information may need to be parsed.
Use epiweek() and epiyear() from lubridate.
The cases dataset is already weekly, but you should still verify the time variable you use.

## Wrangle weekly cases
cases <- cases_raw |>
  # choose the relevant state column
  # choose the relevant date column
  # choose the relevant cases column
  # parse dates
  # restrict to jurisdictions in population
  # compute MMWR year and week
  # aggregate to one row per state-week if needed
  NULL

10. Wrangle the hospitalization data

Create a table called hosp with the same weekly structure as cases, but containing weekly hospitalizations in a variable named hosp.

Because hospitalization data are daily, collapse them to state-week totals. Remove weeks with fewer than 7 reporting days.

## Wrangle weekly hospitalizations
hosp <- hosp_raw |>
  # identify the state, date, and hospitalization columns
  # parse the dates
  # compute MMWR year and week
  # aggregate to weekly totals
  # count number of reporting days per week
  # remove incomplete weeks
  NULL

11. Wrangle the deaths data

Create a table called deaths with weekly COVID-19 deaths by state. Keep the variables:

state
mmwr_year
mmwr_week
date
deaths

Be careful: this dataset may contain national rows and grouping variables.

## Wrangle weekly deaths
# Possible tasks:
# - keep only state-level observations
# - filter to a weekly grouping if needed
# - choose the COVID death count column
# - parse the week-ending date
# - keep the needed variables

deaths <- deaths_raw |>
  NULL

12. Wrangle the vaccination data

Create a table called vax containing the most recent vaccination record within each state-week. Keep:

state
mmwr_year
mmwr_week
date
vax (primary series cumulative count)
booster (booster cumulative count)

## Wrangle weekly vaccination data
vax <- vax_raw |>
  # identify state and date columns
  # choose cumulative primary-series and booster variables
  # parse dates
  # compute MMWR year and week
  # keep the last record within each state-week
  NULL

13. Create the full state-week panel

We will include all state-week combinations from 2020 and 2021, even if one of the datasets is missing a value for a particular week.

Use the following scaffold to create all possible weeks and join population information:

all_dates <- data.frame(
  date = seq(make_date(2020, 1, 25), make_date(2021, 12, 31), by = "week")
) |>
  mutate(date = ceiling_date(date, unit = "week", week_start = 7) - days(1)) |>
  mutate(mmwr_year = epiyear(date), mmwr_week = epiweek(date))

dates_and_pop <- cross_join(
  all_dates,
  data.frame(state = unique(population$state))
) |>
  left_join(population, by = c("state", "mmwr_year" = "year"))

Now join cases, hosp, deaths, and vax to create a final table called dat.

Requirements:

order observations by state and date,
fill vaccination and booster counts forward within state,
replace missing counts for cases, hosp, and deaths with 0,
save the final table as data/dat.rds.

## Join the weekly tables

dat <- dates_and_pop |>
  # join weekly cases
  # join weekly hospitalizations
  # join weekly deaths
  # join weekly vaccinations
  # arrange within state by date
  # fill cumulative vaccination variables forward
  # replace missing weekly counts with 0
  NULL

## Save the dataset
# Your code here

Part III: Create figures in `analysis.qmd`

In your analysis.qmd file, load the weekly dataset from data/dat.rds. Each figure should have:

a short descriptive paragraph,
a code chunk that is shown in the rendered document,
and appropriate titles and axis labels.

# Setup chunk for analysis.qmd
library(tidyverse)
library(lubridate)
library(scales)

dat <- readRDS("data/dat.rds")

14. Figure 1: National weekly burden

Create a trend plot showing national weekly cases, hospitalizations, and deaths per 100,000 people. Place the three panels on top of one another.

Hint: first aggregate to the national level by week, then use pivot_longer() and facet_wrap().

## Figure 1: National weekly burden
fig1_dat <- dat |>
  # aggregate to national weekly rates per 100,000
  NULL

# make the plot

15. Figure 2: Regional case rates over time

Create a line plot of weekly case rates per 100,000 over time for each region_name. Use one line per region.

## Figure 2: Regional case rates
fig2_dat <- dat |>
  # aggregate to regional weekly rates
  NULL

# make the plot

16. Figure 3: Vaccination and booster progress by region

For each region, compute the percent of the regional population that completed the primary series and the percent that received a booster dose. Plot both over time.

One reasonable approach is to create a long data frame with one row per region-date-series.

## Figure 3: Vaccination and booster progress
fig3_dat <- dat |>
  # compute regional percentages
  NULL

# make the plot

17. Figure 4: Top 10 states by cumulative death rate

By the end of 2021, which states had the highest cumulative COVID-19 death rates per 100,000? Create a bar plot of the top 10 states.

## Figure 4: Top 10 cumulative death rates
fig4_dat <- dat |>
  # compute cumulative deaths by state
  # convert to a per-100,000 rate
  # keep the top 10
  NULL

# make the plot

18. Figure 5: Vaccination and hospitalization during a major wave

Use your earlier figures to identify one major national COVID-19 wave in 2021 during which a meaningful share of the population had already completed the primary series.

For your chosen period:

compute hospitalizations per day per 100,000 for each state,
compute the primary-series vaccination rate in each state at the end of the period,
create a scatter plot with vaccination rate on the x-axis and hospitalization rate on the y-axis.

Briefly explain how you chose the time window.

## Figure 5: Vaccination vs hospitalization
start_date <- as_date("YYYY-MM-DD")
end_date   <- as_date("YYYY-MM-DD")

fig5_dat <- dat |>
  # filter to the selected period
  # summarize hospitalization burden
  # join vaccination coverage at the end date
  # compute rates
  NULL

# make the plot

19. Figure 6: Booster coverage and death rates

Repeat the previous exercise, but now examine the relationship between:

booster coverage at the end of the period, and
COVID-19 deaths per day per 100,000 during the period.

Use a late-2021 window for which booster coverage is nontrivial.

## Figure 6: Booster coverage vs death rate
start_date <- as_date("YYYY-MM-DD")
end_date   <- as_date("YYYY-MM-DD")

fig6_dat <- dat |>
  # filter to the selected period
  # summarize deaths during the period
  # join booster coverage at the end date
  # compute rates
  NULL

# make the plot

Deliverables

Push your hw4 folder to your Git repo that contains:

code/funcs.R
code/wrangle.R
code/analysis.qmd
data/dat.rds
at least six PNG files in figs

Submit a link to your GitHub repo.

A rendered PDF of this QMD file that should display both code and output:

quarto render hw4.qmd --to pdf