# Load packages you plan to use
library(tidyverse)
library(httr2)
library(janitor)
library(jsonlite)
library(purrr)
library(lubridate)
library(knitr)
# Define census_key by sourcing the local file
# Your code hereProblem Set 4
Introduction
In this problem set, you will work with US Census and CDC COVID-19 data. The goal is not to produce a formal causal analysis, but rather to build a clean weekly state-level dataset and use graphics to investigate patterns in COVID-19 burden and vaccination uptake.
We will focus on the following questions:
- How did COVID-19 case, hospitalization, and death rates evolve during 2020 and 2021?
- Were there regional differences in COVID-19 burden and vaccination progress?
- During large national waves, did states with higher vaccination or booster coverage tend to have lower hospitalization or death rates?
Objective
You will create a single weekly data frame for each jurisdiction in 2020 and 2021. Your final table should contain, at a minimum, the following variables:
datestatestate_nameregionregion_namepopulationcaseshospdeathsvaxbooster
You will then use this table to create several figures.
Instructions
1. Obtain a Census API key
Go to https://api.census.gov/data/key_signup.html and request a Census API key.
Do not hard-code your key directly into your homework file. Instead, create a file in your working directory called _census-key.R containing exactly one line of code:
census_key <- "YOUR_REAL_CENSUS_KEY"Note that the API key must be enclosed inside the quotes. Your homework code should then load the key by sourcing that file.
2. Create a project structure
In your course Git repository, create a folder hw4 that includes the following directories:
datacodefigs
Inside code, include:
funcs.Rwrangle.Ranalysis.qmd
Your final wrangling script should save the completed dataset as an RDS file in data, and your figures should be saved as PNG files in figs.
3. General guidance
- The CDC datasets do not all use the same variable names, so inspect the column names carefully.
- The CDC APIs can change over time, so part of the assignment is to verify the current schema before wrangling.
- Include comments in your code so a grader can follow your logic.
- In each code chunk, change
#| eval: falseto#| eval: true. - Remove
NULLin each code chunk after you add your own code.
Part I: Download the data
For this part, put your code in code/wrangle.R unless otherwise specified.
1. Load the required packages and source the Census key
Add code that loads the packages you will need and defines census_key by sourcing census-key.R.
2. Download Census population data
Use the Census API to request 2020 and 2021 state population estimates. The code below constructs and performs the request. Add at least two lines of your own code to do the following:
- check that the response was successful,
- determine the content type,
- and extract the body into an object called
population_raw.
census_url <- "https://api.census.gov/data/2021/pep/population"
population_request <- request(census_url) |>
req_url_query(
get = "POP_2020,POP_2021,NAME",
"for" = "state:*",
key = census_key
)
population_response <- population_request |>
req_perform()
# Add your code below
# 1. Check the response status
# 2. Determine the content type
# 3. Extract the body into population_raw3. Tidy the population data
Convert population_raw into a tidy tibble called population with one row per state-year. Your final table should include:
state_namestateyearpopulation
Also add state abbreviations, making sure to handle DC and Puerto Rico correctly.
## Create tidy population table
population <- population_raw |>
# remove the header row problem
# convert to tibble
# pivot to long format
# parse year and population as numeric
# add state abbreviations
# keep a tidy set of variables
NULL4. Download the regions file
The following JSON file lists the Public Health Service regions:
regions_url <- "https://github.com/datasciencelabs/2025/raw/refs/heads/main/data/regions.json"Write code to create a data frame called regions with the variables:
state_nameregionregion_name
One of the region names is long. Replace it with a shorter version.
regions_json <- fromJSON(regions_url, simplifyVector = FALSE)
regions <-
# convert the parsed list into a tidy data frame
# shorten the longest region name
NULL5. Join population and region information
Join regions onto population.
population <-
# join population and regions
NULL6. Write a reusable CDC downloader
Save the following function in code/funcs.R. This function should download a CDC Socrata endpoint, remove the default row limit by setting a very large $limit, parse the JSON, and return a tibble with clean column names.
get_cdc_data <- function(endpoint) {
request(endpoint) |>
req_url_query("$limit" = 10000000) |>
req_perform() |>
resp_check_status() |>
resp_body_json(simplifyVector = TRUE) |>
as_tibble() |>
janitor::clean_names()
}In wrangle.R, source funcs.R.
source("code/funcs.R")7. Download the CDC datasets
Use the function above to download the four datasets below. Save them as cases_raw, hosp_raw, deaths_raw, and vax_raw.
- cases:
https://data.cdc.gov/resource/pwn4-m3yp.json - hospitalizations:
https://data.cdc.gov/resource/39z2-9zu6.json - deaths:
https://data.cdc.gov/resource/r8kw-7aab.json - vaccinations:
https://data.cdc.gov/resource/rh2h-3yt2.json
The download code is provided. Add one or two lines of your own to inspect the result, for example by checking dimensions or variable names.
cases_raw <- get_cdc_data("https://data.cdc.gov/resource/pwn4-m3yp.json")
hosp_raw <- get_cdc_data("https://data.cdc.gov/resource/39z2-9zu6.json")
deaths_raw <- get_cdc_data("https://data.cdc.gov/resource/r8kw-7aab.json")
vax_raw <- get_cdc_data("https://data.cdc.gov/resource/rh2h-3yt2.json")
# Add one or two quick inspection lines herePart II: Wrangle the weekly state-level dataset
In this section, you will create a weekly panel for 2020 and 2021.
8. Compare the raw schemas
Create a small summary table that reports, for each raw CDC dataset:
- the column containing the jurisdiction identifier,
- whether the data are daily or weekly,
- and the columns that contain time information.
You may render the table with knitr::kable().
raw_summary <- tibble(
outcome = c("cases", "hospitalizations", "deaths", "vaccinations")
# add the remaining columns
)
# render the table9. Wrangle the cases data
Create a table called cases that contains one row per state-week with:
statemmwr_yearmmwr_weekdate(the week-ending date)cases(weekly total cases)
Keep only states that appear in the population table.
Hints:
- The date information may need to be parsed.
- Use
epiweek()andepiyear()from lubridate. - The cases dataset is already weekly, but you should still verify the time variable you use.
## Wrangle weekly cases
cases <- cases_raw |>
# choose the relevant state column
# choose the relevant date column
# choose the relevant cases column
# parse dates
# restrict to jurisdictions in population
# compute MMWR year and week
# aggregate to one row per state-week if needed
NULL10. Wrangle the hospitalization data
Create a table called hosp with the same weekly structure as cases, but containing weekly hospitalizations in a variable named hosp.
Because hospitalization data are daily, collapse them to state-week totals. Remove weeks with fewer than 7 reporting days.
## Wrangle weekly hospitalizations
hosp <- hosp_raw |>
# identify the state, date, and hospitalization columns
# parse the dates
# compute MMWR year and week
# aggregate to weekly totals
# count number of reporting days per week
# remove incomplete weeks
NULL11. Wrangle the deaths data
Create a table called deaths with weekly COVID-19 deaths by state. Keep the variables:
statemmwr_yearmmwr_weekdatedeaths
Be careful: this dataset may contain national rows and grouping variables.
## Wrangle weekly deaths
# Possible tasks:
# - keep only state-level observations
# - filter to a weekly grouping if needed
# - choose the COVID death count column
# - parse the week-ending date
# - keep the needed variables
deaths <- deaths_raw |>
NULL12. Wrangle the vaccination data
Create a table called vax containing the most recent vaccination record within each state-week. Keep:
statemmwr_yearmmwr_weekdatevax(primary series cumulative count)booster(booster cumulative count)
## Wrangle weekly vaccination data
vax <- vax_raw |>
# identify state and date columns
# choose cumulative primary-series and booster variables
# parse dates
# compute MMWR year and week
# keep the last record within each state-week
NULL13. Create the full state-week panel
We will include all state-week combinations from 2020 and 2021, even if one of the datasets is missing a value for a particular week.
Use the following scaffold to create all possible weeks and join population information:
all_dates <- data.frame(
date = seq(make_date(2020, 1, 25), make_date(2021, 12, 31), by = "week")
) |>
mutate(date = ceiling_date(date, unit = "week", week_start = 7) - days(1)) |>
mutate(mmwr_year = epiyear(date), mmwr_week = epiweek(date))
dates_and_pop <- cross_join(
all_dates,
data.frame(state = unique(population$state))
) |>
left_join(population, by = c("state", "mmwr_year" = "year"))Now join cases, hosp, deaths, and vax to create a final table called dat.
Requirements:
- order observations by
stateanddate, - fill vaccination and booster counts forward within state,
- replace missing counts for
cases,hosp, anddeathswith 0, - save the final table as
data/dat.rds.
## Join the weekly tables
dat <- dates_and_pop |>
# join weekly cases
# join weekly hospitalizations
# join weekly deaths
# join weekly vaccinations
# arrange within state by date
# fill cumulative vaccination variables forward
# replace missing weekly counts with 0
NULL
## Save the dataset
# Your code herePart III: Create figures in analysis.qmd
In your analysis.qmd file, load the weekly dataset from data/dat.rds. Each figure should have:
- a short descriptive paragraph,
- a code chunk that is shown in the rendered document,
- and appropriate titles and axis labels.
# Setup chunk for analysis.qmd
library(tidyverse)
library(lubridate)
library(scales)
dat <- readRDS("data/dat.rds")14. Figure 1: National weekly burden
Create a trend plot showing national weekly cases, hospitalizations, and deaths per 100,000 people. Place the three panels on top of one another.
Hint: first aggregate to the national level by week, then use pivot_longer() and facet_wrap().
## Figure 1: National weekly burden
fig1_dat <- dat |>
# aggregate to national weekly rates per 100,000
NULL
# make the plot15. Figure 2: Regional case rates over time
Create a line plot of weekly case rates per 100,000 over time for each region_name. Use one line per region.
## Figure 2: Regional case rates
fig2_dat <- dat |>
# aggregate to regional weekly rates
NULL
# make the plot16. Figure 3: Vaccination and booster progress by region
For each region, compute the percent of the regional population that completed the primary series and the percent that received a booster dose. Plot both over time.
One reasonable approach is to create a long data frame with one row per region-date-series.
## Figure 3: Vaccination and booster progress
fig3_dat <- dat |>
# compute regional percentages
NULL
# make the plot17. Figure 4: Top 10 states by cumulative death rate
By the end of 2021, which states had the highest cumulative COVID-19 death rates per 100,000? Create a bar plot of the top 10 states.
## Figure 4: Top 10 cumulative death rates
fig4_dat <- dat |>
# compute cumulative deaths by state
# convert to a per-100,000 rate
# keep the top 10
NULL
# make the plot18. Figure 5: Vaccination and hospitalization during a major wave
Use your earlier figures to identify one major national COVID-19 wave in 2021 during which a meaningful share of the population had already completed the primary series.
For your chosen period:
- compute hospitalizations per day per 100,000 for each state,
- compute the primary-series vaccination rate in each state at the end of the period,
- create a scatter plot with vaccination rate on the x-axis and hospitalization rate on the y-axis.
Briefly explain how you chose the time window.
## Figure 5: Vaccination vs hospitalization
start_date <- as_date("YYYY-MM-DD")
end_date <- as_date("YYYY-MM-DD")
fig5_dat <- dat |>
# filter to the selected period
# summarize hospitalization burden
# join vaccination coverage at the end date
# compute rates
NULL
# make the plot19. Figure 6: Booster coverage and death rates
Repeat the previous exercise, but now examine the relationship between:
- booster coverage at the end of the period, and
- COVID-19 deaths per day per 100,000 during the period.
Use a late-2021 window for which booster coverage is nontrivial.
## Figure 6: Booster coverage vs death rate
start_date <- as_date("YYYY-MM-DD")
end_date <- as_date("YYYY-MM-DD")
fig6_dat <- dat |>
# filter to the selected period
# summarize deaths during the period
# join booster coverage at the end date
# compute rates
NULL
# make the plotDeliverables
- Push your
hw4folder to your Git repo that contains:
code/funcs.Rcode/wrangle.Rcode/analysis.qmddata/dat.rds- at least six PNG files in
figs
Submit a link to your GitHub repo.
- A rendered PDF of this QMD file that should display both code and output:
quarto render hw4.qmd --to pdf