Shigella Data From CDC

This was part of a project I was working on about an outbreak of Shigellosis cases among HIV positive men during the summer of 2019. The full code, along with some of the public data is available on my Github. The project didn’t really go anywhere beyond this, but it was interesting developing a script that pulled all the data automatically.

Call the CDC’s API for the data

Data last updated on 2020-06-13

Our first step will be to download the weekly NNDSS data for Shigellosis across the past few years. The WONDER tables provide data by the week, but we’ll be using the CDC’s API to access the annual NNDSS tables. The data.gov API uses Socrata, so you’ll need the RSocrata package to download the data yourself.

Each year has it’s own table, which I found by searching Shigella. Unfortunately, this means you have to recode some of the variables on a year to year basis. For example, the District of Colombia is coded as DISTRICT OF COLUMBIA in 2019, but is DIST. OF COL. in prior years. The API also returns data for non-states (e.g. Guam) and the sums for regions (e.g. Middle Atlantic). For simplicity, I’m only going to pull the data from:

  • Each of the 50 US states
  • The total cases
  • The District of Colombia
  • New York City (The state of New York excludes cases in NYC in it’s count)
### ---- 2019 Data ---- ###
data2019 <- "https://data.cdc.gov/resource/ew98-twec.csv" %>% read.socrata() %>%
  # Select only shigellosis data (plus other relevant variables)
  select(reporting_area, mmwr_year, mmwr_week, contains("shigellosis")) %>%
  
  # recode variables to make them consistent across the years
  mutate(reporting_area = recode(reporting_area, "TOTAL"="UNITED STATES")) %>%
  
  # Select only US states + total (ignore regional sums)
  filter(reporting_area %in% c(state_names, "UNITED STATES", 
                               "NEW YORK CITY", "DISTRICT OF COLUMBIA")) %>%
  
  # Make data tidy by adding variable for shigella
  mutate(pathogen = "Shigellosis") %>%
  
  # Rename variables and put them in order
  select(reporting_area:mmwr_week, pathogen, weekly_cases=shigellosis_current_week,
         cum_thisYear=shigellosis_cum_2019, cum_lastYear=shigellosis_cum_2018,
         max_past52weeks=shigellosis_previous_52_weeks_max) 

Because this is a lot of code, I’m not including code for the rest of the years here, but the full code for this can be found on my Github.

Here’s what the 2019 data frame looks like:

glimpse(data2019)
## Rows: 2,756
## Columns: 8
## $ reporting_area  <chr> "IOWA", "ALASKA", "NEW YORK", "IDAHO", "DELAWARE", "F…
## $ mmwr_year       <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,…
## $ mmwr_week       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ pathogen        <chr> "Shigellosis", "Shigellosis", "Shigellosis", "Shigell…
## $ weekly_cases    <int> NA, NA, 2, NA, NA, NA, 18, NA, 1, NA, NA, 3, 1, NA, N…
## $ cum_thisYear    <int> NA, NA, 2, NA, NA, NA, 18, NA, 1, NA, NA, 3, 1, NA, N…
## $ cum_lastYear    <int> 2, 2, 1, NA, NA, 7, 9, 1, 2, 3, NA, 6, NA, 5, 1, 11, …
## $ max_past52weeks <int> 7, 2, 25, 6, 1, 50, 43, 1, 11, 3, 2, 21, 6, 17, 14, 1…

Now we’ll join all of these together. We’ll also use the MMWRweek package to translate the MMWR date format (MMWR week, MMWR year) to the actual calendar day, month, and year:

### ---- Join the data frames --- ###
shigella <- bind_rows(data2019, data2018, data2017, data2016, data2015) %>%
  as.tibble() %>%
  unique() %>%
  rename(year=mmwr_year) %>%
  mutate(date = MMWRweek::MMWRweek2Date(MMWRyear = year,   
                                        MMWRweek = mmwr_week,
                                        MMWRday  = 7)) %>%
  select(reporting_area, date, everything()) # reorder vars

# Remove old df
rm(data2019, data2018, data2017, data2016, data2015)

Interactive graphs

Below are interactive figures that you can use to look at the weekly cases by state. The first graph shows the absolute number of cases, and the second shows the incidence rate (in weekly cases per million population in the state).

The graphs can be filtered by individual state or division

widgetframe::frameWidget(ggplotly(gg, dynamicTicks = TRUE))

Same as above, but cases per million (i.e. incidence rate)

S. flexneri more common in males

Another interesting point comes from the CDC’s National Enteric Disease Surveillance report. In their most recent report in 2016, they detailed the distribution of Shigella cases by species across age/sex. On page 12 they have the table for S. flexneri, which looks very different than the tables for other species

S. flexneri pyramid

Figure 8: S. flexneri pyramid

Compare this with S. sonnei:

S. sonnei pyramid

Figure 9: S. sonnei pyramid

Hunter Ratliff, MD, MPH
Hunter Ratliff, MD, MPH
Infectious Diseases Fellow

My research interests include epidemiology, social determinants of health, and reproducible research.

Related