8 Working with Census Data in R

If you have to work with Census data, you can either explore Census data via data.census.gov, tap other Census Bureau resources to find and download the data you need, or rely upon some R packages (the more efficient option). Let us see some of these packages in action, starting with censusapi, tidycensus, and ipumsr.

8.1 Using {censusapi}

Authored by Hannah Recht, this package allows you to access pretty much all Census products via the API. Of course, we will have to get and save our Census API key, and then install the package and the key as shown below. The Census API key is easily obtained from here.

install.packages("devtools")
devtools::install_github("hrecht/censusapi")
Sys.setenv(CENSUS_KEY = YOURKEYHERE)
readRenviron("~/.Renviron")
Sys.getenv("CENSUS_KEY")

Now we can get to work. Let us start by loading the library and then seeing what Census APIs are available.

library(censusapi)
listCensusApis() -> apis

Each API could have slightly varying parameters so Hannah recommends checking the API documentation available here but points out some common parameters – name (the API’s name), vintage (the year of the dataset), vars (the variables you want to access), and region (state, county, etc).

It is easy to find variable names via the built-in function listCensusMetadata. The bureau’s small area health insurance estimates (SAHIE) are shown below, as well as the small area income and poverty estimates (SAIPE).

listCensusMetadata(
  name = "timeseries/healthins/sahie", 
  type = "variables"
  ) -> sahie_vars
listCensusMetadata(
  name = "timeseries/poverty/saipe",
  type = "variables"
  ) -> saipe_vars 

Here are the SAHIE variables

and here are the SAIPE variables.

If you want to see what geographies are available, switch type = to geography:

listCensusMetadata(
  name = "timeseries/healthins/sahie",
  type = "geography"
  ) -> sahie_geos 
listCensusMetadata(
  name = "timeseries/poverty/saipe",
  type = "geography"
  ) -> saipe_geos

Similarly, here are the SAHIE geographies

and here are the SAIPE geographies.

Let us get the most recent county-level data, noting that the latest SAHIE are for 2015 but the latest SAIPE are for 2016.

getCensus(
  name = "timeseries/healthins/sahie",
  vars = c("NAME", "IPRCAT", "IPR_DESC", "PCTUI_PT"),
  region = "county:*", 
  regionin = "state:39", 
  time = 2019, 
  key = CENSUS_KEY
  ) -> sahie_counties
getCensus(
  name = "timeseries/poverty/saipe",
  vars = c("NAME", "SAEPOVRTALL_PT", "SAEMHI_PT"), 
  region = "county:*", 
  regionin = "state:39", 
  YEAR = 2020, 
  key = CENSUS_KEY
  ) -> saipe_counties

Almost everything is read-in as a character (see the chr flag in the data frame) so we’ll have to convert the variables we want to numeric.

library(tidyverse)
saipe_counties %>%
  mutate(
    prate = as.numeric(SAEPOVRTALL_PT),
    mdhinc = as.numeric(SAEMHI_PT)
    ) -> saipe_counties

If you want data for a lot of geographies, the package will let you do it in seconds. For example, if you want tract-level data for all tracts, you can get it as shown below:

fips
tracts <- NULL
for (f in fips) {
    stateget <- paste("state:", f, sep = "")
    temp <- getCensus(name = "sf3", vintage  = 1990,
    vars = c("P0070001", "P0070002", "P114A001"), 
    region = "tract:*",
    regionin = stateget, key = CENSUS_KEY)
    tracts <- rbind(tracts, temp)
    }
head(tracts)

Notice that you had to specify the region, and since tracts are nested within states, you had to add regionin = stateget.

The 2010 Decennial Census summary file 1 requires you to specify a state and county to retrieve block-level data. Use region to request block level data, and regionin to specify the desired state and county.

Here we run it for Athens County, Ohio.

getCensus(
  name = "dec/sf1",
  vintage = 2010,
  vars = c("P001001", "P003001"), 
  region = "tract:*",
  regionin = "state:39",
  key = CENSUS_KEY
  ) -> oh2010sf1a

For the 2000 Decennial Census summary file 1, tract is also required to retrieve block-level data. This example requests data for all blocks within Census tract 010000 in county 027 of state 36.

getCensus(
  name = "sf1",
  vintage = 2000,
  vars = c("P001001", "P003001"), 
  region = "block:*",
  regionin = "state:39 + county:009 + tract:972600",
  key = CENSUS_KEY
  ) -> oh2000sf1a

There is a lot more that you can do with the package, and to get a sense of the possible, look at the voluminous example list.

8.2 Using {tidycensus}

The package’s author, Kyle Walker, describes it thus:

tidycensus is an R package that allows users to interface with the US Census Bureau’s decennial Census and five-year American Community APIs and return tidyverse-ready data frames, optionally with simple feature geometry included.

Install the Census API key as shown below.

library(tidyverse)
install.packages("tidycensus")
library(tidycensus)
census_api_key(
  "YOUR API KEY GOES HERE", 
  install = TRUE
  )

The install = TRUE switch will add the key to your .Renviron. The key should then be automatically read in future sessions with {tidycensus} so be sure to not use that switch for subsequent runs. Otherwise you will be repeatedly asked if you want to overwrite the previous setting in .Renviron.

The two Census products of most value to most of us will either be the decennial censuses or then the American Community Surveys (ACS). One can choose the product via get_decennial() or get_acs() as shown below. We start with a simple look at the variables available to us in the 2015-2019 ACS (the default ACS product for the current version of the package).

load_variables(
  year = 2017, 
  dataset = "acs5", 
  cache = TRUE
  ) -> my.vars.acs5

Let us look at the variables in the database.

DT::datatable(my.vars.acs5)

What if I am more interested in the Summary File 1 from 2010 or Summary File 3from 2000?8

load_variables(
  year = 2010, 
  dataset = "sf1", 
  cache = TRUE
  ) -> my.vars.d.sf1
head(my.vars.d.sf1)
#> # A tibble: 6 × 3
#>   name    label                                concept        
#>   <chr>   <chr>                                <chr>          
#> 1 H001001 Total                                HOUSING UNITS  
#> 2 H002001 Total                                URBAN AND RURAL
#> 3 H002002 Total!!Urban                         URBAN AND RURAL
#> 4 H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL
#> 5 H002004 Total!!Urban!!Inside urban clusters  URBAN AND RURAL
#> 6 H002005 Total!!Rural                         URBAN AND RURAL
load_variables(
  year = 2000, 
  dataset = "sf3", 
  cache = TRUE
  ) -> my.vars.d.sf3
head(my.vars.d.sf3)
#> # A tibble: 6 × 3
#>   name    label                                       concept                       
#>   <chr>   <chr>                                       <chr>                         
#> 1 H001001 Total                                       HOUSING UNITS [1]             
#> 2 H002001 Total                                       UNWEIGHTED SAMPLE HOUSING UNI…
#> 3 H002002 Total!!Occupied                             UNWEIGHTED SAMPLE HOUSING UNI…
#> 4 H002003 Total!!Vacant                               UNWEIGHTED SAMPLE HOUSING UNI…
#> 5 H003001 Total                                       100-PERCENT COUNT OF HOUSING …
#> 6 H004001 Percent of occupied housing units in sample PERCENT OF HOUSING UNITS IN S…

Look at each of these and you will see all the variables in the respective databases. Say I want total population from each. This has to be P0010001 from each. Let us download these from the decennial census, first for states, then for all counties in Ohio, and then for all Census Tracts in Athens, Ohio. Let us do this for the decennial census and then for the ACS data.

get_decennial(
  geography = "state", 
  year = 2010, 
  variables = c(totpopn = "P001001")
  ) -> state.popn.d
get_decennial(
  geography = "county", 
  year = 2010, 
  variables = c(totpopn = "P001001"), 
  state = "OH"
  ) -> county.popn.d
get_decennial(
  geography = "tract", 
  year = 2010, 
  variables = c(totpopn = "P001001"), 
  state = "OH", 
  county = "Athens"
  ) -> tract.popn.d 

And now for the ACS:

get_acs(
  geography = "state", 
  year = 2017, 
  survey = "acs5",
  variables = c(totpopn = "B01003_001")
  ) -> state.popn.acs
get_acs(
  geography = "county", 
  year = 2017, 
  survey = "acs5",
  variables = c(totpopn = "B01003_001"),
  state = "OH"
  ) -> county.popn.acs
get_acs(
  geography = "tract", 
  year = 2017, 
  survey = "acs5",
  variables = c(totpopn = "B01003_001"),
  state = "OH", 
  county = "Athens"
  ) -> tract.popn.acs 

Here are the data for tracts, just as an example.

You can also get an entire table instead of having to list variables, such as shown below, with the default decennial census for the package’s current version (2010).

get_decennial(
  geography = "county", 
  table = "P012", 
  summary_var = "P001001"
  ) -> county.table.d
get_decennial(
  geography = "state", 
  table = "P012", 
  summary_var = "P001001"
  ) -> state.table.d

Here is a snippet of the state table.

If you want the geometry as well because you want to do some plotting, that is easily done. For example, let us look at median gross rent in Ohio’s Athens County’s census tracts.

library(tigris)
library(sf)
options(tigris_class = "sf")
options(tigris_use_cache = TRUE)
get_acs(
  geography = "tract", 
  variables = "DP04_0134", 
  year = 2017, 
  survey = "acs5", 
  state = "OH", 
  county = "Athens", 
  geometry = TRUE
  ) -> rent 

How can I plot it? Here is a quick and dirty map.

rent %>%
  ggplot() +
  geom_sf(
    aes(fill = estimate)
    )

If we want to focus on a single metro area, then we can do that as well. Here I am focusing on Columbus (Ohio), whose GEOID code is 18140.

core_based_statistical_areas(
  cb = TRUE
  ) %>%
  filter(
    GEOID %in% c("18140")
    ) %>% 
  select(
    metro_name = NAME
    ) -> metros 
st_join(
  rent, metros, 
  join = st_within, 
  left = FALSE
  ) -> wc_rent 

And now we plot the data.

library(viridis)
ggplot(
  wc_rent, 
  aes(
    fill = estimate, 
    color = estimate
    )
  ) + 
  geom_sf() + 
  coord_sf(
    crs = 26910
    ) + 
  theme_minimal() + 
  theme(
    aspect.ratio = 1
    ) + 
  scale_fill_viridis() + 
  scale_color_viridis()

What if we wanted to do this for all metro areas in Ohio?

core_based_statistical_areas(
  cb = TRUE
  ) %>%
  filter(
    GEOID %in% c(
      "10420", "15940", "17140", "17460", 
      "18140", "19380", "26580", "30620",
      "31900", "44220", "45780", "48260",
      "48540", "49660"
      )
    ) %>%
  select(
    metro_name = NAME
    ) -> metros2
st_join(
  rent, metros2, 
  join = st_within, 
  left = FALSE
  ) -> wc_rent2  
library(ggthemes)
ggplot(
  wc_rent2, 
  aes(
    fill = estimate, 
    color = estimate
    )
  ) + 
  geom_sf() + 
  coord_sf(
    crs = 26910
    ) + 
  facet_wrap(
    ~ metro_name
    ) + 
  theme_map() + 
  theme(
    aspect.ratio = 1
    ) + 
  scale_fill_viridis() + 
  scale_color_viridis()

Population data for all tracts in the country?

library(tidycensus)
library(purrr)
unique(fips_codes$state)[1:51] -> us
map_df(us, function(x) {
  get_acs(
    geography = "tract", 
    variables = "B01003_001", 
    state = x, 
    geometry = TRUE
    )
  }
  ) -> totalpop

One can also build maps with {tidycensus} since you can ask that geometry be returned. Let us build a map for the Columbus metro area in Ohio. I’ll stick with the Median Household Income example in the vignette. Note that we have to trim the get_acs() result to just the Columbus metro area. We’ll see the default map but then use {mapview} to get a rough sense of these tracts.

As a reminder, {mapview} is an R package created to help researchers during their spatial data analysis workflow. It provides functions to very quickly and conveniently create interactive visualisations of spatial data. It was created to fill the gap of quick (not presentation grade) interactive plotting to examine and visually investigate both aspects of spatial data, the gometries and their attributes.

options(tigris_use_cache = TRUE)
cmh <- get_acs(state = "OH", county = c("Delaware", "Fairfield", "Franklin", "Hocking", "Licking", "Madison", "Morrow", "Perry", "Pickaway", "Union"), geography = "tract", variables = "B19013_001", geometry = TRUE)

cbsas <- core_based_statistical_areas(cb = TRUE) %>% 
  filter(GEOID %in% c("18140")) %>%
  select(metro_name = NAME)

cmhm = st_join(cmh, cbsas, join = st_within, left = TRUE)
load("data/cmhm.RData")
cmhm %>%
  ggplot(
    aes(
      fill = estimate, 
      color = estimate
      )
    ) + 
  geom_sf() + 
  coord_sf(
    crs = 26911
    ) + 
  viridis::scale_fill_viridis(
    option = "mako"
    ) + 
  viridis::scale_color_viridis(
    option = "magma"
    )

library(mapview)
mapview(
  cmhm, 
  zcol = "estimate", 
  legend = TRUE
  )

There is a wonderful feature of the tidycensus package that allows you to use a reference variable in the denominator to calculate ratios, proportions, and percentages – the summary_var switch. In the example that follows we look at the distribution of the population by race.

load("data/metros.RData")
c(
  White = "P0050003", 
  Black = "P0050004", 
  Asian = "P0050006", 
  Hispanic = "P0040003"
  ) -> racevars
get_decennial(
  geography = "tract", 
  variables = racevars, 
  state = "OH", 
  county = c(
    "Delaware", "Fairfield", "Franklin", 
    "Hocking", "Licking", "Madison", 
    "Morrow", "Perry", "Pickaway", "Union"
    ), 
  geometry = TRUE,  
  summary_var = "P0010001"
  ) -> cmh2  
st_join(
  cmh2, metros, 
  join = st_within, 
  left = FALSE
  ) -> cmhm2 
cmhm2 %>%
  mutate(
    Percent = 100 * (value / summary_value)
    ) -> cmhm3
cmhm3 %>% 
  ggplot(
    aes(
      fill = Percent, 
      color = Percent
      )
    ) +
  facet_wrap(
    ~variable
    ) +
  geom_sf() +
  coord_sf(
    crs = 26915
    ) + 
  scale_fill_viridis(
    option = "turbo"
    ) +
  scale_color_viridis()

If you want to save your data file, st_write() comes to your help.

library(sf)
st_write(cmh, "cmh.shp")

8.3 Using {acs}

Authored by Ezra Klein, the {acs} package “Provides a general toolkit for downloading, managing, analyzing, and presenting data from the U.S. Census, including SF1 (Decennial”short-form”), SF3 (Decennial “long-form”), and the American Community Survey (ACS). Confidence intervals provided with ACS data are converted to standard errors to be bundled with estimates in complex acs objects. Package provides new methods to conduct standard operations on acs objects and present/plot data in statistically appropriate ways.” We’ll look at some of its key features here, relying on examples developed by Zev Ross (no surprises there) and other material from Ezra. Just a note of warning based on my personal experience: acs has a lot that it can and does do but it took me a long time to get it to do the things I wanted to see. It also will not pull recent ACS data so bear that in mind.

As usual, if you know what variable you want, life is easy but if you don’t know the table number, acs.lookup(...) is your friend.

library(acs)
acs.lookup(
  endyear = 2015, 
  span = 1, 
  keyword = c("Poverty"), 
  case.sensitive = FALSE
  ) -> myacs01
acs.lookup(
  endyear = 2000, 
  dataset = "sf3", 
  table.number = "P56"
  ) -> myacs02

I’ll pick table B14006 from the 5-year ACS ending in 2012.

geo.make(
  state = c("OH"),
  county = c(9), 
  tract = "*"
  ) -> geo 
acs.fetch(
  endyear = 2019, 
  span = 5, 
  geography = geo,  
  table.number = "B14006", 
  col.names = "pretty"
  ) -> atacs

The data will be returned as a list so will have to be converted into a data frame. We are padding state, county and tract since we’ll merge the data for mapping. Note also that you have to pick the columns you want to work with (if there are several columns downloaded, as is the case here; look at atacs). We then clean up the rowname, give concise names to the columns, and then calculate the percent of the population 3 years and over living below the poverty level.

data.frame(
  paste0(
    str_pad(
      atacs@geography$state, 2, "left", pad = "0"
      ), 
      str_pad(atacs@geography$county, 3, "left", pad = "0"), 
      str_pad(atacs@geography$tract, 6, "left", pad = "0")), 
      atacs@estimate[, c(
        "POVERTY STATUS IN THE PAST 12 MONTHS BY SCHOOL ENROLLMENT BY LEVEL OF SCHOOL FOR THE POPULATION 3 YEARS AND OVER: Total:",
"POVERTY STATUS IN THE PAST 12 MONTHS BY SCHOOL ENROLLMENT BY LEVEL OF SCHOOL FOR THE POPULATION 3 YEARS AND OVER: Income in the past 12 months below the poverty level:")], 
stringsAsFactors = FALSE
) -> atacs_df
rownames(atacs_df) <- 1:nrow(atacs_df)
names(atacs_df) <- c("GEOID", "total", "belowpov")
atacs_df$percent <- 100*(atacs_df$belowpov / atacs_df$total)

Now we merge

library(tigris)
tracts(
  state = 'OH', 
  county = c(9), 
  cb = TRUE
  ) -> tracts 
geo_join(
  tracts, atacs_df, 
  "GEOID", "GEOID"
  ) -> atacs_merged
# there are some tracts with no land that we should exclude
atacs_merged[atacs_merged$ALAND > 0, ] -> atacs_merged
load("data/atacs_merged.RData")

and then map

library(leaflet)
paste0(
  "GEOID: ", 
  atacs_merged$GEOID, 
  "<br>", "Percent below poverty level: ",
  round(atacs_merged$percent, 2)
  ) -> popup
colorNumeric(
  palette = "YlGnBu",
  domain = atacs_merged$percent
  ) -> pal
leaflet() %>%
  addProviderTiles(
    "CartoDB.Positron"
    ) %>%
  addPolygons(
    data = atacs_merged, 
    fillColor = ~pal(percent), 
    color = "#b2aeae", # you need to use hex colors
    fillOpacity = 0.4, 
    weight = 1, 
    smoothFactor = 0.2,
    popup = popup
    ) %>%
  addLegend(
    pal = pal, 
    values = atacs_merged$percent, 
    position = "bottomright", 
    title = "Percent <br>below poverty level",
    labFormat = labelFormat(suffix = "%")
    ) 

8.4 More on {tigris}

Again, a wonderful package authored by Kyle Walker, tigris allows you to work with both the Census Bureau’s TIGER/Line shapefiles as well as the Cartographic Boundary Files. Click on the links to read about each file type. You can access a lot of rich detail, ranging from administrative boundaries of states, counties, etc. to primary and secondary roads, railway line, water bodies, school districts, landmarks, military, and so on. Some examples follow but not all have been run to save time; you should run these on your own.

library(tigris)
roads("OH", "Athens") -> atroads
plot(atroads)

zctas(
  cb = TRUE, 
  starts_with = c("457")
  ) -> atzips
plot(atzips)

You can use functions to loop through geographies, grab what you are looking for and then bind this together. For an example, see page 237 of Kyle’s paper.

For the most part, you are likely to have data gathered by other means that you then want to map, pulling shapefiles or cartographic boundary files from tigris. This means you will have to merge these things. Here is a simple example where we have county-level diabetes rates in a csv file. Of course, you may not end up using this package for plotting given ggmap, leaflet, etc. but there it is.

diatigris = read.csv("data/diatigris.csv")
diatigris$STATEFP = ifelse(diatigris$state == "OH",  "39", "54")
counties(c("OH", "WV"),  cb = TRUE) -> counties 
geo_join(
  counties, diatigris, 
  c("NAME", "STATEFP"), 
  c("NAME", "STATEFP")
  ) -> my.df
subset(
  my.df[, c(6, 11, 13)], 
  !is.na(rate)
  ) -> my.dfds
plot(my.dfds, col = my.dfds$rate)