8 Working with Census Data in R
If you have to work with Census data, you can either explore Census data via data.census.gov, tap other Census Bureau resources to find and download the data you need, or rely upon some R packages (the more efficient option). Let us see some of these packages in action, starting with censusapi, tidycensus, and ipumsr.
8.1 Using {censusapi}
Authored by Hannah Recht, this package allows you to access pretty much all Census products via the API. Of course, we will have to get and save our Census API key, and then install the package and the key as shown below. The Census API key is easily obtained from here.
install.packages("devtools")
::install_github("hrecht/censusapi")
devtoolsSys.setenv(CENSUS_KEY = YOURKEYHERE)
readRenviron("~/.Renviron")
Sys.getenv("CENSUS_KEY")
Now we can get to work. Let us start by loading the library and then seeing what Census APIs are available.
library(censusapi)
listCensusApis() -> apis
Each API could have slightly varying parameters so Hannah recommends checking the API documentation available here but points out some common parameters – name (the API’s name), vintage (the year of the dataset), vars (the variables you want to access), and region (state, county, etc).
It is easy to find variable names via the built-in function listCensusMetadata
. The bureau’s small area health insurance estimates (SAHIE) are shown below, as well as the small area income and poverty estimates (SAIPE).
listCensusMetadata(
name = "timeseries/healthins/sahie",
type = "variables"
-> sahie_vars
) listCensusMetadata(
name = "timeseries/poverty/saipe",
type = "variables"
-> saipe_vars )
Here are the SAHIE variables
and here are the SAIPE variables.
If you want to see what geographies are available, switch type =
to geography:
listCensusMetadata(
name = "timeseries/healthins/sahie",
type = "geography"
-> sahie_geos
) listCensusMetadata(
name = "timeseries/poverty/saipe",
type = "geography"
-> saipe_geos )
Similarly, here are the SAHIE geographies
and here are the SAIPE geographies.
Let us get the most recent county-level data, noting that the latest SAHIE are for 2015 but the latest SAIPE are for 2016.
getCensus(
name = "timeseries/healthins/sahie",
vars = c("NAME", "IPRCAT", "IPR_DESC", "PCTUI_PT"),
region = "county:*",
regionin = "state:39",
time = 2019,
key = CENSUS_KEY
-> sahie_counties )
getCensus(
name = "timeseries/poverty/saipe",
vars = c("NAME", "SAEPOVRTALL_PT", "SAEMHI_PT"),
region = "county:*",
regionin = "state:39",
YEAR = 2020,
key = CENSUS_KEY
-> saipe_counties )
Almost everything is read-in as a character (see the chr
flag in the data frame) so we’ll have to convert the variables we want to numeric.
library(tidyverse)
%>%
saipe_counties mutate(
prate = as.numeric(SAEPOVRTALL_PT),
mdhinc = as.numeric(SAEMHI_PT)
-> saipe_counties )
If you want data for a lot of geographies, the package will let you do it in seconds. For example, if you want tract-level data for all tracts, you can get it as shown below:
fips<- NULL
tracts for (f in fips) {
<- paste("state:", f, sep = "")
stateget <- getCensus(name = "sf3", vintage = 1990,
temp vars = c("P0070001", "P0070002", "P114A001"),
region = "tract:*",
regionin = stateget, key = CENSUS_KEY)
<- rbind(tracts, temp)
tracts
}head(tracts)
Notice that you had to specify the region
, and since tracts are nested within states, you had to add regionin = stateget
.
The 2010 Decennial Census summary file 1 requires you to specify a state and county to retrieve block-level data. Use region to request block level data, and regionin to specify the desired state and county.
Here we run it for Athens County, Ohio.
getCensus(
name = "dec/sf1",
vintage = 2010,
vars = c("P001001", "P003001"),
region = "tract:*",
regionin = "state:39",
key = CENSUS_KEY
-> oh2010sf1a )
For the 2000 Decennial Census summary file 1, tract is also required to retrieve block-level data. This example requests data for all blocks within Census tract 010000 in county 027 of state 36.
getCensus(
name = "sf1",
vintage = 2000,
vars = c("P001001", "P003001"),
region = "block:*",
regionin = "state:39 + county:009 + tract:972600",
key = CENSUS_KEY
-> oh2000sf1a )
There is a lot more that you can do with the package, and to get a sense of the possible, look at the voluminous example list.
8.2 Using {tidycensus}
The package’s author, Kyle Walker, describes it thus:
tidycensus is an R package that allows users to interface with the US Census Bureau’s decennial Census and five-year American Community APIs and return tidyverse-ready data frames, optionally with simple feature geometry included.
Install the Census API key as shown below.
library(tidyverse)
install.packages("tidycensus")
library(tidycensus)
census_api_key(
"YOUR API KEY GOES HERE",
install = TRUE
)
The install = TRUE
switch will add the key to your .Renviron
. The key should then be automatically read in future sessions with {tidycensus}
so be sure to not use that switch for subsequent runs. Otherwise you will be repeatedly asked if you want to overwrite the previous setting in .Renviron
.
The two Census products of most value to most of us will either be the decennial censuses or then the American Community Surveys (ACS). One can choose the product via get_decennial()
or get_acs()
as shown below. We start with a simple look at the variables available to us in the 2015-2019 ACS (the default ACS product for the current version of the package).
load_variables(
year = 2017,
dataset = "acs5",
cache = TRUE
-> my.vars.acs5 )
Let us look at the variables in the database.
::datatable(my.vars.acs5) DT
What if I am more interested in the Summary File 1 from 2010 or Summary File 3from 2000?8
load_variables(
year = 2010,
dataset = "sf1",
cache = TRUE
-> my.vars.d.sf1
) head(my.vars.d.sf1)
#> # A tibble: 6 × 3
#> name label concept
#> <chr> <chr> <chr>
#> 1 H001001 Total HOUSING UNITS
#> 2 H002001 Total URBAN AND RURAL
#> 3 H002002 Total!!Urban URBAN AND RURAL
#> 4 H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL
#> 5 H002004 Total!!Urban!!Inside urban clusters URBAN AND RURAL
#> 6 H002005 Total!!Rural URBAN AND RURAL
load_variables(
year = 2000,
dataset = "sf3",
cache = TRUE
-> my.vars.d.sf3
) head(my.vars.d.sf3)
#> # A tibble: 6 × 3
#> name label concept
#> <chr> <chr> <chr>
#> 1 H001001 Total HOUSING UNITS [1]
#> 2 H002001 Total UNWEIGHTED SAMPLE HOUSING UNI…
#> 3 H002002 Total!!Occupied UNWEIGHTED SAMPLE HOUSING UNI…
#> 4 H002003 Total!!Vacant UNWEIGHTED SAMPLE HOUSING UNI…
#> 5 H003001 Total 100-PERCENT COUNT OF HOUSING …
#> 6 H004001 Percent of occupied housing units in sample PERCENT OF HOUSING UNITS IN S…
Look at each of these and you will see all the variables in the respective databases. Say I want total population from each. This has to be P0010001
from each. Let us download these from the decennial census, first for states, then for all counties in Ohio, and then for all Census Tracts in Athens, Ohio. Let us do this for the decennial census and then for the ACS data.
get_decennial(
geography = "state",
year = 2010,
variables = c(totpopn = "P001001")
-> state.popn.d
) get_decennial(
geography = "county",
year = 2010,
variables = c(totpopn = "P001001"),
state = "OH"
-> county.popn.d
) get_decennial(
geography = "tract",
year = 2010,
variables = c(totpopn = "P001001"),
state = "OH",
county = "Athens"
-> tract.popn.d )
And now for the ACS:
get_acs(
geography = "state",
year = 2017,
survey = "acs5",
variables = c(totpopn = "B01003_001")
-> state.popn.acs
) get_acs(
geography = "county",
year = 2017,
survey = "acs5",
variables = c(totpopn = "B01003_001"),
state = "OH"
-> county.popn.acs
) get_acs(
geography = "tract",
year = 2017,
survey = "acs5",
variables = c(totpopn = "B01003_001"),
state = "OH",
county = "Athens"
-> tract.popn.acs )
Here are the data for tracts, just as an example.
You can also get an entire table
instead of having to list variables, such as shown below, with the default decennial census for the package’s current version (2010).
get_decennial(
geography = "county",
table = "P012",
summary_var = "P001001"
-> county.table.d
) get_decennial(
geography = "state",
table = "P012",
summary_var = "P001001"
-> state.table.d )
Here is a snippet of the state table.
If you want the geometry as well because you want to do some plotting, that is easily done. For example, let us look at median gross rent in Ohio’s Athens County’s census tracts.
library(tigris)
library(sf)
options(tigris_class = "sf")
options(tigris_use_cache = TRUE)
get_acs(
geography = "tract",
variables = "DP04_0134",
year = 2017,
survey = "acs5",
state = "OH",
county = "Athens",
geometry = TRUE
-> rent )
How can I plot it? Here is a quick and dirty map.
%>%
rent ggplot() +
geom_sf(
aes(fill = estimate)
)
If we want to focus on a single metro area, then we can do that as well. Here I am focusing on Columbus (Ohio), whose GEOID code is 18140.
core_based_statistical_areas(
cb = TRUE
%>%
) filter(
%in% c("18140")
GEOID %>%
) select(
metro_name = NAME
-> metros
) st_join(
rent, metros, join = st_within,
left = FALSE
-> wc_rent )
And now we plot the data.
library(viridis)
ggplot(
wc_rent, aes(
fill = estimate,
color = estimate
)+
) geom_sf() +
coord_sf(
crs = 26910
+
) theme_minimal() +
theme(
aspect.ratio = 1
+
) scale_fill_viridis() +
scale_color_viridis()
What if we wanted to do this for all metro areas in Ohio?
core_based_statistical_areas(
cb = TRUE
%>%
) filter(
%in% c(
GEOID "10420", "15940", "17140", "17460",
"18140", "19380", "26580", "30620",
"31900", "44220", "45780", "48260",
"48540", "49660"
)%>%
) select(
metro_name = NAME
-> metros2
) st_join(
rent, metros2, join = st_within,
left = FALSE
-> wc_rent2 )
library(ggthemes)
ggplot(
wc_rent2, aes(
fill = estimate,
color = estimate
)+
) geom_sf() +
coord_sf(
crs = 26910
+
) facet_wrap(
~ metro_name
+
) theme_map() +
theme(
aspect.ratio = 1
+
) scale_fill_viridis() +
scale_color_viridis()
Population data for all tracts in the country?
library(tidycensus)
library(purrr)
unique(fips_codes$state)[1:51] -> us
map_df(us, function(x) {
get_acs(
geography = "tract",
variables = "B01003_001",
state = x,
geometry = TRUE
)
}-> totalpop )
One can also build maps with {tidycensus}
since you can ask that geometry
be returned. Let us build a map for the Columbus metro area in Ohio. I’ll stick with the Median Household Income example in the vignette. Note that we have to trim the get_acs()
result to just the Columbus metro area. We’ll see the default map but then use {mapview}
to get a rough sense of these tracts.
As a reminder,
{mapview}
is an R package created to help researchers during their spatial data analysis workflow. It provides functions to very quickly and conveniently create interactive visualisations of spatial data. It was created to fill the gap of quick (not presentation grade) interactive plotting to examine and visually investigate both aspects of spatial data, the gometries and their attributes.
options(tigris_use_cache = TRUE)
<- get_acs(state = "OH", county = c("Delaware", "Fairfield", "Franklin", "Hocking", "Licking", "Madison", "Morrow", "Perry", "Pickaway", "Union"), geography = "tract", variables = "B19013_001", geometry = TRUE)
cmh
<- core_based_statistical_areas(cb = TRUE) %>%
cbsas filter(GEOID %in% c("18140")) %>%
select(metro_name = NAME)
= st_join(cmh, cbsas, join = st_within, left = TRUE) cmhm
load("data/cmhm.RData")
%>%
cmhm ggplot(
aes(
fill = estimate,
color = estimate
)+
) geom_sf() +
coord_sf(
crs = 26911
+
) ::scale_fill_viridis(
viridisoption = "mako"
+
) ::scale_color_viridis(
viridisoption = "magma"
)
library(mapview)
mapview(
cmhm, zcol = "estimate",
legend = TRUE
)
There is a wonderful feature of the tidycensus
package that allows you to use a reference variable in the denominator to calculate ratios, proportions, and percentages – the summary_var
switch. In the example that follows we look at the distribution of the population by race.
load("data/metros.RData")
c(
White = "P0050003",
Black = "P0050004",
Asian = "P0050006",
Hispanic = "P0040003"
-> racevars
) get_decennial(
geography = "tract",
variables = racevars,
state = "OH",
county = c(
"Delaware", "Fairfield", "Franklin",
"Hocking", "Licking", "Madison",
"Morrow", "Perry", "Pickaway", "Union"
), geometry = TRUE,
summary_var = "P0010001"
-> cmh2
) st_join(
cmh2, metros, join = st_within,
left = FALSE
-> cmhm2 )
%>%
cmhm2 mutate(
Percent = 100 * (value / summary_value)
-> cmhm3
) %>%
cmhm3 ggplot(
aes(
fill = Percent,
color = Percent
)+
) facet_wrap(
~variable
+
) geom_sf() +
coord_sf(
crs = 26915
+
) scale_fill_viridis(
option = "turbo"
+
) scale_color_viridis()
If you want to save your data file, st_write()
comes to your help.
library(sf)
st_write(cmh, "cmh.shp")
8.3 Using {acs}
Authored by Ezra Klein, the {acs}
package “Provides a general toolkit for downloading, managing, analyzing, and presenting data from the U.S. Census, including SF1 (Decennial”short-form”), SF3 (Decennial “long-form”), and the American Community Survey (ACS). Confidence intervals provided with ACS data are converted to standard errors to be bundled with estimates in complex acs objects. Package provides new methods to conduct standard operations on acs objects and present/plot data in statistically appropriate ways.” We’ll look at some of its key features here, relying on examples developed by Zev Ross (no surprises there) and other material from Ezra. Just a note of warning based on my personal experience: acs
has a lot that it can and does do but it took me a long time to get it to do the things I wanted to see. It also will not pull recent ACS data so bear that in mind.
As usual, if you know what variable you want, life is easy but if you don’t know the table number, acs.lookup(...)
is your friend.
library(acs)
acs.lookup(
endyear = 2015,
span = 1,
keyword = c("Poverty"),
case.sensitive = FALSE
-> myacs01
) acs.lookup(
endyear = 2000,
dataset = "sf3",
table.number = "P56"
-> myacs02 )
I’ll pick table B14006
from the 5-year ACS ending in 2012.
geo.make(
state = c("OH"),
county = c(9),
tract = "*"
-> geo
) acs.fetch(
endyear = 2019,
span = 5,
geography = geo,
table.number = "B14006",
col.names = "pretty"
-> atacs )
The data will be returned as a list
so will have to be converted into a data frame. We are padding state, county and tract since we’ll merge the data for mapping. Note also that you have to pick the columns you want to work with (if there are several columns downloaded, as is the case here; look at atacs
). We then clean up the rowname, give concise names to the columns, and then calculate the percent of the population 3 years and over living below the poverty level.
data.frame(
paste0(
str_pad(
@geography$state, 2, "left", pad = "0"
atacs
), str_pad(atacs@geography$county, 3, "left", pad = "0"),
str_pad(atacs@geography$tract, 6, "left", pad = "0")),
@estimate[, c(
atacs"POVERTY STATUS IN THE PAST 12 MONTHS BY SCHOOL ENROLLMENT BY LEVEL OF SCHOOL FOR THE POPULATION 3 YEARS AND OVER: Total:",
"POVERTY STATUS IN THE PAST 12 MONTHS BY SCHOOL ENROLLMENT BY LEVEL OF SCHOOL FOR THE POPULATION 3 YEARS AND OVER: Income in the past 12 months below the poverty level:")],
stringsAsFactors = FALSE
-> atacs_df
) rownames(atacs_df) <- 1:nrow(atacs_df)
names(atacs_df) <- c("GEOID", "total", "belowpov")
$percent <- 100*(atacs_df$belowpov / atacs_df$total) atacs_df
Now we merge
library(tigris)
tracts(
state = 'OH',
county = c(9),
cb = TRUE
-> tracts
) geo_join(
tracts, atacs_df, "GEOID", "GEOID"
-> atacs_merged
) # there are some tracts with no land that we should exclude
$ALAND > 0, ] -> atacs_merged atacs_merged[atacs_merged
load("data/atacs_merged.RData")
and then map
library(leaflet)
paste0(
"GEOID: ",
$GEOID,
atacs_merged"<br>", "Percent below poverty level: ",
round(atacs_merged$percent, 2)
-> popup
) colorNumeric(
palette = "YlGnBu",
domain = atacs_merged$percent
-> pal
) leaflet() %>%
addProviderTiles(
"CartoDB.Positron"
%>%
) addPolygons(
data = atacs_merged,
fillColor = ~pal(percent),
color = "#b2aeae", # you need to use hex colors
fillOpacity = 0.4,
weight = 1,
smoothFactor = 0.2,
popup = popup
%>%
) addLegend(
pal = pal,
values = atacs_merged$percent,
position = "bottomright",
title = "Percent <br>below poverty level",
labFormat = labelFormat(suffix = "%")
)
8.4 More on {tigris}
Again, a wonderful package authored by Kyle Walker, tigris
allows you to work with both the Census Bureau’s TIGER/Line shapefiles as well as the Cartographic Boundary Files. Click on the links to read about each file type. You can access a lot of rich detail, ranging from administrative boundaries of states, counties, etc. to primary and secondary roads, railway line, water bodies, school districts, landmarks, military, and so on. Some examples follow but not all have been run to save time; you should run these on your own.
library(tigris)
roads("OH", "Athens") -> atroads
plot(atroads)
zctas(
cb = TRUE,
starts_with = c("457")
-> atzips
) plot(atzips)
You can use functions to loop through geographies, grab what you are looking for and then bind this together. For an example, see page 237 of Kyle’s paper.
For the most part, you are likely to have data gathered by other means that you then want to map, pulling shapefiles or cartographic boundary files from tigris. This means you will have to merge these things. Here is a simple example where we have county-level diabetes rates in a csv file. Of course, you may not end up using this package for plotting given ggmap, leaflet, etc. but there it is.
= read.csv("data/diatigris.csv")
diatigris $STATEFP = ifelse(diatigris$state == "OH", "39", "54")
diatigriscounties(c("OH", "WV"), cb = TRUE) -> counties
geo_join(
counties, diatigris, c("NAME", "STATEFP"),
c("NAME", "STATEFP")
-> my.df
) subset(
c(6, 11, 13)],
my.df[, !is.na(rate)
-> my.dfds )
plot(my.dfds, col = my.dfds$rate)