6 Data Acquisition with APIs and Other Means

One of the first things we saw in this text was the many ways in which we can read locally sourced data (from our own computer, in particular) as well as data files hosted on the web. But the real fun begins when one has access to application programming interfaces (APIs) – a piece of software that allows two applications to communicate. The number of APIs available for free has grown exponentially in the last decade. Today, you can use an API to tap data from the United States Bureau of the Census, from the Bureau of Labor Statistics, the Bureau of Economic Analysis, the United Nations, the World Bank, the United States Geological Survey, the Centers for Disease Control and Prevention, the Federal Reserve, Yahoo Finance, Spotify, Twitter, NOAA Climate Data, and so many more sources. Why is this a good thing? Because you do not need to manually download and upload files to that your code relies upon. Indeed, as and when old data are updated or new data become available, your code has access to it. This makes our lives easy because we only need to focus on keeping our code current and functional – the data acquisition task has now become automated! But, of course, we cannot leverage this development in APIs until we understand how to use them. Consequently, in this chapter we work our way towards exploring APIs, starting with a look at some large files hosted on the web.

6.0.1 Census Data

Let us download a specific file – the Voting Age Population by Citizenship and Race (CVAP) data compiled from the American Community Survey (ACS) 5-year (2015-2019) collection available here. The SAS version of the data file can be downloaded from here. Documentation for these data is available here.

Let us start by downloading and reading in the data.

tempfile() -> temp 
download.file(
  "https://www2.census.gov/programs-surveys/decennial/rdo/datasets/2019/2019-cvap/CVAP_2015-2019_ACS_sas_files.zip",
  destfile = temp
  )
haven::read_sas(
  unz(
    temp, 
    "place.sas7bdat"
    )
  ) -> placedata 
unlink(temp)
rm(temp)
names(placedata)
#>  [1] "geoname"  "lntitle"  "geoid"    "lnnumber" "tot_est"  "tot_moe"  "adu_est" 
#>  [8] "adu_moe"  "cit_est"  "cit_moe"  "cvap_est" "cvap_moe"

Well, what are the contents? Each row is what the Census Bureau calls a “place” (see geoname and geoid), and for each place, the Bureau gives us the following estimates:

Variable Description
tot_est The rounded estimate of the total number of people for that geographic area and group. (Not available for tracts or block groups.)
tot_moe The margin of error for the total number of people for that geographic area and group. (Not available for tracts or block groups.)
adu_est The rounded estimate of the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.)
adu_moe The margin of error for the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.)
cit_est The rounded estimate of the total number of United States citizens for that geographic area and group
cit_moe The margin of error for the total number of United States citizens for that geographic area and group.
cvap_est The rounded estimate of the total number of United States citizens 18 years of age or older for that geographic area and group.
cvap_moe The margin of error for the total number of United States citizens 18 years of age or older for that geographic area and group.

The rows represent the following groups, marked by lntitle and lnnumber, respectively:

lnnumber lntitle
Total 1
Not Hispanic or Latino 2
American Indian or Alaska Native Alone 3
Asian Alone 4
Black or African American Alone 5
Native Hawaiian or Other Pacific Islander Alone 6
White Alone 7
American Indian or Alaska Native and White 8
Asian and White 9
Black or African American and White 10
American Indian or Alaska Native and Black or African American 11
Remainder of Two or More Race Responses 12
Hispanic or Latino 13

Well, this is all fine and dandy but what if I want to tidy these data? You certainly can, and without too much trouble either. How? See below, but before you walk through the code, understand what we need to do.

First, each if the lntitles should be a column such that each place really has only ONE row of data in the tidy version of the file. The lnnumber column is of little use since we have lntitle and hence we can drop lnnumber without losing anything of value.

library(tidyverse)
placedata %>%
  select(
    -lnnumber
    ) %>%
  group_by(
    geoid, geoname
    ) %>%
  pivot_wider(
    names_from = lntitle,
    values_from = c(
      tot_est, tot_moe, adu_est, adu_moe, 
      cit_est, cit_moe, cvap_est, cvap_moe
      )
  ) %>% 
  janitor::clean_names() -> placedata_tidy

You should end up with a data file that has 29,573 rows and 106 columns. Now, this is too many columns, and sure we are not interested in all of them. Rather, we may only be interested in the total population, the total number of citizens of voting-age, and then of course the percentage of the total number of citizens of voting-age who are non-Hispanic White, Hispanic, non-Hispanic Asian, non-Hispanic Black/African-American, and so on. Let us trim the tidied file in line with this ultimate goal.

placedata_tidy %>%
  group_by(geoid, geoname, tot_est_total, cvap_est_total) %>%
  summarize(
    pct_cvap = (cvap_est_total / tot_est_total),
    pct_cvap_nhwhite = (cit_est_white_alone / cit_est_total),
    pct_cvap_nhblack = (cvap_est_black_or_african_american_alone / cit_est_total),
    pct_cvap_nhaian = (cvap_est_american_indian_or_alaska_native_alone / cit_est_total),
    pct_cvap_nhasian = (cvap_est_asian_alone / cit_est_total),
    pct_cvap_nhnhopi = (cvap_est_native_hawaiian_or_other_pacific_islander_alone / cit_est_total),
    pct_hispanic = (cvap_est_hispanic_or_latino / cvap_est_total)
  ) -> place.cvap.a

Of course the variables I have called pct_* are proportions rather than percentages but that is by design. You will see why when we start mapping or otherwise visualizing these data. Now, you will see various cells with NaN – what R calls a value that is ‘not a number’ because it could not be calculated. Why not? If the denominator is 0, you will encounter this issue. So one fix would be to eliminate nall rows where the tot_est_total > 0 and cvap_est_total > 0 before doing anything else.

placedata_tidy %>%
  filter(tot_est_total > 0, cvap_est_total > 0) %>%
  group_by(geoid, geoname, tot_est_total, cvap_est_total) %>%
  summarize(
    pct_cvap = (cvap_est_total / tot_est_total),
    pct_cvap_nhwhite = (cit_est_white_alone / cit_est_total),
    pct_cvap_nhblack = (cvap_est_black_or_african_american_alone / cit_est_total),
    pct_cvap_nhaian = (cvap_est_american_indian_or_alaska_native_alone / cit_est_total),
    pct_cvap_nhasian = (cvap_est_asian_alone / cit_est_total),
    pct_cvap_nhnhopi = (cvap_est_native_hawaiian_or_other_pacific_islander_alone / cit_est_total),
    pct_hispanic = (cvap_est_hispanic_or_latino / cvap_est_total)
  ) -> place.cvap.b

Now, you might be wondering why the focus on redistricting data at the expense of all the other data products the Census Bureau compiles. Good question, but easy answer: Later on we will be looking at specific packages designed to pull a wide variety of Census products. For now we look at a few important but fenky data.

6.0.2 CDC’s BRFSS Data

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

The 2020 BRFSS survey data can be downloaded in ASCII format from here while the codebook is here and the variable layout is here.

Note: ASCII (pronounced ask-ee) stands for the American Standard Code for Information Interchange and is a code for representing English characters as numbers, with each letter assigned a number from 0 to 127. “For example, the ASCII code for uppercase M is 77. Most computers use ASCII codes to represent text, which makes it possible to transfer data from one computer to another.” Source:

If you download the data and look at the variable layout here you notice that every column has specific starting and ending positions. If we choose to read the downloaded ASCII data we will have to pass the variable layout information to R. I’ll show you how to do this but let us take a shortcut via the other formats CDC so kindly makes available for download, namely the SAS Transport Format found here.

While 2020 is the most recent data available as I write, download and extract two 2016 files and put both in your working directory.

Extract the SAS version of the data-set and put it in your working directory. Then execute the following code, making sure to add the path to the directory where you stored the extracted file.

library(foreign)
read.xport(
  "data/LLCP2016.XPT"
  ) -> brfss2016a

When the code executes you should see brfss2016a in the Global Environment window of RStudio. This was fairly easy, but, what if we were given the raw ASCII data? Now we will need the variable layout information, and we will have to choose a library to work with. Two I use are {readr} (my preferred library) and {LaF} (a close second). Both are shown in action below.

Note: The original column layout specifications had to be modified because the third variable listed online (IDATE) subsumes the three that follow. As such IDATE was deleted from the specifications below.

The {readr} code specifies the start and end of the columns, as well as the column names.

fwf_positions_begin = c(1, 16, 18, 20, 22, 31, 35, 45, 62, 63, 64, 65, 67, 68, 70, 72, 80, 81, 83, 85, 87, 88, 89, 90, 91, 92, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 108, 109, 146, 147, 148, 150, 151, 152, 154, 158, 170, 171, 172, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 192, 193, 196, 198, 200, 202, 203, 209, 210, 211, 213, 215, 216, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 243, 255, 256, 257, 258, 261, 264, 266, 268, 270, 271, 272, 273, 275, 277, 279, 281, 282, 284, 285, 310, 311, 312, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 326, 327, 329, 330, 332, 334, 336, 339, 340, 341, 342, 343, 345, 346, 347, 348, 349, 351, 352, 354, 356, 357, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 573, 574, 575, 576, 578, 579, 581, 582, 583, 584, 590, 591, 595, 623, 625, 626, 627, 628, 629, 645, 647, 1356, 1357, 1363, 1393, 1403, 1417, 1419, 1421, 1425, 1426, 1427, 1430, 1473, 1612, 1669, 1671, 1673, 1674, 1675, 1711, 1907, 1910, 1913, 1916, 2155, 2156, 2157, 2158, 2159, 2160, 2161, 2162, 2163, 2164, 2221, 2223, 2227, 2228, 2229, 2230, 2231, 2232, 2234, 2235, 2236, 2239, 2242, 2247, 2251, 2252, 2253, 2254, 2255, 2256, 2257, 2258, 2259, 2262, 2263, 2267, 2271, 2272, 2273, 2274, 2275, 2276, 2277, 2278, 2279, 2281, 2283, 2284, 2286, 2292)
width =  c(2, 2, 2, 2, 4, 4, 10, 10, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 2, 1, 6, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 2, 1, 1, 1, 3, 3, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 25, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 2, 2, 2, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 6, 1, 4, 28, 2, 1, 1, 1, 1, 1, 2, 2, 1, 6, 10, 10, 10, 2, 2, 1, 1, 1, 1, 10, 10, 1, 2, 2, 1, 1, 1, 10, 3, 3, 3, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 3, 3, 5, 4, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

fwf_positions_end = fwf_positions_begin + (width - 1)

col_names = c("STATE", "FMONTH", "IMONTH", "IDAY", "IYEAR", "DISPCODE", "SEQNO", "_PSU", "CTELENUM", "PVTRESD1", "COLGHOUS", "STATERES", "LADULT", "NUMADULT", "NUMMEN", "NUMWOMEN", "GENHLTH", "PHYSHLTH", "MENTHLTH", "POORHLTH", "HLTHPLN1", "PERSDOC2", "MEDCOST", "CHECKUP1", "EXERANY2", "SLEPTIM1", "CVDINFR4", "CVDCRHD4", "CVDSTRK3", "ASTHMA3", "ASTHNOW", "CHCSCNCR", "CHCOCNCR", "CHCCOPD1", "HAVARTH3", "ADDEPEV2", "CHCKIDNY", "DIABETE3", "DIABAGE2", "LASTDEN3", "RMVTETH3", "VETERAN3", "MARITAL", "CHILDREN", "EDUCA", "EMPLOY1", "INCOME2", "WEIGHT2", "HEIGHT3", "NUMHHOL2", "NUMPHON2", "CPDEMO1", "INTERNET", "RENTHOM1", "SEX", "PREGNANT", "QLACTLM2", "USEEQUIP", "BLIND", "DECIDE", "DIFFWALK", "DIFFDRES", "DIFFALON", "SMOKE100", "SMOKDAY2", "STOPSMK2", "LASTSMK2", "USENOW3", "ALCDAY5", "AVEDRNK2", "DRNK3GE5", "MAXDRNKS", "FLUSHOT6", "FLSHTMY2", "PNEUVAC3", "SHINGLE2", "FALL12MN", "FALLINJ2", "SEATBELT", "DRNKDRI2", "HADMAM", "HOWLONG", "PROFEXAM", "LENGEXAM", "HADPAP2", "LASTPAP2", "HADHYST2", "PCPSAAD2", "PCPSADI1", "PCPSARE1", "PSATEST1", "PSATIME", "PCPSARS1", "BLDSTOOL", "LSTBLDS3", "HADSIGM3", "HADSGCO1", "LASTSIG3", "HIVTST6", "HIVTSTD3", "WHRTST10", "PDIABTST", "PREDIAB1", "INSULIN", "BLDSUGAR", "FEETCHK2", "DOCTDIAB", "CHKHEMO3", "FEETCHK", "EYEEXAM", "DIABEYE", "DIABEDU", "PAINACT2", "QLMENTL2", "QLSTRES2", "QLHLTH2", "MEDICARE", "HLTHCVR1", "DELAYMED", "DLYOTHER", "NOCOV121", "LSTCOVRG", "DRVISITS", "MEDSCOST", "CARERCVD", "MEDBILL1", "ASBIALCH", "ASBIDRNK", "ASBIBING", "ASBIADVC", "ASBIRDUC", "WTCHSALT", "LONGWTCH", "DRADVISE", "ASTHMAGE", "ASATTACK", "ASERVIST", "ASDRVIST", "ASRCHKUP", "ASACTLIM", "ASYMPTOM", "ASNOSLEP", "ASTHMED3", "ASINHALR", "IMFVPLAC", "TETANUS", "HPVTEST", "HPLSTTST", "HPVADVC2", "HPVADSHT", "CNCRDIFF", "CNCRAGE", "CNCRTYP1", "CSRVTRT1", "CSRVDOC1", "CSRVSUM", "CSRVRTRN", "CSRVINST", "CSRVINSR", "CSRVDEIN", "CSRVCLIN", "CSRVPAIN", "CSRVCTL1", "RRCLASS2", "RRCOGNT2", "RRATWRK2", "RRHCARE3", "RRPHYSM2", "RREMTSM2", "SCNTMNY1", "SCNTMEL1", "SCNTPAID", "SCNTWRK1", "SCNTLPAD", "SCNTLWK1", "SCNTVOT1", "SXORIENT", "TRNSGNDR", "RCSBIRTH", "RCSGENDR", "RCHISLA1", "RCSRACE1", "RCSBRAC1", "RCSRLTN2", "CASTHDX2", "CASTHNO2", "EMTSUPRT", "LSATISFY", "QSTVER", "QSTLANG", "MSCODE", "_STSTR", "_STRWT", "_RAWRAKE", "_WT2RAKE", "_AGE80", "_IMPRACE", "_IMPNPH", "_IMPEDUC", "_IMPMRTL", "_IMPHOME", "_LANDWT2", "_LANDWT", "_CHISPNC", "_CPRACE", "_CRACE1", "_IMPCAGE", "_IMPCRAC", "_IMPCSEX", "_CLANDWT", "NPHH", "NAHH", "NCHH", "_HHOLDWT", "_RFHLTH", "_HCVU651", "_TOTINDA", "_LTASTH1", "_CASTHM1", "_ASTHMS1", "_DRDXAR1", "_EXTETH2", "_ALTETH2", "_DENVST2", "_PRACE1", "_MRACE1", "_HISPANC", "_RACE", "_RACEG21", "_RACEGR3", "_RACE_G1", "_AGEG5YR", "_AGE65YR", "_AGE_G", "HTIN4", "HTM4", "WTKG3", "_BMI5", "_BMI5CAT", "_RFBMI5", "_CHLDCNT", "_EDUCAG", "_INCOMG", "_SMOKER3", "_RFSMOK3", "DRNKANY5", "DROCDY3_", "_RFBING5", "_DRNKDY4", "_DRNKMO4", "_RFDRHV4", "_RFDRMN4", "_RFDRWM4", "_FLSHOT6", "_PNEUMO2", "_RFSEAT2", "_RFSEAT3", "_RFMAM2Y", "_MAM502Y", "_RFPAP32", "_RFPSA21", "_RFBLDS2", "_RFSIGM2", "_AIDTST3")
library(readr)
read_fwf(
  "data/LLCP2016.ASC ", 
  fwf_positions(
    fwf_positions_begin, 
    fwf_positions_end, 
    col_names
    )
  ) -> brfss2016b
brfss2016b %>%
  filter(
    STATE == "39
    ") -> brfss2016b# Subsetting Ohio data 
save(
  brfss2016b, 
  file = "data/brfss2016b.RData"
  )

Note that we had to point out that there is an extra white-space after the file extension .ASC.

Next you see the LaF.

library(LaF)
laf_open_fwf(
  "data/LLCP2016.ASC ", 
  column_widths = width, 
  column_types = rep(
    "character", 
    length(width)
    ), 
  column_names = col_names
  ) -> laf 
data.frame(
  laf[, ] 
  ) -> brfss2016c
save(
  brfss2016c, 
  file = "data/brfss2016c.RData"
  )

Notice that with LaF we had to specify the column widths via column_widths = width and also the column headings via column_names = cnames. In addition, note that we had to point out that there is an extra white-space after the file extension .ASC.

6.0.2.1 Changing Variable Formats

When these files are read some of the variable will come in as characters when in fact they are numeric (see STATE, for example) or factors (see PHYSHLTH for example). Others will be read as integers (see GENHLTH) and so on. Consequently, we often have to reset the read formats. Further, we do know from the codebook that missing values and other responses (such as refused) carry numeric codes in the survey. For instance, GENHLTH is supposed to have the following responses mapped to the numeric values stored in the variable

Code Label
1 Excellent
2 Very Good
3 Good
4 Fair
5 Poor
7 Don’t Know/Not Sure
9 Refused
BLANK Not asked or Missing

We can take this mapping and fix GENHLTH by creating a new variable, say general.health as a factor.

load("data/brfss2016b.RData")
factor(
  brfss2016b$GENHLTH, 
  levels = c(1, 2, 3, 4, 5), 
  labels = c(
    "Excellent", "Very Good", "Good", "Fair", "Poor"
    )
  ) -> brfss2016b$general.health 
table(
  brfss2016b$general.health
  )
#> 
#> Excellent Very Good      Good      Fair      Poor 
#>         0         0         0         0         0

Let us also flip STATE into a numeric variable, and format FMONTH, IMONTH, IDAY, IYEAR as dates. For dates the lubridate package comes in very handy.

as.numeric(
  brfss2016b$STATE
  ) -> brfss2016b$STATE

We’ll combine IMONTH, IDAY and IYEAR to create a date variable.

with(
  brfss2016b, 
  sprintf(
    '%s-%s-%s', 
    IMONTH, IDAY, IYEAR
    )
  ) -> brfss2016b$date
head(brfss2016b$date)
#> character(0)
library(lubridate)
brfss2016b$Date = mdy(brfss2016b$date)
is.Date(brfss2016b$Date)
#> [1] TRUE
str(brfss2016b$Date)
#>  'Date' num(0)

Note how we built date … we first concatenated the three columns and then used lubridate to format date via the mdy() specification. I then check to make sure the format is correct via the is.Date() command. This could also have been done via the str() command. Now that Date has been created we could drop the original three columns but we will let that be for now.

6.0.3 BEA and BLS Data

The Bureau of Labor Statistics (BLS) compiles a lot of data, including some under the Local Area Unemployment Statistics (LAUS) program – a Federal-State cooperative effort in which monthly estimates of total employment and unemployment are prepared for approximately 7,300 areas:

  • Census regions and divisions
  • States
  • Metropolitan Statistical Areas and Metropolitan NECTAS (New England City and Town Areas)
  • Metropolitan Divisions and NECTA Divisions
  • Micropolitan Statistical Areas and Micropolitan NECTAs
  • Combined Metropolitan Statistical Areas and Combined NECTAs
  • Small Labor Market Areas
  • Counties and county equivalents
  • Cities of 25,000 population or more
  • Cities and towns in New England regardless of population

These estimates are key indicators of local economic conditions. The Bureau of Labor Statistics (BLS) of the U.S. Department of Labor is responsible for the concepts, definitions, technical procedures, validation, and publication of the estimates that State employment security agencies prepare under agreement with BLS. Data available at BLS can be found here

Similarly, the Bureau of Economic Analysis (BEA) has a regional economic accounts program that details the geographic distribution of U.S. economic activity and growth. “The estimates of gross domestic product by state and state and local area personal income, and the accompanying detail, provide a consistent framework for analyzing and comparing individual state and local area economies.”

BEA has a nice site here that allows you to interactively select the data you want, or you can grab the data from their download section.

The BEA data read below were obtained from here, while the BLS data read below were obtained from here and here, respectively.

read.csv(
  here::here(
    "data", 
    "CA1_1969_2016_OH.csv"
    )
  ) -> df.bea 
read.csv(
  "http://www.bls.gov/lau/laucnty16.txt",
  header = FALSE, 
  skip = 5, 
  sep = ""
  ) -> df.bls1 
colnames(df.bls1) = c(
  "LAUS Code", "StateFIPSCode", "CountyFIPSCode", "CountyName",
  "StateAbbreviation", "Year", "TotalLaborForce", "TotalEmployed",
  "TotalUnemployed", "UnemploymentRate"
  )
library(readxl)
library(httr)
url = "https://www.bls.gov/lau/laucnty16.xlsx"
GET(
  url, 
  write_disk(tf <- tempfile(fileext = ".xlsx"))
  )
#> Response [https://www.bls.gov/lau/laucnty16.xlsx]
#>   Date: 2022-01-14 00:54
#>   Status: 200
#>   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
#>   Size: 246 kB
#> <ON DISK>  /var/folders/qh/6q39v0755_54rxmbl8m5ttnwy0twd7/T//RtmpPMvrrc/file921e4f4e7dd.xlsx
read_excel(
  tf, 
  skip = 5, 
  na = "", 
  col_names = c(
    "LAUS Code", "StateFIPSCode", "CountyFIPSCode", 
    "CountyName", "Year", "Blank_Column", "TotalLaborForce",
    "TotalEmployed", "TotalUnemployed", "UnemploymentRate"
    )
  ) -> df.bls2 
read.csv(
  "http://www.bls.gov/cew/data/api/2015/2/industry/10.csv",
  sep = ",", 
  header = TRUE
  ) -> df.bls3 

Note that df.bls1 used the .txt data and df.bls2 uses the .xlsx data. Note too the Excel file has a blank row that we will have to delete. Also, look at df.bls1 and note that the data has some issues because of the way the txt file is prepared. Consequently, we want to rely on the Excel data for now. If we wanted to cleanly read the txt file we would have to do some more legwork up front.

6.1 Using Application Programming Interfaces (APIs)

Many organizations that gather data have started offering free APIs – essentially a setup whereby if you register for their API key, you can interact with their servers to query and download data. This is especially useful because it saves time and improves efficiency all around. Moreover, an increasing number of R packages are using APIs to communicate with data servers. Before you can use their API, you have to registes for an API key by setting up an account, usually free.

6.1.1 The Bureau of Eeconomic Analysis (BEA) API

Let us start with the Bureau of Economic Analysis’s API. What does the BEA do? Per the agency, “The Bureau of Economic Analysis (BEA) promotes a better understanding of the U.S. economy by providing the most timely, relevant, and accurate economic accounts data in an objective and cost-effective manner.”

The three BEA product streams of interest to most people tend to be the (i) National series (BEA’s national economic statistics provide a comprehensive view of U.S. production, consumption, investment, exports and imports, and income and saving. These statistics are best known by summary measures such as gross domestic product (GDP), corporate profits, personal income and spending, and personal saving.), (ii) the Regional series (The regional economic accounts tell us about the geographic distribution of U.S. economic activity and growth. The estimates of gross domestic product by state and state and local area personal income, and the accompanying detail, provide a consistent framework for analyzing and comparing individual state and local area economies.), and (iii) the Industry Economic Accounts (The industry economic accounts, presented both in an input-output framework and as annual output by each industry, provide a detailed view of the interrelationships between U.S. producers and users and the contribution to production across industries. These accounts are used extensively by policymakers and businesses to understand industry interactions, productivity trends, and the changing structure of the U.S. economy.)

To start, signup for a BEA key here and once you get it (check your email), you should set it via beaKey = "xyz1s3d4g6hblahblahblahblah". Now install and load the library.

library(bea.R)

There are a few primary functions currently offered, you can either search for data via a keyword or you can get specific data once you have identified the table ID and other details. Let us start by seeing what data sets are available.7

beaSets(mybeaKey)
#> $Dataset
#>                DatasetName                   DatasetDescription
#> 1                     NIPA                 Standard NIPA tables
#> 2       NIUnderlyingDetail Standard NI underlying detail tables
#> 3                      MNE            Multinational Enterprises
#> 4              FixedAssets         Standard Fixed Assets tables
#> 5                      ITA  International Transactions Accounts
#> 6                      IIP    International Investment Position
#> 7              InputOutput                    Input-Output Data
#> 8            IntlServTrade         International Services Trade
#> 9            GDPbyIndustry                      GDP by Industry
#> 10                Regional                   Regional data sets
#> 11 UnderlyingGDPbyIndustry           Underlying GDP by Industry
#> 12      APIDatasetMetaData    Metadata about other API datasets
#> 
#> attr(,"params")
#>   ParameterName                       ParameterValue
#> 1  RESULTFORMAT                                 JSON
#> 2        METHOD                       GETDATASETLIST
#> 3        USERID A8CF5A83-A7A6-42A5-8D45-B1E94079E37C

The beaSearch() function is limited to returning details of national and regional data so please bear that in mind. We start with a simple search for all data on wages and salaries.

beaSearch(
  "Wages and Salaries", 
  beaKey = mybeaKey, 
  asHtml = FALSE
  ) -> bea.search.01

Here is the resulting table, and you should look through the series to get a sense of the contents. We will then go in and pick a series or two to download.

The beaGet function will grab the data you specify, as shown below.

  list(
    'UserID' = mybeaKey, 
    'Method' = 'GetData', 
    'DatasetName' = 'NIPA', 
    'TableName' = 'T11000',
    'LineNumber' = "3",
    'Frequency' = 'Q',
    'Year' = '2010,2012,2013,2014,2015,2016,2017,2018',
    'ResultFormat' = 'json'
    ) -> mybea_call

beaGet(
  mybea_call,
  asWide = FALSE
  ) -> bea01

You can control whether data are returned in a long format with the asWide = FALSE switch. If you set it to TRUE you will get the data in wide format, necessitating additional work to whip it into shape for analysis.

To return in a format in which each column represents a series, set iTableStyle = FALSE. This returns columns named with a concatenation of the descriptive column values, whereas rows are populated with numeric data values for each time period, and has the last column named “TimePeriod” filled with dates (see below).

beaGet(
  mybea_call, 
  iTableStyle = FALSE
  ) -> bea02

Note that asWide = TRUE and iTableStyle = TRUE are set as the defaults. My preference would be to work with bea01.

What about some data from the Regional series? Let us see what is available for personal income, and then pick both a series and sub-national level (state, MSA, county, etc.).

beaSearch(
  "Personal Income", 
  beaKey = mybeaKey, 
  asHtml = FALSE
  ) -> bea.search.02

Pay attention to the Warning message: our call threw: “In beaSearch(”Personal Income”, beaKey = mybeaKey, asHtml = FALSE): Regional metadata is missing from /Library/Frameworks/R.framework/Versions/4.1/Resources/library/beaR/data and may be locked for updating on the BEA API; searching national metadata only.”

This is not a surprise if you pay attention to the technical documentation the BEA releases. Specifically, the BEA has announced that “The RegionalData, RegionalIncome, and RegionalProduct data sets are obsolete and API calls using these data sets will no longer function. Users should instead access the dataset Regional, which provides comprehensive detail in statistics, industries, geography, and years. See Appendix N for instructions on how to use the Regional data set. In addition, BEA continues to add new datasets as appropriate, and while any new datasets will be announced on the BEA’s website Developers page (https://apps.bea.gov/developers), we also encourage users to periodically call the above GETDATASETLIST discover new datasets.”

Well, so what now? Let us hack the call a bit to figure out what to ask for. The easy way to do this is to use the BEA’s interactive data interface. I used the Regional data link to pick per capita personal saving by county, and this turns out to be the CAINC1 series. At the time of writing this, the latest data available are for 2020.

Now, you could also rely on some helper functions in {bea.r} to get the series you are curious about. These helper functions are used below with this particular example – chasing CAINC1 and extracting the key pieces we will need in our API call. First I will look for the TableName (although the web interface data told us this was CAINC1 but why not see what else we could get, eh)?

library(tidyverse)
beaParamVals(
  mybeaKey, 
  'Regional', 
  "TableName"
  ) %>%
  as.data.frame(.) -> see01

This gives us 99 rows of data with one TableName and its corresponding description per row.

Now that we have seen the various tables available to us, let us look at the various LineCode values in each and what these linecodes represent.

beaParamVals(
  mybeaKey, 
  'Regional', 
  "LineCode"
  ) %>%
  as.data.frame() -> see02