6 Data Acquisition with APIs and Other Means

One of the first things we saw in this text was the many ways in which we can read locally sourced data (from our own computer, in particular) as well as data files hosted on the web. But the real fun begins when one has access to application programming interfaces (APIs) – a piece of software that allows two applications to communicate. The number of APIs available for free has grown exponentially in the last decade. Today, you can use an API to tap data from the United States Bureau of the Census, from the Bureau of Labor Statistics, the Bureau of Economic Analysis, the United Nations, the World Bank, the United States Geological Survey, the Centers for Disease Control and Prevention, the Federal Reserve, Yahoo Finance, Spotify, Twitter, NOAA Climate Data, and so many more sources. Why is this a good thing? Because you do not need to manually download and upload files to that your code relies upon. Indeed, as and when old data are updated or new data become available, your code has access to it. This makes our lives easy because we only need to focus on keeping our code current and functional – the data acquisition task has now become automated! But, of course, we cannot leverage this development in APIs until we understand how to use them. Consequently, in this chapter we work our way towards exploring APIs, starting with a look at some large files hosted on the web.

6.0.1 Census Data

Let us download a specific file – the Voting Age Population by Citizenship and Race (CVAP) data compiled from the American Community Survey (ACS) 5-year (2015-2019) collection available here. The SAS version of the data file can be downloaded from here. Documentation for these data is available here.

Let us start by downloading and reading in the data.

tempfile() -> temp 
download.file(
  "https://www2.census.gov/programs-surveys/decennial/rdo/datasets/2019/2019-cvap/CVAP_2015-2019_ACS_sas_files.zip",
  destfile = temp
  )
haven::read_sas(
  unz(
    temp, 
    "place.sas7bdat"
    )
  ) -> placedata 
unlink(temp)
rm(temp)
names(placedata)
#>  [1] "geoname"  "lntitle"  "geoid"    "lnnumber" "tot_est"  "tot_moe"  "adu_est" 
#>  [8] "adu_moe"  "cit_est"  "cit_moe"  "cvap_est" "cvap_moe"

Well, what are the contents? Each row is what the Census Bureau calls a “place” (see geoname and geoid), and for each place, the Bureau gives us the following estimates:

Variable Description
tot_est The rounded estimate of the total number of people for that geographic area and group. (Not available for tracts or block groups.)
tot_moe The margin of error for the total number of people for that geographic area and group. (Not available for tracts or block groups.)
adu_est The rounded estimate of the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.)
adu_moe The margin of error for the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.)
cit_est The rounded estimate of the total number of United States citizens for that geographic area and group
cit_moe The margin of error for the total number of United States citizens for that geographic area and group.
cvap_est The rounded estimate of the total number of United States citizens 18 years of age or older for that geographic area and group.
cvap_moe The margin of error for the total number of United States citizens 18 years of age or older for that geographic area and group.

The rows represent the following groups, marked by lntitle and lnnumber, respectively:

lnnumber lntitle
Total 1
Not Hispanic or Latino 2
American Indian or Alaska Native Alone 3
Asian Alone 4
Black or African American Alone 5
Native Hawaiian or Other Pacific Islander Alone 6
White Alone 7
American Indian or Alaska Native and White 8
Asian and White 9
Black or African American and White 10
American Indian or Alaska Native and Black or African American 11
Remainder of Two or More Race Responses 12
Hispanic or Latino 13

Well, this is all fine and dandy but what if I want to tidy these data? You certainly can, and without too much trouble either. How? See below, but before you walk through the code, understand what we need to do.

First, each if the lntitles should be a column such that each place really has only ONE row of data in the tidy version of the file. The lnnumber column is of little use since we have lntitle and hence we can drop lnnumber without losing anything of value.

library(tidyverse)
placedata %>%
  select(
    -lnnumber
    ) %>%
  group_by(
    geoid, geoname
    ) %>%
  pivot_wider(
    names_from = lntitle,
    values_from = c(
      tot_est, tot_moe, adu_est, adu_moe, 
      cit_est, cit_moe, cvap_est, cvap_moe
      )
  ) %>% 
  janitor::clean_names() -> placedata_tidy

You should end up with a data file that has 29,573 rows and 106 columns. Now, this is too many columns, and sure we are not interested in all of them. Rather, we may only be interested in the total population, the total number of citizens of voting-age, and then of course the percentage of the total number of citizens of voting-age who are non-Hispanic White, Hispanic, non-Hispanic Asian, non-Hispanic Black/African-American, and so on. Let us trim the tidied file in line with this ultimate goal.

placedata_tidy %>%
  group_by(geoid, geoname, tot_est_total, cvap_est_total) %>%
  summarize(
    pct_cvap = (cvap_est_total / tot_est_total),
    pct_cvap_nhwhite = (cit_est_white_alone / cit_est_total),
    pct_cvap_nhblack = (cvap_est_black_or_african_american_alone / cit_est_total),
    pct_cvap_nhaian = (cvap_est_american_indian_or_alaska_native_alone / cit_est_total),
    pct_cvap_nhasian = (cvap_est_asian_alone / cit_est_total),
    pct_cvap_nhnhopi = (cvap_est_native_hawaiian_or_other_pacific_islander_alone / cit_est_total),
    pct_hispanic = (cvap_est_hispanic_or_latino / cvap_est_total)
  ) -> place.cvap.a

Of course the variables I have called pct_* are proportions rather than percentages but that is by design. You will see why when we start mapping or otherwise visualizing these data. Now, you will see various cells with NaN – what R calls a value that is ‘not a number’ because it could not be calculated. Why not? If the denominator is 0, you will encounter this issue. So one fix would be to eliminate nall rows where the tot_est_total > 0 and cvap_est_total > 0 before doing anything else.

placedata_tidy %>%
  filter(tot_est_total > 0, cvap_est_total > 0) %>%
  group_by(geoid, geoname, tot_est_total, cvap_est_total) %>%
  summarize(
    pct_cvap = (cvap_est_total / tot_est_total),
    pct_cvap_nhwhite = (cit_est_white_alone / cit_est_total),
    pct_cvap_nhblack = (cvap_est_black_or_african_american_alone / cit_est_total),
    pct_cvap_nhaian = (cvap_est_american_indian_or_alaska_native_alone / cit_est_total),
    pct_cvap_nhasian = (cvap_est_asian_alone / cit_est_total),
    pct_cvap_nhnhopi = (cvap_est_native_hawaiian_or_other_pacific_islander_alone / cit_est_total),
    pct_hispanic = (cvap_est_hispanic_or_latino / cvap_est_total)
  ) -> place.cvap.b

Now, you might be wondering why the focus on redistricting data at the expense of all the other data products the Census Bureau compiles. Good question, but easy answer: Later on we will be looking at specific packages designed to pull a wide variety of Census products. For now we look at a few important but fenky data.

6.0.2 CDC’s BRFSS Data

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

The 2020 BRFSS survey data can be downloaded in ASCII format from here while the codebook is here and the variable layout is here.

Note: ASCII (pronounced ask-ee) stands for the American Standard Code for Information Interchange and is a code for representing English characters as numbers, with each letter assigned a number from 0 to 127. “For example, the ASCII code for uppercase M is 77. Most computers use ASCII codes to represent text, which makes it possible to transfer data from one computer to another.” Source:

If you download the data and look at the variable layout here you notice that every column has specific starting and ending positions. If we choose to read the downloaded ASCII data we will have to pass the variable layout information to R. I’ll show you how to do this but let us take a shortcut via the other formats CDC so kindly makes available for download, namely the SAS Transport Format found here.

While 2020 is the most recent data available as I write, download and extract two 2016 files and put both in your working directory.

Extract the SAS version of the data-set and put it in your working directory. Then execute the following code, making sure to add the path to the directory where you stored the extracted file.

library(foreign)
read.xport(
  "data/LLCP2016.XPT"
  ) -> brfss2016a

When the code executes you should see brfss2016a in the Global Environment window of RStudio. This was fairly easy, but, what if we were given the raw ASCII data? Now we will need the variable layout information, and we will have to choose a library to work with. Two I use are {readr} (my preferred library) and {LaF} (a close second). Both are shown in action below.

Note: The original column layout specifications had to be modified because the third variable listed online (IDATE) subsumes the three that follow. As such IDATE was deleted from the specifications below.

The {readr} code specifies the start and end of the columns, as well as the column names.

fwf_positions_begin = c(1, 16, 18, 20, 22, 31, 35, 45, 62, 63, 64, 65, 67, 68, 70, 72, 80, 81, 83, 85, 87, 88, 89, 90, 91, 92, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 108, 109, 146, 147, 148, 150, 151, 152, 154, 158, 170, 171, 172, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 192, 193, 196, 198, 200, 202, 203, 209, 210, 211, 213, 215, 216, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 243, 255, 256, 257, 258, 261, 264, 266, 268, 270, 271, 272, 273, 275, 277, 279, 281, 282, 284, 285, 310, 311, 312, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 326, 327, 329, 330, 332, 334, 336, 339, 340, 341, 342, 343, 345, 346, 347, 348, 349, 351, 352, 354, 356, 357, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 573, 574, 575, 576, 578, 579, 581, 582, 583, 584, 590, 591, 595, 623, 625, 626, 627, 628, 629, 645, 647, 1356, 1357, 1363, 1393, 1403, 1417, 1419, 1421, 1425, 1426, 1427, 1430, 1473, 1612, 1669, 1671, 1673, 1674, 1675, 1711, 1907, 1910, 1913, 1916, 2155, 2156, 2157, 2158, 2159, 2160, 2161, 2162, 2163, 2164, 2221, 2223, 2227, 2228, 2229, 2230, 2231, 2232, 2234, 2235, 2236, 2239, 2242, 2247, 2251, 2252, 2253, 2254, 2255, 2256, 2257, 2258, 2259, 2262, 2263, 2267, 2271, 2272, 2273, 2274, 2275, 2276, 2277, 2278, 2279, 2281, 2283, 2284, 2286, 2292)
width =  c(2, 2, 2, 2, 4, 4, 10, 10, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 2, 1, 6, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 2, 1, 1, 1, 3, 3, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 25, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 2, 2, 2, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 6, 1, 4, 28, 2, 1, 1, 1, 1, 1, 2, 2, 1, 6, 10, 10, 10, 2, 2, 1, 1, 1, 1, 10, 10, 1, 2, 2, 1, 1, 1, 10, 3, 3, 3, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 3, 3, 5, 4, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

fwf_positions_end = fwf_positions_begin + (width - 1)

col_names = c("STATE", "FMONTH", "IMONTH", "IDAY", "IYEAR", "DISPCODE", "SEQNO", "_PSU", "CTELENUM", "PVTRESD1", "COLGHOUS", "STATERES", "LADULT", "NUMADULT", "NUMMEN", "NUMWOMEN", "GENHLTH", "PHYSHLTH", "MENTHLTH", "POORHLTH", "HLTHPLN1", "PERSDOC2", "MEDCOST", "CHECKUP1", "EXERANY2", "SLEPTIM1", "CVDINFR4", "CVDCRHD4", "CVDSTRK3", "ASTHMA3", "ASTHNOW", "CHCSCNCR", "CHCOCNCR", "CHCCOPD1", "HAVARTH3", "ADDEPEV2", "CHCKIDNY", "DIABETE3", "DIABAGE2", "LASTDEN3", "RMVTETH3", "VETERAN3", "MARITAL", "CHILDREN", "EDUCA", "EMPLOY1", "INCOME2", "WEIGHT2", "HEIGHT3", "NUMHHOL2", "NUMPHON2", "CPDEMO1", "INTERNET", "RENTHOM1", "SEX", "PREGNANT", "QLACTLM2", "USEEQUIP", "BLIND", "DECIDE", "DIFFWALK", "DIFFDRES", "DIFFALON", "SMOKE100", "SMOKDAY2", "STOPSMK2", "LASTSMK2", "USENOW3", "ALCDAY5", "AVEDRNK2", "DRNK3GE5", "MAXDRNKS", "FLUSHOT6", "FLSHTMY2", "PNEUVAC3", "SHINGLE2", "FALL12MN", "FALLINJ2", "SEATBELT", "DRNKDRI2", "HADMAM", "HOWLONG", "PROFEXAM", "LENGEXAM", "HADPAP2", "LASTPAP2", "HADHYST2", "PCPSAAD2", "PCPSADI1", "PCPSARE1", "PSATEST1", "PSATIME", "PCPSARS1", "BLDSTOOL", "LSTBLDS3", "HADSIGM3", "HADSGCO1", "LASTSIG3", "HIVTST6", "HIVTSTD3", "WHRTST10", "PDIABTST", "PREDIAB1", "INSULIN", "BLDSUGAR", "FEETCHK2", "DOCTDIAB", "CHKHEMO3", "FEETCHK", "EYEEXAM", "DIABEYE", "DIABEDU", "PAINACT2", "QLMENTL2", "QLSTRES2", "QLHLTH2", "MEDICARE", "HLTHCVR1", "DELAYMED", "DLYOTHER", "NOCOV121", "LSTCOVRG", "DRVISITS", "MEDSCOST", "CARERCVD", "MEDBILL1", "ASBIALCH", "ASBIDRNK", "ASBIBING", "ASBIADVC", "ASBIRDUC", "WTCHSALT", "LONGWTCH", "DRADVISE", "ASTHMAGE", "ASATTACK", "ASERVIST", "ASDRVIST", "ASRCHKUP", "ASACTLIM", "ASYMPTOM", "ASNOSLEP", "ASTHMED3", "ASINHALR", "IMFVPLAC", "TETANUS", "HPVTEST", "HPLSTTST", "HPVADVC2", "HPVADSHT", "CNCRDIFF", "CNCRAGE", "CNCRTYP1", "CSRVTRT1", "CSRVDOC1", "CSRVSUM", "CSRVRTRN", "CSRVINST", "CSRVINSR", "CSRVDEIN", "CSRVCLIN", "CSRVPAIN", "CSRVCTL1", "RRCLASS2", "RRCOGNT2", "RRATWRK2", "RRHCARE3", "RRPHYSM2", "RREMTSM2", "SCNTMNY1", "SCNTMEL1", "SCNTPAID", "SCNTWRK1", "SCNTLPAD", "SCNTLWK1", "SCNTVOT1", "SXORIENT", "TRNSGNDR", "RCSBIRTH", "RCSGENDR", "RCHISLA1", "RCSRACE1", "RCSBRAC1", "RCSRLTN2", "CASTHDX2", "CASTHNO2", "EMTSUPRT", "LSATISFY", "QSTVER", "QSTLANG", "MSCODE", "_STSTR", "_STRWT", "_RAWRAKE", "_WT2RAKE", "_AGE80", "_IMPRACE", "_IMPNPH", "_IMPEDUC", "_IMPMRTL", "_IMPHOME", "_LANDWT2", "_LANDWT", "_CHISPNC", "_CPRACE", "_CRACE1", "_IMPCAGE", "_IMPCRAC", "_IMPCSEX", "_CLANDWT", "NPHH", "NAHH", "NCHH", "_HHOLDWT", "_RFHLTH", "_HCVU651", "_TOTINDA", "_LTASTH1", "_CASTHM1", "_ASTHMS1", "_DRDXAR1", "_EXTETH2", "_ALTETH2", "_DENVST2", "_PRACE1", "_MRACE1", "_HISPANC", "_RACE", "_RACEG21", "_RACEGR3", "_RACE_G1", "_AGEG5YR", "_AGE65YR", "_AGE_G", "HTIN4", "HTM4", "WTKG3", "_BMI5", "_BMI5CAT", "_RFBMI5", "_CHLDCNT", "_EDUCAG", "_INCOMG", "_SMOKER3", "_RFSMOK3", "DRNKANY5", "DROCDY3_", "_RFBING5", "_DRNKDY4", "_DRNKMO4", "_RFDRHV4", "_RFDRMN4", "_RFDRWM4", "_FLSHOT6", "_PNEUMO2", "_RFSEAT2", "_RFSEAT3", "_RFMAM2Y", "_MAM502Y", "_RFPAP32", "_RFPSA21", "_RFBLDS2", "_RFSIGM2", "_AIDTST3")
library(readr)
read_fwf(
  "data/LLCP2016.ASC ", 
  fwf_positions(
    fwf_positions_begin, 
    fwf_positions_end, 
    col_names
    )
  ) -> brfss2016b
brfss2016b %>%
  filter(
    STATE == "39
    ") -> brfss2016b# Subsetting Ohio data 
save(
  brfss2016b, 
  file = "data/brfss2016b.RData"
  )

Note that we had to point out that there is an extra white-space after the file extension .ASC.

Next you see the LaF.

library(LaF)
laf_open_fwf(
  "data/LLCP2016.ASC ", 
  column_widths = width, 
  column_types = rep(
    "character", 
    length(width)
    ), 
  column_names = col_names
  ) -> laf 
data.frame(
  laf[, ] 
  ) -> brfss2016c
save(
  brfss2016c, 
  file = "data/brfss2016c.RData"
  )

Notice that with LaF we had to specify the column widths via column_widths = width and also the column headings via column_names = cnames. In addition, note that we had to point out that there is an extra white-space after the file extension .ASC.

6.0.2.1 Changing Variable Formats

When these files are read some of the variable will come in as characters when in fact they are numeric (see STATE, for example) or factors (see PHYSHLTH for example). Others will be read as integers (see GENHLTH) and so on. Consequently, we often have to reset the read formats. Further, we do know from the codebook that missing values and other responses (such as refused) carry numeric codes in the survey. For instance, GENHLTH is supposed to have the following responses mapped to the numeric values stored in the variable

Code Label
1 Excellent
2 Very Good
3 Good
4 Fair
5 Poor
7 Don’t Know/Not Sure
9 Refused
BLANK Not asked or Missing

We can take this mapping and fix GENHLTH by creating a new variable, say general.health as a factor.

load("data/brfss2016b.RData")
factor(
  brfss2016b$GENHLTH, 
  levels = c(1, 2, 3, 4, 5), 
  labels = c(
    "Excellent", "Very Good", "Good", "Fair", "Poor"
    )
  ) -> brfss2016b$general.health 
table(
  brfss2016b$general.health
  )
#> 
#> Excellent Very Good      Good      Fair      Poor 
#>         0         0         0         0         0

Let us also flip STATE into a numeric variable, and format FMONTH, IMONTH, IDAY, IYEAR as dates. For dates the lubridate package comes in very handy.

as.numeric(
  brfss2016b$STATE
  ) -> brfss2016b$STATE

We’ll combine IMONTH, IDAY and IYEAR to create a date variable.

with(
  brfss2016b, 
  sprintf(
    '%s-%s-%s', 
    IMONTH, IDAY, IYEAR
    )
  ) -> brfss2016b$date
head(brfss2016b$date)
#> character(0)
library(lubridate)
brfss2016b$Date = mdy(brfss2016b$date)
is.Date(brfss2016b$Date)
#> [1] TRUE
str(brfss2016b$Date)
#>  'Date' num(0)

Note how we built date … we first concatenated the three columns and then used lubridate to format date via the mdy() specification. I then check to make sure the format is correct via the is.Date() command. This could also have been done via the str() command. Now that Date has been created we could drop the original three columns but we will let that be for now.

6.0.3 BEA and BLS Data

The Bureau of Labor Statistics (BLS) compiles a lot of data, including some under the Local Area Unemployment Statistics (LAUS) program – a Federal-State cooperative effort in which monthly estimates of total employment and unemployment are prepared for approximately 7,300 areas:

  • Census regions and divisions
  • States
  • Metropolitan Statistical Areas and Metropolitan NECTAS (New England City and Town Areas)
  • Metropolitan Divisions and NECTA Divisions
  • Micropolitan Statistical Areas and Micropolitan NECTAs
  • Combined Metropolitan Statistical Areas and Combined NECTAs
  • Small Labor Market Areas
  • Counties and county equivalents
  • Cities of 25,000 population or more
  • Cities and towns in New England regardless of population

These estimates are key indicators of local economic conditions. The Bureau of Labor Statistics (BLS) of the U.S. Department of Labor is responsible for the concepts, definitions, technical procedures, validation, and publication of the estimates that State employment security agencies prepare under agreement with BLS. Data available at BLS can be found here

Similarly, the Bureau of Economic Analysis (BEA) has a regional economic accounts program that details the geographic distribution of U.S. economic activity and growth. “The estimates of gross domestic product by state and state and local area personal income, and the accompanying detail, provide a consistent framework for analyzing and comparing individual state and local area economies.”

BEA has a nice site here that allows you to interactively select the data you want, or you can grab the data from their download section.

The BEA data read below were obtained from here, while the BLS data read below were obtained from here and here, respectively.

read.csv(
  here::here(
    "data", 
    "CA1_1969_2016_OH.csv"
    )
  ) -> df.bea 
read.csv(
  "http://www.bls.gov/lau/laucnty16.txt",
  header = FALSE, 
  skip = 5, 
  sep = ""
  ) -> df.bls1 
colnames(df.bls1) = c(
  "LAUS Code", "StateFIPSCode", "CountyFIPSCode", "CountyName",
  "StateAbbreviation", "Year", "TotalLaborForce", "TotalEmployed",
  "TotalUnemployed", "UnemploymentRate"
  )
library(readxl)
library(httr)
url = "https://www.bls.gov/lau/laucnty16.xlsx"
GET(
  url, 
  write_disk(tf <- tempfile(fileext = ".xlsx"))
  )
#> Response [https://www.bls.gov/lau/laucnty16.xlsx]
#>   Date: 2022-01-14 00:54
#>   Status: 200
#>   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
#>   Size: 246 kB
#> <ON DISK>  /var/folders/qh/6q39v0755_54rxmbl8m5ttnwy0twd7/T//RtmpPMvrrc/file921e4f4e7dd.xlsx
read_excel(
  tf, 
  skip = 5, 
  na = "", 
  col_names = c(
    "LAUS Code", "StateFIPSCode", "CountyFIPSCode", 
    "CountyName", "Year", "Blank_Column", "TotalLaborForce",
    "TotalEmployed", "TotalUnemployed", "UnemploymentRate"
    )
  ) -> df.bls2 
read.csv(
  "http://www.bls.gov/cew/data/api/2015/2/industry/10.csv",
  sep = ",", 
  header = TRUE
  ) -> df.bls3 

Note that df.bls1 used the .txt data and df.bls2 uses the .xlsx data. Note too the Excel file has a blank row that we will have to delete. Also, look at df.bls1 and note that the data has some issues because of the way the txt file is prepared. Consequently, we want to rely on the Excel data for now. If we wanted to cleanly read the txt file we would have to do some more legwork up front.

6.1 Using Application Programming Interfaces (APIs)

Many organizations that gather data have started offering free APIs – essentially a setup whereby if you register for their API key, you can interact with their servers to query and download data. This is especially useful because it saves time and improves efficiency all around. Moreover, an increasing number of R packages are using APIs to communicate with data servers. Before you can use their API, you have to registes for an API key by setting up an account, usually free.

6.1.1 The Bureau of Eeconomic Analysis (BEA) API

Let us start with the Bureau of Economic Analysis’s API. What does the BEA do? Per the agency, “The Bureau of Economic Analysis (BEA) promotes a better understanding of the U.S. economy by providing the most timely, relevant, and accurate economic accounts data in an objective and cost-effective manner.”

The three BEA product streams of interest to most people tend to be the (i) National series (BEA’s national economic statistics provide a comprehensive view of U.S. production, consumption, investment, exports and imports, and income and saving. These statistics are best known by summary measures such as gross domestic product (GDP), corporate profits, personal income and spending, and personal saving.), (ii) the Regional series (The regional economic accounts tell us about the geographic distribution of U.S. economic activity and growth. The estimates of gross domestic product by state and state and local area personal income, and the accompanying detail, provide a consistent framework for analyzing and comparing individual state and local area economies.), and (iii) the Industry Economic Accounts (The industry economic accounts, presented both in an input-output framework and as annual output by each industry, provide a detailed view of the interrelationships between U.S. producers and users and the contribution to production across industries. These accounts are used extensively by policymakers and businesses to understand industry interactions, productivity trends, and the changing structure of the U.S. economy.)

To start, signup for a BEA key here and once you get it (check your email), you should set it via beaKey = "xyz1s3d4g6hblahblahblahblah". Now install and load the library.

library(bea.R)

There are a few primary functions currently offered, you can either search for data via a keyword or you can get specific data once you have identified the table ID and other details. Let us start by seeing what data sets are available.7

beaSets(mybeaKey)
#> $Dataset
#>                DatasetName                   DatasetDescription
#> 1                     NIPA                 Standard NIPA tables
#> 2       NIUnderlyingDetail Standard NI underlying detail tables
#> 3                      MNE            Multinational Enterprises
#> 4              FixedAssets         Standard Fixed Assets tables
#> 5                      ITA  International Transactions Accounts
#> 6                      IIP    International Investment Position
#> 7              InputOutput                    Input-Output Data
#> 8            IntlServTrade         International Services Trade
#> 9            GDPbyIndustry                      GDP by Industry
#> 10                Regional                   Regional data sets
#> 11 UnderlyingGDPbyIndustry           Underlying GDP by Industry
#> 12      APIDatasetMetaData    Metadata about other API datasets
#> 
#> attr(,"params")
#>   ParameterName                       ParameterValue
#> 1  RESULTFORMAT                                 JSON
#> 2        METHOD                       GETDATASETLIST
#> 3        USERID A8CF5A83-A7A6-42A5-8D45-B1E94079E37C

The beaSearch() function is limited to returning details of national and regional data so please bear that in mind. We start with a simple search for all data on wages and salaries.

beaSearch(
  "Wages and Salaries", 
  beaKey = mybeaKey, 
  asHtml = FALSE
  ) -> bea.search.01

Here is the resulting table, and you should look through the series to get a sense of the contents. We will then go in and pick a series or two to download.

The beaGet function will grab the data you specify, as shown below.

  list(
    'UserID' = mybeaKey, 
    'Method' = 'GetData', 
    'DatasetName' = 'NIPA', 
    'TableName' = 'T11000',
    'LineNumber' = "3",
    'Frequency' = 'Q',
    'Year' = '2010,2012,2013,2014,2015,2016,2017,2018',
    'ResultFormat' = 'json'
    ) -> mybea_call

beaGet(
  mybea_call,
  asWide = FALSE
  ) -> bea01

You can control whether data are returned in a long format with the asWide = FALSE switch. If you set it to TRUE you will get the data in wide format, necessitating additional work to whip it into shape for analysis.

To return in a format in which each column represents a series, set iTableStyle = FALSE. This returns columns named with a concatenation of the descriptive column values, whereas rows are populated with numeric data values for each time period, and has the last column named “TimePeriod” filled with dates (see below).

beaGet(
  mybea_call, 
  iTableStyle = FALSE
  ) -> bea02

Note that asWide = TRUE and iTableStyle = TRUE are set as the defaults. My preference would be to work with bea01.

What about some data from the Regional series? Let us see what is available for personal income, and then pick both a series and sub-national level (state, MSA, county, etc.).

beaSearch(
  "Personal Income", 
  beaKey = mybeaKey, 
  asHtml = FALSE
  ) -> bea.search.02

Pay attention to the Warning message: our call threw: “In beaSearch(”Personal Income”, beaKey = mybeaKey, asHtml = FALSE): Regional metadata is missing from /Library/Frameworks/R.framework/Versions/4.1/Resources/library/beaR/data and may be locked for updating on the BEA API; searching national metadata only.”

This is not a surprise if you pay attention to the technical documentation the BEA releases. Specifically, the BEA has announced that “The RegionalData, RegionalIncome, and RegionalProduct data sets are obsolete and API calls using these data sets will no longer function. Users should instead access the dataset Regional, which provides comprehensive detail in statistics, industries, geography, and years. See Appendix N for instructions on how to use the Regional data set. In addition, BEA continues to add new datasets as appropriate, and while any new datasets will be announced on the BEA’s website Developers page (https://apps.bea.gov/developers), we also encourage users to periodically call the above GETDATASETLIST discover new datasets.”

Well, so what now? Let us hack the call a bit to figure out what to ask for. The easy way to do this is to use the BEA’s interactive data interface. I used the Regional data link to pick per capita personal saving by county, and this turns out to be the CAINC1 series. At the time of writing this, the latest data available are for 2020.

Now, you could also rely on some helper functions in {bea.r} to get the series you are curious about. These helper functions are used below with this particular example – chasing CAINC1 and extracting the key pieces we will need in our API call. First I will look for the TableName (although the web interface data told us this was CAINC1 but why not see what else we could get, eh)?

library(tidyverse)
beaParamVals(
  mybeaKey, 
  'Regional', 
  "TableName"
  ) %>%
  as.data.frame(.) -> see01

This gives us 99 rows of data with one TableName and its corresponding description per row.

Now that we have seen the various tables available to us, let us look at the various LineCode values in each and what these linecodes represent.

beaParamVals(
  mybeaKey, 
  'Regional', 
  "LineCode"
  ) %>%
  as.data.frame() -> see02

If you filter the result down to CAINC1 you will see two rows with LineCode = 3. What does each refer to? No clue and the documentation does not help answer it either. regardless, let us roll the dice and issue a call for the data.

  list(
    'UserID' = mybeaKey, 
    'Method' = 'GetData', 
    'DatasetName' = 'Regional', 
    'TableName' = 'CAINC1',
    'LineCode' = 3,
    'Year' = 'LAST10',
    'Frequency' = "A",
    'GeoFIPS' = "COUNTY",
    'ResultFormat' = 'json'
    ) -> mybea_call

beaGet(
  mybea_call,
  asWide = FALSE
  ) -> see03

I asked for the last ten years of data via LAST10 and could have also done this in long-hand by spelling out each year. I believe the other option available might be LAST5 to get the data for the last 5 years. Note also that I added the Frequency = A parameter in case the series is available with Annual (A), Quarterly (Q), or Monthly (M) frequency, and you know what frequency you want.

Since the Regional API is deprecated and will not be live for much longer, I suggest you also consider pulling the raw data from the BEA and then processing it in R.

To go the manual route, start here and pick the account you want data for. Then use the dropdown menu options to select the series and Download … you will end up with a zip file that you will then have to unpack and read into R. For example, if I go for CAINC1 I get a 3.4MB file CAINC.zip that, when unpacked, gives me state-specific *.csv files and one all areas file, each set for the 1969-2020 period.

readr::read_csv(
  here::here(
    "data/CAINC1",
    "CAINC1__ALL_AREAS_1969_2020.csv"
    )
  ) -> cainc1

Note the extra underscore in the filename; wonders never cease!

The data are in wide format, with each year a column. But this is a negligible penalty to pay for getting everything in the CAINC1 series in one swoop! Of course you would have to pick the series you want and then pivot the data into the long format. Let me do this for per-capita personal income.

cainc1 %>%
  filter(LineCode == 3) %>%
  select(
    1, 2, 9:60
    ) %>%
  group_by(
    GeoFIPS, GeoName
    ) %>%
  pivot_longer(
    names_to = "year",
    values_to = "pcpi",
    3:54,
    values_transform = list(
      year = as.integer,
      pcpi = as.integer
      )
    ) -> cainc1_long

I made sure to convert both year and pcpi to integers. Note also that the resulting fill will include national and state values as well but that is not a negative result at all. Why? Because we can use these to compare counties to state/national values, as a sort of a comparative benchmark.

We dropped LineCode = 1 and 2 but they will be needed should you want to create a weighted per-capita measure for some higher geography, for example, the Appalachian counties versus the non-Appalachian counties in state X or in the nation. The number of times I have seen small business development and related folks think they can generate an Appalachian estimate by calculating the mean per-capita income for all relevant counties via a simple unweighted mean calculation is shocking.

6.1.2 The Bureau of Labor Statistics (BLS) API

Make sure you signup for a BLS key here and save the key somewhere that you’ll remember. Then install three packages, blsAPI, {blsR} and {blscrapeR} since although the BLS documents blsAPI the other two could be useful.

One of the challenges in working with the BLS is that you need to root around before you can find what you are looking for. A good starting point tends to be this page. Say you are her and want to look at unemployment. If you click Unemployment you see a few options, one of which is Local Area Unemployment Statistics (LAUS). Click this and then click on the Tables icon, and you will see the various data files available to you.

But this is not the API-way. For that, we need to identify the series ID that maps on to local area unemployment. For that you will have to access the BLS page here. Click Local Area Unemployment Statistics and you will find the Series ID.

library(blsR)
library(rjson)
library(blscrapeR)
library(blsAPI)
bls_api(
  c("LAUCN040010000000003"),
  startyear = 2008, 
  endyear = 2021, 
  Sys.getenv("BLS_KEY")
  ) %>%
  dateCast() -> bls01

A much better way to go about this would be to get the master list of series the BLS API points to.

series_ids -> seriesdf

You will end up with a data-frame that has 88,815 rows and 4 columns, each row a unique series! Now, let us assume I want to see the unemployment rate for Athens County, Ohio. The Series ID is LAUCN390090000000003 and we should be able to pull the data.

bls_api(
  c("LAUCN390090000000003"),
  startyear = 2000, 
  endyear = 2021, 
  Sys.getenv("BLS_KEY")
  ) %>%
  dateCast() -> bls02

Fair enough, I can get data for individual counties or multiple counties. At some point the API may reject our calls because we may end up exceeding the allowable limits imposed by te API but let us see how far we can push it. Specifically, I want to grab the unemployment rate data for all 88 Ohio counties. Can I do that? Step 1 will be to find and concatenate the Series ID for all 88 counties.

seriesdf %>%
  filter(
    grepl(" County, Oh", series_title),
    grepl("Unemployment Rate", series_title)
    ) -> series88
series88[1:44, ] -> seria_01
series88[45:88, ] -> seria_02

I know the API will return less than 88 county data if we ask for too much so what we can do is split our counties into two groups, the first 44 (seria_01) and the second 44 (seria_02). Having done this, we can issue bls_api(...) calls for each half, and then bind them together via bind_rows(...).

bls_api(
  seria_01$series_id,
  startyear = 2011, 
  endyear = 2021, 
  Sys.getenv("BLS_KEY")
  ) %>%
  dateCast() -> bls03_01
bls_api(
  seria_02$series_id,
  startyear = 2011, 
  endyear = 2021, 
  Sys.getenv("BLS_KEY")
  ) %>%
  dateCast() -> bls03_02
bls03_01 %>%
  bind_rows(bls03_02) -> bls03

Not bad, looks like we got some 11,440 rows of data. But the County names are missing, so what we could do is append the county names from series88 as follows:

bls03 %>%
  left_join(
    series88[, 1:2], by = c("seriesID" = "series_id") 
  ) -> bls04_raw

Of course, as is visible in the data-frame, the county names are all embedded in the series_title column so let us extract county names and store in a new column. We can also drop columns that are no longer needed, and convert value to an integer.

bls04_raw %>%
  mutate(
    county = stringr::str_remove_all(
      series_title,
      "Unemployment Rate: |Oh \\(U\\)| County, "
      )
    ) %>%
  mutate(
    unemprate = as.integer(value)
  ) %>% 
  select(
    seriesID, latest, footnotes,
    date, county, unemprate
    ) -> bls04

As you can see, getting sub-national data via the BLS API is not an easy thing. I often end up having to use state and county-level unemployment rates, and then at times the labor-force participation rate. What I do is to grab the raw data directly from BLS, read, process, and save the file as *.RData, and then update it as needed. Compared to some of the other APIs out there the BLS API is a tough nut to crack. At least for me.

6.1.3 Initiative and Referendum Data (1904-2010)

The Initiative and Referendum Institute has long gathered data on direct democracy in the states. You can read more here. Three data sets are readily available for download here:

6.1.3.0.1 IRI initiatives by state and year for the 1904-through 2020 period

This database lists all statewide initiatives to appear on the ballot since the first statewide initiative in Oregon in 1904. This list does not contain information on issues placed on the ballot by state legislatures, commonly referred to as “legislative measures.”

"http://www.iandrinstitute.org/docs/Number%20initiatives%20by%20state-year%20(1904-2019).xls" -> myurl 
"Number_20initiatives_20by_20state_year_20_1904_2019_.xls" -> mydestfile
curl::curl_download(myurl, mydestfile)
readxl::read_excel(
  mydestfile
  ) -> iir_df

The Legal Landscape Database describes direct democracy provisions in American cities in 2005. The database includes the 1,000 largest cities in the country as well as the ten largest cities in every state. Data items include city name, FIPS code, population, and initiative status.

Finally, the California Local Option Sales Tax Measures 1976-2020 database provides detailed information on 86 local sales tax referendums for transportation purposes in California, including ballot information, election results, tax amount and duration, estimated revenue, project description, supporters and opponents. A second worksheet provides links to the text of the measures. Collected by UCLA Institute of Transportation Studies.

6.1.4 Education Data (CCD)

“The Common Core of Data (CCD) is a program of the U.S. Department of Education’s National Center for Education Statistics that annually collects fiscal and non-fiscal data about all public schools, public school districts and state education agencies in the United States. The data are supplied by state education agency officials and include information that describes schools and school districts, including name, address, and phone number; descriptive information about students and staff, including demographics; and fiscal data, including revenues and current expenditures.” You can learn more about the program here and access the data and documentation here. You can access the specific file used below from here, and note that these are schools in California.

readxl::read_excel(
  here::here(
    "data", "ncesdata.xlsx"
    )
  ) -> df.nces 
NCES District ID State District ID District Name County Name* Street Address City State ZIP ZIP 4-digit Phone Students* Teachers* Schools* Locale Code* Locale* Student Teacher Ratio* Type
0601620 CA-1964212 ABC Unified LOS ANGELES COUNTY 16700 Norwalk Blvd. Cerritos CA 90703 1838 (562)926-5566 20998.00000 833.11000 30.00000 21 Suburb: Large 25.20000000 Regular School District
0601650 CA-0761630 Acalanes Union High CONTRA COSTA COUNTY 1212 Pleasant Hill Rd. Lafayette CA 94549 2623 (925)280-3900 5402.00000 267.60000 5.00000 21 Suburb: Large 20.20000000 Regular School District
0601680 CA-3166761 Ackerman Charter PLACER COUNTY 13777 Bowman Rd. Auburn CA 95603 3147 (530)885-1974 556.00000 28.24000 1.00000 31 Town: Fringe 19.70000000 Other Education Agency
0600001 CA-1975309 Acton-Agua Dulce Unified LOS ANGELES COUNTY 32248 N. Crown Valley Rd. Acton CA 93510 2620 (661)269-5999 4043.00000 219.57000 22.00000 11 City: Large 18.40000000 Regular School District
0601710 CA-3667587 Adelanto Elementary SAN BERNARDINO COUNTY 11824 Air Expressway Adelanto CA 92301 0070 (760)246-8691 10378.00000 396.93000 16.00000 21 Suburb: Large 26.10000000 Regular School District
0691051 CA-0110017 Alameda County Office of Education ALAMEDA COUNTY 313 W. Winton Ave. Hayward CA 94544 1136 (510)887-0152 3931.00000 187.66000 13.00000 11 City: Large 20.90000000 Regular School District

These are data for a single year. If we wanted to build a panel (i.e., data for multiple schools over multiple years) we would have to download all the files and then bind them into a single data frame.

6.1.5 The {EdSurvey} Package for Educational Data

The {EdSurvey} package is authored by AIR and designed to “read in and analyze functions for education survey and assessment data from the National Center for Education Statistics (NCES) nces.ed.gov, including National Assessment of Educational Progress (NAEP) data nces.ed.gov/nationsreportcard and data from the International Assessment Database: Organisation for Economic Co-operation and Development (OECD) oecd.org, including Programme for International Student Assessment (PISA), Teaching and Learning International Survey (TALIS), Programme for the International Assessment of Adult Competencies (PIAAC), and International Association for the Evaluation of Educational Achievement (IEA) iea.nl, including Trends in International Mathematics and Science Study (TIMSS), TIMSS Advanced, Progress in International Reading Literacy Study (PIRLS), International Civic and Citizenship Study (ICCS), International Computer and Information Literacy Study (ICILS), and Civic Education Study (CivEd).”

The authors have provided extensive documentation of how to use the package, obviating the need to go into any depth here.

6.1.6 World Bank data with data360r

In January 2017 the World Bank launched TCdata360, “a new open data platform that features more than 2,000 trade and competitiveness indicators from 40+ data sources inside and outside the World Bank Group. Users of the website can compare countries, download raw data, create and share data visualizations on social media, get country snapshots and thematic reports, read data stories, connect through an application programming interface (API), and more.” This led to them developing data360r, an R package that allows users to interact with the TCdata360 API and query TCdata360 data, metadata, and a lot more. Below we walk through some of the core functions.

The database has a large number of indicators so most of us would (or should) walk in knowing which indicators we wish to work with. Assume for now that you are interested in data-set 51.

#install.packages("devtools")
#devtools::install_github("mrpsonglao/data360r")
library(data360r)
get_data360(
  dataset_id = 51
  ) -> df51 

But what if you have no clue about the dataset you want and would rather start by browsing the list of available indicators? You could do that for sure, and even browse available countries and data-sets with ease.

Say you want to get all indicators’ metadata in Govdata360.

get_metadata360(
  site = "gov", 
  metadata_type = "indicators"
  ) -> df_indicators 

What if you want all country metadata in TCdata360?

get_metadata360(
  metadata_type = 'countries'
  ) -> df_countries

Perhaps all dataset metadata in TCdata360?

get_metadata360(
  metadata_type = 'datasets'
  ) -> df_datasets 

You can also search for data on an individual country code

search_360(
  'Bhutan', 
  search_type = 'country'
  )
#>    id   name slug    type score redirect
#> 1: NA Bhutan  BTN country     1    FALSE

or top n indicators for a measure (but note that you will likely see more than n rows in the data-set since some indicators have have two IDs – one for TCdata360 and the other for Govdata360).

search_360(
  'GDP', 
  search_type = 'indicator', 
  limit_results = 5
  )
#>       id                               name              slug      type  score
#> 1: 28107                  GDP (current US$)          d6cf2c13 indicator 0.3333
#> 2: 28609                  GDP - current US$         h3c4bf3c9 indicator 0.3333
#> 3:   940              GDP growth (annual %) NY.GDP.MKTP.KD.ZG indicator 0.2500
#> 4:   944       GDP per capita (current US$)    NY.GDP.PCAP.CD indicator 0.2500
#> 5:   949 GDP, PPP (current international $) NY.GDP.MKTP.PP.CD indicator 0.2500
#>    redirect                         dataset
#> 1:    FALSE    World Development Indicators
#> 2:    FALSE World Integrated Trade Solution
#> 3:    FALSE    World Development Indicators
#> 4:    FALSE    World Development Indicators
#> 5:    FALSE    World Development Indicators

or for a particular database

search_360(
  'World Development Indicators', 
  search_type = 'indicator', 
  limit_results = 5
  )
#>      id                                                 name              slug
#> 1: 1804             Commercial service exports (current US$) TX.VAL.SERV.CD.WT
#> 2: 2010  High-technology exports (% of manufactured exports) TX.VAL.TECH.MF.ZS
#> 3: 2009                High-technology exports (current US$)    TX.VAL.TECH.CD
#> 4: 1931 Transport services (% of commercial service exports) TX.VAL.TRAN.ZS.WT
#> 5: 1933    Travel services (% of commercial service exports) TX.VAL.TRVL.ZS.WT
#>         type score                      dataset redirect
#> 1: indicator     1 World Development Indicators    FALSE
#> 2: indicator     1 World Development Indicators    FALSE
#> 3: indicator     1 World Development Indicators    FALSE
#> 4: indicator     1 World Development Indicators    FALSE
#> 5: indicator     1 World Development Indicators    FALSE

or for a specific data provider

search_360(
  'WBG', 
  search_type = 'indicator', 
  limit_results = 5
  )
#>       id
#> 1: 28134
#> 2: 28135
#> 3: 28137
#> 4: 28136
#> 5: 28140
#>                                                                                                name
#> 1:                             Adult literacy rate, population 15+ years, gender parity index (GPI)
#> 2:     Barro-Lee: Percentage of female population age 15+ with primary schooling. Completed Primary
#> 3: Barro-Lee: Percentage of female population age 15+ with secondary schooling. Completed Secondary
#> 4:            Barro-Lee: Percentage of population age 15+ with primary schooling. Completed Primary
#> 5:          Barro-Lee: Percentage of population age 15+ with tertiary schooling. Completed Tertiary
#>        slug      type score              dataset redirect
#> 1: f22e4196 indicator     1 Education Statistics    FALSE
#> 2: 8ef474ba indicator     1 Education Statistics    FALSE
#> 3: cde9ae8d indicator     1 Education Statistics    FALSE
#> 4: c00a96fc indicator     1 Education Statistics    FALSE
#> 5: f7163136 indicator     1 Education Statistics    FALSE

or for a particular category

search_360(
  'Governance', 
  site = 'gov', 
  search_type = 'category', 
  limit_results = 5
  )
#>       id                       name       slug     type  score dataset redirect
#> 1: 40976                 Governance        gov category 1.0000    <NA>     TRUE
#> 2:    NA       Corporate Governance  h758609d3 category 0.5000    <NA>    FALSE
#> 3:    NA         General Governance governance category 0.5000    <NA>    FALSE
#> 4: 44140         Overall Governance  h81b1dd1f category 0.5000    <NA>    FALSE
#> 5:    NA Other (Overall Governance)  h0521fa1f category 0.3333    <NA>    FALSE

If you want to download data for a particular indicator, you can do so as follows. First, find the indicator ID.

search_360(
  "woman business", 
  search_type = "indicator", 
  limit_results = 5
  ) -> df_ind1 

Say I want indicator 90 = Can a married woman register a business in the same way as a married man?

get_data360(
  indicator_id = c(90)
  ) -> df_ind90 
get_data360(
  indicator_id = c(28130, 28131)
  ) -> df_indtwo

If you only want all data for a specific country or just the indicators we pulled in df_ind1 for a specific country you could do:

get_data360(
  country_iso3 = "IND"
  ) -> df_allone
get_data360(
  indicator_id = df_ind1$id, 
  country_iso3 = "IND"
  ) -> df_allind1

Now an example with two measures – legal age of marriage for boys and girls. Note that the package allows you to specify the long format (preferred) than the default wide format you see earlier results being returned in. Note also the use of timeframes = c() that allows you to specify the specific time period you want the indicators for.

search_360(
  "marriage", 
  search_type = "indicator"
  )
#>       id
#> 1:   204
#> 2:   205
#> 3:   206
#> 4:   207
#> 5: 43472
#> 6: 40083
#> 7: 40080
#> 8: 40081
#> 9:    NA
#>                                                                                                       name
#> 1:                                                             What is the legal age of marriage for boys?
#> 2:                                                            What is the legal age of marriage for girls?
#> 3:                                                  Are there any exceptions to the legal age of marriage?
#> 4:                                            Does the law prohibit or invalidate child or early marriage?
#> 5:                  Does the law grant spouses equal administrative authority over assets during marriage?
#> 6:                              What is the minimum age of marriage with judicial authorization for girls?
#> 7:                                     What is the minimum age of marriage with parental consent for boys?
#> 8:                                    What is the minimum age of marriage with parental consent for girls?
#> 9: Are there penalties in the law for authorizing or knowingly entering into the child or early marriage?'
#>                slug      type   score                     dataset redirect
#> 1:    age.marr.male indicator 0.11111 Women, Business and the Law    FALSE
#> 2:     age.marr.fem indicator 0.11111 Women, Business and the Law    FALSE
#> 3: marry.age.except indicator 0.10000 Women, Business and the Law    FALSE
#> 4:        chld.marr indicator 0.10000 Women, Business and the Law    FALSE
#> 5:        hf8cada17 indicator 0.08333 Women, Business and the Law    FALSE
#> 6:        h5e7b11fe indicator 0.08333 Women, Business and the Law    FALSE
#> 7:        haab6520b indicator 0.08333 Women, Business and the Law    FALSE
#> 8:        hdecd0041 indicator 0.08333 Women, Business and the Law    FALSE
#> 9:        hb9d0a754 indicator 0.05882 Women, Business and the Law    FALSE
get_data360(
  indicator_id = c(204, 205), 
  timeframes = c(2016), 
  output_type = 'long'
  ) -> df_marriage
ggplot(
  df_marriage, 
  aes(
    x = Observation, 
    group = Indicator, 
    fill = Indicator
    )
  ) +
    geom_bar() + 
  theme(
    legend.position = "none"
    ) + 
  facet_wrap(
    ~ Indicator, 
    ncol = 1
    ) + 
  labs(
    x = "Legal Age of Marriage", 
    y = "Frequency", 
    title = "Legal Age of Marriage for Boys vs. Girls", 
    subtitle = "(2016)", 
    caption = "Source: World Bank Data"
    )

6.1.7 World Bank data with {wbstats}

Another library, wbstats, does pretty much the same thing: “[It] allows researchers to quickly search and download the data of their particular interest in a programmatic and reproducible fashion; this facilitates a seamless integration into their workflow and allows analysis to be quickly rerun on different areas of interest and with real-time access to the latest available data.” Let us see the core functionality by loading the library and then seeing what is available in terms of indicators, topics, and so on. We can then set the most current list of information in wb_cachelist to be used via new_cache.

#install.packages("wbstats")
#remotes::install_github("nset-ornl/wbstats")
library(wbstats)
str(
  wb_cachelist, 
  max.level = 1
  )
#> List of 8
#>  $ countries    : tibble [304 × 18] (S3: tbl_df/tbl/data.frame)
#>  $ indicators   : tibble [16,649 × 8] (S3: tbl_df/tbl/data.frame)
#>  $ sources      : tibble [63 × 9] (S3: tbl_df/tbl/data.frame)
#>  $ topics       : tibble [21 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ regions      : tibble [48 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ income_levels: tibble [7 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ lending_types: tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ languages    : tibble [23 × 3] (S3: tbl_df/tbl/data.frame)
wb_cache() -> new_cache

What indicators are available?

wb_search(
  pattern = "corruption"
  ) -> corruption_vars
head(corruption_vars)
#> # A tibble: 6 × 3
#>   indicator_id     indicator                      indicator_desc                    
#>   <chr>            <chr>                          <chr>                             
#> 1 CC.EST           Control of Corruption: Estima… "Control of Corruption captures p…
#> 2 CC.NO.SRC        Control of Corruption: Number… "Control of Corruption captures p…
#> 3 CC.PER.RNK       Control of Corruption: Percen… "Control of Corruption captures p…
#> 4 CC.PER.RNK.LOWER Control of Corruption: Percen… "Control of Corruption captures p…
#> 5 CC.PER.RNK.UPPER Control of Corruption: Percen… "Control of Corruption captures p…
#> 6 CC.STD.ERR       Control of Corruption: Standa… "Control of Corruption captures p…

If I want information from a particular source, say Bloomberg,

wb_search(
  pattern = "Bloomberg"
  ) -> blmbrg_vars
head(blmbrg_vars)
#> # A tibble: 1 × 3
#>   indicator_id indicator                             indicator_desc                 
#>   <chr>        <chr>                                 <chr>                          
#> 1 GFDD.OM.02   Stock market return (%, year-on-year) Stock market return is the gro…

Searching for indicators tied to multiple subjects is easy as well:

wb_search(
  pattern = "poverty | unemployment | employment"
  ) -> povemply_vars

Downloading data, once we identify what we want, is easy as well, needing us to specify just the indicator(s) and then the start and end dates. If you want the data for specific countries, that is easily done as well.

wb_data(
  indicator = "SP.POP.TOTL", 
  start_date = 1960, 
  end_date = 2019
  ) -> pop_data1 

wb_data(
  country = c(
    "ABW","AF", "SSF", "ECA", "IND", "CHN"
    ),
  indicator = "SP.POP.TOTL", 
  start_date = 1960, 
  end_date = 2019
  ) -> pop_data2 
wb_data(
  country = c(
    "ABW","AF", "SSF", "ECA", "IND", "CHN"
    ),
  indicator = c(
    "SP.POP.TOTL", "NY.GDP.MKTP.CD"
    ), 
  start_date = 1960, 
  end_date = 2019
  ) -> pop_data3 

By default wb() will return the data in long format but not necessarily in a tidy format. If you want the data returned on call in a wide format, specify return_wide = TRUE and you will have tidy data.

wb_data(
  country = c(
    "ABW","AF", "SSF", "ECA", "IND", "CHN"
    ),
    indicator = c(
      "SP.POP.TOTL", "NY.GDP.MKTP.CD"
      ), 
  start_date = 1960, 
  end_date = 2019, 
  return_wide = TRUE
  ) -> pop_data4

If you do not know the start/end dates for an indicator, use mrv = most recent value. Be careful though: If you use mrv and want the data for more than one country, it will find the most recent date of any country and then return data using that date as the most recent value. In other words, when grabbing data for more than one country, mrv does not search the most recent value for each country on a country-by-country basis.

wb_data(
  country = c("IND"), 
  indicator = 'EG.ELC.ACCS.ZS', 
  mrv = 20
  ) -> mrv_data 
head(mrv_data)
#> # A tibble: 6 × 9
#>   iso2c iso3c country  date EG.ELC.ACCS.ZS unit  obs_status footnote last_updated
#>   <chr> <chr> <chr>   <dbl>          <dbl> <chr> <chr>      <chr>    <date>      
#> 1 IN    IND   India    2000           59.3 <NA>  <NA>       <NA>     2021-12-16  
#> 2 IN    IND   India    2001           55.8 <NA>  <NA>       <NA>     2021-12-16  
#> 3 IN    IND   India    2002           62.3 <NA>  <NA>       <NA>     2021-12-16  
#> 4 IN    IND   India    2003           64.0 <NA>  <NA>       <NA>     2021-12-16  
#> 5 IN    IND   India    2004           64.4 <NA>  <NA>       <NA>     2021-12-16  
#> 6 IN    IND   India    2005           67.1 <NA>  <NA>       <NA>     2021-12-16

As you have no doubt noticed, data may be missing for some years. If you want these missing values to be replaced by the preceding value, use gapfill

wb_data(
  country = c("IND"), 
  indicator = 'EG.ELC.ACCS.ZS',
  mrv = 20, 
  gapfill = TRUE
  ) -> gapfill_data 
head(gapfill_data)
#> # A tibble: 6 × 9
#>   iso2c iso3c country  date EG.ELC.ACCS.ZS unit  obs_status footnote last_updated
#>   <chr> <chr> <chr>   <dbl>          <dbl> <chr> <chr>      <chr>    <date>      
#> 1 IN    IND   India    2001           55.8 <NA>  <NA>       <NA>     2021-12-16  
#> 2 IN    IND   India    2002           62.3 <NA>  <NA>       <NA>     2021-12-16  
#> 3 IN    IND   India    2003           64.0 <NA>  <NA>       <NA>     2021-12-16  
#> 4 IN    IND   India    2004           64.4 <NA>  <NA>       <NA>     2021-12-16  
#> 5 IN    IND   India    2005           67.1 <NA>  <NA>       <NA>     2021-12-16  
#> 6 IN    IND   India    2006           67.9 <NA>  <NA>       <NA>     2021-12-16

Note that now you have 2014 values being used to fill what would be missing data for 2015, 2016 and 2017. If you will be working with dates, whether for plotting purposes or otherwise, then activate the POSIXct = TRUE switch. Otherwise you will have to do this manually.

wb_data(
  indicator = c(
    "CRUDE_DUBAI", "CRUDE_BRENT", "CRUDE_WTI", "CRUDE_PETRO"
    ),
  mrv = 60, 
  freq = "M", 
  POSIXct = TRUE
  ) -> oil_data 
head(oil_data)
ggplot(
  oil_data, 
  aes(
    x = date_ct, 
    y = value, 
    group = indicatorID, 
    color = indicatorID
    )
  ) + 
  geom_line() + 
  scale_x_date(
    date_breaks = "9 months"
    ) + 
  labs(
    x = "Date", 
    y = "Price", 
    title = "Crude Oil Prices", 
    subtitle = "(in nominal US$ per bbl)", 
    caption = "Source: World Bank Data"
    ) + 
  theme(
    legend.position = "bottom"
    ) 

6.2 USGS Data {dataRetrieval}

The {dataRetrieval} package gives you easy access to water data gathered and warehoused by the USGS, USDA, EPA, and other entities. The package has an excellent tutorial available here so I will not go into too many details and nuances here. Start by installing the package and then loading it.

# install.packages("dataRetrieval")
library(dataRetrieval)

You will usually start by identifying the site (for example, The Hocking River) you want data for, the parameter (gage height, temperature, pH, and so on), and the statisic (mean, median, minimum, etc). The full list of available parameters can be found here and the complete list of statistics is available here. Sites can be located from this inventory. In fact, the package comes with a built-in data-set that shows you available parameters, making things a bit easier.

parameterCdFile -> myparmfile
names(myparmfile)
#> [1] "parameter_cd"       "parameter_group_nm" "parameter_nm"      
#> [4] "casrn"              "srsname"            "parameter_units"

If you are curious about a specific parameter, you can see what all is available for it. I’ll look for anything related to the keyword storm.

parameterCdFile[
  grep("storm",
       parameterCdFile$parameter_nm,
       ignore.case =  TRUE),
  ] -> stormq 
unique(
  stormq$parameter_units
  )
#> [1] "nu"      "hours"   "minutes" "Mgal"    "ft3/s"   "mgd"     "in"

The most basic data pull involves a few specifications, as shown below. Note that I am pulling data for the Hocking River at Athens, Ohio.

"03159500" -> siteNo
"00065" -> pCode
"2014-10-01" -> start.date
"2018-02-26" -> end.date
readNWISuv(
  siteNumbers = siteNo,
  parameterCd = pCode,
  startDate = start.date,
  endDate = end.date
  ) -> hocking 

Since the column names will be crude, based on parameter codes, you can clean them up, and also see other attributes embedded in the data-set.

names(hocking)
#> [1] "agency_cd"        "site_no"          "dateTime"         "X_00065_00000"   
#> [5] "X_00065_00000_cd" "tz_cd"
renameNWISColumns(
  hocking
  ) -> hocking 
names(hocking)
#> [1] "agency_cd"  "site_no"    "dateTime"   "GH_Inst"    "GH_Inst_cd" "tz_cd"
names(
  attributes(
    hocking
    )
  )
#> [1] "names"         "row.names"     "class"         "url"           "siteInfo"     
#> [6] "variableInfo"  "disclaimer"    "statisticInfo" "queryTime"

If I wanted data for a few sites rather than one I could run

c(
  "03158200", 
  "03159246"
  ) -> sites 
"00065" -> pcode 
"2014-10-01" -> start.date 
"2018-02-26" -> end.date 
readNWISuv(
  siteNumbers = sites,
  parameterCd = pcode,
  startDate = start.date,
  endDate = end.date
  ) -> hocking2 
renameNWISColumns(
  hocking2
  ) -> hocking2 

Now a simple time-series plot of gage height for both sites.

attr(
  hocking2, 
  "variableInfo"
  ) -> parameterInfo 
hocking2$station = ifelse(
  hocking2$site_no == "03158200", 
  "Monday Creek at Doanville", 
  "Sunday Creek Below Millfield"
  )
as.Date(
  as.character(
    hocking2$dateTime
    ), 
  format  = "%Y-%m-%d"
  ) -> hocking2$mydates
ggplot(
  data = hocking2, 
  aes(
    x = mydates, 
    y = GH_Inst, 
    color = station
    )
  ) + 
  geom_line() + 
  labs(
    x = "",
    y = parameterInfo$variableDescription
    ) + 
  theme(
    legend.position = "bottom"
    )  + 
  scale_x_date(
    date_breaks = "2 weeks"
    ) 

Be sure to check out Laura DeCicco’s excellent tutorial as well as the package vignette, and the other articles that go well beyond this light introduction to the package.

6.3 Working with googlesheets via {googlesheets4}

Googlesheets is a web-based application that allows teams to collaboratively work with a common data table, using it to carry out analyses, visualizations, and of course to update the sheet as well. It is built to function with tables and other mobile platforms, making it very versatile. Given its growing popularity, it was a matter of time before someone built an R package that uses google sheets and lo and behold, that individual turned out to be Jenny Bryan. In this section I’ll outline the basic features of her googlesheets4 package.

library(googledrive)
library(googlesheets4)

Now we authenticate via a google account you have via the browser window/tab that should open when you execute the command below. If you don’t have a gmail account, get one now before you go any further.

What sheets do we have in our google drive?

gs4_find()
#> # A dribble: 15 × 3
#>   name                                    id                         drive_resource 
#>   <chr>                                   <drv_id>                   <list>         
#> 1 EBITE Coaching Cohort 1                 1km5f7Q8MBC0WIzX6H9Z0s__A… <named list [3…
#> 2 tinted-silverfox                        1m9jeXouE5MYr_KWqpQbSaRnr… <named list [3…
#> 3 moonish-takin                           1k9YudiVm2XQA8mzfONonzRLi… <named list [3…
#> 4 Coaching Intake Form (Responses)        1_s8mnyjUZWnuDq2Lznq0usCa… <named list [3…
#> 5 Competition_Results_Exports_UIPM_2021_… 1zqv-B6uE-xUsA7XDks8_v5h4… <named list [3…
#> 6 Detailed Biography DHS Ethiopia         1NGjAmMrVWPUqikSIiqadTVHu… <named list [3…
#> # … with 9 more rows

I want to check the Gapminder sheet, and I can do this by name (as shown below) or then by the sheet_key.

read_sheet(
  "https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077"
  ) -> my.gs1
read_sheet(
  "1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY"
  ) -> my.gs2

Now we want to read the contents of this sheet and start working on them.

str(my.gs1)
str(my.gs2)

Now some basic calculations and a plot, just to see that it works as it is supposed to.

my.gs1 %>% 
  group_by(
    country
    ) %>% 
  summarise(
    mean.gdp = mean(
      gdpPercap, 
      na.rm = TRUE
      )
    ) %>% 
  arrange(
    -mean.gdp
    ) -> tab.01

ggplot(
  my.gs1, 
  aes(
    x = lifeExp, 
    y = gdpPercap
    )
  ) + 
  geom_hex() +
  labs(
    x = "Life Expectancy (in years)",
    y = "Gross Domstic Product (per-capita)"
  )

What if wanted to push something, a dataframe we modified in R or even the tab.01 we created as a googlesheet to our google drive?

sheet_write(
  tab.01,
  sheet = "sheet-1"
  )

Login to your google drive folder and check if the sheet is there, then double-click it to open it as a googlesheet. Voila, there it is!

So what is the value-add of this package? Since many teams work with googlesheets but not everyone is responsible for data analysis, this package allows you to work with sheets created by others, edit them, and even generate a new sheet that the other team members can then use. To me, that is valuable in itself. Dr. Jenny Bryan points out some other uses, such as gathering data securely via googleforms, gathering data on your smartphone while in the field, and pretty much any other use of googlesheets that you can conjure.

Of course, this is not all that the package does; it does a lot more! If you are interested in these other functions, be sure to check out this vignette: googlesheets Basic Usage.