6 Data Acquisition with APIs and Other Means
One of the first things we saw in this text was the many ways in which we can read locally sourced data (from our own computer, in particular) as well as data files hosted on the web. But the real fun begins when one has access to application programming interfaces (APIs) – a piece of software that allows two applications to communicate. The number of APIs available for free has grown exponentially in the last decade. Today, you can use an API to tap data from the United States Bureau of the Census, from the Bureau of Labor Statistics, the Bureau of Economic Analysis, the United Nations, the World Bank, the United States Geological Survey, the Centers for Disease Control and Prevention, the Federal Reserve, Yahoo Finance, Spotify, Twitter, NOAA Climate Data, and so many more sources. Why is this a good thing? Because you do not need to manually download and upload files to that your code relies upon. Indeed, as and when old data are updated or new data become available, your code has access to it. This makes our lives easy because we only need to focus on keeping our code current and functional – the data acquisition task has now become automated! But, of course, we cannot leverage this development in APIs until we understand how to use them. Consequently, in this chapter we work our way towards exploring APIs, starting with a look at some large files hosted on the web.
6.0.1 Census Data
Let us download a specific file – the Voting Age Population by Citizenship and Race (CVAP) data compiled from the American Community Survey (ACS) 5-year (2015-2019) collection available here. The SAS version of the data file can be downloaded from here. Documentation for these data is available here.
Let us start by downloading and reading in the data.
tempfile() -> temp
download.file(
"https://www2.census.gov/programs-surveys/decennial/rdo/datasets/2019/2019-cvap/CVAP_2015-2019_ACS_sas_files.zip",
destfile = temp
)::read_sas(
havenunz(
temp, "place.sas7bdat"
)-> placedata
) unlink(temp)
rm(temp)
names(placedata)
#> [1] "geoname" "lntitle" "geoid" "lnnumber" "tot_est" "tot_moe" "adu_est"
#> [8] "adu_moe" "cit_est" "cit_moe" "cvap_est" "cvap_moe"
Well, what are the contents? Each row is what the Census Bureau calls a “place” (see geoname
and geoid
), and for each place, the Bureau gives us the following estimates:
Variable | Description |
---|---|
tot_est | The rounded estimate of the total number of people for that geographic area and group. (Not available for tracts or block groups.) |
tot_moe | The margin of error for the total number of people for that geographic area and group. (Not available for tracts or block groups.) |
adu_est | The rounded estimate of the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.) |
adu_moe | The margin of error for the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.) |
cit_est | The rounded estimate of the total number of United States citizens for that geographic area and group |
cit_moe | The margin of error for the total number of United States citizens for that geographic area and group. |
cvap_est | The rounded estimate of the total number of United States citizens 18 years of age or older for that geographic area and group. |
cvap_moe | The margin of error for the total number of United States citizens 18 years of age or older for that geographic area and group. |
The rows represent the following groups, marked by lntitle
and lnnumber
, respectively:
lnnumber |
lntitle |
---|---|
Total | 1 |
Not Hispanic or Latino | 2 |
American Indian or Alaska Native Alone | 3 |
Asian Alone | 4 |
Black or African American Alone | 5 |
Native Hawaiian or Other Pacific Islander Alone | 6 |
White Alone | 7 |
American Indian or Alaska Native and White | 8 |
Asian and White | 9 |
Black or African American and White | 10 |
American Indian or Alaska Native and Black or African American | 11 |
Remainder of Two or More Race Responses | 12 |
Hispanic or Latino | 13 |
Well, this is all fine and dandy but what if I want to tidy these data? You certainly can, and without too much trouble either. How? See below, but before you walk through the code, understand what we need to do.
First, each if the lntitles
should be a column such that each place really has only ONE row of data in the tidy version of the file. The lnnumber
column is of little use since we have lntitle
and hence we can drop lnnumber
without losing anything of value.
library(tidyverse)
%>%
placedata select(
-lnnumber
%>%
) group_by(
geoid, geoname%>%
) pivot_wider(
names_from = lntitle,
values_from = c(
tot_est, tot_moe, adu_est, adu_moe,
cit_est, cit_moe, cvap_est, cvap_moe
)%>%
) ::clean_names() -> placedata_tidy janitor
You should end up with a data file that has 29,573 rows and 106 columns. Now, this is too many columns, and sure we are not interested in all of them. Rather, we may only be interested in the total population, the total number of citizens of voting-age, and then of course the percentage of the total number of citizens of voting-age who are non-Hispanic White, Hispanic, non-Hispanic Asian, non-Hispanic Black/African-American, and so on. Let us trim the tidied file in line with this ultimate goal.
%>%
placedata_tidy group_by(geoid, geoname, tot_est_total, cvap_est_total) %>%
summarize(
pct_cvap = (cvap_est_total / tot_est_total),
pct_cvap_nhwhite = (cit_est_white_alone / cit_est_total),
pct_cvap_nhblack = (cvap_est_black_or_african_american_alone / cit_est_total),
pct_cvap_nhaian = (cvap_est_american_indian_or_alaska_native_alone / cit_est_total),
pct_cvap_nhasian = (cvap_est_asian_alone / cit_est_total),
pct_cvap_nhnhopi = (cvap_est_native_hawaiian_or_other_pacific_islander_alone / cit_est_total),
pct_hispanic = (cvap_est_hispanic_or_latino / cvap_est_total)
-> place.cvap.a )
Of course the variables I have called pct_*
are proportions rather than percentages but that is by design. You will see why when we start mapping or otherwise visualizing these data. Now, you will see various cells with NaN
– what R calls a value that is ‘not a number’ because it could not be calculated. Why not? If the denominator is 0, you will encounter this issue. So one fix would be to eliminate nall rows where the tot_est_total > 0
and cvap_est_total > 0
before doing anything else.
%>%
placedata_tidy filter(tot_est_total > 0, cvap_est_total > 0) %>%
group_by(geoid, geoname, tot_est_total, cvap_est_total) %>%
summarize(
pct_cvap = (cvap_est_total / tot_est_total),
pct_cvap_nhwhite = (cit_est_white_alone / cit_est_total),
pct_cvap_nhblack = (cvap_est_black_or_african_american_alone / cit_est_total),
pct_cvap_nhaian = (cvap_est_american_indian_or_alaska_native_alone / cit_est_total),
pct_cvap_nhasian = (cvap_est_asian_alone / cit_est_total),
pct_cvap_nhnhopi = (cvap_est_native_hawaiian_or_other_pacific_islander_alone / cit_est_total),
pct_hispanic = (cvap_est_hispanic_or_latino / cvap_est_total)
-> place.cvap.b )
Now, you might be wondering why the focus on redistricting data at the expense of all the other data products the Census Bureau compiles. Good question, but easy answer: Later on we will be looking at specific packages designed to pull a wide variety of Census products. For now we look at a few important but fenky data.
6.0.2 CDC’s BRFSS Data
The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
The 2020 BRFSS survey data can be downloaded in ASCII format from here while the codebook is here and the variable layout is here.
Note: ASCII
(pronounced ask-ee
) stands for the American Standard Code for Information Interchange and is a code for representing English characters as numbers, with each letter assigned a number from 0 to 127. “For example, the ASCII code for uppercase M is 77. Most computers use ASCII codes to represent text, which makes it possible to transfer data from one computer to another.” Source:
If you download the data and look at the variable layout here you notice that every column has specific starting
and ending
positions. If we choose to read the downloaded ASCII data we will have to pass the variable layout information to R. I’ll show you how to do this but let us take a shortcut via the other formats CDC so kindly makes available for download, namely the SAS Transport Format found here.
While 2020 is the most recent data available as I write, download and extract two 2016 files and put both in your working directory.
Extract the SAS version of the data-set and put it in your working directory. Then execute the following code, making sure to add the path to the directory where you stored the extracted file.
library(foreign)
read.xport(
"data/LLCP2016.XPT"
-> brfss2016a )
When the code executes you should see brfss2016a
in the Global Environment window of RStudio. This was fairly easy, but, what if we were given the raw ASCII data? Now we will need the variable layout information, and we will have to choose a library to work with. Two I use are {readr}
(my preferred library) and {LaF}
(a close second). Both are shown in action below.
Note: The original column layout specifications had to be modified because the third variable listed online (IDATE) subsumes the three that follow. As such IDATE was deleted from the specifications below.
The {readr}
code specifies the start and end of the columns, as well as the column names.
= c(1, 16, 18, 20, 22, 31, 35, 45, 62, 63, 64, 65, 67, 68, 70, 72, 80, 81, 83, 85, 87, 88, 89, 90, 91, 92, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 108, 109, 146, 147, 148, 150, 151, 152, 154, 158, 170, 171, 172, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 192, 193, 196, 198, 200, 202, 203, 209, 210, 211, 213, 215, 216, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 243, 255, 256, 257, 258, 261, 264, 266, 268, 270, 271, 272, 273, 275, 277, 279, 281, 282, 284, 285, 310, 311, 312, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 326, 327, 329, 330, 332, 334, 336, 339, 340, 341, 342, 343, 345, 346, 347, 348, 349, 351, 352, 354, 356, 357, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 573, 574, 575, 576, 578, 579, 581, 582, 583, 584, 590, 591, 595, 623, 625, 626, 627, 628, 629, 645, 647, 1356, 1357, 1363, 1393, 1403, 1417, 1419, 1421, 1425, 1426, 1427, 1430, 1473, 1612, 1669, 1671, 1673, 1674, 1675, 1711, 1907, 1910, 1913, 1916, 2155, 2156, 2157, 2158, 2159, 2160, 2161, 2162, 2163, 2164, 2221, 2223, 2227, 2228, 2229, 2230, 2231, 2232, 2234, 2235, 2236, 2239, 2242, 2247, 2251, 2252, 2253, 2254, 2255, 2256, 2257, 2258, 2259, 2262, 2263, 2267, 2271, 2272, 2273, 2274, 2275, 2276, 2277, 2278, 2279, 2281, 2283, 2284, 2286, 2292)
fwf_positions_begin = c(2, 2, 2, 2, 4, 4, 10, 10, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 2, 1, 6, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 2, 1, 1, 1, 3, 3, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 25, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 2, 2, 2, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 6, 1, 4, 28, 2, 1, 1, 1, 1, 1, 2, 2, 1, 6, 10, 10, 10, 2, 2, 1, 1, 1, 1, 10, 10, 1, 2, 2, 1, 1, 1, 10, 3, 3, 3, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 3, 3, 5, 4, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
width
= fwf_positions_begin + (width - 1)
fwf_positions_end
= c("STATE", "FMONTH", "IMONTH", "IDAY", "IYEAR", "DISPCODE", "SEQNO", "_PSU", "CTELENUM", "PVTRESD1", "COLGHOUS", "STATERES", "LADULT", "NUMADULT", "NUMMEN", "NUMWOMEN", "GENHLTH", "PHYSHLTH", "MENTHLTH", "POORHLTH", "HLTHPLN1", "PERSDOC2", "MEDCOST", "CHECKUP1", "EXERANY2", "SLEPTIM1", "CVDINFR4", "CVDCRHD4", "CVDSTRK3", "ASTHMA3", "ASTHNOW", "CHCSCNCR", "CHCOCNCR", "CHCCOPD1", "HAVARTH3", "ADDEPEV2", "CHCKIDNY", "DIABETE3", "DIABAGE2", "LASTDEN3", "RMVTETH3", "VETERAN3", "MARITAL", "CHILDREN", "EDUCA", "EMPLOY1", "INCOME2", "WEIGHT2", "HEIGHT3", "NUMHHOL2", "NUMPHON2", "CPDEMO1", "INTERNET", "RENTHOM1", "SEX", "PREGNANT", "QLACTLM2", "USEEQUIP", "BLIND", "DECIDE", "DIFFWALK", "DIFFDRES", "DIFFALON", "SMOKE100", "SMOKDAY2", "STOPSMK2", "LASTSMK2", "USENOW3", "ALCDAY5", "AVEDRNK2", "DRNK3GE5", "MAXDRNKS", "FLUSHOT6", "FLSHTMY2", "PNEUVAC3", "SHINGLE2", "FALL12MN", "FALLINJ2", "SEATBELT", "DRNKDRI2", "HADMAM", "HOWLONG", "PROFEXAM", "LENGEXAM", "HADPAP2", "LASTPAP2", "HADHYST2", "PCPSAAD2", "PCPSADI1", "PCPSARE1", "PSATEST1", "PSATIME", "PCPSARS1", "BLDSTOOL", "LSTBLDS3", "HADSIGM3", "HADSGCO1", "LASTSIG3", "HIVTST6", "HIVTSTD3", "WHRTST10", "PDIABTST", "PREDIAB1", "INSULIN", "BLDSUGAR", "FEETCHK2", "DOCTDIAB", "CHKHEMO3", "FEETCHK", "EYEEXAM", "DIABEYE", "DIABEDU", "PAINACT2", "QLMENTL2", "QLSTRES2", "QLHLTH2", "MEDICARE", "HLTHCVR1", "DELAYMED", "DLYOTHER", "NOCOV121", "LSTCOVRG", "DRVISITS", "MEDSCOST", "CARERCVD", "MEDBILL1", "ASBIALCH", "ASBIDRNK", "ASBIBING", "ASBIADVC", "ASBIRDUC", "WTCHSALT", "LONGWTCH", "DRADVISE", "ASTHMAGE", "ASATTACK", "ASERVIST", "ASDRVIST", "ASRCHKUP", "ASACTLIM", "ASYMPTOM", "ASNOSLEP", "ASTHMED3", "ASINHALR", "IMFVPLAC", "TETANUS", "HPVTEST", "HPLSTTST", "HPVADVC2", "HPVADSHT", "CNCRDIFF", "CNCRAGE", "CNCRTYP1", "CSRVTRT1", "CSRVDOC1", "CSRVSUM", "CSRVRTRN", "CSRVINST", "CSRVINSR", "CSRVDEIN", "CSRVCLIN", "CSRVPAIN", "CSRVCTL1", "RRCLASS2", "RRCOGNT2", "RRATWRK2", "RRHCARE3", "RRPHYSM2", "RREMTSM2", "SCNTMNY1", "SCNTMEL1", "SCNTPAID", "SCNTWRK1", "SCNTLPAD", "SCNTLWK1", "SCNTVOT1", "SXORIENT", "TRNSGNDR", "RCSBIRTH", "RCSGENDR", "RCHISLA1", "RCSRACE1", "RCSBRAC1", "RCSRLTN2", "CASTHDX2", "CASTHNO2", "EMTSUPRT", "LSATISFY", "QSTVER", "QSTLANG", "MSCODE", "_STSTR", "_STRWT", "_RAWRAKE", "_WT2RAKE", "_AGE80", "_IMPRACE", "_IMPNPH", "_IMPEDUC", "_IMPMRTL", "_IMPHOME", "_LANDWT2", "_LANDWT", "_CHISPNC", "_CPRACE", "_CRACE1", "_IMPCAGE", "_IMPCRAC", "_IMPCSEX", "_CLANDWT", "NPHH", "NAHH", "NCHH", "_HHOLDWT", "_RFHLTH", "_HCVU651", "_TOTINDA", "_LTASTH1", "_CASTHM1", "_ASTHMS1", "_DRDXAR1", "_EXTETH2", "_ALTETH2", "_DENVST2", "_PRACE1", "_MRACE1", "_HISPANC", "_RACE", "_RACEG21", "_RACEGR3", "_RACE_G1", "_AGEG5YR", "_AGE65YR", "_AGE_G", "HTIN4", "HTM4", "WTKG3", "_BMI5", "_BMI5CAT", "_RFBMI5", "_CHLDCNT", "_EDUCAG", "_INCOMG", "_SMOKER3", "_RFSMOK3", "DRNKANY5", "DROCDY3_", "_RFBING5", "_DRNKDY4", "_DRNKMO4", "_RFDRHV4", "_RFDRMN4", "_RFDRWM4", "_FLSHOT6", "_PNEUMO2", "_RFSEAT2", "_RFSEAT3", "_RFMAM2Y", "_MAM502Y", "_RFPAP32", "_RFPSA21", "_RFBLDS2", "_RFSIGM2", "_AIDTST3")
col_names library(readr)
read_fwf(
"data/LLCP2016.ASC ",
fwf_positions(
fwf_positions_begin,
fwf_positions_end,
col_names
)-> brfss2016b
) %>%
brfss2016b filter(
== "39
STATE ") -> brfss2016b# Subsetting Ohio data
save(
brfss2016b, file = "data/brfss2016b.RData"
)
Note that we had to point out that there is an extra white-space after the file extension .ASC
.
Next you see the LaF
.
library(LaF)
laf_open_fwf(
"data/LLCP2016.ASC ",
column_widths = width,
column_types = rep(
"character",
length(width)
), column_names = col_names
-> laf
) data.frame(
laf[, ] -> brfss2016c
) save(
brfss2016c, file = "data/brfss2016c.RData"
)
Notice that with LaF
we had to specify the column widths via column_widths = width
and also the column headings via column_names = cnames
. In addition, note that we had to point out that there is an extra white-space after the file extension .ASC
.
6.0.2.1 Changing Variable Formats
When these files are read some of the variable will come in as characters when in fact they are numeric (see STATE, for example) or factors (see PHYSHLTH for example). Others will be read as integers (see GENHLTH) and so on. Consequently, we often have to reset the read formats. Further, we do know from the codebook that missing values and other responses (such as refused) carry numeric codes in the survey. For instance, GENHLTH is supposed to have the following responses mapped to the numeric values stored in the variable
Code | Label |
---|---|
1 | Excellent |
2 | Very Good |
3 | Good |
4 | Fair |
5 | Poor |
7 | Don’t Know/Not Sure |
9 | Refused |
BLANK | Not asked or Missing |
We can take this mapping and fix GENHLTH by creating a new variable, say general.health as a factor.
load("data/brfss2016b.RData")
factor(
$GENHLTH,
brfss2016blevels = c(1, 2, 3, 4, 5),
labels = c(
"Excellent", "Very Good", "Good", "Fair", "Poor"
)-> brfss2016b$general.health
) table(
$general.health
brfss2016b
)#>
#> Excellent Very Good Good Fair Poor
#> 0 0 0 0 0
Let us also flip STATE into a numeric variable, and format FMONTH, IMONTH, IDAY, IYEAR as dates. For dates the lubridate
package comes in very handy.
as.numeric(
$STATE
brfss2016b-> brfss2016b$STATE )
We’ll combine IMONTH, IDAY and IYEAR to create a date variable.
with(
brfss2016b, sprintf(
'%s-%s-%s',
IMONTH, IDAY, IYEAR
)-> brfss2016b$date
) head(brfss2016b$date)
#> character(0)
library(lubridate)
$Date = mdy(brfss2016b$date)
brfss2016bis.Date(brfss2016b$Date)
#> [1] TRUE
str(brfss2016b$Date)
#> 'Date' num(0)
Note how we built date … we first concatenated the three columns and then used lubridate
to format date via the mdy()
specification. I then check to make sure the format is correct via the is.Date()
command. This could also have been done via the str()
command. Now that Date has been created we could drop the original three columns but we will let that be for now.
6.0.3 BEA and BLS Data
The Bureau of Labor Statistics (BLS) compiles a lot of data, including some under the Local Area Unemployment Statistics (LAUS) program – a Federal-State cooperative effort in which monthly estimates of total employment and unemployment are prepared for approximately 7,300 areas:
- Census regions and divisions
- States
- Metropolitan Statistical Areas and Metropolitan NECTAS (New England City and Town Areas)
- Metropolitan Divisions and NECTA Divisions
- Micropolitan Statistical Areas and Micropolitan NECTAs
- Combined Metropolitan Statistical Areas and Combined NECTAs
- Small Labor Market Areas
- Counties and county equivalents
- Cities of 25,000 population or more
- Cities and towns in New England regardless of population
These estimates are key indicators of local economic conditions. The Bureau of Labor Statistics (BLS) of the U.S. Department of Labor is responsible for the concepts, definitions, technical procedures, validation, and publication of the estimates that State employment security agencies prepare under agreement with BLS. Data available at BLS can be found here
Similarly, the Bureau of Economic Analysis (BEA) has a regional economic accounts program that details the geographic distribution of U.S. economic activity and growth. “The estimates of gross domestic product by state and state and local area personal income, and the accompanying detail, provide a consistent framework for analyzing and comparing individual state and local area economies.”
BEA has a nice site here that allows you to interactively select the data you want, or you can grab the data from their download section.
The BEA data read below were obtained from here, while the BLS data read below were obtained from here and here, respectively.
read.csv(
::here(
here"data",
"CA1_1969_2016_OH.csv"
)-> df.bea
) read.csv(
"http://www.bls.gov/lau/laucnty16.txt",
header = FALSE,
skip = 5,
sep = ""
-> df.bls1
) colnames(df.bls1) = c(
"LAUS Code", "StateFIPSCode", "CountyFIPSCode", "CountyName",
"StateAbbreviation", "Year", "TotalLaborForce", "TotalEmployed",
"TotalUnemployed", "UnemploymentRate"
)library(readxl)
library(httr)
= "https://www.bls.gov/lau/laucnty16.xlsx"
url GET(
url, write_disk(tf <- tempfile(fileext = ".xlsx"))
)#> Response [https://www.bls.gov/lau/laucnty16.xlsx]
#> Date: 2022-01-14 00:54
#> Status: 200
#> Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
#> Size: 246 kB
#> <ON DISK> /var/folders/qh/6q39v0755_54rxmbl8m5ttnwy0twd7/T//RtmpPMvrrc/file921e4f4e7dd.xlsx
read_excel(
tf, skip = 5,
na = "",
col_names = c(
"LAUS Code", "StateFIPSCode", "CountyFIPSCode",
"CountyName", "Year", "Blank_Column", "TotalLaborForce",
"TotalEmployed", "TotalUnemployed", "UnemploymentRate"
)-> df.bls2
) read.csv(
"http://www.bls.gov/cew/data/api/2015/2/industry/10.csv",
sep = ",",
header = TRUE
-> df.bls3 )
Note that df.bls1
used the .txt data
and df.bls2
uses the .xlsx
data. Note too the Excel file has a blank row that we will have to delete. Also, look at df.bls1
and note that the data has some issues because of the way the txt file is prepared. Consequently, we want to rely on the Excel data for now. If we wanted to cleanly read the txt file we would have to do some more legwork up front.
6.1 Using Application Programming Interfaces (APIs)
Many organizations that gather data have started offering free APIs – essentially a setup whereby if you register for their API key, you can interact with their servers to query and download data. This is especially useful because it saves time and improves efficiency all around. Moreover, an increasing number of R packages are using APIs to communicate with data servers. Before you can use their API, you have to registes for an API key
by setting up an account, usually free.
- U.S. Census Bureau
- U.S. Bureau of Economic Analysis
- U.S. Bureau of Labor Statistcs
- The World Bank
- The National Oceanic and Atmospheric Administration
- google maps
- Quandl
- Data.Gov
6.1.1 The Bureau of Eeconomic Analysis (BEA) API
Let us start with the Bureau of Economic Analysis’s API. What does the BEA do? Per the agency, “The Bureau of Economic Analysis (BEA) promotes a better understanding of the U.S. economy by providing the most timely, relevant, and accurate economic accounts data in an objective and cost-effective manner.”
The three BEA product streams of interest to most people tend to be the (i) National series (BEA’s national economic statistics provide a comprehensive view of U.S. production, consumption, investment, exports and imports, and income and saving. These statistics are best known by summary measures such as gross domestic product (GDP), corporate profits, personal income and spending, and personal saving.), (ii) the Regional series (The regional economic accounts tell us about the geographic distribution of U.S. economic activity and growth. The estimates of gross domestic product by state and state and local area personal income, and the accompanying detail, provide a consistent framework for analyzing and comparing individual state and local area economies.), and (iii) the Industry Economic Accounts (The industry economic accounts, presented both in an input-output framework and as annual output by each industry, provide a detailed view of the interrelationships between U.S. producers and users and the contribution to production across industries. These accounts are used extensively by policymakers and businesses to understand industry interactions, productivity trends, and the changing structure of the U.S. economy.)
To start, signup for a BEA key here and once you get it (check your email), you should set it via beaKey = "xyz1s3d4g6hblahblahblahblah"
. Now install and load the library.
library(bea.R)
There are a few primary functions currently offered, you can either search for data via a keyword or you can get specific data once you have identified the table ID and other details. Let us start by seeing what data sets are available.7
beaSets(mybeaKey)
#> $Dataset
#> DatasetName DatasetDescription
#> 1 NIPA Standard NIPA tables
#> 2 NIUnderlyingDetail Standard NI underlying detail tables
#> 3 MNE Multinational Enterprises
#> 4 FixedAssets Standard Fixed Assets tables
#> 5 ITA International Transactions Accounts
#> 6 IIP International Investment Position
#> 7 InputOutput Input-Output Data
#> 8 IntlServTrade International Services Trade
#> 9 GDPbyIndustry GDP by Industry
#> 10 Regional Regional data sets
#> 11 UnderlyingGDPbyIndustry Underlying GDP by Industry
#> 12 APIDatasetMetaData Metadata about other API datasets
#>
#> attr(,"params")
#> ParameterName ParameterValue
#> 1 RESULTFORMAT JSON
#> 2 METHOD GETDATASETLIST
#> 3 USERID A8CF5A83-A7A6-42A5-8D45-B1E94079E37C
The beaSearch()
function is limited to returning details of national and regional data so please bear that in mind. We start with a simple search for all data on wages and salaries.
beaSearch(
"Wages and Salaries",
beaKey = mybeaKey,
asHtml = FALSE
-> bea.search.01 )
Here is the resulting table, and you should look through the series to get a sense of the contents. We will then go in and pick a series or two to download.
The beaGet
function will grab the data you specify, as shown below.
list(
'UserID' = mybeaKey,
'Method' = 'GetData',
'DatasetName' = 'NIPA',
'TableName' = 'T11000',
'LineNumber' = "3",
'Frequency' = 'Q',
'Year' = '2010,2012,2013,2014,2015,2016,2017,2018',
'ResultFormat' = 'json'
-> mybea_call
)
beaGet(
mybea_call,asWide = FALSE
-> bea01 )
You can control whether data are returned in a long
format with the asWide = FALSE
switch. If you set it to TRUE
you will get the data in wide
format, necessitating additional work to whip it into shape for analysis.
To return in a format in which each column represents a series, set iTableStyle = FALSE
. This returns columns named with a concatenation of the descriptive column values, whereas rows are populated with numeric data values for each time period, and has the last column named “TimePeriod” filled with dates (see below).
beaGet(
mybea_call, iTableStyle = FALSE
-> bea02 )
Note that asWide = TRUE
and iTableStyle = TRUE
are set as the defaults. My preference would be to work with bea01
.
What about some data from the Regional series? Let us see what is available for personal income, and then pick both a series and sub-national level (state, MSA, county, etc.).
beaSearch(
"Personal Income",
beaKey = mybeaKey,
asHtml = FALSE
-> bea.search.02 )
Pay attention to the Warning message:
our call threw: “In beaSearch(”Personal Income”, beaKey = mybeaKey, asHtml = FALSE): Regional metadata is missing from /Library/Frameworks/R.framework/Versions/4.1/Resources/library/beaR/data and may be locked for updating on the BEA API; searching national metadata only.”
This is not a surprise if you pay attention to the technical documentation the BEA releases. Specifically, the BEA has announced that “The RegionalData, RegionalIncome, and RegionalProduct data sets are obsolete and API calls using these data sets will no longer function. Users should instead access the dataset Regional, which provides comprehensive detail in statistics, industries, geography, and years. See Appendix N for instructions on how to use the Regional data set. In addition, BEA continues to add new datasets as appropriate, and while any new datasets will be announced on the BEA’s website Developers page (https://apps.bea.gov/developers), we also encourage users to periodically call the above GETDATASETLIST discover new datasets.”
Well, so what now? Let us hack the call a bit to figure out what to ask for. The easy way to do this is to use the BEA’s interactive data interface. I used the Regional data link to pick per capita personal saving by county, and this turns out to be the CAINC1
series. At the time of writing this, the latest data available are for 2020.
Now, you could also rely on some helper functions in {bea.r}
to get the series you are curious about. These helper functions are used below with this particular example – chasing CAINC1
and extracting the key pieces we will need in our API call. First I will look for the TableName
(although the web interface data told us this was CAINC1 but why not see what else we could get, eh)?
library(tidyverse)
beaParamVals(
mybeaKey, 'Regional',
"TableName"
%>%
) as.data.frame(.) -> see01
This gives us 99 rows of data with one TableName
and its corresponding description per row.
Now that we have seen the various tables available to us, let us look at the various LineCode
values in each and what these linecodes represent.
beaParamVals(
mybeaKey, 'Regional',
"LineCode"
%>%
) as.data.frame() -> see02
If you filter the result down to CAINC1
you will see two rows with LineCode = 3
. What does each refer to? No clue and the documentation does not help answer it either. regardless, let us roll the dice and issue a call for the data.
list(
'UserID' = mybeaKey,
'Method' = 'GetData',
'DatasetName' = 'Regional',
'TableName' = 'CAINC1',
'LineCode' = 3,
'Year' = 'LAST10',
'Frequency' = "A",
'GeoFIPS' = "COUNTY",
'ResultFormat' = 'json'
-> mybea_call
)
beaGet(
mybea_call,asWide = FALSE
-> see03 )
I asked for the last ten years of data via LAST10
and could have also done this in long-hand by spelling out each year. I believe the other option available might be LAST5
to get the data for the last 5 years. Note also that I added the Frequency = A
parameter in case the series is available with Annual (A
), Quarterly (Q
), or Monthly (M
) frequency, and you know what frequency you want.
Since the Regional API is deprecated and will not be live for much longer, I suggest you also consider pulling the raw data from the BEA and then processing it in R.
To go the manual route, start here and pick the account you want data for. Then use the dropdown menu options to select the series and Download
… you will end up with a zip file that you will then have to unpack and read into R. For example, if I go for CAINC1
I get a 3.4MB file CAINC.zip
that, when unpacked, gives me state-specific *.csv
files and one all areas file, each set for the 1969-2020 period.
::read_csv(
readr::here(
here"data/CAINC1",
"CAINC1__ALL_AREAS_1969_2020.csv"
)-> cainc1 )
Note the extra underscore in the filename; wonders never cease!
The data are in wide format, with each year a column. But this is a negligible penalty to pay for getting everything in the CAINC1
series in one swoop! Of course you would have to pick the series you want and then pivot the data into the long format. Let me do this for per-capita personal income.
%>%
cainc1 filter(LineCode == 3) %>%
select(
1, 2, 9:60
%>%
) group_by(
GeoFIPS, GeoName%>%
) pivot_longer(
names_to = "year",
values_to = "pcpi",
3:54,
values_transform = list(
year = as.integer,
pcpi = as.integer
)-> cainc1_long )
I made sure to convert both year
and pcpi
to integers. Note also that the resulting fill will include national and state values as well but that is not a negative result at all. Why? Because we can use these to compare counties to state/national values, as a sort of a comparative benchmark.
We dropped LineCode =
1 and 2 but they will be needed should you want to create a weighted per-capita measure for some higher geography, for example, the Appalachian counties versus the non-Appalachian counties in state X or in the nation. The number of times I have seen small business development and related folks think they can generate an Appalachian estimate by calculating the mean per-capita income for all relevant counties via a simple unweighted mean calculation is shocking.
6.1.2 The Bureau of Labor Statistics (BLS) API
Make sure you signup for a BLS key here and save the key somewhere that you’ll remember. Then install three packages, blsAPI
, {blsR}
and {blscrapeR}
since although the BLS documents blsAPI
the other two could be useful.
One of the challenges in working with the BLS is that you need to root around before you can find what you are looking for. A good starting point tends to be this page. Say you are her and want to look at unemployment. If you click Unemployment
you see a few options, one of which is Local Area Unemployment Statistics (LAUS). Click this and then click on the Tables
icon, and you will see the various data files available to you.
But this is not the API-way. For that, we need to identify the series ID that maps on to local area unemployment. For that you will have to access the BLS page here. Click Local Area Unemployment Statistics
and you will find the Series ID.
library(blsR)
library(rjson)
library(blscrapeR)
library(blsAPI)
bls_api(
c("LAUCN040010000000003"),
startyear = 2008,
endyear = 2021,
Sys.getenv("BLS_KEY")
%>%
) dateCast() -> bls01
A much better way to go about this would be to get the master list of series the BLS API points to.
-> seriesdf series_ids
You will end up with a data-frame that has 88,815 rows and 4 columns, each row a unique series! Now, let us assume I want to see the unemployment rate for Athens County, Ohio. The Series ID is LAUCN390090000000003
and we should be able to pull the data.
bls_api(
c("LAUCN390090000000003"),
startyear = 2000,
endyear = 2021,
Sys.getenv("BLS_KEY")
%>%
) dateCast() -> bls02
Fair enough, I can get data for individual counties or multiple counties. At some point the API may reject our calls because we may end up exceeding the allowable limits imposed by te API but let us see how far we can push it. Specifically, I want to grab the unemployment rate data for all 88 Ohio counties. Can I do that? Step 1 will be to find and concatenate the Series ID for all 88 counties.
%>%
seriesdf filter(
grepl(" County, Oh", series_title),
grepl("Unemployment Rate", series_title)
-> series88
) 1:44, ] -> seria_01
series88[45:88, ] -> seria_02 series88[
I know the API will return less than 88 county data if we ask for too much so what we can do is split our counties into two groups, the first 44 (seria_01
) and the second 44 (seria_02
). Having done this, we can issue bls_api(...)
calls for each half, and then bind them together via bind_rows(...)
.
bls_api(
$series_id,
seria_01startyear = 2011,
endyear = 2021,
Sys.getenv("BLS_KEY")
%>%
) dateCast() -> bls03_01
bls_api(
$series_id,
seria_02startyear = 2011,
endyear = 2021,
Sys.getenv("BLS_KEY")
%>%
) dateCast() -> bls03_02
%>%
bls03_01 bind_rows(bls03_02) -> bls03
Not bad, looks like we got some 11,440 rows of data. But the County names are missing, so what we could do is append the county names from series88
as follows:
%>%
bls03 left_join(
1:2], by = c("seriesID" = "series_id")
series88[, -> bls04_raw )
Of course, as is visible in the data-frame, the county names are all embedded in the series_title
column so let us extract county names and store in a new column. We can also drop columns that are no longer needed, and convert value to an integer.
%>%
bls04_raw mutate(
county = stringr::str_remove_all(
series_title,"Unemployment Rate: |Oh \\(U\\)| County, "
)%>%
) mutate(
unemprate = as.integer(value)
%>%
) select(
seriesID, latest, footnotes,
date, county, unemprate-> bls04 )
As you can see, getting sub-national data via the BLS API is not an easy thing. I often end up having to use state and county-level unemployment rates, and then at times the labor-force participation rate. What I do is to grab the raw data directly from BLS, read, process, and save the file as *.RData
, and then update it as needed. Compared to some of the other APIs out there the BLS API is a tough nut to crack. At least for me.
6.1.3 Initiative and Referendum Data (1904-2010)
The Initiative and Referendum Institute has long gathered data on direct democracy in the states. You can read more here. Three data sets are readily available for download here:
- State Data 1904-2019
- Legal Landscape with the codebook here
- California Local Option Sales Tax Measures 1976-2020 with the codebook here
6.1.3.0.1 IRI initiatives by state and year for the 1904-through 2020 period
This database lists all statewide initiatives to appear on the ballot since the first statewide initiative in Oregon in 1904. This list does not contain information on issues placed on the ballot by state legislatures, commonly referred to as “legislative measures.”
"http://www.iandrinstitute.org/docs/Number%20initiatives%20by%20state-year%20(1904-2019).xls" -> myurl
"Number_20initiatives_20by_20state_year_20_1904_2019_.xls" -> mydestfile
::curl_download(myurl, mydestfile)
curl::read_excel(
readxl
mydestfile-> iir_df )
The Legal Landscape Database describes direct democracy provisions in American cities in 2005. The database includes the 1,000 largest cities in the country as well as the ten largest cities in every state. Data items include city name, FIPS code, population, and initiative status.
Finally, the California Local Option Sales Tax Measures 1976-2020 database provides detailed information on 86 local sales tax referendums for transportation purposes in California, including ballot information, election results, tax amount and duration, estimated revenue, project description, supporters and opponents. A second worksheet provides links to the text of the measures. Collected by UCLA Institute of Transportation Studies.
6.1.4 Education Data (CCD)
“The Common Core of Data (CCD) is a program of the U.S. Department of Education’s National Center for Education Statistics that annually collects fiscal and non-fiscal data about all public schools, public school districts and state education agencies in the United States. The data are supplied by state education agency officials and include information that describes schools and school districts, including name, address, and phone number; descriptive information about students and staff, including demographics; and fiscal data, including revenues and current expenditures.” You can learn more about the program here and access the data and documentation here. You can access the specific file used below from here, and note that these are schools in California.
::read_excel(
readxl::here(
here"data", "ncesdata.xlsx"
)-> df.nces )
NCES District ID | State District ID | District Name | County Name* | Street Address | City | State | ZIP | ZIP 4-digit | Phone | Students* | Teachers* | Schools* | Locale Code* | Locale* | Student Teacher Ratio* | Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0601620 | CA-1964212 | ABC Unified | LOS ANGELES COUNTY | 16700 Norwalk Blvd. | Cerritos | CA | 90703 | 1838 | (562)926-5566 | 20998.00000 | 833.11000 | 30.00000 | 21 | Suburb: Large | 25.20000000 | Regular School District |
0601650 | CA-0761630 | Acalanes Union High | CONTRA COSTA COUNTY | 1212 Pleasant Hill Rd. | Lafayette | CA | 94549 | 2623 | (925)280-3900 | 5402.00000 | 267.60000 | 5.00000 | 21 | Suburb: Large | 20.20000000 | Regular School District |
0601680 | CA-3166761 | Ackerman Charter | PLACER COUNTY | 13777 Bowman Rd. | Auburn | CA | 95603 | 3147 | (530)885-1974 | 556.00000 | 28.24000 | 1.00000 | 31 | Town: Fringe | 19.70000000 | Other Education Agency |
0600001 | CA-1975309 | Acton-Agua Dulce Unified | LOS ANGELES COUNTY | 32248 N. Crown Valley Rd. | Acton | CA | 93510 | 2620 | (661)269-5999 | 4043.00000 | 219.57000 | 22.00000 | 11 | City: Large | 18.40000000 | Regular School District |
0601710 | CA-3667587 | Adelanto Elementary | SAN BERNARDINO COUNTY | 11824 Air Expressway | Adelanto | CA | 92301 | 0070 | (760)246-8691 | 10378.00000 | 396.93000 | 16.00000 | 21 | Suburb: Large | 26.10000000 | Regular School District |
0691051 | CA-0110017 | Alameda County Office of Education | ALAMEDA COUNTY | 313 W. Winton Ave. | Hayward | CA | 94544 | 1136 | (510)887-0152 | 3931.00000 | 187.66000 | 13.00000 | 11 | City: Large | 20.90000000 | Regular School District |
These are data for a single year. If we wanted to build a panel (i.e., data for multiple schools over multiple years) we would have to download all the files and then bind them into a single data frame.
6.1.5 The {EdSurvey}
Package for Educational Data
The {EdSurvey}
package is authored by AIR and designed to “read in and analyze functions for education survey and assessment data from the National Center for Education Statistics (NCES) nces.ed.gov, including National Assessment of Educational Progress (NAEP) data nces.ed.gov/nationsreportcard and data from the International Assessment Database: Organisation for Economic Co-operation and Development (OECD) oecd.org, including Programme for International Student Assessment (PISA), Teaching and Learning International Survey (TALIS), Programme for the International Assessment of Adult Competencies (PIAAC), and International Association for the Evaluation of Educational Achievement (IEA) iea.nl, including Trends in International Mathematics and Science Study (TIMSS), TIMSS Advanced, Progress in International Reading Literacy Study (PIRLS), International Civic and Citizenship Study (ICCS), International Computer and Information Literacy Study (ICILS), and Civic Education Study (CivEd).”
The authors have provided extensive documentation of how to use the package, obviating the need to go into any depth here.
6.1.6 World Bank data with data360r
In January 2017 the World Bank launched TCdata360, “a new open data platform that features more than 2,000 trade and competitiveness indicators from 40+ data sources inside and outside the World Bank Group. Users of the website can compare countries, download raw data, create and share data visualizations on social media, get country snapshots and thematic reports, read data stories, connect through an application programming interface (API), and more.” This led to them developing data360r, an R package that allows users to interact with the TCdata360 API
and query TCdata360
data, metadata, and a lot more. Below we walk through some of the core functions.
The database has a large number of indicators so most of us would (or should) walk in knowing which indicators we wish to work with. Assume for now that you are interested in data-set 51.
#install.packages("devtools")
#devtools::install_github("mrpsonglao/data360r")
library(data360r)
get_data360(
dataset_id = 51
-> df51 )
But what if you have no clue about the dataset you want and would rather start by browsing the list of available indicators? You could do that for sure, and even browse available countries and data-sets with ease.
Say you want to get all indicators’ metadata in Govdata360.
get_metadata360(
site = "gov",
metadata_type = "indicators"
-> df_indicators )
What if you want all country metadata in TCdata360?
get_metadata360(
metadata_type = 'countries'
-> df_countries )
Perhaps all dataset metadata in TCdata360?
get_metadata360(
metadata_type = 'datasets'
-> df_datasets )
You can also search for data on an individual country code
search_360(
'Bhutan',
search_type = 'country'
)#> id name slug type score redirect
#> 1: NA Bhutan BTN country 1 FALSE
or top n
indicators for a measure (but note that you will likely see more than n
rows in the data-set since some indicators have have two IDs – one for TCdata360 and the other for Govdata360).
search_360(
'GDP',
search_type = 'indicator',
limit_results = 5
)#> id name slug type score
#> 1: 28107 GDP (current US$) d6cf2c13 indicator 0.3333
#> 2: 28609 GDP - current US$ h3c4bf3c9 indicator 0.3333
#> 3: 940 GDP growth (annual %) NY.GDP.MKTP.KD.ZG indicator 0.2500
#> 4: 944 GDP per capita (current US$) NY.GDP.PCAP.CD indicator 0.2500
#> 5: 949 GDP, PPP (current international $) NY.GDP.MKTP.PP.CD indicator 0.2500
#> redirect dataset
#> 1: FALSE World Development Indicators
#> 2: FALSE World Integrated Trade Solution
#> 3: FALSE World Development Indicators
#> 4: FALSE World Development Indicators
#> 5: FALSE World Development Indicators
or for a particular database
search_360(
'World Development Indicators',
search_type = 'indicator',
limit_results = 5
)#> id name slug
#> 1: 1804 Commercial service exports (current US$) TX.VAL.SERV.CD.WT
#> 2: 2010 High-technology exports (% of manufactured exports) TX.VAL.TECH.MF.ZS
#> 3: 2009 High-technology exports (current US$) TX.VAL.TECH.CD
#> 4: 1931 Transport services (% of commercial service exports) TX.VAL.TRAN.ZS.WT
#> 5: 1933 Travel services (% of commercial service exports) TX.VAL.TRVL.ZS.WT
#> type score dataset redirect
#> 1: indicator 1 World Development Indicators FALSE
#> 2: indicator 1 World Development Indicators FALSE
#> 3: indicator 1 World Development Indicators FALSE
#> 4: indicator 1 World Development Indicators FALSE
#> 5: indicator 1 World Development Indicators FALSE
or for a specific data provider
search_360(
'WBG',
search_type = 'indicator',
limit_results = 5
)#> id
#> 1: 28134
#> 2: 28135
#> 3: 28137
#> 4: 28136
#> 5: 28140
#> name
#> 1: Adult literacy rate, population 15+ years, gender parity index (GPI)
#> 2: Barro-Lee: Percentage of female population age 15+ with primary schooling. Completed Primary
#> 3: Barro-Lee: Percentage of female population age 15+ with secondary schooling. Completed Secondary
#> 4: Barro-Lee: Percentage of population age 15+ with primary schooling. Completed Primary
#> 5: Barro-Lee: Percentage of population age 15+ with tertiary schooling. Completed Tertiary
#> slug type score dataset redirect
#> 1: f22e4196 indicator 1 Education Statistics FALSE
#> 2: 8ef474ba indicator 1 Education Statistics FALSE
#> 3: cde9ae8d indicator 1 Education Statistics FALSE
#> 4: c00a96fc indicator 1 Education Statistics FALSE
#> 5: f7163136 indicator 1 Education Statistics FALSE
or for a particular category
search_360(
'Governance',
site = 'gov',
search_type = 'category',
limit_results = 5
)#> id name slug type score dataset redirect
#> 1: 40976 Governance gov category 1.0000 <NA> TRUE
#> 2: NA Corporate Governance h758609d3 category 0.5000 <NA> FALSE
#> 3: NA General Governance governance category 0.5000 <NA> FALSE
#> 4: 44140 Overall Governance h81b1dd1f category 0.5000 <NA> FALSE
#> 5: NA Other (Overall Governance) h0521fa1f category 0.3333 <NA> FALSE
If you want to download data for a particular indicator, you can do so as follows. First, find the indicator ID.
search_360(
"woman business",
search_type = "indicator",
limit_results = 5
-> df_ind1 )
Say I want indicator 90 = Can a married woman register a business in the same way as a married man?
get_data360(
indicator_id = c(90)
-> df_ind90
) get_data360(
indicator_id = c(28130, 28131)
-> df_indtwo )
If you only want all data for a specific country or just the indicators we pulled in df_ind1
for a specific country you could do:
get_data360(
country_iso3 = "IND"
-> df_allone
) get_data360(
indicator_id = df_ind1$id,
country_iso3 = "IND"
-> df_allind1 )
Now an example with two measures – legal age of marriage for boys and girls. Note that the package allows you to specify the long
format (preferred) than the default wide
format you see earlier results being returned in. Note also the use of timeframes = c()
that allows you to specify the specific time period you want the indicators for.
search_360(
"marriage",
search_type = "indicator"
)#> id
#> 1: 204
#> 2: 205
#> 3: 206
#> 4: 207
#> 5: 43472
#> 6: 40083
#> 7: 40080
#> 8: 40081
#> 9: NA
#> name
#> 1: What is the legal age of marriage for boys?
#> 2: What is the legal age of marriage for girls?
#> 3: Are there any exceptions to the legal age of marriage?
#> 4: Does the law prohibit or invalidate child or early marriage?
#> 5: Does the law grant spouses equal administrative authority over assets during marriage?
#> 6: What is the minimum age of marriage with judicial authorization for girls?
#> 7: What is the minimum age of marriage with parental consent for boys?
#> 8: What is the minimum age of marriage with parental consent for girls?
#> 9: Are there penalties in the law for authorizing or knowingly entering into the child or early marriage?'
#> slug type score dataset redirect
#> 1: age.marr.male indicator 0.11111 Women, Business and the Law FALSE
#> 2: age.marr.fem indicator 0.11111 Women, Business and the Law FALSE
#> 3: marry.age.except indicator 0.10000 Women, Business and the Law FALSE
#> 4: chld.marr indicator 0.10000 Women, Business and the Law FALSE
#> 5: hf8cada17 indicator 0.08333 Women, Business and the Law FALSE
#> 6: h5e7b11fe indicator 0.08333 Women, Business and the Law FALSE
#> 7: haab6520b indicator 0.08333 Women, Business and the Law FALSE
#> 8: hdecd0041 indicator 0.08333 Women, Business and the Law FALSE
#> 9: hb9d0a754 indicator 0.05882 Women, Business and the Law FALSE
get_data360(
indicator_id = c(204, 205),
timeframes = c(2016),
output_type = 'long'
-> df_marriage )
ggplot(
df_marriage, aes(
x = Observation,
group = Indicator,
fill = Indicator
)+
) geom_bar() +
theme(
legend.position = "none"
+
) facet_wrap(
~ Indicator,
ncol = 1
+
) labs(
x = "Legal Age of Marriage",
y = "Frequency",
title = "Legal Age of Marriage for Boys vs. Girls",
subtitle = "(2016)",
caption = "Source: World Bank Data"
)
6.1.7 World Bank data with {wbstats}
Another library, wbstats, does pretty much the same thing: “[It] allows researchers to quickly search and download the data of their particular interest in a programmatic and reproducible fashion; this facilitates a seamless integration into their workflow and allows analysis to be quickly rerun on different areas of interest and with real-time access to the latest available data.” Let us see the core functionality by loading the library and then seeing what is available in terms of indicators, topics, and so on. We can then set the most current list of information in wb_cachelist
to be used via new_cache
.
#install.packages("wbstats")
#remotes::install_github("nset-ornl/wbstats")
library(wbstats)
str(
wb_cachelist, max.level = 1
)#> List of 8
#> $ countries : tibble [304 × 18] (S3: tbl_df/tbl/data.frame)
#> $ indicators : tibble [16,649 × 8] (S3: tbl_df/tbl/data.frame)
#> $ sources : tibble [63 × 9] (S3: tbl_df/tbl/data.frame)
#> $ topics : tibble [21 × 3] (S3: tbl_df/tbl/data.frame)
#> $ regions : tibble [48 × 4] (S3: tbl_df/tbl/data.frame)
#> $ income_levels: tibble [7 × 3] (S3: tbl_df/tbl/data.frame)
#> $ lending_types: tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#> $ languages : tibble [23 × 3] (S3: tbl_df/tbl/data.frame)
wb_cache() -> new_cache
What indicators are available?
wb_search(
pattern = "corruption"
-> corruption_vars
) head(corruption_vars)
#> # A tibble: 6 × 3
#> indicator_id indicator indicator_desc
#> <chr> <chr> <chr>
#> 1 CC.EST Control of Corruption: Estima… "Control of Corruption captures p…
#> 2 CC.NO.SRC Control of Corruption: Number… "Control of Corruption captures p…
#> 3 CC.PER.RNK Control of Corruption: Percen… "Control of Corruption captures p…
#> 4 CC.PER.RNK.LOWER Control of Corruption: Percen… "Control of Corruption captures p…
#> 5 CC.PER.RNK.UPPER Control of Corruption: Percen… "Control of Corruption captures p…
#> 6 CC.STD.ERR Control of Corruption: Standa… "Control of Corruption captures p…
If I want information from a particular source, say Bloomberg,
wb_search(
pattern = "Bloomberg"
-> blmbrg_vars
) head(blmbrg_vars)
#> # A tibble: 1 × 3
#> indicator_id indicator indicator_desc
#> <chr> <chr> <chr>
#> 1 GFDD.OM.02 Stock market return (%, year-on-year) Stock market return is the gro…
Searching for indicators tied to multiple subjects is easy as well:
wb_search(
pattern = "poverty | unemployment | employment"
-> povemply_vars )
Downloading data, once we identify what we want, is easy as well, needing us to specify just the indicator(s) and then the start and end dates. If you want the data for specific countries, that is easily done as well.
wb_data(
indicator = "SP.POP.TOTL",
start_date = 1960,
end_date = 2019
-> pop_data1
)
wb_data(
country = c(
"ABW","AF", "SSF", "ECA", "IND", "CHN"
),indicator = "SP.POP.TOTL",
start_date = 1960,
end_date = 2019
-> pop_data2
) wb_data(
country = c(
"ABW","AF", "SSF", "ECA", "IND", "CHN"
),indicator = c(
"SP.POP.TOTL", "NY.GDP.MKTP.CD"
), start_date = 1960,
end_date = 2019
-> pop_data3 )
By default wb()
will return the data in long format but not necessarily in a tidy format. If you want the data returned on call in a wide format, specify return_wide = TRUE
and you will have tidy data.
wb_data(
country = c(
"ABW","AF", "SSF", "ECA", "IND", "CHN"
),indicator = c(
"SP.POP.TOTL", "NY.GDP.MKTP.CD"
), start_date = 1960,
end_date = 2019,
return_wide = TRUE
-> pop_data4 )
If you do not know the start/end dates for an indicator, use mrv
= most recent value. Be careful though: If you use mrv
and want the data for more than one country, it will find the most recent date of any country and then return data using that date as the most recent value. In other words, when grabbing data for more than one country, mrv
does not search the most recent value for each country on a country-by-country basis.
wb_data(
country = c("IND"),
indicator = 'EG.ELC.ACCS.ZS',
mrv = 20
-> mrv_data
) head(mrv_data)
#> # A tibble: 6 × 9
#> iso2c iso3c country date EG.ELC.ACCS.ZS unit obs_status footnote last_updated
#> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <date>
#> 1 IN IND India 2000 59.3 <NA> <NA> <NA> 2021-12-16
#> 2 IN IND India 2001 55.8 <NA> <NA> <NA> 2021-12-16
#> 3 IN IND India 2002 62.3 <NA> <NA> <NA> 2021-12-16
#> 4 IN IND India 2003 64.0 <NA> <NA> <NA> 2021-12-16
#> 5 IN IND India 2004 64.4 <NA> <NA> <NA> 2021-12-16
#> 6 IN IND India 2005 67.1 <NA> <NA> <NA> 2021-12-16
As you have no doubt noticed, data may be missing for some years. If you want these missing values to be replaced by the preceding value, use gapfill
wb_data(
country = c("IND"),
indicator = 'EG.ELC.ACCS.ZS',
mrv = 20,
gapfill = TRUE
-> gapfill_data
) head(gapfill_data)
#> # A tibble: 6 × 9
#> iso2c iso3c country date EG.ELC.ACCS.ZS unit obs_status footnote last_updated
#> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <date>
#> 1 IN IND India 2001 55.8 <NA> <NA> <NA> 2021-12-16
#> 2 IN IND India 2002 62.3 <NA> <NA> <NA> 2021-12-16
#> 3 IN IND India 2003 64.0 <NA> <NA> <NA> 2021-12-16
#> 4 IN IND India 2004 64.4 <NA> <NA> <NA> 2021-12-16
#> 5 IN IND India 2005 67.1 <NA> <NA> <NA> 2021-12-16
#> 6 IN IND India 2006 67.9 <NA> <NA> <NA> 2021-12-16
Note that now you have 2014 values being used to fill what would be missing data for 2015, 2016 and 2017. If you will be working with dates, whether for plotting purposes or otherwise, then activate the POSIXct = TRUE
switch. Otherwise you will have to do this manually.
wb_data(
indicator = c(
"CRUDE_DUBAI", "CRUDE_BRENT", "CRUDE_WTI", "CRUDE_PETRO"
),mrv = 60,
freq = "M",
POSIXct = TRUE
-> oil_data
) head(oil_data)
ggplot(
oil_data, aes(
x = date_ct,
y = value,
group = indicatorID,
color = indicatorID
)+
) geom_line() +
scale_x_date(
date_breaks = "9 months"
+
) labs(
x = "Date",
y = "Price",
title = "Crude Oil Prices",
subtitle = "(in nominal US$ per bbl)",
caption = "Source: World Bank Data"
+
) theme(
legend.position = "bottom"
)
6.2 USGS Data {dataRetrieval}
The {dataRetrieval}
package gives you easy access to water data gathered and warehoused by the USGS, USDA, EPA, and other entities. The package has an excellent tutorial available here so I will not go into too many details and nuances here. Start by installing the package and then loading it.
# install.packages("dataRetrieval")
library(dataRetrieval)
You will usually start by identifying the site
(for example, The Hocking River) you want data for, the parameter
(gage height, temperature, pH, and so on), and the statisic
(mean, median, minimum, etc). The full list of available parameters can be found here and the complete list of statistics is available here. Sites can be located from this inventory. In fact, the package comes with a built-in data-set that shows you available parameters, making things a bit easier.
-> myparmfile
parameterCdFile names(myparmfile)
#> [1] "parameter_cd" "parameter_group_nm" "parameter_nm"
#> [4] "casrn" "srsname" "parameter_units"
If you are curious about a specific parameter, you can see what all is available for it. I’ll look for anything related to the keyword storm
.
parameterCdFile[grep("storm",
$parameter_nm,
parameterCdFileignore.case = TRUE),
-> stormq
] unique(
$parameter_units
stormq
)#> [1] "nu" "hours" "minutes" "Mgal" "ft3/s" "mgd" "in"
The most basic data pull involves a few specifications, as shown below. Note that I am pulling data for the Hocking River at Athens, Ohio.
"03159500" -> siteNo
"00065" -> pCode
"2014-10-01" -> start.date
"2018-02-26" -> end.date
readNWISuv(
siteNumbers = siteNo,
parameterCd = pCode,
startDate = start.date,
endDate = end.date
-> hocking )
Since the column names will be crude, based on parameter codes, you can clean them up, and also see other attributes embedded in the data-set.
names(hocking)
#> [1] "agency_cd" "site_no" "dateTime" "X_00065_00000"
#> [5] "X_00065_00000_cd" "tz_cd"
renameNWISColumns(
hocking-> hocking
) names(hocking)
#> [1] "agency_cd" "site_no" "dateTime" "GH_Inst" "GH_Inst_cd" "tz_cd"
names(
attributes(
hocking
)
)#> [1] "names" "row.names" "class" "url" "siteInfo"
#> [6] "variableInfo" "disclaimer" "statisticInfo" "queryTime"
If I wanted data for a few sites rather than one I could run
c(
"03158200",
"03159246"
-> sites
) "00065" -> pcode
"2014-10-01" -> start.date
"2018-02-26" -> end.date
readNWISuv(
siteNumbers = sites,
parameterCd = pcode,
startDate = start.date,
endDate = end.date
-> hocking2
) renameNWISColumns(
hocking2-> hocking2 )
Now a simple time-series plot of gage height for both sites.
attr(
hocking2, "variableInfo"
-> parameterInfo
) $station = ifelse(
hocking2$site_no == "03158200",
hocking2"Monday Creek at Doanville",
"Sunday Creek Below Millfield"
)as.Date(
as.character(
$dateTime
hocking2
), format = "%Y-%m-%d"
-> hocking2$mydates
) ggplot(
data = hocking2,
aes(
x = mydates,
y = GH_Inst,
color = station
)+
) geom_line() +
labs(
x = "",
y = parameterInfo$variableDescription
+
) theme(
legend.position = "bottom"
+
) scale_x_date(
date_breaks = "2 weeks"
)
Be sure to check out Laura DeCicco’s excellent tutorial as well as the package vignette, and the other articles that go well beyond this light introduction to the package.
6.3 Working with googlesheets via {googlesheets4}
Googlesheets is a web-based application that allows teams to collaboratively work with a common data table, using it to carry out analyses, visualizations, and of course to update the sheet as well. It is built to function with tables and other mobile platforms, making it very versatile. Given its growing popularity, it was a matter of time before someone built an R package that uses google sheets and lo and behold, that individual turned out to be Jenny Bryan. In this section I’ll outline the basic features of her googlesheets4
package.
library(googledrive)
library(googlesheets4)
Now we authenticate via a google account you have via the browser window/tab that should open when you execute the command below. If you don’t have a gmail account, get one now before you go any further.
What sheets do we have in our google drive?
gs4_find()
#> # A dribble: 15 × 3
#> name id drive_resource
#> <chr> <drv_id> <list>
#> 1 EBITE Coaching Cohort 1 1km5f7Q8MBC0WIzX6H9Z0s__A… <named list [3…
#> 2 tinted-silverfox 1m9jeXouE5MYr_KWqpQbSaRnr… <named list [3…
#> 3 moonish-takin 1k9YudiVm2XQA8mzfONonzRLi… <named list [3…
#> 4 Coaching Intake Form (Responses) 1_s8mnyjUZWnuDq2Lznq0usCa… <named list [3…
#> 5 Competition_Results_Exports_UIPM_2021_… 1zqv-B6uE-xUsA7XDks8_v5h4… <named list [3…
#> 6 Detailed Biography DHS Ethiopia 1NGjAmMrVWPUqikSIiqadTVHu… <named list [3…
#> # … with 9 more rows
I want to check the Gapminder
sheet, and I can do this by name (as shown below) or then by the sheet_key
.
read_sheet(
"https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077"
-> my.gs1 )
read_sheet(
"1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY"
-> my.gs2 )
Now we want to read the contents of this sheet and start working on them.
str(my.gs1)
str(my.gs2)
Now some basic calculations and a plot, just to see that it works as it is supposed to.
%>%
my.gs1 group_by(
country%>%
) summarise(
mean.gdp = mean(
gdpPercap, na.rm = TRUE
)%>%
) arrange(
-mean.gdp
-> tab.01
)
ggplot(
my.gs1, aes(
x = lifeExp,
y = gdpPercap
)+
) geom_hex() +
labs(
x = "Life Expectancy (in years)",
y = "Gross Domstic Product (per-capita)"
)
What if wanted to push something, a dataframe we modified in R or even the tab.01
we created as a googlesheet to our google drive?
sheet_write(
.01,
tabsheet = "sheet-1"
)
Login to your google drive folder and check if the sheet is there, then double-click it to open it as a googlesheet. Voila, there it is!
So what is the value-add of this package? Since many teams work with googlesheets but not everyone is responsible for data analysis, this package allows you to work with sheets created by others, edit them, and even generate a new sheet that the other team members can then use. To me, that is valuable in itself. Dr. Jenny Bryan points out some other uses, such as gathering data securely via googleforms, gathering data on your smartphone while in the field, and pretty much any other use of googlesheets that you can conjure.
Of course, this is not all that the package does; it does a lot more! If you are interested in these other functions, be sure to check out this vignette: googlesheets Basic Usage.