3 Aggregating Data and Other Operations
Often when analyzing data we end up having to either collapse them into summary statistics or calculate means, standard deviations, etc for different groups, count how many times we see a particular value or a specific pattern in the data, maybe even plot the results. These are all aggregations because we are
I’ll demonstrate some operations with the {dplyr}
, the {tidyr}
, {data.table}
, and the {reshape2}
packages. These are not the only packages capable of performing the operations shown below but they are one (if not the) most popular packages out there today and certainly ones I try to master the best I can, if one can really master anything as fluid as R packages! I will not load any of these packages on their own but instead load the {tidyverse}
.
As per the authors: The {tidyverse}
is an opinionated collection of R packages – packages that you’re likely to use in everyday data analyses – that share an underlying design philosophy, grammar, and data structures. The packages included in the tidyverse are: {ggplot2}, {tidyr}, {dplyr}, {forcats}, {stringr}, {readr}, {tibble}, {purrr}.
We will use a particular data set to understand what {dplyr}
can do – the cmhflights data frame that contains all flights departing from or arriving at the John Glenn Columbus International Airport starting in January and running through September of 2017 (source: The US Bureau of Transportation Statistics). I downloaded the data for all airports from here and then trimmed it to CMH.
library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ tibble 3.1.6 ✓ dplyr 1.0.7
#> ✓ tidyr 1.1.4 ✓ stringr 1.4.0
#> ✓ readr 2.1.1 ✓ forcats 0.5.1
#> ✓ purrr 0.3.4
#> ── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
load("data/cmhflights_01092017.RData")
The codes below have not been run to save space but you should execute them to see the results of each.
names(cmhflights)
print(cmhflights) # Show me a snapshot of the data
glimpse(cmhflights) # This is a dplyr command that lets us take a peek at our data-frame, and i an alternative to print(...)
3.1 Using {dplyr}
In the commands that follow, you will see %>%
– the “pipe operator.” What this operator does is allow you to write code that chains multiple commands together. Chain? What does that mean? Well, simply put, it means you pass the result of one command into the next command and on and on until you stop the chain. The easiest way to remember how the chain works is to remember the chain starts (usually with a data-frame) on the left and flows right. Here is a rough example.
some-data %>%
some-function(…) %>%
another-function(…) %>%
yet-another-function(…) -> resulting-object
In several of the examples that follow, perhaps even most, you will see the results of the code on the page. If you copy the code and execute it, you will see the same result but note that the result is not being saved. If you need to save the result, be sure to end the pipe with -> somename
command, as shown below:
%>%
somedata somefunction(
...-> someobject )
3.1.1 {janitor}
Before we go any further, note that the data-set has clnky variable names, with some letters in uppercase and some in lowercase. This is terrible practice. Preferably, at least in my data world, all column names will be in lowercase, and if some or all the column names are long, broken up by appropriate use of an underscore (_
) or a period (.
). The {janitor}
package is designed to clean up messy column names (but is not a part of the {tidyverse}
). My habit is to use it whenever I load a messy data-set. How does it work? See below.
%>%
cmhflights ::clean_names() -> myflights janitor
3.1.2 select()
Note that we have 110 variables in the cmhflights
data-frame. In order for you to see what is going on, I will trim these variables so that the ones we end up with can be reviewed much like you would in a spreadsheet. Trimming variables is easily done via the select()
command, as shown below.
%>%
myflights select(
:flight_date, carrier:flight_num, origin:origin_city_name,
year:dest_city_name, dep_time:dep_delay, taxi_out, wheels_off,
dest
wheels_on, taxi_in, arr_time, arr_delay-> my.df )
We can specify contiguous columns in shorthand via column.x:column.y
or go the extra mile by listing each individually, separated by a ,
Having trimmed the data-set we may also want to move some columns around so that they appear in our desired order. We could do this via select()
by specifying the column names in our desired order). Here are some examples:
names(my.df)
#> [1] "year" "quarter" "month" "dayof_month"
#> [5] "day_of_week" "flight_date" "carrier" "tail_num"
#> [9] "flight_num" "origin" "origin_city_name" "dest"
#> [13] "dest_city_name" "dep_time" "dep_delay" "taxi_out"
#> [17] "wheels_off" "wheels_on" "taxi_in" "arr_time"
#> [21] "arr_delay"
%>%
my.df select(
c(
:dest, arr_delay, dep_delay
flight_date
)
) #> # A tibble: 35,993 × 9
#> flight_date carrier tail_num flight_num origin origin_city_name dest arr_delay
#> <date> <chr> <chr> <int> <chr> <chr> <chr> <dbl>
#> 1 2017-01-01 AA N802AA 508 LAX Los Angeles, CA CMH -14
#> 2 2017-01-02 AA N766AA 508 LAX Los Angeles, CA CMH 29
#> 3 2017-01-03 AA N760AA 508 LAX Los Angeles, CA CMH -37
#> 4 2017-01-04 AA N753AA 508 LAX Los Angeles, CA CMH -25
#> 5 2017-01-05 AA N744AA 508 LAX Los Angeles, CA CMH -22
#> 6 2017-01-06 AA N815AA 508 LAX Los Angeles, CA CMH 13
#> # … with 35,987 more rows, and 1 more variable: dep_delay <dbl>
%>%
my.df select(
c(
:dep_time,
arr_delay, dep_delay, flight_date:day_of_week, taxi_out:arr_time
year
)
) #> # A tibble: 35,993 × 21
#> arr_delay dep_delay flight_date carrier tail_num flight_num origin
#> <dbl> <dbl> <date> <chr> <chr> <int> <chr>
#> 1 -14 -9 2017-01-01 AA N802AA 508 LAX
#> 2 29 24 2017-01-02 AA N766AA 508 LAX
#> 3 -37 -6 2017-01-03 AA N760AA 508 LAX
#> 4 -25 -5 2017-01-04 AA N753AA 508 LAX
#> 5 -22 -7 2017-01-05 AA N744AA 508 LAX
#> 6 13 22 2017-01-06 AA N815AA 508 LAX
#> # … with 35,987 more rows, and 14 more variables: origin_city_name <chr>,
#> # dest <chr>, dest_city_name <chr>, dep_time <chr>, year <int>, quarter <int>,
#> # month <int>, dayof_month <int>, day_of_week <int>, taxi_out <dbl>,
#> # wheels_off <chr>, wheels_on <chr>, taxi_in <dbl>, arr_time <chr>
Notice that both my.df2
and my.df3
include columns we listed, the latter containing more than the former!
select()
is versatile enough such that we could ask that only columns with a certain prefix or suffix be selected, or even columns that contain a specific string:
%>%
my.df ::select(
dplyrstarts_with("arr")
) #> # A tibble: 35,993 × 2
#> arr_time arr_delay
#> <chr> <dbl>
#> 1 1653 -14
#> 2 1736 29
#> 3 1630 -37
#> 4 1642 -25
#> 5 1645 -22
#> 6 1720 13
#> # … with 35,987 more rows
%>%
my.df ::select(
dplyrends_with("delay")
) #> # A tibble: 35,993 × 2
#> dep_delay arr_delay
#> <dbl> <dbl>
#> 1 -9 -14
#> 2 24 29
#> 3 -6 -37
#> 4 -5 -25
#> 5 -7 -22
#> 6 22 13
#> # … with 35,987 more rows
%>%
my.df ::select(
dplyrcontains("num")
) #> # A tibble: 35,993 × 2
#> tail_num flight_num
#> <chr> <int>
#> 1 N802AA 508
#> 2 N766AA 508
#> 3 N760AA 508
#> 4 N753AA 508
#> 5 N744AA 508
#> 6 N815AA 508
#> # … with 35,987 more rows
To use any function in {dplyr}
, I could have first run library(tidyverse)
or library(dplyr)
. However, library(tidyverse)
would load multiple packages in working memory and library(dplyr)
would load the {dplyr}
library in working memory. A more efficient solution would be to just use a library when needed and not have it occupy working memory. This is done with the package::function(…)
command, as in dplyr::select(…).
Now, you won’t see me use this more efficient code in the rest of this chapter because we will keep using various {dplyr}
functions and hence we might as well just lean on {dplyr}
via the {tidyverse}
library we loaded earlier.
We can also specify the column position instead of names when selecting columns:
%>%
my.df select(
1:4
) #> # A tibble: 35,993 × 4
#> year quarter month dayof_month
#> <int> <int> <int> <int>
#> 1 2017 1 1 1
#> 2 2017 1 1 2
#> 3 2017 1 1 3
#> 4 2017 1 1 4
#> 5 2017 1 1 5
#> 6 2017 1 1 6
#> # … with 35,987 more rows
%>%
my.df select(
1:4, 21, 15, 7:13
) #> # A tibble: 35,993 × 13
#> year quarter month dayof_month arr_delay dep_delay carrier tail_num flight_num
#> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <int>
#> 1 2017 1 1 1 -14 -9 AA N802AA 508
#> 2 2017 1 1 2 29 24 AA N766AA 508
#> 3 2017 1 1 3 -37 -6 AA N760AA 508
#> 4 2017 1 1 4 -25 -5 AA N753AA 508
#> 5 2017 1 1 5 -22 -7 AA N744AA 508
#> 6 2017 1 1 6 13 22 AA N815AA 508
#> # … with 35,987 more rows, and 4 more variables: origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>
I prefer working with column numbers when reordering or pruning a data-set; makes for less typing and is faster so long as I don’t mess up. This is why I keep using the names()
command to make sure I get the column positions right.
In closing, let me also point out that we can drop columns with the -
symbol, as shown below:
%>%
my.df select(
-4, -5
)#> # A tibble: 35,993 × 19
#> year quarter month flight_date carrier tail_num flight_num origin
#> <int> <int> <int> <date> <chr> <chr> <int> <chr>
#> 1 2017 1 1 2017-01-01 AA N802AA 508 LAX
#> 2 2017 1 1 2017-01-02 AA N766AA 508 LAX
#> 3 2017 1 1 2017-01-03 AA N760AA 508 LAX
#> 4 2017 1 1 2017-01-04 AA N753AA 508 LAX
#> 5 2017 1 1 2017-01-05 AA N744AA 508 LAX
#> 6 2017 1 1 2017-01-06 AA N815AA 508 LAX
#> # … with 35,987 more rows, and 11 more variables: origin_city_name <chr>,
#> # dest <chr>, dest_city_name <chr>, dep_time <chr>, dep_delay <dbl>,
#> # taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>, taxi_in <dbl>,
#> # arr_time <chr>, arr_delay <dbl>
which is the same as running
%>%
my.df select(
1:3, 6
) #> # A tibble: 35,993 × 4
#> year quarter month flight_date
#> <int> <int> <int> <date>
#> 1 2017 1 1 2017-01-01
#> 2 2017 1 1 2017-01-02
#> 3 2017 1 1 2017-01-03
#> 4 2017 1 1 2017-01-04
#> 5 2017 1 1 2017-01-05
#> 6 2017 1 1 2017-01-06
#> # … with 35,987 more rows
3.1.3 filter()
filter()
allows you to subset the data to observations that meet a given criterion or a set of criteria. In the code below, for example, we are asking for all flights on January 1 (i.e., with month == 1, dayof_month == 1
). The base R approach to sub-setting the data are shown first, followed by the {dplyr}
code.
my.df[$month == 1 & my.df$dayof_month == 1,
my.df# base R
] #> # A tibble: 105 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 1 7 2017-01-01 AA N802AA
#> 2 2017 1 1 1 7 2017-01-01 AA N820AA
#> 3 2017 1 1 1 7 2017-01-01 AA N828AA
#> 4 2017 1 1 1 7 2017-01-01 AA N828AA
#> 5 2017 1 1 1 7 2017-01-01 AA N426AA
#> 6 2017 1 1 1 7 2017-01-01 AA N426AA
#> # … with 99 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
%>%
my.df subset(
== 1 & dayof_month == 1
month # base R
) #> # A tibble: 105 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 1 7 2017-01-01 AA N802AA
#> 2 2017 1 1 1 7 2017-01-01 AA N820AA
#> 3 2017 1 1 1 7 2017-01-01 AA N828AA
#> 4 2017 1 1 1 7 2017-01-01 AA N828AA
#> 5 2017 1 1 1 7 2017-01-01 AA N426AA
#> 6 2017 1 1 1 7 2017-01-01 AA N426AA
#> # … with 99 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
%>%
my.df filter(
== 1, dayof_month == 1
month # with dplyr
) #> # A tibble: 105 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 1 7 2017-01-01 AA N802AA
#> 2 2017 1 1 1 7 2017-01-01 AA N820AA
#> 3 2017 1 1 1 7 2017-01-01 AA N828AA
#> 4 2017 1 1 1 7 2017-01-01 AA N828AA
#> 5 2017 1 1 1 7 2017-01-01 AA N426AA
#> 6 2017 1 1 1 7 2017-01-01 AA N426AA
#> # … with 99 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
filter()
is versatile and, what this means in practice is that you can filter in a number of simple or seemingly complicated ways. For example, say I want all flights in January OR in February. I could do it as follows.
%>%
my.df filter(
== 1 | month == 2
month
)#> # A tibble: 7,170 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 1 7 2017-01-01 AA N802AA
#> 2 2017 1 1 2 1 2017-01-02 AA N766AA
#> 3 2017 1 1 3 2 2017-01-03 AA N760AA
#> 4 2017 1 1 4 3 2017-01-04 AA N753AA
#> 5 2017 1 1 5 4 2017-01-05 AA N744AA
#> 6 2017 1 1 6 5 2017-01-06 AA N815AA
#> # … with 7,164 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
This is the same as running
%>%
my.df filter(
%in% c(1, 2)
month
)#> # A tibble: 7,170 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 1 7 2017-01-01 AA N802AA
#> 2 2017 1 1 2 1 2017-01-02 AA N766AA
#> 3 2017 1 1 3 2 2017-01-03 AA N760AA
#> 4 2017 1 1 4 3 2017-01-04 AA N753AA
#> 5 2017 1 1 5 4 2017-01-05 AA N744AA
#> 6 2017 1 1 6 5 2017-01-06 AA N815AA
#> # … with 7,164 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
The variable month
is a number and hence we can use mathematical operations. For example, the preceding two codes would be equivalent to the following:
%>%
my.df filter(
< 3
month
)#> # A tibble: 7,170 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 1 7 2017-01-01 AA N802AA
#> 2 2017 1 1 2 1 2017-01-02 AA N766AA
#> 3 2017 1 1 3 2 2017-01-03 AA N760AA
#> 4 2017 1 1 4 3 2017-01-04 AA N753AA
#> 5 2017 1 1 5 4 2017-01-05 AA N744AA
#> 6 2017 1 1 6 5 2017-01-06 AA N815AA
#> # … with 7,164 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
We could have also asked for all flights in January and February EXCEPT for flights Origination from Los Angeles (i.e., “LAX”).
%>%
my.df filter(
< 3 & origin != "LAX"
month
)#> # A tibble: 7,074 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 1 7 2017-01-01 AA N820AA
#> 2 2017 1 1 2 1 2017-01-02 AA N817AA
#> 3 2017 1 1 3 2 2017-01-03 AA N742AA
#> 4 2017 1 1 9 1 2017-01-09 AA N742AA
#> 5 2017 1 1 11 3 2017-01-11 AA N742AA
#> 6 2017 1 1 12 4 2017-01-12 AA N762AA
#> # … with 7,068 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
Yes, but I changed my mind. I also want to exclude flights from Phoenix (i.e., “PHX”).
%>%
my.df filter(
< 3 & !origin %in% c("LAX", "PHX")
month
)#> # A tibble: 6,895 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 9 1 2017-01-09 AA N742AA
#> 2 2017 1 1 10 2 2017-01-10 AA N742AA
#> 3 2017 1 1 11 3 2017-01-11 AA N823AA
#> 4 2017 1 1 12 4 2017-01-12 AA N742AA
#> 5 2017 1 1 13 5 2017-01-13 AA N762AA
#> 6 2017 1 1 16 1 2017-01-16 AA N817AA
#> # … with 6,889 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
Note the different use of !
when I ran origin != “LAX”
versus when I ran !origin %in% c(“LAX”, “PHX”).
This different is easily forgotten but it matters. Specifically, if you are trying to exclude rows with more than one unique value you have to put the I before the variable name. If you are looking to exclude a single value then != must be placed before the exclusionary value.
You could also have asked for all flights with arrival delays that exceed the Median or some other value (less than 60, for example).
%>%
my.df filter(
> median(arr_delay, na.rm = TRUE)
arr_delay
)#> # A tibble: 17,122 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 2 1 2017-01-02 AA N766AA
#> 2 2017 1 1 6 5 2017-01-06 AA N815AA
#> 3 2017 1 1 9 1 2017-01-09 AA N835AA
#> 4 2017 1 1 15 7 2017-01-15 AA N752AA
#> 5 2017 1 1 19 4 2017-01-19 AA N760AA
#> 6 2017 1 1 22 7 2017-01-22 AA N766AA
#> # … with 17,116 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
%>%
my.df filter(
< 60
arr_delay
)#> # A tibble: 33,564 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 1 7 2017-01-01 AA N802AA
#> 2 2017 1 1 2 1 2017-01-02 AA N766AA
#> 3 2017 1 1 3 2 2017-01-03 AA N760AA
#> 4 2017 1 1 4 3 2017-01-04 AA N753AA
#> 5 2017 1 1 5 4 2017-01-05 AA N744AA
#> 6 2017 1 1 6 5 2017-01-06 AA N815AA
#> # … with 33,558 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
Or arrival delays between 15 and 30 minutes, both values included?
%>%
my.df filter(
>= 15 & arr_delay <= 30
arr_delay
)#> # A tibble: 2,490 × 21
#> year quarter month dayof_month day_of_week flight_date carrier tail_num
#> <int> <int> <int> <int> <int> <date> <chr> <chr>
#> 1 2017 1 1 2 1 2017-01-02 AA N766AA
#> 2 2017 1 1 9 1 2017-01-09 AA N835AA
#> 3 2017 1 1 29 7 2017-01-29 AA N823AA
#> 4 2017 1 1 11 3 2017-01-11 AA N823AA
#> 5 2017 1 1 12 4 2017-01-12 AA N752AA
#> 6 2017 1 1 23 1 2017-01-23 AA N825AA
#> # … with 2,484 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> # origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> # dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> # taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
Here is a summary table of common operators and what they represent.
Operator | Meaning |
---|---|
! |
Not equal to |
& |
And |
, |
And |
\(<\) | Less than |
\(<=\) | Less than or equal to |
\(>\) | Greater than |
\(>=\) | Greater than or equal to |
\(==\) | Equal to |
%in% |
Value is in |
3.1.4 arrange()
Let us now sort the data frame, first by Year, Month and DayofMonth, and then by arrival delay (arr_delay)
.
%>%
my.df select(
21, 13, 7, 1, 3, 4
-> tab1
) %>%
tab1 arrange(
year, month, dayof_month
) #> # A tibble: 35,993 × 6
#> arr_delay dest_city_name carrier year month dayof_month
#> <dbl> <chr> <chr> <int> <int> <int>
#> 1 -14 Columbus, OH AA 2017 1 1
#> 2 -8 Columbus, OH AA 2017 1 1
#> 3 -19 Los Angeles, CA AA 2017 1 1
#> 4 -18 Columbus, OH AA 2017 1 1
#> 5 23 Dallas/Fort Worth, TX AA 2017 1 1
#> 6 34 Columbus, OH AA 2017 1 1
#> # … with 35,987 more rows
When we ask for it to be arranged by the values of one or more variables, unless we indicate otherwise {dplyr}
assumes you want want the default – which happens to be in ascending order of values. This is evident from the example below.
%>%
my.df select(
flight_date, carrier, flight_num, arr_delay%>%
) arrange(
arr_delay
) #> # A tibble: 35,993 × 4
#> flight_date carrier flight_num arr_delay
#> <date> <chr> <int> <dbl>
#> 1 2017-02-18 EV 4338 -62
#> 2 2017-01-23 EV 4148 -59
#> 3 2017-02-15 AA 659 -57
#> 4 2017-02-21 EV 4246 -57
#> 5 2017-02-16 F9 1087 -57
#> 6 2017-03-29 AA 659 -57
#> # … with 35,987 more rows
If we would rather have the sort be in descending order of the variable’s values, then we will need to make an explicit call to that effect via desc(some-variable)
, as shown below.
%>%
my.df select(
flight_date, carrier, flight_num, arr_delay%>%
) arrange(
desc(arr_delay)
) #> # A tibble: 35,993 × 4
#> flight_date carrier flight_num arr_delay
#> <date> <chr> <int> <dbl>
#> 1 2017-06-23 UA 349 1327
#> 2 2017-07-17 OO 4798 1234
#> 3 2017-04-25 EV 3908 1118
#> 4 2017-09-07 EV 4335 1099
#> 5 2017-09-11 DL 1276 1063
#> 6 2017-04-02 EV 4246 1053
#> # … with 35,987 more rows
Could we have used the -
symbol instead of using desc()
? Yes, you can.
%>%
my.df select(
flight_date, carrier, flight_num, arr_delay%>%
) arrange(
-arr_delay
) #> # A tibble: 35,993 × 4
#> flight_date carrier flight_num arr_delay
#> <date> <chr> <int> <dbl>
#> 1 2017-06-23 UA 349 1327
#> 2 2017-07-17 OO 4798 1234
#> 3 2017-04-25 EV 3908 1118
#> 4 2017-09-07 EV 4335 1099
#> 5 2017-09-11 DL 1276 1063
#> 6 2017-04-02 EV 4246 1053
#> # … with 35,987 more rows
Note the difference between arr_delay
and desc(arr_delay)
– the former sorts the data-frame in ascending order of arrival delay times but the latter does it in descending order of arrival delay times. In addition, -arr_delay
works similarly to desc(arr_delay)
.
We could also do more complicated sorting orders if we needed to, using multiple columns, and not necessarily all in the same ascending/descending sort sequence either.
Below, for example, we ask for the data-frame to be sorted in ascending order of arrival delays first and then in ascending order of the name of the destination city.
%>%
my.df select(
flight_date, carrier, flight_num, arr_delay, dest_city_name%>%
) arrange(
arr_delay, dest_city_name
) #> # A tibble: 35,993 × 5
#> flight_date carrier flight_num arr_delay dest_city_name
#> <date> <chr> <int> <dbl> <chr>
#> 1 2017-02-18 EV 4338 -62 Houston, TX
#> 2 2017-01-23 EV 4148 -59 Houston, TX
#> 3 2017-02-21 EV 4246 -57 Houston, TX
#> 4 2017-05-06 EV 4338 -57 Houston, TX
#> 5 2017-02-16 F9 1087 -57 Las Vegas, NV
#> 6 2017-02-15 AA 659 -57 Los Angeles, CA
#> # … with 35,987 more rows
We could farther complicate it by adding the Carrier
as yet another ascending sort level.
%>%
my.df select(
flight_date, carrier, flight_num, arr_delay, dest_city_name%>%
) arrange(
arr_delay, dest_city_name, carrier
) #> # A tibble: 35,993 × 5
#> flight_date carrier flight_num arr_delay dest_city_name
#> <date> <chr> <int> <dbl> <chr>
#> 1 2017-02-18 EV 4338 -62 Houston, TX
#> 2 2017-01-23 EV 4148 -59 Houston, TX
#> 3 2017-02-21 EV 4246 -57 Houston, TX
#> 4 2017-05-06 EV 4338 -57 Houston, TX
#> 5 2017-02-16 F9 1087 -57 Las Vegas, NV
#> 6 2017-02-15 AA 659 -57 Los Angeles, CA
#> # … with 35,987 more rows
Of course, as you may have guessed, we could sort in ascending order with one variable and in descending order with another:
%>%
my.df select(
flight_date, carrier, flight_num, arr_delay, dest_city_name%>%
) arrange(
desc(dest_city_name)
arr_delay,
) #> # A tibble: 35,993 × 5
#> flight_date carrier flight_num arr_delay dest_city_name
#> <date> <chr> <int> <dbl> <chr>
#> 1 2017-02-18 EV 4338 -62 Houston, TX
#> 2 2017-01-23 EV 4148 -59 Houston, TX
#> 3 2017-02-15 AA 659 -57 Los Angeles, CA
#> 4 2017-03-29 AA 659 -57 Los Angeles, CA
#> 5 2017-02-16 F9 1087 -57 Las Vegas, NV
#> 6 2017-02-21 EV 4246 -57 Houston, TX
#> # … with 35,987 more rows
3.1.5 rename()
If we want to rename some columns we can do so via rename()
. Note that here I am saving the result as a data-frame called flights
purely for demonstration purposes.
%>%
myflights rename(
arrival.delay = arr_delay,
departure.delay = dep_delay,
arrival.time = arr_time,
departure.time = dep_time
-> flights )
Remember the sequence here … rename(new_name = old_name)
. Note also that you can use rename_with(…)
to simultaneously rename multiple columns, as shown below.
%>%
myflights rename_with(
toupper%>%
) head()
#> # A tibble: 6 × 110
#> YEAR QUARTER MONTH DAYOF_MONTH DAY_OF_WEEK FLIGHT_DATE UNIQUE_CARRIER AIRLINE_ID
#> <int> <int> <int> <int> <int> <date> <chr> <int>
#> 1 2017 1 1 1 7 2017-01-01 AA 19805
#> 2 2017 1 1 2 1 2017-01-02 AA 19805
#> 3 2017 1 1 3 2 2017-01-03 AA 19805
#> 4 2017 1 1 4 3 2017-01-04 AA 19805
#> 5 2017 1 1 5 4 2017-01-05 AA 19805
#> 6 2017 1 1 6 5 2017-01-06 AA 19805
#> # … with 102 more variables: CARRIER <chr>, TAIL_NUM <chr>, FLIGHT_NUM <int>,
#> # ORIGIN_AIRPORT_ID <int>, ORIGIN_AIRPORT_SEQ_ID <int>,
#> # ORIGIN_CITY_MARKET_ID <int>, ORIGIN <chr>, ORIGIN_CITY_NAME <chr>,
#> # ORIGIN_STATE <chr>, ORIGIN_STATE_FIPS <chr>, ORIGIN_STATE_NAME <chr>,
#> # ORIGIN_WAC <int>, DEST_AIRPORT_ID <int>, DEST_AIRPORT_SEQ_ID <int>,
#> # DEST_CITY_MARKET_ID <int>, DEST <chr>, DEST_CITY_NAME <chr>, DEST_STATE <chr>,
#> # DEST_STATE_FIPS <chr>, DEST_STATE_NAME <chr>, DEST_WAC <int>, …
Here, all columns have been changed to be uppercase via toupper
but other transformations are possible as well.
Maybe this is a mistake and we only want city names flipped into uppercase. How could I do that?
%>%
myflights rename_with(
toupper,contains("city")
%>%
) select(contains("CITY"))
#> # A tibble: 35,993 × 4
#> ORIGIN_CITY_MARKET_ID ORIGIN_CITY_NAME DEST_CITY_MARKET_ID DEST_CITY_NAME
#> <int> <chr> <int> <chr>
#> 1 32575 Los Angeles, CA 31066 Columbus, OH
#> 2 32575 Los Angeles, CA 31066 Columbus, OH
#> 3 32575 Los Angeles, CA 31066 Columbus, OH
#> 4 32575 Los Angeles, CA 31066 Columbus, OH
#> 5 32575 Los Angeles, CA 31066 Columbus, OH
#> 6 32575 Los Angeles, CA 31066 Columbus, OH
#> # … with 35,987 more rows
There are trickier rename_with(...)
operations possible that have not been outlined here.
3.1.6 distinct()
Rows with unique combinations of column values can be retained with ease. This is essentially making sure no row is duplicated in the data-frame. For example, what are the unique aircraft, flagged by the tail number (tail_num
), that fly through Columbus? This is extracted by distinct(select(...))
, and can be done for a single variable or for more than one variable. The first example says please keep just one row per aircraft (based on tail_num
).
%>%
myflights distinct(
select(
.,
tail_num
) )
Next we ask for rows with unique combinations of origin
and dest
values.
%>%
myflights distinct(
select(
.,
origin, dest
) )
3.1.7 mutate()
If we want to calculate something based on existing columns and then add these new calculated values to the data frame, mutate(...)
comes in handy. For example, if a flight has a longer dep_delay
than arr-delay
then we know the flight made up some of the delayed time. Let us call this gain
and let us also measure it in our data-set. When we create gain
with mutate(...)
it will be added to the existing data-frame. You see this operation below.
%>%
myflights mutate(
gain = (arr_delay - dep_delay)
-> myflights
)
summary(myflights$gain)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> -63.0 -13.0 -7.0 -5.9 -1.0 210.0 565
One of the amazing things about the mutate(...)
command is that it allows you to refer to a previous calculation in the same command (see below);
%>%
myflights mutate(
gain = (arr_delay - dep_delay),
gain_per_flight_hour = (gain / 60)
) #> # A tibble: 35,993 × 112
#> year quarter month dayof_month day_of_week flight_date unique_carrier airline_id
#> <int> <int> <int> <int> <int> <date> <chr> <int>
#> 1 2017 1 1 1 7 2017-01-01 AA 19805
#> 2 2017 1 1 2 1 2017-01-02 AA 19805
#> 3 2017 1 1 3 2 2017-01-03 AA 19805
#> 4 2017 1 1 4 3 2017-01-04 AA 19805
#> 5 2017 1 1 5 4 2017-01-05 AA 19805
#> 6 2017 1 1 6 5 2017-01-06 AA 19805
#> # … with 35,987 more rows, and 104 more variables: carrier <chr>, tail_num <chr>,
#> # flight_num <int>, origin_airport_id <int>, origin_airport_seq_id <int>,
#> # origin_city_market_id <int>, origin <chr>, origin_city_name <chr>,
#> # origin_state <chr>, origin_state_fips <chr>, origin_state_name <chr>,
#> # origin_wac <int>, dest_airport_id <int>, dest_airport_seq_id <int>,
#> # dest_city_market_id <int>, dest <chr>, dest_city_name <chr>, dest_state <chr>,
#> # dest_state_fips <chr>, dest_state_name <chr>, dest_wac <int>, …
Note that gain_per_flight_hour has been added by dividing gain by (air_time / 60) … since longer flights can gain (i.e., make up) more lost time than can shorter flights.
#> # A tibble: 35,993 × 5
#> air_time arr_delay dep_delay gain gain_per_flight_hour
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 223 -14 -9 -5 -0.0833
#> 2 213 29 24 5 0.0833
#> 3 212 -37 -6 -31 -0.517
#> 4 215 -25 -5 -20 -0.333
#> 5 210 -22 -7 -15 -0.25
#> 6 222 13 22 -9 -0.15
#> # … with 35,987 more rows
We will return to mutate()
later with more nuanced operations, but for now let us look at summarise(...)
– also known as summarize(...)
.
3.1.8 summarise()
summarise()
is another useful function (see below), useful in that whereas mutate()
adds the calculated variable(s) to the existing data-set, summarise()
allows you to create a table that also doubles as a data-frame. For example, say I want to calculate means and standard deviations of arrival delays and departure delays:
%>%
myflights summarise(
mean_arr_delay = mean(arr_delay, na.rm = TRUE),
sd_arr_delay = sd(arr_delay, na.rm = TRUE),
mean_dep_delay = mean(dep_delay, na.rm = TRUE),
sd_dep_delay = sd(dep_delay, na.rm = TRUE)
-> tab.01 )
#>
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#>
#> group_rows
mean_arr_delay | sd_arr_delay | mean_dep_delay | sd_dep_delay |
---|---|---|---|
3.258 | 41.79 | 9.202 | 39.77 |
The key difference between mutate(...)
and summarise(...)
is that mutate(...)
will create generate you are asking it to generate and create a new variable. On the other hand, summarise(...)
will reduce things to a summary table so that you no longer have the entire data-set to work with anymore. This should be clear from the lone example shown above.
3.1.9 Sampling
You can draw random samples from your data frame as well. The code below shows a random sub-sample of 15 observations and then a random sub-sample of 2% of the full data, respectively.
%>%
myflights sample_n(15) # 15 randomly sampled observations
#> # A tibble: 15 × 111
#> year quarter month dayof_month day_of_week flight_date unique_carrier airline_id
#> <int> <int> <int> <int> <int> <date> <chr> <int>
#> 1 2017 1 1 18 3 2017-01-18 WN 19393
#> 2 2017 1 1 22 7 2017-01-22 UA 19977
#> 3 2017 2 6 29 4 2017-06-29 WN 19393
#> 4 2017 3 9 24 7 2017-09-24 WN 19393
#> 5 2017 2 4 16 7 2017-04-16 AA 19805
#> 6 2017 3 7 14 5 2017-07-14 WN 19393
#> # … with 9 more rows, and 103 more variables: carrier <chr>, tail_num <chr>,
#> # flight_num <int>, origin_airport_id <int>, origin_airport_seq_id <int>,
#> # origin_city_market_id <int>, origin <chr>, origin_city_name <chr>,
#> # origin_state <chr>, origin_state_fips <chr>, origin_state_name <chr>,
#> # origin_wac <int>, dest_airport_id <int>, dest_airport_seq_id <int>,
#> # dest_city_market_id <int>, dest <chr>, dest_city_name <chr>, dest_state <chr>,
#> # dest_state_fips <chr>, dest_state_name <chr>, dest_wac <int>, …
%>%
myflights sample_frac(.02) # a 2% random sample
#> # A tibble: 720 × 111
#> year quarter month dayof_month day_of_week flight_date unique_carrier airline_id
#> <int> <int> <int> <int> <int> <date> <chr> <int>
#> 1 2017 2 6 25 7 2017-06-25 WN 19393
#> 2 2017 3 7 7 5 2017-07-07 WN 19393
#> 3 2017 1 1 14 6 2017-01-14 EV 20366
#> 4 2017 2 5 16 2 2017-05-16 WN 19393
#> 5 2017 2 5 6 6 2017-05-06 DL 19790
#> 6 2017 1 3 29 3 2017-03-29 WN 19393
#> # … with 714 more rows, and 103 more variables: carrier <chr>, tail_num <chr>,
#> # flight_num <int>, origin_airport_id <int>, origin_airport_seq_id <int>,
#> # origin_city_market_id <int>, origin <chr>, origin_city_name <chr>,
#> # origin_state <chr>, origin_state_fips <chr>, origin_state_name <chr>,
#> # origin_wac <int>, dest_airport_id <int>, dest_airport_seq_id <int>,
#> # dest_city_market_id <int>, dest <chr>, dest_city_name <chr>, dest_state <chr>,
#> # dest_state_fips <chr>, dest_state_name <chr>, dest_wac <int>, …
These commands have been superseded by other commands that we will see later on in this text.
3.2 group_by()
So far we have been running these commands on all observations rather than by some group indicator. For example, what if we wanted to run some calculations for each unique aircraft (identified by tail_num)? The first thing we will do is create the grouping structure by specifying that we want each unique value of tail_num to be treated as a unique group.
%>%
myflights group_by(tail_num) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
mean.arrival.delay = mean(arr_delay, na.rm = TRUE),
mean.departure.delay = mean(dep_delay, na.rm = TRUE)
)#> # A tibble: 2,248 × 5
#> tail_num count dist mean.arrival.delay mean.departure.delay
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 N008AA 2 926 -14 -3
#> 2 N013AA 2 926 -9 -1.5
#> 3 N015AA 4 926 53.8 63.5
#> 4 N020AA 2 926 150 162
#> 5 N024AA 2 926 20 15
#> 6 N028AA 2 926 308. 298.
#> # … with 2,242 more rows
If you look at the result you will see 2,248 observations of 4 variables – count
tells you how many times each aircraft shows up in the database, dist
tells you the average distance flown, mean.arrival.delay
tells you the mean arrival delay, and mean.departure.delay
tells you the mean departure delay.
Now, is the mean arrival delay related to the average distance flown? We will ask and answer this question via a simple scatter-plot on a subset of the preceding result. The subset will be defined such that the aircraft must have more than 20 arrivals or departures, and must have flown an average distance of less than 2,000 miles.
%>%
myflights group_by(tail_num) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
mean.arrival.delay = mean(arr_delay, na.rm = TRUE),
mean.departure.delay = mean(dep_delay, na.rm = TRUE)
-> tab.02
)
.02 %>%
tabfilter(
> 20, dist < 2000
count -> tab.03
)
plot(tab.03$dist, tab.03$mean.arrival.delay)
What if we wanted to know the number of flights to each destination listed in the data frame?
%>%
myflights group_by(dest) %>%
summarise(
nflights = n()
)
Note that n()
counts the number of rows, and hence in the preceding code nflights = n()
records the number of times each unique value of dest
is seen in the data-frame.
What if we wanted to count the number of flights to each destination by each month?
%>%
myflights group_by(
month, dest%>%
) summarise(
nflights = n()
)
Note that now the result has an extra column (month
) because you asked for the number of flights to each destination for each month.
Now for some fun. Say we want to know, for each airline, the number of flights per day to each destination.
%>%
myflights group_by(carrier, dayof_month, dest) %>%
summarise(
nflights = n()
)
The function group_by(...)
identifies groups composed of unique values of the variables listed in (...)
. The preceding example is asking for unique combinations of carrier
, dayof_month
, and dest
.
What if we wanted to know the number of flights by month rather than by day?
%>%
myflights group_by(carrier, month, dest) %>%
summarise(
nflights = n()
)
And in the year (January through September)?
%>%
myflights group_by(carrier, dest) %>%
summarise(
nflights = n()
)
Perhaps I am curious about the mean departure delay from Columbus, by Airline, regardless of destination.
%>%
myflights filter(origin == "CMH") %>%
group_by(carrier) %>%
summarise(
Mean.Dep.Delay = mean(dep_delay, na.rm = TRUE),
SD.Dep.Delay = sd(dep_delay, na.rm = TRUE)
)#> # A tibble: 7 × 3
#> carrier Mean.Dep.Delay SD.Dep.Delay
#> <chr> <dbl> <dbl>
#> 1 AA 6.92 37.2
#> 2 DL 5.25 41.4
#> 3 EV 22.2 93.4
#> 4 F9 11.0 40.7
#> 5 OO 5.98 58.4
#> 6 UA 7.52 62.1
#> # … with 1 more row
AA is American Airlines, DL is Delta Airlines, EV is ExpressJet, F9 is Frontier Airlines, OO is SkyWest Airlines, UA is United Airlines, and WN is Southwest Airlines. Based on the preceding result, Delta appears to have the briefest mean departure delays. Would the result be any different if we relied on the Median instead?
%>%
myflights filter(origin == "CMH") %>%
group_by(carrier) %>%
summarise(
Median.Dep.Delay = median(dep_delay, na.rm = TRUE),
SD.Dep.Delay = sd(dep_delay, na.rm = TRUE)
)#> # A tibble: 7 × 3
#> carrier Median.Dep.Delay SD.Dep.Delay
#> <chr> <dbl> <dbl>
#> 1 AA -4 37.2
#> 2 DL -3 41.4
#> 3 EV -3 93.4
#> 4 F9 -2 40.7
#> 5 OO -5 58.4
#> 6 UA -4 62.1
#> # … with 1 more row
Aha! Turns out it is American Airlines with the briefest median departure delay and not Delta, so yes the picture changes.
Are flight departure delays worse for some destinations than for others? And assuming the answer is yes, which two destinations are the worst?
%>%
myflights filter(origin == "CMH") %>%
group_by(dest) %>%
summarise(
Mean.Dep.Delay = mean(dep_delay, na.rm = TRUE)
%>%
) arrange(-Mean.Dep.Delay)
#> # A tibble: 26 × 2
#> dest Mean.Dep.Delay
#> <chr> <dbl>
#> 1 LGA 34.3
#> 2 EWR 31.0
#> 3 MSY 21.9
#> 4 OAK 14.9
#> 5 BOS 14.4
#> 6 DTW 12.8
#> # … with 20 more rows
Ah, as expected, flying to La Guardia, NY (LGA) is the worst, followed by Newark, NJ (EWR).
What if we wanted to generate a relative frequency column that showed the percent of flights to each destination, by airline, from Columbus?
%>%
myflights filter(origin == "CMH") %>%
group_by(carrier, dest) %>%
summarise(
Number.of.flights = n()
%>%
) group_by(carrier) %>%
mutate(
Percent.of.flights = (
/ sum(Number.of.flights)
Number.of.flights * 100
) )
Clearly, the majority of Delta flights were to Atlanta, followed by Minneapolis.
If we wanted to know what percent of flights to a given destination are from a given airline, we could modify the code just a bit.
%>%
myflights filter(origin == "CMH") %>%
group_by(carrier, dest) %>%
summarise(
Number.of.flights = n()
%>%
) group_by(dest) %>%
mutate(
Percent.of.flights = (
/ sum(Number.of.flights)
Number.of.flights * 100
) )
All flights to Philadelphia were American Airlines, and similarly to Dallas Fort Worth. Some 74.2% of the flights to Atlanta were by Delta, and so on.
What is noteworthy here, at least for now, is the fact that we have two group_by(...)
commands, and we are also following up summarise(...)
with a mutate(...)
command to arrive at our desired result.
3.3 unnest()
The more complicated the data you are pulling down, or perhaps I should say the more raw the data are that you are pulling down, the more likely it is that you will run into list columns
that contain rows themselves, just as you would expect in a data frame. The unnest
function from the tidyverse
unpacks these list-columns into a regular tibble. The following example is from the unnest page available here.
Here you will see a small data-set called df we have created. Note the odd structure where the three values of y
for a unique value of x = 2
and similarly two values of y
for x = 3
. This does not happen very often but you do encounter data in this convoluted structure.
tibble(
x = 1:3,
y = c("a", "d,e,f", "g,h")
-> df
)
df#> # A tibble: 3 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 d,e,f
#> 3 3 g,h
We ask for it to be cleaned up by splitting the column y
wherever a comma is seen, and then organizing the result such that we have what looks like a ‘proper’ data-set.
%>%
df transform(
y = strsplit(y, ",")
%>%
) unnest(y)
#> # A tibble: 6 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 d
#> 3 2 e
#> 4 2 f
#> 5 3 g
#> 6 3 h
Here is a trickier example.
tibble(
x = 1:2,
y = list(a = 1, b = 3:4)
-> df
)
df#> # A tibble: 2 × 2
#> x y
#> <int> <named list>
#> 1 1 <dbl [1]>
#> 2 2 <int [2]>
With a named list-column you need to specify .id
and now you have to specify cols = c(column-name)
%>%
df unnest(
cols = c(y),
.id = "name"
)#> Warning: The `.id` argument of `unnest()` is deprecated as of tidyr 1.0.0.
#> Manually create column of names instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#> # A tibble: 3 × 3
#> x y name
#> <int> <dbl> <chr>
#> 1 1 1 a
#> 2 2 3 b
#> 3 2 4 b
3.4 Using {tidyr}
The {tidyr}
package is another useful tool in the data analysts’ toolkit. In fact very often you will see {tidyr}
being used in tandem with {dplyr}
because they complement each other well. One thing that we will do here is to use the pipe operator %>%
because it is very efficient to chain commands.
3.4.1 tidyr’s Key Functions
There are four {tidyr}
functions you are likely to use.
Function | What it does … |
---|---|
separate(...) |
splits a single column into multiple columns |
unite(...) |
combines multiple columns into a single column |
pivot_wider(...) |
takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider |
pivot_longer(...) |
takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer |
Before we do anything else, let us read in the data-set we will work with. These data reflect the Census Bureau’s population estimates for Ohio counties and span the 2010 through 2016 fiscal-years. I’ll start by reading in the raw data.
library(readr)
library(here)
#> here() starts at /Users/ruhil/Documents/github/RBook_new
read_csv(
here(
"data",
"PEP_2016_PEPANNRES_with_ann.csv"
),skip = 1
-> popest
) #> Rows: 88 Columns: 12
#> ── Column specification ─────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): Id, Geography
#> dbl (10): Id2, April 1, 2010 - Census, April 1, 2010 - Estimates Base, Populatio...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(popest)[4] = "DecennialCens2010"
colnames(popest)[5] = "PopestBase2010"
colnames(popest)[6] = "Popest2010"
colnames(popest)[7] = "Popest2011"
colnames(popest)[8] = "Popest2012"
colnames(popest)[9] = "Popest2013"
colnames(popest)[10] = "Popest2014"
colnames(popest)[11] = "Popest2015"
colnames(popest)[12] = "Popest2016"
save(
popest, file = "data/popest.RData"
)
Id | Id2 | Geography | DecennialCens2010 | PopestBase2010 | Popest2010 |
---|---|---|---|---|---|
0500000US39001 | 39001 | Adams County, Ohio | 28550 | 28554 | 28562 |
0500000US39003 | 39003 | Allen County, Ohio | 106331 | 106326 | 106395 |
0500000US39005 | 39005 | Ashland County, Ohio | 53139 | 53139 | 53319 |
0500000US39007 | 39007 | Ashtabula County, Ohio | 101497 | 101488 | 101394 |
0500000US39009 | 39009 | Athens County, Ohio | 64757 | 64772 | 65221 |
0500000US39011 | 39011 | Auglaize County, Ohio | 45949 | 45949 | 45932 |
If you open the data frame you will see there are 88 counties, with population estimates for multiple years (2010 through 2016). Notice that Geography
lists the county name and the state. Say we want to split this into two columns, one with just the county name and another with the state name. How could we do this? With separate(...)
3.4.2 separate()
library(tidyr)
%>%
popest separate(
col = Geography,
into = c("County", "State"),
sep = ",",
remove = FALSE
)#> # A tibble: 88 × 14
#> Id Id2 Geography County State DecennialCens20… PopestBase2010 Popest2010
#> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 050000… 39001 Adams Coun… Adams … " Oh… 28550 28554 28562
#> 2 050000… 39003 Allen Coun… Allen … " Oh… 106331 106326 106395
#> 3 050000… 39005 Ashland Co… Ashlan… " Oh… 53139 53139 53319
#> 4 050000… 39007 Ashtabula … Ashtab… " Oh… 101497 101488 101394
#> 5 050000… 39009 Athens Cou… Athens… " Oh… 64757 64772 65221
#> 6 050000… 39011 Auglaize C… Auglai… " Oh… 45949 45949 45932
#> # … with 82 more rows, and 6 more variables: Popest2011 <dbl>, Popest2012 <dbl>,
#> # Popest2013 <dbl>, Popest2014 <dbl>, Popest2015 <dbl>, Popest2016 <dbl>
Note the sequence: I asked that the column called Geography
be split into two, c(“County”, “State”)
based on a comma separator, i.e., sep = “,”
.
I could further clean up the County column since I don’t want the word “County” showing up needlessly; we know these are counties after all!
%>%
popest separate(
col = Geography,
into = c("County", "State"),
sep = ",",
remove = FALSE
%>%
) mutate(
County = gsub(" County", "", County)
)#> # A tibble: 88 × 14
#> Id Id2 Geography County State DecennialCens20… PopestBase2010 Popest2010
#> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 050000… 39001 Adams Count… Adams " Oh… 28550 28554 28562
#> 2 050000… 39003 Allen Count… Allen " Oh… 106331 106326 106395
#> 3 050000… 39005 Ashland Cou… Ashla… " Oh… 53139 53139 53319
#> 4 050000… 39007 Ashtabula C… Ashta… " Oh… 101497 101488 101394
#> 5 050000… 39009 Athens Coun… Athens " Oh… 64757 64772 65221
#> 6 050000… 39011 Auglaize Co… Augla… " Oh… 45949 45949 45932
#> # … with 82 more rows, and 6 more variables: Popest2011 <dbl>, Popest2012 <dbl>,
#> # Popest2013 <dbl>, Popest2014 <dbl>, Popest2015 <dbl>, Popest2016 <dbl>
Here I asked that every instance of ” County” be replaced with nothing. Note I also specified " County"
and not "County"
since the latter will leave a blank space after every county’s name. We could also split the five-digit Id2 variable into two parts – the state FIPS code made up of the first two digits here 39
and the county code made up of the next three digits. Since we know this break-point I’ll set sep = 2
, indicating that it should break Id2
after the second position.
%>%
popest separate(
col = Geography,
into = c("County", "State"),
sep = ", ",
remove = FALSE
%>%
) mutate(
County = gsub(" County", "", County)
%>%
) separate(
col = Id2,
into = c("StateFIPS", "CountyFIPS"),
sep = 2,
remove = FALSE
)#> # A tibble: 88 × 16
#> Id Id2 StateFIPS CountyFIPS Geography County State DecennialCens20…
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 0500000US39001 39001 39 001 Adams Cou… Adams Ohio 28550
#> 2 0500000US39003 39003 39 003 Allen Cou… Allen Ohio 106331
#> 3 0500000US39005 39005 39 005 Ashland C… Ashla… Ohio 53139
#> 4 0500000US39007 39007 39 007 Ashtabula… Ashta… Ohio 101497
#> 5 0500000US39009 39009 39 009 Athens Co… Athens Ohio 64757
#> 6 0500000US39011 39011 39 011 Auglaize … Augla… Ohio 45949
#> # … with 82 more rows, and 8 more variables: PopestBase2010 <dbl>,
#> # Popest2010 <dbl>, Popest2011 <dbl>, Popest2012 <dbl>, Popest2013 <dbl>,
#> # Popest2014 <dbl>, Popest2015 <dbl>, Popest2016 <dbl>
3.4.3 unite()
Of course, we could also work backwards, creating a single column from two or more columns via unite(...)
and I will demonstrate this with the StateFIPS and CountyFIPS columns. Let me create these columns and save them as a data-frame.
%>%
popest separate(
col = Id2,
into = c("StateFIPS", "CountyFIPS"),
sep = 2,
remove = FALSE
%>%
) separate(
col = Geography,
into = c("County", "State"),
sep = ", ",
remove = FALSE
%>%
) mutate(
County = gsub(" County", "", County)
%>%
) select(
StateFIPS, CountyFIPS, County, State-> scdf
)
scdf#> # A tibble: 88 × 4
#> StateFIPS CountyFIPS County State
#> <chr> <chr> <chr> <chr>
#> 1 39 001 Adams Ohio
#> 2 39 003 Allen Ohio
#> 3 39 005 Ashland Ohio
#> 4 39 007 Ashtabula Ohio
#> 5 39 009 Athens Ohio
#> 6 39 011 Auglaize Ohio
#> # … with 82 more rows
%>%
scdf unite(
col = CountyStateFIPS,
StateFIPS, CountyFIPS, sep = "",
remove = FALSE
%>%
) unite(
col = Geography,
County, State, sep = ", ",
remove = FALSE
) #> # A tibble: 88 × 6
#> CountyStateFIPS StateFIPS CountyFIPS Geography County State
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 39001 39 001 Adams, Ohio Adams Ohio
#> 2 39003 39 003 Allen, Ohio Allen Ohio
#> 3 39005 39 005 Ashland, Ohio Ashland Ohio
#> 4 39007 39 007 Ashtabula, Ohio Ashtabula Ohio
#> 5 39009 39 009 Athens, Ohio Athens Ohio
#> 6 39011 39 011 Auglaize, Ohio Auglaize Ohio
#> # … with 82 more rows
Note the components of the command. Start with the data-frame, then the name of the new column you want to create, then the columns you want to unit, and finally whether unite(…)
should insert any symbol or space, etc between the uniting columns when uniting them.
When we ran unite(…)
for joining County and State, notice the space I specified after the comma in sep = “,”
. If you didn’t put this space in the command, the names would show up as Adams,Ohio instead of as Adams, Ohio.
3.4.4 Reshaping data: Long to Wide and Wide to Long
Often the data you access are in a format ill-suited for your purposes. As Hadley Wickham puts it:
Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable. While occasionally you do get a dataset that you can start analyzing immediately, this is the exception, not the rule. This section describes the five most common problems with messy datasets, along with their remedies:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
What is tidy data
then? Tidy data reflect three properties:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
So the following data are not tidy data:
#>
#> Attaching package: 'DSR'
#> The following objects are masked from 'package:tidyr':
#>
#> table1, table2, table3, table5, who
country | year | key | value |
---|---|---|---|
Afghanistan | 1999 | cases | 7.450e+02 |
Afghanistan | 1999 | population | 1.999e+07 |
Afghanistan | 2000 | cases | 2.666e+03 |
Afghanistan | 2000 | population | 2.060e+07 |
Brazil | 1999 | cases | 3.774e+04 |
Brazil | 1999 | population | 1.720e+08 |
Brazil | 2000 | cases | 8.049e+04 |
Brazil | 2000 | population | 1.745e+08 |
China | 1999 | cases | 2.123e+05 |
China | 1999 | population | 1.273e+09 |
China | 2000 | cases | 2.138e+05 |
China | 2000 | population | 1.280e+09 |
Note the problem here. Each country shows up twice for the same year, once with a value
for the number of tuberculosis cases
and the second time with a value
for the total population size. Instead, cases
should be a column all it own and so should population
… that would be a tidy data-set, one that would look as follows:
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 1.999e+07 |
Afghanistan | 2000 | 2666 | 2.060e+07 |
Brazil | 1999 | 37737 | 1.720e+08 |
Brazil | 2000 | 80488 | 1.745e+08 |
China | 1999 | 212258 | 1.273e+09 |
China | 2000 | 213766 | 1.280e+09 |
In fact, R is built assuming you have tidy data and hence in order to get various R packages to do what needs to be done data have to be tidy. {tidyr}
is built to tidy messy data-sets. There are two primary functions used in {tidyr}
– pivot_wider(...)
and pivot_longer(...)
. Let us see what each does.
3.4.4.1 pivot_wider()
pivot_wider()
flips data from the long format to the wide format. What is the long
format and what is the wide
format? The wide
format has one variable per column, as shown below:
ID | x | y |
---|---|---|
1 | x1 | y1 |
2 | x2 | y2 |
3 | x3 | y3 |
4 | x4 | y4 |
5 | x5 | y5 |
while the long format will have the variables as rows accompanied by a values column, as shown below:
ID | Variable | Values |
---|---|---|
1 | x | x1 |
1 | y | y1 |
2 | x | x2 |
2 | y | y2 |
3 | x | x3 |
3 | y | y3 |
4 | x | x4 |
4 | y | y4 |
5 | x | x5 |
5 | y | y5 |
The long format is not tidy and so if you end up gathering data that come in a long format you have to know how to flip it into a wide format. Sometimes you may want or need to flip a wide format data-set into the long format so knowing how to go in that direction is also very useful. {tidyr}
comes in very handy for both.
I can take df.long
and flip it into the wide format as follows:
%>%
df.long group_by(ID) %>%
pivot_wider(
names_from = Variable,
values_from = Values
) #> # A tibble: 5 × 3
#> # Groups: ID [5]
#> ID x y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 3 x3 y3
#> 4 4 x4 y4
#> 5 5 x5 y5
Take the tuberculosis data as another example. I can flip the long format data into a wide format via a similar operation:
%>%
table2 group_by(country, year) %>%
pivot_wider(
names_from = key,
values_from = value
)#> # A tibble: 6 × 4
#> # Groups: country, year [6]
#> country year cases population
#> <fct> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
3.4.4.2 pivot_longer()
pivot_longer(...)
reshapes data in the other direction, from the wide format to the long format.
%>%
df.wide group_by(ID) %>%
pivot_longer(
names_to = "Variable",
values_to = "Value",
cols = 2:3
)
In this command, the names_to = "Variable"
command specifies that the columns that are being gathered should be put as rows in a new column to be called Variable
We have also asked for the values to be retained in a new column called Value
And finally, the cols = 2:3
specifies the columns to be moved into rows.
3.4.5 Reshaping slightly more complicated data
Of course, the preceding are stylized examples and hence easy to flip from long to wide or wide to long. But what if we had more complicated data, perhaps like the one shown below?
tibble(
read.table(
header = TRUE,
text = "
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31
"
)-> grades )
ID | Test | Year | Fall | Spring | Winter |
---|---|---|---|---|---|
1 | 1 | 2008 | 15 | 16 | 19 |
1 | 1 | 2009 | 12 | 13 | 27 |
1 | 2 | 2008 | 22 | 22 | 24 |
1 | 2 | 2009 | 10 | 14 | 20 |
2 | 1 | 2008 | 12 | 13 | 25 |
2 | 1 | 2009 | 16 | 14 | 21 |
2 | 2 | 2008 | 13 | 11 | 29 |
2 | 2 | 2009 | 23 | 20 | 26 |
3 | 1 | 2008 | 11 | 12 | 22 |
3 | 1 | 2009 | 13 | 11 | 27 |
3 | 2 | 2008 | 17 | 12 | 23 |
3 | 2 | 2009 | 14 | 9 | 31 |
These are the raw data. Each student (flagged by ID
) takes two tests, 1
and 2
, thrice a year (Fall
, Spring
, and Winter
) in 2008
, and then again in 2009
. If we wanted to flip these data into the long format we could do the following:
<- grades %>%
grades.long gather(Semester, Score, Fall:Winter) %>%
mutate(Test = paste0("TestType", Test)) %>%
spread(Test, Score) %>%
arrange(ID, Year, Semester)
ID | Year | Semester | TestType1 | TestType2 |
---|---|---|---|---|
1 | 2008 | Fall | 15 | 22 |
1 | 2008 | Spring | 16 | 22 |
1 | 2008 | Winter | 19 | 24 |
1 | 2009 | Fall | 12 | 10 |
1 | 2009 | Spring | 13 | 14 |
1 | 2009 | Winter | 27 | 20 |
2 | 2008 | Fall | 12 | 13 |
2 | 2008 | Spring | 13 | 11 |
2 | 2008 | Winter | 25 | 29 |
2 | 2009 | Fall | 16 | 23 |
2 | 2009 | Spring | 14 | 20 |
2 | 2009 | Winter | 21 | 26 |
3 | 2008 | Fall | 11 | 17 |
3 | 2008 | Spring | 12 | 12 |
3 | 2008 | Winter | 22 | 23 |
3 | 2009 | Fall | 13 | 14 |
3 | 2009 | Spring | 11 | 9 |
3 | 2009 | Winter | 27 | 31 |
This looks quirky but it isn’t; it shows you how you can blend tidyr
and dplyr
to end up with the desired result. Let us see how it works.
- We know that
gather()
will take multiple columns and put them into a single column (what we’ve referred to as going from wide to long). Here the gather command is taking the three semesters (Fall, Spring, Winter) and collapsing these into a single column called Semester, and putting the test scores from Fall, Spring and Winter into a new column called Score.
<- grades %>% gather(Semester, Score, Fall:Winter) step1
ID | Test | Year | Semester | Score |
---|---|---|---|---|
1 | 1 | 2008 | Fall | 15 |
1 | 1 | 2009 | Fall | 12 |
1 | 2 | 2008 | Fall | 22 |
1 | 2 | 2009 | Fall | 10 |
2 | 1 | 2008 | Fall | 12 |
2 | 1 | 2009 | Fall | 16 |
2 | 2 | 2008 | Fall | 13 |
2 | 2 | 2009 | Fall | 23 |
3 | 1 | 2008 | Fall | 11 |
3 | 1 | 2009 | Fall | 13 |
3 | 2 | 2008 | Fall | 17 |
3 | 2 | 2009 | Fall | 14 |
1 | 1 | 2008 | Spring | 16 |
1 | 1 | 2009 | Spring | 13 |
1 | 2 | 2008 | Spring | 22 |
1 | 2 | 2009 | Spring | 14 |
2 | 1 | 2008 | Spring | 13 |
2 | 1 | 2009 | Spring | 14 |
2 | 2 | 2008 | Spring | 11 |
2 | 2 | 2009 | Spring | 20 |
3 | 1 | 2008 | Spring | 12 |
3 | 1 | 2009 | Spring | 11 |
3 | 2 | 2008 | Spring | 12 |
3 | 2 | 2009 | Spring | 9 |
1 | 1 | 2008 | Winter | 19 |
1 | 1 | 2009 | Winter | 27 |
1 | 2 | 2008 | Winter | 24 |
1 | 2 | 2009 | Winter | 20 |
2 | 1 | 2008 | Winter | 25 |
2 | 1 | 2009 | Winter | 21 |
2 | 2 | 2008 | Winter | 29 |
2 | 2 | 2009 | Winter | 26 |
3 | 1 | 2008 | Winter | 22 |
3 | 1 | 2009 | Winter | 27 |
3 | 2 | 2008 | Winter | 23 |
3 | 2 | 2009 | Winter | 31 |
- We know that
mutate()
creates a new variable. Here the new variable is calledTest
and is being created by pasting the string “TestType” to Test 1 and Test 2.
<- grades %>%
step12 gather(Semester, Score, Fall:Winter) %>%
mutate(Test = paste0("TestType", Test))
ID | Test | Year | Semester | Score |
---|---|---|---|---|
1 | TestType1 | 2008 | Fall | 15 |
1 | TestType1 | 2009 | Fall | 12 |
1 | TestType2 | 2008 | Fall | 22 |
1 | TestType2 | 2009 | Fall | 10 |
2 | TestType1 | 2008 | Fall | 12 |
2 | TestType1 | 2009 | Fall | 16 |
2 | TestType2 | 2008 | Fall | 13 |
2 | TestType2 | 2009 | Fall | 23 |
3 | TestType1 | 2008 | Fall | 11 |
3 | TestType1 | 2009 | Fall | 13 |
3 | TestType2 | 2008 | Fall | 17 |
3 | TestType2 | 2009 | Fall | 14 |
1 | TestType1 | 2008 | Spring | 16 |
1 | TestType1 | 2009 | Spring | 13 |
1 | TestType2 | 2008 | Spring | 22 |
1 | TestType2 | 2009 | Spring | 14 |
2 | TestType1 | 2008 | Spring | 13 |
2 | TestType1 | 2009 | Spring | 14 |
2 | TestType2 | 2008 | Spring | 11 |
2 | TestType2 | 2009 | Spring | 20 |
3 | TestType1 | 2008 | Spring | 12 |
3 | TestType1 | 2009 | Spring | 11 |
3 | TestType2 | 2008 | Spring | 12 |
3 | TestType2 | 2009 | Spring | 9 |
1 | TestType1 | 2008 | Winter | 19 |
1 | TestType1 | 2009 | Winter | 27 |
1 | TestType2 | 2008 | Winter | 24 |
1 | TestType2 | 2009 | Winter | 20 |
2 | TestType1 | 2008 | Winter | 25 |
2 | TestType1 | 2009 | Winter | 21 |
2 | TestType2 | 2008 | Winter | 29 |
2 | TestType2 | 2009 | Winter | 26 |
3 | TestType1 | 2008 | Winter | 22 |
3 | TestType1 | 2009 | Winter | 27 |
3 | TestType2 | 2008 | Winter | 23 |
3 | TestType2 | 2009 | Winter | 31 |
- We know that
spread()
shifts data from the long format to the wide format. Here, the spread command is then taking the newly created “Test” column and creating a column for each unique TestType. In these new columns the corresponding values from the Score column are being inserted.
<- grades %>%
step123 gather(Semester, Score, Fall:Winter) %>%
mutate(Test = paste0("TestType", Test)) %>%
spread(Test, Score)
ID | Year | Semester | TestType1 | TestType2 |
---|---|---|---|---|
1 | 2008 | Fall | 15 | 22 |
1 | 2008 | Spring | 16 | 22 |
1 | 2008 | Winter | 19 | 24 |
1 | 2009 | Fall | 12 | 10 |
1 | 2009 | Spring | 13 | 14 |
1 | 2009 | Winter | 27 | 20 |
2 | 2008 | Fall | 12 | 13 |
2 | 2008 | Spring | 13 | 11 |
2 | 2008 | Winter | 25 | 29 |
2 | 2009 | Fall | 16 | 23 |
2 | 2009 | Spring | 14 | 20 |
2 | 2009 | Winter | 21 | 26 |
3 | 2008 | Fall | 11 | 17 |
3 | 2008 | Spring | 12 | 12 |
3 | 2008 | Winter | 22 | 23 |
3 | 2009 | Fall | 13 | 14 |
3 | 2009 | Spring | 11 | 9 |
3 | 2009 | Winter | 27 | 31 |
- Finally,
arrange()
organizes the data into a particular order. Here we are asking the final result to be arranged by ID, Year and Semester.
<- grades %>%
grades.long gather(Semester, Score, Fall:Winter) %>%
mutate(Test = paste0("TestType", Test)) %>%
spread(Test, Score) %>%
arrange(ID, Year, Semester)
ID | Year | Semester | TestType1 | TestType2 |
---|---|---|---|---|
1 | 2008 | Fall | 15 | 22 |
1 | 2008 | Spring | 16 | 22 |
1 | 2008 | Winter | 19 | 24 |
1 | 2009 | Fall | 12 | 10 |
1 | 2009 | Spring | 13 | 14 |
1 | 2009 | Winter | 27 | 20 |
2 | 2008 | Fall | 12 | 13 |
2 | 2008 | Spring | 13 | 11 |
2 | 2008 | Winter | 25 | 29 |
2 | 2009 | Fall | 16 | 23 |
2 | 2009 | Spring | 14 | 20 |
2 | 2009 | Winter | 21 | 26 |
3 | 2008 | Fall | 11 | 17 |
3 | 2008 | Spring | 12 | 12 |
3 | 2008 | Winter | 22 | 23 |
3 | 2009 | Fall | 13 | 14 |
3 | 2009 | Spring | 11 | 9 |
3 | 2009 | Winter | 27 | 31 |
This example shows how you can chain commands via the pipe operator %>% and rather than breaking up the data manipulation tasks collapse all the commands into a single piped command. The trickier the data you are working with, the more difficult this is to do well. One trick is to take a small portion of your complicated data and subset it into a small portion that has maybe two or three unique units (individuals, geographies, etc). Then see if you can get the code to do what you need it to do. If it works on this subset then it will work on the full data set as well.
3.5 other dplyr functions
dplyr
has several functions that can be used with summarise and mutate, as well as in other ways. Below we walk through a few you are likely to need.
3.5.1 count and tally
Say you want a frequency table or a cross-tabulation. This can be done quite easily. The example I’ll use is from the starwars
data-set. How many unique species are there in the series?
data(starwars)
%>%
starwars count(
species, sort = TRUE
)
What if we want species per homeworld?
%>%
starwars count(
species, homeworld, sort = TRUE
)
Notice that you get some NA values so if we didn’t want this we could use filter:
%>%
starwars filter(
!is.na(species) & !is.na(homeworld)
%>%
) count(
species, homeworld, sort = TRUE
)
If I want the species count to be added as a column to the data frame, as mutate
would do, I could run :
%>%
starwars add_count(
species
) #> # A tibble: 87 × 15
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Skyw… 172 77 blond fair blue 19 male mascul…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascul…
#> 3 R2-D2 96 32 <NA> white, blue red 33 none mascul…
#> 4 Darth Vad… 202 136 none white yellow 41.9 male mascul…
#> 5 Leia Orga… 150 49 brown light brown 19 fema… femini…
#> 6 Owen Lars 178 120 brown, grey light blue 52 male mascul…
#> # … with 81 more rows, and 6 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>, n <int>
If I want just the total sample size added I could do
%>%
starwars add_tally()
#> # A tibble: 87 × 15
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Skyw… 172 77 blond fair blue 19 male mascul…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascul…
#> 3 R2-D2 96 32 <NA> white, blue red 33 none mascul…
#> 4 Darth Vad… 202 136 none white yellow 41.9 male mascul…
#> 5 Leia Orga… 150 49 brown light brown 19 fema… femini…
#> 6 Owen Lars 178 120 brown, grey light blue 52 male mascul…
#> # … with 81 more rows, and 6 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>, n <int>
If I do not want to add the sample size but instead just calculate it, I could do:
%>%
starwars tally()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 87
%>%
starwars group_by(
species%>%
) filter(
!is.na(species)
%>%
) tally(
sort = TRUE
)#> # A tibble: 37 × 2
#> species n
#> <chr> <int>
#> 1 Human 35
#> 2 Droid 6
#> 3 Gungan 3
#> 4 Kaminoan 2
#> 5 Mirialan 2
#> 6 Twi'lek 2
#> # … with 31 more rows
3.5.2 summary functions
Say you have the nycflights13
data, similar to our Columbus flights data but for NYC. I want to perform several calculations. First, I want to know how many unique aircraft are there?
library(nycflights13)
%>%
flights summarise(
unique.planes = n_distinct(tailnum)
)
How about unique planes by destination?
%>%
flights group_by(
dest%>%
) summarise(
unique.planes = n_distinct(tailnum)
%>%
) arrange(
-unique.planes
)
So there are 1308 unique aircraft that fly from NYC to Boston.
Which destination would give you the most choice of carriers from NYC?
%>%
flights group_by(
dest%>%
) summarise(
unique.carriers = n_distinct(carrier)
%>%
) arrange(
-unique.carriers
)
I want to flag the first flight for each destination on January 1.
%>%
flights filter(
== 1 & day == 1
month %>%
) group_by(
dest%>%
) summarise(
flt.order = first(time_hour)
)
The last flight?
%>%
flights filter(
== 1 & day == 1
month %>%
) group_by(
dest%>%
) summarise(
flt.order = last(time_hour)
)
How many flights to each destination on January 1?
%>%
flights filter(
== 1 & day == 1
month %>%
) group_by(
dest%>%
) summarise(
n.flts = n()
%>%
) arrange(
-n.flts
)
Wait a minute, what if I want this by airline?
%>%
flights filter(
== 1 & day == 1
month %>%
) group_by(
dest, carrier%>%
) summarise(
n.flts = n()
%>%
) arrange(
-n.flts
)
3.5.3 window functions
What if I want to flag each flight of a unique aircraft and rank it as the first flight of the day, second flight of the day, and so on?
%>%
flights filter(
== 1 & day == 1
month %>%
) group_by(
tailnum%>%
) arrange(
time_hour%>%
) mutate(
flt.order = row_number()
%>%
) arrange(
tailnum, time_hour%>%
) select(
carrier, dest, tailnum,
time_hour, month, day, flt.order )
How many times does the same aircraft fly on January 1?
%>%
flights filter(
== 1 & day == 1
month %>%
) group_by(
tailnum%>%
) summarise(
n.flts = n()
%>%
) arrange(
-n.flts
)
What if I want only the first flight of each aircraft on January 1? min_rank
will flag the first observation of each aircraft arranged by time_hour
.
%>%
flights filter(
== 1 & day == 1
month %>%
) group_by(
tailnum%>%
) arrange(
time_hour%>%
) mutate(
n.flts = min_rank(n())
%>%
) arrange(
-n.flts
)
This is just a snippet of what dplyr can do. In many ways, what dplyr can do is, without exaggeration, only limited by our own understanding of how to construct a logical calculation. Take a look at the cheatsheets here.
3.6 Using data.table
The {data.table}
package was designed for faster data manipulations when dealing with large data-sets, large being defined here as anything more than 10 gigabytes (10 GB). By some counts there are more then 200 R packages that rely on {data.table}
, making it very popular, and it is said to be faster than {dplyr}
. Let us run through some basic operations with this package, using a different version of the flights data-set, one that is restricted to January through October 2014 flights departing New York area airports.
3.6.1 fread()
The package uses the fread()
function that has several options that can be specified if one wishes to. The default settings usually work well so you see no options listed in the command below.
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
fread(
"https://github.com/arunsrinivasan/satrdays-workshop/raw/master/flights_2014.csv"
-> my.df )
fread()
is a lot faster than read.csv()
, and so is fwrite()
faster than write.csv()
when exporting data-frames as csv
files.
If I want to extract one or more variables, perhaps even sub-setting them into another data.table, I can do that quite easily, either by specifying the variable names or by position:
my.df[
,
.(origin)-> df1
]
my.df[
,
.(
year, month, day, dep_delay,
arr_delay, carrier, origin, dest
)-> df2
]
my.df[
, c(1:3, 5)
-> df3 ]
I can also filter the data.table and then extract, say I want all flights departing La Guardia, no, wait, I also want departing La Guardia in July:
my.df[%in% c("LGA")
origin -> df4
]
my.df[%in% c("LGA") & month %in% c(7)
origin -> df5 ]
3.6.2 The key
Since it is all about speed, one way to execute data operations faster is by specifying a key
variable that is then used by {data.table}
to sift the rows and remember the sifting. Once the key has been specified, you can filter/subset etc. without having to specify the key variable.
setkey(my.df, origin, month)
my.df["LGA", 7)
.(-> df6 # flights from LGA in July
] setkey(my.df, origin, month, dest)
my.df["LGA", 7, "ORD")
.(-> df7 # flights from LGA in July to ORD ]
To see what key
is being used you can run key(my.df)
3.6.3 Sort
Sorting is done with order()
, and by default sorting will be done in ascending order so if you want to sort in descending order, append -
to the variable(s).
order(dep_delay)] -> df8 # ascending order of departure _delay
df7[order(-dep_delay)] -> df9 # descending order of departure delay
df7[order(dep_delay, dep_time)] -> df10 # ascending order of departure delay and ascending order of departure time
df7[order(dep_delay, -dep_time)] -> df11 # ascending order of departure delay and descending order of departure time df7[
3.6.4 Calculating Aggregate Values
Say I want to calculate mean departure delay times by airline, for all flights. Maybe I also want to know how many flights each carrier operated.
my.df[
,
.(average_departure_delay = mean(dep_delay),
stddev_dep_delay = sd(dep_delay)
), = carrier
by -> df12
] = carrier] -> df13
my.df[, .N, by = carrier] -> df14 # results are ordered in ascending order of carrier
my.df[, .N, keyby = carrier][order(N)] -> df15 # results in ascending order of N my.df[, .N, by
3.6.5 Chaining operations
Aha! We just leaned on data.table's
way of dealing with chained operations (similar to the pipe operator of dplyr1
) …. follow up one operation encased in []
with another []
to create a chain of operations [][]
! For example, say we want the number of flights by carrier and destination, sorted by carrier.
= .(carrier, dest)][order(carrier)] -> df16 # results in ascending order of carrier
my.df[, .N, by = .(carrier, dest)][order(carrier, -N)] -> df17 # results in ascending order of carrier and then within carrier by descending order of M my.df[, .N, by
3.6.6 .SD
data.table
has a special function, .SD
that will calculate what you want by a subset data-table. For example, I want mean departure and arrival delays by carrier.
my.df[
,
.SD[
, mean(dep_delay),
= carrier
by
], = dest
by -> df18
]
my.df[
, lapply(.SD, mean),
= carrier,
by = c("dep_delay", "arr_delay")
.SDcols -> df19 # calculate the mean for the variables specified, by carrier ]
There is a lot more that data.table
, dplyr
and base R can do so treat these examples as mere glimpses into some of the operations you are likely to perform. It is also useful to keep an eye on the cheat-sheets for the packages: dplyr and tidyr are available here and several others are available here.
3.7 Using {forcats}
There is a package that some find useful when working with factors. Aptly named as an anagram for what it works with, it is the forcats package. We will see some of the core functions here but if you are curious about other possibilities, check the package vignette or then the web for other examples. I’ll show you the three things I use it for. First, generating a plot that orders the categories on the basis of ascending or descending frequencies. Here is an example, with data from the General Social Survey.
library(ggplot2)
library(dplyr)
library(forcats)
%>%
gss_cat group_by(
relig%>%
) summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
-> relig_summary
)
ggplot(
relig_summary, aes(tvhours, relig)
+
) geom_point()
Notice how this plot is unordered. I want it to be with the lowest number of television hours first, so in ascending order of TV hours. That can be achieved via fct_reorder()
.
ggplot(
relig_summary, aes(tvhours, fct_reorder(relig, tvhours))
+
) geom_point()
Note the logic: it says reorganize relig
on the basis of tvhours
. If I wanted it in terms of descending tv hours I would call -tvhours
.
ggplot(
relig_summary, aes(tvhours, fct_reorder(relig, -tvhours))
+
) geom_point()
If using bar-charts, switch to fct_infreq()
.
%>%
gss_cat mutate(
marital = marital %>%
fct_infreq() %>%
fct_rev()
%>%
) ggplot(aes(marital)) +
geom_bar()
And if you want the bars drawn in descending order just leave out fct_rev()
.
%>%
gss_cat mutate(
marital = marital %>%
fct_infreq()
%>%
) ggplot(aes(marital)) +
geom_bar()
forcats
comes in handy for modifying factor levels. To see this functionality we will rely on the usual seven-point party identification variable in the GSS and lean on fct_recode()
.
%>%
gss_cat count(partyid)
#> # A tibble: 10 × 2
#> partyid n
#> <fct> <int>
#> 1 No answer 154
#> 2 Don't know 1
#> 3 Other party 393
#> 4 Strong republican 2314
#> 5 Not str republican 3032
#> 6 Ind,near rep 1791
#> # … with 4 more rows
Now we tweak these labels.
%>%
gss_cat mutate(
partyid = fct_recode(
partyid, "Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)%>%
) count(partyid)
#> # A tibble: 10 × 2
#> partyid n
#> <fct> <int>
#> 1 No answer 154
#> 2 Don't know 1
#> 3 Other party 393
#> 4 Republican, strong 2314
#> 5 Republican, weak 3032
#> 6 Independent, near rep 1791
#> # … with 4 more rows
We can also collapse categories, as shown below in creating Other
.
%>%
gss_cat mutate(
partyid = fct_recode(
partyid,"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)%>%
) count(partyid)
#> # A tibble: 8 × 2
#> partyid n
#> <fct> <int>
#> 1 Other 548
#> 2 Republican, strong 2314
#> 3 Republican, weak 3032
#> 4 Independent, near rep 1791
#> 5 Independent 4119
#> 6 Independent, near dem 2499
#> # … with 2 more rows
Categories can also be collapsed via fct_collapse()
%>%
gss_cat mutate(
partyid = fct_collapse(
partyid,other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)%>%
) count(partyid)
#> # A tibble: 4 × 2
#> partyid n
#> <fct> <int>
#> 1 other 548
#> 2 rep 5346
#> 3 ind 8409
#> 4 dem 7180
and then fct_lump()
(which will lump together categories with small frequencies). If you don’t specify how many groups you want to retain via n =
you will over-collapse the categories so be careful.
%>%
gss_cat mutate(
relig = fct_lump(relig, n = 10)
%>%
) count(
sort = TRUE
relig, )
#> # A tibble: 10 × 2
#> relig n
#> <fct> <int>
#> 1 Protestant 10846
#> 2 Catholic 5124
#> 3 None 3523
#> 4 Christian 689
#> 5 Other 458
#> 6 Jewish 388
#> # … with 4 more rows