3 Aggregating Data and Other Operations

Often when analyzing data we end up having to either collapse them into summary statistics or calculate means, standard deviations, etc for different groups, count how many times we see a particular value or a specific pattern in the data, maybe even plot the results. These are all aggregations because we are

I’ll demonstrate some operations with the {dplyr}, the {tidyr}, {data.table}, and the {reshape2} packages. These are not the only packages capable of performing the operations shown below but they are one (if not the) most popular packages out there today and certainly ones I try to master the best I can, if one can really master anything as fluid as R packages! I will not load any of these packages on their own but instead load the {tidyverse}.

As per the authors: The {tidyverse} is an opinionated collection of R packages – packages that you’re likely to use in everyday data analyses – that share an underlying design philosophy, grammar, and data structures. The packages included in the tidyverse are: {ggplot2}, {tidyr}, {dplyr}, {forcats}, {stringr}, {readr}, {tibble}, {purrr}.

We will use a particular data set to understand what {dplyr} can do – the cmhflights data frame that contains all flights departing from or arriving at the John Glenn Columbus International Airport starting in January and running through September of 2017 (source: The US Bureau of Transportation Statistics). I downloaded the data for all airports from here and then trimmed it to CMH.

library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ tibble  3.1.6     ✓ dplyr   1.0.7
#> ✓ tidyr   1.1.4     ✓ stringr 1.4.0
#> ✓ readr   2.1.1     ✓ forcats 0.5.1
#> ✓ purrr   0.3.4
#> ── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
load("data/cmhflights_01092017.RData")

The codes below have not been run to save space but you should execute them to see the results of each.

names(cmhflights)
print(cmhflights) # Show me a snapshot of the data
glimpse(cmhflights) # This is a dplyr command that lets us take a peek at our data-frame, and i an alternative to print(...)

3.1 Using {dplyr}

In the commands that follow, you will see %>% – the “pipe operator.” What this operator does is allow you to write code that chains multiple commands together. Chain? What does that mean? Well, simply put, it means you pass the result of one command into the next command and on and on until you stop the chain. The easiest way to remember how the chain works is to remember the chain starts (usually with a data-frame) on the left and flows right. Here is a rough example.

some-data %>%
some-function(…) %>%
another-function(…) %>%
yet-another-function(…) -> resulting-object

In several of the examples that follow, perhaps even most, you will see the results of the code on the page. If you copy the code and execute it, you will see the same result but note that the result is not being saved. If you need to save the result, be sure to end the pipe with -> somename command, as shown below:

somedata %>%
  somefunction(
    ...
    ) -> someobject  

3.1.1 {janitor}

Before we go any further, note that the data-set has clnky variable names, with some letters in uppercase and some in lowercase. This is terrible practice. Preferably, at least in my data world, all column names will be in lowercase, and if some or all the column names are long, broken up by appropriate use of an underscore (_) or a period (.). The {janitor} package is designed to clean up messy column names (but is not a part of the {tidyverse}). My habit is to use it whenever I load a messy data-set. How does it work? See below.

cmhflights %>%
  janitor::clean_names() -> myflights

3.1.2 select()

Note that we have 110 variables in the cmhflights data-frame. In order for you to see what is going on, I will trim these variables so that the ones we end up with can be reviewed much like you would in a spreadsheet. Trimming variables is easily done via the select() command, as shown below.

myflights %>%
  select(
    year:flight_date, carrier:flight_num, origin:origin_city_name,
    dest:dest_city_name, dep_time:dep_delay, taxi_out, wheels_off,
    wheels_on, taxi_in, arr_time, arr_delay
    ) -> my.df

We can specify contiguous columns in shorthand via column.x:column.y or go the extra mile by listing each individually, separated by a ,

Having trimmed the data-set we may also want to move some columns around so that they appear in our desired order. We could do this via select() by specifying the column names in our desired order). Here are some examples:

names(my.df)
#>  [1] "year"             "quarter"          "month"            "dayof_month"     
#>  [5] "day_of_week"      "flight_date"      "carrier"          "tail_num"        
#>  [9] "flight_num"       "origin"           "origin_city_name" "dest"            
#> [13] "dest_city_name"   "dep_time"         "dep_delay"        "taxi_out"        
#> [17] "wheels_off"       "wheels_on"        "taxi_in"          "arr_time"        
#> [21] "arr_delay"
my.df %>%
  select(
  c(
    flight_date:dest, arr_delay, dep_delay
    )
  ) 
#> # A tibble: 35,993 × 9
#>   flight_date carrier tail_num flight_num origin origin_city_name dest  arr_delay
#>   <date>      <chr>   <chr>         <int> <chr>  <chr>            <chr>     <dbl>
#> 1 2017-01-01  AA      N802AA          508 LAX    Los Angeles, CA  CMH         -14
#> 2 2017-01-02  AA      N766AA          508 LAX    Los Angeles, CA  CMH          29
#> 3 2017-01-03  AA      N760AA          508 LAX    Los Angeles, CA  CMH         -37
#> 4 2017-01-04  AA      N753AA          508 LAX    Los Angeles, CA  CMH         -25
#> 5 2017-01-05  AA      N744AA          508 LAX    Los Angeles, CA  CMH         -22
#> 6 2017-01-06  AA      N815AA          508 LAX    Los Angeles, CA  CMH          13
#> # … with 35,987 more rows, and 1 more variable: dep_delay <dbl>
my.df %>%
  select(
  c(
    arr_delay, dep_delay, flight_date:dep_time, 
    year:day_of_week, taxi_out:arr_time
    )
  ) 
#> # A tibble: 35,993 × 21
#>   arr_delay dep_delay flight_date carrier tail_num flight_num origin
#>       <dbl>     <dbl> <date>      <chr>   <chr>         <int> <chr> 
#> 1       -14        -9 2017-01-01  AA      N802AA          508 LAX   
#> 2        29        24 2017-01-02  AA      N766AA          508 LAX   
#> 3       -37        -6 2017-01-03  AA      N760AA          508 LAX   
#> 4       -25        -5 2017-01-04  AA      N753AA          508 LAX   
#> 5       -22        -7 2017-01-05  AA      N744AA          508 LAX   
#> 6        13        22 2017-01-06  AA      N815AA          508 LAX   
#> # … with 35,987 more rows, and 14 more variables: origin_city_name <chr>,
#> #   dest <chr>, dest_city_name <chr>, dep_time <chr>, year <int>, quarter <int>,
#> #   month <int>, dayof_month <int>, day_of_week <int>, taxi_out <dbl>,
#> #   wheels_off <chr>, wheels_on <chr>, taxi_in <dbl>, arr_time <chr>

Notice that both my.df2 and my.df3 include columns we listed, the latter containing more than the former!

select() is versatile enough such that we could ask that only columns with a certain prefix or suffix be selected, or even columns that contain a specific string:

my.df %>% 
  dplyr::select(
    starts_with("arr")
  ) 
#> # A tibble: 35,993 × 2
#>   arr_time arr_delay
#>   <chr>        <dbl>
#> 1 1653           -14
#> 2 1736            29
#> 3 1630           -37
#> 4 1642           -25
#> 5 1645           -22
#> 6 1720            13
#> # … with 35,987 more rows
my.df %>%
  dplyr::select(
    ends_with("delay")
  ) 
#> # A tibble: 35,993 × 2
#>   dep_delay arr_delay
#>       <dbl>     <dbl>
#> 1        -9       -14
#> 2        24        29
#> 3        -6       -37
#> 4        -5       -25
#> 5        -7       -22
#> 6        22        13
#> # … with 35,987 more rows
my.df %>%
  dplyr::select(
    contains("num")
  ) 
#> # A tibble: 35,993 × 2
#>   tail_num flight_num
#>   <chr>         <int>
#> 1 N802AA          508
#> 2 N766AA          508
#> 3 N760AA          508
#> 4 N753AA          508
#> 5 N744AA          508
#> 6 N815AA          508
#> # … with 35,987 more rows

To use any function in {dplyr}, I could have first run library(tidyverse) or library(dplyr). However, library(tidyverse) would load multiple packages in working memory and library(dplyr) would load the {dplyr} library in working memory. A more efficient solution would be to just use a library when needed and not have it occupy working memory. This is done with the package::function(…) command, as in dplyr::select(…). Now, you won’t see me use this more efficient code in the rest of this chapter because we will keep using various {dplyr} functions and hence we might as well just lean on {dplyr} via the {tidyverse} library we loaded earlier.

We can also specify the column position instead of names when selecting columns:

my.df %>%
  select(
    1:4
  ) 
#> # A tibble: 35,993 × 4
#>    year quarter month dayof_month
#>   <int>   <int> <int>       <int>
#> 1  2017       1     1           1
#> 2  2017       1     1           2
#> 3  2017       1     1           3
#> 4  2017       1     1           4
#> 5  2017       1     1           5
#> 6  2017       1     1           6
#> # … with 35,987 more rows
my.df %>%
  select(
    1:4, 21, 15, 7:13
  ) 
#> # A tibble: 35,993 × 13
#>    year quarter month dayof_month arr_delay dep_delay carrier tail_num flight_num
#>   <int>   <int> <int>       <int>     <dbl>     <dbl> <chr>   <chr>         <int>
#> 1  2017       1     1           1       -14        -9 AA      N802AA          508
#> 2  2017       1     1           2        29        24 AA      N766AA          508
#> 3  2017       1     1           3       -37        -6 AA      N760AA          508
#> 4  2017       1     1           4       -25        -5 AA      N753AA          508
#> 5  2017       1     1           5       -22        -7 AA      N744AA          508
#> 6  2017       1     1           6        13        22 AA      N815AA          508
#> # … with 35,987 more rows, and 4 more variables: origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>

I prefer working with column numbers when reordering or pruning a data-set; makes for less typing and is faster so long as I don’t mess up. This is why I keep using the names() command to make sure I get the column positions right.

In closing, let me also point out that we can drop columns with the - symbol, as shown below:

my.df %>%
  select(
  -4, -5
  )
#> # A tibble: 35,993 × 19
#>    year quarter month flight_date carrier tail_num flight_num origin
#>   <int>   <int> <int> <date>      <chr>   <chr>         <int> <chr> 
#> 1  2017       1     1 2017-01-01  AA      N802AA          508 LAX   
#> 2  2017       1     1 2017-01-02  AA      N766AA          508 LAX   
#> 3  2017       1     1 2017-01-03  AA      N760AA          508 LAX   
#> 4  2017       1     1 2017-01-04  AA      N753AA          508 LAX   
#> 5  2017       1     1 2017-01-05  AA      N744AA          508 LAX   
#> 6  2017       1     1 2017-01-06  AA      N815AA          508 LAX   
#> # … with 35,987 more rows, and 11 more variables: origin_city_name <chr>,
#> #   dest <chr>, dest_city_name <chr>, dep_time <chr>, dep_delay <dbl>,
#> #   taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>, taxi_in <dbl>,
#> #   arr_time <chr>, arr_delay <dbl>

which is the same as running

my.df %>% 
  select(
    1:3, 6
  ) 
#> # A tibble: 35,993 × 4
#>    year quarter month flight_date
#>   <int>   <int> <int> <date>     
#> 1  2017       1     1 2017-01-01 
#> 2  2017       1     1 2017-01-02 
#> 3  2017       1     1 2017-01-03 
#> 4  2017       1     1 2017-01-04 
#> 5  2017       1     1 2017-01-05 
#> 6  2017       1     1 2017-01-06 
#> # … with 35,987 more rows

3.1.3 filter()

filter() allows you to subset the data to observations that meet a given criterion or a set of criteria. In the code below, for example, we are asking for all flights on January 1 (i.e., with month == 1, dayof_month == 1). The base R approach to sub-setting the data are shown first, followed by the {dplyr} code.

my.df[
  my.df$month == 1 & my.df$dayof_month == 1, 
  ]  # base R
#> # A tibble: 105 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           1           7 2017-01-01  AA      N802AA  
#> 2  2017       1     1           1           7 2017-01-01  AA      N820AA  
#> 3  2017       1     1           1           7 2017-01-01  AA      N828AA  
#> 4  2017       1     1           1           7 2017-01-01  AA      N828AA  
#> 5  2017       1     1           1           7 2017-01-01  AA      N426AA  
#> 6  2017       1     1           1           7 2017-01-01  AA      N426AA  
#> # … with 99 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
my.df %>%
  subset(
    month == 1 & dayof_month == 1
  ) # base R
#> # A tibble: 105 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           1           7 2017-01-01  AA      N802AA  
#> 2  2017       1     1           1           7 2017-01-01  AA      N820AA  
#> 3  2017       1     1           1           7 2017-01-01  AA      N828AA  
#> 4  2017       1     1           1           7 2017-01-01  AA      N828AA  
#> 5  2017       1     1           1           7 2017-01-01  AA      N426AA  
#> 6  2017       1     1           1           7 2017-01-01  AA      N426AA  
#> # … with 99 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
my.df %>%
  filter(
    month == 1, dayof_month == 1
  ) #  with dplyr
#> # A tibble: 105 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           1           7 2017-01-01  AA      N802AA  
#> 2  2017       1     1           1           7 2017-01-01  AA      N820AA  
#> 3  2017       1     1           1           7 2017-01-01  AA      N828AA  
#> 4  2017       1     1           1           7 2017-01-01  AA      N828AA  
#> 5  2017       1     1           1           7 2017-01-01  AA      N426AA  
#> 6  2017       1     1           1           7 2017-01-01  AA      N426AA  
#> # … with 99 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>

filter() is versatile and, what this means in practice is that you can filter in a number of simple or seemingly complicated ways. For example, say I want all flights in January OR in February. I could do it as follows.

my.df %>%
  filter(
    month == 1 | month == 2
  )
#> # A tibble: 7,170 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           1           7 2017-01-01  AA      N802AA  
#> 2  2017       1     1           2           1 2017-01-02  AA      N766AA  
#> 3  2017       1     1           3           2 2017-01-03  AA      N760AA  
#> 4  2017       1     1           4           3 2017-01-04  AA      N753AA  
#> 5  2017       1     1           5           4 2017-01-05  AA      N744AA  
#> 6  2017       1     1           6           5 2017-01-06  AA      N815AA  
#> # … with 7,164 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>

This is the same as running

my.df %>%
  filter(
    month %in% c(1, 2)
  )
#> # A tibble: 7,170 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           1           7 2017-01-01  AA      N802AA  
#> 2  2017       1     1           2           1 2017-01-02  AA      N766AA  
#> 3  2017       1     1           3           2 2017-01-03  AA      N760AA  
#> 4  2017       1     1           4           3 2017-01-04  AA      N753AA  
#> 5  2017       1     1           5           4 2017-01-05  AA      N744AA  
#> 6  2017       1     1           6           5 2017-01-06  AA      N815AA  
#> # … with 7,164 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>

The variable month is a number and hence we can use mathematical operations. For example, the preceding two codes would be equivalent to the following:

my.df %>%
  filter(
    month < 3
  )
#> # A tibble: 7,170 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           1           7 2017-01-01  AA      N802AA  
#> 2  2017       1     1           2           1 2017-01-02  AA      N766AA  
#> 3  2017       1     1           3           2 2017-01-03  AA      N760AA  
#> 4  2017       1     1           4           3 2017-01-04  AA      N753AA  
#> 5  2017       1     1           5           4 2017-01-05  AA      N744AA  
#> 6  2017       1     1           6           5 2017-01-06  AA      N815AA  
#> # … with 7,164 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>

We could have also asked for all flights in January and February EXCEPT for flights Origination from Los Angeles (i.e., “LAX”).

my.df %>%
  filter(
    month < 3 & origin != "LAX"
  )
#> # A tibble: 7,074 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           1           7 2017-01-01  AA      N820AA  
#> 2  2017       1     1           2           1 2017-01-02  AA      N817AA  
#> 3  2017       1     1           3           2 2017-01-03  AA      N742AA  
#> 4  2017       1     1           9           1 2017-01-09  AA      N742AA  
#> 5  2017       1     1          11           3 2017-01-11  AA      N742AA  
#> 6  2017       1     1          12           4 2017-01-12  AA      N762AA  
#> # … with 7,068 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>

Yes, but I changed my mind. I also want to exclude flights from Phoenix (i.e., “PHX”).

my.df %>%
  filter(
    month < 3 & !origin %in% c("LAX", "PHX")
  )
#> # A tibble: 6,895 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           9           1 2017-01-09  AA      N742AA  
#> 2  2017       1     1          10           2 2017-01-10  AA      N742AA  
#> 3  2017       1     1          11           3 2017-01-11  AA      N823AA  
#> 4  2017       1     1          12           4 2017-01-12  AA      N742AA  
#> 5  2017       1     1          13           5 2017-01-13  AA      N762AA  
#> 6  2017       1     1          16           1 2017-01-16  AA      N817AA  
#> # … with 6,889 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>

Note the different use of ! when I ran origin != “LAX” versus when I ran !origin %in% c(“LAX”, “PHX”). This different is easily forgotten but it matters. Specifically, if you are trying to exclude rows with more than one unique value you have to put the I before the variable name. If you are looking to exclude a single value then != must be placed before the exclusionary value.

You could also have asked for all flights with arrival delays that exceed the Median or some other value (less than 60, for example).

my.df %>%
  filter(
    arr_delay > median(arr_delay, na.rm = TRUE)
  )
#> # A tibble: 17,122 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           2           1 2017-01-02  AA      N766AA  
#> 2  2017       1     1           6           5 2017-01-06  AA      N815AA  
#> 3  2017       1     1           9           1 2017-01-09  AA      N835AA  
#> 4  2017       1     1          15           7 2017-01-15  AA      N752AA  
#> 5  2017       1     1          19           4 2017-01-19  AA      N760AA  
#> 6  2017       1     1          22           7 2017-01-22  AA      N766AA  
#> # … with 17,116 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>
my.df %>%
  filter(
    arr_delay < 60
  )
#> # A tibble: 33,564 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           1           7 2017-01-01  AA      N802AA  
#> 2  2017       1     1           2           1 2017-01-02  AA      N766AA  
#> 3  2017       1     1           3           2 2017-01-03  AA      N760AA  
#> 4  2017       1     1           4           3 2017-01-04  AA      N753AA  
#> 5  2017       1     1           5           4 2017-01-05  AA      N744AA  
#> 6  2017       1     1           6           5 2017-01-06  AA      N815AA  
#> # … with 33,558 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>

Or arrival delays between 15 and 30 minutes, both values included?

my.df %>%
  filter(
    arr_delay >= 15 & arr_delay <= 30
  )
#> # A tibble: 2,490 × 21
#>    year quarter month dayof_month day_of_week flight_date carrier tail_num
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>   <chr>   
#> 1  2017       1     1           2           1 2017-01-02  AA      N766AA  
#> 2  2017       1     1           9           1 2017-01-09  AA      N835AA  
#> 3  2017       1     1          29           7 2017-01-29  AA      N823AA  
#> 4  2017       1     1          11           3 2017-01-11  AA      N823AA  
#> 5  2017       1     1          12           4 2017-01-12  AA      N752AA  
#> 6  2017       1     1          23           1 2017-01-23  AA      N825AA  
#> # … with 2,484 more rows, and 13 more variables: flight_num <int>, origin <chr>,
#> #   origin_city_name <chr>, dest <chr>, dest_city_name <chr>, dep_time <chr>,
#> #   dep_delay <dbl>, taxi_out <dbl>, wheels_off <chr>, wheels_on <chr>,
#> #   taxi_in <dbl>, arr_time <chr>, arr_delay <dbl>

Here is a summary table of common operators and what they represent.

Operator Meaning
! Not equal to
& And
, And
\(<\) Less than
\(<=\) Less than or equal to
\(>\) Greater than
\(>=\) Greater than or equal to
\(==\) Equal to
%in% Value is in

3.1.4 arrange()

Let us now sort the data frame, first by Year, Month and DayofMonth, and then by arrival delay (arr_delay).

my.df %>%
  select(
    21, 13, 7, 1, 3, 4
  ) -> tab1
tab1 %>%
  arrange(
    year, month, dayof_month
  ) 
#> # A tibble: 35,993 × 6
#>   arr_delay dest_city_name        carrier  year month dayof_month
#>       <dbl> <chr>                 <chr>   <int> <int>       <int>
#> 1       -14 Columbus, OH          AA       2017     1           1
#> 2        -8 Columbus, OH          AA       2017     1           1
#> 3       -19 Los Angeles, CA       AA       2017     1           1
#> 4       -18 Columbus, OH          AA       2017     1           1
#> 5        23 Dallas/Fort Worth, TX AA       2017     1           1
#> 6        34 Columbus, OH          AA       2017     1           1
#> # … with 35,987 more rows

When we ask for it to be arranged by the values of one or more variables, unless we indicate otherwise {dplyr} assumes you want want the default – which happens to be in ascending order of values. This is evident from the example below.

my.df %>%
  select(
    flight_date, carrier, flight_num, arr_delay
    ) %>%
  arrange(
    arr_delay
  ) 
#> # A tibble: 35,993 × 4
#>   flight_date carrier flight_num arr_delay
#>   <date>      <chr>        <int>     <dbl>
#> 1 2017-02-18  EV            4338       -62
#> 2 2017-01-23  EV            4148       -59
#> 3 2017-02-15  AA             659       -57
#> 4 2017-02-21  EV            4246       -57
#> 5 2017-02-16  F9            1087       -57
#> 6 2017-03-29  AA             659       -57
#> # … with 35,987 more rows

If we would rather have the sort be in descending order of the variable’s values, then we will need to make an explicit call to that effect via desc(some-variable), as shown below.

my.df %>%
  select(
    flight_date, carrier, flight_num, arr_delay
    ) %>%
  arrange(
    desc(arr_delay)
  ) 
#> # A tibble: 35,993 × 4
#>   flight_date carrier flight_num arr_delay
#>   <date>      <chr>        <int>     <dbl>
#> 1 2017-06-23  UA             349      1327
#> 2 2017-07-17  OO            4798      1234
#> 3 2017-04-25  EV            3908      1118
#> 4 2017-09-07  EV            4335      1099
#> 5 2017-09-11  DL            1276      1063
#> 6 2017-04-02  EV            4246      1053
#> # … with 35,987 more rows

Could we have used the - symbol instead of using desc()? Yes, you can.

my.df %>%
  select(
    flight_date, carrier, flight_num, arr_delay
    ) %>%
  arrange(
    -arr_delay
  ) 
#> # A tibble: 35,993 × 4
#>   flight_date carrier flight_num arr_delay
#>   <date>      <chr>        <int>     <dbl>
#> 1 2017-06-23  UA             349      1327
#> 2 2017-07-17  OO            4798      1234
#> 3 2017-04-25  EV            3908      1118
#> 4 2017-09-07  EV            4335      1099
#> 5 2017-09-11  DL            1276      1063
#> 6 2017-04-02  EV            4246      1053
#> # … with 35,987 more rows

Note the difference between arr_delay and desc(arr_delay) – the former sorts the data-frame in ascending order of arrival delay times but the latter does it in descending order of arrival delay times. In addition, -arr_delay works similarly to desc(arr_delay).

We could also do more complicated sorting orders if we needed to, using multiple columns, and not necessarily all in the same ascending/descending sort sequence either.

Below, for example, we ask for the data-frame to be sorted in ascending order of arrival delays first and then in ascending order of the name of the destination city.

my.df %>%
  select(
    flight_date, carrier, flight_num, arr_delay, dest_city_name
    ) %>%
  arrange(
    arr_delay, dest_city_name
  ) 
#> # A tibble: 35,993 × 5
#>   flight_date carrier flight_num arr_delay dest_city_name 
#>   <date>      <chr>        <int>     <dbl> <chr>          
#> 1 2017-02-18  EV            4338       -62 Houston, TX    
#> 2 2017-01-23  EV            4148       -59 Houston, TX    
#> 3 2017-02-21  EV            4246       -57 Houston, TX    
#> 4 2017-05-06  EV            4338       -57 Houston, TX    
#> 5 2017-02-16  F9            1087       -57 Las Vegas, NV  
#> 6 2017-02-15  AA             659       -57 Los Angeles, CA
#> # … with 35,987 more rows

We could farther complicate it by adding the Carrier as yet another ascending sort level.

my.df %>%
  select(
    flight_date, carrier, flight_num, arr_delay, dest_city_name
    ) %>%
  arrange(
    arr_delay, dest_city_name, carrier
  ) 
#> # A tibble: 35,993 × 5
#>   flight_date carrier flight_num arr_delay dest_city_name 
#>   <date>      <chr>        <int>     <dbl> <chr>          
#> 1 2017-02-18  EV            4338       -62 Houston, TX    
#> 2 2017-01-23  EV            4148       -59 Houston, TX    
#> 3 2017-02-21  EV            4246       -57 Houston, TX    
#> 4 2017-05-06  EV            4338       -57 Houston, TX    
#> 5 2017-02-16  F9            1087       -57 Las Vegas, NV  
#> 6 2017-02-15  AA             659       -57 Los Angeles, CA
#> # … with 35,987 more rows

Of course, as you may have guessed, we could sort in ascending order with one variable and in descending order with another:

my.df %>%
  select(
    flight_date, carrier, flight_num, arr_delay, dest_city_name
    ) %>%
  arrange(
    arr_delay, desc(dest_city_name)
  ) 
#> # A tibble: 35,993 × 5
#>   flight_date carrier flight_num arr_delay dest_city_name 
#>   <date>      <chr>        <int>     <dbl> <chr>          
#> 1 2017-02-18  EV            4338       -62 Houston, TX    
#> 2 2017-01-23  EV            4148       -59 Houston, TX    
#> 3 2017-02-15  AA             659       -57 Los Angeles, CA
#> 4 2017-03-29  AA             659       -57 Los Angeles, CA
#> 5 2017-02-16  F9            1087       -57 Las Vegas, NV  
#> 6 2017-02-21  EV            4246       -57 Houston, TX    
#> # … with 35,987 more rows

3.1.5 rename()

If we want to rename some columns we can do so via rename(). Note that here I am saving the result as a data-frame called flights purely for demonstration purposes.

myflights %>%
  rename(
    arrival.delay = arr_delay, 
    departure.delay = dep_delay,
    arrival.time = arr_time, 
    departure.time = dep_time
    ) -> flights

Remember the sequence here … rename(new_name = old_name). Note also that you can use rename_with(…) to simultaneously rename multiple columns, as shown below.

myflights %>%
  rename_with(
    toupper
    ) %>%
  head()
#> # A tibble: 6 × 110
#>    YEAR QUARTER MONTH DAYOF_MONTH DAY_OF_WEEK FLIGHT_DATE UNIQUE_CARRIER AIRLINE_ID
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>               <int>
#> 1  2017       1     1           1           7 2017-01-01  AA                  19805
#> 2  2017       1     1           2           1 2017-01-02  AA                  19805
#> 3  2017       1     1           3           2 2017-01-03  AA                  19805
#> 4  2017       1     1           4           3 2017-01-04  AA                  19805
#> 5  2017       1     1           5           4 2017-01-05  AA                  19805
#> 6  2017       1     1           6           5 2017-01-06  AA                  19805
#> # … with 102 more variables: CARRIER <chr>, TAIL_NUM <chr>, FLIGHT_NUM <int>,
#> #   ORIGIN_AIRPORT_ID <int>, ORIGIN_AIRPORT_SEQ_ID <int>,
#> #   ORIGIN_CITY_MARKET_ID <int>, ORIGIN <chr>, ORIGIN_CITY_NAME <chr>,
#> #   ORIGIN_STATE <chr>, ORIGIN_STATE_FIPS <chr>, ORIGIN_STATE_NAME <chr>,
#> #   ORIGIN_WAC <int>, DEST_AIRPORT_ID <int>, DEST_AIRPORT_SEQ_ID <int>,
#> #   DEST_CITY_MARKET_ID <int>, DEST <chr>, DEST_CITY_NAME <chr>, DEST_STATE <chr>,
#> #   DEST_STATE_FIPS <chr>, DEST_STATE_NAME <chr>, DEST_WAC <int>, …

Here, all columns have been changed to be uppercase via toupper but other transformations are possible as well.

Maybe this is a mistake and we only want city names flipped into uppercase. How could I do that?

myflights %>%
  rename_with(
    toupper,
    contains("city")
    ) %>%
  select(contains("CITY"))
#> # A tibble: 35,993 × 4
#>   ORIGIN_CITY_MARKET_ID ORIGIN_CITY_NAME DEST_CITY_MARKET_ID DEST_CITY_NAME
#>                   <int> <chr>                          <int> <chr>         
#> 1                 32575 Los Angeles, CA                31066 Columbus, OH  
#> 2                 32575 Los Angeles, CA                31066 Columbus, OH  
#> 3                 32575 Los Angeles, CA                31066 Columbus, OH  
#> 4                 32575 Los Angeles, CA                31066 Columbus, OH  
#> 5                 32575 Los Angeles, CA                31066 Columbus, OH  
#> 6                 32575 Los Angeles, CA                31066 Columbus, OH  
#> # … with 35,987 more rows

There are trickier rename_with(...) operations possible that have not been outlined here.

3.1.6 distinct()

Rows with unique combinations of column values can be retained with ease. This is essentially making sure no row is duplicated in the data-frame. For example, what are the unique aircraft, flagged by the tail number (tail_num), that fly through Columbus? This is extracted by distinct(select(...)), and can be done for a single variable or for more than one variable. The first example says please keep just one row per aircraft (based on tail_num).

myflights %>%
  distinct(
    select(
      ., 
      tail_num
      )
    ) 

Next we ask for rows with unique combinations of origin and dest values.

myflights %>%
  distinct(
    select(
      ., 
      origin, dest
      )
    ) 

3.1.7 mutate()

If we want to calculate something based on existing columns and then add these new calculated values to the data frame, mutate(...) comes in handy. For example, if a flight has a longer dep_delay than arr-delay then we know the flight made up some of the delayed time. Let us call this gain and let us also measure it in our data-set. When we create gain with mutate(...) it will be added to the existing data-frame. You see this operation below.

myflights %>% 
  mutate(
    gain = (arr_delay - dep_delay)
    ) -> myflights

summary(myflights$gain)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   -63.0   -13.0    -7.0    -5.9    -1.0   210.0     565

One of the amazing things about the mutate(...) command is that it allows you to refer to a previous calculation in the same command (see below);

myflights %>%
  mutate(
    gain = (arr_delay - dep_delay), 
    gain_per_flight_hour = (gain / 60)
    ) 
#> # A tibble: 35,993 × 112
#>    year quarter month dayof_month day_of_week flight_date unique_carrier airline_id
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>               <int>
#> 1  2017       1     1           1           7 2017-01-01  AA                  19805
#> 2  2017       1     1           2           1 2017-01-02  AA                  19805
#> 3  2017       1     1           3           2 2017-01-03  AA                  19805
#> 4  2017       1     1           4           3 2017-01-04  AA                  19805
#> 5  2017       1     1           5           4 2017-01-05  AA                  19805
#> 6  2017       1     1           6           5 2017-01-06  AA                  19805
#> # … with 35,987 more rows, and 104 more variables: carrier <chr>, tail_num <chr>,
#> #   flight_num <int>, origin_airport_id <int>, origin_airport_seq_id <int>,
#> #   origin_city_market_id <int>, origin <chr>, origin_city_name <chr>,
#> #   origin_state <chr>, origin_state_fips <chr>, origin_state_name <chr>,
#> #   origin_wac <int>, dest_airport_id <int>, dest_airport_seq_id <int>,
#> #   dest_city_market_id <int>, dest <chr>, dest_city_name <chr>, dest_state <chr>,
#> #   dest_state_fips <chr>, dest_state_name <chr>, dest_wac <int>, …

Note that gain_per_flight_hour has been added by dividing gain by (air_time / 60) … since longer flights can gain (i.e., make up) more lost time than can shorter flights.

#> # A tibble: 35,993 × 5
#>   air_time arr_delay dep_delay  gain gain_per_flight_hour
#>      <dbl>     <dbl>     <dbl> <dbl>                <dbl>
#> 1      223       -14        -9    -5              -0.0833
#> 2      213        29        24     5               0.0833
#> 3      212       -37        -6   -31              -0.517 
#> 4      215       -25        -5   -20              -0.333 
#> 5      210       -22        -7   -15              -0.25  
#> 6      222        13        22    -9              -0.15  
#> # … with 35,987 more rows

We will return to mutate() later with more nuanced operations, but for now let us look at summarise(...) – also known as summarize(...).

3.1.8 summarise()

summarise() is another useful function (see below), useful in that whereas mutate() adds the calculated variable(s) to the existing data-set, summarise() allows you to create a table that also doubles as a data-frame. For example, say I want to calculate means and standard deviations of arrival delays and departure delays:

myflights %>%
  summarise(
    mean_arr_delay = mean(arr_delay, na.rm = TRUE),
    sd_arr_delay = sd(arr_delay, na.rm = TRUE),
    mean_dep_delay = mean(dep_delay, na.rm = TRUE),
    sd_dep_delay = sd(dep_delay, na.rm = TRUE)
    ) -> tab.01
#> 
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#> 
#>     group_rows
Table 3.1: Mean and Standard Deviation of Arrival and Departure Delays
mean_arr_delay sd_arr_delay mean_dep_delay sd_dep_delay
3.258 41.79 9.202 39.77

The key difference between mutate(...) and summarise(...) is that mutate(...) will create generate you are asking it to generate and create a new variable. On the other hand, summarise(...) will reduce things to a summary table so that you no longer have the entire data-set to work with anymore. This should be clear from the lone example shown above.

3.1.9 Sampling

You can draw random samples from your data frame as well. The code below shows a random sub-sample of 15 observations and then a random sub-sample of 2% of the full data, respectively.

myflights %>%
  sample_n(15) # 15 randomly sampled observations 
#> # A tibble: 15 × 111
#>    year quarter month dayof_month day_of_week flight_date unique_carrier airline_id
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>               <int>
#> 1  2017       1     1          18           3 2017-01-18  WN                  19393
#> 2  2017       1     1          22           7 2017-01-22  UA                  19977
#> 3  2017       2     6          29           4 2017-06-29  WN                  19393
#> 4  2017       3     9          24           7 2017-09-24  WN                  19393
#> 5  2017       2     4          16           7 2017-04-16  AA                  19805
#> 6  2017       3     7          14           5 2017-07-14  WN                  19393
#> # … with 9 more rows, and 103 more variables: carrier <chr>, tail_num <chr>,
#> #   flight_num <int>, origin_airport_id <int>, origin_airport_seq_id <int>,
#> #   origin_city_market_id <int>, origin <chr>, origin_city_name <chr>,
#> #   origin_state <chr>, origin_state_fips <chr>, origin_state_name <chr>,
#> #   origin_wac <int>, dest_airport_id <int>, dest_airport_seq_id <int>,
#> #   dest_city_market_id <int>, dest <chr>, dest_city_name <chr>, dest_state <chr>,
#> #   dest_state_fips <chr>, dest_state_name <chr>, dest_wac <int>, …
myflights %>%
  sample_frac(.02) # a 2% random sample
#> # A tibble: 720 × 111
#>    year quarter month dayof_month day_of_week flight_date unique_carrier airline_id
#>   <int>   <int> <int>       <int>       <int> <date>      <chr>               <int>
#> 1  2017       2     6          25           7 2017-06-25  WN                  19393
#> 2  2017       3     7           7           5 2017-07-07  WN                  19393
#> 3  2017       1     1          14           6 2017-01-14  EV                  20366
#> 4  2017       2     5          16           2 2017-05-16  WN                  19393
#> 5  2017       2     5           6           6 2017-05-06  DL                  19790
#> 6  2017       1     3          29           3 2017-03-29  WN                  19393
#> # … with 714 more rows, and 103 more variables: carrier <chr>, tail_num <chr>,
#> #   flight_num <int>, origin_airport_id <int>, origin_airport_seq_id <int>,
#> #   origin_city_market_id <int>, origin <chr>, origin_city_name <chr>,
#> #   origin_state <chr>, origin_state_fips <chr>, origin_state_name <chr>,
#> #   origin_wac <int>, dest_airport_id <int>, dest_airport_seq_id <int>,
#> #   dest_city_market_id <int>, dest <chr>, dest_city_name <chr>, dest_state <chr>,
#> #   dest_state_fips <chr>, dest_state_name <chr>, dest_wac <int>, …

These commands have been superseded by other commands that we will see later on in this text.

3.2 group_by()

So far we have been running these commands on all observations rather than by some group indicator. For example, what if we wanted to run some calculations for each unique aircraft (identified by tail_num)? The first thing we will do is create the grouping structure by specifying that we want each unique value of tail_num to be treated as a unique group.

myflights %>%
  group_by(tail_num) %>%
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    mean.arrival.delay = mean(arr_delay, na.rm = TRUE),
    mean.departure.delay = mean(dep_delay, na.rm = TRUE)
    )
#> # A tibble: 2,248 × 5
#>   tail_num count  dist mean.arrival.delay mean.departure.delay
#>   <chr>    <int> <dbl>              <dbl>                <dbl>
#> 1 N008AA       2   926              -14                   -3  
#> 2 N013AA       2   926               -9                   -1.5
#> 3 N015AA       4   926               53.8                 63.5
#> 4 N020AA       2   926              150                  162  
#> 5 N024AA       2   926               20                   15  
#> 6 N028AA       2   926              308.                 298. 
#> # … with 2,242 more rows

If you look at the result you will see 2,248 observations of 4 variables – count tells you how many times each aircraft shows up in the database, dist tells you the average distance flown, mean.arrival.delay tells you the mean arrival delay, and mean.departure.delay tells you the mean departure delay.

Now, is the mean arrival delay related to the average distance flown? We will ask and answer this question via a simple scatter-plot on a subset of the preceding result. The subset will be defined such that the aircraft must have more than 20 arrivals or departures, and must have flown an average distance of less than 2,000 miles.

myflights %>%
  group_by(tail_num) %>%
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    mean.arrival.delay = mean(arr_delay, na.rm = TRUE),
    mean.departure.delay = mean(dep_delay, na.rm = TRUE)
    ) -> tab.02

tab.02 %>%
  filter(
    count > 20, dist < 2000
    ) -> tab.03

plot(tab.03$dist, tab.03$mean.arrival.delay)

What if we wanted to know the number of flights to each destination listed in the data frame?

myflights %>%
  group_by(dest) %>%
  summarise(
    nflights = n()
    )

Note that n() counts the number of rows, and hence in the preceding code nflights = n() records the number of times each unique value of dest is seen in the data-frame.

What if we wanted to count the number of flights to each destination by each month?

myflights %>%
  group_by(
    month, dest
    ) %>%
  summarise(
    nflights = n()
    )

Note that now the result has an extra column (month) because you asked for the number of flights to each destination for each month.

Now for some fun. Say we want to know, for each airline, the number of flights per day to each destination.

myflights %>%
  group_by(carrier, dayof_month, dest) %>%
  summarise(
    nflights = n()
    ) 

The function group_by(...) identifies groups composed of unique values of the variables listed in (...). The preceding example is asking for unique combinations of carrier, dayof_month, and dest.

What if we wanted to know the number of flights by month rather than by day?

myflights %>%
  group_by(carrier, month, dest) %>%
  summarise(
    nflights = n()
    ) 

And in the year (January through September)?

myflights %>%
  group_by(carrier, dest) %>%
  summarise(
    nflights = n()
    ) 

Perhaps I am curious about the mean departure delay from Columbus, by Airline, regardless of destination.

myflights %>%
  filter(origin == "CMH") %>%
  group_by(carrier) %>%
  summarise(
    Mean.Dep.Delay = mean(dep_delay, na.rm = TRUE), 
    SD.Dep.Delay = sd(dep_delay, na.rm = TRUE)
    )
#> # A tibble: 7 × 3
#>   carrier Mean.Dep.Delay SD.Dep.Delay
#>   <chr>            <dbl>        <dbl>
#> 1 AA                6.92         37.2
#> 2 DL                5.25         41.4
#> 3 EV               22.2          93.4
#> 4 F9               11.0          40.7
#> 5 OO                5.98         58.4
#> 6 UA                7.52         62.1
#> # … with 1 more row

AA is American Airlines, DL is Delta Airlines, EV is ExpressJet, F9 is Frontier Airlines, OO is SkyWest Airlines, UA is United Airlines, and WN is Southwest Airlines. Based on the preceding result, Delta appears to have the briefest mean departure delays. Would the result be any different if we relied on the Median instead?

myflights %>%
  filter(origin == "CMH") %>%
  group_by(carrier) %>%
  summarise(
    Median.Dep.Delay = median(dep_delay, na.rm = TRUE), 
    SD.Dep.Delay = sd(dep_delay, na.rm = TRUE)
    )
#> # A tibble: 7 × 3
#>   carrier Median.Dep.Delay SD.Dep.Delay
#>   <chr>              <dbl>        <dbl>
#> 1 AA                    -4         37.2
#> 2 DL                    -3         41.4
#> 3 EV                    -3         93.4
#> 4 F9                    -2         40.7
#> 5 OO                    -5         58.4
#> 6 UA                    -4         62.1
#> # … with 1 more row

Aha! Turns out it is American Airlines with the briefest median departure delay and not Delta, so yes the picture changes.

Are flight departure delays worse for some destinations than for others? And assuming the answer is yes, which two destinations are the worst?

myflights %>%
  filter(origin == "CMH") %>%
  group_by(dest) %>%
  summarise(
    Mean.Dep.Delay = mean(dep_delay, na.rm = TRUE)
    ) %>%
  arrange(-Mean.Dep.Delay)
#> # A tibble: 26 × 2
#>   dest  Mean.Dep.Delay
#>   <chr>          <dbl>
#> 1 LGA             34.3
#> 2 EWR             31.0
#> 3 MSY             21.9
#> 4 OAK             14.9
#> 5 BOS             14.4
#> 6 DTW             12.8
#> # … with 20 more rows

Ah, as expected, flying to La Guardia, NY (LGA) is the worst, followed by Newark, NJ (EWR).

What if we wanted to generate a relative frequency column that showed the percent of flights to each destination, by airline, from Columbus?

myflights %>%
  filter(origin == "CMH") %>% 
  group_by(carrier, dest) %>% 
  summarise(
    Number.of.flights = n()
    ) %>% 
  group_by(carrier) %>% 
  mutate(
    Percent.of.flights = (
      Number.of.flights / sum(Number.of.flights)
      ) * 100
    )

Clearly, the majority of Delta flights were to Atlanta, followed by Minneapolis.

If we wanted to know what percent of flights to a given destination are from a given airline, we could modify the code just a bit.

myflights %>%
  filter(origin == "CMH") %>% 
  group_by(carrier, dest) %>% 
  summarise(
    Number.of.flights = n()
    ) %>% 
  group_by(dest) %>% 
  mutate(
    Percent.of.flights = (
      Number.of.flights / sum(Number.of.flights)
      ) * 100
    ) 

All flights to Philadelphia were American Airlines, and similarly to Dallas Fort Worth. Some 74.2% of the flights to Atlanta were by Delta, and so on.

What is noteworthy here, at least for now, is the fact that we have two group_by(...) commands, and we are also following up summarise(...) with a mutate(...) command to arrive at our desired result.

3.3 unnest()

The more complicated the data you are pulling down, or perhaps I should say the more raw the data are that you are pulling down, the more likely it is that you will run into list columns that contain rows themselves, just as you would expect in a data frame. The unnest function from the tidyverse unpacks these list-columns into a regular tibble. The following example is from the unnest page available here.

Here you will see a small data-set called df we have created. Note the odd structure where the three values of y for a unique value of x = 2 and similarly two values of y for x = 3. This does not happen very often but you do encounter data in this convoluted structure.

tibble(
  x = 1:3,
  y = c("a", "d,e,f", "g,h")
  ) -> df

df
#> # A tibble: 3 × 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 d,e,f
#> 3     3 g,h

We ask for it to be cleaned up by splitting the column y wherever a comma is seen, and then organizing the result such that we have what looks like a ‘proper’ data-set.

df %>%
  transform(
    y = strsplit(y, ",")
    ) %>%
  unnest(y)
#> # A tibble: 6 × 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 d    
#> 3     2 e    
#> 4     2 f    
#> 5     3 g    
#> 6     3 h

Here is a trickier example.

tibble(
  x = 1:2,
  y = list(a = 1, b = 3:4)
  ) -> df

df
#> # A tibble: 2 × 2
#>       x y           
#>   <int> <named list>
#> 1     1 <dbl [1]>   
#> 2     2 <int [2]>

With a named list-column you need to specify .id and now you have to specify cols = c(column-name)

df %>%
  unnest(
    cols = c(y),
    .id = "name"
    )
#> Warning: The `.id` argument of `unnest()` is deprecated as of tidyr 1.0.0.
#> Manually create column of names instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#> # A tibble: 3 × 3
#>       x     y name 
#>   <int> <dbl> <chr>
#> 1     1     1 a    
#> 2     2     3 b    
#> 3     2     4 b

3.4 Using {tidyr}

The {tidyr} package is another useful tool in the data analysts’ toolkit. In fact very often you will see {tidyr} being used in tandem with {dplyr} because they complement each other well. One thing that we will do here is to use the pipe operator %>% because it is very efficient to chain commands.

3.4.1 tidyr’s Key Functions

There are four {tidyr} functions you are likely to use.

Function What it does …
separate(...) splits a single column into multiple columns
unite(...) combines multiple columns into a single column
pivot_wider(...) takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider
pivot_longer(...) takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer

Before we do anything else, let us read in the data-set we will work with. These data reflect the Census Bureau’s population estimates for Ohio counties and span the 2010 through 2016 fiscal-years. I’ll start by reading in the raw data.

library(readr)
library(here)
#> here() starts at /Users/ruhil/Documents/github/RBook_new
read_csv(
  here(
    "data", 
    "PEP_2016_PEPANNRES_with_ann.csv"
    ),
  skip = 1
  ) -> popest 
#> Rows: 88 Columns: 12
#> ── Column specification ─────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (2): Id, Geography
#> dbl (10): Id2, April 1, 2010 - Census, April 1, 2010 - Estimates Base, Populatio...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(popest)[4] = "DecennialCens2010"
colnames(popest)[5] = "PopestBase2010"
colnames(popest)[6] = "Popest2010"
colnames(popest)[7] = "Popest2011"
colnames(popest)[8] = "Popest2012"
colnames(popest)[9] = "Popest2013"
colnames(popest)[10] = "Popest2014"
colnames(popest)[11] = "Popest2015"
colnames(popest)[12] = "Popest2016"

save(
  popest, 
  file = "data/popest.RData"
  )
Id Id2 Geography DecennialCens2010 PopestBase2010 Popest2010
0500000US39001 39001 Adams County, Ohio 28550 28554 28562
0500000US39003 39003 Allen County, Ohio 106331 106326 106395
0500000US39005 39005 Ashland County, Ohio 53139 53139 53319
0500000US39007 39007 Ashtabula County, Ohio 101497 101488 101394
0500000US39009 39009 Athens County, Ohio 64757 64772 65221
0500000US39011 39011 Auglaize County, Ohio 45949 45949 45932

If you open the data frame you will see there are 88 counties, with population estimates for multiple years (2010 through 2016). Notice that Geography lists the county name and the state. Say we want to split this into two columns, one with just the county name and another with the state name. How could we do this? With separate(...)

3.4.2 separate()

library(tidyr)
popest %>% 
  separate(
    col = Geography, 
    into = c("County", "State"), 
    sep = ",",
    remove = FALSE
    )
#> # A tibble: 88 × 14
#>   Id        Id2 Geography   County  State DecennialCens20… PopestBase2010 Popest2010
#>   <chr>   <dbl> <chr>       <chr>   <chr>            <dbl>          <dbl>      <dbl>
#> 1 050000… 39001 Adams Coun… Adams … " Oh…            28550          28554      28562
#> 2 050000… 39003 Allen Coun… Allen … " Oh…           106331         106326     106395
#> 3 050000… 39005 Ashland Co… Ashlan… " Oh…            53139          53139      53319
#> 4 050000… 39007 Ashtabula … Ashtab… " Oh…           101497         101488     101394
#> 5 050000… 39009 Athens Cou… Athens… " Oh…            64757          64772      65221
#> 6 050000… 39011 Auglaize C… Auglai… " Oh…            45949          45949      45932
#> # … with 82 more rows, and 6 more variables: Popest2011 <dbl>, Popest2012 <dbl>,
#> #   Popest2013 <dbl>, Popest2014 <dbl>, Popest2015 <dbl>, Popest2016 <dbl>

Note the sequence: I asked that the column called Geography be split into two, c(“County”, “State”) based on a comma separator, i.e., sep = “,”.

I could further clean up the County column since I don’t want the word “County” showing up needlessly; we know these are counties after all!

popest %>% 
  separate(
    col = Geography, 
    into = c("County", "State"), 
    sep = ",",
    remove = FALSE
    ) %>%
  mutate(
    County = gsub(" County", "", County)
  )
#> # A tibble: 88 × 14
#>   Id        Id2 Geography    County State DecennialCens20… PopestBase2010 Popest2010
#>   <chr>   <dbl> <chr>        <chr>  <chr>            <dbl>          <dbl>      <dbl>
#> 1 050000… 39001 Adams Count… Adams  " Oh…            28550          28554      28562
#> 2 050000… 39003 Allen Count… Allen  " Oh…           106331         106326     106395
#> 3 050000… 39005 Ashland Cou… Ashla… " Oh…            53139          53139      53319
#> 4 050000… 39007 Ashtabula C… Ashta… " Oh…           101497         101488     101394
#> 5 050000… 39009 Athens Coun… Athens " Oh…            64757          64772      65221
#> 6 050000… 39011 Auglaize Co… Augla… " Oh…            45949          45949      45932
#> # … with 82 more rows, and 6 more variables: Popest2011 <dbl>, Popest2012 <dbl>,
#> #   Popest2013 <dbl>, Popest2014 <dbl>, Popest2015 <dbl>, Popest2016 <dbl>

Here I asked that every instance of ” County” be replaced with nothing. Note I also specified " County" and not "County" since the latter will leave a blank space after every county’s name. We could also split the five-digit Id2 variable into two parts – the state FIPS code made up of the first two digits here 39 and the county code made up of the next three digits. Since we know this break-point I’ll set sep = 2, indicating that it should break Id2 after the second position.

popest %>% 
  separate(
    col = Geography, 
    into = c("County", "State"), 
    sep = ", ",
    remove = FALSE
    ) %>%
  mutate(
    County = gsub(" County", "", County)
  ) %>%
  separate(
    col = Id2, 
    into = c("StateFIPS", "CountyFIPS"), 
    sep = 2,
    remove = FALSE
    )
#> # A tibble: 88 × 16
#>   Id               Id2 StateFIPS CountyFIPS Geography  County State DecennialCens20…
#>   <chr>          <dbl> <chr>     <chr>      <chr>      <chr>  <chr>            <dbl>
#> 1 0500000US39001 39001 39        001        Adams Cou… Adams  Ohio             28550
#> 2 0500000US39003 39003 39        003        Allen Cou… Allen  Ohio            106331
#> 3 0500000US39005 39005 39        005        Ashland C… Ashla… Ohio             53139
#> 4 0500000US39007 39007 39        007        Ashtabula… Ashta… Ohio            101497
#> 5 0500000US39009 39009 39        009        Athens Co… Athens Ohio             64757
#> 6 0500000US39011 39011 39        011        Auglaize … Augla… Ohio             45949
#> # … with 82 more rows, and 8 more variables: PopestBase2010 <dbl>,
#> #   Popest2010 <dbl>, Popest2011 <dbl>, Popest2012 <dbl>, Popest2013 <dbl>,
#> #   Popest2014 <dbl>, Popest2015 <dbl>, Popest2016 <dbl>

3.4.3 unite()

Of course, we could also work backwards, creating a single column from two or more columns via unite(...) and I will demonstrate this with the StateFIPS and CountyFIPS columns. Let me create these columns and save them as a data-frame.

popest %>% 
  separate(
    col = Id2, 
    into = c("StateFIPS", "CountyFIPS"), 
    sep = 2,
    remove = FALSE
    ) %>%
  separate(
    col = Geography, 
    into = c("County", "State"), 
    sep = ", ",
    remove = FALSE
    ) %>%
  mutate(
    County = gsub(" County", "", County)
    ) %>%
  select(
    StateFIPS, CountyFIPS, County, State
    ) -> scdf

scdf
#> # A tibble: 88 × 4
#>   StateFIPS CountyFIPS County    State
#>   <chr>     <chr>      <chr>     <chr>
#> 1 39        001        Adams     Ohio 
#> 2 39        003        Allen     Ohio 
#> 3 39        005        Ashland   Ohio 
#> 4 39        007        Ashtabula Ohio 
#> 5 39        009        Athens    Ohio 
#> 6 39        011        Auglaize  Ohio 
#> # … with 82 more rows
scdf %>%
  unite(
    col = CountyStateFIPS, 
    StateFIPS, CountyFIPS, 
    sep = "",
    remove = FALSE
    ) %>%
  unite(
    col = Geography, 
    County, State, 
    sep = ", ",
    remove = FALSE
    ) 
#> # A tibble: 88 × 6
#>   CountyStateFIPS StateFIPS CountyFIPS Geography       County    State
#>   <chr>           <chr>     <chr>      <chr>           <chr>     <chr>
#> 1 39001           39        001        Adams, Ohio     Adams     Ohio 
#> 2 39003           39        003        Allen, Ohio     Allen     Ohio 
#> 3 39005           39        005        Ashland, Ohio   Ashland   Ohio 
#> 4 39007           39        007        Ashtabula, Ohio Ashtabula Ohio 
#> 5 39009           39        009        Athens, Ohio    Athens    Ohio 
#> 6 39011           39        011        Auglaize, Ohio  Auglaize  Ohio 
#> # … with 82 more rows

Note the components of the command. Start with the data-frame, then the name of the new column you want to create, then the columns you want to unit, and finally whether unite(…) should insert any symbol or space, etc between the uniting columns when uniting them.

When we ran unite(…) for joining County and State, notice the space I specified after the comma in sep = “,”. If you didn’t put this space in the command, the names would show up as Adams,Ohio instead of as Adams, Ohio.

3.4.4 Reshaping data: Long to Wide and Wide to Long

Often the data you access are in a format ill-suited for your purposes. As Hadley Wickham puts it:

Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable. While occasionally you do get a dataset that you can start analyzing immediately, this is the exception, not the rule. This section describes the five most common problems with messy datasets, along with their remedies:

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.
  • Variables are stored in both rows and columns.
  • Multiple types of observational units are stored in the same table.
  • A single observational unit is stored in multiple tables.

What is tidy data then? Tidy data reflect three properties:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

So the following data are not tidy data:

#> 
#> Attaching package: 'DSR'
#> The following objects are masked from 'package:tidyr':
#> 
#>     table1, table2, table3, table5, who
country year key value
Afghanistan 1999 cases 7.450e+02
Afghanistan 1999 population 1.999e+07
Afghanistan 2000 cases 2.666e+03
Afghanistan 2000 population 2.060e+07
Brazil 1999 cases 3.774e+04
Brazil 1999 population 1.720e+08
Brazil 2000 cases 8.049e+04
Brazil 2000 population 1.745e+08
China 1999 cases 2.123e+05
China 1999 population 1.273e+09
China 2000 cases 2.138e+05
China 2000 population 1.280e+09

Note the problem here. Each country shows up twice for the same year, once with a value for the number of tuberculosis cases and the second time with a value for the total population size. Instead, cases should be a column all it own and so should population … that would be a tidy data-set, one that would look as follows:

country year cases population
Afghanistan 1999 745 1.999e+07
Afghanistan 2000 2666 2.060e+07
Brazil 1999 37737 1.720e+08
Brazil 2000 80488 1.745e+08
China 1999 212258 1.273e+09
China 2000 213766 1.280e+09

In fact, R is built assuming you have tidy data and hence in order to get various R packages to do what needs to be done data have to be tidy. {tidyr} is built to tidy messy data-sets. There are two primary functions used in {tidyr}pivot_wider(...) and pivot_longer(...). Let us see what each does.

3.4.4.1 pivot_wider()

pivot_wider() flips data from the long format to the wide format. What is the long format and what is the wide format? The wide format has one variable per column, as shown below:

ID x y
1 x1 y1
2 x2 y2
3 x3 y3
4 x4 y4
5 x5 y5

while the long format will have the variables as rows accompanied by a values column, as shown below:

ID Variable Values
1 x x1
1 y y1
2 x x2
2 y y2
3 x x3
3 y y3
4 x x4
4 y y4
5 x x5
5 y y5

The long format is not tidy and so if you end up gathering data that come in a long format you have to know how to flip it into a wide format. Sometimes you may want or need to flip a wide format data-set into the long format so knowing how to go in that direction is also very useful. {tidyr} comes in very handy for both.

I can take df.long and flip it into the wide format as follows:

df.long %>% 
  group_by(ID) %>%
  pivot_wider(
    names_from = Variable, 
    values_from = Values
    ) 
#> # A tibble: 5 × 3
#> # Groups:   ID [5]
#>      ID x     y    
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     3 x3    y3   
#> 4     4 x4    y4   
#> 5     5 x5    y5

Take the tuberculosis data as another example. I can flip the long format data into a wide format via a similar operation:

table2 %>% 
  group_by(country, year) %>%
  pivot_wider(
    names_from = key,
    values_from = value
    )
#> # A tibble: 6 × 4
#> # Groups:   country, year [6]
#>   country      year  cases population
#>   <fct>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

3.4.4.2 pivot_longer()

pivot_longer(...) reshapes data in the other direction, from the wide format to the long format.

df.wide %>% 
  group_by(ID) %>%
  pivot_longer(
    names_to = "Variable",
    values_to = "Value",
    cols = 2:3
    )

In this command, the names_to = "Variable" command specifies that the columns that are being gathered should be put as rows in a new column to be called Variable

We have also asked for the values to be retained in a new column called Value

And finally, the cols = 2:3 specifies the columns to be moved into rows.

3.4.5 Reshaping slightly more complicated data

Of course, the preceding are stylized examples and hence easy to flip from long to wide or wide to long. But what if we had more complicated data, perhaps like the one shown below?

tibble(
  read.table(
    header = TRUE, 
    text = "
   ID   Test Year   Fall Spring Winter
    1   1   2008    15      16      19
    1   1   2009    12      13      27
    1   2   2008    22      22      24
    1   2   2009    10      14      20
    2   1   2008    12      13      25
    2   1   2009    16      14      21
    2   2   2008    13      11      29
    2   2   2009    23      20      26
    3   1   2008    11      12      22
    3   1   2009    13      11      27
    3   2   2008    17      12      23
    3   2   2009    14      9       31
   "
   )
  ) -> grades
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31

These are the raw data. Each student (flagged by ID) takes two tests, 1 and 2, thrice a year (Fall, Spring, and Winter) in 2008, and then again in 2009. If we wanted to flip these data into the long format we could do the following:

grades.long <- grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Semester)
ID Year Semester TestType1 TestType2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20
2 2008 Fall 12 13
2 2008 Spring 13 11
2 2008 Winter 25 29
2 2009 Fall 16 23
2 2009 Spring 14 20
2 2009 Winter 21 26
3 2008 Fall 11 17
3 2008 Spring 12 12
3 2008 Winter 22 23
3 2009 Fall 13 14
3 2009 Spring 11 9
3 2009 Winter 27 31

This looks quirky but it isn’t; it shows you how you can blend tidyr and dplyr to end up with the desired result. Let us see how it works.

  1. We know that gather() will take multiple columns and put them into a single column (what we’ve referred to as going from wide to long). Here the gather command is taking the three semesters (Fall, Spring, Winter) and collapsing these into a single column called Semester, and putting the test scores from Fall, Spring and Winter into a new column called Score.
step1 <- grades %>% gather(Semester, Score, Fall:Winter) 
ID Test Year Semester Score
1 1 2008 Fall 15
1 1 2009 Fall 12
1 2 2008 Fall 22
1 2 2009 Fall 10
2 1 2008 Fall 12
2 1 2009 Fall 16
2 2 2008 Fall 13
2 2 2009 Fall 23
3 1 2008 Fall 11
3 1 2009 Fall 13
3 2 2008 Fall 17
3 2 2009 Fall 14
1 1 2008 Spring 16
1 1 2009 Spring 13
1 2 2008 Spring 22
1 2 2009 Spring 14
2 1 2008 Spring 13
2 1 2009 Spring 14
2 2 2008 Spring 11
2 2 2009 Spring 20
3 1 2008 Spring 12
3 1 2009 Spring 11
3 2 2008 Spring 12
3 2 2009 Spring 9
1 1 2008 Winter 19
1 1 2009 Winter 27
1 2 2008 Winter 24
1 2 2009 Winter 20
2 1 2008 Winter 25
2 1 2009 Winter 21
2 2 2008 Winter 29
2 2 2009 Winter 26
3 1 2008 Winter 22
3 1 2009 Winter 27
3 2 2008 Winter 23
3 2 2009 Winter 31
  1. We know that mutate() creates a new variable. Here the new variable is called Test and is being created by pasting the string “TestType” to Test 1 and Test 2.
step12 <- grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test))
ID Test Year Semester Score
1 TestType1 2008 Fall 15
1 TestType1 2009 Fall 12
1 TestType2 2008 Fall 22
1 TestType2 2009 Fall 10
2 TestType1 2008 Fall 12
2 TestType1 2009 Fall 16
2 TestType2 2008 Fall 13
2 TestType2 2009 Fall 23
3 TestType1 2008 Fall 11
3 TestType1 2009 Fall 13
3 TestType2 2008 Fall 17
3 TestType2 2009 Fall 14
1 TestType1 2008 Spring 16
1 TestType1 2009 Spring 13
1 TestType2 2008 Spring 22
1 TestType2 2009 Spring 14
2 TestType1 2008 Spring 13
2 TestType1 2009 Spring 14
2 TestType2 2008 Spring 11
2 TestType2 2009 Spring 20
3 TestType1 2008 Spring 12
3 TestType1 2009 Spring 11
3 TestType2 2008 Spring 12
3 TestType2 2009 Spring 9
1 TestType1 2008 Winter 19
1 TestType1 2009 Winter 27
1 TestType2 2008 Winter 24
1 TestType2 2009 Winter 20
2 TestType1 2008 Winter 25
2 TestType1 2009 Winter 21
2 TestType2 2008 Winter 29
2 TestType2 2009 Winter 26
3 TestType1 2008 Winter 22
3 TestType1 2009 Winter 27
3 TestType2 2008 Winter 23
3 TestType2 2009 Winter 31
  1. We know that spread() shifts data from the long format to the wide format. Here, the spread command is then taking the newly created “Test” column and creating a column for each unique TestType. In these new columns the corresponding values from the Score column are being inserted.
step123 <- grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test)) %>%
  spread(Test, Score)
ID Year Semester TestType1 TestType2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20
2 2008 Fall 12 13
2 2008 Spring 13 11
2 2008 Winter 25 29
2 2009 Fall 16 23
2 2009 Spring 14 20
2 2009 Winter 21 26
3 2008 Fall 11 17
3 2008 Spring 12 12
3 2008 Winter 22 23
3 2009 Fall 13 14
3 2009 Spring 11 9
3 2009 Winter 27 31
  1. Finally, arrange() organizes the data into a particular order. Here we are asking the final result to be arranged by ID, Year and Semester.
grades.long <- grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Semester)
ID Year Semester TestType1 TestType2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20
2 2008 Fall 12 13
2 2008 Spring 13 11
2 2008 Winter 25 29
2 2009 Fall 16 23
2 2009 Spring 14 20
2 2009 Winter 21 26
3 2008 Fall 11 17
3 2008 Spring 12 12
3 2008 Winter 22 23
3 2009 Fall 13 14
3 2009 Spring 11 9
3 2009 Winter 27 31

This example shows how you can chain commands via the pipe operator %>% and rather than breaking up the data manipulation tasks collapse all the commands into a single piped command. The trickier the data you are working with, the more difficult this is to do well. One trick is to take a small portion of your complicated data and subset it into a small portion that has maybe two or three unique units (individuals, geographies, etc). Then see if you can get the code to do what you need it to do. If it works on this subset then it will work on the full data set as well.

3.5 other dplyr functions

dplyr has several functions that can be used with summarise and mutate, as well as in other ways. Below we walk through a few you are likely to need.

3.5.1 count and tally

Say you want a frequency table or a cross-tabulation. This can be done quite easily. The example I’ll use is from the starwars data-set. How many unique species are there in the series?

data(starwars)
starwars %>% 
  count(
    species, 
    sort =  TRUE
    ) 

What if we want species per homeworld?

starwars %>% 
  count(
    species, homeworld, 
    sort = TRUE
    ) 

Notice that you get some NA values so if we didn’t want this we could use filter:

starwars %>% 
  filter(
    !is.na(species)  & !is.na(homeworld)
    ) %>% 
  count(
    species, homeworld, 
    sort = TRUE
    ) 

If I want the species count to be added as a column to the data frame, as mutate would do, I could run :

starwars %>% 
  add_count(
    species
    ) 
#> # A tibble: 87 × 15
#>   name       height  mass hair_color  skin_color  eye_color birth_year sex   gender 
#>   <chr>       <int> <dbl> <chr>       <chr>       <chr>          <dbl> <chr> <chr>  
#> 1 Luke Skyw…    172    77 blond       fair        blue            19   male  mascul…
#> 2 C-3PO         167    75 <NA>        gold        yellow         112   none  mascul…
#> 3 R2-D2          96    32 <NA>        white, blue red             33   none  mascul…
#> 4 Darth Vad…    202   136 none        white       yellow          41.9 male  mascul…
#> 5 Leia Orga…    150    49 brown       light       brown           19   fema… femini…
#> 6 Owen Lars     178   120 brown, grey light       blue            52   male  mascul…
#> # … with 81 more rows, and 6 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>, n <int>

If I want just the total sample size added I could do

starwars %>% 
  add_tally() 
#> # A tibble: 87 × 15
#>   name       height  mass hair_color  skin_color  eye_color birth_year sex   gender 
#>   <chr>       <int> <dbl> <chr>       <chr>       <chr>          <dbl> <chr> <chr>  
#> 1 Luke Skyw…    172    77 blond       fair        blue            19   male  mascul…
#> 2 C-3PO         167    75 <NA>        gold        yellow         112   none  mascul…
#> 3 R2-D2          96    32 <NA>        white, blue red             33   none  mascul…
#> 4 Darth Vad…    202   136 none        white       yellow          41.9 male  mascul…
#> 5 Leia Orga…    150    49 brown       light       brown           19   fema… femini…
#> 6 Owen Lars     178   120 brown, grey light       blue            52   male  mascul…
#> # … with 81 more rows, and 6 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>, n <int>

If I do not want to add the sample size but instead just calculate it, I could do:

starwars %>% 
  tally()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    87
starwars %>% 
  group_by(
    species
    ) %>% 
  filter(
    !is.na(species)
    ) %>% 
  tally(
    sort = TRUE
    )
#> # A tibble: 37 × 2
#>   species      n
#>   <chr>    <int>
#> 1 Human       35
#> 2 Droid        6
#> 3 Gungan       3
#> 4 Kaminoan     2
#> 5 Mirialan     2
#> 6 Twi'lek      2
#> # … with 31 more rows

3.5.2 summary functions

Say you have the nycflights13 data, similar to our Columbus flights data but for NYC. I want to perform several calculations. First, I want to know how many unique aircraft are there?

library(nycflights13)

flights %>% 
  summarise(
    unique.planes = n_distinct(tailnum)
    ) 

How about unique planes by destination?

flights %>% 
  group_by(
    dest
    ) %>% 
  summarise(
    unique.planes = n_distinct(tailnum)
    ) %>% 
  arrange(
    -unique.planes
    )

So there are 1308 unique aircraft that fly from NYC to Boston.

Which destination would give you the most choice of carriers from NYC?

flights %>% 
  group_by(
    dest
    ) %>% 
  summarise(
    unique.carriers = n_distinct(carrier)
    ) %>% 
  arrange(
    -unique.carriers
    )

I want to flag the first flight for each destination on January 1.

flights %>% 
  filter(
    month == 1 & day == 1
    ) %>% 
  group_by(
    dest
    ) %>% 
  summarise(
    flt.order = first(time_hour)
    ) 

The last flight?

flights %>% 
  filter(
    month == 1 & day == 1
    ) %>% 
  group_by(
    dest
    ) %>% 
  summarise(
    flt.order = last(time_hour)
    ) 

How many flights to each destination on January 1?

flights %>% 
  filter(
    month == 1 & day == 1
    ) %>% 
  group_by(
    dest
    ) %>% 
  summarise(
    n.flts = n()
    ) %>% 
  arrange(
    -n.flts
    ) 

Wait a minute, what if I want this by airline?

flights %>% 
  filter(
    month == 1 & day == 1
    ) %>% 
  group_by(
    dest, carrier
    ) %>% 
  summarise(
    n.flts = n()
    ) %>% 
  arrange(
    -n.flts
    )

3.5.3 window functions

What if I want to flag each flight of a unique aircraft and rank it as the first flight of the day, second flight of the day, and so on?

flights %>% 
  filter(
    month == 1 & day == 1
    ) %>% 
  group_by(
    tailnum
    ) %>% 
  arrange(
    time_hour
    ) %>% 
  mutate(
    flt.order = row_number()
    ) %>% 
  arrange(
    tailnum, time_hour
    ) %>% 
  select(
    carrier, dest, tailnum, 
    time_hour, month, day, flt.order
    )

How many times does the same aircraft fly on January 1?

flights %>% 
  filter(
    month == 1 & day == 1
    ) %>% 
  group_by(
    tailnum
    ) %>% 
  summarise(
    n.flts = n()
    ) %>% 
  arrange(
    -n.flts
    )

What if I want only the first flight of each aircraft on January 1? min_rank will flag the first observation of each aircraft arranged by time_hour.

flights %>% 
  filter(
    month == 1 & day == 1
    ) %>% 
  group_by(
    tailnum
    ) %>% 
  arrange(
    time_hour
    ) %>% 
  mutate(
    n.flts = min_rank(n())
    ) %>% 
  arrange(
    -n.flts
    )

This is just a snippet of what dplyr can do. In many ways, what dplyr can do is, without exaggeration, only limited by our own understanding of how to construct a logical calculation. Take a look at the cheatsheets here.

3.6 Using data.table

The {data.table} package was designed for faster data manipulations when dealing with large data-sets, large being defined here as anything more than 10 gigabytes (10 GB). By some counts there are more then 200 R packages that rely on {data.table}, making it very popular, and it is said to be faster than {dplyr}. Let us run through some basic operations with this package, using a different version of the flights data-set, one that is restricted to January through October 2014 flights departing New York area airports.

3.6.1 fread()

The package uses the fread() function that has several options that can be specified if one wishes to. The default settings usually work well so you see no options listed in the command below.

library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
fread(
  "https://github.com/arunsrinivasan/satrdays-workshop/raw/master/flights_2014.csv"
  ) -> my.df 

fread() is a lot faster than read.csv(), and so is fwrite() faster than write.csv() when exporting data-frames as csv files.

If I want to extract one or more variables, perhaps even sub-setting them into another data.table, I can do that quite easily, either by specifying the variable names or by position:

my.df[
  , 
  .(origin)
  ] -> df1 
my.df[
  , 
  .(
    year, month, day, dep_delay, 
    arr_delay, carrier, origin, dest
    )
  ] -> df2 
my.df[
  , 
  c(1:3, 5)
  ] -> df3 

I can also filter the data.table and then extract, say I want all flights departing La Guardia, no, wait, I also want departing La Guardia in July:

my.df[
  origin %in% c("LGA")
  ] -> df4 
my.df[
  origin %in% c("LGA") & month %in% c(7)
  ] -> df5 

3.6.2 The key

Since it is all about speed, one way to execute data operations faster is by specifying a key variable that is then used by {data.table} to sift the rows and remember the sifting. Once the key has been specified, you can filter/subset etc. without having to specify the key variable.

setkey(my.df, origin, month)
my.df[
  .("LGA", 7)
  ] -> df6 # flights from LGA in July
setkey(my.df, origin, month, dest)
my.df[
  .("LGA", 7, "ORD")
  ] -> df7 # flights from LGA in July to ORD

To see what key is being used you can run key(my.df)

3.6.3 Sort

Sorting is done with order(), and by default sorting will be done in ascending order so if you want to sort in descending order, append - to the variable(s).

df7[order(dep_delay)] -> df8 # ascending order of departure _delay 
df7[order(-dep_delay)] -> df9 # descending order of departure delay 
df7[order(dep_delay, dep_time)] -> df10 # ascending order of departure delay and ascending order of departure time
df7[order(dep_delay, -dep_time)] -> df11 # ascending order of departure delay and descending order of departure time

3.6.4 Calculating Aggregate Values

Say I want to calculate mean departure delay times by airline, for all flights. Maybe I also want to know how many flights each carrier operated.

my.df[
  , 
  .(
    average_departure_delay = mean(dep_delay), 
    stddev_dep_delay = sd(dep_delay)
    ), 
  by = carrier
  ] -> df12 
my.df[, .N, by = carrier] -> df13 
my.df[, .N, keyby = carrier] -> df14 # results are ordered in ascending order of carrier
my.df[, .N, by = carrier][order(N)] -> df15 # results in ascending order of N 

3.6.5 Chaining operations

Aha! We just leaned on data.table's way of dealing with chained operations (similar to the pipe operator of dplyr1) …. follow up one operation encased in [] with another [] to create a chain of operations [][]! For example, say we want the number of flights by carrier and destination, sorted by carrier.

my.df[, .N, by = .(carrier, dest)][order(carrier)] -> df16 # results in ascending order of carrier 
my.df[, .N, by = .(carrier, dest)][order(carrier, -N)] -> df17 # results in ascending order of carrier and then within carrier by descending order of M

3.6.6 .SD

data.table has a special function, .SD that will calculate what you want by a subset data-table. For example, I want mean departure and arrival delays by carrier.

my.df[
  ,  
  .SD[
    , 
    mean(dep_delay), 
    by = carrier
    ], 
  by = dest
  ] -> df18 
my.df[
  , 
  lapply(.SD, mean), 
  by = carrier, 
  .SDcols = c("dep_delay", "arr_delay")
  ] -> df19 # calculate the mean for the variables specified, by carrier

There is a lot more that data.table, dplyr and base R can do so treat these examples as mere glimpses into some of the operations you are likely to perform. It is also useful to keep an eye on the cheat-sheets for the packages: dplyr and tidyr are available here and several others are available here.

3.7 Using {forcats}

There is a package that some find useful when working with factors. Aptly named as an anagram for what it works with, it is the forcats package. We will see some of the core functions here but if you are curious about other possibilities, check the package vignette or then the web for other examples. I’ll show you the three things I use it for. First, generating a plot that orders the categories on the basis of ascending or descending frequencies. Here is an example, with data from the General Social Survey.

library(ggplot2)
library(dplyr)
library(forcats)

gss_cat %>%
  group_by(
    relig
    ) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  ) -> relig_summary 

ggplot(
  relig_summary, 
  aes(tvhours, relig)
  ) + 
  geom_point()

Notice how this plot is unordered. I want it to be with the lowest number of television hours first, so in ascending order of TV hours. That can be achieved via fct_reorder().

ggplot(
  relig_summary, 
  aes(tvhours, fct_reorder(relig, tvhours))
  ) +
  geom_point()

Note the logic: it says reorganize relig on the basis of tvhours. If I wanted it in terms of descending tv hours I would call -tvhours.

ggplot(
  relig_summary, 
  aes(tvhours, fct_reorder(relig, -tvhours))
  ) +
  geom_point()

If using bar-charts, switch to fct_infreq().

gss_cat %>%
  mutate(
    marital = marital %>% 
      fct_infreq() %>% 
      fct_rev()
    ) %>%
  ggplot(aes(marital)) +
    geom_bar()

And if you want the bars drawn in descending order just leave out fct_rev().

gss_cat %>%
  mutate(
    marital = marital %>% 
      fct_infreq()
    ) %>% 
  ggplot(aes(marital)) +
    geom_bar()

forcats comes in handy for modifying factor levels. To see this functionality we will rely on the usual seven-point party identification variable in the GSS and lean on fct_recode().

gss_cat %>% 
  count(partyid)
#> # A tibble: 10 × 2
#>   partyid                n
#>   <fct>              <int>
#> 1 No answer            154
#> 2 Don't know             1
#> 3 Other party          393
#> 4 Strong republican   2314
#> 5 Not str republican  3032
#> 6 Ind,near rep        1791
#> # … with 4 more rows

Now we tweak these labels.

gss_cat %>%
  mutate(
    partyid = fct_recode(
      partyid, 
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat"
      )
    ) %>%
  count(partyid)
#> # A tibble: 10 × 2
#>   partyid                   n
#>   <fct>                 <int>
#> 1 No answer               154
#> 2 Don't know                1
#> 3 Other party             393
#> 4 Republican, strong     2314
#> 5 Republican, weak       3032
#> 6 Independent, near rep  1791
#> # … with 4 more rows

We can also collapse categories, as shown below in creating Other.

gss_cat %>%
  mutate(
    partyid = fct_recode(
      partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat",
      "Other"                 = "No answer",
      "Other"                 = "Don't know",
      "Other"                 = "Other party"
      )
    ) %>%
  count(partyid)
#> # A tibble: 8 × 2
#>   partyid                   n
#>   <fct>                 <int>
#> 1 Other                   548
#> 2 Republican, strong     2314
#> 3 Republican, weak       3032
#> 4 Independent, near rep  1791
#> 5 Independent            4119
#> 6 Independent, near dem  2499
#> # … with 2 more rows

Categories can also be collapsed via fct_collapse()

gss_cat %>%
  mutate(
    partyid = fct_collapse(
      partyid,
      other = c("No answer", "Don't know", "Other party"),
      rep = c("Strong republican", "Not str republican"),
      ind = c("Ind,near rep", "Independent", "Ind,near dem"),
      dem = c("Not str democrat", "Strong democrat")
      )
    ) %>%
  count(partyid)
#> # A tibble: 4 × 2
#>   partyid     n
#>   <fct>   <int>
#> 1 other     548
#> 2 rep      5346
#> 3 ind      8409
#> 4 dem      7180

and then fct_lump() (which will lump together categories with small frequencies). If you don’t specify how many groups you want to retain via n = you will over-collapse the categories so be careful.

gss_cat %>%
  mutate(
    relig = fct_lump(relig, n = 10)
    ) %>%
  count(
    relig, sort = TRUE
    ) 
#> # A tibble: 10 × 2
#>   relig          n
#>   <fct>      <int>
#> 1 Protestant 10846
#> 2 Catholic    5124
#> 3 None        3523
#> 4 Christian    689
#> 5 Other        458
#> 6 Jewish       388
#> # … with 4 more rows