1 Introduction to R & RStudio

In this opening chapter we will walk through the installation process for getting R & RStudio up and running on your machine, understand how the RStudio IDE (aka the integrated development environment) is setup, how to install/update packages, and then how to read and save data in various formats, with/without some basic data cleaning.

1.1 Installing R & RStudio

R is a “free” software environment for statistical computing and graphics. It is powerful, elegant, and incredibly flexible. RStudio is perhaps the most commonly used integrated development environment (IDE) for R, also free, and yet as much, if not more, powerful than any commercial IDE in existence today. Both R & RStudio need to be setup before you can start working with data.1 Let us get cracking then!

First download R for Windows from here and R for Mac from here. Double-click on the downloaded file and accept the default settings as you go through the installation.

Once R is installed you can install RStudio for your operating system from here. Accept the default prompts through the installation process.

Once RStudio has installed, double-click the RStudio shortcut/icon and RStudio will launch. If all goes well you should see R starting up inside RStudio and you should be good to start working with data.

Both R and RStudio go through very frequent updates, some minor, some major. As needed, repeat the steps you took above to re-install and update your version of R and RStudio every few months.

1.2 A Brief RStudio Walk-through

RStudio has more features than needed for this class but at minimum you need to recognize the few listed below.

Feature What it does …
Console This is where commands are issued to R, either by typing and hitting enter or running commands from a script (essentially a file with the commands you want to execute)
Environment stores and shows you all the objects (data, lists, scalars, etc) created during a working session
History shows you a running list of all commands issued to R
Connections shows you any databases/servers you are connected to and also allows you to initiate a new connection
Files shows you files and folders in your current working directory, and you can move up/down in the folder hierarchy
Plots show you all graphics that have been generated
Packages shows you installed packages
help allows you to access the help pages for a package by typing in keywords
Viewer shows you are “live” documents running on the server
Knit allows you to generate html/pdf/word documents that combine R commands with plain text, graphics, tables, etc
Insert allows you to insert a vanilla R chunk. You can (and should) give unique name to code chunks so that you can easily diagnose which chunk is not working but more on R chunks later.
Run allows you to run specific lines and/or R chunks

The gear button allows you to tweak various options; click on Output Options and explore the options therein. The one thing you should do now is to click the gear icon and ensure the following has a check-mark against it: “Chunk Output in Console”. This will make sure that the output shows up in the console or the plot window rather than in your working document.

You can customize the panes via Tools -> Global Options... Panes can be detached, and this is very helpful when you want another application next to the pane or behind it, or if you are using multiple monitors since then you can execute commands in one monitor and watch the output in another monitor. | You also have a spell-check; use it to catch typos.

1.3 Installing R Packages

While base R comes with an expansive and flexible functionality but its abilities are extended by packages built by individuals, teams, and even some organizations. Thus, for example, if I want to generate maps, I will need to lean on specific packages. Ditto if I want to generate interactive graphics, manipulate some complicated data-sets, pull some data from the US Census Bureau, and so on. Each package typically contains some R functions, some data-sets used to show how these functions work and what they do, and some helpful documentation.

Packages can most easily be installed via Tools -> Install Packages.... Once you install a package on a computer, you do not need to re-install it unless something gets corrupted on your computer. Let us try and install a few packages. These are listed below.

devtools, ggplot2, dplyr, tidyr, reshape2, lubridate, car, Hmisc, gapminder, leaflet, prettydoc, DT, data.table, htmltools, scales, ggridges, ggrepel, highcharter, plotly, maps, here, skimr, janitor, gganimate, radix,

Packages can also be installed via an explicit command as shown below

install.packages("ggplot2", "ggmap", "dplyr", "maps", "tidyr", "reshape2")

Packages are frequently updated so you should make it a habit to check for updated packages by running “Check for Package Updates…”, also under Tools.

Some packages also have to be built from source and when that is the case, the author(s) will tell you how to do so, the most common route being one of the following:

devtools::install_github("somename/packagename") remotes::install_github("somename/packagename")

1.4 RStudio Projects

After packages have been installed, go ahead and create a New Project from the File menu in RStudio. You should create it in a New Directory (a folder on your computer) and name this folder mpa. RStudio will restart and launch in the mpa folder. You will see the folder now has a file called mpa.Rproj. The next time you need to launch RStudio for work, double-click this file and RStudio will start.

Go inside the mpa folder and create a sub-folder called data. All data you need for class should be put inside the data folder. All code files that you create or that I give you, each with an .Rmd extension, should be downloaded and placed inside the mpa folder. The resulting directory tree will look like the following sketch:

Please follow these instructions for setting up the project folder, the code folder and the data sub-folder, and for launching RStudio in subsequent sessions. If you take this advice you will spend less time trying to find code or data files and diagnosing errors and more time understanding what a particular piece of code is doing.

When you are done working in RStudio, be sure to close it, and make sure you choose Don't save when asked if you wish to save the session.

1.5 RMarkdown

RMarkdown files enable you to generate html, pdf and word documents that contain both your code (which you can hide or show), appropriate text (describing, for example, the data, the analysis, conclusions, references, and so on), and tables and graphs. The beauty of RMarkdown is that it provides us the ability to work with a single document to create a fancy report, a dashboard, a slide-deck, a book, a website, a blog, etc. From a data science perspective, this is very useful because it allows for reproducibility and collaboration; I can share data and code with you and unless something is malfunctioning on your computer you should be able to knit the document or just run the code line by line and end up with the same results as I did.

Give every code chunk a unique name, whch can be a alphanumeric string with no whitespace. If you forget, use the namer() package to assign names to every code chunk sans a name. This can be done, after the package is loaded, via name_chunks(“myfilename.Rmd)

In RMarkdown files you will see that the code chunks have several options that could be invoked. Here are some of the more common ones we will use.

Option What it does …
eval If FALSE, knitr will not run the code in the code chunk.
include If FALSE, knitr will run the chunk but not include the chunk in the final document.
echo If FALSE, knitr will not display the code in the code chunk but will show the resulting output from the code in the final document.
error If FALSE, knitr will not display any error messages generated by the code.
message If FALSE, knitr will not display any messages generated by the code.
warning If FALSE, knitr will not display any warning messages generated by the code.
cache If TRUE, knitr will cache the results to reuse in future knits. Knitr will reuse the results until the code chunk is altered.2
dev The R function name that will be used as a graphical device to record plots, e.g. dev
dpi A number for knitr to use as the dots per inch (dpi) in graphics (when applicable).
fig.align ‘center’, ‘left’, ‘right’ alignment in the knit document
fig.height height of the figure (in inches, for example)
fig.width width of the figure (in inches, for example)
out.height, out.width The height and the to scale plots in the final output.

Other options can be found in the cheatsheet available here. There also happen to be excellent video tutorial on RMarkdown (and other things) by the RStudio team on vimeo. You may need to sign-up (for free) with an email id.

1.6 Reading Data into R/RStudio

R can read data created in various formats (SPSS, SAS, Stata, Excel, CSV, TXT, etc). The most common data formats you will encounter are likely to be CSV or Excel files and hence we focus on these first. Let us first download and save the data available here (as a zip archive). Once this file downloads, double-click it and extract all files to the data sub-folder; this is where all data must be saved.

Go ahead and make sure you installed the here library. From now on, the first time you need to load or save a data-set to your computer, first load this library via library(here). You then need to only specify the subpath, such as data/ImportDataCSV.csv (see below) and you will be on your way. Otherwise it gets cumbersome to specify the full filepath which, in my case, would be ~/Documents/mpa/code/data/ImportDataCSV.csv. Note, in any R Session you only need to lead the library once and not every time you need to use one of the library’s functions.

You should, at some point, read Jenny Bryan’s excellent pieces on project workflow and the famous ode to the here package.

1.6.1 CSV & Tab-delimited Formats

With the CSV format a comma separates each variable (note each variable is a column), and each row in the original file represents an observation.

  here("data", "ImportDataCSV.csv"),
  sep = ",",
  header = TRUE
  ) -> df.csv 

df.csv is the name I have chosen to give to the data being read. In the read.csv() command I am telling R where the data can be found and the name of the data-set (“data/ImportDataCSV.csv”), the fact that variables (one in each column) are separated by a comma (,), and the fact that the original data have column-headings (header = TRUE).

Note that when you create anything in R, you will most often see examples doing so either via the = symbol, via the <- symbol, or then via the -> symbol (my default). Thus df.csv = read.csv(…) is the same as df.csv <- read.csv(…) and the same as read.csv(…) -> df.csv. I suggest you pick one assignment operator and stick with it.

There are a total of five assignment operators we could use in R. Google <- versus = in R and see the differing opinions on why some prefer one over the other, and also when and where each can be used.

When you execute the command you will see df.csv showing up under Data in the upper-right pane of RStudio. Click on df.csv and you can see the data-set in spreadsheet form. If you only click the blue play button you will see the contents listed in cascading style.

A similar process works for reading in tab-delimited files where the columns are separated by a tab rather than by a comma.

  here("data", "ImportDataTAB.txt"),
  sep = "\t",
  header = TRUE
  ) -> df.tab 

Note the one difference here: I told R it is a tab-delimited file by modifying the sep = ““ command with a backslash followed by t.

1.6.2 Excel Formats (.xls & .xlsx)

There are several packages that will allow you to read files in various Excel formats but the one I prefer is readxl. Remember: Whenever we need to use a package we will have to first load it and then execute whatever commands call upon the loaded package’s features as shown below.

  here("data", "ImportDataXLS.xls")
  ) -> df.xls 
  here("data", "ImportDataXLSX.xlsx")
  ) -> df.xlsx 

Note the one minor difference in the commands; the xlsx format file has the .xlsx file extension.

1.6.3 SPSS, Stata, and SAS formats

At times, and especially from some major federal agencies, the data you will need to access may be shipped in a particular format. Some disciplines/fields are also accustomed to working with a specific file format. For example, economists and those who work in public health have historically used Stata and SAS, respectively. Consequently, whether the data are from the CDC’s BRFSS or some other survey series, you will often see the data being made available for download in these formats. Consequently, I show you how to read data that come to us from these formats.

  here("data", "ImportDataStata.dta")
  ) -> df.stata 
  here("data", "ImportDataSAS.sas7bdat")
       ) -> df.sas 
  here("data", "ImportDataSPSS.sav")
  ) -> df.spss 

1.6.4 Fixed-width files

It is also common to encounter fixed-width files, also called flat-files. These are files where the raw data are stored without any gaps between successive variables. Yes, no commas, tabs, or other delimiters. However, these files will come with documentation that will tell you where each variable starts and ends, along with other details about each variable. Let us see how with a very small example.

  here("data", "fwfdata.txt"),
  widths = c(4, 9, 2, 4),
  header = FALSE,
  col.names = c("Name", "Month", "Day", "Year")
  ) -> df.fw 

Notice that we have to specify the width of each variable (since there is no separator like a comma or a tab that indicates where a variable begins and where it ends) and assign column names. The widths indicate how many slots each variable spans.

The readr package has a similar command that executes slightly faster.

  here("data", "fwfdata.txt"), 
    c(4, 9, 2, 4),
    col_names = c("Name", "Month", "Day", "Year")
  ) -> df.fw2 

1.7 Reading Files directly from the Web

It is also possible to specify the full web-path for a file and read it in, rather than storing a local copy. This is often useful when we have to go in and pull data that might be periodically updated. Indeed, the Census Bureau, Bureau of Labor, Bureau of Economic Analysis and other entities often update older data that we may need. So having this functionality is very useful.

  ) -> fpe 
  header = TRUE
  ) -> test 
  header = TRUE
  ) -> test.csv 
  ) -> hsb2.spss 
as.data.frame(hsb2.spss) -> df.hsb2.spss 
rm("hsb2.spss") # Deleting the intermediate file 

R is able to read data from Twitter feeds, buoys sitting in the Atlantic ocean, and so much more! Now, one problem you could run into is that the website has moved and hence the URL is different or the site is down. There is, unfortunately, no way around these hurdles other than running into such an error while attempting to read the data off the web and then using a browser to search for the data by name or via keywords.3

Reading Excel files is a bit complicated,depending on how you go about it. For example, the long road taken would be something like this:

  mode = "wb"
read_excel("hsb2.xls") -> hsb2

The easier way to do this would be to opt for the {rio} package but remember to install the package (if missing) and to run rio::install_formats() before executing the code below.

"https://stats.idre.ucla.edu/stat/data/hsb2.xls" -> url
import(url) -> hsb2

1.8 Reading compressed files

If you have compressed files, (*.zip, *.gzip, etc.) you can use simple R code to download and unzip the file prior to reading it in, all in one block of commands. You will, however, need to know the file-name and extension of the file inside the compressed archive. The compressed file just needs to be in a format that R can read. If that format happens to be ascii and the delimiters are clunky then of course more work would be needed to specify start/end column positions, maybe add column names, etc. If it is in an easier to read, canned format, such as SAS/State/SPSS, then you are home free!

tempfile() -> temp 
  ) -> ourdata 

You should see ourdata in the Global Environment. A word of caution: Be careful with rm() commands because they remove named objects from the Global Environment. If you forget you removed some object, you could end up running into errors when you try and call on that object for some analysis or visualization.

1.9 Data in R packages

Almost all R packages come bundled with data-sets, too many of them to walk you through but

To load data from a package, if you know the data-set’s name, run

#> [1] "parent" "child"

or you can run

  package = "HistData"
#> [1] "family"          "father"          "mother"          "midparentHeight"
#> [5] "children"        "childNum"        "gender"          "childHeight"

1.10 Saving Data in the R Format

We can save a data-set we have created quite easily with the save() command, as shown below.

  file = here("data", "hsb2.RData")

Note the sequence. We first specify the data set we want to save, here my.df, and then the location and file-name of the saved data: file = "data/my.df.RData". If you look at the data folder you will now see a file called my.df.RData.

You can always check what your current working directory is by running getwd() and you can set your working directory by running setwd(“~/Users/ruhil/Documents/myRbook”) or using the Set Working Directory sub-menu found in the Session menu in R Studio. But, the better method of going about it is by relying on the project since all files are expected to be in the project folder and if you end up changing directory paths in the RMarkdown file when reading data or images, etc., you could run into trouble. Remember to use the here library when working in a project.

  file = here("data", "mtcars.RData")
rm(list = ls()) # To clear the Environment but not to be used liberally or wantonly!! 
  here("data", "mtcars.RData")

You can also save multiple data files as follows:

  mtcars, diamonds,
  file = here("data", "mydata.RData")
rm(list = ls()) # To clear the Environment
  here("data", "mydata.RData")

If you want to save just a single object from the environment and then load it in a later session, maybe with a different name, then you should use saveRDS() and readRDS()

  file = here("data", "mydata.RDS")
rm(list = ls()) # To clear the Environment
  here("data", "mydata.RDS")
  ) -> ourdata 

If instead you did the following, the file will be read with the name it had when it was saved

  file = here("data", "mtcars.RData")
rm(list = ls())  # To clear the Environment
  here("data", "mtcars.RData")
  ) -> ourdata # Note ourdata is listed as "mtcars" 

If you want to save everything you have done in the work session you can via save.image()

  file = here("data", "mywork_jan182018.RData")

The next time you start RStudio this image will be automatically loaded. This is useful if you have a lot of R code you have written and various objects generated and do not want to start from scratch the next time around.

If you are not in a project and they try to close RStudio after some code has been run, you will be prompted to save (or not) the workspace and you should say “No” by default unless you really want to save the workspace. In the decade+ I have been using R I have never saved the workspace. Maybe I should have but since I am none the worse for wear I will plod on not saving the workspace since it has done me no harm thus far!

save(my.df, file = "~/Documents/Testing/hsb2.RData") 
save(my.df, file = "C:\\Documents\\Testing\\hsb2.RData") # Note the two backslashes for Windows 

1.11 Loading and Modifying Data

Let us read some data, perhaps the hsb2 data. These data reflect a random subset of records from the High School & Beyond` study conducted by the National Center for Education Statistics. The primary focus of the study was/is to understand students’ trajectories after leaving high school into post-secondary education, the workforce, and beyond, and to understand what factors influence the students’ educational and career outcomes after passing through the American educational system.

  header = TRUE,
  sep = ","
  ) -> hsb2 

This is a data frame with 200 observations (i.e., $n = 200$) and the following \(11\) variables:

Variable Description
id Student ID.
female Student’s gender, with levels female (1) and male (0).
race Student’s race, with levels african american (3), asian (2), hispanic (1), and white (4).
ses Socio economic status of student’s family, with levels low (1), middle (2), and high (3).
schtyp Type of school, with levels public (1) and private (2).
prog Type of program, with levels general (1), academic (2), and vocational (3).
read Standardized reading score.
write Standardized writing score.
math Standardized math score.
science Standardized science score.
socst Standardized social studies score.

What are the variable names (i.e., the column headings) in this file? The names() command will tell you that.

#>  [1] "id"      "female"  "race"    "ses"     "schtyp"  "prog"    "read"    "write"  
#>  [9] "math"    "science" "socst"

Similarly, the str() command will show you the details of each variable

#> 'data.frame':    200 obs. of  11 variables:
#>  $ id     : int  70 121 86 141 172 113 50 11 84 48 ...
#>  $ female : int  0 1 0 0 0 0 0 0 0 0 ...
#>  $ race   : int  4 4 4 4 4 4 3 1 4 3 ...
#>  $ ses    : int  1 2 3 3 2 2 2 2 2 2 ...
#>  $ schtyp : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ prog   : int  1 3 1 3 2 2 1 2 1 2 ...
#>  $ read   : int  57 68 44 63 47 44 50 34 63 57 ...
#>  $ write  : int  52 59 33 44 52 52 59 46 57 55 ...
#>  $ math   : int  41 53 54 47 57 51 42 45 54 52 ...
#>  $ science: int  47 63 58 53 53 63 53 39 58 50 ...
#>  $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...

and the summary() command will give you some summary information about each variable. Below you see the command used to look at the first five variables.

summary(hsb2[, c(1:5)])
id female race ses schtyp
Min. : 1.0 Min. :0.000 Min. :1.00 Min. :1.00 Min. :1.00
1st Qu.: 50.8 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.00 1st Qu.:1.00
Median :100.5 Median :1.000 Median :4.00 Median :2.00 Median :1.00
Mean :100.5 Mean :0.545 Mean :3.43 Mean :2.06 Mean :1.16
3rd Qu.:150.2 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:3.00 3rd Qu.:1.00
Max. :200.0 Max. :1.000 Max. :4.00 Max. :3.00 Max. :2.00

Note that there are no labels for the various qualitative (aka categorical) variables (female, race, ses, schtyp, and prog) so we’ll have to create these. This is just a quick run through on creating labels for these variables; we will cover this in greater detail in a later chapter.

  levels = c(0, 1),
  labels = c("Male", "Female")
  ) -> hsb2$female 
  levels = c(1:4),
  labels = c("Hispanic", "Asian", "African American", "White")
  ) -> hsb2$race 
  levels = c(1:3),
  labels = c("Low", "Middle", "High")
  ) -> hsb2$ses 
  levels = c(1:2),
  labels = c("Public", "Private")
  ) -> hsb2$schtyp 
  levels = c(1:3),
  labels = c("General", "Academic", "Vocational")
  ) -> hsb2$prog 

Having added labels to the factors in hsb2 we can now save the data for later use, and look at the summary statistics of the first five variables. Note the difference between these results and those obtained previously.

  file = here("data", "hsb2.RData")
summary(hsb2[, c(1:5)])
id female race ses schtyp
Min. : 1.0 Male : 91 Hispanic : 24 Low :47 Public :168
1st Qu.: 50.8 Female:109 Asian : 11 Middle:95 Private: 32
Median :100.5 NA African American: 20 High :58 NA
Mean :100.5 NA White :145 NA NA
3rd Qu.:150.2 NA NA NA NA
Max. :200.0 NA NA NA NA

1.12 Some Important Miscellany

Before we move on, let us go over some important elements of working with R since you will run into these and wonder what it means or why something is showing up this way versus that way.

1.12.1 Variable Types in R

R has six basic variable types chr (representing a character such as “a”, “platinum”, “John Doe”, etc), num (representing a real number or a decimal such as 2 or 15.5173, etc), int for an integer (a subset of numeric and representing numbers that are unlikely to be used for any mathematical calculations, such as social security numbers and pseudo-identifiers, etc), and factors (where each unique value of a character vector is assigned a number and a label, such as 1 = Male; 2 = Female, and so on). Each column in a data-frame is, in R-speak, an atomic vector, meaning a column that only contains variables of one type and one type alone. That is, you cannot combine characters and numbers in the same column because if you do, the column will default to the character format. Here is a snippet of these types with fake data. Run this code and you will see how the format is listed in R.

c("A", "B", "C") -> w 
c(1, 2, 3) -> x 
as.integer(x) -> y 
as.factor(w) -> z 
#>  chr [1:3] "A" "B" "C"
#>  num [1:3] 1 2 3
#>  int [1:3] 1 2 3
#>  Factor w/ 3 levels "A","B","C": 1 2 3

1.12.2 Workspaces

When you go to close RStudio you will be asked if you want to save your work-space. If you say yes, then all commands you executed in the active session plus any objects/data you created will be saved to your machine. The next time you start-up RStudio you will be back where you had last stopped. This is a good idea for specific projects where you have large data and/or cumbersome tasks you have to perform in stages because it allows you to bypass repeating these tasks. In all other instances my default tends to be to never save the work-space.

If you need to save the work-space, go to the Session menu in RStudio and then click on Save Workspace As... and give it an appropriate name. I generally don’t do this unless I am pretty confident that the work completed thus far is error-free OR if the work takes a long time to complete because I am dealing with large files and/or a lot of data manipulation tasks.

R is very greedy when it comes to memory usage so that is another crucial reason why I prefer to clear my work-space, using the “broom” to delete everything in the Global Environment and anything in the Plots tab.

1.12.3 library() versus require()

Some folks load a package with reguire() while others use library() as in, for instance, running require(tidyr) versus library(tidyr). What is the difference? Should you default to one or the other? I will not go into the details because Yihui has it covered but instead draw your attention to the bottom-line:

library() loads a package, and require() tries to load a package. So when you want to load a package, do you load a package or try to load a package? It should be crystal clear.

My suggestion: Listen to Yihui and use library() … he knows what he is talking about.

1.12.4 Assignment operators: <- versus =

You will also see some folks use x <- c(1, 2, 3, 4) versus others opting for x = c(1, 2, 3, 4). There is a suggestion that one should use <- and not = and of late I tend to use <- although it requires one more keystroke than =

Careful though because there are specific instances where I am creating something and I have to use <- and if instead I use = I will run into an error. In addition, it is easy to mistakenly type x < - 2 when you meant to type x <- 2. I try to remember that one of the instances where I will have to use <- is inside a function. For example, see below.

data.frame(x = c(1:5)) -> my.df # Creating some data for variable x
within(my.df, xsq = x^(2)) -> my.df # adding x-squared but "=" will not work  
within(my.df, xsq <- x^(2)) -> my.df # adding x-squared with "<-" and it works! 

Unfortunately, I do slip up every now and then so don’t be surprised if you see me using <- and = interchangeably. Old habits die hard.

1.12.5 tibbles

R’s default is to store a data frame, as shown below with a small example, by converting characters into factors, changing column names, etc.

  `Some Letters` = c("A", "B", "C"), 
  `Some Numbers` = c(1, 2, 3)
  ) -> adf 
#> 'data.frame':    3 obs. of  2 variables:
#>  $ Some.Letters: chr  "A" "B" "C"
#>  $ Some.Numbers: num  1 2 3
#>   Some.Letters Some.Numbers
#> 1            A            1
#> 2            B            2
#> 3            C            3

tibbles is the brainchild of the team behind a bundle of packages (and RStudio) called the tidyverse and drops R’s bad habits

  `Some Letters` = c("A", "B", "C"), 
  `Some Numbers` = c(1, 2, 3)
  ) -> atib 
#> Rows: 3
#> Columns: 2
#> $ `Some Letters` <chr> "A", "B", "C"
#> $ `Some Numbers` <dbl> 1, 2, 3
#> # A tibble: 3 × 2
#>   `Some Letters` `Some Numbers`
#>   <chr>                   <dbl>
#> 1 A                           1
#> 2 B                           2
#> 3 C                           3

There are other advantages to tibbles that you can explore here. Before we move on, though, note the use of glimpse() in lieu of str()