1 Introduction to R & RStudio
In this opening chapter we will walk through the installation process for getting R & RStudio up and running on your machine, understand how the RStudio IDE (aka the integrated development environment) is setup, how to install/update packages, and then how to read and save data in various formats, with/without some basic data cleaning.
1.1 Installing R & RStudio
R is a “free” software environment for statistical computing and graphics. It is powerful, elegant, and incredibly flexible. RStudio is perhaps the most commonly used integrated development environment (IDE) for R, also free, and yet as much, if not more, powerful than any commercial IDE in existence today. Both R & RStudio need to be setup before you can start working with data.1 Let us get cracking then!
First download R for Windows from here and R for Mac from here. Double-click on the downloaded file and accept the default settings as you go through the installation.
Once R is installed you can install RStudio for your operating system from here. Accept the default prompts through the installation process.
Once RStudio has installed, double-click the RStudio shortcut/icon and RStudio will launch. If all goes well you should see R starting up inside RStudio and you should be good to start working with data.
Both R and RStudio go through very frequent updates, some minor, some major. As needed, repeat the steps you took above to re-install and update your version of R and RStudio every few months.
1.2 A Brief RStudio Walk-through
RStudio has more features than needed for this class but at minimum you need to recognize the few listed below.
Feature | What it does … |
---|---|
Console |
This is where commands are issued to R, either by typing and hitting enter or running commands from a script (essentially a file with the commands you want to execute) |
Environment |
stores and shows you all the objects (data, lists, scalars, etc) created during a working session |
History |
shows you a running list of all commands issued to R |
Connections |
shows you any databases/servers you are connected to and also allows you to initiate a new connection |
Files |
shows you files and folders in your current working directory, and you can move up/down in the folder hierarchy |
Plots |
show you all graphics that have been generated |
Packages |
shows you installed packages |
help |
allows you to access the help pages for a package by typing in keywords |
Viewer |
shows you are “live” documents running on the server |
Knit |
allows you to generate html/pdf/word documents that combine R commands with plain text, graphics, tables, etc |
Insert |
allows you to insert a vanilla R chunk . You can (and should) give unique name to code chunks so that you can easily diagnose which chunk is not working but more on R chunks later. |
Run |
allows you to run specific lines and/or R chunks |
The gear
button allows you to tweak various options; click on Output Options
and explore the options therein. The one thing you should do now is to click the gear icon and ensure the following has a check-mark against it: “Chunk Output in Console”. This will make sure that the output shows up in the console or the plot window rather than in your working document.
You can customize the panes via Tools -> Global Options...
Panes can be detached, and this is very helpful when you want another application next to the pane or behind it, or if you are using multiple monitors since then you can execute commands in one monitor and watch the output in another monitor.
| You also have a spell-check; use it to catch typos.
1.3 Installing R Packages
While base R comes with an expansive and flexible functionality but its abilities are extended by packages
built by individuals, teams, and even some organizations. Thus, for example, if I want to generate maps, I will need to lean on specific packages. Ditto if I want to generate interactive graphics, manipulate some complicated data-sets, pull some data from the US Census Bureau, and so on. Each package typically contains some R functions, some data-sets used to show how these functions work and what they do, and some helpful documentation.
Packages can most easily be installed via Tools -> Install Packages...
. Once you install a package on a computer, you do not need to re-install it unless something gets corrupted on your computer. Let us try and install a few packages. These are listed below.
devtools, ggplot2, dplyr, tidyr, reshape2, lubridate, car, Hmisc, gapminder, leaflet, prettydoc, DT, data.table, htmltools, scales, ggridges, ggrepel, highcharter, plotly, maps, here, skimr, janitor, gganimate, radix,
Packages can also be installed via an explicit command as shown below
install.packages("ggplot2", "ggmap", "dplyr", "maps", "tidyr", "reshape2")
Packages are frequently updated so you should make it a habit to check for updated packages by running “Check for Package Updates…”, also under Tools
.
Some packages also have to be built from source and when that is the case, the author(s) will tell you how to do so, the most common route being one of the following:
devtools::install_github("somename/packagename")
remotes::install_github("somename/packagename")
1.4 RStudio Projects
After packages have been installed, go ahead and create a New Project
from the File menu in RStudio. You should create it in a New Directory (a folder on your computer) and name this folder mpa
. RStudio will restart and launch in the mpa
folder. You will see the folder now has a file called mpa.Rproj
. The next time you need to launch RStudio for work, double-click this file and RStudio will start.
Go inside the mpa
folder and create a sub-folder called data
. All data you need for class should be put inside the data
folder. All code files that you create or that I give you, each with an .Rmd
extension, should be downloaded and placed inside the mpa
folder. The resulting directory tree will look like the following sketch:
Please follow these instructions for setting up the project folder, the code folder and the data sub-folder, and for launching RStudio in subsequent sessions. If you take this advice you will spend less time trying to find code or data files and diagnosing errors and more time understanding what a particular piece of code is doing.
When you are done working in RStudio, be sure to close it, and make sure you choose Don't save
when asked if you wish to save the session.
1.5 RMarkdown
RMarkdown files enable you to generate html, pdf and word documents that contain both your code (which you can hide or show), appropriate text (describing, for example, the data, the analysis, conclusions, references, and so on), and tables and graphs. The beauty of RMarkdown is that it provides us the ability to work with a single document to create a fancy report, a dashboard, a slide-deck, a book, a website, a blog, etc. From a data science perspective, this is very useful because it allows for reproducibility and collaboration; I can share data and code with you and unless something is malfunctioning on your computer you should be able to knit
the document or just run the code line by line and end up with the same results as I did.
Give every code chunk a unique name, whch can be a alphanumeric string with no whitespace. If you forget, use the namer()
package to assign names to every code chunk sans a name. This can be done, after the package is loaded, via name_chunks(“myfilename.Rmd)
In RMarkdown files you will see that the code chunks have several options that could be invoked. Here are some of the more common ones we will use.
Option | What it does … |
---|---|
eval |
If FALSE, knitr will not run the code in the code chunk. |
include |
If FALSE, knitr will run the chunk but not include the chunk in the final document. |
echo |
If FALSE, knitr will not display the code in the code chunk but will show the resulting output from the code in the final document. |
error |
If FALSE, knitr will not display any error messages generated by the code. |
message |
If FALSE, knitr will not display any messages generated by the code. |
warning |
If FALSE, knitr will not display any warning messages generated by the code. |
cache |
If TRUE, knitr will cache the results to reuse in future knits. Knitr will reuse the results until the code chunk is altered.2 |
dev |
The R function name that will be used as a graphical device to record plots, e.g. dev |
dpi |
A number for knitr to use as the dots per inch (dpi) in graphics (when applicable). |
fig.align |
‘center’, ‘left’, ‘right’ alignment in the knit document |
fig.height |
height of the figure (in inches, for example) |
fig.width |
width of the figure (in inches, for example) |
out.height , out.width |
The height and the to scale plots in the final output. |
Other options can be found in the cheatsheet available here. There also happen to be excellent video tutorial on RMarkdown (and other things) by the RStudio team on vimeo. You may need to sign-up (for free) with an email id.
1.6 Reading Data into R/RStudio
R can read data created in various formats (SPSS, SAS, Stata, Excel, CSV, TXT, etc). The most common data formats you will encounter are likely to be CSV or Excel files and hence we focus on these first. Let us first download and save the data available here (as a zip archive). Once this file downloads, double-click it and extract all files to the data
sub-folder; this is where all data must be saved.
Go ahead and make sure you installed the here
library. From now on, the first time you need to load or save a data-set to your computer, first load this library via library(here)
. You then need to only specify the subpath, such as data/ImportDataCSV.csv
(see below) and you will be on your way. Otherwise it gets cumbersome to specify the full filepath which, in my case, would be ~/Documents/mpa/code/data/ImportDataCSV.csv
. Note, in any R Session you only need to lead the library once and not every time you need to use one of the library’s functions.
You should, at some point, read Jenny Bryan’s excellent pieces on project workflow and the famous ode to the here package.
1.6.1 CSV & Tab-delimited Formats
With the CSV format a comma separates each variable
(note each variable is a column), and each row in the original file represents an observation
.
library(here)
read.csv(
here("data", "ImportDataCSV.csv"),
sep = ",",
header = TRUE
-> df.csv )
df.csv is the name I have chosen to give to the data being read. In the read.csv()
command I am telling R where the data can be found and the name of the data-set (“data/ImportDataCSV.csv”
), the fact that variables (one in each column) are separated by a comma (,)
, and the fact that the original data have column-headings (header = TRUE
).
Note that when you create anything in R, you will most often see examples doing so either via the =
symbol, via the <-
symbol, or then via the ->
symbol (my default). Thus df.csv = read.csv(…)
is the same as df.csv <- read.csv(…)
and the same as read.csv(…) -> df.csv
. I suggest you pick one assignment operator and stick with it.
There are a total of five assignment operators we could use in R. Google <- versus = in R
and see the differing opinions on why some prefer one over the other, and also when and where each can be used.
When you execute the command you will see df.csv showing up under Data
in the upper-right pane of RStudio. Click on df.csv and you can see the data-set in spreadsheet form. If you only click the blue play
button you will see the contents listed in cascading style.
A similar process works for reading in tab-delimited files where the columns are separated by a tab rather than by a comma.
read.csv(
here("data", "ImportDataTAB.txt"),
sep = "\t",
header = TRUE
-> df.tab )
Note the one difference here: I told R it is a tab-delimited file by modifying the sep = ““
command with a backslash
followed by t
.
1.6.2 Excel Formats (.xls & .xlsx)
There are several packages that will allow you to read files in various Excel formats but the one I prefer is readxl
. Remember: Whenever we need to use a package we will have to first load it and then execute whatever commands call upon the loaded package’s features as shown below.
library(readxl)
read_excel(
here("data", "ImportDataXLS.xls")
-> df.xls
) read_excel(
here("data", "ImportDataXLSX.xlsx")
-> df.xlsx )
Note the one minor difference in the commands; the xlsx format file has the .xlsx
file extension.
1.6.3 SPSS, Stata, and SAS formats
At times, and especially from some major federal agencies, the data you will need to access may be shipped in a particular format. Some disciplines/fields are also accustomed to working with a specific file format. For example, economists and those who work in public health have historically used Stata and SAS, respectively. Consequently, whether the data are from the CDC’s BRFSS or some other survey series, you will often see the data being made available for download in these formats. Consequently, I show you how to read data that come to us from these formats.
library(haven)
read_stata(
here("data", "ImportDataStata.dta")
-> df.stata
) read_sas(
here("data", "ImportDataSAS.sas7bdat")
-> df.sas
) read_sav(
here("data", "ImportDataSPSS.sav")
-> df.spss )
1.6.4 Fixed-width files
It is also common to encounter fixed-width
files, also called flat-files
. These are files where the raw data are stored without any gaps between successive variables. Yes, no commas, tabs, or other delimiters. However, these files will come with documentation that will tell you where each variable starts and ends, along with other details about each variable. Let us see how with a very small example.
read.fwf(
here("data", "fwfdata.txt"),
widths = c(4, 9, 2, 4),
header = FALSE,
col.names = c("Name", "Month", "Day", "Year")
-> df.fw )
Notice that we have to specify the width of each variable (since there is no separator like a comma or a tab that indicates where a variable begins and where it ends) and assign column names. The widths indicate how many slots each variable spans.
The readr
package has a similar command that executes slightly faster.
library(readr)
read_fwf(
here("data", "fwfdata.txt"),
fwf_widths(
c(4, 9, 2, 4),
col_names = c("Name", "Month", "Day", "Year")
)-> df.fw2 )
1.7 Reading Files directly from the Web
It is also possible to specify the full web-path for a file and read it in, rather than storing a local copy. This is often useful when we have to go in and pull data that might be periodically updated. Indeed, the Census Bureau, Bureau of Labor, Bureau of Economic Analysis and other entities often update older data that we may need. So having this functionality is very useful.
read.table(
"http://data.princeton.edu/wws509/datasets/effort.dat"
-> fpe
) read.table(
"https://stats.idre.ucla.edu/stat/data/test.txt",
header = TRUE
-> test
) read.csv(
"https://stats.idre.ucla.edu/stat/data/test.csv",
header = TRUE
-> test.csv
) library(foreign)
read.spss(
"https://stats.idre.ucla.edu/stat/data/hsb2.sav"
-> hsb2.spss
) as.data.frame(hsb2.spss) -> df.hsb2.spss
rm("hsb2.spss") # Deleting the intermediate file
R is able to read data from Twitter feeds, buoys sitting in the Atlantic ocean, and so much more! Now, one problem you could run into is that the website has moved and hence the URL is different or the site is down. There is, unfortunately, no way around these hurdles other than running into such an error while attempting to read the data off the web and then using a browser to search for the data by name or via keywords.3
Reading Excel files is a bit complicated,depending on how you go about it. For example, the long road taken would be something like this:
library(readxl)
download.file(
"https://stats.idre.ucla.edu/stat/data/hsb2.xls",
"hsb2.xls",
mode = "wb"
)read_excel("hsb2.xls") -> hsb2
The easier way to do this would be to opt for the {rio}
package but remember to install the package (if missing) and to run rio::install_formats()
before executing the code below.
"https://stats.idre.ucla.edu/stat/data/hsb2.xls" -> url
library(rio)
import(url) -> hsb2
1.8 Reading compressed files
If you have compressed files, (*.zip, *.gzip, etc.)
you can use simple R code to download and unzip the file prior to reading it in, all in one block of commands. You will, however, need to know the file-name and extension of the file inside the compressed archive. The compressed file just needs to be in a format that R can read. If that format happens to be ascii
and the delimiters are clunky then of course more work would be needed to specify start/end column positions, maybe add column names, etc. If it is in an easier to read, canned format, such as SAS/State/SPSS, then you are home free!
tempfile() -> temp
download.file(
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NVSS/bridgepop/2016/pcen_v2016_y1016.sas7bdat.zip",
temp
)::read_sas(
havenunz(
temp, "pcen_v2016_y1016.sas7bdat"
)-> ourdata
) unlink(temp)
rm(temp)
You should see ourdata
in the Global Environment. A word of caution: Be careful with rm()
commands because they remove named objects from the Global Environment. If you forget you removed some object, you could end up running into errors when you try and call on that object for some analysis or visualization.
1.9 Data in R packages
Almost all R packages come bundled with data-sets, too many of them to walk you through but
To load data from a package, if you know the data-set’s name, run
library(HistData)
data("Galton")
names(Galton)
#> [1] "parent" "child"
or you can run
data(
"GaltonFamilies",
package = "HistData"
)names(GaltonFamilies)
#> [1] "family" "father" "mother" "midparentHeight"
#> [5] "children" "childNum" "gender" "childHeight"
1.10 Saving Data in the R Format
We can save a data-set we have created quite easily with the save()
command, as shown below.
save(
df.hsb2.spss,file = here("data", "hsb2.RData")
)
Note the sequence. We first specify the data set we want to save, here my.df
, and then the location and file-name of the saved data: file = "data/my.df.RData"
. If you look at the data folder you will now see a file called my.df.RData.
You can always check what your current working directory is by running getwd()
and you can set your working directory by running setwd(“~/Users/ruhil/Documents/myRbook”)
or using the Set Working Directory
sub-menu found in the Session
menu in R Studio. But, the better method of going about it is by relying on the project
since all files are expected to be in the project folder and if you end up changing directory paths in the RMarkdown file when reading data or images, etc., you could run into trouble. Remember to use the here
library when working in a project.
data(mtcars)
library(here)
save(
mtcars,file = here("data", "mtcars.RData")
)rm(list = ls()) # To clear the Environment but not to be used liberally or wantonly!!
load(
here("data", "mtcars.RData")
)
You can also save multiple data files as follows:
data(mtcars)
library(ggplot2)
library(here)
data(diamonds)
save(
mtcars, diamonds,file = here("data", "mydata.RData")
)rm(list = ls()) # To clear the Environment
load(
here("data", "mydata.RData")
)
If you want to save just a single object
from the environment and then load it in a later session, maybe with a different name, then you should use saveRDS()
and readRDS()
data(mtcars)
library(here)
saveRDS(
mtcars,file = here("data", "mydata.RDS")
)rm(list = ls()) # To clear the Environment
readRDS(
here("data", "mydata.RDS")
-> ourdata )
If instead you did the following, the file will be read with the name it had when it was saved
data(mtcars)
library(here)
save(
mtcars,file = here("data", "mtcars.RData")
)rm(list = ls()) # To clear the Environment
load(
here("data", "mtcars.RData")
-> ourdata # Note ourdata is listed as "mtcars" )
If you want to save everything you have done in the work session you can via save.image()
library(here)
save.image(
file = here("data", "mywork_jan182018.RData")
)
The next time you start RStudio this image will be automatically loaded. This is useful if you have a lot of R code you have written and various objects generated and do not want to start from scratch the next time around.
If you are not in a project and they try to close RStudio after some code has been run, you will be prompted to save (or not) the workspace
and you should say “No” by default unless you really want to save the workspace. In the decade+ I have been using R I have never saved the workspace. Maybe I should have but since I am none the worse for wear I will plod on not saving the workspace since it has done me no harm thus far!
save(my.df, file = "~/Documents/Testing/hsb2.RData")
save(my.df, file = "C:\\Documents\\Testing\\hsb2.RData") # Note the two backslashes for Windows
1.11 Loading and Modifying Data
Let us read some data, perhaps the hsb2 data. These data reflect a random subset of records from the High School & Beyond` study conducted by the National Center for Education Statistics. The primary focus of the study was/is to understand students’ trajectories after leaving high school into post-secondary education, the workforce, and beyond, and to understand what factors influence the students’ educational and career outcomes after passing through the American educational system.
read.table(
'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
header = TRUE,
sep = ","
-> hsb2 )
This is a data frame
with 200 observations (i.e., $n = 200$)
and the following \(11\) variables:
Variable | Description |
---|---|
id | Student ID. |
female | Student’s gender, with levels female (1) and male (0). |
race | Student’s race, with levels african american (3), asian (2), hispanic (1), and white (4). |
ses | Socio economic status of student’s family, with levels low (1), middle (2), and high (3). |
schtyp | Type of school, with levels public (1) and private (2). |
prog | Type of program, with levels general (1), academic (2), and vocational (3). |
read | Standardized reading score. |
write | Standardized writing score. |
math | Standardized math score. |
science | Standardized science score. |
socst | Standardized social studies score. |
What are the variable names (i.e., the column headings) in this file? The names()
command will tell you that.
names(hsb2)
#> [1] "id" "female" "race" "ses" "schtyp" "prog" "read" "write"
#> [9] "math" "science" "socst"
Similarly, the str()
command will show you the details of each variable
str(hsb2)
#> 'data.frame': 200 obs. of 11 variables:
#> $ id : int 70 121 86 141 172 113 50 11 84 48 ...
#> $ female : int 0 1 0 0 0 0 0 0 0 0 ...
#> $ race : int 4 4 4 4 4 4 3 1 4 3 ...
#> $ ses : int 1 2 3 3 2 2 2 2 2 2 ...
#> $ schtyp : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ prog : int 1 3 1 3 2 2 1 2 1 2 ...
#> $ read : int 57 68 44 63 47 44 50 34 63 57 ...
#> $ write : int 52 59 33 44 52 52 59 46 57 55 ...
#> $ math : int 41 53 54 47 57 51 42 45 54 52 ...
#> $ science: int 47 63 58 53 53 63 53 39 58 50 ...
#> $ socst : int 57 61 31 56 61 61 61 36 51 51 ...
and the summary()
command will give you some summary information about each variable. Below you see the command used to look at the first five variables.
summary(hsb2[, c(1:5)])
id | female | race | ses | schtyp | |
---|---|---|---|---|---|
Min. : 1.0 | Min. :0.000 | Min. :1.00 | Min. :1.00 | Min. :1.00 | |
1st Qu.: 50.8 | 1st Qu.:0.000 | 1st Qu.:3.00 | 1st Qu.:2.00 | 1st Qu.:1.00 | |
Median :100.5 | Median :1.000 | Median :4.00 | Median :2.00 | Median :1.00 | |
Mean :100.5 | Mean :0.545 | Mean :3.43 | Mean :2.06 | Mean :1.16 | |
3rd Qu.:150.2 | 3rd Qu.:1.000 | 3rd Qu.:4.00 | 3rd Qu.:3.00 | 3rd Qu.:1.00 | |
Max. :200.0 | Max. :1.000 | Max. :4.00 | Max. :3.00 | Max. :2.00 |
Note that there are no labels for the various qualitative (aka categorical) variables (female, race, ses, schtyp, and prog) so we’ll have to create these. This is just a quick run through on creating labels for these variables; we will cover this in greater detail in a later chapter.
factor(
$female,
hsb2levels = c(0, 1),
labels = c("Male", "Female")
-> hsb2$female
) factor(
$race,
hsb2levels = c(1:4),
labels = c("Hispanic", "Asian", "African American", "White")
-> hsb2$race
) factor(
$ses,
hsb2levels = c(1:3),
labels = c("Low", "Middle", "High")
-> hsb2$ses
) factor(
$schtyp,
hsb2levels = c(1:2),
labels = c("Public", "Private")
-> hsb2$schtyp
) factor(
$prog,
hsb2levels = c(1:3),
labels = c("General", "Academic", "Vocational")
-> hsb2$prog )
Having added labels to the factors in hsb2 we can now save the data for later use, and look at the summary statistics of the first five variables. Note the difference between these results and those obtained previously.
save(
hsb2,file = here("data", "hsb2.RData")
)summary(hsb2[, c(1:5)])
id | female | race | ses | schtyp | |
---|---|---|---|---|---|
Min. : 1.0 | Male : 91 | Hispanic : 24 | Low :47 | Public :168 | |
1st Qu.: 50.8 | Female:109 | Asian : 11 | Middle:95 | Private: 32 | |
Median :100.5 | NA | African American: 20 | High :58 | NA | |
Mean :100.5 | NA | White :145 | NA | NA | |
3rd Qu.:150.2 | NA | NA | NA | NA | |
Max. :200.0 | NA | NA | NA | NA |
1.12 Some Important Miscellany
Before we move on, let us go over some important elements of working with R since you will run into these and wonder what it means or why something is showing up this way versus that way.
1.12.1 Variable Types in R
R has six basic variable types chr
(representing a character such as “a”, “platinum”, “John Doe”, etc), num
(representing a real number or a decimal such as 2 or 15.5173, etc), int
for an integer (a subset of numeric and representing numbers that are unlikely to be used for any mathematical calculations, such as social security numbers and pseudo-identifiers, etc), and factors
(where each unique value of a character vector is assigned a number and a label, such as 1 = Male; 2 = Female, and so on). Each column in a data-frame is, in R-speak, an atomic vector
, meaning a column that only contains variables of one type and one type alone. That is, you cannot combine characters and numbers in the same column because if you do, the column will default to the character format. Here is a snippet of these types with fake data. Run this code and you will see how the format is listed in R.
c("A", "B", "C") -> w
c(1, 2, 3) -> x
as.integer(x) -> y
as.factor(w) -> z
str(w)
#> chr [1:3] "A" "B" "C"
str(x)
#> num [1:3] 1 2 3
str(y)
#> int [1:3] 1 2 3
str(z)
#> Factor w/ 3 levels "A","B","C": 1 2 3
1.12.2 Workspaces
When you go to close RStudio you will be asked if you want to save your work-space. If you say yes, then all commands you executed in the active session plus any objects/data you created will be saved to your machine. The next time you start-up RStudio you will be back where you had last stopped. This is a good idea for specific projects where you have large data and/or cumbersome tasks you have to perform in stages because it allows you to bypass repeating these tasks. In all other instances my default tends to be to never save the work-space.
If you need to save the work-space, go to the Session
menu in RStudio and then click on Save Workspace As...
and give it an appropriate name. I generally don’t do this unless I am pretty confident that the work completed thus far is error-free OR if the work takes a long time to complete because I am dealing with large files and/or a lot of data manipulation tasks.
R is very greedy when it comes to memory usage so that is another crucial reason why I prefer to clear my work-space, using the “broom” to delete everything in the Global Environment
and anything in the Plots
tab.
1.12.3 library()
versus require()
Some folks load a package with reguire()
while others use library()
as in, for instance, running require(tidyr)
versus library(tidyr)
. What is the difference? Should you default to one or the other? I will not go into the details because Yihui has it covered but instead draw your attention to the bottom-line:
library()
loads a package, andrequire()
tries to load a package. So when you want to load a package, do you load a package or try to load a package? It should be crystal clear.
My suggestion: Listen to Yihui and use library()
… he knows what he is talking about.
1.12.4 Assignment operators: <-
versus =
You will also see some folks use x <- c(1, 2, 3, 4)
versus others opting for x = c(1, 2, 3, 4)
. There is a suggestion that one should use <-
and not =
and of late I tend to use <-
although it requires one more keystroke than =
Careful though because there are specific instances where I am creating something and I have to use <-
and if instead I use =
I will run into an error. In addition, it is easy to mistakenly type x < - 2
when you meant to type x <- 2
. I try to remember that one of the instances where I will have to use <-
is inside a function. For example, see below.
data.frame(x = c(1:5)) -> my.df # Creating some data for variable x
within(my.df, xsq = x^(2)) -> my.df # adding x-squared but "=" will not work
within(my.df, xsq <- x^(2)) -> my.df # adding x-squared with "<-" and it works!
Unfortunately, I do slip up every now and then so don’t be surprised if you see me using <-
and =
interchangeably. Old habits die hard.
1.12.5 tibbles
R’s default is to store a data frame
, as shown below with a small example, by converting characters
into factors
, changing column names, etc.
data.frame(
`Some Letters` = c("A", "B", "C"),
`Some Numbers` = c(1, 2, 3)
-> adf
) str(adf)
#> 'data.frame': 3 obs. of 2 variables:
#> $ Some.Letters: chr "A" "B" "C"
#> $ Some.Numbers: num 1 2 3
print(adf)
#> Some.Letters Some.Numbers
#> 1 A 1
#> 2 B 2
#> 3 C 3
tibbles
is the brainchild of the team behind a bundle of packages (and RStudio) called the tidyverse
and drops R’s bad habits
library(tibble)
tibble(
`Some Letters` = c("A", "B", "C"),
`Some Numbers` = c(1, 2, 3)
-> atib
) glimpse(atib)
#> Rows: 3
#> Columns: 2
#> $ `Some Letters` <chr> "A", "B", "C"
#> $ `Some Numbers` <dbl> 1, 2, 3
print(atib)
#> # A tibble: 3 × 2
#> `Some Letters` `Some Numbers`
#> <chr> <dbl>
#> 1 A 1
#> 2 B 2
#> 3 C 3
There are other advantages to tibbles that you can explore here. Before we move on, though, note the use of glimpse()
in lieu of str()