2 Basic Data Operations with base R

You can generate your own data, manipulate data by adding, subtracting, dividing, or multiplying a numeric column by a scalar or the value in another column, and converting numeric data to factors (qualitative variables, indicators that measure a non-quantifiable attribute about the observation), etc. Below we will see a few basic data operations at work. But first, we need to create some data to work with.

2.1 Creating A Small Data-Set

Let us create two variables, x and y, and combine them into a data-frame.

c(
  100, 101, 102, 103, 104, 105, 106
  ) -> x
c(
  7, 8, 9, 10, 11, 12, 13
  ) -> y

cbind.data.frame(
  x, y
  ) -> df.cbind

Note that as.data.frame(cbind(x, y)) -> df.cbind will give you the same same result as cbind.data.frame(x, y) -> df.cbind.

x	y
100	7
101	8
102	9
103	10
104	11
105	12
106	13

The commands above generate two columns, x and y, and then cbind() binds them as columns into a data-set called df. If we used rbind() instead it would bind x and y as rows instead of columns.

c(
  100, 101, 102, 103, 104, 105, 106
  ) -> x
c(
  7, 8, 9, 10, 11, 12, 13
  ) -> y
rbind.data.frame(
  x, y
  ) -> df.rbind

c.100..7.	c.101..8.	c.102..9.	c.103..10.	c.104..11.	c.105..12.	c.106..13.
100	101	102	103	104	105	106
7	8	9	10	11	12	13

Note the weird column names this gives you versus what you get if you do the following:

as.data.frame(
  rbind(
    x, y
    )
  ) -> df.rbind

	V1	V2	V3	V4	V5	V6	V7
x	100	101	102	103	104	105	106
y	7	8	9	10	11	12	13

When we use rbind() it names the columns V1, V2, and so on. You will likely want to label the columns differently from how they were automatically named. This is easily accomplished as shown below:

c(
  "Variable 1", "Variable 2"
  ) -> names(df.cbind) 

c(
  "Variable 1", "Variable 2", "Variable 3", "Variable 4",
  "Variable 5", "Variable 6", "Variable 7"
  ) -> names(df.rbind)

Check df.cbind and df.rbind, respectively, to verify the column names match what we specified in the preceding code chunk.

rbind binds rows while cbind binds columns.

You can also generate data-sets that combine quantitative and qualitative variables. This is demonstrated below:

c(
  100, 101, 102, 103, 104, 105, 106
  ) -> x 
c(
  "Male", "Female", "Male", "Female", "Female", "Male", "Female"
  ) -> y
cbind.data.frame(
  x, y
  ) -> df.1
c(
  100, 101, 102, 103, 104, 105, 106
  ) -> x
c(
  0, 1, 0, 1, 1, 0, 1
  ) -> y
cbind.data.frame(
  x, y
  ) -> df.2

x	y
100	Male
101	Female
102	Male
103	Female
104	Female
105	Male
106	Female

x	y
100	0
101	1
102	0
103	1
104	1
105	0
106	1

Note that in df.1, y is a string variable with values of Male/Female. In contrast, df.2 has y specified as a 0/1 variable, with 0=Male and 1=Female. We could label the 0/1 values in df.2 as follows:

factor(
  df.2$y,
  levels = c(0, 1),
  labels = c("Male", "Female")
  ) -> df.2$y

x	y
100	Male
101	Female
102	Male
103	Female
104	Female
105	Male
106	Female

We just created a factor, one of several variable formats available in R and discussed in Chapter 1 as well. However, this is a good place to look at these formats a little more closely. Numbers will be read-in as numeric unless you have some alphanumeric values or strings. This is apparent from the examples that follow.

c(
  1, 2, 3, 4, 5
  ) -> x1 
c(
  1, 2, 3, 4, "a5"
  ) -> x2
c(
  1, 2, 3, 4, "hello"
  ) -> x3
c(
  1, 2, 3, 4, NA
  ) -> x4

See how x1 and x2 are read in as numeric but x2 and x3 show up as chr, short for character. You could convert x2 and x3 into numeric as follows:

as.numeric(x2) -> x2.num
as.numeric(x3) -> x3.num

R’s as.numeric() command will automatically set strings (for example, “hello”) and alphanumeric values (for example, “a5”) to missing, denoted by NA during the conversion process. You can also flip a numeric variable into a character or a factor but that will leave the NA as NA.

as.character(x2.num) -> x2.chr

At times the variable may in fact be numeric but because of some anomalies in the variable, R will read it in as a factor. When this happens, converting the variable into numeric has to be done with care. Just running as.numeric() will not work. See the example below where age is flagged as a factor:

c(
  100, 101, 102, 103, 104, 105, 106
  ) -> score 
c(
  "Male", "Female", "Male", "Female",
  "Female", "Male", "Female"
  ) -> sex
c(
  "18", "18", "19", "19", "21", "21", "NA"
  ) -> age 

cbind.data.frame(score, sex, age) -> my.df

score	sex	age
100	Male	18
101	Female	18
102	Male	19
103	Female	19
104	Female	21
105	Male	21
106	Female	NA

as.numeric(my.df$age, na.rm = TRUE) -> my.df$age.num1

If you try as.numeric() on a factor, the conversion uses the factor encoding instead of the actual numeric values. In these instances it is recommended that you run as.numeric(levels(data$variable))[data$variable] instead. An example is shown below where age.factor if flipped into age.num2.

as.factor(my.df$age) -> my.df$age.factor
glimpse(my.df$age.factor)
#>  Factor w/ 4 levels "18","19","21",..: 1 1 2 2 3 3 4
as.numeric(levels(my.df$age.factor))[my.df$age.factor] -> my.df$age.num2
glimpse(my.df$age.num2)
#>  num [1:7] 18 18 19 19 21 21 NA

2.2 Some Basic Calculations

In R, we can easily calculate some basic statistics. Here are a few that you should be very familiar with:

mean(my.df$age.num1, na.rm = TRUE) # Mean age
#> [1] 19.33
median(my.df$age.num1, na.rm = TRUE) # Median age
#> [1] 19
min(my.df$age.num1, na.rm = TRUE) # Minimum age 
#> [1] 18
max(my.df$age.num1, na.rm = TRUE) # Maximum age 
#> [1] 21
range(my.df$age.num1, na.rm = TRUE) # Minimum and Maximum age
#> [1] 18 21
var(my.df$age.num1, na.rm = TRUE) # Variance of age
#> [1] 1.867
sd(my.df$age.num1, na.rm = TRUE) # Standard Deviation of age
#> [1] 1.366
fivenum(my.df$age.num1, na.rm = TRUE) # Five-number summary of age 
#> [1] 18 18 19 21 21
IQR(my.df$age.num1, na.rm = TRUE, type = 5) # Interquartile Range of age 
#> [1] 3
length(my.df$age.num1) # Number of  observations of age.num2 
#> [1] 7
length(unique(my.df$age.num1))  # Number of unique values of age.num2 
#> [1] 4

These can also be calculated with the dplyr package, as shown below.

library(dplyr)
my.df %>% 
  summarise(
    Mean.age = mean(age.num1, na.rm = TRUE),
    Median.age = median(age.num1, na.rm = TRUE),
    Min.age = min(age.num1, na.rm = TRUE),
    Max.age = max(age.num1, na.rm = TRUE),
    Variance.age = var(age.num1, na.rm = TRUE),
    SD.age = sd(age.num1, na.rm = TRUE),
    IQR.age = IQR(age.num1, na.rm = TRUE)
    )

Mean.age	Median.age	Min.age	Max.age	Variance.age	SD.age	IQR.age
19.33	19	18	21	1.867	1.366	2.25

dplyr is going to be one of the packages we turn to a lot down the road. It is a workhorse for many of us.

2.3 Transformations

We can carry out various operations on numeric variables, as shown below with respect to the variable score:

my.df$score * 10 -> my.df$x1 
my.df$score * 100 -> my.df$x2 
my.df$score / 10 -> my.df$x3 
sqrt(my.df$score) -> my.df$x4 
my.df$score^(2) -> my.df$x5 
my.df$score * 1.31 -> my.df$x6 
log(my.df$score) -> my.df$x7
scale(my.df$score) -> my.df$x8

Note the various operators; we multiply via *, divide via /, take the square-root via sqrt(), raise the value to some power via ^(), create the z-score via scale(), take the logarithm via log(), and so on.

We can also delete variables, change variable names, and so on. Let us see this with a small data-set that we create.

data.frame(
  sex = c("M", "F", "M", "F"), 
  NAMES = c("Andy", "Jill", "Jack", "Madison"), 
  age = c(24, 48, 72, 96)
  ) -> my.df

sex	NAMES	age
M	Andy	24
F	Jill	48
M	Jack	72
F	Madison	96

2.4 Binning: Creating a grouped version of a numeric variable

If I have a true numeric variable, say an individuals’ ages, and I want to create a grouped version of this variable, R has a very usefulcut function that comes in handy. For example, say I have the hsb2 data and want to group math scores. After we read in the data and check the range of math scores we notice the minimum is 33 and the maximum score is 75. Doing some quick calculations, the range is $75 - 33 = 42$ and so if I want five groups, I’ll have to have the groups be 33-42, 42-51, 51-60, 60-69, 69-78. Notice the width of the groups is 9 since $\dfrac{42}{5} = 8.4$ which has been rounded up to 9.

read.table(
  "https://stats.idre.ucla.edu/stat/data/hsb2.csv",
  header = TRUE,
  sep = ","
  ) -> hsb 
summary(hsb$math)  # to see the range of math scores

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
33	45	52	52.65	59	75

cut(
  hsb$math,
  breaks = c(33, 42, 51, 60, 69, 78)
  ) -> hsb$math_groups
table(hsb$math_groups)

(33,42]	(42,51]	(51,60]	(60,69]	(69,78]
34	60	61	33	11

Uh, oh, this only gave us 199 grouped values but our sample size is 200 so one student was lost in the process. This is the student with a score of 33. What happened? Well, you have to tell R what to do if it runs into a number that matches the limits of the group, i.e., if I see 42, should I include it in the 33-42 group or in the 42-51 group? We can tell R what to do adding right = FALSE or right = TRUE (the default). So let me set create a new grouped variable with right = FALSE.

cut(
  hsb$math,
  breaks = c(33, 42, 51, 60, 69, 78),
  right = FALSE
  ) -> hsb$math_groups2 
table(hsb$math_groups2)

[33,42)	[42,51)	[51,60)	[60,69)	[69,78)
28	59	64	36	13

Technically, right = FALSE only include values in a group (a,b] if the value of $x$ is $a \leq x < b$ and right = TRUE if the value of $x$ is $a < x \leq b$, wheer $a$ and $b$ are the lower and upper limits of the group, respectively.

2.5 Changing the case (lower/upper)

Say I want to change the variable NAMES to be lowercase. I can do this via

colnames(my.df)[2] = "names"

sex	names	age
M	Andy	24
F	Jill	48
M	Jack	72
F	Madison	96

If I wanted all variable names to be lowercase I would do

data.frame(
  sex = c("M", "F", "M", "F"), 
  NAMES = c("Andy", "Jill", "Jack", "Madison"), 
  age = c(24, 48, 72, 96)
  ) -> my.df 
tolower(colnames(my.df)) -> colnames(my.df)

sex	names	age
M	Andy	24
F	Jill	48
M	Jack	72
F	Madison	96

What if I wanted to convert the individual values in names to uppercase? To lowercase?

toupper(my.df$names) -> my.df$name1
tolower(my.df$names) -> my.df$name2

2.6 An aside on Factors

I can create a new variable, female, and store it as a factor as shown below:

my.df$female[my.df$sex == "M"] = 0 
my.df$female[my.df$sex == "F"] = 1 
factor(
  my.df$female,
  levels = c(0, 1),
  labels = c("Male", "Female")
  ) -> my.df$female

Notice how R stored the factor with values 1 = Male and 2 = Female even though the original coding had 0 and 1, and that the order was Male then Female. This is not what would result if you had done:

my.df$female2[my.df$sex == "M"] = "Male" 
my.df$female2[my.df$sex == "F"] = "Female" 
factor(my.df$female2) -> my.df$female2

Here, you end up with 1 = Female and 2 = Male.

Of course, I could have gone in and reordered the factor such that Female comes before Male. This is often important when we have categorical variables that are ordinal but R defaults to ordering a factor’s levels in an ascending order of the alphabet. For example, say we have data on 10 individuals, asked to respond to a question with Disagree/Neutral/Agree.

data.frame(
  x = c(rep("Disagree", 3), rep("Neutral", 2), rep("Agree", 5))
  ) -> fdf 
factor(fdf$x) -> fdf$responses
levels(fdf$responses)
#> [1] "Agree"    "Disagree" "Neutral"
table(fdf$responses) -> tab1

x	Frequency
Agree	5
Disagree	3
Neutral	2

Note that rep(“Disagree”, 3) is essentially saying create three entries of “Disagree”, rep(“Neutral”, 2) is generating two values of “Neutral”, and so on. It is a quicker way of generating a number of identical values rather than doing x = c(“Disagree”, “Disagree”, “Disagree”, “Neutral”, “Neutral”, “Agree”, “Agree”, “Agree”, “Agree”, “Agree”) etc.

I don’t want this ordering since this is an ordinal variable and either, Disagree should come first, followed by Neutral, and Agree last, or then Agree first, followed by Neutral, and then Disagree coming last.

ordered(
  fdf$responses,
  levels = c("Disagree", "Neutral", "Agree")
  ) -> fdf$newresponses 
levels(fdf$newresponses)
#> [1] "Disagree" "Neutral"  "Agree"
min(fdf$newresponses)
#> [1] Disagree
#> Levels: Disagree < Neutral < Agree
table(fdf$newresponses) -> tab2

x	Frequency
Disagree	3
Neutral	2
Agree	5

I could have also done this when creating the factor via

factor(
  fdf$x, 
  ordered = TRUE,
  levels = c("Disagree", "Neutral", "Agree")
  ) -> fdf$xordered 
levels(fdf$xordered)
#> [1] "Disagree" "Neutral"  "Agree"
min(fdf$xordered)
#> [1] Disagree
#> Levels: Disagree < Neutral < Agree
table(fdf$xordered) -> tab3

x	Frequency
Disagree	3
Neutral	2
Agree	5

Notice the min() command … which asks for the minimum level of a factor. This works with ordered factors but not with unordered factors; try it with x and responses, respectively to see what results.

You can drop the original x variable as follows

fdf$x = NULL

Before we move on, note that when it comes to modeling and other statistical techniques in R, an ordered factor has a different intrinsic meaning (ordinal) than an unordered factor (nominal) so be careful. When it comes to modeling we can specify the reference category explicitly before modeling. For example, say I want Neutral to be the excluded (i.e., the reference) category. I would then do the following:

relevel(fdf$x, ref = "Neutral") -> fdf$refx 
levels(fdf$refx)

This command will not work with an ordered factor so bear that in mind as well.

We could also fix typos, assuming they exist, and then drop unused factor levels as shown below. Say we have the following data:

data.frame(
    mf = as.factor(
      c(rep("Male", 3), rep("male", 2),
        rep("Female", 3), rep("femalE", 2)
        )
    )
  ) -> sexdf 
levels(sexdf$mf)
#> [1] "femalE" "Female" "male"   "Male"

Obviously not what we’d like so we can go in and clean it up a bit. How? As follows:

sexdf$mf[sexdf$mf == "male"] = "Male" 
sexdf$mf[sexdf$mf == "femalE"] = "Female" 
levels(sexdf$mf)
#> [1] "femalE" "Female" "male"   "Male"
table(sexdf$mf) -> tab4

mf	Frequency
femalE	0
Female	5
male	0
Male	5

Wait a second, we still see femalE and male but with 0 frequency counts; how should we get rid of these ghosts? With droplevels() as shown below:

as.factor(sexdf$mf) -> sexdf$newmf
levels(sexdf$newmf)
#> [1] "femalE" "Female" "male"   "Male"
droplevels(sexdf$mf) -> sexdf$newmf
levels(sexdf$newmf)
#> [1] "Female" "Male"
table(sexdf$newmf) -> tab5

newmf	Frequency
Female	5
Male	5

There is the forcats package that you should check out if you want to see how its functions work with factors. I will touch upon this package a bit later on in this text.

2.7 Subsetting data and factor levels

I may also need to drop a factor level. For example, say I am working with the diamonds data-set and need to subset my analysis to diamonds that are at minimum of a “Very Good” cut. Before we tackle the data, here is a brief description of this data frame that has 53940 rows and 10 variables:

Variables	Description
price	price in US dollars ($326–$18,823)
carat	weight of the diamond (0.2–5.01)
cut	quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color	diamond colour, from J (worst) to D (best)
clarity	a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x	length in mm (0–10.74)
y	width in mm (0–58.9)
z	depth in mm (0–31.8)
depth	total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table	width of top of diamond relative to widest point (43–95)

library(ggplot2)
data(diamonds)
str(diamonds$cut)
#>  Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
table(diamonds$cut) -> tab6

cut	Frequency
Fair	1610
Good	4906
Very Good	12082
Premium	13791
Ideal	21551

So a minimum cut of “Very Good” implies diamonds that are either Very Good, Premium, or Ideal. I can subset with this criterion as follows:

subset(
  diamonds, 
  cut != "Fair" & cut != "Good"
  ) -> dia.sub1 
str(dia.sub1$cut)
#>  Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 4 3 3 3 3 5 4 5 ...
table(dia.sub1$cut) -> tab7

cut	Frequency
Fair	0
Good	0
Very Good	12082
Premium	13791
Ideal	21551

You should try it for yourself: subset(diamonds, cut != “Fair” & cut != “Good”) is the same as running subset(diamonds, cut == “Very Good” | cut == “Premium” | cut == “Ideal”)

Aha! I see no diamonds in the excluded cut ratings but str() still shows R remembering cut as having 5 ratings and hence the frequency table shows up with zero counts for Fair and Good, respectively. Once again, we will have to drop these two levels explicitly.

droplevels(dia.sub1$cut) -> dia.sub1$cutnew
str(dia.sub1$cutnew)
#>  Ord.factor w/ 3 levels "Very Good"<"Premium"<..: 3 2 2 1 1 1 1 3 2 3 ...
table(dia.sub1$cutnew) -> tab8

cutnew	Frequency
Very Good	12082
Premium	13791
Ideal	21551

Now a bit more on sub-setting data. You can subset with as simple or as complicated a condition you need. For example, maybe you only want Ideal diamonds with a certain minimum clarity and price. Sure, you can use this sub-setting criterion as shown below:

subset(
  diamonds, cut == "Ideal" & clarity == "VVS1" & price > 3933
  ) -> dia.sub2

Here the sub-setting is generating a data-frame that will only include observations that meet all three requirements (since you used & in the command). If you had run the following you would have had very different results.

subset(
  diamonds, cut == "Ideal" | clarity == "VVS1" | price > 3933
  ) -> dia.sub3
subset(
  diamonds, cut == "Ideal" & clarity == "VVS1" | price > 3933
  ) -> dia.sub4
subset(
  diamonds, cut == "Ideal" | clarity == "VVS1" & price > 3933
  ) -> dia.sub5

That is because & is saying “this criterion and that criterion both have to be met” while | is saying “either this criterion is met or that criterion is met”.

You could also subset the data in different ways. For example,

diamonds[
  diamonds$cut == "Ideal" | diamonds$clarity == "VVS1" | diamonds$price > 3933, 
  ] -> dia.sub6 # same as dia.sub3 
diamonds[
  !(diamonds$cut %in% c("Fair", "Good")), 
  ] -> dia.sub7 # same as dia.sub1 
diamonds[
  diamonds$cut %in% c("Ideal", "Premium", "Very Good"), 
  ] -> dia.sub8 # same as dia.sub1 and dia.sub7

What is the opposite of %in%? Well, that would be the following:

`%!in%` = Negate(`%in%`) # define your own command for "not in"
diamonds[(diamonds$cut %!in% c("Fair", "Good")), ] -> dia.sub7
library(Hmisc) # use the %nin% command from Hmisc library
diamonds[(diamonds$cut %nin% c("Fair", "Good")), ] -> dia.sub7

2.8 Exporting data from R

One can also export data created or manipulated via R into various formats. Why do weneed to do this? Because colleagues or clients may not be using R, in which case we need to be able to share the data with them in their preferred format.

Take the file (out.df) we create below, for example. It is a small data-frame with two variables, one numeric and one categorical. Let us see how to export it to specific formats.

data.frame(
  Person = c("John", "Timothy", "Olivia", "Sebastian", "Serena"),
  Age = c(22, 24, 18, 24, 35)
  ) -> out.df
write.csv(out.df, file = "data/out.csv", row.names = FALSE)
library(haven)
write_dta(out.df, "data/out.df.dta") # Export a Stata format file
write_sav(out.df, "data/out.df.sav") # Export an SPSS format file
write_sas(out.df, "data/out.df.sas") # Export a SAS format file

So using the haven package we exported out.df to Stata, SPSS, and SAS formats. You can also export to MS Excel if you need to:

library(writexl)
write_xlsx(
  out.df, 
  "data/out.df.xlsx"
  )

In the write.() command, specifying row.names = FALSE excludes the unique row number or names (if the rows have names) that R uses for internal operations from being exported to the target file format.

There are other packages as well that import/export data so be sure to check them out, in particular rio.

2.9 Merging data

Often you may need to combine data-sets. For example, say I have data on some students, with test scores in one file and their gender in another file. Each file has a student identifier, called ID. There are two students in each file not seen in the other file.

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> data1
data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> data2

Score	ID
10	A12
21	A23
33	Z14
12	WX1
35	Y31
67	D66
43	C31
99	Q22

Sex	ID
Male	A12
Female	A23
Male	Z14
Female	WX1
Male	Y31
Female	D66
Male	E52
Female	H71

To merge data-frames we need to specify the merge key – the variable that identifies unique observations in each file – and then whether we only want to merge observations that show up in both files (natural join), merge everything even if some cases show up in one but not the other (full outer join), merge such that all cases in file x are retained but only those seen in file y are joined (left outer join), or if we want the merge to keep all cases in y and only those cases are to be merged from x that show up in y, referred to as (right outer join).

merge(x = data1, y = data2, by = c("ID"), all = FALSE) -> natural # only merge if ID seen in both files
merge(x = data1, y = data2, by = c("ID"), all = TRUE) -> full # merge everything 
merge(x = data1, y = data2, by = c("ID"), all.x = TRUE) -> left # make sure all cases from file x are retained 
merge(x = data1, y = data2, by = c("ID"), all.y  = TRUE)  -> right # make sure all cases from file y are retained

What if the ID variables had different names? You could rename them to have a common name or you could use by.x = and by.y = as shown below.

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> data3 
data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> data4
merge(
  data3, data4,
  by.x = c("ID1"),
  by.y = c("ID2"),
  all = FALSE
  ) -> data34

ID1	Score	Sex
A12	10	Male
A23	21	Female
D66	67	Female
WX1	12	Female
Y31	35	Male
Z14	33	Male

You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number (dirn), each school has building ID number (birn), and then the district’s name (district), the building’s name (building), and so on. If I am merging these data then my by = statement will be by = c("dirn", "birn", "district", "building").

Having multiple merge keys is a blessing when you are not sure about the quality of the records. For example, I may have data for patients but I am not sure folks who gathered the data created the unique identifier for each patient correctly. But I do have information on their data of birth, sex, city and state of residence. I can then use these four pieces of information when merging the data to ensure a better quality of the merged data-set than would occur if I trusted the known error-prone ID variables.

If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the Reduce() command, as shown below.

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> df1 
data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> df2 
data.frame(
  Age = c(6, 7, 6, 8, 8, 9, 10, 5),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> df3 
list(df1, df2, df3) -> my.list 
Reduce(
  function(...) merge(..., by = c("ID"), all = TRUE), my.list
  ) -> df.123

ID	Score	Sex	Age
A12	10	Male	6
A23	21	Female	7
C31	43	NA	NA
D66	67	Female	9
E52	NA	Male	10
H71	NA	Female	5
Q22	99	NA	NA
WX1	12	Female	8
Y31	35	Male	8
Z14	33	Male	6

2.10 Identifying and handling duplicates

Every now and then you run into duplicate rows because of data entry errors. Here is an example:

data.frame(
  Sex = c(
    "Male", "Female", "Male", "Female", 
    "Male", "Female", "Male", "Female"
    ),
  ID = c(
    "A12", "A23", "Z14", "WX1", 
    "Y31", "D66", "A12", "WX1"
    )
  ) -> dups.df

Note that in dups.df students A12 and WX1 show up twice! There are two basic commands that we can use to find duplicates – unique and duplicated.

duplicated(dups.df) # flag duplicate rows with TRUE 
!duplicated(dups.df) # flag not duplicated rows with TRUE 
dups.df[duplicated(dups.df), ] # reveal the duplicated rows 
dups.df[!duplicated(dups.df), ] # reveal the not duplicated rows 
dups.df[!duplicated(dups.df), ] -> nodups.df # save a dataframe without duplicated rows
dups.df[duplicated(dups.df), ] -> onlydups.df # save a dataframe with ONLY the duplicated rows

You will invariably want to memorize (or lookup) the no duplicates command: dataframe[!duplicated(dataframe), ]

2.11 Rearranging a data-frame

We may, at times, want to move the variables around, perhaps because there are too many of them. We may also want to arrange the data-frame in ascending or descending order of some variable(s). These tasks are easily done.

mtcars[order(mtcars$mpg), ] -> df1 # arrange the data-frame in ascending order of mpg 
mtcars[order(-mtcars$mpg), ] -> df2 # arrange the data-frame in descending order of mpg 
mtcars[order(mtcars$am, mtcars$mpg), ] -> df3 # arrange the data-frame in ascending order of automatic/manual and then in ascending order of mpg

	mpg	cyl	disp	hp	drat	wt	qsec	am	gear	carb
Cadillac Fleetwood	10.4	8	472	205	2.93	5.250	17.98	0	3	4
Lincoln Continental	10.4	8	460	215	3.00	5.424	17.82	0	3	4
Camaro Z28	13.3	8	350	245	3.73	3.840	15.41	0	3	4
Duster 360	14.3	8	360	245	3.21	3.570	15.84	0	3	4
Chrysler Imperial	14.7	8	440	230	3.23	5.345	17.42	0	3	4
Maserati Bora	15.0	8	301	335	3.54	3.570	14.60	1	5	8

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2

	mpg	cyl	disp	hp	drat	wt	qsec	gear	carb
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	3	4
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	3	4
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	3	4
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	3	3

Maybe I’d like the columns ordered such that particular columns show up first, side by side. For instance, what if I want to order the columns such that the first column is am and then comes mpg, followed by qsec and then the rest of the columns?

names(mtcars) # the original structure of the mtcars data-frame
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
mtcars[, c(9, 1, 7, 2:6, 8, 10:11)] -> df4 # specifying the position of the columns 
names(df4)
#>  [1] "am"   "mpg"  "qsec" "cyl"  "disp" "hp"   "drat" "wt"   "vs"   "gear" "carb"

If I only wanted certain columns I could just not list the other columns and I would have a trimmed data-frame.

mtcars[, c(9, 1, 7)] -> df5 # specifying the position of the columns

Notice the comma in mtcars[, … this tells R to keep all rows in mtcars. If, instead, I did this:

mtcars[1:4, c(9, 1, 7)] -> df6 # keep first four rows and the specified columns

I would end up with only the first four rows of mtcars, not every row. So try to remember the order: dataframe[i, j] where i references the rows and j references the columns.

2.12 Random Sampling

Maybe I just want to sample 5 observations from mtcars. How could I do this?

set.seed(123)
sample(1:nrow(mtcars), 5) -> index 
index
#> [1] 31 15 19 14  3
mtcars[index, ] -> df7

Note the set.seed() function – it is used to make our results reproducible so that here, for example, if someone else runs our code they should end up with the same 5 rows randomly sampled from mtcars. Without set.seed they would end up with different data. We use it every time we are relying on randomization – drawing a random sample, generating a random distribution of numbers, etc.

2.13 Strings

Every now and then you will have variables with some characters that need to be replaced or deleted outright. R has a gsub() function that comes in very handy for such tasks. I end up using it a lot when trying to pull out extraneous words such as “County” appended to every county’s name in a data-set or then, even worse, ” County, Ohio” as well.

library(here)
readr::read_csv(
  here(
    "data", 
    "strings.csv"
    )
  ) -> c.data

The first thing I want to do is to remove the phrase County, Ohio from GEO.display.label. This can be done swiftly with

gsub(
  " County, Ohio", 
  "", 
  c.data$`GEO.display-label`
  ) -> c.data$county

Note the sequence gsub(“look for this pattern”, “replace with this”, mydata$myvariable). Note also that there is whitespace before “County” since that is how the data show up in the file.

If you look at GEO.id and GEO.ID2, they represent Census fields whose values mean something. The Bureau says: “The”GEO.id” field contains 14-digit codes that identify the summary level of data, the geographic component of the data and FIPS codes that uniquely identify the data. For example, the 14-digit “GEO.id” for Harris County, TX is “0500000US48201” where “050” represents the summary level of the data, “0000” represents the geographic component, “US” represents the United States, “48” represents the state of Texas and “201” represents Harris County.

The “GEO.id2” field contains FIPS codes exactly as they appear in the “GEOID” field featured in our TIGER Products. The “GEO.id2” for Harris County, TX is “48201” where “48” represents the state of Texas and “201” represents Harris County. Since the “GEO.id2” field in AFF data matches the “GEOID” field in our TIGER data, the two data-sets can easily be joined together to analyze data and create thematic maps.”

So we can split GEO.id2 into its state FIPS code and county FIPS since we know where the split has to occur – the first two numbers are the state codes and the last three numbers are the county codes. I’ll create new variables first, names and geoid2.

as.character(c.data$`GEO.display-label`) -> c.data$names 
as.character(c.data$GEO.id2) -> c.data$geoid2 
substring(c.data$geoid2, 1, 2) -> c.data$statefips # state FIPS 
substring(c.data$geoid2, 3, 5) -> c.data$countyfips # county FIPS

I could have also split GEO.id2

t(
  sapply(
    c.data$GEO.id2,
    function(x) substring(x, first = c(1, 3), last = c(2, 5))
    )
  ) -> splits 
cbind.data.frame(c.data, splits) -> c.data2 
colnames(c.data2)[10] <- "state_fips"
colnames(c.data2)[11] <- "county_fips"

Similar splits could be applied to GEO.id to carve out independent state and county FIPS codes, as well as a five-digit state-county FIPS code.

2.14 With and Within

When creating new variables or modifying existing variables, two functions are helpful, with() and within(). For example, assume I have the following data-frame.

data.frame(
  x = seq(0, 5, by = 1),
  y = seq(10, 15, by = 1)
  ) -> dfw 
dfw$x * dfw$y -> dfw$xy1 # basic creation of xy 
with(dfw, x * y) -> dfw$xy2 # using with
within(dfw, xy3 <- x * y) -> dfw # using within

We can create a new variable, xy in several ways, making explicit the operation versus using with() or within().

Pay attention to the last statement within(dfw, xy3 <- x * y) -> dfw because here the <- operator is being used inside the within(…) function to create xy3.

2.15 Rounding decimals

You rarely want to display more than two decimal places and more often than not just 1 (for non-technical audiences). R has a function round() that will allow you to control the display. For example, assume we have the following data-frame.

data.frame(
  x = seq(0, 10, by = 0.12343)
  ) -> dfr

Say I want to round to two-digits or maybe one digit or better yet, no digits.

within(
  dfr, x.2 <- round(x, digits = 2)
  ) -> dfr 
within(
  dfr, x.1 <- round(x, digits = 1)
  ) -> dfr
within(
  dfr, x.0 <- round(x, digits = 0)
  ) -> dfr

You could have collapsed all three into a single command as well:

within(dfr, {
  x.2b <- round(x, digits = 2);
  x.1b <- round(x, digits = 1);
  x.0b <- round(x, digits = 0)
  }
  ) -> dfr

2.16 Sums and Means

There are several functions we can lean on to calculate the totals or the means for columns in our data-frames. These are demonstrated for you below.

data.frame(
  x = c(0, 1, 2, 3, 4, NA),
  y = c(10, 11, 12, NA, 14, 15)
  ) -> dfs 
within(
  dfs, {
    xtot = sum(x, na.rm = TRUE);
    ytot = sum(y, na.rm = TRUE)
    }
  ) -> dfs # sum x and sum y 
rowSums(dfs[, c(1:2)], na.rm = TRUE) # sum columns 1 and 2 for each row 
#> [1] 10 12 14  3 18 15
colSums(dfs[, c(1:2)], na.rm = TRUE) # sum columns 1 and 2, respectively 
#>  x  y 
#> 10 62
rowMeans(dfs[, c(1:2)], na.rm = TRUE) # calculate the mean for each rows using columns 1 and 2 
#> [1]  5  6  7  3  9 15
colMeans(dfs[, c(1:2)], na.rm = TRUE) # calculate the mean for columns 1 and 2 
#>    x    y 
#>  2.0 12.4

2.17 Padding a vector

Every now and then you need to add leading zeroes or lagging zeroes or some other element to a vector. For example, many but not all of the Ohio Department of Education’s data-set will come with building and district identifiers as characters such as “000125”, “000138” and so on but other data will have the same variable showing up as the numbers 125, 138, and so on. Since district and school identifiers are all six-digit numbers I end up padding the latter into six-digit characters of “000125”, “000138” and so on. I do this with the stringr package.

library(stringr)
data.frame(
  dirn = c(
    125, 138, 141, 41256, 51513, 131256
    )
  ) -> odedf 
str_pad(
  odedf$dirn, 
  width = 6, 
  side = c("left"), 
  pad = "0"
  ) -> odedf$dirnl

Note that width = 6 refers to the total number of digits the resulting vector should reflect, side = c("left") means add whatever I ask you to add to the left of any existing value, and pad = "0" says I want you to add zeroes. If you had set pad to right or both the padding would be different, as shown below, so be careful.

str_pad(
  odedf$dirn, 
  width = 6, 
  side = c("right"), 
  pad = "0"
  ) -> odedf$dirnr 
str_pad(
  odedf$dirn, 
  width = 6, 
  side = c("both"), 
  pad = "0"
  ) -> odedf$dirnb

Note also that the resulting vector is a character, not a numeric! But that is not a problem since padding has, in my experience, only come in use whenever I needed to merge data-frames but the ID variables were padded in one or more data-frame(s) but not in the other(s).

2.18 The famous ifelse and grepl

There is a very useful function in R, ifelse(condition, A, B), that evaluates a logical expression/condition and if this expression/condition is TRUE then it generates the value A and if the expression/condition is FALSE then it generates the value B. Here is a simple example, with the mtcars data-set. My goal is to generate a new variable whose value will be “Automatic” if am == 1 and “Manual” if am == 0.

data(mtcars)
ifelse(
  mtcars$am == 1, 
  "Automatic", 
  "Manual"
  ) -> mtcars$automatic

Voila! Why is this useful? Because otherwise I would need two lines of code to generate the same variable, as shown below:

mtcars$automatic2[mtcars$am == 1] <- "Automatic"
mtcars$automatic2[mtcars$am == 0] <- "Manual"

Another benefit of ifelse() is that you can link them in a chain if you have very complicated operations to carry out. You will see one such tricky operation when we deal with some messy education data in the next chapter. For now, focus on the other function of value here grepl.

This is a very useful function, especially when bundled with ifelse because it will look for a specified string and allow you to then generate values depending upon whether the string is seen (TRUE) or not (FALSE). See this instance, for example, where we have information on some movies and would like to generate new variables based on what information we have.

load("data/movies/movies.RData")
ifelse(
  grepl("Comedy", movies$genres), "Yes", "No"
  ) -> movies$comedy 
ifelse(
  grepl("(1995)", movies$title), "Yes", "No"
  ) -> movies$in1995 
ifelse(
  grepl("Night", movies$title), "Yes", "No"
  ) -> movies$night

In short, the function looks for the specified string (“Comedy”) in genres and if found adds a flag of “Yes” and if not found adds a flag of “No”. s

But what if I wanted to run more complicated expressions, for example if the words “Night” or “Dark” or “Cool” are used in the title?

ifelse(
  grepl("Night", movies$title), "Yes", 
  ifelse(grepl("Dark", movies$title), "Yes", 
         ifelse(grepl("Cool", movies$title), "Yes", "No")
         )
  ) -> movies$tricky

Beautiful, just beautiful! The trick is to nest the ifelse(grepl( )) statements so that the entire chain of possible “Yes” flags to be attached is executed before it starts flagging all that remains with “No”.

There is a related function, grep() and the difference between that and grepl() is that although both of these functions find patterns, they return different output types based on those patterns, with grep() returning numeric values that index the locations where the patterns were found. On the other hand, grepl() returns a logical vector in which TRUE represents a pattern match and FALSE does not. Run the two commands that follow and focus on the output that scrolls in the console.

grep("Comedy", movies$genres)
grepl("Comedy", movies$genres)

You could use both for sub-setting data as well. See below for some examples, and note that -grep() is the equivalent of saying “select those where this condition is not TRUE”.

movies[grep("Comedy", movies$genres), ] -> mov1 
movies[-grep("Comedy", movies$genres), ] -> mov2

Say I only want columns that have the letter “t” in them. No problem:

movies[, grep("t", colnames(movies))] -> mov3 # keep if t is in column name 
movies[, -grep("t", colnames(movies))] -> mov4 # keep if t is not in column name

Try using grepl in lieu of grep above. Also, before we move on to other functions, recognize that they look for the string as specified, no matter where it shows up!! So it does not care if “t” is the first, last, or some other letter in the column names; it picks up all of them.