2 Basic Data Operations with base R
You can generate your own data, manipulate data by adding, subtracting, dividing, or multiplying a numeric column by a scalar or the value in another column, and converting numeric data to factors (qualitative variables, indicators that measure a non-quantifiable attribute about the observation), etc. Below we will see a few basic data operations at work. But first, we need to create some data to work with.
2.1 Creating A Small Data-Set
Let us create two variables, x and y, and combine them into a data-frame.
c(
100, 101, 102, 103, 104, 105, 106
-> x
) c(
7, 8, 9, 10, 11, 12, 13
-> y
)
cbind.data.frame(
x, y-> df.cbind )
Note that as.data.frame(cbind(x, y)) -> df.cbind
will give you the same same result as cbind.data.frame(x, y) -> df.cbind
.
x | y |
---|---|
100 | 7 |
101 | 8 |
102 | 9 |
103 | 10 |
104 | 11 |
105 | 12 |
106 | 13 |
The commands above generate two columns, x and y, and then cbind()
binds them as columns
into a data-set called df. If we used rbind()
instead it would bind x and y as rows
instead of columns.
c(
100, 101, 102, 103, 104, 105, 106
-> x
) c(
7, 8, 9, 10, 11, 12, 13
-> y
) rbind.data.frame(
x, y-> df.rbind )
c.100..7. | c.101..8. | c.102..9. | c.103..10. | c.104..11. | c.105..12. | c.106..13. |
---|---|---|---|---|---|---|
100 | 101 | 102 | 103 | 104 | 105 | 106 |
7 | 8 | 9 | 10 | 11 | 12 | 13 |
Note the weird column names this gives you versus what you get if you do the following:
as.data.frame(
rbind(
x, y
)-> df.rbind )
V1 | V2 | V3 | V4 | V5 | V6 | V7 | |
---|---|---|---|---|---|---|---|
x | 100 | 101 | 102 | 103 | 104 | 105 | 106 |
y | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
When we use rbind()
it names the columns V1, V2, and so on. You will likely want to label the columns differently from how they were automatically named. This is easily accomplished as shown below:
c(
"Variable 1", "Variable 2"
-> names(df.cbind)
)
c(
"Variable 1", "Variable 2", "Variable 3", "Variable 4",
"Variable 5", "Variable 6", "Variable 7"
-> names(df.rbind) )
Check df.cbind
and df.rbind
, respectively, to verify the column names match what we specified in the preceding code chunk.
rbind
binds rows while cbind
binds columns.
You can also generate data-sets that combine quantitative and qualitative variables. This is demonstrated below:
c(
100, 101, 102, 103, 104, 105, 106
-> x
) c(
"Male", "Female", "Male", "Female", "Female", "Male", "Female"
-> y
) cbind.data.frame(
x, y-> df.1
) c(
100, 101, 102, 103, 104, 105, 106
-> x
) c(
0, 1, 0, 1, 1, 0, 1
-> y
) cbind.data.frame(
x, y-> df.2 )
|
|
Note that in df.1, y is a string variable with values of Male/Female. In contrast, df.2 has y specified as a 0/1 variable, with 0=Male
and 1=Female
. We could label the 0/1 values in df.2 as follows:
factor(
.2$y,
dflevels = c(0, 1),
labels = c("Male", "Female")
-> df.2$y )
x | y |
---|---|
100 | Male |
101 | Female |
102 | Male |
103 | Female |
104 | Female |
105 | Male |
106 | Female |
We just created a factor
, one of several variable formats available in R and discussed in Chapter 1 as well. However, this is a good place to look at these formats a little more closely. Numbers will be read-in as numeric
unless you have some alphanumeric values or strings. This is apparent from the examples that follow.
c(
1, 2, 3, 4, 5
-> x1
) c(
1, 2, 3, 4, "a5"
-> x2
) c(
1, 2, 3, 4, "hello"
-> x3
) c(
1, 2, 3, 4, NA
-> x4 )
See how x1
and x2
are read in as numeric but x2
and x3
show up as chr
, short for character
. You could convert x2
and x3
into numeric as follows:
as.numeric(x2) -> x2.num
as.numeric(x3) -> x3.num
R’s as.numeric()
command will automatically set strings (for example, “hello”) and alphanumeric values (for example, “a5”) to missing, denoted by NA
during the conversion process. You can also flip a numeric variable into a character or a factor but that will leave the NA
as NA
.
as.character(x2.num) -> x2.chr
At times the variable may in fact be numeric but because of some anomalies in the variable, R will read it in as a factor. When this happens, converting the variable into numeric has to be done with care. Just running as.numeric()
will not work. See the example below where age
is flagged as a factor:
c(
100, 101, 102, 103, 104, 105, 106
-> score
) c(
"Male", "Female", "Male", "Female",
"Female", "Male", "Female"
-> sex
) c(
"18", "18", "19", "19", "21", "21", "NA"
-> age
)
cbind.data.frame(score, sex, age) -> my.df
score | sex | age |
---|---|---|
100 | Male | 18 |
101 | Female | 18 |
102 | Male | 19 |
103 | Female | 19 |
104 | Female | 21 |
105 | Male | 21 |
106 | Female | NA |
as.numeric(my.df$age, na.rm = TRUE) -> my.df$age.num1
If you try as.numeric()
on a factor, the conversion uses the factor encoding instead of the actual numeric values. In these instances it is recommended that you run as.numeric(levels(data$variable))[data$variable]
instead. An example is shown below where age.factor
if flipped into age.num2
.
as.factor(my.df$age) -> my.df$age.factor
glimpse(my.df$age.factor)
#> Factor w/ 4 levels "18","19","21",..: 1 1 2 2 3 3 4
as.numeric(levels(my.df$age.factor))[my.df$age.factor] -> my.df$age.num2
glimpse(my.df$age.num2)
#> num [1:7] 18 18 19 19 21 21 NA
2.2 Some Basic Calculations
In R, we can easily calculate some basic statistics. Here are a few that you should be very familiar with:
mean(my.df$age.num1, na.rm = TRUE) # Mean age
#> [1] 19.33
median(my.df$age.num1, na.rm = TRUE) # Median age
#> [1] 19
min(my.df$age.num1, na.rm = TRUE) # Minimum age
#> [1] 18
max(my.df$age.num1, na.rm = TRUE) # Maximum age
#> [1] 21
range(my.df$age.num1, na.rm = TRUE) # Minimum and Maximum age
#> [1] 18 21
var(my.df$age.num1, na.rm = TRUE) # Variance of age
#> [1] 1.867
sd(my.df$age.num1, na.rm = TRUE) # Standard Deviation of age
#> [1] 1.366
fivenum(my.df$age.num1, na.rm = TRUE) # Five-number summary of age
#> [1] 18 18 19 21 21
IQR(my.df$age.num1, na.rm = TRUE, type = 5) # Interquartile Range of age
#> [1] 3
length(my.df$age.num1) # Number of observations of age.num2
#> [1] 7
length(unique(my.df$age.num1)) # Number of unique values of age.num2
#> [1] 4
These can also be calculated with the dplyr
package, as shown below.
library(dplyr)
%>%
my.df summarise(
Mean.age = mean(age.num1, na.rm = TRUE),
Median.age = median(age.num1, na.rm = TRUE),
Min.age = min(age.num1, na.rm = TRUE),
Max.age = max(age.num1, na.rm = TRUE),
Variance.age = var(age.num1, na.rm = TRUE),
SD.age = sd(age.num1, na.rm = TRUE),
IQR.age = IQR(age.num1, na.rm = TRUE)
)
Mean.age | Median.age | Min.age | Max.age | Variance.age | SD.age | IQR.age |
---|---|---|---|---|---|---|
19.33 | 19 | 18 | 21 | 1.867 | 1.366 | 2.25 |
dplyr
is going to be one of the packages we turn to a lot down the road. It is a workhorse for many of us.
2.3 Transformations
We can carry out various operations on numeric variables, as shown below with respect to the variable score
:
$score * 10 -> my.df$x1
my.df$score * 100 -> my.df$x2
my.df$score / 10 -> my.df$x3
my.dfsqrt(my.df$score) -> my.df$x4
$score^(2) -> my.df$x5
my.df$score * 1.31 -> my.df$x6
my.dflog(my.df$score) -> my.df$x7
scale(my.df$score) -> my.df$x8
Note the various operators; we multiply via *
, divide via /
, take the square-root via sqrt()
, raise the value to some power via ^()
, create the z-score via scale()
, take the logarithm via log()
, and so on.
We can also delete variables, change variable names, and so on. Let us see this with a small data-set that we create.
data.frame(
sex = c("M", "F", "M", "F"),
NAMES = c("Andy", "Jill", "Jack", "Madison"),
age = c(24, 48, 72, 96)
-> my.df )
sex | NAMES | age |
---|---|---|
M | Andy | 24 |
F | Jill | 48 |
M | Jack | 72 |
F | Madison | 96 |
2.4 Binning: Creating a grouped version of a numeric variable
If I have a true numeric variable, say an individuals’ ages, and I want to create a grouped version of this variable, R has a very usefulcut
function that comes in handy. For example, say I have the hsb2
data and want to group math scores. After we read in the data and check the range of math scores we notice the minimum is 33 and the maximum score is 75. Doing some quick calculations, the range is \(75 - 33 = 42\) and so if I want five groups, I’ll have to have the groups be 33-42, 42-51, 51-60, 60-69, 69-78. Notice the width of the groups is 9 since \(\dfrac{42}{5} = 8.4\) which has been rounded up to 9.
read.table(
"https://stats.idre.ucla.edu/stat/data/hsb2.csv",
header = TRUE,
sep = ","
-> hsb
) summary(hsb$math) # to see the range of math scores
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
33 | 45 | 52 | 52.65 | 59 | 75 |
cut(
$math,
hsbbreaks = c(33, 42, 51, 60, 69, 78)
-> hsb$math_groups
) table(hsb$math_groups)
(33,42] | (42,51] | (51,60] | (60,69] | (69,78] |
---|---|---|---|---|
34 | 60 | 61 | 33 | 11 |
Uh, oh, this only gave us 199 grouped values but our sample size is 200 so one student was lost in the process. This is the student with a score of 33. What happened? Well, you have to tell R what to do if it runs into a number that matches the limits of the group, i.e., if I see 42, should I include it in the 33-42 group or in the 42-51 group? We can tell R what to do adding right = FALSE
or right = TRUE
(the default). So let me set create a new grouped variable with right = FALSE
.
cut(
$math,
hsbbreaks = c(33, 42, 51, 60, 69, 78),
right = FALSE
-> hsb$math_groups2
) table(hsb$math_groups2)
[33,42) | [42,51) | [51,60) | [60,69) | [69,78) |
---|---|---|---|---|
28 | 59 | 64 | 36 | 13 |
Technically, right = FALSE
only include values in a group (a,b]
if the value of \(x\) is \(a \leq x < b\) and right = TRUE
if the value of \(x\) is \(a < x \leq b\), wheer \(a\) and \(b\) are the lower and upper limits of the group, respectively.
2.5 Changing the case (lower/upper)
Say I want to change the variable NAMES
to be lowercase. I can do this via
colnames(my.df)[2] = "names"
sex | names | age |
---|---|---|
M | Andy | 24 |
F | Jill | 48 |
M | Jack | 72 |
F | Madison | 96 |
If I wanted all variable names to be lowercase I would do
data.frame(
sex = c("M", "F", "M", "F"),
NAMES = c("Andy", "Jill", "Jack", "Madison"),
age = c(24, 48, 72, 96)
-> my.df
) tolower(colnames(my.df)) -> colnames(my.df)
sex | names | age |
---|---|---|
M | Andy | 24 |
F | Jill | 48 |
M | Jack | 72 |
F | Madison | 96 |
What if I wanted to convert the individual values in names to uppercase? To lowercase?
toupper(my.df$names) -> my.df$name1
tolower(my.df$names) -> my.df$name2
2.6 An aside on Factors
I can create a new variable, female
, and store it as a factor as shown below:
$female[my.df$sex == "M"] = 0
my.df$female[my.df$sex == "F"] = 1
my.dffactor(
$female,
my.dflevels = c(0, 1),
labels = c("Male", "Female")
-> my.df$female )
Notice how R stored the factor with values 1 = Male
and 2 = Female
even though the original coding had 0
and 1
, and that the order was Male then Female. This is not what would result if you had done:
$female2[my.df$sex == "M"] = "Male"
my.df$female2[my.df$sex == "F"] = "Female"
my.dffactor(my.df$female2) -> my.df$female2
Here, you end up with 1 = Female
and 2 = Male
.
Of course, I could have gone in and reordered the factor such that Female comes before Male. This is often important when we have categorical variables that are ordinal
but R defaults to ordering a factor’s levels in an ascending order of the alphabet. For example, say we have data on 10 individuals, asked to respond to a question with Disagree/Neutral/Agree.
data.frame(
x = c(rep("Disagree", 3), rep("Neutral", 2), rep("Agree", 5))
-> fdf
) factor(fdf$x) -> fdf$responses
levels(fdf$responses)
#> [1] "Agree" "Disagree" "Neutral"
table(fdf$responses) -> tab1
x | Frequency |
---|---|
Agree | 5 |
Disagree | 3 |
Neutral | 2 |
Note that rep(“Disagree”, 3)
is essentially saying create three entries of “Disagree”, rep(“Neutral”, 2)
is generating two values of “Neutral”, and so on. It is a quicker way of generating a number of identical values rather than doing x = c(“Disagree”, “Disagree”, “Disagree”, “Neutral”, “Neutral”, “Agree”, “Agree”, “Agree”, “Agree”, “Agree”)
etc.
I don’t want this ordering since this is an ordinal variable and either, Disagree should come first, followed by Neutral, and Agree last, or then Agree first, followed by Neutral, and then Disagree coming last.
ordered(
$responses,
fdflevels = c("Disagree", "Neutral", "Agree")
-> fdf$newresponses
) levels(fdf$newresponses)
#> [1] "Disagree" "Neutral" "Agree"
min(fdf$newresponses)
#> [1] Disagree
#> Levels: Disagree < Neutral < Agree
table(fdf$newresponses) -> tab2
x | Frequency |
---|---|
Disagree | 3 |
Neutral | 2 |
Agree | 5 |
I could have also done this when creating the factor via
factor(
$x,
fdfordered = TRUE,
levels = c("Disagree", "Neutral", "Agree")
-> fdf$xordered
) levels(fdf$xordered)
#> [1] "Disagree" "Neutral" "Agree"
min(fdf$xordered)
#> [1] Disagree
#> Levels: Disagree < Neutral < Agree
table(fdf$xordered) -> tab3
x | Frequency |
---|---|
Disagree | 3 |
Neutral | 2 |
Agree | 5 |
Notice the min()
command … which asks for the minimum level of a factor. This works with ordered factors
but not with unordered factors
; try it with x
and responses
, respectively to see what results.
You can drop the original x
variable as follows
$x = NULL fdf
Before we move on, note that when it comes to modeling and other statistical techniques in R, an ordered factor has a different intrinsic meaning (ordinal)
than an unordered factor (nominal)
so be careful. When it comes to modeling we can specify the reference category explicitly before modeling. For example, say I want Neutral
to be the excluded (i.e., the reference) category. I would then do the following:
relevel(fdf$x, ref = "Neutral") -> fdf$refx
levels(fdf$refx)
This command will not work with an ordered factor so bear that in mind as well.
We could also fix typos, assuming they exist, and then drop unused factor levels as shown below. Say we have the following data:
data.frame(
mf = as.factor(
c(rep("Male", 3), rep("male", 2),
rep("Female", 3), rep("femalE", 2)
)
)-> sexdf
) levels(sexdf$mf)
#> [1] "femalE" "Female" "male" "Male"
Obviously not what we’d like so we can go in and clean it up a bit. How? As follows:
$mf[sexdf$mf == "male"] = "Male"
sexdf$mf[sexdf$mf == "femalE"] = "Female"
sexdflevels(sexdf$mf)
#> [1] "femalE" "Female" "male" "Male"
table(sexdf$mf) -> tab4
mf | Frequency |
---|---|
femalE | 0 |
Female | 5 |
male | 0 |
Male | 5 |
Wait a second, we still see femalE
and male
but with 0 frequency counts; how should we get rid of these ghosts? With droplevels()
as shown below:
as.factor(sexdf$mf) -> sexdf$newmf
levels(sexdf$newmf)
#> [1] "femalE" "Female" "male" "Male"
droplevels(sexdf$mf) -> sexdf$newmf
levels(sexdf$newmf)
#> [1] "Female" "Male"
table(sexdf$newmf) -> tab5
newmf | Frequency |
---|---|
Female | 5 |
Male | 5 |
There is the forcats package that you should check out if you want to see how its functions work with factors. I will touch upon this package a bit later on in this text.
2.7 Subsetting data and factor levels
I may also need to drop a factor level. For example, say I am working with the diamonds
data-set and need to subset my analysis to diamonds that are at minimum of a “Very Good” cut
. Before we tackle the data, here is a brief description of this data frame that has 53940 rows and 10 variables:
Variables | Description |
---|---|
price | price in US dollars ($326–$18,823) |
carat | weight of the diamond (0.2–5.01) |
cut | quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
color | diamond colour, from J (worst) to D (best) |
clarity | a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
x | length in mm (0–10.74) |
y | width in mm (0–58.9) |
z | depth in mm (0–31.8) |
depth | total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) |
table | width of top of diamond relative to widest point (43–95) |
library(ggplot2)
data(diamonds)
str(diamonds$cut)
#> Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
table(diamonds$cut) -> tab6
cut | Frequency |
---|---|
Fair | 1610 |
Good | 4906 |
Very Good | 12082 |
Premium | 13791 |
Ideal | 21551 |
So a minimum cut of “Very Good” implies diamonds that are either Very Good, Premium, or Ideal. I can subset with this criterion as follows:
subset(
diamonds, != "Fair" & cut != "Good"
cut -> dia.sub1
) str(dia.sub1$cut)
#> Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 4 3 3 3 3 5 4 5 ...
table(dia.sub1$cut) -> tab7
cut | Frequency |
---|---|
Fair | 0 |
Good | 0 |
Very Good | 12082 |
Premium | 13791 |
Ideal | 21551 |
You should try it for yourself: subset(diamonds, cut != “Fair” & cut != “Good”)
is the same as running subset(diamonds, cut == “Very Good” | cut == “Premium” | cut == “Ideal”)
Aha! I see no diamonds in the excluded cut ratings but str()
still shows R remembering cut
as having 5 ratings and hence the frequency table shows up with zero counts for Fair and Good, respectively. Once again, we will have to drop these two levels explicitly.
droplevels(dia.sub1$cut) -> dia.sub1$cutnew
str(dia.sub1$cutnew)
#> Ord.factor w/ 3 levels "Very Good"<"Premium"<..: 3 2 2 1 1 1 1 3 2 3 ...
table(dia.sub1$cutnew) -> tab8
cutnew | Frequency |
---|---|
Very Good | 12082 |
Premium | 13791 |
Ideal | 21551 |
Now a bit more on sub-setting data. You can subset with as simple or as complicated a condition you need. For example, maybe you only want Ideal diamonds with a certain minimum clarity and price. Sure, you can use this sub-setting criterion as shown below:
subset(
== "Ideal" & clarity == "VVS1" & price > 3933
diamonds, cut -> dia.sub2 )
Here the sub-setting is generating a data-frame that will only include observations that meet all three requirements (since you used &
in the command). If you had run the following you would have had very different results.
subset(
== "Ideal" | clarity == "VVS1" | price > 3933
diamonds, cut -> dia.sub3
) subset(
== "Ideal" & clarity == "VVS1" | price > 3933
diamonds, cut -> dia.sub4
) subset(
== "Ideal" | clarity == "VVS1" & price > 3933
diamonds, cut -> dia.sub5 )
That is because &
is saying “this criterion and that criterion both have to be met” while |
is saying “either this criterion is met or that criterion is met”.
You could also subset the data in different ways. For example,
diamonds[$cut == "Ideal" | diamonds$clarity == "VVS1" | diamonds$price > 3933,
diamonds-> dia.sub6 # same as dia.sub3
]
diamonds[!(diamonds$cut %in% c("Fair", "Good")),
-> dia.sub7 # same as dia.sub1
]
diamonds[$cut %in% c("Ideal", "Premium", "Very Good"),
diamonds-> dia.sub8 # same as dia.sub1 and dia.sub7 ]
What is the opposite of %in%
? Well, that would be the following:
`%!in%` = Negate(`%in%`) # define your own command for "not in"
$cut %!in% c("Fair", "Good")), ] -> dia.sub7
diamonds[(diamondslibrary(Hmisc) # use the %nin% command from Hmisc library
$cut %nin% c("Fair", "Good")), ] -> dia.sub7 diamonds[(diamonds
2.8 Exporting data from R
One can also export data created or manipulated via R into various formats. Why do weneed to do this? Because colleagues or clients may not be using R, in which case we need to be able to share the data with them in their preferred format.
Take the file (out.df
) we create below, for example. It is a small data-frame with two variables, one numeric and one categorical. Let us see how to export it to specific formats.
data.frame(
Person = c("John", "Timothy", "Olivia", "Sebastian", "Serena"),
Age = c(22, 24, 18, 24, 35)
-> out.df
) write.csv(out.df, file = "data/out.csv", row.names = FALSE)
library(haven)
write_dta(out.df, "data/out.df.dta") # Export a Stata format file
write_sav(out.df, "data/out.df.sav") # Export an SPSS format file
write_sas(out.df, "data/out.df.sas") # Export a SAS format file
So using the haven
package we exported out.df
to Stata, SPSS, and SAS formats. You can also export to MS Excel if you need to:
library(writexl)
write_xlsx(
out.df, "data/out.df.xlsx"
)
In the write.()
command, specifying row.names = FALSE
excludes the unique row number or names (if the rows have names) that R uses for internal operations from being exported to the target file format.
There are other packages as well that import/export data so be sure to check them out, in particular rio.
2.9 Merging data
Often you may need to combine data-sets. For example, say I have data on some students, with test scores in one file and their gender in another file. Each file has a student identifier, called ID
. There are two students in each file not seen in the other file.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
-> data1
) data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
-> data2 )
|
|
To merge data-frames we need to specify the merge key – the variable that identifies unique observations in each file – and then whether we only want to merge observations that show up in both files (natural join)
, merge everything even if some cases show up in one but not the other (full outer join)
, merge such that all cases in file x
are retained but only those seen in file y
are joined (left outer join)
, or if we want the merge to keep all cases in y
and only those cases are to be merged from x
that show up in y
, referred to as (right outer join)
.
merge(x = data1, y = data2, by = c("ID"), all = FALSE) -> natural # only merge if ID seen in both files
merge(x = data1, y = data2, by = c("ID"), all = TRUE) -> full # merge everything
merge(x = data1, y = data2, by = c("ID"), all.x = TRUE) -> left # make sure all cases from file x are retained
merge(x = data1, y = data2, by = c("ID"), all.y = TRUE) -> right # make sure all cases from file y are retained
What if the ID variables had different names? You could rename them to have a common name or you could use by.x =
and by.y =
as shown below.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
-> data3
) data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
-> data4
) merge(
data3, data4,by.x = c("ID1"),
by.y = c("ID2"),
all = FALSE
-> data34 )
ID1 | Score | Sex |
---|---|---|
A12 | 10 | Male |
A23 | 21 | Female |
D66 | 67 | Female |
WX1 | 12 | Female |
Y31 | 35 | Male |
Z14 | 33 | Male |
You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number (dirn)
, each school has building ID number (birn)
, and then the district’s name (district)
, the building’s name (building)
, and so on. If I am merging these data then my by =
statement will be by = c("dirn", "birn", "district", "building")
.
Having multiple merge keys is a blessing when you are not sure about the quality of the records. For example, I may have data for patients but I am not sure folks who gathered the data created the unique identifier for each patient correctly. But I do have information on their data of birth, sex, city and state of residence. I can then use these four pieces of information when merging the data to ensure a better quality of the merged data-set than would occur if I trusted the known error-prone ID variables.
If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the Reduce()
command, as shown below.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
-> df1
) data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
-> df2
) data.frame(
Age = c(6, 7, 6, 8, 8, 9, 10, 5),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
-> df3
) list(df1, df2, df3) -> my.list
Reduce(
function(...) merge(..., by = c("ID"), all = TRUE), my.list
-> df.123 )
ID | Score | Sex | Age |
---|---|---|---|
A12 | 10 | Male | 6 |
A23 | 21 | Female | 7 |
C31 | 43 | NA | NA |
D66 | 67 | Female | 9 |
E52 | NA | Male | 10 |
H71 | NA | Female | 5 |
Q22 | 99 | NA | NA |
WX1 | 12 | Female | 8 |
Y31 | 35 | Male | 8 |
Z14 | 33 | Male | 6 |
2.10 Identifying and handling duplicates
Every now and then you run into duplicate rows because of data entry errors. Here is an example:
data.frame(
Sex = c(
"Male", "Female", "Male", "Female",
"Male", "Female", "Male", "Female"
),ID = c(
"A12", "A23", "Z14", "WX1",
"Y31", "D66", "A12", "WX1"
)-> dups.df )
Note that in dups.df
students A12 and WX1 show up twice! There are two basic commands that we can use to find duplicates – unique
and duplicated
.
duplicated(dups.df) # flag duplicate rows with TRUE
!duplicated(dups.df) # flag not duplicated rows with TRUE
duplicated(dups.df), ] # reveal the duplicated rows
dups.df[!duplicated(dups.df), ] # reveal the not duplicated rows
dups.df[!duplicated(dups.df), ] -> nodups.df # save a dataframe without duplicated rows
dups.df[duplicated(dups.df), ] -> onlydups.df # save a dataframe with ONLY the duplicated rows dups.df[
You will invariably want to memorize (or lookup) the no duplicates
command: dataframe[!duplicated(dataframe), ]
2.11 Rearranging a data-frame
We may, at times, want to move the variables around, perhaps because there are too many of them. We may also want to arrange the data-frame in ascending or descending order of some variable(s). These tasks are easily done.
order(mtcars$mpg), ] -> df1 # arrange the data-frame in ascending order of mpg
mtcars[order(-mtcars$mpg), ] -> df2 # arrange the data-frame in descending order of mpg
mtcars[order(mtcars$am, mtcars$mpg), ] -> df3 # arrange the data-frame in ascending order of automatic/manual and then in ascending order of mpg mtcars[
|
|
|
Maybe I’d like the columns ordered such that particular columns show up first, side by side. For instance, what if I want to order the columns such that the first column is am
and then comes mpg
, followed by qsec
and then the rest of the columns?
names(mtcars) # the original structure of the mtcars data-frame
#> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
c(9, 1, 7, 2:6, 8, 10:11)] -> df4 # specifying the position of the columns
mtcars[, names(df4)
#> [1] "am" "mpg" "qsec" "cyl" "disp" "hp" "drat" "wt" "vs" "gear" "carb"
If I only wanted certain columns I could just not list the other columns and I would have a trimmed data-frame.
c(9, 1, 7)] -> df5 # specifying the position of the columns mtcars[,
Notice the comma in mtcars[,
… this tells R to keep all rows in mtcars. If, instead, I did this:
1:4, c(9, 1, 7)] -> df6 # keep first four rows and the specified columns mtcars[
I would end up with only the first four rows of mtcars, not every row. So try to remember the order: dataframe[i, j]
where i
references the rows and j
references the columns.
2.12 Random Sampling
Maybe I just want to sample 5 observations from mtcars. How could I do this?
set.seed(123)
sample(1:nrow(mtcars), 5) -> index
index#> [1] 31 15 19 14 3
-> df7 mtcars[index, ]
Note the set.seed()
function – it is used to make our results reproducible so that here, for example, if someone else runs our code they should end up with the same 5 rows randomly sampled from mtcars. Without set.seed
they would end up with different data. We use it every time we are relying on randomization – drawing a random sample, generating a random distribution of numbers, etc.
2.13 Strings
Every now and then you will have variables with some characters that need to be replaced or deleted outright. R has a gsub()
function that comes in very handy for such tasks. I end up using it a lot when trying to pull out extraneous words such as “County” appended to every county’s name in a data-set or then, even worse, ” County, Ohio” as well.
library(here)
::read_csv(
readrhere(
"data",
"strings.csv"
)-> c.data )
The first thing I want to do is to remove the phrase County, Ohio
from GEO.display.label
. This can be done swiftly with
gsub(
" County, Ohio",
"",
$`GEO.display-label`
c.data-> c.data$county )
Note the sequence gsub(“look for this pattern”, “replace with this”, mydata$myvariable)
. Note also that there is whitespace before “County” since that is how the data show up in the file.
If you look at GEO.id and GEO.ID2, they represent Census fields whose values mean something. The Bureau says: “The”GEO.id” field contains 14-digit codes that identify the summary level of data, the geographic component of the data and FIPS codes that uniquely identify the data. For example, the 14-digit “GEO.id” for Harris County, TX is “0500000US48201” where “050” represents the summary level of the data, “0000” represents the geographic component, “US” represents the United States, “48” represents the state of Texas and “201” represents Harris County.
The “GEO.id2” field contains FIPS codes exactly as they appear in the “GEOID” field featured in our TIGER Products. The “GEO.id2” for Harris County, TX is “48201” where “48” represents the state of Texas and “201” represents Harris County. Since the “GEO.id2” field in AFF data matches the “GEOID” field in our TIGER data, the two data-sets can easily be joined together to analyze data and create thematic maps.”
So we can split GEO.id2
into its state FIPS code and county FIPS since we know where the split has to occur – the first two numbers are the state codes and the last three numbers are the county codes. I’ll create new variables first, names
and geoid2
.
as.character(c.data$`GEO.display-label`) -> c.data$names
as.character(c.data$GEO.id2) -> c.data$geoid2
substring(c.data$geoid2, 1, 2) -> c.data$statefips # state FIPS
substring(c.data$geoid2, 3, 5) -> c.data$countyfips # county FIPS
I could have also split GEO.id2
t(
sapply(
$GEO.id2,
c.datafunction(x) substring(x, first = c(1, 3), last = c(2, 5))
)-> splits
) cbind.data.frame(c.data, splits) -> c.data2
colnames(c.data2)[10] <- "state_fips"
colnames(c.data2)[11] <- "county_fips"
Similar splits could be applied to GEO.id
to carve out independent state and county FIPS codes, as well as a five-digit state-county FIPS code.
2.14 With and Within
When creating new variables or modifying existing variables, two functions are helpful, with()
and within()
. For example, assume I have the following data-frame.
data.frame(
x = seq(0, 5, by = 1),
y = seq(10, 15, by = 1)
-> dfw
) $x * dfw$y -> dfw$xy1 # basic creation of xy
dfwwith(dfw, x * y) -> dfw$xy2 # using with
within(dfw, xy3 <- x * y) -> dfw # using within
We can create a new variable, xy
in several ways, making explicit the operation versus using with()
or within()
.
Pay attention to the last statement within(dfw, xy3 <- x * y) -> dfw
because here the <-
operator is being used inside the within(…)
function to create xy3
.
2.15 Rounding decimals
You rarely want to display more than two decimal places and more often than not just 1 (for non-technical audiences). R has a function round()
that will allow you to control the display. For example, assume we have the following data-frame.
data.frame(
x = seq(0, 10, by = 0.12343)
-> dfr )
Say I want to round to two-digits or maybe one digit or better yet, no digits.
within(
.2 <- round(x, digits = 2)
dfr, x-> dfr
) within(
.1 <- round(x, digits = 1)
dfr, x-> dfr
) within(
.0 <- round(x, digits = 0)
dfr, x-> dfr )
You could have collapsed all three into a single command as well:
within(dfr, {
.2b <- round(x, digits = 2);
x.1b <- round(x, digits = 1);
x.0b <- round(x, digits = 0)
x
}-> dfr )
2.16 Sums and Means
There are several functions we can lean on to calculate the totals or the means for columns in our data-frames. These are demonstrated for you below.
data.frame(
x = c(0, 1, 2, 3, 4, NA),
y = c(10, 11, 12, NA, 14, 15)
-> dfs
) within(
dfs, {= sum(x, na.rm = TRUE);
xtot = sum(y, na.rm = TRUE)
ytot
}-> dfs # sum x and sum y
) rowSums(dfs[, c(1:2)], na.rm = TRUE) # sum columns 1 and 2 for each row
#> [1] 10 12 14 3 18 15
colSums(dfs[, c(1:2)], na.rm = TRUE) # sum columns 1 and 2, respectively
#> x y
#> 10 62
rowMeans(dfs[, c(1:2)], na.rm = TRUE) # calculate the mean for each rows using columns 1 and 2
#> [1] 5 6 7 3 9 15
colMeans(dfs[, c(1:2)], na.rm = TRUE) # calculate the mean for columns 1 and 2
#> x y
#> 2.0 12.4
2.17 Padding a vector
Every now and then you need to add leading zeroes or lagging zeroes or some other element to a vector. For example, many but not all of the Ohio Department of Education’s data-set will come with building and district identifiers as characters such as “000125”, “000138” and so on but other data will have the same variable showing up as the numbers 125, 138, and so on. Since district and school identifiers are all six-digit numbers I end up padding the latter into six-digit characters of “000125”, “000138” and so on. I do this with the stringr
package.
library(stringr)
data.frame(
dirn = c(
125, 138, 141, 41256, 51513, 131256
)-> odedf
) str_pad(
$dirn,
odedfwidth = 6,
side = c("left"),
pad = "0"
-> odedf$dirnl )
Note that width = 6
refers to the total number of digits the resulting vector should reflect, side = c("left")
means add whatever I ask you to add to the left of any existing value, and pad = "0"
says I want you to add zeroes. If you had set pad
to right
or both
the padding would be different, as shown below, so be careful.
str_pad(
$dirn,
odedfwidth = 6,
side = c("right"),
pad = "0"
-> odedf$dirnr
) str_pad(
$dirn,
odedfwidth = 6,
side = c("both"),
pad = "0"
-> odedf$dirnb )
Note also that the resulting vector is a character, not a numeric! But that is not a problem since padding has, in my experience, only come in use whenever I needed to merge data-frames but the ID variables were padded in one or more data-frame(s) but not in the other(s).
2.18 The famous ifelse and grepl
There is a very useful function in R, ifelse(condition, A, B)
, that evaluates a logical expression/condition and if this expression/condition is TRUE then it generates the value A and if the expression/condition is FALSE then it generates the value B. Here is a simple example, with the mtcars
data-set. My goal is to generate a new variable whose value will be “Automatic” if am == 1
and “Manual” if am == 0
.
data(mtcars)
ifelse(
$am == 1,
mtcars"Automatic",
"Manual"
-> mtcars$automatic )
Voila! Why is this useful? Because otherwise I would need two lines of code to generate the same variable, as shown below:
$automatic2[mtcars$am == 1] <- "Automatic"
mtcars$automatic2[mtcars$am == 0] <- "Manual" mtcars
Another benefit of ifelse()
is that you can link them in a chain if you have very complicated operations to carry out. You will see one such tricky operation when we deal with some messy education data in the next chapter. For now, focus on the other function of value here grepl
.
This is a very useful function, especially when bundled with ifelse
because it will look for a specified string and allow you to then generate values depending upon whether the string is seen (TRUE) or not (FALSE). See this instance, for example, where we have information on some movies and would like to generate new variables based on what information we have.
load("data/movies/movies.RData")
ifelse(
grepl("Comedy", movies$genres), "Yes", "No"
-> movies$comedy
) ifelse(
grepl("(1995)", movies$title), "Yes", "No"
-> movies$in1995
) ifelse(
grepl("Night", movies$title), "Yes", "No"
-> movies$night )
In short, the function looks for the specified string (“Comedy”) in genres
and if found adds a flag of “Yes” and if not found adds a flag of “No”. s
But what if I wanted to run more complicated expressions, for example if the words “Night” or “Dark” or “Cool” are used in the title?
ifelse(
grepl("Night", movies$title), "Yes",
ifelse(grepl("Dark", movies$title), "Yes",
ifelse(grepl("Cool", movies$title), "Yes", "No")
)-> movies$tricky )
Beautiful, just beautiful! The trick is to nest the ifelse(grepl( ))
statements so that the entire chain of possible “Yes” flags to be attached is executed before it starts flagging all that remains with “No”.
There is a related function, grep()
and the difference between that and grepl()
is that although both of these functions find patterns, they return different output types based on those patterns, with grep()
returning numeric values that index the locations where the patterns were found. On the other hand, grepl()
returns a logical vector in which TRUE represents a pattern match and FALSE does not. Run the two commands that follow and focus on the output that scrolls in the console.
grep("Comedy", movies$genres)
grepl("Comedy", movies$genres)
You could use both for sub-setting data as well. See below for some examples, and note that -grep()
is the equivalent of saying “select those where this condition is not TRUE”.
grep("Comedy", movies$genres), ] -> mov1
movies[-grep("Comedy", movies$genres), ] -> mov2 movies[
Say I only want columns that have the letter “t” in them. No problem:
grep("t", colnames(movies))] -> mov3 # keep if t is in column name
movies[, -grep("t", colnames(movies))] -> mov4 # keep if t is not in column name movies[,
Try using grepl
in lieu of grep
above. Also, before we move on to other functions, recognize that they look for the string as specified, no matter where it shows up!! So it does not care if “t” is the first, last, or some other letter in the column names; it picks up all of them.