Chapter 4 Central Tendency and Dispersion
While graphical displays of our data are useful descriptors of how a variable is distributed, numerical summaries
of our data are far more concise and descriptive. But what would be good descriptors? A simple rule tends to work well in most situations. In particular, think about this: I tell you that the typical homeless person has been on the streets for 14 months. I know this because I have access to all homeless persons and have documented how long they have been on the streets. I then give you a random sample of homeless persons, blindfold you, and ask you to pick one homeless person from your sample. When you do so I ask you to guess how long this individual has been on the streets. What would be your best guess? Well, ideally it should be 14 months because that is the average duration in the population. Of course, you could say 14 months and that might be exactly how long this person has been on the street. I don;t tell you whether your guess was correct or not. Instead, I ask you to tell me how many months \(\pm\) from the 14 months that is true of the population might you be if your guess was wrong? You would be unable to answer this without a blind guess unless you knew how much the duration of homelessness varies in in the population. This is where knowing something about what is most likely, most typical, the average, and how much variation there is around this typical, average, most likely feature is essential to statistics. In this chapter we will focus on mastering two concepts – how to calculate the “average”, and how to calculate “variability” around this average. The former is central tendency
and the latter is dispersion
.
4.1 Central Tendency
Central tendency is a statistical measure that defines the center of a distribution and is most typical, most representative, of the scores that comprise the distribution of the variable of interest. There are three measures of central tendency – the mean, the median, and the mode. Each of these has strengths and weaknesses and is ideally suited to a specific situation with respect to a given variable. Let us see what these measures are and when they are most useful. Before we do so, however, a brief tutorial on some symbols and mathematical operators you will start seeing. Chief among these is the summation operator \(\sum\).
This Greek symbol calls us to add everything that follows. For example, if I did \(\sum(x_i)\) where \(i = 1, 2, 3, 4\) then we are being asked to sum each value of \(x\). The subscript \(i\) distinguishes between \(x_i = 1, x_i = 2, x_i = 3, \text{ and } x_i = 4\) so we know which x-value we are using in the calculation. In a nutshell, then, \(\sum(x_i) = \sum(x_1 + x_2 + x_3 + x_4) = \sum(1 + 2 + 3 + 4) = 10\). You will also see the same thing expressed alternatively thus:
\[\sum^{n}_{i=1}{x_i}, i=1,2,3,\ldots,n\]
which basically means the same thing as above: Add each observation \(i\) of the total \(n\) observations of \(x\). In other words,
\[\sum^{n}_{i=1}{x_i} = x_1 + x_2 + x_3 + \ldots + x_n\]
Note also that
\[\sum^n_{i=1}(x_i - 1) = \sum((x_1 - 1) + (x_2 - 1) + (x_3 - 1) + (x_4 - 1)) = \sum(0 + 1 + 2 + 3) = 6\]
\[\sum^n_{i=1}(x_i^2) = \sum((x_1^2 + x_2^2 + x_3^2 + x_4^2)) = \sum((1 + 4 + 9 + 16)) = 30\]
\[(\sum^n_{i=1}x_i)^2 = (\sum(x_1 + x_2 + x_3 + x_4))^2 = (\sum(1 + 2 + 3 + 4))^2 = (10)^2 = 100\]
\[\sum^n_{i-1}(x_i + y_i) = \sum^n_{i-1}x_i + \sum^n_{i-1}y_i\]
\[\sum^n_{i=1}kx_i = k\sum^n_{i=1}x_i\]
\[\sum^n_{i=1}k = kn\]
We will start seeing the summation sign with increasing frequency as we progress through the course. You may be wondering: Do I need to know how to do this? The short answer is no, you don’t, because it is not completely necessary to get a basic grasp of the concepts we must wrestle with. However, if you are curious, math-oriented, or just want a deeper understanding because you think you may want to take a few more statistics courses, then yes, try to understand how formulas work. If you crack this nut you will be surprised at the whole new world that opens up before you.
4.1.1 Mean
The arithmetic
mean (also known as the arithmetic average) is computed by adding up the scores in the distribution and dividing this sum by the sample size. In plain words, and using the summation operator, the arithmetic mean of \(x_i\) is calculated as
\[\dfrac{\sum^n_{i=1}x_i}{n} = \dfrac{x_1 + x_2 + x_3 + \ldots + x_n}{n}\]
While the underlying calculations are the same, we do draw a distinction between the population mean
and the sample mean
, using different symbols when calculating each. The formulas are given below in (4.1) and (4.2), respectively:
\[\begin{equation} \tag{4.1} \text{Population Mean } = \mu = \dfrac{\sum^N_{i=1}x_i}{N} \end{equation}\]
\[\begin{equation} \tag{4.2} \text{Sample Mean } = \bar{x} = \dfrac{\sum^n_{i=1}x_i}{n} \end{equation}\]
Notice the important difference – the population mean is symbolized by \(\mu\) (pronounced mu or myoo), and the total number of observations (aka the population size) is symbolized by uppercase N
. In contrast, the sample mean is symbolized by \(\bar{x}\) (pronounced \(x-bar\)) and the total number of observations (aka the sample size) is symbolized by lowercase n
.
Here is a trivial example, that shows you the starting bi-weekly salary of 12, randomly selected graduates of a university’s public affairs school.
salary.e = data.frame(salary = c(2850, 2950, 3050, 2880, 2755,
2710, 2890, 3130, 2940, 3325, 2880, 2920))
library(dplyr)
salary.e2 = arrange(salary.e, salary)
salary.o = data.frame(salary = c(2850, 2950, 3050, 2880, 2755,
2710, 2890, 3130, 2940, 3325, 2920))
salary.o2 = arrange(salary.o, salary)
Bi-weekly Salary | |
---|---|
1 | 2710 |
2 | 2755 |
3 | 2850 |
4 | 2880 |
5 | 2880 |
6 | 2890 |
7 | 2920 |
8 | 2940 |
9 | 2950 |
10 | 3050 |
11 | 3130 |
12 | 3325 |
Since this is a sample, the sample mean could be calculated as:
\[\bar{x} = \dfrac{\sum^n_{i=1}x_i}{n}\] \[= \dfrac{x_{1} + x_{2} + \cdots + x_{12}}{n}\] \[= \dfrac{2,850 + 2,950 + \cdots + 2,920}{12}\] \[= \dfrac{35,280}{12}\] \[\therefore \bar{x} = 2,940\]
The interpretation of this sample mean is rather simple. Think of it as follows: If I ask you to make a random pick of a graduate from this school and ask you what his/her bi-weekly salary is likely to be, your best guess should be $2,940. Why? Because your sample told you that this is the average salary, the typical salary. Of course, this does not mean everybody will have this salary or that the next student you randomly pick has this salary. The table and common sense should tell you that salaries vary, the person you select may be making more or less than your sample mean. However, guessing any other salary would be a wild guess. Instead, knowing the sample mean gives you the basis to make a more informed guess that is more likely to be correct than any other number you pick out of thin air. Even if you are wrong, you will not be very far from the truth. This accuracy is driven by statistical theory that we will cover shortly. Before we do that though, we need to understand two other measures of central tendency.
mean.df = data.frame(id = c(1:6), x = c(6, 3, 5, 3, 4, 5))
mean.df$x2 = with(mean.df, x - 2)
mean.df$x3 = with(mean.df, 2 * x)
mean.df$x4 = with(mean.df, x/2)
mean.df[7, (2:5)] <- colSums(mean.df[, 2:5], na.rm = TRUE)
mean.df[7, 1] = "Total"
Observation | x | (x - 2) | (2 * x) | (x / 2) |
---|---|---|---|---|
1 | 6 | 4 | 12 | 3.0 |
2 | 3 | 1 | 6 | 1.5 |
3 | 5 | 3 | 10 | 2.5 |
4 | 3 | 1 | 6 | 1.5 |
5 | 4 | 2 | 8 | 2.0 |
6 | 5 | 3 | 10 | 2.5 |
Total | 26 | 14 | 52 | 13.0 |
4.1.2 Median
The median is the middle-value that occurs when the data are arranged in an ascending or descending order, and is commonly denoted by the symbol \(Md\). Since central tendency is all about finding the “center” of the values of a variable \(x\), the median tends to be intuitive for most folks and calculating it is equally simple.
- Arrange the values of \(x\) either in ascending order or in descending order.
- If the number of data points in the population or sample is an
odd
number, the median observation can be identified as \(Md = \dfrac{N + 1}{2}\) for a population and \(Md = \dfrac{n + 1}{2}\) for a sample. See Table 4.2 for an example. - If the number of data points in the population or sample is an
even
number, the median observation can be identified as the average of the middle two values of \(x\). See Table 4.3 for an example.
Bi-weekly Salary | |
---|---|
1 | 2710 |
2 | 2755 |
3 | 2850 |
4 | 2880 |
5 | 2890 |
6 | 2920 |
7 | 2940 |
8 | 2950 |
9 | 3050 |
10 | 3130 |
11 | 3325 |
The median observation (sometimes also called the median position
) is the \(6^{th}\) observation and the median salary is thus $2,920.
Bi-weekly Salary | |
---|---|
1 | 2710 |
2 | 2755 |
3 | 2850 |
4 | 2880 |
5 | 2880 |
6 | 2890 |
7 | 2920 |
8 | 2940 |
9 | 2950 |
10 | 3050 |
11 | 3130 |
12 | 3325 |
Now the median observation lies between the \(6^{th}\) and \(7^{th}\) observations and the hence median salary is \(Md = \dfrac{2,890 + 2,920}{2} = \dfrac{5,810}{2} = \$2,905\). With even-numbered samples/populations you could end up with a median value that is actually not seen in the data-set since we average the middle values. Don’t be surprised if this happens.
4.1.3 Mode
The mode or modal value
refers to the value of the variable \(x\) that occurs most frequently in the data-set. The mode makes little sense for numerical variables (interval or ratio) and instead works best with categorical variables (nominal or ordinal). The table below shows you an example where 50 randomly sampled residents of Manhattan (New York City, NY) were asked about how they commute to work on a typical working day of the week.
trans = data.frame(`Transportation Choice` = c("Walking", "Bus/Train",
"Taxi", "Bicycle", "Drove alone/Car-pooled", "Total"), Frequency = c(19,
8, 5, 13, 5, 50))
Transportation.Choice | Frequency |
---|---|
Walking | 19 |
Bus/Train | 8 |
Taxi | 5 |
Bicycle | 13 |
Drove alone/Car-pooled | 5 |
Total | 50 |
Most of the respondents \((19)\) out of \((50)\) said they walk to work and hence the modal transportation choice is walking to work. Just as in the case of the arithmetic mean, the implication should be clear: If you picked a randomly chosen resident of Manhattan and I ask you how do you think they commute to work your best guess ought to be that they walk. You could be wrong, but based on the sample of 50 residents you have seen, walking would be the most educated guess you could make.
4.1.3.1 Calculating Central Tendency in R
In R you will usually have a dataframe to work with when calculating summary statistics. For example, say we have data on some attributes of a sample of 200 students’, including their test scores. Let me read-in the data and then label the variables that need labeling. The code shows you how to take what is read in as an integer (female, race, ses, schtyp, prog)
and if it is a categorical variable, to convert it into a factor
where the values of the categories coded 0, 1, etc. are labeled.
hsb2 = read.csv("./data/hsb2-2.csv")
hsb2$female = factor(hsb2$female, labels = c("Male", "Female"))
hsb2$race = factor(hsb2$race, labels = c("Hispanic", "Asian",
"African American", "White"))
hsb2$ses = factor(hsb2$ses, labels = c("Low", "Middle", "High"))
hsb2$schtyp = factor(hsb2$schtyp, labels = c("Public", "Private"))
hsb2$prog = factor(hsb2$prog, labels = c("General", "Academic",
"Vocational"))
If I run a summary()
command on a dataframe I’ll get some basic estimates, including the minimum and maximum values along with the mean and the median. If it happens to be a categorical variable I get the frequencies instead.
summary(hsb2)
## id female race ses
## Min. : 1.00 Male : 91 Hispanic : 24 Low :47
## 1st Qu.: 50.75 Female:109 Asian : 11 Middle:95
## Median :100.50 African American: 20 High :58
## Mean :100.50 White :145
## 3rd Qu.:150.25
## Max. :200.00
## schtyp prog read write
## Public :168 General : 45 Min. :28.00 Min. :31.00
## Private: 32 Academic :105 1st Qu.:44.00 1st Qu.:45.75
## Vocational: 50 Median :50.00 Median :54.00
## Mean :52.23 Mean :52.77
## 3rd Qu.:60.00 3rd Qu.:60.00
## Max. :76.00 Max. :67.00
## math science socst
## Min. :33.00 Min. :26.00 Min. :26.00
## 1st Qu.:45.00 1st Qu.:44.00 1st Qu.:46.00
## Median :52.00 Median :53.00 Median :52.00
## Mean :52.65 Mean :51.85 Mean :52.41
## 3rd Qu.:59.00 3rd Qu.:58.00 3rd Qu.:61.00
## Max. :75.00 Max. :74.00 Max. :71.00
I could have asked for the mean and the median more pointedly, by running, for example,
mean(hsb2$read)
## [1] 52.23
mean(hsb2$socst)
## [1] 52.405
median(hsb2$read)
## [1] 50
median(hsb2$socst)
## [1] 52
Now, here we have no missing data in the dataframe, but what if we did? Say, for example, the dataframe includes three variables, two numeric and the fourth a measure of the respondent’s sex:
dff = data.frame(sex = c("Male", "Male", "Female", "Female",
"Male", "Female"), age = c(18, 18, NA, 21, 26, 18), gpa = c(NA,
2.9, 4, 4, 4, 3.4))
summary(dff)
## sex age gpa
## Female:3 Min. :18.0 Min. :2.90
## Male :3 1st Qu.:18.0 1st Qu.:3.40
## Median :18.0 Median :4.00
## Mean :20.2 Mean :3.66
## 3rd Qu.:21.0 3rd Qu.:4.00
## Max. :26.0 Max. :4.00
## NA's :1 NA's :1
sex | age | gpa |
---|---|---|
Male | 18 | NA |
Male | 18 | 2.9 |
Female | NA | 4.0 |
Female | 21 | 4.0 |
Male | 26 | 4.0 |
Female | 18 | 3.4 |
Notice the NA
values in the dataframe. If I now try to calculate the mean of age with the same command used before, I get a strange result.
mean(dff$age)
## [1] NA
median(dff$gpa)
## [1] NA
This is because you have a missing value and R expects you to be very explicit about what you want to do when missing values are present. The standard approach is to ignore them by removing them before proceeding with the calculations, and this is done by adding the na.rm = TRUE
switch to the command.
mean(dff$age, na.rm = TRUE)
## [1] 20.2
median(dff$gpa, na.rm = TRUE)
## [1] 4
What if I needed to calculate the mean and median by sex? There would be several ways to do this but the easiest might be to use the pscyh
package’s describeBy()
command:
library(psych)
describeBy(dff$age, df$sex)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 5 20.2 3.49 18 20.2 0 18 26 8 0.77 -1.38 1.56
describeBy(dff$gpa, df$sex)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 5 3.66 0.5 4 3.66 0 2.9 4 1.1 -0.55 -1.77 0.22
describeBy(hsb2$read, hsb2$ses)
##
## Descriptive statistics by group
## group: Low
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 47 48.28 9.34 47 48.18 7.41 28 68 40 0.23 -0.66 1.36
## --------------------------------------------------------
## group: Middle
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 95 51.58 9.43 50 51.32 8.9 31 73 42 0.24 -0.42 0.97
## --------------------------------------------------------
## group: High
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 58 56.5 10.86 57.5 56.6 11.12 34 76 42 -0.13 -1.03
## se
## X1 1.43
4.1.4 Some Features of Measures of Central Tendency
If we have categorical data then now we know that the mode would be the ideal measure to describe the average value. However, what if we have numerical data? Should we choose the mean or the median? We choose between the mean and median depending upon whether we have extremely unusual observations or not.
For example, imagine I give a 10-point pop-quiz in a seminar class of five students and the scores are \(x = 1, 3, 5, 7, 9\) leading to an average score of \(5\). We see that \(5\) is also the median. Intuitively, \(5\) makes perfect sense as the average score given the data. However, what if the scores were \(x = 1, 1, 1, 2, 10\), resulting in a mean of 3. The median score is clearly \(1\). In this latter situation, the mean has been inflated
by that one student who got a perfect score but most people did very poorly. Intuitively then, saying the average score was a 3 would be misleading given that four out of five students scored below the average! So what would be a good measure of the average score here? The answer is, the median. If you did this and said the average student scored \(1\) you would be describing the scores more accurately than if you opted for the mean of \(3\).
This reflects an important principle we should follow when describing the data in terms of means and medians. With unusually large or small scores – compared to the rest of the data – the median better represents the average than does the mean. Note that
- the mean will be inflated (i.e., pulled upwards) if the unusual data point occurs on the high side of the variable’s values, and
- the mean will be deflated (i.e., pulled downwards) if the unusual data point occurs on the low side of the variable’s values.
The median is also useful when you have open-ended data or incomplete data. For instance, say you give kindergarten kids some task to complete and are timing them. Some of them do not finish the task so you cannot record a time value for these kids. Maybe the times (in minutes) look as follows: \(x = 7, 8, 6, 9, 1, 3, 12, 18, NA, NA\). If you had to calculate the average, what could you do? You may be tempted to discard the data for all kids who did not finish the task and calculate the mean time based on all other kids’ data this would lead to a mean time of \(8\) minutes. But this would lead to a flawed conclusion. After all, it would be important to recognize the truth that some kids couldn’t finish the task. How then could you calculate the average? You could arrange the data in increasing order of time taken to complete the task, putting all the kids who did not finish at the very high end of the times, and then calculate the median time! If you did this you would have: \(x = 1, 3, 6, 7, 8, 9, 12, 18, NA, NA\) and the median time would be \(8.5\) minutes, a more accurate portrayal of average time on task.
The mean has two things going for it though; it uses all data points (assuming no incomplete or missing data) and is used in most statistical calculations. This is why we often fall back on the mean as the best measure of the average value of a variable.5
4.2 Dispersion (aka Variability)
Knowing something about the average is useful but by itself the average is of limited value. Rather, what you want to also know is how much do the data values vary? Why does this matter? Think about this in the context of your exam scores. Let us say that when I give you your exams back I also tell everyone that the mean exam score was \(90\) (out of a \(100\)) on Exam I. Your score was a \(96\) and you are relieved because you did better than the average student in the class. But you shouldn’t rejoice just yet. Instead you should ask: How much did the exam scores vary for the class? I tell you average variability was \(2\) points, which means most scores fell in the \(88 \text{ and } 92\) range. Well then, you did much better than most people, maybe you were the top performer. However, if I tell you the average variability was \(6\) then you know most scores fell between \(84 \text{ and } 96\) and now you don’t feel so good any more.
Knowing variability tells you something useful about the variable you are studying. If the variable shows little variability then any conclusions you make will have good accuracy when you extrapolate from your sample to the population (unless you made a mess of gathering your sample data to start with). However, if you see a lot of variability in your data then your conclusions are likely to have a lot of uncertainty when applied to the population from which you drew your sample. I would thus go as far as to say that variability is the linchpin for all data analysis, and the reason you will see a specific measure of variability figuring prominently in most statistical calculations. Let us see what that measure is by working our way to it, starting with a crude measure of variability.
4.2.1 Range
One of the crudest measures of variability is the range
, computed as \(Range = x_{max} - x_{min}\) where \(x_{max}\) is the maximum value of variable \(x\) and \(x_{min}\) is the minimum value of variable \(x\). Intuitively, if the smallest and largest values are very close together, there cannot be much variability in \(x\) but if there is a huge gap between the lowest and highest values then there must be quite a bit of variability. Let us take two examples to demonstrate each scenario, respectively.
- \(x = 1, 2, 3, 4, 10\) and so range is \(10-1 = 9 \ldots\) high variability
- \(y = 1, 2, 3, 4, 5\) and so range is \(5-1 = 4 \ldots\) low variability
Simple enough, and yet fallible. See the following two variables, \(x\) and \(y\).
- \(x = 1, 3, 5, 7, 10\) and so range is \(10-1 = 9\)
- \(y = 1, 1, 1, 1, 10\) and so range is \(10-1 = 9\)
The range is identical for \(x\) and \(y\) but quite obviously the scores hardly vary in \(y\)! This is an example of the limitation of the range
– it can mislead because it only relies on two data points, even though your sample size may be 10 or 100 or even 10,000. In fact, the range is being driven by the smallest and largest values and that would be as much an issue here as in the case of the mean. Consequently, the range is of very limited use in practice.
4.2.2 Quartiles and Interquartile Range
If focusing only on the ends of the data (i.e., \(x_{min} \text{ and } x_{max}\)) is not a good way to measure variability, then one alternative may be to ignore the low and high values and focus just on the middle of the distribution, say the middle \(50\) percent of the distribution. This approach, called the interquartile range (IQR)
is a standard way of measuring variability and used quite often. To calculate the IQR, though, we need first to identify the values that would give us the middle \(50\) percent of our data.
4.2.2.1 Calculating the Quartiles
Just as the median halved the data, the quartiles
quarter the data into four sections of \(25\) percent. We refer to these technically as
- the
first quartile
(\(Q_1\)): \(25\) percent of the data fall below \(Q_1\), - the
second quartile
(\(Q_2\)): also the median, \(50\) percent of the data fall below \(Q_2\), and - the
third quartile
(\(Q_3\)): \(75\) percent of the data fall below \(Q_3\)
Note that \(Q_1\) is also called the \(25^{th}\) percentile, \(Q_2\) is called the \(50^{th}\) percentile, and \(Q_3\) is called the \(75^{th}\) percentile. Note too that we will be using software to perform the calculations, quartiles can be calculated by hand as shown below:
\[\begin{equation} \tag{4.3} i = \left(\dfrac{p}{100}\right) \times n \end{equation}\]
where \(i\) is the observation number and \(p\) is the percentile we want. If, for example, I want the \(25^{th}\) percentile, then this is given by \(i = \left(\dfrac{25}{100}\right) \times n\). Similarly, the \(50^{th}\) percentile (our median) is calculated as \(i = \left(\dfrac{50}{100}\right) \times n\) and the \(75^{th}\) percentile as \(i = \left(\dfrac{75}{100}\right) \times n\).6
Let us calculate \(Q_1\) and \(Q_3\) for our salary data when we had \(n=11\), an odd numbered sample size. In this case, \(Q_1 = i = \left(\frac{p}{100}\right) \times n = \left(\frac{25}{100}\right) \times 11 \approx 3\) or the \(3^{rd}\) salary. Likewise, \(Q_3 = i = \left(\frac{p}{100}\right) \times n = \left(\frac{75}{100}\right) \times 11 \approx 9\) or the \(9^{th}\) salary. The actual salary values of these two observations turn out to be \(Q_1 =\) $2,850 and \(Q_3\) = $3,050; \(25\) percent of our sample had salaries less than $2,850 and \(75\) percent of our sample had salaries below $3,050.
What if we have an even-numbered data-set? Recall that the median was between the \(6^{th}\) and the \(7^{th}\) values, which effectively gave us six salaries below the median and six salaries above the median. Consequently, the first quartile would be \(Q_1 = \dfrac{3^{rd} \text{ salary } + 4^{th}\text{ salary }}{2} = \dfrac{2850 + 2880}{2} = 2865\). Similarly, the third quartile would be \(Q_3 = \dfrac{9^{th} \text{ salary } + 10^{th}\text{ salary }}{2} = \dfrac{2950 + 3050}{2} = 3000\). If we had to depict the fences marking the quartiles and the Median, it would look as follows:
\[2710, 2755, 2850, \vert|Q1\vert|, 2880, 2880, 2890, \vert|Median\vert|, 2920, 2940, 2950, \vert|Q3\vert|, 3050, 3130, 3325\]
Again, do not be thrown off by the fact that the quartiles seem to be illusory (i.e., values that are not seen in the sample). Also note that different statistical software will calculate the quartiles marginally differently depending upon how the software has been programmed.
4.2.2.2 Calculating the Interquartile Range
The interquartile range (IQR) is given by \(Q_3 - Q_1 \cdots\) the spread in the middle \(50\) percent of our data values. In the salary example we just worked with, \(IQR = Q_3 - Q_1 = 3050 - 2850 = 200\). So the spread in the middle \(50\) percent of our data values is $200. For the salary data with \(n = 12\), the \(IQR = Q_3 - Q_1 = 3000 - 2865 = 135\).
The IQR is used in visualizing our data with a more powerful graphic than what we have seen in Chapter 3. We will cover this graphic in Chapter 5, once we have covered some more necessary ground. Before we move on, note that whereas the range was weak in its reliance upon just two data points, the IQR is only marginally better because it relies upon just the middle \(50\) percent of our data to give us a sense of variability. After all, the IQR is discarding half of our data! This is a waste and will lead to an incomplete sense of variability in our data. At this pint, unless you have dozed off, you should be expecting me to say that a better measure of variability would be one that takes all data
into account. You bet. That is precisely what we will do next.
4.2.3 Variance and Standard Deviation
As I had mentioned at the very outset of this chapter, variability is not calculated in a vacuum but with reference to the average value of the variable. Say we are using the mean as the average. How could we calculate average variability, average distance of a data value, around the mean? One way to do this might be to calculate how far each data value is from the mean. Then take these distances, add them up, and divide the resulting total by the number of data points we have (whether \(N \text{ or } n\)). Let us try this simple approach with the following sample \(x = 1, 3, 4, 6, 22, 24\). Note that the mean is \(10\).
df = data.frame(x = c(1, 3, 4, 6, 22, 24))
df$mean = 10
df$diff = with(df, x - mean)
df$diffsq = with(df, (diff)^2)
df[7, (1:4)] <- colSums(df[, 1:4], na.rm = TRUE)
df[7, 1] = "Total"
x | Mean of x | (Mean of x - x) |
---|---|---|
1 | 10 | -9 |
3 | 10 | -7 |
4 | 10 | -6 |
6 | 10 | -4 |
22 | 10 | 12 |
24 | 10 | 14 |
Total | 60 | 0 |
Ah! The total of the distances is \(0\), which makes it impossible to calculate the average distance from the mean. Can we work around it? Sure, we could if we squared each distance so that all negative values were transformed into positive values. The result is shown below.
df = data.frame(x = c(1, 3, 4, 6, 22, 24))
df$mean = 10
df$diff = with(df, x - mean)
df$diffsq = with(df, (diff)^2)
df[7, (3:4)] <- colSums(df[, 3:4], na.rm = TRUE)
df[7, 1] = "Total"
x | Mean of x | (Mean of x - x) | (Mean of x - x) Squared |
---|---|---|---|
1 | 10 | -9 | 81 |
3 | 10 | -7 | 49 |
4 | 10 | -6 | 36 |
6 | 10 | -4 | 16 |
22 | 10 | 12 | 144 |
24 | 10 | 14 | 196 |
Total | NA | 0 | 522 |
Now the average distance would be \(\dfrac{522}{6} = 87\), but interpreting this would be difficult since this is average distance in squared units
. This isn’t an intractable problem though since just as we squared the distances, we could also take the square root
to get average distance in meaningful units of the original metric. That is, we could calculate \(\sqrt{\dfrac{522}{6}} = \sqrt{87} = 9.32379\)! That is, average variability is 9.32379. Congratulations ladies and gentlemen, you have just calculated the variance
and the standard deviation
. Formally, for a variable \(x\),
\[\begin{equation} \tag{4.4} \text{Population Variance } = \sigma^2 = \dfrac{\sum(x_i - \mu)^2}{N} \end{equation}\]
\[\begin{equation} \tag{4.5} \text{Population Standard Deviation } = \sigma = \sqrt{\dfrac{\sum(x_i - \mu)^2}{N}} \end{equation}\]
While the above formulas work for population data, we have to make a slight modification to these formulas when we are working with sample data. In brief, the formulas become:
\[\begin{equation} \tag{4.6} \text{Sample Variance } = s^2 = \dfrac{\sum(x_i - \bar{x})^2}{n-1} \end{equation}\]
\[\begin{equation} \tag{4.7} \text{Sample Standard Deviation } = s = \sqrt{\dfrac{\sum(x_i - \bar{x})^2}{n-1}} \end{equation}\]
The key part of the modified formula is not the symbols \(s^2 \text{ and } s\) replacing \(\sigma^2 \text{ and } \sigma\); it is the fact that we divide by a smaller number \(\cdots (n-1) \text{ instead of } (N)\). Why this change?
4.2.4 Why \(n-1\)?
Consider these data: \(x = 1, 5, 4, 10\). What is the mean? \(\bar{x} = 5\). What is the sample size? \(n = 4\). Fair enough. Now let me give you the following data: \(x = 1, 5, 4, ?\) along with the information that although the fourth data point is not provided to you, \(\bar{x} = 5\). Can you tell me what the missing data value must be? Yes you can. How?
\[\bar{x} = \dfrac{\sum^n_{i=1}}{n} = \dfrac{1 + 5 + 4 + ?}{4} = 5\] \[1 + 5 + 4 + ? = 5 \times 4 \] \[10 + ? = 20\] \[? = 20 - 10 = 10\]
So the missing number must be \(10\). You could have figured out whatever was the missing number in a similar fashion even if I had provided you with \(x = 1, ?, 4, 10\) or \(x = ?, 5, 4, 10\) or \(x = 1, 5, ?, 10\). Why? Because the mean is calculated by summing all values of \(x\) and then dividing by the sample size. When we do this, we know that one number in our data-set will always be locked down to a specific value, it cannot just be any random value. Since we calculate the variance by using the mean, we are forcibly limiting our estimate of variability. Consequently, we make a specific adjustment, that of dividing the squared distances by a smaller number in order to make a correction and obtain a slightly larger estimate of variability.
Another way of thinking about this is as follows. I give you three values in a sample: \(x = 4, 6, 8\). I also tell you to calculate the variance in two situations: (a) You are told that the population mean is \(\mu = 3\) and (b) You are not given this information and so must calculate the sample mean in order to calculate variance without adjusting the denominator
. The calculations are shown below.
my.df = data.frame(x = c(8, 4, 6))
my.df$x2 = with(my.df, x - 3)
my.df$x3 = with(my.df, (x2)^2)
my.df$x4 = with(my.df, x - 6)
my.df$x5 = with(my.df, (x4)^2)
my.df[4, c(3, 5)] <- colSums(my.df[, c(3, 5)], na.rm = TRUE)
my.df[4, 1] = "Total"
x | (x - Popn Mean) | (x - Popn Mean) Squared | (x - Sample Mean) | (x - Sample Mean) Squared |
---|---|---|---|---|
8 | 5 | 25 | 2 | 4 |
4 | 1 | 1 | -2 | 4 |
6 | 3 | 9 | 0 | 0 |
Total | NA | 35 | NA | 8 |
If I calculate the variance when I am given \(\mu=3\) I get \(\dfrac{35}{3} = 11.66667\) but when I calculate variance by using the sample mean \(\bar{x} = 6\) without adjusting the denominator I get an estimated variance of \(\dfrac{8}{3} = 2.666667\), quite a bit smaller than what I got earlier. However, if I make the adjustment we are asked to make: \(\dfrac{8}{3-1} = \dfrac{8}{2}=4\) I get a much bigger estimate of variance. Thus, it matters whether we know the population mean or we do not. If we know the population mean then our estimated variance is closer to the “true variance” but if we don’t know the population mean and must use the sample for calculating both the mean and the variance, then we need to adjust the formula to come closer to the “true variance”. Note that the adjustment of course doesn’t get us completely to \(11.66\), we are still a big distance away but nevertheless \(4\) is closer than \(2.66\).
If none of the two explanations help, think of the adjustment as necessary because no sample will ever fully capture all the variability in the population. Hence we need to make an adjustment to avoid ending up with an estimate of the variance that is guaranteed to be smaller than the true variance in the population.
Variance and Standard Deviation for the Salary Data
Before we move on, here is the calculated variance and standard deviation for our salary data.
my.df = salary.e2
my.df$x2 = with(my.df, salary - mean(salary, na.rm = TRUE))
my.df$x3 = with(my.df, (x2)^2)
my.df[13, (2:3)] <- colSums(my.df[, 2:3], na.rm = TRUE)
my.df[13, 1] = "Total"
Salary | (Salary - Mean) | (Salary - Mean) Squared |
---|---|---|
2710 | -230 | 52900 |
2755 | -185 | 34225 |
2850 | -90 | 8100 |
2880 | -60 | 3600 |
2880 | -60 | 3600 |
2890 | -50 | 2500 |
2920 | -20 | 400 |
2940 | 0 | 0 |
2950 | 10 | 100 |
3050 | 110 | 12100 |
3130 | 190 | 36100 |
3325 | 385 | 148225 |
Total | 0 | 301850 |
\[\text{Sample mean: }\bar{x}=2940 \ldots\] \[\Sigma(x_{i} - \bar{x}) = 0\] \[\Sigma(x_{i} - \bar{x})^{2} = 301850\] \[\text{Sample Variance: }s^{2} = \dfrac{301850}{(12 - 1)} = \$27440.91 \ldots\] \[\text{Sample Standard Deviation: }s = \sqrt{27440.91} = \$165.63 \ldots\]
4.2.4.1 Calculating Measures of Dispersion in R
Just like the mean and the median, calculating the standard deviation, variance, IQR, etc. is pretty straightforward in R. I’ll demonstrate the commands with the hsb2
and the small dataframe we created called df
.
The summary()
command gives us the first and third quartiles for a numerical variable.
summary(hsb2$read)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.00 44.00 50.00 52.23 60.00 76.00
summary(hsb2$socst)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.00 46.00 52.00 52.41 61.00 71.00
summary(dff$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.0 18.0 18.0 20.2 21.0 26.0 1
summary(dff$gpa)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.90 3.40 4.00 3.66 4.00 4.00 1
You could ask for the quartiles more pointedly as well:
quantile(hsb2$read, probs = c(0.25, 0.75))
## 25% 75%
## 44 60
quantile(hsb2$socst, probs = c(0.25, 0.75))
## 25% 75%
## 46 61
quantile(dff$age, probs = c(0.25, 0.75), na.rm = TRUE)
## 25% 75%
## 18 21
quantile(dff$gpa, probs = c(0.25, 0.75), na.rm = TRUE)
## 25% 75%
## 3.4 4.0
The probs = c(0.25, 0.75)
switch says you want specific quantiles, the \(25^{th}\) and the \(75^{th}\) to be precise, which are your first and third quartiles respectively. If you wanted the median, you could have added 0.50
to the mix, as shown below:
quantile(hsb2$read, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 44 50 60
quantile(hsb2$socst, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 46 52 61
quantile(dff$age, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
## 25% 50% 75%
## 18 18 21
quantile(dff$gpa, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
## 25% 50% 75%
## 3.4 4.0 4.0
Calculating quartiles for groups is perhaps best achieved via the following command:
library(dplyr)
hsb2 %>% dplyr::group_by(female) %>% summarize(Q1 = quantile(read,
probs = c(0.25), na.rm = TRUE), Q3 = quantile(read, probs = c(0.75),
na.rm = TRUE))
Here the group_by(female)
command and is asking R to remember that groups are defined by unique values of female
. The summarize()
command asks R to then proceed with calculating Q1 and Q3 for each group.
The interquartile range (IQR) is easier to calculate:
IQR(hsb2$read)
## [1] 16
IQR(hsb2$socst)
## [1] 15
IQR(dff$age, na.rm = TRUE)
## [1] 3
IQR(dff$gpa, na.rm = TRUE)
## [1] 0.6
hsb2 %>% group_by(female) %>% summarize(IQR = IQR(read, na.rm = TRUE))
hsb2 %>% group_by(ses) %>% summarize(IQR = IQR(math, na.rm = TRUE))
Before we move on, however, note that in R there are several ways of calculating the quartiles and hence the IQR. In particular, R has nine ways of calculating quantiles. This should not be anything more than a curiosity to you at this stage but remember this well because should you compare the quartiles from R versus those of another software package, they may not coincide unless the other software package is using the exact same quantile algorithm.
Note also that the range()
command gives you the max and min values, not the \(range = max - min\) that you might be expecting. Consequently, you may just as well calculate it explicitly as shown below.
range(hsb2$read)
## [1] 28 76
range(dff$age, na.rm = TRUE)
## [1] 18 26
hsb2 %>% dplyr::summarize(range = max(read, na.rm = TRUE) - min(read,
na.rm = TRUE))
## range
## 1 48
hsb2 %>% group_by(race) %>% dplyr::summarize(range = max(read,
na.rm = TRUE) - min(read, na.rm = TRUE))
## # A tibble: 4 x 2
## race range
## <fct> <int>
## 1 Hispanic 45
## 2 Asian 26
## 3 African American 32
## 4 White 45
The variance and standard deviation can be calculated via the var()
and sd()
commands respectively, as shown below:
var(hsb2$read)
## [1] 105.1227
sd(hsb2$read)
## [1] 10.25294
var(dff$age, na.rm = TRUE)
## [1] 12.2
sd(dff$age, na.rm = TRUE)
## [1] 3.49285
Variance and standard deviations for groups can be calculated via the describeBy()
command shown earlier, or then via the following:
hsb2 %>% dplyr::group_by(female) %>% dplyr::summarize(Variance = var(math,
na.rm = TRUE), Std.Dev = sd(math, na.rm = TRUE))
## # A tibble: 2 x 3
## female Variance Std.Dev
## <fct> <dbl> <dbl>
## 1 Male 93.4 9.66
## 2 Female 83.7 9.15
dff %>% dplyr::group_by(sex) %>% dplyr::summarize(Variance = var(gpa,
na.rm = TRUE), Std.Dev = sd(gpa, na.rm = TRUE))
## # A tibble: 2 x 3
## sex Variance Std.Dev
## <fct> <dbl> <dbl>
## 1 Female 0.12 0.346
## 2 Male 0.605 0.778
4.3 Properties of the Mean and Variance/Standard Deviation
The mean has some useful properties that come in handy when we have to manipulate data. For instance, say we have the following scores \(x = 2, 4, 6, 8, 10\). Let us calculate the mean and standard deviation of these scores. These turn out to be \(\bar{x} = 6\) and \(s_{x} = 3.162278\). Now let us divide each of these numbers by \(2\) to create a new variable, \(x^{*} = \frac{x}{2}\). Turns out \(x^{*} = 1, 2, 3, 4, 5\). If you calculate the mean and standard deviation for \(x^{*}\) you will find these to be \(\bar{x^{*}} = 3\) and \(s_{x^{*}} = 1.581139\), respectively. Compare these to \(\bar{x}\) and \(s_{x}\) and note the difference. What happened here? Well, if you divide each \(x_i\) by a constant \(k\), this is the same as dividing the original mean by \(k\) and the original standard deviation by \(k\). Likewise, if you multiply \(x_i\) by \(k\) you will see the new mean and standard deviation equal \(\bar{x} \times k\) and \(s \times k\), respectively.
What if you add a constant to each \(x_i\)? Say we add 2 to each of the original \(x\) values to create a new variable \(x^{**} = x + 2\). This will yield \(x^{**} = 4, 6, 8, 10, 12\) and the mean and standard deviation will be \(8\) and \(3.162278\), respectively. Note that the mean has increased by \(2\), the same value added to each \(x_i\) but the standard deviation is unchanged and is still \(3.162278\).
What if you subtracted a constant \(k\) from each \(x_i\)? Say we subtract 1 from each of the original \(x_i\) to create \(x^{***} = x - 1\). This will yield \(x^{***} = 1, 3, 5, 7, 9\), and the new mean will be \(5\) but the standard deviation will still be \(3.162278\).
x | x / 2 | 2x | x + 2 | x - 1 | |
---|---|---|---|---|---|
1 | 2.00 | 1.00 | 4.00 | 4.00 | 1.00 |
2 | 4.00 | 2.00 | 8.00 | 6.00 | 3.00 |
3 | 6.00 | 3.00 | 12.00 | 8.00 | 5.00 |
4 | 8.00 | 4.00 | 16.00 | 10.00 | 7.00 |
5 | 10.00 | 5.00 | 20.00 | 12.00 | 9.00 |
Mean | 6.00 | 3.00 | 12.00 | 8.00 | 6.00 |
Standard Deviation | 3.16 | 1.58 | 6.32 | 3.16 | 3.16 |
In brief, then,
- Changing the value of any observation changes the mean
- Adding or subtracting a
constant
\(k\) from all observations is equivalent to adding or subtracting the constant \(k\) from the original mean and leaves the standard deviation and variance unchanged - Multiplying or dividing a constant \(k\) from all observations is equivalent to multiplying or dividing the original mean by the constant \(k\) and the original variance and standard deviation will also be multiplied or divided by \(k\)
4.4 The Empirical Rule
There is a rule, the empirical rule
that says for a bell-shaped distribution,
- approximately 68% of the data lie within one standard deviation of the mean, that is, in the interval with endpoints \(\bar{x} \pm s\) for samples and \(\mu \pm \sigma\) for populations;
- approximately 95% of the data lie within two standard deviations of the mean, that is, in the interval with endpoints \(\bar{x} \pm 2s\) for samples and \(\mu \pm 2\sigma\) for populations; and
- approximately 99.7% of the data lie within three standard deviations of the mean, that is, in the interval with endpoints \(\bar{x} \pm 3s\) for samples and \(\mu \pm 3\sigma\) for populations.
The empirical rule does not apply to data that are asymmetric and in fact even with bell-shaped distributions the percentages 68%, 95% and 99.7% are only approximate.
There is also a more general theorem, called Chebyshev's Theorem
that says at least \((1 - \dfrac{1}{z^{2}})\) of the data values must lie within \(z\) standard deviations of the mean, where \(z\) is any value greater than 1. In particular, approximately 75% of the data points will fall within \(\bar{x} \pm 2s\), approximately 89% of the data points will fall within \(\bar{x} \pm 3s\), and approximately 94% of the data points will fall within \(\bar{x} \pm 4s\) (and of course similar spans around the population mean \(\mu\) and population standard deviation \(\sigma\).
Why are these rules important? They are useful because we know the approximate range within which a specific amount of the data values must fall if the population from which the sample is being drawn is bell-shaped (in which case the empirical rule comes into force) or if we don’t know the shape of the population (in which case Chebyshev’s Theorem comes into force).
4.5 Z-scores
Combining means and standard deviations helps us in many ways, one of these being the ability to compare what seem to be apples and oranges. For instance, let us say that in December 2016, total snowfall in Boston (MA) was 39 inches, and in Caribou (ME) it was 129 inches. The typical mean and standard deviation for each city are given in the table below. Question is: Which city had more unusual snowfall?7
z.df = data.frame(Place = c("Caribou (ME)", "Boston (MA)"), Mean = c(110,
24), Std.Dev. = c(30, 5), `December 2016` = c(125, 39))
Place | Mean | Std.Dev. | December.2016 |
---|---|---|---|
Caribou (ME) | 110 | 30 | 125 |
Boston (MA) | 24 | 5 | 39 |
We can answer the question posed to us if we standardize
the snowfall in each city relative to its average and the typical deviation from its average. That is, we calculate what is called a z-score
where \(z=\dfrac{x-\mu}{\sigma}\). The resulting z-score allows us to identify the relative location of an observation in a data set by telling us how many standard deviation units above or below the Mean
a particular value \(x_{i}\) falls. The father away it is, whether above or below, the more unusual the snowfall would have been.
As it turns out, \(z_{Boston} = \dfrac{39 - 24}{5} = \dfrac{15}{5} = +3.0\) and \(z_{Caribou} = \dfrac{125 - 110}{30} = \dfrac{15}{30} = +0.5\).
z.df = data.frame(Place = c("Caribou (ME)", "Boston (MA)"), Mean = c(110,
24), Std.Dev. = c(30, 5), `December 2016` = c(125, 39), `z score` = c("+0.5",
"+3.0"))
Place | Mean | Std.Dev. | December.2016 | z.score |
---|---|---|---|---|
Caribou (ME) | 110 | 30 | 125 | +0.5 |
Boston (MA) | 24 | 5 | 39 | +3.0 |
That is, Boston had the more unusual snowfall relative to Caribou. Note that if you only focused on the relative means you would have though both places got 15 inches more than they usually get so they were both hit equally hard. But that approach ignores the fact that their averages and variances differ. Once you account for the typical pattern in each city you realize Boston had the worse month of December.
4.5.1 Calculating z-scores in R
The scale()
command will calculate z-scores for you, provided you specific the variable or variables of interest. For example, below I create a z-score of age
and gpa
, one at a time, and add the z scores to the existing dataframe.
dff$zage = scale(dff$age)
dff$zgpa = scale(dff$gpa)
hsb2$zread = scale(hsb2$read)
hsb2$zscience = scale(hsb2$science)
What if I wanted the z-scores to be calculated by groups?
dff <- dff %>% group_by(sex) %>% mutate(z.gpa = scale(gpa), z.age = scale(age))
hsb2 <- hsb2 %>% group_by(ses) %>% mutate(z.socst = scale(socst),
z.write = scale(write))
Note the difference between zread
and zscience
on the one hand and z.write
and z.socst
on the other; the latter two z-scores were calculated for each ses group while the former two z scores were calculated for the full sample.
4.6 The Coefficient of Variation
Combining the mean and standard deviation also comes in handy when we are trying to figure out which of two variables has more variability, but the variables are not measured on the same metric. In order to compare variability across differently measured variables we rely on the coefficient of variation
, a measure of how large the standard deviation is relative to the mean.
\[\begin{equation} \tag{4.8} \text{Coefficient of Variation } = CV = \left(\dfrac{\text{Standard Deviation }}{\text{Mean }}\right) \times 100 \end{equation}\]
For example, say the average amount of spending a family does in a typical weekend is $200, and this spending has a standard deviation of $100. On average, the typical family has \(3\) members but family size tends to vary as well, with a standard deviation of \(2\). Is there more variability in family size or in family spending? The coefficient of variation will help us out here.
\(CV_{spending} = \left( \dfrac{100}{200}\right) \times 100\) = 50% \(CV_{size} = \left( \dfrac{2}{3}\right) \times 100\) = 66.67%
Evidently family size tends to vary far more than does family spending in a typical weekend.
R calculates the coefficient of variation via
library(raster)
cv(dff$age, na.rm = TRUE)
cv(dff$gpa, na.rm = TRUE)
You could do this calculation manually as well:
(sd(dff$gpa, na.rm = TRUE)/mean(dff$gpa, na.rm = TRUE)) * 100
## [1] 13.60645
(sd(dff$age, na.rm = TRUE)/mean(dff$age, na.rm = TRUE)) * 100
## [1] 17.29134
4.7 Symmetric, Skewed and Bi-modal Distributions
All data follow a shape
and you will often see references to the shape of a distribution
, how peaked
and spreadout
the values of a variable are. A distribution is said to be symmetric if you could carve it in the middle and fold each half perfectly onto the other half. The bell-shaped distribution (see below) is one such distribution but not the only one that is symmetric.
set.seed(3000)
xseq <- seq(-4, 4, 0.01)
densities <- dnorm(xseq, 0, 1)
cumulative <- pnorm(xseq, 0, 1)
randomdeviates <- rnorm(1000, 0, 1)
plot(xseq, densities, xlab = "", ylab = "", type = "l", lwd = 2,
cex = 2, cex.axis = 0.8, xaxt = "no", yaxt = "no", col = "cornflowerblue",
bty = "n")
u <- runif(2e+05)
hist(u)
In contrast, we may have skewed
distributions that have tails that extend more to the right (positively-skewed
) or to the left (negatively-skewed
), as shown below.
library(sn)
par(mfrow = c(1, 2))
plot(density((rsn(n = 1000, alpha = -5))), main = "Negatively-skewed",
xlab = "", ylab = "", bty = "n", xaxt = "no", yaxt = "no",
col = "cornflowerblue", lwd = 2)
plot(density((rsn(n = 1000, alpha = 5))), , main = "Positively-skewed",
xlab = "", ylab = "", bty = "n", xaxt = "no", yaxt = "no",
col = "cornflowerblue", lwd = 2)
par(mfrow = c(1, 1))
A skew means you have unusual data points on the right (positively-skewed) or on the left (negatively-skewed). These patterns are evident from histograms as well, though the direction of the skew isn’t always clear.
library(sn)
par(mfrow = c(1, 2))
hist(rsn(n = 1000, alpha = -5), main = "Negatively-skewed", xlab = "",
ylab = "", bty = "n", xaxt = "no", yaxt = "no", col = "cornflowerblue",
lwd = 2)
hist(rsn(n = 1000, alpha = 5), , main = "Positively-skewed",
xlab = "", ylab = "", bty = "n", xaxt = "no", yaxt = "no",
col = "cornflowerblue", lwd = 2)
par(mfrow = c(1, 1))
The peak in each of the preceding graphs reflects the modal value
of the variable – the value of \(x\) that occurs most frequently. Most distributions may have a single modal value but at times there may be two or more modes. A bimodal distribution
is shown below:
par(mfrow = c(1, 1))
bimodal <- function(x) {
3 * dnorm(x, mean = 0, sd = 1) + dnorm(x, mean = 3, sd = 0.3)/4
}
plot(bimodal, -3, 5, lwd = 3, col = "cornflowerblue", yaxt = "n",
xaxt = "n", bty = "n", xlab = "", ylab = "")
We could, and do calculate skewness
as
\[\begin{equation} \tag{4.9} skewness = \dfrac{1}{n}\sum\left( \dfrac{x_i - \bar{x}}{s} \right)^3 \end{equation}\]
where \(\bar{x}\) is the mean, \(s\) is the standard deviation, and \(n\) is the sample size. In normal and other symmetric distributions skewness \(=0\). Negative values of skewness indicate negatively-skewed data while positive values of skewness indicate positively-skewed data. Remember: By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail. Be careful though; with multimodal data the sign of the skewness can be affected. For the preceding plots of skewed distributions, skewness is calculated to be -0.8344821 and 0.9395554, respectively.
Often you will see estimates of skewness being accompanied by a measure of kurtosis
(the Greek word for bulging) defined as
\[\begin{equation} \tag{4.10} \dfrac{n(n+1)(n-1)}{(n-2)(n-3)} \dfrac{\sum \left(x_i - \bar{x}\right)^4}{\left(\sum(x_i - \bar{x})^2 \right)^2} \end{equation}\]
The normal distribution has kurtosis \(=0\) and is said to be mesokurtic
. A negative kurtosis would be indicative of a thin-tailed distribution, also called a platykurtic
distribution while a positive kurtosis would be indicative of a fat-tailed distribution, also called a leptokurtic
distribution. Again, for the same data for which we calculated skewness, measures of kurtosis would be 0.6978577 and 0.8412673, respectively.
4.8 The Five-number Summary and the Box-plot
The five-number summary is composed of the following values:
\[\text{Minimum value }, Q_1, Md, Q_3, \text{ Maximum value}\]
Looking at just these five values tells us a lot about the shape
of our data. In fact, this five-number summary is used in a very powerful graphic called the box-plot
.
The box-plot relies on the five-number summary and also identifies extremely unusual values. An example is given below. Here we are looking at the distribution of reading scores of male/female students as reflected in the hsb2
data we saw earlier.
hsb2 = read.table("./data/hsb2-2.csv", header = TRUE, sep = ",")
hsb2$female = factor(hsb2$female, labels = c("Male", "Female"))
hsb2$race = factor(hsb2$race, labels = c("Hispanic", "Asian",
"African American", "White"))
hsb2$ses = factor(hsb2$ses, labels = c("Low", "Middle", "High"))
hsb2$schtyp = factor(hsb2$schtyp, labels = c("Public", "Private"))
hsb2$prog = factor(hsb2$prog, labels = c("General", "Academic",
"Vocational"))
ggplot(hsb2, aes(x = female, y = read)) + geom_boxplot(fill = "cornflowerblue") +
coord_flip() + xlab("Student's Sex") + ylab("Reading Scores") +
theme_ipsum_rc()
The darker line in the middle of the box is the median, while the edges of the box are the first quartile \((Q_1)\) and the third quartile \((Q_3)\), respectively. The lines (called the whiskers
) extend leftwards to the minimum value and rightwards to the maximum value. What do these plots tell us? That reading scores are higher for Male students as compared to Female students. Notice that the line in the middle of the box is shifted to the right for Male students. The portion of the box to the right of the medians is larger than the portion of the box to the left. If we look at the five-number summary we will be able to more easily figure out the skew so let us do that.
For Males: \(Min = 31.0; Q_1 = 44.5, Md = 52.0, Q_3 = 63.0, Max = 76.0\) For Females: \(Min = 28.0; Q_1 = 44.0, Md = 50.0, Q_3 = 57.0, Max = 76.0\)
Quite clearly, median reading scores are higher for Male students. Now on to understanding how to decipher the skew.
Look at the distance between \(Min\) and \(Md\) for each sex. For Male students, the distance is \(21\) and for Female students it is \(22\). If you then look at the distance between the \(Md\) and \(Max\) for each group, you see that for Males it is \(24\) and for Females it is \(26\). For both groups, the distance between \(Md\) and \(Max\) is greater than the distance between \(Min\) and \(Md\), indicating the the distribution is more stretched
to the right; hence both Male and Female students’ reading scores are positively-skewed.
These box-plots show no outliers
(unusual data values) but the next set do (see below). Here we have broken out science scores by the student’s socioeconomic status. Notice the two isolated dots for the High SES group.
ggplot(hsb2, aes(x = ses, y = science)) + geom_boxplot(fill = "cornflowerblue") +
coord_flip() + xlab("Socioeconomic Status") + ylab("Science Scores") +
theme_ipsum_rc()
In technical terms we define an outlier
as any data point that is more than \(1.5 \times IQR\) below \(Q_1\) or more than \(1.5 \times IQR\) above \(Q_3\). For the High SES group we have the following values of the five number summary:
\[Min = 26.00, Q_1 = 50.00, Md = 58.00, Q_3 = 62.50, Max = 69.00\]
Given these values we know \(IQR = Q_3 - Q_1 = 62.50 - 50.00 = 12.50\) and thus \(1.5 \times IQR = 1.5 \times 12.50 = 18.75\). So anything that is more than 18.75 points below/above \(Q_1\)/\(Q_3\) will be an outlier. The boundary values then are \(50.00 - 18.75 = 31.25\) so any science score smaller than \(31.25\) will be an outlier. As it turns out we do have two scores below this – \(26, \text{ and } 31\); hence the two outliers shown. There are no other outliers for the High SES group, and certainly none for the Low and Middle SES groups. Outliers may be present on both sides of a distribution as well as shown in the graph below.
library(readxl)
educ = read_excel("~/Documents/Data Hub/Data for Teaching/Education.xlsx")
ggplot(subset(educ, Typology != "Island Corner"), aes(x = Typology,
y = PI_score)) + geom_boxplot(fill = "cornflowerblue") +
ylab("Performance Index Score") + xlab("District Type") +
coord_flip() + theme_ipsum_rc()
4.9 The Power of Summary Statistics Married to Graphical Explorations
One of the golden lessons I want you to take away from Chapters 3 and 4 is the maxim that relying only on summary statistics (mean, median, standard deviation, etc.) without graphically exploring the data can lead to terrible mistakes. A classic example comes to us from a stylized example presented by Francis Anscombe in 1973. Anscombe wanted to make the point that if you don’t visualize your data and look for (and deal with) outliers as well, instead rely simply on summary statistics, you could be easily misled. Here is the example in all its glory. The data-set has 11 observations for four pairs of variables, \(x_i, y_i\), shown below. The last row in the table is each variable’s mean.
library(gridExtra)
data(anscombe)
tab.1 = apply(anscombe, 2, FUN = mean, na.rm = TRUE)
tab.2 = round(tab.1, digits = 2)
tab.3 = rbind.data.frame(anscombe, tab.2)
x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
---|---|---|---|---|---|---|---|
10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
4 | 4 | 4 | 19 | 4.26 | 3.10 | 5.39 | 12.50 |
12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
9 | 9 | 9 | 9 | 7.50 | 7.50 | 7.50 | 7.50 |
Note the similarity in the means for the \(x\) and the \(y\) variables, respectively. If you calculate the correlation between each pair, you will find this to be \(0.8164205\) for each pair of \(x\) and \(y\). So looking just at these numerical summaries it seems the pairs are similar if not downright identical. But if we plot each pair of \(x\) and \(y\) variables, do the distributions still look similar?
p1 <- ggplot(anscombe) + geom_point(aes(x1, y1), color = "cornflowerblue",
size = 1) + scale_x_continuous(breaks = seq(0, 20, 2)) +
scale_y_continuous(breaks = seq(0, 12, 2)) + geom_abline(intercept = 3,
slope = 0.5, color = "black") + expand_limits(x = 0, y = 0) +
labs(title = "Pair 1") + theme_ipsum_rc()
p2 <- ggplot(anscombe) + geom_point(aes(x2, y2), color = "cornflowerblue",
size = 1) + scale_x_continuous(breaks = seq(0, 20, 2)) +
scale_y_continuous(breaks = seq(0, 12, 2)) + geom_abline(intercept = 3,
slope = 0.5, color = "black") + expand_limits(x = 0, y = 0) +
labs(title = "Pair 2") + theme_ipsum_rc()
p3 <- ggplot(anscombe) + geom_point(aes(x3, y3), color = "cornflowerblue",
size = 1) + scale_x_continuous(breaks = seq(0, 20, 2)) +
scale_y_continuous(breaks = seq(0, 12, 2)) + geom_abline(intercept = 3,
slope = 0.5, color = "black") + expand_limits(x = 0, y = 0) +
labs(title = "Pair 3") + theme_ipsum_rc()
p4 <- ggplot(anscombe) + geom_point(aes(x4, y4), color = "cornflowerblue",
size = 1) + scale_x_continuous(breaks = seq(0, 20, 2)) +
scale_y_continuous(breaks = seq(0, 12, 2)) + geom_abline(intercept = 3,
slope = 0.5, color = "black") + expand_limits(x = 0, y = 0) +
labs(title = "Pair 4") + theme_ipsum_rc()
grid.arrange(p1, p2, p3, p4, ncol = 2)
This is precisely why you should always start by graphing your data, all of it, even if you only intend to analyze specific variables, before doing any sort of numerical calculations of means, correlations, variances, etc., leave alone proceeding to do any statistical test. If you fail to follow this mantra, be prepared to be embarrassed.
4.10 Outliers, Outliers, What do I do with them?
If you encounter outliers, the overriding question has to be the next step: Do you delete them or do you keep them? Deleting data points is never a good idea unless you have good reason to do so. The only good reason might be one of measurement error, in that the data were entered incorrectly or you have good reason to believe whoever recorded the measurement did so with error. If you see a skewed distribution with outliers, and there is no good reason to believe this is because of measurement error, then try a transformation.
- Try \(x^* = x^3\) to reduce extreme negative skewness
- Try \(x^* = x^2\) to reduce negative skewness
- Try \(x^{\frac{1}{2}}\) to reduce positive skewness
- Try \(x^* = -\dfrac{1}{\sqrt{x}}\) to reduce extreme positive skewness
- Try \(x^* = -\dfrac{1}{x}\) to reduce even more extreme positive skewness
- If \(x\) can assume a value of \(0\), add a small constant \(c\), such that \(x^* = x + c\). Folks often use \(c = 0.01\) or even \(c = 1\). Then try options (1) through (5) depending upon the direction and strength of the skewness
You will also see the natural logarithm
used to transform data. It only works if \(x\) is strictly \(\geq 1\) so if \(x is \geq 0\), be sure to add a constant to \(x\) before trying the natural logarithm. The reason many shy away from it is because it is difficult to interpret the results. An easy fix to this problem is to use the antilogarithm
, i.e., \(e^{x^*}\). This same rule, of flipping the final result back into the original metric also works for the other transformation discussed here. Specifically,
- If you used \(x^* = x^3\) to reduce extreme negative skewness, convert the final answer via \((x^*)^\frac{1}{3}\)
- If you used \(x^* = x^2\) to reduce negative skewness, convert the final answer via \((x^*)^\frac{1}{2}\)
- If you used \(x^* = x^{\frac{1}{2}}\) to reduce positive skewness, convert the final answer via \((x^*)^2\)
- If you used \(x^* = -\dfrac{1}{\sqrt{x}}\) to reduce extreme positive skewness, convert the final answer via \(\dfrac{1}{(x^*)^2}\)
- If you used \(x^* = -\dfrac{1}{x}\) to reduce even more extreme positive skewness, convert the final answer via \(\dfrac{1}{-x^*}\)
4.11 Beware the flawed rule
Many textbooks tend to indicate that the mean is always pulled away from the median (some also extend this to the mode) in skewed distributions with a left-skewed distribution with \(mean < median\) and a right-skewed distribution with \(mean > median\). This is not always true for discrete variables, and the rule can fail with continuous variables that are distributed with one long tail but the other tail is heavy. Bimodal and multimodal distributions of continuous variables are equally susceptible. In brief: Do not rely on this rule of thumb.
4.12 Practice Problems
Problem 1
Working with the Boston Marathon (2016) data used in the previous chapter, calculate the following:
- Mean finish time
- Median finish time
- Variance of finish times
- Standard Deviation of finish times
- Range of finish times
- The first and the third quartiles of finish times
- The interquartile range of finish times
- The skewness of finish times
Problem 2
Construct gender-specific box-plots of runners’ times. Are the distributions skewed or symmetric for each gender? If skewed, in what direction? Are there outliers in each gender’s distribution? On which side(s) of the distribution?
Problem 3
What is the five-number summary of finish times?
Problem 4
Calculate the statistics listed in Problem 1 for each gender. Do the estimates of skewness you calculated here support your conclusions about skewness from your work in Problem 2?
Problem 5
What is the coefficient of variation for male and female runners’ finish times, respectively? What do these estimates tell you about finish times for male and female runners, respectively?
Problem 6
Ignoring gender differences, if you see skewed distributions in finish times with outliers, try to transform the data to reduce the skewness. What transformations work? Explain why you chose the transformation(s) you did and show (via a box-plot and estimates of skewness) that the transformation(s) worked.
Problem 7
Using the data-set employed in Problem 5 of Chapter 2, calculate and report the average number of hours worked by low/medium/high salary employees. Explain why you chose the measure of average that you did. Also display the distribution of hours worked for each salary group via an appropriate graphic. What patterns do you see in this graphic?
Problem 8
Load the EPA data used for Practice Problem 3 in Chapter 2 and answer the following questions.
- Calculate the average annual fuel cost in 2017. Note that the variable is
fuelCost08
and is based on 15,000 miles, 55% city driving, and the price of fuel used by the vehicle. - Of all the vehicles in the data-set, for the year 2017, what is the most common drive axle type? Note that the variable is
drive
. - Of all the vehicles in the data-set, for the year 2017, what is the most common vehicle size class? Note that the variable is
VClass
. - For the year 2017, construct a grouped frequency table & histogram of
youSaveSpend
. Note that this variable measures the vehicle-related savings/spending over 5 years compared to an average car (in US$). Savings are positive; a greater amount spent yields a negative number.
It would be remiss of us not to recognize the existence of a very passionate and vigorous Null Hypothesis Significance Testing (NHST) controversy. While the details of the controversy are too advanced for our present purposes, here are some articles you should read: The Truth Wears Off: Is there something wrong with the scientific method? and Do We Really Need the S-word?↩
You will also see these formulas listed as \(\dfrac{\left(n + 1\right)}{4}\) for the position of \(Q_1\) and \(\dfrac{3\left(n + 1\right)}{4}\) for the position of \(Q_3\).↩
Unusual compared to what seems to be typical for the city, that is.↩