Chapter 3 Visualizing Data
One of the first tasks of data analysis should be to look at our data. Certainly, you should open up the dataset and eyeball it to make sure everything looks okay, that the file isn’t mysteriously corrupted, etc., but that is not what I mean when I say “look”. Nor am I talking about data visualization
in the sense of the masterpieces of stalwarts such as Alberto Cairo, Mona Chalabi, Andy Kirk, Robert Kossara, Giorgia Lupi, David McCandless, Cole Nussbaumer Knaflic, Randy Olson, Lisa Charlotte Rost, John Schwabish, Nathan Yau, Stephanie Evergreen, Martin Wattenberg, Fernanda Viégas, and many others. We are not going to create visuals that rival these artists’ stunning works. Rather, our goal will be simpler, to use appropriate tables and graphics when analyzing and describing our data, both so that we understand what the data are telling us and can communicate the resulting story to our audience. To accomplish these tasks we will obey simple rules that help us make wise decisions.
3.1 Visualizing Nominal/Ordinal Data
3.1.1 Bar-charts and Frequency Tables
If a variable is measured in nominal or ordinal terms either (i) a bar-chart
or (ii) a frequency table
are very effective displays of how the variable is distributed: What value seems to be most common? Are high values as likely as low values? You can see both these visuals below, drawn from data about the relative abundance of different species. First, the frequency table.
tigers = read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e2aDeathsFromTigers.csv")
ggplot(tigers, aes(activity)) + geom_bar(width = 0.2) + coord_flip() +
ylab("Frequency") + xlab("Human Activity") + theme_ipsum_rc()
Total = sum
tab.1 = table(tigers$activity)
tab.2 = addmargins(tab.1, FUN = Total)
tab.3 = round(prop.table(tab.2), digits = 2)
tab.4 = addmargins(prop.table(tab.1), FUN = Total)
tab.4 = round(tab.4, digits = 2)
tab.5 = tab.4 * 100
tiger.table = cbind.data.frame(tab.2, tab.4, tab.5)
tiger.table = tiger.table[c(1:2, 4, 6)]
colnames(tiger.table) = c("Activity", "Frequency", "Proportion",
"Percentage")
Activity | Frequency | Proportion | Percentage |
---|---|---|---|
Disturbing tiger kill | 5 | 0.06 | 6 |
Fishing | 8 | 0.09 | 9 |
Forest products | 11 | 0.12 | 12 |
Fuelwood/timber | 5 | 0.06 | 6 |
Grass/fodder | 44 | 0.50 | 50 |
Herding | 7 | 0.08 | 8 |
Sleeping in house | 3 | 0.03 | 3 |
Toilet | 2 | 0.02 | 2 |
Walking | 3 | 0.03 | 3 |
Total | 88 | 1.00 | 100 |
Notice how the frequency table is constructed. You have each human activity listed, followed by the frequency
(the number of times someone was killed by a tiger while engaging in each activity), then this column with this frequency converted into a proportion
, and finally the proportion converted into a percentage
. None of this should be a mystery to you:
\[proportion = \dfrac{frequency}{Total}\] \[percentage = \left(\dfrac{frequency}{Total}\right) \times 100 = proportion \times 100\]
The last row in the table shows you the total frequency (we have a total of 88 humans killed), the total proportion (which must sum to \(1\)), and the total percentage (which should sum to \(100\)). I love frequency tables such as these because they show all the data and the story; most kills occurred while the individuals were cutting grass for cattle fodder, followed by while they were out gathering other forest products, and least of all while they were going to the toilet4. We could improve upon this table and bar-chart by organizing it in such a way that the most dangerous activity is listed/plotted first. This would have the added benefit of quickly drawing the reader’s/viewer’s eyes to the most dangerous activity.
What if the variable was an ordinal level variable, say something like an individual’s frequency of praying? We would be facing the same options, a bar-chart or a frequency table. However, we would have to be cautious here in making sure both the table and the bar-chart categories are logically assigned. That is, notice the Response Categfory
; we start with the option “Never” and each category that follows corresponds to a category that reflects praying more often than the preceding category. This order would have to be maintained so that we would be unable to arrange the table or bar-chart in ascending/descending order of the “Frequency” column. If we made that mistake we would be destroying the natural order that exists in the ordinal variable we have before us.
site = "http://faculty.gvsu.edu/kilburnw/nes2008.RData"
load(file = url(site))
tab.1 = table(nes08$pray2)
tab.2 = addmargins(tab.1, FUN = Total)
tab.2 = as.data.frame(tab.2)
colnames(tab.2) = c("Response Category", "Frequency")
Response Category | Frequency |
---|---|
|
196 |
|
226 |
|
264 |
|
346 |
|
456 |
Total | 1488 |
ggplot(tab.2[1:5, ], aes(x = `Response Category`, y = Frequency)) +
geom_bar(stat = "identity", width = 0.2) + coord_flip() +
ylab("Frequency") + xlab("Response Category") + theme_ipsum_rc()
You might be wondering, what about pie charts
? Why not use those with nominal/ordinal data? There are two camps, those who hate them and those who think they may be useful. If you are interested, read What do you mean I’m not supposed to use Pie Charts?! but I for one do not use them since I find them less useful than bar-charts.
3.1.2 Contingency Tables and Bar-charts
Frequency tables and bar-charts are also useful when you have two nominal/ordinal variables to work with and are interested in exploring difference between the two or more groups reflected in one variable versus whatever is being measured by the second variable. Let us use a specific example, one where we ask if religiosity differs between Liberals, Moderates, and Conservatives. First the table.
tab.1 = table(nes08$pray2, nes08$libcon2)
tab.2 = addmargins(tab.1, FUN = Total, quiet = TRUE)
Liberal | Moderate | Conservative | Total | |
---|---|---|---|---|
|
62 | 46 | 36 | 144 |
|
53 | 63 | 52 | 168 |
|
53 | 69 | 76 | 198 |
|
76 | 89 | 107 | 272 |
|
69 | 99 | 176 | 344 |
Total | 313 | 366 | 447 | 1126 |
If you look at Table 3.3, you see the frequencies reported in what we call a contingency table
or a crosstabulation
– where the distributions of two nominal/ordinal variables are jointly displayed. Reading such a table should be simple. In brief, You see how many Liberals said they prayed “Never”, “Once a week or less”, “A few times a week”, “Once a day”, or “Several times a day”. What pattern is evident from this table? Of the 447 conservatives we see most of them saying they pray several times a day, followed by once a day, and the fewest saying they never pray. The pattern is similar for liberals and moderates, although the differences between the numbers responding “Never” and “Several times a day” is small for liberals than it is for conservatives.
tab.3 = prop.table(tab.1, margin = 1) * 100
tab.4 = prop.table(tab.1, margin = 2) * 100
tab.3 = round(tab.3, digits = 2)
tab.4 = round(tab.4, digits = 2)
The story could be helped a great deal if we calculated percentages for these frequencies. We have two choices when calculating these percentages, we could calculate these as row percentages
where we ask “what percent of those who said Never were Liberal, Moderate, and Conservative, respectively?” We could then repeat this for the other categories of religiosity. The result is shown in Table 3.4. This table shows quite clearly that 51.16% of those who say they pray several times a day tend to be Conservative. Likewise, most (43.06%) of those who say they never pray tend to be Liberal.
Liberal | Moderate | Conservative | |
---|---|---|---|
|
43.06 | 31.94 | 25.00 |
|
31.55 | 37.50 | 30.95 |
|
26.77 | 34.85 | 38.38 |
|
27.94 | 32.72 | 39.34 |
|
20.06 | 28.78 | 51.16 |
If we calculated column percentages
we would be able to answer such questions as: “What percent of Liberals said they never pray, pray once a week or less, a few times a week, once a day, several times a day?” The same could then be asked of Moderates and Conservatives, respectively. If we used the column percentages shown in Table 3.5 it would be obvious that Moderates and Conservatives tend to be more religious than Liberals. The essential takeaway here is that how you calculate the percentages (row versus column) depends upon what story you want to highlight, the question you want to ask and answer.
Liberal | Moderate | Conservative | |
---|---|---|---|
|
19.81 | 12.57 | 8.05 |
|
16.93 | 17.21 | 11.63 |
|
16.93 | 18.85 | 17.00 |
|
24.28 | 24.32 | 23.94 |
|
22.04 | 27.05 | 39.37 |
Graphing these tables is easily done as well, and very effective; see Figure 3.3.
ggplot(subset(nes08, !is.na(libcon2) & !is.na(pray2)), aes(pray2,
..count..)) + geom_bar(aes(fill = libcon2), position = "dodge",
width = 0.4) + xlab("Religiosity") + ylab("Frequency") +
scale_fill_manual(values = c("#999999", "#E69F00", "#56B4E9"),
name = "Political Ideology", breaks = c("Liberal", "Moderate",
"Conservative"), labels = c("Liberal", "Moderate",
"Conservative")) + theme(legend.position = "top",
axis.text.x = element_text(angle = 90, size = 10, vjust = -2)) +
coord_flip() + theme_ipsum_rc()
3.2 Visualizing Interval/Ratio Data
We have several visuals we could draw with numeric data that are either interval or ratio levels of measurement. Let us see these first before we look at the one frequency table that could be used with interval/ratio data.
3.2.1 The Histogram
hsb2 = read.table("./data/hsb2-2.csv", header = TRUE, sep = ",")
hsb2$female = factor(hsb2$female, labels = c("Male", "Female"))
hsb2$race = factor(hsb2$race, labels = c("Hispanic", "Asian",
"African American", "White"))
hsb2$ses = factor(hsb2$ses, labels = c("Low", "Middle", "High"))
hsb2$schtyp = factor(hsb2$schtyp, labels = c("Public", "Private"))
hsb2$prog = factor(hsb2$prog, labels = c("General", "Academic",
"Vocational"))
A histogram
is used with a single numeric variable and looks like a bar-chart except there are no gaps between consecutive bars unless there are missing data. The example that follows use a popular dataset known as hsb2
, which contains information about 200 randomly selected students from a national survey of high school seniors called the High School and Beyond survey. The variables in this dataset include:
- id = student id
- female = (0/1)
- race = ethnicity (1=hispanic 2=asian 3=african-amer 4=white)
- ses = (1=low 2=middle 3=high)
- schtyp = type of school (1=public 2=private)
- prog = type of program (1=general 2=academic 3=vocational)
- read = standardized reading score
- write = standardized writing score
- math = standardized math score
- science = standardized science score
- socst = standardized social studies score
I’ll use read
(Reading scores on a standardized test) to plot a histogram. Notice a few things about the histogram in Figure 3.4. The height of the bars, representing how often a particular score occurs, varies a great deal. A few students have done poorly while a few have done very well, but the rest are distributed over the middle range of test scores.
ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue",
bins = 30) + ylab("Frequency") + xlab("Reading Scores") +
theme_ipsum_rc()
We could farther break this histogram apart by asking if male and female students (female
), private versus public students (schtyp
), or students of different races/ethnicities (race
) perform differently on the reading test.
ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue",
bins = 30) + ylab("Frequency") + xlab("Reading Scores") +
facet_wrap(~female) + theme_ipsum_rc()
ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue",
bins = 30) + ylab("Frequency") + xlab("Reading Scores") +
facet_wrap(~schtyp) + theme_ipsum_rc()
ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue",
bins = 30) + ylab("Frequency") + xlab("Reading Scores") +
facet_wrap(~race) + theme_ipsum_rc()
These plots are hard to read for a number of reasons. First, with just \(n = 200\) students in all we have more students in some groups than in others and as a result the histograms look very thin for the groups with fewer data points, making it difficult to tease out any patterns. Second, within any group there are too many different test scores so we don’t see a clear pattern at all. To fix these problems we construct groups of scores, turning what is a numeric variable into an ordinal variable. Let us build this by first creating a grouped frequency table
and then plotting this table as a histogram.
3.2.2 Grouped Frequency Tables
Because numeric variables take on too many values it is often easier to see their distribution by grouping the numeric values. For example, we could start by seeing what are the lowest and highest reading scores. These turn out to be 28 and 76, respectively. We can build the groups as follows:
- Calculate difference between maximum and minimum values, which turns out to be \(76 - 28 = 48\)
- Decide how many groups we want. Good practice suggests no fewer than 4/5 and no more than 6/7. Say we go with 5 groups.
- Divide the gap between the maximum and minimum values by the desired number of groups: \(= \dfrac{48}{5} = 9.6\) and round up to the nearest whole number \(= 10\). This tells us how wide each group should be.
- The groups could thus be \(28-38, 38-48, 48-58, 58-68, 68-78\).
Notice that these groups span, start to finish, all the values of reading scores. But we’ll have to decide which group to include 38, 48, 58, and 68 in. Should 38 go in 28-38 or in 38-48? The choice doesn’t matter so long as we are consistent. Let us choose a rule that says include 38 in 38-48, 48 in 48-58, 58 in 58-68, and 68 in 68-78. Using this rule we now find each reading score and drop it into its group. Then we calculate how many scores fall in each group, creating our Frequency column.
hsb2$read_cut = cut(hsb2$read, breaks = c(28, 38, 48, 58, 68,
78), right = FALSE)
tab.1 = table(hsb2$read_cut)
tab.2 = addmargins(tab.1, FUN = Total)
tab.2 = as.data.frame(tab.2)
tab.2$Var1 = c("28-38", "38-48", "48-58", "58-68", "68-78", "Total")
colnames(tab.2) = c("Grouped Scores", "Frequency")
Grouped Scores | Frequency |
---|---|
28-38 | 14 |
38-48 | 68 |
48-58 | 62 |
58-68 | 36 |
68-78 | 20 |
Total | 200 |
Now it is easy to see that the largest number of students (68) appear to fall in the 38-48 group of reading scores while the smallest frequency (14) occurs in the 68-78 group. Let us add to this table a percentage column.
tab.3 = prop.table(tab.1) * 100
tab.4 = addmargins(tab.3, FUN = Total)
tab.4 = as.data.frame(round(tab.4, digits = 2))
tab.5 = cbind.data.frame(tab.2, tab.4)
tab.6 = tab.5[c(1, 2, 4)]
colnames(tab.6) = c("Grouped Scores", "Frequency", "Percentage")
Grouped Scores | Frequency | Percentage |
---|---|---|
28-38 | 14 | 7 |
38-48 | 68 | 34 |
48-58 | 62 | 31 |
58-68 | 36 | 18 |
68-78 | 20 | 10 |
Total | 200 | 100 |
The percentages make it even easier to see how the distribution breaks down; 34 percent of the students have scores in the 38-48 range while only 10 percent scored in the highest bracket (68-78). We can also break this down by the three groups we used earlier.
ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue",
breaks = seq(28, 78, by = 10), alpha = 0.4, color = "black") +
ylab("Frequency") + xlab("Reading Scores") + facet_wrap(~female) +
theme_ipsum_rc()
ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue",
breaks = seq(28, 78, by = 10), alpha = 0.4, color = "black") +
ylab("Frequency") + xlab("Reading Scores") + facet_wrap(~schtyp) +
theme_ipsum_rc()
ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue",
breaks = seq(28, 78, by = 10), alpha = 0.4, color = "black") +
ylab("Frequency") + xlab("Reading Scores") + facet_wrap(~race) +
theme_ipsum_rc()
Histograms were once quite popular but there is a better visual for looking at our numeric variables, one called the box-plot
that we’ll see in the next chapter. I am not a huge fan of histograms because grouping decisions influence the story being told. My advice would be to use grouped frequency tables instead of histograms to present a summary overview of your numeric data.
3.2.3 Scatterplots
With two numeric variables, a scatterplot
comes in very handy if we want to explore how one variable might be related to another. For example, we may want to ask whether students who score high on the reading test also tend to score high on the mathematics test. This could be visually explored via a scatter-plot, as shown below. The goal should be to look for a pattern: Does one variable increase as the other increases or does one variable decrease as the other increases? Or does there seem to be no relationship at all?
ggplot(data = hsb2, aes(x = read, y = math)) + geom_point(color = "CornflowerBlue") +
ylab("Mathematics Scores") + xlab("Reading Scores") + theme_ipsum_rc()
Quite clearly, students who do well in Reading also tend to do well in Mathematics. You can see this by virtue of the upward, left-to-right tilt of the cloud of points.
What about Science scores and Mathematics scores? Is there any relationship between doing well in Science and doing well in Mathematics?
ggplot(data = hsb2, aes(x = science, y = math)) + geom_point(color = "CornflowerBlue") +
ylab("Mathematics Scores") + xlab("Science Scores") + theme_ipsum_rc()
Of course, just like everything else we could break this down by any nominal/ordinal variable. The visual below shows you the breakouts by the student’s sex.
ggplot(data = hsb2, aes(x = science, y = math)) + geom_point(color = "CornflowerBlue") +
ylab("Mathematics Scores") + xlab("Science Scores") + facet_wrap(~female) +
theme_ipsum_rc()
ggplot(data = hsb2, aes(x = science, y = math)) + geom_point(color = "CornflowerBlue") +
ylab("Mathematics Scores") + xlab("Science Scores") + facet_wrap(~schtyp) +
theme_ipsum_rc()
ggplot(data = hsb2, aes(x = science, y = math)) + geom_point(color = "CornflowerBlue") +
ylab("Mathematics Scores") + xlab("Science Scores") + facet_wrap(~race) +
theme_ipsum_rc()
3.2.4 Line Graphs
If we have time-series data, such as the Presidential Approval data we saw earlier, then a line graph
works well because it shows you how the outcome/phenomenon varies over time. Let us take another example, stock prices. Say I have invested in both IBM and LinkedIn stocks. How are these stocks doing?
library(quantmod)
library(xts)
library(dygraphs)
# Get IBM and Linkedin stock data from Yahoo Finance
oh = na.omit(getSymbols("MEHOINUSOHA672N", src = "FRED", from = "1949-12-31",
auto.assign = FALSE))
pa = na.omit(getSymbols("MEHOINUSPAA672N", src = "FRED", from = "1949-12-31",
auto.assign = FALSE))
wv = na.omit(getSymbols("MEHOINUSWVA672N", src = "FRED", from = "1949-12-31",
auto.assign = FALSE))
oh2 = as.data.frame(oh)
oh2$date = row.names(oh2)
pa2 = as.data.frame(pa)
pa2$date = row.names(pa2)
wv2 = as.data.frame(wv)
wv2$date = row.names(wv2)
ggplot() + geom_line(data = oh2, aes(x = as.Date(date), y = MEHOINUSOHA672N),
color = "blue") + geom_line(data = wv2, aes(x = as.Date(date),
y = MEHOINUSWVA672N), color = "red") + geom_line(data = pa2,
aes(x = as.Date(date), y = MEHOINUSPAA672N), color = "green") +
xlab("Year") + ylab("Real Median Household Income") + theme(plot.title = element_text(lineheight = 0.7,
face = "bold")) + annotate("text", x = as.Date("1997-01-01"),
y = 35000, label = "West Virginia") + annotate("text", x = as.Date("1988-01-01"),
y = 45000, label = "Pennsylvania") + annotate("text", x = as.Date("1990-01-01"),
y = 55000, label = "Ohio") + theme_ipsum_rc()
These plots don’t work well just for financial or election data, they are ideally suited for any phenomenon that is measured over time. For example, the size of the immigrant population over the years, and even the number of lynx pelts reported in Canada per year from 1752 to 1819.
immig = read_excel("~/Documents/Data Hub/MPI-Data-Hub_USImmigFlow_since1820_2014.xlsx",
skip = 7)
immig = immig[c(1:2)]
immig = subset(immig, !is.na(`Number of Legal Permanent Residents`))
immig$Year = as.numeric(immig$Year)
ggplot(immig, aes(x = Year, y = `Number of Legal Permanent Residents`)) +
geom_path(color = "darkred") + xlab("Year") + ylab("No. of Legal Permanent Residents") +
theme_ipsum_rc()
library(abd)
data(Lynx)
ggplot(Lynx, aes(x = year, y = pelts)) + geom_path(color = "darkred") +
xlab("Year") + ylab("Number of Lynx Pelts") + theme_ipsum_rc()
3.2.5 Polar Charts
These charts are helpful for visualizing data that might have otherwise been explored via a bar-chart. For example, say you want to look at the miles per gallon given by a number of different automobiles. We could use a bar-chart, as shown below:
mtcars$car = row.names(mtcars)
p = ggplot(mtcars, aes(x = car, y = mpg, fill = mpg)) + geom_bar(stat = "identity") +
scale_fill_gradient(low = "red", high = "white", limits = c(5,
40)) + theme(axis.title.y = element_text(angle = 0),
legend.position = "top", axis.text.x = element_text(angle = 45,
vjust = 1, hjust = 1, size = 6)) + ylab("") + xlab("Automobile") +
theme_ipsum_rc()
Quite obviously, the Toyota Corolla has the best fuel economy, followed by the Fiat 128, while the Cadillac Fleetwood is tied with the Lincoln Continental for the worst fuel economy. A polar chart would present the same information as seen in the bar-chart, albeit in a more aesthetically pleasing manner.
p + coord_polar() + aes(x = reorder(car, mpg)) + theme(axis.text.x = element_text(angle = -20,
size = 6), legend.position = "top") + xlab("") + ylab("") +
theme_ipsum_rc()
3.3 Some Essential Rules for Good Visualizations
There are several rules, some favored by this expert or that, but regardless of the source of the rule, its merit is not in doubt. So here are the ones I try to follow (although I do break these at times, some times by design and other times unintentionally). It all starts of course with Edward Tufte’s maxims: show the data, tell the truth, help the viewer think about the information rather than the design, encourage the eye to compare the data, make large data sets coherent.
A more specific listing of the rules I try to keep in mind follows:
- Do not include anything in your visualization that is not very informative. Give titles and subtitles where needed and make these stand on their own. At the same time, do not clutter your charts and tables with too much information otherwise the reader gets lost.
- Use colors wisely. Is this visual to be printed on a color printer? Is it part of a presentation in a poorly lit room? Choose bright colors that stand out from each each other and yet are visible on a printed page or on the projection screen in the room. Remember, a fair share of men seeing the visual could be color-blind so use color palettes designed for color-blind individuals. If this is news to you, read How a dog sees a rainbow, and 12 other images that explain how we see color. Pay attention to what is being plotted: Use
sequential
colors for ordered data values (that go from low to high or vice-versa);qualitative
colors nominal data, and;diverging
colors if the goal is to put equal emphasis on mid-range critical values and extremes at both ends of the data range but emphasize the middle with light colors and the low/high extremes with dark colors that have contrasting hues. - The visualization is not your personal Mona Lisa; it must be effective for the target audience and help you tell the story. Do not get carried away in creating it.
- Use a table if a table would be more effective than a graph but remember, while tables are useful for audiences that will want to see more (not less of the actual data you have) graphs are favored by those who want a quick take-away.
- Always have your
y-axis
(orx-axis
if relevant) starting at zero. If you don’t do this you can misrepresent the data. If you are forced to truncate the axis, point this out to the reader so that they can interpret dips and swells in line charts or differences between heights of frequency bars with care. - Combine multiple graphs/tables into a single figure if you can so long as it does not lead to information overload. Go back and look at the scatterplots and line charts and try to visualize what these might look like if the y-axis had been forced to start at \(0\); they would certainly look different. For illustration purposes I have let the plotting software pick the starting coordinates.
- For non-technical audiences you should round up all percentages/proportions to no more than one decimal place and ideally to no decimal place unless doing so distorts the picture. For technical audiences, stay with two or four decimal places.
- Above all, start with pencil and paper and consider all alternate visualizations possible by drawing a rough sketch of what the finished product should look like.
- Consider using a choropleth map if geographic variation is one of the key narratives.
- Be prepared to alter your visual if it does not work; we are all hesitant to delete a page or a graphic and start from scratch but this locks you into something that isn’t working to begin with and you will be stuck in a rut.
- Avoid jargon like the plague. We often think we sound intelligent when we use jargon but this distracts from the oral/written presentation. Think of your audience and write and present in a manner that will resonate with them.
- Show as much data as you can; it vastly improves the effectiveness of the visual narration.
3.4 Tables and Graphics with R
One could build the graphics we have seen in the preceding sections via R, using several different approaches. However, the primary graphics package we will use is ggplot2
. If necessary, we will switch to lattice
, and perhaps even to base R graphics. The examples you will see in this section are not the only way to generate a given table or a particular graph. In fact there are many paths in R that lead to the same shore, one of the things that makes it such a popular software for data analysis and visualization! Given this abundance of approaches I have chosen to share with you the way I go about approaching a particular task.
3.4.1 Building Tables of Frequencies and Descriptive Statistics
We start with the simple stuff – frequency tables! Recall that frequency tables and bar-charts are often the best means of showing the distribution of a categorical variable. To see how we do it, let us load the hsb2
data. Once it is loaded, I’d like to work with the ses
, female
, schtyp
, and prog
variables.
load("~/Documents/Teaching/MPA 6020/Data/hsb2.RData")
Say we want a frequency table of ses
. This can be generated via
tab1 = table(hsb2$ses)
tab1
##
## Low Middle High
## 47 95 58
The table()
command generates a frequency table but no totals, so if we want them we utilize the addmargins()
command as shown below.
tab2 = addmargins(tab1)
tab2
##
## Low Middle High Sum
## 47 95 58 200
Now we see the total, Sum
. But folks are used to seeing the word “Total”, and to replace “Sum” with “Total” we modify the command as follows:
Total = sum
tab2 = addmargins(tab1, FUN = Total)
tab2
##
## Low Middle High Total
## 47 95 58 200
We basically created a new function called Total
that duplicates what sum
does. We then explicitly asked addmargins()
to use the function called Total by specifying FUN = Total
.
Tables such as the ones we have created can be easily flipped into a stand-alone dataframe by specifying
tab3 = as.data.frame(tab2)
tab3
## Var1 Freq
## 1 Low 47
## 2 Middle 95
## 3 High 58
## 4 Total 200
and we can then replace the column name Var1
with an appropriate name and also expand Freq
to “Frequency”.
colnames(tab3) = c("SES Category", "Frequency")
tab3
## SES Category Frequency
## 1 Low 47
## 2 Middle 95
## 3 High 58
## 4 Total 200
To generate relative frequencies we employ the prop.table()
command and then run addmargins()
to generate the totals. You have to run this on the first frequency table you created (tab1)
and to recogniwe that the default product of prop.table()
is a list of proportions, which can be converted into percentages via prop.table() * 100
, as shown below in tab4
and tab5
, respectively.
tab4 = prop.table(tab1)
tab4
##
## Low Middle High
## 0.235 0.475 0.290
tab5 = prop.table(tab4) * 100
tab5
##
## Low Middle High
## 23.5 47.5 29.0
What if we wanted to squeeze these two into a single table so that the reader could see both the frequencies and the relative frequencies? This takes a bit more work but is doable, as shown below.
tab00 = cbind(tab1, tab5)
tab01 = colSums(tab00)
tab02 = rbind.data.frame(tab00, tab01)
rownames(tab02)[4] = "Total"
colnames(tab02) = c("Frequency", "Relative Frequency (%)")
tab02
## Frequency Relative Frequency (%)
## Low 47 23.5
## Middle 95 47.5
## High 58 29.0
## Total 200 100.0
If I wanted to generate a contingency table, perhaps of the program a student is in and the student’s SES, I could generate that as follows:
tab6 = table(hsb2$prog, hsb2$ses)
tab6
##
## Low Middle High
## General 16 20 9
## Academic 19 44 42
## Vocational 12 31 7
tab7 = addmargins(tab6, FUN = Total, quiet = TRUE)
tab7
##
## Low Middle High Total
## General 16 20 9 45
## Academic 19 44 42 105
## Vocational 12 31 7 50
## Total 47 95 58 200
The quiet = TRUE
piece of the command suppresses messages from R that tell you how the totals were calculated, which will be first by row and then by column.
Relative frequencies (as percentages) can be generated via
tab8 = prop.table(tab6) * 100
tab8
##
## Low Middle High
## General 8.0 10.0 4.5
## Academic 9.5 22.0 21.0
## Vocational 6.0 15.5 3.5
tab8
shows you the relative frequency of each cell, relative that is to the sample size 200. However, if you want row or column percentages, the command has to be modified thus:
tab9 = prop.table(tab6, 1) * 100
tab9
##
## Low Middle High
## General 35.55556 44.44444 20.00000
## Academic 18.09524 41.90476 40.00000
## Vocational 24.00000 62.00000 14.00000
tab10 = addmargins(tab9, FUN = Total, quiet = TRUE, margin = 2)
tab10
##
## Low Middle High Total
## General 35.55556 44.44444 20.00000 100.00000
## Academic 18.09524 41.90476 40.00000 100.00000
## Vocational 24.00000 62.00000 14.00000 100.00000
tab11 = prop.table(tab6, 2) * 100
tab11
##
## Low Middle High
## General 34.04255 21.05263 15.51724
## Academic 40.42553 46.31579 72.41379
## Vocational 25.53191 32.63158 12.06897
tab12 = addmargins(tab11, FUN = Total, quiet = TRUE, margin = 1)
tab12
##
## Low Middle High
## General 34.04255 21.05263 15.51724
## Academic 40.42553 46.31579 72.41379
## Vocational 25.53191 32.63158 12.06897
## Total 100.00000 100.00000 100.00000
Notice what the 1
and 2
are doing: They specify if you want row or column proportions and whether to calculate the row or column totals.
One can coerce the frequencies and relative frequencies into a single table as well but it requires a bit of work if using base R commands and is slightly easier with other packages such as dplyr
and data.table
. I show you a simple dplyr
example below:
library(dplyr)
tab13 = group_by(hsb2, ses, prog) %>% summarise(Frequency = n()) %>%
mutate(Percent = (Frequency/sum(Frequency)) * 100)
library(tidyr)
tab14 = gather(tab13, key = ses, value = values, 3:4)
colnames(tab14)[3] = "group"
library(reshape2)
tab15 = dcast(tab14, ses ~ group + prog)
colnames(tab15) = c("SES", "General (n)", "Academic (n)", "Vocational (n)",
"General (%)", "Academic (%)", "Vocational (%)") # Since the names are unwieldy after dcast
tab15
## SES General (n) Academic (n) Vocational (n) General (%)
## 1 Frequency 7 9 12 NA
## 2 Percent NA NA NA 12.06897
## Academic (%) Vocational (%) NA NA NA NA NA NA NA
## 1 NA 16 19 20 NA NA 31 NA NA
## 2 15.51724 NA NA NA 21.05263 25.53191 NA 32.63158 34.04255
## NA NA NA NA NA
## 1 NA 42 44 NA NA
## 2 40.42553 NA NA 46.31579 72.41379
3.4.2 Grouped Frequency Tables
Often, with numeric variables that take on a large number of values it becomes necessary to group the values into meaningful categories prior to generating a frequency table. For example, we may be interested in grouping reading scores to ease interpretation. Below I show you how to do so with the cut()
function.
summary(hsb2$read)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.00 44.00 50.00 52.23 60.00 76.00
my.breaks = c(28, 38, 48, 58, 68, 78)
hsb2$grouped.read1 = cut(hsb2$read, breaks = my.breaks, right = FALSE)
tab.g1 = table(hsb2$grouped.read1)
tab.g1
##
## [28,38) [38,48) [48,58) [58,68) [68,78)
## 14 68 62 36 20
hsb2$grouped.read2 = cut(hsb2$read, breaks = my.breaks, right = TRUE)
tab.g2 = table(hsb2$grouped.read2)
tab.g2
##
## (28,38] (38,48] (48,58] (58,68] (68,78]
## 13 69 61 47 9
I first ran the summary()
command to see the minimum and maximum values of reading scores. I then calculated the range, and calculated what the gap would be between successive groups if I wanted five groups. That gap came out to be 9.6 so I rounded it up to 10. I then specified how I wanted reading scores to be broken – the first group would start at 28, then the next at 38, and so on. This was done via the my.breaks
command that calls cut()
. I also used right = FALSE
and then right = TRUE
to show you the difference between specifying whether the groups should be “right-open” or “right-closed”. This means that any reading score that would, say, take on a value of 48, would be included in the 48-58 group with right = FALSE but fall in the 38-48 group if we used right = TRUE. So right = FALSE
says if you have a value that is equal to the upper-limit (48) of the group interval (38-48), do not count it in that group but instead in the previous group (38-48). You should see this by running the commans and opening hsb2. Look for read = 48 and see where it falls in grouped.read1 versus grouped.read2. Then do the same for read = 68. Also notice that read = 28 gets dropped in grouped.read2 where we had right = TRUE
, which is unfortunate. So for all practical purposes, right = FALSE
would be preferable here. I have not dressed up the frequency table or calculated relative frequencies since now you know how to do that, given the earlier examples.
3.4.3 Bar-charts
We could take our categorical variables and generate bar-charts in base R
, or then with some of the other packages, namely ggplot2
and lattice
. I will show you a bit of base R
and lattice
but we will end up using ggplot2
the most. Let us start with a simple bar-chart of ses frequencies.
tab.a = table(hsb2$ses)
barplot(tab.a, ylim = c(0, 110), ylab = "Frequency", xlab = "Socieconomic Status",
col = "cornflowerblue")
It would be more useful to show the relative frequencies, and that is easily done.
tab.b = prop.table(tab.a) * 100
barplot(tab.b, ylim = c(0, 60), ylab = "Relative Frequency (%)",
xlab = "Socieconomic Status", col = "cornflowerblue")
In ggplot2, the same graph is generated via
library(ggplot2)
ggplot(hsb2, aes(ses, fill = ses)) + geom_bar(width = 0.5) +
theme(legend.position = "none") + xlab("Socioeconomic Status") +
ylab("Frequency") + scale_y_continuous(limits = c(0, 100)) +
theme_ipsum_rc()
Similarly, we can generate a bar-chart of ses by prog as follows:
ggplot(hsb2, aes(x = ses, fill = prog)) + geom_bar(width = 0.5,
position = "dodge") + theme(legend.position = "bottom") +
xlab("Socioeconomic Status") + ylab("Frequency") + scale_y_continuous(limits = c(0,
50)) + theme_ipsum_rc()
If we wanted relative frequencies as proportions, then
ggplot(hsb2, aes(ses, group = prog)) + geom_bar(aes(y = ..prop..,
fill = factor(..x..)), stat = "count") + scale_y_continuous(labels = scales::percent,
limits = c(0, 0.65)) + xlab("Socioeconomic Status") + ylab("Percent") +
facet_wrap(~prog) + theme(legend.position = "none") + geom_text(aes(label = scales::percent(..prop..),
y = ..prop..), stat = "count", vjust = -0.5, size = 3.5) +
theme_ipsum_rc()
3.4.4 Box-plots
Ah, box-plots, my favorite graphic. Here we need a numeric variable so let us use science scores. I’ll build two plots, one with science scores of all students and the next breaking this out by ses and then by schtyp.
ggplot(hsb2, aes(y = science, x = "")) + geom_boxplot(fill = "cornflowerblue") +
xlab("") + coord_flip() + ylab("Science Scores") + theme_ipsum_rc()
ggplot(hsb2, aes(y = science, x = ses)) + geom_boxplot(aes(fill = ses)) +
xlab("Socioeconomic Status") + coord_flip() + ylab("Science Scores") +
theme(legend.position = "none") + theme_ipsum_rc()
If we wanted to further break these out by schtype, we could run
ggplot(hsb2, aes(y = science, x = ses)) + geom_boxplot(aes(fill = ses)) +
xlab("Socioeconomic Status") + coord_flip() + ylab("Science Scores") +
theme(legend.position = "none") + facet_wrap(~schtyp) + theme_ipsum_rc()
And further yet, if we wanted to disaggregate these by the student’s sex,
ggplot(hsb2, aes(y = science, x = ses)) + geom_boxplot(aes(fill = ses)) +
xlab("Socioeconomic Status") + coord_flip() + ylab("Science Scores") +
theme(legend.position = "none") + facet_wrap(~schtyp + female) +
theme_ipsum_rc()
3.4.5 Histograms
Histograms are also useful for plotting numerical data. I’ll stay with reading scores to draw a link between our histograms and the grouped frequency table we drew earlier.
ggplot(hsb2, aes(read)) + geom_histogram(fill = "cornflowerblue",
color = "white") + theme_ipsum_rc()
Note the warning: stat_bin()
using bins = 30
. Pick better value with binwidth
. By default, ggplot2
will pick 30 groups, which isn’t very helpful since that spreads the data very thin. We could try to set the breaks to match what we had in the grouped frequency table.
ggplot(hsb2, aes(read)) + geom_histogram(fill = "cornflowerblue",
color = "white", breaks = seq(28, 78, by = 10)) + theme_ipsum_rc()
We could also break this down by the student’s sex or any other categorical variable, as shown below.
ggplot(hsb2, aes(read)) + geom_histogram(fill = "cornflowerblue",
color = "white", breaks = seq(28, 78, by = 10)) + facet_wrap(~female) +
theme_ipsum_rc()
3.4.6 Scatterplots
Sticking with the science scores, I’ll draw several scatterplots by adding writing scores into the mix. I will then break these out for specific groups.
ggplot(hsb2, aes(x = write, y = science)) + geom_point() + xlab("Writing Scores") +
ylab("Science Scores") + theme_ipsum_rc()
ggplot(hsb2, aes(x = write, y = science)) + geom_point(aes(color = ses)) +
xlab("Writing Scores") + ylab("Science Scores") + theme(legend.position = "bottom") +
theme_ipsum_rc()
Note that this isn’t very helpful since it is hard to distinguish any patterns by ses so we can keep it simple by just breaking out the scatterplot by ses.
ggplot(hsb2, aes(x = write, y = science)) + geom_point() + xlab("Writing Scores") +
ylab("Science Scores") + facet_wrap(~ses) + theme_ipsum_rc()
3.4.7 Line charts
Line charts are ideal for displaying trends in a numerical variable. Most often you will see them used with aggregate estimates of say, income, population size, immigration numbers, stock prices, money supply, inflation, unemployment and the like. I’ll pull a particular data-set that is bundled with the plotly
package.
library(plotly)
data(economics)
names(economics)
## [1] "date" "pce" "pop" "psavert" "uempmed" "unemploy"
ggplot(economics, aes(x = date, y = uempmed)) + geom_line() +
xlab("Date") + ylab("Median Unemployment Rate") + theme_ipsum_rc()
If I needed to add multiple time-series to a single plot we could run the following code. Note I have switched to a different data-set for this example.
library(gapminder)
data(gapminder)
names(gapminder)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
library(dplyr)
df = gapminder %>% group_by(year, continent) %>% summarise(LifeExp = median(lifeExp))
ggplot(df, aes(x = year, y = LifeExp, group = continent)) + geom_line(aes(color = continent)) +
geom_point() + xlab("Year") + ylab("Median Life Expectancy (in years)") +
theme(legend.position = "bottom") + theme_ipsum_rc()
Notice what we had to do for the last plot. Since the gapminder
data-set has country-level data at five-year intervals, we had to first calculate a single value per continent per year, and the variable I chose was lifeExp
. Thereafter, the plotting is straightforward, with geom_line()
drawing the lines and geom_point()
drawing the points (to aid in readability of the plot).
3.4.8 Joy Plots
These plots have a fascinating story and are a recent entry into the R toolkit. For more on their roots see here. I love them as much for their aesthetics as for their ability to show similarities and differences between distributions of the same phenomenon over time or space.
library(viridis)
library(ggridges)
library(ggthemes)
ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = Month)) +
geom_density_ridges(scale = 3, alpha = 0.3, aes(fill = Month)) +
labs(title = "Temperatures in Lincoln NE", subtitle = "Mean temperatures (Fahrenheit) by month for 2016\nData: Original CSV from the Weather Underground") +
theme_ridges() + theme(axis.title.y = element_blank(), legend.position = "none")
Just as a means of comparing alternative graphics, here is a joyplot of readng scores by ses.
ggplot(hsb2, aes(x = read, y = ses)) + geom_density_ridges(scale = 3,
alpha = 0.3, aes(fill = ses)) + labs(title = "Reading Scores in the hsb2 Data",
subtitle = "(by SES)") + theme_ridges() + theme(axis.title.y = element_blank(),
legend.position = "none")
3.5 Practice Problems
Problem 1
Download the monthly Great Lakes water level dataset SPSS format from here and Excel format from here. Construct an appropriate chart to display the data for Lake Superior. Be sure to label the x- and y-axis, and to title the chart. Note that water level is in meters.
Problem 2
Download the number of births per 10,000 of 23 year old women, U.S., 1917-1975 SPSS format from here and Excel format from here. Construct an appropriate chart to display the trend in the data. Be sure to label the x- and y-axis, and to title the chart.
Problem 3
Download the winning speed (in kilometers per hour) for several men’s track and field distances world meets over the 1900 - 2012 period SPSS format from here and Excel format from here. Construct an appropriate chart to display the speeds for the 100 meter dash. Be sure to label the x- and y-axis, and to title the chart. Note that the data are monthly and replicate the speed from the preceding month if the fastest speed was not eclipsed.
Problem 4
Use this data-set used in Practice Problem 4 in the preceding Chapter, noting these details of each variable. Construct a frequency table for belief in life after death
, showing both the frequencies and the relative frequencies (as percentages). Based on the table, what do most people seem to believe? Report the percentage for your answer. Construct an appropriate chart for these data, making sure to label all axis and providing a title.
Problem 5
Construct an appropriate chart that shows the relationship between high school GPA and college GPA. Label both axis and title the chart. What does this plot show? Are the two positively/negatively related? How strong would you guess is the relationship?
Problem 6
Construct a contingency table of vegetarianism against belief in life after death. What percent of vegetarians believe in life after death? What percent of those who believe in life after death are vegetarians? Use an appropriate chart to show the relationship between vegetarianism and belief in life after death. Label everything.
Problem 7
Construct a grouped frequency table with five groups
of the variable age
. Report both the frequencies and the relative frequencies (as percentages). Plot the relative frequencies using an appropriate chart. Label everything as usual. What is the modal age group?
The next set of questions revolve around the 2016 Boston Marathon race results available here. The dataset contains the following variables:
- Bib = the runner’s bib number
- Name = the runner’s name
- Age = the runner’s age (in years)
- M/F = the runner’s gender (M = Male; F = Female)
- City = the runner’s home city
- State = the runner’s home state
- Country = the runner’s home country
- finishtime = the runner’s time (in seconds) to the finish line
Problem 8
What countries had the largest and second-largest number of runners in the race?
Problem 9
What was the distribution of the runners’ gender? Use a suitable chart to reflect the distribution.
Problem 10
Construct a grouped frequency table of the runners’ age, using the following groupings – 18-25; 25-32; 32-39; 39-46; 46-53; 53-60; 60-67; 67-74; 74-81; 81-88. Also construct a grouped histogram. What is the modal age group?
Problem 11
Draw a scatter-plot of runners’ age and finish times. Does this show any relationship? If it does, what sort of a relationship?
Problem 12
Using a reasonable grouping structure, construct country-specific histograms of finish times for runners from each the following countries – AUS, BRA, CAN, CHN, FRA, GBR, GER, ITA, JPN, MEX, and USA. Are finish times skewed for each country? Is the direction of the skew similar? What country seems to have the least skew? What country seems to have the most skew?
Use this data-set, also used in Practice Problem 5
in the preceding Chapter, to answer the following questions.
Problem 13
Were employees who had a workplace accident more likely to leave than employees who did not have an accident? Briefly explain your conclusion with appropriate charts/tables.
Problem 14
Are low salary employees more likely to leave than medium/high salary employees? Why do you conclude as you do? Briefly explain your reasoning with appropriate charts/tables.
Are low salary employees more likely to leave than medium/high salary employees? Why do you conclude as you do? Briefly explain your reasoning with appropriate charts/tables.
Problem 15
Construct an appropriate chart that shows the distribution of the number of years employees spent in the company. What patterns do you see in this chart?
Download the 2017 County Health Rankings data SPSS format from here, Excel format from here and the accompanying codebook.
Problem 16
You should see the following measures (measurename
)
- Adult obesity
- Children in poverty
- High school graduation
- Preventable hospital stays
- Unemployment rate
Looking at pairs of variables in turn, briefly (20 words or less) state whether you would expect a positive/negative/no relationship between the variables in each pair and why?
Problem 17
Construct scatterplots of all pairs of the five variables.
Problem 18
How do the scatterplots you constructed in the preceding problem stack up against the expectations you listed in Problem 16
? Which plots surprised you and how? Be brief.
Indoor plumbing is lacking in many still developing countries, particularly in the rural areas. In some of these countries, particularly India, tigers aren’t the most dangerous animals↩