Chapter 3 Visualizing Data

One of the first tasks of data analysis should be to look at our data. Certainly, you should open up the dataset and eyeball it to make sure everything looks okay, that the file isn’t mysteriously corrupted, etc., but that is not what I mean when I say “look”. Nor am I talking about data visualization in the sense of the masterpieces of stalwarts such as Alberto Cairo, Mona Chalabi, Andy Kirk, Robert Kossara, Giorgia Lupi, David McCandless, Cole Nussbaumer Knaflic, Randy Olson, Lisa Charlotte Rost, John Schwabish, Nathan Yau, Stephanie Evergreen, Martin Wattenberg, Fernanda Viégas, and many others. We are not going to create visuals that rival these artists’ stunning works. Rather, our goal will be simpler, to use appropriate tables and graphics when analyzing and describing our data, both so that we understand what the data are telling us and can communicate the resulting story to our audience. To accomplish these tasks we will obey simple rules that help us make wise decisions.

3.1 Visualizing Nominal/Ordinal Data

3.1.1 Bar-charts and Frequency Tables

If a variable is measured in nominal or ordinal terms either (i) a bar-chart or (ii) a frequency table are very effective displays of how the variable is distributed: What value seems to be most common? Are high values as likely as low values? You can see both these visuals below, drawn from data about the relative abundance of different species. First, the frequency table.

tigers = read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02e2aDeathsFromTigers.csv")
ggplot(tigers, aes(activity)) + geom_bar(width = 0.2) + coord_flip() + 
    ylab("Frequency") + xlab("Human Activity") + theme_ipsum_rc()

Total = sum
tab.1 = table(tigers$activity)
tab.2 = addmargins(tab.1, FUN = Total)
tab.3 = round(prop.table(tab.2), digits = 2)
tab.4 = addmargins(prop.table(tab.1), FUN = Total)
tab.4 = round(tab.4, digits = 2)
tab.5 = tab.4 * 100
tiger.table = cbind.data.frame(tab.2, tab.4, tab.5)
tiger.table = tiger.table[c(1:2, 4, 6)]
colnames(tiger.table) = c("Activity", "Frequency", "Proportion", 
    "Percentage")

TABLE 3.1: Factors associated with human-killing tigers in Chitwan National Park, Nepal
Activity	Frequency	Proportion	Percentage
Disturbing tiger kill	5	0.06	6
Fishing	8	0.09	9
Forest products	11	0.12	12
Fuelwood/timber	5	0.06	6
Grass/fodder	44	0.50	50
Herding	7	0.08	8
Sleeping in house	3	0.03	3
Toilet	2	0.02	2
Walking	3	0.03	3
Total	88	1.00	100

Notice how the frequency table is constructed. You have each human activity listed, followed by the frequency (the number of times someone was killed by a tiger while engaging in each activity), then this column with this frequency converted into a proportion, and finally the proportion converted into a percentage. None of this should be a mystery to you:

\[proportion = \dfrac{frequency}{Total}\] \[percentage = \left(\dfrac{frequency}{Total}\right) \times 100 = proportion \times 100\]

The last row in the table shows you the total frequency (we have a total of 88 humans killed), the total proportion (which must sum to \(1\)), and the total percentage (which should sum to \(100\)). I love frequency tables such as these because they show all the data and the story; most kills occurred while the individuals were cutting grass for cattle fodder, followed by while they were out gathering other forest products, and least of all while they were going to the toilet⁴. We could improve upon this table and bar-chart by organizing it in such a way that the most dangerous activity is listed/plotted first. This would have the added benefit of quickly drawing the reader’s/viewer’s eyes to the most dangerous activity.

What if the variable was an ordinal level variable, say something like an individual’s frequency of praying? We would be facing the same options, a bar-chart or a frequency table. However, we would have to be cautious here in making sure both the table and the bar-chart categories are logically assigned. That is, notice the Response Categfory; we start with the option “Never” and each category that follows corresponds to a category that reflects praying more often than the preceding category. This order would have to be maintained so that we would be unable to arrange the table or bar-chart in ascending/descending order of the “Frequency” column. If we made that mistake we would be destroying the natural order that exists in the ordinal variable we have before us.

site = "http://faculty.gvsu.edu/kilburnw/nes2008.RData"
load(file = url(site))

tab.1 = table(nes08$pray2)
tab.2 = addmargins(tab.1, FUN = Total)
tab.2 = as.data.frame(tab.2)
colnames(tab.2) = c("Response Category", "Frequency")

TABLE 3.2: Frequency of Prayer
Response Category	Frequency
Never	196
Once a week or less	226
A few times a week	264
Once a day	346
Several times a day	456
Total	1488

ggplot(tab.2[1:5, ], aes(x = `Response Category`, y = Frequency)) + 
    geom_bar(stat = "identity", width = 0.2) + coord_flip() + 
    ylab("Frequency") + xlab("Response Category") + theme_ipsum_rc()

FIGURE 3.1: Frequency of Prayer

You might be wondering, what about pie charts? Why not use those with nominal/ordinal data? There are two camps, those who hate them and those who think they may be useful. If you are interested, read What do you mean I’m not supposed to use Pie Charts?! but I for one do not use them since I find them less useful than bar-charts.

3.1.2 Contingency Tables and Bar-charts

Frequency tables and bar-charts are also useful when you have two nominal/ordinal variables to work with and are interested in exploring difference between the two or more groups reflected in one variable versus whatever is being measured by the second variable. Let us use a specific example, one where we ask if religiosity differs between Liberals, Moderates, and Conservatives. First the table.

tab.1 = table(nes08$pray2, nes08$libcon2)
tab.2 = addmargins(tab.1, FUN = Total, quiet = TRUE)

TABLE 3.3: Political Ideology and Religiosity
	Liberal	Moderate	Conservative	Total
Never	62	46	36	144
Once a week or less	53	63	52	168
A few times a week	53	69	76	198
Once a day	76	89	107	272
Several times a day	69	99	176	344
Total	313	366	447	1126

If you look at Table 3.3, you see the frequencies reported in what we call a contingency table or a crosstabulation – where the distributions of two nominal/ordinal variables are jointly displayed. Reading such a table should be simple. In brief, You see how many Liberals said they prayed “Never”, “Once a week or less”, “A few times a week”, “Once a day”, or “Several times a day”. What pattern is evident from this table? Of the 447 conservatives we see most of them saying they pray several times a day, followed by once a day, and the fewest saying they never pray. The pattern is similar for liberals and moderates, although the differences between the numbers responding “Never” and “Several times a day” is small for liberals than it is for conservatives.

tab.3 = prop.table(tab.1, margin = 1) * 100
tab.4 = prop.table(tab.1, margin = 2) * 100
tab.3 = round(tab.3, digits = 2)
tab.4 = round(tab.4, digits = 2)

The story could be helped a great deal if we calculated percentages for these frequencies. We have two choices when calculating these percentages, we could calculate these as row percentages where we ask “what percent of those who said Never were Liberal, Moderate, and Conservative, respectively?” We could then repeat this for the other categories of religiosity. The result is shown in Table 3.4. This table shows quite clearly that 51.16% of those who say they pray several times a day tend to be Conservative. Likewise, most (43.06%) of those who say they never pray tend to be Liberal.

TABLE 3.4: Political Ideology and Religiosity: Row Percentages
	Liberal	Moderate	Conservative
Never	43.06	31.94	25.00
Once a week or less	31.55	37.50	30.95
A few times a week	26.77	34.85	38.38
Once a day	27.94	32.72	39.34
Several times a day	20.06	28.78	51.16

If we calculated column percentages we would be able to answer such questions as: “What percent of Liberals said they never pray, pray once a week or less, a few times a week, once a day, several times a day?” The same could then be asked of Moderates and Conservatives, respectively. If we used the column percentages shown in Table 3.5 it would be obvious that Moderates and Conservatives tend to be more religious than Liberals. The essential takeaway here is that how you calculate the percentages (row versus column) depends upon what story you want to highlight, the question you want to ask and answer.

TABLE 3.5: Political Ideology and Religiosity: Column Percentages
	Liberal	Moderate	Conservative
Never	19.81	12.57	8.05
Once a week or less	16.93	17.21	11.63
A few times a week	16.93	18.85	17.00
Once a day	24.28	24.32	23.94
Several times a day	22.04	27.05	39.37

Graphing these tables is easily done as well, and very effective; see Figure 3.3.

ggplot(subset(nes08, !is.na(libcon2) & !is.na(pray2)), aes(pray2, 
    ..count..)) + geom_bar(aes(fill = libcon2), position = "dodge", 
    width = 0.4) + xlab("Religiosity") + ylab("Frequency") + 
    scale_fill_manual(values = c("#999999", "#E69F00", "#56B4E9"), 
        name = "Political Ideology", breaks = c("Liberal", "Moderate", 
            "Conservative"), labels = c("Liberal", "Moderate", 
            "Conservative")) + theme(legend.position = "top", 
    axis.text.x = element_text(angle = 90, size = 10, vjust = -2)) + 
    coord_flip() + theme_ipsum_rc()

FIGURE 3.2: Political Ideology and Religiosity

3.2 Visualizing Interval/Ratio Data

We have several visuals we could draw with numeric data that are either interval or ratio levels of measurement. Let us see these first before we look at the one frequency table that could be used with interval/ratio data.

3.2.1 The Histogram

hsb2 = read.table("./data/hsb2-2.csv", header = TRUE, sep = ",")
hsb2$female = factor(hsb2$female, labels = c("Male", "Female"))
hsb2$race = factor(hsb2$race, labels = c("Hispanic", "Asian", 
    "African American", "White"))
hsb2$ses = factor(hsb2$ses, labels = c("Low", "Middle", "High"))
hsb2$schtyp = factor(hsb2$schtyp, labels = c("Public", "Private"))
hsb2$prog = factor(hsb2$prog, labels = c("General", "Academic", 
    "Vocational"))

A histogram is used with a single numeric variable and looks like a bar-chart except there are no gaps between consecutive bars unless there are missing data. The example that follows use a popular dataset known as hsb2, which contains information about 200 randomly selected students from a national survey of high school seniors called the High School and Beyond survey. The variables in this dataset include:

id = student id
female = (0/1)
race = ethnicity (1=hispanic 2=asian 3=african-amer 4=white)
ses = (1=low 2=middle 3=high)
schtyp = type of school (1=public 2=private)
prog = type of program (1=general 2=academic 3=vocational)
read = standardized reading score
write = standardized writing score
math = standardized math score
science = standardized science score
socst = standardized social studies score

I’ll use read (Reading scores on a standardized test) to plot a histogram. Notice a few things about the histogram in Figure 3.4. The height of the bars, representing how often a particular score occurs, varies a great deal. A few students have done poorly while a few have done very well, but the rest are distributed over the middle range of test scores.

ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue", 
    bins = 30) + ylab("Frequency") + xlab("Reading Scores") + 
    theme_ipsum_rc()

FIGURE 3.3: Histogram of Reading Scores

We could farther break this histogram apart by asking if male and female students (female), private versus public students (schtyp), or students of different races/ethnicities (race) perform differently on the reading test.

ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue", 
    bins = 30) + ylab("Frequency") + xlab("Reading Scores") + 
    facet_wrap(~female) + theme_ipsum_rc()

FIGURE 3.4: Histogram of Reading Scores by Various Groups

ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue", 
    bins = 30) + ylab("Frequency") + xlab("Reading Scores") + 
    facet_wrap(~schtyp) + theme_ipsum_rc()

FIGURE 3.4: Histogram of Reading Scores by Various Groups

ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue", 
    bins = 30) + ylab("Frequency") + xlab("Reading Scores") + 
    facet_wrap(~race) + theme_ipsum_rc()

FIGURE 3.4: Histogram of Reading Scores by Various Groups

These plots are hard to read for a number of reasons. First, with just \(n = 200\) students in all we have more students in some groups than in others and as a result the histograms look very thin for the groups with fewer data points, making it difficult to tease out any patterns. Second, within any group there are too many different test scores so we don’t see a clear pattern at all. To fix these problems we construct groups of scores, turning what is a numeric variable into an ordinal variable. Let us build this by first creating a grouped frequency table and then plotting this table as a histogram.

3.2.2 Grouped Frequency Tables

Because numeric variables take on too many values it is often easier to see their distribution by grouping the numeric values. For example, we could start by seeing what are the lowest and highest reading scores. These turn out to be 28 and 76, respectively. We can build the groups as follows:

Calculate difference between maximum and minimum values, which turns out to be \(76 - 28 = 48\)
Decide how many groups we want. Good practice suggests no fewer than 4/5 and no more than 6/7. Say we go with 5 groups.
Divide the gap between the maximum and minimum values by the desired number of groups: \(= \dfrac{48}{5} = 9.6\) and round up to the nearest whole number \(= 10\). This tells us how wide each group should be.
The groups could thus be \(28-38, 38-48, 48-58, 58-68, 68-78\).

Notice that these groups span, start to finish, all the values of reading scores. But we’ll have to decide which group to include 38, 48, 58, and 68 in. Should 38 go in 28-38 or in 38-48? The choice doesn’t matter so long as we are consistent. Let us choose a rule that says include 38 in 38-48, 48 in 48-58, 58 in 58-68, and 68 in 68-78. Using this rule we now find each reading score and drop it into its group. Then we calculate how many scores fall in each group, creating our Frequency column.

hsb2$read_cut = cut(hsb2$read, breaks = c(28, 38, 48, 58, 68, 
    78), right = FALSE)
tab.1 = table(hsb2$read_cut)
tab.2 = addmargins(tab.1, FUN = Total)
tab.2 = as.data.frame(tab.2)
tab.2$Var1 = c("28-38", "38-48", "48-58", "58-68", "68-78", "Total")
colnames(tab.2) = c("Grouped Scores", "Frequency")

TABLE 3.6: Grouped Frequency Table of Reading Scores
Grouped Scores	Frequency
28-38	14
38-48	68
48-58	62
58-68	36
68-78	20
Total	200

Now it is easy to see that the largest number of students (68) appear to fall in the 38-48 group of reading scores while the smallest frequency (14) occurs in the 68-78 group. Let us add to this table a percentage column.

tab.3 = prop.table(tab.1) * 100
tab.4 = addmargins(tab.3, FUN = Total)
tab.4 = as.data.frame(round(tab.4, digits = 2))
tab.5 = cbind.data.frame(tab.2, tab.4)
tab.6 = tab.5[c(1, 2, 4)]
colnames(tab.6) = c("Grouped Scores", "Frequency", "Percentage")

TABLE 3.7: Grouped Frequency Table of Reading Scores with Percentages
Grouped Scores	Frequency	Percentage
28-38	14	7
38-48	68	34
48-58	62	31
58-68	36	18
68-78	20	10
Total	200	100

The percentages make it even easier to see how the distribution breaks down; 34 percent of the students have scores in the 38-48 range while only 10 percent scored in the highest bracket (68-78). We can also break this down by the three groups we used earlier.

ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue", 
    breaks = seq(28, 78, by = 10), alpha = 0.4, color = "black") + 
    ylab("Frequency") + xlab("Reading Scores") + facet_wrap(~female) + 
    theme_ipsum_rc()

FIGURE 3.5: Histogram of Grouped Reading Scores by Specific Groups

ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue", 
    breaks = seq(28, 78, by = 10), alpha = 0.4, color = "black") + 
    ylab("Frequency") + xlab("Reading Scores") + facet_wrap(~schtyp) + 
    theme_ipsum_rc()

FIGURE 3.5: Histogram of Grouped Reading Scores by Specific Groups

ggplot(data = hsb2, aes(x = read)) + geom_histogram(fill = "CornflowerBlue", 
    breaks = seq(28, 78, by = 10), alpha = 0.4, color = "black") + 
    ylab("Frequency") + xlab("Reading Scores") + facet_wrap(~race) + 
    theme_ipsum_rc()

FIGURE 3.5: Histogram of Grouped Reading Scores by Specific Groups

Histograms were once quite popular but there is a better visual for looking at our numeric variables, one called the box-plot that we’ll see in the next chapter. I am not a huge fan of histograms because grouping decisions influence the story being told. My advice would be to use grouped frequency tables instead of histograms to present a summary overview of your numeric data.

3.2.3 Scatterplots

With two numeric variables, a scatterplot comes in very handy if we want to explore how one variable might be related to another. For example, we may want to ask whether students who score high on the reading test also tend to score high on the mathematics test. This could be visually explored via a scatter-plot, as shown below. The goal should be to look for a pattern: Does one variable increase as the other increases or does one variable decrease as the other increases? Or does there seem to be no relationship at all?

ggplot(data = hsb2, aes(x = read, y = math)) + geom_point(color = "CornflowerBlue") + 
    ylab("Mathematics Scores") + xlab("Reading Scores") + theme_ipsum_rc()

FIGURE 3.6: Scatterplot of Reading Scores and Mathematics Scores

Quite clearly, students who do well in Reading also tend to do well in Mathematics. You can see this by virtue of the upward, left-to-right tilt of the cloud of points.

What about Science scores and Mathematics scores? Is there any relationship between doing well in Science and doing well in Mathematics?

ggplot(data = hsb2, aes(x = science, y = math)) + geom_point(color = "CornflowerBlue") + 
    ylab("Mathematics Scores") + xlab("Science Scores") + theme_ipsum_rc()

FIGURE 3.7: Scatterplot of Science Scores and Mathematics Scores

Of course, just like everything else we could break this down by any nominal/ordinal variable. The visual below shows you the breakouts by the student’s sex.

ggplot(data = hsb2, aes(x = science, y = math)) + geom_point(color = "CornflowerBlue") + 
    ylab("Mathematics Scores") + xlab("Science Scores") + facet_wrap(~female) + 
    theme_ipsum_rc()

FIGURE 3.8: Scatterplot of Science Scores and Mathematics Scores, by Sex

ggplot(data = hsb2, aes(x = science, y = math)) + geom_point(color = "CornflowerBlue") + 
    ylab("Mathematics Scores") + xlab("Science Scores") + facet_wrap(~schtyp) + 
    theme_ipsum_rc()

FIGURE 3.8: Scatterplot of Science Scores and Mathematics Scores, by Sex

ggplot(data = hsb2, aes(x = science, y = math)) + geom_point(color = "CornflowerBlue") + 
    ylab("Mathematics Scores") + xlab("Science Scores") + facet_wrap(~race) + 
    theme_ipsum_rc()

FIGURE 3.8: Scatterplot of Science Scores and Mathematics Scores, by Sex

3.2.4 Line Graphs

If we have time-series data, such as the Presidential Approval data we saw earlier, then a line graph works well because it shows you how the outcome/phenomenon varies over time. Let us take another example, stock prices. Say I have invested in both IBM and LinkedIn stocks. How are these stocks doing?

library(quantmod)
library(xts)
library(dygraphs)
# Get IBM and Linkedin stock data from Yahoo Finance
oh = na.omit(getSymbols("MEHOINUSOHA672N", src = "FRED", from = "1949-12-31", 
    auto.assign = FALSE))
pa = na.omit(getSymbols("MEHOINUSPAA672N", src = "FRED", from = "1949-12-31", 
    auto.assign = FALSE))
wv = na.omit(getSymbols("MEHOINUSWVA672N", src = "FRED", from = "1949-12-31", 
    auto.assign = FALSE))

oh2 = as.data.frame(oh)
oh2$date = row.names(oh2)
pa2 = as.data.frame(pa)
pa2$date = row.names(pa2)
wv2 = as.data.frame(wv)
wv2$date = row.names(wv2)

ggplot() + geom_line(data = oh2, aes(x = as.Date(date), y = MEHOINUSOHA672N), 
    color = "blue") + geom_line(data = wv2, aes(x = as.Date(date), 
    y = MEHOINUSWVA672N), color = "red") + geom_line(data = pa2, 
    aes(x = as.Date(date), y = MEHOINUSPAA672N), color = "green") + 
    xlab("Year") + ylab("Real Median Household Income") + theme(plot.title = element_text(lineheight = 0.7, 
    face = "bold")) + annotate("text", x = as.Date("1997-01-01"), 
    y = 35000, label = "West Virginia") + annotate("text", x = as.Date("1988-01-01"), 
    y = 45000, label = "Pennsylvania") + annotate("text", x = as.Date("1990-01-01"), 
    y = 55000, label = "Ohio") + theme_ipsum_rc()

FIGURE 3.9: Closing Stock Prices: IBM and Linkedin

These plots don’t work well just for financial or election data, they are ideally suited for any phenomenon that is measured over time. For example, the size of the immigrant population over the years, and even the number of lynx pelts reported in Canada per year from 1752 to 1819.

immig = read_excel("~/Documents/Data Hub/MPI-Data-Hub_USImmigFlow_since1820_2014.xlsx", 
    skip = 7)
immig = immig[c(1:2)]
immig = subset(immig, !is.na(`Number of Legal Permanent Residents`))
immig$Year = as.numeric(immig$Year)
ggplot(immig, aes(x = Year, y = `Number of Legal Permanent Residents`)) + 
    geom_path(color = "darkred") + xlab("Year") + ylab("No. of Legal Permanent Residents") + 
    theme_ipsum_rc()

FIGURE 3.10: Number of Legal Permanent Residents in the USA

library(abd)
data(Lynx)
ggplot(Lynx, aes(x = year, y = pelts)) + geom_path(color = "darkred") + 
    xlab("Year") + ylab("Number of Lynx Pelts") + theme_ipsum_rc()

FIGURE 3.11: Number of lynx pelts reported in Canada per year from 1752 to 1819

3.2.5 Polar Charts

These charts are helpful for visualizing data that might have otherwise been explored via a bar-chart. For example, say you want to look at the miles per gallon given by a number of different automobiles. We could use a bar-chart, as shown below:

mtcars$car = row.names(mtcars)
p = ggplot(mtcars, aes(x = car, y = mpg, fill = mpg)) + geom_bar(stat = "identity") + 
    scale_fill_gradient(low = "red", high = "white", limits = c(5, 
        40)) + theme(axis.title.y = element_text(angle = 0), 
    legend.position = "top", axis.text.x = element_text(angle = 45, 
        vjust = 1, hjust = 1, size = 6)) + ylab("") + xlab("Automobile") + 
    theme_ipsum_rc()

Quite obviously, the Toyota Corolla has the best fuel economy, followed by the Fiat 128, while the Cadillac Fleetwood is tied with the Lincoln Continental for the worst fuel economy. A polar chart would present the same information as seen in the bar-chart, albeit in a more aesthetically pleasing manner.

p + coord_polar() + aes(x = reorder(car, mpg)) + theme(axis.text.x = element_text(angle = -20, 
    size = 6), legend.position = "top") + xlab("") + ylab("") + 
    theme_ipsum_rc()

FIGURE 3.12: Polar Chart of Miles per Gallon

3.3 Some Essential Rules for Good Visualizations

There are several rules, some favored by this expert or that, but regardless of the source of the rule, its merit is not in doubt. So here are the ones I try to follow (although I do break these at times, some times by design and other times unintentionally). It all starts of course with Edward Tufte’s maxims: show the data, tell the truth, help the viewer think about the information rather than the design, encourage the eye to compare the data, make large data sets coherent.

A more specific listing of the rules I try to keep in mind follows:

Do not include anything in your visualization that is not very informative. Give titles and subtitles where needed and make these stand on their own. At the same time, do not clutter your charts and tables with too much information otherwise the reader gets lost.
Use colors wisely. Is this visual to be printed on a color printer? Is it part of a presentation in a poorly lit room? Choose bright colors that stand out from each each other and yet are visible on a printed page or on the projection screen in the room. Remember, a fair share of men seeing the visual could be color-blind so use color palettes designed for color-blind individuals. If this is news to you, read How a dog sees a rainbow, and 12 other images that explain how we see color. Pay attention to what is being plotted: Use sequential colors for ordered data values (that go from low to high or vice-versa); qualitative colors nominal data, and; diverging colors if the goal is to put equal emphasis on mid-range critical values and extremes at both ends of the data range but emphasize the middle with light colors and the low/high extremes with dark colors that have contrasting hues.
The visualization is not your personal Mona Lisa; it must be effective for the target audience and help you tell the story. Do not get carried away in creating it.
Use a table if a table would be more effective than a graph but remember, while tables are useful for audiences that will want to see more (not less of the actual data you have) graphs are favored by those who want a quick take-away.
Always have your y-axis (or x-axis if relevant) starting at zero. If you don’t do this you can misrepresent the data. If you are forced to truncate the axis, point this out to the reader so that they can interpret dips and swells in line charts or differences between heights of frequency bars with care.
Combine multiple graphs/tables into a single figure if you can so long as it does not lead to information overload. Go back and look at the scatterplots and line charts and try to visualize what these might look like if the y-axis had been forced to start at \(0\); they would certainly look different. For illustration purposes I have let the plotting software pick the starting coordinates.
For non-technical audiences you should round up all percentages/proportions to no more than one decimal place and ideally to no decimal place unless doing so distorts the picture. For technical audiences, stay with two or four decimal places.
Above all, start with pencil and paper and consider all alternate visualizations possible by drawing a rough sketch of what the finished product should look like.
Consider using a choropleth map if geographic variation is one of the key narratives.
Be prepared to alter your visual if it does not work; we are all hesitant to delete a page or a graphic and start from scratch but this locks you into something that isn’t working to begin with and you will be stuck in a rut.
Avoid jargon like the plague. We often think we sound intelligent when we use jargon but this distracts from the oral/written presentation. Think of your audience and write and present in a manner that will resonate with them.
Show as much data as you can; it vastly improves the effectiveness of the visual narration.

3.4 Tables and Graphics with R

One could build the graphics we have seen in the preceding sections via R, using several different approaches. However, the primary graphics package we will use is ggplot2. If necessary, we will switch to lattice, and perhaps even to base R graphics. The examples you will see in this section are not the only way to generate a given table or a particular graph. In fact there are many paths in R that lead to the same shore, one of the things that makes it such a popular software for data analysis and visualization! Given this abundance of approaches I have chosen to share with you the way I go about approaching a particular task.

3.4.1 Building Tables of Frequencies and Descriptive Statistics

We start with the simple stuff – frequency tables! Recall that frequency tables and bar-charts are often the best means of showing the distribution of a categorical variable. To see how we do it, let us load the hsb2 data. Once it is loaded, I’d like to work with the ses, female, schtyp, and prog variables.

load("~/Documents/Teaching/MPA 6020/Data/hsb2.RData")

Say we want a frequency table of ses. This can be generated via

tab1 = table(hsb2$ses)
tab1

## 
##    Low Middle   High 
##     47     95     58

The table() command generates a frequency table but no totals, so if we want them we utilize the addmargins() command as shown below.

tab2 = addmargins(tab1)
tab2

## 
##    Low Middle   High    Sum 
##     47     95     58    200

Now we see the total, Sum. But folks are used to seeing the word “Total”, and to replace “Sum” with “Total” we modify the command as follows:

Total = sum
tab2 = addmargins(tab1, FUN = Total)
tab2

## 
##    Low Middle   High  Total 
##     47     95     58    200

We basically created a new function called Total that duplicates what sum does. We then explicitly asked addmargins() to use the function called Total by specifying FUN = Total.

Tables such as the ones we have created can be easily flipped into a stand-alone dataframe by specifying

tab3 = as.data.frame(tab2)
tab3

##     Var1 Freq
## 1    Low   47
## 2 Middle   95
## 3   High   58
## 4  Total  200

and we can then replace the column name Var1 with an appropriate name and also expand Freq to “Frequency”.

colnames(tab3) = c("SES Category", "Frequency")
tab3

##   SES Category Frequency
## 1          Low        47
## 2       Middle        95
## 3         High        58
## 4        Total       200

To generate relative frequencies we employ the prop.table() command and then run addmargins() to generate the totals. You have to run this on the first frequency table you created (tab1) and to recogniwe that the default product of prop.table() is a list of proportions, which can be converted into percentages via prop.table() * 100, as shown below in tab4 and tab5, respectively.

tab4 = prop.table(tab1)
tab4

## 
##    Low Middle   High 
##  0.235  0.475  0.290

tab5 = prop.table(tab4) * 100
tab5

## 
##    Low Middle   High 
##   23.5   47.5   29.0

What if we wanted to squeeze these two into a single table so that the reader could see both the frequencies and the relative frequencies? This takes a bit more work but is doable, as shown below.

tab00 = cbind(tab1, tab5)
tab01 = colSums(tab00)
tab02 = rbind.data.frame(tab00, tab01)
rownames(tab02)[4] = "Total"
colnames(tab02) = c("Frequency", "Relative Frequency (%)")
tab02

##        Frequency Relative Frequency (%)
## Low           47                   23.5
## Middle        95                   47.5
## High          58                   29.0
## Total        200                  100.0

If I wanted to generate a contingency table, perhaps of the program a student is in and the student’s SES, I could generate that as follows:

tab6 = table(hsb2$prog, hsb2$ses)
tab6

##             
##              Low Middle High
##   General     16     20    9
##   Academic    19     44   42
##   Vocational  12     31    7

tab7 = addmargins(tab6, FUN = Total, quiet = TRUE)
tab7

##             
##              Low Middle High Total
##   General     16     20    9    45
##   Academic    19     44   42   105
##   Vocational  12     31    7    50
##   Total       47     95   58   200

The quiet = TRUE piece of the command suppresses messages from R that tell you how the totals were calculated, which will be first by row and then by column.

Relative frequencies (as percentages) can be generated via

tab8 = prop.table(tab6) * 100
tab8

##             
##               Low Middle High
##   General     8.0   10.0  4.5
##   Academic    9.5   22.0 21.0
##   Vocational  6.0   15.5  3.5

tab8 shows you the relative frequency of each cell, relative that is to the sample size 200. However, if you want row or column percentages, the command has to be modified thus:

tab9 = prop.table(tab6, 1) * 100
tab9

##             
##                   Low   Middle     High
##   General    35.55556 44.44444 20.00000
##   Academic   18.09524 41.90476 40.00000
##   Vocational 24.00000 62.00000 14.00000

tab10 = addmargins(tab9, FUN = Total, quiet = TRUE, margin = 2)
tab10

##             
##                    Low    Middle      High     Total
##   General     35.55556  44.44444  20.00000 100.00000
##   Academic    18.09524  41.90476  40.00000 100.00000
##   Vocational  24.00000  62.00000  14.00000 100.00000

tab11 = prop.table(tab6, 2) * 100
tab11

##             
##                   Low   Middle     High
##   General    34.04255 21.05263 15.51724
##   Academic   40.42553 46.31579 72.41379
##   Vocational 25.53191 32.63158 12.06897

tab12 = addmargins(tab11, FUN = Total, quiet = TRUE, margin = 1)
tab12

##             
##                    Low    Middle      High
##   General     34.04255  21.05263  15.51724
##   Academic    40.42553  46.31579  72.41379
##   Vocational  25.53191  32.63158  12.06897
##   Total      100.00000 100.00000 100.00000

Notice what the 1 and 2 are doing: They specify if you want row or column proportions and whether to calculate the row or column totals.

One can coerce the frequencies and relative frequencies into a single table as well but it requires a bit of work if using base R commands and is slightly easier with other packages such as dplyr and data.table. I show you a simple dplyr example below:

library(dplyr)
tab13 = group_by(hsb2, ses, prog) %>% summarise(Frequency = n()) %>% 
    mutate(Percent = (Frequency/sum(Frequency)) * 100)

library(tidyr)
tab14 = gather(tab13, key = ses, value = values, 3:4)
colnames(tab14)[3] = "group"

library(reshape2)
tab15 = dcast(tab14, ses ~ group + prog)
colnames(tab15) = c("SES", "General (n)", "Academic (n)", "Vocational (n)", 
    "General (%)", "Academic (%)", "Vocational (%)")  # Since the names are unwieldy after dcast

tab15

##         SES General (n) Academic (n) Vocational (n) General (%)
## 1 Frequency           7            9             12          NA
## 2   Percent          NA           NA             NA    12.06897
##   Academic (%) Vocational (%) NA NA       NA       NA NA       NA       NA
## 1           NA             16 19 20       NA       NA 31       NA       NA
## 2     15.51724             NA NA NA 21.05263 25.53191 NA 32.63158 34.04255
##         NA NA NA       NA       NA
## 1       NA 42 44       NA       NA
## 2 40.42553 NA NA 46.31579 72.41379

3.4.2 Grouped Frequency Tables

Often, with numeric variables that take on a large number of values it becomes necessary to group the values into meaningful categories prior to generating a frequency table. For example, we may be interested in grouping reading scores to ease interpretation. Below I show you how to do so with the cut() function.

summary(hsb2$read)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.00   44.00   50.00   52.23   60.00   76.00

my.breaks = c(28, 38, 48, 58, 68, 78)
hsb2$grouped.read1 = cut(hsb2$read, breaks = my.breaks, right = FALSE)
tab.g1 = table(hsb2$grouped.read1)
tab.g1

## 
## [28,38) [38,48) [48,58) [58,68) [68,78) 
##      14      68      62      36      20

hsb2$grouped.read2 = cut(hsb2$read, breaks = my.breaks, right = TRUE)
tab.g2 = table(hsb2$grouped.read2)
tab.g2

## 
## (28,38] (38,48] (48,58] (58,68] (68,78] 
##      13      69      61      47       9

I first ran the summary() command to see the minimum and maximum values of reading scores. I then calculated the range, and calculated what the gap would be between successive groups if I wanted five groups. That gap came out to be 9.6 so I rounded it up to 10. I then specified how I wanted reading scores to be broken – the first group would start at 28, then the next at 38, and so on. This was done via the my.breaks command that calls cut(). I also used right = FALSE and then right = TRUE to show you the difference between specifying whether the groups should be “right-open” or “right-closed”. This means that any reading score that would, say, take on a value of 48, would be included in the 48-58 group with right = FALSE but fall in the 38-48 group if we used right = TRUE. So right = FALSE says if you have a value that is equal to the upper-limit (48) of the group interval (38-48), do not count it in that group but instead in the previous group (38-48). You should see this by running the commans and opening hsb2. Look for read = 48 and see where it falls in grouped.read1 versus grouped.read2. Then do the same for read = 68. Also notice that read = 28 gets dropped in grouped.read2 where we had right = TRUE, which is unfortunate. So for all practical purposes, right = FALSE would be preferable here. I have not dressed up the frequency table or calculated relative frequencies since now you know how to do that, given the earlier examples.

3.4.3 Bar-charts

We could take our categorical variables and generate bar-charts in base R, or then with some of the other packages, namely ggplot2 and lattice. I will show you a bit of base R and lattice but we will end up using ggplot2 the most. Let us start with a simple bar-chart of ses frequencies.

tab.a = table(hsb2$ses)
barplot(tab.a, ylim = c(0, 110), ylab = "Frequency", xlab = "Socieconomic Status", 
    col = "cornflowerblue")

FIGURE 3.13: Bar-chart with Frequencies

It would be more useful to show the relative frequencies, and that is easily done.

tab.b = prop.table(tab.a) * 100
barplot(tab.b, ylim = c(0, 60), ylab = "Relative Frequency (%)", 
    xlab = "Socieconomic Status", col = "cornflowerblue")

FIGURE 3.14: Bar-chart with Relative Frequencies

In ggplot2, the same graph is generated via

library(ggplot2)
ggplot(hsb2, aes(ses, fill = ses)) + geom_bar(width = 0.5) + 
    theme(legend.position = "none") + xlab("Socioeconomic Status") + 
    ylab("Frequency") + scale_y_continuous(limits = c(0, 100)) + 
    theme_ipsum_rc()

FIGURE 3.15: Bar-chart with ggplot2

Similarly, we can generate a bar-chart of ses by prog as follows:

ggplot(hsb2, aes(x = ses, fill = prog)) + geom_bar(width = 0.5, 
    position = "dodge") + theme(legend.position = "bottom") + 
    xlab("Socioeconomic Status") + ylab("Frequency") + scale_y_continuous(limits = c(0, 
    50)) + theme_ipsum_rc()

FIGURE 3.16: Bar-chart of ses and prog

If we wanted relative frequencies as proportions, then

ggplot(hsb2, aes(ses, group = prog)) + geom_bar(aes(y = ..prop.., 
    fill = factor(..x..)), stat = "count") + scale_y_continuous(labels = scales::percent, 
    limits = c(0, 0.65)) + xlab("Socioeconomic Status") + ylab("Percent") + 
    facet_wrap(~prog) + theme(legend.position = "none") + geom_text(aes(label = scales::percent(..prop..), 
    y = ..prop..), stat = "count", vjust = -0.5, size = 3.5) + 
    theme_ipsum_rc()

FIGURE 3.17: Bar-chart of ses and prog

3.4.4 Box-plots

Ah, box-plots, my favorite graphic. Here we need a numeric variable so let us use science scores. I’ll build two plots, one with science scores of all students and the next breaking this out by ses and then by schtyp.

ggplot(hsb2, aes(y = science, x = "")) + geom_boxplot(fill = "cornflowerblue") + 
    xlab("") + coord_flip() + ylab("Science Scores") + theme_ipsum_rc()

FIGURE 3.18: Box-plot of Science Scores

ggplot(hsb2, aes(y = science, x = ses)) + geom_boxplot(aes(fill = ses)) + 
    xlab("Socioeconomic Status") + coord_flip() + ylab("Science Scores") + 
    theme(legend.position = "none") + theme_ipsum_rc()

FIGURE 3.19: Box-plot of Science Scores by ses

If we wanted to further break these out by schtype, we could run

ggplot(hsb2, aes(y = science, x = ses)) + geom_boxplot(aes(fill = ses)) + 
    xlab("Socioeconomic Status") + coord_flip() + ylab("Science Scores") + 
    theme(legend.position = "none") + facet_wrap(~schtyp) + theme_ipsum_rc()

FIGURE 3.20: Box-plot of Science Scores by ses and schtyp

And further yet, if we wanted to disaggregate these by the student’s sex,

ggplot(hsb2, aes(y = science, x = ses)) + geom_boxplot(aes(fill = ses)) + 
    xlab("Socioeconomic Status") + coord_flip() + ylab("Science Scores") + 
    theme(legend.position = "none") + facet_wrap(~schtyp + female) + 
    theme_ipsum_rc()

FIGURE 3.21: Box-plot of Science Scores by ses, schtyp and female

3.4.5 Histograms

Histograms are also useful for plotting numerical data. I’ll stay with reading scores to draw a link between our histograms and the grouped frequency table we drew earlier.

ggplot(hsb2, aes(read)) + geom_histogram(fill = "cornflowerblue", 
    color = "white") + theme_ipsum_rc()

FIGURE 3.22: Histogram of Reading Scores

Note the warning: stat_bin() using bins = 30. Pick better value with binwidth. By default, ggplot2 will pick 30 groups, which isn’t very helpful since that spreads the data very thin. We could try to set the breaks to match what we had in the grouped frequency table.

ggplot(hsb2, aes(read)) + geom_histogram(fill = "cornflowerblue", 
    color = "white", breaks = seq(28, 78, by = 10)) + theme_ipsum_rc()

FIGURE 3.23: Histogram of Reading Scores (with specified breaks)

We could also break this down by the student’s sex or any other categorical variable, as shown below.

ggplot(hsb2, aes(read)) + geom_histogram(fill = "cornflowerblue", 
    color = "white", breaks = seq(28, 78, by = 10)) + facet_wrap(~female) + 
    theme_ipsum_rc()

FIGURE 3.24: Histogram of Reading Scores (with specified breaks)

3.4.6 Scatterplots

Sticking with the science scores, I’ll draw several scatterplots by adding writing scores into the mix. I will then break these out for specific groups.

ggplot(hsb2, aes(x = write, y = science)) + geom_point() + xlab("Writing Scores") + 
    ylab("Science Scores") + theme_ipsum_rc()

FIGURE 3.25: Scatter-plot of Science and Writing Scores

ggplot(hsb2, aes(x = write, y = science)) + geom_point(aes(color = ses)) + 
    xlab("Writing Scores") + ylab("Science Scores") + theme(legend.position = "bottom") + 
    theme_ipsum_rc()

FIGURE 3.26: Scatter-plot of Science and Writing Scores (by ses)

Note that this isn’t very helpful since it is hard to distinguish any patterns by ses so we can keep it simple by just breaking out the scatterplot by ses.

ggplot(hsb2, aes(x = write, y = science)) + geom_point() + xlab("Writing Scores") + 
    ylab("Science Scores") + facet_wrap(~ses) + theme_ipsum_rc()

FIGURE 3.27: Another scatter-plot of Science and Writing Scores (by ses)

3.4.7 Line charts

Line charts are ideal for displaying trends in a numerical variable. Most often you will see them used with aggregate estimates of say, income, population size, immigration numbers, stock prices, money supply, inflation, unemployment and the like. I’ll pull a particular data-set that is bundled with the plotly package.

library(plotly)
data(economics)
names(economics)

## [1] "date"     "pce"      "pop"      "psavert"  "uempmed"  "unemploy"

ggplot(economics, aes(x = date, y = uempmed)) + geom_line() + 
    xlab("Date") + ylab("Median Unemployment Rate") + theme_ipsum_rc()

FIGURE 3.28: Line chart of Unemployment Rate over time

If I needed to add multiple time-series to a single plot we could run the following code. Note I have switched to a different data-set for this example.

library(gapminder)
data(gapminder)
names(gapminder)

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

library(dplyr)
df = gapminder %>% group_by(year, continent) %>% summarise(LifeExp = median(lifeExp))

ggplot(df, aes(x = year, y = LifeExp, group = continent)) + geom_line(aes(color = continent)) + 
    geom_point() + xlab("Year") + ylab("Median Life Expectancy (in years)") + 
    theme(legend.position = "bottom") + theme_ipsum_rc()

FIGURE 3.29: Line chart of Median Life Expectancy (by Year and Continent)

Notice what we had to do for the last plot. Since the gapminder data-set has country-level data at five-year intervals, we had to first calculate a single value per continent per year, and the variable I chose was lifeExp. Thereafter, the plotting is straightforward, with geom_line() drawing the lines and geom_point() drawing the points (to aid in readability of the plot).

3.4.8 Joy Plots

These plots have a fascinating story and are a recent entry into the R toolkit. For more on their roots see here. I love them as much for their aesthetics as for their ability to show similarities and differences between distributions of the same phenomenon over time or space.

library(viridis)
library(ggridges)
library(ggthemes)
ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = Month)) + 
    geom_density_ridges(scale = 3, alpha = 0.3, aes(fill = Month)) + 
    labs(title = "Temperatures in Lincoln NE", subtitle = "Mean temperatures (Fahrenheit) by month for 2016\nData: Original CSV from the Weather Underground") + 
    theme_ridges() + theme(axis.title.y = element_blank(), legend.position = "none")

FIGURE 3.30: Joyplots!!

Just as a means of comparing alternative graphics, here is a joyplot of readng scores by ses.

ggplot(hsb2, aes(x = read, y = ses)) + geom_density_ridges(scale = 3, 
    alpha = 0.3, aes(fill = ses)) + labs(title = "Reading Scores in the hsb2 Data", 
    subtitle = "(by SES)") + theme_ridges() + theme(axis.title.y = element_blank(), 
    legend.position = "none")

FIGURE 3.31: Joyplots of Reading Scores

3.5 Practice Problems

Problem 1

Download the monthly Great Lakes water level dataset SPSS format from here and Excel format from here. Construct an appropriate chart to display the data for Lake Superior. Be sure to label the x- and y-axis, and to title the chart. Note that water level is in meters.

Problem 2

Download the number of births per 10,000 of 23 year old women, U.S., 1917-1975 SPSS format from here and Excel format from here. Construct an appropriate chart to display the trend in the data. Be sure to label the x- and y-axis, and to title the chart.

Problem 3

Download the winning speed (in kilometers per hour) for several men’s track and field distances world meets over the 1900 - 2012 period SPSS format from here and Excel format from here. Construct an appropriate chart to display the speeds for the 100 meter dash. Be sure to label the x- and y-axis, and to title the chart. Note that the data are monthly and replicate the speed from the preceding month if the fastest speed was not eclipsed.

Problem 4

Use this data-set used in Practice Problem 4 in the preceding Chapter, noting these details of each variable. Construct a frequency table for belief in life after death, showing both the frequencies and the relative frequencies (as percentages). Based on the table, what do most people seem to believe? Report the percentage for your answer. Construct an appropriate chart for these data, making sure to label all axis and providing a title.

Problem 5

Construct an appropriate chart that shows the relationship between high school GPA and college GPA. Label both axis and title the chart. What does this plot show? Are the two positively/negatively related? How strong would you guess is the relationship?

Problem 6

Construct a contingency table of vegetarianism against belief in life after death. What percent of vegetarians believe in life after death? What percent of those who believe in life after death are vegetarians? Use an appropriate chart to show the relationship between vegetarianism and belief in life after death. Label everything.

Problem 7

Construct a grouped frequency table with five groups of the variable age. Report both the frequencies and the relative frequencies (as percentages). Plot the relative frequencies using an appropriate chart. Label everything as usual. What is the modal age group?

The next set of questions revolve around the 2016 Boston Marathon race results available here. The dataset contains the following variables:

Bib = the runner’s bib number
Name = the runner’s name
Age = the runner’s age (in years)
M/F = the runner’s gender (M = Male; F = Female)
City = the runner’s home city
State = the runner’s home state
Country = the runner’s home country
finishtime = the runner’s time (in seconds) to the finish line

Problem 8

What countries had the largest and second-largest number of runners in the race?

Problem 9

What was the distribution of the runners’ gender? Use a suitable chart to reflect the distribution.

Problem 10

Construct a grouped frequency table of the runners’ age, using the following groupings – 18-25; 25-32; 32-39; 39-46; 46-53; 53-60; 60-67; 67-74; 74-81; 81-88. Also construct a grouped histogram. What is the modal age group?

Problem 11

Draw a scatter-plot of runners’ age and finish times. Does this show any relationship? If it does, what sort of a relationship?

Problem 12

Using a reasonable grouping structure, construct country-specific histograms of finish times for runners from each the following countries – AUS, BRA, CAN, CHN, FRA, GBR, GER, ITA, JPN, MEX, and USA. Are finish times skewed for each country? Is the direction of the skew similar? What country seems to have the least skew? What country seems to have the most skew?

Use this data-set, also used in Practice Problem 5 in the preceding Chapter, to answer the following questions.

Problem 13

Were employees who had a workplace accident more likely to leave than employees who did not have an accident? Briefly explain your conclusion with appropriate charts/tables.

Problem 14

Are low salary employees more likely to leave than medium/high salary employees? Why do you conclude as you do? Briefly explain your reasoning with appropriate charts/tables.

Problem 15

Construct an appropriate chart that shows the distribution of the number of years employees spent in the company. What patterns do you see in this chart?

Download the 2017 County Health Rankings data SPSS format from here, Excel format from here and the accompanying codebook.

Problem 16

You should see the following measures (measurename)

Adult obesity
Children in poverty
High school graduation
Preventable hospital stays
Unemployment rate

Looking at pairs of variables in turn, briefly (20 words or less) state whether you would expect a positive/negative/no relationship between the variables in each pair and why?

Problem 17

Construct scatterplots of all pairs of the five variables.

Problem 18

How do the scatterplots you constructed in the preceding problem stack up against the expectations you listed in Problem 16? Which plots surprised you and how? Be brief.

Indoor plumbing is lacking in many still developing countries, particularly in the rural areas. In some of these countries, particularly India, tigers aren’t the most dangerous animals ↩