5 Visualizing Data

Most of my graduate work revolved initially around SAS on the mainframe, followed by SST, SPSS, BMDP2T, Stata, and Limdep. In fact I think I never drew any visualization with most of these software suites; the focus was always on estimating one thing or another. Stata graphics were “meh” unless you know the syntax well enough to circumvent the defaults and conjure a masterpiece. I had seen the work coming out of S and SPlus but only as a bystander, never a user. So it was that when I ran into R, one winter in Chicago when a graduate student taking a class at the University of Chicago, taught by John Brehm, came and asked me if I could help straighten out their code for visualizing a maximum likelihood function, my world changed. R was a bare install then, no RStudio or any other IDE in existence, everything running off a vanilla script or the R terminal. Yet, watching that function fill the plot window was a treat.

5.1 Graphics in base R

I was intrigued and a year or so later started my journey with R, initially just to create better visualizations than I ever had to that point, and just in base R mind you! For example, say you needed to visualize those hsb2 data you worked with in the Chapter 2. Say we want a simple bar-chart of race in base R. All you need to do is to execute plot(...) and you will get the basic plot. This is how you might draw it.

load("data/hsb2.RData")
plot(hsb2$race)

If we need to append a title, fill with some color, etc, we could certainly do that.

plot(
  hsb2$race, 
  main = "Distribution of Race", 
  sub = "Source: hsb2 Data", 
  ylim = c(0, 160),
  col = "cornflowerblue",
  xlab = "Student's Race (self-reported)",
  ylab = "Frequency"
  )

Beautiful, isn’t it? Notice the clean lines and spare layout. How about a bar-chart with two categorical variables?

table(
  hsb2$race, hsb2$female
  ) -> tab.01
barplot(
  tab.01, 
  ylim = c(0, 120),
  beside = TRUE,
  legend.text = TRUE,
  xlab = "Gender and Race (self-reported)",
  ylab = "Frequency",
  col = c("cornflowerblue", "salmon", "ForestGreen", "purple"),
  main = "Distribution of Race by Gender",
  sub = "Source: hsb2 Data"
  )

5.2 Using `{lattice}`

Deepayan Sarkar authored the {lattice} package to extend base R graphics to multivariate data, with the goal of allowing for “the creation of complex displays using relatively little code.” Paying respect to the second data visualization package I learned in R, here are a few {lattice} plots.

library(palmerpenguins)
data("penguins")
names(penguins)
#> [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"    
#> [5] "flipper_length_mm" "body_mass_g"       "sex"               "year"
library(lattice)
histogram(
  ~ body_mass_g | sex + species,
  data = penguins,
  xlab = "Body Mass (in grams)",
  main = "Distribution of Body Mass by Sex and Species"
  )

That is the familiar histogram built with the Palmer Penguins data from the {palmerpenguins} package. Again, watch the spare lines, both here and in the scatter-plot that follows.

xyplot(
  bill_length_mm ~ bill_depth_mm | species,
  groups = sex,
  data = penguins,
  xlab = "Bill Depth (in mm)",
  ylab = "Bill Length (in mm)",
  main = "Scatterplot of Bill Length and Depth",
  sub = "by Sex and Species",
  auto.key = TRUE
  )

There is a lot more we could do but I don’t want to spend too much time on base R and {lattice} plots since I hardly use them any longer. Instead, I do almost all of my visualization with {ggplot2} – one of the most popular graphics packages in R. Before we dive in though, a few minutes to pay homage to the man whose path-breaking work inspired Hadley Wickham to author {ggplot2} – Leland Wilkinson. I never had the good fortune to meet Wilkinson or to take a class with him, and hence will not attempt to summarize his contributions. Rather, I will leave it to those better placed to do so.

“During the late 1970s and early 1980s, Leland wrote SYSTAT, the first comprehensive, statistical software package designed expressly for microcomputers. It represented an end-run around the punch cards, queues and mainframes required for statistical analysis at that time. The program was the ﬁrst of its kind to include comprehensive graphics driven by a command structure of universally applicable options, foreshadowing the graphical structure that Leland would more fully develop and articulate during the 1990s. SYSTAT also was the ﬁrst software implementation of the now widely used heatmap display. He founded SYSTAT, a company of the same name, headquartered in Evanston, Ill., and later sold SYSTAT to SPSS in 1995. He went on to build a team of graphics programmers there who developed the nViZn platform that produces the visualizations in SPSS, Clementine, and other analytics services.

Leland wrote the seminal book on statistical graphics, his magnum opus, The Grammar of Graphics, in 1999. The Grammar of Graphics provided a new way of creating and describing data visualizations, a language — or grammar — for specifying visual elements on a plot, which was a completely novel idea that has fundamentally shaped modern data visualization. The book served as the foundation for the R package {ggplot2}, the Python Bokeh package, the R package {ggbio} and helped shape the Polaris project at Stanford University.”

library(tweetrmd)
tweet_embed("https://twitter.com/hadleywickham/status/1470419734487347200")

Lee Wilkinson is the reason that ggplot2 exists; not just because he wrote the Grammar of Graphics, but also because he was so kind and supportive to me when I was a young grad student thinking of trying to implement it. He will be missed. https://t.co/Zzzkk3yUmJ
— Hadley Wickham (@hadleywickham) December 13, 2021

5.3 Graphics with `{ggplot2}`

There is a vast ecosystem for {ggplot2} on the web. You can start with the Cookbook for R or the ggplot2 documentation. You can also search on stackoverflow. The definitive guide is Kieran Healy’s Data Visualization: A Practical Guide. Follow the right people on Twitter or subscribe to their blog feeds and you can learn a lot. In fact just following TidyTuesday will be learning enough.

5.3.1 The Mechanics of `{ggplot2}`

We have already mentioned that {ggplot2} is built on the grammar of graphics. Simply and, perhaps, even crudely put, this philosophy build graphs by breaking up each graph into some essential components – data, aesthetics, and geometry. You specify the data with the data command, then you specify the x and y coordinates with the aes command, and finally you specify the geometry (i.e., that you want a bar-chart, a histogram, etc.) via the geom_ command. In the middle of all of this you have had to make a choice – or use the default settings – about the scales to be used for plotting. Of course you have your choice of colors, legend placement, titles, subtitles, and so on to finish the graphic.⁴

In many ways the grammar of graphics is best understood with a hands-on example, and that is precisely what we are going to do. I will use a particular visualization to get us started, a scatter-plot of the total bill paid by a patron at a restaurant and the tip amount left for the server, with a linear regression line and 95% confidence bands drawn as well, and information on whether the bill-payer was male or female.

data(tips, package = 'reshape2')
library(ggplot2)
ggplot(
  data = tips, 
  aes(
    x = total_bill, 
    y = tip
    )
  ) + 
  geom_point(
    aes(
      color = sex
      )
    ) + 
  geom_smooth(
    method = 'lm', se = TRUE
    ) + 
  labs(
    x = "Total Bill",
    y = "Tip left for the Server"
    )

Figure 5.1: Tipping and Billing (1)

What should be very obvious from the preceding code is that ggplot2 builds a visualization piece by piece. You start with the data you want to use. In our case these are the tips data from the reshape2 package.

Next you decide the variables to be plotted and on what axis. Is it just a single variable? Two variables? More than two variables? Are one or more of these variables categorical? Which of these do you want on the x-axis and which one on the y-axis? Do you want to distinguish between groups represented by another variable? In our case, we have the total bill on the x-axis and the tip amount on the y-axis. The third variable, the bill-payer’s sex, is also shown via the differently colored points. Of course, these are all aesthetics, and we could have decided to have each point assume a size based on the bill (as shown below)

ggplot(
  data = tips, 
  aes(
    x = total_bill, 
    y = tip
    )
  ) + 
  geom_point(
    aes(
      color = sex, 
      size = total_bill)
    ) + 
  geom_smooth(
    method = 'lm', 
    se = TRUE
    ) + 
  labs(
    x = "Total Bill",
    "Tip left for the Server"
    )

Figure 5.2: Tipping and Billing (2)

or then on the basis of the tip left for the server (as shown below).

ggplot(
  tips, 
  aes(
    x = total_bill, 
    y = tip
    )
  ) + 
  geom_point(
    aes(
      color = sex, 
      size = tip
      )
    ) + 
  geom_smooth(
    method = 'lm',
    se = TRUE
    ) + 
  labs(
    x = "Total Bill",
    y = "Tip left for the Server"
    )

Figure 5.3: Tipping and Billing (3)

We could have also switched out the points for some other shape.

ggplot(
  tips, 
  aes(
    x = total_bill, 
    y = tip
    )
  ) + 
  geom_point(
    aes(
      color = sex
      ), 
    shape = 23
    ) + 
  geom_smooth(
    method = 'lm',
    se = TRUE
    ) + 
  labs(
    x = "Total Bill",
    y = "Tip left for the Server"
    )

Figure 5.4: Tipping and Billing (4)

Three geometries are visible in these plots, a line, a ribbon (or the band of gray), and another line. There are other geometries that we can and will use in due course of time – bars, text, maps, densities, points, box-plots, histograms, paths, and so on.

Each geometry is also a layer, since the plot is like a blank canvas and we are adding elements to it, the point layer with the line layer and then the ribbon layer.

The preceding is a reduced form of all the various elements that {ggplot2} brings to the table. It is, in my opinion, one of the biggest developments to occur in the R world in the last decade, along with R Studio of course. Now it is time to get to the nuts and bolts of building graphics.

Before we move on, recall that for numeric variables we can rely on box-plots and histograms to explore the distribution of a numeric (scale) variable. Perhaps we are interested in reading scores and want to start with a histogram.

5.3.2 Histograms `geom_histogram(...)`

ggplot(
  data = hsb2, 
  aes(
    x = read
    )
  ) + 
  geom_histogram()

Figure 5.5: Old friends: The hsb2 data

You see a message displayed with the output; R is telling you that “stat_bin() using bins = 30. Pick better value with binwidth.” That is, for a histogram you need to lump the values of the variable into bins/groups, and unless you tell R how you want these bins constructed, R will automatically group the variable into 30 groups (unless there are fewer values). Maybe we want fewer groups, maybe 10. This can be done as follows:

ggplot(
  data = hsb2, 
  aes(
    x = read
    )
  ) + 
  geom_histogram(
    bins = 10
    )

Figure 5.6: Binning the histogram

We can customize this histogram further, changing the colors, the labels for the x-axis, the y-axis, adding a title, and so on.

ggplot(
  data = hsb2, 
  aes(
    x = read
    )
  ) + 
  geom_histogram(
    fill = "cornflowerblue"
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    x = "Reading Score",
    y = "Frequency"
    )

Figure 5.7: Histograms of Reading Scores (1)

Note: A small snippet of the wide expanse of colors available in R can be seen here and you can always brew your own color palette (ask me and I’ll give you the code). See also this post by drsimonj, or this post.

What if wanted to construct these histograms for male versus female students, or perhaps for each of the SES groups?

ggplot(
  data = hsb2, 
  aes(
    x = read
    )
  ) + 
  geom_histogram(
    fill = "tomato"
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    x = "Reading Score",
    y = "Frequency"
    ) + 
  facet_wrap(
    ~ female
    )

Figure 5.8: Histograms of Reading Scores (2)

ggplot(
  data = hsb2, 
  aes(
    x = read
    )
  ) + 
  geom_histogram(
    fill = "steelblue"
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    x = "Reading Score",
    y = "Frequency"
    ) + 
  facet_wrap(
    ~ ses
    )

Figure 5.9: Histograms of Reading Scores (3)

What if we wanted to break out the histogram by female/male students in public versus private schools?

ggplot(
  data = hsb2, 
  aes(
    x = read
    )
  ) + 
  geom_histogram(
    fill = "tomato"
    ) + 
  labs(
    title = "Histogram of Reading Scores",
    x = "Reading Score",
    y = "Frequency"
    ) + 
  facet_wrap(
    female ~ schtyp
    )

Figure 5.10: Histogram of Reading Scores (4)

So far we have used the default number of bins (i.e., groups) in generating these histograms. However, default settings may be a good exploratory start but rarely optimal for the finished product. What might be more helpful here is if we reduce the number of groups to a meaningful amount. Say I want to bin math scores. The first thing I could do is measure the range of math scores and then divide this range by the number of groups I want to end up with, and get an estimate of how wide each group should be. The range turns out to be $75 - 33 = 42$. If I divide this by 5 I get 8.4, so I’ll round this up to 9. Now, the groups could be 30-39, 39-48, 48-57, 57-66, 66-75, and will span all the data values.

ggplot(
  hsb2, 
  aes(
    math
    )
  ) + 
  geom_histogram(
    breaks = seq(30, 75, by = 9),
    fill = "magenta", 
    color = "white"
    ) + 
  labs(
    x = "Math Scores",
    "Frequency"
    ) + 
  scale_x_continuous(
    breaks = seq(30, 75, by = 9)
    )

Figure 5.11: Histogram of Mathematics Scores (1)

Pay attention to the scale_x_continuous(...) command that might seem redundant but is helpful to label the bins on the x-axis. If I do not specify this scale then I end up with the following labels that do not match the breaks I specified:

ggplot(
  hsb2, 
  aes(
    math
    )
  ) + 
  geom_histogram(
    breaks = seq(30, 75, by = 9),
    fill = "magenta", 
    color = "white"
    ) + 
  labs(
    x = "Math Scores",
    "Frequency"
    )

Figure 5.12: Histogram of Mathematics Scores (with Mismatched Break Labels)

As in the preceding examples, we could break out this histogram by ses, sex, race, etc.

ggplot(
  hsb2, 
  aes(
    math
    )
  ) + 
  geom_histogram(
    breaks = seq(30, 75, by = 9), 
    fill = "magenta", 
    color = "white"
    ) + 
  labs(
    x = "Math Scores",
    y = "Frequency"
    ) + 
  facet_wrap(
    ~ schtyp
    ) + 
  scale_x_continuous(
    breaks = seq(30, 75, by = 9)
    )

Figure 5.13: Histogram of Mathematics Scores (2)

One could also be less specific and instead just specify the number of groups we want via the bins command as shown below.

ggplot(
  hsb2, 
  aes(
    math
    )
  ) + 
  geom_histogram(
    bins = 5, 
    fill = "midnightblue", 
    color = "white"
    ) + 
  labs(
    x = "Math Scores",
    y = "Frequency"
    ) + 
  facet_wrap(
    ~ schtyp
    ) + 
  scale_x_continuous(
    breaks = seq(30, 75, by = 9)
    )

Figure 5.14: Histogram of Mathematics Scores (3)

5.3.2.1 Improving comparability Across Groups

What would be a better way to build these plots so that one can compare the distribution of the same variable across groups? Well, one easy solution would be to stack them atop each other so that the viewer can quickly grasp the spread, skew, center, and any other patterns that might be present.

ggplot(
  hsb2, 
  aes(
    math
    )
  ) + 
  geom_histogram(
    bins = 5, 
    fill = "midnightblue", 
    color = "white"
    ) + 
  labs(
    x = "Math Scores",
    y = "Frequency"
    ) + 
  facet_wrap(
    ~ schtyp, ncol = 1
    ) + 
  scale_x_continuous(
    breaks = seq(30, 75, by = 9)
    )

Figure 5.15: Histogram of Mathematics Scores (4)

Since the grouping compresses patterns, I could just let the default bin-width be chosen here and see how that looks.

ggplot(
  hsb2, 
  aes(
    math
    )
  ) + 
  geom_histogram(
    fill = "midnightblue", 
    color = "white"
    ) + 
  labs(
    x = "Math Scores",
    y = "Frequency"
    ) + 
  facet_wrap(
    ~ schtyp, ncol = 1
    ) + 
  scale_x_continuous(
    breaks = seq(30, 75, by = 9)
    )

Figure 5.16: Histogram of Mathematics Scores (5)

Later on we will see another excellent option for comparability but for now we move on to kernel densities.

5.3.3 Kernel Density Plots `geom_density(...)`

When we construct a histogram, we choose the bin-widths (i.e., how many groups do we want and how wide should each group be?). As a result, histograms are not smooth, and depend on both the width of the bins and the end points of the bins. In addition, we end up putting into the same bin some data points whose values may in fact be closer to the adjacent bin. As such, the story histograms tell is often a choppy one because we have collapsed a continuous variable into discrete groups, creating artificial breaks. Kernel density plots get around these problems; they are smooth and do not depend on the end points of the bins.

A kernel density is a method of estimating the probability density function (PDF) of a continuous random variable without assuming any underlying distribution for the variable. The way it works is by moving a window of fixed width across the data, calculating a locally weighted average of the number of observations $(x_i)$ falling in the window. The smoothed plot is scaled so that it encompasses an area that sums to one.

Choosing how wide this sliding window should be is, like the bin-width of a histogram, a matter of trial and error since we don’t want a bad choice influencing the data display. In our case, {ggplot2} will use the defaults for the kernel estimator, essentially the Gaussian smoothing kernel with band-width given by the standard deviation of the chosen smoothing kernel. Note that I am using base R here.

Figure 5.17: Three smoothing kernals with the Old Faithful eruption waiting time data

Focus on the Gaussian kernel since that is the default, and then see the two examples drawn with the Palmer Penguins data-set.

ggplot(
  data = penguins, 
  aes(
    x = body_mass_g, 
    fill = species
    )
  ) + 
  geom_density(
    alpha = 0.3, 
    trim = TRUE
    )

Figure 5.18: Density plots for the Palmer Penguins Data

ggplot(
  data = penguins, 
  aes(
    x = body_mass_g
    )
  )  + 
  geom_histogram(
    aes(
      y = ..density..
      ), 
    binwidth = 0.2, 
    fill = "cornflowerblue"
    ) + 
  labs(title = "Histogram & Kernel Density Plot of Reading Scores",
       x = "Reading Score", 
       y = "Frequency"
       ) + 
  geom_density(
    alpha = 0.75, 
    color = "tomato4", 
    trim = TRUE
    ) + 
  facet_wrap(
    ~ species
    )

Figure 5.19: Histograms for the Palmer Penguins Data

5.3.4 Ridge Plots with `{ggridges}`

These plots have a fascinating story and are a somewhat recent addition to the {ggplot2} toolkit.⁵ I love them as much for their aesthetics as for their ability to show similarities and differences between distributions of the same phenomenon over time or space.

library(viridis)
library(ggridges)
library(ggthemes)
ggplot(
  lincoln_weather, 
  aes(
    x = `Mean Temperature [F]`, 
    y = `Month`
    )
  ) + 
  geom_density_ridges(
    scale = 3, 
    alpha = 0.3, 
    aes(
      fill = Month
      )
    ) + 
  labs(
    title = 'Temperatures in Lincoln NE', 
    subtitle = 'Mean temperatures (Fahrenheit) by month for 2016\nData: Original CSV from the Weather Underground'
    ) + 
  theme_ridges() +
  theme(
    axis.title.y = element_blank(), 
    legend.position = "none"
    )

Figure 5.20: Ridge Plots

Pay attention to the data here because they dictate the effectiveness of the plot. You have mean temperature, by day, for each of 2 months. This allows you to create one ridge per month and stack them in calendar-order on the y-axis. The x-axis allows the daily mean temperature to shift location. Each month has been given a unique fill color.

Could we do this with the penguin data? Let us see.

ggplot(
  penguins, 
  aes(
    x = body_mass_g, 
    y = species,
    fill = stat(x)
    )
  ) + 
  geom_density_ridges_gradient(
    scale = 3
    ) + 
  labs(
    title = 'Distribution of Body Mass (in grams), by Species', 
    caption = 'Data: Palmer Penguins',
    x = 'Body Mass (in grams)'
    ) + 
  scale_fill_viridis(
    option = "magma", 
    alpha = 0.75,
    name = "Body Mass (gms)") +
  theme_ridges() +
  theme(
    axis.title.y = element_blank()
    )

Aha! Note a few things here. First, we are using a fill color that varies with body_mass_g, and this makes it easier to see that many Gentoo penguins are much heavier than Chinstrap and Adelie penguins. Do not worry about the alpha =, scale =, fill = stat(x), and scale_fill_virids(...) options; we will cover these in much detail later on in this text.

5.3.5 Box-plots `geom_boxplot(...)`

Now we can revisit our old friends, the box-plots. Just a reminder that the hinges (edges of the box) mark the first $(Q_1)$ and third $(Q_3)$ quartiles, respectively, with the thick line inside the box flagging the median. The whiskers extend outward from each hinge (i.e., each quartile) to a distance of $1.5 \times IQR$ such that the left-whisker extends from $Q_1$ to $Q_1 - (1.5 \times IQR)$ and the right whisker extends from $Q_3$ to $Q_3 + (1.5 \times IQR)$.⁶ Any observation with a value that goes beyond the whiskers will be flagged as an extreme value, what in common parlance we call an “outlier”. Below are a few box-plots drawn to show you the commands.

ggplot(
  data = hsb2, 
  aes(
    x = female, 
    y = read
    )
  ) + 
  geom_boxplot(
    fill = "seagreen2"
    ) + 
  labs(
    title = "Box-Plot of Reading Scores",
    x = "Gender",
    y = "Reading Score"
    ) + 
  coord_flip()

Figure 5.21: Box-plots (1)

ggplot(
  data = hsb2, 
  aes(
    x = female, 
    y = read
    )
  ) + 
  geom_boxplot(
    fill = "peachpuff"
    ) + 
  labs(
    title = "Box-Plot of Reading Scores",
    subtitle = "(by Gender & School Type)",
    x = "Gender",
    y = "Reading Score"
    ) + 
  coord_flip() +
  facet_wrap(~ schtyp)

Figure 5.22: Box-plots (2)

coord_flip() transposes (i.e., switches) the x-axis and y-axis, making the box-plots horizontal, making it easier to recognize the skew.

5.3.6 Violin Plots `geom_violin(...)`

While box-plots are very useful for looking at the general shape of the distribution, violin plots tend to be more informative since they combine box-plots and kernel density plots. But not everyone likes these (or is used to them). Personally, I find them aesthetically pleasing but still prefer kernel density plots or box-plots.

ggplot(
  data = hsb2, 
  aes(
    x = female, 
    y = read
    )
  ) + 
  geom_violin(
    fill = "seagreen2", 
    trim = FALSE, 
    adjust = 0.5
    ) + 
  labs(
    title = "Violin Plots of Reading Scores",
    x = "Gender",
    y = "Reading Score" 
    ) + 
  geom_boxplot(
    width = .1
    ) + 
  coord_flip()

Figure 5.23: Violin Plots (1)

And here is one with breakouts by school-type as well.

ggplot(
  data = hsb2, 
  aes(
    x = female, 
    y = read
    )
  ) + 
  geom_violin(
    fill = "seagreen2", 
    trim = FALSE, 
    adjust = 0.5
    ) + 
  labs(
    title = "Violin Plots of Reading Scores",
    subtitle = "(by Gender and School Type)",
    x = "Gender",
    y = "Reading Score" 
    ) + 
  geom_boxplot(
    width = .1
    ) + 
  coord_flip() + 
  facet_wrap(~ schtyp)

Figure 5.24: Violin Plots (2)

5.3.7 Bar-Charts `geom_bar(...)`

We could take our categorical variables and generate bar-charts in base R, or then with some of the other packages, namely {ggplot2} and {lattice}. I will show you a bit of base R and then we can switch to {ggplot2} as before. Let us start with a simple bar-chart of ses frequencies.

table(hsb2$ses) -> tab.a 
barplot(
  tab.a, 
  ylim = c(0, 110), 
  ylab = "Frequency", 
  xlab = "Socieconomic Status", 
  col = "cornflowerblue"
  )

Figure 5.25: Bar-chart with Frequencies

It would be more useful to show the relative frequencies, and that is easily done.

prop.table(tab.a) * 100 -> tab.b 
barplot(
  tab.b, 
  ylim = c(0, 60), 
  ylab = "Relative Frequency (%)", 
  xlab = "Socieconomic Status", 
  col = "cornflowerblue"
  )

Figure 5.26: Bar-chart with Relative Frequencies

In ggplot2, the same graph is generated via

ggplot(
  hsb2, 
  aes(
    x = ses, 
    fill = ses
    )
  ) + 
  geom_bar(
    width = 0.5
    ) + 
  theme(
    legend.position = "none"
    ) + 
  labs(
    x = "Socioeconomic Status",
    y = "Frequency"
    ) + 
  scale_y_continuous(
    limits = c(0, 100)
    )

Figure 5.27: Bar-chart with ggplot2

Note the use of scale_y_continuous(limits(...)) to control the minimum and maximum values of the y-axis, and of width = ... to make sure the bars are not too wide (which often makes the plot look unappealing).

Similarly, we can generate a bar-chart of ses by prog as follows:

ggplot(
  hsb2, 
  aes(
    x = ses, 
    fill = prog
    )
  ) + 
  geom_bar(
    width = 0.5, 
    position = "dodge"
    ) + 
  theme(
    legend.position = "bottom"
    ) + 
  labs(
    x = "Socioeconomic Status",
    y = "Frequency"
    ) + 
  scale_y_continuous(
    limits = c(0, 50)
    )

Figure 5.28: Bar-chart of ses and prog

If we wanted relative frequencies, we could do this as shown below, making sure to also reflect the percentages above each bar.

ggplot(
  hsb2, 
  aes(
    x = ses, 
    group = prog
    )
  ) + 
  geom_bar(
    aes(
      y = ..prop.., 
      fill = factor(..x..)
      ), 
    stat = "count"
    ) + 
  scale_y_continuous(
    labels = scales::percent, 
    limits = c(0, 0.65)
    ) + 
  labs(
    x = "Socioeconomic Status",
    y = "Percent"
    ) + 
  facet_wrap(
    ~ prog
    ) + 
  theme(
    legend.position = "none"
    ) + 
  geom_text(
    aes(
      label = scales::percent(..prop..), 
      y = ..prop.. 
      ), 
    stat = "count", 
    vjust = -.5, 
    size = 3.5
    )

Figure 5.29: Bar-chart of ses and prog (%)

Let us generate a few more for gender, schtyp, prog, ses, and race.

ggplot(
  data = hsb2, 
  aes(
    x = female
    )
  ) + 
  geom_bar(
    fill = "seagreen2", 
    width = 0.25
    ) + 
  labs(
    title = "Bar-Chart of Gender",
    x = "Gender",
    y = "Frequency"
    ) + 
  coord_flip()

Figure 5.30: Bar-charts: Gender

And now faceting by a few variables …

ggplot(
  data = hsb2, 
  aes(
    x = race,
    fill = race
    )
  ) + 
  geom_bar() + 
  labs(
    title = "Bar-Chart of Race (by SES & School Type)",
    x = "Race",
    y = "Frequency"
    ) + 
  facet_wrap(
    ses ~ schtyp, 
    ncol = 2
    ) +
  theme(legend.position = "hide")

Figure 5.31: Bar-charts: Race, SES and School-Type

These layouts can be helpful but only in the right circumstances. Here, for example, there is hardly any data for private schools, making it difficult to justify the right column that is mostly empty.

5.3.8 Line Charts `geom_line(...)`

Line charts are ideal for displaying trends in a numerical variable. Most often you will see them used with aggregate estimates of say, income, population size, immigration numbers, stock prices, money supply, inflation, unemployment and the like. I’ll pull a particular data-set that is bundled with the {plotly} package.

library(plotly)
data(economics)
names(economics)
#> [1] "date"     "pce"      "pop"      "psavert"  "uempmed"  "unemploy"
ggplot(
  data = economics, 
  aes(
    x = date, 
    y = uempmed
    )
  ) + 
  geom_line() + 
  labs(
    x = "Date",
    y = "Unemployment Rate"
    )

Figure 5.32: Line chart of Unemployment Rate over time

If we need to add multiple time-series to a single plot we could run the following code. The data we are using here comes from the {gapminder} package.

load("data/gap.df.RData")
ggplot(
  gap.df, 
  aes(
    x = year, 
    y = LifeExp, 
    group = continent, 
    color = continent
    )
  ) + 
  geom_line() + 
  geom_point() + 
  labs(
    x = "Year",
    y = "Median Life Expectancy (in years)"
    ) + 
  theme(
    legend.position = "bottom"
    )

Figure 5.33: Line chart of Median Life Expectancy (by Year and Continent)

Notice what we had to do for the last plot. Since the gapminder data-set has country-level data at five-year intervals, we had to first calculate a single value per continent per year, and the variable I chose was lifeExp. Thereafter, the plotting is straightforward, with geom_line() drawing the lines and geom_point() drawing the points (to aid in readability of the plot).

Line charts are fine in an of themselves but I often find their interactive cousins to be more interesting. Here, for example, is a {plotly} result.

library(zoo)
library(plotly)
plot_ly(
  economics, 
  x = ~date, 
  color = I("black")
  ) %>% 
  add_trace(
    y = ~uempmed, 
    name = 'Unemployment Rate', 
    line = list(color = 'black'), 
    mode = "lines"
    ) %>% 
  add_trace(
    y = ~psavert, 
    name = 'Personal Saving Rate', 
    line = list(color = 'red'), 
    mode = "lines"
    ) -> myplot

library(shiny)
div(myplot, align = "right")

{plotly} is a special graphics package for interactive graphics so don’t think this is how the typical line chart might look. For example, the same plot rendered via {ggplot2} would look as follows:

ggplot() + 
  geom_line(
    data = economics, 
    aes(
      x = date, 
      y = uempmed
      )
  ) +
  geom_line(
    data = economics, 
    aes(
      x = date, 
      y = psavert
      ),
    color = "red"
  ) +
  labs(
    x = "Date",
    y = "Median Unemployment Rate / Personal Savings Rate"
  )

Figure 5.34: Demonstrating vanilla ggplot2 plot of the same data

A little touch of magic via ggplot and the plotly package, and voila!!

ggplot() + 
  geom_line(
    data = economics, 
    aes(
      x = date, 
      y = uempmed
      )
  ) +
  geom_line(
    data = economics, 
    aes(
      x = date, 
      y = psavert
      ),
    color = "red"
  ) +
  labs(
    x = "Date",
    y = "Median Unemployment Rate / Personal Savings Rate"
  ) -> p2
ggplotly(
  p2
  ) -> p2

library(shiny)
div(p2)

Regardless of the package-specific rendering, the basic point should be obvious: You can see how median unemployment and the personal savings rate varies over time. If you are interested, check out plotly’s capabilities here but we will spend some time with it in a later chapter.

5.3.9 Scatter-plots `geom_point(...)`

If we have TWO numeric (scale) variables then a scatter-plot is a great way to explore if and how these two variables are related. Sticking with the science scores, I’ll draw several scatter-plots by adding writing scores into the mix. I will then break these out for specific groups.

ggplot(
  hsb2, 
  aes(
    x = write, 
    y = science
    )
  ) + 
  geom_point() + 
  labs(
    x = "Writing Scores",
    y = "Science Scores"
    )

Figure 5.35: Scatter-plot of Science and Writing Scores

ggplot(
  hsb2, 
  aes(
    x = write, 
    y = science
    )
  ) + 
  geom_point(
    aes(
      color = ses
      )
    ) + 
  labs(
    x = "Writing Scores",
    y = "Science Scores"
    ) + 
  theme(
    legend.position = "bottom"
    )

Figure 5.36: Scatter-plot of Science and Writing Scores (by ses)

Note that this isn’t very helpful since it is hard to distinguish any patterns by ses so we can keep it simple by just breaking out the scatter-plot by ses.

ggplot(
  hsb2, 
  aes(
    x = write, 
    y = science,
    color = ses
    )
  ) + 
  geom_point() + 
  labs(
    x = "Writing Scores",
    y = "Science Scores"
    ) + 
  facet_wrap(
    ~ ses
    ) +
  theme(
    legend.position = 'hide'
    )

Figure 5.37: Another scatter-plot of Science and Writing Scores (by ses)

Here we have some with the Palmer Penguins data-set.

ggplot(
  data = penguins, 
  aes(
    x = bill_length_mm, 
    y = flipper_length_mm, 
    color = species
    )
  ) + 
  geom_point() +
  facet_wrap(
    island ~ sex
    ) +
  theme(
    legend.position = 'hide'
  ) +
  labs(
    x = "Bill Length (in mm)",
    y = "Flipper Length (in mm)"
  )

Figure 5.38: The Penguins data scatterplots

and here with the mtcars data, focusing on mileage and the number of cylinders.

ggplot(
  data = mtcars, 
  aes(
    x = qsec, 
    y = mpg, 
    color = factor(cyl)
    )
  ) + 
  geom_point() +
  labs(
    color = "Cylinders",
    y = "Miles per gallon",
    x = "Fastest time to travel 1/4 mile from standstill (in seconds)"
    ) +
  theme(
    legend.position = 'bottom'
    )

Figure 5.39: The mtcars data scatterplots

5.3.10 Count Charts `geom_count(...)`

These plots allow you to see the frequency of given pairs of values by varying sizes of the points. The more the frequency of a pair the greater the size of these points.

data(mpg, package = "ggplot2")
ggplot(
  mpg, 
  aes(
    x = cty, 
    y = hwy
    )
  ) + 
  geom_count(
    col = "firebrick", 
    show.legend = FALSE
    ) +
  labs(
    subtitle = "City vs Highway mileage", 
    y = "Highway mileage", 
    x = "City mileage"
    )

Figure 5.40: Count plots of Mileage

And now a count plot with data from the Boston Marathon.

read.csv(
  here::here(
    "data", 
    "BostonMarathon.csv"
    )
  ) -> boston 

ggplot(
  boston, 
  aes(
    x = Age, 
    y = finishtime, 
    group = M.F
    )
  ) + 
  geom_count(
    aes(
      color = M.F
      )
    ) + 
  labs(
    subtitle = "", 
    y = "Finishing Times (in seconds)", 
    x = "Age (in years)") + 
  facet_wrap(
    ~ M.F, 
    ncol = 1
    ) +
  theme(legend.position = "hide")

Figure 5.41: Count plots of Boston Marathoners’ Age and Finishing Time (by Sex)

5.3.11 Hexbins

With two continuous variables, scatter-plots are often useful but not when we have a lot of data points that overlap. with a lot of overlapping $x,y$ pairs it becomes hard to discern what pattern is being reflected before our eyes. In these situations, and for some even as an outright replacement perhaps for ordinary scatter-plots, the hexbin comes in handy. The hex-bin works in a very logical way. The basic idea is to carve up the plotting canvas (the $x,y$ grid) into hexagons, all of equal size. Then count how many pairs of $x,y$ values fall inside each hexagon. For hexagons with one or more data points, use a coloring scheme (like a heat-map) to show where hexagons have more data versus less.

ggplot(
  data = diamonds, 
  aes(
    x = carat,
    y = price
    )
  ) + 
  geom_hex() + 
  labs(
    x = "Weight in Carats",
    y = "Price"
    )

Figure 5.42: A hexbin of Diamond weights and prices

ggplot(
  data = diamonds, 
  aes(
    x = carat,
    y = price
    )
  ) + 
  geom_hex() + 
  labs(
    x = "Weight in Carats",
    y = "Price"
    ) + 
  facet_wrap(
    ~ color,
    ncol = 3
    )

Figure 5.43: A hexbin of Diamond weights and prices (by color)

These are some of the basic geoms that ggplot2(...) provides, but there are plenty more that could be used as well. For now we will set these basic visualizations aside and go back to gathering data. Again? Yes, again, but this time we will work with APIs made available by some national/international governmental and non-governmental organizations.

5 Visualizing Data

5.1 Graphics in base R

5.2 Using {lattice}

5.3 Graphics with {ggplot2}

5.3.1 The Mechanics of {ggplot2}

5.3.2 Histograms geom_histogram(...)

5.3.2.1 Improving comparability Across Groups

5.3.3 Kernel Density Plots geom_density(...)

5.3.4 Ridge Plots with {ggridges}

5.3.5 Box-plots geom_boxplot(...)

5.3.6 Violin Plots geom_violin(...)

5.3.7 Bar-Charts geom_bar(...)

5.3.8 Line Charts geom_line(...)

5.3.9 Scatter-plots geom_point(...)

5.3.10 Count Charts geom_count(...)