Data Analysis for Leadership & Public Affairs with R
2018-08-17
Chapter 1 Preface
As a typical graduate student who rarely had enough money I would struggle to find ways to get my hands on the textbooks I needed for my courses without paying full cost. Old editions, used copies, photocopies, Interlibrary Loan services, you name it, I tried it. Once I became an instructor, and I suspect this is true for most of us who teach anything, I found myself flooded with free desk copies, year in, year out. Now a different problem surfaced, however: The annual hunt for a text that covered the material I needed to teach in a clear and thorough manner, albeit, at a reasonable price. Unfortunately, statistics textbooks written for public administration/public affairs students are not only few and far in between but also consistently expensive. This has been an exasperating situation for over two decades with no signs of a correction in the offing. It was thus less serendipity and more growing frustration with the state of affairs that has led to the creation of this text. Given that in this age of crushing student debt the world is replete with open-source software for data analysis and desktop publishing, why not curate a textbook tailored not only to my instructional needs but also to my students’ pocketbooks? Leaning on the wonderful resources created by the teams at RStudio, R, Jared Lander (his wonderful “R for Everyone: Advanced Analytics and Graphics” showed me what a quality, aesthetically pleasing product might resemble), and the thousands of R users across the globe ever willing to share their coding knowledge, I have assembled my course material into a work that I hope is useful beyond just the pocketbook. My goal is to deliver a quality product. That cannot be achieved without your suggestions for substantive, technical and stylistic improvements, so feel free to share these with me.
1.1 Data Analysis and Public Affairs
In the early decades of the Twentieth century the field of public administration was a leader, as much in the area of intellectual thinking about governance as in granular research on governmental processes and outcomes. Since the 1960s, however, public administration has been seen as a poorer cousin of political science, once its twin, public administration’s cache dwindling in the social and behavioral sciences, largely because of the feeling that the scholarship it produces is less rigorous than that of other disciplines. There may be some truth to this notion but engaging in intellectual debates and recovering scholarly bragging rights are not our goals. Rather, the task before us to fill a more crucial gap in the public affairs world – preparing data savvy graduates.
Not an easy task you say? Truer words were never spoken! Why is that? Most of you, like thousands of your peers across the nation, no doubt groaned when you saw that one of your required courses was a research methods class. Maybe you remembered the mandatory statistics course you took as an undergraduate and that familiar dread swept over you in a flash. A false fear of mathematics certainly gets in the way of understanding statistics. Beyond that, however, is the matter of understanding statistical concepts by seeing these concepts in action, something you probably bypassed as an undergraduate. And then, of course, is the all important learning exercise of applying these concepts to real-world data. Well, in this text I try to get you beyond your fear of numbers, help you see statistics in action
, and force you to work with real-world data married to real-world questions.
In other words, this is a very applied course in data analysis, one in which you will learn about how to use data in a way best suited to answer a specific question. Maybe the question is about weighing the evidence in a racial bias in hiring lawsuit your city is facing. Or your agency is curious to know if the public information campaign it has been working on and broadcasting to promote healthy living is having any (or no) impact on the health of the citizens. Maybe you work for a an economic development agency and need to track trends in unemployment rates. Whatever the question before us, invariably there are data that can be used to find reasonable answers. To get at these answers, however, you need to know three things:
- How do I gather the data?
- How should I analyze the data I gathered?
- What are the strengths and limitations of my analysis?
These will be our guiding questions throughout this book. Seems fairly easy but it does require hard work that involves a lot of hands-on practice. After all, data analysis requires us to understand the syntax of statistics, its vocabulary, and in all cases, a crowd of Greek symbols squeezed into what seems to be a mysterious mathematical formula.
1.2 The chapters that follow
The chapters that follow are sequenced in a way designed to promote learning. We start in Chapter 2 with the fundamentals, learning about samples and how they differ from populations, how the different ways we measure some attribute or phenomenon (for example, how the way we measure a survey respondent’s sex differs from how we measure hours of physical activity she/he spends per week) has implications for the analysis that can be done, the tricky business of establishing cause-and-effect, and other such fundamental principles.
In Chapter 3 we move on to understanding the many interesting ways in which we can explore patterns in our data, both graphically and with simple tables. In the process we will also learn best practices in data visualization – how should you build an effective graph? Issues of human cognition and visual perception are more important than we tend to recognize.
Chapter 4 revolves around measures of central tendency and variability. That is, we learn about the three ways we can measure and discuss the average, the typical – The mean, the median, and the mode – and variability around these averages – the range, the interquartile range, the variance, and the standard deviation. We also fold into our data visualization toolbox a very powerful graphic called the box-plot that relies on the five-number summary to tell us a lot about the shape of our distribution.
Probability theory, one of the most notorious subjects in statistics, is our focus in Chapter 5. It is a tricky subject but unless we wrestle with the nuances of probability theory everything else that follows will make little to no sense. It is as powerful as it is difficult, but that is why game shows rely on it and Jeopardy champions and poker stars need to master it. It is also the source for understanding why such events as the winning pick-3 number in NYC on the first anniversary of the 9/11 terrorist attacks being 9-1-1 was not a rare event. If you understand probability you can be the life of the party by correctly predicting that at least two people in a gathering of 30 share the same birthday.
Chapter 6 extends probability theory to probability distributions, both discrete and continuous, and Chapter 7 spans the theory of sampling distributions. Stalwarts like the standard error, the Central Limit Theorem, confidence intervals, and *the Student’s \(t\) distribution grace the stage.
In Chapter 8 we discuss the logic of of hypothesis testing, a method of formalizing and testing our suspicions about whether a program has had an impact, whether there is a gender bias in hiring, etc. This is a tightly specified method, with little room for doing things differently. One of the biggest surprise many students encounter occurs in this chapter; no matter the strength of the evidence there is always a possibility that we could be drawing the wrong conclusion from our data analysis. This chapter also introduces you to some hypothesis tests commonly refered to as t tests
or difference of means tests
.
Chapters 9 and 10 lead us farther through the world of inferential statistics, the process of analyzing the sample data in hand and extrapolating our conclusions to the population the sample represents. We start quite simply, looking at how to determine if the differences between two groups are statistically significant. Since often the world cannot be broken up into two groups (men and women, for example) but instead must be studied as it exists in reality (for example, the fact that the Ohio Department of Education classifies public school districts into eight mutually exclusive categories), we also learn how to analyze categorical outcomes for multiple groups in a coherent manner.
In Chapter 11 we move on to the mother of most statistical analyses today – regression analysis. This is the most exciting and useful portion of the course because we are finally able to accommodate real world complexity into our calculations. For example, if you wanted to predict the number of highway fatalities on a particular stretch of I70, you would have to account for many things (visibility, traffic density, traffic speed, road conditions, drive intoxication, time of day, and so on). Indeed, much of the noise about predictive analytics, health analytics, data mining, etc. revolves around one form or another of regression analysis.
The concluding section of each chapter has a large section devoted to the R code used to carry out an operation. The operation could be something as elementary as reading in a data-set available to us in some format or another, calculating totals, averages, variances and standard deviations, etc., generating graphics, and then, of course, carrying out a particular form of statistical analysis that is the subject of that particular chapter. The answer key (available separately from me) shows you the R code used to answer each question. Each chapter also highlights concepts and techniques that veer into the “advanced” territory, implying a need for an understanding of more complicated theoretical concepts and the corresponding R code. I want you to be aare, however, of the facts that there are many ways to accomplish the same end result via R. Hence each of us R users has to make a choice, settling upon a particular approach that has worked for us. I would not want you to walk away from this text without recognizing this underlying current in my use of specific R code. Indeed, what would be ideal is for you to start with what I am showing you and then use the abundant resources vis-a-vis R on the web to expand your toolkit. If all goes well you will be able to improve upon what I do and nothing could make me happier.
1.3 Keys to Learning Data Analysis
I have already emphasized that statistics has its own language, and in that sense learning statistics is no different from learning a foreign language. You cannot master a foreign language simply by cracking open a book for an hour a week or going to the weekly class. If it were that simple we would all be linguists, but we are not. The ones who master a foreign language (or at least learn enough to impress the servers at your neighborhood Spanish/Italian/French restaurant) are those who practice as much as they can. That is the approach I recommend to you if you want to learn data analysis.
To encourage practice you will see a number of practice problems that conclude each chapter. These are designed to reinforce learning and I expect you to try and solve them. Answer keys are provided with fully worked-out calculations so that if you do make a mistake you can see where and how you went wrong. You should do a few problems before you tackle the assignment for the week; this puts you in the best position to not only complete the assignment correctly but also with minimal frustration and time needed to complete the work. Of course, each chapter also has several worked examples per key concept/calculation so there is no shortage of learning-by-doing opportunities.
This book may not work for everybody. Some people need to approach the same material from multiple vantage points before things click. That is perfectly fine, and if you are one such individual, I encourage you to also look at the several thousand videos and blogs and free books and papers on the internet. Some of these materials have been curated and hot-linked in this book while others have been listed in the Bibliography
for this text; use them
.
1.4 Acknowledgements
I want to acknowledge the following students for providing critical feedback: Grace Kroeger, Eleonora Mocanu, Sharif Wahab, George Mance, Bethany Blinsky