Chapter 2 Introduction to Core Concepts

2.1 Statistics versus Statistic

Statistics is a broad and lively field of study that involves methods to (a) accurately describe patterns observed in data, and (b) to draw inferences about the population the data were gathered from. Thus, for example, the pollsters who track public opinion polls at Gallup ask at regular intervals: “Do you approve or disapprove of the way [insert incumbent President’s name here] is handling his job as president?” Obviously it would be unfeasible to ask every adult in the United States of America this question. Instead, based on statistical theory and meticulous calculations that you will come to learn, Gallup works as follows:

Gallup interviews [approximately 500] U.S. adults aged 18 and older living in all 50 states and the District of Columbia using a dual-frame design, which includes both landline and cellphone numbers. Gallup samples landline and cellphone numbers using random-digit-dial methods. Gallup chooses landline respondents at random within each household based on which member had the next birthday. Each sample of national adults includes a minimum quota of 60% cellphone respondents and 40% landline respondents, with additional minimum quotas by time zone within region. Gallup conducts interviews in Spanish for respondents who are primarily Spanish-speaking. Source

What Gallup is working with is a subset of all U.S. adults aged 18 and older and living in the 50 states plus Washington DC. As they gather the data, they may be doing so to study the trend in presidential approval (see graph below).

library(readxl)
obama_approval = read_excel("~/Documents/Data Hub/obama_approval.xlsx")
obama_approval$id = as.numeric(row.names(obama_approval))
library(dplyr)
obama = arrange(obama_approval, -id)
obama$id2 = with(obama, 419 - id)
library(ggplot2)
library(scales)
ggplot(obama, aes(x = id2, y = Approve)) + geom_line(color = "cornflowerblue") + 
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + 
    xlab("Time") + ylab("Presidential Approval (%)") + ylim(c(0, 
    70)) + geom_vline(xintercept = c(49, 102, 154, 206, 259, 
    311, 362, 415), color = "black") + annotate("text", x = 30, 
    y = 35, label = "2009") + annotate("text", x = 70, y = 35, 
    label = "2010") + annotate("text", x = 120, y = 35, label = "2011") + 
    annotate("text", x = 175, y = 35, label = "2012") + annotate("text", 
    x = 230, y = 35, label = "2013") + annotate("text", x = 280, 
    y = 35, label = "2014") + annotate("text", x = 330, y = 35, 
    label = "2015") + annotate("text", x = 390, y = 35, label = "2016")
Presidential Approval over Time

FIGURE 2.1: Presidential Approval over Time

Why does Gallup do these surveys of a few hundred or a thousand individuals? Not because they are idly curious about what a sample of 500 U.S. adults think. Gallup wants to be able to say this is what the U.S. adult population thinks about the incumbent president’s job performance! In other words, Gallup is using a sample to say something useful about the population.

  • Sample: the subset of cases drawn for analysis from the population
  • Population: the universe (or set) of all elements of interest in a particular study

Drawing conclusions about the population from analyzing the sample is the process of making an inference, and the processes and rules involved in doing so are called inferential statistics.

It will rarely be the case that you will have access to the entire population. The rare instances when that might happen could include cases where you have access to all patients receiving care in the hospital group you work for, all individuals insured with your insurance agency, all individuals receiving services from your agency, all drivers registered with the department of motor vehicles, all homeowners in a specific location, and so on. Even in these instances, however, you can only access the population if the law and agency protocols permit you to do so and have legitimate, authorized reasons for accessing the population data.

Some population data are in the public domain and easily accessed: For example, data about all public school districts in Ohio, data on all voters registered to vote in a particular location, data for the nation’s cities, counties, states, territories, congressional districts, census tracts, and so on. These data are publicly available because there are no legal restrictions on releasing these data. Most of us, however, end up relying on samples.

If statistics is a field of study, a statistic is the result of applying a computational algorithm to a set of data. For example, when the U.S. Census Bureau calculates Median household income, they are generating a statistic. Likewise, when the Gallup polls and tells us what percent of likely voters will vote for Candidate A, they too are generating a statistic. If you are working with a pilot group of your agency’s service recipients and the pilot is testing whether a particular policy or service program is having the intended effect (for example, improving the service recipients’ financial literacy), and you conclude that 10% of those who receive this training will improve their financial literacy, you have generated a “statistic”.

2.1.1 Parameters versus Estimates

Statistics revolves around drawing as accurate as possible estimates of some unknown attribute of the population from the sample at hand. These unknown population attributes/features are called parameters while their sample counterparts are called estimates. Thus, for example, if I had access to the household income of every household in Franklin County, Ohio, I could calculate average household income, and this would be a parameter. However, if I could only get my hands on incomes of say 100 households in Franklin County, Ohio, randomly selected for analysis, and then I calculate average household income for these 100 households, I would have an estimate. In general, we hope (and ensure as best as we can via statistical theory) that our sample estimates equal the population parameters they represent.

This equality may not always hold and so every time we generate an estimate we also generate a measure of the uncertainty around the estimate, a measure that tells us how much our estimates may have drifted from their corresponding parameters. We will make these measures of uncertainty more precise as we go along but for now you can cast your mind to a phrase you may have encountered – the “margin of error”. For example, on October 20, 2017 a Gallup poll found that Americans’ approval of the way Congress was doing its job had hit 13% (i.e., only 13% thought Congress was doing a decent job while 87% thought otherwise). At the very bottom of this story Gallup noted that “[f]or results based on the total sample of national adults, the margin of sampling error is \(\pm 4\) percentage points at the 95% confidence level”. We will learn all about this quantity called the margin of error but for now understand the essence of what that \(\pm 4\) is telling us: If we could have asked every adult American whether they approved of Congress, the percent likely to say “yes” could have ranged between \(13 - 4 = 9\%\) to \(13 + 4 = 17\%\). So this margin of error is telling us something about the uncertainty around the estimate of 13%. If no such uncertainty estimate were provided to us, we would have no way of knowing how much faith to place in anything the sample is telling us.

2.1.2 Sampling Error and Bias

Samples can mislead of course, and by sheer chance at that. When this happens our (sample) estimates will \(\neq\) their corresponding (population) parameters, leading to sampling error. Sampling error is difficult to eradicate completely but can be minimized up to a point. What we should worry about far more than sampling error is bias. Bias can arise in many ways, with some of the more common sources of bias being measurement bias (you have a systematic error in how you are measuring something but don’t realize it); sampling bias (survey non-response and/or the way you are drawing your sample is generating bias); model misspecification (you do not fully understand the phenomenon you are studying and thus end up with a flawed statistical model that will yield biased estimates because of variables you forgot to include, mis-characterized relationships between variables, and so on).

Measurement error is a particularly interesting feature of working with samples. For example, say I am working with a “healthy living” group that is working to help folks live healthier lives by eating better and exercising more. I am tasked with studying the program and then releasing a final report on how well (or poorly) the program worked. I could measure participants’ attendance at organized events, their online studying of food and exercise modules we are providing to them via the web and smart phones/tables, and of course we have their height, weight, and maybe some other data recorded when they enrolled in the program. Unfortunately, however, we can only record their eating and exercise patterns based on what they tell us! People don’t keep very accurate logs of food intake and exercise, so often when they are completing surveys they are doing their best to recall these behaviors over the last week. In recalling the recent past, there will be some mistakes, that is to be expected. Some may overestimate how much they ate and/or exercised while others will underestimate how much they ate and/or exercised. If these overestimates and underestimates were purely random, then they would cancel each other out and we would have no measurement error. However, we do know that self-reported dietary intake measures are not random and hence could generate bias.

2.1.3 Key Elements of Random Sampling

At minimum two things must hold for a random sample …

  1. Every unit of analysis (i.e., the entity that is to be analyzed) in the population must have the same chance of being selected into the sample, and
  2. Every unit of analysis in the population must be sampled independently of all other units of analysis

If you violate (1) you end up with biased estimates and if you violate (2) you will have imprecise estimates. Examples of these violations are aplenty. For instance, imagine I carry out a landline survey of health behaviors and outcomes. How many of you have a landline? Not many I would suspect. So what is the consequence of my using a landline survey? All cellphone-only households have a zero chance of being selected into my sample. If the cellphone-only population were no different from the population that has kept a landline there would be no problem. But we know that isn’t the case. If you are curious, see New Considerations for Survey Researchers When Planning and Conducting RDD Telephone Surveys in the U.S. With Respondents Reached via Cell Phone Numbers and Growing Cell-Phone Population and Noncoverage Bias in Traditional Random Digit Dial Telephone Health Surveys. I could also violate (2) if I am conducting a survey of Freshman’s opinions about healthy food options in a university’s dining halls and end up talking only to people seated together, even if I pick a few tables to talk to. Why is this a problem? Because it is likely that these students are friends/roommates who maybe also share a common major, etc., and most likely come from similar socioeconomic and demographic backgrounds, have shared tastes and preferences, and so on.

2.2 Useful Research Designs

All research starts with a research design, a well-crafted and clearly articulated framework that outlines what data will be gathered in order to address the research questions motivating the research. There are several research designs that could be used, each with it’s strengths and weaknesses but the key quantitative designs you are likely to encounter are listed below.

2.2.1 Experimental

Experiments are the gold standard of quantitative research designs because, much like an experiment in a physics of chemistry lab, they allow the researcher to control for a host of factors and then introduce a stimulus to test the impact of this stimulus on the outcome of interest. For example, political psychologists who study the impacts of campaign advertising on voters’ choices often expose a randomly chosen group of subjects either to a positive campaign advertisements or to a negative campaign advertisements about about two candidates running for the same elected office. The study’s purpose might be to figure out whether negative advertising reinforces stereotypes, whether it makes people less likely to vote at all (even for their own party’s candidate), and so on. Because the two groups are similar in every possible way and were randomly assigned to either of the campaign advertisement types, any impacts we see of the tone of the campaign advertisement can be reliably attributed to the advertisement itself and not to someone’s sex or educational attainment or party preferences, etc.

Take another example. Say I am interested in figuring out whether private schools are really better at educating students than traditional public schools. Ideally, I would have a large random sample of genetically identical twins and the freedom to assign one twin in each pair to a public school and the other to a private school. I could then follow their learning for a year and see which group – private or public – has better academic performance on average. Why would this be a powerful research design? Because they are twins, their home environments are the same (same parental income, interest in child’s education, social circle, access to computers, internet, extra-curricular activities, and the like). An informed reader would at this point object to this example and point out that identical twins’ genes are not identical so couldn’t that be responsible for differential learning? Yes, it might but that is why we restricted our sample to genetically identical twins!

Experiments draw their power from their ability to rule out all other likely influences on the outcome and in doing so yield a clean estimate of the treatment’s (say, attending a private versus a public school, being given Drug A versus Drug B, etc.) impact on an outcome (academic learning, reducing joint inflammation, and so on). However, they are difficult to carry out in most social and behavioral science settings in the real world because they cost a lot, logistics get in the way (you probably won’t get enough parents to agree to separate the twins) and because of legal/ethical requirements that circumscribe all research with human subjects. Central to these requirements is the notion of informed consent.

Informed consent is the process by which researchers working with human participants describe their research project and obtain the subjects’ consent to participate in the research based on the subjects’ understanding of the project’s methods and goals. The subject must have understood his/her rights, the purpose of the study, the procedures to be undergone, and the potential risks and benefits of participation. It is strictly voluntary and consent must be secured for all human subjects research including diagnostic, therapeutic, interventional, social and behavioral studies, and for research conducted domestically or abroad. Vulnerable populations (i.e. prisoners, children, pregnant women, etc.) receive extra protections, and the legal rights of subjects are neither waived nor are the researchers, the study’s sponsor, the researchers’ employer, etc. exempt from any liability for negligence. There is a fascinatingly unfortunate history that has led to this sacrosanct principle of informed consent, even though some would argue it holds back what we could learn; for the curious among you, see Seven Creepy Experiments That Could Teach Us So Much (If They Weren’t So Wrong). Indeed, informed consent is such a crucial aspect of our work that your first assignment will require you to complete the Group 2: Social and Behavioral Investigators and Key Personnel training via Collaborative Institutional Training Initiative (CITI Program).

2.2.2 Natural Experiments

The fundamental reason that experimental research sets the bar for all other research designs is because of the researchers’ ability to completely randomize what treatment a subject encounters. This ensures that there is very likely no other explanation for differences in outcomes other than the differences in the treatments themselves. Natural experiments come close to mimicking this complete randomization attribute of experimental designs although the randomization is not being carried out by the researchers. Here is a classic example that comes to us from Thad Dunning:

An interesting social-scientific example comes from a study of how land titles influence the socio-economic development of poor communities. In 1981, urban squatters organized by the Catholic Church in Argentina occupied open land in the province of Buenos Aires, dividing the land into parcels that were allocated to individual families. A 1984 law, adopted after the return to democracy in 1983, expropriated this land with the intention of transferring titles to the squatters. However, some of the original landowners challenged the expropriation in court, leading to long delays in the transfer of titles to some of the squatters. By contrast, for other squatters, titles were granted immediately.

The legal action therefore created a (treatment) group of squatters to whom titles were granted promptly and a (control) group to whom titles were not granted. The authors of the study find subsequent differences across the two groups in standard social development indicators: average housing investment, household structure, and educational attainment of children. On the other hand, the authors do not find a difference in access to credit markets, which contradicts a well-known theory that the poor will use titled property to collateralize debt. They also find a positive effect of property rights on self-perceptions of individual efficacy. For instance, squatters who were granted land titles—for reasons over which they apparently had no control— disproportionately agreed with statements that people get ahead in life due to hard work.

Is this a valid natural experiment? The key claim is that land titles were assigned to the squatters as-if at random, and the authors present various kinds of evidence to support this assertion. In 1981, for example, the eventual expropriation of land by the state and the transfer of titles to squatters could not have been predicted. Moreover, there was little basis for successful prediction by squatters or the Catholic Church organizers of which particular parcels would eventually have their titles transferred in 1984. Titled and untitled parcels sat side-by-side in the occupied area, and the parcels had similar characteristics, such as distance from polluted creeks. The authors also show that the squatters’ characteristics, such as age and sex, were statistically unrelated to whether they received titles—as should be the case if titles were assigned at random. Finally, the government offered equivalent compensation—based on the size of the lot—to the original owners in both groups, suggesting that the value of the parcels does not explain which owners challenged expropriation and which did not. On the basis of extensive interviews and other qualitative fieldwork, the authors argue convincingly that idiosyncratic factors explain some owners’ decisions to challenge expropriation, and that these factors were unrelated to the characteristics of squatters or their parcels.

The authors thus present compelling evidence for the equivalence of treated and untreated units. Along with qualitative evidence on the process by which the squatting took place, this evidence helps bolster the assertion that assignment is as-if random. Of course, assignment was not randomized, so the possibility of unobserved confounders cannot be entirely ruled out. Yet the argument for as-good-as-random assignment appears compelling.

Two more examples follow, both from the following blog that also lists several other examples

The Dutch Famine Study The Dutch “hunger winter” took place towards the end of World War 2 in the German-occupied Netherlands from late 1944 until the liberation of the area by the Allies in May 1945. During this time official food rations dropped to as low as 500 calories per day and around 20,000 people died of starvation. This experience of a modern advanced country experiencing a short, sharp famine is quite unusual, and in this case the West of the country was strongly affected but not the North or South. This has allowed many researchers to examine the effects of the famine on population health by comparing the outcomes of people from different regions. Of particular interest are the outcomes of the children who were in utero at the time of the famine, as they may have suffered developmental impairments with potentially lifelong repercussions.

The Impact of the Mariel Boatlift on the Miami Labor Market While a majority of economists agree that the average American citizen would be better off if more low-skilled and high-skilled immigrants were allowed to enter the US each year, there is considerably more opposition to immigration among the general public in Britain, America and continental Europe. One frequently raised concern, aside from issues of social cohesion, is that the arrival of immigrants might negatively disrupt the economy and make it more difficult for local workers to secure good jobs.David Card examined this issue by looking at the labor market effects of the “Mariel boatlift” (wiki), the name given to the arrival of 125,000 Cubans into Florida during April - September 1980 (this is the impetus for Tony Montana’s move to America at the start of Scarface). This sudden arrival of relatively unskilled young men increased the size of the Miami labor force by 7%. By comparing Miami with other cities which were not affected by the Mariel boatlift, Card showed that the arrival of the Mariel workers did not notably affect the wages and employment rates of the existing unskilled workers living in Miami.

In a nutshell, it is as if some naturally occurring phenomenon or event led to an “as-if” randomization of units into different groups, and the resulting groupings can then be studied for differences in the outcomes of interest. Some interesting and recent examples of research utilizing natural experiments include work around Hurricane Katrina and the 9/11 terror attacks.

2.2.3 Quasi-Experiments

More often than not researchers can neither carry out an experiment nor access a natural experiment. For instance, consider schooling in America. Some parents choose to send their children to a private school, others send their children to a parochial school, some may place their children in a charter school, but most will send their children to a public school in their district. Suppose I am interested in figuring out whether a child’s academic performance tends to be better in one school type than in another. I have not randomized children across school types, and most likely parental and other attributes (wealth, opinions, school alternatives available, child’s current performance, the public school’s performance, and so on) have led parents to select a school for their child. This sort of a deterministic sorting that constructs the different groups of school type is what we call non-random assignment. If I really want to pursue my study, I would have to control for as many of these differences, as many of these attributes as possible if I want to have a reasonable chance of learning how school type influences learning. I might draw as large a random sample as I can and then use appropriate statistical techniques to see how learning differs by school type; I have a quasi-experimental research design.

Take another example. Say a city decides to replace some traffic lights at major intersections with traffic circles (aka roundabouts) because they believe traffic circles will reduce accidents that otherwise occur at these intersections. Researchers could judge whether traffic circles have had any impact by gathering and analyzing data from the pre-traffic circle period to the post-traffic circle period. If traffic circles have had an impact, on average we should see fewer accidents in the post-traffic circle period.

Note that in both these examples the researcher has had no influence; she/he could neither determine where or when the traffic circle is constructed nor ensure that the drivers using these roads are the same. After all, it is quite likely that some drivers who otherwise took these routes are now avoiding these intersections because they see them as more dangerous or are uncomfortable navigating traffic circles. Consequently, anybody analyzing these data would have to control for as many factors as possible – traffic density, traffic composition (for e.g., are commercial trucks still using these intersections or not), average speed approaching the intersection, and so on. Unless one could do so, it would be impossible to say with any degree of certainty that any reduction in traffic is due to the traffic circle.

2.3 Elements of a Data-set

While data are the facts and figures collected, analyzed and summarized, a data-set comprises all data collected in the course of a particular study. Take, for instance, one of the many data-sets compiled by the Appalachian Regional Commission1. A snippet of this data-set is shown in Table 2.1.

TABLE 2.1: A Typical Dataset
County State Unemployment % Percapita Income Poverty %
Blount County Alabama 6.5 23440 17.3
Calhoun County Alabama 8.4 23420 21.7
Chambers County Alabama 8.2 21494 23.9
Cherokee County Alabama 6.8 22310 21.0

In the language of data analysis you might hear folks talk about observations and variables. By observations they mean the units that make up each specific row of the data-set. Here we have one county per row so the counties are the observations, the units. If we were surveying adults then each adult would be a unit. A variable is an attribute that differs across the observations. Here we have four variables – the state a county is in, unemployment rate, percapita income, and poverty rate. If some attribute does not differ across the observations, then it is really not a variable, is it? Since it does not vary it provides no useful information for studying any outcomes. For example, if in a sample gathered to study adult women’s reproductive health decisions we included a column called sex but the entry for each woman in the sample read “Female”, this would be wasted space. You know all the units are women so you cannot ask if some behavior or choice varies by the individual’s sex; sex is a constant in this study, not a variable.

2.4 Variable Types

When you are staring at a data-set you are likely to see different types of variables. Let us see this variety in the context of a specific example, that displayed in Table 2.2 below.

TABLE 2.2: Proficiency Data for Ohio’s School Districts
dirn district grade subject Advanced Plus Advanced
45187 Ada Exempted Village 3rd Grade Reading NA 30.5
45187 Ada Exempted Village 3rd Grade Mathematics NA 37.3
61903 Adams County Ohio Valley Local 3rd Grade Reading NA 18.0
61903 Adams County Ohio Valley Local 3rd Grade Mathematics NA 24.8
49494 Adena Local 3rd Grade Reading NA 15.3
49494 Adena Local 3rd Grade Mathematics NA 25.0
43489 Akron City 3rd Grade Reading NA 11.6
43489 Akron City 3rd Grade Mathematics 0.1 14.6
45906 Alexander Local 3rd Grade Reading NA 18.9
45906 Alexander Local 3rd Grade Mathematics NA 14.3

You see a variable called dirn – a unique identifier for each school district. This is a numeric variable but the actual numbers don’t mean anything other than identifying a particular district. Your social security number is another such numeric variable. In contrast, the columns labeled “Advanced Plus” and “Advanced” contain true numeric values that represent the percent of students deemed Advanced Plus and the percent of students deemed Advanced, respectively. These numbers mean something. In that sense we can refer to these two proficiency levels as a quantitative variable. Other examples might be median household income, number of highway fatalities on Interstate 95, number of arrests, your height, weight, age, and so on.

The variable district is a string (also called a character) variable in that contains only text, the name of each school district. The other two variables contain a combination of numbers and text, for example, “3rd Grade”, (grade) and just text, for example, “Reading” (subject ). These two variables and dirn are usually referred to as factors or categorical variables that identify categories. If we wanted to recognize broad data types we might as well realize that we can either slot a variable as a true numeric/quantitative variable or then as a qualitative/categorical variable.

Note too that Table 2.2 shows some cells with the string “NA”, an indicator for missing values; for some reason we do not have information on the percent of students Advanced Plus in in a particular subject for some districts but this information is available for other districts (for instance, Akron City, 3rd Grade, Mathematics). It is not uncommon to have missing data and when you do, the data-set will either show missing values with “NA”, “.”, “-9999” or some such entry2.

Data analysts often speak in terms of dependent and independent variables, or then in terms of response and explanatory variables. When they speak in these terms what they are referring to is the fact that the dependent/response variable is the outcome of interest, and that they believe this outcome can be explained/predicted by the explanatory/independent variable. For example, I want to study literacy in early childhood so I gather a random sample of children who attend a nearby child development center, measure their literacy, then provide different amounts of materials designed to improve literacy to the sampled children, and at the end of three months I measure their literacy again. In this study, literacy level is the “dependent” variable while the amount of literacy material received is the “independent” variable; how literate a child is can be explained/predicted by how much literacy development material was provided to a child.

2.5 Cross-sectional, Time-Series, and Panel Data

These are three, very general types of data structures but by no means the only ones data analysts run into. However, these are the structures we are most likely to see and so a few words about each (with an example) are called for.

2.5.1 Cross-sectional Data

Cross-sectional data refer to a structure where we have measurements taken at a single point in time. For example, the U.S. Census Bureau conducts a decennial census, the Census of Population and Housing every 10 years. The result is a snapshot of socioeconomic, demographic and other conditions at a single point in time. Other examples might be a one-time public opinion survey, a financial audit of all state agencies conducted in 2017, a study of all first-time enrollees in two/four year degree granting institutions in Fall 2016, and so on. The data shown in Table 2.1 and Table 2.2 are examples of cross-sectional data.

2.5.2 Time-Series Data

Time-series data are measurements taken for a single unit over multiple time points. The Presidential Approval data we saw earlier (Figure 2.1) is a classic example of a time series; we are taking the pulse of a single unit (the nation) over multiple time periods. Other common examples would be the interest rate in the U.S., the value of the S&P 500 measured daily for a given year, tracking your organization’s clients’ aggregate satisfaction by month/quarter/year.

2.5.3 Panel Data

If you take multiple measurements over time for the same cross-sections you have what we call panel data. For example, I survey the same individuals or organizations every year. Or the Census Bureau’s American Community Survey measures a wealth of information for census states, counties, places, etc. annually, releasing the ACS 1-year data and the ACS 5-year data, respectively. There used to be a 3-year ACS data series as well but this was stopped because of budget cuts effective 2015. Read more about it here.

Table 2.3 is an example of panel data gathered for 10 firms followed from 1935 to 1954.

TABLE 2.3: An Example of Panel Data
invest value capital firm year
317.6 3078.5 2.8 General Motors 1935
391.8 4661.7 52.6 General Motors 1936
410.6 5387.1 156.9 General Motors 1937
257.7 2792.2 209.2 General Motors 1938
330.8 4313.2 203.4 General Motors 1939
461.2 4643.9 207.2 General Motors 1940
512.0 4551.2 255.2 General Motors 1941
448.0 3244.1 303.7 General Motors 1942
499.6 4053.7 264.1 General Motors 1943
547.5 4379.3 201.6 General Motors 1944
561.2 4840.9 265.0 General Motors 1945
688.1 4900.9 402.2 General Motors 1946
568.9 3526.5 761.5 General Motors 1947
529.2 3254.7 922.4 General Motors 1948
555.1 3700.2 1020.1 General Motors 1949
642.9 3755.6 1099.0 General Motors 1950
755.9 4833.0 1207.7 General Motors 1951
891.2 4924.9 1430.5 General Motors 1952
1304.4 6241.7 1777.3 General Motors 1953
1486.7 5593.6 2226.3 General Motors 1954
209.9 1362.4 53.8 US Steel 1935
355.3 1807.1 50.5 US Steel 1936
469.9 2676.3 118.1 US Steel 1937
262.3 1801.9 260.2 US Steel 1938
230.4 1957.3 312.7 US Steel 1939
361.6 2202.9 254.2 US Steel 1940
472.8 2380.5 261.4 US Steel 1941
445.6 2168.6 298.7 US Steel 1942

Often people ask if one type of data is best and the answer is yes. Think about it this way. If I want to understand public opinion on some issue, I could just run a survey once. All this would tell me is what people think at that point in time. This tells me nothing about what I would have discovered about public opinion if I had run the survey the year before or waited for a year and then run the survey. If I aggregate these data and run the same survey every year, I would learn how the average person’s opinion varies over time. However, aggregation would lead to my ignoring subtle differences in the opinion of specific types of individuals (for example, minority groups, men versus women, those with less education versus the more educated, and so on). If, on the other hand, I survey the same individuals every year I can not only study how public opinion changes over time but also how different groups felt about an issue at a specific point in time, if and how people change their opinions, and so on. This is the ultimate strength of panel data – the ability both to study something at a given point in time and to study change over time. Many people think about cross-sectional data as a static photograph while panel data allow you to draw a moving picture.

What data you can gather will be determined by your needs, the amount of money available for data collection, and time. Otherwise everybody would be gathering panel data and then tossing away portions of the data they don’t need. However, my experience has taught me to always gather more data than I need because it easier to set aside data you do not need than having to go back and gather data bypassed because you didn’t anticipate needing it. As with knowledge, so with data: More is always better than less!

2.6 Levels of Measurement

When we set out to measure some aspect, some attribute or quality of an observation, we run into four commonly encountered mutually exclusive levels of measurement.

2.6.1 Nominal

With the nominal level of measurement the best we can do is say this is school district x, that is district y, this student is male, that student is female, this person is White, that person is an Asian, and so on. This is the simplest level of measurement that distinguishes between observations in terms of some attribute but these differences have no hierarchy (i.e., we are unable to say Male is better than Female, Mathematics is above Reading, etc). We have three nominal variables in Table 2.2dirn, district and subject. Other examples of nominal variables include an individual’s sex, the color of your eyes, l=whether you are left-handed or right-handed, etc.

2.6.2 Ordinal

With the ordinal level of measurement we are able to assign some hierarchy to the data as, for example, in the meaning that 5th Grade is above 4th Grade which is, in turn, above 3rd Grade (i.e., \(5^{th} \text{ Grade} > 4^{th} \text{ Grade} > 3^{rd} \text{ Grade}\)). In Table 2.2 grade is an ordinal variable. Other examples would be a college student’s standing (Freshman/Sophomore/Junior/Senior), your opinion about gluten-free bread (Dislike/Like a Little/Like a Lot), income categories (Poor/Middle/Rich), and so on.

2.6.3 Interval/Ratio

The interval/ratio level of measurement applies to numeric variables where the numerical values mean something. For instance, the percent of students at a specific proficiency level is a numeric variable. If I see Advanced proficiency values \(10.2\) for District A and \(20.4\) for District B then District B has twice as many students at Advanced proficiency levels than District A; this is a ratio. There is also a \(10.2\) percent gap in Advanced proficiency between District A and B; this is a statement of numerical difference. Both statements are true and possible, depending upon what we want to say. Notice too that it is possible for a District to have no student at Advanced proficiency; this would be the smallest value possible, a value of \(0\).

However, what if we were measuring something like the maximum temperature in your hometown today versus what it was yesterday. You would be measuring this in degrees Fahrenheit, and maybe it was \(80^{0}F\) today and it was \(60^{0}F\) yesterday. You couldn’t say today is \(1.33\) times hotter than yesterday. All you could say is that it is \(20\) degrees warmer today than it was yesterday, period. Why can’t we speak in terms of ratios with temperature in Fahrenheit but we could with proficiency levels? We can’t because the Fahrenheit temperature scale is a numeric scale but an arbitrary one where \(0^{0}F\) does not mean there is no temperature. However, if Advanced proficiency is \(0\) it means nobody is at this level. So we arrive at a distinction – ratio levels of measurement for numeric variables that have a true zero where zero means a complete absence of whatever is being measured versus an interval level of measurement for numeric variables that do not have a true zero.3 Other examples of ratio levels of measurement would be your income, distance of your college town from your hometown, years of formal schooling, number of highway fatalities, number of black swans you have seen, number of children in a family, and so on. Some examples of interval levels of measurement would be a student’s score on a standardized test like the GRE or SAT or ACT, a feeling thermometer that asks you to rate your feeling towards your Senator with a \(0\) indicating you are very cold towards this individual and \(100\) indicating you feel very warmly towards this individual.

Both ratio and interval levels of measurement may yield discrete or continuous variables. Discrete variables would be numerical variables with values that are finite and hence countable. For example, I can count the number of children in a family, the number of cars owned by a family, the number of traffic fatalities occurring in a particular week, and so on. Continuous variables are numerical variables with values that are infinite and hence uncountable. For example, time can be measured in a number of ways, each measuring the same concept (your age, for instance) in a finer and finer way. Technically, we can define these two types of numerical variables most simply as “discrete” if there is no intermediate value possible between two successive values versus “continuous” if any number of values is possible between two successive values. Applying this logic to the number of children we recognize that a family cannot have 1.5 children, it is either 1 or 2, nothing in between. However, the youngest and the next youngest student in a classroom may have values of 18 years and one month and 19 years and two months but we do know there exist many people with ages between these two students.

Before we close our discussion of measurement levels, note that there is a hierarchy to levels of measurement: \(ratio > interval > ordinal > nominal\). Ideally you would measure everything at the ratio level but this is not always possible. The best we can do with measuring sex, for example, is record an individual’s sex, that is it; there is no way to come up with granular numerical values of sex. The reason why the ratio level trumps the other three is because you can convert ratio levels into any of the other three by suitable alterations as, for example, by asking survey respondents to report their age (in years) and then collapsing age into categories of 18-24, 25-34, 35-44, 45-54, 55-64, and 65 or older (you just created an ordinal variable). But, if you had asked for age categories to begin with you would never be able to extract each individual’s actual age. Note also that in some fields they refer to numerical variables as “continuous” while others will refer to them as “measurement” variables, and categorical variables might be called “factors”.

2.7 The Tricky Business of Cause-and-Effect

Among the many things I want to caution you against is thinking in terms of cause-and-effect. People often see correlations and think \(x\) must cause \(y\), or they see a simple table that show the driver’s race (black versus white) in the rows and whether the driver was subject to a traffic stop (stopped/not stopped)in the columns and think race causes a traffic stop. Indeed, you do this in your own life, many times a day, thinking something happened because of \(x\) or that if you do such-and-such some outcome will occur. At a base level, causality is around us and we have become programmed to understand our world in terms of causality and make predictions based on causality. The problem is, cause-and-effect is a difficult thing to demonstrate in general and especially so in the social and behavioral sciences.

Although we cannot here cover the granular details and epidemiological debates that surround the study of causality, we can agree upon three commonly used rules to demonstrate cause-and-effect.

  1. The presumed cause must precede – in time – the claimed effect. This is a question of the temporal order of the cause (\(x)\) and the effect (\(y\)). Take the example of depression and drinking; those who are depressed are often found to have high levels of alcohol consumption but does that necessarily mean depression causes alcoholism? Couldn’t it be that alcoholism causes depression since alcohol is a depressant? We need to prove that depressed people drank a little or not at all before depression set in and then drank a lot during the episodes of depression in order to satisfy this rule.
  2. We also need to rule out all possible rival explanations for the effect. By this we mean there must be no other logical explanation for the effect we see. Unless we do so we don’t really know if the presumed cause is the sole driver of the effect. Ruling out all possible rival explanations is difficult because the real world is made up of interconnected complex processes and phenomenons, and because our knowledge of how these processes and phenomenons interact tends to be less than perfect.
  3. The cause and the effect must co-vary. That is, we must be able to demonstrate a relationship between \(x\) and \(y\) as, for example, that if depressed then alcoholic and if not depressed then not alcoholic.

When all three rules are satisfied, more often than not we would have done a reasonable job of demonstrating cause-and-effect. You might wonder why I am emphasizing the tricky nature of demonstrating causality; I do not want you to make the mistake that millions commit daily, of thinking correlation equals causation, of failing to look for plausible rival explanations, of failing to check to see if there is indeed a measurable relationship between the cause and the effect.


2.8 Basic R and RStudio Operations

2.8.1 Installing R and RStudio

R is a free software environment for statistical computing and graphics. It is powerful, elegant, and incredibly flexible, and the best part is you don’t need to be a programmer to use it. RStudio is a graphical user interface (GUI) for R that is also free and yet more powerful than any commercial software solution in existence today.

The first thing you need to do to get going is to download R. You can download R for Windows from here and R for Mac from here. Double-click on the downloaded file and accept the default settings as you go through the installation. Once R is installed you can install RStudio for Windows from here and RStudio for Mac from here. There are daily builds of RStudio, and the latest builds have some features not in the latest stable release. I would suggest that for this semester you download and install RStudio from here for Windows and here for Mac.

Accept the default prompts through the installation process. Once installation finishes double-click the RStudio shortcut/icon and RStudio will launch. If all goes well you should see R starting up inside RStudio and the interface looking as shown below:

2.8.2 Updating R and RStudio

Both R and RStudio go through very frequent updates, some minor, some major. As needed, repeat the steps you took above to re-install and update your version of R and RStudio.

2.8.3 Installing Packages

R has packages dedicated to performing specific tasks. Want to analyze gene sequencing data? There is a package for it. Want to analyze financial data? There is a package for that. How about elegant graphics? You bet; there is a package for that. Mapping anyone? You betcha; there is a package for that too. We will use specific packages in this course and I will point out what packages you need and when. You will have to use the “Install Packages…” option under Tools to install these packages. Packages are frequently updated so once a month you should see if there are updates available by running “Check for Package Updates…”, also under Tools.

Let us go ahead and install some packages we will most likely end up using at some point or another.

install.packages("ggplot2", "ggmap", "lattice", "plotly", "ggvis", 
    "visreg", "devtools", "psych", "foreign", "haven", "readxl", 
    "readr", "Hmisc", "car", "pscl", "maps", "Rcmdr", "arm", 
    "choroplethr", "choroplethrMaps", "ggthemes", "cowplot", 
    "wesanderson", "ggrepel", "ggExtra", "gganimate", "ggTimeSeries", 
    "ggsci", "ggridges", "GGally", "ggiraph", "scales", "googleVis", 
    "dplyr", "plyr", "tidyr", "reshape2")

Note the need to put double or single quotes around each package’s name. I have left these off from the install.packages(ggplot2, ggmap, lattice, ....) code to make it easier for you to see the packages listed without having to scroll to the right.

2.8.4 Reading Data

R can read data created in various formats (SPSS, SAS, Stata, Excel, CSV, TXT, etc). The most common data formats you will encounter are likely to be CSV or Excel files. Let us see how to read data in these formats by first downloading and saving the data available here (as a zip archive). Once this file downloads, double-click it and extract all files to a new folder (title it Data) you create in your OU Box folder for the course.

2.8.4.1 CSV & Tab-delimited Formats

With the CSV format a comma separates each variable (a column), and each row in the original file represents an observation.

The first thing R will need to know is where your data reside. This can be accomplished either by setting the working directory or by explicitly specifying the path to your data. We will employ the second option for now.

df.csv = read.csv("~/Documents/Teaching/MPA 6020/Data/ImportDataCSV.csv", 
    sep = ",", header = TRUE)

df.csv is the name I have chosen to give to the data being read. I am telling R that it is in CSV format, where the file resides, the file-name, the fact that variables (one in each column) are separated by a comma (,), and the fact that the original data have column-headings (header=TRUE).

Note that when you create anything in R, you do so either via the = symbol or via <- symbol. Thus df.csv = read.csv(...) is the same as df.csv <= read.csv(...) but my suggestion would be to stick with =.

When you execute the command you will see df.csv showing up under Data in the upper-right pane of RStudio. Click on df.csv and you can see the data.

A similar process works for reading in tab-delimited files where the columns are separated by a tab rather than by a comma.

df.tab = read.csv("~/Documents/Teaching/MPA 6020/Data/ImportDataTAB.txt", 
    sep = "\t", header = TRUE)

Note the one difference here: I have told R it is a tab-delimited file by specifying sep=“\t”

2.8.4.2 Excel Format (.xls & .xlsx)

There are several packages that will allow you to read files in various Excel formats but the one I prefer is readxl. Whenever we need to use a package we will have to first load it and then execute whatever commands call upon the loaded package’s features as shown below.

library(readxl)
df.xls = read_excel("~/Documents/Teaching/MPA 6020/Data/ImportDataXLS.xls")
df.xlsx = read_excel("~/Documents/Teaching/MPA 6020/Data/ImportDataXLSX.xlsx")

Note the one minor difference in the commands; the xlsx file is called ImportDataXLSX.xlsx.

2.8.4.3 SPSS, Stata, and SAS formats

At times, and especially from some major federal agencies, the data you will need to access may be shipped in a particular format. Some discipines/fields are also accustomed to working with a specific file format. For example, economists and those who work in public health have hisorically used Stata and SAS, respectively. Consequently, whether the data are from the CDC’s BRFSS or some other survey series, you will often see the data being made available for download in these formats. Consequently, I show you how to read data that come to us from these formats.

library(haven)
df.stata = read_stata("~/Documents/Teaching/MPA 6020/Data/ImportDataStata.dta")
df.sas = read_sas("~/Documents/Teaching/MPA 6020/Data/ImportDataSAS.sas7bdat")
df.spss = read_sav("~/Documents/Teaching/MPA 6020/Data/ImportDataSPSS.sav")

2.8.4.4 Fixed-width files

It is also common to encounter fixed-width files. These are files where the raw data are stored without any gaps between successive variables. Yes, no commas, tabs, or other delimiters. However, these files will come with documentation that will tell you where each variable starts and ends, along with other details about each variable. Let us see how with a very small example. Very shortly we will have to visit the fixed-width format in greater detail and learn more nuanced techniques vis-a-vis this format.

df.fw = read.fwf("~/Documents/Teaching/MPA 6020/Data/fwfdata.txt", 
    widths = c(4, 9, 2, 4), header = FALSE, col.names = c("Name", 
        "Month", "Day", "Year"))

Notice that we have to specify the width of each variable and then assign column names.

2.8.4.5 Reading Files from the Web

It is also possible to specify the full URL (web-path) for a file and read in virtually any format files. Below you see the code for reading in various format files off the web. This eliminates the need to keep a physical copy of the input data file on your computer. This way if the input file is updated, your R script will always be pulling the current version of the file. R is incredibly versatile; it can read data from Twitter feeds, Buoys sitting in the Atlantic ocean, and so much more!

fpe = read.table("http://data.princeton.edu/wws509/datasets/effort.dat")
test.txt1 = read.table("https://stats.idre.ucla.edu/stat/data/test.txt", 
    header = TRUE)
test.csv1 = read.csv("https://stats.idre.ucla.edu/stat/data/test.csv", 
    header = TRUE)

library(readr)
test.txt2 = read_table2("https://stats.idre.ucla.edu/stat/data/test.txt", 
    col_names = TRUE)
test.csv2 = read_csv("https://stats.idre.ucla.edu/stat/data/test.csv", 
    col_names = TRUE)

library(foreign)
hsb2.spss = read.spss("https://stats.idre.ucla.edu/stat/spss/webbooks/reg/hsb2.sav")
df.hsb2.spss1 = as.data.frame(hsb2.spss)

hsb2.stata1 = read.dta("http://www.philender.com/courses/data/hsb2.dta")

library(haven)
hsb2.spss2 = read_sav("https://stats.idre.ucla.edu/stat/spss/webbooks/reg/hsb2.sav")
hsb2.stata2 = read_dta("http://www.philender.com/courses/data/hsb2.dta")

Since we often end up with several intermediate files or “objects” as we work, it pays to remove some of these from our working memory (see the Environment tab) and this is easily done via:

rm("hsb2.spss")  # Deleting hsb2.spss from our working memory  
rm("hsb2.spss", "hsb2.stata1", "df.xls")  # Deleting multiple files from working memory 
rm(list = ls())  # Delete everything in the Global Environment. Careful with this one!!!!

Notice that I used two packages foreign and haven to read in the spss and stata files. The foreign package has been, at least in my workflow, been superseded by haven because it works better, I feel. Similarly, you can start by relying on the readr package to read txt and csv data files. Note that test.txt2 has an extra row, full of missing values for every variable. We’ll learn how to clean this up but for now, let it be as is. Finally, ntoe the use of rm() … you can use it to delete one or more objects or everything in the Global Environment. But use it wisely since you don’t want to erase everything, just objects that are not going to be used.

2.8.5 Basic Data Operations in R

You can generate your own data, manipulate data by adding, subtracting, dividing, or multiplying, and convert numeric data to factors (qualitative variables), etc. We will see a few basic data operations at work below. Let us start by creating some data.

2.8.5.1 Creating A Small Data-Set

Let us create two variables, x and y.

x = c(100, 101, 102, 103, 104, 105, 106)
y = c(7, 8, 9, 10, 11, 12, 13)
df.columns = as.data.frame(cbind(x, y))
df = cbind.data.frame(x, y)

The commands above generate two columns, x and y, and then bind them as columns into a data-set called df.columns. If we used rbind() instead it would bind x and y as rows instead of columns.

x = c(100, 101, 102, 103, 104, 105, 106)
y = c(7, 8, 9, 10, 11, 12, 13)
df.rows = as.data.frame(rbind(x, y))
df = rbind.data.frame(x, y)

Note that when we use rbind() it names the columns V1, V2, and so on. Often we will want to label the columns differently from how they were read-in. This is easily accomplished:

names(df.columns) = c("Variable 1", "Variable 2")
names(df.rows) = c("Variable 1", "Variable 2", "Variable 3", 
    "Variable 4", "Variable 5", "Variable 6", "Variable 7")

You can also generate data-sets that combine quantitative and qualitative variables. This is demonstrated below:

x = c(100, 101, 102, 103, 104, 105, 106)
y = c("Male", "Female", "Male", "Female", "Female", "Male", "Female")
df.1 = as.data.frame(cbind(x, y))
df.1 = cbind.data.frame(x, y)

x = c(100, 101, 102, 103, 104, 105, 106)
y = c(0, 1, 0, 1, 1, 0, 1)
df.2 = as.data.frame(cbind(x, y))
df.2 = cbind.data.frame(x, y)

Note that in df.1, y is a string variable with values of Male/Female. In contrast, df.2 has y specified as a 0/1 variable, with 0=Male and 1=Female. We could label the 0/1 values in df.2 as follows:

df.2$y = factor(df.2$y, levels = c(0, 1), labels = c("Male", 
    "Female"))

If you click the “play” button before df.2 you will see the contents of the data-set. Note that x is shown as num (numeric) while y is shown as Factor with two levels “Males”, “Female”.

We can operate on any column, for example column x, as follows:

df.2$x1 = df.2$x * 10
df.2$x2 = df.2$x * 100
df.2$x3 = df.2$x/10
df.2$x4 = sqrt(df.2$x)
df.2$x5 = df.2$x^(2)
df.2$x6 = df.2$x * 1.31
df.2$x7 = sqrt(df.2$x)

The same opertation can be achieved by doing the following:

df.2$x11 = with(df.2, x * 10)
df.2$x22 = with(df.2, x * 100)
df.2$x33 = with(df.2, x/10)
df.2$x44 = with(df.2, x)
df.2$x55 = with(df.2, x^(2))
df.2$x66 = with(df.2, x * 1.31)
df.2$x77 = with(df.2, sqrt(x))

Note the various operators; we multiply via *, divide via /, take the square-root via sqrt(), square via ^ and so on.

2.8.5.2 Saving R Data

We can save a data-set we have created quite easily (see below):

save(df.2, file = "./data/df2.RData")

Note the sequence. We specify the data set we want to save, here df.2, and then the location and filename of the saved data: file=“~/Downloads/Archive/df2.RData”. If you look at the folder specified in the command you will see a file called df2.RData.

2.8.5.3 Loading and Modifying R Data

Let us load some larger data-sets, perhaps the hsb2 data we used last semester.

hsb2 = read.table("https://stats.idre.ucla.edu/stat/data/hsb2.csv", 
    header = TRUE, sep = ",")

Note that there are no labels for the various qualitative variables (female, race, ses, schtyp, and prog) so we’ll have to create these.

hsb2$female = factor(hsb2$female, levels = c(0, 1), labels = c("Male", 
    "Female"))
hsb2$race = factor(hsb2$race, levels = c(1:4), labels = c("Hispanic", 
    "Asian", "African American", "White"))
hsb2$ses = factor(hsb2$ses, levels = c(1:3), labels = c("Low", 
    "Middle", "High"))
hsb2$schtyp = factor(hsb2$schtyp, levels = c(1:2), labels = c("Public", 
    "Private"))
hsb2$prog = factor(hsb2$prog, levels = c(1:3), labels = c("General", 
    "Academic", "Vocational"))

Having added labels to the factors in hsb2 we can now save the data for later use.

save(hsb2, file = "./data/hsb2.RData")

We can also delete variables, create new variables, change the variable names, and so on. Let us see this with a small data-set that we create.

my.df = data.frame(sex = c("M", "F", "M", "F"), NAMES = c("Andy", 
    "Jill", "Jack", "Madison"), age = c(24, 48, 72, 96))

Say I want to change the variable NAMES to be lowercase. I can do this via

colnames(my.df)[2] = "names"

If I wanted all variable names to be lowercase I would do

my.df = data.frame(sex = c("M", "F", "M", "F"), NAMES = c("Andy", 
    "Jill", "Jack", "Madison"), age = c(24, 48, 72, 96))
colnames(my.df) = tolower(colnames(my.df))

I can create a new variable, female as follows:

my.df$female[my.df$sex == "M"] = 0
my.df$female[my.df$sex == "F"] = 1
my.df$female = factor(my.df$female, levels = c(0, 1), labels = c("Male", 
    "Female"))

Notice how R stored the factor with values 1 and 2 even though the original coding had 0 and 1.

If I wanted to drop the original sex variable I could do so as follows

my.df$sex = NULL

What if I wanted to convert age, currently stored in months, to years?

my.df$ageinyrs = my.df$age/12

What if I wanted to convert the names into uppercase? Into lowercase?

my.df$name1 = toupper(my.df$names)
my.df$name2 = tolower(my.df$names)

2.8.5.4 Exporting data from R

One can also export data created or manipulated via R into various formats. Take the file below, for example, it is a small data-frame with two variables, one numeric and one categorical. Let us see how to export it to specific formats.

out.df = data.frame(Person = c("John", "Timothy", "Olivia", "Sebastian", 
    "Serena"), Age = c(22, 24, 18, 24, 35))

write.csv(out.df, file = "./data/out.csv", row.names = FALSE)

library(haven)
write_dta(out.df, "./data/out.df.dta")
write_sav(out.df, "./data/out.df.sav")
write_sas(out.df, "./data/out.df.sas")

So using the haven package we exported out.df to stata, spss, and sas formats. You can also exort to excel if you need to as follows:

library(writexl)
write_xlsx(out.df, "./data/out.df.xlsx")

In the write.() command, specifying row.names = FALSE excludes the unique row number or names (if the rows have names) that R uses for its operations from being exported to the target file format.

2.8.6 Workspaces

When you go to close RStudio you will be asked if you want to save your workspace. If you say yes, then all commands you executed in the active session plus any objects/data you created will be saved to your machine. The next time you start-up RStudio you will be back where you had stopped. This is a good idea for specific projects where you have large data and/or cumbersome tasks you have to perform in stages because if you want to bypass these tasks, save the workspace when closing RStudio. Otherwise my default tends to be to never save the workspace.

If you need to save the workspace, go to the Session menu in RStudio and then click on Save Workspace As... and give it an appropriate name.

2.8.7 Working Directory

You can see what is your current working directory by executing getwd() and if you need to change your working directory to some other working directory, you can execute setwd("write_path_to_directory_here"). Of course, in RStudio you can also use the Set Working Directory option found in the Session menu.


2.9 Practice Problems

Problem 1

In a browser, open up the Collaborative Institutional Training Initiative (CITI Program) website. Create an account by clicking on the Register button. Enter all information asked for: your institution = Ohio University, first-name and last-name, your OU email address and then a second email address, a username and a password, skip the continuing education (CE) credits portion unless you are looking to pickup some credits but if you do this you may have to pay. Make sure you select Human subjects training in Step 7 and then Group 2: Social and Behavioral Investigators and Key Personnel. Make sure you Finalize registration on the next screen. The very next screen will show the Group 2: Social and Behavioral Investigators and Key Personnel course and current status. You are now ready to take this course. Once you complete the course and are deemed to have passed you will need to save your certificate as a that you will submit as a part of your first assignment. Hold on to this in case you need it in the near future.

Problem 2

Identify whether the following variables are numeric or categorical, and the level of measurement.

  1. The color of an individual’s eyes
  2. An adult individual’s age
  3. Number of children in a household
  4. An undergraduate student’s standing in the college (Freshman/Sophomore/Junior/Senior)
  5. School type (public/private/parochial/charter)
  6. A student’s SAT or ACT score
  7. Household income
  8. Your monthly cellphone bill
  9. An attitudinal scale where individuals are asked to rate how they feel about the Supreme Court of the United States (0 = Very Cold, 50 = Neutral, 100 = Very Warm)
  10. An attitudinal scale where individuals are asked to rate how they feel about the United States Congress (-100 = Very Cold, 0 = Neutral, 100 = Very Warm)
  11. Number of highway fatalities on a specific one mile stretch of US 33E between Athens and Nelsonville, Ohio
  12. Whether an individual voted or not in the last general election
  13. An individual’s race/ethnicity
  14. Your nationality

Problem 3

Open up EPA’s Fuel Economy data. These data are the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA. You will need to carefully read the accompanying document that explains what and how some attribute is being measured; the document is available here. Identify whether the following variables are numeric or categorical, and their level of measurement.

  1. fuelType1
  2. charge120
  3. cylinders
  4. drive
  5. year
  6. ghgScore
  7. highway08
  8. model
  9. trany
  10. youSaveSpend

The next set of questions deal with this data-set dataset that consists of responses of graduate students in the social sciences enrolled in STA 6126 in a recent term at the University of Florida. The variables are:

  • GE gender
  • AG age in years
  • HI high school GPA (on a four-point scale)
  • CO college GPA
  • DH distance (in miles) of the campus from your home town
  • DR distance (in miles) of the classroom from your current residence
  • TV average number of hours per week that you watch TV
  • SP average number of hours per week that you participate in sports or have other physical exercise NE * number of times a week you read a newspaper
  • AH number of people you know who have died from AIDS or who are HIV+
  • VE whether you are a vegetarian
  • PA political affiliation (d = Democrat, r = Republican, i = independent)
  • PI political ideology
  • RE how often you attend religious services
  • AB opinion about whether abortion should be legal in the first three months of pregnancy
  • AA support affirmative action
  • LD belief in life after death

Download these data and read into your software package (SPSS, Excel, etc.). Make sure you label all the variables and their values (if the variable is categorical). For example, the variable HI must be labeled high school GPA (on a four-point scale) and the values of political affiliation labeled Democrat/Independent/Republican rather than d/i/r, and so on.

Problem 4

For each variable in the data-set, identify whether it is numeric or categorical. Also identify whether it is a ratio or interval level of measurement (if numeric), and nominal or ordinal (if categorical).


Why are our best and most experienced employees leaving prematurely? The data available here includes information on several current and former employees of an anonymous organization. Fields in the data-set include:

  • satisfaction_level = Level of satisfaction (0-1)
  • last_evaluation = Evaluation of employee performance (0-1)
  • number_project = Number of projects completed while at work
  • average_monthly_hours = Average monthly hours at workplace
  • time_spend_company = Number of years spent in the company
  • Work_accident = Whether the employee had a workplace accident
  • left = Whether the employee left the workplace or not (1 or 0)
  • promotion_last_5years = Whether the employee was promoted in the last five years
  • sales = Department in which they work for
  • salary = Relative level of salary (low med high)

Problem 5

Identify whether each variable is numeric or categorical, and whether the level of measurement is nominal/ordinal/interval/ratio.

Problem 6

Say you are interested in studying if and how a tutoring program for first-generation college students (i.e., the first in their family to attend college) helps these students complete their program of study. What type of research design would you have to work with – experimental? Quasi-experimental? A natural experiment? Why? Would you prefer to use cross-sectional data, time-series data, or panel data and why? What variables would you want to include in your data-set and why? What would be the unit of analysis?

Problem 7

What conditions lead to a biased estimate versus an imprecise estimate? Explain with reference to an original example.

Problem 8

What are the three commonly accepted conditions that must be met to demonstrate causality?


  1. The Appalachian Regional Commission (ARC) is a regional economic development agency that represents a partnership of federal, state, and local government. Established by an act of Congress in 1965, ARC is composed of the governors of the 13 Appalachian states and a federal co-chair, who is appointed by the president. Local participation is provided through multi-county local development districts.

  2. These data show the percent of students with a specific proficiency level, by grade and subject, for Ohio’s public school districts. There are additional proficiency levels not shown in the table: Accelerated, Proficient, Basic, and Limited. These and other data can be found here.

  3. The Celsius temperature scale is also an arbitrary scale. This is why for scientific purposes the Kelvin scale is used. Wikipedia describes it thus: “The Kelvin scale is an absolute, thermodynamic temperature scale using as its null point absolute zero, the temperature at which all thermal motion ceases in the classical description of thermodynamics.”