Introduction

Although this chapter was originally written to introduce the topic of Action Research, most of the ideas are relevant no matter what type of research you are doing because I prefer to take a more general quantitative research design approach to Action Research. Rather than simply discussing the various research designs possible, the chapter also presents ways to think about the various issues a researcher must face when designing a research project. This chapter provides only the first two sections of the original document (the remaining sections were not quantitative research and therefore have not been included here).

RESEARCH QUESTIONS IN ACTION RESEARCH

You probably are reading this because you’ve decided to start an action research project and need to know how to proceed. Before we get too deeply into how to do action research, let’s reach some common understanding about action research—and research in general. Most of what we will discuss here is the kind of general knowledge that social science researchers come to know as they perform more and more research. We want to acknowledge that the general structure of this chapter was adapted from a useful text entitled “How to conduct collaborative action research” by Richard Sagor (1992), which will also be cited as appropriate.

Research is an intentional, systematic process that involves asking questions, collecting data, analyzing data, and interpreting and reporting results. Sagor (1992) defined research as “any effort toward disciplined inquiry that employs systematic processes to acquire valid and reliable data concerning phenomena that researchers want to understand better” (p. 9). Each of these primary areas included within these definitions of research will be considered in turn throughout this chapter.

The goal of research is to try to find an answer to some specific set of research questions using well-defined and convincing methodologies. By answering useful research questions, we reach a further goal of refining theories that guide our thinking in a given field. Sometimes these theories are well-defined, sometimes they are still under development, and sometimes they are very informal and untested. Upon reaching conclusions about the data that they collect, researchers attempt to convince someone that their interpretation of the results is a reasonable, even the right, one—that they have a legitimate answer to the research question. In some research we try to convince others about what we’ve found, but in action research, we are most concerned with convincing ourselves.

The goal of this chapter is to provide you an introduction to research design and analysis, enough to feel comfortable beginning an action research project. However, this resource does not necessarily provide everything you need to learn about the research process. It deals only with some of the more common research design issues, data collection techniques, and data analysis techniques. There may be resources that will be more helpful to you, depending on the specific purpose of your research and the particular techniques you will use in your research. There are relatively few free and stable web sites dedicated to research design. Some that are available and seem to have useful documents, useful link, or useful downloads include:

INTRODUCTION TO ACTION RESEARCH

As you probably know, there are several kinds of research. Basic research, applied research, evaluation research, and action research are some of the most common. Basic research generally has as its goal the development of theory. For example, scholars might want to determine how students learn by examining neural activity in the brain. Or they might ask a theoretical question about whether the skills necessary to play video games are related to the skills used in math. Scholars hope to make sense of the theoretical connections between things with basic research, not whether such connections actually work in the “real world.” For example, just because drivers must use the same skills in driving simulators as they do when driving on the road, it doesn’t mean that using driving simulators necessarily makes people better drivers. Basic researchers are most concerned with how other scholars will view their work. That is, when reporting their results, basic researchers hope to convince their peers that they have learned something valuable.

In order to answer this last question, scholars would need to perform applied research. Applied research takes ideas for practice that have theoretical rationale and see whether they really do work in practice. For example, there may be theoretical reasons to believe that students learn better in groups. Applied researchers might actually set up conditions where they can test whether improved learning does seem to occur for most students who work in groups. Applied researchers are very interested in their results generalizing to a broader population than they can actually study in their sample. They desire results that will be applicable to many people across many situations. Their results usually have both theoretical and practical implications. For example, if their application of theory works it helps provide evidence to support the theory; if the application doesn’t work, however, perhaps they have evidence that a theory—or intervention—needs revision. As such, they are concerned with how other scholars and practitioners will view their work and want to convince both audiences that they have learned something useful.

Action research means different things to different people. Even scholars use the term differently. We take the position that action research refers to research that practitioners use to answer a specific question related to their practice and perhaps to guide their decision-making. Most frequently in basic and applied research, the researchers are not the people who will make decisions about practice or implement interventions. Action research is defined here as research done by practitioners to answer questions at a local level; indeed, some call it practitioner research (in education, teachers who do action research are often called, unsurprisingly, teacher researchers). Academic researchers have a much more general and theoretical interest than practitioners. For example, applied researchers frequently want to know whether a particular intervention will work generally for most of the population, whereas action researchers are most interested in whether the intervention will work for them locally and specifically.

Our Perspective

While there are many good resources and textbooks that address action research, Sagor’s (1992) approach is most similar to the perspective we endorse, and the focus of this chapter. That is, action research is often viewed as a process of qualitative inquiry used primarily for the purpose of understanding local phenomena and a particular population. Our perspective, however, is that action researchers often have a much more experimental purpose. More precisely, practitioners frequently want to know whether a particular intervention worked or whether to change a policy. These are inherently quantitative, even experimental, questions. A key difference, however, is that we mostly cannot use the same inferential techniques of analysis that are used in most quantitative research—we must rely on small-sample and qualitative techniques. As such, we believe it is critical for action researchers to have a strong understanding of the issues involved in both qualitative and quantitative methods.

Additionally, because action research typically seeks answers that will help solve a local problem or improve a local situation, the phenomena being studied must be within a practitioner’s scope of influence. That is, those who perform action research must be in a position to make decisions, to make changes, to make policy, or to make recommendations with some credibility and influence. For example, it would not be truly action-oriented research if an individual teacher somehow studied whether small class sizes improve student learning if there is no chance that Superintendent or School Board will approve such changes. Action research implies that action can be taken based on the results obtained.

PURPOSES OF RESEARCH

All forms of research can be used for a variety of purposes. Most commonly, research is used to describe, evaluate, or infer. Descriptive research is used to describe a situation or a population. Most often we want to gain some understanding about the people and the places of interest. Sometimes we don’t know much about a particular group, or we have limited knowledge that we would like to increase. From a practical perspective, we may feel that we need this information before we consider implementing some sort of intervention. Perhaps we may want the stakeholders to help make a particular decision, and we need to collect data to describe the general feelings of the group. On other occasions we may need to understand the culture of a group before we begin to recommend changes.

Evaluation occurs when practitioners attempt to determine the value of a particular intervention or program. Both formative and summative evaluation is used by practitioners. Formative evaluation has the primary goal of improving the intervention, while summative evaluation usually occurs to make judgments about whether an intervention met its goals and whether it should be repeated. Because evaluation research is primarily designed to answer questions at a local level, evaluation researchers are most interested in how the information can be used to improve their own programs or interventions, not with how others will view their work.

While both description and evaluation are purposes for action research, action researchers are usually less interested in inference. Both basic researchers and applied researchers are most interesting in how well their results will hold up over a broad population, and sometimes over broad conditions. Their goal is to make inferences about a population and/or the phenomena of interest from the knowledge they have gained from those few participants and limited circumstances of their studies. Inference is usually less critical to action researchers, whose goals are much more limited in scope. Because action researchers are often already working with the population of interest, there is little need to infer too broadly. What they learn, even if from just a subset of the population, is much more likely to be true for their entire population. Sagor (1992, p. 8) clearly identified description and evaluation among the three primary reasons for action research: (a) seeking information to help us understand and solve a problem, (b) monitoring our work to improve our performance, and (c) evaluating work that has been concluded.

In the end, most research that is not designed to describe people or phenomena is designed to develop, confirm, or disconfirm theory. For example, if a certain learning theory predicts that students who work in groups will learn better than those who work individually, a study could be designed to compare the two approaches. The results will help researchers decide about the usefulness of that learning theory. While this is most obvious in the case of basic research and perhaps applied research, it is also true for action research. For example, an administrator may have an informal theory that predicts that a certain intervention or program will help increase employee satisfaction; therefore, an evaluation study is essentially testing whether this administrator’s informal theory is reasonable.

Of course, one study never proves anything, so researchers are careful to use results only as incremental knowledge for tentative decisions. Often, in social science research, we don’t replicate studies enough to verify that the results will hold up over a broader population or broader conditions. For example, if a study shows that an intervention did not work well, we can’t be sure whether it was the theory or the intervention, or both, that was flawed. On the other have, even if the intervention does work, certain results may imply that it was for reasons not anticipated by the given theory, suggesting a need to develop the theory or the intervention further. If one study shows that the intervention worked well, but other studies don’t produce similar results, we can’t be sure of anything (this is unfortunately more common than we’d like). We mention this idea of replication because the idea of revising our research questions, our informal theories, and our interventions is a key component of action research. And the important thing to understand in any research context (including as a reader or research or as a consumer) is that we almost never get “the” answer to our research questions, but rather incremental evidence that we can use to help make certain decisions.

Finally, you’ve probably heard terms like “data-driven decision making” and “evidence-based practice” and “research-based policy.” Action research can certainly be used to help in these areas, but other sorts of information can also be used. That is, data-driven decisions often can be made using existing data and information, collected either by someone else or at a different time. Indeed, sometimes such decisions are not based on the results of research at all, but based on other sorts of information that have been collected (not all data collection is considered research). However, sometimes that data that is needed for a particular decision is not available. In such cases, the practitioner may become an action researcher and collect the particular data needed to answer a specific question. As an action researcher, the practitioner takes a series of well-defined steps to help answer their questions.

THE ACTION RESEARCH PROCESS

The action research process can be divided into four complementary sets of tasks: * Research Problem: Identifying a problem that needs to be solved and asking an appropriate research question that will guide the research * Research Design and Data Collection: Identifying potential solutions and setting up a study that will investigate whether a solution works * Data Analysis: Analyzing the data that were collected * Reporting Results and Taking Action: Summarizing the results for appropriate stakeholders, evaluating whether the intervention worked, taking action, and revising the research for the next round of action research

Others provide more and less detail in such a list, but in the end the process is essentially the same. Note, also, that it is generally considered to be a cyclical process. After we’ve interpreted our results and reached a decision about what action to take, the process begins again. We recognize that practice is constantly changing, and that what works today may not work tomorrow, so the action research process incorporates this circular process to reinforce that idea.

One thing you will notice as you proceed through this chapter is that there are many decisions to make along the way. Every decision has a potential impact of the study and therefore every decision requires a rationale; that is, there are many ways to study anything, but the key is to justify our decisions. These justifications will be used to help make sense of our results and the legitimacy of our conclusions.

RESEARCH PROBLEM

Action Questions

Before we discuss more specific research questions that an action research project might investigate, we need to take a slightly broader look at the purposes of research. Research questions that guide research must be quite specific, otherwise we cannot design a study that will answer the specific question. That is, we must know what specific question we want to answer in order to be sure that we collect the appropriate data. However, Much research, especially action research, begins with a very broad question that requires an answer. Action research often helps us answer only a small part of that more general, action-oriented question. We will call these questions “Action Questions.”

Action questions are often questions that will result in action, or perhaps changes to policy. They will very often reflect the decision that must be made by some administrator or decision-making body. Examples of some action questions include: * How much homework should students do? * How long should class periods be? * Should soccer players wear helmets? * How long should students travel on a school bus? * Should we purchase TinkerPlots software for our middle school math teachers to use? * Should group leaders try to use humor to enhance the quality of problem-solving? * What color should we paint the walls of our classrooms?

Note that these “should” questions are more policy-oriented than research-oriented. That is, while these are usually the questions that must be answered before action can occur, they do not usually make good research questions. They are indeed usually the questions we want to answer, but no study can be designed to answer them. This is because questions of policy and procedure require additional information beyond what research can provide. Very frequently, policy questions require information about cost, time, effort, ethics, tradition, and so forth, in addition to the information provided from specific research or action research. On many occasions, the answers to such questions must even take into research results that apparently recommend different answers. For example, certain wall colors may be more conducive to learning, while certain other colors may be better to provide more light in the room.

By the way, we are including teachers, administrators, and other educational leaders as policy makers in a very broad sense. While it is easy to see how administrators are involved in making policy, teachers also can be said to have action-oriented and policy-making responsibilities. For example, teachers may consider policy in an academic department, or sometimes setting “policy and procedures” for their own classrooms. That is, there are many times when teachers are interested in knowing whether they “should” take some particular action.

Good research questions, on the other hand, are very specific and can only provide some of the information required to answer the broader action questions. For example, research can examine whether more homework results in more student achievement. However, even if it does, policy makers would need to decide whether more homework should be given after considering such things as whether students become less motivated when they have more homework. Or whether less time for play and extracurricular activities is too high a price for better achievement. Or whether the additional time required for teachers do grade and review the homework takes away from other instructional activities. Or, or, or…

Whereas policy makers may want a great deal of such information before they answer these “should” questions and create policy, researchers must be satisfied with answering only very specific questions. Fortunately, an individual study can investigate several of these more specific research questions. When all the information is put together, then, the policy makers have much more data-based information to use in the decision-making process. No one study will provide all the information a policy maker needs.

Research Questions

Research questions must be more specific than action questions. For example, rather than asking “How can I solve this problem?” or “Should I do this to solve the problem?”, a researcher might ask “Can I improve the situation by using this particular intervention?” or “Which approach works better?” or “Why do I have this problem?”, which might lead to answers for how to solve the problem.

Before starting an Action Research project, we should ask ourselves several key questions that will help focus the research question(s) that will guide our research: * WHAT do we want to know and WHY do we need the answer? * WHAT will we do with the answer and WHY will it make sense to do it? * WHAT will serve as convincing evidence and WHY can we trust that evidence? * WHAT will we do to collect our evidence and WHY will we do it that way?

Sagor (1992, pp. 23-24) identified some other similar questions that may help focus the research question. For example, (a) who is affected?, (b) who or what is suspecting of causing the problem?, (c) what kind of problem is it (for example, a problem with goals, skills, resources, time, etc.)?, (d) what is the goal for improvement?, and (e) what do we propose to do about it?

In particular, researchers must clearly identify the outcome that most interests them. Very often in education, for example, some measure of learning or achievement acts as the primary outcome of interest. But there are many, many other potential outcomes that teacher researchers might find interesting (for example, behavioral outcomes, cognitive outcomes, developmental outcomes, social outcomes, affective outcomes). In addition, however, researchers need to identify and understand the relevant factors, variables, phenomena, contexts, and issues that must be considered and/or studied. For example, what are the theoretical causes of change in the outcome of interest?

Good research questions contain critical pieces of information, either explicitly or implicitly, that help us move forward in the research process. That is, a nicely focused research question can help us make most of our critical decisions as we design our research project. In particular, a good research question contains information about: * the variables of interest * the relationship we expect between the variables * the population of interest * any specific or special context of interest (unrelated to the variables or the population)

Variables

When we get right to the heart of it, most quantitative or experimental research is really designed to identify or verify causes that impact important outcomes of interest. Indeed, much qualitative research is designed to find those suspected causes and incorporate them into theories that predict or explain the phenomenon of the outcome. This notion of theory implies that we can explain why different people have different outcomes and can predict how the outcome would change with different inputs (that is, changes to the causal agents). That said, however, no one research project can ever confirm this causal relationship. The research process is really designed to provide more and more pieces of evidence that may eventually convince us that there is indeed a causal relationship among the variables. We’ll talk more about cause and effect later; for now, let us look a little more at what variables are.

In research, these suspected causes and outcomes of interest are called variables. They are called variables because each case under study (for example, student, teacher, book, classroom) may have different amounts of each. For example, each student may have a different level of achievement, each teacher may have a different level of satisfaction with some program, or each textbook may have a different number of inaccuracies. Note that there is no requirement that every case actually have different values, but rather that it is possible. If we are studying only second graders, then there is no variation in grade—so grade would be a constant, not a variable. But if we’re studying elementary school students, grade becomes a variable because each student may belong to a different grade. If we are studying only what happens to achievement after using the TinkerPlots computer program (a program designed to help middle school students learn data analysis topics), then achievement is a variable because each student may have a different score on a test, but TinkerPlots is not a variable because it is the only intervention variable under investigation. If we add a comparison group that watches a video about data analysis, now we have a variable that represents the instructional method used to teach data analysis (that is, the students will vary on whether they received the TinkerPlots intervention or the video intervention).

Some variables are directly observable. For example, how long it takes a student to run 100 meters in PE class, or how tall a student is. Such variables can be measured directly. Most of the variables we study in education, however, are psychological or educational constructs. These constructs are less obvious and require indirect measurements. For example, none of us is born with a reading comprehension score tattooed to our forehead! Scholars, researchers, and practitioners must figure out ways to measure these constructs. Because these are not directly measurable, scholars develop other mechanisms by which to measure them—very frequently educational assessments and psychological scales. Variable Definitions

We’ll talk more specifically about these measurements later, but for now we need to be aware that all variables are defined in two ways: conceptually and operationally. Conceptual definitions are more closely related to theory. Indeed, some concepts are defined differently depending on which theoretical tradition we read. For example, gender may mean a psychological continuum between masculinity and femininity or a linguistic construct or biological sex, depending on our theoretical framework. Also for example, “intelligence” brings to mind Stanford-Binet and Wechsler IQ tests to some, Goleman and emotional intelligence to others, while it suggests Gardner and multiple intelligences to still others. In such cases, we really need to decide which conceptual variable definition we will use, and therefore which theories support that definition, before we begin the research. We want to be as specific as we can with our conceptual definitions, because subtle differences might make a difference. For example, rather than studying “achievement” as a variable, it might be important to identify “math achievement” or “reading achievement” as the variables. While they both have some of the same theoretical traditions, they also have different theoretical backgrounds.

Operational definitions follow directly from the conceptual definitions. Operationally, we will define how our quantitative variables will be measured in our study. That is, where will the numbers actually come from—what test, what scale? For example, if we are studying class size then we will need to decide how we are defining class size conceptually. Theoretically, some scholars consider the physical dimensions of a classroom, either absolutely or per student, as the most important aspects of class size; other scholars, though, focus on the number of students in a classroom, again either absolutely or per teacher. These differing conceptual definitions obviously lead to different appropriate measurements to be taken for the class size variable. Note that we could study both types of definitions in the same study, but each would require a separate question and a separate analysis.

Types of Variables

In much research, researchers must designate certain variables to be considered outcome variables and certain variables to be considered independent variables. Basically, the independent variables are said, theoretically, to be responsible for changes in the outcome variables. Therefore, the outcomes are somewhat “dependent” on the values of the independent variables. For example, if it is true that doing homework helps cause higher achievement, then achievement is dependent, to some extent, upon how much a student studies.

The dependent variables are the outcome variables of primary interest; that is, it is the variable the researcher ultimately hopes to learn more about. In some studies, in fact, there are no independent variables, or a single independent variable that is believe to influence several outcomes. In many studies, though, the researcher hopes to discover the factors that are responsible for change in the dependent variables. If we know which independent variables help to cause change in the outcomes, we consequently know that if we are able to change the values people have for the independent variables, then we can change their outcomes as well. For example, if we know that nutrition is strongly related to achievement such that if nutrition increases, achievement also increases (a causal relationship), then by providing students with breakfast at school we may help increase achievement among those students.

Most commonly in quantitative research, the dependent variables are also measured on some meaningful scale. However, in experimental research and in other quantitative research, the independent variables are often categorical variables. That is, study participants belong to a group with certain characteristics that are different from other groups being studied. For example, one group might receive an intervention (often called a treatment in research) that differs from what another group received. We often call the group that receives the intervention the experimental group, while we call the other group the control or comparison group. Generally, a control group will receive no treatment of any kind, whereas a comparison group receives something worth comparing to the new intervention being studied (for example, if one group is exposed to some new instructional method, the comparison group would probably receive the typical or “old” method). We often called this kind of independent variable a manipulated variable. Other types of categorical variables include demographic characteristics such as gender and whether English is a student’s first language.

Identifying the Important Variables

How do we find all these variables? The best place to start is by talking to or brainstorming with colleagues and mentors. Another useful approach is to explore the literature in the field, through textbooks or articles written by scholars in the discipline. There are a few good ways to find such literature. The ERIC database (http://www.eric.ed.gov/) run by the US Department of Education is one resource. We can search for articles about specific topics, by certain authors, or in particular journals. ERIC also catalogs curriculum guides and papers presented at professional conferences. It is an incredible resource. Another very good way to find such literature is through Google Scholar (http://scholar.google.com/). We can find many papers and articles available across disciplines through Google Scholar.

Relationships Between the Independent and Dependent Variables

Now that we’ve defined the variables we will study, we need to consider the relationships we expect among them. We need to identify which variables are the suspected causal agents and which variables interest us as outcomes. We call the suspected causes independent variables or predictors, depending on what type of analysis we will do. The outcomes of interest are often called either dependent variables or outcomes.

The relationship between variables suggests that a change in one variable implies or predicts a change in another. Almost all quantitative research questions can be stated in terms of a relationship among variables. Relationships are often measured using correlational statistical methods (for example, correlation and regression). However, comparison studies typically use ANOVA and t test methods. A question of difference among groups implies a relationship among variables; that is, if there is a difference among groups on a certain DV, then group membership is related to that DV.

Almost all quantitative research questions can be worded as a relationship question. For example, * Is there a relationship between the amount of homework completed and achievement? * Is there a relationship between gender and attitude toward math among middle school students? * Is there a difference between middle school boys and girls in their attitude toward math? * Is there a difference between using the TinkerPlots computer program and using textbooks in student math achievement?

Even the last question, about differences between two treatments, is really a relationship question at its heart: If there is a difference between the two instructional methods, then there is a relationship between the independent variable instructional method and the dependent variable math achievement. We can probably see this more clearly in the two ways to word the same question about gender and attitude toward math.

Often though, we can be more specific about the relationships we are studying. Some research predicts relationships between variables, like the relationship between amount of homework completed and achievement test score. In such circumstances, the variable believed to predict the other would be considered the independent variable (but is often called a predictor in this context). It would probably not make much sense to suggest that scores earned on end-of-the-year achievement tests somehow predicted how much homework students completed; therefore, theoretically, the amount of homework completed would be considered the predictor. In such contexts, the independent variable (or predictor) is not required to be categorical. That is, variables that have a measurement scale of some kind can be considered independent variables. Note that we generally try to avoid specific references to cause in our research questions, knowing that our one study will not likely be able to answer that question of cause very well.

Relationships with other variables

There are three other common types of variables that we must acknowledge here, because of their special relationship with the independent and dependent variables. One type of variable, called a mediating variable, intervenes between independent variables and dependent variables. In such situations, the independent variable actually causes change in this mediating variable, which in turn causes change in the dependent variable. Therefore, the independent variable does not have a direct causal relationship with the dependent variable in these situations. Rather, the independent variable may be a direct cause of the mediating variable, which is a direct cause of change in the dependent. For example, the TinkerPlots computer program may cause students to become more interested in the topic, which causes them to explore the concepts more, which then causes them to perform better on an achievement test. In this example, the TinkerPlots program doesn’t really cause the change in achievement, but it may cause change in other mediating variables that finally cause change in achievement. We’d like to include these mediating variables in our research, if at all possible. If we don’t include them, then our results will lose strength because we can’t be sure that our independent variable had any relationship with the dependent variable at all.

Some variables intervene between independent and dependent variables by changing the impact of the independent variable on the dependent variable; these are called moderating variables because they moderate the effects of the independent variables. In such a case, the independent variable identified by the researcher may have differential impact on the dependent variable, depending on what values the cases have on the moderating variable(s). For example, assume that computers help increase achievement for students, in general (for example, if we ignore gender). However, if boys achieve better by working on computers but girls achieve better by working without computers, then the effect of computer status is moderated by biological sex: gender matters when we talk about the effectiveness of computers on learning. If these results are accurate, we cannot necessarily recommend that we introduce computers into every classroom, because girls would achieve better without them. Identifying moderating variables, and including them in the analysis or removing their effects through design choices, can minimize and/or account for their effects. For example, in this computer example, the study could be completed with only boys to nullify the impact of gender on the results—of course, though, this adds a limitation to the possible conclusions that can be made: the results will not apply to girls.

Finally, there are other potential causes of outcomes that are not at all related to the independent variables we have identified for our study. The problem with these extraneous variables is that we cannot be sure whether changes in the dependent variable are due to our chosen independent variable—or due to some other unmeasured causal variable. For example, if we were studying some intervention program designed to increase math achievement, there are many other variables besides our intervention that may be causing change in the students’ achievement. We need to try to identify and measure, or control, as many of these factors as possible.

How to include Mediating and Moderating Variables in our Action Research

When we know that other variables that have important relationships with the variables we are studying, we want to try to control their effects and would therefore include them in our question. For example, there may be a relationship between achievement in science and interest in science. It may be that females achieve better in science because they are more interested in science. So the research question may be refined to ask “Is there a difference between males and females in science achievement after controlling for interest in science?” Here, interest in science would be considered a moderating variable in the relationship between gender and achievement in science. The key is that our research question now guides us to collect information for three variables instead of two—the only way to control for interest in science is to include it in the study as a variable.

One way to include mediating variables is actually to change our research question. For example, perhaps current knowledge in the field indicates that students who spend more time exploring concepts independently also achieve better in math. Our ultimate goal may be to determine whether the TinkerPlots program increases student achievement results on a statewide exam. We can actually break such a study into various pieces. First, we must recognize that there will be too many potential explanations for the results students obtain on statewide exams; therefore, we may decide that it is not the best outcome variable for us to use in our study (especially an action research study). Second, we may not know whether our local test scores are good predictors of the statewide exam scores; therefore, we may choose not to use our classroom test as the dependent variable. Knowing that other researchers have shown that the independent exploration of concepts does have a strong predictive (perhaps causal) relationship with math achievement, we may finally decide that we will use time spent exploring concepts as the outcome variable for our study, while comparing TinkerPlots with what has been done traditionally to teach data analysis.

Causal Relationships

Cause is very difficult to prove in social science research. Even when we discuss cause, we usually mean probabilistic cause, because our decisions are based on statistical procedures that require the use of probability for decisions. Cause requires several arguments:

  • Relationship: changes in one variable must correspond to changes in the other
  • Temporal Ordering: a cause must precede an effect
  • Removal of alternative explanations: we rule out other possible causes of any change in the dependent variable
  • Replication: the same results must occur in more than a single study
  • Explanation: we must be able to explain why the changes occur (theory)

In social science research we can often develop the first two arguments. That is, our research designs usually allow us to comment confidently on the relationships between variables and we can usually make a case for the temporal precedence of one variable over another. But even here we sometimes have difficulty. For example, does interest in a topic cause students to achieve better in that topic? Or does achievement in a topic cause students to become more interested in that topic?

Ruling out alternative explanations for changes in the dependent variable is very difficult in social science research because it usually requires a true experimental design. This is largely due to the fact that just because an independent variable predicts an outcome doesn’t mean that the independent variable causes the outcome, or it may be only one cause from among many. For example, the amount of sleep students get each night (or whether they get adequate sleep regularly) might predict achievement, but that doesn’t mean that sleep causes achievement. Sometimes a third variable causes both of the variables in our study (that is, both the independent and dependent variables). For example, we may believe that homework quality (amount correct) causes achievement as measured by test scores. However, it may be that knowledge is the root cause for both, such that the student would have done well on the test even without completing the homework (we could say here that homework quality predicts test scores, but not that it causes test scores). In order to control these various types of extraneous variables, we usually require a true experimental design that uses both random selection and random assignment.

In the social sciences, we typically recognize that there are multiple causes of change in the dependent variables; therefore, we search for potential causes in the form of independent variables that predict change in the dependent variable. Theoretically, these predictors may actually serve as causal factors, but our research designs rarely allow us to reach a conclusion beyond prediction. In this context, causal conclusions usually require multiple studies, or replications, that all find the same results that support the particular causal relationship being studied. Consequently, we often also talk about a fifth argument required for cause in the social sciences: We must be able to explain theoretically the temporal relationship between the variables (and perhaps even include the roles of other variables in the explanation).

Always remember, relationship between variables is required in order to infer causality; however, relationship does not imply causality. One of the most common flaws in research reports occurs when researchers attempt to make causal conclusions from research designs that should allow them to reach conclusions only about the relationships between the variables. This occurs, perhaps, because people often perform research to try to develop support for a particular theory—and the purpose of theory is to explain the relationship between variables, usually in terms of cause and effect.

Population (Target Population)

As researchers, we must define clearly whom we want to study. Even some action research requires careful description of the population being studied. For example, if we are trying to determine whether TinkerPlots works for our own classroom, then our classroom is our population. In this case, we have defined our population as our classroom and we are finished. However, anything more complicated requires careful thought and requires decisions to be made.

If we want to see whether TinkerPlots will work for all math classes in the school, we have defined our population more broadly. In this case, our classroom may not adequately represent the entire 7th grade population of math students in the school. If we do the same study, using only our single classroom, then we cannot be certain that the results will hold up for other math classes in the 7th grade. If we did pursue this as an action research project, we would call the entire 7th grade our population and we would call our own classroom the sample.

As we might have surmised, it is often not possible to study an entire population. If we extend our previous example, we may actually want to know whether we should make a complete change to the TinkerPlots program for teaching data analysis in 7th grade math. This would require buying the appropriate license for the software, installing it on all the computers we use, buying workbooks to use with the program, creating new lessons, and perhaps convincing hesitant colleagues in our department to make the change. In such an example, the population is no longer just the current 7th grade, but rather all current and future 7th graders at our school. We can even extend this further, perhaps district-wide to include all middle schools or perhaps even to other middle school grades. All of these could be examples of action research.

The population of interest has an important role in our research. We must account for the population in some fashion. Ideally we wouldn’t draw a sample, but rather collect data from everyone in the population; if this were possible, we could know the answer to our research question without doubt. Unfortunately, it is often not possible to collect data from everyone due to the time, effort, and/or cost involved. In applied research, we rarely know every individual case in a population, so we must take samples; in action research, however, we usually do know the entire population, which allows us some flexibility in our research design. For example, we may decide to try TinkerPlots with all the 7th grade math classes, or we may decide just to take a sample of classes. We may decide to design a one-year study, feeling confident that this year’s 7th graders are really no different than other years—or a two-year project because we are not sure that future 7th graders are the same (maybe a whole new set of 5th grade teachers was hired last year).

Note that the population is not always people. The population is the entire set of cases of interest to the researcher. We will be observing or collecting data from the cases (sometimes called units of analysis) we identify. These cases represent the smallest participant level from which we will collect data. Most frequently, these are individual people, but could very easily be small groups of students, entire classrooms or schools, clinics, textbooks, newspaper articles. For example, as part of a textbook selection process, we may be interested in studying the number of mistakes in several history textbooks. We might consider each “fact” to be a case and then study the variable or accuracy (whether the fact is correct or not). Or we could call each chapter a case, count the number of mistakes per chapter as our outcome variable, and then compare the textbooks using the average number mistakes per chapter.

It sometimes helps to consider inclusion and exclusion criteria when defining our target and/or accessible population. That is, we may want to identify what cases would be acceptable to include in our population and which cases would not be appropriate for inclusion. For example, if we wished to study how a reading program worked with 6th-graders, we would exclude 7th-graders from our population. We have great flexibility in deciding who is in our population, but we must always remember that our results will only apply to those in our population. For example, if we delimit our population further by including only 6th-graders within a particular age range who achieved scores in the “proficient” range on the 5th grade statewide exam, then our results will not necessarily apply to those students who scored in ranges lower or higher than proficient. This last example is important because research has shown that there are aptitude-treatment interactions for many interventions. That is, sometimes a student’s given ability or achievement level has a moderating effect on how much the intervention influences the outcome variable (e.g., TinkerPlots may work well for advanced math students, but not so well for those who have more difficulty learning math).

Context

While not always a concern in research, sometimes researchers set a particular context or limitations to the research question. Sometimes this is related to the population (for example, 4th-graders only in public schools using a particular textbook series) and sometimes this is related to the variables (for example, adding a time limitation to a particular skill to be measured). Sometimes researchers are only interested in the relationship between variables in particular situations (for example, how a behavior occurs in public rather than private, or how people act with others around versus by themselves). Context is not critical, but if we recognize that we are interested only in a particular context, we want to pay attention to it. For example, we may want to know about ways to keep students’ attention focused on learning after lunch. The population isn’t really students who have class after lunch. The period following lunch is not really part of the intervention. Rather, it is really more about whether we need to change our instructional methods in this particular context: the period following lunch.

Significance (not Statistical)

Significance of the research is a given in most action research. That is, we have a question that must be answered through action research methods. However, in some circumstances, we must convince someone else that our study is worth allowing, funding, or reading. In such cases, we should provide a strong rationale that justifies the importance of the study and explain how the knowledge gained from the study will improve our work. Essentially, we are arguing for why the study is worth doing—or more strongly, why it must be done. Part of this argument is to provide theoretical and/or practical reasons for the need to know the answer to the research question. Another part is to put the study in context within the literature of the field (this is often done through the literature review); in action research, this is done by showing that the specific answer we need does not already exist. For example, we probably don’t need to do action research to show that a white roof on a school bus lowers temperature inside the bus—applied research exists in the literature. However, if we live in Ohio, we may need to do some action research if all these existing studies about school bus temperature were done in Florida. In this case, we cannot be certain the results will be the same or will have as much impact, especially during the months between October and April.

Action research will rarely have the development of theory as a reason it is significant, but sometimes may have significance for both theoretical and practical reasons. That is, a series of well-done action research studies can sometimes influence how scholars think about an issue and therefore make changes to theory. However, it is the practical reasons that drive action research, not theoretical reasons. Therefore, if we need to build a case for the action research, we need to focus on the answers it will provide and the implications of those answers. We need to argue why these answers cannot be found in any other way and we need to argue for the critical importance of these answers. We need to argue that the implications of knowing these answers are important to our practice. For example, if TinkerPlots does seem to cause changes in our students that will eventually cause them to score much higher on statewide exams (through that series of mediating variable relationships we discussed above), then the implication is that we will want to change. On the other hand, if TinkerPlots doesn’t really help increase achievement, then the implication is that we shouldn’t spend all that time and money on TinkerPlots.

RESEARCH DESIGN

Data, as in data-driven decision making. Evidence, as in evidence-based practice. The real key to research is being able to convince someone that we indeed have reached reasonable and useful conclusions. In basic research, we need to convince other scholars who will decide whether we have something useful to help us further develop theories. In applied research, we need to convince both scholars and practitioners that we have results worth paying attention to. With action research, however, it is most often colleagues close to us, immediate supervisors, or ourselves, that we need to convince about our findings. For example, sometimes we simply want evidence that the path we’ve chosen is a good one. Other times we may need to convince someone else who has decision-making authority that a change is needed (for example, a supervisor, a principal, or a school board).

The decisions we make at many points in the research design process impact how strongly we can make these convincing arguments. We don’t want to set up a study just to find results that confirm our predetermined beliefs. Rather, we want to design a study that collects valid data and reaches valid conclusions. This issue of validity is critically important to the research process, and we will discuss it in a variety of ways throughout this chapter. The first discussion of validity will pertain to claims of knowledge that already exist in the discipline. In particular, we will discuss the sources and the role of existing knowledge in an action research project.

This section of the chapter is intended to introduce you to ways that researchers think about evidence. It is often true in action research that we don’t need to strongest possible research design. Indeed, for practical, logistical, financial reasons, and perhaps even ethical, reasons, we sometimes will not want to use the strongest possible design. The purpose of this section of the chapter is to help you learn how to think about research and evidence, so that you can make the best possible decisions for your action research project. The critical question, as Sagor (1992) put it is: Will we be able to convince a skeptic (or a supervisor or ourselves) by the weight and credibility of the data we amass? Because designing research and collecting data are difficult and time-consuming activities, Sagor proposed that we use a “critical friend” to help us identify additional questions and issues that must be addressed (pp. 46-47). This colleague, who is not part of the action research project, would help analyze and critique—in a friendly way—our data collection plans.

Existing Knowledge

Whenever we begin to consider an issue, we find that there are things we know about the topic and things we don’t know. Sometimes there are things that experts in the field thought they knew 5-10 years ago when we learned them but have changed in recent years. In fact, sometimes the answers we are looking for already exist and are convincing enough that we don’t need to do any research ourselves (for example, whether a white roof keeps the temperature lower inside a school bus).

Existing knowledge constitutes what most scholars or experts (that is, theoreticians, researchers, and practitioners) have come to believe to be true in a discipline. In social sciences, such knowledge is sometimes less certain than in the physical, biological, and health sciences because it is much more difficult for us to design true experiments to help “prove” causal relationships. Yet, we rely on the general agreement among scholars in the field to know what we know. And, unfortunately, scholars don’t always agree and competing theories exist. These competing theories still represent an accumulation of knowledge that has been used in their development; that is, well-respected theories often overlap on the knowledge upon which they are based, but differ in how that knowledge is interpreted.

There are several places for us to turn to obtain the most recent knowledge in the field. One resource is colleagues, including fellow teachers, curriculum specialists, and administrators or other educational leaders. In particular, some of these colleagues may have taken college courses more recently than us or, perhaps, have participated in recent, relevant professional development. Knowledge in most fields continues to grow and change, so it is helpful to know what scholars and practitioners currently think about relevant issues. Another resource may be college faculty, who by the very nature of their jobs must keep up-to-date on the issues they study. We may still have connections with our own college faculty or may have developed new relationships. Perhaps some of our colleagues have connections to faculty that we can use.

Sometimes, unfortunately, we don’t have direct connections to people with current specialized knowledge in a particular field. That brings us to the existing literature in the field. Reviewing the literature often brings to mind ugly, nasty, and long papers we had to write in college or in graduate school. But “the literature” is really nothing more than a name for the vast repository of knowledge that exists in any given discipline. We can, in a sense, contact all the experts in a field simultaneously by searching the literature for existing knowledge on our topic. In action research, we will not really need to write a literature review, but we do need to understand the existing knowledge in our field. So while action researchers should review the literature, they do not always need to write a literature review. It is true, however, that a written literature review may be required for some purposes, even in action research (for example, to gain approval for the project from a supervisor or to report your results to an appropriate audience).

“Extant knowledge” and “current knowledge” also include things that we’ve known for a very long time; that is, sometimes current knowledge reflects that something very old is still true. However, because we learn constantly learn new things, it is important to understand what scholars consider existing knowledge. We cannot really begin our research until we know a little about what others have found. Sometimes others have identified a new way of thinking about the issue with which we were not yet familiar.

Also note that sometimes, unfortunately, there is very little to help us in the most relevant literature. In these cases, we may need to expand our thinking, and our search. For example, even though we are interested in math achievement, perhaps someone has done research in science or reading that will be informative. Perhaps someone with a more psychological or physiological focus has done work in this area. Sometimes practitioners and researchers don’t always communicate as well as they should. In these cases, action researchers may need to look for less academic literature and search more for information produced by practitioners, which may require different search tactics. Or maybe there are relevant theoretical perspectives that haven’t been applied in research or by practitioners yet. Indeed, it may be that because no research has been done, the only current thinking available is theoretical (but being theoretical doesn’t necessarily make it useless, it just doesn’t make it knowledge).

The Literature Review

ERIC (http://www.eric.ed.gov/) and Google Scholar (http://scholar.google.com/) were introduced earlier as good starting places for literature searches. Other places to search include recent editions of textbooks in the discipline. Most textbooks are updated with each revision to reflect the current knowledge in the field. These textbooks also often contain references to recent and important papers that discuss important changes in the existing knowledge.

Before beginning a journey into the research literature, we want to understand our research question(s) very well. We will attempt to identify the issues that must be answered before the action question or policy question can be addressed (see the Chapter 2 Appendix A). Remember, through our efforts, we are really interested in answering, eventually, the action-oriented “should we” questions.

It is important to recognize that, although we often see the literature review as a separate section, or even chapter, in formal research, this does not imply that the literature should be kept separate from other aspects of the research process. In fact, even when the literature review is organized into a separate section or chapter, the literature is critical to every step of the research process. The separation is merely a convenient way to organize the research report that is produced.

For example, we cannot develop a good research question without knowing what the literature says about our variables, the relationships between our variables, our population, and perhaps the context of our study. For example, is there a strong theoretical rationale for believing that a certain intervention will work or that certain variables are related? Are there other ways to think about the problem we are trying to solve, perhaps different interventions that we hadn’t considered? Existing literature also helps us find what are considered accepted methods for research in a particular area. For example, (a) how should we design our study?, (b) how should we collect our data?, (c) what tests or measurements should we use?, and (d) what ethical dilemmas have others faced when doing similar research? We need to understand the current literature so that we can make better sense of our results. For example, have we obtained results similar to what others have found, and if not, why not?

The Unending Conversation

We like the following quotation by the notable American literary theorist Kenneth Burke (1941/1973) from his book “The Philosophy of Literary Form” as a metaphor for the purpose of the literature review: Imagine that you enter a parlor. You come late. When you arrive, others have long preceded you, and they are engaged in a heated discussion, a discussion too heated for them to pause and tell you exactly what it is about. In fact, the discussion had already begun long before any of them got there, so that no one present is qualified to retrace for you all the steps that had gone before. You listen for a while, until you decide that you have caught the tenor of the argument; then you put in your oar. Someone answers; you answer him; another comes to your defense; another aligns himself against you, to either the embarrassment or gratification of your opponent, depending upon the quality of your ally’s assistance. However, the discussion is interminable. The hour grows late, you must depart. And you do depart, with the discussion still vigorously in progress.

Think of reading of the literature as “listening for a while” to what others have said. Your “literature review” is proving a summary of the most important points of that previous conversation (“the steps that had gone before”) that you try to retrace for the next new participant. You cannot possibly include all of what you heard (or read) previously, even though all of it was important in your understanding of the conversation. So you try to include the most important parts of the previous conversation as you bring new participants in the conversation (i.e., your readers) up to speed, including the multiple perspectives that are relevant and credible (i.e., all sides of the argument). Your research hypothesis and the evidence you have collected represent your “putting in your oar.” Conference reviewers, journal editors, and those who cite you (when favorably or critically) are part of the conversation that continues…

Evidence

Research Design and Data Collection are the parts of the research process where researchers deal with issues that address the question “What counts as evidence?” Because they are not trying to convince fellow scholars or peer practitioners that their results are worthy, action researchers have more flexibility with what they will consider to be useful evidence of the effect of an intervention or the results of a survey. We can’t be fooled, though, into thinking this necessarily makes action research easier—sometimes we are our own toughest critics. Before we take action, or feel comfortable recommending a course of action to a supervisor, we may want to be even more convinced than a fellow practitioner would want.

Because of the flexibility we have in action research, we have the capability to use different types of evidence than is typically accepted in more applied research. That is, we may be very comfortable using our experience and knowledge of a particular situation or population to fill in some of the gaps in data. We may be more willing to use qualitative data as evidence in an essentially quantitative study than most journal editors would prefer. However, if we are undertaking an action research project to provide evidence that persuades others, or perhaps more importantly ourselves, that action should be taken, we want to follow certain guidelines to help us gather more credible evidence. Credibility comes in degrees: Nothing is completely believable and nothing is completely worthless. Some decisions result in greater credibility of our evidence in one area and less credibility in other areas.

As researchers, we will be faced with decisions at every step of the research design process. Our decisions should always include consideration of impact on the credibility and validity of our evidence. As we report the results of our study, we will need to justify all the decisions we made. The credibility of our study, in large part, depends on our explanations of the decisions we made.

As consumers of the research literature, we should carefully consider the apparent validity of the results of a study. We want to be skeptical, but not necessarily cynical. That is, we don’t want to presume that the results of a study are not valid, but we do want to be convinced that the evidence is worthy. We can assume nothing—if evidence is not provided to support a claim, we cannot assume it was just left out (persuasion requires evidence). Additionally, we cannot just read the abstracts and the conclusions of the research reports, because they rarely provide the evidence we need to be convinced.

As we critique research reports we must be able both (a) to understand the study well enough to describe it and also (b) to analyze the study critically by evaluating the credibility of the evidence used to make claims of new knowledge. In either case, authors of research reports must provide the information and evidence necessary for us to accomplish these objectives. Chapter 2 Appendix B provides examples of the kinds of questions you might ask as you read a research report.

Research Design Considerations

The research design is essentially the strategy to be used to collect data and answer the research question. This strategy includes such matters as how participants will be assigned to groups, how variables are controlled, and data collection procedures. There are many traditional research designs that have been used and studied by researchers over the years; we will typically choose among them as we design our study.

Our design may need to become more complex depending on (a) how many independent and dependent variables we have, (b) what additional mediating and moderating variables we have identified and how we want to control them, and (c) how many times each case will be measured on dependent variables. We must choose a design that will meet all our needs. That is, in a very real way, the research question determines the research design we must use—there are typically a few options, but generally we must decide among these few best options in order to adequately answer our research question. Similarly, once our research design has been chosen, there are typically very few methods that we can use to analyze the data appropriately.

Unfortunately, no study can be perfect—there are always difficulties and compromises. The researcher must identify potential problems that might cause conclusions to be considered invalid. Sometimes we are able to anticipate these problems and design the study so that they are managed; sometimes, however, the problems are beyond our ability to manage and we must present them as limitations to the study. Too many limitations make a study not worth doing.

We must consider what information we need to collect and what will be difficult to collect or what would happen to our results if we are not able to collect it. For example, if we are trying to collect data from families and if one of our variables is socioeconomic status, we might consider how our results would be impacted if we only collected data from folks who are currently employed; if there would be an important effect, we can use a sampling strategy that will be more likely to provide both employed and unemployed people. In action research, we sometimes can overcome limitations by collecting additional information about our population from other sources outside our research. By identifying potential problems ourselves, we can (a) design the study to manage and account for these matters, (b) anticipate the arguments against our conclusions and be appropriately cautious with our claims, or (c) figure out what additional information we may need to collect from other sources.

We should also check our assumptions and biases about the topic, the variables, the methods, the instruments, and so forth. Even experimental researchers have assumptions—they are just not usually made explicit. These assumptions may lead us to make inappropriate decisions about certain aspects of the study. We’re not considering these matters for others—good research design will minimize the impact of any assumptions and biases we may have—but rather we are considering them so that we won’t make bad or careless decisions.

Problems that we cannot control will cause limitations to our study that will weaken the conclusions we can make about our results. Again, we want to control anything we can with our design choices, but sometimes there are factors that are beyond what we can reasonably control. For example, we are sometimes forced to collect self-report data that cannot be confirmed—a potential source of faking or lying or social desirability. If we cannot randomly assign participants to treatments, and we often cannot in social science research for ethical or logistical reasons, then we cannot reach causal conclusions no matter how well designed the rest of the experiment is.

Limitations, as long as they are not too severe and as long as eventual conclusions acknowledge them, do not necessarily invalidate a study. Rather, limitations provide the context for the usefulness of the results and conclusions. However, limitations that are too severe and cannot be overcome through either design choices or additional information might make the study not worth doing.

Validity

Earlier we said that, in the end, experimental research is really about providing evidence that confirms or disconfirms theories, that researchers are really trying to find support for cause and effect arguments. But remember that we also said that no one study, and in fact most social science studies, cannot prove anything. There are a number of reasons for this, but near the top of the list in social science research is our general inability to design true experimental studies.

In research, there are many factors that determine how strongly we can reach conclusions about cause, prediction, and even relationship. The primary factors that impact our confidence are measurement reliability and validity, internal experimental validity, external validity, and qualitative validity. We will discuss these factors in much more detail as we proceed, but let’s briefly consider them now. Measurement reliability and validity refer to how accurate and useful our measurements (for example, test scores or attitude scale scores) are. Internal experimental validity refers specifically to how strongly we can make conclusions about causal relationships. External validity refers to how well our results will be true for the whole population of interest. Qualitative validity refers generally to issues that matter in qualitative research, such as the accuracy and credibility of the data we collect.

Fortunately, we can enhance the validity of our results by making good decisions during our research design. If we fail to consider these issues of validity, these factors may threaten our ability to make any substantial claims about our results at all—and might even keep us from convincing ourselves about the findings. There are important experimental research design principles that we should consider so that we can know what is rightly considered experimental evidence. We will discuss these matters in the next section.

Internal Validity

Essentially, internal experimental validity pertains to how strong our conclusions can be. With a true experiment (if done well, supported by theory, and replicated), we can be relatively comfortable making causal conclusions. This is because when multiple groups that have been randomly assigned to treatments are used, the various threats to validity and potentially confounding variables are most likely balanced across groups.

We can think of internal validity as research “design validity.” That is, how well does our research design allow us to reach the conclusions in answer to our research questions. It is not only research that hopes to reach causal claims as conclusions that must pay attention to internal validity, but because internal validity is so closely associated with experimental designs, the term design validity may be a bit more useful (and could potentially even apply to qualitative research). Even correlation and descriptive studies must be designed in such a way as to allow us to reach the level of conclusion about our relationships we desire (causal, correlational, descriptive).

As our research design deviates further from a true randomized experiment, our conclusions must become weaker—and less causal. Even though we are not interested in making causal conclusions in such cases, we still must be concerned about the validity of our conclusions. Just remember, there is nothing wrong with a descriptive or a correlational study, we just cannot make conclusions beyond what the designs allow: description or relationships, respectively.

Typically we talk about threats to internal experimental validity. Because very few studies in education and the social sciences are truly experimental, what we really mean by internal experimental validity is how strongly we can make conclusions about our results (especially conclusions about causal relationships). If an experiment has too many limitations or threats to validity, causal claims are untenable. In all studies (descriptive, correlational, and experimental), we want to choose research methods that will give us accurate data. For example, self-report data can be manipulated by respondents in a variety of ways—where high-quality records of the data exist, we should try to access them, either as the primary source or to check the self-report data provided to us by respondents. Correlational and experimental designs must be designed in such a way that we can actually reach conclusions about relationships. For example, as obvious as it seems, we must actually collect data from each case that will allow us to talk about relationships among variables (that is, we cannot simply collect information about groups and reach conclusions about relationships between variables for the individual cases we are studying).

Internal experimental validity deals with how well our design actually allows us to reach the conclusions we make (especially about cause). That is, based on the control exercised in the research design, how sure can we be that our independent variable actually caused the change in the dependent variable? Evidence of internal validity is obtained by controlling potential alternative explanations of the dependent variable, including controlling other independent variables that may impact the dependent variable and controlling flaws in the design. This control can be exercises by including potentially confounding variables in the study and choosing a research design that controls threats to internal validity.

Let’s use an example of a basic study as a foundation for this section. This year, one 8th grade Math teacher in our school started using a new computer program (called TinkerPlots) for the current unit on Data Analysis. We want to know whether the new computer program intervention helped students learn Data Analysis better than what we’ve been doing using our traditional instructional methods. So our intervention (note that we cannot call it an independent variable because it is a constant) is the TinkerPlots program and the outcome of interest is learning, as measured by some achievement test.

External Validity

Essentially, external validity refers to how broadly our conclusions can be generalized. With no external validity, we cannot expect the descriptions or relationships we have observed in our sample to be true anywhere bur for our sample. But in quantitative research, we really want to learn about some population – not our sample. Well, said better… we hope to learn about our population by learning about our sample. This requires that our sample represents the population of interest in our research. Therefore, we might call external validity “representative validity.” We need our samples to represent some broader population for our conclusions to generalize anywhere beyond our sample. However, sometimes we need particular types of participants in our sample (e.g., Twitter users), so we might use different types of sampling designs in order to be able to represent that particular population.

Such representativeness can even be present in qualitative research, for example, when we want to find people to interview who have high levels of job satisfaction for a particular purpose. Or when we want to find schools for a case study where student achievement appears to be much higher than would be expected based on predictions based on well-known predictor variables. If we obtain participants who have never used Twitter or choose schools that really haven’t succeeded the way we expected because some of the data we used to pick schools was wrong, we do not have representative validity even in qualitative research.

We will talk about sampling methods later in the book. But for now, remember that the key is that our sample, or our participants, must be representative the intended population or purpose required for our study. Essentially, we must make a multi-stage argument that: our participants are representative of the sample, which is representative our accessible population or sampling frame, which is representative of the target population. Our participants are those who have agreed to participate in our study. In survey research, we might call them respondents. Our sample is actually all those whom we have invited to participate in the study—not those who chose to participate (even though many researchers call their participants their “sample”). Because of ethical considerations, we cannot force everyone to participate. Therefore, the sample represents the lowest level of control the researcher has in the process. Ideally, we would make an argument that those who chose to participate do indeed represent the sample we have invited—or that our participants are not different from the non-participants (or that respondents are not different from non-respondents). Of course, this argument is not easy, because we usually have no information about non-participants, or about the sample as a whole. Sometimes, however, we have general information about our accessible population or sampling frame that we can compare to information we have about our participants.

From there, we need to be able to argue that our sample represents the sampling frame, or accessible population, which is essentially a list of everyone who could be sampled. The is most easily accomplished using random sampling, but there are other types of sampling strategies that may result in representative samples. We can almost never know the entire target population (those we want to learn about), so we must define a sampling frame, which represents the part of the populations we can access (hence, accessible population). The easiest way to think about the sampling frame is as a list of everyone in a well-defined large-as-possible subset of the target population. We need such a list to take a truly random sample. That is, everyone must have equal probability of being chosen for our sample. If they are not on the list then they have no probability of being chosen. Therefore, the only way to obtain a simple random sample is by having a list of the accessible population. Ultimately, then, the entire sampling process can boil down to how well you have represented your target population (e.g., all third graders) with your accessible population (e.g., third graders at your own school). Your sample can only truly represent your accessible population—you must argue why your sampling frame represents your target population.

In quantitative research, we have learned that the best possible approach for obtaining such a sample is through simple random sampling (or perhaps stratified random sampling). The randomness is our best chance of obtaining a representative sample. But know that it is representativeness that’s the key—not randomness. But note that randomness is also critically important from a statistical perspective. That is, the probabilities we interpret from our statistical analyses assume that we have taken a random sample from the population (you will learn more about this when reading about the Central Limit Theorem).

Common Designs

Intervention—Observation

We can discuss the research design in terms of interventions and observations. In research design terminology, observations are any measurements we make of the variables we are studying (e.g., tests, surveys, frequency counts). In the TinkerPlots example, we have the simplest example of a research design: an intervention (TinkerPlots) and an outcome measurement (learning). Some call this a one-shot case study. Using notation common to experimental studies, it would be shown this way:

\[ X - O \]

where X represents the experimental treatment, or educational intervention, and O represents the observation, or measurement. In everyday language, we might say that we implemented a program and then tested to see if it worked.

Let’s think a little about this one-shot case study design. What are we going to count as evidence? That is, what will we base our decision for action on? It appears that our decision about whether the TinkerPlots intervention worked will be based on the results we obtain on a test following the intervention. Let’s assume these 8th-grade students did well on the test. Can we immediately believe that the intervention was the cause of the high scores?

For example, given such a design, how can we be sure that the students wouldn’t have scored as they did even if they didn’t use TinkerPlots? How can we be sure that something else didn’t have the most important impact on the achievement outcome variable? The examples below are not the only threats to validity that exist, but are common ones that give us a sense of the kinds of things we need to think about when designing a study.

This research design, which is often used in evaluation studies, is considered a weak design. That is, the lack of experimental control in this research design itself may lead to results that may be attributed to a variety of legitimate alternative explanations. Why? Unfortunately, alternative explanations for the results may arise simply because of the research design we choose. In research design, these potential alternative explanations that result from the research design itself are called “threats to validity.” That is, something else besides the intervention may be responsible for the results. In this section we focus on threats to validity. For example, * because we don’t really know how well these students had mastered the topic before we started the TinkerPlots intervention, we can’t be sure that the intervention had any impact on the results; their scores may be based entirely on what they already knew * perhaps something the 8th grade teacher did earlier in the year, while covering a different topic, actually carried over into Data Analysis and caused these scores, in large part * there may be something specific about this particular group of students, where they had a particular 7th grade teacher who had influenced their results, or maybe they all belong to a Math club

When we do such a one-shot case study, we have these problems of selection of participants (indeed, the threat is often called selection bias). There may be something different about the students (or classes) that have been selected for the study. Note that this unknown influence may either increase or decrease the scores, or may hide the influence of an intervention either positively or negatively. That is, an intervention may work well, but students started so low that by the end their scores were still not very high, making the intervention appear to have not worked. Or perhaps the students understood the material before but the intervention actually confused them, so that they didn’t understand the material quite as well afterwards; if the scores are still reasonably high, in this design the intervention may look like it worked reasonably well.

There are a number of other types of reasons the one-shot case study is a weak research design. Perhaps many of the students got so confused by what the 8th grade teacher was doing with TinkerPlots that they sought out help from their 7th grade teachers or older siblings after school. Perhaps there is a popular TV show that, for some reason, has a couple episodes that deal with statistics during the week before the test (and perhaps the students actually learned something helpful about Data Analysis as a result of watching the shows). These are examples of a threat to validity called the history effect: the idea that something of importance (i.e., “historical”) occurs simultaneously with the implementation of the intervention and, therefore, we can’t be sure which one really had a more causal role. Sometimes the history effect actually impacts negatively; for example, an intervention may be working, but because of an excessive number of snow days during the Data Analysis unit, it doesn’t seem to work.

Sometimes things happen during the intervention related to the participants themselves; that is, the participants change during the intervention for reasons unrelated to the intervention itself. For example, in a year long study, middle school students mature in some important ways from August to May. Indeed, this threat to validity is usually called maturation. For example, dog owners once went to a vet with a 6-month-old puppy who was biting all their wooden furniture. The vet prescribed medicine and told the owners that if they gave the pills to the puppy for six months, the furniture-chewing would stop. Because teething generally stops when a puppy reaches about a year old, the vet probably wouldn’t have been wrong even if the pill did no good at all. Other things fall into this threat, such as participants getting wiser, generally more experienced, and even becoming fatigued (remember, these threats don’t only necessarily make the results look better). Our TinkerPlots example may have been short enough, however, that maturation really wasn’t an issue.

What else could be a problem in our TinkerPlots example? How could the test itself be an issue? If the test were too easy, the students could appear to have learned well when they might actually have had difficulty had the test had more difficult items. Or a test that is too hard may make it appear that the intervention didn’t work. When measurement instruments have problems (we’ll discuss measurement validity and reliability later), the results really have no meaning at all. That is, if we can’t be sure what the test is measuring, then how can we know what the test scores really mean? This threat is usually called instrumentation.

Another threat to experimental validity is related to attrition. In the TinkerPlots example, what would happen to the results if, right before the test, the five students who learned the most were suddenly moved into a different classroom? In such a case, the results would probably not look as strong as they would have, had those students been included all the way until the end. The most important issue is when participants drop out of a study for a particular reason (this threat is often called “mortality” in reference to patients in medical studies who die), or all the participants are similar with respect to some characteristic that is relevant to the study.

There are some threats to validity that are related to the length of time the intervention takes to implement. For example, suppose the TinkerPlots program is fun for students and they pay more attention in class because of the novelty of a new instructional technique being used. Is it possible that the novelty itself, causing students to pay closer attention, results in higher test scores—but that in the long-term, once the novelty wears off, it wouldn’t work as well anymore? If the research could have occurred over a longer time span, we may have been able to see such a result occur. Conversely, sometimes our research is too short to see the long-term benefit of an intervention.

Adding a Pretest

So what can we do to improve our design? Well, you’ve probably guessed by now that knowing where the participants are before the intervention will help with some of these issues. That is, adding a pretest before we introduce an intervention can get deal with some of the threats to internal validity. While some of these issues below are relevant only when we use a true pretest–posttest design, some are in fact problems even when we use other information to determine where students are in regard to the outcome variable before the intervention.

The pretest–posttest design is usually drawn usually the following symbolic representation:

\[ O_1 - X - O_2 \] where X still represents the treatment (intervention), O1 represents the pretest, and O2 represents the posttest taken after the intervention. For results to make sense in this design, we must have the same measurements both before and after the intervention.

Knowing where the participants are before the intervention can help reduce the threat due to the selection of participants. Unfortunately, most of the other threats are not minimized by simply adding a pretest. For example, maturation effects following the pretest may still be responsible for the scores students receive on the posttest instead of the intervention. Similarly, history effects, attrition effects, novelty effects, and instrument effects can still impact the results.

Worse, adding a pretest adds another potential problem. Sometimes when participants take a pretest they actually learn some things, or receive some other sorts of clues that cause them to do better on the posttest. In the case of certain attitude or rating scales, participants become sensitized to ceratin issues and therefore respond differently the next time they complete the attitude scale. This threat is called the testing effect. Changing some items on the test from the pretest to the posttest can cause instrumentation effect problems because we can no longer be certain that the tests are measuring the same variable.

Adding a Comparison Group

Another option to improve the design of a study is to use a group for comparative purposes. Often such groups are called control groups, but in most educational research they are more appropriately called comparison groups. The term control is usually used to imply that absolutely nothing happened with the group; obviously, we have difficulty with that in education because we must do something with the students in the comparison group. This nonequivalent group design (often called non-experimental) would look like this:

\[ \begin{align} X &- O \\ C &- O \end{align} \] where X represents the experimental treatment (intervention), C represents a control or comparison group, and O represents the outcome measurement.

Having a comparison group helps with several of the threats to internal validity. For example, most history and maturation effects will impact both groups equally. Attrition can be an issue if participants leave the groups for different reasons, especially if those reasons are related to the intervention, but if participants leave randomly for reasons not related to the study, it may not be a problem. Instrumentation might still be an issue, particularly if there is a relationship between one of the treatments and the test.

Selection bias can occur and have an important impact on the results. For example, even if we believe that the groups we are comparing are relatively equivalent, they may not be. That is, in order to make fair comparisons between the groups, we need to be confident that there are no real differences between the groups prior to the intervention, on average (we’re not saying that each individual needs to be the same or that each individual must have a twin in the other group—just that the groups should be equivalent on average). That is, we want the only important difference between the groups to be that they each received a different treatment. Scientists often call this the “law of the single variable.” That is, the independent variable should be the only thing that varies across the groups.

For example, suppose we now have two groups: one that received the TinkerPlots treatment and one that received our traditional Data Analysis instruction (and note that now we have a true independent variable that differs across groups, unlike the one-shot case study and the pretest-posttest designs). We may believe that the two groups are equivalent, but if most of the students in the TinkerPlots group had the same 7th grade teacher who had taught Data Analysis particularly well, that group may have come into the study with an average already higher than the comparison group. In such a case, the intervention may have no impact, yet the results may make it appear that it did. Remember, there was no pretest here, so when the posttest results differ between the groups we will likely attribute the difference in scores to the intervention the group received.

Another potential threat to validity occurs when we have more than one group. Experimenter bias results when the researcher intentionally or unintentionally does something to influence one of the groups. Perhaps the teacher researcher tells the students that they are part of an experiment to test how well the TinkerPlots program works. Participants will sometimes give more effort when they know they are part of a study. When we don’t tell participants whether they are in the experimental condition or the control (or comparison or placebo) condition, we call that a blind study.

Sometimes experimenter bias occurs just because teacher researchers are more excited about the TinkerPlots intervention than they are about the traditional instructional approach used with the comparison group. This can be minimized by having someone else actually do the implementation, so that researchers cannot inadvertently impact the results. Even more ideally, the assistant wouldn’t know which treatment was the new intervention (we would call this a double blind study, if neither the participants nor the research assistant knows which is the experimental treatment). Because this is difficult in schools, however, action researchers might have someone else observe them as they do both the TinkerPlots intervention and the traditional instruction. This assistant would collect data about what occurred in both conditions to help provide evidence about how much this experimenter bias occurred. Sometimes elements of the intervention used for research purposes have more impact than the intervention itself. For example, we might interview students after each day of using TinkerPlots. The students might feel important because they are getting this individual attention, even though it was intended only to be a data collection technique. Remember that in experimental design, we want the groups to be as similar as possible—except for the interventions. In this case, there are two things different between the groups: the instructional approach used (TinkerPlots or traditional) and the interview data collection method that only the TinkerPlots group received. This threat can be minimized by adding interviews to the traditional group.

True Experimental Designs

Because we rarely have a chance to randomly assign students to treatments, we won’t talk much about true experimental designs. The key benefit to true experimental designs is that the random assignment of participants provides our best chance of creating equivalent groups. All participants bring their own knowledge, experiences, demographic characteristics, and so forth to the study with them, as part of their being. Unfortunately, if the groups are on average different on these characteristics, then these differences may actually be responsible for the results rather than the intervention. When we are able to randomly assign participants to groups, we have our best chance to ensure that the groups end up roughly equivalent on all the “baggage” they bring to the experiment with them. The “R” signifies Random Assignment to groups.

\[ \begin{align} R - X &- O \\ R - C &- O \end{align} \]

Quasi-Experimental Design

Fortunately, we can achieve a relatively strong research design (called quasi-experimental) without needing to randomly assign participants to groups. When we use pretest information (perhaps not just a pretest, but also other information we may have about participants before the intervention, such as general achievement, other exams, or even attitudinal variables), we can verify that groups are relatively equivalent in important ways. This quasi-experimental group design would look like this:

\[ \begin{align} O_1 - X &- O_2 \\ O_1 - C &- O_2 \end{align} \] where X represents the experimental treatment (intervention), C represents a control or comparison group, and O1 represents some pre-intervention information (commonly a pretest, but it could be other data that gives us information about the participants prior to the intervention), and O2 is the final outcome measurement.

For example, if we have two intact classrooms we want to study, we can compare them on a variety of pre-intervention information. If they are relatively similar on these variables, then we have some confidence that they are equivalent as groups in general. We could also use techniques to match participants that allow us to pair students with similar values on the pre-treatment information. For example, we could match the boy with the highest pretest score and best attitude toward Math in the TinkerPlots group with the boy in the comparison group with the highest pretest score and best attitude (this gets more difficult with the more variables we use for matching). If the groups are not equivalent, we can even sometimes use that information to make adjustments to our analyses or interpretations.

As mentioned earlier, we might also be able to match participants across groups, so that we create roughly equivalent groups. However, as always, caution must be exercised because we can only be sure the groups are roughly equivalent on the variables we used to match them—all their other “baggage” may be very different and therefore still result in nonequivalent groups being formed.

The main idea with this quasi-experimental design is that we know where the groups started, on average, and if they weren’t equivalent we can make some adjustments when we analyze or interpret the data. But if they are equivalent (perhaps through matching), then we have more confidence that they were equivalent—at least on the variables we measured before the intervention. Other things may still be nonequivalent, which is why the true experimental design with randomly assigned groups is better (unfortunately, true experiments are very, very difficult in most action research).

Pseudo-Experimental Design

Unfortunately, we often use only relatively weak experimental research designs (we will call them pseudo-experimental) without randomly assigning participants to groups and without knowing any baseline information to verify equivalence. We use the term “pseudo-experimental” design to distinguish it from what many call “non-experimental” design that uses “trait” (often called demographic or experiential) variables for comparisons as opposed to “treatment” groups (where some condition has been manipulated). The researcher has absolutely no control over the traits or experiences of the participants and therefore there is no manipulation whatsoever—and therefore it is non-experimental observed (and correlational) data.

In what we call pseudo-experimental design, there is manipulation of the treatment variables, just not controlled by the researcher. For example, in research that studies the educational effectiveness of Virtual Reality (VR) videos, the researcher asks participants to watch a VR video, but is not able to control how the participant watches the video. Later, the researchers may be able to compare the impact of different VR technology (e.g., VR glasses, computer, or smartphone/tablet) on the intention to watch educational videos.

If the researcher does not randomly assign participants to treatments or have baseline information for the dependent variable to verify equivalence of groups, it cannot be experimental or quasi-experimental design. However, how the participant watched VR video is not a personal trait and only exists because of the research, so it is not non-experimental. We (Adjanin & Brooks, 2024) call it pseudo-experimental in recognition of the fact that the VR technology was a manipulated condition on their own, but without any control from the researcher. Graphically, it is similar to the non-experimental design:

\[ \begin{align} X &- O \\ C &- O \end{align} \]

Single-Subject and Small-Sample Designs

Finally, there are single-subject and small-sample designs that are often useful in action research. Most commonly, these small-sample designs look like this:

\[ O_B - O_X - O_B - O_X \] where \(O_B\) represents baseline measurements and \(O_X\) represents measurements made during the experimental treatment or intervention phase. This is the most basic single-subject, or time series, design simply repeats the same treatment two or more times. In many textbooks, this design is designated as an A B A B design, or even more simply, an A B A design (but most scholars recommend ending with a treatment phase, assuming that the treatment seems to be working).

We can improve the design a little by having the same group do both the experimental treatment and some comparison condition:

\[ O_B - O_C - O_B - O_X \] where \(O_C\) represents measurements made during a comparison condition (which may be a different intervention or may be more of a control or placebo). There are actually many variations on this theme, but the main idea is that the same group gets both the treatment and the control.

The treatments used in these single-subject, small-sample, and time-series studies are often treatments that have immediate impact and, once withdrawn, immediate loss of impact. However, these designs can be adapted for other types of treatments. While this design controls some of the threats related to not having a comparison group, there are obviously some threats to validity problems with this design. For example, the history effect may occur differently during the control and the experimental phases, thereby impacting the results for Oc and Ox differently. Similarly, maturation, experimenter bias, testing, instrumentation, attrition effects may impact the results in this design.

There are also some additional threats that occur with this design, called the carryover effect and order effect. Sometimes, the effect of the first intervention continues for a while and may impact the results obtained for the second intervention; this is called a carryover effect. Sometimes, the effect of the second intervention is increased or decreases simply because it follows the first intervention—an order effect.

There are ways to manage these carryover and order effects and thereby make the design more quasi-experimental. For example, if we have access to only one group, we can run the interventions multiple times and change the order. The design would look like this:

\[ O_B - O_X - O_B - O_C - O_B - O_C - O_B - O_X \] This would allow us to account for potential order and carryover effects. Another way to manage these issues, and several of the others that may threaten validity, is to use a second group who would receive the treatments in different orders:

\[ O_B - O_X - O_B - O_C \\ O_B - O_C - O_B - O_X \] Single-subject, small-sample, and time series designs can be quite useful when comparison groups cannot be obtained. There a number of additional design adaptations and extensions that can be used (for example, multiple baseline designs where two multiple individuals or groups start the treatment at different times to help minimize the history threat). Other Designs

In descriptive and correlational (or observational) designs, we simply collect data for the variables and phenomena that interest us. If we desire to know about relationships between variables, a correlational design will suffice. If our goal is to understand or describe a situation, then we can simply collect the data necessary. We would need to determine what information we need and how to collect it. Sometimes this data will be quantitative and sometimes it will be qualitative—the important issue is to collect the data we need to answer our research question(s) and to inform our decision(s) concerning our action question(s).

Sometimes, the weaker designs described above may in fact be correlational. The key difference is the nature of the independent variable. In our example above, the researchers manipulated the treatment conditions. That is, we assigned each group a particular intervention. However, with some independent variables (for example, trait variables like gender), researchers cannot assign a value to a group. If we want to compare boys and girls, we must simply collect that data and then do the appropriate analyses.

Sometimes the designs may look similar to those described above, but must be interpreted differently. For example, instead of X and C in the nonequivalent-groups design, we might have M and F (for male and female, respectively), and not have an experimental treatment or intervention of any kind. But such analyses could still provide us some useful information. The most important issue here is that we don’t overreach in our conclusions and decide that gender is actually the cause of the results. Unfortunately, because we have nonequivalent groups, there are many other possible explanations, both threats to validity and also other potential explanatory variables, for which we have not accounted. Some designs may help us reach slightly stronger conclusions, but none will ever allow us to reach causal conclusions (because we cannot assign, or manipulate, gender in a true experimental design).

But recognize that all designs are legitimate for purposes they serve. The issue with validity is to be sure that our conclusions do not reach beyond the data and designs we have.

DATA COLLECTION

There are some inherent differences in the questions we might ask that lead to action research. For example, asking why something is happening implies that we must understand the people we are working with or the situations they face. This most often leads to talking to them and trying to figure out what they are going through. Other questions, like trying to learn which method works better, imply a more experimental approach to research. We generally call the former qualitative research and the latter quantitative research. We will discuss the types of data that we need to collect for each type of research below.

Note: The qualitative sections (both Data Collection and Analysis) have been provided in the Appendices for the textbook (i.e., not among the appendices for this chapter) because so many researchers are doing mixed methods now. Further, it is also quite true and quite common that many researchers will actually quantify qualitative data and perform statistical analyses on that transformed data (usually but not exclusively descriptive statistics). For example, some researchers will count how many times a particular idea was expressed during interviews or how many times a particular behavior occurred during observations. These counts can then become quantitative data for analysis.

There is an important difference between our approach in this chapter and many other resources you might read about action research. That is, we have focused a great deal on an experimental approach to evidence. Remember, however, that this section of the chapter was introduced with the recommendation that we don’t always need to strongest research design to answer our research question(s) or to inform our action question(s). That is, sometimes we will collect data for purely descriptive purposes, whereas other times we will want to collect data that helps us decide between two courses of action.

A further difference between our approach and many others is that we believe that qualitative data may be useful even in this experimental approach to research design. That is, typically we only consider quantitative data as appropriate for experimental-type research. However, our perspective is that action researchers want to collect whatever data they can to help them make decisions. The critical importance of the research design is to help us determine what can be considered evidence and what other issues must be acknowledged as potential alternative explanations of our results. But in action research, it is not always possible to collect the quantitative data we desire, nor is it usually appropriate to analyze these data using the typical statistical methods that are applied to experimental research.

Quantitative Data Collection

Very often the questions teachers ask are not qualitative in nature at all. As we discussed earlier, quantitative questions often revolve around whether something worked. For example, an educator may want to know whether a particular intervention or instructional method made a difference on some specific outcome of interest—our TinkerPlots example above.

Quantitative data are measurements we make for our variables. They come from the operational definitions for our variables. Recall that operational definitions were defined as the methods we use to obtain numbers that represent participants’ values on the variables of interest. These operational definitions should match the theoretical or conceptual definitions of the variables. For example, it would make much sense to use a scale that measures teacher quality from a behaviorist perspective if we have defined teaching from a more constructivist perspective. Consequently, it is critically important that we have a clear understanding and definition of our variables prior to deciding how we will measure them.

It is important for us to identify every variable for which we must collect data, not to assume it will be easy to collect. For example, even though our immediate question may not have included gender or whether students own computers, we need to consider our question carefully and try to determine what information we will need before we collect it. Sometimes, especially in action research, we will be able to go back and get the additional information if we need it. However, sometimes if we don’t collect the data we need during the study, it will be all but impossible to collect it later. And even in action research, if we forget to ask for gender when we collect anonymous information from students, we will never be able to match up the data we receive with the gender of the student who provided it. A useful way to think about this issue is to determine how we will want to be able to aggregate and disaggregate the data we collect in the reports we will make to various audiences (Sagor, 1992).

We have two important criteria for the quantitative data we collect: reliability and validity. In reality, in order for measurements to be valid they must also be reliable, so we don’t necessarily need to differentiate the concepts (that is, validity includes reliability)—but for a variety of reasons it is useful to consider them separately.

Construct Validity

Validity is the degree to which the scores obtained from a measurement actually measure what the measurement purports to measure. The most important issue with construct validity, also called measurement validity, is the same as it is with the other forms of validity we have discussed (internal experimental validity, external validity, and qualitative validity). That is, validity is about the inferences we make from the data we have. We infer that people have “so much” of a given variable based on the scores they obtain from some measurement. Evidence of measurement validity gives us confidence that these inferences are useful and meaningful. Several types of evidence can be provided to support a claim that an instrument produces valid scores: construct validity, criterion validity, and content validity.

Construct validity is the most comprehensive term for measurement validity: that an instrument measures the construct it is purported (i.e., intended) to measure. Evidence for construct validity can be obtained in a variety of ways. For example, if our new test of math skills is highly correlated with another test of similar math skills, as it should be, we may have evidence of convergent validity. The test of math skills should be uncorrelated with tests that measure variables theoretically unrelated to math skills, perhaps social studies knowledge. If we have two groups who are known to be different in math skills, then the two groups should have important differences on the scores they obtain from the math test. If we have evidence that certain items are more correlated with each other than they are with other items on the same test (for example, addition items may be more correlated with other addition items than they are with division items), we may have factor validity. Finally, for example, if the math scores improve after an intervention designed to improve the scores, then we may have evidence of construct validity.

Another form of validity often considered is called criterion-related validity. Criterion-related validity specifically refers to situations where we expect our test to predict scores on some criterion variable of interest. For example, if we are trying to determine whether students will score well on a statewide science exam, we would want to have a test that provides scores that correlate well with this statewide exam. By correlating scores on the two tests, we may obtain evidence of criterion-related validity. Once we have convincing evidence of criterion-related validity, we can use that knowledge to predict scores on the statewide exam using our local test. Without this evidence, we cannot diagnose well whether students need additional instruction or intervention prior to taking the statewide exam. That is, without criterion-related validity, students with low (or high) scores on our local test may do well (or poorly) on the statewide exam—we just cannot make any reasonable predictions.

Content validity refers to how well the items of a test or scale actually measure the variable they were intended to measure. There are two ways to think about this. First, if a test is designed to measure math skills in addition, subtraction, multiplication, and division with integers, but includes items with fractions, we have a problem with content validity. That is, items are on the exam that shouldn’t be, so we are not measuring what we thought we were measuring. Second, some aspects of a variable may not have any questions. One the same test, if there are no items that measure multiplication, we are not measuring the full scope of what we intended to measure. With tests, content validity is critically important, but is also very important for attitude or rating scales.

The best way to ensure content validity is through the use of a table of specifications. A table of specifications lists all the areas that should be covered on a test and then records how many items actually do measure those concepts (ideally, we would even indicate which items measure which concept). After we have created a test or an attitude scale or some other type of rating scale, we can have experts or knowledgeable colleagues review the instrument to determine whether we have attended appropriately to the two key issues mentioned above (that is, whether all aspects of the variable are represented with items and whether all items actually measure an aspect of the variable as defined). They can use the table of specifications to help with this process. Chapter 2 Appendix C includes some examples of tables of specifications.

Finally, there are some other issues of validity that must be addressed that have less to do with the instrument itself, and more to do with how people respond to it. However, these behaviors make it difficult for us to assess the validity of the scores provided by the instruments. Further, there are sometimes scale construction methods we can use to help minimize the potential of the behaviors (these are more problematic with scales than with tests). For example, some people respond to attitude or rating scales by providing answers that make them look more favorable. Measurement scholars call this a social desirability response set (that is, respondents provide a set of responses that help them look socially desirable), but it is very similar to the popular culture idea of “political correctness.” There are methods for dealing with this, including using scales that measure it. In action research, there may be additional ways to identify these behaviors, perhaps through triangulation of some kind.

In general, researchers have found that people try to be helpful; consequently, when they respond to surveys they often try to provide answers they think the researcher wants, and in particular, often try to agree. In particular, children often try to please the adult authority figures who ask them questions. Scholars call this an acquiescence response set. Some people want to help, but want to provide the minimum effort possible, and so answer questions quickly and without appropriate thought. This satisficing behavior may lead them to circle all the middle options on a scale, or perhaps all the extreme options. In both these cases, using items that change direction may help identify the behaviors. For example, if we have two items “I love science” and “I hate science” then we would expect that most respondents cannot honestly agree with both statements. Having a relatively equal number of positive and negative items also encourages respondents to read the items more carefully. Also, asking questions (either in interviews or in scales) in multiple ways can help get more honest answers, rather than the more superficial acquiescent answers.

Measurement Reliability

Reliability describes the degree to which an instrument measures consistently whatever it measures. Several types of evidence can be provided to support a claim that an instrument produces reliable scores: test-retest reliability (stability of scores over time), alternate form reliability (stability of scores of parallel tests), and internal consistency (stability of items within the instrument). One subtle matter that concerns measurement scholars but doesn’t seem to concern many others is that tests and scales really have no reliability or validity themselves—it is the scores that are reliable or valid. From this perspective, we would say that reliability means that the scores produced by a given test over time remain relatively consistent, not that the test remains consistent.

The easiest way to think about reliability is to consider a circumstance where we can give a test or some other scale to respondents over and over again without them remembering that they took it (so testing effects do not play a role in changing the scores). For example, if a researcher could give participants a test, wave a magic wand and make them forget that they took it, and then give it to them over and over again.

If you’ve seen the movie “Groundhog Day,” think of Bill Murray as a teacher researcher rather than a weatherman. In the movie, he plays a weatherman who is covering the Groundhog Day festivities (that is, watching to see whether the groundhog sees its shadow). Murray wakes up every day and it is still Groundhog Day, but no one else seems to be reliving the day and each day unfolds differently. He might go to school each day he wakes up and give the same test to his students. Each day is different, meaning that students face different distractions and issues each day. Consequently, they don’t score exactly the same on each test, but if the test produces reliable scores, they will obtain close to the same score every day. If that happens, by the end of the movie, Murray would have a very nice set of data to support an argument for evidence of test-retest reliability of the scores on that given test.

Test-retest reliability is particularly important in research designs that use pretests and posttests. What does it mean to say that scores are unreliable? It means that the scores are not consistent at all, that scores will change from one time to the next for no apparent reason. If scores change for no important reason due to a lack of reliability, then we cannot know what really happened between the pretest and the posttest. That is, we hope that scores changing from the pretest to the posttest would indicate that the intervention was at least partly responsible for the change. However, if the scores are unreliable, that means that the scores changed essentially randomly, and not necessarily due to anything related to the experimental treatment.

In practice, evidence of test-retest reliability is not easy to obtain. It requires giving the same people a test on at least two occasions, far enough apart that the testing effects from the pretest are minimized, but not so far apart that something may have happened to cause a change in the scores. The respondents or test-takers should be essentially the same when they take the test both times. Generally, 2-3 weeks are recommended between tests.

Another form of reliability that has some importance is called equivalent-forms or alternate-forms reliability. This type of reliability deals with the notion that we may have two separate tests that are intended to measure the same variable. Suppose a teacher wants to minimize the possibility of cheating on a test. The teacher may create two tests intended to be equivalent in every way, just with different items a different ordering of the same items (sometimes order makes a difference in test scores). Students would want to have some confidence that their scores would not have been much different no matter which form of the test they took. This requires evidence of alternate-forms reliability. The teacher would give both tests to the same group of students and calculate a reliability statistic. Another example, that actually uses both test-retest and alternate-forms reliability, is related to the SAT, ACT, and GRE exams. Both students and those who require these exams rely on the fact that it doesn’t matter when a student tests a test, nor which version of the test they take. That is, students who take these college admissions tests are expected to get roughly the same score (within a small margin of error) no matter when they take whichever version of the test.

Because evidence for both test-retest reliability and alternate-forms reliability is difficult to obtain (they both require the same students to take a test twice), there is a third form of reliability used more commonly than either. We call this internal consistency. Internal consistency refers to how well the items all work together to measure the same thing. It is conceptually related to split-half reliability. Think of giving students a test that has 40 items. We can think of this test as two halves (alternate forms), each with 20 items. If we correlate the total scores on the first half with the total scores on the second half, we have something essentially equivalent to alternate-forms reliability. For a variety of reasons, it is not best to just use the first and second halves of the test as the alternate forms (for example, fatigue may make students perform less well on the second half or the test items may be ordered by difficulty). There are a large number of ways to divide a test in half (for example, odd and even items is another common method). Scholars have developed measures of internal consistency that essentially estimate the average of all these possible split-half reliability values. The most common are Cronbach’s coefficient alpha (used for attitudinal scales, for example) and KR-20 (used for tests graded as right and wrong).

There is an estimate for KR20 that is relatively easy to calculate, called KR-21. It uses just the mean and standard deviation of the test, but does not provide as good an estimate of reliability as KR-20. However, because KR-21 is a lower bound for the true reliability, if it is relatively high, then we can have some confidence that the reliability will be reasonably strong. We generally look for reliability over .80, or preferably .90 (the maximum is 1.0). The formula for KR-21 is:

KR21=K/(K-1) (1-M(K-M)/(Ks^2 ))

where K is the number of items, M is the mean (or average) test score, and s is the standard deviation of the test scores (so s2 is the variance).

Finally, there are additional forms of reliability that are required in certain circumstances. For example, if you have multiple scorers grading exams, you will probably want to address inter-rater reliability. On some occasions, you may need to know about the accuracy of your decisions. For example, in the case of a pass-fail exam, you may need to know how frequently a test accurately placed students in the right category (pass or fail).

Test and Scale Construction

We would never ask a single question to determine whether students have achieved the desired math skills they have been learning. Similarly, we would not want to ask a single question about whether a student likes math, or whether parents are satisfied with their child’s experiences at school. Such difficult and complicated variables require multiple items to measure them. Just as we do with tests, we can compute a total score on such attitude or rating scales. The key to test and scale construction is to create a coherent set of items that measure the construct of interest.

Ideally, this begins with a table of specifications where we identify all the important aspects of the variable that need to be measured. Next we create a set of items that measures those aspects. We need to have someone check our content validity. We need to pilot test these instruments to determine whether they are being understood by respondents in the way we intended (for example, items and instructions). Finally, we would revise these items and then create a semifinal draft of the test (or scale). After using the test or scale, we would actually want to repeat several phases of this process as we continue to make the instrument better.

We’ve already discussed the table of specifications. Variables are generally complicated enough that we need to identify concepts that help define them. The concepts that go into the table for scales are found in theory or in the literature; the concepts that go into the table for tests are usually found in the content standards or content objectives for the material being taught. Often we can find an existing instrument that we can borrow or adapt (it is considered a courtesy to contact original author for permission, and copyright laws should always be observed). But sometimes we must create an instrument on our own. Developing the items that measure each concept is often a brainstorming process. It is not easy to create items for an instrument. Recommendations for writing and wording items are included in Chapter 2 Appendixes D and E.

Once the instrument is created, we want to get feedback from colleagues, both formal feedback related to content validity and also informal feedback about wording and presentation. We’ll make revisions based on what we hear. Then, ideally, we can try the instrument with some people from the population we want to study. Sometimes this is just not possible, especially in action research where the populations are rather small. However, some creativity can be used to help review the test. For example, if you are doing the study in just one math class, you might use students from another math class to help in this process. They would take the test or respond to the scale and provide you feedback about items they didn’t understand. You would use the actual data from their answers or responses to perform an item analysis on the instrument.

Item Analysis

The item analysis lets us see beyond the feedback to how the items are actually working. We can diagnose problem items and work on improving them before we actually use the instrument to collect data in our study. A relatively straightforward approach to item analysis is to look for items that are not performing appropriately. We use two primary statistics for this purpose: item-total correlation (usually called point-biserial correlation in testing with binary outcomes, and sometimes calculated as “corrected” item-total correlation) and alpha-if-item-deleted. These statistics are most commonly reported by statistical programs. For now, though, we can think through an example of what these statistics tell us.

For example, we would expect the students who scored highest on a test to have answered most of the items correctly. Therefore, for any item, we would expect that the highest-scoring students would get the item right and the lowest-scoring students to get the item wrong. If we have some items that the highest-scoring students got wrong, but the lowest-scoring students got right, there may be something wrong with the item. We may also be concerned with items that only a very small number of students got right—or even items that everyone got right. If we are trying to determine whether an intervention worked, we don’t want everyone to get high scores on the exam because we have a test that is too easy (nor do we want the test too hard). That is, we would expect that the students who didn’t learn as well due to an inferior intervention might be more likely to miss that item. Similarly, we would not want both groups to get the item wrong because it was too hard. We tend to prefer items in the middle to help us distinguish those who know the material versus those who don’t.

Chapter 2 Appendixes F and G have an example form that can be recreated to help perform these item analyses. Essentially, you identify those students with the highest and lowest scores. You will place their scores for each item in the table (an example is provided for both tests and surveys—the only real difference is that you will use the total scores for people in each group for the survey analysis, while you’ll use just the number who got the item correct in the test analysis). Looking at these appendix examples reveals that Khaleel had the highest score on both the test and the scale (tied with Holly on the test), while Qian had the lowest score on both. The items in both examples are ordered from best to worst in item discrimination. Item discrimination is how we measure what we discussed above (whether the highest scoring students got an item wrong while the lowest scoring students got the item right). Any item discrimination below .20 is a concern and any item discrimination below zero ( that is, negative), must usually be revised. In both cases, (a) items 1-5 show reasonably good item discrimination, (b) items 6-7 are just good enough by most standards, but (c) items 8-10 probably need revision. Item difficulty can also be reviewed on these forms.

The number reported as item difficulty for the test analysis actually represents the proportion of students who got the item RIGHT. Any item with more than 80% or fewer than 20% getting it correct could be a problem (for example, items 5 and 6, which both were very easy). The information reported for item average in the survey analysis represents the average rating for each item; since this was a 5-point scale, item 7 had the strongest agreement, while item 9 had the lowest average. When these get to close to the extremes (that is, 1 and 5) we may have an indication that most respondents are answering the same way—at the same end of the scale (the 4.5 for item 7 is probably close enough to be a concern). Although rewording individual items may fix this problem, when the problem is more pervasive, changing the scale of the items to allow more choices at the agreement or positive end of the scale may help provide more variation among the scores.

PROCEDURES

We want to set up a clear plan for the procedures we will use to collect our data, the more detailed the better. Most importantly, we need to determine how we will specifically implementation our interventions and/or collect the data from participants. For example, when exactly will students receive the new intervention and when exactly will they take the pretest, posttest, rating scale, or other measurement instrument? If we need to gain access to certain people or resources, we need to plan how we will do so.

Ideally, if we give our procedures to someone else, they should be able to do the study for us. For example, in an experimental design, we will want to describe how we will manipulate our independent variables and control other variables and potential problems. We need to determine who will implement the experimental intervention (and the control). We will want to indicate how we will check to make sure the manipulation worked (i.e., to confirm that the right treatments were given to the right groups and that the intervention was implemented correctly). We’ll describe how we will train assistants. In a survey design, we’ll describe how we will ensure a reasonable return rate (for example, the number of times we will request surveys to be returned) and how we will obtain responses (for example, internet, postal mail, or interviews).

The procedures should include a time frame, including reasonable estimates for how long each step will take. We’ll need to identify or create our measurement instruments and interview protocols. A pilot study can help work out the problems in our data collection procedures, whether caused by flaws in our measuring instruments or in the procedures themselves.

After we create a timeline, we may need to modify it when unforeseen events cause us to change our schedule. In action research, we will often keep a journal that records our informal thoughts each day about the research process and data collection procedures. Such a journal might be a place to record how these unforeseen circumstances may have impacted the study.

Ethical Considerations

As we begin the process of designing our study, we must keep in mind ethical considerations related to our research. Although there are others, some of the more important ethical considerations to remember have to do with the participants’ rights to: * protection from physical and mental discomfort or harm, including research techniques that might have negative social consequences * privacy, either through anonymity or confidentiality * informed consent, which includes full and honest disclosure of the purpose of the research, along with the right to withdraw (unless otherwise constrained by their official roles) * respect for cultural, religious, gender, and other significant individual differences

Action research has some important differences from basic and applied research, however. Generally action research becomes part of our normal classroom activities; that is, action research, done well, is often an integral part of practice. As long as we are careful to honor those ethical standards that have been developed over time, we will probably be okay. The Belmont Report (http://www.hhs.gov/ohrp/humansubjects/guidance/belmont.htm) is an important document in the history of research ethics. While some adaptation must be made from the medical language used in the document, it speaks in particular to the issue of research and practice in “Part A: Boundaries Between Practice & Research”:

…The distinction between research and practice is blurred partly because both often occur together (as in research designed to evaluate a therapy) and partly because notable departures from standard practice are often called “experimental” when the terms “experimental” and “research” are not carefully defined.

For the most part, the term “practice” refers to interventions that are designed solely to enhance the well-being of an individual patient or client and that have a reasonable expectation of success. The purpose of medical or behavioral practice is to provide diagnosis, preventive treatment or therapy to particular individuals. By contrast, the term ‘research’ designates an activity designed to test an hypothesis, permit conclusions to be drawn, and thereby to develop or contribute to generalizable knowledge (expressed, for example, in theories, principles, and statements of relationships). Research is usually described in a formal protocol that sets forth an objective and a set of procedures designed to reach that objective.

When a clinician departs in a significant way from standard or accepted practice, the innovation does not, in and of itself, constitute research. The fact that a procedure is ‘experimental,’ in the sense of new, untested or different, does not automatically place it in the category of research. Radically new procedures of this description should, however, be made the object of formal research at an early stage in order to determine whether they are safe and effective. Thus, it is the responsibility of medical practice committees, for example, to insist that a major innovation be incorporated into a formal research project.

Research and practice may be carried on together when research is designed to evaluate the safety and efficacy of a therapy. This need not cause any confusion regarding whether or not the activity requires review; the general rule is that if there is any element of research in an activity, that activity should undergo review for the protection of human subjects.

These guidelines are helpful, but not completely definitive. While there are general recommendations about ethical dilemmas researchers face, there are often very specific rules at a local level (for example, university or school district) that must be heeded. Action research is a complicated proposition ethically, as evidenced by the Belmont Report. Before undertaking an action research project, you should check with an administrator at your school or in your district to learn what the rules are. If there are no rules, for your own protection—especially since you are very likely working with children—you might recommend that the district develop a set of guidelines, and even a research review committee. At the very least, if your project has any complex ethical issues, or receives any funding from a state or federal program, you may want to seek permission from your Principal or Superintendent. For example, no faculty member at a university can perform research without approval from the Institutional Review board. Several other useful resources are available online at the following web sites: * American Educational Research Association http://www.aera.net/AboutAERA/Default.aspx?menu_id=90&id=717 * American Psychological Association http://www.apa.org/science/research.html * United States Department of Health and Human Services http://www.hhs.gov/ohrp/irb/irb_guidebook.htm

Protection from Harm

We want to make sure that whatever we do as treatment interventions does no harm participants in any way. Most obviously, we don’t want to make them do something dangerous that could result in physical harm. However, we must also consider the emotional and psychological ramifications of our study. For example, we don’t want to expose young children to inappropriate images or situations. We don’t want to scare participants. For example, we don’t want to withhold “education” from a control group. We must be thoughtful about our decisions. We don’t want to make children disclose information they shouldn’t. We need to make sure that students are not deprived of important parts of the standard curriculum.

In education, protection from harm includes the idea that we don’t want to withhold a treatment that we believe to be useful from participants. It is often the case that we don’t know whether a new instructional method will be better than what we’ve been doing. Unfortunately, however, the perception from others (for example, students and parents) is that something new is something better. Indeed, we may not be sure that a new intervention is better, but we must have our suspicions—or we probably wouldn’t be doing the research.

There are ways to protect against denying certain participants access to a potentially helpful intervention. For example, using a small-sample time-series design where both groups get both the experimental treatment and the control ensures that all students will be exposed to the new treatment. In fact, this can even be done outside of the study. For example, perhaps we use a quasi-experimental design to compare an experimental and control group for our study. Once we’ve obtained the posttest data, we may consider the “study” to be complete. However, for the sake of equivalent of educational opportunities, we may go ahead and reverse the interventions so that everyone received the new treatment.

Confidentiality

We may learn things we otherwise wouldn’t have through our research. Obviously, all FERPA rules will still apply to action research. However, there are other ethical issues that may arise. For example, will you be allowed to audiotape or videotape your classroom? What should you do with information you learn as part of your research that you otherwise might not have learned? For example, if you find out through an interview that a student is having some problems at home, should you report these findings to the school counselor (especially if the student has asked you not to)? If you observe inappropriate behavior on videotape that you did not witness during the class, should you confront, punish, or report this behavior? Discussions at your school prior to conducting the research can help you manage some of the dilemmas you will face.

DATA ANALYSIS

Sagor (1992, p. 48) describes data analysis is action research as the process by which we answer two key questions: * What are the important themes in this data? * How much data supports each of these themes?

In most research, we collect so much data that it becomes practically impossible to make sense of the information by examining it directly. Data analysis is essentially the process by which we summarize our data. We use these summaries to describe the data and to look for relationships in the data. We sometimes even use these summaries sometimes to infer some broader theoretical or practical knowledge. We are generally trying to find themes and relationships that exist across all the data, not just information we have about individual cases. Note, however, that these individual cases can also be exceptionally valuable to our research results.

As part of the data analysis phase of our action research, we also need to determine how much data supports the themes we identify. We need to make sense of the relative weight and credibility of the evidence related to each theme. We can create tables and graphs to help us organize the data. We’ll start with processes that can be used to analyze our qualitative data and then discuss processes for analyzing quantitative data. Finally, we’ll discuss how to use both as we interpret our data and the results of our analyses. A superficial introduction to Qualitative research methods and analysis has been provided in the Appendices.

Quantitative Analysis

When we have numerical data, it is most appropriate to make sense of it quantitatively. We will not use the same statistical methods that are used by quantitative researchers in basic or applied research. Indeed, in the end, we will combine our quantitative results in an essentially qualitative way to interpret our analyses and make sense of our action research questions. However, there will be statistics that are useful for us to consider using in our action research data analyses.

The presentation that follows is not intended to serve as a statistical reference. In fact, we will only be discussing the useful statistical and quantitative analyses that you might consider. If you are not familiar with these statistical methods, you will need to consult other resources in order to learn to use them. Some good online resources exist that might serve as tutorials or refreshers: * David Lane’s HyperStat Online (http://davidmlane.com/hyperstat/index.html) * Rice University’s Virtual Lab in Statistics (http://onlinestatbook.com/rvls/) * David Stockburger’s Introductory Statistics (http://www.psychstat.missouristate.edu/introbook/sbk00.htm) * Gerard Dallal’s The Little Handbook of Statistical Practice (http://www.tufts.edu/~gdallal/LHSP.HTM) * Burt Gerstman’s StatPrimer (http://www.sjsu.edu/faculty/gerstman/StatPrimer/) * Richard Lowry’s Concepts and Applications of Inferential Statistics (http://faculty.vassar.edu/lowry/webtext.html) * StatSoft’s Electronic Textbook (http://www.statsoft.com/textbook/stathome.html) * UCLA Statistics Textbook Wiki (http://www.stat.ucla.edu/textbook/) * Claremont Graduate University’s Web Interface for Statistics Education (http://wise.cgu.edu/)

In action research, we are not usually attempting to use a sample to infer about a population. Rather, we usually have access to the population of interest, or at least a significant proportion of it. Inferential statistics are used to make estimates of the population values (called parameters) on the variables or the relationships among the variables we are studying. Without the need to infer about a population, therefore, we generally have no need for inferential statistics.

However, we still may need to do the things that statistics allow us to do, such as compare groups. One option is to go ahead and perform the statistical analyses, using the perspective that our participants constitute a sample “from time.” That is, we may be interested in knowing that how our results might generalize to a broader population of students, including those students whom we will have in class in the future. This is a common approach because inferential statistics is a common and comfortable tool for many who have experience doing research. There is another way, though, they may be more appropriate for many action research contexts. That is, we treat the data we have as information about the population. As long as we have data for most or all of the members of our population, then we can feel confident with this approach. Some of the most common and most useful techniques are described below.

Interestingly, we do not worry about the research design that was used to collect the data much when we are analyzing the variables using our statistical methods. That is, many research designs require the same statistical analyses. We need some basic information about the design, such as (a) how many variables there are; (b) which variables are considered independent, which are dependent, which are mediating or moderating; (c) whether the data are for separate groups; or (d) whether it is pretest-posttest data. We need to know the characteristics of the variables, such as the measurement scale. But otherwise, the threats to validity that we considered previously don’t become an issue until we are ready to interpret and report our results.

Statistical Conclusion Validity

When we are doing quantitative (i.e., statistical) research, we must pay attention to a particular form of validity called statistical conclusion validity. We might call this just “conclusion validity” to allow the idea a broader reach – perhaps even to qualitative research. That is, can we reach the conclusions we have made based on the evidence we have collected and the analyses we have performed?

In quantitative research that uses statistical analyses, we must pay attention to particular mathematical requirements in order to use the statistics we calculate without worry. These mathematical requirements are typically called assumptions. As mathematical statisticians develop statistics and statistical methods, they often must make certain mathematical assumptions for the statistics to work properly. There are generally four assumptions that we must pay attention to, but the truth is that there are important, more commonly unspoken assumptions that we must also pay attention to. The four key assumptions are usually called: (a) random and independent sampling, (b) linearity, (c) normality, and (d) homoscedasticity or equality of variances. Most of these assumptions will be discussed at appropriate places in the textbook, but here is a quick overview.

Random and Independent Sampling

We must have chosen our sample randomly (for our best chance at representativeness – see external validity above). For many analyses, however, we must also assure that the data we receive for each independently sampled case is not dependent on any other case. The easiest way to think about this is students cheating on an exam. Students who copy answers from others do not have independent data – their data (i.e., their scores) are dependent to some extent on those from whom they copied.

Type of Relationship

In most of the analyses we will discuss, we assume that the variables have a linear relationship (hence, linearity). However, if you are running an analysis that expects a curvilinear relationship (e.g., think quadratic, like a parabola or part of a circle, or cubic or logistic, like a curve shaped like an “S”), then your data should show that type of relationship.

Shape of the data distribution

In most of the analyses we will discuss, we assume normality of the data. That is, we assume that the variables we are measuring, or the errors we make, are shaped according to a normal distribution in the population. Note, again, it’s not the sample we care about so much, it’s the shape of the population data distribution that matters – but the sample matters greatly because we presume that it represents the population. Therefore the quality of the sample is of utmost importance (see external validity above).

Homoscedasticity

For analyses where there are multiple groups, we often need to assume that their population data distributions have equal variances (i.e., that the data are spread pretty much equally in both populations). We call this by a variety of names: homoscedasticity, equality of variances, or homogeneity of variances. Measurement without error

One of the more important assumptions for our statistical conclusions goes back to construct validity and, in particular, reliability. That is, we assume that our numbers have meaning and that our variables are measured without error. Although not always stated clearly in statistics textbooks, we must be able to assume that the data we collect for cases (e.g., scores for variables) are measured very well. Ideally, there is no error at all in how we have measured a variable in our research. However, measurement without error is quite unlikely, even for simple yes-no or categorical items—because people will often accidentally or purposefully choose the wrong category. For example, if asked how much they weigh from the options (a) under 100, (b) 100-149, (c) 150-199, and (d) over 199, someone may have an inaccurate scale and choose (b) even though their actual weight is 151. Or they may decide that they would simply rather put themselves in the (b) range for ego reasons. Or they may carelessly put themselves in (b) by checking or clicking the wrong box without thinking carefully enough. Therefore, with numeric variables we use reliability statistics (e.g., Cronbach’s alpha) to give a sense of how strong our reliability is. Sometimes we might choose statistical methods that actually build this unreliability of measurement into the analysis (but those are much more advanced methods than we will be discussing). No outliers or extreme values

Outliers

Outliers and extreme values are often concerned synonymous terms to refer to cases with scores that are very different from the other data you have collected for a particular variable or combination of variables. The reality here is that it’s not the outliers and extreme values we are concerned with. Our concern is including cases that don’t really belong to our population. The problem is that when a case has an unusual score, we cannot usually know whether they didn’t belong to our population or whether they do belong to our population but are just extreme. Either way, these cases will have an impact on our statistical calculations, so we cannot ignore them. Unless you can be certain a case does not belong to your stated population, the best recommendation is to try the analyses with and without these outliers. With luck, the outlier(s) will not impact the conclusions reached or impact the statistics in any substantial way. If they do, the research will need to determine which analysis (with or without the outliers) is most appropriate and concentrate on the conclusions from those results. However, there should be recognition of the results that would have happened otherwise. Type VI error

Some scholars give a name to the situation where the statistical test you run does not answer the research question you have asked. They call this a Type VI error. This will also impact your conclusion validity. Descriptive Analysis

Just as is true from a qualitative perspective, we are trying to tell a story about our participants and our variables in quantitative analysis. The difference is that our story here is based more on numbers than it is on words. It is important for us to understand the quantitative information that describes the participants. The descriptive analyses are important for several reasons. First, we really want to understand, quantitatively, both the participants and the variables we were studying. It is important for us to have this context as we analyze and then interpret our data.

A variety of statistics will help us understand the data we’ve collected. For example, we will probably want to determine frequency counts for the categorical variables (that is, how many people were in each category). We will probably want to find the mean and the median for our quantitative variables (for example, test scores, rating scale scores, age, time, attitude). In addition to knowing the average values for our scores, we probably will also want to know how spread out they are; that is, did most people have similar scores or were the scores very different? We can use the range or the standard deviation of the scores to provide information about this variation among the scores.

If we are looking for relationships, there are a few appropriate statistics we may want to calculate. For example, the relationships among our quantitative variables are probably best calculated using a correlation statistic such as the Pearson correlation. Sometimes, we only have ranks for our data. For example, we may only know who scored highest and second highest and so forth, and not know what their actual scores were. In these cases we can use a different descriptive statistic, called the Spearman rank-correlation. If we have categorical data instead of scale data, we might use cross-tabulations to analyze the data.

No matter how sophisticated we are statistically, it is almost always useful to look at our data through graphs. For example, bar charts and histograms are useful to help describe our data. For example, do we have scores that seem to follow a normal distribution, or are they skewed in some fashion? We can use a scatterplot to visually examine the relationship between our variables.

Finally, remember, at this point we are simply describing the results. We will not try to make sense of the results just yet, nor of the group differences (or similarities). These are activities will wait until we’re ready to report our results.

Computer Software

We can do many of the analyses we will need using a spreadsheet program, such as Microsoft Excel. However, for more sophisticated analyses, we typically require a dedicated statistical program. This textbook will use an open-source program called R. There are many programs that calculate statistics, but jamovi is free and powerful. Microsoft Excel does provide an “Analysis ToolPak” add-in, but this can be difficult to use and still does not perform most of the analyses researchers use.

Most statistical programs use a spreadsheet-like interface for entering data—or allow you to import Excel data (or data in a comma-separated values (“CSV”) file). Therefore, researchers often use Excel for data entry even if they don’t perform the analyses in Excel.

When entering data, it is generally most convenient to enter them as shown in Chapter 2 Table 2.1. That is, enter a case number or a name in the leftmost column, then enter the variable data in the next column(s). It is helpful to use the first row to label the columns, either with descriptive names (like “Case Number”) or with variable or group names (something as simple as “Group 1” and “Group 2” will work, but it is often more convenient to use the name of their treatment, such as “TinkerPlots” and “Control”).

Example Data from One Variable in Two Groups

Let’s look at some sample data and some example descriptive statistics and graphs. The data are presented in Chapter 2 Table 2.1. For now, let’s consider the two columns of data to be the same variable, just for two different groups. The column labeled “Traditional Group” represents scores from the comparison group, which received what we’ll call “Traditional” instruction, and the second column contains the scores from the TinkerPlots group. For example, we may consider these values to be scores from a math skills test after the TinkerPlots intervention for teaching the Data Analysis unit in our 8th grade class. We’ll say that this was a test score that represents the number of items correct out of a 70 item test.

Figure 2.1, below, is the output obtained from running the psych package describe function in R. Note first, that this output is unedited, except for size. All default results provided by describe have been output here, not just the basic summary statistics. We can edit the table if we’d like, but it isn’t necessary—we can just use the first couple decimals places in our reports.

Table 2.1 Data for Figure 2.1

Case_Number Traditional Tinkerplots
1 40 56
2 43 48
3 42 44
4 36 42
5 49 40
6 53 60
7 41 51
8 63 53
9 41 51
10 19 43
11 47 59
12 54 66
13 48 55
14 56 60
15 42 46
16 53 46
17 37 44
18 48 49
19 53 46
20 60 60
21 52 60
22 38 51
23 45 53
24 41 61
25 41 52
26 40 52

Figure 2.1 Descriptive Statistics using jamovi

jmv::descriptives(
    data = data,
    vars = vars(Traditional, Tinkerplots),
    mode = TRUE,
    sum = TRUE,
    variance = TRUE,
    range = TRUE,
    se = TRUE,
    ci = TRUE,
    iqr = TRUE,
    skew = TRUE,
    kurt = TRUE,
    pcEqGr = TRUE,
    pc = TRUE)

 DESCRIPTIVES

 Descriptives                                              
 ───────────────────────────────────────────────────────── 
                              Traditional    Tinkerplots   
 ───────────────────────────────────────────────────────── 
   N                                   26             26   
   Missing                              0              0   
   Mean                            45.462         51.846   
   Std. error mean                 1.7674         1.3573   
   95% CI mean lower bound         41.821         49.051   
   95% CI mean upper bound         49.102         54.641   
   Median                          44.000         51.500   
   Mode                            41.000         60.000   
   Sum                             1182.0         1348.0   
   Standard deviation              9.0121         6.9206   
   Variance                        81.218         47.895   
   IQR                             11.750         12.250   
   Range                           44.000         26.000   
   Minimum                         19.000         40.000   
   Maximum                         63.000         66.000   
   Skewness                      -0.54880        0.16822   
   Std. error skewness            0.45556        0.45556   
   Kurtosis                        1.8389       -0.85900   
   Std. error kurtosis            0.88651        0.88651   
   25th percentile                 41.000         46.000   
   50th percentile                 44.000         51.500   
   75th percentile                 52.750         58.250   
 ───────────────────────────────────────────────────────── 
   Note. The CI of the mean assumes sample means
   follow a t-distribution with N - 1 degrees of
   freedom

As we review the information provided, some is more useful for us. For example, the means give us a sense of where the data are located on the scale. Most students in the Traditional group scored around 45 items correct, whereas most students in the TinkerPlots group scored around 52 items correct. The median for each group tells us that half of the scores in the Traditional group were below 44 items correct, while in the TinkerPlots group half the scores are below 51.5. The mode, which tells us the most common score in each group, doesn’t really help us compare the groups (but may be useful in some circumstances, especially with categorical variables).

Using the standard deviations, we also notice that the scores in the Traditional group are just a little more spread out than the scores in the Tinkerplots group (that is, 9 and 7, respectively). The range also gives us a sense of the variation in the scores. The range for the Traditional group is 44, which shows that the students who achieved the minimum (19) and maximum (63) scores differed by 44 points. The range for the TinkerPlots is smaller (26) representing the difference between the minimum of 40 and the maximum of 66.

The minimum and maximum values, themselves, can provide useful information about our most extreme cases. For example, if these scores are far away from most other scores, we can verify that we entered those scores correctly. If we did enter the data correctly, we might want to think about why these scores are so low, and whether they should be included in our analysis. For example, perhaps the student who scored 19 in the Traditional group missed a week of school, and therefore most of the Data Analysis unit. We might believe that this score does not truly represent that student’s ability and therefore choose to exclude that student from the analysis of the Traditional group. On the other extreme, we might discover that one student who achieved a 60 in the TinkerPlots group had cheated on the test. This, too, could be justification for the case to be removed.

As you might imagine, this is not a decision to be taken lightly. We need strong justification before we can legitimately remove a case from an analysis. If this student missed no school, yet we believe that the student should have scored higher (based on our belief of the student’s previous achievement), we really do NOT have strong justification to remove the case. Action research provides us some opportunities not usually found in other types of research, however. When we find such low or high scores, we have an opportunity to follow-up with our participants, or otherwise try to collect additional data, to try to determine why the scores were so extreme. Then, if we believe we have strong justification, we may choose to remove them from the analysis.

Beyond whether to remove them, however, we will probably want to try to learn other things about these extreme cases. And it’s not just the most extreme highest and lowest that we may be concerned with, it’s the highest and lowest several scores that we may want to investigate further. For example, we may want to interview those students in the TinkerPlots group with the highest and lowest scores to try to understand why they scored the way they did. This additional information may help us figure out how best to implement the TinkerPlots lessons more broadly or completely, if we eventually decide to do so. We can also determine things about our comparison group through such an analysis, too. We will look at some graphical methods to identify several extreme scores later.

Finally, we should recognize that data entry errors are probably the most common reason for bad data; therefore, data should be checked carefully for accuracy. When we have few enough cases and variables, we should probably consider reviewing every case for accuracy (perhaps enlisting assistance—a fresh pair of eyes—to help verify the accuracy of the data). Sometimes these descriptive analyses can also help us identify potentially erroneous data.

The other information provided by descriptive statistics may be useful in some circumstances, but will not always be necessary. Standard error can be used to calculate a 95% confidence interval, which may be particularly useful if we hope to generalize our results to a broader population—they tell us the margin of error in our estimate for the population mean. The count may be most useful to ensure that we have the right number of cases being analyzed in each group. Finally, skewness and kurtosis are potentially useful in trying to describe the shape of our data (for example, rectangular or skewed or relatively normal), but our graphs will provide better information.

Example Pretest-Posttest Data in One Group

If we have a design with a pretest and a posttest in just our TinkerPlots group, we can calculate additional descriptive statistics. Note that we still want to calculate the individual descriptive statistics for both our pretest scores and posttest scores as we did above. However, we will probably also be interested in an additional statistic: the difference between the two for each case. Chapter 2 Table 2.2 shows our original sample data with an additional column. Note that we probably want to remove any cases that we feel justified in removing before we calculate difference scores, otherwise we will create differences for scores that we considered inappropriate—at the very least we need to remember to exclude these removed cases from all future analyses.

We can run the same descriptive statistics on our difference score variable (sometimes called a change score or even gain score, even though sometimes the “gain” is negative). These results shown in Figure 2.2, however, came from jamovi. We can see that the average change from the pretest to the posttest was six items correct, and we know this means that scores improved by six items because we calculated the difference score as (posttest–pretest). We can review the other information in a manner similar what we did before.

Table 2.2 Data for Figure 2.2

Case_Number Traditional Tinkerplots Difference
1 40 56 16
2 43 48 5
3 42 44 2
4 36 42 6
5 49 40 -9
6 53 60 7
7 41 51 10
8 63 53 -10
9 41 51 10
10 19 43 24
11 47 59 12
12 54 66 12
13 48 55 7
14 56 60 4
15 42 46 4
16 53 46 -7
17 37 44 7
18 48 49 1
19 53 46 -7
20 60 60 0
21 52 60 8
22 38 51 13
23 45 53 8
24 41 61 20
25 41 52 11
26 40 52 12

Figure 2.2 Descriptive Statistics for Difference Scores in One Group

jmv::descriptives(
    data = data,
    vars = Difference,
    mode = TRUE,
    sum = TRUE,
    variance = TRUE,
    range = TRUE,
    se = TRUE,
    ci = TRUE,
    iqr = TRUE,
    skew = TRUE,
    kurt = TRUE,
    pcEqGr = TRUE,
    pc = TRUE)

 DESCRIPTIVES

 Descriptives                              
 ───────────────────────────────────────── 
                              Difference   
 ───────────────────────────────────────── 
   N                                  26   
   Missing                             0   
   Mean                           6.3846   
   Std. error mean                1.6390   
   95% CI mean lower bound        3.0090   
   95% CI mean upper bound        9.7602   
   Median                         7.0000   
   Mode                           7.0000   
   Sum                            166.00   
   Standard deviation             8.3574   
   Variance                       69.846   
   IQR                            9.2500   
   Range                          34.000   
   Minimum                       -10.000   
   Maximum                        24.000   
   Skewness                     -0.25717   
   Std. error skewness           0.45556   
   Kurtosis                      0.14986   
   Std. error kurtosis           0.88651   
   25th percentile                2.5000   
   50th percentile                7.0000   
   75th percentile                11.750   
 ───────────────────────────────────────── 
   Note. The CI of the mean assumes
   sample means follow a
   t-distribution with N - 1 degrees
   of freedom

In particular, we may notice that the minimum difference score was –10, meaning that at least one student’s score actually decreased by 10 points from the pretest to the posttest. We also notice that one student increased by 24 points. In looking more closely at the data, we find that this student who gained by 24 points was the same student who scored only 19 on the pretest. Again, we have information that might make us question whether we should include this student in the analyses. If we just cannot determine a strong justification for removing the case from the analysis, another approach is to run the exact same descriptive analyses both with and without this case. If the output does not differ much between the two analyses, then we don’t really need to worry about the case being included. However, if our results change dramatically without the case, we may need to consider the case further. These extreme cases have the capability both to make an intervention appear to have worked better, and to make an intervention appear not to work well.

Further, a case-by-case analysis on Chapter 2 Table 2.2 shows that four students actually decreased from their pretest to posttest (Cases 5, 8, 16, and 19). Just as we did with our basic descriptive statistics above, we may want to look further at these most extreme scores. We may want to collect more data about them or from them, either quantitative or qualitative, in an effort to determine why they differed so much from the other students in regard to the intervention. And again, we will probably want to do this both for the four negative changes and for the largest positive changes, or increases. Although we may have concern about Case 10’s scores, Case 24 increased by 20 points. Case 1 also improved by an amount larger than the other students (16 points). We might even go a step further and group the students into categories (negative or no change, small positive change, and large positive change). We may be able to collect additional data (or examine other data we have already collected)that will help us identify qualitative differences in these groups that, in turn, will help us make sense of the effects of the intervention.

Example Pretest-Posttest Data in Two Groups

If we have a design with a pretest and a posttest in two or more groups, then we can still do very similar descriptive analyses. In such a case, however, we start with four descriptive analyses for the original two tests in each group and then add two more analyses, one difference score analysis for each group. We still want to look at the original variable information for reasons we saw above (Case 10’s low score of 19 on the pretest). We will also want to examine the difference scores descriptively, then these can be used for comparisons during our interpretations. But again, we will want to examine the difference scores for each group descriptively.

Chapter 2 Table 2.3 shows the data we will finally use for our comparative analyses, assuming we haven’t chosen to remove any cases. The results from the Split By option in jamovi are presented in Figure 2.3. We can see that the improvement in the TinkerPlots group was, on average, slightly larger than the improvement in the Traditional group. This brings up another issue that must be discussed: effect size. We want to consider how large the apparent impact of the intervention is when we interpret our results and reach conclusions later. However, we need to make sure we calculate the information necessary while we’re running our analyses so we have the information to consider. Above, we noticed a six point difference between the two groups (51.8 vs. 45.5 in Figure 2.1) and a six point average increase from pretest to posttest (6.4 in Figure 2.2). Here, we notice only a two point difference between how much each group improved on average (6.4 vs. 4.3). We call these differences “effect size.”

Another thing we can do while examining these results is to try to determine whether we believe the groups were roughly equivalent prior to intervention. For example, we could examine the pretest scores for relative equivalence. We could also compare other quantitative data we have, even though it may not necessarily be immediately relevant to this analysis, to see how equivalent the groups were on those other variables.

Table 2.3 Data for Figure 2.3

Case_Number Group Test Difference
1 Traditional 40 7
2 Traditional 43 1
3 Traditional 42 -2
4 Traditional 36 -3
5 Traditional 49 -5
6 Traditional 53 11
7 Traditional 41 3
8 Traditional 63 5
9 Traditional 41 4
10 Traditional 19 -2
11 Traditional 47 10
12 Traditional 54 15
13 Traditional 48 7
14 Traditional 56 11
15 Traditional 42 -1
16 Traditional 53 0
17 Traditional 37 -2
18 Traditional 48 2
19 Traditional 53 0
20 Traditional 60 11
21 Traditional 52 11
22 Traditional 38 4
23 Traditional 45 5
24 Traditional 41 12
25 Traditional 41 4
26 Traditional 40 4
” ”
Case_Number Group Test Difference
27 Tinkerplots 56 16
28 Tinkerplots 48 5
29 Tinkerplots 44 2
30 Tinkerplots 42 6
31 Tinkerplots 40 -9
32 Tinkerplots 60 7
33 Tinkerplots 51 10
34 Tinkerplots 53 -10
35 Tinkerplots 51 10
36 Tinkerplots 43 24
37 Tinkerplots 59 12
38 Tinkerplots 66 12
39 Tinkerplots 55 7
40 Tinkerplots 60 4
41 Tinkerplots 46 4
42 Tinkerplots 46 -7
43 Tinkerplots 44 7
44 Tinkerplots 49 1
45 Tinkerplots 46 -7
46 Tinkerplots 60 0
47 Tinkerplots 60 8
48 Tinkerplots 51 13
49 Tinkerplots 53 8
50 Tinkerplots 61 20
51 Tinkerplots 52 11
52 Tinkerplots 52 12

Figure 2.3 Descriptive Statistics for Difference Scores in Two Groups

jmv::descriptives(
    formula = Difference ~ Group,
    data = data,
    missing = FALSE,
    se = TRUE,
    iqr = TRUE)

 DESCRIPTIVES

 Descriptives                                        
 ─────────────────────────────────────────────────── 
                         Group          Difference   
 ─────────────────────────────────────────────────── 
   N                     Traditional            26   
                         Tinkerplots            26   
   Mean                  Traditional        4.3077   
                         Tinkerplots        6.3846   
   Std. error mean       Traditional        1.0695   
                         Tinkerplots        1.6390   
   Median                Traditional        4.0000   
                         Tinkerplots        7.0000   
   Standard deviation    Traditional        5.4536   
                         Tinkerplots        8.3574   
   IQR                   Traditional        9.2500   
                         Tinkerplots        9.2500   
   Minimum               Traditional       -5.0000   
                         Tinkerplots       -10.000   
   Maximum               Traditional        15.000   
                         Tinkerplots        24.000   
 ─────────────────────────────────────────────────── 
jmv::descriptives(
    formula = Difference ~ Group,
    data = data,
    missing = FALSE,
    se = TRUE,
    iqr = TRUE,
    desc = "rows")

 DESCRIPTIVES

 Descriptives                                                                                                 
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────── 
                 Group          N     Mean      SE        Median    SD        IQR       Minimum     Maximum   
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   Difference    Traditional    26    4.3077    1.0695    4.0000    5.4536    9.2500     -5.0000     15.000   
                 Tinkerplots    26    6.3846    1.6390    7.0000    8.3574    9.2500    -10.0000     24.000   
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────── 

Example Data from Two Variables in One Group

When we are interested in the relationship between two variables, we will perform descriptive correlation analyses. Correlation is a statistic that provides quantitative information about how strongly two variables are related. Correlations range from –1 to +1, where 0 represents no relationship, –1 represents a perfect negative (or inverse) relationship, and +1 represents a perfect positive (or direct) relationship. For example, there is probably a strong positive relationship between how many hours a student studies for an exam and what score they receive on an achievement exam. This would mean that as the number of hours studying increases, scores on the exam would also increase. If the number of hours decreases, we would expect a related decrease in the test score (remember that this relationship does not imply cause). Conversely, if there is a strong negative relationship between how many hours a student spends playing video games and cumulative GPA, then we would expect those students who play many hours of video games to have lower GPA’s and those who have higher GPA’s to play fewer video games. In some cases, there is no relationship between two variables, for example (perhaps) math skills and attitude toward flowers.

For this example, we’ll use the data in Chapter 2 Table 2.4 (you may notice that it is the same data as Chapter 2 Table 2.1—but this really doesn’t matter). We are now treating the two columns of data as two separate variables: monthly computer use (in hours) and test scores following the TinkerPlots intervention.

Figure 2.4 shows the output provided by the psych and jmv packages in R, which reports a Pearson correlation of 0.475 between the two variables. There are a variety of recommendations that scholars use to describe correlations. For example, any correlation with an absolute value between 0.0 and 0.3 is often considered relatively weak, correlations with absolute values between 0.3 and 0.7 are considered moderate, and correlations with absolute values between 0.7 and 1.0 are often considered strong. Such scales are somewhat arbitrary but do provide researchers a sense of the strength of the relationships they obtain.

Table 2.4 Data for Figure 2.4

Case_Number Test_Scores Computer_Use
1 40 56
2 43 48
3 42 44
4 36 42
5 49 40
6 53 60
7 41 51
8 63 53
9 41 51
10 19 43
11 47 59
12 54 66
13 48 55
14 56 60
15 42 46
16 53 46
17 37 44
18 48 49
19 53 46
20 60 60
21 52 60
22 38 51
23 45 53
24 41 61
25 41 52
26 40 52

Figure 2.4 Correlation for Two Variables in One Group

jmv::corrMatrix(
    data = data,
    vars = vars(Test_Scores, Computer_Use),
    spearman = TRUE,
    flag = TRUE,
    n = TRUE,
    ci = TRUE,
    plotDens = TRUE,
    plotStats = TRUE)

 CORRELATION MATRIX

 Correlation Matrix                                                
 ───────────────────────────────────────────────────────────────── 
                                     Test_Scores    Computer_Use   
 ───────────────────────────────────────────────────────────────── 
   Test_Scores     Pearson's r                 —                   
                   df                          —                   
                   p-value                     —                   
                   95% CI Upper                —                   
                   95% CI Lower                —                   
                   Spearman's rho              —                   
                   df                          —                   
                   p-value                     —                   
                   N                           —                   
                                                                   
   Computer_Use    Pearson's r           0.47513               —   
                   df                         24               —   
                   p-value               0.01417               —   
                   95% CI Upper          0.72842               —   
                   95% CI Lower          0.10758               —   
                   Spearman's rho        0.43892               —   
                   df                         24               —   
                   p-value               0.02488               —   
                   N                          26               —   
 ───────────────────────────────────────────────────────────────── 
   Note. * p < .05, ** p < .01, *** p < .001

If it were a perfect relationship, then all cases would have the same pattern. For example, with a perfect positive relationship (+1), then all cases in the group who had a higher number of hours using a computer would have higher test scores, and all cases who did not use computers much would have lower test scores. Here, the moderate correlation between monthly hours of computer use and the achievement test scores (0.475) indicates that, for the most part, many students who had more computer use also had higher test scores, but that some cases may have had lower test scores in relation to the group. Similarly, many students who had lower test scores also had less computer use, but some of the students with low test scores used computers more than average.

If we have categorical variables, we would probably use cross-tabulation tables for analyzing and reporting our results. Essentially, a “cross-tab” shows how many cases fit into each category. So if our two categorical variables are gender and whether they liked TinkerPlots, we would actually have four groups: boys who liked TinkerPlots, girls who liked TinkerPlots, boys who disliked TinkerPlots, and girls who disliked TinkerPlots.

Figure 2.5 Cross-tabulation table example

jmv::contTables(
    formula = Weight ~ Gender:Tinkerplots,
    data = data,
    chiSqCorr = TRUE,
    fisher = TRUE,
    odds = TRUE,
    exp = TRUE,
    pcRow = TRUE,
    pcCol = TRUE,
    pcTot = TRUE)

 CONTINGENCY TABLES

 The data is weighted by the variable {}."Weight"

 Contingency Tables                                                                    
 ───────────────────────────────────────────────────────────────────────────────────── 
   Gender                       Disliked Tinkerplots    Likes Tinkerplots    Total     
 ───────────────────────────────────────────────────────────────────────────────────── 
   Male      Observed                         6.0000               20.000     26.000   
             Expected                         11.000               15.000     26.000   
             % within row                     23.077               76.923    100.000   
             % within column                  27.273               66.667     50.000   
             % of total                       11.538               38.462     50.000   
                                                                                       
   Female    Observed                        16.0000               10.000     26.000   
             Expected                         11.000               15.000     26.000   
             % within row                     61.538               38.462    100.000   
             % within column                  72.727               33.333     50.000   
             % of total                       30.769               19.231     50.000   
                                                                                       
   Total     Observed                        22.0000               30.000     52.000   
             Expected                         22.000               30.000     52.000   
             % within row                     42.308               57.692    100.000   
             % within column                 100.000              100.000    100.000   
             % of total                       42.308               57.692    100.000   
 ───────────────────────────────────────────────────────────────────────────────────── 


 χ² Tests                                                
 ─────────────────────────────────────────────────────── 
                               Value     df    p         
 ─────────────────────────────────────────────────────── 
   χ²                          7.8788     1    0.00500   
   χ² continuity correction    6.3818     1    0.01153   
   Fisher's exact test                         0.01075   
   N                               52                    
 ─────────────────────────────────────────────────────── 


 Comparative Measures                             
 ──────────────────────────────────────────────── 
                 Value      Lower       Upper     
 ──────────────────────────────────────────────── 
   Odds ratio    0.18750    0.056087    0.62682   
 ──────────────────────────────────────────────── 

When we review a cross-tab, we want to pay specific attention to the cells that are on the diagonal from the bottom-left to the top-right. If the numbers are larger in those cells (as they are here), there is a relationship between the variables. For example, here boys are more likely to “Like” TinkerPlots, while the girls are more likely to “Dislike” it.

Graphical Analysis

When analyzing a single variable, there are certain useful graphical techniques we can use to help describe our data. Realize that our single variable could be a test score, or it could be a computer score of some kind. For example, we might add two test scores together to obtain a final grade. Or we might calculate a difference score between a pretest and a posttest.

Histograms

The first graph that we often find useful is called a histogram. A histogram shows the number of cases we have in several ranges of scores. Figure 2.6 shows the histograms for both the Traditional and TinkerPlots groups from our first example above (one variable in two groups). This graph was created by using the “Histogram” graph and setting the graph to be paneled by rows using the Math Group. Paneling such as this allows us to visualize both groups using the same scales for the X and Y axes. Here you can see that the X-axis goes from 10 to 70 for both graphs (i.e., both Math Groups).

We notice that most of the scores for the Traditional group are around the mean (recall that the mean was about 45.5). Note that the bar just to the right of the label “40” represents the 9 cases that are between 40 and 45 (because the interval widths here are 5.0). In looking at the data in Chapter 2 Table 2.1, we can see that there are several scores in the range between 40 and 45 (and the mode was 41).

As it appears here, there may be a negative skew in the data for the Traditional Group. That is, the mean is being pulled to the left (i.e., the negative end of the number line that is the x-axis) because of the extreme score at the far left. Without that score, the mean would be higher. Consequently, we call this skewed left or negatively skewed. The shape of the graph would probably be somewhat different if we were to remove this case (as we have discussed previously).

Figure 2.6 Paneled Histogram for Traditional and Tinkerplots Groups (see Chapter 2 Table 2.1)

jmv::descriptives(
    formula = Test ~ Group,
    data = data,
    hist = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)

 DESCRIPTIVES

Figure 2.7 shows the paneled histograms for the Traditional and TinkerPlots groups after case 10 (with a Test Score of 19) is removed from the analysis. This clearly shows the impact outliers can have on our interpretations of the data. Indeed, we might describe the Traditional group as more positively skewed now.

A word of warning about graphs: Graphs are easy to manipulate, intentionally or not. Different programs create charts differently, and some researchers will intentionally change the scales of the variables to enhance certain features of the graph. This is a famous way to “lie” with statistics. For example, the two graphs in Figure 2.6 and Figure 2.7 appear to have the same range and the highest bars appear to be the same height. However, closer inspect shows that the leftmost value for Figure 2.7 is close to 40, whereas it is closer to 19 in Figure 2.6. Similarly, the highest bar in Figure 2.6 represents 12 cases, while the highest bar in Figure 2.7 represents only eight cases. We need to be especially careful to pay attention to this fact, both when we are creating our graphs and when we are looking at graphs created by others. One way to deal with this is to use the same scales for related graphs in our own results. That is, we would use a vertical axis with a maximum of 14 for both graphs, and use a horizontal access with the same minimum and maximum values for both graphs.

Figure 2.7 Paneled Histogram for Traditional and Tinkerplots Groups (with outlier removed)

jmv::descriptives(
    formula = Test ~ Group,
    data = data,
    hist = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)

 DESCRIPTIVES

Scatterplot

The most common graph used to represent correlational relationships is called a scatterplot. The scatterplot plots two variables together using (X, Y) coordinate pairs. We can use the scatterplot to make sense of the strength of the relationship between two variables. That is, the closer to a perfect relationship that exists between two variables, the closer the coordinate pairs will appear to form a line (one that slopes upward to the right for positive correlation, but downward to the right for negative correlation). The weaker the relationship, the closer the data are to a circle or a flat blob of points with no slope at all.

For our data in Chapter 2 Table 2.4, we have created the following scatterplot (Figure 2.8). We can see the clear upward-to-the-right slope in the data that represents the moderate correlation (0.475). We also notice a few other things, mostly related to particular cases. We have looked at individual scores for cases previously in order to decide whether we might have cases that are problematic or interesting (that is, the cases we considered for removal and follow-up). So far, though, we have only looked at individual variable scores for individual cases. The scatterplot allows us to look at combinations of variables for strange combinations of scores. These strange combinations may also end up representing cases that we need to investigate further, either for potential removal from the analyses, or more likely, for additional follow-up to determine why they are so different from everyone else.

Figure 2.8 Scatterplot from correlation data from two variables in one group (Chapter 2 Table 2.4)

  scatr::scat(
    data = data,
    x = 'Computer_Use',
    y = 'Test_Scores',
    marg = "box",
    line = "linear")

In Figure 2.8, we see Case 10 again, knowing that it is Case 10 based on the scores listed in Chapter 2 Table 2.4. That is, Case 10 had a computer use value (x-axis) of 19 hours and a test score (y-axis) of 43. This value jumps out as different than most. It is at the far left of the graph and at the lower end of the vertical axis. The scores actually seem to “fit” the results we have. That is, since we had a positive correlation, we expect low scores on one variable to “go with” low scores on the other variable. However, the scores are so low, they are separated from the other data and may represent a case that is unique is some important ways—and therefore, we would probably do some further investigation about this case.

Other cases also jump out, but not quite as obviously perhaps. For example, the case near 65 on the x-axis (from Chapter 2 Table 2.4 we find this to be Case 8) has the largest score for number of hours using a computer in a month, but is just right in the middle of the scores on the achievement test. With a positive correlation, we might have expected this person to score higher on the test. One case has a computer use value near 50 and a test score of 40 (this turns out to be Case 5). Again, we would not have expected a case in the middle of computer use to be so low on the test score. Both Case 5 and Case 8 may be cases we want to follow-up with further, to see if we can identify anything special about them that may have effected their results as related to the TinkerPlots intervention. Note that while Case 12 has the highest test score, this case does not have the highest computer use; however, the computer use is near the highest, so this case actually “fits” reasonably well with what we expected.

Other Graphs

There are many other types of graphs that may prove useful to help summarize our data visually. For example, a bar chart or pie chart may be particularly useful in describing categorical data that we’ve collected. Be careful to remember that simple is better, though. We don’t want to use a graph that is so sophisticated that it actually makes our analysis more difficult.

REPORTING RESULTS AND TAKING (OR RECOMMENDING) ACTION

When reporting our results, we need to pay attention to the audience. We’ll assume for this section of the workshop that we need to produce a report to submit to someone else. While it’s probably reasonable to assume that we wouldn’t be producing a formal report for ourselves, even if we don’t need to report to someone else, we’ll want to go through these steps anyway—and perhaps even produce an informal report of our results, conclusions, and action steps.

Action research differs from basic and applied research because we usually need to describe our next steps following an action research project. In basic and applied research, researchers will often recommend future research for others to do or recommend ways for practitioners to apply the new knowledge to practice. In action research, however, we are the ones who will do the next research ourselves, and we are the ones who will apply the knowledge to practice. Therefore, and important piece of the action research process is deciding what we will do next with what we’ve learned from our research project.

It’s also important to remember that the results of the action research project were probably not the only information we needed to make our action-oriented or policy-oriented decision. We probably already had some other information to use in our decision and we probably had to collect other information using some non-research process. As we produce our report and discuss our next steps, we will need to consolidate all this information into a formal recommendation or decision.

Format of the Research Report

There is a generally accepted format for most research reports that works well for action research, too. The structure of the report essentially follows the same outline of steps we followed as we did our research. The following outline describes briefly the sections that are typically included in a research report, including particular issues that should be reported in action research. Although this structure is presented as an outline, we generally report research using a narrative. However, the focus of each outline point might serve as a reasonable heading for each section of the report.

Research Question

  • What was the action-oriented problem or question (action question, include a description of the reasons for the question, who needs the answer, who the answer will impact)
  • What was the specific research question being studied
  • What were the conceptual definitions of the variables (including which were dependent and independent, if any)
  • What were the operational definitions of the variables, including treatments used for independent variables (e.g., what was the intervention, comparison treatment, control)
  • What relationship(s) did we expect among our variables (e.g., causal, predictive, correlational)
  • What was the population of interest (describe the population as completely as you can, including where they are located and/or how they are connected)
  • Was there any special context for the research question
  • What was the significance of the research question being studied (e.g., why did we need to know the answer)
  • How did we expect the information or knowledge gained from the action research project to help us make our action-oriented or policy-oriented decision

Research Design and Data Collection

  • What knowledge already existed that helped us understand our research question or perhaps even provided some answers (a brief review of existing knowledge and literature)
  • What research design did we use (e.g., experimental, descriptive, observational, correlational, qualitative)
  • If we used experimental research, how did we assign treatments to groups (if experimental)
  • How did we manage the known limitations of the research design we chose (e.g., what additional data or information did we obtain to help minimize the impact of the limitations)
  • How did we obtain participants for our study (how many participated)
  • Were our participants our entire population or were they a sample of our population (if they were a sample, how well do we believe they represented the population—do we have any evidence of their representativeness)
  • What types of qualitative data did we collect and what specific procedures were used to collect it (what interviews questions were asked and what observations were made—of whom)
  • How do we determine that the qualitative data we collected was accurate and credible (validity)
  • What types of quantitative data did we collect and what specific procedures were used to collect it (what instruments or measurements were used to collect data, did you create your own instruments or use existing instruments, did you use tests or rating scales, did you use questionnaires or surveys)
  • How do we know that the measurements we made produced valid and reliable data
  • Did we use a pilot study of any kind (who participated, what did they do)
  • What ethical considerations did we address in our research (how did we protect the rights of our participants)

Data Analysis

  • How did we analyze any qualitative data we collected (what methods did we use—and what themes emerged, how can we describe what the themes mean to the participants)
  • How much and what kinds of support did we have for the themes (e.g., verbatim quotes, observations, documents, journals)
  • How did we analyze any quantitative data we collected (what descriptive, graphical, or inferential statistical methods did we use—and what were the results, what computer programs did we use)
  • Did we find any interesting or extreme cases, or did we find any bad or impossible data (what did we decide to do with these cases—did we collect additional information to make sense of them, why did we remove them if we did)
  • How can we use the quantitative variables we measured to describe the cases and the groups we studied
  • How do we describe the relationships among variables, or the differences between groups that we observed (how large was the relationship or difference effect size)

Reporting and Interpreting Results and Taking Action

  • What conclusions can we legitimately reach based on the results of the study (what was the answer to our research question)
  • How do we interpret the results in the context of our research question (were we surprised by any of our results, or did the results fit with our expectations based on theory, practice, or research done by others)
  • How does the answer to our research questions help guide our decision about our action question (what action should we take based on the results)
  • How else can we apply the knowledge gained from the results to our original action-oriented or policy-oriented question (what else did we learn that we didn’t anticipate)
  • What limitations due to threats to validity cause us to be less confidence in our conclusions (can we take action based on our level of confidence concerning what we learned, do we have other information that can help us become more confident)
  • What are some of the other possible alternative explanations or possible explanatory variables for the results we obtained

Implications and Taking Action

  • What are the implications if we make changes based on our action research—but our results were wrong and we shouldn’t make changes (or what are the implications if we do NOT make changes—but our results were wrong and we should have made changes)
  • Should we make changes to our intervention, to our research design, or to our data collection methods and try another action research project

The first three sections of this outline discuss issues that have been discussed in the previous topics of this workshop. Indeed, you’ll find many of these same issues listed among the recommendations for reading and evaluating research reports (presented in an earlier appendix). The focus of this section of the workshop is reporting the results and taking action. In order to report the results, we must also interpret the results. That is, we must make sense of the results in terms of the research question(s) we asked that guided our research.

Interpretation of our Results

The primary purpose of interpreting our results is to answer our research question. That is, we want to describe how the results we’ve obtained provide important evidence that we can use to answer the research question. Note that we will not attempt to answer the action question during the interpretation of our research results. In action research, the action question is addressed as part of the implications of our results, when we describe the action we plan to take based on what we’ve learned. In basic and applied research, such answers to action-oriented questions are addressed in the discussion section while talking about implications of the results and recommendations for practice.

Answering the Research Question

First, we need to answer our research question. We will answer the research question based on all the results we have obtained, both qualitative and quantitative. We can think about our quantitative results in terms of themes, just as we did for the qualitative data. That is, we will look across all our analyses and try to make sense of what the results are telling us. We need to remember that it is probably true that none of our results can be used to reach strong conclusions, especially not causal conclusions. Rather, what we have is quantitative evidence—some strong and some not-so-strong. This evidence can be thought of qualitatively: What themes emerge from the quantitative evidence we have collected? We need to discuss what the results mean, and in particular, what they mean to our understanding of the research question.

As we discuss the meaning of our results, we need to be careful to support every claim with data or results from our study—that is, evidence. As we make sense of our results through our interpretations, we can use a similar approach to that we used with our themes for this purpose. That is, we can create a matrix of our interpretation and then list all the evidence we believe we have to support these conclusions. Using the themes we identify in the quantitative results, along with the themes from our quantitative analyses, we can begin to reach conclusions. We would use both quantitative and qualitative data and analyses as our supporting evidence. We would not necessarily present the matrix as part of our report, but it would help us organize our results and conclusions and develop our arguments for action.

Effect Size

We’ll need to consider several things in particular as we try to make sense of our results. For example, was the difference between groups large or small. Was the difference between pretest and posttest large or small? Was the relationship between variables large or small? The size of the relationship or difference is referred to as effect size. We need to be thoughtful about how large an impact the intervention had on our participants. Sometimes we may obtain clear and convincing results that our experimental treatment (for example, TinkerPlots) produced achievement scores that were one point higher, on average, than our traditional methods. Perhaps TinkerPlots has large costs associated with it in terms of the time it would take to train users, the effort it would require teachers to implement, the money it would require to purchase both it and new computers to use it the way it teachers want to use it. So the question becomes, given that TinkerPlots improved scores by one point, is there strong enough reason to change to that intervention permanently? Indeed, one point (even two, three, or more) may mean that no real change happened at all if the groups were not truly equivalent prior to the intervention. We’d have to think about how large an effect we need to see before we make a change.

At other times we may just be interested in seeing that a new intervention does not decrease the desired outcomes. Perhaps we’ve already decided to switch to TinkerPlots as long as it doesn’t seem to decrease achievement scores because we are excited about a number of other things it will bring to our curriculum. In such a case, a small effect size may be perfectly acceptable. It’s not easy to decide how large the effect size must be for us to take action, and because we usually have limited amounts of quantitative data in action research, we probably don’t want to rely on the quantitative results alone to make such a decision. That is, the effect is just one piece of information that we should consider as we decide whether to take action. Special Cases

As we consider our results, we probably want to consider further those cases we identified as interesting or problematic. What have we learned about these cases that might influence our conclusions? How many cases seem to be different? Can we identify and describe groups of participants who were different in such a way that we might be able to define different action steps for different groups?

We want to be thoughtful about the cases for whom we obtained results different from the majority. We’d like to know why they are different if we can figure it out. It may impact our decision in important ways. Perhaps their different results had nothing to do with the treatment at all. For example, maybe they were sick on the day of the exam. Or perhaps they were students who had transferred into our school from another school at the beginning of the year, which resulted in them having had a different preparation in previous grades. However, perhaps their extreme (or negative) results in the TinkerPlots example are because they don’t have computers at home like most of the other students do, and therefore have had a harder time using and learning from the program. Before we make decisions or take action, we should try to determine why these special cases were different.

Context

Sometimes we get the results we expected to get. Other times we are completely surprised by the results we get. Either way, it is useful to put our results into context by using the experiences of other researchers (as reported in the literature) or of other practitioners (through communication with colleagues, mentors, and other professionals). It is also helpful to revisit the theoretical perspectives that led us to believe we would obtain certain results by using an intervention or about relationships between variables. How can we make sense of the results, whether they fit with theoretical and empirical expectations or not? If theory and previous research support our conclusions, we can also include them in our matrix of conclusions describe above.

Validity of our Results

We must decide how strong our results are, from several perspectives. Most importantly, we need to consider alternative explanations for the results we obtained. For example, are there other variables that might be responsible for the scores, or the change in scores, we observed in our study? Sometimes we are not able to identify alternative potential causal variables before the study, even though we tried. But sometimes our results may point toward other variables that we hadn’t considered. One action step might be to perform another study where we include these variables, or control the extraneous variables better, in the research design.

Other possible alternative explanations may be a result of the research design we chose. Perhaps we want stronger results of an effect or want to be able to generalize more broadly. Still other alternative explanations for our results may be related to the measurement instruments we used, the interview questions we asked, the people we chose to interview, or the situations we chose to observe. Appropriate action steps might be to perform another study where we use a stronger research design, use better or different data collection techniques, or use a better sampling strategy to obtain participants.

Fortunately, we are often able to take or recommend action despite the limitations we identify in our study. Remember, no study can be perfect. We must determine those limitations that jeopardize our conclusions the most because we cannot minimize their impact in any way (hopefully we fixed our research design earlier if we identified any fatal flaws that would have made our results meaningless). That is, we need to consider each limitation and determine how serious it is. We want to determine whether we have any information that can help us feel more confident in our conclusions, even though a particular limitation exists. We might be uncertain about the quality of our measurements (in terms of reliability and validity), but because we have obtained other information (either quantitative or qualitative) that seems to corroborate our findings, we may feel more confident in the data collected using those instruments. We have some flexibility in action research to use additional data for such purposes. However, we want to be truly confident in our accounting for these limitations—we don’t want to use these additional resources simply to rationalize about why the limitations don’t matter.

We may be concerned about generalizing our results to other 8th grade classes because we only studied one or two classrooms. While technically we would have been better served by using a broader sampling strategy, we may have information at our disposal that allows us to determine how well the results will generalize. This works especially well in action research. For example, we might compare the classes we studied with other classes using data we obtain from other sources. Perhaps we have standardized achievement data that we can examine, or demographic data. Through such a process, we may decide that our results might also be representative of some classes but not others. For example, maybe because all the classes we studied were comprised of students who generally achieved in the “proficient” category on statewide exams, we might decide that these results will hold true for other similar classes, but not for the higher math classes that have more “advanced” students.

Implications of our Results

Considering the implications is most like asking ourselves: “What if I’m wrong?” We may take action based on our results, so if our interpretations and conclusions are right, then we have little to worry about. However, if we’ve not accounted for something we should have or if our limitations cause us greater problems than we believed, we may take inappropriate action. Fortunately, there is not always a large cost or a large penalty associated with incorrect decisions. To help manage such potential errors, a good strategy is to continue the cycle of action research: making small changes rather than large and testing these changes using additional action research.

Unfortunately, sometimes there are indeed large costs and impacts of wrong decisions. We always have to remember that one study never proves anything. The implication of this fact is that our largest changes and most critical action-oriented or policy-oriented decisions should never be based on the results of a single study. Indeed, such important action typically requires additional information beyond what can be provided by answering a relatively specific research question.

As part of our conclusions, we should contemplate the potential ramifications and consequences of taking certain actions as a result of our research results. Then, as we combine our results with other information we have obtained related to our action-oriented or policy-oriented question, we will be able to anticipate relevant problems better.

Taking Action

Finally, our reporting process should identify specific action steps that should be taken based on our results. Often these steps will include additional action research projects, but they will almost always involve changes or recommendations that we are willing to make based on our results. Sometimes these steps will include how to deal with the difficulties we anticipate in trying to implement change.

If we are working with a group of collaborators (for example, the other members of the math or science department), we may want to assign tasks for each team member. We may want to set up a timeline of activities that should be followed in order to implement the changes by some deadline we may have. Perhaps the answer to the research question was only a part of the information we need and we still need to collect additional information; team members can be assigned tasks.

We may need to find ways to convince others that changes should be made. This may require setting up additional action research projects or establishing a more formal pilot program. We may want to enlist additional colleagues in such a project and perhaps to broaden our future study participants to better represent the population.

In the end, we need to remember that our research results are just one piece of the puzzle. While having such empirical evidence is often convincing and comforting, it is not the only evidence or information that exists—nor should it be in most cases. Policy decisions are difficult, and unfortunately, research does not always clarify the decision. And in fact, sometimes research complicates the decision. Remember the example of an intervention that seemed to help boys better than it helped girls. Some policy makers and practitioners are hesitant, even unwilling, to make changes that might have such disparate impact (and therefore the changes might even be considered unethical). Professional judgment is always an important part of the decision-making process and should not be overruled by the results of a single action research project, no matter how convincing those results might be.

SUMMARY

Almost all research proceeds from some question or problem of theoretical or practical interest to the researcher. The research problem is usually best stated as a research question that asks about the relationship(s) among one or more dependent variables and one or more independent variables. The population and context of the study are often also included in the research question. Essentially, the research question is very focused question of what we really want to learn.

Research questions must be stated so they can be answered empirically (that is, through observation and the collection of data). Note that while action questions will often ask the “should” questions, research questions must not. A common form of quantitative research questions might ask whether there is a relationship among two (or more) variables ask whether something should be done and that quantitative research will often not answer how or why things happen. The purpose of the study must be clear.

As we develop the research problem, we should provide a description (introduction) that describes the general research problem (or topic area) and the current state of knowledge in the field concerning the issues involved in the research. This introduction will usually include any necessary background to the study, the theoretical framework or perspective that guides our efforts, and the empirical evidence that exists. The introduction typically includes the most important information we found in our literature review. We also typically will provide definitions of important or unusual concepts and terms that readers must understand in order to make sense of the study.

Chapter 2 Appendix B: Reading and evaluating research articles

These are questions that may help you determine the credibility of the conclusions reached by authors of research reports. The normal-text questions are designed to help you understand the study better. The italicized questions are designed to help you evaluate the study.

Research question

  • What is the general purpose of the research? Does the research appear to have primarily theoretical, applied, or local implications (e.g., action research or evaluation)?
  • Was the purpose of the study clear and obvious?
  • What is the specific primary research question investigated (which may be stated as a research problem)? What does the researcher specifically hope to find out?
  • Why does the researcher believe this is an important study (i.e., significance of the study)?
  • Was the argument for the importance of the study made successfully?
  • Does a coherent, well-structured literature review place the study in context within the field?
  • What important information is missing from introduction and/or lit review?
  • Was the literature recent enough?
  • What was the theoretical basis for the study?
  • What was the empirical basis for the study (i.e., what was the key previous research in the field)?
  • What were the variables or phenomena of interest? Which, if any, were dependent, independent, control, mediating, or moderating variables)?
  • Were the variables or phenomena described well?
  • Do the theoretical/conceptual and operational make sense?
  • What was the target group being studied (i.e., population of interest)? * Was the target population or were the participants defined well?
  • What are limitations and/or delimitations are cited by the author?
  • Were any key assumptions or potential researcher biases implied (or explicitly stated)?

Research design and data collection

  • Who were the participants (i.e., sample)? How they were found, contacted, or recruited? Are there any important demographic breakdowns? What was the sample size (include group sizes, if applicable)?
  • Were appropriate participants selected for inclusion in the study?
  • Is the sample appropriately representative of the population?
  • What procedures were used to collect the data? What were participants asked to do?
  • Were appropriate procedures used to collect data?
  • Were procedures described well enough for replication?
  • Does it seem that the data collected were relevant to the research problem?
  • What was observed by whom?
  • How and where were the data collected?
  • For quantitative studies, what research design was used (e.g., experimental, correlational, descriptive)?
  • What threats to internal and external validity exist?
  • For quantitative studies, what measurements were made, and how?
  • Does it appear that measurements (i.e., operational definitions) were reasonable?
  • Was adequate measurement validity and reliability information provided for instruments?
  • For experimental studies, were equivalent groups created? What was the experimental treatment or intervention? Was there a comparison group—if so, what did it do?
  • Was there any reason to believe that the treatment was not correctly implemented?
  • Was a manipulation check done?
  • Were equivalent groups created and/or assigned?
  • For qualitative studies, what observations and interview questions were used?
  • What threats to qualitative validity? How were the data verified?
  • Was there any potential researcher bias that may have influenced results?

Data analysis and interpretation

  • Describe the analytical methods used and results reported.
  • Does it appear that appropriate methods of analysis were used to answer the research questions?
  • For quantitative studies, how were decisions/inferences made?
  • Were appropriate statistical assumptions met?
  • For qualitative studies, how were themes identified?
  • What descriptive, graphical, tabular, and inferential methods were used?
  • Were the results reported in an appropriate manner?
  • Were the results reported clearly, fairly, and accurately?
  • Was the meaning of the results interpreted — and interpreted correctly?
  • What were the most important results? Were these primary or secondary results? What other information or data would have been helpful in the study?
  • For quantitative studies, what was the magnitude (i.e., effect size) of the relationships among variables or differences between groups?
  • For qualitative studies, how were the results justified and/or verified (e.g., triangulation)?

Reporting Results and Taking Action

  • What was the answer to the research question?
  • Are conclusions based on only the data collected?
  • Are conclusions inconsistent with, or exaggerated from, the results reported?
  • How do the results and conclusions fit with knowledge in the field? What major conclusions were reported?
  • Are alternative explanations for results explored by the author?
  • What other alternative explanations or variables may have affected the results?
  • What implications for the field, either practical or theoretical, were reported by the author?
  • What are implications for the field if the major conclusions are correct?
  • What are implications for the field if the major conclusions are incorrect?
  • What do the results really mean, according to the author?
  • Were inappropriate causal conclusions made or implied?
  • Are the arguments for the author’s conclusions convincing?
  • What do the results mean to you or to the field?
  • What questions for further research are raised by this article?
  • Would this article be good enough to support an argument in a lit review?
  • What limitations or problems cause you to doubt the author’s conclusions?
  • What issues of internal experimental, external, measurement, or qualitative validity concern you?
  • Were there any apparent ethical dilemmas faced by the researchers? How well were they handled?

Chapter 2 Appendix C: Table of Specifications

The table of specifications outlines the format and content of an assessment. The specifications should provide the following information: (1) total number of assessment criteria (items), (2) what proportion of assessment criteria correspond to each content area, and (3) what proportion of assessment criteria correspond to each cognitive level. The assessment specifications should be clearly connected to learning objectives.

Example 1

Example 2

Chapter 2 Appendix D: Some Guidelines for writing Test Items (many of which also apply to surveys and attitude scales)

Provide clear directions

  • be clear what the choices represent (e.g., best answer, correct answer) and how to indicate a response (e.g., circle, write answer on a line)
  • in matching, be clear whether responses can/should be used more than once or only once

Be sure each item is appropriate for the students in terms of reading skills and vocabulary

  • use relatively simple syntax and language
  • reading level should be slightly below that of the students being assessed (otherwise reading skills may be confounded with other skills)
  • avoid complicated sentence structures
  • use vocabulary appropriate for the students
  • avoid negatives — and especially double-negatives

Use clear, unambiguous statements

  • state the stem and options as simply as possible
  • use a question where possible in multiple choice items rather than an incomplete sentence
  • avoid pronouns without referents
  • avoid adjectives/adverbs of indefinite degree (e.g., words such as often and sometimes, which may have multiple interpretations)
  • avoid double-content/double-barreled stems and options (e.g., binary choice items must represent a single concept or proposition)
  • the stem of a multiple choice item must be self-contained (i.e., the task must be clear)
  • do not include extraneous information in the stem (i.e., focus the task)
  • emphasize adjectives/adverbs when the reverse or significantly alter the meaning of the stem or option

Do not provide unintentional clues to the correct answer

  • correct answers should be evenly, and randomly, distributed across all alternative options for the test—no pattern to the correct options
  • do not make correct answers longer than incorrect options
  • do not make True/Correct binary-choice items longer than False/Incorrect items
  • avoid absolute qualifiers such as always and never (especially in T/F items)
  • avoid grammatical clues such as an before a correct answer beginning with a vowel or subject-verb agreement from stem to correct option
  • all options should match the stem appropriately
  • use a logical order for response options (e.g., use alphabetical, numeric, or chronological order unless another order is more logical)
  • in matching items, provide more responses than premises/items so that the last choice isn’t a given

Be sure all options have similar/parallel content and structure

  • if any items have well-defined scales, be sure all do
  • put any words repeated in every option in the stem instead
  • all options should follow logically and grammatically from the stem (similarly, all multiple binary-choice items should follow logically from the stimulus material)
  • in matching items, use homogeneous lists of similar responses (they should all be the same type of response, when possible)

Make all options/distractors plausible

  • students with the relevant knowledge and skill should be able to answer the item easily, while those without the skill should answer incorrectly
  • but keep the number of options reasonable (too many options are difficult to process)
  • especially with binary-choice items, superficial analysis by students leads to incorrect choices (e.g., do not make items too obviously true or false) — students should have to think about the answer
  • distractors can represent common misconceptions
  • distractors can be made to sound correct/believable to an untrained reader (e.g., using 100 for a perfect score, no matter what the scale)
  • distractors can represent common mistakes (e.g., in addition problems, forgetting to carry a value)
  • distractors can be reasonably close to the correct answer (but not too close or attractive)

Do not use options such as all of the above and none of the above to increase item difficulty

  • some argue that for selected-response items, the correct answer should be there to be selected
  • for all of the above, if any two options can be identified as correct, the rest become irrelevant
  • for none of the above, any option identified as correct makes others irrelevant
  • sometimes one option is clearly better, but others are not technically wrong, posing a dilemma for an all of the above option
  • sometimes one option is close to, but not quite, correct and creates a dilemma for “none of the above” (i.e., student can’t tell if it was it an intentional or accidental small error)
  • none of the above is only useful when it is the correct option, but using it only when it is correct provides an unintended clue (it may actually lower reliability when it is used as an incorrect option)

Chapter 2 Appendix E: Some Guidelines for survey and attitude scale items

Purpose of the Survey

  • Does the item fit the purpose of the questionnaire?
  • Does question measure some aspect of a research question?
  • Does the question provide information needed for use in conjunction with some other item or variable?
  • Can you justify and explain why you are asking the question?
  • For scales, does each item measure some aspect of the attitude or a component?
  • Will it be obvious to respondents that the question is necessary, do questions appear relevant (i.e., face validity)?

Understanding the Task (Clarity of Items)

  • Will most respondents understand the question – and understand it in the same way?
  • all respondents should have the same frame of reference as they respond to the questionnaire
  • use common language, usually rules for spoken language acceptable (but not slang or jargon)
  • use technical, specialized language, and abbreviations only when respondents are certain to understand them in the intended way
  • use language and vocabulary appropriate for population of respondents, when possible use simpler words (count syllables)
  • keep questions and sentences as short and as simple as possible (longer items sometimes okay)
  • avoid compound items (double-barreled) that combine more than one thought in a single item (including hidden double-meanings)
  • avoid double negatives (including single negatives in Agree/Disagree items)
  • facts are not always facts (i.e., facts are often defined by the respondent)
  • avoid words that may be confused with similar words (e.g., outlaw and allow)
  • include definition in question if necessary (but don’t get too specific or narrow)
  • use appropriate adjectives or restrictions so all respondents understand the question in the same way (e.g., “violent crime” or “in your school”)
  • be specific with item, avoid vague wording so that you can accurately interpret responses (but not too specific)
  • avoid extreme adjectives and absolutes (e.g., always, never), especially when used with Agree-Disagree scale (e.g., “it was a great course” – someone can disagree and still think it was a good course)
  • be sure that all categories are mutually exclusive and exhaustive
  • all respondents should interpret the time frame in the same way (e.g., don’t ask about last week if some respondents will reply next week and some will reply next month – unless item is being used for a general estimation)
  • consider forced choice items (e.g., “which of these 2 would you choose”) rather than Agree-Disagree items
  • examples sometimes carry over beyond intended boundary (e.g., an example or question early in survey may affect responses later – or respondent may think responses should continue to be in context of a given scenario)

Difficulty of Task

  • Will most respondents have the information necessary to answer the question?
  • Is the task too difficult (e.g., math)?
  • Is the burden on the respondent too much (e.g., must look through old tax records for information)?
  • avoid overly long lists of options (e.g., in a ranking scale)
  • put options and possible responses last (after item)
  • use common concepts and avoid distractions (e.g., don’t ask respondents to calculate things)
  • memory questions are difficult
  • use bounded recall, such as a specific recent reference period (e.g., “in June…” or “within the last month…”) or use salient events or cues (e.g., “since New Years…”) — but be careful that some respondents don’t respond during atypical time frames (e.g., if “last week” was a holiday)
  • if a general perspective is desired, be clear by using appropriate qualifiers (e.g., “in general…” or “overall…”), but avoid over-generalizations (e.g., “usually” is tough, use bounded recall)
  • avoid questions that assume too much knowledge
  • avoid questions about topics that respondents cannot know much about
  • avoid questions that ask respondents what other people think
  • some behaviors and information are difficult to remember (too old, too specific)
  • avoid too much precision (e.g., “exactly how many times have you … in the last year”)
  • hypothetical questions are difficult but can be useful (e.g., vignettes, factorial analyses, standardized stimuli)

Willingness or Need to Respond

  • Will most respondents be willing to answer the question?
  • Do all respondents need to answer the question?
  • will respondents have a chance to respond in their own way, rather than being led by the question?
  • avoid direct sensitive questions, perhaps use categories for responses (e.g., income)
  • consider open-ended questions for more sensitive information (some research shows that sometimes more information is volunteered by respondents in this way)
  • do you only need information from a subset of respondents? consider filter questions (contingency, conditional, branching, skip-logic) – but avoid “quiz questions”
  • carefully consider “Don’t Know” or “Not Applicable” or filter questions – if they are available they will be chosen by relatively large proportion of respondents

Analysis Considerations

  • how much information is needed for the variables being measured (e.g., what level of measurement is needed)?
  • is other information needed to analyze this question? If so, how will you obtain the information (or do you already have it)?
  • have you used meaningful and/or balanced groupings or clusters for categories?
  • how do you justify the categories you’ve created?
  • are your categories too broad to be meaningful? too narrow for summary interpretation?
  • have you matched your categories to other literature?
  • closed items help provide more consistent results from respondents, enhancing reliability and validity? Impartiality
  • avoid wording that connotes positive or negative positions, let the respondents provide their own positions (avoid “leading” questions, e.g., “do you agree that…”)
  • avoid emotionally or politically charged words or terms (“loaded” terms) that might evoke reaction external to the item itself (e.g., “public assistance” vs. “welfare” or “public servant” vs. “politician” or “for the sake of our children, shouldn’t we…”)
  • consider including a scenario or context for more emotional or political topics
  • avoid slanted introductions to items (e.g., “most people… do you?)
  • avoid unequal comparisons (e.g., don’t provide scapegoats or socially acceptable things vs. socially unacceptable)
  • carefully consider the issue of balance for response choice scales (i.e., an equal number of positive/negative or high/low or good/bad response choices — but sometimes more of the favorable choices help provide more variation in responses)
  • use impartial tone – let respondents provide the positive or negative views
  • carefully consider balance of questions (e.g., “do you favor…” versus “do you favor or oppose…” (some research suggests there is little impact due to which is used, though)
  • be sure to indicate either anonymity or confidentiality

Structure of Questionnaire

  • demographics usually work better last (easy items to do at end without thinking when fatigue may be a factor) – only ask relevant demographics
  • begin with interesting items (pique the respondent’s interest)
  • put sensitive items in categories and in the middle of questionnaire (don’t get too personal too soon, build rapport)
  • do item numbers make sense? people follow numbers
  • do sections make sense? consider organization based on type of scale/question or by topic (are there good transitions)
  • are there good instructions for every section or scale type, especially less common types of items (e.g., for forced ranking scale, which is highest value)
  • does the survey look professional? (aesthetics)
  • objective questions should usually come before subjective (closed before open)
  • order effects appear to be less problematic when general comes before specific (e.g., if marital happiness comes first, general happiness usually rated based on marital)

Response Bias

Things that may affect responses * social desirability: respondent’s desire to look good or to conform to social norms (similar to what is now called “political correctness”) * acquiescence: respondent’s desire to be agreeable, to be nice, to be cooperative * yea or nay say: some respondents generally respond more optimistically, some more pessimistically * prestige: respondent’s desire to exaggerate or to look good * threat: respondent’s concern over negative consequences of responses * hostility: respondent’s anger about certain items carried over to other items * sponsor bias: respondent’s answers are influenced by an attitude toward the sponsor of the survey * frame of reference/mental set: respondent has a particular issue in mind – maybe from earlier questions * extremity: respondent’s choice of only extreme scale values * order: routine, consistency, fatigue, pick all same choices * order/consistency: effort to remain consistent with responses to earlier questions rather than answer each question separately, but sometimes order matters (e.g., if attitude toward labor strikes first, more likely to favor owner lockouts also) * order/contrast: purposefully changing second response to contrast with first (e.g., when adolescents about self first, they are less likely to say they belong to the same political party as parents) * order/salience: questions spark memory and may change responses (e.g., when asked attitude about victimization first, more incidents of victimizations were reported; or when asked factual questions first, memory may spark change in attitude) * primacy and recency: tendency to respond using first or last response category, respectively — particularly in long lists or if lists have equally appealing response choices (e.g. in forced choice question or ranking, research not as clear in rating scales)

Miscellaneous Hints

  • a middle point that represents “neutral” or “average” helps to produce more variation in responses
  • omitting the middle point seems to have much less effect on those with stronger feelings or attitudes
  • consider carefully whether “neutral” is a logical option or not — it is chosen more often if available, has greater effect on low-intensity feelings, and decreases “don’t know” responses
  • measure intensity – ask question then ask “how strongly to you feel about that?”
  • in rating scales (as opposed to Likert Scales) no reverse wording of items is necessary
  • start survey development process using open-ended approach (e.g., focus groups) to help determine what questions and categories should be included
  • in order to get reasons or explanations from respondents, use open-ended question
  • don’t ask them to “please be honest”
  • use a social desirability scale (e.g., Marlowe-Crowne) or create a “fake” question (e.g., “which books have you read (check all) – then include a made up title)
  • be careful with certain attempts to catch social desirability — for example, many respondents seem to make educated guesses at items about things with which they are unfamiliar (e.g., attitude toward some nonexistent state law)
  • some people without strong opinion will often answer “don’t know” if the option is available, but WILL respond to the item if “don’t know” option NOT available — sometimes this is actually a useful dimension to learn about
  • people do not seem to respond just to avoid appearing ignorant (because “don’t know” goes up when available) — so they are likely trying to provide “educated guess” responses to items they don’t know (e.g., they “like the sound of that” — but meaningfulness of attitude based on both knowledge and predispositions)
  • some recommend letting respondent provide the “don’t know” or “not applicable” – either by saying so in an interview or not responding to survey item (there are less likely to skip item if “don’t know” not an option)

Chapter 2 Appendix F: Sample Item Analysis (test)

This is an example of item analysis performed using only a very small subset of cases (the top and the bottom five scorers on the test from a class of about 20 students). While the Discrimination Index only uses these two groups, most other item analysis statistics would typically be performed using all data. The primary purpose here is to help illustrate the ideas. In the table 0 = incorrect, 1 = correct. Similar analysis is performed for tests (but some of the decision values differ). See elsewhere in this textbook for an example Item Analysis in R.

Sometimes items with D < 0 occur because the answer key is incorrect (in testing situations) or because the item was reverse-worded but not recoded in the data (in scale situations).

Chapter 2 Appendix G: Sample Item Analysis (scale)

While the Discrimination Index only uses the high and low groups, most other item analysis statistics would typically be performed using all data. The primary purpose here is to help illustrate the ideas. In the table, 1-5 indicate the score on the scale item. See elsewhere in this textbook for an example Item Analysis in R.

Sometimes items with D < 0 and/or negative item-total correlations occur because the item was reverse-worded but not recoded in the data (in scale situations). But sometime the item is just badly written. Note that with scales (and usually even for tests), we usually prefer the item-total correlation. One major advantage is that we don’t lose the middle group. Additionally, most like to consider Cronbach’s alpha-if-item-deleted in addition to the item-total correlation (we recommend both, but with focus on item-total correlation).

Chapter 2 Appendix H: Checklist/Recommendations for Reporting Quantitative Research

(but can be used for Reviewing Research Reports as well – see end of Appendix for note)

Answer Yes, No, or NA (not applicable) for each point in the table (associated with each blank)

INTRODUCTION: THEORETICAL VALIDITY

  • Know well and describe relevant literature (not just literature in this field, all relevant literature) for ALL sections of the research report: Research Problem, Methods, Results, and Conclusions
  • “Introduction” provides literature-based theoretical, conceptual, practical framework/background with empirical support for argument of Theoretical Validity (“best of” or “greatest hits”)
  • Proposal should provide “who, what, where, when, why, & how” about the research (some of which may require multiple perspectives, e.g., who will do the research, who will participate in it)
  • Statement of the problem that led to or drives the research
  • Develop, refine, clarify research question (RQ) about the population to answer empirically
  • Define conceptually/theoretically the variables, for example, dependent, independent, categorical (levels), control, mediating, moderating (if any)
  • Indicate what relationships among variables will be investigated and why the RQ must be answered by providing theoretical and empirical rationale (“so what” or “why bother” question)
  • Delimit and describe target population of interest and perhaps accessible population
  • Provide rationale for why the target population needs to be studied and for any resulting limitations to generalizability (include rationale for appropriate inclusion and exclusion criteria)
  • Consider context, non-population/non-variable, or ecological limitations (e.g., time of year, location, laboratory, equipment, facilities, infrastructure)—these also limit generalizability
  • Argue for/justify theoretical, empirical, or practical significance of RQ (with literature support)
  • Provide rationale for including variables and studying populations included (or excluded)
  • If investigating confirmatory RQs, state and support research hypotheses and justify based on theoretical and empirical literature (i.e., provide expected answers to RQs)—if exploratory, say so
  • Provide specific, key definitions of constructs, terms with unique meaning in the study
  • Consider and describe potential foreseeable limitations or potential problems may be faced, how will they be handled (or why they cannot be), and what impact they might have if they cannot be controlled (e.g., self-report data, potential lying or errors, volunteer bias, extant data, cannot randomize)
  • Literature Review that more completely describes all these issues (as well as methodological issues—such as instrument development and validity/reliability)

INTERNAL VALIDITY

  • Be aware of and discuss the protection of the rights of human subjects and ethical matters
  • Describe research design to answer Research Question & minimize threats to INTERNAL VALIDITY
  • Design can be both flawed and useful – provide rationale for the choices
  • Verify treatment manipulations if needed (e.g., implementation fidelity, compliance, adherence)
  • Use random selection, random assignment, blinding where possible (to avoid confirmation bias)
  • Address strength of research design for (1) relationships, (2) temporal order, (3) ruling out alternatives
  • Address internal validity threats (e.g., maturation, history, attrition, testing, contamination, rivalry)
  • Include mechanisms to verify integrity/fidelity of interventions (e.g., manipulation checks)
  • Consider potential for triangulation or corroborative information (e.g., what data can be collected for evidence of accuracy/trustworthy/credibility, control of confounds/alternative explanations)
  • Describe pilot study & what will be learned (e.g., instruments, procedures, response, sample sizes)

CONSTRUCT VALIDITY

  • Define operationally all variables (including categorical and covariates) in ways that match the conceptual/theoretical definitions, including what numbers mean (e.g., what high scores mean, cut values used for groups, how any categorical or scale variables will be computed/manipulated/coded)
  • Consider including multiple measures/instruments for most important variables (e.g., outcomes)
  • Consider including variables to serve as covariates/moderators/mediators in case the primary analyses produce unexpected results, or outliers/influential cases were found (e.g., variables that may allow subsetting of data in useful ways or help explain unexpected result or why cases were outliers)
  • Report extant CONSTRUCT (measurement, psychometric) VALIDITY and measurement reliability evidence for instruments (perhaps in literature review)
  • Discuss evidence that will be provided of actual measurement reliability and validity of the final data (i.e., that the numerical data means what it is purported to mean for the study)

EXTERNAL VALIDITY

  • Identify sampling frame (population that can be accessed) & sampling strategies for EXTERNAL VALIDITY
  • Describe how sampling frame will be accessed (e.g., how contact info, participation will be obtained)
  • Describe how participation/response rates will be maximized, and random/representative—or why not
  • Describe what data are needed for argument of generalizability (or maybe transferability)
  • Describe data needed to adequately describe the sample and/or compare it to population
  • Describe methods used to screen data for reasonableness (e.g., belong to population, errors, outliers)
  • Describe how missing data will be handled (e.g., incomplete cases, attrition, and perhaps which variables cannot be missing for integrity of the analyses)

PROCEDURES & ANALYSES

  • Provide a time line and budget for the research
  • Detail data collection procedures sufficiently for replication (e.g., how data will be obtained from cases, how cases will be assigned to treatments, what cases will do to participate (e.g., responses, activities)
  • State the statistical null hypotheses (and maybe alternatives) for statistical tests to be performed
  • Set error rates and decision criteria (e.g., critical values, level of significance, family-wise error)
  • Determine sample size (for units of analysis) based on expected effect sizes & desired statistical power and/or desired estimate precision (use minimum sample size sufficient for all purposes)
  • Identify descriptive statistics, tables/graphs, confidence intervals, effect sizes needed for appropriate interpretations and comparison to extant literature
  • Choose analyses/statistics needed to answer RQ (e.g. descriptive, inferential, graphical, post hoc)
  • Describe how assumptions of statistical methods will be tested and consider impact of violations

DESCRIPTIVE STATISTICS (DATA VALIDITY)

  • Screen data for accuracy/reasonableness (e.g., entry errors, univariate/bivariate outliers) and describe what was done with unusual cases (e.g., what issues arose during data collection/entry, what decisions made to code or correct data due to errors or small group sizes)
  • Discuss reasons for missing data (e.g., non-response, attrition) & what was done to handle it (e.g., listwise deletion, multiple imputation, maximum likelihood)
  • Describe the actual sample/participants
  • response rates (e.g., % of total invited, % of total good addresses, complete responses, incomplete but useful responses, etc.), final usable data, total number of participants
  • frequencies for relevant demographic groupings and cross-tabulations for non-scale variables
  • descriptive statistics and correlations for scale (and maybe ordinal) variables – perhaps with appropriate demographic breakdowns (with tables and graphs as appropriate)
  • how well data represent population (if possible compare to known population values)
  • Examine the actual construct/psychometric validity/reliability of scale variables and non-factual variables that might be prone to unreliable responses.
  • Investigate outliers, assumptions (e.g., normality), and response biases for scale variables. Perform item analysis if using a newer scale. * Explain what numbers mean (e.g., is “1” high or low).

STATISTICAL CONCLUSION VALIDITY

  • Use “really useful” tables and graphs for both descriptive and inferential statistical results
  • Test assumptions of specific statistical methods and consider impact of violations (e.g., STATISTICAL CONCLUSION VALIDITY), which should include relevant bivariate or multivariate outlier screening
  • Provide higher level evidence of assumptions: (a) lowest no mention, (b) say they were tested, (c) say they were okay, (d) highest level is evidence they were met (or, if not, how they were handled)
  • Analyze the data using appropriate methods needed to answer the research questions (report relevant actual statistics & p values, decide about null hypotheses) – use robust tests if necessary
  • Report appropriate actual results (e.g., significance tests, descriptives, actual effect sizes & confidence intervals, tables, graphs) – report all results, whether statistically significant or not
  • Report appropriate follow-up/post hoc methods (consider multiple hypothesis testing adjustments, like Bonferroni or Holm) or sensitivity analyses (e.g., for outliers, assumptions)
  • Report supplemental analyses performed but not directly from RQs: that were identified while analyzing (i.e., exploring) data further (e.g., interesting results–must be confirmed with future research)
  • Provide statistical evidence needed to answer the Research Questions (not interpretation)

CONCLUSIONS & DISCUSSION: VALIDITY

  • Answer the research questions with attention to theoretical and practical significance
  • Interpret results and answers to the RQ (whether statistically significant or not) within the context of the literature and from a practical perspective (i.e., practical significance of effect sizes)
  • Reach conclusions based on the results of the study, with connections to the evidence that supports them and recognizing limitations
  • Discuss implications/recommendations for scholars/theory/research), for practice/policy
  • Discuss actual limitations that arose during study or were unanticipated, but may impact validity:
  • limits to usefulness of conclusions (e.g., threats to Internal Experimental Validity
  • statistical analysis issues/failures, model specification that limit Statistical Conclusion Validity
  • Construct/Measurement Validity issues that may limit the quality of the data
  • lack of triangulation/corroborating evidence, other accuracy/trustworthy/credibility issues
  • actual External Validity of results due to actual participation (generalizability/transferability)
  • Discuss interpretations/conclusions based on final design/data in terms of Theoretical Validity
  • Consider potential alternative explanations or confounding variables not accounted for that might lead to recommendations for future research

Common Headings in Research Reports (e.g., Dissertations/Theses)

Common/possible section headings used in the Introduction (not necessarily this order):

  • Background of the Study & Theoretical Framework
  • Statement of the Problem
  • Purpose of the Study
  • Research Questions (potentially Research Hypotheses if not exploratory research – or at end of literature review)
  • Significance of the Study
  • Delimitations & Limitations
  • Definition of Key Terms

Common/possible sections used in the Methods (not necessarily this order):

  • Research Design (including interventions or treatments, if any)
  • Identification of the Population, Sampling Frame, and Sampling Plan (plus location and context, if necessary)
  • Operational Definitions of Variables (selection/development of instruments, previous reliability & validity evidence)
  • Data Collection Procedures (including access and ethical issues that must be addressed)
  • Planned Data Analysis Procedures (including psychometric, descriptive, inferential, assumptions, outliers, post hoc)
  • Pilot Study (always useful, if possible, to test data collection methods)

Common/possible sections used in the Results (not necessarily this order):

  • Instrumentation (actual reliability & validity information)
  • Description of the sample, including descriptive statistics (with appropriate demographic breakdowns, and with attention to generalizability/external validity, if possible)
  • Statistical Results (often presented by RQ, including assumptions, outliers, post hoc, tables, graphs)
  • Supplemental Analyses (answers to questions that were not asked initially, before data collection – exploratory) Common/possible sections used in the Discussion (not necessarily this order):
  • Answers to Research Questions with Discussion (no new results)
  • Summary of major results in context with theory and previous research
  • Theoretical and practical implications of the results (what do the results mean in context)
  • Recommendations (for theory, future research, policy/practice)
  • Limitations (that arose during study and how they may have impacted results)
  • Summary & Conclusions (about significance of the study)

Notes

  • See checklist/recommendations above for what to include in each of these Research Report Sections
  • All chapters usually include brief Introduction (as reminder of purpose of the research) and a Summary
  • To use this guide for REVIEWING RESEARCH, simply ask yourself whether the authors did the things listed

Chapter 2 Appendix I: Guidelines for Reviewing Quantitative Research Reports

The major research questions and/or research problems investigated.

  • Did the research question include/imply variables, expected relationships, population, and context?
  • Was the research question asked so it can be answered empirically through data collection (e.g., not “should”)?
  • Did they clearly explain what they wanted to learn empirically and why they wanted to know it?
  • What type of research was it (e.g., basic, applied, action, evaluation; efficacy or effectiveness research)?

The author’s reasons for conducting the study (significance of the study or “so what?”).

  • Did the author provide sufficient theoretical, practical, or research-based (empirical) rationale?
  • Did the author present a coherent argument for the study=s importance, with evidence/rationale?
  • Did they make the case that additional empirical research was necessary?
  • Did the researcher share any personal motivation for the research? Who paid for the research?

The theoretical framework/background and empirical foundation provided.

  • Did they describe what people in the discipline already “know”? Did they miss any important literature?
  • Is the literature review organized thematically, where appropriate, rather than chronologically or by author?
  • Does the review provide sufficient theoretical/empirical support and explanation of the research problem?
  • Did the literature support inclusion of primary and controlled variables, or exclusion of variables not studied?
  • Does the review provide support for expected relationships among variablesCor the need to study them?
  • Were research hypotheses developed and supported with appropriate rationale?

What theoretical/empirical/practical EVIDENCE/SUPPORT do they provide for an argument of “EXTERNAL VALIDITY” (will results generalize to others elsewhere? are they representative of the population?)

Delimitations for the study, including the target population and/or any special context.

  • Was the target population explained (or implied) clearly, with appropriate delimitations?
  • Were detailed inclusion and exclusion criteria provided? How were cases screened for inclusion?
  • Was any relevant context for the study explained (e.g., laboratory, location, timing, and special circumstances)?
  • What impact might the context have had on ecological validity? What about treatment/outcome variation?
  • What impact might the timing of the study have had on temporal validity? Will sample results hold up over time?

How the sample represented the delimited target population.

  • Was the accessible population (sampling frame) sufficiently representative of the target population?
  • Was the sample randomly selected and/or verified to be representative of the accessible population?
  • Was the sample sufficiently large for coverage, sampling error, and statistical power purposes?
  • Did they explain how they recruited or obtained participants? Was there aggregation bias (ecological fallacy)?
  • How was the sample selected from the accessible population? What sampling method or process was used?

How the actual study participants represented the original sample.

  • What was the response rate? Was there any possible impact due to non-response or volunteer bias?
  • What analyses were performed to ensure that participants belonged to the stated population?
  • Were any cases excluded from analysis because they had outlier or extreme/influential values?
  • Were any cases excluded from analysis because they had missing data? How were missing data handled?
  • What was the possible impact of the cases with extreme values or missing data on the results?

The statistics used to describe the participants, sample, and/or populationCincluding all descriptive, graphical, and/or qualitative methods relevant to the major research questions.

  • What evidence was given and how strong was the representativeness of the data across sampling levels: from (a) cases analyzed to (b) participants to (c) sample to (d) accessible population to (e) target population?

  • Were sufficient descriptive results reported clearly, accurately, appropriately? Sample size? Unit of analysis?

  • Were appropriate demographic breakdowns (i.e., subgroup descriptions) provided for the sample?

  • Based on sampling processes, to what population are the results generalizable (e.g., local generalizability)?

  • Are the results useful if not generalizable (e.g., transferable, quantitative case study)?

What theoretical/empirical/practical EVIDENCE/SUPPORT do they provide for an argument of “CONSTRUCT VALIDITY” (a.k.a., Measurement/Psychometric Validity: are the data useful and meaningful?)

Data collection procedures used and how the authors verified the accuracy of the data.

  • Were the data collection procedures clear and replicable? Were they consistent for all cases?
  • Were data collected directly by the researcher or through intermediaries/assistants? Did they screen the data?
  • Were ethical principles followed (e.g., informed consent, confidentiality/anonymity)?

Variables of interest, both conceptually and operationally – including dependent, independent (with levels/subgroups), & other variables in the design (e.g., control, mediating, moderating).

  • Were all variables in the analyses defined, both conceptually and operationally?
  • Did operational definitions match conceptual definitions? Did they use multiple measures for each variable?
  • Had the instruments been used “successfully” by other scholars? Did they avoid single-item measures?
  • Did they provide data collection materials (e.g., interview protocol, questionnaire items, directions, invitation)?
  • What evidence of validity and reliability was provided from the development/previous use of measures?

Psychometric validity and reliability information provided as evidence to support the quality of the data and/or measurements in this study (i.e., not how instruments worked in previous studies).

  • Did the authors perform psychometric validity and reliability analyses on their own data?
  • Were appropriate inter-coder/inter-rater reliability and agreement techniques used, if necessary?
  • Were issues of trustworthiness and credibility of the data addressed (e.g., social desirability)?

The appropriateness of the data used to answer the research questions.

  • Did respondents/participants have the knowledge, information, and/or traits needed to answer the questions?
  • Was self-report data collection used when another method would have been better or possible?

What theoretical/empirical/practical EVIDENCE/SUPPORT do they provide for an argument of “INTERNAL VALIDITY” (design validity for the relationships/causal relationships among variables?)

The research design (e.g., descriptive, correlation, experimental) and its appropriateness to answer the research questions. All designs can be both flawed and useful.

  • Was the design justified in regard to the research question?
  • Were ethical issues managed appropriately (e.g., potential harm/risk, discomfort)?
  • Were strengths and potential limitations of the design addressed?
  • Was a pilot study used to test the design and the data collection procedures?
  • Did they justify design decisions or adaptations used to handle special circumstances or conditions?

How the research design allowed for strong conclusions about relationships.

  • Was there restriction of range in the data? Were data consistent across subgroups?
  • Were there potential extraneous (“third”) variable concerns? Other issues that could bias correlational results?

How the research design allowed for strong conclusions about temporal ordering of variables.

  • Were data collected simultaneously or longitudinally?
  • If the design did not establish temporal ordering, did the theoretical arguments justify temporal relationships?
  • Did they describe all manipulated variables and justify the treatment levels?
  • If not manipulated (e.g., trait or characteristics), how were independent variables determined or collected?
  • Were manipulations verified (e.g., treatment integrity, implementation fidelity, compliance, adherence)?

How the research design controlled potential alternative explanations for the results.

  • How well were threats to internal validity handled (e.g., maturation, history, selection, attrition, regression, testing, instrumentation, contamination, rivalry, demoralization, researcher impact)?
  • Was random selection, random assignment, blinding, double-blinding used wherever possible?
  • How equivalent were the groups compared prior to experimental manipulation?
  • How well were potentially confounding variables been controlled?
  • Was their design strong enough to overcome “confirmation bias” (i.e., finding what they expected to find)?
  • Were replication or cross-validation results provided (e.g., cross-validation, data-splitting/hold-out samples)?

What theoretical/empirical/practical EVIDENCE/SUPPORT do they provide for an argument of “STATISTICAL CONCLUSION VALIDITY” (are the results/conclusions reasonable and supported?)

Answers/conclusions they reached about major research questions/problems investigated.

  • Did they answer their research questions? Did they support those answers with their own data or results?
  • Did they avoid inappropriately strong causal conclusions or inappropriate interpretations of nonsignificant results?
  • Did they emphasize the “right” results? Did they miss important results (did they “bury the headline”?

Methods (e.g., statistical tests) used to analyze the dataCincluding all descriptive and inferential methods relevant to the major research questions, including post hoc tests (if appropriate).

  • Were all procedures and methods of analysis clearCand appropriate to answer the research question?
  • Were all relevant results presented clearly and appropriately (e.g., statistic, df, p, multiple hypothesis testing)?
  • Were null hypotheses and associated statistical significance tests reported clearly and completely?
  • Were standard errors calculated correctly (e.g., assumption violations, design effects, bootstrapping)?
  • Were the correct analyses used for the chosen unit of analysis and/or sampling techniques?
  • Were all the relevant and important variables included (e.g., was there model specification error)?
  • Were analyses and interpretations appropriately exploratory or confirmatory?

Statistical assumptions tested and outlier diagnostics used.

  • Were data randomly and independently collected? If not, were appropriate analyses performed?
  • Was evidence presented that confirmed whether statistical assumptions were met? Did they not test any?
  • Was evidence presented that confirmed whether outliers were diagnosed? What impact could they have had?
  • Did they do sensitivity (what-f) analyses for robustness (e.g., bias, statistical method, confounding, outliers)?

Effect sizes and/or confidence intervals (i.e., practical significance) reported.

  • Were all appropriate effect sizes and/or confidence intervals reported and interpreted appropriately?
  • How large were group differences? How strong were relationships? Were effect sizes large enough to be useful?
  • Were tables and figures used to illustrate appropriate points? Were they presented and described clearly?

Major conclusions and implications made based on the results provided for this study.

  • What did the authors learn? Did conclusions make sense given the theory being examined or developed?
  • What threats to external validity existed that were not addressed?
  • Were the results synthesized/integrated appropriately (e.g., meta inference)?
  • Were theoretically appropriate recommendations made both for practitioners and for future researchers?
  • Were conclusions appropriate to the design, argued convincingly, and based on their reported results?

Important and appropriate validity limitations recognized and reported (or not).

  • Were limitations or unsupported assumptions so egregious as to nullify or jeopardize the conclusions?
  • Were important limitations missed or ignored?
  • Were any interesting results ignored (even if they are counter to the hypotheses of the study)?
  • Were potential alternative explanations or alternative variables for the results considered or ignored?

Connections made from the results to the literature and how the results fit with theory.

  • Were results put into context by connecting them to extant literature in the field? Were any results ignored?
  • In addition to outcomes, were relationships among predictors attended to sufficiently well?
  • What are the implications for the field if the author=s conclusions are correctCor incorrect?

FINAL EVALUATION: Indicate whether you would feel comfortable using this study as evidence to support an argument in your own literature review.

  • Research is a rhetorical process. Did they write well enough to convince (e.g., headings, evidence, references)?
  • Were corroborating results (e.g., triangulation, cross validation) provided?
  • What important evidence was missing from their reporting of the results? Did the authors “over reach” their conclusions? Does anything written or absent cause you to doubt the quality of their research?
  • What would they have needed to do differently for their results to be more convincing?
  • Did they address all validity issues (internal, external, construct, statistical conclusion, theoretical)?

Chapter 2 Appendix J: Data Manipulation Skills you will probably need for quantitative research:

  1. Give variables “really useful” names and labels (short names are best, but never sacrifice CLARITY for BREVITY)
  2. Give categorical variables “really useful” value labels to indicate what the name of each category is (e.g., 0 = control, 1 = experimental or 1 = low, 2= medium, 3 = high)
  3. Enter new data
  4. Move rows or columns around for easier access
  5. Add or delete rows (cases) or columns (variables)
  6. Recode groups to combine groups into fewer groups
  7. Recode agree/disagree ordinal scale items into agree and disagree as categories
  8. Change string data into numeric data
  9. Set missing values so cases with those values are not wrongly included in the statistical calculations (e.g., set -1 or 999 as missing values)
  10. Recode scale items so that all items have the same direction (e.g., recode negatively-worded items so the score represent a positive attitude)
  11. Rank scores
  12. Create standardized scores
  13. Create groups (i.e., quantiles or “ntiles”) out of ordinal or scale variable data
  14. Compute total scores (using calculations and formulas like sum or sum.4 or sum.11)
  15. Compute average scores (using calculations and formulas like sum or mean.5 or mean.9)
  16. Compute new variables using formulas (e.g., difference scores, averages, rescaled scores)
  17. Split a file into subsets for analyses
  18. Select a subset of cases for analyses