Chapter 11:The Elements of Hypothesis Testing

Back to Chapter Index Page

INTRODUCTION

In this chapter, we will continue our discussion of hypothesis testing. As you read the chapter, keep in mind the key supporting role that the sampling distribution plays in hypothesis testing. The material in this chapter is very important. While it takes some time to assimilate, you will come to understand it as you use the steps presented here in each of the following chapters. This is a foundation chapter and you may want to reread it after each of the following chapters.

We will begin by discussing the procedures data analysts use when they test a single hypothesis. We will define the key terms involved in such a test. We will then consider a study where hypothesis testing was used to see how the test procedures can be implemented and to gain further understanding of the defined terms. Finally, we will further examine both the terms used in the outline and the study and other important terms.

HYPOTHESIS TESTING: AN OUTLINE

The following outline of the steps used in hypothesis testing will be followed with each of the hypothesis testing situations considered in the following chapters. You should do steps 1 through 12 before you collect any data and steps 13 through 20 while you are collecting data and calculating the statistical results.

You should do steps 1 through 12 before you collect any data and steps 13 through 20 while you are collecting data and calculating the statistical results.

1. State the research problem

RESEARCH PROBLEM: A statement in question form concerning a perplexing situation.

2. State the research hypothesis

RESEARCH HYPOTHESIS: A response in declarative form to the question posed in the research problem.

3. State the statistical hypotheses

STATISTICAL HYPOTHESES: Statements in symbolic form concerning the parameter(s) under consideration in the problem and research hypothesis

There are two statistical hypotheses, the null hypothesis, $H_0$ , and the alternative hypothesis, $H_A$.

NULL HYPOTHESIS, $H_0$: A statistical hypothesis that states what is assumed, or is known, to be true for the mean of a sampling distribution
ALTERNATIVE HYPOTHESIS, $H_A$: A statistical hypothesis that reflects the prediction made in the research hypothesis concerning the mean of a sampling distribution

4. Determine valid and reliable measures of the dependent and independent variables

VALIDITY: The degree to which a measurement measures what it purports to measure
RELIABILITY: The degree to which a test or other instrument of evaluation measures consistently whatever it does in fact measure
INDEPENDENT VARIABLE: In an experiment, a variable that can be manipulated. In a nonexperiment, it is referred to as the “explanatory variable” or as the “predictor variable” in regression analysis.
DEPENDENT VARIABLE: A variable whose measurement may be partially dependent on the value of an independent variable.

5. Choose a test statistic

TEST STATISTIC: A function of a sample of observations that provides a basis for testing a statistical hypothesis

6. Consider the assumptions necessary for the statistical test to be valid

VALID STATISTICAL TEST: The use of a test statistic under conditions where the level of significance and power are as specified

7. Consider the risks (potential errors) involved when the assumptions of a statistical test are violated

TYPE I ERROR: The probability of rejecting a true null hypothesis
TYPE II ERROR: The probability of failing to reject a false null hypothesis

8. Decide on the risk of rejecting a true null hypothesis that you are willing to take; that is, select a level of significance

LEVEL OF SIGNIFICANCE: The probability of making a Type I error. It is used as a criterion to decide if a sample statistic could reasonably be assumed to have come from a given population

9. Decide on the probability you will have of rejecting the null hypothesis if the null hypothesis is false; that is, decide on the power of your statistical test

POWER: The probability of rejecting the null hypothesis when the null hypothesis is false

10. Determine what would be an important result to detect; that is, determine an a priori effect size

A PRIORI EFFECT SIZE: A measure of the degree to which the null hypothesis is false

Note that the a priori effect size is what the researcher expects based on theory, previous empirical research, literature review, pilot study, and/or some matter of importance. This a priori effect size differs from the effect size the researcher actually obtains in the data. This actual effect size should be reported in the results of the study as the results are interpreted.

11. Based on the direction of the alternate hypothesis, the test statistic, the level of significance, the power, and the a priori effect size, determine an appropriate sample size

If you are unable to attain a given sample size, for example, because of the availability of units, then the level of significance, power, and/or the a priori effect size may be altered

12. Determine the critical values. (This step may be omitted if a p-value is compared to the level of significance to determine a significant result; see step 19.)

CRITICAL REGION: The values of a statistic that lead to the rejection of the null hypothesis
CRITICAL VALUE: A value of a statistic that bounds a critical region

13. Select a random sample of units, and if possible, randomly assign the units to treatments.

TREATMENT: In experimentation, a stimulus that is applied in order to observe the effect on the experimental situation or to compare its effect with those of other treatments. In practice, “treatment” can refer to a physical substance, a procedure, or anything that is capable of controlled application according to the requirements of the experiment

14. Measure the units and record the measurements as input (e.g., for jamovi)

Here were are collecting and recording the data.

15. Check the recorded measurements for accuracy and identify potential outliers

OUTLIER: A score that cannot reasonably be assumed to belong to a given data set

16. Check on the assumptions necessary for a valid test statistic

Assumptions vary by statistical test, depending on the mathematical derivation of the specific statistic. Most of the statistics we cover in this textbook are part of a larger modeling approach, called the General Linear Model (GLM). GLM statistics generally require (a) reliable instruments, random and independent sampling, (c) normally distributed variables or errors, (d) linearity, and where there are multiple groups, multiple factor levels, or multiple predictor values (e) equality of variance across those levels (this assumption is also called homoscedasticity or homogeneity of variance).

17. Compute descriptive statistics and a point estimate of the population parameter

POINT ESTIMATE: A statistic used to infer the value of a population parameter

18. Calculate a test statistic

Here we are usually using a statistics computer program to calculate our results.

19. Use critical values or p-values to decide on the acceptability of the null hypothesis

p-VALUE: The probability of finding another sample statistic, in the direction predicted in the alternate hypothesis, that is more deviant from the population parameter than is the sample statistic

20. Calculate a confidence interval and actual effect sizes

CONFIDENCE INTERVAL: An interval that consists of those values of a statistic that have a given a priori probability of including the population value
ACTUAL EFFECT SIZE: The same statistic used for the a priori effect size, however, now it is calculated using actual statistical results from our study

You may note that testing an hypothesis is much like placing a bet or playing a game. In steps 1 through 12 you establish the rules of your bet, and in steps 13 through 20 the results of the bet or game are decided. This analogy will be discussed in more detail later in this chapter, and you will be reminded of it when we consider the hypothesis testing steps in each of the following chapters.

HYPOTHESIS TESTING: A SIMPLE STUDY

In this section, we will consider a hypothetical study conducted by a scientist in order to see how the preceding steps might be carried out in an actual study. In the following study the Wechsler Adult Intelligence Scale (WAIS) is given to delegates to the United Nations. In the study we consider the WAIS to be a culture free test which is appropriate across nations. We hope you appreciate the humor in the example.

Stating the Research Problem, Research Hypothesis, and Statistical Hypotheses: Steps 1 through 3

Mary Barth, a political scientist, was studying the delegates to the United Nations. Her research problem statement was: Is the mean on the Wechsler Adult Intelligence Scale (WAIS) of the delegates to the United Nations equal to 100? Based on her past research, e.g. the decisions made at the United Nations, her research hypothesis was: The mean on the Wechsler Adult Intelligence Scale (WAIS) of the delegates to the United Nations is less than 100.

Her statistical hypotheses were written as follows:

\[ \begin{align} H_0 &: μ = 100 \\ H_A &: μ < 100 \end{align} \] ### Determining Validity and Reliability: Step 4

Mary checked in Buros’ Mental Measurements Yearbook and found that her dependent variable, the full scale IQs on the WAIS, yielded reliability coefficients of .97 across three age samples. Mary found that the scores in her own data showed a reliability of Cronbach’s alpha = .94. Mary also found that the concurrent and construct validity of the WAIS had been studied. She found that the WAIS correlated around .80 with the Stanford-Binet and that a factor analysis of the instrument supported “the use of the Full Scale IQs, because of the large general factor in the test scores” (Anastasi, 1969, p.282). Mary’s independent variable was a measure of whether a subject was a member of the United Nations or not. She attained a list of current members from the United Nations and through of assurances from U.N. officers and random checks of the listed members satisfied herself that the delegate list was accurate and up-to-date.

Considering the Test Statistic, the Assumptions, and the Violations of Assumptions: Steps 5 through 7

Mary chose the z statistic as her test statistic:

\[ \begin{equation} z = \frac{M-\mu} {\sigma_M} \\ \tag{11-1} \end{equation} \] Mary knew that this z statistic required that her sample be selected at random from a normal population. She also knew that this statistic would be valid even if the assumption of a normal population distribution was violated but that the statistic would not be valid if the scores of the delegates could not be considered to be independent.

Deciding on the Level of Significance, Power, a priori Effect Size, and Sample Size: Steps 8 through 11

Level of Significance. Mary decided that she would reject the null hypothesis ($H_0$) in favor of the alternate hypothesis ($H_A$) if she found a sample mean that would be likely to occur by chance less than 5 times in 100, that is, p(M < 100) < .05. Otherwise, she would consider the mean IQ at the U.N. to be 100, that is, she would fail to reject the null hypothesis. In other words, Mary established her level of significance to be .05.

Power, A Priori Effect Size, and Sample Size

Mary wanted to have a high probability of rejecting the null hypothesis if the null hypothesis was false; that is, Mary wanted her statistical test of the null hypothesis to have high power. She felt that if the average IQ at the U. N. was 92 or lower, then this would be an important result to detect. In other words, Mary felt that it was important to detect a U.N. delegate population mean that differed from the population mean of adults by 8 points or more. The standardized effect size is this difference written in standard deviation units, i.e., standardized effect size = 8/15 = .53. If a difference of 8 points actually existed, she would be able to detect a mean this deviant with a probability of approximately .85 (power), if she selected a sample of 25 delegates.

Determining the Critical Value: Step 12

In table A.1 of appendix A, Mary found that she would obtain a z statistic less than 1.65 only 5% of the time if the population mean IQ was 100. Therefore, she decided to reject the null hypothesis if she obtained a sample mean that yielded a z statistic that was less than 1.65. Here, 1.65 is her critical value, and the scores less than or equal to 1.65 constitute the critical region.

Selecting a Random Sample, Measuring, Checking Records and Assumptions, and Computing Descriptive Statistics: Steps 13 through 17

After establishing the preceding conditions, Mary selected a random sample of 25 United Nations delegates and hired a psychologist (who was unaware of the purpose of the study) to administer the Wechsler Adult Intelligence Scale. After recording the measurements in an jamovi data file, she printed the scores and checked them to be sure they were accurately recorded. Mary then used jamovi’s transformation capabilities to transform the scores to z scores, and checked the z scores for outliers. She then constructed a range bar plot, a histogram, and a normality plot to check on the assumption that the scores came from a normal distribution. When no violations of the test statistic’s assumptions were found, Mary used jamovi’s descriptive statistics programs to obtain point estimates of the measures of central tendency, variability, skewness, and kurtosis. She found the mean IQ of the U.N. delegates in her sample to be 94, that is, her point estimate of the population mean was 94.

Calculating a Test Statistic: Step 18

Mary transformed the sample mean into a standard score using equation (11-1) as follows:

\[ z = \frac{94-100}{3} = -2.00 \] The population mean on the Wechsler Adult Intelligence Scale is 100, and its population standard deviation is 15. Therefore, Mary knows that the mean of the sampling distribution of the means of sample U.N. delegates is 100 and that the standard deviation of the sampling distribution of the means is σM = σ / √n = 15 / √25 = 3.

Deciding on the Acceptability of $H_0$: Step 19

Approach 1: Critical Value (Reject $H_0$ if sample statistic is greater than the critical value).

Because Mary found the value of the z statistic based on the sample mean (that is, -2.00) to be less than the critical value (that is, -1.65) she rejected her null hypothesis and concluded that the mean IQ at the United Nations is less than 100. Note that her sample mean is 94 and that she wanted to detect a population mean of 92 or less. Here, 94 is Mary’s best estimate of what the population mean actually is, but it is more likely that this sample mean came from a population whose mean is 92 than whose mean is 100.

Approach 2: p-level (Reject $H_0$ if p value is less than the level of significance).

Since the probability of obtaining a z statistic less than -2.00 by chance is .0228, which is less than the predetermined level of .05, Mary decided to reject the null hypothesis in favor of the alternate hypothesis. That is, she decided that the mean IQ at the United Nations is less than 100 because the p-level of .028 was less than .05, her level of significance.

Approach 3: Confidence Interval (Reject $H_0$ if Null Hypothesis value is NOT in the confidence interval).

See the next section for more details.

Calculating a Confidence Interval and Actual Effect Sizes: Step 20

Since the best point estimate of the population mean is the sample mean, Mary decided to construct a 95% confidence interval based on the sample mean of 94. She did this by calculating the mean below which 95% of the sample means would fall if the population mean was actually 94. By algebraically manipulating the z formula (equation 11-1), we can calculate this upper limit mean from the z statistic critical value:

\[ \begin{equation} M = \mu + (z)(\sigma_M) \tag{11-1a} \end{equation} \] Using this formula, she found the upper limit as:

\[ M = 94 + (1.645)(3) = 94 + 4.95 = 98.935 \] Since the sample mean of 94 is probably in error by some unknown amount, Mary felt confident that the population mean is less than 98.935.

Note that the mean of 100 from the null hypothesis is not in the interval from 98.935 to ∞, but the population mean that she thinks is important to detect (that is, 92) is in the interval.

The actual effect size is the same as was used a priori, but calculated now based on the actual results obtained in the study. The standardized mean difference effect size (which will later be called d) is typically calculated using the absolute value of the actual mean difference (here, the size of the difference is what we are most interested in, because we already know which mean was larger). Also, for the z test, we use the population standard deviation of the variable in order to put the mean difference onto the z score scale:

\[ d = \frac{|M-\mu|}{\sigma_X} = \frac{94-100}{15} = \frac{6}{15} = 0.40 \] This is interpreted to mean that the mean difference is 0.40 standard deviation units, or that the means are 0.40 standard deviations apart. That is, a 6-point difference on a scale with a standard deviation of 15 is a 0.40 standard deviation unit difference. Note that this effect size is interpreted in terms of the original z score scale, not in terms of the sampling distribution of the mean that was used for the z statistic. What the effect size tells us is “how different are the means on the original scale.”

HYPOTHESIS TESTING: A GENERAL DISCUSSION OF TERMS

Problem, Research Hypothesis, Statistical Hypotheses

Problem and Research Hypothesis

In chapter4, you were introduced to the problem and research hypothesis in research design. You found that the problem is stated in question form and that the research hypothesis is a response to that question based on past research, theory, and the genius of the researcher.

In this book, the problem and research hypothesis are kept at a simple level so that we may focus on statistics and statistical hypotheses. [For further discussion of problems and research hypotheses consult books on research design such as Barcikowski (1983), Cook and Campbell (1979), or Kerlinger (1973).] The relationships between the problem, research hypothesis, and the statistical hypotheses, however, form a solid foundation for a good data analysis. We will now consider the relationship between the research hypothesis and the statistical hypotheses.

Statistical Hypotheses

The statistical hypotheses have the following characteristics:

For a given question, there are two statistical hypotheses, which are called the null hypothesis and the alternative or alternate hypothesis.

The null hypothesis is labeled $H_0$, and the alternative hypothesis is labeled $H_A$ (some writers use $H_0$ and $H_1$).

The null and alternative hypotheses are written in symbolic form using the symbol μ for the parameter under consideration.

The null and alternative hypotheses may be written so that the parameters of interest are on the left-hand side of an equal sign and a constant is on the right hand side of the equal sign. For example,

\[ \begin{align} H_0 &: \mu = 100 \\ H_A &: \mu \ne 100 \\ &or \\ H_0 &: \mu_1 – \mu_2 = 0 \\ H_A &: \mu_1 – \mu_2 \ne 0 \\ &or \\ H_0 &: \mu = 100 \\ H_A &: \mu < 100 \\ &or \\ H_0 &: \mu = 100 \\ H_A &: \mu > 100 \end{align} \] Some writers, such as Kirk (1982), state directional statistical hypotheses as:

\[ \begin{align} H_0 &: \mu \ge 100 \\ H_A &: \mu \lt 100 \\ &or \\ H_0 &: \mu \le 100 \\ H_A &: \mu \gt 100 \end{align} \] When = is used in the null hypothesis (as will be done in this book), the null hypothesis is referred to as an exact null hypothesis. When ≤ or ≥ is used in the null hypothesis, the null hypothesis is referred to as an inexact null hypothesis. (The use of either form is a matter of some debate but is generally considered a matter of preference.)

The constant in the null hypothesis is the mean of the sampling distribution that will be used in the test of the null hypothesis.

The alternative hypothesis reflects the prediction made in the research hypothesis.

Null Hypothesis

In the null hypothesis, you state what is assumed, or known, to be true for the mean of the sampling distribution. For example, in the preceding story, Mary Barth knew from published test norm information that the population mean of the Wechsler Adult Intelligence Scale was 100.

In experiments involving two or more groups, you generally assume that the groups do not differ.

That is, that the mean of the sampling distribution of mean differences is equal to zero. For example, in the cholesterol experiment described in chapter 4, if taking an herbal supplement pill did not affect a woman’s cholesterol level, we could assume that the difference between the two population means of the Yes-Pill and No-Pill treatments would be zero.

Alternative Hypothesis

As was indicated previously, the alternative hypothesis reflects the prediction made in the research hypothesis. There are four types of predictions that can be made in the research hypothesis, and there are three types of alternative hypotheses. To see how the research and alternative hypotheses are matched, consider the U.N. delegate study. Below are the four possible research hypotheses and their corresponding alternative hypothesis.

\[ \begin{align} \text{Research hypothesis} &: \text{The mean IQ of the delegates to the U.N. is greater than 100.} \\ \text{Alternative hypothesis} &: H_A: μ > 100 \end{align} \]

\[ \begin{align} \text{Research hypothesis} &: \text{The mean IQ of the delegates to the U.N. is less than 100.} \\ \text{Alternative hypothesis} &: H_A: μ < 100 \end{align} \]

\[ \begin{align} \text{Research hypothesis} &: \text{The mean IQ of the delegates to the U.N. differs from 100.} \\ \text{Alternative hypothesis} &: H_A: μ ≠ 100 \end{align} \]

\[ \begin{align} \text{Research hypothesis} &: \text{The mean IQ of the delegates to the U.N. is equal to 100.} \\ \text{Alternative hypothesis} &: H_A: μ ≠ 100 \end{align} \] Note that the fourth possible research question has the same alternative hypothesis as the third. The difference comes in interpretation of the results, not in the creation of the hypotheses. There are range null hypotheses, and two-one-sided-tests, available, but still relatively uncommon, that allow a researcher to more directly test the research hypothesis in possibility 4.

Directional Alternative Hypotheses: Researcher Believes $H_0$ Is False.

Mary Barth’s selection of one of the preceding research and alternative hypothesis pairs is dependent on the information (for example, past research, theory, and so on) that is available. For each of the first two research hypotheses, Mary would be able to specify what is known as a directional or one-tailed alternative hypothesis. In these two situations, Mary could predict whether she expects the true population mean to be on the right (that is, in the right tail of the sampling distribution, where μ > 100) or on the left (that is, in the left tail of the sampling distribution, where μ< 100) of the hypothesized population mean.

Given the first alternative hypothesis, Mary could reject the null hypothesis only if a sample mean were found that is far to the right of the population mean. This possibility is shown in figure 11a(i).

Given the second research hypothesis, Mary could reject the null hypothesis only if a sample mean were found that is far to the left of the population mean. This possibility is shown in figure 11a(ii).

Nondirectional Alternative Hypothesis: Researcher Believes $H_0$ Is False.

If Mary selected the third research hypothesis, she would have believed that the population mean was not equal to 100, but she would not have been sure (perhaps the literature produced conflicting results) if the true population mean was above or below 100. Therefore, for the third research hypothesis, Mary would select a nondirectional or two-tailed alternative hypothesis; that is, an alternative hypothesis that simply states that the population mean does not equal 100 but does not indicate the direction. Given a two-tailed alternative hypothesis, the null hypothesis is rejected when a sample mean is observed that is far to the left or far to the right of the population mean. This possibility is shown in figure 11a(iii).

Nondirectional Alternative Hypothesis: Researcher Believes $H_0$ Is True.

The situation for the fourth research hypothesis is one where Mary believes that the average intelligence of the U.N. delegates does not differ from that found in the population. In this case, a nondirectional alternative hypothesis is the only logical one, since it would make no sense to believe that the population mean was equal to 100 and then to state an alternative hypothesis that contradicted this belief by indicating a direction away from 100 where the mean was expected to be. As in the preceding two-tailed test, the null hypothesis is rejected when a sample mean is found that is far to the left or far to the right of the population mean, see figure 11a(iv).

It should be noted that in this case, the researcher is hoping to fail to reject the null hypothesis. Recall that failing to reject the null hypothesis does not provide much support that the null hypothesis is true, only that we cannot reject it based on our current data. Whenever the researcher believes that the null hypothesis is true, she wants to be sure to have a large sample size and large power. We cannot have confidence in the null hypothesis being true if we have insufficient power to reject a false null hypothesis.

Reliability and Validity

A general discussion of the issues involved in measuring the reliability and validity of the independent and dependent variables in a study is beyond the scope of this book. Interested readers can consult Ebel (1979), Marshall and Hales ( 1971), Nunnally (1978), O’Grady (1982), Thorndike (1971), or Thorndike and Hagen (1977).

You should know, however, that both reliability and validity are primarily measured using the Pearson product-moment correlation coefficient. The following discussion of reliability and validity will serve our purposes in this book.

Reliability is defined as the degree to which a test or other instrument of evaluation measures consistently whatever it does in fact measure (Good, 1973, p. 488). A measure of this “consistency of measurement” can be arrived at using a variety of methods. For example, an instrument (or two different forms of the instrument) can be used to measure a unit at two different times. Here, the correlation between the resulting scores is used as a measure of the reliability of the instrument. Also, what is referred to as an internal consistency measure of reliability is found from the correlations among the items of an instrument that has been administered only once.

Validity is the degree to which a measurement measures what it purports to measure. Like reliability, the validity of a measure can be arrived at in a variety of ways. For example, the correlation between the scores on an instrument and the scores on an instrument used to measure the same thing is known as concurrent validity. The correlations among the items on an instrument are examined using a technique called factor analysis to arrive at what is known as construct validity. The predictive validity of an instrument is found as the correlation between the scores on the instrument and the scores on an outside criterion. For example, you might consider the predictive validity of the Scholastic Aptitude Test (measured prior to college entrance) as its correlation with college grade-point average (measured at graduation).

Figure 11a Four combinations of research and statistical hypotheses with resulting critical regions

The Test Statistic, Assumptions, and Violations of Assumptions

A test statistic is a function of a sample of observations that provides a basis for testing a statistical hypothesis (Kendall and Buckland, 1976, p. 152). For most problems considered by a data analyst, there are a variety of test statistics that could be chosen. For example, in considering the question of whether taking an herbal supplement pill increases the cholesterol levels of women, the two groups of data in chapter 4 could be analyzed using a chi-square statistic (median test), the U-statistic (Mann-Whitney or Wilcoxon Rank-Sum test), or a t statistic (Student’s t test), to name a few. The choice between these test statistics usually depends on the questions asked, the test assumptions, and/or the power of the test. In the following chapters, we will consider tests that are most powerful for the questions asked, that is, most likely to detect a false null hypothesis.

For each of the test statistics that we will consider, there will be a set of assumptions upon which the test statistic is based. It is important for you as a data analyst to be aware of the assumptions that a test statistic is based on. You must also be aware of the risks you are taking when these assumptions are violated. (The “risks” here are Type I and Type II errors, which are discussed in the following sections.) If you are aware of the risks you are taking when an assumption is violated, you may be able to take actions that will reduce the risks. For example, if you know that violating the assumption of normality increases your probability of making a Type I error, you may consider the use of a transformation on your data to make it more nearly normal or you may start with a lower level of significance.

Level of Significance

The level of significance, usually denoted by α (alpha), is a probability that is used as a criterion to decide if a sample value could reasonably be assumed to have come from a given population. The level of significance, or alpha level, is established before data is collected. In the study of U.N. delegates, Mary Barth decided to select the probability of .05 as her level of significance. Mary then selected a random sample of delegates and calculated their mean IQ.

Statisticians have named the error of rejecting a true null hypothesis as a Type I error. The probability of making a Type I error is the probability expressed as the level of significance, α. When the null hypothesis is true, the probability of selecting a sample that would lead to rejection of the null hypothesis is α. Therefore, most researchers attempt to keep the level of significance small. The conventional levels of significance are .10, .05, and .01.

Statistical Power

Power is the probability of rejecting the null hypothesis when the null hypothesis is false. Power is established before the data is collected. The actual calculation of power is usually difficult for the beginning statistics student to grasp (see appendix G for an elementary presentation of the calculation of power). However, what a researcher is attempting to accomplish by setting a power level (that is, control Type II error) is not difficult to understand. Therefore, in this book we will select a given value for power, just as we will select a given value for the level of significance. Power is typically chosen to be greater than or equal to .80, so that typical power values are .80, .85, .90, .95, and .99.

Type II error

If the null hypothesis is false, then we want a high probability of rejecting the null hypothesis, that is, high power. One minus power, usually denoted by β (beta), is the probability of failing to reject a false null hypothesis. Failing to reject the null hypothesis when it is false is called making a Type II error. Therefore, β is the probability of making a Type II error, and 1- β is power.

Choosing the Level of Significance and Power

In the previous two sections, conventional values were recommended for the selection of your level of significance and your power. Prior to selecting these values, you should give some thought to the errors that they control. That is, which type of error, Type I, controlled by your level of significance, or Type II, controlled by your power, is more serious; or do you consider both types of errors to be serious. The following two examples illustrate this point.

Example: Type I Error

Educators frequently conduct “method studies.” In such studies, different methods of instruction are compared with the traditional method of teaching a subject. If one of the methods is found to be significantly better than the traditional method it, probably would be considered for adoption by a school system. Such an adoption might require extensive expenses by the school system for the retraining of teachers and the purchase of new materials. In such a case, a Type I error, where a difference was found and none actually existed, would prove disastrous. Therefore, in this type of study, the researcher would want to keep the level of significance very low, say at .001, .005, or .01.

Example: Type II Error.

Consider a methods study where the new method to be introduced is inexpensive to implement, but the potential benefits to students are expected to be great. In this type of study, a Type I error would not be that serious, since implementing the new method would not change the benefits to students. A Type II error, however, could deprive students of tremendous learning benefits. Here, the level of significance might be set at .10, if necessary, so that power would be as high as possible.

A PRIORI EFFECT SIZE

Effect Size is a measure of the degree to which the null hypothesis is false. The effect size is calculated differently for different statistical tests. Therefore, we will discuss this term for each statistical test that we encounter. Generally, the effect size will be selected based on a rule of thumb or calculated directly from a given equation. For example, given a single group of units, standardized effect size (denoted by d) is defined in this book (and usually attributed to Cohen, 1988, and called Cohen’s d) as:

\[ \begin{equation} d = \frac{|\mu_A-\mu_0|}{\sigma} \tag{11-2} \end{equation} \] Here, $\mu_A$ is the mean that you feel it is important to detect, $\mu_0$ is the value of the population mean found in the null hypothesis, and σ is the population standard deviation. Therefore, for a single sample study, the a priori effect size d is a measure of the standardized distance an important mean is from a hypothesized mean. For example, Mary felt that it was important to detect a mean IQ as low as 92; therefore, $d = |92 – 100| /15 = 0.53$ was her a priori effect size.

Note that the mean difference itself is an effect size. However, because of the scale used for the original variables, the difference is not always meaningful or interpretable. Therefore, standard practice is to use standardized mean difference effect sizes, which have standard deviation units as the scale. Standardized scores and this standardized mean difference becomes more and more familiar to applied researchers the more research they do and read. Further, it is applicable and interpretable no matter what the scale of the variables.

In the example, Mary wanted power to be .85, if the population mean of the IQs at the U.N. was 92. That is, if the mean intelligence of the U.N. delegates was actually 92, the probability was .85 (85 times out of 100) that Mary would draw a sample mean that would cause her to reject her null hypothesis.

The most difficult task that researchers face in a study is to select the specific alternative parameter that, if true, is important to detect. The selection of a value for the specific alternative hypothesis is made on the basis of past research, theory, and the genius of the investigator. That is, the specific alternative hypothesis is made on an informed subjective basis, and the data analyst should be able to state (and defend) his or her reasons for the selection.

In the example, Mary felt that if the population mean of the U.N. delegates was 92 or less, she wanted to be fairly certain to detect this fact. If the U.N. population mean was not at least as low as 92, however, the mean might as well be considered to be 100, because she was not interested in means larger than 92. (Note that here a mean of 130 would possibly cause Mary to reevaluate her theory and past research and to consider conducting another study.)

A priori Effect Size versus Actual Effect Size

Finally, it is important to note that the a priori effect size, described in this section, is different than the actual effect size obtained in the results. That is, we should report an actual effect size after we have calculated our statistics in order to describe our relationship or difference and in order to interpret our results. If there is a difference between means, we want to describe how large that difference is. That standardized mean difference is the actual effect size in a mean comparison study. We use the same Cohen’s d as we did for the a priori effect size, however, now we use the actual statistical results we obtained in the study.

SAMPLE SIZE DETERMINANTS

For a given experiment, sample size is dependent on the:

Test statistic
Power
A priori Effect size
Error variance of the statistic of interest (which includes reliability)
Level of significance
Form of the alternative hypothesis (directional or nondirectional)

The preceding values are referred to in this book as the sample size determinants.

For most of the test statistics in this book, sample size will be found in a table in appendix C, given the level of significance and the a priori effect size. For a single group study, however, sample size is found using the following equation:

\[ \begin{equation} n = \left[ \frac {\sigma [z_{(1-α_P)} + z_{(1-β)}]} {|\mu_A-\mu_0|} \right ]^2 \tag{11-3} \end{equation} \] Here: * $\sigma$ equals the known standard deviation of the population. In Mary’s study, this was the standard deviation on the Wechsler Adult Intelligence Scale, which was 15.

$\alpha_P = \alpha_1 = \alpha$, the level of significance for a one-tailed test, or $\alpha_P = \alpha_2 = \alpha/2$ for a two-tailed test. Mary had $\alpha_P = \alpha_1 = \alpha = .05$ for a one-tailed test.
$z_{(1 - \alpha_P)}$ equals the z score at the $(1 – \alpha_P)$ percentile. Mary had $1–\alpha_P = 1 – \alpha = 1 – .05 = .95$, and $z_{(.95)} = 1.645$ for the one-tailed test.
$\mu_0$ equals the value of the population mean in the null hypothesis. Here, $\mu_0$ = 100.
$1 – β = power = .85$, so that $β = .15$.
$z(1 – β)$ = the z score at the $(1 – β)$ percentile = $z_{(.85)} \approx 1.04$.
$\mu_A$ equals the population mean that you would consider as being significantly different from $\mu_0$. Here, based on past experience, Mary selected $\mu_A$ = 92.

Equation (11-3) will yield accurate values of n for power values greater than .50 and levels of significance less than .50.

When Mary substituted her values into equation (11-3) she found:

\[ n = \left[ \frac {15 [1.645 + 1.04]} {|100-92|} \right ]^2 = 25.34 \approx 25 \] Most data analysts would round this number up to 26, but 25 is easier to work with for demonstration purposes. That is, given a one-tailed test, a z-test statistic, and a level of significance of .05, Mary would need 25 students to have a probability of .85 of detecting an effect size of 8 points.

SELECTING A SAMPLE SIZE

If a particular sample size is too difficult to obtain (for example, because of availability of units) then some alteration of one of the sample size determinants is usually considered. For a given test statistic, you can decrease the sample size needed for a study by increasing the a priori effect size that you wish to detect, decreasing your power, reducing the variance error of the statistic of interest, increasing the level of significance, and/or choosing a one-tailed test over a two-tailed test.

The process that you will usually follow is to select a test statistic and an a priori effect size that you would like to detect and then consider the effects of alterations of the other sample size determinants on sample size. A good analogy is that this process is like tuning a car. By making small adjustments to the car’s timing, carburetor, and so on, you arrive at a car that runs smoothly. Be careful, though. The a priori effect size is what we expect, not what we hope for. We can always calculate smaller sample sizes by increasing the a priori effect size we use. However, if we become too unrealistic in our estimated effect size, our sample size will be much too small. Remember, the effect size is what it is in the population. We are trying to estimate it as accurately as we can in order to choose the most appropriate sample size. We cannot simply say it is larger than it is and then choose a smaller sample size.

We can explore the effects of the sample size determinants on sample size by considering the U.N. delegate example with the level of significance reset at .01 and a two-tailed alternative hypothesis. Here, αP = α2 = .01/2 = .005 and z(1 – .005) = z(.995) = 2.576. The other terms for equation (11-3) remain as they were. Then solving equation (11-3) for n we have

\[ n = \left[ \frac {15 [2.576 + 1.04]} {|100-92|} \right ]^2 = 45.97 \approx 46 \] For a two-tailed test, Mary would like to detect a deviation of 8 points either above or below the hypothesized population mean. In this case, both |100 – 92| and |100 – 108| yield an effect size of 8 points, which is used in the denominator of equation (11-3).

Now, suppose that Mary could only afford a sample size of 25 cases. Since Mary cannot afford to test 46 delegates, she must alter one of the sample size determinants. In the following paragraphs, we will see how modifying each of the sample size determinants can decrease the size of the sample needed for Mary’s study.

Increase the a priori effect size

We can decrease the sample size by increasing the size of the effect that we would like to detect. For example, we can increase the a priori effect size by setting it at 10 points. Equation (11-3) becomes:

\[ n = \left[ \frac {15 [2.576 + 1.04]} {|100-90|} \right ]^2 = 29.42 \approx 30 \] Here, by increasing the a priori effect size from 8 to 10, sample size decreased from 45 to 30. You should be careful in changing the a priori effect size after one has been selected, since the a priori effect size that you arrived at initially is what is important to you.

There are some legitimate changes the researcher can make that would help increase a priori effect size. For example, maintaining strict integrity of the treatment through tighter experiment control can help increase the actual effect size. That is, if the treatment works as well as possible, the difference in means may be larger.

Decrease the Power

If we decrease the power from .85 to .80, z(.80) is .84, and sample size becomes:

\[ n = \left[ \frac {15 [2.576 + 0.84]} {|100-92|} \right ]^2 = 41.02 \approx 42 \] This decrease in power results in a decrease in sample size from 46 to 42.

Decrease the Variance Error of the Statistic of Interest

The variance in the population of interest can be decreased by increasing the reliability of the measuring instrument. For example, if the reliability of the Wechsler Adult Intelligence Scale were increased, the population variance might decrease from 15 to 14. If this were to happen, then the a priori effect size would be:

\[ n = \left[ \frac {14 [2.576 + 1.04]} {|100-92|} \right ]^2 = 40.04 \approx 41 \] Here, sample size would decrease from 46 to 41. (In your further study of research design [for example, see Kirk (1982) or Keppel (1982)], you will find other ways that the variance of a population can be reduced, for example, by sampling from specific subpopulations instead of an entire population.)

Increase the Level of Significance

If the level of significance were to be increased to .05, then αP = α2 = .05/2 = .025, z(1 – .025) = z(.975) = 1.96, and equation (11-3) would yield:

\[ n = \left[ \frac {15 [1.96 + 1.04]} {|100-92|} \right ]^2 = 31.64 \approx 32 \] Here, sample size would decrease from 46 to 32.

Perform a One-Tailed Test

Given a one-tailed test, αP = α1 = .01, z(1 – .01) = z(.99) = 2.326. Then equation (11-3) yields:

\[ n = \left[ \frac {15 [2.326 + 1.04]} {|100-92|} \right ]^2 = 39.83 \approx 40 \] Here the sample size decreased from 46 to 40.

Final Sample Size.

Given that Mary started with sample size determinants that required a sample size of 46 units, she arrived at her sample size of 25 units by modifying some, but not all, of the latter sample size determinants. She kept the z statistic, the a priori effect size at 8 points, her power level at .85, and the population standard deviation at 15. She arrived at her sample size by changing her level of significance from .01 to .05 and conducting a one-tailed test. These changes allowed her to reduce her sample size from 46 to 25 units. She decided that these changes were acceptable after she had been able to study more literature on U.N. delegates and on the Wechsler Adult Intelligence Scale.

Can a Sample Size Be Too Large?

Can a sample size be too large? The answer to this question is an emphatic no. Your sample size can never be too large, but given a large sample of units, your statistical test may lose its value. As sample size increases, the power of your statistical test increases. Given very large samples, your statistical power can become 1.00 for practically any effect size. One way to think about this is that you can “buy” statistical significance if you have enough time, effort, and money to acquire huge samples. Because power is so large for samples so large (i.e., power would be approximately 1.00), this implies that you will reject the null hypothesis for every sample of that size – thereby essentially guaranteeing yourself statistical significance.

If this is the case, however, your statistical test may be significant for effect sizes that you consider to be negligible and unimportant. When this happens, however, your point estimate will be a very accurate estimate of its population parameter, so that you can make a decision on the value of your results based on your a priori informed judgement of what is an important result. Here you could have a significant result that would lead you to retain the null hypothesis. For this reason, no matter what sample size we use, we must always report effect sizes and/or confidence intervals whenever we report a p value.

The ultimate sample occurs when you use your whole population. In this case, the use of statistical tests is meaningless since the differences that occur are the ones that exist. The only way to judge such differences is through the use of an effect size determined on an a priori basis.

For example, consider how Mary would interpret her results if they were based on a sample of 1,000,000 U.N. delegates (if this many existed). Here the power of Mary’s test would be 1.00, and her z statistic would be found as:

\[ z= \frac{94-100} {15/1000} = 400 \] Based on a sample of 1,000,000 cases, Mary would reject any effect size larger than .025. Therefore, Mary’s statistical test is not of value, but her point estimate provides a very accurate estimate of her population mean. Essentially, based on a sample of 1,000,000 cases, Mary would treat the mean IQ of94 as the population mean of the U.N. delegates. Since Mary had decided a priori that she was only interested in the mean IQ if it was less than or equal to 92 (that is, an effect size of 8), however, she would retain her null hypothesis.

Can a Sample Size Be Too Small?

As you can see from the previous discussion, sample size and power are strongly related. If you follow the steps outlined in this chapter, you will always have adequate power and therefore an appropriate sample size. It is not uncommon, however, to read research studies where the researcher has not established a power level for a given statistical test. In such studies, sample size is generally arbitrarily selected. If this is the case, the researcher may find insignificant results because the arbitrarily chosen sample size was too small, and therefore, power was too low. In such studies, the probability of making a Type II error is high.

We return to the U.N. delegate example with the level of significance at .01 and a two-tailed alternative hypothesis. If there is in reality a smaller mean difference effect size than we think, then the sample size we chose above (n = 46) will not be sufficiently large. Solving equation (11-3) for n if we have a mean difference of only 4 (instead of 8, as we had above) gives us:

\[ n = \left[ \frac {15 [2.576 + 1.04]} {4} \right ]^2 = 183.87 \approx 184 \] Note that the sample size we chose above (n = 46) will be far too small for the power we desire. Instead, we need roughly 184 cases when the a priori effect size is smaller. Also, note that there is a nonlinear relationship between changes in a priori effect size and changes in sample size. That is, an a priori effect size half as large requires much more than twice the sample size of the original scenario.

Some authors (e.g., Cohen, 1988) also provide power tables (and going back farther in history, power curves) that can provide the actual power for given level of significance, sample size, and a priori effect size. Thinking from this perspective makes clear that a smaller real effect size for a given sample size results in lower power. Alternatively, a smaller sample size for a given effect size also reduces power.

The Critical Region and Critical Value(s)

The values of a statistic that lead to the rejection of the null hypothesis are called the critical region. A critical value is a value of a statistic that bounds a critical region. The critical region and the critical values are dependent on the level of significance and the alternative hypothesis. If the alternative hypothesis is directional, there is one critical value; but if the alternative hypothesis is nondirectional, there are two critical values. In the U. N. delegate example, the level of significance was .05, and the alternative hypothesis was directional. Therefore, Mary found that all values of M that were less than a mean of 95.05 (or a corresponding z score of -1.65) represented her critical region and that 95.05 (or -1.65) was her critical value. If Mary’s alternative hypothesis had been nondirectional, however, and the level of significance had been .05, then her critical region would have been made up of all values of M that were less than the critical value of 94.12 (z = -1.96) or greater than the critical value of 105.88 (z =1.96).

DATA COLLECTION AND PRELIMINARY ANALYSES

Up to this point, we have been discussing the elements of hypothesis testing that are established prior to the collection and analysis of the data. In the following paragraphs, we will consider the elements of hypothesis testing that begin with the collection of the data.

The collection and preliminary analyses of the data involves the measurement of the units, the recording of the measurements for analysis, the checking of the data for accuracy and for outliers, and the calculation of descriptive statistics. The preceding chapters of this book prepared you to carry out each of these elements of hypothesis testing using the jamovi statistical package.

Test Statistic and the Critical Values

The test statistic based on the sample statistic is calculated and compared to the critical values and a decision is made to reject or fail to reject the null hypothesis. In the previous example, Mary Barth calculated a z statistic based on her sample mean using equation (11-1). The critical values for the four possible research and alternative hypotheses available to Mary are shown in figure 11b.

The research and alternative hypotheses that Mary tested in the example are shown in figure 11b(i). Here, the alternative hypothesis indicated that the population parameter would fall into the right tail of the sampling distribution. In this case, the test statistic must be greater than the critical value to reject the null hypothesis. If Mary’s alternative hypothesis indicated that the population parameter would fall into the left tail of the sampling distribution, then the test statistic would have to be smaller than the critical value for her to reject the null hypothesis [see figure 11b(ii)]. Given that the population parameter could fall into either tail of the sampling distribution, the absolute value of Mary’s test statistic must be greater than the absolute value of the critical values in order to reject the null hypothesis [see figures 11b(iii) and (iv)].

The p-Level

The p-Level: Directional Tests.

Following the calculation of sample values, a probability, usually denoted by p and referred to as the p-level, is calculated. The p-level is based on the sample statistic, and for a one-tailed test, it is the probability of finding another sample statistic that is more deviant from the population parameter, in the direction predicted in the alternative hypothesis, than the calculated sample statistic.

The decision to reject or fail to reject (that is, to retain) the null hypothesis depends on the relationship between the p-level and the level of significance, α. If the p-level is found to be less than or equal to α, the null hypothesis is rejected.

If the p-level is found to be greater than α, the null hypothesis is retained. In the example, Mary Barth found the sample mean to be 94 and the probability of obtaining a mean less than 94 to be .0228 (that is, p(M < 94) = p = .0228). Therefore, since p was less than α = .05, Mary rejected the null hypothesis. This p-level is shown in figure 11c(i).

The p-Level: Nondirectional Tests.

If the research hypothesis implies a two-tailed alternative hypothesis (as was described in research hypotheses 3 and 4 in the previous section), then p is found as the probability of obtaining a mean as deviant or more deviant than the one found in the sample. Therefore, for a nondirectional test, the p-level is found as the sum of the probabilities associated with sample means as deviant as the calculated value, that is, means that are both above and below the population value.

For example, if Mary Barth had specified a research hypothesis that implied a two-tailed alternative hypothesis ($H_0: \mu \ne 100)$ then since her sample mean of 94 deviated by 6 points from the population value, she would have found the p-level as the probability of obtaining a mean less than 94 plus the probability of obtaining a sample mean greater than 106.

Since the sampling distribution of the mean is symmetric, the probability of obtaining a sample mean that is greater than 106 is found to be .0228, the same as the probability of obtaining a sample mean less than 94. Therefore, the value of p is .0228 + .0228 = .0456. Mary would still reject the null hypothesis since p is less than .05. Figure 11c(ii) shows this p-level. [Note the difference between the p-values for the one-tailed (p < .0228) and the two-tailed (p < .0456) tests. The one-tailed test is more powerful.]

The Confidence Interval, Confidence Level, and Confidence Limits

The confidence interval consists of those values of a statistic that have a given a priori probability of including the population value. The a priori probability written as a percentage is called the confidence level, and the bounds on the confidence interval are called the confidence limits. The confidence limits define the interval estimate.

One-Tailed Confidence Interval When $H_A: μ < 100$

Given a one-tailed alternative hypothesis where the population mean is predicted to be less than the population parameter, the 100(1 – α)% one-tailed confidence interval is written as:

\[ \begin{equation} \mu < M + z_{(1-\alpha)} \left[\frac{\sigma}{\sqrt{n}}\right] \tag{11-5} \end{equation} \]

In our example, M = 94, α = .05, z(.95) = 1.645, σ = 15, and n = 25. Substituting the values into equation (11-5), the one-tailed 100(1 – .05)% = 95% confidence interval is found as:

\[ \begin{align} \mu &< 94 + 1.645\left[\frac{15}{\sqrt{25}}\right] \\ \mu &< 98.935 \end{align} \] Here, the values of M that lie below 98.935 make up the confidence interval; -∞ to 98.935 are the confidence limits; and the percentage, 95, is called the confidence level. When the null hypothesis is rejected, as it was in our example, the population mean is not in the confidence interval, for example, 100 > 98.935.

One-Tailed Confidence Interval When $H_A: \mu > 100$

Given a one-tailed alternative hypothesis where the population mean is predicted to be greater than the population parameter, the 100(1 – α)% one-tailed confidence interval is written as:

\[ \begin{equation} \mu > M - z_{(1-α)} \left[\frac{\sigma}{\sqrt{n}} \right] \tag{11-6} \end{equation} \] In our example, M = 94, α = .05, z(.95) = 1.645, σ = 15, and n = 25. Substituting the values into equation (11-6), the one-tailed 100(1 – .05)% = 95% confidence interval is found as:

\[ \begin{align} \mu &> 94 - 1.645\left[\frac{15}{\sqrt{25}}\right] \\ \mu &> 89.065 \end{align} \] Here, the values of M that lie above 89.065 make up the confidence interval, and 89.065 to +∞ are the confidence limits. Note that in this case, the null hypothesis would not be rejected since the test statistic of -2.00 would not be greater than the critical value of 1.645. When the null hypothesis is not rejected, the population mean is inside the confidence interval, for example, 100 < 89.065.

Two-Tailed Confidence Intervals when $H_A: \mu \ne 100$

Given a single sample of subjects, such as we had in the U.N delegate example, a two-tailed confidence interval for the population mean can be written as:

\[ \begin{equation} M - z_{(1-α⁄2)} \left[\frac{\sigma} {\sqrt{n}}\right] < \mu < M + z_{(1-α⁄2)} \left[\frac{\sigma} {\sqrt{n}}\right] \tag{11-7} \end{equation} \] Alternatively, we can express the confidence interval as:

\[ \begin{equation} 100(1-\alpha) \text{% CI} = M \pm z_{(1-α⁄2)} \left[\frac{\sigma} {\sqrt{n}}\right] \end{equation} \] which can also be written as:

\[ \begin{equation} 100(1-\alpha) \text{% CI} = (M - z_{(1-α⁄2)} \left[\frac{\sigma} {\sqrt{n}}\right], M - z_{(1-α⁄2)} \left[\frac{\sigma} {\sqrt{n}}\right]) \end{equation} \] which can written more simply as:

\[ \begin{equation} 100(1-\alpha) \text{% CI} = (M - [se*z_{CV}], M + [se*z_{CV}]) \end{equation} \] where CV means critical value of z at some given $\alpha$ level and $se$ means standard error.

In our example, Mary Barth would establish a 100(1 – α)% = 100(1 – .05)% = 95% confidence interval by substituting into equation (11-7) her sample M of 94; the z score below which (1 – α/2) = (1 – .05/2) = .975 of the sample z scores should fall, that is, z(.975) = 1.96; the standard deviation of the population, 15; and her sample size of 25 cases. The resultant 95% confidence interval would have been:

\[ \begin{align} 94 - 1.96 \left[\frac{15} {\sqrt{25}}\right] < &\mu < 94 + 1.96 \left[\frac{15} {\sqrt{25}}\right] \\ 94 - (1.96 * 3) < &\mu < 94 + (1.96 * 3) \\ 94 - 5.88 < &\mu < 94 + 5.88 \\ 88.12 < &\mu < 99.88 \\ &or\\ (88.12 &, 99.88) \end{align} \] We can also express the confidence interval as (88.12, 99.88).

Here, the values of samples means (M) that lie between 88.12 and 99.88 make up the confidence interval; 88.12 and 99.88 are the confidence limits; and the percentage level, 95, is called the confidence level. Since the absolute value of Mary’s test statistic is larger than the absolute value of her critical value (2.00 > 1.96) Mary would have rejected the null hypothesis given a two-tailed alternative. This is also indicated by the fact that the two-tailed confidence interval does not contain the hypothesized parameter, that is, 100, is not in the confidence interval.

Other Percent Confidence Intervals.

Mary could have constructed a one-tailed or a two-tailed 99% confidence interval by simply changing the z score values to those between which 99% of the sample z statistics should fall. For example, for a two-tailed 99% confidence interval, z(1 – .01/2) = z(.995) = 2.58. Then, Mary’s 99% confidence interval would be found as:

\[ \begin{align} 94 - 2.58 \left[\frac{15} {\sqrt{25}}\right] < &\mu < 94 + 2.58 \left[\frac{15} {\sqrt{25}}\right] \\ 86.26 < &\mu < 101.74 \\ \end{align} \] or 86.26 < μ < 101.74. We can also express the confidence interval as (86.26, 101.74). Note that the 99% confidence interval is wider than the 95% confidence interval. This is because it takes a wider interval of possible means to have more confidence that the true population mean is in that interval.

Similarly, other percent confidence intervals may be found by substituting the z scores between which the given percent of the z scores should fall, into equation (11-5), (11-6), or (11-7).

Interpretation of the Meaning of a Confidence Interval.

Statisticians are very sensitive to the way researchers interpret confidence intervals. For example, they are constantly pointing out that it is incorrect to state that: The population mean has a 95% chance of being included in a given confidence interval. Instead, they correctly point out that prior to selecting a sample mean, the probability is .95 that the sample mean will fall between the confidence limits. This is equivalent to saying that for 95% of the samples the population mean, μ, will fall between the confidence limits. Therefore, prior to selecting a sample, you have a 95% probability of obtaining a sample mean with which you can establish a confidence interval that contains the population mean.

After the sample mean is selected, however, you can no longer speak in terms of probabilities because the population mean is either in the confidence interval or it is not. This is why the term confidence is used instead of the term probability. It turns out that when we calculate 95% confidence intervals in the ways described above, the confidence intervals from 95% of all possible samples will actually capture the true population mean. Because it is much more likely that we have one of these 95% of all possible samples than it is that our sample is one of the 5% that does not accurately contain the true population mean, we say we have 95% confidence that the true population falls within the limits of our sample’s confidence interval.

THE LOGIC OF HYPOTHESIS TESTING

Hypothesis Testing Versus Placing a Bet

To help you follow and understand the steps taken in hypothesis testing, consider them to be the rules you have established on a bet. Here, for simplicity, we will continue to consider a bet that focuses on a single sample of subjects with a known population variance. You bet that you will draw a random sample that has a mean reflecting the prediction you have made in your research hypothesis.

As with most bets, the rules are established before hand, that is, a priori. These rules are usually prepared in a document called a research proposal. Most research proposals contain:

a general discussion of the problem and its significance
a review of the related research and theory
the research hypotheses (unless the research is exploratory)
a discussion of the population and the sample
a presentation of the reliability and validity of the instruments or tests used
the statistical hypotheses
a description of the statistical methodology (for example, the level of significance, power, and the statistical tests used)
and limitations and delimitations.

Once the bet is placed, that is, the proposal is accepted, a random sample is drawn, and statistics of interest are calculated. (The description of the data and the calculation of the statistics of interest involve the work we have considered in chapters 1 through 8.) Based on the sample mean and its corresponding p-level, or critical value, a decision is made to reject or fail to reject the null hypothesis.

In the latter hypothesis testing process, the researcher places a bet that, on the basis of past research and theory, he or she feels has a good probability of success, that is, agreeing with his or her research hypothesis. Once the sample has been drawn, however, the bet is over because the sample mean will either be in the prespecified critical region or it will not. Therefore, once the sample has been drawn, probabilities are no longer discussed and you are not permitted to return to your proposal to modify the rules (for example, to change your prediction or the level of significance). You are permitted to run another study, at your risk of time and effort, if you find something that you feel needs further study.

You can see that this process is similar to placing any bet. For example, if you bet someone $100 that it is not going to rain on August 25, 2000, in Buffalo, New York, and then if it rains (sprinkles or thunder showers) you owe that person $100. Once the day has passed, it has either rained or not and the bet is settled.

In journal articles you will frequently see researchers using terms like strongly significant or highly significant to describe a result that is deeply into the critical region (for example, a thunder shower). In terms of the all-or-nothing result after a bet is decided, you can see that this usage is improper. A null hypothesis is either rejected or it is not; you either win or lose your bet.

Figure 11b Four combinations of research and statistical hypotheses with resulting critical values and critical regions when alpha = .05

Figure 11c Mary Barth’s p-levels for a one-tailed and a two-tailed statistical significance test with alpha = .05 (both tests are statistically significant)

Exploratory Versus Confirmatory Research

As we continue our study of different statistics and their sampling distributions, you will find it helpful to be able to distinguish between what is called an exploratory study and what is called a confirmatory study. These two types of studies are pictured in figure 11d as being on opposite ends of an information continuum.

An exploratory study is one in which you have little or no past research or theory to rely on for guidance. In this phase of analysis, you have ideas and data, but you can make few, if any, predictions.

In contrast, in a confirmatory study, you have available a good deal of past research and theory. In this phase of analysis, you can make predictions that are based on a good deal of information. In actuality, you will frequently find yourself between these two extremes, but the contrast of the poles of this continuum is of value in focusing your position.

Omnibus Versus A Priori Statistical Tests

Given the paucity of information available in an exploratory study, you will be faced with performing what are known as omnibus or overall statistical tests. These tests are called omnibus because they cover all possibilities for rejection of the null hypothesis. In our U.N. delegate example, Mary would have conducted this type of test if her null hypothesis had been nondirectional. She would have rejected her null hypothesis if her sample mean fell into the rejection region of either tail of her sampling distribution.

Given the abundance of information that is available in a confirmatory study, however, you can use what are known as a priori statistical tests. Mary conducted this type of test when she tested her null hypothesis against a directional alternative hypothesis. She was able to reject her null hypothesis only if her sample mean fell into the rejection region that was below the population mean.

The value of being able to use past information to make predictions is that the a priori statistical tests are more sensitive, that is, more powerful, than the omnibus tests. You can see this in the U.N delegate example, where the p-level for the one-tailed a priori test was .0228, while the p-level for the two-tailed omnibus test was .0456. (Remember, you reject the null hypothesis when the p-level is less than the level of significance.)

In the single group studies of the population mean, the p-level for the omnibus test will always be twice that for the a priori test. Therefore, you can see that with an a priori test you will have a better chance of rejecting the null hypothesis when it is false.

Figure 11d Continuum between exploratory and confirmatory studies

Exploratory Study (very little information)
Confirmatory Study (a good deal of information) Information

Note: In an exploratory study there is little or no past research and/or theory to rely on for guidance and therefore, few, if any, predictions. This is contrasted here with a confirmatory study where a good deal of past research and theory are available to make predictions.

SUMMARY

This chapter discussed 20 steps that a data analyst can follow for a study that involves hypothesis testing. The discussion took place in three phases: an outline of hypothesis testing, an sample study, and a full discussion of the terms used in the outline and the sample study. Each phase was presented in two parts, one part was devoted to the elements of hypothesis testing that are established prior to the collection of data, and the other part was devoted to data analysis.

The elements of hypothesis testing were compared to the elements of a bet. In both situations, rules are established prior to an event, and after the event takes place, a decision is made. The following chapters will follow each of the 20 steps discussed in this chapter for problems that require different test statistics.

Chapter 11 Appendix A Study Guide for Z Statistics

Basic Statistical and Hypothesis Testing Concepts and Calculations

Explain the Research Design and Statistical Analysis terms below BRIEFLY but SUFFICIENTLY and IN YOUR OWN WORDS (don’t just give another name for them). Some may require finding additional readings. If you use resources, paraphrase in your own words AND provide a citation of the resource you used (including page numbers).

Central Limit Theorem (3 parts about the parameters of the distribution)
Sampling Distribution
Sampling Error
Standard Error
Confidence Interval
Type I error
Type II error
Statistical Power
How you can improve statistical power in a research project
Effect Size
Steps in null hypothesis significance testing
Describe a specific research project (not discussed in class or in your textbook) where Type I error is a more important concern that Type II error. Provide details and explain clearly why this is your belief.
Describe a specific research project (not discussed in class or in your textbook) where Type II error is a more important concern that Type I error Provide details and explain clearly why this is your belief.

Citation

Please cite as:
Barcikowski, R. S., & Brooks, G. P. (2025). The Stat-Pro book:
A guide for data analysts (revised edition) [Unpublished manuscript].
Department of Educational Studies, Ohio University.
https://people.ohio.edu/brooksg/Rmarkdown/

This is a revision of an unpublished textbook by Barcikowski (1987).
This revision updates some text and uses R and JAMOVI as the primary
tools for examples. The textbook has been used as the primary textbook
in Ohio University EDRE 7200: Educational Statistics courses for 
most semesters 1987-1991 and again 2018-2025.