Chapter 15: Parametric Tests for More than One Group: Exploratory One-Way ANOVA

Back to Chapter Index Page

INTRODUCTION

In chapter 14 you learned why and how an analysis of variance (ANOVA) is performed. In this chapter we will consider in more detail the elements of hypothesis testing in an exploratory ANOVA. We will do this by analyzing a data set in the step-by- step fashion that you would use for an exploratory ANOVA. That is, in this chapter we will consider such key elements as: the state of affairs, a given problem, its research hypothesis, the statistical hypotheses, assumptions, violations of assumptions, sample size selection, etc. We will conclude this chapter by considering four different statistical tests and the conditions under which you would use one of them after you have found that the treatments differ. These tests are referred to as post hoc or a posteriori tests because they are used following an overall F test in order to determine which treatment groups differ.

ANOVA: AN EXPLORATORY ANALYSIS

In this section we will consider the elements of hypothesis testing from the perspective of a researcher who is in a situation where it is difficult or impossible for him to know in advance what mean differences to expect among the treatments under study. That is, the researcher is in a position where past research and/or theory are either nonexistent and/or yield conflicting results. The example used here will be that of the three group herbal supplement pill consumption experiment that we discussed in chapter 14. Here, however, we will be able to consider an appropriate sample size for this experiment. To personalize the discussion, the researcher conducting this study is given the name Leon Waters. We will begin by considering the state of affairs, the ANOVA assumptions, violations of assumptions, and statistical equations used in any exploratory fixed-effects one-way ANOV A. We will then consider a step-by- step example of the elements of hypothesis testing for a one-way fixed-effects design.

State of Affairs

The state of affairs that must exist before you can consider the following test statistic are:

You are able to select a random sample of units and then you are able to randomly assign these units to treatments.
You consider the units in each treatment to represent a sample from a population of units that have received the treatment, and you are interested in the differences among the these treatment population means.
Your treatments are specified (fixed) by you, that is, they do not represent a random sample of treatments from a population of treatments. Remember that when a nonrandom set of treatments have been selected for study using analysis of variance techniques, the analysis is known as a “fixed-effects” ANOVA.

Assumptions

The test statistics that we will consider for the exploratory and confirmatory analyses will be valid when:

The units are independent of one another both within treatments and across treatments. That is, the score that one unit receives does not affect the score that another unit receives either within a given treatment or in another treatment.
The variance of the scores from any treatment population is equal to the variance of the scores in any other treatment population. This is known as the assumption of “homogeneity of variance.” (This assumption is necessary because you pool the variance estimates from each treatment to calculate the mean square within groups.)
The dependent variable is normally distributed in the population of units for each treatment.
The score of unit i in treatment j may be written as \(X_{ij} = \mu + a_j + e_{ij}\), where \(\mu\) is the population mean of the original population, \(a_j\) is a treatment effect, and \(e_{ij}\) is that part of \(X_ij\) that is not accounted for by \(\mu\) or \(a_j\) (i.e., \(e_{ij}\) represents the error). This assumption is known as the “additivity” assumption since each score is written as the addition of three terms, \(\mu\), \(a_j\), and \(e_{ij}\).

Violations of the Assumptions:

The importance of the preceding assumptions is as follows:

The assumption that the units are independent of one another is extremely important because if it is violated the level of significance (i.e. the probability of rejecting a true null hypothesis) can increase dramatically (e.g., from .05 to .40). In figure 15a the actual levels of significance are illustrated when the independence assumption is violated, given two treatments. In figure 15a we have that the probability of rejecting a true null hypothesis (i.e., of making a Type I error) increases dramatically as the relationship among the subjects in a group increases and as the number of subjects per group increases.
The effects of violations of the assumptions of normality and homogeneity of variance on level of significance and power in a fixed-effects ANOVA were considered in a review of literature by Glass, Peckham, and Sanders (1972). From their review, we may conclude that when the assumptions of normality and homogeneity of variance are violated the F test will be robust to these violations given large and equal samples per treatment. However, if one or more of the preceding conditions is not met, particularly when a one-tailed test is to be used, serious consideration should be given to transformations and/or to adjustments of the nominal level of significance and/or of the nominal power.

For example, given heterogeneous variances and unequal n’s per treatment with smaller samples drawn from more variable populations, Glass, Peckham, and Sanders (1972) indicated that your level of significance will be inflated. In this case you may decide to choose a smaller level of significance than you would if homogeneous variances were present (e.g., .01 or .05/2 instead of .05). Or, if you have two treatments, you may decide to use the separate variance t test discussed in chapter 14.
When heterogeneous variances are found with unequal numbers of units in the treatments, some researchers will randomly discard units until an equal n situation is found. These researchers know that with equal n’s the nominal and actual levels of significance will be close. However, many researchers find this approach unacceptable because of the lost information and the loss of power when units are discarded.

Glass, Peckham, and Sanders (1972) have indicated, based on the work of Cochran (1947), “that the principal effect of non-additivity is loss of information and that this will be trivial unless the error variance is very low or there is a very serious departure from additivity” (p. 241). They therefore concluded that “in reality, the violation of the additivity assumption should be of little concern for the researcher” (p. 241).

The Test Statistic in the Sampling Distribution

In chapter 14 the equations for calculating the F statistic for an ANOVA were given along with the steps necessary to obtain an ANOVA from R. For convenience, the equations for the F statistic are repeated here.

The F statistic with \(v_1\) and \(v_2\) degrees of freedom used to test the null hypothesis of equal treatment population means in a one-way ANOVA is found as the ratio of the mean square between groups (MSB) to the mean square within groups (MSW), that is:

\[ \begin{equation} F = \frac{MSB}{MSW} \tag{15-1} \end{equation} \]

Here,

\[ \begin{equation} MSB = \frac{SSB}{J-1} = \text{Mean Square Between Groups} \tag{15-2} \end{equation} \]

\[ \begin{equation} MSW = \frac{SSW}{\sum (n_j-1)} = \text{Mean Square Within Groups} \tag{15-3} \end{equation} \]

\[ \begin{equation} SSB = \sum n_j (M_j-M)^2 = \text{Sum of Squares Between} \tag{15-4} \end{equation} \]

\[ \begin{equation} SSW = \sum_{j=1}^J \sum_{i=1}^N n_j (X_{ij}-M_j)^2 = \text{Sum of Squares Within} \tag{15-5} \end{equation} \]

\[ \begin{equation} v_1 = df_B = J-1 = \text{degrees of freedom Between} \tag{15-6} \end{equation} \]

\[ \begin{equation} v_2 = df_W = \sum (n_j - 1) = \text{degrees of freedom Within} \tag{15-7} \end{equation} \]

Where

J is the number of groups or treatment levels (note that some scholars will call this K instead of J)
sometimes MSB is called MSA (for Mean Square Among)
nj is the number of units (i.e., sample size) for group or treatment j
i is the subscript used to denote a unit (e.g., all units in group j would be i = 1 to nj)
\(X_ij\) is the score for unit i in group or treatment j
Mj is the mean for group or treatment j, and
M is the mean of all the measurements in the design (i.e., the Grand Mean for the entire sample).

The calculation of SSB and SSW can be checked by finding the sum of squares total (SST) and then checking to see if \(SST = SSB + SSW\). Note that some scholars use SSY instead of SST because SST refers to the total sum of squares in the dependent Y variable (recall that the sum of squares is the numerator of the variance formula we learned earlier). Here, SST is found as:

\[ \begin{equation} SST = \sum_{j=1}^J \sum_{i=1}^{n_j} (X_{ij}-M)^2 = \text{Sum of Squares Total} \tag{15-8} \end{equation} \]

Note that the only difference between Equations 15-5 and 15-8 is that the Group Mean is subtracted from each unit’s score in 15-5 but the Grand Mean is subtracted from each unit’s score in 15-8.

The Elements of Hypothesis Testing

The following is a step-by-step example of the elements of hypothesis testing in an exploratory one-way analysis of variance.

Research Problem.

Are there differences among the mean cholesterol levels of groups of women who have taken herbal supplement pills made by the Gamma Drug Company, the Delta Drug Company, and women who have taken no herbal supplement pills?

Research Hypothesis.

There is a mean difference among the cholesterol levels of groups of women who have taken herbal supplement pills made by the Gamma Drug Company, the Delta Drug Company, and women who have taken no herbal supplement pills.

Statistical Hypotheses.

The statistical hypotheses are written in terms of the population means of the three treatments as:

\[ \begin{align} H_0&: \mu_1 = \mu_2 = \mu_3 \\ H_A&: \mu_J \ne \mu_K &&\text{ for any j ≠ k and j & k = 1,2,3} \end{align} \]

Here, \(\mu_1\) is the population cholesterol mean of the women taking the Gamma Drug Company’s herbal supplement pill; \(\mu_2\) is the population cholesterol mean of the women taking the Delta Drug Company’s herbal supplement pill; and _3 is the population cholesterol mean of the women taking no herbal supplement pill. The null hypothesis indicates that there are no differences among the population cholesterol means among the three treatments and the alternate hypothesis indicates that at least one of the treatment population means differs from the others. In general, the null hypothesis would be written that there is no difference among J population means and the alternate hypothesis would be written as it is above.

Determine valid and reliable measures of the dependent and Independent variables.

A review of the cholesterol literature indicated that the measures of cholesterol used in this experiment were valid and reliable. Also, so as to assure the validity of the independent variable, care was taken to assure that the women received pills from the treatment to which they were randomly assigned.

Level of Significance.

The probability of rejecting a true null hypothesis (i.e. the probability of making a Type I error) was set at .10.

In social science research, changing the level of significance from the very common value of .05 usually requires some sort of justification. One of the most common, and defensible, reasons for changing the level of significance is to increase power when a Type II error is particularly undesirable (and a Type I error is not particularly worrisome). A Type II error would be undesirable at early stages of research, when researchers would not want to risk stopping a promising line of research due to non-significant results. Power would be a concern when the research has limited ability to obtain larger samples.

A Type I error would not be terribly worrisome when there is a low cost to the treatment (or intervention) and no side effects (whether we are talking about medical treatments or any other kind of treatment). If someone were to change to a low cost treatment based on the results of a study, they would not lose too much money paying for the truly ineffective treatment (whereas, they would be wasting much money if the treatment was expensive and did not really work). Similarly, if there are not side effects to the treatment, then using it even though it is not effective would not cause harm.

Note that even when these conditions are defensible, either a Type I or a Type II error can lead to an opportunity cost. That is, we would lose the opportunity to change to a better treatment when a Type II error occurred and resulted in the new treatment being not statistically significant in a sample. Of we would lose the opportunity to continue to use a known effective treatment because we change to a new treatment due to the promising – but wrong – results caused by a Type I error in a study.

Power.

Leon Waters decided that he would like his power to be at least .80, i.e., he decided that he would like to be able to reject the null hypothesis at least 80 times out of 100 when the null hypothesis was false.

A Priori Effect Size.

The effect size “\(d_A\)” (A for ANOVA) used is this book is based on the work of Cohen (1977), and is related to the two group effect size “d” that was discussed for the single group t test (Equation 12-4) in Chapter 12, the relationship is:

\[ \begin{equation} d_A = \text{Cohen's } f = \frac{1}{2} d \tag{15-9} \end{equation} \]

As in the case of the single group mean, we may use Cohen’s three different effect sizes, “small” \(d_A = 0.10\), “medium” \(d_A = 0.25\), and “large” \(d_A = 0.40\), as guides in exploratory studies were information is lacking. Note that Cohen called this effect size \(f\) rather than \(d_A\), but we prefer calling it \(d_A\) to help with the rationale expressed in Equation 15-9 (e.g., the medium \(d = 0.50\), so the medium \(d_A = 0.25\)).

In light of Cohen’s suggestions for effect size, Leon Waters decided to choose a large a priori effect size (i.e., \(d_A = 0.40\)) as one that he felt would be present among the pill consumption treatments.

Sample Size.

Sample size may be found using Table C.6 where sample sizes per treatment are tabled for given levels of significance (\(\alpha\)), powers (\(1 – β\)), degrees of freedom (\(df\)), and a priori effect sizes (\(d_A\)).

When an a priori effect size is selected that is not in Table C.6 of Appendix C, the following equation may be used to estimate the sample size:

\[ \begin{equation} n = \frac{n_{.05}}{400(d_A^2)} + 1 \tag{15-10} \end{equation} \]

where \(n\) is the number of units in a single treatment, \(n_.05\) is the value of \(n\) in Table C.5 when \(d_A = .05\) for a given power and degrees of freedom (\(df = K – 1\)).

In Table C.6 Leon found that he needed 17 subjects per treatment when \(\alpha = .10\), \(df = 2\), and \(d_A = .40\) to have the power of his statistical test set at .80. However, since Leon could conveniently collect information on 20 subjects, he decided to proceed with this number. In Table C.6 we see that by using 20 subjects per treatment Leon’s power is between .80 and .90 through linear interpolation he estimated it to be .85.

Critical Values.

If we are using a table with limited degrees of freedom, in order to find the critical value of the F statistic with \(\alpha= .10\), \(v_1 = 2\), and \(v_2 = 57\), we might need to choose between \(F_{(.90, 2, 40)} = 2.44\) and \(F_{(.90, 2, 60)} = 2.39\). In such a case, we would choose to use \(F_{(.90, 2, 40)}\) as our critical value because it is the more conservative (i.e., smaller) degrees of freedom, and use linear interpolation only if our calculated F value falls between the latter two tabled values.

At this point the a priori parameters of hypothesis testing have been established (i.e., your “bet” is made).

Randomly Select and Measure the Sample Units.

The cholesterol measurements of the three groups of twenty women each are shown in Columns 1, 2, and 3 of figure 15c(i). Note that in order to perform the ANOVA in R, we must restructure the data so that there is just one column for Treatment, one column for Cholesterol score, and one column for z score. Figure 15c(ii) shows a partial Case Summaries report for the restructured data. The Treatment variable must also be a factor in R.

Also note that z scores can be obtained in two ways: (a) for the whole sample and (b) for each factor level separately. When doing group comparison analyses, we usually want to examine z scores by group (i.e., each level separately). The z scores provided here were calculated by group in JAMOVI using the Z(Cholesterol, group_by=Treatment) compute function.

Figure 15c(i) Cholesterol scores of women taking different herbal supplement treatments (the last three columns contain the z scores by group for each case with the treatment)

	Original Scores			z Scores
Person_ID	Gamma	Delta	None	z_Gamma	z_Delta	z_None
1	235	260	210	-0.3975	0.8854	-0.3975
2	240	160	215	-0.2865	-1.2954	-0.2865
3	288	255	263	0.7795	0.7763	0.7795
4	245	175	220	-0.1754	-0.9682	-0.1754
5	355	250	330	2.2675	0.6673	2.2675
6	223	260	198	-0.6640	0.8854	-0.6640
7	285	185	260	0.7129	-0.7502	0.7129
8	223	156	198	-0.6640	-1.3826	-0.6640
9	275	273	250	0.4908	1.1689	0.4908
10	204	218	179	-1.0860	-0.0305	-1.0860
11	268	180	243	0.3353	-0.8592	0.3353
12	272	245	247	0.4242	0.5583	0.4242
13	235	200	210	-0.3975	-0.4231	-0.3975
14	297	235	272	0.9794	0.3402	0.9794
15	235	255	210	-0.3975	0.7763	-0.3975
16	278	200	253	0.5574	-0.4231	0.5574
17	322	320	297	1.5346	2.1938	1.5346
18	220	225	195	-0.7306	0.1221	-0.7306
19	183	190	158	-1.5523	-0.6411	-1.5523
20	175	146	150	-1.7300	-1.6007	-1.7300

Figure 15c(ii) Restructured data required to run ANOVA (one column for dependent variable and one column for independent variable)

Check: Outliers.

Create the z score columns of Figure 15c(i). Examination of these columns indicates that none of the scores were considered to be outliers since no z scores greater than 3 or less than -3 were observed.

jmv::descriptives(
    formula = zCholesterol ~ Treatment,
    data = data,
    dotType = "stack",
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE,
    extreme = TRUE)


 DESCRIPTIVES

 EXTREME VALUES

 Extreme values of zCholesterol           
 ──────────────────────────────────────── 
                   Row number    Value    
 ──────────────────────────────────────── 
   Highest    1            13     2.268   
              2            15     2.268   
              3            50     2.194   
              4            49     1.535   
              5            51     1.535   
   Lowest     1            58    -1.730   
              2            60    -1.730   
              3            59    -1.601   
              4            55    -1.552   
              5            57    -1.552   
 ────────────────────────────────────────

However, because the dataset is so small (\(n = 20\) within each sample), we might consider using \(|z| > 2\) as our rules for outliers (i.e., any score beyond the range \(-2 < z < 2\) would be considered an outlier). In this case, we would want to check more carefully Case_ID 13 (Person_ID 5 in the Gamma group), Case_ID 50 (Person_ID 17 in the Delta group), and Case_ID 15 (Person_ID 5 in the None group). Upon checking the data, no good rationale for removing the cases was identified, so Leon kept those cases in the analysis. Further, the boxplots for each treatment were observed to have no unusual characteristics (i.e., no outliers).

jmv::descriptives(
    formula = Cholesterol ~ Treatment,
    data = data,
    box = TRUE,
    dot = TRUE,
    dotType = "stack",
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)


 DESCRIPTIVES

Note that in group analyses (such as ANOVA and the independent t test, we are not only concerned with cases that are outliers on the variable at the univariate level (i.e., entire sample), but also outliers within each sample level. Recall that each sample represents a different population, so we want to verify that a case is not extreme or unusual for its own population. When we think about the impact such a case might have, with fewer cases in each group, an outlier can have a larger impact than it would have across all data in the study. Further, because of this potential for larger impact, the group statistics we use may be severely impacted by that outlier and cause a very different group statistic as a result (e.g., the mean being impacted by skewness in the data).

Check: Assumptions.

Random and Independent Sampling

The initial selection of a random sample of women, followed by their random assignment to the treatments established the initial conditions for independence of the observations. Also, Leon Waters was satisfied that nothing happened during the experiment to violate the independence assumption.

Normality

The normality assumption was considered by constructing histograms of the data in each treatment and by considering the normality plots for each treatment. Both the histograms and the normality plots indicated that the data in each treatment could be considered to have been sampled from a normal distribution.

The normality assumption is frequently tested by using Tests of Normality. We can obtain a test of the overall normality of a variable, but for group analyses like ANOVA and independent t tests, we wish to test normality within each sample, to reach a decision about the normality of each population separately. This analysis is obtained by using the Treatment variable as a SPLIT BY in Descriptives. Clicking to select the Histogram is also helpful for this purpose, but using a paneled histogram from GRAPHS may be more useful (because it uses the same scale for the axes).

jmv::descriptives(
    formula = Cholesterol ~ Treatment,
    data = data,
    hist = TRUE,
    qq = TRUE,
    n = FALSE,
    missing = FALSE,
#   desc = "rows",
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE,
    skew = TRUE,
    kurt = TRUE,
    sw = TRUE)


 DESCRIPTIVES

 Descriptives                                        
 ─────────────────────────────────────────────────── 
                          Treatment    Cholesterol   
 ─────────────────────────────────────────────────── 
   Skewness               Delta             0.2161   
                          Gamma             0.3544   
                          None              0.3544   
   Std. error skewness    Delta             0.5121   
                          Gamma             0.5121   
                          None              0.5121   
   Kurtosis               Delta            -0.4833   
                          Gamma             0.1189   
                          None              0.1189   
   Std. error kurtosis    Delta             0.9924   
                          Gamma             0.9924   
                          None              0.9924   
   Shapiro-Wilk W         Delta             0.9631   
                          Gamma             0.9753   
                          None              0.9753   
   Shapiro-Wilk p         Delta             0.6069   
                          Gamma             0.8603   
                          None              0.8603   
 ───────────────────────────────────────────────────

With small samples, in particular, but also generally, we recommend using the Shapiro-Wilk. The null hypothesis tested by this statistical test is

\[ H_0: \text{The data are normally distributed} \] We could amend this Null Hypothesis to handle the conditional normality situation (i.e., normality within each group).

\[ H_0: \text{The data are normally distributed within each group} \]

Therefore, if we reject the null hypothesis for the Shapiro-Wilk test, we will be rejecting normality. We want to fail to reject normality for the assumption to be satisfied (it doesn’t mean that the population is truly normal, but at least doesn’t provide evidence that it is not). Therefore, for the sake of the assumption test, we are hoping to fail to reject the null hypothesis and have a p value less than our level of significance (i.e., \(p ≤ \alpha\)). In this case, we see that we cannot reject normality for any of the treatment groups. Therefore, we consider the assumption to be tenable (i.e., believable, defensible, met). Fortunately One-way ANOVA and t tests are relatively robust to a violation of normality, but when the assumptions are met, we have that much more confidence in our statistical results.

If we are concerned about normality, recall that non-normality could be the result of outliers (especially for skewed distributions). We recommend that the first option to address the violation of the normality assumption, then, be to look for the impact of outliers. However, if outliers do not appear to be responsible, then as a last resort, if you strongly believe something must be done to address to non-normality, a transformation may be appropriate.

Finally, because ANOVA is part of the General Linear Model (i.e., regression) we can test the Normality assumption using residuals rather than the conditional normality approach. However, it is important to recognize that these tests can produce different results. Our recommendation is the conditional normality with an adjusted alpha (see Bonferroni and Holm sections below) for the number of levels. For example, if your factor has 4 levels, you would test the Shapiro-Wilk statistic using \(\alpha = .0125\) for each group (see Brooks et al., 2024).

Homogeneity of Variance

The assumption of homogeneity of variance was considered by observing boxplots of relatively equal length from each treatment. Note that because we have equal n’s in each treatment our F test would be robust to heterogeneity of variance.

The homogeneity of variance (often called homoscedasticity) assumption is frequently tested by comparing the equality of variances with an F statistic. Currently, the preferred test is Levene’s test of equality of variances. The details of Levene’s test are beyond the scope of this book, but it is useful to know that there are multiple “flavors” of Levene’s test. The most commonly calculated and reported uses the mean in its calculation of deviation. However, some scholars prefer the version of Levene’s test that calculates variation as the squared distance from the median instead of the mean. reports the version based on the mean in its One-way ANOVA procedure, but provides the other flavors in its Means Procedure.

The statistical null hypothesis for Levene’s test is as follows:

\[ \begin{align} H_0&: \sigma_1^2=\sigma_2^2 &&\text{(for 2 groups)} \\ H_0&: \sigma_1^2=\sigma_2^2=\sigma_3^2 &&\text{(for 3 groups)} \\ H_0&: \sigma_j^2=\sigma_k^2 &&\text{(for all groups j & k, where } j \ne k) \end{align} \]

Note that we would not typically write the null hypothesis for Levene’s test as the difference in variances (e.g., \(H_0: \sigma_{12} – \sigma_{22} = 0\)) because we use an F ratio to test this hypothesis. If anything besides simply equality, we would probability choose to write it as \(H_0: \sigma_{12} / \sigma_{22} = 1\). Just like a difference of zero indicating that two values are the same, a ratio of 1 means that both the numerator and denominator are exactly the same.

We obtain the Levene’s test as part of the Independent t test or ANOVA in most programs (including JAMOVI).

jmv::anovaOneW(
    formula = Cholesterol ~ Treatment,
    data = data,
    welchs = FALSE,
    eqv = TRUE)


 ONE-WAY ANOVA

 ASSUMPTION CHECKS

 Homogeneity of Variances Test (Levene's)           
 ────────────────────────────────────────────────── 
                  F          df1    df2    p        
 ────────────────────────────────────────────────── 
   Cholesterol    0.04445      2     57    0.9566   
 ──────────────────────────────────────────────────

The typical process used by researchers is the following (but see paragraph following these steps for our perspective):

Test the equality of variances using Levene’s test

If Levene’s test is NOT statistically significant (i.e., we cannot reject the null hypothesis that the variances are equal in the populations), then we may perform ANOVA or the independent t test.

In One-way ANOVA, use the “pooled” Fisher’s F test from the one-way ANOVA rather than the robust Welch’s F test. In JAMOVI, just make sure the “Assume equal” (Fisher’s) option is checked.
In Independent-Samples t test, use the “pooled” Student’s t test rather than the robust Welch’s t test. In JAMOVI, just make sure the Student’s test is checked.

If Levene’s test is statistically significant (i.e., we reject the null hypothesis that the variances are equal in the populations), then we must perform a ROBUST test for ANOVA or the independent t test.

For the One-way ANOVA, the most common choices are Welch’s F test or Brown-Forsythe’s F test. These tests are robust to a violation of homogeneity of variances and should be used whenever Levene’s test suggests that we have unequal variances. In JAMOVI, just make sure the “Don’t assume equal” (Welch’s) option is checked. The non-parametric Kruskal-Wallis test could also be considered, but Monte Carlo research has shown that Welch’s F is still the better choice.
For the Independent t test, the choice is most commonly Welch’s t test (with Satterthwaite’s degrees of freedom), which was described in Chapter 14. In JAMOVI, just make sure the Welch’s test is checked. The non-parametric Mann-Whitney U test could also be considered, but Monte Carlo research has shown that Welch’s t is still the better choice.

Although this is still the most common process used by researchers, we agree with Zimmerman (2004) and others (including Wilcox) who have written that we should consider using the Robust tests always. Zimmerman showed that we may inflate our Type I error rate by using the process described above (what he called conditional testing, that is, choosing the t test conditionally based on the significance of Levene’s test). The tests that assume equality of variances are generally preferred because they are more powerful, but research has shown that the power lost from using the robust tests may not be substantial enough to avoid using them even when variances are equal. R also used these robust tests as the defaults.

We will provide an example later of a situation where the researcher might conclude that variances across treatments are unequal. We will also examine the post hoc tests to use with this situation.

Compute Descriptive Statistics.

The non-parametric descriptive statistics for each of the treatments are shown in figure 15d. Leon noted that the median (242.5) of the cholesterol levels of women taking the Gamma herbal supplement pill was higher than the medians of the women taking either the Delta pill (221.5) or no pill (217.5). Also, the medians of the Delta group and the “no pill” group were relatively close. The means of the parametric statistics shown in figure 15e reflected the same information as the medians. A check of the variances from each treatment in figure 15e indicates that they may be considered to be equal. The measures of skewness and kurtosis indicate that the scores in each treatment are positively skewed and platykurtic, although not enough to cause concern.

Note that choosing a very small and a very large percentile for this output (here, 0 and 100) produce the minimum and maximum in order with the 1st, 2nd, and 3rd quartiles. This can make it a bit easier to examine those values in order. You would need to choose the very small percentile based on the amount of data you have (note that zero does not work in some programs, so choose maybe 0.1), or simply rely on the Minimum and Maximum reported without worrying about having them in order with the other percentiles.

Figure 15d non-parametric descriptive statistics for the cholesterol experiment for data in Figure 15c(i)

jmv::descriptives(
    formula = Cholesterol ~ Treatment,
    data = data,
    mode = TRUE,
    sd = FALSE,
    range = TRUE,
    iqr = TRUE,
    pc = TRUE,
    pcValues = "0,25,50,75,100")


 DESCRIPTIVES

 Descriptives                                     
 ──────────────────────────────────────────────── 
                       Treatment    Cholesterol   
 ──────────────────────────────────────────────── 
   N                   Delta                 20   
                       Gamma                 20   
                       None                  20   
   Missing             Delta                  0   
                       Gamma                  0   
                       None                   0   
   Mean                Delta              219.4   
                       Gamma              252.9   
                       None               227.9   
   Median              Delta              221.5   
                       Gamma              242.5   
                       None               217.5   
   Mode                Delta              200.0   
                       Gamma              235.0   
                       None               210.0   
   IQR                 Delta              71.25   
                       Gamma              56.75   
                       None               56.75   
   Range               Delta              174.0   
                       Gamma              180.0   
                       None               180.0   
   Minimum             Delta              146.0   
                       Gamma              175.0   
                       None               150.0   
   Maximum             Delta              320.0   
                       Gamma              355.0   
                       None               330.0   
   0th percentile      Delta              146.0   
                       Gamma              175.0   
                       None               150.0   
   25th percentile     Delta              183.8   
                       Gamma              223.0   
                       None               198.0   
   50th percentile     Delta              221.5   
                       Gamma              242.5   
                       None               217.5   
   75th percentile     Delta              255.0   
                       Gamma              279.8   
                       None               254.8   
   100th percentile    Delta              320.0   
                       Gamma              355.0   
                       None               330.0   
 ────────────────────────────────────────────────

Figure 15e Parametric descriptive statistics for the cholesterol experiment for data in Figure 15c(i)

jmv::descriptives(
    formula = Cholesterol ~ Treatment,
    data = data,
    hist = TRUE,
    dens = TRUE,
    bar = TRUE,
    box = TRUE,
    violin = TRUE,
    dot = TRUE,
    boxMean = TRUE,
    qq = TRUE,
    missing = FALSE,
    variance = TRUE,
    se = TRUE,
    ci = TRUE,
    skew = TRUE,
    kurt = TRUE,
    sw = TRUE)


 DESCRIPTIVES

 Descriptives                                            
 ─────────────────────────────────────────────────────── 
                              Treatment    Cholesterol   
 ─────────────────────────────────────────────────────── 
   N                          Delta                 20   
                              Gamma                 20   
                              None                  20   
   Mean                       Delta              219.4   
                              Gamma              252.9   
                              None               227.9   
   Std. error mean            Delta              10.25   
                              Gamma              10.07   
                              None               10.07   
   95% CI mean lower bound    Delta              197.9   
                              Gamma              231.8   
                              None               206.8   
   95% CI mean upper bound    Delta              240.9   
                              Gamma              274.0   
                              None               249.0   
   Median                     Delta              221.5   
                              Gamma              242.5   
                              None               217.5   
   Standard deviation         Delta              45.86   
                              Gamma              45.03   
                              None               45.03   
   Variance                   Delta               2103   
                              Gamma               2028   
                              None                2028   
   Minimum                    Delta              146.0   
                              Gamma              175.0   
                              None               150.0   
   Maximum                    Delta              320.0   
                              Gamma              355.0   
                              None               330.0   
   Skewness                   Delta             0.2161   
                              Gamma             0.3544   
                              None              0.3544   
   Std. error skewness        Delta             0.5121   
                              Gamma             0.5121   
                              None              0.5121   
   Kurtosis                   Delta            -0.4833   
                              Gamma             0.1189   
                              None              0.1189   
   Std. error kurtosis        Delta             0.9924   
                              Gamma             0.9924   
                              None              0.9924   
   Shapiro-Wilk W             Delta             0.9631   
                              Gamma             0.9753   
                              None              0.9753   
   Shapiro-Wilk p             Delta             0.6069   
                              Gamma             0.8603   
                              None              0.8603   
 ─────────────────────────────────────────────────────── 
   Note. The CI of the mean assumes sample means
   follow a t-distribution with N - 1 degrees of
   freedom

Calculate the Test Statistic.

In figure 15f the test F statistic is found to be 2.95 using the One-way ANOVA output. The step-by-step description of how to execute One-way ANOVA is given in chapter 14.

Make A Decision About the Null Hypothesis.

The decision was made to reject the null hypothesis because the value of the test statistic was greater than the value of the critical value, that is, \(2.95 > F_{(.90,2,40)} = 2.44\), and therefore 2.95 would be greater than \(F_{(.90,2,40)}\) and is statistically significant. The p level was found to be .0601. Since the p level was less than our chosen \(\alpha = .10\), this result corroborated the decision to reject the null hypothesis based on the critical value less than .10.

What To Do Following An ANOVA.

When you fail to reject the null hypothesis in an exploratory ANOVA your statistical analysis is finished. You have found that there are no significant differences among your treatment means and so you stop. However, when you do find that there are significant differences among your treatment means, as we did in this example (because we were using \(\alpha = .10\)), the next question to ask is: Which treatments differ? The answer to this question is the subject of the next section.

The effect size most frequently reported for the omnibus ANOVA is either \(R^2\) or \(\eta^2\) (eta-squared). The calculation for \(R^2\) (and \(\eta^2\) since they are equal in One-way ANOVA) is straightforward.

\[ R^2 = \frac{SS_{Between}}{SS_{Total}} = \frac{12130.0}{129130.4} = .094 \]

This is interpreted at the treatment variable explained just over 9% of the variation in Cholesterol scores. That is, which group the participants belong to explained about 9.4% of why they have different scores.

It may also be useful to report effect sizes like Cohen’s d for the post hoc comparisons that are performed (see the next section). That is, it is useful to report which groups are different based on these post hoc multiple cmparisons. However, it is more information to describe with standardized effect sizes how different they are. A Cohen’s d effect size can be calculated for these comparisons like we did in Chapter 14 using the descriptive statistics provided in the outputs.

Note that, like many programs, JAMOVI provides two procedures that can be used for One-Way ANOVA: (a) one called “One-Way ANOVA” and (b) one called simply “ANOVA”. If you have met the assumption of homogeneity of variances, then either is useful (in fact, both are useful because of the difference information they provide). However, if you cannot assume equal variances, only the One-Way ANOVA procedure provides the robust Welch’s F test (and associated robust Games-Howell post hoc comparisons).

Figure 15f(i) Analysis of Variance Omnibus output for the Cholesterol experiment using JAMOVI One-Way ANOVA (no post hoc comparisons provided)

jmv::anovaOneW(
    formula = Cholesterol ~ Treatment,
    data = data,
    fishers = TRUE,
    desc = TRUE,
    descPlot = TRUE,
    norm = TRUE,
    qq = TRUE,
    eqv = TRUE)


 ONE-WAY ANOVA

 One-Way ANOVA                                                  
 ────────────────────────────────────────────────────────────── 
                              F        df1    df2      p        
 ────────────────────────────────────────────────────────────── 
   Cholesterol    Welch's     2.905      2    38.00    0.0670   
                  Fisher's    2.955      2       57    0.0601   
 ────────────────────────────────────────────────────────────── 


 Group Descriptives                                            
 ───────────────────────────────────────────────────────────── 
                  Treatment    N     Mean     SD       SE      
 ───────────────────────────────────────────────────────────── 
   Cholesterol    Delta        20    219.4    45.86    10.25   
                  Gamma        20    252.9    45.03    10.07   
                  None         20    227.9    45.03    10.07   
 ───────────────────────────────────────────────────────────── 


 ASSUMPTION CHECKS

 Normality Test (Shapiro-Wilk)       
 ─────────────────────────────────── 
                  W         p        
 ─────────────────────────────────── 
   Cholesterol    0.9681    0.1175   
 ─────────────────────────────────── 
   Note. A low p-value suggests
   a violation of the assumption
   of normality


 Homogeneity of Variances Test (Levene's)           
 ────────────────────────────────────────────────── 
                  F          df1    df2    p        
 ────────────────────────────────────────────────── 
   Cholesterol    0.04445      2     57    0.9566   
 ──────────────────────────────────────────────────

Figure 15f(ii) Analysis of Variance Omnibus output for the Cholesterol experiment using JAMOVI ANOVA (no post hoc comparisons provided)

jmv::ANOVA(
    formula = Cholesterol ~ Treatment,
    data = data,
    effectSize = c("eta", "partEta"),
    modelTest = TRUE,
    homo = TRUE,
    norm = TRUE,
    qq = TRUE)


 ANOVA

 ANOVA - Cholesterol                                                                             
 ─────────────────────────────────────────────────────────────────────────────────────────────── 
                    Sum of Squares    df    Mean Square    F        p         η²        η²p      
 ─────────────────────────────────────────────────────────────────────────────────────────────── 
   Overall model             12130     2           6065    2.955    0.0601                       
   Treatment                 12130     2           6065    2.955    0.0601    0.0939    0.0939   
   Residuals                117000    57           2053                                          
 ─────────────────────────────────────────────────────────────────────────────────────────────── 


 ASSUMPTION CHECKS

 Homogeneity of Variances Test (Levene's) 
 ──────────────────────────────────────── 
   F          df1    df2    p        
 ──────────────────────────────────────── 
   0.04445      2     57    0.9566   
 ──────────────────────────────────────── 


 Normality Test (Shapiro-Wilk) 
 ───────────────────────────── 
   Statistic    p        
 ───────────────────────────── 
      0.9681    0.1175   
 ─────────────────────────────

EXPLORATORY ANOVA: POST HOC ANALYSES

Following an overall ANOVA where the results indicate that there are significant differences among the population treatment means, a natural question to ask is: Which means differ? For example, in the cholesterol experiment the F value indicated that at the .10 level of significance some or all of the following sample means came from different populations:

\[ \begin{align} &\underline{Gamma} &&\underline{Delta} &&&\underline{No Pill} \\ &252.9 &&219.4 &&&227.9 \end{align} \]

After obtaining the overall significant F value, and observing these means, you would want to ask one or more questions concerning mean differences. Below are some of these questions expressed as post hoc research problems and their corresponding post hoc null and alternative statistical hypotheses. The post hoc research hypotheses are that differences between the treatments being considered are expected. Here, \(\mu_1\) is the Gamma population treatment mean, \(\mu_2\) is the Delta population treatment mean, and \(\mu_3\) is the No Pill population treatment mean.

Questions About Pairs Of Means

Is there a difference between the cholesterol levels of women who take the Gamma Company’s herbal supplement pill and the cholesterol levels of women who take the Delta Company’s herbal supplement pill?

\[ H_0: \mu_1 - \mu_2 = 0 \\ H_A: \mu_1 - \mu_2 \ne 0 \]

Is there a difference between the cholesterol levels of women who take the Gamma Company’s herbal supplement pill and the cholesterol levels of women who do not take a herbal supplement pill?

\[ H_0: \mu_1 - \mu_3 = 0 \\ H_A: \mu_1 - \mu_3 \ne 0 \]

Is there a difference between the cholesterol levels of women who take the Delta Company’s herbal supplement pill and the cholesterol levels of women who do not take a herbal supplement pill?

\[ H_0: \mu_2 - \mu_3 = 0 \\ H_A: \mu_2 - \mu_3 \ne 0 \]

Questions About More Than Two Means (Non-pairs)

Is there a difference between the average cholesterol levels of women who take the Gamma Company’s herbal supplement pill and the women who take the Delta Company’s herbal supplement pill compared to the cholesterol levels of women who do not take a herbal supplement pill?

\[ H_0: \frac{\mu_1+\mu_2}{2} - \mu_3 = 0 \\ H_A: \frac{\mu_1+\mu_2}{2} - \mu_3 \ne 0 \]

Is there a difference between the cholesterol levels of women who take the Gamma Company’s herbal supplement pill compared to the average of the cholesterol levels of women who take the Delta Company’s herbal supplement pill and women who do not take a herbal supplement pill?

\[ H_0: \mu_1 - \frac{\mu_2+\mu_3}{2} = 0 \\ H_A: \mu_1 - \frac{\mu_2+\mu_3}{2} \ne 0 \]

Is there a difference between the cholesterol levels of women who take the Delta Company’s herbal supplement pill compared to the average cholesterol levels of women who take the Gamma Company’s herbal supplement pill and the cholesterol levels of women who do not take a herbal supplement pill?

\[ H_0: \mu_2 - \frac{\mu_1+\mu_3}{2} = 0 \\ H_A: \mu_2 - \frac{\mu_1+\mu_3}{2} \ne 0 \]

There is a good deal that we can learn about post hoc hypothesis testing from the these example problems and their statistical hypotheses. In the following paragraphs some of the features of post hoc hypothesis testing are discussed in greater detail.

LINEAR CONTRASTS (COMPARISONS)

Definition

A linear contrast or comparison among treatment means is defined as a linear combination of the treatment means where the contrast coefficients must sum to zero. A contrast coefficient is a number that is multiplied times a treatment mean, and it is assumed that all of the contrast coefficients are not zero.

For example, the comparison made in the first post hoc null hypothesis, above, contained the contrast coefficients +1, -1, and 0, that is, \(+1\mu_1 – 1\mu_2 + 0\mu_3 = 0\), and the sum of these contrast coefficients is zero (i.e., 1 – 1 + 0 = 0). Also, in the contrast of the last null hypothesis, the coefficients where +1, -1/2, and-1/2, that is, \(+1\mu_1 – (1/2)\mu_2 – (1/2)\mu_3\), which also sum to zero.

You should note that in discussing a contrast, the zero coefficients are always written (as was done for the first contrast), but that zero coefficients are omitted in the statistical hypotheses. You should also verify that the contrast coefficients in each of the preceding null hypotheses sum to zero.

\(\psi\), the Symbol For A Contrast (Comparison)

Statisticians usually denote a population contrast on treatment means by the greek letter \(\psi\) (psi). A subscript is usually added to \(\psi\) to denote a particular contrast. Therefore, we have for the first null hypothesis that \(\psi_1 = \mu_1 – \mu_2\), and for the last null hypothesis that \(\psi_6 = \mu_2 – (\mu_1 –\mu_3)/2\). When a contrast is made on sample means we have an estimate of the population parameter \(\psi\), which is usually denoted by \(\hat{\psi}\). Note that the sample statistic is an estimate of the population parameter and that the “hat” (^) indicates estimate. Therefore, we could write the sample mean as \(\hat{\mu}\).

The first post hoc statistical hypothesis can be written as:

\[ H_0: \psi_1 = 0 \\ H_A: \psi_1 \ne 0 \]

Here, the sample statistic, that is, our estimate of \(\psi\), would be found as:

\[ \hat{\psi}_1 = M_1 - M_2 = 252.9 - 219.4 = 33.5 \]

A Unique Contrast (Comparison)

It can be shown that when you multiply a contrast by any nonzero constant the test of the resulting contrast does not change. For example, the test of

\[ H_0: \psi_1 = \mu_1 - \mu_2 = 0 \]

is the same as testing

\[ H_0: -\psi_1 = -\mu_1 + \mu_2 = 0 \]

\[ H_0: 5\psi_1 = 5\mu_1 - 5\mu_2 = 0 \]

where the numbers -1 and 5 were arbitrarily chosen as the constant. Therefore, in general we have that testing a hypothesis on any contrast, \(\psi\), is the same as testing a null hypothesis on \(c_0*\psi\), where \(c_0\) is any nonzero constant. In this sense, \(\psi\) is said to be unique.

One reason that it is important to understand the concept of a unique contrast is because in entering contrast coefficients into a computer program some contrast coefficients cannot be entered exactly. For example, contrast coefficients such as \(1 – 1/3 – 1/3 – 1/3\) cannot be entered exactly in many programs because 1/3 must be entered as a never ending sequence of 3’s (i.e., \(0.33333...\)). However, if we multiply these coefficients by 3, we may enter them as \(3 – 1 – 1 – 1\) to obtain the same test results. Note however, that although the test results do not change, the size of the estimated contrast and other preliminary statistics will be altered.

For example, if we tested the null hypothesis

\[ H_0: 5\psi_1 = 5\mu_1 - 5\mu_2 = 0 \]

we would calculate the contrast in the sample as

\[ \begin{align} 5\hat{\psi}_1 &= 5M_1 - 5M_2 \\ &= 5(252.9) - 5(219.4) \\ &= 167.5 \end{align} \]

which is 5 times larger than the value of the contrast we calculated above, which was \(\hat{\psi}_1 = 33.5\).

Pairwise Versus non-pairwise Contrasts (Comparisons).

A “pairwise” contrast involves a comparison between only two means. A “non-pairwise” contrast involves a comparison between three or more means. In the preceding cholesterol statistical hypotheses the first three involve pairwise contrasts (e.g., \(\psi_1 = \mu_1 – \mu_2\)), and the second three involve non-pairwise contrasts (e.g., \(\psi_6 = \mu_2 – (\mu_1 –\mu_3)/2\)).

The Per Comparison (Contrast) Error Rate.

The significance level at which you test each contrast is usually referred to as the per comparison error rate. That is, it is the probability of making a Type I error (i.e., rejecting a true null hypothesis) in testing a given comparison. The notation \(\alpha_{PC}\) will be used to denote the per comparison error rate.

The Number of Possible Unique Contrasts (Comparisons).

In an experiment involving K treatments there are Q possible unique contrasts (both pairwise and non-pairwise) that can be made among the treatment means, where:

\[ \begin{equation} Q = 1 + \frac{3^K-1}{2} - 2^K \tag{15-11} \end{equation} \]

The notation \(\psi_q\), where q = 1, 2, … , Q, will be used to denote any contrast. In the cholesterol study K = 3; therefore,

\[ \begin{align} Q &= 1 + \frac{3^3-1}{2} - 2^3 \\ &= 1 + \frac{27-1}{2} - 8 \\ &= 1 + 13 - 8 = 6 \end{align} \]

That is, in a one-way experiment with three treatments there are six possible unique comparisons that can be made be made among the treatment means. These six possible unique comparisons are shown in the preceding six post hoc null hypotheses.

The number of pairwise contrasts.

In a experiment involving K treatments there are T possible unique pairwise contrasts, where

\[ \begin{equation} T = \frac{K(K-1)}{2} \tag{15-12} \end{equation} \]

Therefore, in the cholesterol experiment there are

\[ T = \frac{3(3-1)}{2} = 3 \]

pairwise comparisons. These three unique pairwise comparisons are specified above in the first three post hoc hypotheses. Note that K can also be viewed as the number of variables in a bivariate or pairwise correlation scenario (note also that some textbook use J instead of K). In this scenario, T would be the total number of correlations to be analyzed. This pairwise formula applies to any situation where you want to count the number of possible pairs of things. For example, if you have three variables, then there are three possible pairwise correlations you can analyze (e.g., \(r_{12}\), \(r_{13}\), \(r_{23}\) because \(r_{31}\) = \(r_{13}\), for example).

The Familywise Error Rate.

The familywise error rate (FWER) is the probability of one or more Type I errors being made among a set or family of comparisons. It can be shown that if each comparison in a family has a per comparison error rate of \(\alpha_{PC}\), then the familywise error rate, denoted by \(\alpha_{FW}\) is:

\[ \begin{equation} \alpha_{FW} \le 1 - (1 - \alpha_{PC})^C \tag{15-13} \end{equation} \]

where C is the number of comparisons in the family. The familywise error rate will equal the probability on the right when the comparisons are independent. Note that C can also be viewed more generically as the number of hypothesis tests to be performed (i.e., it is not only specific to multiple comparison testing).

For example, if we decided to test all of the possible pairwise and non-pairwise comparisons shown above for the cholesterol experiment at the .10 level of significance, the familywise error rate would be found as:

\[ \alpha_FW \le 1 - (1 - .10)^6 = 1 - (.90)^6 = .4686 \]

That is, the probability of falsely rejecting one or more contrasts could be as large as .4686. For most researchers the possibility of having this high of a familywise error rate would be intolerable. Fortunately, statistical tests exist which will test all of the preceding hypotheses and still hold the familywise error rate at a reasonable level. A discussion of these post hoc statistical procedures follows.

TYPES OF POST HOC COMPARISON TEST PROCEDURES

Introduction

There are a variety of post hoc comparison test procedures that a researcher can choose from in order to hold the familywise error rate in an exploratory analysis at a reasonable level. Most of these procedures have been named after the person who first described them. For example, to name a few, there is Duncan’s New Multiple Range Test, Fisher’s Least Significant Difference (LSD), the Neuman-Keuls Test, Scheffé’s Test, and Tukey’s Honestly Significant Difference (HSD) or Wholly Significant Difference (WSD) test.

In an exploratory analysis, we favor, along with others (see Kirk, 1982, pp. 146-148, and Keppel, 1983, pp. 164-165) the Tukey procedure for pairwise contrasts and the Scheffé procedure when a family of comparisons contains non-pairwise comparisons. Therefore, the Tukey and Scheffé procedures are discussed and illustrated next as the test procedures to be used following a significant F test in an exploratory analysis of variance. A detailed discussion of most post hoc comparison procedures may be found in presentations by Miller (1966, 1977). But first, we would like to introduce a general-purpose adjustment procedure that works well (but conservatively) with few assumptions: the Bonferroni alpha-adjustment procedure.

Bonferroni’s alpha-adjustment procedure

The Bonferroni alpha-adjustment technique, based on Boole’s inequality, can be used in any situation where we have multiple hypothesis tests and want to control the familywise error rate (FWER). Therefore, the Bonferroni technique, and some modified-Bonferroni techniques, can be used in many multiple hypothesis testing situations, not only for multiple comparisons.

The important decision concerns what is considered a “family” of hypothesis tests. As an example, when we perform post hoc tests after a statistically significant ANOVA, most scholars consider that a family of hypothesis tests (primarily because we are partitioning the same variance to form these contrasts). Some researchers have considered tests of multiple correlations and multiple regression predictors to be from the same family, but there is no consensus in those situations like there is for ANOVA post hoc tests.

The steps for the Bonferroni adjustment are relatively straightforward.

Set the FWER (i.e., \(\alpha_{FW}\), the probability you will make at least one Type I error in your multiple tests).
Determine the total number of hypothesis tests, C, that will be performed (e.g., using Equations 15-11 or 15-12 above). Note that, in 15-12 and 15-13 above, the T and C values can also be considered simply the number of tests to be performed (when not in a post hoc comparisons scenario).
Compute the per-test level of significance (\(\alpha_{PC}\)) by dividing the \(\alpha_{FW}\) by C (that is, \(\alpha_{PC} = \alpha_{FW}/C\)). This \(\alpha_{PC}\) is then the level of significance you will use for each individual null hypothesis test, in order to keep \(\alpha_{FW}\) at your desired level (e.g., .05).
Conduct the several multiple null hypothesis tests by comparing each p value to \(\alpha_{PC}\). For example, in the situation above where we are testing six null hypotheses and wish to keep FWER at \(\alpha_{FW}\) ≤ .05, you would compare each p value from those tests to \(\alpha_{PC} = \alpha_{FW}/6 = .0083\), instead of the original .05.

Note that the null hypothesis decision rule here can be written as either:

\[ \begin{align} &\text{Reject } H_0 \text{ if } p \le \alpha_{FW}⁄C \\ &or \\ &\text{Reject } H_0 \text{ if } (p*C) \le \alpha_{FW} \end{align} \]

Therefore, we must pay attention to whether the p values reported by our statistical program has already adjusted the p values. This p adjustment will typically be the case (instead of alpha adjustment), because the computer programs typically do not know what level of significance we are using. Therefore, we simply compare the p value (if it has already been adjusted by the program) to our original alpha level. This caveat is applicable to all multiple hypothesis testing procedures, including Tukey, Scheffé, and Games-Howell (the computer program will usually indicate when p has been adjusted).

While the Bonferroni technique will keep FWER at or below the desired (nominal) \(\alpha_{FW}\) level, scholars have show that in some situations, the resulting reduction in alpha causes power to decrease substantially when using Bonferroni. Therefore, the Bonferroni technique is not recommended when power is a particular concern (e.g., small sample sizes). Researchers have shown that the Tukey test generally has more power than the Bonferroni technique, for example.

Holm’s Sequentially Rejective Bonferroni Test

It has been shown that Holm’s (1979) modified Bonferroni procedure protects against FWER Type I error inflation as well as Bonferroni, but provides more power. Holm’s test is a bit more complicated to implement, but not too bad.

Set the FWER (i.e., \(\alpha_{FW}\), the probability you will make at least one Type I error in your multiple tests).
Determine the total number of hypothesis tests, C, that will be performed (e.g., using Equations 15-11 or 15-12 above). Note that, in 15-12 and 15-13 above, the T and C values can also be considered simply the number of tests to be performed (when not in a post hoc comparisons scenario).
Compute the per-test level of significance (\(\alpha_{PC}\)) by dividing the \(\alpha_{FW}\) by C (that is, \(\alpha_{PC} = \alpha_{FW}/C\)). This is the same as the Bonferroni technique, above, but will apply only to the first hypothesis test performed using Holm.
To conduct the first sequentially-rejective hypothesis test, compare the actual p value significance of the test with the \(\alpha_{PC}\) calculated in Step 3.

if \(p ≤ \alpha_{PC}\) then reject the first null hypothesis for that test with the smallest p value and proceed to the next step (i.e., the next statistical null hypothesis to be tested).
if \(p > \alpha_{PC}\), however, then you must STOP the process and declare this and all remaining null hypothesis tests non-significant.

After every statistically significant test, adjust p < \(\alpha_{PC}\) based on the total number of null hypothesis tests REMAINING: \(\alpha_{PC} = \alpha_{FW} / (C – C_{Completed})\). Note that the smallest remaining p value is always compared to the smallest remaining \(\alpha_{PC}\) alpha level.
1. For the second test, because we have already performed one test, \(C_{Completed} = 1\). Therefore, for the second test, \(\alpha_{PC} = .05 / (6 – 1) = .05 / 5 = .01\). Find the remaining hypothesis test with the smallest p value to compare to this \(\alpha_{PC}\).

For the third test, because we have already performed two tests, \(C_{Completed} = 2\). Therefore, for the third test, \(\alpha_{PC} = .05 / (6 – 2) = .05 / 4 = .0125\). Find the remaining hypothesis test with the smallest p value to compare to this \(\alpha_{PC}\).
As long as all tests continue to be statistically significant, this process continues until the last remaining hypothesis test (with the largest p value) is performed using \(\alpha_{FW}\) as the level of significance. Note that any test that is not statistically significant at \(\alpha_{FW}\) will also not be statistically significant at \(\alpha_{FW}\).

If any null hypothesis test is deemed non-significant, then STOP all testing (even if some other test would have been statistically significant) and declare this and all remaining null hypothesis tests non-significant (i.e., repeat the process until a test is not rejected or all hypotheses have been tested).

The rationale for the Holm procedure (paraphrased from Howell, 2002, p. 387; italics indicate text we’ve added):

When we reject the null hypothesis for the test with the smallest significance, we are declaring that null hypothesis to be false. If it is false [in reality], that leaves only N - 1 possible true null hypotheses, and so we only need to protect against N - 1 type I errors; this same logic follows for all remaining tests. [If it is not false in reality, then you’ve already made at least one Type I error and additional ones don’t matter.] The logic makes sense, in particular, when we believe strongly that several null hypotheses are almost certain to be false – if they are indeed false, there is no reason to protect against erroneously rejecting them.

There are other, newer, modified Bonferroni procedures that are used by researchers, but they tend to require additional assumptions. The Bonferroni and Holm techniques are always safe techniques (but perhaps a little conservative, especially Bonferroni) to use as long as the other assumptions of the relevant statistical tests have been met. That is, Bonferroni and Holm use p values provided by other statistical tests; these p values must be meaningful. But they require fewer assumptions than most other techniques (e.g., Hochberg). There is also a newer, similar approach called False Discovery Rate (FDR) introduced by Benjamini and Hochberg (1995), but it controls FDR not FWER.

Equal Versus Unequal n’s

In considering the assumptions that are necessary for the overall F test to be valid, we found that the F test is robust with respect to a violation of the assumption of homogeneity of variance when there are an equal number of units in the treatments. However, when the homogeneity assumption is violated, and there are an unequal number of units in the treatments, the F test is not robust. This same result holds true for the robustness of the Tukey and Scheffé tests. Therefore, in what follows the Tukey and Scheffé tests will be presented under conditions where there are an equal number of units in each treatment and then under conditions where there are an unequal number of units in each treatment.

For example, the data in Figure 15g have been modified from Figure 15c(i) by deleting cases from each group in order to create a dataset with unequal sample sizes and unequal variances. The One-way ANOVA was performed for these data and the output is shown in Figure 15h. In Figure 15h, we can see that Levene’s test is statistically significant, p = .048. Because Levene’s test is statistically significant, we must reject the null hypothesis that variances are equal and conclude that variances in the population are not equal. This is a violation of the homogeneity of variance assumption.

Figure 15g The cholesterol data shown in Figure 15c(i) with some scores changed to missing data so as to arrive at unequal n’s and unequal variances (NA is used as a missing value code)

	Original Scores			z Scores
Person_ID	Gamma	Delta	None	z_Gamma	z_Delta	z_None
1	235	260	210	-0.3975	0.8854	-0.3975
2	240	160	215	-0.2865	-1.2954	-0.2865
3	288	255	263	0.7795	0.7763	0.7795
4	245	175	220	-0.1754	-0.9682	-0.1754
5	355	250	330	2.2675	0.6673	2.2675
6	223	260	198	-0.6640	0.8854	-0.6640
7	285	185	260	0.7129	-0.7502	0.7129
8	223	156	198	-0.6640	-1.3826	-0.6640
9	275	273	250	0.4908	1.1689	0.4908
10			179	-1.0860	-0.0305	-1.0860
11	268	180	243	0.3353	-0.8592	0.3353
12	272		247	0.4242	0.5583	0.4242
13	235		210	-0.3975	-0.4231	-0.3975
14	297		272	0.9794	0.3402	0.9794
15	235	255	210	-0.3975	0.7763	-0.3975
16	278		253	0.5574	-0.4231	0.5574
17	322	320		1.5346	2.1938	1.5346
18			195	-0.7306	0.1221	-0.7306
19		190		-1.5523	-0.6411	-1.5523
20		146	150	-1.7300	-1.6007	-1.7300

[1] 238.4

[1] 48

Figure 15h Analysis of Variance Omnibus output for the Cholesterol experiment with Unequal N using JAMOVI One-Way ANOVA (no post hoc comparisons provided)

jmv::anovaOneW(
    formula = Cholesterol ~ Treatment,
    data = data,
    fishers = TRUE,
    welchs = TRUE,
    desc = TRUE,
    descPlot = TRUE,
    eqv = TRUE)


 ONE-WAY ANOVA

 One-Way ANOVA                                                  
 ────────────────────────────────────────────────────────────── 
                              F        df1    df2      p        
 ────────────────────────────────────────────────────────────── 
   Cholesterol    Welch's     5.832      2    27.94    0.0076   
                  Fisher's    5.269      2       45    0.0088   
 ────────────────────────────────────────────────────────────── 


 Group Descriptives                                             
 ────────────────────────────────────────────────────────────── 
                  Treatment    N     Mean     SD       SE       
 ────────────────────────────────────────────────────────────── 
   Cholesterol    Delta        14    218.9    54.25    14.500   
                  Gamma        16    267.2    37.24     9.310   
                  None         18    227.9    41.20     9.712   
 ────────────────────────────────────────────────────────────── 


 ASSUMPTION CHECKS

 Homogeneity of Variances Test (Levene's)         
 ──────────────────────────────────────────────── 
                  F        df1    df2    p        
 ──────────────────────────────────────────────── 
   Cholesterol    3.245      2     45    0.0483   
 ────────────────────────────────────────────────

We know that the ANOVA F statistic is relatively robust to this violation when sample sizes are equal, however, here we do not have equal samples sizes. Therefore, the more conservative approach is to use the “Robust Tests of Equality of Means” provided by . We requested the Welch F test because research has shown that it maintains the nominal Type I error rate better than the other option, by Brown-Forsythe. The Welch F statistic is 5.832 with (2, 27) degrees of freedom, and is statistically significant with a p = .0088. Because we have a statistically significant omnibus test, we will follow-up with an appropriate post hoc test described below (e.g., Games-Howell).

Tukey’s Pairwise Post Hoc Tests: Equal n’s

In this section, Tukey’s Honestly Significant Difference (HSD) test will be described as the procedure to test pairwise contrasts following a significant overall F test in an exploratory ANOVA. That is, Tukey’s test is used here to provide us with an answer to the question: Do one or more mean pairs differ? In so doing, it allows us to select a familywise error rate. That is, in using Tukey’s HSD test with \(\alpha_{FW} = .10\), the probability of at least one pairwise test being found significant when it should not be is .10. Note, however, that if you suspect that your variances are extremely heterogeneous, e.g., if the ratio of one variance to another is of the order of 10 to 1, you should use a modification of Tukey’s test which will be described in a following section.

In what follows, Tukey’s HSD test will be described in general using a step-by-step procedure and illustrated at each step using all pairwise contrasts (i.e., the first three contrasts) from the preceding cholesterol example. While it is perhaps instructive to see the process for one of the methods (e.g., Tukey’s method above), statistical computer programs make it unnecessary to calculate these contrasts and comparisons ourselves, so we won’t show the steps or calculations for any other methods.

Step 1. Find the critical value of the studentized range statistic denoted SRc in Table A.6. To enter Table A.6 you will need:

\(v_2\), the degrees of freedom associated with the error mean square from your overall ANOVA table;
J, the number of treatment levels;
\(\alpha_{FW}\), the familywise error rate (this value is usually set at the same level of significance as that used for the overall F test).

Example. For the cholesterol experiment the preceding values were found to be: \(v_2 = 57\), \(J = 3\), and \(\alpha_{FW} = .10\). Therefore, from Table A.6, after resetting \(v_2\) at the conservative value of 40, we have that: \(SR_C = 2.99\).

Step 2. Find the critical value of the HSD statistic as:

\[ HSD = SR_C \sqrt{MSW⁄n} \tag{15-14} \]

where \(SR_C\) is the critical value of the studentized range statistic found from Table A.6 in step 1; MSW is the mean square within found from the overall ANOVA table; n is the number of units in each treatment.

Example. For the cholesterol example we have that: \(SR_C = 2.99\); \(MSW = 2052.64\) (from figure 15f); \(n = 20\). Therefore, we have that:

\[ HSD = 2.99 \sqrt{2052.64⁄20} = 30.29 \]

Step 3. Prepare a table of differences between all pairs of means.

Example. For the cholesterol experiment we have:

\[ \begin{align} &&M1 &&&M2 &&&&M3 \\ &&252.9 &&&219.4 &&&&227.9 \\ &M1 = 252.9 &- &&&\hat{\psi}_1=33.5 &&&&\hat{\psi}_2=25.0 \\ &M2 = 219.4 &- &&&- &&&&\hat{\psi}_3=-8.5 \end{align} \] 4. Step 4. Consider as significantly different from zero the mean differences in the table prepared in Step 3 whose absolute values are greater than the HSD critical value found in step 2.

Example. In our example we have that only \(|\hat{\psi}| = 33.5 > 30.29\). Therefore, only the null hypothesis:

\(\hat{\psi}_1=\mu_1-\mu_2=0\) would be rejected at the .10 level of significance. That is, it is unreasonable to find \(\psi\) = 33.5 if the sampling distribution of \(\psi\) had a mean of zero.

Step 5. Establish a set of simultaneous confidence intervals for the differences between the sample means. The confidence interval around any contrast, \(\psi_q\), is found as:

\[ \psi_q - HSD < \psi_q < \psi_q + HSD \tag{15-15} \]

A set of simultaneous confidence intervals is a set of confidence intervals that all will contain the population contrasts \(100(1 – \alpha_{FW})\text{%}\) of the time. That is, in the cholesterol experiment, if we were to repeat the experiment an infinite number of times, and each time we computed the set of confidence intervals, 90% of the sets of confidence intervals would contain all of the population contrasts.

Example. The set of 90% confidence intervals for the cholesterol study is:

\[ \begin{align} 33.5 - 30.29 &< \psi_1 < 33.5 + 30.29 &&\text{(interval does NOT contain 0)} \\ 25.0 - 30.29 &< \psi_1 < 25.0 + 30.29 &&\text{(interval does contain 0)} \\ -8.5 - 30.29 &< \psi_1 < -8.5 + 30.29 &&\text{(interval does contain 0)} \end{align} \]

Note that intervals containing zero are centered around the contrasts that were considered to not differ significantly from zero. Intervals that do not contain zero are considered statistically significant.

Results.

The results indicate that there is a significant difference between the average cholesterol levels of the women taking the Gamma Company’s herbal supplement pill and the cholesterol levels of women taking the Delta company’s herbal supplement pill. The means indicate that the cholesterol levels of the women taking the Gamma Company’s herbal supplement pill are higher than the cholesterol levels of the women taking the Delta Company’s herbal supplement pill. No significant differences were found among the other treatments.

We can see from the output in Figure 15f(ii) that only the mean comparison between Gamma and Delta is statistically significant, with p = .059 (recall that we are using alpha = .10). Note that the 90% Confidence Interval provided uses the exact SRc value for the analysis, which is closer to the 2.96 that we would have used for 60 degrees of freedom (from Appendix A.6) since we actually had 57 degrees of freedom within.

The Delta and None groups were not statistically significantly different on the Cholesterol variable (p = .84), therefore they are shown in the same subset (i.e., Subset 1). The same is true for the Gamma and None groups (p = .198), except that they are together in Subset 2. However, because the Gamma and Delta treatments were statistically significantly different at the alpha = .10 level, they are in different subsets: Delta in Subset 1 and Gamma in Subset 2. However, research has shown that this approach does not work well when sample sizes are equal, because of how the calculations are performed. When you obtain this table with unequal sample sizes, this message is provided as a footnote to the table: “The group sizes are unequal. The harmonic mean of the group sizes is used. Type I error levels are not guaranteed.”

Tukey’s Pairwise Post Hoc Tests: Unequal n’s

Tukey’s test was modified by Kramer to handle situations where there are unequal n’s (but equal variance). This Tukey-Kramer procedure is what most programs use for Tukey HSD analysis. It is important to note that, theoretically, there is no problem with performing Tukey HSD or the pooled variance Student’s t test with unequal sample sizes – even vastly different sample sizes (e.g., \(n_1 = 10\) and \(n_2 = 1000\)). However, the larger the difference in sample sizes, the smaller difference in variances required to cause Type I error rates t increase beyond acceptable.

Tukey’s test was modified by Games and Howell (1976) to handle situations where there are both unequal n’s and unequal variances. The Games and Howell modification of Tukey’s test (GHT) is recommended for use when you have an unequal number of subjects in some treatments or when you have equal n’s but extremely heterogeneous variances. This is because GHT is robust to violations of the assumption of homogeneity of variance. The GHT test has been shown to be slightly liberal (e.g., .054 instead of .05; see Tamhane, 1979) when a violation of homogeneity is not present, but this does not appear to be a major problem for most data analysts.

Figure 15h(i) Tukey HSD post hoc mean comparisons for equal sample sizes and equal variances from One-way ANOVA

jmv::anovaOneW(
    formula = Cholesterol ~ Treatment,
    data = data,
    welchs = FALSE,
    phMethod = "tukey",
    phTest = TRUE,
    phFlag = TRUE)


 ONE-WAY ANOVA

 POST HOC TESTS

 Tukey Post-Hoc Test – Cholesterol                           
 ─────────────────────────────────────────────────────────── 
                               Delta     Gamma     None      
 ─────────────────────────────────────────────────────────── 
   Delta    Mean difference         —    -33.50     -8.500   
            t-value                 —    -2.338    -0.5933   
            df                      —     57.00      57.00   
            p-value                 —    0.0586     0.8243   
                                                             
   Gamma    Mean difference                   —     25.000   
            t-value                           —     1.7450   
            df                                —      57.00   
            p-value                           —     0.1976   
                                                             
   None     Mean difference                              —   
            t-value                                      —   
            df                                           —   
            p-value                                      —   
 ─────────────────────────────────────────────────────────── 
   Note. * p < .05, ** p < .01, *** p < .001

Figure 15h(ii) presents the Games-Howell output from One-way ANOVA. We can see in this example that two of the mean comparisons are statistically significant: Gamma vs. Delta, with p = .0265, and Gamma vs. None, with p = .0170. Recall that these were different data than were used for the Tukey HSD earlier (cases had been deleted to create the data in Figure 15g), so there are no contradictions between these results.

Figure 15h(ii) Robust One-Way ANOVA results for data in Figure 15g including the Games-Howell Robust post hoc mean comparisons

jmv::anovaOneW(
    formula = Cholesterol ~ Treatment,
    data = data,
    fishers = TRUE,
    desc = TRUE,
    descPlot = TRUE,
    norm = TRUE,
    qq = TRUE,
    eqv = TRUE,
    phMethod = "gamesHowell",
    phTest = TRUE,
    phFlag = TRUE)


 ONE-WAY ANOVA

 One-Way ANOVA                                                  
 ────────────────────────────────────────────────────────────── 
                              F        df1    df2      p        
 ────────────────────────────────────────────────────────────── 
   Cholesterol    Welch's     5.832      2    27.94    0.0076   
                  Fisher's    5.269      2       45    0.0088   
 ────────────────────────────────────────────────────────────── 


 Group Descriptives                                             
 ────────────────────────────────────────────────────────────── 
                  Treatment    N     Mean     SD       SE       
 ────────────────────────────────────────────────────────────── 
   Cholesterol    Delta        14    218.9    54.25    14.500   
                  Gamma        16    267.2    37.24     9.310   
                  None         18    227.9    41.20     9.712   
 ────────────────────────────────────────────────────────────── 


 ASSUMPTION CHECKS

 Normality Test (Shapiro-Wilk)       
 ─────────────────────────────────── 
                  W         p        
 ─────────────────────────────────── 
   Cholesterol    0.9639    0.1448   
 ─────────────────────────────────── 
   Note. A low p-value suggests
   a violation of the assumption
   of normality


 Homogeneity of Variances Test (Levene's)         
 ──────────────────────────────────────────────── 
                  F        df1    df2    p        
 ──────────────────────────────────────────────── 
   Cholesterol    3.245      2     45    0.0483   
 ──────────────────────────────────────────────── 


 POST HOC TESTS

 Games-Howell Post-Hoc Test – Cholesterol                    
 ─────────────────────────────────────────────────────────── 
                               Delta     Gamma     None      
 ─────────────────────────────────────────────────────────── 
   Delta    Mean difference         —    -48.32     -9.016   
            t-value                 —    -2.804    -0.5166   
            df                      —     22.60      23.64   
            p-value                 —    0.0265     0.8640   
                                                             
   Gamma    Mean difference                   —     39.306   
            t-value                           —     2.9216   
            df                                —      31.99   
            p-value                           —     0.0170   
                                                             
   None     Mean difference                              —   
            t-value                                      —   
            df                                           —   
            p-value                                      —   
 ─────────────────────────────────────────────────────────── 
   Note. * p < .05, ** p < .01, *** p < .001

Scheffé’s Test: Equal n’s

When non-pairwise contrasts are found among the contrasts in a family of contrasts the recommended procedure is Scheffé’s test. Like the Tukey test, the Scheffé test allows you to select a familywise error rate.

Unfortunately, specialized software is required to easily calculate the Scheffé contrasts. For example, Contrast 1 in the output corresponds to Contrast #4 from the section above, with the following hypotheses:

\[ \begin{align} H_0&: \frac{\mu_1+\mu_2}{2} - \mu_3 = 0 \\ H_A&: \frac{\mu_1+\mu_2}{2} - \mu_3 \ne 0 \end{align} \]

We recommend an R Shiny app that we have created to perform the Scheffé comparisons. The website is

Shiny Web App

R Shiny App for Scheffé Comparisons

Here are results from that Shiny app using the data in Figure 15c.

Two important notes about this process. First, Scheffé will work for unequal n, however, cannot adjust for unequal variances. Therefore, if you have a statistically significant Levene’s test, you must use the Brown-Forsythe adaptation to Scheffé (like Games-Howell is for Tukey).

Another option may be to recode the data to create the two groups you desire to compare in the data (note that contrasts are really just comparing two groups, a group with negative contrast coefficients and a group with positive contrast coefficients). Then you can run a two-group analysis (either one-way ANOVA or independent t test) and use an appropriate robust test (e.g., Welch) to address the violation of the homogeneity of variance assumption.

Second, you can perform all the post hoc multiple comparison tests you like using the Scheffé approach. However, when treating these as a priori contrasts, the contrasts (and ideally the directionality of the differences since these are confirmatory analyses, that is, which mean or combination of means is larger) must be determined before you collect the data and run the analyses. In this scenario, most scholars agree that you do not need to adjust alpha as long as you have limited the number of these a priori contrasts to a small number. Most scholars would suggest no more than K – 1 such contrasts, where K is the total number of groups (and ideally these K – 1 contrasts would be orthogonal, but that is not required). If you use this procedure to perform follow-up tests, you must adjust alpha for the multiple hypothesis tests you are performing (e.g., using Bonferroni or Holm alpha-adjustment procedures).

Finally, it does not generally fit the logic of research and hypothesis testing to perform both confirmatory and exploratory analyses (i.e., a priori contrasts and post hoc comparisons) with the same data. Be very careful, and justify your analyses well, if you choose to do so.

Scheffé’s Post Hoc Tests: Unequal n’s

Scheffé’s test for unequal n’s is based on a modification developed by Brown and Forsythe (197 4). The Brown and Forsythe modification of Scheffé’s test (BFS) is recommended for use when you have an unequal number of subjects in each treatment or when you have equal n’s but extremely heterogeneous variances. This is because BFS is robust to violations of the assumption of homogeneity of variance.

The Shiny App above will calculate the Brown-Forsythe adjustment to Scheffé for the most explanatory Scheffé comparison.

Scheffé’s Post Hoc Tests: The Reality

Unfortunately, while most computer programs provide Scheffé post hoc tests as an option, the tests provided are typically only the pairwise tests. The strength (and the curse) of the Scheffé post hoc test procedure is that it allows researchers to test a theoretically infinite number of possible non-pairwise comparisons. That is, when Scheffé is used only with pairwise comparisons, it has low power because the alpha adjustments are made for all possible pairwise and non-pairwise comparisons. Therefore, researchers rarely use and report Scheffé comparisons in the literature.

In fact, most statistical computer programs do not provide non-pairwise tests without extra effort on the part of the researcher to define those contrasts. In R, we can create non-pairwise comparisons only in the One-way ANOVA procedure after defining the contrast coefficients. We could use the Scheffé method to perform the statistical hypothesis testing, but the more common approach seems to be to use the Bonferroni (or Holm) adjustment with the p values provided by the contrast output.

SUMMARY

Exploratory ANOVA

In this chapter we considered the elements of hypothesis testing for an exploratory analysis of variance (ANOVA). That is, we considered an ANOV A situation where the researcher had very little basis to predict which treatments, if any, would differ. We found that a researcher in this situation first uses an overall F test to determine if there are any differences among the treatments. If no treatment differences are found, the analysis stops, but if treatment differences are found, that is, if the F test is significant, a post hoc test is conducted.

We found that there are many different post hoc tests, but that four were recommended for use as part of an exploratory ANOVA The four post hoc tests recommended here were: Tukey’s Honestly Significant Different (HSD) test, a modification of Tukey’s test by Games and Howell (1976), referred to here as GHT, Scheffé’s test, referred to as ST, and a modification of Scheffé’s test by Brown and Forsythe (1974), referred to as BFS. A flow chart illustrating the steps leading to these post hoc tests in a one- way fixed effects exploratory analysis of variance is shown in figure 15i.

Assumptions and Post Hoc Analyses

Equal n’s.

In considering the assumptions for the analysis of variance we found that the F test is robust to violations of its homogeneity of variance assumption, provided that there are an equal number of units in each treatment. That is, given heterogeneous variances and equal n’s, the nominal and actual levels of significance may be considered to be equal. In this situation, that is given equal n’s and only pairwise comparisons, the Tukey post hoc procedure was recommended, and given equal n’s with non-pairwise comparisons, Scheffé’s test was recommended. Both of these post hoc tests are used because they control the familywise error rate in testing a set of mean contrasts.

Unequal n’s.

If one has both unequal n’s and heterogeneous variances or equal n’s and extremely heterogeneous variances, then the actual and nominal levels of significance can differ by an unacceptable amount. In these cases post hoc tests that are based on modifications of the Tukey and Scheffé procedures were recommended because these modified tests were found to be robust to violations of the homogeneity of variance assumption.

Here, the Games and Howell modification of Tukey’s test was recommended for use with pairwise comparisons and the Brown and Forsythe modification of Scheffé’s test was recommended for use with non-pairwise comparisons. These tests were also recommended here for general use whenever you are confronted with post hoc tests in an unequal n ANOVA. This was because they have been shown to be relatively powerful even when the treatment variances are homogeneous (Keselman, Games, & Rogan, 1979). Therefore, in an unequal n ANOVA, the Games and Howell and Brown and Forsythe post hoc tests provide you with both Type I and Type II error protection, which you cannot be certain of from the Tukey and Scheffé tests.

Figure 15i A flow chart of the steps leading to post hoc testing in an exploratory ANOVA

Chapter 15 Appendix A Study Guide for Exploratory ANOVA

SECTION 1: One-way ANOVA

Analyses to Run

Use a CATEGORICAL variable W
Use a SCALE variable Y
Run descriptive statistics on Y as the dependent with W as a grouping variable
Run one-way ANOVA with W as the independent variable and Y as the dependent
Run non-parametric test (e.g., Kruskal-Wallis H & Median) with W as the independent variable and Y as the dependent
Run an error bar plot on Y as the dependent with W as a grouping variable

Using the ANOVA output, respond to the following items

Provide the most appropriate research question for the analysis
Provide the statistical null hypothesis using both words and appropriate symbols.
Name the independent and dependent variables in these ANOVAs.
How many levels are there for the independent/factor variable?
What are the sizes for each group? What is the total sample size regardless of group?
Report and interpret the GRAND MEAN for all cases on Y and its 95% Confidence Interval.
Report and interpret the GROUP MEANS for Y for all W levels/groups and the 95% Confidence Interval for the GROUP 1 mean.
Is the assumption of normality of each level/group violated for this analysis? Provide evidence.
Is the assumption of homogeneity (equality) of variances violated for this analysis? Provide evidence.
Do you reject the Null Hypothesis that all W level/group population means are equal on the Y variable?
Is there a statistically significant difference between ANY of the group means on the Y variable? That is, is the OMNIBUS/OVERALL ANOVA statistically significant?
Is there statistically significant variation in the group means on the Y variable?
Do you conclude that the level/group means are all equal in the population?
Show or explain how the F statistic for the ANOVA is calculated.
Show or explain how to calculation the amount of variation in Y is explained by the W variable (that is, R2 or eta-squared or )
Calculate Cohen’s f effect size for this analysis. By the way, f2 = R2/(1 – R2) and R2 = f2/(1 + f2).
Report whether either OMNIBUS/OVERALL ANOVA is statistically significant. Provide evidence for both analyses.

Note that in real life you would only do the next two items if there is a statistically significant OMNIBUS ANOVA. However, you can do these as a priori comparisons and ignore the OMNIBUS ANOVA test. 18. Report all level/group mean differences and p values for the associated multiple comparison post hoc tests (Tukey HSD is usually a good choice if you have equal variances, Games-Howell if not). 19. Report and interpret which Tukey pairwise level/group mean differences are statistically significant and their associated p values. Instead of Tukey, use Games-Howell if the assumption of homogeneity of variances is violated.

Show how to calculate Cohen’s d for the largest pairwise level/group mean difference
Interpret the results for the one-way ANOVA in an APA-style report to answer the research question and to describe in detail the relationship between W and Y. Whether statistically significant or not, use descriptive statistics, (e.g., means, standard deviations, mean differences, effect sizes, and/or confidence intervals), inferential statistics, degrees of freedom, and statistical significance to describe the differences between the groups, including which group is considered to have the larger mean, if appropriate (an APA table is often the best way to do this for multiple group comparisons). Be sure to discuss assumptions and outliers and their potential impact.

SECTION 2: Power and Sample Size Analysis

What TOTAL sample size would be required for an independent t test with some given characteristics, for example: alpha=.05, power=.90, Cohen’s d=0.6 (provide evidence from tables or G*Power)
What TOTAL sample size would be required for a test of correlation with some given characteristics, for example: alpha=.05, power=.80, expected Population r (rho)=0.4 (provide evidence from tables or G*Power)
What TOTAL sample size would be required for a 3-group One-way ANOVA with some given characteristics, for example: alpha=.05, power=.80, Cohen’s f=0.4 (provide evidence from tables or G*Power)
What TOTAL sample size would be required for a dependent t test with some given characteristics, for example: alpha=.05, power=.70, Cohen’s d=0.3 (provide evidence from tables or G*Power)
What estimated total sample size would be required for a 95% confidence interval for the mean with some given characteristics, for example: = .05, a desired 95% confidence interval width of 3, and the standard deviation for Group 1 in the Independent-Samples t test section above. You may round to the nearest tenth for the tables. (provide evidence from tables or show your calculations)
Someone I know often says, “you can buy statistical significance.” Explain what is meant by that statement. Why does using effect sizes help with this? That is, what do researchers need to do in order to compensate for the fact that “You can buy statistical significance”?