In chapter 14 you learned why and how an analysis of variance (ANOVA) is performed. In this chapter we will consider in more detail the elements of hypothesis testing in an exploratory ANOVA. We will do this by analyzing a data set in the step-by- step fashion that you would use for an exploratory ANOVA. That is, in this chapter we will consider such key elements as: the state of affairs, a given problem, its research hypothesis, the statistical hypotheses, assumptions, violations of assumptions, sample size selection, etc. We will conclude this chapter by considering four different statistical tests and the conditions under which you would use one of them after you have found that the treatments differ. These tests are referred to as post hoc or a posteriori tests because they are used following an overall F test in order to determine which treatment groups differ.
In this section we will consider the elements of hypothesis testing from the perspective of a researcher who is in a situation where it is difficult or impossible for him to know in advance what mean differences to expect among the treatments under study. That is, the researcher is in a position where past research and/or theory are either nonexistent and/or yield conflicting results. The example used here will be that of the three group herbal supplement pill consumption experiment that we discussed in chapter 14. Here, however, we will be able to consider an appropriate sample size for this experiment. To personalize the discussion, the researcher conducting this study is given the name Leon Waters. We will begin by considering the state of affairs, the ANOVA assumptions, violations of assumptions, and statistical equations used in any exploratory fixed-effects one-way ANOV A. We will then consider a step-by- step example of the elements of hypothesis testing for a one-way fixed-effects design.
The state of affairs that must exist before you can consider the following test statistic are:
The test statistics that we will consider for the exploratory and confirmatory analyses will be valid when:
The importance of the preceding assumptions is as follows:
The assumption that the units are independent of one another is extremely important because if it is violated the level of significance (i.e. the probability of rejecting a true null hypothesis) can increase dramatically (e.g., from .05 to .40). In figure 15a the actual levels of significance are illustrated when the independence assumption is violated, given two treatments. In figure 15a we have that the probability of rejecting a true null hypothesis (i.e., of making a Type I error) increases dramatically as the relationship among the subjects in a group increases and as the number of subjects per group increases.
The effects of violations of the assumptions of normality and homogeneity of variance on level of significance and power in a fixed-effects ANOVA were considered in a review of literature by Glass, Peckham, and Sanders (1972). From their review, we may conclude that when the assumptions of normality and homogeneity of variance are violated the F test will be robust to these violations given large and equal samples per treatment. However, if one or more of the preceding conditions is not met, particularly when a one-tailed test is to be used, serious consideration should be given to transformations and/or to adjustments of the nominal level of significance and/or of the nominal power.
For example, given heterogeneous variances and unequal n’s per treatment with smaller samples drawn from more variable populations, Glass, Peckham, and Sanders (1972) indicated that your level of significance will be inflated. In this case you may decide to choose a smaller level of significance than you would if homogeneous variances were present (e.g., .01 or .05/2 instead of .05). Or, if you have two treatments, you may decide to use the separate variance t test discussed in chapter 14.
When heterogeneous variances are found with unequal numbers of units in the treatments, some researchers will randomly discard units until an equal n situation is found. These researchers know that with equal n’s the nominal and actual levels of significance will be close. However, many researchers find this approach unacceptable because of the lost information and the loss of power when units are discarded.
In chapter 14 the equations for calculating the F statistic for an ANOVA were given along with the steps necessary to obtain an ANOVA from R. For convenience, the equations for the F statistic are repeated here.
The F statistic with \(v_1\) and \(v_2\) degrees of freedom used to test the null hypothesis of equal treatment population means in a one-way ANOVA is found as the ratio of the mean square between groups (MSB) to the mean square within groups (MSW), that is:
\[ \begin{equation} F = \frac{MSB}{MSW} \tag{15-1} \end{equation} \]
Here,
\[ \begin{equation} MSB = \frac{SSB}{J-1} = \text{Mean Square Between Groups} \tag{15-2} \end{equation} \]
\[ \begin{equation} MSW = \frac{SSW}{\sum (n_j-1)} = \text{Mean Square Within Groups} \tag{15-3} \end{equation} \]
\[ \begin{equation} SSB = \sum n_j (M_j-M)^2 = \text{Sum of Squares Between} \tag{15-4} \end{equation} \]
\[ \begin{equation} SSW = \sum_{j=1}^J \sum_{i=1}^N n_j (X_{ij}-M_j)^2 = \text{Sum of Squares Within} \tag{15-5} \end{equation} \]
\[ \begin{equation} v_1 = df_B = J-1 = \text{degrees of freedom Between} \tag{15-6} \end{equation} \]
\[ \begin{equation} v_2 = df_W = \sum (n_j - 1) = \text{degrees of freedom Within} \tag{15-7} \end{equation} \]
Where
The calculation of SSB and SSW can be checked by finding the sum of squares total (SST) and then checking to see if \(SST = SSB + SSW\). Note that some scholars use SSY instead of SST because SST refers to the total sum of squares in the dependent Y variable (recall that the sum of squares is the numerator of the variance formula we learned earlier). Here, SST is found as:
\[ \begin{equation} SST = \sum_{j=1}^J \sum_{i=1}^{n_j} (X_{ij}-M)^2 = \text{Sum of Squares Total} \tag{15-8} \end{equation} \]
Note that the only difference between Equations 15-5 and 15-8 is that the Group Mean is subtracted from each unit’s score in 15-5 but the Grand Mean is subtracted from each unit’s score in 15-8.
The following is a step-by-step example of the elements of hypothesis testing in an exploratory one-way analysis of variance.
Are there differences among the mean cholesterol levels of groups of women who have taken herbal supplement pills made by the Gamma Drug Company, the Delta Drug Company, and women who have taken no herbal supplement pills?
There is a mean difference among the cholesterol levels of groups of women who have taken herbal supplement pills made by the Gamma Drug Company, the Delta Drug Company, and women who have taken no herbal supplement pills.
The statistical hypotheses are written in terms of the population means of the three treatments as:
\[ \begin{align} H_0&: \mu_1 = \mu_2 = \mu_3 \\ H_A&: \mu_J \ne \mu_K &&\text{ for any j ≠ k and j & k = 1,2,3} \end{align} \]
Here, \(\mu_1\) is the population cholesterol mean of the women taking the Gamma Drug Company’s herbal supplement pill; \(\mu_2\) is the population cholesterol mean of the women taking the Delta Drug Company’s herbal supplement pill; and _3 is the population cholesterol mean of the women taking no herbal supplement pill. The null hypothesis indicates that there are no differences among the population cholesterol means among the three treatments and the alternate hypothesis indicates that at least one of the treatment population means differs from the others. In general, the null hypothesis would be written that there is no difference among J population means and the alternate hypothesis would be written as it is above.
A review of the cholesterol literature indicated that the measures of cholesterol used in this experiment were valid and reliable. Also, so as to assure the validity of the independent variable, care was taken to assure that the women received pills from the treatment to which they were randomly assigned.
The probability of rejecting a true null hypothesis (i.e. the probability of making a Type I error) was set at .10.
In social science research, changing the level of significance from the very common value of .05 usually requires some sort of justification. One of the most common, and defensible, reasons for changing the level of significance is to increase power when a Type II error is particularly undesirable (and a Type I error is not particularly worrisome). A Type II error would be undesirable at early stages of research, when researchers would not want to risk stopping a promising line of research due to non-significant results. Power would be a concern when the research has limited ability to obtain larger samples.
A Type I error would not be terribly worrisome when there is a low cost to the treatment (or intervention) and no side effects (whether we are talking about medical treatments or any other kind of treatment). If someone were to change to a low cost treatment based on the results of a study, they would not lose too much money paying for the truly ineffective treatment (whereas, they would be wasting much money if the treatment was expensive and did not really work). Similarly, if there are not side effects to the treatment, then using it even though it is not effective would not cause harm.
Note that even when these conditions are defensible, either a Type I or a Type II error can lead to an opportunity cost. That is, we would lose the opportunity to change to a better treatment when a Type II error occurred and resulted in the new treatment being not statistically significant in a sample. Of we would lose the opportunity to continue to use a known effective treatment because we change to a new treatment due to the promising – but wrong – results caused by a Type I error in a study.
Leon Waters decided that he would like his power to be at least .80, i.e., he decided that he would like to be able to reject the null hypothesis at least 80 times out of 100 when the null hypothesis was false.
The effect size “\(d_A\)” (A for ANOVA) used is this book is based on the work of Cohen (1977), and is related to the two group effect size “d” that was discussed for the single group t test (Equation 12-4) in Chapter 12, the relationship is:
\[ \begin{equation} d_A = \text{Cohen's } f = \frac{1}{2} d \tag{15-9} \end{equation} \]
As in the case of the single group mean, we may use Cohen’s three different effect sizes, “small” \(d_A = 0.10\), “medium” \(d_A = 0.25\), and “large” \(d_A = 0.40\), as guides in exploratory studies were information is lacking. Note that Cohen called this effect size \(f\) rather than \(d_A\), but we prefer calling it \(d_A\) to help with the rationale expressed in Equation 15-9 (e.g., the medium \(d = 0.50\), so the medium \(d_A = 0.25\)).
In light of Cohen’s suggestions for effect size, Leon Waters decided to choose a large a priori effect size (i.e., \(d_A = 0.40\)) as one that he felt would be present among the pill consumption treatments.
Sample size may be found using Table C.6 where sample sizes per treatment are tabled for given levels of significance (\(\alpha\)), powers (\(1 – β\)), degrees of freedom (\(df\)), and a priori effect sizes (\(d_A\)).
When an a priori effect size is selected that is not in Table C.6 of Appendix C, the following equation may be used to estimate the sample size:
\[ \begin{equation} n = \frac{n_{.05}}{400(d_A^2)} + 1 \tag{15-10} \end{equation} \]
where \(n\) is the number of units in a single treatment, \(n_.05\) is the value of \(n\) in Table C.5 when \(d_A = .05\) for a given power and degrees of freedom (\(df = K – 1\)).
In Table C.6 Leon found that he needed 17 subjects per treatment when \(\alpha = .10\), \(df = 2\), and \(d_A = .40\) to have the power of his statistical test set at .80. However, since Leon could conveniently collect information on 20 subjects, he decided to proceed with this number. In Table C.6 we see that by using 20 subjects per treatment Leon’s power is between .80 and .90 through linear interpolation he estimated it to be .85.
If we are using a table with limited degrees of freedom, in order to find the critical value of the F statistic with \(\alpha= .10\), \(v_1 = 2\), and \(v_2 = 57\), we might need to choose between \(F_{(.90, 2, 40)} = 2.44\) and \(F_{(.90, 2, 60)} = 2.39\). In such a case, we would choose to use \(F_{(.90, 2, 40)}\) as our critical value because it is the more conservative (i.e., smaller) degrees of freedom, and use linear interpolation only if our calculated F value falls between the latter two tabled values.
At this point the a priori parameters of hypothesis testing have been established (i.e., your “bet” is made).
The cholesterol measurements of the three groups of twenty women each are shown in Columns 1, 2, and 3 of figure 15c(i). Note that in order to perform the ANOVA in R, we must restructure the data so that there is just one column for Treatment, one column for Cholesterol score, and one column for z score. Figure 15c(ii) shows a partial Case Summaries report for the restructured data. The Treatment variable must also be a factor in R.
Also note that z scores can be obtained in two ways: (a) for the whole sample and (b) for each factor level separately. When doing group comparison analyses, we usually want to examine z scores by group (i.e., each level separately). The z scores provided here were calculated by group in JAMOVI using the Z(Cholesterol, group_by=Treatment) compute function.
Person_ID | Gamma | Delta | None | z_Gamma | z_Delta | z_None |
---|---|---|---|---|---|---|
1 | 235 | 260 | 210 | -0.3975 | 0.8854 | -0.3975 |
2 | 240 | 160 | 215 | -0.2865 | -1.2954 | -0.2865 |
3 | 288 | 255 | 263 | 0.7795 | 0.7763 | 0.7795 |
4 | 245 | 175 | 220 | -0.1754 | -0.9682 | -0.1754 |
5 | 355 | 250 | 330 | 2.2675 | 0.6673 | 2.2675 |
6 | 223 | 260 | 198 | -0.6640 | 0.8854 | -0.6640 |
7 | 285 | 185 | 260 | 0.7129 | -0.7502 | 0.7129 |
8 | 223 | 156 | 198 | -0.6640 | -1.3826 | -0.6640 |
9 | 275 | 273 | 250 | 0.4908 | 1.1689 | 0.4908 |
10 | 204 | 218 | 179 | -1.0860 | -0.0305 | -1.0860 |
11 | 268 | 180 | 243 | 0.3353 | -0.8592 | 0.3353 |
12 | 272 | 245 | 247 | 0.4242 | 0.5583 | 0.4242 |
13 | 235 | 200 | 210 | -0.3975 | -0.4231 | -0.3975 |
14 | 297 | 235 | 272 | 0.9794 | 0.3402 | 0.9794 |
15 | 235 | 255 | 210 | -0.3975 | 0.7763 | -0.3975 |
16 | 278 | 200 | 253 | 0.5574 | -0.4231 | 0.5574 |
17 | 322 | 320 | 297 | 1.5346 | 2.1938 | 1.5346 |
18 | 220 | 225 | 195 | -0.7306 | 0.1221 | -0.7306 |
19 | 183 | 190 | 158 | -1.5523 | -0.6411 | -1.5523 |
20 | 175 | 146 | 150 | -1.7300 | -1.6007 | -1.7300 |
Create the z score columns of Figure 15c(i). Examination of these columns indicates that none of the scores were considered to be outliers since no z scores greater than 3 or less than -3 were observed.
jmv::descriptives(
formula = zCholesterol ~ Treatment,
data = data,
dotType = "stack",
n = FALSE,
missing = FALSE,
mean = FALSE,
median = FALSE,
sd = FALSE,
min = FALSE,
max = FALSE,
extreme = TRUE)
DESCRIPTIVES
EXTREME VALUES
Extreme values of zCholesterol
────────────────────────────────────────
Row number Value
────────────────────────────────────────
Highest 1 13 2.268
2 15 2.268
3 50 2.194
4 49 1.535
5 51 1.535
Lowest 1 58 -1.730
2 60 -1.730
3 59 -1.601
4 55 -1.552
5 57 -1.552
────────────────────────────────────────
However, because the dataset is so small (\(n = 20\) within each sample), we might consider using \(|z| > 2\) as our rules for outliers (i.e., any score beyond the range \(-2 < z < 2\) would be considered an outlier). In this case, we would want to check more carefully Case_ID 13 (Person_ID 5 in the Gamma group), Case_ID 50 (Person_ID 17 in the Delta group), and Case_ID 15 (Person_ID 5 in the None group). Upon checking the data, no good rationale for removing the cases was identified, so Leon kept those cases in the analysis. Further, the boxplots for each treatment were observed to have no unusual characteristics (i.e., no outliers).
jmv::descriptives(
formula = Cholesterol ~ Treatment,
data = data,
box = TRUE,
dot = TRUE,
dotType = "stack",
n = FALSE,
missing = FALSE,
mean = FALSE,
median = FALSE,
sd = FALSE,
min = FALSE,
max = FALSE)
DESCRIPTIVES
Note that in group analyses (such as ANOVA and the independent t test, we are not only concerned with cases that are outliers on the variable at the univariate level (i.e., entire sample), but also outliers within each sample level. Recall that each sample represents a different population, so we want to verify that a case is not extreme or unusual for its own population. When we think about the impact such a case might have, with fewer cases in each group, an outlier can have a larger impact than it would have across all data in the study. Further, because of this potential for larger impact, the group statistics we use may be severely impacted by that outlier and cause a very different group statistic as a result (e.g., the mean being impacted by skewness in the data).
The initial selection of a random sample of women, followed by their random assignment to the treatments established the initial conditions for independence of the observations. Also, Leon Waters was satisfied that nothing happened during the experiment to violate the independence assumption.
The normality assumption was considered by constructing histograms of the data in each treatment and by considering the normality plots for each treatment. Both the histograms and the normality plots indicated that the data in each treatment could be considered to have been sampled from a normal distribution.
The normality assumption is frequently tested by using Tests of Normality. We can obtain a test of the overall normality of a variable, but for group analyses like ANOVA and independent t tests, we wish to test normality within each sample, to reach a decision about the normality of each population separately. This analysis is obtained by using the Treatment variable as a SPLIT BY in Descriptives. Clicking to select the Histogram is also helpful for this purpose, but using a paneled histogram from GRAPHS may be more useful (because it uses the same scale for the axes).
jmv::descriptives(
formula = Cholesterol ~ Treatment,
data = data,
hist = TRUE,
qq = TRUE,
n = FALSE,
missing = FALSE,
# desc = "rows",
mean = FALSE,
median = FALSE,
sd = FALSE,
min = FALSE,
max = FALSE,
skew = TRUE,
kurt = TRUE,
sw = TRUE)
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────
Treatment Cholesterol
───────────────────────────────────────────────────
Skewness Delta 0.2161
Gamma 0.3544
None 0.3544
Std. error skewness Delta 0.5121
Gamma 0.5121
None 0.5121
Kurtosis Delta -0.4833
Gamma 0.1189
None 0.1189
Std. error kurtosis Delta 0.9924
Gamma 0.9924
None 0.9924
Shapiro-Wilk W Delta 0.9631
Gamma 0.9753
None 0.9753
Shapiro-Wilk p Delta 0.6069
Gamma 0.8603
None 0.8603
───────────────────────────────────────────────────
With small samples, in particular, but also generally, we recommend using the Shapiro-Wilk. The null hypothesis tested by this statistical test is
\[ H_0: \text{The data are normally distributed} \] We could amend this Null Hypothesis to handle the conditional normality situation (i.e., normality within each group).
\[ H_0: \text{The data are normally distributed within each group} \]
Therefore, if we reject the null hypothesis for the Shapiro-Wilk test, we will be rejecting normality. We want to fail to reject normality for the assumption to be satisfied (it doesn’t mean that the population is truly normal, but at least doesn’t provide evidence that it is not). Therefore, for the sake of the assumption test, we are hoping to fail to reject the null hypothesis and have a p value less than our level of significance (i.e., \(p ≤ \alpha\)). In this case, we see that we cannot reject normality for any of the treatment groups. Therefore, we consider the assumption to be tenable (i.e., believable, defensible, met). Fortunately One-way ANOVA and t tests are relatively robust to a violation of normality, but when the assumptions are met, we have that much more confidence in our statistical results.
If we are concerned about normality, recall that non-normality could be the result of outliers (especially for skewed distributions). We recommend that the first option to address the violation of the normality assumption, then, be to look for the impact of outliers. However, if outliers do not appear to be responsible, then as a last resort, if you strongly believe something must be done to address to non-normality, a transformation may be appropriate.
Finally, because ANOVA is part of the General Linear Model (i.e., regression) we can test the Normality assumption using residuals rather than the conditional normality approach. However, it is important to recognize that these tests can produce different results. Our recommendation is the conditional normality with an adjusted alpha (see Bonferroni and Holm sections below) for the number of levels. For example, if your factor has 4 levels, you would test the Shapiro-Wilk statistic using \(\alpha = .0125\) for each group (see Brooks et al., 2024).
The assumption of homogeneity of variance was considered by observing boxplots of relatively equal length from each treatment. Note that because we have equal n’s in each treatment our F test would be robust to heterogeneity of variance.
The homogeneity of variance (often called homoscedasticity) assumption is frequently tested by comparing the equality of variances with an F statistic. Currently, the preferred test is Levene’s test of equality of variances. The details of Levene’s test are beyond the scope of this book, but it is useful to know that there are multiple “flavors” of Levene’s test. The most commonly calculated and reported uses the mean in its calculation of deviation. However, some scholars prefer the version of Levene’s test that calculates variation as the squared distance from the median instead of the mean. reports the version based on the mean in its One-way ANOVA procedure, but provides the other flavors in its Means Procedure.
The statistical null hypothesis for Levene’s test is as follows:
\[ \begin{align} H_0&: \sigma_1^2=\sigma_2^2 &&\text{(for 2 groups)} \\ H_0&: \sigma_1^2=\sigma_2^2=\sigma_3^2 &&\text{(for 3 groups)} \\ H_0&: \sigma_j^2=\sigma_k^2 &&\text{(for all groups j & k, where } j \ne k) \end{align} \]
Note that we would not typically write the null hypothesis for Levene’s test as the difference in variances (e.g., \(H_0: \sigma_{12} – \sigma_{22} = 0\)) because we use an F ratio to test this hypothesis. If anything besides simply equality, we would probability choose to write it as \(H_0: \sigma_{12} / \sigma_{22} = 1\). Just like a difference of zero indicating that two values are the same, a ratio of 1 means that both the numerator and denominator are exactly the same.
We obtain the Levene’s test as part of the Independent t test or ANOVA in most programs (including JAMOVI).
jmv::anovaOneW(
formula = Cholesterol ~ Treatment,
data = data,
welchs = FALSE,
eqv = TRUE)
ONE-WAY ANOVA
ASSUMPTION CHECKS
Homogeneity of Variances Test (Levene's)
──────────────────────────────────────────────────
F df1 df2 p
──────────────────────────────────────────────────
Cholesterol 0.04445 2 57 0.9566
──────────────────────────────────────────────────
The typical process used by researchers is the following (but see paragraph following these steps for our perspective):
Test the equality of variances using Levene’s test
Although this is still the most common process used by researchers, we agree with Zimmerman (2004) and others (including Wilcox) who have written that we should consider using the Robust tests always. Zimmerman showed that we may inflate our Type I error rate by using the process described above (what he called conditional testing, that is, choosing the t test conditionally based on the significance of Levene’s test). The tests that assume equality of variances are generally preferred because they are more powerful, but research has shown that the power lost from using the robust tests may not be substantial enough to avoid using them even when variances are equal. R also used these robust tests as the defaults.
We will provide an example later of a situation where the researcher might conclude that variances across treatments are unequal. We will also examine the post hoc tests to use with this situation.
The non-parametric descriptive statistics for each of the treatments are shown in figure 15d. Leon noted that the median (242.5) of the cholesterol levels of women taking the Gamma herbal supplement pill was higher than the medians of the women taking either the Delta pill (221.5) or no pill (217.5). Also, the medians of the Delta group and the “no pill” group were relatively close. The means of the parametric statistics shown in figure 15e reflected the same information as the medians. A check of the variances from each treatment in figure 15e indicates that they may be considered to be equal. The measures of skewness and kurtosis indicate that the scores in each treatment are positively skewed and platykurtic, although not enough to cause concern.
Note that choosing a very small and a very large percentile for this output (here, 0 and 100) produce the minimum and maximum in order with the 1st, 2nd, and 3rd quartiles. This can make it a bit easier to examine those values in order. You would need to choose the very small percentile based on the amount of data you have (note that zero does not work in some programs, so choose maybe 0.1), or simply rely on the Minimum and Maximum reported without worrying about having them in order with the other percentiles.
jmv::descriptives(
formula = Cholesterol ~ Treatment,
data = data,
mode = TRUE,
sd = FALSE,
range = TRUE,
iqr = TRUE,
pc = TRUE,
pcValues = "0,25,50,75,100")
DESCRIPTIVES
Descriptives
────────────────────────────────────────────────
Treatment Cholesterol
────────────────────────────────────────────────
N Delta 20
Gamma 20
None 20
Missing Delta 0
Gamma 0
None 0
Mean Delta 219.4
Gamma 252.9
None 227.9
Median Delta 221.5
Gamma 242.5
None 217.5
Mode Delta 200.0
Gamma 235.0
None 210.0
IQR Delta 71.25
Gamma 56.75
None 56.75
Range Delta 174.0
Gamma 180.0
None 180.0
Minimum Delta 146.0
Gamma 175.0
None 150.0
Maximum Delta 320.0
Gamma 355.0
None 330.0
0th percentile Delta 146.0
Gamma 175.0
None 150.0
25th percentile Delta 183.8
Gamma 223.0
None 198.0
50th percentile Delta 221.5
Gamma 242.5
None 217.5
75th percentile Delta 255.0
Gamma 279.8
None 254.8
100th percentile Delta 320.0
Gamma 355.0
None 330.0
────────────────────────────────────────────────
jmv::descriptives(
formula = Cholesterol ~ Treatment,
data = data,
hist = TRUE,
dens = TRUE,
bar = TRUE,
box = TRUE,
violin = TRUE,
dot = TRUE,
boxMean = TRUE,
qq = TRUE,
missing = FALSE,
variance = TRUE,
se = TRUE,
ci = TRUE,
skew = TRUE,
kurt = TRUE,
sw = TRUE)
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────────
Treatment Cholesterol
───────────────────────────────────────────────────────
N Delta 20
Gamma 20
None 20
Mean Delta 219.4
Gamma 252.9
None 227.9
Std. error mean Delta 10.25
Gamma 10.07
None 10.07
95% CI mean lower bound Delta 197.9
Gamma 231.8
None 206.8
95% CI mean upper bound Delta 240.9
Gamma 274.0
None 249.0
Median Delta 221.5
Gamma 242.5
None 217.5
Standard deviation Delta 45.86
Gamma 45.03
None 45.03
Variance Delta 2103
Gamma 2028
None 2028
Minimum Delta 146.0
Gamma 175.0
None 150.0
Maximum Delta 320.0
Gamma 355.0
None 330.0
Skewness Delta 0.2161
Gamma 0.3544
None 0.3544
Std. error skewness Delta 0.5121
Gamma 0.5121
None 0.5121
Kurtosis Delta -0.4833
Gamma 0.1189
None 0.1189
Std. error kurtosis Delta 0.9924
Gamma 0.9924
None 0.9924
Shapiro-Wilk W Delta 0.9631
Gamma 0.9753
None 0.9753
Shapiro-Wilk p Delta 0.6069
Gamma 0.8603
None 0.8603
───────────────────────────────────────────────────────
Note. The CI of the mean assumes sample means
follow a t-distribution with N - 1 degrees of
freedom
In figure 15f the test F statistic is found to be 2.95 using the One-way ANOVA output. The step-by-step description of how to execute One-way ANOVA is given in chapter 14.
The decision was made to reject the null hypothesis because the value of the test statistic was greater than the value of the critical value, that is, \(2.95 > F_{(.90,2,40)} = 2.44\), and therefore 2.95 would be greater than \(F_{(.90,2,40)}\) and is statistically significant. The p level was found to be .0601. Since the p level was less than our chosen \(\alpha = .10\), this result corroborated the decision to reject the null hypothesis based on the critical value less than .10.
When you fail to reject the null hypothesis in an exploratory ANOVA your statistical analysis is finished. You have found that there are no significant differences among your treatment means and so you stop. However, when you do find that there are significant differences among your treatment means, as we did in this example (because we were using \(\alpha = .10\)), the next question to ask is: Which treatments differ? The answer to this question is the subject of the next section.
The effect size most frequently reported for the omnibus ANOVA is either \(R^2\) or \(\eta^2\) (eta-squared). The calculation for \(R^2\) (and \(\eta^2\) since they are equal in One-way ANOVA) is straightforward.
\[ R^2 = \frac{SS_{Between}}{SS_{Total}} = \frac{12130.0}{129130.4} = .094 \]
This is interpreted at the treatment variable explained just over 9% of the variation in Cholesterol scores. That is, which group the participants belong to explained about 9.4% of why they have different scores.
It may also be useful to report effect sizes like Cohen’s d for the post hoc comparisons that are performed (see the next section). That is, it is useful to report which groups are different based on these post hoc multiple cmparisons. However, it is more information to describe with standardized effect sizes how different they are. A Cohen’s d effect size can be calculated for these comparisons like we did in Chapter 14 using the descriptive statistics provided in the outputs.
Note that, like many programs, JAMOVI provides two procedures that can be used for One-Way ANOVA: (a) one called “One-Way ANOVA” and (b) one called simply “ANOVA”. If you have met the assumption of homogeneity of variances, then either is useful (in fact, both are useful because of the difference information they provide). However, if you cannot assume equal variances, only the One-Way ANOVA procedure provides the robust Welch’s F test (and associated robust Games-Howell post hoc comparisons).
jmv::anovaOneW(
formula = Cholesterol ~ Treatment,
data = data,
fishers = TRUE,
desc = TRUE,
descPlot = TRUE,
norm = TRUE,
qq = TRUE,
eqv = TRUE)
ONE-WAY ANOVA
One-Way ANOVA
──────────────────────────────────────────────────────────────
F df1 df2 p
──────────────────────────────────────────────────────────────
Cholesterol Welch's 2.905 2 38.00 0.0670
Fisher's 2.955 2 57 0.0601
──────────────────────────────────────────────────────────────
Group Descriptives
─────────────────────────────────────────────────────────────
Treatment N Mean SD SE
─────────────────────────────────────────────────────────────
Cholesterol Delta 20 219.4 45.86 10.25
Gamma 20 252.9 45.03 10.07
None 20 227.9 45.03 10.07
─────────────────────────────────────────────────────────────
ASSUMPTION CHECKS
Normality Test (Shapiro-Wilk)
───────────────────────────────────
W p
───────────────────────────────────
Cholesterol 0.9681 0.1175
───────────────────────────────────
Note. A low p-value suggests
a violation of the assumption
of normality
Homogeneity of Variances Test (Levene's)
──────────────────────────────────────────────────
F df1 df2 p
──────────────────────────────────────────────────
Cholesterol 0.04445 2 57 0.9566
──────────────────────────────────────────────────
jmv::ANOVA(
formula = Cholesterol ~ Treatment,
data = data,
effectSize = c("eta", "partEta"),
modelTest = TRUE,
homo = TRUE,
norm = TRUE,
qq = TRUE)
ANOVA
ANOVA - Cholesterol
───────────────────────────────────────────────────────────────────────────────────────────────
Sum of Squares df Mean Square F p η² η²p
───────────────────────────────────────────────────────────────────────────────────────────────
Overall model 12130 2 6065 2.955 0.0601
Treatment 12130 2 6065 2.955 0.0601 0.0939 0.0939
Residuals 117000 57 2053
───────────────────────────────────────────────────────────────────────────────────────────────
ASSUMPTION CHECKS
Homogeneity of Variances Test (Levene's)
────────────────────────────────────────
F df1 df2 p
────────────────────────────────────────
0.04445 2 57 0.9566
────────────────────────────────────────
Normality Test (Shapiro-Wilk)
─────────────────────────────
Statistic p
─────────────────────────────
0.9681 0.1175
─────────────────────────────
Following an overall ANOVA where the results indicate that there are significant differences among the population treatment means, a natural question to ask is: Which means differ? For example, in the cholesterol experiment the F value indicated that at the .10 level of significance some or all of the following sample means came from different populations:
\[ \begin{align} &\underline{Gamma} &&\underline{Delta} &&&\underline{No Pill} \\ &252.9 &&219.4 &&&227.9 \end{align} \]
After obtaining the overall significant F value, and observing these means, you would want to ask one or more questions concerning mean differences. Below are some of these questions expressed as post hoc research problems and their corresponding post hoc null and alternative statistical hypotheses. The post hoc research hypotheses are that differences between the treatments being considered are expected. Here, \(\mu_1\) is the Gamma population treatment mean, \(\mu_2\) is the Delta population treatment mean, and \(\mu_3\) is the No Pill population treatment mean.
\[ H_0: \mu_1 - \mu_2 = 0 \\ H_A: \mu_1 - \mu_2 \ne 0 \]
\[ H_0: \mu_1 - \mu_3 = 0 \\ H_A: \mu_1 - \mu_3 \ne 0 \]
\[ H_0: \mu_2 - \mu_3 = 0 \\ H_A: \mu_2 - \mu_3 \ne 0 \]
\[ H_0: \frac{\mu_1+\mu_2}{2} - \mu_3 = 0 \\ H_A: \frac{\mu_1+\mu_2}{2} - \mu_3 \ne 0 \]
\[ H_0: \mu_1 - \frac{\mu_2+\mu_3}{2} = 0 \\ H_A: \mu_1 - \frac{\mu_2+\mu_3}{2} \ne 0 \]
\[ H_0: \mu_2 - \frac{\mu_1+\mu_3}{2} = 0 \\ H_A: \mu_2 - \frac{\mu_1+\mu_3}{2} \ne 0 \]
There is a good deal that we can learn about post hoc hypothesis testing from the these example problems and their statistical hypotheses. In the following paragraphs some of the features of post hoc hypothesis testing are discussed in greater detail.
A linear contrast or comparison among treatment means is defined as a linear combination of the treatment means where the contrast coefficients must sum to zero. A contrast coefficient is a number that is multiplied times a treatment mean, and it is assumed that all of the contrast coefficients are not zero.
For example, the comparison made in the first post hoc null hypothesis, above, contained the contrast coefficients +1, -1, and 0, that is, \(+1\mu_1 – 1\mu_2 + 0\mu_3 = 0\), and the sum of these contrast coefficients is zero (i.e., 1 – 1 + 0 = 0). Also, in the contrast of the last null hypothesis, the coefficients where +1, -1/2, and-1/2, that is, \(+1\mu_1 – (1/2)\mu_2 – (1/2)\mu_3\), which also sum to zero.
You should note that in discussing a contrast, the zero coefficients are always written (as was done for the first contrast), but that zero coefficients are omitted in the statistical hypotheses. You should also verify that the contrast coefficients in each of the preceding null hypotheses sum to zero.
\(\psi\), the Symbol For A Contrast (Comparison)
Statisticians usually denote a population contrast on treatment means by the greek letter \(\psi\) (psi). A subscript is usually added to \(\psi\) to denote a particular contrast. Therefore, we have for the first null hypothesis that \(\psi_1 = \mu_1 – \mu_2\), and for the last null hypothesis that \(\psi_6 = \mu_2 – (\mu_1 –\mu_3)/2\). When a contrast is made on sample means we have an estimate of the population parameter \(\psi\), which is usually denoted by \(\hat{\psi}\). Note that the sample statistic is an estimate of the population parameter and that the “hat” (^) indicates estimate. Therefore, we could write the sample mean as \(\hat{\mu}\).
The first post hoc statistical hypothesis can be written as:
\[ H_0: \psi_1 = 0 \\ H_A: \psi_1 \ne 0 \]
Here, the sample statistic, that is, our estimate of \(\psi\), would be found as:
\[ \hat{\psi}_1 = M_1 - M_2 = 252.9 - 219.4 = 33.5 \]
It can be shown that when you multiply a contrast by any nonzero constant the test of the resulting contrast does not change. For example, the test of
\[ H_0: \psi_1 = \mu_1 - \mu_2 = 0 \]
is the same as testing
\[ H_0: -\psi_1 = -\mu_1 + \mu_2 = 0 \]
or
\[ H_0: 5\psi_1 = 5\mu_1 - 5\mu_2 = 0 \]
where the numbers -1 and 5 were arbitrarily chosen as the constant. Therefore, in general we have that testing a hypothesis on any contrast, \(\psi\), is the same as testing a null hypothesis on \(c_0*\psi\), where \(c_0\) is any nonzero constant. In this sense, \(\psi\) is said to be unique.
One reason that it is important to understand the concept of a unique contrast is because in entering contrast coefficients into a computer program some contrast coefficients cannot be entered exactly. For example, contrast coefficients such as \(1 – 1/3 – 1/3 – 1/3\) cannot be entered exactly in many programs because 1/3 must be entered as a never ending sequence of 3’s (i.e., \(0.33333...\)). However, if we multiply these coefficients by 3, we may enter them as \(3 – 1 – 1 – 1\) to obtain the same test results. Note however, that although the test results do not change, the size of the estimated contrast and other preliminary statistics will be altered.
For example, if we tested the null hypothesis
\[ H_0: 5\psi_1 = 5\mu_1 - 5\mu_2 = 0 \]
we would calculate the contrast in the sample as
\[ \begin{align} 5\hat{\psi}_1 &= 5M_1 - 5M_2 \\ &= 5(252.9) - 5(219.4) \\ &= 167.5 \end{align} \]
which is 5 times larger than the value of the contrast we calculated above, which was \(\hat{\psi}_1 = 33.5\).
A “pairwise” contrast involves a comparison between only two means. A “non-pairwise” contrast involves a comparison between three or more means. In the preceding cholesterol statistical hypotheses the first three involve pairwise contrasts (e.g., \(\psi_1 = \mu_1 – \mu_2\)), and the second three involve non-pairwise contrasts (e.g., \(\psi_6 = \mu_2 – (\mu_1 –\mu_3)/2\)).
The significance level at which you test each contrast is usually referred to as the per comparison error rate. That is, it is the probability of making a Type I error (i.e., rejecting a true null hypothesis) in testing a given comparison. The notation \(\alpha_{PC}\) will be used to denote the per comparison error rate.
In an experiment involving K treatments there are Q possible unique contrasts (both pairwise and non-pairwise) that can be made among the treatment means, where:
\[ \begin{equation} Q = 1 + \frac{3^K-1}{2} - 2^K \tag{15-11} \end{equation} \]
The notation \(\psi_q\), where q = 1, 2, … , Q, will be used to denote any contrast. In the cholesterol study K = 3; therefore,
\[ \begin{align} Q &= 1 + \frac{3^3-1}{2} - 2^3 \\ &= 1 + \frac{27-1}{2} - 8 \\ &= 1 + 13 - 8 = 6 \end{align} \]
That is, in a one-way experiment with three treatments there are six possible unique comparisons that can be made be made among the treatment means. These six possible unique comparisons are shown in the preceding six post hoc null hypotheses.
In a experiment involving K treatments there are T possible unique pairwise contrasts, where
\[ \begin{equation} T = \frac{K(K-1)}{2} \tag{15-12} \end{equation} \]
Therefore, in the cholesterol experiment there are
\[ T = \frac{3(3-1)}{2} = 3 \]
pairwise comparisons. These three unique pairwise comparisons are specified above in the first three post hoc hypotheses. Note that K can also be viewed as the number of variables in a bivariate or pairwise correlation scenario (note also that some textbook use J instead of K). In this scenario, T would be the total number of correlations to be analyzed. This pairwise formula applies to any situation where you want to count the number of possible pairs of things. For example, if you have three variables, then there are three possible pairwise correlations you can analyze (e.g., \(r_{12}\), \(r_{13}\), \(r_{23}\) because \(r_{31}\) = \(r_{13}\), for example).
The familywise error rate (FWER) is the probability of one or more Type I errors being made among a set or family of comparisons. It can be shown that if each comparison in a family has a per comparison error rate of \(\alpha_{PC}\), then the familywise error rate, denoted by \(\alpha_{FW}\) is:
\[ \begin{equation} \alpha_{FW} \le 1 - (1 - \alpha_{PC})^C \tag{15-13} \end{equation} \]
where C is the number of comparisons in the family. The familywise error rate will equal the probability on the right when the comparisons are independent. Note that C can also be viewed more generically as the number of hypothesis tests to be performed (i.e., it is not only specific to multiple comparison testing).
For example, if we decided to test all of the possible pairwise and non-pairwise comparisons shown above for the cholesterol experiment at the .10 level of significance, the familywise error rate would be found as:
\[ \alpha_FW \le 1 - (1 - .10)^6 = 1 - (.90)^6 = .4686 \]
That is, the probability of falsely rejecting one or more contrasts could be as large as .4686. For most researchers the possibility of having this high of a familywise error rate would be intolerable. Fortunately, statistical tests exist which will test all of the preceding hypotheses and still hold the familywise error rate at a reasonable level. A discussion of these post hoc statistical procedures follows.
There are a variety of post hoc comparison test procedures that a researcher can choose from in order to hold the familywise error rate in an exploratory analysis at a reasonable level. Most of these procedures have been named after the person who first described them. For example, to name a few, there is Duncan’s New Multiple Range Test, Fisher’s Least Significant Difference (LSD), the Neuman-Keuls Test, Scheffé’s Test, and Tukey’s Honestly Significant Difference (HSD) or Wholly Significant Difference (WSD) test.
In an exploratory analysis, we favor, along with others (see Kirk, 1982, pp. 146-148, and Keppel, 1983, pp. 164-165) the Tukey procedure for pairwise contrasts and the Scheffé procedure when a family of comparisons contains non-pairwise comparisons. Therefore, the Tukey and Scheffé procedures are discussed and illustrated next as the test procedures to be used following a significant F test in an exploratory analysis of variance. A detailed discussion of most post hoc comparison procedures may be found in presentations by Miller (1966, 1977). But first, we would like to introduce a general-purpose adjustment procedure that works well (but conservatively) with few assumptions: the Bonferroni alpha-adjustment procedure.
The Bonferroni alpha-adjustment technique, based on Boole’s inequality, can be used in any situation where we have multiple hypothesis tests and want to control the familywise error rate (FWER). Therefore, the Bonferroni technique, and some modified-Bonferroni techniques, can be used in many multiple hypothesis testing situations, not only for multiple comparisons.
The important decision concerns what is considered a “family” of hypothesis tests. As an example, when we perform post hoc tests after a statistically significant ANOVA, most scholars consider that a family of hypothesis tests (primarily because we are partitioning the same variance to form these contrasts). Some researchers have considered tests of multiple correlations and multiple regression predictors to be from the same family, but there is no consensus in those situations like there is for ANOVA post hoc tests.
The steps for the Bonferroni adjustment are relatively straightforward.
Note that the null hypothesis decision rule here can be written as either:
\[ \begin{align} &\text{Reject } H_0 \text{ if } p \le \alpha_{FW}⁄C \\ &or \\ &\text{Reject } H_0 \text{ if } (p*C) \le \alpha_{FW} \end{align} \]
Therefore, we must pay attention to whether the p values reported by our statistical program has already adjusted the p values. This p adjustment will typically be the case (instead of alpha adjustment), because the computer programs typically do not know what level of significance we are using. Therefore, we simply compare the p value (if it has already been adjusted by the program) to our original alpha level. This caveat is applicable to all multiple hypothesis testing procedures, including Tukey, Scheffé, and Games-Howell (the computer program will usually indicate when p has been adjusted).
While the Bonferroni technique will keep FWER at or below the desired (nominal) \(\alpha_{FW}\) level, scholars have show that in some situations, the resulting reduction in alpha causes power to decrease substantially when using Bonferroni. Therefore, the Bonferroni technique is not recommended when power is a particular concern (e.g., small sample sizes). Researchers have shown that the Tukey test generally has more power than the Bonferroni technique, for example.
It has been shown that Holm’s (1979) modified Bonferroni procedure protects against FWER Type I error inflation as well as Bonferroni, but provides more power. Holm’s test is a bit more complicated to implement, but not too bad.
The rationale for the Holm procedure (paraphrased from Howell, 2002, p. 387; italics indicate text we’ve added):
When we reject the null hypothesis for the test with the smallest significance, we are declaring that null hypothesis to be false. If it is false [in reality], that leaves only N - 1 possible true null hypotheses, and so we only need to protect against N - 1 type I errors; this same logic follows for all remaining tests. [If it is not false in reality, then you’ve already made at least one Type I error and additional ones don’t matter.] The logic makes sense, in particular, when we believe strongly that several null hypotheses are almost certain to be false – if they are indeed false, there is no reason to protect against erroneously rejecting them.
There are other, newer, modified Bonferroni procedures that are used by researchers, but they tend to require additional assumptions. The Bonferroni and Holm techniques are always safe techniques (but perhaps a little conservative, especially Bonferroni) to use as long as the other assumptions of the relevant statistical tests have been met. That is, Bonferroni and Holm use p values provided by other statistical tests; these p values must be meaningful. But they require fewer assumptions than most other techniques (e.g., Hochberg). There is also a newer, similar approach called False Discovery Rate (FDR) introduced by Benjamini and Hochberg (1995), but it controls FDR not FWER.
In considering the assumptions that are necessary for the overall F test to be valid, we found that the F test is robust with respect to a violation of the assumption of homogeneity of variance when there are an equal number of units in the treatments. However, when the homogeneity assumption is violated, and there are an unequal number of units in the treatments, the F test is not robust. This same result holds true for the robustness of the Tukey and Scheffé tests. Therefore, in what follows the Tukey and Scheffé tests will be presented under conditions where there are an equal number of units in each treatment and then under conditions where there are an unequal number of units in each treatment.
For example, the data in Figure 15g have been modified from Figure 15c(i) by deleting cases from each group in order to create a dataset with unequal sample sizes and unequal variances. The One-way ANOVA was performed for these data and the output is shown in Figure 15h. In Figure 15h, we can see that Levene’s test is statistically significant, p = .048. Because Levene’s test is statistically significant, we must reject the null hypothesis that variances are equal and conclude that variances in the population are not equal. This is a violation of the homogeneity of variance assumption.
Person_ID | Gamma | Delta | None | z_Gamma | z_Delta | z_None |
---|---|---|---|---|---|---|
1 | 235 | 260 | 210 | -0.3975 | 0.8854 | -0.3975 |
2 | 240 | 160 | 215 | -0.2865 | -1.2954 | -0.2865 |
3 | 288 | 255 | 263 | 0.7795 | 0.7763 | 0.7795 |
4 | 245 | 175 | 220 | -0.1754 | -0.9682 | -0.1754 |
5 | 355 | 250 | 330 | 2.2675 | 0.6673 | 2.2675 |
6 | 223 | 260 | 198 | -0.6640 | 0.8854 | -0.6640 |
7 | 285 | 185 | 260 | 0.7129 | -0.7502 | 0.7129 |
8 | 223 | 156 | 198 | -0.6640 | -1.3826 | -0.6640 |
9 | 275 | 273 | 250 | 0.4908 | 1.1689 | 0.4908 |
10 | 179 | -1.0860 | -0.0305 | -1.0860 | ||
11 | 268 | 180 | 243 | 0.3353 | -0.8592 | 0.3353 |
12 | 272 | 247 | 0.4242 | 0.5583 | 0.4242 | |
13 | 235 | 210 | -0.3975 | -0.4231 | -0.3975 | |
14 | 297 | 272 | 0.9794 | 0.3402 | 0.9794 | |
15 | 235 | 255 | 210 | -0.3975 | 0.7763 | -0.3975 |
16 | 278 | 253 | 0.5574 | -0.4231 | 0.5574 | |
17 | 322 | 320 | 1.5346 | 2.1938 | 1.5346 | |
18 | 195 | -0.7306 | 0.1221 | -0.7306 | ||
19 | 190 | -1.5523 | -0.6411 | -1.5523 | ||
20 | 146 | 150 | -1.7300 | -1.6007 | -1.7300 |
[1] 238.4
[1] 48
jmv::anovaOneW(
formula = Cholesterol ~ Treatment,
data = data,
fishers = TRUE,
welchs = TRUE,
desc = TRUE,
descPlot = TRUE,
eqv = TRUE)
ONE-WAY ANOVA
One-Way ANOVA
──────────────────────────────────────────────────────────────
F df1 df2 p
──────────────────────────────────────────────────────────────
Cholesterol Welch's 5.832 2 27.94 0.0076
Fisher's 5.269 2 45 0.0088
──────────────────────────────────────────────────────────────
Group Descriptives
──────────────────────────────────────────────────────────────
Treatment N Mean SD SE
──────────────────────────────────────────────────────────────
Cholesterol Delta 14 218.9 54.25 14.500
Gamma 16 267.2 37.24 9.310
None 18 227.9 41.20 9.712
──────────────────────────────────────────────────────────────
ASSUMPTION CHECKS
Homogeneity of Variances Test (Levene's)
────────────────────────────────────────────────
F df1 df2 p
────────────────────────────────────────────────
Cholesterol 3.245 2 45 0.0483
────────────────────────────────────────────────
We know that the ANOVA F statistic is relatively robust to this violation when sample sizes are equal, however, here we do not have equal samples sizes. Therefore, the more conservative approach is to use the “Robust Tests of Equality of Means” provided by . We requested the Welch F test because research has shown that it maintains the nominal Type I error rate better than the other option, by Brown-Forsythe. The Welch F statistic is 5.832 with (2, 27) degrees of freedom, and is statistically significant with a p = .0088. Because we have a statistically significant omnibus test, we will follow-up with an appropriate post hoc test described below (e.g., Games-Howell).
In this section, Tukey’s Honestly Significant Difference (HSD) test will be described as the procedure to test pairwise contrasts following a significant overall F test in an exploratory ANOVA. That is, Tukey’s test is used here to provide us with an answer to the question: Do one or more mean pairs differ? In so doing, it allows us to select a familywise error rate. That is, in using Tukey’s HSD test with \(\alpha_{FW} = .10\), the probability of at least one pairwise test being found significant when it should not be is .10. Note, however, that if you suspect that your variances are extremely heterogeneous, e.g., if the ratio of one variance to another is of the order of 10 to 1, you should use a modification of Tukey’s test which will be described in a following section.
In what follows, Tukey’s HSD test will be described in general using a step-by-step procedure and illustrated at each step using all pairwise contrasts (i.e., the first three contrasts) from the preceding cholesterol example. While it is perhaps instructive to see the process for one of the methods (e.g., Tukey’s method above), statistical computer programs make it unnecessary to calculate these contrasts and comparisons ourselves, so we won’t show the steps or calculations for any other methods.
Example. For the cholesterol experiment the preceding values were found to be: \(v_2 = 57\), \(J = 3\), and \(\alpha_{FW} = .10\). Therefore, from Table A.6, after resetting \(v_2\) at the conservative value of 40, we have that: \(SR_C = 2.99\).
\[ HSD = SR_C \sqrt{MSW⁄n} \tag{15-14} \]
where \(SR_C\) is the critical value of the studentized range statistic found from Table A.6 in step 1; MSW is the mean square within found from the overall ANOVA table; n is the number of units in each treatment.
Example. For the cholesterol example we have that: \(SR_C = 2.99\); \(MSW = 2052.64\) (from figure 15f); \(n = 20\). Therefore, we have that:
\[ HSD = 2.99 \sqrt{2052.64⁄20} = 30.29 \]
Example. For the cholesterol experiment we have:
\[ \begin{align} &&M1 &&&M2 &&&&M3 \\ &&252.9 &&&219.4 &&&&227.9 \\ &M1 = 252.9 &- &&&\hat{\psi}_1=33.5 &&&&\hat{\psi}_2=25.0 \\ &M2 = 219.4 &- &&&- &&&&\hat{\psi}_3=-8.5 \end{align} \] 4. Step 4. Consider as significantly different from zero the mean differences in the table prepared in Step 3 whose absolute values are greater than the HSD critical value found in step 2.
Example. In our example we have that only \(|\hat{\psi}| = 33.5 > 30.29\). Therefore, only the null hypothesis:
\(\hat{\psi}_1=\mu_1-\mu_2=0\) would be rejected at the .10 level of significance. That is, it is unreasonable to find \(\psi\) = 33.5 if the sampling distribution of \(\psi\) had a mean of zero.
\[ \psi_q - HSD < \psi_q < \psi_q + HSD \tag{15-15} \]
A set of simultaneous confidence intervals is a set of confidence intervals that all will contain the population contrasts \(100(1 – \alpha_{FW})\text{%}\) of the time. That is, in the cholesterol experiment, if we were to repeat the experiment an infinite number of times, and each time we computed the set of confidence intervals, 90% of the sets of confidence intervals would contain all of the population contrasts.
Example. The set of 90% confidence intervals for the cholesterol study is:
\[ \begin{align} 33.5 - 30.29 &< \psi_1 < 33.5 + 30.29 &&\text{(interval does NOT contain 0)} \\ 25.0 - 30.29 &< \psi_1 < 25.0 + 30.29 &&\text{(interval does contain 0)} \\ -8.5 - 30.29 &< \psi_1 < -8.5 + 30.29 &&\text{(interval does contain 0)} \end{align} \]
Note that intervals containing zero are centered around the contrasts that were considered to not differ significantly from zero. Intervals that do not contain zero are considered statistically significant.
The results indicate that there is a significant difference between the average cholesterol levels of the women taking the Gamma Company’s herbal supplement pill and the cholesterol levels of women taking the Delta company’s herbal supplement pill. The means indicate that the cholesterol levels of the women taking the Gamma Company’s herbal supplement pill are higher than the cholesterol levels of the women taking the Delta Company’s herbal supplement pill. No significant differences were found among the other treatments.
We can see from the output in Figure 15f(ii) that only the mean comparison between Gamma and Delta is statistically significant, with p = .059 (recall that we are using alpha = .10). Note that the 90% Confidence Interval provided uses the exact SRc value for the analysis, which is closer to the 2.96 that we would have used for 60 degrees of freedom (from Appendix A.6) since we actually had 57 degrees of freedom within.
The Delta and None groups were not statistically significantly different on the Cholesterol variable (p = .84), therefore they are shown in the same subset (i.e., Subset 1). The same is true for the Gamma and None groups (p = .198), except that they are together in Subset 2. However, because the Gamma and Delta treatments were statistically significantly different at the alpha = .10 level, they are in different subsets: Delta in Subset 1 and Gamma in Subset 2. However, research has shown that this approach does not work well when sample sizes are equal, because of how the calculations are performed. When you obtain this table with unequal sample sizes, this message is provided as a footnote to the table: “The group sizes are unequal. The harmonic mean of the group sizes is used. Type I error levels are not guaranteed.”
Tukey’s test was modified by Kramer to handle situations where there are unequal n’s (but equal variance). This Tukey-Kramer procedure is what most programs use for Tukey HSD analysis. It is important to note that, theoretically, there is no problem with performing Tukey HSD or the pooled variance Student’s t test with unequal sample sizes – even vastly different sample sizes (e.g., \(n_1 = 10\) and \(n_2 = 1000\)). However, the larger the difference in sample sizes, the smaller difference in variances required to cause Type I error rates t increase beyond acceptable.
Tukey’s test was modified by Games and Howell (1976) to handle situations where there are both unequal n’s and unequal variances. The Games and Howell modification of Tukey’s test (GHT) is recommended for use when you have an unequal number of subjects in some treatments or when you have equal n’s but extremely heterogeneous variances. This is because GHT is robust to violations of the assumption of homogeneity of variance. The GHT test has been shown to be slightly liberal (e.g., .054 instead of .05; see Tamhane, 1979) when a violation of homogeneity is not present, but this does not appear to be a major problem for most data analysts.
jmv::anovaOneW(
formula = Cholesterol ~ Treatment,
data = data,
welchs = FALSE,
phMethod = "tukey",
phTest = TRUE,
phFlag = TRUE)
ONE-WAY ANOVA
POST HOC TESTS
Tukey Post-Hoc Test – Cholesterol
───────────────────────────────────────────────────────────
Delta Gamma None
───────────────────────────────────────────────────────────
Delta Mean difference — -33.50 -8.500
t-value — -2.338 -0.5933
df — 57.00 57.00
p-value — 0.0586 0.8243
Gamma Mean difference — 25.000
t-value — 1.7450
df — 57.00
p-value — 0.1976
None Mean difference —
t-value —
df —
p-value —
───────────────────────────────────────────────────────────
Note. * p < .05, ** p < .01, *** p < .001
Figure 15h(ii) presents the Games-Howell output from One-way ANOVA. We can see in this example that two of the mean comparisons are statistically significant: Gamma vs. Delta, with p = .0265, and Gamma vs. None, with p = .0170. Recall that these were different data than were used for the Tukey HSD earlier (cases had been deleted to create the data in Figure 15g), so there are no contradictions between these results.
jmv::anovaOneW(
formula = Cholesterol ~ Treatment,
data = data,
fishers = TRUE,
desc = TRUE,
descPlot = TRUE,
norm = TRUE,
qq = TRUE,
eqv = TRUE,
phMethod = "gamesHowell",
phTest = TRUE,
phFlag = TRUE)
ONE-WAY ANOVA
One-Way ANOVA
──────────────────────────────────────────────────────────────
F df1 df2 p
──────────────────────────────────────────────────────────────
Cholesterol Welch's 5.832 2 27.94 0.0076
Fisher's 5.269 2 45 0.0088
──────────────────────────────────────────────────────────────
Group Descriptives
──────────────────────────────────────────────────────────────
Treatment N Mean SD SE
──────────────────────────────────────────────────────────────
Cholesterol Delta 14 218.9 54.25 14.500
Gamma 16 267.2 37.24 9.310
None 18 227.9 41.20 9.712
──────────────────────────────────────────────────────────────
ASSUMPTION CHECKS
Normality Test (Shapiro-Wilk)
───────────────────────────────────
W p
───────────────────────────────────
Cholesterol 0.9639 0.1448
───────────────────────────────────
Note. A low p-value suggests
a violation of the assumption
of normality
Homogeneity of Variances Test (Levene's)
────────────────────────────────────────────────
F df1 df2 p
────────────────────────────────────────────────
Cholesterol 3.245 2 45 0.0483
────────────────────────────────────────────────
POST HOC TESTS
Games-Howell Post-Hoc Test – Cholesterol
───────────────────────────────────────────────────────────
Delta Gamma None
───────────────────────────────────────────────────────────
Delta Mean difference — -48.32 -9.016
t-value — -2.804 -0.5166
df — 22.60 23.64
p-value — 0.0265 0.8640
Gamma Mean difference — 39.306
t-value — 2.9216
df — 31.99
p-value — 0.0170
None Mean difference —
t-value —
df —
p-value —
───────────────────────────────────────────────────────────
Note. * p < .05, ** p < .01, *** p < .001
When non-pairwise contrasts are found among the contrasts in a family of contrasts the recommended procedure is Scheffé’s test. Like the Tukey test, the Scheffé test allows you to select a familywise error rate.
Unfortunately, specialized software is required to easily calculate the Scheffé contrasts. For example, Contrast 1 in the output corresponds to Contrast #4 from the section above, with the following hypotheses:
\[ \begin{align} H_0&: \frac{\mu_1+\mu_2}{2} - \mu_3 = 0 \\ H_A&: \frac{\mu_1+\mu_2}{2} - \mu_3 \ne 0 \end{align} \]
We recommend an R Shiny app that we have created to perform the Scheffé comparisons. The website is
R Shiny App for Scheffé Comparisons
Here are results from that Shiny app using the data in Figure 15c.
Two important notes about this process. First, Scheffé will work for unequal n, however, cannot adjust for unequal variances. Therefore, if you have a statistically significant Levene’s test, you must use the Brown-Forsythe adaptation to Scheffé (like Games-Howell is for Tukey).
Another option may be to recode the data to create the two groups you desire to compare in the data (note that contrasts are really just comparing two groups, a group with negative contrast coefficients and a group with positive contrast coefficients). Then you can run a two-group analysis (either one-way ANOVA or independent t test) and use an appropriate robust test (e.g., Welch) to address the violation of the homogeneity of variance assumption.
Second, you can perform all the post hoc multiple comparison tests you like using the Scheffé approach. However, when treating these as a priori contrasts, the contrasts (and ideally the directionality of the differences since these are confirmatory analyses, that is, which mean or combination of means is larger) must be determined before you collect the data and run the analyses. In this scenario, most scholars agree that you do not need to adjust alpha as long as you have limited the number of these a priori contrasts to a small number. Most scholars would suggest no more than K – 1 such contrasts, where K is the total number of groups (and ideally these K – 1 contrasts would be orthogonal, but that is not required). If you use this procedure to perform follow-up tests, you must adjust alpha for the multiple hypothesis tests you are performing (e.g., using Bonferroni or Holm alpha-adjustment procedures).
Finally, it does not generally fit the logic of research and hypothesis testing to perform both confirmatory and exploratory analyses (i.e., a priori contrasts and post hoc comparisons) with the same data. Be very careful, and justify your analyses well, if you choose to do so.
Scheffé’s test for unequal n’s is based on a modification developed by Brown and Forsythe (197 4). The Brown and Forsythe modification of Scheffé’s test (BFS) is recommended for use when you have an unequal number of subjects in each treatment or when you have equal n’s but extremely heterogeneous variances. This is because BFS is robust to violations of the assumption of homogeneity of variance.
The Shiny App above will calculate the Brown-Forsythe adjustment to Scheffé for the most explanatory Scheffé comparison.
Unfortunately, while most computer programs provide Scheffé post hoc tests as an option, the tests provided are typically only the pairwise tests. The strength (and the curse) of the Scheffé post hoc test procedure is that it allows researchers to test a theoretically infinite number of possible non-pairwise comparisons. That is, when Scheffé is used only with pairwise comparisons, it has low power because the alpha adjustments are made for all possible pairwise and non-pairwise comparisons. Therefore, researchers rarely use and report Scheffé comparisons in the literature.
In fact, most statistical computer programs do not provide non-pairwise tests without extra effort on the part of the researcher to define those contrasts. In R, we can create non-pairwise comparisons only in the One-way ANOVA procedure after defining the contrast coefficients. We could use the Scheffé method to perform the statistical hypothesis testing, but the more common approach seems to be to use the Bonferroni (or Holm) adjustment with the p values provided by the contrast output.
In this chapter we considered the elements of hypothesis testing for an exploratory analysis of variance (ANOVA). That is, we considered an ANOV A situation where the researcher had very little basis to predict which treatments, if any, would differ. We found that a researcher in this situation first uses an overall F test to determine if there are any differences among the treatments. If no treatment differences are found, the analysis stops, but if treatment differences are found, that is, if the F test is significant, a post hoc test is conducted.
We found that there are many different post hoc tests, but that four were recommended for use as part of an exploratory ANOVA The four post hoc tests recommended here were: Tukey’s Honestly Significant Different (HSD) test, a modification of Tukey’s test by Games and Howell (1976), referred to here as GHT, Scheffé’s test, referred to as ST, and a modification of Scheffé’s test by Brown and Forsythe (1974), referred to as BFS. A flow chart illustrating the steps leading to these post hoc tests in a one- way fixed effects exploratory analysis of variance is shown in figure 15i.
In considering the assumptions for the analysis of variance we found that the F test is robust to violations of its homogeneity of variance assumption, provided that there are an equal number of units in each treatment. That is, given heterogeneous variances and equal n’s, the nominal and actual levels of significance may be considered to be equal. In this situation, that is given equal n’s and only pairwise comparisons, the Tukey post hoc procedure was recommended, and given equal n’s with non-pairwise comparisons, Scheffé’s test was recommended. Both of these post hoc tests are used because they control the familywise error rate in testing a set of mean contrasts.
If one has both unequal n’s and heterogeneous variances or equal n’s and extremely heterogeneous variances, then the actual and nominal levels of significance can differ by an unacceptable amount. In these cases post hoc tests that are based on modifications of the Tukey and Scheffé procedures were recommended because these modified tests were found to be robust to violations of the homogeneity of variance assumption.
Here, the Games and Howell modification of Tukey’s test was recommended for use with pairwise comparisons and the Brown and Forsythe modification of Scheffé’s test was recommended for use with non-pairwise comparisons. These tests were also recommended here for general use whenever you are confronted with post hoc tests in an unequal n ANOVA. This was because they have been shown to be relatively powerful even when the treatment variances are homogeneous (Keselman, Games, & Rogan, 1979). Therefore, in an unequal n ANOVA, the Games and Howell and Brown and Forsythe post hoc tests provide you with both Type I and Type II error protection, which you cannot be certain of from the Tukey and Scheffé tests.
Figure 15i A flow chart of the steps leading to post hoc testing in an exploratory ANOVA
SECTION 1: One-way ANOVA
Analyses to Run
Using the ANOVA output, respond to the following items
Provide the most appropriate research question for the analysis
Provide the statistical null hypothesis using both words and appropriate symbols.
Name the independent and dependent variables in these ANOVAs.
How many levels are there for the independent/factor variable?
What are the sizes for each group? What is the total sample size regardless of group?
Report and interpret the GRAND MEAN for all cases on Y and its 95% Confidence Interval.
Report and interpret the GROUP MEANS for Y for all W levels/groups and the 95% Confidence Interval for the GROUP 1 mean.
Is the assumption of normality of each level/group violated for this analysis? Provide evidence.
Is the assumption of homogeneity (equality) of variances violated for this analysis? Provide evidence.
Do you reject the Null Hypothesis that all W level/group population means are equal on the Y variable?
Is there a statistically significant difference between ANY of the group means on the Y variable? That is, is the OMNIBUS/OVERALL ANOVA statistically significant?
Is there statistically significant variation in the group means on the Y variable?
Do you conclude that the level/group means are all equal in the population?
Show or explain how the F statistic for the ANOVA is calculated.
Show or explain how to calculation the amount of variation in Y is explained by the W variable (that is, R2 or eta-squared or )
Calculate Cohen’s f effect size for this analysis. By the way, f2 = R2/(1 – R2) and R2 = f2/(1 + f2).
Report whether either OMNIBUS/OVERALL ANOVA is statistically significant. Provide evidence for both analyses.
Note that in real life you would only do the next two items if there is a statistically significant OMNIBUS ANOVA. However, you can do these as a priori comparisons and ignore the OMNIBUS ANOVA test. 18. Report all level/group mean differences and p values for the associated multiple comparison post hoc tests (Tukey HSD is usually a good choice if you have equal variances, Games-Howell if not). 19. Report and interpret which Tukey pairwise level/group mean differences are statistically significant and their associated p values. Instead of Tukey, use Games-Howell if the assumption of homogeneity of variances is violated.
Show how to calculate Cohen’s d for the largest pairwise level/group mean difference
Interpret the results for the one-way ANOVA in an APA-style report to answer the research question and to describe in detail the relationship between W and Y. Whether statistically significant or not, use descriptive statistics, (e.g., means, standard deviations, mean differences, effect sizes, and/or confidence intervals), inferential statistics, degrees of freedom, and statistical significance to describe the differences between the groups, including which group is considered to have the larger mean, if appropriate (an APA table is often the best way to do this for multiple group comparisons). Be sure to discuss assumptions and outliers and their potential impact.
SECTION 2: Power and Sample Size Analysis
What TOTAL sample size would be required for an independent t test with some given characteristics, for example: alpha=.05, power=.90, Cohen’s d=0.6 (provide evidence from tables or G*Power)
What TOTAL sample size would be required for a test of correlation with some given characteristics, for example: alpha=.05, power=.80, expected Population r (rho)=0.4 (provide evidence from tables or G*Power)
What TOTAL sample size would be required for a 3-group One-way ANOVA with some given characteristics, for example: alpha=.05, power=.80, Cohen’s f=0.4 (provide evidence from tables or G*Power)
What TOTAL sample size would be required for a dependent t test with some given characteristics, for example: alpha=.05, power=.70, Cohen’s d=0.3 (provide evidence from tables or G*Power)
What estimated total sample size would be required for a 95% confidence interval for the mean with some given characteristics, for example: = .05, a desired 95% confidence interval width of 3, and the standard deviation for Group 1 in the Independent-Samples t test section above. You may round to the nearest tenth for the tables. (provide evidence from tables or show your calculations)
Someone I know often says, “you can buy statistical significance.” Explain what is meant by that statement. Why does using effect sizes help with this? That is, what do researchers need to do in order to compensate for the fact that “You can buy statistical significance”?