Chapter 05: Descriptive Data Analysis: Measures of Variability and Shape (Skewness and Kurtosis)

Back to Chapter Index Page

INTRODUCTION

In chapter 4, we discussed measures of central tendency (the mean, mode, and median) and considered how they may reflect a treatment effect (which in the Pill Consumption experiment was the effect of taking herbal supplement pills).

In this chapter, we will examine measures of variability which indicate the amount of spread in a data set. We will learn how to calculate four measures of variability: the range, interquartile range, variance, and standard deviation. Then we will consider the effect of treatments on these measures. We will also examine different data distributions, as defined by the measures of skewness and kurtosis.

Your R objective in this chapter is to calculate (a) the range, (b) interquartile range, (c) variance, and (d) standard deviation.

EXPERIMENTAL DATA

In this discussion, we will use the Pill Consumption data from chapter 4, with one change. We have created a new variable by adding 25 points to each cholesterol score in the Yes-Pill group to simulate what would happen if the effect of taking the herbal supplement pill caused each woman’s score to increase by 25 points. In our discussions of this data, we will examine the effect this addition has on the measures of central tendency and the measures of variability.

Figure 5a shows the herbal supplement pill data set with the column labeled 25+CHOL 1. As you remember from chapter 4, the labels used to identify the data in Figure 5a are: ID, identification number; HT, height; and CHOL, cholesterol level. The number in each label identifies each group (1 = Yes-Pill, 2 = No-Pill). Thus ID 1 indicates the identification numbers of the subjects in group 1.

MEASURES OF VARIABILITY: HOW SPREAD OUT ARE THE OBSERVATIONS?

In this section, we will consider the measures of variability, or measures of dispersion. We will define each measure, illustrate how to calculate it by hand, and then consider how each measure reflects the dispersion in our experimental data. Because of the arithmetic operations involved, all of the measures of variability that we will discuss are most meaningful when based on data that is measured on interval or a ratio scale. The most frequently used measures of variability are the: (a) Range, (b) Interquartile range, (c) Variance, and (d) Standard deviation.

Figure 5a Original Cholesterol data from the Pill Consumption study plus a variable with 25 points added to each CHOL1 score

Range

The simplest measure of how spread out the scores are is the range. The range is found by subtracting the smallest score from the largest score. The range is not frequently used as a basis for making decisions about score variability, however, because the range is unstable from one sample to the next. That is, the sample range cannot be counted on to give a good estimate of the population range. For example, in constructing a histogram for column 3, CHOL 1, in Chapter 4, we found the range to be 180; but in a second sample (not shown) from the same population, the smallest score was 50 and the largest score was 300, yielding a range of 250. We will consider the instability of the sample range in more detail in chapter 10.

Another measure of range that is sometimes used is the inclusive range, which is the difference between the highest score and the lowest score plus one. For example, if you read pages 6-11 of a book and then report the exclusive range of the pages read (“scores”) it is 5 (that is, 11 – 6 = 5). However, the inclusive range would be 6 (that is, 11 – 6 + 1 = 6) which perhaps more accurately represents the range of the scores in the data (i.e., pages you read).

Interquartile Range

To overcome the lack of stability of the range, statisticians use a measure that does not depend on the highest and lowest scores of a sample. This measure, known as the interquartile range, is based on the difference between the third quartile and the first quartile. The first quartile is the same as the 25th percentile and is usually denoted by Q1 The second quartile, the median, is the 50th percentile and is denoted by Q2. The third quartile is the 75th percentile and is denoted by Q3. The interquartile range is usually denoted by the letter Q or the abbreviation IQR, such that:

Equation 5-1

\[ \begin{equation} IQR = Q3 – Q1 = 75th \text% - 25th \text% \tag{5-1} \end{equation} \] The first quartile (Q1) is the median of the scores from the lowest score to the first score below the median of all of the scores. The third quartile (Q3) is the median of the scores from the first score above the median of all of the scores to the largest score. For example, if we rank order the CHOL scores for the Yes-Pill data, we have:

\[ \begin{align} & 150 \\ & 158 \\ & 179 \\ & 195 \\ & 198 \\ & \text {<--- [(198+198)/2] = 198.0 = first quartile = Q1 = 25th percentile} \\ & 198 \\ & 210 \\ & 210 \\ & 210 \\ & 215 \\ & \text {<--- [(215+220)/2] = 217.5 = Median = second quartile = Q2 = 50th percentile} \\ & 220 \\ & 243 \\ & 247 \\ & 250 \\ & 253 \\ & \text {<--- [(253+260)/2] = 256.5 = third quartile = Q3 = 75th percentile} \\ & 260 \\ & 263 \\ & 272 \\ & 297 \\ & 330 \end{align} \]

There are 10 scores from the lowest score, 150, up to 215, the score below the median. Therefore, the first quartile, (Q ), which is the median of these 10 scores, is 198. similarly, the third quartile (Q3) is 256.5, which is the median of the 10 scores from 220, the score just above the median, to the largest score, 330. The interquartile range (Q) is then found as:

\[ IQR = Q3 – Q1 = 256.5 – 198 = 58.5 \] Thus, there is a spread of 58.5 points within the “heart” of this distribution of scores. Because the interquartile range is based on the difference between quartiles instead of end scores, it is relatively stable from sample to sample. It is also not strongly influenced by extreme points. For these reasons, the interquartile range should be routinely used with data that is skewed (that is, non-symmetric).

Equation (4.1) will not generally yield the 25th and 75th percentile ranks for the scores at Q1 and Q3. For example, using equation (4.1) the percentile rank for Q3 = 256.5 is found to be 77.5%. This is because equation (4.1) assumes you are dealing with a continuous distribution; and in this example, we have a discrete set of scores. In practice, this conflict is only a minor nuisance.

Variance

Variance is a commonly used measure of the spread of scores.

Variance of a sample.

The variance of a sample measures the spread of the scores by: (a) taking the distance that each score lies from its mean (that is, a deviation score) and squaring it, (b) summing all of these squares, and (c) finding the average (almost) of the result. “Almost” is included because the sum of the squared deviation scores is divided by (n-1) instead of by n (where n is the number of scores). The variance of a sample is calculated using (n-1) instead of n because the resultant value yields a “better” estimate of the population variance than does the value found using n. (What is meant by “better” is discussed in chapter 10.) The formula for the sample variance is:

Equation 5-2

\[ \begin{equation} s_X^2= \frac {\sum (X-M_X)^2} {(n-1)} \tag{5-2} \end{equation} \]

For example, the variance of the scores 1, 2, 3, 4, and 5 is found using the following sums:

\[ \begin{align*} & \text{Scores} && \text {Deviation } (X – M_X) &&& (X – M_X)^2 \\ & 1 && 1 – 3 = -2 &&& 4 \\ & 2 && 2 – 3 = -1 &&& 1 \\ & 3 && 3 – 3 = 0 &&& 0 \\ & 4 && 4 – 3 = 1 &&& 1 \\ & 5 && 5 – 3 = 2 &&& 4 \\ & 15 = \sum X && \sum (X – M_X) = 0 &&& \sum (X – M_X)^2 = 10 \\ & M_X = 15/5 = 3 \\ & s_X^2 = 10/(5 – 1) = 2.5 \end{align*} \] The variance of a sample, like the mean, should not be routinely used as a descriptive statistic because it is strongly affected by extreme scores. For example, consider what happens to the variance of the preceding scores when we include an additional score of 105:

\[ \begin{align*} & \text{Scores} && \text {Deviation } (X – M_X) &&& (X – M_X)^2 \\ & 1 && 1 – 20 = -19 &&& 361 \\ & 2 && 2 – 20 = -18 &&& 324 \\ & 3 && 3 – 20 = -17 &&& 289 \\ & 4 && 4 – 20 = -16 &&& 256 \\ & 5 && 5 – 20 = -15 &&& 225 \\ & 105 && 105 – 20 = 85 &&& 7225 \\ & 120 = \sum X && \sum (X – M_X) = 0 &&& \sum (X – M_X)^2 = 8680 \\ & M_X = 120/6 = 20 \\ & s_X^2 = 8680/(6 – 1) = 1736 \end{align*} \] As with the skewed distributions discussed in chapter 4, the mean was pulled in the direction of the extreme score. Also, the variance has “exploded” when compared to 2.5, since it is strongly affected by the squared mean deviation of the extreme score. For this reason, the variance is routinely used only with data that is symmetric (that is, not skewed).

Variance in a Population of Scores.

The population variance is found by: (a) squaring the distance that each score falls from the mean, (b) summing these squares, and (c) finding the average of the result. The formula for the population variance is:

Equation 5-3

\[ \begin{equation} \sigma_X^2 = \frac {\sum {(X-\mu)^2}} {N} \tag{5-3} \end{equation} \]

The square of the small Greek letter σ (i.e., σ2) is the symbol used to denote the population variance. Note that σ is a small sigma and that a capital sigma, Σ, represents the summation operation. Note also, that equation 5-3 contains N in the denominator. If we consider the scores 1, 2, 3, 4, and 5 to represent a population of scores, then \(\mu = 3\), and \(N = 5\). We can find the population variance using the sum of squares from above to be:

\[ \sigma^2 = \frac {10} 5 = 2 \]

Standard Deviation

Another measure of dispersion commonly used by researchers is the standard deviation (SD). The standard deviation is the square root of the variance. The variance can be thought of as the average squared distance of each score from its mean, and the standard deviation can be thought of as a measure of the average distance from the mean.

The average distance from the mean is actually always zero because the sum of the deviation scores always equals zero, that is, (X-M) = 0. (This fact will be illustrated when we discuss skewness.) When we find the variance, however, we square (X – M) and remove the influence of the sign of the difference.

There is a little-used statistic often called Mean Absolute Deviation (sometimes called Mean Deviation) that is calculated as the average distance from the mean, where distance is always positive regardless of whether scores are above or below the mean. The Mean Absolute Deviation (MAD) is calculated as:

Equation 5-4

\[ \begin{equation} MAD = \frac {\sum {|X-M|}} n \tag{5-4} \end{equation} \] where |X – M| is the “absolute value” of the subtraction (i.e., the sign is removed). The MAD is not used frequently because the mathematics for absolute values is more limited than the mathematics for squared values. Therefore, variance and its square root, standard deviation, is used most frequently.

Sample Standard Deviation

Mathematicians prefer to use just one letter as variables in formulas for a variety of reasons. Because of this, the most common abbreviation for the sample standard deviation is s and not SD. However, APA prefers research reports to use SD as the abbreviation for standard deviation. The formula for the sample standard deviation is:

Equation 5-5

\[ \begin{equation} s=\sqrt{ \frac {\sum(X-M)^2} {n-1}} = \sqrt{s^2} \tag{5-5} \end{equation} \] For the set of scores 1, 2, 3, 4, and 5, the sample standard deviation can be found by taking the square root of the sample variance as:

\[ s=\sqrt{ \frac {10} 4} = \sqrt{2.5} = 1.58 \] For a dataset with an extreme score 1, 2, 3, 4, 5, and 105, the sample standard deviation is found as:

\[ s=\sqrt{ \frac {8680} 5} = \sqrt{1736} = 41.67 \] Since the standard deviation is directly related to the variance, it too should be used routinely only with symmetric data. But truthfully, it is used frequently regardless of the shape of the data.

Population Standard Deviation.

The small Greek letter σ is the symbol used to denote the population standard deviation. Note that the formula below erroneously contains s instead of σ. The formula for the population standard deviation is:

Equation 5-6

\[ \begin{equation} \sigma=\sqrt{ \frac {\sum(X-M)^2} {N}} = \sqrt{\sigma^2} \tag{5-6} \end{equation} \] If the scores 1, 2, 3, 4, and 5 represent a population, the population standard deviation is found as:

\[ \sigma=\sqrt{ \frac {10} 5} = \sqrt{2} = 1.41 \]

Figure 5b Range and Interquartile Range for the Cholesterol data by Group

jmv::descriptives(
    data = data,
    vars = vars(CHOL1, CHOL2, CHOL1plus25),
    variance = TRUE,
    iqr=TRUE,
    range = TRUE,
    pc = TRUE)


 DESCRIPTIVES

 Descriptives                                              
 ───────────────────────────────────────────────────────── 
                         CHOL1     CHOL2     CHOL1plus25   
 ───────────────────────────────────────────────────────── 
   N                         20        20             20   
   Missing                    0         0              0   
   Mean                  227.90    219.40         252.90   
   Median                217.50    221.50         242.50   
   Standard deviation    45.029    45.856         45.029   
   Variance              2027.6    2102.8         2027.6   
   IQR                   56.750    71.250         56.750   
   Range                 180.00    174.00         180.00   
   Minimum               150.00    146.00         175.00   
   Maximum               330.00    320.00         355.00   
   25th percentile       198.00    183.75         223.00   
   50th percentile       217.50    221.50         242.50   
   75th percentile       254.75    255.00         279.75   
 ─────────────────────────────────────────────────────────

Notes: * CHOL1 = the Cholesterol scores for the Yes-Pill treatment group * CHOL2 = the Cholesterol scores for the No-Pill treatment group * CHOL1plus25 = the Cholesterol scores for the Yes-Pill treatment group plus 25 (CHOL1+25) * Percentile 25 = the 1st Quartile (Q1) * Percentile 50 = the 2nd Quartile (Q2) = Median * Percentile 75 = the 3rd Quartile (Q3) * Interquartile Range = IQR = Q3 – Q1 * IQR for CHOL1 = 254.75 – 198.00 = 56.75 * IQR for CHOL2 = 255.00 – 183.75 = 71.25 * IQR for CHOL1plus25 = 279.75 – 223.00 = 56.75 * Compare the results for Range and IQR for the CHOL1 and CHOL1plus25 to see the impact of adding a constant to all scores

Looking at Figure 5c, you see that the variance for both CHOL1 and CHOL1plus25 is 2057.57, and the standard deviation is 45.0285.

The variance for column 6 (CHOL2) is 2102.78, and the standard deviation is 45.8561. (We will discuss much of the rest of this output as we continue through this chapter.)

Discussion of the Measures of Variation for an Experiment

The output in Figures 5b and 5c is instructive with respect to the measures of variability in an experiment. One result that we find is that the measures of variability did not change when we added 25 points to the cholesterol scores in the Yes-Pill group. That is, no differences were found between the measures of variability on CHOL1 and 25+CHOL1. A second result is that, very small differences were found between the measures of variability for the Yes-Pill group and those found for the No- Pill group.

First Result: Artificial Situation.

The first result is instructive because it represents a situation in an ideal experiment. In an ideal experiment, the only difference between the scores from the two groups involved is the treatment effect, represented by the constant 25 points. In this ideal experiment, the scores in the no-treatment group (usually referred to as the control group) are represented by the CHOL1 scores, and the scores in the treatment group (usually referred to as the experimental group) are represented by the 25+CHOL1 scores.

Based on the results shown in figures 5b and 5c, the two groups differed on the measures of central tendency but not on the measures of variability. In an experiment, the only reason we expect to find differences between the measures of central tendency of the groups is because of the treatment effect, and the reason we do not expect to find differences in variability is because we have randomly placed the units into the treatments.

Second Result: Realistic Situation.

The second result is instructive because it represents actual results from data on units (subjects) who were randomly sampled and randomly placed into treatment groups. This data, therefore, indicates how much of a difference you can expect between the measures of variability because of chance.

It also indicates how much of a chance difference you can expect between the measures of central tendency, because the original data (the CHOL1 and CHOL2 scores) was chosen so that there was-no treatment effect. Therefore, the differences between the measures of central tendency on these variables reflect only chance differences.

Further Points

If you add a constant (such as 25) to each of the scores in a data set, the resultant measures of central tendency will each be changed by the magnitude of the constant (remember the constant can be a negative number). We saw this result when the constant 25 was added to each cholesterol score in the Yes-Pill group. In figure 5b, the median increased by 25 points from 217.5 to 242.5; and in figure 5c, the mean increased by 25 points from 227.9 to 252.9. Also, since 210 is the mode (see chapter 4), the most frequently occurring score would change by 25 points, for a mode of 235.

In experiments, researchers use the mean as the measure of central tendency and the variance and standard deviation as the measures of variability. These measures are strongly affected by extreme scores. Given a random sample of units and random assignment of these units to treatments, researchers do not expect to find many extreme scores (also ref erred to as wild points or outliers) in their treatments. This is one reason that the mean, variance, and standard deviation can be used in experiments. In chapter 10 we will consider other reasons.

In figure 5c, you can see another measure of variation known as the coefficient of variation. The coefficient of variation is found as the ratio of the standard deviation to the mean. This coefficient is useful as a measure of variability in a series of experiments where the standard deviation and the mean tend to change together. The coefficient of variation is used in an experiment as a benchmark. Comparing data to this benchmark can indicate if something unusual has happened in a treatment.

For example, Snedecor and Cochran (1967, p. 63) report that in com variety trials, “although mean yield and standard deviation vary with location and season, yet the coefficient of variation is often between 5% and 15%.” A coefficient of variation outside of known limits might cause a data analyst to look for an outlier or for an unexpected treatment effect.

The coefficient of variation is a measure of variation that could vary across groups in an experiment (because it is a function of the mean), but it is a measure that is expected to remain within a range of values that are known from past research.

Figure 5c Variance, Standard Deviation, and other statistics for the cholesterol data by Group.

jmv::descriptives(
    data = data,
    vars = vars(CHOL1, CHOL2, CHOL1plus25),
    variance = TRUE,
    range = TRUE,
    se = TRUE,
    ci = TRUE,
    iqr = TRUE,
    skew = TRUE,
    kurt = TRUE)


 DESCRIPTIVES

 Descriptives                                                      
 ───────────────────────────────────────────────────────────────── 
                              CHOL1      CHOL2       CHOL1plus25   
 ───────────────────────────────────────────────────────────────── 
   N                               20          20             20   
   Missing                          0           0              0   
   Mean                        227.90      219.40         252.90   
   Std. error mean             10.069      10.254         10.069   
   95% CI mean lower bound     206.83      197.94         231.83   
   95% CI mean upper bound     248.97      240.86         273.97   
   Median                      217.50      221.50         242.50   
   Standard deviation          45.029      45.856         45.029   
   Variance                    2027.6      2102.8         2027.6   
   IQR                         56.750      71.250         56.750   
   Range                       180.00      174.00         180.00   
   Minimum                     150.00      146.00         175.00   
   Maximum                     330.00      320.00         355.00   
   Skewness                   0.35438     0.21608        0.35438   
   Std. error skewness        0.51210     0.51210        0.51210   
   Kurtosis                   0.11887    -0.48332        0.11887   
   Std. error kurtosis        0.99238     0.99238        0.99238   
 ───────────────────────────────────────────────────────────────── 
   Note. The CI of the mean assumes sample means follow a
   t-distribution with N - 1 degrees of freedom

Notes: * Standard Error (Std. Error) and Confidence Intervals are discussed in Chapter 10 * Compare results for CHOL1 and CHOL1PLUS25 to see the effect of a constant treatment effect on the mean, variance, and standard deviations (that is, see what happens when you add a constant to a set of scores) * There are a number of other summary statistics, but they are not all reported by all computer programs. For example: * Coefficient of Variation (CV) is CV = SD/M * While there is no standard or expectation, it is relatively common to find that the SD is about one-fifth as large at its M (i.e., the mean 5 times larger than the SD) * The CV * Geometric Mean (the mean we report is called the Arithmetic Mean) * Harmonic Mean

MEASURES OF SKEWNESS AND KURTOSIS

This section considers measures of distribution’s skewness and kurtosis. As with the measures of variability (because of the arithmetic operations involved), the measures of skewness and of kurtosis yield more meaningful values when they are based on interval or ratio scaled data. Along with the measures of central tendency and variability, these measures provide us with a mental picture of a variable’s distribution.

Skewness

The skewness of a distribution of scores is a measure of where on the x-axis the scores bunch together; that is, skewness is a measure of how much a distribution of scores deviates from being symmetrical. In figure 5d, several symmetric distributions are illustrated; and in figure 5e, two skewed distributions are illustrated.

The measure of skewness used here is based on the average of the cubed distances from the mean, that is,

Equation 5-7

\[ \begin{equation} m^3 = \frac {\sum (X-\mu)^3} N \tag{5-7} \end{equation} \] This average is known as the third moment about the mean, and is denoted by m3 (m for moment, 3 for third). When a score is quite far from the mean (that is, far from the bunch of other scores) its contribution to the third moment about the mean is substantial and is reflected in the sign of \(m^3\).

Positively Skewed.

If a score is far above the mean, then \((X – \mu)\) will be positive and when cubed, will make a substantial contribution to m3. The result will be a positive value for the coefficient of skewness. That one value is far above the mean, however, indicates that most of the scores are bunched near the mean, as in Figure 5e(i). Thus, a positively skewed distribution is characterized by a few scores far above the mean with the other scores bunched near the mean.

Negatively Skewed.

When we have a few scores far below the mean, as in figure 5e(ii), large negative values are contributed to \(m^3\) (since \((X – \mu)^3\) is negative), yielding a negative coefficient of skewness. Therefore, a negatively skewed distribution is characterized by a few scores far below the mean with the majority of scores bunched near the mean.

Easy Identification of Positively and Negatively Skewed Distributions.

You can easily identify if a distribution is positively or negatively skewed by remembering that a positively skewed distribution points in the direction of the positive x-axis, and a negatively skewed distribution points in the direction of the negative x-axis.

A Measure of Skewness.

A measure of skewness is found by calculating:

Equation 5-8

\[ \begin{equation} m^2 = \frac {\sum {(X-\mu)^2}} N \tag{5-8} \end{equation} \] and

Equation 5-9

\[ \begin{equation} m^3 = \frac {\sum {(X-\mu)^3}} N \tag{5-9} \end{equation} \] and then finding

Equation 5-10

\[ \begin{equation} b_1 = \frac {m_3} {(m_2 \sqrt{m_2})} \tag{5-10} \end{equation} \] In this equation, \(m_2\) is the second moment about the mean, \(m_3\) is the third moment about the mean, and b1 is called the coefficient of skewness. When the coefficient of skewness is based on a sample of observations, \(\mu\) is replaced by M and N is replaced by n.

Values of \(b_1\) that are at or near zero indicate a symmetrical distribution; negative values indicate a negatively skewed distribution; and positive values indicate a positively skewed distribution. (Note that the second moment about the mean is the population variance.)

Figure 5d(i) Normal distribution and its measures of central tendency

Figure 5d(ii) Uniform (rectangular) distribution and its measures of central tendency

Figure 5d(iii) Symmetric Bimodal distribution and its measures of central tendency

Figure 5e(i) Positively skewed distribution (skewed to the right)

Figure 5e(ii) Negatively skewed distribution (skewed to the left)

Skewness: An Examination

As you may have observed in figure 5c, programs report the coefficient of skewness as part of the descriptive statistics output. Using transformations, however, we can gain further insight into the meaning of the coefficient of skewness while avoiding the tedious computations that are involved.

For this discussion, we will use the new data set found in figure 5f. This data set includes the ratings that two student teachers, X RATING and Y RATING, were given for their presentation of a lesson on division to third graders by 25 other student teachers. The ratings are found in columns 1 and 2. On the rating scale, 20 was the highest possible rating and 0 was the lowest possible rating.

The means for X RATING and Y RATING were obtained and then deviation score results were calculated using a transformation procedure to subtract the mean from the ratings. These results were stored in variables 3 and 4 with the labels “(X – MX)” and “(Y – MY)”. Note that the subscripts for M indicate which variable mean it is (e.g., MX is the mean for X RATING).

A transformation procedure was used to obtain the second and third powers of these differences. The differences raised to powers are labeled as (X – MX)2, (Y – MY)2, (X – MX)3, and (Y – MY)3. Here, for example, (X – MX)2 represents the transformation (X – MX)2, where M is the mean of the X RATING scores.

In figure 5f, the X RATING scores have a positively skewed distribution and the Y RATING scores have a negatively skewed distribution. These distributions are shown in figure 5g, X RATING, and figure 5gi, Y RATING, using histogram output. Note the large positive values the scores 16, 17, 18, and 19 contribute to \((X – M_X)^3\) and the large negative values that the scores 14, 12, and 10 contribute to the \((Y – M_Y)^3\). These large values dictate that the coefficient of skewness will be positive for the X RATING scores and negative for the Y RATING scores.

The values of \(m_2\) and \(m_3\) are then found as the means of columns 5, 6, 7, and 8, using the descriptive statistics output. The means of the columns in figure 5f are displayed in figure 5i. Using the means from figure 5i, we can calculate the coefficients of skewness as follows:

\[ b_1 (X RATING)=\frac{17.3610} {(5.9936 \sqrt{5.9936})}=1.18 \] \[ b_1 (Y RATING)=\frac{18.5614} {(5.4944 \sqrt{5.4944})}=-1.44 \]

In figure 5i, note that as was indicated earlier, the sums of the deviation scores (X-M) and (Y-M) are zero.

Kurtosis

Kurtosis is a measure of the “peakedness” of a set of scores. It is measured using the fourth moment about the mean, that is, \(\sum(X – \mu)^4/N\). The concept of kurtosis is used appropriately only when a distribution has a single mode; that is, it is unimodal. Figure 5j illustrates three unimodal distributions that have different degrees of kurtosis.

The normal distribution, shown in Figure 5j(i), is called mesokurtic and is described as having no kurtosis. The value of the coefficient of kurtosis of the normal distribution is 3; this value is used as a benchmark against which the kurtosis of other distributions is compared. The normal distribution is an important distribution that we will discuss further in chapter 9.

Figure 5f Positively and negatively skewed data and their deviation scores raised to different powers

Notes: * X RATING = Positively skewed student teacher ratings * Y RATING = Negatively skewed student teacher ratings * XDEV, YDEV = Deviation scores * XDEV3, YDEV3 = Deviation scores cubed (i.e., to the third power) * XDEV2, YDEV2 = Deviation scores squared (i.e., to the second power)

A distribution having a coefficient of kurtosis larger than 3, as does the distribution in Figure 5j(ii), is called leptokurtic. A leptokurtic distribution is one that has a high peak. A coefficient of kurtosis that is less than 3 is indicative of a flat distribution, which is called platykurtic; that is, a distribution that has very little peak. A platykurtic distribution is shown in Figure 5j(iii).

It should be noted that most programs now convert the kurtosis statistics so that normal is 0, a negative value represented negative kurtosis (i.e., platykurtic) and a value larger than zero represents positive kurtosis (i.e., leptokurtic). Most function in R have options for how to report these statistics.

The coefficient of kurtosis is found by calculating:

Equation 5-11

\[ \begin{equation} m_2 = \frac {\sum (X-\mu)^2} N \tag{5-11} \end{equation} \] and

Equation 5-12

\[ \begin{equation} m_4 = \frac {\sum (X-\mu)^4} N \tag{5-12} \end{equation} \] and then computing:

Equation 5-13

\[ \begin{equation} b_2 = \frac {m_4} {(m_2)^2} \tag{5-13} \end{equation} \] In this equation, \(b_2\) is called the coefficient of kurtosis. As was indicated earlier, the normal distribution is the benchmark against which the kurtosis of other distributions is measured. In the above formulas, \(b_2\) is 3 for the normal distribution.

Kurtosis: An Examination

As with the coefficient of skewness, statistics programs report the coefficient of kurtosis when requested. Transformations allow us to examine the meaning of kurtosis more closely.

Figure 5k presents sample data from three different distributions. The data in the column labeled LEPTO X is used to illustrate a leptokurtic distribution.

The data in variable 2 represent a sample from a uniform population distribution, or a rectangular distribution. The mean of this uniform distribution is 25, the minimum score is 18, and the maximum score is 32. The output refers to the minimum score as the lower boundary and the maximum score as the upper boundary. These PLATY Y scores were generated to represent a platykurtic distribution. The data in variable 3 represents a sample from a normal population distribution with a mean of 25 and a standard deviation of 5.

A uniform distribution has no mode since each score on the x-axis has the same frequency, see figure 5d(ii), and therefore, it would not be appropriate to find its kurtosis. A small sample from uniform distribution, however, provides a good example of a platykurtic distribution.

The histograms for LEPTO X, PLATY Y, and NORMAL Z are shown in figures 5l, 5m, and 5n, respectively. Remember, this data represents samples from populations so its distribution will usually look only similar to the parent population distribution from which it was sampled. Figure 5o shows the coefficients of kurtosis for this data. The LEPTO X scores have a coefficient of kurtosis of 5.30375, which is larger than 3 and indicates a leptokurtic distribution. The PLATY Y scores have a coefficient of kurtosis of 1.73412, which is less than 3 and indicates a platykurtic distribution. The NORMAL Z scores have a coefficient of kurtosis of 2. 79043, which is approximately equal to 3 the value for a normal distribution.

Experimental Data: Skewness and Kurtosis

Now consider the skewness and kurtosis coefficients for our experimental data. Figure 5c shows that the effect of adding 25 points to the CHOL1 scores does not affect the measures of skewness and kurtosis. That is, there is no difference between the coefficients of skewness (0.32723) and kurtosis (2.80545) for the CHOL1 and the 25+CHOL1 scores.

Figure 5c also shows that there is little difference between the coefficients of skewness and kurtosis for the cholesterol scores in the Yes-Pill group (b1 = 0.32723; b2 = 2.80561) and the cholesterol scores in the No-Pill group (b1 = 0.19953; b2= 2.34362).

The coefficients of skewness are near 0, and the coefficients of kurtosis are near 3. These results indicate that it is reasonable to assume that the cholesterol scores in both the Yes-Pill and No-Pill groups came from a normal distribution, that is, a distribution with no skewness (b1 = 0) and with kurtosis (b2) equal to 3.

These results are similar to what a researcher assumes is true of the scores of a group in an experiment: that the scores have been sampled from a normal distribution; that differences may be found between groups on measures of central tendency; and that differences will not be found between groups on measures of dispersion, skewness, or kurtosis.

Figure 5g Histogram of the Positively skewed student teacher X RATING scores

jmv::descriptives(
    data = data,
    vars = X_RATING,
    hist = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)


 DESCRIPTIVES

Figure 5gi Histogram of the Negatively skewed student teacher Y RATING scores

jmv::descriptives(
    data = data,
    vars = Y_RATING,
    hist = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)


 DESCRIPTIVES

Figure 5h Boxplots of the X RATING and Y RATING scores (with range and shape statistics)

jmv::descriptives(
    data = data,
    vars = vars(X_RATING, Y_RATING),
    box = TRUE,
    mean = FALSE,
    sd = FALSE,
    range = TRUE,
    iqr = TRUE,
    skew = TRUE,
    kurt = TRUE,
    pc = TRUE)


 DESCRIPTIVES

 Descriptives                                    
 ─────────────────────────────────────────────── 
                          X_RATING    Y_RATING   
 ─────────────────────────────────────────────── 
   N                            25          25   
   Missing                       0           0   
   Median                   12.000      18.000   
   IQR                      3.0000      3.0000   
   Range                    10.000      10.000   
   Minimum                  10.000      10.000   
   Maximum                  20.000      20.000   
   Skewness                 1.2601     -1.5349   
   Std. error skewness     0.46368     0.46368   
   Kurtosis                 1.3709      2.5750   
   Std. error kurtosis     0.90172     0.90172   
   25th percentile          11.000      16.000   
   50th percentile          12.000      18.000   
   75th percentile          14.000      19.000   
 ───────────────────────────────────────────────

Figure 5i The second (m2) and third (m3) moments found as the means of X RATING and Y RATING deviation scores raised to the second and third powers, respectively

jmv::descriptives(
    data = data,
    vars = vars(X_RATING, XDEV, XDEV2, XDEV3, Y_RATING, YDEV, YDEV2, YDEV3),
    desc = "rows",
    n = FALSE,
    missing = FALSE,
    variance = TRUE,
    range = TRUE,
    min = FALSE,
    max = FALSE,
    skew = TRUE,
    kurt = TRUE)


 DESCRIPTIVES

 Descriptives                                                                                                             
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
               Mean           Median      SD         Variance     Range      Skewness    SE         Kurtosis    SE        
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   X_RATING        13.0800    12.00000     2.4987       6.2433     10.000      1.2601    0.46368      1.3709    0.90172   
   XDEV        -4.6491e-17    -1.08000     2.4987       6.2433     10.000      1.2601    0.46368      1.3709    0.90172   
   XDEV2            5.9936     1.16640    10.3828     107.8032     47.880      3.1767    0.46368     11.3631    0.90172   
   XDEV3           17.3610    -1.25970    71.5540    5119.9696    360.592      3.9222    0.46368     16.6896    0.90172   
   Y_RATING        17.1600    18.00000     2.3923       5.7233     10.000     -1.5349    0.46368      2.5750    0.90172   
   YDEV        -1.3080e-17     0.84000     2.3923       5.7233     10.000     -1.5349    0.46368      2.5750    0.90172   
   YDEV2            5.4944     1.34560    11.0135     121.2982     51.240      3.5212    0.46368     13.3069    0.90172   
   YDEV3          -18.5614     0.59270    78.3649    6141.0638    389.968     -4.1116    0.46368     17.8170    0.90172   
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Notes: * Compare XDEV3 to the value of \(m_3\) computed above based on Equation 5-7 for the positively skewed X RATING distribution (you can see this value in calculating equation 5-8) * Compare YDEV3 to the value of \(m_3\) computed above based on Equation 5-7 for the negatively skewed Y RATING distribution (you can see this value in calculating equation 5-8) * Compare XDEV2 to the value of \(m_2\) computed above based on Equation 5-6 for the positively skewed X RATING distribution (you can see this value in calculating equation 5-8) * Compare YDEV2 to the value of \(m_2\) computed above based on Equation 5-6 for the negatively skewed Y RATING distribution (you can see this value in calculating equation 5-8)

Figure 5j Unimodal distributions exhibiting various degrees of kurtosis

Figure 5k Data with various degrees of kurtosis

Figure 5l Histogram of leptokurtic, platykurtic, and normally distributed scores

jmv::descriptives(
    formula = X ~ Shape,
    data = data,
    hist = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)


 DESCRIPTIVES

Figure 5n Boxplots of leptokurtic, platykurtic, and normally distributed scores

jmv::descriptives(
    formula = X ~ Shape,
    data = data,
    box = TRUE,
    dot = TRUE,
    boxMean = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)


 DESCRIPTIVES

Figure 5o Coefficients of skewness and kurtosis for data having leptokurtic, platykurtic, and approximately normal distributions

jmv::descriptives(
    data = data,
    vars = vars(LEPTO_X, PLATY_Y, NORMAL_Z),
    n = FALSE,
    missing = FALSE,
    variance = TRUE,
    range = TRUE,
    se = TRUE,
    ci = TRUE,
    iqr = TRUE,
    skew = TRUE,
    kurt = TRUE)


 DESCRIPTIVES

 Descriptives                                                    
 ─────────────────────────────────────────────────────────────── 
                              LEPTO_X    PLATY_Y     NORMAL_Z    
 ─────────────────────────────────────────────────────────────── 
   Mean                        24.833      25.033       24.950   
   Std. error mean             1.2356     0.82278      0.97634   
   95% CI mean lower bound     22.306      23.351       22.953   
   95% CI mean upper bound     27.360      26.716       26.947   
   Median                      25.000      25.000       26.300   
   Standard deviation          6.7675      4.5066       5.3476   
   Variance                    45.799      20.309       28.597   
   IQR                         1.0000      8.0000       7.8000   
   Range                       35.000      14.000       22.200   
   Minimum                     8.0000      18.000       11.800   
   Maximum                     43.000      32.000       34.000   
   Skewness                   0.18801    0.034757     -0.61889   
   Std. error skewness        0.42689     0.42689      0.42689   
   Kurtosis                    2.9697     -1.2741    -0.020135   
   Std. error kurtosis        0.83275     0.83275      0.83275   
 ─────────────────────────────────────────────────────────────── 
   Note. The CI of the mean assumes sample means follow a
   t-distribution with N - 1 degrees of freedom

Notes: * The skewness for each variable is essentially equal to zero because these samples were drawn from symmetric population distributions * Most programs transform the kurtosis statistic such that Leptokurtic data have kurtosis statistics above zero (see LEPTO_X with Kurtosis = 2.970 which is > 0) * Most programs transform the kurtosis statistic such that Platykurtic data have kurtosis statistics below zero (see PLATY_Y with Kurtosis = -1.274 which is < 0) * Most programs transform the kurtosis statistic such that Mesokurtic data have kurtosis statistics relatively near zero (see NORMAL_Z with Kurtosis = -0.020 which is approximately 0 (don’t be fooled by the fact that it’s negative… it’s very close to zero)

Figure 5p Several plots for a large sample of scores sampled from a normal distribution


 DESCRIPTIVES

 Descriptives                                                                                            
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
        N       Median    IQR       Skewness     SE          Kurtosis    SE         W          p         
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
   x    1000    75.287    13.807    -0.077413    0.077344    0.024702    0.15453    0.99825    0.40537   
 ───────────────────────────────────────────────────────────────────────────────────────────────────────

Figure 5q Several plots for a large sample of scores sampled from a uniform platykurtic distribution


 DESCRIPTIVES

 Descriptives                                                                                            
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
        N       Median    IQR       Skewness    SE          Kurtosis    SE         W          p          
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
   x    1000    74.929    17.153    0.048441    0.077344     -1.1617    0.15453    0.95694    < .00001   
 ───────────────────────────────────────────────────────────────────────────────────────────────────────

Figure 5r Several plots for a large sample of scores sampled from a positively skewed distribution


 DESCRIPTIVES

 Descriptives                                                                                            
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
        N       Median    IQR       Skewness    SE          Kurtosis    SE         W          p          
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
   x    1000    73.915    14.320     0.59242    0.077344    -0.12905    0.15453    0.96652    < .00001   
 ───────────────────────────────────────────────────────────────────────────────────────────────────────

Figure 5s Several plots for a large sample of scores sampled from a negatively skewed distribution


 DESCRIPTIVES

 Descriptives                                                                                            
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
        N       Median    IQR       Skewness    SE          Kurtosis    SE         W          p          
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
   x    1000    76.085    14.320    -0.59242    0.077344    -0.12905    0.15453    0.96652    < .00001   
 ───────────────────────────────────────────────────────────────────────────────────────────────────────

Figure 5t Several plots for a large sample of scores sampled from a leptokurtic distribution


 DESCRIPTIVES

 Descriptives                                                                                             
 ──────────────────────────────────────────────────────────────────────────────────────────────────────── 
        N       Median    IQR       Skewness     SE          Kurtosis    SE         W          p          
 ──────────────────────────────────────────────────────────────────────────────────────────────────────── 
   x    1000    75.472    10.661    -0.075346    0.077344      3.2206    0.15453    0.96382    < .00001   
 ────────────────────────────────────────────────────────────────────────────────────────────────────────

Figure 5u Several plots for a large sample of scores sampled from a platykurtic distribution


 DESCRIPTIVES

 Descriptives                                                                                             
 ──────────────────────────────────────────────────────────────────────────────────────────────────────── 
        N       Median    IQR       Skewness     SE          Kurtosis    SE         W          p          
 ──────────────────────────────────────────────────────────────────────────────────────────────────────── 
   x    1000    75.466    15.432    -0.068537    0.077344    -0.85946    0.15453    0.98224    < .00001   
 ────────────────────────────────────────────────────────────────────────────────────────────────────────

Figure 5v Several plots for a large sample of scores sampled from a bimodal distribution


 DESCRIPTIVES

 Descriptives                                                                                            
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
        N       Median    IQR       Skewness    SE          Kurtosis    SE         W          p          
 ─────────────────────────────────────────────────────────────────────────────────────────────────────── 
   x    1000    74.422    23.351    0.022198    0.077344    -0.90470    0.15453    0.98000    < .00001   
 ───────────────────────────────────────────────────────────────────────────────────────────────────────

SUMMARY

This chapter explained how to calculate the measures of dispersion, skewness, and kurtosis by hand. These measures are generally used with interval or ratio scaled data. The addition of a constant (that is, a treatment effect) to the scores on a variable does not change the measures of the variable’s dispersion (except the coefficient of variation), skewness, or kurtosis.

When considering the measures of skewness, the coefficient of skewness, b1, is negative for negatively skewed distributions, equal to zero for symmetric distributions, and positive for positively skewed distributions. When considering the measures of kurtosis, the coefficient of kurtosis, b2, is equal to 3 for normal distributions, less than 3 for flat distributions (platykurtic), and greater than 3 for peaked distributions (leptokurtic). Recall that most statistics programs convert the kurtosis values such that normal is 0, platykurtic is negative, and leptokurtic is positive.

Chapter 5 Appendix A Study Guide for Descriptive Statistics

See Chapter 4 Appendix A

SECTION 13: Making Sense of Descriptive Results

Using ALL output created in previous sections, respond to the following items

For SCALE variable Y report and interpret your descriptive results (from the analyses above) using APA Publication Style 6th edition in a way that describes your data well enough that someone can “visualize” what your data distribution looks like without a graph (i.e., based on your words). Include at least two paragraphs of text as well as “really useful” tables and graphs. Use demographic breakdowns of W for some results about Y. Feel free to run additional analyses on the variable in order to get additional information if you think it is necessary (if you do, include those analyses here). Include a discussion of whether you see any univariate outliers in the variable.
For the pair of SCALE variables Y and X report and interpret your bivariate descriptive and correlational results (from the analyses above) using APA Publication Style 6th edition in a way that describes relationship well enough that someone can “visualize” what your bivariate data distribution and what the relationship between the two variables looks like without a graph (i.e., based on your words). Include at least two paragraphs of text as well as “really useful” tables and graphs. Feel free to run additional analyses in order to get additional information if you think it is necessary (if you do, include those analyses here). Include a discussion of whether you see any bivariate outliers in the variables.

Explain the Research Design and Statistical Analysis terms below BRIEFLY but SUFFICIENTLY and IN YOUR OWN WORDS (don’t just give another name for them). Some may require finding additional readings. If you use resources, paraphrase in your own words AND provide a citation of the resource you used (including page numbers).

EXTERNAL validity
INTERNAL validity
CONSTRUCT validity
Nominal, ordinal, interval, and ratio data (and provide your own example of each)
Random sampling
Random assignment
What the four parameters of a population distribution describe