In chapter 4, we discussed measures of central tendency (the mean, mode, and median) and considered how they may reflect a treatment effect (which in the Pill Consumption experiment was the effect of taking herbal supplement pills).
In this chapter, we will examine measures of variability which indicate the amount of spread in a data set. We will learn how to calculate four measures of variability: the range, interquartile range, variance, and standard deviation. Then we will consider the effect of treatments on these measures. We will also examine different data distributions, as defined by the measures of skewness and kurtosis.
Your R objective in this chapter is to calculate (a) the range, (b) interquartile range, (c) variance, and (d) standard deviation.
In this discussion, we will use the Pill Consumption data from chapter 4, with one change. We have created a new variable by adding 25 points to each cholesterol score in the Yes-Pill group to simulate what would happen if the effect of taking the herbal supplement pill caused each woman’s score to increase by 25 points. In our discussions of this data, we will examine the effect this addition has on the measures of central tendency and the measures of variability.
Figure 5a shows the herbal supplement pill data set with the column labeled 25+CHOL 1. As you remember from chapter 4, the labels used to identify the data in Figure 5a are: ID, identification number; HT, height; and CHOL, cholesterol level. The number in each label identifies each group (1 = Yes-Pill, 2 = No-Pill). Thus ID 1 indicates the identification numbers of the subjects in group 1.
In this section, we will consider the measures of variability, or measures of dispersion. We will define each measure, illustrate how to calculate it by hand, and then consider how each measure reflects the dispersion in our experimental data. Because of the arithmetic operations involved, all of the measures of variability that we will discuss are most meaningful when based on data that is measured on interval or a ratio scale. The most frequently used measures of variability are the: (a) Range, (b) Interquartile range, (c) Variance, and (d) Standard deviation.
The simplest measure of how spread out the scores are is the range. The range is found by subtracting the smallest score from the largest score. The range is not frequently used as a basis for making decisions about score variability, however, because the range is unstable from one sample to the next. That is, the sample range cannot be counted on to give a good estimate of the population range. For example, in constructing a histogram for column 3, CHOL 1, in Chapter 4, we found the range to be 180; but in a second sample (not shown) from the same population, the smallest score was 50 and the largest score was 300, yielding a range of 250. We will consider the instability of the sample range in more detail in chapter 10.
Another measure of range that is sometimes used is the inclusive range, which is the difference between the highest score and the lowest score plus one. For example, if you read pages 6-11 of a book and then report the exclusive range of the pages read (“scores”) it is 5 (that is, 11 – 6 = 5). However, the inclusive range would be 6 (that is, 11 – 6 + 1 = 6) which perhaps more accurately represents the range of the scores in the data (i.e., pages you read).
To overcome the lack of stability of the range, statisticians use a measure that does not depend on the highest and lowest scores of a sample. This measure, known as the interquartile range, is based on the difference between the third quartile and the first quartile. The first quartile is the same as the 25th percentile and is usually denoted by Q1 The second quartile, the median, is the 50th percentile and is denoted by Q2. The third quartile is the 75th percentile and is denoted by Q3. The interquartile range is usually denoted by the letter Q or the abbreviation IQR, such that:
\[ \begin{equation} IQR = Q3 – Q1 = 75th \text% - 25th \text% \tag{5-1} \end{equation} \] The first quartile (Q1) is the median of the scores from the lowest score to the first score below the median of all of the scores. The third quartile (Q3) is the median of the scores from the first score above the median of all of the scores to the largest score. For example, if we rank order the CHOL scores for the Yes-Pill data, we have:
\[ \begin{align} & 150 \\ & 158 \\ & 179 \\ & 195 \\ & 198 \\ & \text {<--- [(198+198)/2] = 198.0 = first quartile = Q1 = 25th percentile} \\ & 198 \\ & 210 \\ & 210 \\ & 210 \\ & 215 \\ & \text {<--- [(215+220)/2] = 217.5 = Median = second quartile = Q2 = 50th percentile} \\ & 220 \\ & 243 \\ & 247 \\ & 250 \\ & 253 \\ & \text {<--- [(253+260)/2] = 256.5 = third quartile = Q3 = 75th percentile} \\ & 260 \\ & 263 \\ & 272 \\ & 297 \\ & 330 \end{align} \]
There are 10 scores from the lowest score, 150, up to 215, the score below the median. Therefore, the first quartile, (Q ), which is the median of these 10 scores, is 198. similarly, the third quartile (Q3) is 256.5, which is the median of the 10 scores from 220, the score just above the median, to the largest score, 330. The interquartile range (Q) is then found as:
\[ IQR = Q3 – Q1 = 256.5 – 198 = 58.5 \] Thus, there is a spread of 58.5 points within the “heart” of this distribution of scores. Because the interquartile range is based on the difference between quartiles instead of end scores, it is relatively stable from sample to sample. It is also not strongly influenced by extreme points. For these reasons, the interquartile range should be routinely used with data that is skewed (that is, non-symmetric).
Equation (4.1) will not generally yield the 25th and 75th percentile ranks for the scores at Q1 and Q3. For example, using equation (4.1) the percentile rank for Q3 = 256.5 is found to be 77.5%. This is because equation (4.1) assumes you are dealing with a continuous distribution; and in this example, we have a discrete set of scores. In practice, this conflict is only a minor nuisance.
Variance is a commonly used measure of the spread of scores.
The variance of a sample measures the spread of the scores by: (a) taking the distance that each score lies from its mean (that is, a deviation score) and squaring it, (b) summing all of these squares, and (c) finding the average (almost) of the result. “Almost” is included because the sum of the squared deviation scores is divided by (n-1) instead of by n (where n is the number of scores). The variance of a sample is calculated using (n-1) instead of n because the resultant value yields a “better” estimate of the population variance than does the value found using n. (What is meant by “better” is discussed in chapter 10.) The formula for the sample variance is:
\[ \begin{equation} s_X^2= \frac {\sum (X-M_X)^2} {(n-1)} \tag{5-2} \end{equation} \]
For example, the variance of the scores 1, 2, 3, 4, and 5 is found using the following sums:
\[ \begin{align*} & \text{Scores} && \text {Deviation } (X – M_X) &&& (X – M_X)^2 \\ & 1 && 1 – 3 = -2 &&& 4 \\ & 2 && 2 – 3 = -1 &&& 1 \\ & 3 && 3 – 3 = 0 &&& 0 \\ & 4 && 4 – 3 = 1 &&& 1 \\ & 5 && 5 – 3 = 2 &&& 4 \\ & 15 = \sum X && \sum (X – M_X) = 0 &&& \sum (X – M_X)^2 = 10 \\ & M_X = 15/5 = 3 \\ & s_X^2 = 10/(5 – 1) = 2.5 \end{align*} \] The variance of a sample, like the mean, should not be routinely used as a descriptive statistic because it is strongly affected by extreme scores. For example, consider what happens to the variance of the preceding scores when we include an additional score of 105:
\[ \begin{align*} & \text{Scores} && \text {Deviation } (X – M_X) &&& (X – M_X)^2 \\ & 1 && 1 – 20 = -19 &&& 361 \\ & 2 && 2 – 20 = -18 &&& 324 \\ & 3 && 3 – 20 = -17 &&& 289 \\ & 4 && 4 – 20 = -16 &&& 256 \\ & 5 && 5 – 20 = -15 &&& 225 \\ & 105 && 105 – 20 = 85 &&& 7225 \\ & 120 = \sum X && \sum (X – M_X) = 0 &&& \sum (X – M_X)^2 = 8680 \\ & M_X = 120/6 = 20 \\ & s_X^2 = 8680/(6 – 1) = 1736 \end{align*} \] As with the skewed distributions discussed in chapter 4, the mean was pulled in the direction of the extreme score. Also, the variance has “exploded” when compared to 2.5, since it is strongly affected by the squared mean deviation of the extreme score. For this reason, the variance is routinely used only with data that is symmetric (that is, not skewed).
The population variance is found by: (a) squaring the distance that each score falls from the mean, (b) summing these squares, and (c) finding the average of the result. The formula for the population variance is:
\[ \begin{equation} \sigma_X^2 = \frac {\sum {(X-\mu)^2}} {N} \tag{5-3} \end{equation} \]
The square of the small Greek letter σ (i.e., σ2) is the symbol used to denote the population variance. Note that σ is a small sigma and that a capital sigma, Σ, represents the summation operation. Note also, that equation 5-3 contains N in the denominator. If we consider the scores 1, 2, 3, 4, and 5 to represent a population of scores, then \(\mu = 3\), and \(N = 5\). We can find the population variance using the sum of squares from above to be:
\[ \sigma^2 = \frac {10} 5 = 2 \]
Another measure of dispersion commonly used by researchers is the standard deviation (SD). The standard deviation is the square root of the variance. The variance can be thought of as the average squared distance of each score from its mean, and the standard deviation can be thought of as a measure of the average distance from the mean.
The average distance from the mean is actually always zero because the sum of the deviation scores always equals zero, that is, (X-M) = 0. (This fact will be illustrated when we discuss skewness.) When we find the variance, however, we square (X – M) and remove the influence of the sign of the difference.
There is a little-used statistic often called Mean Absolute Deviation (sometimes called Mean Deviation) that is calculated as the average distance from the mean, where distance is always positive regardless of whether scores are above or below the mean. The Mean Absolute Deviation (MAD) is calculated as:
\[ \begin{equation} MAD = \frac {\sum {|X-M|}} n \tag{5-4} \end{equation} \] where |X – M| is the “absolute value” of the subtraction (i.e., the sign is removed). The MAD is not used frequently because the mathematics for absolute values is more limited than the mathematics for squared values. Therefore, variance and its square root, standard deviation, is used most frequently.
Mathematicians prefer to use just one letter as variables in formulas for a variety of reasons. Because of this, the most common abbreviation for the sample standard deviation is s and not SD. However, APA prefers research reports to use SD as the abbreviation for standard deviation. The formula for the sample standard deviation is:
\[ \begin{equation} s=\sqrt{ \frac {\sum(X-M)^2} {n-1}} = \sqrt{s^2} \tag{5-5} \end{equation} \] For the set of scores 1, 2, 3, 4, and 5, the sample standard deviation can be found by taking the square root of the sample variance as:
\[ s=\sqrt{ \frac {10} 4} = \sqrt{2.5} = 1.58 \] For a dataset with an extreme score 1, 2, 3, 4, 5, and 105, the sample standard deviation is found as:
\[ s=\sqrt{ \frac {8680} 5} = \sqrt{1736} = 41.67 \] Since the standard deviation is directly related to the variance, it too should be used routinely only with symmetric data. But truthfully, it is used frequently regardless of the shape of the data.
The small Greek letter σ is the symbol used to denote the population standard deviation. Note that the formula below erroneously contains s instead of σ. The formula for the population standard deviation is:
\[ \begin{equation} \sigma=\sqrt{ \frac {\sum(X-M)^2} {N}} = \sqrt{\sigma^2} \tag{5-6} \end{equation} \] If the scores 1, 2, 3, 4, and 5 represent a population, the population standard deviation is found as:
\[ \sigma=\sqrt{ \frac {10} 5} = \sqrt{2} = 1.41 \]
jmv::descriptives(
data = data,
vars = vars(CHOL1, CHOL2, CHOL1plus25),
variance = TRUE,
iqr=TRUE,
range = TRUE,
pc = TRUE)
DESCRIPTIVES
Descriptives
─────────────────────────────────────────────────────────
CHOL1 CHOL2 CHOL1plus25
─────────────────────────────────────────────────────────
N 20 20 20
Missing 0 0 0
Mean 227.90 219.40 252.90
Median 217.50 221.50 242.50
Standard deviation 45.029 45.856 45.029
Variance 2027.6 2102.8 2027.6
IQR 56.750 71.250 56.750
Range 180.00 174.00 180.00
Minimum 150.00 146.00 175.00
Maximum 330.00 320.00 355.00
25th percentile 198.00 183.75 223.00
50th percentile 217.50 221.50 242.50
75th percentile 254.75 255.00 279.75
─────────────────────────────────────────────────────────
Notes: * CHOL1 = the Cholesterol scores for the Yes-Pill treatment group * CHOL2 = the Cholesterol scores for the No-Pill treatment group * CHOL1plus25 = the Cholesterol scores for the Yes-Pill treatment group plus 25 (CHOL1+25) * Percentile 25 = the 1st Quartile (Q1) * Percentile 50 = the 2nd Quartile (Q2) = Median * Percentile 75 = the 3rd Quartile (Q3) * Interquartile Range = IQR = Q3 – Q1 * IQR for CHOL1 = 254.75 – 198.00 = 56.75 * IQR for CHOL2 = 255.00 – 183.75 = 71.25 * IQR for CHOL1plus25 = 279.75 – 223.00 = 56.75 * Compare the results for Range and IQR for the CHOL1 and CHOL1plus25 to see the impact of adding a constant to all scores
Looking at Figure 5c, you see that the variance for both CHOL1 and CHOL1plus25 is 2057.57, and the standard deviation is 45.0285.
The variance for column 6 (CHOL2) is 2102.78, and the standard deviation is 45.8561. (We will discuss much of the rest of this output as we continue through this chapter.)
The output in Figures 5b and 5c is instructive with respect to the measures of variability in an experiment. One result that we find is that the measures of variability did not change when we added 25 points to the cholesterol scores in the Yes-Pill group. That is, no differences were found between the measures of variability on CHOL1 and 25+CHOL1. A second result is that, very small differences were found between the measures of variability for the Yes-Pill group and those found for the No- Pill group.
The first result is instructive because it represents a situation in an ideal experiment. In an ideal experiment, the only difference between the scores from the two groups involved is the treatment effect, represented by the constant 25 points. In this ideal experiment, the scores in the no-treatment group (usually referred to as the control group) are represented by the CHOL1 scores, and the scores in the treatment group (usually referred to as the experimental group) are represented by the 25+CHOL1 scores.
Based on the results shown in figures 5b and 5c, the two groups differed on the measures of central tendency but not on the measures of variability. In an experiment, the only reason we expect to find differences between the measures of central tendency of the groups is because of the treatment effect, and the reason we do not expect to find differences in variability is because we have randomly placed the units into the treatments.
The second result is instructive because it represents actual results from data on units (subjects) who were randomly sampled and randomly placed into treatment groups. This data, therefore, indicates how much of a difference you can expect between the measures of variability because of chance.
It also indicates how much of a chance difference you can expect between the measures of central tendency, because the original data (the CHOL1 and CHOL2 scores) was chosen so that there was-no treatment effect. Therefore, the differences between the measures of central tendency on these variables reflect only chance differences.
If you add a constant (such as 25) to each of the scores in a data set, the resultant measures of central tendency will each be changed by the magnitude of the constant (remember the constant can be a negative number). We saw this result when the constant 25 was added to each cholesterol score in the Yes-Pill group. In figure 5b, the median increased by 25 points from 217.5 to 242.5; and in figure 5c, the mean increased by 25 points from 227.9 to 252.9. Also, since 210 is the mode (see chapter 4), the most frequently occurring score would change by 25 points, for a mode of 235.
In experiments, researchers use the mean as the measure of central tendency and the variance and standard deviation as the measures of variability. These measures are strongly affected by extreme scores. Given a random sample of units and random assignment of these units to treatments, researchers do not expect to find many extreme scores (also ref erred to as wild points or outliers) in their treatments. This is one reason that the mean, variance, and standard deviation can be used in experiments. In chapter 10 we will consider other reasons.
In figure 5c, you can see another measure of variation known as the coefficient of variation. The coefficient of variation is found as the ratio of the standard deviation to the mean. This coefficient is useful as a measure of variability in a series of experiments where the standard deviation and the mean tend to change together. The coefficient of variation is used in an experiment as a benchmark. Comparing data to this benchmark can indicate if something unusual has happened in a treatment.
For example, Snedecor and Cochran (1967, p. 63) report that in com variety trials, “although mean yield and standard deviation vary with location and season, yet the coefficient of variation is often between 5% and 15%.” A coefficient of variation outside of known limits might cause a data analyst to look for an outlier or for an unexpected treatment effect.
The coefficient of variation is a measure of variation that could vary across groups in an experiment (because it is a function of the mean), but it is a measure that is expected to remain within a range of values that are known from past research.
jmv::descriptives(
data = data,
vars = vars(CHOL1, CHOL2, CHOL1plus25),
variance = TRUE,
range = TRUE,
se = TRUE,
ci = TRUE,
iqr = TRUE,
skew = TRUE,
kurt = TRUE)
DESCRIPTIVES
Descriptives
─────────────────────────────────────────────────────────────────
CHOL1 CHOL2 CHOL1plus25
─────────────────────────────────────────────────────────────────
N 20 20 20
Missing 0 0 0
Mean 227.90 219.40 252.90
Std. error mean 10.069 10.254 10.069
95% CI mean lower bound 206.83 197.94 231.83
95% CI mean upper bound 248.97 240.86 273.97
Median 217.50 221.50 242.50
Standard deviation 45.029 45.856 45.029
Variance 2027.6 2102.8 2027.6
IQR 56.750 71.250 56.750
Range 180.00 174.00 180.00
Minimum 150.00 146.00 175.00
Maximum 330.00 320.00 355.00
Skewness 0.35438 0.21608 0.35438
Std. error skewness 0.51210 0.51210 0.51210
Kurtosis 0.11887 -0.48332 0.11887
Std. error kurtosis 0.99238 0.99238 0.99238
─────────────────────────────────────────────────────────────────
Note. The CI of the mean assumes sample means follow a
t-distribution with N - 1 degrees of freedom
Notes: * Standard Error (Std. Error) and Confidence Intervals are discussed in Chapter 10 * Compare results for CHOL1 and CHOL1PLUS25 to see the effect of a constant treatment effect on the mean, variance, and standard deviations (that is, see what happens when you add a constant to a set of scores) * There are a number of other summary statistics, but they are not all reported by all computer programs. For example: * Coefficient of Variation (CV) is CV = SD/M * While there is no standard or expectation, it is relatively common to find that the SD is about one-fifth as large at its M (i.e., the mean 5 times larger than the SD) * The CV * Geometric Mean (the mean we report is called the Arithmetic Mean) * Harmonic Mean
This section considers measures of distribution’s skewness and kurtosis. As with the measures of variability (because of the arithmetic operations involved), the measures of skewness and of kurtosis yield more meaningful values when they are based on interval or ratio scaled data. Along with the measures of central tendency and variability, these measures provide us with a mental picture of a variable’s distribution.
The skewness of a distribution of scores is a measure of where on the x-axis the scores bunch together; that is, skewness is a measure of how much a distribution of scores deviates from being symmetrical. In figure 5d, several symmetric distributions are illustrated; and in figure 5e, two skewed distributions are illustrated.
The measure of skewness used here is based on the average of the cubed distances from the mean, that is,
\[ \begin{equation} m^3 = \frac {\sum (X-\mu)^3} N \tag{5-7} \end{equation} \] This average is known as the third moment about the mean, and is denoted by m3 (m for moment, 3 for third). When a score is quite far from the mean (that is, far from the bunch of other scores) its contribution to the third moment about the mean is substantial and is reflected in the sign of \(m^3\).
If a score is far above the mean, then \((X – \mu)\) will be positive and when cubed, will make a substantial contribution to m3. The result will be a positive value for the coefficient of skewness. That one value is far above the mean, however, indicates that most of the scores are bunched near the mean, as in Figure 5e(i). Thus, a positively skewed distribution is characterized by a few scores far above the mean with the other scores bunched near the mean.
When we have a few scores far below the mean, as in figure 5e(ii), large negative values are contributed to \(m^3\) (since \((X – \mu)^3\) is negative), yielding a negative coefficient of skewness. Therefore, a negatively skewed distribution is characterized by a few scores far below the mean with the majority of scores bunched near the mean.
You can easily identify if a distribution is positively or negatively skewed by remembering that a positively skewed distribution points in the direction of the positive x-axis, and a negatively skewed distribution points in the direction of the negative x-axis.
A measure of skewness is found by calculating:
\[ \begin{equation} m^2 = \frac {\sum {(X-\mu)^2}} N \tag{5-8} \end{equation} \] and
\[ \begin{equation} m^3 = \frac {\sum {(X-\mu)^3}} N \tag{5-9} \end{equation} \] and then finding
\[ \begin{equation} b_1 = \frac {m_3} {(m_2 \sqrt{m_2})} \tag{5-10} \end{equation} \] In this equation, \(m_2\) is the second moment about the mean, \(m_3\) is the third moment about the mean, and b1 is called the coefficient of skewness. When the coefficient of skewness is based on a sample of observations, \(\mu\) is replaced by M and N is replaced by n.
Values of \(b_1\) that are at or near zero indicate a symmetrical distribution; negative values indicate a negatively skewed distribution; and positive values indicate a positively skewed distribution. (Note that the second moment about the mean is the population variance.)
As you may have observed in figure 5c, programs report the coefficient of skewness as part of the descriptive statistics output. Using transformations, however, we can gain further insight into the meaning of the coefficient of skewness while avoiding the tedious computations that are involved.
For this discussion, we will use the new data set found in figure 5f. This data set includes the ratings that two student teachers, X RATING and Y RATING, were given for their presentation of a lesson on division to third graders by 25 other student teachers. The ratings are found in columns 1 and 2. On the rating scale, 20 was the highest possible rating and 0 was the lowest possible rating.
The means for X RATING and Y RATING were obtained and then deviation score results were calculated using a transformation procedure to subtract the mean from the ratings. These results were stored in variables 3 and 4 with the labels “(X – MX)” and “(Y – MY)”. Note that the subscripts for M indicate which variable mean it is (e.g., MX is the mean for X RATING).
A transformation procedure was used to obtain the second and third powers of these differences. The differences raised to powers are labeled as (X – MX)2, (Y – MY)2, (X – MX)3, and (Y – MY)3. Here, for example, (X – MX)2 represents the transformation (X – MX)2, where M is the mean of the X RATING scores.
In figure 5f, the X RATING scores have a positively skewed distribution and the Y RATING scores have a negatively skewed distribution. These distributions are shown in figure 5g, X RATING, and figure 5gi, Y RATING, using histogram output. Note the large positive values the scores 16, 17, 18, and 19 contribute to \((X – M_X)^3\) and the large negative values that the scores 14, 12, and 10 contribute to the \((Y – M_Y)^3\). These large values dictate that the coefficient of skewness will be positive for the X RATING scores and negative for the Y RATING scores.
The values of \(m_2\) and \(m_3\) are then found as the means of columns 5, 6, 7, and 8, using the descriptive statistics output. The means of the columns in figure 5f are displayed in figure 5i. Using the means from figure 5i, we can calculate the coefficients of skewness as follows:
\[ b_1 (X RATING)=\frac{17.3610} {(5.9936 \sqrt{5.9936})}=1.18 \] \[ b_1 (Y RATING)=\frac{18.5614} {(5.4944 \sqrt{5.4944})}=-1.44 \]
In figure 5i, note that as was indicated earlier, the sums of the deviation scores (X-M) and (Y-M) are zero.
Kurtosis is a measure of the “peakedness” of a set of scores. It is measured using the fourth moment about the mean, that is, \(\sum(X – \mu)^4/N\). The concept of kurtosis is used appropriately only when a distribution has a single mode; that is, it is unimodal. Figure 5j illustrates three unimodal distributions that have different degrees of kurtosis.
The normal distribution, shown in Figure 5j(i), is called mesokurtic and is described as having no kurtosis. The value of the coefficient of kurtosis of the normal distribution is 3; this value is used as a benchmark against which the kurtosis of other distributions is compared. The normal distribution is an important distribution that we will discuss further in chapter 9.
Notes: * X RATING = Positively skewed student teacher ratings * Y RATING = Negatively skewed student teacher ratings * XDEV, YDEV = Deviation scores * XDEV3, YDEV3 = Deviation scores cubed (i.e., to the third power) * XDEV2, YDEV2 = Deviation scores squared (i.e., to the second power)
A distribution having a coefficient of kurtosis larger than 3, as does the distribution in Figure 5j(ii), is called leptokurtic. A leptokurtic distribution is one that has a high peak. A coefficient of kurtosis that is less than 3 is indicative of a flat distribution, which is called platykurtic; that is, a distribution that has very little peak. A platykurtic distribution is shown in Figure 5j(iii).
It should be noted that most programs now convert the kurtosis statistics so that normal is 0, a negative value represented negative kurtosis (i.e., platykurtic) and a value larger than zero represents positive kurtosis (i.e., leptokurtic). Most function in R have options for how to report these statistics.
The coefficient of kurtosis is found by calculating:
\[ \begin{equation} m_2 = \frac {\sum (X-\mu)^2} N \tag{5-11} \end{equation} \] and
\[ \begin{equation} m_4 = \frac {\sum (X-\mu)^4} N \tag{5-12} \end{equation} \] and then computing:
\[ \begin{equation} b_2 = \frac {m_4} {(m_2)^2} \tag{5-13} \end{equation} \] In this equation, \(b_2\) is called the coefficient of kurtosis. As was indicated earlier, the normal distribution is the benchmark against which the kurtosis of other distributions is measured. In the above formulas, \(b_2\) is 3 for the normal distribution.
As with the coefficient of skewness, statistics programs report the coefficient of kurtosis when requested. Transformations allow us to examine the meaning of kurtosis more closely.
Figure 5k presents sample data from three different distributions. The data in the column labeled LEPTO X is used to illustrate a leptokurtic distribution.
The data in variable 2 represent a sample from a uniform population distribution, or a rectangular distribution. The mean of this uniform distribution is 25, the minimum score is 18, and the maximum score is 32. The output refers to the minimum score as the lower boundary and the maximum score as the upper boundary. These PLATY Y scores were generated to represent a platykurtic distribution. The data in variable 3 represents a sample from a normal population distribution with a mean of 25 and a standard deviation of 5.
A uniform distribution has no mode since each score on the x-axis has the same frequency, see figure 5d(ii), and therefore, it would not be appropriate to find its kurtosis. A small sample from uniform distribution, however, provides a good example of a platykurtic distribution.
The histograms for LEPTO X, PLATY Y, and NORMAL Z are shown in figures 5l, 5m, and 5n, respectively. Remember, this data represents samples from populations so its distribution will usually look only similar to the parent population distribution from which it was sampled. Figure 5o shows the coefficients of kurtosis for this data. The LEPTO X scores have a coefficient of kurtosis of 5.30375, which is larger than 3 and indicates a leptokurtic distribution. The PLATY Y scores have a coefficient of kurtosis of 1.73412, which is less than 3 and indicates a platykurtic distribution. The NORMAL Z scores have a coefficient of kurtosis of 2. 79043, which is approximately equal to 3 the value for a normal distribution.
Now consider the skewness and kurtosis coefficients for our experimental data. Figure 5c shows that the effect of adding 25 points to the CHOL1 scores does not affect the measures of skewness and kurtosis. That is, there is no difference between the coefficients of skewness (0.32723) and kurtosis (2.80545) for the CHOL1 and the 25+CHOL1 scores.
Figure 5c also shows that there is little difference between the coefficients of skewness and kurtosis for the cholesterol scores in the Yes-Pill group (b1 = 0.32723; b2 = 2.80561) and the cholesterol scores in the No-Pill group (b1 = 0.19953; b2= 2.34362).
The coefficients of skewness are near 0, and the coefficients of kurtosis are near 3. These results indicate that it is reasonable to assume that the cholesterol scores in both the Yes-Pill and No-Pill groups came from a normal distribution, that is, a distribution with no skewness (b1 = 0) and with kurtosis (b2) equal to 3.
These results are similar to what a researcher assumes is true of the scores of a group in an experiment: that the scores have been sampled from a normal distribution; that differences may be found between groups on measures of central tendency; and that differences will not be found between groups on measures of dispersion, skewness, or kurtosis.
jmv::descriptives(
data = data,
vars = X_RATING,
hist = TRUE,
n = FALSE,
missing = FALSE,
mean = FALSE,
median = FALSE,
sd = FALSE,
min = FALSE,
max = FALSE)
DESCRIPTIVES
jmv::descriptives(
data = data,
vars = Y_RATING,
hist = TRUE,
n = FALSE,
missing = FALSE,
mean = FALSE,
median = FALSE,
sd = FALSE,
min = FALSE,
max = FALSE)
DESCRIPTIVES
jmv::descriptives(
data = data,
vars = vars(X_RATING, Y_RATING),
box = TRUE,
mean = FALSE,
sd = FALSE,
range = TRUE,
iqr = TRUE,
skew = TRUE,
kurt = TRUE,
pc = TRUE)
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────
X_RATING Y_RATING
───────────────────────────────────────────────
N 25 25
Missing 0 0
Median 12.000 18.000
IQR 3.0000 3.0000
Range 10.000 10.000
Minimum 10.000 10.000
Maximum 20.000 20.000
Skewness 1.2601 -1.5349
Std. error skewness 0.46368 0.46368
Kurtosis 1.3709 2.5750
Std. error kurtosis 0.90172 0.90172
25th percentile 11.000 16.000
50th percentile 12.000 18.000
75th percentile 14.000 19.000
───────────────────────────────────────────────
jmv::descriptives(
data = data,
vars = vars(X_RATING, XDEV, XDEV2, XDEV3, Y_RATING, YDEV, YDEV2, YDEV3),
desc = "rows",
n = FALSE,
missing = FALSE,
variance = TRUE,
range = TRUE,
min = FALSE,
max = FALSE,
skew = TRUE,
kurt = TRUE)
DESCRIPTIVES
Descriptives
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Mean Median SD Variance Range Skewness SE Kurtosis SE
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
X_RATING 13.0800 12.00000 2.4987 6.2433 10.000 1.2601 0.46368 1.3709 0.90172
XDEV -4.6491e-17 -1.08000 2.4987 6.2433 10.000 1.2601 0.46368 1.3709 0.90172
XDEV2 5.9936 1.16640 10.3828 107.8032 47.880 3.1767 0.46368 11.3631 0.90172
XDEV3 17.3610 -1.25970 71.5540 5119.9696 360.592 3.9222 0.46368 16.6896 0.90172
Y_RATING 17.1600 18.00000 2.3923 5.7233 10.000 -1.5349 0.46368 2.5750 0.90172
YDEV -1.3080e-17 0.84000 2.3923 5.7233 10.000 -1.5349 0.46368 2.5750 0.90172
YDEV2 5.4944 1.34560 11.0135 121.2982 51.240 3.5212 0.46368 13.3069 0.90172
YDEV3 -18.5614 0.59270 78.3649 6141.0638 389.968 -4.1116 0.46368 17.8170 0.90172
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Notes: * Compare XDEV3 to the value of \(m_3\) computed above based on Equation 5-7 for the positively skewed X RATING distribution (you can see this value in calculating equation 5-8) * Compare YDEV3 to the value of \(m_3\) computed above based on Equation 5-7 for the negatively skewed Y RATING distribution (you can see this value in calculating equation 5-8) * Compare XDEV2 to the value of \(m_2\) computed above based on Equation 5-6 for the positively skewed X RATING distribution (you can see this value in calculating equation 5-8) * Compare YDEV2 to the value of \(m_2\) computed above based on Equation 5-6 for the negatively skewed Y RATING distribution (you can see this value in calculating equation 5-8)
jmv::descriptives(
formula = X ~ Shape,
data = data,
hist = TRUE,
n = FALSE,
missing = FALSE,
mean = FALSE,
median = FALSE,
sd = FALSE,
min = FALSE,
max = FALSE)
DESCRIPTIVES
jmv::descriptives(
formula = X ~ Shape,
data = data,
box = TRUE,
dot = TRUE,
boxMean = TRUE,
n = FALSE,
missing = FALSE,
mean = FALSE,
median = FALSE,
sd = FALSE,
min = FALSE,
max = FALSE)
DESCRIPTIVES
jmv::descriptives(
data = data,
vars = vars(LEPTO_X, PLATY_Y, NORMAL_Z),
n = FALSE,
missing = FALSE,
variance = TRUE,
range = TRUE,
se = TRUE,
ci = TRUE,
iqr = TRUE,
skew = TRUE,
kurt = TRUE)
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────────────────
LEPTO_X PLATY_Y NORMAL_Z
───────────────────────────────────────────────────────────────
Mean 24.833 25.033 24.950
Std. error mean 1.2356 0.82278 0.97634
95% CI mean lower bound 22.306 23.351 22.953
95% CI mean upper bound 27.360 26.716 26.947
Median 25.000 25.000 26.300
Standard deviation 6.7675 4.5066 5.3476
Variance 45.799 20.309 28.597
IQR 1.0000 8.0000 7.8000
Range 35.000 14.000 22.200
Minimum 8.0000 18.000 11.800
Maximum 43.000 32.000 34.000
Skewness 0.18801 0.034757 -0.61889
Std. error skewness 0.42689 0.42689 0.42689
Kurtosis 2.9697 -1.2741 -0.020135
Std. error kurtosis 0.83275 0.83275 0.83275
───────────────────────────────────────────────────────────────
Note. The CI of the mean assumes sample means follow a
t-distribution with N - 1 degrees of freedom
Notes: * The skewness for each variable is essentially equal to zero because these samples were drawn from symmetric population distributions * Most programs transform the kurtosis statistic such that Leptokurtic data have kurtosis statistics above zero (see LEPTO_X with Kurtosis = 2.970 which is > 0) * Most programs transform the kurtosis statistic such that Platykurtic data have kurtosis statistics below zero (see PLATY_Y with Kurtosis = -1.274 which is < 0) * Most programs transform the kurtosis statistic such that Mesokurtic data have kurtosis statistics relatively near zero (see NORMAL_Z with Kurtosis = -0.020 which is approximately 0 (don’t be fooled by the fact that it’s negative… it’s very close to zero)
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────────────────────────────────────────────────────────
N Median IQR Skewness SE Kurtosis SE W p
───────────────────────────────────────────────────────────────────────────────────────────────────────
x 1000 75.287 13.807 -0.077413 0.077344 0.024702 0.15453 0.99825 0.40537
───────────────────────────────────────────────────────────────────────────────────────────────────────
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────────────────────────────────────────────────────────
N Median IQR Skewness SE Kurtosis SE W p
───────────────────────────────────────────────────────────────────────────────────────────────────────
x 1000 74.929 17.153 0.048441 0.077344 -1.1617 0.15453 0.95694 < .00001
───────────────────────────────────────────────────────────────────────────────────────────────────────
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────────────────────────────────────────────────────────
N Median IQR Skewness SE Kurtosis SE W p
───────────────────────────────────────────────────────────────────────────────────────────────────────
x 1000 73.915 14.320 0.59242 0.077344 -0.12905 0.15453 0.96652 < .00001
───────────────────────────────────────────────────────────────────────────────────────────────────────
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────────────────────────────────────────────────────────
N Median IQR Skewness SE Kurtosis SE W p
───────────────────────────────────────────────────────────────────────────────────────────────────────
x 1000 76.085 14.320 -0.59242 0.077344 -0.12905 0.15453 0.96652 < .00001
───────────────────────────────────────────────────────────────────────────────────────────────────────
DESCRIPTIVES
Descriptives
────────────────────────────────────────────────────────────────────────────────────────────────────────
N Median IQR Skewness SE Kurtosis SE W p
────────────────────────────────────────────────────────────────────────────────────────────────────────
x 1000 75.472 10.661 -0.075346 0.077344 3.2206 0.15453 0.96382 < .00001
────────────────────────────────────────────────────────────────────────────────────────────────────────
DESCRIPTIVES
Descriptives
────────────────────────────────────────────────────────────────────────────────────────────────────────
N Median IQR Skewness SE Kurtosis SE W p
────────────────────────────────────────────────────────────────────────────────────────────────────────
x 1000 75.466 15.432 -0.068537 0.077344 -0.85946 0.15453 0.98224 < .00001
────────────────────────────────────────────────────────────────────────────────────────────────────────
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────────────────────────────────────────────────────────
N Median IQR Skewness SE Kurtosis SE W p
───────────────────────────────────────────────────────────────────────────────────────────────────────
x 1000 74.422 23.351 0.022198 0.077344 -0.90470 0.15453 0.98000 < .00001
───────────────────────────────────────────────────────────────────────────────────────────────────────
This chapter explained how to calculate the measures of dispersion, skewness, and kurtosis by hand. These measures are generally used with interval or ratio scaled data. The addition of a constant (that is, a treatment effect) to the scores on a variable does not change the measures of the variable’s dispersion (except the coefficient of variation), skewness, or kurtosis.
When considering the measures of skewness, the coefficient of skewness, b1, is negative for negatively skewed distributions, equal to zero for symmetric distributions, and positive for positively skewed distributions. When considering the measures of kurtosis, the coefficient of kurtosis, b2, is equal to 3 for normal distributions, less than 3 for flat distributions (platykurtic), and greater than 3 for peaked distributions (leptokurtic). Recall that most statistics programs convert the kurtosis values such that normal is 0, platykurtic is negative, and leptokurtic is positive.
See Chapter 4 Appendix A
SECTION 13: Making Sense of Descriptive Results
Using ALL output created in previous sections, respond to the following items
For SCALE variable Y report and interpret your descriptive results (from the analyses above) using APA Publication Style 6th edition in a way that describes your data well enough that someone can “visualize” what your data distribution looks like without a graph (i.e., based on your words). Include at least two paragraphs of text as well as “really useful” tables and graphs. Use demographic breakdowns of W for some results about Y. Feel free to run additional analyses on the variable in order to get additional information if you think it is necessary (if you do, include those analyses here). Include a discussion of whether you see any univariate outliers in the variable.
For the pair of SCALE variables Y and X report and interpret your bivariate descriptive and correlational results (from the analyses above) using APA Publication Style 6th edition in a way that describes relationship well enough that someone can “visualize” what your bivariate data distribution and what the relationship between the two variables looks like without a graph (i.e., based on your words). Include at least two paragraphs of text as well as “really useful” tables and graphs. Feel free to run additional analyses in order to get additional information if you think it is necessary (if you do, include those analyses here). Include a discussion of whether you see any bivariate outliers in the variables.
Explain the Research Design and Statistical Analysis terms below BRIEFLY but SUFFICIENTLY and IN YOUR OWN WORDS (don’t just give another name for them). Some may require finding additional readings. If you use resources, paraphrase in your own words AND provide a citation of the resource you used (including page numbers).