INTRODUCTION

In this chapter, we will introduce you to descriptive statistics – the branch of statistics that deals with collecting, tabulating, and presenting numerical data. We will consider how statisticians set up an experiment, and we will define some of the terms they use to describe an experiment and explain its results. Experiments are frequently used in psychology, biology, and engineering. However, in many disciplines, such as economics, sociology, communications, and education, experiments are difficult to perform. Why then are we beginning our discussion of descriptive statistics with an experimental example? Simply because experiments serve as a benchmark against which non-experimental studies can be judged. If you understand the basis for an experiment, you are in a better position to understand and interpret the results of a non-experimental study. Also, many of the same statistical terms are used in both experimental and non-experimental studies.

Following our presentation of an experiment, we discuss how statisticians measure results, and the scales of measurement they use. Then we learn how to make a histogram which is a means of visually depicting a grouped frequency distribution. A grouped frequency distribution is a way of indicating what number and percentage of the scores in your data set fall into certain categories. We will also consider what are known as cumulative frequency distributions and cumulative histograms.· Finally, we learn how to find three measures of central tendency, which are descriptive statistics that attempt to indicate where the center of a distribution of scores is located. The three measures of central tendency that we will learn are called the mean, median, and mode.

Your jamovi objectives in this chapter are to: (a) create a grouped frequency distribution and its corresponding histogram, (b) find the mean of a set of scores, (d) find the median and mode for a set of scores. Each of the statistical terms used in this chapter are defined in the glossary. You will find it beneficial to look up these terms as you read the following description of an experiment.

DESCRIPTION OF AN EXPERIMENT

The data in Table 4.1 were collected for a study on the effects of an herbal supplement pill on a measure of cholesterol level. The women who participated in this study are referred to as the units or as the units of analysis. Prior to collecting the data, the researcher had stated the research problem as follows:

Box 4.1
Research Problem: Does the consumption of an herbal supplement pill over an extended period of time affect the cholesterol level of women?

Based on previous research on this pill with mice and monkeys, the researcher postulated that women who took the herbal supplement pill would have higher cholesterol levels than women who did not take the herbal supplement pill. This was the prediction:

Box 4.2
Research Hypothesis: Women who take the herbal supplement pill over an extended period of time will have higher cholesterol levels than women who do not take this herbal supplement pill.

Note that the researcher’s prediction is called the research hypothesis. The problem and research hypothesis form the foundation of most good data analyses; we will discuss them further in chapter 11. For now we will discuss how these women were selected and look at some more terms that are used by statisticians with this type of data.

The women who participated in this experiment were selected at random from the population of women, between the ages of 18 and 35, who lived in a large mid-western city in the United States. These women were then randomly placed into either the Yes-Pill or the No-Pill groups and were asked to remain in these groups for one year. The women in the Yes-Pill group took an herbal supplement pill and the women in the No-Pill group did not take the herbal supplement pill. To make this discussion less complex, we will assume that all brands of this herbal supplement pill are basically the same and that the women were able to follow the researcher’s schedule faithfully.

The following statistical terms are used with the data shown in Table 4.1. In this experiment, we are looking at the effect of the independent variable on the dependent variable. Herbal supplement pill consumption is the independent variable; it is the variable that we can manipulate. The conditions (treatments) of the independent variable, Yes-Pill an No-Pill, are called the levels of the independent variable. Here, the independent variable is measured on a nominal scale. The dependent variable is cholesterol level, which is measured on a ratio scale. The other two variables in Table 4.1 are the women’s identification numbers (IDs), measured on a nominal scale, and their heights (HTs), measured on a ratio scale. These variables were not affected by the independent variable because they were measured before the study began.

How can we tell if there is a difference between the women in the two groups? A researcher would answer this question by comparing the average cholesterol levels of the women from the two groups. The average cholesterol level is the sum of the cholesterol scores in a group divided by the number of women in the group. Statisticians refer to the average as the mean. Measures, such as the mean, that are found using a whole population of scores are called parameters. When such measures are estimated using a set of sample scores, they are called statistics. Then, if the sample mean cholesterol level of the women in the Yes-Pill group is larger than the sample mean cholesterol level of the women in the No-Pill group, a researcher might conclude that consumption of the herbal supplement pill raises the cholesterol level of the women who take it.

If the means of the two sample groups differ, why do we use the word “might” in the preceding sentence? To answer this question, consider this experiment from a statistician’s point of view. To a statistician, the experiment is based on samples from two populations: a sample from the population of women who take herbal supplement pills and a sample from the population of women who do not take herbal supplement pills. When a statistician makes the statement, “the consumption of an herbal supplement pill raises the cholesterol level of women,” he or she means that the mean cholesterol level in the population of women who take the herbal supplement pill is greater than the mean cholesterol level in the population of women who do not take the herbal supplement pill.

Even if the population means are equal for a given experiment, however, the sample means will probably differ simply due to chance. Therefore, we must ask: How much of a difference must exist between two sample means before we can conclude that a difference exists between the population means? For instance, if the sample mean cholesterol level for the Yes-Pill group was 241 and the sample mean for the No-Pill group was 240, would this one point difference be large enough for us to draw conclusions for the population means? If a sample mean for the Yes-Pill group was 270 and a sample mean for the No-Pill group was 240, would the difference be large enough for us to draw conclusions?

Table 4.1 Identification number, height, and cholesterol levels of women taking and not taking an herbal supplement pill

Chapter 14 presents a statistical test that will help us decide when two sample means differ in such a manner that the difference is statistically significant. In the examples here, however, we will try to draw conclusions from the sample means about their corresponding population means. The researcher in the cholesterol study selected a random sample of women from a population of women, so as to be able to make an inference back to the population. In what follows, we will consider descriptive statistics that will help us to better understand our data prior to conducting statistical tests. First, we will consider the scales of measurement available to measure the variables in an experiment.

SCALES OF MEASUREMENT

Measurement is the assignment of a number (measure or score) to a unit according to rules. The well-defined collection (that is, set) of such numbers is referred to as a scale of measurement. Here the term “well-defined” means that the scale of measurement is defined so that it is possible to decide which number should be assigned to a unit. There are four commonly used scales of measurement:

  • Nominal scale
  • Ordinal scale
  • Interval scale
  • Ratio scale

These scales of measurement are listed in order from the least precise, nominal, to the most precise, ratio. You can easily remember the measurement scales and their order of precision by remembering the French word “noir” (for black) – which is composed of the first letters of the names of these scales.

Nominal Scale

Nominal measurement is the process of assigning a number to a unit under the rule that units that receive the same number are thought of as being qualitatively similar and units that receive different numbers are thought of as being qualitatively different. For example, the nominal scale can be used to measure political parties. In this case, democrats may be assigned the number 1 and republicans the number 2, or vice versa, republicans the number 1 and democrats the number 2. You can see that the magnitude of the number that is assigned is not meaningful in terms of group differences since it serves only to identify the group. Indeed, we could assign democrats the number 12 and republicans the number -44. It is apparent that measurement on a nominal scale, which consists of assigning units to groups or categories, is a rather primitive means of measurement.

In experiments, the independent variables are commonly measured on a nominal scale. In this case, the nominal scale consists of the levels of the independent variable. For example, in the pill consumption experiment, the women in the Yes-Pill group were assigned the number 1, for group (or level) 1, and the women in the No-Pill group were assigned the number 2, for group (or level) 2. The choice of numbers, however, and which group was assigned which number, was arbitrarily decided by the experimenter. It is frequently convenient and useful to use 0 and 1 as the group numbers when there are only two groups. In our example, the codes would represent “pill-ness” with 1 meaning “has pill-ness” and 0 meaning “has no pill ness.”

Other examples of experiments where the independent variable is measured on a nominal scale are: methods of teaching (commonly referred to as “methods studies”) in education, types of fertilizers in agriculture, types of business practices in economics, and types of reinforcement in psychology. In each of these studies, the experimenter arbitrarily assigns numbers to the treatment levels. In many experiments, the two groups represent treatment and control. In such cases, 1 might be used to represent treatment (i.e., “has treatment”) and 0 to represent no treatment (i.e., “does not have treatment”).

Ordinal Scale

Ordinal measurement is the process of assigning a number to a unit under the rule that the units are assigned numbers that place them in rank order on a variable. Given a unit measured on an ordinal scale, you know if one unit has more of some variable, but you do not know how much more. For example, if you were to classify minerals according to their hardness, you could rank one mineral as harder than another if it could scratch the other. During this process, you would assign larger numbers to the harder minerals. When you were finished, you would have assigned numbers to the minerals, with the highest numbered mineral able to scratch all of those numbered below it, but you would not know how much harder one mineral was compared to the others.

Another example of ordinal measurement occurs during the presentation of athletic awards. For example, the runners of a marathon are given the numbers 1, 2, and 3 for their respective performances in a race. Given only the numbers 1, 2, and 3, however, we know that runner number 1 was faster than runner number 2; but we do not know if she was faster by a second, a minute, or an hour.

Interval Scale

Interval measurement is the process of assigning a number to a unit under the rule that the resultant number is a linear function of the true magnitude of the variable being measured. That is, if X represents the true magnitude of the variable being measured, then Y (where Y = aX + b, and “a” and “b” are unknown constants) represents the interval measure. This rule implies that equal differences based on interval measures have the same meaning.

For example, since temperature is based on an interval scale, this means that the difference between 95° and 90° (either Fahrenheit or Celsius) has the same meaning as the difference between 15° and 10°, that is, both are differences of 5°. We can see that this is true by considering the temperature differences in terms of their linear rules, as follows:

\[ \begin{align} 95^° & = aX_1+b \\ -90^° & = -aX_2-b \\ 5^° & = a(X_1 - X_2) && \text {= Result} \end{align} \] \[ \begin{align*} 15^° & = aX_3+b \\ -10^° & = -aX_4-b \\ 5^° & = a(X_3 - X_4) && \text {= Result} \end{align*} \] This implies that the difference between the true magnitudes (the “X” values) are equal since:

\[ a(X_1-X_2 )=a(X_3-X_4 )=5^° \] and therefore:

\[ (X_1-X_2 )=(X_3-X_4 ) \] This type of precision is not available on the nominal scale where differences are uninterpretable because the true measures are of different magnitudes.

One limitation of the interval scale is that the selection of what zero means is an arbitrary decision. For example, since Y = aX + b, when X = 0, Y = b, and since b is unknown, it is arbitrarily chosen. For example, if X represents the Kelvin scale for measuring temperature, at X = 0, b = -273.16° on the Celsius interval scale and b = -459.69° on the Fahrenheit interval scale.

Ratio Scale

Ratio measurement is the process of assigning a number to a unit under the rule that the resultant number is a multiple of the variable being measured. That is, if X represents the true magnitude of the variable being measured, then Y (where Y = aX and “a” is an unknown constant) represents the ratio measure. This rule implies that differences and multiples of ratio measures represent true statements about the amounts of some property an object possesses, and that variables measured on a ratio scale have a true zero. This also implies that ordinary arithmetic can be used meaningfully with numbers found using a ratio scale.

Most physical measurements are examples of numbers that are based on a ratio scale. For example, a building that is 400 feet tall is twice as tall as is a building that is 200 feet tall, and a building that is 0 feet tall does not exist. Similarly, an adult who weighs 200 pounds is four times as heavy as a child who weighs 50 pounds.

Parametric Versus Nonparametric Tests

It is important to distinguish between the scales of measurement so you understand what information a given measurement or the result of an arithmetic operation contains. This knowledge will prevent you from making statements such as:

The average IQ in group 1 is 60 and the average IQ of group 2 is 120, therefore, group 2 is twice as bright as group 1. This statement is unreasonable because IQ is not measured on a ratio scale. Indeed, IQ is probably not even measured on an interval scale. Hopkins and Glass (1978, p. 13) indicate that: “The IQ scale defies categorization as strictly ordinal or interval; perhaps it is better to speak of it and some other widely used scales as ‘quasi-interval’.”

A second reason for distinguishing measures based on different scales is that some statisticians (Blalock 1979; Hinkle, Wiersma, & Jurs, 1982; Schmidt 1979; and Twaite & Monroe, 1979) routinely apply some statistical procedures to scores based on different measurement scales. Other statisticians, however, disagree with the selection of a statistical test based on the level of measurement of the scores. For example, Heermann and Braskamp in reviewing the literature on this topic concluded that: “Most investigators seem to agree that scale type is irrelevant to the choice of a statistical tool” (Heerman and Braskamp, 1970, p. 37). More recently, Gatio (1980) reiterated Heermann and Braskamp’s conclusion. The arguments of these statisticians are beyond the scope of this book. From our perspective, the “numbers know not whence they came” (unknown quote) but in order to make sense of the numbers, researcher must know. However, the tests that they discussed are referred to as either parametric or nonparametric tests. It is more important to understand what these tests do than to choose them based on the scale of your data.

COMPARING TWO GROUPS IN A DATASET

To carry out a statistical test on the cholesterol data in Table 4.1, a researcher must assume that the only systematic, or nonrandom, difference between the two groups is the use of the herbal supplement pill. That is, the women in the two groups should not differ systematically on other variables prior to the start of the experiment. If the two groups did differ systematically on a variable, any differences between the groups might be the result of initial differences on this variable. For example, if the women in the Yes-Pill group were heavier than the women in the No-Pill group, any differences in cholesterol might have been caused by the initial differences in weight – not the herbal supplement pill. To reasonably make the assumption that the women were approximately equal on all possible variables, the researcher randomly assigned the women to the two groups in the study. Therefore, because of the process of random assignment, we would not expect the women in one group to be systematically taller, older, heavier, and so on, nor to have higher cholesterol levels than the women in the other group.

If the only difference between the two groups is that one group is taking the herbal supplement pill and the other group is not, what characteristics might we expect the measures of cholesterol to have? We have already indicated that if the pill does affect the women’s cholesterol levels, the means of the two groups will differ. A statistician would say that, in theory, the difference between the means was caused by a constant value being added to the cholesterol scores of the women in the Yes-Pill group. If this is true, the spread of the scores in both groups should be about the same (we will see why in the next chapter). Indeed, we would expect that the two groups would have many similar descriptive statistics but different measures of central tendency.

HISTOGRAMS

A simple way to begin our comparison of the Yes-Pill and No-Pill groups is to present a picture of all of the scores in a histogram. To do this we first must find a grouped frequency distribution for each set of scores and then use the results to create two histograms. A grouped frequency distribution and the resultant histogram of the scores from the Yes-Pill group are shown in Figure 4a(ii) and Figure 4a(i), respectively). A grouped frequency distribution and the resultant histogram of the scores from the No-Pill group are shown in Figure 4b(ii) and 4b(i), respectively. The histogram is shown first followed by the grouped frequency distribution.

Notice that the two groups have somewhat similar histograms.

Rounding

In this text, see Appendix E, we will round a number up if the digit after the place to be rounded to is 5 or more; we will round a number down if this digit is less than 5. For example, the numbers 3.9, 3.8, 3.7, 3.6, and 3.5 are rounded up to 4; the numbers 3.1, 3.2, 3.3, and 3.4 are rounded down to 3.

Note that bars with score Frequency is 0 will simply have 0 height

Figure 4.4a(i) Frequency histogram of the Cholesterol scores from the Yes-Pill treatment

Figure 4.4a(ii) Grouped frequency distribution of the Cholesterol scores from the Yes-Pill treatment

jmv::descriptives(
    data = data,
    vars = Groups,
    freq = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)

 DESCRIPTIVES

 FREQUENCIES

 Frequencies of Groups                               
 ─────────────────────────────────────────────────── 
   Groups     Counts    % of Total    Cumulative %   
 ─────────────────────────────────────────────────── 
   10              1         5.000           5.000   
   140-160         2        10.000          15.000   
   160-180         1         5.000          20.000   
   180-200         3        15.000          35.000   
   200-220         5        25.000          60.000   
   240-260         5        25.000          85.000   
   260-280         2        10.000          95.000   
   280-300         1         5.000         100.000   
 ─────────────────────────────────────────────────── 

Note that “% of Total” uses the TOTAL frequency excluding any MISSING values, which is often called the VALID PERCENT and therefore uses TOTAL NON-MISSING (Frequency/Total Non-Missing X 100%). Most researchers report this VALID PERCENT of the non-missing data they actually collected.

Figure 4.4a(iii) Histogram with Density and Boxplot (Violin Plot) of the Cholesterol scores from the Yes-Pill treatment

jmv::descriptives(
    data = data,
    vars = Cholesterol,
    hist = TRUE,
    dens = TRUE,
    box = TRUE,
    violin = TRUE,
    dot = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)

 DESCRIPTIVES

Figure 4.4b(i) Frequency histogram of the Cholesterol scores from the No-Pill treatment

jmv::descriptives(
    data = data,
    vars = Cholesterol,
    hist = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)

 DESCRIPTIVES

Figure 4.4b(ii) Ungrouped frequency distribution of the Cholesterol scores from the No-Pill treatment

jmv::descriptives(
    data = data,
    vars = Cholesterol,
    freq = TRUE,
    n = FALSE,
    missing = FALSE,
    mean = FALSE,
    median = FALSE,
    sd = FALSE,
    min = FALSE,
    max = FALSE)

 DESCRIPTIVES

 FREQUENCIES

 Frequencies of Cholesterol                              
 ─────────────────────────────────────────────────────── 
   Cholesterol    Counts    % of Total    Cumulative %   
 ─────────────────────────────────────────────────────── 
   146                 1         5.000           5.000   
   156                 1         5.000          10.000   
   160                 1         5.000          15.000   
   175                 1         5.000          20.000   
   180                 1         5.000          25.000   
   185                 1         5.000          30.000   
   190                 1         5.000          35.000   
   200                 2        10.000          45.000   
   218                 1         5.000          50.000   
   225                 1         5.000          55.000   
   235                 1         5.000          60.000   
   245                 1         5.000          65.000   
   250                 1         5.000          70.000   
   255                 2        10.000          80.000   
   260                 2        10.000          90.000   
   273                 1         5.000          95.000   
   320                 1         5.000         100.000   
 ─────────────────────────────────────────────────────── 

A CUMULATIVE FREQUENCY DISTRIBUTION

A cumulative frequency distribution is a frequency distribution where each class interval represents the sum of its frequency plus the sum of the frequencies of the class intervals below it. A cumulative frequency histogram is a histogram where each bar represents the sum of its frequency plus the sum of the .frequencies of the other bars, or class intervals, below it. A cumulative frequency distribution is found by following the hand calculation steps for a grouped frequency distribution, but at step 9, the tallies found in step 8 are summed so that each class interval has a number next to it that is the frequency for that interval plus the sum of the frequencies for the preceding intervals. For example, the cumulative frequency histogram for the cholesterol data is shown in Figure 4d(i).

On the histograms and cumulative histograms output, the frequency is recorded on the left y-axis and the percent of the total is recorded on the right y-axis. Because each bar represents the cumulative sums of the frequencies at and below it, a cumulative frequency histogram can be used to estimate the percentile rank of a percentile. Note: Compare Figure 4d(i) to the “Cumulative %” column in Figure 4a(ii).

Figure 4.4d(i) Cumulative Frequency histogram of the Grouped Cholesterol scores from the Yes-Pill treatment

A percentile is the score below which a given percentage of scores fall. The percentile rank is the percentage of scores below a given score, or percentile. In a set of scores, the percentile ran of a given score is found by: (a) ordering the scores, (b) finding the number of scores below the given score plus 0.5, (c) dividing this number by the number of scores, (d) multiplying the result by 100, and (e) appending a percent sign (%). The equation for this process is as follows:

Equation 4-1

\[ \begin{equation} PR_X = \frac{n_B+0.5}n × 100 \text% \tag{4-1} \end{equation} \] \(PR_X\) stands for the percentile rank, X is the score of the interest, \(n_B\) is the number of scores below a given score, and n is the total number of scores. The reason 0.5 is added to \(n_B\) in cumulative histogram and cumulative frequency distribution is because half of a score’s frequency of 1, (that is, 0.5) is considered to fall above the score and half of the score’s frequency is considered to fall below the score.

For example, consider the following set of scores: 97, 102, 105, 110, 135. If we want to find the percentile rank of the score X = 110, we first determine that there are 3 scores below 110 (nB = 3), and that the total number of scores is 5 (n = 5). Therefore the equation to find the percentile rank is:

\[ \begin{equation} PR_{110} = \frac{3+0.5} 5 × 100 \text% = 70 \text% \end{equation} \] The percentile rank of the score 110 is 70%. You can estimate the score (percentile) that has a given percentile rank from a cumulative frequency histogram by drawing a line from the percentile rank mark on the right y-axis until it exits the bars; the score (percentile) is in the last bar that the line passes through.

If the line touches the top of a bar, the percentile is the midpoint of that bar, found as the sum of the bar’s limits divided by 2; if the line pierces the bar, the percentile is the lower limit of the bar. For example, in Figure 4d(i) the 75th percentile (n = 15) was estimated to be 249.018, which is the midpoint of bar number 6; and the 50th percentile (n = 10) was estimated to be 203.928, the lower limit of bar number 4.

Similarly, you can estimate the percentile rank of a given score by locating which bar it is in and drawing a vertical line from the x-axis to the top of the bar, and then drawing a horizontal line to the right y-axis where the percentile rank is located. For example, the score 194.910 (found as the sum of the class limits divided by two), which represents the scores in bar number 3, has a percentile rank of approximately 31% (n ≈ 6).

MEASURES OF CENTRAL TENDENCY

Measures of central tendency are descriptive statistics that attempt to indicate where the center of a distribution of scores is located. A histogram represents a discrete distribution of scores because all of the scores in a class interval are represented by a single bar; this results in gaps between the bars. A continuous distribution is shown in Figure 4f. Note that in a continuous distribution there are no gaps between the scores on the x-axis, and therefore, no gaps in the curve that represents the distribution of these scores.

The commonly used measures of central tendency are the mean, mode, and median.

Mean

The mean is the average of a set of scores and is found by adding all of the scores and dividing by the total number of scores.

Sample Mean

The equation for the sample mean, M, can be written several equivalent ways (due to the properties of the summation operator):

Equation 4-2

\[ \begin{align*} \overline X = M & = \sum_{i=1}^{n} \frac{X_i} n \\ & = \frac 1 n \sum_{i=1}^{n} X \\ & = \frac {\sum X} n \tag{4-2} \end{align*} \]

Figure 4.4f An example of a continuous symmetric distribution

Figure 4.4g An example of a continuous (positively) skewed distribution

The capital Greek letter sigma, ∑ (pronounced “sig-ma”), is called the summation operator and indicates that you are to take the sum of all of the X’s, and n is the number of units (observations) in the sample. The subscript i used in equation 4-2 indicates that we start with unit 1 and add all the scores through unit n. Adding all scores from 1 to n means adding all scores, which is very common in statistics, and therefore, the summation operator will usually only include the subscript when the formula is used to indicate that we do not sum all the scores. The symbols M and X ̅ (pronounced “X bar”) are commonly used to represent the sample mean; however, APA uses the symbol M and so M will be used to represent the mean in this book (in most cases a bar above a variable will indicate “mean”).

Population Mean.

The Greek letter mu, μ (pronounced “myoo”), is used to represent the population mean. The equation for the population mean is:

Equation 4-3

\[ \begin{equation} \mu = \sum \frac X N \tag{4-3} \end{equation} \]

N represents the total number of units (observations) in the population. To remember the relationship between the sample mean M and the population mean μ, remember that μ is the Greek letter for “m”.

The mean is commonly used as the measure of the center of parametric experimental data (that is, data such as that shown in Table 4.1), because it has statistical properties that are meaningful with such data. We will study these properties in chapters 5, 10 and 14. However, the mean is not always the best measure of central tendency to be used when descriptive statistics are used by themselves. This is because the mean is strongly influenced by extreme scores. For example, consider the salaries of the following five people who work for company Z:

Employee Salary

The average salary for company Z is ($1,060,000 / 5) or $212,000. Here, the mean is a very poor measure of the center of the distribution.

Because the mean is strongly affected by extreme scores, it is recommended for use with data where such scores do not occur, that is, with data where the scores have a symmetric distribution. A distribution is said to be symmetric when scores that are equidistant from the mean (that is, above and below) have the same frequencies. The continuous distribution shown in Figure 4f is symmetric.

Median

The median is that score above which 50% of the scores fall and below which 50% of the scores fall; that is, the median is the 50th percentile. The median is the measure of central tendency that is routinely used in nonparametric studies involving ordinal and sometimes interval data. We will use the abbreviation Md to represent the sample median and the letters Mdpop to represent the continuous distribution population median.

The median is an excellent measure of central tendency because it is not strongly affected by extreme scores. For example, in the preceding set of employee salaries the median is $16,000. Therefore, the median is a good measure of central tendency when the distribution of scores is skewed. A skewed distribution is one where a group of scores is bunched on one side of the x-axis with scores on the other side of the x-axis having relatively low frequencies. The salaries represent a discrete skewed distribution. A continuous skewed distribution is shown in Figure 4g. We will discuss skewness and how it may be measured in chapter 5.

In this text, we define the median of a set of ranked scores as the middle score if the number of scores is odd, or the average of the middle two scores if the number of scores is even For example, consider the following set of five scores: 97, 110, 102, 105, 135. We can calculate the middle rank as (N+1)/2. The score at the middle (or median) rank is therefore the median score. Note the difference between the median rank and the median score. The median score is what we call the median.

We would find the median of these scores by first arranging them in order of magnitude and then finding the middle score (i.e., the score at sorted rank of (5+1)/2, which is rank 3, or the third score):

\[ \begin{align} & 97 \\ & 102 \\ & 105 \text{ <--- Median (Md) = 105} \\ & 110 \\ & 135 \end{align} \] If we add the score 146 to this list, we will have an even number of scores; and the median is found as the average of the middle two scores (i.e., the score at sorted rank of (6+1)/2, which is rank 3.5, which means we average the third and fourth scores to obtain a score between the ranks of 3 and 4):

\[ \begin{align} & 97 \\ & 102 \\ & 105 \\ & \text{ <--- Median (Md) = (105 + 110) / 2 = 107.5} \\ & 110 \\ & 135 \\ & 146 \end{align} \] 97
102 105 <- Median (Md) = (105 + 110) / 2 = 107.5 110 135 146

What is the median for the subjects in the Yes-Pill group? Yes-Pill median = (215 + 220) I 2 = 217.5. Note that this value differs from the value of 203.928, which was the estimated median (50th percentile) found from the cumulative frequency distribution in figure 4e. The two values differ because 217.5 is the actual sample median and 203.928 is its estimate based on grouped data. The most common definition of Median used by researchers and statistical software is simply the middle score described above.

Mode

The mode is the most frequently occurring score in a distribution of scores. The mode is frequently used as the measure of central tendency with nominal data because it generally makes no sense to average or rank order such data. A distribution of scores with two modes is called bimodal. A distribution that has many modes is called multimodal. The mode is not used as often as are the mean or the median as a measure of central tendency because it is not as stable as are these other measures. The-mode can change dramatically with a shift in a few scores. For example, in the cholesterol data for the Yes-Pill group the mode is 210, but the addition of two more scores of 198 would make 198 the mode.

When a grouped frequency distribution is being considered, the mode is estimated to be the midpoint of the class interval with the largest tally. The midpoint of any class interval may be found by adding the largest class limit to the smallest class limit and dividing the result by 2.

For example, based on the histogram in figure 4a, the mode would have to be estimated from bar number 4 because this bar has the largest frequency. Bar number 4 is called the modal interval. Bar number 4 is based on the interval 203.928 - 221.964; therefore, its midpoint is found as:

\[ \text {Mode} = (203.928 + 221.964)/2 = 212.946 \] Note that, based on the actual scores, we know that the mode of the sample is 210. The arbitrary selection of ten intervals led to an estimate of the mode that was different (212.946).

Skewed Distributions: Mean, Median, and Mode

In a skewed distribution where the scores are bunched on the left-hand side of the x-axis, the distribution is referred to as being positively skewed. In positively skewed distribution, the measures of central tendency are generally found as displayed in Figure 4h(i). The mode is found as the score with the highest frequency; the median is unaffected by extreme scores and is found next; and the mean is pulled in the direction of the extreme scores.

In a skewed distribution where the scores are bunched on the right-hand side of the x-axis, the distribution is referred to as being negatively skewed. In a negatively skewed distribution, the measures of central tendency reverse their order on the x-axis from that for the positively skewed distribution. A negatively skewed distribution is displayed in Figure 4h(ii). The mean is pulled in the direction of the extreme scores; the median is found next; and the mode is found as the score with the highest frequency. We will discuss skewed distributions in more detail in the chapter 5.

Figure 4.4h(i) Positively skewed distribution with general relationships between mean, median, and mode

Figure 4.4h(ii) Negatively skewed distribution with general relationships between mean, median, and mode

Figure 4.4i(i) Summary statistics for Cholesterol for Yes-Pill treatment group

jmv::descriptives(
    data = data,
    vars = Cholesterol,
    mode = TRUE,
    variance = TRUE,
    range = TRUE,
    se = TRUE,
    ci = TRUE,
    iqr = TRUE,
    skew = TRUE,
    kurt = TRUE,
    sw = TRUE)

 DESCRIPTIVES

 Descriptives                               
 ────────────────────────────────────────── 
                              Cholesterol   
 ────────────────────────────────────────── 
   N                                   20   
   Missing                              0   
   Mean                            227.90   
   Std. error mean                 10.069   
   95% CI mean lower bound         206.83   
   95% CI mean upper bound         248.97   
   Median                          217.50   
   Mode                            210.00   
   Standard deviation              45.029   
   Variance                        2027.6   
   IQR                             56.750   
   Range                           180.00   
   Minimum                         150.00   
   Maximum                         330.00   
   Skewness                       0.35438   
   Std. error skewness            0.51210   
   Kurtosis                       0.11887   
   Std. error kurtosis            0.99238   
   Shapiro-Wilk W                 0.97530   
   Shapiro-Wilk p                 0.86035   
 ────────────────────────────────────────── 
   Note. The CI of the mean assumes
   sample means follow a t-distribution
   with N - 1 degrees of freedom

Figure 4.4i(ii) Summary Statistics (horizontal) for Cholesterol for Yes-Pill group

jmv::descriptives(
    data = data,
    vars = Cholesterol,
    desc = "rows")

 DESCRIPTIVES

 Descriptives                                                                         
 ──────────────────────────────────────────────────────────────────────────────────── 
                  N     Missing    Mean      Median    SD        Minimum    Maximum   
 ──────────────────────────────────────────────────────────────────────────────────── 
   Cholesterol    20          0    227.90    217.50    45.029     150.00     330.00   
 ──────────────────────────────────────────────────────────────────────────────────── 

Figure 4.4i(iv) Summary Statistics for Cholesterol for No-Pill group

jmv::descriptives(
    data = data,
    vars = Cholesterol,
    mode = TRUE,
    variance = TRUE,
    range = TRUE,
    se = TRUE,
    ci = TRUE,
    iqr = TRUE,
    skew = TRUE,
    kurt = TRUE,
    sw = TRUE)

 DESCRIPTIVES

 Descriptives                               
 ────────────────────────────────────────── 
                              Cholesterol   
 ────────────────────────────────────────── 
   N                                   20   
   Missing                              0   
   Mean                            219.40   
   Std. error mean                 10.254   
   95% CI mean lower bound         197.94   
   95% CI mean upper bound         240.86   
   Median                          221.50   
   Mode                            200.00   
   Standard deviation              45.856   
   Variance                        2102.8   
   IQR                             71.250   
   Range                           174.00   
   Minimum                         146.00   
   Maximum                         320.00   
   Skewness                       0.21608   
   Std. error skewness            0.51210   
   Kurtosis                      -0.48332   
   Std. error kurtosis            0.99238   
   Shapiro-Wilk W                 0.96307   
   Shapiro-Wilk p                 0.60689   
 ────────────────────────────────────────── 
   Note. The CI of the mean assumes
   sample means follow a t-distribution
   with N - 1 degrees of freedom

Figure 4.4j Descriptive summary statistics for Cholesterol by Group (using a Split By approach)

jmv::descriptives(
    formula = Cholesterol ~ Group,
    data = data,
    desc = "rows")

 DESCRIPTIVES

 Descriptives                                                                                       
 ────────────────────────────────────────────────────────────────────────────────────────────────── 
                  Group         N     Missing    Mean      Median    SD        Minimum    Maximum   
 ────────────────────────────────────────────────────────────────────────────────────────────────── 
   Cholesterol    1 Yes-Pill    20          0    227.90    217.50    45.029     150.00     330.00   
                  0 No-Pill     20          0    219.40    221.50    45.856     146.00     320.00   
 ────────────────────────────────────────────────────────────────────────────────────────────────── 

Figure 4.4k Nonparametric (range) statistics for Cholesterol by Group (using a Split File approach)

jmv::descriptives(
    formula = Cholesterol ~ Group,
    data = data,
    desc = "rows",
    mean = FALSE,
    median = FALSE,
    min = FALSE,
    max = FALSE,
    sd = FALSE,
    range = TRUE,
    iqr = TRUE,
    pc = TRUE)

 DESCRIPTIVES

 Descriptives                                                                                     
 ──────────────────────────────────────────────────────────────────────────────────────────────── 
                  Group         N     Missing    IQR       Range     25th      50th      75th     
 ──────────────────────────────────────────────────────────────────────────────────────────────── 
   Cholesterol    1 Yes-Pill    20          0    56.750    180.00    198.00    217.50    254.75   
                  0 No-Pill     20          0    71.250    174.00    183.75    221.50    255.00   
 ──────────────────────────────────────────────────────────────────────────────────────────────── 

Figure 4.4l Median found from the ordered scores of Yes-Pill treatment group

     ID Cholesterol
1  1797         150
2  2271         158
3  2522         179
4  1251         195
5  0796         198
6  2871         198
7  0225         210
8  2349         210
9  1326         210
10 0736         215
11 0906         220
12 0828         243
13 0139         247
14 3349         250
15 0059         253
16 1637         260
17 1291         263
18 0544         272
19 0473         297
20 0494         330

\[ \begin{align} & ID && Cholesterol \\ & 1797 && 150 \\ & 2271 && 158 \\ & 2522 && 179 \\ & 1251 && 195 \\ & 0796 && 198 \\ & 2871 && 198 \\ & 0225 && 210 \text{ <--- Three} \\ & 2349 && 210 \text{ <--- Multiple} \\ & 1326 && 210 \text{ <--- Modes} \\ & 0736 && 215 \\ & && \text{ <--- Median (Md) = (215+220)/2 = 217.5 (10 above, 10 below)} \\ & 0906 && 220 \\ & 0828 && 243 \\ & 0139 && 247 \\ & 3349 && 250 \\ & 0059 && 253 \\ & 1637 && 260 \\ & 1291 && 263 \\ & 0544 && 272 \\ & 0473 && 297 \\ & 0494 && 330 \end{align} \] ## SUMMARY

This chapter discussed an experiment where units (women) were randomly sampled from a population in a large city. These units were then randomly assigned to two levels of an independent variable herbal supplement pill consumption), which was measured on a nominal scale. The research problem and research hypothesis concerned the relationship between the independent variable and the dependent variable, cholesterol level, which was measured on a ratio scale. A preliminary descriptive analysis indicated that the two groups in the experiment had similar histograms, but that their means (Yes-Pill = 227.9; No-Pill = 219.4) differed by 8.5 points. In chapter 14, we will use a statistical test to determine if this difference is statistically significant. Interestingly, the medians of the two groups also differed (Yes- Pill = 217.5; No-Pill = 221.5), but in the opposite direction. What do you think this difference between the direction of mean and median differences indicates about the results of the experiment? (The difference in the direction of the measures of central tendency is an indication that there is no significant difference between the two groups. We will find this to be true when we analyze this data in exercise 5 of chapter 14.)

Descriptive statistics play an important role in experimental analyses where they are used to describe the data in each group of an experiment. Groups in an experiment should have histograms that are similar in shape when they are based on variables measured prior to an experiment and for the dependent variable. However, while the measures of central tendency should be approximately equal on variables measured prior to an experiment, they may not be approximately equal for the dependent variable, which is measured after the experiment. he next chapter discusses descriptive measures of the dependent variable that the groups in an experiment should not differ on. In the exercises for this chapter, examples of studies are presented where descriptive statistics are the only tools used.

Many new statistical terms were incorporated in the discussion of the experiment and in the descriptive analysis of the experimental data. These terms are formally defined in the glossary. You should memorize these terms and then reread the preceding presentation. These terms will be used frequently in this text, and it is important that you know and understand them.

Because jamovi is a useful tool for calculating descriptive statistics, you should also make sure you understand how to use jamovi procedures to obtain the statistics discussed in this chapter.

Chapter 4 Appendix A Study Guide for Descriptive Statistics

For the STUDY GUIDES, you are not required to submit anything (and therefore may work with anyone). The items in this STUDY GUIDE serve mostly as a list of possible skills you will need for Quantitative Research. However, please understand that it is not just about being able to do these things in jamovi, you will need to be able to: (a) obtain and/or find the right output needed to answer questions, (b) in some circumstances, calculate results by hand or using some other program (where jamovi does not provide us everything we need), (c) interpret (make sense of) the results in the output you create or calculate, and (d) understand the statistical concepts we have covered related to the output and results (note that this course typically requires thinking at all of the revised Bloom’s Taxonomy (2001): remember, understand, apply, analyze, evaluate, and create).

You are encouraged to be confident that you can do everything in the STUDY GUIDE sections (unless you are notified otherwise in class and/or on Blackboard). In particular, you should go through all these items for each topic after you have run the analyses given at appropriate times during the semester.

Each STUDY GUIDE section begins with the analyses that items address in the section. Note that you should be able to respond to all items based on ANY output that includes the necessary information. For example, you should be able to calculate or identify the MEAN from any output that reports that MEAN or the information necessary to calculate that MEAN.

In the STUDY GUIDE I refer to variables generically (e.g., V, W, X and Y) rather than to any specific variables. I have tried to be consistent in using V and W for CATEGORICAL variables, and X and Y for SCALE variables. When you run these analyses and respond to the items, you will need to choose particular variables in your dataset to use as V, W, X, and Y (and whatever other variables may also be included in the section). For example, use whatever variables you have chosen for X and Y, not necessarily variables actually named X and Y.

In all places where you see imperative verbs (e.g., Report, Interpret, Show, Explain, Describe) you should ensure that you are confident that you ARE ABLE TO DO these things (because you are not actually assigned to do them for the STUDY GUIDE).

  SECTION 1: Summarizing a Categorical Variable Analyses to Run • Use a CATEGORICAL variable V • Use a CATEGORICAL variable W • Obtain frequencies for V and W independently

Using the output, respond to the following items 1. Report and interpret the FREQUENCIES of each category of the W variable 2. Report and interpret the RELATIVE FREQUENCIES (i.e., percent) of each category of the W variable 3. Interpret a PIE CHART and/or a BAR CHART for the W variable 4. Report the most frequent category for both V and W 5. Report whether there are any missing values for either V or W

SECTION 2: Summarizing TWO Categorical Variables with CROSSTABS Analyses to Run • Use a CATEGORICAL variable V • Use a CATEGORICAL variable W • Obtain cross-tabulated frequencies for V and W

Using the output, respond to the following items

  1. Report and interpret the FREQUENCY of cases in one of the cells (e.g., the lowest V group and the lowest W group)
  2. Report and interpret the PROPORTION of TOTAL cases in the same cell as the previous item (e.g., the lowest V group and the lowest W group)
  3. Report and interpret the FREQUENCY of cases in a different cell (e.g., the lowest V group and the highest W group)
  4. Report and interpret the PROPORTION of cases WITHIN V for the same cell as the previous item (e.g., the lowest V group and the highest W group)
  5. Report and interpret the FREQUENCY of cases in a different cell (e.g., the highest V group and the lowest W group)
  6. Report and interpret the PROPORTION of cases WITHIN W for the same cell as the previous item (e.g., the highest V group and the lowest W group)

SECTION 3: Summarizing a Scale Variable Analyses to Run • Use a SCALE variable Y • Obtain descriptive statistics for Y

Using the output, respond to the following items

  1. Report all Central Tendency statistics (also called Location) for the variable Y

  2. Report all Dispersion statistics (also called Variability or Spread) for the variable Y

  3. Report all Shape statistics (e.g., Skewness and Kurtosis) for the variable Y

  4. Report and interpret the MODE (i.e., modal value not modal interval) for Y

  5. Report and interpret the FREQUENCY of Y scores for the MODE

  6. Report and interpret the RELATIVE FREQUENCY of Y scores for the MODE

  7. Report and interpret the CUMULATIVE FREQUENCY of Y scores at or below the mode reported by FREQUENCIES

  8. Report and interpret the CUMULATIVE PERCENT of Y scores at or below the mode reported by FREQUENCIES

  9. Report and interpret the Y score at the 40th PERCENTILE

  10. Report and interpret the Y score at the 80th PERCENTILE

  11. Report and interpret the QUARTILES for the variable Y scores

Using the HISTOGRAM output, respond to the following items

  1. Report and interpret the MODAL INTERVAL
  2. Report and interpret the MINIMUM and MAXIMUM limits for the interval that contains the MODE
  3. Report and interpret the MIDPOINT for the interval that contains the MODE
  4. Report and interpret the GROUPED FREQUENCY and PERCENTAGE of scores in the interval that contains the MODE
  5. Describe the CENTRAL TENDENCY, DISPERSION, and SHAPE of the distribution of Y

SECTION 4: Summarizing a Scale Variable Analyses to Run • Use a SCALE variable Y. • COMPUTE the deviation scores for Y (e.g., DEVY = Y – Mean_Y) • COMPUTE the squared deviation scores for Y (e.g., DEVYSQ = DEVY * DEVY) • RANK the variable Y by Assigning Rank 1 to SMALLEST VALUE • CALCULATE standardized values for Y • Run descriptive analyses for:  (a) Y  (b) Y deviation scores  (c) squared Y deviation scores  (d) Ranked variable for Y  (e) Standardized variable for Y (i.e., z score) • Sort by Y (Ascending) • Report in sorted order o Y o the Y deviation scores o the squared Y deviation scores o the variable for Y RANKED o the standardized values for Y

Using the descriptive and case report output, respond to the following items

  1. Report all Central Tendency statistics (also called Location) for the variable Y

  2. Report all Dispersion statistics (also called Variability or Spread) for the variable Y

  3. Report all Shape statistics (e.g., Skewness and Kurtosis) for the variable Y

  4. Report and interpret the MEAN for the variable Y

  5. Show or explain how the MEAN for variable Y is calculated from the SUM for variable Y

  6. Report and interpret the rank for any case on variable Y

  7. Show or explain how the MEDIAN for variable Y is calculated.

  8. Explain why you might choose to report the MEDIAN instead of the mean as the statistic for central tendency.

  9. Explain why you might choose to report the RANGE instead of the Standard Deviation as the statistic for dispersion (variation).

  10. Show or explain how the DEVIATION SCORE for any case is calculated

  11. Show or explain how the STANDARDIZED (ZY) SCORE for any case is calculated, using ZY = (Y – MY)/SY

  12. Show or explain how to calculate the ORIGINAL SCORE for any case from its standardized (ZY) score, using Y = MY + (ZY * SY)

  13. Report and interpret the SUM of the deviation scores for the variable Y

  14. Report and interpret the SUM of the squared deviation scores for the variable Y

  15. Report and interpret the MINIMUM and MAXIMUM for variable Y

  16. Show or explain how the RANGE is calculated for variable Y

  17. Report and interpret the STANDARD DEVIATION for the variable Y

  18. Report and interpret the VARIANCE for the variable Y

  19. Show or explain how the VARIANCE for variable Y is calculated from the SUM of the squared deviations for variable Y (use N-1 in the variance calculation)

  20. Show or explain how the STANDARD DEVIATION for variable Y is calculated from the VARIANCE for variable Y

  21. Explain the Mean and SD for the Y deviation scores (recall that SD = Standard Deviation and is often abbreviate as SY for the Y variable or SX for the X variable)

  22. Explain the Mean and SD for the Y standardized scores (recall that SD = Standard Deviation and is often abbreviate as SY for the Y variable or Sx for the X variable)

  23. Report all Central Tendency statistics (also called Location) for the variable Y

  24. Report all Dispersion statistics (also called Variability or Spread) for the variable Y

  25. Report all Shape statistics (e.g., Skewness and Kurtosis) for the variable Y

  26. Report and interpret the INTERQUARTILE RANGE for the variable Y

  27. Show or explain how to calculate the INTERQUARTILE RANGE for the variable Y scores

  28. Explain why you might choose to report the Interquartile Range instead of the Standard Deviation or the Range as the statistic for dispersion (variation).

  29. Report the 5 LARGEST values for the variable Y

  30. Report and interpret the 90th PERCENTILE for the variable Y scores

  31. Interpret the BOXPLOT for the variable Y

  32. Interpret the HISTOGRAM for the variable Y

  33. Interpret the Q-Q PLOT for the variable Y

  34. Report and interpret the SKEWNESS for the variable Y

  35. Report and interpret the KURTOSIS for the variable Y

  36. Report and interpret the NORMALITY for the variable Y (not statistical significance)

Describe what the output (both numerically and graphically) would look like for each of the following. You may use examples from your data to illustrate the shapes requested (if you have variables with data that fit the description relatively closely).

  1. symmetric distribution
  2. positively skewed distribution
  3. negatively skewed distribution
  4. mesokurtic distribution
  5. leptokurtic distribution
  6. platykurtic distribution
  7. bimodal distribution
  8. a distribution with obvious outliers
  9. normal distribution

SECTION 6: Summarizing two Scale Variables together Analyses to Run • Use a SCALE variable Y • Use a SCALE variable X • Run descriptive statistics with BOTH variables (Y and X) together

Using the output, respond to the following items

  1. Report and interpret the MEAN and STANDARD DEVIATION for both Y and X
  2. Report which variable, Y or X, had the larger median
  3. Report which variable, Y or X, had the smaller minimum
  4. Report whether either variable, Y or X (or both), has OUTLIERS (and how you determine this)
  5. Compare the SHAPES (i.e., symmetry, kurtosis) of the variables Y and X

SECTION 7: Summarizing a Scale Variable across Levels/Groups/Sub-samples Analyses to Run • Use a CATEGORICAL variable W • Use a SCALE variable Y • Run descriptive statistics with W as a grouping variable and/or by splitting the file into groups

Using the output, respond to the following items

  1. Report and interpret the FREQUENCY (N), MEAN, and STANDARD DEVIATION of variable Y for each W level/group
  2. Report the HIGHEST and LOWEST Y scores in each W level/group
  3. Which W level/group has the highest MEAN for Y?
  4. Report whether either W level/group has OUTLIERS for variable Y (and how you determine this)
  5. Compare the SHAPES of the variable Y in the levels/groups

SECTION 9: Summarizing a Scale Variable in just ONE Level/Group/Sub-sample

• Use ONE GROUP of the CATEGORICAL variable W • Use a SCALE variable Y

Using the SELECTED CASES DESCRIPTIVES output, respond to the following items

  1. Report and interpret the FREQUENCY, MEAN, and STANDARD DEVIATION of variable Y for the W = 1 level/group

SECTION 10: Graphs, variables, and distributions Analyses to Run • Use a CATEGORICAL variable W • Use a SCALE variable Y • Use a SCALE variable X • Create paneled or grouped boxplots, paneled or grouped error bar plots, bar charts with confidence intervals, and paneled histograms Using this GRAPHS output, respond to the following items

  1. Describe ANY outliers you see in ANY of the graphs (whether univariate or bivariate)
  2. Describe ANY non-normality you see in ANY of the graphs (whether univariate or bivariate)