In this chapter, we will begin our study of inferential statistics by considering its cornerstone, the random sample. We will examine three methods of selecting a random sample, and we will consider a theoretical distribution known as the sampling distribution. We will also consider the role the sampling distribution plays in determining the properties of statistics that are considered to be good estimates of their population parameters. Here, we will define and illustrate criteria by which statistics that are known as point estimates are judged.
Your jamovi objective will be to generate a list of N random numbers from which you will use the first n numbers to select a random sample.
In chapter 4, a random sample was defined as: sample of n units from a population of N units, where each of the possible samples of units has the same probability of being selected. Random sampling has two purposes in experiments. It allows us to make inferences to population parameters based on sample statistics. It also controls pretreatment systematic differences between the units allocated to different treatments. In the following sections, we describe three methods for selecting a random sample from a finite population. In each case, we assume that the population has been defined, that each unit (for example, person) in the population has been numbered, and that each unit has been selected at random to form a simple random sample. The list of population units is called an accessible population, or more commonly a sampling frame.
The following steps describe what might be called a “layman’s” method of selecting a random sample.
One problem with this method of selecting a random sample is that there is a tendency for the slips of paper to stick together. When this happens, the sample that is selected will not be a random sample. This problem occurred during the Viet Nam War when the birth dates of young men eligible for the draft were put on plastic balls. These balls were put into a bowl and then shuffled. The young men were drafted in the order that their birth dates were drawn from the bowl. It was discovered, however, that once a date from one month was selected, an unusually large number of dates from that month were also selected. This indicated that the sampling method was not yielding a truly random sample of birth dates. A computer-generated method of drawing a random sample was then adopted.
Table 10.1 contains a random collection of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 arranged in groups of 9. Each individual digit is randomly generated, not the group of 10. But because each digit is random, you can also get random 2-digit, 3-digit, 4-digit, and so forth, numbers by combining the digits together into larger numbers. The following steps show how to select a random sample of 10 subjects from a population of 100 subjects using Table 10.1.
Row | Column1 | Column2 | Column3 | Column4 | Column5 | Column6 | Column7 |
---|---|---|---|---|---|---|---|
1 | 101933318 | 905711576 | 671522731 | 108278632 | 438427426 | 919306592 | 539186364 |
2 | 674788178 | 438475141 | 119343377 | 626120793 | 282029804 | 279955782 | 807968849 |
3 | 852286888 | 109790763 | 855296230 | 306052532 | 671918934 | 446408478 | 552601949 |
4 | 635172620 | 637847535 | 318490860 | 936689237 | 998810016 | 385507061 | 450243822 |
5 | 819035120 | 383364335 | 311977125 | 197698225 | 395595207 | 268246391 | 530128931 |
6 | 201006283 | 294389988 | 652050753 | 909936874 | 138768344 | 443431671 | 121174750 |
7 | 115682247 | 884502614 | 064866906 | 853708718 | 685719367 | 346331449 | 014700711 |
8 | 886112681 | 953650363 | 605938044 | 790133197 | 636019472 | 345755195 | 266106203 |
9 | 725401822 | 064510586 | 846789078 | 453868392 | 401090001 | 618616979 | 304767833 |
10 | 807565837 | 062639585 | 889965096 | 507219732 | 256882474 | 225256838 | 541426545 |
11 | 327489316 | 182112333 | 461354158 | 930220027 | 368283670 | 310786904 | 773878600 |
12 | 331399171 | 772452873 | 573858998 | 464923955 | 981713218 | 305581352 | 864046964 |
13 | 635551285 | 132472875 | 743362628 | 524340887 | 080944058 | 486383376 | 777353849 |
14 | 771017636 | 512898727 | 879681051 | 458204319 | 572248299 | 508832986 | 265536543 |
15 | 939949716 | 607507358 | 820094882 | 372900166 | 402798895 | 321403201 | 865435153 |
16 | 990727226 | 307341312 | 251346956 | 704626702 | 872273400 | 033053599 | 276143660 |
17 | 000377442 | 176452501 | 872189314 | 279276567 | 693179186 | 589556407 | 462106132 |
18 | 478807807 | 534842576 | 066962177 | 049322335 | 515537151 | 412818445 | 020955777 |
19 | 100486257 | 714813518 | 329671352 | 797625534 | 393871273 | 762891649 | 743720164 |
20 | 801288852 | 387305354 | 433075084 | 095646566 | 106899139 | 375583663 | 098070750 |
21 | 338412564 | 095343641 | 714610386 | 964260504 | 004133151 | 222475642 | 077873557 |
22 | 376037848 | 341173985 | 079986646 | 784481796 | 108582201 | 676311486 | 723148437 |
23 | 469888202 | 970606163 | 506385357 | 070536180 | 781226492 | 549994861 | 080771457 |
24 | 403316982 | 635898235 | 566528112 | 251188380 | 180035513 | 715611739 | 007980888 |
25 | 199779357 | 605976429 | 526513420 | 538564205 | 998723011 | 286539617 | 515564055 |
26 | 014223073 | 441082761 | 549387434 | 178440249 | 752565032 | 843527323 | 419731919 |
27 | 341913548 | 746369269 | 826527253 | 705660700 | 353202136 | 100483117 | 969108222 |
28 | 626884804 | 015887409 | 392746106 | 479768693 | 285104477 | 358732383 | 607682630 |
29 | 269906390 | 061742638 | 101490118 | 320193932 | 293369157 | 692981470 | 945575141 |
30 | 613236639 | 106041713 | 707311287 | 906243223 | 701949868 | 065016309 | 771244403 |
31 | 131523374 | 649063230 | 814408793 | 217861931 | 041006439 | 947854668 | 673607627 |
32 | 527809157 | 995961681 | 142221938 | 205626278 | 350822853 | 771559654 | 105434012 |
33 | 474224383 | 121453978 | 002778731 | 653328151 | 672922026 | 702346771 | 818633028 |
34 | 160931627 | 125490591 | 254660148 | 927046225 | 961276565 | 140828585 | 268392333 |
35 | 842925948 | 125134134 | 204003422 | 823347552 | 327574625 | 573817917 | 817092039 |
36 | 966441348 | 439762091 | 831935981 | 477940167 | 538031844 | 911010397 | 601858259 |
37 | 359568539 | 954020176 | 186342940 | 832259394 | 605614816 | 033000464 | 369846405 |
38 | 473393191 | 807938060 | 912144713 | 198265128 | 082052685 | 906312064 | 942904108 |
39 | 391406612 | 622964900 | 830862122 | 246992036 | 624596058 | 046264815 | 361553151 |
40 | 118065552 | 541246696 | 690344589 | 089960450 | 282576926 | 871278767 | 736304521 |
41 | 724649810 | 772285811 | 903458917 | 800892057 | 801052863 | 424530086 | 734548304 |
42 | 501834881 | 099858580 | 901359898 | 849805082 | 216321002 | 663969277 | 913510852 |
43 | 183208200 | 844938544 | 474828413 | 777514273 | 890882186 | 346320580 | 596376308 |
44 | 663259921 | 215505831 | 854961955 | 604998708 | 860630190 | 044503747 | 614740697 |
45 | 543708555 | 829856999 | 256388623 | 476380286 | 066796475 | 249614066 | 324314161 |
46 | 075973299 | 625185879 | 777269244 | 993391552 | 130887387 | 588444190 | 571200729 |
47 | 044265718 | 525441576 | 440861107 | 075482470 | 637558116 | 111284118 | 766221093 |
48 | 932779799 | 328876565 | 643767732 | 658761197 | 378225086 | 429408280 | 387484595 |
49 | 127264933 | 322367220 | 723023648 | 490085913 | 623134925 | 155691547 | 908608850 |
50 | 728377036 | 112769525 | 064139205 | 774327378 | 902723197 | 568274943 | 233969621 |
Roll a die to determine which column to use. Or you could flip a coin 6 times and count the number of TAILS you toss. In our example, we rolled a 5, and so we will start with column 5 of random numbers.
Close your eyes and touch your computer screen inside the table. See what 2-digit number you are pointing at. When we did this, our finger was pointing to the number 15 (Row 17 Column 4 the second and third numbers). Therefore, we will use row 15 of column 5 as our starting place. Note that there are only 50 rows to this table, so if you point at a two-digit number larger than 50, first try to reverse it (e.g., 51 becomes 15). If that still doesn’t produce a two-digit number under 50, close your eyes and point again until you obtain a number 50 or less.
In our example, the starting point was the number 40, the first 2-digit number in Row 15 and Column 5 (or 4 or 402 or 4027, etc., depending on how large a number you need). We will need 2-digit numbers because we have 100 people in our population (where 00 would indicate person 100). Simply choose the number of digits appropriate for your population size. If the number is unuasable, simply discard it and move on (e.g., if we have only 39 people in our population instead of 100, we would discard the 40 and move on as follows).
Approach A. 4. Now we choose our next 9 two-digit numbers, by either moving across the row (either left or right as long as you are consistent) or up/down the column. If we move to the right, our next 2-digit numbers are: 27, 98, 89, 53, 21, 40, 32, 01 (i.e., 1), and 86. However, because 40 was already used it becomes unusable the second time, so we pick an additional number: 54. We could move down the column, instead, and choose after 40: 87, 69, 51, 39, 10, 00 (i.e., 100), 10, 78, 18. However, again, 10 is unusable so we choose another: 99. You could toss a coin to decide if you will move by rows or columns from the starting point found in step 3.
Approach B. As you can see the latter approach requires that a large number of random numbers be considered before n unique random numbers are found. The number of random numbers considered can be reduced by using modular arithmetic to find numbers that mathematicians describe as being “congruent modulo N”. Using this approach each random number is divided by the population size, N. The remainder is then considered to be a random number, with remainders of zero considered to represent the number N. Our starting point using approach B is now the number 402, since we need a number larger than the population size 100. However, each number that we consider is divided by 100, our N, and the remainder is considered to represent a random digit. When we divided 402 by 100 the remainder was 02, and so the first number in our sample was 2. The following remainders were found for the three-digits numbers in the row following 402: 98, 95, 21, 03, 01, 65, 35, 53, 90. Therefore, the subjects with those ID numbers were selected to represent our random sample. Note that when we reached the end of the row, we moved to the beginning of the next row (Row 16 Column 1).
In the following steps you are shown how to generate a random sample of 10 cases from a population of 20 cases using jamovi. We will use the data in Table 5a for this example. The process involves creating one or two new columns of numbers. If you do not have an ID number, you will need to create one simply as an ordered list of the numbers from one to your population size (i.e., 1 to N = 20). The second new column that contains a list of twenty-five numbers that have been generated at random from the uniform distribution. The end points of this uniform distribution will be set at zero and one. Then, the random list will be ordered, and in the process, the ordered list will be placed in random order.
For this example we will use the data in Figure 5a (hopefully you have it available as a dataset you can open, but if not you can enter it pretty easily). We will use only the 20 cases in the group 1 (the Yes-Pill cases). See Figure 10a. Even though we have an ID number, we will create one for illustrative purposes. We will use jamovi’s UNIF function to create random uniform numbers.
Row | ID1 | HT1 | CHOL1 |
---|---|---|---|
1 | 225 | 64 | 210 |
2 | 736 | 63 | 215 |
3 | 1291 | 64 | 263 |
4 | 906 | 63 | 220 |
5 | 494 | 62 | 330 |
6 | 796 | 65 | 198 |
7 | 1637 | 65 | 260 |
8 | 2871 | 66 | 198 |
9 | 3349 | 65 | 250 |
10 | 2522 | 63 | 179 |
11 | 828 | 67 | 243 |
12 | 1309 | 62 | 247 |
13 | 2349 | 64 | 210 |
14 | 544 | 64 | 272 |
15 | 1326 | 66 | 210 |
16 | 59 | 62 | 253 |
17 | 473 | 67 | 297 |
18 | 1251 | 60 | 195 |
19 | 2271 | 66 | 158 |
20 | 1797 | 68 | 150 |
Steps:
Row | ID1 | HT1 | CHOL1 | IDNUM | RANDNUM | RANKNUM |
---|---|---|---|---|---|---|
1 | 225 | 64 | 210 | 1 | 2.9367 | 1 |
2 | 736 | 63 | 215 | 2 | 18.2085 | 19 |
3 | 1291 | 64 | 263 | 3 | 13.7589 | 13 |
4 | 906 | 63 | 220 | 4 | 3.0573 | 2 |
5 | 494 | 62 | 330 | 5 | 9.3301 | 8 |
6 | 796 | 65 | 198 | 6 | 18.4668 | 20 |
7 | 1637 | 65 | 260 | 7 | 11.2445 | 11 |
8 | 2871 | 66 | 198 | 8 | 13.8210 | 15 |
9 | 3349 | 65 | 250 | 9 | 9.3310 | 9 |
10 | 2522 | 63 | 179 | 10 | 3.2675 | 4 |
11 | 828 | 67 | 243 | 11 | 12.8963 | 12 |
12 | 1309 | 62 | 247 | 12 | 6.3586 | 6 |
13 | 2349 | 64 | 210 | 13 | 6.3192 | 5 |
14 | 544 | 64 | 272 | 14 | 16.3514 | 16 |
15 | 1326 | 66 | 210 | 15 | 17.1935 | 17 |
16 | 59 | 62 | 253 | 16 | 3.0860 | 3 |
17 | 473 | 67 | 297 | 17 | 17.2506 | 18 |
18 | 1251 | 60 | 195 | 18 | 6.8150 | 7 |
19 | 2271 | 66 | 158 | 19 | 13.7665 | 14 |
20 | 1797 | 68 | 150 | 20 | 9.4818 | 10 |
You can use a similar process to generate random numbers that corresponding to row numbers in a list you do not have in jamovi. Let’s say your list has 300 names. You will need to create a column in jamovi with 300 cases (the column can be mostly missing values, but you’ll need some value in row 300 of the Data Editor). So scroll down as far as you can and enter values until you get to row 300 and enter something. Go to COMPUTE and Instead of UNIF(0,1) you would enter UNIF(1,300). The rank the random numbers and choose the row numbers to use in your list.
The sampling method described above are examples of what is own as sampling without replacement. In sampling without replacement a number is sampled at random, but is not replaced in the population, so that it can not be chosen again. Therefore, sampling without replacement is a process which yields no repeats of numbers in the sample. Sampling without replacement is commonly used in practical applications and is the method used throughout this book.
In sampling with replacement a number is randomly selected from a population of numbers and recorded. The number selected is then returned to the population and a second number is randomly chosen and recorded. This process is repeated, until a given sample size is obtained. Sampling with replacement can be done with the method referred to as the “fish bowl shuffle” by replacing each slip of paper back into the fish bowl and shaking the bowl before selecting another slip. Using a table of random numbers, repeats of a number are kept in the sample. In practice almost all sampling is done without replacement. However, there are situations where replacement is used. In particular, a relatively new robust statistical approach called “bootstrapping.”
You used the preceding sampling procedures to randomly select units. You can also use them for random assignment. After you have done your random selection, proceed as follows:
A sampling distribution is a probability distribution of a statistic. Remember that a probability distribution is a theoretical distribution. In this sense, a sampling distribution is based on values of a statistic that has been calculated for each of an infinite number of random samples of size n. Statistics that are commonly considered in sampling distributions are the mean, variance, skewness, kurtosis, correlation coefficient, regression coefficient, proportions, and different test statistics. (We will consider the sampling distributions of different test statistics as we introduce them, starting in chapter 12.)
Sampling distributions of statistics are based on an infinite number of samples. We can obtain an idea of what a sampling distribution is, however, by considering an illustration based on a finite number of samples. For example, we can simulate a sampling distribution of the variance using 100 sample variances based on samples of IQ scores from the population of high school students in New York City. We obtain the variances needed for the sampling distribution by calculating the variance in each of 100 samples of students. For this example, each sample contains 30 students. To calculate a given variance, we select a random sample of 30 students and administer an IQ test to them. We then calculate the variance of the resulting IQ scores, and repeat this process for 100 such samples. When we finish, we have 100 variances, which we can put into a frequency distribution.
In this example, our frequency distribution meets our definition of a sampling distribution if it contains an infinite number of samples and we grade the y-axis with probabilities instead of frequencies. We could theoretically obtain an infinite number of samples of students by:
The ad infinitum (i.e., continuing the process infinitely) is the reason this distribution is called a theoretical distribution.
The sampling distribution is important because mathematical statisticians can tell what shape the sampling distributions of many statistics will take (for example, normal, positively skewed, and so on). Furthermore, statisticians can tell what the mean and variance of these distributions will be.
If we know the mean, variance, and shape of a sampling distribution of a statistic, we can make inferential statements, that is, statements concerning parameter estimation and significance testing. A good example of this is the sampling distribution of the mean. A theorem, known as the Central Limit Theorem, states that:
If a population has a finite variance \(\sigma^2\) and mean \(\mu\), then the sampling distribution of the mean aim approaches a normal distribution when n (the sample size of the random samples upon which the sample means are calculated) increases. That is, when n is very large, the sampling distribution of the mean is approximately normal. Furthermore, the mean of the sampling distribution of the mean is \(\mu\), and the variance of the sampling distribution of the mean is \(\sigma^2/n\).
The variance of the sampling distribution of the mean is called the variance of the mean, and the standard deviation of the sampling distribution of the mean is called the standard error of the mean. Note that in this theorem nothing is said about the population distribution; that is, the population distribution can take any shape. If the population is known to have a normal distribution, however, the sampling distribution of the mean will be normal with any size sample.
We begin to see some of the practical applications of the Central Limit Theorem when we consider a sample of 25 that we believe was selected at random from a population whose mean is 50 and whose variance is 100. If the sample did come from this population, we know that the mean of the sampling distribution of such sample means is 50, and the variance of this sampling distribution is 100/25 = 4, with a standard deviation of 2.
Given this information, our knowledge of standard scores, and the normal probability distribution, we can ask and answer the following estimation and significance test questions. Use the sampling distribution of the mean shown in figure 10c to assist you in considering these questions.
Between what two sample means would we expect 95% of the sample means to fall?
Given table 9.1(b), in the standard normal probability distribution, we would expect 95% of the scores to fall between z(.025) = -1.960 and z(.975) = 1.960. In the sampling distribution of the mean, sample means are our scores; and so a z statistic (a z score for a group) for a given mean is written as:
\[ \begin{equation} z_X = \frac {(M_X - \mu_X)} {\sigma_{M_X}} \\ \text{or} \\ z_X = \frac {(\overline{X} - \mu)} {\sigma_\overline{X}} \\ \tag{10-1} \end{equation} \] Here, \(\mu\) = 50 and \(\sigma_M\) = 2. We must consider z at z(.025) = -1.96 and z(.975) = 1.96, and solve for the values of M. In so doing, we have:
\[ \begin{align} 1.96 &= (M-50)/2 \\ 3.92 &= M-50 \\ 53.92 &= M \\ \\ -1.96 &= (M-50)/2 \\ -3.92 &= M-50 \\ 46.08 &= M \end{align} \] Therefore, we would expect that 95% of our sample means would fall between 46.08 and 53.92, that is, p(46.08 < M < 53.92) = .95.
Between what two sample means would we expect 99% of the sample means to fall?
Given table A.1(b), in the standard normal distribution, we would expect 99% of the standard scores to fall between z(.005) = -2.576 and z(.995) = 2.576. Therefore, using the procedure described for answer 1, we have:
\[ \begin{align} 2.576 &= (M-50)/2 \\ 5.152 &= M-50 \\ 55.152 &= M \\ \\ -2.576 &= (M-50)/2 \\ -5.152 &= M-50 \\ 44.848 &= M \end{align} \]
Therefore, we would expect that 99% of our sample means would fall between 44.848 and 55.152, that is, p(44.848 < M < 55.152) = .99.
If a sample mean of 52.4 were obtained, what would be a good estimate of the population mean? (In the following section, we will describe some criteria that help decide what good means.)
The sample mean of 52.4 is a good estimate of the population mean.
If a sample mean of 52.4 were obtained, would it be reasonable to assume that the mean of the sampling distribution was 50? In other words, would it be reasonable to assume that the sample came from a population where the mean was 50?
If the population mean was 50, we would expect that 95% of the samples randomly selected from this population have means that fall between 53.92 and 46.08. Therefore, since 52.4 falls within this interval, it would be reasonable to assume that the population mean was 50.
The mean 52.4 differs from the suspected population mean by 2.4 points. We can answer question 4 in another way by asking: What is the probability of obtaining a sample mean that differs by 2.4 points or more from a population value of 50? Here, we are asking for the probability of obtaining a sample mean that is less than 47.6 (that is, 50- 2.4) or greater than 52.4 (that is, 50 + 2.4). We can easily answer this question by transforming the score of 52.4 into a z score and then finding the area above it, using table 9.1(a):
\[ z = \frac {52.4-50} {2} = 1.2 \]
\[ p(z>1.2)=.5000-.3849=.1151 \] Also, we can find the area below the z score corresponding to the score of 47.6 (which because of the symmetry of the normal distribution is the same as that area for z > 1.2) as:
\[ z = \frac{47.6-50} {2} = -1.2 \]
\[ p(z < -1.2) = .1151 \]
Therefore, the probability of obtaining a sample mean that is more than +2.4 points from a population mean of 50 is .2302 or \(p(z < -1.2) + p(z > 1.2) = .1151 + .1151 = .2302\). Because this probability is high, we can conclude that if the population mean was 50, it would be reasonable to expect a sample whose mean was 52.4. (In the next chapter, we will discuss what is meant by a high probability in this situation.)
If a sample mean of 56 were obtained, would it be reasonable to assume that the mean of the sampling distribution was 50? In other words, would it be reasonable to assume that the mean of the population was 50?
If the population mean were 50, we would expect that 95% of the sample means would fall between 46.08 and 53.92, and that 99% of the sample means would fall between 44.848 and 55.152. Since a sample mean of 56 is outside both of these ranges, we might conclude that the mean of the sampling distribution (the population mean) is not 50.
As in answer 4b, we can use table 9.1(a) to find the probability of obtaining a sample mean that differs by 6 points (that is, between 44 and 56) from a population mean of 50. Here, the z score for a sample mean of 56 is:
\[ z = \frac{56-50} {2} = 3.00 \]
\[ p(z > 3.00) = .5000 - .4987 = .0013 \]
Similarly, the z score for the sample mean of 44 is:
\[ z = \frac{44-50}{2} = -3.00 \] \[ p(z < -3.00) = .0013 \] Therefore, if the population mean is 50, the probability of obtaining a sample mean that differs from this population mean by 6 or more points is p(z < -3.00) + p(z > 3.00) = .0013 + .0013 = .0026. Since this probability is small, we might conclude that a sample mean of 56 came from a population whose mean is greater than 50.
The preceding questions and answers gave you a sense of the role that a sampling distribution (here, of the mean) plays in inferential statistics. It is an extremely important role although it stays primarily in the background. We do not have to actually find the sampling distribution of a statistic, we only have to know what its shape is and what its parameters are. Based on our knowledge of the sampling distribution we can make a priori probability statements about an unknown sample statistic (as in questions 1 and 2) or an inference about a population parameter (as in questions 3, 4, and 5).
In the next chapter, we will further consider such questions when we consider the importance of the sampling distribution to hypothesis testing. For now, however, we will consider the question that is implied in question 3:
What criteria can be used to help decide if a sample statistic is a good estimate of its population parameter? Said another way: What are the properties of statistics of which we want to consider the sampling distributions? We will consider the formal definitions of these criteria and examine them for commonly used statistics.
A statistic is said to be a point estimate when it is used to infer the value of a population parameter. The equation used to derive the statistic is called the estimator. The following criteria are frequently used to evaluate a statistic:
A statistic is said to be unbiased when the mean of its sampling distribution is its population parameter.
A statistic is said to be consistent when the probability that it is close to its population parameter increases as the sample size increases.
Kendell and Buckland (1976, p. 47) described efficiency as follows:
The concept of efficiency in statistical estimation is due to Fisher (1921) and is an attempt to measure objectively the relative merits of several possible estimators.
The criterion adopted by Fisher was that of variance, an estimator being regarded as more “efficient” than another if it has smaller variance; and if there exists an estimator with minimum variance v the efficiency of another estimator of variance v1 is defined as the ratio of v/v1. It was implicit in this development that the estimator should obey certain criteria such as consistency. For small samples, where other considerations such as bias enter, the concept of efficiency may require extension or modification.
The definition of a sufficient statistic is beyond the scope of this book; suffice it to say that a sufficient statistic contains all of the information in its sample relative to the estimation of its population parameter.
In this section, we will create finite sampling distributions whose properties and estimates we can examine more closely. This exercise will enable us to better conceptualize what a sampling distribution is, and what the properties of its scores (estimates) are. In this regard, we will consider samples from a uniform population distribution.
The uniform distribution was chosen to illustrate that the sampling distributions of most statistics based on samples from this distribution are not uniform. Indeed, considering the Central Limit Theorem, we know that the sampling distribution of the mean for large samples will be close enough to a normal distribution for most practical purposes.
As was the case for the normal distribution, instead of considering a given population with a uniform distribution and a fixed sample size, we will consider a given uniform probability distribution that represents all sample sizes. Consider the discrete uniform probability distribution shown in figure 10d, which has a low boundary of 0 and a high boundary of 1000. The probability of sampling a given number from this distribution is 1/1001, since there are 1001 numbers and each has an equal chance of being selected. The following population parameters have been derived by mathematical statisticians for any discrete uniform probability distribution:
\[ \begin{align} Mean = \mu &= (a+(b-1))/2 \\ Median = Md. &= (a+(b-1))/2 \\ Variance = \sigma^2 &= (b^2-1)/12 \\ Skewness = b_1 &= 0 \end{align} \] Here, \(a\) is the lower boundary, and \(b\) is the upper boundary. Therefore, for the discrete uniform probability distribution shown in figure 10d, we have (for \(a = 0\) and \(b = 1000\)) that:
\[ \begin{align} Mean &= \mu = &&(0+(1000-1))/2 &&= 499.5 \\ Median&=Md. = &&(0+(1000-1))/2 &&= 499.5 \\ Variance &= \sigma^2 = &&(1000^2-1)/12 &&= 83333.25 \\ Skewness &= b_1 = && &&= 0 \end{align} \] We will select numbers at random from this theoretical probability distribution.
###The Raw Data and Its Statistics
To illustrate the properties of sampling distributions and their estimates, 1000 random samples based on the uniform probability distribution shown in figure 10d, were generated for sample sizes of 5, 10, 25, and 50 units.
The 1000 samples of size 5 are partially shown in figure 10e. Similarly, figure 10f shows the means, variances, standard deviations, ranges, and medians for the 1000 samples of size 5. Although the summary statistics for the samples of sizes 5, 10, 25, and 50 units will be examined, only the observations based on samples of size 5 are shown here to keep the presentation less cluttered.
The mean of the scores in Sample 1 in figure 10e is 549.4, which is the first mean shown in column 4 (labeled MEAN 5) of figure 10f. Also, the variance for the first sample in figure 10e is shown in the first row of figure 10f as 51179; the standard deviation is 226.2280 the range is 559; and the median is 481. In this manner, the statistics for a given sample of figure 10e are found in the corresponding row of figure 10f.
We sent the data (statistics) from each of the samples across each of the sample sizes to R’s Descriptive statistics program (see chapter 4). jamovi found descriptive statistics for the means derived from samples of size 5, 10, 25, and 50. jamovi also found descriptive statistics for the medians, variances, standard deviations, and ranges based on samples of size 5, 25, and 50. Figure 10g shows the resulting descriptive statistics for the means based on different sample sizes. The descriptive statistics for the medians, variances, standard deviations, and ranges are shown in figures 10h, 10i, 10j, and 10k, respectively.
To further illustrate the features of the mean, we constructed finite sampling distributions using R’s histogram capabilities (see chapter 4) for the 1000 samples of each sample size. These finite sampling distributions are displayed in the histograms of figure 101. For example, in figure 101, histogram A represents the sampling distribution of the mean based on 1000 samples with 5 units in each sample. Histogram B has 10 units per sample mean; histogram C has 25 units per sample mean; and histogram D has 50 units per sample mean.
We also constructed variance bar plots for each of the statistics across each of the sample sizes. Figures 10m, 10n, 10o, and 10p show these variance bar plots for the means, medians, standard deviations, and ranges, respectively.
For example, figure 10m shows from left to right the four variance bar plots of the 1000 means based on samples of size 5, 10, 25, and 50. The small lines extending from each bar plot just above and below the mean of the statistic, referred to as “whiskers,” represent the standard errors of the statistic. For example, in figure 10m, the first whisker above and below the mean represents one standard error (\(\sigma\) / √n ) from the mean, and the second whisker above or below the mean represents two standard errors (2\(\sigma\) / √n ) from the mean. (Note that the variance bar plots were not made for the variances, whose descriptive statistics are shown in figure 10i, because the variances are so large they require rescaling.)
Statisticians have found that the mean and variance are unbiased estimates of their population parameters. That is, if we could take an infinite number of samples for a given sample size, we would find that the means of the sampling distributions of these statistics would be their population parameters. Since the statistics illustrated here are based on only 1000 samples, we do not find them to be exactly equal to their population values, but they are close (1000 samples sounds like a lot, but for this kind of research we typically use 10,000 or more).
For example, the mean of the population is known to be 499.5, and the sampling distribution means reported in figure 10g are 505.38, 472.64, 497.30, and 490.14 for samples of size 5, 10, 25, and 50, respectively. The population variance is known to be 83333.25; and the sampling distribution means reported in figure 10i are 69937.23, 81804.37, 76857.27, and 80170.70 for samples of size 5, 10, 25, and 50, respectively.
In chapter 5, we found that the population variance was calculated using equation (5-3) as:
\[ \sigma_X^2 = \sum \frac {(X-\mu)^2} {N} \]
The sample variance was found using the estimator, equation (5-2), as:
\[ s_X^2 = \sum \frac {(X-M_X)^2} {n-1} \] Here, a natural question to ask is: Why not use n instead of (n-1) as the denominator of the sample variance? The reason is that if n is used as the denominator of the sample variance, the mean of the sampling distribution of such variances is not the population variance; that is, the sample variance found with n as the denominator is biased. To have an unbiased estimate of the population variance, the estimator must consist of the sum of deviation scores squared, divided by (n-1).
The estimator for the standard deviation (that is, the square root of the unbiased estimator of the variance) yields a biased estimate of its population parameter. Fortunately, the bias of the standard deviation is small and can be considered to be negligible when n is greater than 20. The equation for the unbiased estimate of the population standard deviation is:
\[ \text{unbiased } s = \left[1 + \frac {1} {4(n-1)} \right]*s \]
This estimator is rarely used, however, because of the slight difference between its estimates and those found by taking the square root of the sample variance. The population standard deviation of the discrete uniform distribution that we have been considering is 288.67. In figure 10j, the means of the sampling distributions of standard deviations are 255.74, 282.14, 276.17, and 282.85 for samples of size 5, 10, 25, and 50, respectively. These standard deviations are all reasonably close to the population value.
The range is a biased estimate of its population value. This can be easily seen because the mean of the sampling distribution of the range is dependent upon the sample size. You can observe this relationship when you consider the means of the sampling distributions of the ranges for different sample sizes shown in figure 10k. In figure 10k, the means of the ranges increase from 624.63 to 957.00 as the sample size increases from 5 to 50. This is one reason why the range is not used as an estimate of its population parameter.
ID | Sample | Subject | Score |
---|---|---|---|
1 | 1 | 1 | 646 |
2 | 1 | 2 | 336 |
3 | 1 | 3 | 389 |
4 | 1 | 4 | 481 |
5 | 1 | 5 | 895 |
6 | 2 | 1 | 877 |
7 | 2 | 2 | 727 |
8 | 2 | 3 | 637 |
9 | 2 | 4 | 836 |
10 | 2 | 5 | 438 |
11 | 3 | 1 | 277 |
12 | 3 | 2 | 355 |
13 | 3 | 3 | 852 |
14 | 3 | 4 | 385 |
15 | 3 | 5 | 915 |
… | … | … | … |
… | … | … | … |
… | … | … | … |
146 | 1000 | 1 | 276 |
147 | 1000 | 2 | 808 |
148 | 1000 | 3 | 647 |
149 | 1000 | 4 | 765 |
150 | 1000 | 5 | 564 |
Sample | Mean | Variance | SD | Range | Median |
---|---|---|---|---|---|
1 | 549.4 | 51179 | 226.228 | 559 | 481 |
2 | 703 | 30781 | 175.444 | 439 | 727 |
3 | 556.8 | 90994 | 301.652 | 638 | 385 |
4 | 451.6 | 50790 | 225.366 | 593 | 455 |
5 | 624.2 | 53789 | 231.924 | 539 | 555 |
6 | 739 | 47223 | 217.309 | 565 | 704 |
7 | 503 | 111688 | 334.197 | 764 | 374 |
8 | 541.6 | 71011 | 266.479 | 705 | 579 |
9 | 421.2 | 24368 | 156.103 | 396 | 389 |
10 | 253 | 63757 | 252.5 | 610 | 151 |
11 | 427.8 | 121012 | 347.868 | 858 | 366 |
12 | 263.4 | 22750 | 150.832 | 376 | 309 |
13 | 489.4 | 64700 | 254.363 | 625 | 391 |
14 | 219.6 | 48903 | 221.14 | 503 | 177 |
15 | 454 | 156216 | 395.241 | 920 | 312 |
16 | 401.2 | 45586 | 213.508 | 514 | 288 |
17 | 526.4 | 61298 | 247.585 | 665 | 472 |
… | … | … | … | … | … |
… | … | … | … | … | … |
… | … | … | … | … | … |
998 | 744.8 | 44470 | 210.878 | 568 | 801 |
999 | 440 | 82129 | 286.582 | 718 | 408 |
1000 | 612 | 44563 | 211.098 | 532 | 647 |
[1] 1000
[1] 1
jmv::descriptives(
data = data,
vars = vars(mean05, mean10, mean25, mean50),
variance = TRUE,
range = TRUE,
se = TRUE,
ci = TRUE,
iqr = TRUE,
skew = TRUE,
kurt = TRUE,
sw = TRUE)
DESCRIPTIVES
Descriptives
─────────────────────────────────────────────────────────────────────────────
mean05 mean10 mean25 mean50
─────────────────────────────────────────────────────────────────────────────
N 1000 1000 1000 1000
Missing 0 0 0 0
Mean 505.18 497.35 498.00 498.13
Std. error mean 4.1113 2.8558 1.7978 1.2824
95% CI mean lower bound 497.11 491.75 494.48 495.61
95% CI mean upper bound 513.25 502.96 501.53 500.64
Median 504.42 494.71 498.95 498.83
Standard deviation 130.01 90.309 56.852 40.552
Variance 16903 8155.8 3232.2 1644.5
IQR 181.82 126.42 75.283 53.512
Range 701.49 490.69 426.49 266.88
Minimum 153.20 245.34 269.65 369.05
Maximum 854.68 736.03 696.15 635.93
Skewness -0.0069409 0.017420 0.012627 0.077119
Std. error skewness 0.077344 0.077344 0.077344 0.077344
Kurtosis -0.41671 -0.37078 0.078074 0.048292
Std. error kurtosis 0.15453 0.15453 0.15453 0.15453
Shapiro-Wilk W 0.99669 0.99656 0.99840 0.99882
Shapiro-Wilk p 0.03408 0.02743 0.48969 0.76603
─────────────────────────────────────────────────────────────────────────────
Note. The CI of the mean assumes sample means follow a t-distribution
with N - 1 degrees of freedom
jmv::descriptives(
data = data,
vars = vars(median05, median10, median25, median50),
variance = TRUE,
range = TRUE,
se = TRUE,
ci = TRUE,
iqr = TRUE,
skew = TRUE,
kurt = TRUE,
sw = TRUE)
DESCRIPTIVES
Descriptives
────────────────────────────────────────────────────────────────────────────
median05 median10 median25 median50
────────────────────────────────────────────────────────────────────────────
N 1000 1000 1000 1000
Missing 0 0 0 0
Mean 507.33 497.82 496.05 497.27
Std. error mean 5.9243 4.3500 3.0085 2.1690
95% CI mean lower bound 495.70 489.28 490.15 493.01
95% CI mean upper bound 518.95 506.36 501.95 501.53
Median 506.98 496.43 491.76 497.03
Standard deviation 187.34 137.56 95.138 68.588
Variance 35097 18923 9051.2 4704.4
IQR 289.02 209.14 133.48 94.076
Range 933.47 793.53 573.63 402.54
Minimum 24.191 98.627 211.33 305.58
Maximum 957.66 892.16 784.96 708.12
Skewness -0.055444 0.040829 0.11089 0.088909
Std. error skewness 0.077344 0.077344 0.077344 0.077344
Kurtosis -0.73310 -0.41119 -0.20510 -0.11654
Std. error kurtosis 0.15453 0.15453 0.15453 0.15453
Shapiro-Wilk W 0.98933 0.99639 0.99757 0.99842
Shapiro-Wilk p < .00001 0.02069 0.14606 0.50094
────────────────────────────────────────────────────────────────────────────
Note. The CI of the mean assumes sample means follow a t-distribution
with N - 1 degrees of freedom
jmv::descriptives(
data = data,
vars = vars(var05, var10, var25, var50),
variance = TRUE,
range = TRUE,
se = TRUE,
ci = TRUE,
iqr = TRUE,
skew = TRUE,
kurt = TRUE,
sw = TRUE)
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────────────────────────────────
var05 var10 var25 var50
───────────────────────────────────────────────────────────────────────────────
N 1000 1000 1000 1000
Missing 0 0 0 0
Mean 82894 84184 83525 84190
Std. error mean 1318.7 820.09 510.39 344.59
95% CI mean lower bound 80306 82574 82523 83513
95% CI mean upper bound 85481 85793 84526 84866
Median 81496 83094 82875 84627
Standard deviation 41700 25933 16140 10897
Variance 1.7389e+9 6.7254e+8 2.6050e+8 1.1874e+8
IQR 57779 35994 22042 14103
Range 226376 160305 102012 64108
Minimum 1803.9 20291 42198 54434
Maximum 228179 180596 144210 118541
Skewness 0.37698 0.16218 0.15533 0.10568
Std. error skewness 0.077344 0.077344 0.077344 0.077344
Kurtosis -0.18064 -0.28044 0.056525 -0.091433
Std. error kurtosis 0.15453 0.15453 0.15453 0.15453
Shapiro-Wilk W 0.98461 0.99543 0.99710 0.99733
Shapiro-Wilk p < .00001 0.00436 0.06761 0.09904
───────────────────────────────────────────────────────────────────────────────
Note. The CI of the mean assumes sample means follow a t-distribution
with N - 1 degrees of freedom
jmv::descriptives(
data = data,
vars = vars(sd05, sd10, sd25, sd50),
variance = TRUE,
range = TRUE,
se = TRUE,
ci = TRUE,
iqr = TRUE,
skew = TRUE,
kurt = TRUE,
sw = TRUE)
DESCRIPTIVES
Descriptives
────────────────────────────────────────────────────────────────────────────
sd05 sd10 sd25 sd50
────────────────────────────────────────────────────────────────────────────
N 1000 1000 1000 1000
Missing 0 0 0 0
Mean 277.35 286.49 287.63 289.54
Std. error mean 2.4452 1.4522 0.89116 0.59597
95% CI mean lower bound 272.55 283.64 285.88 288.37
95% CI mean upper bound 282.14 289.34 289.38 290.71
Median 285.47 288.26 287.88 290.91
Standard deviation 77.325 45.923 28.181 18.846
Variance 5979.1 2108.9 794.16 355.18
IQR 103.90 62.490 38.251 24.375
Range 435.21 282.52 174.33 110.99
Minimum 42.473 142.45 205.42 233.31
Maximum 477.68 424.97 379.75 344.30
Skewness -0.31898 -0.25021 -0.13913 -0.077364
Std. error skewness 0.077344 0.077344 0.077344 0.077344
Kurtosis -0.24573 -0.22036 0.019787 -0.12256
Std. error kurtosis 0.15453 0.15453 0.15453 0.15453
Shapiro-Wilk W 0.98943 0.99395 0.99724 0.99761
Shapiro-Wilk p < .00001 0.00046 0.08495 0.15525
────────────────────────────────────────────────────────────────────────────
Note. The CI of the mean assumes sample means follow a t-distribution
with N - 1 degrees of freedom
jmv::descriptives(
data = data,
vars = vars(range05, range10, range25, range50),
variance = TRUE,
range = TRUE,
se = TRUE,
ci = TRUE,
iqr = TRUE,
skew = TRUE,
kurt = TRUE,
sw = TRUE)
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────────────────────────────
range05 range10 range25 range50
───────────────────────────────────────────────────────────────────────────
N 1000 1000 1000 1000
Missing 0 0 0 0
Mean 667.46 824.18 924.56 961.44
Std. error mean 5.6594 3.3525 1.6232 0.84428
95% CI mean lower bound 656.36 817.61 921.37 959.79
95% CI mean upper bound 678.57 830.76 927.74 963.10
Median 688.00 842.57 936.09 966.57
Standard deviation 178.97 106.02 51.330 26.699
Variance 32029 11239 2634.8 712.82
IQR 244.76 145.93 61.926 33.887
Range 884.67 523.29 332.57 172.87
Minimum 100.89 473.00 666.85 826.94
Maximum 985.57 996.29 999.42 999.81
Skewness -0.51438 -0.77897 -1.2853 -1.2381
Std. error skewness 0.077344 0.077344 0.077344 0.077344
Kurtosis -0.30067 0.13049 1.9565 2.0920
Std. error kurtosis 0.15453 0.15453 0.15453 0.15453
Shapiro-Wilk W 0.97194 0.94964 0.90481 0.91424
Shapiro-Wilk p < .00001 < .00001 < .00001 < .00001
───────────────────────────────────────────────────────────────────────────
Note. The CI of the mean assumes sample means follow a t-distribution
with N - 1 degrees of freedom
jmv::descriptives(
data = data,
vars = vars(mad05, mad10, mad25, mad50),
variance = TRUE,
range = TRUE,
se = TRUE,
ci = TRUE,
iqr = TRUE,
skew = TRUE,
kurt = TRUE,
sw = TRUE)
DESCRIPTIVES
Descriptives
────────────────────────────────────────────────────────────────────────────
mad05 mad10 mad25 mad50
────────────────────────────────────────────────────────────────────────────
N 1000 1000 1000 1000
Missing 0 0 0 0
Mean 298.59 336.62 356.70 366.81
Std. error mean 3.9416 3.1057 2.2723 1.5807
95% CI mean lower bound 290.86 330.52 352.24 363.71
95% CI mean upper bound 306.33 342.71 361.16 369.92
Median 291.30 334.92 355.96 368.50
Standard deviation 124.64 98.212 71.856 49.986
Variance 15536 9645.5 5163.3 2498.6
IQR 181.16 145.01 99.431 65.639
Range 617.35 543.67 444.58 285.62
Minimum 20.421 75.288 149.58 218.81
Maximum 637.77 618.96 594.17 504.43
Skewness 0.25259 0.099745 0.10954 -0.080518
Std. error skewness 0.077344 0.077344 0.077344 0.077344
Kurtosis -0.51973 -0.51154 -0.13483 -0.092645
Std. error kurtosis 0.15453 0.15453 0.15453 0.15453
Shapiro-Wilk W 0.98843 0.99418 0.99814 0.99818
Shapiro-Wilk p < .00001 0.00064 0.34452 0.36828
────────────────────────────────────────────────────────────────────────────
Note. The CI of the mean assumes sample means follow a t-distribution
with N - 1 degrees of freedom
The statistics shown in the tables and figures are all based on consistent estimators, and this fact is the most striking feature of these tables and figures. In all cases, as sample size increases, the variability of the sample estimates decreases.
This is vividly shown for all of the statistics on their variance bar plots. For example, the variance bar plots of the sample means in figure 10m shrink dramatically as the sample size upon which a given mean is based increases. These bars reflect the sampling variances in figure 10g of 23092.73, 6037.74, 4158.55, and 2055.69 for means based on samples of size 5, 10, 25, and 50, respectively (remember that the variance of the sample means is called the “variance of the mean”). These variances can also be seen in the error bar plots in Figure 10m.
The sampling distributions shown in the histograms of figure 10l illustrate the consistency of the sample mean by having fewer bars with larger frequencies (that is, less spread) as the sample size increases. For example, in figure 10l, there are 9 bars when the sample size is 5, but in the sampling distribution based on 50 units per sample, we have only 4 bars and two bars dominate the others with frequencies that are greater than or equal to 12.
In a uniform distribution, both the population mean and the population median are equal. Therefore, you might ask: In a uniform distribution, should one use the estimate of the mean or of the median to measure the center of the distribution? Since both the mean and the median are consistent and unbiased estimates, the answer to this question is found when you consider the relative efficiency of these two statistics.
Statisticians have shown that for symmetric distributions the sampling distribution of the mean has a smaller standard deviation than does the sampling distribution of the median. That is, the standard error of the mean is smaller than the standard error of the median. This fact is vividly displayed using the variance bar plots shown in figure 10q. In figure 10q, the first four variance bars are based on sample means, and the second four variance bars are based on sample medians. You can see that for both statistics the variance bar plots decrease as the sample size increases. For the same sample sizes, however, the variance bar plots of the means are always smaller than the variance bar plots of the medians.
For a symmetric population distribution the sampling distribution of the mean will always have a smaller standard estimate than will the sampling distribution of the median. For this reason, the mean should be used when the population distribution is symmetric. In a skewed distribution, however, the mean, even with its smaller standard error, provides a “false” impression of the center of the distribution. In this case, the median, because it is actually in the center of the scores, may be regarded as providing more useful information. (The terms false and useful require further definition, which is beyond the scope of this book.)
This chapter explained how to acquire a random sample of units. Drawing a random sample using slips of paper and a fishbowl (or some other container) frequently leads to nonrandom samples because it is difficult to mix the slips of paper so they can be considered random. A better method is to use a table of random numbers, although this becomes an arduous task with large samples. jamovi can generate numbers at random.
Two methods of sampling used by statisticians were also discussed. The method used most often in practice was referred to as sampling without replacement. Using this method, once a number is drawn it is not replaced into the population. Therefore, in using sampling without replacement a sample does not contain repeats of a random number. The second method of sampling was referred to as sampling with replacement. Using this method each time a number is chosen it is replaced in the population and therefore could be chosen again. Sampling with replacement usually yields samples with repeats of numbers.
Next, a theoretical probability distribution called the sampling distribution was explained. A sampling distribution is a probability distribution of a given statistic, where the statistic is calculated on samples of a given size. Examples demonstrated the important role the sampling distribution plays as a basis for statistical testing. This role is discussed in detail in the next chapter. This chapter focused on the sampling distribution’s role in helping to illustrate criteria that are used to judge estimates of population parameters. Statistics are frequently evaluated to see if they are unbiased, consistent, efficient, and/or sufficient. These properties were defined and the first three were illustrated.
Z Statistics
No output is needed here, but you will need a standard normal distribution table or calculator.
What is the critical value for the Z statistic if you want to use a two-tailed level of significance of .05?
What is the critical value for the Z statistic if you want to use a one-tailed level of significance of .05?
What is the probability of getting the following Z statistics, or larger, as an absolute value (that is, that far or farther away from ZERO in both directions; that is, below -Z and above +Z) if the null hypothesis is true?
Please cite as:
Barcikowski, R. S., & Brooks, G. P. (2025). The Stat-Pro book:
A guide for data analysts (revised edition) [Unpublished manuscript].
Department of Educational Studies, Ohio University.
https://people.ohio.edu/brooksg/Rmarkdown/
This is a revision of an unpublished textbook by Barcikowski (1987).
This revision updates some text and uses R and JAMOVI as the primary
tools for examples. The textbook has been used as the primary textbook
in Ohio University EDRE 7200: Educational Statistics courses for
most semesters 1987-1991 and again 2018-2025.