Chapter 10 Comparing Proportions

Thus far we have worked with continuous and multinomial outcomes but the more common measures you are likely to encounter in the social and behavioral sciences happen to be dichotomous – did the job-training participant get employment \((y=1)\) or not \((y=0)\)?; did the patient improve post-surgery because he/she received pre-operative physical therapy \((y=1)\) or not \((y=0)\)? Does the community support this initiative \((y=1)\) or not \((y=0)\)? Did someone volunteer for the food drive \((y=1)\) or not \((y=0)\)? The dichotomy may also arise in other ways as, for example, with the questions of whether women are more likely to vote \((y=1)\) than men \((y=0)\), whether women are more likely to vote Democrat \((y=1)\) than men \((y=0)\), and so on. With data such as these, while the basic logic of hypothesis testing continues to guide analysis, the manner in which we analyze such data differs from the \(t-tests\) used in the preceding chapter. The need for hypothesis testing remains the same: We have to determine whether the proportions we see in our sample are very likely to reflect what is true in the population before we state a claim.

Take, for instance, the question of blacks killed by police officers. If you read the story you see statements such as “30% of black victims were unarmed compared to 21% of white victims in 2015”. The Washington Post releases a continually updated data-set of all individuals killed by a police officer and so we could analyze the statement for ourselves. Similarly, we could shatter the myth of the chivalry of the seas with the specific example of the sinking of the Titanic and asking whether women were more likely to survive than men by studying the details of the passenger manifest. These are but two examples of the ways in which categorical data tied to a pressing issue or to a lingering romantic historical artifact can be used to judge claims for ourselves, and to correct the record if necessary.

To begin to do so we could recognize that dichotomous (or binary) outcomes where only one of two events is likely to occur at a time per person – a passenger either survives or does not survive – and that the binomial distribution characterizes such outcomes. We covered this distribution in Chapter 6, section 6.2.1.1 to be precise. How could the binomial distribution help us here?

Let us start with the naive assumption that in any shipwreck involving passengers of both sexes, the probability of survival is \(0.50\) for both men and women. Recall that the probability of observing \(x\) successes in \(n\) trials of a binomial process is given by

\[\begin{eqnarray*} P\left[x \text{ successes}\right] = \binom{n}{x}p^{x}\left(1 - p\right)^{n-x} \\ \text{where } \binom{n}{x} = \dfrac{n!}{x!(n-x)!} \text{ and } n! = n \times (n-1) \times (n-2) \times \cdots \times 2 \times 1 \end{eqnarray*}\]

How many female passengers (aka trials) do we have? 466. Let us take this piece of data, together with \(p=0.50\) to generate the distribution of female passengers’ survival via the binomial distribution. The resulting number of female passengers surviving the shipwreck is shown below. The distribution peaks at 233, which is \(=0.5 \times 466\) … i.e., the most likely outcome is that one-half of the female passengers survive.

Binomial Distribution of Female Passenger's Survival with p = 0.5

FIGURE 10.1: Binomial Distribution of Female Passenger’s Survival with p = 0.5

How many of the female passengers survived the Titanic sinking? Some 339 did, which makes the actual probability of survival approximately \(0.7275\). If this were the true probability of a female passenger surviving a shipwreck, the distribution of survivals should have looked as follows:

Binomial Distribution of Female Passenger's Survival with p = 0.5 vs. p = 0.7275

FIGURE 10.2: Binomial Distribution of Female Passenger’s Survival with p = 0.5 vs. p = 0.7275

Note how the actual distribution of survival (in blue) differs from the expected distribution of survival (in red), and this difference is stark. But this is no surprise given that we had seen a massive gap between the expected probability of \(0.50\) and the actual probability of \(0.7275\). Assuming no complications with the data and factors shaping survival, this simple analysis would lead us to conclude that survival probabilities, at least on the Titanic, were quite a bit higher than \(0.5\) for female passengers. What about for the 843 male passengers? Their survival probability was about \(0.1910\), leading to two very different distributions of survival for the sexes.

Binomial Distribution of Female vs. Male Passengers

FIGURE 10.3: Binomial Distribution of Female vs. Male Passengers

This side-by-side comparison would also lead us to suspect that males were less likely to survive than females. But how could we draw these sorts of conclusions via hypothesis tests? As it turns out, in a fairly straightforward manner, and I demonstrate the approach below.

10.1 Specifyng Hypotheses for One-group Tests

Now we no longer refer to the null population mean \(mu\) but instead to the null population proportion \(p_0\) when constructing hypotheses, with the sample proportion represented by \(\bar{p}\). In particular, the hypotheses will look as follows:

\[\begin{eqnarray*} \text{Lower Tail Test } \ldots H_{0}: p \geq p_{0}; H_{1}: p < p_{0} \\ \text{Upper Tail Test } \ldots H_{0}: p \leq p_{0}; H_{1}: p > p_{0} \\ \text{Two Tailed Test } \ldots H_{0}: p = p_{0}; H_{1}: p \neq p_{0} \end{eqnarray*}\]

For a binomial distribution, the mean is given by \(\mu = (n) \times (p)\) where \(n\) is the sample size and \(p\) is the probability of success in any given trial, and the standard deviation is given by \(\sqrt{(n \times p) \times (1 - p)}\). The formulas for the confidence interval are beyond the scope of this course so I will not present them here but you should note that there are several ways to calculate these intervals, with some methods using the z-score approximation and other approaches eschewing this approximation in favor of an exact confidence interval. Of course, we have seen one of these – the Agresti-Coull interval – in an earlier chapter in this text. Here I will explain (a) how we calculate these intervals in R, and which method to employ in a particular situation.

10.1.0.1 Example 1

The probability of surviving a shipwreck is assumed to be \(0.5\). When the Titanic went down, only some 38.1971% of its 1,309 passengers survived. Did the Titanic have a significantly lower survival rate than the assumed probability of \(0.5\)?

\[\begin{array}{l} H_0: \text{ Probability of surviving the Titanic } (p \geq 0.50) \\ H_1: \text{ Probability of surviving the Titanic } (p < 0.50) \end{array}\]

Let us use \(\alpha = 0.01\)

n = 1309
n
## [1] 1309
survived = 0.381971 * n
survived
## [1] 500
binom.test(500, n, p = 0.5, alternative = "two.sided", conf.level = 0.99)
## 
## 
## 
## data:  500 out of n
## number of successes = 500, number of trials = 1309, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 99 percent confidence interval:
##  0.3474826 0.4173286
## sample estimates:
## probability of success 
##               0.381971

The \(p-value\) is practically \(0\) so we can easily reject the null hypothesis. The data suggest that the Titanic had a significantly lower survival rate than \(0.5\).

library(binom)
binom.confint(500, n, conf.level = 0.99, methods = "all")
##           method   x    n      mean     lower     upper
## 1  agresti-coull 500 1309 0.3819710 0.3480520 0.4170804
## 2     asymptotic 500 1309 0.3819710 0.3473797 0.4165623
## 3          bayes 500 1309 0.3820611 0.3477381 0.4168202
## 4        cloglog 500 1309 0.3819710 0.3473688 0.4164534
## 5          exact 500 1309 0.3819710 0.3474826 0.4173286
## 6          logit 500 1309 0.3819710 0.3480274 0.4171071
## 7         probit 500 1309 0.3819710 0.3478930 0.4169894
## 8        profile 500 1309 0.3819710 0.3478206 0.4169179
## 9            lrt 500 1309 0.3819710 0.3478329 0.4169152
## 10     prop.test 500 1309 0.3819710 0.3556560 0.4089866
## 11        wilson 500 1309 0.3819710 0.3480571 0.4170753

The 99% confidence intervals are shown above, and it is obvious that some of them are very similar except for the one generated by prop.test. So the question becomes one of choosing from the available options, and there is a suggestion that you settle for the Agresti-Coull or the Wilson method. Both of these indicate that we can be about 99% confident that the true survival rate for shipwrecks lies in the interval given by \(0.3480\) and \(0.4170\).

10.1.0.2 Example 2

Your public school district carried out a drug-use survey and found that out of a student body of 497, the number reporting “occasional recreational use of opioids” to be \(55\). The national average is reported to be \(6\%\); is your school district’s rate significantly different from the national average? Use \(\alpha = 0.05\).

\[\begin{array}{l} H_0: \text{ The district's rate is not different from the national average } (p = 0.06) \\ H_1: \text{ The district's rate is different from the national average } (p \neq 0.06) \end{array}\]

n = 497
n
## [1] 497
binom.test(55, n, p = 0.06, alternative = "two.sided", conf.level = 0.95)
## 
## 
## 
## data:  55 out of n
## number of successes = 55, number of trials = 497, p-value =
## 1.671e-05
## alternative hypothesis: true probability of success is not equal to 0.06
## 95 percent confidence interval:
##  0.08446262 0.14160062
## sample estimates:
## probability of success 
##               0.110664
binom.confint(55, n, conf.level = 0.95, methods = c("ac", "wilson"))
##          method  x   n     mean      lower     upper
## 1 agresti-coull 55 497 0.110664 0.08585398 0.1414464
## 2        wilson 55 497 0.110664 0.08601362 0.1412868

The 95% confidence interval is 0.085 and 0.1414, if using the Agresti-Coull interval, and 0.0860 and 0.1412 if using the Wilson interval. That is, we can be about 95% confident the true drug use rate lies in the interval of 8.6% and 14.12%.

10.1.1 An Aside on the Normal Approximation Approach to the Binomial Test

The normal approximation we used above works well in “large samples” though defining “large” is tricky. Some folks are perfectly content to rely on the central limit theorem to use the normal approximation so long as they have 30 or more cases but others proceed with more caution. The latter group will focus on the known or, if unknown, the suspected population proportion \(p\). If \(p = 0.5\), then one can get away with the normal approximation when the sample size is 9. As \(p\) gets closer to \(01\) or \(1\), the sample size needed increases. If \(p=0.8\) or \(p=0.2\), you need \(n=36\) and if \(p=0.9\) or \(p=0.1\) you need \(n=81\).

Others will eschew the normal approximation in favor of more precise binomial tests. I follow this route because with the availability of computers there is really no longer any need to use the normal approximation with proportions. In fact, you don’t even need a statistics package, all you need is a calculator and a browser and you can run the tests via, for example, this applet or this applet.

10.2 Two-group Tests

What if we have two groups instead of one as, for example, in the question of survival rates of males versus females on the Titanic, or even drug use among male versus female students in the school district? How can we carry out hypothesis tests for these designs?

10.2.1 Example 1

Assume, for example, that we are studying the Titanic disaster and interested in how survival rates differed by the passenger’s sex. A simple cross-tabulation of sex by survival status would look like the following:

Titanic Passengers' Survival by Sex

FIGURE 10.4: Titanic Passengers’ Survival by Sex

Titanic Passengers' Survival by Sex

FIGURE 10.4: Titanic Passengers’ Survival by Sex

TABLE 10.1: Survival Status by Sex
Died Survived Total
female 127 339 466
male 682 161 843
Total 809 500 1309

The bar-chart shows a marked difference in the chance of survival for male versus female passengers, with female passengers more likely to survive. Specifically, only 19.1% of males survived versus 72.75% of females. The same pattern is visible in the contingency table as well. On the basis of these visualizations is is very likely that if we test for an association between sex and survival we are likely to reject the null hypothesis of no association between the two variables. Let us now turn to the hypothesis test.

In essence, we have a \(2 \times 2\) contingency table and could test whether the proportion of female deaths differs from the proportion of male deaths.

fisher.test(titanic3$survived, titanic3$sex, alternative = "two.sided", 
    conf.int = TRUE)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  titanic3$survived and titanic3$sex
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.06712075 0.11644002
## sample estimates:
## odds ratio 
## 0.08867006
tab.1 = table(titanic3$sex, titanic3$survived)
survival = tab.1[, 2]
cases = rowSums(tab.1)
prop.test(survival, cases, conf.level = 0.95, alternative = "two.sided")
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  survival out of cases
## X-squared = 363.62, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.4864599 0.5865065
## sample estimates:
##    prop 1    prop 2 
## 0.7274678 0.1909846

Note that the prop.test() needs two vectors – one with the number of successes per group (survival) and the other the totals for each group (cases), and that this test uses the \(\chi^2\). The prop.test() command can also be used for more than two groups, as shown below, for example, in testing whether survival rates differed by passenger class. Here you have to create

tab.2 = table(titanic3$survived, titanic3$pclass)
tab.3 = addmargins(tab.2)
survival = tab.3[2, c(1:3)]
cases = tab.3[3, c(1:3)]

prop.test(survival, cases, alternative = "two.sided", conf.level = 0.95)
## 
##  3-sample test for equality of proportions without continuity
##  correction
## 
## data:  survival out of cases
## X-squared = 127.86, df = 2, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3 
## 0.6191950 0.4296029 0.2552891

Finally, there is the G test, useful for very large samples and studies with complex experimental designs.

library(RVAideMemoire)
G.test(tab.2)
## 
##  G-test
## 
## data:  tab.2
## G = 127.77, df = 2, p-value < 2.2e-16

10.2.2 Example 2

Suicides tend to be very high in India, with the causes ranging from indebtedness to a sense of shame for having failed an exam, divorce, and other life course events. Farmers tend to be one of the more vulnerable groups with, by some account, some 60,000 farmers having committed suicide over the past three decades due to rising temperatures and the resultant stress on India’s agricultural sector. We have access to data on suicides over the 2001-2012 period, and this data-set contains information on 3,193 of 19,799 suicides that occurred in 2012.

The question of interest is whether men and women are just as likely to commit suicide for a “Fall in Social Reputation”.

\[\begin{array}{l} H_0: \text{ There is no association between the suicide victim's sex and suicide because of a fall in social reputation} \\ H_1: \text{ There is an association between the suicide victim's sex and suicide because of a fall in social reputation} \end{array}\]

The contingency table and bar-chart are shown below:

TABLE 10.2: Suicide Cause by Sex (India, 2012)
Dowry Dispute Fall in Social Reputation Total
Female 47 50 97
Male 14 79 93
Total 61 129 190
Cause of Suicide by Sex

FIGURE 10.5: Cause of Suicide by Sex

Cause of Suicide by Sex

FIGURE 10.5: Cause of Suicide by Sex

tab.1b = addmargins(tab.1)
reputation = tab.1b[c(1:2), 2]
cases = tab.1b[c(1:2), 3]
prop.test(reputation, cases, alternative = "two.sided", conf.level = 0.95)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  reputation out of cases
## X-squared = 22.79, df = 1, p-value = 1.807e-06
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.4677092 -0.2002877
## sample estimates:
##    prop 1    prop 2 
## 0.5154639 0.8494624

There is an online calculator for \(2 \times 2\) tables; see here, although you won’t see the exact \(p-value\) reported but it is the one calculator we currently have. As it turns out, the \(p-value\) is indeed very small (almost \(0\)) and so this too allows us to reject the null hypothesis.

10.3 Measuring the Strength of the Association

Merely rejecting or failing to reject a null hypothesis is rarely the ultimate goal of any analysis. Say you reject the null hypothesis of a passenger’s survival being independent of their sex. Wouldn’t you at least want to know how strong is the association between these two variables? You often do, and we can answers questions about the strength of the association between categorical variables (be they nominal or ordinal) by way of some statistics. In this section we will not only look at our options but also at what measure of association should be used and when.

Say our contingency table is made up of two nominal variables, one being support for abortion and the other being the respondent’s educational attainment level. The data are mapped for you in the table that follows:

TABLE 10.3: Educational Attainment and Support for Abortion
Less than High School High School or More Total
No Support (n) NA NA 555.0
No Support (%) NA NA 57.1
Yes Support (n) NA NA 417.0
Yes Support (%) NA NA 42.9
Total (n) NA NA 972.0
Total (%) NA NA 100.0

Having rejected the null hypothesis our interest now is in being able to predict support for abortion from educational attainment levels because we suspect that support increases with education. In the language of data analysis, support for adoption would be called the dependent variable while educational attainment would be labeled the independent variable. You only see row totals of 555 and 417; i.e., total for and against abortion, respectively. You pick one person at random. Is this person most likely to support abortion or most likely not to support abortion? Well, the only thing you can do is look at the modal response, which was 555 individuals indicating no support for abortion. So your best bet would be to expect the individual you drew at random to echo no support for abortion.

Now, what if you were provided the missing information, the missing cell frequencies?

TABLE 10.4: Educational Attainment and Support for Abortion
Less than High School High School or More Total
No Support (n) 399.0 156.0 555.0
No Support (%) 64.3 44.4 57.1
Yes Support (n) 222.0 195.0 417.0
Yes Support (%) 35.7 55.6 42.9
Total (n) 621.0 351.0 972.0
Total (%) 100.0 100.0 100.0

Now, a few things would become readily apparent. First, working with only the modal response you would have incorrectly predicted no support for abortion for 417 individuals. The percent correctly predicted would thus have been \(\dfrac{555}{972} \times 100 = 57.1\). Second, if you took educational attainment into account, how many errors would you make? You might have thought everyone with \(>\) High School supports abortion but only \(195\) do so, leading you to make \(156\) errors here. Similarly, you expected everyone with \(\leq\) High School to oppose abortion but only \(399\) do so, leading you to make \(222\) errors here. Total errors when taking education (the independent variable) into account would then sum to \(156 + 222 = 378\).

These errors are fewer in number – \(378\) versus \(417\) – than when you lacked any information the breakdown of support by education. In a way, then, you have reduced prediction error by folding in information on an individual’s educational attainment. This predictive leverage is what lies behind the notion of proportional reduction in error (PRE), with \(0 \leq PRE \leq 1\) and the closer is PRE to \(1\) the more the reduction in prediction error. Technically, PRE is calculated as follows:

  1. Set \(E_1 = n -\) Modal frequency \(= 972 - 555 = 417\). This is the number of prediction errors made by ignoring an independent variable.
  2. \(E_2 = (n_{column_i} - mode_{column_i}) + (n_{column_j} - mode_{column_j})\), \(\therefore E_2 = (621 - 399) + (351 - 195) =156 + 222 = 378\). This is the number of prediction errors made by taking an independent variable into account.

Now calculate \(PRE = \dfrac{E_1 - E_2}{E_1} = \dfrac{417 - 378}{417} = \dfrac{39}{417} = 0.09\). We improved our predictive ability by 9% when using educational attainment as compared to if we ignored or did not have access to information about an individual’s educational attainment.

There are several PRE measures that could be used, depending upon the nature of the variables that comprise the contingency table. Let us see each of these in turn.

10.3.1 Goodman-Kruskal Lambda \((\lambda)\)

\(\lambda = \dfrac{E_1 - E_2}{E_1}\) was used in the preceding example that explained the concept of proportional reduction in error. It is an asymmetrical measure of association in that its value will differ depending upon which variable is used as the dependent variable. For example, consider the following two tables, built with the same data except that one uses violence as the dependent variable while the other uses assailant’s status as the dependent variable.

TABLE 10.5: Perpetrator and Assault Type (1)
Stranger Not a Stranger
Rape/Sexual Assault 122090 350670
Robbery 930860 231040
Assault 3992090 4272230
TABLE 10.5: Perpetrator and Assault Type (2)
Rape/Sexual Assault Robbery Assault
Stranger 122090 930860 3992090
Not a Stranger 350670 231040 4272230

In the first table, \[\begin{array}{l} \lambda = \dfrac{E_1 - E_2}{E_1} \\ E_1 = n - \text{ Modal frequency of the dependent variable} \\ \therefore E_1 = 9,898,980 - 8,264,320 = 1,634,660 \\ E_2 = (5,045,040 - 3,992,090 = 1,052,950) + (4,853,940 - 4,272,230 = 581,710) = 1,634,660 \\ \lambda = \dfrac{E_1 - E_2}{E_1} = \dfrac{1,634,660 - 1,634,660}{1,634,660} = 0 \end{array}\]

In the second table,

\[\begin{array}{l} \lambda = \dfrac{E_1 - E_2}{E_1} \\ E_1 = n - \text{ Modal frequency of the dependent variable} \\ \therefore E_1 = 9,898,980 - 5,045,040 = 4,853,940 \\ E_2 = (472,760 - 350,670) + (1,161,900 - 930,860) + (8,264,320 - 4,272,230) \\ \therefore E_2 = (122,090 + 231,040 + 3,992,090) = 4,345,220 \\ \lambda = \dfrac{E_1 - E_2}{E_1} = \dfrac{4,854,940 - 4,345,220}{4,853,940} = 0.1048 \end{array}\]

Conventionally, the dependent variable is always the row variable and you should follow this rule.

library(DescTools)
Lambda(df1, direction = c("row"))
## [1] 0
Lambda(df2, direction = c("row"))
## [1] 0.1048056

10.3.2 Phi \((\phi)\) Coefficient

If both variables are nominal and you have a \(2 \times 2\) contingency table, then we use the phi coefficient to measure the strength of the association between the two variables.

\(\phi = \sqrt{\dfrac{\chi^2}{n}}\)

Technically, if the table is as follows,

TABLE 10.6: Calculating phi
Alive Dead
Drug A a b
Drug B c d

where \(\phi = \dfrac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}\) with \(df=(r-1)(c-1)\).

Say I have data on pass rates of male and female students, as show below.

TABLE 10.7: Calculating phi
Female Male
Fail 24 64
Pass 44 49

I can calculate \(\phi\) as follows:

library(psych)
phi(tab1, digits = 6)  # preferred since minus sign is preserved 
## [1] -0.206808
library(DescTools)
Phi(tab1)  # Note the absence of the minus sign!! 
## [1] 0.206808

10.3.3 Cramer’s \(V\) and Contingency \(C\)

Cramer’s \(V\) is used when both variables are nominal but at least one has more than \(2\) categories. Cramer’s V \(= \sqrt{\dfrac{\chi^2}{n \times m}}\) where \(m\) is the smaller of \((r-1)\) or \((c-1)\).

Contingency Coefficient \(C = \sqrt{ \dfrac{\chi^{2}}{\chi^{2} \times n} }\) is recommended for \(5 \times 5\) tables since it appears to underestimate the strength of the association in smaller tables.

Both \(c\) and \(V\) will fall in the \(\left[0,1\right]\) range, i.e., \(0 \leq V \leq 1\) and \(0 \leq C \leq 1\).

Reference tables are available to classify the strength of the association:

  1. For \(df=1\),
    • Small effect if \(0.10 < V \leq 0.30\)
    • Medium effect if \(0.30 < V \leq 0.50\)
    • Large effect if \(V > 0.50\)
  2. For \(df=2\),
    • Small effect if \(0.07 < V \leq 0.21\)
    • Medium effect if \(0.21 < V \leq 0.35\)
    • Large effect if \(V > 0.35\)
  3. For \(df=3\),
    • Small effect if \(0.06 < V \leq 0.17\)
    • Medium effect if \(0.17 < V \leq 0.29\)
    • Large effect if \(V > 0.29\)
TABLE 10.8: Calculating phi
Pass Fail
Bloom 21 5
Cobblestone 6 11
Dougal 7 8
Heimlich 27 5

Here we have the pass/fail rates for five counties and hence will use Cramer’s \(V\).

library(DescTools)
CramerV(tab1, conf.level = 0.95)
##  Cramer V    lwr.ci    upr.ci 
## 0.4386881 0.1944364 0.6239856

10.3.4 Goodman-Kruskal Gamma \((\gamma)\)

What if we have two ordinal variables, as is the case in the contingency table shown below?

TABLE 10.9: Two Ordinal Variables
High School or More Less than High School Total
High Financial Satisfaction 1194 193 1387
Low Financial Satisfaction 477 147 624
Total 1671 340 2011

There are four pairs in the table – High education and High financial satisfaction, High education and Low financial satisfaction, Low education and High financial satisfaction, and Low education and Low financial satisfaction. Let us call these pairs High-High, High-Low, Low-High, and Low-Low.

The research question in play is whether education has any impact on financial satisfaction. If it does so perfectly, without any error, then we should see all those high on education to also be high on financial satisfaction, and likewise all those low on education to also be low on financial satisfaction. However, that is not the case; we do see High-Low and Low-High pairs with non-zero frequencies!

One way to evaluate the association between the two variables might be to how many concordant pairs there are (High-High and Low-Low) versus disordant pairs (High-Low, Low-High). Let us label the concordant pairs \(N_s\) and the discordant pairs \(N_d\). Then, the total number of concordant pairs possible would be given by the number High-High \(\times\) the number Low-Low. Similarly, the total number of discordant pairs would be given by the number High-Low \(\times\) the number Low-High. That is, for the given table, we would calculate

\[\begin{array}{l} N_s = 1194 \times 147 = 175,518 \\ N_d = 193 \times 477 = 92,061 \end{array}\]

If most of the pairs possible are concordant, then we could calculate a ratio of the difference between the number of concordant and discordant pairs to the total number of pairs (concordant + discordant). This is the measure of association called Goodman-Kruskal’s gamma \((\gamma)\) where

\[\gamma = \dfrac{N_s - N_d}{N_s + N_d}\]

such that

\[\begin{array}{l} N_s > N_d, \gamma \to +1 \\ N_s < N_d, \gamma \to -1 \\ N_s = N_d, \gamma = 0 \end{array}\]

In the present example, we have \(\gamma = \dfrac{N_s - N_d}{N_s + N_d} = \dfrac{175,518 - 92,061}{175,518 + 92,061}=0.31\), indicating a moderate impact of education on financial satisfaction.

Note, before we move on, that if \(N_s = 100, N_d = 0, \gamma = \dfrac{N_s - N_d}{N_s + N_d} = \dfrac{100 - 0}{100 + 0} = 1\) and if \(N_s = 0, N_d = 100, \gamma = \dfrac{N_s - N_d}{N_s + N_d} = \dfrac{0 - 100}{0 + 100} = -1\)

GoodmanKruskalGamma(tab1, conf.level = 0.95)
##        gamma       lwr.ci       upr.ci 
##  0.007436196 -0.027156738  0.042029130

10.3.5 Kendall’s \((\tau_b)\)

Unfortunately, however, \(\gamma\) ignores what are called tied pairs. Say what?

TABLE 10.10: Tied Dependent and Independent Pairs
High School or More Less than High School Total
High Financial Satisfaction 1194 193 1387
Low Financial Satisfaction 477 147 624
Total 1671 340 2011

With tied dependent pairs \((T_y)\), we have individuals with the same value of the dependent variable but different values of the independent variable. Here, 1,194 and 193 are tied on the dependent variable value of High while 477 and 147 are tied on the dependent variable value of Low. With tied independent pairs \((T_x)\), we have individuals with the same value of the independent variable but different values of the dependent variable. Here, 1194 and 477 are tied on the independent variable value of High while 193 and 147 are tied on the independent variable value of Low.

\[\begin{array}{l} T_y = (1,194 \times 193) + (477 \times 147) = 230,442 + 70,119= 300,561 \\ T_x = (1,194 \times 477) + (193 \times 147) = 569,538 + 28,371 = 597,909 \end{array}\]

Then, we can calculate \(\gamma's\) replacement:\(\tau_b = \dfrac{N_s - N_d}{\sqrt{(N_s + N_d + T_y)(N_s + N_d + T_x)}} = \dfrac{83,457}{701,226} =0.12\), indicative of a weak positive association. This estimate is much smaller than what we had for \(\gamma\) but that should be no surprise given that \(\tau_b\) will, as a rule, be \(< \gamma\) because \(\tau_b\) takes all tied pairs into account whereas \(\gamma\) does not.

\(\tau_b\) is best used with square tables\(2 \times 2\), \(3 \times 3\), and so on.

KendallTauB(tab1, conf.level = 0.95)
##        tau_b       lwr.ci       upr.ci 
##  0.004384523 -0.016014767  0.024783814

10.3.6 Kendall-Stuart’s \((\tau_c)\)

This measure can be used instead of \(\tau_b\) when you have tables that are not square\(2 \times 3\), \(3 \times 4\), and so on. Specifically, \(tau_c = \left(N_s - N_d\right) \times \left[2m / (n^{2} (m - 1) )\right]\), where \(m\) is the number of rows or columns, whichever is smaller, and \(n\) is sample size. All properties that hold for \(\tau_b\) apply to \(\tau_c\) as well.

StuartTauC(tab1, conf.level = 0.95)
##         tauc       lwr.ci       upr.ci 
##  0.003869367 -0.014133187  0.021871920

10.3.7 Somer’s \(D\)

\(\gamma, tau_b, tau_c\) are all symmetric measures in that it does not matter which variable is an independent variable and which the dependent. Indeed, you may have no a priori idea of what is your dependent variable when using these and the resulting estimates would be fine. However, when you have a priori hypotheses about a particular variable being the dependent variable of interest, then you should use Somer’s \(D\).

\[D_{yx} = \dfrac{N_s - N_d}{N_s + N_d + T_y}\]

Again, because this measure adjusts for tied dependent pairs its value will be less than \(\gamma\). In the ongoing example of education and financial satisfaction,

\[D_{yx} = \dfrac{N_s - N_d}{N_s + N_d + T_y} = \dfrac{175,518 - 92,061}{175,518 + 92,061 + 300,561} = \dfrac{83,457}{568,140} = 0.1468951\]

Somer’s \(D\) will work with square and non-square tables of any size.

library(DescTools)
SomersDelta(tab1, conf.level = 0.95, direction = "column")
##       somers       lwr.ci       upr.ci 
##  0.004249677 -0.015522198  0.024021551
SomersDelta(tab1, conf.level = 0.95, direction = "row")
##       somers       lwr.ci       upr.ci 
##  0.004523649 -0.016523097  0.025570395

10.4 Practice Problems

Problem 1 Radioactive waste, John Wayne, and Hollywood. Yup, not making that up, read it for yourself! Turns out that John Wayne starred in a movie (The Conqueror) that was shot in St. George (Utah) in 1954, on a site 137 miles downwind of A nuclear testing site in Yucca Flat (Nevada). As it happens, by the early 1980s some 91 of the 220 cast and crew had been diagnosed with cancer of one form or another, and . According to epidemiological data available to us, only 14% of the population this group of 220 represented should have been diagnosed with cancer between 1954 and the early 1980s. Was a cast or crew member more likely to get cancer because of exposure to radioactive waste while working on the film?

Problem 2

Suicides and the holiday season, a common myth implies, go hand in hand. Similarly, “articles in the medical literature and lay press have supported a belief that individuals, including those dying of cancer, can temporarily postpone their death to survive a major holiday or other significant event, but results and effects have been variable” (Young and Hade 2004). To study this latter question, whether death does indeed take a holiday or not, the authors looked at 12,028 cancer deaths that occurred in Ohio between 1989 and 2000. Of these deaths, 6,052 occurred on the week before Christmas while the rest occurred in the week after Christmas. Do these data suggest that people are really able to postpone their passing until after the holiday season?

Problem 3

Young and Hade (2004) also looked at the breakdown of these cancer deaths by the deceased’s sex, finding that of the 6,252 men, 3,192 died in the week before Christmas while of the 5,776 women, 2,858 died in the week before Christmas. Do these data suggest a difference in the ability of men and women to control the timing of their passing?

Problem 4

Does a driver’s race have any impact on the probability that they will be stopped more often than a driver of another race? When stopped, what is the “hit rate” for each group – the probability that the stop actually results in some contraband being found on the driver or in the vehicle? If the hit rate is lower for minority drivers, that is typically used as an “outcome test” to argue that minority drivers are pulled over without good cause more often than are non-minority drivers. This isn’t foolproof evidence though because of the problem of “infra-margionality”, as has been argued here.

Let us look at Rhode Island’s data, provided to us courtesy of the Stanford Open Policing Project 2017. You can download the full data-set from here.

Once you download the data I want you to create a simple dummy variable that groups the drivers into one of two mutually exclusive groups – Minority \(=1\) if the driver’s race is “Black” or “Hispanic” and Minority \(=0\) if the driver’s race is recorded as “White”. Then calculate the search rates and hit rates for each group. The result should be a table such as the following:

Group Stops Searches Hits Search Rate Hit Rate
Minority 69883 3533 1475 0.05055593 0.4174922
Non-Minority 159808 3759 1828 0.02352198 0.4862995

Now answer the following questions:

  1. Are search rates (i.e., number of searches per number of stops) independent of a driver’s minority status?
  2. Are hit rates (i.e., number of hits per number of searches) independent of a driver’s minority status?
  3. When you combine the information about search rates with information on hit rates, what story emerges? Does this suggest a possible pattern of discrimination?

Problem 5

Some 15.9% of all fiurst time enrollees in a two- or four-year program at a college/university tend to be first-generation students (students who are the first in their family to study beyond high school). A major public university in the Midwest claims that of the 6,000 students who enrolled in the Fall of 2017, some 22.7% were first-generation students. Is this university’s rate significantly different from the national average?

Problem 6

The Youth Risk Behavior Surveillance System (YRBSS) was developed in 1990 to monitor priority health risk behaviors that contribute markedly to the leading causes of death, disability, and social problems among youth and adults in the United States. These behaviors, often established during childhood and early adolescence, include (i) Behaviors that contribute to unintentional injuries and violence; (ii) Sexual behaviors related to unintended pregnancy and sexually transmitted infections, including HIV infection; (iii) Alcohol and other drug use; (iv) Tobacco use; (v) Unhealthy dietary behaviors; and (vi) Inadequate physical activity. In addition, the YRBSS monitors the prevalence of obesity and asthma and other priority health-related behaviors plus sexual identity and sex of sexual contacts. From 1991 through 2015, the YRBSS has collected data from more than 3.8 million high school students in 1,700+ separate surveys.

The problems that follow rely upon the YRBSS 2015 data and the documentation for the data-set can be found here. Read the documentation carefully, in particular, the details of the survey questions. Then answer the following questions:

  1. Is there an association between drinking and driving (Q11) and texting and driving (Q12)?
  2. How strong is the relationship between these two variables?

Problem 7

The General Social Survey (GSS) gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events. Altogether the GSS is the single best source for sociological and attitudinal trend data covering the United States. It allows researchers to examine the structure and functioning of society in general as well as the role played by relevant subgroups and to compare the United States to other nations.

Using these 2016 GSS data, test (a) whether educational attainment (coldeg1) is related to confidence in the scientific community (consci), and (b) the strength of this relationship.

Problem 8

In 1984, the Centers for Disease Control and Prevention (CDC) initiated the state-based Behavioral Risk Factor Surveillance System (BRFSS) –- a cross-sectional telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical and methodologic assistance from CDC. BRFSS is used to collect prevalence data among adult U.S. residents regarding their risk behaviors and preventive health practices that can affect their health status. Respondent data are forwarded to CDC to be aggregated for each state, returned with standard tabulations, and published at year’s end by each state. In 2011, more than 500,000 interviews were conducted in the states, the District of Columbia, and participating U.S. territories and other geographic areas.

The 2016 BRFSS data-set for Ohio is available here and the accopmpanying codebook is here. Using these data, answer the following questions:

  1. Test for a possible relationship between level of education completed (v_educag) and the four categories of body mass index (v_bmi5ca). How strong is this relatinship?
  2. Test for a possible relationship between income (v_incomg) and the four categories of body mass index (v_bmi5ca). How strong is this relatinship?
  3. Use an appropriate chart to reflect the relationships you tested in (a) and (b).