Chapter 08: Bivariate Linear Regression Analysis

Back to Chapter Index Page

INTRODUCTION

The discussion of the Pearson correlation coefficient in the last chapter leads us to an ally of this coefficient-regression analysis. We will see that the Pearson correlation coefficient reflects the linear relationship between two variables and that regression analysis uses this relationship to predict future scores.

In bivariate (sometimes called simple) linear regression analysis there are two variables, X and Y, just as we had for the Pearson correlation coefficient. In the regression context, however, X is thought of as an independent variable and Y is thought of as a dependent variable. The regression process is referred to as regressing Y on X. An important objective in regression analysis is to develop a linear equation that will predict Y using X. In this light, consider Professor Farcle’s data from a regression point of view.

The Professor Farcle Example

In a regression context, Professor Farcle is interested in developing a linear equation that would enable her to use an employee’s score on the expected job performance (EJP) instrument to predict the score on the actual job performance (AJP) instrument. That is, she would like to regress AJP on EJP. If she were successful, the companies could use Professor Farcle’s EJP instrument with prospective employees to eliminate those who are predicted to have unsuccessful job performance. To accomplish this, Professor Farcle must complete the following steps:

Collect data on her expected job performance and actual job performance instruments on a random sample of employees for a given company
Find a linear equation that can be used to predict actual job performance from expected job performance
Give the expected job performance measure to prospective employees at a given company, and use the linear equation found in step 2 to predict their actual job performance scores

The following paragraphs discuss the linear equation that Professor Farcle must develop using the linear relationship measured by the Pearson product-moment correlation coefficient.

LINEAR RELATIONSHIP AND THE REGRESSION EQUATION

In the discussion of the Pearson correlation coefficient, we found that the Pearson correlation coefficient is a measure of linear relationship between two variables. We also found that the only time the dots that represent the X and Y scores in a scatterplot fall on a straight line is when the Pearson correlation is perfect, that is, +1.00 or 1.00. Given a perfect Pearson correlation and a score on X, we can easily find the corresponding score on Y by (a) drawing a straight line, perpendicular to the x-axis, from X to the line that the dots fall on and (b) drawing another straight line perpendicular to the y-axis, to the desired Y value. For example, in Figure 8a(i), rXY = 1; for an X value of 146 there is a Y value of 78. As you can see, given a perfect Pearson correlation and a value for X, you can find exactly what the value for Y will be.

Most Pearson correlations are not perfect, however. When a Pearson correlation is not equal to +1.00 or 1.00, a given value of X can have several different possible values of Y. Figure 8a(ii) illustrates this situation for the data in figure 8b. The Pearson correlation is .56; and given an X value of 20, we can see that subjects who have 20 on X have scores ranging from 7 to 24 on Y. In this situation, given a value for X, what would be the value for Y? The answer to this question is a predicted Y value found on a straight line through the complete set of dots. It is also clear, however, that such a line will not accurately predict all (or perhaps any) of the cases with X = 20 (but it is the line that minimizes these errors). Figure 8a(iii) shows this prediction. The equation for this straight line is presented next.

The Simple Linear Regression Equation

The linear equation that is used to predict Y from X in regression analysis is:

\[ \begin{equation} \hat{Y} = b_0 + b_1X \tag{8-1} \end{equation} \] \(\hat{Y}\) is called “Y-HAT” and is the predicted value of a dependent variable denoted by Y; X denotes the values of an independent variable; and \(b_0\) and \(b_1\) are found using the following equations:

\[ \begin{equation} b_1 = r_{XY} * \frac {s_Y} {s_X} \tag{8-2} \end{equation} \] \[ \begin{equation} b_0 = M_Y - b_1 M_X \tag{8-3} \end{equation} \] ### Figure 8a(i) Given a correlation of plus or minus one, if a value of X is known, the corresponding value of Y is a single value

Figure 8a(ii) When the correlation is not perfect, then for a given X, say 20, there are many possible value for Y

In equation (8-2), rXY is the Pearson correlation between the independent variable X and the dependent variable Y; sY is the standard deviation of Y; and sX is the standard deviation of X in equation (8-3), MY is the mean of the Y scores; \(b_1\) is found in equation (8-2); and MX is the mean of the X scores. The value \(b_0\) is called the Y intercept and \(b_1\) is called slope of the line that they form. Together they are called regression coefficients. Remember that the Y-intercept is that point on the y-axis where X = 0, and the slope is a measure of the pitch of the regression line. The slope is found as the ratio of a difference in Y values to a difference in X values (see Equation 7-5). These terms can be reviewed in Appendix E.

An Example.

We can use the job performance data for company B shown in figure 7a(ii) to illustrate these equations. From figure 7a(ii), we have the following information:

\[ \begin{align} M_X &= 26.7 \\ s_X^2 &= 194.456 \\ s_X &= 13.9447 \\ \\ M_Y &= 25.1 \\ s_Y^2 &= 175.211 \\ s_Y &= 13.2367 \end{align} \] Pearson’s correlation was r_{XY} = .7773 (for some statistics, we keep more decimal places to use in the calculations so as to attain more accurate values, that is, less rounding error, for the final result).

\[ b_1 = .7773 * 13.2367/13.9447 = 0.738 \] Recall that because the correlation cannot be larger than 1.0 (as an absolute value) we do not include the zero in the ones place when reporting r. However, the regression coefficients can be larger than 1.0 (as an absolute value) and therefore we must include the zero before the decimal places.

Using equation (8-3), we have: \[ b_0 = 25.10 - (0.738 * 26.70) = 5.4 \] The regression equation is: \[ \hat{Y} = 5.4 + (0.738X) \] where \(\hat{Y}\), or Y-HAT, is the predicted Y score for a unit of the population with any score X. Note that the same prediction formula is used for any value of X. This leads to some of the assumptions we will address later. Also note that units will have the same predicted value any time they have the same X value. In this sense, there is no Y in Y-HAT (i.e., we don’t need to know Y in order to predict Y). Note that Y really is required for the analysis because we need the Y values in order to calculate the regression coefficients, but once we have the regression prediction model, we no longer need the Y values in order to predict Y.

Regression toward the Mean

In developing a regression equation, note that the mean of the original Y scores is equal to the mean of the predicted Y scores. In the previous example, the mean of the Y scores was 25.10, which is also the mean of the YHAT scores. Also note that the variance of the predicted scores is usually less than the variance of the original scores. This will be true unless rXY = ±1, in which case the variance of the predicted scores will be equal to the variance of the original scores. In the last example, the variance of the original Y scores was 175.211, and the variance of the YHAT scores was 105.852.

In general, the smaller variance of the predicted scores means that the predicted scores are closer to their mean than the original scores are to their mean. This result was first noticed by Francis Galton (1889) in his study of the regression of son’s heights on their father’s heights. In Galton’s study, the sons of very tall and very short fathers were predicted to have heights that were closer to the mean of all of the fathers. This finding was referred to as the phenomenon of “regression towards the mean.” The term regression in regression analysis originated with Galton’s study.

Figure 8b Data shown in Figure 8a(ii), ordered on X

A Criterion for a Good Regression Equation

You can get an indication of how accurately the regression equation predicts Y by considering the difference between each Y and its predicted value, that is,

Equation 8-4

\[ \begin{equation} \text {Residual} = e = Y - \hat{Y} \tag{8-4} \end{equation} \] or for each case, \[ \begin{equation} e_i = Y_i - \hat{Y}_i \tag{8-4} \end{equation} \] In Equation (8-4), e stands for the error in predicting Y from X (\(e_i\) is the error for case i). These errors are sometimes called residuals because they are that part of Y that is not predicted by X. When the e is small for a case, then the equation does a good job of prediction for that case. When the e is large, the equation does not do a good job of prediction for that case. Mathematical statisticians developed the coefficients (\(b_0\) and \(b_1\)) in the regression equation so that the sum of the errors is always 0. Therefore, a criterion used to judge the magnitude of the errors is the sum of the squared errors, that is,

\[ \begin{equation} \text{Sum of Squared Residuals} = \sum e_i^2 \tag{8-5} \end{equation} \] The regression equation was developed (under what is known as the “least squares criterion”) so that the sum of the squared errors is smaller than what would be found for any other straight line. If we select any other values for \(b_0\) and \(b_1\), we will find a larger sum of squared errors. This fact is illustrated using a transformation in figure 8c.

In Figure 8c, the predicted Y2 scores were put in column 3 (labeled 3:\(b_0\)+\(b_1\)*X2), the errors of prediction were put in column 4 (labeled 4:Y2-YHAT), and the squared errors were put in column 5 (labeled 5:E^2). The sum of the squared errors found with the regression equation is 624.123. The predicted values based on another arbitrarily chosen straight line (Y = 6 + .5X) were put in column 7. In column 8, we have the errors found using this new equation; and in column 9, we have these errors squared. The sum of the squared errors found with the arbitrarily chosen equation is 1053.75. This sum of squared errors is larger than that found for the regression equation. The point is that no other straight line will yield a smaller sum of errors squared. Using the error, e, found for each value of Y, we can write Y as:

\[ \begin{equation} Y = \hat{Y} + e \tag{8-6} \end{equation} \] or as:

\[ \begin{equation} Y = b_0 + b_1X + e \tag{8-7} \end{equation} \] You can use the data in columns Y, Predicted Y, and Residual from Figure 8c to check these equations for yourself.

Figure 8c An illustration that the sum of the squared errors for the regression equation is smaller than the sum of the errors squared for any other value of \(b_0\) and \(b_1\). Here, arbitrarily, \(b_0\) = 6 and \(b_1\) = 0.5

where Predicted Y = \(5.4 + 0.738X\) and NEW Predicted Y = \(6+ 0.5X\)

Some things to notice:

\[ \begin{align} \sum e_i &= 0 \\ \sum e_i^2 &= 624.123 \\ \sum e_i^2 &= 1053.75 \\ \sum Y &= \sum \hat{Y} \end{align} \]

The Variance Error of Estimate and the Standard Error of Estimate

The variance of the errors found in a regression analysis is called the variance error of estimate; and its square root, the standard deviation of the errors, is called the standard error of estimate. We can calculate the standard error of estimate by finding the error associated with each Y score and then using an equation like (5-5) in chapter 5 to find the standard deviation of the errors.

\[ \begin{equation} SD = s_X =\sqrt{ \frac {\sum(X-M)^2} {n-1}} = \sqrt{s_X^2} \tag{5-5} \end{equation} \] Because the sum of the prediction errors is also always 0, the mean of the errors is 0, so we need to square the deviation again. However, when we have two variables (one predictor and one outcome), both in hypothesis testing and as an estimate of the population standard error of the estimate, the denominator of equation is set as \((n – 2)\). Similarly, the variance of the residuals (later called Mean Square Error as an estimate) is also found using \((n – 2)\) degrees of freedom when there is one predictor. The degrees of freedom will change as additional predictors are added in multiple regression. Generally, in regression, the degrees of freedom for the residuals are calculated as \((n – k – 1)\), where k is the number of predictors. So with just one predictor, as we have in this chapter, the degrees of freedom used for the residuals are \((n – 1 – 1 = n – 2)\). Therefore, equation (5-5) must be rewritten as:

\[ \begin{equation} \text {Standard Error of the Estimate = SEE} = s_e =\sqrt{ \frac {\sum(e_i)^2} {n-2}} \tag{8-8} \end{equation} \] where \(s_e\) is the standard deviation of the errors, or the standard error of estimate; e is the error found in equation (8-4); and n is the number of subjects sampled.

Figure 8d Predicted scores (\(b_0\) + \(b_1\)*X), errors (\(Y-\hat{Y}\)), and squared errors (\(e^2\)) for the data in Figure 7a(ii)

\[ \begin{align} \text {Mean X} = M_X = \sum {X/n} = 267/10 &= 26.7\\ \text {Mean Y} = M_Y = \sum {Y/n} = 251/10 &= 25.1\\ \text {Mean Predicted Y} = \sum {\hat{Y}/n} = 251/10 &= 25.1 \\ \sum {e_i} &= 0 \\ \sum {e_i^2} &= 624.123 \\ \\ \text {RMSE (reported in JAMOVI)} = \sqrt {\frac {\sum e_i^2} {n}} = \sqrt {\frac {624.123} {10}} &= 7.9 \\ \text {"Regular" SD} = \sqrt {\frac {\sum e_i^2} {n-1}} = \sqrt {\frac {624.123} {9}} &= 8.327 \\ \text {Variance Error of the Estimate} = MSE = \frac {\sum e_i^2} {n-2} = \frac {624.123} {9} &= 78.015 \\ \text {Standard Error of the Estimate} = SEE = \sqrt {\frac {\sum e_i^2} {n-2}} = \sqrt{MSE} = \sqrt{78.015} &= 8.833 \end{align} \]

THE PROPORTION OF Y-VARIANCE ACCOUNTED FOR: AN INTERPRETATION OF RXY SQUARED

The variance error of estimate can also be found using the equation: \[ \begin{equation} s_e^2 = s_Y^2 (1-r_{XY}^2) \tag{8-9} \end{equation} \]

We can rewrite equation (8-9) as: \[ \begin{equation} r_{XY}^2 = 1 - \frac {s_e^2} {s_Y^2} \tag{8-11} \end{equation} \] We can interpret the squared Pearson correlation coefficient as the proportion of the variance in the Y scores that is accounted for by the X scores. Since (\(s_e^2\) / \(s_Y^2\)) represents the proportion of the variance of the Y scores that is NOT accounted for by the X scores, Equation (8-11), [1 – (se2 / sY2)], represents the proportion of the variance of the Y scores that IS accounted for by the regression equation (i.e., \(r_{XY}^2\)).

An Intuitive View of \(r_{XY}^2\)

Figure 8e(i) is a Venn diagram, which illustrates the variance of the Y scores in a regression analysis. In the context of Professor Farcle’s study, Figure 8e(i) represents the total variance of the actual job performance (AJP) of the employees at a given company. If the variable is standardized, then this variance = 1.

In Figure 8e(ii), the proportion of the variance of Y that is accounted for (predicted) by X is illustrated as the overlap between the variance of Y and the variance of X. This overlap represents the proportion of variance of AJP that is accounted for by the variance of EJP. In Figure 8e(ii), the proportion of variance of Y that is not accounted for by X (that is, does not overlap with X) is called the error variance. The error variance is the variance in the Y scores that is unrelated to X.

Why do the scores of units vary? Consider the employees under investigation by Professor Farcle, and rephrase the question to: Why do employees vary on a measure of actual job performance (AJP)? The answer to this question is that employees vary on AJP because they differ with respect to training, personality, aptitude, intelligence, and chance events. We could account for the variance of AJP if we had measures of their training, personality, and so on. Then, we could easily predict every employee’s AJP score. Of course this is impossible, but what we can do is use a measure like Professor Farcle’s expected job performance (EJP) to measure some of these variables. In her regression analysis, Professor Farcle accounted for the variation of the dependent variable, AJP, based on the independent variable, EJP, and then examined how well she had done by considering the unexplained variation among the employees.

The size of the variance error of estimate is an indication of how well Professor Farcle has predicted the variation among the employees by knowing their scores on EJP. The ratio of the variance error of estimate to the variance of the AJP scores provides an estimate of the proportion of the AJP scores variance that is not accounted for by the EJP scores; and 1 minus this proportion [equation (8-11)] represents the proportion of the variance of the AJP scores that is accounted for by the EJP scores.

Another Notation for se

Some books use the notation SY.X to denote the standard error of estimate. This notation is read as “the standard deviation of the Y scores given the X scores.” This notation makes sense since the errors are that part of Y that is not accounted for given the X scores.

An Example

For the data in figure 7a(ii), (se2 / sY2) = 69.35/175.211 = .3958, or roughly 40% of the variance of the Y scores is not accounted for by the X scores. Therefore, 1 – .3958, or roughly 60% of the variance of the Y scores, is accounted for by the X scores. We can verify this result using equation (8-11) where we find that \(r_{XY}^2\) = .77732 = .60.

Figure 8e(i) A Venn diagram representing the variance of the Y scores

Figure 8e(ii) A Venn diagram representing the variance of the Y scores that is accounted for by the variance of the X scores

Other Determinants of a Good Regression Equation

If you consider the relationship between se and e in equation (8-8), you can see that when the errors are large, se will be large. Therefore, another criterion for a good regression equation is that the standard error of estimate, se, be small. Also, equation (8-11) indicates that when se is small \(r_{XY}^2\) will be large. Thus, another criterion for a good regression equation is that the Pearson correlation, rXY (and therefore, also, \(r_{XY}^2\)), between the X and Y pairs be large.

FINDING AND PLOTTING THE REGRESSION LINE

Plotting the Regression Line by Hand

Figure 8f is the plot of the regression line for the data from company B in figure 7a(ii). You can easily find this line by hand by locating the following two points: (a) the Y- intercept, the point on the y-axis where X = 0, yield the coordinate point (0, \(b_0\)) and (b) the point identified by the mean of X and the mean of Y. In this example, these two points are (0, 5.4) for the Y-intercept and (26. 7, 25.1) for the two means. Since two points determine a straight line and the latter two points are always on the regression line, we can draw the regression line shown in Figure 8f. You can verify that the means of the two variables will fall on the regression line by substituting the mean for X in the regression equation; the resultant Y will be the mean on Y.

Note that predictions for data outside the range of the observed X and Y scores are extremely hazardous. For example, consider the number of bushels of com per acre (X) versus rainfall in millimeters (Y) data in figure 7f. What if a researcher only had information on rainfall up to 30 millimeters but wanted to predict values of Y for X = 55? A regression equation based on Xs in a range from 0 to 30 would predict that a large number of bushels of corn would be harvested with 55 millimeters of rain. This prediction would be far from the truth because of the curvilinear relationship that is found when further values of X are considered.

Plot the Regression Equation

R can be used to find the regression equation and to draw the regression line illustrated in Figure 8f. In the following examples, we will use jamovi to find the regression equation and print a scatterplot of the regression line. Syntax will be provided in the SUMMARY.

Figure 8f A plot of the regression line for the data in Figure 7a(ii)

Figure 8g Regression analysis output for the data in Figure 7a(ii)

  jmv::linReg(
    data = data,
    dep = Y,
    covs = X,
    blocks = list(
        list(
            "X")),
    refLevels = list(),
    r2Adj = TRUE,
    rmse = TRUE,
    modelTest = TRUE,
    anova = TRUE,
    ci = TRUE,
    stdEst = TRUE,
    norm = TRUE,
    qqPlot = TRUE,
    resPlots = TRUE,
    collin = TRUE,
    cooks = TRUE)


 LINEAR REGRESSION

 Model Fit Measures                                                                          
 ─────────────────────────────────────────────────────────────────────────────────────────── 
   Model    R          R²         Adjusted R²    RMSE      F         df1    df2    p         
 ─────────────────────────────────────────────────────────────────────────────────────────── 
       1    0.77731    0.60421        0.55474    7.9001    12.213      1      8    0.00814   
 ─────────────────────────────────────────────────────────────────────────────────────────── 
   Note. Models estimated using sample size of N=10


 MODEL SPECIFIC RESULTS

 MODEL 1

 Omnibus ANOVA Test                                                        
 ───────────────────────────────────────────────────────────────────────── 
                Sum of Squares    df    Mean Square    F         p         
 ───────────────────────────────────────────────────────────────────────── 
   X                    952.78     1        952.777    12.213    0.00814   
   Residuals            624.12     8         78.015                        
 ───────────────────────────────────────────────────────────────────────── 
   Note. Type 3 sum of squares


 Model Coefficients - Y                                                                               
 ──────────────────────────────────────────────────────────────────────────────────────────────────── 
   Predictor    Estimate    SE         Lower       Upper      t          p          Stand. Estimate   
 ──────────────────────────────────────────────────────────────────────────────────────────────────── 
   Intercept     5.39958    6.29130    -9.10819    19.9073    0.85826    0.41572                      
   X             0.73784    0.21113     0.25097     1.2247    3.49467    0.00814            0.77731   
 ──────────────────────────────────────────────────────────────────────────────────────────────────── 


 DATA SUMMARY

 Cook's Distance                                             
 ─────────────────────────────────────────────────────────── 
   Mean       Median      SD          Min          Max       
 ─────────────────────────────────────────────────────────── 
   0.12060    0.095279    0.096550    0.0051787    0.32838   
 ─────────────────────────────────────────────────────────── 


 ASSUMPTION CHECKS

 Collinearity Statistics      
 ──────────────────────────── 
        VIF       Tolerance   
 ──────────────────────────── 
   X    1.0000       1.0000   
 ──────────────────────────── 


 Normality Test (Shapiro-Wilk) 
 ───────────────────────────── 
   Statistic    p         
 ───────────────────────────── 
     0.87989    0.13011   
 ─────────────────────────────

Figure 8g(i) Case Summaries of Saved Variables from Regression in Figure 8g

Figure 8g(ii) Bivariate Correlations output for Regression from Figure 8g

  jmv::corrMatrix(
    data = data,
    vars = vars(X, Y),
    flag = TRUE,
    n = TRUE,
    ci = TRUE,
    plots = TRUE,
    plotDens = TRUE,
    plotStats = TRUE)


 CORRELATION MATRIX

 Correlation Matrix                          
 ─────────────────────────────────────────── 
                        X          Y         
 ─────────────────────────────────────────── 
   X    Pearson's r           —              
        df                    —              
        p-value               —              
        95% CI Upper          —              
        95% CI Lower          —              
        N                     —              
                                             
   Y    Pearson's r     0.77731          —   
        df                    8          —   
        p-value         0.00814          —   
        95% CI Upper    0.94462          —   
        95% CI Lower    0.28924          —   
        N                    10          —   
 ─────────────────────────────────────────── 
   Note. * p < .05, ** p < .01, *** p <
   .001

THE ERRORS (\(Y-\hat{Y}\)) IN A REGRESSION PLOT

In regression, the predicted values Y all lie on the regression line. Therefore, the errors of prediction, Y – YHAT, are found as the distance that each dot falls from the regression line. This distance is measured on the y-axis (i.e., vertically, not perpendicular from the regression line) and is illustrated for the employee data for company B in Figure 8h. For example, the second error, e2, was found by finding the actual Y score of 20 on the y-axis and subtracting the predicted Y score, which is also found on the y-axis. Find the actual Y score of 20 by extending a line from the dot, perpendicular to the y-axis. Find the predicted Y score (YHAT) by drawing a line from the dot to the regression line and from the regression line perpendicular to the y-axis.

The Regression of X on Y (i.e., wrong direction)

The regression equation found when you regress Y on X is not necessarily the same regression equation you would find when X is regressed on Y. In Figure 8i, the X scores from company B are regressed on the Y scores (that is, AJP on EJP). The resulting regression line has the equation

\[ \hat{Y} = 6.146 - (0.8189 X) \]

is different from that found when we regressed Y on X (that is, EJP on AJP). You may be surprised to find two different regression lines since the correlation of X with Y is the same as the correlation of Y with X. In equation (8-2), however, you can see that the slope of a regression line is dependent not only on the correlation but also on the ratio of the standard deviation of Y, sY, to the standard deviation of X, sX. Therefore, unless sX = sY, the slope of the two regression lines will not be equal; and from equation (8-2), their Y-intercept values will also probably not be equal.

Figure 8h The residuals (e), or errors of prediction, are found as the vertical distance each dot is from the regression line

Figure 8i An illustration of the fact that the regression of X on Y is not necessarily the same (and, indeed, not usually the same) as the line found from the regression of Y on X (i.e., compare this figure to 8h)

SUMMARY

Regression

In the introduction to regression analysis, you learned that you could derive a linear equation, called the regression equation, which allows you to predict one variable using another variable. Y is traditionally used to denote the variable being predicted, and X is used to denote the predictor variable. In these roles, Y is called the dependent variable; X is called the independent variable; and the process is described as the regression of Y on X.

A good regression equation is one that yields small errors in predicting Y. The variance and standard deviation of the errors can be used as indicators of how good a regression equation is. The variance of the errors is called the variance error of estimate, and the standard deviation of the errors is called the standard error of estimate. A small value for the variance error of estimate or for the standard error of estimate indicates a good regression equation. The Pearson correlation coefficient is indirectly related to the variance error of estimate, in that the larger the Pearson correlation coefficient, the smaller the variance error of estimate.

The Pearson correlation coefficient squared is used to describe the proportion of variance in the Y variable that is accounted for by the X variable. In this role, a high Pearson correlation coefficient means that a good regression equation can be expected.

You learned how to find and plot the regression line both by hand and using R. The errors of prediction in a regression scatterplot were discussed and illustrated. Also, you found that the regression equation found by regressing Y on X is not necessarily the same as the one found by regressing X on Y.

This chapter has considered only the basics of regression analysis. Our purpose was to provide you with an introduction to this topic so that you could see its relationship to the Pearson correlation coefficient; hence, we have only considered the descriptive relationship between one dependent variable and one independent variable. We have not considered the assumptions that are involved, we have not considered statistical null hypothesis testing, nor have we attempted to consider several independent variables and/or nonlinear relationships. For further discussion of this topic consider Cohen and Cohen (1980), Draper and Smith (1981), Kleinbaum and Kupper (1980).

Chapter 8 Appendix A Study Guide for Bivariate Regression

SECTION 1: Bivariate (one-predictor) Linear Regression (Descriptive)

Analyses to Run

• Use a SCALE variable Y • Use a SCALE variable X (an ORDINAL predictor with a large RANGE and large VARIATION may be okay but is NOT really desirable – an ORDINAL predictor would require very careful interpretation) • Run a linear regression with Y as the Dependent Variable and X as the Independent Variable • Run a case report and include your ID variable, X, Y, PRE_1, RES_1, ZRE_1 (for the Unstandardized Predicted Values, Unstandardized Residuals, and Standardized Residuals created when running the Linear Regression above).

Using the output, respond to the following DESCRIPTIVE items

From the Regression output, report and interpret the Pearson’s correlation between Y and X
Report and interpret the standardized regression coefficient used to predict Standardized Y (ZY) from Standardized X (ZX)
Report and interpret how much variation (as a proportion or percentage) in the outcome Y is explained or accounted for by the predictor X?
Show or explain how R2 is calculated as a proportion of variation in Total Y explained by X
Give another name for the estimated standard deviation of the residuals in correlation.
What is estimated standard deviation of the residuals value for this analysis
Show or explain how the estimated standard deviation of the residuals is calculated
Report and interpret the unstandardized Y-intercept for the regression model where Y is predicted from X
Report and interpret the unstandardized slope for the X predictor when it is used to predict Y
Report how much a case’s unstandardized Y score tends to increase or decrease if that case’s unstandardized X score increases by one unit
Report how much a case’s unstandardized Y score tends to increase or decrease if that case’s unstandardized X score decreases by one unit
Report how much a case’s unstandardized Y score tends to increase or decrease if that case’s standardized X score (ZX) decreases by one unit
Report how much a case’s standardized Y score (ZY) tends to increase or decrease (in standard deviation units) for every 1 standard deviation unit increase in X (i.e., a one unit change in standardized X = ZX)
Report and interpret the unstandardized regression model (a.k.a., function, line, equation) based on the results.
Report and interpret the standardized regression model.
Show or explain how to calculate the unstandardized regression coefficient for slope (\(b_1\)) from the standardized regression coefficient for slope (Beta1) using the formula \(b_1\) = Beta1 * (SY / SX), and recall that Beta1 = r
Show or explain how to calculate the unstandardized regression coefficient for Y-intercept (\(b_0\)) from the unstandardized regression coefficient for slope (\(b_1\)) using the formula \(b_0\) = MeanY – \(b_1\)*MeanX
If a case has a z value of the MEAN for X, calculate that case’s PREDICTED value on Y
What is the MEAN for Y?
Pick any case and show or explain how to calculate that case’s PREDICTED value on Y (i.e., Y-hat) using the regression model (you need X from the data for that case)
Using the same case as the previous item, show or explain how to calculate that case’s RESIDUAL value on Y (you will need Y from the data for that case)
Using the same case as in the previous items, report the coordinate point for both the ACTUAL (observed) value for the case and for the PREDICTED value for the case
If a case has a standardized predictor (ZX) score of 1.5, what is their Predicted ZY?
Using the result from the previous item, show or explain how to calculate the Predicted Unstandardized Y score by converting the Predicted ZY into the Predicted Y using formula Yhat = MY + (SY * ZYHAT)

Using ALL the output in this section above, respond to the following item

Interpret the results for the bivariate regression in an APA-style report to answer the research question and to describe in detail the predictive relationship between Y and X. Use descriptive statistics, (e.g., Pearson correlation, shared variation, R2), assumptions, and outliers to describe the size and direction of the relationship.

Citation

Please cite as:
Barcikowski, R. S., & Brooks, G. P. (2025). The Stat-Pro book:
A guide for data analysts (revised edition) [Unpublished manuscript].
Department of Educational Studies, Ohio University.
https://people.ohio.edu/brooksg/Rmarkdown/

This is a revision of an unpublished textbook by Barcikowski (1987).
This revision updates some text and uses R and JAMOVI as the primary
tools for examples. The textbook has been used as the primary textbook
in Ohio University EDRE 7200: Educational Statistics courses for 
most semesters 1987-1991 and again 2018-2025.