Multiple Regression

So far,w e have considered linear regression models with a single predictor. We will now turn to cases with two or more predictors. As soon as a regression model has more then one predictor, we speak of multiple regression.

Multiple regression is a straightforward extension of the simple linear regression. Generally speaking the regression model looks like this:

\[Y = b_0 + b_1X_1+b_2X_2+...+b_kX_k +\epsilon\] Here, \(b_0\) is the intercept, \(X_1\) to \(X_k\) are the predictors, and \(b_1\) to \(b_k\) are their respective regression weights. As always, \(\epsilon\) is the residual. We can test each parameter of our model for significance via a \(t\)-test:

\[\frac{b_i - \beta_i}{SE_{b_i}} \sim t_{N-k}\]

Just as in simple linear regression, we compute the difference between the observed regression parameter and its expected value under \(H_0\) and divide this difference by the parameters’ standard error \(SE_{b_i}\). The resulting variable follows a \(t\)-distribution with \(N-k\) degrees of freedom where \(k\) is the number of parameters.

Just as in simple linear regression, \(\hat{Y}\) is the prediction of our model:

\[\hat{Y} = b_0 + b_1X_1+b_2X_2+...+b_jXj\] Since we can compute \(\hat{Y}\), we can also compute the coefficient of determination \(R^2\) and its adjusted version using the exact same formulae. The only conceptual difference is that - since we now have more than one predictor - \(R^2\) is no longer the square of a ordinary correlation coefficient. For this reason \(R\) (note the capital R) is also referred to as the multiple correlation coefficient. Importantly, however, this does not change how we interpret \(R^2\). It is still our measure of the proportion of variance of your criterion \(Y\) explained by our regression model as a whole. The statistical test of \(R^2\) remains unchanged, as well, that is we test for significance of \(R^2\) using an \(F\)-distribution with \(k-1\) numerator degrees of freedom and \(N-k\) denominator degrees of freedom.

The function we will use to do multiple regression analysis is the same we used for simple linear regression, namely the function lm(). The only difference is that we now specify the formula argument of the lm() function such that it contains two or more predictors.

Excurse: formula objects with multiple predictors

Before we delve into multiple regression in R, we first need to understand how to define the formula object if we have several predictors. Let’s first look at cases, in which we want to predict \(Y\) from two predictors \(X_1\) and \(X_2\)

# formula treating y as an additive function of x1 and x2
y ~ x1 + x2

In the example above, we have two additive effects (think of them as main effects) of \(X_1\) and \(X_2\) on \(Y\). But what if we wanted to add an interaction as well? In this case, we can combine the two predictors using the : operator. If we use this operator, R will compute the interaction term as the product of the two variables and use it as a predictor in the model. Here is what the syntax would look like:

# formula treating y as a function of x1 and x2
# as well as their interaction
y ~ x1 + x2 + x1:x2

In the above example, we have the full model, that is, it contains all possible effects that a regression model with two predictors can have. However, it is also possible to feed a reduced model into the lm() function. For example, we might want to estimate a regression model that entails only the main effect of \(X_1\) and the interaction effect, but not the main effect of \(X_2\). In this case, the model would look like this:

# formula treating y as an additive function of x1 
# and the interaction of x1 and x2
y ~ x1 + x1:x2

R has a neat way of making regression model formulae more parsimonious, namely using the * operator. If we combine tow or more variables with this operator, R will include all main effects and interactions of these variables. Here are some examples:

## formula for a full regression model with two predictors
# this:
y ~ x1*x2

# reads as this:
y ~ x1 + x2 + x1:x2


## formula for a full regression model with three predictors
# this:
y ~ x1*x2*x3

# reads as this:
y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 + x1:x2:x3


## formula for a three-predictor models with the 
## third predictor as a purely additive effect
# this:
y ~ x1*x2 + x3

# reads as this:
y ~ x1 + x2 + x1:x2 + x3

As we can see, the * operator can make our lives a bit easier whenever models involve multiple predictors.

Multiple Regression in R

Let’s now look at the syntax for a multiple regression model. We will use some made-up data to run a regression model with two predictors \(X_1\)and \(X_2\). Here is what the data look like:

  ID     Y    X1    X2
1  1  1.29  0.36  0.92
2  2  0.12 -0.10 -0.73
3  3 -0.84 -0.83 -0.74
4  4  2.37  1.66  0.99
5  5 -0.21  0.17 -0.94
6  6 -0.05 -0.70 -0.48

We now have to make a choice: we can either predict \(Y\) from \(X-1\) and \(X_2\) in a purely additive model, or we can include the interaction term \(X_1 \times X_2\) as a third predictor in the model. Let’s first run the simpler model that omits the interaction effect. The formla looks like this:

\[Y = b_0 + b_1X_1 + b_2X_2+\epsilon\] Here is the syntax for this model:

# predict Y from X1 and X2 in a multiple regression
mod1 = lm(formula = Y ~ X1 + X2, data = df1)

# display the results
summary(mod1)

Once we run that code, R will produce the following console output:

# predict Y from X1 and X2 in a multiple regression
mod1 = lm(formula = Y ~ X1 + X2, data = df1)

# display the results
summary(mod1)


Call:
lm(formula = Y ~ X1 + X2, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.09061 -0.55322  0.06585  0.48445  1.83926 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.17266    0.07949   2.172 0.032285 *  
X1           0.30212    0.07707   3.920 0.000165 ***
X2           0.41297    0.07834   5.271  8.2e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7923 on 97 degrees of freedom
Multiple R-squared:  0.3289,    Adjusted R-squared:  0.3151 
F-statistic: 23.77 on 2 and 97 DF,  p-value: 3.975e-09

As we can see, the section of the output names “coefficients” contains information on the intercept \(b_0\) and the regression weights for the two predictors, \(b_1\) and \(b_2\). In this example both predictors are significantly related to the criterion \(Y\). Note that since we have three parameters in the model and the total sample size \(N\) is 100, the \(t\)-tests for each parameter have 97 degrees of freedom, and the \(F\)-test of \(R^2\) against zero has 2 numerator and 97 denominator degrees of freedom.

How do we interpret the regression model? The answer is: pretty much like a simple linear regression. The intercept \(b_0\) indicates the level of \(\hat{Y}\) (that is, the predictor level of \(Y\)) when both \(X_1\) and \(X_2\) are fixed at zero. Since we did not model and interaction of the two predictors, the regression weights \(b_1\) and \(b_2\) are straightforward to interpret: \(b_1\) tells us much \(\hat{Y}\) increases when we increase \(X_1\) by one unit, irrespective of the current level of \(X_2\); \(b_2\) tells us the same for increases in \(X_2\).

Let’s now look at the full model that includes the interaction of \(X_1\) and \(X_2\) as a third predictor. The model looks like this:

\[Y = b_0 + b_1X_1 + b_2X_2 + b_3X1X2\] As explained above, there are two ways to define the formula object:

# predict Y from X1, X2, and their interaction
# Version A:
mod2 = lm(formula = Y ~ X1 + X2 + X1:X2, data = df1)

# Version B:
mod2 = lm(formula = Y ~ X1*X2, data = df1)

# display the results
summary(mod2)

Running the code above yields the following output in the console:


Call:
lm(formula = Y ~ X1 * X2, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.11870 -0.54380  0.03453  0.47744  2.05139 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.18874    0.07795   2.421   0.0173 *  
X1           0.31211    0.07541   4.139 7.50e-05 ***
X2           0.39870    0.07677   5.193 1.16e-06 ***
X1:X2       -0.14833    0.06252  -2.373   0.0197 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.774 on 96 degrees of freedom
Multiple R-squared:  0.3661,    Adjusted R-squared:  0.3463 
F-statistic: 18.48 on 3 and 96 DF,  p-value: 1.527e-09

The output now shows four parameters in the “coefficients” section. Adding a third predictor cost us one degree of freedom for both the individual \(t\)-tests of the model parameters and and \(F\)-test of \(R^2\) when compared with the previous model.

The interpretation of the model parameters is the same as before for the main effects of \(X_1\) and \(X_2\). Since the interaction term is simply another predictor from the vantage point of the model, \(b_3\) indicates how much \(\hat{Y}\) increases if we increase the product of \(X_1\) and \(X_2\) by one unit. While this statement is mathematically true, it is often difficult to grasp what these products represent conceptually (e.g., what is the product of one’s experienced stress and available resources?). The important thing to keep in mind about interaction terms in multiple regression is the following: if this interaction is statistically significant, we ~~know~~ decide to believe that the magnitude of the relation of \(X_1\) and \(Y\) depends on the level of \(X_2\) (and that the strength of the relation of \(X_2\) and \(Y\) depends on the level of \(X_1\)).

Hierarchical regression

As soon as we have more than one predictor variable, we can think about comparing nested regression models. We speak of nested models if one model is an extension of another. To be precise, a regression model \(A\) is nested in another model \(B\), when model \(A\) contains only some of the predictors of model \(B\) without adding new ones. In hierarchical linear regression, we compare nested models such that we start with a simple model and add predictors in each step.

As long as the earlier models in a hierarchical regression are nested in the later models, we can test whether adding a predictor (or a set of predictors) improves the overall fit of the model. We already know how to measure the overall predictive power of a regression model, namely via the variance explained by a regression model. The respective measure is the coefficient of determination \(R^2\). Since we expand models in a hierarchical regression analysis, the proportion of the variance of \(Y\) explained by the second model is always equal to or greater than that explained by the first model. The question is whether the increase in \(R^2\) is large enough to justify the addition of predictors.

How can we test whether an increase in \(R^2\) between two nested models is statistically significant? We first need to remember that sums of squares follow \(\chi^2\)-distributions. Next, we need to remember that the difference between two \(\chi^2\)-distributed variables is itself \(\chi^2\)-distributed. The degrees of freedom of the difference variables equals the difference of the degrees of freedoms of the two \(\chi^2\)-distributed variables it was computed from. If we, for example computed the difference between a \(\chi^2\)-distributed variable \(A\) with 10 degrees of freedom and another \(\chi^2\)-distributed variable \(B\) with 8 degrees of freedom, the resulting variable \(A-B\) would follow a \(\chi^2\)-distribution with 2 degrees of freedom.

To test whether the difference between the \(R^2\) of two nested models is different, we now need to remember that \(R^2\) is computed from sums of squares, which are \(\chi^2\)-distributed.

\[R^2 = \frac{S_{\hat{Y}}}{S_{Y}} = \frac{SS_{regression}}{SS_{total}} = 1-\frac{SS_{residual}}{SS_{total}}\]

If we compare two nested models, the denominator will be the same because it represents the variance of the observed criterion \(Y\). What will differ is the variance of the model prediction \(S_{\hat{Y}}\). Therefore, the models will also differ regarding their \(SS_{regression}\) and \(SS_{residual}\). In order to compare the two models’ respective fit, we first need to compute the difference between their respective \(SS_{regression}\) or their \(SS_{residual}\) to obtain a \(\chi^2\)-distributed variable (since \(SS_{residual} = SS_{total} - SS_{regression}\), the difference will be the same). We will call this variable \(SS_\Delta\). Since the \(SS_{regression}\) have \(N-k\) parameters, we can easily compute the degrees of freedom of \(SS_\Delta\):

\[df_{SS_\Delta} = N-k_{model1} - (N-k_{model2})= N-k_{model1}-N+k_{model2}=k_{model2}-k_{model1}\]

Now that we have the difference of the sums of squares \(SS_\Delta\) and its degrees of freedom, the last step is to compute an \(F\)-statistic to test whether it represent a statistically significant increase in explained variance. This statistic looks as follows:

\[\frac{ \frac{SS_\Delta}{df_{SS_\Delta}} }{\frac{SS_{residual_{model2}}}{df_{residual_{model2}}}} \sim F_{df_{SS_\Delta}; df_{residual_{model2}}}\]

Now that we know ho to test for a significant \(\Delta R^2\) on a conceptual level, the question is how to run this test in R. We can do so easily using the function anova(). The anova() function is similar to the summary() function in that it can be used on a range of different models an will perform different operations and produce different outputs based on the objects exact nature. If we call the anova() function and feed it two or more nested regression models, it run a significance test of \(\Delta R2\) between the first and second model, between the second and third model, and so on.

Let’s now have look at the syntax. We will use the two regression models we looked at above because they are nested (mod1 is nested within mod2 because it lacks the interaction effect). It is important to enter the models in ascending order of complexity (simplest model first). The code will run even if we enter more complex model first, but we may add up with negative values for the test statistic, which makes no sense mathematically.

# test for a significant increase in variance explained
# between the less complex mod1 and the more complex mod2
anova(mod1, mod2)

Here is the console output:

Analysis of Variance Table

Model 1: Y ~ X1 + X2
Model 2: Y ~ X1 * X2
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     97 60.889                              
2     96 57.517  1    3.3724 5.6288 0.01966 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output first displays the formulae of the two models we compared. More important is the table following it. As we can see, the anova() function computes \(SS_{\Delta}\) as the difference between the two models’ \(SS_{residual}\). It also shows the models’ respective degrees of freedom as well as the degrees of freedom for the model difference \(SS_{\Delta}\). Since we only added one parameter in model 2 (the regression weight for the interaction term), this difference has one degree of freedom. Finally, the function computes the \(F\)-statistic for the significance test and reports the associated \(p\)-value. in our case, the test is significant, meaning that the second model explains a significantly greater proportion of the variance of \(Y\).

If you look at the \(p\)-value for \(\Delta R^2\), you will notice that it is identical to the \(p\)-value of the significance test of the regression weight for the interaction term in the summary of mod2. Since we only added one variable, and this variable is a significant predictor of \(Y\), it follows that the model must explain a significantly larger proportion of the variance of \(Y\).

In other words, if we add a single predictor in a hierarchical regression analysis, we can test its significance using either the \(t\)-test on its regression weight or the \(F\)-test on \(\Delta R^2\). The result will be the same.

Multiple regression with categorical predictors

Just as in a simple linear regression, our predictors can be categorical instead of continuous. Here, we will focus on the case with one continuous predictor \(X_1\) and one categorical predictor \(X_2\). Let’s first assume that \(X_2\) is dichotomous and that we use it as a dummy-coded predictor (coded 0 vs. 1). Here is some made-up data.

  ID     Y    X1        X2
1  1  3.19  7.59   control
2  2 14.97 10.55 treatment
3  3  3.89 12.17   control
4  4  9.30  5.31 treatment
5  5  4.84 10.86   control
6  6 16.57 11.01 treatment

First, we will run a regression model containing only the two main effects. The syntax is the same as for continuous predictors:

# run the regression analysis
mod3a = lm(formula = Y ~ X1 + X2, data = df2)

# show the results
summary(mod3a)

Here is the console output:


Call:
lm(formula = Y ~ X1 + X2, data = df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0924 -1.0868 -0.2693  1.3977  3.2981 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.7094     1.1741   0.604 0.548623    
X1            0.4705     0.1220   3.858 0.000347 ***
X2treatment   9.1844     0.4274  21.491  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.5 on 47 degrees of freedom
Multiple R-squared:  0.9079,    Adjusted R-squared:  0.904 
F-statistic: 231.7 on 2 and 47 DF,  p-value: < 2.2e-16

As we can see, both predictors are significant. Time to interpret the model parameters! The intercept \(b_0\) tells us the estimated value \(\hat{Y}\) for observations in the reference category (\(X_2\) takes the value “treatment”) when \(X_1\) is zero. The regression weight of \(X_1\), \(b_1\) indicates that for each increase in \(X_1\) our estimate \(\hat{Y}\) increases by roughly 0.47 units, while the regression weight \(b_2\) states that moving from the reference category to the other category (\(X_2\) takes the value “control”) is associated with an increase in \(\hat{Y}\) of roughly 9.18 points.

Let’s now compute a second model by adding the interaction term:

# run the regression analysis
mod3b = lm(formula = Y ~ X1 * X2, data = df2)

# show the results
summary(mod3b)

Let’s have a look at the console output:


Call:
lm(formula = Y ~ X1 * X2, data = df2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.13476 -0.79688 -0.03927  0.57664  2.41743 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      6.3654     1.1500   5.535 1.43e-06 ***
X1              -0.1371     0.1215  -1.129    0.265    
X2treatment     -1.7729     1.5842  -1.119    0.269    
X1:X2treatment   1.2046     0.1710   7.044 7.83e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.051 on 46 degrees of freedom
Multiple R-squared:  0.9557,    Adjusted R-squared:  0.9528 
F-statistic: 330.8 on 3 and 46 DF,  p-value: < 2.2e-16

We now notice something interesting: not only is there a significant interaction effect, but the two main effects are no longer significant in the extended model. Let’s first try to make sense of the model parameters. As in the previous model, the intercept \(b_0\) represents the model prediction \(\hat{Y}\) when \(X_1\) is zero and an observation stems from the reference category (meaning that, due to the dummy-coding R uses for factors, \(X_2\) is also zero). The interaction term also takes the value of zero in this case because the interaction is the product of the two predictors, and they are both zero. The regression weight \(b_2\) is the shift of the model intercept when we switch from the category “treatment” to the second category “control” (i.e., \(X_2\) takes the value 1 due to dummy-coding, but \(X_1\) is still fixed at zero). The interpretation of \(b_1\) changes slightly now when compared with the previous model. Instead of telling us, in general, how \(\hat{Y}\) changes if we increase \(X_1\) by one unit, it now tells us how \(\hat{Y}\) changes in the reference category if we increase \(X_1\) by one unit. That means that the regression weight for \(X_1\), \(b_1\) is the slope of a regression line conditional on the categorical predictor \(X_2\) being zero. Finally, the regression weight of the interaction term \(b_3\) tells us how the slope of the regression line represented by the regression weight of \(X_1\) changes if we switch to the second category. In other words, the slope of the regression line for the second category is the sum of \(b_1\) and \(b_3\).

What we might be interested in now is whether the new model outperforms the old one in terms of variance explained. We already know how to test this, namely using the anova() function and feeding it models mod3a and mod3b (in that order) as function arguments.

# test for a significant increase in variance explained between mod3a and mod3b
anova(mod3a, mod3b)

Analysis of Variance Table

Model 1: Y ~ X1 + X2
Model 2: Y ~ X1 * X2
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     47 105.700                                  
2     46  50.851  1    54.848 49.616 7.832e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, the full model outperforms the one containing only the main effects.

This is a nice example because it shows us that a model can be wrong but still provide a very good fit to the data. Once we computed mod3b and confirmed that it explains a substantially greater part of the variance of \(Y\), we can - with the usual confidence - discard mod3a as ‘wrong’. However, if we look at the model output for mod3a, we can see that it was able to explain a whopping 90% of the variance of \(Y\). Just because a model is very good does not mean that it is accurate!

On a final note, a convenient aspect about multiple linear regression with one continuous and one categorical predictor is that we can visualise the data quite effectively. Specifically, we can draw one regression line for each level of the categorical predictor. If, for example, we look at the model summary above, we can derive the regression line for the first category by simply looking at the intercept \(b_0\) and the regression weight for the continuous predictor \(b_1\). For the second category, we now that the intercept is \(b_0 + b_2\) (the intercept for the reference category plus its increase if we switch to the second category), and the slope is \(b_1 + b_3\) (the slope for the reference category plus its adjustment when we switch to the second category). Here is what it looks like for our model mod3b.