# general syntax of a formula
~ x y
Simple Linear Regression
In linear regression, we “predict” one variable \(Y\) (the criterion) from one or more other variables \(X\) (the predictors), given that data on both \(Y\) and \(X\) is available in the same sample (think of regression as a within-subjects approach to inferential statistics). If we say predict in the context of regression, we mean this in a strictly statistical sense, meaning that the regression analysis itself does not allow any statements about the potential causal patterns in the data.
Prediction simply means that we use information from the \(X\) variables to make the best possible guess about the corresponding values of \(Y\). The best possible guess, in this case, is the one that minimizes the prediction error across the whole range of observations. That is, we may still be quite far off for individual values of \(Y\) when using this best guess, but in the long run (or across the board) it will be our best option.
Linear regression is conceptually close to correlation analysis as it assumes a strictly linear relationship between \(X\) and \(Y\). In fact, in cases with only one predictor variable, correlation and linear regression convey the exact same information, albeit being presented in a slightly different manner.
The basics of linear regression
The core idea of linear regression is that we can express the criterion variable \(Y\) as a linear function of the predictor variables \(X\). Let us first consider the simple case with one predictor. The general equation is:
\[Y = b_0 + b_1\times X + \epsilon\]
Here, \(b_0\) is some constant denoting the intercept of the linear function, and \(b_1\) is its slope. We call \(b_1\) a regression weight. It indicates how well we can predict \(Y\) from \(X\) (think of \(b_1\) as a measure of the covariation of \(X\) and \(Y\)). Finally, \(\epsilon\) is the prediction error, also called the residual. We assume that:
\[\epsilon \sim N(0, \sigma^2)\]
That is, the residual is assumed to have a mean of zero and some nonzero variance. We consider \(\epsilon\) to be unsystematic, which means that it is not part of our actual prediction.
Treating \(\epsilon\) as unsystematic is an oversimplification in most regression models. The reason is that the residual term contains not only the true unsystematic measurement error but also systematic variability of \(Y\) that is not captured in our \(X\) variables. In other words, \(\epsilon\) contains both unsystematic and unexplained variability of \(Y\).
Let’s call the prediction of \(Y\) from our predictor variable \(X\) by a different name, namely \(\hat{Y}\). Since \(\epsilon\) is not part of this prediction, we can state:
\[\hat{Y} = b_0 + b_1\times X\] This also means that:
\[\epsilon = Y-\hat{Y} \]
The trick to linear regression is to choose the parameters of the regression model, \(b_0\) and \(b_1\), so that they minimise the residual \(\epsilon\). We do that using the method of least squares, which is why linear regression is sometimes referred to as ordinary least squares (OLS) regression. To understand the method of least squares, it help to move from the variable-level formulation of the regression model to the level of observations.
\[y_i = b_0 + b_1 \times x_i + \epsilon_i\]
What this means is that our prediction of the magnitude of \(Y\) for the \(i\)th observation is based on the corresponding value of \(X\), that is, \(x_i\) and our prediction error for that specific observation \(\epsilon_i\). The method of least squares states that the best combination of parameters \(b_0\) and \(b_1\) is the one that minimises the sum of the squared residuals \(\epsilon_i\).
It can be shown that the residual is minimised for:
\[b_1 = \frac{COV(X,Y)}{\sigma_x^2} = r_{XY} \times \frac{\sigma_Y}{\sigma_X}\] And:
\[b_0 = \bar{Y} - b_1 \times \bar{X}\]
As we can see from the choice of the optimal \(b_1\), it is a linear transformation of the covariance of \(X\) and \(Y\), which means that we can also express it as a linear transformation of their correlation. Since linear transformations only change the units of measurement while leaving the strength of the linear relationship untouched, we can see that the linear regression with a single predictor variable \(X\) contains the exact same information as the correlation, but it is expressed differently.
Specifically, the expression we use in linear regression allows us to predict specific values \(\hat{y}_i\) whereas the correlation tells us how many standard deviations \(Y\) increases if \(X\) is increased by one standard deviation. With that in mind, we can now state that we would expect \(\hat{Y}\) to take the value \(b_0\) if \(X\) were zero. We can also say that \(\hat{Y}\) increases by \(b_1\) units if we increase \(X\) by one unit.
Testing regression coefficients for significance
We can test the fixed parameters of a regression model (i.e., the intercept and the regression weights of the predictor variables \(X\)) for statistical significance using \(t\)-statistics. For each of the parameters the following is true
\[\frac{b_i - \beta_i}{SE_{b_i}} \sim t_{N-k}\]
Here, \(b_i\) is the parameter of interest, \(\beta_i\) is the expected value under \(H_0\) (typically zero, but we can can technically test against non-zero values if we wanted to), and \(SE_{b_i}\) is the standard error of that parameter. The degrees of freedom of the resulting \(t\)-distribution equal the sample size \(N\) minus the number of the estimated parameters \(k\).
We are not going to go into the formula for the standard error of the regression coefficients here because they are a) somewhat complicated and b) R computes them for us. Instead, we will turn to one more important statistic used in regression models, namely the coefficient of determination \(R^2\):
\[R^2 = \frac{S_{\hat{Y}}^2}{S_{Y}^2}\]
Here, \(S_{\hat{Y}}^2\) is our estimate of the variance of the model predictions \(\hat{Y}\), whereas \(S_{Y}^2\) is our estimate of the variance of the criterion variable \(Y\). In the case with only one predictor, the coefficient of determination is the squared correlation coefficient between predictor \(X\) and criterion \(Y\). In the case with multiple predictor variables we, therefore, also refer to \(R\) the multiple correlation coefficient. In any case, \(R^2\) tells us which proportion of the variance of the criterion \(Y\) we can account for with the predictions of our regression model \(\hat{Y}\).
We already know from the chapters on ANOVA that such ratios of variances can also be expressed as ratios of sums of squares. In the case of linear regression, it looks like this:
\[R^2 = \frac{S_{\hat{Y}}^2}{S_{Y}^2} = \frac{\frac{1}{N-k}\sum_{i=1}^{N}(\hat{y}_i - \bar{\hat{y}}_i)^2}{\frac{1}{N-k}\sum_{i=1}^{N}(y_i - \bar{y}_i)^2}\]
We can simplify this equation by dropping the \(\frac{1}{N-k}\) which results in:
\[R^2 = \frac{\sum_{i=1}^{N}(\hat{y}_i - \bar{\hat{y}}_i)^2}{\sum_{i=1}^{N}(y_i - \bar{y}_i)^2} = \frac{SS_{regression}}{SS_{total}}\]
Now that we know that we are dealing with sums of squares again, we can easily compute the residual sum of squares (or error sum of squares):
\[SS_{residual} = SS_{total} - SS_{regression}\]
:::{.alert .alert-success} If at this point you are wondering why we are back to sums of squares even though we are not dealing with ANOVAs anymore: regression and ANOVA are the same. :::
Or, to be precise: they are both special cases of the general linear model. We use ANOVAs when our predictors are categorical variables, but we could easily set up the same model as a regression model. The output of the respective analysis and the specific parameter tests we run may differ, but the underlying information is the same. :::
Back to \(R^2\): we can test for significance of \(R^2\) using an \(F\)-test based on the mean squares not unlike the ones we used in ANOVAs. It can be shown that:
\[\frac{MS_{regression}}{MS_{error}} = \frac{\frac{SS_{regression}}{k-1}}{\frac{SS_{error}}{N-k}} \sim F_{k-1; N-k}\]
This \(F\)-test tells us whether our regression model as a whole - that is, irrespective of the statistical significance of its individual regression weights - allows for an above-chance prediction of the criterion variable.
Adding more predictor variables to a regression model will necessarily increase the proportion of variance explained by the model \(R^2\) even if the new parameter is not statistically significant individually. To counter this problem, we can adjust \(R2\) and essentially “punish” a model for containing too many weak predictors. Here is the formula for the adjusted \(R^2\):
\[R_{adjusted}^2 = 1-\frac{\frac{SS_{error}}{df_{error}}}{\frac{SS_{total}}{df_{total}}} = 1- \frac{MS_{error}}{MS_{total}}\]
Running a regression analysis in R
We can run regression analyses in R using the lm()
function (linear model). The function has several arguments. We will use only the following two:
formula
(required): a formula type object telling the function which variable to predict and which combination of variables to use as predictorsdata
(optional): a data frame containing the variables we feed into the formula; if we do not specify this argument, R will assume that the variables we fed into the formula argument exist as objects in our environment
Excurse: formula type objects
Before we delve into examples of linear regressions, we need to have a look at the syntax of formula type R objects. We already know that these objects use the ~
operator (tilde) and have the general form:
That means, we define \(y\) as a function of \(x\). Of course, a formula can be much more complex than that because we can use combinations of multiple variables on the right hand side. In regression models, the formula is generally an additive combination of predictors, each of which then receives its own regression weight. We can combine multiple predictors using the +
operator.
The first thing we need to know now is that a regression formula in R typically omits the intercept although the lm()
function models it. This is purely a convenience feature. The actual formula looks like this:
# formula explicating that the model contains an intercept
~ 1 + x y
Here, the \(1\) represents the intercept. If we omit it, R will read the function as if we had written it, meaning that the function will estimate the intercept \(b_0\) both when we add the \(1\) in our formula and when we leave it out.
In some situations, we might want to force a regression through the origin. In other words, we want the intercept of the model to be zero. In those cases, we can replace the \(1\) in our regression model with a \(0\), so R will know that the intercept must be 0 in the model.
Back to regressions in R
Now that we have a rough understanding of formula type objects in R, we can go back to running linear regressions in R. Let’s first create some data for the simple case with one predictor variable. Let’s assume we gathered data from 40 people on two variables, love of cats and love of dogs. Both variables are measured on scales ranging from -3 (can’t stand them) to +3 (love them). We will use a linear regression to predict participant’s attitude toward cats from their attitude toward dogs.
This data example is very prototypical for psychology in the sense that it violates two of the assumptions of linear regression, namely that the criterion \(y\) is continuous and measured on an interval scale (meaning that the increase from, say, 1 to 2 is equal to the increase from, say, 4 to 5).
Rating scales, as they are often used in psychological research, do not satisfy these conditions. Neither is the scale continuous (in fact,it is restricted to a few discrete values), nor can we say for certain whether the steps between scale points mark equal psychological distances.
In practice, we generally dismiss this violating of assumptions and pretend that it is not an issue. For those who find that unsatisfying, it may be worthwhile to use a different type of regression model that is better suited to this kind of data, namely ordinal regression.
Let’s first have a brief look at the fictional data (stored in a data frame called df1):
ID love_dogs love_cats
1 1 -2 1
2 2 1 -2
3 3 2 0
4 4 -3 -3
5 5 1 -1
6 6 1 -1
We can now call the lm()
function to run an ordinary least squares regression of participants’ love for cats from their love for dogs. Similar to ANOVAs, we will define the regression model as an object. Here is what the syntax might look like:
# regression of participants' love for cats on their love for dogs
= lm(formula = love_cats ~ 1 + love_dogs,
model1 data = df1)
Once we run that code, an object called “model1” will appear in our environment. R will inform us that this object is a list of 12. We can now call the object’s name and have a look at what R outputs in the console:
Call:
lm(formula = love_cats ~ 1 + love_dogs, data = df1)
Coefficients:
(Intercept) love_dogs
-0.4223 0.1245
As we can see, the output is relatively sparse. We can see the function call we entered as well as estimates of the two regression parameters \(b_0\) (intercept) and \(b_1\) (the regression weight for love of dogs).
While we now know the parameter values of the linear regression, this information is slightly underwhelming. If we want more information, particularly about the statistical significance of the parameters, we need to feed the regression model into another function called summary()
. The summary()
function takes a fitted model as its sole argument. Here is the syntax:
# display detailed output for the regression model
summary(model1)
Here is what the console output looks like when we call the summary()
function on our regression model:
Call:
lm(formula = love_cats ~ 1 + love_dogs, data = df1)
Residuals:
Min 1Q Median 3Q Max
-2.4532 -1.3287 -0.4532 1.4223 3.5468
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4223 0.3007 -1.404 0.168
love_dogs 0.1245 0.1729 0.720 0.476
Residual standard error: 1.674 on 38 degrees of freedom
Multiple R-squared: 0.01346, Adjusted R-squared: -0.0125
F-statistic: 0.5185 on 1 and 38 DF, p-value: 0.4759
As we can see, R now displays substantially more information. In addition to the function call, we can now also see information on the distribution of the residuals (quartiles). For each of the fixed parameters of the model, we can now see the estimate, its standard error, the resulting \(t\)-value, and the \(p\)-value of a two-tailed \(t\)-test of the parameter against zero. Below the parameter estimates, there is some more information on the model, the most important of which are the \(R^2\) and adjusted \(R^2\) values as well as the results of the \(F\)-test testing whether \(R^2\) is greater than zero.
In our example, love of dogs is not a significant predictor of participant’s love for cats. The model as a whole also does not explain a significant proportion of the variance in participants’ love for cats indicated by the non-significant \(F\)-test. This is not surprising because the model only contains one predictor, love for dogs, and this predictor is not significantly related to the criterion. As we can see, the \(p\)-value for the \(t\)-test of the single predictor (love for dogs) is exactly equal to the \(F\)-test of the whole model since love for dogs is the only variable that can possibly explain variability in the love for cats in our data set.
Categorical predcitors in simple linear regression
So far, we have considered the case in which both \(X\) and \(Y\) are (sort of) continuous or variables. While \(Y\) must be a continuous variable in linear regression, the same is not true for the predictor \(X\). We can also predict \(Y\) from categorical variables with two or more levels. If we do so, we need to slightly change the way in which we interpret the results of the model output.
Dichotomous predictors
In the most simple case, a categorical predictor has two levels (e.g., treatment vs. control). If we want to use a dichotomous predictor in a linear regression, we need to assign values to its two levels. Which values we choose is technically irrelevant, but some are more sensible when it comes to interpreting the model. The two most common ways to code dichotomous predictors are:
- effect coding: the predictors levels are centred around 0 (e.g., -1 vs. 1 or -0.5 vs. 0.5)
- dummy coding: the predictor is coded as a binary variable (0 vs. 1)
When running a regression using a categorical predictor in R, we need to consider the type of the predictor variable. If we use a character string or a factor as a binary predictor, R will automatically use dummy coding. If the predictor variable is numeric and different from the desired coding scheme, we need to recode it manually.
Caveat: Unless we define a predictor as a factor with an ordered structure (by defining levels and labels in the desired order), R will order the levels alphabetically, with the first level being treated as the reference category.
Let’s now briefly look at an example of a simple linear regression with a dichotomous predictor. Here is some made up data (stored in df2), in which the predictor is either dummy-coded (because the variable is a factor with two levels) or effect coded (-0.5 vs. 0.5):
ID cond_dummy cond_effect score
1 1 control -0.5 5.59
2 2 treatment 0.5 12.20
3 3 control -0.5 8.55
4 4 treatment 0.5 9.05
5 5 control -0.5 10.17
6 6 treatment 0.5 8.58
We will first look at the dummy-coded version of the predictor by using the labelled factor as the predictor (the variable named ‘cond_dummy’). Here is what the syntax looks like:
# regression of score on the condition variable
= lm(formula = score ~ cond_dummy, data = df2) model2
Running the code above yields the following output:
Call:
lm(formula = score ~ cond_dummy, data = df2)
Residuals:
Min 1Q Median 3Q Max
-4.0977 -1.1347 -0.4077 0.7375 5.4223
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.4077 0.3401 21.780 < 2e-16 ***
cond_dummytreatment 1.4893 0.4810 3.096 0.00302 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.863 on 58 degrees of freedom
Multiple R-squared: 0.1419, Adjusted R-squared: 0.1271
F-statistic: 9.588 on 1 and 58 DF, p-value: 0.003016
When we use dummy-coding, the intercept of the model represent the mean of the reference category (in our example the control group). This means, we can test whether the mean of the reference group differs from zero using the intercept of the model. The treatment effect tells us how much the mean value changes when we move from the control group to the treatment group, that is, it shows us the mean difference between the two groups and tells us whether this mean difference is significant.
Testing the significance of mean differences between two groups…doesn’t that sound familiar? That is what we previously used the independent samples \(t\)-test for.
In fact, if we ran an independent samples \(t\)-test on the data and forced it to assume equal variances, it would show the exact same \(t\)-value, degrees of freedom, and \(p\)-value as the test of the slope in our regression model.
We would get slightly different results when using the default Welch \(t\)-test because it adjusts the degrees of freedom downward to correct for unequal variances, thus yielding a somewhat larger \(p\)-value.
Let’s now see what happens to the output if we use the effect-coded version of the condition variable as a predictor instead of the dummy-coded version.
Call:
lm(formula = score ~ cond_effect, data = df2)
Residuals:
Min 1Q Median 3Q Max
-4.0977 -1.1347 -0.4077 0.7375 5.4223
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.1523 0.2405 33.898 < 2e-16 ***
cond_effect 1.4893 0.4810 3.096 0.00302 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.863 on 58 degrees of freedom
Multiple R-squared: 0.1419, Adjusted R-squared: 0.1271
F-statistic: 9.588 on 1 and 58 DF, p-value: 0.003016
We should notice a few things: first, the regression weight \(b_1\) and its associated test statistics (\(t\)-value, \(p\)-value) are identical. The same goes for the coefficient of determination \(R^2\) and it associated \(F\)-test. The only thing that differs is the intercept of the model and its associated test statistics. This makes sense, intuitively, because we simply shifted the predictor to the “left” by half a unit (from 0 vs. 1 to -0.5 vs. 05). Other than in the dummy-coded model, the intercept no longer reflects the mean of the reference group but the overall mean. The underlying information, namely the correlation between \(X\) and \(Y\), is unchanged, however. Accordingly, the strength of the relationship between \(X\) and \(Y\) as well as its statistical significance cannot differ between the two versions of the regression model.
Predictors with more then two levels
While dichotomous predictors are easy to handle (particularly when we dummy-code them), things can get complicated when categorical predictors have three or more levels. What R does in these cases is create \(j-1\) contrasts, where \(j\) is the number of categories of the predictor \(X\). These contrasts are coded in a specific way. Similar to dummy-coded predictors, the first level s treated as the reference category, and each other level is compared to that reference category.
Let’s look at an example with a four-level predictor. Here, we predict the daily average sun hours in Northern Ireland from the four seasons (data are made up, of course).
obs season sun_hours
1 1 spring 0.90
2 2 summer 2.76
3 3 autumn 1.03
4 4 winter 0.02
5 5 spring 1.64
6 6 summer 2.50
Here is the syntax for the regression model:
# regresison model predicting sun hours from seasons
= lm(formula = sun_hours ~ season, data = df3) model3
Let’s have a look at the console output:
Call:
lm(formula = sun_hours ~ season, data = df3)
Residuals:
Min 1Q Median 3Q Max
-0.9780 -0.1840 -0.0420 0.0975 1.2710
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3080 0.1207 10.839 6.90e-13 ***
seasonsummer 1.6310 0.1707 9.557 2.05e-11 ***
seasonautumn -0.3860 0.1707 -2.262 0.0298 *
seasonwinter -1.6620 0.1707 -9.739 1.25e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3816 on 36 degrees of freedom
Multiple R-squared: 0.9134, Adjusted R-squared: 0.9062
F-statistic: 126.5 on 3 and 36 DF, p-value: < 2.2e-16
In the data, seasons are coded such that spring is the first level, followed by summer, autumn, and winter. Therefore, the regression model treats spring as the reference category. The intercept tells us that spring in Northern Ireland has an average of 1.3 sun hours per day, which is significantly different from having zero sun. The regression weight for summer indicates that the sun shines for an additional 1.6 hours in summer compared with spring. Likewise, the regression weights for autumn and winter tell us that the daily sun hours are lower in autumn than in spring and also lower in winter than spring. Adding one of the regression weights to the intercept yields the mean hours of sunshine for that season (in the case of winter, we the model predicts a daily average of minus 0.3 sun hours in Northern Ireland, which seems about right).
Importantly, the model does not tell us whether average daily sun hours differ between summer and autumn, summer and winter, or autumn and winter. If we are interested in those differences, we need to run the model using a different reference category (for example by changing the factor levels and labels of the predictor variable accordingly).
You may have noticed that despite having only one predictor variable, the model now contains four parameters, the intercept and three regression weights. This is also reflected in the degrees of freedom (we lose one for each parameter).
Technically, this means that we have now entered the realm of multiple linear regression.