The one-factorial ANOVA

The Analysis of Variance (ANOVA) is a generalisation of the \(t\)-Test. It is used to test the difference between two or more means against zero. When comparing groups using ANOVA, we partition the total variance inherent to the data (collapsed across all groups) into the variance between groups (i.e., the variance of the group means) and the variance within groups.

The foundations of ANOVA

A key concept in ANOVA is that of the sum of squares. To understand the sum of squares, let us consider the formal definition of the variance \(\sigma^2\) of a variable, which looks as follows:

\[\sigma^2_x = \frac{1}{n} \times \underbrace{\Sigma{(x_i - \bar{x})^2}}_\text{sum of squares}\]

From this equation, we can see that the variance of the variable \(x\) is the product of two components. One is the sum of squares, that is, the sum of the squared deviations of each observation \(x_i\) from their mean \(\bar{x}\). The second factor simply divides this sum of squares by the number of observations \(n\).

When we estimate the true variance \(\sigma^2_x\) from data we collected, using the formula above produces a slight bias. Specifically, our estimate of the true variance will be systematically too low. To arrive at an unbiased estimate, we need to divide the sum of squares by \(n-1\) instead of \(n\), leading to the following formal definition of the estimate of the variance \(\hat{\sigma}^2_x\):

\[\hat{\sigma}^2_x = \frac{1}{n-1}\Sigma{(x_i - \bar{x})^2}\]

Fun fact: Dividing a sum of squares by its degrees of freedom yields the so called mean squares. The mean squares follow a \(\chi^2\)-distribution with the respective degrees of freedom.

The logic of the ANOVA

The whole idea of ANOVA rests on the fact that the variance can be partitioned into the sum of its parts. This is also true for the the sum of squares. As a very basic rule, we can state that:

\[SS_{total} = SS_{between} + SS_{within}\]

Here, \(SS_{total}\) is a measure of the total variability of a variable, \(SS_{between}\) represents that part of the total variability that is due to differences of the group means, and \(SS_{within}\) represents the part of the total variability that results from heterogeneity within the groups.

Similar to the \(t\)-test, ANOVA is a signal to noise ratio, where we treat variability between groups as the signal and variability within groups as the noise. The test statistic we use to test for significant mean differences between groups is the \(F\)-value, which is formally defined as:

\[F = \frac{VAR_{between}}{VAR_{within}}=\frac{\frac{SS_{between}}{df_{between}}}{\frac{SS_{within}}{df_{within}}} = \frac{MS_{between}}{MS_{within}}\]

In the formula above, \(MS\) is the corresponding mean squares.

Fun fact: Based on the formula above, we can easily derive that \(F\) is the ratio of two \(\chi^2\)-distributed variables. Both the numerator and the denominator constitute a sum of squares divided by its degrees of freedom. That is also the reason why the \(F\)-distribution has two degrees of freedom, one for the numerator and one for the denominator.

How we compute the mean squares depends on the type of ANOVA we run and the number of groups we compare in it.

However, we can state generally how we compute the different sum of squares. First of all, let’s remember how we compute the total sum of squares, \(SS_{total}\):

\[SS_{total} = \sum_{i=1}^{N} (x_i - \bar{x})^2\]

The total sum of squares represent the sum of the squared deviation of all \(N\) observed data points from the grand mean \(\bar{x}\). Now lets look at the formula for \(SS_{between}\).

\[SS_{between} = \sum_{j=1}^{J} n_j \times (\bar{x}_j - \bar{x})^2\]

Here, \(J\) is the number of groups we compare, and \(n_j\) is the sample size of group \(j\). For the \(SS_{between}\) we pretend that there is no variance within the \(J\) groups at all. Each observation is represented by its group’s mean \(\bar{x}_j\), and we compute the variability as the difference of these group means from the grand mean \(\bar{x}\). Therefore, the \(SS_{between}\) isolates the between-group part of the total variability of \(x\). Let’s now turn to the \(SS_{within}\).

\[SS_{within} = \sum_{j=1}^{J} \sum_{i = 1}^{n_j} (x_{ij} - \bar{x}_j)\]

Again, \(J\) is the number of groups we compare, and \(n_j\) is the sample size in group \(j\). The \(x_{ij}\) refers to the \(i\)th observation in group \(j\). For the \(SS_{within}\), we pretend that there is no variance between groups at all. We do so by substituting the grand mean for the respective group means \(\bar{x}_j\). Thus, the \(SS_within\) isolates the within-group variability of \(x\).

The final thing we need to understand before we can delve into the actual ANOVAs is the \(F\)-statistic. As we have seen above, we compute \(F\) as a ratio of the \(MS_{between}\) and the \(MS_{within}\). This ratio is interesting in several ways:

Since the \(SS_{between}\) usually has much fewer degrees of freedom than the \(SS_{within}\), the variability between groups does not have to be nearly as large as the variability within groups to produce a large \(F\)-value.
The more groups we compare, the lower the \(F\)-ratio will be, ceteris paribus. However, this does not necessarily mean that it becomes more difficult to detect significant mean differences. The more groups we compare, and the more numerator degrees of freedom our test has, the lower the critical \(F\)-value past which we consider a result statistically significant.
The larger our total sample, the more denominator degrees of freedom we have, and the smaller the noise becomes in our signal-to-noise ratio, ceteris paribus. This makes intuitive sense: as the samples size increases, our measurement becomes more precise, making it easier to detect differences between group means.

Fun fact: If we use an ANOVA to compare two means, we effectively run a \(t\)-test but discard its ability to indicate the direction of the effect.

In such cases \(F = t^2\), and the \(p\)-value of both tests will be identical if we run a \(t\)-test assuming equal variances.

ANOVAs come in various flavours. In the following, we will look at some of them, namely:

one-factorial ANOVAs (between-subjects)
two-factorial ANOVAs (between-subjects)
repeated-measures ANOVAs (within-subjects)
mixed ANOVAs (at least one between and one within factor)

One-factorial ANOVA

In a one-factorial ANOVA, we compare two or more group means such that we consider each group to represent one level of the same grouping variable or factor \(A\). The underlying model assumes that the true mean of each of the \(j\) groups \(\mu_j\) is the sum of the true grand mean \(\mu\) and the effect of the \(j\)th level of factor \(A\) on that mean. We call these effects \(\alpha_j\).

\(\mu_j = \mu + \alpha_j\)

This is equivalent to:

\(\alpha_j = \mu_j - \mu\)

In other words, the effect of the \(j\)th level of factor \(A\) is the difference between the true group mean \(\mu_j\) and the true grand mean \(\mu\). Accordingly, if \(\mu_j\) exceeds \(\mu\), the respective \(\alpha_j\) is positive.

The Null hypothesis is that all group means are equal. This equivalent to stating that all \(\alpha_j\) are zero. The alternative hypothesis is that not all group means are equal or, put differently, that at least one of the \(\alpha_j\) is non-zero.

\(H0:\alpha_j = 0 \quad \forall j\)

\(H1: \lnot H_0\)

Running ANOVAs in base R tends to be very clunky because R is more centered around classic regression models. Therefore, we won’t be using base R to run ANOVAs, but instead use an R package called afex (Analysis of Factorial Experiment). That means, we need to install and load afex first.

# install.packages("afex")
library(afex)

Once we have done that, we can start doing ANOVAs with one of several functions:

aov_car()
aov_4()
aov_ez()

The three functions serve the same goal and do the same things, but they differ in terms of the syntax. The function aov_car() uses syntax that is most closely related the the (clunky) base R version of ANOVA. In contrast, aov_4() uses syntax based on the popular lme4 package that is widely used for the analysis of generalised mixed models. Thus, this function is ideally suited for users who are already familiar with lme4. Finally, aov_ez() uses a completely string-based format, that is, it does not require a formula type object as a function argument. The advantage of aov_ez() is that it is very convenient and easy to handle (thus the suffix “ez”). This comes at the cost of flexibility. The lack of a formula means that we are stuck with a full ANOVA model. The other two functions technically allow us to specify models that omit certain main effects or interactions. However, since we will rarely want to run these incomplete models, this is a drawback that usually causes no hassles. In the following, we will focus on the aov_ez() function and leave exploration of the other functions to the discretion of the reader.

Here are the most important function arguments of aov_ez():

id (necessary): a character value indicating the name of the variable that contains our subject ID
dv (required): another character value; the name of the variable containing the data the group means of which we want to compare
data (required): an R object of type data frame containing the data we want to analyse
between (optional): a character string or character vector indicating the name(s) of the variable(s) constituting the between-subjects factor(s) of our design; default is NULL meaning that there are no between-subjects factors
within (optional): a character string or character vector indicating the name(s) of the variable(s) constituting the within-subjects factor(s) of our design; default is NULL meaning that there are no within-subjects factors
`covariate*: a character value or vector indicating the name(s) of the covariate(s) in our analysis
factorize (optional): a logical value; determines if the between- and within-subject variables are turned into factors prior to the analysis; default is TRUE; if our design has at least one covariate, we need to set this to FALSE and make sure that all factors are defined as such manually
anova_table (optional): a list of further arguments passed to the function; the ones we may be interested in are es (effect size; default is ‘ges’, which yields \(\eta^2\) as an effect size measure, but we can switch it to ‘none’ or to ‘pes’, which yields \(\eta^2_p\)) and correction (non-sphericity correction method; default is ‘none’, but we can switch it to ‘GG’ for the Greenhouse-Geisser or ‘HF’ for the Huynh-Feldt correction)

For the purpose of running a one-factorial between-subjects ANOVA, we can disregard some of the function arguments shown above. The only ones we need are id, dv, between, and possibly anova_table in case we want to obtain the effect size.

Now that we want to do ANOVAs, it is time to talk about factors. In R, factors are a special type of vector that contain both values and labels for those values. The different values of vectors are considered to be categories. Factors are important because the ANOVA-function we use here requires its between- and within-subject variables to be factors.

We can create a factor using the factor() function by feeding it the following function arguments:

x: a vector we want to turn into a factor
levels: a vector containing all possible values the factor can take
labels: a character vector assigning a label to each level of the factor

Lets look at an example, in which we test whether the means of three groups are equal or not. First, we need to create some data.

# create a data frame containing data from 30 subjects in three groups of 10 each; here, "id" is the subject identifier, "cond" is the between-subjects grouping variable, and "dv" contains the outcome variable we want to compare between groups

my_df = data.frame(
  # subject ID
  ID = 1:30,      
  # between-subjects-factor 'cond'
  cond = factor(  
    rep(x = 1:3, each = 10),
    levels = 1:3,
    labels = c('control', 'treatment1', 'treatment2')),
  # outcome variable 'dv'
  dv = c(
    c(10, 12,  9, 14, 11, 15, 13, 15, 18, 12), # dv data for control
    c( 9,  7, 15, 14,  8,  7, 16, 13, 11, 16), # dv data for treatment1
    c(12, 11, 11,  9,  8, 13,  8,  6, 14,  7)) # dv data for treatment2
)

We can inspect how the data frame looks using the head function.

  ID    cond dv
1  1 control 10
2  2 control 12
3  3 control  9
4  4 control 14
5  5 control 11
6  6 control 15

Now that we have some data, we can run the ANOVA. The syntax looks as follows:

aov_ez(id = 'ID', between = 'cond', dv = 'dv', data = my_df)

Here is what the output in the console looks like:

Contrasts set to contr.sum for the following variables: cond

Anova Table (Type 3 tests)

Response: dv
  Effect    df  MSE    F  ges p.value
1   cond 2, 27 9.27 2.44 .153    .106
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

As we can see, R displays an ANOVA table in the console along with some additional information. The output is preceded by a message (this is NOT an error message; the code ran properly). This message informs us that the aov_ez() function set the contrast type for our factor ‘cond’ to ‘contr.sum’. What this means is that the contrast underling our factor was forced into effect coding because that is the format that ANOVAs use. We don’t need to concern ourselves with that.

Next, R will tell us that this output is an ANOVA table based on type-3 sum of squares. Type-3 sums of squares are what most statistics packages use. If your knowledge of statistics is so advanced that you can make an informed decision that you would prefer type-2 sums of squares, you can change it by setting the type function argument to 2.

Now for the important bits. R shows us what the response variable in our model is, namely ‘dv’. Below that information, it displays the ANOVA table. Because we ran a one-factorial ANOVA; this table contains only one row. Here, we can see the name of the between-subjects factor (effect), the numerator and denominator degrees of freedom for its \(F\)-value (df), the mean squares of the effect (MSE), the \(F\)-value, the generalised \(\eta^2\) as a measure of the effect size, and the \(p\)-value.

In our example, the mean difference is not statistically significant, which means that we cannot reject the null hypothesis. In other words, we cannot say whether the true means between the three groups differ.

Similar to the \(\chi^2\)-test, the \(F\)-test we use in an ANOVA is always a one-tailed test because it is based on squared variables. Therefore, the test has no ‘direction’ as a \(t\)-test would.

Why is this important? Sometimes, we might encounter a scientific article, in which the authors state that they ran a ‘one-tailed’ ANOVA test, but what they do in those articles is simply the divide their \(p\)-value by 2. This practice (often encountered when the regular \(p\)-value lies between .05 and .10) rests on the erroneous belief that all statistical tests are - per default - two-tailed and can, therefore, also be run as a one-tailed test with slightly greater power.

Disentangling significant effects in ANOVAs

In the example above, we had to retain the Null hypothesis because the analysis did not show evidence that the three means were different from each other. In those cases, there is no need for further analyses. However, things look a bit different when the ANOVA yields a significant result. Let’s look at an example.

In this example, we will compare two treatments and one control condition with data for 30 participants in each condition. The data is stored in a data frame called my_df2. Here is what the data looks like (we use the function head() on the data frame to have R show us the first 6 lines).

  ID    cond  dv
1  1 control  88
2  2  treat1 119
3  3  treat2 119
4  4 control 103
5  5  treat1 103
6  6  treat2 137

We now run the ANOVA on the data. Other than before, we will define the result of the analysis as a new R object. This will make it easier for us to disentangle the effect later on. Here is the syntax:

# run a one-factorial ANOVA and save its results as a new R object
model1 = aov_ez(id = 'ID', dv = 'dv', between = 'cond', data = my_df2)

We will now see a new object called “model1” in the Environment. R will tell us that this object is a list. Entering the new object’s name as syntax will show us the result of the ANOVA just as if we had called the function as is instead of defining it as a new object. Here is the console output:

Anova Table (Type 3 tests)

Response: dv
  Effect    df   MSE         F  ges p.value
1   cond 2, 87 89.48 22.91 *** .345   <.001
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

As we can see, the effect of the factor “cond” in our fictional data is statistically significant. We can now state - with the usual confidence - that the means of the three groups are not equal. However, we cannot say anything else. Since an ANOVA uses the squared test statistic \(F\), we have no information on the direction of the mean differences, nor do we know which means differ. Thus, we need further analyses to get a clear picture of the mean differences the ANOVA detected.

As a useful - if somewhat brutish - metaphor, think of an ANOVA as firing a shotgun into think fog. A statistically significant effect means that we hit something, but we do not know what we hit. In order to find that out we need to venture into the fog and have a closer look.

(Credit for this metaphor goes to Prof. Dieter Heyer)

In order to disentangle a significant effect in an ANOVA with three or more groups, we need to run post-hoc analyses. There are many different ways to run such post-hoc analyses. Here, we will focus on post-hoc contrasts.

Post-hoc t-tests

Before we delve into post-hoc contrasts, it is worth mentioning that one easy way to disentangle significant effects in an ANOVA is running \(t\)-tests to compare the means of the groups. We already know how to do this, namely by using the t.test() function.

However, there are three downsides to using simple \(t\)-tests. First, they are less powerful than contrasts because their estimate of the error variance is based on only a part of the total sample. Second, if we want to control the type-I error rate by adjusting for multiple comparisons, we need to do so manually (this may be easy for Bonferroni’s method, but more challenging for less conservative adjustment methods). Third, as the number of groups we compare in an ANOVA increases, so does the number of possible pairwise comparisons. In fact, the number of comparisons is \((k^2-k)/2\), that is, it grows exponentially. Therefore, the amount of code we need to write also increases accordingly.

Pairwise comparisons using contrasts

An alternative to running individual \(t\)-tests is to do pairwise comparisons using post-hoc contrasts. A contrast is a weighted average of the group means \(\bar{x}_j\) of the \(J\) groups we compare.

\[\hat{\psi} = \sum_{j=1}^{J} c_j \bar{x}_j\]

It must satisfy the condition that the weights \(c_j\) add to zero:

\[\sum_{j=1}^{J} c_j = 0\]

The standard error of a contrast \(\hat{\psi}\) is computed based on all data. That is why contrasts are usually more powerful than run-og-the-mill \(t\)-tests. The standard error is defined as:

\[S_{\hat{\psi}} = \sqrt{MS_{within} \sum_{j=1}^{J} \frac{c_j^2}{n_j}}\]

Dividing a contrast \(\hat{\psi}\) by its standard error \(S_{\hat{\psi}}\) yields a variable that follows a \(t\)-distribution with \(N-J\) degrees of freedom, where \(N\) is the total sample size, and \(J\) is the umber of groups.

\[\frac{\hat{\psi}}{S_{\hat{\psi}}} \sim t_{N-J}\] Testing a contrast comes down to running a one-sample \(t\)-test to test (for example):

\(H_0: \hat{\psi} = 0\)

\(H_1: \hat{\psi} \ne 0\)

If we want to analyse contrasts in R, we need another R package called emmeans. This package contains a function called emmeans() (yes, it has the same name as the package) that allows us to run a contrast analysis. Let’s have a look at the emmeans() function. When we call it to analyse contrasts, we need to specify a few arguments:

object (required): an R object containing the results of an ANOVA (good thing we saved our ANOVA as an R object).
spec (required): a character string or vector containing the name(s) of the grouping variable(s) in our data (in our example, the grouping variable is ‘cond’).
contr (optional): a character value or list indicating the type of post-hoc comparisons; the default is NULL, but we can set it to “pairwise”, or we can define it as a list of vectors representing custom contrasts.
adjust (optional): a character value indicating which method should be used to control for type-I error inflation due to multiple comparisons; default is “tukey” for Tukey’s method, but we can set it to “none” if we don’t want to adjust our \(p\)-values, or we can choose a different adjustment method such as “holm” (the the Holm-Tukey method) or Bonferroni (for the Bonferroni correction).

Let’s dive in an run pairwise comparisons on our ANOVA. We can tell the emmeans() function to run pairwise contrasts by setting the contr argument to “pairwise”. The function will then set up the following contrasts:

	control	treatment 1	treatment 2
contrast 1	1	-1	0
contrast 2	1	0	-1
contrast 3	0	1	-1

In the first example, we will go with the unadjusted \(p\)-values. Let’s have a look at the code using the ANOVA model we saved as an R object above (“model1”). Here is the syntax.

# uncomment the next line to install the package emmeans
# install.packages("emmeans")
# load emmeans
library(emmeans)

# pairwise comparisons without type-I error adjustment
emmeans(object = model1, specs = 'cond',
        contr = 'pairwise', adjust = 'none')

One we run the code above, R will return some information in the console. Here is what it looks like:

Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'

$emmeans
 cond    emmean   SE df lower.CL upper.CL
 control   97.1 1.73 87     93.7      101
 treat1   102.4 1.73 87     98.9      106
 treat2   113.3 1.73 87    109.9      117

Confidence level used: 0.95 

$contrasts
 contrast         estimate   SE df t.ratio p.value
 control - treat1    -5.23 2.44 87  -2.143  0.0349
 control - treat2   -16.20 2.44 87  -6.633  <.0001
 treat1 - treat2    -10.97 2.44 87  -4.490  <.0001

The information we are interested in, is contained in the lower half of the output. There, we can see three contrasts, each comparing two of the three levels of our between-subjects variable ‘cond’. The output contains an estimate of the respective mean difference (this is the contrast value \(\hat{\psi}\)) and its associated standard error (this is \(S_{\hat{\psi}}\)). It also shows us the empirical \(t\)-value for the contrast (dividing \(\hat{\psi}\) by \(S_{\hat{\psi}}\) yields this exact value) and its degrees of freedom. Here, we can see that the contrast has more degrees of freedom than a simple \(t\)-test would have. Finally, the output contains the \(p\)-value for each contrast. Based on this analysis, we would conclude that the effect of condition on our ANOVA was significant because all three means differ from one another.

Now let’s rerun the analysis while correcting for multiple comparisons. As mentioned above, we can choose from several methods that control type-I inflation. Let’s say we want to go with a classic and opt for Bonferroni’s method. This method adjusts the threshold at witch we consider an effect to be statistically significant. We now accept \(H_1\) only if \(p < \frac{\alpha}{k}\), where \(\alpha\) is our tolerated type-I error level (e.g., \(\alpha = .05\)), and \(k\) is the number of post-hoc contrasts we test. What R does instead, is multiplying the unadjusted \(p\)-values by \(k\). This makes it somewhat more convenient because we can keep comparing the now corrected \(p\)-values to our usual tolerated type-I error level.

Are wondering what would happen if the unadjusted \(p\)-value is already so large that multiplying it by the number of tests \(k\) would propel it above 1? The simple answer is: nothing terrible: R caps the adjusted \(p\)-values at 1 in order not to violate the laws of probability. While it is technically ‘wrong’, it is of little concerns to because the \(p\)-values it concerns are so large they we would not accept \(H_1\) in either the adjusted or unadjusted case.

Here is the code for the pairwise comparisons using Bonferroni’s method for controlling type-I error inflation (the only thing we need to change is the adjust argument).

# pairwise comparisons with Bonferroni adjustment
emmeans(object = model1, specs = 'cond',
        contr = 'pairwise', adjust = 'bonferroni')

Let’s have a look at the output in the console:

$emmeans
 cond    emmean   SE df lower.CL upper.CL
 control   97.1 1.73 87     93.7      101
 treat1   102.4 1.73 87     98.9      106
 treat2   113.3 1.73 87    109.9      117

Confidence level used: 0.95 

$contrasts
 contrast         estimate   SE df t.ratio p.value
 control - treat1    -5.23 2.44 87  -2.143  0.1048
 control - treat2   -16.20 2.44 87  -6.633  <.0001
 treat1 - treat2    -10.97 2.44 87  -4.490  0.0001

P value adjustment: bonferroni method for 3 tests

The information looks very similar to what we saw when we ran the uncorrected pairwise comparisons. The are only two differences: first, R will state at the bottom of the table that the \(p\)-values were adjusted using Bonferroni’s method for three tests. Second, the \(p\)-values differ. From looking at the first \(p\)-value (the one for the contrast comparing treatment 1 to the control group) we can confirm that R has tripled the original \(p\)-value (the difference at the fourth decimal is due to rounding). If we choose to use a Bonferroni correction for the post-hoc comparisons, our conclusions about the result of the ANOVA change slightly. Now, we would state - with the usual confidence - that the significant effect the ANOVA is due to the mean of the second treatment group differing from both the control group’s mean and that from treatment 1. However, we cannot say whether scores in treatment 1 differ from the control group.

Finally, let’s consider using a less conservative adjustment method, namely Tukey’s method, also known as Tukey’s honestly significant difference (HSD). In technical terms, HSD computes the critical mean difference for which a pairwise comparison is considered significant, and then judges each comparison by that difference. The test statistic \(q\) follows the Studentised range distribution, the shape of which depends on the number of groups \(k\) and the sample size \(N\) (via the distribution’s degrees of freedom). Because the distribution of the \(q\)-statistic considers the number of groups, the HSD adjusts for multiple comparisons. Again, the oly theing we need to change about our syntax is the adjust argument (see below):

# pairwise comparisons using Tukey's method
emmeans(object = model1, specs = 'cond',
        contr = 'pairwise', adjust = 'tukey')

Let’s again inspect the console output

$emmeans
 cond    emmean   SE df lower.CL upper.CL
 control   97.1 1.73 87     93.7      101
 treat1   102.4 1.73 87     98.9      106
 treat2   113.3 1.73 87    109.9      117

Confidence level used: 0.95 

$contrasts
 contrast         estimate   SE df t.ratio p.value
 control - treat1    -5.23 2.44 87  -2.143  0.0872
 control - treat2   -16.20 2.44 87  -6.633  <.0001
 treat1 - treat2    -10.97 2.44 87  -4.490  0.0001

P value adjustment: tukey method for comparing a family of 3 estimates

Once more, the output remains largely the same, the only difference being the \(p\)-values and the statement about how the \(p\)-values were adjusted. We can see that - similar to the Bonferroni - correction, R uses a multiplier on the \(p\)-values rather than lowering the tolerated type-I error level, which makes life a little easier for us because we can still use our conventional \(\alpha\)-level to asses the statistical significance of the tests. Although the Tukey correction is less severe than the Bonferroni correction, the result is qualitatively similar: we can confirm that treatment 2 scores differ from those of the other two conditions, but we cannot say whether scores in treatment 1 differ from those in the control group.

Custom post-hoc contrasts

The final way to run post-hoc tests we will consider here is to run custom contrasts. What this means is that we define ourselves how we want to compare the group means in case of a significant effect in the ANOVA. When using custom contrasts, we can choose how many comparisons we run, which means we want to compare, and how to compare them.The only thing we need to keep in mind is that the contrast weights must add to zero (R does not check this for us; the code will run even if the contrast sum does not add to zero, but the result may be nonsense).

For example, we could compare two group means (just as we did with the pairwise comparisons), but we could also test whether two groups differ from a third.

Let us go back to our ANOVA example and assume that we want to run two post-hoc tests. The first tests whether having any treatment leads to different scores than being in the control condition. The second tests whether the effectiveness of the two treatments differs. Here is what the two contrasts would look like:

	control	treatment 1	treatment 2
contrast 1	-1	+0.5	+0.5
contrast 2	0	1	-1

Let’s dissect that! For contrast 1, we compare the control group with the mean of the two treatment groups. In contrast 2, we compare only the two treatment groups; the 0 weight for the control condition means that its mean does not play any role in this contrast.

How do we tell R that we want to run custom contrasts? We can do so using the contr argument of the emmeans() function. To be specific, we need to define this argument as a list in which each element is one contrast.

When defining the list of contrasts, each contrast must be a numeric vector of length \(k\), where \(k\) is the number of groups in the grouping variable that we fed into the emmeans() function (in our case 3). If we chose the wrong vector length, we will receive an error message. We also need to make sure that the sum of each contrast vector is zero (see above). Finally, we can (but do not have to) create these vectors as named elements of the list. Naming the vectors may lead to output that is slightly easier to make sense of.

Here, we need to decide whether we save the list of contrasts as a separate object or whether we feed the emmeans() function a call of the list() function (the result will be the same, so it is simply a matter of preference). Now let’s look at the R code.

## Alternative 1: separate objects

# create a list containing the custom contrasts
my_contrasts = list(
  treat_vs_control = c(-1, 0.5, 0.5),
  treat1_vs_treat2 = c(0, 1, -1)
)

# run the custom post-hoc tests
emmeans(object = model1, specs = 'cond',
        contr = my_contrasts, adjust = 'none')

## Alternative 2: don't create a separate list

# run the custom post-hoc tests
emmeans(object = model1, specs = 'cond',
        contr = list(
          treat_vs_control = c(-1, 0.5, 0.5),
          treat1_vs_treat2 = c(0, 1, -1)), 
  adjust = 'none')

Running either version of the code will prompt R to test the contrasts and adjust the \(p\)-values according to Bonferroni’s method (of course we could select a different adjustment method or turn if off by changing the adjust argument accordingly). We will receive the following console output:

$emmeans
 cond    emmean   SE df lower.CL upper.CL
 control   97.1 1.73 87     93.7      101
 treat1   102.4 1.73 87     98.9      106
 treat2   113.3 1.73 87    109.9      117

Confidence level used: 0.95 

$contrasts
 contrast         estimate   SE df t.ratio p.value
 treat_vs_control     10.7 2.12 87   5.066  <.0001
 treat1_vs_treat2    -11.0 2.44 87  -4.490  <.0001

As we can see, R now returns a table with two contrast tests. Since we named the contrasts properly, it is easy to see what these contrasts tested. For each contrast, R will tell us the estimated mean difference, the associated standard error, and the degrees of freedom (\(n-k\), similar to the pairwise comparisons described above). It will also show us the \(t\)-value and \(p\)-value for each contrast.

In our example, both contrasts are significant. This means, we can state - with the usual confidence - that the ANOVA was significant because a) getting a treatment leads to different scores than getting no no treatment (control condition), and b) the treatments differ in effectiveness.

But what about multiple comparisons? So far, we have not corrected our custom contrasts for multiple comparisons, but we can do so if we want to. The only thing we need to consider is that we cannot use Tukey’s method as it is specific to pairwise comparisons (oddly enough, Tukey’s method won’t work even if we manually specify pairwise comparisons - if we want to use the HSD method,m we need to set the contr argument to “pairwise”).

To adjust for multiple comparisons, we can again specify the argument adjust. If we use custom contrasts, the default for this argument is actually “none”, but we can set it to “bonferroni” (for the classic Bonferroni correction) or “holm” (for the slightly lees conservative Holm-Bonferroni method) instead. Let’s go with Bonferroni. Here is the code:

# run the custom post-hoc tests using "contrast" and correct
# for multiple comparisons using Bonferroni's method
emmeans(object = model1, specs = 'cond',
        contr = list(
          treat_vs_control = c(-1, 0.5, 0.5),
          treat1_vs_treat2 = c(0, 1, -1)), 
  adjust = 'bonferroni' )

Now let’s look at the console output:

$emmeans
 cond    emmean   SE df lower.CL upper.CL
 control   97.1 1.73 87     93.7      101
 treat1   102.4 1.73 87     98.9      106
 treat2   113.3 1.73 87    109.9      117

Confidence level used: 0.95 

$contrasts
 contrast         estimate   SE df t.ratio p.value
 treat_vs_control     10.7 2.12 87   5.066  <.0001
 treat1_vs_treat2    -11.0 2.44 87  -4.490  <.0001

P value adjustment: bonferroni method for 2 tests

As we can see, not much has changed in terms of statistical significance, which makes sense because the unadjusted \(p\)-values were already very low. However, R now tells us at the bottom of the table that \(p\)-values were adjusted using the Bonferroni method.

Custom contrasts are really useful because of their flexibility. They allow us to test very specific hypotheses. In fact, if we can state in advance which means in our design should differ (for example, because we have a strong theory to base our hypotheses on), we can technically skip the ANOVA altogether and instead run only the contrast tests. In those cases, we speak of a-priori contrasts.

A concluding remark on post-hoc comparisons

We started the journey into post-hoc analyses using the metaphor of the ANOVA as firing a shotgun into thick fog which - in case we hit something - necessitates wandering into the fog in order to find out what exactly we hit. As the examples above show, the answer to that crucial question depends on which post-hoc comparisons we run (simple \(t\)-tests, pairwise comparison contrasts, or custom contrasts) and whether and how we control for type-I error inflation due to multiple comparisons.

When trying to disentangle a significant effect in an ANOVA, we may find ourselves faced with multiple options, each of which is valid and can be argued for convincingly. However, even subtle differences between these options may lead to qualitatively different conclusions. The practical advice here is to preregister exactly which tests we will run in case an ANOVA yields a significant result (ideally by providing the R code for the post-hoc analysis).