Logical Indexing

So far, we have learned how to obtain or overwrite parts of an R object using numerical indexing. We also know how to obtain a named element of a data frame or list, namely by using the $ operator. We will now turn to logical indexing, that is obtaining elements of objects that satisfy certain conditions.

Just as with numerical indexing, logical indexing uses brackets. Within these brackets we can specify conditions that be be tested logically such that the result of the test is a TRUE or FALSE statement (i.e., a Boolean variable). R will then only show those elements of the object for which these conditions are true.

Logical indexing using binary operators

The simplest form of logical indexing is to use a single statement involving a logical binary operator. For example, we could ask R to return to us all elements of a numeric vector (including, for example, a column of a data frame) that are greater then 3, or we could ask it to obtain all elements from a Boolean matrix that are TRUE.

Here are a few examples:

# create a numeric vector
v1 = 1:6

# create a Boolean 2x3 matrix
m1 = matrix(
  c(T, T, F, T, F, F),
  nrow = 2
)

# obtain all elements of v1 that are greater than 3
v1[v1 > 3]

# obtain all elements of m1 that are TRUE
m1[m1 == T]

Here is what the output in the console looks like:

[1] 4 5 6

[1] TRUE TRUE TRUE

We can also obtain elements by combining multiple logical statements using R’s AND and OR operators (& and |, respectively). For example, we could obtain all elements of the numeric vector we created above that exceed 2 AND are smaller than 5.

If we combine multiple logical tests in logical indexing, it is important that we tell R in each of the involved tests which variable to test. The following code would not work even though it may seem clear to us what it should mean:

# obtain all elements of v1 that are greater than 2 and smaller than 5
v1[v1 > 2 & < 5]

While it looks as if we asked R to show us all elements that are greater than 2 AND smaller than 5, R will not know which variable the second part of the test refers to and will return an error message (in this message, R will tell us that the < sign following the & operator was unexpected).

Here is how the code needs to look like if we want to obtain elements of an object using a combination of multiple logical statements:

# obtain all elements of v1 that are greater than 2 and smaller than 5
v1[v1 > 2 & v1 < 5]

This code works as intended, which we can verify by looking at the output in the console.

[1] 3 4

One neat thing about logical indexing is that we are not restricted to referring to the variable we want obtain elements from. As long as the outcome of the logical tests is an object of the same size as the object we want to obtain elements from, the code will work.

What this means is that we can obtain all elements of an object that satisfy conditions in other equally sized objects. This is particularly useful when we want to obtain elements of a data frame. For example, we might want R to show us all elements of one variable that satisfy the condition that another variable equal a specific value. Let’s look at a few specific examples.

# create a data frame containing some demographic variables and responses to three 5-point Likert-scale items
my_df = data.frame(
  ID = 1:6,
  age = c(23, 21, 18, 16, 19, 17),
  gender = c('f', 'f', 'm', 'm', 'f', 'f'),
  item1 = c(2, 5, 5, 4, 5, 3),
  item2 = c(1, 2, 1, 3, 2, 1),
  item3 = c(3, 4, 5, 5, 4, 2)
)

# obtain the responses to item 2 of all participants who are adults
my_df$item2[my_df$age >= 18]

# obtain the gender of all participants who scored below 4 on items 1 and 3
my_df$gender[my_df$item1 < 4 & my_df$item3 < 4]

Here is what the output in the console looks like:

[1] 1 2 1 2

[1] "f" "f"

Just as with numerical indexing, we can also use logical indexing to save the obtained elements as a new object of their own, or we can overwrite the obtained elements. Let’s focus on overwriting. Overwriting of values using logical indexing is particularly useful in situations, in which missing data is coded using specific numbers (e.g., -99).

Numeric values for missing data can cause problems if we do not detect it such as distorting some measures of central tendency (e.g., the mean) or variability (e.g., standard deviation, range). Ideally, we want missing data to be represented by NA when working with R. That means that we may end up having to replace certain values that code missing data with NAs manually. Logical indexing makes this a walk in the park as the following example shows.

# add two more items to the data frame created above, both of which contain missing data coded as -99
my_df$item4 = c(3, 5, 3, -99, 4, -99)
my_df$item5 = c(-99, 2, 3, 1, -99, 2)

# turn all elements that are equal to -99 into NAs
# variant A:
my_df$item4[my_df$item4 == -99] = NA
my_df$item5[my_df$item5 == -99] = NA

# variant B:
my_df[my_df == -99] = NA

Both ways of overwriting the -99s with NAs work. The second one is more parsimonious. However, replacing all elements of a certain value in a data frame can become messy in large data sets. If the code for missing data has been chosen poorly (e.g., 99), there may be some variables for which 999 is an actual and valid value (e.g., a reaction time in milliseconds). Therefore, caution is advised when replacing values for a complete data frame.

Logical indexing using functions

Instead of - or in addition to - binary operators, we can also use functions for logical indexing. We can use functions in two ways here. First, we can use the output of a function call as a value of comparison. For example, rather than testing whether an element of an object exceeds a certain fixed value, we could test whether it exceeds the mean or the median of that object.

Second, there a several functions that conduct logical tests. For example, the function is.na() tests whether an object is NA, the function is.numeric() tests whether an object’s type is numeric (either integer or double), and the function is.finite() tests whether a numeric variable has a finite value.

Note: R has a neat way of reversing functions such as is.na(), is.numeric(), and is.finite(). We can make R test the respective opposite by preceding the function call with an exclamation mark (similar to how the operator != mean not equal).

For example, calling !is.na() tests whether an object is not missing, !is.numeric() tests whether the object’s type is different from numeric, and !is.finite() tests whether a numeric value is infinite (or Inf in R terms).

Let’s look at a few examples of logical indexing using functions. We will do so with the data frame we created and modified above. Here is a quick update on how the data frame looks currently.

  ID age gender item1 item2 item3 item4 item5
1  1  23      f     2     1     3     3    NA
2  2  21      f     5     2     4     5     2
3  3  18      m     5     1     5     3     3
4  4  16      m     4     3     5    NA     1
5  5  19      f     5     2     4     4    NA
6  6  17      f     3     1     2    NA     2

We will now obtain and overwrite some of the elements of this data frame using the is.na() function and its inverse, !is.na(). Note that the complete function call including the opening and closing parentheses must be contained within the brackets.

# obtain the age of all participants whose responded to item4
my_df$age[!is.na(my_df$item4)]

# replace missing responses to items 4 with the average response (i.e., the mean of valid responses to that item across participants) 
my_df$item5[is.na(my_df$item5)] = mean(my_df$item5, na.rm = T)

Running the first line of code above, yields the following output in the console:

[1] 23 21 18 19

And here is what the data frame looks like after we replaced the missing responses with the average response.

  ID age gender item1 item2 item3 item4 item5
1  1  23      f     2     1     3  3.00     2
2  2  21      f     5     2     4  5.00     2
3  3  18      m     5     1     5  3.00     3
4  4  16      m     4     3     5  3.75     1
5  5  19      f     5     2     4  4.00     2
6  6  17      f     3     1     2  3.75     2

Note how R now displays two decimals for item4. It does not display decimals for the other items, even though their object type is double - just as that of item4 (we can find out the object types by calling the function typeof() and feeding it the column of interest as the sole function argument).

R generally only uses as many decimals as it has to. If a double type object contains only whole numbers, then R will simply omit the decimals.