Data frames and lists

Here, we will look at two types of R objects that we will frequently encounter when working R, data frames and lists. Both of these object types are a little more complex than the ones we talked about so far, but with greater complexity comes greater utility.

Data frames

Data frames are R objects specifically designed for people who work with data. Just like matrices, they are two-dimensional objects. The main difference, however, is that data frames can contain different types of elements. To be exact, data frames are a combination of different vectors of the same length.

The general idea of a data frame is to view every row as an observation and every column as a variable. Therefore, each column represents a vector, which also means that, column-wise, all elements must be of the same type. The same is not true for the rows of a data frame.

How to define a data frame

We can define data frames by calling the function data.frame() and entering vectors or matrices as function arguments. Here, vectors are treated as column vectors (or nx1 matrices, if you will). The only requirement is that all of the vectors and matrices we want to combine into a data.frame have the same number of rows.

Below is very simple example of how we can define a data frame in R. Imagine, we have collected data from 5 participants, which we will identify by the numbers from 1 to 5, and that we asked them to report their gender (open response) and age in years (numeric value).

my_data_frame = data.frame(
  c(1, 2, 3, 4, 5),                                       # participant ID
  c('male', 'female', 'non-binary', 'female', 'female'),  # participant gender
  c(25, 19, 23, 22, 28)                                   # participant age
)

Note: In this example, the code is distributed across several lines. Technically, we could have written it all in one long line, but doing so makes it very difficult to read the code. Unless we call a function with few or very short function arguments, it is recommended to use one line for each argument.

Also note how the function arguments are shifted a bit to the right. This is called indentation. Indentation has no effects on how the code works in R (it does in other programming languages such as Python!). However, it contributes greatly to the legibility of the code by structuring it.

It is, therefore, a good idea to get used to organizing longer pieces of code by using multiple lines and proper indentation. Your future self, your collaborators, and third parties, who will read your code when you make it publicly available, will thank you for it.

Once we execute the piece of code above, the data frame will appear as an object in RStudio’s Environment (top right). We can immediately see that it is a data frame because RStudio describes data frames as \(x\) observations of \(y\) variables. In our case that is listed as “5 obs. of 3 variables”.

Note: The Environment will show a small light blue button with a white arrowhead to the left of our data frame’s name. If we click this button, RStudio will “unfold” the data frame and show us a list of the vectors it contains as well as the type of the vector.

Just as with matrices, we can have a look at our new object by either entering its name or by clicking on it in the environment. If we do the former, R will return the data frame in the console, whereas the latter will open the data frame in a separate tab next to our scripts. Let’s go with the code version.

  c.1..2..3..4..5. c..male....female....non.binary....female....female..
1                1                                                  male
2                2                                                female
3                3                                            non-binary
4                4                                                female
5                5                                                female
  c.25..19..23..22..28.
1                    25
2                    19
3                    23
4                    22
5                    28

Assigning proper variable names

We may notice two things: first, the data frame contains exactly the information we specified as function arguments for the data.frame() function; second, each column of our data frame has a very weird name. The reason is that data frames require variable names for each column, and if we do not specify these names, R will just use each vector’s content as a name. This is inconvenient for a number of reasons, so it is desirable to assign proper names to the variables.

One way to do so is calling the function names(), which allows us to change the names of a data frame’s variables. The function ‘names’ requires the name of the data frame as a function argument. We then define it as a character vector containing the new variable names. The character vector must contain one element for each column of the data frame. For our data frame, it could look like this:

names(my_data_frame) = c('ID', 'gender', 'age')

If we now ask R to show us the data frame in the console (by entering its name as code), it will look much tidier.

  ID     gender age
1  1       male  25
2  2     female  19
3  3 non-binary  23
4  4     female  22
5  5     female  28

Another way to assign sensible variable names is to specify them when creating the data frame. We can do so by making slight adjustments to the way we enter vectors as function arguments when calling the data.frame() function. In fact, the code will look as if we define vectors as objects. The only difference is that when doing so within the data.frame() function, the vectors will not be created outside the data frame (i.e., they will not appear as separate objects in the Environment). Here is what the code would look like for our data frame.

my_data_frame = data.frame(  
  ID = c(1, 2, 3, 4, 5),                                          # ID
  gender = c('male', 'female', 'non-binary', 'female', 'female'), # gender
  age = c(25, 19, 23, 22, 28)                                     # age
) 

The result is the same as if we had used the names() function.

  ID     gender age
1  1       male  25
2  2     female  19
3  3 non-binary  23
4  4     female  22
5  5     female  28

Entering matrices into data frames

As mentioned above, we can also enter matrices as function arguments when creating a data frame, as long as these matrices have the correct number of rows. Let’s assume that we additionally collected responses from two Likert scale items (levels 1 to 5), which we would like to enter into our data frame. Our code would then look like this:

my_data_frame = data.frame(  
  ID = c(1, 2, 3, 4, 5),                                          # ID
  gender = c('male', 'female', 'non-binary', 'female', 'female'), # gender
  age = c(25, 19, 23, 22, 28),                                    # age
  matrix(c(1, 3, 5, 4, 3, 5, 5, 3, 2, 2), nrow = 5)               # responses 
) 

Running this code creates a data frame with 5 observations of 5 variables in in the environment. One problem is that we cannot name the two new variables contained in the matrix as we did with the vectors. However, when we look at our data frame, we will notice that the variable names do not look as terrible as they did when we entered the unnamed vectors. Instead, the variables will be named “X1” and “X2” (when entering a matrix, R will just assign the names from “X1” to “Xi” for a matrix with \(i\) columns).

Still, we might prefer less obscure names for our two measures. One possibility is to use the names() function as we did above. Another is to create the matrix as a separate object before creating the data frame. This allows us to set proper column names using the function colnames(). This function works very similar to the names() function but is specifically designed for matrices. Here is what the code would look like.

# First, we create our 5x2 matrix as an object
response_matrix = matrix(
  c(1, 3, 5, 4, 3, 5, 5, 3, 2, 2), 
  nrow = 5
)

# Next, we assign column names
colnames(response_matrix) = c('item1', 'item2')

# Finally, we create the data frame
my_data_frame = data.frame(  
  ID = c(1, 2, 3, 4, 5),                                          # ID
  gender = c('male', 'female', 'non-binary', 'female', 'female'), # gender
  age = c(25, 19, 23, 22, 28),                                    # age
  response_matrix                                                 # responses 
) 

The new variables in our data frame now have proper names. We can check this by calling the name of our data frame and inspecting it in the console (or by clicking on its name in the Environment).

  ID     gender age item1 item2
1  1       male  25     1     5
2  2     female  19     3     5
3  3 non-binary  23     5     3
4  4     female  22     4     2
5  5     female  28     3     2

Defining objects outside a data frame and then simply using their name as a function argument when calling the data.frame() function is not limited to matrices but can also be done for vectors. While there are good reasons for defining objects separately before combining them into a data frame, one downside is that our Environment can get a wee bit cluttered.

If we want to tidy up our environment, we can remove objects we no longer need (for example, because we put them into a data frame) using the function rm(). We simply need to use the name of the object we want to remove as the function argument for rm(), and it will be disappear from the Environment.

Lists

The final type of R object we need to know for now is the list. Lists are very flexible containers that we can create using the list() function. Think of lists as multi-purpose storage units. They can contain all other types of R objects (including other lists). For example, we could create a list that contains a single character string, a numeric vector, and a data frame. To do so, we simply call the function list() and enter each object we want to store in the list as a function argument.

my_value = 'hello'          # a single character value
my_vector = c(1,1,2,3,5,8)  # a numeric vector

# The following code creates a list with three elements
my_list = list(
  my_value,
  my_vector,
  my_data_frame
)

rm(my_value, my_vector)   # this removes two of the objects from the environment

As with other objects, the new list will appear in the Environment. RStudio tells us that this object is a list of 3. Just like for a data frame, there is a small light blue button to its left that allows us to unfold the list and have a look at its contents.

We can also click on the list in the environment to have a look at it or have R print the list in the console by calling its name. If we do the latter, the output looks like this:

[[1]]
[1] "hello"

[[2]]
[1] 1 1 2 3 5 8

[[3]]
  ID     gender age item1 item2
1  1       male  25     1     5
2  2     female  19     3     5
3  3 non-binary  23     5     3
4  4     female  22     4     2
5  5     female  28     3     2

As we can see, the output lists the elements of our list one after another. The first element of the list is preceded by “[[1]]”, the second by “[[2]]”, and so on.

Note: Once we start analysing data in R, we will frequently encounter lists. The reason is that many R functions used in inferential statistics use lists as outputs.