Collapsing or summarising data over the levels of a factor in the tidyverse

This is a simple post, and undoubtedly many people know this already, but I thought I would share it for people who are having trouble finding such things.

Let us say I have data that looks conceptually like this:

Conceptual layout of data from an experiment.

And further, assume I have done all the grunt work in with the wonderful readxl package, and dplyr to get it into a sensible format. If anyone wants to know how to do that, let me know. So our data looks like this in R:

## # A tibble: 160 x 5
##    count replicate side  position size 
##    <dbl> <fct>     <fct> <fct>    <fct>
##  1     3 1         left  arm      small
##  2     1 2         left  arm      small
##  3     0 3         left  arm      small
##  4     3 4         left  arm      small
##  5     1 5         left  arm      small
##  6     1 6         left  arm      small
##  7     1 7         left  arm      small
##  8     1 8         left  arm      small
##  9     2 9         left  arm      small
## 10     0 10        left  arm      small
## # … with 150 more rows

In this example I will collapse over levels the variable side. The key functions here are group_by and summarise (or summarize if you hail from a certain country). group_by—as the name suggests—arranges the data into groups. It is important to note that a) this does not change how the data is displayed when you print it or View it, and b) it returns a grouped data frame. The latter is important, as sometimes it can cause problems. Because of that, I recommend you use the ungroup function as well.

The principle is straightforward—use group_by on the variables you wish to keep, and omit the variable(s) you wish to collapse over. In this example, as I said, I will summarise over side. That means I use group_by on replicate, position, and size.

reducedData = myFormattedData %>% 
  group_by(replicate, position, size) %>% 
  summarise(count = sum(count)) %>% 
  ungroup()

This code groups the data into well groups which have the same values of replicate, position, and size. That means that the each group will have, in this case, a pair of observations, one with side equal to left and one with side equal to right. We then summarise each of these pairs of observations by adding their count values. Note that my code might be a little confusing in that I have used the same variable name, count, for the sum of the counts. The net effect is what we want:

## # A tibble: 80 x 4
##    replicate position size  count
##    <fct>     <fct>    <fct> <dbl>
##  1 1         arm      small     5
##  2 1         arm      big       3
##  3 1         hand     small     3
##  4 1         hand     big       3
##  5 2         arm      small     3
##  6 2         arm      big       4
##  7 2         hand     small     2
##  8 2         hand     big       3
##  9 3         arm      small     2
## 10 3         arm      big       3
## # … with 70 more rows

A quick check shows that the original data had a value of 3 for the first replicate that was small and on the arm on the left, and a value of 2 on the right, giving a sum of 5.

Share Button

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.