Category Archives: dplyr

I think I understand gather

gather and spread—are these the most misunderstood functions in all of dplyr? I don’t know but maybe this will help you.

The simple case—gathering all columns

gather, at its very simplest, reduces your dataframe to a set of key-value pairs. This is best demonstrated by an example. I will create a dataframe with three variables x, y and z.

 
my.df  = data.frame(x = 1:3,                     
                    y = letters[1:3],
                    z = c("Alice", "Bob", "Carol"))

And this dataframe looks like this:

##   x y     z
## 1 1 a Alice
## 2 2 b   Bob
## 3 3 c Carol

gather, applied to this data frame with no additional arguments, will make two new variables, key, and value. key will contain each of the variable names repeated as many times as there are rows in the original dataframe. value will contain the corresponding variable value. That is harder to understand. I will explain, then demonstrate. There are three variables in my.df and three rows. gather will change this into a dataframe with two variables, key, and value, and nine (3 × 3) rows. The first row will have a key of x and a value of 1. The next row will have a key of x and a value of 2 and so on. The new dataframe looks like this:

my.df %>% 
  gather() 
##   key value
## 1   x     1
## 2   x     2
## 3   x     3
## 4   y     a
## 5   y     b
## 6   y     c
## 7   z Alice
## 8   z   Bob
## 9   z Carol

Hopefully the effect of gather is clear here, even if the use case is not.

The not so simple case—gathering some columns

How about when you want to gather some of the columns? Why might you want to do this? There are many reasons, but the most common may be that you have several measurements for an observation that you wish to summarise.
Consider the following example:

##    name time1 time2 time3
## 1 Alice  8.43  0.32  3.14
## 2   Bob  5.77  0.56  1.45
## 3 Carol 13.29  3.17 27.26

How would we go about finding either the total time, or the average time for each person in the dataframe?

Firstly we need to gather so that the three time variables are in a single column. We will store the variable names in a column named run, and we will store the actual time values in a variable named time:

set.seed(123)
ex2.df = data.frame(name = c("Alice", "Bob", "Carol"),
                         time1 = round(rexp(3, 0.10), 2), 
                         time2 = round(rexp(3, 0.10), 2),
                         time3 = round(rexp(3, 0.10), 2), 
                         stringsAsFactors = FALSE)

ex2.gathered.df = ex2.df %>% 
  gather(key = "run", value = "time", time1, time2, time3)

ex2.gathered.df

And the new dataframe looks like this:

##    name   run  time
## 1 Alice time1  8.43
## 2   Bob time1  5.77
## 3 Carol time1 13.29
## 4 Alice time2  0.32
## 5   Bob time2  0.56
## 6 Carol time2  3.17
## 7 Alice time3  3.14
## 8   Bob time3  1.45
## 9 Carol time3 27.26

We want the average time for each person, so we group_by name, and then we summarise.

ex2.gathered.df %>% 
  group_by(name) %>% 
  summarise(ave.time = mean(time))

Et voilà.

## # A tibble: 3 x 2
##   name  ave.time
##   <chr>    <dbl>
## 1 Alice     3.96
## 2 Bob       2.59
## 3 Carol    14.6

My own use case is different – I had a bunch of summary statistics in each row for one data set, and a set of observations in another. I wanted to put them all in a database table with a column indicating whether it was a summary statistic or not (by the name of the statistic), or whether it was an observation. So my gather statements look like:

a1 = a1 %>%
    gather(key = "treatments", value = "stat", mean, sd, min, max) 

a2 = a2 %>%
    rename(stat = accuracy) %>%
    mutate(treatments = "accuracy") %>%
    select(colour, method, reduced, baseline, treatments, stat)

a = bind_rows(a1, a2)

Update

Di Cook reminded me on Twitter

that Hadley has been working to improve that Hadley has been working on two replacements for gather and spread, called pivot_longer and pivot_wider respectively. News about these new functions (they are more than renaming of gather and spread), can be found here. Once I understand them, I will add a section to this post.

Update 2—using pivot_longer

Note: If you want to try out these new functions, you need to install the development version of {tidyr}. This involved a lot of updates for me, including having to fix broken headers (thanks Apple) on OS X Mojave (I used the TL;DR solution I found on Stack Overflow).

pivot_longer improves gather, at least IMHO, by making it more explicit where information is coming from and where it is going to. Here is the same example using pivot_longer:

## Note: You need the development version of tidyr
## for this to work
## devtools::install_github("tidyverse/tidyr")
library(tidyr)

ex2.pivotlonger.df = example2.df %>% 
  pivot_longer(cols = matches("time[1-3]"),
             names_to = "run",
             values_to = "time")

ex2.pivotlonger.df %>% 
  group_by(name) %>% 
  summarise(ave.time = mean(time))

The output is identical to above:

## # A tibble: 3 x 2
##   name  ave.time
##   <chr>    <dbl>
## 1 Alice     3.96
## 2 Bob       2.59
## 3 Carol    14.6

I like the way that I can tell pivot_longer which columns to gather using the cols argument, where to store the variable names using names_to, and where to store the values values_to. Will I switch immediately? No. I think I will wait until these functions enter CRAN versions of tidyr. I am old enough to release living on the bleeding edge of technology does not always pay off!

Update 3

Charco Hui, an MSc student of Anna Fergusson’s and Chris Wild’s has a nice package which animates the gather and spread operations, which can be foundhere. Check it out!

Share Button

Collapsing or summarising data over the levels of a factor in the tidyverse

This is a simple post, and undoubtedly many people know this already, but I thought I would share it for people who are having trouble finding such things.

Let us say I have data that looks conceptually like this:

Conceptual layout of data from an experiment.

And further, assume I have done all the grunt work in with the wonderful readxl package, and dplyr to get it into a sensible format. If anyone wants to know how to do that, let me know. So our data looks like this in R:

## # A tibble: 160 x 5
##    count replicate side  position size 
##    <dbl> <fct>     <fct> <fct>    <fct>
##  1     3 1         left  arm      small
##  2     1 2         left  arm      small
##  3     0 3         left  arm      small
##  4     3 4         left  arm      small
##  5     1 5         left  arm      small
##  6     1 6         left  arm      small
##  7     1 7         left  arm      small
##  8     1 8         left  arm      small
##  9     2 9         left  arm      small
## 10     0 10        left  arm      small
## # … with 150 more rows

In this example I will collapse over levels the variable side. The key functions here are group_by and summarise (or summarize if you hail from a certain country). group_by—as the name suggests—arranges the data into groups. It is important to note that a) this does not change how the data is displayed when you print it or View it, and b) it returns a grouped data frame. The latter is important, as sometimes it can cause problems. Because of that, I recommend you use the ungroup function as well.

The principle is straightforward—use group_by on the variables you wish to keep, and omit the variable(s) you wish to collapse over. In this example, as I said, I will summarise over side. That means I use group_by on replicate, position, and size.

reducedData = myFormattedData %>% 
  group_by(replicate, position, size) %>% 
  summarise(count = sum(count)) %>% 
  ungroup()

This code groups the data into well groups which have the same values of replicate, position, and size. That means that the each group will have, in this case, a pair of observations, one with side equal to left and one with side equal to right. We then summarise each of these pairs of observations by adding their count values. Note that my code might be a little confusing in that I have used the same variable name, count, for the sum of the counts. The net effect is what we want:

## # A tibble: 80 x 4
##    replicate position size  count
##    <fct>     <fct>    <fct> <dbl>
##  1 1         arm      small     5
##  2 1         arm      big       3
##  3 1         hand     small     3
##  4 1         hand     big       3
##  5 2         arm      small     3
##  6 2         arm      big       4
##  7 2         hand     small     2
##  8 2         hand     big       3
##  9 3         arm      small     2
## 10 3         arm      big       3
## # … with 70 more rows

A quick check shows that the original data had a value of 3 for the first replicate that was small and on the arm on the left, and a value of 2 on the right, giving a sum of 5.

Share Button