Tag Archives: pivot_longer

I think I understand gather

gather and spread—are these the most misunderstood functions in all of dplyr? I don’t know but maybe this will help you.

The simple case—gathering all columns

gather, at its very simplest, reduces your dataframe to a set of key-value pairs. This is best demonstrated by an example. I will create a dataframe with three variables x, y and z.

 
my.df  = data.frame(x = 1:3,                     
                    y = letters[1:3],
                    z = c("Alice", "Bob", "Carol"))

And this dataframe looks like this:

##   x y     z
## 1 1 a Alice
## 2 2 b   Bob
## 3 3 c Carol

gather, applied to this data frame with no additional arguments, will make two new variables, key, and value. key will contain each of the variable names repeated as many times as there are rows in the original dataframe. value will contain the corresponding variable value. That is harder to understand. I will explain, then demonstrate. There are three variables in my.df and three rows. gather will change this into a dataframe with two variables, key, and value, and nine (3 × 3) rows. The first row will have a key of x and a value of 1. The next row will have a key of x and a value of 2 and so on. The new dataframe looks like this:

my.df %>% 
  gather() 
##   key value
## 1   x     1
## 2   x     2
## 3   x     3
## 4   y     a
## 5   y     b
## 6   y     c
## 7   z Alice
## 8   z   Bob
## 9   z Carol

Hopefully the effect of gather is clear here, even if the use case is not.

The not so simple case—gathering some columns

How about when you want to gather some of the columns? Why might you want to do this? There are many reasons, but the most common may be that you have several measurements for an observation that you wish to summarise.
Consider the following example:

##    name time1 time2 time3
## 1 Alice  8.43  0.32  3.14
## 2   Bob  5.77  0.56  1.45
## 3 Carol 13.29  3.17 27.26

How would we go about finding either the total time, or the average time for each person in the dataframe?

Firstly we need to gather so that the three time variables are in a single column. We will store the variable names in a column named run, and we will store the actual time values in a variable named time:

set.seed(123)
ex2.df = data.frame(name = c("Alice", "Bob", "Carol"),
                         time1 = round(rexp(3, 0.10), 2), 
                         time2 = round(rexp(3, 0.10), 2),
                         time3 = round(rexp(3, 0.10), 2), 
                         stringsAsFactors = FALSE)

ex2.gathered.df = ex2.df %>% 
  gather(key = "run", value = "time", time1, time2, time3)

ex2.gathered.df

And the new dataframe looks like this:

##    name   run  time
## 1 Alice time1  8.43
## 2   Bob time1  5.77
## 3 Carol time1 13.29
## 4 Alice time2  0.32
## 5   Bob time2  0.56
## 6 Carol time2  3.17
## 7 Alice time3  3.14
## 8   Bob time3  1.45
## 9 Carol time3 27.26

We want the average time for each person, so we group_by name, and then we summarise.

ex2.gathered.df %>% 
  group_by(name) %>% 
  summarise(ave.time = mean(time))

Et voilà.

## # A tibble: 3 x 2
##   name  ave.time
##   <chr>    <dbl>
## 1 Alice     3.96
## 2 Bob       2.59
## 3 Carol    14.6

My own use case is different – I had a bunch of summary statistics in each row for one data set, and a set of observations in another. I wanted to put them all in a database table with a column indicating whether it was a summary statistic or not (by the name of the statistic), or whether it was an observation. So my gather statements look like:

a1 = a1 %>%
    gather(key = "treatments", value = "stat", mean, sd, min, max) 

a2 = a2 %>%
    rename(stat = accuracy) %>%
    mutate(treatments = "accuracy") %>%
    select(colour, method, reduced, baseline, treatments, stat)

a = bind_rows(a1, a2)

Update

Di Cook reminded me on Twitter

that Hadley has been working to improve that Hadley has been working on two replacements for gather and spread, called pivot_longer and pivot_wider respectively. News about these new functions (they are more than renaming of gather and spread), can be found here. Once I understand them, I will add a section to this post.

Update 2—using pivot_longer

Note: If you want to try out these new functions, you need to install the development version of {tidyr}. This involved a lot of updates for me, including having to fix broken headers (thanks Apple) on OS X Mojave (I used the TL;DR solution I found on Stack Overflow).

pivot_longer improves gather, at least IMHO, by making it more explicit where information is coming from and where it is going to. Here is the same example using pivot_longer:

## Note: You need the development version of tidyr
## for this to work
## devtools::install_github("tidyverse/tidyr")
library(tidyr)

ex2.pivotlonger.df = example2.df %>% 
  pivot_longer(cols = matches("time[1-3]"),
             names_to = "run",
             values_to = "time")

ex2.pivotlonger.df %>% 
  group_by(name) %>% 
  summarise(ave.time = mean(time))

The output is identical to above:

## # A tibble: 3 x 2
##   name  ave.time
##   <chr>    <dbl>
## 1 Alice     3.96
## 2 Bob       2.59
## 3 Carol    14.6

I like the way that I can tell pivot_longer which columns to gather using the cols argument, where to store the variable names using names_to, and where to store the values values_to. Will I switch immediately? No. I think I will wait until these functions enter CRAN versions of tidyr. I am old enough to release living on the bleeding edge of technology does not always pay off!

Update 3

Charco Hui, an MSc student of Anna Fergusson’s and Chris Wild’s has a nice package which animates the gather and spread operations, which can be foundhere. Check it out!

Share Button