This is a simple post, and undoubtedly many people know this already, but I thought I would share it for people who are having trouble finding such things.
Let us say I have data that looks conceptually like this:

And further, assume I have done all the grunt work in with the wonderful readxl
package, and dplyr
to get it into a sensible format. If anyone wants to know how to do that, let me know. So our data looks like this in R:
## # A tibble: 160 x 5
## count replicate side position size
## <dbl> <fct> <fct> <fct> <fct>
## 1 3 1 left arm small
## 2 1 2 left arm small
## 3 0 3 left arm small
## 4 3 4 left arm small
## 5 1 5 left arm small
## 6 1 6 left arm small
## 7 1 7 left arm small
## 8 1 8 left arm small
## 9 2 9 left arm small
## 10 0 10 left arm small
## # … with 150 more rows
In this example I will collapse over levels the variable side
. The key functions here are group_by
and summarise
(or summarize
if you hail from a certain country). group_by
—as the name suggests—arranges the data into groups. It is important to note that a) this does not change how the data is displayed when you print it or View
it, and b) it returns a grouped data frame
. The latter is important, as sometimes it can cause problems. Because of that, I recommend you use the ungroup
function as well.
The principle is straightforward—use group_by
on the variables you wish to keep, and omit the variable(s) you wish to collapse over. In this example, as I said, I will summarise over side
. That means I use group_by
on replicate
, position
, and size
.
reducedData = myFormattedData %>% group_by(replicate, position, size) %>% summarise(count = sum(count)) %>% ungroup()
This code groups the data into well groups which have the same values of replicate
, position
, and size
. That means that the each group will have, in this case, a pair of observations, one with side
equal to left
and one with side
equal to right
. We then summarise each of these pairs of observations by adding their count
values. Note that my code might be a little confusing in that I have used the same variable name, count
, for the sum of the counts. The net effect is what we want:
## # A tibble: 80 x 4
## replicate position size count
## <fct> <fct> <fct> <dbl>
## 1 1 arm small 5
## 2 1 arm big 3
## 3 1 hand small 3
## 4 1 hand big 3
## 5 2 arm small 3
## 6 2 arm big 4
## 7 2 hand small 2
## 8 2 hand big 3
## 9 3 arm small 2
## 10 3 arm big 3
## # … with 70 more rows
A quick check shows that the original data had a value of 3 for the first replicate that was small
and on the arm
on the left
, and a value of 2 on the right
, giving a sum of 5.