Adding multiple regression lines to a faceted ggplot2 plot

Those of you who know me well know I am not really a ggplot2 type of guy. Not because I do not think that it is good work—I would be an idiot to claim that probably the most downloaded package in the history of CRAN is not good—but because I predate both Hadley and ggplot2 (yes I admit it, I am a dinosaur now) I can do most of what I want in base R. However, there are times in exploratory data analysis where going the ggplot2 route makes more sense. One situation where that happens for me is when I am fitting linear models with a mixture of continuous explanatory variables and factors. Some people have an old-fashioned name for this situation—but really repeating that name would be a get-burned-at-a-stake type of offence. It is useful when fitting linear models of this type to be able to see the behaviour of the response with respect to a continuous predictor for each combination of the levels of the factors. ggplot2 makes it fairly easy to produce this type of plot through its faceting mechanism. It also makes it really to add a fitted line with a pretty confidence interval to each facet. I should say at this point that this is not restricted to linear models, and in fact works for generalised linear models as well, and for semi-parametric models like smoothers. It is cool stuff.

I want a pony, and I want it NOW Daddy

This is, as I have said, made easy to do in ggplot2 and a half hour of Googling will get you to the point where you can do it with your data. However, I found myself with the following problem. I had a situation where there was a suggestion that an interaction might be significant and so I wanted to explore visually how the fitted models differed with and without interaction. You often find yourself in this situation with tests suggesting the interactions are significant only to find that it is driven by one combination of the factors, or even worse, by a single outlier. Making things explicit, in this problem I was interested in the difference between this model:

$$ Y \sim X * F_1 * F_2 $$

and this model

$$Y \sim X * F_1 + F_2$$

These models are expressed in standard Wilkinson and Rogers notation, where \(Y\) and \(X\) are continuous, and \(F_1\) and \(F_2\) are factors. However, the same issues arise if we are interested in

$$ Y \sim X * F $$

and this model
$$ Y \sim X + F $$

The only difference is that the different levels of the factor \(F\) can be represented on a single row—if there are not too many levels—or in a wrapped facet plot if there are quite a few levels.

We need some data

The data in this post come from Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J. & Ostrowski, E. (1994). A Handbook of Small Data Sets Boca Raton, Florida: Chapman and Hall/CRC. The data try to relate the mortality in UK towns and cities—measured as deaths per 100,000 head of population—to the hardness of the water—measured by calcium concentration (in parts-per-million = ppm). The original authors also had the theory that this relationship changed when the town or city was in “The North.” I have always found the English concept of “Up North” rather bizarre. As you leave the outskirts of London you start seeing road signs telling you how far it is to “The North.” In this data set, “The North” is defined as being north of the English city of Derby. The data is available here: water.csv.

water.df = read.csv("water.csv")

It is worth having a quick look at the summary data

library(skimr)
skim(water.df)
## Skim summary statistics
##  n obs: 61 
##  n variables: 4 
## 
## Variable type: factor 
##  variable missing complete  n n_unique                     top_counts
##      town       0       61 61       61 Bat: 1, Bir: 1, Bir: 1, Bla: 1
##  ordered
##    FALSE
## 
## Variable type: integer 
##   variable missing complete  n    mean     sd   p0  p25  p50  p75 p100
##    calcium       0       61 61   47.18  38.09    5   14   39   75  138
##  mortality       0       61 61 1524.15 187.67 1096 1379 1555 1668 1987
##      north       0       61 61    0.57   0.5     0    0    1    1    1

We can see from this that our factor north is not being treated as such. I will recode it, and since we are using ggplot2 we might as well go into the tidyverse too. I think I am going to violate a Hadley principle by overwriting the existing data, but I can live with this.

library(tidyverse)
water.df = water.df %>% 
  mutate(north = factor(ifelse(north == 1, "north", "south")))

The initial plot

I am going to plot mortality as a function of calcium and I am going to facet (if that is a verb) by our factor north. There is a good argument to make that because both variables are rates (deaths per 100,000 and parts per million) then we should work on a log-log scale. However, given this is not a post on regression analysis, I will not make this transformation (Spoiler: there is no evidence that the factor makes any difference on this scale which kind of makes it useless to show what I want to show ☺)

library(ggplot2)
p = water.df %>% 
  ggplot(aes(x = calcium, y = mortality)) + geom_point()

p = p + facet_wrap(~north)
p
Figure 1: An initial plot of the data

It looks like there is evidence of a shift, but is there evidence of a difference in slopes?

water.fit = lm(mortality ~ calcium * north, data = water.df)
summary(water.fit)

## 
## Call:
## lm(formula = mortality ~ calcium * north, data = water.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -221.27  -75.45    8.52   87.48  310.14 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1692.3128    32.2005  52.555   <2e-16 ***
## calcium              -1.9313     0.8081  -2.390   0.0202 *  
## northsouth         -169.4978    58.5916  -2.893   0.0054 ** 
## calcium:northsouth   -0.1614     1.0127  -0.159   0.8740    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.2 on 57 degrees of freedom
## Multiple R-squared:  0.5909, Adjusted R-squared:  0.5694 
## F-statistic: 27.44 on 3 and 57 DF,  p-value: 4.102e-11

Ignoring all model checking (it ain’t so bad STATS 201x kids), there does not appear to be much support for the different slopes model. However there is for the different intercepts model. I would like to explore this visually.

So about those lines…

What I would like to do is to place the fitted lines from each model on each facet of the plot. In addition, I would like the confidence intervals for each line to be added to each line. There might be a way to do this with geom_smooth, and I would be happy to hear about it. My solution involves three steps:

  1. Fitting each of the models, and using predict to get the confidence intervals
  2. Adding the fitted line from each model to each facet
  3. Adding the confidence intervals from each model to each facet

Step 1—getting the data for the confidence intervals

This step involves, as previously mentioned, using the predict function (in fact predict.lm). This should be fairly straightforward. The key here is to add the interval information, and the fitted values to either a new data frame or our existing data frame. I will opt for the latter.

water.fit1 = water.fit ## interaction model
water.fit2 = lm(mortality ~ calcium + north, data = water.df)

water.pred1 = predict(water.fit1, water.df, interval = "confidence")
water.pred2 = predict(water.fit2, water.df, interval = "confidence")

water.df = water.df %>% 
  mutate(fit1 = water.pred1[,"fit"],
         lwr1 = water.pred1[,"lwr"],
         upr1 = water.pred1[,"upr"],
         fit2 = water.pred2[,"fit"],
         lwr2 = water.pred2[,"lwr"],
         upr2 = water.pred2[,"upr"])

Step 2—Adding the fitted lines

Remember that our plot is stored in the variable p. We will add the fitted lines using the geom_line function. We could have done this with the geom_abline and just the coefficients, however this would have made the method less flexible because we could not accomodate a simple quadratic model for example.

p = p + geom_line(data = water.df, 
                  mapping = aes(x = calcium, y = fit1), 
                  col = "blue", size = 1.2)
p = p + geom_line(data = water.df, 
                  mapping = aes(x = calcium, y = fit2), 
                  col = "red", size = 1.2)
p
Figure 2: Adding the lines

We can see already the lack of support for the different slopes model, however, let’s add the confidence intervals.

Step 3—Adding the confidence intervals

We add the confidence intervals by using the geom_ribbon function. The aesthetic for geom_ribbon requires two sets of y-values, ymin and ymax. These, clearly, are the values we calculated for each of the confidence intervals. We set the color of the bands by setting fill. Setting col controls the border colour of the ribbon. We make these transparent (so they do not obscure our fitted lines), by setting an alpha value closer to zero than to one (0.2 seems to be a good choice).

p = p + geom_ribbon(data = water.df, 
                    mapping = aes(x = calcium, ymin = lwr1, ymax = upr1),
                    fill = "blue", alpha = 0.2)
p = p + geom_ribbon(data = water.df, 
                    mapping = aes(x = calcium, ymin = lwr2, ymax = upr2),
                    fill = "red", alpha = 0.2)
p
Figure 3: Lines with confidence intervals (as promised)

As expected the intervals completely overlap, thus supporting our decision to choose the simpler model.

Well that sucked

How about if there really was a slope effect? I have artificially added one, and redrawn the plot just so we can see what happens when there really is a difference.

## just so we all get the same result
set.seed(123) 

## get the original coefficient
b = coef(water.fit1) 

## and alter the effect for the interaction
b[4] = -5 # make a big difference to the slopes

## get the standard deviation of the residuals
sigma = summary(water.fit1)$sigma

## and use it to make a new set of responses
water.df = water.df %>% 
  mutate(mortality = (b[1] + (north == 'north') * b[2]) + (b[2] + (north == 'north') * b[4]) * calcium + rnorm(61, 0, sigma))

## Now we repeat all the steps above
water.fit1 = lm(mortality ~ calcium * north, data = water.df) ## interaction model
water.fit2 = lm(mortality ~ calcium + north, data = water.df)

water.pred1 = predict(water.fit1, water.df, interval = "confidence")
water.pred2 = predict(water.fit2, water.df, interval = "confidence")

water.df = water.df %&gt;% 
  mutate(fit1 = water.pred1[,"fit"],
         lwr1 = water.pred1[,"lwr"],
         upr1 = water.pred1[,"upr"],
         fit2 = water.pred2[,"fit"],
         lwr2 = water.pred2[,"lwr"],
         upr2 = water.pred2[,"upr"])

p = water.df %&>% 
  ggplot(aes(x = calcium, y = mortality)) + geom_point()

p = p + facet_wrap(~north)
p = p + geom_line(data = water.df, 
                  mapping = aes(x = calcium, y = fit1), 
                  col = "blue", size = 1.2)
p = p + geom_line(data = water.df, 
                  mapping = aes(x = calcium, y = fit2), 
                  col = "red", size = 1.2)
p = p + geom_ribbon(data = water.df, 
                    mapping = aes(x = calcium, ymin = lwr1, ymax = upr1),
                    fill = "blue", alpha = 0.2)
p = p + geom_ribbon(data = water.df, 
                    mapping = aes(x = calcium, ymin = lwr2, ymax = upr2),
                    fill = "red", alpha = 0.2)
p
Figure 4: Data with a real interaction

One final note

I suspect, but I have not checked properly, that when geom_smooth is used, that the fitted lines on each facet are derived from the subset of data in that facet, therefore the standard errors might be slightly larger (because less data is used to estimate \(\sigma\). However, I am happy to be corrected (and if I am wrong I will remove this section :-))

Share Button

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.