Those of you who know me well know I am not really a `ggplot2`

type of guy. Not because I do not think that it is good work—I would be an idiot to claim that probably the most downloaded package in the history of CRAN is not good—but because I predate both Hadley and `ggplot2`

(yes I admit it, I am a dinosaur now) I can do most of what I want in base R. However, there are times in exploratory data analysis where going the `ggplot2`

route makes more sense. One situation where that happens for me is when I am fitting linear models with a mixture of continuous explanatory variables and factors. Some people have an old-fashioned name for this situation—but really repeating that name would be a get-burned-at-a-stake type of offence. It is useful when fitting linear models of this type to be able to see the behaviour of the response with respect to a continuous predictor for each combination of the levels of the factors. `ggplot2`

makes it fairly easy to produce this type of plot through its faceting mechanism. It also makes it really to add a fitted line with a pretty confidence interval to each facet. I should say at this point that this is not restricted to linear models, and in fact works for generalised linear models as well, and for semi-parametric models like smoothers. It is cool stuff.

### I want a pony, and I want it NOW Daddy

This is, as I have said, made easy to do in `ggplot2`

and a half hour of Googling will get you to the point where you can do it with your data. However, I found myself with the following problem. I had a situation where there was a suggestion that an interaction might be significant and so I wanted to explore visually how the fitted models differed with and without interaction. You often find yourself in this situation with tests suggesting the interactions are significant only to find that it is driven by one combination of the factors, or even worse, by a single outlier. Making things explicit, in this problem I was interested in the difference between this model:

$$ Y \sim X * F_1 * F_2 $$

and this model

$$Y \sim X * F_1 + F_2$$

These models are expressed in standard Wilkinson and Rogers notation, where \(Y\) and \(X\) are continuous, and \(F_1\) and \(F_2\) are factors. However, the same issues arise if we are interested in

$$ Y \sim X * F $$

and this model

$$ Y \sim X + F $$

The only difference is that the different levels of the factor \(F\) can be represented on a single row—if there are not too many levels—or in a wrapped facet plot if there are quite a few levels.

### We need some data

The data in this post come from Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J. & Ostrowski, E. (1994). *A Handbook of Small Data Sets* Boca Raton, Florida: Chapman and Hall/CRC. The data try to relate the mortality in UK towns and cities—measured as deaths per 100,000 head of population—to the hardness of the water—measured by calcium concentration (in parts-per-million = ppm). The original authors also had the theory that this relationship changed when the town or city was in “The North.” I have always found the English concept of “Up North” rather bizarre. As you leave the outskirts of London you start seeing road signs telling you how far it is to “The North.” In this data set, “The North” is defined as being north of the English city of Derby. The data is available here: water.csv.

water.df = read.csv("water.csv")

It is worth having a quick look at the summary data

library(skimr) skim(water.df)

```
## Skim summary statistics
## n obs: 61
## n variables: 4
##
## Variable type: factor
## variable missing complete n n_unique top_counts
## town 0 61 61 61 Bat: 1, Bir: 1, Bir: 1, Bla: 1
## ordered
## FALSE
##
## Variable type: integer
## variable missing complete n mean sd p0 p25 p50 p75 p100
## calcium 0 61 61 47.18 38.09 5 14 39 75 138
## mortality 0 61 61 1524.15 187.67 1096 1379 1555 1668 1987
## north 0 61 61 0.57 0.5 0 0 1 1 1
```

We can see from this that our factor `north`

is not being treated as such. I will recode it, and since we are using `ggplot2`

we might as well go into the `tidyverse`

too. I think I am going to violate a Hadley principle by overwriting the existing data, but I can live with this.

library(tidyverse) water.df = water.df %>% mutate(north = factor(ifelse(north == 1, "north", "south")))

### The initial plot

I am going to plot `mortality`

as a function of `calcium`

and I am going to facet (if that is a verb) by our factor `north`

. There is a good argument to make that because both variables are rates (deaths per 100,000 and parts per million) then we should work on a log-log scale. However, given this is not a post on regression analysis, I will not make this transformation (Spoiler: there is no evidence that the factor makes any difference on this scale which kind of makes it useless to show what I want to show ☺)

library(ggplot2) p = water.df %>% ggplot(aes(x = calcium, y = mortality)) + geom_point() p = p + facet_wrap(~north) p

It looks like there is evidence of a shift, but is there evidence of a difference in slopes?

water.fit = lm(mortality ~ calcium * north, data = water.df) summary(water.fit)

```
##
## Call:
## lm(formula = mortality ~ calcium * north, data = water.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -221.27 -75.45 8.52 87.48 310.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1692.3128 32.2005 52.555 <2e-16 ***
## calcium -1.9313 0.8081 -2.390 0.0202 *
## northsouth -169.4978 58.5916 -2.893 0.0054 **
## calcium:northsouth -0.1614 1.0127 -0.159 0.8740
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.2 on 57 degrees of freedom
## Multiple R-squared: 0.5909, Adjusted R-squared: 0.5694
## F-statistic: 27.44 on 3 and 57 DF, p-value: 4.102e-11
```

Ignoring all model checking (it ain’t so bad STATS 201x kids), there does not appear to be much support for the *different slopes* model. However there is for the *different intercepts* model. I would like to explore this visually.

### So about those lines…

What I would like to do is to place the fitted lines from each model on each facet of the plot. In addition, I would like the confidence intervals for each line to be added to each line. There might be a way to do this with `geom_smooth`

, and I would be happy to hear about it. My solution involves three steps:

- Fitting each of the models, and using
`predict`

to get the confidence intervals - Adding the fitted line from each model to each facet
- Adding the confidence intervals from each model to each facet

#### Step 1—getting the data for the confidence intervals

This step involves, as previously mentioned, using the `predict`

function (in fact `predict.lm`

). This should be fairly straightforward. The key here is to add the interval information, and the fitted values to either a new data frame or our existing data frame. I will opt for the latter.

water.fit1 = water.fit ## interaction model water.fit2 = lm(mortality ~ calcium + north, data = water.df) water.pred1 = predict(water.fit1, water.df, interval = "confidence") water.pred2 = predict(water.fit2, water.df, interval = "confidence") water.df = water.df %>% mutate(fit1 = water.pred1[,"fit"], lwr1 = water.pred1[,"lwr"], upr1 = water.pred1[,"upr"], fit2 = water.pred2[,"fit"], lwr2 = water.pred2[,"lwr"], upr2 = water.pred2[,"upr"])

#### Step 2—Adding the fitted lines

Remember that our plot is stored in the variable `p`

. We will add the fitted lines using the `geom_line`

function. We could have done this with the `geom_abline`

and just the coefficients, however this would have made the method less flexible because we could not accomodate a simple quadratic model for example.

p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit1), col = "blue", size = 1.2) p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit2), col = "red", size = 1.2) p

We can see already the lack of support for the *different slopes* model, however, let’s add the confidence intervals.

#### Step 3—Adding the confidence intervals

We add the confidence intervals by using the `geom_ribbon`

function. The aesthetic for `geom_ribbon`

requires two sets of y-values, `ymin`

and `ymax`

. These, clearly, are the values we calculated for each of the confidence intervals. We set the color of the bands by setting `fill`

. Setting `col`

controls the border colour of the ribbon. We make these transparent (so they do not obscure our fitted lines), by setting an `alpha`

value closer to zero than to one (0.2 seems to be a good choice).

p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr1, ymax = upr1), fill = "blue", alpha = 0.2) p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr2, ymax = upr2), fill = "red", alpha = 0.2) p

As expected the intervals completely overlap, thus supporting our decision to choose the simpler model.

### Well that sucked

How about if there really was a slope effect? I have artificially added one, and redrawn the plot just so we can see what happens when there really is a difference.

## just so we all get the same result set.seed(123) ## get the original coefficient b = coef(water.fit1) ## and alter the effect for the interaction b[4] = -5 # make a big difference to the slopes ## get the standard deviation of the residuals sigma = summary(water.fit1)$sigma ## and use it to make a new set of responses water.df = water.df %>% mutate(mortality = (b[1] + (north == 'north') * b[2]) + (b[2] + (north == 'north') * b[4]) * calcium + rnorm(61, 0, sigma)) ## Now we repeat all the steps above water.fit1 = lm(mortality ~ calcium * north, data = water.df) ## interaction model water.fit2 = lm(mortality ~ calcium + north, data = water.df) water.pred1 = predict(water.fit1, water.df, interval = "confidence") water.pred2 = predict(water.fit2, water.df, interval = "confidence") water.df = water.df %>% mutate(fit1 = water.pred1[,"fit"], lwr1 = water.pred1[,"lwr"], upr1 = water.pred1[,"upr"], fit2 = water.pred2[,"fit"], lwr2 = water.pred2[,"lwr"], upr2 = water.pred2[,"upr"]) p = water.df %&>% ggplot(aes(x = calcium, y = mortality)) + geom_point() p = p + facet_wrap(~north) p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit1), col = "blue", size = 1.2) p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit2), col = "red", size = 1.2) p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr1, ymax = upr1), fill = "blue", alpha = 0.2) p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr2, ymax = upr2), fill = "red", alpha = 0.2) p

### One final note

I suspect, but I have not checked properly, that when `geom_smooth`

is used, that the fitted lines on each facet are derived from the subset of data in that facet, therefore the standard errors might be slightly larger (because less data is used to estimate \(\sigma\). However, I am happy to be corrected (and if I am wrong I will remove this section :-))