Those of you who know me well know I am not really a
ggplot2 type of guy. Not because I do not think that it is good work—I would be an idiot to claim that probably the most downloaded package in the history of CRAN is not good—but because I predate both Hadley and
ggplot2 (yes I admit it, I am a dinosaur now) I can do most of what I want in base R. However, there are times in exploratory data analysis where going the
ggplot2 route makes more sense. One situation where that happens for me is when I am fitting linear models with a mixture of continuous explanatory variables and factors. Some people have an old-fashioned name for this situation—but really repeating that name would be a get-burned-at-a-stake type of offence. It is useful when fitting linear models of this type to be able to see the behaviour of the response with respect to a continuous predictor for each combination of the levels of the factors.
ggplot2 makes it fairly easy to produce this type of plot through its faceting mechanism. It also makes it really to add a fitted line with a pretty confidence interval to each facet. I should say at this point that this is not restricted to linear models, and in fact works for generalised linear models as well, and for semi-parametric models like smoothers. It is cool stuff.
I want a pony, and I want it NOW Daddy
This is, as I have said, made easy to do in
ggplot2 and a half hour of Googling will get you to the point where you can do it with your data. However, I found myself with the following problem. I had a situation where there was a suggestion that an interaction might be significant and so I wanted to explore visually how the fitted models differed with and without interaction. You often find yourself in this situation with tests suggesting the interactions are significant only to find that it is driven by one combination of the factors, or even worse, by a single outlier. Making things explicit, in this problem I was interested in the difference between this model:
$$ Y \sim X * F_1 * F_2 $$
and this model
$$Y \sim X * F_1 + F_2$$
These models are expressed in standard Wilkinson and Rogers notation, where \(Y\) and \(X\) are continuous, and \(F_1\) and \(F_2\) are factors. However, the same issues arise if we are interested in
$$ Y \sim X * F $$
and this model
$$ Y \sim X + F $$
The only difference is that the different levels of the factor \(F\) can be represented on a single row—if there are not too many levels—or in a wrapped facet plot if there are quite a few levels.
We need some data
The data in this post come from Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J. & Ostrowski, E. (1994). A Handbook of Small Data Sets Boca Raton, Florida: Chapman and Hall/CRC. The data try to relate the mortality in UK towns and cities—measured as deaths per 100,000 head of population—to the hardness of the water—measured by calcium concentration (in parts-per-million = ppm). The original authors also had the theory that this relationship changed when the town or city was in “The North.” I have always found the English concept of “Up North” rather bizarre. As you leave the outskirts of London you start seeing road signs telling you how far it is to “The North.” In this data set, “The North” is defined as being north of the English city of Derby. The data is available here: water.csv.
water.df = read.csv("water.csv")
It is worth having a quick look at the summary data
## Skim summary statistics ## n obs: 61 ## n variables: 4 ## ## Variable type: factor ## variable missing complete n n_unique top_counts ## town 0 61 61 61 Bat: 1, Bir: 1, Bir: 1, Bla: 1 ## ordered ## FALSE ## ## Variable type: integer ## variable missing complete n mean sd p0 p25 p50 p75 p100 ## calcium 0 61 61 47.18 38.09 5 14 39 75 138 ## mortality 0 61 61 1524.15 187.67 1096 1379 1555 1668 1987 ## north 0 61 61 0.57 0.5 0 0 1 1 1
We can see from this that our factor
north is not being treated as such. I will recode it, and since we are using
ggplot2 we might as well go into the
tidyverse too. I think I am going to violate a Hadley principle by overwriting the existing data, but I can live with this.
library(tidyverse) water.df = water.df %>% mutate(north = factor(ifelse(north == 1, "north", "south")))
The initial plot
I am going to plot
mortality as a function of
calcium and I am going to facet (if that is a verb) by our factor
north. There is a good argument to make that because both variables are rates (deaths per 100,000 and parts per million) then we should work on a log-log scale. However, given this is not a post on regression analysis, I will not make this transformation (Spoiler: there is no evidence that the factor makes any difference on this scale which kind of makes it useless to show what I want to show ☺)
library(ggplot2) p = water.df %>% ggplot(aes(x = calcium, y = mortality)) + geom_point() p = p + facet_wrap(~north) p
It looks like there is evidence of a shift, but is there evidence of a difference in slopes?
water.fit = lm(mortality ~ calcium * north, data = water.df) summary(water.fit)
## ## Call: ## lm(formula = mortality ~ calcium * north, data = water.df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -221.27 -75.45 8.52 87.48 310.14 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1692.3128 32.2005 52.555 <2e-16 *** ## calcium -1.9313 0.8081 -2.390 0.0202 * ## northsouth -169.4978 58.5916 -2.893 0.0054 ** ## calcium:northsouth -0.1614 1.0127 -0.159 0.8740 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 123.2 on 57 degrees of freedom ## Multiple R-squared: 0.5909, Adjusted R-squared: 0.5694 ## F-statistic: 27.44 on 3 and 57 DF, p-value: 4.102e-11
Ignoring all model checking (it ain’t so bad STATS 201x kids), there does not appear to be much support for the different slopes model. However there is for the different intercepts model. I would like to explore this visually.
So about those lines…
What I would like to do is to place the fitted lines from each model on each facet of the plot. In addition, I would like the confidence intervals for each line to be added to each line. There might be a way to do this with
geom_smooth, and I would be happy to hear about it. My solution involves three steps:
- Fitting each of the models, and using
predictto get the confidence intervals
- Adding the fitted line from each model to each facet
- Adding the confidence intervals from each model to each facet
Step 1—getting the data for the confidence intervals
This step involves, as previously mentioned, using the
predict function (in fact
predict.lm). This should be fairly straightforward. The key here is to add the interval information, and the fitted values to either a new data frame or our existing data frame. I will opt for the latter.
water.fit1 = water.fit ## interaction model water.fit2 = lm(mortality ~ calcium + north, data = water.df) water.pred1 = predict(water.fit1, water.df, interval = "confidence") water.pred2 = predict(water.fit2, water.df, interval = "confidence") water.df = water.df %>% mutate(fit1 = water.pred1[,"fit"], lwr1 = water.pred1[,"lwr"], upr1 = water.pred1[,"upr"], fit2 = water.pred2[,"fit"], lwr2 = water.pred2[,"lwr"], upr2 = water.pred2[,"upr"])
Step 2—Adding the fitted lines
Remember that our plot is stored in the variable
p. We will add the fitted lines using the
geom_line function. We could have done this with the
geom_abline and just the coefficients, however this would have made the method less flexible because we could not accomodate a simple quadratic model for example.
p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit1), col = "blue", size = 1.2) p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit2), col = "red", size = 1.2) p
We can see already the lack of support for the different slopes model, however, let’s add the confidence intervals.
Step 3—Adding the confidence intervals
We add the confidence intervals by using the
geom_ribbon function. The aesthetic for
geom_ribbon requires two sets of y-values,
ymax. These, clearly, are the values we calculated for each of the confidence intervals. We set the color of the bands by setting
col controls the border colour of the ribbon. We make these transparent (so they do not obscure our fitted lines), by setting an
alpha value closer to zero than to one (0.2 seems to be a good choice).
p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr1, ymax = upr1), fill = "blue", alpha = 0.2) p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr2, ymax = upr2), fill = "red", alpha = 0.2) p
As expected the intervals completely overlap, thus supporting our decision to choose the simpler model.
Well that sucked
How about if there really was a slope effect? I have artificially added one, and redrawn the plot just so we can see what happens when there really is a difference.
## just so we all get the same result set.seed(123) ## get the original coefficient b = coef(water.fit1) ## and alter the effect for the interaction b = -5 # make a big difference to the slopes ## get the standard deviation of the residuals sigma = summary(water.fit1)$sigma ## and use it to make a new set of responses water.df = water.df %>% mutate(mortality = (b + (north == 'north') * b) + (b + (north == 'north') * b) * calcium + rnorm(61, 0, sigma)) ## Now we repeat all the steps above water.fit1 = lm(mortality ~ calcium * north, data = water.df) ## interaction model water.fit2 = lm(mortality ~ calcium + north, data = water.df) water.pred1 = predict(water.fit1, water.df, interval = "confidence") water.pred2 = predict(water.fit2, water.df, interval = "confidence") water.df = water.df %>% mutate(fit1 = water.pred1[,"fit"], lwr1 = water.pred1[,"lwr"], upr1 = water.pred1[,"upr"], fit2 = water.pred2[,"fit"], lwr2 = water.pred2[,"lwr"], upr2 = water.pred2[,"upr"]) p = water.df %&>% ggplot(aes(x = calcium, y = mortality)) + geom_point() p = p + facet_wrap(~north) p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit1), col = "blue", size = 1.2) p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit2), col = "red", size = 1.2) p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr1, ymax = upr1), fill = "blue", alpha = 0.2) p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr2, ymax = upr2), fill = "red", alpha = 0.2) p
One final note
I suspect, but I have not checked properly, that when
geom_smooth is used, that the fitted lines on each facet are derived from the subset of data in that facet, therefore the standard errors might be slightly larger (because less data is used to estimate \(\sigma\). However, I am happy to be corrected (and if I am wrong I will remove this section :-))