If you are hoping for blinding insight in this post, then I think you better stop reading now. A friend asked me how to show him how to display the observation labels (or class labels) in a PCA biplot. Given how long prcomp has been around, this is hardly new information. It might now even be a feature of plot.prcomp. However, for posterity, and for the searchers, I will give some simple solutions. I will, for the sake of pedagogy, even stop effing and blinding about how the Fisher’s Iris data set deserves to be left to crumble into a distant memory of when we used to do computing with a Jacquard Loom.

I will make use of tidyverse functions in this example, because kids these days. I would not worry about it too much. It just makes some of the data manipulation slightly more transparent. Firstly I will load the data. The Iris data set is an internal R data set so the data command will do it.

Performing the PCA itself is pretty straightforward. I have chosen to scale the variables in this example for no particular reason. The post is not about PCA so it does not matter too much.

pc = prcomp(iris.data, scale. = TRUE)

The scores—the projected values of the observations—are stored in a matrix called pc$x. First we will produce a plot with just observation number.

plot(pc$x[,1:2], type = 'n')
text(pc$x[,1:2], labels = 1:nrow(iris.data))

It would be nice to put the species labels on instead of numbers. The species names are rather long though, so I am going to recode them

plot(pc$x[,1:2], type = 'n')
text(pc$x[,1:2], labels = 1:nrow(iris.data))

Hey you promised us ggplot2 ya bum!

Okay, okay. To do this we need to coerce the scores into a data.frame. The nice thing about doing this is that the principal components are conveniently labelled PC1, PC2, etc. This makes the mapping fairly easy.

library(ggplot2)
pcs = data.frame(pc$x)
p = pcs %>%
ggplot(aes(x = PC1,
y = PC2,
label = iris.labels)) +
geom_text()
p

Those of you who know me well know I am not really a ggplot2 type of guy. Not because I do not think that it is good work—I would be an idiot to claim that probably the most downloaded package in the history of CRAN is not good—but because I predate both Hadley and ggplot2 (yes I admit it, I am a dinosaur now) I can do most of what I want in base R. However, there are times in exploratory data analysis where going the ggplot2 route makes more sense. One situation where that happens for me is when I am fitting linear models with a mixture of continuous explanatory variables and factors. Some people have an old-fashioned name for this situation—but really repeating that name would be a get-burned-at-a-stake type of offence. It is useful when fitting linear models of this type to be able to see the behaviour of the response with respect to a continuous predictor for each combination of the levels of the factors. ggplot2 makes it fairly easy to produce this type of plot through its faceting mechanism. It also makes it really to add a fitted line with a pretty confidence interval to each facet. I should say at this point that this is not restricted to linear models, and in fact works for generalised linear models as well, and for semi-parametric models like smoothers. It is cool stuff.

I want a pony, and I want it NOW Daddy

This is, as I have said, made easy to do in ggplot2 and a half hour of Googling will get you to the point where you can do it with your data. However, I found myself with the following problem. I had a situation where there was a suggestion that an interaction might be significant and so I wanted to explore visually how the fitted models differed with and without interaction. You often find yourself in this situation with tests suggesting the interactions are significant only to find that it is driven by one combination of the factors, or even worse, by a single outlier. Making things explicit, in this problem I was interested in the difference between this model:

$$ Y \sim X * F_1 * F_2 $$

and this model

$$Y \sim X * F_1 + F_2$$

These models are expressed in standard Wilkinson and Rogers notation, where \(Y\) and \(X\) are continuous, and \(F_1\) and \(F_2\) are factors. However, the same issues arise if we are interested in

$$ Y \sim X * F $$

and this model
$$ Y \sim X + F $$

The only difference is that the different levels of the factor \(F\) can be represented on a single row—if there are not too many levels—or in a wrapped facet plot if there are quite a few levels.

We need some data

The data in this post come from Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J. & Ostrowski, E. (1994). A Handbook of Small Data Sets Boca Raton, Florida: Chapman and Hall/CRC. The data try to relate the mortality in UK towns and cities—measured as deaths per 100,000 head of population—to the hardness of the water—measured by calcium concentration (in parts-per-million = ppm). The original authors also had the theory that this relationship changed when the town or city was in “The North.” I have always found the English concept of “Up North” rather bizarre. As you leave the outskirts of London you start seeing road signs telling you how far it is to “The North.” In this data set, “The North” is defined as being north of the English city of Derby. The data is available here: water.csv.

water.df = read.csv("water.csv")

It is worth having a quick look at the summary data

We can see from this that our factor north is not being treated as such. I will recode it, and since we are using ggplot2 we might as well go into the tidyverse too. I think I am going to violate a Hadley principle by overwriting the existing data, but I can live with this.

I am going to plot mortality as a function of calcium and I am going to facet (if that is a verb) by our factor north. There is a good argument to make that because both variables are rates (deaths per 100,000 and parts per million) then we should work on a log-log scale. However, given this is not a post on regression analysis, I will not make this transformation (Spoiler: there is no evidence that the factor makes any difference on this scale which kind of makes it useless to show what I want to show ☺)

library(ggplot2)
p = water.df %>%
ggplot(aes(x = calcium, y = mortality)) + geom_point()
p = p + facet_wrap(~north)
p

It looks like there is evidence of a shift, but is there evidence of a difference in slopes?

Ignoring all model checking (it ain’t so bad STATS 201x kids), there does not appear to be much support for the different slopes model. However there is for the different intercepts model. I would like to explore this visually.

So about those lines…

What I would like to do is to place the fitted lines from each model on each facet of the plot. In addition, I would like the confidence intervals for each line to be added to each line. There might be a way to do this with geom_smooth, and I would be happy to hear about it. My solution involves three steps:

Fitting each of the models, and using predict to get the confidence intervals

Adding the fitted line from each model to each facet

Adding the confidence intervals from each model to each facet

Step 1—getting the data for the confidence intervals

This step involves, as previously mentioned, using the predict function (in fact predict.lm). This should be fairly straightforward. The key here is to add the interval information, and the fitted values to either a new data frame or our existing data frame. I will opt for the latter.

Remember that our plot is stored in the variable p. We will add the fitted lines using the geom_line function. We could have done this with the geom_abline and just the coefficients, however this would have made the method less flexible because we could not accomodate a simple quadratic model for example.

p = p + geom_line(data = water.df,
mapping = aes(x = calcium, y = fit1),
col = "blue", size = 1.2)
p = p + geom_line(data = water.df,
mapping = aes(x = calcium, y = fit2),
col = "red", size = 1.2)
p

We can see already the lack of support for the different slopes model, however, let’s add the confidence intervals.

Step 3—Adding the confidence intervals

We add the confidence intervals by using the geom_ribbon function. The aesthetic for geom_ribbon requires two sets of y-values, ymin and ymax. These, clearly, are the values we calculated for each of the confidence intervals. We set the color of the bands by setting fill. Setting col controls the border colour of the ribbon. We make these transparent (so they do not obscure our fitted lines), by setting an alpha value closer to zero than to one (0.2 seems to be a good choice).

p = p + geom_ribbon(data = water.df,
mapping = aes(x = calcium, ymin = lwr1, ymax = upr1),
fill = "blue", alpha = 0.2)
p = p + geom_ribbon(data = water.df,
mapping = aes(x = calcium, ymin = lwr2, ymax = upr2),
fill = "red", alpha = 0.2)
p

As expected the intervals completely overlap, thus supporting our decision to choose the simpler model.

Well that sucked

How about if there really was a slope effect? I have artificially added one, and redrawn the plot just so we can see what happens when there really is a difference.

## just so we all get the same result
set.seed(123)
## get the original coefficient
b = coef(water.fit1)
## and alter the effect for the interaction
b[4] = -5 # make a big difference to the slopes
## get the standard deviation of the residuals
sigma = summary(water.fit1)$sigma
## and use it to make a new set of responses
water.df = water.df %>%
mutate(mortality = (b[1] + (north == 'north') * b[2]) + (b[2] + (north == 'north') * b[4]) * calcium + rnorm(61, 0, sigma))
## Now we repeat all the steps above
water.fit1 = lm(mortality ~ calcium * north, data = water.df) ## interaction model
water.fit2 = lm(mortality ~ calcium + north, data = water.df)
water.pred1 = predict(water.fit1, water.df, interval = "confidence")
water.pred2 = predict(water.fit2, water.df, interval = "confidence")
water.df = water.df %>%
mutate(fit1 = water.pred1[,"fit"],
lwr1 = water.pred1[,"lwr"],
upr1 = water.pred1[,"upr"],
fit2 = water.pred2[,"fit"],
lwr2 = water.pred2[,"lwr"],
upr2 = water.pred2[,"upr"])
p = water.df %&>%
ggplot(aes(x = calcium, y = mortality)) + geom_point()
p = p + facet_wrap(~north)
p = p + geom_line(data = water.df,
mapping = aes(x = calcium, y = fit1),
col = "blue", size = 1.2)
p = p + geom_line(data = water.df,
mapping = aes(x = calcium, y = fit2),
col = "red", size = 1.2)
p = p + geom_ribbon(data = water.df,
mapping = aes(x = calcium, ymin = lwr1, ymax = upr1),
fill = "blue", alpha = 0.2)
p = p + geom_ribbon(data = water.df,
mapping = aes(x = calcium, ymin = lwr2, ymax = upr2),
fill = "red", alpha = 0.2)
p

One final note

I suspect, but I have not checked properly, that when geom_smooth is used, that the fitted lines on each facet are derived from the subset of data in that facet, therefore the standard errors might be slightly larger (because less data is used to estimate \(\sigma\). However, I am happy to be corrected (and if I am wrong I will remove this section :-))

A blog about statistics, computing, and other things that interest me