Producing a biplot with labels

If you are hoping for blinding insight in this post, then I think you better stop reading now. A friend asked me how to show him how to display the observation labels (or class labels) in a PCA biplot. Given how long prcomp has been around, this is hardly new information. It might now even be a feature of plot.prcomp. However, for posterity, and for the searchers, I will give some simple solutions. I will, for the sake of pedagogy, even stop effing and blinding about how the Fisher’s Iris data set deserves to be left to crumble into a distant memory of when we used to do computing with a Jacquard Loom.

I will make use of tidyverse functions in this example, because kids these days. I would not worry about it too much. It just makes some of the data manipulation slightly more transparent. Firstly I will load the data. The Iris data set is an internal R data set so the data command will do it.

data(iris)
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Now I will put out the labels into a separate vector

iris.labels = iris %>% 
  pull(Species) %>% 
  as.character()
iris.data = iris %>% 
  select(-Species)

Performing the PCA itself is pretty straightforward. I have chosen to scale the variables in this example for no particular reason. The post is not about PCA so it does not matter too much.

pc = prcomp(iris.data, scale. = TRUE)

The scores—the projected values of the observations—are stored in a matrix called pc$x. First we will produce a plot with just observation number.

plot(pc$x[,1:2], type = 'n')
text(pc$x[,1:2], labels = 1:nrow(iris.data))
A simple biplot with observation numbers plotted

It would be nice to put the species labels on instead of numbers. The species names are rather long though, so I am going to recode them

iris.labels = recode(iris.labels, 
                     'setosa' = 's', 
                     'versicolor' = 'v', 
                     'virginica' = 'i')

And now we can plot the labels if we want:


plot(pc$x[,1:2], type = 'n')
text(pc$x[,1:2], labels = 1:nrow(iris.data))
A biplot with class labels

Hey you promised us ggplot2 ya bum!

Okay, okay. To do this we need to coerce the scores into a data.frame. The nice thing about doing this is that the principal components are conveniently labelled PC1, PC2, etc. This makes the mapping fairly easy.

library(ggplot2)
pcs = data.frame(pc$x)
p = pcs %>% 
  ggplot(aes(x = PC1, 
             y = PC2, 
             label = iris.labels)) + 
  geom_text()
p

Et voila!

Let's do it in ggplot2!

Share Button

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.