The Curious Case of the TeX That Could

I have had, for approximately the last nine months, an R Studio/knitr/LaTeX problem. After spending several hours looking at it I have discovered that it was a TeX issue, and no fault of the other two. However, neither offered me any clue as to what was going on. I will detail this problem and the solution (TLDR: removed MikTeX and replaced it with TeXLive).

The problem

I am (co-)authoring another book. This one, like my data analysis book, is comprised of a master document with multiple child documents. The data analysis book was written with Emacs and Sweave, whereas this one is being written with R Studio and knitr. The problem I was having was that the book would compile fine on my Mac OS X machines, but under Windows I would get a message like Running pdflatex.exe on chapter04.tex...failed, and the PDF viewer would not launch. I initially tracked this down to the inclusion of the \makeindex command in my master document. In fact, through the use of a minimal example, I discovered that even after I took this command back out of the document, it would not compile until I deleted the auxiliary files created by LaTeX–in particular, the .idx file for the document.

I started to write a post for stackoverflow and in doing so, I decided I really should make sure that all my software was up-to-date. This, in turn, broke everything! It was, however, obvious that what was broken was MikTeX.

MikTeX – why hast thou forsaken me?

I have been a big fan of MikTeX for a very long time. It traditionally has been, at least for me, a really robust TeX installation for Windows. One of the features I really like is its “Install packages on-the-fly” mechanism, whereby it can automatically install any package that is on CTAN that it detects is missing when it tries to compile a document. This means that the initial install is quite minimal, and usually very fast. At some point in 2020 (or maybe even 2019—I wasn’t paying too much attention) MikTeX changed the way it works. In particular, it seemed to become quite finicky about installing package updates as an Administrator or as a User, and no matter what I did, I couldn’t get it to have those two modes in sync with each other. That is, MikTeX would complain that there were different versions installed for the Administrator than those installed for the User, and vice versa.

TeXLive – what fresh hell is this?

Given MikTeX seemed to be broken, I decided I would try and setup TeXLive. I did this for a couple of reasons. Firstly, MikTeX was pissing me off, and secondly, given I was trying to synchronise behaviour, I thought perhaps having the same TeX distribution as I have on my Macs might help. I have traditionally avoided TeXLive because it doesn’t have the “install on-the-fly” feature, it used to require a huge initial download (not that that really matters anymore) and it was a pain to keep up to date. On the Mac you don’t have a lot of choice about using it. TeXLive does offer a internet installer these days, however, something is seriously wrong with it. I tried three different repositories in three different countries, but each was looking like it would take nearly 24 hours to download and install the software. I got fed up, and downloaded a 3.8GB ISO using a torrent in about five minutes. Windows 10 has a very nice feature which allows an ISO to be treated like a regular drive, and so I could run the installer straight off the ISO. Even so, the install took over 12 minutes which seems excessive on a core-i9 machine with 32GB of RAM, and an SSD but it worked. I then rather foolishly decided I should run the TeXLive Manager and update my installation. This process took 3 hours and 39 minutes!! I have no idea why it was so slow, but I have seen a lot of complaints about this on the net.

So it worked

So the problem was MikTeX. I still don’t know exactly what the issue was, but finally I can compile my book on my Windows machines again. I did have to install a custom class file for the book, but that was rather straightforward compared to the rest. And now I have wasted the rest of my day with writing a blog post. I do sincerely hope that it helps someone!

Share Button

Estimating the mean and standard deviation from a histogram or interval data

NB: I am unsure why the LaTeX is not rendering here. The plug in says that it is not testing with WordPress 5.4, but seems to render other posts correctly.

This post is how to estimate the mean and standard deviation for a data set where we do not have the original values, but rather “binned” data, or a histogram. This is not particularly complex. In fact, we used to teach this in our first year statistics course—perhaps we still do. However, it came up yesterday in a post to Stack Overflow which someone closed because they thought the person was asking an ill-formed question. It is true that the question poster could have illustrated her question a little better, but equally the person who closed the post did not get it either.

The situation is this. Imagine you are working with data that has been collected as part of the standard sort of demographics that survey writers like to ask. In particular, we often get asked to provide an age range, rather than our exact age. I will simulate a little data that looks like this so that we can see how well the method I describe actually works.

set.seed(123)
library(dplyr)
df = data.frame(trueAge = round(rnorm(100, mean = 37, sd = 12.5)),
                stringsAsFactors = FALSE) %>% 
  mutate(ageInt = case_when(
    trueAge > 0 & trueAge <= 17 ~ "0-17",
    trueAge >= 18 & trueAge <= 24 ~ "18-24",
    trueAge >= 25 & trueAge <= 29 ~ "25-29",
    trueAge >= 30 & trueAge <= 34 ~ "30-34",
    trueAge >= 35 & trueAge <= 39 ~ "35-39",
    trueAge >= 40 & trueAge <= 44 ~ "40-44",
    trueAge >= 45 & trueAge <= 49 ~ "45-49",
    trueAge >= 50 & trueAge <= 54 ~ "50-54",
    trueAge >= 55 & trueAge <= 59 ~ "55-59",
    trueAge >= 60 & trueAge <= 64 ~ "60-64",
    trueAge >= 65 & trueAge <= 69 ~ "65-69",
    trueAge >= 70 & trueAge <= 74 ~ "70-74",
    trueAge >= 75 & trueAge <= 79 ~ "75-79",
    trueAge >= 80 & trueAge <= 99 ~ "80-99"
  ))

My data.frame, df, has 100 random observations on two variables: trueAge, and ageInt. Imagine now that we only have the second. How would I go about estimating the mean and the standard deviation from such data? In the original post the questioner was trying to use the sd function on ageInt to get the standard deviation. Clearly this was never going to work, and this is what seemed to confuse the Stack Overflow contributor who closed the question off. The “trick”, if one really wants to call it that, is simply to replace the intervals with their midpoints. For example, “0-17” gets replaced with 8.5, 18-24 gets replaced with 21 and so on. The remainder of this post is how to do that programmatically.

I will break this into two steps. Firstly I am going to extract the end points of each interval using a regular expression, and then I will convert these into numbers and average them. There are lots of ways to do this, but I will use stringr as it provides a simpler mechanism for recovering regular expression capture groups.

library(stringr)
endPoints = data.frame(str_match(df$ageInt, "^([0-9]+)[-]([0-9]+)$")[,2:3], stringsAsFactors = FALSE)

In case someone wants to tell me that I can use \\d+ instead of [0-9]+, thanks but I know this. I find the latter more readable than the former. I will now add these to my data.frame and convert them to numeric values at the same time. I will also calculate the midpoints.

df = df %>% 
  mutate(X1 = as.numeric(endPoints$X1),
         X2 = as.numeric(endPoints$X2),
         midPoint = 0.5 * (X1 + X2))

Now we are in a position to estimate the standard deviation, and we can do this by using df$midPoint as the input to the sd function:

> sd(df$midPoint)
[1] 11.84036

So how did we do? Here is the sample standard deviation of the actual age values:

> sd(df$trueAge)
[1] 11.41703

I would say we did pretty well! The title of this article says we could do this with a histogram as well. How do we do that? Let us say I did not have trueAge, but I had a histogram, i.e.:

Could I estimate the standard deviation from this? The answer is, of course, “yes.” We simply follow the same steps as before. We can see from the plot that the x-axis starts at 5 and finishes at 65. Therefore the midpoints will be a regular sequence from 7.5 to 62.5 progressing in steps of 5. We can code that in R with

mids = seq(from = 7.5, to = 62.5, by = 5)

and although it is painful, we can read off the counts in each bin fairly from the histogram:

counts = c(1, 1, 2, 10, 10, 20, 15, 15, 11,  7,  5,  3)

I could go two-ways at this point. One is to use counts with mids in conjunction with with the rep function to generate a vector of length 100 with the appropriate midpoint for each value, e.g.

ageInt = rep(mids, counts)

However, I am going to use a bit of statistics theory, namely the formulae for the mean and standard deviation of a discrete random variable. You might recall having seen this before. If X is a discrete random variable which can take any value, x from the set Ω with probability f(x) = Pr(X = x), then the expected value of X is

\(
{\mathrm E}[X] = \sum_{x \in \Omega}xf(x)
\)

and the variance

\(
{\mathrm E}[(X-\mu)^2] = \sum_{x \in \Omega}(x-\mu)^2f(x)
\)

where μ = E[X]. This works for us because our counts can be converted to frequencies, and we can use the frequencies to approximate the probabilities in the expressions above, i.e.

\(
\hat{f}(x) \approx \frac1n\sum_{i=1}^{n}I(x_i = x)
\)

Therefore in R, we type:

freqs = counts / 100
EX = sum(mids * freqs)
EX2 = sum(mids^2 * freqs)
sigma = sqrt(EX2 - EX^2)

And this estimates the standard deviation as

> sigma
[1] 11.4

This is a better estimate than our previous effort purely because we have split the 0-17 age group into more bins.

Share Button

F*ck you LaTeX

Some of you know that I have agreed to finish the textbook started by my good friend David Lucy who died far too young in 2018. After some heroic efforts by our publisher Taylor and Francis, I now have a set of LaTeX files that match the last coherent PDF file that David gave our editor, Rob Calver, before he was unable to work any more.

I set about converting this rather large, monolithic file into my normal workflow of a master knitr file, with child files for each of the chapters. As you might expect, when writing a book you often want to compile the chapter you are working on, rather than the whole document. This is especially true with this book, because I would like the JAGS code to be executed whilst the chapter is being compiled, and with nearly twenty chapters, this could take some time.

Again, this process is fairly straightforward. However, when compiling a chapter, the paths have to be relative to the folder the chapter is in, rather than to the master document. This means that if you provide paths for figures, code listings, or the bibliography, then these need to be changed when you only want to compile the chapter, or when you want to compile the whole book. After a bit of reading, I thought I should be able to do this in LaTeX itself as it is a programming language (or at least a macro language) of sorts—boy was I wrong!.

The basic solution to the problem is as follows:

  1. Define a logical variable to indicate whether you are working with the whole book or just the chapter in the master document and set it to TRUE or FALSE as needed.
  2. In each child document, set the paths so that they are relative to the chapter, or the book according to the value of the logical variable.

LaTeX does not have a logical type, but it does have a numeric type in the form of a counter. So step one can be done using the following commands:

%%%%% Are you working with individual chapters
\newcounter{local}
\setcounter{local}{1} %1 = TRUE 0 = FALSE

Step two, in theory is straightforward as well. Here is an example from my first chapter. This goes in my preamble:

\ifnum\value{local}=0{
  \graphicspath{{chapters/chapter01/figures/}}
  \lstset{inputpath=chapters/chapter01/R/}
}
\else{
  \graphicspath{{figures/}} 
  \lstset{inputpath=R/}
}\fi

and this appears at the end of the chapter

\ifnum\value{local}=1{
  \bibliography{../../bibliography/jags-references}
}\fi

We have to make sure that the bibliography is included if the whole book is compiled, so this gets added to the position where we would like the bibliography to appear in the master document:

\ifnum\value{local}=0{
  \bibliography{bibliography/jags-references}
}\fi

Seems fine right? Does it work? No. It seems to half work. Sometimes it works for the chapter, sometimes for the book, but the switching is never seamless.

James, you are an idiot

So remember how I said I was working with knitr? The solution of course is to use a well-defined (N.B. I am not going to argue the merits of R being a real programming language or not, but it is one used by real programmers!) programming language—R—and get knitr to print out the right LaTeX code. It took a tiny bit of fooling about to remember how to do this, but the essence is R chunks with chunk options echo=FALSE and results='asis' to make sure knitr does not try to mark up our LaTeX. So these two chunks appear in my master document:

<<R-master-chunk,echo=FALSE>>=
### Set this to true if compiling chapters only
compileChapterOnly = FALSE
@

and

<<R-set-bibpath-master,echo=FALSE,results='asis'>>=
cat("\\bibliography{bibliography/jags-references.bib}\n")
@

with the latter placed where I need the bibliography to appear.

And these two chunks are placed in each child document:

<<R-set-local-paths-ch01,echo=FALSE,results='asis'>>=
if(compileChapterOnly){
  cat("\\graphicspath{{figures/}}
                     \\lstset{inputpath=R/}\n")
}else{
  cat("\\graphicspath{{chapters/chapter01/figures/}}
  \\lstset{inputpath=chapters/chapter01/R/}\n")
}
@

and

<<R-set-local-bibpath-ch01,echo=FALSE,results='asis'>>=
if(compileChapterOnly){
  cat("\\bibliography{../../bibliography/jags-references}\n")
}
@

Note that each chunk in the child document has a chapter specific chunk tag, just to avoid duplicate tags when I compile the whole book.

I am pretty sure someone will point out my elementary LaTeX mistake, but then again there is not enough space in my head for understanding how all of LaTeX works. After all, I do have to get this book written (and not write blog posts).

Share Button

Struggles with getting the (Bayesian) language right

Maybe it is a function of being originally trained in the Frequentist paradigm, but I am finding it hard to get my Bayesian language right.

At some point when you are learning about Bayesian things you will come across a statement that says something along these lines:

Frequentists believe that the data is random and that parameter of interest is unknown and fixed. Bayesian, on the other hand, believe that the data is fixed, and that the parameter of interest is unknown and random.

At a certain level I am reasonably happy with this. The Bayesian statement leaves out something about the data generating mechanism being random, but on the whole – well, fine. However, I feel this does not mesh well with our statements. For example, let’s say we are doing a rather dull, but simple univariate analysis with inference about the mean. We have a set of observations, and we might propose a normal prior, a normal likelihood and consequently arrive at a normal posterior. How do we choose to summarise this information? Well, we might give a statement about the posterior density:

\(
\mu|\mathbf{x}\sim N(\mu^\prime, (\sigma^{\prime})^2)
\)

Or, we might give a (95%) credible interval:

\(
\Pr(q_{0.025}^\prime\lt\mu|\mathbf{x}\lt q^{\prime}_{0.975})=0.95
\)

Or we might just report the posterior mean and standard deviation. I think it is the last two where I question myself. Let’s take the credible interval. If we express it in words, then we say something along the lines of I am 95% sure the mean lies between a and b (where a and b are the respective lower and upper bounds of the credible interval. It seems implicit in this statement, at least to mean–and feel free to disagree, that there is this underlying belief that there is a single true value for the mean. Similarly, if we report the posterior mean, there is a feel (again to me) that we are reporting our best estimate of the one true mean. Clearly this flies in the face of what we actually think as Bayesians.

What brought this to the fore was working on the book my good friend David Lucy left unfinished.

David, like me, came to Bayesian thinking later in his academic career. He started out his academic life as an archaeologist, working on age estimation. If you are unfamiliar with this problem, it is essentially a classical calibration problem. You have a set of remains–perhaps teeth–from individuals of known age, and for whom you can measure some characteristic, say Y with variability. The idea is given a new value of Y, can you estimate the age of the individual. It parallels the calibration problem because (in theory) the age is measured without error, whereas Y contains all the usual sources of variation. David became interested in Bayesian solutions to this problem, and based on his thesis work, was firmly convinced that it was the only way to approach it.

David liked ontology and epistemology, and as a consequence of this introduces the idea of a Platonic view of statistics. I don’t know if this line will capture his thinking but it suffices. He writes:

Statistical modelling is very Platonic in that it is the parameters of a model which are the ontologically real matter which control the way in which the world works.

Implicit in this statement, again perhaps only to me, is the idea that there is one true value. In fact, if David was around to argue the point, I would say, based on the little I know about Plato, that his ideas were essentially reductionist. That is perhaps not a flaw if the building blocks are probability distributions, and maybe this is what David was arguing.

Anyway – there is no natural conclusion to these thoughts, except perhaps to ask “Are we actually conveying our intent accurately with our language when it comes to reporting Bayesian results?”

Share Button

I think I understand gather

gather and spread—are these the most misunderstood functions in all of dplyr? I don’t know but maybe this will help you.

The simple case—gathering all columns

gather, at its very simplest, reduces your dataframe to a set of key-value pairs. This is best demonstrated by an example. I will create a dataframe with three variables x, y and z.

 
my.df  = data.frame(x = 1:3,                     
                    y = letters[1:3],
                    z = c("Alice", "Bob", "Carol"))

And this dataframe looks like this:

##   x y     z
## 1 1 a Alice
## 2 2 b   Bob
## 3 3 c Carol

gather, applied to this data frame with no additional arguments, will make two new variables, key, and value. key will contain each of the variable names repeated as many times as there are rows in the original dataframe. value will contain the corresponding variable value. That is harder to understand. I will explain, then demonstrate. There are three variables in my.df and three rows. gather will change this into a dataframe with two variables, key, and value, and nine (3 × 3) rows. The first row will have a key of x and a value of 1. The next row will have a key of x and a value of 2 and so on. The new dataframe looks like this:

my.df %>% 
  gather() 
##   key value
## 1   x     1
## 2   x     2
## 3   x     3
## 4   y     a
## 5   y     b
## 6   y     c
## 7   z Alice
## 8   z   Bob
## 9   z Carol

Hopefully the effect of gather is clear here, even if the use case is not.

The not so simple case—gathering some columns

How about when you want to gather some of the columns? Why might you want to do this? There are many reasons, but the most common may be that you have several measurements for an observation that you wish to summarise.
Consider the following example:

##    name time1 time2 time3
## 1 Alice  8.43  0.32  3.14
## 2   Bob  5.77  0.56  1.45
## 3 Carol 13.29  3.17 27.26

How would we go about finding either the total time, or the average time for each person in the dataframe?

Firstly we need to gather so that the three time variables are in a single column. We will store the variable names in a column named run, and we will store the actual time values in a variable named time:

set.seed(123)
ex2.df = data.frame(name = c("Alice", "Bob", "Carol"),
                         time1 = round(rexp(3, 0.10), 2), 
                         time2 = round(rexp(3, 0.10), 2),
                         time3 = round(rexp(3, 0.10), 2), 
                         stringsAsFactors = FALSE)

ex2.gathered.df = ex2.df %>% 
  gather(key = "run", value = "time", time1, time2, time3)

ex2.gathered.df

And the new dataframe looks like this:

##    name   run  time
## 1 Alice time1  8.43
## 2   Bob time1  5.77
## 3 Carol time1 13.29
## 4 Alice time2  0.32
## 5   Bob time2  0.56
## 6 Carol time2  3.17
## 7 Alice time3  3.14
## 8   Bob time3  1.45
## 9 Carol time3 27.26

N.B.When choosing your name for your key, do not use an existing variable from your data.frame, as the result will not be the same, e.g.

ex2.bad.gathered.df = ex2.df %>% 
  gather(key = "name", value = "time", time1, time2, time3)

ex2.bad.gathered.df
## name  time
## 1 time1  8.43
## 2 time1  5.77
## 3 time1 13.29
## 4 time2  0.32
## 5 time2  0.56
## 6 time2  3.17
## 7 time3  3.14
## 8 time3  1.45
## 9 time3 27.26

We want the average time for each person, so we group_by name, and then we summarise.

ex2.gathered.df %>% 
  group_by(name) %>% 
  summarise(ave.time = mean(time))

Et voilà.

## # A tibble: 3 x 2
##   name  ave.time
##   <chr>    <dbl>
## 1 Alice     3.96
## 2 Bob       2.59
## 3 Carol    14.6

My own use case is different – I had a bunch of summary statistics in each row for one data set, and a set of observations in another. I wanted to put them all in a database table with a column indicating whether it was a summary statistic or not (by the name of the statistic), or whether it was an observation. So my gather statements look like:

a1 = a1 %>%
    gather(key = "treatments", value = "stat", mean, sd, min, max) 

a2 = a2 %>%
    rename(stat = accuracy) %>%
    mutate(treatments = "accuracy") %>%
    select(colour, method, reduced, baseline, treatments, stat)

a = bind_rows(a1, a2)

Update

Di Cook reminded me on Twitter

that Hadley has been working to improve that Hadley has been working on two replacements for gather and spread, called pivot_longer and pivot_wider respectively. News about these new functions (they are more than renaming of gather and spread), can be found here. Once I understand them, I will add a section to this post.

Update 2—using pivot_longer

Note: If you want to try out these new functions, you need to install the development version of {tidyr}. This involved a lot of updates for me, including having to fix broken headers (thanks Apple) on OS X Mojave (I used the TL;DR solution I found on Stack Overflow).

pivot_longer improves gather, at least IMHO, by making it more explicit where information is coming from and where it is going to. Here is the same example using pivot_longer:

## Note: You need the development version of tidyr
## for this to work
## devtools::install_github("tidyverse/tidyr")
library(tidyr)

ex2.pivotlonger.df = example2.df %>% 
  pivot_longer(cols = matches("time[1-3]"),
             names_to = "run",
             values_to = "time")

ex2.pivotlonger.df %>% 
  group_by(name) %>% 
  summarise(ave.time = mean(time))

The output is identical to above:

## # A tibble: 3 x 2
##   name  ave.time
##   <chr>    <dbl>
## 1 Alice     3.96
## 2 Bob       2.59
## 3 Carol    14.6

I like the way that I can tell pivot_longer which columns to gather using the cols argument, where to store the variable names using names_to, and where to store the values values_to. Will I switch immediately? No. I think I will wait until these functions enter CRAN versions of tidyr. I am old enough to release living on the bleeding edge of technology does not always pay off!

Update 3

Charco Hui, an MSc student of Anna Fergusson’s and Chris Wild’s has a nice package which animates the gather and spread operations, which can be foundhere. Check it out!

Share Button

Collapsing or summarising data over the levels of a factor in the tidyverse

This is a simple post, and undoubtedly many people know this already, but I thought I would share it for people who are having trouble finding such things.

Let us say I have data that looks conceptually like this:

Conceptual layout of data from an experiment.

And further, assume I have done all the grunt work in with the wonderful readxl package, and dplyr to get it into a sensible format. If anyone wants to know how to do that, let me know. So our data looks like this in R:

## # A tibble: 160 x 5
##    count replicate side  position size 
##    <dbl> <fct>     <fct> <fct>    <fct>
##  1     3 1         left  arm      small
##  2     1 2         left  arm      small
##  3     0 3         left  arm      small
##  4     3 4         left  arm      small
##  5     1 5         left  arm      small
##  6     1 6         left  arm      small
##  7     1 7         left  arm      small
##  8     1 8         left  arm      small
##  9     2 9         left  arm      small
## 10     0 10        left  arm      small
## # … with 150 more rows

In this example I will collapse over levels the variable side. The key functions here are group_by and summarise (or summarize if you hail from a certain country). group_by—as the name suggests—arranges the data into groups. It is important to note that a) this does not change how the data is displayed when you print it or View it, and b) it returns a grouped data frame. The latter is important, as sometimes it can cause problems. Because of that, I recommend you use the ungroup function as well.

The principle is straightforward—use group_by on the variables you wish to keep, and omit the variable(s) you wish to collapse over. In this example, as I said, I will summarise over side. That means I use group_by on replicate, position, and size.

reducedData = myFormattedData %>% 
  group_by(replicate, position, size) %>% 
  summarise(count = sum(count)) %>% 
  ungroup()

This code groups the data into well groups which have the same values of replicate, position, and size. That means that the each group will have, in this case, a pair of observations, one with side equal to left and one with side equal to right. We then summarise each of these pairs of observations by adding their count values. Note that my code might be a little confusing in that I have used the same variable name, count, for the sum of the counts. The net effect is what we want:

## # A tibble: 80 x 4
##    replicate position size  count
##    <fct>     <fct>    <fct> <dbl>
##  1 1         arm      small     5
##  2 1         arm      big       3
##  3 1         hand     small     3
##  4 1         hand     big       3
##  5 2         arm      small     3
##  6 2         arm      big       4
##  7 2         hand     small     2
##  8 2         hand     big       3
##  9 3         arm      small     2
## 10 3         arm      big       3
## # … with 70 more rows

A quick check shows that the original data had a value of 3 for the first replicate that was small and on the arm on the left, and a value of 2 on the right, giving a sum of 5.

Share Button

Beware the ticks of R Markdown

This will be a very short post. It addresses a problem I had and I have seen others ask about on various forums (but with no answer).

The problem

The icons have disappeared from the top of some of my R chunks in my R Markdown document.

Figure 1: Missing icons

Notice how there is no settings icon, or the little green run icon in the second R chunk. If you have eagle eyes you will see why.

The solution

If you look very closely in Figure 1, you will see there is an extra set of back ticks at the end of the chunk. Remove these and you fix the problem.

Figure 2: Problem solved

This might seem blindingly obvious to some, but it is really easy to miss in a large Markdown document. It also has follow on consequences in that it makes it harder to run lines of code in a chunk—you have to select them before using Cmd/Ctrl-Enter.

But wait! There’s more!

This just in, highlighted by Twitter pal Michael MacAskill (@Sakkaden). A space between the ``` and the {r} can cause the same problem.

Figure 3: Watch out for rogue spaces.

Thanks Michael!

Share Button

Producing a biplot with labels

If you are hoping for blinding insight in this post, then I think you better stop reading now. A friend asked me how to show him how to display the observation labels (or class labels) in a PCA biplot. Given how long prcomp has been around, this is hardly new information. It might now even be a feature of plot.prcomp. However, for posterity, and for the searchers, I will give some simple solutions. I will, for the sake of pedagogy, even stop effing and blinding about how the Fisher’s Iris data set deserves to be left to crumble into a distant memory of when we used to do computing with a Jacquard Loom.

I will make use of tidyverse functions in this example, because kids these days. I would not worry about it too much. It just makes some of the data manipulation slightly more transparent. Firstly I will load the data. The Iris data set is an internal R data set so the data command will do it.

data(iris)
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Now I will put out the labels into a separate vector

iris.labels = iris %>% 
  pull(Species) %>% 
  as.character()
iris.data = iris %>% 
  select(-Species)

Performing the PCA itself is pretty straightforward. I have chosen to scale the variables in this example for no particular reason. The post is not about PCA so it does not matter too much.

pc = prcomp(iris.data, scale. = TRUE)

The scores—the projected values of the observations—are stored in a matrix called pc$x. First we will produce a plot with just observation number.

plot(pc$x[,1:2], type = 'n')
text(pc$x[,1:2], labels = 1:nrow(iris.data))
A simple biplot with observation numbers plotted

It would be nice to put the species labels on instead of numbers. The species names are rather long though, so I am going to recode them

iris.labels = recode(iris.labels, 
                     'setosa' = 's', 
                     'versicolor' = 'v', 
                     'virginica' = 'i')

And now we can plot the labels if we want:


plot(pc$x[,1:2], type = 'n')
text(pc$x[,1:2], labels = 1:nrow(iris.data))
A biplot with class labels

Hey you promised us ggplot2 ya bum!

Okay, okay. To do this we need to coerce the scores into a data.frame. The nice thing about doing this is that the principal components are conveniently labelled PC1, PC2, etc. This makes the mapping fairly easy.

library(ggplot2)
pcs = data.frame(pc$x)
p = pcs %>% 
  ggplot(aes(x = PC1, 
             y = PC2, 
             label = iris.labels)) + 
  geom_text()
p

Et voila!

Let's do it in ggplot2!

Share Button

Adding multiple regression lines to a faceted ggplot2 plot

Those of you who know me well know I am not really a ggplot2 type of guy. Not because I do not think that it is good work—I would be an idiot to claim that probably the most downloaded package in the history of CRAN is not good—but because I predate both Hadley and ggplot2 (yes I admit it, I am a dinosaur now) I can do most of what I want in base R. However, there are times in exploratory data analysis where going the ggplot2 route makes more sense. One situation where that happens for me is when I am fitting linear models with a mixture of continuous explanatory variables and factors. Some people have an old-fashioned name for this situation—but really repeating that name would be a get-burned-at-a-stake type of offence. It is useful when fitting linear models of this type to be able to see the behaviour of the response with respect to a continuous predictor for each combination of the levels of the factors. ggplot2 makes it fairly easy to produce this type of plot through its faceting mechanism. It also makes it really to add a fitted line with a pretty confidence interval to each facet. I should say at this point that this is not restricted to linear models, and in fact works for generalised linear models as well, and for semi-parametric models like smoothers. It is cool stuff.

I want a pony, and I want it NOW Daddy

This is, as I have said, made easy to do in ggplot2 and a half hour of Googling will get you to the point where you can do it with your data. However, I found myself with the following problem. I had a situation where there was a suggestion that an interaction might be significant and so I wanted to explore visually how the fitted models differed with and without interaction. You often find yourself in this situation with tests suggesting the interactions are significant only to find that it is driven by one combination of the factors, or even worse, by a single outlier. Making things explicit, in this problem I was interested in the difference between this model:

$$ Y \sim X * F_1 * F_2 $$

and this model

$$Y \sim X * F_1 + F_2$$

These models are expressed in standard Wilkinson and Rogers notation, where \(Y\) and \(X\) are continuous, and \(F_1\) and \(F_2\) are factors. However, the same issues arise if we are interested in

$$ Y \sim X * F $$

and this model
$$ Y \sim X + F $$

The only difference is that the different levels of the factor \(F\) can be represented on a single row—if there are not too many levels—or in a wrapped facet plot if there are quite a few levels.

We need some data

The data in this post come from Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J. & Ostrowski, E. (1994). A Handbook of Small Data Sets Boca Raton, Florida: Chapman and Hall/CRC. The data try to relate the mortality in UK towns and cities—measured as deaths per 100,000 head of population—to the hardness of the water—measured by calcium concentration (in parts-per-million = ppm). The original authors also had the theory that this relationship changed when the town or city was in “The North.” I have always found the English concept of “Up North” rather bizarre. As you leave the outskirts of London you start seeing road signs telling you how far it is to “The North.” In this data set, “The North” is defined as being north of the English city of Derby. The data is available here: water.csv.

water.df = read.csv("water.csv")

It is worth having a quick look at the summary data

library(skimr)
skim(water.df)
## Skim summary statistics
##  n obs: 61 
##  n variables: 4 
## 
## Variable type: factor 
##  variable missing complete  n n_unique                     top_counts
##      town       0       61 61       61 Bat: 1, Bir: 1, Bir: 1, Bla: 1
##  ordered
##    FALSE
## 
## Variable type: integer 
##   variable missing complete  n    mean     sd   p0  p25  p50  p75 p100
##    calcium       0       61 61   47.18  38.09    5   14   39   75  138
##  mortality       0       61 61 1524.15 187.67 1096 1379 1555 1668 1987
##      north       0       61 61    0.57   0.5     0    0    1    1    1

We can see from this that our factor north is not being treated as such. I will recode it, and since we are using ggplot2 we might as well go into the tidyverse too. I think I am going to violate a Hadley principle by overwriting the existing data, but I can live with this.

library(tidyverse)
water.df = water.df %>% 
  mutate(north = factor(ifelse(north == 1, "north", "south")))

The initial plot

I am going to plot mortality as a function of calcium and I am going to facet (if that is a verb) by our factor north. There is a good argument to make that because both variables are rates (deaths per 100,000 and parts per million) then we should work on a log-log scale. However, given this is not a post on regression analysis, I will not make this transformation (Spoiler: there is no evidence that the factor makes any difference on this scale which kind of makes it useless to show what I want to show ☺)

library(ggplot2)
p = water.df %>% 
  ggplot(aes(x = calcium, y = mortality)) + geom_point()

p = p + facet_wrap(~north)
p
Figure 1: An initial plot of the data

It looks like there is evidence of a shift, but is there evidence of a difference in slopes?

water.fit = lm(mortality ~ calcium * north, data = water.df)
summary(water.fit)

## 
## Call:
## lm(formula = mortality ~ calcium * north, data = water.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -221.27  -75.45    8.52   87.48  310.14 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1692.3128    32.2005  52.555   <2e-16 ***
## calcium              -1.9313     0.8081  -2.390   0.0202 *  
## northsouth         -169.4978    58.5916  -2.893   0.0054 ** 
## calcium:northsouth   -0.1614     1.0127  -0.159   0.8740    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.2 on 57 degrees of freedom
## Multiple R-squared:  0.5909, Adjusted R-squared:  0.5694 
## F-statistic: 27.44 on 3 and 57 DF,  p-value: 4.102e-11

Ignoring all model checking (it ain’t so bad STATS 201x kids), there does not appear to be much support for the different slopes model. However there is for the different intercepts model. I would like to explore this visually.

So about those lines…

What I would like to do is to place the fitted lines from each model on each facet of the plot. In addition, I would like the confidence intervals for each line to be added to each line. There might be a way to do this with geom_smooth, and I would be happy to hear about it. My solution involves three steps:

  1. Fitting each of the models, and using predict to get the confidence intervals
  2. Adding the fitted line from each model to each facet
  3. Adding the confidence intervals from each model to each facet

Step 1—getting the data for the confidence intervals

This step involves, as previously mentioned, using the predict function (in fact predict.lm). This should be fairly straightforward. The key here is to add the interval information, and the fitted values to either a new data frame or our existing data frame. I will opt for the latter.

water.fit1 = water.fit ## interaction model
water.fit2 = lm(mortality ~ calcium + north, data = water.df)

water.pred1 = predict(water.fit1, water.df, interval = "confidence")
water.pred2 = predict(water.fit2, water.df, interval = "confidence")

water.df = water.df %>% 
  mutate(fit1 = water.pred1[,"fit"],
         lwr1 = water.pred1[,"lwr"],
         upr1 = water.pred1[,"upr"],
         fit2 = water.pred2[,"fit"],
         lwr2 = water.pred2[,"lwr"],
         upr2 = water.pred2[,"upr"])

Step 2—Adding the fitted lines

Remember that our plot is stored in the variable p. We will add the fitted lines using the geom_line function. We could have done this with the geom_abline and just the coefficients, however this would have made the method less flexible because we could not accomodate a simple quadratic model for example.

p = p + geom_line(data = water.df, 
                  mapping = aes(x = calcium, y = fit1), 
                  col = "blue", size = 1.2)
p = p + geom_line(data = water.df, 
                  mapping = aes(x = calcium, y = fit2), 
                  col = "red", size = 1.2)
p
Figure 2: Adding the lines

We can see already the lack of support for the different slopes model, however, let’s add the confidence intervals.

Step 3—Adding the confidence intervals

We add the confidence intervals by using the geom_ribbon function. The aesthetic for geom_ribbon requires two sets of y-values, ymin and ymax. These, clearly, are the values we calculated for each of the confidence intervals. We set the color of the bands by setting fill. Setting col controls the border colour of the ribbon. We make these transparent (so they do not obscure our fitted lines), by setting an alpha value closer to zero than to one (0.2 seems to be a good choice).

p = p + geom_ribbon(data = water.df, 
                    mapping = aes(x = calcium, ymin = lwr1, ymax = upr1),
                    fill = "blue", alpha = 0.2)
p = p + geom_ribbon(data = water.df, 
                    mapping = aes(x = calcium, ymin = lwr2, ymax = upr2),
                    fill = "red", alpha = 0.2)
p
Figure 3: Lines with confidence intervals (as promised)

As expected the intervals completely overlap, thus supporting our decision to choose the simpler model.

Well that sucked

How about if there really was a slope effect? I have artificially added one, and redrawn the plot just so we can see what happens when there really is a difference.

## just so we all get the same result
set.seed(123) 

## get the original coefficient
b = coef(water.fit1) 

## and alter the effect for the interaction
b[4] = -5 # make a big difference to the slopes

## get the standard deviation of the residuals
sigma = summary(water.fit1)$sigma

## and use it to make a new set of responses
water.df = water.df %>% 
  mutate(mortality = (b[1] + (north == 'north') * b[2]) + (b[2] + (north == 'north') * b[4]) * calcium + rnorm(61, 0, sigma))

## Now we repeat all the steps above
water.fit1 = lm(mortality ~ calcium * north, data = water.df) ## interaction model
water.fit2 = lm(mortality ~ calcium + north, data = water.df)

water.pred1 = predict(water.fit1, water.df, interval = "confidence")
water.pred2 = predict(water.fit2, water.df, interval = "confidence")

water.df = water.df %&gt;% 
  mutate(fit1 = water.pred1[,"fit"],
         lwr1 = water.pred1[,"lwr"],
         upr1 = water.pred1[,"upr"],
         fit2 = water.pred2[,"fit"],
         lwr2 = water.pred2[,"lwr"],
         upr2 = water.pred2[,"upr"])

p = water.df %&>% 
  ggplot(aes(x = calcium, y = mortality)) + geom_point()

p = p + facet_wrap(~north)
p = p + geom_line(data = water.df, 
                  mapping = aes(x = calcium, y = fit1), 
                  col = "blue", size = 1.2)
p = p + geom_line(data = water.df, 
                  mapping = aes(x = calcium, y = fit2), 
                  col = "red", size = 1.2)
p = p + geom_ribbon(data = water.df, 
                    mapping = aes(x = calcium, ymin = lwr1, ymax = upr1),
                    fill = "blue", alpha = 0.2)
p = p + geom_ribbon(data = water.df, 
                    mapping = aes(x = calcium, ymin = lwr2, ymax = upr2),
                    fill = "red", alpha = 0.2)
p
Figure 4: Data with a real interaction

One final note

I suspect, but I have not checked properly, that when geom_smooth is used, that the fitted lines on each facet are derived from the subset of data in that facet, therefore the standard errors might be slightly larger (because less data is used to estimate \(\sigma\). However, I am happy to be corrected (and if I am wrong I will remove this section :-))

Share Button

Some quirks of JAGS

I return to JAGS infrequently enough these days that I forget a lot of things. Do not get me wrong. I love JAGS and I think it is one of the most valuable tools in my toolbox. However administrative and professional duties often destroy any chance I might have of concentrating long enough to actually do some decent statistics.

Despite this I have been working with my colleagues and friends, Duncan Taylor and Mikkel Andersen–who has been visiting from Aalborg since February and really boosting my research, on some decent Bayesian modelling. Along the way we have encountered a few idiosyncrasies of JAGS which are probably worth documenting. I should point out that the issues here have been covered by Martyn Plummer and others on Stack Overflow and other forums, but perhaps not as directly.

Local variables

JAGS allows local variables. In fact you probably use them without realizing as loop indices. However, what it does not like is the declaration of local variables within a loop. That is

model{
  k <- 10
}

is fine, whereas

model{
  for(i in 1:10){
    k <- 10
  }
}

is not. This would be fine, except the second code snippet will yield an error which tells you that k is an unknown variable and asks you to: Either supply values for this variable with the data or define it on the left hand side of a relation. I think this error message is a bit of a head scratcher because look at your code and say “But I have defined it, and it works in R.” – do not fall into the latter mode of thinking – it’s a trap!

There are a couple of solutions to this problem. Firstly, you could do as I did and give up on having your model code readable, and just not use temporary variables like this, or, secondly, you could like the temporary variable vary with the index, like so

model{
  for(i in 1:10){
    k[i] <- 10
  }
}

Missing data / ragged arrays

JAGS cannot deal with missing covariates very well. However, it is happy enough for you to include these observations and “step over” them some how. An example of this might be a balanced experiment where some of the trials failed. As an example let us consider a simple two factorial completely randomised design where each factor only has two levels. Let the first factor have levels A and B, and the second factor have levels 1 and 2. Furthermore let there be 3 replicates for each treatment (combination of the factors). Our (frequentist) statistical model for this design would be

\(
y_{ijk}=\mu + \alpha_i + \beta_j + \gamma_{ij} + \varepsilon_{ijk},~i\in\{A,B\},j\in\{1,2\},k=1,\ldots,3,\varepsilon_{ijk}\sim N(0,\sigma^2)
\)

And we would typically programme this up in JAGS as something like this

model{
  for(i in 1:2){
    for(j in 1:2){
      for(k in 1:3){
        y[i,j,k] ~ dnorm(mu[i,j], tau)
      }
      mu[i,j] <- alpha[i] + beta[j] + gamma[i,j]
    }
  }

  for(i in 1:2){
    alpha[i] ~ dnorm(0, 0.000001)
  }

  for(j in 1:2){
    beta[j] ~ dnorm(0, 0.000001)
  }

  for(i in 1:2){
    for(j in 1:2){
      gammma[i,j] ~ dnorm(0, 0.000001)
    }
  }

  tau ~ dgamma(0.001, 0.001)
}

Implicit in this code is that y is a 2 x 2 x 3 array, and that we have a fully balanced design. Now let us assume that, for some reason, the treatments $$\tau\in\{A1,A2,B1,B2\}$$ have been replicated 2, 3, 2, 3 times respectively. We can deal with this in JAGS by creating another variable in our input data which we will call reps with reps = c(3, 2, 3, 2). This then can be accommodated in our JAGS model by

model{
  for(i in 1:2){
    for(j in 1:2){
      for(k in 1:reps[(i-1) * 2 + j]){
        y[i,j,k] ~ dnorm(mu[i,j], tau)
      }
      mu[i,j] <- alpha[i] + beta[j] + gamma[i,j]
    }
  }

y is still a 2 x 2 x 3 array, but simply has NA stored in the positions without information. It is also worth noting that ever since JAGS 4.0 the syntax for loops allows

for(v in V){

where V is a vector containing integer indices. This is very useful for irregular data.

NOTE I have not compiled the JAGS code in my second example, so please let me know if there are mistakes. This is an ongoing article and so I may update it from time to time.

Share Button

A blog about statistics, computing, and other things that interest me