# Struggles with getting the (Bayesian) language right

Maybe it is a function of being originally trained in the Frequentist paradigm, but I am finding it hard to get my Bayesian language right.

At some point when you are learning about Bayesian things you will come across a statement that says something along these lines:

Frequentists believe that the data is random and that parameter of interest is unknown and fixed. Bayesian, on the other hand, believe that the data is fixed, and that the parameter of interest is unknown and random.

At a certain level I am reasonably happy with this. The Bayesian statement leaves out something about the data generating mechanism being random, but on the whole – well, fine. However, I feel this does not mesh well with our statements. For example, let’s say we are doing a rather dull, but simple univariate analysis with inference about the mean. We have a set of observations, and we might propose a normal prior, a normal likelihood and consequently arrive at a normal posterior. How do we choose to summarise this information? Well, we might give a statement about the posterior density:

$$\mu|\mathbf{x}\sim N(\mu^\prime, (\sigma^{\prime})^2)$$

Or, we might give a (95%) credible interval:

$$\Pr(q_{0.025}^\prime\lt\mu|\mathbf{x}\lt q^prime_{0.975})=0.95$$

Or we might just report the posterior mean and standard deviation. I think it is the last two where I question myself. Let’s take the credible interval. If we express it in words, then we say something along the lines of I am 95% sure the mean lies between a and b (where a and b are the respective lower and upper bounds of the credible interval. It seems implicit in this statement, at least to mean–and feel free to disagree, that there is this underlying belief that there is a single true value for the mean. Similarly, if we report the posterior mean, there is a feel (again to me) that we are reporting our best estimate of the one true mean. Clearly this flies in the face of what we actually think as Bayesians.

What brought this to the fore was working on the book my good friend David Lucy left unfinished.

David, like me, came to Bayesian thinking later in his academic career. He started out his academic life as an archaeologist, working on age estimation. If you are unfamiliar with this problem, it is essentially a classical calibration problem. You have a set of remains–perhaps teeth–from individuals of known age, and for whom you can measure some characteristic, say Y with variability. The idea is given a new value of Y, can you estimate the age of the individual. It parallels the calibration problem because (in theory) the age is measured without error, whereas Y contains all the usual sources of variation. David became interested in Bayesian solutions to this problem, and based on his thesis work, was firmly convinced that it was the only way to approach it.

David liked ontology and epistemology, and as a consequence of this introduces the idea of a Platonic view of statistics. I don’t know if this line will capture his thinking but it suffices. He writes:

Statistical modelling is very Platonic in that it is the parameters of a model which are the ontologically real matter which control the way in which the world works.

Implicit in this statement, again perhaps only to me, is the idea that there is one true value. In fact, if David was around to argue the point, I would say, based on the little I know about Plato, that his ideas were essentially reductionist. That is perhaps not a flaw if the building blocks are probability distributions, and maybe this is what David was arguing.

Anyway – there is no natural conclusion to these thoughts, except perhaps to ask “Are we actually conveying our intent accurately with our language when it comes to reporting Bayesian results?”

# I think I understand gather

gather and spread—are these the most misunderstood functions in all of dplyr? I don’t know but maybe this will help you.

## The simple case—gathering all columns

gather, at its very simplest, reduces your dataframe to a set of key-value pairs. This is best demonstrated by an example. I will create a dataframe with three variables x, y and z.


my.df  = data.frame(x = 1:3,
y = letters[1:3],
z = c("Alice", "Bob", "Carol"))


And this dataframe looks like this:

##   x y     z
## 1 1 a Alice
## 2 2 b   Bob
## 3 3 c Carol

gather, applied to this data frame with no additional arguments, will make two new variables, key, and value. key will contain each of the variable names repeated as many times as there are rows in the original dataframe. value will contain the corresponding variable value. That is harder to understand. I will explain, then demonstrate. There are three variables in my.df and three rows. gather will change this into a dataframe with two variables, key, and value, and nine (3 × 3) rows. The first row will have a key of x and a value of 1. The next row will have a key of x and a value of 2 and so on. The new dataframe looks like this:

my.df %>%
gather()

##   key value
## 1   x     1
## 2   x     2
## 3   x     3
## 4   y     a
## 5   y     b
## 6   y     c
## 7   z Alice
## 8   z   Bob
## 9   z Carol

Hopefully the effect of gather is clear here, even if the use case is not.

## The not so simple case—gathering some columns

How about when you want to gather some of the columns? Why might you want to do this? There are many reasons, but the most common may be that you have several measurements for an observation that you wish to summarise.
Consider the following example:

##    name time1 time2 time3
## 1 Alice  8.43  0.32  3.14
## 2   Bob  5.77  0.56  1.45
## 3 Carol 13.29  3.17 27.26

How would we go about finding either the total time, or the average time for each person in the dataframe?

Firstly we need to gather so that the three time variables are in a single column. We will store the variable names in a column named run, and we will store the actual time values in a variable named time:

set.seed(123)
ex2.df = data.frame(name = c("Alice", "Bob", "Carol"),
time1 = round(rexp(3, 0.10), 2),
time2 = round(rexp(3, 0.10), 2),
time3 = round(rexp(3, 0.10), 2),
stringsAsFactors = FALSE)

ex2.gathered.df = ex2.df %>%
gather(key = "run", value = "time", time1, time2, time3)

ex2.gathered.df


And the new dataframe looks like this:

##    name   run  time
## 1 Alice time1  8.43
## 2   Bob time1  5.77
## 3 Carol time1 13.29
## 4 Alice time2  0.32
## 5   Bob time2  0.56
## 6 Carol time2  3.17
## 7 Alice time3  3.14
## 8   Bob time3  1.45
## 9 Carol time3 27.26

We want the average time for each person, so we group_by name, and then we summarise.

ex2.gathered.df %>%
group_by(name) %>%
summarise(ave.time = mean(time))


Et voilà.

## # A tibble: 3 x 2
##   name  ave.time
##   <chr>    <dbl>
## 1 Alice     3.96
## 2 Bob       2.59
## 3 Carol    14.6

My own use case is different – I had a bunch of summary statistics in each row for one data set, and a set of observations in another. I wanted to put them all in a database table with a column indicating whether it was a summary statistic or not (by the name of the statistic), or whether it was an observation. So my gather statements look like:

a1 = a1 %>%
gather(key = "treatments", value = "stat", mean, sd, min, max)

a2 = a2 %>%
rename(stat = accuracy) %>%
mutate(treatments = "accuracy") %>%
select(colour, method, reduced, baseline, treatments, stat)

a = bind_rows(a1, a2)


## Update

Di Cook reminded me on Twitter

that Hadley has been working to improve that Hadley has been working on two replacements for gather and spread, called pivot_longer and pivot_wider respectively. News about these new functions (they are more than renaming of gather and spread), can be found here. Once I understand them, I will add a section to this post.

## Update 2—using pivot_longer

Note: If you want to try out these new functions, you need to install the development version of {tidyr}. This involved a lot of updates for me, including having to fix broken headers (thanks Apple) on OS X Mojave (I used the TL;DR solution I found on Stack Overflow).

pivot_longer improves gather, at least IMHO, by making it more explicit where information is coming from and where it is going to. Here is the same example using pivot_longer:

## Note: You need the development version of tidyr
## for this to work
## devtools::install_github("tidyverse/tidyr")
library(tidyr)

ex2.pivotlonger.df = example2.df %>%
pivot_longer(cols = matches("time[1-3]"),
names_to = "run",
values_to = "time")

ex2.pivotlonger.df %>%
group_by(name) %>%
summarise(ave.time = mean(time))


The output is identical to above:

## # A tibble: 3 x 2
##   name  ave.time
##   <chr>    <dbl>
## 1 Alice     3.96
## 2 Bob       2.59
## 3 Carol    14.6

I like the way that I can tell pivot_longer which columns to gather using the cols argument, where to store the variable names using names_to, and where to store the values values_to. Will I switch immediately? No. I think I will wait until these functions enter CRAN versions of tidyr. I am old enough to release living on the bleeding edge of technology does not always pay off!

## Update 3

Charco Hui, an MSc student of Anna Fergusson’s and Chris Wild’s has a nice package which animates the gather and spread operations, which can be foundhere. Check it out!

# Collapsing or summarising data over the levels of a factor in the tidyverse

This is a simple post, and undoubtedly many people know this already, but I thought I would share it for people who are having trouble finding such things.

Let us say I have data that looks conceptually like this:

And further, assume I have done all the grunt work in with the wonderful readxl package, and dplyr to get it into a sensible format. If anyone wants to know how to do that, let me know. So our data looks like this in R:

## # A tibble: 160 x 5
##    count replicate side  position size
##    <dbl> <fct>     <fct> <fct>    <fct>
##  1     3 1         left  arm      small
##  2     1 2         left  arm      small
##  3     0 3         left  arm      small
##  4     3 4         left  arm      small
##  5     1 5         left  arm      small
##  6     1 6         left  arm      small
##  7     1 7         left  arm      small
##  8     1 8         left  arm      small
##  9     2 9         left  arm      small
## 10     0 10        left  arm      small
## # … with 150 more rows

In this example I will collapse over levels the variable side. The key functions here are group_by and summarise (or summarize if you hail from a certain country). group_by—as the name suggests—arranges the data into groups. It is important to note that a) this does not change how the data is displayed when you print it or View it, and b) it returns a grouped data frame. The latter is important, as sometimes it can cause problems. Because of that, I recommend you use the ungroup function as well.

The principle is straightforward—use group_by on the variables you wish to keep, and omit the variable(s) you wish to collapse over. In this example, as I said, I will summarise over side. That means I use group_by on replicate, position, and size.

reducedData = myFormattedData %>%
group_by(replicate, position, size) %>%
summarise(count = sum(count)) %>%
ungroup()


This code groups the data into well groups which have the same values of replicate, position, and size. That means that the each group will have, in this case, a pair of observations, one with side equal to left and one with side equal to right. We then summarise each of these pairs of observations by adding their count values. Note that my code might be a little confusing in that I have used the same variable name, count, for the sum of the counts. The net effect is what we want:

## # A tibble: 80 x 4
##    replicate position size  count
##    <fct>     <fct>    <fct> <dbl>
##  1 1         arm      small     5
##  2 1         arm      big       3
##  3 1         hand     small     3
##  4 1         hand     big       3
##  5 2         arm      small     3
##  6 2         arm      big       4
##  7 2         hand     small     2
##  8 2         hand     big       3
##  9 3         arm      small     2
## 10 3         arm      big       3
## # … with 70 more rows

A quick check shows that the original data had a value of 3 for the first replicate that was small and on the arm on the left, and a value of 2 on the right, giving a sum of 5.

# Beware the ticks of R Markdown

### The problem

The icons have disappeared from the top of some of my R chunks in my R Markdown document.

Notice how there is no settings icon, or the little green run icon in the second R chunk. If you have eagle eyes you will see why.

### The solution

If you look very closely in Figure 1, you will see there is an extra set of back ticks at the end of the chunk. Remove these and you fix the problem.

This might seem blindingly obvious to some, but it is really easy to miss in a large Markdown document. It also has follow on consequences in that it makes it harder to run lines of code in a chunk—you have to select them before using Cmd/Ctrl-Enter.

### But wait! There’s more!

This just in, highlighted by Twitter pal Michael MacAskill (@Sakkaden). A space between the  and the {r} can cause the same problem.

Thanks Michael!

# Producing a biplot with labels

If you are hoping for blinding insight in this post, then I think you better stop reading now. A friend asked me how to show him how to display the observation labels (or class labels) in a PCA biplot. Given how long prcomp has been around, this is hardly new information. It might now even be a feature of plot.prcomp. However, for posterity, and for the searchers, I will give some simple solutions. I will, for the sake of pedagogy, even stop effing and blinding about how the Fisher’s Iris data set deserves to be left to crumble into a distant memory of when we used to do computing with a Jacquard Loom.

I will make use of tidyverse functions in this example, because kids these days. I would not worry about it too much. It just makes some of the data manipulation slightly more transparent. Firstly I will load the data. The Iris data set is an internal R data set so the data command will do it.

data(iris)
names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"
## [5] "Species"

Now I will put out the labels into a separate vector

iris.labels = iris %>%
pull(Species) %>%
as.character()
iris.data = iris %>%
select(-Species)


Performing the PCA itself is pretty straightforward. I have chosen to scale the variables in this example for no particular reason. The post is not about PCA so it does not matter too much.

pc = prcomp(iris.data, scale. = TRUE)


The scores—the projected values of the observations—are stored in a matrix called pc$x. First we will produce a plot with just observation number. plot(pc$x[,1:2], type = 'n')
text(pc$x[,1:2], labels = 1:nrow(iris.data))  It would be nice to put the species labels on instead of numbers. The species names are rather long though, so I am going to recode them iris.labels = recode(iris.labels, 'setosa' = 's', 'versicolor' = 'v', 'virginica' = 'i')  And now we can plot the labels if we want:  plot(pc$x[,1:2], type = 'n')
text(pc$x[,1:2], labels = 1:nrow(iris.data))  ### Hey you promised us ggplot2 ya bum! Okay, okay. To do this we need to coerce the scores into a data.frame. The nice thing about doing this is that the principal components are conveniently labelled PC1, PC2, etc. This makes the mapping fairly easy. library(ggplot2) pcs = data.frame(pc$x)
p = pcs %>%
ggplot(aes(x = PC1,
y = PC2,
label = iris.labels)) +
geom_text()
p


Et voila!

# Adding multiple regression lines to a faceted ggplot2 plot

Those of you who know me well know I am not really a ggplot2 type of guy. Not because I do not think that it is good work—I would be an idiot to claim that probably the most downloaded package in the history of CRAN is not good—but because I predate both Hadley and ggplot2 (yes I admit it, I am a dinosaur now) I can do most of what I want in base R. However, there are times in exploratory data analysis where going the ggplot2 route makes more sense. One situation where that happens for me is when I am fitting linear models with a mixture of continuous explanatory variables and factors. Some people have an old-fashioned name for this situation—but really repeating that name would be a get-burned-at-a-stake type of offence. It is useful when fitting linear models of this type to be able to see the behaviour of the response with respect to a continuous predictor for each combination of the levels of the factors. ggplot2 makes it fairly easy to produce this type of plot through its faceting mechanism. It also makes it really to add a fitted line with a pretty confidence interval to each facet. I should say at this point that this is not restricted to linear models, and in fact works for generalised linear models as well, and for semi-parametric models like smoothers. It is cool stuff.

### I want a pony, and I want it NOW Daddy

This is, as I have said, made easy to do in ggplot2 and a half hour of Googling will get you to the point where you can do it with your data. However, I found myself with the following problem. I had a situation where there was a suggestion that an interaction might be significant and so I wanted to explore visually how the fitted models differed with and without interaction. You often find yourself in this situation with tests suggesting the interactions are significant only to find that it is driven by one combination of the factors, or even worse, by a single outlier. Making things explicit, in this problem I was interested in the difference between this model:

$$Y \sim X * F_1 * F_2$$

and this model

$$Y \sim X * F_1 + F_2$$

These models are expressed in standard Wilkinson and Rogers notation, where $$Y$$ and $$X$$ are continuous, and $$F_1$$ and $$F_2$$ are factors. However, the same issues arise if we are interested in

$$Y \sim X * F$$

and this model
$$Y \sim X + F$$

The only difference is that the different levels of the factor $$F$$ can be represented on a single row—if there are not too many levels—or in a wrapped facet plot if there are quite a few levels.

### We need some data

The data in this post come from Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J. & Ostrowski, E. (1994). A Handbook of Small Data Sets Boca Raton, Florida: Chapman and Hall/CRC. The data try to relate the mortality in UK towns and cities—measured as deaths per 100,000 head of population—to the hardness of the water—measured by calcium concentration (in parts-per-million = ppm). The original authors also had the theory that this relationship changed when the town or city was in “The North.” I have always found the English concept of “Up North” rather bizarre. As you leave the outskirts of London you start seeing road signs telling you how far it is to “The North.” In this data set, “The North” is defined as being north of the English city of Derby. The data is available here: water.csv.

water.df = read.csv("water.csv")


It is worth having a quick look at the summary data

library(skimr)
skim(water.df)

## Skim summary statistics
##  n obs: 61
##  n variables: 4
##
## Variable type: factor
##  variable missing complete  n n_unique                     top_counts
##      town       0       61 61       61 Bat: 1, Bir: 1, Bir: 1, Bla: 1
##  ordered
##    FALSE
##
## Variable type: integer
##   variable missing complete  n    mean     sd   p0  p25  p50  p75 p100
##    calcium       0       61 61   47.18  38.09    5   14   39   75  138
##  mortality       0       61 61 1524.15 187.67 1096 1379 1555 1668 1987
##      north       0       61 61    0.57   0.5     0    0    1    1    1


We can see from this that our factor north is not being treated as such. I will recode it, and since we are using ggplot2 we might as well go into the tidyverse too. I think I am going to violate a Hadley principle by overwriting the existing data, but I can live with this.

library(tidyverse)
water.df = water.df %>%
mutate(north = factor(ifelse(north == 1, "north", "south")))


### The initial plot

I am going to plot mortality as a function of calcium and I am going to facet (if that is a verb) by our factor north. There is a good argument to make that because both variables are rates (deaths per 100,000 and parts per million) then we should work on a log-log scale. However, given this is not a post on regression analysis, I will not make this transformation (Spoiler: there is no evidence that the factor makes any difference on this scale which kind of makes it useless to show what I want to show ☺)

library(ggplot2)
p = water.df %>%
ggplot(aes(x = calcium, y = mortality)) + geom_point()

p = p + facet_wrap(~north)
p


It looks like there is evidence of a shift, but is there evidence of a difference in slopes?

water.fit = lm(mortality ~ calcium * north, data = water.df)
summary(water.fit)


##
## Call:
## lm(formula = mortality ~ calcium * north, data = water.df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -221.27  -75.45    8.52   87.48  310.14
##
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)        1692.3128    32.2005  52.555   <2e-16 ***
## calcium              -1.9313     0.8081  -2.390   0.0202 *
## northsouth         -169.4978    58.5916  -2.893   0.0054 **
## calcium:northsouth   -0.1614     1.0127  -0.159   0.8740
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.2 on 57 degrees of freedom
## Multiple R-squared:  0.5909, Adjusted R-squared:  0.5694
## F-statistic: 27.44 on 3 and 57 DF,  p-value: 4.102e-11

Ignoring all model checking (it ain’t so bad STATS 201x kids), there does not appear to be much support for the different slopes model. However there is for the different intercepts model. I would like to explore this visually.

What I would like to do is to place the fitted lines from each model on each facet of the plot. In addition, I would like the confidence intervals for each line to be added to each line. There might be a way to do this with geom_smooth, and I would be happy to hear about it. My solution involves three steps:

1. Fitting each of the models, and using predict to get the confidence intervals
2. Adding the fitted line from each model to each facet
3. Adding the confidence intervals from each model to each facet

#### Step 1—getting the data for the confidence intervals

This step involves, as previously mentioned, using the predict function (in fact predict.lm). This should be fairly straightforward. The key here is to add the interval information, and the fitted values to either a new data frame or our existing data frame. I will opt for the latter.

water.fit1 = water.fit ## interaction model
water.fit2 = lm(mortality ~ calcium + north, data = water.df)

water.pred1 = predict(water.fit1, water.df, interval = "confidence")
water.pred2 = predict(water.fit2, water.df, interval = "confidence")

water.df = water.df %>%
mutate(fit1 = water.pred1[,"fit"],
lwr1 = water.pred1[,"lwr"],
upr1 = water.pred1[,"upr"],
fit2 = water.pred2[,"fit"],
lwr2 = water.pred2[,"lwr"],
upr2 = water.pred2[,"upr"])


#### Step 2—Adding the fitted lines

Remember that our plot is stored in the variable p. We will add the fitted lines using the geom_line function. We could have done this with the geom_abline and just the coefficients, however this would have made the method less flexible because we could not accomodate a simple quadratic model for example.

p = p + geom_line(data = water.df,
mapping = aes(x = calcium, y = fit1),
col = "blue", size = 1.2)
p = p + geom_line(data = water.df,
mapping = aes(x = calcium, y = fit2),
col = "red", size = 1.2)
p


We can see already the lack of support for the different slopes model, however, let’s add the confidence intervals.

#### Step 3—Adding the confidence intervals

We add the confidence intervals by using the geom_ribbon function. The aesthetic for geom_ribbon requires two sets of y-values, ymin and ymax. These, clearly, are the values we calculated for each of the confidence intervals. We set the color of the bands by setting fill. Setting col controls the border colour of the ribbon. We make these transparent (so they do not obscure our fitted lines), by setting an alpha value closer to zero than to one (0.2 seems to be a good choice).

p = p + geom_ribbon(data = water.df,
mapping = aes(x = calcium, ymin = lwr1, ymax = upr1),
fill = "blue", alpha = 0.2)
p = p + geom_ribbon(data = water.df,
mapping = aes(x = calcium, ymin = lwr2, ymax = upr2),
fill = "red", alpha = 0.2)
p


As expected the intervals completely overlap, thus supporting our decision to choose the simpler model.

### Well that sucked

How about if there really was a slope effect? I have artificially added one, and redrawn the plot just so we can see what happens when there really is a difference.

## just so we all get the same result
set.seed(123)

## get the original coefficient
b = coef(water.fit1)

## and alter the effect for the interaction
b[4] = -5 # make a big difference to the slopes

## get the standard deviation of the residuals
sigma = summary(water.fit1)$sigma ## and use it to make a new set of responses water.df = water.df %>% mutate(mortality = (b[1] + (north == 'north') * b[2]) + (b[2] + (north == 'north') * b[4]) * calcium + rnorm(61, 0, sigma)) ## Now we repeat all the steps above water.fit1 = lm(mortality ~ calcium * north, data = water.df) ## interaction model water.fit2 = lm(mortality ~ calcium + north, data = water.df) water.pred1 = predict(water.fit1, water.df, interval = "confidence") water.pred2 = predict(water.fit2, water.df, interval = "confidence") water.df = water.df %&gt;% mutate(fit1 = water.pred1[,"fit"], lwr1 = water.pred1[,"lwr"], upr1 = water.pred1[,"upr"], fit2 = water.pred2[,"fit"], lwr2 = water.pred2[,"lwr"], upr2 = water.pred2[,"upr"]) p = water.df %&>% ggplot(aes(x = calcium, y = mortality)) + geom_point() p = p + facet_wrap(~north) p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit1), col = "blue", size = 1.2) p = p + geom_line(data = water.df, mapping = aes(x = calcium, y = fit2), col = "red", size = 1.2) p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr1, ymax = upr1), fill = "blue", alpha = 0.2) p = p + geom_ribbon(data = water.df, mapping = aes(x = calcium, ymin = lwr2, ymax = upr2), fill = "red", alpha = 0.2) p  ### One final note I suspect, but I have not checked properly, that when geom_smooth is used, that the fitted lines on each facet are derived from the subset of data in that facet, therefore the standard errors might be slightly larger (because less data is used to estimate $$\sigma$$. However, I am happy to be corrected (and if I am wrong I will remove this section :-)) # Some quirks of JAGS I return to JAGS infrequently enough these days that I forget a lot of things. Do not get me wrong. I love JAGS and I think it is one of the most valuable tools in my toolbox. However administrative and professional duties often destroy any chance I might have of concentrating long enough to actually do some decent statistics. Despite this I have been working with my colleagues and friends, Duncan Taylor and Mikkel Andersen–who has been visiting from Aalborg since February and really boosting my research, on some decent Bayesian modelling. Along the way we have encountered a few idiosyncrasies of JAGS which are probably worth documenting. I should point out that the issues here have been covered by Martyn Plummer and others on Stack Overflow and other forums, but perhaps not as directly. ### Local variables JAGS allows local variables. In fact you probably use them without realizing as loop indices. However, what it does not like is the declaration of local variables within a loop. That is model{ k <- 10 }  is fine, whereas model{ for(i in 1:10){ k <- 10 } }  is not. This would be fine, except the second code snippet will yield an error which tells you that k is an unknown variable and asks you to:  Either supply values for this variable with the data or define it on the left hand side of a relation. I think this error message is a bit of a head scratcher because look at your code and say “But I have defined it, and it works in R.” – do not fall into the latter mode of thinking – it’s a trap! There are a couple of solutions to this problem. Firstly, you could do as I did and give up on having your model code readable, and just not use temporary variables like this, or, secondly, you could like the temporary variable vary with the index, like so model{ for(i in 1:10){ k[i] <- 10 } }  ### Missing data / ragged arrays JAGS cannot deal with missing covariates very well. However, it is happy enough for you to include these observations and “step over” them some how. An example of this might be a balanced experiment where some of the trials failed. As an example let us consider a simple two factorial completely randomised design where each factor only has two levels. Let the first factor have levels A and B, and the second factor have levels 1 and 2. Furthermore let there be 3 replicates for each treatment (combination of the factors). Our (frequentist) statistical model for this design would be $$y_{ijk}=\mu + \alpha_i + \beta_j + \gamma_{ij} + \varepsilon_{ijk},~i\in\{A,B\},j\in\{1,2\},k=1,\ldots,3,\varepsilon_{ijk}\sim N(0,\sigma^2)$$ And we would typically programme this up in JAGS as something like this model{ for(i in 1:2){ for(j in 1:2){ for(k in 1:3){ y[i,j,k] ~ dnorm(mu[i,j], tau) } mu[i,j] <- alpha[i] + beta[j] + gamma[i,j] } } for(i in 1:2){ alpha[i] ~ dnorm(0, 0.000001) } for(j in 1:2){ beta[j] ~ dnorm(0, 0.000001) } for(i in 1:2){ for(j in 1:2){ gammma[i,j] ~ dnorm(0, 0.000001) } } tau ~ dgamma(0.001, 0.001) }  Implicit in this code is that y is a 2 x 2 x 3 array, and that we have a fully balanced design. Now let us assume that, for some reason, the treatments $$\tau\in\{A1,A2,B1,B2\}$$ have been replicated 2, 3, 2, 3 times respectively. We can deal with this in JAGS by creating another variable in our input data which we will call reps with reps = c(3, 2, 3, 2). This then can be accommodated in our JAGS model by model{ for(i in 1:2){ for(j in 1:2){ for(k in 1:reps[(i-1) * 2 + j]){ y[i,j,k] ~ dnorm(mu[i,j], tau) } mu[i,j] <- alpha[i] + beta[j] + gamma[i,j] } }  y is still a 2 x 2 x 3 array, but simply has NA stored in the positions without information. It is also worth noting that ever since JAGS 4.0 the syntax for loops allows for(v in V){  where V is a vector containing integer indices. This is very useful for irregular data. NOTE I have not compiled the JAGS code in my second example, so please let me know if there are mistakes. This is an ongoing article and so I may update it from time to time. # Conference Programmes in R It has been a while folks! Lately – over the last month – I have been writing code that helps with the production of a timetable and abstract for conferences. The latest version of that code can be found here, the associated bookdown project here, and the final result here. Note that the programme may change a good number of times before next Sunday (December 10, 2017) when the conference starts. The code above is an improved version of the package I wrote for the Biometrics By The Border conference which worked solely with Google Sheets, by using it as a poor man’s database. I wouldn’t advise this because it is an extreme bottleneck when you constantly need to refresh the database information. The initial project was driven by necessity. I foolishly suggested we use EasyChair to collect information from speakers, not realising that the free version of EasyChair does not allow you to download the abstracts your speakers helpfully provided to you (and fair enough–they are trying to make some money from this service). Please note that if I am incorrect in this statement, I’d be more than happy to find out. A word of advice: if this is still the case, then get your participants to upload an R Markdown (or just plain text) file instead of using the abstract box on the webpage. You can download these files from EasyChair in a single zip file. The second project was also driven by necessity, but the need in this case had more to do with the sheer complexity of organising several hundred talks over a four day programme. Now that everything is in place, most changes requested by speakers or authors can be accomodated in a few minutes by simply moving the talk on the Google sheet that controls the programme, calling several R functions, recompiling the book and pushing it up to github. #### So how does it all work? The current package does depend on some basic information being stored on Google Sheets. The sheets for my current conference (Joint Conference of the New Zealand Statistical Association and the International Association of Statistical Computing (Asian Regional Section)) can be viewed here on Google Sheets. Not all the worksheets here are important. The key sheets are: Monday, Tuesday, Wednesday, Thursday, All Submissions, Allocations, All_Authors, Monday_Chairs, Tuesday_Chairs, Wednesday_Chairs, and Thursday_Chairs. The first four and the last four sheets (Monday-Thursday) are hand created. The colours are for our convenience and do not matter. The layout does, and some of this is hard-coded into the project. For example all though there are seven streams a day, only six of those belong directly to us as the seventh belong to a satellite conference. The code relies on every scheduled event (including meal breaks) having a time adjacent to it in the Time column. It also collects information about the rooms as the order of these rooms does change on some days. The All_Authors sheet comes from EasyChair. EasyChair provides the facility to download author email information as an Excel spreadsheet which in turn can be uploaded to Google Sheets easily. This sheet is simply the sheet labelled All in the EasyChair file. The All Submissions sheet is a copy-and-paste from the EasyChair submissions page. I probably could webscrape this with a little effort. The Allocations sheet is mostly reflective of the All Submissions sheet, but has user added categorizations that allow us to group the talks into sensible sessions. It also uses formulae to format the titles of the talks so that each title word is capitalized (this is an imperfect process), and that the name of the submitter (who we have assumed is the speaker) is appended to the title so that it can be pasted into the programme sheets. #### How do we get data out of Google Sheets? Enter googlesheets. The code works in a number of distinct phases. The first is capturing all the relevant information from the web in placing in a SQLite database. I used Jenny Bryan’s googlesheets package to extract the data from my spreadsheets. The process is quite straightforward, although I found that it worked a little more smoothly on my Mac than on my Windows box, although this may have more to do with the fact that the R install was brand new, and so the httr package was not installed. The difference between it being installed and not installed is that when it is installed authentication (with Google) happens entirely within the browser, whereas without you are required to copy and paste an authentication code back into R. When you are doing this many times, the former is clearly more desirable. Interaction with Google Sheets consists of first grabbing the workbook (Google Sheets does not use this term, it comes from Excel, but it encapsulates the idea that you have a collection of one or more worksheets in any one project, than the folder that contains them all is called a workbook), and then asking for information from the individual sheets. The request to see a workbook is what will prompt the authentication, e.g. library(googlesheets) mySheets = gs_ls() mySheets  The call to gs_ls will prompt authentication. It is important to note that this authentication will last for a certain period of time, and should only require interaction with a web browser the very first time you do it, and never again. Subsequent requests will result in a message along the lines of Auto-refreshing stale OAuth token. The call to gs_ls will return a list of available workbooks. A particular workbook may be retrieved by calling gs_title. For example, this function allows me to get to the conference workbook: getProgrammeSheet = function(title = "NZSA-IASC-ARS 2017"){ mySheets = gs_ls() ss = gs_title(title) return(ss) }  I can use the object returned by this function in turn to access the worksheets I want to work with using the gs_read. The functions in googlesheets are written to be compliant with the tidyverse, and in particular the pipe %>% operator. I will be the first to admit this is not my preferred way of working and I could have easily worked around it. However, in the interests of furthering my skills, I have tried to make the project tidyverse compliant and nearly all the functions will work using the pipe operator, although they take a worksheet or a database as their first argument rather than a tibble. Once I have a workbook object (really just an authenticated connection), I can then read each of the spreadsheets. This is done by a function called updateDB. The only thing worth commenting on in this function is that the worksheets for each of the days have headers which do not really resolve well into column names. They also have a fixed set of rows and columns that we wish to capture. To that end, the range is specified for each day, and the column headers are simply set to be the first seven (A–G) capital letters of the alphabet. These sheets are stored as tibbles which are then written to an internal SQLite database using the dbWriteTable function. There are eight functions (createRoomsTbl, createAffilTbl, createTitleTbl, createAbstractTbl, createAuthorTbl, createAuthorSubTbl, createProgTbl, createChairTbl) which operate on the database/spreadsheet tables to produce the database tables we need for generating the conference timetable, and for the abstract booklet. These functions are rarely called by themselves—we tend to call a single omnibus function rebuildBD. This function allows the user to refresh the information from the web if need be, and to recreate all of the tables in the database. The bottleneck in this function is retrieving the information from the internet which may take around 20-30 seconds depending on the connection speed. The database tables provide information for four functions: writeTT, writeProg, writeIndices, and writeSessionChairs. Each of these functions produces one or more R Markdown files in the directory used by the bookdown project. #### Bookdown, ePub and gitBook The final product is generated using bookdown. Bookdown, in explanation sounds simple. Implementation is really improved by the help of a master. I found this blog post by Sean Kross very helpful along with his minimal bookdown project from github. It would be misleading of me to suggest that the programme book was really produced using R Markdown. Small elements are in markdown, but the vast majority of the formatting is achieved by writing code which writes HTML. This is especially true of the conference timetable, and the hyperlinking of the abstracts to the timetable and other indices. The four functions listed above write out six markdown pages. These are the conference timetable, the session chairs table, four pages for each of the days of the conference, and two indices, one linking talks to authors, and one linking talk titles to submission numbers (which for the most part were issued by EasyChair). There is not a lot more to discussion involved here. Sean’s project sets up things in such a way that changes to the markdown or the yaml files will automatically trigger a rebuild. #### Things I struggled with Our conference had six parallel streams. Whilst it is easy enough to make tables that will hold all this information, it is very difficult to decide how best to display this in a way that would suit everyone. The original HTML tables were squashed almost to the point where there was a character per line on a mobile phone screen. We overcame this slightly by fixing the widths of the div element that holds the tables and adding horizontal scrolling. Many people found this feature confusing, and it did not necessarily translate into ePub. We then added some Javascript functionality by way of the tablesaw library. This allowed us to keep the time column persistant no matter which stream people were looking at, and it allowed better scrolling of streams that were offscreen. However, this was still as step too far for the technologically challenged. In the end we resorted to printing out the timetable. I also used excellent Calibre software to take the ePub as input and output it in other format—most usefully Microsoft Word’s .docx format. I know some of you are shuddering at this thought, but it did allow me to create a PDF with the programme timetable rotated and then create a PDF. This made the old fogeys immensely happy, and me immensely irritated, as I thought the gitBook version was quite useful. #### Not forgetting the abstracts Omitted from my workflow so far is mention of the abstracts. We had authors upload LaTeX (.tex) files to EasyChair, or text files. If you don’t do this, and they use EasyChair’s abstract box, then you have to find a way to scrape the data. There are downsides in doing so (and it even occurs in the user submitted files) in that some unicode text seems to creep in. Needless to say, even after I used an R function to convert all the files to markdown, we still had to do a bunch of manual cleaning. #### Anyway I hope someone finds this work useful. I have no intention of running a conference again for at least four years, but I would appreciate it if anyone wants to build on my work. # How do I match that? This is not a new post, but a repost after a failed WordPress upgrade One of the projects I am working on is the 3rd edition of Introduction to Bayesian Statistics, a text book predominantly written by my former colleague Bill Bolstad. It will be out later this year if you are interested. One of the things which will make no difference to the reader but will make a lot of difference to me is the replacement of all the manually numbered references in the book for things like chapters, sections, tables and figures. The problem I am writing about today arose from programmatically trying to label the latter — tables and figures — in LaTeX. LaTeX’s reference system, as best I understand it, requires that you place a \label command after the caption. For example \begin{figure} \centering \includegraphics{myfig} \caption{Here is a plot of something} \label{fig:myfig} \end{figure}  This piece of code will create a label fig:myfig which I can later use in a \ref{fig:myfig} command. This will in turn be replaced at compile time with a number dynamically formatted according to the chapter and the number of figures which precede this plot. The challenge The problem I was faced with is easy enough to describe. I needed to open each .tex file, find all of the figure and table environments, check to see if they contained a caption and add a label if they did. Regular expressions to the rescue Fortunately I know a bit about regular expressions, or at least enough to know when to ask for help. To make things more complicated for myself, I did this in R. Why? Well basically because a) I did not feel like dusting off my grossly unused Perl — I’ve been clean of Perl for 4 years now and I intend to stay that way — and b) I could not be bothered with Java’s file handling routines – I want to to be able to open files for reading with one command, not 4 or 8 or whatever the hell it is. Looking back I could have used C++, because the new C+11 standard finally has a regex library and the ability to have strings where everything does not have to be double escaped, i.e. I can write R”\label” to look for a line that has a \label on it rather than “\\label” where I have to escape the backslash. And before anyone feels the urge to suggest a certain language I remind you to “Just say no to Python!” Finding the figure and table environments is easy enough. I simply look for the \begin{figure} and \begin{table} tags, as well as the environment closing tags \end{figure} and \end{table}. It is possible to do this all in one regular expression, but I wanted to capture the \begin and \end pairs. I also wanted to deal with tables and figures separately. The reason for this is that it was possible to infer the figure labels from Bill’s file naming convention for his figures. The tables on the other hand could just be labelled sequentially, i.e. starting at 1 and counting upwards with a prefix reflecting the chapter number. Lines = readLines("Chapter.tex") begin = grep("\\begin\\{figure\\}", Lines) end = grep("\\end\\{figure\\}", Lines) n = length(begin) if(n != length(end)) stop("The number of begin and end pairs don't match") ## Now we can work on each figure environment in turn for(k in 1:n){ b = begin[k] e = end[k] block = paste0(Lines[b:e], collapse = "\n") if(!grepl("\\label", block){ ## only add a label if there isn't one already ## everything I'm really talking about is going to happen here. } }  So what I needed to be able to do was find the caption inside my block and then insert a label. Easy enough right? I should be able to write a regular expression to do that. How about something like this: pattern = "\\caption\\{([^}]+)\\}  That will work most of the time, except as you will find out when the caption contains braces itself, and we have some examples that do have just that \caption{If \emph{A} is true then \emph{B} is true.'' Deduction is possible.}  My first regular expression would only match up to the end of \emph{A}, which does not help me. I need something that could, in theory match an unlimited number of inner sets of braces. Matching nested parentheses Fortunately matching nested parentheses is a well-known problem and Hadley Wickham tweeted me a Stack Overflow link that got me started. There is also a similar solution on page 330 of Jeffrey Friedl’s very useful Mastering Regular Expressions book. The solution relies on a regular expression which employs recursion. Set perl = TRUE to use PCRE (and recursion) in R To make this work in R we have to make sure the PCRE library is used, and this is done by setting perl = TRUE in the call to gregexpr This is my solution: ## insert the label after the caption pat = “caption(\\{([^{}]+|(?1))*\\})” m = gregexpr(pat, block, perl = T) capStart = attr(m[[1]], “capture.start”, TRUE)[1] capLength = attr(m[[1]], “capture.length”, TRUE)[1] strLabel = paste0(“\\”,”label{fig:”, figNumber, “}\n”) newBlock = paste0(substr(block, 1, capStart + capLength), strLabel, substr(block, capStart + capLength + 1, nchar(block))) The regular expression assigned to pat is where the work gets done. Reading the expression from left to right it says: match caption literally open the first capture group match { literally open the second capture group match one or more instances of anything that is not an open brace { or a end brace } or open the third capture group and recursively the first sub-pattern. I will elaborate on this more in a bit close the second and third capture groups and ask R to match this pattern zero or more times literally match the end brace } close the first capture group I would be the first to admit that I do not quite understand what ?1 does in this regexp. The initial solution used ?R. The effect of this was that I could match all sets of paired braces within block, but I could not specifically match the caption. As much as I understand this, it seems to limit the recursion to the outer (first) capture group. I would be interested to know. The rest of the code breaks the string apart, inserts the correct label, and creates a new block with the label inserted. I replace the first line of the figure environment block with this new string, and keep a list of the remaining line numbers so that they can be omitted when the file is written back to disk. # An introduction to using Rcpp modules in an R package ### Introduction The aim of this post is to provide readers with a minimal example demonstrating the use of Rcpp modules within an R package. The code and all files for this example can be found on https://github.com/jmcurran/minModuleEx. ### What are Rcpp Modules? Rcpp modules allow R programmers to expose their C++ class to R. By “expose” I mean the ability to instantiate a C++ object from within R, and to call methods which have been defined in the C++ class definition. I am sure there are many reasons why this is useful, but the main reason for me is that it provides a simple mechanism to create multiple instances of the same class. An example of where I have used this is my multicool package which is used to generate permutations of multisets. One can certainly imagine a situation where you might need to generate the permutations of more than two multisets at the same time. multicool allows you to do this by instantiating multiple multicool objects. ### The Files I will make the assumption that you, the reader, know how to create a package which uses Rcpp. If you do not know how to do this, then I suggest you look at the section entitled “Creating a New Package” here on the Rstudio support site. Important: Although it is mentioned in the text, the image displayed on this page does not show that you should change the Type: drop down box to Package w/ Rcpp. This makes sure that a bunch of fields are set for you in the DESCRIPTION file that ensure Rcpp is linked to and imported. There are five files in this minimal example. They are • DESCRIPTION • NAMESPACE • R/minModuleEx-package.R • src/MyClass.cpp • R/zzz.R I will discuss each of these in turn. #### DESCRIPTION This is the standard DESCRIPTION file that all R packages have. The lines that are important are: Depends: Rcpp (>= 0.12.8) Imports: Rcpp (>= 0.12.8) LinkingTo: Rcpp RcppModules: MyModule  The imports and LinkingTo lines should be generated by Rstudio. The RcppModules: line should contain the names(s) of the module(s) that you want to use in this package. I have only one module in this package which is unimaginatively named MyModule. The module exposes two classes, MyClass and AnotherClass. #### NAMESPACE and R/minModule-Ex.R The first of these is the standard NAMESPACE file and it is automatically generated using roxygen2. To make sure this happens you need select Project Options… from the Tools menu. It will bring up the following dialogue box: Select the Build Tools tab, and make sure that the Generate documentation with Roxygen checkbox is ticked, then click on the Configure… button and make sure that that all the checkboxes that are checked below are checked: Note: If you don’t want to use Roxygen, then you do not need the R/minModuleEx-package.R file, and you simply need to put the following three lines in the NAMESPACE file: export(AnotherClass) export(MyClass) useDynLib(minModuleEx)  You need to notice two things. Firstly this NAMESPACE explicitly exports the two classes MyClass and AnotherClass. This means these classes are available to the user from the command prompt. If you only want access to the classes to be available to R functions in the package, then you do not need to export them. Secondly, as previously noted, if you are using Roxygen, then these export statements are generated dynamically from the comments just before each class declaration in the C++ code which is discussed in the next section. The useDynLib(minModuleEx) is generated from the line #' @useDynLib minModuleEx  in the R/minModuleEx-package.R file. #### src/MyClass.cpp This file contains the C++ class definition of each class (MyClass and AnotherClass). There is nothing particularly special about these class declarations, although the comment lines before the class declarations, //' @export MyClass class MyClass{  and //' @export AnotherClass class AnotherClass{  , generate the export statements in the NAMESPACE file. This file also contains the Rcpp Module definition: RCPP_MODULE(MyModule) { using namespace Rcpp; class_<MyClass>( "MyClass") .default_constructor("Default constructor") // This exposes the default constructor .constructor<NumericVector>("Constructor with an argument") // This exposes the other constructor .method("print", &MyClass::print) // This exposes the print method .property("Bender", &MyClass::getBender, &MyClass::setBender) // and this shows how we set up a property ; class_<AnotherClass>("AnotherClass") .default_constructor("Default constructor") .constructor<int>("Constructor with an argument") .method("print", &AnotherClass::print) ; }  In this module I have: 1. Two classes MyClass and AnotherClass. 2. Each class class has: • A default constructor • A constructor which takes arguments from R • A print method 3. In addition, MyClass demonstrates the use of a property field which (simplistically) provides the user with simple retrieval from and assignment to a scalar class member variable. It is unclear to me whether it works for more data types, but anecdotally, I had no luck with matrices. #### R/zzz.R As you might guess from the nonsensical name, it is not essential to call this file zzz.R. The name comes from a suggestion from Dirk Eddelbuettel. It contains a single, but absolutely essential line of code loadModule("MyModule", TRUE)  This code can actually be in any of the R files in your package. However, if you explicitly put it in R/zzz.R then it is easy to remember where it is. ### Using the Module from R Once the package is built and loaded, using the classes from the module is very straightforward. To instantiate a class you use the new function. E.g. m = new(MyClass) a = new(AnotherClass)  This code will call the default constructor for each class. If you want to call a constructor which has arguments, then they can be added to the call to new. E.g. set.seed(123) m1 = new(MyClass, rnorm(10))  Each of these objects has a print method which can be called using the $ operator. E.g.

m$print() a$print()
m1$print()  The output is > m$print()
1.000000 2.000000 3.000000
> a$print() 0 > m1$print()
1.224082 0.359814 0.400771 0.110683 -0.555841 1.786913 0.497850 -1.966617 0.701356 -0.472791


The MyClass class has a module property – a concept also used in C#. A property is a scalar class member variable that can either be set or retrieved. For example, m1 has been constructed with the default value of bBender = FALSE, however we can change it to TRUE easily

m1$Bender = TRUE m1$print()


Now our object m1 behaves more like Bender when asked to do something 🙂

> m1\$print()
Bite my shiny metal ass!


Hopefully this will help you to use Rcpp modules in your project. This is a great feature of Rcpp` and really makes it even more powerful.