# Some quirks of JAGS

I return to JAGS infrequently enough these days that I forget a lot of things. Do not get me wrong. I love JAGS and I think it is one of the most valuable tools in my toolbox. However administrative and professional duties often destroy any chance I might have of concentrating long enough to actually do some decent statistics.

Despite this I have been working with my colleagues and friends, Duncan Taylor and Mikkel Andersen–who has been visiting from Aalborg since February and really boosting my research, on some decent Bayesian modelling. Along the way we have encountered a few idiosyncrasies of JAGS which are probably worth documenting. I should point out that the issues here have been covered by Martyn Plummer and others on Stack Overflow and other forums, but perhaps not as directly.

### Local variables

JAGS allows local variables. In fact you probably use them without realizing as loop indices. However, what it does not like is the declaration of local variables within a loop. That is

model{
k <- 10
}


is fine, whereas

model{
for(i in 1:10){
k <- 10
}
}


is not. This would be fine, except the second code snippet will yield an error which tells you that k is an unknown variable and asks you to:  Either supply values for this variable with the data or define it on the left hand side of a relation. I think this error message is a bit of a head scratcher because look at your code and say “But I have defined it, and it works in R.” – do not fall into the latter mode of thinking – it’s a trap!

There are a couple of solutions to this problem. Firstly, you could do as I did and give up on having your model code readable, and just not use temporary variables like this, or, secondly, you could like the temporary variable vary with the index, like so

model{
for(i in 1:10){
k[i] <- 10
}
}


### Missing data / ragged arrays

JAGS cannot deal with missing covariates very well. However, it is happy enough for you to include these observations and “step over” them some how. An example of this might be a balanced experiment where some of the trials failed. As an example let us consider a simple two factorial completely randomised design where each factor only has two levels. Let the first factor have levels A and B, and the second factor have levels 1 and 2. Furthermore let there be 3 replicates for each treatment (combination of the factors). Our (frequentist) statistical model for this design would be

$$y_{ijk}=\mu + \alpha_i + \beta_j + \gamma_{ij} + \varepsilon_{ijk},~i\in\{A,B\},j\in\{1,2\},k=1,\ldots,3,\varepsilon_{ijk}\sim N(0,\sigma^2)$$

And we would typically programme this up in JAGS as something like this

model{
for(i in 1:2){
for(j in 1:2){
for(k in 1:3){
y[i,j,k] ~ dnorm(mu[i,j], tau)
}
mu[i,j] <- alpha[i] + beta[j] + gamma[i,j]
}
}

for(i in 1:2){
alpha[i] ~ dnorm(0, 0.000001)
}

for(j in 1:2){
beta[j] ~ dnorm(0, 0.000001)
}

for(i in 1:2){
for(j in 1:2){
gammma[i,j] ~ dnorm(0, 0.000001)
}
}

tau ~ dgamma(0.001, 0.001)
}


Implicit in this code is that y is a 2 x 2 x 3 array, and that we have a fully balanced design. Now let us assume that, for some reason, the treatments $$\tau\in\{A1,A2,B1,B2\}$$ have been replicated 2, 3, 2, 3 times respectively. We can deal with this in JAGS by creating another variable in our input data which we will call reps with reps = c(3, 2, 3, 2). This then can be accommodated in our JAGS model by

model{
for(i in 1:2){
for(j in 1:2){
for(k in 1:reps[(i-1) * 2 + j]){
y[i,j,k] ~ dnorm(mu[i,j], tau)
}
mu[i,j] <- alpha[i] + beta[j] + gamma[i,j]
}
}


y is still a 2 x 2 x 3 array, but simply has NA stored in the positions without information. It is also worth noting that ever since JAGS 4.0 the syntax for loops allows

for(v in V){


where V is a vector containing integer indices. This is very useful for irregular data.

NOTE I have not compiled the JAGS code in my second example, so please let me know if there are mistakes. This is an ongoing article and so I may update it from time to time.

# Python and statistics – is there any point?

This semester I gave my graduate student class a project. The brief was relatively simple: implement the iteratively reweighted least squares (IRLS) algorithm to perform a simple (single covariate) logistic regression in Python. Their programmes were supposed to be able to read data in from a text file, perform the simple matrix algebra and math needed to carry out the IRLS computation and return some formatted output – similar to that you would get from R’s summary.glm function. Of course, you do not need matrix algebra to do this, but the idea was for the students to learn a bit of mathematical statistics that they had not seen before.  On the IRLS front, they were allowed to use a simple least squares routine like numpy’s linalg.lstsq and some of numpy’s simple matrix operators, but expressly forbidden from simply loading pandas or statsmodels and using the generalized linear models functions contained therein.

I thought this sounded like a straightforward enough task. The students divided themselves into pairs to work on it, and they had 13 weeks to complete the task.

The kicker was that I did not provide any instruction, either in Python or in the IRLS algorithm. An aim of the project was to simulate the situation where someone asks you to solve a problem, and you have to go and do some research to do it. Their first task was to complete 100 exercises on codeacademy.com as a reasonable introduction to a language none of them had seen before.

## Problems – versions

There are two major versions of Python in the wild, 2.7 and 3.4. Codeacademy teaches using version 2.7. One fundamental difference between 2.7 and 3.4 is the syntax of the print function. All of my students are users of R, to varying levels of skill. When they go to install R at home, they know to go to the CRAN website, or a mirror, and download the current, stable release of R. If they followed this policy, as I did myself, then they would have installed Python 3.4 and found that the way they were taught to use print by Codeacademy does not work, without any sort of helpful “That syntax has been depricated. Python 3 onwards uses the syntax…” This is not the only issue, with the way Python 3.4 handles execution of loops over numbered ranges being another example of a fundamental difference.

## Problems – platform issues

Most students at my institution use Windows, especially at home. There is some Mac penetration, and Linux is virtually non-existent (these are statistics students, not computer science remember). The official Python installers work perfectly well under Windows in my experience. However, then we come to the issue of installing numpy. The official advice from the numpy website seems to be “download a third party version of Python which already has it.” For students who come from a world where a package can be installed by going to a menu, this is less than useful. The common advice from the web is that “there is no official release of numpy 1.8.1 for Python 2.7 or higher for Windows” but that you can download it and install it from a the builds very thoughtfully provided by Christoph Gohlke at UC Irvine here. Christoph’s builds work fine, but again, for something that seems, at least from the outside, very mainstream in the Python community should the user have to go to this level of effort?

## Problems – local installations

Like any instructor, I face the issue that a number of my students have no option but to use the computer laboratories provided for them by the university. This means that we encounter the issue of local installation of libraries for users. Most, if not all, R packages from CRAN can be installed in a local library. As far as I can tell, this is not true for a Windows installation of Python. I am happy to be corrected on this point. The aforementioned Python binaries come with proper Windows installers, which want to install into the Python root directory, something students do not have permission to do. If I had realized this problem in December of last year, I could have asked the admins to pre-install it for all users, however, given I only formulated the problem in February, it was just a tad too late.

## Would I do it again?

I might, but there would have to be serious efforts to resolve the problems listed above on my part. It also would not solve problems of students trying to set up Python at home, and I do not feel like hand-holding people through an installation process. My initial plan had been to try Javascript. I may return to this idea.

I would be the first to admit that I am not a Python user, but I am an experienced programmer with over thirty years of experience in at least a dozen different languages, and on multiple platforms. I know many people find Python a very useful language for their scientific computing, and I am not attempting to bad mouth the language – it seems a decent enough language with the constructs and functionality that I would expect to find in any modern language – but I do not think there is much incentive for a statistician to move away from R, or an R/C++ combination when raw compute power is required.

I am glad that my students experienced programming in a non-vectorized language. R does give a distorted perspective on programming with regards to its handling of vectors, and I think it is beneficial for students to learn about flow structures for element-wise computation.

## Update

Nat Dudley has made the suggestion I used on online IDE like nitrous.io.

Despite the difficulties, nearly all of my students have managed to complete the task, and some have done an exceptionally good job, even adding in the ability to parse R-like formulae.

# Bayesian modelling of left-censored data using JAGS

Bayesian modelling of left-censored data using JAGS

Martyn Plummer's JAGS very helpfully provides us with a way to model censored data through the use of the dinterval distribution. However, almost all of the examples that one finds on the web are for right censored data. The changes to model left censored data are not major, but I do think they warrant a) a post/page of their own and b) hopefully an easy-to-understand example.

Left-censored data arises very commonly when dealing with detection limits from instrumentation. In my own work, I often end up involved in the modelling of data derived from electropherograms.

I will start by generating some left censored data. For simplicity I am going to assume that my data is exponentially distributed, with a true rate of 1.05 ($$\lambda = 1.05$$), and a detection/censoring threshold at log(29). This means that approximately 97.1% of my data (on average) will not exceed my detection threshold. This may seem extreme, but it is the kind of setup that is common in my own work.

set.seed(35202)
x = rexp(10000, rate = 1.05)

## set all the censored values to NA's
x[x < log(29)] = NA


This gives us a data set of size 10,000 with 9,691 values that fall below the detection threshold.

It can be a useful check, if feasible, to see what the maximum likelihood estimate is. We can do this in R using the optim function

## define the log-likelihood
logLik = function(lambda){
isCensored = is.na(x)
nCensored = sum(isCensored)
LOD = log(29)

ll = sum(dexp(x[!isCensored], rate = lambda,
log = TRUE)) +
nCensored  * pexp(LOD, rate = lambda,
log = TRUE)

## return the -ve value because optim
## is a minimizer
return(-ll)
}

fit = optim(0.5, logLik, method = "Brent",
lower = 0, upper = 10,
hessian = TRUE)
fisherInfo = 1/fit$hessian sigma = sqrt(fisherInfo) upper = fit$par + 1.96 * sigma
lower = fit$par - 1.96 * sigma interval = c(lower, fit$par, upper)
names(interval) = c("lower", "MLE", "upper")
interval

## lower   MLE upper
## 1.001 1.032 1.064


So the 95% confidence interval contains our true value which is always a good start!

The trick, if there is any, to dealing with left-censored data in JAGS is to make sure that your indicator variable tells JAGS which variables are above the detection threshold.

So in the next step I will set up the list that contains my data.

bugsData = list(N = length(x),
isAboveLOD = ifelse(!is.na(x),
1, 0),
x = x,
LOD = rep(log(29), length(x)))


There are two points to make about the preceding code. Firstly the variable isAboveLOD uses the NA status of the data. If you have not recoded your censored values to NA then obviously this will not work. Secondly, there is a limit of detection vector LODregardless of whether the observation is below the limit of detection or not.

Next we need to set up some intial values. The key here is setting not only an initial value for $$\lambda$$, but initial values for the observations that have been censored.

bugsInits = list(list(lambda = 0.5,
x = rep(NA, length(x))))

## set the missing values to
## random variates from
## U(0, log(29))
nMissing = sum(!bugsData$isAboveLOD) bugsInits[[1]]$x[!bugsData$isAboveLOD] = runif(nMissing, 0, log(29))  I have chosen uniform random variates between zero and the limit of detection (log(29)) as initial values for my censored data. Now we need to set up our JAGS model. I will store this in a string, and then use writeLines to write this to disk. modelString = " model{ for(i in 1:N){ isAboveLOD[i] ~ dinterval(x[i], LOD[i]) x[i] ~ dexp(lambda) } lambda ~ dgamma(0.001, 0.001) } " writeLines(modelString, con = "model.bugs.R")  Note that I have used a vague gamma prior for $$\lambda$$. This is not especially noteworthy, except from the point of view about being explicit about what I have done. dinterval returns 0 if x[i] < LOD[i] and a 1 otherwise. Many people who try to use dinterval often get this cryptic error Observed node inconsistent with unobserved parents at initialization. This can happen for two reasons. Firstly the indicator variable can be incorrectly set, that is the observations above the limit of detection have been coded with a 0 instead of a 1 and vice versa for the censored observations. Secondly, the error can occur because the initial values for the censored observations are outside of the censored interval. We now have the three components necessary to fit our model: a list containing the data, a list of initial values, and a BUGS model. Firstly we initialize the model library(rjags)  ## Loading required package: coda ## Loading required package: lattice ## Linked to JAGS 3.3.0 ## Loaded modules: basemod,bugs  jagsModel = jags.model(file = "model.bugs.R", data = bugsData, inits = bugsInits)  ## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph Size: 30003 ## ## Initializing model  Next, we let the model burn-in for an arbitrary period. I will use a burn-in period of 1,000 iterations. update(jagsModel, n.iter = 1000)  And finally, we take a sample (hopefully) from the posterior distribution of the parameter of interest, $$\lambda$$ parameters = c("lambda") simSamples = coda.samples(jagsModel, variable.names = parameters, n.iter = 10000) stats = summary(simSamples)  We can obtain a 95% confidence interval by using the posterior mean and standard error, i.e. mx = stats$statistics[1]
sx = stats$statistics[2] ci = mx + c(-1, 1) * 1.96 * sx ci  ## [1] 1.002 1.065  or we can get a 95% credible interval by using the posterior quantiles, i.e. ci = stats$quantiles
ci

##  2.5% 97.5%
## 1.002 1.065


Both intervals contain the true value of 1.05 and have a fairly reasonable estimate of it as well. They also are very close to the ML interval.

I hope this clears things up for someone.

### Credit where credit is due

I could not have written this without following the right censored example provided by John Kruschke here, and from reading Martyn Plummer's presentation here.