# Struggles with getting the (Bayesian) language right

Maybe it is a function of being originally trained in the Frequentist paradigm, but I am finding it hard to get my Bayesian language right.

At some point when you are learning about Bayesian things you will come across a statement that says something along these lines:

Frequentists believe that the data is random and that parameter of interest is unknown and fixed. Bayesian, on the other hand, believe that the data is fixed, and that the parameter of interest is unknown and random.

At a certain level I am reasonably happy with this. The Bayesian statement leaves out something about the data generating mechanism being random, but on the whole – well, fine. However, I feel this does not mesh well with our statements. For example, let’s say we are doing a rather dull, but simple univariate analysis with inference about the mean. We have a set of observations, and we might propose a normal prior, a normal likelihood and consequently arrive at a normal posterior. How do we choose to summarise this information? Well, we might give a statement about the posterior density:

$$\mu|\mathbf{x}\sim N(\mu^\prime, (\sigma^{\prime})^2)$$

Or, we might give a (95%) credible interval:

$$\Pr(q_{0.025}^\prime\lt\mu|\mathbf{x}\lt q^prime_{0.975})=0.95$$

Or we might just report the posterior mean and standard deviation. I think it is the last two where I question myself. Let’s take the credible interval. If we express it in words, then we say something along the lines of I am 95% sure the mean lies between a and b (where a and b are the respective lower and upper bounds of the credible interval. It seems implicit in this statement, at least to mean–and feel free to disagree, that there is this underlying belief that there is a single true value for the mean. Similarly, if we report the posterior mean, there is a feel (again to me) that we are reporting our best estimate of the one true mean. Clearly this flies in the face of what we actually think as Bayesians.

What brought this to the fore was working on the book my good friend David Lucy left unfinished.

David, like me, came to Bayesian thinking later in his academic career. He started out his academic life as an archaeologist, working on age estimation. If you are unfamiliar with this problem, it is essentially a classical calibration problem. You have a set of remains–perhaps teeth–from individuals of known age, and for whom you can measure some characteristic, say Y with variability. The idea is given a new value of Y, can you estimate the age of the individual. It parallels the calibration problem because (in theory) the age is measured without error, whereas Y contains all the usual sources of variation. David became interested in Bayesian solutions to this problem, and based on his thesis work, was firmly convinced that it was the only way to approach it.

David liked ontology and epistemology, and as a consequence of this introduces the idea of a Platonic view of statistics. I don’t know if this line will capture his thinking but it suffices. He writes:

Statistical modelling is very Platonic in that it is the parameters of a model which are the ontologically real matter which control the way in which the world works.

Implicit in this statement, again perhaps only to me, is the idea that there is one true value. In fact, if David was around to argue the point, I would say, based on the little I know about Plato, that his ideas were essentially reductionist. That is perhaps not a flaw if the building blocks are probability distributions, and maybe this is what David was arguing.

Anyway – there is no natural conclusion to these thoughts, except perhaps to ask “Are we actually conveying our intent accurately with our language when it comes to reporting Bayesian results?”

# Some quirks of JAGS

I return to JAGS infrequently enough these days that I forget a lot of things. Do not get me wrong. I love JAGS and I think it is one of the most valuable tools in my toolbox. However administrative and professional duties often destroy any chance I might have of concentrating long enough to actually do some decent statistics.

Despite this I have been working with my colleagues and friends, Duncan Taylor and Mikkel Andersen–who has been visiting from Aalborg since February and really boosting my research, on some decent Bayesian modelling. Along the way we have encountered a few idiosyncrasies of JAGS which are probably worth documenting. I should point out that the issues here have been covered by Martyn Plummer and others on Stack Overflow and other forums, but perhaps not as directly.

### Local variables

JAGS allows local variables. In fact you probably use them without realizing as loop indices. However, what it does not like is the declaration of local variables within a loop. That is

model{
k <- 10
}


is fine, whereas

model{
for(i in 1:10){
k <- 10
}
}


is not. This would be fine, except the second code snippet will yield an error which tells you that k is an unknown variable and asks you to:  Either supply values for this variable with the data or define it on the left hand side of a relation. I think this error message is a bit of a head scratcher because look at your code and say “But I have defined it, and it works in R.” – do not fall into the latter mode of thinking – it’s a trap!

There are a couple of solutions to this problem. Firstly, you could do as I did and give up on having your model code readable, and just not use temporary variables like this, or, secondly, you could like the temporary variable vary with the index, like so

model{
for(i in 1:10){
k[i] <- 10
}
}


### Missing data / ragged arrays

JAGS cannot deal with missing covariates very well. However, it is happy enough for you to include these observations and “step over” them some how. An example of this might be a balanced experiment where some of the trials failed. As an example let us consider a simple two factorial completely randomised design where each factor only has two levels. Let the first factor have levels A and B, and the second factor have levels 1 and 2. Furthermore let there be 3 replicates for each treatment (combination of the factors). Our (frequentist) statistical model for this design would be

$$y_{ijk}=\mu + \alpha_i + \beta_j + \gamma_{ij} + \varepsilon_{ijk},~i\in\{A,B\},j\in\{1,2\},k=1,\ldots,3,\varepsilon_{ijk}\sim N(0,\sigma^2)$$

And we would typically programme this up in JAGS as something like this

model{
for(i in 1:2){
for(j in 1:2){
for(k in 1:3){
y[i,j,k] ~ dnorm(mu[i,j], tau)
}
mu[i,j] <- alpha[i] + beta[j] + gamma[i,j]
}
}

for(i in 1:2){
alpha[i] ~ dnorm(0, 0.000001)
}

for(j in 1:2){
beta[j] ~ dnorm(0, 0.000001)
}

for(i in 1:2){
for(j in 1:2){
gammma[i,j] ~ dnorm(0, 0.000001)
}
}

tau ~ dgamma(0.001, 0.001)
}


Implicit in this code is that y is a 2 x 2 x 3 array, and that we have a fully balanced design. Now let us assume that, for some reason, the treatments $$\tau\in\{A1,A2,B1,B2\}$$ have been replicated 2, 3, 2, 3 times respectively. We can deal with this in JAGS by creating another variable in our input data which we will call reps with reps = c(3, 2, 3, 2). This then can be accommodated in our JAGS model by

model{
for(i in 1:2){
for(j in 1:2){
for(k in 1:reps[(i-1) * 2 + j]){
y[i,j,k] ~ dnorm(mu[i,j], tau)
}
mu[i,j] <- alpha[i] + beta[j] + gamma[i,j]
}
}


y is still a 2 x 2 x 3 array, but simply has NA stored in the positions without information. It is also worth noting that ever since JAGS 4.0 the syntax for loops allows

for(v in V){


where V is a vector containing integer indices. This is very useful for irregular data.

NOTE I have not compiled the JAGS code in my second example, so please let me know if there are mistakes. This is an ongoing article and so I may update it from time to time.

# I am an applied statistician

Today brings a very nice blog post from Rafael Irizarry on being pragmatic in applied statistics rather than rigidly/religiously Bayesian or Frequentist.

Does this article reverse or contradict my thinking in forensic science? Not really. I am a strong proponent of Bayesian thinking in that field. However, in the shorter term I would be happier if practitioners simply had a better understanding of the Frequentist interpretation issues. As a statistician I depend on the collaboration of forensic scientists for both the problems and the data. Telling scientists that everything they are doing is incorrect is generally unhelpful. It is more productive to collaborate and make it better.

# MCMC in Excel — an exercise in perversity

I found myself working in Excel as part of the work I did fitting exponential (and gamma) distributions to left censored data. This was partly due to do the fact that my research colleagues had done the initial distribution fitting using Microsoft Excel’s Solver to do maximum likelihood, something it does quite well. A shortcoming of this approach is that you cannot get the Hessian matrix for models with two or more parameters which you need if you want to place any sort of confidence interval around your estimates. There is nothing stop you, of course, from doing the actual mathematics, and calculating the values you need directly, but this all sounds like rather too much work and is distribution specific. One can equally make the criticism that the approximations used by the BFGS and other quasi-Newton methods are not guaranteed to be close to the true Hessian matrix.

The next step along the chain (I am sure this is a terribly mixed-metaphor but hey who cares), for me at least, was to use MCMC — in particular, to implement a simple random walk Metropolis-Hastings sampling scheme.

Note: The method I describe here is almost impossible for a multi-parameter model, or a model where the log-likelihood does not reduce to a simple sum of the data (or a sum of a function of the data). The reason for this is that Excel’s distribution functions are not vector functions, which means in many circumstances the values of the likelihood for different observations must be stored in separate cells, and then we have to sum over the cells. In a problem with n observations and m proposals, we then would have to store $$n\times m$$ values unless we resort to Visual Basic for Applications (VBA). However, I wanted to do this problem without VBA.

Note 2:I know that it is very easy to estimate the variance for the exponential distribution, but please refer to the title of this post.

## Microsoft Excel 2013 for Mac

In order to do MCMC we need to be able to generate random numbers. This functionality is provided in Excel by the Data Analysis Add-In. However, the Data Analysis Add-In has not been available since Excel 2008 for the Mac. There is a claim that this functionality can be restored by a third party piece of software called StatPlus LE, but in my limited time with it it seems a very limited solution. There are number of other pieces of functionality missing in the Mac version of Excel, which reduces its usefulness greatly.

## What do we need?

### We need data

I will use the similar data from my first post on this subject. However, this time I am going to switch to using data generated from a gamma distribution. To get some consistency, I have generated the data in R with $$\alpha=\mbox{shape}=10.776$$, a rate of $$\beta=\mbox{rate}=5.138$$, and a detection limit of $$\log(29)$$. These numbers might seem absurdly specific but they come from maximum likelihood estimates in a real data set. This leaves me with 395 observations above the detection limit, and 9,605. below it. We only need the sum of the observations above the limit, the sum of the log of the observations above the limit, and the two counts, because the log-likelihood only depends on the sum, the sum of the logs, and the two counts. That is, if $$x_i\sim Gamma(\alpha,\beta), i=1,\ldots,N$$ where $$n$$ is the number of observations above the detection limit (395) (and $$m=N-n$$ is the number of observations that are below the detection limit) then the likelihood function is

$$L(\alpha,\beta;{\mathbf x}) = \prod_{i=1}^{n}\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x_i}\prod_{i=1}^{m}F(T; \alpha, \beta)$$

where $$T=\log(29)$$, and $$F(T; \alpha, \beta)$$ is the Gamma cumulative distribution function. The log-likelihood then simplifies to

\begin{align*} l(\alpha,\beta;{\mathbf x}) &= n\left(\alpha\log(\beta)-\log\Gamma(\alpha)\right)+(\alpha-1)\sum \log x_i -\beta\sum x_i \\ &+m \log F(T; \alpha, \beta) \\ &= (\alpha-1)\sum \log x_i -\beta\sum x_i + \kappa(\alpha,\beta,n,m)\\ \end{align*}

which depends only the sum the observations above the limit, the sum of the logarithms of the observations above the detection limit, and the number of observations above and below the detection limit.

### We need to specify the priors and to get a set of proposal values

In my JAGS model I used a $$\Gamma(0.001, 0.001)$$ priors for $$\alpha$$ and $$\beta$$. This would be easy enough to implement in Excel if the inverse Gamma function was sufficiently robust. However, it is not, and so I have opted for a prior which is $$U(-2,3)$$ on log-scale.

This prior is a little less extreme than the $$\Gamma(0.001, 0.001)$$ prior but has reasonable properties for this example.

We can use the Data Analysis Add-In to generate a set of proposal values. The screen capture below shows the dialog box from the Random Number Generation part of the Data Analysis Add-In. We need proposals for both $$\alpha$$ and $$\beta$$. Therefore we ask Excel to give us 2 random variables. In a standard MCMC implementation we usually choose a “burn-in” period to make sure our samples are not to correlated with the starting values, and to give the sampler time to get somewhere near the target distribution. In this example we will use a burn-in period of 1,000 iterations and then sample for a further 10,000 iterations, for a total of 11,000 iterations. We get Excel to put the proposals out into columns B and C starting at row 2 (and extending to row 11,001). Note: I have set the random number seed (to 202) here so that my results can be replicated.

We also need a column of U[0,1] random variates for our Metropolis-Hastings update step. The screen capture below shows the dialog box how we set this up. We store these values in column F, and as before I have set the random number seed (to 456) so that my results can be replicated.

We use columns C and D to transform our uniform random variates to the original scale to get our proposals for $$\alpha$$ and $$\beta$$. We do this by entering the formula

=exp(B2)

into cell D2, and then selecting cells D2 to D11001 and using the Fill Down command to propagate the formula. We select the range D2:E:11001 and use the Fill Right command to propagate the formula formula across for $$\beta$$. Columns C and D contain my proposal values for $$\alpha$$ and $$\beta$$.

### We need some data

As noted before, all we need to is the sum of the observed values and the sum of the log of the observed values, plus the number of observed and censored values. The sum of the observed values in my data set is 1478.48929487124 (stupid accuracy for replication), and the sum of the logs of the observed values is 519.633872429806. As noted before the number of observed values is 395, and there are 9,605 censored values. I will insert these values in cells I2 to I5 respectively, and in cells H2 to H5 I will enter the labels sum_x, sum_log_x, nObs, and nCens.

#### We need some names

It is useful to label cells with names when working with Excel formulae. This allows us to refer to cells containing values by a name that means something rather than a cell address. We can define names by using the tools on Formula tab. I will use this tool to assign the names I put into cells H2 to H5 to the values I put into cells I2 to I5. To do this I select the range H2:I5, and the I click on the Formula tab, then the “Create names from Selection” button as shown in the screenshot below: Note I do not believe you can do this on the Mac, but I do not know for sure.

I can now use, for example, the name sum_x to refer to cell address $I$2 in my formulae. It also removes the need to make sure that the address is absolute every time I type it.

### We need to be able to calculate the log-likelihood

The only “tricks” we need to calculate the log likelihood are knowing how to calculate the natural logarithm, the Gamma cdf, and the log-gamma function. I mention the first because the natural logarithm in Excel conforms to mathematical convention in that $$\log_e(x)={\rm ln}(x)$$, and the corresponding Excel function is LN. The LOG function calculates $$\log_{10}(x)$$ in Excel. Excel’s GAMMA.DIST function provides both the pdf and the cdf. The latter is obtained by setting the fourth argument (CUMULATIVE) to TRUE. It is important to note that Excel uses parameters alpha and beta, but these correspond to shape and scale, not shape and rate. Finally, the GAMMALN function provides us with the logarithm of the complete gamma function. We will calculate the log-likelihood in column J, therefore we enter the following formula into cell J2

=(D2 - 1) * sum_log_x - E2 * sum_x
+ nObs * (D2 * LN(E2) - GAMMALN(D2))
+ nCens * LN(
GAMMA.DIST(LN(29), D2, 1 / E2, TRUE))
)


After you have got this formula correct (and it will probably take more than one go), then select cells J2:J11001 and use the “Fill Down” command (Ctrl-F on Windows) to propagate the formula down for every proposed value.

### We need starting values

We will store our sampler’s current state in columns K, L and M. The current value of $$\alpha$$ gets stored in column K, the current value of $$\beta$$ in column L, and the current value of the log-likelihood in column M. We need some starting values, and so in this example we will use the first proposal values. In cell K2 we enter the formula

=D2


in cell L2 we enter the formula

=E2


and in cell M2 we enter the formula

=J2


Recall that the state of the sampler changes from $$G_0$$ to $$G_1$$ with probability 1 if $$G_0$$ is more likely than $$G_1$$ $$(l(G_1)>l(G_0))$$ or if $$\log(u) < l(G_1)-l(G_0)[/latex], where $u\sim U(0,1)$. Therefore, to implement this in Excel, we need to use an IF function in conjunction with an OR function. The OR function can handle our two conditions: either the likelihood is bigger or with probability proportional to the ratio of the likelihoods. Therefore, in cell K3, we enter the formula [code] =IF(OR(J3>M2, LN(F3)<(J3-M2)),D3,K2) [/code] Note that the comparisons are made to the current state, not the previous cell. We could write a very similar formula for cells L3 and M3. However, we can just check to see with cell K3 has the same value as cell K2. That is we can see whether there has been a change. Checking for equality of floating point numbers is bad programming practice, however if we are concerned about scientific accuracy we probably would not be doing this in Excel. Therefore, we insert the formulae =IF(K3=K2, L2, E3) =IF(K3=K2, M2, J3)  is cells L3 and M3 respectively. We need to propagate these formulae down to row 11,001 by selecting and using the “Fill Down” command as before. ### We need to summarize our chain Finally, we need to gather some summary statistics about our sample from the posterior distribution. Recall we are using a burn-in period of 1,000 samples and a sampling period of 10,000 iterations. Therefore all our summary functions are only applied from row 1,002 to row 11,001. We are interested in the mean, standard deviation, and 0.025 and 0.975 quantiles for [latex]\alpha$$ and $$\beta$$. We can get these values using the AVERAGE, STDEV.S, and PERCENTILE.EXC functions. In cells P2 to P7 we insert the following formulae.

=AVERAGE(K1002:K11001)
=STDEV.S(K1002:K11001)
=PERCENTILE.EXC(K1002:K11001, 0.025)
=PERCENTILE.EXC(K1002:K11001, 0.975)
=P2 - 1.96 * P3
=P2 + 1.96 * P3


and then we select cells P2:Q7 and use the “Fill Right” command.

## How did we do?

The screen capture above shows my results. Rather embarrassingly the 95% credible interval does not contain the true values $$\alpha=10.776$$ and $$\beta=5.138$$. The main reason for this is that there is a total of 10 acceptances in our sample of size 10,000! That is, the sampler is incredibly inefficient. This is not completely surprising. The interval calculated under the assumption that the posterior distribution is symmetric is a little wider and does contain the true values. However, I would not put much stock in this. Of course, we can easily have the sampler run for a longer time, by generating more random variates and using more cells. I will leave that task to the reader.

## Conclusions

In this article I have shown how to fit a gamma distribution to left censored data using MCMC in Excel. It is definitely not the best way to do this — I would use R and JAGS, which would be infinitely faster and give me more useable results — but it can be done. It does offer functionality to non-R users, of which there are many more than actual users, and it also allows the chance to observe the Markov chain so that we can see how the sampler is working at every stage of the process.

For completeness the sheet I created whilst writing this post is available from the link below.
MCMCGamma.xlsx.zip (1.7MB)

# Bayesian modelling of left-censored data using JAGS

Bayesian modelling of left-censored data using JAGS

Martyn Plummer's JAGS very helpfully provides us with a way to model censored data through the use of the dinterval distribution. However, almost all of the examples that one finds on the web are for right censored data. The changes to model left censored data are not major, but I do think they warrant a) a post/page of their own and b) hopefully an easy-to-understand example.

Left-censored data arises very commonly when dealing with detection limits from instrumentation. In my own work, I often end up involved in the modelling of data derived from electropherograms.

I will start by generating some left censored data. For simplicity I am going to assume that my data is exponentially distributed, with a true rate of 1.05 ($$\lambda = 1.05$$), and a detection/censoring threshold at log(29). This means that approximately 97.1% of my data (on average) will not exceed my detection threshold. This may seem extreme, but it is the kind of setup that is common in my own work.

set.seed(35202)
x = rexp(10000, rate = 1.05)

## set all the censored values to NA's
x[x < log(29)] = NA


This gives us a data set of size 10,000 with 9,691 values that fall below the detection threshold.

It can be a useful check, if feasible, to see what the maximum likelihood estimate is. We can do this in R using the optim function

## define the log-likelihood
logLik = function(lambda){
isCensored = is.na(x)
nCensored = sum(isCensored)
LOD = log(29)

ll = sum(dexp(x[!isCensored], rate = lambda,
log = TRUE)) +
nCensored  * pexp(LOD, rate = lambda,
log = TRUE)

## return the -ve value because optim
## is a minimizer
return(-ll)
}

fit = optim(0.5, logLik, method = "Brent",
lower = 0, upper = 10,
hessian = TRUE)
fisherInfo = 1/fit$hessian sigma = sqrt(fisherInfo) upper = fit$par + 1.96 * sigma
lower = fit$par - 1.96 * sigma interval = c(lower, fit$par, upper)
names(interval) = c("lower", "MLE", "upper")
interval

## lower   MLE upper
## 1.001 1.032 1.064


So the 95% confidence interval contains our true value which is always a good start!

The trick, if there is any, to dealing with left-censored data in JAGS is to make sure that your indicator variable tells JAGS which variables are above the detection threshold.

So in the next step I will set up the list that contains my data.

bugsData = list(N = length(x),
isAboveLOD = ifelse(!is.na(x),
1, 0),
x = x,
LOD = rep(log(29), length(x)))


There are two points to make about the preceding code. Firstly the variable isAboveLOD uses the NA status of the data. If you have not recoded your censored values to NA then obviously this will not work. Secondly, there is a limit of detection vector LODregardless of whether the observation is below the limit of detection or not.

Next we need to set up some intial values. The key here is setting not only an initial value for $$\lambda$$, but initial values for the observations that have been censored.

bugsInits = list(list(lambda = 0.5,
x = rep(NA, length(x))))

## set the missing values to
## random variates from
## U(0, log(29))
nMissing = sum(!bugsData$isAboveLOD) bugsInits[[1]]$x[!bugsData$isAboveLOD] = runif(nMissing, 0, log(29))  I have chosen uniform random variates between zero and the limit of detection (log(29)) as initial values for my censored data. Now we need to set up our JAGS model. I will store this in a string, and then use writeLines to write this to disk. modelString = " model{ for(i in 1:N){ isAboveLOD[i] ~ dinterval(x[i], LOD[i]) x[i] ~ dexp(lambda) } lambda ~ dgamma(0.001, 0.001) } " writeLines(modelString, con = "model.bugs.R")  Note that I have used a vague gamma prior for $$\lambda$$. This is not especially noteworthy, except from the point of view about being explicit about what I have done. dinterval returns 0 if x[i] < LOD[i] and a 1 otherwise. Many people who try to use dinterval often get this cryptic error Observed node inconsistent with unobserved parents at initialization. This can happen for two reasons. Firstly the indicator variable can be incorrectly set, that is the observations above the limit of detection have been coded with a 0 instead of a 1 and vice versa for the censored observations. Secondly, the error can occur because the initial values for the censored observations are outside of the censored interval. We now have the three components necessary to fit our model: a list containing the data, a list of initial values, and a BUGS model. Firstly we initialize the model library(rjags)  ## Loading required package: coda ## Loading required package: lattice ## Linked to JAGS 3.3.0 ## Loaded modules: basemod,bugs  jagsModel = jags.model(file = "model.bugs.R", data = bugsData, inits = bugsInits)  ## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph Size: 30003 ## ## Initializing model  Next, we let the model burn-in for an arbitrary period. I will use a burn-in period of 1,000 iterations. update(jagsModel, n.iter = 1000)  And finally, we take a sample (hopefully) from the posterior distribution of the parameter of interest, $$\lambda$$ parameters = c("lambda") simSamples = coda.samples(jagsModel, variable.names = parameters, n.iter = 10000) stats = summary(simSamples)  We can obtain a 95% confidence interval by using the posterior mean and standard error, i.e. mx = stats$statistics[1]
sx = stats$statistics[2] ci = mx + c(-1, 1) * 1.96 * sx ci  ## [1] 1.002 1.065  or we can get a 95% credible interval by using the posterior quantiles, i.e. ci = stats$quantiles
ci

##  2.5% 97.5%
## 1.002 1.065


Both intervals contain the true value of 1.05 and have a fairly reasonable estimate of it as well. They also are very close to the ML interval.

I hope this clears things up for someone.

### Credit where credit is due

I could not have written this without following the right censored example provided by John Kruschke here, and from reading Martyn Plummer's presentation here.