# Extracting elements from lists in Rcpp

If you are an R programmer, especially one with a solid background in data structures or with experience in a more traditional object oriented environment, then you probably use lists to mimic the features you might expect from a C-style struct or a class in Java or C++. Retrieving information from a list of lists, or a list of matrices, or a list of lists of vectors is fairly straightforward in R, but you may encounter some compiler error messages in Rcpp if you do not take the right steps.

### Stupid as bro

This will not be a very long article, but I think it is useful to have this information somewhere other than Stack Overflow. Two posts, one from Dirk and one from Romain contain the requisite information.

The List class does not know what type of elements it contains. You have to tell it. That means if you have something like

x = list(a = matrix(1:9, ncol = 3), b = 4)


void Test(List x){
IntegerMatrix a = x["a"];
}


in your C++, then you might get a compiler error complaining about certain things not being overloaded. As Dirk points out in another post (which I cannot find right at this moment), the accessor operator for a List simply returns a SEXP. Rcpp has done a pretty good job of removing the need for us to get our hands dirty with SEXP‘s, but they are still there. If you know (and you should since you are the one writing the code and designing the data structures) that this SEXP actually is an IntegerMatrix then you should cast it as one using the as<T>() function. That is,

void Test(List x){
IntegerMatrix a = as<IntegerMatrix>(x["a"]);
}


### So why does this work?

If you look around the internet, you will see chunks of code like

int b = x["b"];
NumericVector y = x["y"];


which compile just fine. Why does this work? It works because the assignment operator has been overloaded for certain types in Rcpp, and so you will probably find you do not need explicit type coercion. However, it certainly will not hurt to explicitly do so for every assignment, and your code will benefit from doing so.

# Generating pseudo-random variates C++-side in Rcpp

It is well-known that if you are writing simulation code in R you can often gain a performance boost by rewriting parts of your simulation in C++. These days the easiest way to do that of course is to use Rcpp. Simulation usually depends on random variates, and usually great numbers of them. One of the issues that may arise is that your simulation needs to execute on the C++ side of things. For example, if you decide to programme your Metropolis-Hastings algorithm (not technically a simulation I know) in Rcpp, then you are going to need to be able to generate hundreds of thousands, if not millions, of random numbers. You can use Rcpp’s features to call R routines from within Rcpp to do this, e.g.

Function rnorm("rnorm");
rnorm(100, _["mean"] = 10.2, _["sd"] = 3.2 );


(Credit: Dirk Eddelbuettell)

but this has a certain overhead. C++ has had built-in in random number generation functionality since at least the C+11 standard (and probably since the C+0X standard). The random header file provides a Mersenne-Twister uniform random number generator (RNG), a Linear Congruential Generator (LCG), and a Subtract-with-Carry RNG. There is also a variety of standard distributions available, described here.

### Uniform random variates

The ability to generate good quality uniform random variates is essential, and the mt19937 engine provides. The 19937 refers to the Mersenne Prime $$(2^{19937}-1)$$ that this algorithm is based on, and also to its period length. There are four steps required to generate uniform random variates. These are:

1. Include the random header file
2. Construct an mt19937 random number engine, and initialise it with a seed
3. Construct a $$U(0,1)$$ random number generator
4. Use your engine and your uniform random number generator to draw variates

In code we would write

#include <random>
#include <Rcpp.h>

using namespace std;
using namespace Rcpp;

mt19937 mtEngine;
uniform_real_distribution<double> rngU;

//[[Rcpp::export]]
void setSeed(unsigned int seed){
mtEngine = mt19937(seed);
rngU = uniform_real_distribution<>(0.0, 1.0);
}

double runif(void){
return rngU(mtEngine);
}


The function runif can now be called with runif(). Note that the setSet function has been exported so that you can initialize the RNG engine with a seed of your choice.

### How about normal random variates?

It does not require very much more effort to add a normal RNG to your code. We simply add

normal_distribution<double> rngZ;


to our declared variables, and

//[[Rcpp::export]]
void setSeed(unsigned int seed){
mtEngine = mt19937(seed);
rngU = uniform_real_distribution<>(0.0, 1.0);
rngZ = normal_distribution<double>(0.0, 1.0);
}

double rnorm(double mu = 0, double sigma = 1){
return rngZ(mtEngine) * sigma + mu;
}


to our code base. Now rnorm can be called without arguments to get standard ($$N(0,1)$$) random variates, or with a mean, or a standard deviation, or both to get $$N(\mu,\sigma^2)$$ random variates

### Rcpp does it

No doubt someone is going to tell me that Romain and Dirk have thought of this already for you, and that my solution is unnecessary Morris Dancing. However, I think there is merit in knowing how to use the standard C++ libraries.

Please note that I do not usually advocate having global variables such as those in the code above. I would normally make mtEngine, rngU, and rngZ private member variables a class and then either instantiate it using an exported Rcpp function, or export the class and essential functions using an Rcpp module.

Working C++ code and an R test script can be found here in the RNG folder. Enjoy!

# Embracing the new. Is it time to ditch pointers in C++?

This post was originally posted December 22, 2014

Recently I had the opportunity to revisit some research that I did circa 1997-1998, because someone asked me to write a book chapter on the subject. It is an interesting process to go back and look at your old work and apply all the things that you have learned in the intervening time period.

In this case the research relied on some C/C++ simulation programmes that I had written. The simulations, even for small cases, performed hundreds of thousands of iterations to estimate lower bounds and so C++ was a natural choice at the time. R was still a fledgling, and Splus simply was not up to extensive simulation work. Given the nature of these simulations, I still do not think I would use R, even though it is very fast these days.

Simulations, being simulations, rely extensively on random number generation, and of course these programmes were no exception. Re-running the programmes seemed trivial, and of course the compute time had been substantially reduced over the years. This lead me to think that I could now explore some more realistic scenarios. If you, the reader, think I am being deliberately mysterious about my simulations, I am not. It is more that the actual research is a side issue to the problems I want to talk about here. The “more realistic inputs” simply correspond to larger simulated DNA databases, inline with those now maintained by many jurisdictions, and a set of allele frequencies generated from a much larger data set than that I had access to in 1997 with a different set of loci.

There would clearly be no story if something did not happen with the new work. My early work was with databases of 100, 400 and 1,000 individuals. When I expanded this to 5,000 and 10,000 individuals I found that things began to go wrong.

Firstly, the programme began to quit unexpectedly on Windows, and produce segmentation faults when compiled with gcc on Linux. The crashes only happened with the larger database sizes, but strangely in the case where N = 1,000 — where there had previously been no crash. I thought initially that this might be because I had inadvertently hard-coded some of the array dimensions, and that the new data sets, or larger runs where causing problems. Extensive examination of the code did not reveal any irregularities.
Random number generators and all that

I did discover fairly early on that I could no longer rely on George Marsaglia’s multiply-with-carry (MWC) uniform random number generator. The reason for this is that the generator, as coded, relies on integers of certain widths, and integer overflow, or wrapping. I had pretty much abandoned this some years ago when a former MSc student, Dr Alec Zwart discovered that there were irregularities in the distribution of the bits. Using a random number a bit at a time is very useful when simulating breeding populations — which is something else I do quite often.

### The Mersenne Twister

The Mersenne Twister has been around since 1997, and again advances in computing have made the computing overhead it incurs relatively negligible. My initial switch to a Mersenne Twister uniform random number generator (RNG) was through an implementation distributed by Agner Fog. This implementation has served me well for quite some time, and I have used it extensively. It sadly was not the case this time. I could not get Visual Studio 2013 to understand some of the enums, and faking it caused me problems elsewhere. I am sure there is nothing wrong with this implementation, but I certainly could not get to to work this time.

I discovered by reading around on the web that random number generation has become part of the new C+11 standard, and that it is fairly easy to get a Mersenne Twister random number generator. Most implementations start with a uniform integer, or long integer, random number stream and then wrap different classes around this stream. C++ is no exception


#include <random>

using namespace std;

static mt19937 mtEngine;
uniform_real_distribution<double> rngU;;

void init_gen(unsigned int seed){
mtEngine = mt19937(seed);
rngU = uniform_real_distribution<>(0.0, 1.0);
}

double runif(void){
return rngU(mtEngine);
}


I have used a static variable to store the stream in my implementation but there is no requirement to do this.

### Nothing is ever normal

I have also, for quite some time, used Chris Wallace’s Fastnorm code for very fast generation of standard normal random variates. However, I found that this too appeared to be causing me problems, especially when I changed operating systems. My programming philosophy these days is that my work should really be portable to any mainstream operating system (Windows, Linux, OS X), especially since I almost never write GUI code any more. Running on both Windows and Linux is useful, because when I want to run really big simulations I often will flick the code over to our cluster which strangely enough does not run on Windows – who knew?

It turns out that the C+11 also has a normal random number generator. I have done very little research to find out what method is used, but my guess is that it is either an inverse CDF method, or at worst a Box-Muller based method. Adding a standard normal generator is easy


static mt19937 mtEngine;
static normal_distribution<double> rngZ;

void init_gen(unsigned int seed){
mtEngine = mt19937(seed);
rngZ = normal_distribution<double>(0.0, 1.0);
}

double snorm(void){
return rngZ(mtEngine);
}


So that will work right?

After all of these changes, which do not seem substantial but bear in mind they took me a long time to get to them, everything was stable right? Well no, I was still getting a crash when N = 10,000, and this was not happening when I started the simulation with that case.
Java to the rescue

I decided, probably incorrectly with hindsight, that I must be making some sort of stupid mistake with allocating memory and releasing it. I decided to take that completely out of the equation by switching to Java. A port from C++ to Java is actually a relatively painless thing to do, and I had a working version of my programme in a couple of hours. This was made easier by the fact that my colleague Duncan Taylor had ported Ross Ihaka’s C code, ripped out of R, for gamma random number generation (yes I need that too), and with a little tweaking I had it running in my programme as well. The Java port let me recognize that I had done some silly things in my original programme, such as storing an entire bootstrap sample before processing it and in the process chewing up CPU and memory time with needless copying. And after a little more hacking (like three days) it ran to my satisfaction and all simulations duly completed with about three hours of run time.

Java has some pretty cool ideas, and it is a fun and easy language to programme in. However, my failure to get the C++ working was weighing heavily on my mind. I like to think that I am a hell of a lot better C++ programmer than a Java programmer, and I dislike the idea that I might be writing slower programmes. I also do not think Java is currently well-suited to scientific programming. I am sure some readers will tell me this is no longer true, but access to a well accepted scientific code library is missing, and although there are many good projects, a lot of them are one-man efforts, or have been abandoned. A good example of the latter is the COLT library from CERN.

### Out with pointers

I thought about this for sometime, and eventually it occurred to me that I could write a C++ programme that looked like a Java programme — that is, no pointers. C++ purists might shudder, but if you think of Java as simplified C++, then the concept is not so strange. Java treats every object argument in a function as being a reference. C++ can replicate this behaviour very easily by simply using its reference notation. The big trade-off was that I was going to also have to drop the pointers I used for dynamic allocation of memory. Java sort of fudges this as far as I can tell, because although the scalar types (int, double, boolean and others) are generally not treated as references, I think native arrays of them are, e.g. int[] or double[].

### The STL

The Standard Template Library (STL) provides a set of low-overhead C++ template container classes, such as lists, vectors, maps and queues. These classes can contain themselves, and they can be dynamically resized at execution time. I have avoided using them in this way, especially when writing code to be very fast. However, I am fairly sure my colleague Brendon Brewer, who is much younger and a more modern C++ programmer, has told me that he never uses pointers. Given I had just finished for the year, this seemed like an ideal quick summer project.

Another couple of days recoding got me to running mode, and now it is time to reveal probably what was the issue all along. Remember when I said I did this:


double runif(void){
return rngU(mtEngine);
}



double runif(void){
return mtEngine() / 4294967295.0;
}


The large constant there is $$2^{32}-1$$, the largest unsigned integer that can be represented on a 32-bit CPU. The mt19997 function mtEngine() returns an unsigned 32-bit integer, but for reasons that still escape me this piece of code:


return (int)floor(b * runif());


which should return a number between 0 and b-1 inclusive, was returning b, thereby causing the programme to address unallocated memory, and hence the crash. The reason it took so long to happen is that the random number stream had to be used for a very long time. Using the uniform_read_distribution class stopped this from happening.

So did I take a performance hit? I cannot say equivocally without going back to my original programme and adjusting the RNGs, but it appears that the C++ version actually takes about 10 minutes longer than the Java version. This is a very poor comparison, because the Java version is running on my Windows PC (core i7, 32GB RAM, HDD), and the C++ version is running on my Macbook (core i7, 16GB RAM, SSD), but also because the new C++ version is “more objected-oriented” than the Java version. That is, firstly I used native arrays, and arrays of arrays in Java, like int[][] and double[][]. If I had used Java ArrayLists (why would you), it might have been a different story. Secondly, there is a bit more OO design and architecture in the C++ version, including things like operator overloading and more extensive use of objects to represent the input data. All of these things cost, and they are probably costing a little too much in execution time, although they pay off in readability and usability, especially in well designed IDEs like Visual Studio and Xcode. Finally, my original C++ programme never finished, so I have no idea actually how long it would take to do the same set of simulations. I think in this case I will take a programme that works in three hours over a programme that quickly crashes in one.

# Forensic anthropologists — lend me your data

Please note: This is not a new post, but a restored post from August that I lost in a WordPress upgrade

## Friends, Romans, forensic anthropologists, lend me your data

I have been reading the Journal of Forensic Sciences (JFS) over the last couple of days to see what sort of research is being done in forensic science and to see how many studies are using statistics to make, or to reinforce their conclusions. The answer the the second question is “quite a few.” There has been quite significant adoption of multivariate analysis, most commonly PCA, in a wide variety of forensic disciplines fields and that is very pleasing to see.

## Forensic anthropology

Anthropologists, and in particular forensic anthropologists have long been heavy users of statistical methodology. Many studies use linear regression, linear discriminant analysis, principal component analysis and logistic regression. The well-known and widely used forensic anthropology computer programme FORDISC uses LDA. It is interesting to me to see the appearance of some newer/different classification techniques such as k-nearest neighbour, quadratic discriminant analysis, classification and regression trees, support vector machines, random forests, and neural networks.

Forensic anthropology features heavily in JFS, and the papers contain a large amount of statistical analysis of data. The focus of the articles is often on classification of remains in to age, gender, or racial groups, or on age estimation. The articles are generally quite interesting and well written.
Show me the data

However, there is almost never any provision of the raw data, and my experience whilst writing my data analysis book was that there was no response to my requests for data from even a single anthropologist of the dozen or so that I wrote to. Not even a polite “sorry but we are unable to release the data.” I understand that in all scientific disciplines data can be expensive, in terms of time or money, to collect, and so a researcher might justifiably want to retain a data set as long as possible to get as much research value from it as possible. However, surely there must be a point where the data could be released in the public domain? The University of Tennessee Knoxville does have a forensic anthropology databank, but, at least from the webpage, it seems that there is an emphasis on deposits rather than withdrawals.

## A challenge

I therefore issue a challenge to the forensic anthropology community – release some of your data into the wild. It will benefit your discipline as it will others, and you might find your work cited more as people give you credit for producing the data that they are applying their novel techniques to.

# I am an applied statistician

Today brings a very nice blog post from Rafael Irizarry on being pragmatic in applied statistics rather than rigidly/religiously Bayesian or Frequentist.

Does this article reverse or contradict my thinking in forensic science? Not really. I am a strong proponent of Bayesian thinking in that field. However, in the shorter term I would be happier if practitioners simply had a better understanding of the Frequentist interpretation issues. As a statistician I depend on the collaboration of forensic scientists for both the problems and the data. Telling scientists that everything they are doing is incorrect is generally unhelpful. It is more productive to collaborate and make it better.

I am seriously considering the introduction of R Markdown for assignments in our second year statistics course. The folks at RStudio have made some great improvements in the latest version of R Markdown (R Markdown V2), which allow you to add a Markdown document template to your R package, which in turn does things like let you provide a document skeleton for the user with as much information as you like, link CSS files (if you are doing HTML), and specify the output document format as well. The latter is an especially important addition to RStudio.

The lastest version of RStudio incorporates Pandoc which is a great format translation utility (and probably more) written by John Macfarlane. It is an important addition to RStudio because it makes it easy to author documents in Microsoft Word, as well as HTML, LaTeX, and PDF. I am sure that emphasizing the importance having the option to export to Word will cause some eye-rolling and groans, but I would remind you that we are teaching approximately 800 undergrads a year in this class, most of who will never ever take another statistics class again, and join a workforce where Microsoft Word is the dominant platform. I like LaTeX too (I do not think I will ever write another book ever again in Word), but it is not about what I like. I should also mention that there are some pretty neat features in the new R Markdown like authoring HTML slides in ioslides format, or PDF/Beamer presentations, and creating HTML documents with embedded Shiny apps (interactive statistics apps).

I think on the whole the students should deal with this pretty well, especially since they can tidy up their documents to their own satisfaction in Word — not saying that RStudio produces messy documents, but rather that the facility to edit post rendering is available.

### Help?

However there is one stumbling block that I hope my readers might provide some feedback on — the issue of loading data. My class is a data analysis class. Every assignment comes with its own data sets. The students are happy, after a while, using read.csv() or read.table in conjunction with file.choose(). However, from my own point of view, reproducible research documents with commands that require user input quickly become tedious because you tend to compile/render multiple times whilst getting your code and your document right. So we are going to have to teach something different. As background, our institution has large computing labs that any registered student can use. The machines boot in either Linux or Windows 7 (currently, and I do not think that is likely to change soon given how much people loathe Windows 8 and what a headache it is for IT support). There is moderate market penetration of Apple laptops in the student body (I would say around 10%). So here is my problem — we have to teach the concept of file paths to a large body of students who on the whole do not have this concept in their skill set and who will find it foreign/archaic/voodoo. They will also regard this as another burdensome thing to learn on top of a whole lot of other things they do not want to learn like R and R Markdown. To make things worse, we have to deal with file paths over multiple platforms.

My thoughts so far are:

• Making tutorial videos
• Providing the data for each assignment in an R package that is loaded at the start of the document
• Providing code in the template document that reads the data from the web

I do not really like the last two options as they let the students avoid learning how to read data into R. Obviously this is not a problem for those who do not go on, but it shifts the burden for those who do. So your thoughts please.

#### Update

One option that has sort of occurred to me before is that in the video I could show how the fully qualified path name to a file can be obtained using file.choose() and then then students could simply copy and paste that into their R code.

# MCMC in Excel — an exercise in perversity

I found myself working in Excel as part of the work I did fitting exponential (and gamma) distributions to left censored data. This was partly due to do the fact that my research colleagues had done the initial distribution fitting using Microsoft Excel’s Solver to do maximum likelihood, something it does quite well. A shortcoming of this approach is that you cannot get the Hessian matrix for models with two or more parameters which you need if you want to place any sort of confidence interval around your estimates. There is nothing stop you, of course, from doing the actual mathematics, and calculating the values you need directly, but this all sounds like rather too much work and is distribution specific. One can equally make the criticism that the approximations used by the BFGS and other quasi-Newton methods are not guaranteed to be close to the true Hessian matrix.

The next step along the chain (I am sure this is a terribly mixed-metaphor but hey who cares), for me at least, was to use MCMC — in particular, to implement a simple random walk Metropolis-Hastings sampling scheme.

Note: The method I describe here is almost impossible for a multi-parameter model, or a model where the log-likelihood does not reduce to a simple sum of the data (or a sum of a function of the data). The reason for this is that Excel’s distribution functions are not vector functions, which means in many circumstances the values of the likelihood for different observations must be stored in separate cells, and then we have to sum over the cells. In a problem with n observations and m proposals, we then would have to store $$n\times m$$ values unless we resort to Visual Basic for Applications (VBA). However, I wanted to do this problem without VBA.

Note 2:I know that it is very easy to estimate the variance for the exponential distribution, but please refer to the title of this post.

## Microsoft Excel 2013 for Mac

In order to do MCMC we need to be able to generate random numbers. This functionality is provided in Excel by the Data Analysis Add-In. However, the Data Analysis Add-In has not been available since Excel 2008 for the Mac. There is a claim that this functionality can be restored by a third party piece of software called StatPlus LE, but in my limited time with it it seems a very limited solution. There are number of other pieces of functionality missing in the Mac version of Excel, which reduces its usefulness greatly.

## What do we need?

### We need data

I will use the similar data from my first post on this subject. However, this time I am going to switch to using data generated from a gamma distribution. To get some consistency, I have generated the data in R with $$\alpha=\mbox{shape}=10.776$$, a rate of $$\beta=\mbox{rate}=5.138$$, and a detection limit of $$\log(29)$$. These numbers might seem absurdly specific but they come from maximum likelihood estimates in a real data set. This leaves me with 395 observations above the detection limit, and 9,605. below it. We only need the sum of the observations above the limit, the sum of the log of the observations above the limit, and the two counts, because the log-likelihood only depends on the sum, the sum of the logs, and the two counts. That is, if $$x_i\sim Gamma(\alpha,\beta), i=1,\ldots,N$$ where $$n$$ is the number of observations above the detection limit (395) (and $$m=N-n$$ is the number of observations that are below the detection limit) then the likelihood function is

$$L(\alpha,\beta;{\mathbf x}) = \prod_{i=1}^{n}\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x_i}\prod_{i=1}^{m}F(T; \alpha, \beta)$$

where $$T=\log(29)$$, and $$F(T; \alpha, \beta)$$ is the Gamma cumulative distribution function. The log-likelihood then simplifies to

\begin{align*} l(\alpha,\beta;{\mathbf x}) &= n\left(\alpha\log(\beta)-\log\Gamma(\alpha)\right)+(\alpha-1)\sum \log x_i -\beta\sum x_i \\ &+m \log F(T; \alpha, \beta) \\ &= (\alpha-1)\sum \log x_i -\beta\sum x_i + \kappa(\alpha,\beta,n,m)\\ \end{align*}

which depends only the sum the observations above the limit, the sum of the logarithms of the observations above the detection limit, and the number of observations above and below the detection limit.

### We need to specify the priors and to get a set of proposal values

In my JAGS model I used a $$\Gamma(0.001, 0.001)$$ priors for $$\alpha$$ and $$\beta$$. This would be easy enough to implement in Excel if the inverse Gamma function was sufficiently robust. However, it is not, and so I have opted for a prior which is $$U(-2,3)$$ on log-scale.

This prior is a little less extreme than the $$\Gamma(0.001, 0.001)$$ prior but has reasonable properties for this example.

We can use the Data Analysis Add-In to generate a set of proposal values. The screen capture below shows the dialog box from the Random Number Generation part of the Data Analysis Add-In. We need proposals for both $$\alpha$$ and $$\beta$$. Therefore we ask Excel to give us 2 random variables. In a standard MCMC implementation we usually choose a “burn-in” period to make sure our samples are not to correlated with the starting values, and to give the sampler time to get somewhere near the target distribution. In this example we will use a burn-in period of 1,000 iterations and then sample for a further 10,000 iterations, for a total of 11,000 iterations. We get Excel to put the proposals out into columns B and C starting at row 2 (and extending to row 11,001). Note: I have set the random number seed (to 202) here so that my results can be replicated.

We also need a column of U[0,1] random variates for our Metropolis-Hastings update step. The screen capture below shows the dialog box how we set this up. We store these values in column F, and as before I have set the random number seed (to 456) so that my results can be replicated.

We use columns C and D to transform our uniform random variates to the original scale to get our proposals for $$\alpha$$ and $$\beta$$. We do this by entering the formula

=exp(B2)

into cell D2, and then selecting cells D2 to D11001 and using the Fill Down command to propagate the formula. We select the range D2:E:11001 and use the Fill Right command to propagate the formula formula across for $$\beta$$. Columns C and D contain my proposal values for $$\alpha$$ and $$\beta$$.

### We need some data

As noted before, all we need to is the sum of the observed values and the sum of the log of the observed values, plus the number of observed and censored values. The sum of the observed values in my data set is 1478.48929487124 (stupid accuracy for replication), and the sum of the logs of the observed values is 519.633872429806. As noted before the number of observed values is 395, and there are 9,605 censored values. I will insert these values in cells I2 to I5 respectively, and in cells H2 to H5 I will enter the labels sum_x, sum_log_x, nObs, and nCens.

#### We need some names

It is useful to label cells with names when working with Excel formulae. This allows us to refer to cells containing values by a name that means something rather than a cell address. We can define names by using the tools on Formula tab. I will use this tool to assign the names I put into cells H2 to H5 to the values I put into cells I2 to I5. To do this I select the range H2:I5, and the I click on the Formula tab, then the “Create names from Selection” button as shown in the screenshot below: Note I do not believe you can do this on the Mac, but I do not know for sure.

I can now use, for example, the name sum_x to refer to cell address $I$2 in my formulae. It also removes the need to make sure that the address is absolute every time I type it.

### We need to be able to calculate the log-likelihood

The only “tricks” we need to calculate the log likelihood are knowing how to calculate the natural logarithm, the Gamma cdf, and the log-gamma function. I mention the first because the natural logarithm in Excel conforms to mathematical convention in that $$\log_e(x)={\rm ln}(x)$$, and the corresponding Excel function is LN. The LOG function calculates $$\log_{10}(x)$$ in Excel. Excel’s GAMMA.DIST function provides both the pdf and the cdf. The latter is obtained by setting the fourth argument (CUMULATIVE) to TRUE. It is important to note that Excel uses parameters alpha and beta, but these correspond to shape and scale, not shape and rate. Finally, the GAMMALN function provides us with the logarithm of the complete gamma function. We will calculate the log-likelihood in column J, therefore we enter the following formula into cell J2

=(D2 - 1) * sum_log_x - E2 * sum_x
+ nObs * (D2 * LN(E2) - GAMMALN(D2))
+ nCens * LN(
GAMMA.DIST(LN(29), D2, 1 / E2, TRUE))
)


After you have got this formula correct (and it will probably take more than one go), then select cells J2:J11001 and use the “Fill Down” command (Ctrl-F on Windows) to propagate the formula down for every proposed value.

### We need starting values

We will store our sampler’s current state in columns K, L and M. The current value of $$\alpha$$ gets stored in column K, the current value of $$\beta$$ in column L, and the current value of the log-likelihood in column M. We need some starting values, and so in this example we will use the first proposal values. In cell K2 we enter the formula

=D2


in cell L2 we enter the formula

=E2


and in cell M2 we enter the formula

=J2


# New PS/4 game console disappointments

Last year around August, during all the hype about the new Xbox One and Sony’s PlayStation 4, I decided to pre-order both consoles. However, as the release date approached I realized that the number of titles available for both consoles was extremely limited with many studios aiming for a 2014 release date. Based on this I decided that I did not need approximately \$NZD1,300 of electronics that were glorified Blue ray players. I should inform readers (he says optimistically) who are not based in New Zealand or Australia that we do not have access to Netflix in New Zealand, and nor do we have cable TV that integrates with either box. Voice activation and control of the Xbox One is also currently disabled in Australia and New Zealand. Microsoft won’t say why but one strongly suspects that the software simply cannot deal with the Australasian accent. The voice features of Android used to be hopeless for anyone who did not have an American accent. Google has spent considerable time working to remedy this, and now it mostly works.

So I held off until May of this year, and finally purchased the PS/4. I had originally thought I would buy the Xbox first because I have been a much heaver Xbox 360 gamer, but the only two new games that interested me were Watchdogs, available on all platforms, and Sucker Punch’s inFamous: Second Son, which is a PS/4 exclusive. The latter swung the decision as I have enjoyed playing both inFamous and inFamous 2, and, to be fair, with the exception of the idiotic spray-painting side “quests”, I enjoyed this iteration as well.

###### Watch Dogs

Watch Dogs, from Ubisoft Montreal, was majorly hyped at 2013’s E3 event. Billed as open world action-adventure title with driving, shooting, running and climbing (parkour), and hacking all being part of the game. In reality, they should have just stuck with driving. This game reminds me so much of Grand Theft Auto it is not funny. Many of the integral parts of the story line require the player to drive at high-speed, either in pursuit of someone, or to get to a specific location within a time limit. These are the exact things I loathe about GTA, but can forgive in GTA, because hey that game is called Grand Theft Auto. It is about cars, and you expect cars to be an integral part of the game. Personally, I like in this sort of game to jump, climb and precision shoot. Watch Dog’s lead protagonist, Aiden Pearce, can run, and climb in a very limited fashion, but not jump — he can fall down to a lower level, or vault a fence which is really just an extension of the running, but he cannot jump from say a boat to a pier.

The hacking part of the game is sort of interesting in that it can provide you a way of carrying out a mission without entering a building, but mostly it doesn’t. It lets you steal money from unsuspecting citizens, which Eurogamer points out you can’t even give to the numerous beggars around the city. It does not really help you when you are driving, except in the odd case where you can cause bollards to rise out of the road to stop a car, or raise a bridge. Eurogamer states, again correctly, that it is a game about hacking where hacking quickly becomes incidental.

A number of side missions require you to interfere with, or stop, criminal activity. Fine, I have no problem with that, but then why do most of these missions require me to tackle someone who is shooting at me with an automatic weapon? Overall, I am extremely disappointed with this game and will probably not bother to complete it.

###### No DLNA support, no third party media

Both the Xbox 360 and Sony PlayStation 3 could see and play all the media content hosted on my Windows box in the basement. This was a nice feature because it meant I could play media (music, video, browse photos) without buggering around with discs, portable hard drives, or flash drives. So you might say, okay minor inconvenience, stick it on a flash drive and away you go. No dice. Sony, in its infinite wisdom, has decided that “Unfortunately as noted pre-launch of the PS4, you cannot play movies on the PS4 off of an external hard drive or memory stick. However this is something that is looking to be added in the future with an update.” One question – “Why?” It is hardly a revolutionary feature — my TV does it. The inability to insert an audio disc seems like a kick in the face to the consumer, and overall Sony, you are just pissing off a lot of people who give your company considerable revenue.

###### Final thoughts

Its possible that Sony may fix some of these issues, but I am not holding my breath. I am hoping that Far Cry 4, releasing in November, is as good as the previous titles in the series.