All posts by James Curran

Conference Programmes in R

It has been a while folks!

Lately – over the last month – I have been writing code that helps with the production of a timetable and abstract for conferences.

The latest version of that code can be found here, the associated bookdown project here, and the final result here. Note that the programme may change a good number of times before next Sunday (December 10, 2017) when the conference starts.

The code above is an improved version of the package I wrote for the Biometrics By The Border conference which worked solely with Google Sheets, by using it as a poor man’s database. I wouldn’t advise this because it is an extreme bottleneck when you constantly need to refresh the database information.

The initial project was driven by necessity. I foolishly suggested we use EasyChair to collect information from speakers, not realising that the free version of EasyChair does not allow you to download the abstracts your speakers helpfully provided to you (and fair enough–they are trying to make some money from this service). Please note that if I am incorrect in this statement, I’d be more than happy to find out. A word of advice: if this is still the case, then get your participants to upload an R Markdown (or just plain text) file instead of using the abstract box on the webpage. You can download these files from EasyChair in a single zip file.

The second project was also driven by necessity, but the need in this case had more to do with the sheer complexity of organising several hundred talks over a four day programme.  Now that everything is in place, most changes requested by speakers or authors can be accomodated in a few minutes by simply moving the talk on the Google sheet that controls the programme, calling several R functions, recompiling the book and pushing it up to github.

So how does it all work?

The current package does depend on some basic information being stored on Google Sheets.  The sheets for my current conference (Joint Conference of the New Zealand Statistical Association and the International Association of Statistical Computing (Asian Regional Section)) can be viewed here on Google Sheets. Not all the worksheets here are important. The key sheets are: Monday, Tuesday, Wednesday, Thursday,  All Submissions, Allocations, All_Authors, Monday_Chairs, Tuesday_Chairs, Wednesday_Chairs, and Thursday_Chairs. The first four and the last four sheets (Monday-Thursday) are hand created. The colours are for our convenience and do not matter. The layout does, and some of this is hard-coded into the project. For example all though there are seven streams a day, only six of those belong directly to us as the seventh belong to a satellite conference. The code relies on every scheduled event (including meal breaks) having a time adjacent to it in the Time column. It also collects information about the rooms as the order of these rooms does change on some days. The All_Authors sheet comes from EasyChair. EasyChair provides the facility to download author email information as an Excel spreadsheet which in turn can be uploaded to Google Sheets easily. This sheet is simply the sheet labelled All in the EasyChair file. The All Submissions sheet is a copy-and-paste from the EasyChair submissions page. I probably could webscrape this with a little effort.  The Allocations sheet is mostly reflective of the All Submissions sheet, but has user added categorizations that allow us to group the talks into sensible sessions. It also uses formulae to format the titles of the talks so that each title word is capitalized (this is an imperfect process), and that the name of the submitter (who we have assumed is the speaker) is appended to the title so that it can be pasted into the programme sheets.

How do we get data out of Google Sheets?

Enter googlesheets.

The code works in a number of distinct phases. The first is capturing all the relevant information from the web in placing in a SQLite database. I used Jenny Bryan’s googlesheets package to extract the data from my spreadsheets. The process is quite straightforward, although I found that it worked a little more smoothly on my Mac than on my Windows box, although this may have more to do with the fact that the R install was brand new, and so the httr package was not installed. The difference between it being installed and not installed is that when it is installed authentication (with Google) happens entirely within the browser, whereas without you are required to copy and paste an authentication code back into R. When you are doing this many times, the former is clearly more desirable.

Interaction with Google Sheets consists of first grabbing the workbook (Google Sheets does not use this term, it comes from Excel, but it encapsulates the idea that you have a collection of one or more worksheets in any one project, than the folder that contains them all is called a workbook), and then asking for information from the individual sheets. The request to see a workbook is what will prompt the authentication, e.g.

library(googlesheets)
mySheets = gs_ls()
mySheets

The call to gs_ls will prompt authentication. It is important to note that this authentication will last for a certain period of time, and should only require interaction with a web browser the very first time you do it, and never again. Subsequent requests will result in a message along the lines of Auto-refreshing stale OAuth token. The call to gs_ls will return a list of available workbooks. A particular workbook may be retrieved by calling gs_title. For example, this function allows me to get to the conference workbook:

getProgrammeSheet = function(title = "NZSA-IASC-ARS 2017"){
  mySheets = gs_ls()
  ss = gs_title(title)
  return(ss)
}

I can use the object returned by this function in turn to access the worksheets I want to work with using the gs_read. The functions in googlesheets are written to be compliant with the tidyverse, and in particular the pipe %>% operator. I will be the first to admit this is not my preferred way of working and I could have easily worked around it. However, in the interests of furthering my skills, I have tried to make the project tidyverse compliant and nearly all the functions will work using the pipe operator, although they take a worksheet or a database as their first argument rather than a tibble.

Once I have a workbook object (really just an authenticated connection), I can then read each of the spreadsheets. This is done by a function called updateDB. The only thing worth commenting on in this function is that the worksheets for each of the days have headers which do not really resolve well into column names. They also have a fixed set of rows and columns that we wish to capture. To that end, the range is specified for each day, and the column headers are simply set to be the first seven (A–G) capital letters of the alphabet. These sheets are stored as tibbles which are then written to an internal SQLite database using the dbWriteTable function. There are eight functions (createRoomsTbl, createAffilTbl, createTitleTbl, createAbstractTbl, createAuthorTbl, createAuthorSubTbl, createProgTbl, createChairTbl) which operate on the database/spreadsheet tables to produce the database tables we need for generating the conference timetable, and for the abstract booklet. These functions are rarely called by themselves—we tend to call a single omnibus function rebuildBD. This function allows the user to refresh the information from the web if need be, and to recreate all of the tables in the database. The bottleneck in this function is retrieving the information from the internet which may take around 20-30 seconds depending on the connection speed.

The database tables provide information for four functions: writeTT, writeProg, writeIndices, and writeSessionChairs. Each of these functions produces one or more R Markdown files in the directory used by the bookdown project.

Bookdown, ePub and gitBook

The final product is generated using bookdown. Bookdown, in explanation sounds simple. Implementation is really improved by the help of a master. I found this blog post by Sean Kross very helpful along with his minimal bookdown project from github. It would be misleading of me to suggest that the programme book was really produced using R Markdown. Small elements are in markdown, but the vast majority of the formatting is achieved by writing code which writes HTML. This is especially true of the conference timetable, and the hyperlinking of the abstracts to the timetable and other indices. The four functions listed above write out six markdown pages. These are the conference timetable, the session chairs table, four pages for each of the days of the conference, and two indices, one linking talks to authors, and one linking talk titles to submission numbers (which for the most part were issued by EasyChair). There is not a lot more to discussion involved here. Sean’s project sets up things in such a way that changes to the markdown or the yaml files will automatically trigger a rebuild.

Things I struggled with

Our conference had six parallel streams. Whilst it is easy enough to make tables that will hold all this information, it is very difficult to decide how best to display this in a way that would suit everyone. The original HTML tables were squashed almost to the point where there was a character per line on a mobile phone screen. We overcame this slightly by fixing the widths of the div element that holds the tables and adding horizontal scrolling. Many people found this feature confusing, and it did not necessarily translate into ePub. We then added some Javascript functionality by way of the tablesaw library. This allowed us to keep the time column persistant no matter which stream people were looking at, and it allowed better scrolling of streams that were offscreen. However, this was still as step too far for the technologically challenged. In the end we resorted to printing out the timetable. I also used excellent Calibre software to take the ePub as input and output it in other format—most usefully Microsoft Word’s .docx format. I know some of you are shuddering at this thought, but it did allow me to create a PDF with the programme timetable rotated and then create a PDF. This made the old fogeys immensely happy, and me immensely irritated, as I thought the gitBook version was quite useful.

Not forgetting the abstracts

Omitted from my workflow so far is mention of the abstracts. We had authors upload LaTeX (.tex) files to EasyChair, or text files. If you don’t do this, and they use EasyChair’s abstract box, then you have to find a way to scrape the data. There are downsides in doing so (and it even occurs in the user submitted files) in that some unicode text seems to creep in. Needless to say, even after I used an R function to convert all the files to markdown, we still had to do a bunch of manual cleaning.

Anyway

I hope someone finds this work useful. I have no intention of running a conference again for at least four years, but I would appreciate it if anyone wants to build on my work.

Share Button

How do I match that?

This is not a new post, but a repost after a failed WordPress upgrade

One of the projects I am working on is the 3rd edition of Introduction to Bayesian Statistics, a text book predominantly written by my former colleague Bill Bolstad. It will be out later this year if you are interested.

One of the things which will make no difference to the reader but will make a lot of difference to me is the replacement of all the manually numbered references in the book for things like chapters, sections, tables and figures. The problem I am writing about today arose from programmatically trying to label the latter — tables and figures — in LaTeX. LaTeX’s reference system, as best I understand it, requires that you place a \label command after the caption. For example

\begin{figure}
\centering
\includegraphics{myfig}
\caption{Here is a plot of something}
\label{fig:myfig}
\end{figure}

This piece of code will create a label fig:myfig which I can later use in a \ref{fig:myfig} command. This will in turn be replaced at compile time with a number dynamically formatted according to the chapter and the number of figures which precede this plot.
The challenge

The problem I was faced with is easy enough to describe. I needed to open each .tex file, find all of the figure and table environments, check to see if they contained a caption and add a label if they did.
Regular expressions to the rescue

Fortunately I know a bit about regular expressions, or at least enough to know when to ask for help. To make things more complicated for myself, I did this in R. Why? Well basically because a) I did not feel like dusting off my grossly unused Perl — I’ve been clean of Perl for 4 years now and I intend to stay that way — and b) I could not be bothered with Java’s file handling routines – I want to to be able to open files for reading with one command, not 4 or 8 or whatever the hell it is. Looking back I could have used C++, because the new C+11 standard finally has a regex library and the ability to have strings where everything does not have to be double escaped, i.e. I can write R”\label” to look for a line that has a \label on it rather than “\\label” where I have to escape the backslash.

And before anyone feels the urge to suggest a certain language I remind you to “Just say no to Python!”

Finding the figure and table environments is easy enough. I simply look for the \begin{figure} and \begin{table} tags, as well as the environment closing tags \end{figure} and \end{table}. It is possible to do this all in one regular expression, but I wanted to capture the \begin and \end pairs. I also wanted to deal with tables and figures separately. The reason for this is that it was possible to infer the figure labels from Bill’s file naming convention for his figures. The tables on the other hand could just be labelled sequentially, i.e. starting at 1 and counting upwards with a prefix reflecting the chapter number.

Lines = readLines("Chapter.tex")

begin = grep("\\begin\\{figure\\}", Lines)
end = grep("\\end\\{figure\\}", Lines)

n = length(begin)
if(n != length(end))
  stop("The number of begin and end pairs don't match")

## Now we can work on each figure environment in turn
for(k in 1:n){
  b = begin[k]
  e = end[k]
  
  block = paste0(Lines[b:e], collapse = "\n")

  if(!grepl("\\label", block){ ## only add a label if there isn't one already
    ## everything I'm really talking about is going to happen here.
  }
}

So what I needed to be able to do was find the caption inside my block and then insert a label. Easy enough right? I should be able to write a regular expression to do that. How about something like this:

pattern = "\\caption\\{([^}]+)\\}

That will work most of the time, except as you will find out when the caption contains braces itself, and we have some examples that do have just that

\caption{``If \emph{A} is true then \emph{B} is true.''  Deduction is possible.}

My first regular expression would only match up to the end of \emph{A}, which does not help me. I need something that could, in theory match an unlimited number of inner sets of braces.
Matching nested parentheses

Fortunately matching nested parentheses is a well-known problem and Hadley Wickham tweeted me a Stack Overflow link that got me started. There is also a similar solution on page 330 of Jeffrey Friedl’s very useful Mastering Regular Expressions book. The solution relies on a regular expression which employs recursion.
Set perl = TRUE to use PCRE (and recursion) in R

To make this work in R we have to make sure the PCRE library is used, and this is done by setting perl = TRUE in the call to gregexpr

This is my solution:

## insert the label after the caption
pat = “caption(\\{([^{}]+|(?1))*\\})”
m = gregexpr(pat, block, perl = T)
capStart = attr(m[[1]], “capture.start”, TRUE)[1]
capLength = attr(m[[1]], “capture.length”, TRUE)[1]

strLabel = paste0(“\\”,”label{fig:”, figNumber, “}\n”)
newBlock = paste0(substr(block, 1, capStart + capLength),
strLabel,
substr(block, capStart + capLength + 1, nchar(block)))

The regular expression assigned to pat is where the work gets done. Reading the expression from left to right it says:

match caption literally
open the first capture group
match { literally
open the second capture group
match one or more instances of anything that is not an open brace { or a end brace }
or open the third capture group and recursively the first sub-pattern. I will elaborate on this more in a bit
close the second and third capture groups and ask R to match this pattern zero or more times
literally match the end brace }
close the first capture group

I would be the first to admit that I do not quite understand what ?1 does in this regexp. The initial solution used ?R. The effect of this was that I could match all sets of paired braces within block, but I could not specifically match the caption. As much as I understand this, it seems to limit the recursion to the outer (first) capture group. I would be interested to know.

The rest of the code breaks the string apart, inserts the correct label, and creates a new block with the label inserted. I replace the first line of the figure environment block with this new string, and keep a list of the remaining line numbers so that they can be omitted when the file is written back to disk.

Share Button

An introduction to using Rcpp modules in an R package

 

 

Introduction

The aim of this post is to provide readers with a minimal example demonstrating the use of Rcpp modules within an R package. The code and all files for this example can be found on https://github.com/jmcurran/minModuleEx.

What are Rcpp Modules?

Rcpp modules allow R programmers to expose their C++ class to R. By “expose” I mean the ability to instantiate a C++ object from within R, and to call methods which have been defined in the C++ class definition. I am sure there are many reasons why this is useful, but the main reason for me is that it provides a simple mechanism to create multiple instances of the same class. An example of where I have used this is my multicool package which is used to generate permutations of multisets. One can certainly imagine a situation where you might need to generate the permutations of more than two multisets at the same time. multicool allows you to do this by instantiating multiple multicool objects.

The Files

I will make the assumption that you, the reader, know how to create a package which uses Rcpp. If you do not know how to do this, then I suggest you look at the section entitled “Creating a New Package” here on the Rstudio support site. Important: Although it is mentioned in the text, the image displayed on this page does not show that you should change the Type: drop down box to Package w/ Rcpp.

Creating a package with Rcpp

This makes sure that a bunch of fields are set for you in the DESCRIPTION file that ensure Rcpp is linked to and imported.

There are five files in this minimal example. They are

  • DESCRIPTION
  • NAMESPACE
  • R/minModuleEx-package.R
  • src/MyClass.cpp
  • R/zzz.R

I will discuss each of these in turn.

DESCRIPTION

This is the standard DESCRIPTION file that all R packages have. The lines that are important are:

Depends: Rcpp (>= 0.12.8)
Imports: Rcpp (>= 0.12.8)
LinkingTo: Rcpp
RcppModules: MyModule

The imports and LinkingTo lines should be generated by Rstudio. The RcppModules: line should contain the names(s) of the module(s) that you want to use in this package. I have only one module in this package which is unimaginatively named MyModule. The module exposes two classes, MyClass and AnotherClass.

NAMESPACE and R/minModule-Ex.R

The first of these is the standard NAMESPACE file and it is automatically generated using roxygen2. To make sure this happens you need select Project Options… from the Tools menu. It will bring up the following dialogue box:

Project Options

Select the Build Tools tab, and make sure that the Generate documentation with Roxygen checkbox is ticked, then click on the Configure… button and make sure that that all the checkboxes that are checked below are checked:

Configuring Roxygen

Note: If you don’t want to use Roxygen, then you do not need the R/minModuleEx-package.R file, and you simply need to put the following three lines in the NAMESPACE file:

export(AnotherClass)
export(MyClass)
useDynLib(minModuleEx)

You need to notice two things. Firstly this NAMESPACE explicitly exports the two classes MyClass and AnotherClass. This means these classes are available to the user from the command prompt. If you only want access to the classes to be available to R functions in the package, then you do not need to export them. Secondly, as previously noted, if you are using Roxygen, then these export statements are generated dynamically from the comments just before each class declaration in the C++ code which is discussed in the next section. The useDynLib(minModuleEx) is generated from the line

#' @useDynLib minModuleEx

in the R/minModuleEx-package.R file.

src/MyClass.cpp

This file contains the C++ class definition of each class (MyClass and AnotherClass). There is nothing particularly special about these class declarations, although the comment lines before the class declarations,

//' @export MyClass
class MyClass{

and

//' @export AnotherClass
class AnotherClass{

, generate the export statements in the NAMESPACE file.

This file also contains the Rcpp Module definition:

RCPP_MODULE(MyModule) {
  using namespace Rcpp;

  class_<MyClass>( "MyClass")
    .default_constructor("Default constructor") // This exposes the default constructor
    .constructor<NumericVector>("Constructor with an argument") // This exposes the other constructor
    .method("print", &MyClass::print) // This exposes the print method
    .property("Bender", &MyClass::getBender, &MyClass::setBender) // and this shows how we set up a property
  ;

  class_<AnotherClass>("AnotherClass")
    .default_constructor("Default constructor")
    .constructor<int>("Constructor with an argument")
    .method("print", &AnotherClass::print)
  ;
}

In this module I have:

  1. Two classes MyClass and AnotherClass.
  2. Each class class has:
    • A default constructor
    • A constructor which takes arguments from R
    • A print method
  3. In addition, MyClass demonstrates the use of a property field which (simplistically) provides the user with simple retrieval from and assignment to a scalar class member variable. It is unclear to me whether it works for more data types, but anecdotally, I had no luck with matrices.

R/zzz.R

As you might guess from the nonsensical name, it is not essential to call this file zzz.R. The name comes from a suggestion from Dirk Eddelbuettel. It contains a single, but absolutely essential line of code

loadModule("MyModule", TRUE)

This code can actually be in any of the R files in your package. However, if you explicitly put it in R/zzz.R then it is easy to remember where it is.

Using the Module from R

Once the package is built and loaded, using the classes from the module is very straightforward. To instantiate a class you use the new function. E.g.

m = new(MyClass)
a = new(AnotherClass)

This code will call the default constructor for each class. If you want to call a constructor which has arguments, then they can be added to the call to new. E.g.

set.seed(123)
m1 = new(MyClass, rnorm(10))

Each of these objects has a print method which can be called using the $ operator. E.g.

m$print()
a$print()
m1$print()

The output is

> m$print()
1.000000 2.000000 3.000000
> a$print()
0
> m1$print()
1.224082 0.359814 0.400771 0.110683 -0.555841 1.786913 0.497850 -1.966617 0.701356 -0.472791

The MyClass class has a module property – a concept also used in C#. A property is a scalar class member variable that can either be set or retrieved. For example, m1 has been constructed with the default value of bBender = FALSE, however we can change it to TRUE easily

m1$Bender = TRUE
m1$print()

Now our object m1 behaves more like Bender when asked to do something 🙂

> m1$print()
Bite my shiny metal ass!

Hopefully this will help you to use Rcpp modules in your project. This is a great feature of Rcpp and really makes it even more powerful.

Share Button

An R/Rcpp mystery

This morning and today I spent almost four hours trying to deal with the fact that our R package DNAtools would not compile under Windows. The issue originated with a win-builder build which was giving errors like this:


I"D:/RCompile/recent/R/include" -DNDEBUG -I"d:/RCompile/CRANpkg/lib/3.4/Rcpp/include" -I"d:/Compiler/gcc-4.9.3/local330/include" -c DNTRare.cpp -o DNTRare.o
ID:/RCompile/recent/R/include: not found

and I replicated this (far too many times) on my Windows VM on my Mac.

In the end this boiled down to the presence of our Makevars file which contained only one line:


CXX_STD = CXX14

Deleting fixed the problem and it now compiles just fine. It compiles fine locally, and I am waiting for the response from the win-builder site. I do not anticipate an issue, but it would be useful to understand what was going wrong. I must admit that I have forgotten what aspects of the C++14 standard we are using, but I do know that changing line to


PKG_CXXFLAGS= -std=c++14

which I use in my multicool package gives me a different pain, with the compiler being unable to locate Rccp.h after seeing a #include directive.

Share Button

Extracting elements from lists in Rcpp

If you are an R programmer, especially one with a solid background in data structures or with experience in a more traditional object oriented environment, then you probably use lists to mimic the features you might expect from a C-style struct or a class in Java or C++. Retrieving information from a list of lists, or a list of matrices, or a list of lists of vectors is fairly straightforward in R, but you may encounter some compiler error messages in Rcpp if you do not take the right steps.

Stupid as bro

This will not be a very long article, but I think it is useful to have this information somewhere other than Stack Overflow. Two posts, one from Dirk and one from Romain contain the requisite information.

The List class does not know what type of elements it contains. You have to tell it. That means if you have something like

x = list(a = matrix(1:9, ncol = 3), b = 4)

in your R code and

void Test(List x){
  IntegerMatrix a = x["a"];
}

in your C++, then you might get a compiler error complaining about certain things not being overloaded. As Dirk points out in another post (which I cannot find right at this moment), the accessor operator for a List simply returns a SEXP. Rcpp has done a pretty good job of removing the need for us to get our hands dirty with SEXP‘s, but they are still there. If you know (and you should since you are the one writing the code and designing the data structures) that this SEXP actually is an IntegerMatrix then you should cast it as one using the as<T>() function. That is,

void Test(List x){
  IntegerMatrix a = as<IntegerMatrix>(x["a"]);
}

So why does this work?

If you look around the internet, you will see chunks of code like

int b = x["b"];
NumericVector y = x["y"];

which compile just fine. Why does this work? It works because the assignment operator has been overloaded for certain types in Rcpp, and so you will probably find you do not need explicit type coercion. However, it certainly will not hurt to explicitly do so for every assignment, and your code will benefit from doing so.

Share Button

Generating pseudo-random variates C++-side in Rcpp


It is well-known that if you are writing simulation code in R you can often gain a performance boost by rewriting parts of your simulation in C++. These days the easiest way to do that of course is to use Rcpp. Simulation usually depends on random variates, and usually great numbers of them. One of the issues that may arise is that your simulation needs to execute on the C++ side of things. For example, if you decide to programme your Metropolis-Hastings algorithm (not technically a simulation I know) in Rcpp, then you are going to need to be able to generate hundreds of thousands, if not millions, of random numbers. You can use Rcpp’s features to call R routines from within Rcpp to do this, e.g.

Function rnorm("rnorm");
rnorm(100, _["mean"] = 10.2, _["sd"] = 3.2 );

(Credit: Dirk Eddelbuettell)

but this has a certain overhead. C++ has had built-in in random number generation functionality since at least the C+11 standard (and probably since the C+0X standard). The random header file provides a Mersenne-Twister uniform random number generator (RNG), a Linear Congruential Generator (LCG), and a Subtract-with-Carry RNG. There is also a variety of standard distributions available, described here.

Uniform random variates

The ability to generate good quality uniform random variates is essential, and the mt19937 engine provides. The 19937 refers to the Mersenne Prime \((2^{19937}-1)\) that this algorithm is based on, and also to its period length. There are four steps required to generate uniform random variates. These are:

  1. Include the random header file
  2. Construct an mt19937 random number engine, and initialise it with a seed
  3. Construct a \(U(0,1)\) random number generator
  4. Use your engine and your uniform random number generator to draw variates

In code we would write

#include <random>
#include <Rcpp.h>

using namespace std;
using namespace Rcpp;

mt19937 mtEngine;
uniform_real_distribution<double> rngU;

//[[Rcpp::export]]
void setSeed(unsigned int seed){
  mtEngine = mt19937(seed);
  rngU = uniform_real_distribution<>(0.0, 1.0);
}

double runif(void){
  return rngU(mtEngine);
}

The function runif can now be called with runif(). Note that the setSet function has been exported so that you can initialize the RNG engine with a seed of your choice.

How about normal random variates?

It does not require very much more effort to add a normal RNG to your code. We simply add

normal_distribution<double> rngZ;

to our declared variables, and

//[[Rcpp::export]]
void setSeed(unsigned int seed){
  mtEngine = mt19937(seed);
  rngU = uniform_real_distribution<>(0.0, 1.0);
  rngZ = normal_distribution<double>(0.0, 1.0);
}

double rnorm(double mu = 0, double sigma = 1){
    return rngZ(mtEngine) * sigma + mu;
}

to our code base. Now rnorm can be called without arguments to get standard (\(N(0,1)\)) random variates, or with a mean, or a standard deviation, or both to get \(N(\mu,\sigma^2)\) random variates

Rcpp does it

No doubt someone is going to tell me that Romain and Dirk have thought of this already for you, and that my solution is unnecessary Morris Dancing. However, I think there is merit in knowing how to use the standard C++ libraries.

Please note that I do not usually advocate having global variables such as those in the code above. I would normally make mtEngine, rngU, and rngZ private member variables a class and then either instantiate it using an exported Rcpp function, or export the class and essential functions using an Rcpp module.

Working C++ code and an R test script can be found here in the RNG folder. Enjoy!

Share Button

Embracing the new. Is it time to ditch pointers in C++?

This post was originally posted December 22, 2014

Recently I had the opportunity to revisit some research that I did circa 1997-1998, because someone asked me to write a book chapter on the subject. It is an interesting process to go back and look at your old work and apply all the things that you have learned in the intervening time period.

In this case the research relied on some C/C++ simulation programmes that I had written. The simulations, even for small cases, performed hundreds of thousands of iterations to estimate lower bounds and so C++ was a natural choice at the time. R was still a fledgling, and Splus simply was not up to extensive simulation work. Given the nature of these simulations, I still do not think I would use R, even though it is very fast these days.

Simulations, being simulations, rely extensively on random number generation, and of course these programmes were no exception. Re-running the programmes seemed trivial, and of course the compute time had been substantially reduced over the years. This lead me to think that I could now explore some more realistic scenarios. If you, the reader, think I am being deliberately mysterious about my simulations, I am not. It is more that the actual research is a side issue to the problems I want to talk about here. The “more realistic inputs” simply correspond to larger simulated DNA databases, inline with those now maintained by many jurisdictions, and a set of allele frequencies generated from a much larger data set than that I had access to in 1997 with a different set of loci.

There would clearly be no story if something did not happen with the new work. My early work was with databases of 100, 400 and 1,000 individuals. When I expanded this to 5,000 and 10,000 individuals I found that things began to go wrong.

Firstly, the programme began to quit unexpectedly on Windows, and produce segmentation faults when compiled with gcc on Linux. The crashes only happened with the larger database sizes, but strangely in the case where N = 1,000 — where there had previously been no crash. I thought initially that this might be because I had inadvertently hard-coded some of the array dimensions, and that the new data sets, or larger runs where causing problems. Extensive examination of the code did not reveal any irregularities.
Random number generators and all that

I did discover fairly early on that I could no longer rely on George Marsaglia’s multiply-with-carry (MWC) uniform random number generator. The reason for this is that the generator, as coded, relies on integers of certain widths, and integer overflow, or wrapping. I had pretty much abandoned this some years ago when a former MSc student, Dr Alec Zwart discovered that there were irregularities in the distribution of the bits. Using a random number a bit at a time is very useful when simulating breeding populations — which is something else I do quite often.

The Mersenne Twister

The Mersenne Twister has been around since 1997, and again advances in computing have made the computing overhead it incurs relatively negligible. My initial switch to a Mersenne Twister uniform random number generator (RNG) was through an implementation distributed by Agner Fog. This implementation has served me well for quite some time, and I have used it extensively. It sadly was not the case this time. I could not get Visual Studio 2013 to understand some of the enums, and faking it caused me problems elsewhere. I am sure there is nothing wrong with this implementation, but I certainly could not get to to work this time.

I discovered by reading around on the web that random number generation has become part of the new C+11 standard, and that it is fairly easy to get a Mersenne Twister random number generator. Most implementations start with a uniform integer, or long integer, random number stream and then wrap different classes around this stream. C++ is no exception

	
#include <random>
 
using namespace std;
 
static mt19937 mtEngine;
uniform_real_distribution<double> rngU;;
 
void init_gen(unsigned int seed){
  mtEngine = mt19937(seed);
  rngU = uniform_real_distribution<>(0.0, 1.0);
}
 
double runif(void){
    return rngU(mtEngine);
}

I have used a static variable to store the stream in my implementation but there is no requirement to do this.

Nothing is ever normal

I have also, for quite some time, used Chris Wallace’s Fastnorm code for very fast generation of standard normal random variates. However, I found that this too appeared to be causing me problems, especially when I changed operating systems. My programming philosophy these days is that my work should really be portable to any mainstream operating system (Windows, Linux, OS X), especially since I almost never write GUI code any more. Running on both Windows and Linux is useful, because when I want to run really big simulations I often will flick the code over to our cluster which strangely enough does not run on Windows – who knew?

It turns out that the C+11 also has a normal random number generator. I have done very little research to find out what method is used, but my guess is that it is either an inverse CDF method, or at worst a Box-Muller based method. Adding a standard normal generator is easy

	
static mt19937 mtEngine;
static normal_distribution<double> rngZ;
 
void init_gen(unsigned int seed){
  mtEngine = mt19937(seed);
  rngZ = normal_distribution<double>(0.0, 1.0);
}
 
double snorm(void){
  return rngZ(mtEngine);
}

So that will work right?

After all of these changes, which do not seem substantial but bear in mind they took me a long time to get to them, everything was stable right? Well no, I was still getting a crash when N = 10,000, and this was not happening when I started the simulation with that case.
Java to the rescue

I decided, probably incorrectly with hindsight, that I must be making some sort of stupid mistake with allocating memory and releasing it. I decided to take that completely out of the equation by switching to Java. A port from C++ to Java is actually a relatively painless thing to do, and I had a working version of my programme in a couple of hours. This was made easier by the fact that my colleague Duncan Taylor had ported Ross Ihaka’s C code, ripped out of R, for gamma random number generation (yes I need that too), and with a little tweaking I had it running in my programme as well. The Java port let me recognize that I had done some silly things in my original programme, such as storing an entire bootstrap sample before processing it and in the process chewing up CPU and memory time with needless copying. And after a little more hacking (like three days) it ran to my satisfaction and all simulations duly completed with about three hours of run time.

But what about C++?

Java has some pretty cool ideas, and it is a fun and easy language to programme in. However, my failure to get the C++ working was weighing heavily on my mind. I like to think that I am a hell of a lot better C++ programmer than a Java programmer, and I dislike the idea that I might be writing slower programmes. I also do not think Java is currently well-suited to scientific programming. I am sure some readers will tell me this is no longer true, but access to a well accepted scientific code library is missing, and although there are many good projects, a lot of them are one-man efforts, or have been abandoned. A good example of the latter is the COLT library from CERN.

Out with pointers

I thought about this for sometime, and eventually it occurred to me that I could write a C++ programme that looked like a Java programme — that is, no pointers. C++ purists might shudder, but if you think of Java as simplified C++, then the concept is not so strange. Java treats every object argument in a function as being a reference. C++ can replicate this behaviour very easily by simply using its reference notation. The big trade-off was that I was going to also have to drop the pointers I used for dynamic allocation of memory. Java sort of fudges this as far as I can tell, because although the scalar types (int, double, boolean and others) are generally not treated as references, I think native arrays of them are, e.g. int[] or double[].

The STL

The Standard Template Library (STL) provides a set of low-overhead C++ template container classes, such as lists, vectors, maps and queues. These classes can contain themselves, and they can be dynamically resized at execution time. I have avoided using them in this way, especially when writing code to be very fast. However, I am fairly sure my colleague Brendon Brewer, who is much younger and a more modern C++ programmer, has told me that he never uses pointers. Given I had just finished for the year, this seemed like an ideal quick summer project.

Another couple of days recoding got me to running mode, and now it is time to reveal probably what was the issue all along. Remember when I said I did this:

	
double runif(void){
    return rngU(mtEngine);
}

What I actually did was this:

	
double runif(void){
    return mtEngine() / 4294967295.0;
}

The large constant there is $$2^{32}-1$$, the largest unsigned integer that can be represented on a 32-bit CPU. The mt19997 function mtEngine() returns an unsigned 32-bit integer, but for reasons that still escape me this piece of code:

	
return (int)floor(b * runif());

which should return a number between 0 and b-1 inclusive, was returning b, thereby causing the programme to address unallocated memory, and hence the crash. The reason it took so long to happen is that the random number stream had to be used for a very long time. Using the uniform_read_distribution class stopped this from happening.

What about performance?

So did I take a performance hit? I cannot say equivocally without going back to my original programme and adjusting the RNGs, but it appears that the C++ version actually takes about 10 minutes longer than the Java version. This is a very poor comparison, because the Java version is running on my Windows PC (core i7, 32GB RAM, HDD), and the C++ version is running on my Macbook (core i7, 16GB RAM, SSD), but also because the new C++ version is “more objected-oriented” than the Java version. That is, firstly I used native arrays, and arrays of arrays in Java, like int[][] and double[][]. If I had used Java ArrayLists (why would you), it might have been a different story. Secondly, there is a bit more OO design and architecture in the C++ version, including things like operator overloading and more extensive use of objects to represent the input data. All of these things cost, and they are probably costing a little too much in execution time, although they pay off in readability and usability, especially in well designed IDEs like Visual Studio and Xcode. Finally, my original C++ programme never finished, so I have no idea actually how long it would take to do the same set of simulations. I think in this case I will take a programme that works in three hours over a programme that quickly crashes in one.

Share Button

Forensic anthropologists — lend me your data

Please note: This is not a new post, but a restored post from August that I lost in a WordPress upgrade

Friends, Romans, forensic anthropologists, lend me your data

I have been reading the Journal of Forensic Sciences (JFS) over the last couple of days to see what sort of research is being done in forensic science and to see how many studies are using statistics to make, or to reinforce their conclusions. The answer the the second question is “quite a few.” There has been quite significant adoption of multivariate analysis, most commonly PCA, in a wide variety of forensic disciplines fields and that is very pleasing to see.

Forensic anthropology

Anthropologists, and in particular forensic anthropologists have long been heavy users of statistical methodology. Many studies use linear regression, linear discriminant analysis, principal component analysis and logistic regression. The well-known and widely used forensic anthropology computer programme FORDISC uses LDA. It is interesting to me to see the appearance of some newer/different classification techniques such as k-nearest neighbour, quadratic discriminant analysis, classification and regression trees, support vector machines, random forests, and neural networks.

Forensic anthropology features heavily in JFS, and the papers contain a large amount of statistical analysis of data. The focus of the articles is often on classification of remains in to age, gender, or racial groups, or on age estimation. The articles are generally quite interesting and well written.
Show me the data

However, there is almost never any provision of the raw data, and my experience whilst writing my data analysis book was that there was no response to my requests for data from even a single anthropologist of the dozen or so that I wrote to. Not even a polite “sorry but we are unable to release the data.” I understand that in all scientific disciplines data can be expensive, in terms of time or money, to collect, and so a researcher might justifiably want to retain a data set as long as possible to get as much research value from it as possible. However, surely there must be a point where the data could be released in the public domain? The University of Tennessee Knoxville does have a forensic anthropology databank, but, at least from the webpage, it seems that there is an emphasis on deposits rather than withdrawals.

A challenge

I therefore issue a challenge to the forensic anthropology community – release some of your data into the wild. It will benefit your discipline as it will others, and you might find your work cited more as people give you credit for producing the data that they are applying their novel techniques to.

Share Button

I am an applied statistician

Today brings a very nice blog post from Rafael Irizarry on being pragmatic in applied statistics rather than rigidly/religiously Bayesian or Frequentist.

Link to article: As an applied statistician I find the Frequentists versus Bayesians debate completely inconsequential.

Does this article reverse or contradict my thinking in forensic science? Not really. I am a strong proponent of Bayesian thinking in that field. However, in the shorter term I would be happier if practitioners simply had a better understanding of the Frequentist interpretation issues. As a statistician I depend on the collaboration of forensic scientists for both the problems and the data. Telling scientists that everything they are doing is incorrect is generally unhelpful. It is more productive to collaborate and make it better.

Share Button

R Markdown and undergraduates

I am seriously considering the introduction of R Markdown for assignments in our second year statistics course. The folks at RStudio have made some great improvements in the latest version of R Markdown (R Markdown V2), which allow you to add a Markdown document template to your R package, which in turn does things like let you provide a document skeleton for the user with as much information as you like, link CSS files (if you are doing HTML), and specify the output document format as well. The latter is an especially important addition to RStudio.

The lastest version of RStudio incorporates Pandoc which is a great format translation utility (and probably more) written by John Macfarlane. It is an important addition to RStudio because it makes it easy to author documents in Microsoft Word, as well as HTML, LaTeX, and PDF. I am sure that emphasizing the importance having the option to export to Word will cause some eye-rolling and groans, but I would remind you that we are teaching approximately 800 undergrads a year in this class, most of who will never ever take another statistics class again, and join a workforce where Microsoft Word is the dominant platform. I like LaTeX too (I do not think I will ever write another book ever again in Word), but it is not about what I like. I should also mention that there are some pretty neat features in the new R Markdown like authoring HTML slides in ioslides format, or PDF/Beamer presentations, and creating HTML documents with embedded Shiny apps (interactive statistics apps).

I think on the whole the students should deal with this pretty well, especially since they can tidy up their documents to their own satisfaction in Word — not saying that RStudio produces messy documents, but rather that the facility to edit post rendering is available.

Help?

However there is one stumbling block that I hope my readers might provide some feedback on — the issue of loading data. My class is a data analysis class. Every assignment comes with its own data sets. The students are happy, after a while, using read.csv() or read.table in conjunction with file.choose(). However, from my own point of view, reproducible research documents with commands that require user input quickly become tedious because you tend to compile/render multiple times whilst getting your code and your document right. So we are going to have to teach something different. As background, our institution has large computing labs that any registered student can use. The machines boot in either Linux or Windows 7 (currently, and I do not think that is likely to change soon given how much people loathe Windows 8 and what a headache it is for IT support). There is moderate market penetration of Apple laptops in the student body (I would say around 10%). So here is my problem — we have to teach the concept of file paths to a large body of students who on the whole do not have this concept in their skill set and who will find it foreign/archaic/voodoo. They will also regard this as another burdensome thing to learn on top of a whole lot of other things they do not want to learn like R and R Markdown. To make things worse, we have to deal with file paths over multiple platforms.

My thoughts so far are:

  • Making tutorial videos
  • Providing the data for each assignment in an R package that is loaded at the start of the document
  • Providing code in the template document that reads the data from the web

I do not really like the last two options as they let the students avoid learning how to read data into R. Obviously this is not a problem for those who do not go on, but it shifts the burden for those who do. So your thoughts please.

Update

One option that has sort of occurred to me before is that in the video I could show how the fully qualified path name to a file can be obtained using file.choose() and then then students could simply copy and paste that into their R code.

Share Button