R Markdown and undergraduates

I am seriously considering the introduction of R Markdown for assignments in our second year statistics course. The folks at RStudio have made some great improvements in the latest version of R Markdown (R Markdown V2), which allow you to add a Markdown document template to your R package, which in turn does things like let you provide a document skeleton for the user with as much information as you like, link CSS files (if you are doing HTML), and specify the output document format as well. The latter is an especially important addition to RStudio.

The lastest version of RStudio incorporates Pandoc which is a great format translation utility (and probably more) written by John Macfarlane. It is an important addition to RStudio because it makes it easy to author documents in Microsoft Word, as well as HTML, LaTeX, and PDF. I am sure that emphasizing the importance having the option to export to Word will cause some eye-rolling and groans, but I would remind you that we are teaching approximately 800 undergrads a year in this class, most of who will never ever take another statistics class again, and join a workforce where Microsoft Word is the dominant platform. I like LaTeX too (I do not think I will ever write another book ever again in Word), but it is not about what I like. I should also mention that there are some pretty neat features in the new R Markdown like authoring HTML slides in ioslides format, or PDF/Beamer presentations, and creating HTML documents with embedded Shiny apps (interactive statistics apps).

I think on the whole the students should deal with this pretty well, especially since they can tidy up their documents to their own satisfaction in Word — not saying that RStudio produces messy documents, but rather that the facility to edit post rendering is available.


However there is one stumbling block that I hope my readers might provide some feedback on — the issue of loading data. My class is a data analysis class. Every assignment comes with its own data sets. The students are happy, after a while, using read.csv() or read.table in conjunction with file.choose(). However, from my own point of view, reproducible research documents with commands that require user input quickly become tedious because you tend to compile/render multiple times whilst getting your code and your document right. So we are going to have to teach something different. As background, our institution has large computing labs that any registered student can use. The machines boot in either Linux or Windows 7 (currently, and I do not think that is likely to change soon given how much people loathe Windows 8 and what a headache it is for IT support). There is moderate market penetration of Apple laptops in the student body (I would say around 10%). So here is my problem — we have to teach the concept of file paths to a large body of students who on the whole do not have this concept in their skill set and who will find it foreign/archaic/voodoo. They will also regard this as another burdensome thing to learn on top of a whole lot of other things they do not want to learn like R and R Markdown. To make things worse, we have to deal with file paths over multiple platforms.

My thoughts so far are:

  • Making tutorial videos
  • Providing the data for each assignment in an R package that is loaded at the start of the document
  • Providing code in the template document that reads the data from the web

I do not really like the last two options as they let the students avoid learning how to read data into R. Obviously this is not a problem for those who do not go on, but it shifts the burden for those who do. So your thoughts please.


One option that has sort of occurred to me before is that in the video I could show how the fully qualified path name to a file can be obtained using file.choose() and then then students could simply copy and paste that into their R code.

Share Button

MCMC in Excel — an exercise in perversity

I found myself working in Excel as part of the work I did fitting exponential (and gamma) distributions to left censored data. This was partly due to do the fact that my research colleagues had done the initial distribution fitting using Microsoft Excel’s Solver to do maximum likelihood, something it does quite well. A shortcoming of this approach is that you cannot get the Hessian matrix for models with two or more parameters which you need if you want to place any sort of confidence interval around your estimates. There is nothing stop you, of course, from doing the actual mathematics, and calculating the values you need directly, but this all sounds like rather too much work and is distribution specific. One can equally make the criticism that the approximations used by the BFGS and other quasi-Newton methods are not guaranteed to be close to the true Hessian matrix.

The next step along the chain (I am sure this is a terribly mixed-metaphor but hey who cares), for me at least, was to use MCMC — in particular, to implement a simple random walk Metropolis-Hastings sampling scheme.

Note: The method I describe here is almost impossible for a multi-parameter model, or a model where the log-likelihood does not reduce to a simple sum of the data (or a sum of a function of the data). The reason for this is that Excel’s distribution functions are not vector functions, which means in many circumstances the values of the likelihood for different observations must be stored in separate cells, and then we have to sum over the cells. In a problem with n observations and m proposals, we then would have to store \(n\times m\) values unless we resort to Visual Basic for Applications (VBA). However, I wanted to do this problem without VBA.

Note 2:I know that it is very easy to estimate the variance for the exponential distribution, but please refer to the title of this post.

Microsoft Excel 2013 for Mac

In order to do MCMC we need to be able to generate random numbers. This functionality is provided in Excel by the Data Analysis Add-In. However, the Data Analysis Add-In has not been available since Excel 2008 for the Mac. There is a claim that this functionality can be restored by a third party piece of software called StatPlus LE, but in my limited time with it it seems a very limited solution. There are number of other pieces of functionality missing in the Mac version of Excel, which reduces its usefulness greatly.

What do we need?

We need data

I will use the similar data from my first post on this subject. However, this time I am going to switch to using data generated from a gamma distribution. To get some consistency, I have generated the data in R with \(\alpha=\mbox{shape}=10.776\), a rate of \(\beta=\mbox{rate}=5.138\), and a detection limit of \(\log(29)\). These numbers might seem absurdly specific but they come from maximum likelihood estimates in a real data set. This leaves me with 395 observations above the detection limit, and 9,605. below it. We only need the sum of the observations above the limit, the sum of the log of the observations above the limit, and the two counts, because the log-likelihood only depends on the sum, the sum of the logs, and the two counts. That is, if \(x_i\sim Gamma(\alpha,\beta), i=1,\ldots,N\) where \(n\) is the number of observations above the detection limit (395) (and \(m=N-n\) is the number of observations that are below the detection limit) then the likelihood function is

\(L(\alpha,\beta;{\mathbf x}) = \prod_{i=1}^{n}\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x_i}\prod_{i=1}^{m}F(T; \alpha, \beta) \)

where \(T=\log(29)\), and \(F(T; \alpha, \beta)\) is the Gamma cumulative distribution function. The log-likelihood then simplifies to

l(\alpha,\beta;{\mathbf x}) &= n\left(\alpha\log(\beta)-\log\Gamma(\alpha)\right)+(\alpha-1)\sum \log x_i -\beta\sum x_i \\
&+m \log F(T; \alpha, \beta) \\
&= (\alpha-1)\sum \log x_i -\beta\sum x_i + \kappa(\alpha,\beta,n,m)\\

which depends only the sum the observations above the limit, the sum of the logarithms of the observations above the detection limit, and the number of observations above and below the detection limit.

We need to specify the priors and to get a set of proposal values

In my JAGS model I used a \(\Gamma(0.001, 0.001)\) priors for \(\alpha\) and \(\beta\). This would be easy enough to implement in Excel if the inverse Gamma function was sufficiently robust. However, it is not, and so I have opted for a prior which is \(U(-2,3)\) on log-scale.

Prior for alpha and beta. It is uniform (U[-2,3]) on the log-scale
Prior for alpha and \beta. It is uniform (U[-2,3]) on the log-scale

This prior is a little less extreme than the \(\Gamma(0.001, 0.001)\) prior but has reasonable properties for this example.

We can use the Data Analysis Add-In to generate a set of proposal values. The screen capture below shows the dialog box from the Random Number Generation part of the Data Analysis Add-In. We need proposals for both \(\alpha\) and \(\beta\). Therefore we ask Excel to give us 2 random variables. In a standard MCMC implementation we usually choose a “burn-in” period to make sure our samples are not to correlated with the starting values, and to give the sampler time to get somewhere near the target distribution. In this example we will use a burn-in period of 1,000 iterations and then sample for a further 10,000 iterations, for a total of 11,000 iterations. We get Excel to put the proposals out into columns B and C starting at row 2 (and extending to row 11,001). Note: I have set the random number seed (to 202) here so that my results can be replicated.

Generating uniform (U[-2,3]) random variates for the proposal distribution
Generating uniform (U[-2,3]) random variates for the proposal distribution

We also need a column of U[0,1] random variates for our Metropolis-Hastings update step. The screen capture below shows the dialog box how we set this up. We store these values in column F, and as before I have set the random number seed (to 456) so that my results can be replicated.


We use columns C and D to transform our uniform random variates to the original scale to get our proposals for \(\alpha\) and \(\beta\). We do this by entering the formula


into cell D2, and then selecting cells D2 to D11001 and using the Fill Down command to propagate the formula. We select the range D2:E:11001 and use the Fill Right command to propagate the formula formula across for \(\beta\). Columns C and D contain my proposal values for \(\alpha\) and \(\beta\).

We need some data

As noted before, all we need to is the sum of the observed values and the sum of the log of the observed values, plus the number of observed and censored values. The sum of the observed values in my data set is 1478.48929487124 (stupid accuracy for replication), and the sum of the logs of the observed values is 519.633872429806. As noted before the number of observed values is 395, and there are 9,605 censored values. I will insert these values in cells I2 to I5 respectively, and in cells H2 to H5 I will enter the labels sum_x, sum_log_x, nObs, and nCens.

We need some names

It is useful to label cells with names when working with Excel formulae. This allows us to refer to cells containing values by a name that means something rather than a cell address. We can define names by using the tools on Formula tab. I will use this tool to assign the names I put into cells H2 to H5 to the values I put into cells I2 to I5. To do this I select the range H2:I5, and the I click on the Formula tab, then the “Create names from Selection” button as shown in the screenshot below: Note I do not believe you can do this on the Mac, but I do not know for sure.

Excel (Windows) allows you to create multiple names in a spreadsheet at once.
Excel (Windows) allows you to create multiple names in a spreadsheet at once.

I can now use, for example, the name sum_x to refer to cell address $I$2 in my formulae. It also removes the need to make sure that the address is absolute every time I type it.

We need to be able to calculate the log-likelihood

The only “tricks” we need to calculate the log likelihood are knowing how to calculate the natural logarithm, the Gamma cdf, and the log-gamma function. I mention the first because the natural logarithm in Excel conforms to mathematical convention in that \(\log_e(x)={\rm ln}(x)\), and the corresponding Excel function is LN. The LOG function calculates \(\log_{10}(x)\) in Excel. Excel’s GAMMA.DIST function provides both the pdf and the cdf. The latter is obtained by setting the fourth argument (CUMULATIVE) to TRUE. It is important to note that Excel uses parameters alpha and beta, but these correspond to shape and scale, not shape and rate. Finally, the GAMMALN function provides us with the logarithm of the complete gamma function. We will calculate the log-likelihood in column J, therefore we enter the following formula into cell J2

=(D2 - 1) * sum_log_x - E2 * sum_x 
+ nObs * (D2 * LN(E2) - GAMMALN(D2))
+ nCens * LN(
GAMMA.DIST(LN(29), D2, 1 / E2, TRUE))

After you have got this formula correct (and it will probably take more than one go), then select cells J2:J11001 and use the “Fill Down” command (Ctrl-F on Windows) to propagate the formula down for every proposed value.

We need starting values

We will store our sampler’s current state in columns K, L and M. The current value of \(\alpha\) gets stored in column K, the current value of \(\beta\) in column L, and the current value of the log-likelihood in column M. We need some starting values, and so in this example we will use the first proposal values. In cell K2 we enter the formula


in cell L2 we enter the formula


and in cell M2 we enter the formula


We need a Metropolis-Hastings update step

Recall that the state of the sampler changes from \(G_0\) to \(G_1\) with probability 1 if \(G_0\) is more likely than \(G_1\) \((l(G_1)>l(G_0))\) or if \(\log(u) < l(G_1)-l(G_0)[/latex], where [latex]u\sim U(0,1)[/latex]. Therefore, to implement this in Excel, we need to use an IF function in conjunction with an OR function. The OR function can handle our two conditions: either the likelihood is bigger or with probability proportional to the ratio of the likelihoods. Therefore, in cell K3, we enter the formula [code] =IF(OR($J3>$M2, LN($F3)<($J3-$M2)),D3,K2) [/code] Note that the comparisons are made to the current state, not the previous cell. We could write a very similar formula for cells L3 and M3. However, we can just check to see with cell K3 has the same value as cell K2. That is we can see whether there has been a change. Checking for equality of floating point numbers is bad programming practice, however if we are concerned about scientific accuracy we probably would not be doing this in Excel. Therefore, we insert the formulae

=IF(K3=K2, L2, E3)
=IF(K3=K2, M2, J3)

is cells L3 and M3 respectively. We need to propagate these formulae down to row 11,001 by selecting and using the “Fill Down” command as before.

We need to summarize our chain

Finally, we need to gather some summary statistics about our sample from the posterior distribution. Recall we are using a burn-in period of 1,000 samples and a sampling period of 10,000 iterations. Therefore all our summary functions are only applied from row 1,002 to row 11,001.

We are interested in the mean, standard deviation, and 0.025 and 0.975 quantiles for [latex]\alpha\) and \(\beta\). We can get these values using the AVERAGE, STDEV.S, and PERCENTILE.EXC functions. In cells P2 to P7 we insert the following formulae.

=PERCENTILE.EXC(K1002:K11001, 0.025)
=PERCENTILE.EXC(K1002:K11001, 0.975)
=P2 - 1.96 * P3
=P2 + 1.96 * P3

and then we select cells P2:Q7 and use the “Fill Right” command.

How did we do?

How did we do?
How did we do?

The screen capture above shows my results. Rather embarrassingly the 95% credible interval does not contain the true values \(\alpha=10.776\) and \(\beta=5.138\). The main reason for this is that there is a total of 10 acceptances in our sample of size 10,000! That is, the sampler is incredibly inefficient. This is not completely surprising. The interval calculated under the assumption that the posterior distribution is symmetric is a little wider and does contain the true values. However, I would not put much stock in this. Of course, we can easily have the sampler run for a longer time, by generating more random variates and using more cells. I will leave that task to the reader.


In this article I have shown how to fit a gamma distribution to left censored data using MCMC in Excel. It is definitely not the best way to do this — I would use R and JAGS, which would be infinitely faster and give me more useable results — but it can be done. It does offer functionality to non-R users, of which there are many more than actual users, and it also allows the chance to observe the Markov chain so that we can see how the sampler is working at every stage of the process.


For completeness the sheet I created whilst writing this post is available from the link below.
MCMCGamma.xlsx.zip (1.7MB)

Share Button

Introduction to Using Regular Expressions in R

R and Regular Expressions

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. – Jamie Zawinski, courtesy of Jeffrey Friedl’s blog

This is a blog post on using regular expressions in R. I am sure there are plenty of others out there with the same information. However, this is also an exercise for me to see how hard it is to change my knitr .Rnw files into markdown and then into HTML. It turns out that most of the work can be done by running pandoc over the LaTeX file I get from knitting my .Rnw file. The rest I did manually.

What are regular expressions?

Regular expressions provide a powerful and sophisticated way of matching patterns in text. For example with regular expressions I can:

  • Find a word or a series of letters

  • Do wild-card searches

  • Match patterns at the start or the end of a line

  • Make replacements in text based on the match

A simple example

Very often when people read about regular expressions they do not grasp the power, thinking “I could do that without using regular expressions at all.”

Here is something I did the other day. I had a file whose lines consisted of the names of files containing R code. That is, I had twenty lines of text with the end of each line ending in .R. I needed to

  1. insert export( at the start of each line, and

  2. replace the .R at end of each line with )

You can probably think of a way of doing this with a mixture of find and replace, and manual insertion. That is all well and good if you only have 20 lines. What if you had 20,000 lines?

I did this in Notepad++ by matching the regular expression ^(.*)\.[rR]$ and replacing it with export(\1)

Before After
#onewayPlot.R# \export(onewayPlot)
autocor.plot.R \export(autocor.plot)
boxqq.r \export(boxqq)
boxqq.r~ \export(boxqq)
ciReg.R \export(ciReg)
cooks20x.R \export(cooks20x)
crossFactors.R \export(crossFactors)
crossFactors.R~ \export(crossFactors)
crosstabs.R \export(crosstabs)
eovcheck.R \export(eovcheck)
estimateContrasts.R \export(estimateContrasts)
estimateContrasts1.R \export(estimateContrasts1)
estimateContrasts2.R \export(estimateContrasts2)
freq1way.r \export(freq1way)
freq1way.r~ \export(freq1way)
getVersion.R \export(getVersion)
interactionPlots.R \export(interactionPlots)
layout20x.R \export(layout20x)
levene.test.R \export(levene.test)
levene.test.R~ \export(levene.test)
multipleComp.R \export(multipleComp)
normcheck.R \export(normcheck)
onewayPlot.R \export(onewayPlot)
onewayPlot.R~ \export(onewayPlot)
pairs20x.R \export(pairs20x)
pairs20x.R~ \export(pairs20x)
predict20x.R \export(predict20x)
predict20x.R~ \export(predict20x)
propslsd.new.R \export(propslsd.new)
residPlot.R \export(residPlot)
rowdistr.r \export(rowdistr)
  • Regular expressions are powerful way of describing patterns that we might search for or replace in a text document

  • In some sense they are an extension of the wildcard search and replace operations you might carry out in Microsoft word or a text editor.

  • To the untrained eye they look like gobbledygook!

  • Most programming languages have some form of regular expression library

  • Some text editors, such as Emacs, Notepad++, RStudio, also have regular expressions

  • This is very useful when you don't need to write a programme

  • The file utility grep uses regular expressions to find occurrences of a pattern in files

  • Mastering regular expressions could take a lifetime, however you can achieve a lot with a good introduction

  • A couple of very good references are:

    • Jeffery Freidl's Mastering Regular Expressions, (2006), 3rd Edition, O'Reilly Media, Inc.
    • Paul Murrell's Introduction to Data Technologies, (2009), Chapman & Hall/CRC Computer Science & Data Analysis.
    • This entire book is on the web: http://www.stat.auckland.ac.nz/~paul/ItDT/, but if you use it a lot you really should support Paul and buy it
    • Chapter 11 is the most relevant for this post

Tools for regular expressions in R

We need a set of tools to use regular expressions in something we understand — i.e. R. The functions I will make the most use of are

  • grepl and grep

  • gsub

  • gregexpr

Functions for matching

  • grep and grepl are the two simplest functions for pattern matching in R

  • By pattern matching I mean being able to either

    (i) Return the elements of, or indices of, a vector that match a set of characters (or a pattern)

    (ii) Return TRUE or FALSE for each element of a vector on the basis of whether it matches a set of characters (or a pattern)

grep does (i) and grepl does (ii).

Very simple regular expressions

At their simplest, a regular expression can just be a string you want to find. For example the commands below look for the string James in the vector of names names. This may seem like a silly example, but it demonstrates a very simple regular expression called a string literal, informally meaning match this string — literally!

names = c('James Curran', 'Robert Smith', 
          'James Last')
grep('James', names)
## [1] 1 3

Wild cards and other metacharacters

  • At the next level up, we might like to search for a string with a single wild card character

  • For example, my surname is sometimes (mis)spelled with an e or an i

  • The regular expression symbol/character for any character is the the full stop or period

  • So my regular expression would be Curr.n, e.g.

surnames = c('Curran', 'Curren', 'Currin',
grepl('Curr.n', surnames)
  • The character . is the simplest example of a regular expression metacharacter

  • The other metacharacters are [ ], [^ ], \, ?, *. +,{,}, ^, $, \<, \>, | and ()

  • If a character is a regular expression metacharacter then it has a special meaning to the regular expression interpreter

  • There will be times, however, when you want to search for a full stop (or any of the other metacharacters). To do this you can escape the metacharacter by preceding it with a double backslash \\.

  • Note that we only use two backslashes in R – nearly every other language uses a single backslash

  • Note that \\ followed by a digit from 0 to 9 has special meaning too, e.g. \\1


  • Whilst this example obviously works, there is a more sensible way to do this and that is to use the alternation or or operator |.

  • E.g.

grepl('Curr(a|e|i)n', c('Curran', 'Curren', 
                        'Currin', 'Curin'))
  • This regular expression contains two metacharacters ( and |

  • The round bracket ( has another meaning later on, but here it delimits the alternation.

  • We read (a|e|i) as a or e or i.

A bigger example – counting words

In this example we will use the text of Moby Dick. The R script below does the following things

  1. Opens a file connection to the text

  2. Reads all the lines into a vector of lines

  3. Counts the number of lines where the word whale is used

  4. Counts the number of lines where the word Ahab is used

## open a read connection to the Moby Dick 
## text from Project Gutenberg
mobyURL = 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt'
f1 = url(mobyURL, 'r')

## read the text into memory and close
## the connection
Lines = readLines(f1)

Note: The code above is what I would normally do, but if you do it too often you get this nice message from Project Gutenberg

Don't use automated software to download lots of books. We have a limit on how fast you can go while using this site. If you surpass this limit you get blocked for 24h.

which is fair enough, so I am actually using a version I stored locally.

## Throw out all the lines before 
## 'Call me Ishmael'
i = grep('Call me Ishmael', Lines)
Lines = Lines[-(1:(i - 1))]

numWhale = sum(grepl('(W|w)hale', Lines))
numAhab = sum(grepl('(A|a)hab', Lines))

cat(paste('Whale:', numWhale, ' Ahab:', 
           numAhab, '\n'))
## Whale: 1487  Ahab: 491

Note:I am being explicit here about the capitals. In fact, as my friend Mikkel Meyer Andersen points out, I do not have to be. grep, grepl, regexpr and gregexpr all have an argument ignore.case which can be set to TRUE. However I did want to highlight that generally regular expressions are case sensitive.

This programme will count the number of lines containing the words whale or Ahab but not the number of occurrences. To count the number of occurrences, we need to work slightly harder. The gregexpr function globally matches a regular expression. If there is no match, then gregpexr returns -1. However, if there is one or match, then gregexpr returns the position of the match, and the length of the match for each matching instance. For example, we will look for the word the in the following three sentences:

s1 = 'James is a silly boy'
s2 = 'The cat is hungry'
s3 = 'I do not know about the cat but the dog is stupid'

## Set the regular expression
## Note: This would match 'There' and 'They' as
## well as others for example 
pattern = '[Tt]he'
## We will store the matches so that we 
## can examine them in turn
m1 = gregexpr(pattern, s1)
m2 = gregexpr(pattern, s2)
m3 = gregexpr(pattern, s3)

There are no matches in the first sentence so we expect gregexpr to return -1

## [[1]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE

which it does.

In the second sentence, there is a single match at the start of the sentence

## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 3
## attr(,"useBytes")
## [1] TRUE

This result tells us that there is a single match at character position 1, and that the match is 3 characters long.

In the third example there are two matches, at positions 21 and 33, and they are both 3 characters long

## [[1]]
## [1] 21 33
## attr(,"match.length")
## [1] 3 3
## attr(,"useBytes")
## [1] TRUE

So in order to count the number of occurences of a word we need to use gregexpr and keep all of those instances where the result is not -1 and then count, using the length function the number of matches, e.g.

## count the number of occurences of whale
pattern = '(W|w)hale'
matches = gregexpr(pattern, Lines)

## if gregexpr returns -1, then the number of 
## matches is 0 if not, then the number of 
## matches is given by length
counts = sapply(matches, 
                  if(x[1] == -1) 

cat(paste('Whale:', sum(counts), '\n'))
## Whale: 1564

Character classes/sets or the [ ] operator

  • Regular expression character sets provide a simple mechanism for matching any one of a set of characters

  • For example [Tt]he will match The and the

  • The real strength of character sets is in its special range and set operators

  • For example the regular expressions:

    • [0-9] will match any digit from 0 to 9
    • [a-z] will match any lower case letter from a to z
    • [A-Z0-9] will match any upper case letter from A to Z or any digit from 0 to 9 and so on
  • You may initially think that character sets are like alternation, but they are not. Character sets treat their sets as an unordered list of characters

    • So (se|ma)t will (fully) match set
    • But mat, [sema]t will not
    • Alternatively [sema]t will match st,at, mt and at
  • The POSIX system defines a set of special character classes which are supported in R and can be very useful. These are

[:alpha:] Alphabetic (only letters)
[:lower:] Lowercase letters
[:upper:] Uppercase letters
[:digit:] Digits
[:alnum:] Alphanumeric (letters and digits)
[:space:] White space
[:punct:] Punctuation
  • The regular expression [[:lower:]] will help you capture accented lower case letters like &agrave, &eacute, or &ntilde whereas [a-z] would miss all of them

  • You may think this is uncommon, but optical character recognition (OCR) text often has characters like this present

Negated character sets – [^...]

  • Negated character sets provide a way for you to match anything but the characters in this set

  • A very common example of this is when you want to match everything between a set of (round) brackets

  • E.g. The regular expression ([^)]) would match any single character between a pair of round brackets

Matching more than one character — quantifiers

Another common thing we might do is match zero, one, or more occurrences of a pattern. We have four ways to do this

  1. ? means match zero or one occurrences of the previous pattern
  2. * means match zero or more occurrences of the previous pattern

  3. + means match one or more occurrences of the previous pattern

  4. {a,b} means match from a to b occurrences of the previous pattern*

  5. b may be omitted so that {a,} means match a or more occurrences of the previous pattern

  6. b and the comma may be omitted so that {a} means match exactly a occurences of the previous pattern

Continuing with our misspelling example, I would like a way of picking up all of the possibilities of misspelling my surname. Variants I've seen are Curren, Currin, Curin, Curan, Curen, Curn and even Karen!

If I wanted to construct a regular expression to match all of these possibilities I need to match (in this order):

  1. a C or a K

  2. a u or an a

  3. one or two occurence of r

  4. zero or more occurrences of e or i

  5. and finally an n

This is how I do this with regular expressions

pattern = '[CK](u|a)r{1,2}(e|i)*n';
badNames = c('Curren', 'Currin', 'Curin', 
             'Curan', 'Curen', 'Curn', 'Karen')
grepl(pattern, badNames)

Notice how the regular expression didn't match Curan. To fix the code so that it does match we need to change the set of possible letters before the n from (e|i) to allow a as a possibility, i.e. (a|e|i) or alternatively [aei]

pattern1 = '[CK](u|a)r{1,2}(a|e|i)*n'
pattern2 = '[CK](u|a)r{1,2}[aei]*n'
badNames = c('Curren', 'Currin', 'Curin', 'Curan', 
             'Curen', 'Curn', 'Karen')
grepl(pattern1, badNames)
grepl(pattern2, badNames)

Anchors — matching a position

  • The metacharacters ^, $, \< and \> match positions

  • ^ and $ match the start and the end of a line respectively

  • \< and \> match the start and end of a word respectively

  • I find I use ^ and $ more often

  • For example

    • ^James will match "James hates the cat" but not "The cat does not like James"
    • cat$ will match "James hates the cat" but not "The cat does not like James"

Summary — R functions for matching regular expressions

The functions grep and regexpr are the most useful. Loosely,

  • grep tells you which elements of your vector match the regular expression

  • whereas regexpr tells you which elements match, where they match, and how long each match is

  • gregexpr matches every occurrence of the pattern, whereas regexpr only matches the first occurrence

The example below shows the difference

poss = c('Curren', 'Currin', 'Curin', 'Curan', 
         'Curen', 'Curn', 'Karen')
pattern = '[CK](u|a)r{1,2}(i|e)*n'
grep(pattern, poss)
## [1] 1 2 3 5 6 7
gregexpr(pattern, poss)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 6
## attr(,"useBytes")
## [1] TRUE
## [[2]]
## [1] 1
## attr(,"match.length")
## [1] 6
## attr(,"useBytes")
## [1] TRUE
## [[3]]
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"useBytes")
## [1] TRUE
## [[4]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## [[5]]
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"useBytes")
## [1] TRUE
## [[6]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## [[7]]
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"useBytes")
## [1] TRUE

String substitution

  • Finding or matching is often only one half of the equation

  • Quite often we want to find and replace

  • This process is called string substitution

  • R has two functions sub and gsub

  • The difference between them is that sub only replaces the first occurrence of the pattern whereas gsub replaces every occurrence

  • Normal usage is quite straight forward. E.g.

poss = 'Curren, Currin, Curin, Curan, Curen, Curn and Karen'
pattern = '[CK](u|a)r{1,2}(i|e)*n'

sub(pattern, 'Curran', poss)
## [1] "Curran, Currin, Curin, Curan, Curen, Curn and Karen"
gsub(pattern, 'Curran', poss)
## [1] "Curran, Curran, Curran, Curan, Curran, Curran and Curran"

Back substitution

  • Abnormal usage is probably quite unlike anything you have seen before

  • One of the most powerful features of regular expressions is the ability to re-use something that you matched in a regular expression

  • This idea is called back substitution

  • Imagine that I have a text document with numbered items in it. E.g.

    1. James
    2. David
    3. Kai
    4. Corinne
    5. Vinny
    6. Sonika
  • How would I go about constructing a regular expression that would take each of the lines in my document and turn them into a nice LaTeX itemized list where the numbers are the list item markers?

  • The trick is to capture the numbers at the start of each line and use them in the substitution

  • To do this we use the round brackets to capture the match of interest

  • And we use the \\1 and \\2 backreference operators to retrieve the information we matched. E.g.

Lines = c('1. James', '2. David', '3. Kai', 
          '4. Corinne', '5. Vinny', 
          '6. Sonika')
pattern = '(^[0-9]\\.)[[:space:]]+([[:upper:]][[:lower:]]+$)'
gsub(pattern, '\\\\item[\\1]{\\2}', Lines)
## [1] "\\item[1.]{James}"   "\\item[2.]{David}"  "\\item[3.]{Kai}"    
## [4] "\\item[4.]{Corinne}" "\\item[5.]{Vinny}"   "\\item[6.]{Sonika}"

Note the double backslash will become a single backslash when written to file.

I actually used a regular expression with back substitution to format output for LaTeX in the file name example at the start of this post. My regular expression was the following:


and this was my back substitution expression

 \\verb!\1!  &  \\verb!\\export(\2)! \\\\

There is only a single \ in the back references because I just did this in the RStudio editor, not in R. Note how there are two back references, corresponding to two capture groups, one of which is nested inside the other. In nesting situtations like this, the capture groups are labelled in order from the outermost inwards.

String manipulation

We need two more things to finish this section

  • The ability to extract smaller strings from larger strings

  • The ability to construct strings from smaller strings

  • The first is called extracting substrings

  • The second is called string concatenation

  • We use the functions substr and paste for these tasks respectively


  • substr is very simple

  • Its arguments are

    • the string, x,
    • a starting position, start,
    • and an stopping position, stop.
  • It extracts all the characters in x from start to stop

  • If the alias substring is used then stop is optional

  • If stop is not provided then substring extracts all the characters from start to the end of x


substr('abcdef', 2, 4)
## [1] "bcd"
substring('abcdef', 2)
## [1] "bcdef"


paste is a little more complicated:

  • paste takes 1 or more arguments, and two optional arguments sep and collapse

  • sep is short for separator

  • For the following discussion I am going to assume that I call paste with two arguments x and y

  • If x and y are both scalars thenpaste(x,y) will join them together in a single string separated bya space, e.g.

paste(1, 2)
## [1] "1 2"
paste('a', 'b')
## [1] "a b"
paste('a =', 1)
## [1] "a = 1"
  • If x and y are both scalars and you define sep to be "," say then paste(x, y, sep = ",") will join them together in a single string separated by a comma,
paste(1, 3, sep = ',')
## [1] "1,3"
paste(TRUE, 3, sep = ',')
## [1] "TRUE,3"
paste('a', 3, sep = ',')
## [1] "a,3"
  • If If x and y are both vectors and you define sep to be "," say then paste(x, y , sep = ",") will join each element of x with each element of y into a set of strings where the elements are separated by a comma, e.g.
x = 1:3
y = LETTERS[1:3]
paste(x, y, sep = ',')
## [1] "1,A" "2,B" "3,C"
  • If collapse is not NULL, e.g “-” then each of the strings will be pasted together into a single string.
x = 1:3
y = LETTERS[1:3]
paste(x, y, sep = ',', collapse = '-')
## [1] "1,A-2,B-3,C"
  • And of course you can take it as far as you like 🙂
x = 1:3
y = LETTERS[1:3]
paste('(', paste(x, y, sep = ','), ')', 
       sep = '', collapse = '')
## [1] "(1,A)-(2,B)-(3,C)"


Mikkel also points out that paste0 is a shortcut to avoid specifying sep = "" everytime

Share Button



If you are not a statistician, this cartoon is mocking statistical significance used in most frequentist hypothesis tests. If we use a threshold of \(\)P<0.05[/latex] then we accept that on average we will falsely declare there to be a significant departure from the null approximately one time in twenty ([latex]1/20 = 0.05[/latex]). In lay terms we will decide there is a difference when there really is not one time in twenty. Thanks Randall.

Share Button

New PS/4 game console disappointments

Last year around August, during all the hype about the new Xbox One and Sony’s PlayStation 4, I decided to pre-order both consoles. However, as the release date approached I realized that the number of titles available for both consoles was extremely limited with many studios aiming for a 2014 release date. Based on this I decided that I did not need approximately $NZD1,300 of electronics that were glorified Blue ray players. I should inform readers (he says optimistically) who are not based in New Zealand or Australia that we do not have access to Netflix in New Zealand, and nor do we have cable TV that integrates with either box. Voice activation and control of the Xbox One is also currently disabled in Australia and New Zealand. Microsoft won’t say why but one strongly suspects that the software simply cannot deal with the Australasian accent. The voice features of Android used to be hopeless for anyone who did not have an American accent. Google has spent considerable time working to remedy this, and now it mostly works.

So I held off until May of this year, and finally purchased the PS/4. I had originally thought I would buy the Xbox first because I have been a much heaver Xbox 360 gamer, but the only two new games that interested me were Watchdogs, available on all platforms, and Sucker Punch’s inFamous: Second Son, which is a PS/4 exclusive. The latter swung the decision as I have enjoyed playing both inFamous and inFamous 2, and, to be fair, with the exception of the idiotic spray-painting side “quests”, I enjoyed this iteration as well.

Watch Dogs

Watch Dogs, from Ubisoft Montreal, was majorly hyped at 2013’s E3 event. Billed as open world action-adventure title with driving, shooting, running and climbing (parkour), and hacking all being part of the game. In reality, they should have just stuck with driving. This game reminds me so much of Grand Theft Auto it is not funny. Many of the integral parts of the story line require the player to drive at high-speed, either in pursuit of someone, or to get to a specific location within a time limit. These are the exact things I loathe about GTA, but can forgive in GTA, because hey that game is called Grand Theft Auto. It is about cars, and you expect cars to be an integral part of the game. Personally, I like in this sort of game to jump, climb and precision shoot. Watch Dog’s lead protagonist, Aiden Pearce, can run, and climb in a very limited fashion, but not jump — he can fall down to a lower level, or vault a fence which is really just an extension of the running, but he cannot jump from say a boat to a pier.

The hacking part of the game is sort of interesting in that it can provide you a way of carrying out a mission without entering a building, but mostly it doesn’t. It lets you steal money from unsuspecting citizens, which Eurogamer points out you can’t even give to the numerous beggars around the city. It does not really help you when you are driving, except in the odd case where you can cause bollards to rise out of the road to stop a car, or raise a bridge. Eurogamer states, again correctly, that it is a game about hacking where hacking quickly becomes incidental.

A number of side missions require you to interfere with, or stop, criminal activity. Fine, I have no problem with that, but then why do most of these missions require me to tackle someone who is shooting at me with an automatic weapon? Overall, I am extremely disappointed with this game and will probably not bother to complete it.

No DLNA support, no third party media

Both the Xbox 360 and Sony PlayStation 3 could see and play all the media content hosted on my Windows box in the basement. This was a nice feature because it meant I could play media (music, video, browse photos) without buggering around with discs, portable hard drives, or flash drives. So you might say, okay minor inconvenience, stick it on a flash drive and away you go. No dice. Sony, in its infinite wisdom, has decided that “Unfortunately as noted pre-launch of the PS4, you cannot play movies on the PS4 off of an external hard drive or memory stick. However this is something that is looking to be added in the future with an update.” One question – “Why?” It is hardly a revolutionary feature — my TV does it. The inability to insert an audio disc seems like a kick in the face to the consumer, and overall Sony, you are just pissing off a lot of people who give your company considerable revenue.

Final thoughts

Its possible that Sony may fix some of these issues, but I am not holding my breath. I am hoping that Far Cry 4, releasing in November, is as good as the previous titles in the series.

Share Button

Python and statistics – is there any point?

This semester I gave my graduate student class a project. The brief was relatively simple: implement the iteratively reweighted least squares (IRLS) algorithm to perform a simple (single covariate) logistic regression in Python. Their programmes were supposed to be able to read data in from a text file, perform the simple matrix algebra and math needed to carry out the IRLS computation and return some formatted output – similar to that you would get from R’s summary.glm function. Of course, you do not need matrix algebra to do this, but the idea was for the students to learn a bit of mathematical statistics that they had not seen before.  On the IRLS front, they were allowed to use a simple least squares routine like numpy’s linalg.lstsq and some of numpy’s simple matrix operators, but expressly forbidden from simply loading pandas or statsmodels and using the generalized linear models functions contained therein.

I thought this sounded like a straightforward enough task. The students divided themselves into pairs to work on it, and they had 13 weeks to complete the task.

The kicker was that I did not provide any instruction, either in Python or in the IRLS algorithm. An aim of the project was to simulate the situation where someone asks you to solve a problem, and you have to go and do some research to do it. Their first task was to complete 100 exercises on codeacademy.com as a reasonable introduction to a language none of them had seen before.

Problems – versions

There are two major versions of Python in the wild, 2.7 and 3.4. Codeacademy teaches using version 2.7. One fundamental difference between 2.7 and 3.4 is the syntax of the print function. All of my students are users of R, to varying levels of skill. When they go to install R at home, they know to go to the CRAN website, or a mirror, and download the current, stable release of R. If they followed this policy, as I did myself, then they would have installed Python 3.4 and found that the way they were taught to use print by Codeacademy does not work, without any sort of helpful “That syntax has been depricated. Python 3 onwards uses the syntax…” This is not the only issue, with the way Python 3.4 handles execution of loops over numbered ranges being another example of a fundamental difference.

Problems – platform issues

Most students at my institution use Windows, especially at home. There is some Mac penetration, and Linux is virtually non-existent (these are statistics students, not computer science remember). The official Python installers work perfectly well under Windows in my experience. However, then we come to the issue of installing numpy. The official advice from the numpy website seems to be “download a third party version of Python which already has it.” For students who come from a world where a package can be installed by going to a menu, this is less than useful. The common advice from the web is that “there is no official release of numpy 1.8.1 for Python 2.7 or higher for Windows” but that you can download it and install it from a the builds very thoughtfully provided by Christoph Gohlke at UC Irvine here. Christoph’s builds work fine, but again, for something that seems, at least from the outside, very mainstream in the Python community should the user have to go to this level of effort?

Problems – local installations

Like any instructor, I face the issue that a number of my students have no option but to use the computer laboratories provided for them by the university. This means that we encounter the issue of local installation of libraries for users. Most, if not all, R packages from CRAN can be installed in a local library. As far as I can tell, this is not true for a Windows installation of Python. I am happy to be corrected on this point. The aforementioned Python binaries come with proper Windows installers, which want to install into the Python root directory, something students do not have permission to do. If I had realized this problem in December of last year, I could have asked the admins to pre-install it for all users, however, given I only formulated the problem in February, it was just a tad too late.

Would I do it again?

I might, but there would have to be serious efforts to resolve the problems listed above on my part. It also would not solve problems of students trying to set up Python at home, and I do not feel like hand-holding people through an installation process. My initial plan had been to try Javascript. I may return to this idea.

I would be the first to admit that I am not a Python user, but I am an experienced programmer with over thirty years of experience in at least a dozen different languages, and on multiple platforms. I know many people find Python a very useful language for their scientific computing, and I am not attempting to bad mouth the language – it seems a decent enough language with the constructs and functionality that I would expect to find in any modern language – but I do not think there is much incentive for a statistician to move away from R, or an R/C++ combination when raw compute power is required.

I am glad that my students experienced programming in a non-vectorized language. R does give a distorted perspective on programming with regards to its handling of vectors, and I think it is beneficial for students to learn about flow structures for element-wise computation.


Nat Dudley has made the suggestion I used on online IDE like nitrous.io.

Second updates

Despite the difficulties, nearly all of my students have managed to complete the task, and some have done an exceptionally good job, even adding in the ability to parse R-like formulae.

Share Button

Bayesian modelling of left-censored data using JAGS

Bayesian modelling of left-censored data using JAGS

Martyn Plummer's JAGS very helpfully provides us with a way to model censored data through the use of the dinterval distribution. However, almost all of the examples that one finds on the web are for right censored data. The changes to model left censored data are not major, but I do think they warrant a) a post/page of their own and b) hopefully an easy-to-understand example.

Left-censored data arises very commonly when dealing with detection limits from instrumentation. In my own work, I often end up involved in the modelling of data derived from electropherograms.

I will start by generating some left censored data. For simplicity I am going to assume that my data is exponentially distributed, with a true rate of 1.05 (\(\lambda = 1.05\)), and a detection/censoring threshold at log(29). This means that approximately 97.1% of my data (on average) will not exceed my detection threshold. This may seem extreme, but it is the kind of setup that is common in my own work.

x = rexp(10000, rate = 1.05)

## set all the censored values to NA's
x[x < log(29)] = NA

This gives us a data set of size 10,000 with 9,691 values that fall below the detection threshold.

It can be a useful check, if feasible, to see what the maximum likelihood estimate is. We can do this in R using the optim function

## define the log-likelihood
logLik = function(lambda){
  isCensored = is.na(x)
  nCensored = sum(isCensored)
  LOD = log(29)

  ll = sum(dexp(x[!isCensored], rate = lambda, 
                log = TRUE)) + 
       nCensored  * pexp(LOD, rate = lambda, 
                         log = TRUE)

  ## return the -ve value because optim 
  ## is a minimizer

fit = optim(0.5, logLik, method = "Brent", 
            lower = 0, upper = 10, 
            hessian = TRUE)
fisherInfo = 1/fit$hessian
sigma = sqrt(fisherInfo)
upper = fit$par + 1.96 * sigma
lower = fit$par - 1.96 * sigma
interval = c(lower, fit$par, upper)
names(interval) = c("lower", "MLE", "upper")
## lower   MLE upper 
## 1.001 1.032 1.064

So the 95% confidence interval contains our true value which is always a good start!

The trick, if there is any, to dealing with left-censored data in JAGS is to make sure that your indicator variable tells JAGS which variables are above the detection threshold.

So in the next step I will set up the list that contains my data.

bugsData = list(N = length(x),
                isAboveLOD = ifelse(!is.na(x), 
                                    1, 0),
                x = x,
                LOD = rep(log(29), length(x)))

There are two points to make about the preceding code. Firstly the variable isAboveLOD uses the NA status of the data. If you have not recoded your censored values to NA then obviously this will not work. Secondly, there is a limit of detection vector LODregardless of whether the observation is below the limit of detection or not.

Next we need to set up some intial values. The key here is setting not only an initial value for \(\lambda\), but initial values for the observations that have been censored.

bugsInits = list(list(lambda = 0.5, 
                      x = rep(NA, length(x))))

## set the missing values to 
## random variates from
## U(0, log(29))
nMissing = sum(!bugsData$isAboveLOD)
bugsInits[[1]]$x[!bugsData$isAboveLOD] = 
           runif(nMissing, 0, log(29))

I have chosen uniform random variates between zero and the limit of detection (log(29)) as initial values for my censored data.

Now we need to set up our JAGS model. I will store this in a string, and then use writeLines to write this to disk.

modelString = "
    for(i in 1:N){
      isAboveLOD[i] ~ dinterval(x[i], LOD[i])
      x[i] ~ dexp(lambda)

    lambda ~ dgamma(0.001, 0.001)

writeLines(modelString, con = "model.bugs.R")

Note that I have used a vague gamma prior for \(\lambda\). This is not especially noteworthy, except from the point of view about being explicit about what I have done. dinterval returns 0 if x[i] < LOD[i] and a 1 otherwise. Many people who try to use dinterval often get this cryptic error Observed node inconsistent with unobserved parents at initialization. This can happen for two reasons. Firstly the indicator variable can be incorrectly set, that is the observations above the limit of detection have been coded with a 0 instead of a 1 and vice versa for the censored observations. Secondly, the error can occur because the initial values for the censored observations are outside of the censored interval.

We now have the three components necessary to fit our model: a list containing the data, a list of initial values, and a BUGS model. Firstly we initialize the model

## Loading required package: coda
## Loading required package: lattice
## Linked to JAGS 3.3.0
## Loaded modules: basemod,bugs
jagsModel = jags.model(file = "model.bugs.R", data = bugsData, inits = bugsInits)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
##    Graph Size: 30003
## Initializing model

Next, we let the model burn-in for an arbitrary period. I will use a burn-in period of 1,000 iterations.

update(jagsModel, n.iter = 1000)

And finally, we take a sample (hopefully) from the posterior distribution of the parameter of interest, \(\lambda\)

parameters = c("lambda")
simSamples = coda.samples(jagsModel, 
                          variable.names = parameters, 
                          n.iter = 10000)
stats = summary(simSamples)

We can obtain a 95% confidence interval by using the posterior mean and standard error, i.e.

mx = stats$statistics[1]
sx = stats$statistics[2]
ci = mx + c(-1, 1) * 1.96 * sx
## [1] 1.002 1.065

or we can get a 95% credible interval by using the posterior quantiles, i.e.

ci = stats$quantiles
##  2.5% 97.5% 
## 1.002 1.065

Both intervals contain the true value of 1.05 and have a fairly reasonable estimate of it as well. They also are very close to the ML interval.

I hope this clears things up for someone.

Credit where credit is due

I could not have written this without following the right censored example provided by John Kruschke here, and from reading Martyn Plummer's presentation here.

Share Button

A blog about statistics, computing, and other things that interest me