Category Archives: R

How do I match that?

This is not a new post, but a repost after a failed WordPress upgrade

One of the projects I am working on is the 3rd edition of Introduction to Bayesian Statistics, a text book predominantly written by my former colleague Bill Bolstad. It will be out later this year if you are interested.

One of the things which will make no difference to the reader but will make a lot of difference to me is the replacement of all the manually numbered references in the book for things like chapters, sections, tables and figures. The problem I am writing about today arose from programmatically trying to label the latter — tables and figures — in LaTeX. LaTeX’s reference system, as best I understand it, requires that you place a \label command after the caption. For example

\begin{figure}
\centering
\includegraphics{myfig}
\caption{Here is a plot of something}
\label{fig:myfig}
\end{figure}

This piece of code will create a label fig:myfig which I can later use in a \ref{fig:myfig} command. This will in turn be replaced at compile time with a number dynamically formatted according to the chapter and the number of figures which precede this plot.
The challenge

The problem I was faced with is easy enough to describe. I needed to open each .tex file, find all of the figure and table environments, check to see if they contained a caption and add a label if they did.
Regular expressions to the rescue

Fortunately I know a bit about regular expressions, or at least enough to know when to ask for help. To make things more complicated for myself, I did this in R. Why? Well basically because a) I did not feel like dusting off my grossly unused Perl — I’ve been clean of Perl for 4 years now and I intend to stay that way — and b) I could not be bothered with Java’s file handling routines – I want to to be able to open files for reading with one command, not 4 or 8 or whatever the hell it is. Looking back I could have used C++, because the new C+11 standard finally has a regex library and the ability to have strings where everything does not have to be double escaped, i.e. I can write R”\label” to look for a line that has a \label on it rather than “\\label” where I have to escape the backslash.

And before anyone feels the urge to suggest a certain language I remind you to “Just say no to Python!”

Finding the figure and table environments is easy enough. I simply look for the \begin{figure} and \begin{table} tags, as well as the environment closing tags \end{figure} and \end{table}. It is possible to do this all in one regular expression, but I wanted to capture the \begin and \end pairs. I also wanted to deal with tables and figures separately. The reason for this is that it was possible to infer the figure labels from Bill’s file naming convention for his figures. The tables on the other hand could just be labelled sequentially, i.e. starting at 1 and counting upwards with a prefix reflecting the chapter number.

Lines = readLines("Chapter.tex")

begin = grep("\\begin\\{figure\\}", Lines)
end = grep("\\end\\{figure\\}", Lines)

n = length(begin)
if(n != length(end))
  stop("The number of begin and end pairs don't match")

## Now we can work on each figure environment in turn
for(k in 1:n){
  b = begin[k]
  e = end[k]
  
  block = paste0(Lines[b:e], collapse = "\n")

  if(!grepl("\\label", block){ ## only add a label if there isn't one already
    ## everything I'm really talking about is going to happen here.
  }
}

So what I needed to be able to do was find the caption inside my block and then insert a label. Easy enough right? I should be able to write a regular expression to do that. How about something like this:

pattern = "\\caption\\{([^}]+)\\}

That will work most of the time, except as you will find out when the caption contains braces itself, and we have some examples that do have just that

\caption{``If \emph{A} is true then \emph{B} is true.''  Deduction is possible.}

My first regular expression would only match up to the end of \emph{A}, which does not help me. I need something that could, in theory match an unlimited number of inner sets of braces.
Matching nested parentheses

Fortunately matching nested parentheses is a well-known problem and Hadley Wickham tweeted me a Stack Overflow link that got me started. There is also a similar solution on page 330 of Jeffrey Friedl’s very useful Mastering Regular Expressions book. The solution relies on a regular expression which employs recursion.
Set perl = TRUE to use PCRE (and recursion) in R

To make this work in R we have to make sure the PCRE library is used, and this is done by setting perl = TRUE in the call to gregexpr

This is my solution:

## insert the label after the caption
pat = “caption(\\{([^{}]+|(?1))*\\})”
m = gregexpr(pat, block, perl = T)
capStart = attr(m[[1]], “capture.start”, TRUE)[1]
capLength = attr(m[[1]], “capture.length”, TRUE)[1]

strLabel = paste0(“\\”,”label{fig:”, figNumber, “}\n”)
newBlock = paste0(substr(block, 1, capStart + capLength),
strLabel,
substr(block, capStart + capLength + 1, nchar(block)))

The regular expression assigned to pat is where the work gets done. Reading the expression from left to right it says:

match caption literally
open the first capture group
match { literally
open the second capture group
match one or more instances of anything that is not an open brace { or a end brace }
or open the third capture group and recursively the first sub-pattern. I will elaborate on this more in a bit
close the second and third capture groups and ask R to match this pattern zero or more times
literally match the end brace }
close the first capture group

I would be the first to admit that I do not quite understand what ?1 does in this regexp. The initial solution used ?R. The effect of this was that I could match all sets of paired braces within block, but I could not specifically match the caption. As much as I understand this, it seems to limit the recursion to the outer (first) capture group. I would be interested to know.

The rest of the code breaks the string apart, inserts the correct label, and creates a new block with the label inserted. I replace the first line of the figure environment block with this new string, and keep a list of the remaining line numbers so that they can be omitted when the file is written back to disk.

Share Button

An introduction to using Rcpp modules in an R package

 

 

Introduction

The aim of this post is to provide readers with a minimal example demonstrating the use of Rcpp modules within an R package. The code and all files for this example can be found on https://github.com/jmcurran/minModuleEx.

What are Rcpp Modules?

Rcpp modules allow R programmers to expose their C++ class to R. By “expose” I mean the ability to instantiate a C++ object from within R, and to call methods which have been defined in the C++ class definition. I am sure there are many reasons why this is useful, but the main reason for me is that it provides a simple mechanism to create multiple instances of the same class. An example of where I have used this is my multicool package which is used to generate permutations of multisets. One can certainly imagine a situation where you might need to generate the permutations of more than two multisets at the same time. multicool allows you to do this by instantiating multiple multicool objects.

The Files

I will make the assumption that you, the reader, know how to create a package which uses Rcpp. If you do not know how to do this, then I suggest you look at the section entitled “Creating a New Package” here on the Rstudio support site. Important: Although it is mentioned in the text, the image displayed on this page does not show that you should change the Type: drop down box to Package w/ Rcpp.

Creating a package with Rcpp

This makes sure that a bunch of fields are set for you in the DESCRIPTION file that ensure Rcpp is linked to and imported.

There are five files in this minimal example. They are

  • DESCRIPTION
  • NAMESPACE
  • R/minModuleEx-package.R
  • src/MyClass.cpp
  • R/zzz.R

I will discuss each of these in turn.

DESCRIPTION

This is the standard DESCRIPTION file that all R packages have. The lines that are important are:

Depends: Rcpp (>= 0.12.8)
Imports: Rcpp (>= 0.12.8)
LinkingTo: Rcpp
RcppModules: MyModule

The imports and LinkingTo lines should be generated by Rstudio. The RcppModules: line should contain the names(s) of the module(s) that you want to use in this package. I have only one module in this package which is unimaginatively named MyModule. The module exposes two classes, MyClass and AnotherClass.

NAMESPACE and R/minModule-Ex.R

The first of these is the standard NAMESPACE file and it is automatically generated using roxygen2. To make sure this happens you need select Project Options… from the Tools menu. It will bring up the following dialogue box:

Project Options

Select the Built Tools tab, and make sure that the Generate documentation with Roxygen checkbox is ticked, then click on the Configure… button and make sure that that all the checkboxes that are checked below are checked:

Configuring Roxygen

Note: If you don’t want to use Roxygen, then you do not need the R/minModuleEx-package.R file, and you simply need to put the following three lines in the NAMESPACE file:

export(AnotherClass)
export(MyClass)
useDynLib(minModuleEx)

You need to notice two things. Firstly this NAMESPACE explicitly exports the two classes MyClass and AnotherClass. This means these classes are available to the user from the command prompt. If you only want access to the classes to be available to R functions in the package, then you do not need to export them. Secondly, as previously noted, if you are using Roxygen, then these export statements are generated dynamically from the comments just before each class declaration in the C++ code which is discussed in the next section. The useDynLib(minModuleEx) is generated from the line

#' @useDynLib minModuleEx

in the R/minModuleEx-package.R file.

src/MyClass.cpp

This file contains the C++ class definition of each class (MyClass and AnotherClass). There is nothing particularly special about these class declarations, although the comment lines before the class declarations,

//' @export MyClass
class MyClass{

and

//' @export AnotherClass
class AnotherClass{

, generate the export statements in the NAMESPACE file.

This file also contains the Rcpp Module definition:

RCPP_MODULE(MyModule) {
  using namespace Rcpp;

  class_<MyClass>( "MyClass")
    .default_constructor("Default constructor") // This exposes the default constructor
    .constructor<NumericVector>("Constructor with an argument") // This exposes the other constructor
    .method("print", &MyClass::print) // This exposes the print method
    .property("Bender", &MyClass::getBender, &MyClass::setBender) // and this shows how we set up a property
  ;

  class_<AnotherClass>("AnotherClass")
    .default_constructor("Default constructor")
    .constructor<int>("Constructor with an argument")
    .method("print", &AnotherClass::print)
  ;
}

In this module I have:

  1. Two classes MyClass and AnotherClass.
  2. Each class class has:
    • A default constructor
    • A constructor which takes arguments from R
    • A print method
  3. In addition, MyClass demonstrates the use of a property field which (simplistically) provides the user with simple retrieval from and assignment to a scalar class member variable. It is unclear to me whether it works for more data types, but anecdotally, I had no luck with matrices.

R/zzz.R

As you might guess from the nonsensical name, it is not essential to call this file zzz.R. The name comes from a suggestion from Dirk Eddelbuettel. It contains a single, but absolutely essential line of code

loadModule("MyModule", TRUE)

This code can actually be in any of the R files in your package. However, if you explicitly put it in R/zzz.R then it is easy to remember where it is.

Using the Module from R

Once the package is built and loaded, using the classes from the module is very straightforward. To instantiate a class you use the new function. E.g.

m = new(MyClass)
a = new(AnotherClass)

This code will call the default constructor for each class. If you want to call a constructor which has arguments, then they can be added to the call to new. E.g.

set.seed(123)
m1 = new(MyClass, rnorm(10))

Each of these objects has a print method which can be called using the $ operator. E.g.

m$print()
a$print()
m1$print()

The output is

> m$print()
1.000000 2.000000 3.000000
> a$print()
0
> m1$print()
1.224082 0.359814 0.400771 0.110683 -0.555841 1.786913 0.497850 -1.966617 0.701356 -0.472791

The MyClass class has a module property – a concept also used in C#. A property is a scalar class member variable that can either be set or retrieved. For example, m1 has been constructed with the default value of bBender = FALSE, however we can change it to TRUE easily

m1$Bender = TRUE
m1$print()

Now our object m1 behaves more like Bender when asked to do something 🙂

> m1$print()
Bite my shiny metal ass!

Hopefully this will help you to use Rcpp modules in your project. This is a great feature of Rcpp and really makes it even more powerful.

Share Button

An R/Rcpp mystery

This morning and today I spent almost four hours trying to deal with the fact that our R package DNAtools would not compile under Windows. The issue originated with a win-builder build which was giving errors like this:


I"D:/RCompile/recent/R/include" -DNDEBUG -I"d:/RCompile/CRANpkg/lib/3.4/Rcpp/include" -I"d:/Compiler/gcc-4.9.3/local330/include" -c DNTRare.cpp -o DNTRare.o
ID:/RCompile/recent/R/include: not found

and I replicated this (far too many times) on my Windows VM on my Mac.

In the end this boiled down to the presence of our Makevars file which contained only one line:


CXX_STD = CXX14

Deleting fixed the problem and it now compiles just fine. It compiles fine locally, and I am waiting for the response from the win-builder site. I do not anticipate an issue, but it would be useful to understand what was going wrong. I must admit that I have forgotten what aspects of the C++14 standard we are using, but I do know that changing line to


PKG_CXXFLAGS= -std=c++14

which I use in my multicool package gives me a different pain, with the compiler being unable to locate Rccp.h after seeing a #include directive.

Share Button

Extracting elements from lists in Rcpp

If you are an R programmer, especially one with a solid background in data structures or with experience in a more traditional object oriented environment, then you probably use lists to mimic the features you might expect from a C-style struct or a class in Java or C++. Retrieving information from a list of lists, or a list of matrices, or a list of lists of vectors is fairly straightforward in R, but you may encounter some compiler error messages in Rcpp if you do not take the right steps.

Stupid as bro

This will not be a very long article, but I think it is useful to have this information somewhere other than Stack Overflow. Two posts, one from Dirk and one from Romain contain the requisite information.

The List class does not know what type of elements it contains. You have to tell it. That means if you have something like

x = list(a = matrix(1:9, ncol = 3), b = 4)

in your R code and

void Test(List x){
  IntegerMatrix a = x["a"];
}

in your C++, then you might get a compiler error complaining about certain things not being overloaded. As Dirk points out in another post (which I cannot find right at this moment), the accessor operator for a List simply returns a SEXP. Rcpp has done a pretty good job of removing the need for us to get our hands dirty with SEXP‘s, but they are still there. If you know (and you should since you are the one writing the code and designing the data structures) that this SEXP actually is an IntegerMatrix then you should cast it as one using the as<T>() function. That is,

void Test(List x){
  IntegerMatrix a = as<IntegerMatrix>(x["a"]);
}

So why does this work?

If you look around the internet, you will see chunks of code like

int b = x["b"];
NumericVector y = x["y"];

which compile just fine. Why does this work? It works because the assignment operator has been overloaded for certain types in Rcpp, and so you will probably find you do not need explicit type coercion. However, it certainly will not hurt to explicitly do so for every assignment, and your code will benefit from doing so.

Share Button

Generating pseudo-random variates C++-side in Rcpp


It is well-known that if you are writing simulation code in R you can often gain a performance boost by rewriting parts of your simulation in C++. These days the easiest way to do that of course is to use Rcpp. Simulation usually depends on random variates, and usually great numbers of them. One of the issues that may arise is that your simulation needs to execute on the C++ side of things. For example, if you decide to programme your Metropolis-Hastings algorithm (not technically a simulation I know) in Rcpp, then you are going to need to be able to generate hundreds of thousands, if not millions, of random numbers. You can use Rcpp’s features to call R routines from within Rcpp to do this, e.g.

Function rnorm("rnorm");
rnorm(100, _["mean"] = 10.2, _["sd"] = 3.2 );

(Credit: Dirk Eddelbuettell)

but this has a certain overhead. C++ has had built-in in random number generation functionality since at least the C+11 standard (and probably since the C+0X standard). The random header file provides a Mersenne-Twister uniform random number generator (RNG), a Linear Congruential Generator (LCG), and a Subtract-with-Carry RNG. There is also a variety of standard distributions available, described here.

Uniform random variates

The ability to generate good quality uniform random variates is essential, and the mt19937 engine provides. The 19937 refers to the Mersenne Prime \((2^{19937}-1)\) that this algorithm is based on, and also to its period length. There are four steps required to generate uniform random variates. These are:

  1. Include the random header file
  2. Construct an mt19937 random number engine, and initialise it with a seed
  3. Construct a \(U(0,1)\) random number generator
  4. Use your engine and your uniform random number generator to draw variates

In code we would write

#include <random>
#include <Rcpp.h>

using namespace std;
using namespace Rcpp;

mt19937 mtEngine;
uniform_real_distribution<double> rngU;

//[[Rcpp::export]]
void setSeed(unsigned int seed){
  mtEngine = mt19937(seed);
  rngU = uniform_real_distribution<>(0.0, 1.0);
}

double runif(void){
  return rngU(mtEngine);
}

The function runif can now be called with runif(). Note that the setSet function has been exported so that you can initialize the RNG engine with a seed of your choice.

How about normal random variates?

It does not require very much more effort to add a normal RNG to your code. We simply add

normal_distribution<double> rngZ;

to our declared variables, and

//[[Rcpp::export]]
void setSeed(unsigned int seed){
  mtEngine = mt19937(seed);
  rngU = uniform_real_distribution<>(0.0, 1.0);
  rngZ = normal_distribution<double>(0.0, 1.0);
}

double rnorm(double mu = 0, double sigma = 1){
    return rngZ(mtEngine) * sigma + mu;
}

to our code base. Now rnorm can be called without arguments to get standard (\(N(0,1)\)) random variates, or with a mean, or a standard deviation, or both to get \(N(\mu,\sigma^2)\) random variates

Rcpp does it

No doubt someone is going to tell me that Romain and Dirk have thought of this already for you, and that my solution is unnecessary Morris Dancing. However, I think there is merit in knowing how to use the standard C++ libraries.

Please note that I do not usually advocate having global variables such as those in the code above. I would normally make mtEngine, rngU, and rngZ private member variables a class and then either instantiate it using an exported Rcpp function, or export the class and essential functions using an Rcpp module.

Working C++ code and an R test script can be found here in the RNG folder. Enjoy!

Share Button

Introduction to Using Regular Expressions in R

R and Regular Expressions


Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. – Jamie Zawinski, courtesy of Jeffrey Friedl’s blog

This is a blog post on using regular expressions in R. I am sure there are plenty of others out there with the same information. However, this is also an exercise for me to see how hard it is to change my knitr .Rnw files into markdown and then into HTML. It turns out that most of the work can be done by running pandoc over the LaTeX file I get from knitting my .Rnw file. The rest I did manually.

What are regular expressions?

Regular expressions provide a powerful and sophisticated way of matching patterns in text. For example with regular expressions I can:

  • Find a word or a series of letters

  • Do wild-card searches

  • Match patterns at the start or the end of a line

  • Make replacements in text based on the match

A simple example

Very often when people read about regular expressions they do not grasp the power, thinking “I could do that without using regular expressions at all.”

Here is something I did the other day. I had a file whose lines consisted of the names of files containing R code. That is, I had twenty lines of text with the end of each line ending in .R. I needed to

  1. insert export( at the start of each line, and

  2. replace the .R at end of each line with )

You can probably think of a way of doing this with a mixture of find and replace, and manual insertion. That is all well and good if you only have 20 lines. What if you had 20,000 lines?

I did this in Notepad++ by matching the regular expression ^(.*)\.[rR]$ and replacing it with export(\1)

Before After
#onewayPlot.R# \export(onewayPlot)
autocor.plot.R \export(autocor.plot)
boxqq.r \export(boxqq)
boxqq.r~ \export(boxqq)
ciReg.R \export(ciReg)
cooks20x.R \export(cooks20x)
crossFactors.R \export(crossFactors)
crossFactors.R~ \export(crossFactors)
crosstabs.R \export(crosstabs)
eovcheck.R \export(eovcheck)
estimateContrasts.R \export(estimateContrasts)
estimateContrasts1.R \export(estimateContrasts1)
estimateContrasts2.R \export(estimateContrasts2)
freq1way.r \export(freq1way)
freq1way.r~ \export(freq1way)
getVersion.R \export(getVersion)
interactionPlots.R \export(interactionPlots)
layout20x.R \export(layout20x)
levene.test.R \export(levene.test)
levene.test.R~ \export(levene.test)
multipleComp.R \export(multipleComp)
normcheck.R \export(normcheck)
onewayPlot.R \export(onewayPlot)
onewayPlot.R~ \export(onewayPlot)
pairs20x.R \export(pairs20x)
pairs20x.R~ \export(pairs20x)
predict20x.R \export(predict20x)
predict20x.R~ \export(predict20x)
propslsd.new.R \export(propslsd.new)
residPlot.R \export(residPlot)
rowdistr.r \export(rowdistr)
  • Regular expressions are powerful way of describing patterns that we might search for or replace in a text document

  • In some sense they are an extension of the wildcard search and replace operations you might carry out in Microsoft word or a text editor.

  • To the untrained eye they look like gobbledygook!

  • Most programming languages have some form of regular expression library

  • Some text editors, such as Emacs, Notepad++, RStudio, also have regular expressions

  • This is very useful when you don't need to write a programme

  • The file utility grep uses regular expressions to find occurrences of a pattern in files

  • Mastering regular expressions could take a lifetime, however you can achieve a lot with a good introduction

  • A couple of very good references are:

    • Jeffery Freidl's Mastering Regular Expressions, (2006), 3rd Edition, O'Reilly Media, Inc.
    • Paul Murrell's Introduction to Data Technologies, (2009), Chapman & Hall/CRC Computer Science & Data Analysis.
    • This entire book is on the web: http://www.stat.auckland.ac.nz/~paul/ItDT/, but if you use it a lot you really should support Paul and buy it
    • Chapter 11 is the most relevant for this post

Tools for regular expressions in R

We need a set of tools to use regular expressions in something we understand — i.e. R. The functions I will make the most use of are

  • grepl and grep

  • gsub

  • gregexpr

Functions for matching

  • grep and grepl are the two simplest functions for pattern matching in R

  • By pattern matching I mean being able to either

    (i) Return the elements of, or indices of, a vector that match a set of characters (or a pattern)

    (ii) Return TRUE or FALSE for each element of a vector on the basis of whether it matches a set of characters (or a pattern)

grep does (i) and grepl does (ii).

Very simple regular expressions

At their simplest, a regular expression can just be a string you want to find. For example the commands below look for the string James in the vector of names names. This may seem like a silly example, but it demonstrates a very simple regular expression called a string literal, informally meaning match this string — literally!

names = c('James Curran', 'Robert Smith', 
          'James Last')
grep('James', names)
## [1] 1 3

Wild cards and other metacharacters

  • At the next level up, we might like to search for a string with a single wild card character

  • For example, my surname is sometimes (mis)spelled with an e or an i

  • The regular expression symbol/character for any character is the the full stop or period

  • So my regular expression would be Curr.n, e.g.

surnames = c('Curran', 'Curren', 'Currin',
             'Curin')
grepl('Curr.n', surnames)
## [1]  TRUE  TRUE  TRUE FALSE
  • The character . is the simplest example of a regular expression metacharacter

  • The other metacharacters are [ ], [^ ], \, ?, *. +,{,}, ^, $, \<, \>, | and ()

  • If a character is a regular expression metacharacter then it has a special meaning to the regular expression interpreter

  • There will be times, however, when you want to search for a full stop (or any of the other metacharacters). To do this you can escape the metacharacter by preceding it with a double backslash \\.

  • Note that we only use two backslashes in R – nearly every other language uses a single backslash

  • Note that \\ followed by a digit from 0 to 9 has special meaning too, e.g. \\1

Alternation

  • Whilst this example obviously works, there is a more sensible way to do this and that is to use the alternation or or operator |.

  • E.g.

grepl('Curr(a|e|i)n', c('Curran', 'Curren', 
                        'Currin', 'Curin'))
## [1]  TRUE  TRUE  TRUE FALSE
  • This regular expression contains two metacharacters ( and |

  • The round bracket ( has another meaning later on, but here it delimits the alternation.

  • We read (a|e|i) as a or e or i.

A bigger example – counting words

In this example we will use the text of Moby Dick. The R script below does the following things

  1. Opens a file connection to the text

  2. Reads all the lines into a vector of lines

  3. Counts the number of lines where the word whale is used

  4. Counts the number of lines where the word Ahab is used

## open a read connection to the Moby Dick 
## text from Project Gutenberg
mobyURL = 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt'
f1 = url(mobyURL, 'r')

## read the text into memory and close
## the connection
Lines = readLines(f1)
close(f1)

Note: The code above is what I would normally do, but if you do it too often you get this nice message from Project Gutenberg

Don't use automated software to download lots of books. We have a limit on how fast you can go while using this site. If you surpass this limit you get blocked for 24h.

which is fair enough, so I am actually using a version I stored locally.

## Throw out all the lines before 
## 'Call me Ishmael'
i = grep('Call me Ishmael', Lines)
Lines = Lines[-(1:(i - 1))]

numWhale = sum(grepl('(W|w)hale', Lines))
numAhab = sum(grepl('(A|a)hab', Lines))

cat(paste('Whale:', numWhale, ' Ahab:', 
           numAhab, '\n'))
## Whale: 1487  Ahab: 491

Note:I am being explicit here about the capitals. In fact, as my friend Mikkel Meyer Andersen points out, I do not have to be. grep, grepl, regexpr and gregexpr all have an argument ignore.case which can be set to TRUE. However I did want to highlight that generally regular expressions are case sensitive.

This programme will count the number of lines containing the words whale or Ahab but not the number of occurrences. To count the number of occurrences, we need to work slightly harder. The gregexpr function globally matches a regular expression. If there is no match, then gregpexr returns -1. However, if there is one or match, then gregexpr returns the position of the match, and the length of the match for each matching instance. For example, we will look for the word the in the following three sentences:

s1 = 'James is a silly boy'
s2 = 'The cat is hungry'
s3 = 'I do not know about the cat but the dog is stupid'

## Set the regular expression
## Note: This would match 'There' and 'They' as
## well as others for example 
pattern = '[Tt]he'
  
## We will store the matches so that we 
## can examine them in turn
m1 = gregexpr(pattern, s1)
m2 = gregexpr(pattern, s2)
m3 = gregexpr(pattern, s3)

There are no matches in the first sentence so we expect gregexpr to return -1

print(m1)
## [[1]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE

which it does.

In the second sentence, there is a single match at the start of the sentence

print(m2)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 3
## attr(,"useBytes")
## [1] TRUE

This result tells us that there is a single match at character position 1, and that the match is 3 characters long.

In the third example there are two matches, at positions 21 and 33, and they are both 3 characters long

print(m3)
## [[1]]
## [1] 21 33
## attr(,"match.length")
## [1] 3 3
## attr(,"useBytes")
## [1] TRUE

So in order to count the number of occurences of a word we need to use gregexpr and keep all of those instances where the result is not -1 and then count, using the length function the number of matches, e.g.

## count the number of occurences of whale
pattern = '(W|w)hale'
matches = gregexpr(pattern, Lines)

## if gregexpr returns -1, then the number of 
## matches is 0 if not, then the number of 
## matches is given by length
counts = sapply(matches, 
                function(x){
                  if(x[1] == -1) 
                    return(0)
                  else
                    return(length(x))})

cat(paste('Whale:', sum(counts), '\n'))
## Whale: 1564

Character classes/sets or the [ ] operator

  • Regular expression character sets provide a simple mechanism for matching any one of a set of characters

  • For example [Tt]he will match The and the

  • The real strength of character sets is in its special range and set operators

  • For example the regular expressions:

    • [0-9] will match any digit from 0 to 9
    • [a-z] will match any lower case letter from a to z
    • [A-Z0-9] will match any upper case letter from A to Z or any digit from 0 to 9 and so on
  • You may initially think that character sets are like alternation, but they are not. Character sets treat their sets as an unordered list of characters

    • So (se|ma)t will (fully) match set
    • But mat, [sema]t will not
    • Alternatively [sema]t will match st,at, mt and at
  • The POSIX system defines a set of special character classes which are supported in R and can be very useful. These are

[:alpha:] Alphabetic (only letters)
[:lower:] Lowercase letters
[:upper:] Uppercase letters
[:digit:] Digits
[:alnum:] Alphanumeric (letters and digits)
[:space:] White space
[:punct:] Punctuation
  • The regular expression [[:lower:]] will help you capture accented lower case letters like &agrave, &eacute, or &ntilde whereas [a-z] would miss all of them

  • You may think this is uncommon, but optical character recognition (OCR) text often has characters like this present

Negated character sets – [^...]

  • Negated character sets provide a way for you to match anything but the characters in this set

  • A very common example of this is when you want to match everything between a set of (round) brackets

  • E.g. The regular expression ([^)]) would match any single character between a pair of round brackets

Matching more than one character — quantifiers

Another common thing we might do is match zero, one, or more occurrences of a pattern. We have four ways to do this

  1. ? means match zero or one occurrences of the previous pattern
  2. * means match zero or more occurrences of the previous pattern

  3. + means match one or more occurrences of the previous pattern

  4. {a,b} means match from a to b occurrences of the previous pattern*

  5. b may be omitted so that {a,} means match a or more occurrences of the previous pattern

  6. b and the comma may be omitted so that {a} means match exactly a occurences of the previous pattern

Continuing with our misspelling example, I would like a way of picking up all of the possibilities of misspelling my surname. Variants I've seen are Curren, Currin, Curin, Curan, Curen, Curn and even Karen!

If I wanted to construct a regular expression to match all of these possibilities I need to match (in this order):

  1. a C or a K

  2. a u or an a

  3. one or two occurence of r

  4. zero or more occurrences of e or i

  5. and finally an n

This is how I do this with regular expressions

pattern = '[CK](u|a)r{1,2}(e|i)*n';
badNames = c('Curren', 'Currin', 'Curin', 
             'Curan', 'Curen', 'Curn', 'Karen')
grepl(pattern, badNames)
## [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

Notice how the regular expression didn't match Curan. To fix the code so that it does match we need to change the set of possible letters before the n from (e|i) to allow a as a possibility, i.e. (a|e|i) or alternatively [aei]

pattern1 = '[CK](u|a)r{1,2}(a|e|i)*n'
pattern2 = '[CK](u|a)r{1,2}[aei]*n'
badNames = c('Curren', 'Currin', 'Curin', 'Curan', 
             'Curen', 'Curn', 'Karen')
grepl(pattern1, badNames)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
grepl(pattern2, badNames)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Anchors — matching a position

  • The metacharacters ^, $, \< and \> match positions

  • ^ and $ match the start and the end of a line respectively

  • \< and \> match the start and end of a word respectively

  • I find I use ^ and $ more often

  • For example

    • ^James will match "James hates the cat" but not "The cat does not like James"
    • cat$ will match "James hates the cat" but not "The cat does not like James"

Summary — R functions for matching regular expressions

The functions grep and regexpr are the most useful. Loosely,

  • grep tells you which elements of your vector match the regular expression

  • whereas regexpr tells you which elements match, where they match, and how long each match is

  • gregexpr matches every occurrence of the pattern, whereas regexpr only matches the first occurrence

The example below shows the difference

poss = c('Curren', 'Currin', 'Curin', 'Curan', 
         'Curen', 'Curn', 'Karen')
pattern = '[CK](u|a)r{1,2}(i|e)*n'
grep(pattern, poss)
## [1] 1 2 3 5 6 7
gregexpr(pattern, poss)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 6
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] 1
## attr(,"match.length")
## [1] 6
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"useBytes")
## [1] TRUE
## 
## [[4]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[5]]
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"useBytes")
## [1] TRUE
## 
## [[6]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[7]]
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"useBytes")
## [1] TRUE

String substitution

  • Finding or matching is often only one half of the equation

  • Quite often we want to find and replace

  • This process is called string substitution

  • R has two functions sub and gsub

  • The difference between them is that sub only replaces the first occurrence of the pattern whereas gsub replaces every occurrence

  • Normal usage is quite straight forward. E.g.

poss = 'Curren, Currin, Curin, Curan, Curen, Curn and Karen'
pattern = '[CK](u|a)r{1,2}(i|e)*n'

sub(pattern, 'Curran', poss)
## [1] "Curran, Currin, Curin, Curan, Curen, Curn and Karen"
gsub(pattern, 'Curran', poss)
## [1] "Curran, Curran, Curran, Curan, Curran, Curran and Curran"

Back substitution

  • Abnormal usage is probably quite unlike anything you have seen before

  • One of the most powerful features of regular expressions is the ability to re-use something that you matched in a regular expression

  • This idea is called back substitution

  • Imagine that I have a text document with numbered items in it. E.g.

    1. James
    2. David
    3. Kai
    4. Corinne
    5. Vinny
    6. Sonika
    
  • How would I go about constructing a regular expression that would take each of the lines in my document and turn them into a nice LaTeX itemized list where the numbers are the list item markers?

  • The trick is to capture the numbers at the start of each line and use them in the substitution

  • To do this we use the round brackets to capture the match of interest

  • And we use the \\1 and \\2 backreference operators to retrieve the information we matched. E.g.

Lines = c('1. James', '2. David', '3. Kai', 
          '4. Corinne', '5. Vinny', 
          '6. Sonika')
pattern = '(^[0-9]\\.)[[:space:]]+([[:upper:]][[:lower:]]+$)'
gsub(pattern, '\\\\item[\\1]{\\2}', Lines)
## [1] "\\item[1.]{James}"   "\\item[2.]{David}"  "\\item[3.]{Kai}"    
## [4] "\\item[4.]{Corinne}" "\\item[5.]{Vinny}"   "\\item[6.]{Sonika}"

Note the double backslash will become a single backslash when written to file.

I actually used a regular expression with back substitution to format output for LaTeX in the file name example at the start of this post. My regular expression was the following:

(^[^A-Za-z]*([A-Za-z0-9.]+)[.][rR].*)

and this was my back substitution expression

 \\verb!\1!  &  \\verb!\\export(\2)! \\\\

There is only a single \ in the back references because I just did this in the RStudio editor, not in R. Note how there are two back references, corresponding to two capture groups, one of which is nested inside the other. In nesting situtations like this, the capture groups are labelled in order from the outermost inwards.

String manipulation

We need two more things to finish this section

  • The ability to extract smaller strings from larger strings

  • The ability to construct strings from smaller strings

  • The first is called extracting substrings

  • The second is called string concatenation

  • We use the functions substr and paste for these tasks respectively

substr

  • substr is very simple

  • Its arguments are

    • the string, x,
    • a starting position, start,
    • and an stopping position, stop.
  • It extracts all the characters in x from start to stop

  • If the alias substring is used then stop is optional

  • If stop is not provided then substring extracts all the characters from start to the end of x

E.g.

substr('abcdef', 2, 4)
## [1] "bcd"
substring('abcdef', 2)
## [1] "bcdef"

paste

paste is a little more complicated:

  • paste takes 1 or more arguments, and two optional arguments sep and collapse

  • sep is short for separator

  • For the following discussion I am going to assume that I call paste with two arguments x and y

  • If x and y are both scalars thenpaste(x,y) will join them together in a single string separated bya space, e.g.

paste(1, 2)
## [1] "1 2"
paste('a', 'b')
## [1] "a b"
paste('a =', 1)
## [1] "a = 1"
  • If x and y are both scalars and you define sep to be "," say then paste(x, y, sep = ",") will join them together in a single string separated by a comma,
    e.g.
paste(1, 3, sep = ',')
## [1] "1,3"
paste(TRUE, 3, sep = ',')
## [1] "TRUE,3"
paste('a', 3, sep = ',')
## [1] "a,3"
  • If If x and y are both vectors and you define sep to be "," say then paste(x, y , sep = ",") will join each element of x with each element of y into a set of strings where the elements are separated by a comma, e.g.
x = 1:3
y = LETTERS[1:3]
paste(x, y, sep = ',')
## [1] "1,A" "2,B" "3,C"
  • If collapse is not NULL, e.g “-” then each of the strings will be pasted together into a single string.
x = 1:3
y = LETTERS[1:3]
paste(x, y, sep = ',', collapse = '-')
## [1] "1,A-2,B-3,C"
  • And of course you can take it as far as you like 🙂
x = 1:3
y = LETTERS[1:3]
paste('(', paste(x, y, sep = ','), ')', 
       sep = '', collapse = '')
## [1] "(1,A)-(2,B)-(3,C)"

paste0

Mikkel also points out that paste0 is a shortcut to avoid specifying sep = "" everytime

Share Button

Python and statistics – is there any point?

This semester I gave my graduate student class a project. The brief was relatively simple: implement the iteratively reweighted least squares (IRLS) algorithm to perform a simple (single covariate) logistic regression in Python. Their programmes were supposed to be able to read data in from a text file, perform the simple matrix algebra and math needed to carry out the IRLS computation and return some formatted output – similar to that you would get from R’s summary.glm function. Of course, you do not need matrix algebra to do this, but the idea was for the students to learn a bit of mathematical statistics that they had not seen before.  On the IRLS front, they were allowed to use a simple least squares routine like numpy’s linalg.lstsq and some of numpy’s simple matrix operators, but expressly forbidden from simply loading pandas or statsmodels and using the generalized linear models functions contained therein.

I thought this sounded like a straightforward enough task. The students divided themselves into pairs to work on it, and they had 13 weeks to complete the task.

The kicker was that I did not provide any instruction, either in Python or in the IRLS algorithm. An aim of the project was to simulate the situation where someone asks you to solve a problem, and you have to go and do some research to do it. Their first task was to complete 100 exercises on codeacademy.com as a reasonable introduction to a language none of them had seen before.

Problems – versions

There are two major versions of Python in the wild, 2.7 and 3.4. Codeacademy teaches using version 2.7. One fundamental difference between 2.7 and 3.4 is the syntax of the print function. All of my students are users of R, to varying levels of skill. When they go to install R at home, they know to go to the CRAN website, or a mirror, and download the current, stable release of R. If they followed this policy, as I did myself, then they would have installed Python 3.4 and found that the way they were taught to use print by Codeacademy does not work, without any sort of helpful “That syntax has been depricated. Python 3 onwards uses the syntax…” This is not the only issue, with the way Python 3.4 handles execution of loops over numbered ranges being another example of a fundamental difference.

Problems – platform issues

Most students at my institution use Windows, especially at home. There is some Mac penetration, and Linux is virtually non-existent (these are statistics students, not computer science remember). The official Python installers work perfectly well under Windows in my experience. However, then we come to the issue of installing numpy. The official advice from the numpy website seems to be “download a third party version of Python which already has it.” For students who come from a world where a package can be installed by going to a menu, this is less than useful. The common advice from the web is that “there is no official release of numpy 1.8.1 for Python 2.7 or higher for Windows” but that you can download it and install it from a the builds very thoughtfully provided by Christoph Gohlke at UC Irvine here. Christoph’s builds work fine, but again, for something that seems, at least from the outside, very mainstream in the Python community should the user have to go to this level of effort?

Problems – local installations

Like any instructor, I face the issue that a number of my students have no option but to use the computer laboratories provided for them by the university. This means that we encounter the issue of local installation of libraries for users. Most, if not all, R packages from CRAN can be installed in a local library. As far as I can tell, this is not true for a Windows installation of Python. I am happy to be corrected on this point. The aforementioned Python binaries come with proper Windows installers, which want to install into the Python root directory, something students do not have permission to do. If I had realized this problem in December of last year, I could have asked the admins to pre-install it for all users, however, given I only formulated the problem in February, it was just a tad too late.

Would I do it again?

I might, but there would have to be serious efforts to resolve the problems listed above on my part. It also would not solve problems of students trying to set up Python at home, and I do not feel like hand-holding people through an installation process. My initial plan had been to try Javascript. I may return to this idea.

I would be the first to admit that I am not a Python user, but I am an experienced programmer with over thirty years of experience in at least a dozen different languages, and on multiple platforms. I know many people find Python a very useful language for their scientific computing, and I am not attempting to bad mouth the language – it seems a decent enough language with the constructs and functionality that I would expect to find in any modern language – but I do not think there is much incentive for a statistician to move away from R, or an R/C++ combination when raw compute power is required.

I am glad that my students experienced programming in a non-vectorized language. R does give a distorted perspective on programming with regards to its handling of vectors, and I think it is beneficial for students to learn about flow structures for element-wise computation.

Update

Nat Dudley has made the suggestion I used on online IDE like nitrous.io.

Second updates

Despite the difficulties, nearly all of my students have managed to complete the task, and some have done an exceptionally good job, even adding in the ability to parse R-like formulae.



Share Button