Introduction to Using Regular Expressions in R

R and Regular Expressions

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. – Jamie Zawinski, courtesy of Jeffrey Friedl’s blog

This is a blog post on using regular expressions in R. I am sure there are plenty of others out there with the same information. However, this is also an exercise for me to see how hard it is to change my knitr .Rnw files into markdown and then into HTML. It turns out that most of the work can be done by running pandoc over the LaTeX file I get from knitting my .Rnw file. The rest I did manually.

What are regular expressions?

Regular expressions provide a powerful and sophisticated way of matching patterns in text. For example with regular expressions I can:

  • Find a word or a series of letters

  • Do wild-card searches

  • Match patterns at the start or the end of a line

  • Make replacements in text based on the match

A simple example

Very often when people read about regular expressions they do not grasp the power, thinking “I could do that without using regular expressions at all.”

Here is something I did the other day. I had a file whose lines consisted of the names of files containing R code. That is, I had twenty lines of text with the end of each line ending in .R. I needed to

  1. insert export( at the start of each line, and

  2. replace the .R at end of each line with )

You can probably think of a way of doing this with a mixture of find and replace, and manual insertion. That is all well and good if you only have 20 lines. What if you had 20,000 lines?

I did this in Notepad++ by matching the regular expression ^(.*)\.[rR]$ and replacing it with export(\1)

Before After
#onewayPlot.R# \export(onewayPlot)
autocor.plot.R \export(autocor.plot)
boxqq.r \export(boxqq)
boxqq.r~ \export(boxqq)
ciReg.R \export(ciReg)
cooks20x.R \export(cooks20x)
crossFactors.R \export(crossFactors)
crossFactors.R~ \export(crossFactors)
crosstabs.R \export(crosstabs)
eovcheck.R \export(eovcheck)
estimateContrasts.R \export(estimateContrasts)
estimateContrasts1.R \export(estimateContrasts1)
estimateContrasts2.R \export(estimateContrasts2)
freq1way.r \export(freq1way)
freq1way.r~ \export(freq1way)
getVersion.R \export(getVersion)
interactionPlots.R \export(interactionPlots)
layout20x.R \export(layout20x)
levene.test.R \export(levene.test)
levene.test.R~ \export(levene.test)
multipleComp.R \export(multipleComp)
normcheck.R \export(normcheck)
onewayPlot.R \export(onewayPlot)
onewayPlot.R~ \export(onewayPlot)
pairs20x.R \export(pairs20x)
pairs20x.R~ \export(pairs20x)
predict20x.R \export(predict20x)
predict20x.R~ \export(predict20x) \export(
residPlot.R \export(residPlot)
rowdistr.r \export(rowdistr)
  • Regular expressions are powerful way of describing patterns that we might search for or replace in a text document

  • In some sense they are an extension of the wildcard search and replace operations you might carry out in Microsoft word or a text editor.

  • To the untrained eye they look like gobbledygook!

  • Most programming languages have some form of regular expression library

  • Some text editors, such as Emacs, Notepad++, RStudio, also have regular expressions

  • This is very useful when you don't need to write a programme

  • The file utility grep uses regular expressions to find occurrences of a pattern in files

  • Mastering regular expressions could take a lifetime, however you can achieve a lot with a good introduction

  • A couple of very good references are:

    • Jeffery Freidl's Mastering Regular Expressions, (2006), 3rd Edition, O'Reilly Media, Inc.
    • Paul Murrell's Introduction to Data Technologies, (2009), Chapman & Hall/CRC Computer Science & Data Analysis.
    • This entire book is on the web:, but if you use it a lot you really should support Paul and buy it
    • Chapter 11 is the most relevant for this post

Tools for regular expressions in R

We need a set of tools to use regular expressions in something we understand — i.e. R. The functions I will make the most use of are

  • grepl and grep

  • gsub

  • gregexpr

Functions for matching

  • grep and grepl are the two simplest functions for pattern matching in R

  • By pattern matching I mean being able to either

    (i) Return the elements of, or indices of, a vector that match a set of characters (or a pattern)

    (ii) Return TRUE or FALSE for each element of a vector on the basis of whether it matches a set of characters (or a pattern)

grep does (i) and grepl does (ii).

Very simple regular expressions

At their simplest, a regular expression can just be a string you want to find. For example the commands below look for the string James in the vector of names names. This may seem like a silly example, but it demonstrates a very simple regular expression called a string literal, informally meaning match this string — literally!

names = c('James Curran', 'Robert Smith', 
          'James Last')
grep('James', names)
## [1] 1 3

Wild cards and other metacharacters

  • At the next level up, we might like to search for a string with a single wild card character

  • For example, my surname is sometimes (mis)spelled with an e or an i

  • The regular expression symbol/character for any character is the the full stop or period

  • So my regular expression would be Curr.n, e.g.

surnames = c('Curran', 'Curren', 'Currin',
grepl('Curr.n', surnames)
  • The character . is the simplest example of a regular expression metacharacter

  • The other metacharacters are [ ], [^ ], \, ?, *. +,{,}, ^, $, \<, \>, | and ()

  • If a character is a regular expression metacharacter then it has a special meaning to the regular expression interpreter

  • There will be times, however, when you want to search for a full stop (or any of the other metacharacters). To do this you can escape the metacharacter by preceding it with a double backslash \\.

  • Note that we only use two backslashes in R – nearly every other language uses a single backslash

  • Note that \\ followed by a digit from 0 to 9 has special meaning too, e.g. \\1


  • Whilst this example obviously works, there is a more sensible way to do this and that is to use the alternation or or operator |.

  • E.g.

grepl('Curr(a|e|i)n', c('Curran', 'Curren', 
                        'Currin', 'Curin'))
  • This regular expression contains two metacharacters ( and |

  • The round bracket ( has another meaning later on, but here it delimits the alternation.

  • We read (a|e|i) as a or e or i.

A bigger example – counting words

In this example we will use the text of Moby Dick. The R script below does the following things

  1. Opens a file connection to the text

  2. Reads all the lines into a vector of lines

  3. Counts the number of lines where the word whale is used

  4. Counts the number of lines where the word Ahab is used

## open a read connection to the Moby Dick 
## text from Project Gutenberg
mobyURL = ''
f1 = url(mobyURL, 'r')

## read the text into memory and close
## the connection
Lines = readLines(f1)

Note: The code above is what I would normally do, but if you do it too often you get this nice message from Project Gutenberg

Don't use automated software to download lots of books. We have a limit on how fast you can go while using this site. If you surpass this limit you get blocked for 24h.

which is fair enough, so I am actually using a version I stored locally.

## Throw out all the lines before 
## 'Call me Ishmael'
i = grep('Call me Ishmael', Lines)
Lines = Lines[-(1:(i - 1))]

numWhale = sum(grepl('(W|w)hale', Lines))
numAhab = sum(grepl('(A|a)hab', Lines))

cat(paste('Whale:', numWhale, ' Ahab:', 
           numAhab, '\n'))
## Whale: 1487  Ahab: 491

Note:I am being explicit here about the capitals. In fact, as my friend Mikkel Meyer Andersen points out, I do not have to be. grep, grepl, regexpr and gregexpr all have an argument which can be set to TRUE. However I did want to highlight that generally regular expressions are case sensitive.

This programme will count the number of lines containing the words whale or Ahab but not the number of occurrences. To count the number of occurrences, we need to work slightly harder. The gregexpr function globally matches a regular expression. If there is no match, then gregpexr returns -1. However, if there is one or match, then gregexpr returns the position of the match, and the length of the match for each matching instance. For example, we will look for the word the in the following three sentences:

s1 = 'James is a silly boy'
s2 = 'The cat is hungry'
s3 = 'I do not know about the cat but the dog is stupid'

## Set the regular expression
## Note: This would match 'There' and 'They' as
## well as others for example 
pattern = '[Tt]he'
## We will store the matches so that we 
## can examine them in turn
m1 = gregexpr(pattern, s1)
m2 = gregexpr(pattern, s2)
m3 = gregexpr(pattern, s3)

There are no matches in the first sentence so we expect gregexpr to return -1

## [[1]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE

which it does.

In the second sentence, there is a single match at the start of the sentence

## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 3
## attr(,"useBytes")
## [1] TRUE

This result tells us that there is a single match at character position 1, and that the match is 3 characters long.

In the third example there are two matches, at positions 21 and 33, and they are both 3 characters long

## [[1]]
## [1] 21 33
## attr(,"match.length")
## [1] 3 3
## attr(,"useBytes")
## [1] TRUE

So in order to count the number of occurences of a word we need to use gregexpr and keep all of those instances where the result is not -1 and then count, using the length function the number of matches, e.g.

## count the number of occurences of whale
pattern = '(W|w)hale'
matches = gregexpr(pattern, Lines)

## if gregexpr returns -1, then the number of 
## matches is 0 if not, then the number of 
## matches is given by length
counts = sapply(matches, 
                  if(x[1] == -1) 

cat(paste('Whale:', sum(counts), '\n'))
## Whale: 1564

Character classes/sets or the [ ] operator

  • Regular expression character sets provide a simple mechanism for matching any one of a set of characters

  • For example [Tt]he will match The and the

  • The real strength of character sets is in its special range and set operators

  • For example the regular expressions:

    • [0-9] will match any digit from 0 to 9
    • [a-z] will match any lower case letter from a to z
    • [A-Z0-9] will match any upper case letter from A to Z or any digit from 0 to 9 and so on
  • You may initially think that character sets are like alternation, but they are not. Character sets treat their sets as an unordered list of characters

    • So (se|ma)t will (fully) match set
    • But mat, [sema]t will not
    • Alternatively [sema]t will match st,at, mt and at
  • The POSIX system defines a set of special character classes which are supported in R and can be very useful. These are

[:alpha:] Alphabetic (only letters)
[:lower:] Lowercase letters
[:upper:] Uppercase letters
[:digit:] Digits
[:alnum:] Alphanumeric (letters and digits)
[:space:] White space
[:punct:] Punctuation
  • The regular expression [[:lower:]] will help you capture accented lower case letters like &agrave, &eacute, or &ntilde whereas [a-z] would miss all of them

  • You may think this is uncommon, but optical character recognition (OCR) text often has characters like this present

Negated character sets – [^...]

  • Negated character sets provide a way for you to match anything but the characters in this set

  • A very common example of this is when you want to match everything between a set of (round) brackets

  • E.g. The regular expression ([^)]) would match any single character between a pair of round brackets

Matching more than one character — quantifiers

Another common thing we might do is match zero, one, or more occurrences of a pattern. We have four ways to do this

  1. ? means match zero or one occurrences of the previous pattern
  2. * means match zero or more occurrences of the previous pattern

  3. + means match one or more occurrences of the previous pattern

  4. {a,b} means match from a to b occurrences of the previous pattern*

  5. b may be omitted so that {a,} means match a or more occurrences of the previous pattern

  6. b and the comma may be omitted so that {a} means match exactly a occurences of the previous pattern

Continuing with our misspelling example, I would like a way of picking up all of the possibilities of misspelling my surname. Variants I've seen are Curren, Currin, Curin, Curan, Curen, Curn and even Karen!

If I wanted to construct a regular expression to match all of these possibilities I need to match (in this order):

  1. a C or a K

  2. a u or an a

  3. one or two occurence of r

  4. zero or more occurrences of e or i

  5. and finally an n

This is how I do this with regular expressions

pattern = '[CK](u|a)r{1,2}(e|i)*n';
badNames = c('Curren', 'Currin', 'Curin', 
             'Curan', 'Curen', 'Curn', 'Karen')
grepl(pattern, badNames)

Notice how the regular expression didn't match Curan. To fix the code so that it does match we need to change the set of possible letters before the n from (e|i) to allow a as a possibility, i.e. (a|e|i) or alternatively [aei]

pattern1 = '[CK](u|a)r{1,2}(a|e|i)*n'
pattern2 = '[CK](u|a)r{1,2}[aei]*n'
badNames = c('Curren', 'Currin', 'Curin', 'Curan', 
             'Curen', 'Curn', 'Karen')
grepl(pattern1, badNames)
grepl(pattern2, badNames)

Anchors — matching a position

  • The metacharacters ^, $, \< and \> match positions

  • ^ and $ match the start and the end of a line respectively

  • \< and \> match the start and end of a word respectively

  • I find I use ^ and $ more often

  • For example

    • ^James will match "James hates the cat" but not "The cat does not like James"
    • cat$ will match "James hates the cat" but not "The cat does not like James"

Summary — R functions for matching regular expressions

The functions grep and regexpr are the most useful. Loosely,

  • grep tells you which elements of your vector match the regular expression

  • whereas regexpr tells you which elements match, where they match, and how long each match is

  • gregexpr matches every occurrence of the pattern, whereas regexpr only matches the first occurrence

The example below shows the difference

poss = c('Curren', 'Currin', 'Curin', 'Curan', 
         'Curen', 'Curn', 'Karen')
pattern = '[CK](u|a)r{1,2}(i|e)*n'
grep(pattern, poss)
## [1] 1 2 3 5 6 7
gregexpr(pattern, poss)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 6
## attr(,"useBytes")
## [1] TRUE
## [[2]]
## [1] 1
## attr(,"match.length")
## [1] 6
## attr(,"useBytes")
## [1] TRUE
## [[3]]
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"useBytes")
## [1] TRUE
## [[4]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## [[5]]
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"useBytes")
## [1] TRUE
## [[6]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## [[7]]
## [1] 1
## attr(,"match.length")
## [1] 5
## attr(,"useBytes")
## [1] TRUE

String substitution

  • Finding or matching is often only one half of the equation

  • Quite often we want to find and replace

  • This process is called string substitution

  • R has two functions sub and gsub

  • The difference between them is that sub only replaces the first occurrence of the pattern whereas gsub replaces every occurrence

  • Normal usage is quite straight forward. E.g.

poss = 'Curren, Currin, Curin, Curan, Curen, Curn and Karen'
pattern = '[CK](u|a)r{1,2}(i|e)*n'

sub(pattern, 'Curran', poss)
## [1] "Curran, Currin, Curin, Curan, Curen, Curn and Karen"
gsub(pattern, 'Curran', poss)
## [1] "Curran, Curran, Curran, Curan, Curran, Curran and Curran"

Back substitution

  • Abnormal usage is probably quite unlike anything you have seen before

  • One of the most powerful features of regular expressions is the ability to re-use something that you matched in a regular expression

  • This idea is called back substitution

  • Imagine that I have a text document with numbered items in it. E.g.

    1. James
    2. David
    3. Kai
    4. Corinne
    5. Vinny
    6. Sonika
  • How would I go about constructing a regular expression that would take each of the lines in my document and turn them into a nice LaTeX itemized list where the numbers are the list item markers?

  • The trick is to capture the numbers at the start of each line and use them in the substitution

  • To do this we use the round brackets to capture the match of interest

  • And we use the \\1 and \\2 backreference operators to retrieve the information we matched. E.g.

Lines = c('1. James', '2. David', '3. Kai', 
          '4. Corinne', '5. Vinny', 
          '6. Sonika')
pattern = '(^[0-9]\\.)[[:space:]]+([[:upper:]][[:lower:]]+$)'
gsub(pattern, '\\\\item[\\1]{\\2}', Lines)
## [1] "\\item[1.]{James}"   "\\item[2.]{David}"  "\\item[3.]{Kai}"    
## [4] "\\item[4.]{Corinne}" "\\item[5.]{Vinny}"   "\\item[6.]{Sonika}"

Note the double backslash will become a single backslash when written to file.

I actually used a regular expression with back substitution to format output for LaTeX in the file name example at the start of this post. My regular expression was the following:


and this was my back substitution expression

 \\verb!\1!  &  \\verb!\\export(\2)! \\\\

There is only a single \ in the back references because I just did this in the RStudio editor, not in R. Note how there are two back references, corresponding to two capture groups, one of which is nested inside the other. In nesting situtations like this, the capture groups are labelled in order from the outermost inwards.

String manipulation

We need two more things to finish this section

  • The ability to extract smaller strings from larger strings

  • The ability to construct strings from smaller strings

  • The first is called extracting substrings

  • The second is called string concatenation

  • We use the functions substr and paste for these tasks respectively


  • substr is very simple

  • Its arguments are

    • the string, x,
    • a starting position, start,
    • and an stopping position, stop.
  • It extracts all the characters in x from start to stop

  • If the alias substring is used then stop is optional

  • If stop is not provided then substring extracts all the characters from start to the end of x


substr('abcdef', 2, 4)
## [1] "bcd"
substring('abcdef', 2)
## [1] "bcdef"


paste is a little more complicated:

  • paste takes 1 or more arguments, and two optional arguments sep and collapse

  • sep is short for separator

  • For the following discussion I am going to assume that I call paste with two arguments x and y

  • If x and y are both scalars thenpaste(x,y) will join them together in a single string separated bya space, e.g.

paste(1, 2)
## [1] "1 2"
paste('a', 'b')
## [1] "a b"
paste('a =', 1)
## [1] "a = 1"
  • If x and y are both scalars and you define sep to be "," say then paste(x, y, sep = ",") will join them together in a single string separated by a comma,
paste(1, 3, sep = ',')
## [1] "1,3"
paste(TRUE, 3, sep = ',')
## [1] "TRUE,3"
paste('a', 3, sep = ',')
## [1] "a,3"
  • If If x and y are both vectors and you define sep to be "," say then paste(x, y , sep = ",") will join each element of x with each element of y into a set of strings where the elements are separated by a comma, e.g.
x = 1:3
y = LETTERS[1:3]
paste(x, y, sep = ',')
## [1] "1,A" "2,B" "3,C"
  • If collapse is not NULL, e.g “-” then each of the strings will be pasted together into a single string.
x = 1:3
y = LETTERS[1:3]
paste(x, y, sep = ',', collapse = '-')
## [1] "1,A-2,B-3,C"
  • And of course you can take it as far as you like 🙂
x = 1:3
y = LETTERS[1:3]
paste('(', paste(x, y, sep = ','), ')', 
       sep = '', collapse = '')
## [1] "(1,A)-(2,B)-(3,C)"


Mikkel also points out that paste0 is a shortcut to avoid specifying sep = "" everytime

Share Button

Leave a Reply