# Conference Programmes in R

It has been a while folks!

Lately – over the last month – I have been writing code that helps with the production of a timetable and abstract for conferences.

The latest version of that code can be found here, the associated bookdown project here, and the final result here. Note that the programme may change a good number of times before next Sunday (December 10, 2017) when the conference starts.

The code above is an improved version of the package I wrote for the Biometrics By The Border conference which worked solely with Google Sheets, by using it as a poor man’s database. I wouldn’t advise this because it is an extreme bottleneck when you constantly need to refresh the database information.

The second project was also driven by necessity, but the need in this case had more to do with the sheer complexity of organising several hundred talks over a four day programme.  Now that everything is in place, most changes requested by speakers or authors can be accomodated in a few minutes by simply moving the talk on the Google sheet that controls the programme, calling several R functions, recompiling the book and pushing it up to github.

#### So how does it all work?

The current package does depend on some basic information being stored on Google Sheets.  The sheets for my current conference (Joint Conference of the New Zealand Statistical Association and the International Association of Statistical Computing (Asian Regional Section)) can be viewed here on Google Sheets. Not all the worksheets here are important. The key sheets are: Monday, Tuesday, Wednesday, Thursday,  All Submissions, Allocations, All_Authors, Monday_Chairs, Tuesday_Chairs, Wednesday_Chairs, and Thursday_Chairs. The first four and the last four sheets (Monday-Thursday) are hand created. The colours are for our convenience and do not matter. The layout does, and some of this is hard-coded into the project. For example all though there are seven streams a day, only six of those belong directly to us as the seventh belong to a satellite conference. The code relies on every scheduled event (including meal breaks) having a time adjacent to it in the Time column. It also collects information about the rooms as the order of these rooms does change on some days. The All_Authors sheet comes from EasyChair. EasyChair provides the facility to download author email information as an Excel spreadsheet which in turn can be uploaded to Google Sheets easily. This sheet is simply the sheet labelled All in the EasyChair file. The All Submissions sheet is a copy-and-paste from the EasyChair submissions page. I probably could webscrape this with a little effort.  The Allocations sheet is mostly reflective of the All Submissions sheet, but has user added categorizations that allow us to group the talks into sensible sessions. It also uses formulae to format the titles of the talks so that each title word is capitalized (this is an imperfect process), and that the name of the submitter (who we have assumed is the speaker) is appended to the title so that it can be pasted into the programme sheets.

#### How do we get data out of Google Sheets?

The code works in a number of distinct phases. The first is capturing all the relevant information from the web in placing in a SQLite database. I used Jenny Bryan’s googlesheets package to extract the data from my spreadsheets. The process is quite straightforward, although I found that it worked a little more smoothly on my Mac than on my Windows box, although this may have more to do with the fact that the R install was brand new, and so the httr package was not installed. The difference between it being installed and not installed is that when it is installed authentication (with Google) happens entirely within the browser, whereas without you are required to copy and paste an authentication code back into R. When you are doing this many times, the former is clearly more desirable.

Interaction with Google Sheets consists of first grabbing the workbook (Google Sheets does not use this term, it comes from Excel, but it encapsulates the idea that you have a collection of one or more worksheets in any one project, than the folder that contains them all is called a workbook), and then asking for information from the individual sheets. The request to see a workbook is what will prompt the authentication, e.g.

library(googlesheets)
mySheets = gs_ls()
mySheets


The call to gs_ls will prompt authentication. It is important to note that this authentication will last for a certain period of time, and should only require interaction with a web browser the very first time you do it, and never again. Subsequent requests will result in a message along the lines of Auto-refreshing stale OAuth token. The call to gs_ls will return a list of available workbooks. A particular workbook may be retrieved by calling gs_title. For example, this function allows me to get to the conference workbook:

getProgrammeSheet = function(title = "NZSA-IASC-ARS 2017"){
mySheets = gs_ls()
ss = gs_title(title)
return(ss)
}


I can use the object returned by this function in turn to access the worksheets I want to work with using the gs_read. The functions in googlesheets are written to be compliant with the tidyverse, and in particular the pipe %>% operator. I will be the first to admit this is not my preferred way of working and I could have easily worked around it. However, in the interests of furthering my skills, I have tried to make the project tidyverse compliant and nearly all the functions will work using the pipe operator, although they take a worksheet or a database as their first argument rather than a tibble.

Once I have a workbook object (really just an authenticated connection), I can then read each of the spreadsheets. This is done by a function called updateDB. The only thing worth commenting on in this function is that the worksheets for each of the days have headers which do not really resolve well into column names. They also have a fixed set of rows and columns that we wish to capture. To that end, the range is specified for each day, and the column headers are simply set to be the first seven (A–G) capital letters of the alphabet. These sheets are stored as tibbles which are then written to an internal SQLite database using the dbWriteTable function. There are eight functions (createRoomsTbl, createAffilTbl, createTitleTbl, createAbstractTbl, createAuthorTbl, createAuthorSubTbl, createProgTbl, createChairTbl) which operate on the database/spreadsheet tables to produce the database tables we need for generating the conference timetable, and for the abstract booklet. These functions are rarely called by themselves—we tend to call a single omnibus function rebuildBD. This function allows the user to refresh the information from the web if need be, and to recreate all of the tables in the database. The bottleneck in this function is retrieving the information from the internet which may take around 20-30 seconds depending on the connection speed.

The database tables provide information for four functions: writeTT, writeProg, writeIndices, and writeSessionChairs. Each of these functions produces one or more R Markdown files in the directory used by the bookdown project.

#### Bookdown, ePub and gitBook

The final product is generated using bookdown. Bookdown, in explanation sounds simple. Implementation is really improved by the help of a master. I found this blog post by Sean Kross very helpful along with his minimal bookdown project from github. It would be misleading of me to suggest that the programme book was really produced using R Markdown. Small elements are in markdown, but the vast majority of the formatting is achieved by writing code which writes HTML. This is especially true of the conference timetable, and the hyperlinking of the abstracts to the timetable and other indices. The four functions listed above write out six markdown pages. These are the conference timetable, the session chairs table, four pages for each of the days of the conference, and two indices, one linking talks to authors, and one linking talk titles to submission numbers (which for the most part were issued by EasyChair). There is not a lot more to discussion involved here. Sean’s project sets up things in such a way that changes to the markdown or the yaml files will automatically trigger a rebuild.

#### Things I struggled with

Our conference had six parallel streams. Whilst it is easy enough to make tables that will hold all this information, it is very difficult to decide how best to display this in a way that would suit everyone. The original HTML tables were squashed almost to the point where there was a character per line on a mobile phone screen. We overcame this slightly by fixing the widths of the div element that holds the tables and adding horizontal scrolling. Many people found this feature confusing, and it did not necessarily translate into ePub. We then added some Javascript functionality by way of the tablesaw library. This allowed us to keep the time column persistant no matter which stream people were looking at, and it allowed better scrolling of streams that were offscreen. However, this was still as step too far for the technologically challenged. In the end we resorted to printing out the timetable. I also used excellent Calibre software to take the ePub as input and output it in other format—most usefully Microsoft Word’s .docx format. I know some of you are shuddering at this thought, but it did allow me to create a PDF with the programme timetable rotated and then create a PDF. This made the old fogeys immensely happy, and me immensely irritated, as I thought the gitBook version was quite useful.

#### Not forgetting the abstracts

Omitted from my workflow so far is mention of the abstracts. We had authors upload LaTeX (.tex) files to EasyChair, or text files. If you don’t do this, and they use EasyChair’s abstract box, then you have to find a way to scrape the data. There are downsides in doing so (and it even occurs in the user submitted files) in that some unicode text seems to creep in. Needless to say, even after I used an R function to convert all the files to markdown, we still had to do a bunch of manual cleaning.

#### Anyway

I hope someone finds this work useful. I have no intention of running a conference again for at least four years, but I would appreciate it if anyone wants to build on my work.

I am seriously considering the introduction of R Markdown for assignments in our second year statistics course. The folks at RStudio have made some great improvements in the latest version of R Markdown (R Markdown V2), which allow you to add a Markdown document template to your R package, which in turn does things like let you provide a document skeleton for the user with as much information as you like, link CSS files (if you are doing HTML), and specify the output document format as well. The latter is an especially important addition to RStudio.

The lastest version of RStudio incorporates Pandoc which is a great format translation utility (and probably more) written by John Macfarlane. It is an important addition to RStudio because it makes it easy to author documents in Microsoft Word, as well as HTML, LaTeX, and PDF. I am sure that emphasizing the importance having the option to export to Word will cause some eye-rolling and groans, but I would remind you that we are teaching approximately 800 undergrads a year in this class, most of who will never ever take another statistics class again, and join a workforce where Microsoft Word is the dominant platform. I like LaTeX too (I do not think I will ever write another book ever again in Word), but it is not about what I like. I should also mention that there are some pretty neat features in the new R Markdown like authoring HTML slides in ioslides format, or PDF/Beamer presentations, and creating HTML documents with embedded Shiny apps (interactive statistics apps).

I think on the whole the students should deal with this pretty well, especially since they can tidy up their documents to their own satisfaction in Word — not saying that RStudio produces messy documents, but rather that the facility to edit post rendering is available.

### Help?

However there is one stumbling block that I hope my readers might provide some feedback on — the issue of loading data. My class is a data analysis class. Every assignment comes with its own data sets. The students are happy, after a while, using read.csv() or read.table in conjunction with file.choose(). However, from my own point of view, reproducible research documents with commands that require user input quickly become tedious because you tend to compile/render multiple times whilst getting your code and your document right. So we are going to have to teach something different. As background, our institution has large computing labs that any registered student can use. The machines boot in either Linux or Windows 7 (currently, and I do not think that is likely to change soon given how much people loathe Windows 8 and what a headache it is for IT support). There is moderate market penetration of Apple laptops in the student body (I would say around 10%). So here is my problem — we have to teach the concept of file paths to a large body of students who on the whole do not have this concept in their skill set and who will find it foreign/archaic/voodoo. They will also regard this as another burdensome thing to learn on top of a whole lot of other things they do not want to learn like R and R Markdown. To make things worse, we have to deal with file paths over multiple platforms.

My thoughts so far are:

• Making tutorial videos
• Providing the data for each assignment in an R package that is loaded at the start of the document
• Providing code in the template document that reads the data from the web

I do not really like the last two options as they let the students avoid learning how to read data into R. Obviously this is not a problem for those who do not go on, but it shifts the burden for those who do. So your thoughts please.

#### Update

One option that has sort of occurred to me before is that in the video I could show how the fully qualified path name to a file can be obtained using file.choose() and then then students could simply copy and paste that into their R code.