Python and statistics – is there any point?

This semester I gave my graduate student class a project. The brief was relatively simple: implement the iteratively reweighted least squares (IRLS) algorithm to perform a simple (single covariate) logistic regression in Python. Their programmes were supposed to be able to read data in from a text file, perform the simple matrix algebra and math needed to carry out the IRLS computation and return some formatted output – similar to that you would get from R’s summary.glm function. Of course, you do not need matrix algebra to do this, but the idea was for the students to learn a bit of mathematical statistics that they had not seen before.  On the IRLS front, they were allowed to use a simple least squares routine like numpy’s linalg.lstsq and some of numpy’s simple matrix operators, but expressly forbidden from simply loading pandas or statsmodels and using the generalized linear models functions contained therein.

I thought this sounded like a straightforward enough task. The students divided themselves into pairs to work on it, and they had 13 weeks to complete the task.

The kicker was that I did not provide any instruction, either in Python or in the IRLS algorithm. An aim of the project was to simulate the situation where someone asks you to solve a problem, and you have to go and do some research to do it. Their first task was to complete 100 exercises on codeacademy.com as a reasonable introduction to a language none of them had seen before.

Problems – versions

There are two major versions of Python in the wild, 2.7 and 3.4. Codeacademy teaches using version 2.7. One fundamental difference between 2.7 and 3.4 is the syntax of the print function. All of my students are users of R, to varying levels of skill. When they go to install R at home, they know to go to the CRAN website, or a mirror, and download the current, stable release of R. If they followed this policy, as I did myself, then they would have installed Python 3.4 and found that the way they were taught to use print by Codeacademy does not work, without any sort of helpful “That syntax has been depricated. Python 3 onwards uses the syntax…” This is not the only issue, with the way Python 3.4 handles execution of loops over numbered ranges being another example of a fundamental difference.

Problems – platform issues

Most students at my institution use Windows, especially at home. There is some Mac penetration, and Linux is virtually non-existent (these are statistics students, not computer science remember). The official Python installers work perfectly well under Windows in my experience. However, then we come to the issue of installing numpy. The official advice from the numpy website seems to be “download a third party version of Python which already has it.” For students who come from a world where a package can be installed by going to a menu, this is less than useful. The common advice from the web is that “there is no official release of numpy 1.8.1 for Python 2.7 or higher for Windows” but that you can download it and install it from a the builds very thoughtfully provided by Christoph Gohlke at UC Irvine here. Christoph’s builds work fine, but again, for something that seems, at least from the outside, very mainstream in the Python community should the user have to go to this level of effort?

Problems – local installations

Like any instructor, I face the issue that a number of my students have no option but to use the computer laboratories provided for them by the university. This means that we encounter the issue of local installation of libraries for users. Most, if not all, R packages from CRAN can be installed in a local library. As far as I can tell, this is not true for a Windows installation of Python. I am happy to be corrected on this point. The aforementioned Python binaries come with proper Windows installers, which want to install into the Python root directory, something students do not have permission to do. If I had realized this problem in December of last year, I could have asked the admins to pre-install it for all users, however, given I only formulated the problem in February, it was just a tad too late.

Would I do it again?

I might, but there would have to be serious efforts to resolve the problems listed above on my part. It also would not solve problems of students trying to set up Python at home, and I do not feel like hand-holding people through an installation process. My initial plan had been to try Javascript. I may return to this idea.

I would be the first to admit that I am not a Python user, but I am an experienced programmer with over thirty years of experience in at least a dozen different languages, and on multiple platforms. I know many people find Python a very useful language for their scientific computing, and I am not attempting to bad mouth the language – it seems a decent enough language with the constructs and functionality that I would expect to find in any modern language – but I do not think there is much incentive for a statistician to move away from R, or an R/C++ combination when raw compute power is required.

I am glad that my students experienced programming in a non-vectorized language. R does give a distorted perspective on programming with regards to its handling of vectors, and I think it is beneficial for students to learn about flow structures for element-wise computation.

Update

Nat Dudley has made the suggestion I used on online IDE like nitrous.io.

Despite the difficulties, nearly all of my students have managed to complete the task, and some have done an exceptionally good job, even adding in the ability to parse R-like formulae.

2 thoughts on “Python and statistics – is there any point?”

1. Josef says:

Just a comment to the installation problems:
http://winpython.sourceforge.net/ contains Python with all main scientific packages in a portable version for Windows that you can put on a memory stick or just unzip in a directory.
The reason 3rd party distributions are recommended is because they include everything and can be installed in a few clicks. Otherwise you need to follow a page of installation instructions to install all the pieces individually.
Another possibility is to install everything into a virtual environment, which is then just a directory on the file system that doesn’t need administrator privileges.

Once you allow the use of numpy, then Python is **also** a vectorized language. It has essentially the same array/matrix features as R or Matlab or similar specialized languages.

(Disclosure: I’m a statsmodels maintainer and do all my work and development on Windows.)

1. Thanks Josef. Comment from the experts is what I was after. I take your point about numpy and vectorization. Ideally I would like them to use Java or C++, but I think that is too bigger hurdle for students.

This site uses Akismet to reduce spam. Learn how your comment data is processed.