Reproducible research software: some more thoughts

So there was a lot written over the blogosphere, twittersphere and what-have-you-sphere about the to publish code in scientific research. The latest volley was fired from a post at from “JermDemo” which also mentioned my post on making accountable research software by forming a volunteer “Bioinformatics Testing Consortium”. (My post, not my idea). I won’t get into its content, since it is not the main point of this post. You can read it if you like, there are some interesting points there.

Anyhow, Leighton Pritchard commented on his ambivalence with “reproducible research software” in his Facebook page. I’d thought I’d reproduce it (ha!) here (almost) verbatim, as I share many of Leighton’s thoughts and qualms. I especially share his ambivalence, so to me it is refreshing to have a non-dogmatic position voicing some concerns from both sides of the argument.  Thanks for letting me post your thoughts, Leighton.

<Channeling Leighton>

Fo what’s it worth I have a confused and wavering opinion on this kind of ‘reproducibility’ in bioinformatics, which is one reason why I don’t comment on the topic more (same with Open Access) I see it, broadly, as a good thing: prosaically, it’s better to be able to check methods and results than not; I also like to be able to follow the clever stuff other people do in more detail, because I learn a lot that way. However, our goal as scientists is, arguably, progressively to produce more useful and more accurate/’truthful’ models of (bits of) the universe around us. Direct reproducibility of methods is a desirable, but not a strictly necessary component of that.

Sometimes I think I see an argument for reproducibility that borders on logical fallacy. Paraphrasing: “if you can’t re-enact the work exactly, then the results can’t be trusted. The corollary is that if work can be reproduced, it can be trusted.” Now there’s a straw man for you. Firstly, no-one wants to re-enact – for example – James Joule’s experiments to confirm the relationship between mechanical work and heat; we have other, independent, confirmatory, convenient ways to do this now. Secondly, several people reproduced Blondlot’s N-Rays, but the rays didn’t actually exist. Neither situation is a very good map onto bioinformatics or software, but they should be illustrative of how exact reproducibility of methods may be neither essential nor infallible in science.

I also share your opinion  (if I’ve read your blog post correctly, at least) that a large amount of practising bioinformatics code is pretty much extemporised to get data from one place/one form to another in an appropriate way without thinking too hard about the underlying computer science, or usability issues. This is where JermDemo’s chemistry analogy becomes interesting for me, as a former chemist: what matters about what we write/wrote down as chemists concerning the method was information about the reagents, how and when they interact and how long for, and what special conditions are needed at any step (along with the results, of course). We never had to package up the exact glassware, or share glassware (except for when it was novel or highly-specialised glassware; typically a company made it and it could be bought off-the-shelf; you just had to know how to use it) for replication. On one level, at least, software libraries = glassware; local scripts/algorithm choice = the arrangement of that glassware; algorithm parameters = experimental parameters: temperature, pressure, concentration etc.

Typically what people are interested in for chemistry methods is what was done with the reagents. In bioinformatics, what was done with the data. As an example (from

A suspension of isoquinolinequinone (1 mmol), the required amine (2 mmol), CeCl3.7H2O (0.05 mmol) and ethanol (25 mL) was left with stirring at rt after completion of the reaction as indicated by TLC. The solvent was removed under reduced pressure and the residue was column cromatographed over silica gel (85:15 CH2Cl2/AcOEt) to yield the corresponding mixture of regioisomers. These were analysed by 1H-NMR to evaluate the proportion between the 6- and 7-aminoisoquinolinequinone derivatives. Column chromatography of the mixture, over silica gel (95:5 CH2Cl2/ AcOEt), provided pure samples of the regioisomers.

There’s an assumption that you know what each of these steps is, why you would do it (I remember both), and that you’re competent to do it with your own kit (I’ve lost my lab skills, sadly – but if this was 1996 I think I could have done it) if you’re interested in replication. The code is only a tool and the key thing is to know what you’re doing with the data. It’s the science itself, that path from data to analytical result, that needs to be reproducible, otherwise blind reuse of code (which we all know happens) is potentially just an electronic version of Blondlot’s N-Rays. Independent replication is the key to that, which is one reason why the bioinformatician’s core skillset is knowing what to do with the data and how to make computers do it for you; ‘trained-up wet lab biologists’ who just about know how to use BLAST and have a vague idea that HMMs exist are no more bioinformaticians in that sense than someone who knows how to do a t-test (but not when you *shouldn’t*) is a statistician.

As it happens, I also agree with Kevin [Karplus IF] that the BTC is a good idea for code of a certain quality level, but I have similar reservations about its practicality. (I disagree with him that it’s a step on the path to treating bioinformaticians as “servants” to “real researchers”, because I get quite enough of that already, without sharing my code. No smiley, you’ll note.).

A common gripe I have about bioinformatics research being irreproducible is that the written methods are actually insufficient for a competent person to reproduce the work by themselves – David’s ‘vague ref’* seems to be the norm for many papers that ‘use bioinformatics’ but aren’t bioinformatics papers, as it were. If code existed, then releasing it would be a help. I suspect that, in many cases, there isn’t code – these were largely web-based or other ad hoc applications of software, and that the person responsible doesn’t understand how or why it works.


* “I have just read a GWAS paper that does some very dodgy logistic regression corrected for ethnicity effects based on the first 5 PCA scores from some sort of unexplained meta-analysis (vague ref) and corrected for other multiple “confounders” using stepwise selection (no validation).” (IF: David Broadhurst, same FB thread, who also penned a paper in 2006 entitled Statistical Strategies for Avoiding False Discoveries in Metabolomics Experiments).

</Channeling Leighton>


Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

2 Responses to “Reproducible research software: some more thoughts”

  1. Robert Sugar says:

    Very interesting post. I particularly liked the chemistry analogy, except for one thing: code is much easier to share than glassware. If there was an easy way in chemistry to download that, it would speed things up quite a bit.

    The other parts are (unfortunately) quite true. Reproducability is useless, if you re-run the same script on the same data, which proves exactly nothing. Things though might become interesting, if you change the input data or the parameters (what if …), where I see the added benefit of sharing code.

  2. Leighton Pritchard says:

    Good point about the relative lack of ease of sharing glassware, Robert – since we *can* share easily, maybe we ought to just do it. Allowing others to modify input data/parameters is also a good reason to share code.

    Extending the glassware analogy, the ‘crappy code makes the result unsafe’ issue is a bit like what happens if it turns out that the boron in your particular glassware, or maybe just some contaminant from a previous experiment, was responsible – or necessary – for your result. You might want to run the experiment again in clean glassware (i.e. independent replication of what the code’s meant to be doing) to check, if it looked suspicious. Sharing the code/glassware lets *everyone* see whether it’s poorly-written/dirty or not. But if it looks dirty, maybe there’s more chance of someone rerunning your experiment with clean glassware, as it were? Is there an assumption that if code is highly-polished and passes its unit tests, say, it can be taken for granted that there are no fundamental algorithmic errors? They may be easier to spot in clean code, but how much is usually taken on trust? Do we only tend to inspect the internals of software when it doesn’t work, or doesn’t work as expected?

    Like I say, I have a ‘confused and wavering opinion’ ;)