A bit more on writing bioinformatic research code
There has been a lot of discussion recently on this blog and others on the need for robust scientific software. Most of the discussion I have been involved in comes from bioinformaticians, because, well, I am one. There has been plenty of talk about code robustness, sharing, and replicability vs. reproduciblity. I do not want to get into this whole debate again, although it is a worthy debate with plenty of interesting things coming out of it. I do want to talk a bit about how to write research software the way I understand it: mostly short scripts with a short lifespan. Those are written in the lab for analyzing data. They are not written as a software product to be used by other labs (although they may eventually be, and you should keep that in mind as you code along). Also, they are not written as computer-science-y software aimed to try a new algorithm or optimize it. So I am not talking about optimizing a new assembly algorithm or such.
I am writing about the software you write to answer a biological question using the sequence data, structure data and/or metadata that you have. Those are the programs that your lab writes in Python, Perl, R or Matlab to classify, analyze, quantify and otherwise shake-and-bake the raw pink data you have until it is nicely and evenly cooked with a fine brown glaze of information coating.
FSM help you if you write in Perl, though. (Kidding, or maybe not. This post, like most of my blog, science, and life is heavily Python-biased.)
I am also not writing about the code you should have ready when you are publishing: this will come up in a followup post. I am talking about starting out and moving along in your project. And the list is quite partial, with obvious lacunae listed in the end.
So you started a new project — something exploratory, as new projects tend to be. The scripts you write may initially seem quick and dirty, and mostly throwaway scripts, you should not treat them as such from the get-go. You cannot know in advance which of the scripts you are writing will be the one you run once, twice, 10 times and then never touch it again — and which will be the ones that run all the way to the the finish line (writing your paper) and beyond (maybe going into a piece of bona-fide software your lab, or some other lab will produce down the line). Therefore, proper respect and writing are due for every script you write — even though most of them die early. So, some minimal respect at first (bare minimum is Skeletal Documentation), and growing respect as you move along.
1. Skeletal documentation. You have a really burning question you would like to find out. Something that can be answered very quickly if you code it right. You know exactly what you want to write up, and your fingers are itching to get that scripting part out of the way so you can run it and answer your burning question. Sounds familiar? Slow down, just a bit. Start each script by writing the header, its arguments. And, despite your fingers itching to code, take a few minutes to describe what this particular function does, and especially the meaning of each argument passed, and the values returned. This will help frame to yourself what you are doing, because defining to yourself what your doing and writing it down is much, much better than being under the illusion you can keep it all in your head. Hopefully, this will also save you from making some mistakes down the line when you actually start coding, as you can refer to that particular piece of inline documentation as your little working framework. More importantly, “future you” can refer to this bit of documentation tomorrow, or a week from now, or 6 months from now. This documentation doesn’t really have to be much, something like:
def count_bad_reads(infile, thresh): # infile: FASTQ file # thresh: number of N's in read to be declared "bad" # Returns: # n_tot_reads: total # of reads # n_bad_reads: number of bad reads . .
Once you wrote this little bit of inline documentation, unleash your mad coding skillz, and write up the function. Your stitch in time will save nine in the future, when you gaze wide-eyed at your code trying to remember what the heck you wrote it for in the first place. Do not take shortcuts with skeletal documentation, make it a habit. It always saves time, and not in the long run, in the sort run. The returns are almost immediate. You’ll be surprised how many don’t even do this bare-bones bit…
2. 00README files are your lab notebook: as often happens, your quick-and-dirty hack started to grow into something larger. The file you started with a couple of scripts is rapidly balooning into something bigger. Split your editor, open a 00README text file. Write some more extensive documentation there. This is your “lab notebook”, and you should have one per directory in your project. (More about maintaining directories below). This should be in the form of a diary: date your entries. Write up what is going on with your script runs. As you create more and more output files for your interim data, give those files meaningful names. Document what they have in them. Cut & paste the top data line (if they are text file) into your 00README. That will remind you of their format: what’s in column 1, column 2, and so on. My 00README’s often end with a TODO header, reminding tomorrow-me what today-me wanted to do before kids phoned up with: “daddy, when are you coming home?” README files are also good for preparing that lab meeting, follow-up discussions with your colleagues & PI, etc.
3. Streamline & command-line: So about that file that has a bunch of functions in it? You probably already have some way you are stringing them together: function A returns variables a1 and a2. a2 is input to function B, which returns b1. b1 and a2 go into function C… you get the picture. As soon as you see you are running this stringing routine more than a few times, write a caller function, that wraps A, B and C in it, and calls them in that particular order. Don’t forget to skeleton-document the caller function. You can also have that caller functional runnable from the command-line. So a Python file loaded with functions may end up like this toy example:
#!/usr/bin/env python import sys def find_taxon_proteins(infile, taxon_id_list): # function finds all proteins in a given taxon # input: # infile: allProteins.dat, format: taxon prot_id # taxon_id_list ['13201','7505',...] (NCBI taxa IDs) # returns: dictionary of proteins in the passed taxa # key: taxon: value: list of proteins . . . return taxa_prot_dict . . . def find_enzymes(protein_list, enzyme_dict): # find which of the passed proteins are enzymes # input: list of proteins # list of enzymes # return: list that is an intersection of input lists. # . . . ###### # Command-line executable part ###### # argv[1]: allProteins.dat file # argv[2]: file listing of proteins that are enzymes # if __name__ == '__main__': human_prot_dict = find_taxon_proteins(sys.argv[1], taxon_id_list=["9606"]) enzyme_list = read_enzyme_list(sys.argv[2]) human_enzymes = find_enzymes(human_prot_dict.values(), enzyme_list) enz_fout = open("human_enzymes.csv","w") for enzyme in human enzymes: enz_fout.write("%s\t%d\n" % (enzyme, human_enzymes[enzyme]) enz_fout.close()
Very few doc lines per function, just enough to jog your memory when you revisit this code down the line. What does each function (briefly) do? Which files does it read? What are the passed arguments? What does it return? Also, you can now run the caller function as an executable from the command-line, which, at this point, is more convenient than calling functions and running them in the Python shell.
4. If you need to gawk, gawk — but no more than 2 times, and put it in a file. Sometimes, a quick and dirty gawk / shell one-liner seems like a good solution. Try to avoid it. But if you do seem to be hitting that “arrow up” key recalling your gawk run and modifying it, please: stop. Mouse over the gawk code, grab it, put it in an editor, write some skeletal documentation, and save that code for your history. Also, if you have more than 2 gawk scripts, code a proper function in your language of choice, and drop the gawks. Ugly connective one-liners are stopgap measure, and should be treated as such.
#!/bin/bash # command line input: gene_association.goa_uniprot file # output: a list of human proteins, and the number of annotations for each one. gawk '/taxon:9606/ {a[$2]++} END {for (i in a) print i, a[i]}' ${1} | sort'
5. Mutiple directories: for each new topic, create a new directory under your main project directory (which is usually the one you created first). Each directory is set aside for different topics. Each directory has its own 00README describing what’s in it. It is hard to define what would be a “new topic”, but here’s an example: you started out looking for enzymes in the human proteome. Now you are specifically interested in in-depth analysis of those enzymes in fatty-acid pathways, and their orthologs in mammals. Open a new directory for each new question:
./fatty_acid ./mammal_orthos
At least for starters (you can merge later), as those are separate and new directions in your research.
6. Use version control software My lab uses Git and the free online github. Some use Mercurial (which I know little about), some use others. At least for github, you need to pay if you want to make your directories non-public ($7.00/month for five non-public repositories). I pay, but some just download and install their own local git repository. Using version control can help you go back and recreate when that stupid bug has been introduced, and of course, share with others in your lab and in other labs.
7. Backup everything…. And version control software is not a backup solution. Use multiple offsite daily incremental backups. Automate via cron, and check your logfiles frequently to see that the backup script is still working! (Yeah, should be obvious, but this is one time when stating the obvious can only do good.) This is also important for transparency and scientific integrity, in an analog to preserving lab notebooks in a wetlab.
This list is incomplete. Specifically, I did not write anything about testing, which is a whole topic unto itself, and “publication code”, by which I mean what should you have ready when you are actually ready to publish (besides your paper & figures, haha). Hopefully in a followup post to this one.
BitBucket provides free unlimited public or private repositories for Git and Mercurial, with the caveat that no more than 5 users can have access to a private repo. I was able to bump up this limit to 7 or 8 by inviting colleagues to create BitBucket accounts, but sadly I haven’t yet enjoyed the privilege of having TOO MANY people interested in contributing to my project. I doubt this limit will be a serious hindrance for most academic bioinformatics scientists.
Thanks for your tips Iddo! I would like to add to this list:
– Write unit tests. I would advice everyone to do so. They make excellent documentation, can be written very quick and are obviously handy in many other ways.
– Use make for building pipelines. This way you have to automate everything. The more you reduce human intervention in your workflow, the better! (in my case…)
I would like to share more on this topic, so I wrote a blogpost at http://bioinformatics-man.blogspot.be/2013/01/and-even-more-on-writing-bioinformatics.html.
Thanks
I couldn’t agree more about the README.txt files. It is beyond believable the code/programs/scripts that get published for usage and have ABSOLUTELY NO DOCUMENTATION. This is simply bad skills. How about a few REM statements? Something to make the script interpretable and useful to someone who wasn’t intimately involved in writing it; or has to fix it because since the time it was written, some dependency has “drifted” away from where it used to be…
All good advice.
But does the PI also follow all the rules?
🙂