Archive

Archive for the ‘open source software’ Category

Bioinformatics Open Source Conference 2010 (and a poll)

June 14th, 2010 3 comments

The 11th Annual Bioinformatics Open Source Conference (BOSC) 2010 is coming up in Boston, July 9-10 2010. The BOSC meetings are a great get-together of a community of programmers who are like-minded in their advocacy of open source code for science, and specifically for bioinformatics. The whole thing is run by volunteers who take a lot of time and effort to bring a top-notch meeting every year, so a big thanks to this year’s organizing committee!

If you are reading this, and you are in Boston on those dates, consider showing up, it is a great experience. There will also be a codefest on the two days before the meeting. This year’s topic is cloud computing for bioinformatics. If you like using AWS for bioinformatics or if you want to learn more, this is your chance. Amazon have provided a grant towards this codefest. (Thanks!) Biopython, Bioperl, Biojava and Bioruby developers will all be there, tailoring code to the cloud.

Which brings me to the latest poll: if you are a bioinformatics programmer, which of the Bio* packages  are you using in your programming, if any? If more than one, check the one you use most frequently. Poll answers on the right. As with all Internet polls, you must be crazy if you take it at all seriously.

AMOS on Ubuntu

April 19th, 2010 Comments off

AMOS is a suite of genome assembly and editing software. It includes assemblers, validation, visualization, and scaffolding tools.  I have been having some issues installing AMOS on Ubuntu  9.10.  Specifically, Ubuntu 9.10 has gcc 4.4, which breaks the compilation of the AMOS release version. However, the development version has been fixed to accommodate that.

If you don’t know which Ubuntu version you are running, type:

$ lsb_release -a

No more than fifteen minutes after I posted my Q to the amos-help mailing list, Florent Angly came through with a solution. I am posting his email here.

Hi,

This issue was fixed in the development version of AMOS. See below for instructions on how to install this version on Ubuntu:

Download either the regular or development version of AMOS. As of April 4, 2010,
Minimo is only available from the development version of AMOS.
i/ The regular AMOS version is available from http://sourceforge.net/projects/amos/files/, e.g.:
$ wget http://sourceforge.net/projects/amos/files/amos/2.0.8/amos-2.0.8.tar.gz/download
ii/ The development version of AMOS is in a CVS repository. To get it, run:
$ cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS

In the directory where the AMOS file are located, run the following to install
the prerequisites:
$ sudo aptitude install ash coreutils gawk gcc automake mummer mummer-doc libboost-dev

For the Hawkeye component of AMOS, you need Qt3:
$ sudo aptitude install libqt3-headers

For the standard version of AMOS, skip to next step, but for the CVS development version, first, run:
$ ./bootstrap

Then regardless of the version:
$ ./configure –with-Qt-dir=/usr/share/qt3 –prefix=/usr/local/AMOS
$ make
$ make check
$ sudo make install
$ sudo ln -s /usr/local/AMOS/bin/* /usr/local/bin/

Now all the programs shipped in AMOS should be available from the command-line.
For example try:
$ Minimo -h
Regards,

Florent

You will need the AMOS development version for Ubuntu 9.10 (and above, presumably), but the regular version for 9.04 (and below). If you are getting the development version, you will also need to install cvs on your machine:

$ sudo aptitude install cvs

Hope this helps anyone struggling with installing AMOS on Ubuntu or other Linux platforms.

Short bioinformatic hacks: reading between the genes

February 11th, 2010 6 comments

In celebration of the biohackathon happening now in Tokyo, I am putting up a script that is oddly missing from many bioinformatic packages: extracting intergenic regions. This one was written together with my student, Ian. As for the biohackathon itself, I’m not there, but I am following the tweets and  Brad Chapman’s excellent posts:

About intergenic regions: intergenic regions are as interesting and sometimes even more interesting than the genes themselves: when you are interested in promoters, transcription factor binding sites or almost any other transcription regulation mechanism. Here’s a simple script to find intergenic regions. It reads a genbank formatted file and uses the information there to extract the intergenic regions. The sequences are written to a FASTA file.

#!/usr/bin/env python
import sys
import Bio
from Bio import SeqIO, SeqFeature
from Bio.SeqRecord import SeqRecord
import os

# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
def get_interregions(genbank_path,intergene_length=1):
    seq_record = SeqIO.parse(open(genbank_path), "genbank").next()
    cds_list_plus = []
    cds_list_minus = []
    intergenic_records = []
    # Loop over the genome file, get the CDS features on each of the strands
    for feature in seq_record.features:
        if feature.type == 'CDS':
            mystart = feature.location._start.position
            myend = feature.location._end.position
            if feature.strand == -1:
                cds_list_minus.append((mystart,myend,-1))
            elif feature.strand == 1:
                cds_list_plus.append((mystart,myend,1))
            else:
                sys.stderr.write("No strand indicated %d-%d. Assuming +\n" %
                                  (mystart, myend))
                cds_list_plus.append((mystart,myend,1))

    for i,pospair in enumerate(cds_list_plus[1:]):
        # Compare current start position to previous end position
        last_end = cds_list_plus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "+"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    for i,pospair in enumerate(cds_list_minus[1:]):
        last_end = cds_list_minus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "-"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    outpath = os.path.splitext(os.path.basename(genbank_path))[0] + "_ign.fasta"
    SeqIO.write(intergenic_records, open(outpath,"w"), "fasta")

if __name__ == '__main__':
    if len(sys.argv) == 2:
         get_interregions(sys.argv[1])
    elif len(sys.argv) == 3:
         get_interregions(sys.argv[1],int(sys.argv[2]))
    else:
         print "Usage: get_intergenic.py gb_file [intergenic_length]"
         sys.exit(0)

What are we seeing here?

Lines 11-16 are the preamble: we read the GenBank file using Biopython’s genbank parser in line 12.  Beacuse we expect a genome file, which contains a single record, this is a one-time read. Note that this is a rate limiting step, and can take a couple of seconds. Took me ~2secs to read the full E. coli genome on my Linux box.  We prepare one list for the + strand intergenic regions (13), another one for the minus strand intergenic regions (14) and one for all the records (line 15).

The rest of the code are three loop blocks: lines 16-28 I loop over the genbank features, extracting the coordinated of the genes themselves. Line 32-41 I find the intergenic regions on the + strand. Lines 42-52 I do the same for the “-” strand.

Now for a philosophical interlude: although there is a way to read all the intergenic regions in a single pass, I subscribe to the “code simple” doctrine of research software writing. Code performance optimization is a low priority for me. I’d much rather have something that is simple to write,read and modify. I also don’t want to spend too much time coding and elegant script for elegance’s sake, especially if I may not use it too much. Historically, scientific code written for research is mostly extinct: thrown away after a short lived hypothesis was tested and ended its days. Research coding is mostly throwaway glue code. Very rarely it matures into a product. Then, and only then, can you apply all those fine software engineering you learned in college. Before that, write fast and simple.

But I digress. Line 17 loops over the features in the genome file. Line 18 we identify if it is a coding sequence (CDS). If so, we identify the start position, and position and the strand the CDS is on. The list cds_list_minus is a list of 2-tuples. Each 2-tuple is the start and end positions of a CDS on the minus strand. (If you would like to go over the genes, as opposed to coding sequences, change line 18 to:

if feature.type == 'gene':

(or better yet, pass an argument that defines it.)

cds_list_plus, is, yes, the same as cds_list_minus, only for the plus strand (line 24).

Sometime, a CDS does not contain information on which strand it is. With genome files, that is usually the case with single stranded viral genomes. Therefore, we put in the default assumption that if there is no strand indication, then the feature is is on the plus strand. We generate a warning message nevertheless (lines 25-28).

Lines 30-41 we loop over the plus  strand list, and identify the coordinates between the genes. Python’s enumerate function is very useful here. The enumerate function allows us to iterate over a list, but at the same time keep track of which index we are in when looping over the list. So in line 30, pospair receives the start and end coordinates of a CDS as a 2-tuple, while i receives the actual number if the index in the plus strand CDS list. In that way, we can look back to the previous list member, find the coordinates where that CDS ends, and where the current CDS begins. The two coordinates make up the beginning and end of the intergenic regions between those two genes on that strand. In line 35 we check if the intergenic region length is equal to or larger than a threshold: suppose we are only interested in those intergenic regions that are longer than 100 bases? (The default value is 1, see line 11.) In lines 38-39 we build a biopython sequence object that contains an informative header, and the sequence of the intergenic region. The description which goes in the sequence header contains the start and end coordinates of the intergenic region, and the locus ID of the CDS directly downstream from it. The sequence object is appended to a list, which will eventually get written (lines 40-41).

Lines 42-52 are a repeat of lines 30-41, only for the minus strand.  Lines 53 & 54: the list that contains all the intergenic region sequence objects gets written to its own fasta file.

Finally, line 56-63 are boilerplate code, that make this script runnable from the command line. Have fun looking at intergenic regions. Let me know of you find something interesting.

Thankful for…

November 26th, 2009 1 comment

In no particular order or context. No personal stuff and by no means a complete list:

WordPress (like, duh).

icon_big

Wikipedia (default for looking up new stuff)

600px-Wikipedia-logo.svg

Wikis in general (great lab management tool. Don’t need LIMS)

Open Access Publishing and Creative Commons licensing.

cc.logo.circle

FLOSS licensing (90% of the software I use, and 100% of what I write)

opensource-logo

Science Bloggers (too numerous to link)

Science tweeters and FriendFeeders (too numerous to link. That’s how I keep up with things)

Facebook+Friendfeed-VS-Twitter

BLAST (Sometimes it feels like bioinformatics is should be renamed to blastology)

LaTeX (Wrote my dissertation in LaTeX, and never looked back)

latex_lion

OpenOffice.org (because not everyone uses LaTeX).

OpenOfficeLogo

CiteULike (Keeping my reference library up to date and in good order)

Citeulike_logo

Delicious (Keeping my bookmarks up to date and in good order)

delicious_logo

Gmail (because finding that document you sent me a month ago would be impossible otherwise)

super-gmail-logo

Google Scholar (For standing on the toes of Hobbits. Or something like that)

mainG

GIS (for blogging and making class slides)

Vim (because emacs blows)

vim-editor_logo

Python (ease & power)

python_logo_without_textsvg

Biopython (OK, conflict of interest here, since I contributed a bit)

biopython

Friendly colleagues (They certainly are!)

umured7

Good students (gotta make my lab page).

Goulash for dinner. Can’t stand oven Turkey.

turkey

Music. Especially the latest song that is going around in my head: