Short bioinformatic hacks: reading between the genes

In celebration of the biohackathon happening now in Tokyo, I am putting up a script that is oddly missing from many bioinformatic packages: extracting intergenic regions. This one was written together with my student, Ian. As for the biohackathon itself, I’m not there, but I am following the tweets and  Brad Chapman’s excellent posts:

About intergenic regions: intergenic regions are as interesting and sometimes even more interesting than the genes themselves: when you are interested in promoters, transcription factor binding sites or almost any other transcription regulation mechanism. Here’s a simple script to find intergenic regions. It reads a genbank formatted file and uses the information there to extract the intergenic regions. The sequences are written to a FASTA file.

#!/usr/bin/env python
import sys
import Bio
from Bio import SeqIO, SeqFeature
from Bio.SeqRecord import SeqRecord
import os

# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
def get_interregions(genbank_path,intergene_length=1):
    seq_record = SeqIO.parse(open(genbank_path), "genbank").next()
    cds_list_plus = []
    cds_list_minus = []
    intergenic_records = []
    # Loop over the genome file, get the CDS features on each of the strands
    for feature in seq_record.features:
        if feature.type == 'CDS':
            mystart = feature.location._start.position
            myend = feature.location._end.position
            if feature.strand == -1:
                cds_list_minus.append((mystart,myend,-1))
            elif feature.strand == 1:
                cds_list_plus.append((mystart,myend,1))
            else:
                sys.stderr.write("No strand indicated %d-%d. Assuming +\n" %
                                  (mystart, myend))
                cds_list_plus.append((mystart,myend,1))

    for i,pospair in enumerate(cds_list_plus[1:]):
        # Compare current start position to previous end position
        last_end = cds_list_plus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "+"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    for i,pospair in enumerate(cds_list_minus[1:]):
        last_end = cds_list_minus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "-"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    outpath = os.path.splitext(os.path.basename(genbank_path))[0] + "_ign.fasta"
    SeqIO.write(intergenic_records, open(outpath,"w"), "fasta")

if __name__ == '__main__':
    if len(sys.argv) == 2:
         get_interregions(sys.argv[1])
    elif len(sys.argv) == 3:
         get_interregions(sys.argv[1],int(sys.argv[2]))
    else:
         print "Usage: get_intergenic.py gb_file [intergenic_length]"
         sys.exit(0)

What are we seeing here?

Lines 11-16 are the preamble: we read the GenBank file using Biopython’s genbank parser in line 12.  Beacuse we expect a genome file, which contains a single record, this is a one-time read. Note that this is a rate limiting step, and can take a couple of seconds. Took me ~2secs to read the full E. coli genome on my Linux box.  We prepare one list for the + strand intergenic regions (13), another one for the minus strand intergenic regions (14) and one for all the records (line 15).

The rest of the code are three loop blocks: lines 16-28 I loop over the genbank features, extracting the coordinated of the genes themselves. Line 32-41 I find the intergenic regions on the + strand. Lines 42-52 I do the same for the “-” strand.

Now for a philosophical interlude: although there is a way to read all the intergenic regions in a single pass, I subscribe to the “code simple” doctrine of research software writing. Code performance optimization is a low priority for me. I’d much rather have something that is simple to write,read and modify. I also don’t want to spend too much time coding and elegant script for elegance’s sake, especially if I may not use it too much. Historically, scientific code written for research is mostly extinct: thrown away after a short lived hypothesis was tested and ended its days. Research coding is mostly throwaway glue code. Very rarely it matures into a product. Then, and only then, can you apply all those fine software engineering you learned in college. Before that, write fast and simple.

But I digress. Line 17 loops over the features in the genome file. Line 18 we identify if it is a coding sequence (CDS). If so, we identify the start position, and position and the strand the CDS is on. The list cds_list_minus is a list of 2-tuples. Each 2-tuple is the start and end positions of a CDS on the minus strand. (If you would like to go over the genes, as opposed to coding sequences, change line 18 to:

if feature.type == 'gene':

(or better yet, pass an argument that defines it.)

cds_list_plus, is, yes, the same as cds_list_minus, only for the plus strand (line 24).

Sometime, a CDS does not contain information on which strand it is. With genome files, that is usually the case with single stranded viral genomes. Therefore, we put in the default assumption that if there is no strand indication, then the feature is is on the plus strand. We generate a warning message nevertheless (lines 25-28).

Lines 30-41 we loop over the plus  strand list, and identify the coordinates between the genes. Python’s enumerate function is very useful here. The enumerate function allows us to iterate over a list, but at the same time keep track of which index we are in when looping over the list. So in line 30, pospair receives the start and end coordinates of a CDS as a 2-tuple, while i receives the actual number if the index in the plus strand CDS list. In that way, we can look back to the previous list member, find the coordinates where that CDS ends, and where the current CDS begins. The two coordinates make up the beginning and end of the intergenic regions between those two genes on that strand. In line 35 we check if the intergenic region length is equal to or larger than a threshold: suppose we are only interested in those intergenic regions that are longer than 100 bases? (The default value is 1, see line 11.) In lines 38-39 we build a biopython sequence object that contains an informative header, and the sequence of the intergenic region. The description which goes in the sequence header contains the start and end coordinates of the intergenic region, and the locus ID of the CDS directly downstream from it. The sequence object is appended to a list, which will eventually get written (lines 40-41).

Lines 42-52 are a repeat of lines 30-41, only for the minus strand.  Lines 53 & 54: the list that contains all the intergenic region sequence objects gets written to its own fasta file.

Finally, line 56-63 are boilerplate code, that make this script runnable from the command line. Have fun looking at intergenic regions. Let me know of you find something interesting.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

6 Responses to “Short bioinformatic hacks: reading between the genes”

  1. Peter says:

    Hi Iddo – I haven’t read the code but this looks like a good idea for a Biopython cookbook entry 🙂

    I wanted to say the official twitter tag for the event is #BioHackathon2010 but there are quite a few tweets tagged just #BioHackathon as well. See http://hackathon3.dbcls.jp/

    Peter

  2. Jason says:

    Here’s mine with some extra debugging and pulling from a Bio::DB::SeqFeature::Store database that also runs Gbrowse instance if you are more of the perl persuasion.

    http://github.com/hyphaltip/genome-scripts/blob/master/seqfeature/get_intergenic_seq.pl

  3. Iddo says:

    @Peter
    I put it in the CookBook Wiki under “I” for “intergenic” http://biopython.org/wiki/Intergenic_regions

  4. Andrey says:

    Will that work correctly for CDS defined on circular construct and spanning the origin, like:
    http://www.ncbi.nlm.nih.gov/nuccore/AE017195.1
    FEATURES Location/Qualifiers
    source 1..208369
    /organism=”Bacillus cereus ATCC 10987″
    /mol_type=”genomic DNA”
    /strain=”ATCC 10987″
    /db_xref=”ATCC:10987″
    /db_xref=”taxon:222523″
    /plasmid=”pBc10987″
    gene join(207497..208369,1..687)
    /locus_tag=”BCE_A0242″
    /old_locus_tag=”BCEA0242″
    CDS join(207497..208369,1..687)

  5. Iddo says:

    @Andrey
    Probably not, sorry.

  6. Andrey says:

    I settled on this trick:
    all_cds_mask = numpy.zeros(len(seq_record.seq),dtype=bool)
    ind = numpy.arange(len(seq_record.seq)) # returns [0,1,2,…]
    for feature in seq_record.features:
    if feature.type == ‘CDS’:
    ind_cds = feature_cds.extract(ind) # returns coords of cds nucleotides
    all_cds_mask[ind_cds] = 1
    seq = numpy.fromstring(str(seq_record.seq),dtype=”S1″) #sequence as array of characters
    all_cds_seq = seq[all_cds_mask]
    all_non_cds_seq = seq[numpy.logical_not(all_cds_mask)]
    # or iterate through the mask if runs of non-cds are needed as separate records
    # reverse-compliment locations can be handled with a separate mask array
    there is still a a slight issue of how to treat fuzzy ranges like 30..>1030, where we do not actually know if, say, 1031 is intergenic or not