Short Bioinformatics Hacks: Glimmer Splitter

By Iddo on October 29th, 2009

Glimmer is a program that predicts ORFs in bacterial and archeal genomes. The input is the assembled genome FASTA file, the output are several files of the predictions in different stages. The terminal output file is the .predict file. which looks something like this:

>NODE_1_length_38001_cov_935.551880
orf00001 481      362 -2     1.45
orf00002      451      567 +1     0.59
orf00004     3691      623 -2     5.43
orf00006     4254     5228 +3     4.65
orf00007     5204     5326 +2     7.04
orf00009     5587     6921 +1     5.20
orf00011     7062     8135 +3     5.48
orf00013     8327     9238 +2     4.26
orf00014     9241    10116 +1     3.26
orf00015    10119    10280 +3     2.81
orf00016    10296    10673 +3     6.61
orf00017    11288    10683 -3     6.35
orf00018    11910    11305 -1     7.18
orf00019    12313    11894 -2     5.22

The first column is the predicted ORF ID, the second is the start position, the third is end position, the fourth is the reading frame used and the fifth is a reliability score. For full details and how to install glimmer see Glimmer’s documentation. Glimmer3 also comes packaged with Ubuntu.

Here is a short Python script whose input is the genomic FASTA file which contains a single, assembled sequence and the glimmer file. The output is a FASTA file containing a separate entry for each predicted gene. Of course, it uses Biopython.

#!/usr/bin/env python
import sys
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

glimmer_file = sys.argv[1]
fasta_file = sys.argv[2]

# Read the sequence file
seq_record = SeqIO.parse(open(fasta_file),"fasta").next()

outseqrecords = []
# Read the glimmer file, record by record
for inline in file(glimmer_file):
    if '>' in inline:
        seqname = inline.split()[0][1:]
        outfilename = "%s_g3.tfa" % (seqname)
        continue
    if "orf" not in inline:
        continue
    orfname, sbegin, send, rf, score = inline.strip().split()
    sbegin = int(sbegin)
    send = int(send)
    rf = int(rf)
    # reverse complement
    if rf < 0:
        sbegin, send = send, sbegin
    sbegin -= 1     # Python indexes start a 0
    score = float(score)
    # split the sequence record
    newseq = seq_record.seq[sbegin:send]
    if rf < 0:
        newseq = newseq.reverse_complement()
    # Add a sequence record to the output
    seqrecord_description = "begin=%d end=%d rf=%d score=%.2f" % (sbegin+1, send, rf, score)
    outseqrecords.append(SeqRecord(newseq,id=seqname+"_"+orfname, description=seqrecord_description))

SeqIO.write(outseqrecords,open(outfilename,"w"),"fasta")

To run simply type:

./split_by_g3.py predict_file_name fasta_file_name

The output will be named with a _g3.tfa suffix. You can change that in line 17 (variable “outfilename”).

Share and Enjoy:

Categorized under: Bioinformatics, Genomics.
Tagged with: Bioinformatics, gene finding, genomics, ORF finding, programming.

5 Responses to “Short Bioinformatics Hacks: Glimmer Splitter”

Andrew Perry says:

29-October-2009 at 10:51 PM

Thanks for sharing this Iddo … your timing is perfect 🙂

I’ve been meaning to run GlimmerHMM on a draft eukaryotic genome for a particular project, and have been putting it off because I needed to write a script to do this step. I’ll need to modify it to concatenate together exons from the same transcript, but otherwise it should do the trick.

This seemed like something that should already implemented in some software package, somewhere, and I assumed if I held off for a little while I would find it (the analysis isn’t urgent, yet). I wonder what other people use when faced with converting GFF formatted ‘orf’ features to FASTA formatted sequences ?
Andrew Perry says:

29-October-2009 at 11:08 PM

Whoops, looks like in my excitement, I responded too soon. GlimmerHMM outputs slightly different ‘predict’ format (or, alternatively, GFF3 format). If I don’t find a pre-written solution first, I’ll probably write my script to do GFF3 exon features -> FASTA ORFs.
Peter says:

30-October-2009 at 7:07 AM

Hi Iddo, Nice post 🙂

Did you know that with Biopython 1.45 or later, if the file contains *exactly* once sequence, you can do:
seq_record = SeqIO.read(open(fasta_file),”fasta”)

instead of the more cryptic:
seq_record = SeqIO.parse(open(fasta_file),”fasta”).next()

For the final steps of the script, since Biopython 1.50 you can slice a SeqRecord, and I’ve started thinking about how a SeqRecord reverse complement method might work:
http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html
Maybe you could share your thoughts on this over on the mailing list?
Iddo says:

30-October-2009 at 8:48 AM

@Andrew Perry
You may want to check out Brad Chapman’s blog:

http://bcbio.wordpress.com/2009/03/22/mapreduce-implementation-of-gff-parsing-for-biopython/
edge says:

7-March-2010 at 10:22 PM

@Andrew Perry
Hi Perry,

Do you already successful write a script to generate GFF3 format by using GlimmerHMM output?
Can I ask your opinion about the way to write the script?
Thanks a lot for your sharing.
5

Byte Size Biology

The musings and ravings of a computational biologist about science, computers, music and, you know, stuff

Short Bioinformatics Hacks: Glimmer Splitter

5 Responses to “Short Bioinformatics Hacks: Glimmer Splitter”

Categories

Tags

Recent Posts

Recent Comments

Other stuff I read

Science blogs I like to read

Twitter