Short Bioinformatics Hacks: Glimmer Splitter
Glimmer is a program that predicts ORFs in bacterial and archeal genomes. The input is the assembled genome FASTA file, the output are several files of the predictions in different stages. The terminal output file is the .predict file. which looks something like this:
>NODE_1_length_38001_cov_935.551880
orf00001 481 362 -2 1.45
orf00002 451 567 +1 0.59
orf00004 3691 623 -2 5.43
orf00006 4254 5228 +3 4.65
orf00007 5204 5326 +2 7.04
orf00009 5587 6921 +1 5.20
orf00011 7062 8135 +3 5.48
orf00013 8327 9238 +2 4.26
orf00014 9241 10116 +1 3.26
orf00015 10119 10280 +3 2.81
orf00016 10296 10673 +3 6.61
orf00017 11288 10683 -3 6.35
orf00018 11910 11305 -1 7.18
orf00019 12313 11894 -2 5.22
The first column is the predicted ORF ID, the second is the start position, the third is end position, the fourth is the reading frame used and the fifth is a reliability score. For full details and how to install glimmer see Glimmer’s documentation. Glimmer3 also comes packaged with Ubuntu.
Here is a short Python script whose input is the genomic FASTA file which contains a single, assembled sequence and the glimmer file. The output is a FASTA file containing a separate entry for each predicted gene. Of course, it uses Biopython.
#!/usr/bin/env python import sys from Bio.SeqRecord import SeqRecord from Bio import SeqIO glimmer_file = sys.argv[1] fasta_file = sys.argv[2] # Read the sequence file seq_record = SeqIO.parse(open(fasta_file),"fasta").next() outseqrecords = [] # Read the glimmer file, record by record for inline in file(glimmer_file): if '>' in inline: seqname = inline.split()[0][1:] outfilename = "%s_g3.tfa" % (seqname) continue if "orf" not in inline: continue orfname, sbegin, send, rf, score = inline.strip().split() sbegin = int(sbegin) send = int(send) rf = int(rf) # reverse complement if rf < 0: sbegin, send = send, sbegin sbegin -= 1 # Python indexes start a 0 score = float(score) # split the sequence record newseq = seq_record.seq[sbegin:send] if rf < 0: newseq = newseq.reverse_complement() # Add a sequence record to the output seqrecord_description = "begin=%d end=%d rf=%d score=%.2f" % (sbegin+1, send, rf, score) outseqrecords.append(SeqRecord(newseq,id=seqname+"_"+orfname, description=seqrecord_description)) SeqIO.write(outseqrecords,open(outfilename,"w"),"fasta")
To run simply type:
./split_by_g3.py predict_file_name fasta_file_name
The output will be named with a _g3.tfa suffix. You can change that in line 17 (variable “outfilename”).
Thanks for sharing this Iddo … your timing is perfect 🙂
I’ve been meaning to run GlimmerHMM on a draft eukaryotic genome for a particular project, and have been putting it off because I needed to write a script to do this step. I’ll need to modify it to concatenate together exons from the same transcript, but otherwise it should do the trick.
This seemed like something that should already implemented in some software package, somewhere, and I assumed if I held off for a little while I would find it (the analysis isn’t urgent, yet). I wonder what other people use when faced with converting GFF formatted ‘orf’ features to FASTA formatted sequences ?
Whoops, looks like in my excitement, I responded too soon. GlimmerHMM outputs slightly different ‘predict’ format (or, alternatively, GFF3 format). If I don’t find a pre-written solution first, I’ll probably write my script to do GFF3 exon features -> FASTA ORFs.
Hi Iddo, Nice post 🙂
Did you know that with Biopython 1.45 or later, if the file contains *exactly* once sequence, you can do:
seq_record = SeqIO.read(open(fasta_file),”fasta”)
instead of the more cryptic:
seq_record = SeqIO.parse(open(fasta_file),”fasta”).next()
For the final steps of the script, since Biopython 1.50 you can slice a SeqRecord, and I’ve started thinking about how a SeqRecord reverse complement method might work:
http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html
Maybe you could share your thoughts on this over on the mailing list?
@Andrew Perry
You may want to check out Brad Chapman’s blog:
http://bcbio.wordpress.com/2009/03/22/mapreduce-implementation-of-gff-parsing-for-biopython/
@Andrew Perry
Hi Perry,
Do you already successful write a script to generate GFF3 format by using GlimmerHMM output?
Can I ask your opinion about the way to write the script?
Thanks a lot for your sharing.
5