Closing gaps

By Iddo on May 30th, 2010

Geek alert: this post for coders.
So you sequenced your genome, reached an optimally small number of contigs, they look sane, and now you would like to see what you need for the finishing stage. Namely, how many gaps you have and what are their sizes. UPDATE: “might just be worth clarifying this is for gaps in scaffolds produced by short-read de novo assemblers like Velvet, SOAPdenovo working on paired-end data. Rather than ‘gaps’ (technically more often unresolvable repeats than true gaps) between contigs which would require a different strategy to resolve lengths between.” (Thanks for this one Nick!)

This little script will help. Note that you need to download and install biopython for it to work.

Short version: to run this on a Un*x machine, you need to download the file, then:

chmod +x uncalled_base_stats.py

./uncalled_base_stats.py contig-filename.fasta

#!/usr/bin/env python
import sys
import re
from Bio import SeqIO
"""
input: a contig file, FASTA formatted.
output: a histogram of gaps in the contig file, and contig node names
output format:
L N [n1, n2,...,nk]
Where:
L is the length of the gap
N is the number of gap lengths
The list is square brackets are the sequence ids nodes where gaps of this length are
found
"""

def uncalled_base_stats(seq_record):
    histo = {}
    for n_match in re.finditer('N+',seq_record.seq.tostring()):
        n_len = n_match.end()- n_match.start()
        histo[n_len] = histo.get(n_len,0) + 1
    return histo

if __name__ == '__main__':
    total_histo = {}
    contig_ids = {}
    for seq_record in SeqIO.parse(open(sys.argv[1]), "fasta"):
        histo = uncalled_base_stats(seq_record)
        for i in histo:
            total_histo[i] = total_histo.get(i,0) + histo[i]
            contig_ids.setdefault(i,[]).append(seq_record.id)
    gap_sizes = total_histo.keys()
    gap_sizes.sort()
    for i in gap_sizes:
        print i, total_histo[i], contig_ids[i]

Line 1: Linux shell magic line for an executable file. You can omit this if you are using another OS.

Line 2-4: necessary imports, including biopython.

Lines 17-22: the function uncalled_base_stats accepts a biopython sequence object (a contig) and returns a histogram in the form of a dictionary (a Python associative array) called histo. The dictionary keys are the gap lengths and the values are the respective number of contigs. For example:

{5:10, 17:3}

means that there are 10 gaps of length 5, and 3 gaps of length 17.

Line 18: initiate the dictionary histo

Line 19: here we create an iterator that loops through matches of the regular expression for finding one or more “N” characters, “N+”.

Line 20: n_match is a Python regexp object, returned by the re.finditer method, if there is a match to the “N+” regexp. The .end() method is the position of the last character of the match, the .start() is the first one. The difference is the length of the match. Note that this is Python indexing, so .end() actually gives the index of the position after the match, so no need to add one to the difference beween n_match.end() and n_match.start()

Line 21: add one to the histogram dictionary

Line 22: function returns the histogram dictionary

Line 24-35: the bit that gets executed when you run the program from your OS.

24-26: total_histo: a sum of all the distributions of “N” in the reads; contig-ids: the ids for all the contigs that have a sequence of N of a certain length. The key is the sequence length.

27: Loop on the contig FASTA formatted file. Each loop iteration processes a single contig record.

28: Find consecutive groups of gaps in the contig record. Put them in the histogram for a single record, as described in lines 17-22.

29-31: Pool the histogram for this specific contig into a histogram for all contigs named total_histo.

31: Another dictionary, contig_ids, has the gap sizes as keys, and an accumulated list of contig IDs having thoise gap sizes as values. Example: we have 3 contigs, whose ids are “contig_a”, “contig_b” and “contig_c”. contig_a has two gaps of lengths 10 and 15. contig_b has three gaps of 10, 15 and 30. contig_c has one gap of length 5. The dictionary contig_ids will look like:

contig_ids = {5: ['contig_c'],
        10: ['contig_a','contig_b'],
        15:['contig_a','contig_b'],
        20:['contig_b']}

32-33: gap_sizes is a sorted list of all gap sizes

34-35: print a histogram of all gap sizes, and the contigs invovled. Using the example above, the output will look like:

5    1  ['contig_c']
10 2    ['contig_a', 'contig_b']
15 2    ['contig_a','contig_b']
20 1    ['contig_c']

I originally wrote this script for a colleague who was worried about how many gaps he will have to close in the finishing process. Turned out to be a very popular bit of code…

Share and Enjoy:

Categorized under: Bioinformatics, Genomics.
Tagged with: Bioinformatics, Biopython, second generation sequencing, sequencing, Software.

One Response to “Closing gaps”

Nick Loman says:

30-May-2010 at 1:36 PM

Nice one Iddo, might just be worth clarifying this is for N gaps in scaffolds produced by short-read de novo assemblers like Velvet, SOAPdenovo working on paired-end data. Rather than ‘gaps’ (technically more often unresolvable repeats than true gaps) between contigs which would require a different strategy to resolve lengths between.

Byte Size Biology

The musings and ravings of a computational biologist about science, computers, music and, you know, stuff

Closing gaps

One Response to “Closing gaps”

Categories

Tags

Recent Posts

Recent Comments

Other stuff I read

Science blogs I like to read

Twitter