Short bioinformatics hacks: reading mate-pairs from a fastq file

By Iddo on November 10th, 2011

If you have a merged file of paired-end reads, here is a quick way to read them using Biopython:

from Bio import SeqIO
from itertools import izip_longest
# Loop over pairs of reads
readiter = SeqIO.parse(open(inpath), "fastq")
for rec1, rec2 in izip_longest(readiter, readiter):
    print rec1.id  # do something with rec1
    print rec2.id  # do something with rec2
    .
    .

izip_longest is fed the same iterator, readiter, twice. However, readiter.next(), which advances the iterator, is called on the first argument and then on the second argument. Since next() is being called on the same iterator, successive records are yielded.

By “merged file” I mean a fastq file where the mate-pairs are one after the other, as in:

@HWUSI-EAS687_112864999:8:1:1980:1055#CGAGAA/1
GTTTGTTTTAATTTCAGTGATTCATCAATTTTAAAAAAAGATGAGAATAATAACTATTATAAAAAGATAAATAAATGTGAAATTTATATTTCAAATTCAA
+
@:DGBGDDD@GGGDGDGDDGD@GGGGE@GGG?EBGGGADDDDGEG4?3BA*::7:GEGGGG>EDDDDAG@G><ADDGBGGGGEGGGGDGGGFEGGGEFDE
@HWUSI-EAS687_112864999:8:1:1980:1055#CGAGAA/2
AATGAATTGAATAAATATAAGAAGGATGATTAATAATAATTCTTGAATTTGAAATATAAATTTCACATTTATTTATCTTTTTATAATAGTTATTATTCTC
+
D?DB:@8EBDB>GG:=<DED79>>A8CEC8DGDGG8CEC<BGGG+BAAEA@D<2D71;:8AG<ABBEEEEBEDC?C>AACDDDCD>AD<@EFFDDDECBB
@HWUSI-EAS687_112864999:8:1:2274:1058#CGAGAA/1
CCTCAGTTAGCTTCTATTGGTATTAACATGGGTGAATTTACTAAACAATTTAATGACCAAACTAAAGATAAAAATGGTGAAGTTATACCTTGTATAATTA
+
GFGGGHHGHHHHHHGHHHHHGHHHHHHHFBGDBGEHHHHFHHEHHHHDFHCGFFFHHHHHHHGHHGGEBHEEFFCEE@E>A>>8A@EBE@BBB>BGEEDB
@HWUSI-EAS687_112864999:8:1:2274:1058#CGAGAA/2
AACTGGAGTTGTTTTAATTTCAAAAGTAAAAGATTTATCTTTAAATGCTGTAATTATACAAGGTATAACTTCACCATTTTTATCTTTAGTTTGGTCATTA
+
IIIIIIIIIIGIIIDHHIIIIDIHD8CGGGGDADEIIIIIIIHIIGBGD>DGDGGDGIGIIIIBGDG@GFHIIII<C<CCGHHHIHIBGDEEB3BEDEE@

The solution is derived from this Stackoverflow entry.

Of course, if the mate-pair files are not merged then you can use this script to merge them. Also illustrates using iterators from two different files in one for loop:

#!/usr/bin/env python
from Bio import SeqIO
import itertools
import sys
import os
def merge_fastq(fastq_path1, fastq_path2, outpath):
    outfile = open(outpath,"w")
    fastq_iter1 = SeqIO.parse(open(fastq_path1),"fastq")
    fastq_iter2 = SeqIO.parse(open(fastq_path2),"fastq")
    for rec1, rec2 in itertools.izip(fastq_iter1, fastq_iter2):
        SeqIO.write([rec1,rec2], outfile, "fastq")
    outfile.close()

if __name__ == '__main__':
    outpath = "%s.merged.fastq" % os.path.splitext(sys.argv[1])[0]
    merge_fastq(sys.argv[1],sys.argv[2],outpath)

Share and Enjoy: