Short bioinformatics hacks: reading mate-pairs from a fastq file
If you have a merged file of paired-end reads, here is a quick way to read them using Biopython:
from Bio import SeqIO from itertools import izip_longest # Loop over pairs of reads readiter = SeqIO.parse(open(inpath), "fastq") for rec1, rec2 in izip_longest(readiter, readiter): print rec1.id # do something with rec1 print rec2.id # do something with rec2 . .
izip_longest is fed the same iterator, readiter, twice. However, readiter.next(), which advances the iterator, is called on the first argument and then on the second argument. Since next() is being called on the same iterator, successive records are yielded.
By “merged file” I mean a fastq file where the mate-pairs are one after the other, as in:
@HWUSI-EAS687_112864999:8:1:1980:1055#CGAGAA/1 GTTTGTTTTAATTTCAGTGATTCATCAATTTTAAAAAAAGATGAGAATAATAACTATTATAAAAAGATAAATAAATGTGAAATTTATATTTCAAATTCAA + @:DGBGDDD@GGGDGDGDDGD@GGGGE@GGG?EBGGGADDDDGEG4?3BA*::7:GEGGGG>EDDDDAG@G><ADDGBGGGGEGGGGDGGGFEGGGEFDE @HWUSI-EAS687_112864999:8:1:1980:1055#CGAGAA/2 AATGAATTGAATAAATATAAGAAGGATGATTAATAATAATTCTTGAATTTGAAATATAAATTTCACATTTATTTATCTTTTTATAATAGTTATTATTCTC + D?DB:@8EBDB>GG:=<DED79>>A8CEC8DGDGG8CEC<BGGG+BAAEA@D<2D71;:8AG<ABBEEEEBEDC?C>AACDDDCD>AD<@EFFDDDECBB @HWUSI-EAS687_112864999:8:1:2274:1058#CGAGAA/1 CCTCAGTTAGCTTCTATTGGTATTAACATGGGTGAATTTACTAAACAATTTAATGACCAAACTAAAGATAAAAATGGTGAAGTTATACCTTGTATAATTA + GFGGGHHGHHHHHHGHHHHHGHHHHHHHFBGDBGEHHHHFHHEHHHHDFHCGFFFHHHHHHHGHHGGEBHEEFFCEE@E>A>>8A@EBE@BBB>BGEEDB @HWUSI-EAS687_112864999:8:1:2274:1058#CGAGAA/2 AACTGGAGTTGTTTTAATTTCAAAAGTAAAAGATTTATCTTTAAATGCTGTAATTATACAAGGTATAACTTCACCATTTTTATCTTTAGTTTGGTCATTA + IIIIIIIIIIGIIIDHHIIIIDIHD8CGGGGDADEIIIIIIIHIIGBGD>DGDGGDGIGIIIIBGDG@GFHIIII<C<CCGHHHIHIBGDEEB3BEDEE@
The solution is derived from this Stackoverflow entry.
Of course, if the mate-pair files are not merged then you can use this script to merge them. Also illustrates using iterators from two different files in one for loop:
#!/usr/bin/env python from Bio import SeqIO import itertools import sys import os def merge_fastq(fastq_path1, fastq_path2, outpath): outfile = open(outpath,"w") fastq_iter1 = SeqIO.parse(open(fastq_path1),"fastq") fastq_iter2 = SeqIO.parse(open(fastq_path2),"fastq") for rec1, rec2 in itertools.izip(fastq_iter1, fastq_iter2): SeqIO.write([rec1,rec2], outfile, "fastq") outfile.close() if __name__ == '__main__': outpath = "%s.merged.fastq" % os.path.splitext(sys.argv[1])[0] merge_fastq(sys.argv[1],sys.argv[2],outpath)
Comments are closed.