Short bioinformatics hacks: merging fastq files

By Iddo on August 25th, 2011

So you received your mate-paired reads in two different files, and you need to merge them for your assembler. Here is a quick Python script to do that. You will need Biopython installed.

#!/usr/bin/env python
from Bio import SeqIO
import itertools
import sys
import os
# Copyright(C) 2011 Iddo Friedberg
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
def merge_fastq(fastq_path1, fastq_path2, outpath):
    outfile = open(outpath,"w")
    fastq_iter1 = SeqIO.parse(open(fastq_path1),"fastq")
    fastq_iter2 = SeqIO.parse(open(fastq_path2),"fastq")
    for rec1, rec2 in itertools.izip(fastq_iter1, fastq_iter2):
        SeqIO.write([rec1,rec2], outfile, "fastq")
    outfile.close()

if __name__ == '__main__':
    outpath = "%s.merged.fastq" % os.path.splitext(sys.argv[1])[0]
    merge_fastq(sys.argv[1],sys.argv[2],outpath)

The neat trick is in line 13, using Python’s itertools to zip two iterators and loop over them in parallel two fastq records at a time.

How to use this script: download to a file you will call merge_fastq (or whatever). Then:

$ chmod +x merge_fastq

And you are ready to go.

$ ./merge_fastq myseq_1_.fastq myseq_2_.fastq

The merged file will be called myseq_1_.merged.fastq

Share and Enjoy:

Categorized under: Bioinformatics, programming, Software.
Tagged with: Bioinformatics, programming, second generation sequencing, sequencing, short read sequencing.

Comments are closed.

Byte Size Biology

The musings and ravings of a computational biologist about science, computers, music and, you know, stuff

Short bioinformatics hacks: merging fastq files

Categories

Tags

Recent Posts

Recent Comments

Other stuff I read

Science blogs I like to read

Twitter