Short bioinformatics hacks, ch. 2: chunk it.
First, a non-bioinformatic one liner, which is very relevant to most of us working on 3 different machines simultaneously, not including the 80 in our cluster. ssh-ing and giving your password each time is painful, and makes it almost impossible to do scripted file transfers, like backups. A good solution is shared key ssh in which the host machine recognizes the trusted machine as a client. Thus, after an initial setup you can ssh without typing in your password each time. However, if you want to establish a shared key ssh to a remote machine, you have to (1) generate a local key by running ssh-keygen locally; (2) scp the ~/.ssh/id_rsa.pub to the remote server (3) ssh to the remote server and append that key to your remote ~/.ssh/authorized_keys file. Which means 2 password typings at least, and lots of stuff to recall (or to search on the web) if you do not do this too often. Kyle Rankin comes to our rescue offering this one liner to solve the problem in the June 2009 issue of Linux Journal:
ssh email@example.com "cat >> ~/.ssh/authorized_keys" < ~/.ssh/id_rsa.pub
This hack will establish your shared key ssh with one fell Enter keystroke. Done.
And now, to bioinformatics.
I am partial to Python and Biopython, so the following example will be in Python, using the Biopython package. Biopython is an open-source package with many bioinformatic tools. You can download and install it from the biopython.org site, or if you are using Linux, some distributions carry their own version. For example, in Ubuntu installation is as easy as:
sudo aptitude install python-biopython python-biopython-doc
The problem at hand is splitting a sequence file containing many sequences (thousands or millions) into something more manageable for, say, over the network analysis, where bandwidth or file size are restricted. Or maybe as a first step in Embarrassingly ParallelTM analysis on a cluster computer. However, before I continue I should mention that the Biopython Cookbook has another, more generic solution to this problem.
#!/usr/bin/python # Copyright (c) 2009 Iddo Friedberg. Distributed under the Biopython # license available from http://www.biopython.org/DIST/LICENSE import os import sys import getopt from Bio import SeqIO def chunk_sequences(infile, outfile_basename=None, chunk_size=100, seq_format="fasta"): if not outfile_basename: outfile_basename = os.path.splitext(infile.name) n = 0 chunk_list =  for seq_record in SeqIO.parse(infile, seq_format): n += 1 chunk_list.append(seq_record) if n % chunk_size == 0: fout = open("%s_%d.%s" % (outfile_basename, n//chunk_size, seq_format), "w") SeqIO.write(chunk_list, fout, seq_format) chunk_list =  if chunk_list: fout = open("%s_%d.%s" % (outfile_basename, n//chunk_size + 1, seq_format), "w") SeqIO.write(chunk_list, fout, seq_format) def usage(): print "usage: chunk_sequences -i infile [-o outfile] [-s size] [-f format]" if __name__ == '__main__': try: opts, args = getopt.getopt(sys.argv[1:], "i:o:s:f:") except getopt.GetoptError: usage() sys.exit(2) inpath = None outfile_basename = None chunk_size = 100 seq_format = "fasta" for o, a in opts: if o == "-i": inpath = a elif o == "-o": outfile_basename = a elif o == "-s": chunk_size = int(a) elif o == "-f": seq_format = a if not inpath: usage() sys.exit(2) chunk_sequences(open(inpath), outfile_basename, chunk_size, seq_format)
Walking through the code. Line 7 imports a Biopython package: SeqIO reads and writes sequence files in various formats. Most common formats are supported for reading, but not all are supported for writing. This changes over Biopython versions, so check the documentation of your own Biopython installation to see which file formats are write-supported. In any case, FASTA format is supported, and that is what we will be using here.
Lines 8-20 are the actual splitting, or “chunking” function. Function definition of chunk_sequences on line 8: infile is the seqeunce file to be read, outfile_basename is the base for the output files (that will be numbered, eg. “basefile_1.fa, basefile_2.fa”, etc.); chunk_size is the number of sequences per output file (default 100).
Lines 9-12: initializations.
Lines 13-22: read the FASTA file in a for loop using SeqIO’s parser functionality. chunk_list is a mutable array, the Python workhorse, called a list. We append sequence records to chunk_list until the number of records in a chunk is reached (checked in line 14). If it is reached, the chunk file is written to, and the chunk counter is incremented. Once we leave the loop, we write the remaining sequences, if any, one last time.
The function chunk_sequences can be called from a Python shell, from another Python function. The file can also be run as a command from the operating system shell. That is what lines 25-53 are for
Lines 24-25: a short function writing the text of an error message.
Line 27: a standard Python toy t check if the program is being run from the command line.
Lines 28- end: make use of Python’s arguments/ options command line parser, using the Python implementation of GNU getopt. For more on how this works, read here.
To split a large FASTA file into a series of files containing 500 sequences each:
chunk_sequences mylargefile.fa -o smallchunk -s 500 -f fasta
This will create a series of files: smallchunk_1.fasta smallchunk_2.fasta and so on. Yay.
Update: thanks to Peter for the fixes (see comment section). I updated the script.