Short bioinformatics hacks, ch. 2: chunk it.

First, a non-bioinformatic one liner, which is very relevant to most of us working on 3 different machines simultaneously, not including the 80 in our cluster. ssh-ing and giving your password each time is painful, and makes it almost impossible to do scripted file transfers, like backups. A good solution is shared key ssh in which the host machine recognizes the trusted machine as a client. Thus, after an initial setup you can ssh without typing in your password each time. However, if you want to establish a shared key ssh to a remote machine, you have to (1) generate a local key by running ssh-keygen locally; (2) scp the ~/.ssh/id_rsa.pub to the remote server (3) ssh to the remote server and append that key to your remote ~/.ssh/authorized_keys file. Which means 2 password typings at least, and lots of stuff to recall (or to search on the web) if you do not do this too often. Kyle Rankin comes to our rescue offering this one liner to solve the problem in the June 2009 issue of Linux Journal:

ssh user@server.example.net "cat >> ~/.ssh/authorized_keys" < ~/.ssh/id_rsa.pub

This hack will establish your shared key ssh with one fell Enter keystroke. Done.

And now, to bioinformatics.

I am partial to Python and Biopython, so the following example will be in Python, using the Biopython package. Biopython is an open-source package with many bioinformatic tools. You can download and install it from the biopython.org site, or  if you are using Linux, some distributions carry their own version. For example, in Ubuntu installation is as easy as:

sudo aptitude install python-biopython python-biopython-doc

The problem at hand is splitting a sequence file containing many sequences (thousands or millions) into something more manageable for, say, over the network analysis, where bandwidth or file size are restricted. Or maybe as a first step in Embarrassingly ParallelTM analysis on a cluster computer. However, before I continue I should mention that the Biopython Cookbook has another, more generic solution to this problem.

#!/usr/bin/python
# Copyright (c) 2009 Iddo Friedberg. Distributed under the Biopython
# license available from  http://www.biopython.org/DIST/LICENSE
import os
import sys
import getopt
from Bio import SeqIO
def chunk_sequences(infile, outfile_basename=None, chunk_size=100, seq_format="fasta"):
    if not outfile_basename:
        outfile_basename = os.path.splitext(infile.name)[0]
    n = 0
    chunk_list = []
    for seq_record in SeqIO.parse(infile, seq_format):
        n += 1
        chunk_list.append(seq_record)
        if n % chunk_size == 0:
            fout = open("%s_%d.%s" % (outfile_basename, n//chunk_size, seq_format), "w")
            SeqIO.write(chunk_list, fout, seq_format)
            chunk_list = []
    if chunk_list:
        fout = open("%s_%d.%s" % (outfile_basename, n//chunk_size + 1, seq_format), "w")
        SeqIO.write(chunk_list, fout, seq_format)

def usage():
    print "usage: chunk_sequences -i infile [-o outfile] [-s size] [-f format]"

if __name__ == '__main__':
    try:
        opts, args = getopt.getopt(sys.argv[1:], "i:o:s:f:")
    except getopt.GetoptError:
        usage()
        sys.exit(2)
    inpath = None
    outfile_basename = None
    chunk_size = 100
    seq_format = "fasta"

    for o, a in opts:
        if o == "-i":
            inpath = a
        elif o == "-o":
            outfile_basename = a
        elif o == "-s":
            chunk_size = int(a)
        elif o == "-f":
            seq_format = a

    if not inpath:
        usage()
        sys.exit(2)

    chunk_sequences(open(inpath), outfile_basename, chunk_size, seq_format)


Walking through the code. Line 7 imports a Biopython package:  SeqIO reads and writes sequence files in various formats. Most common formats are supported for reading, but not all  are supported for writing. This changes over Biopython versions, so check the documentation of your own Biopython installation to see which file formats are write-supported. In any case, FASTA format is supported, and that is what we will be using here.

Lines 8-20 are the actual splitting, or “chunking” function. Function definition of chunk_sequences on line 8: infile is the seqeunce file to be read, outfile_basename is the base for the output files (that will be numbered, eg.  “basefile_1.fa, basefile_2.fa”, etc.); chunk_size is the number of sequences per output file (default 100).

Lines 9-12: initializations.

Lines 13-22: read the FASTA file in a for loop using SeqIO’s parser functionality. chunk_list is a mutable array, the Python workhorse, called a list. We append sequence records to chunk_list until the number of records in a chunk is reached (checked in line 14). If it is reached, the chunk file is written to, and the chunk counter is incremented. Once we leave the loop, we write the remaining sequences, if any, one last time.

The function chunk_sequences can be called from a Python shell, from another Python function. The file can also be run as a command from the operating system shell. That is what lines 25-53 are for

Lines 24-25: a short function writing the text of an error message.

Line 27: a standard Python toy t check if the program is being run from the command line.

Lines 28- end: make use of Python’s arguments/ options command line parser, using the Python implementation of GNU getopt. For more on how this works, read here.

chunky_monkey

To split a large FASTA file into a series of files containing 500 sequences each:

chunk_sequences mylargefile.fa -o smallchunk  -s 500 -f fasta

This will create a series of files: smallchunk_1.fasta  smallchunk_2.fasta and so on. Yay.

Happy chunking.

Update: thanks to Peter for the fixes (see comment section). I updated the script.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

5 Responses to “Short bioinformatics hacks, ch. 2: chunk it.”

  1. Peter Cock says:

    Hi Iddo,

    Nice post!

    I hope you don’t mind if I nit pic, and point out a few minor things you could improve:

    (1) Use enumerate(…) to keep track of the record number index (n), rather than incrementing it in the loop. This is just a style thing.

    (2) Close your output handles with a couple of fout.close() statements. This should be happening for you behind the scenes when the script finishes, but it is good practice.

    (3) Use an explicit integer divide (n//chunk_size rather than n/chunk_size). Good practice looking ahead to Python 3.

    (4) You don’t actually need to import the SeqRecord class in this example.

    Peter

  2. Iddo says:

    Hi Peter: enumerate will start n at 0, I need it at 1. I could do:

    for i, seq_record in enumerate(SeqIO.parse(infile, seq_format)):
           n = i+1 #continue to use n as before
    .
    .
    .

    But to me, that is less clear than what I am doing right now.

    close output handles: good point

    explicit divide: you mean I am not working in Python 1.5 anymore? Good point.

    SeqRecord: D’oh!

    I’ll fix this soon, based on your suggestions. Thanks!

  3. gioby says:

    I suggest you to install biopython with easy_install, rather than with the package manager from your distribution.
    The egg on pypi is usually more up-to-date, moreover easy_install can work in any operating system.
    Usually easy_install is included in the package python-setuptools in debian.

    $: sudo apt-get install python-setuptools
    $: sudo easy_install biopython

  4. gioby says:

    Well, I would also add some good documentation and at least one doctest to your scripts.
    Moreover, speaking of bioinformatics tips, I would put at least an example to a makefile 🙂

  5. must have a look, and plz pass ur review and comments on this site on bioinformatics http://bionet.awardspace.info/index.html