The Genome Center at University of California Davis and researchers at UC Santa Cruz are organizing a genome assembly competition which they call The Assemblathon. They have released two simulated genomes for competing groups to assemble as best they can. Assemblies are due February 6th, 2011. So there is still time, if you would like to showcase your genome assembly skillz. The rules are pretty straightforward:
- A total of three two genome sequences will be made available to participants.
- One genome sequence will be a real set of Illumina reads from an unspecified organism. Update: real Illumina data will now be available as part of Assemblathon 2
- The other two genomes will be derived from a pair of related ‘virtual’ species whose genomes have been artificially evolved using the EVOLVER program (Edgar, R.C., Asimenos, G., Batzoglou, S, and Sidow, A.). The estimated divergence time between the two virtual species is 100 million years.
- One synthetic genome will already be assembled and participants will be able to (optionally) use information from this ‘sister’ species’ genome to guide the assembly of the other unassembled genome (which will exist as a set of synthetic Illumina reads).
- Reads from the virtual assembly will be made to simulate the dynamics of ‘real’ Illumina reads as much as possible and will be derived from a mixture of Paired-reads and Mate Pairs from a mixture of insert sizes.
- Genome data will be available to download from December 1st, 2010.
- Participants have until February 6th 2011 to submit their assemblies for the synthetic genome.
- Assemblies will be assessed using a variety of metrics: one of the goals of the Assemblathon is to devise new ways of quantifying and qualifying genome assemblies.
- Results from the Assemblathon will be discussed by invited participants at the Genome Assembly Workshop, in March 2011
Why do we need an Assemblathon?
There are many genome assembly programs out there, but it is not always clear as to which is the best. Part of the problem is that it is not easy to define what ‘best’ is and an assembler that might work well in one situation (e.g. assembling a high-repeat-content genome) might not fare as well in other situations. Part of the reason for organizing this Assemblathon is to see if we can produce newer metrics for assessing the quality of a genome assembly that will complement existing statistics such as N50 contig size.
The ever changing landscape of sequencing technology also means that it is important to continually appraise new methods as well as re-appraise old ones. Assemblers that work well with the short reads from ‘next generation’ sequencers (e.g. Illumina and SOLiD) might not work as well (or at all) with reads from even newer technologies such as the new sequencers from PacBio. There is also a separate, but related, need to assemble transcriptomes from RNA-Seq data.
Another more fundamental need for a project such as the Assemblathon is that even when we believe that an assembler has made a good job of assembling a genome, we are never entirely sure what the actual solution is. This is a bit like putting a jigsaw together where the pieces all all one of four different colors; how do you know what the final picture is supposed to look like? To tackle this issue, the Assemblathon will provide participants with simulated reads from synthetic genomes. By starting with a complete genome that has been generated in silico, we will know what the final ‘answer’ should be. Additionally, there will also be a ‘real’ genome sequence to assemble.
As someone involved in two analogous efforts (prediction of protein function and of genotype/phenotype connections) I think this is a terrific idea that will help us advance the field of sequence assembly. Of course, I would have called this effort CASA: Critical Assessment of Sequence Assemblies. First, it fits well with the “Critical Assessment” meme, which describes community efforts of bioinformaticians to assess how well their software is performing (e.g. CASP, CAFASP, CAPRI, CAGI, CAFA etc.) Second, you can have a lot of fun with the word casa, especially in California.