Squeezing DNA
The state of biology today:
Our main problem is turning these DNA data into useful information. Finding genes and other functional genomic element, characterizing them, understanding their function and their impact on Life – all these are challenges that will remain with us for a long time, and which have revolutionized biology into the information science it is today.
Before all that, science is a collaborative endeavor. To collaborate, scientists need to exchange data, including sequence data. But when the the flood of data is very hard to channel into the narrow Internet tubes.
We need to compress these data. There are generic compression software – zip, gzip and bzip2 come to mind. However, could we do better with a solution tailored to DNA? After all, we are talking about a string taken from a four-letter alphabet, with many repeats made.
So the Pistoia Alliance announced a $15,000 prize for “putting forward a prize fund of US$15,000 to the best novel open-source NGS compression algorithm submitted before the closing date of 15 March 2012.” The paper describing the competition recently came out in GigaScience. (Which is why I am hearing about this only now).
The nice thing about Sequence Squeeze is that the scoreboard was dynamic and gave immediate feedback to how well a compression algorithm was doing. The criteria for performance were a combination of time, CPU usage, memory usage, compression ratio, and decompression quality. To wit:
Each judging instance contained a simple script which controlled the judging process. It operated as follows:
1. Download the entry
2. Set up a the contest data (a random extract from the 1000 Genomes Project)
3. Secure the firewall
4. Run the entry in compression mode
5. Measure CPU and memory usage
6. Assess the compression ratio
7. Run the entry in decompression mode
8. Check that the total combined output files contain exactly the same information (header, sequence, and quality lines) as the input files
9. Update the results database
10. Email the results
The winner of the first (and, as far as I can tell, the only) round of Sequence Squeeze was James Bonfield from the Sanger Institute. You can read more about Sequence Squeeze in the Pistoia Alliance’s blog and in the paper.
Holland RC, & Lynch N (2013). Sequence squeeze: an open contest for sequence compression. GigaScience, 2 (1) PMID: 23596984
[…] billet de blog attire mon attention sur un concours de compression de données spécifiquement pour les séquences […]