Absolut standards: report from the Metagenomics Metadata and Metaanalysis 2009 meeting. Part 1
The first metagenomics, metadata and metaanalysis meeting held in Stockholm June 27 2009 was a raging success. People were standing all the way back to the hall jostling for elbow room, while all the other concurrent meetings were pitifully empty after word has made it about how awesome we were.
OK, I may be exaggerating slightly, since I was the meeting’s co-organizer, co-chair, program committee co-chair, and bartender. (If you were there and you don’t remember me tending bar then I must have done a good job). Well, maybe I wasn’t a bartender. Fine.
So what was the meta(genomics, data, analysis) meeting about?
I’ve talked about metagenomics in several earlier posts. Just in case you are a new here: metagenomics is the study of genetic material that comes directly from the environment. It is a technique used to study genetic material from organisms (usually microbes) that cannot be cultured in a lab, and to get a picture of organisms in their natural environment, which often differ from lab clones.
While in genomics we strive to obtain a full picture of an organism’s DNA, in metagenomics we sample the environment for whatever DNA we can get. We are actually merging population biology with genomics. While in population biology our basic unit of study is an organism, in metagenomics it is a DNA sequence. This presents many challenges: properly sampling the microbial habitat and extracting the DNA, understanding which organisms the DNA in the samples came from, gauging sample depth, assembling the sequences, identifying genes, assigning a biological function to those genes, to name a few. There are many different experimental and computational procedures for doing so, and they should be meticulously documented, as Nikos Kyrpides from the Joint Genome Institute writes in this month’s Nature Biotechnology:
Like molecular biology, genomics has been fueled by the innovative energy of many interdisciplinary activities. Unlike molecular biology, which has thrived on the principle of standardized methods and protocols, genomics has progressed without regard for the critical importance of shared standards. Now, 14 years since the first complete genome was published and with more than 900 genome sequences finished, it is astonishing to observe the lack of standards for so many critical procedures in the field, ranging from simple data exchange to gene finding, function prediction and metabolic pathway description.
Now for the kick in the head:
As an example, we compared the genomes of two closely related organisms, Burkholderia mallei ATCC 23344 (ref. 19) and Burkholderia pseudomallei K96243 (each sequenced by a different sequencing center) [...] we identified 548 genes in B. mallei that are absent from B. pseudomallei and are potentially related to their different lifestyles. Manual curation of those 548 genes revealed that, in fact, 497 of them are also in the B. pseudomallei genome, but there they had not been identified as 'real' genes. The reason for this discrepancy? The two sequencing centers used different gene finding methods. The consequence was an almost 90% error rate in the results of our comparison.
Ouch. Ouch, ouch ouch. And that is not an anecdotal example. Furtehrmore, it also applies to metagenomics: even more so, since many of the standard operating procedures (SOPs) in metagenomics are still in the process of inventing themselves.
Metadata is the “data about the data”: all the habitat data, SOPs and abiotic data that is in dire need of the standardization Kyrpides writes about.
Last, metaanalysis would be the analysis of genomes and metagenomes. Since the M3 meeting was held under the auspices of the International Society for Computational Biology, it attracted mainly computational biologists — the type to analyze, rather than sample and sequence (but the differences are rapidly blurring, as we saw in many talks).
But things are actually looking better for standards. In 2005 the Genomics Standards Consortium was formed to address this problem. Renzo Kottman from the Max-Planck Institute for Marine Microbiology in Bremen, Germany talked about software development within the GSC, and specifically about his own project: the Genomic Contextual Data Markup Language, or GCDML. GCDML is an XML-based standard for describing everything associated with a genomic or a metagenomic sample: where it was taken from , under what conditions, which protocols were used to extract, sequence, assemble, finish and analyze the metagenome. Again, my own personal bias here: I am a heavy user of GCDML, as I am writing my own data-insertion software, and have headed such an effort for a while at the University of California San Diego. Here are Kottmann’s slides, and you can also read more about GCDML.
Daniel Richter talked about the [:ttip=”Assigning biological functions to genes” id=”annotation”] functional annotation[:/ttip] of metagenomes, using Gene Ontology, a technique he developed with Daniel Huson, at the university of Tuebingen, Germany. The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. It is composed of a vocabulary of some 27,000 terms, with hierarchical relationships defined between them, from the general (“catalytic activity”) to the specific (“phosphatase activity”) to the more specific (“Tyrosine phosphatase activity”). (Graph theory prudes: GO is a DAG, not really a hierarchy, I know, I know). Richter assigns functions to sequences hypothesized to be genes using the Last Common Ancestor approach. LCA works as follows: once a high enough similarity is found between a sequence from a metagenome and a sequence from a reference database, LCA looks for similarities to other, related sequences, where the similarity score is above a certain threshold. It then assign a general function using GO that may fit all.
Jack Gilbert from Plymouth Marine Laboratory, Plymouth UK talked about a year of sampling marine microbiome in the Western English Channel. He went through many different sampling and normalization problems.
Tom Matthews from the National Microbiology Laboratory in Canada talked about a profiling pipeline for pathogens. A fast typification of pathogens in case of an outbreak.
There were more presentations, but I think I’ll give it a rest and get back to them in part 2. I am also waiting for some people to upload their slides…. you know who you are!
Kyrpides, N. (2009). Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream Nature Biotechnology, 27 (7), 627-632 DOI: 10.1038/nbt.1552
Kottmann, R., Gray, T., Murphy, S., Kagan, L., Kravitz, S., Lombardot, T., Field, D., Glöckner, F., & , . (2008). A Standard MIGS/MIMS Compliant XML Schema: Toward the Development of the Genomic Contextual Data Markup Language (GCDML) OMICS: A Journal of Integrative Biology, 12 (2), 115-121 DOI: 10.1089/omi.2008.0A10
Comments are closed.