The Human Genome Variome Project and Google News Reader



Apparently sequencing two white males of European extraction does not make for a very good sample of mankind, and that if we really want to get a good view of what we are really like, we need to sequence a couple more. Maybe even, you know, a woman, or someone from India or China or Egypt or Brazil… just a thought.

Until sequencing becomes cheap enough to get the full genome of every Tom Dick & Harriet, the technique of choice would be to (1) sequence as many genomic bits and pieces as we can ; (2) map them to a reference genome and (3) make the sequence information and metadata globally available.  The first part is being done all the time: every time future parents go for a genetic testing, every time a genetic or a cancer is diagnosed (or negatively diagnosed), every time 23andme hold a spit party and of course, every time Horatio Caine takes off his sunglasses and says in that terse controlled voice: “Well Miss Boa Vista, what have you got for me?”  — every time one of those happens, yet another variation of part of the human genome has been sequenced. Or rather “humanity’s genome”.

CSI Miami

(Uh… y’all do  know Horatio Caine is a fictional character… OK, good. Let’s move on).

Cue the Human Variome Project (HVP).  A large international effort that aims to consolidate all these genomic data and associated metadata. A list of the HVPs  goals as outlined in the most recent meeting report paper:

1. Capture and archive all human gene variation associated with human
disease in a central location with mirror sites in other countries. Data
governance will ensure security and integrity through the use of
auditing and security technologies but, nevertheless, allow searching
across all genes using a common interface.

2. Provide a standardized system of gene variation nomenclature,
reference sequences, and support systems that will enable diagnostic
laboratories to use and contribute to total human variation knowledge.
3. Establish systems that ensure adequate curation of human variation
knowledge from gene-specific (locus-specific), country-specific, or
disease specific database perspective to improve accuracy, reduce
errors, and develop a comprehensive data set comprising all human genes.

4. Facilitate the development of software to collect and exchange human
variation data in a federation of gene-specific (locus-specific),
country specific, disease-specific, and general databases.

5. Establish a structured and tiered mechanism that clinicians can use
to determine the health outcomes associated with genetic variation. This
will work as a dialogue between those who use human variation data and
those who provide them. Clinicians will be encouraged to provide data
and will have open access to complete variation data.

6. Create a support system for research laboratories that provides for
the collection of genotypic and phenotypic data together using the
defined reference sequence in a free, unrestricted and open access
system and create a simple mechanism for logging discoveries.

7. Develop ethical standards to ensure open access to all human
variation data that are to be used for global public good and address
the needs of ‘‘indigenous’’ communities under threat of dilution in
emerging countries.

8. Provide support to developing countries to build capacity and to
fully participate in the collection, analysis and sharing of genetic
variation information.

9. Establish a communication and education program to collect and spread
knowledge related to human variation knowledge to all countries of the

10. Continue to carry out research within the opportunities presented by
investigation of human genetic variation and to present these findings
to users of this information for the benefit of all.

All very laudable goals. Each one can be the subject of a blog post and more. Each one a serious challenge, since those topics have all been tackled with different levels of success in other contexts. A central problem is getting so many people, labs, entities and clinics  spread across different countries and cultures. This means many different legal systems, trademark, patent and medical privacy laws; also different types of data and metadata kept or discarded. How do we get them all to  disclose the data and metadata in a uniform fashion?

This problem is discussed at length in the paper. It is clear the HVP will not start with a huge central repository with all sorts of data input mechanisms: this is a recipe for a white elephant if there ever was one;  the risk of being unusable or very partially usable at the release are too high: it is not practical to force a huge and diverse population of clinicians, researchers, government labs and commercial labs to deposit data in a uniform manner.  Instead of the huge database approach, the HVP people are developing the concept of Locus Specific Databases (LSDBs). Many LSDBs already exist, although not named as such. They might not be locus specific, but phenotype specific: i.e. dealing with a certain disease or set of disease that may or may not be locus-specific.  Nevertheless, the phenotype-genotype association is there.

Once these individual LSDBs are developed and curated locally, ethically appropriate
data elements can be deposited in national or international databases
(NCBI or EBI).

An example of such a local database is ARUP at the University of Utah, dealing with galactosemia; many others are listed in the paper.

So we have a collection of thousands of information producers, in many variable formats, and one projects that seeks to aggregate them automatically and present that information in a digestible, standard fashion. Does this  remind you of something? It reminds me of the search engine approach to consolidating news data: it is Google News‘s responsibility to cluster all the news on a given news topic in the same page section. It is not the responsibility of the NY Times, CNN or Huffington Post to align themselves with Google News’s reporting standards. The three above news resources may actually have the facilities and even motivation to do so, but what about the Ithaca Journal or the Alice Springs News? Running with this analogy, the simple plug-and-play analog of an ATOM feed may be all that is needed (and in many cases all that will be tolerated) by the LSDBs to propagate their information. The consolidation, standardization and subsequent delivery are all the responsibility of the one large project (Google News, HVP), rather than the thousands of small ones.

But why stop with the HVP? Many other genomic data projects may benefit from this line of thinking. The responsibility for the correct acquisition of molecular data should rest chiefly with the aggregator project, rather than with the hundreds of smaller projects producing the data getting aggregated. Note that I wrote “primarily”. The analogy to news aggregators breaks down at some point, as all analogies do. Some minimal standards should be adhered to by the information producers. If your habitat metadata contains temperature, but no units, it is not up to the aggregator to guess whether you were using Celsius or Kelvin (I hope no one in science is still using Fahrenheit!).  Similarly with sequence data, don’t go inventing your own non-redundant amino acid alphabet, or with structure data, there are two good enough standard formats (mmCIF and PDB) use one of them. Other than those caveats, the aggregator should be the one reading, parsing  and consolidating the data, possibly with some feedback mechanism letting the small information feeder know how they might improve.

Happy data mining.

Kaput, J., Cotton, R., Hardman, L., Watson, M., Al Aqeel, A., Al-Aama, J., Al-Mulla, F., Alonso, S., Aretz, S., Auerbach, A., Bapat, B., Bernstein, I., Bhak, J., Bleoo, S., Blöcker, H., Brenner, S., Burn, J., Bustamante, M., Calzone, R., Cambon-Thomsen, A., Cargill, M., Carrera, P., Cavedon, L., Cho, Y., Chung, Y., Claustres, M., Cutting, G., Dalgleish, R., den Dunnen, J., Díaz, C., Dobrowolski, S., dos Santos, M., Ekong, R., Flanagan, S., Flicek, P., Furukawa, Y., Genuardi, M., Ghang, H., Golubenko, M., Greenblatt, M., Hamosh, A., Hancock, J., Hardison, R., Harrison, T., Hoffmann, R., Horaitis, R., Howard, H., Barash, C., Izagirre, N., Jung, J., Kojima, T., Laradi, S., Lee, Y., Lee, J., Gil-da-Silva-Lopes, V., Macrae, F., Maglott, D., Marafie, M., Marsh, S., Matsubara, Y., Messiaen, L., Möslein, G., Netea, M., Norton, M., Oefner, P., Oetting, W., O’Leary, J., de Ramirez, A., Paalman, M., Parboosingh, J., Patrinos, G., Perozzi, G., Phillips, I., Povey, S., Prasad, S., Qi, M., Quin, D., Ramesar, R., Richards, C., Savige, J., Scheible, D., Scott, R., Seminara, D., Shephard, E., Sijmons, R., Smith, T., Sobrido, M., Tanaka, T., Tavtigian, S., Taylor, G., Teague, J., Töpel, T., Ullman-Cullere, M., Utsunomiya, J., van Kranen, H., Vihinen, M., Webb, E., Weber, T., Yeager, M., Yeom, Y., Yim, S., Yoo, H., & , . (2009). Planning the Human Variome Project: The Spain report Human Mutation, 30 (4), 496-510 DOI: 10.1002/humu.20972

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

One Response to “The Human Genome Variome Project and Google News Reader”

  1. Thank you for the analysis of the strategy outlined in our paper. This is a global project with thousands of potential contributors and we would like to hear from anyone interested in helping.