The 11th Annual Bioinformatics Open Source Conference (BOSC) 2010 is coming up in Boston, July 9-10 2010. The BOSC meetings are a great get-together of a community of programmers who are like-minded in their advocacy of open source code for science, and specifically for bioinformatics. The whole thing is run by volunteers who take a lot of time and effort to bring a top-notch meeting every year, so a big thanks to this year’s organizing committee!
If you are reading this, and you are in Boston on those dates, consider showing up, it is a great experience. There will also be a codefest on the two days before the meeting. This year’s topic is cloud computing for bioinformatics. If you like using AWS for bioinformatics or if you want to learn more, this is your chance. Amazon have provided a grant towards this codefest. (Thanks!) Biopython, Bioperl, Biojava and Bioruby developers will all be there, tailoring code to the cloud.
Which brings me to the latest poll: if you are a bioinformatics programmer, which of the Bio* packages are you using in your programming, if any? If more than one, check the one you use most frequently. Poll answers on the right. As with all Internet polls, you must be crazy if you take it at all seriously.
Recently, a judge in Federal District Court in Manhattan ruled that Myriad’s patents on BRCA1 and BRCA2 genes were invalid, being a “products of the law nature” and could be patented no more than, say, mount Everest. These two genes are associated with breast and ovarian cancer, and are used in testing for susceptibility to these types of cancer — and for the patent’s duration, using Myriad’s labs. The ruling, if it holds up in appeals, will change the way pharmaceutical business is done: there are over 4,300 gene patents today. BRCA2 tests cost $3,000 in the US, where Myriad has exclusivity. In some provinces of Canada, where Myriad’s exclusivity is not honored, BRCA tests cost considerably less. As an aside, one of the successes that the plaintiffs attribute to the verdict is the contribution to women’s health. True, but not exclusively so: there is growing evidence that BRCA1/2 mutations are associated with pancreatic cancer and testicular cancer.
Stephen Colbert has something to say about it; but in this case, although he is his usual facetiously hilarious self, he seemed to confuse the ACLU, who was one of the plaintiffs. Actually, he confused me too. His arguments for the patent invalidation seem a tad self-defeating, rather unusual for Colbert. See for yourself.
AMOS is a suite of genome assembly and editing software. It includes assemblers, validation, visualization, and scaffolding tools. I have been having some issues installing AMOS on Ubuntu 9.10. Specifically, Ubuntu 9.10 has gcc 4.4, which breaks the compilation of the AMOS release version. However, the development version has been fixed to accommodate that.
If you don’t know which Ubuntu version you are running, type:
$ lsb_release -a
No more than fifteen minutes after I posted my Q to the amos-help mailing list, Florent Angly came through with a solution. I am posting his email here.
Hi,
This issue was fixed in the development version of AMOS. See below for instructions on how to install this version on Ubuntu:
In the directory where the AMOS file are located, run the following to install
the prerequisites:
$ sudo aptitude install ash coreutils gawk gcc automake mummer mummer-doc libboost-dev
For the Hawkeye component of AMOS, you need Qt3:
$ sudo aptitude install libqt3-headers
For the standard version of AMOS, skip to next step, but for the CVS development version, first, run:
$ ./bootstrap
Then regardless of the version:
$ ./configure –with-Qt-dir=/usr/share/qt3 –prefix=/usr/local/AMOS
$ make
$ make check
$ sudo make install
$ sudo ln -s /usr/local/AMOS/bin/* /usr/local/bin/
Now all the programs shipped in AMOS should be available from the command-line.
For example try:
$ Minimo -h
Regards,
Florent
You will need the AMOS development version for Ubuntu 9.10 (and above, presumably), but the regular version for 9.04 (and below). If you are getting the development version, you will also need to install cvs on your machine:
$ sudo aptitude install cvs
Hope this helps anyone struggling with installing AMOS on Ubuntu or other Linux platforms.
In celebration of the biohackathon happening now in Tokyo, I am putting up a script that is oddly missing from many bioinformatic packages: extracting intergenic regions. This one was written together with my student, Ian. As for the biohackathon itself, I’m not there, but I am following the tweets and Brad Chapman’s excellent posts:
About intergenic regions: intergenic regions are as interesting and sometimes even more interesting than the genes themselves: when you are interested in promoters, transcription factor binding sites or almost any other transcription regulation mechanism. Here’s a simple script to find intergenic regions. It reads a genbank formatted file and uses the information there to extract the intergenic regions. The sequences are written to a FASTA file.
#!/usr/bin/env python
import sys
import Bio
from Bio import SeqIO, SeqFeature
from Bio.SeqRecord import SeqRecord
import os
# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
def get_interregions(genbank_path,intergene_length=1):
seq_record = SeqIO.parse(open(genbank_path), "genbank").next()
cds_list_plus = []
cds_list_minus = []
intergenic_records = []
# Loop over the genome file, get the CDS features on each of the strands
for feature in seq_record.features:
if feature.type == 'CDS':
mystart = feature.location._start.position
myend = feature.location._end.position
if feature.strand == -1:
cds_list_minus.append((mystart,myend,-1))
elif feature.strand == 1:
cds_list_plus.append((mystart,myend,1))
else:
sys.stderr.write("No strand indicated %d-%d. Assuming +\n" %
(mystart, myend))
cds_list_plus.append((mystart,myend,1))
for i,pospair in enumerate(cds_list_plus[1:]):
# Compare current start position to previous end position
last_end = cds_list_plus[i][1]
this_start = pospair[0]
strand = pospair[2]
if this_start - last_end >= intergene_length:
intergene_seq = seq_record.seq[last_end:this_start]
strand_string = "+"
intergenic_records.append(
SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
description="%s %d-%d %s" % (seq_record.name, last_end+1,
this_start,strand_string)))
for i,pospair in enumerate(cds_list_minus[1:]):
last_end = cds_list_minus[i][1]
this_start = pospair[0]
strand = pospair[2]
if this_start - last_end >= intergene_length:
intergene_seq = seq_record.seq[last_end:this_start]
strand_string = "-"
intergenic_records.append(
SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
description="%s %d-%d %s" % (seq_record.name, last_end+1,
this_start,strand_string)))
outpath = os.path.splitext(os.path.basename(genbank_path))[0] + "_ign.fasta"
SeqIO.write(intergenic_records, open(outpath,"w"), "fasta")
if __name__ == '__main__':
if len(sys.argv) == 2:
get_interregions(sys.argv[1])
elif len(sys.argv) == 3:
get_interregions(sys.argv[1],int(sys.argv[2]))
else:
print "Usage: get_intergenic.py gb_file [intergenic_length]"
sys.exit(0)
What are we seeing here?
Lines 11-16 are the preamble: we read the GenBank file using Biopython’s genbank parser in line 12. Beacuse we expect a genome file, which contains a single record, this is a one-time read. Note that this is a rate limiting step, and can take a couple of seconds. Took me ~2secs to read the full E. coli genome on my Linux box. We prepare one list for the + strand intergenic regions (13), another one for the minus strand intergenic regions (14) and one for all the records (line 15).
The rest of the code are three loop blocks: lines 16-28 I loop over the genbank features, extracting the coordinated of the genes themselves. Line 32-41 I find the intergenic regions on the + strand. Lines 42-52 I do the same for the “-” strand.
Now for a philosophical interlude: although there is a way to read all the intergenic regions in a single pass, I subscribe to the “code simple” doctrine of research software writing. Code performance optimization is a low priority for me. I’d much rather have something that is simple to write,read and modify. I also don’t want to spend too much time coding and elegant script for elegance’s sake, especially if I may not use it too much. Historically, scientific code written for research is mostly extinct: thrown away after a short lived hypothesis was tested and ended its days. Research coding is mostly throwaway glue code. Very rarely it matures into a product. Then, and only then, can you apply all those fine software engineering you learned in college. Before that, write fast and simple.
But I digress. Line 17 loops over the features in the genome file. Line 18 we identify if it is a coding sequence (CDS). If so, we identify the start position, and position and the strand the CDS is on. The list cds_list_minus is a list of 2-tuples. Each 2-tuple is the start and end positions of a CDS on the minus strand. (If you would like to go over the genes, as opposed to coding sequences, change line 18 to:
if feature.type == 'gene':
(or better yet, pass an argument that defines it.)
cds_list_plus, is, yes, the same as cds_list_minus, only for the plus strand (line 24).
Sometime, a CDS does not contain information on which strand it is. With genome files, that is usually the case with single stranded viral genomes. Therefore, we put in the default assumption that if there is no strand indication, then the feature is is on the plus strand. We generate a warning message nevertheless (lines 25-28).
Lines 30-41 we loop over the plus strand list, and identify the coordinates between the genes. Python’s enumerate function is very useful here. The enumerate function allows us to iterate over a list, but at the same time keep track of which index we are in when looping over the list. So in line 30, pospair receives the start and end coordinates of a CDS as a 2-tuple, while i receives the actual number if the index in the plus strand CDS list. In that way, we can look back to the previous list member, find the coordinates where that CDS ends, and where the current CDS begins. The two coordinates make up the beginning and end of the intergenic regions between those two genes on that strand. In line 35 we check if the intergenic region length is equal to or larger than a threshold: suppose we are only interested in those intergenic regions that are longer than 100 bases? (The default value is 1, see line 11.) In lines 38-39 we build a biopython sequence object that contains an informative header, and the sequence of the intergenic region. The description which goes in the sequence header contains the start and end coordinates of the intergenic region, and the locus ID of the CDS directly downstream from it. The sequence object is appended to a list, which will eventually get written (lines 40-41).
Lines 42-52 are a repeat of lines 30-41, only for the minus strand. Lines 53 & 54: the list that contains all the intergenic region sequence objects gets written to its own fasta file.
Finally, line 56-63 are boilerplate code, that make this script runnable from the command line. Have fun looking at intergenic regions. Let me know of you find something interesting.
Science is many things to many people, but any lab-rat will tell you that research is mainly long stretches of frustration, interspersed with flashes of satisfying success. The best laid schemes of mice and men gang aft agley. A scientist’s path contains leads to blind alleys more than anything else, and meticulous experimental preparation only serves to somehow mitigate the problem, if you’re lucky. This doesn’t work, that doesn’t work either and this technique worked perfectly in Dr. X’s lab, why can’t I get this to work for me? My experiment was invalidated by my controls; my controls didn’t work the way the controls were supposed to work in the first place. I keep getting weird results from this assay. I can’t explain my latest results in any coherent way… these statements are typical of daily life in the lab.
This stumped and stymied day-to-day life is not the impression of science we get from reading a research paper, when listening to a lecture, or when watching a science documentary show. When science is actually presented, it seems that the path to discovery was carefully laid out, planned and flawlessly executed, a far cry from the frustrating, bumbling mess that really led to the discovery. There are three chief reasons for the disparity between how research is presented, as opposed to what really goes on. First, no one wants to look like an idiot, least of all scientists whose part of their professional trappings is strutting their smarts. Second, there are only so many pages to write a paper, one hour to present a seminar or one hour for a documentary: there is no time to present all the stuff that did not work. Third, who cares about what didn‘t work? Science is linked to progress, not to regress. OK, you had a hard time finding this out, we sympathize and thank you for blazing the trail for the rest of us. Make a note for yourself not to go into those blind alleys that held you back for years and move on. We’re not interested in your tales of woe.
Only maybe these tales of woe should be interesting to other people. If you make your negative results public, that could help others avoid the same pitfalls you had. If you share the limits of a technique, a protocol or software then someone can avoid using it in a way that does not work. A lab’s publications are actually the tip of the sum total of its accumulated knowledge.Every lab has its own oral tradition of accumulated do’s and dont’s. Not oral in the literal sense: they may even be written down for internal use, but never published. UPDATE (2-FEB-2010):most peer-reviewed journals don’t like stuff that does not work. Thanks to Mickey Kosloff for pointing out the Journal of Negative Results in Biomedicine and The Journal of Negative Results – Ecology and Evolutionary Biology.
Until now.
The Journal of Serendipitous and Unexpected Results aims to help us examine the sunken eight-ninths of the scientific knowledge iceberg, in life science and in computer science. (So an additional field over JNRB and JNREEB). From JSUR’s homepage:
Help disseminate untapped knowledge in the Computational or Life Sciences
Can you demonstrate that:
* Technique X fails on problem Y.
* Hypothesis X can’t be proven using method Y.
* Protocol X performs poorly for task Y.
* Method X has unexpected fundamental limitations.
* While investigating X, you discovered Y.
* Model X can’t capture the behavior of phenomenon Y.
* Failure X is explained by Y.
* Assumption X doesn’t hold in domain Y.
* Event X shouldn’t happen, but it does.
The problem with the JSUR model, and the nature of discovery
I expect JSUR will be a great way to comment on methods and techniques. Indeed it will codify a trend that has been going on for some time: public protocol knowledge sharing. Many sites like openwetware, seqanswers or the UC Davis bioinformatics wiki have been doing this for a while. Not to mention a plethora of blogs. Scientists are willing to share their experience with working protocols and procedures, and if this sharing of knowledge can be now monetized to that all-important coin of academia, the peer-reviewed publication, all the better.
So where is the problem? The problem lies with discovery, and credit given towards it. It would be very hard to get anyone to share awkward, unexpected or yet-uninterpreted results. First, as I said, no one wants to look like an idiot. Second, unexpected or yet uninterpreted results are often viewed as a precursor to yet another avenue of exploration. A scientist would rather pursue that avenue, with the hope of the actual meaningful discovery occurring in the lab. At most, there will be a consultation with a handful of trusted colleagues in a closed forum. If the results are made public, someone else might take the published unexpected and uninterpreted results, interpret them using complementary knowledge gained in their lab, and publish them as a bona-fide research paper. The scientist who catalyzed the research paper with his JSUR publication receives, at best, secondary credit. The story of Rosalind Franklin’s under-appreciated contribution to the discovery of the structure of DNA comes to mind. Watson and Crick used the X-ray diffraction patterns generated by Franklin to solve the three dimensional structure of the DNA molecule. Yet she was not given a co-authorship on the paper. (And she did not even make the results public, they were shared without her knowledge.) Unexpected results are viewed either as an opportunity or an embarrassment, and given the competitive nature of science, no on wants to advertise either: the first due to the fear of getting scooped, the second for fear of soiling a reputation. I expect JSUR would have a harder time filling in the odd-results niche, but I hope I am wrong.
But if you have protocols you are willing to share…what are you waiting for? Get those old lab notebooks, 00README files, forum posts and start editing them to a paper. You are sitting on a goldmine of publishable data and you did not even realize it.
Finally, here are two scientists who never declined sharing their unexpected results.
This post has been slashdotted. Exercise extreme caution.
Speaking of sampling bacteria, this ties in well with the previous post about GEBA. And by “well” I mean “in an alternate-universe/ altered-consciousness manner”.
Google flew the green-starred flag of hope yesterday, in celebration of the 150th birthday of a man who constructed a whole language based upon hope. He called himself Doctor Hopeful, and he wanted that the language he created would help break down national barriers. He made it easy to learn, so that people would be motivated to learn it as their second language. They would then speak the Language of Hope, understand each other, and not be so insular. As isolation breed suspicion, and suspicion breeds hostility and ultimately violence.
Unfortunately, neither his language nor his vision of a more understanding and tolerant mankind caught on. One hundred and fifty years after his birth, and 122 after the publication of his book, the world is no friendlier nor tolerant than it was when Ludwig Zamenhof set to correct it by publishing his book International Language: Foreword And Complete Textbook under the pseudonym of Doktoro Esperanto.
English has become the second language of choice for many. The increasing dominance of English speaking powers throughout the last 200 years resulting in English as the lingua franca is interpreted by many that English was adopted as an imposition from above. English is perceived by many who wish to preserve their non-Anglo cultures as overwhelming, a threat to their local culture, which would be diluted to extinction through constant bombardment by English speaking movies, TV shows, and Internet provided content. Zamenhof would not have liked that, as Esperanto was intended to be an adoption of choice, without carrying any threatening cultural baggage.
The Internet itself is hailed by many as a medium to strike down barriers to knowledge and help communications. But national firewalls, traffic monitoring, crackdowns on content sharing, criminal abuse and vilification in the popular media cause many to see it more as a threat to their own society, rather than a promise for all societies. And let us not forget that it is still mostly a developed world’s medium, with most of the content and cultural narrative originating from rich countries.
Neither a world-wide communication technology nor a globally dominant language seem to have brought us closer to the peaceful, understanding and egalitarian world that Zamenhof envisioned. We should be mindful of that, and of Esperanto. The Esperanto language is viewed as a curiosity at best. Esperantists as people with a quaint hobby. Happily, Esperantists do not view themselves as such. They are continuing the mission of Zamenhof for a more understanding humankind. Esperanto is kept alive by the two million who speak it, by national and international organizations, by books, magazines, and even music. Happy Birthday Doctor Hopeful.
Martin Weise of the Swedish Esperanto-singing Band Persone. from his Solo Album “more than nothing” Pli ol nenio.
One problem that I am facing is convincing colleagues of the utility of an Open Access publication. The usual arguments: more visibility, retention of the right to re-use material, the Greater Good, taxpayer access to taxpayer-funded research and so on don’t stick very well when faced with a $1500-$2500 or higher publication fee. These can be very big expenses if one is working on medium to small size grants, and where publication fees are sought, in part, from the College. Note: in many case the OA fees are not unaffordable; one would not request, in good faith, that the fees be waived or discounted by the publisher. But if one can use this money to pay the summer salary of a couple of more students, go to a conference, or upgrade / repair equipment, then the utility of shelling out this money for a publication seems marginal and pying this money for publication fees seems almost frivolous. In the US, funding agencies require, at most, that publications resulting from their funding would, be available on Pubmed Central within a certain time period and many non-OA publications comply, or they would lose the ability to publish a large chunk of NIH/NSF funded research projects. But doing so is not really timely OA. The bottom line is, if the grant is smaller than R01 size, many applicants would rather budget the expected $8000 of OA fees for the 3-4 year grant period for other line items that have a more palpable payoff, so to speak.
I don’t really have a point to this post, other than raising a problem that seems to be ignored, or marginalized, by many OA advocates. Not everyone operates on large grants. Many lab budgets leave very little room to buy a new laptop, let alone pay for an OA publication (typically the price of two of said laptops).
CLARIFICATION: the events described here have not happened. Yet.
We are a few years into the future. Whole human genomes can be sequenced relatively cheaply and accurately. Direct to Consumer Genomics companies offer true genomic analyses now, not just marker analyses. They BLAST* your sequence against known genotype & disease databases, looking for known genotypic associations. Furthermore, individuals who are “bioinformatics savvy” can analyze their own genome. We hear of the first life-saving BLAST: a person found an association between one of his SNPs and pancreatic cancer, and managed to undergo a life-saving operation in time. We also hear, tragically, of the first BLAST related murder: a molecular biologist killed her infant child and herself after she discovered on her own she and her son are both destined to have Huntington’s chorea. Another, similar suicide took place, but in that second case the person misdiagnosed himself. In a few US states as well as in Italy, the police have successfully subpoenaed DNA sequences from DTC genomics companies. In Singapore, a mandatory database of the genome of all citizens has been announced.
Credit: Adrian Cousins, Wellcome Images
Worldwide, calls for legislation abound that would limit individuals’ access to their own genomic data. At the same time, a loose coalition of political activists, scientists and journalists advocate a “Genomic Freedom Movement” to legislate a governmental and insurance company “hands off” policy. Finally, insurance companies (not just health), financial companies and employers are all interested in the new field of “genomic personality studies”, or “Tarot card genomics” as those studies are called by their opponents. With the advent of many complete human genomes, there has been an explosion of studies that tie personality traits, life-expectancy, lifestyle, earning power, accident prone-ness and even sexual prowess to genomic data. These studies, some of questionable quality, are gaining strong public attention. Cosmopolitan has just published “Is He Right for You?: how to Get his Genome and What you can Learn From It”. A whole industry of “compatibility genomics” for couples to be married is flourishing. The Leubavitcher Hassidim are maintaining a “shidduch” genomic database for eligible singles.
The future of genomic data, who can access it and for what reasons seems murky at best. Under those conditions will you have your own genome sequenced? Note that there is no company that will give up that data (you can have your DNA sequence file, but they wish to keep it too, although they promise complete anonymity and privacy).
So will you have your genome sequenced?
——
(*) BLAST is used, as a generic name for any sequence based database searching software. We may have something else that rules the roost 5 years from now.
What is it?Open Notebook means “no insider information” You lab notebook is on a wiki, out there for everyone to see. Negative results & all. You share your research process with the world as you go along. There are many shades to this process: you may share some of your data, edit it, sanitize it… but he general idea holds, that you share a major part of your data, methods and thoughts prior to the official publication.
Why doesn’t it work? Social and cultural reasons. A basic tenet of science culture is that competition breeds quality and innovation. Researchers need to pass a series of competitive thresholds to be able to continue and expand their research: secure a position to be able to start your independent research, compete for a grant to fund it (at a 10-15% funding rate in the US for biomedical research), compete for more grants so one can fund an expanding vision of one’s research, pass a threshold to receive tenure (or rather, not get fired after 6 years). In places with no tenure, pass periodic reviews. Search committees, grant review panels and tenure / periodic review committees judge a scientist by the number of publications, their innovation, how attributable they are to his group as opposed to the collaborating groups and how much impact they carry in the field. Of course the $$$ brought in by grant overheads. To reach a truly innovative leap in research, there is a period when you have to play your cards close to the chest, sharing your findings only with your lab, your collaborators and trusted colleagues. Revealing findings too early will get you scooped by a better equipped lab, or at best dilute the innovative impact: your open lab notebook wiki can and will be construed as a prior publication.
Taking openness and collaboration to the extreme, if you put your notebook on a wiki, and your field is “hot” enough, you can be sure someone will use those ideas to their own benefit, very likely at your expense. It need not be malign: they could make an intuitive leap of reasoning reading your notebook before you can. Even if they are honest and generous enough to credit you by co-authorship, how much of the innovation would be attributed to you? And if you receive less credit for research innovation than you could, that would lower your evaluation score at whatever career stage you are in. By and large, this culture does not appear to be changing. The need to be identified with a certain type of research you can call “your own” and the need to innovate trump those collaborations that, in the eyes of your peers and evaluators, only serve to dilute your achievements.
Therefore, in the foreseeable future, I believe that the Open Science vision will be limited to non-competitive endeavors that don’t have potential for high-impact research papers down the line. Those usually have more to do with tool and technology development rather than innovative research. That is actually a great thing: at least open-notebook science enables protocol, tool and software development more quickly. But anyone who has been involved with Free and Open Source Software has known that for three decades or more.
Different disciplines in science have different cultures. The biomedical field is known to be especially competitive. Also, the field is going through very fast changes. I am referring to this field. I realize that things are different in physics, for example, where pre-publication of results is encouraged and credited. All the more proof that openness, or lack of it, is a cultural issue, rather than inherent in academic research.
What does work? Collaborative technologies: wikis, blogs, discussion forums are great for publicizing oneself (HEY!), asking general questions about one’s methodologies, protocols, howtos, software or equipment. OpenWetWare is an example of such a success story for the experimental biology community, being a central repository for protocols and general lab how-tos. But the lab notebooks section only contains a handful of notebooks, most of them out of date. Social bookmarking like Delicious or specialized social bookmarking like citeulike are catching on, maybe a bit slower than expected. Wikis (not open ones) are great for internal lab management as well, as more labs are discovering.
The free and open source software culture, where one is free to modify and distribute software so licensed, has enabled new feats in scientific computation infrastructure by leveling the playing field so that anyone can use, modify and re-distribute software. In a similar vein, grid technologies are leveling the field of computational power and hardware. Publications likePLoS-ONE, which accept research based on scientific rigor rather than innovation leaps and “exceptional interest” have filled the gap necessary to communicate research that is of interest, yet will not be accepted to journals demanding an innovative edge. Freely available data, post-publication, makes it easier to validate research by third parties, and build upon it. And of course, Open Access which makes publications available to all: not only to read, but to further publicize.
For another view that advocates a change in scientific culture that will make Open Science part of the academic incentive structure, just as publications are today, read here.
Community annotation
Credit: victoriapeckham Flickr
What is it? Genomics has become a data rich science. The deluge of genomes and metagenomes are to be too much to handle for a group of curators. The idea some genomic database maintainers have come up with is borrowed from the success of Wikipedia. If enough users would come in to annotate their favorite genes, we will eventually end up with a comprehensive collection of annotations for most if not all genes in a sequenced genome. If ths system is good for Wikipedia entries, why not for genes?
Why doesn’t it work?
Why would anyone expect—or even worse, depend on—a community annotation effort? Imagine investing millions of dollars into state-of-the-art sequencing facilities, and then expecting volunteers from the community to stop by and run the sequencing machines. One might argue that this analogy is not valid because running a sequencing facility requires well-trained personnel, standardized protocols, clear procedures, quality controls and, most of all, tight coordination. Yet, the same professional standards are required for data curation, and it is precisely these aspects that are rarely achieved through a community contribution approach. Community annotation should be encouraged and facilitated, but the curation of biological data cannot depend solely on volunteer work. High standards and quality implies professionalism, and this, in turn, requires investing in dedicated professionals. Until this is done, data curation—and consequently the whole field of microbial genomics—will not move beyond the amateur stage.
What does work? The failure of community based annotations has brought the often overlooked but crucial activity of biocurators into the limelight. Recently, the International Society for Biocuration was formed. From the mission statement:
Strong support from the research community, the journal publishers, and the funding agencies is indispensable for databases to continue to provide the valuable tools on which a large fraction of research vitally depends. Structured ways for biocurators and associated developers to increase the sharing of tools and ideas through conferences and high quality peer-reviewed publications need to be developed. This will improve data capture, representation, and analysis. Secondly, biocurators, researchers and publishers need to collaborate to facilitate data integration into public resources. Researchers should be encouraged to directly participate in annotation. This will lead to improved productivity and better quality of published papers as well as stronger integrity of the data represented in databases. Thirdly, funding agencies need to recognize the importance of database for basic research by providing increased and stable funding. Finally, the recognition of biocuration as a professional career path will ensure the continued recruitment of highly qualified scientists to this field, which benefits the wider world of biomedical sciences.
So it’s back to expert handling of data, perhaps with some community assistance. This goes back to the attribution problem discussed above: in the current culture, there is hardly any career-building attribution to community annotations. For true community involvement, this would need to change. At the same time, biocuration needs to be recognized as a valid and important career path.
Virtual Conferences
Credit: NASA
What is it? Why pay over $2000 for an international conference, suffer through delayed flights, lost baggage, forgotten poster tubes, jet lag, overpriced meals and hotels (“conference discount” my a$$), sweaty poster sessions and tight-fisted finance admins when you finally get home and try to get reimbursed (phew!) — when you can attend a conference using webcasting in the comfort of your home for a fraction of the price if not for free?
Why doesn’t it work? First: virtual conferencing technology sucks. It doesn’t matter if you use a free Skype on a $150 netbook, or a state-of-the art teleconferencing equipment with a 52″ screen and Dolby Surround, piped through at hundreds of Gigabits per second. You will get interruptions, cuts, lags, annoyances and embarrassing moments. Second: social reasons. The important parts of a conference take place in the hallways, poster sessions, meals, banquets and, of course, the pub across the street. Incipient collaborations, exchange of ideas, brainstorming: all those take place around the dinner table and in the halls. With food, coffee and alcohol providing the social lubrication, and the talks and posters the intellectual one. A conference is much more than a series of talks.
To summarize: until we reach a level of virtuality akin to that of the Star-Trek holodeck, or at least something that manages to sync picture & sound without one or the other dropping every 3 minutes, we have no choice but to continue taking off our shoes and belts in front of uniformed strangers.
What does work? live and archived webcasts can be an acceptable substitute to the lecture part if you could not make it to the meatspace meeting. Although you probably will not spend the time at home watching all the webcasts of all the keynote speakers you would have gone to in the conference. Microblogging is emerging as a time-saving device for those who were not there: you don’t need to devote 45 minutes to read a microblog from that talk you really wanted to attend. Done properly, perhaps with the speaker’s slides shared somewhere, it is less time consuming than watching a day’s worth of webcasts. And you can filter your interests using the microblogging notes taken by your colleagues, posted on friendfeed or such. No substitution for the real deal, which is shmoozing in the hallways. But at least you’ll get an idea about the latest & greatest in research in your field.
This is not to say that the Internet obviates socializing and work collaborations, quite the opposite of course. Most of my collaborators are time zones away from me, and I use email, chat, wikis, Googledocs, and even (shudder) Skype conference calls for working with them. But the experience of a critical mass of people meeting for real and getting things done in a very short space of time has yet to be duplicated by technological means.
The “End of Theory” science
What is it? I am referring to the Wired article penned by Wired‘s editor-in-chief, Chris Anderson last year. It generated a large response, and a resounding echo of “me too” and “he’s so right” articles and blog posts. The message of this article was that with such a deluge of data in the natural scientists, scientists can stop going through the “hypothesize, model, test” cycle. Rather, they can simply look for statistical correlation and draw conclusions from them.
Why doesn’t it work? Because it was wrong from the get-go. I don’t think any serious scientist ever went through the cycle Anderson superficially outlined. He neglected to prefix the “observe” phase to “hypothesize, model, test”. Observation – a.k.a. data collection is the foundation to whatever comes after. Scientists first observe, then if enough observations are made that seem to fit a certain trend, they formulate one or more hypotheses. Those are tested, and the hypotheses refined or discarded based on test results. Finally, some model may or may not emerge. In any case, the empirical process of research is more of an “(1)observe, (2)hypothesize, (3)test, (4)observe again, (5)retest, (5)correct hypothesis,(6) bumble through previous 5 stages for quite a while, if you’re lucky you may have a (6)model”. This is the way science is done regardless of whether you have 20 data points or 20 trillion. There are, of course, qualitative differences to large quantities of data: methods of observation and sifting through data become rather different, technology starts playing a major role: you really need that computer cluster power (see also above, on community annotation). It does not preclude the need to go through the previous stages, even more carefully than you have done with 20 data points. In the end, science is about providing explanations for observed phenomena, and that is what a model is: an explanation, the best we can come up with at this time. If you don’t have hypotheses, models and theories you don’t have science.
What does work?
M. Mitchell Waldrop (2008). Science 2.0 — Is Open Access Science the Future? Scientific American, 298 (5), 68-73 DOI: 18444327
Hoffmann, R. (2008). A wiki for the life sciences where authorship matters Nature Genetics, 40 (9), 1047-1051 DOI: 10.1038/ng.f.217
Sagotsky, J., Zhang, L., Wang, Z., Martin, S., & Deisboeck, T. (2008). Life Sciences and the web: a new era for collaboration Molecular Systems Biology, 4 DOI: 10.1038/msb.2008.39
Intelligent Systems in Molecular Biology (ISMB) is a large international gathering of computational biologists, mostly from the bioinformatics side: genomics, structural bioinformatics, computational genomics, etc. This year there is a friendfeed room for microblogging ISMB 2009. So if you are not in Stockholm, or also if you are, look it up. Most of the microbloggers also have their own blogs, and recap posts on those will be forthcoming (I hope).
My take on microblogging is that it is a nice public note taking mechancism, but going back to recap and provide a deeper analysis of the session one has attended is probably more useful in the long run. Now that I said it, I would probably have to do it. Watch this space.
eMusic, a subscription-based indie music estore has hiked its prices and concurrently signed a deal with Sony BMG to sell their back catalog. What’s wrong with this? Well, a lot. Read my previous post for details. It seems like the reaction on the intertubes has been less than joyful, with phrases like “corporate sellout” and “breach of contract” dominating.
I created an informal poll for eMusic customers, it is ongoing until June 15. The results are below. Yes, I know the many statistic caveats to an Internet-based poll. But if you take the results of this survey together with the bulk of the reactions on eMusic’s message boards, various blogs, and chatter on social networks, it seems that eMusic have a serious problem to fix. If you are an eMusic customer, please take the short poll.
eMusic is by far my favorite music store. A huge collection of indie, jazz, blues, classical, world, ambient… anything but mainstream. They keep the music DRM free,which means you are free to make as many copies as you please, and play on whatever device you like. For this reason, eMusic has little to offer from the Big Four labels (EMI, Universal, Sony and Warner). In the name of copyright, these labels make various different attempts to limit your listening experience, limiting you to certain platforms, operating systems or music players including placing rootkit software on your computer to lock you out of your music or (as experienced by some customers of Sony) lock you out of your computer completely. Thanks, I do not need that. As I am mainly a jazz, blues and classical music fan, with the occasional sprinkled indies, I get most of my listening needs from eMusic. I download the occasional album from Amazon MP3 store if I really feel the need for something not on eMusic.
I have a $15.99/30 days account which lets me download 50 tracks every 30 days: that’s just under $0.30 / track, a fantastic value considering that Amazon MP3 and iTunes charge upwards of $.90 per track. Also, I don’t support the big four’s DRM shenanigans, lawsuit frenzy and profit margins and I do support independent artists that would never have a chance of signing up with any of the Big Four.
Until now, that is.
Today I logged into eMusic to prepare some music into my basket, as my 30 day account refills around the 14th every month. I am flying on June 14th, and I would like a quick download of new tracks on my MP3 player for the plane once my quota kicks in. I discovered that on May 31 eMusic announced that they are expanding their music and that they are hiking their prices. So come July my $15.99 can now buy me 37 tracks instead of 50, which means I will be paying $0.41 / track. That’s a 41% price increase! Although it is cheaper than what is out there, it does not allow for the fun of cheap experimenting with new music: instead of thinking twice before I download a new track, and then saying “what the hell” it’s only $0.29 and downloading it anyway, I’ll probably download less, and go for the “sure things”. The experience of discovering the musical diamonds in the rough has suddenly become somewhat pricey. I used to allocate 5-10 tracks to experimenting, usually relegating them to the “listened once- – did not like” list. But I occasionally discovered wonderful things. That left me with 40 well thought out downloads per month, and 10 frivolous ones. Now, I will probably have to cut back on the frivolity, as would many other eMusic customers. In the long run,that is probably not good for the discovery of interesting new and different music that is out there — what eMusic was all about. Also, you have to be a subscriber to buy from eMusic: you cannot buy occasionally like you do from Amazon MP3. This means that prices should be somewhat cheaper, due to the guaranteed customer loyalty in their business model.
OK, but $0.41/track is still cheaper than Amazon or iTunes, so why am I making such a big fuss? Also, times are tough, and they are running a business, not a fuzz-and-wawa charity. Well, the other problem is that eMusic justifies the hike due to a deal with Sony, getting Sony’s back catalog that expands eMusic’s repertoire by some 200,000 tracks. But the reason I and many others are eMusic customers in the first place, is that we do not really care to listen — at least not for a bulk of our listening time — to the Dixie Chicks, or Leonard Cohen, or Bruce Springsteen. So eMusic have hiked their prices to subsidize a deal with the kind of label most eMusic customers would not go to anyway! Not only because of the music, but also because of the above-mentioned business practices. Also because Sony would never have given a starting chance to the likes of Department of Eagles, Bon Iver, Shearwater or Vic Ruggiero. (Look them up)
Finally, eMusic announced this event as a done deal, completely surprising its customers. Danny Stein, eMusic’s CEO published a letter on May 31. There are now 1200 customer replies to Mr. Stein’s letter, and it seems like many, if not most are terribly unhappy about the whole affair.
I could [sic] care less if the selection is broadened when my plan is cut in half. If eMusic didn’t already have what I wanted, then I wouldn’t have subscribed, much less bumped up the subscription plan. Thanks for nothing. Now, where is that exit door
Getting 90 downloads a month for $191.90 a year probably was too cheap. I only wish that its rectification wasn’t such a thinly veiled jump in the sack with big business.
There were other responses too, those that understood the need for a price hike, although they were still grumbling about expanding with a Big Four label, instead of with more indies.
I set up a three question survey, which I also publicized on eMusic’s blog. I know these surveys are crap as far as sampling goes, but I did it anyway, just to get a feel for things. Also, it may help, if I get a few hundred respondents, to send the results to Mr. Stein. If you are an eMusic customer, please take a minute to fill the survey.
Here is a piece on how eMusic manhandled the whole situation by a poor, mismanaged response to their customers. I mentioned that inherent customer loyalty is a vital part of eMusic’s business structure. Well, it seems that eMusic has taken that loyalty for granted. Not a smart business move, as is obvious from the waves of ire on eMusic’s blog, other blogs and the various social networks.
What will I do? See how this new situation plays out. eMusic was insanely cheap, and a price hike was due at some point. I am just not happy about the way they raised their prices, and who they did it for.
Finally, here is a great video to a great song from Department of Eagles; one of the bands I discovered in my frivolous downloads. I always imagined “No One Does it Like You” to be a mellow, tender, morning-after love song. This video turned it into a rather disturbing and haunting battle of the sexes. Beautiful though, in its own way.
Harvard University has removed from YouTube the video I embedded in my Leonardo Da Vinci and the F0-F1 ATPase post, due to copyright concerns. It is a pity. I believe the main sufferer from this step is the lab that actually created this video, and now has one outlet less to publicize its work. One would think that after a projected loss of 30% of their endowment, Harvard would come up with more creative ideas for freely publicizing their researchers’ fine work, not less. (Yeah, I know no one reads my blog, but everyone goes to YouTube, including people who don’t normally read Nature).
Whatever. I hope that the IP admins at the MRC in Cambridge (UK) have a more advanced view on these matters than their concurrents in Cambridge (US), and will keep the following videos up. Here are two F0-F1 ATPase videos from Dr. John E. Walker’s lab. Incidentally, John E. Walker received the 1997 Nobel prize for physiology or medicine for his work on the ATPase enzymatic mechanism. You may find some of these movies on his web page.
The first is a general overview of the F0-F1 in action:
The second shows views from above and then below the F1 domain around the rotating gamma subunit (that’s the blue eccentric stator in the middle):
The third is a group of what appear to be Japanese grad students /postdocs demonstrating the ATPase dance. I have no idea where this came from. I give them a “C-” in dancing, but an “A” in structural biology (to get an A+ they should have tossed tennis balls to represent synthesized ATP):