Long time readers of this blog (hi mom!) know that I am working with many other people in an effort called CAFA: the Critical Assessment of protein Function Annotation. This is a challenge that many research groups participate in, and its goal is to determine how well we can predict a protein function from its amino-acid sequence. The first CAFA challenge (CAFA1) was held in 2010-2011 and published in 2013. We learned several things from CAFA1. First, that the top ranking function prediction programs generally perform quite well in predicting the molecular function of the protein. For example, a good function predictor can identify if the protein is an enzyme, and what is the approximate biochemical reaction it performs. However, the programs are not so good in predicting the biological process the protein participates in: it would be more difficult to predict whether your protein is involved in cell-division, apoptosis, both, or neither. For proteins that can influence different phenotypes (pleiotropic), or have different functions in different tissues or organisms, the predictions may even be worse.
So I saw Star Wars VII: “The Force Awakens” the other day. Great movie, which has mostly erased the shame of episodes I-III. Despite even more than the usual suspension of science, it’s a great SF flick.
(Major spoilers below! You have been warned!)
One mystery which will hopefully be resolved in the upcoming episodes is the origin of Rey. She is obviously strong with the force, using a Jedi mind trick on a stormtrooper (played by Daniel Craig!) to escape. Has visions related to the force, fights well with a lightsaber despite no previous training, and is a general badass. However, while it is pretty clear that Rey’s midi-chlorian count is violently high, it is not clear how she relates to the Force-inhabited family (of at all). One idea is that she might be Luke’s daughter, another that she may be a second offspring of Leia, making her a sister or half-sister to Kylo Ren.
Genetics can help us solve this problem. My assumption is that the Sith gene (sitH) is X-linked, recessively inherited. X-linked means that the gene sits on the X chromosome: so for someone to turn to the Dark Side, they need not only a high midi-chlorian count, but also a gene, which happens to sit on the X-chromosome. Recessively inherited means that for the genetic information to be expressed, women need two copies of the gene, but men only need one. Females would need two copies (alleles) of the sitH gene , one on each chromosome (X+/X+ which is why we see few Sith females). But a male inheriting an X chromosome containing a copy of sitH from his mother (X+/Y0) will inevitably turn to the Dark Side, because he does not have a second, x- chromosome to “block” the evil X+. (x- means a copy of X with no sitH gene Y0 means a Y chromosome which cannot carry a copy of the gene sitH).
Under this assumption, let’s examine the genotypes of the Skywalker family:
Anakin Skywalker (Darth Vader): X+/Y0 Anakin has a copy of sitH on his sole X chromosome. He fought for three miserable movies against going to the Dark Side, but failed and gave in to his genotype.
Leia Organa: X+/x- : Leia has to have one copy of sitH, since she has to receive one from Anakin, her father, who is X+. Thankfully, the other copy from her mother Padme Amidala is X-. sitH being recessive, Leia remained firmly in the Light Side.
Luke Skywalker: x-/Y0 Leia’s twin brother. Received the Y chromosome from Vader, and the x- copy from Padme.
Kylo Ren: X+/Y0 the new evil guy. Not quite a Sith Lord, but definitely embedded in the Dark Side. Leia and Han’s son, he received the X+ copy from mom, unfortunately.
Rey: X+/x- or x-/x-. If Rey is indeed Luke’s daughter, she would have received an x- copy from him. Assuming her mother is not from Jedi / Sith descent, that would mean that to be Luke’s daughter she is a x-/x-. Having an X+ copy would mean she got it from a parent who is carrying the X+ allele. Given the age differences, and possible range of parents, that would probably mean that Leia is her mother.
(More on X-linked recessive inheritance).
Dark matter is a proposed kind of matter that cannot be seen, but that we believe accounts for most of the mass in the universe. Its existence, mass, and properties are inferred from its gravitational effects on visible matter. The most favored hypothesis is that dark matter is not composed of baryons, the basic components of visible matter, but of completely different types of particles. In any case, while the actual nature of dark matter is a mystery, its effect on the universe is well-documented, and the term is quite precise in its usage and meaning.
Biological “dark matter” is quite a different thing, as the term is not original, but a metaphor taken from astronomy. As such, it has been used to mean various different things in different fields. In molecular biology, dark matter has been used to mean junk DNA, and non-coding RNA, among other things. In microbiology it has been used to describe the large number of microbes we are unable to culture and classify, and whose nature we mostly infer from metagenomic data. “Dark matter” has also been used for the shadow biosphere, a microbial world that does not feature any known biochemistry, and whose existence is only hypothesized.
I am uncomfortable with the use of the term dark matter metaphor in biology. First, this term is used to mean so many different things, it ends up meaning none of them. If someone tells me they work on biological dark matter, my first reaction is: “which one?”. Instead of being a clarifying shortcut, the use of this metaphor adds a layer of confusion. Second, as metaphors go, it is misused. With the exception of the shadow biosphere metaphor, we do understand what the various “dark matters” are composed of, and in many cases their effects on known biological mechanisms. The components of the various types of non-coding RNA are well-known to us: those are the same nucleotides that make up any RNA molecule. We may not understand the exact effects and how they work, but their nature (nucleic acids which may have information and catalytic ability) is not outside the scope of our current biological worldview. Similarly, in the case of junk DNA: yes, most of the DNA in some eukaryotes does not get transcribed or translated. Some of it may regulate gene expression, some (or a lot) of it is selfish DNA that simply replicates itself. But again, junk DNA is not dark matter as its composition is not alien to our understanding of biology, and there is no biological phenomenon that requires its existence to be explained. In fact, the opposite is true: we are not sure why some organisms have accumulated so much junk in their chromosomes, although we have some interesting hypotheses, and we are pretty sure there is no single underlying cause for the existence of junk DNA. As for the use of dark matter in microbiology: we probably have much to discover about microbial diversity, but it is no dark matter. We do not expect that new microbial species would be fundamentally different than those that we know of now: we expect them to be CHON life with mechanisms fundamentally similar to those we know of. That is not to say we are not hopeful of discovering new and exciting biochemistry and molecular biology in these undiscovered species. I am convinced that there are untold riches there! We may even discover new metabolites, new amino acids, maybe even new epigenetic or genetic mechanisms. But those would be still within the framework of known biochemistry. Even if we discover a radical new biochemistry, the second part of the metaphor does not hold: we do not need, at the moment, to assume a completely different microbial world to explain current life. This brings us to the realm of the shadow biosphere: the whole existence of a shadow biosphere, while intriguing, is hypothetical. It is not as if there are unexplained phenomena in biology that can only be explained by the existence of a shadow biosphere, and we have scant evidence of its existence. One day we may very well discover that we are sharing this planet with microbes of non-classical biochemistry. But, for now, we do not see any effects on our own biospehere that can only be explained by a shadow biosphere.
So is there dark matter in biology? Is there a phenomenon that cannot be explained unless we assume a mysterious new player that explains it?
Biology actually has a history of dark matter hypotheses. Biologists have always strived to explain how life differs from non-life, and for that they had to resort to any one of several mysterious “dark matter-like” (or, more appropriately, dark energy-like) explanations. Over time, the differentiation of life from non-life was explained by anything from the Aristotelian Final Cause to, most recently, vitalism: an unmeasurable force or energy that exists only in living creatures and differentiates a frog from a rock. All of these proposals were used to explain the apparent purposefulness of action, or teleonomy, in anything from microbes to humans. As we grew to understand biochemistry and the complexity of life, vitalism went out the window of respected science. Why? Do we understand the apparent purposefulness life displays? Not really, but we do not need to resort to an overarching force that explains it. Our (still highly imperfect) understanding of life is that organisms are emergent highly complex systems, whose apparent purposefulness is tied to evolution and their ability to self-organize and reproduce.
So while there are plenty of unexplained phenomena in biology, at the moment there is really none that require the stipulation of a dark matter analogous to that which is in astronomy. Tempting as it may be, perhaps we should calm down on the use of the term dark matter in biology. Biology is confusing, complicated, and mysterious enough without it.
I recently attended a conference which was unusual for me as most of the speakers come from a computer science culture, rather than a biology one. Somewhat outside my comfort zone. The science that was discussed was quite different from the more biological bioinformatics meetings: the reason being the motivation of the scientists, and what they value in their research culture.
Biology is a discovery science. Earth’s life is out there, and the biologist’s aim is to discover new things about it. Whether it’s a new species, a new cellular mechanism, a new important gene function, a new disease or a new understanding of a known disease. Biology a science of observations and discoveries. It is also a science of history: evolutionary biology aims to find the true relationships among species, which is historical research.
In contrast, in computer science, the goal is the study and development of computational systems. Chiefly the feasibility, efficiency and structure of algorithms. So we have two different drivers here: in biology, we try to discover and/or “fill in the blanks” from what we see in nature. In computer science, we seek to understand and better perform computation.
When shall the twain meet? When the problem in biology is that of information processing, and when computer science can innovate in processing that information. Textbook case in point: a basic statistical model in biology today is that of sequence evolution. It states that, given two DNA sequences descended from a common ancestor, their descent can be depicted as a series of nucleotide deletions, insertions and substitutions. In fact, since historically a deletion event in one sequence can be viewed as an insertion event in another, the model actually narrows down to two types of historical events: the insertion/deletion event (or indel), and the substitution event. The model turns out to be a powerful tool, since it can be used to make predictions. Namely, if two DNA (or protein) sequences are found to have a relatively small number of indel and substitution events between them, they are considered homologous. The “relatively small number” is key here, and understanding when the number of steps is small enough to call the similarity between the sequences homology is a whole field unto itself. Finding homologous sequences is important for understanding evolutionary history, but not only for that. If the sequences are homologous, there is a good chance that the proteins they encode have similar functions, even in different organisms: which is the basis for the use of model organisms throughout biomedical research.
But at this point is where the biologist and the computer scientist may part ways. The biologist (here “biologist” = shorthand for “biological-discovery-oriented-researcher”) will continue to treat the sequence editing model as a tool for discovering things about life, such as finding a human homolog to a gene we know is involved in cancer in mice. The computer scientist (== “computational method investigator and/or developer”) may wish to refine the algorithm or create a new one to make the process faster, or more memory-efficient.
When do we have a problem? When researchers in one field do not understand the purview of the other, and seek a measure of simplicity where there is none. I was once told by a rather prominent virologist that “bioinformatics is all about pipelines”. I asked him what he meant by that and he basically said that all he really needs to have is a tool that will give him a result and a e-value “like BLAST”). When I said that statistical significance is, at best, one of several metrics that can be used to understand results, and that sometimes it does not coincide with biological significance or is simply inappropriate, he replied that “well, it shouldn’t be that way, as a biologist I need to know whether the result means something or not, and have a simple metric that tells me that”.
On the other hand, I had computer scientist claim that, since some proteins are a product fused from different genes (he meant different ORFs actually) , this phenomenon upends the definition of the gene, and that we should actually have a “new biology” which is not “gene centric”. To that, my reply was twofold: first, that biology is not “gene centric” any more than it is “ATP centric” or “photosynthesis centric”; and second, that the best description I can come up with for a gene is a “unit of heredity”. The reply I got was that this definition is not a good one since it is not rigorous, and too open-ended to be workable. (Note that I could not provide a definition for a gene, only a description.)
Both the virologist and the computer scientist were seeking simplicity or unequivocality in the “other’s” field where those are not to be found. The problem stems from a misunderstanding of each other’s fields, which they see only through the interface to their own. Biologists, which think in terms of discovery using hypothesis-driven research, would like to have tools that help test their hypotheses. A computer scientist would like to have a biological phenomenon that is clear-cut and therefore amenable to rigorous modelling. Both are flummoxed when they discover that ambiguity rests in their peers’ fields, even though they can totally accept it in their own.
What is to be done? First, learn more about each other’s fields. If you are a biologist using BLAST (and almost all are), please take care to read up on the statistics behind BLAST results. This will give you an idea of the different metrics BLAST provides you with, and what their meanings are. Do the same for the other software you use, and understand it is not just all about “pipelines”. If you are a computer scientist, and (for example) are interested in genomic annotation, please respect the 150 years* of thought invested in modern biology, that naturally keeps revising. However, understanding basic biological concepts is necessary before you go about arguing against what might be an unintentional strawman.
Also, try to listen more, and attend meetings outside your comfort zone. It seems I learn more from conversations in my “non-regular” meetings than in my “regular” ones. Of course, once the “non-regular” become my “regular” meetings I will learn less, so basically I may have to constantly shift my comfort zone. Then again, to me it seems like science is always poking and prodding outside one’s comfort zone.
(*I picked the publication date of Origin of Species as an arbitrary start date, one might think this is conservative and go back even further).
Starting June 1, 2015, my lab is moving to Iowa State University in Ames, Iowa, and I’m very excited about this. I’ll be joining a growing cohort of researchers as part of a presidential “big data” hire the university has started a year ago. The research environment is superb, and there are some great bioinformaticians and genomics people there already (I’m not naming names, because there are too many, I’ll forget someone, and get people upset at me), as well as an excellent bioinformatics graduate program. I’ll be setting up shop as an Associate Professor in the Department of Veterinary Microbiology and Preventative Medicine. This means I’ll still get to be in an experimental microbiology department, which I like, and work with experimentalists and computational people all around campus. I’m honored that I was chosen for this position! Everyone at the university has been very helpful setting us up, and we’re not even physically there yet.
So if you are looking for a postdoc or graduate studies in computational biology, and you and are interested in bacterial genome evolution, document mining, protein function prediction, animal, human or soil microbiome, (among other things), I’m hiring. Ames, Iowa has been ranked as of the best places to live in the US: what are you waiting for? Graduate students: please apply through the Bioinformatics and Computational Biology Graduate Program, and/or contact me directly at Friedberg.lab.jobs at gmail ‘dot’ com. Postdocs: see ad below.
The Friedberg Lab is recruiting postdoctoral fellows to several newly funded projects. The lab is relocating to Iowa State University in Ames, Iowa as part of a university-wide Big Data initiative. Iowa State is a large research university with world-leading computational resources, and a strong highly collaborative community of bioengineering, bioinformaticians, and life science researchers.
The successful candidates will be joining the lab at the College of Veterinary Medicine, Department of Veterinary Microbiology. Areas of interest include: bacterial genome evolution, gene and protein function prediction, microbial genome mining, animal and human microbiome, and biological database analysis.
These are bioinformatics postdoc positions, and the successful applicants would be required to perform research employing computational biology skills.
Requirements: A PhD in microbiology, bioinformatics, or a related field. A strong publication record in peer-reviewed journals. Strong programming skills; strong oral and written communication skills in English; Strong domain knowledge of molecular biology. Salary is competitive and commensurate with experience. The Friedberg lab is a computational biology lab equipped with high-end cluster computers and bioinformatics support.
Ames, Iowa is constantly ranked as one of the best places to live in the US, and has received numerous awards for being a progressive, innovative and exciting community with high affordability and high quality of life.
Candidates should send a C.V. and statement of interest as one PDF document, and have three letters of reference sent independently by their authors to Dr. Iddo Friedberg at Friedberg.Lab.Jobs “at” gmail dot com. Screening of applications begins immediately and will continue until the positions are filled. The positions are expected to start on or after June 2015.
All offers of employment, oral and written, are contingent upon the university’s verification of credentials and other information required by federal and state law, ISU policies/procedures, and may include the completion of a background check. Iowa State University is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, age, religion, sex, sexual orientation, gender identity, genetic information, national origin, marital status, disability, or protected veteran status, and will not be discriminated against. Inquiries can be directed to the Director of Equal Opportunity, 3350 Beardshear Hall, (515) 294-7612
If you haven’t read the transcript of Sean Eddy‘s recent talk “On High Throughput Sequencing for Neuroscience“, go ahead and read it. It’s full of many observations and insights into the relationships between computational and “wet” biology, and it is very well-written. I agree with many of his points, for example, that sequencing is not “Big Science”, and that people are overenamored with high throughput sequencing without understanding that it’s basically a tool. The talk, posted on his blog, prompted me to think, yet again, about the relationships between experimental and computational biology. A few things in this talk rubbed me the wrong way though, and this is my attempt to put down my thoughts regarding some of Eddy’s observations.
One of Eddy’s main thrusts is that biologists doing high throughput sequencing should do their own data analyses, and therefore should learn how to script.
The most important thing I want you to take away from this talk tonight is that writing scripts in Perl or Python is both essential and easy, like learning to pipette.
There are two levels to this argument, I actually disagree with both: first, scripting and pipetting are not the equivalent in the skill level they require, the training time, the aptitude, the experience required to do it well, and the amount of things that can be done wrong. It may very well be that scripting is a required lab skill in terms of needs in the lab, but it is not as easy to learn to do proficiently as pipetting. (Although pipetting correctly is not as trivial as it sounds.)
But there is a deeper issue here, bear with me. Obviously, different labs, and people in those labs, have different skill sets. After all, it would be surprising if the same lab to was proficient in two completely disparate techniques, say both high-resolution microscopy and structural bioinformatics. We expect that labs specialize in certain lines of research, the consequent usage and/or development of certain tools, and as a result the aggregation of people with a certain skill sets and competencies in that lab. There are excellent biologists who do wonders with the microscope in terms of getting the right images. Then there are others who have invaluable field skills, or animal handling, primary cell-culture harvesting and treatment, growing crystals, farming worms, and the list goes on. All of those require training, some require months of experience to do right, and even more years to do well. All are time consuming, both in training and in execution, and all produce data, including, if needed (and sometimes not-so-needed) large volumes data. Different people have different aptitudes, and if a lab lacks a certain set of skills to complete a project, it is not necessarily a deficiency. It may be something that can be filled through collaboration.
- Unlock the secrets of animals that survive freezing: a crowdfunded science project to sequence the genome of the North American wood frog. Thirteen days to go!
- Why information security is a joke.
- This week was Open Access week. While many researchers support the idea of Open Access, few see it as a consideration for publication. Also, there is the problem of Open Access publication fees being regressive.
- An illustrated book of bad arguments. Very useful.
- Evolution explained clearly, via these beautiful videos.
- Lior Pachter on the two-body opportunity.
- Finally, I’ll put this image here. When you see it…
A question to genome annotators out there. I need a simple genome annotator for annotating bacteriophage genomes in an undergraduate course. Until now, we used DNAMaster but for various reasons I would like to move away from that. Here’s what I need for class:
1. Annotate a single assembled linear chromosome, about 50,000 bp, 80-120 genes, no introns (it’s a bacteriophage). Annotation consists of ORF calling and basic functional annotation (can be done in conjunction with BLAST / InterProscan etc).
2. Simple learning curve, these are undergraduates with no experience in the field.
3. Preferably Linux compatible
4. Can read FASTA, GenBank, GFF3
5. Output in a standard format, i.e. GenBank / GFF3.
6. Simple installation, preferably no unwieldy many-tabled databases.
I’ve been playing with Artemis using a Glimmer3 annotated genome and it seems OK so far, but what else is out there?
Please comment here and/or tweet @iddux with your ideas. Thanks!
Tweet to @iddux
Scammers are cashing in on the ebola scare. The news media is cashing in on the ebola scare. Politicians are cashing in on the ebola scare. Unfortunately, neither international healthcare nor biomedical research are cashing in on the ebola scare.
I found the first software patch. Seems pretty robust.
Four years ago I wrote about how Open Access would be adopted if it were convenient. Polls at the time showed that few scientists actively seek to publish OA, even though many support it. Reasons given, in no particular order: aiming for journals that were not OA and high publication fees. My conclusion was that researchers will try to publish OA not from any OA ideology, but from convenience. I should have added also palpable gain (e.g. in publication venue prestige).
So for open access week, I decided to revisit that post “The Revolution Will be Convenient“. Has anything changed since?
No and yes. In a 2014 survey of Canadian researchers, immediate Open Access received only 3.3 out of possible 100 importance points. OA publishing after an embargo period received 2.2 points. The top consideration for choosing a publication venue were impact factor (26.8 points) and journal reputation (42.9). So in that sense, (and assuming the Canadian survey reflects the attitudes of scientists from other countries) little has changed. While there may be a positive attitude towards publishing immediate OA, there is little incentive to do so. Scientists want to make a splash every time they publish, but most do not seem to equate Open Access with “making a splash”. Researchers are still mostly more concerned about promotion and grant review committees reading where they published rather than growing the audience reading what they published.
“The survey suggests however, that there is a disconnect between researchers’ apparent agreement with the principle of open access (i.e., that research should be freely available to everyone) and their publishing decision criteria. Although the vast majority of researchers (83%) agree with the principle of open access, the availability of open access as a publishing option was not an important decision criterion when selecting a journal in which to publish. In this regard, availability of open access ranked 6th out of 18 possibility criteria. It was eight times less important than impact factor and thirteen times less important than journal reputation when selecting a journal” (Source)
But it seems like things are changing, although those changes are being made from the establishment rather than from the people. So not exactly a revolution: the people still overwhelmingly favor prestige over access. Increasingly, funding agencies and universities are mandating OA publication, although not immediate, author-pays “gold” open access. The mandates are mostly for self-archiving, or “green” open access. The NIH has been mandating publication within 12 months of all NIH research. Other US Federal agencies have been directed to develop a policy for expanding access to Federally funded research by the White House Office of Science and Technology Policy. Open access mandates by funding agencies are not a done deal yet, but the wind seems to be blowing in that direction. Many journals are allowing now for self-archiving of preprints and pre-publication copies. Green open access is convenient, free of publication charges, and mostly does not interfere with the main consideration that researchers have for publication venue: the “high profile” journal.
So is there a problem? Yes, several. First, embargo period. Green OA allows for an embargo period, which means that the final paper is not immediately freely available. Anyone wishing to read the latest and hopefully greatest achievements in science would still be stumped by a paywall. Of course, this goes back to the whole “who do I care who reads me?” question: researchers seem to mostly care about other researchers reading what (and where) they published. Those colleagues would mostly have access to the manuscript anyway. Second, self-archiving and preprint policies vary, even if a self-archived copy is available, it may take some effort to locate it, although Google Scholar seems to be doing a pretty good job in that department. Finally, publishers’ policy regarding preprints vary and are sometimes unclear, which can deter researchers from self-archiving lest they violate some policy. So green is not without its shortcomings, even without an embargo period.
In the UK, a 2012 report by The Working Group on Expanding Access to Published Research Findings supported Gold OA. It discounted the Green methods for many of the reasons stated above, and recommended the author-pays, immediate publishing model.
“If green is cumbersome, messy, involves assumptions about cooperation and investment in infrastructure, and still only delivers an imperfect version of the article, and then several months after publication, surely it’s better to pay for the final version to be accessible upon publication?” (Source)
On the plus side, no-one is really arguing anymore on whether we should even publish open access. OA is here to stay, and the questions asked today relate to degrees of accessibility and freedom to reproduce, and financial models to support OA. But the Canadian survey has brought this to the forefront: something is wrong in a scientific culture that has turned communication into coinage, and the disconnect between the values researchers profess (overwhelmingly pro open-access), and what they actually practice (OA counts for little when choosing a publication venue), is worrying.
So great are the rewards for publishing in top academic journals that everyone games furiously – authors, editors, universities and academic publishers… (Source)
(With apologies to the memory of Elizabeth Barrett Browning)
How shall I license thee? Let me count the ways
I license thee to be free to distribute and embed
My code can be buggy, when I wrote it late last night
“While” loops have been made without a stated end
I license thee to change and modify
Most urgent need, by emacs and vi
I license thee MIT, as men strive for Right,
I license thee QPL, as they turn from Praise.
I license thee with a Python, put to use
In my old griefs, and with my postdoc’s faith.
I license thee with a license I seemed to lose
With my crashed disk, — I license thee with the BSD
Mozilla, Apache, of all my web-stuff! – and, if Stallman choose,
I shall license thee better with GPLv3.
TL; DR: The genome sequence of the North American Wood Frog will tell us a lot about the genetic control of freezing and reanimating whole organisms. My friend and colleague, Dr. Andor Kiss is crowdfunding this project. If you would like to help, please go to experiment.com. You will get acknowledged by name in the paper. To learn more on why this is cool and important, read below.
Eighteen people die each day in the US waiting for an organ transplant. Every ten minutes, a person gets added to the waiting list. The need for improvement in organ donations is real.
Why are these statistics so grim? Even when a potentially good match is found (which can take months or years), there is a very short window between the time an organ is donated, and the time it can be transplanted. The maximum viability time for a human kidney is estimated at 35 hours; a liver 20, and a lung less than 10. This time constraint also limits the availability of matching organs. Just imagine if we could freeze and thaw organs without the risk of killing them, keeping them viable for months or even years. The time patients need to wait would be shorter, and, also, better matches may be found as the number of frozen organs increase. If we could learn to freeze organs without damaging them, we would revolutionize organ transplant in the same way refrigeration and freezing revolutionized the food industry. Today, however, freezing organs is not an option: once an organ is frozen, there is irreversible and widespread damage from the formation of ice crystals. Cells shrivel and collapse, blood vessels disintegrate, connective tissue rips apart.
But there are animals that can freeze and re-animate multiple times. In fact, if you live in the northern parts of North America, you have probably seen one, and almost surely heard it: the North American Wood Frog. The Wood Frog can freeze solid and then thaw – multiple times – with no ill effect. During this freeze event, the frog dumps glucose (a sugar) and high levels of urea (an acid normally found in urine) into its bloodstream. The glucose pulls water out of the cells and causes ice to freeze outside of the cells – a type of cryo-dehydration. This is to prevent ice forming inside the cells, where it would cause irreparable damage. The urea is thought to do two things – one, it also protects the cells integrity from damage, and two, it helps slow down the frog’s metabolism. The fact that the frog can freeze in and of itself is pretty spectacular – no heartbeat, no brain activity, no movement. When it thaws, the animal spontaneously reanimates.
What seems even more bizarre about this animal is that once the frog is acclimated to summer, freezing it will simply kill it. We think that there is some sort of seasonal trigger for winter and the possibility of freezing. There must therefore be a change of gene expression between the summer and winter frogs. One could think of this animal as its own experimental control! So to understand how the Wood Frog can survive freezing, we just pick frogs from different seasons, and see the difference in RNA expression. This can clue us into what makes a freeze-adapted frog different than a non-freeze adapted one. Andor has actually been doing that, and will be talking about it next week at the American Physiological Society Meeting in San Diego (if you’re there, walk up to him and say “hi”). But what Andor doesn’t have, is a good reference genome. Nothing close to the Wood Frog has been sequenced yet. Xenopus is a genus of frogs used in laboratories whose genomes have been sequeenced. But as a reference for Wood Frog, the Xenopus genomes aren’t good — the two species are too far apart.
Even more interesting, having the genome of the wood frog will enable studies of the different epigenetic patterns between summer and winter frogs. The control of gene expression is ultimately what Andor is interested in sorting out, and it’s likely that the epigenetics is involved: changes in the DNA that are not in the actual sequence, (like methylation) and affect the production of RNA and proteins. Additionally, because we don’t know the wood frog genome yet, we may find gene family expansions and contractions, novel genes that impart the freeze tolerance to the animal that we cannot possibly predict using a hypothesis driven approach.
And it can all be done relatively cheaply. For less than $4,000 (which is what Andor is asking for), one can do a vertebrate genome. That makes it feasible for a single researcher to (a) build the library, and (b) having the sequencing done. Annotation, of course, is another story, but we are planning a jamboree for that. Stay tuned.
Interested? The project is not too expensive, only $4,000! Any little bit helps a lot. Please go to the science crowdfunding site experiment.com, and give something. You will get acknowledged by name in the paper as part of the “Wood Frog Sequencing Consortium”. Thank you!
Fakeference invitation: an email from Nancy, Sally or June, inviting you, for the second time (“perhaps you didn’t get my first invitation, there may be something wrong with my email”) to speak at a conference. The meeting has 5-10 Nobel laureates listed as invited speakers, and covers everything in science, from quantum mechanics to fish breeding. Needless to say, there is no meeting.
Fauxpen access: when a published offers “open access” publication, but not really. The paper is not under Creative Commons, the publisher still holds the copyright. Doesn’t stop them from charging you $4,000.
Grantxiety: that time between the moment you hear that your grant has scored well, and the moment you hear that it is still too low to be funded.
LinkedWho: an invitation from someone you don’t know to join on LinkedIn.
Paper turfing: when you refuse a request to review a manuscript, because the abstract is so poorly written you don’t even want to think what it would be like to slog through the whole paper.
PCWave: (rhymes with “PCA”): someone showing a principal components analysis chart at a seminar, furiously waving their hands around the data points to convince the audience they are clustered in some meaningful way.
Spamdoc: an email beginning with “Dear esteemed professor”, continues to list a a science biography that has no relevance to the research you do, and ends with a request to join your lab.
Starer bars: uncategorized, unannotated error bars in a graph. Are these SD, SE, CI or what? I don’t know, I guess I’ll just stare.
Travel dead zone: too far to drive, too near to fly (does not apply to countries that have good trains).
Virtual absence : taking your laptop to a coffee shop, activating the away message on your email and not answering the phone because you want to get some work done.
Virtual presence: being at a remote conference but answering work emails and Skype calls any time. Including 3am.
Workminar: when you go to a seminar for politeness sake, but take your laptop and work furiously through it because the grant deadline is tomorrow.
Or: “Estimating how much we don’t know, and how much it can hurt us”.
One of the main activities I’m involved with is CAFA,* the critical assessment of function annotations. The general idea of CAFA is to assess how well protein function prediction algorithms work.
Why is it important to do that? Because today most of our understanding of what genes do comes from computational predictions, rather than actual experiments. For almost any given gene that is sequenced, its function is determined by putting its sequence through one or more function annotation algorithms. Computational annotation is cheaper and more feasible than cloning, translating, and assaying the gene product (typically a protein) to find out exactly what it does. Experiments can be long, expensive and, in many cases, impossible to perform.
But, by resorting to computational annotation of the function of proteins, we need to know how well can these algorithms actually perform. Enter CAFA, of which I have written before. CAFA is a community challenge that assesses the performance of protein function prediction algorithms.
How does the CAFA challenge work? Well, briefly:
1. Target selection: we select a large number of proteins from SwissProt, UniProt-GOA and other databases. Those proteins have no experimental annotations, only computational ones. Those are the prediction targets.
2. Prediction phase: we publish the targets. Participating CAFA teams now have four months to provide their own functional annotations, using the Gene Ontology, a controlled vocabulary describing protein functions.
3. Growth phase: after four months, we close the predictions, and wait for another six months, or so. During those six months, some of the targets acquire experimentally-validated annotations. This typically means that biocurators have associated some of these proteins with papers for which experimental data have been provided, we call the targets that have been experimentally annotated during this phase benchmarks. Typically, the benchmarks are a small fraction of the targets. But since we pick about 100,000 targets, even a small fraction comprise a few hundred benchmarks, which is enough to assess how well programs match these proteins to their functions.
4. Assessment: we now use the benchmark proteins to assess how well the predictions are doing. We look at the GO terms assigned to the benchmarks, and compare them with the GO terms which the predictors gave.
Sounds good, what’s the problem then?
Well, there are a few, as there are with any methods challenge. For one, who’s to say that the metrics we use to assess algorithm prediction quality are the best ones? To address that, we actually use several metrics, but there are some we don’t use. Our AFP meetings always have very lively discussions about these metrics. No bloody noses yet, but pretty damn close.
Bu the issue I would like to talk about here is how much can we know at the time of the assessment? Suppose someone predicts that the target protein “X” is a kinase. Protein “X” happened to be experimentally annotated during the growth phase, so now it is a benchmark. However, it was not annotated as a kinase. So the prediction that “A” is a kinase is considered to be, at the assessment, a false positive. But is it really? Suppose that a year later, someone does discover that X is a kinase. The prediction was correct after all, but because we did not know ti at the time, we dinged that method a year ago when we shouldn’t have. The same goes for the converse: suppose “X” was not predicted to be a kinase, and was also assessed not to be a kinase. Two years later, “X” is found out to be a kinase. In this case, not predicting a kinase function for “X” was a false negative prediction. And we gave a free pass to the methods that did not catch that false negative. The figure blow illustrates that at time “A” we have an incomplete knowledge, when compared to a later time “B”.
So how many of these false false-positives and false false-negatives are there? Putting it another way, by how much does the missing information at any given time affect our ability to accurately assess the accuracy of the function prediction algorithms? If we assess an algorithm today only to discover that are assessment is wrong a year from now, our assessment is not worth much, is it?
To answer this question, I first have to explain how we assess CAFA predictions. There are two chief ways we do that. First, there is a method using precision (pr) and recall (rc). Precision/recall assessments are quite common when assessing the performance of prediction programs.
The precision is the number of correct predictions out of all predictions, true and false. Recall is the number of correct predictions out of all known true annotations. (tp: true positives; fp: false positives; fn: false negatives). We can use the harmonic mean of precision and recall to boil them down to one number also known as F1:
The F1 is one metric we use to rank performance in CAFA. If a prediction is perfect, i.e. no false positives or false negatives, fp=0, fn=0 then the precision equals 1, the recall equals 1 and F1=1. On the other hand, if there are no true positives (tp=0 that is, the method didn’t get anything right), then F1 =0. between these two extremes lie the spectrum of scores of methods predicting in CAFA. Here is how well the top-scoring methods did in CAFA1:
But the F1 (or rather Fmax, the maximum F1 for each method) given here is for time (A), when we first assess CAFA. But at time (A) our knowledge is incomplete. At time (B) we know more! We have some idea of the alpha and beta errors we made.(At later time points, C etc. we will know more still).
OK, so let’s rescore the methods at time B, call this new score F’1. First, the new precision and recall, pr‘ and rc‘ at time (B), when we are wiser than (or at least more knowledgeable of our errors at) time (A).
So F’1 is:
So what? Bear with me a bit longer. We now formalized the F1 (at the time of our first assessment) and F’1 (at some later time when we know more and recognize our α and β errors). What we are interested in is whether the differences between F1 and F’1 are significant. We know that pr’ > pr because beta > 0. The change in precision is
So the precision can only grow, and the more false positives we discover to be true positives at time (B), the better our precision gets. OK, so when we deal with precision, having missing information on false negatives at time (A) is bad once we have more knowledge at time (B).
But what about recall and missing information on false positives? After all, F1 is a product of both precision and recall.
With recall, the story is slightly different. When rc‘ is the recall on the new annotations.
I won’t get into the details here, you can see the full derivation in the paper (or work it out for yourself), but the bottom line is:
This is a surprising finding because only if rc’ is greater than half of the F1 will the F’1 > F1 ( that is, ΔF1>0)
1 does not depend directly on precision, only on recall!
So the F1 measure is quite robust to changes for prediction tools operating in high precision but low recall areas, which is characteristic of many of the tools participating in CAFA. The study also shows that, on real protein data, changes in F1 are not that large over time.
To be fair, the other metric that we use in CAFA, semantic distance, is more sensitive to varying values of δ. But even then the error rate is low, and we can at estimate it using studies of predictions over previous years. Our study also has simulations, playing around with α and β values to see how much we can aggravate changes in F1 and semantic distance.
Bottom line: missing information will always give us some error in the way we assess function prediction programs. But we can at least estimate the extent of the error and, under realistic conditions, the error rate is acceptably low. Making decisions when you know you don’t have enough data is a toughie, and is a much-studied problem in machine learning and game theory. Here we quantified this problem for CAFA. At least in our case, what we don’t know can hurt us, but we can estimate the level of hurt (misassessing algorithm accuracy), and it doesn’t really hurt. Much.
Jiang, Y., Clark, W., Friedberg, I., & Radivojac, P. (2014). The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective Bioinformatics, 30 (17) DOI: 10.1093/bioinformatics/btu472
(*For all you Hebrew speakers sniggering out there — yes, the acronym is, purposefully, the Hebrew slang for “slap” כאפה).