Changing directions
For some reason, this reminds me a lot of the way some of my research has been going recently….
For some reason, this reminds me a lot of the way some of my research has been going recently….
I’m organizing a workshop later this month (see here, scroll to session V), and I have just received the attendees list from the main conference’s organizers. Since I need to spam send the attendees informative email on the specific workshop, I needed their email addresses. Here’s what I did.
The file itself is MS Word doc. Those I save as native openoffice on my system. Now, an openoffice document is really just a bunch of mostly XML documents zipped together. If you do the following:
unzip -l conference-delegates.odt
You get a listing that looks like this:
Archive: conference-delegates.odt Length Date Time Name --------- ---------- ----- ---- 39 2010-09-01 18:16 mimetype 71244 2010-09-01 18:16 content.xml 94 2010-09-01 18:16 layout-cache 15522 2010-09-01 18:16 styles.xml 1241 2010-09-01 18:16 meta.xml 24852 2010-09-01 18:16 Thumbnails/thumbnail.png 0 2010-09-01 18:16 Configurations2/accelerator/current.xml 0 2010-09-01 18:16 Configurations2/progressbar/ 0 2010-09-01 18:16 Configurations2/floater/ 0 2010-09-01 18:16 Configurations2/popupmenu/ 0 2010-09-01 18:16 Configurations2/menubar/ 0 2010-09-01 18:16 Configurations2/toolbar/ 0 2010-09-01 18:16 Configurations2/images/Bitmaps/ 0 2010-09-01 18:16 Configurations2/statusbar/ 8961 2010-09-01 18:16 settings.xml 1988 2010-09-01 18:16 META-INF/manifest.xml --------- ------- 123941 16 files
Wow. Which file contains the delegates’ emails in all that? Actually, content.xml contains the textual content of the openoffice.org document. You can open it with your favorite XML and see how it’s constructed (I like Firefox myself for browsing, and XML Copy Editor for more in-depth diagnosis). But for now, we would like to extract the emails. So we unzip content.xml only:
unzip conference-delegates.odt content.xml
This unzip command will only extract content.xml from the archive that is the .odt file.
When looking at the content.xml file, we see lines like this:
<text:a xlink:type="simple" xlink:href="mailto:noone@usc.edu"> <text:span text:style-name="Internet_20_link"> <text:span text:style-name="T2">noone@usc.edu</text:span> </text:span> </text:a> </text:p>
Which means that “noone’s” (usernames have been changed to protect the innocent) email appears both as text and as hyperlink. It may or may not be that all the delegates’ emails are hyperlinked, so we may expect some duplications we need to get rid of.
To get the email addresses themselves, we use egrep. egrep uses the extended regular expression syntax in searching for emails. What is a good regex for email addresses? There is a good discussion of that at the regex-guru site. I use the rather simple form:
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' content.xml
Explanation: the -o qualifier prints only the word matching the regex. -i means a case-insensitive match. egrep, the extended version of grep, that can handle regexs with things like {m,n} repeats. However, the result of our little exercise would still have duplicate emails, because of the hyperlinking tags. Here is how to get rid of the duplicates:
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' content.xml | sort | uniq
sort sorts the output alphabetically, preparing it for uniq to get rid of duplicates.
One last touch-up: we really don’t need to physically extract the content.xml file. “unzip -c” extracts files to stdout. Therefore, we can get the email addresses without cluttering our disk:
unzip -c conference-delegates.odt content.xml | \
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' | sort |\
uniq > email-these.txt
Voila! email-these.txt now contains the emails of the conference delegates.
One last word: it may have been easier just to save the MS-Word doc file as text using the File -> Save as…” option in openoffice.org. Supposed we saved the file as conference-delegates.txt. We wouldn’t have to muck about with all the XML, and remove the email address duplicates due to hyperlinking. We could have just done:
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' \
conference-delegates.txt > email-these.txt
But where’s the fun in that?
Happy spamming!
No, not a new hunter-killer drone, neither is it the n+1 installment in the sci-fi horror series. Rather, Myxococcus xanthus. Again.
M. xanthus is a highly cooperative bacterium, as we have already seen: when starved, most cells “commit suicide” while a few form spores, to survive the lean times. But M. xanthus also cooperates when times are good and food plentiful: M. xanthus cells form swarms, which envelope their prey and increase the concentration of digestive enzymes they secrete to the environment.
M. xanthus also ripple together to better ingest the nutrients released after digesting their prey. To do so, they use a type of pilus (motility organ, like a bacterial tentacle) called type IV secretion pili. But amazingly enough, in mutant M. xanthus that are unable to make these pili, a new mechanism for cooperative swarming evolved. In 2003, Velicer & Yu from the University of Tuebingen have used a group of mutant M. xanthus, which are lacking the type IV pili gene to show that cooperative swarming can evolve using an alternative mechanism. The mutant bacteria do that by forming a physical net of sugars and proteins connecting them — and their rippling motion — together. Watching behavior evolve: how cool is that?
Finally, a movie of M. xanthus swarming & rippling:
Velicer, G., & Yu, Y. (2003). Evolution of novel cooperative swarming in the bacterium Myxococcus xanthus Nature, 425 (6953), 75-78 DOI: 10.1038/nature01908
Berleman, J., Chumley, T., Cheung, P., & Kirby, J. (2006). Rippling Is a Predatory Behavior in Myxococcus xanthus Journal of Bacteriology, 188 (16), 5888-5895 DOI: 10.1128/JB.00559-06
I have written before about bacterial cooperation, and how cheating works, up to a point, in an environment of bacterial cooperation. That post talked about bacterial quorum sensing, the collective signaling mechanism by which bacteria construct supra-cellular structures called biofilms. Biofilms are tough multicellular enclosures that allow bacteria to survive and thrive in hostile environments, and to invade host species. Both studies have shown that freeloading does not pay off. Bacteria who do not chip in to build the biofilm, yet benefit from it are ultimately doomed — and sometimes doom the collective of which they are constituents. That post dealt with the “here and now” aspect of cooperation and cheating.
This post deals with another aspect of bacterial cooperation: how does it evolve? Why cooperate in the first place at all? Every time an individual cooperates, short term gains are sacrificed for long-term ones, but those long-term ones are contingent upon all or most cooperating individuals doing their bit. Think about standing in line to the bus. If everyone cooperates, we get on the bus faster, but some of us may be forced to stand. On the other hand, shoving your way to the beginning of the line will assure you a good seat, albeit at the expense of glares from your fellow-passengers, and maybe a few altercations along the way. In evolutionary terms, selfishness seems like a sounder strategy than cooperating. After all, if you manage to gain a better position for yourself in life’s pecking order, you pass those genes that enable that to your progeny, and further down the line. Why cooperate or act selflessly in the first place? Why let someone else share the gene pool with you when you can have it all to yourself?
![]()
Unless that “someone else” shared genes with you: that is, they were related in some way. Suddenly, cooperation seems to have evolutionary benefits: you are preserving and passing on some of the same genes. Protecting kin is the most often-used explanation for how cooperation evolved in the first place: kin selection, meaning, favoring cooperation those individuals with which you share a larger number of genes over those who do not. Evolutionary biologists use the Hamilton’s law as a guideline: the higher the benefit of the cooperation, the lower the cost, and the closer the relatedness of the individuals cooperating, the more likely it is that there will be cooperation. Putting it into an equation, cooperation will evolve if the following condition is met:
rb - c > 0
Where r is the relatedness (on a scale of 0 to 1 where 0: no genetic relation, 1:self), b is the benefit of cooperation and c is the cost. This rule, formulated by William Donald Hamilton is a centerpiece of evolutionary biology. Imagine going on a day’s hunt where the quarry is a large animal that can feed one hunter for 35 days, but requires at least five hunters to take it down. Now there are also smaller animals around, that can be hunted by one individual, and they supply enough food for one hunter for two days. Is it beneficial to hunt alone or together? Let’s figure the benefits and costs. For hunting the large animal, the one that requires at least five hunters, the benefit is a week of food each (b= 35/5 = 7) while hunting for one day (c=1). If the individuals are cousins sharing an average of 20% of the genetic material then:
0.2×7 – 1 = 0.4 is the benefit score
If they are siblings, sharing 50% of the genetic material, then:
0.5×7 – 1 = 2.5 the benefit score is even higher
But what about individual hunting? The benefit of the smaller quarry is is 2 days worth of food, and one day of hunting, and you do it alone. So r=1 (yourself), b=2 and c=1 giving us:
1×2-1 = 1
In this hypothetical model, a group of siblings will cooperate to hunt big game, while cousins would probably hunt smaller game alone. If you want to dig deeper into how Hamilton’s rule was derived, and further implications of the rule, I recommend this post.
Any mechanism in evolution is examined through the lens of fitness. Fitness is the relative ability to produce and support viable progeny. So if cooperation increases fitness, we can use the following simple graph to explain the difference between a cooperating and a non-cooperating individuals in a cooperating population using Hamilton’s rule:

Figure 1: Hamilton's rule prediction: the fitness of cooperators (blue) and non-cooperators (red) increases as the number of cooperators among social neighbors (x-axis) increases. The slope of both lines is the benefit (b).
The benefit, b, is the slope of these two lines. The difference is c. Note that for any given frequency of cooperation in the population, the non-cooperating individuals (red line) have a higher fitness than the cooperating ones (blue line). It seems that it “pays off” to be a self-server no matter the social environment you are in, even though you still benefit from being in a cooperating community. Yeah, we all know the type.
But what happens when the difference between cooperating and not cooperating depends on the percentage of cooperators in the population? Not too hard to imagine: if most of the population is playing nicely together and benefiting from it, then this might change the attitude of the selfish individuals more readily then if only a small fraction of the population is cooperating. But as it stands, Hamilton’s rule does not provide for this type of model. However, the following modification of Hamilton’s rule does:
r ⋅ b – c + m ⋅ d > 0
Relatedness, r, is now not a scalar (a single number), but a vector (an ordered set of values) r = {r1, r2, …} describing relatedness under different conditions. Ditto the benefit vector, b. b has the coefficients of the equation describing the fitness of non-cooperators as a function of how many neighboring cooperators there are in the population (red lines). In a linear setting (Figure 1), r = {r1} b={b1} and m⋅d = 0, collapsing the expanded equation into the classical Hamilton’s rule. We won’t get into m and d in this post, they are important though, and you should read the paper to understand how they play a role
Expanding them from scalars to vectors enables a richer and more flexible description of Hamilton’s rule, allowing to describe non-linear relationships like this:

Figure 2: Note two things. First, the relationship between fitness and the fraction of cooperators in the population is not linear. Second, the difference in fitness between cooperators and non-cooperators decreases as the fraction of cooperators in the population goes up. These two phenomena cannot be described by the classic Hamilton's rule equation. They can be described using the modified rule.
This modification of Hamilton’s rule was developed by Jeff Smith and colleagues, at the department of Biology at Indiana University. Armed with the new equation, Smith and his colleagues decided to see how well it can be applied. They decided to look at Myxococcus xanthus. M. xanthus bacteria behave normally as long as food is abundant: they swim around and proliferate by cell-division as bacteria do. But when starved, they aggregate, and some cells form resistant spores, while the others die. Some cheating strains sporulate well when in cooperating populations, but do poorly on their own. The scientists mixed a cooperator strain with a cheating strain at different frequencies, starved them, and measured the fraction of each strain in the population of surviving spores. They found the following: first, the fitness effect was non-linear; in fact, it was almost exponential. Second, cooperators were more fit than cheaters at low cooperator frequencies, but cheaters fared better at high cooperator frequencies. So it pays to freeload when most people around you behave nicely. In the case of M. xanthus, the added value to the population is quite high. In fact, the scientists found that cooperation in M. xanthus is very robust and resistant to cheating: cheaters were viable (i.e. had a positive fitness) only with groups that had more than 70% cooperators. So it is only when cheaters have a large cooperating population to buffer their nasty habits that a they can thrive.

Figure 3: Relative fitness of cooperators (blue) and cheaters (red) in a populations with different relative frequencies of cooperators. Note that the fitness scale is logarithmic: the fitness increase is very much non-linear, as in Figure 2.
Moral of this story: if you got to cheat, make sure there are a lot of nice people around. Otherwise it won’t work out very well. In evolutionary terms, the trait for cooperation and kin selection has evolved to become strongly entrenched, so much that cheaters can only survive if cushioned by a high frequency of cooperators. Favoring your own and acting selflessly towards them seems to be the way to go, in the case of M. xanthus.
Previously on our show: ‘ Homology is Not a Quantitative Term‘. Homology is a drop-in replacement for the “common ancestry”. It does not make any sense to say “low common ancestry” “high common ancestry” “micro common ancestry” or (egads!) “70% common ancestry”. You cannot be 70% homologous any more than you can be 70% pregnant.
Why am I harping on this again? Because the term “low homology” managed to sneak itself, of all places, into the title of a paper published in Bioinformatics. Ouch. Bioinformaticians should know better.
Just for kicks, I decided to look at how many papers were published this year (January 1 through today) using the misuse of terms in their title or abstract. Here are the results:
I could not find others such as “weak homologs”, “strong homologs”. Small mercies. Well, there is some work to do still in removing bad habits.
Last week I posted a video of Dan Telfer arguing with his audience over who is the best dinosaur. Well, The Black Keys, a blues band from Akron, Ohio came up with the best dinosaur. His name is Frank, and he is a Funkasaurus rex. See and, more importantly, listen for yourselves. Epic dino-slide is epic. Tighten Up!
Update: the video embedding has been disabled. You can still watch it on YouTube.
Some headlines just write themselves…
It has been known for some time that an approaching large herbivore causes aphids to abandon ship …err plant. Makes sense since, after all, there’s not much of a point in staying on the particular bit of shrubbery that will be consumed, lock, stalk and barrel by a ravenous forager. However, it was not exactly clear what in the herbivore causes the aphids to drop. Well, it is not the shaking of the twigs, as rustling the plant did not cause a substantial number of the aphids to drop. Rather, it’s the breath. The researchers had a human, a sheep and goat all breath on an aphid-infested plant, with equal results: the aphids dropped from the plant en-masse. But what in the breath causes aphids to do that? Well, it is not the CO2 nor the air movement itself. Rather, the heat and the humidity of the breathing, as tested by Moshe Gish and his colleagues at the University of Haifa.
This is a great example of adaptation: after all, bush movement may be due to many different factors, including uninterested rodents and carnivores. Also, air movement can be simply caused by wind, including hot or humid air. But someone breathing directly on you, hot and damp can only mean one thing to an aphid: abandon plant or be goat dinner!
YouTube is chock-a-block with vids of Richard Feynman. I love the way he uses analogies to explain science. Here is one of my favorites, the discovery of natural laws as viewing a chess game.
Not only the funniest, but also the best-informed rant on dinosaurs I have ever heard. OK, I only heard this one, but it cracked me up. NSFW language.
These are just ridiculously cute, I had to put them in. The sloth counterpart of Marilyn Manson appears at the end.
A few interesting facts about sloths (edited from Wikipedia):
Meet the sloths from Amphibian Avenger on Vimeo.
Much too noisy. When looking at a population of genetically identical bacteria, the number of proteins they produce varies. The picture below shows the levels of one type of protein that was fused to a green fluorescent protein (so we can see it): clearly there is a variation in how much of the protein each cell produces (“protein expression” in molbio-speak), even though the bacteria are genetically identical. Why is that? In 2006, a group of researchers at the University of California San Diego and Boston University looked at the variation in protein expression in genetically identical bacteria, and what it could mean. They constructed a simple and well-defined computational model first. The researchers were surprised when their model shows that the variation actually increased when the cell growth and division was slowed or stopped. This prediction seemed paradoxical: if the cells are less active, how come the variation in protein expression increases? Shouldn’t they all be going into some “baseline production mode”? To answer these questions, they took them to the lab. Nicholas Guido and his colleagues engineered bacteria with simple gene networks, where the production of the gene could be induced, repressed and both induced and repressed simultaneously. The gene itself was with green fluorescent protein, so that the more protein is produced, the brighter the cells shine under light. Lo and behold, the computational predictions were correct! (1) the expression of the protein was not uniform (even though the cells were genetically identical) and (2) variation in protein expression increased when the proliferation of the bacterial cells was slowing down or has stopped.
Their explanation to this random noise in protein production: the need for variation to survive. Bacteria often deal with quickly changing conditions: temperature, oxygen concentration, water, chemicals, antibiotics… all these can kill. If the population is identical, what kills one kills all. But if even within a genetically identical population there are variations in protein level expression, then the population is not phenotypically identical even though it is genetically so. Some bacteria may survive the dry spell, the heat or — what concerns us quite a bit — the onslaught by antibiotics. The random population variation or “noise” in protein level expression is an evolutionary survival mechanism.
Fast forward from 2006 to last week. In a brilliant work published in Science, Yuichi Taniguchi and colleagues from Harvard University and University of Toronto looked at individual E. coli cells for protein expression. They used examined different strains, 96 at a time using a microfluidic chip. Each lane on the chip has room for a single cell, enabling them to quantify the levels of proteins in single cells from the same or different strains very quickly. Taniguchi and colleagues examined 1018 different genes in E. coli which covers about 25% of the genome. Like Guido and colleagues, Taniguchi and colleagues also found a large variation in the expression of the same protein in different cells which were otherwise genetically identical, no matter what the protein was. They also found that different kinds of proteins were produced in different distributions in the cell. They also measured was noise: how much randomness went into the production of proteins. What they found were two kinds of noise: one type of noise was from proteins that were produced in small numbers (less than 10 molecules per cell) the more protein produced, the lower the variation in protein production, or “noise”, between cells. A second type was from proteins that were produced in larger numbers. For those, there is a “noise floor”: the fluctuation in protein production does not decrease below a certain point, and there is less fluctuation in proteins that are produced in high numbers than in those produced in low numbers. This means that the cellular mechanisms of protein degradation and/or production control may hit some sort of steady-state once protein production reaches a certain level.
They did not stop by looking at proteins, though. In each cell, they also looked at the level of mRNA coding for that protein. mRNA production numbers are also very noisy: actually, noisier than those of protein. But surprisingly, Taniguchi discovered that when looking at single cells, mRNA and protein levels do not correlate. Not even a weak correlation, and no matter what the protein. The high noise levels and lack of correlation in expression can be explained by the different lifetimes of protein vs. mRNA. mRNA is quickly degraded in the cell, while proteins may outlive cell-division. mRNA is produced in short bursts, “lives fast and dies fast”, while buffering protein levels from high fluctuation levels.
Looking at these studies together, we can say that there is a lot of noise in the system, but it serves a purpose: not only on the selection level (as discovered in 2006), but also on the systems level (as shown last week). On the selection level, noise fosters differences between individuals, which gives at least someone from the bacterial population a chance to survive if conditions change drastically. It is less clear what is happening in the level of the intracellular system: for proteins expressed in large numbers, it seems like there is some external control mechanism at work that keeps noise above a certain level.
How does mRNA and protein production noise then propagate, say, across gene expression pathways, when one protein can cascade the production of many others? How much is noise a control mechanism on a cellular and cellular-ensemble level? Are there “noise clamping” and “noise amplification” mechanisms that need yet be discovered? The Taniguchi study hints that there are, and the Guido study strongly suggests that they are affected by the environment. I think we are only beginning to hear the noise bacteria make.
Guido, N., Wang, X., Adalsteinsson, D., McMillen, D., Hasty, J., Cantor, C., Elston, T., & Collins, J. (2006). A bottom-up approach to gene regulation Nature, 439 (7078), 856-860 DOI: 10.1038/nature04473
Arcade Fire’s new album, The Suburbs is officially available today. Unofficially, there is already a fan video. Pretty cool. You can also listen to a stream of the entire album from NPR.
The Third Reviewer is a website for those of us who would rather show up to a journal club late, beer in hand and in their pajamas. Which means basically 100% of all scientists I know. TTR pulls feeds form multiple journals, and posts the abstracts on its site for us to comment upon; anonymously if so wished. The site is called The Third Reviewer, since most papers are reviewed by two people, and the third would be the rest of the scientific community. From their Welcome page:
The Third Reviewer is a forum for scientists to share opinions about recently published research. It’s like journal club, but…
- Faster. No need to set aside an hour of your time.
- Convenient. Check in from home or at lab, at 5 a.m. or 10 p.m.
- Comprehensive. Browse papers from lots of journals, all on one site.
TTR started with Neurobiology papers only. Now they added microbiology, so it’s nice to see that there is some interesting science going on there too… (Ow, owwww, ow… Kidding. KIDDING!!!).
Bora Zivkovic, the BUCA (Best Universal Common Ancestor) of science bloggers has tagged this blog with with a Blog of Substance award. As a grateful recipient of this award I am obligated to do two things:
1. Sum up my blogging motivation, philosophy and experience in exactly 10 words.
2. Pass this award on to 10 other blogs.
Of course, I never do anything without researching it first, because I am such an awesome scientist, or detail-oriented !@#*^, depending on whether you ask me or my students. So I looked up “substance” in the Merriam-Webster dictionary. Here is what I found:
Main Entry: sub·stance
Pronunciation: \ˈsəb-stən(t)s\
Function: noun
Etymology: Middle English, from Anglo-French, from Latin substantia, from substant-, substans, present participle of substare to stand under, from sub- + stare to stand — more at stand
Date: 14th century1 a : essential nature : essence b : a fundamental or characteristic part or quality c Christian Science : god 1b
2 a : ultimate reality that underlies all outward manifestations and change b : practical importance : meaning, usefulness
3 a : physical material from which something is made or which has discrete existence b : matter of particular or definite chemical constitution c : something (as drugs or alcoholic beverages) deemed harmful and usually subject to legal restriction
Hmmm… 2a and 2b seem to be relevant. Perhaps 3c should be too, as my blogging could be construed harmful to other more productive activities, which I am obviously not engaged with at this moment. Actually you, gentle reader, are not engaged in more productive activities either right now. Be that as it may, the word substance does seem to have an air of permanence about it, which is contrary to the perceived ephemeral nature of blogging. Bora is actually one of the people who are doing something about making blogs less ephemeral by publishing The Open Laboratory collection (full disclosure: I’m published in the 2009 book) and by supporting science bloggers, blogging and activities wherever they may be. This makes me so happy to be among Bora’s chosen 10 (OK, 11, he cheated a bit) among the hundreds of blogs he must be reading. Thanks Bora!
I do wonder though, eighty-five years from now, how many of us science bloggers would be remembered for our blogging? Well, maybe not as individuals, but what kind of impact are we having now, and how much will it remain 85 years from now? Hopefully as a collective, science bloggers are impacting the understanding of science, which is one of the reasons I am blogging. Hopefully, we do have substance, as a group if not as individuals.
Why eighty-five years? Well, the answer to that brings me to the main topic (substance?) part of this post, which is the anniversary of the Scopes trial. This month, 85 years ago, a schoolteacher in Tennessee was convicted of a high misdemeanor for violating the State of Tennessee’s Butler Act which prohibited the teaching of evolution in any of the state’s public schools and universities. He was fined $100.
PUBLIC ACTS
OF THE
STATE OF TENNESSEE
PASSED BY THE
SIXTY – FOURTH GENERAL ASSEMBLY
1925________
CHAPTER NO. 27
House Bill No. 185
(By Mr. Butler)
AN ACT prohibiting the teaching of the Evolution Theory in all the Universities, Normals and all other public schools of Tennessee, which are supported in whole or in part by the public school funds of the State, and to provide penalties for the violations thereof.
Section 1. Be it enacted by the General Assembly of the State of Tennessee, That it shall be unlawful for any teacher in any of the Universities, Normals and all other public schools of the State which are supported in whole or in part by the public school funds of the State, to teach any theory that denies the story of the Divine Creation of man as taught in the Bible, and to teach instead that man has descended from a lower order of animals.
Section 2. Be it further enacted, That any teacher found guilty of the violation of this Act, Shall be guilty of a misdemeanor and upon conviction, shall be fined not less than One Hundred $ (100.00) Dollars nor more than Five Hundred ($ 500.00) Dollars for each offense.
Section 3. Be it further enacted, That this Act take effect from and after its passage, the public welfare requiring it.
Passed March 13, 1925
W. F. Barry,
Speaker of the House of Representatives
L. D. Hill,
Speaker of the Senate
Approved March 21, 1925.
Austin Peay,
Governor.
Seems incredible at this day an age… or maybe not so incredible given recent events in Louisiana.
The trial, which originated as something of a publicity affair for the town of Dayton, Tennessee, quickly became a battleground for evolution vs. creation. In the short term, the trial actually increased the number of anti-evolution bills proposed in different state legislatures in the US. In the long term, however, Tennessee vs. Scopes is seen as a watershed moment in the teaching and public acceptance of evolution, and has had long terms ramifications in the US and internationally. Scopes himself spoke only once at the trial, was not called to testify, and only had this to say when granted a statement after sentence was passed:
Your honor, I feel that I have been convicted of violating an unjust statute. I will continue in the future, as I have in the past, to oppose this law in any way I can. Any other action would be in violation of my ideal of academic freedom — that is, to teach the truth as guaranteed in our constitution, of personal and religious freedom. I think the fine is unjust.
Now that is substance.
Back to the award; I still have some conditions to fulfill:
1. Sum up your blogging motivation, philosophy and experience in exactly 10 words.
1Blogging 2motivation, 3philosophy 4and 5experience 6cannot 7be 8summed 9in 10ten 11words.
2. Pass this award on to 10 other blogs
Given the 10n growth rate of tagged blogs, chain-letter fashion, I wonder about how this Blogging with Substance award has originated. Search engines was no help, as so many blogs are now tagged with the Blogging with Substance. If someone has an answer, let me know. Anyhow, here are my 10 tags, based on what I am reading nowadays, ephemerality of blogging substance, and all that jazz. Tough choices though, so many good blogs out there:
2. Sandwalk
3. Thoughtomics
4. The Loom
6. Genomics, Evolution and Pseudoscience
10. Mystery Rays form Outer Space
Final word: if this post seems a bit confused, and you are not sure that you are “getting it”, well, that’s this post’s substance.
The trouble with genomic sequencing, is that it is too cheap. Anyone that has a bit of extra cash laying around, you can scrape the bugs off your windshield, sequence them, and write a paper. Seriously?
Yes, seriously now: as we sequence more and more genomes, our annotation tools cannot keep up with them. It’s like unearthing thousands of books at some vast archaeological dig of an ancient library, but being able to read only a few pages here and there. Simply put: what do all these genes do? The gap between what we do know and what we do not know is constantly growing. We are unearthing more and more books (genomes) at an ever-increasing pace, but we cannot keep up with the influx of new and strange words (genes) of this ancient language. Many genes are being tested for their function experimentally in laboratories. But the number of genes whose function we are determining using experiments is but a drop in the ocean compared to the number of genes we have sequenced and whose whose function is not known We may be sitting on the next drug target for cancer or Alzheimer’s disease, but those proteins are labeled as “unknown function” in the databases.

The red line is the growth of protein sequences deposited in TrEMBL, a comprehensive protein sequence database. The blue line illustrates the growth proteins in TrEMBL whose function is know, or at least can be predicted with some reasonable accuracy. The green line is the growth in the proteins whose 3D structure has been solved. Note the logarithmically increasing gap between what we know (blue) and what we do not know (red). Image courtesy of Predrag Radivojac.
Enter bioinformatics. CPU hours are cheaper than high throughput screening assays. And if the algorithms are good, software can do the work of determining function much cheaper than experiments. But therein lies the rub: how do we know how well function prediction algorithms perform? How do we compare their accuracy? Which method performs best, and are different methods better for different types of function predictions? This is important because most of the functional annotations in the databases come from bioinformatic prediction tools, not from experimental evidence. We need to know how accurate these tools are. Think about it this way: even an increase of 1% in accuracy would means that hundreds of thousands of sequence database entries are better annotated, which in turn means a lot less time in the lab or in high throughput screening labs going after false drug leads.
So a few of us got together and decided to run an experiment to compare the performance of different function prediction software tools. We call our initiative the CAFA challenge: Critical Assessment of Function Annotation. There are many research groups that are developing algorithms for gene and protein function prediction, but those have not been compared on a large scale, yet. OK then: let’s have some fun. We, the CAFA challenge organizers, will release the sequences of some 50,000 proteins whose functions are unknown. The various research groups will predict their functions using their own software. By January 2011 all the predictions should be submitted to the CAFA experiment website. Over the net few months, some of these proteins will get annotated experimentally. Not many, probably no more than a few hundred judging by the slow growth of the experimental annotations in the databases. But we don’t need that many to score the predictions. A few dozen will do.
On July 15, 2011 we will all meet in Vienna, and hold the first-ever CAFA meeting as a satellite meeting of ISMB 2011. This will be the fifth Automated Function Prediction meeting we have been holding since 2005. Only this time, there won’t just be the usual talks and posters, there will be the results of a very interesting experiment. The International Society for Computational Biology is generously hosting our meeting, and judging by the response we are getting so far, we will need one of the larger halls.
Learn more at http://biofunctionprediction.org If computational protein function prediction is your thing, join the CAFA challenge. If you are just an interested observer, keep an eye on the site. In any case, please spread the word. Finally, if your company wants some publicity, get in touch! We could use the sponsorship ^_^
Acknowledgements: I would like to thank the CAFA co-organizers, Michal Linial and Predrag Radivojac. The CAFA steeering committee: Burkhard Rost, Steven Brenner, Patsy Babbitt and Christine Orengo for supporting us, keeping us on the straight and narrow and for incredibly useful and insightful suggestions. Sean Mooney and Amos Bairoch for hashing out the assessment. Tal Ronnen-Oron and the rest of Sean Mooney’s group for setting up the CAFA website. The International Society for Computational Biology for sponsoring us. The community of computational function predictors that have participated in and supported past meetings on computational function prediction, the research groups that have registered to CAFA so far, and those that will register soon
Finally, Inbal Halperin-Landsberg for coining the name CAFA. I apologize in advance if I left someone out.
GO CAFA!