WoW is full of bacteria

Speaking of sampling bacteria, this ties in well with the previous post about GEBA. And by “well” I mean “in an alternate-universe/ altered-consciousness manner”.

The voices in the song are sampled from this KFC employee training tape. The video won a prize in machinima.com. So if you like World of Warcraft, bacteria, KFC, sampled music, or any combination of the above, you’re gonna love this.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Filling in the evolutionary blanks, genome by genome

ResearchBlogging.org

After hearing Jonathan Eisen and Nikos Kyripdes talk about GEBA in various meetings, it is great to see the paper finally come out, and under a CC license too. Good move for everyone.

GEBA is the Genomic Encyclopedia of Bacteria and Archaea. The idea is simple: we have >1000 prokaryotic genomes in GenBank as of today.  But those were sequenced under a myriad of interests: clinical, functional, ease, biotechnological or pharmaceutical potential, etc.  In evolutionary terms, those 1000 genomes provide a very biased view of the tree of microbial life. That would be like sampling mammalian life in Europe and North America only: you would miss out on most big cats, Elephants, Rhinos, not to mention all the marsupials. To correct this situation, teams from the  Joint Genome Institute,  UC Davis and several others set out to perform a more uniform sampling across the tree of prokaryotic life. The first batch of 56 genomes from GEBA is published today in Nature; fifty-three bacterial and three archaeal.

Maximum-likelihood phylogenetic tree of the bacterial domain based on a concatenated alignment of 31 broadly conserved protein-coding genes. Phyla are distinguished by colour of the branch and GEBA genomes are indicated in red in the outer circle of species names. Click to open original in Nature.

It seems that they are on the right track to enrich our understanding of bacterial genes and genomes using this phylogenetically-mindful sampling strategy.  For example, they show that their sampling enables the discovery of an average of 1,060 protein families/genome. Sampling a single bacterial family would provide 121 new protein families, sampling within a bacterial phylum would give an average of 308 new protein families, and within a bacterial domain, 650. They have discovered a total of 1,798 families that seem to have no similarity to any existing family, hinting at new bacterial functionality (or maybe some new prophages?) They have  discovered a few new cellulases, genes that break down cellulose, the polymer that makes up plant cell walls. Cellulases are the holy grail of the biofuel prospecting industry: specifically,  a cellulase that can be exploited en-masse to turn plant matter into fuel economically. They also discovered a homolog of Actin, a cytoskeletal protein thought until now to only exist in eukaryotes.

One thing that is sorely missing is accessibility. Yes, the individual genome papers are all published in SIGS and in Nature under open access, which is great. But when you go to the GEBA site, you get a simple description of the candidate genomes. The annotations are somewhere behind a password-protected site, but I could not seem to get an account to view them. A proper genomic browser for the sequenced and annotated genomes, with some phylogenetic map showing who is located where on the tree would go a long way towards  helping the rest of us explore this new comprehensive picture of prokaryotic genome space.

Finally, if you want to hear more about how they did what, here’s Eisen talking about GEBA.


Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova, N., Kunin, V., Goodwin, L., Wu, M., Tindall, B., Hooper, S., Pati, A., Lykidis, A., Spring, S., Anderson, I., D’haeseleer, P., Zemla, A., Singer, M., Lapidus, A., Nolan, M., Copeland, A., Han, C., Chen, F., Cheng, J., Lucas, S., Kerfeld, C., Lang, E., Gronow, S., Chain, P., Bruce, D., Rubin, E., Kyrpides, N., Klenk, H., & Eisen, J. (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Nature, 462 (7276), 1056-1060 DOI: 10.1038/nature08656

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Five octopus vids

The coconut using octopus has been making the news lately, as the first evidence of tool use by these animals. A good opportunity to post some vids of these cool creatures:

UPDATE: the “first tool use” has been somewhat oversold. Thanks to Zen Faulkes for calling my attention to this.

They are resourceful:

Camouflage skills:

More camouflage skills, and dining habits:

More dining habits:

Flexible:

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Feliĉa Naskiĝtago Doktoro Esperanto

eo-flag-of-esperanto

Google flew the green-starred flag of hope yesterday, in celebration of the 150th birthday of a man who constructed a whole language based upon hope. He called himself Doctor Hopeful, and he wanted that the language he created would help break down national barriers. He made it easy to learn, so that people would be motivated to learn it as their second language. They would then speak the Language of Hope, understand each other, and not be so insular. As isolation breed suspicion, and suspicion breeds hostility and ultimately violence.

Unfortunately, neither his language nor his vision of a more understanding and tolerant mankind caught on. One hundred and fifty years after his birth, and 122 after the publication of his book, the world is no friendlier nor tolerant than it was when Ludwig Zamenhof set to correct it by publishing his book International Language: Foreword And Complete Textbook under the pseudonym of Doktoro Esperanto.

English has become the second language of choice for many.  The increasing dominance of English speaking powers throughout the last 200 years resulting in English as the lingua franca is interpreted by many that English was adopted as an imposition from above. English is perceived by many who wish to preserve their non-Anglo cultures as  overwhelming, a threat to their local culture, which would be diluted to extinction through constant bombardment by English speaking movies, TV shows, and Internet provided content. Zamenhof would not have liked that, as Esperanto was intended to be an adoption of choice, without carrying any threatening cultural baggage.

The Internet itself is  hailed by many as a medium to strike down barriers to knowledge and help communications.  But  national firewalls, traffic monitoring, crackdowns on content sharing, criminal abuse and vilification in the popular media cause many to see it more as a threat to their own society, rather than a promise for all societies. And let us not forget that it is still mostly  a developed world’s medium, with most of the content and cultural narrative originating from rich countries.

Neither a world-wide communication technology nor a globally dominant language seem to have brought us closer to the peaceful, understanding and egalitarian world that Zamenhof envisioned. We should be mindful of that, and of Esperanto. The Esperanto language is viewed as a curiosity at best. Esperantists as people with a quaint hobby.  Happily, Esperantists do not view themselves as such. They are continuing the mission of Zamenhof for a more understanding humankind. Esperanto is kept alive by the two million who speak it, by national and  international organizations, by books, magazines, and even music. Happy Birthday Doctor Hopeful.

Martin Weise of the Swedish Esperanto-singing Band Persone. from his Solo Album “more than nothing” Pli ol nenio.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Gene and protein annotation: it’s worse than you thought

ResearchBlogging.org

Sequencing centers keep pumping large amounts of sequence data into the omics-sphere (will I get a New Worst omics Word Award for this?)  There is no way we can annotate even a small fraction of those experimentally and indeed most  annotations are automatic, done bioinformatically. Typically function is inferred by homology: if the protein sequence is similar enough to that of a protein whose function has been determined, then homology is inferred: that is, the unknown and the known protein are descended from a common ancestor. Even more so,  functional identity to the known protein are inferred: the assumption being, that the function did not change if the common ancestral protein is recent enough: that is, if the sequence identity is high enough.  But there are problems: what is the threshold for determining not only homology, but functional identity? Even if two proteins are 95% identical in their amino-acid sequence, if the remaining 5% happen to include active site residues, these proteins may do completely different things. However, most new sequences are annotated just this way, with some variations.

Because of its volume, the veracity of the electronic annotation is rarely checked by experts.  Also, the electronic annotations come from far and wide, with different annotation software using different databases to infer gene and protein function. This sets the stage to a huge game of Broken Telephone, where  wrong annotations can propagate through many databases, accumulating errors. Imagine that we have an annotation program with a 90% accuracy rate. This means that given a query protein sequence and a “gold standard” 100% correct reference database, this programs infers the query sequence’s correct function 90 out of every 100 times.  For a typical bacterial genome of 5000 genes, this would mean that 500 genes are wrongly annotated. Let’s cal our bacterium Bug1. Now we place those 500 wrong annotations (along with the 4500 correct ones) in the “definitive database” for this bacterium, called Bug1DB.  Now this Bug1DB is used as  a “gold standard”, and  another genome is annotated, this time of Bug2. Let’s suppose, for argument’s sake, that the two genomes contain roughly the same homologous genes.  Since every gene in Bug1 has a 10% probability of being wrongly re-annotated when transferred to Bug2, this would mean a compounding error of  0.10 * 500 = 50 genes from the original  wrong 500 genes (we assume that “two wrongs do not make a right” and that an incorrect annotation of any incorrectly annotated gene would not revert to a correct annotation my mis-annotating it again).  But it would also mean that, on average, 500-50 =450 genes from A that were correctly annotated the first time would  be incorrectly annotated the second time. This means that Bug B now has 500+450= 950 mis-annotated genes. And this is through two filters of a Broken Telephone game using a highly accurate annotation program.

The trouble is, that a 90% accuracy rate is unrealistically optimistic. Also, having all 5000 genes in a genome annotated with some function (as opposed to simply “unknown”) is rather fanciful. So the mis-annotation problem is worse, even if transfer and re-annotation does not take place exactly as described. But just how much worse?

The question is answered in  a rather disturbing study published in PLoS Computational Biology by Alexandra Schnoes and her colleagues in Patricia Babbit’s group at th University of California, San Francisco. They used 37 experimentally characterized enzyme families to test different databases.  They found a high level of misannotation, but also a highly variable one. For example, the manually curated SwissProt database had a very low level of errors. On the other hand, TrEMBL, which uses simple sequence similarity for annotations, had a high level. So did NR, the combined GenBank coding sequence translations+RefSeq Proteins+PDB+SwissProt+PIR+PRF; pretty much the default reference database against which biologists BLAST their sequences.  They found that 40% of the genes they examined were mis-annotated in NR. They also went back in time, examining the misannotaion fraction of their gold standard 37 families, and found that the fraction of misannotated genes has increased,  from 15% in 1995 to 40% in 2005.

growing-over-time
The change in misannotation over time in the NR database for the 37 families investigated. Sequences are plotted by the year when they were originally deposited in the database (x-axis). The number of sequences (left y-axis, bar graph) found to be correctly annotated is shown in green. The number of sequences found to be misannotated is shown in red. The bars for each year represent only the sequences deposited into the database in that year. The fraction (right y-axis, line plot) of sequences deposited each year into the NR database that were misannotated is given by the open nodes, connected by the black line to aid in visualizing the overall trend. This fraction represents the number of sequences in the 37 test families predicted to be misannotated divided by the total number of sequences deposited each year from the test set, i.e. the sum of the sequences depicted in the red and green bars for each year. (From: Schnoes AM, Brown SD, Dodevski I, Babbitt PC, 2009 Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies. PLoS Comput Biol 5(12): e1000605. doi:10.1371/journal.pcbi.1000605)

There are also many ways to be wrong, as Schnoes and her colleagues have discovered. Overprediction is one, where proteins are annotated with functions that are more specific than the available evidence supports. 85% of misannotations were found to be overpredictions. Of the remianing 15%, about half were found to be missing important amino acid residues, which means that they could not carry out the functions by which they were annotated. The other half were simply not within the similarity threshold necessary to include them in one of the superfamilies they have examined.

By now you are wondering, who is validating the validators? That is, if Schnoes and her colleagues determine a single cutoff for inclusion in a protein family, they might also include falsely annotated proteins as correctly annotated (false positive), or exclude correctly annotated proteins as mis-annotated (false negative). To avoid that, they set three different similarity thresholds to their 37 superfamlies, and examined which proteins the similarity searches attract. In the lowest of these threshold, they purposefully included the ability to include up to 5% false positives. This they called the “lenient threshold”, and they did check their results using these different thresholds (three of them). They found there was a slight increase, but no overall substantial change, in the discovered level of misannotation in the databases, even when lowering the bar to the lenient threshold.

So how bad is the level of misanntoation in the databases? It depends on the protein superfamily they checked against, and on the database. Here is an excerpt from another figure, showing the misannotation of protein families in the HAD haloacid dehalogenase (HAD) and amidohydrolaseand (AH)  superfamilies of enzymes. Each rectangle represents a different database. The bar is the mean error in that database for that particular superfamily, and each colored circle is a protein family, placed and the level of  average misannotation for that family. The circle size indicates the family size.

Percent misannotation in the families and superfamilies tested

Percent misannotation in the families and superfamilies tested

Note that SwissProt fares very well, although lacks some families (those with an “X” through the blank circle).  For the HAD superfamily, we see an error of 60% in the three other heavily used databases, and for AH we see a 40% error. That is brutally high, and quite worrying. Other families fared little better when checked against those databases. Some went up to 80% and 100%(!)

So what can be done? Schnoes and her colleagues suggest several remedies. First, include “evidence codes” with the annotations. Those will let us know how each annotation is inferred, and thus how trustworthy it is. Additionally, avoid overprediction, which accounts for 85% of wrong annotations. Many protein functions are described too specifically, without enough evidence to support the annotation claim. Taking a step back and giving a more general description of the function would go a long way towards cleaning up the databases. The manually curated databases such as SwissProt did fare very well in their examination, but manual curation is not possible anymore with the post-genomic and metagenomic data deluge. Large databases  have to clean up the mess pretty much the same way it was created: by automated means.  Let’s hope it will happen soon enough. A 40% error rate in the database you are looking at can really put a damp on your analysis.


Schnoes, A., Brown, S., Dodevski, I., & Babbitt, P. (2009). Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies PLoS Computational Biology, 5 (12) DOI: 10.1371/journal.pcbi.1000605

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Structuregate?

The University of Alabama at Birmingham issued a statement last week asking that 11 structures be removed from the Protein Data Bank, as they are quite possibly fabricated. Wow. Very little detail was given by UAB’s statement (below), or by the media. Apparently all the structures are tied to one person, HMK Murthy, who could not be reached or traced, as reported by the Birmingham News.

The structures’ PDB codes are:

1CMW, 1DF9/2QID, 1G40, 1G44, 1L6L, 2OU1, 1RID, 1Y8E, 2A01, and 2HR0 Some of them are still in the databank.

The University of Alabama at Birmingham has requested that the Research Collaboratory for Structural Bioinformatics Protein Data Bank remove certain protein structure files deposited by a former UAB employee. UAB also has identified nine publications related to the same protein structures that should be retracted from various scientific journals, and is making those journals aware of this matter.

Allegations of data fabrication and/or falsification were made concerning certain protein structures published by the former UAB employee. In accordance with UAB’s scientific integrity policy, and that of the Office of Research Integrity of the U.S. Department of Health & Human Services, UAB empanelled a committee of experts with no conflicting interests to investigate these allegations. After a thorough examination of the available data, which included a re-analysis of each structure alleged to have been fabricated, the committee found a preponderance of evidence that structures 1BEF, 1CMW, 1DF9/2QID, 1G40, 1G44, 1L6L, 2OU1, 1RID, 1Y8E, 2A01, and 2HR0 were more likely than not falsified and/or fabricated and recommended that they be removed from the public record.

“Scientific misconduct is absolutely unacceptable,” said UAB Scientific Integrity Officer Richard B. Marchase, Ph.D., vice president for Research and Economic Development. “It was important that the files be removed from the database and the articles be retracted to ensure that future research in the areas of macromolecular structure analysis and the function of proteins could continue uncompromised by faulty data.”

Some of these structures date back to 2002; this has been going on for quite a while then.  Apparently the investigation ended May 2009, but UAB only  issued a statrement today. The associated papers are also being retracted.  If anyone has more information on this strange affair, please share here.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

The Ultimate Rebuttal Letter

Floated in my email inbox recently. Bears blogging.

Dear Editor,

I would like to thank the editorial board and the referees for their comments and contributions to our manuscript. We have carefully considered the comments when rewriting the manuscript, and believe it to be much improved now…

…Oh, screw this. Let’s cut the bull. Mmkay?

Referee #1 did not even bother to read the paper. He basically glanced at the references, realized he was not cited enough to his taste, got pissed off, and attached a Pubmed dump of his papers in the last 10 years. All three of them. There is a reason none of these papers went beyond a single digit number of citations: they suck! Also, I fail to see how a paper discussing semantic distances as applied to an “endoplasmatic reticulum membrane elasticity ontology” has anything to do with my paper. Or with anything of interest, for that matter.

Referee #2 requested reanalysis of our data, using Boyle-Scott statistics. Applying Boyle-Scott statistics to our work would be like draping a hornet’s nest with clingwrap while wearing a bathing suit: a long and painful process which is utterly pointless. B-S statistics are exactly what they are, and if you think I will be bothered to do that, with my grad student finally graduating and taking off, you’re as delusional as Dr. Boyle was when he was researching REM sleep in cannabis-treated amphibians just before he went completely schizo and had to be locked up.

Referee #3 Actually read the manuscript carefully. Which is both commendable and rare. Unfortunately, judging by the comments presented, it was not my manuscript.

Finally, I would request that you as an editor grow a brain. Did you even read their comments before passing them on to me? Shipping out papers to referees, then getting them back, pasting them together and slapping on some boilerplate text from your journal’s editor’s site is not editorial work. In fact, a middle school student that volunteers in my lab wrote up a script yesterday that does just that. We are thinking of installing it in your esteemed journal’s author’s website and waiting to see if this editorial version of the Turing test would pass. We are very optimistic about the results, and we plan to write a paper about them.

Sincerely,

Prof. I. M. Irritated

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Going to GOA: pt. 1

GOA, the Gene Ontology Annotation, provide Gene Ontology annotation to proteins in UniProt. It also provides GO annotations to several genome projects: Chicken, Arabidopsis, Fly, Human, Mouse, Rat and Cow. Anyone working on any of those genomes, or on UniProt and is interested in annotation, would most likely need to query GOA once in a while. Here I will show some to read and interrogate GOA files.

Each GOA-annotated genome has two files associated with it. One is gene_association.* and the other is *.xrefs. So for example, the Mouse genome has the two files gene_association.goa_mouse and mouse.xrefs

The gene_association file is a tab-delimited file,with each row containing a gene and its GO associations. Each gene may have more than a single GO annotation, so multiple rows for the same gene exist. The fields are tab-delimited.

def read_goa_associations(inpath):
      goa_records = {}
      for inline in file(inpath):
             # Read a single GOA record
            db, db_object_id, db_object_symbol, qualifier, go_id, \
            db_reference, evidence, withit, aspect, \
            db_object_name, synonym, db_object_type, \
            taxon_id, date, assigned_by = inline.strip().split('\t')
            # The dictionary key is a concatenation of the gene-id and the database
            key = db+":"+db_object_id

            goa_records.setdefault(key,[]).append({'db':db,
                              'db_object_id':db_object_id,
                              'db_object_symbol': db_object_symbol,
                              'qualifier': qualifier,
                              'go_id': go_id,
                              'db_reference': db_reference,
                              'evidence': evidence,
                              'with': withit,
                              'aspect': aspect,
                              'db_object_name': db_object_name,
                              'synonym': synonym,
                              'db_object_type': db_object_type,
                              'taxon_id': taxon_id,
                              'date': date,
                              'assigned_by': assigned_by})
      return goa_records

Let’s break this one down. The records will be stored in the dictionary goa_records, which is initialized in line 2.  We loop through the lines in in the files in line 3. Lines 5-8 are broken down physically, but they form a single command line to parse the 15 tab-delimited fields.

Now,  as I mentioned, there may be more than one GO association with a single gene. So the value of each dictionary entry is actually a list which may contain one or more GO associations. Line 12 shows us how to do that. The format:

dict.setdefault(dakey,[]).append(davalue)

tells us that for dictionary dict, if it does not have the key dakey then that key is added, and the value associated with it is an empty list []. Otherwise, if the dictionary already has the key dakey, then the value davalue is appended to the list. This one-liner allows us to add multiple values associated with one key.

In the command stretching from line 12 to line 26, we add a value which is actually a dictionary by itself. Each key in the dictionary is a field of the gene_associations records. Each value is the value read in that field in lines 5-8.

So what have we got now in goa_records?
Let’s try running this on an Arabidopsis gene_associations file. Download the file:

gene_association.goa_arabidopsis.51.gz
unzip it:

gunzip gene_association.goa_arabidopsis.51.gz

this is is the Arabidopsis GOA from 6-OCT-2009. It has 109339 lines which are 22778 annotated genes. How do I know that? easy:

Number of lines uniqueified by field #2, which is the gene id.

gawk 'BEGIN {FS="\t"} {print $1":"$2}' gene_association.goa_arabidopsis.51 | sort | uniq | wc -l

Also, download the code file, and call it gene_assoc.py
Now run the code, Open a Python shell:

$ python
>>> import gene_assoc as ga
>>> goa_arabid = ga.read_goa_associations("gene_association.goa_arabidopsis.51")
>>>

go_arabid now contains the records. We can do a few simple statistics. For example, how many genes are annotated as hydrolases? For that, we need to know that the hydrolase_activity GO accession number is “GO:0016787”. We can now ask our question in the Python shell:

>>> n=0
>>> for i in goa_arabid:
...     for j in goa_arabid[i]:
...             if j['go_id'] == "GO:0016787": n+=1
...
>>> print n
975

OK, this may seem a bit silly. We could have just as easily used a gawk one-liner to search for all the lines containing “GO:0016787”. Why go into all the trouble of a Python data structure? The answer is, we are just getting started.

When genes are annotated, they are annotated by different strengths of evidence. Many genes are annotated simply by sequence similarity to other annotated genes. This is fair evidence, but not as good as, say, experimental evidence. When curators annotate genes, they also add an “evidence code” that tells s how good is the evidence that the assigned GO term is actually true. Let’s find how many hydrolases we have whose annotations were inferred by experimental evidence. See the Guide to GO evidence codes for a full description.

Experimental Evidence Codes
EXP: Inferred from Experiment
IDA: Inferred from Direct Assay
IPI: Inferred from Physical Interaction
IMP: Inferred from Mutant Phenotype
IGI: Inferred from Genetic Interaction
IEP: Inferred from Expression Pattern

Computational Analysis Evidence Codes
ISS: Inferred from Sequence or Structural Similarity
ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment
ISM: Inferred from Sequence Model
IGC: Inferred from Genomic Context
RCA: inferred from Reviewed Computational Analysis

Author Statement Evidence Codes
TAS: Traceable Author Statement
NAS: Non-traceable Author Statement
Curator Statement Evidence Codes
IC: Inferred by Curator
ND: No biological Data available

Automatically-assigned Evidence Codes
IEA: Inferred from Electronic Annotation
Obsolete Evidence Codes
NR: Not Recorded

>>> n=0
>>> for i in goa_arabid:
...     for j in goa_arabid[i]:
...             if j['go_id'] == "GO:0016787" and j['evidence'] in ('EXP','IDA','IPI','IMP','IGI','IEP'): n += 1
...
>>> print n
1

Uh, oh. Seems like only one gene was inferred to be a hydrolase by experimental evidence?

Note another problem: we only checked for genes annotated with “hydrolase_activity” GO term. But a hydrolase is a very generic term. This does not mean that other genes are not hydrolases. Hydrolases are a very large enzymatic class that includes many different enzymes: any enzyme that uses water to break a covalent bond, basically. For a complete list of hydrolase subclasses in GO see: here. Note that there are more GO terms under the fold (just click on the nodes annotated with ‘+’). So we are definitely not looking at all the enzymes that have hydrolase activity, only those that are annotated with the words “hydrolase_activity”.  For example, all phosphatases are hydrolases, but if a protein is annotated with the GO term “phosphatase_activity”, it won’t show up on our search. How do we handle that problem?

Wait for the next installment.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Byte Size Hedgehog

I don’t know whether to categorize this guy under microbiology or zoology. He’s so small!

Cute little fella

From pixdaus.com Click on pic to go to site.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Videos on sequencing

A few cool vids on sequencing. Company infomercials, but still entertaining and informative. Thanks to my student, David Ream, for finding these.

Pyrosequencing:

Helicos:

SOLiD:

BASETM nanopore sequencing:

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

The Tao of Programming

I was recently reminded of this classic by Geoffrey James. Here are a few of my favorites. The whole text is available online.

In the beginning was the Tao. The Tao gave birth to Space and Time. Therefore Space and Time are Yin and Yang of programming.

Programmers that do not comprehend the Tao are always running out of time and space for their programs. Programmers that comprehend the Tao always have enough time and space to accomplish their goals.

How could it be otherwise?

tao

Thus spake the master programmer:

“After three days without programming, life becomes meaningless.”

tao

A novice asked the master: “I have a program that sometime runs and sometimes aborts. I have followed the rules of programming, yet I am totally baffled. What is the reason for this?”

The master replied: “You are confused because you do not understand Tao. Only a fool expects rational behavior from his fellow humans. Why do you expect it from a machine that humans have constructed? Computers simulate determinism; only Tao is perfect.

“The rules of programming are transitory; only Tao is eternal. Therefore you must contemplate Tao before you receive enlightenment.”

“But how will I know when I have received enlightenment?” asked the novice.

“Your program will then run correctly,” replied the master.

tao

A novice asked the Master: “Here is a programmer that never designs, documents or tests his programs. Yet all who know him consider him one of the best programmers in the world. Why is this?”

The Master replies: “That programmer has mastered the Tao. He has gone beyond the need for design; he does not become angry when the system crashes, but accepts the universe without concern. He has gone beyond the need for documentation; he no longer cares if anyone else sees his code. He has gone beyond the need for testing; each of his programs are perfect within themselves, serene and elegant, their purpose self-evident. Truly, he has entered the mystery of Tao.”

529px-Tao_character.svg

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Thankful for…

In no particular order or context. No personal stuff and by no means a complete list:

WordPress (like, duh).

icon_big

Wikipedia (default for looking up new stuff)

600px-Wikipedia-logo.svg

Wikis in general (great lab management tool. Don’t need LIMS)

Open Access Publishing and Creative Commons licensing.

cc.logo.circle

FLOSS licensing (90% of the software I use, and 100% of what I write)

opensource-logo

Science Bloggers (too numerous to link)

Science tweeters and FriendFeeders (too numerous to link. That’s how I keep up with things)

Facebook+Friendfeed-VS-Twitter

BLAST (Sometimes it feels like bioinformatics is should be renamed to blastology)

LaTeX (Wrote my dissertation in LaTeX, and never looked back)

latex_lion

OpenOffice.org (because not everyone uses LaTeX).

OpenOfficeLogo

CiteULike (Keeping my reference library up to date and in good order)

Citeulike_logo

Delicious (Keeping my bookmarks up to date and in good order)

delicious_logo

Gmail (because finding that document you sent me a month ago would be impossible otherwise)

super-gmail-logo

Google Scholar (For standing on the toes of Hobbits. Or something like that)

mainG

GIS (for blogging and making class slides)

Vim (because emacs blows)

vim-editor_logo

Python (ease & power)

python_logo_without_textsvg

Biopython (OK, conflict of interest here, since I contributed a bit)

biopython

Friendly colleagues (They certainly are!)

umured7

Good students (gotta make my lab page).

Goulash for dinner. Can’t stand oven Turkey.

turkey

Music. Especially the latest song that is going around in my head:

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Photosynthesis, phages and structures: there’s treasure everywhere!

ResearchBlogging.org

Here’s a really cool work, published this September in Nature.. Why did I choose this work?  Well, it’s a major discovery, and it’s all done using bioinformatics, and fairly simple bioinformatics at that. The power of metagenomics and bioinfromatics: in a mass of data you just have to know what you are looking for, and how to look for it.

Obviously not CC licensed, but I couldn't resist using this very appropriate strip

Obviously not CC licensed, but I couldn't resist using this very appropriate strip

Viruses as a bacterial genetic mechanism

Viruses follow some interesting and sometimes convoluted evolutionary paths.  One is “infect quick, reproduce fast, and make sure you can get to the next host before you kill this one”.  That is pretty extreme: smallpox was doing that, when there was smallpox. Ebola is doing that, but not very well: killing the host too quickly means that the disease is contained, especially in rural areas. Another strategy is: “slow and easy wins the race”. The herpes virus does that. Not lethal, but laying dormant in the central nervous system, it is  infectious, but rarely causes anything more than they occasional cold sore (which admittedly, is painful and disturbing). Still, it manages to infect up to 90% of the human population, most of which are completely unaware they harbor it, and would never develop any symptoms.

Most of the viruses on earth don’t infect humans, nor animals, nor plants. They infect microbes, where the same spectrum of evolutionary strategies applies. Some attack quickly, killing the microbial population they infect. Other can remain dormant for a long time. It is becoming clear to us that bacterial viruses or bacteriophages, are responsible for a large portion, if not the majority, of genetic variance in bacteria. In fact, viruses are a major component in bacterial genetics. The mechanism is called transduction, and it is illustrated below. Bacteriophages pick up DNA from bacteria they infect, and transfer it to other bacteria, creating genetic variance in the bacterial population.

Generalized transduciotn. Source: Indian River State College

Generalized transduction. Source: Indian River State College

Viral transduction also adapts

But viral transduction does not just carry random genes. Natural selection favors transduced genes that increase the bacterial host’s fitness. Because when a bacteria is infected by a virus, its protein making machinery is used to make viral genes. But when the viral genes include genes that are beneficial to the host as well, then everybody wins: the phage-infected bacterial species gets genes which enable it to compete better for resources with other bacterial species, while the phage gets a larger number of hosts to infect. Of course, this has to go hand in hand with a relatively benign virus that remains dormant long enough to let the bacterial host species enjoy the benefits of the transduced genes.

Such is the case of cyanophages and cyanobacteria. Cyanobacteria are photosynthetic bacteria, and cyanophages are the viruses that infect them. Several studies have shown that cyanophages have acquired whole photosynthetic genes from bacteria. Viruses do not photosynthesize, but when they infect cyanobaceria, the viral photosynthetic system is added to the bacterial one, boosting bacterial photosynthetic activity and ultimately increasing bacterial energy production.

The photosynthetic mechanism is  divided into two components: photosystem I and photosystem II (PSI and PSII). For a few years now, PSII has been known to be transduced by cyanophages.

A  more recent study by Itai Sharon and colleagues published in Nature this September shows that PSI proteins are also tranduced by cyanophages. Also, it seems like the viral PSI has some interesting properties that may make it advantageous over the cyanobacterial PSI. Two proteins in the bacterial PSI are called PsaJ and PsaF.  They found that the homologous protein in cyanophages is a fusion of the two, PsaJF. When they modeled an insert of PsaJF into the bacterial photosystem I it seemed that the bacterial PSI with the viral insert can now function more efficiently than the the original bacterial PSI. As a rule, PSI is a system that accepts electrons from PSII via a protein called plastocyanin. The donated electrons are excited by light, and the energized electrons are used to synthesize ATP and NADPH, the energy coinage of the cell, which are used to synthesize sugar from CO2. However, when the bacterial PsaJ and PsaF are replaced by the viral compound PsaJF, it seems like plastocyanin does not have to be the only electron donor to the newly minted virally-donated PSI. This means that the PSI may now accept electrons not only from plastocyanin, but from other electron-carrying proteins as well. Such proteins that are involved in the respiratory system, for example, which also donate electrons. The advantage of such a setup is that electrons whose reducing power would otherwise go to waste, got through PSII for formation of extra NADPH and ATP. Sharon and colleagues do not prove all this experimentally, but they make a pretty strong case, citing some analogous cases.

Electron transport from PSII to PSI via plastocyanin

Electron transport from PSII to PSI via plastocyanin. Source: wikimedia commons.

a, The structure of T. elongatus PSI (subunits) was illustrated by PyMOL (http://pymol.sourceforge.net/) using a PSI monomer (adopted from Protein Data Bank (PDB) accession 1jb0). PsaF is in magenta, PsaJ is in blue, and all of the other subunits are in green. b, A model for the structure of the viral PsaJF fusion protein (red) substituting the original PsaF and PsaJ subunits. Reproduced under NPG Liceensing terms for non-commercial / educational purposes

a, The structure of T. elongatus PSI (subunits) was illustrated by PyMOL (http://pymol.sourceforge.net/) using a PSI monomer (adopted from Protein Data Bank (PDB) accession 1jb0). PsaF is in magenta, PsaJ is in blue, and all of the other subunits are in green. b, A model for the structure of the viral PsaJF fusion protein (red) substituting the original PsaF and PsaJ subunits. Reproduced under NPG Licensing terms for non-commercial / educational purposes. doi:10.1038/nature08284

Like I said,  this work is purely bioinformatics. They basically mined the Global Ocean Survey metagenomic data, over six million sequences from marine microbes collected by the J. Craig Venter Institute which I mentioned in another post. They then identified sequences that contain PSI genes, and sifted through those to find sequences that also contain genes that are exclusively viral. Having both a PSI gene and a viral gene on the same DNA clone ensures they were taken from a virus. I’m not sure how they did the structural modeling and insertion of the PsaJF. This seems to be missing both from the Nature article, and the supplementary material. Yes, it’s one of those Nature works with 3 pages of article, and 28 of supplementary. Great read though, there’s treasure everywhere.


Sharon, I., Alperovitch, A., Rohwer, F., Haynes, M., Glaser, F., Atamna-Ismaeel, N., Pinter, R., Partensky, F., Koonin, E., Wolf, Y., Nelson, N., & Béjà, O. (2009). Photosystem I gene cassettes are present in marine virus genomes Nature, 461 (7261), 258-262 DOI: 10.1038/nature08284

Lindell, D., Jaffe, J., Johnson, Z., Church, G., & Chisholm, S. (2005). Photosynthesis genes in marine viruses yield proteins during host infection Nature, 438 (7064), 86-89 DOI: 10.1038/nature04111

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

The Warren L. DeLano Memorial Award for Computational Biosciences

Warren DeLano passed away suddenly and at a young age at his home Nov 3, 2009. He was the author of PyMol, a very popular molecular visualization program, and a strong advocate of open source software. The family of Warren Lyford DeLano has created a “In Memorium” page and blog. Also, a memorial award is being set up in his name, as per this email circulated on various mailing lists.

Dear friends and colleagues:

It’s now been over a week since Warren has passed away.  We are trying to
move toward a permanent way to honor Warren’s memory and what
he stood for: Open Source Computational Biosciences and molecular
visualization. To do this, Jim Wells and I put together a mission statement
with the approval of Warren’s family:
The Warren L. DeLano Memorial Award for Computational Biosciences

This award shall be given to a top computational bioscientist in
recognition of the contributions made by Warren L. DeLano to creating powerful
visualization tools for three dimensional structures and making them freely accessible.
The award, accompanying lecture, and honorium will be given annually in the context of a
national bioscience meeting or a Bay Area gathering of
computational bioscientists at Stanford, UCSF or UC Berkeley. For the award special emphasis
will be given for Open Source developments and service to the bioscience community.
The award selection committee, consisting of experts in the computational and
biological sciences, will accept nominations from anyone.
To make something like this happen in perpetuity would take about ~100K for
the endowment.

For donations, Warren’s family has set up a tax deductible fund:

Silicon Valley Community Foundation
memo:  Warren L. DeLano Memorial Fund
2440 West El Camino Real, Suite 300
Mountain View, CA 94040
tel: 650.450.5400

We hope that you’ll consider making a contribution (not matter
how small) in Warren’s honor.  Also, please forward this message
to anybody who might be able be willing to contribute.

Best regards,
Axel

Axel T. Brunger
Investigator,  Howard Hughes Medical Institute
Professor of Molecular and Cellular Physiology
Stanford University

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Book: Thirteen / Black Man. Richard K. Morgan

To say that Thirteen is a futuristic Chandlerian hardboiled-detective-fiction meets Gibson cyberpunk in a Swiftian satire of contemporary USA would be a cumbersomely loaded one-liner describing a no less loaded but sleekly streamlined novel. Saying that would also do injustice to Gibson, Chandler, Swift, the English language and especially Richard Morgan.

This book has it all, and then some. Frame-by-frame violence? Check. Cool tech including AIs of different flavors, weapons, and assorted plug-in-your-CNS-and-get-groovy? Check.  Biotechnology that includes genetically enhanced soldiers practicing a Martian martial art and shooting virus laced bullets? Check. An uneasy, sexually-charged partnership between a lady cop who plays by the rules and a bounty hunter who doesn’t? Check. Political statements about the current US through a futuristic caricature thereof? Check.

20080330_blackmanCarl Marsalis is a “variant Thirteen”, a modified human raised in a crèche with his peers to become a lone-wolf, high aggression, low sociability fighter. Thirteens’ genomes were designed to resuscitate the primitive human male that existed during hunter-gatherer times, when alpha males ruled, and sociopathy combined with aggression had their survival  merits. In other words, Marsalis is a throwback to a time when Men were Men, something the 22nd century society cannot handle. But  the same society that cannot handle such men is still fighting wars in remote places, and how would a government raise an army from a pool of girlie-men? The answer: use genetic engineering to create a Yujimbo, a Conan, a Tarzan, a Universal Soldier,  a Mad Max, a John Carter, or a Dirty Harry: take your pick. all those  Real Men who get out, get the job done, don’t ask questions and don’t give a shit about what the rest of us think about them. Make enough of those to fight your remote wars in Central Asia and South America, and you can keep the rest of the cudlips (a derogatory terms used by Thirteens to allude to humans as herd animals) happy and carefree.

“We’re not like you. We’re the Witches. We’re the violent exiles, the lone-wolf nomads that you bred out of the race back when growing crops and living in one place became so popular. We don’t have, and we don’t need a social context.”

As usual, those plans backfire, and Thirteens are deemed too dangerous to keep on  Earth. In response to humanity’s xenophobic backlash to the “twists”, the pejorative term for Thirteens,  Earth’s governments exile them to Mars. Mars is  a terraformed New New World resembling the early Australia or America: somewhere to send the social misfits, never mind that society itself created them. But Marsalis wins a return-to-Earth lottery and gets a ticket back to Earth. He is tolerated as long as he keeps his genetic identity secret, since it is not legal for Thirteens to walk free. Another condition is that he works as a bounty hunter for UNGLA, the United Nations Genetic Legislation Authority. Apparently not all Thirteens have gone willingly to Mars, and Marsalis is the one who tracks them down and kills them. The book opens with such an execution, and some “collateral damage”. Marsalis’s almost blase attitude towards the value of life offends us until quite quickly we realize that many non-engineered humans are much worse morally than he is.

Richard Morgan has also decided to take current US political partisanship to the extreme, and divide the US into three countries. Remember the Jesusland maps that were circulating on the Internet after Bush’s second election? Morgan uses those maps to outline future America. The secessionist Pacific Rim (RimSec) which is pretty much modern-day California, with its perceived cultural openness, innovation-driven boom-bust supercapitalist economy allied with the strong thirteeneconomies of Asia. This is contrasted with The Republic or Jesusland. Oppressively religious, with a failed education system and failing agrarian economy, the fundamentalist-run Jesusland is a Bible-Belt pastiche. Finally, The Northeastern states are dominated by the UN, and heavily aligned with a culturally-tolerant, socialistic Europe. A strong endorsement of Morgan’s  futuristic vision is the title under which his book was distributed in the US. The original title, Black Man was deemed offensive by the US publishers, and was changed by them to Thirteen. Jesusland would probably have banned the book altogether: graphic sex, violence, drugs, profanity and general apostasy and heresy.

“To be a believer, you … have to want something big and patriarchial around to take care of business for you. You have to be apt for worship, and Thirteens don’t do worship, of anyone or anything”.

Marsalis is black and a Thirteen. Twenty-first Century racial prejudice meets the 22nd Century genetic one, as one Jesusland officer called him the “nigger twist”. This is played upon a lot in the book. Perhaps a bit too much.

After Marsalis completes a UNGLA-sanctioned murder mission of another Thirteen in South America, he gets arrested in a police entrapment in Jesusland during a layover on his flight home to London. He is sprung from jail by COLIN, the UN COLonial INitiative authority from the Northeastern states, to track down a renegade Thirteen who also came back from Mars and is going through  a killing spree throughout North America. The victims are purposefully targeted, yet how they are connected is a mystery. As usual in hardboiled fiction, the arduous and twisted (pun intended?) detection trail slowly leads to reveal corruption in the upper echelons. The rotten firmaments of civilized society are exposed as being run by people even more sociopathic and dangerous than the elusive and murderous Thirteen Marsalis is tracking. When Marsalis finally figures things out and goes on his own revenge spree, we cannot help but cheer.  That is about all of the plot I should probably reveal.

Interestingly, as we go through the book Marsalis is slowly revealed as the antithesis to what a Thirteen is supposed to be. Marsalis holds his aggression in check better than most cudlips. He follows society’s norms (although, as the text repeatedly states, only because he is smart enough to stay out of trouble, not because he accepts them). He is cool, calculated, and even when following his gut feeling he seems more cerebral than 600px-Jesusland_map.svghis human partners. We learn that a cold-blooded killer is more humane than a hot-blooded one. Twists may kill, but cudlips massacre. The reason lies with the particular brand of sociopathy wired into Thirteens: their disinterest in society and social norms defuses the aspiration for power over others that some humans have. A Thirteen would never climb the corporate ladder, run for political office, impose his religion over others, or decide to conquer the world: he has no interest. Speaking of religion, see the quote above. Thirteens are wired to be atheists.

Not aspiring to a position of power, not wanting to shepherd the flock, means a Thirteen will not do the damage that it takes to get that position, or to hold it.

“Warlord wants the same thing any cudlip politician wants — legitimacy, recognition, and respect from the rest of the herd. The whole nine-car motorcade.”

Morgan paints a mixed picture of humanity: collaborative societies accomplish things. They grow corn, build cities, cure diseases and colonize Mars. But in a collaborative society, it is mostly the power-hungry sociopaths that make it to the top, or to a top. Humanity’s accomplished all its good things by working together, and deferring to authority while doing so. The price we pay for working together, is having someone work us, and that someone may not have the Greater Good in their minds. The Thirteen can look at us from the sidelines, and expose our leaders for what they are, but at the end of the day:

“They (cudlips) won because it worked. Group cooperation and bowing down to some thug with a beard worked better than standing alone as a thirteen was ever going to…they hunted us down, they exterminated us, and they got the future as the prize.”

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks