Archive

Archive for the ‘Bioinformatics’ Category

Bioinformatics Blog Carnival #1

March 10th, 2010 1 comment

Yes! Why should the evolution people have all the fun with their blog carnival? (After all, it is only a theory.) It’s time for bioinformaticians to show what we are made of, and to have a carnival of our own. Bio::blogs had a good run some time ago. I decided to reconnect what is hopefully now a larger and more networked community under the title of Bioinformatics Blog Carnival.

Northampton: monument to Francis Crick, born in Northampton. Copyright Ian Rob, licensed for reuse.

Programming

For many new bioinformatic programmers, there is a question of which Bio* package to choose from. The Bio* packages (Biopython, BioPerl, BioJava and BioRuby) are open source licensed packages which are used heavily in bioinformatics coding. They contain parsers for bioinformatic data file, file format converters, SQL interfaces and other goodies that make a biohacker’s life manageable. Walter Jessen presents a rapid-fire no-nonsense review of Open Source Programming with Bio* Libraries on Expressing Scientific Insight. Finally, if you would like to do everything wrong, Manuel Corpas presents 10 Sarcastic Rules on How to Be a Bioinformatician posted at Manuel Corpas’ Blog.

Databases

Morgan Langille talks about BioTorrents – a file sharing resource for scientists posted at his blog Beta Science. My take on BioTorrents is that it is a cool idea, but as most institutes block BitTorrent along with other peer-to-peer sharing utilities, I doubt there is a critical mass of feeders to make BioTorrents viable. Things can change though: institutes might decide to set up dedicated servers to Torrent scientific data, just like there are legitimate Torrent servers for Linux distros. George presents Bio-graphics, BioSQL and Rails part 1 posted at Biorelated. He talks about how to quickly add graphics support to a bioinformatics database rails application.

Brad Chapman’s Blue Collar Bioinformatics is a treasure-trove of useful bioinformatics methods. Brad is very thorough in his writing, and he covers a wide variety of topics. I particularly enjoyed reading about his adventures at the biohackathon 2010 in Tokyo, and the resulting Python query interface to BioGateway SPARQL endpoint and InterMine.

Genomics

Nick Loman from Pathogens: Genes and Genomes gives some excellent tips for de-novo genome assembly. He also gives the necessary scripts,which are great companions to Velvet, the popular short read assembler. Luke Jostins writes about AGBT: Speculating on Third Gen Tech posted at Genetic Inference, “An investigation into what data from various third generation sequencing technologies may look like.” SM presents Resources for Exome sequencing annotation posted at Organizing the Strands of Curiosity.

Structural biology

Sean Seaver talks about the retraction of several protein structure papers published by one group at the University of Alabama, Birmingham. Structuregate was reported both in the media and in the blogosphere. Sean walks us through how it might have been done, and how structural bioinformatics techniques found out the wrong structures in Origin and Orientation posted at P212121. Maria Hodges talks about how difficult it is to explain structural genomics to the man in the pub. Menachem Fromer presents Tradeoff between stability and multispecificity in the design of promiscuous proteins. Promiscuous proteins are proteins which bind different partners (ligands, or other proteins). However, the more partners they are able to bind, the less stable they are, as shown in a series of simulated evolution studies, summarized at Nir London’s Macromolecular Modeling Blog. OK, promiscuity and and a pub. There must be a joke somewhere there.

Credit: Iddo Friedberg & 3D chem for the Anthrax toxin lethal factor image

A common lament among bloggers and other enthusiastic adopters of Web 2.0 technology is the lack of mainstream uptake of these tools by active scientists. A recent report from University of California Berkeley confirmed this: “The advice given to pre-tenure scholars was consistent across all fields: focus on publishing in the right venues and avoid spending too much time on public engagement, committee work, writing op-ed pieces, developing websites, blogging, and other non-traditional forms of electronic dissemination (including online course activities)“. Maria Hodges argues that Web 2.0 thrives where journals don’t, and that the NMR community might be the first to reach the tipping point, where your career is harmed by not contributing. She talks of two wikis used by the NMR community. It is a small community, which has a need for sharing methods that are generally not publishable in peer-reviewed journals. The wiki venue makes for an ideal dissemination method for this community. On the subject of data sharing, and how it can backfire: here is an interesting connection between the history of the PDB, and the email affair known as ‘climategate’.

A sh*tload of data

March 4th, 2010 1 comment

ResearchBlogging.org

There are more microbial cells in our body than our own. Those microbes are not just passive hitchhikers or conversely, malicious agents of disease. They affect our well-being and health in a much broader spectrum than simply “bad” or “passive”. Among other things our gut microbes play an important role in digestion, have been linked to obesity, conditions as severe as certain inflammatory bowel syndromes or as relatively mild as traveler’s diarrhea. The microbes living on our skin also affect many things: from body odor and dandruff and acne to dermatitis and psoriasis. Also, being the most exposed microbial population means that they are themselves affected by our constant exposure to various agents,and their population varies by our own behavior –  such as how often, and with what, do we wash our hands: antibacterial soap has been named as a major culprit in the development of resistant bacterial strains. In all organs the native flora, relatively benign, protects us against colonization by more virulent bacteria.

Indeed, our gut, skin, mouth and genital microbiomes can be viewed as  additional organs in the way that they affect us. If you take a long course of powerful systemic antibiotics, the ensuing diarrhea and sometimes mouth thrush are the result of these “organs” — in your mouth and gut — being removed temporarily from your body.

Credit: David Gregory&Debbie Marshall. Wellcome Images images@wellcome.ac.uk

Today, another big step has been taken towards understanding the role of our microbiomes play. Just how big, in what direction, or what will be the consequences of this step is unclear, and will remain unclear for quite some time. A group of Chinese and European researchers have published the largest sequencing effort yet of gut bacteria. Their current yield, 576 Gigabases of DNA from the feces of 124 European people is considerably larger than the previous large effort in the US: 3GB from sequenced from 33 US and Japanese adults. Also, Qin and colleagues looked at obese, lean, healthy and sick (inflamatory bowel syndrome) individuals. They identified 1,000 to 1,100 different bacterial species, with about 160 different species per individual. Healthy individual’s bacterial population was markedly different that those with inflammatory bowel syndromes, and the populations of those with IBS differed depending on disease type.

What does this all mean? Well, like in the human genome project, it will take a while before  we understand not only what these data contain, but what are the limits of our ability to interpret them, and how best they can serve us. What we have right now is the equivalent of the outline of a newly discovered continent. It is up to many individual explorers to discover and chart the myriad living things in terra excreta. Which genes are most associated with obesity or with ulcerative colitis? How prevalent is gene transfer between gut bacteria, and how much of a role does this play in antibiotic resistance? Are there microbial species more prone to changes in their genomes than others? Are there metabolic pathways that are shared between different microbial species? Are some genes faster evolving than others, what would be the ecological role different species play in the gut? And how do different microbial populations ultimately affect their human hosts? Do different bacterial species have a preference with whom they share our guts? These broad questions can be broken down into individual questions relating to a lab’s pet genes, metabolic pathways, microbial species and metabolic conditions. There is a lot to sift through here and these data will keep us busy for years to come.

Pathogenic E. coli on the intestinal lining. Credit: S. Schuller.

In other blogs Carl Zimmer has a great post on how he, for one welcomes our bacterial overlords, while Ed Yong talks about the science of things to come.

Oh, and for those of you who, like me, can’t wait to plug the data into your favorite analysis pipeline, you can get the gene and assembled and annotated sequence data data from Peer Bork’s lab at the European Molecular Biology Laboratory, or the Beijing Genome Institute. The raw sequence data are available from the European Bioinformatics Institute under the accession ERA000116. (I can’t find the latter myself right now, hopefully they will show up in a couple of days). Update March 7, 2010: the sequences are now deposited at EBI. Also, on BioTorrents (hat tip to Morgan Langille).


Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, H., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Doré, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., Antolin, M., Artiguenave, F., Blottiere, H., Borruel, N., Bruls, T., Casellas, F., Chervaux, C., Cultrone, A., Delorme, C., Denariaz, G., Dervyn, R., Forte, M., Friss, C., van de Guchte, M., Guedon, E., Haimet, F., Jamet, A., Juste, C., Kaci, G., Kleerebezem, M., Knol, J., Kristensen, M., Layec, S., Le Roux, K., Leclerc, M., Maguin, E., Melo Minardi, R., Oozeer, R., Rescigno, M., Sanchez, N., Tims, S., Torrejon, T., Varela, E., de Vos, W., Winogradsky, Y., Zoetendal, E., Bork, P., Ehrlich, S., & Wang, J. (2010). A human gut microbial gene catalogue established by metagenomic sequencing Nature, 464 (7285), 59-65 DOI: 10.1038/nature08821

Bioinformatics blog carnival

February 18th, 2010 2 comments

Byte Size Biology will be hosting the first edition of the bioinformatics blog carnival. All you bioinformatics bloggers, submit your entries by Mar 9, 2010 23:59:03  EST. Note the 3 second extension I have already given. There will be no more deadline extensions, I’ve been generous enough as it is. The carnival will be posted here by March 15, 2010.

Any blog posts that have to do with the computational aspects of: genomics, nextgen sequencing, sequence analysis, gene expression, systems biology, ontologies, databases, structural biology, metagenomics, phylogenetics, function prediction and I probably forgot a few other categories so don’t hold it against me, just submit. Early and often. Your own posts and others that you liked. Nothing too old please, 1/1/2009 and later. I reserve the right to be a less-than-benevolent dictator and screen out posts. This applies especially to commercial plugs with no other merits.

Please retweet, reblog, rebuzz and remember:  submit to the bioinformatics blog carnival!

If you have a cool logo for this carnival, email bioinfo.blog.carnival AT gmail.com OhOne and OhToo will pick the winning logo to be displayed on the carnival. Likewise, if you would like to host the next edition, let me know. Do not submit posts by email, only via blogcarnival.com.

Happy carny-ing.

Ancient Greenlander’s DNA reveals ugly mullet

February 13th, 2010 2 comments

ResearchBlogging.org

Seriously, this is what I first thought when I saw the cover of this week’s Nature, and the associated drawings  in the press.  The dude’s haircut seems like it was even bad in the ’80s… 2080 BCE that is, which is when his body is dated. Approximately.

Inuk, artist's impression

A large group of researchers were involved in analyzing DNA from a 4,000 year old man found in Greenland. There is an interesting story on how the sample was obtained. Eske Villerslev spent 2 months looking for human remains in archaeological sites in Greenland, to no avail. When complaining about it to a friend, he learned that that friend’s father had already given hair samples to the national museum of Denmark… back  in 1986. Serious slap on the forehead!

The preserved DNA was of great quality. Also, this is 2010: we have a plethora of genotypic data that lets us associate a genome with phenotypes, and geographical origin. Inuk, as he was dubbed, shares many traits with east Asians, and much less with contemporary North American Inuit or other Native Americans. This means that Inuk and his people migrated fairly recently to Greenland from Siberia and across North America. Inuk and his people, the (now extinct) Saqqaq have no relationship to contemporary Native Americans. This was also shown in an earlier study of the mitochondrial DNA.

What do we know about Inuk himself? He had blood type A+; thick black hair and very likely a receding hairline due to a baldness allele; brown eyes; shovel-graded front teeth (East  Asian characteristic); dry type earwax… before you laugh that is actually, another characteristic of Asian populations on the Siberian and Chinese east coast. He also had alleles associates with human adaptation to cold temperatures.

And a bad haircut; but great press nevertheless (just got to Google News and search for  “Greenland Genome” – see, for example, the NYTimes.)

This post has been Slashdotted. Exercise extreme caution.


Rasmussen, M., Li, Y., Lindgreen, S., Pedersen, J., Albrechtsen, A., Moltke, I., Metspalu, M., Metspalu, E., Kivisild, T., Gupta, R., Bertalan, M., Nielsen, K., Gilbert, M., Wang, Y., Raghavan, M., Campos, P., Kamp, H., Wilson, A., Gledhill, A., Tridico, S., Bunce, M., Lorenzen, E., Binladen, J., Guo, X., Zhao, J., Zhang, X., Zhang, H., Li, Z., Chen, M., Orlando, L., Kristiansen, K., Bak, M., Tommerup, N., Bendixen, C., Pierre, T., Grønnow, B., Meldgaard, M., Andreasen, C., Fedorova, S., Osipova, L., Higham, T., Ramsey, C., Hansen, T., Nielsen, F., Crawford, M., Brunak, S., Sicheritz-Pontén, T., Villems, R., Nielsen, R., Krogh, A., Wang, J., & Willerslev, E. (2010). Ancient human genome sequence of an extinct Palaeo-Eskimo Nature, 463 (7282), 757-762 DOI: 10.1038/nature08835

Gilbert MT, Kivisild T, Grønnow B, Andersen PK, Metspalu E, Reidla M, Tamm E, Axelsson E, Götherström A, Campos PF, Rasmussen M, Metspalu M, Higham TF, Schwenninger JL, Nathan R, De Hoog CJ, Koch A, Møller LN, Andreasen C, Meldgaard M, Villems R, Bendixen C, & Willerslev E (2008). Paleo-Eskimo mtDNA genome reveals matrilineal discontinuity in Greenland. Science (New York, N.Y.), 320 (5884), 1787-9 PMID: 18511654

Short bioinformatic hacks: reading between the genes

February 11th, 2010 3 comments

In celebration of the biohackathon happening now in Tokyo, I am putting up a script that is oddly missing from many bioinformatic packages: extracting intergenic regions. This one was written together with my student, Ian. As for the biohackathon itself, I’m not there, but I am following the tweets and  Brad Chapman’s excellent posts:

About intergenic regions: intergenic regions are as interesting and sometimes even more interesting than the genes themselves: when you are interested in promoters, transcription factor binding sites or almost any other transcription regulation mechanism. Here’s a simple script to find intergenic regions. It reads a genbank formatted file and uses the information there to extract the intergenic regions. The sequences are written to a FASTA file.

#!/usr/bin/env python
import sys
import Bio
from Bio import SeqIO, SeqFeature
from Bio.SeqRecord import SeqRecord
import os

# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
def get_interregions(genbank_path,intergene_length=1):
    seq_record = SeqIO.parse(open(genbank_path), "genbank").next()
    cds_list_plus = []
    cds_list_minus = []
    intergenic_records = []
    # Loop over the genome file, get the CDS features on each of the strands
    for feature in seq_record.features:
        if feature.type == 'CDS':
            mystart = feature.location._start.position
            myend = feature.location._end.position
            if feature.strand == -1:
                cds_list_minus.append((mystart,myend))
            elif feature.strand == 1:
                cds_list_plus.append((mystart,myend))
            else:
                sys.stderr.write("No strand indicated %d-%d. Assuming +\n" %
                                  (mystart, myend))
                cds_list_plus.append((mystart,myend,1))

    for i,pospair in enumerate(cds_list_plus[1:]):
        # Compare current start position to previous end position
        last_end = cds_list_plus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "+"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    for i,pospair in enumerate(cds_list_minus[1:]):
        last_end = cds_list_minus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "-"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    outpath = os.path.splitext(os.path.basename(genbank_path))[0] + "_ign.fasta"
    SeqIO.write(intergenic_records, open(outpath,"w"), "fasta")

if __name__ == '__main__':
    if len(sys.argv) == 2:
         get_interregions(sys.argv[1])
    elif len(sys.argv) == 3:
         get_interregions(sys.argv[1],int(sys.argv[2]))
    else:
         print "Usage: get_intergenic.py gb_file [intergenic_length]"
         sys.exit(0)

What are we seeing here?

Lines 11-16 are the preamble: we read the GenBank file using Biopython’s genbank parser in line 12.  Beacuse we expect a genome file, which contains a single record, this is a one-time read. Note that this is a rate limiting step, and can take a couple of seconds. Took me ~2secs to read the full E. coli genome on my Linux box.  We prepare one list for the + strand intergenic regions (13), another one for the minus strand intergenic regions (14) and one for all the records (line 15).

The rest of the code are three loop blocks: lines 16-28 I loop over the genbank features, extracting the coordinated of the genes themselves. Line 32-41 I find the intergenic regions on the + strand. Lines 42-52 I do the same for the “-” strand.

Now for a philosophical interlude: although there is a way to read all the intergenic regions in a single pass, I subscribe to the “code simple” doctrine of research software writing. Code performance optimization is a low priority for me. I’d much rather have something that is simple to write,read and modify. I also don’t want to spend too much time coding and elegant script for elegance’s sake, especially if I may not use it too much. Historically, scientific code written for research is mostly extinct: thrown away after a short lived hypothesis was tested and ended its days. Research coding is mostly throwaway glue code. Very rarely it matures into a product. Then, and only then, can you apply all those fine software engineering you learned in college. Before that, write fast and simple.

But I digress. Line 17 loops over the features in the genome file. Line 18 we identify if it is a coding sequence (CDS). If so, we identify the start position, and position and the strand the CDS is on. The list cds_list_minus is a list of 2-tuples. Each 2-tuple is the start and end positions of a CDS on the minus strand. (If you would like to go over the genes, as opposed to coding sequences, change line 18 to:

if feature.type == 'gene':

(or better yet, pass an argument that defines it.)

cds_list_plus, is, yes, the same as cds_list_minus, only for the plus strand (line 24).

Sometime, a CDS does not contain information on which strand it is. With genome files, that is usually the case with single stranded viral genomes. Therefore, we put in the default assumption that if there is no strand indication, then the feature is is on the plus strand. We generate a warning message nevertheless (lines 25-28).

Lines 30-41 we loop over the plus  strand list, and identify the coordinates between the genes. Python’s enumerate function is very useful here. The enumerate function allows us to iterate over a list, but at the same time keep track of which index we are in when looping over the list. So in line 30, pospair receives the start and end coordinates of a CDS as a 2-tuple, while i receives the actual number if the index in the plus strand CDS list. In that way, we can look back to the previous list member, find the coordinates where that CDS ends, and where the current CDS begins. The two coordinates make up the beginning and end of the intergenic regions between those two genes on that strand. In line 35 we check if the intergenic region length is equal to or larger than a threshold: suppose we are only interested in those intergenic regions that are longer than 100 bases? (The default value is 1, see line 11.) In lines 38-39 we build a biopython sequence object that contains an informative header, and the sequence of the intergenic region. The description which goes in the sequence header contains the start and end coordinates of the intergenic region, and the locus ID of the CDS directly downstream from it. The sequence object is appended to a list, which will eventually get written (lines 40-41).

Lines 42-52 are a repeat of lines 30-41, only for the minus strand.  Lines 53 & 54: the list that contains all the intergenic region sequence objects gets written to its own fasta file.

Finally, line 56-63 are boilerplate code, that make this script runnable from the command line. Have fun looking at intergenic regions. Let me know of you find something interesting.

The polypharmacome

January 23rd, 2010 1 comment

ResearchBlogging.org
This post was chosen as an Editor's Selection for ResearchBlogging.org
Pharmaceutical companies are always on the lookout for secondary drug targets. After all, if you invest billions developing a single drug, you would be more than happy to sell it as a treatment for two, three, or more different ailments. Sildenafil citrate was developed to treat angina and hypertension, but during phase I clinical trials, it was found that Sildenafil induces penile erections. The drug was branded Viagra, and the rest is history. Eflornithine, an anti-cancer drug, is also effective against the agent of African sleeping sickness, Trypanosoma brucei. African Sleeping Sickness is known as a “neglected disease”, for which drug development is not profitable and therefore not a priority. However, having a drug already on hand makes it easier to distribute in affected areas, since the R&D costs are recovered elsewhere.

Another example of polypharmacology is a drug that binds to multiple targets in the human body. This could be used for overcoming drug resistance, a known problem with cancer. Cancer tumors often develop a resistance to anti-cancer drugs by simple natural selection: the protein that the drug binds to mutates, and no longer binds the drug. However, if the drug acts by binding redundantly to several proteins, it would be more effective, since several mutations would be required to effect drug resistance.

Another important polypharmacological consideration  is toxicity. If a drug binds to one protein, its drug target, it may also bind to another one which it should not bind to as it disrupts the normal functions and the health of the patient. If the side effects outweigh the cure, the drug is no good.

How one drug (cyan) can bind to two different proteins with different overall shapes (pink and green), but with similar binding sites

Because it can either increase  profits, or conversely derail a whole process of drug development, predicting polypharmacophoric effects is very much something drug developers want. A study published yesterday in PLoS Computational Biology by Jacob Durrant and his colleagues suggest a way  bioinformatics and theoretical biophysics can help in identifying multiple drug targets. Durrant’s goal was simple: given the molecular structure of a candidate drug, which proteins are expected to bind it? The strategy this group took is a combination of bioinformatic and experimental screening.

Finding additional drug targets in four  steps

Reproduced under CC license from doi:10.1371/journal.pcbi.1000648.g001. Click for original image.

Step 1 (A-C in the figure above): identify the known target protein. Now pick all the protein structures that are not similar to it. Why those that are not similar? Similar proteins could be potential drug targets, since they have a similar shape to the known target protein. But here they are interested in finding targets from proteins that are of a different shape, and have no homology to the known target protein: secondary targets.

Step 2 (D): take this set of dissimilar proteins, and look for binding site similarities. Binding sites are clefts in the protein that may bind drugs. If those clefts are similar in shape to the cleft in the known target proteins, they may bind the drug. Leave only those non-homologous proteins that have similar binding sites (D in the figure)

Step 3 (E): Now add all the proteins that are homologous to the set generated in step 2. This increases the number of possible targets to homologs of the proteins that were initially selected only for binding site similarity.

Step 4 (F): take the drug molecule, and try to dock it to the various protein structures. Rank the druggability of each protein according to the score provided by the drug docking software (Autodock).

Experimental verification

Now for the cool part. Durrant and colleagues tested  the computational prediction in the lab, using the compound NSC-45208. NSC-45208 inhibits a protein responsible for RNA processing in Trypanosoma brucei. So we know it is a potential drug against African sleeping sickness. What else is it good for?

(NSC-45208), 4,5-dihydroxy-3-(1-naphthyldiazenyl)-2,7 -naphthalenedisulfonic acid, a recently discovered inhibitor of T. brucei RNA editing ligase 1 (TbREL1)

“The predicted secondary targets that gave the best docking scores, H. sapiens mitochondrial 2-enoyl thioester reductase (HsETR1), T. brucei UDP-galactose 4′ epimerase (TbGalE), H. sapiens phosphodiesterase 9A (HsPDE9A2), and Streptococcus pneumoniae teichoic acid phosphorylcholine esterase (SpPce), were subsequently tested experimentally.”

Durrant and his colleagues tested their predictions that NSC-45208 also binds to two human proteins (HsETR1 and HsPDE9A2), one additional Trypansome protein (TbGalE), and a bacterial (Streptococcus pneumoniae) protein (SpPce). Not only binds, but also inhibits their enzymatic activity. Their predictions worked well on the top two predicted targets: NSC-45308 inhibited the enzymatic activity of HsETR1and TbGalE, but HsPDE9A2 and SpPce were not affected by the drug.

At first blush, this does not seem to be much of a batting average: two positives and two false positives. But we have to remember that the experimental verification of these predictions — even if you have a good enzymatic assay to check predictions– can be very time- and resource consuming. An exhaustive test of the predictions for several compounds and many target enzymes is still not possible. As initial proof-of-principle, this work goes much farther than most other joint experimental / computational works I have read. The authors also go into lengthy and detailed discussions on limitations and improvements, as well as on another form of non-specific inhibition of the secondary targets, makes for a really interesting read on polypharmacological considerations in drug screening.


Durrant, J., Amaro, R., Xie, L., Urbaniak, M., Ferguson, M., Haapalainen, A., Chen, Z., Di Guilmi, A., Wunder, F., Bourne, P., & McCammon, J. (2010). A Multidimensional Strategy to Detect Polypharmacological Targets in the Absence of Structural and Sequence Homology PLoS Computational Biology, 6 (1) DOI: 10.1371/journal.pcbi.1000648

Fold.it: wasting time in a good cause

January 16th, 2010 3 comments

I just spent the better part of a Saturday playing with Foldit. Foldit is an ongoing experiment in finding protein structures by harnessing the power of the mob – or gamers, as is the case here. The player is presented with a backbone & sidechain configuration, with the secondary structures mostly pre-determined. The problem is to get the protein to fold into the correct conformation. You can tweak the secondary structures, rubberband the beta strands together into sheets and rotate the sidechains. The residues are colored by hydrophobicity, so you know who should be facing where. The 23 tutorial cases walk you through the  simple yet powerful interface to folding the structures. You can rotate helices, rubberband strands, flip sheets, etc. The interface gives you feedback on sidechain clashes and voids in the structure among other things.  The examples also teach you the basic of protein structural considerations: maximize hydrogen bonds, hydrophobic side chains should be buried in the structure, strands should combine to sheets, and so on. When you feel you are ready, you can start solving the various folding puzzles presented online. You can work solo, or as part of a team. The “correct solution” is, of course, unknown. The best you can do is accumulate as many points as you can, which represent how stable is the conformation you are building. Foldit, like many other cool things in structural biology, is the product of David Baker’s lab at the University of Washington. Here is a video from the YouTube UWFoldit channel showing the coolness of it all. If you decide to get your fold on, make sure you can make the time. It’s flippin’ addictive.

Going to GOA, pt. 2: children of a lesser GO

January 1st, 2010 No comments

The source file associated with this post can be downloaded here.

The last time I talked about how to read a GOA gene_associations file into a Python dictionary data structure.  Our goal was to find all genes that are annotated as hydrolases in the GOA gene_associations file. The tricky part is, most enzymes are not annotated with a general term such as “hydrolase_activity”, but with a more specific term, such “phosphatase_activity” or “tyrosine_phosphatase_activity”. “phosphatase_activity” can be viewed as a “private case” or in ontological terms a descendant of  hydrolase_activity. “tyrosine_phosphatase_activity” is even more specific, being a descendant of “phosphatase_activity” and, by transitivity of “hydrolase_activity”. The bottom line is,  we need to know all the descendants of  our query term to get the properly annotated genes form the GOA gene_associations file.

To do that, we would need to access the GO database. You can import it from the Gene Ontology FTP site as a MySQL dump, and install it locally. This would probably make your query faster. However, you would need to worry about maintaining and regularly updating your database from GO. Instead, it might make sense to connect to one of the mirrors of the GO MySQL database, and query it remotely. This is what I will do here.

To do that, I will create a class that does two things: (1) opens (and closes) a MySQL connection to a GO mirror site and (2) reads the GOA gene_associations file. We saw how to read the file in the previous part. What I am doing here is simply rolling that code into a method (a function called by a class).

import UserDict
import MySQLdb

class GOA_ga(UserDict.UserDict):
    def __init__(self):
        self.data = {}

    def godb_open(self):
        self.dbcon = MySQLdb.connect(host="mysql.ebi.ac.uk", user="go_select",
                   passwd="amigo", db="go_latest", port=4085)
        self.dbcursor = self.dbcon.cursor()

    def godb_close(self):
        self.dbcursor.close()

    def read_gene_assoc(self,inpath):
        for inline in file(inpath):
            db, db_object_id, db_object_symbol, qualifier, go_id, \
            db_reference, evidence, withit, aspect, \
            db_object_name, synonym, db_object_type, \
            taxon_id, date, assigned_by = inline.strip().split('\t')
            key = db+":"+db_object_id
            self.data.setdefault(key,[]).append({'db':db,'db_object_id':db_object_id,
                                            'db_object_symbol': db_object_symbol,
                                            'qualifier': qualifier,
                                            'go_id': go_id,
                                            'db_reference': db_reference,
                                            'evidence': evidence,
                                            'with': withit,
                                            'aspect': aspect,
                                            'db_object_name': db_object_name,
                                            'synonym': synonym,
                                            'db_object_type': db_object_type,
                                            'taxon_id': taxon_id,
                                            'date': date,
                                            'assigned_by': assigned_by})

What we have here is a class with four methods. The class itself is an instantiation of the Python UserDict class. This means it is primarily a dictionary, and can be used as such. But there are other methods too.

Lines 5-6: The __init__ method describes what happens when the class is instantiated, and is actually superfluous here. I just like to put it as good programming practice. Instantiating GOA_ga created an empty dictionary. That’s it.

Lines 8-11: The second method will open a MySQL connection to a remote server. dbcon and dbcursor will be used to access the database from Python, as we shall see soon. I import the Python MySQLdb module which contains functions to access MySQL through Python. You can get it from SourceForge, and it is also in the repositories of most Linux distributions. To install in Ubuntu Linux simply type at the command-line:

sudo aptitude install python-mysqldb

Lines 13-14: The go_dbclose method simply closes the database connection. It is good practice to do so after all queries are done.

Lines 16-36: this is basically the code to read the GOA gene_associations file that we saw in part 1.
The file go_associations.py contains this code (and more below). You can download it, import it, and run the following on your GOA gene association file of choice, using the Python shell. I chose arabidopsis in this example.

>>> import go_associations as ga
>>> my_goa = ga.GOA_ga() # create a class instantiation
>>> my_goa.read_gene_assoc("gene_association.goa_arabidopsis")  # read all GOA

So what we have is a class that lets us read GOA gene_associations file, and also query the GOA MySQL database.

Remember the original goal: get all the records associated with a certain function, or its children. The MySQL code for that is:

SELECT
  rchild.acc
FROM
  term AS rchild, term AS ancestor, graph_path
WHERE
  graph_path.term2_id = rchild.id and
  graph_path.term1_id = ancestor.id and
  ancestor.acc = 'GO:0016787'

This mysql query will return a list of all the GO accession codes that are the descendants of the go accession code “GO:0016187″. How do you know which GO accession code belongs to which term? You can use the keyword search feature in AmiGO. By the way, I pulled this query from the GO Example Queries page on the GO site. There are many useful queries there!

So how do we insert this into Python code? Here it is:

def find_go_descendants(goa_ga, parent_go_id):
   find_descendants_go = """
   SELECT
     rchild.acc
   FROM
     term AS rchild, term AS ancestor, graph_path
   WHERE
     graph_path.term2_id = rchild.id and
     graph_path.term1_id = ancestor.id and
     ancestor.acc = '%s'
   """
   go_children = {}
   found_genes = {}
   # query GO database for all the child terms of the parent_go_id
   goa_ga.dbcursor.execute(find_descendants_go % parent_go_id)
   rpl = goa_ga.dbcursor.fetchall()
   # Flatten the list. Put it in dictionary keys for faster searching.
   for i in rpl:
      go_children[i[0]] = None
   for gene_id in goa_ga.keys():
      for goa_rec in goa_ga[gene_id]:
         if goa_rec['go_id'] in go_children.keys():
            found_genes.setdefault(gene_id,[]).append(goa_rec)
   return found_genes

Lines 2-11 are the templated MySQL query to get all the descendants of the provided GO access number.
Line 12: go_children is a dictionary whose keys are all the GO accession numbers that are the descendants of our query. The values are null. It is faster to search a dictionary keys than a list.
Line 13: found_genes is a GOA_ga dictionary which will contain the found go_associations entries. The keys are the UniProt IDs, and the values are GOA_ga dictionary entries
Lines 15-16: Now we start working. First, we query the GO MySQL database for all the descendant GO terms of go_id. The returned value is a tuple of tuples which looks something like this:

(('GO:0012345',), ('GO:0034567',),('GO:003617',),....)

That is simply how MySQLdb’s fetchall method works: each MySQL record fetched is a tuple. Since we are fetching only one field with our query, each tuple element is  We now need to flatten this tuple — remove the superfluous parentheses. Lines 18-19 do that. They also put the descendant GO terms a dictionary keys structure, rather than a list structure. A dictionary keys structure is faster to search as we do repeatedly in line 23.

Line 20 is the outer loop of the search. We loop over all UniProt gene IDs in the goa_associations file we read into our goa_ga data structure.
Line 21 is the inner loop. Each Uniprot ID may appear in more than one record, as the protein it represents may be associated with more than one GO term.

Lines 22-23: if the associated GO term is in our list of descendant, add that GOA record to our found_genes dictionary, and key it with the proper UniProt ID.

Line 24: return the dictionary of all GOA records that are annotated as the descendants of the query term, or as the query term itself.

Putting it all together:

>>> import go_associations as ga
>>> my_goa = ga.GOA_ga() # create a class instantiation
>>> my_goa.read_gene_assoc("gene_association.goa_arabidopsis")  # read all GOA
>>> my_goa.godb_open()     # open a connection to the GO database
>>> all_hydrolases  = ga.find_go_descendants(my_goa,"GO:0016787") # find the descendants of "hydrolase_activity"

all_hydrolases now contains the GOA records of all the arabidopsis proteins annotated as hydrolases. How many are there?

>>> len(all_hydrolases)
3617
>>>

(This may vary when the latest version of this file changes. GOA is regularly updated). Some genes may be annotated as hydrolases more than once. This might either be an annotation error, as the is annotated twice with the same function, or the protein might be promiscuous with more than one function, or it may have two domains, each with a specific function. To find out all the hydrolase annotations:

>>> n=0
>>> for i in all_hydrolases:
...      n += len(all_hydrolases[i])
>>> print n
5753
>>>

How many of those annotations have a good evidence code?

>>> n=0
>>> for i in all_hydrolases:
...     for j in all_hydrolases[i]:
...             if  j['evidence'] in ('EXP','IDA','IPI','IMP','IGI','IEP'): n += 1
...
>>> print n
416
>>>

416 out of 5763: just over 7%. Keep that in mind next time you delve into an annotation database. (Exercise: find out how many enzymes have been annotated with good evidence in arabidopsis, or in another GOA-annotated organism).

Have fun GOing, and getting there!

Filling in the evolutionary blanks, genome by genome

December 23rd, 2009 8 comments

ResearchBlogging.org

After hearing Jonathan Eisen and Nikos Kyripdes talk about GEBA in various meetings, it is great to see the paper finally come out, and under a CC license too. Good move for everyone.

GEBA is the Genomic Encyclopedia of Bacteria and Archaea. The idea is simple: we have >1000 prokaryotic genomes in GenBank as of today.  But those were sequenced under a myriad of interests: clinical, functional, ease, biotechnological or pharmaceutical potential, etc.  In evolutionary terms, those 1000 genomes provide a very biased view of the tree of microbial life. That would be like sampling mammalian life in Europe and North America only: you would miss out on most big cats, Elephants, Rhinos, not to mention all the marsupials. To correct this situation, teams from the  Joint Genome Institute,  UC Davis and several others set out to perform a more uniform sampling across the tree of prokaryotic life. The first batch of 56 genomes from GEBA is published today in Nature; fifty-three bacterial and three archaeal.

Maximum-likelihood phylogenetic tree of the bacterial domain based on a concatenated alignment of 31 broadly conserved protein-coding genes. Phyla are distinguished by colour of the branch and GEBA genomes are indicated in red in the outer circle of species names. Click to open original in Nature.

It seems that they are on the right track to enrich our understanding of bacterial genes and genomes using this phylogenetically-mindful sampling strategy.  For example, they show that their sampling enables the discovery of an average of 1,060 protein families/genome. Sampling a single bacterial family would provide 121 new protein families, sampling within a bacterial phylum would give an average of 308 new protein families, and within a bacterial domain, 650. They have discovered a total of 1,798 families that seem to have no similarity to any existing family, hinting at new bacterial functionality (or maybe some new prophages?) They have  discovered a few new cellulases, genes that break down cellulose, the polymer that makes up plant cell walls. Cellulases are the holy grail of the biofuel prospecting industry: specifically,  a cellulase that can be exploited en-masse to turn plant matter into fuel economically. They also discovered a homolog of Actin, a cytoskeletal protein thought until now to only exist in eukaryotes.

One thing that is sorely missing is accessibility. Yes, the individual genome papers are all published in SIGS and in Nature under open access, which is great. But when you go to the GEBA site, you get a simple description of the candidate genomes. The annotations are somewhere behind a password-protected site, but I could not seem to get an account to view them. A proper genomic browser for the sequenced and annotated genomes, with some phylogenetic map showing who is located where on the tree would go a long way towards  helping the rest of us explore this new comprehensive picture of prokaryotic genome space.

Finally, if you want to hear more about how they did what, here’s Eisen talking about GEBA.


Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova, N., Kunin, V., Goodwin, L., Wu, M., Tindall, B., Hooper, S., Pati, A., Lykidis, A., Spring, S., Anderson, I., D’haeseleer, P., Zemla, A., Singer, M., Lapidus, A., Nolan, M., Copeland, A., Han, C., Chen, F., Cheng, J., Lucas, S., Kerfeld, C., Lang, E., Gronow, S., Chain, P., Bruce, D., Rubin, E., Kyrpides, N., Klenk, H., & Eisen, J. (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Nature, 462 (7276), 1056-1060 DOI: 10.1038/nature08656

Gene and protein annotation: it’s worse than you thought

December 14th, 2009 2 comments

ResearchBlogging.org

Sequencing centers keep pumping large amounts of sequence data into the omics-sphere (will I get a New Worst omics Word Award for this?)  There is no way we can annotate even a small fraction of those experimentally and indeed most  annotations are automatic, done bioinformatically. Typically function is inferred by homology: if the protein sequence is similar enough to that of a protein whose function has been determined, then homology is inferred: that is, the unknown and the known protein are descended from a common ancestor. Even more so,  functional identity to the known protein are inferred: the assumption being, that the function did not change if the common ancestral protein is recent enough: that is, if the sequence identity is high enough.  But there are problems: what is the threshold for determining not only homology, but functional identity? Even if two proteins are 95% identical in their amino-acid sequence, if the remaining 5% happen to include active site residues, these proteins may do completely different things. However, most new sequences are annotated just this way, with some variations.

Because of its volume, the veracity of the electronic annotation is rarely checked by experts.  Also, the electronic annotations come from far and wide, with different annotation software using different databases to infer gene and protein function. This sets the stage to a huge game of Broken Telephone, where  wrong annotations can propagate through many databases, accumulating errors. Imagine that we have an annotation program with a 90% accuracy rate. This means that given a query protein sequence and a “gold standard” 100% correct reference database, this programs infers the query sequence’s correct function 90 out of every 100 times.  For a typical bacterial genome of 5000 genes, this would mean that 500 genes are wrongly annotated. Let’s cal our bacterium Bug1. Now we place those 500 wrong annotations (along with the 4500 correct ones) in the “definitive database” for this bacterium, called Bug1DB.  Now this Bug1DB is used as  a “gold standard”, and  another genome is annotated, this time of Bug2. Let’s suppose, for argument’s sake, that the two genomes contain roughly the same homologous genes.  Since every gene in Bug1 has a 10% probability of being wrongly re-annotated when transferred to Bug2, this would mean a compounding error of  0.10 * 500 = 50 genes from the original  wrong 500 genes (we assume that “two wrongs do not make a right” and that an incorrect annotation of any incorrectly annotated gene would not revert to a correct annotation my mis-annotating it again).  But it would also mean that, on average, 500-50 =450 genes from A that were correctly annotated the first time would  be incorrectly annotated the second time. This means that Bug B now has 500+450= 950 mis-annotated genes. And this is through two filters of a Broken Telephone game using a highly accurate annotation program.

The trouble is, that a 90% accuracy rate is unrealistically optimistic. Also, having all 5000 genes in a genome annotated with some function (as opposed to simply “unknown”) is rather fanciful. So the mis-annotation problem is worse, even if transfer and re-annotation does not take place exactly as described. But just how much worse?

The question is answered in  a rather disturbing study published in PLoS Computational Biology by Alexandra Schnoes and her colleagues in Patricia Babbit’s group at th University of California, San Francisco. They used 37 experimentally characterized enzyme families to test different databases.  They found a high level of misannotation, but also a highly variable one. For example, the manually curated SwissProt database had a very low level of errors. On the other hand, TrEMBL, which uses simple sequence similarity for annotations, had a high level. So did NR, the combined GenBank coding sequence translations+RefSeq Proteins+PDB+SwissProt+PIR+PRF; pretty much the default reference database against which biologists BLAST their sequences.  They found that 40% of the genes they examined were mis-annotated in NR. They also went back in time, examining the misannotaion fraction of their gold standard 37 families, and found that the fraction of misannotated genes has increased,  from 15% in 1995 to 40% in 2005.

growing-over-time
The change in misannotation over time in the NR database for the 37 families investigated. Sequences are plotted by the year when they were originally deposited in the database (x-axis). The number of sequences (left y-axis, bar graph) found to be correctly annotated is shown in green. The number of sequences found to be misannotated is shown in red. The bars for each year represent only the sequences deposited into the database in that year. The fraction (right y-axis, line plot) of sequences deposited each year into the NR database that were misannotated is given by the open nodes, connected by the black line to aid in visualizing the overall trend. This fraction represents the number of sequences in the 37 test families predicted to be misannotated divided by the total number of sequences deposited each year from the test set, i.e. the sum of the sequences depicted in the red and green bars for each year. (From: Schnoes AM, Brown SD, Dodevski I, Babbitt PC, 2009 Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies. PLoS Comput Biol 5(12): e1000605. doi:10.1371/journal.pcbi.1000605)

There are also many ways to be wrong, as Schnoes and her colleagues have discovered. Overprediction is one, where proteins are annotated with functions that are more specific than the available evidence supports. 85% of misannotations were found to be overpredictions. Of the remianing 15%, about half were found to be missing important amino acid residues, which means that they could not carry out the functions by which they were annotated. The other half were simply not within the similarity threshold necessary to include them in one of the superfamilies they have examined.

By now you are wondering, who is validating the validators? That is, if Schnoes and her colleagues determine a single cutoff for inclusion in a protein family, they might also include falsely annotated proteins as correctly annotated (false positive), or exclude correctly annotated proteins as mis-annotated (false negative). To avoid that, they set three different similarity thresholds to their 37 superfamlies, and examined which proteins the similarity searches attract. In the lowest of these threshold, they purposefully included the ability to include up to 5% false positives. This they called the “lenient threshold”, and they did check their results using these different thresholds (three of them). They found there was a slight increase, but no overall substantial change, in the discovered level of misannotation in the databases, even when lowering the bar to the lenient threshold.

So how bad is the level of misanntoation in the databases? It depends on the protein superfamily they checked against, and on the database. Here is an excerpt from another figure, showing the misannotation of protein families in the HAD haloacid dehalogenase (HAD) and amidohydrolaseand (AH)  superfamilies of enzymes. Each rectangle represents a different database. The bar is the mean error in that database for that particular superfamily, and each colored circle is a protein family, placed and the level of  average misannotation for that family. The circle size indicates the family size.

Percent misannotation in the families and superfamilies tested

Percent misannotation in the families and superfamilies tested

Note that SwissProt fares very well, although lacks some families (those with an “X” through the blank circle).  For the HAD superfamily, we see an error of 60% in the three other heavily used databases, and for AH we see a 40% error. That is brutally high, and quite worrying. Other families fared little better when checked against those databases. Some went up to 80% and 100%(!)

So what can be done? Schnoes and her colleagues suggest several remedies. First, include “evidence codes” with the annotations. Those will let us know how each annotation is inferred, and thus how trustworthy it is. Additionally, avoid overprediction, which accounts for 85% of wrong annotations. Many protein functions are described too specifically, without enough evidence to support the annotation claim. Taking a step back and giving a more general description of the function would go a long way towards cleaning up the databases. The manually curated databases such as SwissProt did fare very well in their examination, but manual curation is not possible anymore with the post-genomic and metagenomic data deluge. Large databases  have to clean up the mess pretty much the same way it was created: by automated means.  Let’s hope it will happen soon enough. A 40% error rate in the database you are looking at can really put a damp on your analysis.


Schnoes, A., Brown, S., Dodevski, I., & Babbitt, P. (2009). Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies PLoS Computational Biology, 5 (12) DOI: 10.1371/journal.pcbi.1000605

Going to GOA: pt. 1

December 7th, 2009 No comments

GOA, the Gene Ontology Annotation, provide Gene Ontology annotation to proteins in UniProt. It also provides GO annotations to several genome projects: Chicken, Arabidopsis, Fly, Human, Mouse, Rat and Cow. Anyone working on any of those genomes, or on UniProt and is interested in annotation, would most likely need to query GOA once in a while. Here I will show some to read and interrogate GOA files.

Each GOA-annotated genome has two files associated with it. One is gene_association.* and the other is *.xrefs. So for example, the Mouse genome has the two files gene_association.goa_mouse and mouse.xrefs

The gene_association file is a tab-delimited file,with each row containing a gene and its GO associations. Each gene may have more than a single GO annotation, so multiple rows for the same gene exist. The fields are tab-delimited.

def read_goa_associations(inpath):
      goa_records = {}
      for inline in file(inpath):
             # Read a single GOA record
            db, db_object_id, db_object_symbol, qualifier, go_id, \
            db_reference, evidence, withit, aspect, \
            db_object_name, synonym, db_object_type, \
            taxon_id, date, assigned_by = inline.strip().split('\t')
            # The dictionary key is a concatenation of the gene-id and the database
            key = db+":"+db_object_id

            goa_records.setdefault(key,[]).append({'db':db,
                              'db_object_id':db_object_id,
                              'db_object_symbol': db_object_symbol,
                              'qualifier': qualifier,
                              'go_id': go_id,
                              'db_reference': db_reference,
                              'evidence': evidence,
                              'with': withit,
                              'aspect': aspect,
                              'db_object_name': db_object_name,
                              'synonym': synonym,
                              'db_object_type': db_object_type,
                              'taxon_id': taxon_id,
                              'date': date,
                              'assigned_by': assigned_by})
      return goa_records

Let’s break this one down. The records will be stored in the dictionary goa_records, which is initialized in line 2.  We loop through the lines in in the files in line 3. Lines 5-8 are broken down physically, but they form a single command line to parse the 15 tab-delimited fields.

Now,  as I mentioned, there may be more than one GO association with a single gene. So the value of each dictionary entry is actually a list which may contain one or more GO associations. Line 12 shows us how to do that. The format:

dict.setdefault(dakey,[]).append(davalue)

tells us that for dictionary dict, if it does not have the key dakey then that key is added, and the value associated with it is an empty list []. Otherwise, if the dictionary already has the key dakey, then the value davalue is appended to the list. This one-liner allows us to add multiple values associated with one key.

In the command stretching from line 12 to line 26, we add a value which is actually a dictionary by itself. Each key in the dictionary is a field of the gene_associations records. Each value is the value read in that field in lines 5-8.

So what have we got now in goa_records?
Let’s try running this on an Arabidopsis gene_associations file. Download the file:

gene_association.goa_arabidopsis.51.gz
unzip it:

gunzip gene_association.goa_arabidopsis.51.gz

this is is the Arabidopsis GOA from 6-OCT-2009. It has 109339 lines which are 22778 annotated genes. How do I know that? easy:

Number of lines uniqueified by field #2, which is the gene id.

gawk 'BEGIN {FS="\t"} {print $1":"$2}' gene_association.goa_arabidopsis.51 | sort | uniq | wc -l

Also, download the code file, and call it gene_assoc.py
Now run the code, Open a Python shell:

$ python
>>> import gene_assoc as ga
>>> goa_arabid = ga.read_goa_associations("gene_association.goa_arabidopsis.51")
>>>

go_arabid now contains the records. We can do a few simple statistics. For example, how many genes are annotated as hydrolases? For that, we need to know that the hydrolase_activity GO accession number is “GO:0016787″. We can now ask our question in the Python shell:

>>> n=0
>>> for i in goa_arabid:
...     for j in goa_arabid[i]:
...             if j['go_id'] == "GO:0016787": n+=1
...
>>> print n
975

OK, this may seem a bit silly. We could have just as easily used a gawk one-liner to search for all the lines containing “GO:0016787″. Why go into all the trouble of a Python data structure? The answer is, we are just getting started.

When genes are annotated, they are annotated by different strengths of evidence. Many genes are annotated simply by sequence similarity to other annotated genes. This is fair evidence, but not as good as, say, experimental evidence. When curators annotate genes, they also add an “evidence code” that tells s how good is the evidence that the assigned GO term is actually true. Let’s find how many hydrolases we have whose annotations were inferred by experimental evidence. See the Guide to GO evidence codes for a full description.

Experimental Evidence Codes
EXP: Inferred from Experiment
IDA: Inferred from Direct Assay
IPI: Inferred from Physical Interaction
IMP: Inferred from Mutant Phenotype
IGI: Inferred from Genetic Interaction
IEP: Inferred from Expression Pattern

Computational Analysis Evidence Codes
ISS: Inferred from Sequence or Structural Similarity
ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment
ISM: Inferred from Sequence Model
IGC: Inferred from Genomic Context
RCA: inferred from Reviewed Computational Analysis

Author Statement Evidence Codes
TAS: Traceable Author Statement
NAS: Non-traceable Author Statement
Curator Statement Evidence Codes
IC: Inferred by Curator
ND: No biological Data available

Automatically-assigned Evidence Codes
IEA: Inferred from Electronic Annotation
Obsolete Evidence Codes
NR: Not Recorded

>>> n=0
>>> for i in goa_arabid:
...     for j in goa_arabid[i]:
...             if j['go_id'] == "GO:0016787" and j['evidence'] in ('EXP','IDA','IPI','IMP','IGI','IEP'): n += 1
...
>>> print n
1

Uh, oh. Seems like only one gene was inferred to be a hydrolase by experimental evidence?

Note another problem: we only checked for genes annotated with “hydrolase_activity” GO term. But a hydrolase is a very generic term. This does not mean that other genes are not hydrolases. Hydrolases are a very large enzymatic class that includes many different enzymes: any enzyme that uses water to break a covalent bond, basically. For a complete list of hydrolase subclasses in GO see: here. Note that there are more GO terms under the fold (just click on the nodes annotated with ‘+’). So we are definitely not looking at all the enzymes that have hydrolase activity, only those that are annotated with the words “hydrolase_activity”.  For example, all phosphatases are hydrolases, but if a protein is annotated with the GO term “phosphatase_activity”, it won’t show up on our search. How do we handle that problem?

Wait for the next installment.

Photosynthesis, phages and structures: there’s treasure everywhere!

November 24th, 2009 No comments

ResearchBlogging.org

Here’s a really cool work, published this September in Nature.. Why did I choose this work?  Well, it’s a major discovery, and it’s all done using bioinformatics, and fairly simple bioinformatics at that. The power of metagenomics and bioinfromatics: in a mass of data you just have to know what you are looking for, and how to look for it.

Obviously not CC licensed, but I couldn't resist using this very appropriate strip

Obviously not CC licensed, but I couldn't resist using this very appropriate strip

Viruses as a bacterial genetic mechanism

Viruses follow some interesting and sometimes convoluted evolutionary paths.  One is “infect quick, reproduce fast, and make sure you can get to the next host before you kill this one”.  That is pretty extreme: smallpox was doing that, when there was smallpox. Ebola is doing that, but not very well: killing the host too quickly means that the disease is contained, especially in rural areas. Another strategy is: “slow and easy wins the race”. The herpes virus does that. Not lethal, but laying dormant in the central nervous system, it is  infectious, but rarely causes anything more than they occasional cold sore (which admittedly, is painful and disturbing). Still, it manages to infect up to 90% of the human population, most of which are completely unaware they harbor it, and would never develop any symptoms.

Most of the viruses on earth don’t infect humans, nor animals, nor plants. They infect microbes, where the same spectrum of evolutionary strategies applies. Some attack quickly, killing the microbial population they infect. Other can remain dormant for a long time. It is becoming clear to us that bacterial viruses or bacteriophages, are responsible for a large portion, if not the majority, of genetic variance in bacteria. In fact, viruses are a major component in bacterial genetics. The mechanism is called transduction, and it is illustrated below. Bacteriophages pick up DNA from bacteria they infect, and transfer it to other bacteria, creating genetic variance in the bacterial population.

Generalized transduciotn. Source: Indian River State College

Generalized transduction. Source: Indian River State College

Viral transduction also adapts

But viral transduction does not just carry random genes. Natural selection favors transduced genes that increase the bacterial host’s fitness. Because when a bacteria is infected by a virus, its protein making machinery is used to make viral genes. But when the viral genes include genes that are beneficial to the host as well, then everybody wins: the phage-infected bacterial species gets genes which enable it to compete better for resources with other bacterial species, while the phage gets a larger number of hosts to infect. Of course, this has to go hand in hand with a relatively benign virus that remains dormant long enough to let the bacterial host species enjoy the benefits of the transduced genes.

Such is the case of cyanophages and cyanobacteria. Cyanobacteria are photosynthetic bacteria, and cyanophages are the viruses that infect them. Several studies have shown that cyanophages have acquired whole photosynthetic genes from bacteria. Viruses do not photosynthesize, but when they infect cyanobaceria, the viral photosynthetic system is added to the bacterial one, boosting bacterial photosynthetic activity and ultimately increasing bacterial energy production.

The photosynthetic mechanism is  divided into two components: photosystem I and photosystem II (PSI and PSII). For a few years now, PSII has been known to be transduced by cyanophages.

A  more recent study by Itai Sharon and colleagues published in Nature this September shows that PSI proteins are also tranduced by cyanophages. Also, it seems like the viral PSI has some interesting properties that may make it advantageous over the cyanobacterial PSI. Two proteins in the bacterial PSI are called PsaJ and PsaF.  They found that the homologous protein in cyanophages is a fusion of the two, PsaJF. When they modeled an insert of PsaJF into the bacterial photosystem I it seemed that the bacterial PSI with the viral insert can now function more efficiently than the the original bacterial PSI. As a rule, PSI is a system that accepts electrons from PSII via a protein called plastocyanin. The donated electrons are excited by light, and the energized electrons are used to synthesize ATP and NADPH, the energy coinage of the cell, which are used to synthesize sugar from CO2. However, when the bacterial PsaJ and PsaF are replaced by the viral compound PsaJF, it seems like plastocyanin does not have to be the only electron donor to the newly minted virally-donated PSI. This means that the PSI may now accept electrons not only from plastocyanin, but from other electron-carrying proteins as well. Such proteins that are involved in the respiratory system, for example, which also donate electrons. The advantage of such a setup is that electrons whose reducing power would otherwise go to waste, got through PSII for formation of extra NADPH and ATP. Sharon and colleagues do not prove all this experimentally, but they make a pretty strong case, citing some analogous cases.

Electron transport from PSII to PSI via plastocyanin

Electron transport from PSII to PSI via plastocyanin. Source: wikimedia commons.

a, The structure of T. elongatus PSI (subunits) was illustrated by PyMOL (http://pymol.sourceforge.net/) using a PSI monomer (adopted from Protein Data Bank (PDB) accession 1jb0). PsaF is in magenta, PsaJ is in blue, and all of the other subunits are in green. b, A model for the structure of the viral PsaJF fusion protein (red) substituting the original PsaF and PsaJ subunits. Reproduced under NPG Liceensing terms for non-commercial / educational purposes

a, The structure of T. elongatus PSI (subunits) was illustrated by PyMOL (http://pymol.sourceforge.net/) using a PSI monomer (adopted from Protein Data Bank (PDB) accession 1jb0). PsaF is in magenta, PsaJ is in blue, and all of the other subunits are in green. b, A model for the structure of the viral PsaJF fusion protein (red) substituting the original PsaF and PsaJ subunits. Reproduced under NPG Licensing terms for non-commercial / educational purposes. doi:10.1038/nature08284

Like I said,  this work is purely bioinformatics. They basically mined the Global Ocean Survey metagenomic data, over six million sequences from marine microbes collected by the J. Craig Venter Institute which I mentioned in another post. They then identified sequences that contain PSI genes, and sifted through those to find sequences that also contain genes that are exclusively viral. Having both a PSI gene and a viral gene on the same DNA clone ensures they were taken from a virus. I’m not sure how they did the structural modeling and insertion of the PsaJF. This seems to be missing both from the Nature article, and the supplementary material. Yes, it’s one of those Nature works with 3 pages of article, and 28 of supplementary. Great read though, there’s treasure everywhere.


Sharon, I., Alperovitch, A., Rohwer, F., Haynes, M., Glaser, F., Atamna-Ismaeel, N., Pinter, R., Partensky, F., Koonin, E., Wolf, Y., Nelson, N., & Béjà, O. (2009). Photosystem I gene cassettes are present in marine virus genomes Nature, 461 (7261), 258-262 DOI: 10.1038/nature08284

Lindell, D., Jaffe, J., Johnson, Z., Church, G., & Chisholm, S. (2005). Photosynthesis genes in marine viruses yield proteins during host infection Nature, 438 (7064), 86-89 DOI: 10.1038/nature04111

The Warren L. DeLano Memorial Award for Computational Biosciences

November 15th, 2009 No comments

Warren DeLano passed away suddenly and at a young age at his home Nov 3, 2009. He was the author of PyMol, a very popular molecular visualization program, and a strong advocate of open source software. The family of Warren Lyford DeLano has created a “In Memorium” page and blog. Also, a memorial award is being set up in his name, as per this email circulated on various mailing lists.

Dear friends and colleagues:

It’s now been over a week since Warren has passed away.  We are trying to
move toward a permanent way to honor Warren’s memory and what
he stood for: Open Source Computational Biosciences and molecular
visualization. To do this, Jim Wells and I put together a mission statement
with the approval of Warren’s family:
The Warren L. DeLano Memorial Award for Computational Biosciences

This award shall be given to a top computational bioscientist in
recognition of the contributions made by Warren L. DeLano to creating powerful
visualization tools for three dimensional structures and making them freely accessible.
The award, accompanying lecture, and honorium will be given annually in the context of a
national bioscience meeting or a Bay Area gathering of
computational bioscientists at Stanford, UCSF or UC Berkeley. For the award special emphasis
will be given for Open Source developments and service to the bioscience community.
The award selection committee, consisting of experts in the computational and
biological sciences, will accept nominations from anyone.
To make something like this happen in perpetuity would take about ~100K for
the endowment.

For donations, Warren’s family has set up a tax deductible fund:

Silicon Valley Community Foundation
memo:  Warren L. DeLano Memorial Fund
2440 West El Camino Real, Suite 300
Mountain View, CA 94040
tel: 650.450.5400

We hope that you’ll consider making a contribution (not matter
how small) in Warren’s honor.  Also, please forward this message
to anybody who might be able be willing to contribute.

Best regards,
Axel

Axel T. Brunger
Investigator,  Howard Hughes Medical Institute
Professor of Molecular and Cellular Physiology
Stanford University

The Genomic Ark: 10,000 vertebrate genomes

November 5th, 2009 No comments

ResearchBlogging.org

The first bioinformatics meeting I went to was in 1996 at the  Nachsholim resort,  north of Tel Aviv. I received a fellowship for the duration, and shared a room with the brilliant Golan Yona, then a grad student at the Hebrew University. I was doing biochemistry at the time and knew next to nothing about bioinformatics, except that it seemed like an interesting thing to get into if you liked biology and programming. The meeting was great: Samuel Karlin, Pavel Pevzner, Dannie Durand, Temple Smith and Eugene Myers were there. Lots of down time on the beach and in the pub by the beach.  I learned an incredible amount in four days and by the time the meeting ended, I was hooked. I wrapped up my grad school work in biochemistry as a Master’s degree, and joined Hanah Margalit’s lab for a PhD in bioinformatics.

Dan Graur gave a talk at that meeting on The One True Phylogenetic Tree of Mammals. Dan’s talks are fast and funny. His tactic of building audience interest is by making them think they are missing something great if they even dare blink when he is talking;  it works. Dan was complaining that all genomic efforts were invested in inconsequential organisms such as humans, mice and Drosophila, and no one was interested in the Aardvark or Sloth genomes. He bemoaned the situation, as he needed the Aardvark, and a few thousand other mammalian species to get the “One True Tree”. Later that day, over dinner, Pavel Pevzner suggested sequencing the X chromosome from all mammals using the then-new DNA chip technology. The X chromosome being a “microgenome”, with no transposable elements from other chromosomes, making it a perfect candidate for being a proxy for a genome.

In 1996, capillary sequencing was well established, but still quite expensive,usable only by large institutions and companies.  DNA chips, however, were thought to become the next cheap sequencing technology, and there were many expectations that they would enable mass genomics. Chips turned out to be useful in many other applications, but not in mass sequencing. We had to wait almost 10 years for pyrosequencing  and other cheap mass sequencing technologies to hit the scene.

The cost of sequencing is still dropping exponentially, so fulfilling Dan’s wishes is very much in the making now. We are getting closer to getting the genomes, not only of all mammals, but of all vertebrates. The Genome 10K initiative was officially launched in April 2009. Today, the paper describing the project has been published in the Journal of Heredity. The goal is to collect and systematically sequence 10,000 vertebrate (not just mammalian) genomes. 10,000 is a nice round number, but looking at the paper, their actual aim is 16,203. Wow! That includes some recently extinct species for which genomic material may still be obtained like the Tasmanian Wolf.

Entry of the Animals Into Noah's Ark / Jan Breughel the Elder

Entry of the Animals Into Noah's Ark / Jan Breughel the Elder

Note that they do not plan to begin sequencing immediately. The cost of sequencing is still too high, and they are still waiting for costs to decrease to $2500 per genome, which is one-hundred times cheaper than it is today. But at the rate cost is dropping, they estimate that mass sequencing can be started in a few years. In the meantime, they are soliciting samples from the community.

A lot of effort for the True Tree… but it’s not only for that. It is the next logical step to take after completing the genome of a few select organisms. The library of life. To achieve an understanding of animal evolution on a level that in 1996 we could only  joke about. More information can be found on their site. Here is the closing paragraph from the article:

As the printing of the first book by Johannes Gutenberg altered the course of human history, so did the human genome project forever change the course of the life sciences with the publication of the first full vertebrate genome sequence. When Gutenberg’s success was followed by the publication of other books, libraries naturally emerged to hold the fruits of this new technology for the benefit of all who sought to imbibe the vast knowledge made available by the new print medium. We must now follow the human genome project with a library of vertebrate genome sequences, a genomic ark for thriving and threatened species alike, and a permanent digital record of countless molecular triumphs and stumbles across some 600 million years of evolutionary episodes that forged the “endless forms most beautiful” that make up our living world.

. (2009). Genome 10K: A Proposal to Obtain Whole-Genome Sequence for 10 000 Vertebrate Species Journal of Heredity DOI: 10.1093/jhered/esp086

Check Hayden, E. (2009). 10,000 genomes to come Nature, 462 (7269), 21-21 DOI: 10.1038/462021a

Short Bioinformatics Hacks: Glimmer Splitter

October 29th, 2009 5 comments

Glimmer is a program that predicts ORFs in bacterial and archeal genomes. The input is the assembled genome FASTA file, the output are several files of the predictions in different stages. The terminal output file is the .predict file. which looks something like this:

>NODE_1_length_38001_cov_935.551880
orf00001 481      362  -2     1.45
orf00002      451      567  +1     0.59
orf00004     3691      623  -2     5.43
orf00006     4254     5228  +3     4.65
orf00007     5204     5326  +2     7.04
orf00009     5587     6921  +1     5.20
orf00011     7062     8135  +3     5.48
orf00013     8327     9238  +2     4.26
orf00014     9241    10116  +1     3.26
orf00015    10119    10280  +3     2.81
orf00016    10296    10673  +3     6.61
orf00017    11288    10683  -3     6.35
orf00018    11910    11305  -1     7.18
orf00019    12313    11894  -2     5.22

The first column is the predicted ORF ID, the second is the start position, the third is end position, the fourth is the reading frame used and the fifth is a reliability score. For full details and how to install glimmer see Glimmer’s documentation. Glimmer3 also comes packaged with Ubuntu.

Here is a short Python script whose input is the genomic FASTA file which contains a single, assembled sequence and the glimmer file. The output is a FASTA file containing a separate entry for each predicted gene. Of course, it uses Biopython.

#!/usr/bin/env python
import sys
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

glimmer_file = sys.argv[1]
fasta_file = sys.argv[2]

# Read the sequence file
seq_record = SeqIO.parse(open(fasta_file),"fasta").next()

outseqrecords = []
# Read the glimmer file, record by record
for inline in file(glimmer_file):
    if '>' in inline:
        seqname = inline.split()[0][1:]
        outfilename = "%s_g3.tfa" % (seqname)
        continue
    if "orf" not in inline:
        continue
    orfname, sbegin, send, rf, score = inline.strip().split()
    sbegin = int(sbegin)
    send = int(send)
    rf = int(rf)
    # reverse complement
    if rf < 0:
        sbegin, send = send, sbegin
    sbegin -= 1     # Python indexes start a 0
    score = float(score)
    # split the sequence record
    newseq = seq_record.seq[sbegin:send]
    if rf < 0:
        newseq = newseq.reverse_complement()
    # Add a sequence record to the output
    seqrecord_description = "begin=%d end=%d rf=%d score=%.2f" % (sbegin+1, send, rf, score)
    outseqrecords.append(SeqRecord(newseq,id=seqname+"_"+orfname, description=seqrecord_description))

SeqIO.write(outseqrecords,open(outfilename,"w"),"fasta")

To run simply type:

./split_by_g3.py predict_file_name fasta_file_name

The output will be named with a _g3.tfa suffix. You can change that in line 17 (variable “outfilename”).