Short bioinformatic hacks: reading between the genes

In celebration of the biohackathon happening now in Tokyo, I am putting up a script that is oddly missing from many bioinformatic packages: extracting intergenic regions. This one was written together with my student, Ian. As for the biohackathon itself, I’m not there, but I am following the tweets and  Brad Chapman’s excellent posts:

About intergenic regions: intergenic regions are as interesting and sometimes even more interesting than the genes themselves: when you are interested in promoters, transcription factor binding sites or almost any other transcription regulation mechanism. Here’s a simple script to find intergenic regions. It reads a genbank formatted file and uses the information there to extract the intergenic regions. The sequences are written to a FASTA file.

#!/usr/bin/env python
import sys
import Bio
from Bio import SeqIO, SeqFeature
from Bio.SeqRecord import SeqRecord
import os

# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
def get_interregions(genbank_path,intergene_length=1):
    seq_record = SeqIO.parse(open(genbank_path), "genbank").next()
    cds_list_plus = []
    cds_list_minus = []
    intergenic_records = []
    # Loop over the genome file, get the CDS features on each of the strands
    for feature in seq_record.features:
        if feature.type == 'CDS':
            mystart = feature.location._start.position
            myend = feature.location._end.position
            if feature.strand == -1:
                cds_list_minus.append((mystart,myend,-1))
            elif feature.strand == 1:
                cds_list_plus.append((mystart,myend,1))
            else:
                sys.stderr.write("No strand indicated %d-%d. Assuming +\n" %
                                  (mystart, myend))
                cds_list_plus.append((mystart,myend,1))

    for i,pospair in enumerate(cds_list_plus[1:]):
        # Compare current start position to previous end position
        last_end = cds_list_plus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "+"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    for i,pospair in enumerate(cds_list_minus[1:]):
        last_end = cds_list_minus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "-"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    outpath = os.path.splitext(os.path.basename(genbank_path))[0] + "_ign.fasta"
    SeqIO.write(intergenic_records, open(outpath,"w"), "fasta")

if __name__ == '__main__':
    if len(sys.argv) == 2:
         get_interregions(sys.argv[1])
    elif len(sys.argv) == 3:
         get_interregions(sys.argv[1],int(sys.argv[2]))
    else:
         print "Usage: get_intergenic.py gb_file [intergenic_length]"
         sys.exit(0)

What are we seeing here?

Lines 11-16 are the preamble: we read the GenBank file using Biopython’s genbank parser in line 12.  Beacuse we expect a genome file, which contains a single record, this is a one-time read. Note that this is a rate limiting step, and can take a couple of seconds. Took me ~2secs to read the full E. coli genome on my Linux box.  We prepare one list for the + strand intergenic regions (13), another one for the minus strand intergenic regions (14) and one for all the records (line 15).

The rest of the code are three loop blocks: lines 16-28 I loop over the genbank features, extracting the coordinated of the genes themselves. Line 32-41 I find the intergenic regions on the + strand. Lines 42-52 I do the same for the “-” strand.

Now for a philosophical interlude: although there is a way to read all the intergenic regions in a single pass, I subscribe to the “code simple” doctrine of research software writing. Code performance optimization is a low priority for me. I’d much rather have something that is simple to write,read and modify. I also don’t want to spend too much time coding and elegant script for elegance’s sake, especially if I may not use it too much. Historically, scientific code written for research is mostly extinct: thrown away after a short lived hypothesis was tested and ended its days. Research coding is mostly throwaway glue code. Very rarely it matures into a product. Then, and only then, can you apply all those fine software engineering you learned in college. Before that, write fast and simple.

But I digress. Line 17 loops over the features in the genome file. Line 18 we identify if it is a coding sequence (CDS). If so, we identify the start position, and position and the strand the CDS is on. The list cds_list_minus is a list of 2-tuples. Each 2-tuple is the start and end positions of a CDS on the minus strand. (If you would like to go over the genes, as opposed to coding sequences, change line 18 to:

if feature.type == 'gene':

(or better yet, pass an argument that defines it.)

cds_list_plus, is, yes, the same as cds_list_minus, only for the plus strand (line 24).

Sometime, a CDS does not contain information on which strand it is. With genome files, that is usually the case with single stranded viral genomes. Therefore, we put in the default assumption that if there is no strand indication, then the feature is is on the plus strand. We generate a warning message nevertheless (lines 25-28).

Lines 30-41 we loop over the plus  strand list, and identify the coordinates between the genes. Python’s enumerate function is very useful here. The enumerate function allows us to iterate over a list, but at the same time keep track of which index we are in when looping over the list. So in line 30, pospair receives the start and end coordinates of a CDS as a 2-tuple, while i receives the actual number if the index in the plus strand CDS list. In that way, we can look back to the previous list member, find the coordinates where that CDS ends, and where the current CDS begins. The two coordinates make up the beginning and end of the intergenic regions between those two genes on that strand. In line 35 we check if the intergenic region length is equal to or larger than a threshold: suppose we are only interested in those intergenic regions that are longer than 100 bases? (The default value is 1, see line 11.) In lines 38-39 we build a biopython sequence object that contains an informative header, and the sequence of the intergenic region. The description which goes in the sequence header contains the start and end coordinates of the intergenic region, and the locus ID of the CDS directly downstream from it. The sequence object is appended to a list, which will eventually get written (lines 40-41).

Lines 42-52 are a repeat of lines 30-41, only for the minus strand.  Lines 53 & 54: the list that contains all the intergenic region sequence objects gets written to its own fasta file.

Finally, line 56-63 are boilerplate code, that make this script runnable from the command line. Have fun looking at intergenic regions. Let me know of you find something interesting.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Leonardo da Vinci’s Resume

Marketing yourself is a process you go through many times. The job hunt comes to mind — but not only. Academia is rife with self-marketing: grant applications, promotion & tenure reports, attracting students to your courses and to your lab,  competing for conference lecture slots, giving a lecture.

But not only academia, and not only in the present day.

Leonardo di ser Piero da Vinci was, among many other things, a civil and military engineer.  Marc Cenedella has this report on Da Vinci’s resume, sent to the Duke of Milan, who apparently was in need of some military hardware.  The resume, such as it is, is truly a great piece of self-marketing. Note how LDV tailors his resume to the Duke’s needs. He does not list his artistic achievements (which were many by the time), but only those achievements and skills that fit his prospective employer’s interests. My observations are in boldface.

“Most Illustrious Lord, Having now sufficiently considered the specimens of all those who proclaim themselves skilled contrivers of instruments of war, and that the invention and operation of the said instruments are nothing different from those in common use: I shall endeavor, without prejudice to any one else, to explain myself to your Excellency, showing your Lordship my secret, and then offering them to your best pleasure and approbation to work with effect at opportune moments on all those things which, in part, shall be briefly noted below.

1. I have a sort of extremely light and strong bridges, adapted to be most easily carried, and with them you may pursue, and at any time flee from the enemy; and others, secure and indestructible by fire and battle, easy and convenient to lift and place. Also methods of burning and destroying those of the enemy. [Before the legendary British WWII Bailey Bridge]

2. I know how, when a place is besieged, to take the water out of the trenches, and make endless variety of bridges, and covered ways and ladders, and other machines pertaining to such expeditions.

3. If, by reason of the height of the banks, or the strength of the place and its position, it is impossible, when besieging a place, to avail oneself of the plan of bombardment, I have methods for destroying every rock or other fortress, even if it were founded on a rock, etc.

4. Again, I have kinds of mortars; most convenient and easy to carry; and with these I can fling small stones almost resembling a storm; and with the smoke of these cause great terror to the enemy, to his great detriment and confusion. [Battlefield smoke].

5. And if the fight should be at sea I have kinds of many machines most efficient for offense and defense; and vessels which will resist the attack of the largest guns and powder and fumes.

6. I have means by secret and tortuous mines and ways, made without noise, to reach a designated spot, even if it were needed to pass under a trench or a river.

7. I will make covered chariots, safe and unattackable, which, entering among the enemy with their artillery, there is no body of men so great but they would break them. And behind these, infantry could follow quite unhurt and without any hindrance. [Yes, LDV invented the tank! Note also the combined infantry / armor tactics].

8. In case of need I will make big guns, mortars, and light ordnance of fine and useful forms, out of the common type. [Not only can I do it, I can do it cheaply, by re-purposing your existing ordinance.]

9. Where the operation of bombardment might fail, I would contrive catapults, mangonels, trabocchi, and other machines of marvelous efficacy and not in common use. And in short, according to the variety of cases, I can contrive various and endless means of offense and defense.

10. In times of peace I believe I can give perfect satisfaction and to the equal of any other in architecture and the composition of buildings public and private; and in guiding water from one place to another. [Dear Duke: you want to hire me even if you are not fighting, or you will want to keep me after your wars are over.]

11. I can carry out sculpture in marble, bronze, or clay, and also I can do in painting whatever may be done, as well as any other, be he who he may. [Another peacetime skill.]

Again, the bronze horse may be taken in hand, which is to be to the immortal glory and eternal honor of the prince your father of happy memory, and of the illustrious house of Sforza. [Catering to the employer’s vanity. Gotta know how to do it right…]

And if any of the above-named things seem to anyone to be impossible or not feasible, I am most ready to make the experiment in your park, or in whatever place may please your Excellency – to whom I comment myself with the utmost humility, etc.” [I realize make big claims, but I can back them up. Just get me an interview.]

edw513 at Ycombinator.com has adapted this resume to fit current market needs:

If it worked for Leonardo da Vinci, maybe it could work for me. The next time I’m looking for a job, I’ll try this:“Most Illustrious Proprietor, Having now sufficiently considered the specimens of all those who proclaim themselves skilled developers of applications of business, and that the invention and operation of the said programs are nothing different from those in common use: I shall endeavor, without prejudice to any one else, to explain myself to your Company, showing your Management my secret, and then offering them to your best pleasure and approbation to work with effect at opportune moments on all those things which, in part, shall be briefly noted below.

1. I have a sort of extremely light and strong functions and modules, adapted to be most easily ftp’d, and with them you may pursue, and at any time combine them with others, secure and indestructible by standard mean time to failure of hardware and denial of service, easy and convenient to compile and catalog. Also methods of unzipping and storing the data of the customers.

2. I know how, when a website is besieged, to shard data onto the cloud, and make endless variety of mirrors, and fault tolerant disks and RAIDs, and other machines pertaining to such concerns.

3. If, by reason of the volume of the data, or the structure of the btrees and its indexes, it is impossible, when conducting a search, to avail oneself of sub-second response time, I have methods for benchmarking every process or other function, even if it were interpreted, etc.

4. Again, I have kinds of functions; most convenient and easy to ftp; and with these I can spawn lots of data almost resembling a torrent; and with the download of these cause great terror to the competitor, to his great detriment and confusion.

5. And if the processing should be on the desktop I have apps of many machines most efficient for data entry and reporting; and utilities which will satisfy the needs of the most demanding customers and users and consumers.

6. I have means by secret and tortuous scripts and modules, made without leaving tracks, to generate source code, even if it were needed to run on a client or a server.

7. I will make secure firewalls, safe and unattackable, which, entering among the hackers with their utilities, there is no body of crackers so great but they would break them. And behind these, software could run quite unhurt and without any hindrance.

8. In case of need I will make big properties, methods, and collections and useful forms, out of the common type.

9. Where the operation of compiling might fail, I would contrive scripts, functions, routines, and other parameter driven processes of marvellous efficacy and not in common use. And in short, according to the variety of cases, I can contrive various and endless means of data entry, reporting, and storage.

10. In times of low revenue I believe I can give perfect satisfaction and to the equal of any other in maintenance and the refactoring of code public and private; and in guiding data from one warehouse to another.

11. I can carry out code in Javascript, PHP, or C, and also I can do in network administration whatever may be done, as well as any other, be he who he may.

Again, the intranet app may be taken in hand, which is to be to the immortal glory and eternal honor of all your customers of happy memory, and of the illustrious house of Google.

And if any of the above-named things seem to anyone to be impossible or not feasible, I am most ready to make the experiment in your data center, or in whatever place may please your Businessperson – to whom I comment myself with the utmost humility, etc.”

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

All theories proven with one graph

Boys and girls, it can be done. Published in the Journal of Irreproducible Results (Where else?)

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

JSUR? Yes, sir. (Updated 2-FEB-2010)

The most exciting phrase to hear in science, the one that heralds new
discoveries, is not ‘Eureka!’, but ‘That’s funny…’ -Isaac Asimov

Thanks to Ruchira Datta for pointing out this one.

Science is many things to many people, but any lab-rat will tell you that research is mainly long stretches of frustration, interspersed with flashes of satisfying success. The best laid schemes of mice and men gang aft agley. A scientist’s path contains leads to blind alleys more than anything else, and meticulous experimental preparation only serves to somehow mitigate the problem, if you’re lucky. This doesn’t work, that doesn’t work either and this technique worked perfectly in Dr. X’s lab, why can’t I get this to work for me?  My experiment was invalidated by my controls; my controls didn’t work the way the controls were supposed to work in the first place. I keep getting weird results from this assay. I can’t explain my latest results in any coherent way… these statements are typical of daily life in the lab.

This stumped and stymied day-to-day life is not the impression of science we get from reading a research paper, when listening to a lecture, or when watching a science documentary show. When science is actually presented, it seems that the path to discovery was carefully laid out, planned and  flawlessly executed, a far cry from the frustrating, bumbling mess that really led to the discovery. There are three chief reasons for the disparity between how research is presented, as opposed to what really goes on. First, no one wants to look like an idiot, least of all scientists whose part of their professional trappings is strutting their smarts. Second, there are only so many pages to write a paper, one hour to present a seminar or one hour for a documentary: there is no time to present all the stuff that did not work. Third, who cares about what didnt work? Science is linked to progress, not to regress. OK, you had a hard time finding this out, we sympathize and thank you for blazing the trail for the rest of us. Make a note for yourself not to go into those blind alleys that held you back for years and move on. We’re not interested in your tales of woe.

Only maybe these tales of woe should be interesting to other people. If you make your negative results public, that could help others avoid the same pitfalls you had. If you share the limits of a technique, a protocol or software then someone can avoid using it in a way that does not work. A lab’s publications are actually the tip of the sum total of its accumulated knowledge.Every lab has its own oral tradition of accumulated do’s and dont’s. Not oral in the literal sense: they may even be written down for internal use, but never published. UPDATE (2-FEB-2010): most peer-reviewed journals don’t like stuff that does not work. Thanks to Mickey Kosloff for pointing out the Journal of Negative Results in Biomedicine and The Journal of Negative Results – Ecology and Evolutionary Biology.

Until now.

The Journal of Serendipitous and Unexpected Results aims to help us examine the sunken eight-ninths of the scientific knowledge iceberg, in life science and in computer science. (So an additional field over JNRB and JNREEB). From JSUR’s homepage:

Help disseminate untapped knowledge in the Computational or Life Sciences

Can you demonstrate that:

* Technique X fails on problem Y.
* Hypothesis X can’t be proven using method Y.
* Protocol X performs poorly for task Y.
* Method X has unexpected fundamental limitations.
* While investigating X, you discovered Y.
* Model X can’t capture the behavior of phenomenon Y.
* Failure X is explained by Y.
* Assumption X doesn’t hold in domain Y.
* Event X shouldn’t happen, but it does.

The problem with the JSUR model, and the nature of discovery

I expect JSUR will be a great way to comment on  methods and techniques. Indeed it will codify a trend that has been going on for some time: public protocol knowledge sharing. Many sites like openwetwareseqanswers or the UC Davis bioinformatics wiki have been doing this for a while. Not to mention a plethora of blogs. Scientists are willing to share their experience with working protocols and procedures, and if this sharing of knowledge can be now monetized to that all-important coin of academia, the  peer-reviewed publication, all the better.

So where is the problem? The problem lies with discovery, and credit given towards it. It would be very hard to get anyone to share awkward, unexpected or yet-uninterpreted results. First, as I said, no one wants to look like an idiot. Second, unexpected or yet uninterpreted results are often viewed as a precursor to yet another avenue of exploration. A scientist would rather pursue that avenue, with the hope of  the actual meaningful discovery occurring in the lab. At most, there will be a consultation with a handful of trusted colleagues in a closed forum. If the results are made public, someone else might take the published unexpected and uninterpreted results, interpret them using complementary knowledge gained in their lab, and publish them as a bona-fide research paper. The scientist who catalyzed the research paper with his JSUR publication receives, at best, secondary credit. The story of Rosalind Franklin’s under-appreciated contribution to the discovery of the structure of DNA comes to mind. Watson and Crick used the X-ray diffraction patterns generated by Franklin to solve the three dimensional structure of the DNA molecule. Yet she was not given a co-authorship on the paper. (And she did not even make the results public, they were shared without her knowledge.) Unexpected results are viewed either as an opportunity or an embarrassment, and given the competitive nature of science, no on wants to advertise either: the first due to the fear of getting scooped, the second for fear of soiling a reputation. I expect JSUR would have a harder time filling in the odd-results niche, but I hope I am wrong.

But if you have protocols you are willing to share…what are you waiting for? Get those old lab notebooks, 00README files, forum posts  and start editing them to a paper. You are sitting on a goldmine of publishable data and you did not even realize it.

Finally, here are two scientists who never declined sharing their unexpected results.

This post has been slashdotted. Exercise extreme caution.


Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

BsB in high school science… nice

A  small spike on my blog traffic yesterday led me to look for the source via Google Analytics. (If you are a blogger, you should really use this tool, lots of useful traffic information.) Seems like most of the traffic came from the page of a high school science teacher at Badin High School in Hamilton, OH. Apparently the students were to be quizzed today on two of my posts about endosymbiosis (and one from 80Beats; I’m in good company.) So they were very busy Sunday. It’s encouraging to know that some of my posts are accessible enough for high school science. Finally, quite a few Miami students come from Hamilton (we’re close). So I might see some of them next year.

Muahahaha!

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

The polypharmacome

ResearchBlogging.org
This post was chosen as an Editor's Selection for ResearchBlogging.org
Pharmaceutical companies are always on the lookout for secondary drug targets. After all, if you invest billions developing a single drug, you would be more than happy to sell it as a treatment for two, three, or more different ailments. Sildenafil citrate was developed to treat angina and hypertension, but during phase I clinical trials, it was found that Sildenafil induces penile erections. The drug was branded Viagra, and the rest is history. Eflornithine, an anti-cancer drug, is also effective against the agent of African sleeping sickness, Trypanosoma brucei. African Sleeping Sickness is known as a “neglected disease”, for which drug development is not profitable and therefore not a priority. However, having a drug already on hand makes it easier to distribute in affected areas, since the R&D costs are recovered elsewhere.

Another example of polypharmacology is a drug that binds to multiple targets in the human body. This could be used for overcoming drug resistance, a known problem with cancer. Cancer tumors often develop a resistance to anti-cancer drugs by simple natural selection: the protein that the drug binds to mutates, and no longer binds the drug. However, if the drug acts by binding redundantly to several proteins, it would be more effective, since several mutations would be required to effect drug resistance.

Another important polypharmacological consideration  is toxicity. If a drug binds to one protein, its drug target, it may also bind to another one which it should not bind to as it disrupts the normal functions and the health of the patient. If the side effects outweigh the cure, the drug is no good.

How one drug (cyan) can bind to two different proteins with different overall shapes (pink and green), but with similar binding sites

Because it can either increase  profits, or conversely derail a whole process of drug development, predicting polypharmacophoric effects is very much something drug developers want. A study published yesterday in PLoS Computational Biology by Jacob Durrant and his colleagues suggest a way  bioinformatics and theoretical biophysics can help in identifying multiple drug targets. Durrant’s goal was simple: given the molecular structure of a candidate drug, which proteins are expected to bind it? The strategy this group took is a combination of bioinformatic and experimental screening.

Finding additional drug targets in four  steps

Reproduced under CC license from doi:10.1371/journal.pcbi.1000648.g001. Click for original image.

Step 1 (A-C in the figure above): identify the known target protein. Now pick all the protein structures that are not similar to it. Why those that are not similar? Similar proteins could be potential drug targets, since they have a similar shape to the known target protein. But here they are interested in finding targets from proteins that are of a different shape, and have no homology to the known target protein: secondary targets.

Step 2 (D): take this set of dissimilar proteins, and look for binding site similarities. Binding sites are clefts in the protein that may bind drugs. If those clefts are similar in shape to the cleft in the known target proteins, they may bind the drug. Leave only those non-homologous proteins that have similar binding sites (D in the figure)

Step 3 (E): Now add all the proteins that are homologous to the set generated in step 2. This increases the number of possible targets to homologs of the proteins that were initially selected only for binding site similarity.

Step 4 (F): take the drug molecule, and try to dock it to the various protein structures. Rank the druggability of each protein according to the score provided by the drug docking software (Autodock).

Experimental verification

Now for the cool part. Durrant and colleagues tested  the computational prediction in the lab, using the compound NSC-45208. NSC-45208 inhibits a protein responsible for RNA processing in Trypanosoma brucei. So we know it is a potential drug against African sleeping sickness. What else is it good for?

(NSC-45208), 4,5-dihydroxy-3-(1-naphthyldiazenyl)-2,7 -naphthalenedisulfonic acid, a recently discovered inhibitor of T. brucei RNA editing ligase 1 (TbREL1)

“The predicted secondary targets that gave the best docking scores, H. sapiens mitochondrial 2-enoyl thioester reductase (HsETR1), T. brucei UDP-galactose 4′ epimerase (TbGalE), H. sapiens phosphodiesterase 9A (HsPDE9A2), and Streptococcus pneumoniae teichoic acid phosphorylcholine esterase (SpPce), were subsequently tested experimentally.”

Durrant and his colleagues tested their predictions that NSC-45208 also binds to two human proteins (HsETR1 and HsPDE9A2), one additional Trypansome protein (TbGalE), and a bacterial (Streptococcus pneumoniae) protein (SpPce). Not only binds, but also inhibits their enzymatic activity. Their predictions worked well on the top two predicted targets: NSC-45308 inhibited the enzymatic activity of HsETR1and TbGalE, but HsPDE9A2 and SpPce were not affected by the drug.

At first blush, this does not seem to be much of a batting average: two positives and two false positives. But we have to remember that the experimental verification of these predictions — even if you have a good enzymatic assay to check predictions– can be very time- and resource consuming. An exhaustive test of the predictions for several compounds and many target enzymes is still not possible. As initial proof-of-principle, this work goes much farther than most other joint experimental / computational works I have read. The authors also go into lengthy and detailed discussions on limitations and improvements, as well as on another form of non-specific inhibition of the secondary targets, makes for a really interesting read on polypharmacological considerations in drug screening.


Durrant, J., Amaro, R., Xie, L., Urbaniak, M., Ferguson, M., Haapalainen, A., Chen, Z., Di Guilmi, A., Wunder, F., Bourne, P., & McCammon, J. (2010). A Multidimensional Strategy to Detect Polypharmacological Targets in the Absence of Structural and Sequence Homology PLoS Computational Biology, 6 (1) DOI: 10.1371/journal.pcbi.1000648

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Poll: favorite kingdom?

Seems like I had a couple of taxonomic posts recently.  So BsB would like to know what is your favorite Kingdom of life. I used the 6 kingdom system just to make things slightly more interesting. Vote in the pollbox on the righthand sidebar. You only have yourself to blame if your favorite kingdom is not represented adequately.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Social networking for taxonomists

ResearchBlogging.org

Despite our best attempts to remove species from the face of the Earth, there is still quite a bit of life out there and it is still quite diverse. Also, there are still quite a few people who want to document, describe and make the rest of us aware of the magnitude and diversity of life. In the time-honored tradition of natural history they take the tools of their trade to the field to sample, collect, identify, tag, photograph, film, mount and otherwise understand and advertise Nature’s progeny.  There are probably thousands of web sites dedicated to taxonomy of families, genuses genera and species. Trouble is, they are all “out there”, and require quite a bit of searching to find those that are of interest to you. And what if you just want to browse, the Internet equivalent of going through a natural history museum?

Museums were historically the place where natural historians brought their findings from far away places to be studied and displayed (if lucky) or simply archived. If the finding was still alive, it would go to the zoo or the botanical garden.  And people would go to the museum, zoo or garden to see the  latest wonders of life.

(From Life Through a Lens: the Natural History museum (of London) 1880-1950)

Virtually every biology lab with some taxonomic interest in  a given species, genus or family would have such a site. That’s a lot of taxonomic information , but is not very accessible or organized! What if all these sites could be rounded up together, connected in some contextual fashion, some semblance of standardization? Then we could have an online Natural History museum, where the displays accumulate, are constantly updated, and we can walk through virtual halls looking for the plant or animal of our fancy.

Scratchpads is a cool way of getting to that online Natural History museum. Actually, it is hosted in the great-grandaddy of brick-and-mortar natural history museums, the one in London.

Scratchpads are an easy to use, social networking application that enable communities of researchers to manage, share and publish taxonomic data online. Sites are hosted at the Natural History Museum London, and offered free to any scientist that completes an online registration form. Key features of the Scratchpads include tools to manage:

Taxonomy Classifications Phylogeny Phylogenies
Literature Bibliographies Documents Documents
Images Image galleries Custom Data Custom data
Specimens Specimen records Simple Maps Maps

Users control who has access to content, which is published on the site under Creative Commons (by-nc-sa) license.

Data added to a Scratchpad are automatically classified and grouped around a taxonomy that is supplied by the users or imported from EOL. This is optionally supplemented with information from high quality web accessible databases, to automatic construct content rich web pages about any documented taxon. Currently these sources include Genbank, GBIF, Biodiversity Heritage Library, Yahoo! Images, flickr, Google Scholar and Wikipedia.

When you start a Scratchpad site, you are given a template on which you can build your taxonomy site. the template includes a tree-building widget, space for image galleries, a blog area, a bibliography area etc. You insert the content, and your site links back to the scratchpad mother site.  The Scratchpads linked from the mother site are in various stages of construction, which is actually pretty cool as you can see a village of taxonomy sites being constructed. Some are fairly complete and interesting to browse through, like the Scratchpad about freeloader flies or the one about nanofossils.

Time will tell if Scratchpads will catch on with taxonomists. I certainly hope it does. It seems  quite a bit of thought has gone into balancing standardization with creativity, and making scratchpading easy and manageabe. The authors have chosen templates based on the Drupal content management system, which has a very gentle learning curve. I am waiting on my own scratchpad account, where I intend to… well, you’ll see it on this blog eventually, I hope.

So if you are involved in taxonomy in some way, consider getting your own Scratchpad. This could make for a great student project, by the way.  If you know someone who is a taxonomist, pass this information on to them.


Smith VS, Rycroft SD, Harman KT, Scott B, & Roberts D (2009). Scratchpads: a data-publishing framework to build, share and manage information on the diversity of life. BMC bioinformatics, 10 Suppl 14 PMID: 19900302

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Fold.it: wasting time in a good cause

I just spent the better part of a Saturday playing with Foldit. Foldit is an ongoing experiment in finding protein structures by harnessing the power of the mob – or gamers, as is the case here. The player is presented with a backbone & sidechain configuration, with the secondary structures mostly pre-determined. The problem is to get the protein to fold into the correct conformation. You can tweak the secondary structures, rubberband the beta strands together into sheets and rotate the sidechains. The residues are colored by hydrophobicity, so you know who should be facing where. The 23 tutorial cases walk you through the  simple yet powerful interface to folding the structures. You can rotate helices, rubberband strands, flip sheets, etc. The interface gives you feedback on sidechain clashes and voids in the structure among other things.  The examples also teach you the basic of protein structural considerations: maximize hydrogen bonds, hydrophobic side chains should be buried in the structure, strands should combine to sheets, and so on. When you feel you are ready, you can start solving the various folding puzzles presented online. You can work solo, or as part of a team. The “correct solution” is, of course, unknown. The best you can do is accumulate as many points as you can, which represent how stable is the conformation you are building. Foldit, like many other cool things in structural biology, is the product of David Baker’s lab at the University of Washington. Here is a video from the YouTube UWFoldit channel showing the coolness of it all. If you decide to get your fold on, make sure you can make the time. It’s flippin’ addictive.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

The Science(?) in Avatar

Saw Avatar with OhOne last weekend. Awesome cinematography, cool effects, great work. A few random observations, with no particular theme in mind. Note that James Cameron says that “Avatar [is not] science fiction, it is an action/adventure/science fantasy film”. So, I might just being too sciency here.

1. All of Pandora’s animals are hexapods, except for the Na’vi which are quadrupod. The Na’vi don’t even seem to have vestigial limbs. For a species that is supposed to be intimately connected with all life on the planet, they are seriously missing a Hox gene that everybody else seems to have.

2. What would be the fitness increase in having four eyes, paired in twos? Increased ability to gauge distances? Actually, that is kinda cool. Again, only the Na’vi don’t have them. Did they also lose a copy of eyeless?

3. Hometree’s central structure is DNA shaped, a sort of in-your-face symbolism. It’s spiraling the wrong way though. Just saying.

4. What would cause the Hammerheads to invest so much in a bullet-proof armor? Also, with the high density the armor must have, and their body size that means a lot of calorie consumption just to be able just to carry your own weight.

5. Somehow, Ursula Le Guin’s The Word for World is Forest is not mentioned as an influence to Avatar, but it certainly seems to be. Both have tropical rain forest worlds with sentient humanoid natives that are exploited by invading Terrans. The native humanoids have an intimate connection to the planet the Terran invaders do not appreciate and which ultimately cause their defeat. And yes, both stories are not-too-subtle contemporary political allegories.

6. I doubt a moon orbiting a gas giant would be able to sustain life, the orbit would be too erratic, tidal pulls would be too varied and strong, not to mention the climatic effects of prolonged planetary eclipses.

7. I want one of those AMP suits!

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

w00t! Post selected for Open Laboratory 2009

My post The Incredible Shrinking Genome was selected for publication in Open Laboratory 2009. The Open Laboratory books are anthologies of 52 posts from various science blogs selected annually by a panel of judges . This year the judges waded through 470 740 nominations (thanks for catching this Bora), so it is great to be selected! How on Earth did they go through 740 posts?

You must buy the book once it’s out, office and home copies. Don’t forget your spouse, lover, friends, colleagues, children and pet hamster, for they will all be deeply offended if you do not get them a copy of Open Laboratory 2009, which may have serious ramifications.  I will link to the book once it is out, in a month or so. In the meantime, you should seriously  consider purchasing the 2008, 2007 and 2006 editions.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Happy Birthday Elvis

Celebrate with a peanut butter and banana sandwich, while listening to The King.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Newton’s birthday and crop diversity

Today is the 366th birthday of Sir Isaac Newton. Formulator of the three laws of motion, the theory of gravity, inventor of the first reflecting telescope, theory of color, calculus (with due credit to Gottfried Leibniz), the generalized binomial theorem, and president of the Royal Society.

Newton in a 1702 portrait by Godfrey Kneller

All which ties in directly to retail, and biodiversity. Huh?

Co-operative Farms (UK) recently bought 1,000  rare and endangered apple varieties, with colorful names like Great Expectations, Fairie Queen, Northern Spy, Forty Shilling, Duck’s Bill and Bloody Ploughman. (I wonder how the last name came to be; actually, maybe I shouldn’t.) This also includes Isaac’s Newton’s Tree: the apple variety cultivated from the descendants of the tree which inspired Newton to formulate the Theory of Gravity. Many of those apples were dessert apples, but some fell out of favor the strains were no longer grown, threatening to disappear.

Co-op Farms are bottling them up as the “Truly Irresistible Tillington 1,000” pressed apple juice. I think it is great that a retail chain is funding crop diversity and finding a way to make some money in the process. Although with 7,500 cultivars worldwide, apples as are not exactly under an extinction threat. But there is also the matter of food variety, cultural heritage and, of course, preserving the history of physics. Or bottling it up, whichever the case may be.

Also, fruit, including apples, are important in the small-arms industry:

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Real programmers use…

A nice take on the vi / emacs wars

vim rules, emacs drools

Also, real programmers browse the web using the vimperator.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Going to GOA, pt. 2: children of a lesser GO

The source file associated with this post can be downloaded here.

The last time I talked about how to read a GOA gene_associations file into a Python dictionary data structure.  Our goal was to find all genes that are annotated as hydrolases in the GOA gene_associations file. The tricky part is, most enzymes are not annotated with a general term such as “hydrolase_activity”, but with a more specific term, such “phosphatase_activity” or “tyrosine_phosphatase_activity”. “phosphatase_activity” can be viewed as a “private case” or in ontological terms a descendant of  hydrolase_activity. “tyrosine_phosphatase_activity” is even more specific, being a descendant of “phosphatase_activity” and, by transitivity of “hydrolase_activity”. The bottom line is,  we need to know all the descendants of  our query term to get the properly annotated genes form the GOA gene_associations file.

To do that, we would need to access the GO database. You can import it from the Gene Ontology FTP site as a MySQL dump, and install it locally. This would probably make your query faster. However, you would need to worry about maintaining and regularly updating your database from GO. Instead, it might make sense to connect to one of the mirrors of the GO MySQL database, and query it remotely. This is what I will do here.

To do that, I will create a class that does two things: (1) opens (and closes) a MySQL connection to a GO mirror site and (2) reads the GOA gene_associations file. We saw how to read the file in the previous part. What I am doing here is simply rolling that code into a method (a function called by a class).

import UserDict
import MySQLdb

class GOA_ga(UserDict.UserDict):
    def __init__(self):
        self.data = {}

    def godb_open(self):
        self.dbcon = MySQLdb.connect(host="mysql.ebi.ac.uk", user="go_select",
                   passwd="amigo", db="go_latest", port=4085)
        self.dbcursor = self.dbcon.cursor()

    def godb_close(self):
        self.dbcursor.close()

    def read_gene_assoc(self,inpath):
        for inline in file(inpath):
            db, db_object_id, db_object_symbol, qualifier, go_id, \
            db_reference, evidence, withit, aspect, \
            db_object_name, synonym, db_object_type, \
            taxon_id, date, assigned_by = inline.strip().split('\t')
            key = db+":"+db_object_id
            self.data.setdefault(key,[]).append({'db':db,'db_object_id':db_object_id,
                                            'db_object_symbol': db_object_symbol,
                                            'qualifier': qualifier,
                                            'go_id': go_id,
                                            'db_reference': db_reference,
                                            'evidence': evidence,
                                            'with': withit,
                                            'aspect': aspect,
                                            'db_object_name': db_object_name,
                                            'synonym': synonym,
                                            'db_object_type': db_object_type,
                                            'taxon_id': taxon_id,
                                            'date': date,
                                            'assigned_by': assigned_by})

What we have here is a class with four methods. The class itself is an instantiation of the Python UserDict class. This means it is primarily a dictionary, and can be used as such. But there are other methods too.

Lines 5-6: The __init__ method describes what happens when the class is instantiated, and is actually superfluous here. I just like to put it as good programming practice. Instantiating GOA_ga created an empty dictionary. That’s it.

Lines 8-11: The second method will open a MySQL connection to a remote server. dbcon and dbcursor will be used to access the database from Python, as we shall see soon. I import the Python MySQLdb module which contains functions to access MySQL through Python. You can get it from SourceForge, and it is also in the repositories of most Linux distributions. To install in Ubuntu Linux simply type at the command-line:

sudo aptitude install python-mysqldb

Lines 13-14: The go_dbclose method simply closes the database connection. It is good practice to do so after all queries are done.

Lines 16-36: this is basically the code to read the GOA gene_associations file that we saw in part 1.
The file go_associations.py contains this code (and more below). You can download it, import it, and run the following on your GOA gene association file of choice, using the Python shell. I chose arabidopsis in this example.

>>> import go_associations as ga
>>> my_goa = ga.GOA_ga() # create a class instantiation
>>> my_goa.read_gene_assoc("gene_association.goa_arabidopsis")  # read all GOA

So what we have is a class that lets us read GOA gene_associations file, and also query the GOA MySQL database.

Remember the original goal: get all the records associated with a certain function, or its children. The MySQL code for that is:

SELECT
  rchild.acc
FROM
  term AS rchild, term AS ancestor, graph_path
WHERE
  graph_path.term2_id = rchild.id and
  graph_path.term1_id = ancestor.id and
  ancestor.acc = 'GO:0016787'

This mysql query will return a list of all the GO accession codes that are the descendants of the go accession code “GO:0016187”. How do you know which GO accession code belongs to which term? You can use the keyword search feature in AmiGO. By the way, I pulled this query from the GO Example Queries page on the GO site. There are many useful queries there!

So how do we insert this into Python code? Here it is:

def find_go_descendants(goa_ga, parent_go_id):
   find_descendants_go = """
   SELECT
     rchild.acc
   FROM
     term AS rchild, term AS ancestor, graph_path
   WHERE
     graph_path.term2_id = rchild.id and
     graph_path.term1_id = ancestor.id and
     ancestor.acc = '%s'
   """
   go_children = {}
   found_genes = {}
   # query GO database for all the child terms of the parent_go_id
   goa_ga.dbcursor.execute(find_descendants_go % parent_go_id)
   rpl = goa_ga.dbcursor.fetchall()
   # Flatten the list. Put it in dictionary keys for faster searching.
   for i in rpl:
      go_children[i[0]] = None
   for gene_id in goa_ga.keys():
      for goa_rec in goa_ga[gene_id]:
         if goa_rec['go_id'] in go_children.keys():
            found_genes.setdefault(gene_id,[]).append(goa_rec)
   return found_genes

Lines 2-11 are the templated MySQL query to get all the descendants of the provided GO access number.
Line 12: go_children is a dictionary whose keys are all the GO accession numbers that are the descendants of our query. The values are null. It is faster to search a dictionary keys than a list.
Line 13: found_genes is a GOA_ga dictionary which will contain the found go_associations entries. The keys are the UniProt IDs, and the values are GOA_ga dictionary entries
Lines 15-16: Now we start working. First, we query the GO MySQL database for all the descendant GO terms of go_id. The returned value is a tuple of tuples which looks something like this:

(('GO:0012345',), ('GO:0034567',),('GO:003617',),....)

That is simply how MySQLdb’s fetchall method works: each MySQL record fetched is a tuple. Since we are fetching only one field with our query, each tuple element is  We now need to flatten this tuple — remove the superfluous parentheses. Lines 18-19 do that. They also put the descendant GO terms a dictionary keys structure, rather than a list structure. A dictionary keys structure is faster to search as we do repeatedly in line 23.

Line 20 is the outer loop of the search. We loop over all UniProt gene IDs in the goa_associations file we read into our goa_ga data structure.
Line 21 is the inner loop. Each Uniprot ID may appear in more than one record, as the protein it represents may be associated with more than one GO term.

Lines 22-23: if the associated GO term is in our list of descendant, add that GOA record to our found_genes dictionary, and key it with the proper UniProt ID.

Line 24: return the dictionary of all GOA records that are annotated as the descendants of the query term, or as the query term itself.

Putting it all together:

>>> import go_associations as ga
>>> my_goa = ga.GOA_ga() # create a class instantiation
>>> my_goa.read_gene_assoc("gene_association.goa_arabidopsis")  # read all GOA
>>> my_goa.godb_open()     # open a connection to the GO database
>>> all_hydrolases  = ga.find_go_descendants(my_goa,"GO:0016787") # find the descendants of "hydrolase_activity"

all_hydrolases now contains the GOA records of all the arabidopsis proteins annotated as hydrolases. How many are there?

>>> len(all_hydrolases)
3617
>>>

(This may vary when the latest version of this file changes. GOA is regularly updated). Some genes may be annotated as hydrolases more than once. This might either be an annotation error, as the is annotated twice with the same function, or the protein might be promiscuous with more than one function, or it may have two domains, each with a specific function. To find out all the hydrolase annotations:

>>> n=0
>>> for i in all_hydrolases:
...      n += len(all_hydrolases[i])
>>> print n
5753
>>>

How many of those annotations have a good evidence code?

>>> n=0
>>> for i in all_hydrolases:
...     for j in all_hydrolases[i]:
...             if  j['evidence'] in ('EXP','IDA','IPI','IMP','IGI','IEP'): n += 1
...
>>> print n
416
>>>

416 out of 5763: just over 7%. Keep that in mind next time you delve into an annotation database. (Exercise: find out how many enzymes have been annotated with good evidence in arabidopsis, or in another GOA-annotated organism).

Have fun GOing, and getting there!

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks