Archive

Archive for the ‘Science’ Category

A sh*tload of data

March 4th, 2010 1 comment

ResearchBlogging.org

There are more microbial cells in our body than our own. Those microbes are not just passive hitchhikers or conversely, malicious agents of disease. They affect our well-being and health in a much broader spectrum than simply “bad” or “passive”. Among other things our gut microbes play an important role in digestion, have been linked to obesity, conditions as severe as certain inflammatory bowel syndromes or as relatively mild as traveler’s diarrhea. The microbes living on our skin also affect many things: from body odor and dandruff and acne to dermatitis and psoriasis. Also, being the most exposed microbial population means that they are themselves affected by our constant exposure to various agents,and their population varies by our own behavior –  such as how often, and with what, do we wash our hands: antibacterial soap has been named as a major culprit in the development of resistant bacterial strains. In all organs the native flora, relatively benign, protects us against colonization by more virulent bacteria.

Indeed, our gut, skin, mouth and genital microbiomes can be viewed as  additional organs in the way that they affect us. If you take a long course of powerful systemic antibiotics, the ensuing diarrhea and sometimes mouth thrush are the result of these “organs” — in your mouth and gut — being removed temporarily from your body.

Credit: David Gregory&Debbie Marshall. Wellcome Images images@wellcome.ac.uk

Today, another big step has been taken towards understanding the role of our microbiomes play. Just how big, in what direction, or what will be the consequences of this step is unclear, and will remain unclear for quite some time. A group of Chinese and European researchers have published the largest sequencing effort yet of gut bacteria. Their current yield, 576 Gigabases of DNA from the feces of 124 European people is considerably larger than the previous large effort in the US: 3GB from sequenced from 33 US and Japanese adults. Also, Qin and colleagues looked at obese, lean, healthy and sick (inflamatory bowel syndrome) individuals. They identified 1,000 to 1,100 different bacterial species, with about 160 different species per individual. Healthy individual’s bacterial population was markedly different that those with inflammatory bowel syndromes, and the populations of those with IBS differed depending on disease type.

What does this all mean? Well, like in the human genome project, it will take a while before  we understand not only what these data contain, but what are the limits of our ability to interpret them, and how best they can serve us. What we have right now is the equivalent of the outline of a newly discovered continent. It is up to many individual explorers to discover and chart the myriad living things in terra excreta. Which genes are most associated with obesity or with ulcerative colitis? How prevalent is gene transfer between gut bacteria, and how much of a role does this play in antibiotic resistance? Are there microbial species more prone to changes in their genomes than others? Are there metabolic pathways that are shared between different microbial species? Are some genes faster evolving than others, what would be the ecological role different species play in the gut? And how do different microbial populations ultimately affect their human hosts? Do different bacterial species have a preference with whom they share our guts? These broad questions can be broken down into individual questions relating to a lab’s pet genes, metabolic pathways, microbial species and metabolic conditions. There is a lot to sift through here and these data will keep us busy for years to come.

Pathogenic E. coli on the intestinal lining. Credit: S. Schuller.

In other blogs Carl Zimmer has a great post on how he, for one welcomes our bacterial overlords, while Ed Yong talks about the science of things to come.

Oh, and for those of you who, like me, can’t wait to plug the data into your favorite analysis pipeline, you can get the gene and assembled and annotated sequence data data from Peer Bork’s lab at the European Molecular Biology Laboratory, or the Beijing Genome Institute. The raw sequence data are available from the European Bioinformatics Institute under the accession ERA000116. (I can’t find the latter myself right now, hopefully they will show up in a couple of days). Update March 7, 2010: the sequences are now deposited at EBI. Also, on BioTorrents (hat tip to Morgan Langille).


Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, H., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Doré, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., Antolin, M., Artiguenave, F., Blottiere, H., Borruel, N., Bruls, T., Casellas, F., Chervaux, C., Cultrone, A., Delorme, C., Denariaz, G., Dervyn, R., Forte, M., Friss, C., van de Guchte, M., Guedon, E., Haimet, F., Jamet, A., Juste, C., Kaci, G., Kleerebezem, M., Knol, J., Kristensen, M., Layec, S., Le Roux, K., Leclerc, M., Maguin, E., Melo Minardi, R., Oozeer, R., Rescigno, M., Sanchez, N., Tims, S., Torrejon, T., Varela, E., de Vos, W., Winogradsky, Y., Zoetendal, E., Bork, P., Ehrlich, S., & Wang, J. (2010). A human gut microbial gene catalogue established by metagenomic sequencing Nature, 464 (7285), 59-65 DOI: 10.1038/nature08821

Blogosphere catches: Marco Island, finding Ada and blog carnivals

March 2nd, 2010 No comments

Some interesting events cropped up recently. The Marco Island Advances in Genome Biology and Technology meeting was heavily tweeted and blogged about.  Pacific Biosciences unveiled their third generation sequencer. Ostensibly, it can sequence reads of 20,000 length, but the fraction of actual long reads in a run, and their quality is still a bit hazy. The most interested to me is the Ion Torrent. Being rather low on budget, this seems like the family budget car of high throughput sequencing: cheap, reliable, and does not offer more than I really need. $50,000 for a sequencer with $500 runs with 160MB/hr? Nice. Genetic Inference has a great summary of the various technologies presented.

Overall, we are starting to see a divergence in sequencing technologies, as each tech concentrates on having clearly defined advantages and potential applications that differ from all others. This means that the scientists themselves can more closely tailor their choice of tech to fit their situation. Are you a small lab that needs 10 high-quality genomes on a budget? Go to Complete. Want a cheap, fast machine for library validation? Use Ion Torrent. Setting up a pipeline for sequencing thousands of genomes? Go Illumina.

The review article on metagenomics I recently published in PLoS Computational Biology (yeah, yeah, shameless plug) already starts to feels somewhat outdated on the sequencing technology front.

Carnival of Evolution #21 the superstar edition is up: check it out. It’s a nice and detailed one,. Some posts I liked included talking about how human fingers evolved, and why it is important to consult evolutionary biologists when making decision about conservation.

An interesting email I got yesterday: PubGet, a search engine for PDFs of scientific articles, is no linked to PLoS. PubGet is a very useful service that gets  you the article PDF immediately, without going through he usual clickeroo via Google,  pubmed, publisher’s gateway, journal gateway and then squinting along the sidebar to find the PDF link. Nice to see that these two are teaming up.

Finally, two reminders. First, Ada Lovelace day, a blogging day celebrating the achievements of women in science and technology is coming up, March 24. Go ahead, pledge and blog! Second, the Byte Size Biology will be hosting a Carnival of Bioinformatics. Quite a few posts have been submitted already, please submit yours, deadline: March 9.

Most common weaknesses of grant applications

February 20th, 2010 1 comment

I’ve been cleaning out my Gmail from attachment-emails recently. (Why do people continue to send me video files when there’s YouTube?) Anyhow, Mickey Kosloff sent me this. In this pic, form meet function, being concise and to the point, as a good grant application should be.

A picture is worth one-thousand study sections

Categories: Funding Tags: ,

Bioinformatics blog carnival

February 18th, 2010 2 comments

Byte Size Biology will be hosting the first edition of the bioinformatics blog carnival. All you bioinformatics bloggers, submit your entries by Mar 9, 2010 23:59:03  EST. Note the 3 second extension I have already given. There will be no more deadline extensions, I’ve been generous enough as it is. The carnival will be posted here by March 15, 2010.

Any blog posts that have to do with the computational aspects of: genomics, nextgen sequencing, sequence analysis, gene expression, systems biology, ontologies, databases, structural biology, metagenomics, phylogenetics, function prediction and I probably forgot a few other categories so don’t hold it against me, just submit. Early and often. Your own posts and others that you liked. Nothing too old please, 1/1/2009 and later. I reserve the right to be a less-than-benevolent dictator and screen out posts. This applies especially to commercial plugs with no other merits.

Please retweet, reblog, rebuzz and remember:  submit to the bioinformatics blog carnival!

If you have a cool logo for this carnival, email bioinfo.blog.carnival AT gmail.com OhOne and OhToo will pick the winning logo to be displayed on the carnival. Likewise, if you would like to host the next edition, let me know. Do not submit posts by email, only via blogcarnival.com.

Happy carny-ing.

“Codon” is now a four letter word

February 17th, 2010 5 comments

ResearchBlogging.org

As part of the process of manufacturing  a new car,  the designers will take the blueprints to the factory floor. There they will set up an experimental assembly line, tinkering with the manufacturing process of the prototype until it is ready for mass-production. Can we do the same with the machinery of life – the assembly of proteins? Can we set up an alternative assembly line for a new protein prototype — and then actually set up a working assembly line for the whole new protein?  A proof-of-concept has been published this week in Nature by Jason Chin’s group at the Medical Research Council Laboratory of Molecular Biology, Cambridge UK.

If there is a single common denominator to all life, it is the genetic code. All life is built around DNA encoding information for proteins  nucleotide triplets or codons. Since there are four types of nucleotides (A,T,G,C)  that are read in words of thee, there are 43 = 64 possible codons: more than enough to encode for the 22 amino acids that make up proteins. There is nothing more basic and fundamental to life on Earth than the three-letter based genetic code.

Until now.

Chin’s group has created a four-nucleotide codon system.  It is not that the DNA is different: it is the way the cellular machinery decoding  RNA transcripts interprets the nucleotide sequence. Ribosomes –large RNA and protein complexes  which are the platform upon which messenger RNA is read and decoded — are set to serve up messenger RNA three nucleotides at a time. (Messenger RNA or mRNA is a transcript of the DNA which is carried to the ribosome.)  Transfer RNA or tRNA is a short RNA molecule that shuttles the proper amino acid to the ribosome, but will only attach if the proper codon is served up by the ribosome. The whole protein synthesis “assembly line” looks something like this:

Protein synthesis. Credit: Wikimedia Commons.

To change the interpretation of the genetic code from three lettered words  to four, Chin and his colleagues had to make new ribosomes, and new tRNAs.  To create these new ribosomes, they designed orthogonal ribosomes, or o-ribosomes. O-ribosomes are genes inserted to produce extra ribosomes that operate in the cell alongside the regular ribosomes. The cell functions because it has the regular ribosomes to maintain its viability. The ribosomal RNA in the o-ribosomes is free to be mutated to create new unnatural traits: in this case, the ability to serve as a platform read four-letter codons. They selected for Escherichia coli bacterial cells that expressed a o-ribosomes which translated a four-letter codon in a gene, which would otherwise go untranslated by the regular ribosome. The gene gives the bacterial cells resistance to the antibiotic chloramphenicol. So cells that survive a dosage of chloramphenicol are those which have functioning o-ribosomes, as they have the chloramphenicol resistance gene that is being translated by the o-ribosomes.

They also needed to create new tRNAs that have an four-nucleotide anticodon (the part that complementarily binds to the messenger RNA –  see figure above.)  So the surviving E. coli cells have a population of working o-ribosomes, regular ribosomes, modified tRNA (with a  four-letter anticodon) and regular tRNA.

Then they took their work a step further. Each three-letter tRNA carries a specific amino-acid, depending on its anticodon. Thus tRNAAAG will always have a phenylalanine attached, because CTT (the complement of AAG on the messenger RNA) codes for phenylalanine. If you start messing with that, the translation machinery will produce non-functional proteins, which will probably kill the cells pretty quick. But with the orthogonal 4-letter code machinery, that is not really a problem: the orthogonal machinery operates alongside the normal one. Also, there are no amino acids naturally assigned to any four letter code, because this code does not appear in nature in the first place! So Chin’s lab assigned an unnatural amino acids to a four-letter code. The non-naturally occurring p-azido-l-phenylalanine amino acid was assigned to tRNAUCCU. They then showed that the whole alternative translational machinery worked by synthesizing a mutant of the protein calmodulin which used this amino-acid in its structure.

Why do it? Well, personally I don’t see the need for justification: just being able to do it is so cool!  But seriously: think of the ability to design proteins from up to 44=256 different amino acids other than the 22 we have now.  The possibilities of tinkering with existing proteins using this orthogonal, four-letter based machinery are huge. The other benefit of this orthogonal synthesis setup is the ability to control this orthogonal translational machinery: because it does not use the three-letter vocabulary, this orthogonal machinery would be much easier to manipulate, tinker with and switch on and off without getting in the way of regular cellular translational machinery. The analogy to a car assembly line breaks here. It is as if two different models are being assembled on the same line just by using different robots. The better analogy is for a program source code to be read by two different compilers, each producing a different program. Awesome.


Neumann, H., Wang, K., Davis, L., Garcia-Alai, M., & Chin, J. (2010). Encoding multiple unnatural amino acids via evolution of a quadruplet-decoding ribosome Nature DOI: 10.1038/nature08817

Ancient Greenlander’s DNA reveals ugly mullet

February 13th, 2010 2 comments

ResearchBlogging.org

Seriously, this is what I first thought when I saw the cover of this week’s Nature, and the associated drawings  in the press.  The dude’s haircut seems like it was even bad in the ’80s… 2080 BCE that is, which is when his body is dated. Approximately.

Inuk, artist's impression

A large group of researchers were involved in analyzing DNA from a 4,000 year old man found in Greenland. There is an interesting story on how the sample was obtained. Eske Villerslev spent 2 months looking for human remains in archaeological sites in Greenland, to no avail. When complaining about it to a friend, he learned that that friend’s father had already given hair samples to the national museum of Denmark… back  in 1986. Serious slap on the forehead!

The preserved DNA was of great quality. Also, this is 2010: we have a plethora of genotypic data that lets us associate a genome with phenotypes, and geographical origin. Inuk, as he was dubbed, shares many traits with east Asians, and much less with contemporary North American Inuit or other Native Americans. This means that Inuk and his people migrated fairly recently to Greenland from Siberia and across North America. Inuk and his people, the (now extinct) Saqqaq have no relationship to contemporary Native Americans. This was also shown in an earlier study of the mitochondrial DNA.

What do we know about Inuk himself? He had blood type A+; thick black hair and very likely a receding hairline due to a baldness allele; brown eyes; shovel-graded front teeth (East  Asian characteristic); dry type earwax… before you laugh that is actually, another characteristic of Asian populations on the Siberian and Chinese east coast. He also had alleles associates with human adaptation to cold temperatures.

And a bad haircut; but great press nevertheless (just got to Google News and search for  “Greenland Genome” – see, for example, the NYTimes.)

This post has been Slashdotted. Exercise extreme caution.


Rasmussen, M., Li, Y., Lindgreen, S., Pedersen, J., Albrechtsen, A., Moltke, I., Metspalu, M., Metspalu, E., Kivisild, T., Gupta, R., Bertalan, M., Nielsen, K., Gilbert, M., Wang, Y., Raghavan, M., Campos, P., Kamp, H., Wilson, A., Gledhill, A., Tridico, S., Bunce, M., Lorenzen, E., Binladen, J., Guo, X., Zhao, J., Zhang, X., Zhang, H., Li, Z., Chen, M., Orlando, L., Kristiansen, K., Bak, M., Tommerup, N., Bendixen, C., Pierre, T., Grønnow, B., Meldgaard, M., Andreasen, C., Fedorova, S., Osipova, L., Higham, T., Ramsey, C., Hansen, T., Nielsen, F., Crawford, M., Brunak, S., Sicheritz-Pontén, T., Villems, R., Nielsen, R., Krogh, A., Wang, J., & Willerslev, E. (2010). Ancient human genome sequence of an extinct Palaeo-Eskimo Nature, 463 (7282), 757-762 DOI: 10.1038/nature08835

Gilbert MT, Kivisild T, Grønnow B, Andersen PK, Metspalu E, Reidla M, Tamm E, Axelsson E, Götherström A, Campos PF, Rasmussen M, Metspalu M, Higham TF, Schwenninger JL, Nathan R, De Hoog CJ, Koch A, Møller LN, Andreasen C, Meldgaard M, Villems R, Bendixen C, & Willerslev E (2008). Paleo-Eskimo mtDNA genome reveals matrilineal discontinuity in Greenland. Science (New York, N.Y.), 320 (5884), 1787-9 PMID: 18511654

Highly Evolved

February 12th, 2010 4 comments

If the title of this post makes you cringe, then you belong to a minority of people who realize why the phrase “highly evolved” is so wrong. Unfortunately, “highly evolved” (as an absolute term) and “more evolved” (as a comparative term) seem  to be used all-too frequently.  They are uttered not only by non-scientists and non-biologists but even by scientists who should know better. Even when they catch themselves after blurting out “highly evolved” in a conversation (or, more embarrassingly, in a lecture), the damage is done. Yet another Freudian (Darwinian?) slip that tells of a fundamentally bad grasp on evolution. And, yes, I know, this topic has been written about by many of my betters, who are vastly more evolved better writers than I, with much better breadth and depth of knowledge of evolution, and a reach to a much wider audience.

Not a sponge. Source: wikimedia commons. Public Domain.

More evolved? Source: Wikimedia commons, public domain.

So why am I writing about it? Well, this is my blog and ranting in it is my prerogative. And despite the Richard Dawkinses and Steven Jay Goulds of this world, the use of this phrase still persists. So it is up to us foot soldiers of the blogging community to do our own modest bit. If I prevent any of my six readers from being tempted to utter this phrase the next time it is (wrongly) deemed appropriate, then I have done my bit.

Why is this “highly evolved” used so much? And why is it wrong?

Consider the sponge, and then consider Albert Einstein. There are certain traits that Einstein had, that a sponge does not. We deem these traits to be of merit. Einstein developed a fundamental theory in physics. He  played the violin. He  ate with a knife and fork, had binocular color vision, opposable thumbs and he cultivated his facial hair in the form of a mustache.

A sponge… well, to be brief, does not have all those qualities we hold in such high merit. It kinda sits there at the bottom of the shallow ocean, flopping about, filter feeding, pooping and apparently not much else. Clearly, there are qualities to Einstein that make him more interesting than the sponge.

Less evolved? Source: Wikimedia Commons

Einstein seems, intuitively, to be more complex than a sponge, and that complexity can be quantified directly, in many ways. Actually, this is a pretty contentious point by itself: can we speak of organism complexity? Can we quantify the complexity of an organism and compare between different species? And what exactly would the complexity metric we choose tell us?

But let us assume, for argument’s sake, that our intuition that Einstein is more complex than a sponge is correct. For example, we can imagine a measure derived from the diversity and number of cells. Obviously there are more cell types in Einstein than in a sponge. Does that mean he is also more evolved? Are humans a more  evolved than sponges? Chimps? After all, did life not start 3.85 billion years ago as simple and over time became more complex? Progressing, as it were from simple unicellular bacteria through more complex sponges all the way to the crowning achievement of humans? Had humans not, in a sort of (alas, Pyrrhic) victory, mastered the Earth and competed with many of earth’s species to the latter’s extinction? Isn’t competition what evolution is all about? And isn’t human victory a direct result of human complexity making humans “more evolved”? So isn’t “complexity” an end product of evolution, the more complex you are the more successful you are, and the more evolved you are?

No, no, no, no, no, and no.

The reason for this series of compounding errors is the mistaken notion that evolution by natural selection is a progression resulting in a production of increasingly complex life.  Evolution is not goal oriented, and there is no teleology involved. The increasing complexity of organisms along time may seem to involve a  progressive process, but there is none. It is a “statistical illusion”.  What do I mean by that? Well, life did start out in less complex forms, that became more complex. But the less complex forms remained as well. Thus the distribution of complexity increased over time, but there is no directionality towards progress: the less complex life remained around as well. But over 3.85B years, complexity has had a chance to manifest itself in life, as natural selection favored some initial complexities, and those extended to become even more complex. Yes, we can trace a direct route from the first multicellular organisms, through sponges, invertebrates, vertebrates. But humans, chimps, sponges and bacteria living on Earth today are the result of exactly the same selective forces that have shaped life since  it crawled out of an underwater volcano, or wherever. Complexity emerged over time, and is still emerging. But complex organisms are being added to the pool of life, rather than replacing the simple organisms. The result is an increase of a distribution of complexity levels, not the moving of an entire curve of complexity rightwards.

Apparent progress due to a to a 'wall' restricting where random change can take things. Adapted from SJ Gould. Reproduced under CC from talkorigins.org

The point I am trying to make is that humans may be more complex than sponges, but we are not “more evolved” nor are we “highly evolved”.  There is no progressive process, and all of life on earth is the result of the same 3.85B years of selective pressures.

For a really good historical overview of teleological, or purpose-driven, thought in evolution, look to talkorigins.org.

Few know that Einstein was teaching evolution at Princeton. Physics was just a cover.

All of this does not mean that Highly Evolved by The Vines is not a kick-ass song. Listening to it is also  a good way to get the rage from hearing “highly evolved” out of your system.  Note the low complexity of the video:

Short bioinformatic hacks: reading between the genes

February 11th, 2010 3 comments

In celebration of the biohackathon happening now in Tokyo, I am putting up a script that is oddly missing from many bioinformatic packages: extracting intergenic regions. This one was written together with my student, Ian. As for the biohackathon itself, I’m not there, but I am following the tweets and  Brad Chapman’s excellent posts:

About intergenic regions: intergenic regions are as interesting and sometimes even more interesting than the genes themselves: when you are interested in promoters, transcription factor binding sites or almost any other transcription regulation mechanism. Here’s a simple script to find intergenic regions. It reads a genbank formatted file and uses the information there to extract the intergenic regions. The sequences are written to a FASTA file.

#!/usr/bin/env python
import sys
import Bio
from Bio import SeqIO, SeqFeature
from Bio.SeqRecord import SeqRecord
import os

# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
def get_interregions(genbank_path,intergene_length=1):
    seq_record = SeqIO.parse(open(genbank_path), "genbank").next()
    cds_list_plus = []
    cds_list_minus = []
    intergenic_records = []
    # Loop over the genome file, get the CDS features on each of the strands
    for feature in seq_record.features:
        if feature.type == 'CDS':
            mystart = feature.location._start.position
            myend = feature.location._end.position
            if feature.strand == -1:
                cds_list_minus.append((mystart,myend))
            elif feature.strand == 1:
                cds_list_plus.append((mystart,myend))
            else:
                sys.stderr.write("No strand indicated %d-%d. Assuming +\n" %
                                  (mystart, myend))
                cds_list_plus.append((mystart,myend,1))

    for i,pospair in enumerate(cds_list_plus[1:]):
        # Compare current start position to previous end position
        last_end = cds_list_plus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "+"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    for i,pospair in enumerate(cds_list_minus[1:]):
        last_end = cds_list_minus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "-"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    outpath = os.path.splitext(os.path.basename(genbank_path))[0] + "_ign.fasta"
    SeqIO.write(intergenic_records, open(outpath,"w"), "fasta")

if __name__ == '__main__':
    if len(sys.argv) == 2:
         get_interregions(sys.argv[1])
    elif len(sys.argv) == 3:
         get_interregions(sys.argv[1],int(sys.argv[2]))
    else:
         print "Usage: get_intergenic.py gb_file [intergenic_length]"
         sys.exit(0)

What are we seeing here?

Lines 11-16 are the preamble: we read the GenBank file using Biopython’s genbank parser in line 12.  Beacuse we expect a genome file, which contains a single record, this is a one-time read. Note that this is a rate limiting step, and can take a couple of seconds. Took me ~2secs to read the full E. coli genome on my Linux box.  We prepare one list for the + strand intergenic regions (13), another one for the minus strand intergenic regions (14) and one for all the records (line 15).

The rest of the code are three loop blocks: lines 16-28 I loop over the genbank features, extracting the coordinated of the genes themselves. Line 32-41 I find the intergenic regions on the + strand. Lines 42-52 I do the same for the “-” strand.

Now for a philosophical interlude: although there is a way to read all the intergenic regions in a single pass, I subscribe to the “code simple” doctrine of research software writing. Code performance optimization is a low priority for me. I’d much rather have something that is simple to write,read and modify. I also don’t want to spend too much time coding and elegant script for elegance’s sake, especially if I may not use it too much. Historically, scientific code written for research is mostly extinct: thrown away after a short lived hypothesis was tested and ended its days. Research coding is mostly throwaway glue code. Very rarely it matures into a product. Then, and only then, can you apply all those fine software engineering you learned in college. Before that, write fast and simple.

But I digress. Line 17 loops over the features in the genome file. Line 18 we identify if it is a coding sequence (CDS). If so, we identify the start position, and position and the strand the CDS is on. The list cds_list_minus is a list of 2-tuples. Each 2-tuple is the start and end positions of a CDS on the minus strand. (If you would like to go over the genes, as opposed to coding sequences, change line 18 to:

if feature.type == 'gene':

(or better yet, pass an argument that defines it.)

cds_list_plus, is, yes, the same as cds_list_minus, only for the plus strand (line 24).

Sometime, a CDS does not contain information on which strand it is. With genome files, that is usually the case with single stranded viral genomes. Therefore, we put in the default assumption that if there is no strand indication, then the feature is is on the plus strand. We generate a warning message nevertheless (lines 25-28).

Lines 30-41 we loop over the plus  strand list, and identify the coordinates between the genes. Python’s enumerate function is very useful here. The enumerate function allows us to iterate over a list, but at the same time keep track of which index we are in when looping over the list. So in line 30, pospair receives the start and end coordinates of a CDS as a 2-tuple, while i receives the actual number if the index in the plus strand CDS list. In that way, we can look back to the previous list member, find the coordinates where that CDS ends, and where the current CDS begins. The two coordinates make up the beginning and end of the intergenic regions between those two genes on that strand. In line 35 we check if the intergenic region length is equal to or larger than a threshold: suppose we are only interested in those intergenic regions that are longer than 100 bases? (The default value is 1, see line 11.) In lines 38-39 we build a biopython sequence object that contains an informative header, and the sequence of the intergenic region. The description which goes in the sequence header contains the start and end coordinates of the intergenic region, and the locus ID of the CDS directly downstream from it. The sequence object is appended to a list, which will eventually get written (lines 40-41).

Lines 42-52 are a repeat of lines 30-41, only for the minus strand.  Lines 53 & 54: the list that contains all the intergenic region sequence objects gets written to its own fasta file.

Finally, line 56-63 are boilerplate code, that make this script runnable from the command line. Have fun looking at intergenic regions. Let me know of you find something interesting.

JSUR? Yes, sir. (Updated 2-FEB-2010)

January 31st, 2010 2 comments

The most exciting phrase to hear in science, the one that heralds new
discoveries, is not ‘Eureka!’, but ‘That’s funny…’ -Isaac Asimov

Thanks to Ruchira Datta for pointing out this one.

Science is many things to many people, but any lab-rat will tell you that research is mainly long stretches of frustration, interspersed with flashes of satisfying success. The best laid schemes of mice and men gang aft agley. A scientist’s path contains leads to blind alleys more than anything else, and meticulous experimental preparation only serves to somehow mitigate the problem, if you’re lucky. This doesn’t work, that doesn’t work either and this technique worked perfectly in Dr. X’s lab, why can’t I get this to work for me?  My experiment was invalidated by my controls; my controls didn’t work the way the controls were supposed to work in the first place. I keep getting weird results from this assay. I can’t explain my latest results in any coherent way… these statements are typical of daily life in the lab.

This stumped and stymied day-to-day life is not the impression of science we get from reading a research paper, when listening to a lecture, or when watching a science documentary show. When science is actually presented, it seems that the path to discovery was carefully laid out, planned and  flawlessly executed, a far cry from the frustrating, bumbling mess that really led to the discovery. There are three chief reasons for the disparity between how research is presented, as opposed to what really goes on. First, no one wants to look like an idiot, least of all scientists whose part of their professional trappings is strutting their smarts. Second, there are only so many pages to write a paper, one hour to present a seminar or one hour for a documentary: there is no time to present all the stuff that did not work. Third, who cares about what didnt work? Science is linked to progress, not to regress. OK, you had a hard time finding this out, we sympathize and thank you for blazing the trail for the rest of us. Make a note for yourself not to go into those blind alleys that held you back for years and move on. We’re not interested in your tales of woe.

Only maybe these tales of woe should be interesting to other people. If you make your negative results public, that could help others avoid the same pitfalls you had. If you share the limits of a technique, a protocol or software then someone can avoid using it in a way that does not work. A lab’s publications are actually the tip of the sum total of its accumulated knowledge.Every lab has its own oral tradition of accumulated do’s and dont’s. Not oral in the literal sense: they may even be written down for internal use, but never published. UPDATE (2-FEB-2010): most peer-reviewed journals don’t like stuff that does not work. Thanks to Mickey Kosloff for pointing out the Journal of Negative Results in Biomedicine and The Journal of Negative Results – Ecology and Evolutionary Biology.

Until now.

The Journal of Serendipitous and Unexpected Results aims to help us examine the sunken eight-ninths of the scientific knowledge iceberg, in life science and in computer science. (So an additional field over JNRB and JNREEB). From JSUR’s homepage:

Help disseminate untapped knowledge in the Computational or Life Sciences

Can you demonstrate that:

* Technique X fails on problem Y.
* Hypothesis X can’t be proven using method Y.
* Protocol X performs poorly for task Y.
* Method X has unexpected fundamental limitations.
* While investigating X, you discovered Y.
* Model X can’t capture the behavior of phenomenon Y.
* Failure X is explained by Y.
* Assumption X doesn’t hold in domain Y.
* Event X shouldn’t happen, but it does.

The problem with the JSUR model, and the nature of discovery

I expect JSUR will be a great way to comment on  methods and techniques. Indeed it will codify a trend that has been going on for some time: public protocol knowledge sharing. Many sites like openwetwareseqanswers or the UC Davis bioinformatics wiki have been doing this for a while. Not to mention a plethora of blogs. Scientists are willing to share their experience with working protocols and procedures, and if this sharing of knowledge can be now monetized to that all-important coin of academia, the  peer-reviewed publication, all the better.

So where is the problem? The problem lies with discovery, and credit given towards it. It would be very hard to get anyone to share awkward, unexpected or yet-uninterpreted results. First, as I said, no one wants to look like an idiot. Second, unexpected or yet uninterpreted results are often viewed as a precursor to yet another avenue of exploration. A scientist would rather pursue that avenue, with the hope of  the actual meaningful discovery occurring in the lab. At most, there will be a consultation with a handful of trusted colleagues in a closed forum. If the results are made public, someone else might take the published unexpected and uninterpreted results, interpret them using complementary knowledge gained in their lab, and publish them as a bona-fide research paper. The scientist who catalyzed the research paper with his JSUR publication receives, at best, secondary credit. The story of Rosalind Franklin’s under-appreciated contribution to the discovery of the structure of DNA comes to mind. Watson and Crick used the X-ray diffraction patterns generated by Franklin to solve the three dimensional structure of the DNA molecule. Yet she was not given a co-authorship on the paper. (And she did not even make the results public, they were shared without her knowledge.) Unexpected results are viewed either as an opportunity or an embarrassment, and given the competitive nature of science, no on wants to advertise either: the first due to the fear of getting scooped, the second for fear of soiling a reputation. I expect JSUR would have a harder time filling in the odd-results niche, but I hope I am wrong.

But if you have protocols you are willing to share…what are you waiting for? Get those old lab notebooks, 00README files, forum posts  and start editing them to a paper. You are sitting on a goldmine of publishable data and you did not even realize it.

Finally, here are two scientists who never declined sharing their unexpected results.

This post has been slashdotted. Exercise extreme caution.


BsB in high school science… nice

January 25th, 2010 2 comments

A  small spike on my blog traffic yesterday led me to look for the source via Google Analytics. (If you are a blogger, you should really use this tool, lots of useful traffic information.) Seems like most of the traffic came from the page of a high school science teacher at Badin High School in Hamilton, OH. Apparently the students were to be quizzed today on two of my posts about endosymbiosis (and one from 80Beats; I’m in good company.) So they were very busy Sunday. It’s encouraging to know that some of my posts are accessible enough for high school science. Finally, quite a few Miami students come from Hamilton (we’re close). So I might see some of them next year.

Muahahaha!

The polypharmacome

January 23rd, 2010 1 comment

ResearchBlogging.org
This post was chosen as an Editor's Selection for ResearchBlogging.org
Pharmaceutical companies are always on the lookout for secondary drug targets. After all, if you invest billions developing a single drug, you would be more than happy to sell it as a treatment for two, three, or more different ailments. Sildenafil citrate was developed to treat angina and hypertension, but during phase I clinical trials, it was found that Sildenafil induces penile erections. The drug was branded Viagra, and the rest is history. Eflornithine, an anti-cancer drug, is also effective against the agent of African sleeping sickness, Trypanosoma brucei. African Sleeping Sickness is known as a “neglected disease”, for which drug development is not profitable and therefore not a priority. However, having a drug already on hand makes it easier to distribute in affected areas, since the R&D costs are recovered elsewhere.

Another example of polypharmacology is a drug that binds to multiple targets in the human body. This could be used for overcoming drug resistance, a known problem with cancer. Cancer tumors often develop a resistance to anti-cancer drugs by simple natural selection: the protein that the drug binds to mutates, and no longer binds the drug. However, if the drug acts by binding redundantly to several proteins, it would be more effective, since several mutations would be required to effect drug resistance.

Another important polypharmacological consideration  is toxicity. If a drug binds to one protein, its drug target, it may also bind to another one which it should not bind to as it disrupts the normal functions and the health of the patient. If the side effects outweigh the cure, the drug is no good.

How one drug (cyan) can bind to two different proteins with different overall shapes (pink and green), but with similar binding sites

Because it can either increase  profits, or conversely derail a whole process of drug development, predicting polypharmacophoric effects is very much something drug developers want. A study published yesterday in PLoS Computational Biology by Jacob Durrant and his colleagues suggest a way  bioinformatics and theoretical biophysics can help in identifying multiple drug targets. Durrant’s goal was simple: given the molecular structure of a candidate drug, which proteins are expected to bind it? The strategy this group took is a combination of bioinformatic and experimental screening.

Finding additional drug targets in four  steps

Reproduced under CC license from doi:10.1371/journal.pcbi.1000648.g001. Click for original image.

Step 1 (A-C in the figure above): identify the known target protein. Now pick all the protein structures that are not similar to it. Why those that are not similar? Similar proteins could be potential drug targets, since they have a similar shape to the known target protein. But here they are interested in finding targets from proteins that are of a different shape, and have no homology to the known target protein: secondary targets.

Step 2 (D): take this set of dissimilar proteins, and look for binding site similarities. Binding sites are clefts in the protein that may bind drugs. If those clefts are similar in shape to the cleft in the known target proteins, they may bind the drug. Leave only those non-homologous proteins that have similar binding sites (D in the figure)

Step 3 (E): Now add all the proteins that are homologous to the set generated in step 2. This increases the number of possible targets to homologs of the proteins that were initially selected only for binding site similarity.

Step 4 (F): take the drug molecule, and try to dock it to the various protein structures. Rank the druggability of each protein according to the score provided by the drug docking software (Autodock).

Experimental verification

Now for the cool part. Durrant and colleagues tested  the computational prediction in the lab, using the compound NSC-45208. NSC-45208 inhibits a protein responsible for RNA processing in Trypanosoma brucei. So we know it is a potential drug against African sleeping sickness. What else is it good for?

(NSC-45208), 4,5-dihydroxy-3-(1-naphthyldiazenyl)-2,7 -naphthalenedisulfonic acid, a recently discovered inhibitor of T. brucei RNA editing ligase 1 (TbREL1)

“The predicted secondary targets that gave the best docking scores, H. sapiens mitochondrial 2-enoyl thioester reductase (HsETR1), T. brucei UDP-galactose 4′ epimerase (TbGalE), H. sapiens phosphodiesterase 9A (HsPDE9A2), and Streptococcus pneumoniae teichoic acid phosphorylcholine esterase (SpPce), were subsequently tested experimentally.”

Durrant and his colleagues tested their predictions that NSC-45208 also binds to two human proteins (HsETR1 and HsPDE9A2), one additional Trypansome protein (TbGalE), and a bacterial (Streptococcus pneumoniae) protein (SpPce). Not only binds, but also inhibits their enzymatic activity. Their predictions worked well on the top two predicted targets: NSC-45308 inhibited the enzymatic activity of HsETR1and TbGalE, but HsPDE9A2 and SpPce were not affected by the drug.

At first blush, this does not seem to be much of a batting average: two positives and two false positives. But we have to remember that the experimental verification of these predictions — even if you have a good enzymatic assay to check predictions– can be very time- and resource consuming. An exhaustive test of the predictions for several compounds and many target enzymes is still not possible. As initial proof-of-principle, this work goes much farther than most other joint experimental / computational works I have read. The authors also go into lengthy and detailed discussions on limitations and improvements, as well as on another form of non-specific inhibition of the secondary targets, makes for a really interesting read on polypharmacological considerations in drug screening.


Durrant, J., Amaro, R., Xie, L., Urbaniak, M., Ferguson, M., Haapalainen, A., Chen, Z., Di Guilmi, A., Wunder, F., Bourne, P., & McCammon, J. (2010). A Multidimensional Strategy to Detect Polypharmacological Targets in the Absence of Structural and Sequence Homology PLoS Computational Biology, 6 (1) DOI: 10.1371/journal.pcbi.1000648

Poll: favorite kingdom?

January 22nd, 2010 6 comments

Seems like I had a couple of taxonomic posts recently.  So BsB would like to know what is your favorite Kingdom of life. I used the 6 kingdom system just to make things slightly more interesting. Vote in the pollbox on the righthand sidebar. You only have yourself to blame if your favorite kingdom is not represented adequately.

Categories: Taxonomy Tags: ,

Social networking for taxonomists

January 21st, 2010 2 comments

ResearchBlogging.org

Despite our best attempts to remove species from the face of the Earth, there is still quite a bit of life out there and it is still quite diverse. Also, there are still quite a few people who want to document, describe and make the rest of us aware of the magnitude and diversity of life. In the time-honored tradition of natural history they take the tools of their trade to the field to sample, collect, identify, tag, photograph, film, mount and otherwise understand and advertise Nature’s progeny.  There are probably thousands of web sites dedicated to taxonomy of families, genuses genera and species. Trouble is, they are all “out there”, and require quite a bit of searching to find those that are of interest to you. And what if you just want to browse, the Internet equivalent of going through a natural history museum?

Museums were historically the place where natural historians brought their findings from far away places to be studied and displayed (if lucky) or simply archived. If the finding was still alive, it would go to the zoo or the botanical garden.  And people would go to the museum, zoo or garden to see the  latest wonders of life.

(From Life Through a Lens: the Natural History museum (of London) 1880-1950)

Virtually every biology lab with some taxonomic interest in  a given species, genus or family would have such a site. That’s a lot of taxonomic information , but is not very accessible or organized! What if all these sites could be rounded up together, connected in some contextual fashion, some semblance of standardization? Then we could have an online Natural History museum, where the displays accumulate, are constantly updated, and we can walk through virtual halls looking for the plant or animal of our fancy.

Scratchpads is a cool way of getting to that online Natural History museum. Actually, it is hosted in the great-grandaddy of brick-and-mortar natural history museums, the one in London.

Scratchpads are an easy to use, social networking application that enable communities of researchers to manage, share and publish taxonomic data online. Sites are hosted at the Natural History Museum London, and offered free to any scientist that completes an online registration form. Key features of the Scratchpads include tools to manage:

Taxonomy Classifications Phylogeny Phylogenies
Literature Bibliographies Documents Documents
Images Image galleries Custom Data Custom data
Specimens Specimen records Simple Maps Maps

Users control who has access to content, which is published on the site under Creative Commons (by-nc-sa) license.

Data added to a Scratchpad are automatically classified and grouped around a taxonomy that is supplied by the users or imported from EOL. This is optionally supplemented with information from high quality web accessible databases, to automatic construct content rich web pages about any documented taxon. Currently these sources include Genbank, GBIF, Biodiversity Heritage Library, Yahoo! Images, flickr, Google Scholar and Wikipedia.

When you start a Scratchpad site, you are given a template on which you can build your taxonomy site. the template includes a tree-building widget, space for image galleries, a blog area, a bibliography area etc. You insert the content, and your site links back to the scratchpad mother site.  The Scratchpads linked from the mother site are in various stages of construction, which is actually pretty cool as you can see a village of taxonomy sites being constructed. Some are fairly complete and interesting to browse through, like the Scratchpad about freeloader flies or the one about nanofossils.

Time will tell if Scratchpads will catch on with taxonomists. I certainly hope it does. It seems  quite a bit of thought has gone into balancing standardization with creativity, and making scratchpading easy and manageabe. The authors have chosen templates based on the Drupal content management system, which has a very gentle learning curve. I am waiting on my own scratchpad account, where I intend to… well, you’ll see it on this blog eventually, I hope.

So if you are involved in taxonomy in some way, consider getting your own Scratchpad. This could make for a great student project, by the way.  If you know someone who is a taxonomist, pass this information on to them.


Smith VS, Rycroft SD, Harman KT, Scott B, & Roberts D (2009). Scratchpads: a data-publishing framework to build, share and manage information on the diversity of life. BMC bioinformatics, 10 Suppl 14 PMID: 19900302

Fold.it: wasting time in a good cause

January 16th, 2010 3 comments

I just spent the better part of a Saturday playing with Foldit. Foldit is an ongoing experiment in finding protein structures by harnessing the power of the mob – or gamers, as is the case here. The player is presented with a backbone & sidechain configuration, with the secondary structures mostly pre-determined. The problem is to get the protein to fold into the correct conformation. You can tweak the secondary structures, rubberband the beta strands together into sheets and rotate the sidechains. The residues are colored by hydrophobicity, so you know who should be facing where. The 23 tutorial cases walk you through the  simple yet powerful interface to folding the structures. You can rotate helices, rubberband strands, flip sheets, etc. The interface gives you feedback on sidechain clashes and voids in the structure among other things.  The examples also teach you the basic of protein structural considerations: maximize hydrogen bonds, hydrophobic side chains should be buried in the structure, strands should combine to sheets, and so on. When you feel you are ready, you can start solving the various folding puzzles presented online. You can work solo, or as part of a team. The “correct solution” is, of course, unknown. The best you can do is accumulate as many points as you can, which represent how stable is the conformation you are building. Foldit, like many other cool things in structural biology, is the product of David Baker’s lab at the University of Washington. Here is a video from the YouTube UWFoldit channel showing the coolness of it all. If you decide to get your fold on, make sure you can make the time. It’s flippin’ addictive.

Newton’s birthday and crop diversity

January 4th, 2010 No comments

Today is the 366th birthday of Sir Isaac Newton. Formulator of the three laws of motion, the theory of gravity, inventor of the first reflecting telescope, theory of color, calculus (with due credit to Gottfried Leibniz), the generalized binomial theorem, and president of the Royal Society.

Newton in a 1702 portrait by Godfrey Kneller

All which ties in directly to retail, and biodiversity. Huh?

Co-operative Farms (UK) recently bought 1,000  rare and endangered apple varieties, with colorful names like Great Expectations, Fairie Queen, Northern Spy, Forty Shilling, Duck’s Bill and Bloody Ploughman. (I wonder how the last name came to be; actually, maybe I shouldn’t.) This also includes Isaac’s Newton’s Tree: the apple variety cultivated from the descendants of the tree which inspired Newton to formulate the Theory of Gravity. Many of those apples were dessert apples, but some fell out of favor the strains were no longer grown, threatening to disappear.

Co-op Farms are bottling them up as the “Truly Irresistible Tillington 1,000″ pressed apple juice. I think it is great that a retail chain is funding crop diversity and finding a way to make some money in the process. Although with 7,500 cultivars worldwide, apples as are not exactly under an extinction threat. But there is also the matter of food variety, cultural heritage and, of course, preserving the history of physics. Or bottling it up, whichever the case may be.

Also, fruit, including apples, are important in the small-arms industry: