Byte Size Biology

It’s a small (RNA) world after all

By Iddo on June 29th, 2010

The central dogma of molecular biology edit: the sequence hypothesis (thanks for setting me straight, Kamel!) as formulated 57 years ago was simple: DNA is transcribed to mRNA,and mRNA is translated to proteins. Proteins are the business end of this process. mRNA is only the messenger: its sole function is to deliver information from the template (DNA) to the business end (Protein). It was thought to have no more function than a fax, email or instruction manual. Well, except for tRNA which carry amino acids to the ribosome. Oh, and ribosomal RNA that helps the specific binding of tRNA to the ribosomal complex, uh, yes, somehow.

Credit: Daniel Horspool. Wikimedia Commons

This rather convenient view of RNA’s role in life has been repeatedly assaulted. The discovery that RNA can be catalytic — ribozymes — was the most dramatic change. It transformed our perception of the ribosome form a protein complex with some RNA to an active RNA complex with some protein. Since RNA was understood to both carry information and perform action, it led to the hypothesis of the RNA World. The RNA World being the primitive scaffold of molecular replicants and enzymes composed solely of RNA – later to be joined by DNA and protein. The supporters of the RNA world hypothesis see the evidence for it all around: riboswitches which are mRNA molecules that regulate their own activity by binding to small molecules; small interfering RNAs and micro RNAs are both RNA species that silence or help degrade mRNA. Small nucleolar RNA act to process tRNA and rRNA and their cousins, small nuclear RNA. All showing that the role of RNA in life is diverse, far-reaching and critical.

Last week a study published in Nature added yet another role to RNA, and from an unexpected source: pseudogenes. Pseudogenes are generally considered to be vestigial, or the remains of once active genes. The DNA sequence of a pseudogene resembles that of a gene, but owing to any of several reasons such as early termination, mutations or the inability to transcribe, pseudogenes are laying about the genome’s junkyard, accumulating rust and slowly being phased out of the genome. Until recently, pseudogenes were interesting mostly to evolutionary biologists. The reason being, since pseudogenes share ancestry with active genes, they can inform us of the evolutionary history of the genome, a sort of a molecular fossil or molecular-archaeological ruin if you will. Informing us about the past, but not having much of a function at present.

But apparently there is more to pseudogenes than just being archaeological genome dig markers, and actually they regulate the expression of their kin “real” genes. “Real” being in quotes, because suddenly pseudogenes are “real” too — they are actually doing something, not just rusting away in the genome. And performing in a very interesting way, too.

PTEN is a gene whose protein product is a tumor suppressor. In normal cells, its expression is tightly regulated, since even changes in the number of protein produced might cause cells to turn cancerous. PTEN is regulated by several types of microRNAs or miRNAs. miRNAs are non-coding RNAs, which are approximately 22 nucleotides long, and can control gene expression by interacting with their target mRNA. So miRNA molecules interact with PTEN mRNA. Each PTEN mRNA that interacts with PTEN-specific miRNA gets knocked out of circulation, lowering the number of PTEN proteins, inlreasing the chances of a cell turning cancerous. (Don’t confuse “miRNA” and “mRNA”!)

PTEN1 is a pseudogene which shares a very recent common ancestor with PTEN. A mutation in PTEN1 prevents it from being translated into a protein product, but it can still be transcribed to PTEN1 mRNA. Laura Poliseno and her colleagues have shown that PTEN1 mRNA, being very similar in sequence to PTEN mRNA attracts miRNA molecules that target PTEN mRNA. In other words, PTEN1 mRNA lures PTEN-specific miRNA molecules away from PTEN mRNA, lowering the number of inactivated PTEN mRNAs.

PTEN1 acts as a decoy for PTEN-specific miRNAs. (A): transcription and translation of the PTEN gene into the tumor-suppressing PTEN protein. (B): PTEN translation inhibited by miRNA. A high enough concentration of miRNA can lower the cell PTEN protein concentration below a threshold, inducing cancer. (C) The mRNA of the PTEN1 pseudogene attracts PTEN miRNA, effectively lowering its concentration in the cell, enabling PTEN protein production

So PTEN1 is not really a relic, but rather an active tumor suppression gene, on the mRNA level.

Is this PTEN/PTEN1 relationship unique? Apparently not. Looking through the mouse genome, Poliseno and colleagues have found other pseudogenes homologous to active genes, with possible mRNA binding sites. Indeed, they have shown that the same mechanism is true for the KRAS gene and the pseudogene, KRAS1. KRAS is an cancer causing gene or an oncogene, and KRAS1 acting as a miRNA decoy enhances KRAS expression. Therefore KRAS1 is also an oncogene. The authors suggest a new general model for mRNA mediated biology, which they call competitive endogenous RNA, or ceRNA.

The implications of this finding go beyond tumor suppression control. The number of pseudogenes in a typical animal genome meets or exceeds that of regular genes. It may very well be that many of them are functional on the RNA level, which offers a whole new outlook on what a gene is, what a pseudogene is, and how they function and control in the cell. It will take a while for the implication of this study to sink in, but it seems that Poliseno and her colleagues have just opened up a whole new subfield in molecular biology, with a whole new set of roles for RNA.

Oh, and if you were expecting a video of one of the most annoying songs ever, forget it.

Poliseno, L., Salmena, L., Zhang, J., Carver, B., Haveman, W., & Pandolfi, P. (2010). A coding-independent function of gene and pseudogene mRNAs regulates tumour biology Nature, 465 (7301), 1033-1038 DOI: 10.1038/nature09144

Share and Enjoy:

Funny, Science Comments turned off

Science as Middle-Earth

By Iddo on June 26th, 2010

From Abstruse Goose. I like it that Biology is in Mirkwood, and that Bioinformatics is on the left bank of Anduin while CS is on the right.

I would have put Botany in Fangorn (because of the Ents), Microbiology in the Sea of Rhûn for beyond it are “wide uncharted lands, nameless plains, and forests unexplored” and machine learning in the Misty Mountains (close enough to Bioinformatics, statistics and Computer Science).

Also, if the social sciences are in Mordor, what does that say about Sauron?

Share and Enjoy:

Blues Comments turned off

Sonny Moorman’s group at Oxford

By Iddo on June 25th, 2010

This Cincinnati group played a concert today (Thursday) at the uptown park at Oxford Ohio. A trio with Sonny on guitar, Dennis “Willy D” Williams – bass & vocals, and Dave Fair- drums & vocals. They played powerful electric blues-rock, with great covers of Blind Willie McTell, Robert Johnson and many originals. Sonny played a variety of guitars, switching between a slide lap guitar, a Debro and a Gibson Flying V. At one time he stepped off stage playing his lap guitar and let the children in the audience walk up to him and see him play. Strong blues played by a friendly band on a beautiful summer night. A great way to end the week.

Sonny showing his pickings to the younger audience at uptown Oxford, OH

Playing Crossroads Motel. Not sure where or when this was recorded:

Share and Enjoy:

Ecology Comments turned off

Scary

By Iddo on June 23rd, 2010

Image acquired June 19, 2010. Source: earthobservatory.nasa.gov

Share and Enjoy:

Genomics, Health Comments turned off

Celebromics? HeavyMetalomics? Advertomics? Anniversomics!

By Iddo on June 22nd, 2010

René Goscinny would probably have done a better job of naming the new trend of personal genomics (genomix?) companies to sequence celebrities genomes. Heck, we might have even done Obelix’s and Asterix’s genomes to find out if Obelix can drink the magic potion without Getafix’s (Panoramix’s) admonishments that it might do him harm, or to find out how Asterix can control the feathers in his helmet. I believe that any good, comical or neutral publicity for genomics is a good thing. Hence, last week’s headline in The Sunday Times announcing that “Genetics to solve why Ozzy Osbourne is still alive” is a still a good thing, tongue-in-lacerated-cheek attitude nonwithstanding. If something is revealed about Ozzy’s robustness to decades of substance abuse that helps the rest of us, so much the better. The more likely case would be that Ozzy’s geneomic makeup will reveal certain irregularities, which may or may not be correlated with his remarkable tolerance for alcohol, drugs and the consumption of raw animals on stage.

Trust me. I founded Black Sabbath (ozzyosbourne.com)

I must add that in my humble opinion making Ozzy the Sunday Times’s health advice columnist might be taking things a bit too far. (Next they’ll be putting Hannibal Lecter in charge of the personal relationships advice column.)

"Dear Distraught. I suggest your sister-in-law and you resolve your differences by dining on your husband's liver with a bottle of fine Chianti and some fava beans." (Photo credit: Wikimedia Commons)

There is quite a but of public FUD surrounding genomics. Most people get their science information from the mainstream media and the movie industry. The mainstream media and the movie industry tend to highlight and in many cases exaggerate the the catastrophic potential of any new technological or scientific achievement. This attitude draws attention and sells, but does not educate. Jurassic Park and GATTACA-like dystopias are easier to package into an entertaining package than the equally dramatic somewhat more cerebral The Genome Wars. Also, expectations from scientists are set high, (perhaps because in the movies they solve everything in 2 hours flat), such as those that were set for the human genome. It is therefore not too surprising that the New York Times ran an editorial and two stories that expressed a guarded disappointment with the lack of benefits from the human genome project at its 10th “anniversary”. The complaint is that we were promised the cure for cancer and for Alzheimer’s disease, but in most cases we get a muddle of hundreds of variants that may or may not be linked to any disease. We were hoping for a clear roadmap of the genetic cause of diseases, but instead we got an intractable Gordian knot of hundreds of common variants and environmental causes. It is becoming clear that the genetic contribution of many diseases — when it does exist — lies with rare genomic variants, those that can only be found by sequencing a large number of whole genomes, which is only now becoming economically feasible.

"By Toutatis! We will be able to sequence our genomes for 1,000 sestertii in three years! Long live Genomix!"

Make no mistake: the human genome project is undoubtedly one of the greatest scientific achievements ever accomplished. Actually, let’s put that in context: the field of genomics human and otherwise, is one of the greatest scientific achievements ever accomplished. Like most great scientific achievements: heliocentricity, relativity, quantum mechanics or evolution it is the shift in the way we perceive nature that is the great achievement, rather than the tangible benefits that inevitably follow, even if those lag behind. With relativity, we learned that the speed of light is absolute, time can dilate, energy and mass are transmutable. The tangibles such as Global Positioning Systems and nuclear energy followed only decades later. With genomics, we learned to view an organism not only in terms of its anatomy physiology and biochemistry — i.e. what it does — but in terms of its genetic makeup: what it can do, and how it relates to other organisms.This is a deep and fundamental change to life science, and advertising the grandness of this achievement, as well as realizing the benefits will take time.

Some tangibles already exist: genetic assays to quickly identify pathogen variants for correct antibiotic treatment; identification of common cancer or the proper dosing of an anticoagulant drug. Many more are in the works.

In the meantime, you might want to read Neil Saunder’s post on his experience with personal genomics (which should really be named “personal common Haplotype analysis”). Not quite the cure for Alzheimer’s yet, but it does not detract from the amazing achievement of being able to learn fundamental things about yourself you could not have done a decade ago. Also, Mike the Mad Biologist writes why human sequencing should be regulated, as many other diagnostics are.

Finally, we need milestones for achievements, so a happy 10th anniversary of the sequencing of the first human genome. And when criticizing the achievements (or lack thereof) of genomics, remember that scientists are not Miracle Men.

Share and Enjoy:

Bioinformatics, open source software 3 comments

Bioinformatics Open Source Conference 2010 (and a poll)

By Iddo on June 14th, 2010

The 11th Annual Bioinformatics Open Source Conference (BOSC) 2010 is coming up in Boston, July 9-10 2010. The BOSC meetings are a great get-together of a community of programmers who are like-minded in their advocacy of open source code for science, and specifically for bioinformatics. The whole thing is run by volunteers who take a lot of time and effort to bring a top-notch meeting every year, so a big thanks to this year’s organizing committee!

If you are reading this, and you are in Boston on those dates, consider showing up, it is a great experience. There will also be a codefest on the two days before the meeting. This year’s topic is cloud computing for bioinformatics. If you like using AWS for bioinformatics or if you want to learn more, this is your chance. Amazon have provided a grant towards this codefest. (Thanks!) Biopython, Bioperl, Biojava and Bioruby developers will all be there, tailoring code to the cloud.

Which brings me to the latest poll: if you are a bioinformatics programmer, which of the Bio* packages are you using in your programming, if any? If more than one, check the one you use most frequently. Poll answers on the right. As with all Internet polls, you must be crazy if you take it at all seriously.

Share and Enjoy:

Biochemistry, Evolution 5 comments

Protein function, promiscuity, moonlighting and philosophy

By Iddo on June 12th, 2010

I recently received an email from a graduate student in Philosophy regarding protein function. Not sure if that person wants his name advertised, so I will keep it to myself.

“I am a fan of your blog, and interested in the philosophy of biology. One particularly interesting question is what makes something have a function; when it comes to artifacts, we just check with whoever designed the thing. It gets more complicated when functions change, and things are used for purposes other than what they were originally designed for, but it’s still pretty straightforward. However, biological functions can’t go that route (unless maybe one is a fan of intelligent design). I’m curious what you think about this, after seeing you mention your interest in predicting the function of genes and proteins. Is the function of something just the causal role that it plays in some larger mechanism? Do you have to include evolutionary considerations? If you ever have the time, I’d love to hear your thoughts about this.”

Thanks very much

My rather rambling answer follows:
“Ouff, you’ve opened a pretty big can of worms, which many of us are having a problem with.

Function in biology is context dependent. An enzyme catalyzes a biochemical reaction, say, removing a phosphate molecule from a protein, However, by removing that phosphate from the protein, the enzyme changes something in the function of the cell, as phosphate molecules are the ‘signaling currency’ of the cell. So the enzyme fulfills a cellular function as well. Finally, suppose this cell is in a developing embryo, and the phosphate removal in this type of protein in many catalyzes the creation of a limb, or a particular organ or tissue: now we have a whole organismal functional context. Which one of those: the biochemical, cellular or organismal is the ‘real’ function of the cell? Well, obviously all three are ‘real’.

To add a twist, suppose that a this enzyme is also active in removing phosphates from proteins in the adult animal. Now the animal has reached maturity, and because of a mutation in one of the cells that enzyme does not work anymore. The intra-cellular signaling becomes defective and the protein accumulates in its ‘phosphorylated’ form. This signals a division of the cell, and suddenly you have a pre-cancerous situation. So from a health point of view, this mutant plays a role in the survival and proliferation of cancer cells. Interestingly, a protein that causes our spittle to froth (don’t try doing this around other people, gross), was first discovered as a nasopharenygeal cancer associated protein, and it is named as such. Many genes and proteins are named after they are found to do one thing, even though we generally associate them with something else, simply because of the context in which they were discovered.

Also, there are moonlighting proteins, which may simply perform different functions. A protein called APIS is part of the proteasome: a cellular protein shredder which is itself a rather large protein complex. APIS also plays a role in transcribing DNA to RNA: thus, it is part of a protein creation complex, and of a destruction complex. See this short paper on Moonlighting proteins.

Yes, evolutionary considerations always come in to play, it is the lens through which we examine all biological phenomena. Evolution does cause certain proteins to be ‘multi-purposed’, also, some types of protein structures are more amenable to a certain set of functions than others. Furthermore, certain proteins are ‘promiscuous‘: certain enzymes may work on more than a single substrate (“Promiscuous” is different from “moonlighting”, where enzymes do completely different jobs; being “promiscuous” means a single enzyme does the same thing, but with different partners: i.e. catalyze the destruction of a sugar, but with different types of sugar molecules). Promiscuous enzymes can clearly show a ‘trajectory of evolution’ i.e. going from being very specific for one substrate, to non-specific for several substrates (or vice-versa). Promiscuity is a good example of molecular adaptation and tradeoff: a promiscuous enzyme means you have a jack-of-all-trades in your genomic complement, and you have to spend less energy on controlling the production of several different enzymes for several different tasks. However, the flipside of having a jack of all trades is that he is the master of none: the catalysis reactions are generally less efficient, which may cause problems for the cell/organism.

Phew, I hope I managed to convey some of the complexities of this issue, and how we try to deal with them in a systematic fashion.
[… edited out]

Cheers,

me”

The difference between moonlighting...

...and promiscuous

Khersonsky O, Roodveldt C, & Tawfik DS (2006). Enzyme promiscuity: evolutionary and mechanistic aspects. Current opinion in chemical biology, 10 (5), 498-508 PMID: 16939713

Jeffery, C. (2003). Moonlighting proteins: old proteins learning new tricks Trends in Genetics, 19 (8), 415-417 DOI: 10.1016/S0168-9525(03)00167-7

Share and Enjoy:

Jazz Comments turned off

Black Cat Zoot – No Swingin’ In Your Walkin’

By Iddo on June 10th, 2010

Once again, Black Cat Zoot, whose walkin’ is always swingin’:

Share and Enjoy:

Funny Comments turned off

What is the function of a necktie?

By Iddo on June 9th, 2010

From a random bulletin board…

Share and Enjoy:

Biochemistry, Bioinformatics, Genomics, Microbiology Comments turned off

Computational Bridge to Experiments

By Iddo on June 8th, 2010

A bit of background information: this is a meeting I am really happy to be part of, and even more so honored to be a co-organizer. One of my main scientific interests is the prediction of the function of genes and proteins of unknown function.

Some background information: we have sequenced more than 1000 genomes of microbes, and hundreds of plants and animals. Additionally, we have millions of partial DNA sequences, RNA sequences, proteins, genomic fragments and millions of genes sequenced from metagenomic data. Problem: for most of these sequenced genes, we do not know what they are doing. That’s right: most of the sequence data that we have is just that: data. Not information. We are amassing an ever-growing collection of books that are written in a mostly incomprehensible language. We know (or “educatedly guess”) where the words in those books (the genes) are located, because we have sequence signals that indicate where the bits of the DNA that code for genes is. For some of the words, we know the meaning. But in many cases, (and by some estimates in most cases) we fail to understand the meaning of the words (genes) in those books (genomes). Drawing further on the book<–>genome and gene<–>word metaphor, we sometimes know one meaning of a word, but we all know that words in human languages can hold different meanings, depending on context. “Whatever floats your boat” can be read literally, but more often this particular collection of words in this order is a figure of speech. The same thing goes for genes: a gene may code for a certain enzyme, catalyzing a simple chemical reaction. But in another context, it may perform developmental function for the whole organism, which has different implications than just the biochemical level.

Where's one of those when we need them?

We can’t just rely on computational means to find out what’s doing what. Bioinformatics can help us annotate genes that are similar to those already discovered, and in some cases give us new insights to the function of unknown genes. But for truly novel functions, and to known whether our boat is real or a metaphor for “what works best” we may need to run experiments. And we need a good collaboration between those who do the computational work, and those who do the experimental work in identifying which are the most important books to look at, and what words in them we need to decipher first.

The COMBREX meeting aims to start this large-scale and long-term decoding, a collaboration between experimentalists and computational biologists.
Note that the COMBREX workshop is part of the larger Microbial Genomics meeting at Lake Arrowhead, California.

Here is the announcement. Feel free to cut & paste and forward:

Announcing the first COMBREX Workshop for Computational and Experimental Determination of Protein Function. September 15, 2010 Lake Arrowhead, California USA

COMBREX (Computational Bridge to Experiments) is a new NIH funded effort that aims to increase the pace of experimental determination of the function of large and high priority gene families in bacterial genomes. The Principal investigators are Richard Roberts (New England Biolabs) Simon Kasif (Boston University) and Martin Steffen (Boston University), this effort will form a consortium of experimental and computational biologists that would collaborate directly to test the predicted functions or specificity of high-priority genes.

Central to this effort would be the creation of a community web-based database that will allow computational and experimental scientists to communicate easily and assist experimentalists in identifying high-priority genes with high-quality computational predictions. Experimentalists will be able to submit bids (proposals) to validate individual predictions, and if successful, will receive modest funding from COMBREX to perform the validation.

The website can be found at http://combrex.bu.edu/ .

A workshop to discuss issues related to the formation and operation of COMBREX will take place on Wednesday, September 15, 2010, as part of the 18th Annual International Meeting on Microbial Genomics at Lake Arrowhead, CA, outside of Los Angeles. A preliminary program can be found at http://www.mimg.ucla.edu/arrowhead2010/program.html (COMBREX is formerly SciBay). Confirmed speakers include Richard Roberts, Simon
Kasif, Manuel Ferrer (CSIC, Madrid), Patricia Babbit (UCSF), John Gerlt (Illinois), Peter Karp (SRI), Alexander Yakunin (Toronto), Steven Brenner (UC Berkeley) and Bruno Sobral (Virginia Tech).

The morning session will provide an overview of COMBREX, including both the experimental and computational challenges, related talks, and a
description of topics to be discussed by breakout groups. These groups will convene in the afternoon to discuss the topics and prepare a short summary, for presentation to the entire workshop after dinner.

Topics to be discussed by the breakout groups will roughly divide into the following areas: (1) whole genome annotation, (2) assessment of computational predictions, (3) use of structure to predict function, and (4) infrastructure for function annotation. General topics to be discussed include:

1. How to prioritize predictions?

2. How to evaluate experimental bids?

3. How to handle non-enzymatic proteins?

4. How best to handle predictions/phenotypes from high-throughput experimentation?

A key desired outcome of the workshop is the identification of opportunities and catalysis collaborations between computational and experimental biologists.

We hope you will be able to join us for this event. You can register at: http://www.mimg.ucla.edu/arrowhead2010/registration.html

For further information please contact the organizers:

Co-chairs: Martin Steffen, Boston University, steffen ‘at’ bu ‘dot’ edu
Iddo Friedberg, Miami University, i.friedberg ‘at’ muohio ‘dot’ edu

Steering Committee: Simon Kasif and Richard J. Roberts

Share and Enjoy:

Physics Comments turned off

Awesomest Cola & Mentos yet

By Iddo on June 7th, 2010

Yeah, yeah, Cola & Mentos videos are getting somewhat tired. Still, this one really goes overboard:

Ha! Now how does the Cola & Mentos reaction work?

Well, first, the Cola & Mentos thing is a physical reaction, more than a chemical one: it happens mainly due to nucleation sites provided by the pitted surface of the Mentos candy.This allows for bubbles to form quickly. The candy sink to the bottom, so the pressure form the gas forming at teh bottom of the bottle pushes the water up, rather violently. Gum Arabic (in the Mentos candy) and aspartame (in the Diet Cola) also help the reaction: Diet Cola works better than regular. Gelatin and gum arabic from the dissolving candy break the surface tension, letting bubbles form faster. This paper in the American Journal of Physics actually has surface pictures of Fruit Mentos and Mint Mentos taken with a scanning electron microscope. They checked the nucleation capabilities of both candies, under different conditions in Diet Coke, Caffeine Free Diet Coke, Coca-Cola Classic, CaffeineFree Coca-Cola Classic, seltzer water, seltzer water with potassium benzoateadded, seltzer water with aspartame added, tonic water, and diettonic water. The also used different nucleation surfaces including Mint Mentos, Fruit Mentos, a mixture ofDawn Dishwashing detergent and water, playground sand, table salt, rocksalt, Wint-o-Green Lifesavers, a mixture of baking soda and water,liquid gum arabic, and molecular sieve beads They found that the least amount of work needed to create the bubbles was in diet, caffeinated cola. The best nucleation sites were formed on Mentos (no difference found between the Mint and Fruit Mentos).

SEM images of Mint Mentos [(a) and (c)] and Fruit Mentos with a candy coating [(b) and (d)]. The scale bars in each image represent the lengths (a) 200 µm, (b) 100 µm, (c) 20 µm, and (d) 20 µm. The images were acquired with a beam energy of 12.5 kV and a spot size of 5.0 nm. The lower magnification image of the Fruit Mentos has smooth patches in contrast to the lower magnification image of the Mint Mentos, but the candy coating is not uniform. The higher magnification image of the Fruit Mentos is zoomed in on one of the rougher patches

Coffey, T. (2008). Diet Coke and Mentos: What is really behind this physical reaction? American Journal of Physics, 76 (6) DOI: 10.1119/1.2888546

Share and Enjoy:

Bioinformatics, Genomics 1 comment

Closing gaps

By Iddo on May 30th, 2010

Geek alert: this post for coders.
So you sequenced your genome, reached an optimally small number of contigs, they look sane, and now you would like to see what you need for the finishing stage. Namely, how many gaps you have and what are their sizes. UPDATE: “might just be worth clarifying this is for gaps in scaffolds produced by short-read de novo assemblers like Velvet, SOAPdenovo working on paired-end data. Rather than ‘gaps’ (technically more often unresolvable repeats than true gaps) between contigs which would require a different strategy to resolve lengths between.” (Thanks for this one Nick!)

This little script will help. Note that you need to download and install biopython for it to work.

Short version: to run this on a Un*x machine, you need to download the file, then:

chmod +x uncalled_base_stats.py

./uncalled_base_stats.py contig-filename.fasta

#!/usr/bin/env python
import sys
import re
from Bio import SeqIO
"""
input: a contig file, FASTA formatted.
output: a histogram of gaps in the contig file, and contig node names
output format:
L N [n1, n2,...,nk]
Where:
L is the length of the gap
N is the number of gap lengths
The list is square brackets are the sequence ids nodes where gaps of this length are
found
"""

def uncalled_base_stats(seq_record):
    histo = {}
    for n_match in re.finditer('N+',seq_record.seq.tostring()):
        n_len = n_match.end()- n_match.start()
        histo[n_len] = histo.get(n_len,0) + 1
    return histo

if __name__ == '__main__':
    total_histo = {}
    contig_ids = {}
    for seq_record in SeqIO.parse(open(sys.argv[1]), "fasta"):
        histo = uncalled_base_stats(seq_record)
        for i in histo:
            total_histo[i] = total_histo.get(i,0) + histo[i]
            contig_ids.setdefault(i,[]).append(seq_record.id)
    gap_sizes = total_histo.keys()
    gap_sizes.sort()
    for i in gap_sizes:
        print i, total_histo[i], contig_ids[i]

Line 1: Linux shell magic line for an executable file. You can omit this if you are using another OS.

Line 2-4: necessary imports, including biopython.

Lines 17-22: the function uncalled_base_stats accepts a biopython sequence object (a contig) and returns a histogram in the form of a dictionary (a Python associative array) called histo. The dictionary keys are the gap lengths and the values are the respective number of contigs. For example:

{5:10, 17:3}

means that there are 10 gaps of length 5, and 3 gaps of length 17.

Line 18: initiate the dictionary histo

Line 19: here we create an iterator that loops through matches of the regular expression for finding one or more “N” characters, “N+”.

Line 20: n_match is a Python regexp object, returned by the re.finditer method, if there is a match to the “N+” regexp. The .end() method is the position of the last character of the match, the .start() is the first one. The difference is the length of the match. Note that this is Python indexing, so .end() actually gives the index of the position after the match, so no need to add one to the difference beween n_match.end() and n_match.start()

Line 21: add one to the histogram dictionary

Line 22: function returns the histogram dictionary

Line 24-35: the bit that gets executed when you run the program from your OS.

24-26: total_histo: a sum of all the distributions of “N” in the reads; contig-ids: the ids for all the contigs that have a sequence of N of a certain length. The key is the sequence length.

27: Loop on the contig FASTA formatted file. Each loop iteration processes a single contig record.

28: Find consecutive groups of gaps in the contig record. Put them in the histogram for a single record, as described in lines 17-22.

29-31: Pool the histogram for this specific contig into a histogram for all contigs named total_histo.

31: Another dictionary, contig_ids, has the gap sizes as keys, and an accumulated list of contig IDs having thoise gap sizes as values. Example: we have 3 contigs, whose ids are “contig_a”, “contig_b” and “contig_c”. contig_a has two gaps of lengths 10 and 15. contig_b has three gaps of 10, 15 and 30. contig_c has one gap of length 5. The dictionary contig_ids will look like:

contig_ids = {5: ['contig_c'],
        10: ['contig_a','contig_b'],
        15:['contig_a','contig_b'],
        20:['contig_b']}

32-33: gap_sizes is a sorted list of all gap sizes

34-35: print a histogram of all gap sizes, and the contigs invovled. Using the example above, the output will look like:

5    1  ['contig_c']
10 2    ['contig_a', 'contig_b']
15 2    ['contig_a','contig_b']
20 1    ['contig_c']

I originally wrote this script for a colleague who was worried about how many gaps he will have to close in the finishing process. Turned out to be a very popular bit of code…

Share and Enjoy:

Microbiology Comments turned off

A cure for Ebola?

By Iddo on May 29th, 2010

There are few infectious diseases as violent and as lethal as the Ebola Haemorragic Fever. This terrible disease was first described in 1976 at a mission hospital at the Ebola river in Zaire (now the Democratic Republic of Congo). The disease is 80% fatal, the victims die painfully from a literal meltdown of their organs. Because the disease is transmitted via direct contact with bodily fluids, and because it is relatively non-infectious during the incubation period, it does not spread as well as airborne agents, such as the flu virus. Outbreaks are relatively short and contained, and there hasn’t been an Ebola pandemic. The different Ebola virus strains are classified as Grade A bioterror weapons: although not spread by aerosol, the viral particles can still be inhaled, and the violence and rapid spread and lack of cure for the disease are worrying. Ebola is lethal in other primates too, besides humans. A study from 2006 claims that Ebola may have killed up to 5,000 gorillas.

Yesterday, a report was published in The Lancet that may hail a cure for the disease, and provide hope for a treatment for many other viral diseases. The researchers infected seven Macaque and Rhesus monkeys with the lethal Zaire Ebola Virus, and then injected them with synthetic RNA that specifically targets the virus. All but one of the infected monkeys survived and fully recovered.

Transmission Electron Micrograph of the Ebola Virus. Credit: US Center for Disease Control

The interesting bit about the cure is how it works. The researchers used small interfering RNA or siRNA. siRNA is a natural mechanism in plants and animals that interferes with the expression of specific RNA and causes its destruction in the cell. siRNA molecules are short (21-23bp long) RNA molecules by themselves. They block other RNA molecules very specifically, without disrupting the whole cell. Think of a military censor that blacks out only very specific words in soldier’s letters, those that may cause damage if taken any further, but leaves in the benign stuff. (OK, that was before email, I am not even sure there are military censors now.) It is a form of control of RNA expression in the cell, but it is also used to interfere with the expression of harmful alien RNA, such as viral RNA: viruses operate by hijacking the cellular machinery to make their own RNA and proteins, which is the reason why some of them cause diseases. Now, siRNA can also be synthesized in the lab and delivered intravenously using lipid capsules. The capsules are delivered via the vascular systems to the infected cells, deliver the siRNA into the cells and block the virus from making its own RNA. If done right, the synthetic siRNA blocks only the viral RNA, leaving the cellular RNA alone. Here the researchers targeted the RNA that codes for the viral polymerase protein: the very protein that the virus uses to make its own proteins. So no polymerase RNA –> no polymerase protein –> no Ebola making factory. More than that, this is the first time that siRNA treatment has been shown to work in primate models of a human disease. So what we have here is a new class of antiviral drugs, that may be used against other diseases.

I seriously doubt we will see controlled human trials before the next natural outbreak, so this is as close as we can get to a proven treatment for Ebola. Finally, here is a short clip showing how siRNA works in the cell. Here the siRNA is injected, rather than delivered using lipid capsules, but the rest remains the same.

“Small interfering RNAs (siRNAs) are 2123nt dsRNA (double-stranded RNA) molecules that facilitate potent and sequence-specific gene suppression via the mechanism of RNAi (RNA interference). siRNA pathway animation gives an idea on the mechanism of gene supression by siRNA. When introduced into cultured mammalian cells, siRNAs facilitate the degradation of mRNA sequences to which they are homologous; thereby silencing the encoding gene. The basic mechanism behind RNAi is the breaking of a dsRNA matching a specific gene sequence into short pieces of siRNA. These siRNAs post-transcriptionally silences a gene through mRNA degradation. mRNA silencing involves the chopping of long dsRNA into smaller pieces, corresponding to both sense and antisense strands of the target gene by the Rnase-III (Ribonuclease-III) family member, Dicer. Dicer chops dsRNA into two classes of smaller RNAs—miRNAs (microRNAs) and siRNAs. Dicer delivers these siRNAs to a group of proteins called the RISC (RNA-Inducing Silencing Complex), which uses the antisense strand of the siRNA to bind to and degrade the corresponding mRNA, resulting in gene silencing. siRNAs are associated with silencing triggered by transgenes, microinjected RNA, viruses, and transposons, and hence can be considered intermediaries in host defense pathways against foreign nucleic”

(From the caption of the Protein Lounge video http://www.proteinlounge.com)

Prof Thomas W Geisbert Corresponding Amy CH Lee, Marjorie Robbins, Joan B Geisbert, Anna N Honko, Vandana Sood, Joshua C Johnson, Susan de Jong, Iran Tavakoli, Adam Judge, Lisa E Hensley, Ian MacLachlan (2010). Postexposure protection of non-human primates against a lethal Ebola virus challenge with RNA interference: a proof-of-concept study The Lancet, 375 (9729), 1896-1905 : doi:10.1016/S0140-6736(10)60357-1

Share and Enjoy:

Science, Software, Technology Comments turned off

Android apps for scientists

By Iddo on May 27th, 2010

A few science apps for the Android mobile phone operating system. Some of these I have, some I don’t , and some I really would like to check out. Feel free to add more that you know of in the comment section. Better yet, make a wish…

Science Literature:

AgileMedSearch: Searching through pubmed databases. Pretty much bare-bones. Can search for articles, read abstracts, and email details. PubMedMobile seems to have some more functionality, with a link to the article, if available, and more search parameters.

pubmedmobile

Chemistry:

Elements is a periodic table with more data that you will probably ever need. PubChemMobile allows you to search the pubchem database, a central repository of compounds. MolPad is a compound drawing app.

pubchemmobile

Lab:

AgileSciTools is a calculator set for biologists. Includes functions to determine molarity dilutions, cell dilutions, MOI calculations, and primer resuspension volumes. You can also count cells using the Laboratory Cell Counter.

AgileSciTools

Not exactly for scientists, but still interesting:

PersonalGenomics offers a side-by-side comparison of personal genomics services: deCODEme, Navigenics and 23andme by loci and variants for top 20 conditions. I wonder if any of those companies offer a mobile-friendly interface, or an app to read their results. Then again, why would you want to browse your own genome data from your phone?

PersonalGenomics

Finally, there is the Star Trek Tricorder. It costs 0.99 € , but being from the future, a paid app is not unreasonable.

Share and Enjoy:

Art, Science 1 comment

Life on earth in 60 seconds

By Iddo on May 26th, 2010

Gives a nice temporal perspective.

Share and Enjoy:

Page 19 of 32« First...10 «17 181920 21 »30...Last »

Byte Size Biology

The musings and ravings of a computational biologist about science, computers, music and, you know, stuff

It’s a small (RNA) world after all

Science as Middle-Earth

Sonny Moorman’s group at Oxford

Scary

Celebromics? HeavyMetalomics? Advertomics? Anniversomics!

Bioinformatics Open Source Conference 2010 (and a poll)

Protein function, promiscuity, moonlighting and philosophy

Black Cat Zoot – No Swingin’ In Your Walkin’

What is the function of a necktie?

Computational Bridge to Experiments

Announcing the first COMBREX Workshop for Computational and Experimental Determination of Protein Function. September 15, 2010 Lake Arrowhead, California USA

The website can be found at http://combrex.bu.edu/ .

Topics to be discussed by the breakout groups will roughly divide into the following areas: (1) whole genome annotation, (2) assessment of computational predictions, (3) use of structure to predict function, and (4) infrastructure for function annotation. General topics to be discussed include:

1. How to prioritize predictions?

2. How to evaluate experimental bids?

3. How to handle non-enzymatic proteins?

4. How best to handle predictions/phenotypes from high-throughput experimentation?

A key desired outcome of the workshop is the identification of opportunities and catalysis collaborations between computational and experimental biologists.

We hope you will be able to join us for this event. You can register at: http://www.mimg.ucla.edu/arrowhead2010/registration.html

For further information please contact the organizers:

Co-chairs: Martin Steffen, Boston University, steffen ‘at’ bu ‘dot’ edu
Iddo Friedberg, Miami University, i.friedberg ‘at’ muohio ‘dot’ edu

Steering Committee: Simon Kasif and Richard J. Roberts

Awesomest Cola & Mentos yet

Closing gaps

A cure for Ebola?

Android apps for scientists

Science Literature:

Chemistry:

Lab:

Not exactly for scientists, but still interesting:

Life on earth in 60 seconds

Categories

Tags

Recent Posts

Recent Comments

Other stuff I read

Science blogs I like to read

Twitter

The musings and ravings of a computational biologist about science, computers, music and, you know, stuff

Announcing the first COMBREX Workshop for Computational and Experimental Determination of Protein Function. September 15, 2010 Lake Arrowhead, California USA

The website can be found at http://combrex.bu.edu/ .

Topics to be discussed by the breakout groups will roughly divide into the following areas: (1) whole genome annotation, (2) assessment of computational predictions, (3) use of structure to predict function, and (4) infrastructure for function annotation. General topics to be discussed include:

1. How to prioritize predictions?

2. How to evaluate experimental bids?

3. How to handle non-enzymatic proteins?

4. How best to handle predictions/phenotypes from high-throughput experimentation?

A key desired outcome of the workshop is the identification of opportunities and catalysis collaborations between computational and experimental biologists.

We hope you will be able to join us for this event. You can register at: http://www.mimg.ucla.edu/arrowhead2010/registration.html

For further information please contact the organizers:

Co-chairs: Martin Steffen, Boston University, steffen ‘at’ bu ‘dot’ edu Iddo Friedberg, Miami University, i.friedberg ‘at’ muohio ‘dot’ edu Steering Committee: Simon Kasif and Richard J. Roberts

Science Literature:

Chemistry:

Lab:

Not exactly for scientists, but still interesting:

Categories

Tags

Recent Posts

Recent Comments

Other stuff I read

Science blogs I like to read

Co-chairs: Martin Steffen, Boston University, steffen ‘at’ bu ‘dot’ edu
Iddo Friedberg, Miami University, i.friedberg ‘at’ muohio ‘dot’ edu

Steering Committee: Simon Kasif and Richard J. Roberts