Archive

Archive for the ‘Bioinformatics’ Category

Protein Function: how do we know that we know what we know?

July 22nd, 2010 6 comments

ResearchBlogging.org

The trouble with genomic sequencing, is that it is too cheap. Anyone that has a bit of extra cash laying around, you can scrape the bugs off your windshield, sequence them, and write a paper. Seriously?

Yes, seriously now: as we sequence more and more genomes, our annotation tools cannot keep up with them. It’s like unearthing thousands of books at some vast archaeological dig of an ancient library, but being able to read only a few pages here and there. Simply put: what do all these genes do? The gap between what we do know and what we do not know is constantly growing. We are unearthing more and more books (genomes) at an ever-increasing pace, but we cannot keep up with the influx of new and strange words (genes) of this ancient language. Many genes are being tested for their function experimentally in laboratories. But the number of genes whose function we are determining using experiments is but a drop in the ocean compared to the number of genes we have sequenced and whose whose function is not known We may be sitting on the next drug target for cancer or Alzheimer’s disease, but those proteins are labeled as “unknown function” in the databases.

The red line is the growth of protein sequences deposited in TrEMBL, a comprehensive protein sequence database. The blue line illustrates the growth proteins in TrEMBL whose function is know, or at least can be predicted with some reasonable accuracy. The green line is the growth in the proteins whose 3D structure has been solved. Note the logarithmically increasing gap between what we know (blue) and what we do not know (red). Image courtesy of Predrag Radivojac.

Enter bioinformatics. CPU hours are cheaper than high throughput screening assays. And if the algorithms are good, software can do the work of determining function much cheaper than experiments. But therein lies the rub: how do we know how well function prediction algorithms perform? How do we compare their accuracy? Which method performs best, and are different methods better for different types of function predictions? This is important because most of the functional annotations in the databases come from bioinformatic prediction tools, not from experimental evidence. We need to know how accurate these tools are. Think about it this way: even an increase of 1% in accuracy  would means that hundreds of thousands of sequence database entries are better annotated, which in turn means a lot less time in the lab or in high throughput screening labs going after false drug leads.

So a few of us got together and decided to run an experiment to compare the performance of different function prediction software tools.  We call our initiative the CAFA challenge: Critical Assessment of Function Annotation. There are many research groups that are developing algorithms for gene and protein function prediction, but those have not been compared on a large scale, yet. OK then: let’s have some fun. We, the CAFA challenge organizers, will release the sequences of some 50,000 proteins whose functions are unknown. The various research groups will predict their functions using their own software. By January 2011 all the predictions should be submitted to the CAFA experiment website. Over the net few months, some of these proteins will get annotated experimentally. Not many, probably no more than a few hundred judging by the slow growth of the experimental annotations in the databases. But we don’t need that many to score the predictions. A few dozen will do.

On July 15, 2011 we will all meet in Vienna, and hold the first-ever CAFA meeting as a satellite meeting of ISMB 2011. This will be the fifth Automated Function Prediction meeting we have been holding since 2005. Only this time, there won’t just be the usual talks and posters, there will be the results of a very interesting experiment. The International Society for Computational Biology is generously hosting our meeting, and judging by the response we are getting so far, we will need one of the larger halls.

Learn more at http://biofunctionprediction.org If computational protein function prediction is your thing, join the CAFA challenge. If you are just an interested observer, keep an eye on the site. In any case, please spread the word.  Finally, if your company wants some publicity, get in touch! We could use the sponsorship ^_^

Acknowledgements: I would like to thank the CAFA co-organizers, Michal Linial and Predrag Radivojac. The CAFA steeering committee: Burkhard Rost, Steven Brenner, Patsy Babbitt and Christine Orengo for supporting us, keeping us on the straight and narrow and for incredibly useful and insightful suggestions.  Sean Mooney and Amos Bairoch for hashing out the assessment.  Tal Ronnen-Oron and the rest of Sean Mooney’s group for setting up the CAFA website. The International Society for Computational Biology for sponsoring us. The community of computational function predictors that have participated in and supported past meetings on computational function prediction, the research groups that have registered to CAFA so far, and those that will register soon :)   Finally, Inbal Halperin-Landsberg for coining the name CAFA. I apologize in advance if I left someone out.

GO CAFA!


Godzik, A., Jambon, M., & Friedberg, I. (2007). Computational protein function prediction: Are we making progress? Cellular and Molecular Life Sciences, 64 (19-20), 2505-2511 DOI: 10.1007/s00018-007-7211-y

I’m Shipping up to Boston

July 19th, 2010 No comments

Got back recently from the ISMB 2010 meeting in Boston. Five days amongst great science and scientists. I microblogged some of the talks on my FriendFeed channel, but definitely not all I attended. All the keynotes were microblogged by many of the attendees on the ISMB 2010 feed.

An a conference is not just about the talks, but food and drink too: Newbury street is filled with great restaurants.  And seeing people you have not seen in a while. And goodies at the booths. My favorite was a Dengue virus fold-your-own kit at the PDB booth.

And here are the Dropkick Murphys, a Boston-based Celtic punk band singing  “I’m Shipping up to Boston”. Yes, it’s “that song” from The Departed.

Bioinformatics Open Source Conference 2010 (and a poll)

June 14th, 2010 3 comments

The 11th Annual Bioinformatics Open Source Conference (BOSC) 2010 is coming up in Boston, July 9-10 2010. The BOSC meetings are a great get-together of a community of programmers who are like-minded in their advocacy of open source code for science, and specifically for bioinformatics. The whole thing is run by volunteers who take a lot of time and effort to bring a top-notch meeting every year, so a big thanks to this year’s organizing committee!

If you are reading this, and you are in Boston on those dates, consider showing up, it is a great experience. There will also be a codefest on the two days before the meeting. This year’s topic is cloud computing for bioinformatics. If you like using AWS for bioinformatics or if you want to learn more, this is your chance. Amazon have provided a grant towards this codefest. (Thanks!) Biopython, Bioperl, Biojava and Bioruby developers will all be there, tailoring code to the cloud.

Which brings me to the latest poll: if you are a bioinformatics programmer, which of the Bio* packages  are you using in your programming, if any? If more than one, check the one you use most frequently. Poll answers on the right. As with all Internet polls, you must be crazy if you take it at all seriously.

Computational Bridge to Experiments

June 8th, 2010 Comments off

A bit of background information: this is a meeting I am really happy to be part of, and even more so honored to be a co-organizer. One of my main scientific interests is the prediction of the function of genes and proteins of unknown function.

Some background information: we have sequenced more than 1000 genomes of microbes, and hundreds of plants and animals. Additionally, we have millions of partial DNA sequences, RNA sequences, proteins, genomic fragments and millions of genes sequenced from metagenomic data. Problem: for most of these sequenced genes, we do not know what they are doing. That’s right:  most of the sequence data that we have is just that: data. Not information. We are amassing an ever-growing collection of books that are written in a mostly incomprehensible  language. We know (or “educatedly guess”) where the words in those books (the genes)  are located, because we have sequence signals that indicate where the bits of the DNA that code for genes is. For some of the words, we know the meaning. But in many cases, (and by some estimates in most cases) we fail to understand the meaning of the words (genes) in those books (genomes). Drawing further on the book<–>genome and gene<–>word metaphor, we sometimes know one  meaning of a word, but we all know that words in human languages can hold different meanings, depending on context. “Whatever floats your boat” can be read literally, but more often this particular collection of words in this order is a figure of speech. The same thing goes for genes: a gene may code for a certain enzyme, catalyzing a simple chemical reaction. But in another context, it may perform  developmental function for the whole organism, which has different implications than just the biochemical level.

Where's one of those when we need them?

We can’t just rely on computational means to find out what’s doing what. Bioinformatics can help us annotate genes that are similar to those already discovered, and in some cases give us new insights to the function of unknown genes. But for truly novel functions, and to known whether our boat is real or a metaphor for “what works best” we may need to run experiments. And we need a good collaboration between those who do the computational work, and those who do the experimental work in identifying which are the most important books to look at, and what words in them we need to decipher first.

The COMBREX meeting aims to start this large-scale and long-term decoding, a collaboration between experimentalists and computational biologists.
Note that the COMBREX workshop is part of the larger Microbial Genomics meeting at Lake Arrowhead, California.

Here is the announcement. Feel free to cut & paste and forward:

Announcing the first COMBREX Workshop for Computational and Experimental Determination of Protein Function. September 15, 2010 Lake Arrowhead, California USA

COMBREX (Computational Bridge to Experiments) is a new NIH funded effort that aims to increase the pace of experimental determination of the function of large and high priority gene families in bacterial genomes. The Principal investigators are  Richard Roberts (New England Biolabs) Simon Kasif (Boston University) and Martin Steffen (Boston University), this effort will form a consortium of experimental and computational biologists that would collaborate directly to test the predicted functions or specificity of high-priority genes.
Central to this effort would be the creation of a community web-based database that will  allow computational and experimental scientists to communicate easily and assist experimentalists in identifying high-priority genes with high-quality computational predictions. Experimentalists will be able to submit bids (proposals) to validate individual predictions, and if successful, will receive modest funding from COMBREX to perform the validation.
The website can be found at http://combrex.bu.edu/ .
A workshop to discuss issues related to the formation and operation of COMBREX will take place on Wednesday, September 15, 2010, as part of the 18th Annual International Meeting on Microbial Genomics at Lake Arrowhead, CA, outside of Los Angeles. A preliminary program can be found at http://www.mimg.ucla.edu/arrowhead2010/program.html (COMBREX is formerly SciBay). Confirmed speakers include Richard Roberts, Simon
Kasif, Manuel Ferrer (CSIC, Madrid), Patricia Babbit (UCSF), John Gerlt (Illinois), Peter Karp (SRI), Alexander Yakunin (Toronto), Steven Brenner (UC Berkeley) and Bruno Sobral (Virginia Tech).
The morning session will provide an overview of COMBREX, including both the experimental and computational challenges, related talks, and a
description of topics to be discussed by breakout groups. These groups will convene in the afternoon to discuss the topics and prepare a short summary, for presentation to the entire workshop after dinner.
Topics to be discussed by the breakout groups will roughly divide into the following areas: (1) whole genome annotation, (2) assessment of computational predictions, (3) use of structure to predict function, and (4) infrastructure for function annotation. General topics to be discussed include:
1. How to prioritize predictions?
2. How to evaluate experimental bids?
3. How to handle non-enzymatic proteins?
4. How best to handle predictions/phenotypes from high-throughput experimentation?
A key desired outcome of the workshop is the identification of opportunities and catalysis collaborations between computational and experimental biologists.
We hope you will be able to join us for this event. You can register at: http://www.mimg.ucla.edu/arrowhead2010/registration.html
For further information please contact the organizers:
Co-chairs: Martin Steffen, Boston University, steffen ‘at’ bu ‘dot’ edu
Iddo Friedberg, Miami University, i.friedberg ‘at’ muohio ‘dot’ edu

Steering Committee: Simon Kasif and Richard J. Roberts

Closing gaps

May 30th, 2010 1 comment

Geek alert: this post for coders.

So you sequenced your genome, reached an optimally small number of contigs, they look sane, and now you would like to see what you need for the finishing stage. Namely, how many gaps you have and what are their sizes. UPDATE: “might just be worth clarifying this is for gaps in scaffolds produced by short-read de novo assemblers like Velvet, SOAPdenovo working on paired-end data. Rather than ‘gaps’ (technically more often unresolvable repeats than true gaps) between contigs which would require a different strategy to resolve lengths between.” (Thanks for this one Nick!)

This little script will help. Note that you need to download and install biopython for it to work.

Short version: to run this on a Un*x machine, you need to download the file, then:

chmod +x uncalled_base_stats.py
./uncalled_base_stats.py contig-filename.fasta
#!/usr/bin/env python
import sys
import re
from Bio import SeqIO
"""
input: a contig file, FASTA formatted.
output: a histogram of gaps in the contig file, and contig node names
output format:
L N [n1, n2,...,nk]
Where:
L is the length of the gap
N is the number of gap lengths
The list is square brackets are the sequence ids nodes where gaps of this length are
found
"""

def uncalled_base_stats(seq_record):
    histo = {}
    for n_match in re.finditer('N+',seq_record.seq.tostring()):
        n_len = n_match.end()- n_match.start()
        histo[n_len] = histo.get(n_len,0) + 1
    return histo

if __name__ == '__main__':
    total_histo = {}
    contig_ids = {}
    for seq_record in SeqIO.parse(open(sys.argv[1]), "fasta"):
        histo = uncalled_base_stats(seq_record)
        for i in histo:
            total_histo[i] = total_histo.get(i,0) + histo[i]
            contig_ids.setdefault(i,[]).append(seq_record.id)
    gap_sizes = total_histo.keys()
    gap_sizes.sort()
    for i in gap_sizes:
        print i, total_histo[i], contig_ids[i]

Line 1:  Linux shell magic line for an executable file. You can omit this if you are using another OS.

Line 2-4: necessary imports, including biopython.

Lines 17-22: the function uncalled_base_stats accepts a biopython sequence object (a contig) and returns a histogram in the form of a dictionary (a Python associative array) called histo. The dictionary keys are the gap lengths and the values are the respective number of contigs. For example:

{5:10, 17:3}

means that there are 10 gaps of length 5, and  3 gaps of length 17.

Line 18: initiate the dictionary histo

Line 19: here we create an iterator that loops through matches of the regular expression for finding one or more “N” characters, “N+”.

Line 20: n_match is a Python regexp object, returned by the re.finditer method, if there is a match to the “N+” regexp. The .end() method is the position of the last character of the match, the .start() is the first one. The difference is the length of the match. Note that this is Python indexing, so .end() actually gives the index of the position after the match, so no need to add one to the difference beween n_match.end() and n_match.start()

Line 21: add one to the histogram dictionary

Line 22: function returns the histogram dictionary

Line 24-35: the bit that gets executed when you run the program from your OS.

24-26: total_histo: a sum of all the distributions of “N” in the reads; contig-ids: the ids for all the contigs that have a sequence of N of a certain length. The key is the sequence length.

27: Loop on the contig FASTA formatted file. Each loop iteration processes a single contig record.

28: Find consecutive groups of gaps in the contig record. Put them in the histogram for a single record, as described in lines 17-22.

29-31: Pool the histogram for this specific contig into a histogram for all contigs named total_histo.

31: Another dictionary, contig_ids, has the gap sizes as keys, and an accumulated list of contig IDs having thoise gap sizes as values. Example: we have 3 contigs, whose  ids are “contig_a”, “contig_b” and “contig_c”. contig_a has two gaps of lengths 10 and 15. contig_b has three gaps of 10, 15 and 30. contig_c has one gap of length 5. The dictionary contig_ids will look like:

contig_ids = {5: ['contig_c'],
        10: ['contig_a','contig_b'],
        15:['contig_a','contig_b'],
        20:['contig_b']}

32-33: gap_sizes is a sorted list of all gap sizes

34-35: print a histogram of all gap sizes, and the contigs invovled. Using the example above, the output will look  like:

5    1  ['contig_c']
10 2    ['contig_a', 'contig_b']
15 2    ['contig_a','contig_b']
20 1    ['contig_c']

I originally wrote this script for a colleague who was worried about how many gaps he will have to close in the finishing process. Turned out to be a very popular bit of code…

Comparative Functional Genomics: Penguin vs. Bacterium

May 4th, 2010 2 comments

No, not the flesh-blood-and-feathers penguin, but rather Tux, the beloved mascot of the Linux operating system. Compared with Escherichia coli, the model organism of choice for microbiologists.

We refer to DNA as “the book of life”; some geeks refer to it as the “operating system of life”. Just like in a computer’s operating system, DNA contains all the instructions on how to “execute” life and to keep things humming.  Many genes make proteins or RNA than act as switches to activate the synthesis of other proteins, sometimes in a two- three- or higher level hierarchy.  These switches are conditional, based on environmental conditions, or whether it’s time to replicate the DNA and divide into two daughter cells, and so on. Some genes activate the transcription of other genes, but are not regulated themselves by other genes, those can be dubbed  “master regulators”. Some genes are both activated by other genes, and activate other genes themselves: “middle management”. Finally, there are genes that are activated, but do not regulate other genes: the “workhorses”. This information, known as the transcriptional regulatory network exists for 1,378 genes of the E. coli bacterium.
ResearchBlogging.org

Paralleling this in Linux, there are programs that call other programs; again, in a hierarchical fashion.  According to the calling structure, they also can be dubbed Master Regulators (calling other programs but not being called themselves), Middle Management (calling other programs and being called), and  Workhorse (only being called).

Koon-Kiu Yan and his colleagues from Yale mapped the program call graph in Linux by setting each program as a node and drawing lines to the programs that call it, and to the programs it calls. They did the same thing for E. coli‘s transcriptional regulatory network. Here are the graphs they got:

So it seems like Linux is middle-management heavy, whereas E. coli is workhorse heavy. 30% of Linux programs are top management, as opposed to only 5% in E. coli.

Looking at the actual functions for the genes/programs, it seems that Linux programs also have much more of a  functional redundancy than in E. coli: 3.5% of E. coli‘s genes have “reusable” functions, as opposed to 8.4% of Linux programs. But if we look at entire working subgraphs of these two graphs, the subgraph overlap in Linux is 87%, whereas in E.coli the overlap is only 4.3%. This means that the division of labor in E. coli is much more distinct than in Linux. There are many ways of activating the same hierarchy in Linux, but in E. coli there is rarely more than one way to do it. Note that Linux is top-heavy, whereas E. coli has a pyramid-like structure. It is pretty obvious that the Workhorse modules in Linux go through heavy reuse while those in E. coli do not.

The scientists then decided to look into how these two networks developed.  The oldest genes in E. coli are the Workhorses, whereas the regulatory genes in middle and top management arrived more recently. In contrast, the newest programs — the most heavily rewritten ones– in Linux are the Workhorses, whereas the ones in the management echelons are  less changed than their predecessors. The oldest programs are those that are in Middle Management. they are also the most abundant type in Linux’s call graph.

Who are the Workhorses in E. coli? Those are mostly enzymes, the proteins that catalyze specific biochemical functions.  As a rule, enzymes are very specific: an enzyme would catalyze only one type of reaction, and only with a very specific chemical (substrate). Examples are enzymes that break up sugars: there is a specific enzyme for every type of sugar molecule. Who are the Workhorses in Linux? Those are the functions that get used all the time in thousands of different programs: strlen (measuring a character string’s length) or malloc (allocating memory for a data structure).  The Workhorses in Linux are non-specific while the Workhorses in E. coli are very specific.

So how to account for these differences? Nothing in biology makes sense except in light of evolution, and we have to look to the evolutionary history of both the bacterial and the computational systems for answers.  The major constraint in E. coli‘s evolution is fitness. If something breaks down in E. coli‘s Workhorse it wont get passed on to the next generation: the cell with the lethal mutation would never reproduce and will get thrown into Darwin’s rubbish bin. This leads to single-function workhorses because a multi-functional Workhorse would be too prone to messing too many systems up when it  mutates, and would never make it to the next generation, which is why the Workhorses in  E. coli‘s call graph have a lower connectivity that those in Linux’s call graph.

The authors conclude that the E. coli‘s call graph evolved bottom-up, with system robustness being the main selective trait. In contrast, Linux evolved top-bottom, with reusability of the Workhorses being the main selective trait. Reusability and robustness are tradeoffs. In the case of a man-made system like Linux, bugs in reusable modules are is not a problem, since Workhorse bugs are easily fixed in the next release. It is much less costly, in coding time, to tweak existing Workhorses than to build new ones.  Mutations in reusable workhorses in E. coli would weed out those kinds of proteins from the gene pool, and therefore E. coli‘s Workhorses are not reusable.

I’m not exactly sure what insight we can get by comparing natural vs. man-made networks. But hey, sometimes science is not about insight – sometimes is just about being totally cool; and The Coolness is strong with this work.


Yan, K., Fang, G., Bhardwaj, N., Alexander, R., & Gerstein, M. (2010). Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks Proceedings of the National Academy of Sciences DOI: 10.1073/pnas.0914771107

Combrex: Computational Bridge to Experiments

May 4th, 2010 Comments off

Combrex is an exciting new project at Boston University to bridge computational and experimental techniques to functionally annotate proteins. They are hiring, see below:

JOB POST

We are seeking to hire a creative computational scientist for a
transformative project: COMBREX: A Computational Bridge to Experiments.

The work will involve building a novel resource that combines databases,
science, social networking and machine learning.

The position is available immediately.

For some preliminary information pls. see

www.combrex.org

BS or MS in Computer Science, Informatics, Engineering or related field is required.

Applicants with PhD’s would be considered for a separate Research Associate position.

Pls send CV and names (emails) of two references to:

Prof. Simon Kasif

kasif@bu.edu

Subject Line: COMBREX POSITION

AMOS on Ubuntu

April 19th, 2010 Comments off

AMOS is a suite of genome assembly and editing software. It includes assemblers, validation, visualization, and scaffolding tools.  I have been having some issues installing AMOS on Ubuntu  9.10.  Specifically, Ubuntu 9.10 has gcc 4.4, which breaks the compilation of the AMOS release version. However, the development version has been fixed to accommodate that.

If you don’t know which Ubuntu version you are running, type:

$ lsb_release -a

No more than fifteen minutes after I posted my Q to the amos-help mailing list, Florent Angly came through with a solution. I am posting his email here.

Hi,

This issue was fixed in the development version of AMOS. See below for instructions on how to install this version on Ubuntu:

Download either the regular or development version of AMOS. As of April 4, 2010,
Minimo is only available from the development version of AMOS.
i/ The regular AMOS version is available from http://sourceforge.net/projects/amos/files/, e.g.:
$ wget http://sourceforge.net/projects/amos/files/amos/2.0.8/amos-2.0.8.tar.gz/download
ii/ The development version of AMOS is in a CVS repository. To get it, run:
$ cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS

In the directory where the AMOS file are located, run the following to install
the prerequisites:
$ sudo aptitude install ash coreutils gawk gcc automake mummer mummer-doc libboost-dev

For the Hawkeye component of AMOS, you need Qt3:
$ sudo aptitude install libqt3-headers

For the standard version of AMOS, skip to next step, but for the CVS development version, first, run:
$ ./bootstrap

Then regardless of the version:
$ ./configure –with-Qt-dir=/usr/share/qt3 –prefix=/usr/local/AMOS
$ make
$ make check
$ sudo make install
$ sudo ln -s /usr/local/AMOS/bin/* /usr/local/bin/

Now all the programs shipped in AMOS should be available from the command-line.
For example try:
$ Minimo -h
Regards,

Florent

You will need the AMOS development version for Ubuntu 9.10 (and above, presumably), but the regular version for 9.04 (and below). If you are getting the development version, you will also need to install cvs on your machine:

$ sudo aptitude install cvs

Hope this helps anyone struggling with installing AMOS on Ubuntu or other Linux platforms.

I never metagenomics I didn’t like

April 7th, 2010 1 comment

ResearchBlogging.org

“Let another man praise thee, and not thine own mouth; a stranger, and not thine own lips.” — Proverbs 27:2

“What-ever” –  Me

In PLoS Computational Biology this week, a trio of researchers provides a review of the challenges that metagenomics might ― and already do ― pose for bioinformaticians. The authors refer to metagenomic sequencing data as “noisy and partial.” Their review specifically addresses the computational requirements presented by metagenomics, rather than a comprehensive review of the current technologies. The review concludes with a “representative studies illustrating different facets of recent scientific discoveries made using metagenomics.”

–   http://www.genomeweb.com/blog/week-plos-75

I participated in writing this article for two reasons: first, Phil Bourne, the Editor in Chief of PLoS Computational Biology is a very persuasive fellow (in a good way). Second, one of the mandates of my position at the CAMERA project at the time was to develop new and innovative bioinformatics methods for metagenomics.  As I was immersed in the field and the latest professional literature anyway, writing a review seems like a good way to communicate the latest and greatest in the field, both to others but also to ourselves (John, Adam and I, all working in CAMERA). So we decided to write this up. While writing it, this article went through a few drastic alterations: Life Happened to me a couple of times, including changing jobs and a 3,500 km move, which caused a few months  of hiatus in the writing. Once we got back to writing, the field changed: new research and new software materialized and at least one software package we were writing about fell off the face of the Earth, so revisions and insertions were necessary. Second, the manuscript started to blow up uncontrollably: there is so much to touch upon in computational metagenomics! Obviously, the field is a moving target but anything we write, especially about software, will be outdated by the time it is published. We therefore decided against the laundry list approach, and focused more on general methods. We also culled a lot of interesting things (sequencing validation, eukaryotic metagenomics)  to keep the article relatively brief and flowing.

I hope you like the final result. There’s the comment section in PLoS, you are welcome to make good use of it.


Wooley, J., Godzik, A., & Friedberg, I. (2010). A Primer on Metagenomics PLoS Computational Biology, 6 (2) DOI: 10.1371/journal.pcbi.1000667

Paweł Szczęsny in TEDx Warsaw

March 30th, 2010 Comments off

Pawel on Open Science. Full disclosure: I consider sharing an office with this guy for over a year to be one of the best experiences of my postdoc.

Bioinformatics Blog Carnival #1

March 10th, 2010 2 comments

Yes! Why should the evolution people have all the fun with their blog carnival? (After all, it is only a theory.) It’s time for bioinformaticians to show what we are made of, and to have a carnival of our own. Bio::blogs had a good run some time ago. I decided to reconnect what is hopefully now a larger and more networked community under the title of Bioinformatics Blog Carnival.

Northampton: monument to Francis Crick, born in Northampton. Copyright Ian Rob, licensed for reuse.

Programming

For many new bioinformatic programmers, there is a question of which Bio* package to choose from. The Bio* packages (Biopython, BioPerl, BioJava and BioRuby) are open source licensed packages which are used heavily in bioinformatics coding. They contain parsers for bioinformatic data file, file format converters, SQL interfaces and other goodies that make a biohacker’s life manageable. Walter Jessen presents a rapid-fire no-nonsense review of Open Source Programming with Bio* Libraries on Expressing Scientific Insight. Finally, if you would like to do everything wrong, Manuel Corpas presents 10 Sarcastic Rules on How to Be a Bioinformatician posted at Manuel Corpas’ Blog.

Databases

Morgan Langille talks about BioTorrents – a file sharing resource for scientists posted at his blog Beta Science. My take on BioTorrents is that it is a cool idea, but as most institutes block BitTorrent along with other peer-to-peer sharing utilities, I doubt there is a critical mass of feeders to make BioTorrents viable. Things can change though: institutes might decide to set up dedicated servers to Torrent scientific data, just like there are legitimate Torrent servers for Linux distros. George presents Bio-graphics, BioSQL and Rails part 1 posted at Biorelated. He talks about how to quickly add graphics support to a bioinformatics database rails application.

Brad Chapman’s Blue Collar Bioinformatics is a treasure-trove of useful bioinformatics methods. Brad is very thorough in his writing, and he covers a wide variety of topics. I particularly enjoyed reading about his adventures at the biohackathon 2010 in Tokyo, and the resulting Python query interface to BioGateway SPARQL endpoint and InterMine.

Genomics

Nick Loman from Pathogens: Genes and Genomes gives some excellent tips for de-novo genome assembly. He also gives the necessary scripts,which are great companions to Velvet, the popular short read assembler. Luke Jostins writes about AGBT: Speculating on Third Gen Tech posted at Genetic Inference, “An investigation into what data from various third generation sequencing technologies may look like.” SM presents Resources for Exome sequencing annotation posted at Organizing the Strands of Curiosity.

Structural biology

Sean Seaver talks about the retraction of several protein structure papers published by one group at the University of Alabama, Birmingham. Structuregate was reported both in the media and in the blogosphere. Sean walks us through how it might have been done, and how structural bioinformatics techniques found out the wrong structures in Origin and Orientation posted at P212121. Maria Hodges talks about how difficult it is to explain structural genomics to the man in the pub. Menachem Fromer presents Tradeoff between stability and multispecificity in the design of promiscuous proteins. Promiscuous proteins are proteins which bind different partners (ligands, or other proteins). However, the more partners they are able to bind, the less stable they are, as shown in a series of simulated evolution studies, summarized at Nir London’s Macromolecular Modeling Blog. OK, promiscuity and and a pub. There must be a joke somewhere there.

Credit: Iddo Friedberg & 3D chem for the Anthrax toxin lethal factor image

A common lament among bloggers and other enthusiastic adopters of Web 2.0 technology is the lack of mainstream uptake of these tools by active scientists. A recent report from University of California Berkeley confirmed this: “The advice given to pre-tenure scholars was consistent across all fields: focus on publishing in the right venues and avoid spending too much time on public engagement, committee work, writing op-ed pieces, developing websites, blogging, and other non-traditional forms of electronic dissemination (including online course activities)“. Maria Hodges argues that Web 2.0 thrives where journals don’t, and that the NMR community might be the first to reach the tipping point, where your career is harmed by not contributing. She talks of two wikis used by the NMR community. It is a small community, which has a need for sharing methods that are generally not publishable in peer-reviewed journals. The wiki venue makes for an ideal dissemination method for this community. On the subject of data sharing, and how it can backfire: here is an interesting connection between the history of the PDB, and the email affair known as ‘climategate’.

A sh*tload of data

March 4th, 2010 1 comment

ResearchBlogging.org
This post was chosen as an Editor's Selection for ResearchBlogging.org

There are more microbial cells in our body than our own. Those microbes are not just passive hitchhikers or conversely, malicious agents of disease. They affect our well-being and health in a much broader spectrum than simply “bad” or “passive”. Among other things our gut microbes play an important role in digestion, have been linked to obesity, conditions as severe as certain inflammatory bowel syndromes or as relatively mild as traveler’s diarrhea. The microbes living on our skin also affect many things: from body odor and dandruff and acne to dermatitis and psoriasis. Also, being the most exposed microbial population means that they are themselves affected by our constant exposure to various agents,and their population varies by our own behavior –  such as how often, and with what, do we wash our hands: antibacterial soap has been named as a major culprit in the development of resistant bacterial strains. In all organs the native flora, relatively benign, protects us against colonization by more virulent bacteria.

Indeed, our gut, skin, mouth and genital microbiomes can be viewed as  additional organs in the way that they affect us. If you take a long course of powerful systemic antibiotics, the ensuing diarrhea and sometimes mouth thrush are the result of these “organs” — in your mouth and gut — being removed temporarily from your body.

Credit: David Gregory&Debbie Marshall. Wellcome Images images@wellcome.ac.uk

Today, another big step has been taken towards understanding the role of our microbiomes play. Just how big, in what direction, or what will be the consequences of this step is unclear, and will remain unclear for quite some time. A group of Chinese and European researchers have published the largest sequencing effort yet of gut bacteria. Their current yield, 576 Gigabases of DNA from the feces of 124 European people is considerably larger than the previous large effort in the US: 3GB from sequenced from 33 US and Japanese adults. Also, Qin and colleagues looked at obese, lean, healthy and sick (inflamatory bowel syndrome) individuals. They identified 1,000 to 1,100 different bacterial species, with about 160 different species per individual. Healthy individual’s bacterial population was markedly different that those with inflammatory bowel syndromes, and the populations of those with IBS differed depending on disease type.

What does this all mean? Well, like in the human genome project, it will take a while before  we understand not only what these data contain, but what are the limits of our ability to interpret them, and how best they can serve us. What we have right now is the equivalent of the outline of a newly discovered continent. It is up to many individual explorers to discover and chart the myriad living things in terra excreta. Which genes are most associated with obesity or with ulcerative colitis? How prevalent is gene transfer between gut bacteria, and how much of a role does this play in antibiotic resistance? Are there microbial species more prone to changes in their genomes than others? Are there metabolic pathways that are shared between different microbial species? Are some genes faster evolving than others, what would be the ecological role different species play in the gut? And how do different microbial populations ultimately affect their human hosts? Do different bacterial species have a preference with whom they share our guts? These broad questions can be broken down into individual questions relating to a lab’s pet genes, metabolic pathways, microbial species and metabolic conditions. There is a lot to sift through here and these data will keep us busy for years to come.

Pathogenic E. coli on the intestinal lining. Credit: S. Schuller.

In other blogs Carl Zimmer has a great post on how he, for one welcomes our bacterial overlords, while Ed Yong talks about the science of things to come.

Oh, and for those of you who, like me, can’t wait to plug the data into your favorite analysis pipeline, you can get the gene and assembled and annotated sequence data data from Peer Bork’s lab at the European Molecular Biology Laboratory, or the Beijing Genome Institute. The raw sequence data are available from the European Bioinformatics Institute under the accession ERA000116. (I can’t find the latter myself right now, hopefully they will show up in a couple of days). Update March 7, 2010: the sequences are now deposited at EBI. Also, on BioTorrents (hat tip to Morgan Langille).


Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, H., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Doré, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., Antolin, M., Artiguenave, F., Blottiere, H., Borruel, N., Bruls, T., Casellas, F., Chervaux, C., Cultrone, A., Delorme, C., Denariaz, G., Dervyn, R., Forte, M., Friss, C., van de Guchte, M., Guedon, E., Haimet, F., Jamet, A., Juste, C., Kaci, G., Kleerebezem, M., Knol, J., Kristensen, M., Layec, S., Le Roux, K., Leclerc, M., Maguin, E., Melo Minardi, R., Oozeer, R., Rescigno, M., Sanchez, N., Tims, S., Torrejon, T., Varela, E., de Vos, W., Winogradsky, Y., Zoetendal, E., Bork, P., Ehrlich, S., & Wang, J. (2010). A human gut microbial gene catalogue established by metagenomic sequencing Nature, 464 (7285), 59-65 DOI: 10.1038/nature08821

Bioinformatics blog carnival

February 18th, 2010 2 comments

Byte Size Biology will be hosting the first edition of the bioinformatics blog carnival. All you bioinformatics bloggers, submit your entries by Mar 9, 2010 23:59:03  EST. Note the 3 second extension I have already given. There will be no more deadline extensions, I’ve been generous enough as it is. The carnival will be posted here by March 15, 2010.

Any blog posts that have to do with the computational aspects of: genomics, nextgen sequencing, sequence analysis, gene expression, systems biology, ontologies, databases, structural biology, metagenomics, phylogenetics, function prediction and I probably forgot a few other categories so don’t hold it against me, just submit. Early and often. Your own posts and others that you liked. Nothing too old please, 1/1/2009 and later. I reserve the right to be a less-than-benevolent dictator and screen out posts. This applies especially to commercial plugs with no other merits.

Please retweet, reblog, rebuzz and remember:  submit to the bioinformatics blog carnival!

If you have a cool logo for this carnival, email bioinfo.blog.carnival AT gmail.com OhOne and OhToo will pick the winning logo to be displayed on the carnival. Likewise, if you would like to host the next edition, let me know. Do not submit posts by email, only via blogcarnival.com.

Happy carny-ing.

Ancient Greenlander’s DNA reveals ugly mullet

February 13th, 2010 2 comments

ResearchBlogging.org

Seriously, this is what I first thought when I saw the cover of this week’s Nature, and the associated drawings  in the press.  The dude’s haircut seems like it was even bad in the ’80s… 2080 BCE that is, which is when his body is dated. Approximately.

Inuk, artist's impression

A large group of researchers were involved in analyzing DNA from a 4,000 year old man found in Greenland. There is an interesting story on how the sample was obtained. Eske Villerslev spent 2 months looking for human remains in archaeological sites in Greenland, to no avail. When complaining about it to a friend, he learned that that friend’s father had already given hair samples to the national museum of Denmark… back  in 1986. Serious slap on the forehead!

The preserved DNA was of great quality. Also, this is 2010: we have a plethora of genotypic data that lets us associate a genome with phenotypes, and geographical origin. Inuk, as he was dubbed, shares many traits with east Asians, and much less with contemporary North American Inuit or other Native Americans. This means that Inuk and his people migrated fairly recently to Greenland from Siberia and across North America. Inuk and his people, the (now extinct) Saqqaq have no relationship to contemporary Native Americans. This was also shown in an earlier study of the mitochondrial DNA.

What do we know about Inuk himself? He had blood type A+; thick black hair and very likely a receding hairline due to a baldness allele; brown eyes; shovel-graded front teeth (East  Asian characteristic); dry type earwax… before you laugh that is actually, another characteristic of Asian populations on the Siberian and Chinese east coast. He also had alleles associates with human adaptation to cold temperatures.

And a bad haircut; but great press nevertheless (just got to Google News and search for  “Greenland Genome” – see, for example, the NYTimes.)

This post has been Slashdotted. Exercise extreme caution.


Rasmussen, M., Li, Y., Lindgreen, S., Pedersen, J., Albrechtsen, A., Moltke, I., Metspalu, M., Metspalu, E., Kivisild, T., Gupta, R., Bertalan, M., Nielsen, K., Gilbert, M., Wang, Y., Raghavan, M., Campos, P., Kamp, H., Wilson, A., Gledhill, A., Tridico, S., Bunce, M., Lorenzen, E., Binladen, J., Guo, X., Zhao, J., Zhang, X., Zhang, H., Li, Z., Chen, M., Orlando, L., Kristiansen, K., Bak, M., Tommerup, N., Bendixen, C., Pierre, T., Grønnow, B., Meldgaard, M., Andreasen, C., Fedorova, S., Osipova, L., Higham, T., Ramsey, C., Hansen, T., Nielsen, F., Crawford, M., Brunak, S., Sicheritz-Pontén, T., Villems, R., Nielsen, R., Krogh, A., Wang, J., & Willerslev, E. (2010). Ancient human genome sequence of an extinct Palaeo-Eskimo Nature, 463 (7282), 757-762 DOI: 10.1038/nature08835

Gilbert MT, Kivisild T, Grønnow B, Andersen PK, Metspalu E, Reidla M, Tamm E, Axelsson E, Götherström A, Campos PF, Rasmussen M, Metspalu M, Higham TF, Schwenninger JL, Nathan R, De Hoog CJ, Koch A, Møller LN, Andreasen C, Meldgaard M, Villems R, Bendixen C, & Willerslev E (2008). Paleo-Eskimo mtDNA genome reveals matrilineal discontinuity in Greenland. Science (New York, N.Y.), 320 (5884), 1787-9 PMID: 18511654

Short bioinformatic hacks: reading between the genes

February 11th, 2010 6 comments

In celebration of the biohackathon happening now in Tokyo, I am putting up a script that is oddly missing from many bioinformatic packages: extracting intergenic regions. This one was written together with my student, Ian. As for the biohackathon itself, I’m not there, but I am following the tweets and  Brad Chapman’s excellent posts:

About intergenic regions: intergenic regions are as interesting and sometimes even more interesting than the genes themselves: when you are interested in promoters, transcription factor binding sites or almost any other transcription regulation mechanism. Here’s a simple script to find intergenic regions. It reads a genbank formatted file and uses the information there to extract the intergenic regions. The sequences are written to a FASTA file.

#!/usr/bin/env python
import sys
import Bio
from Bio import SeqIO, SeqFeature
from Bio.SeqRecord import SeqRecord
import os

# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
def get_interregions(genbank_path,intergene_length=1):
    seq_record = SeqIO.parse(open(genbank_path), "genbank").next()
    cds_list_plus = []
    cds_list_minus = []
    intergenic_records = []
    # Loop over the genome file, get the CDS features on each of the strands
    for feature in seq_record.features:
        if feature.type == 'CDS':
            mystart = feature.location._start.position
            myend = feature.location._end.position
            if feature.strand == -1:
                cds_list_minus.append((mystart,myend,-1))
            elif feature.strand == 1:
                cds_list_plus.append((mystart,myend,1))
            else:
                sys.stderr.write("No strand indicated %d-%d. Assuming +\n" %
                                  (mystart, myend))
                cds_list_plus.append((mystart,myend,1))

    for i,pospair in enumerate(cds_list_plus[1:]):
        # Compare current start position to previous end position
        last_end = cds_list_plus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "+"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    for i,pospair in enumerate(cds_list_minus[1:]):
        last_end = cds_list_minus[i][1]
        this_start = pospair[0]
        strand = pospair[2]
        if this_start - last_end >= intergene_length:
            intergene_seq = seq_record.seq[last_end:this_start]
            strand_string = "-"
            intergenic_records.append(
                  SeqRecord(intergene_seq,id="%s-ign-%d" % (seq_record.name,i),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))
    outpath = os.path.splitext(os.path.basename(genbank_path))[0] + "_ign.fasta"
    SeqIO.write(intergenic_records, open(outpath,"w"), "fasta")

if __name__ == '__main__':
    if len(sys.argv) == 2:
         get_interregions(sys.argv[1])
    elif len(sys.argv) == 3:
         get_interregions(sys.argv[1],int(sys.argv[2]))
    else:
         print "Usage: get_intergenic.py gb_file [intergenic_length]"
         sys.exit(0)

What are we seeing here?

Lines 11-16 are the preamble: we read the GenBank file using Biopython’s genbank parser in line 12.  Beacuse we expect a genome file, which contains a single record, this is a one-time read. Note that this is a rate limiting step, and can take a couple of seconds. Took me ~2secs to read the full E. coli genome on my Linux box.  We prepare one list for the + strand intergenic regions (13), another one for the minus strand intergenic regions (14) and one for all the records (line 15).

The rest of the code are three loop blocks: lines 16-28 I loop over the genbank features, extracting the coordinated of the genes themselves. Line 32-41 I find the intergenic regions on the + strand. Lines 42-52 I do the same for the “-” strand.

Now for a philosophical interlude: although there is a way to read all the intergenic regions in a single pass, I subscribe to the “code simple” doctrine of research software writing. Code performance optimization is a low priority for me. I’d much rather have something that is simple to write,read and modify. I also don’t want to spend too much time coding and elegant script for elegance’s sake, especially if I may not use it too much. Historically, scientific code written for research is mostly extinct: thrown away after a short lived hypothesis was tested and ended its days. Research coding is mostly throwaway glue code. Very rarely it matures into a product. Then, and only then, can you apply all those fine software engineering you learned in college. Before that, write fast and simple.

But I digress. Line 17 loops over the features in the genome file. Line 18 we identify if it is a coding sequence (CDS). If so, we identify the start position, and position and the strand the CDS is on. The list cds_list_minus is a list of 2-tuples. Each 2-tuple is the start and end positions of a CDS on the minus strand. (If you would like to go over the genes, as opposed to coding sequences, change line 18 to:

if feature.type == 'gene':

(or better yet, pass an argument that defines it.)

cds_list_plus, is, yes, the same as cds_list_minus, only for the plus strand (line 24).

Sometime, a CDS does not contain information on which strand it is. With genome files, that is usually the case with single stranded viral genomes. Therefore, we put in the default assumption that if there is no strand indication, then the feature is is on the plus strand. We generate a warning message nevertheless (lines 25-28).

Lines 30-41 we loop over the plus  strand list, and identify the coordinates between the genes. Python’s enumerate function is very useful here. The enumerate function allows us to iterate over a list, but at the same time keep track of which index we are in when looping over the list. So in line 30, pospair receives the start and end coordinates of a CDS as a 2-tuple, while i receives the actual number if the index in the plus strand CDS list. In that way, we can look back to the previous list member, find the coordinates where that CDS ends, and where the current CDS begins. The two coordinates make up the beginning and end of the intergenic regions between those two genes on that strand. In line 35 we check if the intergenic region length is equal to or larger than a threshold: suppose we are only interested in those intergenic regions that are longer than 100 bases? (The default value is 1, see line 11.) In lines 38-39 we build a biopython sequence object that contains an informative header, and the sequence of the intergenic region. The description which goes in the sequence header contains the start and end coordinates of the intergenic region, and the locus ID of the CDS directly downstream from it. The sequence object is appended to a list, which will eventually get written (lines 40-41).

Lines 42-52 are a repeat of lines 30-41, only for the minus strand.  Lines 53 & 54: the list that contains all the intergenic region sequence objects gets written to its own fasta file.

Finally, line 56-63 are boilerplate code, that make this script runnable from the command line. Have fun looking at intergenic regions. Let me know of you find something interesting.