The Search for Small finds Life on a Gradient

In Chapter 3 of The House at  Pooh Corner, Rabbit organizes a search for Small, “One of my friends and relations.”  Like a good manager (or scientist) Pooh lays out a program:


As soon as Rabbit was out of sight, Pooh remembered that he had forgotten to ask who Small was, and whether he was the sort of friend-and-relation who settled on one’s nose, or the sort who got trodden on by mistake, and as it was Too Late Now, he thought he would begin the Hunt by looking for Piglet, and asking him what they were looking for before he looked for it.
“And it’s no good looking at the Six Pine Trees for Piglet,” said Pooh to himself, “because he’s been organdized in a special place of his own. So I shall have to look for the Special Place first. I wonder where it is.”
And he wrote it down in his head like this:

ORDER OF LOOKING FOR THINGS

  1. Special Place
  2. Piglet
  3. Small
  4. Rabbit
  5. Small again
(To find Piglet)
(To find who Small is)
(To find Small)
(To tell him I’ve found Small)
(To tell him I’ve found Rabbit)

“Which makes it look like a bothering sort of day,” thought Pooh as he stumped along.

Of course, it does turn out to be a bothering sort of day, and nothing goes according to plan. Pooh does find Small but that is almost an afterthought considering the other things he discovered that day.


Just like science. You set out looking for something, you find a bunch of other things. You may or may not find what you  to originally set out to look for, but by the time you get to finding Small, finding him may not be the accomplishment you originally thought it may be. Something else has superseded it.

I started writing this post about the search for the smallest organism. Why? Because life in small packages fascinates me. How small can a biological package be, and still be considered living? Or: “The Search for Small(est)”.

But like Pooh, I bumbled along into other things.

So what is the smallest living thing? Starting at the smallest scale, viruses  are considered by most scientists to be replicators, rather than organisms. They do not metabolize, and do not carry a full complement of reproduction machinery. They affect life profoundly, but they are missing a few essential components to actually be living.  This view has been shaken up recently with the discovery of giant viruses that have genomes larger than some bacteria. These genomes are also quite complex, including coding for a large part of the reproductive machinery, having a selective membrane, and other of life’s goodies. Still, even if we consider giant viruses (or mimiviruses, as they are called)  have crossed the border between non-life and life and are considered to be living, they are already not the smallest around. Not in genome size, and not in the particle size. Indeed, mimivirus were, for a long time, mistaken for bacteria due to their size, which is where they go their name: “mimi”  is short for microbial mimic.

So: small bacteria? The bacterium Candidatus carsonella rudii is really small: its genome is just shy of 160,000 base pairs and it codes for about 182 predicted genes. But carsonella is an obligatory endosymbiont: it lives inside the cells of a special organ in the jumping plant louse or psyllid, an insect that feeds on plant phloem. Carsonella cannot survive outside its host and, in fact, its genome has lost so many genes that it is practically an organelle, not much larger than a mitochondrion. (A mitochondrion  has 16,000 base-pairs and 32 genes.) Mitochondria are not living, although they originated from bacteria. Is carsonella there yet? Has it crossed into from life into non-life just as mimiviruses may have crossed from life into life? Buchnera, another insect endosymbiont is not much larger, with about 400 genes.

Mycoplasma genitalium is parasitic,  but at least it codes for (almost) all of its proteins. It is usually heralded as “the smallest organism that can be grown in cell-free culture”. Its genome is  521 genes strong: just 3 times more than that of carsonella. It is not an obligatory endosymbiont, but it is a parasite: we can trick it to live and grow in a nutrient-rich soup, but in nature you will not find it outside a host.

Pelagibacter ubique, a marine bacterium, is, as far as we know, the smallest free-living organism, with approximately 1390 predicted genes.

So in searching for Small, I was asking a question that seemed to become more awkward each time I thought I found him: is this Small I found  living or not?  Each of the Smalls has certain characteristics of life, but where on the scale outlined by pelagibacter, mycoplasma, carsonella and a mitochondrion does life turn into non-life?

When a question you ask makes you feel weird, you may want to consider whether you are asking the right question. So maybe I was asking the wrong question. Maybe the definition of life is not a binary one and we should not think in terms of “living” and “not living”. Life may very well be a quantitative thing.  Life sheds itself into non-life gradually, from free-living to parasitic to endiosymbiont to organelle.  Indeed, self-replicating proteins (prions) and self-replicating RNA (viroids) are the byproduct of much more complex life, which has arisen before those replicators were derived. As they are, they are non-living, but they owe their existence to life.

So there is no single boundary where “life”  crosses over to “non-life”. That’s not the right way to look at it. When journeying  from virus through mimivirus,  through organelle, various endosymbionts, parasites to free-living we are are simply hitting milestones on a continuum. Perhaps not that different from the continuum from which life emerged in the first place.

Understanding this probably beats actually finding Small.

 

Music Monday: War Again

Balkan Beat Box, from “Blue Eyed Black Boy”. I like the animated rendering of Tomer Yosef.

 

 

Microbial Pancakes

 

Prepared by daughter. Not to scale. Species not yet identified. Delicious.

 

Gut microbes and diabetes

It seems that every day we are discovering more about the role of microbes to our health. We really have to revise our definition of what a human (or any other animal or plant) is: we are not just a creatures of 10,000,000,000,000 cells containing the DNA we got from mother and father. We have 10 times that many cells which are microbial, and we are only now beginning to understand how profoundly they affect us.

Together with obesity, insulin resistance is the harbringer of metabolic syndrome. Insulin resistance is when the body cannot use insulin effectively. Insulin is needed to help control the amount of sugar in the body. As a result, blood sugar and fat levels rise.  Therein lies the path to morbid obesity, diabetes, stroke, and heart problems.

ResearchBlogging.org

This post was chosen as an Editor's Selection for ResearchBlogging.org

So what’s the connection of metabolic disease to bacteria? Well, for one thing, we know that in obese people the bacterial population in the gut is different, and the different population of bacteria may lead to a vicious cycle contributing to obesity.

Another possible connection has to do with an  important group of molecules in our body called Toll-like receptors, or  TLRs. TLRs are a family of  membrane proteins that sense a wide variety of bacterial populations and activate our innate immune system. TLRs are like a first-defense warning station: they sense the bacterial enemy first, and, if needed, activate the proper defense mechanisms. Researchers studying TLR-2 have created knockout mice lacking TLR-2, and they discovered is that many of TLR-2 knockout mice do not develop insulin resistance when fed with a high-fat diet. Think about it: all the McCrap you can eat, yet your blood sugar level remains normal (although you still grow fat).  So why does that happen? How come these mice lacking a bacterial sensor also seem immune to insulin resistance?

 

To answer this, we must understand that TLR receptors (quite a few are known so far) are known to serve as a bridge (or “mediate crosstalk”) between the immune system and the body’s metabolism.  Mice without TLR-5 develop eating disorder known as hyperphagia which is characterized by an increased appetite; they also show other pre-diabetic  symptoms: hypertension, high lipids, and insulin resistance. I have posted before about how TLR-5 may control the type of gut bacteria mice have and, in turn, control their propensity for obesity.

TLR-4 deficient mice, on the other hand, seem to be protected from insulin resistance, just like TLR-2 deficient mice. So a connection between these front-line sensors of the immune system and whole body metabolism is well-known.

A group of researchers from Brazil have decided to look further into these “diabetes resistant” mice. The thing about mutant TLR deficient mice, is that they are normally grown in sterile conditions because possible infections and because the uncontrollable gut microbes add uncontrolled variables to any experiment. When scientists cannot precisely control for experimental conditions, they face two choices: One, they can deviate from the model to emulate “real world” better, but sacrifice control of one or more of the variables in an experiment. Or, maintain full control of the experiment and sacrifice a simulation whatever they are trying to model. Most scientists go with the second option: they would prefer to have a well-controlled model, even if it supposedly detracts from its supposed practicality and application to “real life”. That is because a model (in our case, mutant TLR mice), is somewhat removed from the real thing anyway: mutant TLR-deficient are basically a  an artificial construct used to investigate the effect of knocking out a TLR from the mouse, so hopefully we can draw conclusions about humans. So the second type of possible error is the one scientists generally prefer to make.

 

Toll-like receptors: it's complicated.

But Andrea Caricilli and colleagues have decided to look at TLR-2 knockout mice in non-sterile conditions. Rememebr: TLR-2 knockouts seem to have protection from insulin resistance. What Caricilli and her colleagues discovered was quite the opposite of what was known so far: TLR-2 knockout mice were not protected from insulin resistance. Quite the opposite: the mutant mice developed metabolic syndrome. But did  the gut bacteria do it? To check that, the researchers treated the mice with broad-spectrum antibiotics for 20 days. After that, the bacterial species that re-colonized the mice’s guts were quite different in their composition from the bacterial species that originally inhabited them. And they did not have meteabolic disease, or the symptoms were much less severe.

So here it is: changing the mice’s gut microbiota changed them from mice with insulin resistance to mice without insulin resistance. Yes, there are mutant mice, but still: insulin resistance was turned off  by changing the types of microbes in the gut.

They then transplanted the microbiota from TLR-2 mutants which had insulin resistance to regular mice. And what do you know: the regular mice then showed symptoms of insulin resistance and metabolic disease.

There is a lot more to this paper than these two experiments: they have also investigated many other parameters, trying to come up with the chain of events that bacteria trigger when causing metabolic disease. I won’t get into that, the paper is quite long with some 20(!)  figures. A lot of work went into this. But the bottom line again supports what has been shown in other studies: the bacteria that live in our gut are responsible for our metabolism, and it is the interaction between the bacteria and our immune system that not only protects us from pathogens, but also protects us (or not) from metabolic disease.

 

 Update: this post has been slashdotted. Exercise extreme caution.


Caricilli, A., Picardi, P., de Abreu, L., Ueno, M., Prada, P., Ropelle, E., Hirabara, S., Castoldi, A., Vieira, P., Camara, N., Curi, R., Carvalheira, J., & Saad, M. (2011). Gut Microbiota Is a Key Modulator of Insulin Resistance in TLR 2 Knockout Mice PLoS Biology, 9 (12) DOI: 10.1371/journal.pbio.1001212

 

Nobody knows you

With deepest apologies to the memory of Jimmy Cox.

EDIT: I got a couple of concerned emails. No, this did not happen to me. Yet.

Once I lived the life of a PI so rich,
Research was going along without a hitch.
Lab manager, four postdocs and grad students eight,
My lab took up the whole floor, and that felt great.

Five years later it all went to hell,
My renewal was declined, because no papers in Cell.
But I just read an RFA that is out,
I’m going to apply, and get it, without a doubt.

Nobody knows you,
when you lose your grant.
In your funding, not one penny,
and as for postdocs, I haven’t any.

If I ever get back on my feet again,
My department chair will not treat me with disdain.
It’s mighty strange, which is why I’m doing this rant,
Nobody knows you when you lose your grant.

Best version of the original, IMHO:

Seriously?

I get the weirdest emails sometimes….

 

Dear  professor  Iddo Friedberg,

    

    First of all, I would like to introduce myself, my name is _____, 30 years old I  occupy a staff position of instructor at the department of pharmaceutical microbiology, faculty of pharmacy, ____ University, _____.

I was graduated in 2003 with an overall grade “excellence with honors” & I was the second among 1000 students.  I have a PhD scholarship totally funded by the ______ Ministry of High Education. The scholarship includes tuition fees, residence, health insurance and everything. Here is the link of the scholarships offered _________. I am interested in doing my PhD study in the field of molecular biology under your supervision. To fulfill the requirements of the scholarship, I should have a research proposal including the following items. I will apply to The University as soon as I receive the proposal.

 

The research proposal includes:

·                   Title Page (including Title, Keywords).

·                   Abstract

·                   General Overview of Research Area and Literature

·                   Key Research Questions and Objectives

·                   Methodology

·                   Tentative Timetable

·                   Selective Research Bibliography

 

I will be so grateful to you if you generously send me  a research proposal few days before the deadline(January 10, 2012)

 

 

N.B. other information is available in the attached C.V.

If you need any additional information or documents, please let me know

I am waiting for your reply as soon as possible in case of either acceptance or refusal.

Kind regards

Open Cancer Research

 

“We seek to download from the amazing successes of the computer industry two principles: that of open source, and that of crowdsourcing; to quickly, responsibly accelerate the delivery of targeted therapeutics to cancer patients. Our business model involves all of you. This research is funded by the public.”

 

Music: The Black Keys, El Camino

The Black Keys‘ new album El Camino is coming out today. I am not entirely sure why they called the album El Camino, and placed a picture of a 1994 Chrysler Town & Country van:

 

What a Chevrolet El Camino might look like:

1968 El Camino

 

Anyhow, the music is great. Here is the first track, Lonely Boy. Enjoy:

 

Circumcision, preventing fraud, and icky toilets. You know you’re going to read this.

In no particular order or ranking, recent and not-so-recent articles from PLoS-1. The common thread (if any): I thought they were pretty cool in one way or another.


 

1. Men don’t tell the truth about their penis. No kidding? But this is somewhat more serious. It has been accepted for some time that male circumcision dramatically reduces the rate of HIV infection. But recently, some reports have shown that high rates of infection prevail among circumcised men as well. But since circumcision is usually self-reported, could there be a problem there? This study shows that in a cross-sectional (sorry…) study among recruits to the Lesotho Defense Force, 50% of the men that reported they were circumcised were, in fact, partially (27%) or completely (23%) not circumcised. The researchers conclude that biases in the self-reporting of male circumcision may lead to erroneous reports that show high HIV infection rates among circumcised men.

Concluding quote:

…until further research can document improved methods for obtaining accurate self-reported MC [male circumcision I.F.] data, all assessments of MC and HIV prevalence, as well as projections for VMMC [voluntary male medical circumcision I.F.] interventions, should be informed by physical-exam-based data [as opposed toself reporting, I.F.].

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0027561

ResearchBlogging.org


2. Share your data or GTFO. 

Can sharing data help prevent errors and fraud?

From the abstract:

Background: The widespread reluctance to share published research data is often hypothesized to be due to the authors’ fear that reanalysis may expose errors in their work or may produce conclusions that contradict their own. However, these hypotheses have not previously been studied systematically

So Jelte Wicherts and his colleagues from the University of Amsterdam wanted to see whether sharing data was related to the number of statistical analysis errors in a paper. So, to phrase this as a null and alternative hypothesis:

H0:There is no difference in the number of statistical errors in those papers where the authors are willing to share data, and those where the authors are unwilling to do so.

H1: (one sided): the number of weaker evidence and statistical errors in papers where the authors are unwilling to share data is larger than those in which the authors are willing to share data.

Wicherts and colleagues contacted authors of 141 papers published in five journals of the American Psychological Association, requesting their data. Trouble is, they could not get enough authors to share data to make their own study significant: in a previous study, some 73% of the authors contacted were unwilling to share data. Wow.

However, authors publishing in two of these journals, Journal of Personality and Social Psychology (JPSP) and Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP:LMC), were somewhat more forthcoming.  Wicherts and colleagues therefore limited their analysis to a subset of 49 papers published in those journals. (Note that sometimes lack of data sharing is due to legitimate considerations, such as being part of an ongoing study, or third-party proprietary rights. However, those were not considerations in 49 papers analyzed here.)

Wicherts  then checked for specific types of statistical errors in these papers, and compared the number of errors in papers from authors willing to share data to those who did not. Here are some of the findings:

Distribution of the number of errors in the reporting of p-values for 28 papers from which the data were not shared (left column) and 21 from which the data were shared (right column) for all misreporting errors (upper row), larger misreporting errors at the 2nd decimal (middle row), and misreporting errors that concerned statistical significance (p<.05; bottom row). doi:10.1371/journal.pone.0026828.g001

 

Pretty clear picture: those papers where the authors authors were willing to share data were less prone to statistical errors.

Concluding quote:

In this sample of psychology papers, the authors’ reluctance to share data was associated with more errors in reporting of statistical results and with relatively weaker evidence (against the null hypothesis). The documented errors are arguably the tip of the iceberg of potential errors and biases in statistical analyses and the reporting of statistical results. It is rather disconcerting that roughly 50% of published papers in psychology contain reporting errors [33] and that the unwillingness to share data was most pronounced when the errors concerned statistical significance.

Although note that Wicherts is very careful about drawing conclusions:

Although our results are consistent with the notion that the reluctance to share data is generated by the author’s fear that reanalysis will expose errors and lead to opposing views on the results, our results are correlational in nature and so they are open to alternative interpretations. Although the two groups of papers are similar in terms of research fields and designs, it is possible that they differ in other regards. Notably, statistically rigorous researchers may archive their data better and may be more attentive towards statistical power than less statistically rigorous researchers. If so, more statistically rigorous researchers will more promptly share their data, conduct more powerful tests, and so report lower p-values. However, a check of the cell sizes in both categories of papers (see Text S2) did not suggest that statistical power was systematically higher in studies from which data were shared.

 

In fact, Wicherts also wrote a piece in Nature where he argued that sharing data can help avoid fraud, such as in the recent infamous case of Diederik Stapel, a highly regarded psychologist at Tilburg University in the Netherlands.

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0026828


3. Toilet paper. A study of surfaces of public restrooms has shown that they are covered with bacteria, mainly the kind that is known to live on and in humans. So now we have a somewhat broader view of the species living in restrooms, including the uncultured ones.

Two interesting quotes from the paper:

Although many of the source-tracking results evident from the restroom surfaces sampled here are somewhat obvious, this may not always be the case in other environments or locations.

Not sure about this bit: if the sources here are obvious, then is this paper a proof-of concept?

Also:

Unfortunately, previous studies have documented that college students (who are likely the most frequent users of the studied restrooms) are not always the most diligent of hand-washers.

No shit! (Pun intended).

Concluding quote:

Although the methods used here did not provide the degree of phylogenetic resolution to directly identify likely pathogens, the prevalence of gut and skin-associated bacteria throughout the restrooms we surveyed is concerning since enteropathogens or pathogens commonly found on skin (e.g. Staphylococcus aureus) could readily be transmitted between individuals by the touching of restroom surfaces.

Translation:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028132


Thomas, A., Tran, B., Cranston, M., Brown, M., Kumar, R., & Tlelai, M. (2011). Voluntary Medical Male Circumcision: A Cross-Sectional Study Comparing Circumcision Self-Report and Physical Examination Findings in Lesotho PLoS ONE, 6 (11) DOI: 10.1371/journal.pone.0027561

Wicherts, J., Bakker, M., & Molenaar, D. (2011). Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results PLoS ONE, 6 (11) DOI: 10.1371/journal.pone.0026828

Flores, G., Bates, S., Knights, D., Lauber, C., Stombaugh, J., Knight, R., & Fierer, N. (2011). Microbial Biogeography of Public Restroom Surfaces PLoS ONE, 6 (11) DOI: 10.1371/journal.pone.0028132

So what’s new with humans?

Man is the only animal that laughs and weeps, for he is the only animal that is struck with the difference between what things are and what they ought to be.
– William Hazlitt

We like to think that we are the only species capable of emotional self-awareness and therefore the only “animal that laughs and weeps”, but that is quite probably untrue, as other animals have been shown to laugh and perhaps weep.

Credit: Shiny Things, Flickr

 

Whatever that elusive quality is that distinguishes us from our closest cousins, the chimps and the bonobos, it is to be found in our genome. Since human and some great apes and other primate genomes have been sequenced, the basis for comparing these blueprints exists. Many studies have been done comparing the conservation of genes, copy numbers of genes, intergenic regions, control regions, synteny, splicing and other mechanisms that may explain the differences between us and our 96% cousins. As expected, no one factor can  explain why bonobos are peaceful and sexual, chimps are aggressive and patriarchal, and humans worry about taxes and blog.

Are there any new genes in humans that can help explain these differences? New genes can arise in various ways: gene duplication, exon shuffling, horizontal transfer, genes may split up (fission) or merge (fusion).

But how about genes that are completely new in humans? Do we have genes that we can claim as our own and are neither homologous to those in other apes nor have arisen from a mix & match manipulation in the common lineage of all apes? Are there actually human genes that are just that: exclusively human?

A group from China and Canada has decided to tackle that question. They looked specifically for genes that are new in the human lineage, but not in chimp or orangutan. (I’m not exactly sure why they did not look in Gorilla too, which is the other great ape with a mostly sequenced genome, perhaps because the assembly is still very much in progress.)

So how does one go about looking for genes that are human-only? The pipeline Wu and colleagues have set up looks like this:

 

Clockwise, from top left:

1. They scanned the human genome   for genes with a high similarity in the genomes of chimp, orangutan and rhesus macaque. That left them with 584 genes (out of roughly 25,000) which did not have an ortholog in other primates.

2. A simple sanity check: those human genes with no start or stop codons were probably mis-identified. We are now down to 352 genes.

3. Of the 352, they looked for those that have disrupted homologous regions in chimp and/or orangutan. That mans that while the gene is functional in humans, it is not functional in the other primates. Disrupted homologous regions can mean that in non-humans the gene does not have a start codon, or has a premature stop codon, or has some frameshift mutation that renders it non-translatable. From 352 we are now down to 66 new human gene candidates.

4. But a human gene, even if not functional in other primates, may have been functional in a common ancestor of all primates, lost in the orangutan and chimp lineages, but maintained in humans. This history not make the gene as brand-new human-only. So in the 66 remaining genes they looked for sequences where the mutation that rendered them functional (like an ATG start codon, or the removal of a missense mutation) was found only in humans. Now we are left with 46 genes.

5. Great, so we have 46 open reading frames in humans that look like original, human-lineage only genes. But are they functional? Do they actually transcribe into RNA and translate into protein? (RNA-only genes were excluded from this rather conservative pipeline, they are hard enough to identify as it is.)  To find that out, they looked for transcribed regions EST databases (for RNA), and in the PRIDE peptide database (for protein). Now we are left with 27 genes that are novel in humans, and because they are translated are probably active.

Trouble is, some of these genes are listed only in certain versions of Ensembl, the genome database from which the researchers took their data; (they used version 56.) This highlights a problem with the annotation of genes with no homologs: their annotation is volatile, and may change between different versions of the same database of the exact same genome. To overcome this problem, the researchers subjected different versions of Ensembl (40 through 55) to the same pipeline described above. They discovered an additional 33 genes that are candidates for de novo  human-lineage only active genes, bringing the total up to 60.

What are those genes like?  Why are they found only in humans? Can they help explain the differences between human and other primates? Well, for one, they’re short. Only one or, at most, two exons. This makes sense as these relatively new genes had not the time to accumulate splice sites.

The researchers moved on to look where the genes were expressed. They used RNA-Seq data from 11 different human tissues: adipose, whole brain, cerebral cortex, breast, colon, heart, liver, lymph node, skeletal muscle, lung and testes.

Here is what they found:

Levels of expression of de novo genes in 11 tissues. (A) Mean normalized expression levels of de novo originated genes in 11 tissues are defined by the mean level of expression as the numbers of unique reads mapping to coding regions divided by the total length of all the coding regions, divided by the total number of valid reads in the samples (×10−8). The vertical axis represents value of mean the normalized expression levels and abscissa axis represents the 11 tissues. (B) The proportion of the de novo originated genes that have expressed reads in the 11 tissues. The vertical axis represents the values of proportion, and abscissa axis represents the 11 tissues. (C) The proportion of the de novo originated genes having their highest normalized expression levels in each of the 11 tissues. The vertical axis represents the values of proportion, and abscissa axis represents the 11 tissues. doi:10.1371/journal.pgen.1002379.g002

 

Panel C is the  business bit: the expression of the 60 de novo  human genes normalized by the general expression levels of genes in those tissues. (Pray, where are the error bars?). Seems like in Woody Allen’s two favorite organs, the testes and the cerebral cortex, do these genes have the highest expression. This actually makes some sort of sense: the testes are hypothesized to be a hotbed (sorry…) of evolutionary novelty, with all the meiosis going on there. The  high expression of the de-novo human genes in the cerebral cortex also seems to confirm our anthropomorphic prejudice: we are smarter. Yay. EDIT: Following MRR’s comment: yes, we should check de-novo genes and their expression in chimps. Perhaps the high expression of  de-novo genes exclusive to chimp lineage is in the cerebral cortex and testes too.

 

The authors do point out that there may be many other de-novo human lineage genes:

Our estimated rate, though, for de novo origin may be underestimated due to the conservativeness of our pipeline. First, as described above, in our pipeline, translatable open reading frames must have been complete in the human genome and disrupted in both the chimpanzee and orangutan genomes to be candidates as a de novo gene. Genes that did not have a clear ortholog (i.e., a sequence with very high similarity) in either the chimpanzee or the orangutan genomes (both of which are less complete than the human genome, and thus could be a missing genes) were not used. It is also often difficult to determine whether a protein-coding gene originated specifically on the human lineage or if it originated in a primate ancestor but was then lost on both the chimpanzee and orangutan lineages. The conservativeness of our pipeline thus only allowed us to accept genes where we could clearly show human specific mutations generated complete protein-coding reading frames, and that these were conserved for disrupting state in both the chimpanzee and orangutan genomes. As both the chimpanzee and orangutan sequences should be non-functional sequences, and thus not under selection, there is a reasonable likelihood that a second mutation, in addition to the human open reading frame completing mutation, could have occurred in the chimpanzee or orangutan that would prevent us for identifying these genes as having a de novo origin on the human lineage.

Also, PRIDE and PeptideAtlas, the databases of proteins they used may be underpopulated, and not include many other proteins.

ResearchBlogging.org

To conclude, yes, humans do have their own brand-new genes which, together with many other genomic features, may help explain the differences between humans and other primates. And there are probably more of these genes than we have found so far.

 

 

As for what it means to be human:

Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-eight million miles is an utterly insignificant little blue-green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.

Perhaps it was the late, great Douglas Adams who nailed it.


Wu, D., Irwin, D., & Zhang, Y. (2011). De Novo Origin of Human Protein-Coding Genes PLoS Genetics, 7 (11) DOI: 10.1371/journal.pgen.1002379

Oh, but to receive such a rejection letter!

It is with no inconsiderable degree of reluctance that I decline the offer of any Paper from you. I think, however, you will upon reconsideration of the subject be of opinion that I have no other alternative. The subjects you propose for a series of Mathematical and Metaphysical Essays are so very profound, that there is perhaps not a single subscriber to our Journal who could follow them.

David Brewster, physicist and mathematician and inventor acting as editor of The Edinburgh Journal of Science to Charles Babbage, mathematician, philosopher, inventor and mechanical engineer; father of the computer circa 1821.

The genomics programming language

Genomics is a new and exciting programming language based on Brainfsck. Here are the commands:

g    Move pointer to the right.
e    Move pointer to the left.
n    Increment the cell at the pointer.
o    Decrement the cell at the pointer.
m    Jump forward past the matching i if the cell at the current pointer is zero.
i    Jump backward to the matching m unless the cell at the current pointer is zero.
c    Output the value of the cell at the pointer.
s    Input a byte and store it in the cell at the pointer.

As you can probably tell, I spent a lot of time working on genomics, but out pure generosity I am placing this incredibly useful language in the public domain. I’m sure we will see a BioGenomics group on Open Bioinformatics Forum any day now, and that genomics will prove to be a game-changer in the field of, um, genomics.

Allow me to end this post with the following inspirational statement:

nnnnnnnnnnmgnnngnnnnnngnnnnnnnngnnnnnnnnnngnnnnnnnnnnneeeeeoiggnnnnnnnn
nnnnncenncgggnnnnnnnncgncnnnnnnnceoooooooceeecgggnncoocgoooooooocncennn
nnnnncoooocennnnnnnnnnnnnnnnnnncggnnnnceeeencoooooooooooooooooooooooc

Thank you.

Short bioinformatics hacks: reading mate-pairs from a fastq file

If you have a merged file of paired-end reads, here is a quick way to read them using Biopython:

from Bio import SeqIO
from itertools import izip_longest
# Loop over pairs of reads
readiter = SeqIO.parse(open(inpath), "fastq")
for rec1, rec2 in izip_longest(readiter, readiter):
    print rec1.id  # do something with rec1
    print rec2.id  # do something with rec2
    .
    .

izip_longest is fed the same iterator, readiter, twice. However, readiter.next(), which advances the iterator, is called on the first argument and then on the second argument. Since next() is being called on the same iterator, successive records are yielded.

By “merged file” I mean a fastq file where the mate-pairs are one after the other, as in:

@HWUSI-EAS687_112864999:8:1:1980:1055#CGAGAA/1
GTTTGTTTTAATTTCAGTGATTCATCAATTTTAAAAAAAGATGAGAATAATAACTATTATAAAAAGATAAATAAATGTGAAATTTATATTTCAAATTCAA
+
@:DGBGDDD@GGGDGDGDDGD@GGGGE@GGG?EBGGGADDDDGEG4?3BA*::7:GEGGGG>EDDDDAG@G><ADDGBGGGGEGGGGDGGGFEGGGEFDE
@HWUSI-EAS687_112864999:8:1:1980:1055#CGAGAA/2
AATGAATTGAATAAATATAAGAAGGATGATTAATAATAATTCTTGAATTTGAAATATAAATTTCACATTTATTTATCTTTTTATAATAGTTATTATTCTC
+
D?DB:@8EBDB>GG:=<DED79>>A8CEC8DGDGG8CEC<BGGG+BAAEA@D<2D71;:8AG<ABBEEEEBEDC?C>AACDDDCD>AD<@EFFDDDECBB
@HWUSI-EAS687_112864999:8:1:2274:1058#CGAGAA/1
CCTCAGTTAGCTTCTATTGGTATTAACATGGGTGAATTTACTAAACAATTTAATGACCAAACTAAAGATAAAAATGGTGAAGTTATACCTTGTATAATTA
+
GFGGGHHGHHHHHHGHHHHHGHHHHHHHFBGDBGEHHHHFHHEHHHHDFHCGFFFHHHHHHHGHHGGEBHEEFFCEE@E>A>>8A@EBE@BBB>BGEEDB
@HWUSI-EAS687_112864999:8:1:2274:1058#CGAGAA/2
AACTGGAGTTGTTTTAATTTCAAAAGTAAAAGATTTATCTTTAAATGCTGTAATTATACAAGGTATAACTTCACCATTTTTATCTTTAGTTTGGTCATTA
+
IIIIIIIIIIGIIIDHHIIIIDIHD8CGGGGDADEIIIIIIIHIIGBGD>DGDGGDGIGIIIIBGDG@GFHIIII<C<CCGHHHIHIBGDEEB3BEDEE@

The solution is derived from this Stackoverflow entry.

Of course, if the mate-pair files are not merged then you can use this script to merge them. Also illustrates using iterators from two different files in one for loop:

#!/usr/bin/env python
from Bio import SeqIO
import itertools
import sys
import os
def merge_fastq(fastq_path1, fastq_path2, outpath):
    outfile = open(outpath,"w")
    fastq_iter1 = SeqIO.parse(open(fastq_path1),"fastq")
    fastq_iter2 = SeqIO.parse(open(fastq_path2),"fastq")
    for rec1, rec2 in itertools.izip(fastq_iter1, fastq_iter2):
        SeqIO.write([rec1,rec2], outfile, "fastq")
    outfile.close()

if __name__ == '__main__':
    outpath = "%s.merged.fastq" % os.path.splitext(sys.argv[1])[0]
    merge_fastq(sys.argv[1],sys.argv[2],outpath)

Brainf**k while waiting for a flight

Warning: NSFW language.

Brainfuck is a Turing-complete programming language consisting of eight commands, each of which is represented as a single character.

> Increment the pointer.
< Decrement the pointer.
+    Increment the cell at the pointer.
-    Decrement the cell at the pointer.
.    Output the ASCII value of the cell at the pointer.
,    Input a byte and store it in the cell at the pointer.
[    Jump forward past the matching ] if the cell at the current pointer is zero.
]    Jump backward to the matching [ unless the cell at the current pointer is zero.

Having arrived almost 3 hours early to JFK, flying back to Cincinnati, I spent the time coding up a Python script which inputs a string and outputs a Brainfuck source code which, when run with a Brainfuck interpreter, outputs said string. So for example:

to_brainfuck "Hello, World!"

Will output:

++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>++++++++
++++.>>+.+++++++..>+.<<<<++++++++++++++.------------.>>>>++++++++.----
----.+++.<.--------.<<<+.-----------------------.

 

The horror above is what Brainfuck source code looks like. When you run the above code with a Brainfuck interpreter, it will print "Hello, world!".

Brainfuck interpreters and compilers can be found here. Ubuntu has a Brainfuck interpreter called bf.

Probably not the best code I wrote, could use some honing. Still, it served the purpose of killing a couple of hours.

#!/usr/bin/env python
import sys

class bf:

    def __init__(self,format_bf=True):
        """
        Initiate brainfuck code string. Pointers are initiated to the
        following values, with their ascii equivalents shown
        ptr0 = 10 ptr0 is used as a loop counter
        ptr1 = 30
        ptr2 = 60  @
        ptr3 = 80  #P
        ptr4 = 100 #d
        ptr5 = 110 #n
        """
        self.bf_code = ''
        self.bf_code += "++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]"        

        # Index: cell number. Value: cell value.
        self.ptrs = {1: 30, 2: 60, 3: 80, 4: 100, 5:110}
        self.ptr_idx = 0 # which pointer is being used
        self.format_bf = format_bf # Format the bf code. Default True.

    def string_bf(self,instring):
        # Accepts a string. Outputs bf code which prints that string
        # when run with a bf interpreter
        for c in instring:
            self.bf_code += self.to_bf(c)
        # add a newline
        self.bf_code += self.to_bf(chr(10))
        if self.format_bf:
            self.bf_code = self._format_bf_code(self.bf_code)
        return self.bf_code

    def _format_bf_code(self,bf_code):
        # Format the bf source code to 70 chars / line
        outstr = ''
        for i,c in enumerate(bf_code):
            if i % 70 == 0 and i > 0:
                outstr += '%s\n' % c
            else:
                outstr += c
        return outstr

    def to_bf(self,c):
        # accept a character c, generate the bf code to print that
        # character.

        # increment / decrement the data pointer
        if c < '@':
            ptr_target = 1
        elif c >= '@' and c < 'P':
            ptr_target = 2
        elif c >= 'P' and c < 'd':
            ptr_target = 3
        elif c >= 'd' and c <'n':
            ptr_target = 4
        else:
            ptr_target = 5
        ptr_inc_str = self.increment_ptr(ptr_target)
        # Now increment / decrement the value which the pointer points
        ascii_target = ord(c)
        ascii_val = self.ptrs[self.ptr_idx]
        inc_val, inc_val_str = self.increment_val(ascii_val, ascii_target)
        self.ptrs[self.ptr_idx] += inc_val
        return ptr_inc_str+inc_val_str+'.'

    def increment_val(self,ascii_val, ascii_target):
        inc_val = ascii_target - ascii_val
        if inc_val < 0:
            inc_val_str = '-'*abs(inc_val)
        elif inc_val > 0:
            inc_val_str = '+'*abs(inc_val)
        else:
            inc_val_str = ''
        return inc_val, inc_val_str

    def increment_ptr(self,ptr_target):
        ptr_inc = ptr_target - self.ptr_idx
        if ptr_inc < 0:
            ptr_str = '<'*abs(ptr_inc)
        elif ptr_inc > 0:
            ptr_str = '>'*ptr_inc
        else:
            ptr_str = ''
        self.ptr_idx += ptr_inc
        return ptr_str

if __name__ == '__main__':
    my_bf = bf()
    if sys.argv[1] == '-f':
        intext = file(sys.argv[2]).read()
    else:
        intext = sys.argv[1]
    o = my_bf.string_bf(intext)
    sys.stdout.write("%s\n" % o)
    

To run:

chmod +x bf_string.py
./bf_string.py "Brainfork is awesome!" > mycode.bf # generate Brainfuck code into mycode.bf
bf mycode.bf # The brainfuck interpreter bf
Brainfork is awesome!

And mycode.bf will contain:

++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>++++++.>
>>++++.<<+++++++++++++++++.>+++++.>----.<---.>+.+++.<+++++.<<<++.>>>--
.>+.<<<<.>>.>>++++.<----.>----.----.<++++++++.--------.<<<+.----------
-------------.

You can also run it with the -f option, where the input string will be read from a file.
 

UPDATE: and here is a brainfuck interpreter, written in Python.
UPDATE II: Following Vincent's comment, here is a fixed version of the interpreter. This time it should work with nested loops. Thanks Vincent.

#!/usr/bin/env python
import sys
class BfInterpreter:
    def __init__(self,inpath):
        self.iptr = 0
        self.cells =[0]
        self.cmdptr = 0
        self.infile = file(inpath)
        self.bfcode = self.infile.read()
        self.cloop_stack = [] # location of current startloop in bf code
        self.ploop_stack = [] # which ptr is current loopcounter
        self.loop_ended = False # Indicates if a loop counter just
                                # reached zero
    def inc_ptr(self):
        self.iptr += 1
        if self.iptr > len(self.cells) - 1:
            self.cells.append(0)
    def dec_ptr(self):
        self.iptr -= 1
        if self.iptr <= -1:
            raise ValueError,"negative pointer"
    def inc_cell(self):
        self.cells[self.iptr] += 1
    def dec_cell(self):
        self.cells[self.iptr] -= 1
        # Check if this is a loop counter
        if self.ploop_stack:
            if self.ploop_stack[-1] == self.iptr and \
               self.cells[self.iptr] == 0:
                self.loop_ended = True
    def start_loop(self):
        self.cloop_stack.append(self.cmdptr)
        self.ploop_stack.append(self.iptr)
    def end_loop(self):
        if self.cells[self.iptr] > 0:
            if not self.cloop_stack:
                raise ValueError,"no startloop character found"
            else:
                self.cmdptr = self.cloop_stack[-1]
        elif self.cells[self.iptr] == 0 and self.loop_ended:
            self.loop_ended = False
            self.cloop_stack.pop()
            self.ploop_stack.pop()

    def putc(self):
        sys.stdout.write("%s" % chr(self.cells[self.iptr]))
    def getc(self):
        self.cells[self.iptr] = ord(sys.stdin.read(1))
    def run_bf(self):
        self.cmdptr = -1
        while True:
            self.cmdptr += 1
            if self.cmdptr >= len(self.bfcode):
                break
            cmd = self.bfcode[self.cmdptr]
            # print cmd,
            if cmd == '>':
                self.inc_ptr()
            elif cmd == '<':
                self.dec_ptr()
            elif cmd == '+':
                self.inc_cell()
            elif cmd == '-':
                self.dec_cell()
            elif cmd == '[':
                self.start_loop()
            elif cmd == ']':
                self.end_loop()
            elif cmd == '.':
                self.putc()
            elif cmd == ',':
                self.getc()
if __name__ == '__main__':
    bf = BfInterpreter(sys.argv[1])
    bf.run_bf()

To run, download the file, name it (say, pybf) and then:

chmod +x pybf
./pybf bf_source_code_file.bf

Music Monday: Whole Lotta Love

This excellent cover of “Whole Lotta Love” went viral last week. Michael Winslow of Police Academy fame gives his interpretation to the Led Zeppelin classic:

And if that gave you a taste for the original, go here.