Comparative Functional Genomics: Penguin vs. Bacterium
No, not the flesh-blood-and-feathers penguin, but rather Tux, the beloved mascot of the Linux operating system. Compared with Escherichia coli, the model organism of choice for microbiologists.
We refer to DNA as “the book of life”; some geeks refer to it as the “operating system of life”. Just like in a computer’s operating system, DNA contains all the instructions on how to “execute” life and to keep things humming. Many genes make proteins or RNA than act as switches to activate the synthesis of other proteins, sometimes in a two- three- or higher level hierarchy. These switches are conditional, based on environmental conditions, or whether it’s time to replicate the DNA and divide into two daughter cells, and so on. Some genes activate the transcription of other genes, but are not regulated themselves by other genes, those can be dubbed “master regulators”. Some genes are both activated by other genes, and activate other genes themselves: “middle management”. Finally, there are genes that are activated, but do not regulate other genes: the “workhorses”. This information, known as the transcriptional regulatory network exists for 1,378 genes of the E. coli bacterium.
Paralleling this in Linux, there are programs that call other programs; again, in a hierarchical fashion. According to the calling structure, they also can be dubbed Master Regulators (calling other programs but not being called themselves), Middle Management (calling other programs and being called), and Workhorse (only being called).
Koon-Kiu Yan and his colleagues from Yale mapped the program call graph in Linux by setting each program as a node and drawing lines to the programs that call it, and to the programs it calls. They did the same thing for E. coli‘s transcriptional regulatory network. Here are the graphs they got:
So it seems like Linux is middle-management heavy, whereas E. coli is workhorse heavy. 30% of Linux programs are top management, as opposed to only 5% in E. coli.
Looking at the actual functions for the genes/programs, it seems that Linux programs also have much more of a functional redundancy than in E. coli: 3.5% of E. coli‘s genes have “reusable” functions, as opposed to 8.4% of Linux programs. But if we look at entire working subgraphs of these two graphs, the subgraph overlap in Linux is 87%, whereas in E.coli the overlap is only 4.3%. This means that the division of labor in E. coli is much more distinct than in Linux. There are many ways of activating the same hierarchy in Linux, but in E. coli there is rarely more than one way to do it. Note that Linux is top-heavy, whereas E. coli has a pyramid-like structure. It is pretty obvious that the Workhorse modules in Linux go through heavy reuse while those in E. coli do not.
The scientists then decided to look into how these two networks developed. The oldest genes in E. coli are the Workhorses, whereas the regulatory genes in middle and top management arrived more recently. In contrast, the newest programs — the most heavily rewritten ones– in Linux are the Workhorses, whereas the ones in the management echelons are less changed than their predecessors. The oldest programs are those that are in Middle Management. they are also the most abundant type in Linux’s call graph.
Who are the Workhorses in E. coli? Those are mostly enzymes, the proteins that catalyze specific biochemical functions. As a rule, enzymes are very specific: an enzyme would catalyze only one type of reaction, and only with a very specific chemical (substrate). Examples are enzymes that break up sugars: there is a specific enzyme for every type of sugar molecule. Who are the Workhorses in Linux? Those are the functions that get used all the time in thousands of different programs: strlen (measuring a character string’s length) or malloc (allocating memory for a data structure). The Workhorses in Linux are non-specific while the Workhorses in E. coli are very specific.
So how to account for these differences? Nothing in biology makes sense except in light of evolution, and we have to look to the evolutionary history of both the bacterial and the computational systems for answers. The major constraint in E. coli‘s evolution is fitness. If something breaks down in E. coli‘s Workhorse it wont get passed on to the next generation: the cell with the lethal mutation would never reproduce and will get thrown into Darwin’s rubbish bin. This leads to single-function workhorses because a multi-functional Workhorse would be too prone to messing too many systems up when it mutates, and would never make it to the next generation, which is why the Workhorses in E. coli‘s call graph have a lower connectivity that those in Linux’s call graph.
The authors conclude that the E. coli‘s call graph evolved bottom-up, with system robustness being the main selective trait. In contrast, Linux evolved top-bottom, with reusability of the Workhorses being the main selective trait. Reusability and robustness are tradeoffs. In the case of a man-made system like Linux, bugs in reusable modules are is not a problem, since Workhorse bugs are easily fixed in the next release. It is much less costly, in coding time, to tweak existing Workhorses than to build new ones. Mutations in reusable workhorses in E. coli would weed out those kinds of proteins from the gene pool, and therefore E. coli‘s Workhorses are not reusable.
I’m not exactly sure what insight we can get by comparing natural vs. man-made networks. But hey, sometimes science is not about insight – sometimes is just about being totally cool; and The Coolness is strong with this work.
Yan, K., Fang, G., Bhardwaj, N., Alexander, R., & Gerstein, M. (2010). Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks Proceedings of the National Academy of Sciences DOI: 10.1073/pnas.0914771107