Distant homology and being a little pregnant
(Thanks to F.B. for the inspiration).
Sigh… people don’t seem to learn. It’s been almost 22 years (yikes!) since a distinguished group of scientists published a letter in Cell calling for a responsible use of the word “homology”. If you were born when that letter was published, then in the US you can already drink legally. And you may very well want to, by the time you finish reading this post.
As of today there are one hundred and sixty seven articles listed in PubMed with the phrases “distant homology” or “remote homology” in either the title or the abstract.
Please: make it stop.
Homology is a qualitative term. It means having a common evolutionary origin. Two genes / proteins / organs are either homologous, or they are not. They cannot be “somewhat homologous” or “partially homologous” or (a favorite among molecular and structural biologists) “distantly / remotely homologous”.
Homology is inferred from similarity. Similarity is quantitative. If organs are sufficiently similar, like mammalian forelimbs, then they are considered to be homologous. They maybe more similar (like the hands of humans and chimpanzees), or less similar (like human hand and a bat wing). Nevertheless, once they pass a certain similarity threshold, homology is inferred. The same applies to sequences of proteins and nucleic acids. Similarity can be measured. Different degrees of similarities can be compared and scaled.
If two protein sequences are aligned, and 40% of the amino acids in the alignment are identical, then the two sequences have a 40% identity. The do not have a 40% homology. They are homologous, and the homology is inferred from the similarity. We observe that the two sequences are similar, and then we conclude that they are homologous. We use the sequence similarity, as measured by percent identity, to trace a line of common descent for those proteins we deem homologous.
(As an aside I should say that the percentage of sequence identity, or %ID is not a very good measure for inferring homology, nor is it for measuring similarity. It is an easy one to use: but it is very coarse and prone to errors. There are many better measures out there, including statistical ones like e-values, p-values or information theoretic ones like bit scores. But I digress, and this is a matter for another post.)
But once we confuse observations with conclusions, things quickly become an impossible muddle.
Am I not not just picking nits here? I mean, surely when the term “distant homology” comes up in a paper or in conversation, we all know the meaning. Distant homology means having a common evolutionary origin, but with a common ancestor that was around a long time ago. “Distant homology” is intuitive, brief yet understandable. it is less cumbersome than: “homologous, with a distant common ancestor, as concluded form a low yet statistically significant similarity” which is what we really should say if we properly separate observations from conclusions, as captain nitpick would have us do.
Allow me to answer with two examples. First, I have read several papers discussing “structural homology” in the context of protein structure. Those papers that discuss structural homology were actually using a verbal shortcut for a homology inferred from structural similarity. That is, they inferred common descent from protein structural similarity. This kind of inference is highly contentious, and while not necessarily wrong, must be done with great care and proper caveats. However, once the researchers rolled up observations with conclusions by using the “structural homology” verbal shortcut, they absolved themselves from convincing the reader that structural similarity is indeed a good measure of homology, and jumped directly to the conclusion that there is indeed an homology here. The framework for inferring homology from sequence similarity is well worked out, but not so for structure, yet. Therefore, even if we do use the verbal shortcut “distant homology”, we can only use it by virtue of having a certain measure of similarity well-established already, as in sequence based similarity. If it is not well established, and in using structural similarities, we fail to go through the proper scientific channels that consist of providing convincing observations prior to providing conclusions.
Second: even worse is the use of the term “functional homology”. This is a clear case of the word homology used as a drop-in synonym for similarity. The misnomer “functional homology” is typically used in studies where proteins that are clearly not homologous perform similar functions. Why infer evolutionary descent when clearly that was not intended in the first place? Well, once you start confusing similarity with homology, observations with conclusions, and make them synonymous, this is what happens.
So don’t even start this confusion. Separate observations from conclusions, and make the former support the latter. Homology is qualitative, similarity is quantitative. Genes cannot be distantly homologous any more than a woman can be a little pregnant.
Now you can have that drink. Unless you are a little pregnant.
Gerald R. Reeck, Christoph de Haëna, David C. Teller, Russell F. Doolittle, Walter M. Fitch, Richard E. Dickerson, Pierre Chambon, Andrew D. McLachlan, Emanuel Margoliash, Thomas H. Jukes and Emile Zuckerkandl (1987). “Homology” in proteins and nucleic acids: A terminology muddle and a way out of it Cell, 50 (5) DOI: 10.1016/0092-8674(87)90322-9