05 November 2007

My First Paper

The inauguration of this blog was just barely in time for me to report my first paper as primary (and sole) author:

KEESEY, T. M. 2007. A mathematical approach to defining clade names, with potential applications to computer storage and processing. Zoologica Scripta 36 (6): 607–621. doi:10.1111/j.1463-6409.2007.00302.x

Here's the abstract, also available here:
Clade names may be objectively defined based on conditions of phylogeny. Definitions usually take one of three forms — node-, branch- or apomorphy-based — but other forms and complex permutations of these forms are also possible. Some database projects have attempted to store definitions of clade names in a manner accessible to computer applications, but, so far, they have only provided ways of storing the most common types of definition. To create a more extensible system, I have taken a mathematical approach to defining clade names. To render definitions accessible to computer storage and analysis, I propose using Mathematical Markup Language (MATHML) with extensions. Since the mathematical approach is granular to the level of the organism, not to fuzzy higher levels such as population or species, it sheds light on some theoretical difficulties with defining clade names. For example, some definitions do not resolve to a single organism as the ancestor, but to sets of organisms which are not ancestral to each other and share common descendants. I term such sets ‘cladogenetic sets’.
If you made it through that, congratulations. Now you may have some questions.

What is a "clade"?

An ancestor and all of its descendants. As an example, mammals form a clade. Fish do not form a clade, since they exclude some descendants (tetrapods). Hoofed mammals ("ungulates") do not form a clade, since their common ancestors were not hoofed (instead, hooves have evolved several times among placental mammals).

What is "branch-based", again?

The PhyloCode is a set of rules being put together to deal with the naming of clades. It recommends certain forms of definition. The main ones (but certainly not the only ones), with examples, are:
  • node-based. "Mammalia is the final common ancestor of platypuses and humans, and all descendants of that ancestor."
  • branch-based. "Synapsida is the initial ancestor of humans which is not also ancestral to sand lizards, and all descendants of that ancestor." (The image below represents two branch-based clades, one in red and one in yellow. White dots represent organisms in both clades.)
  • apomorphy-based. "Avialae is the first ancestor of Andean condors to possess powered flight homologous with that in Andean condors, and all descendants of that ancestor."


(Actual definitions would use proper scientific names instead of "platypuses", "humans", etc. but you get the idea.)


This stands in contrast to the current taxonomic codes, which are rank-based. Definitions under rank-based codes look more like, "Homo is the genus that includes Homo sapiens." There is a very important difference between these two styles of definition. Rank-based definitions are based (at least partly) on subjective opinions, since the ranks (with the possible, but contentious, exception of species) do not have any objective meaning. We all probably learned about kingdoms, classes, orders, families, and genera in biology class, but these ranks don't have any intrinsic meaning. A family of birds might include a few closely related species, while a family of insects might include thousands, with more distant common ancestry.

Phylogenetic definitions, on the other hand, proceed directly from our knowledge of phylogeny. When two researchers disagree on the content of a rank-based taxon, they might be arguing about aesthetics, actual relationships, or both. When they disagree about the content of a phylogenetic taxon, they can only be arguing about actual relationships.


So, what did you do?

Since phylogenetic definitions are based directly on phylogeny, without need for opinions, this means they can be expressed in completely unambiguous language. This includes:
  • Mathematical formulas.
  • Computer languages.
As I discuss in the paper, some people have created unambiguous shorthand formulas and unambiguous database schemas for representing phylogenetic definitions. But the previous efforts have all focused on simple definitional formats, ignoring other formats and complex permutations.

Well, la-ti-da. So what?

This means more of the taxonomic process can be automated. With rank-based definitions, there has to be an expert to "feel out" how expansive a genus, family, order, etc. should be. But with phylogenetic definitions, you can feed a computer application the phylogeny encoded in a popular file format (e.g., NEXUS) and taxonomic definitions encoded in a popular file format (MathML), and it can figure out the content referred to by a taxonomic name in fractions of a second.

Okay, so where's the application?

I'm still working on one, called Names on NEXUS. So far it's going well; I just need to refactor and complete the server-side application and touch up the client-side application. Should have some time for that next year.


1 comment: