03 March 2009

Refactoring "Names on Nodes" Entities, Part I

(Warning: If you are not me, this post may not make much sense. Same could be said for many recent posts. Sorry for all the self-indulgence here, lately, but I'm trying to work through a lot of thorny issues.)

Last year I wrote a post about some revisions to the entity schema of Names on Nodes, my longstanding project to automate the application of phylogenetic nomenclature. The revisions were pretty hefty, and necessitated a rewrite of much of the project. I got pretty far without making any further major modifications to the schema. But, after a few months of work, some flaws are beginning to show.

Once again, here is the UML diagram:
And, once again: The white arrows indicate inheritance, i.e., "is-a" relationships. For example, a PhyloDefinition is a type of Definition. The black diamonds indicate composition, i.e., "has" relationships. For example, a Definition has any number of Anchor entities, each of which has exactly one Signifier entity.

So, the problems...

The nomenclature is confusing.

Not all of it, but some. What I was calling a SignifierIdentity is, in fact, a taxon (in a somewhat loose sense, i.e., any set of organisms, or subset of life, or whatever—more here), and a Signifier is just a taxon identifier. What I was calling an Authority is actually a authority identifier, and what I was calling an AuthorityIdentity ... is really an authority!

Anchors are insufficient.

The idea of the Anchor class was to allow every definition, be it rank-based or phylogenetic, to be connected with any number of taxa, namely, those taxa required by the definition. Each Anchor object specifies a taxon (through an identifier/signifier) and tells whether it is internal or external. I had hoped that this would work equally well for both rank-based and phylogenetic definitions, modeling biological types for the former and the specifiers for the latter. But there are some crucial differences between types and specifiers:
  • A rank-based definition may not have a type; but a phylogenetic definition must have at least one specifier (usually two or more, but in theory you could get by with one, e.g., "Homo erectus (Dubois 1892) and all of its descendants," not that I'd recommend it in most cases).
  • A specifier can be a character state description, but a type cannot. (Both can be taxonomic names or specimen identifiers).
  • Types are always internal, so it's pointless to have to mark them as such.
  • A type is always included in the taxon. A specifier, even an internal one, may not be, since phylogenetically-defined taxa are potentially empty.
Relations are insufficient.

Why do Parentage and Inclusion both extend Relation? Because they can. They both require two ordered operands (parent and child for the former, superset and subset for the latter). There really is no other reason; modeling them this way doesn't make calculations faster (in fact, it slows them down), and gives no benefit otherwise. Furthermore, the Relation class is incapable of modeling other types of relations, like equation (i.e., subjective and heterodefinitional synonymy), which has two or more unordered operands. (Note: objective/homodefinitional synonymy is already well-handled by the relation of identifiers/signifiers to taxa/identities.)

Relators are insufficient.

Why do Definition, DefinitionApplication, and Dataset all extend Relator? Good question. The idea was that all of them indicate relations of some kind. But this resemblance only goes so far.
  • Rank-based definitions do indicate that the types are included by the defined identifier/signifier, but phylogenetic definitions don't really indicate anything, since they potentially yield empty results.
  • The inclusions indicated by rank-based definitions are redundant with the information about their types. I had to implement an awkward system to synchronize this.
  • Datasets are the only relators that can indicate parentage; the other two can only indicate inclusion.
  • Only datasets can indicate subjective synonymy, and only definition applications can indicate heterodefinitional synonymy. Those relations aren't currently modeled at all, but should be.
Definitions do not need to reference an authority.

For a while, I had been considering taxonomic names as defined by different authorities to share the same identity. This proved unworkable. Instead, whenever an authority defines a name, it is either coining that name anew, or converting it into a new name (that happens to have the same spelling, but a different authority). For example, Aves under Linnaeus 1758 and Aves under the ICZN are the same thing, but Aves sensu Gauthier & de Queiroz 2001 and Aves sensu Sereno 2005 are different entities.

For this reason, a definition can be considered to have the same authority as the name it defines. Keeping an extra reference to an authority is redundant. Under this system, every name gets only one definition (if that).

Looking up contextual relations is awkward.

Those other problems are pretty minor compared to this one. One of the core ideas of Names on Nodes is that you are free to create a phylogenetic context. A context is basically a way of saying which datasets you want to use (and which you want to ignore). Every definition is true for all contexts, but the application of each definition may differ.

Thus, when looking up things like whether taxon A is ancestral to taxon B (something you have to do a lot when applying phylogenetic definitions), the algorithm has to look at every single relation and decide whether it belongs or not. Does it belong to a definition? Then it belongs. Does it belong to a definition application? Then it belongs if tat application is under the specified context. Does it belong to a dataset? Then it belongs if that dataset is included in the context. I optimized this a lot, but, at the end of the day, I was making it do something it did not really need to do. Which brings me to my last point.

Looking up contextual relations is not easily optimized.

The Context class is pretty bare bones, and that's not a good thing. I've been looking into implementing some of the optimizations present in Bender & al. 2005, but it's not possible with the current schema.



So, some revisions are needed. Not nearly as major as last time, but fairly significant. More in Part II, coming some day....

No comments:

Post a Comment