WizardOfOz
When you use a thesaurus for indexing context covering multiple disciplines, the need for disambiguation of terms is increased. This fact of thesaurus life was well illustrated in a presentation at this year’s DHUG (Data Harmony Users Group) meeting. The presentation, by Rachel Drysdale, Taxonomy Manager of the Public Library of Science (PLOS), was titled “The PLOS Thesaurus: the first year.”

While Rachel discussed a variety of aspects of thesaurus implementation and maintenance, what caught my interest and sympathy as a fellow taxonomist was her description of what she called “taxonomy funnies.” Anyone who has been a taxonomist for a period of time has run into such funnies, or problems that are chuckle-worthy but need some sort of dealing with.

In the talk, Rachel discussed the refinement of indexing rules. PLOS maintains its thesaurus in a Data Harmony software application, MAIstro that includes integration of a taxonomy management tool, Thesaurus Master with M.A.I., an indexing application in which a “rule base” of indexing rules is maintained. In MAIstro, when a term is added to a thesaurus, a simple identity rule is automatically created in the associated rule base. So when the Animals branch was being developed, the addition of “Pumas” caused the creation of a rule that looked like this:

Text to Match [in the text being read and parsed by M.A.I.]: pumas

USE [Indexing term] Pumas

M.A.I. also recognizes singular and plural variants. In the absence of any rule or condition to the contrary, the rule above would cause the automatic assignment, or suggestion to a human editor, of the indexing term “Pumas” when coming across the text string “puma”.

PLOS content has good coverage of zoological topics, but is also especially heavy on molecular biology, particularly genetics. The PLOS wordsmiths were mystified when they found that multitudes of genetics articles were being indexed with the term “Pumas”. True, there might have been a sprinkling of articles about wild feline genetics, but this would not account for the number of articles that boasted the “Pumas” descriptor.

The taxonomists at PLOS looked at the articles in question and found the culprit. “PUMA” was appearing in those articles, as an acronym for a gene whose full name is “p53 upregulated modulator of apoptosis.” (I can’t blame the geneticists for using an acronym for that one. The full name isn’t very conversation friendly.) And it’s not specific to pumas; humans have it, and so do such diverse creatures as fish and frogs. So the PLOS taxonomists modified the indexing rule, adding conditions that required at least one other word or phrase having to do with the world of wild feline creatures to be present before “Pumas” could be assigned or suggested. The addition of a few synonyms and quasi-synonyms for pumas made the rule richer and better able to disambiguate pumas from PUMAs. The rule ended up looking like this:

Text to Match: pumas

IF (MENTIONS “feline*” OR MENTIONS “jaguar*” … OR MENTIONS “panther*” OR MENTIONS “cougar*” OR MENTIONS “catamount*” …)

USE Pumas

ENDIF

The next indexing run was much better. Alas, there were still some articles inappropriately indexed with “Pumas”. What was wrong? The PLOS editors did some more detective work.

It turned out that some of the problem articles were about the toxoplasma parasite, which has many variant strains and is found in a wide variety of organisms, including people, frogs, and cats. One of those variant strains is known as COUGAR. A conceptual relationship with actual cougar critters does exist; the variant was first discovered in a group of Canadian cougars. That’s rather tangential, though. The toxoplasma articles in question aren’t really about cougars. The problem was that as far as animals (and the PLOS rule base) were concerned, “Cougars” is a synonym of “Pumas”. So when the indexing system read “COUGAR” in the text, “Pumas” got popped onto the list of subject terms for each of those toxoplasma articles.

The next critter slithering amok through the PLOS records was the snail. What would make snails unruly? The real culprit is once again a gene in disguise, in this case SNAI1, naturally referred to frequently as SNAIL. Once such a culprit is properly identified, it’s a straightforward matter to modify a rule that prevents the wrong term from being suggested or assigned, by considering likely contexts and reflecting those in the rule conditions. One bonus of the situation is that the same rule can be further modified to enable indexing of the formerly problematic document with a more appropriate term.

There’s no reason to be afraid of the wild animals in your thesaurus, as long as you stay alert for them. You can tame the mighty mountain lion and the slithery snail.

Barbara Gilles, Taxonomist
Access Innovations