Connecting the Taxonomy with Search

A customer asked me, “How is the taxonomy connected to search?” Their search vendor says they can implement the full taxonomy. This is such a simple question – such a long answer below.

The taxonomy (the hierarchical view of a controlled vocabulary) is really just a list of terms in their preferred form to be used to subject tag individual articles or papers in the customer’s collection. The tagged items can be books, book chapters, journal articles, museum specimens, white papers, technical reports, HR documents, payroll records, ANYTHING that needs to be retrieved or pulled back out of the computer system. Let’s call them “information objects”, IO. The popular application is called search; in this case the term itself becomes amorphous, but the same is true when items are loaded to a records management system (RMS) or a content management system (CMS), SharePoint, or a library catalog (OPAC). Each application calls the activity something different, which makes our lives confusing. Metadata tagging, subject headings, keywords, taxonomy terms, thesaurus terms, ontology, descriptors, controlled terms, and so forth are all used in different venues.

The work flow has two major parts. First is the attaching of the appropriate terms to the individual information objects (IO). That means they would be elements in the XML record for the IO. They also might be in a table in the relational database management system (RDBMS), usually as a secondary key and linked to an accession or other item number. Hopefully, the system will allow you more than one of the terms per IO.

Once the terms are indicated, attached, or share a common accession number, then they can be used in the search system. Voila! The search system is using the taxonomy!

However, if you want to leverage the full potential of a taxonomy in search, then you have to have it on the user interface end (the UI). That means that all the wonders of the taxonomy, like related terms, synonyms, hierarchy, type-ahead, browsing the navigation tree, and recommendations (more like these), all happen in the search presentation layer of the user facing page. This is where some understanding of basic information architecture and how to leverage the taxonomy comes into play. If you go to the www.mediasleuth.com web site you can see search done using a taxonomy. The search engine underneath is Lucene, which is open source search software. This search interface can use any standards-compliant taxonomy and any data, and can ride on most search software.

So in summary,

The taxonomy is used as a flat file to tag records with meaningful subject access to the records.
The search software then includes the taxonomy terms in the inverted or searchable file of the retrieval engine.
The taxonomy is showcased on the user experience or search interface where the user can browse, get additional suggestions, and be prompted to do a better search by leveraging the taxonomy.
To use the search presentation layer, the interface is really interacting with two things simultaneously. It is sending the search to the search software indexes and caches; at the same time it is sending the term searched for to the taxonomy interface to surface additional terms. These additional terms might be related terms and/or narrower terms that show part of the navigation hierarchy. The interaction is also enabling the type-ahead feature.

When the search vendor says they can fully use the taxonomy, find out what they mean by that.

Is this a W3C taxonomy or a ISO/ANSI/NISO/BSI taxonomy definition? The latter is more robust.
Can they load synonyms to the search indexes? Is there a limit on the synonyms?
Can they use and show the hierarchy in the search results?
Do they allow related terms to take the user across hierarchies to other topical areas, or do you have to go up and down silos of information?
How are the taxonomy terms added to the data? Manually? Are they controlled terms or are they automatically generated by statistical pointers (uncontrolled terms)?
If they say you can manage the terms from within their system then ask if you can add related terms (associations), broader and narrower terms (hierarchies), and synonyms (equivalence terms). What about scope notes, where do they go? Does the system keep a history of activities on terms, or an audit trail?

Marjorie M.K. Hlava
President, Access Innovations

Connecting the Taxonomy with Search