Introduction to Taxonomies – definitions

To discuss the semantic integration or the leveraging of a taxonomy in search, web sites mashups and other places, we should first review what they are. Let’s look at the definitions and then the integration of a taxonomy as a building block for the larger information architecture for an organization. We need to think of taxonomies in that bigger case when we are talking about where we apply them. Once those are out of the way I will review some use-cases and show what makes them work.

A taxonomy is a “knowledge organization system” – or a KOS. There are many types of KOS which you will see in a minute but, basically, it is a set of words that have been organized to control the use of terms used in a field or a “vocabulary” so that things can be easily found in a specific subject area. These Knowledge Organizations Systems are usually specific to a knowledge domain or a topical area, a subject area, a scholarly or scientific or an enterprise area. They’re really descriptive labels.  We have many names we can call something but we settle on one descriptive label and then we post all the terms we use to that descriptive label as a main term and its synonyms, no matter how we call it.

Taxonomies often put together words (controlled terms) in hierarchical view, meaning a broader/narrower navigation tree so that people can browse a tree to find information quickly. They are certainly used as a storage and retrieval aid.  That is why we started using them in the first place. We use them both for tagging information objects as we add them to a data base and to search as we query that same database.

Those controlled vocabularies or those KOS – Knowledge Organization Systems – can be anything from a flat list, like your Saturday to do list, to a list with ambiguity control by using a synonym ring. We might put some structure to it and map the terms into a hierarchy, which gives us a taxonomy. We move to a full thesaurus by adding relationships between the terms, notes and other features, adding the ambiguity control from a synonym ring. The thesaurus has those four features; the hierarchy (taxonomy) , the relationships (associative or related terms), ambiguity control (Nonpreferred terms, synonyms, use references), and various kinds of notes. The next step in complexity would be to define those associative relationships in many different ways, i.e., as an ontology. The final option in the increasing complexity is the Linked Data or Semantic Web options where the actual items described in many different systems are hooked together by this KOS. People are increasingly saying taxonomy or ontology and meaning thesaurus. Whatever the instance is called they are still all knowledge organization systems just as classification systems are.

As an organization system, we are putting control on the terminology but we also have a hierarchical format – parent-child, genus-species, and broader-narrower type of relationships. Specific items can occur as final leaves on those hierarchies. That is they are on the branches of those navigation trees. They are sometimes known as narrower term instances; sometimes they an actual URL or a document itself. They are very common on websites. They are used as drop down-pick lists. They may represent a sizeable directory and, as you will see, a lot of other variations.

Taxonomies themselves, as a single group, are just an emerging set of standards. They aren’t fully standardized as yet. A taxonomy does have a definition within the Z39.19 Controlled Vocabularies standard published by ANSI (American National Standards Institute) and NISO (National Information Standards Organization) standard in that it is a “hierarchically organized vocabulary based on a classification system”. That does not include the synonyms or the disambiguation for homographs, for example, or related terms, also known as associative relationships. Those are not included in the classic definition of a taxonomy. You can download this standard at no charge from NISO.

There is a companion item, also a controlled vocabulary and the focus of the standard, which is a thesaurus. A controlled vocabulary focuses on concepts. It’s not the items themselves – not the specific items – but rather the concepts. So, you will group things together by concepts. It also has a hierarchy and it has a lot of different display formats and allows a lot of networks of relationships or related terms that could be considered friends or cousins or aliases. It also allows things like scope notes and term histories. It is a more elaborative and informative addition and there are long-established standards for this area.
Starting with Cutter and Dewey and the turn of the 19thCentury and moving forward we have the evolving standards for thesauri. In the same standard, Z39.19, it is defined as a controlled vocabulary of terms in natural language order. It is designed for “post-coordination”.

Those are fairly heavy terms. Post-coordination means that the terms are going to be combined at the time that the search or question is asked. So, when we are making a query, is when you are combining the terms. We aren’t combining like we did in subject headings and back-of-the-book indexes. We combine those in a “pre-coordinated” fashion – at the time the index is created. In the use of a thesaurus or a taxonomy, we use them in a post-coordination activity. That is a big difference. We need to think about how that happens. When someone is “post coordinate indexing” instead of cataloging, we get a big flip in the way the terms are created. Some people are indexing – doing the back of the book – and others are indexing using a controlled vocabulary. The processes are quite different. So, in the Taxonomy Division, we don’t really talk about back-of-the-book indexing and subject heading indexing. We talk about using a thesaurus or using a taxonomy for that indexing.

We’ve gotten to the point where the terms thesaurus and taxonomy are used interchangeably quite often. People think of a thesaurus as a taxonomy with extras. It has related terms, non-preferred terms as synonyms, or the use and use-for references (another way to talk about them), has scope notes, and a whole lot more. But we need to be careful when we are talking about the word thesaurus because the lay person will automatically begin to assume that we mean a synonymy, like the Roget’s Thesaurus and that is not what we mean at all. Although synonyms are important, they are not the only piece of what we are doing.

This is a shot of what a taxonomy and a thesaurus can look like.

On the left, you see the taxonomy view. We have a broader term and some narrower terms and we have even top terms in the taxonomy view. This view is broader-narrower term driven so it is hierarchical in nature and it is called the taxonomy view.

The full term record focuses around the term in question so it might have broader terms and narrower terms as we see here. But, it could also have a status, related terms, some see also or non-preferred synonym references, as well as scope notes, editorial notes, a history field, facets and other things. This set of fields is kind of the default set – the basic individual set for the system.

The basic taxonomy features are the hierarchy structure, the broad concepts-the general concepts, which are known as broader terms, and then the more specific concepts – but still concepts – are the narrower terms. Related terms are conceptual cousins. They don’t fit into that subsidiary relationship. So, they might have something to do with it, but they are not exactly the same. So, those are conceptual cousins.

Then we have those equivalency relationships – term equivalents – the synonyms, non-preferred terms, the use and use for kinds of references. You can have a lot of those. You can even make them multi-lingual equivalent. You can have an English-Spanish-French-Chinese-German or some other set of languages for all of these synonyms.

We can also have facets which is another way to sort the data. Frequently, in faceted search and the way it is most broadly approached now is really field-formatted search. But, there is a large field of knowledge about faceting information from Ranganathan and other luminaries in the field.

Scope notes might be notes that you want somebody to see. Glossary notes are dictionary definitions, whereas editorial notes might be something you want to keep to yourselves and your team and not show to your user base.

All of those together, make up a term record.

We put concepts into a taxonomy. We put People, Places, and Things into authority lists. Name Authority files are one kind of the authority file options. We have a name authority list, which might be the way you want to talk about somebody. For instance my name might be Marjorie Hlava, or Marge Hlava, or Margie Hlava, or Marjorie M.K. Hlava, or it might be Marjorie Maxine Kimmel Hlava or the even longer name I was baptized with. Getting that name authority so that all of those names post to one place is increasingly important as we get more and more connected. There is a lot of work going on, for example, in author authority files and author disambiguation problems. Libraries have always known about those. Is it Mark Twain or is it Samuel Clemens? There are many other pen name situations that we deal with. Name authority files are important. They are also important in corporations when you talk about brand names. For example, in a pharmaceutical firm where something starts off as a chemical name, becomes a code name, becomes a production name, becomes something they want to try out and actually launch to the public. In different markets in different countries it will have a lot of different names. You need to keep all of those names together for research purposes. An authority file lends itself to that kind of thing.

A synonym set or synonym ring is something where you have a lot of terms that mean essentially the same thing. You might have keywords, descriptors, subject headings, etc., that are equal to each other. You can begin to put some control on the vocabulary with synonyms but you can control it further with broader-narrower terms and on up through thesaurus, ontology, and eventually, when applied to data, a semantic network.

Taxonomy terms are an important part of meta data. In particular, they describe subject meta data, or concept meta data. Meta data itself is all the kinds of information about the information you are talking about. So, it is not the information itself. It is the description of that information. When we talk about a bibliographic citation with indexing, we are really talking about the meta data on the journal article or book. The data “about” data is meta data.

The Dublin Core is one meta data standard but there are a bunch of other ones – meta data, schemas, or standards – that are commonly designed specifically for a body of knowledge.

As I said a taxonomy is one of the building blocks. It’s the words representing concepts. It is also known as semantics. Semantics is a popular way of defining this now. And just like in meta data, you can substitute the word “about”. In semantics, you can substitute the word “words” and you would have a good idea of what is driving the process and how something works. So, semantics is driving the process or the words are driving the process of applying this kind of control to a knowledge domain. The taxonomy is core to that process because it gives the organizing principles for all of the content that you are going to deliver. And, if you are lucky, in your application you are going to be able to use it all the way from website design to new product offerings to searches, to the way things are stored and maybe the records management part of the organization. Looking at all of those different pieces, we want to see where these can be applied in real-life to our systems.

Check back next week when we continue the series and talk about Cases for Semantic Enrichment Using Taxonomies.

Marjorie M.K. Hlava
President, Access Innovations