A thesaurus is also known as an indexing language. Basically, to use a thesaurus for indexing, we assign terms to documents, based on the presence of corresponding concepts.

We’re controlling the synonyms – different terms for the same concept. They are polythemes or homonyms, depending on whether you are European or stateside. They are the same words with different meanings, like lead.

We are going to delineate the scope of meaning. We are going to say that this term means this to us in this thesaurus. We are going to do that by where it sits in the taxonomy; we are going to do it by what we add as synonyms, as broader and narrower terms, and all of those different things.

In our term equivalence, we are going to be linking the synonyms, and we will disambiguate homonyms so that we will know what those particular items mean in our taxonomy. If we have to use all of them, we will use all of them, but we will differentiate them.

The precision options in search precision have to do with the specificity with which we apply the language and the coordination of the language.

Two contrasting approaches to precision involve pre-coordination and post-coordination of terms. Less is better in a taxonomy, so that we can let the searcher do pre-coordination in the end. Compound terms are a simple form of pre-coordination. We might do some compound coordination, in that we talk about aerospace engineering as opposed to electrical engineering. We can be fairly specific in delineation, but we don’t want to get those phrases too long. Term pairs  are very useful in a taxonomy.

Certainly, in trying to delineate them, we can include in search some proximity or word distance indication. We might say it’s ‘near’ (within two or three words) or we might say it’s ‘with’ (contained in the same sentence) or we might say that it ‘mentions’ it, meaning it is in the same paragraph or whatever – any of those kinds of things that can be defined as a distance indicator.

For precision, we are looking at structural relationships – the links and roles of terms. In the 1980s, we had a great many links and roles applied to terms, so that the National Library of Medicine and ERIC and Engineering Information Inc. and others talked about linking roles of the terms. That stuff was also covered by COSATI. We have gotten away from those. They have not proven as effective as we wanted them to be, but I have noticed that the semantic net/semantic web community has started to use linking roles fairly heavily. That stuff may come back to us.

Treatment and aspect codes are other things that we have gotten away from. They dealt with how we are planning to use terms, and where those terms fit within a particular term cluster. Secondary terms, or facets, were then added to further define how the terms are used.

Term weighting is used pretty heavily in a lot of the search engines that we work with. With term weighting, you might specify something like “If this word appears in the title [that is, the title field in the database record], it will be more important” or “If this term is used in conjunction with that term, it is going to be more important.”

Below is an example of what you might have as fields in a record in a database.

You might have the name and the language, country code, abstract, title, and so forth, but the subject indexing is going to come from a controlled vocabulary. You might also have an authority term field, which you will fill from an authority list. I would keep those separate.

We discussed pre-coordination already. Card catalogs generally have pre-coordinated indexing. They do depend on controlled vocabularies, sometimes much more controlled, and they are expensive to apply. If you have ever done library cataloging, you probably know that it’s pretty expensive to do. Original cataloging can take about two hours for a catalog record, which is why places like OCLC got there first. To share that labor, it was worth paying $5 to OCLC to get a catalog record or deposit them for $1 so that you could get credits. It was a nice model. You could share that cataloging. The very high end costs to do pre-coordinated indexing were partially because they are so confusing.

With post-coordination, we coordinate things at the time of the search. This really started when those punch cards from IBM came out. You could sort, and it was easier to sort at a really discrete level. That was empowering in the mid-1960s, because the cards were machine-readable. Very handy. The terms are natural language. They are current and they are very specific. That is the idea of it. However, you can have exhaustive coverage but you may lose precision because you don’t have that coordinated concept that you have when you have pre-coordination. On the other hand, there are low input costs, because you don’t have to know the whole bundle or know the whole coding sequence. You can just put in the natural language terms.

Next time, we’ll look at some methodologies for thesaurus creation.

Marjorie M.K. Hlava, President Access Innovations

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.