A taxonomy is an organization system. It is a controlled vocabulary, containing a parent-child or hierarchical relationship. The specificity happens at the lower levels, at the branches, at the leaflets – or at the end of the list. They are very common on websites. They are also commonly supported as pick lists — a dropdown menu of ten or twelve items. Sometimes they are browsable directories. There are many different ways to put them into play.
One standard that addresses taxonomy is ANSI/NISO Z39.19-2005. It talks about a taxonomy as a ‘collection of controlled vocabulary terms organized into a hierarchical structure’. You notice that it doesn’t say anything about equivalence relationships or synonyms, homograms, associated relationships, notes, etc. It is just the hierarchy.
A thesaurus is also a controlled vocabulary. It focuses on concepts. However, the thesaurus and taxonomy do not focus on the information object itself. You use the taxonomy/thesaurus to find them. You have to identify the information object (your content) by using the thesaurus. It is not outlining those information objects. It is giving you a guide to those information objects. There are different ways to display it and you can find the network of relationships among the terms so that it helps you to find it.
We use the controlled vocabulary for the indexing and retrieval. The hierarchical array is for our convenience for either navigating the collection when we are searching or for organizing the terms. Nowadays, taxonomy and thesaurus often are used interchangeably. With a taxonomy, you have the hierarchy and you add related terms, synonyms, and other stuff as you need them and you have a thesaurus, instead.
Someone asked the other day if a related term could be used as an indexing term. Interesting question…how to answer that? A related term is in the thesaurus as an approved term. Of course you can use it because it is an approved term, a preferred term. It is a term. What the question implied was “How do those relationships work?” The relationships are there for a couple of reasons.
1) They are there to help us in building that outline of the field. That is why we do the nesting and the adding of the terms, the broader and narrower relationships of the terms. It is really a guide to us in finding our way around this huge field. The way you think of the field and the way I think of the field might be different. It doesn’t really matter what that hierarchical structure is. It just matters that we put it there to figure out where we are at any given time.
We do try to lump those things based on the way the literature indicates that these things are thought out by experts in the field. At the end of the day, that outline of knowledge, the hierarchy of the thesaurus, depends on your perspective, doesn’t it? I might have one point of view and your point of view may be very different. So, if I do an outline of information science, my outline of information science will be based on, and biased by, my concentration on the organization of information.
Other people will want to talk about how linguistic analysis works. That will be a bigger topic and they will want to use that focus on everything they do to organize the field. So, you need to be careful, especially when you are talking to subject matter experts, by realizing that they see the field from their point of view. The more expert a person becomes, the guarantee is that the “expert” will have a focus. You did your dissertation on XYZ and, you know what, the entirety of the rest of the field is viewed through how that works.
We need to be careful when you ask subject matter experts by realizing that they have a focus, just one way of thinking about the field, and we need to be careful that we don’t end up biasing the information. The stronger, i.e., the more expert someone is in any particular area, the more they will try to force everyone to think just like them. It’s very important to involve subject matter experts, but it’s also very dangerous because they can torpedo a project — they can delay it for years. And, they love it! This is the subject of their academic realm. Of course they want to talk about it, and argue about it, and argue about it some more and some more. And, then there’s the “Can you bring in a few other guys so we can talk to them about it?”
To me, it’s the same thing as discussing the plays in the Super Bowl. You can talk about those plays for years and depending on which side you were on — who you were supporting — you are viewing the entire game from that point of view. The same thing happens when you look at something with an expert view.
2) It is the network of the relationships. You can display the same information – the same set of taxonomy terms – in many different ways. Having said that the hierarchy is going to depend on your view of the field also means that when you are surfacing some particular collection, some particular group of information, you might want to change the hierarchy. You aren’t changing the terms, you are using the same terms underneath to index the corpus. However, the way it is displayed, the way it is given to the user is going to depend on which group you are trying to attract.
For instance, if you have two million items, you might decide that you want to make a new collection of those items. You go across, you run your automatic indexing on them, you bring those that have the right subject terms together, and you look at them, and then you look at the current taxonomy suggested by what’s left (because not everything in the corpus and therefore not all of the words in your taxonomy will be used). You take that collection which will be much smaller so you only have a residual set of terms. You may look at them and say, “You know, for this set of people, for the Spanish Studies people, I am going to take a different collection than I had for those interested in World Wars. A lot of the documents will be the same, but I will present them in a really different fashion. Different purpose, different group of people, I want to organize the information differently. I am tagging it the same way and using the same thesaurus to tag it. But with a product development point of view, I’ve come up with an entirely new offering using the same set of data.
There are a lot of things that can be done with it. I just don’t want people to get so hung up on their hierarchy that they forget that you can change your hierarchy tomorrow. We change the organization of government, we change the entire organization of the Superintendent of Documents collection, because we’re following the outline of government. The documents are the same. Fortunately, there is a subject catalog so we can still get to the document. We don’t need to worry about the classification system. New people come in; they think of something in a different way, they change the hierarchy. So, these standards and their kind of point of view are pretty well established.
The indexing of the content depends only on the terms themselves and their synonyms. The related terms and broader/narrower relationships are great for managing the taxonomy and for presentation and a search interface, but they are completely independent of the use of the terms in indexing. Automatic indexing systems like the Data Harmony M.A.I. work the terms as though they are all equal level. The placement in the hierarchy is irrelevant to the term usage in the indexing process.
So in just as indexers have always done in the past, you index to the specific level of the code. The term “transportation”, for example, is applied where it is appropriate to the content. There is no implied hierarchy by using a top level term for indexing.
Now, that said, I am sympathetic to the unending repetition of the terms in the index. It becomes unwieldy. But each unit is indexed only according to the content it contains. Something with a transportation title in a state code may really only have to do with road works. So it does not need transportation applied to it.
When you do the display of the records using the hierarchy, all things having to do with transportation will be resident and clear to the users. Take a look at Media Sleuth — there is a browseable tree on the left side. Each record is attached according to the terms used to index it.
There are two sides to this coin — the application of specific terms to the records of content at all levels according to their specific (or at the chapter level broad) content and then the display of that data in the user interface.
So yes, you use the related term for indexing because it is a term.
Marjorie M.K. Hlava
President, Access Innovations