Methodologies for Taxonomy and Thesaurus Creation

When you are building a thesaurus, you can build it from the original text or from a new intellectual area. You can build from an existing vocabulary or from an existing topic. You can do a combination of the two.

What often happens is that you take your text and apply it to a taxonomy you think might be fairly parallel, and then you customize it to meet your own needs. That is very common practice and it gets you there faster.

Your first goal is to achieve some degree of vocabulary control. In achieving vocabulary control, you are looking at getting terms and getting them organized. You start with an uncontrolled list – that’s all you have got at first – and you group them together, grouping ever tighter. Normally, only one or two people work on that, because you need some people dedicated to getting their heads around the whole thing.

I often convene a team to do a taxonomy, but I have one person start to get the general groupings; once I have got the general top terms together, then I have people work on a branch. They can take this top term or that top term or a couple or three of them that are fairly similar and grab all of the terms they think are appropriate to them. If they are working at a term record level, then other people who grab that term for that taxonomy branch can just add it as a broader term. They can put it in more than one branch.

You can have a lot of people working on the taxonomy at one time. To start, though, you have to get that big set of bins to sort things into that people can generally agree with. What I normally find is that people have some kind of a category list or a general outline of what they have. It might be departments; it might be product lines; it might be something that exists from someplace else. Normally, a client will have 16-20 general categories that they think of their company as. We use those to start. If you don’t have those, you have to kind of look at the literature and invent them.

I normally take out all of the people, places, and things, all the authority file stuff and park them over there. I can fold them in later if that’s the kind of taxonomy I need. Because they behave differently in concepts – they represent a concept of one – I want to be able to put it in its appropriate place. That way you divide the mass in half and then you can fold them back in.

Try out the taxonomy on real data. I generally ask for 1,000-5,000 records, because you want a big enough set that you really get a good transect of what they are covering in their system. If you can get a goodly amount of data to deal with, that is important.

When you are adopting an existing taxonomy, you want to look at the terms in current use and fold them into the existing taxonomy. Take the terms from your data – the kind of data that you are dealing with – and you fold those into the existing taxonomy and decide which way you want the term to be stated. That is the preferred usage. Use the terms the users are going to use.

If you are not sure which way you’re going to go, then try out some and see what happens. If you can do it automatically, apply the terms automatically from the test thesaurus to 5,000 units and see how they look. That’s a good way to do a really good test. You might do only 100 or you might do 50, but do try it on some set of the data to see if it will work.

When you do this type of exercise, you’ll probably get a list of terms that were not used at all. That’s part of the reason for having a good-sized set to try them out. Review the list of terms that were not used to see if you are throwing out three-quarters of the taxonomy; if so, then it is not really appropriate to your data. If you are only throwing out five or ten percent, then obviously the coverage is pretty close. You get a sense very quickly, by looking at it that way, of whether it will work or not. That’s one way you can decide whether or not you should adopt an existing taxonomy.

It’s also important to have an area for candidate terms, or a way of flagging them. As the jargon evolves, you might find a term you haven’t seen before but now you have seen it five times in the last two weeks. Is it a term that is going to stick? You need to think about where you will put the terms, and then have a staging area for terms to be taken care of.

When you are “finalizing” the terms in your taxonomy, you need to be sure to allow for new jargon. Any living field, even the study of the ancient Dead Sea Scrolls, continues to evolve. There are always new terms coming into use, so you cannot say that the terminology is frozen. I know of one organization that has now frozen its taxonomy three times. STOP IT! You can’t do it. It is a living beast, and you had better fold in maintenance of that thesaurus to the work plan. Otherwise, it will go out of date, people will stop using it, and all of the effort to build it will be totally wasted.

You need a place to put the terms. That seems to be a frequently overlooked detail. You build the taxonomy; you are applying the taxonomy to the data; the data needs a field for the taxonomy term. I don’t care what you call the field, but please build in a field. We see a lot of the Fire-Ready-Aim approach: I built the hardware, I choose this software package. Okay, let’s put the data in the software. Well, there’s no room for that piece of data, so we are not going to put that in. Oh, there’s no room for a taxonomy in this database, so you can’t do it. Well, you just wasted a bunch of money, right? So, look at your data. Build the taxonomy that fits the data. Then, choose the software that will support your data.

Marjorie M.K. Hlava, President Access Innovations

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Methodologies for Taxonomy and Thesaurus Creation