During the initial stages of discussing a new taxonomy project, I am frequently asked questions like:

How granular does my taxonomy need to be?

How many levels deep should the vocabulary go?

And especially:

How many terms should my thesaurus have?

The answer is—of course—it depends.

The smallest thesaurus project with which I’ve ever been involved was for a thesaurus of 11 terms; the largest is a 57,000-word vocabulary.

We once lost a bid because we refused to agree to build a 10,000-word thesaurus (not approximately, exactly); no matter how loudly we insisted that it’s far more logical (“best practice”) to let the data decide the size of the thesaurus, someone had already decided on an arbitrary number.

At Access Innovations, we like to say that we build “content-aware” taxonomies, that the data will tell us how large the taxonomy should be. The primary data point is the content: How much is there? What is the ongoing volume being published? Clearly, no one needs a 25,000-word thesaurus to index 1000 documents; similarly, a 200-term thesaurus is not going to be that useful if you have 800,000 journal articles.

Just as returning 2,000,000 search results is not very helpful (unless what you’re looking for is on the first page), a thesaurus term with which 20,000 articles are tagged isn’t doing that much good—more granularity is probably required. There are very likely sub-types or sub-categories of that concept that you can research and add.

The flip side is that you don’t need terms in your vocabulary—no matter how cool they may be—if there is little or no content requiring them for indexing. Your 1500-word branch of particle physics terms is just dead weight in the great psychology thesaurus you’re developing.

Other factors include the type of users you have searching your content: Are they third-graders? Professional astrophysicists? High school teachers? Reviewing search logs and interviewing users is another way to focus your approach, which in turn will help you gauge the size your taxonomy will be in the end.

Let’s make up an example (as an excuse to post pictures that are fun to look at). We’re building a taxonomy that includes some terms about furniture, including the concept Sofa.

PT        =          Sofa

NPT     =          Couch

Now, being good taxonomists, we’re obviously lightning-fast researchers, so we quickly uncover some other candidate terms:









English Rolled Arm






 It looks like a real taxonomy of sofas would depend at least partly on arm height?

Whereas “couch” is clearly a synonym, these could all be narrower terms (NTs) for Sofa as they are all distinct types, styles, and sub-classes. Alternately, these could all be made NPTs for Sofa so that any occurrence of the words above would index to Sofa and be available for search, browse, etc.

How do we decide the proper course of action?

We let the content tell us.

How many articles in our imaginary corpus reference e.g. the Cabriole, Camelback, or Canapé?

  • If the answer is “none”, there’s clearly no need for this term; however, adding it as an NPT will catch any future occurrences, so we may as well be completist.
  • If the answer is “many”–some significant proportion of the total mentions of Sofa or Couch—then the term definitely merits its own place in the taxonomy.
  • If the answer is “few”—more than none, but not enough to warrant inclusion—go ahead and add it as an NPT.  You can always promote it to preferred term status later.

However—and this is a big exception—if you find through reviewing search logs that a significant number of searchers were looking for a particular term, it might signal that it’s an emerging concept, new trend, or hot topic, in which case you may decide to override the statistical analysis and err on the side of adding it to the thesaurus. It won’t hurt anything, and as long as your hierarchy is well formed and your thesaurus is rich in related terms, people will find what they’re looking for…which is, after all, the goal.

So remember: It’s not only the size of your taxonomy that’s important—it’s how relevant it is to the content and users for which it’s designed.

Bob Kasenchak, Project Coordinator
Access Innovations