by Barbara Gilles, Access Innovations thesaurian
Observing the practices described below for building a thesaurus will help ensure:
- Effective searches to enable actual retrieval
- A rich and functional network of relationships among terms
- A standards-compliant thesaurus and taxonomy
- Efficient development of a controlled vocabulary that meets users’ needs
1. Background: The Building Blocks of a Thesaurus
Building a thesaurus database involves creating conceptual records to support concepts and their relationships. The parts of a conceptual record are:
- A term representing the concept (the “main term” of the record)
- Broader terms
- Narrower terms
- Related terms
- Synonyms, common misspellings, and other non-preferred versions that express the concept
- Notes (definitions, scope notes, etc.)
- Tracking information
2. Starting a Thesaurus Project
Get a sense of the subject area with which the thesaurus deals. If you are not already familiar with the subject area to be covered by the thesaurus, now is the time to familiarize yourself with the area’s scope and concepts. Read encyclopedia articles and skim through or study introductory textbooks.
Decide what areas will be central and will receive plenty of expansion as well as what peripheral topics to include.
The thesaurus project administrator should establish team members’ editorial permissions (through the Data Harmony Administrative Module), taking into account the overall workflow, team members’ capabilities and knowledge of the subject area, and the thesaurus review process.
Team members don’t need to be experts in the subject area. Find wordsmiths fascinated with the workings of language.
Even If you are an expert on a subject within the thesaurus scope, you may need to refresh and broaden your perspective.
3. Choosing Terms
Each term should be self-sufficient. Never use “Other” as a term or as the beginning of a term. Avoid making a term reliant on a broader term to complete its sense. For example, Diesel” should not be a narrower term of “Engines” or “Fuel.” Instead, use “Diesel engines” under “Engines,” or “Diesel fuel” under “Fuel.”
Gather a list of tentative terms. Brainstorm. Pull main concepts from articles. Use search logs, textbook indexes, terms from already indexed documents, and so forth.
Review the terms for clarity, expanding them as necessary. For example, “Control” is too ambiguous and vague to be a useful term. “Process control” and “Remote control” may be specific enough to work well as thesaurus terms in a technology thesaurus.
Don’t use the same term for two distinct concepts. For example, “Vectors” should not be a narrower term of both “Epidemiology” and “Mathematics”; “Bats” should not be a narrower term of both “Sports equipment” and “Animals.” Instead, modify or expand the term for one or both of the concepts (“Biological vectors”; “Vector mathematics”; “Baseball and cricket bats”).
Weed out terms that don’t seem appropriate. Use common sense.
Don’t use terms that are too general—take into consideration the scope of the thesaurus. For example, if the entire thesaurus is about process control, don’t use “Process control” as a thesaurus term. For texts that survey the field, you may want to add a term such as “General process control concepts”.
Don’t clutter the thesaurus with terms that are not likely to be used for indexing or searching.
4. Style and Spelling
All thesaurus terms should be nouns or noun phrases.
Generally, use plural forms. Use common sense in applying this rule; don’t change “Water” and “Money” into “Waters” and “Monies.” Abstract terms, such as the “-tions” and “-ilities,” should normally stay singular. Be careful with words whose meanings change between singular and plural (art/arts, novelty/novelties, quality/qualities, speech/speeches).
In terms consisting of two or more words, observe natural language order. For example, use “Classical musicians” instead of “Musicians, classical.”
The first character of each term should be consistently capitalized or lower case, with certain exceptions. For example, “pH” is correct, although it is contrary to an initial capitalization scheme (and, incidentally, to the practice of making characters after the first one lower case). And, of course, the first letter of capitalized acronyms should stay capitalized.
Use lowercase as much as practicable. Words with all letters capitalized are difficult to read and can be confused with acronyms.
In general, use a spelled-out term in preference to its acronym.
In some cases, the spelled-out version of an acronym is not widely recognized. In such cases, the acronym is preferable, as long as the intended meaning is clear. Use “LASIK” or “LASIK surgery” as the main term in a record concerning laser-assisted in situ keratomileusis.
Avoid using acronyms whenever there is any chance of confusion as to the meaning of the acronyms. For example, if your thesaurus covers the 20th century in the United States, SNL could refer to either Sandia National Laboratories or Saturday Night Live.
In general, avoid punctuation.
Avoid the inclusion of hyphens, except for terms that are normally spelled with a hyphen. Never use hyphens simply as punctuation (Trees – deciduous).
Don’t use commas except in proper nouns that require them, such as “Bureau of Alcohol, Tobacco, Firearms and Explosives.”
Avoid the use of parentheses in thesaurus terms.
If a term has one or more alternate spellings, the preferred term should have the spelling that is most widely accepted by the expected or intended users of the thesaurus. Use the alternative spelling as a synonym (UF or Non-Preferred Term).
For geographical spelling variants, such as American English and British English spellings (Theatres/Theaters), consistently use the spelling of one of the styles throughout the thesaurus for the preferred terms. (However, you should not alter spellings of proper nouns. “Drury Lane Theatre” is a perfectly good narrower term of “Theaters.”)
5. Crafting the Hierarchical Structure
Establish some general categories within the thesaurus subject area to serve tentatively as top terms.
Eventually, there should be between five and 20 top terms. (Twenty is the number of terms that will fit easily in a screen display of a thesaurus hierarchy.) If it is necessary to have more, try not to have more than 50 top terms.
See what terms can be grouped under an existing term. In general, it’s good to go ahead and make the group of terms narrower terms of the more general term.
A term should fit completely within the category that its broader term represents. This is the all-or-some rule. Remember that if all the instances of a concept fit under another term, that concept’s term may be a good narrower term; if only some of the instances fit, a related term relationship would be more appropriate. In a food thesaurus, “Quiche” or “Quiches” would not be appropriate as a narrower term under “Vegetarian foods,” because some quiches contain bacon or ham.
Don’t group compound terms together simply because they have a word in common. For example, don’t put “Pest control” and “Process control” together under “Control.”
Limit the number of terms at the same level in each branch to 20 terms if practicable. If there are more than 20 terms, see if one of them could serve as a broader term for some of the narrower terms.
In general, a polyhierarchical structure is best. If a term could legitimately be a narrower term of terms in two or more separate branches, go ahead and add those terms as broader terms. For example, “Benjamin Franklin” could have a variety of broader terms, including “Authors”, “Diplomats”, “Inventors”, “Politicians”, and “Scientists”.
6. Developing the Non-Hierarchical Relationships
Provide cross-hierarchy relationships and enrich the information that your term records provide to users by entering existing thesaurus terms in the Related Term field. In a term record for “Quiche,” you could add “Vegetarian foods” as a related term. You can add related terms freely, as long as the concepts they represent are genuinely related in some way. They can even be opposites. Related terms provide additional information for searches.
Identify terms that are synonymous with each other. Decide which one to use as a regular thesaurus term. Then consider adding the synonym(s) in that preferred term’s Non-Preferred Term field.
Consider term variants that searchers might use. Add these as non-preferred terms.
For terms for which alternate spellings exist (Cultural centers/Cultural centres), add the alternate spellings as non-preferred terms in the term record of the preferred term.
7. Adding Term Record Annotations
Enter a note in the Scope Note field of a term for which there might be some uncertainty or misunderstanding about the scope covered by the term. For “Classical music,” the scope note might explain whether or not the term covers concert music written after the Classical era in Europe, and whether or not it covers classical music of non-European cultures such as China, India, and Persia/Iran.
Team members should use this field to document information that might be of value to other thesaurus editors or compilers, such as reasons for choosing a particular term in preference to other possibilities.
Other Annotation Fields
The thesaurus project administrator should consider setting up additional fields for such things as definitions, bibliographic references, and cross-references to statutes or external classification systems.
8. Evaluating Your Thesaurus
Proofread your thesaurus. Verify spelling of unfamiliar or difficult terms. Misspellings not only look bad, but also can thwart thesaurus searches, rule base searches, indexing, and document searches.
Check once more for balanced coverage of the subject area, watching out for possible gaps and unnecessary duplication.
When surveying your thesaurus, take advantage of Thesaurus Master’s capability to expand and collapse the visual display of the thesaurus structure. This capability enables you to skim through terms at the same level, and to “drill down” to focus on terms at various levels in a single branch.
Have a person who is knowledgeable in the subject area of the thesaurus do a thorough review of the terms and relationships. Also have a senior editor who is familiar with thesaurus best practices do a thorough review.
Use the thesaurus on a test set of documents to ensure that it covers the content appropriately.
9. Maintaining Your Thesaurus
After your thesaurus has been used for a few months, examine the keywords suggested by editors and that don’t appear in the thesaurus. Some of those keywords may indicate terms that should be added to the thesaurus.
As concepts and technologies change and advance, so does terminology.
Review your thesaurus on a regular basis to determine what updates you should make.
Before adding a new term, always check for existing terms that cover the concept or that could or should be modified. When searching the thesaurus, use single words rather than phrases, and use character strings truncated by an asterisk (a wildcard character in Thesaurus Master), rather than a full word with an overly specific ending.
Before deleting a term, consider the potential effects in relation to documents that have already been indexed with the term, and in relation to future searches.
10. Common Mistakes
- Overly vague terms
- Lack of balance in terms
- Gaps in coverage, sometimes severe
- Too many top terms
- Not enough levels (“flat” structure)
- Same term (essentially) in two places in thesaurus, but with different style (Milky Way galaxy/Milky Way Galaxy; Bird houses/Birdhouses)
- Two or more synonymous terms (Biochemistry/Biological chemistry) as regular thesaurus terms
- Too many terms at any one level within a branch
- Inappropriate NT-BT relationships
- Term assigned to entirely incorrect/inappropriate place in thesaurus because meaning or relationship misunderstood or not well thought out
- Term misplaced in thesaurus because of shared word or morpheme
Spelling errors (more common than many might think)
- “Other” as a term or as the beginning of a term
- Terms that don’t mean anything, or whose meaning isn’t clear, apart from hierarchy context
- Meaningless (to the user) abbreviations or lingo
Revised December 2010