When you select the terms, you can select them from all those standard sources that might be available to you, like these:
- Existing taxonomies, thesauri, and classification schemes
- Encyclopedias, lexicons, dictionaries, and glossaries
- Books and journals, and their indexes
- Databases
- Annual reviews and surveys
Also scan the literature in general, not just your literature but the literature of other publishers. I would encourage you to watch the international literature. A tremendous amount of what is happening these days is happening outside the United States. We tend to take a rather parochial view of what is happening in our own field and knowledge organization systems. I would say that there is more happening in Europe than in the United States at the moment. They are way ahead of us in actually getting taxonomic implementations done and pushing the envelopes for thinking. So, those of us in non-European countries need to need to pay attention to what is happening in Europe.
Of course, capture the knowledge of users and of subject matter experts. You have to handle the experts with care. Their time is valuable and they tend to have a narrow view of the field. So, you might get knowledgeable people with a bachelor’s degree. They have a broad view. Master’s degree holders have more focused knowledge. By the time they have finished their doctoral and have areas of expertise that they are teaching in, they are experts in that field but they are not experts in the entire field. Although you may need their input, you need to also realize that your taxonomy could become completely skewed if you listen to only one expert. You need to find a way to capture terms and keep track of them, particularly how often they appear. Frequency lists, such as search log files that you sort by frequency, would give you an idea of how often particular words and phrases are used in a particular environment. Words and phrases that are used all the time are not useful as index terms; they don’t serve to distinguish one file from another, or to filter search results. Words and phrases that are used a medium amount are ones that you probably would want to have it in your taxonomy. You might decide on a cutoff, taking the terms that with a frequency of 50 or whatever, whether the list is from search logs or from parsing document texts. Consider reviewing the terms that don’t make the numeric cut; you might spot some worth keeping to represent concepts that might not always be expressed in the exact wording of those terms, or to represent emergent concepts.
You might consider tracking the search logs for customer queries, if you can do that. There are a lot of privacy issues now surrounding that. But if you can get the search logs, that is a helpful system.
You can run a term and phrase list against a representative corpus of texts, using stop words, to get a good harvest of terms. That’s one way to harvest from full text. If you note the co-occurrences, that also will give you an idea of conceptual relationships, whether they be broader or narrower or related terms.
Be careful not to overdo it on the co-occurrence. There are Bayesian and vector analysis approaches that frankly, I have not found really helpful in building a controlled vocabulary. Those approaches are heavily used in search, and so you need to be wary of search systems that rely heavily on a Bayesian approach. You can get some interesting results in the search. It sometimes runs contrary to what you think it might do.
In the process of gathering your terms, keep track of the literary warrant for each term. If you record the places that you found the terms used, you can go back and say, “Right here. That is how they used it. Is this wrong?” This can be useful in the case of disagreements; I have heard more than one subject matter say something like, “Pshaw! John doesn’t know what he’s talking about!” Also, if dates are connected with the literary warrants, you can get an idea of which terms are emergent and therefore worth considering, even if they don’t show up frequently yet.
Once you have a tentative collection of terms, the next step is to organize them. Next time, I’ll discuss some ways of doing that.
Marjorie M.K. Hlava President, Access Innovations
Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.