I received several thoughtful comments on my Beyond Search Web log from well-known search and content processing experts (not the search engine optimization type or the MBA analyst species). These comments addressed the topic of taxonomies. One senior manager at a leading search and content processing firm referenced David Weinberger’s quite good book, Everything is Miscellaneous. My copy has gone missing, so join me in ordering a new one from Amazon. Taxonomy and taxonomies have attained fad status in behind-the-firewall search and content processing. Every vendor has to support taxonomies. Every licensee wants to “have” a taxonomy.
A “taxonomy” is a classification of things. Let me narrow my focus to behind-the firewall content processing. In an organization, a taxonomy provides a conceptual framework that can be used to the organization’s information. Synonyms fortaxonomy include classification, categorization, ontology, typing, and grouping. Each of these terms can be used with broader or narrower meanings, but for my purpose, we will assume each can be used interchangeably. Most vendors and consultants toss these terms around as interchangeable Lego blocks in my experience.
A fad, as you know, is an interest that is followed for some period of time with intense enthusiasm. Think Elvis, bell bottoms, and speaking Starbuck’s coffee language.
A Small Acorn
A few years ago, a consultant approached me to write about indexing content inside an organization. This individual had embarked on a consulting career and needed information for her Web site. I dipped into my files, collected some useful information about the challenges corporate jargon presented, and added some definitions of search-related terms.
I did work for hire, so my client could reuse the information to suit specific needs. Imagine my pleasant surprise when I found my information recycled multiple times and used to justify a custom taxonomy for an enterprise. I was pleased to have become a catalyst for a boom in taxonomy seminars, newsgroups, and consulting businesses. One remarkable irony was that a person who had recycled the information I sold to consultant A thousands of miles away turned up as consultant B at a company in which I was an investor. I sat in a meeting and heard my own information delivered back to me as a way to orient me about classifying an organization’s information.
A taxonomy revolution had taken place, and I was only partially aware. A new industry had taken root, flowered, and spread like kudzu around me.
The interest in taxonomies continues to grow. After completing the descriptions of companies offering what I call rich content processing, organizations looking for taxonomy-centric systems have many choices. Of the 24 companies profiled in the Beyond Search study, all 24 “do” taxonomies. Obviously there are greater and lesser degrees of stringency. One company has a system that supports American National Standards Institute guidelines for controlled terms and taxonomies. Other companies “discover” categories on the fly. Between these two extremes there are numerous variations. One conclusion I drew after this exhausting analysis is that it is difficult to locate a system that can’t “do” taxonomies.
What’s Behind the Fad?
Let me consider briefly a question that I don’t tackle in Beyond Search: “Why the white-hot interest in taxonomies?”
Taxonomies have a long and distinguished history in library science, philosophy, and epistemology. For those of you who are a bit rusty, “epistemology” is the theory of knowledge. Taxonomies require a grasp, no matter how weak, on knowledge. No matter how clever, a person creating a taxonomy must figure out how to organize email, proposals, legal documents, and the other effluvia of organizational existence.
I think people have enough experience with key word search to realize its strengths and limitations. Key words — either controlled terms or free text — work wonderfully when I know what’s in an electronic collection, and I know the jargon or “secret words” to use to get the information I need.
Boolean logic (implicit or explicit) is not too useful when one is trying to find information in a typical corpus today. There’s no editorial policy at work. Anything the indexing subsystem is fed is tossed into an inverted index. This is the “miscellaneous” in David Weinberger’s book.
A taxonomy becomes a way to index content so the user can look at a series of headings and subheadings. A series of headings and sub-headings makes it possible to see the forest, not the trees. Clever systems can take the category tags and marry them to a graphical interface. With hyperlinks, it is possible to follow one’s nose — what some vendors call exploratory search or search by serendipity.
A taxonomy, when properly implemented, offers yields payoffs:
First, users like to point-and-click to discover information without having to craft a query. Believe me, most busy people in an organization don’t like trying to outfox the search box.
Second, the categories — even when hidden behind a naked search box interface — are intuitively obvious to a user. An accountant may (as I have seen) enter the term finance and then point-and-click through results. When I ask users if they know specific taxonomy terms, I hear, “What’s a taxonomy?” Intuitive search techniques should be a part of behind-the-firewall search and content processing systems.
Third, management is willing to invest in fine-tuning a taxonomy. Unlike a controlled vocabulary, a suggestion to add categories meets with surprisingly little resistance. I think the intuitive usefulness of cataloging and categorizing is obvious to people who tell people to search for them.
There are some pitfalls in the taxonomy game: The standard warnings are “Don’t expect miracles when you categorize modest volumes of content.” And “Be prepared for some meetings that are more like a graduate class in logic than trying to figure out how to deliver what the marketing department needs in a search system. ” Etc.
On the whole, the investment in a system that automatically indexes is a wise one. It becomes ever wiser when the system can use a knowledge bases, word lists, taxonomies, and other information inputs to index more accurately.
Keep in mind that “smart” systems can be right most of the time and then without warning run into a ditch. At some point, you will have to hunker down and do the hard thinking that a useful taxonomy requires. If you are not sure how to proceed, try to get your hands on a the taxonomies that once were available from Convera. Oracle one once? offered vertical term lists. You can also Google for taxonomies. A little work will return some useful examples.
To wrap up, I am delighted that so many individuals and organizations have an interest in taxonomies — whether a fad or something more epistemologically more satisfying. The content processing industry is maturing.
Stephen Arnold, February 8, 2008