I would like to make some observations about statistics-based categorization and search, and about the advantages that their proponents claim.

First of all, statistics-based co-occurrence approaches do have their place, such as for wide-ranging bodies of text such as email archives and social media exchanges, and for assessing the nature of an unknown collection of documents. In these circumstances, a well-defined collection of concepts covering a pre-determined area of study and practice might not be practicable. For lack of a relevant controlled vocabulary foundation, and for lack of other practical approaches, attempts at analysis fall back on less-than-ideal mathematical approaches.

Co-occurrence can do strange things. You may have done Google searches that got mysteriously steered to a different set of search strings, apparently based on what other people have been searching on. This is a bit like the proverbial search for a key under a street light, instead of the various places the key is more likely to be, simply because the light is better and the search is easier under the street light.

These approaches are known for low search results accuracy (60% or less); this is unacceptable for the article databases of research institutions, professional associations, and scholarly organizations. Not only is this a disservice to searchers in general; it is an extreme disservice to the authors whose insights and research reports get overlooked, and to the researchers who might otherwise find a vital piece of information in a document that the search misses.

The literature databases and repositories of research-oriented oriented organizations cover specific disciplines, with well-defined subdisciplines and related fields. This makes them ideal for a keyword/keyphrase approach that utilizes a thesaurus. These well-defined disciplines have well-defined terminology that is readily captured in a taxonomy or thesaurus. The thesaurus has additional value as a navigable guide for searchers and for human indexers, including the authors and researchers who are the people who know the material best. Surely, they know the material far better than any algorithm could, and can take full advantage of the indexing and search benefits that a thesaurus can offer. An additional benefit (and a huge one) of a thesaurus is that it can serves as the basis for an associated indexing rule base that, if properly developed, can run rings around any statistics-based approach.

Proponents of statistics-based semantic indexing approaches claim that searchers need to guess the specific words and phrases that appear in the documents of potential interest. On the contrary, a human can anticipate these things much better than can a co-occurrence application. Further, with an integrated rule-based implementation, the searcher does not need to guess all of the the exact words and phrases that express particular concepts in the documents.

With indexing rooted in the controlled vocabulary of the thesaurus, documents that express the same concept in various ways (including ways that completely circumvent anything that co-occurrence could latch onto) are brought together in the same set of search results. Admittedly, I do appreciate the statistical approaches’ success in pulling together certain words that it has learned (through training) may be important. The proximity or distance between the words matters in the return results ranking, implying relevance to the query, However, that unknown distance does make it hard for a statistical approach to latch onto a concept, and therefore unreliable.

Use of a thesaurus also enables search interface recommendations for tangentially related concepts to search on, as well as more specific concepts and broader concepts of possible interest. The searcher and the indexer can also navigate the thesaurus (if it’s online or in the search interface) to discover concepts and terms of interest.

With statistic-based indexing, the algorithms are hidden in a mysterious “black box” that mustn’t be tinkered with. With rule-based, taxonomy-based approaches, the curtain can be pulled back, and the workings can be directly dealt with, using human intelligence and expert knowledge of the subject matter.

As for costs, statistical approaches generally require ‘training sets’ of documents to ‘learn’ or develop their algorithms. Any newly added concept/term means collecting another ten to fifty documents on the topic. If the new term is to be a narrower term of an existing term , that broader term’s training set loses meaning and must be redone to differentiate the concepts. Consider that point in ongoing management of a set of concepts.

Any concept that is not easily expressed in a clear word or phrase, or that is often referred to through creative expressions that vary from instance to instance, will require a significantly larger training set. The expanded set will still be likely to miss relevant resources while retrieving irrelevant resources.

Dealing with training sets is costly, and is likely to be somewhat ineffective, at that; if the selected documents don’t cover all the concepts that are covered in the full database, you won’t end up with all the algorithms you need, and the ones that are developed will be incomplete and likely inaccurate. So with a statistics-based approach, you won’t reap the full value of your efforts.

Barbara Gilles, Taxonomist
Access Innovations