Understanding “Rule Based” vs. “Statistics Based” Indexing Systems

by Marjorie M.K. Hlava
reprinted from Information Outlook with permission

It’s never been more fascinating—or more challenging—to be an information professional.

Surveys show that corporate librarians have more information to manage, and less staff to help manage it, than ever before. Not only is the universe of data expanding by the minute, but we are beset with confusion over the variety of tools and methods for managing and retrieving electronic information.

Perhaps the most striking example of this confusion surrounds automatic indexing. Also called “automatic categorization, it’s the process of mechanically analyzing concepts and themes in a database’s stored and newly added content to create linkages between keywords and phrases.

And it may be the most critical, too. The costs and benefits of how you categorize your information collection will cascade through your organization for years to come. In our increasingly knowledge-based economy, indexing amounts to basic infrastructure.

So why is automatic indexing important? It’s the glue that holds your content together. It’s the underlying layer of order that makes your database productive, robust and responsive—and thus best able to serve the needs of your organization.

Without automatic indexing, you may find the precise bit of data that will ignite a new market, but at what cost, if you and your staff have spent hours wading through a river of irrelevant documents called up from an online search? Even more likely are instances of lost opportunities to deliver on requests for research, competitive intelligence or industry awareness because no one had the means to put missing or disparate pieces of information together.

So which system for automatic indexing of data is fastest? Easiest to implement and update? Which provides the best return on investment? Which system will “understand” that when you’re interested in the REM sleep phase that you’re not interested in the brainy rock band called R.E.M.?

To better grasp the options for categorizing data, we’ll be smart librarians and impose some categories here ourselves. We will divide the major systems for automatic indexing into two groups – rule-based vs. statistics-based.

We’ll look at the return on investment, long and short term, for each, as well as how each method compares for ease of implementation, user access and accuracy. Most systems require a thesaurus in order to start, and we’ll assume one here for each system. (The thesaurus is a controlled vocabulary that lists the main components of the data collection, along with appropriate synonyms and antonyms. It guides both the indexer and the searcher to select the same terms to describe a particular subject.)

The critical notion for us is that one has to “teach” an automatic indexing system to identify relevant data and how to categorize it. How is this accomplished in each of the two automatic indexing systems?

Rule-Based Indexing

The newest type of automatic indexing system – rules-based — is a leap forward in the science of indexing. It offers greater precision, while burning up fewer dollars and hours than previous systems. In a rule-based system, simple categorization rules are automatically generated, matching bits of text (“prompt words”) to the thesaurus or taxonomy terms, which tell the software how to categorize the document. Editors may further define the rules by telling the system what words must be present or absent in the text, or some other specific instructions that point the document to a particular category. For the organization conducting sleep research, you tell the software that a document with REM and “music” is a “miss,” and it doesn’t bring up any documents with the band again. Think of a rules based approach as driving a car using specific directions to get to a destination.

Statistics-Based Indexing

The second system—statistics based—is “trained” by examining a set of 50 or so documents associated with each keyword in the thesaurus. This creates scenarios from word occurrence and location in the training documents. In this system, the software would deduce that REM is part of the science of sleep, and not a 1990s rock band, since none of the training documents mentioned music. Think of this as giving a driver a big set of maps, and telling him to figure out the best way to get to the destination.

These are sophisticated systems, and the upfront investment for any automatic index system is substantial. For our index comparison, let’s assume an existing thesaurus (sometimes called controlled vocabulary) of 6,000 words, which is typical. We’ll assume hourly rates and units per hour on industry rules of thumb. And we’ll assume that 85 percent accuracy is the baseline required for implementation to save personnel time.

Now for a test drive—our experience with two different clients, using the two different systems.

The Rule-Based Approach

A simple rule base matches terms in the thesaurus to exact terms and synonyms in the documents to capture the appropriate indexing from the target text. With an existing thesaurus or authority file, this is a two hour process. Rules for both synonyms and preferred terms are generated automatically. So, for example, if the thesaurus category was “bush,” it might also recognize “shrub” as a synonym. Using the simple rule base usually provides a 60% accuracy alone.

The editor also adds more complex rules, which might command, for instance, that the index, in its search for shrubbery documents, ignore the document with the word “bush,” if it also was within a few words of “president,” or used “bush” with a capital B. Complex rules, such as these, comprise about 10% of the terms in the vocabulary. A person functioning as an index editor can create rules at a rate of four to six per hour. So we may assume that for a 6,000 term thesaurus, creating 600 complex rules at 6 per hour requires 2.5 person weeks. Enhancing the simple rules through editorial rule building offers the potential to achieve 85% or higher accuracy.

The rules base approach places no limit on the number of terms used in the taxonomy created or the number of taxonomies held on a server. Our client was  and running with their rule base index in a month.

So, let’s add it up.

  • Software is about $60,000, including training and support.
  • Conversion of the thesaurus is about two hours at $125 per hour in programming time.
  • Loading the thesaurus and creating the rule base, two hours of editorial time at $45 per hour or $90.
  • Complex rule building, 100 to 150 hours of editorial time at $45 per hour, or $4,500.
  • The total, based on those assumptions, would be $64,840.

The client for whom we prepared the rule base reported 92% accuracy and four-fold increase in productivity.

Now for a test drive – our experience with two different clients, using the two different systems.

The Statistical Approach

We’ll start with the same pre-existing 6,000 word thesaurus. The cost of the software in this system usually starts at about $75,000. Usually one week of training is required at about $10,000.

Now you must address the documents – news articles, for instance – that “train” the thesaurus. The documents can be collected by using software programs, but document sets for each thesaurus term must be reviewed by editors to remove misleading records. If the reviews require 15 per thesaurus term, that results in $67,500 for editorial review of a training set collection.

Next, you run the training documents through the software, with programming time of 40 hours at $125 per hour, or $5,000. The index editor reviews the results, and repeats the collection of training sets for the thesaurus terms that didn’t return good data sets. The second run is reviewed. Editorial time of 40 hours at $45 per hour costs $1,800.

The next step is to collect additional training data for bad sets. If 25 percent of the thesaurus term yields 1,500 terms, multiplied by one hour per term of editorial time is $67,500. The training set is re-run, assuming 20 hours of programming time $125 per hour or $2,500. Review of results requires 20 hours of editorial time at $45 per hour or $900.

In our case study with a client, the resulting accuracy at this point was 60%. To reach reliable improvement in productivity requires 85% accuracy. So at this point, an editor can write rules in a program language such as SQL. If you train editors to write these rules, you can avoid higher programmer rates. Still, to write four SQL rules per hour for 1,500 terms (this was 25 percent of the thesaurus terms), requires 375 more editorial hours at $45 per hour, or $16,875.

Again, adding up the costs:

  • Total implementation time frame was 33 weeks.
  • Total person hours was 6,488, plus 40 of editor training
  • Upfront cost of $449,375.
  • Maximum accuracy achieved was 72%, but productivity doubled.

So, what is the return on investment? Assuming six editors are involved in the process, a rule base system recoups its value in one month, compared to almost 60 months under the statistical base approach. It’s not hard to see why we prefer the rule base index.