A Case Study Comparison of Rule Base and Statistical Approaches

by Marjorie M.K. Hlava

There is a lot of confusion in the marketplace. Word is out that rule bases take a lot of up front investment.  Concurrence systems advertise a purely programmatic solution and appeal to many IT professionals.

  • What is the real up front cost of a rule based versus training set based system?
  • Which approach takes more up front investment? 
  • Which is faster to implement? 
  • Which has a higher accuracy level?
  • What is the cost of the system with the additional cost of creating the rule base or collecting the training set? 

Let’s look at the actual data. This is a case study based on the Data Harmony MAIstro rule-based system and the statistics-based systems (such as Autonomy, Nstein or Stratify) implementation.
 
First, a couple of assumptions:

  1. There is an existing thesaurus or controlled vocabulary. If there is not, then we need to add the cost of thesaurus creation*.
  2. Hourly rates are estimated for the comparison; units per hour are based on our 25 years of experience in this field.

Rule Base Approach (Data Harmony’s MAIstro™ Software):

A simple rule base (matching preferred and equivalent/synonym terms) is created automatically as terms are added to the controlled vocabulary (thesaurus, taxonomy etc). If there is an existing thesaurus or an authority file, this is a 2 hour process. Rules for both equivalent (synonym) and preferred terms are created when a controlled vocabulary is imported.

Complex rules generally make up an average of less than 10% of the terms in the vocabulary. Complex rules are created at a rate of 4–6 per hour. So with a 6000 term thesaurus, 600 complex rules at 6 per hour requires 2.5 man weeks. Some people begin indexing with the software immediately to get some baseline statistics and then do the rule building. This usually provides 60% accuracy with just the simple rule base. With the addition of complex rules, the accuracy increases to 85–92%.

There is no limit to the number of users, the number of terms used in the taxonomy created, or the number of taxonomies put on a server.

Data (e.g., an existing taxonomy) can be preloaded to the Data Harmony software before shipping. It may already be available in one of three formats (tab or comma delimited, XML, or left tagged ASCII). If not, a short conversion script can put it into an appropriate format.

On the average, Data Harmony customers are up and running one month after the contract is done.

The up front time and dollar investment based on the workflow for implementation for the full Thesaurus Master and M.A.I. (Machine Aided Indexer) (MAIstro in combination) is:

  1. Software $60,000 — Training is included in the price if held in Albuquerque; it is $1500 per day on site. Support is the same. Free for the first year over the phone, $1500 on site. We recommend two days training for M.A.I. and one day for Thesaurus Master software. This includes practicums.
  2. Conversion of thesaurus — 2 hours (programming @$125)
  3. Loading thesaurus and creating rule base — 2 hours (editor @$65)
  4. Complex rule building — 100–150 hours (editor@$65)
  5. Therefore, if editors are at $65 per hour and programmers at $125 the “up front” cost is  $60,000 software + $250 programming + $130 editorial  + $9750 editorial  = $70,130

Rule Base Implementation

  • Total time frame: 1 month
  • Total man hours: 104–154 plus training of 24 hours/editor
  • Up front cost:  $70,130

There is a proven four-fold productivity increase for editors. The return on investment if an editor is loaded expense of $65 per hour is 8 months on one editor or 1 month on 8 editors.  They will index with more accuracy, be more consistent, and do deeper indexing (and enjoy the system).  Time made available can be used for those other things you’ve been wanting to get to like talk with customers, increase coverage, automatically filter data.

The U.S. GAO (Lockheed) reports 92% accuracy and CSA reports a four fold increase in productivity.

Statistical Approach – Training Set Solution

If we take the same example based on a 6000 word thesaurus, the thesaurus creation cost should be the same. 

The cost of the software usually starts at about $75,000. (We will use this lower number although it can be much higher.)  Training and support are an additional expense of about $2000 per day.  Usually one week of training is required ($10,000)

The up front time and dollar investment based on the workflow for implementation for the statistical (Bayesian, DNA, etc.) systems is:

  1. Collection of the training set data.  Usually 20 to 60 items are required for a good diverse training set—the more collected the better the resulting accuracy.  One hour per term is 6000 hours @ $65 editorial = $390,000.  Some systems limit the number of terms allowed in a taxonomy, which means an extra license or secondary file building.
  2. Some of this may be done programmatically and then the data sets are reviewed to remove misleading or false drop records from the data set.  The review could be done at 15 minutes per term.  This would be only $97,500 for a training set collection.
  3. Run the training sets — Programming time, usually one week 40 hours @ $125 per hour = $5,000
  4. Review of the results, recollection of training sets for terms which do not return good data sets
  5. Review of results — 40 hours editorial @$65 = $2600.
  6. Collect additional training data for bad sets — Estimate 25% of the term list 1500 terms x 1 hour editorial @$65 = $97,500
  7. Rerun training set 20 hours editorial @$125 = $2500.
  8. Review of results — Editor 20 hours editorial @ $65 = $1,300.
  9. Most systems require at least one more round of review, collection and revision. But let’s stop here.
  10. Resulting accuracy 60%
  11. To get to 85% which is the level where you can get improvement in productivity or use the system for filtering, one has to write rules (which is where Data Harmony starts the process). Write sequel rules @ 4 per hour for 25% of the terms – 375 editor hours x $65 per hour editorial = $24,375.

Now you are ready to begin implementation.

Elapsed time: Assume all people are ready and standing by to move to the next step when needed.

  • Tailor software for installation and delivery — 2 weeks
  • Collect training sets — 6000 hours (35 man months ), so if six people work on it full time it can be done in 6 months
  • Run training sets — One week
  • Review results — One week
  • Rerun training sets — 3 days
  • Review data — 3 days
  • Write sequel rules — 9 man weeks; use two people and do it in a little over a month

Concurrence Training Set Implementation

  • Total time frame:  33 plus weeks, if nothing goes wrong.
  • Total Man hours:  9475 man hours  plus training of 40 hours (add $2,600)
  • Up front cost:  $695,875

A two-fold productivity increase has been noted by the American Psychological Association. Accuracy is not known above 72% at present.


Summary

The table below compares the return on investment for the rule based system and the statistics based system in terms of total cost and time to implementation. It is apparent that there are considerable savings in using the rule based systems over the statistics based system–by a factor of almost seven, based on the assumptions outlined above.

Return on Investment Comparison

 
Rules Based
Statistics Based
Total time frame: 1 month 33+ Weeks
     
Total man hours: 104-154 hours 6488 hours
  +24 hours training +40 hours training
Total up-front cost: $64,840 $449,375
     
ROI assuming 6 editors: 1 month 57.73 months

                                                                         


The Data Harmony M.A.I. system is both efficient and cost effective right out of the box. A simple rule base is generated automatically on the basis of your controlled vocabulary (thesaurus, taxonomy, authority file). Rules are generated for both preferred terms and specified synonym terms.

The accuracy of results from the simple rule base is enhanced by fine-tuning the rules to reflect editorial analysis, interpretation, and insight. For about 10% of the terms, complex rules are required to capture the meaning and conditions of use of the term. (This estimate varies with the wording of taxonomy terms and document writing style.)

How quickly can M.A.I. be implemented?
The software is delivered by CD ROM or FTP immediately upon payment. Your data can be preloaded in the software for immediate use. Data in tab- or comma-delimited format, XML, or left-tagged ASCII is ready to go; format conversion would require a small amount of additional time. Our customers are typically up and running one month after the contract is done.

What about automatic taxonomy generation?
This is partially possible.  However, using training sets and full or unstructured text to create a categorization system causes many misleading information channels to appear.  For example: If Enron is searched in the news today, it will co-occur with fraud, embezzlement etc.  If it was run four years ago in would occur with energy and gas distribution, etc.  Using rules will ensure the proper usage and application of the language over the life of the project. We recommend augmentation of an existing vocabulary as a faster, more accurate, more reliable, and more consistent methodology for taxonomy creation.  It is also less expensive.

*Can I purchase or augment an existing thesaurus?

  • Yes. Data Harmony offers 40 Knowledge Domains, including ready-made thesauri with associated rule bases covering a variety of topics.
  • With our experience, we can construct a thesaurus for your specific needs.
  • Time required varies with the topic; estimates provided reflect an average 6000 term thesaurus.
  • Time investment: 4 months
  • Cost investment: $32,000 including software