September 6, 2010 – There are two approaches to automatic metadata generation/extraction and within those, many variations.

Statistical
Generally speaking, the statistical approach types are Bayesian, vector, neural nets, automatic clustering, and the like. These methods work off the principle that if in a big set of data two words occur together frequently, then those words are related conceptually. 

Statistical methods depend on the algorithms that the developer has come up with. Some vendors lock down the numerical recipes and provide few or no user controls. These systems usually require “training”.  The way that they do this is to take a list of terms, often a thesaurus, taxonomy or authority file and a corpus of text. The system manager processes the inputs with spot checking and often manual intervention by a human subject matter expert. When the system has been trained, test queries are run to verify that the system is performing as desired on fresh content.

There are drawbacks in production:

  1. In order to train the set, you need to find or create a suitable corpus in which a term is used “correctly”. This is expensive and time consuming and not broadly shared before you buy. It takes us about 300 articles to find 100 in which a term is used properly for training.  So expect to spend at least one hour per term to review the documents and find the right usage for training.
  2. When you add new terms to the taxonomy, the meaning and use of a term change. You need to reset the vectors or statistical values will change. That means that the set needs to be retained. (You will also need to go back to the vendor to do the training of the sets.) Language drift is a normal effect of human communication.
  3. Statistical systems typically return 40 to 60 percent accuracy levels. Accuracy can be improved with better training or with rules, which are discussed in the next section of this blog post. To avoid this extra step, some statistical systems use a form of relevance ranking based on a confidence factor. The relevance “score” and subsequent ranking are a way of getting around the accuracy measures of precision and recall measured against a vetted set done by human indexers. 

The US government, particularly the intelligence community, has been enthralled with the statistical system for years. The government has funded many types of systems and none of them work particularly well. Early on, In-Q-Tel learned that Stratify (now part of Iron Mountain) was a system that required significant human input. Digital Reasoning’s method is an automated process that “discovers” concepts, bound phrases, and entities. In my opinion, neither Stratify nor Digital Reasoning delivers a slam dunk.

One challenge to users is that some important information may be missed and there can be latency in the system as adjustments are made to terms and vectors. The bad guys know how to “game” the system, which is really very easy. I think it has led to massive intelligence failures.

Vendors in this space have been heavily financed by the government.  The statistical approach also has become the pet of some university researchers because they can get their PhD and start their own data mining business or get hired by a government contractor and benefit from that “heavy financing”. This is not the Holy Grail in data mining. The Holy Grail would be for the system to tell a specific user what he/she needs to know. End users need predictive information access.

There is a lot of overlap between these systems and the search vendors that depend on the same kinds of processing. Auto indexing vendors using statistics include Nstein, Convera, Coveo, Clearview, ClearForest, Just Systems, TEMIS, and Calais, which is an open source web service. Other search vendors using statistics include Google and Autonomy.

Rule-Based
The second approach is rule-based. These systems also start with a list of terms; for example, a thesaurus, taxonomy or authority file. A system administrator build rules of two types: [a] simple (match and identify rules, or if term use term and if synonym use term) and [b] complex (including conditions beyond the initial text to match). The complexity of the rules can vary considerably. Access Innovations software has eight types of conditions that can be included in complex rules.

There is some natural language processing underneath the system, but only the rules level can be augmented by the users. There is no training set. But rules must be built.  In our case, we find that about 80 percent of the rules are simple and 20 percent need to be complex. We have found that with simple rules only we get about 60 percent accuracy and with complex rules we get about 85 – 95 percent accuracy. 

Testing the rules against the data can be called training the rules.  It means that we take a test batch of data, run the rules, and review them with a human reviewing the results. We add the complex rules for those terms where the “noise” and “misses” are high. Noise is created when the computer suggests an indexing term that a human would not choose.  Misses are terms the human would suggest but the system did not.

The differences in the vendors in this space come down to how long it takes to build a rule and whether a vendor has a thesaurus / taxonomy tool. That is the real cost of implementation. 

The problem with rules is that humans have to create them. Google has automated systems that generate changes to rules, so once there are some rules humans can do other things. This is a key advantage Google has but has not yet leveraged in a way that generates revenue. Endeca is at the foundation level about human indexing and rules.

Autoindexing vendors using rules include Data Harmony MAI and MAIstro, Teragram, SmartLogic, and Silverchair. Search vendors using rules or faceted search systems include Endeca and possibly Exalead. Exalead is similar to Google. It is a hybrid system. The startup Perfect Search Corporation uses a system that works as an accelerator for relational database queries. Other functions are absent. Silverchair is looking for automated solutions but still relies on humans. It is not a key player in technology. It is a service.

Access Innovations is one of a very small number of companies able to help its clients generate ANSI/ISO/W3C-compliant taxonomies. By focusing on making information findable, we produce knowledge organization that works.

Margie Hlava
President, Access Innovations