People often ask us how much time it will take to manage a rule base with Data Harmony software. We reply with specific customer experience numbers and tell them a few hours per month of editorial time to maintain both the thesaurus and the rule base. One customer of ours, the American Institute of Physics, found that maintaining their thesaurus and rule base takes less than 15 hours per month for 2000 articles per week throughput. Another customer, The Weather Channel, manages breaking news all day long with four hours per month of maintenance. It takes the editorial team just a few hours per month to keep up with the changing trends and events within their field and transfer those into the organizational knowledge base represented by the M.A.I.™ rule base. This is a small investment that provides the organization with the highest level of accuracy in coding (usually well over 90% hits without human intervention), as well as to support analysis of the trends in the business, the creation of author profiles, semantic fingerprints of the entire organizational holdings, and extraction of real meaning for all the data. Other customers, such as IEEE and the US GAO, find the accuracy of their Data Harmony software implementations so high that they now only sample the data periodically to glean new terms and trends. They do not see the need to review every single item.
The real question, though, should be a matter of control. If a rule-based solution maintained by the editorial staff is the approach taken, then full control remains with the editorial department. If a programmatic learning system – the seductive call of the purely automatic system – is the choice, then oversight either remains with the vendor or moves to the IT (information technology) department. The lower accuracy of the indexing returns (usually in the 60% range) means much more time spent by the editorial department on the production of the taxonomy tagged items. The time that would have been spent improving the knowledge base is instead spent in production time processing records, due to lower accuracy levels.
Here’s an example: let’s assume 1000 articles per month. Using 90% accuracy versus 60% accuracy, how much extra production time is involved? Let’s also suppose, for easy calculations, that there are 10 terms per article. If our rule base indexing is 90% accurate, then only one term will need to be reviewed, researched, and replaced or discarded. If alternative indexing methods produce 60% accuracy, then there are four terms per record to research, replace, or discard. The time to research a term and decide on its disposition is conservatively two minutes. So two minutes per term at 1 term per article is just 33.3 hours per month. But if four terms (60% accuracy) need reviewing, then 133.3 editorial hours per month are needed – obviously, four times the effort. Moreover, the rule base improves over time with this small editorial input, so the maintenance time continues to decrease.
A statistical approach can appear to be a gift on a silver platter, but beware – such an approach means more time spent on production, less on building a knowledge base, lower accuracy, higher throughput costs, and no chance to learn about the data through semantic fingerprinting. To make matters even more frustrating, you have little control of the system. It has to be improved and worked on by the vendor or the IT department. New terms require a full revamping of the system each time, resulting in costly delays, rather than the real-time, instant updates that a system based on Java object-oriented programming allows. As a result, the taxonomy is not responsive to the organization’s data.
It is tempting to think that the classification of content can be done without the use of a vetted taxonomy properly applied or that the taxonomy only provides a convenient file folder naming convention. Unfortunately, the cost is high to make that choice. The accuracy is lower, the throughput is slower, and the clerical aspect of the indexing process is increased when you use a statistical system. In addition, control is no longer with the editorial department, but shifted to IT and the vendor. The power dynamic of the choice is clear: IT versus editorial. Who do you want to be in control of your indexing?
Marjorie M.K. Hlava
President, Access Innovations