Semantic enhancement extends beyond journal article indexing, though the ability of users to easily find all the relevant articles (your assets) when searching still remains the central purpose. Now, in addition to articles, semantic “fingerprinting” is used for identifying and clustering ancillary published resources, media, events, authors, members or subscribers, and industry experts.
The system you choose to enhance the value of your assets, and the people behind it, is extraordinarily important.
It starts with a profile of your electronic collection. It may include a profile of your organization as well. As you choose the concepts that represent the areas of research today and in the past, the ideas and thoughts of your most articulate representatives, the emerging methods and technologies, you bring together a picture of the overall effort. This can be done with a thesaurus, an organized list of terms representing those concepts (taxonomy) enhanced with relationship links between terms (synonyms, related terms, web references, scope notes). The profile provides an illustration of the nature of intellectual effort being expended and, equally important, the shape of the organizational knowledge that is your key asset.
We’d like to convince you that human intelligence is still the most powerful engine driving the development and maintenance of this lexicographic profile. Technology tools help with the content mining, frequency analyses, and other measures valuable to a taxonomist, but the organization, concept expressions, and relationship building is still best done by humans.
Similarly, the application of the thesaurus is best done by humans. Because of the volume of content items being created every day, it may not be possible to have human indexers review each of them. Our automated systems can achieve perhaps 90% “accuracy” (i.e. matching what a human indexer would choose), so high-valued content is still indexed by humans, much more efficiently than in the past, but still by humans. And the balance requires the contribution of humans to inform the algorithm in actual natural (human) language. Fully enabled, the automated system produces impressive precision in identifying the “aboutness” of a piece of content.
And how can a system achieve accuracy and consistency? Our approach is to reflect the reasoning process of humans, using a set of rules. Our rule base is simple to enhance and simple to maintain, and like the thesaurus, flexible enough to accommodate new terminology in a discipline as it evolves. About 80% of the rules work well just as initially (automatically) created. The other 20% achieve better precision when ‘touched’ by a human who adds conditions to limit, broaden, or disambiguate the use of the term triggering the rule.
Mathematical analyses work to identify statistical characteristics of a large number of items and is quite useful in making business decisions. But making decisions about meaning? For many decades now, researchers have been working to find a way to analyze natural language that would result in somewhere near the precision provided by human indexers and abstractors. Look at IBM’s super-computer “Watson” and the years and resources invested to produce it. It continues to miss the simple (to us) relationships between words and context that humans understand intuitively.
Mary Garcia, Systems Analyst
Access Innovations