Large Data Sets and Ever Changing Terminology

September 12, 2011  
Posted in Access Insights, Featured, indexing

Indexing enables accurate, consistent retrieval to the full depth and breadth of the collection. This does not mean that the statistics-based systems the government loves so much will go away, but they are learning to embrace the addition of taxonomy terms as indexing.

To answer your question, relevant metadata, tagging, normalization of entity references and similar indexing functions just make it easier to allow a person to locate what’s needed. Easy to say and very hard to do.

Search is like having to stand in a long line waiting to order a cold drink on a hot day. So there will always be dissatisfaction because “search” stands between you and what you want. You want the drink but hate the line. That said, I think the reason controlled indexing (taxonomy or thesaurus) is so popular compared to the free ranging keywords is that they have control. They make moving through the line efficient. You know how long the wait is and what terms you need to achieve the result.

Indexing, which I define as the tagging of records with controlled vocabularies, is not new. Indexing has been around since before Cutter and Dewey. My hunch is that librarians in Alexandria put tags on scrolls thousands of years ago.

What is different is that it is now widely recognized that search is better with the addition of controlled vocabularies. The use of classification systems, subject headings, thesauri and authority files certainly has been around for a long time.

When we were just searching the abstract or a summary, the need was not as great because those content objects are often tightly written. The hard sciences went online first and STM [scientific, technical, medical] content is more likely to use the same terms worldwide for the same things.

The coming online of social sciences, business information, popular literature and especially full text has made search overwhelming, inaccurate, and frustrating. I know that you have reported that more than half the users of an enterprise search system are dissatisfied with that system. I hear complaints about people struggling with Bing and Google.

What does one do with large data sets and ever changing terminology? We can handle “big data” and changing data. We designed the Data Harmony M.A.I. to handle large volumes of digital content. In fact, that is part of the reason for it. We needed something to help us with the editorial process of determining what a content object is about in a “big data” context.  M.A.I.? = Oh, machine assisted indexing. Sounds old fashioned now, so we just use the initials.

Concepts covered well today will always change over time. Therefore, we needed to deal with changes to a knowledge domain. When a domain remains static, it is dead. Therefore, we must build the ability to handle change into the system and work flow. We have automated most of the processes so that volume is not the problem. The hard problem is finding. Searching has many methods. Finding is very hard.

A typical client for us comes with a set of documents–structured, unstructured or both–and a need to structure them for storage and retrieval. Our clients normally want to have neutral format not tied to a particular software or hardware platform. Like most people who need to access information efficiently, the clients would like to find, reuse and often repackage parts of the information.

Our clients usually have three user communities: The people who create the documents, those who want to use them and those who need to manage them. Our clients’ needs are quite varied. Some have the further complications of regulatory bodies or other legal requirements placed on the collections.

Once the records are tagged with the current terms, the records are stored in a database. Most database system administrators are not enthusiastic when the data set has to be rebuilt due to term changes.

Although we have found some partners for whom term change is not a problem, we have many clients who have to keep pace with change. So we concentrate on the user interface side. Remember I noted the interface work that can include synonyms, and other kinds of terms.

If you cannot rebuild the search index, you can build the newer terms into the interface while keeping the older terms used as well. Our approach provides additional flexibility. We keep track of those changes in the Thesaurus Master. Then whether the user is using Microsoft SharePoint, Lucene, Autonomy, Endeca, Perfect Search, or another retrieval system, he or she can still find the current terms. We harvest those terms in a number of ways to ensure currency.

Marjorie M.K. Hlava
President, Access Innovations