Our regular blogger is out of commission while recovering from surgery. Please enjoy this classic blog post. Margie Hlava – Founder of Access Innovations, former President of Access Innovations, and the current Chief Science Officer – wrote this post in response to a podcast she was listening to, at the time. Corrections are noted with brackets.
Field by Any Other Name
July 26, 2010 — In one of the chatty podcasts about software engineering, one of the commentators pointed out that metadata was one of the most critical aspects of information governance. We agree. The chatter continued and another podcast voice mentioned that there was a relationship between database field names and metadata.
I realized that the folks on this free podcast were blowing smoke. The words were recognizable, but they were jumbled. To a casual listener, the chipper voices sounded convincing.
In a database, the value of a particular cell is important; for example, in a family checking account, the entry $3,216.66 has context and meaning. The number is in a check book. The check book has an account number and bank number. I write my name and the date I started the check record in my checking account book. I also include the number of the check, the date on which I wrote the check, and I record the amount of the check. The checking account is an example of structured data with entries tagged with field names. I can take a piece of paper and map out the row numbers assigning each the number of the corresponding check. The column names are field names and identify what bit of information goes where; for example, date gets the date, payee gets the name of the outfit to which I made out the check.
Unstructured information like an email is less rigidly organized. The email address and other header information and the date value are easily identified field names. The text in the body of the message is a different kettle of fish. I also get email with attachments, and there can be a number of different file formats.
The transformation of structured and unstructured information into intelligently tagged information objects is a complicated job. Today most systems, including the recently released IBM OmniFind 9.1 system [Omnifind was withdrawn in April 2011], suggest that any information can be ingested, processed, and made searchable. The implication is an extension of the Google approach of making information retrieval a non-issue.
The impression I get is that search is no big deal.
The problem, of course, has many different sub-problems. Let’s look at three.
First, there is the problem of terminology. Exactly what do the terms indexing, metadata, and tagging mean? How do these terms related to the old-fashioned notion of field names in a database? Without nailing down these terms it is unlikely that those involved tackling information governance or what I call an editorial policy will make significant progress. The terminology is sufficiently opaque to make misunderstanding almost a certainty.
Second, there is the problem of context. Processing content for words and phrases is a start. The challenge is to make certain that the context of the data item or the information is recognized, assigned to the article or record, and used consistently throughout the content processing system. Messy taxonomies mean that findability is compromised and precision and recall get stuffed into the dust bin. How can a high performance content processing system identify and index context? That’s an important question, and we think our approach is one that works quite well. Most systems struggle with context. In fact, Google’s purchase of Metaweb can be viewed as a very expensive way to buy some context technology that works better than Google’s home grown systems.
Third, there is the problem of normalization. The old fashioned idea of field names has been gussied up as well-formed XML tags. No problem with this, but it is important to recognize that disparate information objects can be tough to normalize. In fact, content transformation can chew up as much as one third of an information technology budget. Normalization requires a taxonomy crafted to the highest standards and significant effort.
Access Innovations is one of a very small number of companies able to help its clients generate ANSI-compliant taxonomies. We avoid jargon. We focus on making information findable. We produce knowledge organization that works. No excuses.
Margie Hlava,
President [formerly President, current Chief Science Officer], Access Innovations