July 26, 2010 — In one of the chatty podcasts about software engineering, one of the commentators pointed out that metadata was one of the most critical aspects of information governance. We agree. The chatter continued and another podcast voice mentioned that there was a relationship between database field names and metadata.
I realized that the folks on this free podcast were blowing smoke. The words were recognizable, but they were jumbled. To a casual listener, the chipper voices sounded convincing.
In a database, the value of a particular cell is important; for example, in a family checking account, the entry $3,216.66” has context and meaning. The number is in a check book. The check book has an account number and bank number. I write my name and the date I started the check record in my checking account book. I also include the number of the check, the date on which I wrote the check, and I record the amount of the check. The checking account is an example of structured data with entries tagged with field names. I can take a piece of paper and map out the row numbers assigning each the number of the corresponding check. The column names are field names and identify what bit of information goes where; for example, date gets the date, payee gets the name of the outfit to which I made out the check.
Unstructured information like an email is less rigidly organized. The email address and other header information and the date value are easily identified field names. The text in the body of the message is a different kettle of fish. I also get email with attachments, and there can be a number of different file formats.
The transformation of structured and unstructured information into intelligently tagged information objects is a complicated job. Today most systems, including the recently released IBM OmniFind 9.1 system, suggest that any information can be ingested, processed, and made searchable. The implication is an extension of the Google approach of making information retrieval a non-issue.
The impression I get is that search is no big deal.
The problem, of course, has many different sub-problems. Let’s look at three.
First, there is the problem of terminology. Exactly what do the terms indexing, metadata, and tagging mean? How do these terms related to the old-fashioned notion of field names in a database? Without nailing down these terms it is unlikely that those involved tackling information governance or what I call an editorial policy will make significant progress. The terminology is sufficiently opaque to make misunderstanding almost a certainty.
Second, there is the problem of context. Processing content for words and phrases is a start. The challenge is to make certain that the context of the data item or the information is recognized, assigned to the article or record, and used consistently throughout the content processing system. Messy taxonomies mean that findability is compromised and precision and recall get stuffed into the dust bin. How can a high performance content processing system identify and index context? That’s an important question, and we think our approach is one that works quite well. Most systems struggle with context. In fact, Google’s purchase of Metaweb can be viewed as a very expensive way to buy some context technology that works better than Google’s home grown systems.
Third, there is the problem of normalization. The old fashioned idea of field names has been gussied up as well-formed XML tags. No problem with this, but it is important to recognize that disparate information objects can be tough to normalize. In fact, content transformation can chew up as much as one third of an information technology budget. Normalization requires a taxonomy crafted to the highest standards and significant effort.
Access Innovations is one of a very small number of companies able to help its clients generate ANSI-compliant taxonomies. We avoid jargon. We focus on making information findable. We produce knowledge organization that works. No excuses.
Margie Hlava,
President, Access Innovations
Metadata and database fields–a fine example of everything old being new again. The early leaders in the young field of documentation had it right by breaking down documents into critical fields and relying on controlled and well-organized vocabularies to precisely identify information in the data. Search is a big deal. It’s frustrating and memorable when bad but rarely noticed when good. Good search results require a solid information infrastructure, including database fields or metadata–your choose the word–and a good taxonomy.
“content transformation can chew up as much as one third of an information technology budget”
It is not uncommon to find legacy unstructured content in an organization needing metadata tags to be useful. That is understandable.
But to continue to produce unstructured content when it is relatively easy to assist the creator in identifying critical metadata is unforgivable.