Taxonomies and Metadata

Let’s talk about metadata a little bit. This is to give you a broad overview so that you know how to build those taxonomies.

Metadata is data about data, which means it is really information about information. If you look at the metadata standards world and especially the Dublin Core Metadata Initiative, it has suddenly taken flight again. It’s really interesting. Dublin Core is 10 years old or so and has suddenly become more relevant again.

Metadata describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. That is the whole idea. And we are looking at lots of kinds of metadata, and it does not really matter what it is. It is just going to be about the stuff you are working with. A simpler way to say it is that metadata is data that characterizes other data in a reflexive way. It may include descriptive information about the content, quality and condition, or other characteristics of the data.

“Metadata” covers a multitude of sins at this point. So, if you remember our basic list from our discussion of markup languages, that is basic metadata.

If we look at keywords, subject headings, index terms, identifiers, taxonomy terms, controlled descriptors, controlled vocabulary, and so forth, we can see that they are a type of metadata. That’s why taxonomists are interested in metadata.

Keywords (aka subject headings, index terms, identifiers, or subject area) are one type of metadata. A bibliographic database record usually includes keyword information, as well as information regarding the author(s), title, language, and date of creation. So does a traditional library card catalog.

A bibliographic citation is metadata, and so is a library card.

An HTML header can include metadata.

Not all web pages have metadata. A lot of websites don’t, because the people making the pages don’t fill in the headers. This has been one of the big problems for corporations in trying to exercise control over their intranets and being able to search their intranets. If you fill in the headers on your web pages, you can fill them in with your taxonomy terms. That’s where they would go.

Metadata standardization has been around for a while. Let’s look at some early metadata initiatives.

MARC – the Machine-Readable Cataloging format standard from the 1960s – was a metadata initiative spearheaded by the Library of Congress. The AACR2 – the Anglo-American Cataloguing Rules, 2^nd edition (1988) – was the style sheet for MARC records.

There are quite a few metadata initiatives nowadays. These are some of the more prominent ones:

Dublin Core, you may know about. And maybe the indecs Content Model, which gave rise to ONIX.

ONIX, the ONline Information eXchange, is a set of XML standards that publishers use as the metadata for marketing and shipping their books and other publications. ONIX records tell you everything from how many of a certain book will fit into a box to what kind of display items will come with it. So, if it’s a new Harry Potter book display and it has some big cutout deal that goes with it, it is described using ONIX. ONIX records can also describe all of the CIPI codes and other things – serials, book/item identifier, contribution identifier and all of the additional information that Amazon or Ingram or any of those other guys need in order to ship a book to you or to each other. That is a huge standard and it was a huge effort.

The Text Encoding Initiative guidelines are also useful. And the list goes on. What’s important for us to realize is that there are a lot of different metadata initiatives. Dublin Core is one of them. I’ll discuss Dublin Core in another installment in this series.

Marjorie M.K. Hlava President Access Innovations

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Taxonomies and Metadata