Taxonomies and Standards

We’ve discussed what it takes to make the components of a digital information model work. One of the things that is most important, if you want it all to work together, is standards.

There are several controlled vocabulary standards, as well as networking protocols, that have an impact on taxonomy implementation. There are also standards having to do with markup and with metadata and data modeling that impact thesauri.

Looking at the content standards – the preponderance of them are NISO/ISO standards. ISO is the International Organization for Standardization, based in Geneva, Switzerland. (“ISO” is not an acronym; it’s a short form based on the Greek word isos.) ISO has 163 member nations; all of those nations have one vote on each standard. Because the standards world is very complex, it covers everything from the threads on a light bulb to the way that sockets work in the walls to how we make a taxonomy. ISO has a multitude of Technical Advisory Groups, or TAGs. The standards having to do with thesauri and terminologies are concentrated in TAG 46 and TAG 37. TAG 37 deals primarily with computer system standards for those, and TAG 46 deals with international library standards.

ANSI, the American National Standards Institute, votes on behalf of the United States at ISO. NISO, the National Information Standards Organization, is part of ANSI. NISO is one of the 30-odd maintenance agencies for standards in the United States; it does the Library and Information Standards. Standards identified with “Z39” are from NISO. ANSI/NISO Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies) is the official thesaurus standard for the United States.

Organizational and control standards from ISO/NISO are ones that we look at. We also look at the storage, retrieval, and preservation standards associated with field formatting and tagging, because the thesaurus terms need to go one or more fields in a database. If you are dealing with structured data, when you add metadata, you are adding structure to undifferentiated text. We want to put that data into a field, and we need to know which one. There are standards for that, so we want to know how the standards work so we know where to put the data.

There are also some classification standards and some publishing standards. Particularly in our case, we are interested in subject indexing ones. There used to be a standard for “How to index”, but it ran into trouble because the way you index the back of a book and pre-coordinate indexing is not the same way that you do indexing for online databases. To try to wrap all of those into a single standard didn’t work. The poor guy who wrote the standard and did a masterful job was James Anderson out of Rutgers. The thing finally sank. It is not technically a standard because people couldn’t agree, but it has been published as a NISO technical report, Guidelines for Indexes and Related Information Retrieval Devices, NISO TR02-1997. Someday the dust will settle and we’ll take it up again and make two standards and everyone will be happy.

Controlled vocabulary standards have been around at least since the 1960s in one form or another. In 1967, the Committee on Scientific and Technical Information (COSATI) of the Federal Council on Science and Technology developed TEST – the Thesaurus of Engineering and Scientific Terms – and they also published the Guidelines for the Development of Information Retrieval Thesauri, which is actually an introduction to TEST. At about the same time, the French wrote a thesaurus standard that is amazingly similar. So did the Germans. So did the Americans. In 1974, we came out with the first edition of our standard, based on the COSATI guidelines. Then the others came forth with standards as well.

In 1985 and 1986, ISO published ISO 2788 and ISO 5964, standards for monolingual thesauri and for multilingual thesauri, respectively. Later, between 2005 and 2008, the British Standards Institute (BSI) released the five parts of Standard BS 8723, Structured vocabularies for information retrieval. Those began the ever slow march toward a compatible and technologically updated ISO standard. You will notice that the most recent version of Z39.19 was published in 2005 as well. We had the British Standard saying one thing and the U.S. standard saying the exact opposite, which was very frustrating. We wanted an international standard.

You know, the nice thing about standards is that there are so many to choose from.

I was part of the team that helped to make sure that the new standards don’t conflict with each other. We are still in that process as we go forward to ISO 25964, Thesauri and interoperability with other vocabularies. Part 1 (Thesauri for information retrieval) was published in August 2011, and Part 2 (Interoperability with other vocabularies) is still under review.

All of these are good standards. I think the U.S. one, ANSI/NISO Z39.19-2005, also known as Z39.19-2010 (NISO reaffirmed its validity in 2010), is the easiest to read. If you are going to stay up late and want a nice standard to read, then that would be the one I’d recommend. The BSI standards are very thorough; it’s a little harder to stay awake with those — but, very thorough.

ISO 25964-1 “Information and documentation — Thesauri and interoperability with other vocabularies — Part 1: Thesauri for information retrieval.”

As I mentioned, the new ISO standard is coming out in two parts. The first part is already published. The original intention was that it would be very much like the British standard. In fact, the author of Parts 1 and 2 of BS 8723, Stella Dexter Clark, has taken over writing it. She is running Part 2 of the ISO standard through the gauntlet of the standards approval process. Parts 1 through 4 of the British standard have been completely re-written and are folded into Part 1 of the ISO standard. Part 2 of the ISO standard, which significantly expands on the British standard, covers interoperability. The coverage includes interoperability among standards but also with search systems. It is a really fine piece of work.

All of these are good references.

Marjorie M.K. Hlava
President Access Innovations

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Taxonomies and Standards