There are several ways to create a taxonomy. One is from the existing data, which is my preferred method. You can also do it as an intellectual outline of the discipline, thinking about what ought to be included in this discipline. That is what the monks did many years ago in their cold towers on those mountainsides. Few of us have that luxury. Most of us will be dealing with a specific corpus of information.

You can also divide by discipline or by subject. In the case of discipline, you might reach an academic consensus of the study aspect of the field. On the other hand, you might talk with an expert in a particular field, and that is the most important field and … why would anybody ever study anything other than that field … and so forth. So, you get a little skewing.

We also find that within any discipline, there are the traditional subjects and then there are the ones that are on the edge that might involve a lot of peripheral fields and a lot of peripheral science. Not too long ago, I took ten years of a professional association’s publication data and then I took the same ten years of Medline and the same ten years of patent data, and I indexed it with three different taxonomies. I ended up with a grid – a nine-part grid –with which I could, by slice, figure out where any topical area was going by how much the papers were moving. I made the assumption that people would do a patent first, then do a popular paper like a conference proceeding or a magazine article, and then they might do papers for publication in the peer-reviewed journals. So, if you want to track the way science is heading you could, theoretically, do it that way. And, if you indexed with different vocabularies, getting different knowledge domains, you might be able to interpret where the edges of those sciences are going. It is a fun way to look at what else we might need to be covering in the field.

What you find in many disciplines is that you have the traditional core subject and then the multiple fields that are related to it. You need to find those so that you can move forward. The subject usually has some fundamental facets, and you will see the things that are studied and talked about and researched or published in that field. Then you can define the separations. So, if you think about it as ‘books and chapters’, you might have a book on the subject in general, and the chapters represent ever finer parts of the field.

Whether you are going to take a discipline approach as outlined here, or you are just going to go for the subject and the basic facets, I recommend working from the data, if you can. The reason for this is that the thesaurus is built for the data. You are just dealing with a subject domain. So, if you are looking at that domain, for all practical purposes the domain is really the things that you are going to index using that taxonomy or that you are going to search using that taxonomy.  This is the whole purpose for building it. It is the reason it exists. People lose track of that as they get into the intellectual enterprise of building it, and they forget that you have to tie it back to the data. Otherwise, why build it at all?

So, although there is some mixing of approaches in practice, and you can go find additional treatises or taxonomies that you can adapt, there is a vital advantage in having those core subject areas covered. You know that those areas are going to be relevant to your data.

So basically, from the general concepts, you should identify peripheral disciplines and get the core subjects established. Then you can work from there.

The next step is to select the terms. I’ll cover that topic in the next installment.

Marjorie M.K. Hlava
President, Access Innovations


Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.