Leveraging Your Taxonomy – Part 1

February 6, 2012  
Posted in Access Insights, Featured, search, Taxonomy

This series of blog posts will explore how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. The modules of search are:

  • Search software – of course
  • Computer network
  • Parsing of text
  • Well formed or structured text
  • CLEAN DATA
  • Computer software – network
  • Computer hardware
  • Telecommunications connection
  • Training sets for statistical systems
  • Search technology
  • Ranking algorithms
  • Query language
  • Federators
  • Cache
  • Inverted index
  • Other enhancements
  • Presentation layer

We will cover each of those items over the series; however, we also need to measure the accuracy of search. Accuracy in search is measured by three major areas: precision, recall, and relevance. Each of these can be handled in different ways. Part of the challenge in measuring search accuracy is that in search, there are two major theoretical directions. One of them is based on the Bayes theorems and the other on the algorithms of Boole. I will explore the work of these two gentlemen and of more recent people supporting search enrichment. Then I will discuss what effect they have on search as we know it today. Finally, I will discuss the effect of taxonomy on search.

How does search work?

Here is the normal path people take to search implementation.  “Well, I think I will get some hardware.”  And they say…. “Well, you can’t go wrong with Company X.”

  • So they buy hardware and
  • they buy the software that will work on that hardware.
  • Then they design a system that will work with that software and
  • then they try to load their data and,
  • finally, they try to enhance the data with a taxonomy.

In my opinion, that is totally backwards. What they should be doing is looking at what they are building the system for in the first place – that is, the data. How about we build a system to hold your data?

We assess the data so that we can get a design; we know what fields there are. I have written about this backwards approach before.

What are you building? 

  • Assess the data
  • Do the design
  • Decide what else needs to be added
  • Taxonomy terms
  • Other controls
  • Find a system that will work with your data

Let’s outline the pieces of the search implementation. There are a lot of parts to search, and one of them, of course, is the search software itself. The search software itself runs on a computer network. The software depends on the parsing or the cutting of the text into specific pieces so that it can be searched. That means that the data needs to be well-formed, if you are talking in the XML vernacular, or structured. Unstructured data is simply data that has not been tagged into fields. You could transform a Word document, which is generally considered unstructured, into a well-formed, XML structured document, by simply putting <Begin Body> and <Close Body> at the beginning and end of the text. Yes, it can be that deceptively and technically simple. The whole notion of structured and unstructured text is a bit of a misnomer and a little bit hard to understand, because most of us don’t think of data in that way. In fact, a Word document has a Properties table in it that may or may not be populated. Some things are populated in it by default. So it is, actually, partially structured already. It can even be saved as an XML document. The search software must depend for its implementation on clean data. That means it has to be clean, well-formed, preferably the metadata fields are all filled in, including the addition of the taxonomy terms in a specific tagged field or element.

The computer software runs on a network, which runs on hardware. To get to it, you need to have a telecommunications connection of some kind. It might be a hard network wire within your organization – so you connect from one place to another within the firm – or it might be something that goes over the Internet to a remote location. It doesn’t really matter – the connection is still a telecommunications connection that transfers data in an orderly fashion over a wire.

Next week we will talk about the search software itself.

Marjorie M.K. Hlava
President, Access Innovations