Search has many parts.
The parts of search are moving parts, and every system does it a little differently. There’s the search software, based on one of two major camps. Then, there’s the computer network that it’s riding on. Then, there’s the way that the text is parsed, which is not always the same.
There’s also a differentiation between whether the search is working on well-formed or structured text. With XML, they talk about well-formed data, and that essentially means that the data is tagged well. We will also know if the data is structured data or field-formatted data, depending on how far back your vocabulary goes.
There’s the computer software – related, of course – in the network. There’s the hardware. Not everything runs on every piece of hardware, because the operating systems are different. There’s the telecommunications system and how you access that thing.
Then, if you have chosen to go a Bayesian route, there are training sets for the statistical system.
There are a lot of different pieces.
In the search software itself, in the search technology, there are also several pieces. One is the ranking algorithms.
If you look at systems like Autonomy or Google, there is the query language. This is what someone is going to use to get a result. The query language is usually of two levels. There is what the user types into a box, and then that is translated into a command line that is sent to the search software itself.
There are federators, which provide a way to gather information from several repositories and bring it back as a single result – also called ‘federated search’.
Then there are the caching algorithms. Caching is what happens behind the scenes. You run a query, and it starts to display the results. It displays the first ten because it is not serving all the one million up into your little bit of memory partition that’s been allocated to you. It is caching them. As you go through those numbers, if you go to the last number in the list, it takes a while to get that information to me. It is queuing the information up in case I want it. It is only going to go a little bit in advance of my query, because it doesn’t want to chew up more memory than it needs to. So, cache memory and the caching of it are aspects of a search system. It’s important to understand how they work.
The inverted index is something that is basic to all search systems. It is the alphabetic or alphanumeric listing of every word, and connections with those words for everything that is in the database. So, it is an important thing for practically all searches.
The final thing in search is the presentation layer. That is what you see. On top of all that technical stuff, there is the user interface that you see. That’s the presentation layer of search.
We can see a presentation layer in this diagram of a search system.
You notice that this presentation layer is going two places. So, here I am the user and I am going to the federator, and I am also going to the repository cache at the same time. My federator is taking me out to all the query servers, and I might have data in a lot of different places that I am pulling the information out of. At the same time, I am going to the repository cache, where the information might have come from the source data with some cleanup algorithms to make sure that it presents properly going into the cache. Then, that information is going into the cache builders, then going into both the deploy hub and the index builders, which is where the inverted file is built. All of this information, then, through the query servers and through the federators, is coming back and giving me my answers. There is a whole lot going on behind the scenes.
These things are built in different ways. Below is one example.
This is FAST, which was bought by Google. Here you have all of the information coming in through crawlers and through file traversers and API connectors to other data systems. Maybe it is indexing all of the emails and some other stuff coming from Oracle, Documentum, FileNet, or some other big systems. It is going through a content application programming interface (API). This API is translating all of this information that came in from someplace else, is building a cache, and is loading it into what FAST calls a document processor and what the other guys call a deploy hub.
When the data lands here, it is a really good time to apply the taxonomy terms to the data. This is where we are adding taxonomy terms. Then that information is going out to build the inverted file, the indexed database. It is also going to build another database to support sending out alerts and RSS feeds and the like.
This information – the index – is searchable through a query made using the presentation layer. That query is where the user is coming in from. Whether the users are coming in on a mobile device or regular laptop, they are all coming in through that query processor. If you also apply the taxonomy terms at the query end in the search presentation layer, you get a better search.
These are two places in the search system where you might want to be able to apply the taxonomy terms, if they were not already attached to the record. At a point like that, you want to enrich the data so that you can do the search.
In the next installment, we’ll examine measuring and achieving accuracy in search.
Marjorie M.K. Hlava, President, Access Innovations
Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.