This series of blog posts is exploring how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. Last week we started with the various modules of search. This week we are addressing the search software itself.
The technical parts normally include several pieces:
- some ranking algorithms, so that you get the data in a format that is most likely to answer the query of the user or the search question;
- a query language and syntax that enable you to actually ask a question of a computer;
- federators that gather together the information from all kinds of different places and put it into a single cache to be searched;
- and the cache itself, which is often the collection of data. There are two main types of cache. People talk about cache and cache memory. They are not the same thing. One is active in the random access memory (RAM) and one is on the storage disk.
- Once you have all those things, all search software that I know of builds an inverted index, an alphabetical sort of every term in the searchable areas that you will be covering.
- Most other enhancements take place in the presentation layer – that’s the interface that you see on the screen. The look and feel of the system is often what sells it, even though the technology underneath is widely variable.
Here is my preferred methodology:
- Design the system application.
- Decide what else needs to be added to the data so that you can enhance it. In that case, that would be where the taxonomy comes in, for dealing with subject metadata.
- Consider what other metadata, other data, and other controls your system needs in order to work properly.
- Once you have done all that, then find a system that will work with your data.
- Don’t spend months working on a document type definition (DTD) just to find out that when you try to stuff your data into the DTD, you forgot to allow for multiple authors, or you forgot to add pagination. Both of those examples have happened to me with clients recently. We waited months for a DTD, got to the DTD. The DTD only allowed a single subject term, it allowed no more than one author, it didn’t have pagination, they didn’t even allow a place for an abstract. Then they said that the DTD is locked, sorry, we can’t change it. So, we are stuffing the data into inappropriate fields.
Many organizations have five or more kinds of search software. All too often, none of them work; this is why they end up just sitting on the shelf. They aren’t looking at the data first. Okay, enough of my rant. Next week we’ll continue with the pieces of search.
Marjorie M.K. Hlava
President, Access Innovations