This series of blog posts is exploring how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. Last week we started talking about search software, and today we will continue with that topic.
I believe in the data first, as you know if you’ve been following this blog. Starting in the diagram with your source data, you can see how the data flows. You need to clean the source data to a uniform format. This is often called the conversion process or the ETL – extract, transform, and load.
Next we deposit the clean data into a cache – this is a repository of the data. Some people call it the content management system or CMS. Then build access to the repository in the cache builders. Next the data goes into some kind of a hub to deploy that information into search. So from the hub, we build the inverted file indexes, so that we have those indexes available for searching. The user will submit queries into the search presentation layer. That will go into a federator because we might have, as in the case of SharePoint, a whole lot of projects that we want to search across. If you have multiple repositories that you want to search across, you need to go to a federator, and that will go to the query servers that will access the deploy hub, access the indexes, and return an answer to the user.
If you look at a system like FAST (offered by a company that has been bought recently by Microsoft), you see that you have all of this content coming in here on the left. That information has to be digested by the system. It is digested by a lot of different connectors or conversion programs that then load it into the content API for, in this case, FAST, or some other search software.
The document is processed, and that means that it is broken into pieces so that the search system can use it. The search system is going to want to serve the full article, so it has to have a server to present those articles. It’s going to have an index so that you can show the individual files and it’s maybe going to have some RSS feeds and other filters to send out alerts to different kinds of receiving organizations.
If you have chosen a Bayesian system, then you need to collect training sets so that the statistical analysis of the system has something to work on. That means you need at least 20 records (100 is better) for each term you want to train the system to search. In these records, the term must be used correctly. Our experience is that you need to collect at least three records for each term, because in English we use the terms in so many ways. That is the Achilles heel expense of the statistics-based systems.
When the user submits a query, the search request is moving through a query API – or a search application protocol interface – something that will translate the search question into the syntax within the system. It goes into the query processor and then accesses the guts of the system, the inverted file.
The taxonomy can come to play in two parts. There may be a taxonomy governance layer. It is great to apply the taxonomy terms to those documents as they are processed into the system. That activity should happen somewhere in the document processing pipeline activity. If you are lucky, you can also use the taxonomy at the search end so that people can disambiguate their queries as they are searching the system. Taxonomy, to my mind, should be in two places within your search implementation. Really, search is the reason you built the taxonomy in the first place.
Next week we’ll address the need for accuracy in search.
Marjorie M.K. Hlava
President, Access Innovations