As we continue the series on search and how it works, we are looking at file indexes more closely. More specifically, we are looking at complex inverted file indexes.
Stemming involves such actions as de-pluralization and removing gerund endings. It is also called lemmatization. Truncation – left and right –is a popular technique in search. Right truncation, basically chopping a word off at its end, is pretty easy. Left truncation is tricky. Consider the word ‘organization’, which can be spelled with either an ‘s’ or a ‘z’, depending on where you are from. The ‘-ation’ can be chopped off pretty easily, but for the right part, I have to build an entire index, starting with o, or, org, org, so that I can go through all of those to see where the full extension is. When people do left truncation, it is a lot more expensive. It is a much bigger, additional index.
Variant spellings needs to be considered in taxonomy building. They also need to be considered in search. You can see that a lot of what we do in taxonomies is also used in thinking about search.
That taxonomy effect – Where do the terms go? How are they used? What other ways can I use the taxonomy in search? Shown is a site that has a whole lot of search embedded in it. You might want to search the site. You might want to search a whole bunch of combined sites – a federated search. You might want to just use the taxonomy in navigation. Or you might want to search for books or the journals in publication or all publications. Each of those is going to take you to one of two kinds of places, depending on how it is set up at the back end. You will end up either at one great big database that combines everything and allows you to parse the file and search only for books or only for journals or only on the site, or at a separate search system for each of these. It is very popular in these days to have it all in different places, which means you get different kinds of results from the different searches. If you have them all combined behind the scenes, it really powers the user to something much more powerful.
Another way of doing a search presentation is shown on the left side, where you can see a hierarchical view. This view is actually a view of the taxonomy itself. It tells you how many items are tagged with that term in the database, and you can browse up and down it. You are using the full taxonomy tree for navigation.
In the search box, we see the implementation of the taxonomy for type-ahead. In this case we type the word ‘sell’ and by rotating the terms – both the synonyms and the preferred terms – we are finding a drop-down menu that will change as we continue to type. People don’t have to actually know the term that was preferred in building the taxonomy; it is doing auto-completion. Auto-completion is popular. It can go from the term index, or from that inverted file, or from the taxonomy, or or from some other dictionary that you don’t even know about that may or may not be attached to the taxonomy. When it is attached to the taxonomy, it really leverages search.
If we were to click on one of the terms in the drop-down list, we would get a search that would give us conceptually appropriate records, and I could click on this and see the record or I can click, in this case, on this site to go buy the item described in the record. I can get a little snippet about the record or a short abstract. It could be just a snippet like you see in a Google search. What is also displayed are additional conceptually appropriate terms, which include thesaurus-related terms to expand the search, and narrower terms to narrow the search; just another way to look at how things are going.
Next week we will look behind the scenes at databases and taxonomies.
Marjorie M.K. Hlava
President, Access Innovations