Inverted Files, Parsing, Discovery, and Clustering

Inverted files and Boolean are basic to most or all search.

So, what we have is the inverted file – that big alphabetical list of all the terms and where they came from – and a way to combine them. At the end of the day, no matter which direction you come from on the high end and even higher presentation levels, you are depending on these two to make it work. If you add that taxonomy to that, you are strengthening the use of your terms.

Here’s an example of some text.

Pretend that this box shows you a page and focus on any one term, like thesaurus.

If we were to build an inverted file of this data, this is what we would be doing; we would make an alphabetic list.

If we wanted to make it much more complex, we would add stop words and we would add the conditions.

We could say that this term appears in these places. That would give me and my computer a way to combine the terms, because I know from their positions where they are. At the end of the day, this is what a very simple inverted index display looks like; it is basic to search.

We also use a lot of other stuff, like stemming and truncation and wild cards and other ways to do search. In addition, we accommodate all those misspellings, variant spellings, and so forth, so that we can make search even better.

Right truncation is very common, and so are wild cards and some stemming algorithms. Left hand truncation is hard, because you have to build an entire inverted file for every letter of everything to the left side. It makes an extremely large index very quickly. Wild cards are okay term substitutions, but left truncation is very hard and very expensive to implement. Before you ask for it, look at what it is going to do to your system.

It is said that discovery is a very popular action, and it has received a great deal of the research monies and a great deal of attention. However, only about 2% of searchers’ time is actually spent in discovery. 98% of their time is spent doing whatever search is needed to update themselves. So, if they’re mostly updating themselves, that would want one kind of system and if you are in discovery mode, you might want a different kind of system.

One of the big questions in search is, what kind of search are you going to do? Are you doing discovery – looking for new things and new ways of combining stuff? Or are you trying to do an exhaustive everything search? Do you need relevance, recall, and precision, or are you in discovery mode? It makes a big difference as to what kinds of search stuff you need to design, depending on what the users are primarily going to be doing.

Vivisimo, for instance, is a discovery system. It searches on the fly, does automatic clustering, and it doesn’t return the same clusters each time. Each time, the clusters are new. It is not the same search; it’s not the same results. It really angers some researchers to get different stuff the next time. What they really want is additive results. They want to see what they got before, and anything that is new. That’s a different search presentation.

I won’t go into detail about all the types of clustering. Suffice it to say that you can do it in many different ways.

You can group everything together into hierarchical clusters, or you can partition stuff. Even in a cluster you can add a thesaurus, although it won’t really like it; but you can.

This is a view by Vivisimo of its own clustering module. The automatic clustering is shown on the side; this is what will change each time you search.

Remember, I showed you FAST earlier. This is yet another kind of model of how search is done. We have a core engine, and then everything else is connected by application programming interfaces (APIs). These are modules. You don’t get all of these when you buy it; you get just some of them. They are different faces of the data. So, if you wanted to search a bunch of different databases, then you would buy the federation module. If you wanted to go to a bunch of different query servers, then you’d get the module for that. If you want to look at it in different languages, particularly in different character sets, then you would get this module. If you want to use the results, you would need this one. If you want to connect it to other search engines, you need this one. If you want to display those clusters on the fly, then this is the cluster module. Collaboration is where people can view and tag those documents and show them to other people. That’s where you keep what you have, because next time you search, it will be different.

Marjorie M.K. Hlava, President, Access Innovations

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Inverted Files, Parsing, Discovery, and Clustering