This is the next piece in our series of blog posts on search and how it works. Next let’s look at an inverted file index. Let’s pretend that this is the outline of the presentation. I have Define Key Terminology, Thesaurus Tools, Functions, Features, Class, Construction of the Thesaurus, etc. in the figure below. You can see that the word “Thesaurus” is used three times here. I have a number of other words that you might focus on to see where they are. If I am going to take these and make them into an inverted file, the simple inverted file index is just going to take them and make them into an alphabetic list. So it will sort the high ASCII characters first – the special characters and the numbers – and then it will sort the rest of them alphabetically.
That’s nice; a nice alphabetical list but remember, computers are just great big calculators at the end and so, what I need to do is to make this a little fancier.
In order to make it fancier, I need to apply some intelligence to this. Some things (like of, 1, 2, 3) I won’t search on, so I will make them stop words. I will tell the system where the other words are. So, line 7, paragraph 2, and what type of thing it is – it’s a subheading. So, I have things that are in the titles, over here T, I have headings, I have stop words, I have placement. What this does for that great big calculator is to give me something so that I can add them up. So, if I want to know about construction costs, I am going to compare two things here – construction and costs – and see if they are located close to each other. One is in paragraph 2 and one is in paragraph 1. If I wanted to define futures, here I am in paragraph 1 and paragraph 1, those might be pretty handy. So, I am looking for those Boolean interchanges. I want to find out where these things are. Here I have when and why; there’s a Stop Word in between them. I have when and why, line 9, positions 1 and 3. I would present those to the user first as an answer.
Next week we will continue with a look at complex inverted file indexes.
Marjorie M.K. Hlava
President, Access Innovations