As we continue the series on search and how it works we are looking at file indexes more completely, more specifically complex inverted file indexes.
Stemming is the de-pluralization or removing the gerund endings. It is also called lemmatization. Truncation – left and right – are popular parts of search. Right, in some cases, chopping a word off at its end; is pretty easy. Left-hand truncation is hard because if you look at this wild card in the word ‘organization’ which can be spelled with either an ‘s’ or a ‘z’, depending on where you are from, the ‘-ation’ can be chopped off pretty easily but the right part, I have to build an entire index, starting with o, or, org, org, so that I can go through all of those to see where the full extension is. When people do left-hand truncation, it is a lot more expensive. It is a much bigger, additional index.