There are search systems that are advertised as keyword search. There are ones that say that they are Bayesian. There are ones that are Boolean. There are ones that are primarily ranking algorithms.

The theories behind the different kinds of search are based on the work and theories of several famous guys, two of whom are dead, two of whom are still alive.

The first of these theorists was George Boole, a mathematician and logician who lived from 1815 to 1864.

Yes, this is the man behind Boolean algebra, an algebraic system of logic.

Boolean algebra is the basis of Boolean search, in which Web pages are handled as elements of Boolean sets.

The next theorist of interest is Thomas Bayes, a mathematician who lived from 1702 to 1761.

Bayes’ Theorem, which uses probability inductively, established a mathematical basis for probability inference. The theorem provides a way of calculating, from the number of times an event has not occurred, the probability that it will occur in future trials.

We’ve seen already that search technology often involves probability; some of this technology is based on Bayes’ Theorem. While Bayesian methods have value, they need to be used with caution.

Our next person of interest is Peter D. Turney, a Canadian expert in computational linguistics. One of his main contributions to search technology is Turney’s algorithm, which is useful for sentiment analysis, the intent behind the words.

Finally, there is Marco Dorigo, a Belgian researcher and initial developer of the ant colony optimization algorithms. Search that uses this approach uses artificial ants (with artificial intelligence) and mimics their search behavior.

Finally, we’ll take a look at a search approach with a more corporate background: the ranking algorithms of Google.

The terms that you input – the queries that you type – are called the term inputs.  Those term inputs are weighted based on an extremely large number of factors within Google. If you were to get the Google Search of Science to run your in-house searches, they would be using exactly the same algorithms. They will update those algorithms for you when you upgrade the algorithms on the main search box, assuming you subscribe to the system. That means that its real depth depends on search ranking algorithms in-house that you get on Google.

You might like that kind of search if it works well for you. If you don’t like that search and you feel like you get too many false drops, then you are going to get the same situation in-house. One of the ways you can mitigate that is with the data that you load to Google. You can make it partially well-formed by adding taxonomy terms to it, and then your search improves. Some people have just thrown up their hands because – “Oh no!  This is giving me a Google search appliance. I don’t know why I ever bothered to do a taxonomy in the first place.”  However, if you add those terms at the time that you are loading the data to Google, you can search the taxonomy and improve your search results.

Another approach used in search, and compatible with the statistical approaches, is natural language processing (NLP). NLP uses natural human language input, or output, or both, in combination with computer processing.

There are certain basic tenets of natural language processing. All of them are used, to some extent and to varying degrees, in search software. Some of them use it as a base layer and never let anyone touch it. Some use it as an interface for people. Often they are black box applications because the algorithms, particularly the grammatical ones, are hard to follow and you don’t want people messing with them because they break them. These have not been as popular lately as they were some years ago.

What has become more popular is automatic language processing (ALP).

This is where you could apply a lot of different tenets. I used to go to computational linguistics meetings and artificial intelligence meetings and automatic translation meetings and ASIS&T meetings, and suddenly a lot of people that I saw in those artificial intelligence meetings were showing up at the ASIS&T meetings. There is a lot of overlap in these fields, but what is happening with ALP is that they are trying to automate most of the pieces. It’s essentially NLP with a few more computational heuristics, and it lends itself well to search.

Another kind of search is statistical search.

It is primarily Bayesian. It has different levels of usage of the Bayesian algorithms. You might hear about the ‘smart factors’ from Cornell, or ‘cluster analysis’ from other sources; you might hear about neural nets and neural networks from concept analysts; you might hear about co-occurrence engines from Autonomy, you might hear about Bayesian inference engines from other people, and there are all doing a variation on a theme. They are all doing some kind of heavy statistical analysis of the data and trying (also depending on some of the ALP or NLP techniques) some kind of way to derive a search result for you.

What’s basic to all of these kinds of searches are inverted files and Boolean logic. We’ll cover inverted files (among other topics) in the next installment.

Marjorie M.K. Hlava, President, Access Innovations


Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.