This is the next piece in our series of blog posts on search and how it works. Last week we ended with natural language processing. We are picking up this week on automatic language processing or ALP. 

Automatic language processing is different. It involves automatic translation or automatic indexing or auto abstracting. It is a pillar for artificial intelligence. It is also a pillar in a lot of search systems but it is frequently built on top of the precepts of natural language processing. It might add spell checking. A lot of the initiatives for the semantic web are based on some sort of automatic language processing, as well as linking algorithms, other kinds of NLP, and other kinds of computational linguistics.

Statistical search has evolved to encompass a wide variety of options. Taking the Bayesian statistics from many years ago, we might be able to come up with cluster analysis, or  neural network search, or vector searching, or co-occurrence, or Bayesian inference, or latent semantics; these are all different search methods. They all are, at the end of the day, based on statistics, and they all depend on the input data and on training sets. You need to factor into your calculations in implementing a search system like this, what it will cost to train the system. To train the system is really just batch processing. It takes programs to do it. But, in order for them to do their work, they need examples of every term – for example, in your taxonomy – used correctly in quite a few articles, like 20-50. In my experience, it takes about three times that many to find ones where the term is used spot-on, specifically the way you want to use it. You present those as training sets. It takes quite a while to collect the training sets for statistical search. That is the real Achilles heel of these systems.

In all of these systems, whether they are statistical, natural language, or whatever, at the end of the day they depend on an inverted file and some And, Or, And Not (at least) operators. We frequently build a searchable index, which is the inverted file index, and then we might build on the user end some kind of a presentation layer, which is a hierarchical display – or browsable list – and frequently that comes from the taxonomy view of the thesaurus.

Next week we will talk more about inverted file indexes.

Marjorie M.K. Hlava
President, Access Innovations