Search is such a wonderful and widely debated issue. To me there are three main kinds of search, and each is accomplished in different ways.
Assume that we have an article indexed with the terms appropriate to the content of that article. These terms are specific to the content. The terms applied are not broad categories unless the article itself is broad. We put limits in the rule base on broad terms so they do not get over-productive in search results.
Boolean search – that’s what is usually behind the little search box on a page. Put a word in, get all the articles tagged with that word. If a taxonomy is implemented and the search software will explode the query to include all synonyms, then you will get all the articles on the same concept but differently stated – i.e., the synonyms are also applied to the query. Boolean search often depends on field formatted data. This may be in XML or a relational database.
Browse – this uses a navigation tree, either as bread crumbs or as a hierarchy. This may be “hardwired” into the user interface or dynamic. Normally it is hardwired, and the programmers really do not like to change it. This is a problem, because the data is constantly rearranging itself based on the trends and fads in the science presented. But it can be done dynamically using the hierarchical view of a thesaurus – also known as a taxonomy and that is why taxonomies are so popular at the moment.
Take a look at www.mediasleuth.com. When you click on a term in the left side taxonomic view it will search for all document tagged with that taxonomy term in the full underlying database collection of articles. You can choose to see only the items tagged with that term, or all the items underneath that term in the hierarchy as well as that term. It allows for a “roll up” or dive down into the data.
Another way to look at it is at http://www.dataharmony.com/products/navtree.html or download a free version at http://www.phpclasses.org/package/837-PHP-Explorer-like-tree.html. It works well with Lucene search software in all flavors and Lucene works well with MySQL.
Statistical approaches to search are many, and occupy the bulk of government-funded university research. They go by names like neural networks, latent semantics, vector search, Bayesian, statistical, co-occurrence, clustering, etc. The algorithms sound fabulous in the demo data sets. They are processor intensive. They perform much better with well tagged data. They do depend on Boolean operators for the base inverted indexes but add a great deal of calculation to the results, allowing different views and clusters of the data based on n-grams, clusters, and data points. They are great for small deeply mined static sets. If the data is always being added to, updated, or changed in some way, the vectors have to be reset for each analysis, increasing the processing overhead and slowing the reports from the data. They are the holy grail of search and still very actively pursued.
There are many other search options, such as weighting, ranking, major and minor topic (term) determinations, roles, term stemming, right truncation of words, and less often left truncation (which require a much bigger inverted index). Many of these search options can be automatically applied.
In fact, I think that, with a couple of rare exceptions, search itself has not made many recent advances. Under the hood in search, things like new caching algorithms, faster stemming, and co-occurrence indications are pretty esoteric and don’t make much news. The only one I have seen that is new on the back end is that of Perfect Search Corporation (www.perfectsearchcorp.com). Lucene is free; I think that is why so many people use it. Lucene Solr has great documentation, as does the Red Hat improvement on Debian Linux. Still the same stuff on the back end, though.
The advances are in the user experience. The layer between the technology infrastructure and the web face is where the action resides. The user interface also plays a very important part in search. Advanced search usually takes you to the fielded search page. Displaying the hierarchy and searching the database through it can also be done as indicated above.
Word completion can be effected through a taxonomy or through a dictionary. “Did you mean” can be accomplished using synonyms from your thesaurus. Related terms can expand a search and narrower terms can make it more precise. These two can also be used to create a recommendation system for searchers.
The word of search is ever-changing, and fascinating to try to keep up with.
Marjorie M.K. Hlava
President, Access Innovations