This is the next piece in our series of blog posts on search and how it works. This time we are talking about the importance of relevance.

Relevance is how well a set of returned documents answers the information need, another way of talking about accuracy. But, it’s related to the objective of the search. Different user communities can get exactly the same answer from a set of information resources, with one finding the set relevant and the other not. So, there’s a really healthy tension between the user needs and the context available. That’s why a lot of relevance engines do a lot of profiling of the users.

If I search Google with a particular question, I might get one answer and each of you will search it and get a different answer. That’s because your profile and the things that you’ve clicked on in the past will indicate to Google that your answers should be more in this sphere or more in that sphere. So, relevance is really a confidence factor or a guesstimate on how well this set of documents will answer this particular user’s query.

There are formulas for recall, precision, and relevance. Recall is the number of relevant items that have been retrieved from the system over the total number of relevant items in the collection. One hundred percent recall would mean that we got all of the relevant items of the entire collection. Precision is the number of relevant things – those things that I wanted to see as a user – out of the total number of items retrieved. So, if only five of the articles presented to me on my first ten hits were actually germane to my question, then I have only a 50% precision score. Relevance is precision in relation to recall, or those that are right for my questions vs. everything in the database that has to do with the particular topic. You can see that there is a lot of difference between precision, recall, and relevance.

When we are measuring relevance we are looking at the context – i.e., this searcher, the profile of this searcher and what they have in their brain, what they are looking for. We usually also take into account the age of the documents, because people normally are looking for the most recent. We are looking for how many of the documents we got, the full completeness of the data set returned, or the recall; the measure of quality, which is often very hard to determine, because it is often in the eye of the user. Often relevance is statistically determined or at least the attempt to get at it, is statistically determined. We will cover more of that with Mr. Bayes. It is really a subjective evaluation, or a confidence of the system that what I am presenting to you is, indeed, what you want. There are a lot of different and complex factors in measuring relevance itself.

If we look at the different kinds of search, we encounter several main options: keyword search, Bayesian search, Boolean search, and ranking algorithms. In truth, most search systems have some or all of these options in them. It is a matter of what they depend on the most that is important. These search options are dependent on a few particular famous theorists. Two of them are still alive; two of them are long gone. Boole and Boolean algebra; Bayes and Bayesian techniques; Tierney and his algorithms for enriched structured data; there’s Marco Durango and his ant colony theory. There is a very large body of research on search; please consider this as only a very high level sampling.

George Boole was a mathematician who lived 1815 – 1864, not all that long as you can see. The Wikipedia article on him is quite good. He came up with an algebraic expression or algebraic system to express the logic of what we know as the “And”, “Or”, “Not”, and “And Not” expressions. Those of you who have been searching for quite a while may have searched Dialog or BRS, or the IBM Stairs or Ovid or CD Plus, or SilverPlatter. These, and a lot of different systems, including MedLine, are based on the Boolean algebra approach. It has been around for a long time. It is quite popular, providing very high end precision and recall statistics.

The Boolean representation is done by something called the Venn diagram, which shows the intersection. If I were to do a search and I wanted to search A and B, and if I just entered a single expression, then A and B would be an automatic intersection – or could be. If I wanted A or B, I would get everything in both circles; if I want A and B, I only get this overlapped circle. If I want A, X, or B, I would exclude this circle. There are lots of ways – and I could say A Not B and get rid of this circle but keep the rest of A. There are a lot of things in this expression that we could get via explaining it through a Venn diagram.

Mr. Thomas Bayes was an earlier mathematician (1702 – 1761) and he wrote a lot about probability. He theorized, if we had a known set and we knew that these things usually happen, then when we got a new set, couldn’t we infer that the following will probably happen. That was a probability based on what had happened in the past so that we could forecast what would happen in the future. It’s a nice and fairly well-established algorithm and can be used so that we say, “Well, if these 5,000 articles were about this then, if the same term set is used in the next 5,000 articles, probably they are about the same thing.” But the distribution of probabilities changes, particularly in active areas like news or cutting edge science. People might not want to depend on the distribution of historical data to predict future data. A user might also make a new kind of request – something that is not what they have queried of the system in the past. So, to get that information out of the network is much harder. We have a computational linguistic difficulty if we explore a set of data with an unknown new kind of request.

What we have to do is to say, “We knew this to be true in the past and therefore it will be true in the future.” So, if you are looking at a lot of terrorist literature, for example, and trying to figure out what may happen in the future, based on what has happened in the past, one would expect that the terrorists would be aware of that and they would be constantly changing, in new and novel ways, so that they could trick the system – figure out the distribution of probabilities and actually give erroneous results to people. So, if you depend on a Bayesian engine to keep track of rapidly happening events, you often find that you come up a little off from what actually happens. Another way of saying that is that you have to assume that the prior knowledge is always reliable and is indeed what will happen in the future, because if you say that it’s going to be different, then the next results will be invalid. You want to be sure that the statistical distribution that you come up with to use for modeling your data is consistent. If you have a consistent set of data inside and it’s not going to change much, then this is a good way to go. Otherwise, you have to constantly train and re-train the data every time you add new data sets, particularly if the direction of the field has changed.

A more recent guy, Peter Turney, hails from Canada and talks about learning algorithms for key phrase extraction. He says that you can do that as a tree. You can make decisions as you move down the tree. He called it the “tree induction algorithm”. And, he came up with something called lexical semantics which is a way to say, “As I make decisions going down that tree, things are going to change.”


Extraction vs. generation and sentiment of words:

(hits(word AND “excellent”) hits (poor))
log2 —————————————-
(hits(word AND “poor”) hits (excellent))

His information and the learning algorithms for key phrase extraction formula that he came up with give you an idea of how you can do just plain extractions of data from a system versus trying to generate even sentiment from the words that are there. He could plot 80% accuracy in these results, which is better than 60% from Bayes.

Another guy, Marco Dorigo, research director for the Belgian Fonds de la Recherche Scientifique and research director of the IRIDIA lab at the Université Libre de Bruxelles, talks about swarm intelligence. How if you look at the way things move and the way things change, you could actually change that information on a dime if you knew which way information was going to be moving. There’s the data itself, then there’s the way that data is used – or what is important about it at the moment, what is heuristically important. Ant colony optimization is a metaheuristic for combinatorial optimization problems using “swarm intelligence.” It makes statements about the value importance vs. heuristic importance and is therefore useful in search prediction. For example, you might look at the way people are analyzing Twitter feeds, its ant colony optimization, its swarm intelligence. Suddenly a Twitter stream has emerged out of nowhere and it gathers importance very, very quickly, just like a bunch of ants suddenly attacking a piece of peanut butter and jelly sandwich that landed on the ground only minutes ago.

Another big area that we know about is natural language processing (NLP). Natural language processing is frequently used these days used in conjunction with another system. The main pillars of natural language processing depend on the researcher and the creation – how much of each pillar they are using.

  1. Syntactics: the rules for the language and how they govern the sentences in any individual language.
  2. Semantics: the words themselves and how they are stated and behave.
  3. Morphology of those words – the singulars and plurals and other things about them.
  4. Phrase logical implementations – the use of those words in phrases.
  5. Stemming: or lemmatization, cutting off the endings, such as the eds, and the ings, and other things that take you to the word root.
  6. Statistical options – as outlined above.
  7. Grammatical applications, some of them actually fully graphed sentences. Some of you are old enough to remember graphing sentences from school.
  8. Then, at the end of the day there is just a nice common sense. It’s really handy to have a common-sense algorithm that you can apply. That is often done in a rules base or something like that where you say that it’s pretty clear to us how this works. Natural language processing in companion Boolean operators, for example, makes a nice rules-based system for people.

That is a nice segue to automatic language processing, which is where we’ll pick up next week.

Marjorie M.K. Hlava
President, Access Innovations