July 19, 2010 — Xiaosin Yin and Sarthak Shah wrote a paper published in April 2010, “Building Taxonomy of Web Search Intents for Name Entity Queries.” We revisited this paper because Yahoo announced that it was considering a service that used popular queries for a new Yahoo news service. You can read a summary of the service here.

Microsoft’s partnership with Yahoo may make certain linguistic and content processing technology part of the partners’ search services. We dug out our copy of the Yin and Shah paper because it contains a number of interesting and thought provoking ideas. If you have not seen a copy of this research paper, you can get a copy at this link.

What we recalled is that searches for the names of people, places, products, and things are a popular way to locate information. For example, a teen may want to find Lady Gaga’s concert dates from a mobile phone. The query “Lady Gaga” should display basic information. If the search system knows the user’s profile, concert data could be among the top results for that person’s query.

However, entities are notoriously difficult to get right. Many systems rely on controlled term lists, dictionaries, recursive extraction, and sometime hybrid methods.

The Microsoft team applies the novel idea of building a hierarchical taxonomy of the generic search intents for a class of name entities. The idea is that intent and entities in combination provide a useful “hook” to use when responding to queries. (Queries can come from a person or from a software component.) The Microsoft methods, according to the paper:

…are purely based on search logs, and do not utilize any existing taxonomies such as Wikipedia.

Our view is that this approach may yield some useful data. However, in our experience, Microsoft is pushing toward automation of this process. The key to success will be processing information quickly and then making those outputs available to the search system.

The weak spot in any automated system is latency. If the system lacks current entity information, in today’s fast moving data flows, the user may run a query and not get the results expected.

This challenge is one that any company relying on automated entity extraction faces. Perhaps more robust hardware and high performance infrastructure will address the challenge?

In our opinion, humans and specialized tagging methods are required to deal with certain types of entities. On the day the iPad became available, was that entity identified and available? Our tests showed that the string “iPad” was the way to obtain the information. “Side searches” and facets did not pick up the iPad as a new class of device. No problem in consumer products. Problem in other information arenas, however.

Stephen E Arnold