At lunch today, the conversation turned to algorithmic methods of applying tags to content objects. At the table was a person deeply steeped in human-centric methods and one academic who embraced seed lists and algorithmic approaches. What was interesting was that both of these individuals agreed on the need for improved tagging of content objects.

In the course of the conversation, both of my lunch partners mentioned a paper I had not read. “Semantic Taxonomy Induction from Heterogeneous Evidence” became available in 2006. It was one of the first presentations of a method that is becoming increasingly important at a certain large Web search company, if the information I gleaned from the lunch conversation is accurate.

I tracked down a copy of this paper. Here’s the abstract:

We propose a novel algorithm for inducing semantic taxonomies. Previous algorithms for taxonomy induction have typically focused on independent classifiers for discovering new single relationships based on hand-constructed or automatically discovered textual patterns. By contrast, our algorithm flexibly incorporates evidence from multiple classifiers over heterogeneous relationships to optimize the entire structure of the taxonomy, using knowledge of a word’s coordinate terms to help in determining its hypernyms, and vice versa. We apply our algorithm on the problem of sense-disambiguated noun hyponym acquisition, where we combine the predictions of hypernym and coordinate term classifiers with the knowledge in a preexisting semantic taxonomy (WordNet 2.1). We add 10, 000 novel synsets to WordNet 2.1 at 84% precision, a relative error reduction of 70% over a non-joint algorithm using the same component classifiers. Finally, we show that a taxonomy built using our algorithm shows a 23% relative F-score improvement over WordNet 2.1 on an independent test set of hypernym pairs.

If you want to get a copy of this interesting paper by Rion Snow, Daniel Jurafsky, and Andrew Ng, click here. Hurry, the document could be removed without warning.

Stephen E Arnold, July 1, 2010