Using N-grams to discover taxonomy terms

Words that occur together frequently are likely to encode important concepts. Therefore, simply sorting a list of phrases according to the frequency of occurrence in text is an automatic way of capturing important concepts associated with that subject. Word order is important since meaning tends to be associated with order. Thus, word order must be preserved when creating phrases from text.

The basic idea of N-gram analysis is to count phrases consisting of N sequential words from a document. Sorting a list of all of the phrases of all sizes contained in a corpus of documents by frequency presents a list of candidate phrases. These are likely to encode concepts important in the corpus. Frequency of occurrence will occasionally be correlated with the importance of the concept, though occurrences at the highest frequency level often can be less than helpful. They may well be commonly seen phrases, so they can’t simply be taken on their own; the human element must still come into play.

Concepts are not necessarily captured by phrases of a particular length. Indeed, some concepts might be complex enough to require several sentences to describe them. Therefore, it is appropriate to explore phrases of various lengths when searching for concepts in text. Of course, human propensity for acronyms and economy of communication tends to drive the representation of important concepts toward shorter words or phrases.

Thus there are two opposing forces at work that tend to adjust the balance between representing complex ideas: short sequence of characters that need to be supported by a large dictionary of complex concepts and long sequences of words that can be supported by a smaller dictionary of simpler concepts.

A fine example of this is in the comparison of the Windows and Linux operating systems. There is stark contrast between the point and grunt paradigm of Windows, where enormously complex concepts are embodied into the process of pointing to a simple button in Windows, and the verbosity of Linux, with paragraphs of text used to describe the same operation on Linux.

New ideas can be discovered by finding combinations of words that have not been seen before or that are occurring with higher (or lower) frequency than in the past. Therefore, having the capability of detecting changes in the frequency of occurrence of phrases can be a path toward discovery of new or evolving concepts. In addition, when starting a new taxonomy, useful groups of words can be selected from the N-grams as a starting point for the taxonomy.

N-grams are not only good for discovering new concepts, though. Equally important is the ability to use N-grams to discover concepts that are no longer being discussed. Take a journal on nuclear physics. As early as the 1920s, scientists were hypothesizing on the concept that would come to be known as “cold fusion.” Papers were published on the topic all the way into the late 1980s, when Martin Fleischmann and Stanly Pons drew wide media attention after reporting that their experiment actually worked. The idea was a cause of celebration in the wake of rising energy costs and the need for cheap clean energy.

Had N-grams been run on the corpus of that nuclear physics journal in 1988, “cold” and “fusion” might frequently be seen together–an obvious choice for a candidate term in a taxonomy. Just a year later, however, after nobody could repeat the Fleischmann-Pons experiment, the concept was debunked and quickly considered a joke. Afterward, almost nobody wrote on the concept and it became extremely rare to find anything on the topic in a reputable journal. Running N-grams on the same journal today would reveal it as a concept that may well no longer even belong in the vocabulary at all. N-grams have considerable value in the understanding the evolution of a single concept or an entire discipline.

They can also be extremely useful in limiting concepts in a taxonomy to things that are useful. Say, for instance, I’m building a taxonomy of food for a website and I come to a branch for “Cheese.” There are thousands of different styles of cheese and I could fairly easily get a list of cheeses and add them all into the branch. That’s simple, but extremely time consuming and, ultimately, not very useful. If there is no content on this website about Abondance, an excellent but relatively unpopular cheese, and nobody is searching for content about it, why would it be in the taxonomy? It’ll just sit there uselessly. The answer, of course, is to run N-grams on the site content and the visitor search logs. The cheeses that appear in the results are the ones that could be considered highly useful in the taxonomy, helping to keep it clean, concise and, especially, relevant to your content.

N-grams may not be perfect, but they’re a great beginning to a controlled vocabulary. Their quick analysis is brilliant for going through scores of content, but they still absolutely require the human element to be useful. We at Access Innovations use N-grams in close conjunction with our taxonomists to help bring out the most in our clients’ content.

Daniel Vasicek,Programmer
Daryl Loomis, Business Development
Access Innovations

Using N-grams to discover taxonomy terms