Some vendors of text analytics software claim that their software can identify the occurrences of text reflecting specific taxonomy terms (with the strong, and false, implication that it identifies all such occurrences) using “fuzzy matching” or “fuzzy term matching.” Some explanations of the technology, from Techopedia and Wikipedia, show that it is a fairly crude mathematical approach, similar to the co-occurrence statistical approaches that such software also tends to use, and no match for rule-based indexing approaches that derive their effectiveness from human intelligence.
I remember searching online for information about the Data Harmony Users Group (DHUG) meeting. Google, in its infinitely fuzzy wisdom, asked “Did you mean “thug”?
As explained in Techopedia,
Fuzzy matching is a method that provides an improved ability to process word-based matching queries to find matching phrases or sentences from a database. When an exact match is not found for a sentence or phrase, fuzzy matching can be applied. Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold matching percentage set by the application.
Fuzzy matching is mainly used in computer-assisted translation and other related applications.
Fuzzy matching searches a translation memory for a query’s phrases or words, finding derivatives by suggesting words with approximate matching in meanings as well as spellings.
The fuzzy matching technique applies a matching percentage. The database returns possible matches for the queried word between a certain percentage (the threshold percentage) and 100 percent.
So far, fuzzy matching is not capable of replacing humans in language translation processing.
And the Wikipedia article on the subject explains as follows:
“The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are:
- insertion: cot → coat
- deletion: coat → cot
- substitution: coat → cost
These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted:
- insertion: co*t → coat
- deletion: coat → co*t
- substitution: coat → cost
- …
The most common application of approximate matchers until recently has been spell checking.
In blog comments, people have commented on the large number of false positives this would create. As a small example of the possibilities, think what would happen with the American Society of Civil Engineers (ASCE) thesaurus and “bridging the gap.” Not to mention the many, many words and acronyms that a rule-based approach could easily disambiguate.
And then there are all the occurrences of the concept that would be totally missed by the fuzzy matching approach. Not all synonyms, nor all the other kinds of expressions of various concepts, are lexical variants that are similar character strings. Fuzzy matching has no way of dealing with these alternative expressions, and they happen often.
There is another problem with these approaches. They are sometimes tied in with weighting, with the “edit distance” mentioned in the Wikipedia article used to downgrade the supposed relevance of a lexical variant. Why on earth should a variant be downgraded, if the intended concept is completely identical to the one expressed by the corresponding preferred term?
The fuzzy approach does not save human time and effort. Rules covering a wide variety of lexical variants can be written using truncated strings as the basic text to match, and proximity conditions added as desired to make those rules more accurate.
Sometimes (in fact, rather frequently), a concept is expressed in separate parts over the course of a sentence, a paragraph, or a larger span of text. There is no way that I’m aware of that fuzzy matching can deal with that. A rule-based approach can.
In short, fuzzy matching has serious deficiencies as far as indexing and search are concerned, and is vastly inferior to a rule-based approach.
Barbara Gilles, Taxonomist
Access Innovations