It is difficult to find a list of taxonomy management software that is both comprehensive and up to date, yet not overwhelmed with related products and services. For a long time, perhaps the most comprehensive directory of taxonomy software was that of the British consultant Leonard Will, who has since retired. Considering the valuable and respected content, we at Access Innovations recognize our good fortune and huge responsibility of now hosting and maintaining the Willpower thesaurus software directory. The history of how this came to be can found in The Accidental Taxonomist blog post titled, “Taxonomy Software Directories.”
The core of TaxoBank‘s directory “Software for building and editing thesauri” at present is still essentially the same as the Willpower site, with some additions and changes here and there.
Melody K. Smith
Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.
Learn from professionals with decades of experience all while enjoying the Canadian hospitality in beautiful Vancouver. The “Introduction to Taxonomies” all day workshop features our own Marjorie Hlava and Bob Kasenchak, and is part of the Special Libraries Association (SLA) Annual Conference in Vancouver, June 6-10, 2014. The SLA Annual Conference is an excellent international venue for learning new ideas and identifying information trends.
The workshop introduces participants to the basic methodologies and techniques for taxonomy development, as well as providing an overview of taxonomy standards and their application in search, web sites, publishing, retail and e-commerce, records management, and other organizational needs. After learning about the principles and core standards of controlled vocabularies, participants will explore key concepts of taxonomies, thesauri, indexing, classification, and filtering. Discussion will include the basics of a taxonomy record and fundamental term relationships. Attendees will put concepts into practice through multiple exercises, including creating a simple taxonomy.
Register for the workshop here.
Melody K. Smith
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
Some vendors of text analytics software claim that their software can identify the occurrences of text reflecting specific taxonomy terms (with the strong, and false, implication that it identifies all such occurrences) using “fuzzy matching” or “fuzzy term matching.” Some explanations of the technology, from Techopedia and Wikipedia, show that it is a fairly crude mathematical approach, similar to the co-occurrence statistical approaches that such software also tends to use, and no match for rule-based indexing approaches that derive their effectiveness from human intelligence.
I remember searching online for information about the Data Harmony Users Group (DHUG) meeting. Google, in its infinitely fuzzy wisdom, asked “Did you mean “thug”?
As explained in Techopedia,
Fuzzy matching is a method that provides an improved ability to process word-based matching queries to find matching phrases or sentences from a database. When an exact match is not found for a sentence or phrase, fuzzy matching can be applied. Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold matching percentage set by the application.
Fuzzy matching is mainly used in computer-assisted translation and other related applications.
Fuzzy matching searches a translation memory for a query’s phrases or words, finding derivatives by suggesting words with approximate matching in meanings as well as spellings.
The fuzzy matching technique applies a matching percentage. The database returns possible matches for the queried word between a certain percentage (the threshold percentage) and 100 percent.
So far, fuzzy matching is not capable of replacing humans in language translation processing.
And the Wikipedia article on the subject explains as follows:
“The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are:
- insertion: cot → coat
- deletion: coat → cot
- substitution: coat → cost
These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted:
- insertion: co*t → coat
- deletion: coat → co*t
- substitution: coat → cost
The most common application of approximate matchers until recently has been spell checking.
In blog comments, people have commented on the large number of false positives this would create. As a small example of the possibilities, think what would happen with the American Society of Civil Engineers (ASCE) thesaurus and “bridging the gap.” Not to mention the many, many words and acronyms that a rule-based approach could easily disambiguate.
And then there are all the occurrences of the concept that would be totally missed by the fuzzy matching approach. Not all synonyms, nor all the other kinds of expressions of various concepts, are lexical variants that are similar character strings. Fuzzy matching has no way of dealing with these alternative expressions, and they happen often.
There is another problem with these approaches. They are sometimes tied in with weighting, with the “edit distance” mentioned in the Wikipedia article used to downgrade the supposed relevance of a lexical variant. Why on earth should a variant be downgraded, if the intended concept is completely identical to the one expressed by the corresponding preferred term?
The fuzzy approach does not save human time and effort. Rules covering a wide variety of lexical variants can be written using truncated strings as the basic text to match, and proximity conditions added as desired to make those rules more accurate.
Sometimes (in fact, rather frequently), a concept is expressed in separate parts over the course of a sentence, a paragraph, or a larger span of text. There is no way that I’m aware of that fuzzy matching can deal with that. A rule-based approach can.
In short, fuzzy matching has serious deficiencies as far as indexing and search are concerned, and is vastly inferior to a rule-based approach.
Barbara Gilles, Taxonomist
As noted last week in our article, “A Spirit of Another Name“, Saveur has created a glossary of Whisk(e)ys. However, as we all know, a glossary does not a taxonomy make –it can, however, be a good starting point.
One of the problems, of course, is that national styles – and even spellings – are mutable.
In general, “whiskey” comes from Ireland and the United States, while “whisky” (no ‘e’) comes from Canada and Scotland. However, well-known bourbon Maker’s Mark long ago decided to buck the semantic trend and drop the “e” despite being an all-American brand.
The Japanese usually use the Scottish version (being heavily influenced by Scotch). However, there are now whisk(e)ys being made in France, Wales, Germany, Australia, Finland, India, Sweden, Spain, and the Czech Republic, to name a few. Adoption of one or the other spelling variant is, well, varied.
Besides, once we accept that “whiskey” and “whisky” are synonyms, the spelling has little to do with the semantics.
It’s more important to understand the legal (and in some cases, traditional but perhaps not codified) production requirements that define the various styles; these are often (but not always) defined by name and region.
For example: Canadian whisky is, by law, allowed to have up to 9.09% “flavorings” – a category of adulterants with no definition (but which in practice include artificial colors, many different sweeteners, prune and other fruit juices, etc.). Scotch whisky can have caramel color added, but no flavorings. Bourbon, on the other hand, can be cut to proof with water, but must by law have no additives for either color or flavor. This gives Canadian whisky its characteristic sweet taste.
Straight Bourbon Whiskey is, in fact, the most strictly defined and regulated of the whiskeys – although, contrary to common beliefs, it need not be made in Kentucky. It must, however, be produced in the United States from spring water; the mash (mixture of grains) must comprise at least 51% corn (the rest being barley, wheat, and rye); it must be aged no less than 24 months in new charred American white oak barrels; and of course it must not contain any additives. (For the record, Rye is identical to Bourbon with the very important exception that it must contain no less than 51% rye.)
Other factors are also in play. For example, Irish whiskey is almost always triple-distilled, while Scotch is almost always double-distilled. Scotch is further delineated by region (Highland, Campbeltown, Islay, Highland Islands, Lowland; Speyside is a sub-region of Highland) and drying methods (whether peat, gas, or coal is used to dry the grain to stop the germination process) as well as the various permutations of blends, single malts, and vatted malts (by many names), not to mention other variants such as single-barrel, cask-strength, and various “finishes” in casks which formerly held other kinds of spirits.
Now that is a categorization problem.
In constructing a taxonomy of whisk(e)y, a faceted approach might be best. However, given the limited space here, let’s just take a quick crack. (The Top Term is of course “Whiskey” UF=Whisky.)
. . Blended Whiskey
. . Grain whiskey
. . . . Corn whiskey
. . Vatted Malts [UF=Blended Malts SN=Comprised of various single malts, no “grain”]
. . Single Malts
. . Peated Whiskey [RT=Islay Whisky]
. . Pot still Whiskey
. . Single barrel Whiskey [UF=Single-barrel Whiskey]
. . Small batch Whiskey
. . Whiskey Distillation
. . . . Double Distillation [UF=Double-distilled]
. . . . Triple Distillation [UF=Triple-distilled]
. . American Whiskey
. . . . Bourbon Whiskey
. . . . California Whiskey
. . . . Oregon Whiskey
. . . . [add other states as necessary]
. . . . Rye Whiskey
. . . . Tennessee Whiskey
. . Australian Whiskey
. . Canadian Whisky
. . European Whiskeys
. . . . Czech Whisky
. . . . Finnish Whisky
. . . . French Whisky
. . . . German Whisky
. . . . Irish Whiskey
. . . . Scotch Whisky
. . . . . . Campbeltown Whisky
. . . . . . Highland Whisky
. . . . . . . . Speyside Whisky
. . . . . . . . Highland Island Whisky
. . . . . . Islay Whisky
. . . . . . Lowland Whisky
. . . . Welsh Whisky
. . . . Spanish Whisky
. . . . Swedish Whisky
. . . . [add others as needed]
. . Indian Whisky
. . Japanese Whiskey
. . Cask Strength
. . Overproof [SN=95 proof or higher]
. . Standard proof [SN=80 to 94 proof*]
*this is a little arbitrary but reflects industry norms
Clearly I’m missing cask finishes (mostly in Scotch, but now reaching Bourbon territory) and a few other things. (Hey, it’s just a blog post.) Ages, also, could quickly become a problem: no one wants a list of cardinal numbers in their thesaurus.
The various brands could, then, be narrower terms in the hierarchy I’ve sketched out above.
In order to avoid massive categorization issues and massive duplication (instead of going Netflix-style, as “Single malt overproof cask-finished Campbeltown Scotch whiskey” is a pretty unwieldy taxonomy term) you’d have to apply multiple labels to categorize each individual item. Imagining this would be most useful for e-commerce (as opposed to scholarly document categorization) helps: think about browseable tabs on a website; you’d want to find Laphroig under both “Peated Whiskeys” and “Islay Whiskys” to allow people to find what they were looking for using multiple approaches.
This is why I described it as a “faceted approach.” But let’s not get into that now. For the same reason, though, I’m going to stop while I’m ahead.
Bob Kasenchak, Project Coordinator
Let’s call him George. George was having a very bad day. He needed legal advice. So, over his lunch hour he scheduled an appointment. After George described his situation, the lawyer pulled a book from the shelf behind him. After briefly scanning it and checking precedent, he confidently told George with a smile, “Relax. This is a slam dunk.” On his way out of the office, George gazed at the consultation invoice and muttered to himself: “15 minutes! Only 15 minutes and he charged me $325.00!”
On his way back to the office, George’s car starting making some serious grinding noises whenever he turned to the right. He pulled into the local Fix-It-All Garage and described the noise to the technician. After turning over the keys, George looked through the large glass window as the mechanic pulled here and tugged there at his car up on the lift. After only a few keystrokes at his computer station, the technician began installing over the next ten minutes what looked to George like a $10.00 part. In the blink of an eye, George was standing at the counter with another invoice. He called his wife to grumble: “He pulled and tugged in two different spots and then charged me $325.00. I’m in the wrong line of work!”
Certain that his ulcer was acting up, George stopped at the clinic on his way home that day. The doctor, who agreed to fit him in right away, asked a few short questions, consulted his desk reference guide, and started writing a prescription. Moments later, speechless George could only grimace as he faced yet another hefty bill. Poor George.
Besides a considerable amount of cash, what was George missing? Someone might say that George knew the cost of everything, but the value of nothing. Because George was able to successfully confront and overcome several perplexing and complex problems, someone else might say, “What a great quality of life George has!” It all looked so deceptively easy. However, to focus only on the “interface” is to fail to consider the years of training and experience behind each professional who knew just what questions to ask, just where to look and pull and tug, and just which resource to consult.
How does one measure the true value of successful information organization, navigation, and retrieval? Access Innovations Inc. offers superior customer service, ease of product use, and support, combined with years of experience in order to provide outstanding quality. Speak with the CEO of Access Innovations, Inc., Jay Van Eman, about the qualitative and quantitative criteria used to assess successful KOSs and the proper rationale for measuring ROI in your setting. Are you getting real value for the cost?
Check out these additional resources:
Eric Ziecker, Information Consultant
Access Innovations, Inc.
Of course, records play a vital role in the litigation process, as any records management professional would affirm. But what exactly is their role?
Attend the 2014 ARMA Rio Grande Information Governance: ORDER! Are you ready for court? conference and you can find out. A substantial line-up of speakers will share some first-hand experience lessons in how records are used and how crucial a good records management program assists in the litigation process.
Our own Margie Hlava will be speaking about the basics of a building a functional taxonomy and examining how good taxonomy structure contributes to eDiscovery success. She will share how a well-built taxonomy is part of the foundation for information architecture that underlies content management systems (CMS), web sites, corporate intranets, search retrieval, and access to relevant content in databases.
Melody K. Smith
Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.
Baseball and biology are not commonly found in the same conceptual space. Neither do you find taxonomy associated with baseball, but in recent news these connections were made. Grant Bisbee, editor of “Baseball Nation”, digresses into the arcane as he laments the coming of the “He’s In the Best Shape of His Life” season. This is the time of year baseball writers must assess the prospects for the coming season, and clichés and hyperbole reign. The dubious practice of evaluating the physical condition of players runs rampant as spring training begins. With tongue in cheek, Bisbee tries to shape a taxonomy to classify this spring ritual. His would be the taxonomy of the “In the Best-shape Stories”.
Bisbee suggests three categories: “Play X Got in Shape”, “Fixed Eyesight”, and “A Serious, Previously Undiscovered Affliction”. He compiles representative stories for each classification. He is somewhat skeptical that a player could be in the “best shape of his life” as a result of some of the reported training regimes. Can someone really put on 18 pounds of muscle in the off-season? Legally? Shouldn’t vision examinations be a regular part of baseball teams’ operations? After all, they’re paying millions for their players’ hand-eye coordination.
Are there baseball taxonomies already out there that could help Mr. Bisbee classify his apocryphal collection? A “Google” search of baseball taxonomies returns a million items more or less – plus or minus – give or take – a few hundred thousand. Not much help there. At the very least, Bisbee could expand the classification to include other sports and how each portrays their mystical, preseason rituals. How are preseason body rebuilding rhythms disturbed by “Seasonus Interruptus”?
And biology is experiencing a potential season-disrupting trend. Tom Spears, reporting for the “Ottawa Citizen”, filed the story, “Taxing times for taxonomy”. On a more serious note, Mr. Spears chronicles the demise of field-based taxonomic study in biology in favor of lab work. Computers and DNA studies have relegated classification to the closet. While the work done in laboratories is vital to the field, biologist Ernest Small is quoted, “How can you do a study of forests without knowing the trees.” The reverse is also true; you can’t really study a tree without understanding its forest. Commenting on his lab-centric colleagues, Dr. Small laments “the lack of knowledge of what plants and animals make up our world”. Spears goes on to write, “We need to understand whole species, not just genes, if we are to solve the problems of agriculture, fisheries, insect pests, ecology and the spread of diseases.”
Taxonomies are central to understanding a field as a whole, whether biology or sports. And they are critical to understanding individual members of the taxonomy. The taxonomy of plants and animals tells you about the relationships and characteristics of its members. The child, or narrower node in a taxonomy, inherits the properties of the parent, or broader node. Reviewing an entire hierarchical branch and its relationship to other branches conveys the properties, similarities, and differences of branches. This knowledge can lead to sleuthing out meaningful trends and can help explain interactions or identify possible interactions. Taxonomy for organizing sports articles might reveal interesting groupings and trends. Are there more injuries being reported this season? Are there geographic trends or team trends? Applying graphical analysis visualizes the content in ways that enhance spotting of trends.
What these two diverse stories have in common besides the core theme of taxonomies is the need for hands-on, down-and-dirty fieldwork. Sports writers need to see players perform during preseason to really assess their fitness. Biologists can learn a great deal about a species by looking through a microscope, but they can’t really understand behavior that way. They need to observe firsthand. When the field data and lab data are gathered, placing it in context gives it meaning and leads to insights. Taxonomy is ideal for grouping data in meaningful ways, and that can lead to insights. It can also lead you back to the data at some future point. Taxonomy is often focused on organizing content for search to make search easy, fast, reliable, and replicable. Taxonomy also defines a domain or conceptual space. It can help you find your way and keep you from getting lost. Equally important, it provides meaning to the conceptual space you’re wandering through. Spring is almost here – enjoy the first Danaus plexippus of the season and know it has a genus and a tribe and a family and more. Play ball!
Jay Ven Eman, CEO, Access Innovations
First published on January 30, 2012.
Data Harmony, a division of Access Innovations, Inc., has announced that version 3.9 of its suite of software tools will be released on March 31, 2014. Current customers will be the first to receive the new version.
“Our recent enhancements provide greater search efficiency while browsing your thesaurus through callouts and hyperlinks to related and narrower terms,” said Allex Lyons, one of the Data Harmony programming team. “Dropdown menus in the search window display subject terms with an auto-complete function that assists users of any skill or experience level.”
“The Thesaurus Master™ module is a wonderful tool for creating and maintaining taxonomies, and when it’s coupled with M.A.I.™, our customers have a world-class semantic enrichment platform,” comments Bob Kasenchak, Project Coordinator at Access Innovations. “In addition, this release of the software works with our new Semantic Fingerprinting application for creation of disambiguated author networks. For a number of reasons, author nets are becoming a critical component of the publishing workflow for many of our publishing clients.”
Access Innovations’ editorial team collaborated with Data Harmony programmers to update the user guides and improve technical documentation for all software modules. “For version 3.9, MAIstro is featuring several new thesaurus export formats, and there are great new functions inside the user interface,” noted Kirk Sanders, Project Manager, Access Innovations Editorial Department. “In addition, we’re releasing several new Web service modules that extend Data Harmony software in intriguing ways. I’m eager to see how our clients gain from the creative implementation of Data Harmony Inline Tagging, Search Harmony, and the Data Harmony Author Submission System — and the other Web services!” The Data Harmony Web service modules offer the power to configure instantaneous Web functions inside a graphical user interface based on Data Harmony software actions.
Current Data Harmony customers can expect to be contacted about version 3.9 by the end of March, according to Marjorie M. K. Hlava, President of Access Innovations. Further information will be available on the Data Harmony website at www.dataharmony.com.
Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.
Many institutions and organizations – notably (but not limited to) publishers – have large, or sometimes very very large, lists of names. These names are from member directories, employees and staff, clients and customers, marketing, development, and many other sources; indeed, oftentimes the lists from various departments in the same organization are not connected or resolved with one another in any way.
This growing problem has given rise to a sub-field in the information/data industry variously called “named entity disambiguation” or “author disambiguation” or “name disambiguation”, among other monikers. In the academic publishing space, disambiguation of author names is a common challenge.
In a nutshell, given a list of names—let’s say, oh, 3.2 million names—to determine which ones are the same person and which are not, we might proceed as follows:
The goal is, as automatically as possible, to sort out which of these records should be merged. Once accomplished, you (a publisher) could make a webpage for each author listing all publications and so forth for your users to browse.
Clearly, some of the names above are potentially the same person, while others are not. For example B. Caldwell Smith, B.C. Smith, and Brandon C. Smith, and Brandon Caldwell Smith look like they might be the same person. To find out without looking at every name and every article (3.2 million, remember?) we need more information.
To accomplish this task, metadata associated with each author is examined and compared to try to eliminate duplicates. For example, from each article we can associate an author with his co-authors, the institution with which she was involved when the paper was published, email addresses, dates of publication, and so forth.
Well, some things are clearer, but some are not. Whereas before we may have suspected that Rodger Smith and Roger Smith were different people, they published at the same institution in the same year; maybe it’s just a typo? And maybe Brandon C. Caldwell moved from Harvard to Yale (not unheard of) sometime between 1961 and 1972?
At Access Innovations we’ve been developing a way to add some certainty to the process using semantic metadata—it’s not a silver bullet, but it is a bigger gun. We call the process “semantic fingerprinting” and it’s based on our thesaurus and indexing technology.
Every author’s works (papers, conference proceedings, editorial roles) associates them with one or more pieces of content, and for each piece of content we have indexing terms from a thesaurus particular to that client. By associating the author directly with the indexing terms, we develop a semantic profile (or “fingerprint”) for each one. Since each author usually authors multiple papers (see “Lotka’s Law”), we compile the subject terms from each paper to make a more complete profile; obviously the more papers we have, the more accurate these profiles are.
Returning to our example:
What we suspected to be perhaps one person based on our best information turns out pretty obviously to be two distinct researchers once the areas of expertise are added to the equation.
While the process is far from foolproof, it does help to automate the disambiguation process, which cuts down on the number of human hours required to review the work.
The concept of the “semantic fingerprint” can be applied to a paper, a school, an editor, or any other entity for which subject metadata is available. So this same basic process can be used for other purposes; for example, to:
- Disambiguate institution names
- Match articles to peer reviewers or editors
- Demonstrate what areas of research are exploding at,
- A journal
- A college
- A research laboratory
As datasets get cleaner and cleaner the accuracy of, and uses for, semantic technologies—such as Access Innovations’ Semantic Fingerprinting techniques—will continue to increase.
Bob Kasenchak, Project Coordinator
Semantic Fingerprinting image © Access Innovations, Inc.
Access Innovations’ CEO, Jay Ven Eman shared his views of taxonomies and their correlation to knowledge management in a recent article feature.
According to Ven Eman, ”Well-built taxonomies are the core of successful knowledge management. We have pioneered and perfected the development and deployment of taxonomies that provide knowledge management, observing the essential guiding principles needed for effective and efficient knowledge organization. Without the purpose-built taxonomy, the language of the knowledge contributor (for instance, an author, scientist, or knowledge worker) and the language of the knowledge user (for instance, a customer) will typically fail to be bridged. If you can’t find it, it might as well not be there.”
Ven Eman pointed out that Access Innovations has built more successful enterprise taxonomies than anyone. The well-built taxonomies bridge language and usage differences, and provides accurate, consistent, persistent, and precise navigation and retrieval. We are proud to share this feature from KMWorld titled, “Access Innovations, Inc., Jay Van Eman CEO, Marjorie Hlava, President: View From the Top.”
Melody K. Smith
Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.