Access Innovations, Inc. Announces Release of the Author Submit Extension Module to Data Harmony Version 3.9

May 12, 2014  
Posted in Access Insights, Featured, Taxonomy

Access Innovations, Inc. has announced the Author Submit extension module as part of their Data Harmony Version 3.9 release. Author Submit is a Data Harmony application for integration of author-selected subject metadata information into a publishing workflow during the paper submissions or upload process. Author Submit facilitates the addition of taxonomy terms by the author. With Author Submit, each author provides subject metadata from the publisher taxonomy to accompany the item they are submitting. During the submission process, Data Harmony’s M.A.I. core application suggests subject terms based on a controlled vocabulary, and the author chooses appropriate terms to describe the content of their document, thus enabling early categorization and selection of peer reviewers and support for trend analysis.

“Author Submit is an exciting addition to the Data Harmony repertoire,” said Marjorie M. K. Hlava, President of Access Innovations, Inc. “Publishers can easily incorporate Author Submit, streamlining several steps at the beginning of their workflow, as well as semantically enriching that content at the beginning of the production process. They are getting far more benefits, and doing so without adding time and effort.”

“The approach is simple on the surface and supported by very sophisticated software,” remarked Bob Kasenchak, Production Manager at Access Innovations. “The document is indexed using the Data Harmony software, which returns a list of suggested thesaurus terms, from which the author selects appropriate terms. Author Submit supports the creation of a ‘semantic fingerprint’ for the author, collecting additional information along with the subject metadata. Finally, the tagged content is added to the digital record to complete production and be added to the data repository. It’s an amazing system to see in action.”

Author Submit can be implemented in several ways, including as a tool for assisting authors to self-select appropriate metadata assigned to their name and research at the point of submission into the publishing pipeline; as an editorial interface between a semantically-enriched controlled vocabulary and potential submissions to the literature corpus; for editorial review of subject indexing at the point of submission, enabling a robust evolution to a controlled vocabulary (taxonomy or thesaurus), by encouraging timely rule base refinement; for simultaneous assignment of descriptive and subject metadata by curators of document repositories for efficient integration of documents in a large collection; and as a method of tracking authors and their submissions for conference proceedings, symposia, and the like.

“This flexible system opens the door between Data Harmony software and an author submission pipeline in fascinating new ways,” commented Kirk Sanders, an Access Innovations taxonomist. “Users can choose a configuration that maximizes the gain from their organizational taxonomy, at the point it is needed most: when their authors log on to submit their documents.”


About Access Innovations, Inc. –,,

Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments, and corporate clients throughout the world.

Hold the Mayo! A study in ambiguity

When we (at least those of us in Greater Mexico) hear of or read about Cinco de Mayo, there is no question in our minds that “Mayo” refers to the month of May. The preceding “Cinco de” (Spanish for “Fifth of”) pretty much clinches it. Of course, if the overall content is in Spanish, there might still might be some ambiguity about whether it is the holiday that is being referred to, or simply a date that happens to be the one after the fourth of May. (As in “Hey, what day do we get off work?” “The fourth of July, I think.”)

We can generally resolve this kind of ambiguity by the context, as can a good indexing system and a rule base associated with a taxonomy.

If you’re reading this posting, you read English. So there’s a good chance that when you read the word “mayo”, you think of the sandwich spread formerly and formally known as mayonnaise.

Or perhaps the famous Mayo Clinic comes to mind. If you’re an American football fan (I had to throw “American” in there to differentiate the mentioned sport from soccer), you might think of New England Patriots linebacker Jerod Mayo.

The context enables us to recognize which mayo we’re dealing with. Likewise, an indexing system might take context into account when encountering the slippery word. A really good indexing rule base might help you sort things out when you have got text about Jerod Mayo’s line of mayonnaise, the proceeds of which he is donating to the Boston (not Mayo) Clinic.


As a person of Irish descent, I know perfectly well that that is not the end of Mayo’s spread. There is a County Mayo in Ireland, which has a few other Mayos, too.

If you consult the Mayo disambiguation page in Wikipedia, you will quickly discover that Mayo goes much further than Ireland. There are Mayos of one sort or another all over the world: towns, rivers, and an assortment of other geographical entities that might easily co-exist in a taxonomy or gazetteer.

Traveling down past the geographical Mayos on the Wikipedia page, one finds the names of dozens and dozens of people, many of whom have Mayo as a first name, and many of whom have Mayo as a last name. Thank goodness the four relatively famous William Mayos have different middle names.

The final category on Wikipedia’s Mayo page is, perhaps inevitably, Other. There are quite a few Other Mayos. And what might the last one be?  Where has this journey taken us?

Mayo, the Spanish word for May”

Hold the Cinco de Mayo celebration!

Barbara Gilles, Taxonomist
Access Innovations

The Supposed Advantages of Statistical Indexing

I would like to make some observations about statistics-based categorization and search, and about the advantages that their proponents claim.

First of all, statistics-based co-occurrence approaches do have their place, such as for wide-ranging bodies of text such as email archives and social media exchanges, and for assessing the nature of an unknown collection of documents. In these circumstances, a well-defined collection of concepts covering a pre-determined area of study and practice might not be practicable. For lack of a relevant controlled vocabulary foundation, and for lack of other practical approaches, attempts at analysis fall back on less-than-ideal mathematical approaches.

Co-occurrence can do strange things. You may have done Google searches that got mysteriously steered to a different set of search strings, apparently based on what other people have been searching on. This is a bit like the proverbial search for a key under a street light, instead of the various places the key is more likely to be, simply because the light is better and the search is easier under the street light.

These approaches are known for low search results accuracy (60% or less); this is unacceptable for the article databases of research institutions, professional associations, and scholarly organizations. Not only is this a disservice to searchers in general; it is an extreme disservice to the authors whose insights and research reports get overlooked, and to the researchers who might otherwise find a vital piece of information in a document that the search misses.

The literature databases and repositories of research-oriented oriented organizations cover specific disciplines, with well-defined subdisciplines and related fields. This makes them ideal for a keyword/keyphrase approach that utilizes a thesaurus. These well-defined disciplines have well-defined terminology that is readily captured in a taxonomy or thesaurus. The thesaurus has additional value as a navigable guide for searchers and for human indexers, including the authors and researchers who are the people who know the material best. Surely, they know the material far better than any algorithm could, and can take full advantage of the indexing and search benefits that a thesaurus can offer. An additional benefit (and a huge one) of a thesaurus is that it can serves as the basis for an associated indexing rule base that, if properly developed, can run rings around any statistics-based approach.

Proponents of statistics-based semantic indexing approaches claim that searchers need to guess the specific words and phrases that appear in the documents of potential interest. On the contrary, a human can anticipate these things much better than can a co-occurrence application. Further, with an integrated rule-based implementation, the searcher does not need to guess all of the the exact words and phrases that express particular concepts in the documents.

With indexing rooted in the controlled vocabulary of the thesaurus, documents that express the same concept in various ways (including ways that completely circumvent anything that co-occurrence could latch onto) are brought together in the same set of search results. Admittedly, I do appreciate the statistical approaches’ success in pulling together certain words that it has learned (through training) may be important. The proximity or distance between the words matters in the return results ranking, implying relevance to the query, However, that unknown distance does make it hard for a statistical approach to latch onto a concept, and therefore unreliable.

Use of a thesaurus also enables search interface recommendations for tangentially related concepts to search on, as well as more specific concepts and broader concepts of possible interest. The searcher and the indexer can also navigate the thesaurus (if it’s online or in the search interface) to discover concepts and terms of interest.

With statistic-based indexing, the algorithms are hidden in a mysterious “black box” that mustn’t be tinkered with. With rule-based, taxonomy-based approaches, the curtain can be pulled back, and the workings can be directly dealt with, using human intelligence and expert knowledge of the subject matter.

As for costs, statistical approaches generally require ‘training sets’ of documents to ‘learn’ or develop their algorithms. Any newly added concept/term means collecting another ten to fifty documents on the topic. If the new term is to be a narrower term of an existing term , that broader term’s training set loses meaning and must be redone to differentiate the concepts. Consider that point in ongoing management of a set of concepts.

Any concept that is not easily expressed in a clear word or phrase, or that is often referred to through creative expressions that vary from instance to instance, will require a significantly larger training set. The expanded set will still be likely to miss relevant resources while retrieving irrelevant resources.

Dealing with training sets is costly, and is likely to be somewhat ineffective, at that; if the selected documents don’t cover all the concepts that are covered in the full database, you won’t end up with all the algorithms you need, and the ones that are developed will be incomplete and likely inaccurate. So with a statistics-based approach, you won’t reap the full value of your efforts.

Barbara Gilles, Taxonomist
Access Innovations

Data Harmony® Software Version 3.9 Now Available

April 21, 2014  
Posted in Access Insights, Autoindexing, Featured

Access Innovations, Inc. has announced that Version 3.9 of its Data Harmony Suite of software tools is now available.

The Data Harmony Suite provides content management solutions to improve information organization by systematically applying a taxonomy or thesaurus in total integration, with patented natural language processing methods. MAIstro, the award-winning flagship software module of the Data Harmony product line, combines Thesaurus Master® (for taxonomy creation and maintenance) with M.A.I. (Machine Aided Indexer) for interactive text analysis and better subject tagging.

Data Harmony software gives users the power to quickly attach precise indexing terms for their documents, reflecting the controlled vocabulary of their business sector or academic discipline – at document level, in the metadata. With the MAIstro interface, editors can adjust vocabulary terms or corresponding rules that govern indexing, to generate very accurate and consistent subject metadata for content objects. The MAIstro rule-building screen and indexing rule base syntax are highly intuitive, and accessible to non-programmers.

Access Innovations has added the following custom features and improvements to Data Harmony software for this release: 

MAIstro Version 3.9 – for user-friendly, full-featured, structured vocabulary development and management

  • Emphasis display in Thesaurus Master – permitting use of bolding, italics, and underlining for terms
  • New Thesaurus Master export formats – OWL2 (Web Ontology Language), Full Path XML to support XML-driven content management systems and systems that rely on breadcrumb navigation, and the Custom XML Fields format (users select the fields to be included when generating an export)
  • Enhanced Adobe PDF support –  Test MAI supports text from newer, full-feature PDFs, to be used during spot-testing of the indexing rule base
  • Improvements to Thesaurus Master’s Import Module and the Data Harmony command line project creation tool – for improved preservation of source file term IDs

M.A.I. Version 3.9 – for automated indexing, machine-aided term selection interactivity, and extending the implementation of indexing based on a very accessible rule base

  • Support for returning the full path of suggested terms
  • Enhanced reporting features for identifying the usage of conditional rules
  • Increased maximum rule length, to allow for more complex and specific rule-building
  • Easy implementation of secure server login (SSL) beginning at project creation, if desired
  • More flexible and more memory-efficient capabilities for implementation with inline tagging

For Version 3.9, application developers placed an emphasis on creatively leveraging API calls to accomplish specialized goals for existing Data Harmony customers, resulting in new software packages becoming available for the first time. Data Harmony API calls offer a method for system administrators to facilitate ongoing Web services for Data Harmony users. API calls also provide a way to configure direct machine-to-machine operations for processing information stored in Data Harmony. All APIs are available as Web services.

Additional major updates in this Data Harmony software release include improvements to Search Harmony and the addition of a new service, Data Harmony’s Author Submit extension module.

“This version of the software shows a strong move to Web-based functionality,” said Allex Lyons, one of the Data Harmony programming team. “Our recent enhancements provide greater search efficiency while browsing your thesaurus, through callouts and hyperlinks to related and narrower terms.”

For further information, see

About Access Innovations, Inc. –,,

Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments, and corporate clients throughout the world.

Limitations of Fuzzy Matching of Lexical Variants

April 14, 2014  
Posted in Access Insights, Featured, semantic

Some vendors of text analytics software claim that their software can identify the occurrences of text reflecting specific taxonomy terms  (with the strong, and false, implication that it identifies all such occurrences) using “fuzzy matching” or “fuzzy term matching.” Some explanations of the technology, from Techopedia and Wikipedia, show that it is a fairly crude mathematical approach, similar to the co-occurrence statistical approaches that such software also tends to use, and no match for rule-based indexing approaches that derive their effectiveness from human intelligence.

I remember searching online for information about the Data Harmony Users Group (DHUG) meeting. Google, in its infinitely fuzzy wisdom, asked “Did you mean “thug”?

As explained in Techopedia,

Fuzzy matching is a method that provides an improved ability to process word-based matching queries to find matching phrases or sentences from a database. When an exact match is not found for a sentence or phrase, fuzzy matching can be applied. Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold matching percentage set by the application.

Fuzzy matching is mainly used in computer-assisted translation and other related applications.

Fuzzy matching searches a translation memory for a query’s phrases or words, finding derivatives by suggesting words with approximate matching in meanings as well as spellings.

The fuzzy matching technique applies a matching percentage. The database returns possible matches for the queried word between a certain percentage (the threshold percentage) and 100 percent.

So far, fuzzy matching is not capable of replacing humans in language translation processing.

And the Wikipedia article on the subject explains as follows:

“The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are:

    • insertion:      cotcoat
    • deletion: coat      → cot
    • substitution:      coatcost

 These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted:

    • insertion:      co*tcoat
    • deletion: coat      → co*t
    • substitution:      coatcost

The most common application of approximate matchers until recently has been spell checking.

In blog comments, people have commented on the large number of false positives this would create. As a small example of the possibilities, think what would happen with the American Society of Civil Engineers (ASCE) thesaurus and “bridging the gap.” Not to mention the many, many words and acronyms that a rule-based approach could easily disambiguate.

And then there are all the occurrences of the concept that would be totally missed by the fuzzy matching approach. Not all synonyms, nor all the other kinds of expressions of various concepts, are lexical variants that are similar character strings. Fuzzy matching has no way of dealing with these alternative expressions, and they happen often.

There is another problem with these approaches. They are sometimes tied in with weighting, with the “edit distance” mentioned in the Wikipedia article used to downgrade the supposed relevance of a lexical variant. Why on earth should a variant be downgraded, if the intended concept is completely identical to the one expressed by the corresponding preferred term?

The fuzzy approach does not save human time and effort. Rules covering a wide variety of lexical variants can be written using truncated strings as the basic text to match, and proximity conditions added as desired to make those rules more accurate.

Sometimes (in fact, rather frequently), a concept is expressed in separate parts over the course of a sentence, a paragraph, or a larger span of text. There is no way that I’m aware of that fuzzy matching can deal with that. A rule-based approach can.

In short, fuzzy matching has serious deficiencies as far as indexing and search are concerned, and is vastly inferior to a rule-based approach.

Barbara Gilles, Taxonomist
Access Innovations

The Semantics of Whisk(e)y

April 7, 2014  
Posted in Access Insights, Featured, Taxonomy

As noted last week in our article, “A Spirit of Another Name“, Saveur has created a glossary of Whisk(e)ys. However, as we all know, a glossary does not a taxonomy make –it can, however, be a good starting point.

One of the problems, of course, is that national styles – and even spellings – are mutable.

In general, “whiskey” comes from Ireland and the United States, while “whisky” (no ‘e’) comes from Canada and Scotland.  However, well-known bourbon Maker’s Mark long ago decided to buck the semantic trend and drop the “e” despite being an all-American brand.

The Japanese usually use the Scottish version (being heavily influenced by Scotch). However, there are now whisk(e)ys being made in France, Wales, Germany, Australia, Finland, India, Sweden, Spain, and the Czech Republic, to name a few. Adoption of one or the other spelling variant is, well, varied.

Besides, once we accept that “whiskey” and “whisky” are synonyms, the spelling has little to do with the semantics.

It’s more important to understand the legal (and in some cases, traditional but perhaps not codified) production requirements that define the various styles; these are often (but not always) defined by name and region.

For example: Canadian whisky is, by law, allowed to have up to 9.09% “flavorings” – a category of adulterants with no definition (but which in practice include artificial colors, many different sweeteners, prune and other fruit juices, etc.). Scotch whisky can have caramel color added, but no flavorings. Bourbon, on the other hand, can be cut to proof with water, but must by law have no additives for either color or flavor. This gives Canadian whisky its characteristic sweet taste.

Straight Bourbon Whiskey is, in fact, the most strictly defined and regulated of the whiskeys – although, contrary to common beliefs, it need not be made in Kentucky. It must, however, be produced in the United States from spring water; the mash (mixture of grains) must comprise at least 51% corn (the rest being barley, wheat, and rye); it must be aged no less than 24 months in new charred American white oak barrels; and of course it must not contain any additives. (For the record, Rye is identical to Bourbon with the very important exception that it must contain no less than 51% rye.)

Other factors are also in play. For example, Irish whiskey is almost always triple-distilled, while Scotch is almost always double-distilled. Scotch is further delineated by region (Highland, Campbeltown, Islay, Highland Islands, Lowland; Speyside is a sub-region of Highland) and drying methods (whether peat, gas, or coal is used to dry the grain to stop the germination process) as well as the various permutations of blends, single malts, and vatted malts (by many names), not to mention other variants such as single-barrel, cask-strength, and various “finishes” in casks which formerly held other kinds of spirits.

Now that is a categorization problem.

In constructing a taxonomy of whisk(e)y, a faceted approach might be best. However, given the limited space here, let’s just take a quick crack.  (The Top Term is of course “Whiskey” UF=Whisky.)

Whiskey Blends
. . Blended Whiskey
. . Grain whiskey
. . . . Corn whiskey
. . Vatted Malts   [UF=Blended Malts   SN=Comprised of various single malts, no “grain”]
. . Single Malts
Whiskey Production
. . Peated Whiskey   [RT=Islay Whisky]
. . Pot still Whiskey
. . Single barrel Whiskey   [UF=Single-barrel Whiskey]
. . Small batch Whiskey
. . Whiskey Distillation
. . . . Double Distillation   [UF=Double-distilled]
. . . . Triple Distillation   [UF=Triple-distilled]
Whiskey Regions
. . American Whiskey
. . . . Bourbon Whiskey
. . . . California Whiskey
. . . . Oregon Whiskey
. . . . [add other states as necessary]
. . . . Rye Whiskey
. . . . Tennessee Whiskey
. . Australian Whiskey
. . Canadian Whisky
. . European Whiskeys
. . . . Czech Whisky
. . . . Finnish Whisky
. . . . French Whisky
. . . . German Whisky
. . . . Irish Whiskey
. . . . Scotch Whisky
. . . . . . Campbeltown Whisky
. . . . . . Highland Whisky
. . . . . . . . Speyside Whisky
. . . . . . . . Highland Island Whisky
. . . . . . Islay Whisky
. . . . . . Lowland Whisky
. . . . Welsh Whisky
. . . . Spanish Whisky
. . . . Swedish Whisky
. . . . [add others as needed]
. . Indian Whisky
. . Japanese Whiskey
Whiskey Strengths
. . Cask Strength
. . Overproof   [SN=95 proof or higher]
. . Standard proof   [SN=80 to 94 proof*]

*this is a little arbitrary but reflects industry norms

UF=Use For

SN=Scope Note

RT=Related term

Clearly I’m missing cask finishes (mostly in Scotch, but now reaching Bourbon territory) and a few other things. (Hey, it’s just a blog post.)  Ages, also, could quickly become a problem: no one wants a list of cardinal numbers in their thesaurus.

The various brands could, then, be narrower terms in the hierarchy I’ve sketched out above.

In order to avoid massive categorization issues and massive duplication (instead of going Netflix-style, as “Single malt overproof cask-finished Campbeltown Scotch whiskey” is a pretty unwieldy taxonomy term) you’d have to apply multiple labels to categorize each individual item. Imagining this would be most useful for e-commerce (as opposed to scholarly document categorization) helps: think about browseable tabs on a website; you’d want to find Laphroig under both “Peated Whiskeys” and “Islay Whiskys” to allow people to find what they were looking for using multiple approaches.

This is why I described it as a “faceted approach.” But let’s not get into that now. For the same reason, though, I’m going to stop while I’m ahead.

Bob Kasenchak, Project Coordinator
Access Innovations

Knowledge Organization Systems and Return on Investment (KOSs and ROI)

Let’s call him George. George was having a very bad day. He needed legal advice. So, over his lunch hour he scheduled an appointment. After George described his situation, the lawyer pulled a book from the shelf behind him. After briefly scanning it and checking precedent, he confidently told George with a smile, “Relax. This is a slam dunk.” On his way out of the office, George gazed at the consultation invoice and muttered to himself: “15 minutes! Only 15 minutes and he charged me $325.00!”

On his way back to the office, George’s car starting making some serious grinding noises whenever he turned to the right. He pulled into the local Fix-It-All Garage and described the noise to the technician. After turning over the keys, George looked through the large glass window as the mechanic pulled here and tugged there at his car up on the lift. After only a few keystrokes at his computer station, the technician began installing over the next ten minutes what looked to George like a $10.00 part. In the blink of an eye, George was standing at the counter with another invoice. He called his wife to grumble: “He pulled and tugged in two different spots and then charged me $325.00. I’m in the wrong line of work!”

Certain that his ulcer was acting up, George stopped at the clinic on his way home that day. The doctor, who agreed to fit him in right away, asked a few short questions, consulted his desk reference guide,  and started writing a prescription. Moments later, speechless George could only grimace as he faced yet another hefty bill. Poor George.

Besides a considerable amount of cash, what was George missing? Someone might say that George knew the cost of everything, but the value of nothing. Because George was able to successfully confront and overcome several perplexing and complex problems, someone else might say, “What a great quality of life George has!” It all looked so deceptively easy. However, to focus only on the “interface” is to fail to consider the years of training and experience behind each professional who knew just what questions to ask, just where to look and pull and tug, and just which resource to consult.

How does one measure the true value of successful information organization, navigation, and retrieval? Access Innovations Inc. offers superior customer service, ease of product use, and support, combined with years of experience in order to provide outstanding quality. Speak with the CEO of Access Innovations, Inc., Jay Van Eman, about the qualitative and quantitative criteria used to assess successful KOSs and the proper rationale for measuring ROI in your setting. Are you getting real value for the cost?

Check out these additional resources:

Why Knowledge Management Is Important To The Success Of Your Company

The Use of Return on Investment (ROI) in the Performance Measurement and Evaluation of Information Systems

ROI & Impact: Quantitative & Qualitative Measures for Taxonomies

Eric Ziecker, Information Consultant
Access Innovations, Inc.

Of Taxonomies, Biology, and Moneyball

March 24, 2014  
Posted in Access Insights, Featured, Taxonomy

Baseball and biology are not commonly found in the same conceptual space. Neither do you find taxonomy associated with baseball, but in recent news these connections were made. Grant Bisbee, editor of “Baseball Nation”, digresses into the arcane as he laments the coming of the “He’s In the Best Shape of His Life” season. This is the time of year baseball writers must assess the prospects for the coming season, and clichés and hyperbole reign. The dubious practice of evaluating the physical condition of players runs rampant as spring training begins. With tongue in cheek, Bisbee tries to shape a taxonomy to classify this spring ritual. His would be the taxonomy of the “In the Best-shape Stories”.

Bisbee suggests three categories:  “Play X Got in Shape”, “Fixed Eyesight”, and “A Serious, Previously Undiscovered Affliction”. He compiles representative stories for each classification. He is somewhat skeptical that a player could be in the “best shape of his life” as a result of some of the reported training regimes. Can someone really put on 18 pounds of muscle in the off-season? Legally? Shouldn’t vision examinations be a regular part of baseball teams’ operations? After all, they’re paying millions for their players’ hand-eye coordination.

Are there baseball taxonomies already out there that could help Mr. Bisbee classify his apocryphal collection? A “Google” search of baseball taxonomies returns a million items more or less – plus or minus – give or take – a few hundred thousand. Not much help there. At the very least, Bisbee could expand the classification to include other sports and how each portrays their mystical, preseason rituals. How are preseason body rebuilding rhythms disturbed by “Seasonus Interruptus”?

And biology is experiencing a potential season-disrupting trend. Tom Spears, reporting for the “Ottawa Citizen”, filed the story, “Taxing times for taxonomy”. On a more serious note, Mr. Spears chronicles the demise of field-based taxonomic study in biology in favor of lab work. Computers and DNA studies have relegated classification to the closet. While the work done in laboratories is vital to the field, biologist Ernest Small is quoted, “How can you do a study of forests without knowing the trees.” The reverse is also true; you can’t really study a tree without understanding its forest. Commenting on his lab-centric colleagues, Dr. Small laments “the lack of knowledge of what plants and animals make up our world”. Spears goes on to write, “We need to understand whole species, not just genes, if we are to solve the problems of agriculture, fisheries, insect pests, ecology and the spread of diseases.”

Taxonomies are central to understanding a field as a whole, whether biology or sports. And they are critical to understanding individual members of the taxonomy. The taxonomy of plants and animals tells you about the relationships and characteristics of its members. The child, or narrower node in a taxonomy, inherits the properties of the parent, or broader node. Reviewing an entire hierarchical branch and its relationship to other branches conveys the properties, similarities, and differences of branches. This knowledge can lead to sleuthing out meaningful trends and can help explain interactions or identify possible interactions. Taxonomy for organizing sports articles might reveal interesting groupings and trends. Are there more injuries being reported this season? Are there geographic trends or team trends? Applying graphical analysis visualizes the content in ways that enhance spotting of trends.

What these two diverse stories have in common besides the core theme of taxonomies is the need for hands-on, down-and-dirty fieldwork. Sports writers need to see players perform during preseason to really assess their fitness. Biologists can learn a great deal about a species by looking through a microscope, but they can’t really understand behavior that way. They need to observe firsthand. When the field data and lab data are gathered, placing it in context gives it meaning and leads to insights. Taxonomy is ideal for grouping data in meaningful ways, and that can lead to insights. It can also lead you back to the data at some future point. Taxonomy is often focused on organizing content for search to make search easy, fast, reliable, and replicable. Taxonomy also defines a domain or conceptual space. It can help you find your way and keep you from getting lost. Equally important, it provides meaning to the conceptual space you’re wandering through. Spring is almost here – enjoy the first Danaus plexippus of the season and know it has a genus and a tribe and a family and more. Play ball!

Jay Ven Eman, CEO, Access Innovations

First published on January 30, 2012.

Release of Data Harmony® Software Version 3.9 Planned for March 31, 2014

March 17, 2014  
Posted in Access Insights, Featured, indexing, metadata

Data Harmony, a division of Access Innovations, Inc., has announced that version 3.9 of its suite of software tools will be released on March 31, 2014. Current customers will be the first to receive the new version.

“Our recent enhancements provide greater search efficiency while browsing your thesaurus through callouts and hyperlinks to related and narrower terms,” said Allex Lyons, one of the Data Harmony programming team. “Dropdown menus in the search window display subject terms with an auto-complete function that assists users of any skill or experience level.”

“The Thesaurus Master™  module is a wonderful tool for creating and maintaining taxonomies, and when it’s coupled with M.A.I.™, our customers have a world-class semantic enrichment platform,” comments Bob Kasenchak, Project Coordinator at Access Innovations. “In addition, this release of the software works with our new Semantic Fingerprinting application for creation of disambiguated author networks. For a number of reasons, author nets are becoming a critical component of the publishing workflow for many of our publishing clients.”

Access Innovations’ editorial team collaborated with Data Harmony programmers to update the user guides and improve technical documentation for all software modules. “For version 3.9, MAIstro is featuring several new thesaurus export formats, and there are great new functions inside the user interface,” noted Kirk Sanders, Project Manager, Access Innovations Editorial Department. “In addition, we’re releasing several new Web service modules that extend Data Harmony software in intriguing ways. I’m eager to see how our clients gain from the creative implementation of Data Harmony Inline Tagging, Search Harmony, and the Data Harmony Author Submission System — and the other Web services!” The Data Harmony Web service modules offer the power to configure instantaneous Web functions inside a graphical user interface based on Data Harmony software actions.

Current Data Harmony customers can expect to be contacted about version 3.9 by the end of March, according to Marjorie M. K. Hlava, President of Access Innovations. Further information will be available on the Data Harmony website at


About Access Innovations, Inc. –,,

Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments, and corporate clients throughout the world.

Semantic Fingerprinting for Name Disambiguation







Many institutions and organizations – notably (but not limited to) publishers – have large, or sometimes very very large, lists of names. These names are from member directories, employees and staff, clients and customers, marketing, development, and many other sources; indeed, oftentimes the lists from various departments in the same organization are not connected or resolved with one another in any way.

This growing problem has given rise to a sub-field in the information/data industry variously called “named entity disambiguation” or “author disambiguation” or “name disambiguation”, among other monikers. In the academic publishing space, disambiguation of author names is a common challenge.

In a nutshell, given a list of names—let’s say, oh, 3.2 million names—to determine which ones are the same person and which are not, we might proceed as follows:







The goal is, as automatically as possible, to sort out which of these records should be merged. Once accomplished, you (a publisher) could make a webpage for each author listing all publications and so forth for your users to browse.

Clearly, some of the names above are potentially the same person, while others are not. For example B. Caldwell Smith, B.C. Smith, and Brandon C. Smith, and Brandon Caldwell Smith look like they might be the same person. To find out without looking at every name and every article (3.2 million, remember?) we need more information.

To accomplish this task, metadata associated with each author is examined and compared to try to eliminate duplicates. For example, from each article we can associate an author with his co-authors, the institution with which she was involved when the paper was published, email addresses, dates of publication, and so forth.






Well, some things are clearer, but some are not. Whereas before we may have suspected that Rodger Smith and Roger Smith were different people, they published at the same institution in the same year; maybe it’s just a typo? And maybe Brandon C. Caldwell moved from Harvard to Yale (not unheard of) sometime between 1961 and 1972?

At Access Innovations we’ve been developing a way to add some certainty to the process using semantic metadata—it’s not a silver bullet, but it is a bigger gun. We call the process “semantic fingerprinting” and it’s based on our thesaurus and indexing technology.

Every author’s works (papers, conference proceedings, editorial roles) associates them with one or more pieces of content, and for each piece of content we have indexing terms from a thesaurus particular to that client. By associating the author directly with the indexing terms, we develop a semantic profile (or “fingerprint”) for each one. Since each author usually authors multiple papers (see “Lotka’s Law”), we compile the subject terms from each paper to make a more complete profile; obviously the more papers we have, the more accurate these profiles are.

Returning to our example:







What we suspected to be perhaps one person based on our best information turns out pretty obviously to be two distinct researchers once the areas of expertise are added to the equation.

While the process is far from foolproof, it does help to automate the disambiguation process, which cuts down on the number of human hours required to review the work.

The concept of the “semantic fingerprint” can be applied to a paper, a school, an editor, or any other entity for which subject metadata is available. So this same basic process can be used for other purposes; for example, to:

  • Disambiguate institution names
  • Match articles to peer reviewers or editors
  • Demonstrate what areas of research are exploding at,
    • A journal
    • A college
    • A research laboratory

As datasets get cleaner and cleaner the accuracy of, and uses for, semantic technologies—such as Access Innovations’ Semantic Fingerprinting techniques—will continue to increase.

Bob Kasenchak, Project Coordinator
Access Innovations

Semantic Fingerprinting image © Access Innovations, Inc.

« Previous PageNext Page »