Data Harmony Version 3.9 Includes MAI Batch GUI – A New Interface For M.A.I.™ (Machine Aided Indexer) and MAIstro™ Modules

June 16, 2014  
Posted in Access Insights, Featured, metadata, semantic

Access Innovations, Inc. has announced the inclusion of the MAI Batch Graphical User Interface (GUI) as part of the recent Data Harmony Version 3.9 software update release. MAI Batch GUI is a new interface for running a full directory of files through the M.A.I. Concept Extractor. This tool enables processing of large amounts of text through the Data Harmony M.A.I. Concept Extractor with a single command. Usually used in working with legacy or archival files, it allows complete semantic enrichment of entire back files in a short time. Once run, the taxonomy terms from a thesaurus or taxonomy become part of the record itself.

“For Data Harmony Version 3.9, we decided to add the interface to the MAIstro and M.A.I. modules to allow use directly from the desktop, giving more power to the user,” remarked Marjorie M. K. Hlava, President of Access Innovations, Inc. “It’s a fast, easy way to perform machine-aided indexing on batches of documents, without any need for command-line instructions.”

“M.A.I.’s batch-indexing capability has been in place for years via command line interface,” noted Bob Kasenchak, Production Manager at Access Innovations. “This new GUI makes it really easy to use. Customers only need to open ‘MAI Batch app’ in their Data Harmony Administrative Module, choose the files or directories to process, and submit the job.”

The purpose of MAI Batch is to provide immediate processing of data files on demand. MAI Batch can be deployed to achieve rapid subject indexing of legacy text collections.

MAI Batch GUI offers semantic enrichment by extracting concepts from input text in most file formats, including the following:

  • Adobe PDFs
  • MS Word DOC files
  • HTM/HTML pages
  • RTF documents
  • XML files

For XML files, the ‘XML Tags’ option permits users to define specific XML elements for MAI Batch GUI to analyze during batch processing. This option opens the door for indexing source documents that are tagged according to different XML schemas. XML Tags also permits the exclusion during indexing of sections in the document structure, as designated by the user.

The interface’s Input and Output panes present a practical view of the batch during processing, enabling a degree of interactivity – M.A.I. is a very accessible automatic indexing system. It’s a ‘machine-aided’ software approach, even when applied to batches of documents. IT support is important but not needed to process and maintain the Data Harmony Suite of products.

When the documents already contain indexing terms, MAI Batch GUI will derive accuracy statistics for inclusion in the output, logging the statistics of indexing accuracy for the batch. M.A.I. calculates the indexing accuracy of its suggested terms from Concept Extractor compared to the previously-applied subject terms. This powerful method for enhancing the accuracy of subject indexing is based on reports generated by the M.A.I. Statistics Collector, giving a taxonomy administrator all the data needed to continually improve the results based on the system recommendations, selections, and additions.

About Access Innovations, Inc. – www.accessinn.com, www.dataharmony.com, www.taxodiary.com

Founded in 1978, Access Innovations has leveraged semantic enrichment of text for internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments, and corporate clients throughout the world.

Blind Alleys, Dead Ends, and Mazes

June 9, 2014  
Posted in Access Insights, Featured, Taxonomy

“I don’t know where I am!”

mazes

 

 

 

Time traveler Clara Oswald becomes disoriented once again, in a scary encounter with a taxonomy displayed in flat format.

Taxonomies can be displayed in a variety of ways. One of the display types that we occasionally see is known as the flat format display. It’s described in the main U.S. standard for controlled vocabularies, ANSI/NISO Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, published by the National Information Standards Organization) as follows:

“The flat format is the most commonly used controlled vocabulary display format. It consists of all the terms arranged in alphabetical order, including their term details, and one level of BT/NT hierarchy.”

At the top level, this format might (or might not) look like that of other hierarchical vocabularies when they are collapsed. What happens, though, when you start navigating to deeper levels? Let’s take a look at the ERIC Thesaurus published by the U.S. Department of Education’s Institute of Education Sciences. Here’s the initial view, when you choose to browse the thesaurus:

eric2

 

Aha, top terms, yes? Unfortunately, no. These are non-hierarchy category labels into which the actual terms are grouped, without regard for hierarchical placement. Clicking on any of these category labels results in a flat alphabetical display of all the terms in that category. This is something that thesaurus publishers can get away with when they use a flat format display.

If you click on the first category, Agriculture and Natural Resources, you see a flat alphabetical list of terms, including Agricultural Education. Clicking on that, you would discover that its one broader term is Education (no, not Agriculture and Natural Resources), and that its one narrower term is Young Farmer Education. What you see is basically a term record, and that’s all. That’s flat format display.

age

Are there problems with this? I think so. Even if the vocabulary is viewed only by the people constructing and maintaining it, those people will have difficulty spotting gaps and redundancies. And even if the vocabulary is used only by in-house human indexers, they will have difficulty exploring it to find the most appropriate terms to apply for indexing, and they will tend to use the first terms they come across that seem to fit. In the latter scenario, the ignored terms are apt to fall victim to usage statistics, even if they’re good terms that should have been used. (I’ve seen this happen to at least one taxonomy.)

While the format may have simplified things in the days of printed taxonomies, taxonomists and indexers have problems with this format. Think of the problems encountered by searchers looking for information resources. Searchers benefit from being able to navigate and explore a taxonomy, and to take full advantage of its hierarchical structure. The flat format doesn’t present a hierarchy; instead, it presents obstacles.

Blind Alleys

blind

 

 

 

 

 

 

While you’re traveling down one path, you don’t have an opportunity to see what’s in nearby pathways, or in distant but related pathways.

Dead Ends

deadends

 

 

 

 

 

 

You can’t see where you’re headed, or how far the path goes. The path that you originally saw as promising might only lead to a stone wall, after you’ve already traveled one term at a time to get there. (Some flat format taxonomies, though, turn out to be unexpectedly shallow, so you’re more apt to hit a dead end sooner than later.)

Mazes

mazes2

 

 

 

 

 

 

 

Because you can’t see more than one level before and after the term you’re in, and you can’t see over the hedge to other pathways, you may end up zigzagging and backtracking through the taxonomy in a frustrating guessing game.

Getting a Better View

view

 

Ideally, you should be able to view the full panorama of a taxonomy’s coverage. At the same time, you should be able to focus on the areas of interest to you. And you should be able to view more than one branch at the same time, and to view entire branches. To accomplish those goals, you need a full hierarchical display that you can expand and collapse as needed. The example below is a screenshot of the MediaSleuth thesaurus, some branches of which I’ve temporarily exposed to an expanded view with a click of the mouse.

medias

 

With this kind of view, we can see our way in all directions, from wherever we are. We can see where we might want to go from there, and how to get there. We know exactly where we are.

Barbara Gilles, Taxonomist
Access Innovations

Access Innovations, Inc. Announces Release of the Semantic Fingerprinting Web Service Extension for Data Harmony Version 3.9

June 2, 2014  
Posted in Access Insights, Featured, semantic

Access Innovations, Inc. announces the Semantic Fingerprinting Web service extension as part of their Data Harmony Version 3.9 release. Semantic Fingerprinting is a managed Web service offered to scholarly publishers to disambiguate author names and affiliations by leveraging semantic metadata within an existing publishing pipeline.

The Semantic Fingerprinting Web service data mines a publisher’s document collection to build a database of named authors and affiliated institutions, and then expands the database over time with customization and administration services provided by Access Innovations during configuration. The author/affiliation database powers M.A.I.™ (Machine Aided Indexer) algorithms for matching names in new content received from contributors. During the configuration phase, an essential component is the graphical user interface (GUI) where users disambiguate unmatched names using clues that M.A.I. surfaces as a result of rigorous document analysis.

“Like a fingerprint, each author has a unique ‘semantic profile’ that captures the specific disciplines and topic areas in which they publish – reflecting subject areas covered in their body of research. Data Harmony generates subject keywords that describe the document’s content, to increase the number of author name matches a reviewer can find during editorial review of unresolved names,” explained Kirk Sanders, Access Innovations Taxonomist and Data Harmony Technical Editor.

“Semantic Fingerprinting is a versatile addition to the Data Harmony software lineup,” said Marjorie M. K. Hlava, President of Access Innovations, Inc. “Publishers can incorporate Semantic Fingerprinting to build each author’s profile, precisely reflecting that person’s research and publication achievements and institutional affiliations – all driven by information that’s already moving through the pipeline. It’s an elegant approach to data-mining a document stream for highly practical purposes, an approach presenting immediate benefits for the scholarly publisher.”

“Semantic Fingerprinting is driven by patented natural language processing algorithms,” responded Bob Kasenchak, Production Manager at Access Innovations, when asked to comment on the module’s inclusion in the Version 3.9 software update release. “The Web service enables a publisher to move far beyond adding subject metadata in their pipeline by supplementing it with the author’s research profile. This module and the process also offer a new way to improve precise document search and retrieval. Enhancements to document metadata also present opportunities to support other functions related to marketing or assigning appropriate peer reviewers.”

The Semantic Fingerprinting extension from Data Harmony 3.9 is a Web service (managed by Access Innovations) that relates terms from a publisher’s controlled vocabulary (a taxonomy or thesaurus) to the contributing authors, their affiliated institutions, and other relevant metadata information. Software components such as the user interfaces and entity-matching algorithms are adjustable, because every data set needs a targeted approach. As more data is processed by the matching algorithms and/or human editors, the name authority file and other processes require routine monitoring and adjustments. In many cases, suggestions for adjustments will come from human editors, based on questionable entities that they resolve by searching the name authority file in the Semantic Fingerprinting interface.

About Access Innovations, Inc. – www.accessinn.com, www.dataharmony.com, www.taxodiary.com

Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments, and corporate clients throughout the world.

Putting Human Intelligence To Work To Enhance the Value of Information Assets

May 26, 2014  
Posted in Access Insights, Featured, Taxonomy

Semantic enhancement extends beyond journal article indexing, though the ability of users to easily find all the relevant articles (your assets) when searching still remains the central purpose. Now, in addition to articles, semantic “fingerprinting” is used for identifying and clustering ancillary published resources, media, events, authors, members or subscribers, and industry experts.

The system you choose to enhance the value of your assets, and the people behind it, is extraordinarily important.

It starts with a profile of your electronic collection. It may include a profile of your organization as well. As you choose the concepts that represent the areas of research today and in the past, the ideas and thoughts of your most articulate representatives, the emerging methods and technologies, you bring together a picture of the overall effort. This can be done with a thesaurus, an organized list of terms representing those concepts (taxonomy) enhanced with relationship links between terms (synonyms, related terms, web references, scope notes). The profile provides an illustration of the nature of intellectual effort being expended and, equally important, the shape of the organizational knowledge that is your key asset.

We’d like to convince you that human intelligence is still the most powerful engine driving the development and maintenance of this lexicographic profile. Technology tools help with the content mining, frequency analyses, and other measures valuable to a taxonomist, but the organization, concept expressions, and relationship building is still best done by humans.

Similarly, the application of the thesaurus is best done by humans. Because of the volume of content items being created every day, it may not be possible to have human indexers review each of them. Our automated systems can achieve perhaps 90% “accuracy” (i.e. matching what a human indexer would choose), so high-valued content is still indexed by humans, much more efficiently than in the past, but still by humans. And the balance requires the contribution of humans to inform the algorithm in actual natural (human) language. Fully enabled, the automated system produces impressive precision in identifying the “aboutness” of a piece of content.

And how can a system achieve accuracy and consistency? Our approach is to reflect the reasoning process of humans, using a set of rules. Our rule base is simple to enhance and simple to maintain, and like the thesaurus, flexible enough to accommodate new terminology in a discipline as it evolves.  About 80% of the rules work well just as initially (automatically) created. The other 20% achieve better precision when ‘touched’ by a human who adds conditions to limit, broaden, or disambiguate the use of the term triggering the rule.

Mathematical analyses work to identify statistical characteristics of a large number of items and is quite useful in making business decisions. But making decisions about meaning? For many decades now, researchers have been working to find a way to analyze natural language that would result in somewhere near the precision provided by human indexers and abstractors.  Look at IBM’s super-computer “Watson” and the years and resources invested to produce it.  It continues to miss the simple (to us) relationships between words and context that humans understand intuitively.

Mary Garcia, Systems Analyst
Access Innovations

The Size of Your Thesaurus

May 19, 2014  
Posted in Access Insights, Featured, Taxonomy

During the initial stages of discussing a new taxonomy project, I am frequently asked questions like:

How granular does my taxonomy need to be?

How many levels deep should the vocabulary go?

And especially:

How many terms should my thesaurus have?

The answer is—of course—it depends.

The smallest thesaurus project with which I’ve ever been involved was for a thesaurus of 11 terms; the largest is a 57,000-word vocabulary.

We once lost a bid because we refused to agree to build a 10,000-word thesaurus (not approximately, exactly); no matter how loudly we insisted that it’s far more logical (“best practice”) to let the data decide the size of the thesaurus, someone had already decided on an arbitrary number.

At Access Innovations, we like to say that we build “content-aware” taxonomies, that the data will tell us how large the taxonomy should be. The primary data point is the content: How much is there? What is the ongoing volume being published? Clearly, no one needs a 25,000-word thesaurus to index 1000 documents; similarly, a 200-term thesaurus is not going to be that useful if you have 800,000 journal articles.

Just as returning 2,000,000 search results is not very helpful (unless what you’re looking for is on the first page), a thesaurus term with which 20,000 articles are tagged isn’t doing that much good—more granularity is probably required. There are very likely sub-types or sub-categories of that concept that you can research and add.

The flip side is that you don’t need terms in your vocabulary—no matter how cool they may be—if there is little or no content requiring them for indexing. Your 1500-word branch of particle physics terms is just dead weight in the great psychology thesaurus you’re developing.

Other factors include the type of users you have searching your content: Are they third-graders? Professional astrophysicists? High school teachers? Reviewing search logs and interviewing users is another way to focus your approach, which in turn will help you gauge the size your taxonomy will be in the end.

Let’s make up an example (as an excuse to post pictures that are fun to look at). We’re building a taxonomy that includes some terms about furniture, including the concept Sofa.

PT        =          Sofa

NPT     =          Couch

Now, being good taxonomists, we’re obviously lightning-fast researchers, so we quickly uncover some other candidate terms:

Cabriole

Camelback

Canapé

Chesterfield

Davenport

Daybed

Divan

Empire(-style)

English Rolled Arm

Lawson

Loveseat

Settee

Tuxedo

sofa

 It looks like a real taxonomy of sofas would depend at least partly on arm height?

Whereas “couch” is clearly a synonym, these could all be narrower terms (NTs) for Sofa as they are all distinct types, styles, and sub-classes. Alternately, these could all be made NPTs for Sofa so that any occurrence of the words above would index to Sofa and be available for search, browse, etc.

How do we decide the proper course of action?

We let the content tell us.

How many articles in our imaginary corpus reference e.g. the Cabriole, Camelback, or Canapé?

  • If the answer is “none”, there’s clearly no need for this term; however, adding it as an NPT will catch any future occurrences, so we may as well be completist.
  • If the answer is “many”–some significant proportion of the total mentions of Sofa or Couch—then the term definitely merits its own place in the taxonomy.
  • If the answer is “few”—more than none, but not enough to warrant inclusion—go ahead and add it as an NPT.  You can always promote it to preferred term status later.

However—and this is a big exception—if you find through reviewing search logs that a significant number of searchers were looking for a particular term, it might signal that it’s an emerging concept, new trend, or hot topic, in which case you may decide to override the statistical analysis and err on the side of adding it to the thesaurus. It won’t hurt anything, and as long as your hierarchy is well formed and your thesaurus is rich in related terms, people will find what they’re looking for…which is, after all, the goal.

So remember: It’s not only the size of your taxonomy that’s important—it’s how relevant it is to the content and users for which it’s designed.

Bob Kasenchak, Project Coordinator
Access Innovations

Access Innovations, Inc. Announces Release of the Smart Submit Extension Module to Data Harmony Version 3.9

May 12, 2014  
Posted in Access Insights, Featured, Taxonomy

Access Innovations, Inc. has announced the Smart Submit extension module as part of their Data Harmony Version 3.9 release. Smart Submit is a Data Harmony application for integration of author-selected subject metadata information into a publishing workflow during the paper submissions or upload process. Smart Submit facilitates the addition of taxonomy terms by the author. With Smart Submit, each author provides subject metadata from the publisher taxonomy to accompany the item they are submitting. During the submission process, Data Harmony’s M.A.I. core application suggests subject terms based on a controlled vocabulary, and the author chooses appropriate terms to describe the content of their document, thus enabling early categorization and selection of peer reviewers and support for trend analysis.

“Smart Submit is an exciting addition to the Data Harmony repertoire,” said Marjorie M. K. Hlava, President of Access Innovations, Inc. “Publishers can easily incorporate Smart Submit, streamlining several steps at the beginning of their workflow, as well as semantically enriching that content at the beginning of the production process. They are getting far more benefits, and doing so without adding time and effort.”

“The approach is simple on the surface and supported by very sophisticated software,” remarked Bob Kasenchak, Production Manager at Access Innovations. “The document is indexed using the Data Harmony software, which returns a list of suggested thesaurus terms, from which the author selects appropriate terms. Smart Submit supports the creation of a ‘semantic fingerprint’ for the author, collecting additional information along with the subject metadata. Finally, the tagged content is added to the digital record to complete production and be added to the data repository. It’s an amazing system to see in action.”

Smart Submit can be implemented in several ways, including as a tool for assisting authors to self-select appropriate metadata assigned to their name and research at the point of submission into the publishing pipeline; as an editorial interface between a semantically-enriched controlled vocabulary and potential submissions to the literature corpus; for editorial review of subject indexing at the point of submission, enabling a robust evolution to a controlled vocabulary (taxonomy or thesaurus), by encouraging timely rule base refinement; for simultaneous assignment of descriptive and subject metadata by curators of document repositories for efficient integration of documents in a large collection; and as a method of tracking authors and their submissions for conference proceedings, symposia, and the like.

“This flexible system opens the door between Data Harmony software and an author submission pipeline in fascinating new ways,” commented Kirk Sanders, an Access Innovations taxonomist. “Users can choose a configuration that maximizes the gain from their organizational taxonomy, at the point it is needed most: when their authors log on to submit their documents.”

 

About Access Innovations, Inc. – www.accessinn.com, www.dataharmony.com, www.taxodiary.com

Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments, and corporate clients throughout the world.

Hold the Mayo! A study in ambiguity

http://www.dreamstime.com/stock-image-cinco-de-may-may-5-calendar-image29230011

When we (at least those of us in Greater Mexico) hear of or read about Cinco de Mayo, there is no question in our minds that “Mayo” refers to the month of May. The preceding “Cinco de” (Spanish for “Fifth of”) pretty much clinches it. Of course, if the overall content is in Spanish, there might still might be some ambiguity about whether it is the holiday that is being referred to, or simply a date that happens to be the one after the fourth of May. (As in “Hey, what day do we get off work?” “The fourth of July, I think.”)

We can generally resolve this kind of ambiguity by the context, as can a good indexing system and a rule base associated with a taxonomy.

If you’re reading this posting, you read English. So there’s a good chance that when you read the word “mayo”, you think of the sandwich spread formerly and formally known as mayonnaise.

http://www.dreamstime.com/stock-images-mayonnaise-ingredients-image28775494

Or perhaps the famous Mayo Clinic comes to mind. If you’re an American football fan (I had to throw “American” in there to differentiate the mentioned sport from soccer), you might think of New England Patriots linebacker Jerod Mayo.

The context enables us to recognize which mayo we’re dealing with. Likewise, an indexing system might take context into account when encountering the slippery word. A really good indexing rule base might help you sort things out when you have got text about Jerod Mayo’s line of mayonnaise, the proceeds of which he is donating to the Boston (not Mayo) Clinic.

Mayo_Image3

As a person of Irish descent, I know perfectly well that that is not the end of Mayo’s spread. There is a County Mayo in Ireland, which has a few other Mayos, too.

http://www.dreamstime.com/stock-photography-famous-ashford-castle-county-mayo-ireland-image14560652

If you consult the Mayo disambiguation page in Wikipedia, you will quickly discover that Mayo goes much further than Ireland. There are Mayos of one sort or another all over the world: towns, rivers, and an assortment of other geographical entities that might easily co-exist in a taxonomy or gazetteer.

Traveling down past the geographical Mayos on the Wikipedia page, one finds the names of dozens and dozens of people, many of whom have Mayo as a first name, and many of whom have Mayo as a last name. Thank goodness the four relatively famous William Mayos have different middle names.

The final category on Wikipedia’s Mayo page is, perhaps inevitably, Other. There are quite a few Other Mayos. And what might the last one be?  Where has this journey taken us?

Mayo, the Spanish word for May”

Hold the Cinco de Mayo celebration!

http://www.dreamstime.com/stock-image-cinco-de-mayo-celebration-aztec-dancer-performs-image30873071

Barbara Gilles, Taxonomist
Access Innovations

The Supposed Advantages of Statistical Indexing

I would like to make some observations about statistics-based categorization and search, and about the advantages that their proponents claim.

First of all, statistics-based co-occurrence approaches do have their place, such as for wide-ranging bodies of text such as email archives and social media exchanges, and for assessing the nature of an unknown collection of documents. In these circumstances, a well-defined collection of concepts covering a pre-determined area of study and practice might not be practicable. For lack of a relevant controlled vocabulary foundation, and for lack of other practical approaches, attempts at analysis fall back on less-than-ideal mathematical approaches.

Co-occurrence can do strange things. You may have done Google searches that got mysteriously steered to a different set of search strings, apparently based on what other people have been searching on. This is a bit like the proverbial search for a key under a street light, instead of the various places the key is more likely to be, simply because the light is better and the search is easier under the street light.

These approaches are known for low search results accuracy (60% or less); this is unacceptable for the article databases of research institutions, professional associations, and scholarly organizations. Not only is this a disservice to searchers in general; it is an extreme disservice to the authors whose insights and research reports get overlooked, and to the researchers who might otherwise find a vital piece of information in a document that the search misses.

The literature databases and repositories of research-oriented oriented organizations cover specific disciplines, with well-defined subdisciplines and related fields. This makes them ideal for a keyword/keyphrase approach that utilizes a thesaurus. These well-defined disciplines have well-defined terminology that is readily captured in a taxonomy or thesaurus. The thesaurus has additional value as a navigable guide for searchers and for human indexers, including the authors and researchers who are the people who know the material best. Surely, they know the material far better than any algorithm could, and can take full advantage of the indexing and search benefits that a thesaurus can offer. An additional benefit (and a huge one) of a thesaurus is that it can serves as the basis for an associated indexing rule base that, if properly developed, can run rings around any statistics-based approach.

Proponents of statistics-based semantic indexing approaches claim that searchers need to guess the specific words and phrases that appear in the documents of potential interest. On the contrary, a human can anticipate these things much better than can a co-occurrence application. Further, with an integrated rule-based implementation, the searcher does not need to guess all of the the exact words and phrases that express particular concepts in the documents.

With indexing rooted in the controlled vocabulary of the thesaurus, documents that express the same concept in various ways (including ways that completely circumvent anything that co-occurrence could latch onto) are brought together in the same set of search results. Admittedly, I do appreciate the statistical approaches’ success in pulling together certain words that it has learned (through training) may be important. The proximity or distance between the words matters in the return results ranking, implying relevance to the query, However, that unknown distance does make it hard for a statistical approach to latch onto a concept, and therefore unreliable.

Use of a thesaurus also enables search interface recommendations for tangentially related concepts to search on, as well as more specific concepts and broader concepts of possible interest. The searcher and the indexer can also navigate the thesaurus (if it’s online or in the search interface) to discover concepts and terms of interest.

With statistic-based indexing, the algorithms are hidden in a mysterious “black box” that mustn’t be tinkered with. With rule-based, taxonomy-based approaches, the curtain can be pulled back, and the workings can be directly dealt with, using human intelligence and expert knowledge of the subject matter.

As for costs, statistical approaches generally require ‘training sets’ of documents to ‘learn’ or develop their algorithms. Any newly added concept/term means collecting another ten to fifty documents on the topic. If the new term is to be a narrower term of an existing term , that broader term’s training set loses meaning and must be redone to differentiate the concepts. Consider that point in ongoing management of a set of concepts.

Any concept that is not easily expressed in a clear word or phrase, or that is often referred to through creative expressions that vary from instance to instance, will require a significantly larger training set. The expanded set will still be likely to miss relevant resources while retrieving irrelevant resources.

Dealing with training sets is costly, and is likely to be somewhat ineffective, at that; if the selected documents don’t cover all the concepts that are covered in the full database, you won’t end up with all the algorithms you need, and the ones that are developed will be incomplete and likely inaccurate. So with a statistics-based approach, you won’t reap the full value of your efforts.

Barbara Gilles, Taxonomist
Access Innovations

Data Harmony® Software Version 3.9 Now Available

April 21, 2014  
Posted in Access Insights, Autoindexing, Featured

Access Innovations, Inc. has announced that Version 3.9 of its Data Harmony Suite of software tools is now available.

The Data Harmony Suite provides content management solutions to improve information organization by systematically applying a taxonomy or thesaurus in total integration, with patented natural language processing methods. MAIstro, the award-winning flagship software module of the Data Harmony product line, combines Thesaurus Master® (for taxonomy creation and maintenance) with M.A.I. (Machine Aided Indexer) for interactive text analysis and better subject tagging.

Data Harmony software gives users the power to quickly attach precise indexing terms for their documents, reflecting the controlled vocabulary of their business sector or academic discipline – at document level, in the metadata. With the MAIstro interface, editors can adjust vocabulary terms or corresponding rules that govern indexing, to generate very accurate and consistent subject metadata for content objects. The MAIstro rule-building screen and indexing rule base syntax are highly intuitive, and accessible to non-programmers.

Access Innovations has added the following custom features and improvements to Data Harmony software for this release: 

MAIstro Version 3.9 – for user-friendly, full-featured, structured vocabulary development and management

  • Emphasis display in Thesaurus Master – permitting use of bolding, italics, and underlining for terms
  • New Thesaurus Master export formats – OWL2 (Web Ontology Language), Full Path XML to support XML-driven content management systems and systems that rely on breadcrumb navigation, and the Custom XML Fields format (users select the fields to be included when generating an export)
  • Enhanced Adobe PDF support –  Test MAI supports text from newer, full-feature PDFs, to be used during spot-testing of the indexing rule base
  • Improvements to Thesaurus Master’s Import Module and the Data Harmony command line project creation tool – for improved preservation of source file term IDs

M.A.I. Version 3.9 – for automated indexing, machine-aided term selection interactivity, and extending the implementation of indexing based on a very accessible rule base

  • Support for returning the full path of suggested terms
  • Enhanced reporting features for identifying the usage of conditional rules
  • Increased maximum rule length, to allow for more complex and specific rule-building
  • Easy implementation of secure server login (SSL) beginning at project creation, if desired
  • More flexible and more memory-efficient capabilities for implementation with inline tagging

For Version 3.9, application developers placed an emphasis on creatively leveraging API calls to accomplish specialized goals for existing Data Harmony customers, resulting in new software packages becoming available for the first time. Data Harmony API calls offer a method for system administrators to facilitate ongoing Web services for Data Harmony users. API calls also provide a way to configure direct machine-to-machine operations for processing information stored in Data Harmony. All APIs are available as Web services.

Additional major updates in this Data Harmony software release include improvements to Search Harmony and the addition of a new service, Data Harmony’s Smart Submit extension module for author submission of articles and other documents.

“This version of the software shows a strong move to Web-based functionality,” said Allex Lyons, one of the Data Harmony programming team. “Our recent enhancements provide greater search efficiency while browsing your thesaurus, through callouts and hyperlinks to related and narrower terms.”

For further information, see www.dataharmony.com.

About Access Innovations, Inc. – www.accessinn.com, www.dataharmony.com, www.taxodiary.com

Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments, and corporate clients throughout the world.

Limitations of Fuzzy Matching of Lexical Variants

April 14, 2014  
Posted in Access Insights, Featured, semantic

Some vendors of text analytics software claim that their software can identify the occurrences of text reflecting specific taxonomy terms  (with the strong, and false, implication that it identifies all such occurrences) using “fuzzy matching” or “fuzzy term matching.” Some explanations of the technology, from Techopedia and Wikipedia, show that it is a fairly crude mathematical approach, similar to the co-occurrence statistical approaches that such software also tends to use, and no match for rule-based indexing approaches that derive their effectiveness from human intelligence.

I remember searching online for information about the Data Harmony Users Group (DHUG) meeting. Google, in its infinitely fuzzy wisdom, asked “Did you mean “thug”?

As explained in Techopedia,

Fuzzy matching is a method that provides an improved ability to process word-based matching queries to find matching phrases or sentences from a database. When an exact match is not found for a sentence or phrase, fuzzy matching can be applied. Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold matching percentage set by the application.

Fuzzy matching is mainly used in computer-assisted translation and other related applications.

Fuzzy matching searches a translation memory for a query’s phrases or words, finding derivatives by suggesting words with approximate matching in meanings as well as spellings.

The fuzzy matching technique applies a matching percentage. The database returns possible matches for the queried word between a certain percentage (the threshold percentage) and 100 percent.

So far, fuzzy matching is not capable of replacing humans in language translation processing.

And the Wikipedia article on the subject explains as follows:

“The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are:

    • insertion:      cotcoat
    • deletion: coat      → cot
    • substitution:      coatcost

 These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted:

    • insertion:      co*tcoat
    • deletion: coat      → co*t
    • substitution:      coatcost

The most common application of approximate matchers until recently has been spell checking.

In blog comments, people have commented on the large number of false positives this would create. As a small example of the possibilities, think what would happen with the American Society of Civil Engineers (ASCE) thesaurus and “bridging the gap.” Not to mention the many, many words and acronyms that a rule-based approach could easily disambiguate.

And then there are all the occurrences of the concept that would be totally missed by the fuzzy matching approach. Not all synonyms, nor all the other kinds of expressions of various concepts, are lexical variants that are similar character strings. Fuzzy matching has no way of dealing with these alternative expressions, and they happen often.

There is another problem with these approaches. They are sometimes tied in with weighting, with the “edit distance” mentioned in the Wikipedia article used to downgrade the supposed relevance of a lexical variant. Why on earth should a variant be downgraded, if the intended concept is completely identical to the one expressed by the corresponding preferred term?

The fuzzy approach does not save human time and effort. Rules covering a wide variety of lexical variants can be written using truncated strings as the basic text to match, and proximity conditions added as desired to make those rules more accurate.

Sometimes (in fact, rather frequently), a concept is expressed in separate parts over the course of a sentence, a paragraph, or a larger span of text. There is no way that I’m aware of that fuzzy matching can deal with that. A rule-based approach can.

In short, fuzzy matching has serious deficiencies as far as indexing and search are concerned, and is vastly inferior to a rule-based approach.

Barbara Gilles, Taxonomist
Access Innovations

« Previous PageNext Page »