Semantic enhancement extends beyond journal article indexing, though the ability of users to easily find all the relevant articles (your assets) when searching still remains the central purpose. Now, in addition to articles, semantic “fingerprinting” is used for identifying and clustering ancillary published resources, media, events, authors, members or subscribers, and industry experts.
The system you choose to enhance the value of your assets, and the people behind it, is extraordinarily important.
It starts with a profile of your electronic collection. It may include a profile of your organization as well. As you choose the concepts that represent the areas of research today and in the past, the ideas and thoughts of your most articulate representatives, the emerging methods and technologies, you bring together a picture of the overall effort. This can be done with a thesaurus, an organized list of terms representing those concepts (taxonomy) enhanced with relationship links between terms (synonyms, related terms, web references, scope notes). The profile provides an illustration of the nature of intellectual effort being expended and, equally important, the shape of the organizational knowledge that is your key asset.
We’d like to convince you that human intelligence is still the most powerful engine driving the development and maintenance of this lexicographic profile. Technology tools help with the content mining, frequency analyses, and other measures valuable to a taxonomist, but the organization, concept expressions, and relationship building is still best done by humans.
Similarly, the application of the thesaurus is best done by humans. Because of the volume of content items being created every day, it may not be possible to have human indexers review each of them. Our automated systems can achieve perhaps 90% “accuracy” (i.e. matching what a human indexer would choose), so high-valued content is still indexed by humans, much more efficiently than in the past, but still by humans. And the balance requires the contribution of humans to inform the algorithm in actual natural (human) language. Fully enabled, the automated system produces impressive precision in identifying the “aboutness” of a piece of content.
And how can a system achieve accuracy and consistency? Our approach is to reflect the reasoning process of humans, using a set of rules. Our rule base is simple to enhance and simple to maintain, and like the thesaurus, flexible enough to accommodate new terminology in a discipline as it evolves. About 80% of the rules work well just as initially (automatically) created. The other 20% achieve better precision when ‘touched’ by a human who adds conditions to limit, broaden, or disambiguate the use of the term triggering the rule.
Mathematical analyses work to identify statistical characteristics of a large number of items and is quite useful in making business decisions. But making decisions about meaning? For many decades now, researchers have been working to find a way to analyze natural language that would result in somewhere near the precision provided by human indexers and abstractors. Look at IBM’s super-computer “Watson” and the years and resources invested to produce it. It continues to miss the simple (to us) relationships between words and context that humans understand intuitively.
Mary Garcia, Systems Analyst
During the initial stages of discussing a new taxonomy project, I am frequently asked questions like:
How granular does my taxonomy need to be?
How many levels deep should the vocabulary go?
How many terms should my thesaurus have?
The answer is—of course—it depends.
The smallest thesaurus project with which I’ve ever been involved was for a thesaurus of 11 terms; the largest is a 57,000-word vocabulary.
We once lost a bid because we refused to agree to build a 10,000-word thesaurus (not approximately, exactly); no matter how loudly we insisted that it’s far more logical (“best practice”) to let the data decide the size of the thesaurus, someone had already decided on an arbitrary number.
At Access Innovations, we like to say that we build “content-aware” taxonomies, that the data will tell us how large the taxonomy should be. The primary data point is the content: How much is there? What is the ongoing volume being published? Clearly, no one needs a 25,000-word thesaurus to index 1000 documents; similarly, a 200-term thesaurus is not going to be that useful if you have 800,000 journal articles.
Just as returning 2,000,000 search results is not very helpful (unless what you’re looking for is on the first page), a thesaurus term with which 20,000 articles are tagged isn’t doing that much good—more granularity is probably required. There are very likely sub-types or sub-categories of that concept that you can research and add.
The flip side is that you don’t need terms in your vocabulary—no matter how cool they may be—if there is little or no content requiring them for indexing. Your 1500-word branch of particle physics terms is just dead weight in the great psychology thesaurus you’re developing.
Other factors include the type of users you have searching your content: Are they third-graders? Professional astrophysicists? High school teachers? Reviewing search logs and interviewing users is another way to focus your approach, which in turn will help you gauge the size your taxonomy will be in the end.
Let’s make up an example (as an excuse to post pictures that are fun to look at). We’re building a taxonomy that includes some terms about furniture, including the concept Sofa.
PT = Sofa
NPT = Couch
Now, being good taxonomists, we’re obviously lightning-fast researchers, so we quickly uncover some other candidate terms:
English Rolled Arm
Whereas “couch” is clearly a synonym, these could all be narrower terms (NTs) for Sofa as they are all distinct types, styles, and sub-classes. Alternately, these could all be made NPTs for Sofa so that any occurrence of the words above would index to Sofa and be available for search, browse, etc.
How do we decide the proper course of action?
We let the content tell us.
How many articles in our imaginary corpus reference e.g. the Cabriole, Camelback, or Canapé?
- If the answer is “none”, there’s clearly no need for this term; however, adding it as an NPT will catch any future occurrences, so we may as well be completist.
- If the answer is “many”–some significant proportion of the total mentions of Sofa or Couch—then the term definitely merits its own place in the taxonomy.
- If the answer is “few”—more than none, but not enough to warrant inclusion—go ahead and add it as an NPT. You can always promote it to preferred term status later.
However—and this is a big exception—if you find through reviewing search logs that a significant number of searchers were looking for a particular term, it might signal that it’s an emerging concept, new trend, or hot topic, in which case you may decide to override the statistical analysis and err on the side of adding it to the thesaurus. It won’t hurt anything, and as long as your hierarchy is well formed and your thesaurus is rich in related terms, people will find what they’re looking for…which is, after all, the goal.
So remember: It’s not only the size of your taxonomy that’s important—it’s how relevant it is to the content and users for which it’s designed.
Bob Kasenchak, Project Coordinator
Access Innovations, Inc. Announces Release of the Author Submit Extension Module to Data Harmony Version 3.9
Access Innovations, Inc. has announced the Author Submit extension module as part of their Data Harmony Version 3.9 release. Author Submit is a Data Harmony application for integration of author-selected subject metadata information into a publishing workflow during the paper submissions or upload process. Author Submit facilitates the addition of taxonomy terms by the author. With Author Submit, each author provides subject metadata from the publisher taxonomy to accompany the item they are submitting. During the submission process, Data Harmony’s M.A.I. core application suggests subject terms based on a controlled vocabulary, and the author chooses appropriate terms to describe the content of their document, thus enabling early categorization and selection of peer reviewers and support for trend analysis.
“Author Submit is an exciting addition to the Data Harmony repertoire,” said Marjorie M. K. Hlava, President of Access Innovations, Inc. “Publishers can easily incorporate Author Submit, streamlining several steps at the beginning of their workflow, as well as semantically enriching that content at the beginning of the production process. They are getting far more benefits, and doing so without adding time and effort.”
“The approach is simple on the surface and supported by very sophisticated software,” remarked Bob Kasenchak, Production Manager at Access Innovations. “The document is indexed using the Data Harmony software, which returns a list of suggested thesaurus terms, from which the author selects appropriate terms. Author Submit supports the creation of a ‘semantic fingerprint’ for the author, collecting additional information along with the subject metadata. Finally, the tagged content is added to the digital record to complete production and be added to the data repository. It’s an amazing system to see in action.”
Author Submit can be implemented in several ways, including as a tool for assisting authors to self-select appropriate metadata assigned to their name and research at the point of submission into the publishing pipeline; as an editorial interface between a semantically-enriched controlled vocabulary and potential submissions to the literature corpus; for editorial review of subject indexing at the point of submission, enabling a robust evolution to a controlled vocabulary (taxonomy or thesaurus), by encouraging timely rule base refinement; for simultaneous assignment of descriptive and subject metadata by curators of document repositories for efficient integration of documents in a large collection; and as a method of tracking authors and their submissions for conference proceedings, symposia, and the like.
“This flexible system opens the door between Data Harmony software and an author submission pipeline in fascinating new ways,” commented Kirk Sanders, an Access Innovations taxonomist. “Users can choose a configuration that maximizes the gain from their organizational taxonomy, at the point it is needed most: when their authors log on to submit their documents.”
Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.
Collaboration is a key component of research. Original research papers with a single author are a vanishing breed, particularly in the life sciences. This makes it difficult to identify author contributions and acknowledgements, as well as to mine any data from the unstructured information. This information came from the international science journal Nature in an article, “Publishing: Credit where credit is due,” where our own Margie Hlava and her contribution to this issue was highlighted.
Developments in digital technology are making a difference in this area. Digital taxonomies can help identify contributor roles relatively easily in structured formats during the process of developing and publishing a paper. This was demonstrated in an experiment conducted by the authors of the Nature article. Read the article for their interesting perspectives.
Melody K. Smith
Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.
When we (at least those of us in Greater Mexico) hear of or read about Cinco de Mayo, there is no question in our minds that “Mayo” refers to the month of May. The preceding “Cinco de” (Spanish for “Fifth of”) pretty much clinches it. Of course, if the overall content is in Spanish, there might still might be some ambiguity about whether it is the holiday that is being referred to, or simply a date that happens to be the one after the fourth of May. (As in “Hey, what day do we get off work?” “The fourth of July, I think.”)
We can generally resolve this kind of ambiguity by the context, as can a good indexing system and a rule base associated with a taxonomy.
If you’re reading this posting, you read English. So there’s a good chance that when you read the word “mayo”, you think of the sandwich spread formerly and formally known as mayonnaise.
Or perhaps the famous Mayo Clinic comes to mind. If you’re an American football fan (I had to throw “American” in there to differentiate the mentioned sport from soccer), you might think of New England Patriots linebacker Jerod Mayo.
The context enables us to recognize which mayo we’re dealing with. Likewise, an indexing system might take context into account when encountering the slippery word. A really good indexing rule base might help you sort things out when you have got text about Jerod Mayo’s line of mayonnaise, the proceeds of which he is donating to the Boston (not Mayo) Clinic.
As a person of Irish descent, I know perfectly well that that is not the end of Mayo’s spread. There is a County Mayo in Ireland, which has a few other Mayos, too.
If you consult the Mayo disambiguation page in Wikipedia, you will quickly discover that Mayo goes much further than Ireland. There are Mayos of one sort or another all over the world: towns, rivers, and an assortment of other geographical entities that might easily co-exist in a taxonomy or gazetteer.
Traveling down past the geographical Mayos on the Wikipedia page, one finds the names of dozens and dozens of people, many of whom have Mayo as a first name, and many of whom have Mayo as a last name. Thank goodness the four relatively famous William Mayos have different middle names.
The final category on Wikipedia’s Mayo page is, perhaps inevitably, Other. There are quite a few Other Mayos. And what might the last one be? Where has this journey taken us?
“Mayo, the Spanish word for May”
Hold the Cinco de Mayo celebration!
Barbara Gilles, Taxonomist
Last week Access Innovations released Data Harmony 3.9, and all week long we shared some of our own developers’ favorite new features. It was nice to see KMWorld mention some of the great new features available in their article, “Content enrichment from Access Innovations.”
The Data Harmony Suite is designed to enhance the organization of information by systematically applying a taxonomy or thesaurus in total integration. MAIstro is the flagship software module of the Data Harmony line, and combines Thesaurus Master with M.A.I. (Machine Aided Indexer) for interactive text analysis and better subject tagging. What does this mean for users?
Data Harmony software gives users the power to quickly attach precise indexing terms to their content items. With the MAIstro interface, editors can adjust vocabulary terms or corresponding rules that govern indexing. This enables the users to generate highly accurate and consistent subject metadata for content objects.
Melody K. Smith
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
I would like to make some observations about statistics-based categorization and search, and about the advantages that their proponents claim.
First of all, statistics-based co-occurrence approaches do have their place, such as for wide-ranging bodies of text such as email archives and social media exchanges, and for assessing the nature of an unknown collection of documents. In these circumstances, a well-defined collection of concepts covering a pre-determined area of study and practice might not be practicable. For lack of a relevant controlled vocabulary foundation, and for lack of other practical approaches, attempts at analysis fall back on less-than-ideal mathematical approaches.
Co-occurrence can do strange things. You may have done Google searches that got mysteriously steered to a different set of search strings, apparently based on what other people have been searching on. This is a bit like the proverbial search for a key under a street light, instead of the various places the key is more likely to be, simply because the light is better and the search is easier under the street light.
These approaches are known for low search results accuracy (60% or less); this is unacceptable for the article databases of research institutions, professional associations, and scholarly organizations. Not only is this a disservice to searchers in general; it is an extreme disservice to the authors whose insights and research reports get overlooked, and to the researchers who might otherwise find a vital piece of information in a document that the search misses.
The literature databases and repositories of research-oriented oriented organizations cover specific disciplines, with well-defined subdisciplines and related fields. This makes them ideal for a keyword/keyphrase approach that utilizes a thesaurus. These well-defined disciplines have well-defined terminology that is readily captured in a taxonomy or thesaurus. The thesaurus has additional value as a navigable guide for searchers and for human indexers, including the authors and researchers who are the people who know the material best. Surely, they know the material far better than any algorithm could, and can take full advantage of the indexing and search benefits that a thesaurus can offer. An additional benefit (and a huge one) of a thesaurus is that it can serves as the basis for an associated indexing rule base that, if properly developed, can run rings around any statistics-based approach.
Proponents of statistics-based semantic indexing approaches claim that searchers need to guess the specific words and phrases that appear in the documents of potential interest. On the contrary, a human can anticipate these things much better than can a co-occurrence application. Further, with an integrated rule-based implementation, the searcher does not need to guess all of the the exact words and phrases that express particular concepts in the documents.
With indexing rooted in the controlled vocabulary of the thesaurus, documents that express the same concept in various ways (including ways that completely circumvent anything that co-occurrence could latch onto) are brought together in the same set of search results. Admittedly, I do appreciate the statistical approaches’ success in pulling together certain words that it has learned (through training) may be important. The proximity or distance between the words matters in the return results ranking, implying relevance to the query, However, that unknown distance does make it hard for a statistical approach to latch onto a concept, and therefore unreliable.
Use of a thesaurus also enables search interface recommendations for tangentially related concepts to search on, as well as more specific concepts and broader concepts of possible interest. The searcher and the indexer can also navigate the thesaurus (if it’s online or in the search interface) to discover concepts and terms of interest.
With statistic-based indexing, the algorithms are hidden in a mysterious “black box” that mustn’t be tinkered with. With rule-based, taxonomy-based approaches, the curtain can be pulled back, and the workings can be directly dealt with, using human intelligence and expert knowledge of the subject matter.
As for costs, statistical approaches generally require ‘training sets’ of documents to ‘learn’ or develop their algorithms. Any newly added concept/term means collecting another ten to fifty documents on the topic. If the new term is to be a narrower term of an existing term , that broader term’s training set loses meaning and must be redone to differentiate the concepts. Consider that point in ongoing management of a set of concepts.
Any concept that is not easily expressed in a clear word or phrase, or that is often referred to through creative expressions that vary from instance to instance, will require a significantly larger training set. The expanded set will still be likely to miss relevant resources while retrieving irrelevant resources.
Dealing with training sets is costly, and is likely to be somewhat ineffective, at that; if the selected documents don’t cover all the concepts that are covered in the full database, you won’t end up with all the algorithms you need, and the ones that are developed will be incomplete and likely inaccurate. So with a statistics-based approach, you won’t reap the full value of your efforts.
Barbara Gilles, Taxonomist
Clearer Page Layouts – You betcha!
- Data Harmony onscreen interface tabs, button-names, menu options, dialog boxes, text-boxes, and actual computer code examples: Each category is presented in a different emphasis and font-style in the software instructions, to reduce confusion for the new reader.
More Navigation Features – For Increased Accessibility
- Every manual has more internal cross-references than before, providing the reader with convenient spots to move directly to a related section and bypass information they already are familiar with.
- Every instruction manual displays PDF bookmarks in a pane to the left of the written page, for all major sections – accessible from anywhere in the document. A reader can effectively navigate backward or forward to any section they need, from any page of the guide, anytime, without the extra step of consulting the full table of contents.
Kirk Sanders, Editorial Services
Project Manager, Access Innovations
Glad you asked.
In 2014, we improved the instruction manuals significantly – to be more comprehensive without being harder on the eyes. We upgraded the documentation to be even clearer, more consistent and complete than before, while removing extraneous material.
Data Harmony customers will find that the software guides are now boosted by — A) better content, B) clearer page-layouts, and C) more navigation features.
- Overview section presents a less-technical description of what the Data Harmony module is accomplishing and how,
- Instructions for doing tasks in interfaces are streamlined, with more tips for the frequent user,
- Glossary no longer contains Data Harmony software definitions mixed in alongside general thesaurus/indexing definitions. Now there are two highly-focused knowledge resources to help our users directly:
- ‘Thesaurus Master® and MAIstro™ Glossary’ defines terminology that a person should know to gain the greatest advantage while using either program to develop or administer a controlled vocabulary
- ‘M.A.I.™ Glossary’ defines terminology that a person needs to know in order to work with a rule base for subject-indexing text
- A recent emphasis by development programmers on customizing the use of API calls to meet specialized customer goals is now reflected in documentation for several ‘Data Harmony extensions,’ available for purchase this year for the first time – such as Search Harmony & Recommender, Inline Tagging, Semantic Fingerprinting, Metadata Extractor, and Author Submit. User guides for these Data Harmony Web services/extensions contain implementation examples that illustrate the software in action, with corresponding explanations
- Wait… what was that last one?
- The API calls offer a method for system administrators to facilitate ongoing Web services for Data Harmony users in an interface (or through machine-to-machine automated operations).
But wait, there’s more.
Kirk Sanders, Editorial Services Project Manager
I like the variety of exports available. We’re still creating custom ones for content and web content management systems that require specialized formats, and we’ll continue to do that. But now we can offer exports of limited (top) levels of a hierarchy and exports that include only the fields that you want. There’s no need to share editorial comments and term origin information if you don’t want to. And it’s possible to run an export with just related terms or just editorial notes when that’s what you need for vocabulary review. These are seemingly small changes that can make a big difference.
Data Harmony is becoming more verbose, too. It has more messages to let you know that your export or import has been successful (or messages to let you know why not!). It also has more reports (and more on the way). We love hearing from our users, and much of the new verbosity results from suggestions our users have made.
Mary Garcia, Systems Analyst