It can be tempting to look at a taxonomy and only see it in terms of the search results generated from it. That part of it is very important, no doubt, but taxonomies have far-reaching tendrils that extend deep into many industries, especially publishing. They work subtly and are often invisible to the end user, but the effects can be extraordinarily powerful.
Drawing by Pierre Dénys de Montfort, 1801, http://commons.wikimedia.org/wiki/Cryptozoology#mediaviewer/File:Colossal_octopus_by_Pierre_Denys_de_Montfort.jpg
Taxonomies interact with and enrich the publication pipeline in so many ways. This is especially apparent in the world of academic publishing, but the principles can be applied throughout the industry.
With the explosion of electronic content, the use of metadata has become increasingly important. Hunting down print records from an archive by hand is simply unacceptable today; documents get lost, forcing the unnecessary duplication of work. This is even more frustrating with electronic content because, in theory, the conversion to digital was supposed to streamline the retrieval process.
It hasn’t really worked out so smoothly. This isn’t simply about finding something with a search function; inefficiencies gum up the works at every stage, costing companies valuable time that could be more effectively used elsewhere. In fact, studies have shown that people in publishing spend up to 35% of their time hunting down information. That’s simply too much time.
Access Innovations’ Data Harmony software products can help clean up the mess, both literally and figuratively. With applications at every stage, from submission to publication, the software suite is specifically designed to improve accuracy and efficiency, from the writer submitting the work to the publisher putting it in a journal, to the end user who needs to access it.
We will examine these ideas in a nine-part series to see just how integral taxonomies are to publishers. Some features will specifically address how Data Harmony works to improve data, while some will be more general. The following topics will be covered:
- THE NEED FOR QUICK PUBLISHING: Academics publishing papers in journals—crucial for tenure—often face months and even years of waiting, sometimes so long that the tenure window passes. How does Data Harmony, specifically Smart Submit, streamline the submission process?
- SEMANTIC ENRICHMENT: The capture of metadata is key to meaningful analysis of content. The resulting metadata allows a user to view content in unique ways to draw previously invisible patterns that dramatically improve the accuracy of content retrieval and enable analytics.
- BY THE NUMBERS: We look at the real cost of slow turnaround in academic publishing—it doesn’t just affect the writer. There is a real cost to publishers when their systems are inefficient.
- INLINE TAGGING—WHEN YOU STILL CAN’T FIND IT: Data Harmony’s constant refinement has now led to inline tagging solutions that allow users to easily zoom in on and identify concepts in a full text document.
- FINGERS IN MANY POTS—FROM SUBMISSION TO PUBLICATION: From Smart Submit to precision search results at the end, Data Harmony has a hand in making each piece of the workflow production pipeline smoother, easier, and more accessible.
- FEEDBACK LOOPS FOR TRULY ACCURATE INDEXING: One of the more interesting aspects of Data Harmony indexing solutions is how they enable constant checking of content extraction effectiveness, so that it becomes ever more accurate with each new piece of added content.
- THE FUTURE—REAL-TIME ANALYSIS OF THE CHANGES IN PUBLISHING: The explosion of content being housed online instead of in print brings up some interesting speculative issues about the future of journal publication.
The series will close with an installment that will demonstrate, on a broader level, that high-quality metadata and accurate taxonomy-based indexing streamline and enrich the publication process. Today’s technology gives publishers the opportunity to enhance the way their journals are compiled and disseminated to the public. However, many publishers have not yet taken advantage of that opportunity. This makes for a frustrating search experience for the end user that, most importantly, does not precisely deliver the content they require.
Ultimately, that end user experience is most crucial, but in order for a system to really work, it requires improved efficiency at every level. Certainly, good semantic enrichment software such as Data Harmony can help with that, but changes in mentality are also required. That doesn’t happen overnight, but the faster that publishers can start to implement new methods of looking at their data and making their content available, the faster that users throughout the pipeline will get true satisfaction from their results.
When the author is satisfied and the publisher is satisfied, the end user wins. That builds trust in your client base, which translates into revenue. That revenue allows the publisher to offer improved and diversified products, leading to a broader user base that is assured of a reliable content offering, which once again, leads to even more revenue. This kind of cycle drives industry to improve upon itself, and good semantic enrichment software such as Data Harmony can facilitate that process.
Access Innovations, Inc. is pleased to announce that the Data Harmony software line now includes inline tagging capability, beginning with Version 3.9, released earlier this year.
Inline Tagging Feature In Data Harmony 3.9 Installations Employing inline tags, the TestMAI screen in the MAIstro™ module now displays matching content in a highlighted font when a Data Harmony user runs M.A.I.™ (Machine Aided Indexer), putting the focus on indexing concepts as they appear in the input text. XML elements are applied in the text at the exact location where a word or phrase triggered a term suggestion from the controlled vocabulary (taxonomy or thesaurus).
Inline tagging keeps the context in view when M.A.I. makes a subject indexing match, and generates a statistical summary showing the list of matched terms. This provides a look at the operation of term-matching rules and improves the user’s experience during rule base testing and refinement.
Inline Tagging As a Software Product The Inline Tagging extension module is available as a Data Harmony web-based service, with configuration options for integrating M.A.I. into the user interface of a content management system.
“For Version 3.9, we added inline tagging for MAIstro and M.A.I. installations, as Data Harmony customers are seeing now when they run TestMAI on their data,” remarked Marjorie M. K. Hlava, President of Access Innovations, Inc. “And, we packaged Inline Tagging as a software module to be integrated with an organization’s content management program, in the interface.
The Inline Tagging extension puts M.A.I.’s indexing precision within the reach of users, users who may be requesting more Web functionality! Often, the first step to offering more interactivity at document level is the precise placement of inline subject tags… this step is accomplished through a thoughtful configuration of the Inline Tagging extension.”
“The Inline Tagging extension inserts an XML ‘wrapper element’ around matched words; it locates concepts from your vocabulary inside full-text documents,” explains Bob Kasenchak, Production Manager at Access Innovations. “The XML wrapper can be configured so your interface displays related information when a user points their cursor at the match. Or, the interface can display a link for the user to visit. But that’s not all… an IT department can deploy inline text tags to make search and retrieval in a large document collection more precise and context-sensitive.”
Implementation Ideas for Inline Tagging Web-based Service Extension:
- As a search engine tool to boost document search results
- To turn up the volume of social media postings
- To create a better, more searchable index of an XML database
- To foster increased user interaction opportunities with documents
How the Inline Tagging Extension Works, Simplified Implementations of the Inline Tagging extension module are based on an organization’s controlled vocabulary and accompanying rule base, stored in a Data Harmony installation. The Inline Tagging API can retrieve any element from the vocabulary term record in order to offer relevant information alongside matched content, and it handles this ‘on the fly.’ This makes it convenient for a Web developer to implement customized functions for a content management system. Driven by data already stored in the term records, with Web-based services configured for the CMS interface, Inline Tagging gives users immediate access to supplemental information about the relevant controlled vocabulary terms, and any related concepts.
About Access Innovations, Inc. –
Access Innovations has extensive experience with Internet technology applications, master data management, database creation, content-aware thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world. Access Innovations has been changing search to found since 1978.
I recently had the opportunity to see webinars featuring a couple of software systems for taxonomy construction/management and content categorization. The systems were both impressive and, if I didn’t have 20 years in the business, I would have been totally awed … and snowed. It’s easy to be overwhelmed by a slick appearance and professional presentation.
Early in my career, I was overwhelmed and confused by the terminology—its abundance, multiplicity, and ambiguity. Each software company used different words, all very catchy, developed by a creative marketing department. I didn’t get whether they were talking about different concepts or the same in different verbal wrappers. Cutting through the terminology to identify key software features and functions can be tough. Yet that’s just what must be done for an informed buying decision.
One of the buzzwords I came across in these recent webinars was “content-driven” (or “data-driven”) to describe a taxonomy. To my amazement, this was described as a “trend” in taxonomy construction by the presenter for the company “with over 15 years of experience.” Apparently it was intended as a strike at a “top-down” approach to pulling together terms for a taxonomy based on an abstract, authoritative view of a domain. The top-down approach was described as more complex than necessary and including nodes not reflecting your content.
However, the discussion ignored the equally familiar and long established counterpart to top-down. This is the “bottom-up” approach, drawing terms directly from the documents to be categorized, i.e., content-driven. Here’s a link to a brief description of the strategies written in 1996 by Jessica Milstead.
In most cases, building a taxonomy or thesaurus requires a hybrid approach, with the overall organization based on a top-down approach for navigation and the bulk of terms reflecting the preferred terms for concepts in the domain and drawn from the actual documents. The strategies are most often used in balance, with the taxonomist providing a logical “top” structure into which the content-linked terms can fit.
The software on display generated a list of candidate terms, offering words and phrases from the content as terms. But this was just a starting point in taxonomy construction. Time for the taxonomist to add the value of organization through hierarchical, associative, and equivalence relationships.
Ah, “relationships” takes me to semantics, another buzzword that sounds very impressive and truly represents the power of taxonomies. The key thing to remember is that semantics in a taxonomy starts with the hierarchical, associative, and equivalence relationships. (Actually, a taxonomy with all those features is more accurately called a thesaurus). Organizing terms in a hierarchy of broader and narrower concepts—from general to specific—and recognizing synonymous alternative expressions and internal conceptual links all add semantic richness to terms by providing context based on the meanings of words. These are features built into a well-developed taxonomy, providing pivot points from one term to another through logical semantic associations. Applied as metadata to content items, the taxonomy terms provide semantic enrichment.
Another slick webinar focused on semantic enrichment with an artfully designed but effective presentation. As jaded as I have become, I was duly impressed by the appealing motifs, the jazzy colors, the graphics in motion, and the requisite buzzwords in the opening. This is the part you show to the CIO, CTO, etc., the one with final budget authority. We are still talking about semantically enriching content with metadata from a domain-specific taxonomy. You say, “This is just what I need!”
Several modules were described. One extracts words and phrases as key topics for a taxonomy-ish product, called by a name not found in the ANSI/NISO Z39.19 standard for taxonomy construction. Another is for human taxonomy building from scratch, if the ready-built domain taxonomies are not a good fit. Others serve categorizing/indexing/tagging/annotating content (chose your favorite expression), also known as applying taxonomy terms as metadata or … semantically enriching the content.
I must admit I was impressed, but not snowed. I’m an editor, not in marketing or in an art department. I knew what this was all about because this is basic taxonomy and indexing work that I do daily, using software that delivers these functions. I know that slick is cool to look at, but it comes at a price.
“Gee, thanks for the spin in your Ferrari, but I was hoping for a Chevy pricetag and Honda/Volvo/Subaru reliability.”
Photo by Ahmad Mukhlis, http://en.wikipedia.org/wiki/Honda_Civic#mediaviewer/File:Honda_Civic_Hybrid_%28Malaysia%29.jpg / CC BY-SA 3.0
I also know that essential functions are available in products much more accessible to organizations on a budget.
If you are interested in software for taxonomy creation, management, and application, don’t get snowed by the buzzwords and bling. Know the basics of taxonomy construction and implementation, and use that knowledge as a starting point when comparing software. Know the functions you need to perform and avoid slick but unnecessary frills, as alluring as they may be. Know if the product will work with other systems and whether you’ll need a high-priced mechanic or an editor to do the work. When you hear about trends, consider established history and experience. Data Harmony software from Access Innovations was developed for a demanding production setting, starting in 1995 and continually improving over the years. It may be eclipsed on the slick presentation front, but the software has proven it’s up to the job. Just ask any of our satisfied customers. Contact us; we’d be happy to give you the list.
Alice Redmond-Neal, Senior Taxonomist
If you look at a branch of a typical deciduous tree, you can see that it looks like a smaller tree. Likewise, that branch branches off into smaller branches that look like even smaller trees.
This characteristic of trees is an example of what mathematicians, biologists, and systems scientists call self-similarity. Self-similar systems repeat their basic geometry at smaller and smaller scales, creating multiple miniatures of themselves at different scales. In general, natural and mathematical systems in which self-similarity results in complex and detailed patterns are referred to as fractal systems.
Many natural phenomena are or can be fractal:
Photo of a 12-sided snowflake by Becky Ramotowski, www.srh.noaa.gov/abq/?n=features_snowflake
Painting by Katsushika Hokusai, www.katsushikahokusai.org/Mount-Fuji-Seen-Below-a-Wave-at-Kanagawa.html /CC BY-NC-ND 3.0
and even broccoli.
Photo by Jon Sullivan, en.wikipedia.org/wiki/Romanesco_broccoli#mediaviewer/File:Fractal_Broccoli.jpg
Trees are loosely fractal. While the trunks don’t keep replicating, the branches do. As the Fractal Explorer observes:
If you don’t know anything about fractals a tree might seem as a very random object. No patterns, no rules. But if you know something about fractals and look closer you can see that basically a tree is a trunk with trees on it. That is a basic pattern that every tree follows.
Taxonomies are often described as taxonomic trees, or as having a tree-like structure. To carry the analogy further, we often refer to the progressively more specific and more numerous hierarchical subdivisions in a taxonomy as branches. The overall domain of a taxonomy, while sometimes referred to as its root, might also be viewed as its trunk.
So this begs the question: Are taxonomies fractal? As it turns out, several authors have written articles on the fractal nature of biological genus-and-species taxonomies. These articles discuss the branching characteristics of these taxonomies, the same branching characteristics that we see in taxonomies outside the realm of biological species categorization. They also discuss the mathematical tendencies of the proportions of the various branches, tendencies that could perhaps be a natural result of the degree to which things in a group need to be different before we find it appropriate to give them different names.
In recent years, interdisciplinary scientists such as Christophe Eloy have been studying the natural forces that make trees grow the way they do, and how their growth patterns might make them resilient in windstorms. Interestingly enough, these scientists have been inspired, in part, by an observation that another person with an interdisciplinary approach, Leonardo da Vinci, made 500 years ago.
As Joe Palca explains in “The Wisdom Of Trees (Leonardo Da Vinci Knew It)”:
Leonardo noticed that when trees branch, smaller branches have a precise, mathematical relationship to the branch from which they sprang. Many people have verified Leonardo’s rule, as it’s known, but no one had a good explanation for it. …
Leonardo’s rule is fairly simple, but stating it mathematically is a bit, well, complicated. Eloy did his best:
“When a mother branch branches in two daughter branches, the diameters are such that the surface areas of the two daughter branches, when they sum up, is equal to the area of the mother branch.”
Translation: The surface areas of the two daughter branches add up to the surface area of the mother branch.
Here’s another explanation, from Esther Inglis-Arkell’s article “Scientists Still Puzzled by a Fractal Discovered 500 Years Ago”, that might be more intuitive:
Strip the leaves off of the average tree, soak the whole thing in water until it gets mushy, bundle the branches up together, and you’ll get what looks like one long trunk. That’s what Leonardo Da Vinci said in the fifteen hundreds. If a tree trunk splits off into three main branches, each of the branches will be one third the size of the trunk. When each of those branches splits into three again, making nine branches on the second ‘tier’ of the tree, each of these second tier branches will be one ninth the side of the trunk. As the branches grow and split, they will always be a particular fraction of the size of the trunk, and adding together all the fractional bits of each ‘tier’ of branches will always add up to ‘one trunk.’ This isn’t the case in all trees, but the majority hold to this pattern.
Can we gain a new perspective on taxonomies from all this? I think the lesson might have to do with scope, specificity, and detail. According to da Vinci’s observation, tree branches uniformly become ever thinner until they taper off, yet their total bulk at most levels of the tree will be approximately the same. So, in a taxonomy that grows naturally, we might expect that the terms at any given depth might be at approximately the same level of specificity. At the same time, their individual scopes at any given depth will add up to a sum total that will ideally (I think) cover the same scope as the top level of terms. As with trees branches tapering off, though, this will be less true as the taxonomy branches naturally taper off and end at the most specific levels.
Inglis-Arkell sums up with some interesting observations about the beauty of branches:
This pattern of growth has a mathematical, as well as physical, beauty. Trees are natural fractals, patterns that repeat smaller and smaller copies of themselves. Each tree branch, from the trunk to the tips, is a copy of the one that came before it. Branches split off from the highest tip the same way they do from the trunk, and set of branches splits off at the same angle to each other. Physics, math, and biology come together to create the simplest and most efficient growth pattern. It just took Leonardo Da Vinci to first notice it, the big show-off.
Barbara Gilles, Taxonomist
The following post, by Rachel Drysdale, originally appeared in PLOS BLOGS on April 8, 2014.
Science does not stand still and neither does the PLOS thesaurus. With more than 10,700 Subject Area terms, we use the thesaurus to index our articles and provide useful links to related papers, enhanced search functions, and, for PLOS ONE (more than 90 articles published every day!), customizable Subject Area-based email alerts and Subject Area landing pages.
Sometimes we decide to renovate a sector of the thesaurus to better reflect the make-up of the PLOS corpus. For example, we’ve long had a Subject Area term for “Synthetic biology,” sitting beneath “Biology and life sciences.” We even have a healthy Synthetic Biology Collection. However, the Subject Area term “Synthetic biology” was being applied to only a handful of articles despite the fact that many more PLOS articles were about synthetic biology and should ideally have been indexed accordingly. Why was this?
Part of the explanation is that ‘synthetic biology’ is not a phrase that is frequently used in natural language. So whereas an article about hypertension may use the word ‘hypertension’ 26 times within the text, an article about synthetic biology might state ‘synthetic biology’ rarely, if at all. This poses a challenge to the Machine Aided Indexing process which assigns Subject Areas to articles based on the frequency of matches in the text.
The way around this is to introduce a level of abstraction to the rulebase that governs the Machine Aided Indexing. The base rules are very literal: “if I see ‘synthetic biology’ in the text I’m going to use the ‘Synthetic biology’ Subject Area term.” But there are additional words and phrases that are diagnostic of synthetic biology topics, such as “biobricks” and “Registry of Standard Biological Parts.” Adding rules for these terms – for example “if I see ‘Registry of Standard Biological Parts’ in the text I’m going to use ‘Synthetic biology’” – increases the frequency of indexing to “Synthetic biology” and thus the retrieval of relevant articles in our searches.
A second factor is to do with the hierarchical structure of the thesaurus – an especially important factor given that our search functionality is designed to utilize this hierarchy. For example, a Subject search for “Vascular medicine,” beneath which Hypertension sits, retrieves articles indexed specifically with Hypertension, even if they have not been explicitly tagged with “Vascular medicine.” In earlier versions of the PLOS thesaurus “Synthetic biology” had no narrower terms, and this was doing it no favours with regard to how useful it was for retrieving relevant articles. We therefore reviewed essays about synthetic biology, scope descriptions from relevant institutional and departmental web sites, and proceedings from synthetic biology conferences, all in light of the content of our articles, and introduced new, narrower terms to sit beneath our existing “Synthetic biology” where that made sense. So we went from having the single “Synthetic biology” term to the new structure of 30 terms in one renovation. Here is what we have now:
Much of the evolution of the PLOS thesaurus is gradual, as for example when we realised that “puma” can be used as an abbreviation for “p53 upregulated modulator of apoptosis” as well as a kind of big cat, or learned that asteroids can be starfish. Dealing with these indexing missteps requires small-scale changes to specific rules. But sometimes the change needs to be more radical. Our new “Synthetic biology” sector was implemented in Ambra 2.9.12 (released March 26th, 2014). Where previously only a handful of articles was indexed with “Synthetic biology,” now a Subject search across all PLOS journals retrieves over 400 “Synthetic biology” articles – much more fitting for this important and developing field.
For more about the work PLOS is doing with Synthetic biology see “An Invitation to Contribute to the Second Life of the Synthetic Biology Collection.”
Access Innovations, Inc. Now Accepting Presentation Abstracts for the Eleventh Annual Data Harmony Users Group Meeting
Access Innovations, Inc. is pleased to announce the Call for Presentations for the 2015 Data Harmony Users Group (DHUG) meeting. The annual DHUG meeting is held every February at Access Innovations company headquarters in Albuquerque, New Mexico. DHUG 2015 is the eleventh annual meeting and will focus on leveraging of taxonomies and tagged data, techniques for integrating tagged data flows into production cycles, and inventive ways to improve the user experience.
The theme for the meeting, “Beyond Subject Metadata, or, So you have a Taxonomy!… now what?” urges Data Harmony users to ask questions such as the following:
- What do I do now that my content is tagged?
- How do I integrate that tagged content into my workflow or production cycle?
- How can I get my newly-tagged content in front of my users?
- How can I improve the search experience for my users who want to access these information assets?
- Are there other features I can add based on the metadata tagging now in place?
- What other implementations can I set up to capitalize on content objects organized around my taxonomy?
For the first time, Data Harmony users can now submit presentation proposals using the company’s Smart Submit software extension module, at http://www.dataharmony.com/dhug/submissions. The system is a full working implementation of the module and demonstrates how easy it is to use. The deadline for inclusion in the preliminary program is September 20, 2014.
In the DHUG 2015 implementation of Smart Submit, the first screen includes fields for entering such information as title, creator (author or presenter, usually a DHUG member), abstract, contact information, and a brief biography of the presenter. Optionally, the user may choose to upload a PDF or Microsoft Word file. There are also some fields customized for the meeting organizer, such as on what day of the week a presenter would prefer to be scheduled, and how long his/her presentation will be.
In the second screen, Smart Submit uses Data Harmony’s M.A.I.(TM) (Machine Aided Indexer) software module to display suggested indexing terms from the Access Innovations thesaurus to characterize the presentation. M.A.I. bases its automated indexing assistance on the text in the title, the abstract, and any PDF or Microsoft Word document that was uploaded via the first screen. The presenter chooses to retain or remove each of the suggested terms and may add additional terms from the thesaurus. The system also allows for searching the thesaurus and adding terms from the search results view.
“This is an exciting addition to the DHUG meeting planning process,” remarked Heather Kotula, Marketing Coordinator for Access Innovations. “We made it a priority to showcase our own software this year. Using Smart Submit to collect presentation proposals is going to make my job of organizing the meeting easier, faster, more complete, and more accurate.”
DHUG registration includes breakfast, lunch, and breaks with refreshments for all five days of the meeting, February 16th-20th, 2015. A networking reception will be held Monday evening at the University/Midtown Hampton Inn. On Tuesday evening, dinner will be provided for all attendees at a unique Albuquerque attraction. The University/Midtown Hampton Inn is the primary DHUG meeting hotel, offering a $79 nightly rate for members.
For more information about DHUG 2015, please visit http://www.dataharmony.com/dhug/dhug2015.
Data Harmony released their Inline Tagging Web service extension recently – let’s talk about inline tagging software and information environments well-suited to benefit.
Web developers are implementing inline tagging software in an increasing variety of information environments, spurred on by the creativity of users requesting new features based on accurate placement of inline tags. And it’s probably safe to say many users aren’t aware it’s inline tagging that propels some of the innovations they enjoy in their graphical user interface (GUI)… at the level of the onscreen text.
Data Harmony recently released their Inline Tagging Web service as one of the Version 3.9 ‘extension modules’ – causing me to wonder:
- What kinds of Web computing environments are well-suited for leveraging subject tags at the level of inline text?
- What is inline tagging good for? What can a subject tag accomplish when it’s been matched to a specific word’s location in the input text?
- What is the Data Harmony development team’s vision for implementation of the Inline Tagging extension?
- Can tags other than subject indexing terms be deployed for inline tagging?
To begin at the end of the tale, the answer to the last question is ‘Yes’ – geographical terms and other non-subject tags can be deployed for inline tagging, since inline tags are based on accurate indexing, which in turn is reliant on controlled vocabularies.
Controlled vocabularies such as taxonomies and thesauri can store terms like place names and other kinds of terms that don’t capture strictly conceptual information. Rather, they serve as an authority file for other forms of information, for example, geographical. Inline tagging applications can match non-conceptual terms also, during analysis of input text, and be configured to extend functionality for a purpose like linking to a geographical database for supporting information. For example, if ‘Canada’ were matched in the text, inline tagging might activate a mouse over window that offers the user a chance to go look at a relevant entry from an atlas, or encyclopedia. If the user chooses to click on the word ‘Canada’ in the text, a new interface tab opens to the relevant entry.
Guess what I discovered on taking my questions to the Data Harmony 3.9 developers… implementation ideas!
As a tool for search engines to boost the results of document search and retrieval
When a tag is included inline in a text object found by a search engine, words immediately around the tag (or the entire sentence) can be returned to the search engine, to supplement search results by providing context information about the match’s location in the found document.
The capability to return search term matches along with their context is significant in publications with multiple sections or chapters, to permit easier division into identifiable sections and subsections. Many publishers now offer content for sale in smaller pieces, so each customer can put together a ‘customized electronic book’ by combining chunks from different sources. Search and retrieval in publication collections retrieves relevant sections and subsections for recombination into new content objects. Accurate inline tagging facilitates this highly effective search strategy.
To turn up the volume of social media postings
Inline tagging can add value to search and retrieval within social media communities, increasing the gain of metadata information that’s already there in posts! You can use it for better categorization and linking related Twitter ‘tweets,’ professional discussions, social issue blogs and closed community forums (chat rooms) – for turning up the volume!
A well-placed inline tag inside a blog entry offers a semantic hook for Web applications to latch onto: blog postings can be followed within a certain date range only, or sent to designated recipients automatically when contributors write about any subject of definite interest.
As a lexicological training tool
Inline tagging methods can provide information for a language learner or human indexer about the meaning, form, and usage of words, while keeping the context in view.
In XML databases
XML databases often build indexes of searchable data by polling, at incredible speeds, all text in all available XML files – even for millions of records – and storing results in a repository. Inline tagging offers an alternative to the traditional polling method that often serves as the foundation for document search and retrieval in an XML database. Inline tagging methods enable you to describe fields with unique inline XML tags, for later recognition and retrieval by the spidering engines. Learn more.
Kirk Sanders, Editorial Services
Project Manager, Access Innovations
People often ask us how much time it will take to manage a rule base with Data Harmony software. We reply with specific customer experience numbers and tell them a few hours per month of editorial time to maintain both the thesaurus and the rule base. One customer of ours, the American Institute of Physics, found that maintaining their thesaurus and rule base takes less than 15 hours per month for 2000 articles per week throughput. Another customer, The Weather Channel, manages breaking news all day long with four hours per month of maintenance. It takes the editorial team just a few hours per month to keep up with the changing trends and events within their field and transfer those into the organizational knowledge base represented by the M.A.I.™ rule base. This is a small investment that provides the organization with the highest level of accuracy in coding (usually well over 90% hits without human intervention), as well as to support analysis of the trends in the business, the creation of author profiles, semantic fingerprints of the entire organizational holdings, and extraction of real meaning for all the data. Other customers, such as IEEE and the US GAO, find the accuracy of their Data Harmony software implementations so high that they now only sample the data periodically to glean new terms and trends. They do not see the need to review every single item.
The real question, though, should be a matter of control. If a rule-based solution maintained by the editorial staff is the approach taken, then full control remains with the editorial department. If a programmatic learning system – the seductive call of the purely automatic system – is the choice, then oversight either remains with the vendor or moves to the IT (information technology) department. The lower accuracy of the indexing returns (usually in the 60% range) means much more time spent by the editorial department on the production of the taxonomy tagged items. The time that would have been spent improving the knowledge base is instead spent in production time processing records, due to lower accuracy levels.
Here’s an example: let’s assume 1000 articles per month. Using 90% accuracy versus 60% accuracy, how much extra production time is involved? Let’s also suppose, for easy calculations, that there are 10 terms per article. If our rule base indexing is 90% accurate, then only one term will need to be reviewed, researched, and replaced or discarded. If alternative indexing methods produce 60% accuracy, then there are four terms per record to research, replace, or discard. The time to research a term and decide on its disposition is conservatively two minutes. So two minutes per term at 1 term per article is just 33.3 hours per month. But if four terms (60% accuracy) need reviewing, then 133.3 editorial hours per month are needed – obviously, four times the effort. Moreover, the rule base improves over time with this small editorial input, so the maintenance time continues to decrease.
A statistical approach can appear to be a gift on a silver platter, but beware – such an approach means more time spent on production, less on building a knowledge base, lower accuracy, higher throughput costs, and no chance to learn about the data through semantic fingerprinting. To make matters even more frustrating, you have little control of the system. It has to be improved and worked on by the vendor or the IT department. New terms require a full revamping of the system each time, resulting in costly delays, rather than the real-time, instant updates that a system based on Java object-oriented programming allows. As a result, the taxonomy is not responsive to the organization’s data.
It is tempting to think that the classification of content can be done without the use of a vetted taxonomy properly applied or that the taxonomy only provides a convenient file folder naming convention. Unfortunately, the cost is high to make that choice. The accuracy is lower, the throughput is slower, and the clerical aspect of the indexing process is increased when you use a statistical system. In addition, control is no longer with the editorial department, but shifted to IT and the vendor. The power dynamic of the choice is clear: IT versus editorial. Who do you want to be in control of your indexing?
Marjorie M.K. Hlava
President, Access Innovations
Data Harmony Version 3.9 Includes MAI Batch GUI – A New Interface For M.A.I.™ (Machine Aided Indexer) and MAIstro™ Modules
Access Innovations, Inc. has announced the inclusion of the MAI Batch Graphical User Interface (GUI) as part of the recent Data Harmony Version 3.9 software update release. MAI Batch GUI is a new interface for running a full directory of files through the M.A.I. Concept Extractor. This tool enables processing of large amounts of text through the Data Harmony M.A.I. Concept Extractor with a single command. Usually used in working with legacy or archival files, it allows complete semantic enrichment of entire back files in a short time. Once run, the taxonomy terms from a thesaurus or taxonomy become part of the record itself.
“For Data Harmony Version 3.9, we decided to add the interface to the MAIstro and M.A.I. modules to allow use directly from the desktop, giving more power to the user,” remarked Marjorie M. K. Hlava, President of Access Innovations, Inc. “It’s a fast, easy way to perform machine-aided indexing on batches of documents, without any need for command-line instructions.”
“M.A.I.’s batch-indexing capability has been in place for years via command line interface,” noted Bob Kasenchak, Production Manager at Access Innovations. “This new GUI makes it really easy to use. Customers only need to open ‘MAI Batch app’ in their Data Harmony Administrative Module, choose the files or directories to process, and submit the job.”
The purpose of MAI Batch is to provide immediate processing of data files on demand. MAI Batch can be deployed to achieve rapid subject indexing of legacy text collections.
MAI Batch GUI offers semantic enrichment by extracting concepts from input text in most file formats, including the following:
- Adobe PDFs
- MS Word DOC files
- HTM/HTML pages
- RTF documents
- XML files
For XML files, the ‘XML Tags’ option permits users to define specific XML elements for MAI Batch GUI to analyze during batch processing. This option opens the door for indexing source documents that are tagged according to different XML schemas. XML Tags also permits the exclusion during indexing of sections in the document structure, as designated by the user.
The interface’s Input and Output panes present a practical view of the batch during processing, enabling a degree of interactivity – M.A.I. is a very accessible automatic indexing system. It’s a ‘machine-aided’ software approach, even when applied to batches of documents. IT support is important but not needed to process and maintain the Data Harmony Suite of products.
When the documents already contain indexing terms, MAI Batch GUI will derive accuracy statistics for inclusion in the output, logging the statistics of indexing accuracy for the batch. M.A.I. calculates the indexing accuracy of its suggested terms from Concept Extractor compared to the previously-applied subject terms. This powerful method for enhancing the accuracy of subject indexing is based on reports generated by the M.A.I. Statistics Collector, giving a taxonomy administrator all the data needed to continually improve the results based on the system recommendations, selections, and additions.
Founded in 1978, Access Innovations has leveraged semantic enrichment of text for internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.
“I don’t know where I am!”
Time traveler Clara Oswald becomes disoriented once again, in a scary encounter with a taxonomy displayed in flat format.
Taxonomies can be displayed in a variety of ways. One of the display types that we occasionally see is known as the flat format display. It’s described in the main U.S. standard for controlled vocabularies, ANSI/NISO Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, published by the National Information Standards Organization) as follows:
“The flat format is the most commonly used controlled vocabulary display format. It consists of all the terms arranged in alphabetical order, including their term details, and one level of BT/NT hierarchy.”
At the top level, this format might (or might not) look like that of other hierarchical vocabularies when they are collapsed. What happens, though, when you start navigating to deeper levels? Let’s take a look at the ERIC Thesaurus published by the U.S. Department of Education’s Institute of Education Sciences. Here’s the initial view, when you choose to browse the thesaurus:
Aha, top terms, yes? Unfortunately, no. These are non-hierarchy category labels into which the actual terms are grouped, without regard for hierarchical placement. Clicking on any of these category labels results in a flat alphabetical display of all the terms in that category. This is something that thesaurus publishers can get away with when they use a flat format display.
If you click on the first category, Agriculture and Natural Resources, you see a flat alphabetical list of terms, including Agricultural Education. Clicking on that, you would discover that its one broader term is Education (no, not Agriculture and Natural Resources), and that its one narrower term is Young Farmer Education. What you see is basically a term record, and that’s all. That’s flat format display.
Are there problems with this? I think so. Even if the vocabulary is viewed only by the people constructing and maintaining it, those people will have difficulty spotting gaps and redundancies. And even if the vocabulary is used only by in-house human indexers, they will have difficulty exploring it to find the most appropriate terms to apply for indexing, and they will tend to use the first terms they come across that seem to fit. In the latter scenario, the ignored terms are apt to fall victim to usage statistics, even if they’re good terms that should have been used. (I’ve seen this happen to at least one taxonomy.)
While the format may have simplified things in the days of printed taxonomies, taxonomists and indexers have problems with this format. Think of the problems encountered by searchers looking for information resources. Searchers benefit from being able to navigate and explore a taxonomy, and to take full advantage of its hierarchical structure. The flat format doesn’t present a hierarchy; instead, it presents obstacles.
While you’re traveling down one path, you don’t have an opportunity to see what’s in nearby pathways, or in distant but related pathways.
You can’t see where you’re headed, or how far the path goes. The path that you originally saw as promising might only lead to a stone wall, after you’ve already traveled one term at a time to get there. (Some flat format taxonomies, though, turn out to be unexpectedly shallow, so you’re more apt to hit a dead end sooner than later.)
Because you can’t see more than one level before and after the term you’re in, and you can’t see over the hedge to other pathways, you may end up zigzagging and backtracking through the taxonomy in a frustrating guessing game.
Getting a Better View
Ideally, you should be able to view the full panorama of a taxonomy’s coverage. At the same time, you should be able to focus on the areas of interest to you. And you should be able to view more than one branch at the same time, and to view entire branches. To accomplish those goals, you need a full hierarchical display that you can expand and collapse as needed. The example below is a screenshot of the MediaSleuth thesaurus, some branches of which I’ve temporarily exposed to an expanded view with a click of the mouse.
With this kind of view, we can see our way in all directions, from wherever we are. We can see where we might want to go from there, and how to get there. We know exactly where we are.
Barbara Gilles, Taxonomist