Access Innovations recently debuted Data Harmony Version 3.9. Within its new features and fixes is a sneakily clever module called Inline Tagging. On the surface, it does exactly what the name says: It allows the user to see in a piece of content, quickly and clearly, what concepts in the text, exactly where in the text, triggered subject tagging by the software. It seems simple enough, a handy tool, but upon closer inspection, it really opens doors for the user.
Once the text is tagged, it becomes a question of what the user wants to do with it. That’s where the possibilities start to get really intriguing. In part, it allows an editor to do some very helpful things internally. Once term indexing triggers are tagged in a document, the editor could, for instance, go to the terms’ thesaurus listing, where they can see broader and related terms, along with synonyms or any number of facets of the taxonomy.
Thus, Inline Tagging is a helpful tool in aiding the editing process, but my thoughts are moving more toward the end user right now. It’s they who can truly reap its benefits. That’s because Inline Tagging can easily serve as a conduit for linking data, which has the potential to dramatically enrich a user’s search experience; absolutely crucial, especially in publishing.
We’ve already seen how massive the amount of data in the world has become, and we’ve seen the need to understand and control it. We see the emergent patterns in that data, and we work with it to discover new avenues for viewership or revenue or education. But that’s using just a handful of datasets. No matter how large they might be, the size of that data pales in comparison to the data in the world. If we could harness that power, what could we do?
Linked data, which has emerged as one of the most important concepts in data publishing, could well be the answer. In a database, one that implements Inline Tagging, the key terms and concepts in the documents are located at their occurrences within those documents. By using Inline Tagging, you turn a passage of text into a data item that can be quickly plucked for analysis. But how does that help us?
It can work on a number of levels. This can be as simple as having a taxonomy term link to a definition page, with broader and narrower terms, synonyms, etc. That right there can help with clarity, speed, and accuracy, but that’s just the beginning. There could also be a more substantial relationship between a thesaurus and the world’s data, one that allows users to take those data items and send them out to mine the web for related tags, drawing them back to the original page as related materials.
Say somebody is starting to write a paper on how a cheetah raises its young. They go online to research it and find a paper that addresses the topic perfectly. Now, this website also happens to implement linked data, so when the user queried “cheetahs raising young,” not only did the search result in a strong match on the site, it also, in turn, queried the cloud of data in the web. On its own, it locates information on other sites on the same topic and pulls down additional links: a wiki page, other related articles and papers, videos, or really anything.
It’s well known that people love one-stop shopping. That’s true in retail and that’s true in publishing. If the researcher can get all that information, curated personally for them in a clear, concise, and most importantly, highly accurate manner, they’ll almost certainly make that site their primary resource.
Some of the concepts have already been implemented in places, notably the BBC, whose unique Sport Ontology created for the 2012 Olympic games revealed just some of the potential of linked data. The idea was to personalize how the viewer watched the Olympics, understanding that enriched, relevant information delivered to the viewer in real time will drive traffic to the site.
There are even bigger ways linked data is being used, or potentially being used. The European Union is funding a project called Digitised Manuscripts to Europeana (DM2E), which aims to link all of Europe’s memory institutions to Europeana, the EU’s largest cultural heritage portal, to give free access to the stores of European history.
What if, in theory, a medical organization had access to linked data during flu season? That organization could pull information from not only medical records, but from, say, community records, school data, and other sources to try to predict when and where outbreaks might occur to minimize the damage. Certainly, there are issues with privacy and other hurdles that would need to be addressed, but even though that example is theoretical, the potential is massive.
Of course, proper implementation of linked data takes plenty of cooperation, so the jury is still out on how much or how soon sophisticated linked data usage could come about. The possibilities for academia, cultural awareness, and even retail look too enticing for it not to flourish. I, for one, am looking forward to a day where information I never dreamed of is right at my fingertips. I don’t know what it’s going to be, but it should be a fun ride.
Access Innovations, Inc. has announced that the Data Harmony Metadata Extractor is available as an extension of MAIstro™, the flagship thesaurus and indexing application in the company’s Data Harmony software line. Metadata Extractor is a managed Web-based service for revealing the hidden structure in an organization’s content, through superior data mining of publication elements, to normalize and automate document metadata tagging for the benefit of the organization.
Data Harmony Version 3.9 software achieves user-friendly integration of a taxonomy (or thesaurus) with an existing content platform or publishing pipeline. Patented indexing algorithms generate terms that describe what documents are really about, and precise keywords are attached for retrieving those content objects later, under different conditions. Among other benefits, deploying Data Harmony for subject tagging throughout a document collection creates a better search experience for users, because the results they get are closer to the point – there’s less extraneous material.
Leveraging a patented approach to text analysis for better keyword tagging is only one of the advantages to be gained from implementing the new Metadata Extractor Web service.
Quality Metadata Is Essential for Effective Content Management
To enhance the quality of metadata, this Data Harmony extension generates a complete bibliographic citation, creates an auto-summarized abstract of an article’s content, handles author parsing, and assigns subject keywords automatically. Metadata Extractor takes an unstructured or semi-structured article as input and returns an XML document with richer, more descriptive information captured in the metadata elements.
The Metadata Extractor extension identifies descriptive information in a document, distilling and normalizing it in a method far more sophisticated than merely matching keywords in text. The extension attaches this enhanced metadata to boost long-term value of the content object. It’s been shown that high quality metadata, consistently applied, reduces a common source of user frustration: not finding the appropriate document at the right time, in an oversized, disorganized file system.
Publishers Stand to Gain From Implementation
“Metadata Extractor is an essential addition to the Data Harmony software lineup for scholarly publishers, especially,” said Marjorie M. K. Hlava, President of Access Innovations, when asked to comment on its release. “Since every publication style sheet requires a targeted approach to leverage the most appropriate fields, Access Innovations provides customization supporting each new implementation. The result is a highly specialized output of accurate, consistent metadata for client documents, with subject keywords applied from their own unique vocabulary.”
M.A.I.™ Sets This Metadata Tool Apart from the Rest
“The extraction process uses element-based semantic algorithms mediated by M.A.I., the Machine Aided Indexer,” said Bob Kasenchak, Access Innovations’ Production Manager. “It draws on a set of Data Harmony programs that harness natural language processing (NLP) for targeted text analysis. During configuration, elements in the document schema are specified for metadata extraction, to reflect the structure of input articles. Then, whenever someone processes an article with Metadata Extractor, M.A.I. algorithms go to work surfacing crucial pieces of information to identify that document, and that document only.”
The graphical user interfaces (GUIs) and input elements for the Metadata Extractor Web service are adjustable based on the nature of incoming data and user needs.
Data Harmony Extension Modules
Access Innovations offers an expanding selection of Web-based service extension modules that are opening up new avenues between content management platforms and the innovative Data Harmony core applications: Thesaurus Master® and M.A.I.™ (Machine Aided Indexer).
To supplement an organization’s publishing pipeline or document collection with great tools for knowledge discovery, the Data Harmony Web service extensions operate on the basis of rigorous taxonomy structures, creative data extraction methods, patented text analytics, and flexible implementation options. All Data Harmony software is designed for excellent cross-platform interoperability, offering convenient opportunities for integration in all kinds of computing environments and content management systems (CMSs).
Visit the Data Harmony Products page to explore the range of focused solutions that are presented by Data Harmony Version 3.9 extension modules.
About Access Innovations, Inc. –
Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.
Not that long ago, getting published was the big hurdle for a writer to overcome. You could produce all you wanted, but unless you knew how to get somebody to read your random submission, or you were rich enough to self-publish, your writing lived in a drawer, waiting for you to give it to a friend who doesn’t want to read it.
It’s hard to believe how fast technology has opened publishing up to people. Now, anyone with an opinion has a platform, and while it’s as tough as ever to make a living writing, the platform, in many cases, is totally free. So that changes the hurdle from publication to recognition. If everybody has a voice, how do you get heard?
This isn’t just a question of red-hot opinions on social media. The explosion of e-book publishing has enabled writers of all kinds and all backgrounds, and without a character restriction. Whether it’s through a blog, an e-book, or whatever, the gatekeeper has started to disappear, and to a writer who likes getting published, that prospect is thrilling.
But a new gatekeeper has replaced the old. The driving force of the explosion has been the Amazon Kindle. Since it was first issued in 2007, Kindle titles have taken an increasingly large share of the industry, and now make up nearly 20% of all book sales, not just e-books.
That’s astonishingly fast, and the publishing industry has been dragged kicking and screaming behind. It’s easy to see how it could be a painful transition for them. There’s no physical copy to print and they’re out of the distribution game, so publishers naturally make less per book sold than they had in the past. Amazon made deals advantageous to themselves, of course, but sales have continued to increase. The downside is that issues have arisen as a result of Amazon trying to strong-arm publishers who don’t want to play ball.
By the same token, writers make less in royalties than they once did, as well. That’s the sad part, I guess, but the positive side is that more people are writing and more ideas are floating around, which is a beautiful thing and vital to the advancement of culture. It also presents a brand new problem for the industry: information overload.
As long as there was traditional publishing, there was a structure in place to determine what writing was deemed “worthy” of printing. It kept dangerous or controversial views out of the public, sure, but it also filtered out the garbage. Academic publishing still has its review system in place to make sure a work is suitable to print, but the non-academic side now has little to no filter.
Let’s face it; for all the good that open access to publication can do for society, it also means that one may have to wade through a lot of it to find high-quality, relevant material. So the question becomes how to access it so that every time you want to find something, you don’t have to filter through a large amount of irrelevant and useless material. It’s for this reason that data management has become so vital. Its use has resulted in revolutionary new ways to look at publishing.
The basic fact of having an individual platform is big enough. But there are larger, more groundbreaking efforts to take advantage of the opportunities the technology has afforded us. Norway, for instance, is in the process of digitizing all of its books, all of them, to make them available online to anyone with a Norwegian IP address; the Digital Public Library of America is a growing resource connecting libraries across the country; and the Public Library of Science has turned the paradigm of academic publishing on its ear.
The concept of the digital library isn’t new. Project Gutenberg has been around since 1971. Little did we know back then what kind of value that might have. It’s only becoming clear now that analytic software has become so advanced. For Amazon, books were a means to mine customer data for other products. Now, that kind of data mining is commonplace. It doesn’t have to be about sales, though. In these library projects, that same level of data mining can be used for all sorts of purposes, from recommending new reading materials to a better understanding of a student’s learning habits.
The potential in these projects is limitless, and it takes innovative thinkers to look for patterns and derive ways to utilize them. But the most important thing to me is that what I write, what anybody writes, can be published and accessed for all to see in one form or another if somebody is interested. After all, if I want to read about new methods in cancer treatment or some crazy person ranting about aliens, I should have that right, and so should everyone.
In her 1996 paper, The Rage to Master: The Decisive Role of Talent in the Visual Arts, Ellen Winner presents a concept she calls, well, the “rage to master.” The idea is that intellectually gifted children have a natural inclination to focus on a subject and immerse themselves in it until they reach mastery.
With proper support, the “rage to master” creates a positive feedback loop. Their interest combines with their gifts, enabling him or her to more easily grasp a topic than a more average individual. This provides a feeling of satisfaction, reinforcement that encourages the child to continue mining the subject. Using the initial knowledge as a springboard, the cycle repeats itself, creating an outward-spreading spiral of knowledge.
Data Harmony has something in common with that gifted child: the feedback loop in its indexing. The software knows nothing at first, but when it is fed content, its subject of choice, and is given support and encouragement in the form of taxonomy building and editorial analysis, it can start the learning process.
With one piece of content, it can only learn so much. It grows with each new piece, the next feeding off what came before, but it needs consistent and diligent editing of those results. Given that, the software can become progressively smarter.
Just like with the gifted child, though, who can never learn everything about the given subject, the feedback loop that indexing software can create won’t last forever. Eventually, progress will slow down. There’s a big difference between the highly accurate search results it delivers and perfectly accurate search results, an unattainable goal.
Voltaire’s aphorism, “Perfection is the enemy of the good,” applies well here. The “rage to master” in the gifted child depends on progress and satisfaction. Attempting perfection undermines both. Progress will slow to a halt, denying the child the satisfaction that was the driving force in the first place.
Of course, we’re talking about software here, so feelings and stuff like that don’t actually apply. Where it does apply is with the user, though, who “motivates” the software by feeding it content. They are the impetus for software’s education, giving it new material while honing and fine-tuning the output. All of this delivers accurate results and the user gets the feeling of satisfaction.
Indexing software has the “rage to master” content because it was built to serve that purpose. It can’t do anything alone, though. It takes a dedicated team of editors to feed it that content and interpret the results. The responsibility is on them to understand how to leverage the results into valuable commodities. Without that side of it, the software achieves very little.
The emergence of Big Data has made this increasingly vital to business in industries of any stripe. The amount of data is growing at an astonishing rate and shows no signs of slowing down. If it was difficult to collect and analyze large amounts of content manually a few decades ago, imagine the struggle today with the glut of tablets, phones, and computers collecting and transmitting data every moment of the day.
There is so much out there that even a large team of editors can struggle to sort and analyze it with much effectiveness or insight. But this is exactly where the feedback loop created by indexing software can change the game. The software speeds the process, facilitating the analysis, but it can’t make decisions on its own. The editors are absolutely crucial to the accuracy of the software’s output. It starts with an analysis of a single batch of content, but with their guidance, that analysis builds on itself with each new batch. Before long, patterns start to emerge.
Now, the people who would have had to endure the tedium of slowly going through the data by hand can work with these emergent patterns instead. This is a far more meaningful way to interact with data and enables new ways to look at the results. Now, people can more quickly and easily identify and react to trends in their industry.
In publishing, this means understanding how users search for content and potentially directing them to content they may not initially have found valuable. Using Data Harmony, the publisher has a controlled vocabulary that narrowly and accurately directs searches, but it also allows them to observe and analyze how the user searches and what else they search for, which gives them tools find patterns in their customer base and tailor future initiatives to their specific needs.
The mountain of data in this world is only going to continue to grow, so while large-scale analysis is important today, it will be even more important tomorrow, next week, and in a year. Who knows what the landscape will look like in a decade, but we can safely speculate that the positive feedback loop that emerges from software like Data Harmony will enable organizations to handle it, no matter how massive it may have grown.
Access Innovations, Inc., the industry leader in data organization and innovator of the Data Harmony® software suite, is pleased to announce that KMWorld has selected Data Harmony 3.9 for their Trend-Setting Products list for 2014.
“We enhance and enlarge the Data Harmony offerings every year. This year the suite has increased to 14 modules. It is vitally important to stay at the forefront of knowledge management. With Data Harmony v.3.9, we have delivered the most integrated, flexible, and streamlined user-friendly semantic enrichment software on the market,” notes Marjorie Hlava, president of Access Innovations, Inc. “We will continue developing new and innovative ways to analyze, enhance, and access data to increase findability and distribution options for our customers.”
The proven, patented Data Harmony software is the knowledge management solution to index information resources and, in 2014, pushed farther into the future with the inclusion of Inline Tagging, which automatically finds and labels text strings, and Smart Submit, a module that greatly streamlines the author submission process. With these in place, Data Harmony offers a richer, more advanced, and friendlier customer experience.
The Trend-Setting Product awards from KMWorld began in 2003. More than 650 offerings from vendors were assessed by KMWorld’s judging panel, which consists of editorial colleagues, analysts, system integrators, vendors themselves, line-of-business managers, and users. All products selected demonstrate clearly identifiable technology breakthroughs that serve vendors’ full spectrum of constituencies, especially their customers.
“Data Harmony was selected by the panel because it demonstrates thoughtful, well-reasoned innovation and execution for the most important constituency of them all: the customers,” explained Hugh McKellar, editor-in-chief of KMWorld Magazine.
Data Harmony v.3.9 is available through the cloud, a hosted SaaS version, or an enterprise version hosted on a client’s server. More information about Data Harmony and the 14 software modules is available at www.dataharmony.com.
Access Innovations has extensive experience with Internet technology applications, master data management, content-aware database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world. Access Innovations: changing search to found since 1978.
About KMWorld – www.kmworld.com
KMWorld (www.kmworld.com) is the leading information provider serving the Knowledge Management systems market and covers the latest in Content, Document and KnowledgeManagement, informing more than 40,000 subscribers about the components and processes – and subsequent success stories – that together offer solutions for improving business performance. KMWorld is a publishing unit of Information Today, Inc. (www.infotoday.com).
Regardless of discipline, there’s one thing that connects most academics I’ve encountered: the desire to keep practicing their respective fields. They’ve spent years cultivating their expertise and want to make a difference in their field. But in order for that to happen, they all share the same obstacle: tenure.
Increasingly, universities are favoring adjunct jobs over tenured professorships. When one looks at it from a business perspective, as administrations with budgets are bound to, it isn’t hard to see why that happened. They get to pay less for the same work (though maybe not the same quality of work) and they retain power over the adjunct’s job security.
Whether one agrees with that policy, it makes a certain kind of sense from that side of the pipeline. The system isn’t exactly ideal for the academics in adjunct positions, though, whose lack of job security means that, year after year, the potential of finding a new job (another adjunct position, likely) weighs heavily on their minds. Nobody can do great work under that kind of pressure.
Finally, they do get that tenure-track position. Initially, it might seem like the hard part is over, but it’s only just begun. Convincing a university administration to offer a position is one thing; it’s a whole different story when it comes to the thing that most fuels a university’s engine: publication.
It’s inviting to think high-mindedly about higher education, but a professor’s value is based far more on their academic prestige and contributions to the field, at least at the administrative level, than the professor’s skills as an instructor. These contributions are marked by the quantity of articles published in academic journals and by the prestige of those journals. There really is no other road to tenure.
It’s a cutthroat game and professors are playing for keeps…they have to. There aren’t more total jobs on the academic market; the tenured positions are replaced by adjunct ones at the first available opportunity. Those with tenure hold onto to the privilege for dear life, and rarely does a seat open up at the table.
It’s a simple equation then, once they do find a seat, why the institution would demand publication. The institution wants prestige, which they get through having a prestigious faculty that publishes in renowned journals. They select for it, because it’s a bottom-line situation for them; a well-respected faculty means a higher class of student, which means a higher rate of tuition and a better result at the end of the fiscal year.
This is why it’s vital for the journal itself to make sure that what is printed on their pages meets their academic standards. Enter the peer review process. While it was designed to uphold academic rigor (and often succeeds at that purpose), it has the consequence of acting as a gatekeeper for those seeking tenure. That consequence may not have been intentional, but it has become a growing issue.
Submitted articles, certainly, must be vetted for accuracy and content, but they also must be filtered so they get into the right hands for peer review. This process takes time—always has—but it has grown even slower in recent years. With fewer tenured positions, there are fewer people available to review articles. The number of articles hasn’t necessarily changed, though, so those available are now busier than ever and, unfortunately, less attentive on top of it.
The trouble is that those on the tenure track only have so much time before their window closes. It can take months and even years for an article to slog through the pipeline, often preventing viable candidates from receiving tenure for the simple and fixable issue of delay. This inefficiency does a grave disservice to the very people the system was designed to help.
Without wholesale change in university administration mentality, the issue will not fix itself, so it must be addressed from a new angle. By identifying and analyzing the metadata present in a given article submission, it becomes clear where the submission comes from and who it should go to, which can help to streamline the process and make it easier on both the author and the peer reviewer, which will subsequently speed up the publishing process.
Data Harmony software is able to take care of this quickly and easily with the Smart Submit module. Using the article metadata in conjunction with a taxonomy, Smart Submit automatically identifies the subject areas covered in a submitted article. With that information, and with a properly designed management system, a publisher can find qualified peer reviewers for the submission and ensure that reviewers don’t get overwhelmed with submissions. A lighter workload means that more time and care can be taken with a given submission, making for a better work environment and, potentially, a smoother path through the pipeline.
Academic publishing is a two-way street. Publishers need authors to write articles to populate their journals. Authors need journals to publish their research, which furthers their career and their field. When the two sides work together, that’s when a field of study can really flourish.
Why set up these barriers? It should be difficult to get published in a prestigious journal because academic rigor demands it, not because of an inefficient system that doesn’t help either side of the system. Smart Submit won’t solve all the problems an author might face in getting published, but getting the submission and review parts of the process streamlined and more transparent will make the process less frustrating for users and, ideally, speed up an arduous process that often hinders, when it should be an avenue for fresh voices to be heard.
It can be tempting to look at a taxonomy and only see it in terms of the search results generated from it. That part of it is very important, no doubt, but taxonomies have far-reaching tendrils that extend deep into many industries, especially publishing. They work subtly and are often invisible to the end user, but the effects can be extraordinarily powerful.
Drawing by Pierre Dénys de Montfort, 1801, http://commons.wikimedia.org/wiki/Cryptozoology#mediaviewer/File:Colossal_octopus_by_Pierre_Denys_de_Montfort.jpg
Taxonomies interact with and enrich the publication pipeline in so many ways. This is especially apparent in the world of academic publishing, but the principles can be applied throughout the industry.
With the explosion of electronic content, the use of metadata has become increasingly important. Hunting down print records from an archive by hand is simply unacceptable today; documents get lost, forcing the unnecessary duplication of work. This is even more frustrating with electronic content because, in theory, the conversion to digital was supposed to streamline the retrieval process.
It hasn’t really worked out so smoothly. This isn’t simply about finding something with a search function; inefficiencies gum up the works at every stage, costing companies valuable time that could be more effectively used elsewhere. In fact, studies have shown that people in publishing spend up to 35% of their time hunting down information. That’s simply too much time.
Access Innovations’ Data Harmony software products can help clean up the mess, both literally and figuratively. With applications at every stage, from submission to publication, the software suite is specifically designed to improve accuracy and efficiency, from the writer submitting the work to the publisher putting it in a journal, to the end user who needs to access it.
We will examine these ideas in a nine-part series to see just how integral taxonomies are to publishers. Some features will specifically address how Data Harmony works to improve data, while some will be more general. The following topics will be covered:
- THE NEED FOR QUICK PUBLISHING: Academics publishing papers in journals—crucial for tenure—often face months and even years of waiting, sometimes so long that the tenure window passes. How does Data Harmony, specifically Smart Submit, streamline the submission process?
- SEMANTIC ENRICHMENT: The capture of metadata is key to meaningful analysis of content. The resulting metadata allows a user to view content in unique ways to draw previously invisible patterns that dramatically improve the accuracy of content retrieval and enable analytics.
- BY THE NUMBERS: We look at the real cost of slow turnaround in academic publishing—it doesn’t just affect the writer. There is a real cost to publishers when their systems are inefficient.
- INLINE TAGGING—WHEN YOU STILL CAN’T FIND IT: Data Harmony’s constant refinement has now led to inline tagging solutions that allow users to easily zoom in on and identify concepts in a full text document.
- FINGERS IN MANY POTS—FROM SUBMISSION TO PUBLICATION: From Smart Submit to precision search results at the end, Data Harmony has a hand in making each piece of the workflow production pipeline smoother, easier, and more accessible.
- FEEDBACK LOOPS FOR TRULY ACCURATE INDEXING: One of the more interesting aspects of Data Harmony indexing solutions is how they enable constant checking of content extraction effectiveness, so that it becomes ever more accurate with each new piece of added content.
- THE FUTURE—REAL-TIME ANALYSIS OF THE CHANGES IN PUBLISHING: The explosion of content being housed online instead of in print brings up some interesting speculative issues about the future of journal publication.
The series will close with an installment that will demonstrate, on a broader level, that high-quality metadata and accurate taxonomy-based indexing streamline and enrich the publication process. Today’s technology gives publishers the opportunity to enhance the way their journals are compiled and disseminated to the public. However, many publishers have not yet taken advantage of that opportunity. This makes for a frustrating search experience for the end user that, most importantly, does not precisely deliver the content they require.
Ultimately, that end user experience is most crucial, but in order for a system to really work, it requires improved efficiency at every level. Certainly, good semantic enrichment software such as Data Harmony can help with that, but changes in mentality are also required. That doesn’t happen overnight, but the faster that publishers can start to implement new methods of looking at their data and making their content available, the faster that users throughout the pipeline will get true satisfaction from their results.
When the author is satisfied and the publisher is satisfied, the end user wins. That builds trust in your client base, which translates into revenue. That revenue allows the publisher to offer improved and diversified products, leading to a broader user base that is assured of a reliable content offering, which once again, leads to even more revenue. This kind of cycle drives industry to improve upon itself, and good semantic enrichment software such as Data Harmony can facilitate that process.
Access Innovations, Inc. is pleased to announce that the Data Harmony software line now includes inline tagging capability, beginning with Version 3.9, released earlier this year.
Inline Tagging Feature In Data Harmony 3.9 Installations Employing inline tags, the TestMAI screen in the MAIstro™ module now displays matching content in a highlighted font when a Data Harmony user runs M.A.I.™ (Machine Aided Indexer), putting the focus on indexing concepts as they appear in the input text. XML elements are applied in the text at the exact location where a word or phrase triggered a term suggestion from the controlled vocabulary (taxonomy or thesaurus).
Inline tagging keeps the context in view when M.A.I. makes a subject indexing match, and generates a statistical summary showing the list of matched terms. This provides a look at the operation of term-matching rules and improves the user’s experience during rule base testing and refinement.
Inline Tagging As a Software Product The Inline Tagging extension module is available as a Data Harmony web-based service, with configuration options for integrating M.A.I. into the user interface of a content management system.
“For Version 3.9, we added inline tagging for MAIstro and M.A.I. installations, as Data Harmony customers are seeing now when they run TestMAI on their data,” remarked Marjorie M. K. Hlava, President of Access Innovations, Inc. “And, we packaged Inline Tagging as a software module to be integrated with an organization’s content management program, in the interface.
The Inline Tagging extension puts M.A.I.’s indexing precision within the reach of users, users who may be requesting more Web functionality! Often, the first step to offering more interactivity at document level is the precise placement of inline subject tags… this step is accomplished through a thoughtful configuration of the Inline Tagging extension.”
“The Inline Tagging extension inserts an XML ‘wrapper element’ around matched words; it locates concepts from your vocabulary inside full-text documents,” explains Bob Kasenchak, Production Manager at Access Innovations. “The XML wrapper can be configured so your interface displays related information when a user points their cursor at the match. Or, the interface can display a link for the user to visit. But that’s not all… an IT department can deploy inline text tags to make search and retrieval in a large document collection more precise and context-sensitive.”
Implementation Ideas for Inline Tagging Web-based Service Extension:
- As a search engine tool to boost document search results
- To turn up the volume of social media postings
- To create a better, more searchable index of an XML database
- To foster increased user interaction opportunities with documents
How the Inline Tagging Extension Works, Simplified Implementations of the Inline Tagging extension module are based on an organization’s controlled vocabulary and accompanying rule base, stored in a Data Harmony installation. The Inline Tagging API can retrieve any element from the vocabulary term record in order to offer relevant information alongside matched content, and it handles this ‘on the fly.’ This makes it convenient for a Web developer to implement customized functions for a content management system. Driven by data already stored in the term records, with Web-based services configured for the CMS interface, Inline Tagging gives users immediate access to supplemental information about the relevant controlled vocabulary terms, and any related concepts.
About Access Innovations, Inc. –
Access Innovations has extensive experience with Internet technology applications, master data management, database creation, content-aware thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world. Access Innovations has been changing search to found since 1978.
I recently had the opportunity to see webinars featuring a couple of software systems for taxonomy construction/management and content categorization. The systems were both impressive and, if I didn’t have 20 years in the business, I would have been totally awed … and snowed. It’s easy to be overwhelmed by a slick appearance and professional presentation.
Early in my career, I was overwhelmed and confused by the terminology—its abundance, multiplicity, and ambiguity. Each software company used different words, all very catchy, developed by a creative marketing department. I didn’t get whether they were talking about different concepts or the same in different verbal wrappers. Cutting through the terminology to identify key software features and functions can be tough. Yet that’s just what must be done for an informed buying decision.
One of the buzzwords I came across in these recent webinars was “content-driven” (or “data-driven”) to describe a taxonomy. To my amazement, this was described as a “trend” in taxonomy construction by the presenter for the company “with over 15 years of experience.” Apparently it was intended as a strike at a “top-down” approach to pulling together terms for a taxonomy based on an abstract, authoritative view of a domain. The top-down approach was described as more complex than necessary and including nodes not reflecting your content.
However, the discussion ignored the equally familiar and long established counterpart to top-down. This is the “bottom-up” approach, drawing terms directly from the documents to be categorized, i.e., content-driven. Here’s a link to a brief description of the strategies written in 1996 by Jessica Milstead.
In most cases, building a taxonomy or thesaurus requires a hybrid approach, with the overall organization based on a top-down approach for navigation and the bulk of terms reflecting the preferred terms for concepts in the domain and drawn from the actual documents. The strategies are most often used in balance, with the taxonomist providing a logical “top” structure into which the content-linked terms can fit.
The software on display generated a list of candidate terms, offering words and phrases from the content as terms. But this was just a starting point in taxonomy construction. Time for the taxonomist to add the value of organization through hierarchical, associative, and equivalence relationships.
Ah, “relationships” takes me to semantics, another buzzword that sounds very impressive and truly represents the power of taxonomies. The key thing to remember is that semantics in a taxonomy starts with the hierarchical, associative, and equivalence relationships. (Actually, a taxonomy with all those features is more accurately called a thesaurus). Organizing terms in a hierarchy of broader and narrower concepts—from general to specific—and recognizing synonymous alternative expressions and internal conceptual links all add semantic richness to terms by providing context based on the meanings of words. These are features built into a well-developed taxonomy, providing pivot points from one term to another through logical semantic associations. Applied as metadata to content items, the taxonomy terms provide semantic enrichment.
Another slick webinar focused on semantic enrichment with an artfully designed but effective presentation. As jaded as I have become, I was duly impressed by the appealing motifs, the jazzy colors, the graphics in motion, and the requisite buzzwords in the opening. This is the part you show to the CIO, CTO, etc., the one with final budget authority. We are still talking about semantically enriching content with metadata from a domain-specific taxonomy. You say, “This is just what I need!”
Several modules were described. One extracts words and phrases as key topics for a taxonomy-ish product, called by a name not found in the ANSI/NISO Z39.19 standard for taxonomy construction. Another is for human taxonomy building from scratch, if the ready-built domain taxonomies are not a good fit. Others serve categorizing/indexing/tagging/annotating content (chose your favorite expression), also known as applying taxonomy terms as metadata or … semantically enriching the content.
I must admit I was impressed, but not snowed. I’m an editor, not in marketing or in an art department. I knew what this was all about because this is basic taxonomy and indexing work that I do daily, using software that delivers these functions. I know that slick is cool to look at, but it comes at a price.
“Gee, thanks for the spin in your Ferrari, but I was hoping for a Chevy pricetag and Honda/Volvo/Subaru reliability.”
Photo by Ahmad Mukhlis, http://en.wikipedia.org/wiki/Honda_Civic#mediaviewer/File:Honda_Civic_Hybrid_%28Malaysia%29.jpg / CC BY-SA 3.0
I also know that essential functions are available in products much more accessible to organizations on a budget.
If you are interested in software for taxonomy creation, management, and application, don’t get snowed by the buzzwords and bling. Know the basics of taxonomy construction and implementation, and use that knowledge as a starting point when comparing software. Know the functions you need to perform and avoid slick but unnecessary frills, as alluring as they may be. Know if the product will work with other systems and whether you’ll need a high-priced mechanic or an editor to do the work. When you hear about trends, consider established history and experience. Data Harmony software from Access Innovations was developed for a demanding production setting, starting in 1995 and continually improving over the years. It may be eclipsed on the slick presentation front, but the software has proven it’s up to the job. Just ask any of our satisfied customers. Contact us; we’d be happy to give you the list.
Alice Redmond-Neal, Senior Taxonomist
If you look at a branch of a typical deciduous tree, you can see that it looks like a smaller tree. Likewise, that branch branches off into smaller branches that look like even smaller trees.
This characteristic of trees is an example of what mathematicians, biologists, and systems scientists call self-similarity. Self-similar systems repeat their basic geometry at smaller and smaller scales, creating multiple miniatures of themselves at different scales. In general, natural and mathematical systems in which self-similarity results in complex and detailed patterns are referred to as fractal systems.
Many natural phenomena are or can be fractal:
Photo of a 12-sided snowflake by Becky Ramotowski, www.srh.noaa.gov/abq/?n=features_snowflake
Painting by Katsushika Hokusai, www.katsushikahokusai.org/Mount-Fuji-Seen-Below-a-Wave-at-Kanagawa.html /CC BY-NC-ND 3.0
and even broccoli.
Photo by Jon Sullivan, en.wikipedia.org/wiki/Romanesco_broccoli#mediaviewer/File:Fractal_Broccoli.jpg
Trees are loosely fractal. While the trunks don’t keep replicating, the branches do. As the Fractal Explorer observes:
If you don’t know anything about fractals a tree might seem as a very random object. No patterns, no rules. But if you know something about fractals and look closer you can see that basically a tree is a trunk with trees on it. That is a basic pattern that every tree follows.
Taxonomies are often described as taxonomic trees, or as having a tree-like structure. To carry the analogy further, we often refer to the progressively more specific and more numerous hierarchical subdivisions in a taxonomy as branches. The overall domain of a taxonomy, while sometimes referred to as its root, might also be viewed as its trunk.
So this begs the question: Are taxonomies fractal? As it turns out, several authors have written articles on the fractal nature of biological genus-and-species taxonomies. These articles discuss the branching characteristics of these taxonomies, the same branching characteristics that we see in taxonomies outside the realm of biological species categorization. They also discuss the mathematical tendencies of the proportions of the various branches, tendencies that could perhaps be a natural result of the degree to which things in a group need to be different before we find it appropriate to give them different names.
In recent years, interdisciplinary scientists such as Christophe Eloy have been studying the natural forces that make trees grow the way they do, and how their growth patterns might make them resilient in windstorms. Interestingly enough, these scientists have been inspired, in part, by an observation that another person with an interdisciplinary approach, Leonardo da Vinci, made 500 years ago.
As Joe Palca explains in “The Wisdom Of Trees (Leonardo Da Vinci Knew It)”:
Leonardo noticed that when trees branch, smaller branches have a precise, mathematical relationship to the branch from which they sprang. Many people have verified Leonardo’s rule, as it’s known, but no one had a good explanation for it. …
Leonardo’s rule is fairly simple, but stating it mathematically is a bit, well, complicated. Eloy did his best:
“When a mother branch branches in two daughter branches, the diameters are such that the surface areas of the two daughter branches, when they sum up, is equal to the area of the mother branch.”
Translation: The surface areas of the two daughter branches add up to the surface area of the mother branch.
Here’s another explanation, from Esther Inglis-Arkell’s article “Scientists Still Puzzled by a Fractal Discovered 500 Years Ago”, that might be more intuitive:
Strip the leaves off of the average tree, soak the whole thing in water until it gets mushy, bundle the branches up together, and you’ll get what looks like one long trunk. That’s what Leonardo Da Vinci said in the fifteen hundreds. If a tree trunk splits off into three main branches, each of the branches will be one third the size of the trunk. When each of those branches splits into three again, making nine branches on the second ‘tier’ of the tree, each of these second tier branches will be one ninth the side of the trunk. As the branches grow and split, they will always be a particular fraction of the size of the trunk, and adding together all the fractional bits of each ‘tier’ of branches will always add up to ‘one trunk.’ This isn’t the case in all trees, but the majority hold to this pattern.
Can we gain a new perspective on taxonomies from all this? I think the lesson might have to do with scope, specificity, and detail. According to da Vinci’s observation, tree branches uniformly become ever thinner until they taper off, yet their total bulk at most levels of the tree will be approximately the same. So, in a taxonomy that grows naturally, we might expect that the terms at any given depth might be at approximately the same level of specificity. At the same time, their individual scopes at any given depth will add up to a sum total that will ideally (I think) cover the same scope as the top level of terms. As with trees branches tapering off, though, this will be less true as the taxonomy branches naturally taper off and end at the most specific levels.
Inglis-Arkell sums up with some interesting observations about the beauty of branches:
This pattern of growth has a mathematical, as well as physical, beauty. Trees are natural fractals, patterns that repeat smaller and smaller copies of themselves. Each tree branch, from the trunk to the tips, is a copy of the one that came before it. Branches split off from the highest tip the same way they do from the trunk, and set of branches splits off at the same angle to each other. Physics, math, and biology come together to create the simplest and most efficient growth pattern. It just took Leonardo Da Vinci to first notice it, the big show-off.
Barbara Gilles, Taxonomist