Many institutions and organizations – notably (but not limited to) publishers – have large, or sometimes very very large, lists of names. These names are from member directories, employees and staff, clients and customers, marketing, development, and many other sources; indeed, oftentimes the lists from various departments in the same organization are not connected or resolved with one another in any way.
This growing problem has given rise to a sub-field in the information/data industry variously called “named entity disambiguation” or “author disambiguation” or “name disambiguation”, among other monikers. In the academic publishing space, disambiguation of author names is a common challenge.
In a nutshell, given a list of names—let’s say, oh, 3.2 million names—to determine which ones are the same person and which are not, we might proceed as follows:
The goal is, as automatically as possible, to sort out which of these records should be merged. Once accomplished, you (a publisher) could make a webpage for each author listing all publications and so forth for your users to browse.
Clearly, some of the names above are potentially the same person, while others are not. For example B. Caldwell Smith, B.C. Smith, and Brandon C. Smith, and Brandon Caldwell Smith look like they might be the same person. To find out without looking at every name and every article (3.2 million, remember?) we need more information.
To accomplish this task, metadata associated with each author is examined and compared to try to eliminate duplicates. For example, from each article we can associate an author with his co-authors, the institution with which she was involved when the paper was published, email addresses, dates of publication, and so forth.
Well, some things are clearer, but some are not. Whereas before we may have suspected that Rodger Smith and Roger Smith were different people, they published at the same institution in the same year; maybe it’s just a typo? And maybe Brandon C. Caldwell moved from Harvard to Yale (not unheard of) sometime between 1961 and 1972?
At Access Innovations we’ve been developing a way to add some certainty to the process using semantic metadata—it’s not a silver bullet, but it is a bigger gun. We call the process “semantic fingerprinting” and it’s based on our thesaurus and indexing technology.
Every author’s works (papers, conference proceedings, editorial roles) associates them with one or more pieces of content, and for each piece of content we have indexing terms from a thesaurus particular to that client. By associating the author directly with the indexing terms, we develop a semantic profile (or “fingerprint”) for each one. Since each author usually authors multiple papers (see “Lotka’s Law”), we compile the subject terms from each paper to make a more complete profile; obviously the more papers we have, the more accurate these profiles are.
Returning to our example:
What we suspected to be perhaps one person based on our best information turns out pretty obviously to be two distinct researchers once the areas of expertise are added to the equation.
While the process is far from foolproof, it does help to automate the disambiguation process, which cuts down on the number of human hours required to review the work.
The concept of the “semantic fingerprint” can be applied to a paper, a school, an editor, or any other entity for which subject metadata is available. So this same basic process can be used for other purposes; for example, to:
- Disambiguate institution names
- Match articles to peer reviewers or editors
- Demonstrate what areas of research are exploding at,
- A journal
- A college
- A research laboratory
As datasets get cleaner and cleaner the accuracy of, and uses for, semantic technologies—such as Access Innovations’ Semantic Fingerprinting techniques—will continue to increase.
Bob Kasenchak, Project Coordinator
Semantic Fingerprinting image © Access Innovations, Inc.
When you use a thesaurus for indexing context covering multiple disciplines, the need for disambiguation of terms is increased. This fact of thesaurus life was well illustrated in a presentation at this year’s DHUG (Data Harmony Users Group) meeting. The presentation, by Rachel Drysdale, Taxonomy Manager of the Public Library of Science (PLOS), was titled “The PLOS Thesaurus: the first year.”
While Rachel discussed a variety of aspects of thesaurus implementation and maintenance, what caught my interest and sympathy as a fellow taxonomist was her description of what she called “taxonomy funnies.” Anyone who has been a taxonomist for a period of time has run into such funnies, or problems that are chuckle-worthy but need some sort of dealing with.
In the talk, Rachel discussed the refinement of indexing rules. PLOS maintains its thesaurus in a Data Harmony software application, MAIstro that includes integration of a taxonomy management tool, Thesaurus Master with M.A.I., an indexing application in which a “rule base” of indexing rules is maintained. In MAIstro, when a term is added to a thesaurus, a simple identity rule is automatically created in the associated rule base. So when the Animals branch was being developed, the addition of “Pumas” caused the creation of a rule that looked like this:
Text to Match [in the text being read and parsed by M.A.I.]: pumas
USE [Indexing term] Pumas
M.A.I. also recognizes singular and plural variants. In the absence of any rule or condition to the contrary, the rule above would cause the automatic assignment, or suggestion to a human editor, of the indexing term “Pumas” when coming across the text string “puma”.
PLOS content has good coverage of zoological topics, but is also especially heavy on molecular biology, particularly genetics. The PLOS wordsmiths were mystified when they found that multitudes of genetics articles were being indexed with the term “Pumas”. True, there might have been a sprinkling of articles about wild feline genetics, but this would not account for the number of articles that boasted the “Pumas” descriptor.
The taxonomists at PLOS looked at the articles in question and found the culprit. “PUMA” was appearing in those articles, as an acronym for a gene whose full name is “p53 upregulated modulator of apoptosis.” (I can’t blame the geneticists for using an acronym for that one. The full name isn’t very conversation friendly.) And it’s not specific to pumas; humans have it, and so do such diverse creatures as fish and frogs. So the PLOS taxonomists modified the indexing rule, adding conditions that required at least one other word or phrase having to do with the world of wild feline creatures to be present before “Pumas” could be assigned or suggested. The addition of a few synonyms and quasi-synonyms for pumas made the rule richer and better able to disambiguate pumas from PUMAs. The rule ended up looking like this:
Text to Match: pumas
IF (MENTIONS “feline*” OR MENTIONS “jaguar*” … OR MENTIONS “panther*” OR MENTIONS “cougar*” OR MENTIONS “catamount*” …)
The next indexing run was much better. Alas, there were still some articles inappropriately indexed with “Pumas”. What was wrong? The PLOS editors did some more detective work.
It turned out that some of the problem articles were about the toxoplasma parasite, which has many variant strains and is found in a wide variety of organisms, including people, frogs, and cats. One of those variant strains is known as COUGAR. A conceptual relationship with actual cougar critters does exist; the variant was first discovered in a group of Canadian cougars. That’s rather tangential, though. The toxoplasma articles in question aren’t really about cougars. The problem was that as far as animals (and the PLOS rule base) were concerned, “Cougars” is a synonym of “Pumas”. So when the indexing system read “COUGAR” in the text, “Pumas” got popped onto the list of subject terms for each of those toxoplasma articles.
The next critter slithering amok through the PLOS records was the snail. What would make snails unruly? The real culprit is once again a gene in disguise, in this case SNAI1, naturally referred to frequently as SNAIL. Once such a culprit is properly identified, it’s a straightforward matter to modify a rule that prevents the wrong term from being suggested or assigned, by considering likely contexts and reflecting those in the rule conditions. One bonus of the situation is that the same rule can be further modified to enable indexing of the formerly problematic document with a more appropriate term.
There’s no reason to be afraid of the wild animals in your thesaurus, as long as you stay alert for them. You can tame the mighty mountain lion and the slithery snail.
Barbara Gilles, Taxonomist
“Criteria for inclusion vary, but all companies have things in common. Access Innovations, Inc. has proven to define the spirit of practical innovation by blending sparkling technology with a deep, fundamental commitment to customer success,” says Hugh McKellar, KMWorld editor-in-Chief.
Marjorie M. K. Hlava, president of Access Innovations, says she is honored by her company’s accolade. “Access Innovations prides itself on pushing the edges of technology to meet the needs of the next generation of knowledge management,” she says. “It’s challenging and rewarding to be at the cutting edge of knowledge management, and it’s delightful to be recognized as a leader in the field, making content findable for our customers and their users,”
The Top 100 Companies That Matter list is compiled annually by editorial colleagues, analysts, theorists and practitioners. Unlike many other trade lists, inclusion is not purchased and is at the sole discretion of KMWorld’s editors.
For a full list of the Top 100 Companies That Matter in Knowledge Management, pick up the March issue of KMWorld, which is available on newsstands now, or click here to view the online article.
About Access Innovations, Inc.
Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.
The leading information provider serving the knowledge, document, and content management systems market, KMWorld informs more than 45,000 subscribers about the components and processes—and subsequent success stories—that together offer solutions for improving business performance. KMWorld is a publishing unit of Information Today, Inc.
After critical business information has been identified at a high level and a focal has been assigned, best practices from complementary disciplines can be incorporated.
Identify the main subjects for a business-specific controlled vocabulary
Each company or organization develops its own language for talking about what it does. Like all languages, organizational languages are based on a common way of seeing and thinking. A technology or farm machinery company may use alphanumeric designations to identify thousands of products. An entertainment firm may use cryptic acronyms in discussing thousands of events or programs. An agricultural organization may talk about plans and events as they relate to “the harvest.” Even within one company, the language varies between departments. A finance department is likely to have a language that is different from the language used in research or operations departments.
Controlled vocabularies provide the key for translating organizational language between departments, between new and experienced employees, and between internal and external stakeholders. Controlled vocabularies also provide the basis for consistent analysis, visualization, and reporting, as well as effective search, retrieval, and distribution. Maintaining effective, business-specific controlled vocabularies provides a competitive advantage. They can also provide operational advantages by supporting the translation of business concepts into rapidly evolving IT technical concepts.
Creating and maintaining controlled vocabularies, including relationships and cross-references, has been a best practice in library science, information science, and records management for a very long time. Over time, effective principles, practices, and standards have been developed for them, but currently marketed tools do not always use them.
At the beginning, it is important to identify the main business subject areas that might benefit from a controlled vocabulary. They may be specific to one or more industries, to a discipline, or to a technique. Existing vocabularies and standards can then be identified as building blocks or goals for future cooperative efforts. Industry and subject vocabularies can usually be found through associations, through research, or through vocabulary lists such as Access Innovations’ TaxoBank. General standards for creating and maintaining vocabularies can be found through ANSI and ISO and apply more generally than technology-specific standards.
Once pertinent subjects, vocabularies and standards are identified, basic policies should be established regarding their use and upkeep. Like all languages, organizational languages change and evolve with use. Because cultural and business environments are rapidly evolving, vocabulary policies need to support rapid innovation and creativity.
Define the general types of needed metadata
The metadata needed to identify and track business critical information and data is specific to an organization and is a corporate asset. Defining and maintaining it in a consistent, reliable, useful form is essential, even when a tool provides “OOB (Out Of the Box)” implementations or automated discovery. At a minimum, tools must be configurated and business vocabularies, codes, users, and processes retrofitted to the tool and then maintained. Ongoing tool success and return on investment require significant effort and investment in the definition of policies, standards, vocabularies, processes and procedures. Usually this involves changes in work, roles, and responsibilities for which planning and ongoing management are essential and need to be added to tool costs.
As with vocabulary creation and maintenance, many effective principles, practices, and standards have been developed over time for metadata definition and maintenance, but tools do not always use them.
At the beginning, it is important to identify the main areas of business concern and vulnerability, such as regulatory compliance, product liability, cross-departmental standardization and communication, fulfillment of marketing strategies, or a variety of business specific customer and product-related issues. Each of these areas requires specific tracking techniques and processes that dictate specific metadata. ANSI, ISO, and technology specific standards, such as those designed for the Internet, may be applicable to a business. Determining which standards are applicable will require research.
Governance Level Understanding of Information and Data Needs
Developing a governance level understanding of information and data needs, consisting of the four steps outlined in this and a previous blog posting in this series can be handled as time bounded projects. This high level understanding will be invaluable in providing a business-oriented basis for prioritizing and managing additional work, scoping and justifying the creation of an information and data governance program, and evaluating and efficiently implementing cost-effective new technologies.
Watch future blog postings for more on this subject.
Judith Gerber (guest blogger), JGG Enterprises
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
Marjorie M.K. Hlava Selected to Receive Prestigious Miles Conrad Award from the National Federation of Advanced Information Services (NFAIS™)
Marjorie (Margie) M.K. Hlava, president of Albuquerque-based Access Innovations, Inc., has been selected to receive the prestigious Miles Conrad Award from the National Federation of Advanced Information Services (NFAIS). The award will be presented at the upcoming 2014 NFAIS Annual Conference in Philadelphia, PA, February 23-25, 2014. In keeping with longstanding tradition, Marjorie will present a 45-minute lecture with her perspective on the information industry during the NFAIS Annual Conference.
“The objective of the Miles Conrad Memorial Lecture, established in 1965 in commemoration of NFAIS founder G. Miles Conrad, is to recognize and honor those members of the information community who have made significant contributions to the field of information science and to NFAIS itself,” said Marcie Granahan, Executive Director of NFAIS. “The lecture is presented every year at the organization’s annual conference by an outstanding person on a suitable topic in the field of abstracting and indexing, but above the level of any individual service.”
“Margie Hlava is a well-known and well-respected information industry pioneer,” said NFAIS President Suzanne BeDell. “She has worked behind the scenes for most of the major information organizations, including many NFAIS member organizations. Margie believes that you learn as much as you receive by being active in professional organizations, and she has been intimately involved in the standards process for much of her career. She served for seven years on the NISO board and was personally involved in the development of many NISO standards. She chaired the Special Libraries Association’s (SLA) Standards Committee for nine years, has chaired the NFAIS Standards Committee since 2001, and is currently a member of the NISO Content and Collection Management Topic Committee. Margie is one of my predecessors, having been NFAIS President from 2003 to 2004, as well as President of other organizations such as the American Society for Information Science and Technology and Documentation Abstracts. She has served on the Board of SLA twice and currently serves on several boards, including those for the ASIS&T Bulletin of which she is Chair, Information Systems and Use, Places and Spaces, University of North Carolina SILS, and the SLA Taxonomy Division, of which she is the founding chair. Margie also is a volunteer outside the information industry, serving on the boards of the New Mexico Information Commons, the Hubbell House Alliance, New Mexico Data Stream, and the Hubbell Family Historical Society. The NFAIS Board is delighted to confer our organization’s highest honor upon her.”
“Previous recipients are people I have long admired and looked up to as luminaries in our field,” said Marjorie Hlava when she was notified of the award. “I am truly honored to be among them.”
Margie was educated as a botanist and trained by NASA as an information engineer, a position she worked in for five years. She was a beta tester on the NASA Recon, Dialog, and other early online host systems such as BRS and SDC. She was also the Information Director for the Department of Energy National Energy Information Center and its affiliate NEICA. She rose to the position of Information Director before taking her team private as Access Innovations, Inc. in 1978.
Margie’s abiding research interests center on speeding the human processes in knowledge management through productivity enhancements. She has developed the Data Harmony software suite specifically to increase accuracy and consistency while streamlining the clerical aspects in editorial and indexing tasks. The most recent innovation is applying those systems to medical records for medical claims compliance in a new division, Access Integrity.
Margie’s work has been acknowledged through numerous awards throughout her career, including ASIS&T’s Watson Davis award, and recognition both as an SLA Fellow and as a Woman of Influence for Technology. She is the author of two books and over 200 articles. She holds two U.S. patents encompassing 21 patent claims. She has no intention of resting on her laurels, but plans to continue her adventures in information science and explore the boundaries of new technology and methodologies. A complete list of prior Miles Conrad Award winners can be found on the NFAIS website.
About Access Innovations, Inc. – www.accessinn.com, www.dataharmony.com, www.taxodiary.com Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.
About NFAIS – www.nfais.org Founded in 1958, NFAIS is a membership organization of more than 55 of the world’s leading producers of databases and related information services, information technology, and library services in the sciences, engineering, social sciences, business, and the arts and humanities. For more information on NFAIS and its member organizations, on NFAIS Annual Conferences and meetings or the Miles Conrad Memorial Lecture series, contact Jill O’Neill, Director of Communication and Planning (email@example.com or 215-893-1561), or visit the NFAIS website.
The link between business and information technology is the data, information, and process assets that are stored and automated through technical tools. This blog suggests the first steps toward governing and managing these important assets before tool implementation, helping to avoid the too common “graveyards” of expensive, underused tools.
Identify business critical information and data
In order to get past the confusion of rapidly evolving types, formats, risks, and tools, first identify the most important information and data assets for your organization and start treating them like assets. These assets may already be known but not documented, or identifying them may require chartering and funding a project. Critical information and data assets vary widely across organizations and departments. They need to be based on the core products, expertise, and risks of an organization, which also may need to be identified. For example, data from production machinery and its interpretation could be an unrecognized competitive asset. In other cases, information and data may not yet be regarded as critical assets, but regulatory scrutiny may be about to change that perception.
The identification and listing of important information and data assets should include brief descriptions, the most recent owner, and a relative value. This high level overview is intended to enable discussions about assets, prioritization of work and investments, and the creation of general policies. It should not be confused with the detailed, time-consuming asset management inventories for which records managers and librarians are trained. It is, however, the first step toward governance and “thoughtful localization and organization,” proven techniques which can later be the basis for advanced management techniques such as developing and using metadata, taxonomies, and controlled vocabularies. The overview can employ simple, existing tools such as a spreadsheet or database that can aid in analysis and produce reports.
The first goal is to initiate discussions about how information and data assets support organizational strategies and to determine what governance and management programs are needed. Governance is the exercise of control over multiple operations through accountability frameworks and priorities. It may take some time to build out all the needed policies and measurements regarding decision rights, alignment, and communication, but the discussions will get the work started. Management, which is the exercise of control over day-to-day operations, decisions, work, people, or things, will come later and will comply with governance policies.
Assign an Information and Data Governance Focal Point
Responsibility for information and data governance needs to be assigned if progress is expected, even if the organization is not ready to fund a full-scale program. A part-time person can be responsible for the information and data asset list, act as an authenticating gatekeeper for changes, and make sure that it is discussed at appropriate high-level meetings. With a little bit of additional time the assignee could set up and publicize a mail box or shared site for collecting issues, ideas, and needs, compile them, and recommend projects that are worthy of investment.
The steps above are the beginning and will help to determine where effort, investment, and tools can be justified and what should be accomplished. Much additional work is needed to realize more significant competitive advantages, provide complete functional requirements for tools, and meet regulatory requirements.
Keeping in mind the principle that information is best understood and used by its primary users, governance, standardization, normalization, and coordination may be needed across departments to achieve strategic quality and integrity goals. In addition, specific, detailed, ongoing programs and organizations may be needed for information and data management, funded to evolve, grow, and change as uses, formats, and values fluctuate with business and regulatory changes.
Incrementally, over time, or as the result of concentrated, planned projects, a deeper understanding of needs can be achieved, and more advanced management techniques can be justified, funded, and adopted. Examples include more advanced techniques for asset valuation, cooperative metadata and vocabulary adoption and use, development of competitive information and data techniques, and strategic asset based service level agreements with vendors and operating level agreements with internal groups.
In most cases, over time, it will be beneficial to incorporate principles, standards, and best practices from a variety of complementary disciplines which have found successful ways to deal with the issues – records management, information science, library science, ISO, ANSI, related industries, project management, organizational change, and COBIT and ITIL frameworks for IT governance and management.
Watch future blog postings for more details on this subject.
Judith Gerber (guest blogger), JGG Enterprises
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
Heather Kotula, a long-time employee of Access Innovations, Inc., has recently been promoted to the position of DHUG (Data Harmony Users Group) Meeting and Marketing Coordinator. Heather is one of many faces of fresh, young talent at Access Innovations, and her promotion can only mean good things for the company.
Ms. Kotula started with Access Innovations in 1995, and has since filled many positions within the company and seen it grow over many years. She has worked in finance, as a project manager, office manager, and Vice President of Operations. Her versatility in her early years at Access Innovations gave her a strong background and knowledge of the company, and these have translated into new marketing ideas and skills that will propel the company forward.
Ms. Kotula has coordinated the past three DHUG meetings, and her new position within the company puts her at the forefront of new ideas for the annual DHUG gatherings. These meetings include two days of case studies and presentations and three days of software training. Access Innovations’ clients from around the world attend to present case studies, get training in the use of the Data Harmony software, network with other users, and become acquainted with the team at Access Innovations. As well as putting these meetings together, she handles and oversees the marketing endeavors at Access Innovations. She is always working to keep the company up to date, as well as providing ways to communicate the company’s products and services to the world to benefit others.
“The changes in the world of information over the past 20 years are astounding,” commented Ms. Kotula recently. “Since the founding of the company in 1978, Access Innovations has been preparing and waiting for the ‘Information Age’ to arrive. We have the best tools and an unparalleled breadth of experience in taking information from archive to actionable asset.”
Heather received her Bachelor’s degree in Distributed Foreign Languages from the University of New Mexico in 1991. Before she received her degree, Heather also attended the Goethe Institut in Munich and took German language classes there, as well as attending the Scuola Dante Alighieri in Florence, Italy where she received a Certificate of Fluency in Italian language. Heather received her Master’s degree in Business Administration from New Mexico State University in 1995.
Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. The Access Innovations Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.
As an IT process and governance consultant, I see a large number of software tools for managing knowledge, information, and data. The choices between vendors and in-house development options seem endless. Evaluation can be challenging because descriptions of purpose, use, and comparative approach are often not clear, standardized, or based on research.
Recently I heard a well-known software vendor describe one of its popular and respected products as being able to deliver far beyond its designed capabilities of content tracking and retrieval across multiple platforms. It was described as also able to “accelerate, automate, and maintain compliance with core business processes.” There was no mention of the significant management work required to make this happen. Most organizations are not even close to the needed level of defined processes, policies, measurements, and organizational roles.
I also recently heard a talented technical manager describe how he developed a taxonomy for his specialty, but he was unaware of existing taxonomy tools, standards, or related taxonomies with which his proprietary tool needs to interoperate for long-term success.
I am part of the IT community where management and historical perspectives are not always adequately evaluated before a technical solution is considered. Consequently, I have seen far too many graveyards of expensive tools that never met their potential and were discarded.
My heart goes out to those who feel overwhelmed by a need to do “something” with increasing numbers of content types, formats, devices, and security or litigation risks. Growing amounts of content, vaguely written regulations and laws, and nebulous but formidable concepts like “Big Data,” “The Cloud, ” and “Dark Data,” add additional complexity. It is understandable why an easy one-tool solution is attractive. Nevertheless, it is not productive or necessary to keep buying more expensive tools that will only be discarded. There are better ways of addressing the problems.
Establish Governance and Management Processes First
Establishing basic governance and management processes for knowledge, information, and data is essential for informed decisions about tool purchase, configuration, coordination, and ongoing content viability and validity. The seemingly simpler choice of following what a tool vendor suggests for processes usually complicates the enterprise business environment. Designers of general and industry tools have no way of knowing specific business details or organization strengths, which are often competitive advantages.
Use What is Already Known
Having “too much information” is not new. Humans have survived by processing large amounts of information in the subconscious brain while concentrating the conscious mind on the most pressing external business. Libraries were begun as shared repositories and the beginnings of thought about managing “too much information” by around 2,500 BC.
Marjorie M.K. Hlava, President, Access Innovations, states in a May 27, 2013, TaxoDiary blog post, “We librarians and information specialists get to view anarchy in the universe more often than other people do. And we are the ones who have the job of putting the universe into some sort of order. With a thousand points of knowledge…”
What has been learned by facing “a thousand points of knowledge” head-on applies to the terabytes, exabytes, zettabytes, and yottabytes we now face. They all seemed boundless when first encountered, but can be bounded with thoughtful localization and organization.
During my earlier career as a corporate librarian and records manager, I learned information science models that combined thousands of years of thought with current research and technology for addressing “too much information.” A few examples paired with initial action steps follow:
- Information is best understood and used by its primary users
Define core organizational products, areas of expertise, and risks. Focus and fund knowledge, information, and
data work in these important areas.
- Information is a definable, manageable asset
Define the knowledge, information, and data assets needed to produce core products and maintain expertise,
dividing the assets into manageable, coordinated groups.
- Managed metadata can describe information so that it can be found and/or linked
- Controlled, agreed vocabularies in areas of specialty can greatly enhance retrieval
Define and assign organizational entities and roles to make and maintain the asset definitions, decision rights,
priorities, growth needs, agreed metadata schemas and vocabularies, measurements and reports.
- Information does not follow the rules of thermodynamics – it grows when it is used
Plan for growth, change, coordination and interoperability by using existing standards and making use of what is
Watch future blog postings for more details on this subject.
Judith Gerber (guest blogger)
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
The bell struck twelve.
The Phantom slowly, gravely, silently, approached. … It was shrouded in a deep black garment, which concealed its head, its face, its form, and left nothing of it visible save one outstretched hand. But for this it would have been difficult to detach its figure from the night, and separate it from the darkness by which it was surrounded.
(From A Christmas Carol in Prose; Being a Ghost Story of Christmas, by Charles Dickens.)
And so it is with emerging concepts, those concepts whose forms we can but vaguely discern at the present point in time, whose true reality lurks in the future.
As taxonomists, we have a responsibility to discern those future concepts, although they may still be invisible to most. We can save the various expressions of those concepts in search logs from being rejected from consideration for a vocabulary simply on account of their as yet infrequent appearance. In a taxonomy or thesaurus, we can provide labels that will consolidate the indexing for a concept whose researchers have not yet settled on a name. In some cases, especially with widely used vocabularies, we can perhaps determine the name by which a concept will be known on a standard basis.
This role in itself is one of the emerging responsibilities for taxonomists, thanks to the rapid advances in science and technology. In “What Next, Taxonomy?” (posted on The Taxonomy Blog on November 4, 2011), taxonomist Marlene Rockmore concludes that taxonomists need to deal with emerging technologies in a variety of ways, including collection of relevant content:
“So what next, taxonomy? What is nice to hear is that more taxonomists are surviving because their organizations understand their core roles. What’s the emerging topics and challenges – how to distribute and decentralize (localize) while having authority and control, how to collect new content on emerging, current topics, visualization, how to be more agile, how to fit in with new technologies like social media, mobile, and big data. Phew! That’s a challenge. Taxonomists have a chance to build relationships not only between terms, but with stakeholders on the way to a compelling, visualized, multidimensional content strategy. Good luck.”
This challenge has been growing in step with the rapid advances in science and technology. One example among the many advances in science is the ability of biologists to recognize new and emerging species, as well as life forms that have existed for a while but were formerly overlooked. The Live Science page Newfound Species observes:
“Science has identified some 2 million species of plants, animals and microbes on Earth, but scientists estimated there are millions more left to discover, and new species are constantly discovered and described. The most commonly discovered new species are typically insects, a type of animal with a high degree of biodiversity. Newly discovered mammal species are rare, but they do occur, typically in remote places that haven’t been well-studied previously. Some animals are found to be new species only when scientists peer at their genetic code, because they look outwardly similar to another species — these are called cryptic species. Some newfound species come from museum collections that haven’t been previously combed through and, of course, from fossils.”
Even the humble hosta has its own emergings, due in part to technological and social advances in communication.
“In past centuries, we used to talk about people “discovering” new species of plants. What this usually meant was that European, English or American plant explorers traveled to remote parts of the world and found plants that were new to them. Now, of course, we know that local people in those other parts of the world were often quite familiar with these plants all along. Many of the so-called new plants, including hostas, have been found in local paintings and documents produced long before the Westerners started poking around. In more recent times, however, with better communications, we more universally share the knowledge of different horticultural communities.”
As far as actually emerging species are concerned, evolutionary biologist Rob DeSalle of the American Museum of Natural History has indicated the continuing nature of species emergence:
“Identifying a new species as it emerges is the holy grail of evolutionary biology. … Species must be emerging someplace on earth. The best places to look would be places with lots of species, like rain forests, and islands, because isolation opens new niches.” (In “Q & A; Emerging Species” by C. Claiborne Ray, published June 17, 2003 in The New York Times)
The ScienceDaily website has a webpage dedicated to news about “new” species of plants and animals. While most of these will escape public awareness, Time Magazine has sifted through the barrage of information to identify the “Top 10 New Species” of 2013.
Speaking of top things of 2013, and moving on to emerging technologies, the Massachusetts Institute of Technology’s online Technology Review has published a list of “10 Breakthrough Technologies 2013“. The Technology Review’s “Best of 2013” (December 23, 2013) a quantum internet that Los Alamos National Laboratory has been running, is one of many significant technologies that didn’t make the list, perhaps because the system has been running for the past two years.
The Wikipedia article “Emerging technologies” emphasizes the role of technology convergence in the emergence of new technologies. The article mentions an acronym of particular interest to those in the information technology world:
“NBIC, an acronym for Nanotechnology, Biotechnology, Information technology and Cognitive science, is currently the most popular term for emerging and converging technologies, and was introduced into public discourse through the publication of Converging Technologies for Improving Human Performance, a report sponsored in part by the U.S. National Science Foundation.”
Wikipedia also has a “List of emerging technologies” containing brief descriptions of “some of the most prominent ongoing developments, advances, and innovations in various fields of modern technology.” More than two hundred emerging technologies are listed.
There are and will continue to be many new and emerging concepts in science, technology, and other fields. Taxonomies can help define the terminology for those concepts. This is perhaps most readily evident for genus-species-subspecies-etc. names, whose designation is the territory of the biological taxonomist, or the biologist temporarily acting as taxonomist. Elsewhere, taxonomists can identify predominant labels and the occasionally used synonyms, and then use that information to add appropriate preferred terms and non-preferred synonyms to a vocabulary. They can also add definitions and scope notes. The skills of the taxonomist can bring clarity to formerly mysterious concepts and nomenclature.
No fog, no mist; clear, bright, jovial, stirring, cold; cold, piping for the blood to dance to; Golden sunlight; Heavenly sky; sweet fresh air; merry bells. Oh, glorious! Glorious!
So don’t be scared of the ghosts of future concepts. Think of them as true spirits of the future, taking flight with the benefit of well-chosen terms and synonyms in a taxonomy or thesaurus.
Every time a new term rings true, an emerging concept gets its wings.
Barbara Gilles, Taxonomist
Much they saw, and far they went, and many homes they visited, but always with a happy end. The Spirit stood beside sick beds, and they were cheerful; on foreign lands, and they were close at home; by struggling men, and they were patient in their greater hope; by poverty, and it was rich.
(From A Christmas Carol in Prose; Being a Ghost Story of Christmas, by Charles Dickens. Illustrations by John Leech.)
Wouldn’t it be splendid if, in the ‘spirit’ of Dickens’ Ghost of Christmas Present, we could use taxonomies to accomplish the same things?
- Cure and eradicate sickness
- Promote international understanding
- Promote justice and social harmony
- Lessen and eradicate poverty
Admittedly, these are lofty goals. As it happens, taxonomies can help us accomplish these things. As taxonomist Alice Redmond-Neal has pointed out, “Verbalizing a concept identifies it, gives it substance, and makes it recognizable.” Taxonomies enable us to all agree on what we’re talking about, which can help us identify, quantify, and deal with problems.
Never underestimate the power of a taxonomy! Let’s take a short tour of taxonomies that reflect the spirit and intent of Dickens’ ghost.
“The Spirit stood beside sick beds, and they were cheerful”
The Public Library of Science (PLOS) has a large thesaurus reflecting the content of their digital library. (We at Access Innovations are very familiar with this thesaurus, as we helped develop it in its current form.) Most of PLOS’s journals focus on biological topics. Several of these journals present research related to disease control methods and eradication efforts:
- PLOS Medicine
- PLOS Pathogens
- PLOS Neglected Tropical Diseases
The last-named journal is especially noteworthy in that it offers a publication platform for researchers in third world countries who may have no opportunity for publication elsewhere. PLOS is probably the main publisher of articles on neglected (or at least previously neglected) tropical diseases.
Since the PLOS thesaurus was constructed to reflect the scope and depth of PLOS articles, the thesaurus covers hundreds of terms relevant to disease control methods and eradication efforts. The thesaurus serves as a basis for indexing the articles. As such, it guides searchers to information that can be used in current research, as well as information for healthcare providers and government officials to apply in disease control and eradication efforts.
“The Spirit stood … on foreign lands, and they were close at home”
Probably the best-known organization concerned with international understanding and cooperation is the United Nations. It’s fitting that they have a thesaurus, and a multilingual one at that, in all the official languages of the United Nations: Arabic, Chinese, English, French, Russian and Spanish.
“The multilingual UNBIS Thesaurus, created by the Dag Hammarskjöld Library, United Nations Department of Public Information, contains the terminology used in subject analysis of documents and other materials relevant to United Nations programmes and activities. It is used as the subject authority of the United Nations Bibliographic Information System (UNBIS) and has been incorporated as the subject lexicon of the United Nations Official Document System. It is multidisciplinary in scope, reflecting the Organization’s wide-ranging concerns. The terms included are meant to reflect accurately, clearly, concisely and with a sufficient degree of specificity, matters of importance and interest to the United Nations.”
“The Spirit stood … by struggling men, and they were patient in their greater hope”
As HURIDOCS describes itself, it is “an international NGO [non-governmental organization] helping human rights organisations use information technologies and documentation methods to maximise the impact of their advocacy work.” Of potential interest to taxonomists, “HURIDOCS is also an informal, open and decentralised network of human rights organisations who wish to put together their experiences and creativity to develop common standards and tools for information management. “
One of those tools is a set of small thesauri, in a collection named “Micro-thesauri : a tool for documenting human rights violations”.
“This collection of 48 lists with terminology was developed by HURIDOCS or adapted from a variety of authoritative resources. The Micro-thesauri are intended for use in conjunction with HURIDOCS Standard Formats manuals, and in particular with the HURIDOCS Events Standard Formats: a tool for documenting human rights violations.
“The Micro-thesauri can be used as a starting point for developing one’s own index terms for libraries and documentation centres, as keywords for organising information on websites, or as controlled vocabularies for databases to record violations.
“They have been translated into the following languages, often by volunteers: English, French, Spanish, Arabic, Russian, Portuguese, and Bahasa Indonesia.”
“The Spirit stood … by poverty, and it was rich.”
The Oxford Poverty and Human Development Initiative (OPHI) is an economic research center of the Oxford Department of International Development, at the University of Oxford. The center’s goal is “to build and advance a more systematic methodological and economic framework for reducing multidimensional poverty, grounded in people’s experiences and values.” OPHI explains multidimensional poverty as follows:
“Most countries of the world define poverty by income. Yet poor people themselves define their poverty much more broadly, to include lack of education, health, housing, empowerment, humiliation, employment, personal security and more. No one indicator, such as income, is uniquely able to capture the multiple aspects that contribute to poverty.”
OPHI has identified various aspects of poverty, grouped into five “missing dimensions” of poverty “that deprived people cite as important in their experiences of poverty”:
- Quality of work
- Physical safety
- Social connectedness
- Psychological wellbeing
While OPHI does not call their dimensional framework a taxonomy, it can certainly serve as one.
The Moral of This Posting
Use the power of the taxonomy! And as Obi-Wan Kenobi said in the movie Star Wars, “Use your power for good, not evil.”
And make sure your taxonomy gets used. Call attention to it, or to the search platform that it’s integrated with.
“Sometimes you have to…
SLAP them in the face just to get their attention.”
Carol Kane as The Ghost of Christmas Present, with Bill Murray, in the movie Scrooged (1988), written by Mitch Glazer and Michael O’Donoghue
Barbara Gilles, Taxonomist