Many institutions and organizations – notably (but not limited to) publishers – have large, or sometimes very very large, lists of names. These names are from member directories, employees and staff, clients and customers, marketing, development, and many other sources; indeed, oftentimes the lists from various departments in the same organization are not connected or resolved with one another in any way.
This growing problem has given rise to a sub-field in the information/data industry variously called “named entity disambiguation” or “author disambiguation” or “name disambiguation”, among other monikers. In the academic publishing space, disambiguation of author names is a common challenge.
In a nutshell, given a list of names—let’s say, oh, 3.2 million names—to determine which ones are the same person and which are not, we might proceed as follows:
The goal is, as automatically as possible, to sort out which of these records should be merged. Once accomplished, you (a publisher) could make a webpage for each author listing all publications and so forth for your users to browse.
Clearly, some of the names above are potentially the same person, while others are not. For example B. Caldwell Smith, B.C. Smith, and Brandon C. Smith, and Brandon Caldwell Smith look like they might be the same person. To find out without looking at every name and every article (3.2 million, remember?) we need more information.
To accomplish this task, metadata associated with each author is examined and compared to try to eliminate duplicates. For example, from each article we can associate an author with his co-authors, the institution with which she was involved when the paper was published, email addresses, dates of publication, and so forth.
Well, some things are clearer, but some are not. Whereas before we may have suspected that Rodger Smith and Roger Smith were different people, they published at the same institution in the same year; maybe it’s just a typo? And maybe Brandon C. Caldwell moved from Harvard to Yale (not unheard of) sometime between 1961 and 1972?
At Access Innovations we’ve been developing a way to add some certainty to the process using semantic metadata—it’s not a silver bullet, but it is a bigger gun. We call the process “semantic fingerprinting” and it’s based on our thesaurus and indexing technology.
Every author’s works (papers, conference proceedings, editorial roles) associates them with one or more pieces of content, and for each piece of content we have indexing terms from a thesaurus particular to that client. By associating the author directly with the indexing terms, we develop a semantic profile (or “fingerprint”) for each one. Since each author usually authors multiple papers (see “Lotka’s Law”), we compile the subject terms from each paper to make a more complete profile; obviously the more papers we have, the more accurate these profiles are.
Returning to our example:
What we suspected to be perhaps one person based on our best information turns out pretty obviously to be two distinct researchers once the areas of expertise are added to the equation.
While the process is far from foolproof, it does help to automate the disambiguation process, which cuts down on the number of human hours required to review the work.
The concept of the “semantic fingerprint” can be applied to a paper, a school, an editor, or any other entity for which subject metadata is available. So this same basic process can be used for other purposes; for example, to:
- Disambiguate institution names
- Match articles to peer reviewers or editors
- Demonstrate what areas of research are exploding at,
- A journal
- A college
- A research laboratory
As datasets get cleaner and cleaner the accuracy of, and uses for, semantic technologies—such as Access Innovations’ Semantic Fingerprinting techniques—will continue to increase.
Bob Kasenchak, Project Coordinator
Access Innovations
Semantic Fingerprinting image © Access Innovations, Inc.