Juliet: “What’s in a name? That which we call a rose, By any other name would smell as sweet.” Romeo and Juliet (II, ii, 1-2)
True, but try finding the right document set for your current project by sniffing them out from within a database of 8 million similar smelling documents. This approach is all too common, very time consuming, and unreliable leaving you with aromatic, unpalatable results.
What’s in a person’s name? Take my name: Jay Ven Eman. What are the parts? What constitutes the last name or surname? Is that a middle name? My surname is Ven Eman and my first name is Jay. In XML it could look like <First_name>Jay</First_name><Surname>Ven Eman</Surname>. A small sampling of the name variants I’ve seen are Jay Von Eman, Jay Van Eman, Jay van Eman, Jay ven Eman, Jay Veneman, and Jay Venema. I haven’t seen or used any aliases, but you have undoubtedly seen William Henry McCarty, aka Henry Antrim, aka William H. Bonney, aka Billy the Kid. Along with aliases are pseudonyms.
Peoples’ names present a significant challenge. Name variants and aliases are difficult to identify and to track. Cultural differences in formatting names across languages contribute to the problem. Privacy concerns and the desire by many to remain relatively anonymous cause misidentification. Typographical errors, inconsistencies in original entry, and errors introduced during post processing, account for more of the confusion.
The magnitude of the mess is monumental. Facebook is projected to hit 700 million names. The world’s population is estimated to hit 7 billion this month. How will your boss, peers, anyone ever find you? How will you find the people you need?
A thesaurus is designed to provide guidance on all of the possible words that are used to label a rose and still allow it to smell as sweet. Thesaurus concepts can also be applied to proper nouns such as things, places, and people. Synonym relationships can be used for name variants, aliases, and pseudonyms. These can be lumped into a single field or stored separately. Storing them in separate fields allows you to maintain more information about relationship types. When researching historical figures, for example, historians want to know what pseudonyms and aliases have been used and when. A system like our Data Harmony Thesaurus Master® makes it easier to capture and store person-name values Our XIS™, the XML database management system, uses an XML DTD to control a master data record, or gold record, facilitating name management.
Capturing the core data correctly as it is added to a database system is much easier than cleaning data later. The Smart Submit author submission system that Access Innovations developed and installed for the American Society of Information Science and Technology (ASIS&T) is an example of the place to start. The entry template with formatting names, capturing each component of a name in separate fields. It is then saved in native XML allowing for greater flexibility in post processing. Entries can be bounced against the master data record for verification by the author. This is a great opportunity to update your database of members or when applied to a commercial Web site, a way to update your customer database.
The example I use here is for an author submitting a manuscript for publication. At the point of name entry, the entered name can be bounced against your author database as mentioned and can also be integrated with external initiative such as VIVO and ORCID. ORCID stands for Open Researcher & Contributor ID. It is “an independent, community effort to standardize researcher identification. If you have a customer database, an opt-in initiative is an ideal way to create a much cleaner customer database.
Cleaning up a name database requires multiple strategies and multiple passes through the database. We make extensive use of semantics along with standard data processing techniques. Semantics add a richer layer to the analysis, improving your odds of getting it right.
A complete discourse on creating and cleansing name databases cannot be covered in this limited venue and while not an insignificant undertaking, the benefits are significant. Do it right and you will end up smelling like a rose by whatever name.
Jay Ven Eman, CEO
Access Innovations, http://www.accessinn.com/