Name Disambiguation Musings

The Wall Street Journal on April 19, 2011 talked about the need for customer name authority control in banks. Okay, so maybe that is not what they said. What they did outline was the problem of Arabic names and the many ways to state them. This creates difficulties for banks and other organizations that try to track the information or put a hold on funds for organizations or individuals. The example used was Moammar Gadhafi. His first name could be transliterated as Muammar, Mummar, Mohamed Mahmut, Mehmud and more than 20 other variants. The same goes with the last name, which could be Gaddafi, Ghathafi, Elkaddafi, El-Kaddafi, Al-Gaddafi, Gadhafi, Qaddafi, Al-Qadhafi, El-Qaddfi, Qadhafi, Abu Miryar Al-Qahafi, Ghadaffi, and others. Any combination of these names is valid. There are further complications of the Abu or Al or El and other designations of honor, making things even more interesting.

The UN Sanctions group lists twelve variations for his name. The UN prefers a single form of the name (Muammar Mohammed Abu Miryar Qahafi), as do the Swiss (Muammar Ghedklafi). However, there is nothing unusual or illegal about someone filing under any of several valid variations of their name.

Arabic does not have transliteration standards like Kanji (Korean, Chinese and Japanese) or Tamil or most of the other languages. There are attempts and the ISO has a transliteration standard. But transliteration means that it should work both ways and this is not the case in this case because the initial inputs are so variable. Arabic has sounds that do not exist in Latin based languages, and the ways of displaying them are highly varied. A speaker will be able to understand the meaning phonetically but the machine cannot easily translate them back and forth. There is also the issue of dialects of Arabic. They are very particular about how one group speaks compared to another. Recently, an Egyptian told me that the Algerians do not speak real Arabic, to which an Algerian responded that it was the Egyptians who did not know how to speak. Ugh.

Figuring out names, how to best format them, building databases of those names and their variations is increasingly important in our ever more digital world. Author disambiguation for authors of papers scientific journals is a recent popular area with the introduction of author networks such as the AIP UniPHY and the Elsevier SciVal (formerly Collexis) offerings. The common examples are for Asian names because they can be inverted and one not knowing the syntax and naming conventions can easily confuse them. There are many Asian authors in Western journals. Trying to distinguish the names written in languages like Arabic, Tamil, and other alphabetic scripts that do not conform to the Western notion of naming conventions precludes simple solutions to complex options for names. Names change as people marry, divorce, use nicknames, use or don’t use middle names, initials etc.

We believe that the need for transliteration standards, like the large number of them available from ISO, and the accompanying tools and authority file databases of names, will continue to become more important as we try to bridge the divides of information capture and sharing. Access Innovations chooses to use Unicode to ensure that all character sets can be represented. The Data Harmony tool set accommodates the huge variations in naming conventions, and the XML option ensures that the data can be encapsulated and shared across many platforms. Using Java means that the tools are platform independent. Entity extraction (people, places, and things) tools need to pull as much conceptual value from the digital objects as available. Making sure that the desired information can be extracted in any language and then gathered with the aliases of each name forms the name disambiguation framework we support.

Marjorie M.K. Hlava
President, Access Innovations

Name Disambiguation Musings