Semantic Integration – Part I of VI
November 28, 2011
Posted in Access Insights, Featured, Taxonomy, Term lists
Introduction to Taxonomies – definitions
To discuss the semantic integration or the leveraging of a taxonomy in search, web sites mashups and other places, we should first review what they are. Let’s look at the definitions and then the integration of a taxonomy as a building block for the larger information architecture for an organization. We need to think of taxonomies in that bigger case when we are talking about where we apply them. Once those are out of the way I will review some use-cases and show what makes them work.
A taxonomy is a “knowledge organization system” – or a KOS. There are many types of KOS which you will see in a minute but, basically, it is a set of words that have been organized to control the use of terms used in a field or a “vocabulary” so that things can be easily found in a specific subject area. These Knowledge Organizations Systems are usually specific to a knowledge domain or a topical area, a subject area, a scholarly or scientific or an enterprise area. They’re really descriptive labels. We have many names we can call something but we settle on one descriptive label and then we post all the terms we use to that descriptive label as a main term and its synonyms, no matter how we call it.
Taxonomies often put together words (controlled terms) in hierarchical view, meaning a broader/narrower navigation tree so that people can browse a tree to find information quickly. They are certainly used as a storage and retrieval aid. That is why we started using them in the first place. We use them both for tagging information objects as we add them to a data base and to search as we query that same database.
Those controlled vocabularies or those KOS – Knowledge Organization Systems – can be anything from a flat list, like your Saturday to do list, to a list with ambiguity control by using a synonym ring. We might put some structure to it and map the terms into a hierarchy, which gives us a taxonomy. We move to a full thesaurus by adding relationships between the terms, notes and other features, adding the ambiguity control from a synonym ring. The thesaurus has those four features; the hierarchy (taxonomy) , the relationships (associative or related terms), ambiguity control (Nonpreferred terms, synonyms, use references), and various kinds of notes. The next step in complexity would be to define those associative relationships in many different ways, i.e., as an ontology. The final option in the increasing complexity is the Linked Data or Semantic Web options where the actual items described in many different systems are hooked together by this KOS. People are increasingly saying taxonomy or ontology and meaning thesaurus. Whatever the instance is called they are still all knowledge organization systems just as classification systems are.
As an organization system, we are putting control on the terminology but we also have a hierarchical format – parent-child, genus-species, and broader-narrower type of relationships. Specific items can occur as final leaves on those hierarchies. That is they are on the branches of those navigation trees. They are sometimes known as narrower term instances; sometimes they an actual URL or a document itself. They are very common on websites. They are used as drop down-pick lists. They may represent a sizeable directory and, as you will see, a lot of other variations.
Taxonomies themselves, as a single group, are just an emerging set of standards. They aren’t fully standardized as yet. A taxonomy does have a definition within the Z39.19 Controlled Vocabularies standard published by ANSI (American National Standards Institute) and NISO (National Information Standards Organization) standard in that it is a “hierarchically organized vocabulary based on a classification system”. That does not include the synonyms or the disambiguation for homographs, for example, or related terms, also known as associative relationships. Those are not included in the classic definition of a taxonomy. You can download this standard at no charge from NISO.
There is a companion item, also a controlled vocabulary and the focus of the standard, which is a thesaurus. A controlled vocabulary focuses on concepts. It’s not the items themselves – not the specific items – but rather the concepts. So, you will group things together by concepts. It also has a hierarchy and it has a lot of different display formats and allows a lot of networks of relationships or related terms that could be considered friends or cousins or aliases. It also allows things like scope notes and term histories. It is a more elaborative and informative addition and there are long-established standards for this area.
Starting with Cutter and Dewey and the turn of the 19thCentury and moving forward we have the evolving standards for thesauri. In the same standard, Z39.19, it is defined as a controlled vocabulary of terms in natural language order. It is designed for “post-coordination”.
Those are fairly heavy terms. Post-coordination means that the terms are going to be combined at the time that the search or question is asked. So, when we are making a query, is when you are combining the terms. We aren’t combining like we did in subject headings and back-of-the-book indexes. We combine those in a “pre-coordinated” fashion – at the time the index is created. In the use of a thesaurus or a taxonomy, we use them in a post-coordination activity. That is a big difference. We need to think about how that happens. When someone is “post coordinate indexing” instead of cataloging, we get a big flip in the way the terms are created. Some people are indexing – doing the back of the book – and others are indexing using a controlled vocabulary. The processes are quite different. So, in the Taxonomy Division, we don’t really talk about back-of-the-book indexing and subject heading indexing. We talk about using a thesaurus or using a taxonomy for that indexing.
We’ve gotten to the point where the terms thesaurus and taxonomy are used interchangeably quite often. People think of a thesaurus as a taxonomy with extras. It has related terms, non-preferred terms as synonyms, or the use and use-for references (another way to talk about them), has scope notes, and a whole lot more. But we need to be careful when we are talking about the word thesaurus because the lay person will automatically begin to assume that we mean a synonymy, like the Roget’s Thesaurus and that is not what we mean at all. Although synonyms are important, they are not the only piece of what we are doing.
This is a shot of what a taxonomy and a thesaurus can look like.
On the left, you see the taxonomy view. We have a broader term and some narrower terms and we have even top terms in the taxonomy view. This view is broader-narrower term driven so it is hierarchical in nature and it is called the taxonomy view.
The full term record focuses around the term in question so it might have broader terms and narrower terms as we see here. But, it could also have a status, related terms, some see also or non-preferred synonym references, as well as scope notes, editorial notes, a history field, facets and other things. This set of fields is kind of the default set – the basic individual set for the system.
The basic taxonomy features are the hierarchy structure, the broad concepts-the general concepts, which are known as broader terms, and then the more specific concepts – but still concepts – are the narrower terms. Related terms are conceptual cousins. They don’t fit into that subsidiary relationship. So, they might have something to do with it, but they are not exactly the same. So, those are conceptual cousins.
Then we have those equivalency relationships – term equivalents – the synonyms, non-preferred terms, the use and use for kinds of references. You can have a lot of those. You can even make them multi-lingual equivalent. You can have an English-Spanish-French-Chinese-German or some other set of languages for all of these synonyms.
We can also have facets which is another way to sort the data. Frequently, in faceted search and the way it is most broadly approached now is really field-formatted search. But, there is a large field of knowledge about faceting information from Ranganathan and other luminaries in the field.
Scope notes might be notes that you want somebody to see. Glossary notes are dictionary definitions, whereas editorial notes might be something you want to keep to yourselves and your team and not show to your user base.
All of those together, make up a term record.
We put concepts into a taxonomy. We put People, Places, and Things into authority lists. Name Authority files are one kind of the authority file options. We have a name authority list, which might be the way you want to talk about somebody. For instance my name might be Marjorie Hlava, or Marge Hlava, or Margie Hlava, or Marjorie M.K. Hlava, or it might be Marjorie Maxine Kimmel Hlava or the even longer name I was baptized with. Getting that name authority so that all of those names post to one place is increasingly important as we get more and more connected. There is a lot of work going on, for example, in author authority files and author disambiguation problems. Libraries have always known about those. Is it Mark Twain or is it Samuel Clemens? There are many other pen name situations that we deal with. Name authority files are important. They are also important in corporations when you talk about brand names. For example, in a pharmaceutical firm where something starts off as a chemical name, becomes a code name, becomes a production name, becomes something they want to try out and actually launch to the public. In different markets in different countries it will have a lot of different names. You need to keep all of those names together for research purposes. An authority file lends itself to that kind of thing.
A synonym set or synonym ring is something where you have a lot of terms that mean essentially the same thing. You might have keywords, descriptors, subject headings, etc., that are equal to each other. You can begin to put some control on the vocabulary with synonyms but you can control it further with broader-narrower terms and on up through thesaurus, ontology, and eventually, when applied to data, a semantic network.
Taxonomy terms are an important part of meta data. In particular, they describe subject meta data, or concept meta data. Meta data itself is all the kinds of information about the information you are talking about. So, it is not the information itself. It is the description of that information. When we talk about a bibliographic citation with indexing, we are really talking about the meta data on the journal article or book. The data “about” data is meta data.
The Dublin Core is one meta data standard but there are a bunch of other ones – meta data, schemas, or standards – that are commonly designed specifically for a body of knowledge.
As I said a taxonomy is one of the building blocks. It’s the words representing concepts. It is also known as semantics. Semantics is a popular way of defining this now. And just like in meta data, you can substitute the word “about”. In semantics, you can substitute the word “words” and you would have a good idea of what is driving the process and how something works. So, semantics is driving the process or the words are driving the process of applying this kind of control to a knowledge domain. The taxonomy is core to that process because it gives the organizing principles for all of the content that you are going to deliver. And, if you are lucky, in your application you are going to be able to use it all the way from website design to new product offerings to searches, to the way things are stored and maybe the records management part of the organization. Looking at all of those different pieces, we want to see where these can be applied in real-life to our systems.
Check back next week when we continue the series and talk about Cases for Semantic Enrichment Using Taxonomies.
Marjorie M.K. Hlava
President, Access Innovations
What’s in a name?
October 31, 2011
Posted in Access Insights, Featured, Term lists
Juliet: “What’s in a name? That which we call a rose, By any other name would smell as sweet.” Romeo and Juliet (II, ii, 1-2)
True, but try finding the right document set for your current project by sniffing them out from within a database of 8 million similar smelling documents. This approach is all too common, very time consuming, and unreliable leaving you with aromatic, unpalatable results.
What’s in a person’s name? Take my name: Jay Ven Eman. What are the parts? What constitutes the last name or surname? Is that a middle name? My surname is Ven Eman and my first name is Jay. In XML it could look like <First_name>Jay</First_name><Surname>Ven Eman</Surname>. A small sampling of the name variants I’ve seen are Jay Von Eman, Jay Van Eman, Jay van Eman, Jay ven Eman, Jay Veneman, and Jay Venema. I haven’t seen or used any aliases, but you have undoubtedly seen William Henry McCarty, aka Henry Antrim, aka William H. Bonney, aka Billy the Kid. Along with aliases are pseudonyms.
Peoples’ names present a significant challenge. Name variants and aliases are difficult to identify and to track. Cultural differences in formatting names across languages contribute to the problem. Privacy concerns and the desire by many to remain relatively anonymous cause misidentification. Typographical errors, inconsistencies in original entry, and errors introduced during post processing, account for more of the confusion.
The magnitude of the mess is monumental. Facebook is projected to hit 700 million names. The world’s population is estimated to hit 7 billion this month. How will your boss, peers, anyone ever find you? How will you find the people you need?
A thesaurus is designed to provide guidance on all of the possible words that are used to label a rose and still allow it to smell as sweet. Thesaurus concepts can also be applied to proper nouns such as things, places, and people. Synonym relationships can be used for name variants, aliases, and pseudonyms. These can be lumped into a single field or stored separately. Storing them in separate fields allows you to maintain more information about relationship types. When researching historical figures, for example, historians want to know what pseudonyms and aliases have been used and when. A system like our Data Harmony Thesaurus Master® makes it easier to capture and store person-name values Our XIS™, the XML database management system, uses an XML DTD to control a master data record, or gold record, facilitating name management.
Capturing the core data correctly as it is added to a database system is much easier than cleaning data later. The author submission system Access Innovations developed and installed for the American Society of Information Science and Technology (ASIS&T) is an example of the place to start. The entry template with formatting names, capturing each component of a name in separate fields. It is then saved in native XML allowing for greater flexibility in post processing. Entries can be bounced against the master data record for verification by the author. This is a great opportunity to update your database of members or when applied to a commercial Web site, a way to update your customer database.
The example I use here is for an author submitting a manuscript for publication. At the point of name entry, the entered name can be bounced against your author database as mentioned and can also be integrated with external initiative such as VIVO and ORCID. ORCID stands for Open Researcher & Contributor ID. It is “an independent, community effort to standardize researcher identification. If you have a customer database, an opt-in initiative is an ideal way to create a much cleaner customer database.
Cleaning up a name database requires multiple strategies and multiple passes through the database. We make extensive use of semantics along with standard data processing techniques. Semantics add a richer layer to the analysis, improving your odds of getting it right.
A complete discourse on creating and cleansing name databases cannot be covered in this limited venue and while not an insignificant undertaking, the benefits are significant. Do it right and you will end up smelling like a rose by whatever name.
Jay Ven Eman, CEO
Access Innovations http://www.accessinn.com/
Labeling or Profiling?
October 17, 2011
Posted in News, semantic, Term lists
Labeling people based on behaviors is not a new technology or even behavior. This has been done since the beginning of time. However, with the advancement of semantic technology and the flood of data available, it has become overwhelming to discern a simplified approach to organization.
We found this interesting information on Business 2 Community in their article, “Your Taxonomy or Mine?.” It outlines a recent and exhaustive attempt to cut through the clutter and simplify segmentation as part of a Center of Excellence initiative for a major technology company. The project revealed some interesting finds and is definitely worth the read. Especially if you have any interest in social behaviors and trends.
Melody K. Smith
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
Open Net – What a site!
September 26, 2011
Posted in Access Insights, Featured, Term lists
I was doing some poking around to find out about OpenNet (which the Department of State uses), and I came across a DOE implementation of it (they apparently helped invent it.) Clicking the author link works really well! The site is clean and crisp. Very professional looking.
The “Document Categories” list on the Advanced Search page gave me pause.
There are about 70 categories listed, in no discernible order, except for occasional apparent groupings of consecutive listings. One of those groupings, strangely enough, is “Laser Isotope Separation” and “Other Isotope Separation Information”, while “Isotope Separation” is the first category in the entire list. “Other Weapon Topics” is near the end; various weapon categories are sprinkled throughout the list. I guess you have to go through the whole list to see if your weapon of choice is “other”.
You might at least expect an alphabetical order. Best practices would suggest that we put 70 terms into a list of 20 – 25 so they will easily display on a web page or a LONG drop down list. After that a hierarchical grouping would be useful
Fourth from the end is plain old “Other”. Not to be topped by that, the very last item is “General, Miscellaneous, Administrative, Historical and”. (I might be able to find out what comes after “and” if I did a search.) I’m not sure what the differences are among “Other”, “General”, and “Miscellaneous”. Perhaps in Government speak you need to get all of them covered but seems like a great place to use synonyms.
DOE had an excellent thesaurus inherited from ERDA which replaced the AEC in 1975. It was built using the standards we all now follow. It became very large without strong controls on the term added and other governance. More recently it has been subsumed in a joint effort with INIS called the ETDE/INIS Joint Thesaurus Project. A good idea since the combination of the two nuclear information thesauri will better serve the greater global community with a single nomenclature. .
The actual site however makes one wonder if the lure of indexing using the computer without any help from the human brain made them do away with application of the thesaurus/indexing practice and instead depend on the computer guessing what the user wants? What happened to the idea that the human can enhance search using a computer? I don’t know how any one finds stuff using this system. No wonder our intelligence systems are so flawed! Or maybe that is the idea, the data is there and open… hard to pinpoint but satisfies the openness criteria.
Marjorie M.K. Hlava
President, Access Innovations
Finding New Among the Old
September 15, 2011
Posted in Access Insights, indexing, News, Term lists
Indexing is becoming one of the key information management methods. That is not new news. The fascination of categorizing and putting vast information in a findable form is what drives good taxonomy building.
This subject of the order of things triggered the discussion recently when, Access Innovations’ president, Margie Hlava, sat down with Steve Arnold, a technology and financial analyst, owner of ArnoldIT and writer of Beyond Search.
Novelty detection is deployed when you are looking for new terms and uses in the community. Calling old things by new names, creating acronyms, etc. happens every day, but they are not new. So finding “novel” terms is sometimes difficult.
Making it even more complicated, term combinations happen to provide a new way to look at old things. Term mining applications are deployed to work through this maze.
This interesting discussion and approach is worth listening to in full. Listen to the complete podcast here.
Melody K. Smith
Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.
Cost Reduction Options for Hospitals
September 5, 2011
Posted in Autoindexing, indexing, News, Taxonomy, Term lists
Many are concerned about the cost of healthcare, patients and hospitals alike. Cuts in Medicare funding, increases in insurance premiums, and new federal requirements (such as the ICD-10 medical coding transition) are putting pressure on everyone to find ways to cut costs.
We found this pertinent information on Becker’s Hospital Review in their article, “6 Ways Hospitals Can Survive the Onslaught of Medicare Cuts.” There are many options for hospitals to endure through the consistent Medicare cuts and laws that affect their bottom dollar. Many are outlined in the referenced article.
Access Innovations can also help. Developer of the M.A.I. machine assisted indexing system and specialists in complex coding, tagging, and indexing, Access Innovations provides a range of services that deliver tag integrity. Access Innovations provides training to a client’s staff and then offers quality assurance and validation services that can:
- Minimize the risk of a coding error
- Identify inappropriate or inadvertently applied tags
- Display a “map” of coding distributions to allow management to get a bird’s-eye view of the coding assignment flow.
Many widely used tagging systems lack the user friendly interface and may not implement a rigorous ANSI compliant coding subsystem. Access Innovations’ solutions are ANSI compliant and implement state-of-the-art technology to speed tagging and reduce errors. For more information, contact Access Innovations.
Melody K. Smith
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
From Simple to Complex
June 20, 2011
Posted in Access Insights, Featured, ontology, Standards, Taxonomy, Term lists
People talk about different kinds of vocabularies. The differences usually have to do with the structure, or lack thereof. Sometimes, people refer to “flat lists”. These are one-level lists with no hierarchical structure. They can be uncontrolled or controlled lists. An uncontrolled list is a simple, flat structure. The uncontrolled list is your “Saturday list”. It is also a candidate term list; nothing particularly formatted about it. It is a simple list but it is a list. It might look something like this:
Wash
Trim bushes
Clean cat box
Iron
Water plants
Make birthday cake
It is a simple, natural language list. It can be a list of candidate terms for a taxonomy or thesaurus, serving as an excellent starting point.
Normally, an authority list is a flat list, although it can be organized by broad categories. It often defines the “approved” forms of names. The names are often of people or places. If the list is associated with variant forms of the names, it can be used to control ambiguity. It also can control ambiguity by providing a consistent form of a name for indexing.
Synonym rings try to control ambiguity, but they put forth all of the synonyms. You might have descriptor, keyword, subject headings, thesaurus term, taxonomy term, etc., all meaning roughly the same thing. You need to determine which one is the primary way you will talk about it and that is the way you will use it.
A taxonomy is solely the hierarchy, while a thesaurus tries to bring the ambiguity control, the synonyms, the hierarchies and the related terms to it. There are not many standards as yet for taxonomies (although there are many standards for thesauri).
One standard that does address taxonomy is ANSI/NISO Z39.19-2005. It defines a taxonomy as a ‘collection of controlled vocabulary terms organized into a hierarchical structure’. You may notice that it doesn’t say anything about equivalence, homograms, associated relationships, or notes. It is just the hierarchy. So as a controlled vocabulary, a parent-child or hierarchical relationship, the specificity happens at the lower levels, at the branches, at the leaflets – or at the end of the list. They are very common on websites. They are also commonly supported as pick lists — a drop down menu of ten or twelve items. Sometimes they are browseable directories. There are many different ways to put them into play.
A thesaurus is also a controlled vocabulary. Since many thesauri are hierarchical, they may be referred to as taxonomies. However, unlike a simple taxonomy, a thesaurus does have equivalence, synonyms, associative relationships (related terms), and scope notes. It may contain definitions, editorial notes, and mappings from other thesauri and/or from taxonomies.
A thesaurus focuses on concepts. It doesn’t focus on the information object itself. You have to identify the information object, or the concept, by using the thesaurus. Rather than outlining those information objects, as a taxonomy might, it is giving you a guide to those information objects. Thus, for instance, a multilingual thesaurus term record focuses on the concept of the terms, rather than on the term itself in any one particular language. There are different ways to display a thesaurus so that you can see the network of relationships between the terms.
An ontology usually doesn’t do too well on ambiguity, but is strong on synonyms, is big on hierarchy, and it has a swarm of additional relationships — is a, as a, is part of, kinds of statements are made when you work in an ontology. Those are different kinds of relationships; restating them using Thesaurus Master software adds a different kind of related term.
Marjorie M.K. Hlava
President, Access Innovations
Pre or Post-Coordinate Indexing?
June 13, 2011
Posted in Access Insights, Featured, indexing, Term lists
Most people think about what they want to search for and are willing to combine their concepts at the time of search. If you think of the way people search in Google, they put in the combination of terms they are thinking of; they are doing the coordination of terms. It is up to the search software to do the intersection of the terms for them and figure out the post-coordination.
In the current online environment, very seldom do we put together terms in a pre-coordinate fashion. That is one of the challenges in taking older classified lists – back-of-the-book indexes – and making them into a post-coordinate system.
Post-coordination of terms is typical of traditional classification systems, and not of most modern taxonomies and thesauri. Classification systems often concatenate separate concepts into a string of terms. Natural language is not used.
I really think that most people search by typing in words as they think of them, so we need to support natural language in our systems. This is part of why Access Innovations uses post-coordination in the taxonomies and thesauri it creates.
In a post-coordinate system, a single concept is represented by a single term. We’re not combining two concepts, we are keeping them separate. That way we are able to do a large amount of automated indexing. If we try to mine the text at the beginning, it is very system-intensive. So, we have taken another path, by and large, for that.
Marjorie M.K. Hlava
President, Access Innovations
Adventures of a TaxoTourist
May 23, 2011
Posted in Access Insights, Featured, Taxonomy, Term lists
The trip was awesome—a dream exotic vacation to Bali. It was not about eat, pray, love, but a rather unbalanced midpoint to meet my Oz-dwelling daughter. I enjoyed dashes of ecotourism and agritourism, but even in full vacation mode I couldn’t fully suppress my perspective as a taxonomist.
The place is idyllic, verdant, vibrant, unique, and quirky, with wondrous surprises at every turn. A few features were especially surprising, like getting candy instead of small change in a sales transaction (rupiah come in millions, like Italian lire before the euro), and motorbike fuel is sold in any container handy from vegetable oil jugs to Jack Daniels bottles. The lack of guard rails on twisty mountain roads, wildly variable heights of steps in a staircase and the wobbly concrete panels of sidewalk paving with gaping 2×3 foot holes in them, hopscotched on a dark night in the rain, would set a lawyer’s heart a-flutter.
And then there’s the spelling, a wonderfully creative adventure. The spelling of words on signs, menus, brochures and more never failed to surprise and entertain. Though the word “internet” was consistent, even in remote villages, other words were expressed with endless variety. Initially, I viewed the spelling as a phonology test in linguistics until I realized there was no rule, no rhyme, no reason to it. Seeing an English translation after the Indonesian text did not guarantee I could easily make sense of what I read, nor could I expect the same spelling the next time.
Though much of what I encountered was oriented to tourists, the unexpected—sidewalks, steps, spelling, and more—was well outside the range of my expectations and made navigating a challenge.
Travel is an adventure, venturing beyond the daily norm to get away from the routine and expected. We seek excitement and variety from travel, not efficiency. Yet efficiency has its place. Predictability and standardization make it easy to recognize where we are and how to move around. With familiar landmarks and a sense of what to expect, it is simply easier to find your way. Creatively gesturing to circumvent a language barrier can be fun in travel but ineffective to accurately convey an idea or important direction.
What is true for finding your way around a new place is also true for navigating a taxonomy. What’s around and what comes next should make sense. It’s easier to find information when your approach to knowledge organization and expression is similar to that reflected in the unfamiliar taxonomy. Ideally, the approaches are based on accepted and shared standards, which facilitate a searcher’s navigating a taxonomy. Those standards also support interoperability between vocabularies, their capacity to meld concepts and organizational structures in a reasonably smooth fashion. For taxonomies, that means following the ANSI/NISO Z39.19 standard, Guidelines for the Construction and Management of Monolingual Controlled Vocabularies, the British Standard BS 8723 Structured Vocabularies for Information Retrieval, or the comparable international standards ISO 2788 for monolingual thesauri and ISO 5964 for multilingual thesauri.
My short visit to Bali reminded me that travel and adventure present wonderful opportunities to explore. It is great to break out of everyday routines and enjoy what’s novel and unexpected. Surprises are usually welcome and often the highlight of a trip. But quirky doesn’t cut it when the goal is efficiency and productivity. Especially for taxonomy work, standards are good.
Alice Redmond-Neal
Chief Taxonomist, Senior Editor
Access Innovations
Originally posted on October 25, 2010
The Power Of Survey Taxonomies To Skew The Results The Way You Want Them
April 18, 2011
Posted in Access Insights, Featured, reference, semantic, Standards, Term lists
I went to the doctor’s office this week and they asked if I would participate in a short Federal survey. I said sure.
“What is your nationality?”
“American,” I said.
“That is not an option,” said the lady.
“What are the options?” I asked.
“Hispanic, Asian, Asian Black, African, Central American, Chicano, Cuban, Hispanic Latino, Mexican, Native American, Native Hawaiian, South American, Spanish, White, White Hispanic Other, Unknown and Refused,” she said.
“I am Native American, White, Spanish and Mexican,” I said.
“You can only pick one,” she said.
“I am also Welsh, English, Scottish, German, Dutch, Irish and married to a Bohemian Mexican, Spanish French man. Put me down as Refused!”
She said “Most people are putting down Refused or Other.”
I figured I was at the doctor’s office, many groups have known medical predispositions to diseases. That must be why they are asking. Medical predispositions of some sort (whether susceptibility to certain diseases or response to certain drugs) might actually have been why they were asking; at least it’s quite plausible. Of course, there’s still a problem with the lack of granularity, whether they’re doing research or predicting risk.
One example is that ingestion of fava beans which may be fatal for some people of Mediterranean descent. I’ve heard anecdotes about a U.S. Army cook serving up meals with fava beans, and the infirmary subsequently dealing with an influx of very sick people.
I don’t have a reference for the latest version of the OMB Directive (still the 1997 one) and came across the FDA’s “Guidance for Industry: Collection of Race and Ethnicity Data in Clinical Trials“, which says, in part:
“Differences in response to medical products have already been observed in racially and ethnically distinct subgroups of the U.S. population. …For example, in the United States, Whites are more likely than persons of Asian and African heritage to have abnormally low levels of an important enzyme (CYP2D6) that metabolizes drugs belonging to a variety of therapeutic areas, such as antidepressants, antipsychotics, and beta blockers (XIE 2001). Other studies have shown that Blacks respond poorly to several classes of antihypertensive agents (beta blockers and angiotensin converting enzyme (ACE inhibitors) (Exner 2001 and Yancy 2001). …Clinical trials have demonstrated lower responses to interferon-alpha used in the treatment of hepatitis C among Blacks when compared with other racial subgroups.”
Ashkenazi Jews are known to be especially vulnerable to certain diseases, e.g. breast cancer. And from the American Association of Cancer Research Journal “62% of the Taiwanese colorectal tumor specimens analyzed exhibited Eps8 over expression.”
Those would indicate excellent reasons to do this survey. Nope! This classification does not justify those. The groups were incredibly unbalanced. All of Asian, Chinese, Korean, Indian, Malay etc are in a single class – half the works population under a single classification. ”African” Africa is a huge continent. There are many phenotypes there and all are grouped into a single lump. White, not German, Scandinavian, English, French, plus most Spanish and Portuguese are Caucasian in origin as well.
More background. Many have tried to classify mankind. Bodin’s color classifications in the mid 1500′s were descriptive using neutral terms based on skin color such as “duskish colour, like roasted quinze, black, chestnut, and farish white.”
By the 1600′s Bernier settled on four subgroups based on the four quarters of the globe and used Europeans, Far Easterners, Negroes (blacks), and Lapps.
In the 1800′s Louis Agassiz made a case for genre of scientific racism based on creationism and gained wide followings. We have Arthur de Gobineau to thank for the theory of the superior races and the Aryan race. He saw the intermingling of races – like French marrying Germans as a degenerative process. Thomas Huxley and Charles Darwin were believers in monogenism (all humans descended from one evolutionary process). Huxley separated mankind into 9 types – four of them on the African continent, and three types of Mongoloid. Darwin argued that they were all one speicies and in the Descent of Man, chapter VII argues that all “should be classed as a single species or race, or as two (Virey), as three (Jacquinot), as four (Kant), five (Blumenbach), six (Buffon), seven (Hunter), eight (Agassiz), eleven (Pickering), fifteen (Bory St. Vincent), sixteen (Desmoulins), twenty-two (Morton), sixty (Crawfurd), or as sixty-three, according to Burke. This diversity of judgment does not prove that the races ought not to be ranked as species, but it shews that they graduate into each other, and that it is hardly possible to discover clear distinctive characters between them.”
In the later 19th and 20 centuries there were a lot of mental excursions into classifications based on intelligence, skull shape, etc. By the 1930′s people had stopped trying to do these types of classifications and the rise of the Nazi’s underscored how damaging such classifications can be leading to ethnic cleansing by superior.
In 1954 UNESCO condemned all approaches to classification by race saying that we should not make examples of the Caucasian, Negroid and Mongoloid races but rather talk about ethnic groups which share common cultural ties.
So what is the government doing? Recent news articles have heralded a 40% level of Hispanics in the US. Is that true? Do I have to be only one classification? How reliable are surveys where of the 28 classifications available 8 could be roughly grouped as Hispanic (what happened to Iberian?). Aren’t the Spanish a combination of Moors and Celts? Why do we try to do this?
An interesting way to trace our thinking is to follow the US Census categories. In the 1790 Census the count was made on White Males, White Females, other free persons, and Slaves (all types). In 1940 Mexican was counted as white. In 2010 the census allows for an entire question on Hispanic origin including Argentinean, Salvadoran, etc., and an additional 15 categories for Race. Wikipedia itself has 35 entires for race and ethnicity. Seven of those are Hispanic and an additional one for Non Hispanic whites.
The American Anthropological Association made recommendations for the classifications for 2010 but they were not accepted by the Census Bureau. There is still no American for those of us who do not fit into one or even two classifications. Let’s see 8 out of 18 classifications is … 44.5 percent. The news says that the Hispanics are 40% of the population. I wonder what the Irish are. If we had a classification for Central Europeans would they be a bigger part of the population?
This shows the power of the classification system in surveys. If you want to get a certain answer then you make that percentage of questions or answer options the percentage you hope for. How many Chinese Americans? They are under Asian. How many people from India? Look under Asian. Japanese, Filipino, Thai, Vietnamese, guess where to look. All are classed together. Want to know how many Arabs? Tough!
What if we were to let people put in their own classification what would the answer be? The 1980 and 1990 censuses came close to that option. But they did not allow multiple posting. You could be either Black or White. If you said White/Black you were classed as White, if you put Black/White you were coded as Black. I do understand that the big mainframe computers of that age had fixed length fields and coded options with limited sort options. But those days are long past. Now we handle variable length fields of text, multiple subfields, we can sort and aggregate information in many ways.
What would I do?
- Work from the data. People got really annoyed with the census. Some refused to answer at all. The options were not ones they felt comfortable with. I would let people put in their own assessment. That would give a realistic assessment of what people prefer to think of themselves as. What if we decide to collect the information to see how diverse we really are? Actually we do not have this data, but we should try to collect it. It is not too early to decide on what to collect for the 2020 Census. Perhaps by then we can see …20/20.
- Ensure that the balance of the surveys is truly reflective of the data group. Do not bias it by the questions asked. If 44.5% of the answers provide a single grouping, is that really a fair survey? I would not allow surveys that try to cram everyone into a single class (multiple broader terms should be allowed). I would allow as many listings (race or ethnicity) as people want to take the time to put in. We are a melting pot of a country. We used to be proud of it. Now we try to segment and separate which drives wedges and divides.
- Provide associations. If you let people do their own classification, allowing free associations, then the results would provide linkages the creator of a survey instrument could not foresee. The richness of related terms in the thesaurus or links in semantic web are a bonus in richness of expression.
- Make a hierarchy. Are all those classifications equal or are some subdivisions of others? Could someone choose a higher level because they are Cuban and Latino? Some want to be grouped as Hispanic? It does not have to be a single flat list. Let people decide how discretely they want to be classified. That would tell us a lot about the nation. This step takes a lot of care; it’s is where an unscrupulous or careless group would have power to really slant the survey by the way it organizes the hierarchy.
- Does it matter what the group calls itself? There are shorthand ways of describing every ethnic group and race. Can we allow them to use those names and translate them into officialdom? I think that would make the results a better source of information about the groups themselves. Someone could decide on the preferred term usage, but not at the data collection level. That would interfere with the real data collection.
Summary
If the census and other surveys were built on controlled vocabulary principles, then there would be Associate, Equivalence and Hierarchical options. Working from the data instead of imposing a preferred order on the subjects would give a significantly enhanced data set. In this digital age, we should be able to do much better. We are no longer bound by old style mainframe computing or tallying all results by hand. Let’s catch the census and other surveys up to the current information standard practice.
Marjorie M.K. Hlava
President, Access Innovations

