Adventures of a TaxoTourist

The trip was awesome—a dream exotic vacation to Bali. It was not about eat, pray, love, but a rather unbalanced midpoint to meet my Oz-dwelling daughter. I enjoyed dashes of ecotourism and agritourism, but even in full vacation mode I couldn’t fully suppress my perspective as a taxonomist.

The place is idyllic, verdant, vibrant, unique, and quirky, with wondrous surprises at every turn. A few features were especially surprising, like getting candy instead of small change in a sales transaction (rupiah come in millions, like Italian lire before the euro), and motorbike fuel is sold in any container handy from vegetable oil jugs to Jack Daniels bottles. The lack of guard rails on twisty mountain roads, wildly variable heights of steps in a staircase and the wobbly concrete panels of sidewalk paving with gaping 2×3 foot holes in them, hopscotched on a dark night in the rain, would set a lawyer’s heart a-flutter.

And then there’s the spelling, a wonderfully creative adventure. The spelling of words on signs, menus, brochures and more never failed to surprise and entertain. Though the word “internet” was consistent, even in remote villages, other words were expressed with endless variety. Initially, I viewed the spelling as a phonology test in linguistics until I realized there was no rule, no rhyme, no reason to it. Seeing an English translation after the Indonesian text did not guarantee I could easily make sense of what I read, nor could I expect the same spelling the next time.

Though much of what I encountered was oriented to tourists, the unexpected—sidewalks, steps, spelling, and more—was well outside the range of my expectations and made navigating a challenge.

Travel is an adventure, venturing beyond the daily norm to get away from the routine and expected. We seek excitement and variety from travel, not efficiency. Yet efficiency has its place. Predictability and standardization make it easy to recognize where we are and how to move around. With familiar landmarks and a sense of what to expect, it is simply easier to find your way. Creatively gesturing to circumvent a language barrier can be fun in travel but ineffective to accurately convey an idea or important direction.

What is true for finding your way around a new place is also true for navigating a taxonomy. What’s around and what comes next should make sense. It’s easier to find information when your approach to knowledge organization and expression is similar to that reflected in the unfamiliar taxonomy. Ideally, the approaches are based on accepted and shared standards, which facilitate a searcher’s navigating a taxonomy. Those standards also support interoperability between vocabularies, their capacity to meld concepts and organizational structures in a reasonably smooth fashion. For taxonomies, that means following the ANSI/NISO Z39.19 standard, Guidelines for the Construction and Management of Monolingual Controlled Vocabularies, the British Standard BS 8723 Structured Vocabularies for Information Retrieval, or the comparable international standards ISO 2788 for monolingual thesauri and ISO 5964 for multilingual thesauri.

My short visit to Bali reminded me that travel and adventure present wonderful opportunities to explore. It is great to break out of everyday routines and enjoy what’s novel and unexpected. Surprises are usually welcome and often the highlight of a trip. But quirky doesn’t cut it when the goal is efficiency and productivity. Especially for taxonomy work, standards are good.

Alice Redmond-Neal
Chief Taxonomist, Senior Editor
Access Innovations

Originally posted on October 25, 2010

The Power Of Survey Taxonomies To Skew The Results The Way You Want Them

I went to the doctor’s office this week and they asked if I would participate in a short Federal survey. I said sure.

“What is your nationality?”
“American,” I said.
“That is not an option,” said the lady.
“What are the options?” I asked.
“Hispanic, Asian, Asian Black, African, Central American, Chicano, Cuban, Hispanic Latino, Mexican, Native American, Native Hawaiian, South American, Spanish, White, White Hispanic Other, Unknown and Refused,” she said.
“I am Native American, White, Spanish and Mexican,” I said.
“You can only pick one,” she said. 
“I am also Welsh, English, Scottish, German, Dutch, Irish and married to a Bohemian Mexican, Spanish French man. Put me down as Refused!” 
She said “Most people are putting down Refused or Other.” 

I figured I was at the doctor’s office, many groups have known medical predispositions to diseases. That must be why they are asking. Medical predispositions of some sort (whether susceptibility to certain diseases or response to certain drugs) might actually have been why they were asking; at least it’s quite plausible. Of course, there’s still a problem with the lack of granularity, whether they’re doing research or predicting risk.

One example is that ingestion of fava beans which may be fatal for some people of Mediterranean descent. I’ve heard anecdotes about a U.S. Army cook serving up meals with fava beans, and the infirmary subsequently dealing with an influx of very sick people.
I don’t have a reference for the latest version of the OMB Directive (still the 1997 one) and came across the FDA’s “Guidance for Industry: Collection of Race and Ethnicity Data in Clinical Trials“, which says, in part: 

“Differences in response to medical products have already been observed in racially and ethnically distinct subgroups of the U.S. population. …For example, in the United States, Whites are more likely than persons of Asian and African heritage to have abnormally low levels of an important enzyme (CYP2D6) that metabolizes drugs belonging to a variety of therapeutic areas, such as antidepressants, antipsychotics, and beta blockers (XIE 2001). Other studies have shown that Blacks respond poorly to several classes of antihypertensive agents (beta blockers and angiotensin converting enzyme (ACE inhibitors) (Exner 2001 and Yancy 2001). …Clinical trials have demonstrated lower responses to interferon-alpha used in the treatment of hepatitis C among Blacks when compared with other racial subgroups.”

Ashkenazi Jews are known to be especially vulnerable to certain diseases, e.g. breast cancer. And from the American Association of Cancer Research Journal “62% of the Taiwanese colorectal tumor specimens analyzed exhibited Eps8 over expression.”

Those would indicate excellent reasons to do this survey. Nope! This classification does not justify those. The groups were incredibly unbalanced. All of Asian, Chinese, Korean, Indian, Malay etc are in a single class – half the works population under a single classification. ”African” Africa is a huge continent. There are many phenotypes there and all are grouped into a single lump. White, not German, Scandinavian, English, French, plus most Spanish and Portuguese are Caucasian in origin as well. 

More background.  Many have tried to classify mankind. Bodin’s color classifications in the mid 1500′s were descriptive using neutral terms based on skin color such as “duskish colour, like roasted quinze, black, chestnut, and farish white.”

By the 1600′s Bernier settled on four subgroups based on the four quarters of the globe and used Europeans, Far Easterners, Negroes (blacks), and Lapps.

In the 1800′s Louis Agassiz made a case for genre of scientific racism based on creationism and gained wide followings. We have Arthur de Gobineau to thank for the theory of the superior races and the Aryan race. He saw the intermingling of races – like French marrying Germans as a degenerative process. Thomas Huxley and Charles Darwin were believers in monogenism (all humans descended from one evolutionary process). Huxley separated mankind into 9 types – four of them on the African continent, and three types of Mongoloid. Darwin argued that they were all one speicies and in the Descent of Man, chapter VII argues that all “should be classed as a single species or race, or as two (Virey), as three (Jacquinot), as four (Kant), five (Blumenbach), six (Buffon), seven (Hunter), eight (Agassiz), eleven (Pickering), fifteen (Bory St. Vincent), sixteen (Desmoulins), twenty-two (Morton), sixty (Crawfurd), or as sixty-three, according to Burke. This diversity of judgment does not prove that the races ought not to be ranked as species, but it shews that they graduate into each other, and that it is hardly possible to discover clear distinctive characters between them.” 

In the later 19th and 20 centuries there were a lot of mental excursions into classifications based on intelligence, skull shape, etc. By the 1930′s people had stopped trying to do these types of classifications and the rise of the Nazi’s underscored how damaging such classifications can be leading to ethnic cleansing by superior.

In 1954 UNESCO condemned all approaches to classification by race saying that we should not make examples of the Caucasian, Negroid and Mongoloid races but rather talk about ethnic groups which share common cultural ties.

So what is the government doing? Recent news articles have heralded a 40% level of Hispanics in the US. Is that true? Do I have to be only one classification? How reliable are surveys where of the 28 classifications available 8 could be roughly grouped as Hispanic (what happened to Iberian?). Aren’t the Spanish a combination of Moors and Celts? Why do we try to do this?

An interesting way to trace our thinking is to follow the US Census categories. In the 1790 Census the count was made on White Males, White Females, other free persons, and Slaves (all types). In 1940 Mexican was counted as white. In 2010 the census allows for an entire question on Hispanic origin including Argentinean, Salvadoran, etc., and an additional 15 categories for Race. Wikipedia itself has 35 entires for race and ethnicity. Seven of those are Hispanic and an additional one for Non Hispanic whites.

The American Anthropological Association made recommendations for the classifications for 2010 but they were not accepted by the Census Bureau. There is still no American for those of us who do not fit into one or even two classifications. Let’s see 8 out of 18 classifications is … 44.5 percent. The news says that the Hispanics are 40% of the population. I wonder what the Irish are. If we had a classification for Central Europeans would they be a bigger part of the population?

This shows the power of the classification system in surveys. If you want to get a certain answer then you make that percentage of questions or answer options the percentage you hope for. How many Chinese Americans? They are under Asian. How many people from India? Look under Asian. Japanese, Filipino, Thai, Vietnamese, guess where to look.  All are classed together. Want to know how many Arabs?  Tough!

What if we were to let people put in their own classification what would the answer be? The 1980 and 1990 censuses came close to that option. But they did not allow multiple posting. You could be either Black or White. If you said White/Black you were classed as White, if you put Black/White you were coded as Black. I do understand that the big mainframe computers of that age had fixed length fields and coded options with limited sort options.  But those days are long past. Now we handle variable length fields of text, multiple subfields, we can sort and aggregate information in many ways.  

What would I do?

  1. Work from the data. People got really annoyed with the census. Some refused to answer at all. The options were not ones they felt comfortable with. I would let people put in their own assessment. That would give a realistic assessment of what people prefer to think of themselves as. What if we decide to collect the information to see how diverse we really are? Actually we do not have this data, but we should try to collect it. It is not too early to decide on what to collect for the 2020 Census. Perhaps by then we can see …20/20.
  2. Ensure that the balance of the surveys is truly reflective of the data group. Do not bias it by the questions asked. If 44.5% of the answers provide a single grouping, is that really a fair survey? I would not allow surveys that try to cram everyone into a single class (multiple broader terms should be allowed). I would allow as many listings (race or ethnicity) as people want to take the time to put in. We are a melting pot of a country. We used to be proud of it. Now we try to segment and separate which drives wedges and divides.
  3. Provide associations. If you let people do their own classification, allowing free associations, then the results would provide linkages the creator of a survey instrument could not foresee. The richness of related terms in the thesaurus or links in semantic web are a bonus in richness of expression.
  4. Make a hierarchy. Are all those classifications equal or are some subdivisions of others? Could someone choose a higher level because they are Cuban and Latino? Some want to be grouped as Hispanic? It does not have to be a single flat list. Let people decide how discretely they want to be classified. That would tell us a lot about the nation. This step takes a lot of care; it’s is where an unscrupulous or careless group would have power to really slant the survey by the way it organizes the hierarchy.
  5. Does it matter what the group calls itself? There are shorthand ways of describing every ethnic group and race. Can we allow them to use those names and translate them into officialdom? I think that would make the results a better source of information about the groups themselves. Someone could decide on the preferred term usage, but not at the data collection level. That would interfere with the real data collection.

Summary

If the census and other surveys were built on controlled vocabulary principles, then there would be Associate, Equivalence and Hierarchical options. Working from the data instead of imposing a preferred order on the subjects would give a significantly enhanced data set. In this digital age, we should be able to do much better. We are no longer bound by old style mainframe computing or tallying all results by hand. Let’s catch the census and other surveys up to the current information standard practice.

Marjorie M.K. Hlava
President, Access Innovations

Thesaurus from – and for – the Farm

February 1, 2011  
Posted in indexing, News, reference, Technology, Term lists

February 1, 2011 – The USDA has announced release of the 2011 edition of the on-line NAL Agricultural Thesaurus and Glossary (NALT). This release adds 3,441 new terms and 321 definitions to these vocabulary tools.

Per their web article, “USDA’s National Agricultural Library Releases 2011 Edition of Thesaurus,” terminology in the new edition has been expanded in areas associated with nanotechnology, food safety risk assessment and sustainable agriculture.

The primary uses of the thesaurus and glossary are indexing and improving the retrieval of agricultural information, as well as a resource for students, teachers, writers, etc. who are seeking precise definitions of words from the agricultural sciences.

The NALT has a great reputation for being a thorough resource and would stand on its own as a great example of good cataloging.  

Melody K. Smith

Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.

A Well-Constructed Thesaurus

January 10, 2011  
Posted in Access Insights, Featured, Term lists

A thesaurus is a special type of controlled vocabulary, which is itself a list of specifically selected terms to represent a set of concepts – concepts described in an electronic collection of documents, podcasts, captioned photos, emails, or other types of content.

As content merchants (journal publishers and aggregators, specialized news delivery services, consultants) can attest, a controlled list of concept terms is essential for consistent indexing and ease of searching. With concept terms displayed in a hierarchical tree, like a classic taxonomy, users can most easily locate the term they are interested in using. A thesaurus adds value by describing rich relationships of various sorts, between terms near and far from each other in the tree, between alternate expressions of the concept and the “preferred” term, and between terms and notes about their origin and use, all of which, being electronic, can be used to deliver value to the end user.

A well-structured thesaurus hierarchy enables editorial staff to “drill down” into more and more specific categories, and to readily browse similar neighboring categories, to find the best terms for indexing. It invites web users to explore categories of interest, and to steer toward categories most closely meeting their interests and needs. Terms that logically belong to more than one broader category can be found by various pathways. The broader term – narrower term relationships are established with rigorous logic so that users can anticipate where to find a term for the desired subject area, and so that the thesaurus lends itself to further development.

The term records of a well-constructed thesaurus contain indications of other terms in the thesaurus in which the user may be interested, as well as acronyms and non-preferred synonyms. The term records will also include editorial and scope notes, along with definitions as needed or desired. All elements recommended in the most widely recognized thesaurus standards are present.

Access Innovations is one of a small number of companies that can provide your organization with a well-constructed thesaurus.

Barbara Gilles, Access Innovations Thesaurian

10 Reasons To Resolve To Create A Taxonomy For Your Business In 2011

January 3, 2011  
Posted in Access Insights, Featured, search, Taxonomy, Term lists

Just in time for New Year’s, we have compiled a list of 10 reasons for companies to resolve to create a taxonomy for their business in 2011. Because many businesses are unsure about how a taxonomy could benefit them, we believe offering some actual examples will be useful.

1. Searches on your website or database yield an avalanche of irrelevant returns. If you’re interested in string, a taxonomy associated with indexing software can help you filter out string cheese, string quartets, and string theory. It can also help sort out uses of Java as coffee, software, or an island.

2. Every person or department uses a different term, even though they’re all talking about the same thing. Your coworkers can’t find the company policy for the Fourth (or Fifth, or Sixth) of July, because it’s tagged as Independence Day? An enterprise taxonomy can get all of you searching the same language, if not talking it.

3. You know there’s a perfect search term for what you’re looking for, but you can’t remember it. With a taxonomy, you can browse from a general category to more specific levels. It’s like glancing through a book’s table of contents: “Section 4 looks interesting. So does Chapter 4.3. Aha, I’ve got to read Subchapter 4.3.2!”

4. A coworker just spent 45 minutes trying to locate a document, but didn’t know what search term to use. Taxonomy browsing should work for him or her. And with synonyms, he/she can look for eye doctors or even “optimalogists” and find ophthalmologists.

5. Your masterpiece report remains undiscovered, because it doesn’t fit neatly into any of the usual search topics. When you create a taxonomy, you become aware of terms that beg to be related in some way. They might not be siblings, parents, or children; let’s call them cousins. You can identify them as related terms in a taxonomy. Then people can search on hockey sticks, notice an enticing term to search on, and discover your obscure report on the history of curling equipment. (A taxonomy can also disentangle your report from one on hair curlers).

6.  Six people use six different spellings or punctuations of a term, so none of them can find what they’re looking for. Once again, synonyms to the rescue. Is it Bookkeeping? Or Book keeping? Or Book-keeping? (Or perhaps the new employee misspells it Bookeeping?) You can decide; they’ll all find the information they need. 

7. You have controlled vocabulary terms for left-handed widgets; unfortunately, you need to find the reports on right-handed widgets. When you organize terms using a taxonomy, you can easily see where your content organization scheme might not match your actual content. Then you can add terms for all your specialty widgets.

8. Everything for HR gets called “HR” – all 10,000 documents. Get your indexers, taggers, and searchers browsing down to the more specific terms that a taxonomy can show them. You have HR documents on free pizza as a fringe benefit? Add Fringe benefits as a narrower term, and add Free pizza under Fringe benefits, so people can save some dough.

9. Your website visitors keep searching on “bumbershoots” and “brollies”; forecasts for your umbrella department sales are gloomy. With a taxonomy, you can decide which side of the pond gets the preferred (term) treatment. Add the other term as a synonym. Rainy days will turn sunny for the umbrella department.

10. People ignore your term list because it looks like a shopping list. And you know full well that even when you take your shopping list to the store, there’s a good chance it’ll end up neglected at the bottom of the cart. 

The bottom line is that a good taxonomy can save your staff time, and your organization time and money. In today’s world, no business can afford not to make their website and published content as findable as possible.

Barbara Gilles, Access Innovations Thesaurian
Alice Redmond-Neal, Access Innovations Chief Taxonomist

Best Practices for Creating and Maintaining a Thesaurus with Thesaurus Master or MAIstro

December 8, 2010  
Posted in Access Insights, Term lists

by Barbara Gilles, Access Innovations thesaurian

Observing the practices described below for building a thesaurus will help ensure:

  • Effective searches to enable actual retrieval
  • A rich and functional network of relationships among terms
  • A standards-compliant thesaurus and taxonomy
  • Efficient development of a controlled vocabulary that meets users’ needs

1. Background: The Building Blocks of a Thesaurus 

Building a thesaurus database involves creating conceptual records to support concepts and their relationships. The parts of a conceptual record are:

  • A term representing the concept (the “main term” of the record)
  • Broader terms
  • Narrower terms
  • Related terms
  • Synonyms, common misspellings, and other non-preferred versions that express the concept
  • Notes (definitions, scope notes, etc.)
  • Tracking information

2. Starting a Thesaurus Project 

Get a sense of the subject area with which the thesaurus deals. If you are not already familiar with the subject area to be covered by the thesaurus, now is the time to familiarize yourself with the area’s scope and concepts. Read encyclopedia articles and skim through or study introductory textbooks.

Decide what areas will be central and will receive plenty of expansion as well as what peripheral topics to include.

The thesaurus project administrator should establish team members’ editorial permissions (through the Data Harmony Administrative Module), taking into account the overall workflow, team members’ capabilities and knowledge of the subject area, and the thesaurus review process.

Team members don’t need to be experts in the subject area. Find wordsmiths fascinated with the workings of language.

Even If you are an expert on a subject within the thesaurus scope, you may need to refresh and broaden your perspective.

3. Choosing Terms 

Each term should be self-sufficient. Never use “Other” as a term or as the beginning of a term. Avoid making a term reliant on a broader term to complete its sense. For example, Diesel” should not be a narrower term of “Engines” or “Fuel.” Instead, use “Diesel engines” under “Engines,” or “Diesel fuel” under “Fuel.”

Gather a list of tentative terms. Brainstorm. Pull main concepts from articles. Use search logs, textbook indexes, terms from already indexed documents, and so forth.

Review the terms for clarity, expanding them as necessary. For example, “Control” is too ambiguous and vague to be a useful term. “Process control” and “Remote control” may be specific enough to work well as thesaurus terms in a technology thesaurus.

Don’t use the same term for two distinct concepts. For example, “Vectors” should not be a narrower term of both “Epidemiology” and “Mathematics”; “Bats” should not be a narrower term of both “Sports equipment” and “Animals.”  Instead, modify or expand the term for one or both of the concepts (“Biological vectors”; “Vector mathematics”; “Baseball and cricket bats”).

Weed out terms that don’t seem appropriate. Use common sense.

Don’t use terms that are too general—take into consideration the scope of the thesaurus. For example, if the entire thesaurus is about process control, don’t use “Process control” as a thesaurus term. For texts that survey the field, you may want to add a term such as “General process control concepts”.

Don’t clutter the thesaurus with terms that are not likely to be used for indexing or searching.

4. Style and Spelling 

All thesaurus terms should be nouns or noun phrases.

Generally, use plural forms. Use common sense in applying this rule; don’t change “Water” and “Money” into “Waters” and “Monies.” Abstract terms, such as the “-tions” and “-ilities,” should normally stay singular. Be careful with words whose meanings change between singular and plural (art/arts, novelty/novelties, quality/qualities, speech/speeches).

In terms consisting of two or more words, observe natural language order. For example, use “Classical musicians” instead of “Musicians, classical.”

The first character of each term should be consistently capitalized or lower case, with certain exceptions. For example, “pH” is correct, although it is contrary to an initial capitalization scheme (and, incidentally, to the practice of making characters after the first one lower case).  And, of course, the first letter of capitalized acronyms should stay capitalized.

Use lowercase as much as practicable. Words with all letters capitalized are difficult to read and can be confused with acronyms.

In general, use a spelled-out term in preference to its acronym.

In some cases, the spelled-out version of an acronym is not widely recognized.  In such cases, the acronym is preferable, as long as the intended meaning is clear. Use “LASIK” or “LASIK surgery” as the main term in a record concerning laser-assisted in situ keratomileusis.

Avoid using acronyms whenever there is any chance of confusion as to the meaning of the acronyms. For example, if your thesaurus covers the 20th century in the United States, SNL could refer to either Sandia National Laboratories or Saturday Night Live.

In general, avoid punctuation.

Avoid the inclusion of hyphens, except for terms that are normally spelled with a hyphen. Never use hyphens simply as punctuation (Trees – deciduous).

Don’t use commas except in proper nouns that require them, such as “Bureau of Alcohol, Tobacco, Firearms and Explosives.”

Avoid the use of parentheses in thesaurus terms.

If a term has one or more alternate spellings, the preferred term should have the spelling that is most widely accepted by the expected or intended users of the thesaurus. Use the alternative spelling as a synonym (UF or Non-Preferred Term).

For geographical spelling variants, such as American English and British English spellings (Theatres/Theaters), consistently use the spelling of one of the styles throughout the thesaurus for the preferred terms. (However, you should not alter spellings of proper nouns. “Drury Lane Theatre” is a perfectly good narrower term of “Theaters.”)

5. Crafting the Hierarchical Structure 

Establish some general categories within the thesaurus subject area to serve tentatively as top terms.

Eventually, there should be between five and 20 top terms. (Twenty is the number of terms that will fit easily in a screen display of a thesaurus hierarchy.) If it is necessary to have more, try not to have more than 50 top terms.

See what terms can be grouped under an existing term. In general, it’s good to go ahead and make the group of terms narrower terms of the more general term.

A term should fit completely within the category that its broader term represents. This is the all-or-some rule. Remember that if all the instances of a concept fit under another term, that concept’s term may be a good narrower term; if only some of the instances fit, a related term relationship would be more appropriate. In a food thesaurus, “Quiche” or “Quiches” would not be appropriate as a narrower term under “Vegetarian foods,” because some quiches contain bacon or ham.

Don’t group compound terms together simply because they have a word in common. For example, don’t put “Pest control” and “Process control” together under “Control.”

Limit the number of terms at the same level in each branch to 20 terms if practicable. If there are more than 20 terms, see if one of them could serve as a broader term for some of the narrower terms.

In general, a polyhierarchical structure is best. If a term could legitimately be a narrower term of terms in two or more separate branches, go ahead and add those terms as broader terms. For example, “Benjamin Franklin” could have a variety of broader terms, including “Authors”, “Diplomats”, “Inventors”, “Politicians”, and “Scientists”.

6. Developing the Non-Hierarchical Relationships

Related Terms
Provide cross-hierarchy relationships and enrich the information that your term records provide to users by entering existing thesaurus terms in the Related Term field. In a term record for “Quiche,” you could add “Vegetarian foods” as a related term. You can add related terms freely, as long as the concepts they represent are genuinely related in some way. They can even be opposites. Related terms provide additional information for searches.

Non-Preferred Terms
Identify terms that are synonymous with each other. Decide which one to use as a regular thesaurus term. Then consider adding the synonym(s) in that preferred term’s Non-Preferred Term field.

Consider term variants that searchers might use. Add these as non-preferred terms.

For terms for which alternate spellings exist (Cultural centers/Cultural centres), add the alternate spellings as non-preferred terms in the term record of the preferred term.

7. Adding Term Record Annotations

Scope Notes
Enter a note in the Scope Note field of a term for which there might be some uncertainty or misunderstanding about the scope covered by the term. For “Classical music,” the scope note might explain whether or not the term covers concert music written after the Classical era in Europe, and whether or not it covers classical music of non-European cultures such as China, India, and Persia/Iran.

Editorial Notes
Team members should use this field to document information that might be of value to other thesaurus editors or compilers, such as reasons for choosing a particular term in preference to other possibilities.

Other Annotation Fields
The thesaurus project administrator should consider setting up additional fields for such things as definitions, bibliographic references, and cross-references to statutes or external classification systems.

8. Evaluating Your Thesaurus

Proofread your thesaurus. Verify spelling of unfamiliar or difficult terms. Misspellings not only look bad, but also can thwart thesaurus searches, rule base searches, indexing, and document searches.

Check once more for balanced coverage of the subject area, watching out for possible gaps and unnecessary duplication.

When surveying your thesaurus, take advantage of Thesaurus Master’s capability to expand and collapse the visual display of the thesaurus structure. This capability enables you to skim through terms at the same level, and to “drill down” to focus on terms at various levels in a single branch.

Have a person who is knowledgeable in the subject area of the thesaurus do a thorough review of the terms and relationships. Also have a senior editor who is familiar with thesaurus best practices do a thorough review.

Use the thesaurus on a test set of documents to ensure that it covers the content appropriately.

9. Maintaining Your Thesaurus

After your thesaurus has been used for a few months, examine the keywords suggested by editors and that don’t appear in the thesaurus. Some of those keywords may indicate terms that should be added to the thesaurus.

As concepts and technologies change and advance, so does terminology.
Review your thesaurus on a regular basis to determine what updates you should make.

Before adding a new term, always check for existing terms that cover the concept or that could or should be modified. When searching the thesaurus, use single words rather than phrases, and use character strings truncated by an asterisk (a wildcard character in Thesaurus Master), rather than a full word with an overly specific ending.

Before deleting a term, consider the potential effects in relation to documents that have already been indexed with the term, and in relation to future searches.

10. Common Mistakes

  • Overly vague terms
  • Lack of balance in terms
  • Gaps in coverage, sometimes severe
  • Too many top terms
  • Not enough levels (“flat” structure)
  • Same term (essentially) in two places in thesaurus, but with different style (Milky Way galaxy/Milky Way Galaxy; Bird houses/Birdhouses)
  • Two or more synonymous terms (Biochemistry/Biological chemistry) as regular thesaurus terms
  • Too many terms at any one level within a branch
  • Inappropriate NT-BT relationships
  • Term assigned to entirely incorrect/inappropriate place in thesaurus because meaning or relationship misunderstood or not well thought out
  • Term misplaced in thesaurus because of shared word or morpheme
    Spelling errors (more common than many might think)
  • “Other” as a term or as the beginning of a term
  • Terms that don’t mean anything, or whose meaning isn’t clear, apart from hierarchy context
  • Meaningless (to the user) abbreviations or lingo

Revised December 2010

Knowledge Organization Systems

November 1, 2010  
Posted in Access Insights, Featured, semantic, Taxonomy, Term lists

November 1, 2010 – I attended the 2010 meeting of the American Society for Information Science and Technology (ASIS&T) in Pittsburgh. There were quite a few papers and posters I found aligned with taxonomies and the whole area of linked data, semantic implementations and the Dublin Core. The DC-2010 conference, sponsored by the Dublin Core Metadata Initiative (DCMI), was held immediately prior to ASIS&T in the same hotel so the overlap in participants and programming was spot on for my interests. The ASIS&T annual meeting is the main venue for disseminating research centered on advances in the information sciences and related applications of information technology. It has veered heavily into usability for the last few years and this change back to mainline information science was refreshing!

I moderated a panel discussion on Knowledge Organization: Evaluating Foundation and Function in the Information Ecosystem put together by Jane Greenberg. The panelists (Hollie White, Denise Bedford and Gail Hodge) addressed the complexity of our digital information ecosystem, as the information transfer process becomes increasingly, and in some domains fully digital. Indicative of this change are entirely new ways in which individuals and information systems generate, provide access to, and link information. In line with this change is a growing need to better integrate and leverage knowledge organization systems (KOS). The presentations were great, the room was packed, and the subsequent discussion was quite lively. How do we know if the work is a “good” taxonomy or not? Effective means are needed to measure the application of KOS as both an integral foundation in the information ecosystem, and a core function. There are standards for creation but they do not really address the need for evaluation and benchmarking. Many people have joined the ranks of “taxonomists” in the past five years. Some are excellent. Some are not. How does the organization in need of this basic building block for linked data and semantic web applications know if the created taxonomy will work for their needs?

We need to create standards. Starting with the basics, a taxonomy itself is a KOS, one of the many kinds of Knowledge organization systems. Granted, it can range from simple to highly complex (semantics), but at its core is controlled vocabulary. Evaluating it — measuring its accuracy and determining its relationship to the knowledge domain – starts with getting it built. Identifying the parts of a KOS and more specifically, parts of the taxonomy, ontology, and thesaurus, is important to an effective evaluation. Then we have to apply it , to real data, and it has to be scalable for increasingly large collections of data. NKOS – the Networked Knowledge Organization Systems will take up the question. I am proud to be a member of the group. Marcia Zeng runs the discussion list and keeps the records for the organization here.

We have made the first step , identification of the need and the second step, identification of an ad hoc group to work on this challenge. I was tasked with working on functional requirements for KOS descriptions and potential registries. It is only the beginning in the need to develop standards and guidelines for KOS implementations. This set of meetings brought together KOS researchers, implementers, and developers to examine and share KOS approaches and evaluation strategies. The meeting also brought Linked Data, Dublin Core, publishers and programmers together for a joint discussion, often in the halls or over drinks at receptions. All felt the need to develop a deeper understanding of evaluation methods and gain a picture of evolving framework for assessing KOS. Hopefully after sharing our information, new approaches to using these systems effectively was considered and we walked away with further insight into the research needs and priorities for KOS.

Marjorie M.K. Hlava
President and Chairman
Access Innovations / Data Harmony

Internet Titles – A Who’s Who Explanation

October 26, 2010  
Posted in News, Taxonomy, Term lists

October 26, 2010 – The Internet isn’t new, but for some, navigating sites with ease doesn’t come easily. For those who have it mastered, they fly like the wind across links and site maps. But who are the people behind the scenes making your daily navigation a snap? An article titled “So You Work On The Internet. But What Do You Do?” reveals a much closer look.

Five occupational activities are listed and defined, and this should give us a better understanding of  “how” the Internet operates or who is operating it.

1 – Information Architecture (IA). The structuring and classifying of information. The front end of defining a taxonomy to the back end of how to implement a website’s functionality and code.

2 – User Experience (UX) Design. A user’s perception and emotive connection with a website falls into this catagory. If users find what they are looking for easily, and then become repeat visitors, this person has done the job correctly.

3 – Programming/Coding/Engineering/Developing. One can say that these are all the same. Different programs are utilized, but the end result is the same.

4 – Design. This is the graphical look and feel of a website or application.

5 – Content Strategy. These folks plan the type, frequency, tone and delivery schedules of content on a website. SEO, social media, UX and media knowledge all fall into this category.

These titles are well behind the scenes of the internet, but are vital to its existence. They are not the glory jobs that some people have, but if it were easy, everyone would be doing it. 

Glenn Black

Sponsored by Access Innovations, the world leader in indexing and making content findable.

Stretching Taxonomy in a Social Way

September 27, 2010  
Posted in Access Insights, Featured, metadata, Term lists

September 20, 2010 – We read with interest “A Revised Taxonomy of Social Networking Data.”  In the last month, the arguments in the article have provided a framework for thinking about the amount of the information flowing through Facebook, Twitter, Tumblr, and other social media conduits.

Let’s revisit the elements of the social media taxonomy. The principal categories are, and we quote:

“Service data is the data you give to a social networking site in order to use it. Such data might include your legal name, your age, and your credit-card number. Disclosed data is what you post on your own pages: blog entries, photographs, messages, comments, and so on. Entrusted data is what you post on other people’s pages. It’s basically the same stuff as disclosed data, but the difference is that you don’t have control over the data once you post it — another user does. Incidental data is what other people post about you: a paragraph about you that someone else writes, a picture of you that someone else takes and posts. Again, it’s basically the same stuff as disclosed data, but the difference is that you don’t have control over it, and you didn’t create it in the first place. Behavioral data is data the site collects about your habits by recording what you do and who you do it with. It might include games you play, topics you write about, news articles you access (and what that says about your political leanings), and so on. Derived data is data about you that is derived from all the other data. For example, if 80 percent of your friends self-identify as gay, you’re likely gay yourself.”

One question we discussed recently is, “Who controls the metadata?”

In a traditional indexing operation, the database publisher manages the terms, assigns them, and controls access. Usage data is provided by some vendors, but often that information is separate from the controlled terms.

In social media, there is a shift from the traditional term list to the usage information. The result is that social data evoke a different way to think about tagging, its utility, and its implications. The content takes a back seat to information about users.

“Who controls the metadata for social media?” Certainly not the user. The categories seem to emerge from user actions. The output is a new set of data, and the “metadata” angle is essentially outputs from traditional data mining and clustering, among other processes.

The second question, “Is this type of tagging metadata?”

On the surface, the answer is “Yes.” As we discussed this issue, three observations surfaced:

First, in classical indexing the rules are known and can be explained. A term from a controlled list is assigned to an information object. Period. Clean. Easy to understand. The social media tagging is a different approach. The categories are emergent.

Second, metatagging in a commercial database operation is intentional. In the social media space, emergent processes operate. We don’t think this difference has been fully explored.

Third, traditional tagging operations impart consistency. Testing may be easier to benchmark. In the social media taxonomic system, consistency may not be a key factor. One may do less testing and more exploration or discovery.

Our view is that more work must be done with regard to a social media taxonomy. The difference between traditional indexing and social media indexing seems significant and not well understood. Social media tagging seems to be about business. Traditional indexing is about findability.

Users may be surprised about the intent of social media tagging.

Ken Toth

A New Thesaurus for Online Scholarly Publications

August 3, 2010  
Posted in Access Insights, News, reference, Term lists

August 3, 2010 – Access Innovations, Inc. has collaborated with the American Institute of Physics (AIP), to convert AIP’s Physics and Astronomy Classification Scheme (PACS) into a thesaurus for its nearly two million online scholarly articles indexed by PACS. Using the new thesaurus allows scientists to execute faster, more accurate and more efficient information searches.

“We’ve worked diligently to archive articles from our journals and conference proceedings as a valuable and convenient resource for researchers worldwide,” said Mark Cassar, publisher of journals and technical publications at AIP. “Unfortunately, the articles have become increasingly hard to find. The new thesaurus, however, represents a major step forward, as it allows researchers to achieve successful searches in much less time.”

PACS uses a decimal-based, six-digit alphanumerical notation classification scheme, arranged in a hierarchical format, which is limited to 10 top terms and nine second-level terms. In the past, this classification system resulted in multiple concepts expressed within the same term, and was not search friendly.

For complete details, see the official announcement.

Melody K. Smith

Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.

« Previous PageNext Page »