Maintaining a Thesaurus in an Excel Workbook
April 16, 2012
Posted in Access Insights, Featured, Taxonomy, Term lists
There’s been some discussion recently in the Taxonomy Community of Practice LinkedIn group about free or low cost thesaurus management software. I’ve noticed a dearth of postings about using Excel, a very popular tool, particularly if you already have a Microsoft Office license.
Experts disparage Excel as a tool, but it can provide a way to start your thesaurus development. And, if you are mindful of organizing your Excel worksheet so that its data can be imported later into a dedicated tool, you can achieve some important objectives. Excel is indeed the most popular thesaurus management tool. (see Taxonomy & metadata strategies for effective content management workshop slides in which taxonomy expert Joe Busch reiterates this.)
First requirement: a hierarchy
It’s not enough to collect the terms that represent the “aboutness” of your electronic collection; it’s also important to put them into a “human usable” format. A hierarchy, or tree, displays the relationship between a broader, more inclusive expression of a concept and increasingly narrower, more specific examples and instances of it. The breadth of your term list, as represented by the top terms, provides insight into the breadth of your content collection. The number of levels in the hierarchy, when representative of groups of content items, describes the depth of the collection.
A hierarchy is very easy to create and maintain in Excel. The columns provide the levels, and a row for a new term can be easily added with a right click on a row number and the choice of Insert. Synonyms, scope notes, and related terms (the term record) can be added in columns beyond the last hierarchical level (more about these next week) because Excel does not “complain” about empty cells within a row, it just doesn’t like empty rows within a text ‘list’. Dedicating the first three to six columns to terms only allows you to adjust the column widths until you get a pleasing hierarchy display. And you can select just the columns that include terms as the “print area” to print out a hierarchical list.
Dedicated software tools allow collapse and expansion of the hierarchy to assist in changing focus from general to specific. In Excel, you’ll need to manually select groups of rows (the branches of your hierarchy) and identify them as an outline group (Data ribbon > Outline > Group).
To get a list of the top terms, you can copy Column 1 of your Excel worksheet to another worksheet, add a column label, select the column, and use the Sort and Filter section of the Data ribbon with “Unique records only” checked to remove the blank rows. Use Sort A-Z function on the Data ribbon to put them in alphabetical order.
A Different View
Dedicated software provides different views of your terms, for example alphabetical, permuted, a list with full term records. The easiest of those to reproduce in Excel is an alphabetical list of terms. One way to accomplish this is to copy the term hierarchy columns to a new worksheet and use a concatenate formula in the next empty column (=CONCATENATE(A1,B1,C1,D1,E1,…) to get all your terms in a single column. Copy the formula down: Home ribbon > Editing > Fill > Down to the last row of your hierarchy to display all the terms in a single column. To convert from displaying the results of your formula to an actual list of terms, copy the column and use the Paste Special > Values (a right click option, or from the Home ribbon > Clipboard section) to create the list. Then, select the list (click on the first term and hold down the Shift key while you tap the End key and then the Down arrow (↓) key) and use the Sort A-Z function on the Data ribbon to get them in alphabetical order.
Exporting your list
To export your list as a delimited text file, Choose the “Save As” option from the File button/menu. Give the export file a name and in the next “Save as type” field, pick Text (tab delimited)(*.txt) option. This means of export saves the entire sheet (but one sheet only). Using tabs as the delimiter prevents a comma appearing in a scope note from throwing off your export. If you need the terms only, copy the hiearchy columns to a separate worksheet first. When the webmaster asks you for the terms in just the top two levels (and maybe for only selected branches) of your taxonomy, copy the columns or branch sections to a new worksheet. In about four steps, and using a formula combining Excel’s IF and CONCATENATE functions, you can deliver just what’s needed.
Thesaurus standards
As your term list grows, you’ll want to check that there are no gaps in your hierarchy (e.g., a 3rd level term as ‘child’ of a 1st level term) and that you haven’t used a top term somewhere else in the hierarchy (which violates the NISO Z39.19 standard and ultimately will cause you trouble). You’ll probably need a Visual Basic program for your worksheet (a sophisticated macro). If you’re not a programmer, sharpen those negotiating skills to get your talented colleagues to develop one for you. Or, consider dedicated thesaurus management software.
Even though Excel may be a place to start in building your taxonomy, you’ll want to be aware of its limitations and which of them are “deal breakers” for your project and career. Begin evaluating dedicated software from the start.
The Data Harmony thesaurus management software produced by Access Innovations takes care of keeping your thesaurus hierarchy standards-compliant as well as offering many other features that make the people building and using a thesaurus significantly more productive.
Mary Garcia, Lead Technical Support Specialist
Access Innovations, Inc.
Access Innovations, Inc. Creates Taxonomy For Iowa Code, Administrative Code and Acts
April 5, 2012
Posted in Access Insights, indexing, News, Term lists
Enhanced Electronic Index Allows Access to Complete, Extensive Collection of Legal Documents By Topic of Interest
Access Innovations, Inc., a leader in the data management industry, has collaborated with the Iowa Legislative Services Agency to build a customized thesaurus that allows the Iowa Legislature General Assembly to easily access its extensive legal body of existing and proposed laws, bills, acts, and regulations by using controlled, vocabulary-driven indexing in addition to published indexing codes.
The new thesaurus also allows newly published unstructured content to be tied together across multiple indices and will drive a uniform index.
The six-month project utilized Access Innovations’ Data Harmony® software suite with the goal of providing subscription-based delivery of legal documents derived from uniform index entries. The project team created the thesaurus using MAIstro™, a software tool which includes both Thesaurus Master® (thesaurus and taxonomy management) and Machine Aided Indexer (M.A.I.™).
Marjorie M.K. Hlava, president of Access Innovations, explained that the new index makes it much easier for the Iowa Legislative Services Agency to deliver precise information to its users much more quickly.
“Before, they might have had 100 different terms covering laws and regulations related to agriculture. In order to find the right citation or receive updates, a user would have to know the correct term as well as the code. Now, Iowa users can enter a single term or the standardized code to subscribe to ‘Agriculture’ and be notified of changes or proposed changes to state laws on this topic. This saves them a huge amount of time and frustration,” she said.
Hlava added she believes the project can serve as a model for other states interested in creating more efficient and effective indexes covering their past, current and proposed laws, regulations and bills.
The project differed from typical index and thesaurus creation because the Iowa Legislative Services Agency needed to maintain its existing codes from each back-of-the-book index, rather than starting from scratch and creating new codes. One reference alone, the Blue Index, included 2,300 index terms. To create the thesaurus, Access Innovations looked at different methods to apply to each term according to the existing references, tied preferred terms to the existing codes, and added related terms to the preferred terms.
The codes covered previous legislation dating as far back as 1953 to legislation through 2010. Also, the custom taxonomy was built with only four levels in order to meet Iowa Legislative Services’ navigation requirements. Typically, thesauri are not limited by a specified number of levels.
Jack Bruce, a project manager at Access Innovations, noted, “The new index makes it much easier for Iowa to deliver precise information to all their users much more quickly. Their 300-page Iowa Code index is now 22 pages and provides fast, accurate access. I don’t think they could ever go back to the old pre-coordinated indexing days. They were interested in trying it but a bit skeptical at first. Now, they are believers.”
Bruce said that Access Innovations also provided on-site training and ongoing technical support so that Iowa Legislative Services Agency’s indexing staff could get used to the new thesaurus, ask questions, and learn in a hands-on environment how to work with it so they could, in turn, show their users how to get the most from it.
“We believe the new uniform index derived from this thesaurus is going to be an extremely efficient way to deliver a huge volume of information in a useful and easily searchable way for years to come,” Bruce said.
Hlava added, “The Iowa LSA is a very forward thinking group with the vision to harness the Data Harmony technology to streamline access to government information, enabling the democratic process in a much more transparent and efficient way.”
Semantic Integration – Part I of VI
November 28, 2011
Posted in Access Insights, Featured, Taxonomy, Term lists
Introduction to Taxonomies – definitions
To discuss the semantic integration or the leveraging of a taxonomy in search, web sites mashups and other places, we should first review what they are. Let’s look at the definitions and then the integration of a taxonomy as a building block for the larger information architecture for an organization. We need to think of taxonomies in that bigger case when we are talking about where we apply them. Once those are out of the way I will review some use-cases and show what makes them work.
A taxonomy is a “knowledge organization system” – or a KOS. There are many types of KOS which you will see in a minute but, basically, it is a set of words that have been organized to control the use of terms used in a field or a “vocabulary” so that things can be easily found in a specific subject area. These Knowledge Organizations Systems are usually specific to a knowledge domain or a topical area, a subject area, a scholarly or scientific or an enterprise area. They’re really descriptive labels. We have many names we can call something but we settle on one descriptive label and then we post all the terms we use to that descriptive label as a main term and its synonyms, no matter how we call it.
Taxonomies often put together words (controlled terms) in hierarchical view, meaning a broader/narrower navigation tree so that people can browse a tree to find information quickly. They are certainly used as a storage and retrieval aid. That is why we started using them in the first place. We use them both for tagging information objects as we add them to a data base and to search as we query that same database.
Those controlled vocabularies or those KOS – Knowledge Organization Systems – can be anything from a flat list, like your Saturday to do list, to a list with ambiguity control by using a synonym ring. We might put some structure to it and map the terms into a hierarchy, which gives us a taxonomy. We move to a full thesaurus by adding relationships between the terms, notes and other features, adding the ambiguity control from a synonym ring. The thesaurus has those four features; the hierarchy (taxonomy) , the relationships (associative or related terms), ambiguity control (Nonpreferred terms, synonyms, use references), and various kinds of notes. The next step in complexity would be to define those associative relationships in many different ways, i.e., as an ontology. The final option in the increasing complexity is the Linked Data or Semantic Web options where the actual items described in many different systems are hooked together by this KOS. People are increasingly saying taxonomy or ontology and meaning thesaurus. Whatever the instance is called they are still all knowledge organization systems just as classification systems are.
As an organization system, we are putting control on the terminology but we also have a hierarchical format – parent-child, genus-species, and broader-narrower type of relationships. Specific items can occur as final leaves on those hierarchies. That is they are on the branches of those navigation trees. They are sometimes known as narrower term instances; sometimes they an actual URL or a document itself. They are very common on websites. They are used as drop down-pick lists. They may represent a sizeable directory and, as you will see, a lot of other variations.
Taxonomies themselves, as a single group, are just an emerging set of standards. They aren’t fully standardized as yet. A taxonomy does have a definition within the Z39.19 Controlled Vocabularies standard published by ANSI (American National Standards Institute) and NISO (National Information Standards Organization) standard in that it is a “hierarchically organized vocabulary based on a classification system”. That does not include the synonyms or the disambiguation for homographs, for example, or related terms, also known as associative relationships. Those are not included in the classic definition of a taxonomy. You can download this standard at no charge from NISO.
There is a companion item, also a controlled vocabulary and the focus of the standard, which is a thesaurus. A controlled vocabulary focuses on concepts. It’s not the items themselves – not the specific items – but rather the concepts. So, you will group things together by concepts. It also has a hierarchy and it has a lot of different display formats and allows a lot of networks of relationships or related terms that could be considered friends or cousins or aliases. It also allows things like scope notes and term histories. It is a more elaborative and informative addition and there are long-established standards for this area.
Starting with Cutter and Dewey and the turn of the 19thCentury and moving forward we have the evolving standards for thesauri. In the same standard, Z39.19, it is defined as a controlled vocabulary of terms in natural language order. It is designed for “post-coordination”.
Those are fairly heavy terms. Post-coordination means that the terms are going to be combined at the time that the search or question is asked. So, when we are making a query, is when you are combining the terms. We aren’t combining like we did in subject headings and back-of-the-book indexes. We combine those in a “pre-coordinated” fashion – at the time the index is created. In the use of a thesaurus or a taxonomy, we use them in a post-coordination activity. That is a big difference. We need to think about how that happens. When someone is “post coordinate indexing” instead of cataloging, we get a big flip in the way the terms are created. Some people are indexing – doing the back of the book – and others are indexing using a controlled vocabulary. The processes are quite different. So, in the Taxonomy Division, we don’t really talk about back-of-the-book indexing and subject heading indexing. We talk about using a thesaurus or using a taxonomy for that indexing.
We’ve gotten to the point where the terms thesaurus and taxonomy are used interchangeably quite often. People think of a thesaurus as a taxonomy with extras. It has related terms, non-preferred terms as synonyms, or the use and use-for references (another way to talk about them), has scope notes, and a whole lot more. But we need to be careful when we are talking about the word thesaurus because the lay person will automatically begin to assume that we mean a synonymy, like the Roget’s Thesaurus and that is not what we mean at all. Although synonyms are important, they are not the only piece of what we are doing.
This is a shot of what a taxonomy and a thesaurus can look like.
On the left, you see the taxonomy view. We have a broader term and some narrower terms and we have even top terms in the taxonomy view. This view is broader-narrower term driven so it is hierarchical in nature and it is called the taxonomy view.
The full term record focuses around the term in question so it might have broader terms and narrower terms as we see here. But, it could also have a status, related terms, some see also or non-preferred synonym references, as well as scope notes, editorial notes, a history field, facets and other things. This set of fields is kind of the default set – the basic individual set for the system.
The basic taxonomy features are the hierarchy structure, the broad concepts-the general concepts, which are known as broader terms, and then the more specific concepts – but still concepts – are the narrower terms. Related terms are conceptual cousins. They don’t fit into that subsidiary relationship. So, they might have something to do with it, but they are not exactly the same. So, those are conceptual cousins.
Then we have those equivalency relationships – term equivalents – the synonyms, non-preferred terms, the use and use for kinds of references. You can have a lot of those. You can even make them multi-lingual equivalent. You can have an English-Spanish-French-Chinese-German or some other set of languages for all of these synonyms.
We can also have facets which is another way to sort the data. Frequently, in faceted search and the way it is most broadly approached now is really field-formatted search. But, there is a large field of knowledge about faceting information from Ranganathan and other luminaries in the field.
Scope notes might be notes that you want somebody to see. Glossary notes are dictionary definitions, whereas editorial notes might be something you want to keep to yourselves and your team and not show to your user base.
All of those together, make up a term record.
We put concepts into a taxonomy. We put People, Places, and Things into authority lists. Name Authority files are one kind of the authority file options. We have a name authority list, which might be the way you want to talk about somebody. For instance my name might be Marjorie Hlava, or Marge Hlava, or Margie Hlava, or Marjorie M.K. Hlava, or it might be Marjorie Maxine Kimmel Hlava or the even longer name I was baptized with. Getting that name authority so that all of those names post to one place is increasingly important as we get more and more connected. There is a lot of work going on, for example, in author authority files and author disambiguation problems. Libraries have always known about those. Is it Mark Twain or is it Samuel Clemens? There are many other pen name situations that we deal with. Name authority files are important. They are also important in corporations when you talk about brand names. For example, in a pharmaceutical firm where something starts off as a chemical name, becomes a code name, becomes a production name, becomes something they want to try out and actually launch to the public. In different markets in different countries it will have a lot of different names. You need to keep all of those names together for research purposes. An authority file lends itself to that kind of thing.
A synonym set or synonym ring is something where you have a lot of terms that mean essentially the same thing. You might have keywords, descriptors, subject headings, etc., that are equal to each other. You can begin to put some control on the vocabulary with synonyms but you can control it further with broader-narrower terms and on up through thesaurus, ontology, and eventually, when applied to data, a semantic network.
Taxonomy terms are an important part of meta data. In particular, they describe subject meta data, or concept meta data. Meta data itself is all the kinds of information about the information you are talking about. So, it is not the information itself. It is the description of that information. When we talk about a bibliographic citation with indexing, we are really talking about the meta data on the journal article or book. The data “about” data is meta data.
The Dublin Core is one meta data standard but there are a bunch of other ones – meta data, schemas, or standards – that are commonly designed specifically for a body of knowledge.
As I said a taxonomy is one of the building blocks. It’s the words representing concepts. It is also known as semantics. Semantics is a popular way of defining this now. And just like in meta data, you can substitute the word “about”. In semantics, you can substitute the word “words” and you would have a good idea of what is driving the process and how something works. So, semantics is driving the process or the words are driving the process of applying this kind of control to a knowledge domain. The taxonomy is core to that process because it gives the organizing principles for all of the content that you are going to deliver. And, if you are lucky, in your application you are going to be able to use it all the way from website design to new product offerings to searches, to the way things are stored and maybe the records management part of the organization. Looking at all of those different pieces, we want to see where these can be applied in real-life to our systems.
Check back next week when we continue the series and talk about Cases for Semantic Enrichment Using Taxonomies.
Marjorie M.K. Hlava
President, Access Innovations
What’s in a name?
October 31, 2011
Posted in Access Insights, Featured, Term lists
Juliet: “What’s in a name? That which we call a rose, By any other name would smell as sweet.” Romeo and Juliet (II, ii, 1-2)
True, but try finding the right document set for your current project by sniffing them out from within a database of 8 million similar smelling documents. This approach is all too common, very time consuming, and unreliable leaving you with aromatic, unpalatable results.
What’s in a person’s name? Take my name: Jay Ven Eman. What are the parts? What constitutes the last name or surname? Is that a middle name? My surname is Ven Eman and my first name is Jay. In XML it could look like <First_name>Jay</First_name><Surname>Ven Eman</Surname>. A small sampling of the name variants I’ve seen are Jay Von Eman, Jay Van Eman, Jay van Eman, Jay ven Eman, Jay Veneman, and Jay Venema. I haven’t seen or used any aliases, but you have undoubtedly seen William Henry McCarty, aka Henry Antrim, aka William H. Bonney, aka Billy the Kid. Along with aliases are pseudonyms.
Peoples’ names present a significant challenge. Name variants and aliases are difficult to identify and to track. Cultural differences in formatting names across languages contribute to the problem. Privacy concerns and the desire by many to remain relatively anonymous cause misidentification. Typographical errors, inconsistencies in original entry, and errors introduced during post processing, account for more of the confusion.
The magnitude of the mess is monumental. Facebook is projected to hit 700 million names. The world’s population is estimated to hit 7 billion this month. How will your boss, peers, anyone ever find you? How will you find the people you need?
A thesaurus is designed to provide guidance on all of the possible words that are used to label a rose and still allow it to smell as sweet. Thesaurus concepts can also be applied to proper nouns such as things, places, and people. Synonym relationships can be used for name variants, aliases, and pseudonyms. These can be lumped into a single field or stored separately. Storing them in separate fields allows you to maintain more information about relationship types. When researching historical figures, for example, historians want to know what pseudonyms and aliases have been used and when. A system like our Data Harmony Thesaurus Master® makes it easier to capture and store person-name values Our XIS™, the XML database management system, uses an XML DTD to control a master data record, or gold record, facilitating name management.
Capturing the core data correctly as it is added to a database system is much easier than cleaning data later. The author submission system Access Innovations developed and installed for the American Society of Information Science and Technology (ASIS&T) is an example of the place to start. The entry template with formatting names, capturing each component of a name in separate fields. It is then saved in native XML allowing for greater flexibility in post processing. Entries can be bounced against the master data record for verification by the author. This is a great opportunity to update your database of members or when applied to a commercial Web site, a way to update your customer database.
The example I use here is for an author submitting a manuscript for publication. At the point of name entry, the entered name can be bounced against your author database as mentioned and can also be integrated with external initiative such as VIVO and ORCID. ORCID stands for Open Researcher & Contributor ID. It is “an independent, community effort to standardize researcher identification. If you have a customer database, an opt-in initiative is an ideal way to create a much cleaner customer database.
Cleaning up a name database requires multiple strategies and multiple passes through the database. We make extensive use of semantics along with standard data processing techniques. Semantics add a richer layer to the analysis, improving your odds of getting it right.
A complete discourse on creating and cleansing name databases cannot be covered in this limited venue and while not an insignificant undertaking, the benefits are significant. Do it right and you will end up smelling like a rose by whatever name.
Jay Ven Eman, CEO
Access Innovations http://www.accessinn.com/
Labeling or Profiling?
October 17, 2011
Posted in News, semantic, Term lists
Labeling people based on behaviors is not a new technology or even behavior. This has been done since the beginning of time. However, with the advancement of semantic technology and the flood of data available, it has become overwhelming to discern a simplified approach to organization.
We found this interesting information on Business 2 Community in their article, “Your Taxonomy or Mine?.” It outlines a recent and exhaustive attempt to cut through the clutter and simplify segmentation as part of a Center of Excellence initiative for a major technology company. The project revealed some interesting finds and is definitely worth the read. Especially if you have any interest in social behaviors and trends.
Melody K. Smith
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
Open Net – What a site!
September 26, 2011
Posted in Access Insights, Featured, Term lists
I was doing some poking around to find out about OpenNet (which the Department of State uses), and I came across a DOE implementation of it (they apparently helped invent it.) Clicking the author link works really well! The site is clean and crisp. Very professional looking.
The “Document Categories” list on the Advanced Search page gave me pause.
There are about 70 categories listed, in no discernible order, except for occasional apparent groupings of consecutive listings. One of those groupings, strangely enough, is “Laser Isotope Separation” and “Other Isotope Separation Information”, while “Isotope Separation” is the first category in the entire list. “Other Weapon Topics” is near the end; various weapon categories are sprinkled throughout the list. I guess you have to go through the whole list to see if your weapon of choice is “other”.
You might at least expect an alphabetical order. Best practices would suggest that we put 70 terms into a list of 20 – 25 so they will easily display on a web page or a LONG drop down list. After that a hierarchical grouping would be useful
Fourth from the end is plain old “Other”. Not to be topped by that, the very last item is “General, Miscellaneous, Administrative, Historical and”. (I might be able to find out what comes after “and” if I did a search.) I’m not sure what the differences are among “Other”, “General”, and “Miscellaneous”. Perhaps in Government speak you need to get all of them covered but seems like a great place to use synonyms.
DOE had an excellent thesaurus inherited from ERDA which replaced the AEC in 1975. It was built using the standards we all now follow. It became very large without strong controls on the term added and other governance. More recently it has been subsumed in a joint effort with INIS called the ETDE/INIS Joint Thesaurus Project. A good idea since the combination of the two nuclear information thesauri will better serve the greater global community with a single nomenclature. .
The actual site however makes one wonder if the lure of indexing using the computer without any help from the human brain made them do away with application of the thesaurus/indexing practice and instead depend on the computer guessing what the user wants? What happened to the idea that the human can enhance search using a computer? I don’t know how any one finds stuff using this system. No wonder our intelligence systems are so flawed! Or maybe that is the idea, the data is there and open… hard to pinpoint but satisfies the openness criteria.
Marjorie M.K. Hlava
President, Access Innovations
Finding New Among the Old
September 15, 2011
Posted in Access Insights, indexing, News, Term lists
Indexing is becoming one of the key information management methods. That is not new news. The fascination of categorizing and putting vast information in a findable form is what drives good taxonomy building.
This subject of the order of things triggered the discussion recently when, Access Innovations’ president, Margie Hlava, sat down with Steve Arnold, a technology and financial analyst, owner of ArnoldIT and writer of Beyond Search.
Novelty detection is deployed when you are looking for new terms and uses in the community. Calling old things by new names, creating acronyms, etc. happens every day, but they are not new. So finding “novel” terms is sometimes difficult.
Making it even more complicated, term combinations happen to provide a new way to look at old things. Term mining applications are deployed to work through this maze.
This interesting discussion and approach is worth listening to in full. Listen to the complete podcast here.
Melody K. Smith
Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.
Cost Reduction Options for Hospitals
September 5, 2011
Posted in Autoindexing, indexing, News, Taxonomy, Term lists
Many are concerned about the cost of healthcare, patients and hospitals alike. Cuts in Medicare funding, increases in insurance premiums, and new federal requirements (such as the ICD-10 medical coding transition) are putting pressure on everyone to find ways to cut costs.
We found this pertinent information on Becker’s Hospital Review in their article, “6 Ways Hospitals Can Survive the Onslaught of Medicare Cuts.” There are many options for hospitals to endure through the consistent Medicare cuts and laws that affect their bottom dollar. Many are outlined in the referenced article.
Access Innovations can also help. Developer of the M.A.I. machine assisted indexing system and specialists in complex coding, tagging, and indexing, Access Innovations provides a range of services that deliver tag integrity. Access Innovations provides training to a client’s staff and then offers quality assurance and validation services that can:
- Minimize the risk of a coding error
- Identify inappropriate or inadvertently applied tags
- Display a “map” of coding distributions to allow management to get a bird’s-eye view of the coding assignment flow.
Many widely used tagging systems lack the user friendly interface and may not implement a rigorous ANSI compliant coding subsystem. Access Innovations’ solutions are ANSI compliant and implement state-of-the-art technology to speed tagging and reduce errors. For more information, contact Access Innovations.
Melody K. Smith
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
From Simple to Complex
June 20, 2011
Posted in Access Insights, Featured, ontology, Standards, Taxonomy, Term lists
People talk about different kinds of vocabularies. The differences usually have to do with the structure, or lack thereof. Sometimes, people refer to “flat lists”. These are one-level lists with no hierarchical structure. They can be uncontrolled or controlled lists. An uncontrolled list is a simple, flat structure. The uncontrolled list is your “Saturday list”. It is also a candidate term list; nothing particularly formatted about it. It is a simple list but it is a list. It might look something like this:
Wash
Trim bushes
Clean cat box
Iron
Water plants
Make birthday cake
It is a simple, natural language list. It can be a list of candidate terms for a taxonomy or thesaurus, serving as an excellent starting point.
Normally, an authority list is a flat list, although it can be organized by broad categories. It often defines the “approved” forms of names. The names are often of people or places. If the list is associated with variant forms of the names, it can be used to control ambiguity. It also can control ambiguity by providing a consistent form of a name for indexing.
Synonym rings try to control ambiguity, but they put forth all of the synonyms. You might have descriptor, keyword, subject headings, thesaurus term, taxonomy term, etc., all meaning roughly the same thing. You need to determine which one is the primary way you will talk about it and that is the way you will use it.
A taxonomy is solely the hierarchy, while a thesaurus tries to bring the ambiguity control, the synonyms, the hierarchies and the related terms to it. There are not many standards as yet for taxonomies (although there are many standards for thesauri).
One standard that does address taxonomy is ANSI/NISO Z39.19-2005. It defines a taxonomy as a ‘collection of controlled vocabulary terms organized into a hierarchical structure’. You may notice that it doesn’t say anything about equivalence, homograms, associated relationships, or notes. It is just the hierarchy. So as a controlled vocabulary, a parent-child or hierarchical relationship, the specificity happens at the lower levels, at the branches, at the leaflets – or at the end of the list. They are very common on websites. They are also commonly supported as pick lists — a drop down menu of ten or twelve items. Sometimes they are browseable directories. There are many different ways to put them into play.
A thesaurus is also a controlled vocabulary. Since many thesauri are hierarchical, they may be referred to as taxonomies. However, unlike a simple taxonomy, a thesaurus does have equivalence, synonyms, associative relationships (related terms), and scope notes. It may contain definitions, editorial notes, and mappings from other thesauri and/or from taxonomies.
A thesaurus focuses on concepts. It doesn’t focus on the information object itself. You have to identify the information object, or the concept, by using the thesaurus. Rather than outlining those information objects, as a taxonomy might, it is giving you a guide to those information objects. Thus, for instance, a multilingual thesaurus term record focuses on the concept of the terms, rather than on the term itself in any one particular language. There are different ways to display a thesaurus so that you can see the network of relationships between the terms.
An ontology usually doesn’t do too well on ambiguity, but is strong on synonyms, is big on hierarchy, and it has a swarm of additional relationships — is a, as a, is part of, kinds of statements are made when you work in an ontology. Those are different kinds of relationships; restating them using Thesaurus Master software adds a different kind of related term.
Marjorie M.K. Hlava
President, Access Innovations
Pre or Post-Coordinate Indexing?
June 13, 2011
Posted in Access Insights, Featured, indexing, Term lists
Most people think about what they want to search for and are willing to combine their concepts at the time of search. If you think of the way people search in Google, they put in the combination of terms they are thinking of; they are doing the coordination of terms. It is up to the search software to do the intersection of the terms for them and figure out the post-coordination.
In the current online environment, very seldom do we put together terms in a pre-coordinate fashion. That is one of the challenges in taking older classified lists – back-of-the-book indexes – and making them into a post-coordinate system.
Post-coordination of terms is typical of traditional classification systems, and not of most modern taxonomies and thesauri. Classification systems often concatenate separate concepts into a string of terms. Natural language is not used.
I really think that most people search by typing in words as they think of them, so we need to support natural language in our systems. This is part of why Access Innovations uses post-coordination in the taxonomies and thesauri it creates.
In a post-coordinate system, a single concept is represented by a single term. We’re not combining two concepts, we are keeping them separate. That way we are able to do a large amount of automated indexing. If we try to mine the text at the beginning, it is very system-intensive. So, we have taken another path, by and large, for that.
Marjorie M.K. Hlava
President, Access Innovations



