Maintaining a Thesaurus in an Excel Workbook, Part 2

May 14, 2012  
Posted in Access Insights, Featured, Standards, Taxonomy

In Part 1, we looked at maintaining a taxonomy in Excel – a set of preferred terms arranged in a hierarchy. This taxonomy structure is a handy way to organize a group of terms and can be used  across an industry for benchmarking or reporting requirements (see Strategies for Incorporating Data Exchange Standards in E-Business Taxonomies advocating for the construction industry and The IFRS Taxonomy, including  the labels used in the International Financial Reporting Standards). Excel works quite well to create and maintain a taxonomy, but how about a thesaurus?

A thesaurus is a taxonomy with enhancements. This is the kind of thesaurus containing a set of terms incorporating  the relevant concepts represented in an electronic collection (as opposed to Roget’s Thesaurus of synonyms). Though valuable for a number of purposes, the first and still most common use is to improve users’ search experience. Appropriate terms from the thesaurus are assigned to content items – articles, people profiles, image captions, policy statements, marketing materials, etc. – and a search using one or several of these terms produces significantly better recall and precision than a full text search based on a user’s impromptu search term.

The enhancements include relationships with other terms (e.g. synonyms, related terms), notes about the term (scope notes, editorial notes, literary warrant), and can also include additional identifying information (ID, URI, source) as well as the term’s history of addition and changes. For example:

Deep Space Network

DEF    A communication network managed by the Jet Propulsion Laboratory for command and control of all planetary flights.

UF      DSN (space network)

GS      networks

          . . . communication networks
          . . . Deep Space Network
          . . . tracking networks

RT      spacecraft tracking

www.sti.nasa.gov/thesvol1.pdf

Alternative crops

Used For                 alternate crops
Broader Term       crops
Related Term        alternative farming
Spanish                   cultivos alternativos

aims.fao.org/website/AGROVOC-Thesaurus/sub

But, for our Excel worksheet, we’ll make the term record horizontal:

This horizontal format allows saving as a tab delimited file, useful for importing into a dedicated thesaurus management application, or as a source file for an XML version of your thesaurus. You can adjust column widths to view the contents more easily; Excel will not catch spelling errors in your labels. (Various ways to view term records, and no need to type labels, are definite advantages of dedicated software.)

The preferred terms are the “chosen ones”, but many concepts have at least one other way to express them. These alternate expressions are called equivalent terms and include synonyms, near-synonyms, and sometimes antonyms (think literacy and illiteracy). They will also include alternate spellings and abbreviations of the preferred term. Though not recommended in the ANSI/NISO Z39.19 standard, common misspellings are often included too. Sometimes referred to as “entry terms”, equivalent terms point to, or lead to, the preferred term for that concept and on to the full thesaurus. Equivalent terms allow the thesaurus builder to include the various words and expressions that a searcher may use to find content. Traditionally, they are called “use for” terms since the preferred term is used for or in place of the equivalent term. UF is used as a label to identify them. Since authors often express central concepts in different ways, equivalent terms are important for automated indexing as well. Other uses for equivalent terms include terms mapped from another organization’s thesaurus, translations for other languages, and terms belonging to deeper branches “rolled up” when a thesurus needs to be displayed as only two or three levels for a navigation hierarchy.

In Excel, we’ll start with three columns reserved for equivalent terms. If a preferred term has more, we can insert additional columns.

Related terms (RTs) connect separate parts of the hierarchy. Product testing and Product defects represent closely related concepts, but if they appear in different branches of a hierarchy (for example, branches named Product development and Customer support), they can be associated as related terms. Identifying this relationship alleviates the need to have a term reiterated in several places in the hierarchy while alerting a user to a different perspective on the captured concept. The related term association is also useful to offer end users the opportunity to broaden a search to include content indexed with a related term as well as with their initial search term.

We’ll start out with a single column for related terms, but like equivalent terms, there may be a need to insert additional columns to handle more than one per preferred term. Related terms always come in pairs (a notation for each of the terms related to each other), so keeping them in a single column or adjacent columns will make it easier to check that all the pairs are complete.

Scope notes are used to clarify intent. A term like politics may be useful to index content about life activities, but much too broad a term for content describing the operation of popular government. In the latter case, where it may be included as a top or 2nd level term, a scope note can clarify that the term should be limited, used only to represent a broad category. Scope notes can also be used to disambiguate what may be misunderstood. A note for the term mercury might clarify its use to represent only the chemical element,instructing that a separate term, mercury poisoning, be used for the metal’s occurrence as a toxin.

We’ll need only one Excel column for scope notes. Other descriptive information, such as a term’s ID number and/or URI, its definition, its literary warrant or source, and a history of the dates added and changed will probably require a single column each.

Instead of labeling each term with its type, we can label the columns:

Use the View ribbon > Freeze Panes button > Freeze Top Row option. Or, if you want the top terms to be always visible also (column A), position your cursor in cell B2 and choose the Freeze Panes option instead.

You can choose a narrow width for the Scope Notes column because when you click on a cell in it, the cell’s entire contents are displayed in the formula bar, above the column letters.

If you need more columns for Related Terms, insert columns as needed.

Make the Related Terms label span all columns by selecting the cells in fow 1 that fall above the related terms and from the Home ribbon, Alignment section, choose Wrap Text (or right-click, choose Format Cells > Alignment tab, and add a check mark in the Wrap Text option in the Text Control section). If you need to add another column later, click the Wrap Text option (or remove check mark) to turn the feature off, reselect the cell span, and click Wrap Text (add check mark) again.

If you aren’t fortunate enough to have a 26” monitor, you may also find the Hide columns feature handy. Concentrating on Related Terms only? Choose the columns to hide by clicking and dragging across the column letters (e.g., F thru J above). Use the Ctrl key to select non-contiguous columns (e.g. if you want Scope Notes to stay visible). Right-click on one of the column letters and choose Hide. (Or, on the Home ribbon, in the Cells section, click the drop-down arrow next to Format, choose the Hide and Unhide group, and then Hide Columns.)

To unhide later, select the columns before and after the hidden section (G and K in this example) by clicking on the column letter and dragging to include the 2nd column. Right-click and choose Unhide (or the same process as used to hide the columns if using the menu ribbon).

Export as Delimited File

A delimited file format is probably the most useful way to transmit your thesaurus terms and term records. Delimited files are used to incorporate the thesaurus with other organzation systems – the intra- or internet web site, your content management system, a directory. Since notes fields (Scope Notes, Editorial Notes, Definition) may include commas in their entries, it’s best not to use a CSV (comma separated values) file. We can use a tab-delimited format instead.

Add labels as required by the application for which the delimited file will be used (e.g. UF=):

  • Right-click on the sheet tab and choose the Move or Copy option to create a copy to work with. Don’t forget to put a checkmark in the Create a Copy box and select which workbook sheet to place the copy in front of.
  • Insert a column to the right of the column requiring labels
  • Use a formula to add the label, for example
    =IF(ISBLANK(F2),”",CONCATENATE(“ID=”,F2))
    and fill down to copy the formula for each cell. While the cells are still selected, copy them (Ctrl + c OR Home > Clipboard section, Copy icon down-arrow, Copy). Use the Up arrow on your keyboard to move back to the top of the column and right- click on the first cell in the original column section (F2 in this example) and choose the Paste Values option (or from Home > Clipboard section, Paste down-arrow, Paste Values section, 1st button (paste values)).
  • Now, delete the added column.
  • For the multiple column sections, insert the same number of columns as are used for your equivalent, related, translation, etc. terms. After entering the formula, copy it across to all added columns (hover your mouse of the formula cell bottom right corner and when you see the + symbol, drag it across the cells to copy to). Then fill down all the columns, copy all, and Paste Values at the cell in the first row, first column of the section you’re labeling.
  • Finally, delete the title row.

Choose the Save As option (F12 key) and open the list of choices for Save as Type:. Select the Text (tab delimited) (*.txt) option. Excel will save the worksheet only, with tabs separating each column.

The tab-delimited file you create will work nicely if you want to import your thesaurus into Data Harmony’s Thesaurus Master® standards-compliant thesaurus management software. Data Harmony’s MAIstro product includes automated indexing as well.

Mary Garcia, Lead Technical Support Specialist
Access Innovations, Inc.