Maintaining a Thesaurus in an Excel Workbook, Part 2
May 14, 2012
Posted in Access Insights, Featured, Standards, Taxonomy
In Part 1, we looked at maintaining a taxonomy in Excel – a set of preferred terms arranged in a hierarchy. This taxonomy structure is a handy way to organize a group of terms and can be used across an industry for benchmarking or reporting requirements (see Strategies for Incorporating Data Exchange Standards in E-Business Taxonomies advocating for the construction industry and The IFRS Taxonomy, including the labels used in the International Financial Reporting Standards). Excel works quite well to create and maintain a taxonomy, but how about a thesaurus?
A thesaurus is a taxonomy with enhancements. This is the kind of thesaurus containing a set of terms incorporating the relevant concepts represented in an electronic collection (as opposed to Roget’s Thesaurus of synonyms). Though valuable for a number of purposes, the first and still most common use is to improve users’ search experience. Appropriate terms from the thesaurus are assigned to content items – articles, people profiles, image captions, policy statements, marketing materials, etc. – and a search using one or several of these terms produces significantly better recall and precision than a full text search based on a user’s impromptu search term.
The enhancements include relationships with other terms (e.g. synonyms, related terms), notes about the term (scope notes, editorial notes, literary warrant), and can also include additional identifying information (ID, URI, source) as well as the term’s history of addition and changes. For example:
Deep Space Network
DEF A communication network managed by the Jet Propulsion Laboratory for command and control of all planetary flights.
UF DSN (space network)
GS networks
. . . communication networks
. . . Deep Space Network
. . . tracking networks
RT spacecraft tracking
Alternative crops
Used For alternate crops
Broader Term crops
Related Term alternative farming
Spanish cultivos alternativos
aims.fao.org/website/AGROVOC-Thesaurus/sub
But, for our Excel worksheet, we’ll make the term record horizontal:
This horizontal format allows saving as a tab delimited file, useful for importing into a dedicated thesaurus management application, or as a source file for an XML version of your thesaurus. You can adjust column widths to view the contents more easily; Excel will not catch spelling errors in your labels. (Various ways to view term records, and no need to type labels, are definite advantages of dedicated software.)
The preferred terms are the “chosen ones”, but many concepts have at least one other way to express them. These alternate expressions are called equivalent terms and include synonyms, near-synonyms, and sometimes antonyms (think literacy and illiteracy). They will also include alternate spellings and abbreviations of the preferred term. Though not recommended in the ANSI/NISO Z39.19 standard, common misspellings are often included too. Sometimes referred to as “entry terms”, equivalent terms point to, or lead to, the preferred term for that concept and on to the full thesaurus. Equivalent terms allow the thesaurus builder to include the various words and expressions that a searcher may use to find content. Traditionally, they are called “use for” terms since the preferred term is used for or in place of the equivalent term. UF is used as a label to identify them. Since authors often express central concepts in different ways, equivalent terms are important for automated indexing as well. Other uses for equivalent terms include terms mapped from another organization’s thesaurus, translations for other languages, and terms belonging to deeper branches “rolled up” when a thesurus needs to be displayed as only two or three levels for a navigation hierarchy.
In Excel, we’ll start with three columns reserved for equivalent terms. If a preferred term has more, we can insert additional columns.
Related terms (RTs) connect separate parts of the hierarchy. Product testing and Product defects represent closely related concepts, but if they appear in different branches of a hierarchy (for example, branches named Product development and Customer support), they can be associated as related terms. Identifying this relationship alleviates the need to have a term reiterated in several places in the hierarchy while alerting a user to a different perspective on the captured concept. The related term association is also useful to offer end users the opportunity to broaden a search to include content indexed with a related term as well as with their initial search term.
We’ll start out with a single column for related terms, but like equivalent terms, there may be a need to insert additional columns to handle more than one per preferred term. Related terms always come in pairs (a notation for each of the terms related to each other), so keeping them in a single column or adjacent columns will make it easier to check that all the pairs are complete.
Scope notes are used to clarify intent. A term like politics may be useful to index content about life activities, but much too broad a term for content describing the operation of popular government. In the latter case, where it may be included as a top or 2nd level term, a scope note can clarify that the term should be limited, used only to represent a broad category. Scope notes can also be used to disambiguate what may be misunderstood. A note for the term mercury might clarify its use to represent only the chemical element,instructing that a separate term, mercury poisoning, be used for the metal’s occurrence as a toxin.
We’ll need only one Excel column for scope notes. Other descriptive information, such as a term’s ID number and/or URI, its definition, its literary warrant or source, and a history of the dates added and changed will probably require a single column each.
Instead of labeling each term with its type, we can label the columns:
Use the View ribbon > Freeze Panes button > Freeze Top Row option. Or, if you want the top terms to be always visible also (column A), position your cursor in cell B2 and choose the Freeze Panes option instead.
You can choose a narrow width for the Scope Notes column because when you click on a cell in it, the cell’s entire contents are displayed in the formula bar, above the column letters.
If you need more columns for Related Terms, insert columns as needed.
Make the Related Terms label span all columns by selecting the cells in fow 1 that fall above the related terms and from the Home ribbon, Alignment section, choose Wrap Text (or right-click, choose Format Cells > Alignment tab, and add a check mark in the Wrap Text option in the Text Control section). If you need to add another column later, click the Wrap Text option (or remove check mark) to turn the feature off, reselect the cell span, and click Wrap Text (add check mark) again.
If you aren’t fortunate enough to have a 26” monitor, you may also find the Hide columns feature handy. Concentrating on Related Terms only? Choose the columns to hide by clicking and dragging across the column letters (e.g., F thru J above). Use the Ctrl key to select non-contiguous columns (e.g. if you want Scope Notes to stay visible). Right-click on one of the column letters and choose Hide. (Or, on the Home ribbon, in the Cells section, click the drop-down arrow next to Format, choose the Hide and Unhide group, and then Hide Columns.)
To unhide later, select the columns before and after the hidden section (G and K in this example) by clicking on the column letter and dragging to include the 2nd column. Right-click and choose Unhide (or the same process as used to hide the columns if using the menu ribbon).
Export as Delimited File
A delimited file format is probably the most useful way to transmit your thesaurus terms and term records. Delimited files are used to incorporate the thesaurus with other organzation systems – the intra- or internet web site, your content management system, a directory. Since notes fields (Scope Notes, Editorial Notes, Definition) may include commas in their entries, it’s best not to use a CSV (comma separated values) file. We can use a tab-delimited format instead.
Add labels as required by the application for which the delimited file will be used (e.g. UF=):
- Right-click on the sheet tab and choose the Move or Copy option to create a copy to work with. Don’t forget to put a checkmark in the Create a Copy box and select which workbook sheet to place the copy in front of.
- Insert a column to the right of the column requiring labels
- Use a formula to add the label, for example
=IF(ISBLANK(F2),”",CONCATENATE(“ID=”,F2))
and fill down to copy the formula for each cell. While the cells are still selected, copy them (Ctrl + c OR Home > Clipboard section, Copy icon down-arrow, Copy). Use the Up arrow on your keyboard to move back to the top of the column and right- click on the first cell in the original column section (F2 in this example) and choose the Paste Values option (or from Home > Clipboard section, Paste down-arrow, Paste Values section, 1st button (paste values)). - Now, delete the added column.
- For the multiple column sections, insert the same number of columns as are used for your equivalent, related, translation, etc. terms. After entering the formula, copy it across to all added columns (hover your mouse of the formula cell bottom right corner and when you see the + symbol, drag it across the cells to copy to). Then fill down all the columns, copy all, and Paste Values at the cell in the first row, first column of the section you’re labeling.
- Finally, delete the title row.
Choose the Save As option (F12 key) and open the list of choices for Save as Type:. Select the Text (tab delimited) (*.txt) option. Excel will save the worksheet only, with tabs separating each column.
The tab-delimited file you create will work nicely if you want to import your thesaurus into Data Harmony’s Thesaurus Master® standards-compliant thesaurus management software. Data Harmony’s MAIstro product includes automated indexing as well.
Mary Garcia, Lead Technical Support Specialist
Access Innovations, Inc.
Natural Language Processing Only Goes So Far
May 7, 2012
Posted in Access Insights, Featured, search, semantic
The siren call of natural language processing has been issued again. In this study, the researchers compared the use of free text searches to administrative codes to see which would give a better indication of safety based on 20 indicators. The authors rightly suggest that instead of reliance only on the notoriously poor check boxes used with discharge orders, the hand-written notes from the discharge nurse or physician might be much more instructive.
There is a huge cognitive leap in this article, though, when they assume that natural language processing (NLP) will always be better than the 20 critical care indicators. NLP Systems are not designed to provide precision and recall. They are designed to give excellent indications of data and gist of meaning and guide the user in discovery. They are often combined with Bayesian systems (or other statistical methods like latent semantic, neural net, vector based) to further enhance the discovery aspects of the search.
However, in the end the searcher will have to prove where and how the data was accessed. This is particularly true when the discoveries lead to court cases. Using ever-changing pointers and word values only goes so far. They do not provide replicable results. They do not provide the same additive results. So if more data is added to the system, the information presented will change. The combination could lead first to excellent NLP discovery and then pinpoint the activities using more accurate Boolean approaches. Use the discovery the 5% of the time that you are in discovery mode trying to figure out the trends, and then use the accurate Boolean (AND, OR, and NOT commands) to zoom in on and collect all of the data related to the topic.
Marjorie M.K. Hlava
President, Access Innovations
Originally published September 19, 2011
The DSM-5 Draft: Half-Baked Meatloaf?
April 30, 2012
Posted in Access Insights, Featured, Taxonomy
For a short time in the 1990s, I helped a small company develop proposals for providing mental health services. My desk housed two standard references: the Chicago Manual of Style, and the fourth edition of the Diagnostic and Statistical Manual (DSM-IV) of the American Psychological Association (APA). The DSM is essentially a classification system, like a taxonomy, providing a structured list of the psychiatric diagnoses used by mental health practitioners.
The DSM has long been regarded as the psychiatrists’ bible, just as the Chicago Manual is still widely and highly regarded as the writers’ bible. It looks like the DSM is now going down a different path. The APA’s proposed changes for the May 2013 release of the DSM-5, which were made public two years ago, have triggered strong letters of protest from numerous experts and organizations, and even from within its own ranks. Some petitions are circulating, and an “Occupy the APA” protest demonstration is planned for this May.
As reported by Gary Greenberg (“Inside the Battle to Define Mental Illness,” Wired Magazine, January 2011), “Psychiatrists at the top of their specialties, clinicians at prominent hospitals, and even some contributors to the new edition have expressed deep reservations about it.” One of the most prominent and outspoken critics of the DSM-5 draft is Dr. Allen Frances, lead editor of the DSM-IV (he readily admits some of its errors and urges them to be corrected). Dr. Frances has fired off several open letters to the APA (see his blog) that have been published by the Psychiatric Times. Greenberg observes that “some are beginning to agree with Frances that public pressure may be the only way to derail a train that he fears will take psychiatry off a cliff”.”
Do the DSM draft’s faults, whatever they are, matter all that much? Well, as Greenberg explains, “The book is the basis of psychiatrists’ authority to pronounce upon our mental health, to command health care dollars from insurance companies for treatment and from government agencies for research. It is as important to psychiatrists as the Constitution is to the US government or the Bible is to Christians. Outside the profession, too, the DSM rules, serving as the authoritative text for psychologists, social workers, and other mental health workers; it is invoked by lawyers in arguing over the culpability of criminal defendants and by parents seeking school services for their children. If, as Frances warns, the new volume is an “absolute disaster,” it could cause a seismic shift in the way health care is practiced in this country. It could cause the APA to lose its franchise on our psychic suffering, the naming rights to our pain.”
Why all the controversy? There are numerous reasons. One root cause is that the task force responsible for revision seems to have ignored best practices for classification system construction and revision. The task force has been widely accused of 1) a closed, secretive development process; and 2) resistance to constructive criticism and external recommendations.
The closed, secretive approach has promoted distrust, especially as details of potential conflicts of interest (including contracts with pharmaceutical companies) have emerged. Revision contributors have complained that they were obliged to sign nondisclosure agreements before participating in DSM-5 draft development. Benedict Carey, in an article for The New York Times (“Psychiatrists Revise the Book of Human Troubles”), quotes DSM-III lead reviser Dr. Robert Spitzer as saying, “When I first heard about this agreement, I just went bonkers. Transparency is necessary if the document is to have credibility, and, in time, you’re going to have people complaining all over the place that they didn’t have the opportunity to challenge anything.”
Furthermore, the closed approach seems to have gone hand in hand with a resistance and lack of responsiveness to external criticism and recommendations. The draft changes have been online and open to public comment since February of 2010. However, despite the many subsequent criticisms, the committee has persisted in defending its proposed changes, even the more controversial ones.
As reported by Michelle Diament on March 28, 2012 (“DSM Committee Standing Firm on Autism Changes”), “Members of the committee tasked with updating the diagnostic criteria for autism appear to be digging in as critics worry that proposed changes will strip many of their diagnosis. In a commentary released this week, members of the American Psychiatric Association panel charged with revising the autism definition appearing in the forthcoming edition of the Diagnostic and Statistical Manual of Mental Disorders, or DSM, defended the changes they’re proposing.” Despite petitions, and despite two recent scientific reports that indicate that a large number of people who rely on various medical and other services would lose those services with the proposed changes to the DSM-5, “the committee continues to defend its proposal.”
The large potential influence on funding, diagnosis, treatment, and services is why many parents and advocates of people on the “autism spectrum” are concerned about the proposed changes, as reported by Amy Harmon in The New York Times (“A Specialists’ Debate on Autism Has Many Worried Observers”). (Note: Autism, Asperger syndrome, and other autism spectrum conditions are neurological conditions, not mental disorders, but their inclusion in DSM means that the DSM governs diagnosis, funding, etc.) Harmon quotes the DSM task force: “We have to make sure not everybody who is a little odd gets a diagnosis of autism or Asperger disorder. It involves a use of treatment resources. It becomes a cost issue.” The committee’s solution was to lump all autism spectrum conditions together, subsuming autism, Asperger disorder, and “not otherwise specified” pervasive developmental disorders under the umbrella of a single diagnosis, “Autism spectrum disorder.” At the same time, they narrowed the scope of what conditions would be covered by that umbrella, forcing a narrowing of the options of clinicians and service providers.
As for the controversial disappearance of Asperger syndrome from the DSM, it’s not just about funding. Harmon quotes Michael John Carley, director of the Global and Regional Asperger Syndrome Partnership: “Having a diagnosis helps people understand why we process thoughts and emotions differently and make positive changes. Sadly, we may be heading back to the days when our differences are seen through the lens of character deficits rather than in the context of brain wiring.”
The committee justified their approach in an unusual and disturbing way on the APA’s DSM-5 website: “A single spectrum disorder is a better reflection of the state of knowledge about pathology and clinical presentation; previously, the criteria were equivalent to trying to “cleave meatloaf at the joints”.” The perceived problem was one of blurred boundaries. Taxonomies and classification systems often present taxonomists with the dilemma of blurred boundaries. Tossing out the more specific classifications might temporarily make things easier, but it sacrifices specificity down the road. True, taxonomies and indexing must do a balancing act between specificity and sensitivity, so as to adequately cover what’s represented by the text or the situation while not being too vague about it. In the present case, though, the APA has chopped off the ends of the so-called meatloaf, at the same time that it refuses to slice it. It has sacrificed both specificity and sensitivity.
Classification systems such as the DSM are not merely tools of convenience; they can shape thinking and practice in entire professional, scholarly, and scientific areas. Therefore, the revisers have a special responsibility to users and to others who may be affected by it. They might carefully set an initial framework, but from then on they should be open to at least considering the comments and recommendations of experts and users.
Barbara Gilles, taxonomist
Access Innovations, Inc.
Access Innovations, Inc. Creates Taxonomy For Iowa Code, Administrative Code, and Acts
April 23, 2012
Posted in Access Insights, Featured, Taxonomy
Enhanced Electronic Index Allows Access to Extensive Collection of Legal Documents By Topic of Interest
Access Innovations, Inc., a leader in the data management industry, has collaborated with the Iowa Legislative Services Agency to build a customized thesaurus that allows the Iowa Legislature General Assembly to easily access its extensive legal body of existing and proposed laws, bills, acts, and regulations by using controlled, vocabulary-driven indexing in addition to published indexing codes.
The new thesaurus also allows newly published unstructured content to be tied together across multiple indices and will drive a uniform index.
The six-month project utilized Access Innovations’ Data Harmony® software suite with the goal of providing subscription-based delivery of legal documents derived from uniform index entries. The project team created the thesaurus using MAIstro™, a software tool that includes both Thesaurus Master® (thesaurus and taxonomy management) and M.A.I.™ (Machine Aided Indexer).
Marjorie M.K. Hlava, president of Access Innovations, explained that the new index makes it much easier for the Iowa Legislative Services Agency to deliver precise information to its users much more quickly.
“Before, they might have had 100 different terms covering laws and regulations related to agriculture. In order to find the right citation or receive updates, a user would have to know the correct term as well as the code. Now, Iowa users can enter a single term or the standardized code to subscribe to ‘Agriculture’ and be notified of changes or proposed changes to state laws on this topic. This saves them a huge amount of time and frustration,” she said.
Hlava added she believes the project can serve as a model for other states interested in creating more efficient and effective indexes covering their past, current, and proposed laws, regulations and bills.
The project differed from typical index and thesaurus creation because the Iowa Legislative Services Agency needed to maintain its existing codes from each back-of-the-book index, rather than starting from scratch and creating new codes. One reference alone, the Blue Index, included 2,300 index terms. To create the thesaurus, Access Innovations looked at different methods to apply to each term according to the existing references, tied preferred terms to the existing codes, and added related terms to the preferred terms.
The codes covered previous legislation dating as far back as 1953 to legislation through 2010. Also, the custom taxonomy was built with only four levels in order to meet Iowa Legislative Services’ navigation requirements. Typically, thesauri are not limited by a specified number of levels.
Jack Bruce, a project manager at Access Innovations, noted, “The new index makes it much easier for Iowa to deliver precise information to all their users much more quickly. Their 300-page Iowa Code index is now 22 pages and provides fast, accurate access. I don’t think they could ever go back to the old pre-coordinated indexing days. They were interested in trying it but a bit skeptical at first. Now, they are believers.”
Bruce said that Access Innovations also provided on-site training and ongoing technical support so that Iowa Legislative Services Agency’s indexing staff could get used to the new thesaurus, ask questions, and learn in a hands-on environment how to work with it so they could, in turn, show their users how to get the most from it.
“We believe the new uniform index derived from this thesaurus is going to be an extremely efficient way to deliver a huge volume of information in a useful and easily searchable way for years to come,” Bruce said.
Hlava added, “The Iowa LSA is a very forward thinking group with the vision to harness the Data Harmony technology to streamline access to government information, enabling the democratic process in a much more transparent and efficient way.”
Maintaining a Thesaurus in an Excel Workbook
April 16, 2012
Posted in Access Insights, Featured, Taxonomy, Term lists
There’s been some discussion recently in the Taxonomy Community of Practice LinkedIn group about free or low cost thesaurus management software. I’ve noticed a dearth of postings about using Excel, a very popular tool, particularly if you already have a Microsoft Office license.
Experts disparage Excel as a tool, but it can provide a way to start your thesaurus development. And, if you are mindful of organizing your Excel worksheet so that its data can be imported later into a dedicated tool, you can achieve some important objectives. Excel is indeed the most popular thesaurus management tool. (see Taxonomy & metadata strategies for effective content management workshop slides in which taxonomy expert Joe Busch reiterates this.)
First requirement: a hierarchy
It’s not enough to collect the terms that represent the “aboutness” of your electronic collection; it’s also important to put them into a “human usable” format. A hierarchy, or tree, displays the relationship between a broader, more inclusive expression of a concept and increasingly narrower, more specific examples and instances of it. The breadth of your term list, as represented by the top terms, provides insight into the breadth of your content collection. The number of levels in the hierarchy, when representative of groups of content items, describes the depth of the collection.
A hierarchy is very easy to create and maintain in Excel. The columns provide the levels, and a row for a new term can be easily added with a right click on a row number and the choice of Insert. Synonyms, scope notes, and related terms (the term record) can be added in columns beyond the last hierarchical level (more about these next week) because Excel does not “complain” about empty cells within a row, it just doesn’t like empty rows within a text ‘list’. Dedicating the first three to six columns to terms only allows you to adjust the column widths until you get a pleasing hierarchy display. And you can select just the columns that include terms as the “print area” to print out a hierarchical list.
Dedicated software tools allow collapse and expansion of the hierarchy to assist in changing focus from general to specific. In Excel, you’ll need to manually select groups of rows (the branches of your hierarchy) and identify them as an outline group (Data ribbon > Outline > Group).
To get a list of the top terms, you can copy Column 1 of your Excel worksheet to another worksheet, add a column label, select the column, and use the Sort and Filter section of the Data ribbon with “Unique records only” checked to remove the blank rows. Use Sort A-Z function on the Data ribbon to put them in alphabetical order.
A Different View
Dedicated software provides different views of your terms, for example alphabetical, permuted, a list with full term records. The easiest of those to reproduce in Excel is an alphabetical list of terms. One way to accomplish this is to copy the term hierarchy columns to a new worksheet and use a concatenate formula in the next empty column (=CONCATENATE(A1,B1,C1,D1,E1,…) to get all your terms in a single column. Copy the formula down: Home ribbon > Editing > Fill > Down to the last row of your hierarchy to display all the terms in a single column. To convert from displaying the results of your formula to an actual list of terms, copy the column and use the Paste Special > Values (a right click option, or from the Home ribbon > Clipboard section) to create the list. Then, select the list (click on the first term and hold down the Shift key while you tap the End key and then the Down arrow (↓) key) and use the Sort A-Z function on the Data ribbon to get them in alphabetical order.
Exporting your list
To export your list as a delimited text file, Choose the “Save As” option from the File button/menu. Give the export file a name and in the next “Save as type” field, pick Text (tab delimited)(*.txt) option. This means of export saves the entire sheet (but one sheet only). Using tabs as the delimiter prevents a comma appearing in a scope note from throwing off your export. If you need the terms only, copy the hiearchy columns to a separate worksheet first. When the webmaster asks you for the terms in just the top two levels (and maybe for only selected branches) of your taxonomy, copy the columns or branch sections to a new worksheet. In about four steps, and using a formula combining Excel’s IF and CONCATENATE functions, you can deliver just what’s needed.
Thesaurus standards
As your term list grows, you’ll want to check that there are no gaps in your hierarchy (e.g., a 3rd level term as ‘child’ of a 1st level term) and that you haven’t used a top term somewhere else in the hierarchy (which violates the NISO Z39.19 standard and ultimately will cause you trouble). You’ll probably need a Visual Basic program for your worksheet (a sophisticated macro). If you’re not a programmer, sharpen those negotiating skills to get your talented colleagues to develop one for you. Or, consider dedicated thesaurus management software.
Even though Excel may be a place to start in building your taxonomy, you’ll want to be aware of its limitations and which of them are “deal breakers” for your project and career. Begin evaluating dedicated software from the start.
The Data Harmony thesaurus management software produced by Access Innovations takes care of keeping your thesaurus hierarchy standards-compliant as well as offering many other features that make the people building and using a thesaurus significantly more productive.
Mary Garcia, Lead Technical Support Specialist
Access Innovations, Inc.
Leveraging Your Taxonomy – Part 10 (Taxonomies in SharePoint)
April 9, 2012
Posted in Access Insights, Featured, search, Taxonomy
I hope this series on search has been helpful to users and professionals alike. Let’s close it with a look at taxonomies in SharePoint.
Let’s look at this data flow in another way. We have incoming information; going to dump into a repository. We need to add metadata to that repository. We want to add taxonomy terms. The taxonomy terms all need to be controlled or suggested. So, there’s a back end to do that. Once we have the data in that repository, it could be exported to a SQL or a relational database, transactional system, for e-commerce. It might be put into a repository so that the full displays can be done. It might be loaded into a search system and you also might have a presentation layer for display.
If you can, it is nice to save data to the search and the repositories all at the same time, so that when you save a record in one place, it saves automatically to all of the other places. That would mean that you would have a lot of immediate availability of the information to everybody. You can kind of group this sort of stuff that I just showed you that is all over the map into your source data, cleaning and enhancing the data, and making it searchable. You want to be able to get your data, load it and clean it up, and then export it to the several repositories with different requirements.

If you are doing that in SharePoint, it’s very similar. You have your client data; you have your own taxonomy; you run it through a battery of stuff to clean it up; dump it into the repository. You still need search software, separate. SharePoint actually comes with a small search system, but if you have a lot of documents, you will need to have something more robust for search – make it faster, faster query speed, faster results processing, etc. And, if you add the taxonomy back in at the front end, you can browse and increase the accuracy of the search results.
We’ve covered how search works and how to measure accuracy in search, and we’ve discussed some of the theoretical bases. We also discussed the taxonomy effect.
I really strongly recommend that you do the data first, if you can. Many of you have been able to before, but now all of you can. But, assess what you have. What does it need? How would you like to access it? What features do you like? Give lots of examples if you can find them. Look at the data before you create the specifications, because a DTD built without data isn’t going to work for your data. You’ll have to go back through a whole new big processing round. If you choose the system that will support your data first, then you are really in good shape. It is amazing to me that people don’t do that.
Marjorie M.K. Hlava
President, Access Innovations
Leveraging Your Taxonomy – Part 9
April 2, 2012
Posted in Access Insights, Featured, search, Taxonomy
As we continue the series on search, we are close to wrapping up with a more in-depth look behind the scenes of database management systems.
Let’s take a quick look behind the scenes. We want to connect the database management system to the thesaurus tool so that we can validate the terms and make sure that they are in good shape and, as people are adding records to the database, if they have any suggestions or candidates, we want to lock those in as well.
The thesaurus tool will tell you which terms are actually correct, allow you to add, change, and delete, and otherwise manage the term base. Then the indexing is used to actually suggest indexing terms to records as they are loaded to the database management system. That system can be SharePoint, it could be a content management system, it could be a Documentum or a FileNet, or any other thing you want to use as a repository to manage your data. That is driven by the taxonomy.
Here’s a taxonomy view. We have the hierarchical view on one side and we have the individual term records over here. The reason to maintain them as term records is because that way we have a related term that is attached to this term itself, and narrower terms for that term, and broader terms for that term, so you get the record as an object that you can keep track of.
Where does this subject metadata, or the set of taxonomy terms, go? We are going to apply those to the content itself, and we can do that in the metaname field in the HTML header, or we can connect it to keywords in the SQL database or other database tables, if it is a relational database system.
If you go to View, Source on any webpage it will give you something that gives which DTD for HTML they are using. This is HTML 4 Frameset in English and if your browser doesn’t have it, here’s where it can go get it. What is important here, besides that this is the DTD we’re using, is the metaname keywords field. This is how metadata as a term became popular in our business. What you are hoping is that internally or externally, people are filling in the content in the metaname keyword field. Ideally, on your website these should come from your taxonomy so that they are, indeed, populated and tied to the taxonomy; this gives you much more precise search.
If you are working in a relational database management system, you have a lot of tables, a lot of different kinds of information. Here it is a health database. The taxonomy terms are linked to a couple of places and you can see all of the fields that they can feed to. It’s a primary key, as opposed to a secondary key, and it is called ‘Category’. Here we have Category ID. That is the taxonomy field in this particular database. That is where we are going to dump the taxonomy terms. In this case, because it is an ID, I have to go get them. Here are all the fields of that taxonomy record.
Another way to do it is if we have some XML database. The record shown below is from the National Information Center for Educational Media. I might have an abstract and title; based on those, I am going to go get some suggested terms automatically. It is going to tell me how many times they are suggested and dump them into the record. Then I go back to my view of the taxonomy and here is the hierarchical view; here’s the narrower term view and here is the related term view for the record. That is how it came up. It came directly from the taxonomy itself.
If I am going to integrate this into enhanced findability, I am going to try to create some faceted navigation or browsable views. Maybe I’ll have smart search to search for term equivalents or synonyms. I might have taxonomy terms that are original or modified that I’ll use as labels in the search. And, I’ll have navigational aids for the user to help with those relationships. I might also want to be able to give searchers spelling alternatives and correct the spelling when they are wrong, with messages such as “Did you mean…” or “I’m searching for this instead …”. ,I might want to give some related concepts, or some statistical information about that metadata – how many records there are that are indexed with a particular term. I might provide some navigation or hierarchical trees and drill down. I might want to offer some recursive steps so that you can search within that search. I might want to give you some concept linking or even a dictionary look-up with a glossary of the terms, so that you can define what those terms mean online.
Next week we will finish up this series by looking at taxonomies in SharePoint.
Marjorie M.K. Hlava
President, Access Innovations
Leveraging Your Taxonomy – Part 8
March 26, 2012
Posted in Access Insights, Featured, search
As we continue the series on search and how it works, we are looking at file indexes more closely. More specifically, we are looking at complex inverted file indexes.
Stemming involves such actions as de-pluralization and removing gerund endings. It is also called lemmatization. Truncation – left and right –is a popular technique in search. Right truncation, basically chopping a word off at its end, is pretty easy. Left truncation is tricky. Consider the word ‘organization’, which can be spelled with either an ‘s’ or a ‘z’, depending on where you are from. The ‘-ation’ can be chopped off pretty easily, but for the right part, I have to build an entire index, starting with o, or, org, org, so that I can go through all of those to see where the full extension is. When people do left truncation, it is a lot more expensive. It is a much bigger, additional index.
Variant spellings needs to be considered in taxonomy building. They also need to be considered in search. You can see that a lot of what we do in taxonomies is also used in thinking about search.
That taxonomy effect – Where do the terms go? How are they used? What other ways can I use the taxonomy in search? Shown is a site that has a whole lot of search embedded in it. You might want to search the site. You might want to search a whole bunch of combined sites – a federated search. You might want to just use the taxonomy in navigation. Or you might want to search for books or the journals in publication or all publications. Each of those is going to take you to one of two kinds of places, depending on how it is set up at the back end. You will end up either at one great big database that combines everything and allows you to parse the file and search only for books or only for journals or only on the site, or at a separate search system for each of these. It is very popular in these days to have it all in different places, which means you get different kinds of results from the different searches. If you have them all combined behind the scenes, it really powers the user to something much more powerful.
Another way of doing a search presentation is shown on the left side, where you can see a hierarchical view. This view is actually a view of the taxonomy itself. It tells you how many items are tagged with that term in the database, and you can browse up and down it. You are using the full taxonomy tree for navigation.
In the search box, we see the implementation of the taxonomy for type-ahead. In this case we type the word ‘sell’ and by rotating the terms – both the synonyms and the preferred terms – we are finding a drop-down menu that will change as we continue to type. People don’t have to actually know the term that was preferred in building the taxonomy; it is doing auto-completion. Auto-completion is popular. It can go from the term index, or from that inverted file, or from the taxonomy, or or from some other dictionary that you don’t even know about that may or may not be attached to the taxonomy. When it is attached to the taxonomy, it really leverages search.
If we were to click on one of the terms in the drop-down list, we would get a search that would give us conceptually appropriate records, and I could click on this and see the record or I can click, in this case, on this site to go buy the item described in the record. I can get a little snippet about the record or a short abstract. It could be just a snippet like you see in a Google search. What is also displayed are additional conceptually appropriate terms, which include thesaurus-related terms to expand the search, and narrower terms to narrow the search; just another way to look at how things are going.
Next week we will look behind the scenes at databases and taxonomies.
Marjorie M.K. Hlava
President, Access Innovations
Leveraging Your Taxonomy – Part 7
March 19, 2012
Posted in Access Insights, Featured, search, Taxonomy
This is the next piece in our series of blog posts on search and how it works. Next let’s look at an inverted file index. Let’s pretend that this is the outline of the presentation. I have Define Key Terminology, Thesaurus Tools, Functions, Features, Class, Construction of the Thesaurus, etc. in the figure below. You can see that the word “Thesaurus” is used three times here. I have a number of other words that you might focus on to see where they are. If I am going to take these and make them into an inverted file, the simple inverted file index is just going to take them and make them into an alphabetic list. So it will sort the high ASCII characters first – the special characters and the numbers – and then it will sort the rest of them alphabetically.
That’s nice; a nice alphabetical list but remember, computers are just great big calculators at the end and so, what I need to do is to make this a little fancier.
In order to make it fancier, I need to apply some intelligence to this. Some things (like of, 1, 2, 3) I won’t search on, so I will make them stop words. I will tell the system where the other words are. So, line 7, paragraph 2, and what type of thing it is – it’s a subheading. So, I have things that are in the titles, over here T, I have headings, I have stop words, I have placement. What this does for that great big calculator is to give me something so that I can add them up. So, if I want to know about construction costs, I am going to compare two things here – construction and costs – and see if they are located close to each other. One is in paragraph 2 and one is in paragraph 1. If I wanted to define futures, here I am in paragraph 1 and paragraph 1, those might be pretty handy. So, I am looking for those Boolean interchanges. I want to find out where these things are. Here I have when and why; there’s a Stop Word in between them. I have when and why, line 9, positions 1 and 3. I would present those to the user first as an answer.
Next week we will continue with a look at complex inverted file indexes.
Marjorie M.K. Hlava
President, Access Innovations
Leveraging Your Taxonomy – Part 6
March 12, 2012
Posted in Access Insights, Featured, search, Taxonomy
This is the next piece in our series of blog posts on search and how it works. Last week we ended with natural language processing. We are picking up this week on automatic language processing or ALP.
Automatic language processing is different. It involves automatic translation or automatic indexing or auto abstracting. It is a pillar for artificial intelligence. It is also a pillar in a lot of search systems but it is frequently built on top of the precepts of natural language processing. It might add spell checking. A lot of the initiatives for the semantic web are based on some sort of automatic language processing, as well as linking algorithms, other kinds of NLP, and other kinds of computational linguistics.
Statistical search has evolved to encompass a wide variety of options. Taking the Bayesian statistics from many years ago, we might be able to come up with cluster analysis, or neural network search, or vector searching, or co-occurrence, or Bayesian inference, or latent semantics; these are all different search methods. They all are, at the end of the day, based on statistics, and they all depend on the input data and on training sets. You need to factor into your calculations in implementing a search system like this, what it will cost to train the system. To train the system is really just batch processing. It takes programs to do it. But, in order for them to do their work, they need examples of every term – for example, in your taxonomy – used correctly in quite a few articles, like 20-50. In my experience, it takes about three times that many to find ones where the term is used spot-on, specifically the way you want to use it. You present those as training sets. It takes quite a while to collect the training sets for statistical search. That is the real Achilles heel of these systems.
In all of these systems, whether they are statistical, natural language, or whatever, at the end of the day they depend on an inverted file and some And, Or, And Not (at least) operators. We frequently build a searchable index, which is the inverted file index, and then we might build on the user end some kind of a presentation layer, which is a hierarchical display – or browsable list – and frequently that comes from the taxonomy view of the thesaurus.
Next week we will talk more about inverted file indexes.
Marjorie M.K. Hlava
President, Access Innovations

















