A Quick Look Behind the Scenes
March 4, 2013
Posted in Access Insights, Featured, Taxonomy
The workings of a taxonomy or thesaurus in a database or website can seem mysterious. Let’s take a look behind the scenes.
First of all, we need the taxonomy or thesaurus in digital form, either as a separate file or as it exists in a specialized software application. The screenshot below is from the editorial user interface of a thesaurus software application.
The left panel shows a taxonomy view, the hierarchical view. The right panel shows a term record. We have the broader term, the narrower terms, status, related terms, and then other stuff – synonyms, history, scope notes, and so forth. A lot of information stored as an object. This is the object, being “Heating, cooling, and ventilation”. In this case the object is the term and everything focusing around it.
Now let’s look at how the thesaurus terms get connected with a website. (We’ve seen some of this before, in connection with the discussion of metadata.) Go to a site, choose view, select the source view, and you’ll see the view store. It will look something like this.
Here you can see the meta names field for the view. Not all sites have them, but many do.
If you were to do this in a relational database, you would put your taxonomy terms in a table somewhere. You would need to be sure that they are related to the primary key or main records, so that you have them linked to the records.
It doesn’t matter if you are in an object system or if you’re in a relational database management system (RBDMS). You want to have a place to put those terms. Be sure that the IT people give you a place to put the terms.
In object-oriented code, it would be a very similar kind of model. You want to be sure that the data transfers over.
You want to define the terms and their connections in the relational database. In the various relational database models, you have a lot of options as to how to do those things.
You might have an XML-based database system, in which case you can put in new text and have a way to suggest the terms automatically and add them to the system.
If you look at this site, you see the hierarchical list that comes from the hierarchical list of the taxonomy.
You might see that the narrower terms in the term record becomes the narrower terms in the search interface, and that the related terms from the term record will also be posted in the search interface. You can see that you can do a fairly direct connection of the two.
You want to integrate that taxonomy so that you can enhance the findability of the terms. You want to use them as labels in search and also use them in tagging the records behind the scene.
If you attach the taxonomy terms to the record, load them into the search system, and then use a variation of that same taxonomy on top of the search system, you are using the taxonomy to search and you are using the taxonomy to tag. Then when you do the search, you get better results.
It could be in a relational database management system. It could be that you use MySQL, for example, as your search software, or you could be using Lucene or Autonomy, or you could use Google, but you attach that taxonomy term to the term record. Then you put the taxonomy terms in the inverted file for search. Then when you choose a taxonomy term on the user interface, it goes to that inverted index and pulls back the appropriate records.
Here’s a workflow diagram that might help to clarify things.
You might have a lot of raw data that you put into a data repository. You’re adding the taxonomy terms to the records in that repository. Then that repository could be stored as an SQL file for e-commerce. It could be stored in a repository. It could be stored in a search system, and you might or might not use a presentation layer to do that search. So, from the repository where you have added the terms to the records, you can spin it out to all of these different places to put the records. You don’t have to, but you could.
So, as you can see, you could use the same set of taxonomy terms lots of places in your website. There are lots of things you can do with taxonomies.
Marjorie M.K. Hlava, President Access Innovations
Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.
Where Do Terms Go?
February 25, 2013
Posted in Access Insights, Featured, Taxonomy
Over the past several weeks, we have looked at various topics involving taxonomies and thesauri. And we have seen that controlled vocabularies are an important part of search and browsing. But how are taxonomies put to use? What about the terms? Where do they go? Exactly how are they put to use in search? What are the different ways that a taxonomy can be used in search?
To start understanding the answers to these questions, let’s look at an old page on the website of the American Society of Civil Engineering (ASCE).
From the bookstore search box near the middle, you could search the bookstore for journals and other ASCE publications. You could use the navigation bar to browse by topic. From the site search bar near the top, you could search the whole site. However, do all of those things really provide a complete search? The interface gave the users the feeling that they were searching all kinds of different places, but they were not. They were searching the same place.
Another way of providing website search capabilities is to use your taxonomy or thesaurus to drive the searches.
This is the search presentation layer. At the left, you have the taxonomy. In the parentheses behind each of the terms, you can see how many items are stored using that search term or that taxonomy term.
Near the top, you can see auto-completion at work, using the preferred and non-preferred terms; the potential terms and their synonyms are listed in a permuted drop-down list, permuting on the letter C. You don’t need to know the whole term. You can search on it either in its synonym form or the preferred form. You get a drop-down list as you type. When you click on one of those, it gives you your search results.
When the search comes back, in this case on cells, some related terms display to provide suggestions for further search, as do the the narrower terms of the term originally searched on. All of that is served up from the taxonomy.
Where we are building the foundation for that is in a database management system (DBMS). We have the thesaurus, and we have the indexing. These are interrelated, because we need to use the thesaurus to apply the terms to the records in the DBMS.
In another installment, we’ll take a quick look behind the scenes to see how this works.
Marjorie M.K. Hlava, President
Access Innovations
Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.
Taxonomies Are Everywhere
February 25, 2013
Posted in Access Insights, News, Taxonomy
There are taxonomies everywhere. Walk into Home Depot or your nearest grocery store and look up. The aisle markers tell you where to find what you are looking for — in broad terms and hopefully, related terms.
In the wrap-up session of the final day of the 9th Annual Data Harmony Users Group meeting, Jay Ven Eman, CEO of Access Innovations, led an entertaining and informative discussion on views from on high, taxonomy-wise. Included in the discussion of semantics and rule building, other topics such as users’ behavior were addressed. Van Eman advised those in attendance to ask their clients, whether in formal or impromptu surveys, why they do what they do.
What with most sessions coming at the end of two full days of sitting and listening (make that three days for some who came early for workshops), it can be difficult to keep your audience’s attention. However, the witty and genuinely knowledgeable Ven Eman pulled it off without a hitch. It has been an enjoyable time learning from clients of Data Harmony how they are using the product to build their taxonomies, as well as the challenges they faced along the way, and to witness the collaboration between clients as they work together to solve problems that have served as hurdles or walls along the way.
Melody K. Smith
Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.
Best Practices Shared at DHUG
February 22, 2013
Posted in Access Insights, News
Building a thesaurus for a digital library of more than 1,500 academic journals, 15,000 scholarly books, and other sources, is no easy task. Currently there is no one single thesaurus that holds the terms to cover all the subjects included in these collections. Nancy Murray, Associate Director, Metadata for JSTOR is taking on that challenge with the assistance of Access Innovations and Data Harmony.
As we begin a day of case studies at the 9th Annual Data Harmony Users Group (DHUG) gathering, Murray shared her experiences in this huge project, one that isn’t done yet. Subject matter experts (SMEs) are often used in the preparation of thesauri and taxonomies and Murray assured that they too will be following this pattern. “However, I already feel like we have, with the vast experience and knowledge provided by the staff of Access Innovations,” said Murray.
This is just one example of the best practice stories being shared here at the DHUG meeting. Users who are new to the Data Harmony line of products are gaining valuable knowledge as they begin their own journeys. This morning over breakfast, one fellow attendee said, “I was so excited to get here today because it is case studies all day!”
I am never that enthusiastic that early in the morning, but it was nice to know that others also find this time well spent.
Melody K. Smith
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.
Visualize Your Results
February 22, 2013
Posted in Access Insights, News
Many users of Data Harmony are intimidated by data visualization. Maybe intimidated is the wrong word, but they certainly have anxiety around seeing their analysis in images, as opposed to lists of tens of thousands words.
During a presentation at the 9th Annual Data Harmony Users Group meeting, Access Innovations editor/taxonomist, Ben Barnes, made the topic of data visualization approachable. An anthropologist by training and experience, Barnes reminded us that written communication started as visuals. The early cave paintings depicted actual celestial events. Language has been symbolic from the beginning. So now we can look at our data representations with a different perspective, reminded of their pictorial origins.
This approach feeds that part of our brain that wants more than numbers and words. It feeds the dimensional parts of our vision and other senses all the while, providing valuable and usable data.
Melody K. Smith
Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.
The Future is Unknown
February 21, 2013
Posted in Access Insights, News, Technology
In earlier times, the world was slow moving and you could anticipate the next year to be much like the current year. This world is no longer slow moving; in fact, it is on warp speed as technology drives more connectivity in the moment. The Internet provides access to all. Librarians are no longer the gatekeepers of information. The entire business model of information has changed. “We are doing business at the speed of thoughts,” says Margie Hlava, president and CEO of Access Innovations. During her keynote presentation at the 9th Annual Data Harmony Users Group this week, Hlava looked back at where we’ve been while looking forward at how Access Innovations is responding to the new industries and challenges that technology has brought us.
Interesting facts were shared that may bring you to a pause for absorption. For instance, 65% of the current preschoolers will be in a profession in their adult life that does not exist at the current time. Technology is advancing and progressing so fast that it is creating new needs on a daily basis. Are you ready?
Melody K. Smith
Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.
The Perfect Storm
February 21, 2013
Posted in Access Insights, Autoindexing, indexing, News
In what appears to be a timely presentation, John Kuranz, CEO of Access Integrity, walked the group gathered at the 9th Annual Data Harmony Users Group meeting through the major events that are taking place in health care. He talked extensively about the ICD-10 coding classification that will be implemented on October 1, 2014. Many are facing the challenges that implementation poses, and then not far behind that is ICD-11 coming in 2015.
Adding to that seemingly impossible situation, the Affordable Care Act will bring 33 million new patients into the healthcare system. For providers, that means beefing up the number of patients to be seen each day, the number of procedures hospitals need to meet quota on, or the census days extended care facilities need to maintain.
These two challenges are posing a perfect storm for those in the medical industry and for those working in information management for the medical industry.
Medical coding requires specialized expertise and systems tailored to the regulatory requirements in which health care providers, hospitals, and doctors deliver their services.
Access Integrity, Inc., provides tools and services for quality assurance and validation of medical coding. Their Medical Claims Compliance system can be used to quickly and accurately validate medical coding or to locate errors in existing documentation. The cost savings are potentially huge.
Melody K. Smith
Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.
Opening Ceremonies
February 20, 2013
Posted in Access Insights, News
Tuesday was the opening day for the 9th Annual Data Harmony Users Group meeting. Unlike the Olympics, there were no industrial revolution references, neither Harry Potter’s nemesis nor James Bond made an appearance, and Danny Boyle’s nod to his childhood did not make the agenda. However, this small group of 60+ in Albuquerque, New Mexico was globally represented with attendees and presenters from Holland, England, Spain, India, and widespread parts of the United States.
Margie Hlava, president and chairman of Access Innovations, started the session with an update on new features in the most recent release of Data Harmony, which debuted at the event. Tuesday’s agenda promised case studies by impressive clients such as PLOS (formerly the Public Library of Science) and the American Institute of Physics Publishing.
If you were unable to attend, check back here on TaxoDiary, as we will describe some highlights of the gathering over the next few days. When you have this much talent and expertise gathered in one room, you are bound to learn something, even if it is by osmosis.
Melody K. Smith
Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.
Facets and Search
February 18, 2013
Posted in Access Insights, Featured, indexing
Most of you who have studied library science or information science are familiar with faceted classification as developed by Indian mathematician and librarian S. R. Ranganathan (1892-1972). His main contribution was the colon classification system, the first faceted classification. It allowed multiple multiple classifications to be assigned to an object, rather than a single pre-coordinated taxonomic designation.
Dr. Ranganathan is still pretty popular in the search world, especially in e-commerce. The Endeca faceted search module is popular with online marketers, because they are looking at lots of different kinds of ways to filter data.
In the case of Endeca, their biggest usages are in retail. So, if people are looking for online ordering, then their system is generally going to be an Endeca search system. That’s because you have one object: A shirt that comes in five colors; it comes in four sizes and maybe there are some other attributes to it. You want to be able to search on any one of those classifications and get the same shirt.
So, I want it in a women’s size whatever, and he wants it in a men’s size whatever. He wants it in blue and I want it in green. You can make those orders with the same general properties. The shirt classification has a lot of sub-facets to it. So it is searching all of those different facets, which we know as size and shape and color, and get it ordered. It isn’t a single taxonomic list; it is a lot of sub-lists that identify that object.
In the taxonomy, you could have built each of those out as a separate branch. More likely, you would build them all as little taxonomies that are separate, because they are basically little pick lists or authority lists. Any one of them is consistent. If I want women’s clothing on L.L. Bean, I am going to click on Women’s Clothing, and then the website is going to tell me that it has pants, shirts, and other things, and then I can choose from those. They are facets. I am going down a hierarchical approach to them if they are in Ranganathan. This is called faceted search. I can click through and get ever more detail.
In Lucene, which is an open source search system, I can do the same thing but a bit differently, because here the facet count is giving me the individual item. I have a lot of different facets, so I can search for the manufacturer and from a drop-down list of manufacturers. Each of these narrows search. I can go by price, or I can go by resolution or by zoom range, and all of those things are just narrowing down my search.
Once again, I think we can see the influence of Boolean logic in search.
Marjorie M.K. Hlava, President, Access Innovations
Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.
Inverted Files, Parsing, Discovery, and Clustering
February 11, 2013
Posted in Access Insights, Featured, search
Inverted files and Boolean are basic to most or all search.
So, what we have is the inverted file – that big alphabetical list of all the terms and where they came from – and a way to combine them. At the end of the day, no matter which direction you come from on the high end and even higher presentation levels, you are depending on these two to make it work. If you add that taxonomy to that, you are strengthening the use of your terms.
Here’s an example of some text.
Pretend that this box shows you a page and focus on any one term, like thesaurus.
If we were to build an inverted file of this data, this is what we would be doing; we would make an alphabetic list.
If we wanted to make it much more complex, we would add stop words and we would add the conditions.
We could say that this term appears in these places. That would give me and my computer a way to combine the terms, because I know from their positions where they are. At the end of the day, this is what a very simple inverted index display looks like; it is basic to search.
We also use a lot of other stuff, like stemming and truncation and wild cards and other ways to do search. In addition, we accommodate all those misspellings, variant spellings, and so forth, so that we can make search even better.
Right truncation is very common, and so are wild cards and some stemming algorithms. Left hand truncation is hard, because you have to build an entire inverted file for every letter of everything to the left side. It makes an extremely large index very quickly. Wild cards are okay term substitutions, but left truncation is very hard and very expensive to implement. Before you ask for it, look at what it is going to do to your system.
It is said that discovery is a very popular action, and it has received a great deal of the research monies and a great deal of attention. However, only about 2% of searchers’ time is actually spent in discovery. 98% of their time is spent doing whatever search is needed to update themselves. So, if they’re mostly updating themselves, that would want one kind of system and if you are in discovery mode, you might want a different kind of system.
One of the big questions in search is, what kind of search are you going to do? Are you doing discovery – looking for new things and new ways of combining stuff? Or are you trying to do an exhaustive everything search? Do you need relevance, recall, and precision, or are you in discovery mode? It makes a big difference as to what kinds of search stuff you need to design, depending on what the users are primarily going to be doing.
Vivisimo, for instance, is a discovery system. It searches on the fly, does automatic clustering, and it doesn’t return the same clusters each time. Each time, the clusters are new. It is not the same search; it’s not the same results. It really angers some researchers to get different stuff the next time. What they really want is additive results. They want to see what they got before, and anything that is new. That’s a different search presentation.
I won’t go into detail about all the types of clustering. Suffice it to say that you can do it in many different ways.
You can group everything together into hierarchical clusters, or you can partition stuff. Even in a cluster you can add a thesaurus, although it won’t really like it; but you can.
This is a view by Vivisimo of its own clustering module. The automatic clustering is shown on the side; this is what will change each time you search.
Remember, I showed you FAST earlier. This is yet another kind of model of how search is done. We have a core engine, and then everything else is connected by application programming interfaces (APIs). These are modules. You don’t get all of these when you buy it; you get just some of them. They are different faces of the data. So, if you wanted to search a bunch of different databases, then you would buy the federation module. If you wanted to go to a bunch of different query servers, then you’d get the module for that. If you want to look at it in different languages, particularly in different character sets, then you would get this module. If you want to use the results, you would need this one. If you want to connect it to other search engines, you need this one. If you want to display those clusters on the fly, then this is the cluster module. Collaboration is where people can view and tag those documents and show them to other people. That’s where you keep what you have, because next time you search, it will be different.
Marjorie M.K. Hlava, President, Access Innovations
Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.
























