Leveraging Your Taxonomy – Part 1
February 6, 2012
Posted in Access Insights, Featured, search, Taxonomy
This series of blog posts will explore how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. The modules of search are:
- Search software – of course
- Computer network
- Parsing of text
- Well formed or structured text
- CLEAN DATA
- Computer software – network
- Computer hardware
- Telecommunications connection
- Training sets for statistical systems
- Search technology
- Ranking algorithms
- Query language
- Federators
- Cache
- Inverted index
- Other enhancements
- Presentation layer
We will cover each of those items over the series; however, we also need to measure the accuracy of search. Accuracy in search is measured by three major areas: precision, recall, and relevance. Each of these can be handled in different ways. Part of the challenge in measuring search accuracy is that in search, there are two major theoretical directions. One of them is based on the Bayes theorems and the other on the algorithms of Boole. I will explore the work of these two gentlemen and of more recent people supporting search enrichment. Then I will discuss what effect they have on search as we know it today. Finally, I will discuss the effect of taxonomy on search.
How does search work?
Here is the normal path people take to search implementation. “Well, I think I will get some hardware.” And they say…. “Well, you can’t go wrong with Company X.”
- So they buy hardware and
- they buy the software that will work on that hardware.
- Then they design a system that will work with that software and
- then they try to load their data and,
- finally, they try to enhance the data with a taxonomy.
In my opinion, that is totally backwards. What they should be doing is looking at what they are building the system for in the first place – that is, the data. How about we build a system to hold your data?
We assess the data so that we can get a design; we know what fields there are. I have written about this backwards approach before.
What are you building?
- Assess the data
- Do the design
- Decide what else needs to be added
- Taxonomy terms
- Other controls
- Find a system that will work with your data
Let’s outline the pieces of the search implementation. There are a lot of parts to search, and one of them, of course, is the search software itself. The search software itself runs on a computer network. The software depends on the parsing or the cutting of the text into specific pieces so that it can be searched. That means that the data needs to be well-formed, if you are talking in the XML vernacular, or structured. Unstructured data is simply data that has not been tagged into fields. You could transform a Word document, which is generally considered unstructured, into a well-formed, XML structured document, by simply putting <Begin Body> and <Close Body> at the beginning and end of the text. Yes, it can be that deceptively and technically simple. The whole notion of structured and unstructured text is a bit of a misnomer and a little bit hard to understand, because most of us don’t think of data in that way. In fact, a Word document has a Properties table in it that may or may not be populated. Some things are populated in it by default. So it is, actually, partially structured already. It can even be saved as an XML document. The search software must depend for its implementation on clean data. That means it has to be clean, well-formed, preferably the metadata fields are all filled in, including the addition of the taxonomy terms in a specific tagged field or element.
The computer software runs on a network, which runs on hardware. To get to it, you need to have a telecommunications connection of some kind. It might be a hard network wire within your organization – so you connect from one place to another within the firm – or it might be something that goes over the Internet to a remote location. It doesn’t really matter – the connection is still a telecommunications connection that transfers data in an orderly fashion over a wire.
Next week we will talk about the search software itself.
Marjorie M.K. Hlava
President, Access Innovations
Of Taxonomies, Biology, and Moneyball
January 30, 2012
Posted in Access Insights, Featured, Taxonomy
Baseball and biology are not commonly found in the same conceptual space. Neither do you find taxonomy associated with baseball, but in recent news these connections were made. Grant Bisbee, editor of “Baseball Nation”, digresses into the arcane as he laments the coming of the “He’s In the Best Shape of His Life” season. This is the time of year baseball writers must assess the prospects for the coming season, and clichés and hyperbole reign. The dubious practice of evaluating the physical condition of players runs rampant as spring training begins. With tongue in cheek, Bisbee tries to shape a taxonomy to classify this spring ritual. His would be the taxonomy of the “In the Best-shape Stories”.
Bisbee suggests three categories: “Play X Got in Shape”, “Fixed Eyesight”, and “A Serious, Previously Undiscovered Affliction”. He compiles representative stories for each classification. He is somewhat skeptical that a player could be in the “best shape of his life” as a result of some of the reported training regimes. Can someone really put on 18 pounds of muscle in the off-season? Legally? Shouldn’t vision examinations be a regular part of baseball teams’ operations? After all, they’re paying millions for their players’ hand-eye coordination.
Are there baseball taxonomies already out there that could help Mr. Bisbee classify his apocryphal collection? A “Google” search of baseball taxonomies returns a million items more or less – plus or minus – give or take – a few hundred thousand. Not much help there. At the very least, Bisbee could expand the classification to include other sports and how each portrays their mystical, preseason rituals. How are preseason body rebuilding rhythms disturbed by “Seasonus Interruptus”?
And biology is experiencing a potential season-disrupting trend. Tom Spears, reporting for the “Ottawa Citizen”, filed the story, “Taxing times for taxonomy”. On a more serious note, Mr. Spears chronicles the demise of field-based taxonomic study in biology in favor of lab work. Computers and DNA studies have relegated classification to the closet. While the work done in laboratories is vital to the field, biologist Ernest Small is quoted, “How can you do a study of forests without knowing the trees.” The reverse is also true; you can’t really study a tree without understanding its forest. Commenting on his lab-centric colleagues, Dr. Small laments “the lack of knowledge of what plants and animals make up our world”. Spears goes on to write, “We need to understand whole species, not just genes, if we are to solve the problems of agriculture, fisheries, insect pests, ecology and the spread of diseases.”
Taxonomies are central to understanding a field as a whole, whether biology or sports. And they are critical to understanding individual members of the taxonomy. The taxonomy of plants and animals tells you about the relationships and characteristics of its members. The child, or narrower node in a taxonomy, inherits the properties of the parent, or broader node. Reviewing an entire hierarchical branch and its relationship to other branches conveys the properties, similarities, and differences of branches. This knowledge can lead to sleuthing out meaningful trends and can help explain interactions or identify possible interactions. Taxonomy for organizing sports articles might reveal interesting groupings and trends. Are there more injuries being reported this season? Are there geographic trends or team trends? Applying graphical analysis visualizes the content in ways that enhance spotting of trends.
What these two diverse stories have in common besides the core theme of taxonomies is the need for hands-on, down-and-dirty fieldwork. Sports writers need to see players perform during preseason to really assess their fitness. Biologists can learn a great deal about a species by looking through a microscope, but they can’t really understand behavior that way. They need to observe firsthand. When the field data and lab data are gathered, placing it in context gives it meaning and leads to insights. Taxonomy is ideal for grouping data in meaningful ways, and that can lead to insights. It can also lead you back to the data at some future point. Taxonomy is often focused on organizing content for search to make search easy, fast, reliable, and replicable. Taxonomy also defines a domain or conceptual space. It can help you find your way and keep you from getting lost. Equally important, it provides meaning to the conceptual space you’re wandering through. Spring is almost here – enjoy the first Danaus plexippus of the season and know it has a genus and a tribe and a family and more. Play ball!
Jay Ven Eman, CEO
Access Innovations
Eighth Annual Data Harmony Users Group – February 7-9, 2012
January 23, 2012
Posted in Access Insights, Featured, News
Access Innovations will host the the eighth annual Data Harmony® Users Group (DHUG) meeting this coming February 7-9 in Albuquerque, New Mexico. The meeting will focus on helping users get the most from their investment in the Data Harmony knowledge management software suite, which helps users organize information resources based on a well-built and systematically applied taxonomy or thesaurus.
“This meeting is an exciting opportunity to learn how to fully utilize the power of Data Harmony software to maximize the effectiveness and profitability of your organization for your members, customers and staff,” said Marjorie M.K. Hlava, president of Access Innovations.
On Tuesday, February 7, Access Innovations’ staff will present a full day of free training for new users or users who want to review and update their knowledge of the Data Harmony, which includes:
- Thesaurus Master® – Provides taxonomy and thesaurus construction and management;
- M.A.I.™ (Machine Aided Indexer) – Offers automatic indexing or editorial aid in indexing; and
- MAIstro™ – Combines Thesaurus Master and M.A.I. for maximum efficiency in both automatic indexing and taxonomy construction
On Wednesday, February 8 and Thursday, February 9, speakers will introduce new features of the software and present case studies about how actual users have leveraged Data Harmony to organize their content, increase their productivity, lower their costs, and drive organizational or company revenues.
Access Innovations is encouraging anyone who wishes to share their story at the meeting to contact them. Registrations are also now being accepted. For more information about the eighth annual Data Harmony Users Group meeting, click here or call (505) 998-0800 or 1-800-926-8328.
About Access Innovations – www.accessinn.com, www.dataharmony.com, www.taxodiary.com Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. The Access Innovations Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments and corporate clients throughout the world.
2012: What Lies Ahead?
January 16, 2012
Posted in Access Insights, Autoindexing, Business strategy, Featured, semantic, Taxonomy
This time of year I read a lot of trends and reviews. What happened last year and what will happen in the coming year is a popular topic and it is in fact a good time to take stock and think about initiatives for the year ahead. So here is what I see coming down the road for 2012.
1. More people will be building and implementing taxonomies. The awareness of controlled vocabularies and their applications continues to grow. They will be applied not just in publishing but in websites, association offerings, commerce online and in records management and retention. There are many ways to leverage a taxonomy and information architects have only brushed the surface in their applications. Medical organizations will also embrace alternatives to IDC-9 and the complexity of government driven coding to find accuracy through taxonomic means.
2. Taxonomies, controlled vocabularies, interoperability and linked data will become mainstream for corporations. Publishers and associations will also actively embrace the needs for control over ever growing collections. Universities and government will be late adopters.
3. With the increase in taxonomists there is also an explosion of “carpet baggers” these people see a hot trend and are leaping on the wagon with newly heralded expertise. It is true that vocabulary control has been around for well over 100 years, so many will have a back ground in the area, just a new field to apply their skill sets. It is also an area that is learned in practice, but not difficult to learn. The standards outline the options and there are many webinars, reading and other training opportunities in the field. This means the buyer beware in checking the credentials of their service and software providers. Is there hands on experience or book learning or opportunity seekers?
4. I don’t think the semantic web will happen this year either. In fact I don’t think it will ever happen as originally envisioned. It is just too complicated and no search system to support it has gotten out of the lab to handle large data sets. Just as SGML gave way to HTML and then XML, Semantic web will fade and Linked Data will rise.
5. I do think that the linked data initiatives will take the lead. I expect the Linked Data community will get over its focus on the syntax and start talking about implementation and application leading the way by showing how it can be done, in many ways and there is no single path needed to make those links. Mash ups using linked data will become much more common. Some of these are already very active sites and many more will follow in 2012.
6. The rebirth of Dublin Core. Oh I know it has been around for a long time – I have a few lashes to show for it myself. But when that standard (Z39.84) was “passed” it was by inventing a new way to get around the standards consensus requirement and used a new program called “fast tracking”. now 15 years later the reasons it got 7 NO votes are still there but they are finally getting an honest appraisal and serious consideration on what the functional requirements need to be and how to make Dublin Core work as a real measurable standard rather than a guideline. The new crop of DC advocates will make it happen. In addition the linked Data crowd and the DC crowd are working together to bring real change and education to the marketplace as well as to University enclaves.
7. The ontology name is so cool. Not many are really sure what it means and very few mean the OWL standard when they use it. Having said that I think we will all be talking more about ontologies and less about taxonomies (and certainly not thesauri) this year. We might still mean the same thing but our words to describe it will change.
8. Finally I think everyone will be more cut throat. Manners and honesty will take a back seat to getting the sale and the close. We saw that increase last year and I think it will continue to grow as a problem in 2012. There are two underlying reasons. The shrinking marketplace tied to the larger number of investment capital firms behind many businesses will cause them to cut corners to get the sale so they can make their “numbers”. The other reason is the increasingly uncivil political climate will bleed its desperation into all other corners of our lives.
Marjorie M.K. Hlava
President, Access Innovations
Taxonomy Meetings 2011 – A Year of Change or Realization?
January 9, 2012
Posted in Access Insights, Business strategy, Featured, reference, Taxonomy
What are the meetings that cater to people who use controlled vocabularies, like taxonomies? Where should a taxonomist go, click, or attend to learn about the latest implementations and uses of controlled vocabulary strategies? Every company thinks long and hard both about what they do and where to find customers for their products and services. The Information Industry is no different. In the Age of the Internet when everyone’s “knows” about searching and information; it seems like the “information Industry” should be booming, its conferences should be huge, and the attendance incredible, but that is not the case. Why? If the information industry and our little taxonomy segment of the business has gone mainstream, then where are all the people you would expect at the long established industry meetings? The meetings we have attended for years are dying on the vine. The SLA Expo was sparse, the Information Today meetings are smaller, Online Information (formerly International Online) was nearly empty, and NFAIS remains the same size each year. ASIS&T is growing significantly. Frankfurt Book Fair is bigger than ever. Specific User Group meetings are increasingly targeted and well attended.
I believe there are several factors at work. The diminishing meetings have had challenges for years. Nor are we alone in this trend. It is national and perhaps international. Other options are now available. Nationally 126 million people attended meetings in 2009. In 2010 only 80 million attended. “There were 12 percent fewer attendees in 2010 than in 2005 – and 19.7 percent fewer in 2009 than in 2006,” notes the Baltimore Sun. The trend is downward significantly even with the problems of the economy. Let’s take them one at a time.
SLA Conference and Expo – an expensive and glitzy meeting held in conjunction with the SLA (aka Special Libraries Association) annual meeting. This meeting has had as many as 7,000 attendees and many auxiliary events such as user group meetings and advisory boards surrounding it. The meeting itself is significantly smaller now. The membership itself is down to half its numbers from 14,000 to about 7,000 including an unknown, but likely held steady, at about 2000 student quotient. This means they no longer command as large an audience. In spite of a well-meaning board trying to cater to the un- and underemployed by reduced fees, the membership has been shrinking. The Expo has held two functions in the industry. One, of course, is to show the companies wares to the attendees, people who work in corporate and other kinds of unusual libraries and often command large purchasing budgets. Second is the meeting of most of the players in the industry in a single exhibit hall allows for intellectual property rights discussions and business arrangements/deals to be made. But several things have happened to make this a less attractive venue.
Years ago SLA mandated that a company could not sponsor a division’s activities, that is get close to the real customer group, unless you were an exhibitor. That meant paying for the booth (about $5000 for the smallest), paying for furniture, electric, carpet, Internet, card reader, plus the art and brochures, and giveaways, etc. (much more than $5000). Then you need staffing for the booth including airfare, hotel for at least two, but usually more staff. (Another $5 – 7,000 in direct cost per person plus a week away from the office.) After that you get to be the target of every division to sponsor their events – at $500 – $5000 each (there are 28 divisions and almost all of them will call you). So SLA needs to be at least $30,000 line item in the budget, but is usually over $50,000 plus staff labor and opportunity cost. The business aspect of companies (a less degrading label than “vendors”; What are we circus performers?) talking with companies has been good, but the increasing number of companies “suitcasing” (that is, without a booth), has made the exhibitors targets of not only the divisions and SLA, but also those who did not pay the freight to be in the show. Meanwhile, the attendees are walking the aisles, looking for giveaways, not making eye contact since they have no budget to spend.
More recently the Divisions have realized that they could get more out of their target companies, if they held out the carrot of a speaking slot. If you pay X you are a sponsor, if you pay Y you can also have a speaking slot. That all works as long as there is a large audience to talk with. But over the past two years there have been very few people attending the meetings. The sessions of substance are well attended. I went to the Taxonomy related ones and they were often standing room crowds. Buying of speaking slots, however, degrades the programming options and also makes the exhibitors feel cheap. My expertise, which I have been able to found and run a company on, is only worth hearing, if I pay you to listen? It feels like some kind of prostitution going on here!
At SLA 2011 many sessions had to do with how to get a job, get a raise, change careers, etc. These are helpful to the out of work perhaps, but NOT a persuasive reason for an employer to send a staff member to the meeting. Why should they send their staff to a meeting to learn how to get a different job? The early program was full of such sessions and a turn off to many of the employers and potential attendees I spoke with. They need to send people to the meeting for a skills and industry update and refresher.
So few attendees because the programs are not delivering content and while business discussions for exhibitors have held them in the hall for the past few years, is that enough to make the show a go? Here are new options out there as you will see later in this article.
Over laden with regulations, booth fee increases, and limited staff resources, have resulted in a thin meeting on top of an already downward fiscal spiral for SLA. Can they pull it out? Perhaps they can, but probably not with the current strategy. That exhibit hall finances much of the SLA annual operations. An organization which gets more than half of its annual income from a single face-to-face meeting in the Internet age has some hard thinking to do.
Information Today built its reputation on the once premier meeting in the industry – the National Online Meeting. It sprang into being when SLA and ASIS&T missed the rise of online searching and the incipient internet offerings as a potential big force in their lives. More recently this meeting has been cut into sections and targeted to specific groups like “Computers in Libraries, Internet Librarian, Taxonomy Boot Camp, Knowledge Management World,” and etc. Each of them seems to draw a small, but loyal crowd of attendees. The business aspect of the meeting has been lost, not much deal making goes on here, and the exhibits are shrinking. Here too, if you are a potential exhibitor, you are generally not allowed a speaking slot unless you pony up for a booth.
This has led to a platform of consultants, who plead inability to exhibit, hawking their services from the podium. The quality of the program is diminished and the people with industry knowledge look for another avenue to get to the customer. The previous model of perhaps if they were speaking, they might also exhibit, has changed to no speaking unless you exhibit. Further the segmentation of the meeting has meant that the exhibitors cannot form the deal making side of attendance that is so important to their livelihood.
International Online was also an Information Today meeting (okay, Learned Information and when they sold it they had to change their name to the very successful industry newspaper they publish – not a bad thing). Traditionally held the first week of December in London, it was THE place to be for the buying and selling of digital rights and to see what new things were being released in the New Year. A vibrant, exciting meeting with a crush of people, big parties in the evenings, cutting edge presentations, and many user group meetings surrounding the IOM. One person commented that about 90% of the Intellectual property rights deals and changes for the year happened in that week in December. This year the meeting was a shadow of itself. Most of the big players did not exhibit, very few people walked through the hall. If you set up lots of meetings in advance, it was okay, otherwise a dud. What happened?
It became two unconnected meetings. One was the conference with delegates (attendees) held on the third floor a block away from the exhibit so the attendees seldom came down through the wet London cold to the exhibition. At the same time it became very expensive! Greed in the face of an economic down turn certainly plays a role, but this is not the only factor. Next year it is moving to Docklands from the Olympia and changing the format and venue. The meeting we knew is gone.
NFAIS has gone a different way. It is a membership organization of about 120 companies. But the leverage of the intellectual value added including controlled vocabularies is not the current focus of these former abstracting and indexing organization’s meeting. Their focus is on the “next big thing,” the trends in the industry. The program committee does NOT select member companies to speak. So if you are a member, you will not be on the podium except as a possible moderator. But NFAIS members would like to hear from members who are in a similar situation and find out how they have dealt with the problem. It is a cutting edge meeting, well planned and thought out, but does not grow due to self-imposed limits.
ASIS&T, the American Society for Information Science and Technology is often considered an academic meeting where professors can get their students’ papers on the program to showcase them. The Board is academic. The members are a mix, more academic than practitioners, but still a fair number of people looking for new technologies and a way to implement them on the home front. I used to survey the audience and decided it was in three segments. The academics sat in the front of the hall ready to comment and debate with the speaker, the practitioners and managers in the middle soaking up what they could from the presentations and questions, and the entrepreneurs and other misfits in the back, standing or on the aisles with an easy exit plan. It is still that way except that the middle has thinned out considerably. The meeting this year was a pleasant surprise on many fronts. It was a substantive program. Lots of hard hitting application and real life talks, less of the presentations on a sample of 10 – 30 and extrapolating unrealistic results. The talks were longer – 30 minutes and allowed enough time to actually describe the substance and then have penetrating questions. The student papers were moved to a huge poster session – 92 posters replacing the Presidential reception with dinner in the middle and posters around the edge – great for conversations, good learning experiences. Well done. Some even had to do with taxonomies.
But for a lot of application and implementation discussions, the action has moved to the ASIS&T IA Summit. The information architecture meeting now has as many attendees as the annual meeting (around 700 people) and has its own Web site and branding. Here it is far less academic and much more hands on discussions. I found the meetings clannish, but the discussions were worth listening to.
Frankfurt Book Fair – a few years ago this meeting was only for print publishers, although it was THE meeting for print. But as digital media has taken hold a new pavilion was added and the digital activity in Building 4 is now incredibly active. The rights trading is definitely done at this show now. The parties and the satellite meetings have mostly moved here. Publishers and the Online community have merged to be here in Frankfurt in October.
User Group Meetings – remember they used to be satellite meetings around the bigger meetings, but their members were no longer attending the big meetings. They now go for the shorter, pure vendor update, and presentations, which deal directly with their service, product, or software. They use these specialized events to learn what’s new and how to use it better. It pays off back at the office and you meet others who are using and leveraging the same things. I attended several of these during the year. They were uniformly well attended by enthusiastic people wanting to know more about the products and services so they could better manage their investments. Meetings that are viable are those that engage the attendee and the User Group Meetings. I attended several this year and they are of two types. 1) those which follow the rock star level of presentation – like MarkLogic and SilverChair, 2) and those which are hands on updates on the applications and use cases to leverage the customer investments like Atypon and Data Harmony.
Summary:
Okay great – we know where the companies are going to get their work done, make deals, and to learn new things, but what about the individual? Where are they going? What are they now doing to learn and keep skills fresh?
The Internet has made many things possible that were not possible before. We can convene a meeting electronically in a very short time. We can have discussions over Skype or Webex or GoToMeeting. We can develop documents using collaborative wikis. We can have conference calls for people in many locations and several continents without leaving our desks. People have turned increasingly to webinars and web searching to find new things and answers. We follow blogs to read opinions and discussions to add to and enjoy.
If we go to a meeting, we are expecting something else. We want to find community. We want to build relationships, which can then be maintained on the Web once they are established. We want to have discussions. We want to help build, brainstorm, learn, and develop in a group setting. We want to make a deal, discuss the terms, and build trust, face to face. Teaching new skills, reading thought pieces, and announcements can now be done in a web-enabled environment.
Selling (Prostitution) of the speaking slots by the real vendors, those who put on the shows, has had a deleterious effect on the quality of the meetings. The costs have reached a tipping point where they no longer provide a good return on investment for attendee or exhibitor. It is no longer useful to have a big party for your users or to set up a user group meeting in conjunction with one of the big national meetings. But more than that, the challenge remains on how to engage the attendee. How can they be part of the meeting rather than a passive audience? How do you get a sense of community?
There are several budding online communities, which seem to be flourishing. Taxonomy Community of practice is one; the Taxonomy Division of SLA is another. The ones on LinkedIn and Facebook have not yet taken off. The rest are in user groups. Access Innovation’s Data Harmony User Group meeting will be held in Albuquerque February 7-9, 2012.
Come join the community!
Marjorie M.K. Hlava
President, Access Innovations
SharePoint and Taxonomies – Part VI of VI
January 2, 2012
Posted in Access Insights, Featured, semantic, Taxonomy
This final segment in our series on semantic integration specifically addresses SharePoint and Taxonomies.
SharePoint is a popular software and comes free with the Microsoft Server. In fact, I think SharePoint, more than any other thing, has excited interest in taxonomies for people. SharePoint 2010 has a taxonomy module and although it does not have everything that your heart might wish for, it is a significant step forward. A lot of people have been trying to figure out exactly how to best use their taxonomy within the SharePoint offering. This is one option.
SharePoint itself will only show you ten lines of a vocabulary. This particular application, Data Harmony, shows you a bunch more. In this case again, it’s when you are uploading a document, we want to be able to suggest, from that document, terms that are actually valid in your taxonomy and then post those as keywords in the SharePoint system so that you can search for them using your taxonomy. Since it is very easy to build a SharePoint application, just like it used to be very easy to build a Lotus Notes application, the control of that application gets out of hand quickly. People are looking hard to find ways to implement some kind of vocabulary control using SharePoint, particularly 2010, to a lesser extent 2007 so that they can actually index their documents and get them out easily. They are not going to have to remember what somebody called them. They can make broad use of synonyms and browse categories and generally and get at their information more easily.
Here are a couple other implementations. The last one was Eldercare. This one is on Educational Information. People can browse the terms or they could type ahead and get the appropriate suggestion in the keyword field from the taxonomy. That is very helpful to people in their SharePoint implementations.
Another case is Records Management. People are even using SharePoint for Records Management but in the case here, because of the nature of Records Management, you might have groups of types of records or facets for record types. You could also have content types. The content types could be put into a taxonomic fashion. You might have Human Resources documents and under those Human Resources documents, you might have many different kinds of items, i.e., reviews, résumés’, payroll records, etc. Finance might also have payroll records. So you will want to give them multiple broader terms. Where you have a combination of the record types, the content types, and the creators of those records, you might be able to automate the retention schedule assignments. This is a very heavy load for most organizations these days to try to be able to figure out the retention schedule for the record types, the content types, by creator. How long do you need to keep those things? Do you need to keep them three years for tax? Do you need to keep them seven years for fraud? Do you need to keep them indefinitely for patent research? Do you need at least 17 years? There are a lot of different retention schedules to which you need to pay attention. Having this kind of automation help from the taxonomy might be very useful.
Conclusion
We have covered a lot of different information, but what I would like to leave you with are the ideas that taxonomies and metadata are really the cornerstones of information architecture. They can be used as the basis for content organization and, if they are used that way, then they can build a browsable outline of the content. When you are using subject meta data, especially the taxonomy, you can get 100% recall of relevant information. That is a really big thing for people who really cannot afford to miss any of the information that is in your database corpus. They are the basis for search and for labeling things for storage and very useful in navigation and information architecture. When you recognize those synonyms, you can improve the taxonomy implementation considerably.
Taxonomies are great fun to build because they kind of challenge your intellectual rigor. Applying them to data is what really makes the work worthwhile. That is where the rubber hits the road. So we have to figure out how to best use those taxonomies. The more ways you can find to use them, the more likelihood they will be supported over your lifetime tenure with the taxonomy. Maintaining them and their applications is going to be what creates a strong knowledge management platform for an organization.
Marjorie M.K. Hlava
President, Access Innovations
Data Visualization and Term Analytics – Part V of VI
December 26, 2011
Posted in Access Insights, Featured, semantic, Taxonomy
Continuing with our series on semantic integration, let’s address Data Visualization and Term Analytics.

A completely different kind of use is term analytics. We have talked a lot of text analytics in the past, where people take great big full text files and they run them through a lot of Bayesian, neural net, and latent semantic indexing kind of engines to figure out how to compare things. You could do that using a taxonomy instead and still figure out the strengths of the organization; what are the strengths in the publications; what are the emerging topics in your areas. You use people’s own data to address these questions and figure out the answers.
In this particular case, we took ten years of PubMed, ten years of IEEE, and ten years of the US Patents and we ran the MESH subject headings, the IEEE thesaurus, and the DTIC thesaurus against each of the three collections. You have 9 different sets of data. Then we compared them to see where is the field going? What is the next event? What are the trends? Since we were able to do it in one year slices, we were able to map that term space to show the distributions and figure out where the overlaps were, where things were pulling apart, groups that needed to be changed or augmented, and ones that could be enlarged or generally marketed-to, what are the new trends in the business, what are the new pieces of information that we need to deal with. What came out of it is a lot of different ways to display the same sets of data. This is exactly the same data from that 9 point matrix displayed in a number of different visual applications. This long line at the bottom is just wrapped around a circle here to show different ways that the data can be displayed. In this matrix up here, we have the blue and the red – red being medical, blue being engineering – and we are able to show where bioengineering, for example, overlaps. Good for business intelligence and data mining harnessing the concepts from the taxonomy.
Mobile Options
A taxonomy really helps deliver things that are very precise and exactly what the user wants. You don’t have time on a mobile device to just give the user the normal million hits from Google. You want to provide very precise, very good recall so that what they are getting is everything but only the ones that are precisely on the topic, not ones that are moving around to different areas. Relevance to me is, as some of you have heard me say, a bit of a canard because it is just a guess. It is a confidence factor. I don’t really want your confidence that you are answering my question correctly. I want you to absolutely answer my question precisely.
Taxonomies in E-Commerce
E-commerce is yet another way to use taxonomies. You have seen this a lot in Amazon or LL Bean or E-Bay. Take a look at an Amazon site with primary categorizations on the right. These are the top terms in their taxonomy. When they display the second level, they’re displaying the second level, which is Arts and Photography as a First Level term and then they are displaying not only the Second Level terms but they are also displaying the related categories and the related topics. They are trying to get to a lot more areas in the same screen. If I go to Photography and click again, I would then go down to even a narrower selection. Then I can click on any one of those and get to their offerings. We don’t want to go any more than three levels. Keep it at three clicks for searching is the web concept. More clicks and the user has lost interest. In Amazon, I think most of you have probably ordered but it’ll say something like “customers who bought (whatever you bought) also bought …. Then they will show some thumbnails and the bibliographic citations based on the subject area or the taxonomy terms that are applied to these books. They are going to apply some additional content to the user so that he can link and serve more like these based on what I saw – what I was searching for. You could drive this from the taxonomy instead as a recommendation based on similar taxonomy terms.
Next week, we will finish up the series talking about SharePoint and Taxonomies.
Marjorie M.K. Hlava
President, Access Innovations
Authors at a Place – Part IV of VI
December 19, 2011
Posted in Access Insights, Featured, Taxonomy
Continuing with our series on semantic integration, let’s discuss Authors.
You can take a list of authors by place. Identifying marks or spots represent clusters of people who have published on a particular topical area. You take the author, their subject profile, and then attach to their address, the latitude and longitude of that address, and you can put them on a map.
These kinds of match-ups are pretty cool.
This one is a match-up using Google Earth. We’ve taken a bunch of, in this case, cancer research institutes, their addresses and because they have to do with cancer and because they have their place name attached to a latitude and longitude, we can do this graphical display of where they are located. So, then you will know if there is someone nearby you who might be doing this kind of work.
Displaying Authors and institutes on a map based on taxonomy
This one is using a little free program called Spicy Nodes. It doesn’t work for large purposes but it is really fun for small ones. This shows a term and the terms that are the next level down. This is based on the broader-narrower term relationships. There is a top term, AACR. These are the major topical areas and medicine is one of those topical areas. These are the terms below medicine. You see first level, second level – including this one, and then third level terms. A fun way to display your taxonomy and, if I could do a live moving screen through the connections to show the different connections changing as you get closer and further from a term. You could see the arrows move and bounce. It is a very fun thing to do but it also shows connections in a way that you don’t normally see them in a flat navigation list.
Another way to use taxonomies is for an Expert Reviewers list. When you are looking for people to work on a project or review a it is hard to find people who have the expertise to actually review a piece of work. Same thing as if you were trying to decide if you should start a new project or have someone review a piece of research to see if it is valid – or not. How do you find people who have expertise to give you any kind of advice on that? If it is paper-based, it is not so easy to use but if you changed it into a dynamic taxonomy, then the reviewers could be found from the member profiles and put those who were willing to do reviews, it would supply you with a quick way to give you a list of experts to review those articles.
Following along in that vein, you can also use it as a way to do member profiles so you can find people of like interest. Who else within SLA is interested in doing taxonomies? Should we do a survey and then canvass those who might be interested in joining us? If people have input information about themselves and the kinds of things that they have worked on, then they could automatically suggest from the controlled vocabulary what might of interest to them. (Actually, for SLA, I have created a taxonomy, which currently is not in use but was in use for a while on the website and I think it helped search a lot for that period of time.)
Here’s another example. This one is a fairly new system called XPeerient. XPeerient is a new company that is matching up technology sellers with technology buyers. What happens is that someone has a bit of information about hardware, software, or service and they upload it to XPeerient in an online template and then they click a button to auto-suggest the taxonomy attributes. The attributes that match this profile come up into the listing of terms – on the right – and you click the ones that you think are appropriate. Then, the system matches the profiles so that if someone is looking to buy software and some is looking to sell, they can match the profiles and find each other more easily. Like a dating service for hardware and software findings.
Next week we will talk about Data Visualization.
Marjorie M.K. Hlava
President, Access Innovations
Paper Submission – Empowering the Authors – Part III of VI
December 12, 2011
Posted in Access Insights, Featured, Taxonomy
As we continue talking about semantic integration and more specifically, use cases; here is a different kind of use case.
Somebody in this case is uploading an article (and you’ll see this again in other venues) in this case a conference paper to the American Society for Information Science and Technology site. A prospective author uploads their paper, they fill in the blanks as appropriate, and up pops a list of appropriate taxonomy terms. The author, the submitting person, can then check off the list of terms they think are appropriate to their paper. Then the paper is saved, already indexed. They will have already indexed their paper, which saves people a lot of time. In any kind of venue, folks can do their own indexing.
Behind the scenes, it can be saved either as a HTML or as an XML file so that it can be integrated into the general editorial workflow. The terms themselves are added as a tagged XML subject feed for these records.
I mentioned, briefly, an author authority file. You can build it while you build the article base. You would create a full author record: Name, address, URLs, websites, telephone numbers, fax number, email, etc. You would also allow them to put in a profile of who they are.
If you add the subject terms from the thesaurus or taxonomy to the author – let them choose it just as we saw in this previous slide where, if they entered their résumés, for example, or their CVs, this would automatically pop up but if they didn’t enter quite as much, you might have to discern it some other way.
You list authors and then you can find them in a place and you can put them into a social network. This not only works with authors but also with staff, for example, a highly distributed research staff throughout the nation or throughout the world. You might want to be able to find people that are working on similar things. They have to be able to describe what they are working on and they can use the taxonomy to do that.
If you’ve done that, then you could link the authors in these kinds of ways, in this case, by publications.
This author has published with these people and these people have published with the people in the secondary circle. So, you have two levels of author linkages and you can do this on subject as well as on name. Over here, we just have a fish-eye view of the authors as you circle over this field.
Next week we will talk about Authors.
Marjorie M.K. Hlava
President, Access Innovations
Use Cases for Semantic Enrichment Using Taxonomies – Part II of VI
December 5, 2011
Posted in Access Insights, Featured, Taxonomy
Last week, we started discussing semantic integration. We did a brief introduction and now that the basics are settled let us look at ways to enrich the user experience using Semantics (the words from our taxonomies). We have a way to improve search. It is a search option driven by taxonomies that you can play with.

We have the hierarchy on the left. We can navigate the full taxonomic tree and you can see in the screen shot. There is a term and then a number behind that term. That tells you how many records are tagged with that term in the corpus, in the data base or set of information that you have. You have all of that information available to you and you know exactly how much is in each of those branches. It is also a good way, if you are building a taxonomy, to get a feel for how many terms are over-loaded and how many really don’t have any of these terms applied to them, in which case they really are not a sensible taxonomy term.
Another way to use the taxonomy is to combine the synonyms as well as the main terms and put them in this permuted list. This is permuted, in that they wrap around. For example, sickle cell disease here would also be found as the same heading under disease and under sickle. The person wouldn’t have to know exactly what the term was that you finally decided on to put as the primary term in the taxonomy. They could just type in the term and they would have a good idea of where to search. If they click on it, then it would automatically implement the search. On the right you will notice, contextually appropriate, under Expand Your Search, we have a way to look at related terms for the search term that you chose. We also have narrower terms so that you can narrow the search and make it more specific. So, lots of ways to guide the user on the search side directly from the taxonomy itself. These were created from the term record in the full thesaurus using those relationships to leverage search.
What we have found is that it is 50% faster for people to do a search if they can use browsable categories. Chin Che Chin and Dumais have written a very nice study on how helpful it is. It is even better if you are doing a Google-type search and the results were not in the top 20 but if you entered the specific term, you’d be able to go right to it. Searchers have many
different learning styles. Therefore, they have different ways to search. But you can get them to browse in a category search and they prefer it, at least 50% of the time.
Serving lots of related content – an Association Example
Let’s look at the needs or organizations like associations to serve a great deal of content to their users on their web sites. If the taxonomy is used to tag all the content then it can be delivered through the Content Management System each time the taxonomy terms (keywords is what they are called in this application) are surfaced in a search query. The main information is delivered and the collateral information is delivered to the page at the same time giving a wonderful search experience.
If you use your taxonomy to index everything in your corpus – on your site – then when you are ready to serve it up to the users, you could link it by article (journal article), that’s pretty common or articles from within a document in repository or File Net or where you want because they would all be linked by keywords. You could also do it by activity. If people are working on an activity, you could serve that up at the same time by saying that they might also be interested in this activity that we are doing, this research, or this upcoming conference, or maybe even a job posting. If you have posted job openings on this topic and you are searching on that topic, then maybe someone would be interested in thinking about the work. You have a Podcast interview where someone is talking on the subject that a searcher just entered, that would also be of interest. Maybe there’s a grant possibility. Maybe there the other people that are working in same field and have profiled themselves with that same taxonomy term, you could serve those up as well. There are lots of ways to serve up that information.
If you look at his page, you an article found on cancer epidemiology, biomarkers, and their prevention. There is a set of keywords that are attached to this and what we can serve to them from this – this is from the American Association for Cancer Research. They might be interested in particular working groups, or awards, think tank reports, some webcasts, maybe some related book contents, and in this case, we could even take you to the book chapter that has to do with your subjects, abstracts from upcoming conferences that might be useful, some workshops and other conferences that are being held on the topic and where they are, and even press releases. So this gives you an idea if you have taken the taxonomy and you have linked the resources behind the scenes, by indexing them all using your taxonomy, then you can serve the content in a lot of different ways.
Next week we will talk about Paper Submission.
Marjorie M.K. Hlava
President, Access Innovations








