Leveraging Your Taxonomy – Part 5

March 5, 2012  
Posted in Access Insights, Featured, search, semantic

This is the next piece in our series of blog posts on search and how it works. This time we are talking about the importance of relevance.

Relevance is how well a set of returned documents answers the information need, another way of talking about accuracy. But, it’s related to the objective of the search. Different user communities can get exactly the same answer from a set of information resources, with one finding the set relevant and the other not. So, there’s a really healthy tension between the user needs and the context available. That’s why a lot of relevance engines do a lot of profiling of the users.

If I search Google with a particular question, I might get one answer and each of you will search it and get a different answer. That’s because your profile and the things that you’ve clicked on in the past will indicate to Google that your answers should be more in this sphere or more in that sphere. So, relevance is really a confidence factor or a guesstimate on how well this set of documents will answer this particular user’s query.

There are formulas for recall, precision, and relevance. Recall is the number of relevant items that have been retrieved from the system over the total number of relevant items in the collection. One hundred percent recall would mean that we got all of the relevant items of the entire collection. Precision is the number of relevant things – those things that I wanted to see as a user – out of the total number of items retrieved. So, if only five of the articles presented to me on my first ten hits were actually germane to my question, then I have only a 50% precision score. Relevance is precision in relation to recall, or those that are right for my questions vs. everything in the database that has to do with the particular topic. You can see that there is a lot of difference between precision, recall, and relevance.

When we are measuring relevance we are looking at the context – i.e., this searcher, the profile of this searcher and what they have in their brain, what they are looking for. We usually also take into account the age of the documents, because people normally are looking for the most recent. We are looking for how many of the documents we got, the full completeness of the data set returned, or the recall; the measure of quality, which is often very hard to determine, because it is often in the eye of the user. Often relevance is statistically determined or at least the attempt to get at it, is statistically determined. We will cover more of that with Mr. Bayes. It is really a subjective evaluation, or a confidence of the system that what I am presenting to you is, indeed, what you want. There are a lot of different and complex factors in measuring relevance itself.

If we look at the different kinds of search, we encounter several main options: keyword search, Bayesian search, Boolean search, and ranking algorithms. In truth, most search systems have some or all of these options in them. It is a matter of what they depend on the most that is important. These search options are dependent on a few particular famous theorists. Two of them are still alive; two of them are long gone. Boole and Boolean algebra; Bayes and Bayesian techniques; Tierney and his algorithms for enriched structured data; there’s Marco Durango and his ant colony theory. There is a very large body of research on search; please consider this as only a very high level sampling.

George Boole was a mathematician who lived 1815 – 1864, not all that long as you can see. The Wikipedia article on him is quite good. He came up with an algebraic expression or algebraic system to express the logic of what we know as the “And”, “Or”, “Not”, and “And Not” expressions. Those of you who have been searching for quite a while may have searched Dialog or BRS, or the IBM Stairs or Ovid or CD Plus, or SilverPlatter. These, and a lot of different systems, including MedLine, are based on the Boolean algebra approach. It has been around for a long time. It is quite popular, providing very high end precision and recall statistics.

The Boolean representation is done by something called the Venn diagram, which shows the intersection. If I were to do a search and I wanted to search A and B, and if I just entered a single expression, then A and B would be an automatic intersection – or could be. If I wanted A or B, I would get everything in both circles; if I want A and B, I only get this overlapped circle. If I want A, X, or B, I would exclude this circle. There are lots of ways – and I could say A Not B and get rid of this circle but keep the rest of A. There are a lot of things in this expression that we could get via explaining it through a Venn diagram.

Mr. Thomas Bayes was an earlier mathematician (1702 – 1761) and he wrote a lot about probability. He theorized, if we had a known set and we knew that these things usually happen, then when we got a new set, couldn’t we infer that the following will probably happen. That was a probability based on what had happened in the past so that we could forecast what would happen in the future. It’s a nice and fairly well-established algorithm and can be used so that we say, “Well, if these 5,000 articles were about this then, if the same term set is used in the next 5,000 articles, probably they are about the same thing.” But the distribution of probabilities changes, particularly in active areas like news or cutting edge science. People might not want to depend on the distribution of historical data to predict future data. A user might also make a new kind of request – something that is not what they have queried of the system in the past. So, to get that information out of the network is much harder. We have a computational linguistic difficulty if we explore a set of data with an unknown new kind of request.

What we have to do is to say, “We knew this to be true in the past and therefore it will be true in the future.” So, if you are looking at a lot of terrorist literature, for example, and trying to figure out what may happen in the future, based on what has happened in the past, one would expect that the terrorists would be aware of that and they would be constantly changing, in new and novel ways, so that they could trick the system – figure out the distribution of probabilities and actually give erroneous results to people. So, if you depend on a Bayesian engine to keep track of rapidly happening events, you often find that you come up a little off from what actually happens. Another way of saying that is that you have to assume that the prior knowledge is always reliable and is indeed what will happen in the future, because if you say that it’s going to be different, then the next results will be invalid. You want to be sure that the statistical distribution that you come up with to use for modeling your data is consistent. If you have a consistent set of data inside and it’s not going to change much, then this is a good way to go. Otherwise, you have to constantly train and re-train the data every time you add new data sets, particularly if the direction of the field has changed.

A more recent guy, Peter Turney, hails from Canada and talks about learning algorithms for key phrase extraction. He says that you can do that as a tree. You can make decisions as you move down the tree. He called it the “tree induction algorithm”. And, he came up with something called lexical semantics which is a way to say, “As I make decisions going down that tree, things are going to change.”

 

Extraction vs. generation and sentiment of words:

(hits(word AND “excellent”) hits (poor))
log2 —————————————-
(hits(word AND “poor”) hits (excellent))

His information and the learning algorithms for key phrase extraction formula that he came up with give you an idea of how you can do just plain extractions of data from a system versus trying to generate even sentiment from the words that are there. He could plot 80% accuracy in these results, which is better than 60% from Bayes.

Another guy, Marco Dorigo, research director for the Belgian Fonds de la Recherche Scientifique and research director of the IRIDIA lab at the Université Libre de Bruxelles, talks about swarm intelligence. How if you look at the way things move and the way things change, you could actually change that information on a dime if you knew which way information was going to be moving. There’s the data itself, then there’s the way that data is used – or what is important about it at the moment, what is heuristically important. Ant colony optimization is a metaheuristic for combinatorial optimization problems using “swarm intelligence.” It makes statements about the value importance vs. heuristic importance and is therefore useful in search prediction. For example, you might look at the way people are analyzing Twitter feeds, its ant colony optimization, its swarm intelligence. Suddenly a Twitter stream has emerged out of nowhere and it gathers importance very, very quickly, just like a bunch of ants suddenly attacking a piece of peanut butter and jelly sandwich that landed on the ground only minutes ago.

Another big area that we know about is natural language processing (NLP). Natural language processing is frequently used these days used in conjunction with another system. The main pillars of natural language processing depend on the researcher and the creation – how much of each pillar they are using.

  1. Syntactics: the rules for the language and how they govern the sentences in any individual language.
  2. Semantics: the words themselves and how they are stated and behave.
  3. Morphology of those words – the singulars and plurals and other things about them.
  4. Phrase logical implementations – the use of those words in phrases.
  5. Stemming: or lemmatization, cutting off the endings, such as the eds, and the ings, and other things that take you to the word root.
  6. Statistical options – as outlined above.
  7. Grammatical applications, some of them actually fully graphed sentences. Some of you are old enough to remember graphing sentences from school.
  8. Then, at the end of the day there is just a nice common sense. It’s really handy to have a common-sense algorithm that you can apply. That is often done in a rules base or something like that where you say that it’s pretty clear to us how this works. Natural language processing in companion Boolean operators, for example, makes a nice rules-based system for people.

That is a nice segue to automatic language processing, which is where we’ll pick up next week.

Marjorie M.K. Hlava
President, Access Innovations

Leveraging Your Taxonomy – Part 4

February 27, 2012  
Posted in Access Insights, Featured, search

As we continue this series on search and how it works, we have to address accuracy. First, how are we are going to measure accuracy?

  • Relevance
  • Recall
  • Precision
  • Accuracy – Hits, misses, noise
  • Ranking
  • Linguistics
  • Query processing
  • Results processing
  • Display
  • Search refinement
  • Usability
  • Business rules

As you can see from this list, there are a whole lot of different ways to do it. Relevance has become extraordinarily popular, so I put that at the top, but there are lots of other ways to measure it. One is recall:  Did you get everything in the database having to do with your question? And precision:  So, did you get everything but not a lot of junk as well? If you got everything, you might have gotten a lot of junk that was not really responsive or appropriate to your question. Taxonomies help an extraordinary amount with recall and precision. They don’t help much with relevance.

Another way to measure accuracy is with statistics on hits, misses, and noise. That is when you look at the records retrieved – and you can use this for the indexing accuracy, as well. (Indexing and search go hand in hand and have done so since the first computer databases for text were developed in the mid-1960s.) A hit is something that a human thinks is exactly right and the computer also thought was right. A miss is something that the human would have suggested but the computer did not suggest. Noise is something the computer suggested, about which the human says, “That isn’t really right for my query.” So, hit, miss and noise statistics are another way to measure accuracy.

Search results are often presented in rank order – those that are most appropriate are first, and those that match fewer and fewer parts of the query are closer to the bottom.

Linguistic analysis or general linguistic applications using natural language processing are frequently applied in search and, there again, we measure them either by recall and precision or hit, miss, and noise statistics. 

Query Processing

In measuring accuracy in search or measuring how well search works, we also talk about query processing. In particular, we are interested in how fast the results are returned. So, there are two parts to that. One part is the query. I’ve asked a query; now, how fast is it going to come back to me with an answer? That often depends on how much of that query is going to be held in cache memory and how much is going to be accessed through a hard drive of some kind.

The second part is the results processing. We are looking at the results that are coming to the user. How fast can I see my results? How fast are they returned to the user? A query is where we are getting the pieces of the answer, but once the query has given me those pieces, I need to process those into a result – into the answer that you see. I got 55 hits and I want to see them. I want to click on something and get the actual document presented to me. Well, that is the display processing. Most systems do not assemble the entire record on the fly, or as you ask for it, from a lot of different pieces. They have what is called a display server, and they will show you the results from that as you indicate your approval by clicking the URL or the path that will take you to the full document. So, it pops up as a full document in all its original glory. The easy way is just to store the full document somewhere out there and let you go retrieve it. In a system like Google, obviously you are going to the original web page or whatever is referenced in the URL link.

There are lots of refinements on that. How long does it take you to narrow down your search?  Can you do a search within a search, also known as a recursive search? You have a general set. Can you just keep that set and throw out the junk and narrow in to what is precisely what you want, or do you have to start the search over with more words so that you can get an answer?

Usability is another way that we measure search results to see how good the search system is. This is when we say – “This was really easy to use.”  “It was very user-friendly.”  “I like the user interface.” All of those user experience kinds of questions are, of course, big in the end user’s mind. They may not be particularly important to the IT people because they are looking at the things we have already talked about. When you get down to the customer service and user experience interface, then you want to know how usable the system is.

Finally, there’s a whole set of business rules. How is the security of the system? Am I able to limit it so that people can search only in a couple of ways?

That takes us to relevance and we’ll pick up there next week.

Marjorie M.K. Hlava
President, Access Innovations

Leveraging Your Taxonomy – Part 3

February 20, 2012  
Posted in Access Insights, Featured, search, Taxonomy

This series of blog posts is exploring how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. Last week we started talking about search software, and today we will continue with that topic.

I believe in the data first, as you know if you’ve been following this blog. Starting in the diagram with your source data, you can see how the data flows. You need to clean the source data to a uniform format. This is often called the conversion process or the ETL – extract, transform, and load.

Next we deposit the clean data into a cache – this is a repository of the data. Some people call it the content management system or CMS. Then build access to the repository in the cache builders. Next the data goes into some kind of a hub to deploy that information into search. So from the hub, we build the inverted file indexes, so that we have those indexes available for searching. The user will submit queries into the search presentation layer. That will go into a federator because we might have, as in the case of SharePoint, a whole lot of projects that we want to search across. If you have multiple repositories that you want to search across, you need to go to a federator, and that will go to the query servers that will access the deploy hub, access the indexes, and return an answer to the user.

If you look at a system like FAST (offered by a company that has been bought recently by Microsoft), you see that you have all of this content coming in here on the left.  That information has to be digested by the system. It is digested by a lot of different connectors or conversion programs that then load it into the content API for, in this case, FAST, or some other search software.

The document is processed, and that means that it is broken into pieces so that the search system can use it.  The search system is going to want to serve the full article, so it has to have a server to present those articles. It’s going to have an index so that you can show the individual files and it’s maybe going to have some RSS feeds and other filters to send out alerts to different kinds of receiving organizations.

If you have chosen a Bayesian system, then you need to collect training sets so that the statistical analysis of the system has something to work on. That means you need at least 20 records (100 is better) for each term you want to train the system to search. In these records, the term must be used correctly. Our experience is that you need to collect at least three records for each term, because in English we use the terms in so many ways. That is the Achilles heel expense of the statistics-based systems.

When the user submits a query, the search request is moving through a query API – or a search application protocol interface – something that will translate the search question into the syntax within the system. It goes into the query processor and then accesses the guts of the system, the inverted file.

The taxonomy can come to play in two parts. There may be a taxonomy governance layer. It is great to apply the taxonomy terms to those documents as they are processed into the system. That activity should happen somewhere in the document processing pipeline activity. If you are lucky, you can also use the taxonomy at the search end so that people can disambiguate their queries as they are searching the system. Taxonomy, to my mind, should be in two places within your search implementation. Really, search is the reason you built the taxonomy in the first place.

Next week we’ll address the need for accuracy in search.

Marjorie M.K. Hlava
President, Access Innovations

Leveraging Your Taxonomy – Part 2

February 13, 2012  
Posted in Access Insights, Featured, search, Taxonomy

This series of blog posts is exploring how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. Last week we started with the various modules of search. This week we are addressing the search software itself.

The technical parts normally include several pieces:

  • some ranking algorithms, so that you get the data in a format that is most likely to answer the query of the user or the search question;
  • a query language and syntax that enable you to actually ask a question of a computer;
  • federators that gather together the information from all kinds of different places and put it into a single cache to be searched;
  • and the cache itself, which is often the collection of data. There are two main types of cache. People talk about cache and cache memory. They are not the same thing. One is active in the random access memory (RAM) and one is on the storage disk.
  • Once you have all those things, all search software that I know of builds an inverted index, an alphabetical sort of every term in the searchable areas that you will be covering.
  • Most other enhancements take place in the presentation layer – that’s the interface that you see on the screen. The look and feel of the system is often what sells it, even though the technology underneath is widely variable.

Here is my preferred methodology:

  • Design the system application.
  • Decide what else needs to be added to the data so that you can enhance it. In that case, that would be where the taxonomy comes in, for dealing with subject metadata.
  • Consider what other metadata, other data, and other controls your system needs in order to work properly.
  • Once you have done all that, then find a system that will work with your data.
  • Don’t spend months working on a document type definition (DTD) just to find out that when you try to stuff your data into the DTD, you forgot to allow for multiple authors, or you forgot to add pagination. Both of those examples have happened to me with clients recently. We waited months for a DTD, got to the DTD. The DTD only allowed a single subject term, it allowed no more than one author, it didn’t have pagination, they didn’t even allow a place for an abstract. Then they said that the DTD is locked, sorry, we can’t change it. So, we are stuffing the data into inappropriate fields.

Many organizations have five or more kinds of search software. All too often, none of them work; this is why they end up just sitting on the shelf. They aren’t looking at the data first. Okay, enough of my rant. Next week we’ll continue with the pieces of search.

Marjorie M.K. Hlava
President, Access Innovations

Leveraging Your Taxonomy – Part 1

February 6, 2012  
Posted in Access Insights, Featured, search, Taxonomy

This series of blog posts will explore how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. The modules of search are:

  • Search software – of course
  • Computer network
  • Parsing of text
  • Well formed or structured text
  • CLEAN DATA
  • Computer software – network
  • Computer hardware
  • Telecommunications connection
  • Training sets for statistical systems
  • Search technology
  • Ranking algorithms
  • Query language
  • Federators
  • Cache
  • Inverted index
  • Other enhancements
  • Presentation layer

We will cover each of those items over the series; however, we also need to measure the accuracy of search. Accuracy in search is measured by three major areas: precision, recall, and relevance. Each of these can be handled in different ways. Part of the challenge in measuring search accuracy is that in search, there are two major theoretical directions. One of them is based on the Bayes theorems and the other on the algorithms of Boole. I will explore the work of these two gentlemen and of more recent people supporting search enrichment. Then I will discuss what effect they have on search as we know it today. Finally, I will discuss the effect of taxonomy on search.

How does search work?

Here is the normal path people take to search implementation.  “Well, I think I will get some hardware.”  And they say…. “Well, you can’t go wrong with Company X.”

  • So they buy hardware and
  • they buy the software that will work on that hardware.
  • Then they design a system that will work with that software and
  • then they try to load their data and,
  • finally, they try to enhance the data with a taxonomy.

In my opinion, that is totally backwards. What they should be doing is looking at what they are building the system for in the first place – that is, the data. How about we build a system to hold your data?

We assess the data so that we can get a design; we know what fields there are. I have written about this backwards approach before.

What are you building? 

  • Assess the data
  • Do the design
  • Decide what else needs to be added
  • Taxonomy terms
  • Other controls
  • Find a system that will work with your data

Let’s outline the pieces of the search implementation. There are a lot of parts to search, and one of them, of course, is the search software itself. The search software itself runs on a computer network. The software depends on the parsing or the cutting of the text into specific pieces so that it can be searched. That means that the data needs to be well-formed, if you are talking in the XML vernacular, or structured. Unstructured data is simply data that has not been tagged into fields. You could transform a Word document, which is generally considered unstructured, into a well-formed, XML structured document, by simply putting <Begin Body> and <Close Body> at the beginning and end of the text. Yes, it can be that deceptively and technically simple. The whole notion of structured and unstructured text is a bit of a misnomer and a little bit hard to understand, because most of us don’t think of data in that way. In fact, a Word document has a Properties table in it that may or may not be populated. Some things are populated in it by default. So it is, actually, partially structured already. It can even be saved as an XML document. The search software must depend for its implementation on clean data. That means it has to be clean, well-formed, preferably the metadata fields are all filled in, including the addition of the taxonomy terms in a specific tagged field or element.

The computer software runs on a network, which runs on hardware. To get to it, you need to have a telecommunications connection of some kind. It might be a hard network wire within your organization – so you connect from one place to another within the firm – or it might be something that goes over the Internet to a remote location. It doesn’t really matter – the connection is still a telecommunications connection that transfers data in an orderly fashion over a wire.

Next week we will talk about the search software itself.

Marjorie M.K. Hlava
President, Access Innovations

Of Taxonomies, Biology, and Moneyball

January 30, 2012  
Posted in Access Insights, Featured, Taxonomy

Baseball and biology are not commonly found in the same conceptual space. Neither do you find taxonomy associated with baseball, but in recent news these connections were made. Grant Bisbee, editor of “Baseball Nation”, digresses into the arcane as he laments the coming of the “He’s In the Best Shape of His Life” season. This is the time of year baseball writers must assess the prospects for the coming season, and clichés and hyperbole reign. The dubious practice of evaluating the physical condition of players runs rampant as spring training begins. With tongue in cheek, Bisbee tries to shape a taxonomy to classify this spring ritual. His would be the taxonomy of the “In the Best-shape Stories”.

Bisbee suggests three categories:  “Play X Got in Shape”, “Fixed Eyesight”, and “A Serious, Previously Undiscovered Affliction”. He compiles representative stories for each classification. He is somewhat skeptical that a player could be in the “best shape of his life” as a result of some of the reported training regimes. Can someone really put on 18 pounds of muscle in the off-season? Legally? Shouldn’t vision examinations be a regular part of baseball teams’ operations? After all, they’re paying millions for their players’ hand-eye coordination.

Are there baseball taxonomies already out there that could help Mr. Bisbee classify his apocryphal collection? A “Google” search of baseball taxonomies returns a million items more or less – plus or minus – give or take – a few hundred thousand. Not much help there. At the very least, Bisbee could expand the classification to include other sports and how each portrays their mystical, preseason rituals. How are preseason body rebuilding rhythms disturbed by “Seasonus Interruptus”?

And biology is experiencing a potential season-disrupting trend. Tom Spears, reporting for the “Ottawa Citizen”, filed the story, “Taxing times for taxonomy”. On a more serious note, Mr. Spears chronicles the demise of field-based taxonomic study in biology in favor of lab work. Computers and DNA studies have relegated classification to the closet. While the work done in laboratories is vital to the field, biologist Ernest Small is quoted, “How can you do a study of forests without knowing the trees.” The reverse is also true; you can’t really study a tree without understanding its forest. Commenting on his lab-centric colleagues, Dr. Small laments “the lack of knowledge of what plants and animals make up our world”. Spears goes on to write, “We need to understand whole species, not just genes, if we are to solve the problems of agriculture, fisheries, insect pests, ecology and the spread of diseases.”

Taxonomies are central to understanding a field as a whole, whether biology or sports. And they are critical to understanding individual members of the taxonomy. The taxonomy of plants and animals tells you about the relationships and characteristics of its members. The child, or narrower node in a taxonomy, inherits the properties of the parent, or broader node. Reviewing an entire hierarchical branch and its relationship to other branches conveys the properties, similarities, and differences of branches. This knowledge can lead to sleuthing out meaningful trends and can help explain interactions or identify possible interactions. Taxonomy for organizing sports articles might reveal interesting groupings and trends. Are there more injuries being reported this season? Are there geographic trends or team trends? Applying graphical analysis visualizes the content in ways that enhance spotting of trends.

What these two diverse stories have in common besides the core theme of taxonomies is the need for hands-on, down-and-dirty fieldwork. Sports writers need to see players perform during preseason to really assess their fitness. Biologists can learn a great deal about a species by looking through a microscope, but they can’t really understand behavior that way. They need to observe firsthand. When the field data and lab data are gathered, placing it in context gives it meaning and leads to insights. Taxonomy is ideal for grouping data in meaningful ways, and that can lead to insights. It can also lead you back to the data at some future point. Taxonomy is often focused on organizing content for search to make search easy, fast, reliable, and replicable. Taxonomy also defines a domain or conceptual space. It can help you find your way and keep you from getting lost. Equally important, it provides meaning to the conceptual space you’re wandering through. Spring is almost here – enjoy the first Danaus plexippus of the season and know it has a genus and a tribe and a family and more. Play ball!

Jay Ven Eman, CEO
Access Innovations

Eighth Annual Data Harmony Users Group – February 7-9, 2012

January 23, 2012  
Posted in Access Insights, Featured, News

Access Innovations will host the the eighth annual Data Harmony® Users Group (DHUG) meeting this coming  February  7-9 in Albuquerque, New Mexico. The meeting will focus on helping users get the most from their investment in the Data Harmony knowledge management software suite, which helps users organize information resources based on a well-built and systematically applied taxonomy or thesaurus.

“This meeting is an exciting opportunity to learn how to fully utilize the power of Data Harmony software to maximize the effectiveness and profitability of your organization for your members, customers and staff,” said Marjorie M.K. Hlava, president of Access Innovations.

On Tuesday, February 7, Access Innovations’ staff will present a full day of free training for new users or users who want to review and update their knowledge of the Data Harmony, which includes:

  • Thesaurus Master® – Provides taxonomy and thesaurus construction and management;
  • M.A.I.™ (Machine Aided Indexer) – Offers automatic indexing or editorial aid in indexing; and
  • MAIstro™ – Combines Thesaurus Master and M.A.I. for maximum efficiency in both automatic indexing and taxonomy construction

On Wednesday, February 8 and Thursday, February 9, speakers will introduce new features of the software and present case studies about how actual users have leveraged Data Harmony to organize their content, increase their productivity, lower their costs, and drive organizational or company revenues.

Access Innovations is encouraging anyone who wishes to share their story at the meeting to contact them. Registrations are also now being accepted. For more information about the eighth annual Data Harmony Users Group meeting, click here or call (505) 998-0800 or 1-800-926-8328.

About Access Innovationswww.accessinn.com, www.dataharmony.com, www.taxodiary.com Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. The Access Innovations Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments and corporate clients throughout the world.

2012: What Lies Ahead?

This time of year I read a lot of trends and reviews. What happened last year and what will happen in the coming year is a popular topic and it is in fact a good time to take stock and think about initiatives for the year ahead. So here is what I see coming down the road for 2012.

1. More people will be building and implementing taxonomies. The awareness of controlled vocabularies and their applications continues to grow. They will be applied not just in publishing but in websites, association offerings, commerce online and in records management and retention. There are many ways to leverage a taxonomy and information architects have only brushed the surface in their applications. Medical organizations will also embrace alternatives to IDC-9 and the complexity of government driven coding to find accuracy through taxonomic means.

2. Taxonomies, controlled vocabularies, interoperability and linked data will become mainstream for corporations. Publishers and associations will also actively embrace the needs for control over ever growing collections. Universities and government will be late adopters.

3. With the increase in taxonomists there is also an explosion of “carpet baggers” these people see a hot trend and are leaping on the wagon with newly heralded expertise. It is true that vocabulary control has been around for well over 100 years, so many will have a back ground in the area, just a new field to apply their skill sets. It is also an area that is learned in practice, but not difficult to learn. The standards outline the options and there are many webinars, reading and other training opportunities in the field. This means the buyer beware in checking the credentials of their service and software providers. Is there hands on experience or book learning or opportunity seekers?

4. I don’t think the semantic web will happen this year either. In fact I don’t think it will ever happen as originally envisioned. It is just too complicated and no search system to support it has gotten out of the lab to handle large data sets. Just as SGML gave way to HTML and then XML, Semantic web will fade and Linked Data will rise.

5. I do think that the linked data initiatives will take the lead. I expect the Linked Data community will get over its focus on the syntax and start talking about implementation and application leading the way by showing how it can be done, in many ways and there is no single path needed to make those links. Mash ups using linked data will become much more common. Some of these are already very active sites and many more will follow in 2012.

6. The rebirth of Dublin Core. Oh I know it has been around for a long time – I have a few lashes to show for it myself. But when that standard (Z39.84) was “passed” it was by inventing a new way to get around the standards consensus requirement and used a new program called “fast tracking”. now 15 years later the reasons it got 7 NO votes are still there but they are finally getting an honest appraisal and serious consideration on what the functional requirements need to be and how to make Dublin Core work as a real measurable standard rather than a guideline. The new crop of DC advocates will make it happen. In addition the linked Data crowd and the DC crowd are working together to bring real change and education to the marketplace as well as to University enclaves.

7. The ontology name is so cool. Not many are really sure what it means and very few mean the OWL standard when they use it. Having said that I think we will all be talking more about ontologies and less about taxonomies (and certainly not thesauri) this year. We might still mean the same thing but our words to describe it will change.

8. Finally I think everyone will be more cut throat. Manners and honesty will take a back seat to getting the sale and the close. We saw that increase last year and I think it will continue to grow as a problem in 2012. There are two underlying reasons. The shrinking marketplace tied to the larger number of investment capital firms behind many businesses will cause them to cut corners to get the sale so they can make their “numbers”. The other reason is the increasingly uncivil political climate will bleed its desperation into all other corners of our lives.

Marjorie M.K. Hlava
President, Access Innovations

Taxonomy Meetings 2011 – A Year of Change or Realization?

What are the meetings that cater to people who use controlled vocabularies, like taxonomies? Where should a taxonomist go, click, or attend to learn about the latest implementations and uses of controlled vocabulary strategies? Every company thinks long and hard both about what they do and where to find customers for their products and services. The Information Industry is no different. In the Age of the Internet when everyone’s “knows” about searching and information; it seems like the “information Industry” should be booming, its conferences should be huge, and the attendance incredible, but that is not the case. Why? If the information industry and our little taxonomy segment of the business has gone mainstream, then where are all the people you would expect at the long established industry meetings? The meetings we have attended for years are dying on the vine. The SLA Expo was sparse, the Information Today meetings are smaller, Online Information (formerly International Online) was nearly empty, and NFAIS remains the same size each year.  ASIS&T is growing significantly. Frankfurt Book Fair is bigger than ever. Specific User Group meetings are increasingly targeted and well attended.

I believe there are several factors at work. The diminishing meetings have had challenges for years. Nor are we alone in this trend. It is national and perhaps international. Other options are now available. Nationally 126 million people attended meetings in 2009. In 2010 only 80 million attended. “There were 12 percent fewer attendees in 2010 than in 2005 – and 19.7 percent fewer in 2009 than in 2006,”  notes the Baltimore Sun.  The trend is downward significantly even with the problems of the economy. Let’s take them one at a time.

SLA Conference and Expo – an expensive and glitzy meeting held in conjunction with the SLA (aka Special Libraries Association) annual meeting. This meeting has had as many as 7,000 attendees and many auxiliary events such as user group meetings and advisory boards surrounding it. The meeting itself is significantly smaller now. The membership itself is down to half its numbers from 14,000 to about 7,000 including an unknown, but likely held steady, at about 2000 student quotient. This means they no longer command as large an audience. In spite of a well-meaning board trying to cater to the un- and underemployed by reduced fees, the membership has been shrinking. The Expo has held two functions in the industry. One, of course, is to show the companies wares to the attendees, people who work in corporate and other kinds of unusual libraries and often command large purchasing budgets. Second is the meeting of most of the players in the industry in a single exhibit hall allows for intellectual property rights discussions and business arrangements/deals to be made. But several things have happened to make this a less attractive venue.

Years ago SLA mandated that a company could not sponsor a division’s activities, that is get close to the real customer group, unless you were an exhibitor. That meant paying for the booth (about $5000 for the smallest), paying for furniture, electric, carpet, Internet, card reader, plus the art and brochures, and giveaways, etc. (much more than $5000). Then you need staffing for the booth including airfare, hotel for at least two, but usually more staff. (Another $5 – 7,000 in direct cost per person plus a week away from the office.) After that you get to be the target of every division to sponsor their events – at $500 – $5000 each (there are 28 divisions and almost all of them will call you). So SLA needs to be at least $30,000 line item in the budget, but is usually over  $50,000 plus staff labor and opportunity cost. The business aspect of companies (a less degrading label than “vendors”; What are we circus performers?) talking with companies has been good, but the increasing number of companies “suitcasing” (that is, without a booth), has made the exhibitors targets of not only the divisions and SLA, but also those who did not pay the freight to be in the show. Meanwhile, the attendees are walking the aisles, looking for giveaways, not making eye contact since they have no budget to spend.

More recently the Divisions have realized that they could get more out of their target companies, if they held out the carrot of a speaking slot. If you pay X you are a sponsor, if you pay Y you can also have a speaking slot. That all works as long as there is a large audience to talk with. But over the past two years there have been very few people attending the meetings. The sessions of substance are well attended. I went to the Taxonomy related ones and they were often standing room crowds. Buying of speaking slots, however, degrades the programming options and also makes the exhibitors feel cheap. My expertise, which I have been able to found and run a company on, is only worth hearing, if I pay you to listen? It feels like  some kind of prostitution going on here!

At SLA 2011 many sessions had to do with how to get a job, get a raise, change careers, etc. These are helpful to the out of work perhaps, but NOT a persuasive reason for an employer to send a staff member to the meeting. Why should they send their staff to a meeting to learn how to get a different job? The early program was full of such sessions and a turn off to many of the employers and potential attendees I spoke with. They need to send people to the meeting for a skills and industry update and refresher.

So few attendees because the programs are not delivering content and while business discussions for exhibitors have held them in the hall for the past few years, is that enough to make the show a go? Here are new options out there as you will see later in this article.

Over laden with regulations, booth fee increases, and limited staff resources, have resulted in a thin meeting on top of an already downward fiscal spiral for SLA. Can they pull it out? Perhaps they can, but probably not with the current strategy. That exhibit hall finances much of the SLA annual operations. An organization which gets more than half of its annual income from a single face-to-face meeting in the Internet age has some hard thinking to do.

Information Today built its reputation on the once premier meeting in the industry – the National Online Meeting. It sprang into being when SLA and ASIS&T missed the rise of online searching and the incipient internet offerings as a potential big force in their lives. More recently this meeting has been cut into sections and targeted to specific groups like “Computers in Libraries, Internet Librarian, Taxonomy Boot Camp, Knowledge Management World,” and etc. Each of them seems to draw a small, but loyal crowd of attendees. The business aspect of the meeting has been lost, not much deal making goes on here, and the exhibits are shrinking. Here too, if you are a potential exhibitor, you are generally not allowed a speaking slot unless you pony up for a booth.

This has led to a platform of consultants, who plead inability to exhibit, hawking their services from the podium. The quality of the program is diminished and the people with industry knowledge look for another avenue to get to the customer. The previous model of perhaps if they were speaking, they might also exhibit, has changed to no speaking unless you exhibit. Further the segmentation of the meeting has meant that the exhibitors cannot form the deal making side of attendance that is so important to their livelihood.

International Online was also an Information Today meeting (okay, Learned Information and when they sold it they had to change their name to the very successful industry newspaper they publish – not a bad thing). Traditionally held the first week of December in London, it was THE place to be for the buying and selling of digital rights and to see what new things were being released in the New Year. A vibrant, exciting meeting with a crush of people, big parties in the evenings, cutting edge presentations, and many user group meetings surrounding the IOM. One person commented that about 90% of the Intellectual property rights deals and changes for the year happened in that week in December. This year the meeting was a shadow of itself. Most of the big players did not exhibit, very few people walked through the hall. If you set up lots of meetings in advance, it was okay, otherwise a dud. What happened?

It became two unconnected meetings. One was the conference with delegates (attendees) held on the third floor a block away from the exhibit so the attendees seldom came down through the wet London cold to the exhibition. At the same time it became very expensive! Greed in the face of an economic down turn certainly plays a role, but this is not the only factor. Next year it is moving to Docklands from the Olympia and changing the format and venue. The meeting we knew is gone.

NFAIS has gone a different way. It is a membership organization of about 120 companies. But the leverage of the intellectual value added including controlled vocabularies is not the current focus of these former abstracting and indexing organization’s meeting. Their focus is on the “next big thing,” the trends in the industry. The program committee does NOT select member companies to speak. So if you are a member, you will not be on the podium except as a possible moderator. But NFAIS members would like to hear from members who are in a similar situation and find out how they have dealt with the problem. It is a cutting edge meeting, well planned and thought out, but does not grow due to self-imposed limits.

ASIS&T, the American Society for Information Science and Technology is often considered an academic meeting where professors can get their students’ papers on the program to showcase them. The Board is academic. The members are a mix, more academic than practitioners, but still a fair number of people looking for new technologies and a way to implement them on the home front. I used to survey the audience and decided it was in three segments. The academics sat in the front of the hall ready to comment and debate with the speaker, the practitioners and managers in the middle soaking up what they could from the presentations and questions, and the entrepreneurs and other misfits in the back, standing or on the aisles with an easy exit plan.  It is still that way except that the middle has thinned out considerably. The meeting this year was a pleasant surprise on many fronts. It was a substantive program. Lots of hard hitting application and real life talks, less of the presentations on a sample of 10 – 30 and extrapolating unrealistic results. The talks were longer – 30 minutes and allowed enough time to actually describe the substance and then have penetrating questions. The student papers were moved to a huge poster session – 92 posters replacing the Presidential reception with dinner in the middle and posters around the edge – great for conversations, good learning experiences. Well done. Some even had to do with taxonomies.

But for a lot of application and implementation discussions, the action has moved to the ASIS&T IA Summit. The information architecture meeting now has as many attendees as the annual meeting (around 700 people) and has its own Web site and branding. Here it is far less academic and much more hands on discussions. I found the meetings clannish, but the discussions were worth listening to.

Frankfurt Book Fair – a few years ago this meeting was only for print publishers, although it was THE meeting for print. But as digital media has taken hold a new pavilion was added and the digital activity in Building 4 is now incredibly active. The rights trading is definitely done at this show now. The parties and the satellite meetings have mostly moved here. Publishers and the Online community have merged to be here in Frankfurt in October.

User Group Meetings – remember they used to be satellite meetings around the bigger meetings, but their members were no longer attending the big meetings. They now go for the shorter, pure vendor update, and presentations, which deal directly with their service, product, or software. They use these specialized events to learn what’s new and how to use it better. It pays off back at the office and you meet others who are using and leveraging the same things. I attended several of these during the year. They were uniformly well attended by enthusiastic people wanting to know more about the products and services so they could better manage their investments. Meetings that are viable are those that engage the attendee and the User Group Meetings. I attended several this year and they are of two types. 1) those which follow the rock star level of presentation – like MarkLogic  and SilverChair, 2) and those which are hands on updates on the applications and use cases to leverage the customer investments like Atypon and Data Harmony.

Summary:

Okay great – we know where the companies are going to get their work done, make deals, and to learn new things, but what about the individual? Where are they going?  What are they now doing to learn and keep skills fresh?

The Internet has made many things possible that were not possible before. We can convene a meeting electronically in a very short time. We can have discussions over Skype or Webex or GoToMeeting. We can develop documents using collaborative wikis.  We can have conference calls for people in many locations and several continents without leaving our desks. People have turned increasingly to webinars and web searching to find new things and answers. We follow blogs to read opinions and discussions to add to and enjoy.

If we go to a meeting, we are expecting something else. We want to find community. We want to build relationships, which can then be maintained on the Web once they are established. We want to have discussions. We want to help build, brainstorm, learn, and develop in a group setting. We want to make a deal, discuss the terms, and build trust, face to face. Teaching new skills, reading thought pieces, and announcements can now be done in a web-enabled environment.

Selling (Prostitution) of the speaking slots by the real vendors, those who put on the shows, has had a deleterious effect on the quality of the meetings. The costs have reached a tipping point where they no longer provide a good return on investment for attendee or exhibitor. It is no longer useful to have a big party for your users or to set up a user group meeting in conjunction with one of the big national meetings. But more than that, the challenge remains on how to engage the attendee. How can they be part of the meeting rather than a passive audience? How do you get a sense of community?

There are several budding online communities, which seem to be flourishing. Taxonomy Community of practice is one; the Taxonomy Division of SLA is another. The ones on LinkedIn and Facebook have not yet taken off. The rest are in user groups. Access Innovation’s Data Harmony User Group meeting will be held in Albuquerque February 7-9, 2012.

Come join the community!

Marjorie M.K. Hlava
President, Access Innovations

SharePoint and Taxonomies – Part VI of VI

January 2, 2012  
Posted in Access Insights, Featured, semantic, Taxonomy

This final segment in our series on semantic integration specifically addresses SharePoint and Taxonomies.

SharePoint is a popular software and comes free with the Microsoft Server. In fact, I think SharePoint, more than any other thing, has excited interest in taxonomies for people. SharePoint 2010 has a taxonomy module and although it does not have everything that your heart might wish for, it is a significant step forward. A lot of people have been trying to figure out exactly how to best use their taxonomy within the SharePoint offering. This is one option.

SharePoint itself will only show you ten lines of a vocabulary. This particular application, Data Harmony,  shows you a bunch more. In this case again, it’s when you are uploading a document, we want to be able to suggest, from that document, terms that are actually valid in your taxonomy and then post those as keywords in the SharePoint system so that you can search for them using your taxonomy. Since it is very easy to build a SharePoint application, just like it used to be very easy to build a Lotus Notes application, the control of that application gets out of hand quickly. People are looking hard to find ways to implement some kind of vocabulary control using SharePoint, particularly 2010, to a lesser extent 2007 so that they can actually index their documents and get them out easily. They are not going to have to remember what somebody called them. They can make broad use of synonyms and browse categories and generally and get at their information more easily.

Here are a couple other implementations. The last one was Eldercare. This one is on Educational Information. People can browse the terms or they could type ahead and get the appropriate suggestion in the keyword field from the taxonomy. That is very helpful to people in their SharePoint implementations.

Another case is Records Management. People are even using SharePoint for Records Management but in the case here, because of the nature of Records Management, you might have groups of types of records or facets for record types. You could also have content types. The content types could be put into a taxonomic fashion. You might have Human Resources documents and under those Human Resources documents, you might have many different kinds of items, i.e., reviews, résumés’, payroll records, etc. Finance might also have payroll records. So you will want to give them multiple broader terms. Where you have a combination of the record types, the content types, and the creators of those records, you might be able to automate the retention schedule assignments. This is a very heavy load for most organizations these days to try to be able to figure out the retention schedule for the record types, the content types, by creator. How long do you need to keep those things? Do you need to keep them three years for tax? Do you need to keep them seven years for fraud? Do you need to keep them indefinitely for patent research? Do you need at least 17 years? There are a lot of different retention schedules to which you need to pay attention. Having this kind of automation help from the taxonomy might be very useful.

Conclusion

We have covered a lot of different information, but what I would like to leave you with are the ideas that taxonomies and metadata are really the cornerstones of information architecture. They can be used as the basis for content organization and, if they are used that way, then they can build a browsable outline of the content. When you are using subject meta data, especially the taxonomy, you can get 100% recall of relevant information. That is a really big thing for people who really cannot afford to miss any of the information that is in your database corpus. They are the basis for search and for labeling things for storage and very useful in navigation and information architecture. When you recognize those synonyms, you can improve the taxonomy implementation considerably.

Taxonomies are great fun to build because they kind of challenge your intellectual rigor.  Applying them to data is what really makes the work worthwhile. That is where the rubber hits the road. So we have to figure out how to best use those taxonomies. The more ways you can find to use them, the more likelihood they will be supported over your lifetime tenure with the taxonomy. Maintaining them and their applications is going to be what creates a strong knowledge management platform for an organization.

Marjorie M.K. Hlava
President, Access Innovations

« Previous PageNext Page »