Founder/President of Access Innovations Receives Prestigious Ann Marie Cunningham Award From NFAIS
March 13, 2012
Posted in Access Insights, News, Standards
Marjorie M.K. Hlava, founder and President of Access Innovations, recently received the prestigious Ann Marie Cunningham Award for outstanding service from NFAIS (the National Federation of Advanced Information Services).
Keith MacGregor, president of NFAIS, explained, “This year Marjorie Hlava is being recognized for her years of hard work as chair of the NFAIS Standards Committee. It is rare that a week goes by without an email alert from Margie regarding activities in the global standards community and she handles the NFAIS voting process related to NISO (the National Information Standards Organization). She has served NFAIS in many capacities – on the board of directors, as a former president and on the annual conference planning committee. She is now in her third year as the editor of our annual collection of NFAIS meeting papers published in Information Systems and Use. We are very grateful to Margie for all that she has done and continues to do.”
Hlava, a pioneer in the information management industry, founded Access Innovations in 1978. She holds patents for a number of technological processes, including automatic text processing and management and software-based methods for searching chemical names in text-containing documents.
Hlava is very committed to creating, updating, and promoting standards for the information industry. In addition to serving as chair of the NFAIS Standards Committee since 2005, she chaired the Special Libraries Association (SLA) standards committee for nine years, and was chair of the ASIDIC (Association of Information and Dissemination Centers) Standards Committee from 2003 to 2007. She currently serves on the NISO Content Committee for standards development.
Hlava served as a member of the Dublin Core Standards Committee Z39.85 from 1999 to 2002. She was on the redrafting committee for the NISO Z39.19 standard for controlled vocabularies. She has presented numerous standards updates to professional organizations, including the SLA, the Community of Practice under the auspices of the Library of Congress, and Information Today.
“I am so honored and surprised to receive this award! Since it is a surprise award there was no time to prepare remarks; however there is still room on the standards committee if you would like to serve,” Hlava said to a chuckling audience as she received the award. “Standards are critical in maintaining the integrity of information and in allowing the creation of taxonomies and thesauri that are accurate and useful. As more and more information is disseminated globally, standards will play an increasingly important role. I’m passionate about the work we do and appreciate the fact that we can come together to ensure that the information we create, organize and present is not only accessible to the world, but also credible.” Hlava said.
###
About Access Innovations – www.accessinn.com, www.dataharmony.com, www.taxodiary.com
Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation and semantic integration. The Access Innovations Data Harmony software includes automatic indexing, thesaurus management, an XML intranet system (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments and corporate clients throughout the world.
About NFAIS and the Ann Marie Cunningham Award – www.nfais.org
The National Federation of Advanced Information Services is a global, non-profit, volunteer-powered membership organization that serves the information community – all those who create, aggregate, organize, and otherwise provide ease of access to and effective navigation and use of authoritative, credible information. For more than 50 years, NFAIS has promoted the success of its members and provided a forum to address common interests through education and advocacy.
The Ann Marie Cunningham award is presented to members who go above and beyond the normal call of duty and is named after Ann Marie Cunningham, who served as executive director of NFAIS from 1991 to 1994.
Leveraging Your Taxonomy – Part 6
March 12, 2012
Posted in Access Insights, Featured, search, Taxonomy
This is the next piece in our series of blog posts on search and how it works. Last week we ended with natural language processing. We are picking up this week on automatic language processing or ALP.
Automatic language processing is different. It involves automatic translation or automatic indexing or auto abstracting. It is a pillar for artificial intelligence. It is also a pillar in a lot of search systems but it is frequently built on top of the precepts of natural language processing. It might add spell checking. A lot of the initiatives for the semantic web are based on some sort of automatic language processing, as well as linking algorithms, other kinds of NLP, and other kinds of computational linguistics.
Statistical search has evolved to encompass a wide variety of options. Taking the Bayesian statistics from many years ago, we might be able to come up with cluster analysis, or neural network search, or vector searching, or co-occurrence, or Bayesian inference, or latent semantics; these are all different search methods. They all are, at the end of the day, based on statistics, and they all depend on the input data and on training sets. You need to factor into your calculations in implementing a search system like this, what it will cost to train the system. To train the system is really just batch processing. It takes programs to do it. But, in order for them to do their work, they need examples of every term – for example, in your taxonomy – used correctly in quite a few articles, like 20-50. In my experience, it takes about three times that many to find ones where the term is used spot-on, specifically the way you want to use it. You present those as training sets. It takes quite a while to collect the training sets for statistical search. That is the real Achilles heel of these systems.
In all of these systems, whether they are statistical, natural language, or whatever, at the end of the day they depend on an inverted file and some And, Or, And Not (at least) operators. We frequently build a searchable index, which is the inverted file index, and then we might build on the user end some kind of a presentation layer, which is a hierarchical display – or browsable list – and frequently that comes from the taxonomy view of the thesaurus.
Next week we will talk more about inverted file indexes.
Marjorie M.K. Hlava
President, Access Innovations
Leveraging Your Taxonomy – Part 5
March 5, 2012
Posted in Access Insights, Featured, search, semantic
This is the next piece in our series of blog posts on search and how it works. This time we are talking about the importance of relevance.
Relevance is how well a set of returned documents answers the information need, another way of talking about accuracy. But, it’s related to the objective of the search. Different user communities can get exactly the same answer from a set of information resources, with one finding the set relevant and the other not. So, there’s a really healthy tension between the user needs and the context available. That’s why a lot of relevance engines do a lot of profiling of the users.
If I search Google with a particular question, I might get one answer and each of you will search it and get a different answer. That’s because your profile and the things that you’ve clicked on in the past will indicate to Google that your answers should be more in this sphere or more in that sphere. So, relevance is really a confidence factor or a guesstimate on how well this set of documents will answer this particular user’s query.
There are formulas for recall, precision, and relevance. Recall is the number of relevant items that have been retrieved from the system over the total number of relevant items in the collection. One hundred percent recall would mean that we got all of the relevant items of the entire collection. Precision is the number of relevant things – those things that I wanted to see as a user – out of the total number of items retrieved. So, if only five of the articles presented to me on my first ten hits were actually germane to my question, then I have only a 50% precision score. Relevance is precision in relation to recall, or those that are right for my questions vs. everything in the database that has to do with the particular topic. You can see that there is a lot of difference between precision, recall, and relevance.
When we are measuring relevance we are looking at the context – i.e., this searcher, the profile of this searcher and what they have in their brain, what they are looking for. We usually also take into account the age of the documents, because people normally are looking for the most recent. We are looking for how many of the documents we got, the full completeness of the data set returned, or the recall; the measure of quality, which is often very hard to determine, because it is often in the eye of the user. Often relevance is statistically determined or at least the attempt to get at it, is statistically determined. We will cover more of that with Mr. Bayes. It is really a subjective evaluation, or a confidence of the system that what I am presenting to you is, indeed, what you want. There are a lot of different and complex factors in measuring relevance itself.
If we look at the different kinds of search, we encounter several main options: keyword search, Bayesian search, Boolean search, and ranking algorithms. In truth, most search systems have some or all of these options in them. It is a matter of what they depend on the most that is important. These search options are dependent on a few particular famous theorists. Two of them are still alive; two of them are long gone. Boole and Boolean algebra; Bayes and Bayesian techniques; Tierney and his algorithms for enriched structured data; there’s Marco Durango and his ant colony theory. There is a very large body of research on search; please consider this as only a very high level sampling.
George Boole was a mathematician who lived 1815 – 1864, not all that long as you can see. The Wikipedia article on him is quite good. He came up with an algebraic expression or algebraic system to express the logic of what we know as the “And”, “Or”, “Not”, and “And Not” expressions. Those of you who have been searching for quite a while may have searched Dialog or BRS, or the IBM Stairs or Ovid or CD Plus, or SilverPlatter. These, and a lot of different systems, including MedLine, are based on the Boolean algebra approach. It has been around for a long time. It is quite popular, providing very high end precision and recall statistics.
The Boolean representation is done by something called the Venn diagram, which shows the intersection. If I were to do a search and I wanted to search A and B, and if I just entered a single expression, then A and B would be an automatic intersection – or could be. If I wanted A or B, I would get everything in both circles; if I want A and B, I only get this overlapped circle. If I want A, X, or B, I would exclude this circle. There are lots of ways – and I could say A Not B and get rid of this circle but keep the rest of A. There are a lot of things in this expression that we could get via explaining it through a Venn diagram.
Mr. Thomas Bayes was an earlier mathematician (1702 – 1761) and he wrote a lot about probability. He theorized, if we had a known set and we knew that these things usually happen, then when we got a new set, couldn’t we infer that the following will probably happen. That was a probability based on what had happened in the past so that we could forecast what would happen in the future. It’s a nice and fairly well-established algorithm and can be used so that we say, “Well, if these 5,000 articles were about this then, if the same term set is used in the next 5,000 articles, probably they are about the same thing.” But the distribution of probabilities changes, particularly in active areas like news or cutting edge science. People might not want to depend on the distribution of historical data to predict future data. A user might also make a new kind of request – something that is not what they have queried of the system in the past. So, to get that information out of the network is much harder. We have a computational linguistic difficulty if we explore a set of data with an unknown new kind of request.
What we have to do is to say, “We knew this to be true in the past and therefore it will be true in the future.” So, if you are looking at a lot of terrorist literature, for example, and trying to figure out what may happen in the future, based on what has happened in the past, one would expect that the terrorists would be aware of that and they would be constantly changing, in new and novel ways, so that they could trick the system – figure out the distribution of probabilities and actually give erroneous results to people. So, if you depend on a Bayesian engine to keep track of rapidly happening events, you often find that you come up a little off from what actually happens. Another way of saying that is that you have to assume that the prior knowledge is always reliable and is indeed what will happen in the future, because if you say that it’s going to be different, then the next results will be invalid. You want to be sure that the statistical distribution that you come up with to use for modeling your data is consistent. If you have a consistent set of data inside and it’s not going to change much, then this is a good way to go. Otherwise, you have to constantly train and re-train the data every time you add new data sets, particularly if the direction of the field has changed.
A more recent guy, Peter Turney, hails from Canada and talks about learning algorithms for key phrase extraction. He says that you can do that as a tree. You can make decisions as you move down the tree. He called it the “tree induction algorithm”. And, he came up with something called lexical semantics which is a way to say, “As I make decisions going down that tree, things are going to change.”
Extraction vs. generation and sentiment of words:
(hits(word AND “excellent”) hits (poor))
log2 —————————————-
(hits(word AND “poor”) hits (excellent))
His information and the learning algorithms for key phrase extraction formula that he came up with give you an idea of how you can do just plain extractions of data from a system versus trying to generate even sentiment from the words that are there. He could plot 80% accuracy in these results, which is better than 60% from Bayes.
Another guy, Marco Dorigo, research director for the Belgian Fonds de la Recherche Scientifique and research director of the IRIDIA lab at the Université Libre de Bruxelles, talks about swarm intelligence. How if you look at the way things move and the way things change, you could actually change that information on a dime if you knew which way information was going to be moving. There’s the data itself, then there’s the way that data is used – or what is important about it at the moment, what is heuristically important. Ant colony optimization is a metaheuristic for combinatorial optimization problems using “swarm intelligence.” It makes statements about the value importance vs. heuristic importance and is therefore useful in search prediction. For example, you might look at the way people are analyzing Twitter feeds, its ant colony optimization, its swarm intelligence. Suddenly a Twitter stream has emerged out of nowhere and it gathers importance very, very quickly, just like a bunch of ants suddenly attacking a piece of peanut butter and jelly sandwich that landed on the ground only minutes ago.
Another big area that we know about is natural language processing (NLP). Natural language processing is frequently used these days used in conjunction with another system. The main pillars of natural language processing depend on the researcher and the creation – how much of each pillar they are using.
- Syntactics: the rules for the language and how they govern the sentences in any individual language.
- Semantics: the words themselves and how they are stated and behave.
- Morphology of those words – the singulars and plurals and other things about them.
- Phrase logical implementations – the use of those words in phrases.
- Stemming: or lemmatization, cutting off the endings, such as the eds, and the ings, and other things that take you to the word root.
- Statistical options – as outlined above.
- Grammatical applications, some of them actually fully graphed sentences. Some of you are old enough to remember graphing sentences from school.
- Then, at the end of the day there is just a nice common sense. It’s really handy to have a common-sense algorithm that you can apply. That is often done in a rules base or something like that where you say that it’s pretty clear to us how this works. Natural language processing in companion Boolean operators, for example, makes a nice rules-based system for people.
That is a nice segue to automatic language processing, which is where we’ll pick up next week.
Marjorie M.K. Hlava
President, Access Innovations
Data Without Boundaries Project Launches
March 2, 2012
Posted in Access Insights, metadata, News
Metadata Technology is partnering with the Data Without Boundaries project where they will help design and implement a portal for the search and discovery of datasets held within national archives, statistical offices, and research centers across Europe.
We found this interesting information on PRWeb in their article, “Metadata Technology Providing Support for Data Without Boundaries Project.” Metadata Technology is the sole commercial partner in Data Without Boundaries, but there are 28 participants across 12 countries, including data archives and statistical agencies. The project is projected to last for 4 years and is supported by the European Union Seventh Framework Program through a funding of over 6 million Euros.
Melody K. Smith
Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.
Leveraging Your Taxonomy – Part 4
February 27, 2012
Posted in Access Insights, Featured, search
As we continue this series on search and how it works, we have to address accuracy. First, how are we are going to measure accuracy?
- Relevance
- Recall
- Precision
- Accuracy – Hits, misses, noise
- Ranking
- Linguistics
- Query processing
- Results processing
- Display
- Search refinement
- Usability
- Business rules
As you can see from this list, there are a whole lot of different ways to do it. Relevance has become extraordinarily popular, so I put that at the top, but there are lots of other ways to measure it. One is recall: Did you get everything in the database having to do with your question? And precision: So, did you get everything but not a lot of junk as well? If you got everything, you might have gotten a lot of junk that was not really responsive or appropriate to your question. Taxonomies help an extraordinary amount with recall and precision. They don’t help much with relevance.
Another way to measure accuracy is with statistics on hits, misses, and noise. That is when you look at the records retrieved – and you can use this for the indexing accuracy, as well. (Indexing and search go hand in hand and have done so since the first computer databases for text were developed in the mid-1960s.) A hit is something that a human thinks is exactly right and the computer also thought was right. A miss is something that the human would have suggested but the computer did not suggest. Noise is something the computer suggested, about which the human says, “That isn’t really right for my query.” So, hit, miss and noise statistics are another way to measure accuracy.
Search results are often presented in rank order – those that are most appropriate are first, and those that match fewer and fewer parts of the query are closer to the bottom.
Linguistic analysis or general linguistic applications using natural language processing are frequently applied in search and, there again, we measure them either by recall and precision or hit, miss, and noise statistics.
Query Processing
In measuring accuracy in search or measuring how well search works, we also talk about query processing. In particular, we are interested in how fast the results are returned. So, there are two parts to that. One part is the query. I’ve asked a query; now, how fast is it going to come back to me with an answer? That often depends on how much of that query is going to be held in cache memory and how much is going to be accessed through a hard drive of some kind.
The second part is the results processing. We are looking at the results that are coming to the user. How fast can I see my results? How fast are they returned to the user? A query is where we are getting the pieces of the answer, but once the query has given me those pieces, I need to process those into a result – into the answer that you see. I got 55 hits and I want to see them. I want to click on something and get the actual document presented to me. Well, that is the display processing. Most systems do not assemble the entire record on the fly, or as you ask for it, from a lot of different pieces. They have what is called a display server, and they will show you the results from that as you indicate your approval by clicking the URL or the path that will take you to the full document. So, it pops up as a full document in all its original glory. The easy way is just to store the full document somewhere out there and let you go retrieve it. In a system like Google, obviously you are going to the original web page or whatever is referenced in the URL link.
There are lots of refinements on that. How long does it take you to narrow down your search? Can you do a search within a search, also known as a recursive search? You have a general set. Can you just keep that set and throw out the junk and narrow in to what is precisely what you want, or do you have to start the search over with more words so that you can get an answer?
Usability is another way that we measure search results to see how good the search system is. This is when we say – “This was really easy to use.” “It was very user-friendly.” “I like the user interface.” All of those user experience kinds of questions are, of course, big in the end user’s mind. They may not be particularly important to the IT people because they are looking at the things we have already talked about. When you get down to the customer service and user experience interface, then you want to know how usable the system is.
Finally, there’s a whole set of business rules. How is the security of the system? Am I able to limit it so that people can search only in a couple of ways?
That takes us to relevance and we’ll pick up there next week.
Marjorie M.K. Hlava
President, Access Innovations
Leveraging Your Taxonomy – Part 3
February 20, 2012
Posted in Access Insights, Featured, search, Taxonomy
This series of blog posts is exploring how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. Last week we started talking about search software, and today we will continue with that topic.

I believe in the data first, as you know if you’ve been following this blog. Starting in the diagram with your source data, you can see how the data flows. You need to clean the source data to a uniform format. This is often called the conversion process or the ETL – extract, transform, and load.
Next we deposit the clean data into a cache – this is a repository of the data. Some people call it the content management system or CMS. Then build access to the repository in the cache builders. Next the data goes into some kind of a hub to deploy that information into search. So from the hub, we build the inverted file indexes, so that we have those indexes available for searching. The user will submit queries into the search presentation layer. That will go into a federator because we might have, as in the case of SharePoint, a whole lot of projects that we want to search across. If you have multiple repositories that you want to search across, you need to go to a federator, and that will go to the query servers that will access the deploy hub, access the indexes, and return an answer to the user.
If you look at a system like FAST (offered by a company that has been bought recently by Microsoft), you see that you have all of this content coming in here on the left. That information has to be digested by the system. It is digested by a lot of different connectors or conversion programs that then load it into the content API for, in this case, FAST, or some other search software.
The document is processed, and that means that it is broken into pieces so that the search system can use it. The search system is going to want to serve the full article, so it has to have a server to present those articles. It’s going to have an index so that you can show the individual files and it’s maybe going to have some RSS feeds and other filters to send out alerts to different kinds of receiving organizations.
If you have chosen a Bayesian system, then you need to collect training sets so that the statistical analysis of the system has something to work on. That means you need at least 20 records (100 is better) for each term you want to train the system to search. In these records, the term must be used correctly. Our experience is that you need to collect at least three records for each term, because in English we use the terms in so many ways. That is the Achilles heel expense of the statistics-based systems.
When the user submits a query, the search request is moving through a query API – or a search application protocol interface – something that will translate the search question into the syntax within the system. It goes into the query processor and then accesses the guts of the system, the inverted file.
The taxonomy can come to play in two parts. There may be a taxonomy governance layer. It is great to apply the taxonomy terms to those documents as they are processed into the system. That activity should happen somewhere in the document processing pipeline activity. If you are lucky, you can also use the taxonomy at the search end so that people can disambiguate their queries as they are searching the system. Taxonomy, to my mind, should be in two places within your search implementation. Really, search is the reason you built the taxonomy in the first place.
Next week we’ll address the need for accuracy in search.
Marjorie M.K. Hlava
President, Access Innovations
Leveraging Your Taxonomy – Part 2
February 13, 2012
Posted in Access Insights, Featured, search, Taxonomy
This series of blog posts is exploring how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. Last week we started with the various modules of search. This week we are addressing the search software itself.
The technical parts normally include several pieces:
- some ranking algorithms, so that you get the data in a format that is most likely to answer the query of the user or the search question;
- a query language and syntax that enable you to actually ask a question of a computer;
- federators that gather together the information from all kinds of different places and put it into a single cache to be searched;
- and the cache itself, which is often the collection of data. There are two main types of cache. People talk about cache and cache memory. They are not the same thing. One is active in the random access memory (RAM) and one is on the storage disk.
- Once you have all those things, all search software that I know of builds an inverted index, an alphabetical sort of every term in the searchable areas that you will be covering.
- Most other enhancements take place in the presentation layer – that’s the interface that you see on the screen. The look and feel of the system is often what sells it, even though the technology underneath is widely variable.
Here is my preferred methodology:
- Design the system application.
- Decide what else needs to be added to the data so that you can enhance it. In that case, that would be where the taxonomy comes in, for dealing with subject metadata.
- Consider what other metadata, other data, and other controls your system needs in order to work properly.
- Once you have done all that, then find a system that will work with your data.
- Don’t spend months working on a document type definition (DTD) just to find out that when you try to stuff your data into the DTD, you forgot to allow for multiple authors, or you forgot to add pagination. Both of those examples have happened to me with clients recently. We waited months for a DTD, got to the DTD. The DTD only allowed a single subject term, it allowed no more than one author, it didn’t have pagination, they didn’t even allow a place for an abstract. Then they said that the DTD is locked, sorry, we can’t change it. So, we are stuffing the data into inappropriate fields.
Many organizations have five or more kinds of search software. All too often, none of them work; this is why they end up just sitting on the shelf. They aren’t looking at the data first. Okay, enough of my rant. Next week we’ll continue with the pieces of search.
Marjorie M.K. Hlava
President, Access Innovations
Leveraging Your Taxonomy – Part 1
February 6, 2012
Posted in Access Insights, Featured, search, Taxonomy
This series of blog posts will explore how search works. We need to have a basic understanding of search fundamentals in order to know where taxonomies come in. The modules of search are:
- Search software – of course
- Computer network
- Parsing of text
- Well formed or structured text
- CLEAN DATA
- Computer software – network
- Computer hardware
- Telecommunications connection
- Training sets for statistical systems
- Search technology
- Ranking algorithms
- Query language
- Federators
- Cache
- Inverted index
- Other enhancements
- Presentation layer
We will cover each of those items over the series; however, we also need to measure the accuracy of search. Accuracy in search is measured by three major areas: precision, recall, and relevance. Each of these can be handled in different ways. Part of the challenge in measuring search accuracy is that in search, there are two major theoretical directions. One of them is based on the Bayes theorems and the other on the algorithms of Boole. I will explore the work of these two gentlemen and of more recent people supporting search enrichment. Then I will discuss what effect they have on search as we know it today. Finally, I will discuss the effect of taxonomy on search.
How does search work?
Here is the normal path people take to search implementation. “Well, I think I will get some hardware.” And they say…. “Well, you can’t go wrong with Company X.”
- So they buy hardware and
- they buy the software that will work on that hardware.
- Then they design a system that will work with that software and
- then they try to load their data and,
- finally, they try to enhance the data with a taxonomy.
In my opinion, that is totally backwards. What they should be doing is looking at what they are building the system for in the first place – that is, the data. How about we build a system to hold your data?
We assess the data so that we can get a design; we know what fields there are. I have written about this backwards approach before.
What are you building?
- Assess the data
- Do the design
- Decide what else needs to be added
- Taxonomy terms
- Other controls
- Find a system that will work with your data
Let’s outline the pieces of the search implementation. There are a lot of parts to search, and one of them, of course, is the search software itself. The search software itself runs on a computer network. The software depends on the parsing or the cutting of the text into specific pieces so that it can be searched. That means that the data needs to be well-formed, if you are talking in the XML vernacular, or structured. Unstructured data is simply data that has not been tagged into fields. You could transform a Word document, which is generally considered unstructured, into a well-formed, XML structured document, by simply putting <Begin Body> and <Close Body> at the beginning and end of the text. Yes, it can be that deceptively and technically simple. The whole notion of structured and unstructured text is a bit of a misnomer and a little bit hard to understand, because most of us don’t think of data in that way. In fact, a Word document has a Properties table in it that may or may not be populated. Some things are populated in it by default. So it is, actually, partially structured already. It can even be saved as an XML document. The search software must depend for its implementation on clean data. That means it has to be clean, well-formed, preferably the metadata fields are all filled in, including the addition of the taxonomy terms in a specific tagged field or element.
The computer software runs on a network, which runs on hardware. To get to it, you need to have a telecommunications connection of some kind. It might be a hard network wire within your organization – so you connect from one place to another within the firm – or it might be something that goes over the Internet to a remote location. It doesn’t really matter – the connection is still a telecommunications connection that transfers data in an orderly fashion over a wire.
Next week we will talk about the search software itself.
Marjorie M.K. Hlava
President, Access Innovations
Of Taxonomies, Biology, and Moneyball
January 30, 2012
Posted in Access Insights, Featured, Taxonomy
Baseball and biology are not commonly found in the same conceptual space. Neither do you find taxonomy associated with baseball, but in recent news these connections were made. Grant Bisbee, editor of “Baseball Nation”, digresses into the arcane as he laments the coming of the “He’s In the Best Shape of His Life” season. This is the time of year baseball writers must assess the prospects for the coming season, and clichés and hyperbole reign. The dubious practice of evaluating the physical condition of players runs rampant as spring training begins. With tongue in cheek, Bisbee tries to shape a taxonomy to classify this spring ritual. His would be the taxonomy of the “In the Best-shape Stories”.
Bisbee suggests three categories: “Play X Got in Shape”, “Fixed Eyesight”, and “A Serious, Previously Undiscovered Affliction”. He compiles representative stories for each classification. He is somewhat skeptical that a player could be in the “best shape of his life” as a result of some of the reported training regimes. Can someone really put on 18 pounds of muscle in the off-season? Legally? Shouldn’t vision examinations be a regular part of baseball teams’ operations? After all, they’re paying millions for their players’ hand-eye coordination.
Are there baseball taxonomies already out there that could help Mr. Bisbee classify his apocryphal collection? A “Google” search of baseball taxonomies returns a million items more or less – plus or minus – give or take – a few hundred thousand. Not much help there. At the very least, Bisbee could expand the classification to include other sports and how each portrays their mystical, preseason rituals. How are preseason body rebuilding rhythms disturbed by “Seasonus Interruptus”?
And biology is experiencing a potential season-disrupting trend. Tom Spears, reporting for the “Ottawa Citizen”, filed the story, “Taxing times for taxonomy”. On a more serious note, Mr. Spears chronicles the demise of field-based taxonomic study in biology in favor of lab work. Computers and DNA studies have relegated classification to the closet. While the work done in laboratories is vital to the field, biologist Ernest Small is quoted, “How can you do a study of forests without knowing the trees.” The reverse is also true; you can’t really study a tree without understanding its forest. Commenting on his lab-centric colleagues, Dr. Small laments “the lack of knowledge of what plants and animals make up our world”. Spears goes on to write, “We need to understand whole species, not just genes, if we are to solve the problems of agriculture, fisheries, insect pests, ecology and the spread of diseases.”
Taxonomies are central to understanding a field as a whole, whether biology or sports. And they are critical to understanding individual members of the taxonomy. The taxonomy of plants and animals tells you about the relationships and characteristics of its members. The child, or narrower node in a taxonomy, inherits the properties of the parent, or broader node. Reviewing an entire hierarchical branch and its relationship to other branches conveys the properties, similarities, and differences of branches. This knowledge can lead to sleuthing out meaningful trends and can help explain interactions or identify possible interactions. Taxonomy for organizing sports articles might reveal interesting groupings and trends. Are there more injuries being reported this season? Are there geographic trends or team trends? Applying graphical analysis visualizes the content in ways that enhance spotting of trends.
What these two diverse stories have in common besides the core theme of taxonomies is the need for hands-on, down-and-dirty fieldwork. Sports writers need to see players perform during preseason to really assess their fitness. Biologists can learn a great deal about a species by looking through a microscope, but they can’t really understand behavior that way. They need to observe firsthand. When the field data and lab data are gathered, placing it in context gives it meaning and leads to insights. Taxonomy is ideal for grouping data in meaningful ways, and that can lead to insights. It can also lead you back to the data at some future point. Taxonomy is often focused on organizing content for search to make search easy, fast, reliable, and replicable. Taxonomy also defines a domain or conceptual space. It can help you find your way and keep you from getting lost. Equally important, it provides meaning to the conceptual space you’re wandering through. Spring is almost here – enjoy the first Danaus plexippus of the season and know it has a genus and a tribe and a family and more. Play ball!
Jay Ven Eman, CEO
Access Innovations
SharePoint Gives Findability
January 25, 2012
Posted in Access Insights, News, search
Access to information is something we take for granted more and more every day. The Internet has changed our ability to find information and locate data, close or remote. Smart phones have taken that availability to unimagined realms. In the corporate world, there are still two groups – those who have findability of their data and those who do not.
We found this interesting viewpoint on KMWorld in their article, “SharePoint: Transforming the information have-nots into the haves.” Those who do have findability most likely employ some version of SharePoint. Access Innovations’ Data Harmony suite of content enrichment and thesaurus management tools can be fully integrated with Microsoft SharePoint 2010. Data Harmony fills semantic gaps in SharePoint to help users take full advantage of their metadata through auto-classification, enterprise taxonomy management, entity extraction, and search enhancements. The end result is information assets that are more searchable and more accessible.
Melody K. Smith
Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.


