New Book Addresses Ontologies

March 29, 2012  
Posted in Folksonomy, News, ontology, reference, Taxonomy

Woodhead Publishing Ltd’s new book “Library Classification Trends in the 21st Century” traces the development in and around library classification as reported in literature published in the first decade of the 21st century.

Business Wire brought this news to our attention in their article, “Research and Markets: Library Classification Trends in the 21st Century.” Their review of literature published on various aspects of library classification includes modern applications of classification such as Internet resource discovery, automatic book classification, text categorization, modern manifestations of classification (such as taxonomies, folksonomies and ontologies), and interoperable systems enabling crosswalk.

It is an interesting read and will be useful to classification researchers, LIS faculties, and postgraduate students in library and information science.

Melody K. Smith

Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.

Flickr as a Folksonomy

May 17, 2011  
Posted in Folksonomy, News, Taxonomy

Remember when you took film to be developed and then shared your packets of memories with friends? It seems so nostalgic, but this routine was occurring only a few years ago. Then came the age of digital photos which led to online photo sites, of which Flickr was born.

The Telegraph brought this topic to our attention in their article, “Flickr: the world’s photo album.” No one could predict that 35 million people would willingly choose to share their photos online with strangers. Even more, who could have guessed that people would join groups identified by their photo subjects, i.e. cats, dogs, cancer patients, etc?

Thomas Vander Wal calls this anomaly a “folksonomy”, an informal, collaborative taxonomy. Much like Wikipedia, it is a relatively small, highly committed bunch of users who busy themselves with the sifting and sorting out of Flickr’s content. Their diligence was recognized by the US Library of Congress, who added unmarked photographs from its archive to Flickr in the hope that the worker-ants might identify them. Other institutions quickly followed suit.

Melody K. Smith

Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.

Not True!

The Autonomy folks must be getting worried about the progress of taxonomy applications and the precision and recall that such systems provide. Autonomy and Google live on relevance rankings as the return to the user. Relevance to me is a confidence game. It is the best guess of the system as to whether the results returned will actually match the user’s request. If you have a big enough data set returned, certainly something in there will be useful. But the sheer amount of items the user has to review (or amount of noise they have to look at) is very annoying. So they rank the returns by relevance based on a number of statistical factors so the most likely items based on co-occurrence with terms matches and near matches will appear at the top of the list – that is, they will be relevance ranked. 

The Boolean systems depend on actually giving the user exactly what they want (precision) and all of their items but not other records which do not match the topic  asked for (recall). When a search system is combined with a taxonomy providing an outline of the contents of the file, the user will find what they want – exactly without the overburden of additional postings. The next time they interrogate the system they will get what they got before – reliably - and they will get the new stuff. The clusters will not be rearranged in a new order after each upload of additional data but will be consistent, persistent retrieval based on precision, recall and organized as expected. This means the researcher can return and update their work on regular basis. 

Autonomy and other systems like Vivisimo are wonderful tools for discovery. Discovery in the research process takes up to 5% of the time in a project, where it is a basis for legal proceedings or a research project. The other 95% of the search time is based on reading, defining the terms of the project, reviewing and updating the known science with new events and findings.  

The Autonomy site says “Keyword and Boolean searches return only those documents that contain the terms queried by the user. …” The posting indicates that “other major flaws prevail:

1. It ignores the context in which the keywords were found, and therefore cannot accurately gauge whether the keywords found in the file also represent the main concepts of the file. Weighting of keywords (e.g. found in title vs. buried in the middle of the file; frequency of keywords) only mitigate this issue and does not remove this critical defect.”

MH Reply:

Depending on context as described here would infer the context in the full text of an article, report or other document. Certainly some search systems, Autonomy included, use weighting as a ranking algorithm. Frequency of keywords – also known as number of occurrences – is a basic premise of co-occurrence ranking systems and it does not necessarily provide accurate search results. To say it does not remove this critical defect highlights a problem with statistical based systems and co-occurrence search software like Autonomy itself. Indexing (the practice of applying controlled vocabulary terms to a document) to the most specific level is the best practice according to the National Information Standards Organization (NISO) standard Z39.14. When you are at that specific level you are able to determine the context immediately surrounding the term generation spot. So the indexing is precisely what is needed at that point in the document. 

2. It cannot find files that are conceptually relevant to the queried terms but do not contain the keywords used in the query.

MH Reply:

This would be true if there is no thesaurus or taxonomy involved in the implementation. But when term equivalency is implemented using non-preferred terms or an alias or synonymy or synonym ring, this distinction disappears, and the precision and recall can provide a time saving and exact answer for the user, no matter what they call the concept. If you call it a “couch” and I call it a “sofa”  and someone else calls it a “davenport” or “settee”, we will all get exactly the same search results using a good dictionary approach in the search. That dictionary implemented as a thesaurus will also provide the hierarchical and related terms to further enrich the search experience. The taxonomy can also provide the basis of a navigational structure, such as with MediaSleuth.

3. In order to maximize accuracy, it requires human intervention to manage and update keyword associations or categories.

MH Reply:

This is partially true. Much of this work has been done and is available for sharing or purchase.  The WordNet system from Princeton University, a number of excellent dictionaries such as American Heritage, knowledge domain thesauri from Access Innovations, and taxonomies from Cengage and from WAND

4. It is unable to learn and adapt through use; it cannot retrieve files by being shown an example.

MH Reply:

It is true that training sets and retraining are not needed in taxonomy or keyword based systems it is also true that these processes are labor intensive. Training and retraining of a Bayesian system requires a two-step process.

1.     You have to find enough records where the term is used correctly so the training can take place. This is an editorial task.

2.     Once the records are collected, a systems programmer needs to run the entire collection and reset the vectors or pointers so that retrieval can take place using the new concepts. Yes, you can train by example, but it is an expensive and time-consuming process. While it is being done, the system cannot use those new concept terms. In a keyword system, they are simply added with their synonyms so retrieval on the older material and the new can take place.

Parsing and Natural Language Analysis

“Due to the inherent complexity of language, it is unable to handle ambiguity (e.g. “The dog came into the room; it was furry.” What is the “it” it refers to?). Improving the algorithm requires the construction of a set of rules that are cumbersome and difficult to maintain.” If you were using a keyword system you would already know it was a “dog”. The “it” would not need to be redefined.

Manual Tagging, Weighting and XML

“Manual tagging schemes are becoming an increasingly popular method of labeling and categorizing digital material. However, it suffers from the following flaws:  It is descriptively inconsistent due to its reliance on human contribution. Each person may tag a given document differently (especially when the content deals with multiple themes), and/or people can get lax in their tagging and categorize most content under “general.”

MH Reply:

The manual intensive systems are indeed not reliable. Editorial drift is a well known problem. The use of an automatic or assisted indexing system to suggest terms to the indexer or apply them automatically IS consistent. It will suggest the same keyword terms under the same conditions every time. Best practice indicates that the term “general” should never be used. It is lazy indexing and poor controlled vocabulary creation. “Not elsewhere classified” is a dumping ground and creates unfindable items. It should never be used in indexing of classification of information. 

“XML is not a set of standard tag definitions; it is a set of definitions that allow for tags to be defined. This poses difficulty when organizations or departments with different practices interoperate.”

MH Reply:

True, XML is a great format for transporting data but it is only that. To share the information, one must also have the DTD or schema that defines what the tags are for. That is why RDF wrappers for XML files are so wonderful. They wrap the XML in its own definitions. It is like the HTML header you can see when you view source on a web page. It tells the browser which DTD is in use, and if you don’t happen to have it stored in your browser, it tells you where you can go to get it to import and activate.

“Taxonomy creation and tagging involve costly manual labor, requiring input from librarians, users and IT staff.”

MH Reply:

Taxonomy creation is a largely automatic process. Either use and augment an existing thesaurus or create one from your data. Keep it up to date using the search logs and folksonomy implementations of various systems to capture the users’ actual questions and additions as they think of the content. Taxonomy creation should be done by a combination of taxonomists (wordsmiths from all walks of life make good taxonomists) and subject matter experts (SMEs). This brings together two kinds of experts – those who know how to put together a good taxonomy and those who know the subject content and how it fits together. The best combination is to use each where they are strongest. Use the taxonomists to create the term relationships, the hierarchical, equivalence, associative and notes about the term. Then use the subject matter experts to assure that the organization is as the people in the specialty field will think of the data. Building a taxonomy is less expensive than creating training sets for automated systems. They are easier to maintain and they provide higher precision and higher recall. True relevance measured by what the requester has to look at and how closely it matches what they were looking for is much higher with taxonomic based systems.

“Tags fail to highlight the relationships between subjects because they lack a conceptual understanding to form correlations. There are often vital relationships between seemingly separately tagged subjects such as wing design/low drag and aerofoil/efficiency, but this concept of “idea distancing” is not leveraged.”

MH Reply:

This assumes a taxonomy as solely a hierarchical presentation. Although it is possible that a taxonomic navigational structure will be implemented, a pruely heriarchical taxonomy lacks the richness available from a well constructed full thesaurus built according to the ISO FDIS 25964 or NISO Z39.19 vocabulary standards. The “related terms” option for associative terms provides the same effect as the “idea distancing” while ensuring consistent reliable retrieval and replicable results.

“As the number of tags increases, so too does the likelihood of misclassification and the effort to maintain consistency. This approach is not scalable.”

MH Reply:

If one were using a folksonomy and limited the number of postings, this might be true. However, using a taxonomy which can grow with the content is in fact a much more scalable solution. The consistency in a rules based application of terms is 100%. The terms are always applied the same way. There is no editorial drift. In fact, the Autonomy classification system using vectors is not truly scalable. The more data, the more likely they system is to get confused on the pointers, and the more likely that the system will provide unreliable, inconsistent, unreplicable result sets to the users. The massive amount of computational time needed to deliver these results is logarithmic in delivery and slows over time. 

Marjorie M.K. Hlava
President, Access Innovations

Metadata: Use It or Lose It

March 9, 2011  
Posted in Folksonomy, metadata, News

March 9, 2011 – Metadata seems to be the word of the year. Though it certainly is a key component of any document management system, there seems to be ambiguity around the appropriate use of metadata.

CMSWire brought this news to our attention in their article, “Document Collaboration and Metadata: An Unholy Alliance.” Document management systems covers a wide birth of features, including search, indexing, and metatagging. Metadata allows us to store more information about individual documents to help us understand them better, and ultimately search for them with assured findability.

Like most things though, it works only if used. Many users take a stab at it with a faint heart and leave many fields blank that could have been doors for future access. Many users create their own folksonomy using familiar terms, but again if it isn’t used consistently and with some sense of standards, it will fail in the end.

Machine assisted indexing against a solid custom taxonomy is the key to any document management system’s success.

Access Innovations is one of a very small number of companies that can help clients generate ANSI/ISO/W3C-compliant taxonomies to manage your content.

Melody K. Smith

Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.

Contributor Version 4.2 Launched

July 14, 2010  
Posted in Folksonomy, News

July 14, 2010 – Colligo Networks Inc. has launched of Version 4.2 of its Contributor for SharePoint product line with compatibility with Microsoft Outlook/Office 2010.

Enterprise Systems shared this news in their article, “Colligo Networks Launches Version 4.2 of Contributor for SharePoint Product Line”. New features of Version 4.2 also includes support for keyword metadata fields, in-place records management, and synchronization performance optimization. All features were designed to improve overall performance. In the end, good data management means good business.

Melody K. Smith

Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.

Heather Hedden, Taxonomist – Profiled

July 9, 2010  
Posted in Folksonomy, News, Taxonomy

July 9, 2010 – Heather Hedden, author of The Accidental Taxonomist, recently responded to questions profiling her career in taxonomy.

Part of their regular features, Infonista.com sat down with Heather Hedden and talked about her career as a respected taxonomist. This was shared in their post “Career Profile: Heather Hedden, Taxonomist”.  Describing  her fourteen-year career in the field of taxonomies, controlled vocabularies, and thesauri, Hedden appreciated that the best part of her chosen career is that taxonomists do not have to specialize in one subject area. They are fortunate to work with all kinds of subject areas, learning about different things in different industries and with different kinds of professionals.

Hedden’s book, The Accidental Taxonomist, was recently reviewed by Margie Hlava on this site in the feature, “The Accidental Taxonomist by Heather Hedden Review”.

Melody K. Smith

Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.

Government 2.0 Gives a Laugh

June 17, 2010  
Posted in Autoindexing, Folksonomy, News

June 17, 2010 – A cartoon caption contest provides opportunity for fun and creativity on an otherwise not always so humorous topic.

Federal Computer Weekly and GovLoop, a social network for the federal employees, joined together in this challenge for their readers to create a caption with a government 2.0 theme for cartoonist, John Klossner’s drawing. The challenge was made in May in their post “How is gov 2.0 like a cowboy in an office?”  where you can also see the cartoon.

The results of the contest were recently released in their post “What a laugh! The results of the FCW cartoon caption contest.”  The winning entry was “I’m stalled at 1.0.”

The results show that there can be humor in what some would deem a dry subject.

Melody K. Smith

Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.

Folksonomy Persists in Folktales

June 1, 2010  
Posted in Folksonomy, News

June 1, 2010 – An expert at Knowledge Architecture Professional Services delivered a talk at a recent conference. The title was “Folksonomy Folktales.” In my opinion, the most interesting idea in the talk was the hybrid approach to taxonomy and folksonomy. The idea is that making use of both formal and informal methods yields good results. The challenge, quite rightly, is for an organization to create a framework within which folksonomies can contribute. The balance would be “Simple Taxonomies + Facets + Clusters/ folksonomies + Auto-Categorization.” We agree and Access Innovations has the expertise and the tools for this type of project. Learn more at our Web site.

Margie Hlava