The Autonomy folks must be getting worried about the progress of taxonomy applications and the precision and recall that such systems provide. Autonomy and Google live on relevance rankings as the return to the user. Relevance to me is a confidence game. It is the best guess of the system as to whether the results returned will actually match the user’s request. If you have a big enough data set returned, certainly something in there will be useful. But the sheer amount of items the user has to review (or amount of noise they have to look at) is very annoying. So they rank the returns by relevance based on a number of statistical factors so the most likely items based on co-occurrence with terms matches and near matches will appear at the top of the list – that is, they will be relevance ranked.
The Boolean systems depend on actually giving the user exactly what they want (precision) and all of their items but not other records which do not match the topic asked for (recall). When a search system is combined with a taxonomy providing an outline of the contents of the file, the user will find what they want – exactly without the overburden of additional postings. The next time they interrogate the system they will get what they got before – reliably – and they will get the new stuff. The clusters will not be rearranged in a new order after each upload of additional data but will be consistent, persistent retrieval based on precision, recall and organized as expected. This means the researcher can return and update their work on regular basis.
Autonomy and other systems like Vivisimo are wonderful tools for discovery. Discovery in the research process takes up to 5% of the time in a project, where it is a basis for legal proceedings or a research project. The other 95% of the search time is based on reading, defining the terms of the project, reviewing and updating the known science with new events and findings.
The Autonomy site says “Keyword and Boolean searches return only those documents that contain the terms queried by the user. …” The posting indicates that “other major flaws prevail:
1. It ignores the context in which the keywords were found, and therefore cannot accurately gauge whether the keywords found in the file also represent the main concepts of the file. Weighting of keywords (e.g. found in title vs. buried in the middle of the file; frequency of keywords) only mitigate this issue and does not remove this critical defect.”
Depending on context as described here would infer the context in the full text of an article, report or other document. Certainly some search systems, Autonomy included, use weighting as a ranking algorithm. Frequency of keywords – also known as number of occurrences – is a basic premise of co-occurrence ranking systems and it does not necessarily provide accurate search results. To say it does not remove this critical defect highlights a problem with statistical based systems and co-occurrence search software like Autonomy itself. Indexing (the practice of applying controlled vocabulary terms to a document) to the most specific level is the best practice according to the National Information Standards Organization (NISO) standard Z39.14. When you are at that specific level you are able to determine the context immediately surrounding the term generation spot. So the indexing is precisely what is needed at that point in the document.
2. It cannot find files that are conceptually relevant to the queried terms but do not contain the keywords used in the query.
This would be true if there is no thesaurus or taxonomy involved in the implementation. But when term equivalency is implemented using non-preferred terms or an alias or synonymy or synonym ring, this distinction disappears, and the precision and recall can provide a time saving and exact answer for the user, no matter what they call the concept. If you call it a “couch” and I call it a “sofa” and someone else calls it a “davenport” or “settee”, we will all get exactly the same search results using a good dictionary approach in the search. That dictionary implemented as a thesaurus will also provide the hierarchical and related terms to further enrich the search experience. The taxonomy can also provide the basis of a navigational structure, such as with MediaSleuth.
3. In order to maximize accuracy, it requires human intervention to manage and update keyword associations or categories.
This is partially true. Much of this work has been done and is available for sharing or purchase. The WordNet system from Princeton University, a number of excellent dictionaries such as American Heritage, knowledge domain thesauri from Access Innovations, and taxonomies from Cengage and from WAND.
4. It is unable to learn and adapt through use; it cannot retrieve files by being shown an example.
It is true that training sets and retraining are not needed in taxonomy or keyword based systems it is also true that these processes are labor intensive. Training and retraining of a Bayesian system requires a two-step process.
1. You have to find enough records where the term is used correctly so the training can take place. This is an editorial task.
2. Once the records are collected, a systems programmer needs to run the entire collection and reset the vectors or pointers so that retrieval can take place using the new concepts. Yes, you can train by example, but it is an expensive and time-consuming process. While it is being done, the system cannot use those new concept terms. In a keyword system, they are simply added with their synonyms so retrieval on the older material and the new can take place.
Parsing and Natural Language Analysis
“Due to the inherent complexity of language, it is unable to handle ambiguity (e.g. “The dog came into the room; it was furry.” What is the “it” it refers to?). Improving the algorithm requires the construction of a set of rules that are cumbersome and difficult to maintain.” If you were using a keyword system you would already know it was a “dog”. The “it” would not need to be redefined.
Manual Tagging, Weighting and XML
“Manual tagging schemes are becoming an increasingly popular method of labeling and categorizing digital material. However, it suffers from the following flaws: It is descriptively inconsistent due to its reliance on human contribution. Each person may tag a given document differently (especially when the content deals with multiple themes), and/or people can get lax in their tagging and categorize most content under “general.”
The manual intensive systems are indeed not reliable. Editorial drift is a well known problem. The use of an automatic or assisted indexing system to suggest terms to the indexer or apply them automatically IS consistent. It will suggest the same keyword terms under the same conditions every time. Best practice indicates that the term “general” should never be used. It is lazy indexing and poor controlled vocabulary creation. “Not elsewhere classified” is a dumping ground and creates unfindable items. It should never be used in indexing of classification of information.
“XML is not a set of standard tag definitions; it is a set of definitions that allow for tags to be defined. This poses difficulty when organizations or departments with different practices interoperate.”
True, XML is a great format for transporting data but it is only that. To share the information, one must also have the DTD or schema that defines what the tags are for. That is why RDF wrappers for XML files are so wonderful. They wrap the XML in its own definitions. It is like the HTML header you can see when you view source on a web page. It tells the browser which DTD is in use, and if you don’t happen to have it stored in your browser, it tells you where you can go to get it to import and activate.
“Taxonomy creation and tagging involve costly manual labor, requiring input from librarians, users and IT staff.”
Taxonomy creation is a largely automatic process. Either use and augment an existing thesaurus or create one from your data. Keep it up to date using the search logs and folksonomy implementations of various systems to capture the users’ actual questions and additions as they think of the content. Taxonomy creation should be done by a combination of taxonomists (wordsmiths from all walks of life make good taxonomists) and subject matter experts (SMEs). This brings together two kinds of experts – those who know how to put together a good taxonomy and those who know the subject content and how it fits together. The best combination is to use each where they are strongest. Use the taxonomists to create the term relationships, the hierarchical, equivalence, associative and notes about the term. Then use the subject matter experts to assure that the organization is as the people in the specialty field will think of the data. Building a taxonomy is less expensive than creating training sets for automated systems. They are easier to maintain and they provide higher precision and higher recall. True relevance measured by what the requester has to look at and how closely it matches what they were looking for is much higher with taxonomic based systems.
“Tags fail to highlight the relationships between subjects because they lack a conceptual understanding to form correlations. There are often vital relationships between seemingly separately tagged subjects such as wing design/low drag and aerofoil/efficiency, but this concept of “idea distancing” is not leveraged.”
This assumes a taxonomy as solely a hierarchical presentation. Although it is possible that a taxonomic navigational structure will be implemented, a pruely heriarchical taxonomy lacks the richness available from a well constructed full thesaurus built according to the ISO FDIS 25964 or NISO Z39.19 vocabulary standards. The “related terms” option for associative terms provides the same effect as the “idea distancing” while ensuring consistent reliable retrieval and replicable results.
“As the number of tags increases, so too does the likelihood of misclassification and the effort to maintain consistency. This approach is not scalable.”
If one were using a folksonomy and limited the number of postings, this might be true. However, using a taxonomy which can grow with the content is in fact a much more scalable solution. The consistency in a rules based application of terms is 100%. The terms are always applied the same way. There is no editorial drift. In fact, the Autonomy classification system using vectors is not truly scalable. The more data, the more likely they system is to get confused on the pointers, and the more likely that the system will provide unreliable, inconsistent, unreplicable result sets to the users. The massive amount of computational time needed to deliver these results is logarithmic in delivery and slows over time.
Marjorie M.K. Hlava
President, Access Innovations