The Apache Software Foundation has released an embeddable toolkit for content detection and analysis called Tika v.1. This toolkit has been five years in the making.
CMS Wire brought this news to our attention in their article, “Apache Announces Toolkit for Content Detection and Analysis.” Tika is described as a one-stop shop for identifying, retrieving, and parsing text and metadata. This sounds comprehensive, considering the amount of data being produced every minute.
Tika has been tested in repositories with more than 500 million documents. Even NASA leverages Tika on several Earth science data system projects to help process hundreds of terabytes of scientific data in a variety of formats.
Melody K. Smith
Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.