March 11, 2011 – A new software system is addressing the lack of language process systems for non-English electronic resources. Data mining with more accurate translation is helping with computational processing of documents.

We found this interesting piece of information on the University of Buffalo’s website in their post, “Digitizing Urdu: Software Will Improve Analysis of Documents, Social Networks in Pakistan’s National Language.” Computer scientists at the University at Buffalo and at Janya Inc. have developed the first software system that will allow for computational processing of documents in Urdu, Pakistan’s national language and one of the world’s five most-spoken languages. It will also help develop sophisticated ways to do sentiment analysis of social media content.

Other languages don’t have the established electronic infrastructures that are taken for granted in English and the European languages. These infrastructures include lexicons, annotated electronic dictionaries and well-developed ontologies that describe relationships among words and entities in documents. With the new software, the data will be recognized in an unprocessed state and mined for processing.

Interesting work, and we hope it will be used for other languages as it evolves to provide even more inclusion.

Melody K. Smith

Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.