Natural language processing (NLP) is the ability of a computer or machine to understand human language as it is written and its different characteristics. It is typically separated into four stages: understand, classify, retrieve and generate. The Times of India brought this interesting topic to our attention in their article, “Accuracy in AI is a function of availability of quality data … building NLP tools for low-resource Indian languages is hard.”
However, let’s face it. When we talk about NLP, we think about it understanding English. That is the ethnocentric part of our society and minds. So, does it work beyond English?
NLP tools for languages like English, French and German benefit from a lot of data in news articles, web pages, etc. However, there is a big challenge in creating language models for other languages from Asia and Africa because of the data. To learn by example, a model requires that you give it lots of sentences to understand. But in Indian languages, for example, the biggest data sets might be a few thousands.
Melody K. Smith
Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.