The benefits of Natural Language Processing (NLP) tools are seen everywhere in day-to-day life. They are responsible for autocorrect tools, virtual assistants like Siri, and are the backbone of every search engine. However, not all languages are created equal when it comes to compatibility with common NLP techniques, limiting many people’s access to strong NLP tools for the language they speak.
When it comes to why NLP is harder to apply to some languages than others, it mainly boils down to available language resources and the language’s complexity.
Lack of available language resources is the first barrier for languages when applying NLP. Modern language models rely upon huge amounts of language data and without this abundance of data available, it’s impossible to apply NLP. This is a problem for languages that don’t have many speakers as well as languages of countries that don’t have the resources to collect and process language data regardless of the number of speakers. Unfortunately, this creates an issue of data scarcity as well as the underrepresentation of many indigenous languages and languages of developing countries in South America, Africa, and Asia in NLP.
Even if there is a large number of available language resources for a certain language, the complexity of the language can be problematic as well. More specifically, the potentially intricate syntax of a language can prove to be one source of hardship. The best example of this is Arabic. Arabic is one of the most widely spoken languages in the world, yet one of the most difficult languages to learn as a second language. Even humans struggle to pick up on Arabic’s elaborate grammatical structure, so it’s no surprise that computers also struggle. Arabic’s flexible word order creates obstacles for computers, which work better with strict rules and few exceptions. An added layer of complexity comes with Arabic’s number of basic word roots, which is about ten times larger than English’s. Developers have been able to curate specific methods to create NLP tools for Arabic, but applications like machine translation are still impacted and tend to be less accurate for Arabic than for English.
Large language models (LLMs), machine learning models built to perform NLP tasks, are typically developed for English since its syntax is relatively simple and there’s a lot of English data to work with. These models are then adapted to languages like French and Spanish so people who speak those languages have access to them and the NLP tools it provides. However, there are thousands of languages that are overlooked in NLP research due to either a low amount of available language resources or the complexity of the language, creating large unavailability of NLP tools for people who speak those languages. A potential lack of accessible NLP tools makes it more difficult for people to search for information on the internet, which can hinder research efforts and disproportionately affects people in developing countries.
It is imperative that NLP researchers look past English when it comes to developing applications.