Generative AI Is Not a Death Sentence for Endangered Languages

According to UNESCO, up to half of languages could be extinct by 2100. Many people say generative AI is contributing to this process.

The decline in language diversity didn’t start with AI—or the Internet. But AI is in a position to accelerate the demise of indigenous and low-resource languages.

Most of the world’s 7,000+ languages don’t have sufficient resources to train AI models—and many lack a written form. This means that a few major languages dominate humanity’s stock of potential AI training data, while most stand to be left behind in the AI revolution—and could disappear entirely.

The simple reason is that most available AI training data is in English. English is the main driver of large language models (LLMs), and people who speak less-common languages are finding themselves underrepresented in AI technology.

Consider these statistics from the World Economic Forum:

Two-thirds of all websites are in English.
Much of the data that GenAI learns from is scraped from the web.
Fewer than 20% of the world’s population speaks English.

As AI becomes more embedded in our daily lives, we should all be thinking about language equity. AI has unprecedented potential to problem-solve at scale, and its promise should not be limited to the English-speaking world. AI is creating conveniences and tools that enhance people’s personal and professional lives for people in wealthy, developed nations.

Speakers of low-resource languages are accustomed to finding a shortage of representation in technology—from not finding websites in their language to not having their dialect recognized by Siri. A lot of the text that is available to train AI in lower-resourced languages is poor quality (itself translated with questionable accuracy) and narrow in scope.

How can society ensure that lower-resourced languages don’t get left out of the AI equation? How can we ensure that language isn’t a barrier to the promise of AI?

In an effort toward language inclusivity, some major tech players have initiatives to train huge multilingual language models (MLMs). Microsoft Translate, for example, has pledged to support “every language, everywhere.” And Meta has a “No Language Left Behind” promise. These are laudable, but are they realistic?

Aspiring toward one model that handles every language in the world favors the privileged because there are far greater volumes of data from the world’s major languages. When we start dealing with lower-resource languages and languages with non-Latin scripts, training AI models becomes more arduous, time-consuming—and more expensive. Think of it as an unintentional tax on underrepresented languages.

Advances in Speech Technology

AI models are largely trained on text, which naturally favors languages with deeper stores of text content. Language diversity would be better supported with systems that don’t depend on text. Human interaction at one time was all speech-based, and many cultures retain that oral focus. To better cater to a global audience, the AI industry must progress from text data to speech data.

Research is making huge strides in speech technology, but it still lags behind text-based technologies. Research in speech processing is progressing, but direct speech-to-speech technology is far from mature. The reality is that the industry tends to move cautiously, and only once a technology advances to a certain level.

TransPerfect’s newly released GlobalLink Live interpretation platform uses the more mature forms of speech technology—automatic speech recognition (ASR) and text-to-speech (TTS)—again, because the direct speech-to-speech systems are not mature enough at this point. That being said, our research teams are preparing for the day when fully speech-to-speech pipelines are ready for prime time.

Speech-to-speech translation models offer huge promise in the preservation of oral languages. In 2022, Meta announced the first AI-powered speech-to-speech translation system for Hokkien, a primarily oral language spoken by about 46 million people in the Chinese diaspora. It’s part of Meta’s Universal Speech Translator project, which is developing new AI models that it hopes will enable real-time speech-to-speech translation across many languages. Meta opted to open-source its Hokkien translation models, evaluation datasets, and research papers so that others can reproduce and build on its work.

Learning with Less

The fact that we as a global community lack resources around certain languages is not a death sentence for those languages. This is where multi-language models do have an advantage, in that the languages learn from each other. All languages follow patterns. Because of knowledge transfer between languages, the need for training data is lessened.

Suppose you have a model that’s learning 90 languages and you want to add Inuit (a group of indigenous North American languages). Because of knowledge transfer, you will need less Inuit data. We are finding ways to learn with less. The amount of data needed to fine-tune engines is lower.

I’m hopeful about a future with more inclusive AI. I don’t believe we are doomed to see hordes of languages disappear—nor do I think AI will remain the domain of the English-speaking world. Already, we are seeing more awareness around the issue of language equity. From more diverse data collection to building more language-specific models, we are making headway.

Consider Fon, a language spoken by about 4 million people in Benin and neighboring African countries. Not too long ago, a popular AI model described Fon as a fictional language. A computer scientist named Bonaventure Dosseau, whose mother speaks Fon, was used to this type of exclusion. Dosseau, who speaks French, grew up with no translation program to help him communicate with his mother. Today, he can communicate with his mother thanks to a Fon-French translator that he painstakingly built. Today, there is also a fledgling Fon Wikipedia.

In an effort to use technology to preserve languages, Turkish artist Refik Anadol has kicked off the creation of an open-source AI tool for Indigenous people. At the World Economic Summit, he asked: “How on Earth can we create an AI that doesn’t know the whole of humanity?”

We can’t, and we won’t.