Article by Brooke Tanner and Cameron F. Kerry: “Indigenous languages play a critical role in preserving cultural identity and transmitting unique worldviews, traditions, and knowledge, but at least 40% of the world’s 6,700 languages are currently endangered. The United Nations declared 2022-2032 as the International Decade of Indigenous Languages to draw attention to this threat, in hopes of supporting the revitalization of these languages and preservation of access to linguistic resources.
Building on the advantages of SLMs, several initiatives have successfully adapted these models specifically for Indigenous languages. Such Indigenous language models (ILMs) represent a subset of SLMs that are designed, trained, and fine-tuned with input from the communities they serve.
Case studies and applications
- Meta released No Language Left Behind (NLLB-200), a 54 billion–parameter open-source machine translation model that supports 200 languages as part of Meta’s universal speech translator project. The model includes support for languages with limited translation resources. While the model’s breadth of languages included is novel, NLLB-200 can struggle to capture the intricacies of local context for low-resource languages and often relies on machine-translated sentence pairs across the internet due to the scarcity of digitized monolingual data.
- Lelapa AI’s InkubaLM-0.4B is an SLM with applications for low-resource African languages. Trained on 1.9 billion tokens across languages including isiZulu, Yoruba, Swahili, and isiXhosa, InkubaLM-0.4B (with 400 million parameters) builds on Meta’s LLaMA 2 architecture, providing a smaller model than the original LLaMA 2 pretrained model with 7 billion parameters.
- IBM Research Brazil and the University of São Paulo have collaborated on projects aimed at preserving Brazilian Indigenous languages such as Guarani Mbya and Nheengatu. These initiatives emphasize co-creation with Indigenous communities and address concerns about cultural exposure and language ownership. Initial efforts included electronic dictionaries, word prediction, and basic translation tools. Notably, when a prototype writing assistant for Guarani Mbya raised concerns about exposing their language and culture online, project leaders paused further development pending community consensus.
- Researchers have fine-tuned pre-trained models for Nheengatu using linguistic educational sources and translations of the Bible, with plans to incorporate community-guided spellcheck tools. Since the translations relying on data from the Bible, primarily translated by colonial priests, often sounded archaic and could reflect cultural abuse and violence, they were classified as potentially “toxic” data that would not be used in any deployed system without explicit Indigenous community agreement…(More)”.