Speaking in Tongues — Teaching Local Languages to Machines


Report by DIAL: “…Machines learn to talk to people by digesting digital content in languages people speak through a technique called Natural Language Processing (NLP). As things stand, only about 85 of the world’s approximately 7500 languages are represented in the major NLPs — and just 7 languages, with English being the most advanced, comprise the majority of the world’s digital knowledge corpus. Fortunately, many initiatives are underway to fill this knowledge gap. My new mini-report with Digital Impact Alliance (DIAL) highlights a few of them from Serbia, India, Estonia, and Africa.

The examples in the report are just a subset of initiatives on the ground to make digital services accessible to people in their local languages. They are a cause for excitement and hope (tempered by realistic expectations). A few themes across the initiatives include –

  • Despite the excitement and enthusiasm, most of the programs above are still at a very nascent stage — many may fail, and others will require investment and time to succeed. While countries such as India have initiated formal national NLP programs (one that is too early to assess), others such as Serbia have so far taken a more ad hoc approach.
  • Smaller countries like Estonia recognize the need for state intervention as the local population isn’t large enough to attract private sector investment. Countries will need to balance their local, cultural, and political interests against commercial realities as languages become digital or are digitally excluded.
  • Community engagement is an important component of almost all initiatives. India has set up a formal crowdsourcing program; other programs in Africa are experimenting with elements of participatory design and crowd curation.
  • While critics have accused ChatGPT and others of paying contributors from the global south very poorly for their labeling and other content services; it appears that many initiatives in the south are beginning to dabble with payment models to incentivize crowdsourcing and sustain contributions from the ground.
  • The engagement of local populations can ensure that NLP models learn appropriate cultural nuances, and better embody local social and ethical norms…(More)”.