Article by Sam Peters: “How do you teach somebody to read a language if there’s nothing for them to read? This is the problem facing developers across the African continent who are trying to train AI to understand and respond to prompts in local languages.
To train a language model, you need data. For a language like English, the easily accessible articles, books and manuals on the internet give developers a ready supply. But for most of Africa’s languages — of which there are estimated to be between 1,500 and 3,000 — there are few written resources available. Vukosi Marivate, a professor of computer science at the University of Pretoria, in South Africa, uses the number of available Wikipedia articles to illustrate the amount of available data. For English, there are over 7 million articles. Tigrinya, spoken by around 9 million people in Ethiopia and Eritrea, has 335. For Akan, the most widely spoken native language in Ghana, there are none.
Of those thousands of languages, only 42 are currently supported on a language model. Of Africa’s 23 scripts and alphabets, only three — Latin, Arabic and Ge’Ez (used in the Horn of Africa) — are available. This underdevelopment “comes from a financial standpoint,” says Chinasa T. Okolo, the founder of Technēculturǎ, a research institute working to advance global equity in AI. “Even though there are more Swahili speakers than Finnish speakers, Finland is a better market for companies like Apple and Google.”
If more language models are not developed, the impact across the continent could be dire, Okolo warns. “We’re going to continue to see people locked out of opportunity,” she told CNN. As the continent looks to develop its own AI infrastructure and capabilities, those who do not speak one of these 42 languages risk being left behind…(More)”