UCT Researchers Create AI Model for 11 South African Languages

A New Frontier in AI for South African Languages

A team at the University of Cape Town (UCT) has made a significant breakthrough by developing a new artificial intelligence (AI) language model trained specifically on South Africa’s 11 official written languages. This innovation aims to bridge a critical gap that has left millions underserved by mainstream AI tools, which often lack support for these languages.

The research, set to be presented at the Language Resources and Evaluation Conference (LREC) in Mallorca, Spain, introduces two key contributions: MzansiText, a curated multilingual dataset covering all 11 official languages, and MzansiLM, a language model trained from scratch using this dataset. The project was led by Anri Lombard and Dr Jan Buys from UCT's Department of Computer Science, along with Dr Francois Meyer and a broader team of collaborators.

Addressing the Data Gap

As AI language tools become increasingly integrated into daily life, the disparity in support for different languages becomes more apparent. For speakers of most South African languages, the experience is quite different. Ask a popular AI assistant a question in isiNdebele or Sepedi, and the response is likely to be poor, inconsistent, or simply incorrect. The researchers attribute this issue to data scarcity.

"In language modelling, languages are considered low resource, primarily because there are much fewer and smaller textual datasets available in these languages for training language models," said Dr Buys, a senior lecturer in the Department of Computer Science. "Our dataset, MzansiText, is still small compared to data available for high-resource languages such as English and major European and Asian languages, but larger than previous datasets for South African languages."

MzansiLM is believed to be the first publicly available decoder-only language model to explicitly target all 11 languages. While nine of South Africa’s 11 official written languages fall into the low-resource category, others like isiNdebele and Sepedi have been largely overlooked. The model represents a step forward in addressing this imbalance.

From Master’s Research to a Baseline for the Field

For Lombard, a master's student in computer science, the project began with a recurring question in his research. He focused on how different language-model architectures perform for low-resource languages, an area that remains relatively underexplored.

"One thing that stood out to me is that publicly available models tended to cover only a subset of the South African languages we care about. MzansiLM was meant to provide a small decoder-only baseline that future work can compare against and build on," he explained.

The model itself, with 125 million parameters, is modest by today’s standards. However, the team’s tests showed it performing competitively on specific tasks, even outperforming much larger open-source models on benchmarks in several South African languages. On isiXhosa text generation, for instance, it produced results that competed with encoder-decoder models more than 10 times its size.

Not a Chatbot, But a Foundation

It is important to clarify what MzansiLM is and what it is not. Unlike tools such as ChatGPT or Claude, it is not designed for open-ended conversation. Instead, it serves as a base model—foundation that developers and researchers can adapt for specific purposes through a process known as fine-tuning.

"In practice, that means developers could build tools for specific use cases; for example, summarising information or annotating raw data, in South African languages," said Meyer. "Adapting MzansiLM for a limited use case might be more effective and affordable than relying on proprietary large language models, if you want users to be able to interact with a system in their home language."

The more immediate benefits for everyday users will come from future, larger versions of the model and from systems built on top of this foundation. However, the research also sheds light on a broader question: Why do even powerful commercial AI systems still struggle with languages other than English?

The Role of Open Research

The team emphasizes that MzansiLM is a step, not a destination. Closing the gap between South African languages and the capabilities now available in English will require sustained, collective effort.

"A lot of the progress we were able to make depends on earlier open research from the African Natural Language Processing research community, so continuing that openness is essential," said Lombard. "We still need better and broader data sources, stronger benchmarks, and the kind of shared datasets, models, code, and results that make it possible for others to reproduce and extend the work."

Meyer echoed this sentiment, highlighting the importance of the research community working openly. "The research community plays an important role here by working openly, sharing datasets, models, and findings so others can build on them. That kind of openness is often what leads to progress, especially compared to proprietary systems where much of the data and methodology isn't accessible."

Public Availability and Future Prospects

The UCT team has made both MzansiText and MzansiLM publicly available. The paper, "MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages," is available on arXiv. This initiative marks a crucial milestone in the development of AI tools tailored for South African languages, paving the way for future advancements and broader accessibility.