Google Research and African Partners Launch Open Speech Dataset to Advance AI for African Languages

2026-02-14 Feb by Funds Beeline editor

In a major step toward making artificial intelligence (AI) more inclusive across the continent, Google Research Africa and a consortium of African research institutions have unveiled WAXAL, a large-scale open speech dataset designed to expand AI’s ability to understand and generate African languages. The announcement, made on 2 February 2026, aims to narrow a longstanding technological gap that has left most African languages underrepresented in voice-enabled tools and digital services.

The WAXAL dataset, named after the Wolof word for “speak”, provides foundational speech data for 21 Sub-Saharan African languages, opening up new opportunities for developers, researchers and entrepreneurs to build technology that truly speaks to African contexts. These languages include Hausa, Yoruba, Igbo, Swahili, Luganda, Acholi, Kikuyu and many others spoken by an estimated more than 100 million people across the region.

Most AI voice technologies, such as speech recognition and text-to-speech tools, have historically focused on widely spoken global languages, leaving the majority of Africa’s more than 2 000 indigenous languages without sufficient high-quality training data. This has limited the relevance and accessibility of modern AI services, including educational platforms, healthcare apps and digital public services, for communities whose first language is not English or another global lingua franca.

The WAXAL initiative was developed over three years with funding and technical support from Google, in collaboration with key African partners. These include Makerere University (Uganda), the University of Ghana and Digital Umuganda (Rwanda), with additional contributions from organisations such as the African Institute for Mathematical Sciences (AIMS).

Unlike many global data projects, the WAXAL effort was locally led, with African institutions responsible for data collection, community engagement and stewardship of the dataset, and ensuring they retain ownership of the data they gathered. This approach is intended to shift AI development away from extractive models toward more equitable, community-driven innovation.

The open-access dataset comprises approximately 1 250 hours of transcribed natural speech and more than 20 hours of studio-quality recordings, designed to support both automatic speech recognition (ASR) and text-to-speech (TTS) systems. Over 11 000 hours of speech data from nearly two million individual recordings have been included, representing a broad range of dialects, accents and expressive patterns.

Many of the speech samples were collected by asking participants to describe images or everyday scenes in their native languages, capturing natural patterns of use that are often absent in scripted datasets. This helps ensure that AI models trained on WAXAL can more accurately reflect real-world speech rather than artificial or overly uniform samples.

Aisha Walcott-Bryant, Head of Google Research Africa, described WAXAL as a “critical foundation” for empowering Africans to build technologies in their own languages and on their own terms. She emphasised that the dataset’s release under an open licence, available on platforms such as Hugging Face, will enable wide access for students, startups, developers and researchers working on voice and language technologies across sectors.

“For AI to have a real impact in Africa, it must speak our languages and understand our contexts,” said Joyce Nakatumba-Nabende, Senior Lecturer at Makerere University, highlighting how WAXAL has already supported local research capacity and inspired language technology projects in Uganda and beyond.

The project also has the potential to transform digital experiences in education, agriculture, healthcare and public services by enabling voice-enabled tools that interact with users in languages they use daily, breaking down barriers to access and participation in the digital economy.

The initiative addresses one of the most fundamental obstacles to equitable AI participation in language representation, and lays the groundwork for voice technologies that are relevant, accessible and grounded in African cultural and linguistic diversity.

Photo courtesy / Medium

Article by Jed Mwangi

https://blog.google/intl/en-africa/company-news/outreach-and-initiatives/introducing-waxal-a-new-open-dataset-for-african-speech-technology/

Google Research and African Partners Launch Open Speech Dataset to Advance AI for African Languages

Comment

News

Email news article