Although automatic speech recognition (ASR) could be considered a solved problem in the context of high-resource languages like English, ASR performance for accented speech is significantly inferior. The recent emergence of large pretrained ASR models has facilitated multiple transfer learning and domain adaptation efforts, in which performant general-purpose ASR models are fine-tuned for specific domains, such as clinical or accented speech. However, African accented clinical speech recognition remains largely unexplored. We propose a semantically aligned, domain-specific multitask learning framework (generative and discriminative) and demonstrate empirically that semantically aligned, multitask learning enhances ASR, outperforming the single-task architecture by 2.5% (relative). We discover that the generative multitask design improves generalization to unseen accents, while the discriminative multitask approach improves clinical ASR for majority and minority accents.

The Future of African Voice AI

Gain deep insights into the evolving technological landscape with our upcoming 2026 Africa Voice AI Report. The report explores the trends, challenges, and opportunities shaping the future of speech interfaces across the continent.