Factorizing E2E on acoustic and language models

While end-to-end speech recognition systems are dominating leaderboards, it’s still valuable to consider the separate acoustic and language models. This separation present in the network as the lower layers of the network handle acoustic information, filtering out noise, while the higher layers encode linguistic patterns, including cross-word dependencies. For various reasons, it remains beneficial to factorize large models into these acoustic and language subcomponents and some other components too like speaker identity.

A clear example is cross-lingual models. While the ideal is to learn everything from the acoustic features, in practice, mastering the language component is just as critical. This includes handling proper names, city names, and other language-specific elements. Achieving this requires massive datasets for each language, not just a large English corpus with small amounts of data for others.

For instance, when we evaluated Qwen3-ASR-1.7B for Russian, the results were disappointing. The model struggled to recognize Russian words, even simple, common names, despite being trained on millions of hours of data. This illustrates the importance of adequately addressing the language component in training.

Another challenge is domain transfer. If we have millions of hours of data from a real estate call center, can we repurpose it for banking assistants? Unfortunately, the answer is no. It is not easy to replace internal LM without real data. This means that even with massive datasets, there will always be gaps, especially in niche domains like medical, where specialized systems like Google’s MedASR outperform general models.

While the acoustic and language factors are distinct, perfect accuracy depends on their deep integration. This is why I’m skeptical about audio LLMs. Without a robust feedback loop between the acoustic layer and the LLM, these models won’t achieve optimal robustness. More importantly, testing should focus on noisy, real-world conditions, not just clean datasets like Librispeech or Fleurs.

Currently, most frameworks don’t account for explicit separation of acoustic and language models. Hopefully, we’ll see such tools in the future.