Written by
Nickolay Shmyrev
on
Factorizing E2E on acoustic and language models
While end-to-end speech recognition systems are dominating leaderboards,
it’s still valuable to consider the separate acoustic and language
models. This separation present in the network as the lower layers of the
network handle acoustic information, filtering out noise, while the
higher layers encode linguistic patterns, including cross-word
dependencies. For various reasons, it remains beneficial to factorize
large models into these acoustic and language subcomponents and some
other components too like speaker identity.
A clear example is cross-lingual models. While the ideal is to learn
everything from the acoustic features, in practice, mastering the
language component is just as critical. This includes handling proper
names, city names, and other language-specific elements. Achieving this
requires massive datasets for each language, not just a large English
corpus with small amounts of data for others.
For instance, when we evaluated Qwen3-ASR-1.7B for Russian, the results
were disappointing. The model struggled to recognize Russian words, even
simple, common names, despite being trained on millions of hours of data.
This illustrates the importance of adequately addressing the language
component in training.
Another challenge is domain transfer. If we have millions of hours of
data from a real estate call center, can we repurpose it for banking
assistants? Unfortunately, the answer is no. It is not easy to replace
internal LM without real data. This means that even with massive
datasets, there will always be gaps, especially in niche domains like
medical, where specialized systems like Google’s MedASR outperform
general models.
While the acoustic and language factors are distinct, perfect accuracy
depends on their deep integration. This is why I’m skeptical about audio
LLMs. Without a robust feedback loop between the acoustic layer and the
LLM, these models won’t achieve optimal robustness. More importantly,
testing should focus on noisy, real-world conditions, not just clean
datasets like Librispeech or Fleurs.
Currently, most frameworks don’t account for explicit separation of
acoustic and language models. Hopefully, we’ll see such tools in the
future.