In-depth evaluation of ASR engines

While everyone focuses on latency, there are many measurable aspects of ASR that are easy to evaluate and have a significant impact on user experience. Here are some of them:

  1. Hallucination rates from noisy inputs. This is really useful for estimation how robust your recognizer is. We just feed 1000 noisy samples into every engine and measure how many words it produces. Some systems hallucinate significantly, exceeding 100% of insertions, some just give a few extra words. Very easy to measure, very useful in practice.

  2. Recognition of the short inputs. While everyone trains on longer samples, short inputs sometimes gets less attention as a result systems have very big trouble to recognize simple “yes”. A huge thing for voicebots, much more important than latency. We specifically feed real-life short data and compare engines, results might be very different from system to system.

  3. Ability to identify non-speech sounds as music and noises. While most modern systems become more and more intelligient the users expect them to react to a wide range of conditions. Even embedded systems need to identify the environment around properly. Very few engines able to do that.

  4. Rare words problem. Because end-to-end systems learn distributions as is, the hidden distributions mismatches becomes a hidden but serious problems. For example, end-to-end systems learn to recognize frequent words well but our semantics obviously depends a lot on rare words - names, street names, products, medications. End-to-end systems have trouble with them, even the one trained with millions hours of data. People use LLM to estimate WER, talk about BERT WER but the reality is simple. You drop 30k most frequent words and consider substituion rate for the remaining tail. Take the standard Tedlium test and two modern English systems - Cohere and Qwen3-ASR.

  Qwen3-ASR Cohere
WER 3.02 4.22
Subsitution rate 1.52 1.51
Rare substituion rate 14.69 11.88

You see even top systems with different architecture (encoder-decoder and LLM) make mistakes on rare words much more often than on common words. A paper like Improving accuracy of rare words for RNN-Transducer through unigram shallow fusion might be a great start as well as subsequent publications.

Recently we worked a lot on Russian ASR and we can introduce new version Vosk 0.62 focused exactly on the problems above. You might see the accuracy) didn’t improve much but performance in noise, rare words and music detection imroved significantly.

Dataset Vosk 0.54 Vosk 0.62 Vosk Small Streaming 0.54 Vosk Small Streaming 0.62 GigaAM3 RNNT T-one CTC + LM Streaming
Audiobooks 1.2 1.1 4.1 4.0 3.6 5.8
Ru Librispeech 9.4 8.4 14.4 14.1 4.4 6.2
CommonVoice 12.0 6.1 5.8 11.2 10.9 2.6 5.5
Golos Crowd 3.1 3.1 5.5 5.6 2.6 5.6
Golos Farfield 6.2 6.5 10.1 10.6 4.3 12.5
Sova Devices 11.6 11.8 14.7 15.4 10.2 10.1
TV Broadcast 16.6 16.6 19.8 20.9 12.0 19.5
Medical 15.7 14.3 17.9 18.7 9.0 17.1
Short commands 4.4 4.8 7.1 6.4 3.2 12.2
Callcenter orders 20.0 19.6 27.9 29.7 14.6 18.5
Callcenter support 12.9 12.8 16.8 17.2 12.6 14.8
Average 9.74 9.53 13.95 13.94 7.19 11.62

Extra features

Dataset Vosk 0.54 Vosk 0.62 Vosk Small Streaming 0.54 Vosk Small Streaming 0.62 GigaAM3 RNNT T-one CTC + LM Streaming
Noises 99.03 29.59 48.38 40.04 79.90 14.10
Music/Noise - + - + - -
Rare words 28.39 25.7 35.44 34.82 24.44 38.91

Despite not being top accuracy we still see our models improve user experience significantly.

Latency and WER are, of course, still important, and we need to incorporate latency more systematically into our standard tests. Hopefully, we will also develop additional approaches for handling rare words, although the problem require much deeper architecture changes and brings us back to the LM integration issues discussed in the previous post. More on it later.