Written by
Nickolay Shmyrev
on
In-depth evaluation of ASR engines
While everyone focuses on latency, there are many measurable aspects of
ASR that are easy to evaluate and have a significant impact on user
experience. Here are some of them:
-
Hallucination rates from noisy inputs. This is really useful for
estimation how robust your recognizer is. We just feed 1000 noisy samples
into every engine and measure how many words it produces. Some systems
hallucinate significantly, exceeding 100% of insertions, some just give
a few extra words. Very easy to measure, very useful in practice.
-
Recognition of the short inputs. While everyone trains on longer samples,
short inputs sometimes gets less attention as a result systems have very
big trouble to recognize simple “yes”. A huge thing for voicebots, much
more important than latency. We specifically feed real-life short data and
compare engines, results might be very different from system to system.
-
Ability to identify non-speech sounds as music and noises. While most
modern systems become more and more intelligient the users expect them
to react to a wide range of conditions. Even embedded systems need to
identify the environment around properly. Very few engines able to do that.
-
Rare words problem. Because end-to-end systems learn distributions
as is, the hidden distributions mismatches becomes a hidden but serious
problems. For example, end-to-end systems learn to recognize frequent
words well but our semantics obviously depends a lot on rare words -
names, street names, products, medications. End-to-end systems have
trouble with them, even the one trained with millions hours of data.
People use LLM to estimate WER, talk about BERT WER but the reality is
simple. You drop 30k most frequent words and consider substituion rate
for the remaining tail. Take the standard Tedlium test and two modern
English systems - Cohere and Qwen3-ASR.
| |
Qwen3-ASR |
Cohere |
| WER |
3.02 |
4.22 |
| Subsitution rate |
1.52 |
1.51 |
| Rare substituion rate |
14.69 |
11.88 |
You see even top systems with different architecture (encoder-decoder and
LLM) make mistakes on rare words much more often than on common words. A
paper like Improving accuracy of rare words for RNN-Transducer through unigram shallow fusion might be a
great start as well as subsequent publications.
Recently we worked a lot on Russian ASR and we can introduce new version
Vosk 0.62 focused exactly on the problems above. You might see the accuracy)
didn’t improve much but performance in noise, rare words and music
detection imroved significantly.
| Dataset |
Vosk 0.54 |
Vosk 0.62 |
Vosk Small Streaming 0.54 |
Vosk Small Streaming 0.62 |
GigaAM3 RNNT |
T-one CTC + LM Streaming |
| Audiobooks |
1.2 |
1.1 |
4.1 |
4.0 |
3.6 |
5.8 |
| Ru Librispeech |
9.4 |
8.4 |
14.4 |
14.1 |
4.4 |
6.2 |
| CommonVoice 12.0 |
6.1 |
5.8 |
11.2 |
10.9 |
2.6 |
5.5 |
| Golos Crowd |
3.1 |
3.1 |
5.5 |
5.6 |
2.6 |
5.6 |
| Golos Farfield |
6.2 |
6.5 |
10.1 |
10.6 |
4.3 |
12.5 |
| Sova Devices |
11.6 |
11.8 |
14.7 |
15.4 |
10.2 |
10.1 |
| TV Broadcast |
16.6 |
16.6 |
19.8 |
20.9 |
12.0 |
19.5 |
| Medical |
15.7 |
14.3 |
17.9 |
18.7 |
9.0 |
17.1 |
| Short commands |
4.4 |
4.8 |
7.1 |
6.4 |
3.2 |
12.2 |
| Callcenter orders |
20.0 |
19.6 |
27.9 |
29.7 |
14.6 |
18.5 |
| Callcenter support |
12.9 |
12.8 |
16.8 |
17.2 |
12.6 |
14.8 |
| Average |
9.74 |
9.53 |
13.95 |
13.94 |
7.19 |
11.62 |
Extra features
| Dataset |
Vosk 0.54 |
Vosk 0.62 |
Vosk Small Streaming 0.54 |
Vosk Small Streaming 0.62 |
GigaAM3 RNNT |
T-one CTC + LM Streaming |
| Noises |
99.03 |
29.59 |
48.38 |
40.04 |
79.90 |
14.10 |
| Music/Noise |
- |
+ |
- |
+ |
- |
- |
| Rare words |
28.39 |
25.7 |
35.44 |
34.82 |
24.44 |
38.91 |
Despite not being top accuracy we still see our models improve user experience
significantly.
Latency and WER are, of course, still important, and we need to
incorporate latency more systematically into our standard tests.
Hopefully, we will also develop additional approaches for handling rare
words, although the problem require much deeper architecture changes and
brings us back to the LM integration issues discussed in the previous
post. More on it later.