Written by Nickolay Shmyrev
on October 22, 2022

OpenAI Whisper Accuracy and other recent models (Nemo Transducer XLarge, Gigaspeech)

Everyone is crazy about OpenAI Whisper. Trained on 680 thousand hours of multilingual data it indeed sets a new stage in speech recognition.

We tested it and some other recent models a bit on the same data we tested Nemo Conformer and other models before. Here are the results.

Some afterthoughs:

Models are impressive
More surprising is that en tiny models are pretty good, degradation from pruning is not that significant
Other models are not that bad too like Gigaspeech models from Wenet and K2 or Nemo models, if you look for best accuracy they are about the same
Wenet gigaspeech model (with LM rescoring) is not that bad
RTX varies with K2 being the fastest
Whisper LID model is not very good, on non-native speech it often confuses languages
Whisper postprocessing is nice, but often arbitrary in decisions as well
Whisper has funny hallucinations on short commands
Not all properties of whisper models are fully understood yet, first robustness to noise (we heard echo harms a lot)
Yet to test improvement on longer files compared to utterance-based processing of other systems (this is one of the strongest point in Whisper)
Yet to test accuracy on other important languages systematically
Vosk accuracy is outdated, need to improve ASAP

Dataset	Vosk Small	Vosk Big	K2 Gigaspeech RNNT	Wenet Gigaspeech + LM	Nvidia Transducer Xlarge	Whisper Tiny En	Whisper Small En	Whisper Large En
Librispeech test-clean	10.1	5.7	3.1	3.4	1.6	7.0	4.4	4.0
Tedlium test	10.6	6.0	4.2	3.9	5.2	8.6	6.7	6.5
Google commands	26.4	19.7	27.7	16.1	19.8	33.9	22.4	33.9
Non-native speech	53.0	41.7	29.1	27.2	19.3	41.7	25.6	18.7
Children speech	18.2	9.5	5.5	5.1	4.1	10.4	5.4	4.5
Podcasts	23.4	15.8	11.5	11.4	13.8	17.0	13.9	13.5
Callcenter bot	31.3	18.0	11.5	11.0	11.4	22.0	13.1	10.1
Callcenter 1	56.6	45.6	35.2	32.3	31.9	45.9	31.1	31.8
Callcenter 2	49.4	29.7	25.2	22.6	28.8	33.4	28.4	26.0
Callcenter 3	63.6	33.0	40.0	27.4	29.0	40.7	33.1	31.4

← Top →