OpenAI Whisper Accuracy and other recent models (Nemo Transducer XLarge, Gigaspeech)

Everyone is crazy about OpenAI Whisper. Trained on 680 thousand hours of multilingual data it indeed sets a new stage in speech recognition.

We tested it and some other recent models a bit on the same data we tested Nemo Conformer and other models before. Here are the results.

Some afterthoughs:

  • Models are impressive
  • More surprising is that en tiny models are pretty good, degradation from pruning is not that significant
  • Other models are not that bad too like Gigaspeech models from Wenet and K2 or Nemo models, if you look for best accuracy they are about the same
  • Wenet gigaspeech model (with LM rescoring) is not that bad
  • RTX varies with K2 being the fastest
  • Whisper LID model is not very good, on non-native speech it often confuses languages
  • Whisper postprocessing is nice, but often arbitrary in decisions as well
  • Whisper has funny hallucinations on short commands
  • Not all properties of whisper models are fully understood yet, first robustness to noise (we heard echo harms a lot)
  • Yet to test improvement on longer files compared to utterance-based processing of other systems (this is one of the strongest point in Whisper)
  • Yet to test accuracy on other important languages systematically
  • Vosk accuracy is outdated, need to improve ASAP
Dataset Vosk Small Vosk Big K2 Gigaspeech RNNT Wenet Gigaspeech + LM Nvidia Transducer Xlarge Whisper Tiny En Whisper Small En Whisper Large En
Librispeech test-clean 10.1 5.7 3.1 3.4 1.6 7.0 4.4 4.0
Tedlium test 10.6 6.0 4.2 3.9 5.2 8.6 6.7 6.5
Google commands 26.4 19.7 27.7 16.1 19.8 33.9 22.4 33.9
Non-native speech 53.0 41.7 29.1 27.2 19.3 41.7 25.6 18.7
Children speech 18.2 9.5 5.5 5.1 4.1 10.4 5.4 4.5
Podcasts 23.4 15.8 11.5 11.4 13.8 17.0 13.9 13.5
Callcenter bot 31.3 18.0 11.5 11.0 11.4 22.0 13.1 10.1
Callcenter 1 56.6 45.6 35.2 32.3 31.9 45.9 31.1 31.8
Callcenter 2 49.4 29.7 25.2 22.6 28.8 33.4 28.4 26.0
Callcenter 3 63.6 33.0 40.0 27.4 29.0 40.7 33.1 31.4