Kaldi Gigaspeech Vosk Model Release

Recently Kaldi project released a pack of models trained on Gigaspeech. You can find them here Models are good, not significantly better than our previous model, but not significantly worse either. We expect them to work better on Youtube-like inputs and podcasts. Some notes:

  • Unlike model 0.22 this new model is more stable and doesn’t output ‘the’ for long silence regions.
  • The model has TDNN+LSTM architecture and a bit slow.
  • We packaged it with a big graph so the archive is larger (2.3Gb) and decoding requires 16Gb. It should not be a problem for modern servers.

You can download models packaged for Vosk here.

Here are the accuracy results:

Dataset Vosk 0.42 Gigaspeech Vosk 0.22 K2 Gigaspeech RNNT Wenet Gigaspeech + LM Nvidia Transducer Xlarge Whisper Large En
Librispeech test-clean 5.64 5.7 3.1 3.4 1.6 4.0
Tedlium test 6.24 6.0 4.2 3.9 5.2 6.5
Google commands 25.71 19.7 27.7 16.1 19.8 33.9
Non-native speech 40.53 41.7 29.1 27.2 19.3 18.7
Children speech 10.62 9.5 5.5 5.1 4.1 4.5
Podcasts 14.79 15.8 11.5 11.4 13.8 13.5
Callcenter bot 17.91 18.0 11.5 11.0 11.4 10.1
Callcenter 1 48.49 45.6 35.2 32.3 31.9 31.8
Callcenter 2 30.17 29.7 25.2 22.6 28.8 26.0
Callcenter 3 36.56 33.0 40.0 27.4 29.0 31.4