Written by
Nickolay Shmyrev
on
Kaldi Gigaspeech Vosk Model Release
Recently Kaldi project released a pack of models trained on Gigaspeech. You can find them here
Models are good, not significantly better than our previous model, but not significantly worse either. We expect them to
work better on Youtube-like inputs and podcasts. Some notes:
- Unlike model 0.22 this new model is more stable and doesn’t output ‘the’ for long silence regions.
- The model has TDNN+LSTM architecture and a bit slow.
- We packaged it with a big graph so the archive is larger (2.3Gb) and decoding requires 16Gb. It should not be a problem
for modern servers.
You can download models packaged for Vosk here.
Here are the accuracy results:
Dataset |
Vosk 0.42 Gigaspeech |
Vosk 0.22 |
K2 Gigaspeech RNNT |
Wenet Gigaspeech + LM |
Nvidia Transducer Xlarge |
Whisper Large En |
Librispeech test-clean |
5.64 |
5.7 |
3.1 |
3.4 |
1.6 |
4.0 |
Tedlium test |
6.24 |
6.0 |
4.2 |
3.9 |
5.2 |
6.5 |
Google commands |
25.71 |
19.7 |
27.7 |
16.1 |
19.8 |
33.9 |
Non-native speech |
40.53 |
41.7 |
29.1 |
27.2 |
19.3 |
18.7 |
Children speech |
10.62 |
9.5 |
5.5 |
5.1 |
4.1 |
4.5 |
Podcasts |
14.79 |
15.8 |
11.5 |
11.4 |
13.8 |
13.5 |
Callcenter bot |
17.91 |
18.0 |
11.5 |
11.0 |
11.4 |
10.1 |
Callcenter 1 |
48.49 |
45.6 |
35.2 |
32.3 |
31.9 |
31.8 |
Callcenter 2 |
30.17 |
29.7 |
25.2 |
22.6 |
28.8 |
26.0 |
Callcenter 3 |
36.56 |
33.0 |
40.0 |
27.4 |
29.0 |
31.4 |