Written by Nickolay Shmyrev
on June 05, 2020

Kaldi models testing

Many models and datasets become available recently, testing models against datasets becomes more complicated and in the same time more fun.

Recenly Kaldi Active Grammar Project released some new models and did some testing of our Vosk models, so I had to verify everything since the numbers we see and the numbers reported mismatch as usual.

So I tested the following models:

vosk-model-en-us-aspire-0.2
vosk-model-small-en-us-0.3
vosk-model-en-us-librispeech-0.2
vosk-model-en-us-daanzu-0.2
deepspeech 0.7 (desktop version with deepspeech big LM)

There is also lookahead model from kaldi-active-grammar which can quickly rebuild the graph:

vosk-model-en-us-daanzu-0.2-lgraph

Kaldi-active-grammar model graph was recompiled to use the same language model as en-us-aspire, our big en-us language model one so that more direct comparision is possible.

For testing I used the following datasets:

Librispeech test-clean (2620 utts)
Tedlium test set (1155 utts)
Google commands set (11005 short commands)
Fisher eval2000 subset (436 utts)

Vosk library was used for testing, see the testing script here in our repo. Here is the results I got:

Model	Librispeech	Tedlium	Commands	Fisher
en-us-aspire	13.49	12.53	55.62	17.39
en-us-daanzu	8.36	8.68	9.30	31.37
en-us-small	15.34	12.09	45.52	N/A
en-us-librispeech	4.37	N/A	N/A	N/A
deepspeech	6.12	18.03	N/A	N/A

Some thoughts on the results:

Librispeech is very artificial test set.
There is no good universal model still, models for domains work much better than generic ones. The one tuned for read speech (librispeech) is much better than more generic ones tuned for more broad context.
Short commands is an interesting domain we need to pay more attention to.
If you need good wideband model, take en-us-daanzu model, it is really good (though your custom model will likely be better).
It would be nice to automate it all and run these tests on systematic basis.
As popularized by our colleague Snakers4 in his criticisms proper testing requires deeper analysis of factors and conditions. Comparision on just a simple clean test set doesn’t really prove anything.

← Top →