Written by
Nickolay Shmyrev
on
Kaldi models testing
Many models and datasets become available recently, testing models against datasets becomes more complicated
and in the same time more fun.
Recenly Kaldi Active Grammar Project
released some new models and did some testing of our Vosk models, so I had to verify everything since the numbers
we see and the numbers reported mismatch as usual.
So I tested the following models:
There is also lookahead model from kaldi-active-grammar which can quickly rebuild the graph:
Kaldi-active-grammar model graph was recompiled to use the same language
model as en-us-aspire, our big en-us language model one so that more
direct comparision is possible.
For testing I used the following datasets:
- Librispeech test-clean (2620 utts)
- Tedlium test set (1155 utts)
- Google commands set (11005 short commands)
- Fisher eval2000 subset (436 utts)
Vosk library was used for testing, see the testing script
here
in our repo. Here is the results I got:
Model |
Librispeech |
Tedlium |
Commands |
Fisher |
en-us-aspire |
13.49 |
12.53 |
55.62 |
17.39 |
en-us-daanzu |
8.36 |
8.68 |
9.30 |
31.37 |
en-us-small |
15.34 |
12.09 |
45.52 |
N/A |
en-us-librispeech |
4.37 |
N/A |
N/A |
N/A |
deepspeech |
6.12 |
18.03 |
N/A |
N/A |
Some thoughts on the results:
- Librispeech is very artificial test set.
- There is no good universal model still, models for domains work much better than generic ones. The one
tuned for read speech (librispeech) is much better than more generic ones tuned for more broad context.
- Short commands is an interesting domain we need to pay more attention to.
- If you need good wideband model, take en-us-daanzu model, it is really good (though your custom model will likely be better).
- It would be nice to automate it all and run these tests on systematic basis.
- As popularized by our colleague Snakers4 in his criticisms proper testing
requires deeper analysis of factors and conditions. Comparision on just a simple clean test set doesn’t really prove anything.