Status of Vosk in October 2020

When you work on things day to day you lose the overall picture very quickly. We’ve been actively training models and fixing things here and there and adding new platforms. Now Vosk supports 16 languages (recent additions are German, Catalan and Farsi) and several important platforms, Asterisk, Freeswitch and even Jigasi. It is somewhat easy to start although the API needs more love and the Windows support needs more love and so on and so forth.

Two big events and few small events brought me to the Earth recently.

First one is Kaldi meeting, a really exciting event with very wide representation from academia, industry and even big corporations. It is amazing how many people use Kaldi in their practice and follow it. Many interesting things in the agenda, but the major ones are:

  • Kaldi C++ neural network library is not flexible enough, Pytorch is much more powerful and much widely supported, Kaldi is moving to Pytorch soon (great thing).
  • These days people train end-to-end very deep neural networks (50+ layers, 100M parameters) while Kaldi by default uses like 30 layers, 20 million parameters, so the accuracy is below modern expectations. Modern architectures are also very fast compared to Kaldi. Kaldi CNN or Kaldi LSTM are extremely slow to train.
  • Model quantization is very important for production, not yet available in Kaldi. Hopefully we’ll get there.
  • Context dependency has to go, it creates problems during training and during decoding as well. Subword units are easier to learn and much more stable for fast conversational speech.
  • Modern architectures like the Nvidia Quartznet are very fast (becuase encoder is very accurate) and also easy to train.
  • Kaldi RNNLM is also kind of suboptimal compared to modern LMs.
  • Backpropagation from NLU to speech recognizer is important, very frequently mentioned by the participants.

A good example of modern pretty simple decoder is Google’s ContextNet.

Most of those issues are going to be solved with new K2 which is really promising but there are few things I would try in the original Kaldi itself (tried some of it already):

  • Train TDNN without context dependency with a monophone tree (degradation is about 0.3% which is not important and should go away with deeper network). Speed improvement is very nice. Basically we give up context dependency but hope that very deep network with large context (50 frames on the left) will learn the context anyway.
  • Train much deeper TDNN with up to 50M parameters (training now, some improvements visible but more experiments required).
  • Train with BPE/subword models.
  • Try to implement faster convolutions in Kaldi (probably not worth it).

Second thing was that I tried to run Vosk on OSX with the microphone. The experience is awful to be honest, the accuracy is pretty low and response time is like 2 seconds even with a very small model. It is very important to “eat your own dog food” as I have been reminded again and again. Sadly, I rarely use speech recognition in daily life, definitely I need to use it more frequently.

Things to do in Vosk:

  • Get rid of i-vector adaptation. While ContextNet above shows very good results for long context (essentially applying advanced i-vectors) it certainly leads to very bad user experience. On the first utterance we can not estimate context at all and the few first words are very frequently transcribed incorrectly by the recognizer. Users do not care about the second sentence, the first one is critically important.
  • These i-vectors are kind of cheating since we do not know the speaker information in real life. They are good to demonstrate good results on a Librispeech dataset or probably on batch transcription. It is worth noting that many Librispeech results are reported without mentioning if speaker information was used or not while it is very important for one-to-one comparison. If we don’t use speaker info, Kaldi will lose more compared to e2e systems which rarely use speaker information.
  • Try to improve response time by reducing the right context. Again, this will lead to worse accuracy but 2 seconds for the result is too much. We need much faster results for sure. Maybe we need to integrate this endpointer here with NLU so it can decide the end of the utterance more reliably.

Lots of fun ahead.