On latency of speech recognition
There are many factors that affect the quality of the speech recognition system. One is word accuracy (word error rate), others are intent accuracy (something that big companies like Amazon and Microsoft try to optimize recently). Speed, memory usage, energy consumption, noise robustness. All those factors are equally important.
One factor that is frequently missing from evaluation is response time or latency. If you check any real human-human dialogs you’ll see that human the reaction is very fast, almost immediate. Sometimes the opponent starts speaking even before the first person finishes the phrase. Such latency is very hard to get in modern recognizers.
Google, Amazon, Facebook and Microsoft understand the importance of this problem and try to develop solutions for that. Just 3 recent papers from Google:
- Universal ASR: Unify and Improve Streaming ASR with Full-context Modeling
- Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition
- Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data
I particularly liked the latter one since it applies the interesting idea of unsupervised teacher-student training from more accurate model to less accurate one.
In general, the following factors affect latency of the modern speech recognition system like Vosk:
- Non-streaming architecture like BLSTM or other bidirectional models. These models have to apply voice activity detection step to select utterances and then they process the data chunk, meaning their response time is usually extremely slow. They also can not output partial results meaning no runtime analysis is possible.
- Too wide right context. In order to get accuracy, developers select a very wide right context. Indeed, the more context your model sees, the better. In turn it means that the model has to wait like 0.5 second more before it can score a frame. Recent Vosk models use 42 frames right context and suffer from this issue.
- Batching of input frames. In order to speed up processing with caching, the modern systems buffer input frames together so that they are processed by a neural network faster. Due to that you have to wait for a certain amount of input frames in order to score a frame and it reduces latency. But it increases throughput.
- VAD instead of intent parsing for end of utterance detection. Like we said, a human doesn’t wait for silence in order to process the sentences, he does parsing continuously and decides the end of utterance not just by the silence but also by content. Unfortunately in the modern approach we have to wait for a half a second of silence before we process the input. The paper from Amazon Combining Acoustic Embeddings and Decoding Features For End-of-utterance Detection In Real-time Far-field Speech Recognition Systems talks about it.
- From the point above one can extend that beside real-time intent parsing one needs also a real-time answer generation. Here a streaming TTS is an important idea. Unfortunately streaming TTS is very rare, but there are some papers like High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency
Traditionally Kaldi decoders had very good support for streaming speech recognition. The strong features that Vosk has here compared to other toolkits are: real stream-like processing, voice activity detection integrated with speech recognition results, ability to create small (deterministic) lattices during decoding process and backtrace the best result, so intent recognition can be a step in processing. Many end-to-end systems do not support features like that.
Unfortunately, there are bad sides too. Default models we are using are not batch models but still they use very large context (42 frames) like I said above. Daanzu models are pretty bad too (30 frames in right context). For now, I must admit Zamia models which use just 10 frames in the right context feel better than Vosk ones. We will retrain all our mobile models to use just 10 frames to increase the system responsiveness, hopefully it will not affect the accuracy much.
Btw, if you want to figure out the context of the model you can use the kaldi command
nnet3-info like this:
$ nnet3-info final.mdl | head left-context: 34 right-context: 34
Second, we need to add a latency time test so we can evaluate the real life system latency and make sure it stays within reasonable 0.2 seconds. Here we might need to play with batching of the frames (–frames-per-chunk option). Hopefully it will not make the system much slower.
Third, code examples should demonstrate real time intent parsing instead, not sure how to implement it yet.
So it looks we will have to degrade our accuracy a bit to get proper latency. At least for the time while modern neural network architectures as well as better training procedures will help us to catch up.
P. S. Btw, another properly of the human speech recognition is a streaming recognition of multispeaker input. We can really quickly identify who is speaking and switch between speakers. No open source system can do that but Amazon is getting really close in STREAMING MULTI-SPEAKER ASR WITH RNN-T
P. P. S. There is a relevant Kaldi paper too Low latency acoustic modeling using temporal convolution and LSTMs