Lookahead composition in Kaldi and Vosk

In 2019 AlphaCephei has made quite some good progress. We have introduced a project called Vosk which is meant to be a portable API for speech recognition for variety of platforms (Linux servers, Windows, iOS, Android, RPi, etc) and languages (Engish, Spanish, Portuguese, Chinese, Russian, German, French, more coming soon) and variety of programming languages (C#, Java, Javascript, Python). It also has unique lifelong learning component for the training. Still work in progress, no good API docs and tutorials, API itself also needs attention.

One interesting thing we have implemented in the library is a lookahead composition for efficient graph construction during decoding. This is an old idea from openfst which was actually there for a long time but never used.

Traditional Kaldi approach is still to create a huge decoding graph from the language model, dictionary and context dependency graph and decode with relatively simple decoder which just explores the best path. The size of the graph is usually at least 300Mb but can be up to several gigabytes. The advantages of this approach are very simple - speed of the decoding. The disadvantages are huge model size which doesn’t fit the mobile platforms and limited flexibility of the model. For example, you can not adjust word probabilities without tricks of graph recompilation (the ones that implemented in GraphFst).

New idea is to rearrange lexicon graph so that composition with the grammar graph can be done on the fly during decoding. As a result, you don’t need to distribute the huge graph with the application. Instead of single 300Mb graph you just give user 20Mb graph of the lexicon + context dependent transducer + 20Mb compressed language model. And the language model is flexible, you can configure it for every new decoding instance. For the publication on the subject you are welcome to check:

A Generalized Composition Algorithm for Weighted Finite-State Transducers by Cyril Allauzen, Michael Riley, Johan Schalkwyk

For the Kaldi part of the setup please check this script, it explains how to compile the graph with lookahead, how to decode and also provides some decoding results on Librispeech.

run_lookahead.sh

Basically, the results are:

# %WER 4.86 [ 2553 / 52576, 315 ins, 222 del, 2016 sub ] exp/chain_cleaned/tdnn_1d_sp/decode_test_clean_lookahead/wer_11_0.0
# %WER 4.79 [ 2518 / 52576, 279 ins, 292 del, 1947 sub ] exp/chain_cleaned/tdnn_1d_sp/decode_test_clean_lookahead_arpa/wer_11_0.0
# %WER 4.82 [ 2532 / 52576, 286 ins, 290 del, 1956 sub ] exp/chain_cleaned/tdnn_1d_sp/decode_test_clean_lookahead_arpa_fast/wer_11_0.0
# %WER 4.86 [ 2553 / 52576, 314 ins, 222 del, 2017 sub ] exp/chain_cleaned/tdnn_1d_sp/decode_test_clean_lookahead_base/wer_11_0.0
# %WER 4.86 [ 2553 / 52576, 315 ins, 222 del, 2016 sub ] exp/chain_cleaned/tdnn_1d_sp/decode_test_clean_lookahead_static/wer_11_0.0


# Speed
#
# base       0.18 xRT
# static     0.18 xRT
# lookahead  0.29 xRT
# arpa       0.35 xRT
# arpa_fast  0.21 xRT

# Graph size
#
# Base                 476 Mb
# Static               621 Mb
# Lookahead            48 Mb HCL + 77 Mb Grammar
# Lookahead + OpenGrm  48 Mb HCL + 42 Mb Grammar

So the advantages of this thing are more or less obvious, let me write down the disadvantages I have figured out as a plan for future research directions on this subject. Some of them are implementation-specific, some are pretty deep and require rethinking of decoder architecture.

Decoding speed

As you can see in the Kaldi results, the lookahead composition has significant drop in the decoding speed (about twice more for default operation mode). This drop can be reduced with more narrow beams where search doesn’t take big portion but still for good accuracy you need to be slower. The original paper didn’t mention it because in those GMM days the acoustic model scoring was much longer and thus search part was smaller and lookahead penalty was less visible. These days with very fast acoustic scoring based on neural networks (chain model scores 1 of 3 frames and does it very fast) the penalty is much more visible. Profiling shows that C++ things in composition are quite painful, maybe they can be optimized, but default openfst implementation doesn’t look very efficient. There are caches, iterators and all those things which look very slow. Pure C optimized implementation could be much faster. I started to think again about pure C efficient decoder.

This is a profiler output so you can check the most expensive methods, sgemm is for acoustic model, the rest are heavy openfst templates:

CPU: Intel Skylake microarchitecture, speed 4700 MHz (estimated)
Counted cpu_clk_unhalted events () with a unit mask of 0x00 (Core cycles when at least one thread on the physical core is not in halt state) count 100000
samples  %        image name               app name                 symbol name
-------------------------------------------------------------------------------
712671   22.3013  libmkl_avx2.so           nnet3-latgen-faster-lookahead mkl_blas_avx2_sgemm_kernel_nocopy_TN_b1
-------------------------------------------------------------------------------
334831   10.4777  libmkl_avx2.so           nnet3-latgen-faster-lookahead mkl_blas_avx2_sgemm_kernel_0
-------------------------------------------------------------------------------
214193    6.7027  olabel_lookahead-fst.so.0.0.0 nnet3-latgen-faster-lookahead long fst::LabelReachable<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::FastLogAccumulator<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, fst::LabelReachableData<int> >::LowerBound<fst::ArcIterator<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >(fst::ArcIterator<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >*, long, long, int) const
-------------------------------------------------------------------------------
209942    6.5696  ngram-fst.so.0.0.0       nnet3-latgen-faster-lookahead fst::ArcIterator<fst::NGramFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::Value() const
-------------------------------------------------------------------------------
128013    4.0059  nnet3-latgen-faster-lookahead nnet3-latgen-faster-lookahead std::_Hashtable<int, int, fst::PoolAllocator<int>, std::__detail::_Identity, fst::CompactHashBiTable<int, fst::DefaultComposeStateTuple<int, fst::PairFilterState<fst::PairFilterState<fst::IntegerFilterState<signed char>, fst::WeightFilterState<fst::TropicalWeightTpl<float> > >, fst::IntegerFilterState<int> > >, fst::ComposeHash<fst::DefaultComposeStateTuple<int, fst::PairFilterState<fst::PairFilterState<fst::IntegerFilterState<signed char>, fst::WeightFilterState<fst::TropicalWeightTpl<float> > >, fst::IntegerFilterState<int> > > >, std::equal_to<fst::DefaultComposeStateTuple<int, fst::PairFilterState<fst::PairFilterState<fst::IntegerFilterState<signed char>, fst::WeightFilterState<fst::TropicalWeightTpl<float> > >, fst::IntegerFilterState<int> > > >, (fst::HSType)3>::HashEqual, fst::CompactHashBiTable<int, fst::DefaultComposeStateTuple<int, fst::PairFilterState<fst::PairFilterState<fst::IntegerFilterState<signed char>, fst::WeightFilterState<fst::TropicalWeightTpl<float> > >, fst::IntegerFilterState<int> > >, fst::ComposeHash<fst::DefaultComposeStateTuple<int, fst::PairFilterState<fst::PairFilterState<fst::IntegerFilterState<signed char>, fst::WeightFilterState<fst::TropicalWeightTpl<float> > >, fst::IntegerFilterState<int> > > >, std::equal_to<fst::DefaultComposeStateTuple<int, fst::PairFilterState<fst::PairFilterState<fst::IntegerFilterState<signed char>, fst::WeightFilterState<fst::TropicalWeightTpl<float> > >, fst::IntegerFilterState<int> > > >, (fst::HSType)3>::HashFunc, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::_M_find_before_node(unsigned long, int const&, unsigned long) const
-------------------------------------------------------------------------------
120748    3.7785  nnet3-latgen-faster-lookahead nnet3-latgen-faster-lookahead kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::StdToken>::PruneForwardLinks(int, bool*, bool*, float)

-------------------------------------------------------------------------------
107220    3.3552  libc-2.27.so             nnet3-latgen-faster-lookahead _int_malloc
-------------------------------------------------------------------------------
83195     2.6034  nnet3-latgen-faster-lookahead nnet3-latgen-faster-lookahead kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::StdToken>::ProcessEmitting(kaldi::DecodableInterface*)
-------------------------------------------------------------------------------
71555     2.2391  olabel_lookahead-fst.so.0.0.0 nnet3-latgen-faster-lookahead bool fst::LabelReachable<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::FastLogAccumulator<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, fst::LabelReachableData<int> >::Reach<fst::ArcIterator<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >(fst::ArcIterator<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >*, long, long, bool)
-------------------------------------------------------------------------------
62693     1.9618  ngram-fst.so.0.0.0       nnet3-latgen-faster-lookahead fst::internal::NGramFstImpl<fst::ArcTpl<fst::TropicalWeightTpl<float> > >::Transition(std::vector<int, std::allocator<int> > const&, int) const
-------------------------------------------------------------------------------
59650     1.8666  libc-2.27.so             nnet3-latgen-faster-lookahead malloc_consolidate

Strictness of the openfst algorithm

The big problem with the OpenFST approach is that it tries to be very exact about composition. It is good from research point of view but real-life decoder can cut many things and maintain almost the same level of speed/accuracy. One example in pocketsphinx was about cross-word triphones. Usually cross-word triphones create the biggest problems in search since the next word is uncertain and there are too many variants on where to go next and required search space explodes. Between words the required space is not so big since you have less words to search for and thus you can use narrow beam.

You can approximate cross-word with word-internal triphones or just take 1-best during that transition without breadth search and the accuracy level will be the same but required beam width will be much small and search speed will be faster. Tricks like that are not very easy in OpenFST framework.

Thats why most commercial systems have custom decoders.

LOUDS ngram model

Opengrm uses pretty interesting succinct data structure LOUDS algorithm for ngram model compression. Very cool idea of bit-wise compression. It is slightly different from conventional trie that is used in KenLM or ConstArpaLm but claimed to be very fast and cache efficient but it appeared to be very slow in the experiments. Default Kaldi-based ConstArpaLm seems to be much more natural and faster and the size of the compressed model is the same. ConstArpaLm doesn’t implement full Fst interface from openfst right now, so you can’t use it in decoding as is but probably it can be implemented easily.

Quantization

This is probably a low hanging fruit, actually in ASR you don’t need 32-bit floats as weights. 16-bit is more than enough and sometimes even 8-bit. Openfst supports compact FSTs, but I haven’t yet figure out how to use them in decoding. Maybe I need to spend some more time on it.

Also, I hope to see the quantized acoustic model in Kaldi soon. Kaldi team works on Pytorch in pybind11 branch, really hope to see it working soon, then stardard Pytorch quantization process can create much more efficient mobile models. Currently models are pretty basic (768-dim layers, about 15Mb total) and not very accurate.

Vocabulary

Next problem is that we can not easily change the vocabulary and most today realistic applications - navigation queries, song selection and so on, they all expect very huge vocabularies. This is more visible for languages like Russian or even German which have many more word forms than English, but must be visible in English as well. Modern subword-based e2e systems like wav2letter set a new level here much more serious than something we can achieve with standard HCLG approach.

One nice thing we had in Russian internet recently is the openstt-ru effort - the big dataset of real-life tasks with a measurement of both commercial and open source recognizer accuracy. Something that is missing in the English internet. They run the tests on the taxi driver queries, medical queries, public lectures and so on. The thing is that recent end-to-end recognizers demonstrate much more superior user experience (2-3 times less errors) when required vocabulary is large. I attribute this to the flexible way they work with the vocabulary.

It has been ok in dictation times to have a vocabulary of 20k entries and systems tuned to a specific domain, these days the expectations are that the system just works for millions of words. Something that embedded decoder like vosk-api can’t provide unless we change the approach to dictionary handling.

One way to deal with this issue is to run a live lifelong system to learn the new words, a Vosk style approach, something I would like to post about separately.

Conclusion

Overall you see there are still many problems to solve. Both speed requires quite some work and the algorithm stability too. I have been long complaining about ivectors which demonstrate ok improvements for long talks and dictation but create awful experience with very short words. The first impression user get from an ivector system is usually not so nice. People are used to talk with short phrases and not giving a large talks, so all ivector thing is rather questionable. But it is worth a separate post.