Written by
Nickolay Shmyrev
on
N-gram language model toolkits in 2020
N-gram language models are well understood and widely used. These days
n-grams are not the best models for common machine learning tasks like
translation or speech recognition, they have been long superseeded by
giant neural networks. But still, for fast quick first pass computation
ngrams are very valuable. Beam search in speech recognition still uses
ngram models.
There are dozen toolkits helping to build ARPA models from very simple to
very complicated ones. Required features are pretty clear. The best
toolkit should have:
- Witten-Bell discount (Knesser-Ney discount doesn’t work well in pruning)
- Entropy pruning
- Simple command line interface
- Language model interpolation for quick domain adapation
- License for commercial applications
- Ability to process 100+Gb texts of Common Crawl.
- Fast and compact binary representation
Looks pretty simple, isn’t it. Unfortunately, no such toolkit around, and it is very surprising.
- SRILM - the most powerful around. Everything above is possible and work really stable. Unfortunately, license doesn’t allow to use it for most applications.
- Kenlm - the toolkit everyone loves, really fast an can build really big language models. But no interpolation, only counts interpolation. No Witten-Bell too, no entropy pruning.
- MitLM - mostly unmaintained these days, no Witten-Bell, no pruning.
- IRSTLM - no simple command line, a bunch of perl scripts, no interpolation, no pruning.
- Kaldi pocolm - no simple command line, no interpolation.
There are great new toolkits like Tongrams with very efficient
computation based on hash algorithms but still not a full feature set.
Surprisingly, there is a toolkit I rarely remember to suggest, it is also
rarely mentioned but it fits almost all the requirements. It is
OpenGRM. It
follows a little complicated OpenFST C++ template-heavy style, but
otherwise it is very feature rich:
- Many types of discount including Witten-Bell
- Different pruning types (entropy, Seymour pruning)
- Command line and C++/Python API
- Interpolation (bayes, counts, contexts, etc)
- Permissive Apache license
- Compact LOUDS representation
The only minor (not so minor) thing is that OpenGRM is pretty slow. It is
hard to use it even with a 2Gb LM. Maybe something can be optimized
internally with better compiler, I have yet to figure it out.
P. S. Surprisingly, n-gram model research in ASR is still going on, two notable papers recent papers:
Connecting and Comparing Language Model Interpolation Techniques
Efficient MDI Adaptation for n-gram Language Models