Models

This is the list of models compatible with Vosk-API.

To add a new model here create an issue on Github.

Model Size Accuracy Notes License
English        
vosk-model-en-us-aspire-0.2 1.4G 13.64 (librispeech test-clean) 12.89 (tedlium) Trained on Fisher + more or less recent LM. Should be pretty good for generic US English transcription Apache 2.0
vosk-model-small-en-us-0.4 36M 15.34 (librispeech test-clean) 12.09 (tedlium) Lightweight wideband model for Android and RPi Apache 2.0
vosk-model-en-us-daanzu-20200905 1.0G 7.08 (librispeech test-clean) 8.25 (tedlium) Accurate wideband model for dictation from Kaldi-active-grammar project AGPL
vosk-model-en-us-daanzu-20200905-lgraph 129M 8.20 (librispeech test-clean) 9.28 (tedlium) Accurate wideband model for dictation from Kaldi-active-grammar project with configurable graph AGPL
vosk-model-en-us-librispeech-0.2 845M TBD Repackaged Librispeech model from Kaldi. Not very accurate, mainly for research Apache 2.0
Indian English        
vosk-model-en-in-0.4 370M TBD Generic Indian English model for telecom and broadcast Apache 2.0
vosk-model-small-en-in-0.4 36M TBD Lightweight Indian English model for mobile applications Apache 2.0
Chinese        
vosk-model-cn-0.1.zip 195M TBD Big narrowband Chinese model for server processing Apache 2.0
vosk-model-small-cn-0.3 32M TBD Lightweight wideband model for Android and RPi Apache 2.0
Russian        
vosk-model-ru-0.10.zip 2.5G 5.71 (our audiobooks) 16.26 (open_stt audiobooks) 26.20 (public_youtube_700_val open_stt) 40.15 (asr_calls_2_val open_stt) Big narrowband Russian model for server processing Apache 2.0
vosk-model-small-ru-0.4 39M TBD Lightweight wideband model for Android and RPi Apache 2.0
French        
vosk-model-small-fr-pguyot-0.3 39M TBD Lightweight wideband model for Android and RPi trained by Paul Guyot CC-BY-NC-SA 4.0
fr-pguyot-zamia-20191016-tdnn_f 282M TBD Bigger more accurate model by Paul Guyot CC-BY-NC-SA 4.0
German        
vosk-model-de-0.6 1.2G 13.03 (Tuda-de test), 11.22 (Tuda-de rescore) Big narrowband German model for telephony and server Apache 2.0
vosk-model-small-de-zamia-0.3 49M TBD Lightweight wideband model for Android and RPi LGPL-3.0
Spanish        
vosk-model-small-es-0.3 33M TBD Lightweight wideband model for Android and RPi Apache 2.0
Portuguese        
vosk-model-small-pt-0.3 31M TBD Lightweight wideband model for Android and RPi Apache 2.0
Greek        
vosk-model-el-gr-0.7.zip 1.1G TBD Big narrowband Greek model for server processing, not extremely accurate though Apache 2.0
Turkish        
vosk-model-small-tr-0.3 35M TBD Lightweight wideband model for Android and RPi Apache 2.0
Vietnamese        
vosk-model-small-vn-0.3 32M TBD Lightweight wideband model for Android and RPi Apache 2.0
Italian        
vosk-model-small-it-0.4 32M TBD Lightweight wideband model for Android and RPi Apache 2.0
Dutch        
vosk-model-nl-spraakherkenning-0.6.zip 860M TBD Medium Dutch model from Kaldi_NL Apache 2.0
vosk-model-nl-spraakherkenning-0.6-lgraph.zip 100M TBD Smaller model with dynamic graph Apache 2.0
Catalan        
vosk-model-small-ca-0.4 42M TBD Lightweight wideband model for Android and RPi for Catalan Apache 2.0
Arabic        
vosk-model-ar-mgb2-0.4 318M 16.40 (MGB-2 dev set) Repackaged Arabic model trained on MGB2 dataset from Kaldi Apache 2.0
Farsi        
vosk-model-small-fa-0.4 47M TBD Lightweight wideband model for Android and RPi for Farsi (Persian) Apache 2.0
Speaker identification model        
vosk-model-spk-0.4 13M TBD Model for speaker identification, should work for all languages Apache 2.0

Other models

Other places where you can check for models which might be compatible:

Training your own model

You can train your model with Kaldi toolkit. The training is pretty standard - you need tdnn nnet3 model with i-vectors. You can check mini_librispeech recipe for details. Some notes on training:

  • For smaller mobile models watch number of parameters
  • Train the model without pitch. It might be helpful for small amount of data, but for large database it doesn’t give the advantage but complicates the processing and increases response time.
  • Train ivector of dim 30 instead of standard 100 to save memory of mobile models.
  • Latest mini_librispeech uses online cmvn which we do not support yet. Use this script to train nnet3 model.

PLEASE NOTE THAT THE SIMPLE GMM MODEL YOU TRAIN WITH “KALDI FOR DUMMIES” TUTORIAL DOES NOT WORK WITH VOSK. YOU NEED TO RUN MINI-LIBRISPEECH FROM START TO END, INCLUDING CHAIN MODEL TRAINING. You also need CUDA GPU to train. If you do not have a GPU, try to run Kaldi on Collab.

Model structure

Once you trained the model arrange the files according to the following layout (see en-us-aspire for details):

  • am/final.mdl - acoustic model
  • conf/mfcc.conf - mfcc config file. Make sure you take mfcc_hires.conf version if you are using hires model (most external ones)
  • conf/model.conf - provide default decoding beams and silence phones. you have to create this file yourself, it is not present in kaldi model
  • ivector/final.dubm - take ivector files from ivector extractor (optional folder if the model is trained with ivectors)
  • ivector/final.ie
  • ivector/final.mat
  • ivector/splice.conf
  • ivector/global_cmvn.stats
  • ivector/online_cmvn.conf
  • graph/phones/word_boundary.int - from the graph
  • graph/HCLG.fst - this is the decoding graph, if you are not using lookahead
  • graph/HCLr.fst - use Gr.fst and HCLr.fst instead of one big HCLG.fst if you want to run rescoring
  • graph/Gr.fst
  • graph/phones.txt - from the graph
  • graph/words.txt - from the graph
  • rescore/G.carpa - carpa rescoring is optional but helpful in big models. Usually located inside data/lang_test_rescore
  • rescore/G.fst - also optional if you want to use rescoring