Models
We have two types of models - big and small, small models are ideal for some limited task on mobile applications. They can run on smartphones, Raspberry Pi’s. They are also recommended for desktop applications. Small model typically is around 50Mb in size and requires about 300Mb of memory in runtime. Big models are for the high-accuracy transcription on the server. Big models require up to 16Gb in memory since they apply advanced AI algorithms. Ideally you run them on some high-end servers like i7 or latest AMD Ryzen. On AWS you can take a look on c5a machines and similar machines in other clouds.
Most small model allow dynamic vocabulary reconfiguration. Big models are static the vocabulary can not be modified in runtime.
Model list
This is the list of models compatible with Vosk-API.
To add a new model here create an issue on Github.
Model | Size | Word error rate/Speed | Notes | License |
---|---|---|---|---|
English | ||||
vosk-model-small-en-us-0.15 | 40M | 9.85 (librispeech test-clean) 10.38 (tedlium) | Lightweight wideband model for Android and RPi | Apache 2.0 |
vosk-model-en-us-0.22 | 1.8G | 5.69 (librispeech test-clean) 6.05 (tedlium) 29.78(callcenter) | Accurate generic US English model | Apache 2.0 |
vosk-model-en-us-0.22-lgraph | 128M | 7.82 (librispeech) 8.20 (tedlium) | Big US English model with dynamic graph | Apache 2.0 |
vosk-model-en-us-0.42-gigaspeech | 2.3G | 5.64 (librispeech test-clean) 6.24 (tedlium) 30.17 (callcenter) | Accurate generic US English model trained by Kaldi on Gigaspeech. Mostly for podcasts, not for telephony | Apache 2.0 |
English Other | Older Models | |||
vosk-model-en-us-daanzu-20200905 | 1.0G | 7.08 (librispeech test-clean) 8.25 (tedlium) | Wideband model for dictation from Kaldi-active-grammar project | AGPL |
vosk-model-en-us-daanzu-20200905-lgraph | 129M | 8.20 (librispeech test-clean) 9.28 (tedlium) | Wideband model for dictation from Kaldi-active-grammar project with configurable graph | AGPL |
vosk-model-en-us-librispeech-0.2 | 845M | TBD | Repackaged Librispeech model from Kaldi, not very accurate | Apache 2.0 |
vosk-model-small-en-us-zamia-0.5 | 49M | 11.55 (librispeech test-clean) 12.64 (tedlium) | Repackaged Zamia model f_250, mainly for research | LGPL-3.0 |
vosk-model-en-us-aspire-0.2 | 1.4G | 13.64 (librispeech test-clean) 12.89 (tedlium) 33.82(callcenter) | Kaldi original ASPIRE model, not very accurate | Apache 2.0 |
vosk-model-en-us-0.21 | 1.6G | 5.43 (librispeech test-clean) 6.42 (tedlium) 40.63(callcenter) | Wideband model previous generation | Apache 2.0 |
Indian English | ||||
vosk-model-en-in-0.5 | 1G | 36.12 (NPTEL Pure) | Generic Indian English model for telecom and broadcast | Apache 2.0 |
vosk-model-small-en-in-0.4 | 36M | 49.05 (NPTEL Pure) | Lightweight Indian English model for mobile applications | Apache 2.0 |
Chinese | ||||
vosk-model-small-cn-0.22 | 42M | 23.54 (SpeechIO-02) 38.29 (SpeechIO-06) 17.15 (THCHS) | Lightweight model for Android and RPi | Apache 2.0 |
vosk-model-cn-0.22 | 1.3G | 13.98 (SpeechIO-02) 27.30 (SpeechIO-06) 7.43 (THCHS) | Big generic Chinese model for server processing | Apache 2.0 |
Chinese Other | ||||
vosk-model-cn-kaldi-multicn-0.15 | 1.5G | 17.44 (SpeechIO-02) 9.56 (THCHS) | Original Wideband Kaldi multi-cn model from Kaldi with Vosk LM | Apache 2.0 |
Russian | ||||
vosk-model-ru-0.42 | 1.8G | 4.5 (our audiobooks) 11.1 (open_stt audiobooks) 19.5 (open_stt youtube) 36.0 (openstt calls) 4.4 (golos crowd) 17.9 (sova devices) | Big mixed band Russian model for servers | Apache 2.0 |
vosk-model-small-ru-0.22 | 45M | 22.71 (openstt audiobooks) 31.97 (openstt youtube) 29.89 (sova devices) 11.79 (golos crowd) | Lightweight wideband model for Android/iOS and RPi | Apache 2.0 |
Russian Other | ||||
vosk-model-ru-0.22 | 1.5G | 5.74 (our audiobooks) 13.35 (open_stt audiobooks) 20.73 (open_stt youtube) 37.38 (openstt calls) 8.65 (golos crowd) 19.71 (sova devices) | Big mixed band Russian model for servers | Apache 2.0 |
vosk-model-ru-0.10 | 2.5G | 5.71 (our audiobooks) 16.26 (open_stt audiobooks) 26.20 (public_youtube_700_val open_stt) 40.15 (asr_calls_2_val open_stt) | Big narrowband Russian model for servers | Apache 2.0 |
French | ||||
vosk-model-small-fr-0.22 | 41M | 23.95 (cv test) 19.30 (mtedx) 27.25 (podcast) | Lightweight wideband model for Android/iOS and RPi | Apache 2.0 |
vosk-model-fr-0.22 | 1.4G | 14.72 (cv test) 11.64 (mls) 13.10 (mtedx) 21.61 (podcast) 13.22 (voxpopuli) | Big accurate model for servers | Apache 2.0 |
French Other | ||||
vosk-model-small-fr-pguyot-0.3 | 39M | 37.04 (cv test) 28.72 (mtedx) 37.46 (podcast) | Lightweight wideband model for Android and RPi trained by Paul Guyot | CC-BY-NC-SA 4.0 |
vosk-model-fr-0.6-linto-2.2.0 | 1.5G | 16.19 (cv test) 16.44 (mtedx) 23.77 (podcast) 0.4xRT | Model from LINTO project | AGPL |
German | ||||
vosk-model-de-0.21 | 1.9G | 9.83 (Tuda-de test), 24.00 (podcast) 12.82 (cv-test) 12.42 (mls) 33.26 (mtedx) | Big German model for telephony and server | Apache 2.0 |
vosk-model-de-tuda-0.6-900k | 4.4G | 9.48 (Tuda-de test), 25.82 (podcast) 4.97 (cv-test) 11.01 (mls) 35.20 (mtedx) | Latest big wideband model from Tuda-DE project | Apache 2.0 |
vosk-model-small-de-zamia-0.3 | 49M | 14.81 (Tuda-de test, 37.46 (podcast) | Zamia f_250 small model repackaged (not recommended) | LGPL-3.0 |
vosk-model-small-de-0.15 | 45M | 13.75 (Tuda-de test), 30.67 (podcast) | Lightweight wideband model for Android and RPi | Apache 2.0 |
Spanish | ||||
vosk-model-small-es-0.42 | 39M | 16.02 (cv test) 16.72 (mtedx test) 11.21 (mls) | Lightweight wideband model for Android and RPi | Apache 2.0 |
vosk-model-es-0.42 | 1.4G | 7.50 (cv test) 10.05 (mtedx test) 5.84 (mls) | Big model for Spanish | Apache 2.0 |
Portuguese/Brazilian Portuguese | ||||
vosk-model-small-pt-0.3 | 31M | 68.92 (coraa dev) 32.60 (cv test) | Lightweight wideband model for Android and RPi | Apache 2.0 |
vosk-model-pt-fb-v0.1.1-20220516_2113 | 1.6G | 54.34 (coraa dev) 27.70 (cv test) | Big model from FalaBrazil | GPLv3.0 |
Greek | ||||
vosk-model-el-gr-0.7 | 1.1G | TBD | Big narrowband Greek model for server processing, not extremely accurate though | Apache 2.0 |
Turkish | ||||
vosk-model-small-tr-0.3 | 35M | TBD | Lightweight wideband model for Android and RPi | Apache 2.0 |
Vietnamese | ||||
vosk-model-small-vn-0.4 | 32M | 15.70 (Vivos test) | Lightweight Vietnamese model | Apache 2.0 |
vosk-model-vn-0.4 | 78M | 15.70 (Vivos test) | Bigger Vietnamese model for server | Apache 2.0 |
Italian | ||||
vosk-model-small-it-0.22 | 48M | 16.88 (cv test) 25.87 (mls) 17.01 (mtedx) | Lightweight model for Android and RPi | Apache 2.0 |
vosk-model-it-0.22 | 1.2G | 8.10 (cv test) 15.68 (mls) 11.23 (mtedx) | Big generic Italian model for servers | Apache 2.0 |
Dutch | ||||
vosk-model-small-nl-0.22 | 39M | 22.45 (cv test) 26.80 (tv) 25.84 (mls) 24.09 (voxpopuli) | Lightweight model for Dutch | Apache 2.0 |
Dutch Other | ||||
vosk-model-nl-spraakherkenning-0.6 | 860M | 20.40 (cv test) 32.64 (tv) 17.73 (mls) 19.96 (voxpopuli) | Medium Dutch model from Kaldi_NL | CC-BY-NC-SA |
vosk-model-nl-spraakherkenning-0.6-lgraph | 100M | 22.82 (cv test) 34.01 (tv) 18.81 (mls) 21.01 (voxpopuli) | Smaller model with dynamic graph | CC-BY-NC-SA |
Catalan | ||||
vosk-model-small-ca-0.4 | 42M | TBD | Lightweight wideband model for Android and RPi for Catalan | Apache 2.0 |
Arabic | ||||
vosk-model-ar-mgb2-0.4 | 318M | 16.40 (MGB-2 dev set) | Repackaged Arabic model trained on MGB2 dataset from Kaldi | Apache 2.0 |
vosk-model-ar-0.22-linto-1.1.0 | 1.3G | 52.87 (cv test) 28.50 (MBG-2 dev set) 1.0xRT | Big model from LINTO project | AGPL |
Arabic Tunisian | ||||
vosk-model-small-ar-tn-0.1-linto | 158M | 16.06 (TARIC set) | Small Arabic Tunisian model from Linagora | Apache 2.0 |
vosk-model-ar-tn-0.1-linto | 517M | 16.06 (TARIC set) | Arabic Tunisian model from Linagora | Apache 2.0 |
Farsi | ||||
vosk-model-small-fa-0.4 | 47M | TBD | Lightweight wideband model for Android and RPi for Farsi (Persian) | Apache 2.0 |
vosk-model-fa-0.5 | 1G | TBD | Model with large vocabulary, not yet accurate but better than before (Persian) | Apache 2.0 |
vosk-model-small-fa-0.5 | 60M | TBD | Bigger small model for desktop application (Persian) | Apache 2.0 |
Filipino | ||||
vosk-model-tl-ph-generic-0.6 | 320M | 18.87 (FLEURS-dev) 18.61 (FLEURS-test) 97.9 (BABEL-dev) MATERIAL-dev (41.31) | Medium wideband model for Filipino (Tagalog) by feddybear | CC-BY-NC-SA 4.0 |
Ukrainian | ||||
vosk-model-small-uk-v3-nano | 73M | TBD | Nano model from Speech Recognition for Ukrainian | Apache 2.0 |
vosk-model-small-uk-v3-small | 133M | TBD | Small model from Speech Recognition for Ukrainian | Apache 2.0 |
vosk-model-uk-v3 | 343M | TBD | Bigger model from Speech Recognition for Ukrainian | Apache 2.0 |
vosk-model-uk-v3-lgraph | 325M | TBD | Big dynamic model from Speech Recognition for Ukrainian | Apache 2.0 |
Kazakh | ||||
vosk-model-small-kz-0.15 | 42M | 9.60(dev) 8.32(test) | Small mobile model from SAIDA_Kazakh | Apache 2.0 |
vosk-model-kz-0.15 | 378M | 8.06(dev) 6.81(test) | Bigger wideband model SAIDA_Kazakh | Apache 2.0 |
Swedish | ||||
vosk-model-small-sv-rhasspy-0.15 | 289M | TBD | Repackaged model from Rhasspy project | MIT |
Japanese | ||||
vosk-model-small-ja-0.22 | 48M | 9.52(csj CER) 17.07(ted10k CER) | Lightweight wideband model for Japanese | Apache 2.0 |
vosk-model-ja-0.22 | 1Gb | 8.40(csj CER) 13.91(ted10k CER) | Big model for Japanese | Apache 2.0 |
Esperanto | ||||
vosk-model-small-eo-0.42 | 42M | 7.24 (CV Test) | Lightweight model for Esperanto | Apache 2.0 |
Hindi | ||||
vosk-model-small-hi-0.22 | 42M | 20.89 (IITM Challenge) 24.72 (MUCS Challenge) | Lightweight model for Hindi | Apache 2.0 |
vosk-model-hi-0.22 | 1.5Gb | 14.85 (CV Test) 14.83 (IITM Challenge) 13.11 (MUCS Challenge) | Big accurate model for servers | Apache 2.0 |
Czech | ||||
vosk-model-small-cs-0.4-rhasspy | 44M | 21.29 (CV Test) | Lightweight model for Czech from Rhasspy project | MIT |
Polish | ||||
vosk-model-small-pl-0.22 | 50M | 18.36 (CV Test) 16.88 (MLS Test) 11.55 (Voxpopuli Test) | Lightweight model for Polish | Apache 2.0 |
Uzbek | ||||
vosk-model-small-uz-0.22 | 49M | 13.54 (CV Test) 12.92 (IS2AI USC test) | Lightweight model for Uzbek | Apache 2.0 |
Korean | ||||
vosk-model-small-ko-0.22 | 82M | 28.1 (Zeroth Test) | Lightweight model for Korean | Apache 2.0 |
Breton | ||||
vosk-model-br-0.8 | 70M | 36.4 (MCV11 Test) | Breton model from vosk-br project | MIT license |
Gujarati | ||||
vosk-model-gu-0.42 | 700M | 16.45 (MS Test) | Big Gujarati model | Apache 2.0 |
vosk-model-small-gu-0.42 | 100M | 20.49 (MS Test) | Lightweight model for Gujarati | Apache 2.0 |
Tajik | ||||
vosk-model-tg-0.22 | 327M | 41.1 (Fleurs test) | Big Tajik model | Apache 2.0 |
vosk-model-small-tg-0.22 | 50M | 38.4 (Fleurs test) | Lightweight model for Tajik | Apache 2.0 |
Speaker identification model | ||||
vosk-model-spk-0.4 | 13M | TBD | Model for speaker identification, should work for all languages | Apache 2.0 |
Punctuation models
For punctuation and case restoration we recommend the models trained with https://github.com/benob/recasepunc
Model | Size | License |
---|---|---|
English | ||
vosk-recasepunc-en-0.22 | 1.6G | Apache 2.0 |
Russian | ||
vosk-recasepunc-ru-0.22 | 1.6G | Apache 2.0 |
German | ||
vosk-recasepunc-de-0.21 | 1.1G | Apache 2.0 |
Other models
Other places where you can check for models which might be compatible:
- https://kaldi-asr.org/models.html - variety of models from Kaldi - librispeech, aspire, chinese models
- https://github.com/daanzu/kaldi-active-grammar/blob/master/docs/models.md - Big dictation models for English
- https://github.com/uhh-lt/vosk-model-tuda-de - German models
- https://github.com/german-asr/kaldi-german - Another German project
- https://zamia-speech.org/asr/ - German and English model from Zamia
- https://github.com/pguyot/zamia-speech/releases - French models for Zamia
- https://github.com/opensource-spraakherkenning-nl/Kaldi_NL - Dutch model
- https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html (GMM models, not compatible but might be still useful)
- https://github.com/goodatlas/zeroth - Korean Kaldi (just a recipe and data to train)
- https://github.com/undertheseanlp/automatic_speech_recognition - Vietnamese Kaldi project
- https://doc.linto.ai/#/services/linstt - LINTO project by Linagora with French, English and Arabic models
- https://community.rhasspy.org/ - Rhasspy (some Kaldi models for Czech, probably even more)
- https://github.com/feddybear/flipside_ph - Filipino model project by Federico Ang
- https://github.com/alumae/kiirkirjutaja - Estonian Speech Recognition project with Vosk models
- https://github.com/falabrasil/kaldi-br - Portuguese models from FalaBrasil project
- https://github.com/egorsmkv/speech-recognition-uk - Ukrainian ASR project with Vosk models
- https://github.com/Appen/UHV-OTS-Speech - repository from Appen for Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development
- https://github.com/vistec-AI/commonvoice-th - Thai models trained on CommonVoice
- https://github.com/gweltou/vosk-br - Vosk for Breton
Training your own model
You can train your model with Kaldi toolkit. The training is pretty standard - you need tdnn nnet3 model with i-vectors. You can check Vosk recipe for details:
https://github.com/alphacep/vosk-api/tree/master/training
- For smaller mobile models watch the number of parameters
- Train the model without pitch. It might be helpful for small amount of data, but for large database it doesn’t give the advantage but complicates the processing and increases response time.
- Train ivector of dim 40 instead of standard 100 to save memory of mobile models.
- Many Kaldi recipes are overcomplicated and do many unnecessary steps
- PLEASE NOTE THAT THE SIMPLE GMM MODEL YOU TRAIN WITH “KALDI FOR DUMMIES” TUTORIAL DOES NOT WORK WITH VOSK. YOU NEED TO RUN VOSK RECIPE FROM START TO END, INCLUDING CHAIN MODEL TRAINING. You also need CUDA GPU to train. If you do not have a GPU, try to run Kaldi on Google Colab.
Model structure
Once you trained the model arrange the files according to the following layout (see en-us-aspire for details):
am/final.mdl
- acoustic modelam/global_cmvn.stats
- required for online-cmvn models, if present enables online cmvn on features.conf/mfcc.conf
- mfcc config file. Make sure you take mfcc_hires.conf version if you are using hires model (most external ones)conf/model.conf
- provide default decoding beams and silence phones. you have to create this file yourself, it is not present in kaldi modelconf/pitch.conf
- optional file to create feature pipeline with pitch features. Might be missing if model doesn’t use pitchivector/final.dubm
- take ivector files from ivector extractor (optional folder if the model is trained with ivectors)ivector/final.ie
ivector/final.mat
ivector/splice.conf
ivector/global_cmvn.stats
ivector/online_cmvn.conf
graph/phones/word_boundary.int
- from the graphgraph/HCLG.fst
- this is the decoding graph, if you are not using lookaheadgraph/HCLr.fst
- use Gr.fst and HCLr.fst instead of one big HCLG.fst if you want to run rescoringgraph/Gr.fst
graph/phones.txt
- from the graphgraph/words.txt
- from the graphrescore/G.carpa
- carpa rescoring is optional but helpful in big models. Usually located inside data/lang_test_rescorerescore/G.fst
- also optional if you want to use rescoring, also used for interpolation with RNNLMrnnlm/feat_embedding.final.mat
- RNNLM embedding for rescoring. Optional if you have it.rnnlm/special_symbol_opts.conf
- RNNLM model optionsrnnlm/final.raw
- RNNLM modelrnnlm/word_feats.txt
- RNNLM model word feats