Vosk/Kaldi German acoustic model for callcenter and broacast transcription

There many open source German models already around, unfortunately, most of them are not perfectly trained. Here is a review of the current state and some information about new German model for Vosk.

Zamia

Zamia provides scripts for training Kaldi models as well as pretrained models. Pretrained models are pretty good, in particular a mobile one. Some points:

  • Uses good dictionary with proper stress marks (well, except glottal stop which should not be there really)
  • Trained only on 400 hours, so it doesn’t include latest CommonVoice data. Honestly, CommonVoice data is pretty useless for accuracy as you can see in the result below. So not a big problem.
  • LM is not so good (trained on Wikipedia probably)
  • No rescoring setup
  • No updates since 2019

Tuda-de

Tuda provides corpus for training the German model as well as smooth Kaldi model training setup. Overall training process is ok, but there are some tiny issues.

  • Dictionary needs love. First of all it is case-sensitive, so not so good for language modeling. Second, it has wrong stress marks. For example:
überzuckert ? y: b 6 'ts U k 6 t

here ‘ is a stress mark and it should be on vowel since stressed vowel is affected most of all, not on consonant.

   überzuckert ? y: b 6 ts 'U k 6 t
  • No separate silence phone
  • Not so good LM, no rescoring.
  • Bad AM network architecture (not TDNN-F) while scripts have TDNN-F setup. Model is 300Mb, very slow.
  • No augmentation in recipe, only speed perturb
  • Default scripts configuration is too huge for proper training (very big lm, 4000-dim layer in acoustic model). One has to use smaller parameters to train reasonable models
  • Both Zamia and Tuda models are trained for wideband, doesn’t support telephony speech.

German ASR

German ASR is another project to train on mAI-Labs, SwC and Tuda data. Scripts are more straightforward since based on librispeech.

  • Better phonetic dictionary based on Wiktionary
  • No pretrained models, you have to run the scripts
  • Overall good pipeline
  • No augmentation

Deepspeech

There are several models, one is Deepspeech German with model for 0.6, another is Jaco Deepspeech Polyglot model with model for 0.7

Vosk model

We have recently trained Vosk German model mostly following Tuda recipe. Our model uses proper big language model and narrowband acoustic model so fits telephony. You can download model here.

Vosk-server is also updated, so you can simply run:

docker run -d -p 2700:2700 alphacep/kaldi-de:latest

Also we have a small model for mobile applications which is derived from a Zamia small model with updated lookahead graph.

Test results

Here are error rates on TUDA-De test set and on our internal Podcast transcription test:

Model Tuda Test WER Podcast WER Common Voice WER Speed
Zamia 11.48 31.12   0.33xRT
Zamia Small / Vosk 14.81 37.46   0.14xRT
Tuda pretrained 13.21 27.78   0.9xRT
German ASR 12.80 ???   ???
Deepspeech German 39.79 55.89   Very slow
Deepspeech Polyglot 29.07 52.72   Very slow
Vosk 11.07 27.45   0.33xRT
Vosk Rescoring 9.31 26.26   0.33xRT
Vosk Mobile 13.75 30.67   0.11xRT
Vosk DE 0.21 9.30 24.10 11.99 ???

Discussion

So you see all Kaldi German models are more or less the same, since they are using about the same data. Models could be improved significantly as well, we will post the updates soon. Feel free to test and comment.

Update 2020-12

We have retrained small model vosk-model-small-de-0.3.15 too with optimized architecture providing much smaller latency and best accuracy too at good speed. Results in the table are updated.

Update 2021-09

We have released updated version 0.21 with new language model and RNNLM rescoring. Numbers updated in the table.

Update 2021-10

There is also Scribosermo. Claimed to have 6.6% WER on CommonVoice. I didn’t test them, will test soon. Also pending test of Voxpopuli