Vosk/Kaldi German acoustic model for callcenter and broacast transcription
There many open source German models already around, unfortunately, most of them are not perfectly trained. Here is a review of the current state and some information about new German model for Vosk.
Zamia provides scripts for training Kaldi models as well as pretrained models. Pretrained models are pretty good, in particular a mobile one. Some points:
- Uses good dictionary with proper stress marks (well, except glottal stop which should not be there really)
- Trained only on 400 hours, so it doesn’t include latest CommonVoice data. Honestly, CommonVoice data is pretty useless for accuracy as you can see in the result below. So not a big problem.
- LM is not so good (trained on Wikipedia probably)
- No rescoring setup
- No updates since 2019
Tuda provides corpus for training the German model as well as smooth Kaldi model training setup. Overall training process is ok, but there are some tiny issues.
- Dictionary needs love. First of all it is case-sensitive, so not so good for language modeling. Second, it has wrong stress marks. For example:
überzuckert ? y: b 6 'ts U k 6 t
here ‘ is a stress mark and it should be on vowel since stressed vowel is affected most of all, not on consonant.
überzuckert ? y: b 6 ts 'U k 6 t
- No separate silence phone
- Not so good LM, no rescoring.
- Bad AM network architecture (not TDNN-F) while scripts have TDNN-F setup. Model is 300Mb, very slow.
- No augmentation in recipe, only speed perturb
- Default scripts configuration is too huge for proper training (very big lm, 4000-dim layer in acoustic model). One has to use smaller parameters to train reasonable models
- Both Zamia and Tuda models are trained for wideband, doesn’t support telephony speech.
German ASR is another project to train on mAI-Labs, SwC and Tuda data. Scripts are more straightforward since based on librispeech.
- Better phonetic dictionary based on Wiktionary
- No pretrained models, you have to run the scripts
- Overall good pipeline
- No augmentation
We have recently trained Vosk German model mostly following Tuda recipe. Our model uses proper big language model and narrowband acoustic model so fits telephony. You can download model here.
Vosk-server is also updated, so you can simply run:
docker run -d -p 2700:2700 alphacep/kaldi-de:latest
Also we have a small model for mobile applications which is derived from a Zamia small model with updated lookahead graph.
Here are error rates on TUDA-De test set and on our internal Podcast transcription test:
|Model||Tuda Test WER||Podcast WER||Speed|
|Zamia Small / Vosk||14.81||37.46||0.14xRT|
|Deepspeech German||39.79||55.89||Very slow|
|Deepspeech Polyglot||29.07||52.72||Very slow|
So you see all Kaldi German models are more or less the same, since they are using about the same data. Models could be improved significantly as well, we will post the updates soon. Feel free to test and comment.
We have retrained small model vosk-model-small-de-0.3.15 too with optimized architecture providing much smaller latency and best accuracy too at good speed. Results in the table are updated.