Written by
Nickolay Shmyrev
on
Vosk/Kaldi German acoustic model for callcenter and broacast transcription
There many open source German models already around, unfortunately, most of them are not perfectly trained. Here is a review of the current
state and some information about new German model for Vosk.
Zamia
Zamia provides scripts for training Kaldi models as well as pretrained models. Pretrained
models are pretty good, in particular a mobile one. Some points:
- Uses good dictionary with proper stress marks (well, except glottal stop which should not be there really)
- Trained only on 400 hours, so it doesn’t include latest CommonVoice data. Honestly, CommonVoice data is pretty useless for accuracy as you can see in the result below. So not a big problem.
- LM is not so good (trained on Wikipedia probably)
- No rescoring setup
- No updates since 2019
Tuda-de
Tuda provides corpus for
training the German model as well as smooth Kaldi model training setup.
Overall training process is ok, but there are some tiny issues.
- Dictionary needs love. First of all it is case-sensitive, so not so good for language modeling. Second, it has wrong stress marks. For example:
überzuckert ? y: b 6 'ts U k 6 t
here ‘ is a stress mark and it should be on vowel since stressed vowel is affected most of all, not on consonant.
überzuckert ? y: b 6 ts 'U k 6 t
- No separate silence phone
- Not so good LM, no rescoring.
- Bad AM network architecture (not TDNN-F) while scripts have TDNN-F setup. Model is 300Mb, very slow.
- No augmentation in recipe, only speed perturb
- Default scripts configuration is too huge for proper training (very big lm, 4000-dim layer in acoustic model). One has to use smaller parameters
to train reasonable models
- Both Zamia and Tuda models are trained for wideband, doesn’t support telephony speech.
German ASR
German ASR is another project to train on mAI-Labs, SwC and Tuda data. Scripts
are more straightforward since based on librispeech.
- Better phonetic dictionary based on Wiktionary
- No pretrained models, you have to run the scripts
- Overall good pipeline
- No augmentation
Deepspeech
There are several models, one is Deepspeech
German with model for 0.6, another is Jaco
Deepspeech Polyglot model with model for 0.7
Vosk model
We have recently trained Vosk German model mostly following Tuda recipe. Our model
uses proper big language model and narrowband acoustic model so fits
telephony. You can download model here.
Vosk-server is also updated, so you can simply run:
docker run -d -p 2700:2700 alphacep/kaldi-de:latest
Also we have a small model for mobile applications which is derived from a Zamia small model with updated lookahead graph.
Test results
Here are error rates on TUDA-De test set and on our internal Podcast transcription test:
Model |
Tuda Test WER |
Podcast WER |
Common Voice WER |
Speed |
Zamia |
11.48 |
31.12 |
|
0.33xRT |
Zamia Small / Vosk |
14.81 |
37.46 |
|
0.14xRT |
Tuda pretrained |
13.21 |
27.78 |
|
0.9xRT |
German ASR |
12.80 |
??? |
|
??? |
Deepspeech German |
39.79 |
55.89 |
|
Very slow |
Deepspeech Polyglot |
29.07 |
52.72 |
|
Very slow |
Vosk |
11.07 |
27.45 |
|
0.33xRT |
Vosk Rescoring |
9.31 |
26.26 |
|
0.33xRT |
Vosk Mobile |
13.75 |
30.67 |
|
0.11xRT |
Vosk DE 0.21 |
9.30 |
24.10 |
11.99 |
??? |
Discussion
So you see all Kaldi German models are more or less the same, since they
are using about the same data. Models could be improved significantly as
well, we will post the updates soon. Feel free to test and comment.
Update 2020-12
We have retrained small model vosk-model-small-de-0.3.15
too with optimized architecture providing much smaller latency and best
accuracy too at good speed. Results in the table are updated.
Update 2021-09
We have released updated version 0.21 with new language model and RNNLM rescoring. Numbers updated in the table.
Update 2021-10
There is also
Scribosermo.
Claimed to have 6.6% WER on CommonVoice. I didn’t test them, will test
soon. Also pending test of
Voxpopuli