Things about Espnet
Espnet toolkit has got some great recent developments as described in Recent Developments on ESPnet Toolkit Boosted by Conformer
Results include great accuracy numbers, so I tried Espnet recently on librispeech. Some secrets never mentioned in the paper:
- Librispeech LM default recipe trains 60 days on 4 GPUs. It can be 10 days if you cut it earlier.
- Librispeech default decoder uses beam 60 and decodes test-clean whole day.
- People recommend beam 20 that results in 4 hours decoding.
- For something practical you need beam 2 and you get significant drop in accuracy, even worse than DNN-HMM systems.
Essentially great conformer numbers reported on readmes are for extermely
slooow systems. You can’t even imagine how slow they are. Also, models
are big, acoustic model tend to be about 300-500Mb in size.
Speed is not a problem actually, you can still use those big models with
distillation setup or with teacher-student setup and train more accurate
semi-supervised models with the custom data. You just can’t use those
models in production. A recent paper on the subject Uncertainty in
Structured Prediction demonstrates
you can get a lot from the mixture of experts. Between, you can combine
many such expert models from different vendors to get really amazing
results. And remember recent Switch transformer from Google Brain which
targets similar directions.
More toolkits to try: nvidia/nemo, flashlight (wav2letter before),