Things about Espnet

Espnet toolkit has got some great recent developments as described in Recent Developments on ESPnet Toolkit Boosted by Conformer

Results include great accuracy numbers, so I tried Espnet recently on librispeech. Some secrets never mentioned in the paper:

  • Librispeech LM default recipe trains 60 days on 4 GPUs. It can be 10 days if you cut it earlier.
  • Librispeech default decoder uses beam 60 and decodes test-clean whole day.
  • People recommend beam 20 that results in 4 hours decoding.
  • For something practical you need beam 2 and you get significant drop in accuracy, even worse than DNN-HMM systems.

Essentially great conformer numbers reported on readmes are for extermely slooow systems. You can’t even imagine how slow they are. Also, models are big, acoustic model tend to be about 300-500Mb in size.

Speed is not a problem actually, you can still use those big models with distillation setup or with teacher-student setup and train more accurate semi-supervised models with the custom data. You just can’t use those models in production. A recent paper on the subject Uncertainty in Structured Prediction demonstrates you can get a lot from the mixture of experts. Between, you can combine many such expert models from different vendors to get really amazing results. And remember recent Switch transformer from Google Brain which targets similar directions.

More toolkits to try: nvidia/nemo, flashlight (wav2letter before), espresso, wenet, TensorFlowASR