Things about Espnet

Espnet toolkit has got some great recent developments as described in Recent Developments on ESPnet Toolkit Boosted by Conformer

Results include great accuracy numbers, so I tried Espnet recently on librispeech. Some secrets never mentioned in the paper:

Essentially great conformer numbers reported on readmes are for extermely slooow systems. You can’t even imagine how slow they are. Also, models are big, acoustic model tend to be about 300-500Mb in size.

Speed is not a problem actually, you can still use those big models with distillation setup or with teacher-student setup and train more accurate semi-supervised models with the custom data. You just can’t use those models in production. A recent paper on the subject Uncertainty in Structured Prediction demonstrates you can get a lot from the mixture of experts. Between, you can combine many such expert models from different vendors to get really amazing results. And remember recent Switch transformer from Google Brain which targets similar directions.

More toolkits to try: nvidia/nemo, flashlight (wav2letter before), espresso, wenet, TensorFlowASR