NVIDIA Nemo Conformer-CTC model test results

Not long after Citrinet Nvidia NeMo released Conformer-CTC model. As usual, forget about Citrinet now, Conformer-CTC is way better.

The model is available for download here, latest Nemo repo supports it.

We tested the model with the same datasets we tried before, see the results in the table below.

The model is very good.

Take a note it is very important to build the LM and rescore with the LM model. Ideally, rescore with Transformer model too. For details see previous Citrinet post.

I believe if Nvidia implements AED rescorer instead of CTC, results will be even better. See the Gigaspeech overview for details.

Dataset Vosk Aspire Vosk Daanzu Facebook RASR Facebook Wav2Vec2.0 Nvidia Citrinet Nvidia Conformer-CTC
Librispeech test-clean 11.72 7.08 3.30 2.6 2.78 2.26
Tedlium test 11.23 8.25 5.96 6.3 5.61 4.89
Google commands 46.76 11.64 20.06 24.1 28.15 19.77
Non-native speech 57.92 33.31 26.99 29.6 28.78 24.22
Children speech 20.29 9.90 6.17 5.5 6.85 5.52
Podcasts 19.85 21.21 15.06 17.0 14.82 13.79
Callcenter bot 17.20 19.22 14.55 22.5 12.85 13.96
Callcenter 1 53.98 52.97 42.82 46.7 36.05 32.57
Callcenter 2 33.82 43.02 30.41 36.9 29.40 27.82
Callcenter 3 35.86 52.80 32.98 40.9 29.78 27.44