NVIDIA Nemo Test Results

In Russian community Nvidia NEMO got some popularity recently. Big companies like VK.COM and Yandex announce they use Nemo for their production systems. Quartznet and Jasper architectures mentioned here and there. Following Facebook tests I tested NVIDIA Nemo Quartznet15x5En-Base model

Description of the model says:

QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other.

You can see results below with comparision with the previous post results from Facebook RASR model.

Dataset Aspire Vosk model Daanzu Vosk model Facebook RASR model Nvidia NEMO
Librispeech test-clean 11.72 7.08 3.30 3.78
Tedlium test 11.23 8.25 5.96 8.03
Google commands 46.76 11.64 20.06 44.40
Non-native speech 57.92 33.31 26.99 31.06
Children speech 20.29 9.90 6.17 8.17
Podcasts 19.85 21.21 15.06 24.47
Callcenter bot 17.20 19.22 14.55 14.55
Callcenter 1 53.98 52.97 42.82 42.18
Callcenter 2 33.82 43.02 30.41 31.45
Callcenter 3 35.86 52.80 32.98 33.03

Some notes:

Tight LM adaptation for E2E models

I also investigated recognition with domain-specific LMs for end-to-end models like RASR and Quartznet, surprisingly, they do not perform very well. Here we compare generic Commoncrawl LM vs domain-specific LM:

Dataset Daanzu Vosk model Facebook RASR model Nvidia NEMO
LM perplexity 120 33.31 26.99 31.06
LM perplexity 15 11.48 20.59 20.81

The effect here is strange, it seems that improving LM perplexity doesn’t significantly improve E2E accuracy. I attribute it to not so perfect decoder implementation (for example, no smearing in PaddlePaddle ctc_decoder used in Nemo). As a consequence, E2E decoders might not be perfect for audio alignment and similar tasks where LM perplexities are much lower than in generic tasks. This problem needs further investigation.