NVIDIA Nemo Test Results

In Russian community Nvidia NEMO got some popularity recently. Big companies like VK.COM and Yandex announce they use Nemo for their production systems. Quartznet and Jasper architectures mentioned here and there. Following Facebook tests I tested NVIDIA Nemo Quartznet15x5En-Base model

Description of the model says:

QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other.

You can see results below with comparision with the previous post results from Facebook RASR model.

Dataset Aspire Vosk model Daanzu Vosk model Facebook RASR model Nvidia NEMO
Librispeech test-clean 11.72 7.08 3.30 3.78
Tedlium test 11.23 8.25 5.96 8.03
Google commands 46.76 11.64 20.06 44.40
Non-native speech 57.92 33.31 26.99 31.06
Children speech 20.29 9.90 6.17 8.17
Podcasts 19.85 21.21 15.06 24.47
Callcenter bot 17.20 19.22 14.55 14.55
Callcenter 1 53.98 52.97 42.82 42.18
Callcenter 2 33.82 43.02 30.41 31.45
Callcenter 3 35.86 52.80 32.98 33.03

Some notes:

  • The overall Nemo ASR implementation is pretty lightweight and straightforward, so good as a baseline or a starting point for more advanced implementations.
  • Pretrained Quartznet model is also impressive too considering the model size. The model is particulary good for callcenter (most likely due to pronunciation iregularities in conversational speech)
  • The model size is just about 70M, not big one, maybe Nvidia can share bigger model some day.
  • For beam decoding I used Commoncrawl LM, same as for RASR.
  • Default code doesn’t even demonstrate proper decoding with beam search, be careful to use beam decoding for best accuracy.
  • Default beam is in the demo code is 15 which is very narrow, with beam like 500 you get significantly better results.
  • The model is a bit overtuned for librispeech

Tight LM adaptation for E2E models

I also investigated recognition with domain-specific LMs for end-to-end models like RASR and Quartznet, surprisingly, they do not perform very well. Here we compare generic Commoncrawl LM vs domain-specific LM:

Dataset Daanzu Vosk model Facebook RASR model Nvidia NEMO
LM perplexity 120 33.31 26.99 31.06
LM perplexity 15 11.48 20.59 20.81

The effect here is strange, it seems that improving LM perplexity doesn’t significantly improve E2E accuracy. I attribute it to not so perfect decoder implementation (for example, no smearing in PaddlePaddle ctc_decoder used in Nemo). As a consequence, E2E decoders might not be perfect for audio alignment and similar tasks where LM perplexities are much lower than in generic tasks. This problem needs further investigation.