Written by
Nickolay Shmyrev
on
NVIDIA Nemo Test Results
In Russian community Nvidia NEMO got some popularity recently. Big
companies like VK.COM and Yandex announce they use Nemo for their
production systems. Quartznet and Jasper architectures mentioned here and
there. Following Facebook tests I tested NVIDIA Nemo
Quartznet15x5En-Base model
Description of the model says:
QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla
Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher,
Switchboard, and NSC Singapore English. It was trained with Apex/Amp
optimization level O1 for 600 epochs. The model achieves a WER of 3.79%
on LibriSpeech dev-clean, and a WER of 10.05% on dev-other.
You can see results below with comparision with the previous post results from Facebook RASR model.
Dataset |
Aspire Vosk model |
Daanzu Vosk model |
Facebook RASR model |
Nvidia NEMO |
Librispeech test-clean |
11.72 |
7.08 |
3.30 |
3.78 |
Tedlium test |
11.23 |
8.25 |
5.96 |
8.03 |
Google commands |
46.76 |
11.64 |
20.06 |
44.40 |
Non-native speech |
57.92 |
33.31 |
26.99 |
31.06 |
Children speech |
20.29 |
9.90 |
6.17 |
8.17 |
Podcasts |
19.85 |
21.21 |
15.06 |
24.47 |
Callcenter bot |
17.20 |
19.22 |
14.55 |
14.55 |
Callcenter 1 |
53.98 |
52.97 |
42.82 |
42.18 |
Callcenter 2 |
33.82 |
43.02 |
30.41 |
31.45 |
Callcenter 3 |
35.86 |
52.80 |
32.98 |
33.03 |
Some notes:
- The overall Nemo ASR implementation is pretty lightweight and straightforward, so good as a baseline or a starting point for more advanced
implementations.
- Pretrained Quartznet model is also impressive too considering the model size. The model is particulary good for callcenter (most likely due to pronunciation iregularities in conversational speech)
- The model size is just about 70M, not big one, maybe Nvidia can share bigger model some day.
- For beam decoding I used Commoncrawl LM, same as for RASR.
- Default code doesn’t even demonstrate proper decoding with beam search, be careful to use beam decoding for best accuracy.
- Default beam is in the demo code is 15 which is very narrow, with beam like 500 you get significantly better results.
- The model is a bit overtuned for librispeech
Tight LM adaptation for E2E models
I also investigated recognition with domain-specific LMs for end-to-end
models like RASR and Quartznet, surprisingly, they do not perform very
well. Here we compare generic Commoncrawl LM vs domain-specific LM:
Dataset |
Daanzu Vosk model |
Facebook RASR model |
Nvidia NEMO |
LM perplexity 120 |
33.31 |
26.99 |
31.06 |
LM perplexity 15 |
11.48 |
20.59 |
20.81 |
The effect here is strange, it seems that improving LM perplexity doesn’t
significantly improve E2E accuracy. I attribute it to not so perfect
decoder implementation (for example, no smearing in PaddlePaddle
ctc_decoder used in Nemo). As a consequence, E2E decoders might not be
perfect for audio alignment and similar tasks where LM perplexities are
much lower than in generic tasks. This problem needs further
investigation.