NVIDIA Nemo Citrinet model test results
The race for biggest model continue. Recently NVIDIA came out with a
Citrinet model, a bigger and more advanced version of Quartznet. The
Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition by Somshubra Majumdar et al
The model is available for download here, latest Nemo repo supports it.
We tested the model with the same datasets we tried before, see the results in the table below. We compared
the model with recently released Nvidia Quartznet, Wav2Letter RASR, Wav2Vec
and also Vosk models.
The model is trained on
- Librispeech 960 hours of English speech
- Fisher Corpus
- Switchboard-1 Dataset
- WSJ-0 and WSJ-1
- National Speech Corpus - 1
- Mozilla Common Voice
The overall training size is more than 7000 hours. The training was
performed on 32 V100 GPUs for 1000 epoch. Huge amount of compute
unavailable for hackers.
So overall the model is great, it gives very solid results and stays on
par with RASR. For callcenter it wins, for wideband it is almost the
same. Take a note that RASR model used some private podcast corpus. So
given you have the big model and huge compute results are more or less
the same as expected.
The decoding is not very straightforward. The recommended decoder uses
BPE and BPE lm, so doesn’t even need a vocabulary (great!) On the bad
side you have to follow the special script to train 6-gram BPE kenlm
model to get best results. Unfortunately you can not use word LM yet as
in the previous tests, so the results are a bit LM dependent. The
decoding script looks like this (not yet documented anywhere, but
hopefully will improve):
import numpy as np
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.restore_from('stt_en_citrinet_1024.nemo', strict=False)
vocab = asr_model.decoder.vocabulary
vocab = [chr(idx + TOKEN_OFFSET) for idx in range(len(vocab))]
ids_to_text_func = asr_model.tokenizer.ids_to_text
beam_search_lm = nemo_asr.modules.BeamSearchDecoderWithLM(
e = np.exp(x - np.max(x))
return e / e.sum(axis=-1).reshape([x.shape, 1])
files = [x.strip().replace("/root/host/", "") for x in open(sys.argv).readlines()]
for i, logits in enumerate(asr_model.transcribe(files, logprobs=True)):
probs = softmax(logits)
transcript = beam_search_lm.forward(log_probs = np.expand_dims(probs, axis=0), log_probs_length=None)
print (files[i].split("/")[-1][0:-4], ids_to_text_func([ord(c) - TOKEN_OFFSET for c in transcript]))
The great sides of the model:
- Great accuracy, especially for callcenter
- Lexicon-free decoding with BPE LM with great results
- Yet to compare with the results with word-based LM, hopefully they will be even stronger!
- Smaller size than RASR
- You can reuse huge compute efforts embedded in the model
Things to work on:
- Decoder is half-done, not yet complete API. No lattics, confidence, etc.
- The model “eats” words. The deletions are more frequent than insertions. Not so good for alignment
- Not so great accuracy on short commands and words, much better results on longer chunks.
We didn’t investigate the adaptation capabilities of Citrinet. From
previous experiments Quartznet adapts no so well, so we shall see.
Wav2Vec model is not so great as others for callcenter but it certainly
adapts very well, it even overtrains a bit. Adaptation is something to
explore in the future.