NVIDIA Nemo Citrinet model test results

The race for biggest model continue. Recently NVIDIA came out with a Citrinet model, a bigger and more advanced version of Quartznet. The publication is:

Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition by Somshubra Majumdar et al

The model is available for download here, latest Nemo repo supports it.

We tested the model with the same datasets we tried before, see the results in the table below. We compared the model with recently released Nvidia Quartznet, Wav2Letter RASR, Wav2Vec and also Vosk models.

The model is trained on

  • Librispeech 960 hours of English speech
  • Fisher Corpus
  • Switchboard-1 Dataset
  • WSJ-0 and WSJ-1
  • National Speech Corpus - 1
  • Mozilla Common Voice

The overall training size is more than 7000 hours. The training was performed on 32 V100 GPUs for 1000 epoch. Huge amount of compute unavailable for hackers.

So overall the model is great, it gives very solid results and stays on par with RASR. For callcenter it wins, for wideband it is almost the same. Take a note that RASR model used some private podcast corpus. So given you have the big model and huge compute results are more or less the same as expected.

The decoding is not very straightforward. The recommended decoder uses BPE and BPE lm, so doesn’t even need a vocabulary (great!) On the bad side you have to follow the special script to train 6-gram BPE kenlm model to get best results. Unfortunately you can not use word LM yet as in the previous tests, so the results are a bit LM dependent. The decoding script looks like this (not yet documented anywhere, but hopefully will improve):

import os
import sys
import numpy as np
import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.EncDecCTCModel.restore_from('stt_en_citrinet_1024.nemo', strict=False)

vocab = asr_model.decoder.vocabulary
vocab = [chr(idx + TOKEN_OFFSET) for idx in range(len(vocab))]
ids_to_text_func = asr_model.tokenizer.ids_to_text

beam_search_lm = nemo_asr.modules.BeamSearchDecoderWithLM(
    alpha=1.0, beta=0.5,
    num_cpus=max(os.cpu_count(), 1),

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum(axis=-1).reshape([x.shape[0], 1])

files = [x.strip().replace("/root/host/", "") for x in open(sys.argv[1]).readlines()]
for i, logits in enumerate(asr_model.transcribe(files, logprobs=True)):
    probs = softmax(logits)
    transcript = beam_search_lm.forward(log_probs = np.expand_dims(probs, axis=0), log_probs_length=None)
    print (files[i].split("/")[-1][0:-4], ids_to_text_func([ord(c) - TOKEN_OFFSET for c in transcript[0][0][1]]))

The great sides of the model:

  • Great accuracy, especially for callcenter
  • Lexicon-free decoding with BPE LM with great results
  • Yet to compare with the results with word-based LM, hopefully they will be even stronger!
  • Smaller size than RASR
  • You can reuse huge compute efforts embedded in the model

Things to work on:

  • Decoder is half-done, not yet complete API. No lattics, confidence, etc.
  • The model “eats” words. The deletions are more frequent than insertions. Not so good for alignment
  • Not so great accuracy on short commands and words, much better results on longer chunks.

We didn’t investigate the adaptation capabilities of Citrinet. From previous experiments Quartznet adapts no so well, so we shall see. Wav2Vec model is not so great as others for callcenter but it certainly adapts very well, it even overtrains a bit. Adaptation is something to explore in the future.

Dataset Vosk Aspire Vosk Daanzu Facebook RASR Facebook Wav2Vec2.0 Nvidia Quartznet Nvidia Citrinet
Librispeech test-clean 11.72 7.08 3.30 2.6 3.78 2.78
Tedlium test 11.23 8.25 5.96 6.3 8.03 5.61
Google commands 46.76 11.64 20.06 24.1 44.40 28.15
Non-native speech 57.92 33.31 26.99 29.6 31.06 28.78
Children speech 20.29 9.90 6.17 5.5 8.17 6.85
Podcasts 19.85 21.21 15.06 17.0 24.47 14.82
Callcenter bot 17.20 19.22 14.55 22.5 14.55 12.85
Callcenter 1 53.98 52.97 42.82 46.7 42.18 36.05
Callcenter 2 33.82 43.02 30.41 36.9 31.45 29.40
Callcenter 3 35.86 52.80 32.98 40.9 33.03 29.78