Wav2Vec2.0 Test Results

We continue testing of the most advanced ASR models, here we try famous Wav2Vec2.0, an impressive work by Facebook. Here are previous posts:

Nvidia Nemo

Wav2Letter RASR

The ideas behind Wav2Vec are extremely hot today - pretraining, contrasive learning, huge maked models, etc. Indeed, as you can see below, the accuracy is pretty nice. It is not as good as RASR and Nemo, but still nice.

Here we tested the model Wav2Vec 2.0 Large (LV-60) with Fairseq/Flashlight/Paddlepaddle/Kenlm decoder. We find this model works best for diverse conditions, self-training model seems to be even worse for callcenter and podcasts too.

  • The model trained on books mostly (librispeech and librilight), it doesn’t work well with callcenter and accented data, maybe finetuning will help.

  • Decoding is not very easy to setup due to separate format of the data files, not even similar to wav2letter, and several preparation steps required, but it is managable.

  • The speed of decoding is good despite the model itself is almost 3Gb.

  • The computation cost to train such model from scratch is of course unbelievable. The whole thing about this model is that you can reuse Facebook’s compute resources in your own research.

  • Be careful to use LM beam search decoding, it is much more accurate than widely advised greedy decoding without LM, for example, transformers setup, While on librispeech greedy decoding is ok, on most noisy datasets the greedy decoding is obviously much worse.

  • Default recipe suggests uppercase lexicon and LM, most LMs are lowercase. I’d recommend to move to lowercase everywhere and convert token vocabulary and lexicon and so on.

  • Default beams are two narrow, in general, the default options need care

Dataset Vosk Aspire Vosk Daanzu Facebook RASR Nvidia NEMO Facebook Wav2Vec2.0
Librispeech test-clean 11.72 7.08 3.30 3.78 2.6
Tedlium test 11.23 8.25 5.96 8.03 6.3
Google commands 46.76 11.64 20.06 44.40 24.1
Non-native speech 57.92 33.31 26.99 31.06 29.6
Children speech 20.29 9.90 6.17 8.17 5.5
Podcasts 19.85 21.21 15.06 24.47 17.0
Callcenter bot 17.20 19.22 14.55 14.55 22.5
Callcenter 1 53.98 52.97 42.82 42.18 46.7
Callcenter 2 33.82 43.02 30.41 31.45 36.9
Callcenter 3 35.86 52.80 32.98 33.03 40.9


Same as before, the models doesn’t adapt well to LM perplexity improvements:

Dataset Vosk Daanzu Facebook RASR Nvidia NEMO Wav2Vec2.0
LM perplexity 120 33.31 26.99 31.06 29.6
LM perplexity 15 11.48 20.59 20.81 23.3


The overall question now is: can one build an accurate system with this technology with reasonable time and resources. The promise of finetuning is that we can, we will explore this question in more details in the next post.