Written by Nickolay Shmyrev
on March 02, 2021

Wav2Vec2.0 Test Results

We continue testing of the most advanced ASR models, here we try famous Wav2Vec2.0, an impressive work by Facebook. Here are previous posts:

Nvidia Nemo

Wav2Letter RASR

The ideas behind Wav2Vec are extremely hot today - pretraining, contrasive learning, huge maked models, etc. Indeed, as you can see below, the accuracy is pretty nice. It is not as good as RASR and Nemo, but still nice.

Here we tested the model Wav2Vec 2.0 Large (LV-60) with Fairseq/Flashlight/Paddlepaddle/Kenlm decoder. We find this model works best for diverse conditions, self-training model seems to be even worse for callcenter and podcasts too.

The model trained on books mostly (librispeech and librilight), it doesn’t work well with callcenter and accented data, maybe finetuning will help.
Decoding is not very easy to setup due to separate format of the data files, not even similar to wav2letter, and several preparation steps required, but it is managable.
The speed of decoding is good despite the model itself is almost 3Gb.
The computation cost to train such model from scratch is of course unbelievable. The whole thing about this model is that you can reuse Facebook’s compute resources in your own research.
Be careful to use LM beam search decoding, it is much more accurate than widely advised greedy decoding without LM, for example, transformers setup, While on librispeech greedy decoding is ok, on most noisy datasets the greedy decoding is obviously much worse.
Default recipe suggests uppercase lexicon and LM, most LMs are lowercase. I’d recommend to move to lowercase everywhere and convert token vocabulary and lexicon and so on.
Default beams are two narrow, in general, the default options need care

Dataset	Vosk Aspire	Vosk Daanzu	Facebook RASR	Nvidia NEMO	Facebook Wav2Vec2.0
Librispeech test-clean	11.72	7.08	3.30	3.78	2.6
Tedlium test	11.23	8.25	5.96	8.03	6.3
Google commands	46.76	11.64	20.06	44.40	24.1
Non-native speech	57.92	33.31	26.99	31.06	29.6
Children speech	20.29	9.90	6.17	8.17	5.5
Podcasts	19.85	21.21	15.06	24.47	17.0
Callcenter bot	17.20	19.22	14.55	14.55	22.5
Callcenter 1	53.98	52.97	42.82	42.18	46.7
Callcenter 2	33.82	43.02	30.41	31.45	36.9
Callcenter 3	35.86	52.80	32.98	33.03	40.9

Adaptation

Same as before, the models doesn’t adapt well to LM perplexity improvements:

Dataset	Vosk Daanzu	Facebook RASR	Nvidia NEMO	Wav2Vec2.0
LM perplexity 120	33.31	26.99	31.06	29.6
LM perplexity 15	11.48	20.59	20.81	23.3

Tuning

The overall question now is: can one build an accurate system with this technology with reasonable time and resources. The promise of finetuning is that we can, we will explore this question in more details in the next post.

← Top →