Wav2Vec2.0 Test Results
We continue testing of the most advanced ASR models, here we try famous Wav2Vec2.0, an impressive work by Facebook. Here are previous posts:
The ideas behind Wav2Vec are extremely hot today - pretraining, contrasive learning, huge maked models, etc. Indeed, as you can see below, the accuracy is pretty nice. It is not as good as RASR and Nemo, but still nice.
Here we tested the model Wav2Vec 2.0 Large (LV-60) with Fairseq/Flashlight/Paddlepaddle/Kenlm decoder. We find this model works best for diverse conditions, self-training model seems to be even worse for callcenter and podcasts too.
The model trained on books mostly (librispeech and librilight), it doesn’t work well with callcenter and accented data, maybe finetuning will help.
Decoding is not very easy to setup due to separate format of the data files, not even similar to wav2letter, and several preparation steps required, but it is managable.
The speed of decoding is good despite the model itself is almost 3Gb.
The computation cost to train such model from scratch is of course unbelievable. The whole thing about this model is that you can reuse Facebook’s compute resources in your own research.
Be careful to use LM beam search decoding, it is much more accurate than widely advised greedy decoding without LM, for example, transformers setup, While on librispeech greedy decoding is ok, on most noisy datasets the greedy decoding is obviously much worse.
Default recipe suggests uppercase lexicon and LM, most LMs are lowercase. I’d recommend to move to lowercase everywhere and convert token vocabulary and lexicon and so on.
Default beams are two narrow, in general, the default options need care
|Dataset||Vosk Aspire||Vosk Daanzu||Facebook RASR||Nvidia NEMO||Facebook Wav2Vec2.0|
Same as before, the models doesn’t adapt well to LM perplexity improvements:
|Dataset||Vosk Daanzu||Facebook RASR||Nvidia NEMO||Wav2Vec2.0|
|LM perplexity 120||33.31||26.99||31.06||29.6|
|LM perplexity 15||11.48||20.59||20.81||23.3|
The overall question now is: can one build an accurate system with this technology with reasonable time and resources. The promise of finetuning is that we can, we will explore this question in more details in the next post.