Wav2Vec2.0 Test Results
We continue testing of the most advanced ASR models, here we try famous
an impressive work by Facebook. Here are previous posts:
The ideas behind Wav2Vec are extremely hot today - pretraining,
contrasive learning, huge maked models, etc. Indeed, as you can see
below, the accuracy is pretty nice. It is not as good as RASR and Nemo,
but still nice.
Here we tested the model Wav2Vec 2.0 Large (LV-60)
with Fairseq/Flashlight/Paddlepaddle/Kenlm decoder. We find this model
works best for diverse conditions, self-training model seems to be even worse for callcenter and podcasts too.
The model trained on books mostly (librispeech and librilight), it doesn’t work well with callcenter and accented data, maybe finetuning will help.
Decoding is not very easy to setup due to separate format of the data
files, not even similar to wav2letter, and several preparation steps
required, but it is managable.
The speed of decoding is good despite the model itself is almost 3Gb.
The computation cost to train such model from scratch is of course
unbelievable. The whole thing about this model is that you can reuse
Facebook’s compute resources in your own research.
Be careful to use LM beam search decoding, it is much more accurate
than widely advised greedy decoding without LM, for example,
transformers setup, While on librispeech greedy decoding is ok, on
most noisy datasets the greedy decoding is obviously much worse.
Default recipe suggests uppercase lexicon and LM, most LMs are lowercase. I’d recommend to move to lowercase everywhere
and convert token vocabulary and lexicon and so on.
Default beams are two narrow, in general, the default options need care
Same as before, the models doesn’t adapt well to LM perplexity improvements:
|LM perplexity 120
|LM perplexity 15
The overall question now is: can one build an accurate system with this
technology with reasonable time and resources. The promise of finetuning
is that we can, we will explore this question in more details in the next