Written by
Nickolay Shmyrev
on
Wav2Vec2.0 Test Results
We continue testing of the most advanced ASR models, here we try famous
Wav2Vec2.0,
an impressive work by Facebook. Here are previous posts:
Nvidia Nemo
Wav2Letter RASR
The ideas behind Wav2Vec are extremely hot today - pretraining,
contrasive learning, huge maked models, etc. Indeed, as you can see
below, the accuracy is pretty nice. It is not as good as RASR and Nemo,
but still nice.
Here we tested the model Wav2Vec 2.0 Large (LV-60)
with Fairseq/Flashlight/Paddlepaddle/Kenlm decoder. We find this model
works best for diverse conditions, self-training model seems to be even worse for callcenter and podcasts too.
-
The model trained on books mostly (librispeech and librilight), it doesn’t work well with callcenter and accented data, maybe finetuning will help.
-
Decoding is not very easy to setup due to separate format of the data
files, not even similar to wav2letter, and several preparation steps
required, but it is managable.
-
The speed of decoding is good despite the model itself is almost 3Gb.
-
The computation cost to train such model from scratch is of course
unbelievable. The whole thing about this model is that you can reuse
Facebook’s compute resources in your own research.
-
Be careful to use LM beam search decoding, it is much more accurate
than widely advised greedy decoding without LM, for example,
transformers setup, While on librispeech greedy decoding is ok, on
most noisy datasets the greedy decoding is obviously much worse.
-
Default recipe suggests uppercase lexicon and LM, most LMs are lowercase. I’d recommend to move to lowercase everywhere
and convert token vocabulary and lexicon and so on.
-
Default beams are two narrow, in general, the default options need care
Dataset |
Vosk Aspire |
Vosk Daanzu |
Facebook RASR |
Nvidia NEMO |
Facebook Wav2Vec2.0 |
Librispeech test-clean |
11.72 |
7.08 |
3.30 |
3.78 |
2.6 |
Tedlium test |
11.23 |
8.25 |
5.96 |
8.03 |
6.3 |
Google commands |
46.76 |
11.64 |
20.06 |
44.40 |
24.1 |
Non-native speech |
57.92 |
33.31 |
26.99 |
31.06 |
29.6 |
Children speech |
20.29 |
9.90 |
6.17 |
8.17 |
5.5 |
Podcasts |
19.85 |
21.21 |
15.06 |
24.47 |
17.0 |
Callcenter bot |
17.20 |
19.22 |
14.55 |
14.55 |
22.5 |
Callcenter 1 |
53.98 |
52.97 |
42.82 |
42.18 |
46.7 |
Callcenter 2 |
33.82 |
43.02 |
30.41 |
31.45 |
36.9 |
Callcenter 3 |
35.86 |
52.80 |
32.98 |
33.03 |
40.9 |
Adaptation
Same as before, the models doesn’t adapt well to LM perplexity improvements:
Dataset |
Vosk Daanzu |
Facebook RASR |
Nvidia NEMO |
Wav2Vec2.0 |
LM perplexity 120 |
33.31 |
26.99 |
31.06 |
29.6 |
LM perplexity 15 |
11.48 |
20.59 |
20.81 |
23.3 |
Tuning
The overall question now is: can one build an accurate system with this
technology with reasonable time and resources. The promise of finetuning
is that we can, we will explore this question in more details in the next
post.