Written by
Nickolay Shmyrev
on
Wav2Letter RASR Model Test Results
There are so many toolkits and model releases that some interesting
things left unnoticed. Some time ago Facebook published a small paper
Rethinking Evaluation in ASR: Are Our Models Robust Enough? and released a
set of robust US English models.
They trained a big English conformer model on a number of datasets
(librispeech, common voice, fisher, their private video dataset) total
about 9k hours. And, I must say, to the date the model is one of the best
ones.
I made a number of test on different datasets and here are the comparison
of WER for existing Vosk models and RASR model.
Dataset |
Aspire Vosk model |
Daanzu Vosk model |
RASR model |
Librispeech test-clean |
11.72 |
7.08 |
3.30 |
Tedlium test |
11.23 |
8.25 |
5.96 |
Google commands |
46.76 |
11.64 |
20.06 |
Non-native speech |
57.92 |
33.31 |
26.99 |
Children speech |
20.29 |
9.90 |
6.17 |
Podcasts |
19.85 |
21.21 |
15.06 |
Callcenter bot |
17.20 |
19.22 |
14.55 |
Callcenter 1 |
53.98 |
52.97 |
42.82 |
Callcenter 2 |
33.82 |
43.02 |
30.41 |
Callcenter 3 |
35.86 |
52.80 |
32.98 |
I would say today this model is quite competitive. I also
compared the model with few specialized models and while specialized
models are better in terms of WER, RASR model probably can be finetuned
to very good accuracy. I’m yet to try it.
On callcenter the model is not that shiny, very old and slow ASPIRE model
is not significantly worse, but, again, given finetuning capabilities, it
has a geat potential.
For videos/podcasts the accuracy is great, even for non-native speech.
The model requires new Flashlight library, not every easy to setup, I recommend
to try it with a docker. Docker works smoothly.
The model is pretty big (2.5G acoustic model, 2.5G language model), but
surprisingly it runs in a reasonable time even in CPU, so it an be used
both in training and sometimes for decoding if good accuracy is required
and speed is not an issue.
The paper doesn’t mention the training time for the models but I suspect
it was quite long and required many GPUs, so such a model is not easily
reproducible in a small organization. That makes the model even more
valuable.
Thumbs up, Facebook!