Wav2Letter RASR Model Test Results
There are so many toolkits and model releases that some interesting things left unnoticed. Some time ago Facebook published a small paper Rethinking Evaluation in ASR: Are Our Models Robust Enough? and released a set of robust US English models. They trained a big English conformer model on a number of datasets (librispeech, common voice, fisher, their private video dataset) total about 9k hours. And, I must say, to the date the model is one of the best ones.
I made a number of test on different datasets and here are the comparison of WER for existing Vosk models and RASR model.
|Dataset||Aspire Vosk model||Daanzu Vosk model||RASR model|
I would say today this model is quite competitive. I also compared the model with few specialized models and while specialized models are better in terms of WER, RASR model probably can be finetuned to very good accuracy. I’m yet to try it.
On callcenter the model is not that shiny, very old and slow ASPIRE model is not significantly worse, but, again, given finetuning capabilities, it has a geat potential.
For videos/podcasts the accuracy is great, even for non-native speech.
The model requires new Flashlight library, not every easy to setup, I recommend to try it with a docker. Docker works smoothly.
The model is pretty big (2.5G acoustic model, 2.5G language model), but surprisingly it runs in a reasonable time even in CPU, so it an be used both in training and sometimes for decoding if good accuracy is required and speed is not an issue.
The paper doesn’t mention the training time for the models but I suspect it was quite long and required many GPUs, so such a model is not easily reproducible in a small organization. That makes the model even more valuable.
Thumbs up, Facebook!