Written by
Nickolay Shmyrev
on
OpenAI Whisper Accuracy (Tflite, Whisper.CPP and Large-V2)
Whisper popularity wave continues. Many projects appear for whisper-based
web services, whisper on mobile and so on. Some projects modify Whisper
models and algorithms to improve speed and it raises questions about
their accuracy. Here we tested couple of different project to demonstrate
the effect those algorithmic modifications have on the accuracy.
There is some accuracy drop, but, accuracy is still extremely impressive.
Note that the TFLite model of just 40Mb demonstrates extremely good
performance even on complicated datasets.
The TFLite decoding is a bit slow though, about 0.5xRT on Intel CPU. With
modern hardware accelerators (like M1 chip) the decoder runs quick enough
even on mobile devices.
Whisper.cpp project is still in a very initial state. A lot of changes in
algorithms (like 2x speedup for audio decoding, questionable practice)
and issues with Python bindings (decoder object not reusable, memory
corruption).
Dataset |
Whisper TFLite |
Whisper CPP |
Whisper Tiny |
Librispeech test-clean |
6.3 |
6.6 |
7.0 |
Tedlium test |
7.2 |
7.9 |
8.6 |
Google commands |
34.4 |
100.0 |
33.9 |
Non-native speech |
45.2 |
40.7 |
41.7 |
Children speech |
9.7 |
13.2 |
10.4 |
Podcasts |
16.7 |
16.3 |
17.0 |
Callcenter bot |
17.6 |
20.8 |
22.0 |
Callcenter 1 |
45.0 |
43.1 |
45.9 |
Callcenter 2 |
34.3 |
31.9 |
33.4 |
Callcenter 3 |
43.3 |
38.4 |
40.7 |
We also tested https://huggingface.co/openai/whisper-large-v2 recently
released Large-V2 model claimed to be more accurate. Surprisingly, it is
visibly worse than the original V1 Version. Especially on short commands.
Other users report it too, see here:
https://github.com/openai/whisper/discussions/657
Dataset |
Whisper Large V1 |
Whisper Large V2 |
Librispeech test-clean |
4.0 |
4.1 |
Tedlium test |
6.5 |
9.1 |
Google commands |
33.9 |
54.0 |
Non-native speech |
18.7 |
18.8 |
Children speech |
4.5 |
4.9 |
Podcasts |
13.5 |
15.8 |
Callcenter bot |
10.1 |
14.9 |
Callcenter 1 |
31.8 |
31.7 |
Callcenter 2 |
26.0 |
29.9 |
Callcenter 3 |
31.4 |
32.8 |
The big thing in Whisper is that it uses huge context. But the extent how
accuracy degrades is not very clear. Here we can provide some numbers:
Dataset |
Whisper Large Segmented (3-5 seconds) |
Whisper Large Whole Files |
Podcasts |
13.5 |
11.6 |
You see the advantage of Whisper for long files is visible although
probably not critical. As you also see above, accuracy of recognition of
short commands is pretty bad. So it is not clear how Whisper will
perform for realtime applications.
Lets continue with Whisperology. There are still many many questions how
such a small model works so good.