OpenAI Whisper Accuracy (Tflite, Whisper.CPP and Large-V2)

Whisper popularity wave continues. Many projects appear for whisper-based web services, whisper on mobile and so on. Some projects modify Whisper models and algorithms to improve speed and it raises questions about their accuracy. Here we tested couple of different project to demonstrate the effect those algorithmic modifications have on the accuracy.

There is some accuracy drop, but, accuracy is still extremely impressive. Note that the TFLite model of just 40Mb demonstrates extremely good performance even on complicated datasets.

The TFLite decoding is a bit slow though, about 0.5xRT on Intel CPU. With modern hardware accelerators (like M1 chip) the decoder runs quick enough even on mobile devices.

Whisper.cpp project is still in a very initial state. A lot of changes in algorithms (like 2x speedup for audio decoding, questionable practice) and issues with Python bindings (decoder object not reusable, memory corruption).

Dataset Whisper TFLite Whisper CPP Whisper Tiny
Librispeech test-clean 6.3 6.6 7.0
Tedlium test 7.2 7.9 8.6
Google commands 34.4 100.0 33.9
Non-native speech 45.2 40.7 41.7
Children speech 9.7 13.2 10.4
Podcasts 16.7 16.3 17.0
Callcenter bot 17.6 20.8 22.0
Callcenter 1 45.0 43.1 45.9
Callcenter 2 34.3 31.9 33.4
Callcenter 3 43.3 38.4 40.7

We also tested https://huggingface.co/openai/whisper-large-v2 recently released Large-V2 model claimed to be more accurate. Surprisingly, it is visibly worse than the original V1 Version. Especially on short commands. Other users report it too, see here: https://github.com/openai/whisper/discussions/657

Dataset Whisper Large V1 Whisper Large V2
Librispeech test-clean 4.0 4.1
Tedlium test 6.5 9.1
Google commands 33.9 54.0
Non-native speech 18.7 18.8
Children speech 4.5 4.9
Podcasts 13.5 15.8
Callcenter bot 10.1 14.9
Callcenter 1 31.8 31.7
Callcenter 2 26.0 29.9
Callcenter 3 31.4 32.8

The big thing in Whisper is that it uses huge context. But the extent how accuracy degrades is not very clear. Here we can provide some numbers:

Dataset Whisper Large Segmented (3-5 seconds) Whisper Large Whole Files
Podcasts 13.5 11.6

You see the advantage of Whisper for long files is visible although probably not critical. As you also see above, accuracy of recognition of short commands is pretty bad. So it is not clear how Whisper will perform for realtime applications.

Lets continue with Whisperology. There are still many many questions how such a small model works so good.