Written by Nickolay Shmyrev
on December 11, 2022

OpenAI Whisper Accuracy (Tflite, Whisper.CPP and Large-V2)

Whisper popularity wave continues. Many projects appear for whisper-based web services, whisper on mobile and so on. Some projects modify Whisper models and algorithms to improve speed and it raises questions about their accuracy. Here we tested couple of different project to demonstrate the effect those algorithmic modifications have on the accuracy.

https://github.com/usefulsensors/openai-whisper - TFlite-based quantized model tiny.en from Whisper. The project provides Android and iOS demo. Model is just 40Mb which is pretty nice.
https://github.com/ggerganov/whisper.cpp - popular CPU-based whisper project. Tested through Python bindings https://github.com/o4dev/whispercpp.py. It converts the models to internal format GGML. We test tiny.en again
We compare models above with original Whisper tiny.en model. Model is 78Mb.

There is some accuracy drop, but, accuracy is still extremely impressive. Note that the TFLite model of just 40Mb demonstrates extremely good performance even on complicated datasets.

The TFLite decoding is a bit slow though, about 0.5xRT on Intel CPU. With modern hardware accelerators (like M1 chip) the decoder runs quick enough even on mobile devices.

Whisper.cpp project is still in a very initial state. A lot of changes in algorithms (like 2x speedup for audio decoding, questionable practice) and issues with Python bindings (decoder object not reusable, memory corruption).

Dataset	Whisper TFLite	Whisper CPP	Whisper Tiny
Librispeech test-clean	6.3	6.6	7.0
Tedlium test	7.2	7.9	8.6
Google commands	34.4	100.0	33.9
Non-native speech	45.2	40.7	41.7
Children speech	9.7	13.2	10.4
Podcasts	16.7	16.3	17.0
Callcenter bot	17.6	20.8	22.0
Callcenter 1	45.0	43.1	45.9
Callcenter 2	34.3	31.9	33.4
Callcenter 3	43.3	38.4	40.7

We also tested https://huggingface.co/openai/whisper-large-v2 recently released Large-V2 model claimed to be more accurate. Surprisingly, it is visibly worse than the original V1 Version. Especially on short commands. Other users report it too, see here: https://github.com/openai/whisper/discussions/657

Dataset	Whisper Large V1	Whisper Large V2
Librispeech test-clean	4.0	4.1
Tedlium test	6.5	9.1
Google commands	33.9	54.0
Non-native speech	18.7	18.8
Children speech	4.5	4.9
Podcasts	13.5	15.8
Callcenter bot	10.1	14.9
Callcenter 1	31.8	31.7
Callcenter 2	26.0	29.9
Callcenter 3	31.4	32.8

The big thing in Whisper is that it uses huge context. But the extent how accuracy degrades is not very clear. Here we can provide some numbers:

Dataset	Whisper Large Segmented (3-5 seconds)	Whisper Large Whole Files
Podcasts	13.5	11.6

You see the advantage of Whisper for long files is visible although probably not critical. As you also see above, accuracy of recognition of short commands is pretty bad. So it is not clear how Whisper will perform for realtime applications.

Lets continue with Whisperology. There are still many many questions how such a small model works so good.

← Top →