Status of Whisper ASR Libraries

Whisper ASR is a great technology with many innovative things. For example, multiobjective transcription/translation training, a huge 600k hours training dataset or long-context decoding were really revolutionary at the time Whisper arrived.

Unfortunately Whisper ecosystem has hundreds of packages without proper support for model features. It is very hard to properly use the Whisper model unless you understand the features the model provides.

  1. Whisper is trained on a long context of 30 seconds. It works best for long context decoding like long podcasts. This is why everyone in the podcast industry is excited. It doesn’t work well for short commands and is not supposed to work like that. Standard ASR databases for tests include short chunks thus are not really suitable for testing Whisper. Simple VAD chunking doesn’t really work either.

  2. Between, training is also better done on 30 second chunks! Not on a standard ASR database (like everyone is doing).

  3. I expect the 30-second audio better be a single speaker. Effect of multiple speakers in an audio is not investigated yet. Is it better to decode speakers separately and then merge or jointly. (no numbers on that)

  4. Whisper is very good for English but not extremely good for many other languages. You’d better finetune it. For Chinese FunASR is much better for example.

  5. Streaming doesn’t really work either unless you tolerate 30-second delay.

  6. Batching chunks is critical for high performance

  7. Continuous batching (packing multiple requests into one) should also work but I haven’t seen implementation of this. For LLMs it works.

  8. Beam search is a must for low quality audio and a must overall for stable decoding.

  9. Whisper assigns punctuation but punctuation accuracy is not really known (no numbers on that).

  10. Quantization works pretty well, large models could be quantized to just 4 bits without loss of accuracy. All modern LLM quantization tricks like quantization to 1.5 bits might also work. It is much better to use large quantized model than small float model if you are restricted on memory. Quantized models must be faster too if quantization implemented properly (reduced CPU cache load)

  11. Advanced LLM tricks like Flash attention significantly improve decoding speed.

  12. LORA finetuning makes sense for Whisper as well.

  13. Whisper trained on huge and diverse database including music sources, very good at decoding song lyrics compared to other systems (no numbers on that)

  14. Whisper is not trained on silence, silence MUST be excluded before decoding.

  15. There are efficient algorithms to inject vocabulary lists into Whisper (TCPGen), but nobody implements them. There is a patch in Espnet though and the code from the authors.

  16. Prompts for Whisper are yet to be investigated, but in general they have huge effect on accuracy (like on every LLM)

  17. Whisper returns normalized text, it is actually not trivial to properly evaluate the accuracy of it. You have to denormalize it properly.

  18. Timestamps Whisper assigns are not millisecond accurate. I haven’t seen the research on the timestamp accuracy yet.

  19. There is no confidence.

If we consider current libraries, features are not quite uniform

  1. Original Whisper. No batching, no quantization, no VAD. Some initial hallucination detection on silence

  2. Whisper.CPP. Very good quantization, no batching, no VAD. No good python API (current ones do not support GPU).

  3. Faster Whisper. VAD, batching. No good quantization (original models are still large). Flash attention recently implemented see also here.

  4. HuggingFace Whisper. Quantization, batching. No VAD, No good Python API. LORA fine tuning with bits and bytes.

  5. Insanely-fast-whisper. Flash attention + batching. No VAD, no quantization, questionable advertizing

  6. Whisper-jax. No VAD, no quantization, questionable advertising. Slow on GPU, gonna work on TPU mostly.

  7. Sherpa-ONNX with Whisper-ONNX. Good batching, ok quantization (8 bits), but not 4 bits. A good candidate actually but VAD is not straightforward.

  8. WhisperX. Uses Faster Whisper

Some recent numbers for quantization of Whisper Large V3 model measured on Russian audiobooks (single file processing, about 7 second utterances) GPU processing on RTX3090.

Model WER CER Speed xRT
Original Large V3 10.4 4.7 0.270
Original Medium 16.5 7.5 0.216
HuggingFace Large V3 10.8 5.3 0.190
Faster Whisper Large V3 32bit 10.3 4.1 0.172
Faster Whisper Large V3 8 bit 10.2 4.1 0.112
Faster Whisper Medium 8 bit 13.2 5.0 0.094
Whisper.CPP Large V3 16 bit 10.7 5.0 0.128
Whisper.CPP Large V3 4 bit 11.0 5.2 0.108
Whisper.CPP Medium 5 bit 47.9 35.8 0.092

As you can see, the large 4 bits model makes a lot of sense. Large V3 model is only 800Mb then decoding takes about 2.5Gb of memory on the card. As for medium model, it simply doesn’t work on 4 or 5 bits. Not sure why, is it bad quantization or just the model size constrains. However, it needs testing on a larger dataset, including noisy samples. So don’t take these numbers for granted.

Conclusion. If you can only add Whisper.CPP quantization to something like Faster Whisper and mix it with flash attention and LORA tuning from HF and implement a good Python library on top of that, it is going to be a perfect mix.