Whisper Fine-Tuning

Whisper is very popular these days, so here are some more observations on it.

Whisper has many cool properties like very good generic transcription accuracy or accurate punctuation, but Whisper fine-tuning is a bit specific thing, let us consider it in detail. First of all, let’s better define fine-tuning. There are many fine-tuning tasks with different properties, it is reasonable they require different algorithms.

  1. Case one is a very limited task like command recognition. For such a task it is enough to have 1-2 hours of input. We do not consider that case.

  2. Case two is 10 minutes of in-domain speech and we want to build a good speech recognition system. This is wav2vec case where wav2vec and related self-supervised features really shine. However, we consider this case not really practical. If you have 10 minutes of annotated speech, what stops you from annotating another 10 hours? It will not take long. It is not really a reasonable restriction. The reasonable restriction is computation. You should be able to finetune your model on a 2x3090 server in a week.

  3. Realistic case is that we have about 100-1000 hours of in-domain speech and want to build a very accurate ASR system for it (like lectures, etc)

All modern models like Whisper, Nemo Conformers and Wav2vec are good candidates for fine-tuning. Whisper fine-tuning can use Huggingface scripts for example. Nemo has own fine-tuning setup. And even then if you see that there are thousands of models uploaded, most of them are of questionable quality.

Here are some points to consider:

  1. It is better to have a more simple speech-oriented architecture than a more complex one. As a result, Whisper fine-tune is usually worse than Conformer fine-tune.

  2. Fine-tuning is the same as training. You just start from a different point, but you can apply all training tricks. It is critical to apply regularization for example. Specaugment is a must for example. Take a note that very few fine-tuning recipes use Specaugment.

  3. It is still better to process more in-domain data than train for 10 epochs more on the same small dataset. You still want to fine-tune on all the diverse datasets you want to transcribe (and possibly add something clean like librispeech for regularization).

  4. It is reasonable to apply LM. Conformer + LM is much faster to fine-tune than huge transducers.

  5. We look for bigger batch sizes and higher learning rate. If a model doesn’t fit GPU memory like Whisper-large, it might be better to use smaller model with bigger batch size than a big model with small batch size.

  6. We might want to tune the model on one domain (like calls for banking) and use it for other related but different domains (like calls for insurance). Large context models like conformers learn internal LMs so if you train them on banking they will be only good in banking. You need simple models like wav2vec to learn acoustics.

Here are some related numbers on a special call center task:

Whisper medium                                     18.8
Whisper medium fine-tuned                          12.6
Nemo transducer x-large original                   18.6
Nemo conformer fine-tuned + LM                     12.2
Wav2Vec Robust fine-tuned + LM (fairseq tuning)    10.4

Similar results on Russian call-center task:

Whisper original                                          39.3
Vosk                                                      33.1
Whisper fine-tuned (mitchelldehaven/whisper-large-v2-ru)  28.9
Nemo conformer transducer (original)                      25.3

You see, one can tune a simple model better than a much more advanced Whisper model. Whisper is not great for non-English languages actually and moreover it is not fine-tuning very well.

Yet to try Whisper fine-tuning scripts from Espnet though, probably they will be better.


  1. Whisper encoder is good, but not very strong as demostrated in Espnet experiments. Whisper decoder is very good but it is good for many reasons, not just amount of data. For example, due to multiobjective training (translation + recognition). If you finetune without translations, your results would not be good.

  2. Thanks to Stefano on Telegram, here are the results from Whisper fine-tuning events:

In both case tuned Whisper is not much better than the original conformer transducer (like for German Whisper is at 5.76 and transducer is at 4.93).