Written by Nickolay Shmyrev
on March 15, 2025

Experiments with correction of speech recognition output with LLMs

Generative error correction is a thing recently, there are many papers on that, even a challenge:

Some notable papers:

Large language model based generative error correction: A challenge and baselines for speech recognition, speaker tagging, and emotion recognition https://arxiv.org/abs/2409.09785
Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction https://arxiv.org/abs/2407.16370

Overall, GEC results are somewhat controversal because most experiments are on book-sourced texts and LLM knows book texts very well:

Most SpeechLLMs are trained on the test sets of common speech datasets

We recently tried to rescore 5-best transcription of Russian telephony calls with LLMs. There are many LLMs to try. We tried ones that fit 8Gb card and Gemini Flash Lite 2.0 as a big model. We also tried an LLM finetuned specifically for GEC for Russian language Meno Tiny.

Here is what our prompt looks like:

You need to edit and improve the output of speech recognition system.
Here are 5 variants of transcription of a support call.
Calls are in Russian.
Speech recognizer makes mistakes, for example it uses "southpan" instead of "southpark".
First variant is most precise.
The second variant recognizes proper names better than the others.
Correct mistakes and print most accurate transcription using the context, grammar and knowledge about phonetics.
You need to provide only one answer. Number of lines in the answer must match the number of lines in the first variant.

# Example input:

1.

what is the price for the house
ok good i got it
goodbye

2.

what is the price for the horse
ok good i've got it
ok goodbye

3.

what is the price for the house
ok great i've got it
goodbye

4.

what is the price for the house
ok great i've got it
ok goodbye

5.

what is the price for the house
ok great i've got it
ok goodbye

# Example output:

what is the price for the house
ok good i've got it
ok goodbye

# Input:

1.

<lines1>

2.

<lines2>

....

# Output:

Here are our approximate results:

Model	WER
1-best ASR	15.9
5-best ROVER	14.8
Qwen2.5-7B-Instruct-1M-Q4_K_M	100+ unstable
vikhr-llama3.1-8b-instruct-r-21-09-24-q4_k_m	40+ unstable
vikhr-yandexgpt-5-lite-8b-it-q4_k_m	unstable
meno-tiny-0.1-fp16	40+ unstable
gemma-2-9b-it-Q4_K_M	16.0
google_gemma-3-4b-it-Q8_0	16.7
Gemini Flash 2.0 Lite	14.6
Gemini Flash 2.0 Lite English prompt	14.7
Gemini Flash 2.0 Lite 10-line chunks	14.8

Some our observations:

Most 8B models at 4b quantization are not very stable, hallucinations present in about 25% cases. Qwen is very unstable for this task.
Gemma2 and Gemma3 are ok, yet to try 27B version.
Simple prompt from the papers certainly doesn’t work. One has to provide much more details and specific issues in prompt. We yet to work on the prompt more.
Even prompt formatting matters, by modifying the prompt format we were able to reduce WER from 26% to 16%
For now GEC doesn’t seem like a breakthrough tech, it seems like something like extra sauce is needed, simple ROVER is equally ok and much more stable.
We discussed on the channel with iLa that English prompt helps for non-English language. I think it is possible for some models but I can’t confirm in experiments.
For big model input split doesn’t help much.
There are still a lot of overcorrection of proper names which are rare and unknown to LLM and overcorrection of grammar. We need to work more on it.
The difference between Gemma2-9B and Gemini Flash is not very large except for a number of hallucinations.
Most models have very poor knowledge in rare domains and poor knowledge about speech (phonetics).

So interesting results and more work is needed. Eventually we can make a real benchmark from this, it is actually interesting which LLM performs the best here.

PS. Sreyan Ghosh on Twitter suggested me the following paper stressing the issues with named entity recognition for GEC. Right on the subject:

Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

https://arxiv.org/abs/2410.13198

Top →