Written by
Nickolay Shmyrev
on
Experiments with correction of speech recognition output with LLMs
Generative error correction is a thing recently, there are many papers on that, even
a challenge:
https://huggingface.co/GenSEC-LLM
Some notable papers:
Overall, GEC results are somewhat controversal because most experiments are on book-sourced texts and LLM knows
book texts very well:
Most SpeechLLMs are trained on the test sets of common speech datasets
We recently tried to rescore 5-best transcription of Russian telephony
calls with LLMs. There are many LLMs to try. We tried ones that fit 8Gb
card and Gemini Flash Lite 2.0 as a big model. We also tried an LLM
finetuned specifically for GEC for Russian language Meno Tiny.
Here is what our prompt looks like:
You need to edit and improve the output of speech recognition system.
Here are 5 variants of transcription of a support call.
Calls are in Russian.
Speech recognizer makes mistakes, for example it uses "southpan" instead of "southpark".
First variant is most precise.
The second variant recognizes proper names better than the others.
Correct mistakes and print most accurate transcription using the context, grammar and knowledge about phonetics.
You need to provide only one answer. Number of lines in the answer must match the number of lines in the first variant.
# Example input:
1.
what is the price for the house
ok good i got it
goodbye
2.
what is the price for the horse
ok good i've got it
ok goodbye
3.
what is the price for the house
ok great i've got it
goodbye
4.
what is the price for the house
ok great i've got it
ok goodbye
5.
what is the price for the house
ok great i've got it
ok goodbye
# Example output:
what is the price for the house
ok good i've got it
ok goodbye
# Input:
1.
<lines1>
2.
<lines2>
....
# Output:
Here are our approximate results:
Model |
WER |
1-best ASR |
15.9 |
5-best ROVER |
14.8 |
Qwen2.5-7B-Instruct-1M-Q4_K_M |
100+ unstable |
vikhr-llama3.1-8b-instruct-r-21-09-24-q4_k_m |
40+ unstable |
vikhr-yandexgpt-5-lite-8b-it-q4_k_m |
unstable |
meno-tiny-0.1-fp16 |
40+ unstable |
gemma-2-9b-it-Q4_K_M |
16.0 |
google_gemma-3-4b-it-Q8_0 |
16.7 |
Gemini Flash 2.0 Lite |
14.6 |
Gemini Flash 2.0 Lite English prompt |
14.7 |
Gemini Flash 2.0 Lite 10-line chunks |
14.8 |
Some our observations:
-
Most 8B models at 4b quantization are not very stable, hallucinations present in about 25% cases. Qwen is very unstable for this task.
-
Gemma2 and Gemma3 are ok, yet to try 27B version.
-
Simple prompt from the papers certainly doesn’t work. One has to
provide much more details and specific issues in prompt. We yet to work
on the prompt more.
-
Even prompt formatting matters, by modifying the prompt format we were able to reduce WER from 26% to 16%
-
For now GEC doesn’t seem like a breakthrough tech, it seems like something like extra sauce is needed, simple
ROVER is equally ok and much more stable.
-
We discussed on the channel with iLa that English prompt helps for non-English language. I think it is possible for some models but I can’t confirm in experiments.
-
For big model input split doesn’t help much.
-
There are still a lot of overcorrection of proper names which are rare and unknown to LLM and overcorrection of grammar.
We need to work more on it.
-
The difference between Gemma2-9B and Gemini Flash is not very large except for a number of hallucinations.
-
Most models have very poor knowledge in rare domains and poor knowledge about speech (phonetics).
So interesting results and more work is needed. Eventually we can make a real benchmark from this, it is actually interesting which LLM performs
the best here.
PS. Sreyan Ghosh on Twitter suggested me the following paper stressing the issues with named entity recognition for GEC. Right
on the subject:
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
https://arxiv.org/abs/2410.13198