Written by Nickolay Shmyrev
on August 19, 2009

How to improve accuracy

People very often ask "how to improve accuracy?". Since I've got three questions like this today I decided to write more or less extensive description of the ways to solve this problem. Probably it will be a bit sketchy, but I hope it will be helpful. Corrections will be very appreciated as well.

1) Well, first of all let me mention that the problem is complex. It requires an understanding of the Hidden Markov Models, beam search, language modelling and every other technology involved. I really recommend you to read the book on speech recognition first. This one is very good:

http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165

In theory it will be possible to build a speech recognition system without any extensive knowledge as listed above, but it's not the case now. We are working on easy to use system but we are still in the early beginning.

Probably you don't have time to read and study all this. Then think if you have time to implement speech recognition system at all. At least learn basic concepts.

Please also learn how to do programming before you are starting. I think it's obvious you need to have software development experience. We can't teach you Java really.

2) Next step is to setup the basic example of the system and estimate it's accuracy while first part is quite obvious, second is often ignored. Don't do that. It's critical to test accuracy on realistic conditions during whole development process.

Decide what kind of system are you going to implement - large vocabulary dictation, medium vocabulary names recognition or small vocabulary command and control or some other task. Probably you need IVR system. For each system there is a demo already. Use it as a base. Please don't try to build dictation from a command and control demo, it's just not suitable.

Now the important task, the estimation of the accuracy. The examples how to do that could be found in decoder sources, in tutorial and in many more places. Collect test database, recognize it and compute the exact accuracy estimation.

It should be the following:

Command and control 5% WER (word error rate)
Medium vocabulary 15% WER
Large vocabulary 30% WER
Large vocabulary short utterances 50% WER

if you have noisy audio or some accent multiply this number on 2.

4) Compare the actual accuracy with the expected value. If the accuracy is mostly the expected, proceed to the next step. If not, search for the bug. Most likely you've made a mistake in system setup. Check the configuration if it is suitable for the task, check sound speech quality with sound editing, find out the spectrum range, check the sampling rate, accent, dictionary and other parameters. If your WER is 90%, you made a mistake for sure.

For example of task-dependent training consider acoustic database. If you train a small vocabulary acoustic model you need word-based acoustic model with word-based phoneset in Sphinxtrain case. If you are training large vocabulary database make sure your phoneset is not large and make sure you have selected the proper number of senones/mixtures.

5) Once you've reached a baseline, it will take a lot to improve it. Think if it's enough for you and if you can build your application with such accuracy. It's unlikely you'll get significantly more. But if you are brave enough:

Use MLLT/VTLN feature space adaptation
Use MLLR and other type of online speaker adaptation
Adapt language model, use context-sensetive language models
Tune beams - try different values and experiment with them
Implement the rejection for OOV (out-of-vocabulary) words and other noise sounds
Implement noise cancellation.
Adapt acoustic models and dictionary to your speakers/their accent.
....

The out-of-vocabulary words are the most frequent issue here. Unlike expected by user, most demos don't do any OOV filtering out of box while it's critical for applications. Unfortunately though unlimited vocabulary systems exist, they are quite complex (though also possible to implement). Most systems have limited vocabulary. That means that you need to implement OOV detection and/or confidence scoring in order to filter the garbage. This is doable and described in demos too, for example in confidence demo.

If you are short of ideas here, join the mailing list, we have a lot of features to implement.

← Top →