Dither is considered harmful

MFCC features used in speech recogition are still a reasonable choice if you want to recognize generic speech. With tunings like frequency wrap for VTLN and MLLT they still can suggest the reasonable performance. Although there are many parameters to tune like upper and lower frequencies, the shape of the mel filters and so on, default values mostly works fine. Still I had to spend this week on one issue related to zero energy frames.

Zero energy frames are quite common in telephony recorded speech. Due to noise cancellation or due to VAD speech compression telephony recordings are full of the frames with zero energy. The issue is that calculation of the MFCC features consist of taking log from eneries, thus you have an undefined value of log 0. There are several ways to overcome this issue.

The one used in HTK or SPTK for example is to assign some floored value to the log, usually it's quite a big value in log domain, say 1e-5. This solution is actually quite bad at least in it's sphinx implementation. That's because it largely affects CMN computation, means goes down and bad things happen. Silent frame can affect the result of the whole phrase.

Another one is dither, when you apply random 1bit noise to the sound as a whole and use this modified waveform for training. Such change is usually enough to make log take acceptable values around -1.

There were complains about dither, most well known one is that it affects recognition scores, results can be different from run to run. It's a bad thing but not that bad when you start with predefined seed. So I thought before that dither is fine. And by default it's applied both in training and decoder. But recently when I started with the testing of the sphinxtrain tutorial I come to more important issue.

See the results on an4 database from run to run without any modifications:

TOTAL Words: 773 Correct: 645 Errors: 139
TOTAL Percent correct = 83.44% Error = 17.98% Accuracy = 82.02%
TOTAL Insertions: 11 Deletions: 17 Substitutions: 111
TOTAL Words: 773 Correct: 633 Errors: 149
TOTAL Percent correct = 81.89% Error = 19.28% Accuracy = 80.72%
TOTAL Insertions: 9 Deletions: 23 Substitutions: 117
TOTAL Words: 773 Correct: 639 Errors: 142
TOTAL Percent correct = 82.66% Error = 18.37% Accuracy = 81.63%
TOTAL Insertions: 8 Deletions: 19 Substitutions: 115
TOTAL Words: 773 Correct: 650 Errors: 133
TOTAL Percent correct = 84.09% Error = 17.21% Accuracy = 82.79%
TOTAL Insertions: 10 Deletions: 17 Substitutions: 106
TOTAL Words: 773 Correct: 639 Errors: 142
TOTAL Percent correct = 82.66% Error = 18.37% Accuracy = 81.63%
TOTAL Insertions: 8 Deletions: 19 Substitutions: 115

If you are lucky you can even get WER of 15.95%. Thats certainly unacceptable and it still remains true why training is so sensible to dither applied. Clearly it makes any testing impossible. I checked this results on medium vocabulary 50-hours database and they are still the same - the accuracy is very different from run to run. Interesting thing is only training is affected that much. For testing you can get very slight difference of 0.1%.

So far my solutions are:
  • Disable dither on training
  • Apply a patch to drop frames with zero energy (this seems useless but it helps to be less nervious about warnings)
  • Decode with dither
I hope I'll be able to provide more information in the future about the reasons of this unstability, but for now it's all I know.