Harmonic Noise Model in Speech Recognition


Recently I came around a nice demo about generation of natural sounds from physical models. This is really an exciting topic because while Hollywood can now draw almost everything like Star Wars, the sound generation is pretty limited and unexplored area. For example, really high quality speech still can not be created by computers, no matter how powerful they are. This leads to a question of speech signal representation.

Accurate speech signal representations made a big difference in different areas of speech processing like TTS, voice conversion, voice coding. The core idea is very simple and straightforward but also powerful - we notice the fact that acoustic signals are either produced by harmonic oscillation in which case it has structure or by a turbulence cavitation in which case we see something like white noise. In speech such classes are represented by vowels and sibilant consonants, everything else is a mixture of those with some degree of turbulence and some degree of structure. However, this does not really speech-specific, all other real world signals except artificial ones might be analyzed from this point of view.

Such representation allowed to greatly improve voice compression in the class of MELP codecs (mixed excitations linear prediction). Basically we represent the speech as noise and harmonics and compress them separately. That allowed to improve compression of speech signal to unbelievable 600b/s. Mixed excitation was very important in text-to-speech synthesis. And it really made a big difference, as was proven quite some time ago by Mixed excitation for HMM-based speech synthesis by Takayoshi Yoshimura at al. 2001.

Unfortunately there is very little published research on mixed excitation models for speech recognition. I only found a paper A harmonic-model-based front end for robust speech recognition by Michael L. Seltzer which does consider harmonic and noise model but focus on robust speech recognition and not the advantages of the model itself. However, I believe such model can be quite important for speech analysis because it allows to classify speech events with very high degree of certainty. For example, if you consider a task of creating TTS system from voice recording, you might still notice that even best algorithms still confuse sounds a lot, assign incorrect boundaries, select wrong annotation. More accurate signal representation could help here.

It would be great if readers share more links on this, thank you!