Around noise-robust PNCC features

Last week I've been working on PNCC features which are famous features for speech recognition by Chanwoo Kim and Richard Stern. I made quite some experiments with parameters and research around PNCC. Here are some thoughts on that.

The fundamental paper about PNCC is C. Kim and R. M. Stern. Power-Normalized Cepstral Coefficients for robust speech recognition. IEEE Trans, Speech, Audio, Lang. Process but for detailed explanation of the process and experiments one can look in C. Kim Signal Processing for Robust Speech Recognition Motivated by Auditory Processing, Ph. D Thesis. The code for Octave is available too. The C implementation is also available in bug tracking system, thanks to Vyacheslav Klimkov, and will be committed soon after some cleanup. I hope  Sphinx4 implementation will follow.

However, quite some important information is not contained in papers. The main pipeline of PNCC is similar to the one of the conventional MFCC except few modifications. First, a gammatone filterbank is used instead of triangular filterbank. Second, filterbank energies are filtered to remove noise and reverberations effects.And third, power law nonlinearity together with power normalization is applied. Most of the pipeline design is inspired by research on human auditory subsystem.

There is a lot of research on using auditory ideas including power law nonlinearities, gammatone filterbanks and so on in speech recognition and PNCC papers do not cover it fully. Important ones are fundamental paper about RASTA and some recent research about auditory-inpired features like Gammatone Features and Feature Combination by Shulter at al.

PNCC design using auditory features raises quite fundamental questions and not discussed in the papers above though, a one very important paper is Spectral Signal Processing for ASR (1999) by Melvyn Hunt from Dragon. The idea from the paper is:
The philosophical case for taking what we know about the human auditory system as an inspiration for the representation used in our automatic recognition systems was set out in the Introduction, and it seems quite strong. Unfortunately, there does not seem to be much solid empirical evidence to support this case. Sophisticated auditory models have not generally been found to be better than conventional representations outside the laboratories in which they were developed, and none has found its way into a major mainstream system. Certainly, there are successful approaches and features that are generally felt to have an auditory motivation—the use of the mel-scale, the cube-root representation, and PLP. However, this paper has sought to show that they have no need of the auditory motivation, and their properties can be better understood purely signal processing terms, or in some cases in terms of the acoustic properties of the production process. Other successful approaches, such as LDA made no pretense of having an auditory basis.
This idea is very important because PNCC paper is very experimental one and doesn't really cover the theory behind the design of the filterbank. There are good things in PNCC design and not so clear things too. Here are some observations I had:

1. PNCC is really simple and elegant feature extraction, all the steps could be clearly understood and that makes PNCC very attractive. Noise robustness properties are really great too.

2. Noise filtering does reduce the accuracy in clean conditions, usually this reduction is visible (about 5% relative) but can be justified since we get quite a good improvement in noise. Despite there is a claim that PNCC is better than MFCC on clean data, my experiments do not confirm that. PNCC paper never provide exact numbers only the graphs that makes it very hard to verify their findings.

3. Band bias subtraction and temporal masking are indeed very reasonable stages to apply in feature extraction pipeline. Given the noise is mostly additive with slowly changing spectrum it's easy to remove noise using long-term integration and analog of the Wiener filtering.

4. Gammatone filterbank doesn't improve significantly over triangular filterbank so essentially it's complexity is not justified.Morever, default PNCC filterbank is suboptimal compared to good tuned MFCC filterbank. The filterbank starts only from 200Hz so for most broadcast recordings it has to be changed to 100Hz.

5. Power law nonlinearlity is mathematically not reasonable since it doesn't help to transform channel modification into the simple addition to be removed with CMN lately. The tests were done on normalized database like WSJ while every real database will show the reduction in performance due to the complex power law effects. Overall power normalization with moving average makes things even worse and reduces the ability to normalize scaled audio on training and decoding stages, for example for very short utterances it's really hard to estimate the power properly. Power nonlinearity could be compensated with variance normalization but there are no signs in the PNCC papers about that. So my personal choice is shifted log nonlinearity which is log for high energies and have a shift at low end to deal with noise. Log is probably a bit less accurate with noise but it is stable and have good scaling properties.

6. For raw MFCC lifter for coefficients has to be applied for best performance or LDA/MLLT has to be applied to make features more gaussian-like. Unfortunatly, PNCC paper doesn't tell anything about liftering or LDA/MLLT. With LDA the results could be quite different from the ones reported.

Still, PNCC seem to provide quite a good robustness in noise and I think PNCC will provide improved performance for default models. The recent plan is to import PNCC into pocketsphinx and sphinx4 as default features and train the models for them.