Not All Speaker Adaptations Are Equally Useful

Some time ago I was rather encouraged by VTLN which is vocal tract normalization. By so-called frequence warping it tries to unify vocal tract lenght of all speakers and thus make better model. It's done by shifting and adjusting mel filter frequencies. This thing is implemented in Sphinxtrain/Pocketsphinx/Sphinx4. Basically all you need is to enable it in sphinx_train.cfg

$CFG_VTLN = 'yes';


And run the training. It will extract features with all frequency warp parameters with some step (take care about space on disk) and will find out best one with forced alignment of each utterance. Then it will create new fileids and transcription files with reference to the file with proper warp parameter.

To decode with VTLN model you need to guess warp parameter. There are several algorithms suggested to do that. One analyses pitch, others employ GMM for classification. Then you need to reextract features with predicted warp parameter. It gives some visible improvement in performance.

But recently a set of articles like

MLLR-LIKE SPEAKER ADAPTATION BASED ON LINEARIZATION OF VTLN WITH MFCC FEATURES by Xiaodong Cui and Abeer Alwan

came into my sight thanks to antonsrv8. The simple idea is that any transform which we do on features, especially smooth transform could be mostly replaced just by linear transform of MFCC coefficients, basically by MLLR transformation. This kind of obvious fact makes me think if we really need other transformations if MLLR is generic enough. It's not harder to estimate MLLR than to estimate warp factor, especially if data is large enough which is usually the case. Another transformation applied will just conflict with MLLR. On large data sets this is confirmed by experimental results in article above.

Of course non-linear transform like VTLN could be better than linear one, but it's certainly not VTLN it seems. I hope latest state of art in voice conversion could suggest something better.

Update: this point was of course largely covered in research papers. Good coverage with math and results is provided in Luis thesis:

Speaker Normalisation and Adaptation in Large Vocabulary Speech Recognition by Lu ́s Felipe Uebel