Written by Nickolay Shmyrev
on January 21, 2010

Moving Beyond the `Beads-On-A-String'

Recently I've got interested in quite a large domain of speech recognition research where old school linguistic meets modern speech recognition. Basically the idea is that in spontaneous speech variativity is so huge that phonetic transcription from the dictionary doesn't apply well. In plain CMUSphinx setup linguistic information about phones is almost lost like we don't care if phone is labial or dental. It is used in a decision tree building but it's not clear if such usage helps. It's definitely not so good to drop such a huge amount of information that could help with classification. So this idea is actively developed and you can find there everything you miss probably - distinctive phone features, landmarks, spectrogram recognition.

I went through the following articles, the number of methods, approaches and implementations described there is really huge. In other articles it's going to be even bigger:

S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester. Speech production knowledge in automatic speech recognition. Journal of the Acoustical Society of America, 121(2):723-742, February 2007. PDF

Moving Beyond the `Beads-On-A-String' Model of Speech by M. Ostendorf PDF

Speaking In Shorthand - A Syllable-Centric Perspective For Understanding Pronunciation Variation by Steven Greenberg PDF

To be honest the only idea from the articles that grown in my mind is that reductions on fast speech are root of the problem. I also noticed it in early days and was experimenting with a skip states. Skips didn't give any improvements except reduced speed. It will probably help to automatically increase lexicon variability and use forced alignemnt to get proper pronuciation at least at training stage. As I understood I just need to take a dictionary with syllabification and create a dictionary with a lot of reduced variants where onsets are kept as as and codas are reduced in some form. Then we force align, then train. Probably acoustic model will be better then.

Another striking point was that I haven't found any significant accuracy improvement result in the articles I read. Improvement like 20% with discriminative training could make any method widely adopted but nothing like that is mentioned. Probably this research is in very initial state.

← Top →