Written by
Nickolay Shmyrev
on
Phoneset with stress
So I finally finished testing of the stress-aware model. It took me few month and the end I could say that lexical stress is definitely better. It provides better accuracy and, more importantly, more robustness over model which has non-stressed phoneset.
I hope we retrain all other models we have with the phoneset with stress. It's great that CMUDict provides enough information to do that. The story of me testing that was quite interesting. I believed in stress for a long time but wasn't able to prove that. In theory it's clear why it helps, when speech speed changes, stressed syllables remain less corrupted than unstressed and we get better control over data. Additional information like lexical stress is important. Of course the issue is the increased number of parameters to train the model. That's why I think early investigations concluded that phoneset without stress is better. Discussion about it on cmusphinx-devel this summer also confirmed Nuance moved to the model with stress in their automotive decoder.
It's interesting how long I tested that. I made numerous attempts and each one had bugs
- First attempt was using bad features (adapted for 3gp) and didn't show any improvement
- Number of senones in second training was too small since I didn't know the reason of first failure
- Third attempt had issue with the automatic questions which were used accidentally instead of manual ones I wrote and it went unnoticed
- Fourth attempt was rejected because there were issues with the dictionary format in Sphinx4. Never use FastDictionary between, use FullDictionary. Fast dictionary expects specific dictionary format with variants like (2) (3) (4) and not (1) or (2) and (4).
- Only fifth attempt was good but in shown improvement only on big test set and not on the small one
So basically to check every fact you need to be
very careful and double- or triple-check everything. Bugs are everywhere, in language model training, decoder, trainer, configuration. From run to run bugs could lead to different results, even a small change can break everything. I think optimal way for research could be to check the same proposition in independent teams using independent decoders and probably different data. Not sure if it's doable in short term.