Written by
Nickolay Shmyrev
Looking on the waves
Here is the question - a perfectly looking sound file which is transcribed with 10% accuracy. Sounds crazy, isn't it? Click on it to enlarge. No noise, no accent.
Because of that I'm looking on state-of-art in channel normalization, especially for non-linear channel distortions. No good solution yet, I've only found the description of the problem in very old paper
There is CDCN normalization, few CMN improvements, RASTA and even recently invented HN normalization. CDCN is suprisingly available in Sphinxtrain but nobody uses it. Well it gives no improvement but it's an interesting approach worth to document one day. The idea to collect statistics from the speech to apply it later sounds nice.
There are model-level approaches, various feature transforms, adaptations. They do not really look that attractive. Most papers now deal with channel compensation for speaker recognition, not speech recognition. I must admit the topic is too large to overview it in few weeks.
Luckily, I can also spend time looking on the waves like the one on the right. Somewhat more pleasant I would say.