Decoding of Compressed Low-Bitrate Speech

I've spent some time on optimizing accuracy for 3gp speech recordings from mobile phones. 3gp is a container format used on most mobile devices nowdays with speech compressed using AMR-NB inside. Converted audio to AMR-NB and back, extracted PLP features and then trained few models on that. Result is not encouraging - accuracy is worse than stock model both on original and on compressed/decompressed audio. Not much worse but significanly worse.

Looks like traditional HMM issues like frame independency assumption play here which is confirmed by the papers I found. This paper is quite useful for example:

Vladimir Fabregas Surigué de Alencar and Abraham Alcaim. On the Performance of ITU-T G.723.1 and AMR-NB Codecs for Large Vocabulary Distributed Speech Recognition in Brazilian Portuguese

And this paper is good too:

Patrick Bauer, David Scheler, Tim Fingscheidt. WTIMIT: The TIMIT Speech Corpus Transmitted Over the 3G AMR Wideband Mobile Network

Need to research more on subject. Suprisingly there are only few papers on the subject, way less than on reverberation. It looks we have to build specialized frontend specifically targetted on decoding of low-bitrate compressed speech. Or we need to move to more robust features than PLP.

For now I would state the problem to develop a speech recognition framework to provide good accuracy on:
  • Unmodified speech
  • Noise-corrupted speech
  • Music-corrupted speech
  • Codec-corrupted speech
  • Long-distance speech
Good system should decode well in all cases.