Written by
Nickolay Shmyrev
on
When information is already lost
In speech recognition we frequently deal with noisy or simply corrupted recordings. For example, in call center recordings you still get error rates like 50% or 60% even with the best algorithms. Someone calls while driving, others on the noisy street. Some use really bad phones. And the question raises how to improve the accuracy for such recordings. People try different complex algorithms and train model on GPU for many weeks while the answer is simple - the information in such recordings is simply lost and you can not decode them accurately.
Data transfer and data storage are expensive and VoIP providers often save every cent since they all operate on very low margin. That means they often use codecs with bugs or bad transmission lines and as a result you simply get unintelligible speech. Then everyone uses cell phones thus you have multiple encoding-decoding rounds where information from the microphone is encoded with AMR then encoded into 729 and finally converted into MP3 for storage, you have many codecs and often frame drops. As a result the quality of sound is sometimes very bad. And recognition accuracy is zero.
The quality of speech is hard to measure but there are ways to do that. The easiest way requires controlled experiments where you send the data from one endpoint and measure distortion on another endpoint. There are also single-side tools developed by ITU in
PESQ 563 that simply take an audio file and give you the sound quality score which takes many parameters into account and estimates the quality of the speech. They are rough, but still can give you some idea how noisy your audio is. And if it is really noisy, the way to improve it is not to apply better algorithms and put more research in speech recognition but simply go to the VoIP provider and demand better sound quality.
Given we have such a tool we might want to introduce the normalized word error rate which takes into account how good the recording is. So you really want to decode high quality recordings accurately and you probably do not care about bad quality recordings.
When accuracy matters, sound quality is really important. If possible you can use your own VoIP stack sending audio directly from the mobile microphone to the server. But when calls come to play, it is usually hopeless.