Semi-supervised Learning and Frame Rate
With the development of neural network toolkits it seems that the technology reached the point where huge network can remember and recognize almost everything as long as it was properly annotated in the training database. It might not be practical for all applications, but it works. For deployment, you can use distilled model which has reasonable decoding speed and size.
The problem here is the training dataset - it is practically impossible to prepare a properly annotated dataset accurately describing what is going on in audio. To cover all possible accents, noises and nuances you need to have a huge database of many thousand hours. Of course, such database is not very accurately annotated and often contains mistakes.
Algorithms to detect mistakes, correct them and learn from them become more and more important. Here we have semi-supervised learning (rebranded as self-supervised learning), improved confidence measures, result calibration. Overall approach looks like this:
- We have a huge teacher model which accurately recognizes speech
- We have student model for production
- Our databases are not well annotated.
The interesting point here is that teacher model doesn’t have to be fast, it must be accurate enough to recognize training database data. And this leads to the interesting consequence that we might review some of the algorithms which we traditionally designed to work in production. We can have a huge model learned with constrasive learning (wav2vec) as a teacher model, this model can work from raw audio instead of mfcc, this model can use huge memory database (Vosk model).
One example is the sample rate we use for training. While traditionally we work with 16khz audio, nothing stops us from working with 48khz in teacher model with high-quality speech recordings, it improves robustness in noise for many cases.
Another interesting example of this is about frame rate we traditionally use for speech recognition. Often it was 100 frames per second, with the recent chain and e2e model frame rate is frequently reduced to just 30 frames per second. Such a rate is ok for slow speech like audiobooks, sometimes ok for telephony like codecs. Although, in many cases in conversational speech you might notice that chain models skip words without any notice. The deletion rate of chain model is always higher than the insertion rate. This is one of the signs that frame rate of 30 frames per second is sometimes not sufficient. Also note that if we use CTC topology the smallest size of the unit we have is 2 frames, so the smallest size of the sound change is 0.02 second. With standard feed-forward HMM topology it is 0.03 seconds. With chain models with subsampling it is 0.06 seconds. Pretty long to be honest. Also, one can not guarantee that the phone boundary is multiple of 0.01 second, which introduces noise in the models.
From investigation of the fast conventional conversational speech one can figure out that the rate of the signal change in speech is much higher. To properly annotate the speech you need to have a frame rate of 200 frames per second or even 400. Phoneme might be just one pitch period which could be as small as 1/ 400 Hz = 0.0025 second for children or women speech. This idea is not very novel, it was investigated in the following publications:
and some earlier publications like:
and many more like:
And, not surprisingly, the higher frame rate significantly improves speech recognition accuracy for conversational speech. That’s an interesting point to explore.
Of course one can go further and indeed check variable frame rate and also consider something like wavelets for teacher model, even if wavelets are slow they can give much better accuracy meaning you can teach the student model better too. This is an interesting subject of research.