Semi-supervised Learning and Frame Rate
With the development of neural network toolkits it seems that the
technology reached the point where huge network can remember and
recognize almost everything as long as it was properly annotated in the
training database. It might not be practical for all applications, but it
works. For deployment, you can use distilled model which has reasonable
decoding speed and size.
The problem here is the training dataset - it is practically impossible
to prepare a properly annotated dataset accurately describing what is
going on in audio. To cover all possible accents, noises and nuances you
need to have a huge database of many thousand hours. Of course, such
database is not very accurately annotated and often contains mistakes.
Algorithms to detect mistakes, correct them and learn from them become
more and more important. Here we have semi-supervised learning (rebranded
as self-supervised learning), improved confidence measures, result
calibration. Overall approach looks like this:
- We have a huge teacher model which accurately recognizes speech
- We have student model for production
- Our databases are not well annotated.
The interesting point here is that teacher model doesn’t have to be fast,
it must be accurate enough to recognize training database data. And this
leads to the interesting consequence that we might review some of the
algorithms which we traditionally designed to work in production. We can
have a huge model learned with constrasive learning (wav2vec) as a
teacher model, this model can work from raw audio instead of mfcc, this
model can use huge memory database (Vosk model).
One example is the sample rate we use for training. While traditionally
we work with 16khz audio, nothing stops us from working with 48khz in
teacher model with high-quality speech recordings, it improves robustness
in noise for many cases.
Another interesting example of this is about frame rate we traditionally
use for speech recognition. Often it was 100 frames per second, with the
recent chain and e2e model frame rate is frequently reduced to just 30
frames per second. Such a rate is ok for slow speech like audiobooks,
sometimes ok for telephony like codecs. Although, in many cases in
conversational speech you might notice that chain models skip words
without any notice. The deletion rate of chain model is always higher
than the insertion rate. This is one of the signs that frame rate of 30
frames per second is sometimes not sufficient. Also note that if we use
CTC topology the smallest size of the unit we have is 2 frames, so the
smallest size of the sound change is 0.02 second. With standard
feed-forward HMM topology it is 0.03 seconds. With chain models with
subsampling it is 0.06 seconds. Pretty long to be honest. Also, one can
not guarantee that the phone boundary is multiple of 0.01 second, which
introduces noise in the models.
From investigation of the fast conventional conversational speech one can
figure out that the rate of the signal change in speech is much higher.
To properly annotate the speech you need to have a frame rate of 200
frames per second or even 400. Phoneme might be just one pitch period
which could be as small as 1/ 400 Hz = 0.0025 second for children or
women speech. This idea is not very novel, it was investigated in the
End-to-End Speech Recognition with High-Frame-Rate Features Extraction by Cong-Thanh Do
and some earlier publications like:
On the Use of Variable Frame Rate Analysis in Speech Recognition
and many more like:
Impact of frame rate on automatic speech-text alignment for corpus-based phonetic studies
And, not surprisingly, the higher frame rate significantly improves
speech recognition accuracy for conversational speech. That’s an
interesting point to explore.
Of course one can go further and indeed check variable frame rate and
also consider something like wavelets for teacher model, even if wavelets
are slow they can give much better accuracy meaning you can teach the
student model better too. This is an interesting subject of research.