Written by
Nickolay Shmyrev
on
The masking problem - capsules, specaug, bert
An important issue with a modern neural networks is their vulnerability
to the masked corruption, that is the random corruption of some small
amount of samples in the image or sound. It is well known that human is
very robust about such noise, a man can ignore slightly but randomly
corrupted pictures, sounds, sentences. MP3 compression is using
masking to drop unimportant bits of sounds. Random impulse
background noise usually has little effect on speech recognition by
human. On the other hand it is very easy to demonstrate that modern ASR
is extremely vulnerable to random spontaneous noise and that really makes a
difference, even slight change of some frequencies can harm the accuracy
a lot.
Hinton understood this problem and that is why he proposed capsule
networks as a
solution. The idea is that by using an agreement between a set of experts
you can get more reliable prediction ignoring unreliable parts. Capsules
are not very popular yet, but they were exactly thought to solve the
masking problem.
On the other hand, Google/Facebook/OpenAI
tried to solve the same problem with more traditional networks. They
still use deep and connected architectures, but they decided to corrupt
the dataset with masks during the training and teach the model to
recognize it. And it does work well too, for example, remember
SpecAugment success in speech recognition, BERT/ROBERTA/XLM in NLP are
very good examples too.
On the path to reproduce this idea it
is important to understand one thing. Since neural network effectively
memorizes the input, to properly recognize masked images trainer has to
see all possible masks and has to store their vectors in a network. That
means - training process has be much much longer and the network has to
be much much bigger. We see that in BERT. Kaldi people also saw it when
they tried to reproduce SpecAugment.
Given that, some future ideas:
-
SpecAugment is not really random masking, it either drops the
column or the raw. I predict more effective masking would be to
randomly drop the 15% of the values on the whole 2-d spectrum,
something Bert-style. I think in the near future we shall see that
idea implemented.
-
The idea of masking can be applied to other sequence modeling
problems in speech, for example, in TTS, we shall see it soon in
vocoders and in transformer/tacotron models.
-
The waste of resources for training and decoding with masking is
obvious, a more intelligent architecture to recognize masked inputs
might change the things significantly.
Thanks to Robit Mann on @cmusphinx for the initial idea.