The masking problem - capsules, specaug, bert
An important issue with a modern neural networks is their vulnerability to the masked corruption, that is the random corruption of some small amount of samples in the image or sound. It is well known that human is very robust about such noise, a man can ignore slightly but randomly corrupted pictures, sounds, sentences. MP3 compression is using masking to drop unimportant bits of sounds. Random impulse background noise usually has little effect on speech recognition by human. On the other hand it is very easy to demonstrate that modern ASR is extremely vulnerable to random spontaneous noise and that really makes a difference, even slight change of some frequencies can harm the accuracy a lot.
Hinton understood this problem and that is why he proposed capsule networks as a solution. The idea is that by using an agreement between a set of experts you can get more reliable prediction ignoring unreliable parts. Capsules are not very popular yet, but they were exactly thought to solve the masking problem.
On the other hand, Google/Facebook/OpenAI tried to solve the same problem with more traditional networks. They still use deep and connected architectures, but they decided to corrupt the dataset with masks during the training and teach the model to recognize it. And it does work well too, for example, remember SpecAugment success in speech recognition, BERT/ROBERTA/XLM in NLP are very good examples too.
On the path to reproduce this idea it is important to understand one thing. Since neural network effectively memorizes the input, to properly recognize masked images trainer has to see all possible masks and has to store their vectors in a network. That means - training process has be much much longer and the network has to be much much bigger. We see that in BERT. Kaldi people also saw it when they tried to reproduce SpecAugment.
Given that, some future ideas:
SpecAugment is not really random masking, it either drops the column or the raw. I predict more effective masking would be to randomly drop the 15% of the values on the whole 2-d spectrum, something Bert-style. I think in the near future we shall see that idea implemented.
The idea of masking can be applied to other sequence modeling problems in speech, for example, in TTS, we shall see it soon in vocoders and in transformer/tacotron models.
The waste of resources for training and decoding with masking is obvious, a more intelligent architecture to recognize masked inputs might change the things significantly.
Thanks to Robit Mann on @cmusphinx for the initial idea.