Interspeech 2020 Monday

Interspeech is overwhelming as usual. Thosands of papers and ideas, lives and thoughts.

On one hand I kind of like online format when you can participate in discussions sitting at home with a cup of tea. You can visit the presentation without running cross the floors, no need to hurry. On the other hand, I miss Shanghai which I really wanted to visit.

Monday seems to be very active day, Tuesday not so much. I bookmarked more than 50 entries myself, I won’t list them here. Here are just some major ones:

  • It is still not clear which approach in ASR is the best. Transformers, encoder-decoders, hybrid networks, and so on. No best direction to take yet and many small but not so critical improvements. I haven’t got the full picture yet, but it seems that the core ideas are not stabilized. There are problems in E2E, there are problems in hybrid. In hybrid networks we still have quite some things to do. It is sad that Kaldi doesn’t make it easy to experiment with architectures. The first thing is to get rid of context dependency tree Mon-2-11-3 Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

  • E2E teams mostly work on context integration, it seems that they have a big problem with context and with a proper recognition of special cases (names, etc).

  • Streaming ASR is a big problem, some research is going on, but in general the situation is clear - there will be a significant drop of the accuracy from missing right context. Mon-2-3-2 Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer

  • Looks like Speechbrain is not going to be announced on Interspeech as planned. Not sure.

  • Non-native children speech recognition is a very intersting tasks with very cool results from Cambridge HTK team as well as many opportunities Mon-SS-1-6-3 Non-Native Children’s Automatic Speech Recognition: the INTERSPEECH 2020 Shared Task ALTA Systems

  • Many presentations on semi-supervised, self-supervised learning and so on. Hopefully it will even grow further.

  • Speech is not a frontier in modern AI unfortunately. Most of the ideas first appear in NLP (memory-augmented transformers, noisy student augmentation, distillation, etc) and only then applied in speech. From that point of view Interspeech is not the leading conference.

And the paper of the day:

Mon-3-7-1 Continual Learning in Automatic Speech Recognition

Very important research and step in the right direction. Unfortuantely this work misses the core component that we implemented Vosk, so it is not that efficient as it might be.