ICASSP 2021 Part 1
This week ICASSP 2021 starts online. A bit late time for a year and everyone already looked on the publications. Many papers are already on Arxiv for some time, some not. Still, an interesting spot of the recent research.
ICASSP is not purely about speech, it also shares significant part of papers about standard signal processing, all those MIMO algorithms and also covers a lot of machine learning theory. That makes it even more interesting than Interspeech due to inter-disciplinary connections. Many papers from Google, Apple, Amazon which I find interesting. Many many many papers from Google.
Transformers seem to give great decoding results but they are pretty slow to train and slow to decode. They seem to use the context effectively but one needs a lot of compute. One approach is to apply them as a rescoring pass on top of quick decoding result. Another one is to restrict attention. Several papers talk about that for example
It is a long story also going on in NLP. Big Bird, Longformer, Reformer, Compressive Transformer and so on. I suspect the search for a proper approach will last for many years ahead
Due to speed constrains many paper combine computations from various network and try to effectively reuse context. Its a productive work where you can invent many many possible combinations.
Grapheme/Phoneme/Wordpiece battle still going on. Hybrid systems still have a point. Convolutional Dropout and Wordpiece Augmentation for End-to-End Speech Recognition Hainan Xu, Yinghui Huang, Yun Zhu, Kartik Audhkhasi, Bhuvana Ramabhadran proper wordpiece split is important, convolutional dropout is interesting by itself
A whole section (actually, two sections) on speech-to-intent models that ignore transcrpitions. Many good points, but I suspect there are drawbacks too: DO as I Mean, Not as I Say: Sequence Loss Training for Spoken Language Understanding Milind Rao; Pranav Dheram; Gautam Tiwari; Anirudh Raju; Jasha Droppo; Ariya Rastrow; Andreas Stolcke
Several papers on OOV problems:
Google admitted that convolution should go first in Conformer. They seriously insisted it should be the other way around before. Was it desinformation? A Better and Faster end-to-end Model for Streaming ASR Bo Li; Anmol Gulati; Jiahui Yu; Tara N. Sainath; Chung-Cheng Chiu; Arun Narayanan; Shuo-Yiin Chang et al.
Graphical tools for data anaysis are important in concept of life-long learning. Hopefully such tools will become more powerful and could enable efficient co-learnign between human and AI. This direction definitely worth attention.
Many interesting ideas in federated learning. Mostly about proper information sharing between parties. Something to think about in concept of life-long learning too. Who will be the parties, how to connect them, what information are they willing to share. A paper from Google:
A paper we covered on our channel already. The problem with TTS quality is that internal mel representation doesn’t describe voice in perfect sense, in particular it is not perfect for noisy parts. In speech coding long time ago people shifted from mel-based codecs to MELP codecs with mixed speech/noise exicitation. Same in TTS, until you generate noise sufficiently well you won’t see a great quality. A solution would be to avoid internal mel representation altogether like in the paper below. But I still hope that one can figure out a proper internal representation that will describe voice and noise and also can train easier.