ICASSP 2021 Part 1

This week ICASSP 2021 starts online. A bit late time for a year and everyone already looked on the publications. Many papers are already on Arxiv for some time, some not. Still, an interesting spot of the recent research.

ICASSP is not purely about speech, it also shares significant part of papers about standard signal processing, all those MIMO algorithms and also covers a lot of machine learning theory. That makes it even more interesting than Interspeech due to inter-disciplinary connections. Many papers from Google, Apple, Amazon which I find interesting. Many many many papers from Google.

Transformers seem to give great decoding results but they are pretty slow to train and slow to decode. They seem to use the context effectively but one needs a lot of compute. One approach is to apply them as a rescoring pass on top of quick decoding result. Another one is to restrict attention. Several papers talk about that for example

Capturing Multi-Resolution Context by Dilated Self-Attention Niko Moritz; Takaaki Hori; Jonathan Le Roux


Focus on the Present: A Regularization Method for the ASR Source-Target Attention Layer Nanxin Chen; Piotr Żelasko; Jesús Villalba; Najim Dehak

It is a long story also going on in NLP. Big Bird, Longformer, Reformer, Compressive Transformer and so on. I suspect the search for a proper approach will last for many years ahead

Due to speed constrains many paper combine computations from various network and try to effectively reuse context. Its a productive work where you can invent many many possible combinations.

Grapheme/Phoneme/Wordpiece battle still going on. Hybrid systems still have a point. Convolutional Dropout and Wordpiece Augmentation for End-to-End Speech Recognition Hainan Xu, Yinghui Huang, Yun Zhu, Kartik Audhkhasi, Bhuvana Ramabhadran proper wordpiece split is important, convolutional dropout is interesting by itself


Tiny Transducer: A Highly-Efficient Speech Recognition Model on Edge Devices (phoneme based) Yuekai Zhang; Sining Sun; Long Ma


Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition Wei Zhou; Simon Berger; Ralf Schlüter; Hermann Ney

A whole section (actually, two sections) on speech-to-intent models that ignore transcrpitions. Many good points, but I suspect there are drawbacks too: DO as I Mean, Not as I Say: Sequence Loss Training for Spoken Language Understanding Milind Rao; Pranav Dheram; Gautam Tiwari; Anirudh Raju; Jasha Droppo; Ariya Rastrow; Andreas Stolcke

Several papers on OOV problems:

Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems Xianrui Zheng; Yulan Liu; Deniz Gunceler; Daniel Willett


A Comparison of Methods for OOV-Word Recognition on a New Public Dataset Rudolf A. Braun; Srikanth Madikeri; Petr Motlicek

Google admitted that convolution should go first in Conformer. They seriously insisted it should be the other way around before. Was it desinformation? A Better and Faster end-to-end Model for Streaming ASR Bo Li; Anmol Gulati; Jiahui Yu; Tara N. Sainath; Chung-Cheng Chiu; Arun Narayanan; Shuo-Yiin Chang et al.

Graphical tools for data anaysis are important in concept of life-long learning. Hopefully such tools will become more powerful and could enable efficient co-learnign between human and AI. This direction definitely worth attention.

Nemo Speech Data Explorer: Interactive Analysis Tool for Speech Datasets Authors Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg, NVIDIA

Many interesting ideas in federated learning. Mostly about proper information sharing between parties. Something to think about in concept of life-long learning too. Who will be the parties, how to connect them, what information are they willing to share. A paper from Google:

Training Speech Recognition Models with Federated Learning: A Quality/Cost Framework Dhruv Guliani; Françoise Beaufays; Giovanni Motta

A paper we covered on our channel already. The problem with TTS quality is that internal mel representation doesn’t describe voice in perfect sense, in particular it is not perfect for noisy parts. In speech coding long time ago people shifted from mel-based codecs to MELP codecs with mixed speech/noise exicitation. Same in TTS, until you generate noise sufficiently well you won’t see a great quality. A solution would be to avoid internal mel representation altogether like in the paper below. But I still hope that one can figure out a proper internal representation that will describe voice and noise and also can train easier.

Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis Ron J. Weiss; RJ Skerry-Ryan; Eric Battenberg; Soroosh Mariooryad; Diederik P. Kingma