Written by
Nickolay Shmyrev
on
ICASSP 2021 Part 1
This week ICASSP 2021 starts online. A bit late time for a year and
everyone already looked on the publications. Many papers are already on
Arxiv for some time, some not. Still, an interesting spot of the recent
research.
ICASSP is not purely about speech, it also shares significant part of
papers about standard signal processing, all those MIMO algorithms and
also covers a lot of machine learning theory. That makes it even more
interesting than Interspeech due to inter-disciplinary connections. Many
papers from Google, Apple, Amazon which I find interesting. Many many
many papers from Google.
Transformers seem to give great decoding results but they are pretty slow
to train and slow to decode. They seem to use the context effectively but
one needs a lot of compute. One approach is to apply them as a rescoring
pass on top of quick decoding result. Another one is to restrict
attention. Several papers talk about that for example
Capturing Multi-Resolution Context by Dilated Self-Attention
Niko Moritz; Takaaki Hori; Jonathan Le Roux
also
Focus on the Present: A Regularization Method for the ASR Source-Target Attention Layer
Nanxin Chen; Piotr Żelasko; Jesús Villalba; Najim Dehak
It is a long story also going on in NLP. Big Bird, Longformer, Reformer,
Compressive Transformer and so on. I suspect the search for a proper approach will last
for many years ahead
Due to speed constrains many paper combine computations from various
network and try to effectively reuse context. Its a productive work where
you can invent many many possible combinations.
Grapheme/Phoneme/Wordpiece battle still going on. Hybrid systems still have a point.
Convolutional Dropout and Wordpiece Augmentation for End-to-End Speech Recognition
Hainan Xu, Yinghui Huang, Yun Zhu, Kartik Audhkhasi, Bhuvana Ramabhadran
proper wordpiece split is important, convolutional dropout is interesting by itself
and
Tiny Transducer: A Highly-Efficient Speech Recognition Model on Edge Devices (phoneme based)
Yuekai Zhang; Sining Sun; Long Ma
and
Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition
Wei Zhou; Simon Berger; Ralf Schlüter; Hermann Ney
A whole section (actually, two sections) on speech-to-intent models that
ignore transcrpitions. Many good points, but I suspect there are
drawbacks too:
DO as I Mean, Not as I Say: Sequence Loss Training for Spoken Language Understanding
Milind Rao; Pranav Dheram; Gautam Tiwari; Anirudh Raju; Jasha Droppo; Ariya Rastrow; Andreas Stolcke
Several papers on OOV problems:
Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems
Xianrui Zheng; Yulan Liu; Deniz Gunceler; Daniel Willett
and
A Comparison of Methods for OOV-Word Recognition on a New Public Dataset
Rudolf A. Braun; Srikanth Madikeri; Petr Motlicek
Google admitted that convolution should go first in Conformer. They seriously insisted it should
be the other way around before. Was it desinformation?
A Better and Faster end-to-end Model for Streaming ASR
Bo Li; Anmol Gulati; Jiahui Yu; Tara N. Sainath; Chung-Cheng Chiu; Arun Narayanan; Shuo-Yiin Chang et al.
Graphical tools for data anaysis are important in concept of life-long learning.
Hopefully such tools will become more powerful and could enable efficient
co-learnign between human and AI. This direction definitely worth attention.
Nemo Speech Data Explorer: Interactive Analysis Tool for Speech Datasets
Authors Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg, NVIDIA
Many interesting ideas in federated learning. Mostly about proper
information sharing between parties. Something to think about in concept
of life-long learning too. Who will be the parties, how to connect them,
what information are they willing to share. A paper from Google:
Training Speech Recognition Models with Federated Learning: A Quality/Cost Framework
Dhruv Guliani; Françoise Beaufays; Giovanni Motta
A paper we covered on our channel already. The problem with TTS quality
is that internal mel representation doesn’t describe voice in perfect
sense, in particular it is not perfect for noisy parts. In speech coding
long time ago people shifted from mel-based codecs to MELP codecs with
mixed speech/noise exicitation. Same in TTS, until you generate noise
sufficiently well you won’t see a great quality. A solution would be to
avoid internal mel representation altogether like in the paper below. But
I still hope that one can figure out a proper internal representation
that will describe voice and noise and also can train easier.
Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis
Ron J. Weiss; RJ Skerry-Ryan; Eric Battenberg; Soroosh Mariooryad; Diederik P. Kingma