Written by
Nickolay Shmyrev
on
Failure of SSL
The recent release of the FAIR omnilingual model, LeCun news and active use of
wav2vec in “semantics” made me think again about SSL in speech.
This is going to be a brick in speech technology science. The thing is
that the premise of SSL is that one can train a stable baseline
multilingual model with just acoustics and then finetune it for many
languages. However, all our experience with end-to-end models
demonstrated that effective and robust speech recognition and
understanding requires tight integration between acoustic and
understanding (language model) layers. Just like Whisper.
This means that you can not sort out things just from the acoustics. And
most of the multilingual models fail for exactly the same reason. Despite
claims to support 1000 languages, the real support for language specifics
is limited to major 3-5 and most others are not well supported. For
example, omnilingual 300m performs really bad for
Swedish,
a pretty resourceful language. Whisper is not good for Italian, language
specific conformers are much better. Essentially there is no well defined
universal phonetic alphabet (like IPA), allophones are very context
dependant and context is large (up to semantic layers).
On top of that, many multilingual models (wav2vec, hubert, xls-r, mms,
seamless, omnilingual) are not trained for robustness (noise
augmentation, telephony augmentation, etc). Only WavLM has that sort of
multienvironment training as recent Jinyu Li’s
talk explained, others are
usually focused on simple uniform inputs like Fleurs.
As a result, you need to finetune wav2vec A LOT to get good accuracy,
essentially training it from scratch (kudos to our friend Anton Nekrasov
who mentions that frequently).
My estimation is that modern size models can effectively learn 3-4
languages at once, not 1000 and to learn 1000 languages one needs 100B
models and those models have to have good speech understanding of each
language they claim to support.
Given that, one can think about methods for training multilingual models
which are modular enough to support many languages well and at least have
some means to inject the language model quickly. For example, traditional
semi-supervised learning that integrates some sort of language model into
the learning process makes much more sense actually.