Failure of SSL

The recent release of the FAIR omnilingual model, LeCun news and active use of wav2vec in “semantics” made me think again about SSL in speech.

This is going to be a brick in speech technology science. The thing is that the premise of SSL is that one can train a stable baseline multilingual model with just acoustics and then finetune it for many languages. However, all our experience with end-to-end models demonstrated that effective and robust speech recognition and understanding requires tight integration between acoustic and understanding (language model) layers. Just like Whisper.

This means that you can not sort out things just from the acoustics. And most of the multilingual models fail for exactly the same reason. Despite claims to support 1000 languages, the real support for language specifics is limited to major 3-5 and most others are not well supported. For example, omnilingual 300m performs really bad for Swedish, a pretty resourceful language. Whisper is not good for Italian, language specific conformers are much better. Essentially there is no well defined universal phonetic alphabet (like IPA), allophones are very context dependant and context is large (up to semantic layers).

On top of that, many multilingual models (wav2vec, hubert, xls-r, mms, seamless, omnilingual) are not trained for robustness (noise augmentation, telephony augmentation, etc). Only WavLM has that sort of multienvironment training as recent Jinyu Li’s talk explained, others are usually focused on simple uniform inputs like Fleurs.

As a result, you need to finetune wav2vec A LOT to get good accuracy, essentially training it from scratch (kudos to our friend Anton Nekrasov who mentions that frequently).

My estimation is that modern size models can effectively learn 3-4 languages at once, not 1000 and to learn 1000 languages one needs 100B models and those models have to have good speech understanding of each language they claim to support.

Given that, one can think about methods for training multilingual models which are modular enough to support many languages well and at least have some means to inject the language model quickly. For example, traditional semi-supervised learning that integrates some sort of language model into the learning process makes much more sense actually.