Written by
Nickolay Shmyrev
on
Voting, Ensembles and bringing AI to life
While people recently argue if Google’s model is sentient we must admit
that another important property of the living creatures emerged in recent AI models - they started to have experience. They see new data, new tests,
and improve over time. They teach each other like K2 model learns from Hubert. Vosk learns from NVIDIA Nemo. And this
experience is not really explainable. You can not really understand what is going on inside the latest Nemo conformer model
but you still can learn from the predictions it is making.
There are important consequences of treating AI models as live creatures with experience, for example, it changes my view on using system combinations.
Long ago (in 2013) I was thinking that system combination is not a
reasonable idea and it doesn’t bring anything
practical.
I was always blaming Kaggle competitions for ensembling models. I thought
that single best model is possible. I used to think that it is not such a
great idea to have 10 projects dedicated to speech recognition but recent
developments proved me wrong (again).
The core math here is actually very simple. If you have a complex
multidimensional space and you want to explore it, you can not make any
reasonable loop over points of the space. It is either too complex or too
slow. The only way is to drop some random points and hope that their
estimate will show something. It is not guaranteed that you get the
results but often you can get very reasonable estimate quickly. It is an
idea of Monte-Carlo computation and also an idea of many other randomized
algorithms which are abundant. Not always the best solution but a
solution with many nice properties.
First, you get an estimate quickly. Second, you can actually get not just
the first order estimate but also second order estimate (deviation or
confidence) which is very important for applications. And in many cases
like speech recognition it is the only way to get proper confidence (see Uncertainty Estimation in Autoregressive Structured Prediction).
Yeah, that diversity and democracy enforced in Western universities,
sometimes it makes sense. Of course sometimes it is not the best solution,
a single model is right and majority wrong but in many cases it really
makes sense to vote.
If we treat speech recognition models this way, few items emerge:
-
ROVER algorithm and
many other algorithms for system combination get more and more
importance.
-
Instead of HF style we don’t simply put models on one page but we
need to build tools to learn from each model, perform cross-confirmation and
distillation.
-
Continuous evaluation of tools and models is also becoming relevant. Here
it is worth to mention SpeechIO leaderboard, or recent paper on evaluation of different cloud
providers on Switchboard
-
There will be no single best toolkit anymore, some models will be good for one domain, others for another. Unless we build a meta-toolkit (remember Turchin’s metasystems).
We consider this direction important for Vosk and industry in general.