Voting, Ensembles and bringing AI to life

While people recently argue if Google’s model is sentient we must admit that another important property of the living creatures emerged in recent AI models - they started to have experience. They see new data, new tests, and improve over time. They teach each other like K2 model learns from Hubert. Vosk learns from NVIDIA Nemo. And this experience is not really explainable. You can not really understand what is going on inside the latest Nemo conformer model but you still can learn from the predictions it is making.

There are important consequences of treating AI models as live creatures with experience, for example, it changes my view on using system combinations.

Long ago (in 2013) I was thinking that system combination is not a reasonable idea and it doesn’t bring anything practical. I was always blaming Kaggle competitions for ensembling models. I thought that single best model is possible. I used to think that it is not such a great idea to have 10 projects dedicated to speech recognition but recent developments proved me wrong (again).

The core math here is actually very simple. If you have a complex multidimensional space and you want to explore it, you can not make any reasonable loop over points of the space. It is either too complex or too slow. The only way is to drop some random points and hope that their estimate will show something. It is not guaranteed that you get the results but often you can get very reasonable estimate quickly. It is an idea of Monte-Carlo computation and also an idea of many other randomized algorithms which are abundant. Not always the best solution but a solution with many nice properties.

First, you get an estimate quickly. Second, you can actually get not just the first order estimate but also second order estimate (deviation or confidence) which is very important for applications. And in many cases like speech recognition it is the only way to get proper confidence (see Uncertainty Estimation in Autoregressive Structured Prediction). Yeah, that diversity and democracy enforced in Western universities, sometimes it makes sense. Of course sometimes it is not the best solution, a single model is right and majority wrong but in many cases it really makes sense to vote.

If we treat speech recognition models this way, few items emerge:

  • ROVER algorithm and many other algorithms for system combination get more and more importance.

  • Instead of HF style we don’t simply put models on one page but we need to build tools to learn from each model, perform cross-confirmation and distillation.

  • Continuous evaluation of tools and models is also becoming relevant. Here it is worth to mention SpeechIO leaderboard, or recent paper on evaluation of different cloud providers on Switchboard

  • There will be no single best toolkit anymore, some models will be good for one domain, others for another. Unless we build a meta-toolkit (remember Tuchin metasystems).

We consider this direction important for Vosk and industry in general.