While dataset sizes grow beyond 10 thousand hours (Gigaspeech) the
compute requirements for speech recognition research also grow. Any
research even a simple architecture testing gets harder and harder
because with a single training run of 2 weeks you won’t be able to run
proper experiments. One solution would be to increase hardware
capabilities which are pretty expensive these days and sometimes even not
available. Even disk storage cost skyrocketed this year, not to mention
A small companies and open source projects only have to watch this race.
But, there is one direction of research where even small project like
Vosk can do interesing things - active learning. The idea is that we
don’t really need to train on those thousands hours of data, we can
select a subset of it and train much faster. Hopefully, even order or
With just 1/10th of the dataset and simple selection method you can reach
the full dataset accuracy level. Something to try in a near future.
Common note: most of the older paper assume big dataset is untranscribed
and it is hard to transcribe (label) it. In the recent setups big dataset
is transcribed (though not perfectly), we just can’t train the model
using it. It means the algorithms will slightly shift to the different
area where we might still process big data but just have to do it very
Another common note: Semi-supervised learning wins over active learning
in all the works with big a margin. It seems that it is still beneficial
to process a whole set at least once instead of forcing focus on a
smaller subset. No method could guarantee full accuracy with even 30% of
the samples, they all are suboptimal. Except for the paper above but I do
not think the paper above applies to real-life data. Maybe the best idea
is to train on a subset and then finetune once on a whole set.
Another note: Lack of proper evaluation set definitely makes
apple-to-apple comparison much harder. Some of the papers have different
working point (they try to select 1000 lines for labelling from 10h
untranscribed). Hard to guess what happens if you select more data.
An Analysis of Active Learning Strategies for Sequence Labeling Tasks - Not
really on speech, but pretty fundamental paper evaluating many methods: least confidence, margin, token entropy, total token entropy,
sequence entropy, N-best sequence entropy, vote entropy, Kullback-Leibler, sequence vote entropy, sequence Kullback-Leibler, expected gradient length,
information density, Fisher information ratio. Information density wins in tests. Sequence vote entropy also good.
Active and Unsupervised Learning for Automatic Speech Recognition (2003) - Least confidence from model predictions. Very small dataset size (5000 lines).
Active Learning For Automatic Speech Recognition (2002) Same authors, same method.
Active Learning: Theory and Applications to Automatic Speech Recognition (2005) Same authors. Journal paper.
Active Learning for LF-MMI Trained Neural Networks in ASR (2018) - Least confidence from model + confidence from multi-model agreement (voting confidence). Also cited as HNN (Heterogenous Neural Network) approach.
A Dropout-Based Single Model Committee Approach for Active Learning in ASR (2019)
Same approach overall. Different method to create voting models.
Speech modeling based on committee-based active learning
Comittee-based confidence wins over simple single-model confidence. Vote entropy
Acoustic model training using committee-based active and semi-supervised learning for speech recognition
Comittee-based confidence wins over simple single-model confidence. Vote entropy. Extension of the previous paper. Semi-supervised much better than active.
Active and Semi-Supervised Learning in ASR: Benefits on the Acoustic and Language Models
Amazon paper. Least-confidence selection. States that confidence measure doesn’t matter, even simple ones work. Detailed description of experiments. Utterance length filtering didn’t work.
Generative model like LM doesn’t improve from confidence-based sampling. No mention of diversity sampling.
Active Learning Methods for Low Resource End-To-End Speech Recognition (2019)
Least confidence assisted with clustered i-vector max entropy. Better
results than simple least confidence.
Active learning for speech recognition: the power of gradients - Expected gradient length calculated with the model is a great measure uncorrelated with the
confidence. Wins in experiments, although not significantly.
Gradient-based Active Learning Query Strategy for End-to-end Speech Recognition (2019)
Extends expected gradient length approach. Trains a special network to predict true gradient from expected gradient and entropy (GAN style). The evaluation
is not descriptive, only 10000 utterances selected.
Loss Prediction: End-to-End Active Learning Approach For Speech Recognition by Jian Luo, Jianzong Wang, Ning Cheng, Jing Xiao (2021)
GAN-like idea to create extra network to predict sample informativeness. Doesn’t consider representativeness. The results
are not significantly better than random selection, maybe network is not trained properly. 50% random selection is much
better than 20% selection with loss prediction.
Generative Adversarial Active Learning (2017) - GANs are popular in active learning in vision.
LMC-SMCA: A New Active Learning Method in ASR by XiuSong Sun et al. (2021)
Comittee-based confidence selection with LM-based entropy diversity selection. Both work well together.
Training data selection based on context dependent state matching (2014)
iVector-based Acoustic Data Selection (2013)
A big data approach to acoustic model training corpus selection (2014)
CD state KL divergence and i-vector diversity and also top confidence (not least confidence between). Set of papers from Google.
Kullback-Leibler divergence-based ASR training data selection - Kullback-Leibler with ngrams. Very small datasets.
Should be better for generative models.
Active learning and semi-supervised learning for speech recognition: A unified framework using the global entropy reduction maximization
Microsoft paper which introduces global entropy reduction GERM - a
greedy selection algorithm.The idea is that we try to select lattices
with most diverse paths covering other paths in the other lattices.
Demonstated to be better than confidence-based, but just 2% relative. Authors
observed that if only 1% of utterances are to be selected, most
utterances selected by the confidence-base approach are noise and garbage
utterances that have extremely low confidence but have little value to
improving the performance of the overall system, while only a few such
utterances are selected by the GERM algorithm. Seems reasonable.
Maximizing global entropy reduction for active learning in speech recognition (2009)
From same authors, conference paper.
Hierarchical Sampling for Active Learning Sampling
bias. The two faces of active learning. Nice idea but probably not
directly applicable to speech. At least not that straight.
Active Learning based data selection for limited resource STT and KWS (2015)
HMM state entropy maximization. Similar to Rudnicky paper but for HMM states instead of words.
Parting with Illusions about Deep Active Learning - As usual, not everything is straightforward.
Most methods are plain similar and advanced methods have no effect. Vision paper though. Semi-supervised learning beats
active learning in Vision. As in many other papers, the advantage of active learnign are not straightforward.
Modern semi-supervised learning algorithms applied in the conventional
active learning setting show a higher relative performance increase than
any of the active learning methods proposed in the recent years.
State-of-the-art active learning approaches often fail to outperform
simple random sampling, especially when the labeling budget is small - a
setting critically important for many real-world applications. Hmm…
N-best entropy based data selection for acoustic modeling (2012)
Good understanding of informativeness/representativeness. Informativeness calculated as n-best entropy. Representativeness
as KD-divergence of phone distributions from the same n-best lattice.
The representativeness had a negative gain in the small setup, but wins with larger amount of selected data (important)
Testing was done with GMM models, might be better with DNN
A Two-Stage Method for Active Learning of Statistical Grammars (2005)
Accompanying NLP paper
Submodular subset selection for large-scale speech training data (2014)
Replaces greedy optimization algorithm of Rudnicky approach with submodular function optimization. Reaches better optimum
point this way with more computation. Exact submodular optimization method is not well described though.
A Convex Optimization Framework for Active Learning (2013)
Not really on speech, vision paper. Replaces greedy optimization in task of selection of informative samples with convex optimization. Much faster and better optimum.
Toward Optimal Active Learning through Sampling Estimation of Error Reduction
NLP paper but still interesting idea. For every new sample we take every possible label and retrain the model quickly. We assume we can do that
Then estimate how useful the sample would be. Well, it requires a special classifier. But at least it reaches good quality quickly.