Written by Nickolay Shmyrev
on July 13, 2021

Active learning in speech recognition

While dataset sizes grow beyond 10 thousand hours (Gigaspeech) the compute requirements for speech recognition research also grow. Any research even a simple architecture testing gets harder and harder because with a single training run of 2 weeks you won’t be able to run proper experiments. One solution would be to increase hardware capabilities which are pretty expensive these days and sometimes even not available. Even disk storage cost skyrocketed this year, not to mention GPU cards.

A small companies and open source projects only have to watch this race. But, there is one direction of research where even small project like Vosk can do interesing things - active learning. The idea is that we don’t really need to train on those thousands hours of data, we can select a subset of it and train much faster. Hopefully, even order or magnitude faster.

An old paper from CMU is a great example:

Data Selection for Speech Recognition by Yi Wu, Alexander I. Rudnicky and Rong Zhang

With just 1/10th of the dataset and simple selection method you can reach the full dataset accuracy level. Something to try in a near future.

Common note: most of the older paper assume big dataset is untranscribed and it is hard to transcribe (label) it. In the recent setups big dataset is transcribed (though not perfectly), we just can’t train the model using it. It means the algorithms will slightly shift to the different area where we might still process big data but just have to do it very quickly.

Another common note: Semi-supervised learning wins over active learning in all the works with big a margin. It seems that it is still beneficial to process a whole set at least once instead of forcing focus on a smaller subset. No method could guarantee full accuracy with even 30% of the samples, they all are suboptimal. Except for the paper above but I do not think the paper above applies to real-life data. Maybe the best idea is to train on a subset and then finetune once on a whole set.

Another note: Lack of proper evaluation set definitely makes apple-to-apple comparison much harder. Some of the papers have different working point (they try to select 1000 lines for labelling from 10h untranscribed). Hard to guess what happens if you select more data.

An Analysis of Active Learning Strategies for Sequence Labeling Tasks - Not really on speech, but pretty fundamental paper evaluating many methods: least confidence, margin, token entropy, total token entropy, sequence entropy, N-best sequence entropy, vote entropy, Kullback-Leibler, sequence vote entropy, sequence Kullback-Leibler, expected gradient length, information density, Fisher information ratio. Information density wins in tests. Sequence vote entropy also good.
Active and Unsupervised Learning for Automatic Speech Recognition (2003) - Least confidence from model predictions. Very small dataset size (5000 lines).

Active Learning For Automatic Speech Recognition (2002) Same authors, same method.

Active Learning: Theory and Applications to Automatic Speech Recognition (2005) Same authors. Journal paper.
Active Learning for LF-MMI Trained Neural Networks in ASR (2018) - Least confidence from model + confidence from multi-model agreement (voting confidence). Also cited as HNN (Heterogenous Neural Network) approach.

A Dropout-Based Single Model Committee Approach for Active Learning in ASR (2019) Same approach overall. Different method to create voting models.
Speech modeling based on committee-based active learning Comittee-based confidence wins over simple single-model confidence. Vote entropy
Acoustic model training using committee-based active and semi-supervised learning for speech recognition Comittee-based confidence wins over simple single-model confidence. Vote entropy. Extension of the previous paper. Semi-supervised much better than active.
Active and Semi-Supervised Learning in ASR: Benefits on the Acoustic and Language Models Amazon paper. Least-confidence selection. States that confidence measure doesn’t matter, even simple ones work. Detailed description of experiments. Utterance length filtering didn’t work. Generative model like LM doesn’t improve from confidence-based sampling. No mention of diversity sampling.
Active Learning Methods for Low Resource End-To-End Speech Recognition (2019) Least confidence assisted with clustered i-vector max entropy. Better results than simple least confidence.
Active learning for speech recognition: the power of gradients - Expected gradient length calculated with the model is a great measure uncorrelated with the confidence. Wins in experiments, although not significantly.
Gradient-based Active Learning Query Strategy for End-to-end Speech Recognition (2019) Extends expected gradient length approach. Trains a special network to predict true gradient from expected gradient and entropy (GAN style). The evaluation is not descriptive, only 10000 utterances selected.
Loss Prediction: End-to-End Active Learning Approach For Speech Recognition by Jian Luo, Jianzong Wang, Ning Cheng, Jing Xiao (2021) GAN-like idea to create extra network to predict sample informativeness. Doesn’t consider representativeness. The results are not significantly better than random selection, maybe network is not trained properly. 50% random selection is much better than 20% selection with loss prediction.

Generative Adversarial Active Learning (2017) - GANs are popular in active learning in vision.
LMC-SMCA: A New Active Learning Method in ASR by XiuSong Sun et al. (2021) Comittee-based confidence selection with LM-based entropy diversity selection. Both work well together.
Training data selection based on context dependent state matching (2014)

iVector-based Acoustic Data Selection (2013)

A big data approach to acoustic model training corpus selection (2014)

CD state KL divergence and i-vector diversity and also top confidence (not least confidence between). Set of papers from Google.
Kullback-Leibler divergence-based ASR training data selection - Kullback-Leibler with ngrams. Very small datasets. Should be better for generative models.
Active learning and semi-supervised learning for speech recognition: A unified framework using the global entropy reduction maximization criterion
Microsoft paper which introduces global entropy reduction GERM - a greedy selection algorithm.The idea is that we try to select lattices with most diverse paths covering other paths in the other lattices. Demonstated to be better than confidence-based, but just 2% relative. Authors observed that if only 1% of utterances are to be selected, most utterances selected by the confidence-base approach are noise and garbage utterances that have extremely low confidence but have little value to improving the performance of the overall system, while only a few such utterances are selected by the GERM algorithm. Seems reasonable.

Maximizing global entropy reduction for active learning in speech recognition (2009) From same authors, conference paper.
Hierarchical Sampling for Active Learning Sampling bias. The two faces of active learning. Nice idea but probably not directly applicable to speech. At least not that straight.
Active Learning based data selection for limited resource STT and KWS (2015) HMM state entropy maximization. Similar to Rudnicky paper but for HMM states instead of words.
Parting with Illusions about Deep Active Learning - As usual, not everything is straightforward. Most methods are plain similar and advanced methods have no effect. Vision paper though. Semi-supervised learning beats active learning in Vision. As in many other papers, the advantage of active learnign are not straightforward.

Modern semi-supervised learning algorithms applied in the conventional active learning setting show a higher relative performance increase than any of the active learning methods proposed in the recent years. State-of-the-art active learning approaches often fail to outperform simple random sampling, especially when the labeling budget is small - a setting critically important for many real-world applications. Hmm…
N-best entropy based data selection for acoustic modeling (2012) Good understanding of informativeness/representativeness. Informativeness calculated as n-best entropy. Representativeness as KD-divergence of phone distributions from the same n-best lattice.

The representativeness had a negative gain in the small setup, but wins with larger amount of selected data (important)

Testing was done with GMM models, might be better with DNN

A Two-Stage Method for Active Learning of Statistical Grammars (2005) Accompanying NLP paper
Submodular subset selection for large-scale speech training data (2014) Replaces greedy optimization algorithm of Rudnicky approach with submodular function optimization. Reaches better optimum point this way with more computation. Exact submodular optimization method is not well described though.
A Convex Optimization Framework for Active Learning (2013)

Not really on speech, vision paper. Replaces greedy optimization in task of selection of informative samples with convex optimization. Much faster and better optimum.
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

NLP paper but still interesting idea. For every new sample we take every possible label and retrain the model quickly. We assume we can do that Then estimate how useful the sample would be. Well, it requires a special classifier. But at least it reaches good quality quickly.

Books, surveys

Active Learning Literature Survey (2010), previos version, book - a survey from Burr Settles (Duolingo)
A literature survey of active machine learning in the context of natural language processing
Active Learning Query Strategies for Classification, Regression, and Clustering: A Survey
A Survey of Deep Active Learning - some interesting recent variational methods too (GANs)

Extra links

Not so interesting papers

Robustness Aspects of Active Learning for Acoustic Modeling (2003) Not much interesting, confidence-based digits experiments. There were 3-4 papers from the same authors.
Boosting Active Learning For Speech Recognition With Noisy Pseudo-Labelled Samples (2020) Not very interesting. Just adds pseudo-labelling to confidence-based selection. Was considered before with semi-supervised works.
Active Learning with Minimum Expected Error for Spoken Language Understanding NLP paper. Considers similarity between intents. Similar to GERM from Microsoft.
Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource indian language (2018) Confidence selection
Incorporating Diversity in Active Learning with Support Vector Machines Time for SVMs hasn’t come yet.
Training Data Augmentation and Data Selection Not much details. I-vector selection.
Error-Driven Fixed-Budget ASR Personalization for Accented Speakers (2021) No clear objective or results, ideas of phonemes in non-native speech is doomed.
Active Learning for Hidden Markov Models: Objective Functions and Algorithms Not really related to speech though a bit interesting

On importance of active learning

Most Research in Deep Learning is a Total Waste of Time - Jeremy Howard

On RNNLMs

The similar task is valid for simple RNNLM training, not necessary for speech. For texts it is even easier to experiment.

Relevant papers:

Sampling Informative Training Data for RNN Language Models
Not All Samples Are Created Equal: Deep Learning with Importance Sampling
Deduplicating Training Data Makes Language Models Better - actually amount of duplicated data is not that great and does NOT affect the perplexity

← Top →