Written by
Nickolay Shmyrev
on
ML datasets are not relevant anymore
We have started promoting data collection for open source speech
recognition at Voxforge project in 2007. It has
been a great time before the speech recognition revolution but even then the
importance of the public data was obvious. Those days WSJ dataset of 120
hours was frequently used in research. Later on Voxforge project was
followed by Mozilla CommonVoice which
expanded the effort of data collection in multiple languages. Recently it
was announced that CommonVoice project will go into maintenance
mode
which is unfortunate but not really critical. Let me explain why.
In 2012 Librispeech corpus was created with an enormous 1000 hours of
speech data. By that time it was a huge amount. Librispeech corpus is
still useful these days and serves an important role in speech
recognition research. Here it is good to mention that core Librispeech
ideas were popularized by CMU people, in particular Prof. James K. Baker
during the Summer Of Code 2011 project. Librispeech demonstrated that you
do not really need careful data collection to get a great speech dataset.
Since then, the development of Kaldi toolkit, deep learning algorithms in
general and progress in computing and cloud storage changed the picture
even further. In particular, the following algorithms enabled new
approaches:
- Semi-supervised learning
- Reliable transfer learning between languages and acoustic conditions
- Unsupervised pretraining
- Online learning
- Federated learning
These days you do not need carefully annotated data to learn great things
anymore. Sometimes you do not need annotation at all. Given that so much
data is available for training that you can never imagine. Corporations
like Amazon train their systems on 1 Million hours of data, Google trains
language models on Terabytes of texts.
On the other hand, we live in a fast changing world which repeats itself
to some degree, but never repeats exactly. The flow of time is not really
incorporated in our models which are statically trained on fixed
datasets. One of the major tasks on the way to create an AI is integration
of the notion of time which is itself a very important subject to discuss
in philosophy and metaphysics.
A good illustration of the problem is a toy case of the phonetic
dictionary. A dictionary is something that is never fixed. New words
appear and disappear. Who knew the word TikTok a couple of years ago? Now
it is the topic of discussions in the news and speech recognition system
should be able to learn and use it. So we can not go ahead with a
straightforward phonetic dictionary, no matter how big and accurate it
is. We need to have a live system which learns words every day in a
streaming fashion. One should be able to update the dictionary with a new
words automatically or semi-automatically, verify and review the update
and distribute the update to all the users.
Another good example is Common Crawl
News, a
collection of news feeds updated daily. It is called a dataset but it is
not really a dataset, more a live system which extends itself over
time. You can learn very interesting things from CC News if you learn to
learn from a daily updates of the stream.
Such live systems are more common in NLP, less yet common in speech, but
certainly they will become popular. For example one interesting new toolkit
is LAMOL, a toolkit for lifelong
language modeling. Others like Creme, Bytedance
Fedlearner or
Snorkel will get more
attention and hopefully “lifelong learning” will be the buzzword at some
future Interspeech.