Written by Nickolay Shmyrev
on August 14, 2020

ML datasets are not relevant anymore

We have started promoting data collection for open source speech recognition at Voxforge project in 2007. It has been a great time before the speech recognition revolution but even then the importance of the public data was obvious. Those days WSJ dataset of 120 hours was frequently used in research. Later on Voxforge project was followed by Mozilla CommonVoice which expanded the effort of data collection in multiple languages. Recently it was announced that CommonVoice project will go into maintenance mode which is unfortunate but not really critical. Let me explain why.

In 2012 Librispeech corpus was created with an enormous 1000 hours of speech data. By that time it was a huge amount. Librispeech corpus is still useful these days and serves an important role in speech recognition research. Here it is good to mention that core Librispeech ideas were popularized by CMU people, in particular Prof. James K. Baker during the Summer Of Code 2011 project. Librispeech demonstrated that you do not really need careful data collection to get a great speech dataset.

Since then, the development of Kaldi toolkit, deep learning algorithms in general and progress in computing and cloud storage changed the picture even further. In particular, the following algorithms enabled new approaches:

Semi-supervised learning
Reliable transfer learning between languages and acoustic conditions
Unsupervised pretraining
Online learning
Federated learning

These days you do not need carefully annotated data to learn great things anymore. Sometimes you do not need annotation at all. Given that so much data is available for training that you can never imagine. Corporations like Amazon train their systems on 1 Million hours of data, Google trains language models on Terabytes of texts.

On the other hand, we live in a fast changing world which repeats itself to some degree, but never repeats exactly. The flow of time is not really incorporated in our models which are statically trained on fixed datasets. One of the major tasks on the way to create an AI is integration of the notion of time which is itself a very important subject to discuss in philosophy and metaphysics.

A good illustration of the problem is a toy case of the phonetic dictionary. A dictionary is something that is never fixed. New words appear and disappear. Who knew the word TikTok a couple of years ago? Now it is the topic of discussions in the news and speech recognition system should be able to learn and use it. So we can not go ahead with a straightforward phonetic dictionary, no matter how big and accurate it is. We need to have a live system which learns words every day in a streaming fashion. One should be able to update the dictionary with a new words automatically or semi-automatically, verify and review the update and distribute the update to all the users.

Another good example is Common Crawl News, a collection of news feeds updated daily. It is called a dataset but it is not really a dataset, more a live system which extends itself over time. You can learn very interesting things from CC News if you learn to learn from a daily updates of the stream.

Such live systems are more common in NLP, less yet common in speech, but certainly they will become popular. For example one interesting new toolkit is LAMOL, a toolkit for lifelong language modeling. Others like Creme, Bytedance Fedlearner or Snorkel will get more attention and hopefully “lifelong learning” will be the buzzword at some future Interspeech.

← Top →