Written by Nickolay Shmyrev
on August 21, 2013

Mixer 6 database release by LDC & Librivox

LDC has recently announced availability of a very large speech database for acoustic model training. A database named Mixer 6 contains incredible amount of 15000 hours of transcribed speech data by few hundred speakers.While commercial companies have access to a significantly bigger sets, Mixer is the biggest data set compared to databases used in research ever before. Previously available Fisher database has only around 2000 hours.

It would be really interesting to see the results obtained with this database, data size should improve the existing system performance. However, I see that this dataset will pose some critical challenge to the research and development community. Essentially, such data size means that it will be very hard to train a model using conventional software and accessible hardware. For example, it takes about a week and a decent cluster to train a model using 1000 hours, with 15000 hours you have to wait several months unless more efficient algorithms will be introduced. So, it is not easy.

On the other hand, we have access to a similar amount of data - a Librivox archive contains way more high-quality recordings with text available. It certainly must be a focus of the development to train a model on Librivox data. Such a training is not going to be straightforward too - new algorithms and software must be created. A critical issue is to design an algorithm which will improve the accuracy of the model without the need to process the whole dataset. Let me know if you are interested in this project.

Between, Librivox accepts donations and they are definitely worth them.

← Top →