On SANE 2015 Videos on Signal Separation

Recently a great collection of videos from Speech and Audio in the Northeast (SANE) 2015 workshop has been shared. The main topic of the workshop was sound signal separation which I consider very important direction of research for the new future, something that would be critical to solve to get human-like performance of speech recognition systems.

We did some experiments with NMF and other methods to robustly recognize overlapped speech before but my conclusion is that unless training and test conditions are carefully matched the whole system does not really work, anything unknown on the background really destroys recognition result. For that reason I was very interested to check recent progress in the field. The research is pretty early stage but there are very interesting results for sure.

The talk by Dr. Paris Smaragdis is quite useful to understand connection between non-negative matrix factorization and more recent approach with neural networks which also demonstrate how neural network works by selecting principal components from the data.


One interesting bit from the talk above is the announcement of the bitwise neural networks which are very fast and effective way to classify inputs. I believe it could be another big advancement in the performance of the speech recognition algorithms. The details could be found in the following publication: Bitwise Neural Networks by Minje Kim and Paris Smaragdis. Overall, the idea of the bit-compressed computation to reduce memory bandwidth seem very important (LOUDS language model in Google mobile recognizer also from this area). I think NVIDIA should be really concerned about it since GPU is certainly not the device this type of algorithms need. No more need in expensive Teslas.

Another interesting talk was by Dr.Tuomas Virtanen in which a very interesting database and the approach to use neural networks for separation of different event types is presented.  The results are pretty entertaining.

This video also had quite important bits, one of them is the announcement of Detection and Classification of Acoustic Scenes and Events Challenge 2016 (DCASE 2016) in which acoustic scene classification would be evaluated. The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "street", "office". The discussion of the challenge which starts soon is already going in the challenge group, this would be very interesting to participate.