ICASSP 2010

So, I'm back from ICASSP in Dallas, TX. It was very impressive conference with lots of interesting and inspiring presentations, meetings and discussions. Amazing everyone was there and I've finally met all the speech people who guided me for so long time. I've met ASR people Bhiksha Raj, David Huggins-Daines, Rita Singh, Richard Stern and TTS people Alan W. Black, Keiichi Tokuda, Simon King, Heiga Zen. I was pleased to  meet second time wonderful guys like Evandro and Peter. Worth to mention that I was able to listen talks by famous people like Hynek Hermanski.

We had Sphinx Users and Developers Workshop  there and also two CMU Sphinx development planing meetings. But they are  subject for another post. This one is just about interesting ideas presented on the conference by other people. I didn't have time to attend every presentation out there, I think it was impossible. You have to find the time for sightseeing and there were often two or three parallel lecture sessions and also poster presentations which I liked most. I think poster presentation is the best way to access author, ask him questions and get feedback. Many posters were so popular it was almost impossible to get to the stand.

Anyway, the amount of talks I've got already exceeds what can be consumed in a week. It would be nice to get one day all information about current research collected into structure or wiki-like resource. It's a huge work for  the future though.

So here are some presentations and ideas I've met there and found them to be worth attention:

Robust Speaking Rate Estimation Using Broad Phonetic Class Recognition

Authors: Jiahong Yuan; University of Pennsylvania, Mark Liberman;
University of Pennsylvania

Presented work is about using easy classifier to get some specific data about speech like to estimate  deletions in syllables and thus speech quality. This is actually very promising approach which is ignored for some reason in most places where it seems to be practical. For example it's not clear for me why speaker identification framework doesn't try to find phonetic classes first and build GMM only after that. It seems to be a natural approach to improve SID performance.

Broad phonetic classes remind me the idea from the famous face recognition algorithm by Viola and Jones about applying cascade for fast classification. This idea could be applied to speech in some form like authors suggest I think.

Clap Your Hands! Calibrating Spectral Substraction for Reverberation

Authors: Uwe Zaeh, Korbinian Riedhammer, Tobias Bocklet, Elmar Noeth

Reverberation was very popular on this conference and especially it's important for meetings. Various speech system require various noise cancellation. Far microphone need to fix reverberation from room, close microphones need to fix clicks and so on. Far microphones sometimes do calibration for reverberation estimation. This defines the set of components sphinx4 could have to deal with various environment conditions. Right now they are simply missing.

Detecting Local Semantic Concepts in Environmental Sounds using Markov Model

Authors: Keansub Lee, Daniel Ellis, Alexander Loui

Interesting that classification database for this task is available at http://labrosa.ee.columbia.edu/projects/consumervideo/, this could be a base for non-speech recognition research.

Learning Task-Dependent Speech Variability In Discriminative Acoustic Model Adaptation

Authors: Shoei Sato, Takahiro Oku, Shinichi Homma, Akio Kobayashi, Toru Imai

Discriminative approaches are popular now days. Direct optimization of the cost function could serve on various stages of training process. In this work for example the set of subword units is selected to minimize decoding error rate.

An Improved Consensus-Like method for Minimum Bayes Risk Decoding and Lattice Combination

Authors: Haihua Xu, Daniel Povey, Lidia Mangu, Jie Zhu

This deals with specific criterion for lattice decoding. Not just best path could be chosen but other criterion like consensus could also apply. For me personally it would be very interesting to formalize and apply the criterion that will ensure grammatical correctness of the result. I haven't found anything on this yet.

Discriminative training based on an integrated view of MPE and MMI in margin and error space

Authors: Erik McDermott, Shinji Watanabe, Atsushi Nakamura
                   
Interesting to find out that real math goes into ASR. Basically it was a long waited thing and it seems it was started by Georg Heygold with his works on MMI and other methods. It would be nice to review this area to get some idea what's the outcome of it. Heygold was sited in almost every presentation, so it's really getting popular.

Balancing False Alarms and Hits in Spoken Term Detection

Authors: Carolina Parada, Abhinav Sethy, Bhuvana Ramabhadran

It's interesting to see what tools are used. WFST's are very convenient and used by everyone. IBM, Google, AT&T. This is also a topic for separate post.

Bayesian Analysis of Finite Gaussian Mixtures

Authors: Mark Morelande, Branko Ristic

Rather old idea (there are similar papers from 1998) to use Bayesian learning to estimate number of mixtures in the model. I'm in favor of the approach to estimate all model parameters including number of mixtures, language weight and number of senones at once.

Improving Speech Recognition by Explicit Modeling of Phone Deletions

Authors: Tom Ko

Pronunciation variation by phone deletion looks very promising since traditional linguists mostly complain about sequential HMM model which doesn't handle deletions correctly. Unfortunately, the effect  of this seems to be small. The improvement cited is only from 91.5% to 92%.

An Efficient Beam Pruning With A Reward Considering The Potential To Reach Various Words.

Authors: Tsuneo Kato, Kengo Fujita, Nobuyuki Nishizawa

Beam pruning according to the number of reachable words or to other the risk function. Good idea to implement in sphinx4 to speedup recognition. Factor cited is 1.2 for a large vocabulary.

That's it. I missed Friday and most early mornings unfortunately, so something  interesting could be there. I'm sure you could select your own set. It's  interesting to look on it.