Written by
Nickolay Shmyrev
on
ICASSP 2010
So, I'm back from
ICASSP in Dallas, TX. It was very impressive conference with lots of interesting and inspiring presentations, meetings and discussions. Amazing everyone was there and I've finally met all the speech people who guided me for so long time. I've met ASR people Bhiksha Raj, David Huggins-Daines, Rita Singh, Richard Stern and TTS people Alan W. Black, Keiichi Tokuda, Simon King, Heiga Zen. I was pleased to meet second time wonderful guys like Evandro and Peter. Worth to mention that I was able to listen talks by famous people like Hynek Hermanski.
We had
Sphinx Users and Developers Workshop there and also two CMU Sphinx development planing meetings. But they are subject for another post. This one is just about interesting ideas presented on the conference by other people. I didn't have time to attend every presentation out there, I think it was impossible. You have to find the time for sightseeing and there were often two or three parallel lecture sessions and also poster presentations which I liked most. I think poster presentation is the best way to access author, ask him questions and get feedback. Many posters were so popular it was almost impossible to get to the stand.
Anyway, the amount of talks I've got already exceeds what can be consumed in a week. It would be nice to get one day all information about current research collected into structure or wiki-like resource. It's a huge work for the future though.
So here are some presentations and ideas I've met there and found them to be worth attention:
Robust Speaking Rate Estimation Using Broad Phonetic Class RecognitionAuthors: Jiahong Yuan; University of Pennsylvania, Mark Liberman;
University of Pennsylvania
Presented work is about using easy classifier to get some specific data about speech like to estimate deletions in syllables and thus speech quality. This is actually very promising approach which is ignored for some reason in most places where it seems to be practical. For example it's not clear for me why speaker identification framework doesn't try to find phonetic classes first and build GMM only after that. It seems to be a natural approach to improve SID performance.
Broad phonetic classes remind me the idea from the famous face recognition algorithm by Viola and Jones about applying cascade for fast classification. This idea could be applied to speech in some form like authors suggest I think.
Clap Your Hands! Calibrating Spectral Substraction for ReverberationAuthors: Uwe Zaeh, Korbinian Riedhammer, Tobias Bocklet, Elmar Noeth
Reverberation was very popular on this conference and especially it's important for meetings. Various speech system require various noise cancellation. Far microphone need to fix reverberation from room, close microphones need to fix clicks and so on. Far microphones sometimes do calibration for reverberation estimation. This defines the set of components sphinx4 could have to deal with various environment conditions. Right now they are simply missing.
Detecting Local Semantic Concepts in Environmental Sounds using Markov ModelAuthors: Keansub Lee, Daniel Ellis, Alexander Loui
Interesting that classification database for this task is available at
http://labrosa.ee.columbia.edu/projects/consumervideo/, this could be a base for non-speech recognition research.
Learning Task-Dependent Speech Variability In Discriminative Acoustic Model AdaptationAuthors: Shoei Sato, Takahiro Oku, Shinichi Homma, Akio Kobayashi, Toru Imai
Discriminative approaches are popular now days. Direct optimization of the cost function could serve on various stages of training process. In this work for example the set of subword units is selected to minimize decoding error rate.
An Improved Consensus-Like method for Minimum Bayes Risk Decoding and Lattice Combination Authors: Haihua Xu, Daniel Povey, Lidia Mangu, Jie Zhu
This deals with specific criterion for lattice decoding. Not just best path could be chosen but other criterion like consensus could also apply. For me personally it would be very interesting to formalize and apply the criterion that will ensure grammatical correctness of the result. I haven't found anything on this yet.
Discriminative training based on an integrated view of MPE and MMI in margin and error spaceAuthors: Erik McDermott, Shinji Watanabe, Atsushi Nakamura
Interesting to find out that real math goes into ASR. Basically it was a long waited thing and it seems it was started by Georg Heygold with his works on MMI and other methods. It would be nice to review this area to get some idea what's the outcome of it. Heygold was sited in almost every presentation, so it's really getting popular.
Balancing False Alarms and Hits in Spoken Term DetectionAuthors: Carolina Parada, Abhinav Sethy, Bhuvana Ramabhadran
It's interesting to see what tools are used. WFST's are very convenient and used by everyone. IBM, Google, AT&T. This is also a topic for separate post.
Bayesian Analysis of Finite Gaussian MixturesAuthors: Mark Morelande, Branko Ristic
Rather old idea (there are similar papers from 1998) to use Bayesian learning to estimate number of mixtures in the model. I'm in favor of the approach to estimate all model parameters including number of mixtures, language weight and number of senones at once.
Improving Speech Recognition by Explicit Modeling of Phone DeletionsAuthors: Tom Ko
Pronunciation variation by phone deletion looks very promising since traditional linguists mostly complain about sequential HMM model which doesn't handle deletions correctly. Unfortunately, the effect of this seems to be small. The improvement cited is only from 91.5% to 92%.
An Efficient Beam Pruning With A Reward Considering The Potential To Reach Various Words.Authors: Tsuneo Kato, Kengo Fujita, Nobuyuki Nishizawa
Beam pruning according to the number of reachable words or to other the risk function. Good idea to implement in sphinx4 to speedup recognition. Factor cited is 1.2 for a large vocabulary.
That's it. I missed Friday and most early mornings unfortunately, so something interesting could be there. I'm sure you could select your own set. It's interesting to look on it.