Multiview Representations On Interspeech

From my experience, in every activity it's important to have multilevel view of any activity, interesting is that it's both part of Getting Things Done and just a good practice in software development. Multiple models of the process or just different views help to understand what's going on. The only problem is to make those views consistent. That reminds me the Russian model of the world.

So it's actually very interesting to get a high-level overview of what's going on in speech recognition. Luckily to do that you just need to review some conference materials or journal articles. Latter is more compicated, while former is feasible. So here comes some topics from the plenary talks from Interspeech. Suprisingly they are rather consistent across each other and I hope they really present trends, not just selected topics.

Speech To Information
by Mari Ostendorf

Multilevel representation gets more and more important, in particular in speech recognition. The most complicated task - spontaneous meetings recording requires unifiication of the recognition efforts on all levels from acoustic representation to semantic one. Nice to call this approach "Speech To Information", as a result of speech recogntion not just the words are repaired but even syntactic and semantic structure of the talk. One of the interesting tasks is for example restoration of punctuation and capitalization, something that SRILM does.

Good thing is that testing database for such material is already available for free download. Very uncommon situation to have such representative database in free access. AMI corpus looks like an amazing piece of work.

Singe Method
by Sadaoki Furui

WFST-based T3 decoder looks quite impressive. Single method of data representation used everywhere which more importantly allows combination of the models gives wonderful opportunity. For example consider the example of building high-quality Icelandic ASR system combining WFST for English one and very basic Icelandic one. I imagine the decoder is really simple since basically all structures including G2P rules, language and acoustic model could be weighted finite-state automata.

Bayesian Learning
by Tom Griffiths

Hierachical bayesian learning and things like compressed sensing seems to be a hot topics in mashine learning. Google does that. There are already some efforts to impelement a speech recognizer based on hierachical bayesian learning. Indeed it looks impressive to just feed the audio to the recognizer and make it understand you.

Though probabilistic point of few was always questionable opposed to precise discriminative methods like MPE I'm still looking forward to see progress here. Despite huge amount of audio is required, like I remember there were estimates about 100000 hours I think it's feasible nowdays. For example it already recognizes written digits, so success looks really close. And again, it's also multilevel!