Written by
Nickolay Shmyrev
on
Multiview Representations On Interspeech
From my experience, in every activity it's important to have multilevel view of any activity, interesting is that it's both part of
Getting Things Done and just a good practice in
software development. Multiple models of the process or just different views help to understand what's going on. The only problem is to make those views consistent. That reminds me the
Russian model of the world.
So it's actually very interesting to get a high-level overview of what's going on in speech recognition. Luckily to do that you just need to review some conference materials or journal articles. Latter is more compicated, while former is feasible. So here comes some topics from the plenary talks from Interspeech. Suprisingly they are rather consistent across each other and I hope they really present trends, not just selected topics.
Speech To Informationby Mari Ostendorf
Multilevel representation gets more and more important, in particular in speech recognition. The most complicated task - spontaneous meetings recording requires unifiication of the recognition efforts on all levels from acoustic representation to semantic one. Nice to call this approach "Speech To Information", as a result of speech recogntion not just the words are repaired but even syntactic and semantic structure of the talk. One of the interesting tasks is for example restoration of punctuation and capitalization, something that SRILM
does.
Good thing is that testing database for such material is already
available for free download. Very uncommon situation to have such representative database in free access. AMI corpus looks like an amazing piece of work.
Singe Methodby Sadaoki Furui
WFST-based T3 decoder looks quite impressive. Single method of data representation used everywhere which more importantly allows combination of the models gives wonderful opportunity. For example consider the example of building high-quality Icelandic ASR system combining WFST for English one and very basic Icelandic one. I imagine the decoder is really simple since basically all structures including G2P rules, language and acoustic model could be weighted finite-state automata.
Bayesian Learningby Tom Griffiths
Hierachical bayesian learning and things like
compressed sensing seems to be a hot topics in mashine learning. Google does
that. There are already some efforts to impelement a speech recognizer based on hierachical bayesian learning. Indeed it looks impressive to just feed the audio to the recognizer and make it understand you.
Though probabilistic point of few was always questionable opposed to precise discriminative methods like MPE I'm still looking forward to see progress here. Despite huge amount of audio is required, like I remember there were estimates about 100000 hours I think it's feasible nowdays. For example i
t already recognizes written digits, so success looks really close. And again, it's also multilevel!