Written by Nickolay Shmyrev
on March 14, 2009

Building the language model for dialogs

I'm in search how to build a combined language model suitable for dialog decoding. I have quite a lot of dialog transcriptions, but they aren't comparable with generic model built from the large corpora from the view of the coverage. It would be nice to combine them somehow to get the structure of the first model and the diversity of the second one. In one article I read it's possible just to interpolate them lineary. So probably I just need to get closer in touch with SRILM toolkit

It's discouraging that sphinx4 doesn't support high-order n-grams. Another article mentions a solution for that to join some often word combinations into compound words.

Btw, generic model gives 40% accuracy while home-groun dialog model gives 60, so it's a promising direction anyhow.

← Top →