Written by
Nickolay Shmyrev
on
On CMUCLMTK
I've rebuilt the Nexiwave langauge models and meet some issues which would be nice to solve one day. CMU language model tookit is a nice simple piece of software but it definitely lacks many features which are required to build a good language model. So thinking about features language modelling toolkit can provide I created a list.
None of the toolkits have all the features in place which leads to unneceessary java coding, perl coding and python coding every time I rebuild the model. It would be nice to see that list turned into software one day.
Language is a live constantly changing structure. Very interesting is that in 2006 nobody used works like "twitter", "obama", "facebook". New words like "ipad" arrive every day and become very actively used. That makes senselsess to collect any static databases to train the language model. If pronunciation and acoustic changes slowly (well, not so slowly now. with Skype used everywhere acoustic of speech changed very significantly over last years!) language model needs to be very quickly adapted to the date. Other problem is that Sphinx4 is not very robust to unknown words. If unknown word is met, it bascailly screw the whole utterance around. That's why it's important to have up-to-date vocabulary. Maybe it can be small dynamic language model combined with huge static one, I'm not sure.
It can be clearly seen that Gigaword doesn't have enough coverage of modern terms. Models like lm_giga are nice, but they only work for old books. We need something live.
Another issue is where to find the texts. Unfortunately very relevant spoken transcriptions aren't available. Only companies which manually transcribe speech have them I think. Every written text is very different from spoken one.
So we need to collect data from realtime, from Twitter, Facebook, Google, Wikipedia, Youtube, from the net. We also need to be able to process this data, classify it, convert it to a spoken form and train the language model on the result. Issues here are that the crawled text is often corrupted, has spelling issues and spam. That's a huge task to make it usable.
Google and Bing use brut-force approach here, they just collect everything and hope it will be good enough. That can be seen on their n-grams data. Not sure if this approach is helpful for ASR.
So the features to see:
- Automated spelling errors detection and correction for unified written form. That includes automated abbreviation and numbers expansion. For which you must have good NLP component to be able to identify part of speech and other properties. Interesting here is that unified written form must be spelling-oriented. For example there are differently pronounced "going to" and "gonna" while from the language point of view they are identical. I haven't decided what to do with "gonna" yet.
- Automatic vocabulary selection. While in theory decoder should operate with unlimited vocabulary, in pratice it's better to have smaller one but with a good coverage. It's very important to be able to filter common spelling errors here.
- Tookit should support crawling from major sources like from Twitter, Wikipedia, other sources.
- Though NLP is mentioned above, I think it shoudln't be only used for expansion. Tookit should support many NLP features to be able to create more complex language model than simple n-grams.
I think that's a myth that n-gram model describe language well. Language model is not effective enough in rejection of the transcriptions that aren't possible at all in the language. We already changed the decoder to penalize trigrams which aren't common in a language model and require backoff, but this change appread to be not effective enough.
There is a nice idea of discriminative correction in Joshua machine translation toolkit for example
Large-scale Discriminative n-gram Language Models for Statistical Machine Translation
Zhifei Li and Sanjeev KhudanpurThe paper has quite interesting problem statement to correct the decoding results which are not possible in the language but solution that gives only 5% improvement is certainly not worth attention. We need to embed language knowledge deep in the search.
- Toolkit must be designed for parallel hardware in mind. Everything is getting distributed and with information volumes it's a hard requirement to be able to process data in parallel.
Quite a long list to be honest. Few years of coding on the top of cmuclmtk. It needs to be done anyway.
Paper on subject:
Ivan Bulyko, Mari Ostendorf, Manhung Siu, Tim Ng, Andreas Stolcke Web resources for language modeling in conversational speech recognition