Written by
Nickolay Shmyrev
on
Training language model with fragments
Sequitur g2p by M. Bisani and H. Ney. is a cool package for the letter to phone translation, quite accurate and, the most important, open. But actually there are different hidden gems in this package :)
One of them is the phone-oriented segmenter that splits the words on chunks - graphones. Graphone is a joint object consisting of letters and corresponding phones that combine words. Graphones are used in g2p internally, but for example they are very useful in construction of the open vocabulary models. The system as a whole is described here:
Open Vocabulary Spoken Term Detection Using Graphone-Based Hybrid recognition System by M. Acbacak, D. Virgyri and A. Stolkeand the details of the language model in the original article:
Open Vocabulary Speech Recognition with Flat Hybrid Models by Maximilian Bisani and Hermann Ney
The interesting thing is that all required components are already available, the issue is to find correct option and build the system. So the quick reciept is:
1. Get Sequitur G2p
2. Patch it to support Python 2.5 (replace elementtree with xml.etree, since elementtree is deprecated now)
3. Convert cmudict lexicon to xml-based Bliss format (I'm not sure what's it, I failed to find information about it on the web)
import sys
import string
print "<lexicon>"
file = open(sys.argv[1], "r")
for line in file:
toks = line.strip().split()
if len(toks) < 2:
continue
word = toks[0]
phones = string.join(toks[1:]," ")
print "<orth>"
print word
print "</orth>"
print "<pron>"
print phones
print "</pron>"
print "</lexicon>"
4. Train the segmenter model. The most complicated thing is to figure option to train multigram model with several phones. Default one used in g2p consist of 1 phone and 1 letter, it's not suitable for OOV language model.
g2p.py --model model-1 --ramp-up --train cmudict.0.7a.train --devel 5% --write-model model-2 -s 0,2,0,2
5. Ramp up the model to make it more precise
6. Build the language model, here you need the dictionary in XML format. As the article above describes, the original lexicon should be around 10k, the subliminal training lexicon should be 50k or so.
makeOvModel.py --order=4 -l cmudict.xml --subliminal-lexicon=cmudict.xml.test -g model-2 --write-lexicon=res.lexicon --write-tokens=res.tokens
After that you can get a tokens for lm and with additional options even a counts for the language model you could train with SRILM. I haven't finished the previous step yet, so this post should have follow up.