Written by Nickolay Shmyrev
on April 26, 2011

Voicemail transcription with Pocketsphinx and Asterisk (Part 2)

This is a second part which describes voicemail transcription for Asterisk administrators. See previous part which describes how to setup Pocketsphinx here

So you have configured the recognizer to transcribe voicemails and now look on the improved recognizer accuracy. Honestly I can tell you that you will not get perfect transcription results for free unless you will send voicemails to some human-assisted transcription company. You will not get them from the Google either. Though there are several commercial services to try like Yap or Phonetag which specialize on voicemails specifically. Our proprietary Nexiwave technology for example uses way more advanced algorithms and way bigger speech databases than distributed with Pocketsphinx. And it's a really visible difference.

However even the result you can get with Pocketsphinx can be very usable or you. I estimate you can easily get 80-90% accuracy with little effort considering the language of your voicemails is simple.

Now, the core components of the recognizer are:

Language model which controls sequence of words
Acoustic model which describe how each phone sounds
Phonetic dictionary which maps words to phonetic representation

To get better accuracy you need to improve those three. By default the following models are used

Dictionary - pocketsphinx/model/lm/en_US/cmu07a.dic
Language model - pocketsphinx/model/lm/en_US/hub4.5000.DMP
Acoustic model - pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k

So let's try to improve them step by step by the order of importance

Language model
The core reason voicemail transcription is bad is that language model is built for completely different domain. HUB4 is DARPA task to transcribe broadcast news so you see it's very different from the voicemail language. It's perfect to recognize voicemail about NATO or democracy but not about your wife's problems. We need to change the language model.

1) Transcribe some amount of your existing voicemails. A 100 will be already good. Put the transcription in a text file line by line:

hello jim it's steve let's meet at five p m

hello jim buy some milk

jim it's bob i should catch you tomorrow

you fired jim 

....

Important here that text is all in lower case one sentence per line and it doesn't have any punctuation.

Then, you can find some domain specific texts in your computer. For example, if you are working as system administrator in chemical company some chemical texts will help to improve the quality of the language model. Take few books and convert them to the same simple text form: split out punctuation, formatting and add them to the text of the transcribed voicemails. Consider your email archives, they can be also good.

3) Then, you can just use MITLM toolkit to convert the texts you've collected to the language model

Download MITLM language model toolkit here

http://code.google.com/p/mitlm/

Run it as

estimate-ngram -text voicemail.txt -write-lm your_model.lm

It will create the language model model your_model.lm

4) Sometimes it make sense to mix your specific model with a generic model. It may help if your training text is small or your model is not good enough. To do that download a generic model here:

http://keithv.com/software/giga/lm_giga_5k_nvp_3gram.zip

Then unpack it and interpolate with your voicemail model using MITLM tools:

interpolate-ngram -lm "your_model.lm, lm_giga_5k_nvp_3gram.arpa" -interpolation LI -op voicemail.txt -wl Lectures+Textbook.LI.lm

See MITLM tutorial for details http://code.google.com/p/mitlm/wiki/Tutorial

lm_giga model is quite big, you can also pick hub4 language model for interpolation. To do that you need to convert it to text form from the binary form first:

sphinx_lm_convert -ifmt dmp -ofmt arpa hub4.5000.DMP hub4.lm

One day you will be able to work with a language model using CMU language model toolkit CMUCLMTK, but for now it's more complicated than MITLM. So MITLM is a recommended tool for language model operations.

4) To speedup the startup of the recognizer sort the model and convert it to a binary format:

sphinx_lm_sort < your_model.lm > your_model_sorted.lm

sphinx_lm_convert -i your_model_sorted.lm -o your_model_sorted.lm.dmp

3) In Pocketsphinx script, use your language model for transcription, add the following argument:

-lm your_model_sorted.lm.dmp

That's it.

Between, Google API is mostly trained on search queries. Why it perfectly suitable for voice search it's not good for voicemail transcription either. Voicemail transcription texts are usually quite sensitive information and it's very hard to get free access to them.

I think after this step the accuracy of the transcription is already good enough. You will be able to collect transcription results, fix them and use them to improve the language model.

Acoustic model
Sometimes it's usable to update the acoustic model. This step will require you to compile and setup Sphinxtrain. Again, transcribe few voicemails you've recorded, then organize them into a database. Then follow the acoustic model adaptation HOWTO as described in CMUSphinx wiki:

http://cmusphinx.sourceforge.net/wiki/tutorialadapt

Acoustic model adaptation always make sense but it's quite a time consuming process. Maybe one day someone will automate it to make it really flawless. For example we have started a project to help to train and adapt the model from the set of long files accompanied with text, not with a carefully drafted database. Once this project will be completed it will be way easier to train and adapt the acoustic models. Any help on this is appreciated.

Dictionary
There can be cases when you need to add few words to the dictionary which are missing. For example in step 1 when you adapted the language model you've got few words which are missing in cmu07a.dic. Then it make sense to add them. Just open a dictionary with a text editor, find the appropriate place and change or edit the phonetic pronunciation of the word. For example, CMU dictionary is missing the word "twitter"

twitter T W IH T ER

Usually this step is not needed but if you have for example an accented words or some other unusual words it may help.

Test the model
After you have adapted the model, retranscribe the files you have already collected. Check the accuracy if it's good or not.

Follow up
So here are the directions to take. I understand it's some work but maybe you consider it's worth the effort. We are really trying to make this process easier and your comments on that will be very appreciated.

← Top →