Written by
Nickolay Shmyrev
on
Speech Decoding Engines Part 1. Juicer, the WFST recognizer
ASR today is quite diverse. While in 1998 there was only a HTK package and some inhouse toolkits like CMUSphinx released in 2000, now there are dozen very interesting recognizers released to the public and available under open source licenses. We are starting today the review series about them.
So, the first one is
Juicer, the WSFT recognizer from IDIAP, University of Edinburgh and University of Sheffield.
Weighted finite state transducers (WFST) are very popular trend in modern ASR, with very famous addicts like Google, IBM Watson center and so many others. Basically the idea is that you convert everything into same format that allows not just unified representation, but more advanced operations like merging to build a shared search space or reducing to make that search space smaller (with operations like determinization and minimization). Format also provides you interpolation properties, for example you don't need to care about g2p anymore, it's automatically done by transducer. For WFST itself, I found a good tutorial by
Mohril, Pereira and Riley "Speech Recognition with Weighted State Transducers".Juicer can do very efficient decoding with stanard set of ASR tools - ARPA language model (bigram due to memory requirements), dictionary and cross-word triphone models could be trained by HTK. BSD license makes Juicer very attractive. Juicer is part of AMI project that targets meeting transcription, other AMI deliverables are subject for separate posts though.
So here is the description how to try it. Don't expect it to be straightforward though, it's not a trivial process. Well, one day we'll put everything on a live CD to make ASR development environment easier. Right now you can follow this step-by-step howto as many our young friends call such thing. I wonder where do people get the idea that for everything there is detailed step-by-step howto.
So, let's start Download Juicer and dependencies:
Torch3src.tgz
tracter-0.6.0.tar.bz2
juicer-0.12.0.tar.bz2
kiss_fft-v1.2.8.tar.gz
openfst-1.1.tar.gz
Unpack and build torch
tar xf Torch3src.tgz
cd Torch3
cp config/Linux.cfg .
Edit Linux.cfg to include packages: distributions gradients kernels
speech datasets decoder
# Packages you want to use
packages = distributions gradients kernels speech datasets decoder
Continue with the build
./xmake all
cd ..
Unpack kiss_fft:
tar xf kiss_fft-v1.2.8.tar.gz
There is no need to build kiss, it's build is included in the next step.
Unpack and build tracter
tar xf tracter-0.6.0.tar.bz2
cd tracter-0.6.0
aclocal && libtoolize && automake -a && autoconf
mkdir m4
./configure \
--with-kiss-fft=/current_folder/kiss_fft_v1_2_8 \
--with-htk-includes="-I/htk_folder/HTKLib" \
--with-htk-libs="/htk_folder/HTKLib/HTKLib.a"
--with-torch3=/current_folder/Torch3
make && make install
cd ..
Make sure you point full path to the dependencies, since relative path
will not work. Also note that for htk you need to provide compiler
options, not folders. Alternatively you can increase your pain trying
to build tracter with cmake as readme describes.
Unpack and build juicer
Make sure PKG_CONFIG_PATH makes tracter.pc reachable.
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig
tar xf juicer-0.12.0.tar.bz2
cd juicer-0.12.0
aclocal && libtoolize && automake -a && autoconf
mkdir m4
./configure \
--with-kiss-fft=/current_folder/kiss_fft_v1_2_8 \
--with-htk-includes="-I/htk_folder/HTKLib" \
--with-htk-libs="/htk_folder/HTKLib/HTKLib.a"
--with-torch3=/current_folder/Torch3
make && make install
cd ..
Build openfst:
tar xf openfst-1.1.tgz
cd openfst
./configure && make && make install
make
cd ../..
Setup environment variables:
export JUTOOLS=/current_folder/juicer-0.12.0/bin
At this point juicer and required tools are built, let's try it with HTK
wsj model from
Keith Vertanen. Download the model
htk_wsj_si84_2750_8.zipand unpack it
unzip htk_wsj_si84_2750_8.zip
Convert model to ascii
mkdir ascii
touch empty
HHEd -D -T 1 -H hmmdefs -H macros -M ascii empty tiedlist
Convert dmp turtle model from pocketsphinx to ARPA model turtle.lm
sphinx_lm_convert -i turtle.DMP -o turtle.lm -ifmt dmp -ofmt arpa
Remove alternative pronunciation numbers from turtle.dic and build phoneset
sed 's/([0-9])//g' turtle.dic | tr [:upper:] [:lower:] > turtle.dic.lower
mv turtle.dic.lower turtle.dic
echo "<s> sil" >> turtle.dic
echo "</s> sil" >> turtle.dic
for w in `cat turtle.dic | cut -d" " -f 2-`; do echo $w; done | sort | uniq > turtle.phone
echo sp >> turtle.phone
Due to some script limitations, not all different words couldn't have same pronunciations. So open turtle.dic and remove line with entry "two t uw" because it conflicts with "to t uw"
Now let's convert everything into WFST
gramgen -gramType ngram -lmFName turtle.lm -lexFName turtle.dic \
-fsmFName gram.fsm -inSymsFName gram.insyms -outSymsFName gram.outsyms \
-sentStartWord "<s>" -sentEndWord "</s>"
lexgen -lexFName turtle.dic -monoListFName turtle.phone \
-fsmFName dic.fsm -inSymsFName dic.insyms -outSymsFName dic.outsyms \
-sentStartWord "<s>" -sentEndWord "</s>" -pauseMonphone sp -addPronunsWithEndPause
cdgen -htkModelsFName wsj_si84_2750_8/ascii/hmmdefs -tiedListFName \
wsj_si84_2750_8/tiedlist -monoListFName turtle.phone -fsmFName wsj.fsm \
-inSymsFName wsj.insyms -outSymsFName wsj.outsyms
To deal with juicer bug comment the following lines in juicer-0.12.0/bin/aux2eps.pl:
#if ( ! %AUXSYMS )
#{
# print "no aux syms in symbol file - nothing to do\n" ;
# exit 0 ;
#}
Now let's compose it into single WFST
build-wfst-openfst gram.fsm dic.fsm wsj.fsm
Everything is ready for decoding. Let's try with goforward.raw from pocketsphinx
sox -r 16000 -2 -s goforward.raw goforward.wav
create HTK config
SOURCEFORMAT = WAV
TARGETKIND = MFCC_0_D_A_Z
TARGETRATE = 100000.0
WINDOWSIZE = 250000.0
SAVECOMPRESSED = F
SAVEWITHCRC = F
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = T
ZMEANSOURCE = T
USEPOWER = T
Convert to mfcc
HCopy -C config goforward.wav goforward.mfc
Create control file:
echo goforward.mfc > train.scp
Decode
juicer -inputFormat htk -lexFName turtle.dic -inputFName train.scp -fsmFName final.fsm -inSymsFName final.insyms -outSymsFName final.outsyms -htkModelsFName wsj_si84_2750_8/ascii/hmmdefs
Get the result:
<s> go four are <s> ten meters </s>
It's not accurate for some reason. Probably feature extraction is not the same as were used for acoustic model. Probably I should use word insertion penalty.
Of course not everything is so perfect. The main issues with WFST decoder are very well described in documentaion. Basically they are memory requirements for the first pass decoding (that's why Juicer can't run trigram models on commodity hardware) and lack of dynamic search optimization that's more straightforward. Anyway, WFST framework has a lot of applications going beyond just recognition. It's applied for speech indexing, open vocabulary decoding, simplifies confidence scoring.
That's it, you can count it works and embed it into your software. Overall, it's an interesting package demonstrating how simple things could be when you put everything into flexible format. I'm sure CMUSphinx will follow this direction and will implement WFST decoding soon. At least we ultimately need to introduce FST tools in our framework.