Written by Nickolay Shmyrev
on July 23, 2011

Decoders And Features

CMUSphinx decoders in a glance, so one can compare. Table is incomplete and imprecise of course.

	sphinx2	sphinx3	sphinx4	pocketsphinx
Acoustic Lookahead	-	-	+	-
Alignment	+	+	+	-
Flat Forward Search	+	+	-	+
Finite Grammar Confidence	+	-	-	-
Full n-gram History Tree Search	-	-	+	-
HTK Features	-	+	+	+
Phonetic Loop Decoder	+	+	-	-
Phonetic Lookahead	+	+	-	+
PLP features	-	-	+	-
PTM Models	-	-	-	+
Score Quantization	+	-	-	+
Semi-Continuous Models	+	+	-	+
Single Tree Search	+	-	-	+
Subvector Quantization	+	+	-	+
Time-Switching Tree Search	-	+	-	-
Tree Search Smear	-	+	+	-
Word-Switching Tree Search	-	+	-	-
Thread Safety	-	-	+	+
Keyword Spotting	-	-	+	-

And here is the description of the entries

Specific Applications

Phonetic Loop Decoder. Phonetic loop decoding requires specialized search algorithm. It's not implemented in Sphinx4 for example.

Alignment. Given text and the transcription get the word timings.

Keyword spotting. Search for keyword requires separate search space and different search approach.

Finite Grammar Confidence. Get confidence estimation for finite state grammar. This is a complex problem which
require additional operations during search, for example phone loop pass.

Effective pruning

Acoustic Lookahead. Using acoustic score for the current frame we can predict the score for the next frame
and thus prune token early.

Phonetic Lookahead. Using phonetic loop decoder we can predict possible phones and thus restrict large vocabulary search.

Features

HTK Features. CMUSphinx feature extraction is different from HTK (different filterbank and transform). To provide HTK capability one needs to have specific HTK feature extraction.

PLP features. Type of the features different from traditional MFCC. They are more popular now.

Search Space

Flat Forward Search. Search space when word paths aren't joined in lextree. Separated path lets us to apply language model probability earlier. Thus search is more accurate. But because search space is bigger it's also slower. Usually flat search is applied as a second pass after tree search.

Full n-gram History Tree Search. Tokens which have different n-gram history are tracked separately. For example token for "how are UW " and token for "hello are UW.." are tracked separately. In pocketsphinx such tokens are just joined and only best one survive. Full history search is more accurate but slower and more complex in implementation.

Word-Switching Tree Search. Separate lextrees are kept for each unigram history. This search is in the middle between the one to keep the full history and another one to drop the history at all.

Single Tree Search. Lextree tokens don't care about word history. This is faster but less accurate way.

Time-Switching Tree Search. Lextree states don't care about word history but several lextrees are kept in memory (3-5). In this time switching approach lextrees are switched every frame. Because of that there is higher chance to track both histories.

Tree Search Smear. Lextree contains unigram probability and thus it's possible to prune token earlier based on the language score.

Acoustic Scoring

PTM Models. Models when gaussians are shared across senones with same central phone. So we don't need to calculate gaussians value for each senone, just few values for each central phone. Then using different mixture weights we get senone score. This approach reduce computation required but keeps accuracy on a reasonable level. It's similar to semi-continuous models where gaussians are shared across all senones, not just across senones with same central phone.

Score Quantization. Acoustic scores in some cases could be represented by just 2 bytes (semi-continuous models and specific feature set). Usually scores are in log domain and shifted by 10 bits. This reduces memory required for acoustic model and for scoring and speeds up the computation in particular on CPU without FPU.

Semi-Continuous Models. Gaussians are shared across all senones, only mixture weights are different. Such models are fast and usually quite accurate. Usually they are multistream (s2_4x or 1s_c_d_dd with subvector 0-12/13-25/26-38) since separate streams could be better quantized.

Subvector Quantization. Gaussian selection approach to reduce acoustic scoring. Basically continuous model after training is deconstructed on several subvector gaussians which are shared across senones and thus scored efficiently.

← Top →