Written by Nickolay Shmyrev
on May 31, 2010

KISS Principle

Still think that you can take sphinx4 engine and make a state-of-art recognizer? Check what AMI RT-09 entry is doing for meeting transcription in presentation on RT'09 workshop "The AMI RT’09 STT and SASTT Systems":

Segmentation
Initial decoding of full meeting with
- 4g LM based on 50K vocabulary and weak acoustic model (ML) M1
- 7g LM based on 6K vocabulary and strong acoustic model (MPE) M2
Intersect output and adapt (CMLLR)
Decode using M2 models and 4gLM on 50k vocabulary
Compute VTLN/SBN/fMPE
Adapt SBN/fMPE/MPE models M3 using CMLLR
Adapt LCRCBN/fMPE/MPE models M4 using CMLLR and output of previous stage
Generate 4g lattices with adapted M4 models
Rescore using M1 models and CMLLR + MLLR adaptation
Compute Confusion networks

Click on image to check the details of the process.

← Top →