Written by
Nickolay Shmyrev
on
KISS Principle
Still think that you can take sphinx4 engine and make a state-of-art recognizer? Check what AMI RT-09 entry is doing for meeting transcription in presentation on RT'09 workshop "
The AMI RT’09 STT and SASTT Systems":
- Segmentation
- Initial decoding of full meeting with
- 4g LM based on 50K vocabulary and weak acoustic model (ML) M1
- 7g LM based on 6K vocabulary and strong acoustic model (MPE) M2
- Intersect output and adapt (CMLLR)
- Decode using M2 models and 4gLM on 50k vocabulary
- Compute VTLN/SBN/fMPE
- Adapt SBN/fMPE/MPE models M3 using CMLLR
- Adapt LCRCBN/fMPE/MPE models M4 using CMLLR and output of previous stage
- Generate 4g lattices with adapted M4 models
- Rescore using M1 models and CMLLR + MLLR adaptation
- Compute Confusion networks
Click on image to check the details of the process.