KISS Principle

Still think that you can take sphinx4 engine and make a state-of-art recognizer? Check what AMI RT-09 entry is doing for meeting transcription in presentation on RT'09 workshop "The AMI RT’09 STT and SASTT Systems":

  1. Segmentation
  2. Initial decoding of full meeting with

    • 4g LM based on 50K vocabulary and weak acoustic model (ML) M1
    • 7g LM based on 6K vocabulary and strong acoustic model (MPE) M2
  3. Intersect output and adapt (CMLLR)
  4. Decode using M2 models and 4gLM on 50k vocabulary
  5. Compute VTLN/SBN/fMPE
  6. Adapt SBN/fMPE/MPE models M3 using CMLLR
  7. Adapt LCRCBN/fMPE/MPE models M4 using CMLLR and output of previous stage
  8. Generate 4g lattices with adapted M4 models
  9. Rescore using M1 models and CMLLR + MLLR adaptation
  10. Compute Confusion networks
Click on image to check the details of the process.