Training process

What I really like in Sphinxtrain is that it provides straightforward way for training an audio model. It remains unclear for me why everyone bothers with HTKBook while there is clean an easy way to train the model. One should just define the dictionary and transcription and put the files in the proper folder. Anyway, I'm continuously thinking about the way sphinxtrain process could be improved. Currently it indeed lacks a lot of critical information on training and that makes look uncomplete.

Basically here is what I would like to put into the next versions of sphinxtrain and sphinxtrain tutorial:

  1. Description on how to prepare the data
  2. Building of the database transcription. Between, what bothers me last month is the requirement to have fileids. I really think the file with fileids could be silentely dropped. What's the problem to get the id of the file from the transcription labels 
  3. Automatic splitting on training data, testing data and development data. I see development data presense as a hard requirement for the training process. Unfortunately, current documentation lacks it. There could be code to do that, but for most databases it's automatic of course.
  4. Bootstrapping from a hand-labelled data. I think this as an important part of training, HTK results confirm that. In general it repeats human language learning, so I think it's natural as well.
  5. Training
  6. Optimizing number of senones, mixtures on a devel set
  7. Optimizing most important parameters like language weight on the development set. This part is complicated as I see it. First of all the reasononing behind proper language weight scaling is still unclear for me, I could one day write a separate post on it. Basically it depends on everything, even on the decoder
  8. Testing on the test set 
 If it will be possible to keep this as straightforward as it is now that would be just perfect. Probably if I'll start to write the chapter in a week, this could be ready till summer.