Training the large database trick

Training of the large database requires a cluster. SphinxTrain supports training on Torque:PBS for example, to do this you need to set the following configuration variables:

$CFG_QUEUE_TYPE = "Queue::PBS";

and set the number of parts to train. The issue is to guess the number of parts. I previously thought

1 part:

TOTAL Words: 773 Correct: 660 Errors: 126
TOTAL Percent correct = 85.38% Error = 16.30% Accuracy = 83.70%
TOTAL Insertions: 13 Deletions: 9 Substitutions: 104

3 parts:

TOTAL Words: 773 Correct: 583 Errors: 262
TOTAL Percent correct = 75.42% Error = 33.89% Accuracy = 66.11%
TOTAL Insertions: 72 Deletions: 17 Substitutions: 173

10 parts:

TOTAL Words: 773 Correct: 633 Errors: 168
TOTAL Percent correct = 81.89% Error = 21.73% Accuracy = 78.27%
TOTAL Insertions: 28 Deletions: 10 Substitutions: 130

20 parts:

TOTAL Words: 773 Correct: 619 Errors: 181
TOTAL Percent correct = 80.08% Error = 23.42% Accuracy = 76.58%
TOTAL Insertions: 27 Deletions: 13 Substitutions: 141

But it appeared that all above is not true. One potential source of problems was that the norm.pl scripts grabs all the sub directories under the bwaccum one indiscriminately. So if there are some old bwaccum dirs left over (e.g. if you train on 20 parts first then start again with 10, without deleting the directories in-between), the norm script will screw up (thanks to David Huggins-Daines for pointing that out to me). In this particular test there was another one that I forgot to update mdef after model rebuild and old scripts didn't do that automatically. On multipart the order of senones in mdef is different thats why there was a regression. Though the set of senones is the same.

So the testing and statements above are completely wrong - accuracy doesn't depend on number of parts used. As expected. This confirms the ground truth that correct experiment statement is the most important thing in research.

Now only one issue left - the dropped accuracy from old tutorial to a new one. But that is a completely different issue discussed in my mails on cmusphinx-sdmeet now.