Written by Nickolay Shmyrev
on April 24, 2010

Intelligent Testing In ASR

To continue previous topic about testing I want to share the information about nice paper I read some time ago which I wanted to bring to our daily practices.

The issue is that the current way we test our systems is far from being optimal, at least there no real theory behind that. I usually apply 1/10th rule in practice where I split data on 9/10 training set and 1/10 testing set. This was done in voxforge as well. Not so good thing since with 70 hours of Voxforge data test set grows to 7 hours and it takes ages to decode it. I took this rule from festival's traintest script. And that's more or less common practice in ASR while things like 10-fold cross-validation aren't popular for computational reasons mostly. Suprisingly, problems like that could be easily solved if only we could focus on real goal of testing - estimating the recognition performance. All our test sets are oversized, one could easily find that looking on decoder results during testing. They tend to stabilize very quickly unless there is some data inconsistency.

Speech recognition practice unfortunately doesn't cover this even in scientific papers. Help comes from character recognition. The nice paper I found is:

Isabelle Guyon, John Makhoul, Fellow, IEEE, Richard Schwartz, and Vladimir Vapnik
What Size Test Set Gives Good Error Rate Estimates?
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 1, JANUARY 1998

Authors address the problem of determining what size test set guarantees statistically significant results in a character recognition task, as a function of the expected error rate. The paper is well written and actually rather clear to understand. There are no complex model behind the testing, nothing speech-specific. There are two valuable points there:

The approach that puts reasoning behind test process
The forumlae itself

To put it simple:

The test set for medium vocabulary task could be small. If word error rate is expected to be like 10%, by the table on page 9 you can get that to compare two configurations with difference 0.5% absolute you need only 13k words data size.

That's four times smaller than current Voxforge test set. I think this estimate can be even improved if we'll specialize with speech. I really hope this result will be useful for us and will help us to speedup the process of application testing and optimization.

← Top →