Written by Nickolay Shmyrev
on November 30, 2009

How to create a speech recognition application for your needs

Sometimes people ask: why there is no high-quality open source speech recognition applications (dictation application, IVR applications, closed-captions alignment, language acquisition and so on). The answer obviously is that nobody wrote them and make them public. It's often noted, for example by Voxforge, that we lack the database for the acoustic model. I admit Voxforge have it's reason to state we need a database. But that's only a little part of the problem, not entirely the problem as a whole.

And as it always happens, the statement of the question doesn't allow constructive answer on it. To get constructive answer you need the following question: How do I create a speech recognition application.

To answer on this let me provide an example. Consider we want to develop flash-based dictation website. The dictation application consists of the following parts which should be created

Website, user accounting, user-dependent information storage
Initial acoustic and language models trained with Voxforge audio and other free sources transmitted through Flash codecs
Recognizer setup to convert incoming streams into text. Distributed computation framework for the recognizer
Recognizer frontend with noise cancellation and VAD
Acoustic model adaptation framework to let user adapt the generic acoustic model to their pronunciation
Language model adaptation framework
Transcription control package that will process commands during dictation like error correction ones or punctuation ones.
Post-processing package to put punctuation and capitalization, date and acronym post-processing
Test framework for dictation with dictation recordings and ability to check dictation effectiveness

Everything above could be done with open source tools and have approximately equal complexity and require minimum specialized knowledge. Performance-wise this system should be usable for a large vocabulary dictation for a wide range of users. The core components are:

Red5 streaming server
Adobe Flex SDK
Sphinx4
Sphinxtrain
Language model toolkit
Voxforge acoustic database

So you see mostly it's just an implementation of the existing algorithms and technologies. No rocket science. This makes me think that such application is just a matter of time.

← Top →