Written by
Nickolay Shmyrev
on
How to create a speech recognition application for your needs
Sometimes people ask: why there is no high-quality open source speech recognition applications (dictation application, IVR applications, closed-captions alignment, language acquisition and so on). The answer obviously is that nobody wrote them and make them public. It's often noted, for example by Voxforge, that we lack the database for the acoustic model. I admit Voxforge have it's reason to state we need a database. But that's only a little part of the problem, not entirely the problem as a whole.
And as it always happens, the statement of the question doesn't allow constructive answer on it. To get constructive answer you need the following question: How do I create a speech recognition application.
To answer on this let me provide an example. Consider we want to develop flash-based dictation website. The dictation application consists of the following parts which should be created
- Website, user accounting, user-dependent information storage
- Initial acoustic and language models trained with Voxforge audio and other free sources transmitted through Flash codecs
- Recognizer setup to convert incoming streams into text. Distributed computation framework for the recognizer
- Recognizer frontend with noise cancellation and VAD
- Acoustic model adaptation framework to let user adapt the generic acoustic model to their pronunciation
- Language model adaptation framework
- Transcription control package that will process commands during dictation like error correction ones or punctuation ones.
- Post-processing package to put punctuation and capitalization, date and acronym post-processing
- Test framework for dictation with dictation recordings and ability to check dictation effectiveness
Everything above could be done with open source tools and have approximately equal complexity and require minimum specialized knowledge. Performance-wise this system should be usable for a large vocabulary dictation for a wide range of users. The core components are:
- Red5 streaming server
- Adobe Flex SDK
- Sphinx4
- Sphinxtrain
- Language model toolkit
- Voxforge acoustic database
So you see mostly it's just an implementation of the existing algorithms and technologies. No rocket science. This makes me think that such application is just a matter of time.