Speech Decoding Engines, Part 2. SCARF, The Next Big Thing In Machine Learning

It seems that HMM will not stay forever. If you aren't tied to speech and track big things in machine learning, you should hear about that new thing - Conditional Random Fields. According to recently started but very promising Metaoptimize, it's one of the most influental ideas in machine learning.

And, suprisingly, you can already apply this thing to speech recognition, thanks to Microsoft Research including Geoffrey Zweig, Patrick Nguyen. It's SCARF, a Segmental Conditional Random Field Speech Recognition Toolkit which is version 0.5 now. You can download it's sources from Microsoft Research Website.



The idea behind SCARF is very elegant I would say. In HMM we use join probability distribution between observation features and state label to estimate the probability of label sequence. The showstopper here is the assumed independence of state distributions.

In CRF we consider different thing - the conditional probability of label sequence given the observation sequence. Conditional models are used to label a novel observation sequence x by selecting the label sequence that maximizes the conditional probability p(y|x). The conditional nature of such models means that no effort is wasted on modeling the observations, and one is free from having to make unwarranted independence assumptions about these sequences; arbitrary attributes of the observation data may be captured by the model, without the modeler having to worry about how these attributes are related.

In application to speech models, labels used are states in a language model FST or just words and features could be arbitrary set, including pitch, spectral features or phonetic recognizer posteriors. But in practice SCARF doesn't operate on acoustic features, instead it's used as a postprocessing step over posteriours predicted with conventional recognizer or some other detector. This presentation has more information on that. Usage of high-level events makes it similar to other postprocessing decoders like consensus decoding of lattices recently landed in CMUSphinx SVN.



The whole point that it's possible to provide efficient training and decoding even for such a complex model. Taking into account all nice properties of CRF's it all sounds very promising.

The whole SCARF code is very small, the codebase for trainer and decoder is just 6KLOC. The included manual is very good, pretty simple and describes everything needed in details. The little issue is to obtain the data to train and test the model. Data formats are rather clear but anyway require some effort to produce them. At least I didn't manage to get the input so had no luck to test it in action.


So if you are interested, you definitely need to try to create a dataset for SCARF and train something.

Related posts:

Speech Decoding Engines Part 1. Juicer, the WFST recognizer