The case against probabilistic models in metric spaces

A recent discussion on kaldi group about OOV words reminded me about this old problem.

One of the things that makes modern recognizers so unnatural is probabilistic models behind them. It's a core design decision to build the recognizer on terms of probability of classes and use models which are all probabilistic. Probabilistic models are easy to estimate, but they do not often fit the reality.

In the most common situation, if you have two classes A and B and garbage class G, a point from the garbage is either estimated as A or B and it is very hard to properly classify it as G. While probability of the signal is easy to estimate from the database based on examples, probability of the garbage is very hard. You need to have a huge database of garbage examples or you will not be able to get the garbage estimate properly. As a result, the current systems can not drop non-speech sounds and often create very misleading hypothesis. Bad things also happen in training, incorrectly labelled examples significantly disturb correct probability estimation and model has no means to detect them.

And in a long term the chase for probabilistic model is getting worse, everything is reduced to probabilistic framework. People talk about graphical models, Gaussian processes, stick-breaking model, Monte-Carlo sampling when they simply need to optimize the number of Gaussians in the mixture with a simple cost function. And they never tell you can simply train 500 Gaussians mixture and that will work equally well.

Same issue you might see in search engines, you can not use "not" in the search, for example, you can not search for a "restaurant not on the river bank". Though some companies try to implement such search, this effort is not widespread yet.

Situation slightly changes if we consider some real space of variants, for example a metric space. Much more reasonable decision might be made with geometrical models. You just look on the distance between the observation and the expectation and make a decision based on certain threshold. Of course you need to train the threshold and the distance function but this decision relies only on observation and the distance, not on the probability of everything else. Yes, I'm talking about plain old SVMs.

Metric is really the key here, with generic space indeed you can not invent something more advanced than simple bayesian rule. However, in presence of metric you might hope that you'll get much more interesting results from using it or at least combining metric decision with probabilistic decision.

Unfortunately there is no much information about it on the net, almost all AI books start with probabilistic reasoning as a natural approach to intelligence. I found some research like this paper, but it is far from being complete. Any links on more complete research  on the topic would be really appreciated.