Written by
Nickolay Shmyrev
on
Experiments with solvers and decoding-time guidance in flow matching
Some features are somewhat small and require few lines of code, not
really worth a conference paper or a poster. Still, they are somewhat
widespread. A blog post about them feels just right.
In P-Flow paper P-Flow: A Fast and Data-Efficient Zero-Shot TTS through
Speech Prompting there is a
big section on guided sampling. Authors claim that pronunciation clarity
can be further enhanced by applying techniques from a classifier-free
guidance method.
The code implementation is really simple, you just run estimator on mean and change the gradient:
https://github.com/p0p4k/pflowtts_pytorch/blob/master/pflow/models/components/flow_matching.py#L168
Here are our experiments with guided sampling and different solvers.
As you see, as any regularization method it helps to reduce artifacts and
improve clarity (see that CER is reduced). It also significantly reduces
expressiveness (see that FAD significantly increased). However, one can
see that simply reducing temperature has similar effect. The question
then is why do we spend compute time on guided sampling. I’ve seen that
many times that researchers propose some different regularization method
but never consider alternatives.
As for solvers, I don’t see any effect from 2-nd order Heun solver. Maybe
diffusion has to be fixed first (replaced with DiT).
Between, default VITS temperature of 0.8 is pretty high and often leads
to artifacts, I’ve heard many times in discussion that production guys
use lower values up to 0.2-0.3. Voice is not that expressive, but
artifacts are significantly reduced.
Between, Matcha/VITS also have problems with modeling speakers. Next post about it.