Written by
Nickolay Shmyrev
on
Matcha TTS notes
Recently I’ve spent some time with Matcha by Shivam Mehta. Some related papers
Matcha-TTS: A fast TTS architecture with conditional flow matching
Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
Overall, Matcha is attractive because it is a very simple system
following VITS design and incorporating recent advances in TTS. It is
fast and light. We test Matcha on a Russian database of about 1000 hours
100 speakers. I wish those things would be mentioned in the paper, but
papers these days are not a good source of information.
Better quality than VITS2
Out of box Matcha gives you better synthesis clarity (CER metric) and
intonation (FAD metric) than VITS2 at a price of slightly slower speed
and reduced quality (UTMOS). The quality drop is due to the codec and
mel-based architecture, end-to-end quality is better as expected. Our
results are like this:
Metric |
VITS2 |
Matcha |
Matcha + Vocos |
CER |
1.9 |
1.2 |
1.2 |
UTMOS |
3.4 |
3.0 |
3.2 |
FAD |
9.7 |
5.0 |
3.0 |
SIM |
0.87 |
0.82 |
0.84 |
CPU xRT |
0.07 |
0.40 |
0.14 |
Note that MOS is not expected to be better (as paper claims).
Focus on a single speaker
Matcha default params are optimized for a single speaker database. There
is VCTK setup too, but it doesn’t feel optimal. Such focus has the
usecase, but overall it is not sufficient for modern TTS. As a result,
some parts need extra inputs (for example, duration module doesn’t use
speaker embedding which creates big issues with duration), some need more
parameters (it is
recommended to
increase params of the decoder from 10M to 40M. Overall, it is interesting
that most of the current light models are underparameterized. An example
of this is a good boost in quality of Kokoro-StyleTTS2 vs plain
StyleTTS2.
Mel parameters
Matcha uses an 8khz cut-off for mel coefficients. Maybe one day it was
relevant when you want to mix 16khz data, but these days there are plenty
of wideband data around. I yet have to retrain with full 11khz mel, but I
already think there will be an improvement. 80 mels doesn’t feel enough
as well, it is probably better to use 100.
BERT semantics
It is clear that proper synthesis is not really possible without text
understanding, since Bert-VITS2 BERT embeddings are
very helpful to implement that. In our experiments, BERT embeddings are
very helpful for Matcha as well, we see that FAD improves from 3.0 to
2.3 which is a really good improvement.
Not sure why modern research systems still ignore this, essentially weakening the baselines.
Duration issues
Matcha copies VITS simple duration predictor. It is sufficient for single
speaker, not really sufficient for life-like synthesis.
Overall, duration is the weakest point of modern VITS successors. It is
very sad the problem is not fully understood and explored. For example it
is surprising that duration paper linked above never mentioned duration
metrics, only WER metric. GlowTTS/VITS duration modeling is somewhat
innovative (for example, MAS) but very basic – no skip silence phones, no
chance for learning proper alignment. Very inaccurate plain CNN duration
model. Very sad this piece of code is copied from repo to repo. As a
result researchers claim duration models are not helpful (as in E5/F5).
First of all, it is still not clear for me why do we feed text encoder
outputs to duration. Text encoder outputs are optimized with prior loss
with L2 distance to mel spectrogram, essentially it is just a rough mel
spectrogram where all semantics is lost. No punctuation, nothing. And we
hope to predict the duration from it.
Another example of the issue copied from repo to repo is
About ceiling for calculating phoneme duration.
Essentially what happens here - VITS duration predictor is not very good
and often predicts very short duration for important phones, which causes
a very perceptible phone skips in audio. This ceil helps to hide this problem
by pushing duration up, but still creates many problems with intonation.
To match original length the sound has to be scaled down by a factor of
approximately 0.9 and still you have irregular duration sometimes.
The proper solution would be to improve duration predictor, then ceil can
be replaced with round which leads to much better intonation (FAD reduces
from 2.3 to 2.0) by this change. But then you have issues with phone
skips (CER raises from 1.2 to 1.9).
Our experiment with flow-matching duration from the paper above didn’t
demonstrate advantages of the method. In fact it got worse. Probably
because there is no speaker vector. See the note on the overall flow
matching issues below. We have yet to explore this part.
Some other good duration ideas: duration discriminator (VITS2), interpolation between
sdp and dp predictors (MeloTTS), LSTM duration (StyleTTS2).
MAS vs ASR alignment
MAS is somewhat innovative (it implements probabilistic alignment instead
of old fixed Viterbi-style alignment). However, it is certainly not sufficient for
all the usecases. It definitely works well for large single-speaker database, but
is expected to fail for diverse data. Here are the cases:
-
Fine-tuning for small and emotional dataset of say 0.5 hours. It is a perfectly
valid usecase, but MAS is not going to work here, there is not enough internal data
to properly align.
-
Training on diverse datasets with variable amount of speech per speaker. Some
speaker have hours, some minutes. MAS will also fail here.
As a result, it becomes obvious that modern light TTS system should have ASR aligner,
not MAS aligner. It also aligns with modularity requirements we covered in the previous
post.
An interesting thing that StyleTTS2 has the proper architecture here. Overall, StyleTTS2
has many proper decisions and needs more attention.
Mel representation, Vocos and BigVGAN codecs
By default Matcha uses HifiGAN, which is ok, even has good quality but not as universal
and fast as modern vocoders.
We tried BigVGAN with Matcha expecting it allows to create high quality
synthesis, unfortunately, it doesn’t really work. UTMOS is 2.5 (10 NFE)
and something like 2.8 (50 NFE) much lower than Vocos (3.2). The thing is
that Matcha doesn’t model Mel precisely and BigVGAN is not just speech
codec. So it renders inaccurate mel as non-speech sounds (clicks)
affecting quality a lot.
Yet to investigate other codecs, for example, some projects like
Pflow-Encodec report good
results with the Encodec latents.
Mel is clearly not sufficient to reproduce speech clearly. VITS2 system
has UTMOS 3.4, higher than Matcha and uses latent of dimension 192.
Hopefully more advanced codec could help to fill the gap between Matcha
and E2E systems while keeping reasonable speed of synthesis.
Flow matching and outliers
Overall, flow matching and diffusion is not a silver bullet. It can
emulate complex distributions with some precision but it also has drawbacks -
the outliers. If your distribution doesn’t match the model you get variance
of the prediction. Sometimes this variance is nice when you want to have
emotional speech. But sometimes it really hurts - you get outliers which
affect the impression from the result. As an outcome, VITS CER is always
very high compared to FastSpeech2 models. Former is usually about 2%, latter
is 0.7%. VITS often skips or misrenders some phones due to flow model.
Same thing we see with flow matching in Matcha. We improve expressiveness
overall, but we introduce some gross outliers. We can hear them in
acoustics and in duration too. As VITS, Matcha is not very good at
modeling liquid phones (l and r).
It is better to create a good model matching the target distribution (for
example, introduce intonation vector as input) than to hope that flow
matching will solve your problems. It will not.
No style vectors
Style vectors like in StyleTTS2 or HierSpeech++ seem like an important
way to control synthesis. Beside a simple timbre vector, an emotional
style vector could affect intonation based on reference file and do many
other things. Matcha doesn’t use anything similar probably due to the
focus on more uniform speech but later testing with more diverse
databases and speech styles might demonstrate requirements of
architecture modifications.
Streaming
Low latency synthesis is getting popular last year, while we don’t
believe in it, something like two-stage synthesis definitely should make
sense. Since we need to understand the full semantics to render the line
properly, we still need to look for the whole sentence. For example we
synthesize intermediate representation of the whole line quickly and then
render it with diffusion in streaming fashion.