Matcha TTS notes

Recently I’ve spent some time with Matcha by Shivam Mehta. Some related papers

Matcha-TTS: A fast TTS architecture with conditional flow matching

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting

Overall, Matcha is attractive because it is a very simple system following VITS design and incorporating recent advances in TTS. It is fast and light. We test Matcha on a Russian database of about 1000 hours 100 speakers. I wish those things would be mentioned in the paper, but papers these days are not a good source of information.

Better quality than VITS2

Out of box Matcha gives you better synthesis clarity (CER metric) and intonation (FAD metric) than VITS2 at a price of slightly slower speed and reduced quality (UTMOS). The quality drop is due to the codec and mel-based architecture, end-to-end quality is better as expected. Our results are like this:

Metric VITS2 Matcha Matcha + Vocos
CER 1.9 1.2 1.2
UTMOS 3.4 3.0 3.2
FAD 9.7 5.0 3.0
SIM 0.87 0.82 0.84
CPU xRT 0.07 0.40 0.14

Note that MOS is not expected to be better (as paper claims).

Focus on a single speaker

Matcha default params are optimized for a single speaker database. There is VCTK setup too, but it doesn’t feel optimal. Such focus has the usecase, but overall it is not sufficient for modern TTS. As a result, some parts need extra inputs (for example, duration module doesn’t use speaker embedding which creates big issues with duration), some need more parameters (it is recommended to increase params of the decoder from 10M to 40M. Overall, it is interesting that most of the current light models are underparameterized. An example of this is a good boost in quality of Kokoro-StyleTTS2 vs plain StyleTTS2.

Mel parameters

Matcha uses an 8khz cut-off for mel coefficients. Maybe one day it was relevant when you want to mix 16khz data, but these days there are plenty of wideband data around. I yet have to retrain with full 11khz mel, but I already think there will be an improvement. 80 mels doesn’t feel enough as well, it is probably better to use 100.

BERT semantics

It is clear that proper synthesis is not really possible without text understanding, since Bert-VITS2 BERT embeddings are very helpful to implement that. In our experiments, BERT embeddings are very helpful for Matcha as well, we see that FAD improves from 3.0 to 2.3 which is a really good improvement.

Not sure why modern research systems still ignore this, essentially weakening the baselines.

Duration issues

Matcha copies VITS simple duration predictor. It is sufficient for single speaker, not really sufficient for life-like synthesis.

Overall, duration is the weakest point of modern VITS successors. It is very sad the problem is not fully understood and explored. For example it is surprising that duration paper linked above never mentioned duration metrics, only WER metric. GlowTTS/VITS duration modeling is somewhat innovative (for example, MAS) but very basic – no skip silence phones, no chance for learning proper alignment. Very inaccurate plain CNN duration model. Very sad this piece of code is copied from repo to repo. As a result researchers claim duration models are not helpful (as in E5/F5).

First of all, it is still not clear for me why do we feed text encoder outputs to duration. Text encoder outputs are optimized with prior loss with L2 distance to mel spectrogram, essentially it is just a rough mel spectrogram where all semantics is lost. No punctuation, nothing. And we hope to predict the duration from it.

Another example of the issue copied from repo to repo is About ceiling for calculating phoneme duration.
Essentially what happens here - VITS duration predictor is not very good and often predicts very short duration for important phones, which causes a very perceptible phone skips in audio. This ceil helps to hide this problem by pushing duration up, but still creates many problems with intonation. To match original length the sound has to be scaled down by a factor of approximately 0.9 and still you have irregular duration sometimes.

The proper solution would be to improve duration predictor, then ceil can be replaced with round which leads to much better intonation (FAD reduces from 2.3 to 2.0) by this change. But then you have issues with phone skips (CER raises from 1.2 to 1.9).

Our experiment with flow-matching duration from the paper above didn’t demonstrate advantages of the method. In fact it got worse. Probably because there is no speaker vector. See the note on the overall flow matching issues below. We have yet to explore this part.

Some other good duration ideas: duration discriminator (VITS2), interpolation between sdp and dp predictors (MeloTTS), LSTM duration (StyleTTS2).

MAS vs ASR alignment

MAS is somewhat innovative (it implements probabilistic alignment instead of old fixed Viterbi-style alignment). However, it is certainly not sufficient for all the usecases. It definitely works well for large single-speaker database, but is expected to fail for diverse data. Here are the cases:

  1. Fine-tuning for small and emotional dataset of say 0.5 hours. It is a perfectly valid usecase, but MAS is not going to work here, there is not enough internal data to properly align.

  2. Training on diverse datasets with variable amount of speech per speaker. Some speaker have hours, some minutes. MAS will also fail here.

As a result, it becomes obvious that modern light TTS system should have ASR aligner, not MAS aligner. It also aligns with modularity requirements we covered in the previous post.

An interesting thing that StyleTTS2 has the proper architecture here. Overall, StyleTTS2 has many proper decisions and needs more attention.

Mel representation, Vocos and BigVGAN codecs

By default Matcha uses HifiGAN, which is ok, even has good quality but not as universal and fast as modern vocoders.

We tried BigVGAN with Matcha expecting it allows to create high quality synthesis, unfortunately, it doesn’t really work. UTMOS is 2.5 (10 NFE) and something like 2.8 (50 NFE) much lower than Vocos (3.2). The thing is that Matcha doesn’t model Mel precisely and BigVGAN is not just speech codec. So it renders inaccurate mel as non-speech sounds (clicks) affecting quality a lot.

Yet to investigate other codecs, for example, some projects like Pflow-Encodec report good results with the Encodec latents.

Mel is clearly not sufficient to reproduce speech clearly. VITS2 system has UTMOS 3.4, higher than Matcha and uses latent of dimension 192. Hopefully more advanced codec could help to fill the gap between Matcha and E2E systems while keeping reasonable speed of synthesis.

Flow matching and outliers

Overall, flow matching and diffusion is not a silver bullet. It can emulate complex distributions with some precision but it also has drawbacks - the outliers. If your distribution doesn’t match the model you get variance of the prediction. Sometimes this variance is nice when you want to have emotional speech. But sometimes it really hurts - you get outliers which affect the impression from the result. As an outcome, VITS CER is always very high compared to FastSpeech2 models. Former is usually about 2%, latter is 0.7%. VITS often skips or misrenders some phones due to flow model.

Same thing we see with flow matching in Matcha. We improve expressiveness overall, but we introduce some gross outliers. We can hear them in acoustics and in duration too. As VITS, Matcha is not very good at modeling liquid phones (l and r).

It is better to create a good model matching the target distribution (for example, introduce intonation vector as input) than to hope that flow matching will solve your problems. It will not.

No style vectors

Style vectors like in StyleTTS2 or HierSpeech++ seem like an important way to control synthesis. Beside a simple timbre vector, an emotional style vector could affect intonation based on reference file and do many other things. Matcha doesn’t use anything similar probably due to the focus on more uniform speech but later testing with more diverse databases and speech styles might demonstrate requirements of architecture modifications.

Streaming

Low latency synthesis is getting popular last year, while we don’t believe in it, something like two-stage synthesis definitely should make sense. Since we need to understand the full semantics to render the line properly, we still need to look for the whole sentence. For example we synthesize intermediate representation of the whole line quickly and then render it with diffusion in streaming fashion.