Modern TTS - approaches and requirements

Speech technology is continuously disrupted by neural network things and generative AI approaches. A good example is the TTS area. In the last years a hundred methods and models have been proposed like Tacotron2, FastPitch, GlowTTS, Flowtron and so on and so forth. Recent developments made it all if not obsolete, at least not very relevant.

Modern TTS is approaching human-like quality, it is multilingual, multispeaker, designed to render long texts and dialogs in streaming style. And this actually changes the way TTS systems are designed, created, tested and used. Here are some points one doesn’t see in modern research papers or open source packages which are getting important.

How we test TTS systems

Because TTS is much more natural and variable you need much more inputs to test it. If before it was enough to test engine quality with 10 utterances and it can be done with MOS-like scores, now you need 10-20 hours of speech from multiple speakers to reliably evaluate the engine. This automatically means you can’t test the engine with subjective measure. Blizzard-style testing with MTurk doesn’t really work anymore. And even 10 years ago it was still a huge effort to test subjectively properly since you had to collect opinions from hundreds of expert participants.

You still can compare 10 prompts still but proper testing has to be automated. And this is one of the big mistakes every open source package makes. They share 10 samples of creaky LJSpeech voice and try to convince you they are way better than others. No metrics, no established tests, nothing. Let me repeat again - stop using LJSpeech in demos. It was a nice public dataset 10 years ago, not anymore.

Of course automated metrics exist. There are things like WER testing, Frechet Audio Distance,, and many more. But nobody uses them in open source.

Even if project like MatchaTTS publishes WER numbers in the paper, they do not include corresponding evaluation setup in the repository. Others simply don’t publish anything.

For example, recently StyleTTS2 made a splash, they still only use MOS and mostly deal with short-style utterances.

Another big point is that modern TTS is generative, i.e. noise-based (gans, diffusions or flows). This means that noise is an organic source of the training process. And this means that loss values are meaningless for the training. They only demonstrate the noise injected in particular epochs and don’t really show the quality of the model. And results can vary from epoch to epoch significantly. Again, a proper automated metric is required. Guys from Stable Diffusion use Frechet Distance to estimate convergence of SD models. Same is very critical for speech.

So there is a critical need to evaluate TTS systems objectively and on a large and diverse test set. And the reality is that automated metrics are not really perfect.

If we take major ones - Frechet Audio Distance measures mostly global things, they kind of make sure your pitch follows the original, they don’t really account for glitches and phoneme replacements. And phoneme replacements are very common in modern algorithms, most of which are based on VITS. The secret is that the monotonic_align algorithm from VITS is very unstable, it doesn’t converge properly for silence breaks and many other things which are possible in the TTS database. As a result VITS-based models usually have misplaced phonemes with PER like 3%. Not a lot but easily catchable for every listener for sufficiently long input (1-2 minutes). Note that you can not catch PER errors for 10 utterances of total length of 10 seconds.

Then you can do WER. It catches phoneme mistakes but then another issue comes up. WER metric prefers a plain more robotic less variable voice. Of course it is easier to recognize it but it is actually opposite to being natural. So just by optimizing WER you don’t get to a proper point.

A better discriminator is required. Some networks like MosNET try to model it, however, I’m not aware of a universal solution and certainly it is not a common practice to use anything like that among VITS clones. There is a nice publication from Bhiksha Raj which proposes to evaluate TTS by training ASR recognizer on it but the method is crazy slow I suppose. Overall, it is something ultimately needed.

Database design

The second thing is the database design. Whisper made it clear - it is not enough to train on standalone utterances anymore, human context is much longer. Actually recently I heard from some neuroscience guys it is about 30 seconds, just like Whisper window. For TTS it means the following:

Databases should not have single sentences anymore. We need to capture inter-sentence intonation properly, for this we need several sentences in every utterance. It is not the case for many currently open databases and as a result, many modern TTS systems can model sentences properly but rarely can synthesize complex dialogs. A good example of this is popular VITS. Locally it renders sounds with a good quality, you probably can’t even distinguish it from humans. Globally it fails miserably. It doesn’t have any idea of long-term intonation in the model as a result, the intonation is totally random. For GPT-like models the situation is better but still very bad since you don’t see enough context in the training. Or you need a 1M hours TTS database for a GPT-like model.

Second, semantics of course doesn’t go from the audio alone. Through text you can actually keep much longer than 30 second context. Modern LLMs compete for whole book comprehension with very wide contexts like 200k symbols. It is all the same for TTS, the speech needs to account for a much wider context around the current utterance and it can do it with LLM embeddings for example.

So the time to resegment audiobooks and podcasts and make multi-sentence chunks with context, otherwise TTS will not be as good as we present it. A proper approach here is implemented in our beloved K2 but it doesn’t get as much attention as it deserves. Hopefully, the situation will change in the near future.

Many of the points above are covered in recent publications including Google TTS teams and from Apple. However, it is very far from current open source where everyone optimizes next vocoder. So a lot of work again comrades, time to drop all the code we wrote before and rewrite it from scratch. It is never boring.