TTS Design Thoughts

We spent last year working mostly on TTS just as in the good old Festival times. Here are some more random thoughts I have on the subject. Rants follow, I still have trouble living in a positive thinking world. That one of course has advantages as life demonstrates but I just can’t get there.

These days there are a dozen TTS systems around and their strengths and weaknesses are not fully understood. Some are mimicking. For example, a popular MeloTTS is just a VITS system with very tiny modifications and well-trained single speaker voices. So it has the same weaknesses as VITS - weak global intonation, no emotion, no text understanding. Speed is fast as in any VITS though.

New audio LLMs arrive every week. Everyone markets short response latency. Somehow they all forget to mention WER which is usually two times worse than the offline systems due to streaming nature. It is actually crazy that LLM reviews talk about everything but forget to mention WER metrics.

Recently the F5-TTS made a splash. People claim it is good from listening to a few short samples. Everyone finetunes it with 80 hours of data on their 4090 with gradient accumulation from 80 steps. Nobody tests anything rigorously, not even WER.

F5-TTS paper reports WER and speaker similarity but never reports UTMOS and F0 correlation. Not surprising since F5 uses Vocos which is not very great at UTMOS and not very good for multilingual generation. Reasons are simple - Vocos is fast but it is trained on English data only, for other languages it needs finetune. Second, it uses mel as input, as a result it is not great for complex sounds like fricatives and clicks. Kind of crazy nobody mentions that.

Nobody mentions that simple transformers can’t learn semantics, even on 200k hours of training data. Very few people honestly report the transformer swallows syllables and has trouble with complextext inputs like repeating numbers. Nobody talks about pause control, pronunciation control and so on. Nevertheless, F5-TTS is described as the next “Elevenlabs killer”.

But I have to admit diffusion transformer idea from Stable Diffusion 3 is super nice.

Let us describe a few basic principles and demonstrate how they affect the design of a modern TTS system.

Design purpose

No free lunch thing is still in place and the optimal system design depends on the purpose. Tests and everything else, all shaped by purpose. There are different purposes for TTS.

Book reading TTS - it has to read the whole book and render dialogs properly. It doesn’t have to be fast as the generation can be offline. On-device interactive TTS - information reading, simple chatbot. It has to be fast and clear. Pronunciation should be solid. Emotional chatbots and production TTS - it has to be emotional and deliver human-like speech in various conditions. It has to support arbitrary voices and probably arbitrary languages. Singing TTS has a separate feature set.

The thing is - design of the TTS for use cases above doesn’t have to be the same. Many simple fast TTS (VITS, Matcha, Pflow) are actually single voice TTS and they serve their purpose well - they are fast and sometimes clear to deliver information. Advanced GPT is unnecessary there, though there are specific things where AI is required (like number pronunciation).

Now, when one sees results for the TTS systems in the paper one has to account for the purpose of the system too. I.e. you can’t compare a single speaker system with an emotional multi speaker system. Single speaker is trained on LJSpeech usually and results reported on LJSpeech only. The examples of such systems are VITS, MatchaTTS and StyleTTS2. Good for single speaker training and simple plain non-emotional speech they usually fail for emotional speech altogether. Typically such a system has a very plain simple duration predictor with no semantics. For example, MatchaTTS has a very simple CNN-based duration predictor. For non-fiction audiobooks and simple reporting it is ok. If you ask them to render emotional conversational input they usually fail in intonation significantly.

On the other hand, proper multispeaker multipurpose systems are usually extremely heavy and don’t fit small devices.

At the moment it looks like we have to design several different TTS systems for different purposes as it is hard to construct a universal one.

Here are some more points which get very important.

Lean compute and modularity (the end of end2end)

In recent times when good compute capabilities are not really available for us, a big part of the strategy is to reuse as many components as possible. It really means the end of end2end for small researchers. Things like speaker identification networks, LLMs, vocoders are all better pre-trained than trained from scratch.

From experiments, end-to-end systems like VITS still show very impressive quality compared to modular ones after training for months on simple hardware, so it is a big research task on how to make it work. For example, Vocos is very hard for multilingual cases like I mentioned above. Hopefully, we can find a reasonable replacement. One thing that is clear is that Mel is certainly not good enough, probably some encodec-like multi-scale multi-level features make more sense.

This requirement also means all big monolithic architectures have to pass by. It is very hard to resist the common opinion that a big compute is always better but right now I don’t see the way forward unfortunately. Maybe I will be wrong again.

Recently we trained a good old Kaldi ASR for one not very well supported language. It is not perfect but it ended up being more robust than any MMS-like or Whisper finetune. I kind of feel that a simple modern TDNN ASR system has great potential in the world of LLMs.

Dirty data training

One requirement that is frequently omitted in modern TTS paper is the ability to properly train from dirty data.Given the amounts and languages, the input data is certainly dirty and some methods work better with dirty data than others. No good research here, but hopefully we can do something about it soon.

Monotonic alignment for example is a nice move forward here. Due to probabilistic nature it actually can deal with some data inconsistencies. But the full nature and consequences are not fully understood here.

Everyone is using transformers, how good they are for the dirty data, that is the question.