Modern TTS - details

There are many TTS engines around, here are some notes about them


According to decoder takes most of the time in TTS, so decoder speed optimization is critical

Module RTF
Text encoder 0.010
Flow 0.019
Decoder 0.819


HifiGAN, VITS - while being end-to-end, architecture uses simple HiFigan vocoder, convnet upsampling

iSTFT-VITS - replaces final layers of hifigan with iSTFT, speedups inference twice

MB-iSTFT-VITS - multiband iSTFT, speedups decoding even faster, but quality degrades to be honest despite paper claims

VOCOS - ConvNext + iSTFT for speedup and better quality

StyleTTS2, - source-filter model (input f0 too)


Melgan - multiscale discriminator

HifiGAN, VITS - multiperiod discriminator + mulitiscale discriminator + feature matching

UnivNet, Descript Audio Codec, VOCOS - adds multiresolution spectrogram discriminator (Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation

StyleTTS2, Bert-VITS2 - adds WavLM discriminator

Multidiscriminator loss

Vits, Descript Audio Codec, UnivNET, Bert-VITS2 - least square loss

Vocos, Soundstream - hinge loss (Geometric GAN

StyleTTS2 - TPR-LS loss (Improve gan-based neural vocoder using truncated pointwise relativistic least square gan

Failed experiments - claims to have backward flow loss like in naturalspeech. On practice backward loss requires SoftDTW and realignment. Without realignment it “flattens” speech, as a result WER goes down, FAD goes up significantly. Same effect can be archived with reducing noise scale in runtime from 0.6667 to 0.4 Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis. Proposes to increase c_mel to 100. Same effect, WER down, FAD up, can be simple noise scale reduction in runtime.