Modern TTS - details

There are many TTS engines around, here are some notes about them

Speed

According to https://arxiv.org/pdf/2210.15975.pdf decoder takes most of the time in TTS, so decoder speed optimization is critical

Module RTF
Text encoder 0.010
Flow 0.019
Decoder 0.819

Upsampler

HifiGAN, VITS - while being end-to-end, architecture uses simple HiFigan vocoder, convnet upsampling

iSTFT-VITS - replaces final layers of hifigan with iSTFT, speedups inference twice

MB-iSTFT-VITS - multiband iSTFT, speedups decoding even faster, but quality degrades to be honest despite paper claims

VOCOS - ConvNext + iSTFT for speedup and better quality

StyleTTS2, https://github.com/chomeyama/SiFiGAN - source-filter model (input f0 too)

Discriminators

Melgan - multiscale discriminator

HifiGAN, VITS - multiperiod discriminator + mulitiscale discriminator + feature matching

UnivNet, Descript Audio Codec, VOCOS - adds multiresolution spectrogram discriminator (Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation https://arxiv.org/pdf/2106.07889.pdf)

StyleTTS2, Bert-VITS2 - adds WavLM discriminator

Multidiscriminator loss

Vits, Descript Audio Codec, UnivNET, Bert-VITS2 - least square loss

Vocos, Soundstream - hinge loss (Geometric GAN https://arxiv.org/pdf/1705.02894.pdf)

StyleTTS2 - TPR-LS loss (Improve gan-based neural vocoder using truncated pointwise relativistic least square gan https://arxiv.org/pdf/2103.14245.pdf)

Failed experiments

https://github.com/PlayVoice/vits_chinese - claims to have backward flow loss like in naturalspeech. On practice backward loss requires SoftDTW and realignment. Without realignment it “flattens” speech, as a result WER goes down, FAD goes up significantly. Same effect can be archived with reducing noise scale in runtime from 0.6667 to 0.4

https://arxiv.org/abs/2305.16699 Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis. Proposes to increase c_mel to 100. Same effect, WER down, FAD up, can be simple noise scale reduction in runtime.