Written by
Nickolay Shmyrev
on
Modern TTS - details
There are many TTS engines around, here are some notes about them
Speed
According to https://arxiv.org/pdf/2210.15975.pdf decoder takes most of the time in TTS, so decoder
speed optimization is critical
Module |
RTF |
Text encoder |
0.010 |
Flow |
0.019 |
Decoder |
0.819 |
Upsampler
HifiGAN, VITS - while being end-to-end, architecture uses simple HiFigan vocoder, convnet upsampling
iSTFT-VITS - replaces final layers of hifigan with iSTFT, speedups inference twice
MB-iSTFT-VITS - multiband iSTFT, speedups decoding even faster, but quality degrades to be honest despite paper claims
VOCOS - ConvNext + iSTFT for speedup and better quality
StyleTTS2, https://github.com/chomeyama/SiFiGAN - source-filter model (input f0 too)
Discriminators
Melgan - multiscale discriminator
HifiGAN, VITS - multiperiod discriminator + mulitiscale discriminator + feature matching
UnivNet, Descript Audio Codec, VOCOS - adds multiresolution spectrogram discriminator (Univnet: A neural vocoder with
multi-resolution spectrogram discriminators for high-fidelity waveform generation https://arxiv.org/pdf/2106.07889.pdf)
StyleTTS2, Bert-VITS2 - adds WavLM discriminator
Multidiscriminator loss
Vits, Descript Audio Codec, UnivNET, Bert-VITS2 - least square loss
Vocos, Soundstream - hinge loss (Geometric GAN https://arxiv.org/pdf/1705.02894.pdf)
StyleTTS2 - TPR-LS loss (Improve gan-based neural vocoder using truncated pointwise relativistic least square gan https://arxiv.org/pdf/2103.14245.pdf)
Failed experiments
https://github.com/PlayVoice/vits_chinese - claims to have backward flow loss like in naturalspeech. On practice
backward loss requires SoftDTW and realignment. Without realignment it “flattens” speech, as a result WER goes down, FAD goes up significantly.
Same effect can be archived with reducing noise scale in runtime from 0.6667 to 0.4
https://arxiv.org/abs/2305.16699 Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis.
Proposes to increase c_mel to 100. Same effect, WER down, FAD up, can be simple noise scale reduction in runtime.