Evaluation of Russian TTS models

We recently evaluated Russian open source and proprietary TTS models. Here are the results:

Engine Voice CER xRT GPU xRT CPU UTMOS Similarity Avg/Min Encodec FAD
Silero v3_1 Aidar 0.7 0.0177 0.1256 2.544 - 97.36
Silero v3_1 Baya 0.7 0.0177 0.1256 2.978 - 170.53
Silero 4 Aidar 1.0 0.0149 0.0544 1.755 - 79.33
Silero 4 Baya 0.9 0.0149 0.0544 2.144 - 118.63
Vosk-TTS 0.6 Multi 2.3 - 0.0605 3.283 0.869/0.571 9.99
TeraTTS Natasha 1.6 - 0.1945 3.281 - 70.10
UtrobinTTS Female 2.1 0.0265 0.1323 2.851 - 73.34
UtrobinTTS Male 2.1 0.0265 0.1323 3.186 - 46.14
XTTS2 Multi 2.7 0.3458 - 3.035 0.762/0.468 97.05
Vosk-TTS GPT Multi 2.1 0.2690 - 3.381 0.814/0.544 10.08
Piper Denis 3.7 - 0.045 3.056   142.91
Piper Dmitry 3.6 - 0.045 2.864   130.9
Piper Irina 1.4 - 0.045 3.672   74.98
Piper Ruslan 3 - 0.045 2.975   72.22
BeneGes Ruslan 2.4 - 0.321 2.537   63.02
EdgeTTS Dmitry 0.7 - 0.076 (cloud) 3.565   32.69
EdgeTTS Svetlana 0.7 - 0.076 (cloud) 3.513   30.60
Yandex Alexander 0.6 - 0.028 (cloud) 3.413   54.10
Yandex Marina 0.6 - 0.028 (cloud) 3.482   49.40
Tortoise Ruslan Multi 6.2 25.0300 - 2.893 0.660/0.483 14.21
Bark Small Ru_4 10.3 1.201 - 2.554   61.71

To make it clear here are the evaluated metrics:

  • CER - obtained using ASR recognizer. Responsible for clarity of speech (phonemes are not swallowed, they are pronounced correctly)
  • xRT - synthesis rate
  • UTMOS - responsible for sound purity (studio sound quality)
  • Similarity - for multi-voice systems, similarity measures the similarity of a voice to a sample
  • Encodec FAD - intonation quality
  • The code for evaluation is here: https://github.com/alphacep/vosk-tts/tree/master/extra/tts-test/ru

Similar repo is https://github.com/Edresson/ZS-TTS-Evaluation

Some information about evaluation data:

  • Audiobooks, about 100 speakers, about 1k utterances.

Some observations:

  • Fastspeech2 methods still show best clarity (Silero/Yandex/EdgeTTS). They are not very good in intonation but clarity is hard to beat. For end user it really makes sense actually, you can deal with plain intonation but artifacts are really annoying.

  • Training database matters a lot, even a small size gives very good results (CER and UTMOS), if the data is good (Piper Irina compared to other piper voices. And there the data is only 1 hour of data)

  • Multi-voice systems seriously suffer from fuzziness (CER 0.7 > 2.0+), something needs to be done about it.

  • Tortoise is pretty good in intonation (as expected).

  • It is necessary to add another metric responsible for the liveliness of speech (F0 correlation? duration?). FAD is relevant, but only works for multivoice systems.

  • XTTS2 results are much worse than I expected. Both similarity and clarity of speech.

  • A good metric to evaluate would be diversity of speech generation. VITS for example specifically optimized for diversity compared to fastspeech. Something to implement in the future.

See also https://alphacephei.com/nsh/2023/11/29/new-tts.html.