Written by Nickolay Shmyrev
on July 12, 2024

Evaluation of Russian TTS models

We recently evaluated Russian open source and proprietary TTS models. Here are the results:

To make it clear here are the evaluated metrics:

CER - obtained using ASR recognizer. Responsible for clarity of speech (phonemes are not swallowed, they are pronounced correctly)
xRT - synthesis rate
UTMOS - responsible for sound purity (studio sound quality)
Similarity - for multi-voice systems, similarity measures the similarity of a voice to a sample
Encodec FAD - intonation quality
The code for evaluation is here: https://github.com/alphacep/vosk-tts/tree/master/extra/tts-test/ru

Some information about evaluation data:

Some observations:

Fastspeech2 methods still show best clarity (Silero/Yandex/EdgeTTS). They are not very good in intonation but clarity is hard to beat. For end user it really makes sense actually, you can deal with plain intonation but artifacts are really annoying.
Training database matters a lot, even a small size gives very good results (CER and UTMOS), if the data is good (Piper Irina compared to other piper voices. And there the data is only 1 hour of data)
Multi-voice systems seriously suffer from fuzziness (CER 0.7 > 2.0+), something needs to be done about it.
Tortoise is pretty good in intonation (as expected).
It is necessary to add another metric responsible for the liveliness of speech (F0 correlation? duration?). FAD is relevant, but only works for multivoice systems.
XTTS2 results are much worse than I expected. Both similarity and clarity of speech.
A good metric to evaluate would be diversity of speech generation. VITS for example specifically optimized for diversity compared to fastspeech. Something to implement in the future.

Top →