Written by
Nickolay Shmyrev
on
Evaluation of Russian TTS models
We recently evaluated Russian open source and proprietary TTS models. Here are the results:
Engine |
Voice |
CER |
xRT GPU |
xRT CPU |
UTMOS |
Similarity Avg/Min |
Encodec FAD |
Silero v3_1 |
Aidar |
0.7 |
0.0177 |
0.1256 |
2.544 |
- |
97.36 |
Silero v3_1 |
Baya |
0.7 |
0.0177 |
0.1256 |
2.978 |
- |
170.53 |
Silero 4 |
Aidar |
1.0 |
0.0149 |
0.0544 |
1.755 |
- |
79.33 |
Silero 4 |
Baya |
0.9 |
0.0149 |
0.0544 |
2.144 |
- |
118.63 |
Vosk-TTS 0.6 |
Multi |
2.3 |
- |
0.0605 |
3.283 |
0.869/0.571 |
9.99 |
TeraTTS |
Natasha |
1.6 |
- |
0.1945 |
3.281 |
- |
70.10 |
UtrobinTTS |
Female |
2.1 |
0.0265 |
0.1323 |
2.851 |
- |
73.34 |
UtrobinTTS |
Male |
2.1 |
0.0265 |
0.1323 |
3.186 |
- |
46.14 |
XTTS2 |
Multi |
2.7 |
0.3458 |
- |
3.035 |
0.762/0.468 |
97.05 |
Vosk-TTS GPT |
Multi |
2.1 |
0.2690 |
- |
3.381 |
0.814/0.544 |
10.08 |
Piper |
Denis |
3.7 |
- |
0.045 |
3.056 |
|
142.91 |
Piper |
Dmitry |
3.6 |
- |
0.045 |
2.864 |
|
130.9 |
Piper |
Irina |
1.4 |
- |
0.045 |
3.672 |
|
74.98 |
Piper |
Ruslan |
3 |
- |
0.045 |
2.975 |
|
72.22 |
BeneGes |
Ruslan |
2.4 |
- |
0.321 |
2.537 |
|
63.02 |
EdgeTTS |
Dmitry |
0.7 |
- |
0.076 (cloud) |
3.565 |
|
32.69 |
EdgeTTS |
Svetlana |
0.7 |
- |
0.076 (cloud) |
3.513 |
|
30.60 |
Yandex |
Alexander |
0.6 |
- |
0.028 (cloud) |
3.413 |
|
54.10 |
Yandex |
Marina |
0.6 |
- |
0.028 (cloud) |
3.482 |
|
49.40 |
Tortoise Ruslan |
Multi |
6.2 |
25.0300 |
- |
2.893 |
0.660/0.483 |
14.21 |
Bark Small |
Ru_4 |
10.3 |
1.201 |
- |
2.554 |
|
61.71 |
To make it clear here are the evaluated metrics:
- CER - obtained using ASR recognizer. Responsible for clarity of speech (phonemes are not swallowed, they are pronounced correctly)
- xRT - synthesis rate
- UTMOS - responsible for sound purity (studio sound quality)
- Similarity - for multi-voice systems, similarity measures the similarity of a voice to a sample
- Encodec FAD - intonation quality
- The code for evaluation is here: https://github.com/alphacep/vosk-tts/tree/master/extra/tts-test/ru
Similar repo is https://github.com/Edresson/ZS-TTS-Evaluation
Some information about evaluation data:
- Audiobooks, about 100 speakers, about 1k utterances.
Some observations:
-
Fastspeech2 methods still show best clarity (Silero/Yandex/EdgeTTS). They are not very good in intonation but clarity
is hard to beat. For end user it really makes sense actually, you can deal with plain intonation but artifacts are
really annoying.
-
Training database matters a lot, even a small size gives very good results (CER and UTMOS), if the data is good (Piper Irina compared to other piper voices. And there the data is only 1 hour of data)
-
Multi-voice systems seriously suffer from fuzziness (CER 0.7 > 2.0+), something needs to be done about it.
-
Tortoise is pretty good in intonation (as expected).
-
It is necessary to add another metric responsible for the liveliness of speech (F0 correlation? duration?). FAD is relevant, but only works for multivoice systems.
-
XTTS2 results are much worse than I expected. Both similarity and clarity of speech.
-
A good metric to evaluate would be diversity of speech generation. VITS for example specifically optimized for diversity compared
to fastspeech. Something to implement in the future.
See also https://alphacephei.com/nsh/2023/11/29/new-tts.html.