Speech factorization in NaturalSpeech3

Recently published NaturalSpeech paper https://arxiv.org/abs/2403.03100 attracted some attention. While ideas discussed there are somewhat straight, it is nice to see a solid implementation from a reputable institution and great results. It is also amazing how fast some developers are, open source implementations are already made public https://github.com/open-mmlab/Amphion/pull/152.

While the main topic of the research is factorization, for me it seems some important points are missing from the paper and somewhat hide the true value of the work.

Factorization is important for many reasons:

  • Improves explainability of the model
  • Drastically data requirements to train the model
  • Reduces parameter requirements of the model and speedups inference.

Factor approach is a long focus of the voice conversion (VC) community. It is nice that the factor approach gets attention for TTS, hopefully it will also spread to other speech domains like ASR.

But there are also other items the paper doesn’t mention

  • One of the core use cases for factorization is a module reuse from external parties (something that large corporations always lack in their research). Paper trains all diffusions from own data and with own models, however, it is actually critical to reuse existing components - speaker id networks like Wenet, speech recognition networks like Whisper. Of course they do not fit so well into the picture but they certainly allow us to build much more powerful systems.

    For example, it is known that Whisper was trained on a song database. Whisper models are way better for lyrics extraction than any other ASR model. Consequently, it is much more stable as content extractor for singing voice conversion. For example, see https://github.com/PlayVoice/lora-svc/issues/17 or https://arxiv.org/abs/2310.11160.

  • Factorization definitely has limits and not always possible. Like any linear approximation it doesn’t fully cover the target domain. For example, it is actually very hard to fully disentangle the speaker and pitch. Or to disentangle pitch and emotion. As a result, models do not fully reproduce the target domain and one can discover inconsistencies.

  • One of the consequences of the item above is that an open question if factorization on utterance level is the only one required, many applications need more fine-grained control, so factorization could be both word-level based and utterance level. It is still an open question if sequence-style attribute specification makes more sense than utterance-level factor specification. One can connect it to our recent post on spiking models.

The last part, the paper doesn’t say much about semantics. Text encoder considers only phonemes and is usually trained on a comparatively limited amount of data compared to LLMs. This is one of the weakest points of TTS research and one can specifically create a test set to demonstrate the weakness of modern TTS models by creating the set where intonation is affected by semantics and not by syntax. For example, try “She said with a relaxed and sad voice - hello”. Very few engines will render “hello” with a sad voice.

So far, many open items in modern TTS, we will enjoy great research papers in the future.