Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

0. Contents

Abstract
Models description
Samples of noisy target speakers from the TTS training set
Samples of noisy target speakers from VCTK-noise
Samples of the noisy target speaker from Real-noise (a noisy speaker from the Internet)
Summary
References

1. Abstract

Building a voice conversion system for noisy target speakers, such as users providing noisy samples or Internet found data, is a challenging task since the use of contaminated speech in model training will apparently degrade the conversion performance. In this paper, we leverage the advances of our recently proposed Glow-WaveGAN[1] and propose a noise-independent speech representation learning approach for high-quality voice conversion for noisy target speakers. Specifically, we learn a latent feature space where we ensure that the target distribution modeled by the conversion model is exactly from the modeled distribution of the waveform generator. With this premise, we further manage to make the latent feature to be noise-invariant. Specifically, we introduce a noise-controllable WaveGAN, which directly learns the noise-independent acoustic representation from waveform by the encoder and conducts noise control in the hidden space through a FiLM[2] module in the decoder. As for the conversion model, importantly, we use a flow-based model to learn the distribution of noise-independent but speaker-related latent features from phoneme posteriorgrams. Experimental results demonstrate that the proposed model achieves high speech quality and speaker similarity in the voice conversion for noisy target speakers.

2. Models description

Topline: Auto-regressive conversion model + HiFi-GAN vocoder (using target speakers' clean training data)

Baseline: Auto-regressive conversion model + HiFi-GAN vocoder (using target speakers' denoised training data)

Proposed: Flow-based conversion model + noise-controllable WaveGAN (using target speakers' noisy training data)

3. Samples of noisy target speakers from the TTS training set

Dataset	Clean speech for Topline	Denoised speech for Baseline	Noisy speech for Proposed
VCTK-noise



Real-noise
	-
	-
	-

4. Samples of noisy target speakers from VCTK-noise

Source	Target	Topline	Baseline	Proposed
Source text: He seems to have everything.

Source text: But it may take some time to confirm the findings.

Source text: You must always attempt to raise the bar.

Source text: That can cost him a fortune.

Source text: Scotland had great assets.

5. Samples of the noisy target speaker from Real-noise (a noisy speaker from the Internet)

Source	Target	Baseline	Proposed
Source text: I'd just like to play.

Source text: Painful, but only because it's true.

Source text: This represents a tough game for us.

Source text: To the Hebrews it was a token that there would be no more universal floods.

Source text: Overall it has been deeply frustrating and disappointing.

6. Summary

Building a voice conversion system for noisy target speakers, such as users providing noisy samples or Internet found data, is a challenging task since the use of contaminated speech in model training will apparently degrade the conversion performance. In this paper, we leverage the advances of our recently proposed Glow-WaveGAN and propose a noise-independent speech representation learning approach for high-quality voice conversion for noisy target speakers. Specifically, we learn a latent feature space where we ensure that the target distribution modeled by the conversion model is exactly from the modeled distribution of the waveform generator. With this premise, we further manage to make the latent feature to be noise-invariant. Specifically, we introduce a noise-controllable WaveGAN, which directly learns the noise-independent acoustic representation from waveform by the encoder and conducts noise control in the hidden space through a FiLM module in the decoder. As for the conversion model, importantly, we use a flow-based model to learn the distribution of noise-independent but speaker-related latent features from phoneme posteriorgrams. Experimental results demonstrate that the proposed model achieves high speech quality and speaker similarity in the voice conversion for noisy target speakers.

7. References

[1] Cong, Jian, Shan Yang, Lei Xie and Dan Su. “Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis.” Interspeech (2021). [2] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in AAAI, vol. 32, no. 1, 2018