Audio samples from "A New Cycle Consistency Net for Cross-Speaker Style Transfer TTS Training"
Abstract: In this paper, we propose a new cycle consistency network for training an end-to-end, cross-speaker style transfer TTS. Based upon an end-to-end, multi-speaker model where Variational Autoencoder (VAE) is used to learn the latent representation of the speaking style of speakers, we augment the training structure with unpaired style reference input and its transferred output, in addition to the paired style reference input and its output, to make the cross-speaker style transfer embeddings and the decoder more powerful. Additionally, a cycle consistency neural net is proposed to make training process tractable algorithmically. Testing results show that the proposed approach can perform much better than both the Global Style Token (GST) and VAE based models in cross-speaker style transfer, for either parallel or unparallel transfer.