Audio samples from "A New Cycle Consistency Net for Cross-Speaker Style Transfer TTS Training"

Abstract: In this paper, we propose a new cycle consistency network for training an end-to-end, cross-speaker style transfer TTS. Based upon an end-to-end, multi-speaker model where Variational Autoencoder (VAE) is used to learn the latent representation of the speaking style of speakers, we augment the training structure with unpaired style reference input and its transferred output, in addition to the paired style reference input and its output, to make the cross-speaker style transfer embeddings and the decoder more powerful. Additionally, a cycle consistency neural net is proposed to make training process tractable algorithmically. Testing results show that the proposed approach can perform much better than both the Global Style Token (GST) and VAE based models in cross-speaker style transfer, for either parallel or unparallel transfer.

1. Parallel transfer

(1). And these crystals are just like globes of light."
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model

(2). But I mean to see where fate will lead me.'
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model

(3). 'Come along,' said Lawford, with a faint gust of laughter; 'let's see.'
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model

(4). And he was eager to hear all that she could tell him.
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model

(5). He waved his hand gaily, and approached us along the road.
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model

(1). And these crystals are just like globes of light."
Style reference audio	Speaker reference audio

GST baseline model	VAE baseline model	Proposed Cycle model

(2). But I mean to see where fate will lead me.'
Style reference audio	Speaker reference audio

GST baseline model	VAE baseline model	Proposed Cycle model

(3). 'Come along,' said Lawford, with a faint gust of laughter; 'let's see.'
Style reference audio	Speaker reference audio

GST baseline model	VAE baseline model	Proposed Cycle model

(4). And he was eager to hear all that she could tell him.
Style reference audio	Speaker reference audio

GST baseline model	VAE baseline model	Proposed Cycle model

(5). He waved his hand gaily, and approached us along the road.
Style reference audio	Speaker reference audio

GST baseline model	VAE baseline model	Proposed Cycle model

2. Unparallel transfer

(1). And these crystals are just like globes of light."
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model

(2). "You know yourself why he'll come.
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model

(3). He's telling the whole truth, you may believe it."
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model

(4). 'A week, seven days!'
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model

(5). They were trying to focus all fear and resentment on him.
Style reference audio Speaker reference audio

GST baseline model VAE baseline model Proposed Cycle model