Audio samples from "Cycle Consistent Network for End-to-end Style Transfer TTS Training"

Cycle Consistent Network for End-to-end Style Transfer TTS Training

Authors: Liumeng Xue¹, Shifeng Pan², Lei He², Lei Xie¹, Frank K. Soong²

¹Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China

²Microsoft China

Abstract: In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfer. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks of direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.

GST: Global style token baseline

VAE: Variational Autoencoder TTS baseline

OneStyEncNoCycle: The model using one style encoder without the cycle consistent network.

OneStyEncCycle unpaired: The model using one style encoder with the cycle consistent network. And the unpaired output is used for speech synthesis.

TwoStyEncCycle paired: The model using two style encoders with the cycle consistent network. And the unpaired output is used for speech synthesis.

TwoStyEncCycle unpaired: The model using two style encoders with the cycle consistent network. And the paired output is used for speech synthesis.

一、 Ablation test (Inter-speaker style transfer)

Parallel transfer
(1). And these crystals are just like globes of light.
Style reference Speaker reference

VAE OneStyEncNoCycle OneStyEncCycle unpaired TwoStyEncCycle unpaired

(2). 'Come along,' said Lawford, with a faint gust of laughter; 'let's see.'
Style reference Speaker reference

VAE OneStyEncNoCycle OneStyEncCycle unpaired TwoStyEncCycle unpaired

(1). And these crystals are just like globes of light.
Style reference	Speaker reference

VAE	OneStyEncNoCycle	OneStyEncCycle unpaired	TwoStyEncCycle unpaired

(2). 'Come along,' said Lawford, with a faint gust of laughter; 'let's see.'
Style reference	Speaker reference

VAE	OneStyEncNoCycle	OneStyEncCycle unpaired	TwoStyEncCycle unpaired

Unparallel transfer
(1). He's telling the whole truth, you may believe it.
Style reference Speaker reference

VAE OneStyEncNoCycle OneStyEncCycle unpaired TwoStyEncCycle unpaired

(2). They were trying to focus all fear and resentment on him.
Style reference Speaker reference

VAE OneStyEncNoCycle OneStyEncCycle unpaired TwoStyEncCycle unpaired

(1). He's telling the whole truth, you may believe it.
Style reference	Speaker reference

VAE	OneStyEncNoCycle	OneStyEncCycle unpaired	TwoStyEncCycle unpaired

(2). They were trying to focus all fear and resentment on him.
Style reference	Speaker reference

VAE	OneStyEncNoCycle	OneStyEncCycle unpaired	TwoStyEncCycle unpaired

二、 Final test

1. Inter-speaker style transfer

Parallel transfer
(1). And these crystals are just like globes of light.
Style reference Speaker reference

GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

(2). 'Come along,' said Lawford, with a faint gust of laughter; 'let's see.'
Style reference Speaker reference

GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

(1). And these crystals are just like globes of light.
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle paired	TwoStyEncCycle unpaired

(2). 'Come along,' said Lawford, with a faint gust of laughter; 'let's see.'
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle paired	TwoStyEncCycle unpaired

Unparallel transfer
(1). He's telling the whole truth, you may believe it.
Style reference Speaker reference

GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

(2). They were trying to focus all fear and resentment on him.
Style reference Speaker reference

GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

(1). He's telling the whole truth, you may believe it.
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle paired	TwoStyEncCycle unpaired

(2). They were trying to focus all fear and resentment on him.
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle paired	TwoStyEncCycle unpaired

2. Intra-speaker style transfer

Parallel transfer
(1). If you won't give me them, I will have every item out of the trucks and make a new list.
Style reference Speaker reference

GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

(2). The doctor examined him in silence-while we too were still.
Style reference Speaker reference

GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

(1). If you won't give me them, I will have every item out of the trucks and make a new list.
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle paired	TwoStyEncCycle unpaired

(2). The doctor examined him in silence-while we too were still.
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle paired	TwoStyEncCycle unpaired

Unparallel transfer
(1). The doctor examined him in silence-while we too were still.
Style reference Speaker reference

GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

(2). Inside, Chris was waiting, carrying an official automatic.
Style reference Speaker reference

GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

(1). The doctor examined him in silence-while we too were still.
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle paired	TwoStyEncCycle unpaired

(2). Inside, Chris was waiting, carrying an official automatic.
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle paired	TwoStyEncCycle unpaired

3. Unseen speaker style transfer

Parallel transfer
(1). "What was it?"
Style reference Speaker reference

GST VAE TwoStyEncCycle unpaired

(2). I don't know that they have boiling water.
Style reference Speaker reference

GST VAE TwoStyEncCycle unpaired

(1). "What was it?"
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle unpaired

(2). I don't know that they have boiling water.
Style reference	Speaker reference

GST	VAE	TwoStyEncCycle unpaired

Unparallel transfer
(1). "See here!" said Sheridan.
Style reference Speaker reference

GST VAE TwoStyEncCycle unpaired

(2). "See here!" said Sheridan.
Style reference Speaker reference

GST VAE TwoStyEncCycle unpaired