Cycle Consistent Network for End-to-end Style Transfer TTS Training

Authors: Liumeng Xue1, Shifeng Pan2, Lei He2, Lei Xie1, Frank K. Soong2
              1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China
              2Microsoft China
Abstract: In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfer. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks of direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.

GST: Global style token baseline
VAE: Variational Autoencoder TTS baseline
OneStyEncNoCycle: The model using one style encoder without the cycle consistent network.
OneStyEncCycle unpaired: The model using one style encoder with the cycle consistent network. And the unpaired output is used for speech synthesis.
TwoStyEncCycle paired: The model using two style encoders with the cycle consistent network. And the unpaired output is used for speech synthesis.
TwoStyEncCycle unpaired: The model using two style encoders with the cycle consistent network. And the paired output is used for speech synthesis.

一、 Ablation test (Inter-speaker style transfer)

Parallel transfer

(1). And these crystals are just like globes of light.
Style reference Speaker reference
VAE OneStyEncNoCycle OneStyEncCycle unpaired TwoStyEncCycle unpaired
(2). 'Come along,' said Lawford, with a faint gust of laughter; 'let's see.'
Style reference Speaker reference
VAE OneStyEncNoCycle OneStyEncCycle unpaired TwoStyEncCycle unpaired

Unparallel transfer

(1). He's telling the whole truth, you may believe it.
Style reference Speaker reference
VAE OneStyEncNoCycle OneStyEncCycle unpaired TwoStyEncCycle unpaired
(2). They were trying to focus all fear and resentment on him.
Style reference Speaker reference
VAE OneStyEncNoCycle OneStyEncCycle unpaired TwoStyEncCycle unpaired

二、 Final test

1. Inter-speaker style transfer

Parallel transfer

(1). And these crystals are just like globes of light.
Style reference Speaker reference
GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired
(2). 'Come along,' said Lawford, with a faint gust of laughter; 'let's see.'
Style reference Speaker reference
GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

Unparallel transfer

(1). He's telling the whole truth, you may believe it.
Style reference Speaker reference
GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired
(2). They were trying to focus all fear and resentment on him.
Style reference Speaker reference
GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

2. Intra-speaker style transfer

Parallel transfer

(1). If you won't give me them, I will have every item out of the trucks and make a new list.
Style reference Speaker reference
GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired
(2). The doctor examined him in silence-while we too were still.
Style reference Speaker reference
GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

Unparallel transfer

(1). The doctor examined him in silence-while we too were still.
Style reference Speaker reference
GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired
(2). Inside, Chris was waiting, carrying an official automatic.
Style reference Speaker reference
GST VAE TwoStyEncCycle paired TwoStyEncCycle unpaired

3. Unseen speaker style transfer

Parallel transfer

(1). "What was it?"
Style reference Speaker reference
GST VAE TwoStyEncCycle unpaired
(2). I don't know that they have boiling water.
Style reference Speaker reference
GST VAE TwoStyEncCycle unpaired

Unparallel transfer

(1). "See here!" said Sheridan.
Style reference Speaker reference
GST VAE TwoStyEncCycle unpaired
(2). "See here!" said Sheridan.
Style reference Speaker reference
GST VAE TwoStyEncCycle unpaired