Cycle Consistent Network for End-to-end Style Transfer TTS Training
Authors:
Liumeng Xue1, Shifeng Pan2, Lei He2, Lei Xie1, Frank K.
Soong2
1Audio,
Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical
University, Xi'an, China
2Microsoft China
Abstract:
In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style
transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and
unparallel transfer. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS
model. The model is usually trained in a paired manner, which means the reference speech is totally paired
with the output including speaker identity, text, and style. To achieve a better quality for style transfer,
which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated
variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an
unpaired output. The unpaired output, which lacks of direct ground-truth target, is then successfully
constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the
forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer
yields an output expected to be the same as the original reference speech. Ablation study shows the
effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed
model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style
Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech
quality, similarity of speaker identity, and similarity of speaking style.