Audio samples from "Building a mixed-lingual neural TTS system with only monolingual data"

Paper: arXiv
Authors: Liumeng Xue1, Wei Song2, Guanghui Xu2, Lei Xie1, Zhizheng Wu2
              1Shaanxi Provincial Key Lab of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, Xi’an, China
              2JD.COM
Abstract: When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an average voice model which is built from multi-speaker monolingual data, i.e., Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility, and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.
Paper accepted by INTERSPEECH 2019


1. Speaker embedding at differen position

Mandarin Samples

(1). 这儿夏天雨水很多,可是秋天很少雨。
Speaker embedding at encoder (SE-ENC) Speaker embedding at decoder (SE-DEC)
(2). 峨眉山今天下雨,不冷,西南风四到五级。
Speaker embedding at encoder (SE-ENC) Speaker embedding at decoder (SE-DEC)
(3). 不过稍微改变了一点儿说法。
Speaker embedding at encoder (SE-ENC) Speaker embedding at decoder (SE-DEC)

English Samples

(1). Hang on, gaps those of you in the know.
Speaker embedding at encoder (SE-ENC) Speaker embedding at decoder (SE-DEC)
(2). How do we develop football?
Speaker embedding at encoder (SE-ENC) Speaker embedding at decoder (SE-DEC)
(3). It's a pretty difficult journey.
Speaker embedding at encoder (SE-ENC) Speaker embedding at decoder (SE-DEC)

Mixed-lingual Samples

(1). Downie的才华才得到了michele的进一步赏识。
Speaker embedding at encoder (SE-ENC) Speaker embedding at decoder (SE-DEC)
(2). Love herby流花的灵动之美也更添情趣此外。
Speaker embedding at encoder (SE-ENC) Speaker embedding at decoder (SE-DEC)
(3). Mic drop已连续六天夺得了销量冠军。
Speaker embedding at encoder (SE-ENC) Speaker embedding at decoder (SE-DEC)

2. Including versus excluding the target speaker data in the AVM training

Mandarin Samples

(1). 你只需要像这样坐着,弯腰然后深呼吸。
Excluding the target speaker data (Retrain-AVM) Including the target speaker data (SE-DEC)
(2). 我是顺路过来跟你告别的。
Excluding the target speaker data (Retrain-AVM) Including the target speaker data (SE-DEC)
(3). 紫貂把腰拱得像弯弓一般。
Excluding the target speaker data (Retrain-AVM) Including the target speaker data (SE-DEC)

English Samples

(1). How do we develop football?
Excluding the target speaker data (Retrain-AVM) Including the target speaker data (SE-DEC)
(2). It's a pretty difficult journey.
Excluding the target speaker data (Retrain-AVM) Including the target speaker data (SE-DEC)
(3). People are now losing their jobs.
Excluding the target speaker data (Retrain-AVM) Including the target speaker data (SE-DEC)

Mixed-lingual Samples

(1). Lemonade和随后的“formation世界巡回演唱会。
Excluding the target speaker data (Retrain-AVM) Including the target speaker data (SE-DEC)
(2). Downie的才华才得到了michele的进一步赏识。
Excluding the target speaker data (Retrain-AVM) Including the target speaker data (SE-DEC)
(3). Caviar lobster和牛排都一般,主打甜品阿拉斯加太甜。
Excluding the target speaker data (Retrain-AVM) Including the target speaker data (SE-DEC)

3. Different language trainging data

Mandarin Samples

(1). 谈好条件以后,他俩轮流嚼着口香糖。
Using Mandarin corpus based on SE-DEC (CORPUS-MAN) Using English corpus based on SE-DEC (CORPUS-ENG) Using Mixed Mandarin-English corpus based on SE-DEC (CORPUS-MIX)
(2). 不过稍微改变了一点儿说法。
Using Mandarin corpus based on SE-DEC (CORPUS-MAN) Using English corpus based on SE-DEC (CORPUS-ENG) Using Mixed Mandarin-English corpus based on SE-DEC (CORPUS-MIX)
(3). 根本无法把那么多的水一下子排泄出去。
Using Mandarin corpus based on SE-DEC (CORPUS-MAN) Using English corpus based on SE-DEC (CORPUS-ENG) Using Mixed Mandarin-English corpus based on SE-DEC (CORPUS-MIX)

English Samples

(1).It's a pretty difficult journey.
Using Mandarin corpus based on SE-DEC (CORPUS-MAN) Using English corpus based on SE-DEC (CORPUS-ENG) Using Mixed Mandarin-English corpus based on SE-DEC (CORPUS-MIX)
(2). A huge diverse country.
Using Mandarin corpus based on SE-DEC (CORPUS-MAN) Using English corpus based on SE-DEC (CORPUS-ENG) Using Mixed Mandarin-English corpus based on SE-DEC (CORPUS-MIX)
(3). Amitabha was also present at the music launch of the movie.
Using Mandarin corpus based on SE-DEC (CORPUS-MAN) Using English corpus based on SE-DEC (CORPUS-ENG) Using Mixed Mandarin-English corpus based on SE-DEC (CORPUS-MIX)

Mixed-lingual Samples

(1). Lemonade和随后的“formation世界巡回演唱会。
Using Mandarin corpus based on SE-DEC (CORPUS-MAN) Using English corpus based on SE-DEC (CORPUS-ENG) Using Mixed Mandarin-English corpus based on SE-DEC (CORPUS-MIX)
(2). Caviar lobster和牛排都一般,主打甜品阿拉斯加太甜。
Using Mandarin corpus based on SE-DEC (CORPUS-MAN) Using English corpus based on SE-DEC (CORPUS-ENG) Using Mixed Mandarin-English corpus based on SE-DEC (CORPUS-MIX)
(3). Chain和同事在一项研究中验证了edge的能力。
Using Mandarin corpus based on SE-DEC (CORPUS-MAN) Using English corpus based on SE-DEC (CORPUS-ENG) Using Mixed Mandarin-English corpus based on SE-DEC (CORPUS-MIX)

3. The use of phoneme-informed attention

Mandarin Samples

(1). 紫貂把腰拱得像弯弓一般。
Speaker embedding at decoder (SE-DEC) Phoneme embedding context vector based on SE-DEC (PECV) Residual encoder based on SE-DEC (RES)
(2). 我有你画的羊,羊的箱子和羊的嘴套子…
Speaker embedding at decoder (SE-DEC) Phoneme embedding context vector based on SE-DEC (PECV) Residual encoder based on SE-DEC (RES)
(3). 眼看郭丁香操劳过度,人老珠黄,没有年轻时漂亮了,张生慢慢变得喜新厌旧。。
Speaker embedding at decoder (SE-DEC) Phoneme embedding context vector based on SE-DEC (PECV) Residual encoder based on SE-DEC (RES)

English Samples

(1). People are now losing their jobs.
Speaker embedding at decoder (SE-DEC) Phoneme embedding context vector based on SE-DEC (PECV) Residual encoder based on SE-DEC (RES)
(2). Why is oil so important?
Speaker embedding at decoder (SE-DEC) Phoneme embedding context vector based on SE-DEC (PECV) Residual encoder based on SE-DEC (RES)
(3). Ain't nobody watching over these crooks.
Speaker embedding at decoder (SE-DEC) Phoneme embedding context vector based on SE-DEC (PECV) Residual encoder based on SE-DEC (RES)

Mixed-lingual Samples

(1). Love herby流花的灵动之美也更添情趣此外。
Speaker embedding at decoder (SE-DEC) Phoneme embedding context vector based on SE-DEC (PECV) Residual encoder based on SE-DEC (RES)
(2). Change可以指任何变化,shift表示变动多指位置。
Speaker embedding at decoder (SE-DEC) Phoneme embedding context vector based on SE-DEC (PECV) Residual encoder based on SE-DEC (RES)
(3). Downie的才华才得到了michele的进一步赏识。
Speaker embedding at decoder (SE-DEC) Phoneme embedding context vector based on SE-DEC (PECV) Residual encoder based on SE-DEC (RES)