Audio samples for "Building a mixed-lingual neural TTS system with only monolingual data"

Authors: Liumeng Xue, Wei Song, Guanghui Xu, Lei Xie, Zhizheng Wu
Abstract: This paper investigates the problem of building a mixed-lingual text-to-speech (TTS) system when only monolingual data is available, which is a typical problem when deploying Chinese neural TTS, specifically the encoder-decoder-based one. The mixed lingual here is Mandarin Chinese with embedded English phrases or words. We investigate the problem from two aspects: speaker similarity and intelligibility. We make use of multi-speaker's monolingual data, i.e. Mandarin data and English data, to build a base model, which is expected to construct Mandarin and English language space and speaker space. We study the use of speaker embedding to make speaker similarity consistent across languages, analyze phoneme embedding and encoder output representations between two languages, and investigate the choice of modeling training. We report the findings and the challenges for building a mixed-lingual TTS with monolingual data within the encoder-decoder framework.

Mandarin Samples

1. 这儿夏天雨水很多,可是秋天很少雨。
Speaker embedding in encoder(SE-ENC) Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC) Residual encoder (using SE-DEC)

2. 我很伤心,我准备冲网吧了。
Speaker embedding in encoder(SE-ENC) Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC) Residual encoder (using SE-DEC)

3. 太好了,我喜欢吃中餐。
Speaker embedding in encoder(SE-ENC) Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC) Residual encoder (using SE-DEC)

4. 我有你画的羊,羊的箱子和羊的嘴套子。
Speaker embedding in encoder(SE-ENC) Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC) Residual encoder (using SE-DEC)

5. 音箱网络已断开,请检查网络后重新连接。
Speaker embedding in encoder(SE-ENC) Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC) Residual encoder (using SE-DEC)

English Samples

6. An artwork that brings more attention to gay issues is worth lots of recognition.
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)

7. It's a pretty difficult journey.
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)

8. Strict rules apply to the exhumation of bodies.
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)

9. A lot of times we'll go just because our friends are involved.
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)

10. Ain't nobody watching over these crooks.
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)

Mixed-lingual Samples

11. Exo的二零一七冬季特别专辑"universe"于今日揭开神秘面纱.
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)

12. Evolve这张专辑让imagine dragons在自省中实现突破.
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)

13. 八月"we young"等活动。
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)

14. Anyway接待说你可以walk in进来.
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)

15. Denny和bobby是双胞胎兄弟.
Speaker embedding in encoder(SE-ENC)Speaker embedding in decoder(SE-DEC) Additional phoneme embedding context vector(using SE-DEC)Residual encoder (using SE-DEC)