About me

I am currently an Assistant Professor at Nanjing University. Previously, I worked as a Postdoctoral Fellow at Hong Kong University of Science and Technology, working with Prof.Yike Guo and Prof.Wei Xue. I also worked as a Postdoctoral Fellow at Chinese University of Hong Kong, Shenzhen (CUHK-SZ), working with Prof.Haizhou Li and Prof.Zhizheng Wu. I obtained Ph.D. degree from Audio, Speech and Language Processing Laboratory at Northwestern Polytechnical University (ASLP@NWPU), supervised by Prof.Lei Xie. During my Ph.D. studies, I performed research at JD AI Lab, Tencent AI Lab and Microsoft.

My research interest mainly focuses on audio, speech and language processing; speech, audio and music understanding and generation; emotional and expressive speech generation; conversational AI; AI agents.

I committed to building open-source tools and data resources for the research community, including Amphion for audio, music, and speech generation; Audio-FLAN open-source instruction-following dataset for unified understanding and generation of speech, music, and sound; WenetSpeech4TTS open-source Mandarin dataset for speech generation. I have also contributed to research on large-scale generative models for speech, music, and audio, including Llasa, Spark-TTS, YuE, and AudioX, etc.

📢 I am actively recruiting self-motivated Ph.D. students, Master’s students, and research interns. If you are interested in my research or would like to explore related topics together, please feel free to contact me at lmxue@nju.edu.cn.

News

🎉 July 17, 2026: MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech and PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation were accepted to ACM MM 2026.
🎉 Jun 15, 2026: NVVSpeech Challenge@ISCSLP 2026 has launched.
🎉 Jun 4, 2026: NVV-SuperBench: Beyond Words, Beyond Quality-Benchmarking Nonverbal Vocalizations in Speech Generation was accepted to INTERSPEECH 2026 as a long paper.
🎉 Apr 13, 2026: The 15th International Symposium on Chinese Spoken Language Processing (ISCSLP 2026) is launched.
🎉 Apr 8, 2026: 2nd Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM)@INTERSPEECH2026 is launched.
🎉 Apr 7, 2026: S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models is accepted by ACL 2026.
🎉 Jan 26, 2026: AudioX: A Unified Framework for Anything-to-Audio Generation and YuE: Scaling Open Foundation Models for Long-Form Music Generation are accepted by ICLR 2026.
🎉 Jan 8, 2026: SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing is accepted by IEEE Journal of Selected Topics in Signal Processing.
🎉 Sep 26, 2025: LLM4MA Workshop@ISMIR 2025 is launched.
🎉 Mar 19, 2025: Invited by Josh Gardner from Apple to give a talk on Audio-FLAN.
🎉 Mar 3, 2025: Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens is released. GitHub Demo
🎉 Feb 23, 2025: Audio-FLAN: An Instruction-Following Dataset for Unified Understanding and Generation of Speech, Music, and Sound is released. GitHub HuggingFace !Selected as one of the Best Audio Papers of 2025!
🎉 Feb 6, 2025: Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis is released. GitHub Demo
🎉 Jan 18, 2025: YuE: Scaling Open Foundation Models for Long-Form Music Generation is released. GitHub Demo
🎉 Aug 30, 2024: Amphion: An Open-Source Audio, Music and Speech Generation Toolkit is accepted by IEEE SLT 2024.
🎉 Aug 30, 2024: Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion is accepted by IEEE SLT 2024.
🎉 Aug 20, 2024: The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge: Tasks, Results and Findings is accepted by ISCSLP 2024.
🎉 Aug 20, 2024: SingVisio is accepted by Computers & Graphics.
🎉 Jun 13, 2024: Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion is accepted and published by TASLP.
🎉 Jun 4, 2024: WenetSpeech4TTS, Single-Codec, TACA-Audiobook are accepted by INTERSPEECH2024.
🎉 May 31, 2024: Conversational Voice Clone Challenge (CoVoC) is launched.
🎉 Feb 28, 2024: IEEE SLT 2024 is launched.
🎉 Dec 18, 2023: Amphion is released. Amphion is an open-source platform for Audio, Music, and Speech Generation. As a co-founder of Amphion,
- I built the fundamental framework to integrate various generative tasks into a unified pipeline, including data pre-processing, model building, training and inference, etc. Github
- I led the reproduction of several typical TTS models and released the pre-trained models. HuggingFace
- I developed and led the SingVisio project, an interactive visualization platform that makes the inner work mechanism of diffusion model easily understandable in the context of singing voice conversion. Try it.

Research Experience

2021.11 - 2022.10, Researcher, Microsoft.
2021.06 - 2021.11, Researcher, Tencent AI Lab.
2019.04 - 2020.06, Researcher, Microsoft.
2018.10 - 2019.04, Researcher, JD.COM AI Lab.

Academic Activities

Conference Co-organizer: International Symposium on Chinese Spoken Language Processing 2026 (ISCSLP 2026), IEEE Spoken Language Technology Workshop 2024 (IEEE SLT workshop 2024).
Challenge Co-organizer: Multilingual Conversational Speech Language Model (MLC-SLM)@INTERSPEECH2026, Conversational Voice Clone Challenge (CoVoC)@ISCSLP2024.
Workshop Co-organizer: Large Language Models for Music & Audio (LLM4MA) Workshop@ISMIR 2025.
Invited Reviewer: ACL, ACM MM, ICASSP, INTERSPEECH, IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), Speech Processing Letters, Speech Communication, ASRU, SLT, ISCSLP, etc.

Selected Publications

AudioX: A Unified Framework for Anything-to-Audio Generation, Zeyue Tian, Zhaoyang Liu, Yizhu Jin, Ruibin Yuan, Liumeng Xue, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo. 2026

PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation, Yujia Xiao, Liumeng Xue Lei He, Xinyi Chen, Aemon Yat Fei Chiu, Wenjie Tian, Shaofei Zhang, Qiuqiang Kong, Xinfa Zhu, Wei Xue, Tan Lee. 2025.
Audio-FLAN: A preliminary release, Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue. 2025.
Yue: Scaling open foundation models for long-form music generation, Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, …, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, …, Wei Xue, Xu Tan, Yike Guo. 2025.
Spark-TTS: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens, Xinsheng Wang, Mingqi Jiang, Ziyang Ma, …, Liumeng Xue, …, Xie Chen, Lei Xie, Yike Guo, Wei Xue. 2025.
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis, Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, …, Liumeng Xue, …, Yike Guo, Wei Xue. 2025.
The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge: Tasks, Results and Findings, Kangxiang Xia, Dake Guo, Jixun Yao, Liumeng Xue, Hanzhao Li, Shuai Wang, Zhao Guo, Lei Xie, Qingqing Zhang, Lei Luo, Minghui Dong, Peng Sun. ISCSLP, 2025.
SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion, Liumeng Xue, Chaoren Wang, Mingxuan Wang, Xueyao Zhang, Jun Han, Zhizheng Wu, Computers & Graphics, 2024
Amphion: An open-source audio, music and speech generation toolkit, Xueyao Zhang^*, Liumeng Xue^*, Yicheng Gu^*, Yuancheng Wang^*, Jiaqi Li^*, Haorui He, Chaoren Wang, Songting Liu, Xi Chen, Junan Zhang, Zihao Fang, Haopeng Chen, Tze Ying Tang, Lexiao Zou, Mingxuan Wang, Jun Han, Kai Chen, Haizhou Li, Zhizheng Wu. IEEE SLT, 2024.
Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion, Xueyao Zhang, Zihao Fang, Yicheng Gu, Haopeng Chen, Lexiao Zou, Junan Zhang, Liumeng Xue, Zhizheng Wu, IEEE SLT 2024.
Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion, Zhichao Wang, Liumeng Xue, Qiuqiang Kong, Lei Xie. Yuanzhe Chen, Qiao Tian, Yuping Wang, TASLP, 2024
Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation, Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie. Yunlin Chen, Hao Yin, Zhifei Li. INTERSPEECH, 2024.
Text-aware and Context-aware Expressive Audiobook Speech Synthesis, Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie. INTERSPEECH, 2024.
WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark, Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie. INTERSPEECH, 2024.
An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder, Yicheng Gu, Xueyao Zhang, Liumeng Xue, Haizhou Li, Zhizheng Wu, under review, 2024.
SponTTS: modeling and transferring spontaneous style for TTS, Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie. ICASSP, 2024.
Transfer the linguistic representations from TTS to accent conversion with non-parallel data, Xi Chen, Jiakun Pei, Liumeng Xue, Mingyang Zhang, ICASSP, 2024
Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder, Yicheng Gu, Xueyao Zhang, Liumeng Xue, Zhizheng Wu, ICASSP, 2024.
An Initial Investigation of Neural Replay Simulator for Over-the-Air Adversarial Perturbations to Automatic Speaker Verification, Jiaqi Li, Li Wang, Liumeng Xue, Lei Wang, Zhizheng Wu, ICASSP, 2024.
HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS, Dake Guo, Xinfa Zhu, Liumeng Xue, Tao Li, Yuanjun Lv, Yuepeng Jiang, Lei Xie. ASRU, 2023.
Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion, Xueyao Zhang, Yicheng Gu, Haopeng Chen, Zihao Fang, Lexiao Zou, Liumeng Xue, Zhizheng Wu, ML4Audio @ NeurIPS 2023.
Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features, Ziqian Ning, Qicong Xie, Pengcheng Zhu, Zhichao Wang, Liumeng Xue, Jixun Yao, Lei Xie. Mengxiao Bi. ICASSP, 2023.
Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers, Liumeng Xue, Shan Yang, Na Hu, Dan Su, Lei Xie. INTERSPEECH, 2022
ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS, Liumeng Xue, Frank K. Soong, Shaofei Zhang, Lei Xie. TASLP, 2022

Cycle consistent network for end-to-end style transfer TTS training Liumeng Xue, Shifeng Pan, Lei He, Lei Xie. Frank K Soong. Neural Networks 2021
Controllable emotion transfer for end-to-end speech synthesis Tao Li, Shan Yang, Liumeng Xue, Lei Xie. ISCSLP 2021

On the localness modeling for the self-attention based end-to-end speech synthesis Shan Yang, Heng Lu, Shiyin Kang, Liumeng Xue, Jinba Xiao, Dan Su, Lei Xie. Dong Yu. Neural networks 2020

Building a mixed-lingual neural TTS system with only monolingual data Liumeng Xue, Wei Song, Guanghui Xu, Lei Xie. Zhizheng Wu. INTERSPEECH 2019

VISITOR LOCATIONS

Liumeng Xue

About me

News

Research Experience

Academic Activities

Selected Publications