Accepted as a long paper at Interspeech 2026
Nonverbal vocalizations (NVVs), such as laughing, sighing, and sobbing, are essential for human-like speech, yet standardized evaluation rarely jointly assesses whether systems generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present NVV-SuperBench, a bilingual English/Chinese benchmark for speech generation with NVVs. It provides a unified 45-type taxonomy and a multi-axis protocol beyond conventional speech quality assessment, evaluating NVV-specific controllability, placement, and perceptual salience. We benchmark 15 speech generation systems spanning prompt-based and tag-based control paradigms, using objective metrics, human listening tests, and LLM-based multi-rater evaluation. Results show that NVV controllability often decouples from speech quality, while low-SNR oral cues and long-duration affective NVVs remain bottlenecks. NVV-SuperBench highlights current gaps and supports progress toward more human-like speech generation.
6 categories and 45 fine-grained types covering respiratory, throat, laughter, crying, emotional, and oral vocalizations.
4,500 instances in English and Chinese, balanced across all 45 NVV types.
8 tag-based and 7 prompt-based systems, spanning commercial (ElevenLabs, Gemini 2.5, GPT-4o mini TTS) and open-source models (ChatTTS, CosyVoice 2, Bark, Dia, and more).
Separates speech quality from NVV-specific controllability, placement accuracy, and salience using objective metrics, human listening tests, and LLM-based multi-rater evaluation.
A unified taxonomy of 45 nonverbal vocalization types organized into 6 categories.
| Category | NVV Types | Count |
|---|---|---|
| Respiratory | breathinhaleexhalequick breathsighgasppantingwheezingsnoreyawn | 10 |
| Throat / Physiological | coughsneezethroat clearinghiccupsniffsnifflesnort | 7 |
| Laughter Spectrum | chucklegigglelaughlaugh harderstart laughingstifled laughburst of laughter | 7 |
| cryingsobbingcrying loudlywailwhimper | 5 | |
| Emotional Vocalizations | humhumminggroanmoangruntmumbleexclamation (ah, oh, hmm) | 7 |
| Oral / Miscellaneous | lipsmackgulpswallowburptskssscluckinghissingwhisper | 9 |
| Total | 45 |
Listen to system outputs for all available NVV types across bilingual (EN/ZH) and two synthesis paradigms (tag-based / prompt-based), with 3 samples per type whenever available.
Results across objective metrics, subjective listening tests, and LLM-based evaluation for 15 TTS systems.
| System | WER↓ | DNSMOS SIG↑ | DNSMOS BAK↑ | DNSMOS OVRL↑ | CLAP Score↑ |
|---|---|---|---|---|---|
| Parler-TTS Mini | 6.25 | 3.48 | 4.03 | 3.20 | 0.34 |
| Parler-TTS Large | 9.30 | 3.45 | 4.00 | 3.15 | 0.35 |
| CapSpeech | 4.56 | 3.46 | 3.96 | 3.16 | 0.40 |
| Qwen3-TTS | 2.06 | 3.56 | 4.07 | 3.30 | 0.45 |
| GPT-4o mini TTS | 4.81 | 3.59 | 4.14 | 3.35 | 0.44 |
| Gemini 2.5 Flash | 58.80 | 3.57 | 4.02 | 3.29 | 0.42 |
| Gemini 2.5 Pro | 5.40 | 3.52 | 4.00 | 3.23 | 0.41 |
| System | CER↓ | DNSMOS SIG↑ | DNSMOS BAK↑ | DNSMOS OVRL↑ | CLAP Score↑ |
|---|---|---|---|---|---|
| Qwen3-TTS | 4.08 | 3.50 | 3.98 | 3.19 | 0.39 |
| GPT-4o mini TTS | 4.67 | 3.64 | 4.16 | 3.40 | 0.43 |
| Gemini 2.5 Flash | 16.45 | 3.57 | 3.97 | 3.25 | 0.41 |
| Gemini 2.5 Pro | 7.68 | 3.55 | 4.01 | 3.26 | 0.42 |
| System | WER↓ | DNSMOS SIG↑ | DNSMOS BAK↑ | DNSMOS OVRL↑ | Coverage↑ | Precision↑ | Recall↑ | F1↑ | NTD↓ |
|---|---|---|---|---|---|---|---|---|---|
| Bark | 14.73 | 3.00 | 3.08 | 2.52 | 0.11 | 0.614 | 0.700 | 0.654 | 0.0037 |
| Higgs-Audio | 9.41 | 3.55 | 3.96 | 3.23 | 0.09 | 0.360 | 0.407 | 0.382 | 0.0111 |
| ChatTTS | 5.58 | 3.67 | 4.12 | 3.42 | 0.02 | 0.652 | 0.680 | 0.664 | 0.0028 |
| Fish-Speech | 5.65 | 3.55 | 4.06 | 3.28 | 0.16 | 0.447 | 0.418 | 0.432 | 0.0157 |
| Dia | 21.95 | 2.94 | 2.71 | 2.27 | 0.29 | 0.574 | 0.705 | 0.632 | 0.0052 |
| CosyVoice 2 | 3.82 | 3.67 | 4.19 | 3.45 | 0.18 | 0.475 | 0.451 | 0.463 | 0.0159 |
| Orpheus TTS | 4.98 | 3.62 | 4.11 | 3.34 | 0.18 | 0.687 | 0.774 | 0.728 | 0.0031 |
| ElevenLabs | 2.31 | 3.59 | 4.06 | 3.32 | 0.27 | 0.664 | 0.787 | 0.720 | 0.0091 |
| System | CER↓ | DNSMOS SIG↑ | DNSMOS BAK↑ | DNSMOS OVRL↑ | Coverage↑ | Precision↑ | Recall↑ | F1↑ | NTD↓ |
|---|---|---|---|---|---|---|---|---|---|
| Bark | 41.86 | 2.99 | 3.10 | 2.54 | 0.11 | 0.572 | 0.605 | 0.588 | 0.0141 |
| Orpheus TTS | 18.78 | 3.60 | 4.11 | 3.30 | 0.18 | 0.585 | 0.671 | 0.625 | 0.0159 |
| Higgs-Audio | 5.98 | 3.52 | 3.91 | 3.17 | 0.09 | 0.429 | 0.369 | 0.396 | 0.0186 |
| Fish-Speech | 9.25 | 3.49 | 4.00 | 3.19 | 0.16 | 0.592 | 0.605 | 0.598 | 0.0397 |
| ChatTTS | 4.52 | 3.49 | 3.50 | 2.94 | 0.02 | 0.644 | 0.773 | 0.703 | 0.0237 |
| CosyVoice 2 | 6.27 | 3.63 | 4.17 | 3.40 | 0.18 | 0.515 | 0.480 | 0.496 | 0.0383 |
| ElevenLabs | 4.13 | 3.61 | 4.03 | 3.32 | 0.27 | 0.630 | 0.750 | 0.684 | 0.0246 |
DNSMOS SIG/BAK/OVRL: DNSMOS P.835 metrics for speech signal quality, background noise intrusiveness, and overall quality (higher is better). NTD: Normalized Tag Distance (lower is better).
| System | Naturalness↑ | Quality↑ | NVV PE↑ | NVV IF↑ |
|---|---|---|---|---|
| Parler-TTS Large | 2.69 | 3.35 | 0.93 | 0.99 |
| Parler-TTS Mini | 2.62 | 3.40 | 0.85 | 0.92 |
| CapSpeech | 2.73 | 3.41 | 1.01 | 1.11 |
| GPT-4o mini | 3.33 | 3.56 | 1.74 | 1.89 |
| Qwen3-TTS | 3.44 | 3.86 | 2.03 | 2.15 |
| Gemini 2.5 Flash | 4.00 | 4.28 | 2.60 | 2.67 |
| Gemini 2.5 Pro | 4.07 | 4.30 | 2.68 | 2.74 |
| System | Naturalness↑ | Quality↑ | NVV PE↑ | NVV IF↑ |
|---|---|---|---|---|
| GPT-4o mini | 2.11 | 2.77 | 0.84 | 0.92 |
| Qwen3-TTS | 3.45 | 3.93 | 2.01 | 1.98 |
| Gemini 2.5 Flash | 3.04 | 3.71 | 2.29 | 2.42 |
| Gemini 2.5 Pro | 3.38 | 3.75 | 2.07 | 2.11 |
| System | Naturalness↑ | Quality↑ | NVV PE↑ | NVV Accuracy↑ |
|---|---|---|---|---|
| Bark | 2.75 | 2.37 | 2.07 | 2.65 |
| Dia | 3.12 | 2.82 | 2.45 | 2.99 |
| Fish-Speech | 3.53 | 3.62 | 1.01 | 1.04 |
| Higgs-Audio | 3.63 | 3.63 | 2.41 | 2.28 |
| CosyVoice 2 | 3.65 | 3.92 | 2.39 | 2.22 |
| ChatTTS | 3.30 | 3.10 | 3.00 | 3.40 |
| Orpheus TTS | 4.01 | 4.11 | 3.31 | 3.71 |
| ElevenLabs | 4.60 | 4.71 | 3.92 | 4.21 |
| System | Naturalness↑ | Quality↑ | NVV PE↑ | NVV Accuracy↑ |
|---|---|---|---|---|
| Bark | 2.23 | 2.00 | 1.09 | 1.12 |
| Fish-Speech | 2.71 | 3.09 | 0.10 | 0.13 |
| Orpheus TTS | 3.20 | 3.29 | 2.05 | 2.17 |
| Higgs-Audio | 3.48 | 3.68 | 1.38 | 1.18 |
| CosyVoice 2 | 3.76 | 4.35 | 1.56 | 1.65 |
| ChatTTS | 3.53 | 3.13 | 2.13 | 2.23 |
| ElevenLabs | 4.09 | 4.31 | 3.38 | 3.41 |
NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following. All scores are on a 5-point scale.
| System | Naturalness↑ | Quality↑ | NVV PE↑ | NVV IF↑ |
|---|---|---|---|---|
| Parler-TTS Mini | 1.65 | 3.30 | 0.23 | 0.30 |
| Parler-TTS Large | 1.67 | 3.13 | 0.30 | 0.38 |
| CapSpeech | 1.88 | 3.52 | 0.48 | 0.57 |
| GPT-4o mini | 3.16 | 3.93 | 1.73 | 1.65 |
| Qwen3-TTS | 3.17 | 3.96 | 2.16 | 1.98 |
| Gemini 2.5 Flash | 3.30 | 3.80 | 2.84 | 2.78 |
| Gemini 2.5 Pro | 3.20 | 3.65 | 2.54 | 2.64 |
| System | Naturalness↑ | Quality↑ | NVV PE↑ | NVV IF↑ |
|---|---|---|---|---|
| GPT-4o mini | 2.50 | 3.85 | 0.92 | 1.04 |
| Qwen3-TTS | 3.23 | 3.88 | 2.10 | 1.99 |
| Gemini 2.5 Flash | 3.27 | 3.90 | 3.06 | 3.12 |
| Gemini 2.5 Pro | 3.11 | 3.73 | 2.54 | 2.71 |
| System | Naturalness↑ | Quality↑ | NVV PE↑ | NVV Accuracy↑ |
|---|---|---|---|---|
| Bark | 1.78 | 2.20 | 1.43 | 2.34 |
| Dia | 2.19 | 2.51 | 2.08 | 3.14 |
| Fish-Speech | 1.97 | 3.00 | 0.48 | 0.71 |
| CosyVoice 2 | 2.46 | 3.72 | 1.41 | 1.80 |
| ChatTTS | 2.88 | 3.31 | 3.00 | 3.88 |
| Higgs-Audio | 2.77 | 3.50 | 1.90 | 2.15 |
| Orpheus TTS | 3.68 | 3.76 | 4.03 | 3.97 |
| ElevenLabs | 3.92 | 4.21 | 4.33 | 4.58 |
| System | Naturalness↑ | Quality↑ | NVV PE↑ | NVV Accuracy↑ |
|---|---|---|---|---|
| Bark | 1.44 | 1.84 | 1.28 | 2.25 |
| Fish-Speech | 2.65 | 3.83 | 1.29 | 1.58 |
| Orpheus TTS | 2.81 | 3.48 | 2.74 | 3.09 |
| CosyVoice 2 | 3.37 | 3.84 | 1.98 | 1.75 |
| Higgs-Audio | 3.10 | 3.37 | 2.79 | 3.24 |
| ChatTTS | 2.69 | 2.75 | 1.94 | 2.88 |
| ElevenLabs | 4.09 | 4.09 | 3.74 | 3.77 |
LLM-based scores use an automated multi-rater evaluation protocol with a 5-point scale. NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following.
Per-type NVV Perceptual Effect (PE) (0–5 scale) derived from subjective evaluation. Top panel: tag-based systems sorted by mean PE. Bottom panel: prompt-based systems. White cells = NVV type not supported.
If you use NVV-SuperBench in your research, please cite:
@article{xue2026nvvsuperbench,
title = {NVV-SuperBench: Beyond Words, Beyond Quality—Benchmarking Nonverbal Vocalizations in Speech Generation},
author = {Xue, Liumeng and Bian, Weizhen and Pan, Jiahao and Wu, Wenxuan and Ren, Yilin and Kang, Boyi and Hu, Jingbin and Ma, Ziyang and Wang, Shuai and Qian, Xinyuan and others},
journal = {arXiv preprint arXiv:2604.16211},
year = {2026}
}