NVV-SuperBench: Beyond Words, Beyond Quality—Benchmarking Nonverbal Vocalizations in Speech Generation

Accepted as a long paper at Interspeech 2026

Nonverbal vocalizations (NVVs), such as laughing, sighing, and sobbing, are essential for human-like speech, yet standardized evaluation rarely jointly assesses whether systems generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present NVV-SuperBench, a bilingual English/Chinese benchmark for speech generation with NVVs. It provides a unified 45-type taxonomy and a multi-axis protocol beyond conventional speech quality assessment, evaluating NVV-specific controllability, placement, and perceptual salience. We benchmark 15 speech generation systems spanning prompt-based and tag-based control paradigms, using objective metrics, human listening tests, and LLM-based multi-rater evaluation. Results show that NVV controllability often decouples from speech quality, while low-SNR oral cues and long-duration affective NVVs remain bottlenecks. NVV-SuperBench highlights current gaps and supports progress toward more human-like speech generation.
🌏

45-type NVV Taxonomy

6 categories and 45 fine-grained types covering respiratory, throat, laughter, crying, emotional, and oral vocalizations.

📊

Bilingual Evaluation Set

4,500 instances in English and Chinese, balanced across all 45 NVV types.

🤖

15 Speech Generation Systems

8 tag-based and 7 prompt-based systems, spanning commercial (ElevenLabs, Gemini 2.5, GPT-4o mini TTS) and open-source models (ChatTTS, CosyVoice 2, Bark, Dia, and more).

🔎

Multi-Axis Evaluation Protocol

Separates speech quality from NVV-specific controllability, placement accuracy, and salience using objective metrics, human listening tests, and LLM-based multi-rater evaluation.

NVV Taxonomy

A unified taxonomy of 45 nonverbal vocalization types organized into 6 categories.

Category NVV Types Count
Respiratory breathinhaleexhalequick breathsighgasppantingwheezingsnoreyawn 10
Throat / Physiological coughsneezethroat clearinghiccupsniffsnifflesnort 7
Laughter Spectrum chucklegigglelaughlaugh harderstart laughingstifled laughburst of laughter 7
Crying Spectrum cryingsobbingcrying loudlywailwhimper 5
Emotional Vocalizations humhumminggroanmoangruntmumbleexclamation (ah, oh, hmm) 7
Oral / Miscellaneous lipsmackgulpswallowburptskssscluckinghissingwhisper 9
Total 45

Demo Samples

Listen to system outputs for all available NVV types across bilingual (EN/ZH) and two synthesis paradigms (tag-based / prompt-based), with 3 samples per type whenever available.

Language
| Mode
NVV Type
<laugh> Tag-based: NVV type inserted as an inline tag marker in the text.
💬 Prompt-based: A natural-language caption describes the desired NVV behavior.
Loading demo data...

Evaluation Results

Results across objective metrics, subjective listening tests, and LLM-based evaluation for 15 TTS systems.

Tag-based Systems (8)

ChatTTS Open Higgs-Audio Open Bark Open Fish-Speech Open Orpheus TTS Open CosyVoice 2 Open Dia Open ElevenLabs Commercial

Prompt-based Systems (7)

Gemini 2.5 Pro Commercial Gemini 2.5 Flash Commercial GPT-4o mini TTS Commercial Qwen3-TTS Open CapSpeech Open Parler-TTS Mini Open Parler-TTS Large Open
Within each language & system-type block: bold = best  |  underline = second-best
Prompt-based Systems — English
System WER↓ DNSMOS SIG↑ DNSMOS BAK↑ DNSMOS OVRL↑ CLAP Score↑
Parler-TTS Mini6.253.484.033.200.34
Parler-TTS Large9.303.454.003.150.35
CapSpeech4.563.463.963.160.40
Qwen3-TTS2.063.564.073.300.45
GPT-4o mini TTS4.813.594.143.350.44
Gemini 2.5 Flash58.803.574.023.290.42
Gemini 2.5 Pro5.403.524.003.230.41
Prompt-based Systems — Chinese
System CER↓ DNSMOS SIG↑ DNSMOS BAK↑ DNSMOS OVRL↑ CLAP Score↑
Qwen3-TTS4.083.503.983.190.39
GPT-4o mini TTS4.673.644.163.400.43
Gemini 2.5 Flash16.453.573.973.250.41
Gemini 2.5 Pro7.683.554.013.260.42
Tag-based Systems — English
System WER↓ DNSMOS SIG↑ DNSMOS BAK↑ DNSMOS OVRL↑ Coverage↑ Precision↑ Recall↑ F1↑ NTD↓
Bark14.733.003.082.520.110.6140.7000.6540.0037
Higgs-Audio9.413.553.963.230.090.3600.4070.3820.0111
ChatTTS5.583.674.123.420.020.6520.6800.6640.0028
Fish-Speech5.653.554.063.280.160.4470.4180.4320.0157
Dia21.952.942.712.270.290.5740.7050.6320.0052
CosyVoice 23.823.674.193.450.180.4750.4510.4630.0159
Orpheus TTS4.983.624.113.340.180.6870.7740.7280.0031
ElevenLabs2.313.594.063.320.270.6640.7870.7200.0091
Tag-based Systems — Chinese
System CER↓ DNSMOS SIG↑ DNSMOS BAK↑ DNSMOS OVRL↑ Coverage↑ Precision↑ Recall↑ F1↑ NTD↓
Bark41.862.993.102.540.110.5720.6050.5880.0141
Orpheus TTS18.783.604.113.300.180.5850.6710.6250.0159
Higgs-Audio5.983.523.913.170.090.4290.3690.3960.0186
Fish-Speech9.253.494.003.190.160.5920.6050.5980.0397
ChatTTS4.523.493.502.940.020.6440.7730.7030.0237
CosyVoice 26.273.634.173.400.180.5150.4800.4960.0383
ElevenLabs4.133.614.033.320.270.6300.7500.6840.0246

DNSMOS SIG/BAK/OVRL: DNSMOS P.835 metrics for speech signal quality, background noise intrusiveness, and overall quality (higher is better). NTD: Normalized Tag Distance (lower is better).

Prompt-based Systems — English
System Naturalness↑ Quality↑ NVV PE↑ NVV IF↑
Parler-TTS Large2.693.350.930.99
Parler-TTS Mini2.623.400.850.92
CapSpeech2.733.411.011.11
GPT-4o mini3.333.561.741.89
Qwen3-TTS3.443.862.032.15
Gemini 2.5 Flash4.004.282.602.67
Gemini 2.5 Pro4.074.302.682.74
Prompt-based Systems — Chinese
System Naturalness↑ Quality↑ NVV PE↑ NVV IF↑
GPT-4o mini2.112.770.840.92
Qwen3-TTS3.453.932.011.98
Gemini 2.5 Flash3.043.712.292.42
Gemini 2.5 Pro3.383.752.072.11
Tag-based Systems — English
System Naturalness↑ Quality↑ NVV PE↑ NVV Accuracy↑
Bark2.752.372.072.65
Dia3.122.822.452.99
Fish-Speech3.533.621.011.04
Higgs-Audio3.633.632.412.28
CosyVoice 23.653.922.392.22
ChatTTS3.303.103.003.40
Orpheus TTS4.014.113.313.71
ElevenLabs4.604.713.924.21
Tag-based Systems — Chinese
System Naturalness↑ Quality↑ NVV PE↑ NVV Accuracy↑
Bark2.232.001.091.12
Fish-Speech2.713.090.100.13
Orpheus TTS3.203.292.052.17
Higgs-Audio3.483.681.381.18
CosyVoice 23.764.351.561.65
ChatTTS3.533.132.132.23
ElevenLabs4.094.313.383.41

NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following. All scores are on a 5-point scale.

Prompt-based Systems — English
System Naturalness↑ Quality↑ NVV PE↑ NVV IF↑
Parler-TTS Mini1.653.300.230.30
Parler-TTS Large1.673.130.300.38
CapSpeech1.883.520.480.57
GPT-4o mini3.163.931.731.65
Qwen3-TTS3.173.962.161.98
Gemini 2.5 Flash3.303.802.842.78
Gemini 2.5 Pro3.203.652.542.64
Prompt-based Systems — Chinese
System Naturalness↑ Quality↑ NVV PE↑ NVV IF↑
GPT-4o mini2.503.850.921.04
Qwen3-TTS3.233.882.101.99
Gemini 2.5 Flash3.273.903.063.12
Gemini 2.5 Pro3.113.732.542.71
Tag-based Systems — English
System Naturalness↑ Quality↑ NVV PE↑ NVV Accuracy↑
Bark1.782.201.432.34
Dia2.192.512.083.14
Fish-Speech1.973.000.480.71
CosyVoice 22.463.721.411.80
ChatTTS2.883.313.003.88
Higgs-Audio2.773.501.902.15
Orpheus TTS3.683.764.033.97
ElevenLabs3.924.214.334.58
Tag-based Systems — Chinese
System Naturalness↑ Quality↑ NVV PE↑ NVV Accuracy↑
Bark1.441.841.282.25
Fish-Speech2.653.831.291.58
Orpheus TTS2.813.482.743.09
CosyVoice 23.373.841.981.75
Higgs-Audio3.103.372.793.24
ChatTTS2.692.751.942.88
ElevenLabs4.094.093.743.77

LLM-based scores use an automated multi-rater evaluation protocol with a 5-point scale. NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following.

NVV Perceptual Effect by Type

Per-type NVV Perceptual Effect (PE) (0–5 scale) derived from subjective evaluation. Top panel: tag-based systems sorted by mean PE. Bottom panel: prompt-based systems. White cells = NVV type not supported.

English NVV Perceptual Effect heatmap (tag-based top, prompt-based bottom)

Citation

If you use NVV-SuperBench in your research, please cite:

@article{xue2026nvvsuperbench,
  title   = {NVV-SuperBench: Beyond Words, Beyond Quality—Benchmarking Nonverbal Vocalizations in Speech Generation},
  author  = {Xue, Liumeng and Bian, Weizhen and Pan, Jiahao and Wu, Wenxuan and Ren, Yilin and Kang, Boyi and Hu, Jingbin and Ma, Ziyang and Wang, Shuai and Qian, Xinyuan and others},
  journal = {arXiv preprint arXiv:2604.16211},
  year    = {2026}
}