NVV-SuperBench: Beyond Words, Beyond Quality—Benchmarking Nonverbal Vocalizations in Speech Generation

Accepted as a long paper at Interspeech 2026

Nonverbal vocalizations (NVVs), such as laughing, sighing, and sobbing, are essential for human-like speech, yet standardized evaluation rarely jointly assesses whether systems generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present NVV-SuperBench, a bilingual English/Chinese benchmark for speech generation with NVVs. It provides a unified 45-type taxonomy and a multi-axis protocol beyond conventional speech quality assessment, evaluating NVV-specific controllability, placement, and perceptual salience. We benchmark 15 speech generation systems spanning prompt-based and tag-based control paradigms, using objective metrics, human listening tests, and LLM-based multi-rater evaluation. Results show that NVV controllability often decouples from speech quality, while low-SNR oral cues and long-duration affective NVVs remain bottlenecks. NVV-SuperBench highlights current gaps and supports progress toward more human-like speech generation.

🌏

45-type NVV Taxonomy

6 categories and 45 fine-grained types covering respiratory, throat, laughter, crying, emotional, and oral vocalizations.

📊

Bilingual Evaluation Set

4,500 instances in English and Chinese, balanced across all 45 NVV types.

🤖

15 Speech Generation Systems

8 tag-based and 7 prompt-based systems, spanning commercial (ElevenLabs, Gemini 2.5, GPT-4o mini TTS) and open-source models (ChatTTS, CosyVoice 2, Bark, Dia, and more).

🔎

Multi-Axis Evaluation Protocol

Separates speech quality from NVV-specific controllability, placement accuracy, and salience using objective metrics, human listening tests, and LLM-based multi-rater evaluation.

📄 Paper 💾 Dataset & Code

Category	NVV Types	Count
Respiratory	breathinhaleexhalequick breathsighgasppantingwheezingsnoreyawn	10
Throat / Physiological	coughsneezethroat clearinghiccupsniffsnifflesnort	7
Laughter Spectrum	chucklegigglelaughlaugh harderstart laughingstifled laughburst of laughter	7
Crying Spectrum	cryingsobbingcrying loudlywailwhimper	5
Emotional Vocalizations	humhumminggroanmoangruntmumbleexclamation (ah, oh, hmm)	7
Oral / Miscellaneous	lipsmackgulpswallowburptskssscluckinghissingwhisper	9
Total		45

Evaluation Results

Results across objective metrics, subjective listening tests, and LLM-based evaluation for 15 TTS systems.

Tag-based Systems (8)

ChatTTS Open Higgs-Audio Open Bark Open Fish-Speech Open Orpheus TTS Open CosyVoice 2 Open Dia Open ElevenLabs Commercial

Prompt-based Systems (7)

Gemini 2.5 Pro Commercial Gemini 2.5 Flash Commercial GPT-4o mini TTS Commercial Qwen3-TTS Open CapSpeech Open Parler-TTS Mini Open Parler-TTS Large Open

Within each language & system-type block: bold = best | underline = second-best

Prompt-based Systems — English

System	WER↓	DNSMOS SIG↑	DNSMOS BAK↑	DNSMOS OVRL↑	CLAP Score↑
Parler-TTS Mini	6.25	3.48	4.03	3.20	0.34
Parler-TTS Large	9.30	3.45	4.00	3.15	0.35
CapSpeech	4.56	3.46	3.96	3.16	0.40
Qwen3-TTS	2.06	3.56	4.07	3.30	0.45
GPT-4o mini TTS	4.81	3.59	4.14	3.35	0.44
Gemini 2.5 Flash	58.80	3.57	4.02	3.29	0.42
Gemini 2.5 Pro	5.40	3.52	4.00	3.23	0.41

Prompt-based Systems — Chinese

System	CER↓	DNSMOS SIG↑	DNSMOS BAK↑	DNSMOS OVRL↑	CLAP Score↑
Qwen3-TTS	4.08	3.50	3.98	3.19	0.39
GPT-4o mini TTS	4.67	3.64	4.16	3.40	0.43
Gemini 2.5 Flash	16.45	3.57	3.97	3.25	0.41
Gemini 2.5 Pro	7.68	3.55	4.01	3.26	0.42

Tag-based Systems — English

System	WER↓	DNSMOS SIG↑	DNSMOS BAK↑	DNSMOS OVRL↑	Coverage↑	Precision↑	Recall↑	F1↑	NTD↓
Bark	14.73	3.00	3.08	2.52	0.11	0.614	0.700	0.654	0.0037
Higgs-Audio	9.41	3.55	3.96	3.23	0.09	0.360	0.407	0.382	0.0111
ChatTTS	5.58	3.67	4.12	3.42	0.02	0.652	0.680	0.664	0.0028
Fish-Speech	5.65	3.55	4.06	3.28	0.16	0.447	0.418	0.432	0.0157
Dia	21.95	2.94	2.71	2.27	0.29	0.574	0.705	0.632	0.0052
CosyVoice 2	3.82	3.67	4.19	3.45	0.18	0.475	0.451	0.463	0.0159
Orpheus TTS	4.98	3.62	4.11	3.34	0.18	0.687	0.774	0.728	0.0031
ElevenLabs	2.31	3.59	4.06	3.32	0.27	0.664	0.787	0.720	0.0091

Tag-based Systems — Chinese

System	CER↓	DNSMOS SIG↑	DNSMOS BAK↑	DNSMOS OVRL↑	Coverage↑	Precision↑	Recall↑	F1↑	NTD↓
Bark	41.86	2.99	3.10	2.54	0.11	0.572	0.605	0.588	0.0141
Orpheus TTS	18.78	3.60	4.11	3.30	0.18	0.585	0.671	0.625	0.0159
Higgs-Audio	5.98	3.52	3.91	3.17	0.09	0.429	0.369	0.396	0.0186
Fish-Speech	9.25	3.49	4.00	3.19	0.16	0.592	0.605	0.598	0.0397
ChatTTS	4.52	3.49	3.50	2.94	0.02	0.644	0.773	0.703	0.0237
CosyVoice 2	6.27	3.63	4.17	3.40	0.18	0.515	0.480	0.496	0.0383
ElevenLabs	4.13	3.61	4.03	3.32	0.27	0.630	0.750	0.684	0.0246

DNSMOS SIG/BAK/OVRL: DNSMOS P.835 metrics for speech signal quality, background noise intrusiveness, and overall quality (higher is better). NTD: Normalized Tag Distance (lower is better).

Prompt-based Systems — English

System	Naturalness↑	Quality↑	NVV PE↑	NVV IF↑
Parler-TTS Large	2.69	3.35	0.93	0.99
Parler-TTS Mini	2.62	3.40	0.85	0.92
CapSpeech	2.73	3.41	1.01	1.11
GPT-4o mini	3.33	3.56	1.74	1.89
Qwen3-TTS	3.44	3.86	2.03	2.15
Gemini 2.5 Flash	4.00	4.28	2.60	2.67
Gemini 2.5 Pro	4.07	4.30	2.68	2.74

Prompt-based Systems — Chinese

System	Naturalness↑	Quality↑	NVV PE↑	NVV IF↑
GPT-4o mini	2.11	2.77	0.84	0.92
Qwen3-TTS	3.45	3.93	2.01	1.98
Gemini 2.5 Flash	3.04	3.71	2.29	2.42
Gemini 2.5 Pro	3.38	3.75	2.07	2.11

Tag-based Systems — English

System	Naturalness↑	Quality↑	NVV PE↑	NVV Accuracy↑
Bark	2.75	2.37	2.07	2.65
Dia	3.12	2.82	2.45	2.99
Fish-Speech	3.53	3.62	1.01	1.04
Higgs-Audio	3.63	3.63	2.41	2.28
CosyVoice 2	3.65	3.92	2.39	2.22
ChatTTS	3.30	3.10	3.00	3.40
Orpheus TTS	4.01	4.11	3.31	3.71
ElevenLabs	4.60	4.71	3.92	4.21

Tag-based Systems — Chinese

System	Naturalness↑	Quality↑	NVV PE↑	NVV Accuracy↑
Bark	2.23	2.00	1.09	1.12
Fish-Speech	2.71	3.09	0.10	0.13
Orpheus TTS	3.20	3.29	2.05	2.17
Higgs-Audio	3.48	3.68	1.38	1.18
CosyVoice 2	3.76	4.35	1.56	1.65
ChatTTS	3.53	3.13	2.13	2.23
ElevenLabs	4.09	4.31	3.38	3.41

NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following. All scores are on a 5-point scale.

Prompt-based Systems — English

System	Naturalness↑	Quality↑	NVV PE↑	NVV IF↑
Parler-TTS Mini	1.65	3.30	0.23	0.30
Parler-TTS Large	1.67	3.13	0.30	0.38
CapSpeech	1.88	3.52	0.48	0.57
GPT-4o mini	3.16	3.93	1.73	1.65
Qwen3-TTS	3.17	3.96	2.16	1.98
Gemini 2.5 Flash	3.30	3.80	2.84	2.78
Gemini 2.5 Pro	3.20	3.65	2.54	2.64

Prompt-based Systems — Chinese

System	Naturalness↑	Quality↑	NVV PE↑	NVV IF↑
GPT-4o mini	2.50	3.85	0.92	1.04
Qwen3-TTS	3.23	3.88	2.10	1.99
Gemini 2.5 Flash	3.27	3.90	3.06	3.12
Gemini 2.5 Pro	3.11	3.73	2.54	2.71

Tag-based Systems — English

System	Naturalness↑	Quality↑	NVV PE↑	NVV Accuracy↑
Bark	1.78	2.20	1.43	2.34
Dia	2.19	2.51	2.08	3.14
Fish-Speech	1.97	3.00	0.48	0.71
CosyVoice 2	2.46	3.72	1.41	1.80
ChatTTS	2.88	3.31	3.00	3.88
Higgs-Audio	2.77	3.50	1.90	2.15
Orpheus TTS	3.68	3.76	4.03	3.97
ElevenLabs	3.92	4.21	4.33	4.58

Tag-based Systems — Chinese

System	Naturalness↑	Quality↑	NVV PE↑	NVV Accuracy↑
Bark	1.44	1.84	1.28	2.25
Fish-Speech	2.65	3.83	1.29	1.58
Orpheus TTS	2.81	3.48	2.74	3.09
CosyVoice 2	3.37	3.84	1.98	1.75
Higgs-Audio	3.10	3.37	2.79	3.24
ChatTTS	2.69	2.75	1.94	2.88
ElevenLabs	4.09	4.09	3.74	3.77

LLM-based scores use an automated multi-rater evaluation protocol with a 5-point scale. NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following.

NVV-SuperBench: Beyond Words, Beyond Quality—Benchmarking Nonverbal Vocalizations in Speech Generation

45-type NVV Taxonomy

Bilingual Evaluation Set

15 Speech Generation Systems

Multi-Axis Evaluation Protocol

NVV Taxonomy

Demo Samples

Evaluation Results

Tag-based Systems (8)

Prompt-based Systems (7)

NVV Perceptual Effect by Type

Citation