Interspeech 2026 — Anonymous Submission

NVBench: A Benchmark for Speech Synthesis
with Non-Verbal Vocalizations

Anonymous submission to Interspeech 2026

Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet existing TTS benchmarks rarely test whether systems can generate the intended NVVs, place it correctly, and keep it salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. The evaluation code and dataset are available in our GitHub repository. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent bottlenecks.

🌏

45-Type NVV Taxonomy

A unified taxonomy with 6 categories and 45 fine-grained NVV types, spanning respiratory, throat, laughter, crying, emotional, and oral vocalizations.

📊

Bilingual Evaluation Set

4,500 high-quality instances in English and Chinese with balanced coverage across all 45 NVV types and two synthesis paradigms.

🔎

Multi-Axis Evaluation Protocol

Disentangles general speech quality from NVV-specific controllability, placement accuracy, and salience across 15 TTS systems.

💾 Dataset 🔗 Code

Category	NVV Types	Count
Respiratory	breathinhaleexhalequick breathsighgasppantingwheezingsnoreyawn	10
Throat / Physiological	coughsneezethroat clearinghiccupsniffsnifflesnort	7
Laughter Spectrum	chucklegigglelaughlaugh harderstart laughingstifled laughburst of laughter	7
Crying Spectrum	cryingsobbingcrying loudlywailwhimper	5
Emotional Vocalizations	humhumminggroanmoangruntmumbleexclamation (ah, oh, hmm)	7
Oral / Miscellaneous	lipsmackgulpswallowburptskssscluckinghissingwhisper	9
Total		45

Evaluation Results

Results across objective metrics, subjective listening tests, and LLM-based evaluation for 15 TTS systems.

Within each language & system-type block: bold = best | underline = second-best

Prompt-based Systems — English

System	WER↓	SIG↑	BAK↑	OVRL↑	CLAP Score↑
Parler-TTS Mini	6.25	3.48	4.03	3.20	0.34
Parler-TTS Large	9.30	3.45	4.00	3.15	0.35
CapSpeech	4.56	3.46	3.96	3.16	0.40
Qwen3-TTS	2.06	3.56	4.07	3.30	0.45
GPT-4o mini TTS	4.81	3.59	4.14	3.35	0.44
Gemini 2.5 Flash	58.80	3.57	4.02	3.29	0.42
Gemini 2.5 Pro	5.40	3.52	4.00	3.23	0.41

Prompt-based Systems — Chinese

System	CER↓	SIG↑	BAK↑	OVRL↑	CLAP Score↑
Qwen3-TTS	4.08	3.50	3.98	3.19	0.39
GPT-4o mini TTS	4.67	3.64	4.16	3.40	0.43
Gemini 2.5 Flash	16.45	3.57	3.97	3.25	0.41
Gemini 2.5 Pro	7.68	3.55	4.01	3.26	0.42

Tag-based Systems — English

System	WER↓	SIG↑	BAK↑	OVRL↑	Coverage↑	Precision↑	Recall↑	F1↑	NTD↓
Bark	14.73	3.00	3.08	2.52	0.11	0.614	0.700	0.654	0.0037
Higgs-Audio	9.41	3.55	3.96	3.23	0.09	0.360	0.407	0.382	0.0111
ChatTTS	5.58	3.67	4.12	3.42	0.02	0.652	0.680	0.664	0.0028
Fish-Speech	5.65	3.55	4.06	3.28	0.16	0.447	0.418	0.432	0.0157
Dia	21.95	2.94	2.71	2.27	0.29	0.574	0.705	0.632	0.0052
CosyVoice 2	3.82	3.67	4.19	3.45	0.18	0.475	0.451	0.463	0.0159
Orpheus TTS	4.98	3.62	4.11	3.34	0.18	0.687	0.774	0.728	0.0031
ElevenLabs	2.31	3.59	4.06	3.32	0.27	0.664	0.787	0.720	0.0091

Tag-based Systems — Chinese

System	CER↓	SIG↑	BAK↑	OVRL↑	Coverage↑	Precision↑	Recall↑	F1↑	NTD↓
Bark	41.86	2.99	3.10	2.54	0.11	0.572	0.605	0.588	0.0141
Orpheus TTS	18.78	3.60	4.11	3.30	0.18	0.585	0.671	0.625	0.0159
Higgs-Audio	5.98	3.52	3.91	3.17	0.09	0.429	0.369	0.396	0.0186
Fish-Speech	9.25	3.49	4.00	3.19	0.16	0.592	0.605	0.598	0.0397
ChatTTS	4.52	3.49	3.50	2.94	0.02	0.644	0.773	0.703	0.0237
CosyVoice 2	6.27	3.63	4.17	3.40	0.18	0.515	0.480	0.496	0.0383
ElevenLabs	4.13	3.61	4.03	3.32	0.27	0.630	0.750	0.684	0.0246

SIG/BAK/OVRL: DNSMOS metrics (higher is better). NTD: Normalized Tag Distance (lower is better).

Prompt-based Systems — English

System	Naturalness↑	Quality↑	NVV PE↑	NVV IF↑
Parler-TTS Large	2.69	3.35	0.93	0.99
Parler-TTS Mini	2.62	3.40	0.85	0.92
CapSpeech	2.73	3.41	1.01	1.11
GPT-4o mini	3.33	3.56	1.74	1.89
Qwen3-TTS	3.44	3.86	2.03	2.15
Gemini 2.5 Flash	4.00	4.28	2.60	2.67
Gemini 2.5 Pro	4.07	4.30	2.68	2.74

Prompt-based Systems — Chinese

System	Naturalness↑	Quality↑	NVV PE↑	NVV IF↑
GPT-4o mini	2.11	2.77	0.84	0.92
Qwen3-TTS	3.45	3.93	2.01	1.98
Gemini 2.5 Flash	3.04	3.71	2.29	2.42
Gemini 2.5 Pro	3.38	3.75	2.07	2.11

Tag-based Systems — English

System	Naturalness↑	Quality↑	NVV PE↑	NVV Accuracy↑
Bark	2.75	2.37	2.07	2.65
Dia	3.12	2.82	2.45	2.99
Fish-Speech	3.53	3.62	1.01	1.04
Higgs-Audio	3.63	3.63	2.41	2.28
CosyVoice 2	3.65	3.92	2.39	2.22
ChatTTS	3.30	3.10	3.00	3.40
Orpheus TTS	4.01	4.11	3.31	3.71
ElevenLabs	4.60	4.71	3.92	4.21

Tag-based Systems — Chinese

System	Naturalness↑	Quality↑	NVV PE↑	NVV Accuracy↑
Bark	2.23	2.00	1.09	1.12
Fish-Speech	2.71	3.09	0.10	0.13
Orpheus TTS	3.20	3.29	2.05	2.17
Higgs-Audio	3.48	3.68	1.38	1.18
CosyVoice 2	3.76	4.35	1.56	1.65
ChatTTS	3.53	3.13	2.13	2.23
ElevenLabs	4.09	4.31	3.38	3.41

NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following. All scores are on a 5-point scale.

Prompt-based Systems — English

System	Naturalness↑	Quality↑	NVV PE↑	NVV IF↑
Parler-TTS Mini	1.65	3.30	0.23	0.30
Parler-TTS Large	1.67	3.13	0.30	0.38
CapSpeech	1.88	3.52	0.48	0.57
GPT-4o mini	3.16	3.93	1.73	1.65
Qwen3-TTS	3.17	3.96	2.16	1.98
Gemini 2.5 Flash	3.30	3.80	2.84	2.78
Gemini 2.5 Pro	3.20	3.65	2.54	2.64

Prompt-based Systems — Chinese

System	Naturalness↑	Quality↑	NVV PE↑	NVV IF↑
GPT-4o mini	2.50	3.85	0.92	1.04
Qwen3-TTS	3.23	3.88	2.10	1.99
Gemini 2.5 Flash	3.27	3.90	3.06	3.12
Gemini 2.5 Pro	3.11	3.73	2.54	2.71

Tag-based Systems — English

System	Naturalness↑	Quality↑	NVV PE↑	NVV Accuracy↑
Bark	1.78	2.20	1.43	2.34
Dia	2.19	2.51	2.08	3.14
Fish-Speech	1.97	3.00	0.48	0.71
CosyVoice 2	2.46	3.72	1.41	1.80
ChatTTS	2.88	3.31	3.00	3.88
Higgs-Audio	2.77	3.50	1.90	2.15
Orpheus TTS	3.68	3.76	4.03	3.97
ElevenLabs	3.92	4.21	4.33	4.58

Tag-based Systems — Chinese

System	Naturalness↑	Quality↑	NVV PE↑	NVV Accuracy↑
Bark	1.44	1.84	1.28	2.25
Fish-Speech	2.65	3.83	1.29	1.58
Orpheus TTS	2.81	3.48	2.74	3.09
CosyVoice 2	3.37	3.84	1.98	1.75
Higgs-Audio	3.10	3.37	2.79	3.24
ChatTTS	2.69	2.75	1.94	2.88
ElevenLabs	4.09	4.09	3.74	3.77

LLM-based scores use an automated multi-rater evaluation protocol with a 5-point scale. NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following.

NVBench: A Benchmark for Speech Synthesis
with Non-Verbal Vocalizations

45-Type NVV Taxonomy

Bilingual Evaluation Set

Multi-Axis Evaluation Protocol

NVV Taxonomy

Demo Samples

Evaluation Results

NVV Perceptual Effect by Type

Citation

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

45-Type NVV Taxonomy

Bilingual Evaluation Set

Multi-Axis Evaluation Protocol

NVV Taxonomy

Demo Samples

Evaluation Results

NVV Perceptual Effect by Type

Citation

NVBench: A Benchmark for Speech Synthesis
with Non-Verbal Vocalizations