NVBench: A Benchmark for Speech Synthesis
with Non-Verbal Vocalizations
Anonymous submission to Interspeech 2026
Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech,
yet existing TTS benchmarks rarely test whether systems can generate the intended NVVs, place it
correctly, and keep it salient without harming speech. We present Non-verbal Vocalization
Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech
synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset
and introduces a multi-axis protocol that separates general speech naturalness and quality from
NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using
objective metrics, listening tests, and an LLM-based multi-rater evaluation. The evaluation code
and dataset are available in our GitHub repository. Results reveal that
NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration
affective NVVs remain persistent bottlenecks.
🌏
45-Type NVV Taxonomy
A unified taxonomy with 6 categories and 45 fine-grained NVV types, spanning respiratory, throat, laughter, crying, emotional, and oral vocalizations.
📊
Bilingual Evaluation Set
4,500 high-quality instances in English and Chinese with balanced coverage across all 45 NVV types and two synthesis paradigms.
🔎
Multi-Axis Evaluation Protocol
Disentangles general speech quality from NVV-specific controllability, placement accuracy, and salience across 15 TTS systems.
Listen to system outputs for all available NVV types across bilingual (EN/ZH) and two synthesis paradigms (tag-based / prompt-based), with 3 samples per type whenever available.
Language
|Mode
NVV Type
<laugh>Tag-based: NVV type inserted as an inline tag marker in the text.
💬Prompt-based: A natural-language caption describes the desired NVV behavior.
Loading demo data...
Evaluation Results
Results across objective metrics, subjective listening tests, and LLM-based evaluation for 15 TTS systems.
Within each language & system-type block: bold = best | underline = second-best
Prompt-based Systems — English
System
WER↓
SIG↑
BAK↑
OVRL↑
CLAP Score↑
Parler-TTS Mini
6.25
3.48
4.03
3.20
0.34
Parler-TTS Large
9.30
3.45
4.00
3.15
0.35
CapSpeech
4.56
3.46
3.96
3.16
0.40
Qwen3-TTS
2.06
3.56
4.07
3.30
0.45
GPT-4o mini TTS
4.81
3.59
4.14
3.35
0.44
Gemini 2.5 Flash
58.80
3.57
4.02
3.29
0.42
Gemini 2.5 Pro
5.40
3.52
4.00
3.23
0.41
Prompt-based Systems — Chinese
System
CER↓
SIG↑
BAK↑
OVRL↑
CLAP Score↑
Qwen3-TTS
4.08
3.50
3.98
3.19
0.39
GPT-4o mini TTS
4.67
3.64
4.16
3.40
0.43
Gemini 2.5 Flash
16.45
3.57
3.97
3.25
0.41
Gemini 2.5 Pro
7.68
3.55
4.01
3.26
0.42
Tag-based Systems — English
System
WER↓
SIG↑
BAK↑
OVRL↑
Coverage↑
Precision↑
Recall↑
F1↑
NTD↓
Bark
14.73
3.00
3.08
2.52
0.11
0.614
0.700
0.654
0.0037
Higgs-Audio
9.41
3.55
3.96
3.23
0.09
0.360
0.407
0.382
0.0111
ChatTTS
5.58
3.67
4.12
3.42
0.02
0.652
0.680
0.664
0.0028
Fish-Speech
5.65
3.55
4.06
3.28
0.16
0.447
0.418
0.432
0.0157
Dia
21.95
2.94
2.71
2.27
0.29
0.574
0.705
0.632
0.0052
CosyVoice 2
3.82
3.67
4.19
3.45
0.18
0.475
0.451
0.463
0.0159
Orpheus TTS
4.98
3.62
4.11
3.34
0.18
0.687
0.774
0.728
0.0031
ElevenLabs
2.31
3.59
4.06
3.32
0.27
0.664
0.787
0.720
0.0091
Tag-based Systems — Chinese
System
CER↓
SIG↑
BAK↑
OVRL↑
Coverage↑
Precision↑
Recall↑
F1↑
NTD↓
Bark
41.86
2.99
3.10
2.54
0.11
0.572
0.605
0.588
0.0141
Orpheus TTS
18.78
3.60
4.11
3.30
0.18
0.585
0.671
0.625
0.0159
Higgs-Audio
5.98
3.52
3.91
3.17
0.09
0.429
0.369
0.396
0.0186
Fish-Speech
9.25
3.49
4.00
3.19
0.16
0.592
0.605
0.598
0.0397
ChatTTS
4.52
3.49
3.50
2.94
0.02
0.644
0.773
0.703
0.0237
CosyVoice 2
6.27
3.63
4.17
3.40
0.18
0.515
0.480
0.496
0.0383
ElevenLabs
4.13
3.61
4.03
3.32
0.27
0.630
0.750
0.684
0.0246
SIG/BAK/OVRL: DNSMOS metrics (higher is better). NTD: Normalized Tag Distance (lower is better).
Prompt-based Systems — English
System
Naturalness↑
Quality↑
NVV PE↑
NVV IF↑
Parler-TTS Large
2.69
3.35
0.93
0.99
Parler-TTS Mini
2.62
3.40
0.85
0.92
CapSpeech
2.73
3.41
1.01
1.11
GPT-4o mini
3.33
3.56
1.74
1.89
Qwen3-TTS
3.44
3.86
2.03
2.15
Gemini 2.5 Flash
4.00
4.28
2.60
2.67
Gemini 2.5 Pro
4.07
4.30
2.68
2.74
Prompt-based Systems — Chinese
System
Naturalness↑
Quality↑
NVV PE↑
NVV IF↑
GPT-4o mini
2.11
2.77
0.84
0.92
Qwen3-TTS
3.45
3.93
2.01
1.98
Gemini 2.5 Flash
3.04
3.71
2.29
2.42
Gemini 2.5 Pro
3.38
3.75
2.07
2.11
Tag-based Systems — English
System
Naturalness↑
Quality↑
NVV PE↑
NVV Accuracy↑
Bark
2.75
2.37
2.07
2.65
Dia
3.12
2.82
2.45
2.99
Fish-Speech
3.53
3.62
1.01
1.04
Higgs-Audio
3.63
3.63
2.41
2.28
CosyVoice 2
3.65
3.92
2.39
2.22
ChatTTS
3.30
3.10
3.00
3.40
Orpheus TTS
4.01
4.11
3.31
3.71
ElevenLabs
4.60
4.71
3.92
4.21
Tag-based Systems — Chinese
System
Naturalness↑
Quality↑
NVV PE↑
NVV Accuracy↑
Bark
2.23
2.00
1.09
1.12
Fish-Speech
2.71
3.09
0.10
0.13
Orpheus TTS
3.20
3.29
2.05
2.17
Higgs-Audio
3.48
3.68
1.38
1.18
CosyVoice 2
3.76
4.35
1.56
1.65
ChatTTS
3.53
3.13
2.13
2.23
ElevenLabs
4.09
4.31
3.38
3.41
NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following. All scores are on a 5-point scale.
Prompt-based Systems — English
System
Naturalness↑
Quality↑
NVV PE↑
NVV IF↑
Parler-TTS Mini
1.65
3.30
0.23
0.30
Parler-TTS Large
1.67
3.13
0.30
0.38
CapSpeech
1.88
3.52
0.48
0.57
GPT-4o mini
3.16
3.93
1.73
1.65
Qwen3-TTS
3.17
3.96
2.16
1.98
Gemini 2.5 Flash
3.30
3.80
2.84
2.78
Gemini 2.5 Pro
3.20
3.65
2.54
2.64
Prompt-based Systems — Chinese
System
Naturalness↑
Quality↑
NVV PE↑
NVV IF↑
GPT-4o mini
2.50
3.85
0.92
1.04
Qwen3-TTS
3.23
3.88
2.10
1.99
Gemini 2.5 Flash
3.27
3.90
3.06
3.12
Gemini 2.5 Pro
3.11
3.73
2.54
2.71
Tag-based Systems — English
System
Naturalness↑
Quality↑
NVV PE↑
NVV Accuracy↑
Bark
1.78
2.20
1.43
2.34
Dia
2.19
2.51
2.08
3.14
Fish-Speech
1.97
3.00
0.48
0.71
CosyVoice 2
2.46
3.72
1.41
1.80
ChatTTS
2.88
3.31
3.00
3.88
Higgs-Audio
2.77
3.50
1.90
2.15
Orpheus TTS
3.68
3.76
4.03
3.97
ElevenLabs
3.92
4.21
4.33
4.58
Tag-based Systems — Chinese
System
Naturalness↑
Quality↑
NVV PE↑
NVV Accuracy↑
Bark
1.44
1.84
1.28
2.25
Fish-Speech
2.65
3.83
1.29
1.58
Orpheus TTS
2.81
3.48
2.74
3.09
CosyVoice 2
3.37
3.84
1.98
1.75
Higgs-Audio
3.10
3.37
2.79
3.24
ChatTTS
2.69
2.75
1.94
2.88
ElevenLabs
4.09
4.09
3.74
3.77
LLM-based scores use an automated multi-rater evaluation protocol with a 5-point scale. NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following.
NVV Perceptual Effect by Type
Per-type NVV Perceptual Effect (PE) (0–5 scale) derived from subjective evaluation.
Top panel: tag-based systems sorted by mean PE.
Bottom panel: prompt-based systems. White cells = NVV type not supported.
Citation
If you use NVBench in your research, please cite:
@inproceedings{nvbench2026,
title = {NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations},
author = {Anonymous},
booktitle = {Interspeech 2026},
year = {2026}
}