Interspeech 2026 — Anonymous Submission

NVBench: A Benchmark for Speech Synthesis
with Non-Verbal Vocalizations

Anonymous submission to Interspeech 2026

Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet existing TTS benchmarks rarely test whether systems can generate the intended NVVs, place it correctly, and keep it salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. The evaluation code and dataset are available in our GitHub repository. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent bottlenecks.
🌏

45-Type NVV Taxonomy

A unified taxonomy with 6 categories and 45 fine-grained NVV types, spanning respiratory, throat, laughter, crying, emotional, and oral vocalizations.

📊

Bilingual Evaluation Set

4,500 high-quality instances in English and Chinese with balanced coverage across all 45 NVV types and two synthesis paradigms.

🔎

Multi-Axis Evaluation Protocol

Disentangles general speech quality from NVV-specific controllability, placement accuracy, and salience across 15 TTS systems.

💾 Dataset 🔗 Code

NVV Taxonomy

A unified taxonomy of 45 non-verbal vocalization types organized into 6 categories.

Category NVV Types Count
Respiratory breathinhaleexhalequick breathsighgasppantingwheezingsnoreyawn 10
Throat / Physiological coughsneezethroat clearinghiccupsniffsnifflesnort 7
Laughter Spectrum chucklegigglelaughlaugh harderstart laughingstifled laughburst of laughter 7
Crying Spectrum cryingsobbingcrying loudlywailwhimper 5
Emotional Vocalizations humhumminggroanmoangruntmumbleexclamation (ah, oh, hmm) 7
Oral / Miscellaneous lipsmackgulpswallowburptskssscluckinghissingwhisper 9
Total 45

Demo Samples

Listen to system outputs for all available NVV types across bilingual (EN/ZH) and two synthesis paradigms (tag-based / prompt-based), with 3 samples per type whenever available.

Language
| Mode
NVV Type
<laugh> Tag-based: NVV type inserted as an inline tag marker in the text.
💬 Prompt-based: A natural-language caption describes the desired NVV behavior.
Loading demo data...

Evaluation Results

Results across objective metrics, subjective listening tests, and LLM-based evaluation for 15 TTS systems.

Within each language & system-type block: bold = best  |  underline = second-best
Prompt-based Systems — English
System WER↓ SIG↑ BAK↑ OVRL↑ CLAP Score↑
Parler-TTS Mini6.253.484.033.200.34
Parler-TTS Large9.303.454.003.150.35
CapSpeech4.563.463.963.160.40
Qwen3-TTS2.063.564.073.300.45
GPT-4o mini TTS4.813.594.143.350.44
Gemini 2.5 Flash58.803.574.023.290.42
Gemini 2.5 Pro5.403.524.003.230.41
Prompt-based Systems — Chinese
System CER↓ SIG↑ BAK↑ OVRL↑ CLAP Score↑
Qwen3-TTS4.083.503.983.190.39
GPT-4o mini TTS4.673.644.163.400.43
Gemini 2.5 Flash16.453.573.973.250.41
Gemini 2.5 Pro7.683.554.013.260.42
Tag-based Systems — English
System WER↓ SIG↑ BAK↑ OVRL↑ Coverage↑ Precision↑ Recall↑ F1↑ NTD↓
Bark14.733.003.082.520.110.6140.7000.6540.0037
Higgs-Audio9.413.553.963.230.090.3600.4070.3820.0111
ChatTTS5.583.674.123.420.020.6520.6800.6640.0028
Fish-Speech5.653.554.063.280.160.4470.4180.4320.0157
Dia21.952.942.712.270.290.5740.7050.6320.0052
CosyVoice 23.823.674.193.450.180.4750.4510.4630.0159
Orpheus TTS4.983.624.113.340.180.6870.7740.7280.0031
ElevenLabs2.313.594.063.320.270.6640.7870.7200.0091
Tag-based Systems — Chinese
System CER↓ SIG↑ BAK↑ OVRL↑ Coverage↑ Precision↑ Recall↑ F1↑ NTD↓
Bark41.862.993.102.540.110.5720.6050.5880.0141
Orpheus TTS18.783.604.113.300.180.5850.6710.6250.0159
Higgs-Audio5.983.523.913.170.090.4290.3690.3960.0186
Fish-Speech9.253.494.003.190.160.5920.6050.5980.0397
ChatTTS4.523.493.502.940.020.6440.7730.7030.0237
CosyVoice 26.273.634.173.400.180.5150.4800.4960.0383
ElevenLabs4.133.614.033.320.270.6300.7500.6840.0246

SIG/BAK/OVRL: DNSMOS metrics (higher is better). NTD: Normalized Tag Distance (lower is better).

Prompt-based Systems — English
System Naturalness↑ Quality↑ NVV PE↑ NVV IF↑
Parler-TTS Large2.693.350.930.99
Parler-TTS Mini2.623.400.850.92
CapSpeech2.733.411.011.11
GPT-4o mini3.333.561.741.89
Qwen3-TTS3.443.862.032.15
Gemini 2.5 Flash4.004.282.602.67
Gemini 2.5 Pro4.074.302.682.74
Prompt-based Systems — Chinese
System Naturalness↑ Quality↑ NVV PE↑ NVV IF↑
GPT-4o mini2.112.770.840.92
Qwen3-TTS3.453.932.011.98
Gemini 2.5 Flash3.043.712.292.42
Gemini 2.5 Pro3.383.752.072.11
Tag-based Systems — English
System Naturalness↑ Quality↑ NVV PE↑ NVV Accuracy↑
Bark2.752.372.072.65
Dia3.122.822.452.99
Fish-Speech3.533.621.011.04
Higgs-Audio3.633.632.412.28
CosyVoice 23.653.922.392.22
ChatTTS3.303.103.003.40
Orpheus TTS4.014.113.313.71
ElevenLabs4.604.713.924.21
Tag-based Systems — Chinese
System Naturalness↑ Quality↑ NVV PE↑ NVV Accuracy↑
Bark2.232.001.091.12
Fish-Speech2.713.090.100.13
Orpheus TTS3.203.292.052.17
Higgs-Audio3.483.681.381.18
CosyVoice 23.764.351.561.65
ChatTTS3.533.132.132.23
ElevenLabs4.094.313.383.41

NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following. All scores are on a 5-point scale.

Prompt-based Systems — English
System Naturalness↑ Quality↑ NVV PE↑ NVV IF↑
Parler-TTS Mini1.653.300.230.30
Parler-TTS Large1.673.130.300.38
CapSpeech1.883.520.480.57
GPT-4o mini3.163.931.731.65
Qwen3-TTS3.173.962.161.98
Gemini 2.5 Flash3.303.802.842.78
Gemini 2.5 Pro3.203.652.542.64
Prompt-based Systems — Chinese
System Naturalness↑ Quality↑ NVV PE↑ NVV IF↑
GPT-4o mini2.503.850.921.04
Qwen3-TTS3.233.882.101.99
Gemini 2.5 Flash3.273.903.063.12
Gemini 2.5 Pro3.113.732.542.71
Tag-based Systems — English
System Naturalness↑ Quality↑ NVV PE↑ NVV Accuracy↑
Bark1.782.201.432.34
Dia2.192.512.083.14
Fish-Speech1.973.000.480.71
CosyVoice 22.463.721.411.80
ChatTTS2.883.313.003.88
Higgs-Audio2.773.501.902.15
Orpheus TTS3.683.764.033.97
ElevenLabs3.924.214.334.58
Tag-based Systems — Chinese
System Naturalness↑ Quality↑ NVV PE↑ NVV Accuracy↑
Bark1.441.841.282.25
Fish-Speech2.653.831.291.58
Orpheus TTS2.813.482.743.09
CosyVoice 23.373.841.981.75
Higgs-Audio3.103.372.793.24
ChatTTS2.692.751.942.88
ElevenLabs4.094.093.743.77

LLM-based scores use an automated multi-rater evaluation protocol with a 5-point scale. NVV PE: NVV Perceptual Effect. NVV IF: NVV Instruction Following.

NVV Perceptual Effect by Type

Per-type NVV Perceptual Effect (PE) (0–5 scale) derived from subjective evaluation. Top panel: tag-based systems sorted by mean PE. Bottom panel: prompt-based systems. White cells = NVV type not supported.

English NVV Perceptual Effect heatmap (tag-based top, prompt-based bottom)

Citation

If you use NVBench in your research, please cite:

@inproceedings{nvbench2026,
  title     = {NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations},
  author    = {Anonymous},
  booktitle = {Interspeech 2026},
  year      = {2026}
}