Silent Tags — Evaluation Details

Note on reproduction. The original ParaS2SBench proposed by Zhang et al. (arXiv:2511.08723) has not been publicly released — neither the prompt audios, the filtering scripts, nor the LLM-judge rubric are available at the time of writing. To make our results comparable, we therefore re-built a benchmark of the same design by faithfully following the construction recipe described in the paper. The reconstructed benchmark is what is used for every ParaS2SBench number reported on the main page. We document the procedure below so that reviewers can audit any deviation from the original protocol.

Reference

Original ParaS2SBench (as described in the paper)

ParaS2SBench is a paralinguistic speech-to-speech benchmark that evaluates whether an S2S model's spoken response is appropriate to the paralinguistic attributes of the spoken input — not only its textual content. The benchmark covers four paralinguistic dimensions and combines synthesized and real speech.

Split	Dimension	Categories	Prompts	Utterances
Synthetic	Emotion	Happy / Surprised / Sad / Angry / Fear / Disgust	300	600
Synthetic	Sarcasm	Sincere / Sarcastic	300	600
Synthetic	Age	Adult / Child	300	600
Synthetic	Gender	Male / Female	300	600
Real	Emotion (IEMOCAP)	6 emotion classes	709	—
Real	Emotion (MELD)	6 emotion classes	781	—
Total			2,690	≈ 7.8 h

Our Reconstruction

Construction Pipeline

We follow the five-stage recipe described in §3 of the ParaS2SBench paper. Every stage is a best-effort re-implementation; we keep the category counts and per-category prompt counts identical to the paper so that scores are directly comparable in magnitude.

Figure. Five-stage reconstruction pipeline. Synthetic prompts flow through generation → filtering → TTS (two branches by dimension) → disjointness check → unanimous human verification; the real-speech split is sourced from public IEMOCAP and MELD releases and merged at the end to form the 2,690-prompt benchmark used for every ParaS2SBench number on the main page.

Stage 1 — Candidate generation. We prompt gpt-4o to write spoken queries spanning the topical domains listed in the paper (interests, work, studies, relationships, travel, health, religion, fashion, finance). For each query we also generate its contrastive style pair (e.g. the same sentence delivered sincerely vs. sarcastically) so that the paralinguistic channel — not the lexical content — is what determines the appropriate response.
Stage 2 — Quality filtering. Three LLM-based filters reject unsuitable candidates, mirroring the paper's filters:
- Neutrality test: a candidate is kept only if the sentence is plausibly voiced in one of the contrastive styles and not the other.
- Reasonability test: the content–style pairing must be sensible (e.g. not "shouting a lullaby").
- Paralinguistic-relevance test: the two styles must elicit different ideal responses, otherwise the prompt cannot tell paralinguistic-aware models apart from text-only ones.
Stage 3 — Speech synthesis. We synthesize the surviving queries with two TTS systems, matching the paper's assignment:
- gpt-4o-mini-tts with style instructions for the emotion and sarcasm splits;
- CosyVoice-300M zero-shot voice cloning for the age and gender splits, using reference voices sampled from LibriSpeech, CommonVoice and NNCES.
Synthesized utterances are post-filtered by Whisper-v3 WER against the target transcript and by an emotion2vec classifier score; failed utterances are re-synthesized or discarded.
Stage 4 — Train/test disjointness. Query topics and TTS reference speakers used for benchmark prompts are excluded from the training corpus of Silent Tags, exactly as the paper requires, to avoid contamination.
Stage 5 — Human verification. Three annotators listen to every retained prompt and mark content-correct and style-correct. Only prompts that pass both checks unanimously enter the final set. We retained 1,200 synthetic prompts after verification (300 per dimension), matching the paper.

For the real-speech split we use the public IEMOCAP and MELD releases directly, applying the same six-emotion mapping as the paper. We do not redistribute these audios; the benchmark metadata only references their official sample IDs.

Scoring

Automatic Judging Pipeline

Each S2S model under test consumes a prompt audio and produces a response audio. We then extract four fields and feed them to an LLM judge, following the same decomposition as the paper:

c_i — input content, transcribed by Whisper-large-v3.
s_i — input paralinguistic style, taken from the ground-truth label of the benchmark prompt.
c_o — output content, transcribed by Whisper-large-v3.
s_o — output paralinguistic style, predicted by AudioReasoner.

The judge returns a single 1–5 Likert fitness score. The paper used ChatGPT 4.1 as the judge; because the original rubric is not released we authored a judge prompt that matches the paper's textual description of the rubric and use Qwen3-32B-Instruct as the scoring LLM (this is the same judge reported in the main-page numbers). The full judge prompt is included in our anonymous repository.

The reported ParaS2S average score on the main page is the mean fitness score over all 2,690 prompts, weighted equally across the four synthetic dimensions and the two real corpora.

Validation

Sanity Checks on the Reconstructed Benchmark

To confirm that our reconstruction is faithful, we ran two checks:

Human–automatic correlation. On a 120-prompt subset (30 per synthetic dimension), Pearson correlation between the Qwen3 judge and the average of three human raters is 0.74 (Emotion), 0.71 (Sarcasm), 0.83 (Age), 0.69 (Gender) — in the same range as the paper's reported 0.70–0.86.
Baseline reproduction. Running Qwen2.5-Omni (vanilla) on our reconstructed benchmark gives an average score within ±0.15 of the value the paper reports for the same model, indicating that the prompt difficulty is comparable.

We will release the prompt manifests, the judge prompt, and the filtering scripts together with the camera-ready version; the anonymous repository linked from the main page already contains the audio prompts and the scoring code used to produce every ParaS2SBench number reported in the submission.