Note on reproduction.
The original ParaS2SBench proposed by Zhang et al. (arXiv:2511.08723) has not been publicly released — neither the prompt audios, the filtering scripts, nor the LLM-judge rubric are available at the time of writing.
To make our results comparable, we therefore re-built a benchmark of the same design by faithfully following the construction recipe described in the paper. The reconstructed benchmark is what is used for every ParaS2SBench number reported on the main page. We document the procedure below so that reviewers can audit any deviation from the original protocol.
Reference
Original ParaS2SBench (as described in the paper)
ParaS2SBench is a paralinguistic speech-to-speech benchmark that evaluates whether an S2S
model's spoken response is appropriate to the paralinguistic attributes of the
spoken input — not only its textual content. The benchmark covers four paralinguistic
dimensions and combines synthesized and real speech.
Split
Dimension
Categories
Prompts
Utterances
Synthetic
Emotion
Happy / Surprised / Sad / Angry / Fear / Disgust
300
600
Synthetic
Sarcasm
Sincere / Sarcastic
300
600
Synthetic
Age
Adult / Child
300
600
Synthetic
Gender
Male / Female
300
600
Real
Emotion (IEMOCAP)
6 emotion classes
709
—
Real
Emotion (MELD)
6 emotion classes
781
—
Total
2,690
≈ 7.8 h
Our Reconstruction
Construction Pipeline
We follow the five-stage recipe described in §3 of the ParaS2SBench paper. Every stage is
a best-effort re-implementation; we keep the category counts and per-category prompt counts
identical to the paper so that scores are directly comparable in magnitude.
Figure. Five-stage reconstruction pipeline. Synthetic prompts flow through
generation → filtering → TTS (two branches by dimension) → disjointness check → unanimous
human verification; the real-speech split is sourced from public IEMOCAP and MELD releases
and merged at the end to form the 2,690-prompt benchmark used for every ParaS2SBench
number on the main page.
Stage 1 — Candidate generation.
We prompt gpt-4o to write spoken queries spanning the topical domains listed in
the paper (interests, work, studies, relationships, travel, health, religion, fashion,
finance). For each query we also generate its contrastive style pair
(e.g. the same sentence delivered sincerely vs. sarcastically) so that
the paralinguistic channel — not the lexical content — is what determines the appropriate
response.
Stage 2 — Quality filtering.
Three LLM-based filters reject unsuitable candidates, mirroring the paper's filters:
Neutrality test: a candidate is kept only if the sentence is plausibly
voiced in one of the contrastive styles and not the other.
Reasonability test: the content–style pairing must be sensible
(e.g. not "shouting a lullaby").
Paralinguistic-relevance test: the two styles must elicit
different ideal responses, otherwise the prompt cannot tell paralinguistic-aware
models apart from text-only ones.
Stage 3 — Speech synthesis.
We synthesize the surviving queries with two TTS systems, matching the paper's assignment:
gpt-4o-mini-tts with style instructions for the emotion
and sarcasm splits;
CosyVoice-300M zero-shot voice cloning for the age
and gender splits, using reference voices sampled from LibriSpeech,
CommonVoice and NNCES.
Synthesized utterances are post-filtered by Whisper-v3 WER against the target transcript and
by an emotion2vec classifier score; failed utterances are re-synthesized or
discarded.
Stage 4 — Train/test disjointness.
Query topics and TTS reference speakers used for benchmark prompts are excluded from the
training corpus of Silent Tags, exactly as the paper requires, to avoid contamination.
Stage 5 — Human verification.
Three annotators listen to every retained prompt and mark content-correct
and style-correct. Only prompts that pass both checks unanimously enter the final
set. We retained 1,200 synthetic prompts after verification (300 per
dimension), matching the paper.
For the real-speech split we use the public IEMOCAP and MELD releases directly, applying the
same six-emotion mapping as the paper. We do not redistribute these audios; the benchmark
metadata only references their official sample IDs.
Scoring
Automatic Judging Pipeline
Each S2S model under test consumes a prompt audio and produces a response audio. We then
extract four fields and feed them to an LLM judge, following the same decomposition as the
paper:
c_i — input content, transcribed by Whisper-large-v3.
s_i — input paralinguistic style, taken from the ground-truth label of
the benchmark prompt.
c_o — output content, transcribed by Whisper-large-v3.
s_o — output paralinguistic style, predicted by AudioReasoner.
The judge returns a single 1–5 Likert fitness score. The paper used
ChatGPT 4.1 as the judge; because the original rubric is not released we authored a
judge prompt that matches the paper's textual description of the rubric and use
Qwen3-32B-Instruct as the scoring LLM (this is the same judge reported in
the main-page numbers). The full judge prompt is included in our anonymous repository.
The reported ParaS2S average score on the main page is the mean fitness score over
all 2,690 prompts, weighted equally across the four synthetic dimensions and the two real
corpora.
Validation
Sanity Checks on the Reconstructed Benchmark
To confirm that our reconstruction is faithful, we ran two checks:
Human–automatic correlation. On a 120-prompt subset (30 per synthetic
dimension), Pearson correlation between the Qwen3 judge and the average of three human
raters is 0.74 (Emotion), 0.71 (Sarcasm),
0.83 (Age), 0.69 (Gender) — in the same range as the
paper's reported 0.70–0.86.
Baseline reproduction. Running Qwen2.5-Omni (vanilla) on our
reconstructed benchmark gives an average score within ±0.15 of the value the paper
reports for the same model, indicating that the prompt difficulty is comparable.
We will release the prompt manifests, the judge prompt, and the filtering scripts together
with the camera-ready version; the anonymous repository linked from the main page already
contains the audio prompts and the scoring code used to produce every ParaS2SBench number
reported in the submission.