Note on reproduction. The original ParaS2SBench proposed by Zhang et al. (arXiv:2511.08723) has not been publicly released — neither the prompt audios, the filtering scripts, nor the LLM-judge rubric are available at the time of writing. To make our results comparable, we therefore re-built a benchmark of the same design by faithfully following the construction recipe described in the paper. The reconstructed benchmark is what is used for every ParaS2SBench number reported on the main page. We document the procedure below so that reviewers can audit any deviation from the original protocol.
Reference
Original ParaS2SBench (as described in the paper)

ParaS2SBench is a paralinguistic speech-to-speech benchmark that evaluates whether an S2S model's spoken response is appropriate to the paralinguistic attributes of the spoken input — not only its textual content. The benchmark covers four paralinguistic dimensions and combines synthesized and real speech.

Split Dimension Categories Prompts Utterances
SyntheticEmotionHappy / Surprised / Sad / Angry / Fear / Disgust300600
SyntheticSarcasmSincere / Sarcastic300600
SyntheticAgeAdult / Child300600
SyntheticGenderMale / Female300600
RealEmotion (IEMOCAP)6 emotion classes709
RealEmotion (MELD)6 emotion classes781
Total2,690≈ 7.8 h
Our Reconstruction
Construction Pipeline

We follow the five-stage recipe described in §3 of the ParaS2SBench paper. Every stage is a best-effort re-implementation; we keep the category counts and per-category prompt counts identical to the paper so that scores are directly comparable in magnitude.

STAGE 1 Candidate Generation contrastive style pairs gpt-4o STAGE 2 Quality Filtering 3 LLM-based filters neutrality · reason · relevance STAGE 3 Speech Synthesis two TTS systems WER + emotion2vec filter Emotion / Sarcasm gpt-4o-mini-tts Age / Gender CosyVoice (zs) STAGE 4 Train/Test Disjoint no topic / speaker leak vs. Silent-Tag corpus STAGE 5 Human Verification 3 annotators, unanimous content + style correct Reconstructed ParaS2SBench 1,200 synthetic + 1,490 real = 2,690 prompts (~7.8 h) Real speech (public) IEMOCAP 709 · MELD 781
Figure. Five-stage reconstruction pipeline. Synthetic prompts flow through generation → filtering → TTS (two branches by dimension) → disjointness check → unanimous human verification; the real-speech split is sourced from public IEMOCAP and MELD releases and merged at the end to form the 2,690-prompt benchmark used for every ParaS2SBench number on the main page.
  1. Stage 1 — Candidate generation. We prompt gpt-4o to write spoken queries spanning the topical domains listed in the paper (interests, work, studies, relationships, travel, health, religion, fashion, finance). For each query we also generate its contrastive style pair (e.g. the same sentence delivered sincerely vs. sarcastically) so that the paralinguistic channel — not the lexical content — is what determines the appropriate response.
  2. Stage 2 — Quality filtering. Three LLM-based filters reject unsuitable candidates, mirroring the paper's filters:
    • Neutrality test: a candidate is kept only if the sentence is plausibly voiced in one of the contrastive styles and not the other.
    • Reasonability test: the content–style pairing must be sensible (e.g. not "shouting a lullaby").
    • Paralinguistic-relevance test: the two styles must elicit different ideal responses, otherwise the prompt cannot tell paralinguistic-aware models apart from text-only ones.
  3. Stage 3 — Speech synthesis. We synthesize the surviving queries with two TTS systems, matching the paper's assignment:
    • gpt-4o-mini-tts with style instructions for the emotion and sarcasm splits;
    • CosyVoice-300M zero-shot voice cloning for the age and gender splits, using reference voices sampled from LibriSpeech, CommonVoice and NNCES.
    Synthesized utterances are post-filtered by Whisper-v3 WER against the target transcript and by an emotion2vec classifier score; failed utterances are re-synthesized or discarded.
  4. Stage 4 — Train/test disjointness. Query topics and TTS reference speakers used for benchmark prompts are excluded from the training corpus of Silent Tags, exactly as the paper requires, to avoid contamination.
  5. Stage 5 — Human verification. Three annotators listen to every retained prompt and mark content-correct and style-correct. Only prompts that pass both checks unanimously enter the final set. We retained 1,200 synthetic prompts after verification (300 per dimension), matching the paper.

For the real-speech split we use the public IEMOCAP and MELD releases directly, applying the same six-emotion mapping as the paper. We do not redistribute these audios; the benchmark metadata only references their official sample IDs.

Scoring
Automatic Judging Pipeline

Each S2S model under test consumes a prompt audio and produces a response audio. We then extract four fields and feed them to an LLM judge, following the same decomposition as the paper:

The judge returns a single 1–5 Likert fitness score. The paper used ChatGPT 4.1 as the judge; because the original rubric is not released we authored a judge prompt that matches the paper's textual description of the rubric and use Qwen3-32B-Instruct as the scoring LLM (this is the same judge reported in the main-page numbers). The full judge prompt is included in our anonymous repository.

The reported ParaS2S average score on the main page is the mean fitness score over all 2,690 prompts, weighted equally across the four synthetic dimensions and the two real corpora.

Validation
Sanity Checks on the Reconstructed Benchmark

To confirm that our reconstruction is faithful, we ran two checks:

We will release the prompt manifests, the judge prompt, and the filtering scripts together with the camera-ready version; the anonymous repository linked from the main page already contains the audio prompts and the scoring code used to produce every ParaS2SBench number reported in the submission.