Demo · Paralinguistic Expressiveness
Emotion & Speaking Style

Same Thinker; vanilla Talker vs. Silent Tags. All inputs are spoken queries — responses are generated end-to-end with no explicit tags at inference.

User Query Emotion Response Prosody Baseline(Qwen3-Omni) Silent Tags
I just lost my job, but I have an interview tomorrow. Anxious That sounds really tough. But having an interview lined up already — that's something to hold on to. soothingencouraging
I finally finished my dissertation after three years! Joyful Three years of hard work — and now it's done. That's a huge achievement. cheerful / excited
Why did you do that? I told you not to touch my things. Angry I'm sorry — I didn't mean to upset you. Can we talk about it? gentle / apologetic
I'm feeling really tired lately and I don't know why. Sad That kind of persistent tiredness can be really draining. Sometimes our body is asking us to slow down. soft / empathetic
[Whispered] I'm in a library — can you keep it down? Neutral Of course, I'll keep my voice low. What can I help you with? whisper
Demo · Intra-Response Trajectory
Token-Free Prosody Shift Within a Single Response

Silent tags re-read the Thinker's hidden states at every decoding step, so a reply can open soothingly and pivot to an encouraging tone mid-sentence — no inline tags inserted.

Scenario Response (prosody labels shown for reference only) Baseline Silent Tags
User distressed about a setback soothing "I hear you — that's genuinely hard, and it's okay to feel overwhelmed." encouraging "But you've navigated tough situations before, and you have more tools than you realise."
User shares exciting news nervously warm "That's wonderful news — congratulations!" excited "I can only imagine how much work went into this. You should be really proud!"
Sarcastic question about obvious advice dry "Yes, drinking water when you're thirsty is indeed recommended." playful "But in all seriousness, hydration has measurable effects on focus — want the details?"
Demo · Cross-Thinker Transfer
Qwen2.5-Omni → Qwen3 via Lightweight Adapter

The trained Talker is re-used with a stronger Thinker through a ~1M-parameter linear adapter trained in minutes. Both expressiveness and reasoning improve from the stronger Thinker.

User Query Response Silent Tags (Qwen2.5-Omni) Silent Tags (Qwen3 + adapter)
My dog passed away last night. I don't know how to cope. I'm so sorry for your loss. Losing a pet is a real grief. Give yourself time — it's okay to feel exactly the way you're feeling right now.
Can you explain why the sky is blue? I have a presentation tomorrow. Sure! The sky appears blue due to Rayleigh scattering — shorter wavelengths scatter more through the atmosphere. That one-sentence version should land well for your presentation.

Method
How Silent Tags Works

Silent Tags comprises three components: a paralinguistically-annotated S2S corpus, a Talker extended with cross-attention over the Thinker's lower hidden states, and a Talker-only training objective. The Thinker is always frozen.

A GPT-4-class LLM annotates responses with per-segment paralinguistic tags (48 categories covering emotions, speaking styles, and vocal events). A tag-controllable TTS teacher renders each segment and provides dense tag embeddings. The Talker's cross-attention reads only layers 0–6 of the Thinker — where paralinguistic content concentrates — forming silent tags: dense, step-by-step prosodic representations.

Training loss: L = Lgen + λ₁ Ldistill + λ₂ Laux. The distillation loss aligns silent tags with teacher embeddings; the auxiliary classifier predicts segment-level tags. Both are discarded at inference — no tokens, no extra latency.

Overview
Figure 1. Overview of Silent Tags. The frozen Thinker processes speech input; the Talker attends to lower-layer hidden states via cross-attention, supervised by a tag-controllable TTS teacher. At inference, no explicit tags are used.
Results
Expressiveness Without the Reasoning Trade-off

Silent Tags reaches 4.31 average score on ParaS2SBench (Qwen3) and 65.7% win rate on EmergentTTS-Eval (Kimi-Audio), matching or surpassing explicit-paradigm methods that modify the Thinker.

Crucially, VoiceBench (72.2 vs 72.4), MMLU (68.0 vs 68.1), and GSM8K (64.6 vs 64.7) are fully preserved — explicit methods lose 2–5 MMLU points and 4–7 GSM8K points.

[ Table 1–2: ParaS2SBench & EmergentTTS-Eval — replace with pics/results.png ]
Table 1–2. Expressiveness benchmarks. Bold: best open-source result. † marks methods that modify the Thinker.