Demo · Paralinguistic Expressiveness
Emotion & Speaking Style

Same Thinker; vanilla Talker vs. Silent Tags. All inputs are spoken queries — responses are generated end-to-end with no explicit tags at inference.

User Query Emotion Response Prosody Baseline(Qwen3-Omni) Silent Tags
I just lost my job, but I have an interview tomorrow. Anxious That sounds really tough. But having an interview lined up already — that's something to hold on to. soothingencouraging
I finally finished my dissertation after three years! Joyful Three years of hard work — and now it's done. That's a huge achievement. cheerful / excited
Why did you do that? I told you not to touch my things. Angry I'm sorry — I didn't mean to upset you. Can we talk about it? gentle / apologetic
I'm feeling really tired lately and I don't know why. Sad That kind of persistent tiredness can be really draining. Sometimes our body is asking us to slow down. soft / empathetic
[Whispered] I'm in a library — can you keep it down? Neutral Of course, I'll keep my voice low. What can I help you with? whisper
Demo · Intra-Response Trajectory
Token-Free Prosody Shift Within a Single Response

Silent tags re-read the Thinker's hidden states at every decoding step, so a reply can open soothingly and pivot to an encouraging tone mid-sentence — no inline tags inserted.

Scenario Response (prosody labels shown for reference only) Baseline Silent Tags
User distressed about a setback soothing "I hear you — that's genuinely hard, and it's okay to feel overwhelmed." encouraging "But you've navigated tough situations before, and you have more tools than you realise."
User shares exciting news nervously warm "That's wonderful news — congratulations!" excited "I can only imagine how much work went into this. You should be really proud!"
Sarcastic question about obvious advice dry "Yes, drinking water when you're thirsty is indeed recommended." playful "But in all seriousness, hydration has measurable effects on focus — want the details?"
Method
How Silent Tags Works

Silent Tags comprises three components: a paralinguistically-annotated S2S corpus, a Talker extended with cross-attention over the Thinker's lower hidden states, and a Talker-only training objective. The Thinker is always frozen.

A GPT-4-class LLM annotates responses with per-segment paralinguistic tags (48 categories covering emotions, speaking styles, and vocal events). A tag-controllable TTS teacher renders each segment and provides dense tag embeddings. The Talker's cross-attention reads only layers 0–6 of the Thinker — where paralinguistic content concentrates — forming silent tags: dense, step-by-step prosodic representations.

Training loss: L = Lgen + λ₁ Ldistill + λ₂ Laux. The distillation loss aligns silent tags with teacher embeddings; the auxiliary classifier predicts segment-level tags. Both are discarded at inference — no tokens, no extra latency.

Overview
Figure 1. Overview of Silent Tags. The frozen Thinker processes speech input; the Talker attends to lower-layer hidden states via cross-attention, supervised by a tag-controllable TTS teacher. At inference, no explicit tags are used.

Layer selection. Layer-wise studies on text and speech Transformers consistently report that lower layers preferentially encode prosodic, affective, and speaker information, while upper layers carry semantic content. We confirm this for an omni-modal Thinker before designing the Talker. Specifically, we train per-layer linear probes on frozen Qwen2.5-Omni hidden states to predict three paralinguistic events from a held-out subset of our corpus and report the F1 score at each layer (Figure 2). The probe accuracy peaks sharply around layer 4 (averaged F1 of 78.7%, surpassing a Whisper-Large-v3 reference at 73.4%) and degrades monotonically toward the upper layers, falling below the random and text-embedding baselines beyond layer 20. This empirical layer profile motivates a hard cut: we designate layers 0 to 6 as paralinguistic layers and layers 7 to L as semantic layers, and restrict the Talker's cross-attention to the former, so that the semantic generation pathway is left intact while the paralinguistic readout is concentrated where the signal is strongest.

Linear probing F1 across Thinker layers
Figure 2. Linear probing F1 (%) of paralinguistic events across Thinker layer indices. Layer 4–6 achieves the highest scores, confirming that paralinguistic content concentrates in the lower layers — motivating the cross-attention design over layers 0–6.
Results
Expressiveness Without the Reasoning Trade-off

Silent Tags reaches 4.31 average score on ParaS2SBench (Qwen3) and 65.7% win rate on EmergentTTS-Eval (Kimi-Audio), matching or surpassing explicit-paradigm methods that modify the Thinker.

Crucially, VoiceBench (72.2 vs 72.4), MMLU (68.0 vs 68.1), and GSM8K (64.6 vs 64.7) are fully preserved — explicit methods lose 2–5 MMLU points and 4–7 GSM8K points.

Cross-attention layer range ablation
Figure 3. Ablation on cross-attention layer range. Restricting cross-attention to shallow layers (0–6 / Shallow Mid 0–13) consistently outperforms mid or deep configurations across ParaS2S, Emo Acc, Para Recall, UTMOS, and WER.