| Audio Interaction Model | Speech & audio | Zhifei Xie +10 | Jun 3, 2026 | 107 |
| EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation | Speech & audio | Songlin Yang +25 | May 22, 2026 | 79 |
| SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue | Speech & audio | Ruiqi Li +5 | May 29, 2026 | 57 |
| MMAE: A Massive Multitask Audio Editing Benchmark | Speech & audio | Ziyang Ma +37 | Jun 5, 2026 | 44 |
| LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV | Speech & audio | Tengfei Liu +19 | May 25, 2026 | 38 |
| Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer | Speech & audio | Ke Lei +6 | May 29, 2026 | 37 |
| Native Audio-Visual Alignment for Generation | Speech & audio | Longbin Ji +8 | May 28, 2026 | 33 |
| Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization | Speech & audio | Paul Hyunbin Cho +9 | Jun 9, 2026 | 32 |
| Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios | Speech & audio | Changhao Pan +14 | May 27, 2026 | 31 |
| Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling | Speech & audio | Xinglin Wang +11 | May 26, 2026 | 31 |
| Orchestra-o1: Omnimodal Agent Orchestration | Speech & audio | Fan Zhang +10 | Jun 10, 2026 | 27 |
| Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories | Speech & audio | Kevin Qinghong Lin +5 | Jun 9, 2026 | 26 |
| Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini | Speech & audio | Madhuri Shanbhogue +39 | May 26, 2026 | 23 |
| Watch, Remember, Reason: Human-View Video Understanding with MLLMs | Speech & audio | Jiahao Meng +14 | Jun 5, 2026 | 21 |
| From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors | Speech & audio | Jiejun Tan +6 | May 29, 2026 | 18 |
| StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration | Speech & audio | Linrui Tian +2 | May 25, 2026 | 16 |
| Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders | Speech & audio | Georgii Aparin +3 | Jun 5, 2026 | 15 |
| dots.tts Technical Report | Speech & audio | Shi Lian +8 | Jun 5, 2026 | 15 |
| OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning | Speech & audio | Jiahao Wang +15 | Jun 7, 2026 | 14 |
| Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders | Speech & audio | Nikita Koriagin +3 | Jun 8, 2026 | 12 |
| DEMON: Diffusion Engine for Musical Orchestrated Noise | Speech & audio | Ryan Fosdick | May 27, 2026 | 11 |
| OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration | Speech & audio | Xinchen Zhang +9 | May 27, 2026 | 11 |
| ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood | Speech & audio | Tiantian Feng +12 | May 28, 2026 | 10 |
| CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test | Speech & audio | Zhangyi Hu +8 | May 22, 2026 | 9 |
| OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains | Speech & audio | Xinyue Cai +4 | Jun 12, 2026 | 8 |
| Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models | Speech & audio | Malikeh Ehghaghi +3 | Jun 9, 2026 | 8 |
| Liberating LLM Capabilities in Full-Duplex Speech Models | Speech & audio | Luoyuan Zhang +6 | May 4, 2026 | 8 |
| MERIT: Learning Disentangled Music Representations for Audio Similarity | Speech & audio | Abhinaba Roy +2 | May 26, 2026 | 8 |
| SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones? | Speech & audio | Sy-Tuyen Ho +3 | May 28, 2026 | 8 |
| Convex Low-resource Accent-Robust Language Detection in Speech Recognition | Speech & audio | Miria Feng +2 | May 22, 2026 | 6 |
| POISE: Position-Aware Undetectable Skill Injection on LLM Agents | Speech & audio | Haochang Hao +6 | Jun 6, 2026 | 4 |
| Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development | Speech & audio | Zixi Li +1 | Jun 5, 2026 | 4 |
| Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs | Speech & audio | Gio Paik +2 | Jun 4, 2026 | 4 |
| Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging | Speech & audio | Yuanyi Wang +7 | May 28, 2026 | 4 |
| OpenSTBench: Beyond Semantic Evaluation for Speech Translation | Speech & audio | Yanjie An +7 | May 29, 2026 | 3 |
| OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants | Speech & audio | Xudong Lu +10 | May 26, 2026 | 3 |
| Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path | Speech & audio | Thomas Sesmat +2 | Jun 5, 2026 | 2 |
| Score-Control for Hallucination Reduction in Diffusion Models | Speech & audio | Mahesh Bhosale +5 | May 29, 2026 | 2 |
| PianoKontext: Expressive Performance Rendering from Deadpan Context | Speech & audio | Dmitrii Gavrilev | Jun 10, 2026 | 1 |
| Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation | Speech & audio | Zixuan Jiang +10 | May 28, 2026 | 1 |
| Multimodal Music Recommendation System using LLMs | Speech & audio | Srikar Prabhas Kandagatla +8 | May 28, 2026 | 1 |
| SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing | Speech & audio | Hanlin Zhang +5 | Jun 3, 2026 | 1 |
| SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory | Speech & audio | Samiul Alam +6 | May 30, 2026 | 1 |
| Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan | Speech & audio | Ahan Chatterjee +4 | May 9, 2026 | 1 |