Literature

Papers

SortNotable Newest Most cited Oldest A–Z

Notable = Hugging Face daily papers (community-upvoted) · every paper links to arXiv · citations from OpenAlex

TopicAll Vision & multimodal Agents Safety & alignment Code Efficiency & systems Image & video gen Data & benchmarks Robotics Speech & audio Reinforcement learning Theory Science & bio Other LLMs & reasoning

Showing 1–44 of 44 notable papers

Paper	Topic	Authors	Published	HF ▲
Audio Interaction Model	Speech & audio	Zhifei Xie +10	Jun 3, 2026	107
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation	Speech & audio	Songlin Yang +25	May 22, 2026	79
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue	Speech & audio	Ruiqi Li +5	May 29, 2026	57
MMAE: A Massive Multitask Audio Editing Benchmark	Speech & audio	Ziyang Ma +37	Jun 5, 2026	44
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV	Speech & audio	Tengfei Liu +19	May 25, 2026	38
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer	Speech & audio	Ke Lei +6	May 29, 2026	37
Native Audio-Visual Alignment for Generation	Speech & audio	Longbin Ji +8	May 28, 2026	33
Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization	Speech & audio	Paul Hyunbin Cho +9	Jun 9, 2026	32
Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios	Speech & audio	Changhao Pan +14	May 27, 2026	31
Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling	Speech & audio	Xinglin Wang +11	May 26, 2026	31
Orchestra-o1: Omnimodal Agent Orchestration	Speech & audio	Fan Zhang +10	Jun 10, 2026	27
Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories	Speech & audio	Kevin Qinghong Lin +5	Jun 9, 2026	26
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini	Speech & audio	Madhuri Shanbhogue +39	May 26, 2026	23
Watch, Remember, Reason: Human-View Video Understanding with MLLMs	Speech & audio	Jiahao Meng +14	Jun 5, 2026	21
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors	Speech & audio	Jiejun Tan +6	May 29, 2026	18
StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration	Speech & audio	Linrui Tian +2	May 25, 2026	16
Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders	Speech & audio	Georgii Aparin +3	Jun 5, 2026	15
dots.tts Technical Report	Speech & audio	Shi Lian +8	Jun 5, 2026	15
OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning	Speech & audio	Jiahao Wang +15	Jun 7, 2026	14
Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders	Speech & audio	Nikita Koriagin +3	Jun 8, 2026	12
DEMON: Diffusion Engine for Musical Orchestrated Noise	Speech & audio	Ryan Fosdick	May 27, 2026	11
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration	Speech & audio	Xinchen Zhang +9	May 27, 2026	11
ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood	Speech & audio	Tiantian Feng +12	May 28, 2026	10
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test	Speech & audio	Zhangyi Hu +8	May 22, 2026	9
OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains	Speech & audio	Xinyue Cai +4	Jun 12, 2026	8
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models	Speech & audio	Malikeh Ehghaghi +3	Jun 9, 2026	8
Liberating LLM Capabilities in Full-Duplex Speech Models	Speech & audio	Luoyuan Zhang +6	May 4, 2026	8
MERIT: Learning Disentangled Music Representations for Audio Similarity	Speech & audio	Abhinaba Roy +2	May 26, 2026	8
SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?	Speech & audio	Sy-Tuyen Ho +3	May 28, 2026	8
Convex Low-resource Accent-Robust Language Detection in Speech Recognition	Speech & audio	Miria Feng +2	May 22, 2026	6
POISE: Position-Aware Undetectable Skill Injection on LLM Agents	Speech & audio	Haochang Hao +6	Jun 6, 2026	4
Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development	Speech & audio	Zixi Li +1	Jun 5, 2026	4
Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs	Speech & audio	Gio Paik +2	Jun 4, 2026	4
Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging	Speech & audio	Yuanyi Wang +7	May 28, 2026	4
OpenSTBench: Beyond Semantic Evaluation for Speech Translation	Speech & audio	Yanjie An +7	May 29, 2026	3
OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants	Speech & audio	Xudong Lu +10	May 26, 2026	3
Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path	Speech & audio	Thomas Sesmat +2	Jun 5, 2026	2
Score-Control for Hallucination Reduction in Diffusion Models	Speech & audio	Mahesh Bhosale +5	May 29, 2026	2
PianoKontext: Expressive Performance Rendering from Deadpan Context	Speech & audio	Dmitrii Gavrilev	Jun 10, 2026	1
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation	Speech & audio	Zixuan Jiang +10	May 28, 2026	1
Multimodal Music Recommendation System using LLMs	Speech & audio	Srikar Prabhas Kandagatla +8	May 28, 2026	1
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing	Speech & audio	Hanlin Zhang +5	Jun 3, 2026	1
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory	Speech & audio	Samiul Alam +6	May 30, 2026	1
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan	Speech & audio	Ahan Chatterjee +4	May 9, 2026	1