MMODELYST
Literature

Papers

Showing 1–44 of 44 notable papers
PaperTopicAuthorsPublishedHF ▲
Audio Interaction ModelSpeech & audioZhifei Xie +10Jun 3, 2026107
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video GenerationSpeech & audioSonglin Yang +25May 22, 202679
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and DialogueSpeech & audioRuiqi Li +5May 29, 202657
MMAE: A Massive Multitask Audio Editing BenchmarkSpeech & audioZiyang Ma +37Jun 5, 202644
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AVSpeech & audioTengfei Liu +19May 25, 202638
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion TransformerSpeech & audioKe Lei +6May 29, 202637
Native Audio-Visual Alignment for GenerationSpeech & audioLongbin Ji +8May 28, 202633
Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip SynchronizationSpeech & audioPaul Hyunbin Cho +9Jun 9, 202632
Comprehensive Benchmarking of Long-Form Speech Generation in Diverse ScenariosSpeech & audioChanghao Pan +14May 27, 202631
Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time ScalingSpeech & audioXinglin Wang +11May 26, 202631
Orchestra-o1: Omnimodal Agent OrchestrationSpeech & audioFan Zhang +10Jun 10, 202627
Data Journalist Agent: Transforming Data into Verifiable Multimodal StoriesSpeech & audioKevin Qinghong Lin +5Jun 9, 202626
Gemini Embedding 2: A Native Multimodal Embedding Model from GeminiSpeech & audioMadhuri Shanbhogue +39May 26, 202623
Watch, Remember, Reason: Human-View Video Understanding with MLLMsSpeech & audioJiahao Meng +14Jun 5, 202621
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan BackdoorsSpeech & audioJiejun Tan +6May 29, 202618
StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled OrchestrationSpeech & audioLinrui Tian +2May 25, 202616
Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncodersSpeech & audioGeorgii Aparin +3Jun 5, 202615
dots.tts Technical ReportSpeech & audioShi Lian +8Jun 5, 202615
OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video CaptioningSpeech & audioJiahao Wang +15Jun 7, 202614
Interpreting and Steering a Text-to-Speech Language Model with Sparse AutoencodersSpeech & audioNikita Koriagin +3Jun 8, 202612
DEMON: Diffusion Engine for Musical Orchestrated NoiseSpeech & audioRyan FosdickMay 27, 202611
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured RecalibrationSpeech & audioXinchen Zhang +9May 27, 202611
ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across ChildhoodSpeech & audioTiantian Feng +12May 28, 202610
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit TestSpeech & audioZhangyi Hu +8May 22, 20269
OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence ChainsSpeech & audioXinyue Cai +4Jun 12, 20268
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language ModelsSpeech & audioMalikeh Ehghaghi +3Jun 9, 20268
Liberating LLM Capabilities in Full-Duplex Speech ModelsSpeech & audioLuoyuan Zhang +6May 4, 20268
MERIT: Learning Disentangled Music Representations for Audio SimilaritySpeech & audioAbhinaba Roy +2May 26, 20268
SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?Speech & audioSy-Tuyen Ho +3May 28, 20268
Convex Low-resource Accent-Robust Language Detection in Speech RecognitionSpeech & audioMiria Feng +2May 22, 20266
POISE: Position-Aware Undetectable Skill Injection on LLM AgentsSpeech & audioHaochang Hao +6Jun 6, 20264
Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and DevelopmentSpeech & audioZixi Li +1Jun 5, 20264
Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language PairsSpeech & audioGio Paik +2Jun 4, 20264
Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model MergingSpeech & audioYuanyi Wang +7May 28, 20264
OpenSTBench: Beyond Semantic Evaluation for Speech TranslationSpeech & audioYanjie An +7May 29, 20263
OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal AssistantsSpeech & audioXudong Lu +10May 26, 20263
Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation PathSpeech & audioThomas Sesmat +2Jun 5, 20262
Score-Control for Hallucination Reduction in Diffusion ModelsSpeech & audioMahesh Bhosale +5May 29, 20262
PianoKontext: Expressive Performance Rendering from Deadpan ContextSpeech & audioDmitrii GavrilevJun 10, 20261
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic EvaluationSpeech & audioZixuan Jiang +10May 28, 20261
Multimodal Music Recommendation System using LLMsSpeech & audioSrikar Prabhas Kandagatla +8May 28, 20261
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech EditingSpeech & audioHanlin Zhang +5Jun 3, 20261
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon MemorySpeech & audioSamiul Alam +6May 30, 20261
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to OccitanSpeech & audioAhan Chatterjee +4May 9, 20261