MMODELYST
Literature

Papers

Showing 1–100 of 136 notable papers
PaperTopicAuthorsPublishedHF ▲
Kwai Keye-VL-2.0 Technical ReportVision & multimodalKwai Keye Team +39Jun 9, 2026183
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box DecodingVision & multimodalShihao Wang +12May 26, 2026139
MiniMax Sparse AttentionVision & multimodalXunhao Lai +10Jun 11, 2026125
Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language ModelsVision & multimodalMahtab Bigverdi +10Jun 3, 2026117
WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid InterfacesVision & multimodalWanli Li +6Jun 8, 202697
Agent Explorative Policy Optimization for Multimodal Agentic ReasoningVision & multimodalMinki Kang +6May 27, 202690
SpatialClaw: Rethinking Action Interface for Agentic Spatial ReasoningVision & multimodalSeokju Cho +10Jun 11, 202688
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific ResearchVision & multimodalWanghan Xu +39May 28, 202687
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?Vision & multimodalJiaqi Tang +8Jun 6, 202675
From Pixels to Words -- Towards Native One-Vision Models at ScaleVision & multimodalHaiwen Diao +20May 27, 202673
Why Far Looks Up: Probing Spatial Representation in Vision-Language ModelsVision & multimodalCheolhong Min +7May 28, 202660
From Activation to Causality: Discovery of Causal Visual Representations in the Human BrainVision & multimodalYuval Golbari +7May 22, 202652
CoVEBench: Can Video Editing Models Handle Complex Instructions?Vision & multimodalJiangtao Wu +9Jun 7, 202648
GGT-100K: Generative Ground Truth for Generalizable Real-World Image RestorationVision & multimodalXiangtao Kong +4May 29, 202644
SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context ConditioningVision & multimodalWenhao Yan +3Jun 9, 202641
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World TasksVision & multimodalHongcheng Gao +20Jun 8, 202641
Function2Scene: 3D Indoor Scene Layout from Functional SpecificationsVision & multimodalRuiqi Wang +6May 29, 202641
MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval MechanismVision & multimodalCong Chen +9Jun 5, 202638
VideoKR: Towards Knowledge- and Reasoning-Intensive Video UnderstandingVision & multimodalLin Fu +5Jun 3, 202635
X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream UnderstandingVision & multimodalPeiwen Sun +11Jun 1, 202635
LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-TrainingVision & multimodalMinju Gwak +5May 28, 202634
EarlyTom: Early Token Compression Completes Fast Video UnderstandingVision & multimodalHesong Wang +6May 28, 202632
From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image FusionVision & multimodalYuchen Xian +3Jun 10, 202630
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial ReasoningVision & multimodalChaofan Ma +7Jun 10, 202630
Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMsVision & multimodalZhihao Wu +4May 28, 202629
HYDRA-X: Native Unified Multimodal Models with Holistic Visual TokenizersVision & multimodalGuozhen Zhang +13Jun 11, 202627
M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video TasksVision & multimodalJie Huang +6Jun 3, 202626
AutoMedBench: Towards Medical AutoResearch with Agentic AI ModelsVision & multimodalJunqi Liu +14Jun 1, 202626
VLM3: Vision Language Models Are Native 3D LearnersVision & multimodalZhipeng Cai +5May 28, 202626
Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial ReasoningVision & multimodalHaoyuan Li +4Jun 5, 202624
QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction AgentsVision & multimodalYe Yuan +14May 26, 202624
Benchmarking Visual State Tracking in Multimodal Video UnderstandingVision & multimodalSihyun Yu +10Jun 2, 202623
LoMo: Local Modality Substitution for Deeper Vision-Language FusionVision & multimodalFeng Han +4May 28, 202623
GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar ReselectionVision & multimodalZheng Wu +7May 27, 202623
Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight MergingVision & multimodalMinsik Choi +1Jun 1, 202621
InternVideo3: Agentify Foundation Models with Multimodal Contextual ReasoningVision & multimodalZiang Yan +12Jun 10, 202620
ZipSplat: Fewer Gaussians, Better SplatsVision & multimodalAlexander Veicht +3Jun 3, 202620
PEEK: Picking Essential frames via Efficient Knowledge distillationVision & multimodalKillian Steunou +4May 29, 202620
How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial ReasoningVision & multimodalQian Yang +6May 26, 202620
OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement DynamicsVision & multimodalMingxian Lin +11Jun 8, 202619
Personal AI Agent for Camera Roll VQAVision & multimodalThao Nguyen +4Jun 3, 202619
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web AgentsVision & multimodalRui Yang +9Jun 1, 202619
Advancing Creative Physical Intelligence in Large Multimodal ModelsVision & multimodalCheng Qian +12May 25, 202619
Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language ModelsVision & multimodalCheng-Yu Yang +2Jun 10, 202617
MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map GenerationVision & multimodalDeguo Xia +8Jun 3, 202617
TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RLVision & multimodalTianze Yang +5Jun 1, 202617
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World InteractionVision & multimodalChengzhi Liu +16May 28, 202617
DyCo-RL: Dynamic Cross-Modal Coordination for Visual ReasoningVision & multimodalHangui Lin +8Jun 6, 202616
One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QAVision & multimodalZhi Zheng +4Jun 9, 202616
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline MatchingVision & multimodalHao Zhong +10Jun 2, 202616
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report GenerationVision & multimodalChenghao Zhang +4May 28, 202616
Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual ReasoningVision & multimodalFanhu Zeng +5May 25, 202616
Thinking with Imagination: Agentic Visual Spatial Reasoning with World SimulatorsVision & multimodalChenming Zhu +6Jun 4, 202615
PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-TrainingVision & multimodalZelun Zhang +14Jun 2, 202615
UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI AgentsVision & multimodalYuxiang Chai +5May 28, 202615
RedAct: Redacting Agent Capability Traces for Procedural Skill ProtectionVision & multimodalShuwen Xu +2Jun 10, 202614
MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?Vision & multimodalXinyu Che +12Jun 1, 202614
TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report GenerationVision & multimodalXinkai Ma +23Jun 1, 202614
Brain-IT-VQA: From Brain Signals to AnswersVision & multimodalRoman Beliy +4May 28, 202614
HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range VideosVision & multimodalJeongeun Park +6May 19, 202613
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?Vision & multimodalYue Zhang +5May 28, 202612
GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-TuningVision & multimodalHaodong Zhao +4May 26, 202612
Count AnythingVision & multimodalMengqi Lei +6May 29, 202611
Linear Scaling Video VLMs for Long Video UnderstandingVision & multimodalCristobal Eyzaguirre +2May 29, 202611
PANDO: Efficient Multimodal AI Agents via Online Skill DistillationVision & multimodalYubo Li +3May 24, 202611
LLM Agents Can See Code RepositoriesVision & multimodalDongjian Ma +5Jun 12, 202610
Ultralytics YOLO26: Unified Real-Time End-to-End Vision ModelsVision & multimodalGlenn Jocher +5Jun 2, 202610
Agent Skills Should Go Beyond Text: The Case for Visual SkillsVision & multimodalBinxiao Xu +3May 31, 202610
PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language UnderstandingVision & multimodalSelim Kuzucu +5May 28, 202610
CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLMVision & multimodalYubo Li +1May 24, 20269
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial WorkflowsVision & multimodalHarshada Badave +8May 22, 20269
WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web ArtifactsVision & multimodalYuxin Meng +11Jun 2, 20268
SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation ModelsVision & multimodalOlaf Dünkel +5May 29, 20268
How can embedding models bind concepts?Vision & multimodalArnas Uselis +2May 29, 20268
One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance SegmentationVision & multimodalSanghyun Jo +6May 28, 20268
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM AgentsVision & multimodalJiahao Ying +4May 22, 20268
Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral DetectionVision & multimodalTravis LelleMay 28, 20268
Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual ReasoningVision & multimodalHaoran Xu +6Jun 8, 20267
Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language ModelsVision & multimodalMohammad Mahdi Abootorabi +6Jun 4, 20267
Towards One-to-Many Temporal GroundingVision & multimodalQi Xu +7Jun 4, 20267
Stateful Visual Encoders for Vision-Language ModelsVision & multimodalZirui Wang +5Jun 3, 20267
Multi-Agent Computer UseVision & multimodalJing Yu Koh +2Jun 1, 20267
3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via CodeVision & multimodalYipeng Gao +7May 31, 20267
SVI-Bench: A Dynamic Microworld for Strategic Video IntelligenceVision & multimodalYulu Pan +6May 29, 20267
Unified Neural Scaling LawsVision & multimodalEthan Caballero +3May 25, 20267
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement LearningVision & multimodalChang-Bin Zhang +3May 29, 20267
A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics ProblemsVision & multimodalTitu Ranjan Sarker +5May 28, 20266
HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White PapersVision & multimodalIssa Sugiura +3May 31, 20266
AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment CorruptionsVision & multimodalJingwei Sun +5May 25, 20266
Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information SeekingVision & multimodalFan Zhang +9Jun 5, 20265
Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event PredictionVision & multimodalTianxiang Jiang +7Jun 4, 20265
AdaCodec: A Predictive Visual Code for Video MLLMsVision & multimodalHaowen Hou +10Jun 1, 20265
BraveGuard: From Open-World Threats to Safer Computer-Use AgentsVision & multimodalYunhao Feng +15Jun 2, 20265
Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement LearningVision & multimodalCiara Rowles +4May 28, 20265
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward ModelingVision & multimodalSeojeong Park +5Jun 1, 20265
Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPOVision & multimodalJing SunMay 30, 20265
ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM ReasoningVision & multimodalSicheng Yang +7Jun 12, 20264
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy ReductionVision & multimodalAmirhossein Abaskohi +5May 11, 20264
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond TextVision & multimodalYutong Bian +4Jun 8, 20264
AsyncWebRL: Efficient Multi-Step RL for Visual Web AgentsVision & multimodalHao Bai +5Jun 4, 20264
← PrevPage 1 of 2Next →