| Kwai Keye-VL-2.0 Technical Report | Vision & multimodal | Kwai Keye Team +39 | Jun 9, 2026 | 183 |
| LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding | Vision & multimodal | Shihao Wang +12 | May 26, 2026 | 139 |
| MiniMax Sparse Attention | Vision & multimodal | Xunhao Lai +10 | Jun 11, 2026 | 125 |
| Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models | Vision & multimodal | Mahtab Bigverdi +10 | Jun 3, 2026 | 117 |
| WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces | Vision & multimodal | Wanli Li +6 | Jun 8, 2026 | 97 |
| Agent Explorative Policy Optimization for Multimodal Agentic Reasoning | Vision & multimodal | Minki Kang +6 | May 27, 2026 | 90 |
| SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning | Vision & multimodal | Seokju Cho +10 | Jun 11, 2026 | 88 |
| ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research | Vision & multimodal | Wanghan Xu +39 | May 28, 2026 | 87 |
| Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding? | Vision & multimodal | Jiaqi Tang +8 | Jun 6, 2026 | 75 |
| From Pixels to Words -- Towards Native One-Vision Models at Scale | Vision & multimodal | Haiwen Diao +20 | May 27, 2026 | 73 |
| Why Far Looks Up: Probing Spatial Representation in Vision-Language Models | Vision & multimodal | Cheolhong Min +7 | May 28, 2026 | 60 |
| From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain | Vision & multimodal | Yuval Golbari +7 | May 22, 2026 | 52 |
| CoVEBench: Can Video Editing Models Handle Complex Instructions? | Vision & multimodal | Jiangtao Wu +9 | Jun 7, 2026 | 48 |
| GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration | Vision & multimodal | Xiangtao Kong +4 | May 29, 2026 | 44 |
| SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning | Vision & multimodal | Wenhao Yan +3 | Jun 9, 2026 | 41 |
| SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks | Vision & multimodal | Hongcheng Gao +20 | Jun 8, 2026 | 41 |
| Function2Scene: 3D Indoor Scene Layout from Functional Specifications | Vision & multimodal | Ruiqi Wang +6 | May 29, 2026 | 41 |
| MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism | Vision & multimodal | Cong Chen +9 | Jun 5, 2026 | 38 |
| VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding | Vision & multimodal | Lin Fu +5 | Jun 3, 2026 | 35 |
| X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding | Vision & multimodal | Peiwen Sun +11 | Jun 1, 2026 | 35 |
| LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training | Vision & multimodal | Minju Gwak +5 | May 28, 2026 | 34 |
| EarlyTom: Early Token Compression Completes Fast Video Understanding | Vision & multimodal | Hesong Wang +6 | May 28, 2026 | 32 |
| From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion | Vision & multimodal | Yuchen Xian +3 | Jun 10, 2026 | 30 |
| Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning | Vision & multimodal | Chaofan Ma +7 | Jun 10, 2026 | 30 |
| Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs | Vision & multimodal | Zhihao Wu +4 | May 28, 2026 | 29 |
| HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers | Vision & multimodal | Guozhen Zhang +13 | Jun 11, 2026 | 27 |
| M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks | Vision & multimodal | Jie Huang +6 | Jun 3, 2026 | 26 |
| AutoMedBench: Towards Medical AutoResearch with Agentic AI Models | Vision & multimodal | Junqi Liu +14 | Jun 1, 2026 | 26 |
| VLM3: Vision Language Models Are Native 3D Learners | Vision & multimodal | Zhipeng Cai +5 | May 28, 2026 | 26 |
| Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning | Vision & multimodal | Haoyuan Li +4 | Jun 5, 2026 | 24 |
| QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents | Vision & multimodal | Ye Yuan +14 | May 26, 2026 | 24 |
| Benchmarking Visual State Tracking in Multimodal Video Understanding | Vision & multimodal | Sihyun Yu +10 | Jun 2, 2026 | 23 |
| LoMo: Local Modality Substitution for Deeper Vision-Language Fusion | Vision & multimodal | Feng Han +4 | May 28, 2026 | 23 |
| GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection | Vision & multimodal | Zheng Wu +7 | May 27, 2026 | 23 |
| Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging | Vision & multimodal | Minsik Choi +1 | Jun 1, 2026 | 21 |
| InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning | Vision & multimodal | Ziang Yan +12 | Jun 10, 2026 | 20 |
| ZipSplat: Fewer Gaussians, Better Splats | Vision & multimodal | Alexander Veicht +3 | Jun 3, 2026 | 20 |
| PEEK: Picking Essential frames via Efficient Knowledge distillation | Vision & multimodal | Killian Steunou +4 | May 29, 2026 | 20 |
| How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning | Vision & multimodal | Qian Yang +6 | May 26, 2026 | 20 |
| OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics | Vision & multimodal | Mingxian Lin +11 | Jun 8, 2026 | 19 |
| Personal AI Agent for Camera Roll VQA | Vision & multimodal | Thao Nguyen +4 | Jun 3, 2026 | 19 |
| OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents | Vision & multimodal | Rui Yang +9 | Jun 1, 2026 | 19 |
| Advancing Creative Physical Intelligence in Large Multimodal Models | Vision & multimodal | Cheng Qian +12 | May 25, 2026 | 19 |
| Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models | Vision & multimodal | Cheng-Yu Yang +2 | Jun 10, 2026 | 17 |
| MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation | Vision & multimodal | Deguo Xia +8 | Jun 3, 2026 | 17 |
| TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL | Vision & multimodal | Tianze Yang +5 | Jun 1, 2026 | 17 |
| WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction | Vision & multimodal | Chengzhi Liu +16 | May 28, 2026 | 17 |
| DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning | Vision & multimodal | Hangui Lin +8 | Jun 6, 2026 | 16 |
| One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA | Vision & multimodal | Zhi Zheng +4 | Jun 9, 2026 | 16 |
| Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching | Vision & multimodal | Hao Zhong +10 | Jun 2, 2026 | 16 |
| Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation | Vision & multimodal | Chenghao Zhang +4 | May 28, 2026 | 16 |
| Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning | Vision & multimodal | Fanhu Zeng +5 | May 25, 2026 | 16 |
| Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators | Vision & multimodal | Chenming Zhu +6 | Jun 4, 2026 | 15 |
| PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training | Vision & multimodal | Zelun Zhang +14 | Jun 2, 2026 | 15 |
| UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents | Vision & multimodal | Yuxiang Chai +5 | May 28, 2026 | 15 |
| RedAct: Redacting Agent Capability Traces for Procedural Skill Protection | Vision & multimodal | Shuwen Xu +2 | Jun 10, 2026 | 14 |
| MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills? | Vision & multimodal | Xinyu Che +12 | Jun 1, 2026 | 14 |
| TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation | Vision & multimodal | Xinkai Ma +23 | Jun 1, 2026 | 14 |
| Brain-IT-VQA: From Brain Signals to Answers | Vision & multimodal | Roman Beliy +4 | May 28, 2026 | 14 |
| HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos | Vision & multimodal | Jeongeun Park +6 | May 19, 2026 | 13 |
| Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)? | Vision & multimodal | Yue Zhang +5 | May 28, 2026 | 12 |
| GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning | Vision & multimodal | Haodong Zhao +4 | May 26, 2026 | 12 |
| Count Anything | Vision & multimodal | Mengqi Lei +6 | May 29, 2026 | 11 |
| Linear Scaling Video VLMs for Long Video Understanding | Vision & multimodal | Cristobal Eyzaguirre +2 | May 29, 2026 | 11 |
| PANDO: Efficient Multimodal AI Agents via Online Skill Distillation | Vision & multimodal | Yubo Li +3 | May 24, 2026 | 11 |
| LLM Agents Can See Code Repositories | Vision & multimodal | Dongjian Ma +5 | Jun 12, 2026 | 10 |
| Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models | Vision & multimodal | Glenn Jocher +5 | Jun 2, 2026 | 10 |
| Agent Skills Should Go Beyond Text: The Case for Visual Skills | Vision & multimodal | Binxiao Xu +3 | May 31, 2026 | 10 |
| PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding | Vision & multimodal | Selim Kuzucu +5 | May 28, 2026 | 10 |
| CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM | Vision & multimodal | Yubo Li +1 | May 24, 2026 | 9 |
| Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows | Vision & multimodal | Harshada Badave +8 | May 22, 2026 | 9 |
| WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts | Vision & multimodal | Yuxin Meng +11 | Jun 2, 2026 | 8 |
| SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models | Vision & multimodal | Olaf Dünkel +5 | May 29, 2026 | 8 |
| How can embedding models bind concepts? | Vision & multimodal | Arnas Uselis +2 | May 29, 2026 | 8 |
| One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation | Vision & multimodal | Sanghyun Jo +6 | May 28, 2026 | 8 |
| OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents | Vision & multimodal | Jiahao Ying +4 | May 22, 2026 | 8 |
| Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection | Vision & multimodal | Travis Lelle | May 28, 2026 | 8 |
| Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning | Vision & multimodal | Haoran Xu +6 | Jun 8, 2026 | 7 |
| Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models | Vision & multimodal | Mohammad Mahdi Abootorabi +6 | Jun 4, 2026 | 7 |
| Towards One-to-Many Temporal Grounding | Vision & multimodal | Qi Xu +7 | Jun 4, 2026 | 7 |
| Stateful Visual Encoders for Vision-Language Models | Vision & multimodal | Zirui Wang +5 | Jun 3, 2026 | 7 |
| Multi-Agent Computer Use | Vision & multimodal | Jing Yu Koh +2 | Jun 1, 2026 | 7 |
| 3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code | Vision & multimodal | Yipeng Gao +7 | May 31, 2026 | 7 |
| SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence | Vision & multimodal | Yulu Pan +6 | May 29, 2026 | 7 |
| Unified Neural Scaling Laws | Vision & multimodal | Ethan Caballero +3 | May 25, 2026 | 7 |
| iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning | Vision & multimodal | Chang-Bin Zhang +3 | May 29, 2026 | 7 |
| A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems | Vision & multimodal | Titu Ranjan Sarker +5 | May 28, 2026 | 6 |
| HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers | Vision & multimodal | Issa Sugiura +3 | May 31, 2026 | 6 |
| AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions | Vision & multimodal | Jingwei Sun +5 | May 25, 2026 | 6 |
| Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking | Vision & multimodal | Fan Zhang +9 | Jun 5, 2026 | 5 |
| Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction | Vision & multimodal | Tianxiang Jiang +7 | Jun 4, 2026 | 5 |
| AdaCodec: A Predictive Visual Code for Video MLLMs | Vision & multimodal | Haowen Hou +10 | Jun 1, 2026 | 5 |
| BraveGuard: From Open-World Threats to Safer Computer-Use Agents | Vision & multimodal | Yunhao Feng +15 | Jun 2, 2026 | 5 |
| Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning | Vision & multimodal | Ciara Rowles +4 | May 28, 2026 | 5 |
| Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling | Vision & multimodal | Seojeong Park +5 | Jun 1, 2026 | 5 |
| Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO | Vision & multimodal | Jing Sun | May 30, 2026 | 5 |
| ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning | Vision & multimodal | Sicheng Yang +7 | Jun 12, 2026 | 4 |
| ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction | Vision & multimodal | Amirhossein Abaskohi +5 | May 11, 2026 | 4 |
| Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text | Vision & multimodal | Yutong Bian +4 | Jun 8, 2026 | 4 |
| AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents | Vision & multimodal | Hao Bai +5 | Jun 4, 2026 | 4 |