| Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors | Vision & multimodal | Hanxun Yu +6 | Jun 5, 2026 | 4 |
| Benchmark Everything Everywhere All at Once | Vision & multimodal | Shiyun Xiong +7 | Jun 4, 2026 | 4 |
| MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding | Vision & multimodal | Qian Kou +6 | May 29, 2026 | 4 |
| ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree | Vision & multimodal | Vincent Koc +5 | May 31, 2026 | 4 |
| MindZero: Learning Online Mental Reasoning With Zero Annotations | Vision & multimodal | Shunchi Zhang +5 | May 29, 2026 | 4 |
| Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training | Vision & multimodal | Michal Chudoba +3 | Jun 10, 2026 | 3 |
| Phase Marginalization for Patch-Grid Instability in Vision Transformers | Vision & multimodal | Oğuzhan Ercan | Jun 6, 2026 | 3 |
| SDR: Set-Distance Rewards for Radiology Report Generation | Vision & multimodal | Halil Ibrahim Gulluk +3 | May 30, 2026 | 3 |
| WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark | Vision & multimodal | Yida Yin +11 | Jun 4, 2026 | 3 |
| Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models | Vision & multimodal | Haibo Wang +1 | Jun 4, 2026 | 3 |
| Video2LoRA: Parametric Video Internalization for Vision-Language Models | Vision & multimodal | Manan Suri +2 | Jun 3, 2026 | 3 |
| BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding | Vision & multimodal | Muhammad Usama +3 | Jun 3, 2026 | 3 |
| SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes | Vision & multimodal | Tianhui Liu +8 | May 29, 2026 | 3 |
| Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes | Vision & multimodal | Chenxi Tao +1 | May 28, 2026 | 3 |
| Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly | Vision & multimodal | Aditya Chetan +7 | May 20, 2026 | 3 |
| Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models | Vision & multimodal | Yifan Jiang +3 | May 26, 2026 | 3 |
| Seeing the Needle in the Haystack: Towards Weakly-Supervised Log Instance Anomaly Localization via Counterfactual Perturbation | Vision & multimodal | Yutszyuk Wong +3 | May 9, 2026 | 3 |
| Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation | Vision & multimodal | Guo Yu +5 | Jun 11, 2026 | 2 |
| ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages | Vision & multimodal | Tanmoy Kanti Halder +4 | Jun 11, 2026 | 2 |
| WebChallenger: A Reliable and Efficient Generalist Web Agent | Vision & multimodal | Jayoo Hwang +2 | Jun 9, 2026 | 2 |
| Can Generalist Agents Automate Data Curation? | Vision & multimodal | Feiyang Kang +7 | Jun 2, 2026 | 2 |
| Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions | Vision & multimodal | Xinnong Zhang +4 | Jun 4, 2026 | 2 |
| When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models | Vision & multimodal | Ding Zhang +5 | Jun 2, 2026 | 2 |
| SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation | Vision & multimodal | Junxiao Yang +6 | Jun 2, 2026 | 2 |
| Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations | Vision & multimodal | Sachin Kumar | May 27, 2026 | 2 |
| Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models | Vision & multimodal | Guangzhao He +3 | Jun 1, 2026 | 2 |
| ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats | Vision & multimodal | Shangpin Peng +12 | May 31, 2026 | 2 |
| Leveraging Morphology for Historical Script Metrological Analysis | Vision & multimodal | Malamatenia Vlachou Efstathiou +3 | Jun 8, 2026 | 1 |
| APEX: A Network-Native Time-Series Foundation Model for Forecasting and Anomaly Detection for Wireless Edge Operations | Vision & multimodal | Swadhin Pradhan +2 | Jun 10, 2026 | 1 |
| Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents | Vision & multimodal | XiuYu Zhang +2 | Jun 4, 2026 | 1 |
| Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense | Vision & multimodal | Shuhao Zhang +4 | May 29, 2026 | 1 |
| Quality-Guided Semi-Supervised Learning for Medical Image Segmentation | Vision & multimodal | Kumar Abhishek +1 | Jun 1, 2026 | 1 |
| Benchmarking Composed Image Retrieval for Applied Earth Observation | Vision & multimodal | Bill Psomas +8 | May 23, 2026 | 1 |
| Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection | Vision & multimodal | Xiaona Zhou +4 | May 28, 2026 | 1 |
| Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning | Vision & multimodal | Chun-Hsiao Yeh +5 | May 28, 2026 | 1 |
| Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification | Vision & multimodal | Yuting Yang +4 | May 24, 2026 | 1 |