| Agents' Last Exam | Agents | Yiyou Sun +39 | Jun 3, 2026 | 338 |
| Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs | Agents | Haozhe Zhao +8 | May 28, 2026 | 192 |
| AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security | Agents | Dongrui Liu +49 | May 28, 2026 | 142 |
| EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments | Agents | Jundong Xu +13 | Jun 11, 2026 | 130 |
| COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation | Agents | Tianyi Zhou +4 | May 29, 2026 | 111 |
| SWE-Explore: Benchmarking How Coding Agents Explore Repositories | Agents | Shaoqiu Zhang +10 | Jun 5, 2026 | 110 |
| Toward Generalist Autonomous Research via Hypothesis-Tree Refinement | Agents | Jiajie Jin +17 | Jun 10, 2026 | 109 |
| GrepSeek: Training Search Agents for Direct Corpus Interaction | Agents | Alireza Salemi +6 | May 28, 2026 | 104 |
| Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution | Agents | Xucong Wang +6 | Jun 9, 2026 | 75 |
| FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents | Agents | Jia Deng +11 | Jun 10, 2026 | 71 |
| A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks | Agents | Tomer Keren +5 | May 27, 2026 | 65 |
| MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research | Agents | Dingbang Wu +10 | May 25, 2026 | 64 |
| Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks | Agents | Mengyu Zheng +15 | Jun 10, 2026 | 63 |
| LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents | Agents | Aofan Yu +10 | Jun 4, 2026 | 62 |
| Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism | Agents | Haoxiang Zhang +6 | May 29, 2026 | 62 |
| Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application | Agents | Jiachun Li +14 | Jun 10, 2026 | 60 |
| K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts | Agents | Nahyun Lee +14 | Jun 1, 2026 | 55 |
| Mellum2 Technical Report | Agents | Marko Kojic +8 | May 29, 2026 | 54 |
| Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories | Agents | Jiaming Wang +10 | Jun 1, 2026 | 53 |
| Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts | Agents | Wenbo Pan +7 | Jun 4, 2026 | 52 |
| SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations | Agents | Taewon Yun +5 | Jun 4, 2026 | 51 |
| Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses | Agents | Pengcheng Jiang +7 | Jun 1, 2026 | 51 |
| SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research | Agents | Pu Ning +9 | Jun 8, 2026 | 50 |
| ResearchMath-14K: Scaling Research-Level Mathematics via Agents | Agents | Guijin Son +5 | May 27, 2026 | 49 |
| ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time? | Agents | Woojung Song +5 | Jun 4, 2026 | 47 |
| APPO: Agentic Procedural Policy Optimization | Agents | Xucong Wang +7 | Jun 10, 2026 | 46 |
| TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration | Agents | Soyeong Jeong +3 | Jun 3, 2026 | 44 |
| Trust Region On-Policy Distillation | Agents | Xingrun Xing +4 | May 31, 2026 | 42 |
| LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards | Agents | Nianyi Lin +3 | May 29, 2026 | 41 |
| Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents | Agents | Yeonjun In +4 | May 25, 2026 | 41 |
| AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints | Agents | Jiayu Liu +10 | Jun 4, 2026 | 40 |
| The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence | Agents | MiniMax +206 | May 26, 2026 | 39 |
| Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents | Agents | Suji Kim +2 | May 27, 2026 | 38 |
| Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning | Agents | Xuekang Wang +5 | Jun 3, 2026 | 37 |
| SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories | Agents | Zhuoyun Yu +6 | May 31, 2026 | 35 |
| ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence | Agents | Rui Meng +12 | May 25, 2026 | 35 |
| Rethinking Memory as Continuously Evolving Connectivity | Agents | Jizhan Fang +14 | May 27, 2026 | 34 |
| CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents | Agents | Bowen Wang +13 | May 25, 2026 | 33 |
| Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning | Agents | Jiapeng Zhu +7 | May 27, 2026 | 32 |
| Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems | Agents | Jianing Zhu +7 | May 25, 2026 | 32 |
| Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents | Agents | Shuo Ji +2 | Jun 4, 2026 | 31 |
| DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch | Agents | Jiale Zhao +6 | Jun 9, 2026 | 31 |
| Streaming Communication in Multi-Agent Reasoning | Agents | Zhen Yang +5 | Jun 3, 2026 | 29 |
| AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? | Agents | Zhangchen Xu +18 | Jun 3, 2026 | 28 |
| OpenSkill: Open-World Self-Evolution for LLM Agents | Agents | Zhiling Yan +10 | Jun 4, 2026 | 27 |
| SkillGrad: Optimizing Agent Skills Like Gradient Descent | Agents | Hanyu Wang +4 | May 26, 2026 | 27 |
| MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation | Agents | Huawei Lin +4 | May 26, 2026 | 27 |
| EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery | Agents | Amy Xin +7 | Jun 11, 2026 | 26 |
| AI Research Agents Narrow Scientific Exploration | Agents | Yixuan Tang +1 | May 27, 2026 | 26 |
| End-to-End Context Compression at Scale | Agents | Ang Li +14 | Jun 8, 2026 | 25 |
| Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems | Agents | Yipeng Ouyang +5 | May 26, 2026 | 25 |
| SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search | Agents | Yunbo Tang +6 | May 28, 2026 | 25 |
| When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents | Agents | Dongsheng Zhu +7 | Jun 4, 2026 | 23 |
| Exploring Autonomous Agentic Data Engineering for Model Specialization | Agents | Yujie Luo +12 | May 28, 2026 | 23 |
| HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry | Agents | Tingyang Chen +13 | Jun 12, 2026 | 22 |
| From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI | Agents | Yongheng Zhang +19 | Jun 12, 2026 | 21 |
| Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling | Agents | Yucheng Li +16 | Jun 10, 2026 | 21 |
| Rethinking Continual Experience Internalization for Self-Evolving LLM Agents | Agents | Jingwen Chen +9 | Jun 3, 2026 | 21 |
| LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis | Agents | Kewei Xu +6 | May 28, 2026 | 21 |
| Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents | Agents | Minhua Lin +16 | May 28, 2026 | 21 |
| Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields | Agents | Liya Zhu +39 | Jun 9, 2026 | 20 |
| Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents | Agents | Tianpeng Bu +9 | May 28, 2026 | 20 |
| SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents | Agents | Wenxuan Wang +6 | Jun 4, 2026 | 19 |
| MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery | Agents | Shangheng Du +13 | Jun 4, 2026 | 19 |
| VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions | Agents | Yuxin Chen +13 | May 26, 2026 | 19 |
| EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents | Agents | Weixian Xu +2 | Jun 9, 2026 | 18 |
| TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning | Agents | Heming Zou +11 | Jun 9, 2026 | 17 |
| MemTrain: Self-Supervised Context Memory Training | Agents | Ziheng Li +4 | Jun 2, 2026 | 17 |
| When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs | Agents | Yifan Zeng +6 | May 22, 2026 | 17 |
| LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents | Agents | Xiaoxuan Peng +7 | May 28, 2026 | 17 |
| LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? | Agents | HuiMing Fan +7 | May 27, 2026 | 17 |
| MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation | Agents | Wenhao Wang +11 | Jun 1, 2026 | 16 |
| Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents | Agents | Jianxiang Yu +5 | May 29, 2026 | 16 |
| AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios | Agents | Kou Shi +9 | May 27, 2026 | 16 |
| Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments | Agents | Yuxin Chen +11 | May 26, 2026 | 16 |
| Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement | Agents | Dingwei Chen +7 | May 26, 2026 | 16 |
| Joint Agent Memory and Exploration Learning via Novelty Signals | Agents | Shizuo Tian +11 | Jun 1, 2026 | 15 |
| When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems | Agents | Corrado Rainone +3 | May 28, 2026 | 15 |
| VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild | Agents | Xiaohongshu Inc | May 27, 2026 | 15 |
| DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs | Agents | Yi Li +5 | May 24, 2026 | 15 |
| Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses | Agents | Xiaojun Wu +9 | Jun 6, 2026 | 14 |
| SIA: Self Improving AI with Harness & Weight Updates | Agents | Prannay Hebbar +6 | May 26, 2026 | 14 |
| From AGI to ASI | Agents | Tim Genewein +13 | Jun 10, 2026 | 13 |
| Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill | Agents | Tao Chen +12 | Jun 2, 2026 | 13 |
| Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues | Agents | Peixuan Han +5 | Jun 1, 2026 | 13 |
| SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating | Agents | Zequn Xie +6 | Jun 5, 2026 | 12 |
| Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills | Agents | Chuan Xiao +7 | Jun 5, 2026 | 12 |
| Unsupervised Skill Discovery for Agentic Data Analysis | Agents | Zhisong Qiu +6 | Jun 4, 2026 | 12 |
| Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams | Agents | Zewen Liu +9 | Jun 1, 2026 | 12 |
| Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion | Agents | Stine Lyngsø Beltoft +7 | May 29, 2026 | 12 |
| HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness | Agents | Xiaoxuan Wang +5 | Jun 11, 2026 | 11 |
| EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning | Agents | Guhong Chen +8 | Jun 2, 2026 | 11 |
| PhoneWorld: Scaling Phone-Use Agent Environments | Agents | Zhengyang Tang +23 | May 28, 2026 | 11 |
| AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation | Agents | Shanghua Gao +2 | May 27, 2026 | 11 |
| TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search | Agents | Zhuofan Shi +10 | Jun 10, 2026 | 10 |
| Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval | Agents | Jiaxi Li +7 | Jun 3, 2026 | 10 |
| SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction | Agents | Yuting Ning +10 | Jun 1, 2026 | 10 |
| Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory | Agents | Jingwei Sun +4 | May 19, 2026 | 10 |
| From Model Scaling to System Scaling: Scaling the Harness in Agentic AI | Agents | Shangding Gu | May 25, 2026 | 9 |
| CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval | Agents | Vaishali Senthil +2 | May 28, 2026 | 9 |