Literature

Papers

SortNotable Newest Most cited Oldest A–Z

Notable = Hugging Face daily papers (community-upvoted) · every paper links to arXiv · citations from OpenAlex

TopicAll Vision & multimodal Agents Safety & alignment Code Efficiency & systems Image & video gen Data & benchmarks Robotics Speech & audio Reinforcement learning Theory Science & bio Other LLMs & reasoning

Showing 1–100 of 155 notable papers

Paper	Topic	Authors	Published	HF ▲
Agents' Last Exam	Agents	Yiyou Sun +39	Jun 3, 2026	338
Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs	Agents	Haozhe Zhao +8	May 28, 2026	192
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security	Agents	Dongrui Liu +49	May 28, 2026	142
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments	Agents	Jundong Xu +13	Jun 11, 2026	130
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation	Agents	Tianyi Zhou +4	May 29, 2026	111
SWE-Explore: Benchmarking How Coding Agents Explore Repositories	Agents	Shaoqiu Zhang +10	Jun 5, 2026	110
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement	Agents	Jiajie Jin +17	Jun 10, 2026	109
GrepSeek: Training Search Agents for Direct Corpus Interaction	Agents	Alireza Salemi +6	May 28, 2026	104
Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution	Agents	Xucong Wang +6	Jun 9, 2026	75
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents	Agents	Jia Deng +11	Jun 10, 2026	71
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks	Agents	Tomer Keren +5	May 27, 2026	65
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research	Agents	Dingbang Wu +10	May 25, 2026	64
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks	Agents	Mengyu Zheng +15	Jun 10, 2026	63
LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents	Agents	Aofan Yu +10	Jun 4, 2026	62
Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism	Agents	Haoxiang Zhang +6	May 29, 2026	62
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application	Agents	Jiachun Li +14	Jun 10, 2026	60
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts	Agents	Nahyun Lee +14	Jun 1, 2026	55
Mellum2 Technical Report	Agents	Marko Kojic +8	May 29, 2026	54
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories	Agents	Jiaming Wang +10	Jun 1, 2026	53
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts	Agents	Wenbo Pan +7	Jun 4, 2026	52
SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations	Agents	Taewon Yun +5	Jun 4, 2026	51
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses	Agents	Pengcheng Jiang +7	Jun 1, 2026	51
SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research	Agents	Pu Ning +9	Jun 8, 2026	50
ResearchMath-14K: Scaling Research-Level Mathematics via Agents	Agents	Guijin Son +5	May 27, 2026	49
ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?	Agents	Woojung Song +5	Jun 4, 2026	47
APPO: Agentic Procedural Policy Optimization	Agents	Xucong Wang +7	Jun 10, 2026	46
TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration	Agents	Soyeong Jeong +3	Jun 3, 2026	44
Trust Region On-Policy Distillation	Agents	Xingrun Xing +4	May 31, 2026	42
LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards	Agents	Nianyi Lin +3	May 29, 2026	41
Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents	Agents	Yeonjun In +4	May 25, 2026	41
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints	Agents	Jiayu Liu +10	Jun 4, 2026	40
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence	Agents	MiniMax +206	May 26, 2026	39
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents	Agents	Suji Kim +2	May 27, 2026	38
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning	Agents	Xuekang Wang +5	Jun 3, 2026	37
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories	Agents	Zhuoyun Yu +6	May 31, 2026	35
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence	Agents	Rui Meng +12	May 25, 2026	35
Rethinking Memory as Continuously Evolving Connectivity	Agents	Jizhan Fang +14	May 27, 2026	34
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents	Agents	Bowen Wang +13	May 25, 2026	33
Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning	Agents	Jiapeng Zhu +7	May 27, 2026	32
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems	Agents	Jianing Zhu +7	May 25, 2026	32
Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents	Agents	Shuo Ji +2	Jun 4, 2026	31
DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch	Agents	Jiale Zhao +6	Jun 9, 2026	31
Streaming Communication in Multi-Agent Reasoning	Agents	Zhen Yang +5	Jun 3, 2026	29
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?	Agents	Zhangchen Xu +18	Jun 3, 2026	28
OpenSkill: Open-World Self-Evolution for LLM Agents	Agents	Zhiling Yan +10	Jun 4, 2026	27
SkillGrad: Optimizing Agent Skills Like Gradient Descent	Agents	Hanyu Wang +4	May 26, 2026	27
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation	Agents	Huawei Lin +4	May 26, 2026	27
EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery	Agents	Amy Xin +7	Jun 11, 2026	26
AI Research Agents Narrow Scientific Exploration	Agents	Yixuan Tang +1	May 27, 2026	26
End-to-End Context Compression at Scale	Agents	Ang Li +14	Jun 8, 2026	25
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems	Agents	Yipeng Ouyang +5	May 26, 2026	25
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search	Agents	Yunbo Tang +6	May 28, 2026	25
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents	Agents	Dongsheng Zhu +7	Jun 4, 2026	23
Exploring Autonomous Agentic Data Engineering for Model Specialization	Agents	Yujie Luo +12	May 28, 2026	23
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry	Agents	Tingyang Chen +13	Jun 12, 2026	22
From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI	Agents	Yongheng Zhang +19	Jun 12, 2026	21
Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling	Agents	Yucheng Li +16	Jun 10, 2026	21
Rethinking Continual Experience Internalization for Self-Evolving LLM Agents	Agents	Jingwen Chen +9	Jun 3, 2026	21
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis	Agents	Kewei Xu +6	May 28, 2026	21
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents	Agents	Minhua Lin +16	May 28, 2026	21
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields	Agents	Liya Zhu +39	Jun 9, 2026	20
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents	Agents	Tianpeng Bu +9	May 28, 2026	20
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents	Agents	Wenxuan Wang +6	Jun 4, 2026	19
MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery	Agents	Shangheng Du +13	Jun 4, 2026	19
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions	Agents	Yuxin Chen +13	May 26, 2026	19
EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents	Agents	Weixian Xu +2	Jun 9, 2026	18
TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning	Agents	Heming Zou +11	Jun 9, 2026	17
MemTrain: Self-Supervised Context Memory Training	Agents	Ziheng Li +4	Jun 2, 2026	17
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs	Agents	Yifan Zeng +6	May 22, 2026	17
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents	Agents	Xiaoxuan Peng +7	May 28, 2026	17
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?	Agents	HuiMing Fan +7	May 27, 2026	17
MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation	Agents	Wenhao Wang +11	Jun 1, 2026	16
Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents	Agents	Jianxiang Yu +5	May 29, 2026	16
AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios	Agents	Kou Shi +9	May 27, 2026	16
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments	Agents	Yuxin Chen +11	May 26, 2026	16
Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement	Agents	Dingwei Chen +7	May 26, 2026	16
Joint Agent Memory and Exploration Learning via Novelty Signals	Agents	Shizuo Tian +11	Jun 1, 2026	15
When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems	Agents	Corrado Rainone +3	May 28, 2026	15
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild	Agents	Xiaohongshu Inc	May 27, 2026	15
DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs	Agents	Yi Li +5	May 24, 2026	15
Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses	Agents	Xiaojun Wu +9	Jun 6, 2026	14
SIA: Self Improving AI with Harness & Weight Updates	Agents	Prannay Hebbar +6	May 26, 2026	14
From AGI to ASI	Agents	Tim Genewein +13	Jun 10, 2026	13
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill	Agents	Tao Chen +12	Jun 2, 2026	13
Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues	Agents	Peixuan Han +5	Jun 1, 2026	13
SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating	Agents	Zequn Xie +6	Jun 5, 2026	12
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills	Agents	Chuan Xiao +7	Jun 5, 2026	12
Unsupervised Skill Discovery for Agentic Data Analysis	Agents	Zhisong Qiu +6	Jun 4, 2026	12
Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams	Agents	Zewen Liu +9	Jun 1, 2026	12
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion	Agents	Stine Lyngsø Beltoft +7	May 29, 2026	12
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness	Agents	Xiaoxuan Wang +5	Jun 11, 2026	11
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning	Agents	Guhong Chen +8	Jun 2, 2026	11
PhoneWorld: Scaling Phone-Use Agent Environments	Agents	Zhengyang Tang +23	May 28, 2026	11
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation	Agents	Shanghua Gao +2	May 27, 2026	11
TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search	Agents	Zhuofan Shi +10	Jun 10, 2026	10
Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval	Agents	Jiaxi Li +7	Jun 3, 2026	10
SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction	Agents	Yuting Ning +10	Jun 1, 2026	10
Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory	Agents	Jingwei Sun +4	May 19, 2026	10
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI	Agents	Shangding Gu	May 25, 2026	9
CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval	Agents	Vaishali Senthil +2	May 28, 2026	9

← PrevPage 1 of 2Next →