MMODELYST
Literature

Papers

Showing 1–100 of 155 notable papers
PaperTopicAuthorsPublishedHF ▲
Agents' Last ExamAgentsYiyou Sun +39Jun 3, 2026338
Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse InputsAgentsHaozhe Zhao +8May 28, 2026192
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and SecurityAgentsDongrui Liu +49May 28, 2026142
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic EnvironmentsAgentsJundong Xu +13Jun 11, 2026130
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge DistillationAgentsTianyi Zhou +4May 29, 2026111
SWE-Explore: Benchmarking How Coding Agents Explore RepositoriesAgentsShaoqiu Zhang +10Jun 5, 2026110
Toward Generalist Autonomous Research via Hypothesis-Tree RefinementAgentsJiajie Jin +17Jun 10, 2026109
GrepSeek: Training Search Agents for Direct Corpus InteractionAgentsAlireza Salemi +6May 28, 2026104
Role-Agent: Bootstrapping LLM Agents via Dual-Role EvolutionAgentsXucong Wang +6Jun 9, 202675
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search AgentsAgentsJia Deng +11Jun 10, 202671
A Matter of TASTE: Improving Coverage and Difficulty of Agent BenchmarksAgentsTomer Keren +5May 27, 202665
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent ResearchAgentsDingbang Wu +10May 25, 202664
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding TasksAgentsMengyu Zheng +15Jun 10, 202663
LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM AgentsAgentsAofan Yu +10Jun 4, 202662
Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its MechanismAgentsHaoxiang Zhang +6May 29, 202662
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and ApplicationAgentsJiachun Li +14Jun 10, 202660
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean ContextsAgentsNahyun Lee +14Jun 1, 202655
Mellum2 Technical ReportAgentsMarko Kojic +8May 29, 202654
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent TrajectoriesAgentsJiaming Wang +10Jun 1, 202653
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory RolloutsAgentsWenbo Pan +7Jun 4, 202652
SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive VariationsAgentsTaewon Yun +5Jun 4, 202651
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing HarnessesAgentsPengcheng Jiang +7Jun 1, 202651
SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep ResearchAgentsPu Ning +9Jun 8, 202650
ResearchMath-14K: Scaling Research-Level Mathematics via AgentsAgentsGuijin Son +5May 27, 202649
ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?AgentsWoojung Song +5Jun 4, 202647
APPO: Agentic Procedural Policy OptimizationAgentsXucong Wang +7Jun 10, 202646
TIDE: Proactive Multi-Problem Discovery via Template-Guided IterationAgentsSoyeong Jeong +3Jun 3, 202644
Trust Region On-Policy DistillationAgentsXingrun Xing +4May 31, 202642
LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric RewardsAgentsNianyi Lin +3May 29, 202641
Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon AgentsAgentsYeonjun In +4May 25, 202641
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User ConstraintsAgentsJiayu Liu +10Jun 4, 202640
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World IntelligenceAgentsMiniMax +206May 26, 202639
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use AgentsAgentsSuji Kim +2May 27, 202638
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement LearningAgentsXuekang Wang +5Jun 3, 202637
SkillAdaptor: Self-Adapting Skills for LLM Agents from TrajectoriesAgentsZhuoyun Yu +6May 31, 202635
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-EvidenceAgentsRui Meng +12May 25, 202635
Rethinking Memory as Continuously Evolving ConnectivityAgentsJizhan Fang +14May 27, 202634
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use AgentsAgentsBowen Wang +13May 25, 202633
Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement LearningAgentsJiapeng Zhu +7May 27, 202632
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed SystemsAgentsJianing Zhu +7May 25, 202632
Memory is Reconstructed, Not Retrieved: Graph Memory for LLM AgentsAgentsShuo Ji +2Jun 4, 202631
DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from ScratchAgentsJiale Zhao +6Jun 9, 202631
Streaming Communication in Multi-Agent ReasoningAgentsZhen Yang +5Jun 3, 202629
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?AgentsZhangchen Xu +18Jun 3, 202628
OpenSkill: Open-World Self-Evolution for LLM AgentsAgentsZhiling Yan +10Jun 4, 202627
SkillGrad: Optimizing Agent Skills Like Gradient DescentAgentsHanyu Wang +4May 26, 202627
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and EvaluationAgentsHuawei Lin +4May 26, 202627
EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific DiscoveryAgentsAmy Xin +7Jun 11, 202626
AI Research Agents Narrow Scientific ExplorationAgentsYixuan Tang +1May 27, 202626
End-to-End Context Compression at ScaleAgentsAng Li +14Jun 8, 202625
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production SystemsAgentsYipeng Ouyang +5May 26, 202625
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic SearchAgentsYunbo Tang +6May 28, 202625
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM AgentsAgentsDongsheng Zhu +7Jun 4, 202623
Exploring Autonomous Agentic Data Engineering for Model SpecializationAgentsYujie Luo +12May 28, 202623
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness FoundryAgentsTingyang Chen +13Jun 12, 202622
From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AIAgentsYongheng Zhang +19Jun 12, 202621
Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection SamplingAgentsYucheng Li +16Jun 10, 202621
Rethinking Continual Experience Internalization for Self-Evolving LLM AgentsAgentsJingwen Chen +9Jun 3, 202621
LongDS-Bench: On the Failure of Long-Horizon Agentic Data AnalysisAgentsKewei Xu +6May 28, 202621
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM AgentsAgentsMinhua Lin +16May 28, 202621
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional FieldsAgentsLiya Zhu +39Jun 9, 202620
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI AgentsAgentsTianpeng Bu +9May 28, 202620
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI AgentsAgentsWenxuan Wang +6Jun 4, 202619
MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm DiscoveryAgentsShangheng Du +13Jun 4, 202619
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User InteractionsAgentsYuxin Chen +13May 26, 202619
EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving AgentsAgentsWeixian Xu +2Jun 9, 202618
TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement LearningAgentsHeming Zou +11Jun 9, 202617
MemTrain: Self-Supervised Context Memory TrainingAgentsZiheng Li +4Jun 2, 202617
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing TradeoffsAgentsYifan Zeng +6May 22, 202617
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language AgentsAgentsXiaoxuan Peng +7May 28, 202617
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?AgentsHuiMing Fan +7May 27, 202617
MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment SimulationAgentsWenhao Wang +11Jun 1, 202616
Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM AgentsAgentsJianxiang Yu +5May 29, 202616
AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task ScenariosAgentsKou Shi +9May 27, 202616
Learning to Act under Noise: Enhancing Agent Robustness via Noisy EnvironmentsAgentsYuxin Chen +11May 26, 202616
Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary EnhancementAgentsDingwei Chen +7May 26, 202616
Joint Agent Memory and Exploration Learning via Novelty SignalsAgentsShizuo Tian +11Jun 1, 202615
When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent SystemsAgentsCorrado Rainone +3May 28, 202615
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the WildAgentsXiaohongshu IncMay 27, 202615
DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMsAgentsYi Li +5May 24, 202615
Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent HarnessesAgentsXiaojun Wu +9Jun 6, 202614
SIA: Self Improving AI with Harness & Weight UpdatesAgentsPrannay Hebbar +6May 26, 202614
From AGI to ASIAgentsTim Genewein +13Jun 10, 202613
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent SkillAgentsTao Chen +12Jun 2, 202613
Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive DialoguesAgentsPeixuan Han +5Jun 1, 202613
SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward GatingAgentsZequn Xie +6Jun 5, 202612
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent SkillsAgentsChuan Xiao +7Jun 5, 202612
Unsupervised Skill Discovery for Agentic Data AnalysisAgentsZhisong Qiu +6Jun 4, 202612
Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task StreamsAgentsZewen Liu +9Jun 1, 202612
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight EvasionAgentsStine Lyngsø Beltoft +7May 29, 202612
HarnessBridge: Learnable Bidirectional Controller for LLM Agent HarnessAgentsXiaoxuan Wang +5Jun 11, 202611
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement LearningAgentsGuhong Chen +8Jun 2, 202611
PhoneWorld: Scaling Phone-Use Agent EnvironmentsAgentsZhengyang Tang +23May 28, 202611
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific ExperimentationAgentsShanghua Gao +2May 27, 202611
TreeSeeker: Tree-Structured Trial, Error, and Return in Deep SearchAgentsZhuofan Shi +10Jun 10, 202610
Online Skill Learning for Web Agents via State-Grounded Dynamic RetrievalAgentsJiaxi Li +7Jun 3, 202610
SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated ConstructionAgentsYuting Ning +10Jun 1, 202610
Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent MemoryAgentsJingwei Sun +4May 19, 202610
From Model Scaling to System Scaling: Scaling the Harness in Agentic AIAgentsShangding GuMay 25, 20269
CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool RetrievalAgentsVaishali Senthil +2May 28, 20269
← PrevPage 1 of 2Next →