| On the Geometry of On-Policy Distillation | Reinforcement learning | Zhennan Shen +8 | Jun 5, 2026 | 71 |
| DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes | Reinforcement learning | Caijun Xu +3 | May 27, 2026 | 46 |
| Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning | Reinforcement learning | Renjie Mao +9 | Jun 9, 2026 | 42 |
| Rethinking the Divergence Regularization in LLM RL | Reinforcement learning | Jiarui Yao +5 | Jun 8, 2026 | 32 |
| Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories | Reinforcement learning | Ali Behrouz +2 | Jun 2, 2026 | 29 |
| Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation | Reinforcement learning | Yuanyi Wang +8 | May 26, 2026 | 26 |
| ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning | Reinforcement learning | Ziyan Liu +9 | Jun 2, 2026 | 25 |
| N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization | Reinforcement learning | Xukun Zhu +3 | Jun 9, 2026 | 24 |
| Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning | Reinforcement learning | Jiayu Yang +8 | Jun 11, 2026 | 19 |
| ESPO: Early-Stopping Proximal Policy Optimization | Reinforcement learning | Zihang Li +10 | May 28, 2026 | 19 |
| Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders | Reinforcement learning | Yi Jing +6 | May 26, 2026 | 15 |
| Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling | Reinforcement learning | Runpeng Dai +4 | Jun 2, 2026 | 14 |
| RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains | Reinforcement learning | Haoxiang Jiang +7 | May 27, 2026 | 13 |
| Large Language Models Hack Rewards, and Society | Reinforcement learning | Wei Liu +4 | Jun 2, 2026 | 10 |
| The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement | Reinforcement learning | Xiaobo Wang +5 | May 29, 2026 | 10 |
| OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification | Reinforcement learning | Yuhang Zhou +7 | May 31, 2026 | 8 |
| Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization | Reinforcement learning | Hao Xiang +10 | Jun 10, 2026 | 7 |
| Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning | Reinforcement learning | Atoosa Chegini +1 | Jun 1, 2026 | 7 |
| Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases | Reinforcement learning | Dongyoon Hahm +2 | May 26, 2026 | 7 |
| SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling | Reinforcement learning | Haoran Xu +5 | Jun 8, 2026 | 6 |
| How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs | Reinforcement learning | Zhichen Dong +11 | Jun 9, 2026 | 6 |
| Not only where, But when: Temporal Scheduling for RLVR | Reinforcement learning | Jinghao Zhang +3 | May 25, 2026 | 6 |
| Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration | Reinforcement learning | Zili Wang +5 | May 27, 2026 | 6 |
| CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations | Reinforcement learning | Mike Zhang +2 | May 25, 2026 | 6 |
| VISTA: View-Consistent Self-Verified Training for GUI Grounding | Reinforcement learning | Xinyu Qiu +5 | Jun 12, 2026 | 5 |
| Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO | Reinforcement learning | Yiming Ren +10 | Jun 2, 2026 | 4 |
| CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning | Reinforcement learning | Linas Nasvytis +5 | May 27, 2026 | 4 |
| GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards | Reinforcement learning | Tej Deep Pala +2 | Jun 3, 2026 | 4 |
| Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models | Reinforcement learning | Wenlong Deng +6 | May 24, 2026 | 4 |
| Distilling LLM Feedback for Lean Theorem Proving | Reinforcement learning | Gaetan Narozniak +4 | May 29, 2026 | 3 |
| Trust Region Q Adjoint Matching | Reinforcement learning | Yonghoon Dong +4 | May 26, 2026 | 3 |
| Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data | Reinforcement learning | XiuYu Zhang +3 | Jun 3, 2026 | 1 |