MMODELYST
Literature

Papers

Showing 1–32 of 32 notable papers
PaperTopicAuthorsPublishedHF ▲
On the Geometry of On-Policy DistillationReinforcement learningZhennan Shen +8Jun 5, 202671
DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy PrefixesReinforcement learningCaijun Xu +3May 27, 202646
Beyond Uniform Token-Level Trust Region in LLM Reinforcement LearningReinforcement learningRenjie Mao +9Jun 9, 202642
Rethinking the Divergence Regularization in LLM RLReinforcement learningJiarui Yao +5Jun 8, 202632
Language Models Need Sleep: Learning to Self-Modify and Consolidate MemoriesReinforcement learningAli Behrouz +2Jun 2, 202629
Not All Disagreement Is Learnable: Token Teachability in On-Policy DistillationReinforcement learningYuanyi Wang +8May 26, 202626
ThoughtFold: Folding Reasoning Chains via Introspective Preference LearningReinforcement learningZiyan Liu +9Jun 2, 202625
N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy OptimizationReinforcement learningXukun Zhu +3Jun 9, 202624
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement LearningReinforcement learningJiayu Yang +8Jun 11, 202619
ESPO: Early-Stopping Proximal Policy OptimizationReinforcement learningZihang Li +10May 28, 202619
Guiding LLM Post-training Data Engineering with Model Internals from Sparse AutoencodersReinforcement learningYi Jing +6May 26, 202615
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time ScalingReinforcement learningRunpeng Dai +4Jun 2, 202614
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable DomainsReinforcement learningHaoxiang Jiang +7May 27, 202613
Large Language Models Hack Rewards, and SocietyReinforcement learningWei Liu +4Jun 2, 202610
The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised ImprovementReinforcement learningXiaobo Wang +5May 29, 202610
OmniOPD: Logit-Free On-Policy Distillation via Speculative VerificationReinforcement learningYuhang Zhou +7May 31, 20268
Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning GeneralizationReinforcement learningHao Xiang +10Jun 10, 20267
Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical ReasoningReinforcement learningAtoosa Chegini +1Jun 1, 20267
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned BiasesReinforcement learningDongyoon Hahm +2May 26, 20267
SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher SamplingReinforcement learningHaoran Xu +5Jun 8, 20266
How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMsReinforcement learningZhichen Dong +11Jun 9, 20266
Not only where, But when: Temporal Scheduling for RLVRReinforcement learningJinghao Zhang +3May 25, 20266
Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient CalibrationReinforcement learningZili Wang +5May 27, 20266
CroCo: Cross-Lingual Contrastive Preference Tuning on Self-GenerationsReinforcement learningMike Zhang +2May 25, 20266
VISTA: View-Consistent Self-Verified Training for GUI GroundingReinforcement learningXinyu Qiu +5Jun 12, 20265
Smaller Models are Natural Explorers for Policy-Level Diversity in GRPOReinforcement learningYiming Ren +10Jun 2, 20264
CORE: Contrastive Reflection Enables Rapid Improvements in ReasoningReinforcement learningLinas Nasvytis +5May 27, 20264
GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable RewardsReinforcement learningTej Deep Pala +2Jun 3, 20264
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language ModelsReinforcement learningWenlong Deng +6May 24, 20264
Distilling LLM Feedback for Lean Theorem ProvingReinforcement learningGaetan Narozniak +4May 29, 20263
Trust Region Q Adjoint MatchingReinforcement learningYonghoon Dong +4May 26, 20263
Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal DataReinforcement learningXiuYu Zhang +3Jun 3, 20261