| Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking | Data & benchmarks | Zekun Qi +12 | Jun 2, 2026 | 38 |
| SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks | Data & benchmarks | Wai-Chung Kwan +3 | May 29, 2026 | 28 |
| Rethinking RAG in Long Videos: What to Retrieve and How to Use It? | Data & benchmarks | Yuho Lee +7 | Jun 11, 2026 | 20 |
| UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs | Data & benchmarks | Amirhossein Abaskohi +6 | Jun 4, 2026 | 20 |
| ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics | Data & benchmarks | Shunkai Zhang +17 | Jun 9, 2026 | 18 |
| PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers | Data & benchmarks | Ngoc Phan Phuoc Loc +10 | May 26, 2026 | 16 |
| RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator | Data & benchmarks | Zhenwei Tang +5 | May 20, 2026 | 16 |
| REPOT: Recoverable Program-of-Thought via Checkpoint Repair | Data & benchmarks | Parsa Mazaheri | May 28, 2026 | 10 |
| LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs | Data & benchmarks | Gianluca Barmina +2 | Jun 4, 2026 | 8 |
| RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video | Data & benchmarks | Ulrich Prestel +3 | May 29, 2026 | 7 |
| PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design | Data & benchmarks | Runtian Wang +3 | May 26, 2026 | 7 |
| SurGe: Improved Surface Geometry in Point Maps | Data & benchmarks | Karim Knaebel +6 | May 29, 2026 | 6 |
| Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models | Data & benchmarks | Maikel Yelandi Leyva-Vázquez +1 | May 22, 2026 | 5 |
| Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation | Data & benchmarks | Amir El-Ghoussani +3 | Jun 10, 2026 | 3 |
| On the Limits of LLM-as-Judge for Scientific Novelty Assessment | Data & benchmarks | Soumitra Sinhahajari +2 | Jun 10, 2026 | 3 |
| Chiaroscuro Attention: Spending Compute in the Dark | Data & benchmarks | Prateek Kumar Sikdar | Jun 6, 2026 | 1 |
| Geometric Latent Reasoning Induces Shorter Generations in LLMs | Data & benchmarks | Shashi Kumar +4 | Jun 1, 2026 | 1 |
| Model-Based Quality Assessment for Massively Multilingual Parallel Data | Data & benchmarks | Abdelaziz M. A. Ibrahim +3 | May 29, 2026 | 1 |
| The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure | Data & benchmarks | Yubo Li +2 | May 27, 2026 | 1 |