| Latent Spatial Memory for Video World Models | Image & video gen | Weijie Wang +9 | Jun 8, 2026 | 66 |
| CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation | Image & video gen | Fangtai Wu +9 | May 25, 2026 | 61 |
| Representation Forcing for Bottleneck-Free Unified Multimodal Models | Image & video gen | Yuqing Wang +12 | May 29, 2026 | 59 |
| Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions | Image & video gen | Xin Jin +10 | Jun 8, 2026 | 58 |
| minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models | Image & video gen | Min Zhao +11 | May 28, 2026 | 58 |
| YoCausal: How Far is Video Generation from World Model? A Causality Perspective | Image & video gen | You-Zhe Xie +5 | May 28, 2026 | 54 |
| Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction | Image & video gen | Jin Hyeon Kim +10 | May 25, 2026 | 41 |
| Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models | Image & video gen | Bowen Ping +5 | Jun 9, 2026 | 40 |
| $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing | Image & video gen | Aoxi Liu +7 | May 25, 2026 | 39 |
| GenClaw: Code-Driven Agentic Image Generation | Image & video gen | Junyan Ye +6 | May 28, 2026 | 38 |
| dMoE: dLLMs with Learnable Block Experts | Image & video gen | Sicheng Feng +4 | May 29, 2026 | 36 |
| SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer | Image & video gen | Yuyang Zhao +8 | May 28, 2026 | 36 |
| Qwen-Image-Flash: Beyond Objective Design | Image & video gen | Tianhe Wu +23 | Jun 2, 2026 | 35 |
| Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration | Image & video gen | Yiren Song +4 | May 17, 2026 | 33 |
| Echo-Memory: A Controlled Study of Memory in Action World Models | Image & video gen | Wayne King +15 | Jun 8, 2026 | 32 |
| JLT: Clean-Latent Prediction in Latent Diffusion Transformers | Image & video gen | Funing Fu +4 | May 26, 2026 | 32 |
| OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data | Image & video gen | Jiwen Liu +10 | Jun 11, 2026 | 29 |
| World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning | Image & video gen | Yucheng Zhou +3 | Jun 2, 2026 | 29 |
| VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization | Image & video gen | Junhao Cheng +6 | Jun 1, 2026 | 29 |
| Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation | Image & video gen | Yuxuan Bian +11 | Jun 3, 2026 | 28 |
| VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion | Image & video gen | Hidir Yesiltepe +6 | May 28, 2026 | 26 |
| Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models | Image & video gen | Haozhan Shen +3 | May 27, 2026 | 25 |
| Colored Noise Diffusion Sampling | Image & video gen | Hadar Davidson +2 | May 28, 2026 | 25 |
| Triplet-Block Diffusion RWKV | Image & video gen | Ke Lin +4 | May 25, 2026 | 25 |
| ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations | Image & video gen | Junke Wang +18 | Jun 9, 2026 | 24 |
| LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing | Image & video gen | Jianzong Wu +13 | Jun 4, 2026 | 24 |
| OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning | Image & video gen | Yunyang Ge +6 | May 27, 2026 | 24 |
| Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching | Image & video gen | Yoad Tewel +3 | Jun 2, 2026 | 22 |
| Linearizing Vision Transformer with Test-Time Training | Image & video gen | Yining Li +5 | May 28, 2026 | 20 |
| Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective | Image & video gen | Hyunmin Cho +2 | May 26, 2026 | 20 |
| LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation | Image & video gen | Qixin Hu +4 | Jun 1, 2026 | 19 |
| Recursive Flow Matching | Image & video gen | Jiahe Huang +3 | May 26, 2026 | 19 |
| VideoMDM: Towards 3D Human Motion Generation From 2D Supervision | Image & video gen | Amir Mann +3 | Jun 11, 2026 | 18 |
| Complexity-Balanced Diffusion Splitting | Image & video gen | Noam Issachar +2 | Jun 4, 2026 | 17 |
| RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models | Image & video gen | Xing Cong +5 | May 26, 2026 | 16 |
| SwiftVR: Real-Time One-Step Generative Video Restoration | Image & video gen | Jiaqi Yan +7 | Jun 8, 2026 | 15 |
| Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence | Image & video gen | Artur Jesslen +2 | May 28, 2026 | 15 |
| Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback | Image & video gen | Huaisong Zhang +9 | Jun 4, 2026 | 14 |
| Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them | Image & video gen | Woojung Han +5 | Jun 4, 2026 | 14 |
| AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation | Image & video gen | Haobo Li +8 | Jun 2, 2026 | 14 |
| Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation | Image & video gen | Ziyue Lin +8 | May 31, 2026 | 14 |
| LVSA: Training-Free Sparse Attention for Long Video Diffusion | Image & video gen | Gael Glorian +4 | May 29, 2026 | 14 |
| Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution | Image & video gen | Zixin Jessie Chen +6 | May 25, 2026 | 13 |
| DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory | Image & video gen | Zhenhao Yang +7 | May 29, 2026 | 12 |
| AdaState: Self-Evolving Anchors for Streaming Video Generation | Image & video gen | Yusuf Dalva +1 | May 28, 2026 | 12 |
| NeuROK: Generative 4D Neural Object Kinematics | Image & video gen | Chen Geng +5 | May 28, 2026 | 12 |
| Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation | Image & video gen | Shuhong Zheng +4 | May 25, 2026 | 12 |
| MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold | Image & video gen | Yang Zhou +6 | Jun 11, 2026 | 11 |
| Policy and World Modeling Co-Training for Language Agents | Image & video gen | Ning Lu +11 | Jun 1, 2026 | 11 |
| PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions | Image & video gen | Omer Benishu +2 | May 28, 2026 | 11 |
| Text-to-Image Models Need Less from Text Encoders Than You Think | Image & video gen | Nurit Spingarn +3 | Jun 2, 2026 | 10 |
| MAOAM: Unified Object and Material Selection with Vision-Language Models | Image & video gen | Jaden Park +7 | Jun 2, 2026 | 10 |
| High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation | Image & video gen | Dongyang Liu +9 | Jun 10, 2026 | 9 |
| i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models | Image & video gen | Boya Zeng +6 | Jun 9, 2026 | 9 |
| Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models | Image & video gen | Jiazheng Xing +11 | May 29, 2026 | 8 |
| SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control | Image & video gen | Zhida Zhang +7 | May 27, 2026 | 8 |
| Learning A Unified Risk Map for Autonomous Driving in Partially Observable Environments | Image & video gen | Jie Jia +6 | May 21, 2026 | 8 |
| MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale | Image & video gen | Zhicong Tang +8 | May 26, 2026 | 8 |
| Bridging the Agent-World Gap: Text World Models for LLM-based Agents | Image & video gen | Yixia Li +15 | Jun 8, 2026 | 7 |
| GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models | Image & video gen | Xiaohang Tang +6 | May 28, 2026 | 7 |
| One-Forcing: Towards Stable One-Step Autoregressive Video Generation | Image & video gen | Jiaqi Feng +3 | May 22, 2026 | 7 |
| RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space | Image & video gen | Xichen Pan +5 | Jun 12, 2026 | 6 |
| Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference | Image & video gen | Agata Żywot +6 | May 24, 2026 | 5 |
| MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training | Image & video gen | Lianyu Pang +7 | Jun 7, 2026 | 4 |
| MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation | Image & video gen | Ishaan Preetam Chandratreya +6 | Jun 8, 2026 | 4 |
| A Cookbook of 3D Vision: Data, Learning Paradigms, and Application | Image & video gen | Hongyang Du +10 | Jun 2, 2026 | 4 |
| Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation | Image & video gen | Samson Gourevitch +6 | May 21, 2026 | 4 |
| EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration | Image & video gen | Wuyang Li +6 | May 14, 2026 | 4 |
| MotiMotion: Motion-Controlled Video Generation with Visual Reasoning | Image & video gen | Lee Hsin-Ying +5 | May 21, 2026 | 4 |
| Avatar V: Scaling Video-Reference Avatar Video Generation | Image & video gen | Benjamin Liang +22 | Jun 11, 2026 | 3 |
| Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation | Image & video gen | Siyuan Liu +1 | Jun 8, 2026 | 3 |
| EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers | Image & video gen | Zongyuan Yang +10 | May 16, 2026 | 3 |
| When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models | Image & video gen | Jungwon Park +4 | May 27, 2026 | 3 |
| Cross-scale Aligned Supervision for Training GANs | Image & video gen | Sangeek Hyun +2 | May 26, 2026 | 3 |
| IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder | Image & video gen | Yitong Chen +7 | Jun 9, 2026 | 2 |
| Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation | Image & video gen | Xingyu Su +8 | Jun 4, 2026 | 2 |
| Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning | Image & video gen | Ziyang Yao +12 | Jun 4, 2026 | 2 |
| Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization | Image & video gen | Zhuohan Liu +3 | May 27, 2026 | 2 |
| Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization | Image & video gen | Shufan Li +4 | May 29, 2026 | 2 |
| MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation | Image & video gen | Dongxia Liu +9 | Apr 8, 2026 | 2 |
| MBench: A Comprehensive Benchmark on Memory Capability for Video World Models | Image & video gen | Shengjun Zhang +13 | Jun 8, 2026 | 1 |
| Building Social World Models with Large Language Models | Image & video gen | Haofei Yu +3 | Jun 9, 2026 | 1 |
| FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching | Image & video gen | Danilo Danese +4 | Jun 8, 2026 | 1 |
| Streaming Video Generation with Streaming Force Control | Image & video gen | Hanhui Wang +5 | Jun 5, 2026 | 1 |
| Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing | Image & video gen | Yixuan Ding +4 | Apr 16, 2026 | 1 |