2026-02-03

Title: OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models

Authors: Michael Siebenmann (1), Javier Argota Sánchez-Vaquerizo (1), Stefan Arisona (2), Krystian Samp (2), Luis Gisler (2), Dirk Helbing (1 and 3) ((1) Professorship of Computational Social Science, ETH Zurich, (2) Esri R&D Center Zurich, (3) Complexity Science Hub, Vienna)
Subjects: cs.LG, cs.AI, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2602.00012
Pdf URL: https://arxiv.org/pdf/2602.00012
Copy Paste: [[2602.00012]] OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models(https://arxiv.org/abs/2602.00012)
Keywords: generation
Abstract: We present OGD4All, a transparent, auditable, and reproducible framework based on Large Language Models (LLMs) to enhance citizens' interaction with geospatial Open Government Data (OGD). The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs. Evaluated on a 199-question benchmark covering both factual and unanswerable questions, across 430 City-of-Zurich datasets and 11 LLMs, OGD4All reaches 98% analytical correctness and 94% recall while reliably rejecting questions unsupported by available data, which minimizes hallucination risks. Statistical robustness tests, as well as expert feedback, show reliability and social relevance. The proposed approach shows how LLMs can provide explainable, multimodal access to public data, advancing trustworthy AI for open governance.
摘要：我们提出 OGD4All，这是一个基于大型语言模型 (LLM) 的透明、可审计和可复制的框架，旨在增强公民与地理空间开放政府数据 (OGD) 的互动。该系统结合了语义数据检索、迭代代码生成的代理推理以及产生可验证的多模式输出的安全沙箱执行。 OGD4All 根据涵盖事实问题和无法回答问题的 199 个问题基准进行评估，涵盖 430 个苏黎世市数据集和 11 个法学硕士，达到 98% 的分析正确性和 94% 的召回率，同时可靠地拒绝没有可用数据支持的问题，从而最大限度地降低幻觉风险。统计稳健性测试以及专家反馈显示了可靠性和社会相关性。拟议的方法展示了法学硕士如何提供对公共数据的可解释的、多模式的访问，从而推进值得信赖的人工智能以实现开放治理。

Title: ELLMPEG: An Edge-based Agentic LLM Video Processing Tool

Authors: Zoha Azimi, Reza Farahani, Radu Prodan, Christian Timmerer
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2602.00028
Pdf URL: https://arxiv.org/pdf/2602.00028
Copy Paste: [[2602.00028]] ELLMPEG: An Edge-based Agentic LLM Video Processing Tool(https://arxiv.org/abs/2602.00028)
Keywords: generation, generative
Abstract: Large language models (LLMs), the foundation of generative AI systems like ChatGPT, are transforming many fields and applications, including multimedia, enabling more advanced content generation, analysis, and interaction. However, cloud-based LLM deployments face three key limitations: high computational and energy demands, privacy and reliability risks from remote processing, and recurring API costs. Recent advances in agentic AI, especially in structured reasoning and tool use, offer a better way to exploit open and locally deployed tools and LLMs. This paper presents ELLMPEG, an edge-enabled agentic LLM framework for the automated generation of video-processing commands. ELLMPEG integrates tool-aware Retrieval-Augmented Generation (RAG) with iterative self-reflection to produce and locally verify executable FFmpeg and VVenC commands directly at the edge, eliminating reliance on external cloud APIs. To evaluate ELLMPEG, we collect a dedicated prompt dataset comprising 480 diverse queries covering different categories of FFmpeg and the Versatile Video Codec (VVC) encoder (VVenC) commands. We validate command generation accuracy and evaluate four open-source LLMs based on command validity, tokens generated per second, inference time, and energy efficiency. We also execute the generated commands to assess their runtime correctness and practical applicability. Experimental results show that Qwen2.5, when augmented with the ELLMPEG framework, achieves an average command-generation accuracy of 78 % with zero recurring API cost, outperforming all other open-source models across both the FFmpeg and VVenC datasets.
摘要：大型语言模型 (LLM) 是 ChatGPT 等生成式 AI 系统的基础，正在改变包括多媒体在内的许多领域和应用程序，从而实现更高级的内容生成、分析和交互。然而，基于云的 LLM 部署面临三个关键限制：高计算和能源需求、远程处理带来的隐私和可靠性风险以及经常性 API 成本。代理人工智能的最新进展，特别是在结构化推理和工具使用方面，为利用开放和本地部署的工具和法学硕士提供了更好的方法。本文介绍了 ELLMPEG，这是一种支持边缘的代理 LLM 框架，用于自动生成视频处理命令。 ELLMPEG 将工具感知的检索增强生成 (RAG) 与迭代自反射集成在一起，可直接在边缘生成和本地验证可执行的 FFmpeg 和 VVenC 命令，从而消除对外部云 API 的依赖。为了评估 ELLMPEG，我们收集了一个专用提示数据集，其中包含 480 个不同的查询，涵盖不同类别的 FFmpeg 和多功能视频编解码器 (VVC) 编码器 (VVenC) 命令。我们验证命令生成的准确性，并根据命令有效性、每秒生成的令牌、推理时间和能源效率评估四个开源 LLM。我们还执行生成的命令来评估它们的运行时正确性和实际适用性。实验结果表明，Qwen2.5 在使用 ELLMPEG 框架进行增强后，可实现 78% 的平均命令生成精度，且重复 API 成本为零，在 FFmpeg 和 VVenC 数据集上的性能优于所有其他开源模型。

Title: RAPTOR-AI for Disaster OODA Loop: Hierarchical Multimodal RAG with Experience-Driven Agentic Decision-Making

Authors: Takato Yasuno
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00030
Pdf URL: https://arxiv.org/pdf/2602.00030
Copy Paste: [[2602.00030]] RAPTOR-AI for Disaster OODA Loop: Hierarchical Multimodal RAG with Experience-Driven Agentic Decision-Making(https://arxiv.org/abs/2602.00030)
Keywords: generation
Abstract: Effective humanitarian assistance and disaster relief (HADR) requires rapid situational understanding, reliable decision support, and the ability to generalize across diverse and previously unseen disaster contexts. This work introduces an agentic Retrieval-Augmented Generation (RAG) framework designed to support the three canonical phases of disaster response: initial rescue, mid-term recovery, and long-term reconstruction. To achieve robust multimodal grounding, we construct a hierarchical knowledge base that integrates textual disaster manuals, historical lessons (e.g., the 2011 Tohoku earthquake), and both aerial and ground-level imagery. Our system builds on the open-source multimodal implementation, which processes 46 tsunami-related PDFs (2,378 pages) using BLIP-based image captioning, ColVBERT embeddings, and long-context summarization to generate an efficient, structured multimodal retrieval tree optimized for disaster knowledge preservation. An agentic controller dynamically selects retrieval strategies (e.g., RAPTOR, ColBERT) through entropy-aware scene abstraction, enabling adaptive reasoning across heterogeneous inputs. Additionally, a lightweight LoRA-based post-training method injects experiential knowledge from past disasters, enhancing the models' capacity to support both expert and non-expert responders. Experiments on real disaster datasets demonstrate improved situational grounding, enhanced task decomposition accuracy, and superior usability for emergency operations. Incorporating recent advances in long-context RAG systems, agentic information retrieval, and contemporary emergency response AI, our system achieves substantial gains through adaptive retrieval-augmented generation with self-reasoning and multimodal chain-of-thought capabilities.
摘要：有效的人道主义援助和救灾 (HADR) 需要快速的态势了解、可靠的决策支持以及在不同的和前所未见的灾害环境中进行概括的能力。这项工作引入了一个代理检索增强生成（RAG）框架，旨在支持灾难响应的三个典型阶段：初始救援、中期恢复和长期重建。为了实现强大的多模态接地，我们构建了一个分层知识库，其中集成了文本灾难手册、历史教训（例如 2011 年东北地震）以及航空和地面图像。我们的系统建立在开源多模态实现的基础上，该实现使用基于 BLIP 的图像字幕、ColVBERT 嵌入和长上下文摘要来处理 46 个与海啸相关的 PDF（2,378 页），以生成针对灾难知识保存而优化的高效、结构化多模态检索树。代理控制器通过熵感知场景抽象动态选择检索策略（例如 RAPTOR、ColBERT），从而实现跨异构输入的自适应推理。此外，基于 LoRA 的轻量级后训练方法注入了过去灾害的经验知识，增强了模型支持专家和非专家响应者的能力。对真实灾害数据集的实验表明，情景基础得到了改善，任务分解的准确性得到了提高，并且应急操作具有卓越的可用性。我们的系统结合了长上下文 RAG 系统、代理信息检索和当代应急响应人工智能的最新进展，通过具有自我推理和多模式思维链功能的自适应检索增强生成，取得了巨大的成果。

Title: Enhancing few-shot time series forecasting with LLM-guided diffusion

Authors: Haonan Shi, Dehua Shuai, Liming Wang, Xiyang Liu, Long Tian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00040
Pdf URL: https://arxiv.org/pdf/2602.00040
Copy Paste: [[2602.00040]] Enhancing few-shot time series forecasting with LLM-guided diffusion(https://arxiv.org/abs/2602.00040)
Keywords: generative
Abstract: Time series forecasting in specialized domains is often constrained by limited data availability, where conventional models typically require large-scale datasets to effectively capture underlying temporal dynamics. To tackle this few-shot challenge, we propose LTSM-DIFF (Large-scale Temporal Sequential Memory with Diffusion), a novel learning framework that integrates the expressive power of large language models with the generative capability of diffusion models. Specifically, the LTSM module is fine-tuned and employed as a temporal memory mechanism, extracting rich sequential representations even under data-scarce conditions. These representations are then utilized as conditional guidance for a joint probability diffusion process, enabling refined modeling of complex temporal patterns. This design allows knowledge transfer from the language domain to time series tasks, substantially enhancing both generalization and robustness. Extensive experiments across diverse benchmarks demonstrate that LTSM-DIFF consistently achieves state-of-the-art performance in data-rich scenarios, while also delivering significant improvements in few-shot forecasting. Our work establishes a new paradigm for time series analysis under data scarcity.
摘要：专业领域的时间序列预测通常受到数据可用性有限的限制，传统模型通常需要大规模数据集才能有效捕获潜在的时间动态。为了解决这个小样本挑战，我们提出了 LTSM-DIFF（大规模时间序列记忆扩散），这是一种新颖的学习框架，它将大型语言模型的表达能力与扩散模型的生成能力相结合。具体来说，LTSM 模块经过微调并用作时间记忆机制，即使在数据稀缺的情况下也能提取丰富的顺序表示。然后，这些表示被用作联合概率扩散过程的条件指导，从而能够对复杂时间模式进行精细建模。这种设计允许知识从语言领域转移到时间序列任务，从而大大增强了泛化性和鲁棒性。跨不同基准的大量实验表明，LTSM-DIFF 在数据丰富的场景中始终如一地实现了最先进的性能，同时还在几次预测方面提供了显着改进。我们的工作为数据稀缺下的时间序列分析建立了新的范式。

Title: TextBFGS: Quasi-Newton Optimization for Discrete Executable Text via Gradient-Operator Retrieval

Authors: Zizheng Zhang, Yuyang Liao, Chen Chen, Jian He, Dun Wu, Qianjin Yu, Yanqin Gao, Jin Yang, Kailai Zhang, Eng Siong Chng, Xionghu Zhong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00059
Pdf URL: https://arxiv.org/pdf/2602.00059
Copy Paste: [[2602.00059]] TextBFGS: Quasi-Newton Optimization for Discrete Executable Text via Gradient-Operator Retrieval(https://arxiv.org/abs/2602.00059)
Keywords: generation
Abstract: Optimizing discrete executable text such as prompts and code has recently been framed as a gradient-based process, effectively translating backpropagation concepts to the semantic space. However, existing methods predominantly operate as first-order optimizers akin to Stochastic Gradient Descent, which are suffering from slow convergence and instability because they neglect the semantic curvature of the optimization landscape. To bridge this gap, we introduce TextBFGS, a second-order framework to implement a Quasi-Newton optimization method for discrete text. Unlike traditional memory-based approaches that retrieve similar textual instances, TextBFGS approximates the inverse Hessian matrix by retrieving Gradient-Operators from the memory of pre-learned successful trajectories. Specifically, given a textual gradient feedback, TextBFGS identifies historical correction patterns from the optimization knowledge base and tries to apply these abstract operators to the current variable. This mechanism enables a One-Pass Update, combining feedback generation and second-order correction into a single inference step. Empirical evaluations on code optimization across diverse domains (e.g., HumanEval, MBPP) demonstrate that TextBFGS significantly outperforms first-order baselines. It achieves superior pass rates with fewer model calls and exhibits strong cross-task transferability, thus establishes a mathematically grounded paradigm for efficient, memory-aware text optimization.
摘要：优化离散可执行文本（例如提示和代码）最近被定义为基于梯度的过程，有效地将反向传播概念转化为语义空间。然而，现有方法主要作为类似于随机梯度下降的一阶优化器运行，由于忽略了优化景观的语义曲率，因此存在收敛速度慢和不稳定的问题。为了弥补这一差距，我们引入了 TextBFGS，这是一个二阶框架，用于实现离散文本的拟牛顿优化方法。与检索相似文本实例的传统基于记忆的方法不同，TextBFGS 通过从预先学习的成功轨迹的记忆中检索梯度运算符来近似逆 Hessian 矩阵。具体来说，给定文本梯度反馈，TextBFGS 从优化知识库中识别历史校正模式，并尝试将这些抽象运算符应用于当前变量。该机制可实现一次性更新，将反馈生成和二阶校正结合到单个推理步骤中。对不同领域（例如 HumanEval、MBPP）的代码优化的实证评估表明，TextBFGS 显着优于一阶基线。它以更少的模型调用实现了优异的通过率，并表现出强大的跨任务可转移性，从而为高效、内存感知的文本优化建立了一个基于数学的范例。

Title: The Impact of Machine Learning Uncertainty on the Robustness of Counterfactual Explanations

Authors: Leonidas Christodoulou, Chang Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00063
Pdf URL: https://arxiv.org/pdf/2602.00063
Copy Paste: [[2602.00063]] The Impact of Machine Learning Uncertainty on the Robustness of Counterfactual Explanations(https://arxiv.org/abs/2602.00063)
Keywords: generation
Abstract: Counterfactual explanations are widely used to interpret machine learning predictions by identifying minimal changes to input features that would alter a model's decision. However, most existing counterfactual methods have not been tested when model and data uncertainty change, resulting in explanations that may be unstable or invalid under real-world variability. In this work, we investigate the robustness of common combinations of machine learning models and counterfactual generation algorithms in the presence of both aleatoric and epistemic uncertainty. Through experiments on synthetic and real-world tabular datasets, we show that counterfactual explanations are highly sensitive to model uncertainty. In particular, we find that even small reductions in model accuracy - caused by increased noise or limited data - can lead to large variations in the generated counterfactuals on average and on individual instances. These findings underscore the need for uncertainty-aware explanation methods in domains such as finance and the social sciences.
摘要：反事实解释广泛用于通过识别可能改变模型决策的输入特征的最小变化来解释机器学习预测。然而，当模型和数据不确定性发生变化时，大多数现有的反事实方法尚未经过测试，导致解释在现实世界的变化下可能不稳定或无效。在这项工作中，我们研究了机器学习模型和反事实生成算法的常见组合在存在任意和认知不确定性的情况下的鲁棒性。通过对合成和现实世界表格数据集的实验，我们表明反事实解释对模型不确定性高度敏感。特别是，我们发现，即使模型精度略有下降（由于噪声增加或数据有限而导致），也可能导致生成的反事实平均和个别实例的巨大变化。这些发现强调了金融和社会科学等领域需要具有不确定性的解释方法。

Title: Generative AI-enhanced Probabilistic Multi-Fidelity Surrogate Modeling Via Transfer Learning

Authors: Jice Zeng, David Barajas-Solano, Hui Chen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2602.00072
Pdf URL: https://arxiv.org/pdf/2602.00072
Copy Paste: [[2602.00072]] Generative AI-enhanced Probabilistic Multi-Fidelity Surrogate Modeling Via Transfer Learning(https://arxiv.org/abs/2602.00072)
Keywords: generative
Abstract: The performance of machine learning surrogates is critically dependent on data quality and quantity. This presents a major challenge, as high-fidelity (HF) data is often scarce and computationally expensive to acquire, while low-fidelity (LF) data is abundant but less accurate. To address this data scarcity problem, we develop a probabilistic multi-fidelity surrogate framework based on generative transfer learning. We employ a normalizing flow (NF) generative model as the backbone, which is trained in two phases: (i) the NF is first pretrained on a large LF dataset to learn a probabilistic forward model; (ii) the pretrained model is then fine-tuned on a small HF dataset, allowing it to correct for LF-HF discrepancies via knowledge transfer. To relax the dimension-preserving constraint of standard bijective NFs, we integrate surjective (dimension-reducing) layers with standard coupling blocks. This architecture enables learned dimension reduction while preserving the ability to train with exact likelihoods. The resulting surrogate provides fast probabilistic predictions with quantified uncertainty and significantly outperforms LF-only baselines while using fewer HF evaluations. We validate the approach on a reinforced concrete slab benchmark, combining many coarse-mesh (LF) simulations with a limited set of fine-mesh (HF) simulations. The proposed model achieves probabilistic predictions with HF accuracy, demonstrating a practical path toward data-efficient, generative AI-driven surrogates for complex engineering systems.
摘要：机器学习代理的性能很大程度上取决于数据质量和数量。这提出了一个重大挑战，因为高保真 (HF) 数据通常稀缺且获取的计算成本昂贵，而低保真 (LF) 数据丰富但不太准确。为了解决这个数据稀缺问题，我们开发了一个基于生成迁移学习的概率多保真度代理框架。我们采用归一化流（NF）生成模型作为骨干，分两个阶段进行训练：（i）首先在大型 LF 数据集上对 NF 进行预训练，以学习概率前向模型； (ii) 然后在小型 HF 数据集上对预训练模型进行微调，使其能够通过知识转移来纠正 LF-HF 差异。为了放松标准双射 NF 的维数保持约束，我们将满射（降维）层与标准耦合块集成。这种架构可以实现学习降维，同时保留精确可能性训练的能力。由此产生的替代方案提供了具有量化不确定性的快速概率预测，并且在使用较少的 HF 评估的同时显着优于仅 LF 基线。我们在钢筋混凝土板基准上验证了该方法，将许多粗网格（LF）模拟与有限的一组细网格（HF）模拟相结合。所提出的模型实现了高频精度的概率预测，为复杂工程系统展示了一条实现数据高效、生成人工智能驱动替代品的实用路径。

Title: Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits

Authors: Neha Kalibhat, Zi Wang, Prasoon Bajpai, Drew Proud, Wenjun Zeng, Been Kim, Mani Malek
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2602.00092
Pdf URL: https://arxiv.org/pdf/2602.00092
Copy Paste: [[2602.00092]] Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits(https://arxiv.org/abs/2602.00092)
Keywords: generation
Abstract: We introduce a black-box interpretability framework that learns a verifiable constitution: a natural language summary of how changes to a prompt affect a model's specific behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the input prompt. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, including mathematical reasoning and text-to-image alignment, for controlling and understanding model behavior. We found that for text-to-image generation, GPT-Image tends to focus on grammatical adherence, while Imagen 4 prioritizes atmospheric coherence. In mathematical reasoning, distractor variables confuse GPT-5 but leave Gemini 2.5 models and o4-mini largely unaffected. Moreover, our results show that the learned constitutions are highly effective for controlling model behavior, achieving an average of 1.86 times boost in success rate over methods that do not use constitutions.
摘要：我们引入了一个黑盒可解释性框架，该框架学习可验证的构成：对提示的更改如何影响模型的特定行为（例如其对齐、正确性或对约束的遵守）的自然语言摘要。我们的方法利用原子概念编辑（ACE），这是在输入提示中添加、删除或替换可解释概念的有针对性的操作。通过系统地应用 ACE 并观察各种任务对模型行为的影响，我们的框架学习了从编辑到可预测结果的因果映射。这种博学的宪法为模型提供了深刻的、可概括的见解。根据经验，我们在不同的任务中验证了我们的方法，包括数学推理和文本到图像对齐，以控制和理解模型行为。我们发现，对于文本到图像的生成，GPT-Image 倾向于关注语法一致性，而 Imagen 4 则优先考虑大气一致性。在数学推理中，干扰变量会混淆 GPT-5，但 Gemini 2.5 模型和 o4-mini 基本上不受影响。此外，我们的结果表明，学习的构成对于控制模型行为非常有效，与不使用构成的方法相比，成功率平均提高了 1.86 倍。

Title: Mirage2Matter: A Physically Grounded Gaussian World Model from Video

Authors: Zhengqing Gao, Ziwen Li, Xin Wang, Jiaxin Huang, Zhenyang Ren, Mingkai Shao, Hanlue Zhang, Tianyu Huang, Yongkang Cheng, Yandong Guo, Runqi Lin, Yuanyuan Wang, Tongliang Liu, Kun Zhang, Mingming Gong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00096
Pdf URL: https://arxiv.org/pdf/2602.00096
Copy Paste: [[2602.00096]] Mirage2Matter: A Physically Grounded Gaussian World Model from Video(https://arxiv.org/abs/2602.00096)
Keywords: generation, generative
Abstract: The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.
摘要：实体智能的可扩展性从根本上受到现实世界交互数据稀缺的限制。虽然仿真平台提供了一种有前途的替代方案，但现有方法通常与真实环境存在巨大的视觉和物理差距，并且依赖昂贵的传感器、精确的机器人校准或深度测量，限制了它们的大规模实用性。我们推出了 Simulate Anything，这是一种图形驱动的世界建模和模拟框架，仅使用多视图环境视频和现成资产即可高效生成高保真具体训练数据。我们的方法使用 3D 高斯泼溅 (3DGS) 将现实世界环境重建为逼真的场景表示，从视频中无缝捕获细粒度的几何形状和外观。然后，我们利用生成模型来恢复物理真实的表示，并通过精确校准目标将其集成到模拟环境中，从而实现重建场景与现实世界之间的精确比例对齐。这些组件共同提供了一个统一的、可编辑的、基于物理的世界模型。在我们的模拟数据上训练的视觉语言动作（VLA）模型在下游任务上实现了强大的零样本性能，匹配甚至超越了真实世界数据获得的结果，凸显了重建驱动的世界建模在可扩展和实用的体现智能训练方面的潜力。

Title: R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation

Authors: Zhuohong Chen, Zhengxian Wu, Zirui Liao, Shenao Jiang, Hangrui Xu, Yang Chen, Chaokui Su, Xiaoyu Liu, Haoqian Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00104
Pdf URL: https://arxiv.org/pdf/2602.00104
Copy Paste: [[2602.00104]] R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation(https://arxiv.org/abs/2602.00104)
Keywords: generation
Abstract: Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains this http URL address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking this http URL first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence this http URL MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at this https URL.
摘要：VQA 以视觉为中心的检索需要检索图像以提供缺失的视觉线索并将其集成到推理过程中。然而，选择正确的图像并将其有效地集成到模型的推理中仍然是这个http URL解决这个挑战的方法，我们提出了R3G，一个模块化的推理-检索-重新排名这个http URL首先产生一个简短的推理计划，指定所需的视觉线索，然后采用两阶段策略，先进行粗略检索，然后进行细粒度的重新排名，以选择证据这个http URL MRAG-Bench，R3G提高了六个MLLM主干和九个主干的准确性子场景，实现最先进的整体性能。消融表明，充分性感知的重新排序和推理步骤是互补的，有助于模型选择正确的图像并很好地使用它们。我们在此 https URL 发布代码和数据。

Title: AI-Driven Three-Dimensional Reconstruction and Quantitative Analysis for Burn Injury Assessment

Authors: S. Kalaycioglu, C. Hong, K. Zhai, H. Xie, J.N. Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00113
Pdf URL: https://arxiv.org/pdf/2602.00113
Copy Paste: [[2602.00113]] AI-Driven Three-Dimensional Reconstruction and Quantitative Analysis for Burn Injury Assessment(https://arxiv.org/abs/2602.00113)
Keywords: generation
Abstract: Accurate, reproducible burn assessment is critical for treatment planning, healing monitoring, and medico-legal documentation, yet conventional visual inspection and 2D photography are subjective and limited for longitudinal comparison. This paper presents an AI-enabled burn assessment and management platform that integrates multi-view photogrammetry, 3D surface reconstruction, and deep learning-based segmentation within a structured clinical workflow. Using standard multi-angle images from consumer-grade cameras, the system reconstructs patient-specific 3D burn surfaces and maps burn regions onto anatomy to compute objective metrics in real-world units, including surface area, TBSA, depth-related geometric proxies, and volumetric change. Successive reconstructions are spatially aligned to quantify healing progression over time, enabling objective tracking of wound contraction and depth reduction. The platform also supports structured patient intake, guided image capture, 3D analysis and visualization, treatment recommendations, and automated report generation. Simulation-based evaluation demonstrates stable reconstructions, consistent metric computation, and clinically plausible longitudinal trends, supporting a scalable, non-invasive approach to objective, geometry-aware burn assessment and decision support in acute and outpatient care.
摘要：准确、可重复的烧伤评估对于治疗计划、愈合监测和医学法律记录至关重要，但传统的目视检查和 2D 摄影是主观的，并且在纵向比较方面受到限制。本文提出了一种支持人工智能的烧伤评估和管理平台，该平台在结构化临床工作流程中集成了多视图摄影测量、3D 表面重建和基于深度学习的分割。该系统使用来自消费级相机的标准多角度图像，重建患者特定的 3D 烧伤表面，并将烧伤区域映射到解剖结构上，以计算现实世界单位的客观指标，包括表面积、TBSA、与深度相关的几何代理和体积变化。连续重建在空间上对齐，以量化随时间推移的愈合进展，从而能够客观跟踪伤口收缩和深度减少。该平台还支持结构化患者接收、引导图像捕获、3D 分析和可视化、治疗建议和自动报告生成。基于模拟的评估展示了稳定的重建、一致的度量计算和临床上合理的纵向趋势，支持在急性和门诊护理中采用可扩展、非侵入性的客观、几何感知烧伤评估和决策支持方法。

Title: 1S-DAug: One-Shot Data Augmentation for Robust Few-Shot Generalization

Authors: Yunwei Bai, Ying Kiat Tan, Yao Shu, Tsuhan Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00114
Pdf URL: https://arxiv.org/pdf/2602.00114
Copy Paste: [[2602.00114]] 1S-DAug: One-Shot Data Augmentation for Robust Few-Shot Generalization(https://arxiv.org/abs/2602.00114)
Keywords: generative
Abstract: Few-shot learning (FSL) challenges model generalization to novel classes based on just a few shots of labeled examples, a testbed where traditional test-time augmentations fail to be effective. We introduce 1S-DAug, a one-shot generative augmentation operator that synthesizes diverse yet faithful variants from just one example image at test time. 1S-DAug couples traditional geometric perturbations with controlled noise injection and a denoising diffusion process conditioned on the original image. The generated images are then encoded and aggregated, alongside the original image, into a combined representation for more robust FSL predictions. Integrated as a training-free model-agnostic plugin, 1S-DAug consistently improves FSL across standard benchmarks of 4 different datasets without any model parameter update, including achieving over 10% proportional accuracy improvement on the miniImagenet 5-way-1-shot benchmark. Codes will be released.
摘要：少样本学习（FSL）挑战模型泛化到仅基于少量标记示例的新类别，这是传统测试时间增强无法有效的测试平台。我们引入了 1S-DAug，这是一种一次性生成增强算子，它在测试时仅从一个示例图像合成多样化但忠实的变体。 1S-DAug 将传统的几何扰动与受控噪声注入和以原始图像为条件的去噪扩散过程结合起来。然后，生成的图像与原始图像一起被编码和聚合成组合表示，以实现更稳健的 FSL 预测。 1S-DAug 作为一个与模型无关的免训练插件集成，在 4 个不同数据集的标准基准测试中持续改进 FSL，无需任何模型参数更新，包括在 miniImagenet 5-way-1-shot 基准测试中实现超过 10% 的比例精度改进。 Codes will be released.

Title: ALIGN: Aligned Delegation with Performance Guarantees for Multi-Agent LLM Reasoning

Authors: Tong Zhu, Baiting Chen, Jin Zhou, Hua Zhou, Sriram Sankararaman, Xiaowu Dai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00127
Pdf URL: https://arxiv.org/pdf/2602.00127
Copy Paste: [[2602.00127]] ALIGN: Aligned Delegation with Performance Guarantees for Multi-Agent LLM Reasoning(https://arxiv.org/abs/2602.00127)
Keywords: generation
Abstract: LLMs often underperform on complex reasoning tasks when relying on a single generation-and-selection pipeline. Inference-time ensemble methods can improve performance by sampling diverse reasoning paths or aggregating multiple candidate answers, but they typically treat candidates independently and provide no formal guarantees that ensembling improves reasoning quality. We propose a novel method, Aligned Delegation for Multi-Agent LLM Reasoning (ALIGN), which formulates LLM reasoning as an aligned delegation game. In ALIGN, a principal delegates a task to multiple agents that generate candidate solutions under designed incentives, and then selects among their outputs to produce a final answer. This formulation induces structured interaction among agents while preserving alignment between agent and principal objectives. We establish theoretical guarantees showing that, under a fair comparison with equal access to candidate solutions, ALIGN provably improves expected performance over single-agent generation. Our analysis accommodates correlated candidate answers and relaxes independence assumptions that are commonly used in prior work. Empirical results across a broad range of LLM reasoning benchmarks consistently demonstrate that ALIGN outperforms strong single-agent and ensemble baselines.
摘要：当依赖单一的生成和选择管道时，法学硕士在复杂的推理任务上通常表现不佳。推理时间集成方法可以通过对不同的推理路径进行采样或聚合多个候选答案来提高性能，但它们通常独立对待候选答案，并且不提供集成提高推理质量的正式保证。我们提出了一种新颖的方法，即多智能体 LLM 推理的对齐委托（ALIGN），它将 LLM 推理制定为对齐委托游戏。在 ALIGN 中，委托人将一项任务委托给多个代理，这些代理在设计的激励下生成候选解决方案，然后在其输出中进行选择以产生最终答案。该公式诱导了代理之间的结构化交互，同时保持代理和主要目标之间的一致性。我们建立了理论保证，表明在公平比较和平等访问候选解决方案的情况下，ALIGN 可以证明比单代理生成提高了预期性能。我们的分析容纳了相关的候选答案，并放宽了先前工作中常用的独立性假设。广泛的 LLM 推理基准的实证结果一致表明，ALIGN 的性能优于强大的单智能体和集成基准。

Title: Monte Carlo Tree Search for Execution-Guided Program Repair with Large Language Models

Authors: Yixuan Liang
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2602.00129
Pdf URL: https://arxiv.org/pdf/2602.00129
Copy Paste: [[2602.00129]] Monte Carlo Tree Search for Execution-Guided Program Repair with Large Language Models(https://arxiv.org/abs/2602.00129)
Keywords: generation
Abstract: Automated program repair with large language models remains challenging at the repository level due to long-horizon reasoning requirements and the limitations of autoregressive decoding. We present CodePilot, a hybrid framework that integrates Monte Carlo Tree Search (MCTS) with large language models to enable execution-guided program repair for real-world GitHub issues. CodePilot performs hierarchical fault localization from repository to file and function level, explores diverse patch trajectories using MCTS, and leverages execution feedback as a reward signal to guide search and refinement. The framework further incorporates confidence-calibrated generation to selectively refine low-confidence outputs. Experiments on SWE-bench Lite demonstrate that CodePilot achieves a 24.67% issue resolution rate using open-weight models, outperforming comparable baselines. These results suggest that combining symbolic search with neural language models is an effective strategy for scalable, execution-aware software engineering automation.
摘要：由于长期推理要求和自回归解码的限制，使用大型语言模型的自动程序修复在存储库级别仍然具有挑战性。我们推出了 CodePilot，这是一个混合框架，它将蒙特卡洛树搜索 (MCTS) 与大型语言模型集成在一起，以实现针对实际 GitHub 问题的执行引导程序修复。 CodePilot 执行从存储库到文件和功能级别的分层故障定位，使用 MCTS 探索不同的补丁轨迹，并利用执行反馈作为奖励信号来指导搜索和细化。该框架进一步结合了置信度校准生成，以选择性地完善低置信度输出。 SWE-bench Lite 上的实验表明，CodePilot 使用开放权重模型实现了 24.67% 的问题解决率，优于可比较的基线。这些结果表明，将符号搜索与神经语言模型相结合是可扩展、执行感知的软件工程自动化的有效策略。

Title: Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields

Authors: Shiqian Li, Ruihong Shen, Junfeng Ni, Chang Pan, Chi Zhang, Yixin Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00148
Pdf URL: https://arxiv.org/pdf/2602.00148
Copy Paste: [[2602.00148]] Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields(https://arxiv.org/abs/2602.00148)
Keywords: generation
Abstract: Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF's strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models.
摘要：

Title: SDCM: Simulated Densifying and Compensatory Modeling Fusion for Radar-Vision 3-D Object Detection in Internet of Vehicles

Authors: Shucong Li, Xiaoluo Zhou, Yuqian He, Zhenyu Liu
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2602.00149
Pdf URL: https://arxiv.org/pdf/2602.00149
Copy Paste: [[2602.00149]] SDCM: Simulated Densifying and Compensatory Modeling Fusion for Radar-Vision 3-D Object Detection in Internet of Vehicles(https://arxiv.org/abs/2602.00149)
Keywords: generation
Abstract: 3-D object detection based on 4-D radar-vision is an important part in Internet of Vehicles (IoV). However, there are two challenges which need to be faced. First, the 4-D radar point clouds are sparse, leading to poor 3-D representation. Second, vision datas exhibit representation degradation under low-light, long distance detection and dense occlusion scenes, which provides unreliable texture information during fusion stage. To address these issues, a framework named SDCM is proposed, which contains Simulated Densifying and Compensatory Modeling Fusion for radar-vision 3-D object detection in IoV. Firstly, considering point generation based on Gaussian simulation of key points obtained from 3-D Kernel Density Estimation (3-D KDE), and outline generation based on curvature simulation, Simulated Densifying (SimDen) module is designed to generate dense radar point clouds. Secondly, considering that radar data could provide more real time information than vision data, due to the all-weather property of 4-D radar. Radar Compensatory Mapping (RCM) module is designed to reduce the affects of vision datas' representation degradation. Thirdly, considering that feature tensor difference values contain the effective information of every modality, which could be extracted and modeled for heterogeneity reduction and modalities interaction, Mamba Modeling Interactive Fusion (MMIF) module is designed for reducing heterogeneous and achieving interactive Fusion. Experiment results on the VoD, TJ4DRadSet and Astyx HiRes 2019 dataset show that SDCM achieves best performance with lower parameter quantity and faster inference speed. Our code will be available.
摘要：基于4D雷达视觉的3D物体检测是车联网（IoV）的重要组成部分。然而，有两个挑战需要面对。首先，4D 雷达点云稀疏，导致 3D 表示效果较差。其次，视觉数据在低光、长距离检测和密集遮挡场景下表现出表征退化，这在融合阶段提供了不可靠的纹理信息。为了解决这些问题，提出了一个名为 SDCM 的框架，其中包含用于车联网中雷达视觉 3D 物体检测的模拟致密化和补偿建模融合。首先，考虑到基于3-D核密度估计（3-D KDE）获得的关键点的高斯模拟的点生成和基于曲率模拟的轮廓生成，模拟致密（SimDen）模块被设计用于生成密集的雷达点云。其次，考虑到4维雷达的全天候特性，雷达数据比视觉数据可以提供更多的实时信息。雷达补偿映射（RCM）模块旨在减少视觉数据表示退化的影响。第三，考虑到特征张量差值包含了每种模态的有效信息，可以提取这些信息并进行建模以减少异质性和模态交互，Mamba建模交互式融合（MMIF）模块旨在减少异质性并实现交互融合。在 VoD、TJ4DRadSet 和 Astyx HiRes 2019 数据集上的实验结果表明，SDCM 以更低的参数量和更快的推理速度实现了最佳性能。我们的代码将可用。

Title: Joint Continual Learning of Local Language Models and Cloud Offloading Decisions with Budget Constraints

Authors: Evan Chen, Wenzhi Fang, Shiqiang Wang, Christopher Brinton
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00166
Pdf URL: https://arxiv.org/pdf/2602.00166
Copy Paste: [[2602.00166]] Joint Continual Learning of Local Language Models and Cloud Offloading Decisions with Budget Constraints(https://arxiv.org/abs/2602.00166)
Keywords: generation
Abstract: Locally deployed Small Language Models (SLMs) must continually support diverse tasks under strict memory and computation constraints, making selective reliance on cloud Large Language Models (LLMs) unavoidable. Regulating cloud assistance during continual learning is challenging, as naive reward-based reinforcement learning often yields unstable offloading behavior and exacerbates catastrophic forgetting as task distributions shift. We propose DA-GRPO, a dual-advantage extension of Group Relative Policy Optimization that incorporates cloud-usage constraints directly into advantage computation, avoiding fixed reward shaping and external routing models. This design enables the local model to jointly learn task competence and collaboration behavior, allowing cloud requests to emerge naturally during post-training while respecting a prescribed assistance budget. Experiments on mathematical reasoning and code generation benchmarks show that DA-GRPO improves post-switch accuracy, substantially reduces forgetting, and maintains stable cloud usage compared to prior collaborative and routing-based approaches.
摘要：本地部署的小语言模型 (SLM) 必须在严格的内存和计算限制下持续支持各种任务，这使得对云大型语言模型 (LLM) 的选择性依赖不可避免。在持续学习期间调节云辅助具有挑战性，因为基于奖励的强化学习通常会产生不稳定的卸载行为，并随着任务分配的变化而加剧灾难性遗忘。我们提出了 DA-GRPO，这是组相对策略优化的双重优势扩展，它将云使用约束直接纳入优势计算中，避免了固定奖励塑造和外部路由模型。这种设计使本地模型能够共同学习任务能力和协作行为，从而允许云请求在训练后自然出现，同时遵守规定的辅助预算。数学推理和代码生成基准的实验表明，与之前的协作和基于路由的方法相比，DA-GRPO 提高了切换后的准确性，大大减少了遗忘，并保持了稳定的云使用。

Title: Stabilizing Diffusion Posterior Sampling by Noise--Frequency Continuation

Authors: Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00176
Pdf URL: https://arxiv.org/pdf/2602.00176
Copy Paste: [[2602.00176]] Stabilizing Diffusion Posterior Sampling by Noise--Frequency Continuation(https://arxiv.org/abs/2602.00176)
Keywords: super-resolution
Abstract: Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance, but it often fails to recover fine details because measurement terms are applied in a manner that is weakly coupled to the diffusion noise level. At high noise, data-consistency gradients computed from inaccurate estimates can be geometrically incongruent with the posterior geometry, inducing early-step drift, spurious high-frequency artifacts, plus sensitivity to schedules and ill-conditioned operators. To address these concerns, we propose a noise--frequency Continuation framework that constructs a continuous family of intermediate posteriors whose likelihood enforces measurement consistency only within a noise-dependent frequency band. This principle is instantiated with a stabilized posterior sampler that combines a diffusion predictor, band-limited likelihood guidance, and a multi-resolution consistency strategy that aggressively commits reliable coarse corrections while conservatively adopting high-frequency details only when they become identifiable. Across super-resolution, inpainting, and deblurring, our method achieves state-of-the-art performance and improves motion deblurring PSNR by up to 5 dB over strong baselines.
摘要：

Title: Reducing Memorisation in Generative Models via Riemannian Bayesian Inference

Authors: Johanna Marie Gegenfurtner, Albert Kjøller Jacobsen, Naima Elosegui Borras, Alejandro Valverde Mahou, Georgios Arvanitidis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00199
Pdf URL: https://arxiv.org/pdf/2602.00199
Copy Paste: [[2602.00199]] Reducing Memorisation in Generative Models via Riemannian Bayesian Inference(https://arxiv.org/abs/2602.00199)
Keywords: generative
Abstract: Modern generative models can produce realistic samples, however, balancing memorisation and generalisation remains an open problem. We approach this challenge from a Bayesian perspective by focusing on the parameter space of flow matching and diffusion models and constructing a predictive posterior that better captures the variability of the data distribution. In particular, we capture the geometry of the loss using a Riemannian metric and leverage a flexible approximate posterior that adapts to the local structure of the loss landscape. This approach allows us to sample generative models that resemble the original model, but exhibit reduced memorisation. Empirically, we demonstrate that the proposed approach reduces memorisation while preserving generalisation. Further, we provide a theoretical analysis of our method, which explains our findings. Overall, our work illustrates how considering the geometry of the loss enables effective use of the parameter space, even for complex high-dimensional generative models.
摘要：Modern generative models can produce realistic samples, however, balancing memorisation and generalisation remains an open problem.我们从贝叶斯的角度来应对这一挑战，重点关注流匹配和扩散模型的参数空间，并构建更好地捕获数据分布变化的预测后验。特别是，我们使用黎曼度量捕获损失的几何形状，并利用灵活的近似后验来适应损失景观的局部结构。 This approach allows us to sample generative models that resemble the original model, but exhibit reduced memorisation. Empirically, we demonstrate that the proposed approach reduces memorisation while preserving generalisation.此外，我们提供了我们的方法的理论分析，这解释了我们的发现。总的来说，我们的工作说明了如何考虑损失的几何形状来有效利用参数空间，即使对于复杂的高维生成模型也是如此。

Title: SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis

Authors: Rishav Pramanik, Ian E. Nielsen, Jeff Smith, Saurav Pandit, Ravi P. Ramachandran, Zhaozheng Yin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00249
Pdf URL: https://arxiv.org/pdf/2602.00249
Copy Paste: [[2602.00249]] SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis(https://arxiv.org/abs/2602.00249)
Keywords: generation
Abstract: The rapid progress of text-to-image (T2I) models has unlocked unprecedented creative potential, yet their ability to faithfully render complex prompts involving multiple objects, attributes, and spatial relationships remains a significant bottleneck. Progress is hampered by a lack of adequate evaluation methods; current benchmarks are often restricted to closed-set vocabularies, lack fine-grained diagnostic capabilities, and fail to provide the interpretable feedback necessary to diagnose and remedy specific compositional failures. We solve these challenges by introducing SANEval (Spatial, Attribute, and Numeracy Evaluation), a comprehensive benchmark that establishes a scalable new pipeline for open-vocabulary compositional evaluation. SANEval combines a large language model (LLM) for deep prompt understanding with an LLM-enhanced, open-vocabulary object detector to robustly evaluate compositional adherence, unconstrained by a fixed vocabulary. Through extensive experiments on six state-of-the-art T2I models, we demonstrate that SANEval's automated evaluations provide a more faithful proxy for human assessment; our metric achieves a Spearman's rank correlation with statistically different results than those of existing benchmarks across tasks of attribute binding, spatial relations, and numeracy. To facilitate future research in compositional T2I generation and evaluation, we will release the SANEval dataset and our open-source evaluation pipeline.
摘要：文本到图像（T2I）模型的快速进步释放了前所未有的创造潜力，但它们忠实渲染涉及多个对象、属性和空间关系的复杂提示的能力仍然是一个重大瓶颈。由于缺乏适当的评估方法，进展受到阻碍；当前的基准通常仅限于封闭的词汇表，缺乏细粒度的诊断能力，并且无法提供诊断和修复特定组合故障所需的可解释的反馈。我们通过引入 SANEval（空间、属性和计算能力评估）来解决这些挑战，这是一个综合基准，为开放词汇组合评估建立了一个可扩展的新管道。 SANEval 将用于深度快速理解的大型语言模型 (LLM) 与 LLM 增强型开放词汇对象检测器相结合，以稳健地评估构图依从性，不受固定词汇的约束。通过对六种最先进的 T2I 模型进行大量实验，我们证明 SANEval 的自动评估为人类评估提供了更可靠的代理；我们的指标实现了 Spearman 等级相关性，其统计结果与跨属性绑定、空间关系和计算能力的现有基准的结果不同。为了促进未来组合 T2I 生成和评估的研究，我们将发布 SANEval 数据集和我们的开源评估管道。

Title: TABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models

Authors: Shreshth Saini, Avinab Saha, Balu Adsumilli, Neil Birkbeck, Yilin Wang, Alan C. Bovik
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2602.00250
Pdf URL: https://arxiv.org/pdf/2602.00250
Copy Paste: [[2602.00250]] TABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models(https://arxiv.org/abs/2602.00250)
Keywords: generation, generative
Abstract: Masked Diffusion Models (MDMs) have emerged as a promising non-autoregressive paradigm for generative tasks, offering parallel decoding and bidirectional context utilization. However, current sampling methods rely on simple confidence-based heuristics that ignore the long-term impact of local decisions, leading to trajectory lock-in where early hallucinations cascade into global incoherence. While search-based methods mitigate this, they incur prohibitive computational costs ($O(K)$ forward passes per step). In this work, we propose Backward-on-Entropy (BoE) Steering, a gradient-guided inference framework that approximates infinite-horizon lookahead via a single backward pass. We formally derive the Token Influence Score (TIS) from a first-order expansion of the trajectory cost functional, proving that the gradient of future entropy with respect to input embeddings serves as an optimal control signal for minimizing uncertainty. To ensure scalability, we introduce \texttt{ActiveQueryAttention}, a sparse adjoint primitive that exploits the structure of the masking objective to reduce backward pass complexity. BoE achieves a superior Pareto frontier for inference-time scaling compared to existing unmasking methods, demonstrating that gradient-guided steering offers a mathematically principled and efficient path to robust non-autoregressive generation. We will release the code.
摘要：Masked Diffusion Models (MDMs) have emerged as a promising non-autoregressive paradigm for generative tasks, offering parallel decoding and bidirectional context utilization. However, current sampling methods rely on simple confidence-based heuristics that ignore the long-term impact of local decisions, leading to trajectory lock-in where early hallucinations cascade into global incoherence.虽然基于搜索的方法缓解了这一问题，但它们会产生高昂的计算成本（每步前向传递 $O(K)$）。 In this work, we propose Backward-on-Entropy (BoE) Steering, a gradient-guided inference framework that approximates infinite-horizon lookahead via a single backward pass. We formally derive the Token Influence Score (TIS) from a first-order expansion of the trajectory cost functional, proving that the gradient of future entropy with respect to input embeddings serves as an optimal control signal for minimizing uncertainty. To ensure scalability, we introduce \texttt{ActiveQueryAttention}, a sparse adjoint primitive that exploits the structure of the masking objective to reduce backward pass complexity. BoE achieves a superior Pareto frontier for inference-time scaling compared to existing unmasking methods, demonstrating that gradient-guided steering offers a mathematically principled and efficient path to robust non-autoregressive generation.我们将发布代码。

Title: World-Shaper: A Unified Framework for 360° Panoramic Editing

Authors: Dong Liang, Yuhao Liu, Jinyuan Jia, Youjun Zhao, Rynson W.H.Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00265
Pdf URL: https://arxiv.org/pdf/2602.00265
Copy Paste: [[2602.00265]] World-Shaper: A Unified Framework for 360° Panoramic Editing(https://arxiv.org/abs/2602.00265)
Keywords: generation
Abstract: Being able to edit panoramic images is crucial for creating realistic 360° visual experiences. However, existing perspective-based image editing methods fail to model the spatial structure of panoramas. Conventional cube-map decompositions attempt to overcome this problem but inevitably break global consistency due to their mismatch with spherical geometry. Motivated by this insight, we reformulate panoramic editing directly in the equirectangular projection (ERP) domain and present World-Shaper, a unified geometry-aware framework that bridges panoramic generation and editing within a single editing-centric design. To overcome the scarcity of paired data, we adopt a generate-then-edit paradigm, where controllable panoramic generation serves as an auxiliary stage to synthesize diverse paired examples for supervised editing learning. To address geometric distortion, we introduce a geometry-aware learning strategy that explicitly enforces position-aware shape supervision and implicitly internalizes panoramic priors through progressive training. Extensive experiments on our new benchmark, PEBench, demonstrate that our method achieves superior geometric consistency, editing fidelity, and text controllability compared to SOTA methods, enabling coherent and flexible 360° visual world creation with unified editing control. Code, model, and data will be released at our project page: this https URL
摘要：能够编辑全景图像对于创建逼真的 360° 视觉体验至关重要。然而，现有的基于透视的图像编辑方法无法对全景图的空间结构进行建模。传统的立方体贴图分解试图克服这个问题，但由于它们与球面几何形状不匹配，不可避免地破坏全局一致性。受这种洞察力的启发，我们直接在等距柱状投影（ERP）领域重新制定了全景编辑，并提出了 World-Shaper，这是一个统一的几何感知框架，可以在单个以编辑为中心的设计中桥接全景生成和编辑。为了克服配对数据的稀缺性，我们采用了生成然后编辑的范式，其中可控全景生成作为辅助阶段来合成用于监督编辑学习的各种配对示例。为了解决几何失真问题，我们引入了一种几何感知学习策略，该策略显式强制执行位置感知形状监督，并通过渐进式训练隐式内化全景先验。在我们的新基准 PEBench 上进行的大量实验表明，与 SOTA 方法相比，我们的方法实现了卓越的几何一致性、编辑保真度和文本可控性，通过统一的编辑控制实现连贯且灵活的 360° 视觉世界创建。代码、模型和数据将在我们的项目页面发布：此 https URL

Title: PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories

Authors: Gemma Canet Tarrés, Manel Baradad, Francesc Moreno-Noguer, Yumeng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00267
Pdf URL: https://arxiv.org/pdf/2602.00267
Copy Paste: [[2602.00267]] PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories(https://arxiv.org/abs/2602.00267)
Keywords: generative
Abstract: Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item's identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model's temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.
摘要：生成式人工智能的最新进展极大地改善了照片级真实感图像合成，但仍不足以实现工作室级别的多对象合成。这项任务要求同时 (i) 近乎完美地保留每个项目的身份，(ii) 精确的背景和颜色保真度，(iii) 布局和设计元素控制，以及 (iv) 完整、吸引人的展示所有对象。然而，当前最先进的模型经常改变对象细节、省略或重复对象，并生成相对大小不正确或项目呈现不一致的布局。为了弥补这一差距，我们引入了 PLACID，这是一个将对象图像集合转换为有吸引力的多对象合成的框架。我们的方法有两个主要贡献。首先，我们利用带有文本控制的预训练图像到视频 (I2V) 扩散模型，通过利用视频中的时间先验来保留对象的一致性、身份和背景细节。其次，我们提出了一种新颖的数据管理策略，该策略生成合成序列，其中随机放置的对象平滑地移动到目标位置。该合成数据与训练期间视频模型的时间先验一致。在推理时，在随机位置初始化的对象一致地汇聚成由文本引导的连贯布局，最终帧充当合成图像。广泛的定量评估和用户研究表明，PLACID 在多对象合成方面超越了最先进的方法，实现了卓越的身份、背景和色彩保留，遗漏的对象更少，结果具有视觉吸引力。

Title: TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

Authors: Ariel Shaulov, Eitan Shaar, Amit Edenzon, Lior Wolf
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00268
Pdf URL: https://arxiv.org/pdf/2602.00268
Copy Paste: [[2602.00268]] TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation(https://arxiv.org/abs/2602.00268)
Keywords: generation
Abstract: Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.
摘要：自动回归视频生成通过根据先前生成的内容迭代调节每批新帧来实现长视频合成。然而，最近的研究表明，此类管道会遭受严重的时间漂移，其中误差会随着时间的推移而累积和放大。我们假设这种漂移主要并非源于模型容量不足，而是源于推理时间误差传播。具体来说，我们认为漂移是由于在自回归推理过程中不受控制地重用损坏的潜在条件标记而引起的。为了纠正这种错误的积累，我们提出了一种简单的推理时间方法，通过在不稳定的潜在标记重新用于调节之前识别和删除它们来减轻时间漂移。为此，我们将不稳定标记定义为潜在标记，其表示与先前生成的批次的表示显着偏差，表明潜在的损坏或语义漂移。通过从自回归上下文中明确删除损坏的潜在标记，而不是修改整个空间区域或模型参数，我们的方法可以防止不可靠的潜在信息影响未来的生成步骤。因此，它显着提高了长范围时间一致性，而无需修改模型架构、训练过程或留下潜在空间。

Title: Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective

Authors: Shaorong Zhang, Longxuan Yu, Rob Brekelmans, Luhan Tang, Salman Asif, Greg Ver Steeg
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00286
Pdf URL: https://arxiv.org/pdf/2602.00286
Copy Paste: [[2602.00286]] Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective(https://arxiv.org/abs/2602.00286)
Keywords: generation
Abstract: Masked Diffusion Models (MDMs) significantly accelerate inference by trading off sequential determinism. However, the theoretical mechanisms governing generation order and the risks inherent in parallelization remain under-explored. In this work, we provide a unified information-theoretic framework to decouple and analyze two fundamental sources of failure: order sensitivity and parallelization bias. Our analysis yields three key insights: (1) The benefits of Easy-First decoding (prioritizing low-entropy tokens) are magnified as model error increases; (2) factorized parallel decoding introduces intrinsic sampling errors that can lead to arbitrary large Reverse KL divergence, capturing "incoherence" failures that standard Forward KL metrics overlook; and (3) while verification can eliminate sampling error, it incurs an exponential cost governed by the total correlation within a block. Conversely, heuristics like remasking, though computationally efficient, cannot guarantee distributional correctness. Experiments on a controlled Block-HMM and large-scale MDMs (LLaDA) for arithmetic reasoning validate our theoretical framework.
摘要：掩蔽扩散模型 (MDM) 通过权衡顺序决定论显着加速推理。然而，控制生成顺序的理论机制和并行化固有的风险仍未得到充分探索。在这项工作中，我们提供了一个统一的信息理论框架来解耦和分析两个基本的故障来源：顺序敏感性和并行化偏差。我们的分析得出了三个关键见解：（1）随着模型误差的增加，Easy-First 解码（优先考虑低熵令牌）的好处被放大； (2) 因式分解并行解码引入了固有采样误差，可能导致任意大的反向 KL 发散，捕获标准前向 KL 指标忽略的“不连贯”故障； (3) 虽然验证可以消除采样误差，但它会产生由块内总相关性控制的指数成本。相反，像重新屏蔽这样的启发式方法虽然计算效率高，但不能保证分布的正确性。用于算术推理的受控 Block-HMM 和大规模 MDM (LLaDA) 的实验验证了我们的理论框架。

Title: TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

Authors: Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, Gedas Bertasius
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00288
Pdf URL: https://arxiv.org/pdf/2602.00288
Copy Paste: [[2602.00288]] TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs(https://arxiv.org/abs/2602.00288)
Keywords: generation
Abstract: Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at this https URL .
摘要：Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding.受认知科学的启发，TimeBlind 将细粒度的时间理解分为三个层次：识别原子事件、表征事件属性以及推理事件相互依赖性。与将识别与时间推理混为一谈的基准不同，TimeBlind 利用最小对范例：视频对共享相同的静态视觉内容，但仅在时间结构上有所不同，利用互补问题来中和语言先验。在 600 个精选实例（2400 个视频问题对）上评估 20 多个最先进的 MLLM（例如 GPT-5、Gemini 3 Pro）表明，性能最佳 MLLM 的实例准确度（正确区分一对中的两个视频）仅为 48.2%，远低于人类表现 (98.2%)。这些结果表明，即使是前沿模型也严重依赖静态视觉快捷方式，而不是真正的时间逻辑，这将 TimeBlind 定位为下一代视频理解的重要诊断工具。 Dataset and code are available at this https URL .

Title: Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

Authors: Franz A. Heinsen, Leo Kozachkov
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2602.00294
Pdf URL: https://arxiv.org/pdf/2602.00294
Copy Paste: [[2602.00294]] Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation(https://arxiv.org/abs/2602.00294)
Keywords: generation
Abstract: The most widely used artificial intelligence (AI) models today are Transformers employing self-attention. In its standard form, self-attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society's ability to provide them. To help address this issue, we show that self-attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders-of-magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation's Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed-forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial-kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application over a greater number of heads per token than otherwise feasible. We implement our formulation and empirically validate its correctness. Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.
摘要：当今最广泛使用的人工智能 (AI) 模型是采用自注意力的 Transformer。在其标准形式下，自注意力的成本随着上下文长度的增加而增加，从而推动了对存储、计算和能源的需求，而这些需求现在已经超出了社会的提供能力。为了帮助解决这个问题，我们证明了自注意力可以在每个令牌的成本恒定的情况下有效地计算到任意精度，从而实现内存使用和计算量的数量级减少。我们通过将传统公式的泰勒展开分解为张量积对称链上的表达式来导出公式。我们利用它们的对称性来获得前馈变换，从而有效地将查询和键映射到最小多项式核特征基础中的坐标。值得注意的是，成本与磁头大小成反比固定，使得每个代币能够应用比其他可行的更多磁头数量。我们实施我们的公式并凭经验验证其正确性。我们的工作能够以适度的固定成本实现无限制的代币生成，从而大大减少大型 Transformer 模型的基础设施和能源需求。我们介绍的数学技术具有独立的意义。

Title: Agentic Framework for Epidemiological Modeling

Authors: Rituparna Datta, Zihan Guan, Baltazar Espinoza, Yiqi Su, Priya Pitre, Srini Venkatramanan, Naren Ramakrishnan, Anil Vullikanti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00299
Pdf URL: https://arxiv.org/pdf/2602.00299
Copy Paste: [[2602.00299]] Agentic Framework for Epidemiological Modeling(https://arxiv.org/abs/2602.00299)
Keywords: generation
Abstract: Epidemic modeling is essential for public health planning, yet traditional approaches rely on fixed model classes that require manual redesign as pathogens, policies, and scenario assumptions evolve. We introduce EPIAGENT, an agentic framework that automatically synthesizes, calibrates, verifies, and refines epidemiological simulators by modeling disease progression as an iterative program synthesis problem. A central design choice is an explicit epidemiological flow graph intermediate representation that links scenario specifications to model structure and enables strong, modular correctness checks before code is generated. Verified flow graphs are then compiled into mechanistic models supporting interpretable parameter learning under physical and epidemiological constraints. Evaluation on epidemiological scenario case studies demonstrates that EPIAGENT captures complex growth dynamics and produces epidemiologically consistent counterfactual projections across varying vaccination and immune escape assumptions. Our results show that the agentic feedback loop prevents degeneration and significantly accelerates convergence toward valid models by mimicking professional expert workflows.
摘要：流行病建模对于公共卫生规划至关重要，但传统方法依赖于固定的模型类别，随着病原体、政策和情景假设的发展，需要手动重新设计。我们引入了 EPIAGENT，这是一个代理框架，它通过将疾病进展建模为迭代程序合成问题来自动合成、校准、验证和完善流行病学模拟器。核心设计选择是明确的流行病学流程图中间表示，它将场景规范与模型结构联系起来，并在生成代码之前启用强大的模块化正确性检查。然后将经过验证的流程图编译成机械模型，支持在物理和流行病学限制下进行可解释的参数学习。对流行病学情景案例研究的评估表明，EPIAGENT 捕捉了复杂的增长动态，并在不同的疫苗接种和免疫逃逸假设中产生流行病学一致的反事实预测。我们的结果表明，代理反馈循环可以通过模仿专业的专家工作流程来防止退化并显着加速向有效模型的收敛。

Title: Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception

Authors: Alexandros Christoforos, Sarah Jenkins, Michael Brown, Tuan Pham, David Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00340
Pdf URL: https://arxiv.org/pdf/2602.00340
Copy Paste: [[2602.00340]] Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception(https://arxiv.org/abs/2602.00340)
Keywords: generation
Abstract: This manuscript presents a pioneering Synergistic Neural Agents Network (SynerNet) framework designed to mitigate the phenomenon of cross-modal alignment degeneration in Vision-Language Models (VLMs) when encountering Out-of-Distribution (OOD) concepts. Specifically, four specialized computational units - visual perception, linguistic context, nominal embedding, and global coordination - collaboratively rectify modality disparities via a structured message-propagation protocol. The principal contributions encompass a multi-agent latent space nomenclature acquisition framework, a semantic context-interchange algorithm for enhanced few-shot adaptation, and an adaptive dynamic equilibrium mechanism. Empirical evaluations conducted on the VISTA-Beyond benchmark demonstrate that SynerNet yields substantial performance augmentations in both few-shot and zero-shot scenarios, exhibiting precision improvements ranging from 1.2% to 5.4% across a diverse array of domains.
摘要：本手稿提出了一种开创性的协同神经代理网络（SynerNet）框架，旨在减轻视觉语言模型（VLM）在遇到分布外（OOD）概念时跨模态对齐退化的现象。具体来说，四个专门的计算单元——视觉感知、语言上下文、名义嵌入和全局协调——通过结构化消息传播协议协作纠正模态差异。主要贡献包括多智能体潜在空间命名法获取框架、用于增强小样本适应的语义上下文交换算法以及自适应动态平衡机制。在 VISTA-Beyond 基准上进行的实证评估表明，SynerNet 在少样本和零样本场景中都实现了显着的性能增强，在多个领域中表现出 1.2% 到 5.4% 的精度提升。

Title: When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

Authors: Beidi Zhao, Wenlong Deng, Xinting Liao, Yushu Li, Nazim Shaikh, Yao Nie, Xiaoxiao Li
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2602.00344
Pdf URL: https://arxiv.org/pdf/2602.00344
Copy Paste: [[2602.00344]] When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs(https://arxiv.org/abs/2602.00344)
Keywords: generation
Abstract: While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.
摘要：虽然检索增强生成（RAG）是在基于知识的 VQA 任务上增强大型视觉语言模型（LVLM）的主要范例之一，但最近的工作将 RAG 失败归因于对检索上下文的关注不足，建议减少分配给图像标记的注意力。在这项工作中，我们确定了以前的研究忽视的一种独特的失败模式：注意力分散（AD）。当检索到的上下文足够时（高度相关或包含正确答案），检索到的文本会全局抑制视觉注意力，并且图像标记上的注意力会从与问题相关的区域转移。这会导致模型在没有检索到的文本的情况下原本可以正确回答的问题失败。为了缓解这个问题，我们提出了 MAD-RAG，这是一种无需训练的干预措施，通过双问题公式将视觉基础与上下文整合分离，并结合注意力混合来保留图像条件证据。对 OK-VQA、E-VQA 和 InfoSeek 的大量实验表明，MAD-RAG 在不同模型系列中始终优于现有基线，与普通 RAG 基线相比，绝对增益高达 4.76%、9.20% 和 6.18%。值得注意的是，MAD-RAG 可以纠正高达 74.68% 的故障案例，而计算开销可以忽略不计。

Title: ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

Authors: Ignacy Kolton, Kacper Marzol, Paweł Batorski, Marcin Mazur, Paul Swoboda, Przemysław Spurek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00350
Pdf URL: https://arxiv.org/pdf/2602.00350
Copy Paste: [[2602.00350]] ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models(https://arxiv.org/abs/2602.00350)
Keywords: restoration
Abstract: Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at this https URL
摘要：机器忘却是从文本到图像扩散模型中删除未经授权的概念的关键防御机制，但最近的证据表明，潜在的视觉信息在忘却后通常仍然存在。现有的利用这种泄漏的对抗方法受到基本限制的限制：由于每个实例的迭代搜索，基于优化的方法在计算上是昂贵的。同时，基于推理和启发式技术缺乏来自目标模型潜在视觉表示的直接反馈。为了应对这些挑战，我们引入了 ReLAPSe，这是一种基于策略的对抗框架，它将概念恢复重新表述为强化学习问题。 ReLAPSe 使用具有可验证奖励的强化学习 (RLVR) 来训练代理，利用扩散模型的噪声预测损失作为模型固有且可验证的反馈信号。这种闭环设计直接将文本提示操作与潜在的视觉残差结合起来，使代理能够学习可转移的恢复策略，而不是优化孤立的提示。通过率先从按实例优化向全局策略学习的转变，ReLAPSe 跨多种最先进的取消学习方法实现了高效、近实时的细粒度身份和风格恢复，为严格的未学习扩散模型红队提供了可扩展的工具。一些实验评估涉及敏感的视觉概念，例如裸体。代码可在此 https URL 获取

Title: Planning with Language and Generative Models: Toward General Reward-Guided Wireless Network Design

Authors: Chenyang Yuan, Xiaoyuan Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00357
Pdf URL: https://arxiv.org/pdf/2602.00357
Copy Paste: [[2602.00357]] Planning with Language and Generative Models: Toward General Reward-Guided Wireless Network Design(https://arxiv.org/abs/2602.00357)
Keywords: generation, generative
Abstract: Intelligent access point (AP) deployment remains challenging in next-generation wireless networks due to complex indoor geometries and signal propagation. We firstly benchmark general-purpose large language models (LLMs) as agentic optimizers for AP planning and find that, despite strong wireless domain knowledge, their dependence on external verifiers results in high computational costs and limited scalability. Motivated by these limitations, we study generative inference models guided by a unified reward function capturing core AP deployment objectives across diverse floorplans. We show that diffusion samplers consistently outperform alternative generative approaches. The diffusion process progressively improves sampling by smoothing and sharpening the reward landscape, rather than relying on iterative refinement, which is effective for non-convex and fragmented objectives. Finally, we introduce a large-scale real-world dataset for indoor AP deployment, requiring over $50k$ CPU hours to train general reward functions, and evaluate in- and out-of-distribution generalization and robustness. Our results suggest that diffusion-based generative inference with a unified reward function provides a scalable and domain-agnostic foundation for indoor AP deployment planning.
摘要：由于复杂的室内几何形状和信号传播，智能接入点 (AP) 部署在下一代无线网络中仍然具有挑战性。我们首先对通用大语言模型（LLM）作为 AP 规划的代理优化器进行基准测试，发现尽管拥有强大的无线领域知识，但它们对外部验证器的依赖导致高计算成本和有限的可扩展性。受这些限制的推动，我们研究了由统一奖励函数引导的生成推理模型，捕获跨不同平面图的核心 AP 部署目标。我们证明扩散采样器的性能始终优于其他生成方法。扩散过程通过平滑和锐化奖励景观来逐步改进采样，而不是依赖于迭代细化，这对于非凸和分散的目标是有效的。最后，我们引入了用于室内 AP 部署的大规模真实数据集，需要超过 5 万美元的 CPU 小时来训练一般奖励函数，并评估分布内和分布外的泛化性和鲁棒性。我们的结果表明，具有统一奖励函数的基于扩散的生成推理为室内 AP 部署规划提供了可扩展且与领域无关的基础。

Title: RePaint-Enhanced Conditional Diffusion Model for Parametric Engineering Designs under Performance and Parameter Constraints

Authors: Ke Wang, Nguyen Gia Hien Vu, Yifan Tang, Mostafa Rahmani Dehaghani, G. Gary Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00384
Pdf URL: https://arxiv.org/pdf/2602.00384
Copy Paste: [[2602.00384]] RePaint-Enhanced Conditional Diffusion Model for Parametric Engineering Designs under Performance and Parameter Constraints(https://arxiv.org/abs/2602.00384)
Keywords: generation, generative
Abstract: This paper presents a RePaint-enhanced framework that integrates a pre-trained performance-guided denoising diffusion probabilistic model (DDPM) for performance- and parameter-constraint engineering design generation. The proposed method enables the generation of missing design components based on a partial reference design while satisfying performance constraints, without retraining the underlying model. By applying mask-based resampling during inference process, RePaint allows efficient and controllable repainting of partial designs under both performance and parameter constraints, which is not supported by conventional DDPM-base methods. The framework is evaluated on two representative design problems, parametric ship hull design and airfoil design, demonstrating its ability to generate novel designs with expected performance based on a partial reference design. Results show that the method achieves accuracy comparable to or better than pre-trained models while enabling controlled novelty through fixing partial designs. Overall, the proposed approach provides an efficient, training-free solution for parameter-constraint-aware generative design in engineering applications.
摘要：本文提出了一种 RePaint 增强型框架，该框架集成了预先训练的性能引导去噪扩散概率模型 (DDPM)，用于生成性能和参数约束的工程设计。所提出的方法能够基于部分参考设计生成缺失的设计组件，同时满足性能约束，而无需重新训练底层模型。通过在推理过程中应用基于掩模的重采样，RePaint 可以在性能和参数约束下对部分设计进行高效且可控的重绘，这是传统的基于 DDPM 的方法所不支持的。该框架针对两个代表性设计问题（参数化船体设计和翼型设计）进行了评估，展示了其基于部分参考设计生成具有预期性能的新颖设计的能力。结果表明，该方法的精度与预训练模型相当或更好，同时通过修复部分设计实现受控的新颖性。总体而言，所提出的方法为工程应用中的参数约束感知生成设计提供了一种高效、免训练的解决方案。

Title: A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Authors: Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00388
Pdf URL: https://arxiv.org/pdf/2602.00388
Copy Paste: [[2602.00388]] A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode(https://arxiv.org/abs/2602.00388)
Keywords: generation
Abstract: Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. We identify a simple yet effective failure mode, termed context nesting, where harmful requests are embedded within structured benign contexts, effectively bypassing the stepwise reduction mechanism. Empirically, we show that this simple strategy is sufficient to bypass D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Most notably, it enables the first successful jailbreak of Gemini Diffusion, to our knowledge, exposing a critical vulnerability in commercial D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs.
摘要：扩散大语言模型 (D-LLM) 提供了自回归 LLM (AR-LLM) 的替代方案，并在生成效率方面表现出了优势。除了效用优势之外，我们认为 D-LLM 还表现出先前未被充分探索的安全优势：它们的扩散式生成赋予了针对最初为 AR-LLM 设计的越狱攻击的内在鲁棒性。在这项工作中，我们对潜在机制进行了初步分析，表明扩散轨迹会产生逐步减少效应，从而逐步抑制不安全的一代。 This robustness, however, is not absolute.我们确定了一种简单而有效的故障模式，称为上下文嵌套，其中有害请求嵌入到结构化良性上下文中，有效地绕过了逐步减少机制。根据经验，我们证明这种简单的策略足以绕过 D-LLM 的安全祝福，在模型和基准上实现最先进的攻击成功率。最值得注意的是，据我们所知，它使 Gemini Diffusion 首次成功越狱，暴露了商业 D-LLM 中的一个关键漏洞。我们的研究结果共同描述了 D-LLM 安全保障的起源和局限性，构成了 D-LLM 的早期红队。

Title: Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset

Authors: Gabriel Bromonschenkel, Alessandro L. Koerich, Thiago M. Paixão, Hilário Tomaz Alves de Oliveira
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00393
Pdf URL: https://arxiv.org/pdf/2602.00393
Copy Paste: [[2602.00393]] Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset(https://arxiv.org/abs/2602.00393)
Keywords: generation
Abstract: Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: this https URL.
摘要：图像字幕 (IC) 是指自动生成图像的自然语言描述，其应用范围从社交媒体内容生成到帮助视力障碍人士。虽然大多数研究都集中在基于英语的模型上，但由于缺乏专门的数据集和模型，巴西葡萄牙语等资源匮乏的语言面临着重大挑战。多项研究通过自动翻译现有数据集来创建数据集，以缓解资源短缺问题。这项工作通过提出对巴西葡萄牙语 IC 基于 Transformer 的视觉和语言模型进行跨本地翻译评估来解决这一差距。我们使用由巴西葡萄牙语母语人士手动创建的字幕组成的 Flickr30K 版本，并将其与字幕自动从英语翻译成葡萄牙语的版本进行比较。这些实验包括跨上下文方法，在一个数据集上训练的模型在另一个数据集上进行测试，以评估翻译影响。此外，我们将注意力图纳入模型推理解释，并使用 CLIP-Score 指标来评估图像描述对齐。我们的研究结果表明，Swin-DistilBERTimbau 始终优于其他模型，展示了跨数据集的强大泛化能力。 ViTucano 是一种巴西葡萄牙语预训练 VLM，在传统的基于文本的评估指标中超越了更大的多语言模型（GPT-4o、LLaMa 3.2 Vision），而 GPT-4 模型实现了最高的 CLIP 分数，突出显示了图像文本对齐的改进。注意力分析揭示了系统偏差，包括性别错误分类、对象枚举错误和空间不一致。当前研究期间生成和分析的数据集和模型可在以下位置找到：此 https URL。

Title: Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure

Authors: Trishna Chakraborty, Udita Ghosh, Aldair Ernesto Gongora, Ruben Glatt, Yue Dong, Jiachen Li, Amit K. Roy-Chowdhury, Chengyu Song
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00414
Pdf URL: https://arxiv.org/pdf/2602.00414
Copy Paste: [[2602.00414]] Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure(https://arxiv.org/abs/2602.00414)
Keywords: generation
Abstract: Laboratories are prone to severe injuries from minor unsafe actions, yet continuous safety monitoring -- beyond mandatory pre-lab safety training -- is limited by human availability. Vision language models (VLMs) offer promise for autonomous laboratory safety monitoring, but their effectiveness in realistic settings is unclear due to the lack of visual evaluation data, as most safety incidents are documented primarily as unstructured text. To address this gap, we first introduce a structured data generation pipeline that converts textual laboratory scenarios into aligned triples of (image, scene graph, ground truth), using large language models as scene graph architects and image generation models as renderers. Our experiments on the synthetic dataset of 1,207 samples across 362 unique scenarios and seven open- and closed-source models show that VLMs perform effectively given textual scene graph, but degrade substantially in visual-only settings indicating difficulty in extracting structured object relationships directly from pixels. To overcome this, we propose a post-training context-engineering approach, scene-graph-guided alignment, to bridge perceptual gaps in VLMs by translating visual inputs into structured scene graphs better aligned with VLM reasoning, improving hazard detection performance in visual only settings.
摘要：实验室很容易因轻微的不安全行为而遭受严重伤害，但除了强制性的实验室前安全培训之外，持续的安全监控受到人员可用性的限制。视觉语言模型 (VLM) 为自主实验室安全监控提供了希望，但由于缺乏视觉评估数据，其在现实环境中的有效性尚不清楚，因为大多数安全事件主要记录为非结构化文本。为了解决这一差距，我们首先引入一个结构化数据生成管道，将文本实验室场景转换为对齐的三元组（图像、场景图、地面实况），使用大型语言模型作为场景图架构师，使用图像生成模型作为渲染器。我们对涵盖 362 个独特场景和 7 个开源和闭源模型的 1,207 个样本的合成数据集进行的实验表明，VLM 在给定文本场景图的情况下可以有效执行，但在纯视觉设置中性能大幅下降，这表明直接从像素提取结构化对象关系很困难。为了克服这个问题，我们提出了一种训练后上下文工程方法，即场景图引导对齐，通过将视觉输入转换为与 VLM 推理更好地一致的结构化场景图来弥合 VLM 中的感知差距，从而提高仅视觉设置中的危险检测性能。

Title: Federated-inspired Single-cell Batch Integration in Latent Space

Authors: Quang-Huy Nguyen, Zongliang Yue, Hao Chen, Wei-Shinn Ku, Jiaqi Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00423
Pdf URL: https://arxiv.org/pdf/2602.00423
Copy Paste: [[2602.00423]] Federated-inspired Single-cell Batch Integration in Latent Space(https://arxiv.org/abs/2602.00423)
Keywords: generation
Abstract: Advances in single-cell RNA sequencing enable the rapid generation of massive, high-dimensional datasets, yet the accumulation of data across experiments introduces batch effects that obscure true biological signals. Existing batch correction approaches either insufficiently correct batch effects or require centralized retraining on the complete dataset, limiting their applicability in distributed and continually evolving single-cell data settings. We introduce scBatchProx, a post-hoc optimization method inspired by federated learning principles for refining cell-level embeddings produced by arbitrary upstream methods. Treating each batch as a client, scBatchProx learns batch-conditioned adapters under proximal regularization, correcting batch structure directly in latent space without requiring raw expression data or centralized optimization. The method is lightweight and deployable, optimizing batch-specific adapter parameters only. Extensive experiments show that scBatchProx consistently yields relative gains of approximately 3-8% in overall embedding quality, with batch correction and biological conservation improving in 90% and 85% of data-method pairs, respectively. We envision this work as a step toward the practical refinement of learned representations in dynamic single-cell data systems.
摘要：单细胞 RNA 测序的进步使得能够快速生成大量高维数据集，但实验中数据的积累会带来批次效应，从而掩盖了真实的生物信号。现有的批量校正方法要么不足以正确校正批量效应，要么需要对完整数据集进行集中再训练，从而限制了它们在分布式和不断发展的单细胞数据设置中的适用性。我们引入了 scBatchProx，这是一种事后优化方法，其灵感来自联邦学习原理，用于细化由任意上游方法生成的单元级嵌入。 scBatchProx 将每个批次视为客户端，在近端正则化下学习批次条件适配器，直接在潜在空间中纠正批次结构，无需原始表达数据或集中优化。该方法是轻量级且可部署的，仅优化特定于批次的适配器参数。大量实验表明，scBatchProx 的整体嵌入质量始终保持约 3-8% 的相对增益，批量校正和生物保存分别在 90% 和 85% 的数据方法对中得到改善。我们设想这项工作是朝着动态单细胞数据系统中学习表示的实际改进迈出的一步。

Title: Open Materials Generation with Inference-Time Reinforcement Learning

Authors: Philipp Hoellmer, Stefano Martiniani
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2602.00424
Pdf URL: https://arxiv.org/pdf/2602.00424
Copy Paste: [[2602.00424]] Open Materials Generation with Inference-Time Reinforcement Learning(https://arxiv.org/abs/2602.00424)
Keywords: generation, generative
Abstract: Continuous-time generative models for crystalline materials enable inverse materials design by learning to predict stable crystal structures, but incorporating explicit target properties into the generative process remains challenging. Policy-gradient reinforcement learning (RL) provides a principled mechanism for aligning generative models with downstream objectives but typically requires access to the score, which has prevented its application to flow-based models that learn only velocity fields. We introduce Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL), a policy-gradient RL framework that operates directly on the learned velocity fields and eliminates the need for the explicit computation of the score. OMatG-IRL leverages stochastic perturbations of the underlying generation dynamics preserving the baseline performance of the pretrained generative model while enabling exploration and policy-gradient estimation at inference time. Using OMatG-IRL, we present the first application of RL to crystal structure prediction (CSP). Our method enables effective reinforcement of an energy-based objective while preserving diversity through composition conditioning, and it achieves performance competitive with score-based RL approaches. Finally, we show that OMatG-IRL can learn time-dependent velocity-annealing schedules, enabling accurate CSP with order-of-magnitude improvements in sampling efficiency and, correspondingly, reduction in generation time.
摘要：晶体材料的连续时间生成模型通过学习预测稳定的晶体结构来实现逆向材料设计，但将明确的目标属性纳入生成过程仍然具有挑战性。策略梯度强化学习（RL）提供了一种使生成模型与下游目标保持一致的原则机制，但通常需要访问分数，这阻碍了其应用于仅学习速度场的基于流的模型。我们引入了带有推理时间强化学习的开放材料生成 (OMatG-IRL)，这是一种策略梯度 RL 框架，可直接在学习的速度场上运行，无需显式计算分数。 OMatG-IRL 利用底层生成动态的随机扰动，保留预训练生成模型的基线性能，同时在推理时实现探索和策略梯度估计。使用 OMatG-IRL，我们首次将 RL 应用到晶体结构预测 (CSP)。我们的方法能够有效强化基于能量的目标，同时通过成分调节保留多样性，并且它的性能与基于分数的 RL 方法具有竞争力。最后，我们表明 OMatG-IRL 可以学习与时间相关的速度退火计划，从而实现精确的 CSP，并在采样效率方面实现数量级的提高，并相应地减少生成时间。

Title: LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference

Authors: Vikram Krishnamurthy
Subjects: cs.LG, cs.AI, cs.CL, eess.SP
Abstract URL: https://arxiv.org/abs/2602.00426
Pdf URL: https://arxiv.org/pdf/2602.00426
Copy Paste: [[2602.00426]] LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference(https://arxiv.org/abs/2602.00426)
Keywords: generation
Abstract: Large language models (LLMs) based on transformer architectures are typically described through collections of architectural components and training procedures, obscuring their underlying computational structure. This review article provides a concise mathematical reference for researchers seeking an explicit, equation-level description of LLM training, alignment, and generation. We formulate LLMs as high-dimensional nonlinear autoregressive models with attention-based dependencies. The framework encompasses pretraining via next-token prediction, alignment methods such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), rejection sampling fine-tuning (RSFT), and reinforcement learning from verifiable rewards (RLVR), as well as autoregressive generation during inference. Self-attention emerges naturally as a repeated bilinear--softmax--linear composition, yielding highly expressive sequence models. This formulation enables principled analysis of alignment-induced behaviors (including sycophancy), inference-time phenomena (such as hallucination, in-context learning, chain-of-thought prompting, and retrieval-augmented generation), and extensions like continual learning, while serving as a concise reference for interpretation and further theoretical development.
摘要：基于 Transformer 架构的大型语言模型 (LLM) 通常通过架构组件和训练过程的集合来描述，从而掩盖了其底层计算结构。这篇综述文章为寻求 LLM 训练、对齐和生成的明确方程级描述的研究人员提供了简明的数学参考。我们将法学硕士制定为具有基于注意力的依赖性的高维非线性自回归模型。该框架包括通过下一个令牌预测进行预训练、对齐方法，例如来自人类反馈的强化学习 (RLHF)、直接偏好优化 (DPO)、拒绝采样微调 (RSFT) 和来自可验证奖励的强化学习 (RLVR)，以及推理过程中的自回归生成。自注意力自然地以重复的双线性——softmax——线性组合的形式出现，产生了高度表达的序列模型。该公式能够对对齐引发的行为（包括阿谀奉承）、推理时间现象（例如幻觉、情境学习、思维链提示和检索增强生成）以及持续学习等扩展进行原则性分析，同时为解释和进一步的理论发展提供简明的参考。

Title: FedMOA: Federated GRPO for Personalized Reasoning LLMs under Heterogeneous Rewards

Authors: Ziyao Wang, Daeun Jung, Yexiao He, Guoheng Sun, Zheyu Shen, Myungjin Lee, Ang Li
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2602.00453
Pdf URL: https://arxiv.org/pdf/2602.00453
Copy Paste: [[2602.00453]] FedMOA: Federated GRPO for Personalized Reasoning LLMs under Heterogeneous Rewards(https://arxiv.org/abs/2602.00453)
Keywords: generation
Abstract: Group Relative Policy Optimization (GRPO) has recently emerged as an effective approach for improving the reasoning capabilities of large language models through online multi-objective reinforcement learning. While personalization on private data is increasingly vital, traditional Reinforcement Learning (RL) alignment is often memory-prohibitive for on-device federated learning due to the overhead of maintaining a separate critic network. GRPO's critic-free architecture enables feasible on-device training, yet transitioning to a federated setting introduces systemic challenges: heterogeneous reward definitions, imbalanced multi-objective optimization, and high training costs. We propose FedMOA, a federated GRPO framework for multi-objective alignment under heterogeneous rewards. FedMOA stabilizes local training through an online adaptive weighting mechanism via hypergradient descent, which prioritizes primary reasoning as auxiliary objectives saturate. On the server side, it utilizes a task- and accuracy-aware aggregation strategy to prioritize high-quality updates. Experiments on mathematical reasoning and code generation benchmarks demonstrate that FedMOA consistently outperforms federated averaging, achieving accuracy gains of up to 2.2% while improving global performance, personalization, and multi-objective balance.
摘要：组相对策略优化（GRPO）最近成为通过在线多目标强化学习提高大型语言模型推理能力的有效方法。虽然私人数据的个性化变得越来越重要，但由于维护单独的批评者网络的开销，传统的强化学习 (RL) 对齐对于设备上的联合学习来说通常会占用内存。 GRPO 的无批评架构可实现可行的设备上训练，但过渡到联合设置会带来系统性挑战：异构奖励定义、不平衡的多目标优化和高昂的训练成本。我们提出了 FedMOA，一个联邦 GRPO 框架，用于异构奖励下的多目标协调。 FedMOA 通过超梯度下降的在线自适应加权机制来稳定本地训练，当辅助目标饱和时，该机制会优先考虑主要推理。在服务器端，它利用任务和准确性感知聚合策略来优先考虑高质量更新。数学推理和代码生成基准的实验表明，FedMOA 的性能始终优于联邦平均，实现了高达 2.2% 的准确度增益，同时提高了全局性能、个性化和多目标平衡。

Title: LatentTrack: Sequential Weight Generation via Latent Filtering

Authors: Omer Haq
Subjects: cs.LG, cs.AI, cs.RO, stat.ML
Abstract URL: https://arxiv.org/abs/2602.00458
Pdf URL: https://arxiv.org/pdf/2602.00458
Copy Paste: [[2602.00458]] LatentTrack: Sequential Weight Generation via Latent Filtering(https://arxiv.org/abs/2602.00458)
Keywords: generation
Abstract: We introduce LatentTrack (LT), a sequential neural architecture for online probabilistic prediction under nonstationary dynamics. LT performs causal Bayesian filtering in a low-dimensional latent space and uses a lightweight hypernetwork to generate predictive model parameters at each time step, enabling constant-time online adaptation without per-step gradient updates. At each time step, a learned latent model predicts the next latent distribution, which is updated via amortized inference using new observations, yielding a predict--generate--update filtering framework in function space. The formulation supports both structured (Markovian) and unstructured latent dynamics within a unified objective, while Monte Carlo inference over latent trajectories produces calibrated predictive mixtures with fixed per-step cost. Evaluated on long-horizon online regression using the Jena Climate benchmark, LT consistently achieves lower negative log-likelihood and mean squared error than stateful sequential and static uncertainty-aware baselines, with competitive calibration, demonstrating that latent-conditioned function evolution is an effective alternative to traditional latent-state modeling under distribution shift.
摘要：我们引入 LatentTrack (LT)，这是一种用于非平稳动态下在线概率预测的顺序神经架构。 LT 在低维潜在空间中执行因果贝叶斯过滤，并使用轻量级超网络在每个时间步生成预测模型参数，从而实现恒定时间在线自适应，而无需每步梯度更新。在每个时间步，学习的潜在模型预测下一个潜在分布，该分布通过使用新观察的摊销推理进行更新，从而在函数空间中产生预测-生成-更新过滤框架。该公式支持统一目标内的结构化（马尔可夫）和非结构化潜在动态，而对潜在轨迹的蒙特卡洛推理会产生具有固定每步成本的校准预测混合物。使用 Jena Climate 基准对长期在线回归进行评估，LT 始终比状态序列和静态不确定性感知基线实现更低的负对数似然和均方误差，并具有竞争性校准，证明潜在条件函数演化是分布偏移下传统潜在状态建模的有效替代方案。

Title: PSGS: Text-driven Panorama Sliding Scene Generation via Gaussian Splatting

Authors: Xin Zhang, Shen Chen, Jiale Zhou, Lei Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00463
Pdf URL: https://arxiv.org/pdf/2602.00463
Copy Paste: [[2602.00463]] PSGS: Text-driven Panorama Sliding Scene Generation via Gaussian Splatting(https://arxiv.org/abs/2602.00463)
Keywords: generation
Abstract: Generating realistic 3D scenes from text is crucial for immersive applications like VR, AR, and gaming. While text-driven approaches promise efficiency, existing methods suffer from limited 3D-text data and inconsistent multi-view stitching, resulting in overly simplistic scenes. To address this, we propose PSGS, a two-stage framework for high-fidelity panoramic scene generation. First, a novel two-layer optimization architecture generates semantically coherent panoramas: a layout reasoning layer parses text into structured spatial relationships, while a self-optimization layer refines visual details via iterative MLLM feedback. Second, our panorama sliding mechanism initializes globally consistent 3D Gaussian Splatting point clouds by strategically sampling overlapping perspectives. By incorporating depth and semantic coherence losses during training, we greatly improve the quality and detail fidelity of rendered scenes. Our experiments demonstrate that PSGS outperforms existing methods in panorama generation and produces more appealing 3D scenes, offering a robust solution for scalable immersive content creation.
摘要：从文本生成逼真的 3D 场景对于 VR、AR 和游戏等沉浸式应用至关重要。虽然文本驱动的方法保证了效率，但现有方法受到 3D 文本数据有限和多视图拼接不一致的影响，导致场景过于简单。为了解决这个问题，我们提出了 PSGS，一个用于高保真全景场景生成的两阶段框架。首先，一种新颖的两层优化架构生成语义连贯的全景图：布局推理层将文本解析为结构化空间关系，而自优化层通过迭代 MLLM 反馈细化视觉细节。其次，我们的全景滑动机制通过策略性地采样重叠视角来初始化全局一致的 3D 高斯溅射点云。通过在训练过程中结合深度和语义一致性损失，我们极大地提高了渲染场景的质量和细节保真度。我们的实验表明，PSGS 在全景生成方面优于现有方法，并能生成更具吸引力的 3D 场景，为可扩展的沉浸式内容创建提供强大的解决方案。

Title: ZS-TreeSeg: A Zero-Shot Framework for Tree Crown Instance Segmentation

Authors: Pengyu Chen, Fangzheng Lyu, Sicheng Wang, Cuizhen Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00470
Pdf URL: https://arxiv.org/pdf/2602.00470
Copy Paste: [[2602.00470]] ZS-TreeSeg: A Zero-Shot Framework for Tree Crown Instance Segmentation(https://arxiv.org/abs/2602.00470)
Keywords: generation
Abstract: Individual tree crown segmentation is an important task in remote sensing for forest biomass estimation and ecological monitoring. However, accurate delineation in dense, overlapping canopies remains a bottleneck. While supervised deep learning methods suffer from high annotation costs and limited generalization, emerging foundation models (e.g., Segment Anything Model) often lack domain knowledge, leading to under-segmentation in dense clusters. To bridge this gap, we propose ZS-TreeSeg, a Zero-Shot framework that adapts from two mature tasks: 1) Canopy Semantic segmentation; and 2) Cells instance segmentation. By modeling tree crowns as star-convex objects within a topological flow field using Cellpose-SAM, the ZS-TreeSeg framework forces the mathematical separation of touching tree crown instances based on vector convergence. Experiments on the NEON and BAMFOREST datasets and visual inspection demonstrate that our framework generalizes robustly across diverse sensor types and canopy densities, which can offer a training-free solution for tree crown instance segmentation and labels generation.
摘要：单树冠分割是遥感森林生物量估算和生态监测的一项重要任务。然而，在密集、重叠的树冠中准确描绘仍然是一个瓶颈。虽然有监督的深度学习方法存在注释成本高和泛化能力有限的问题，但新兴的基础模型（例如，Segment Anything Model）通常缺乏领域知识，导致密集集群中的分割不足。为了弥补这一差距，我们提出了 ZS-TreeSeg，这是一个零射击框架，它改编自两个成熟的任务：1）Canopy 语义分割； 2) 细胞实例分割。通过使用 Cellpose-SAM 将树冠建模为拓扑流场内的星凸对象，ZS-TreeSeg 框架强制基于矢量收敛对接触树冠实例进行数学分离。对 NEON 和 BAMFOREST 数据集和视觉检查的实验表明，我们的框架可以在不同的传感器类型和树冠密度上稳健地推广，这可以为树冠实例分割和标签生成提供免训练的解决方案。

Title: Diffusion LMs Can Approximate Optimal Infilling Lengths Implicitly

Authors: Hengchang Liu, Zhao Yang, Bing Su
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2602.00476
Pdf URL: https://arxiv.org/pdf/2602.00476
Copy Paste: [[2602.00476]] Diffusion LMs Can Approximate Optimal Infilling Lengths Implicitly(https://arxiv.org/abs/2602.00476)
Keywords: generation
Abstract: Diffusion language models (DLMs) provide a bidirectional generation framework naturally suited for infilling, yet their performance is constrained by the pre-specified infilling length. In this paper, we reveal that DLMs possess an inherent ability to discover the correct infilling length. We identify two key statistical phenomena in the first-step denoising confidence: a local \textit{Oracle Peak} that emerges near the ground-truth length and a systematic \textit{Length Bias} that often obscures this signal. By leveraging this signal and calibrating the bias, our training-free method \textbf{CAL} (\textbf{C}alibrated \textbf{A}daptive \textbf{L}ength) enables DLMs to approximate the optimal length through an efficient search before formal decoding. Empirical evaluations demonstrate that CAL improves Pass@1 by up to 47.7\% over fixed-length baselines and 40.5\% over chat-based adaptive methods in code infilling, while boosting BLEU-2 and ROUGE-L by up to 8.5\% and 9.9\% in text infilling. These results demonstrate that CAL paves the way for robust DLM infilling without requiring any specialized training. Code is available at this https URL.
摘要：扩散语言模型（DLM）提供了天然适合填充的双向生成框架，但其性能受到预先指定的填充长度的限制。在本文中，我们揭示了 DLM 具有发现正确填充长度的固有能力。我们在第一步去噪置信度中识别出两个关键的统计现象：在真实长度附近出现的局部 \textit{Oracle Peak} 和经常掩盖该信号的系统 \textit{Length Bias}。通过利用该信号并校准偏差，我们的免训练方法 \textbf{CAL} （\textbf{C}alibated \textbf{A}daptive \textbf{L}ength）使 DLM 能够在正式解码之前通过有效搜索来近似最佳长度。实证评估表明，CAL 在代码填充方面比固定长度基线提高了 Pass@1 高达 47.7%，比基于聊天的自适应方法提高了 40.5%，同时在文本填充方面将 BLEU-2 和 ROUGE-L 提高了高达 8.5% 和 9.9%。这些结果表明，CAL 为稳健的 DLM 填充铺平了道路，无需任何专门培训。代码可从此 https URL 获取。

Title: Quality-Diversity Optimization as Multi-Objective Optimization

Authors: Xi Lin, Ping Guo, Yilu Liu, Qingfu Zhang, Jianyong Sun
Subjects: cs.LG, cs.AI, cs.NE, math.OC
Abstract URL: https://arxiv.org/abs/2602.00478
Pdf URL: https://arxiv.org/pdf/2602.00478
Copy Paste: [[2602.00478]] Quality-Diversity Optimization as Multi-Objective Optimization(https://arxiv.org/abs/2602.00478)
Keywords: generation
Abstract: The Quality-Diversity (QD) optimization aims to discover a collection of high-performing solutions that simultaneously exhibit diverse behaviors within a user-defined behavior space. This paradigm has stimulated significant research interest and demonstrated practical utility in domains including robot control, creative design, and adversarial sample generation. A variety of QD algorithms with distinct design principles have been proposed in recent years. Instead of proposing a new QD algorithm, this work introduces a novel reformulation by casting the QD optimization as a multi-objective optimization (MOO) problem with a huge number of optimization objectives. By establishing this connection, we enable the direct adoption of well-established MOO methods, particularly set-based scalarization techniques, to solve QD problems through a collaborative search process. We further provide a theoretical analysis demonstrating that our approach inherits theoretical guarantees from MOO while providing desirable properties for the QD optimization. Experimental studies across several QD applications confirm that our method achieves performance competitive with state-of-the-art QD algorithms.
摘要：质量多样性 (QD) 优化旨在发现一系列高性能解决方案，这些解决方案在用户定义的行为空间内同时表现出不同的行为。这种范式激发了人们极大的研究兴趣，并在机器人控制、创意设计和对抗性样本生成等领域展示了实用性。近年来，人们提出了多种具有不同设计原理的QD算法。这项工作没有提出新的 QD 算法，而是通过将 QD 优化转化为具有大量优化目标的多目标优化 (MOO) 问题，引入了一种新颖的重新表述。通过建立这种连接，我们可以直接采用成熟的 MOO 方法，特别是基于集合的标量化技术，通过协作搜索过程解决 QD 问题。我们进一步提供了理论分析，证明我们的方法继承了 MOO 的理论保证，同时为 QD 优化提供了理想的特性。多个量子点应用的实验研究证实，我们的方法的性能可与最先进的量子点算法相媲美。

Title: OD-DEAL: Dynamic Expert-Guided Adversarial Learning with Online Decomposition for Scalable Capacitated Vehicle Routing

Authors: Dongbin Jiao, Zisheng Chen, Xianyi Wang, Jintao Shi, Shengcai Liu, Shi Yan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00488
Pdf URL: https://arxiv.org/pdf/2602.00488
Copy Paste: [[2602.00488]] OD-DEAL: Dynamic Expert-Guided Adversarial Learning with Online Decomposition for Scalable Capacitated Vehicle Routing(https://arxiv.org/abs/2602.00488)
Keywords: generative
Abstract: Solving large-scale capacitated vehicle routing problems (CVRP) is hindered by the high complexity of heuristics and the limited generalization of neural solvers on massive graphs. We propose OD-DEAL, an adversarial learning framework that tightly integrates hybrid genetic search (HGS) and online barycenter clustering (BCC) decomposition, and leverages high-fidelity knowledge distillation to transfer expert heuristic behavior. OD-DEAL trains a graph attention network (GAT)-based generative policy through a minimax game, in which divide-and-conquer strategies from a hybrid expert are distilled into dense surrogate rewards. This enables high-quality, clustering-free inference on large-scale instances. Empirical results demonstrate that OD-DEAL achieves state-of-the-art (SOTA) real-time CVRP performance, solving 10000-node instances with near-constant neural scaling. This uniquely enables the sub-second, heuristic-quality inference required for dynamic large-scale deployment.
摘要：启发式算法的高度复杂性和神经求解器在海量图上的泛化能力有限，阻碍了大规模容量车辆路径问题（CVRP）的解决。我们提出了 OD-DEAL，这是一种对抗性学习框架，它紧密集成了混合遗传搜索（HGS）和在线重心聚类（BCC）分解，并利用高保真知识蒸馏来转移专家启发式行为。 OD-DEAL 通过极小极大游戏训练基于图注意力网络 (GAT) 的生成策略，其中来自混合专家的分而治之策略被提炼为密集的代理奖励。这可以在大规模实例上实现高质量、无聚类的推理。实证结果表明，OD-DEAL 实现了最先进 (SOTA) 的实时 CVRP 性能，以近乎恒定的神经缩放来解决 10000 个节点的实例。这独特地实现了动态大规模部署所需的亚秒级启发式质量推理。

Title: Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models

Authors: Jingrui Zhang, Feng Liang, Yong Zhang, Wei Wang, Runhao Zeng, Xiping Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00505
Pdf URL: https://arxiv.org/pdf/2602.00505
Copy Paste: [[2602.00505]] Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models(https://arxiv.org/abs/2602.00505)
Keywords: generation
Abstract: With the remarkable success of large language models (LLMs) in natural language understanding and generation, multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities. While most existing efforts focus on scaling up language models or constructing higher-quality training data, limited attention has been paid to effectively integrating cross-modal knowledge into the language space. In vision-language models, for instance, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features, limiting the model's ability of cross-modality understanding. To address this issue, we propose SparseCut, a general cross-modal fusion architecture for MLLMs, introducing sparse shortcut connections between the cross-modal encoder and the LLM. These shortcut connections enable the efficient and hierarchical integration of visual features at multiple levels, facilitating richer semantic fusion without increasing computational overhead. We further introduce an efficient multi-grained feature fusion module, which performs the fusion of visual features before routing them through the shortcuts. This preserves the original language context and does not increase the overall input length, thereby avoiding an increase in computational complexity for the LLM. Experiments demonstrate that SparseCut significantly enhances the performance of MLLMs across various multimodal benchmarks with generality and scalability for different base LLMs.
摘要：随着大语言模型 (LLM) 在自然语言理解和生成方面取得的巨大成功，多模态大语言模型 (MLLM) 跨多种模态处理数据的能力也迅速提高。虽然大多数现有的努力都集中在扩大语言模型或构建更高质量的训练数据上，但对有效地将跨模态知识整合到语言空间的关注有限。例如，在视觉语言模型中，仅使用高级视觉特征来对齐模态通常会丢弃中低级特征中存在的丰富语义信息，从而限制了模型跨模态理解的能力。为了解决这个问题，我们提出了 SparseCut，一种用于 MLLM 的通用跨模态融合架构，在跨模态编码器和 LLM 之间引入稀疏快捷连接。这些快捷连接可以在多个级别上高效且分层地集成视觉特征，从而促进更丰富的语义融合，而无需增加计算开销。我们进一步引入了一种高效的多粒度特征融合模块，该模块在通过快捷方式路由之前执行视觉特征的融合。这保留了原始语言上下文，并且不会增加总体输入长度，从而避免增加 LLM 的计算复杂性。实验表明，SparseCut 显着增强了 MLLM 在各种多模式基准测试中的性能，并具有针对不同基础 LLM 的通用性和可扩展性。

Title: DuoGen: Towards General Purpose Interleaved Multimodal Generation

Authors: Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni, Jialuo Li, Shubham Pachori, Zhaoshuo Li, Yogesh Balaji, Haoxiang Wang, Tsung-Yi Lin, Xiao Fu, Yue Zhao, Chieh-Yun Chen, Ming-Yu Liu, Humphrey Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00508
Pdf URL: https://arxiv.org/pdf/2602.00508
Copy Paste: [[2602.00508]] DuoGen: Towards General Purpose Interleaved Multimodal Generation(https://arxiv.org/abs/2602.00508)
Keywords: generation
Abstract: Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, a general-purpose interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining and enabling flexible base model selection. A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences. Across public and newly proposed benchmarks, DuoGen outperforms prior open-source models in text quality, image fidelity, and image-context alignment, and also achieves state-of-the-art performance on text-to-image and image editing among unified generation models. Data and code will be released at this https URL.
摘要：交错多模态生成能够提供超越单模态生成模型的功能，例如分步教学指南、视觉规划以及生成用于推理的视觉草稿。然而，现有的交错生成模型在一般指令下的质量仍然受到训练数据和基础模型容量不足的限制。我们推出了 DuoGen，这是一种通用交错生成框架，可系统地解决数据管理、架构设计和评估问题。在数据方面，我们通过结合从策划的原始网站重写的多模式对话以及涵盖日常场景的各种合成示例，构建了一个大规模、高质量的指令调整数据集。在架构上，DuoGen 利用了对预训练多模态 LLM 的强大视觉理解以及在视频生成上预训练的扩散变换器 (DiT) 的视觉生成功能，避免了昂贵的单模态预训练并实现灵活的基础模型选择。两阶段解耦策略首先对 MLLM 进行指令调整，然后使用策划的交错图像文本序列将 DiT 与其对齐。在公开的和新提出的基准中，DuoGen 在文本质量、图像保真度和图像上下文对齐方面优于先前的开源模型，并且在统一生成模型中的文本到图像和图像编辑方面也实现了最先进的性能。数据和代码将在此 https URL 发布。

Title: SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding

Authors: Yujia Tong, Tian Zhang, Yunyang Wan, Kaiwei Lin, Jingling Yuan, Chuang Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00523
Pdf URL: https://arxiv.org/pdf/2602.00523
Copy Paste: [[2602.00523]] SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding(https://arxiv.org/abs/2602.00523)
Keywords: generation
Abstract: Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models (VLMs) by enabling parallel verification of multiple draft tokens. However, existing methods rely on static tree structures that remain fixed throughout the decoding process, failing to adapt to the varying prediction difficulty across generation steps. This leads to suboptimal acceptance lengths and limited speedup. In this paper, we propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty. Our key insight is that output entropy serves as a natural confidence indicator with strong temporal correlation across decoding steps. SAGE constructs deeper-narrower trees for high-confidence predictions to maximize speculation depth, and shallower-wider trees for uncertain predictions to diversify exploration. SAGE improves acceptance lengths and achieves faster acceleration compared to static tree baselines. Experiments on multiple benchmarks demonstrate the effectiveness of SAGE: without any loss in output quality, it delivers up to $3.36\times$ decoding speedup for LLaVA-OneVision-72B and $3.18\times$ for Qwen2.5-VL-72B.
摘要：推测性解码已成为一种有前景的方法，通过支持多个草稿标记的并行验证来加速视觉语言模型 (VLM) 的推理。然而，现有的方法依赖于在整个解码过程中保持固定的静态树结构，无法适应生成步骤中不同的预测难度。这导致接受长度不理想且加速有限。在本文中，我们提出了 SAGE，这是一种基于实时预测不确定性动态调整推测树结构的新颖框架。我们的主要见解是，输出熵可以作为自然置信度指标，在解码步骤之间具有很强的时间相关性。 SAGE 构建更深更窄的树来进行高置信度预测，以最大限度地提高推测深度，并构建更浅更宽的树来进行不确定的预测，从而使探索多样化。与静态树基线相比，SAGE 提高了接受长度并实现了更快的加速。多个基准测试的实验证明了 SAGE 的有效性：在输出质量没有任何损失的情况下，它为 LLaVA-OneVision-72B 提供高达 3.36\times$ 的解码加速，为 Qwen2.5-VL-72B 提供高达 3.18\times$ 的解码加速。

Title: Physiology as Language: Translating Respiration to Sleep EEG

Authors: Kaiwen Zha, Chao Li, Hao He, Peng Cao, Tianhong Li, Ali Mirzazadeh, Ellen Zhang, Jong Woo Lee, Yoon Kim, Dina Katabi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00526
Pdf URL: https://arxiv.org/pdf/2602.00526
Copy Paste: [[2602.00526]] Physiology as Language: Translating Respiration to Sleep EEG(https://arxiv.org/abs/2602.00526)
Keywords: generative
Abstract: This paper introduces a novel cross-physiology translation task: synthesizing sleep electroencephalography (EEG) from respiration signals. To address the significant complexity gap between the two modalities, we propose a waveform-conditional generative framework that preserves fine-grained respiratory dynamics while constraining the EEG target space through discrete tokenization. Trained on over 28,000 individuals, our model achieves a 7% Mean Absolute Error in EEG spectrogram reconstruction. Beyond reconstruction, the synthesized EEG supports downstream tasks with performance comparable to ground truth EEG on age estimation (MAE 5.0 vs. 5.1 years), sex detection (AUROC 0.81 vs. 0.82), and sleep staging (Accuracy 0.84 vs. 0.88), significantly outperforming baselines trained directly on breathing. Finally, we demonstrate that the framework generalizes to contactless sensing by synthesizing EEG from wireless radio-frequency reflections, highlighting the feasibility of remote, non-contact neurological assessment during sleep.
摘要：本文介绍了一种新颖的跨生理学翻译任务：从呼吸信号合成睡眠脑电图（EEG）。为了解决两种模式之间显着的复杂性差距，我们提出了一种波形条件生成框架，该框架保留细粒度的呼吸动力学，同时通过离散标记化约束脑电图目标空间。经过超过 28,000 人的训练，我们的模型在 EEG 频谱图重建中实现了 7% 的平均绝对误差。除了重建之外，合成脑电图还支持下游任务，其在年龄估计（MAE 5.0 vs. 5.1 岁）、性别检测（AUROC 0.81 vs. 0.82）和睡眠分期（准确度 0.84 vs. 0.88）方面的性能与真实脑电图相当，显着优于直接呼吸训练的基线。最后，我们证明该框架通过从无线射频反射合成脑电图来推广到非接触式传感，强调了睡眠期间远程、非接触式神经评估的可行性。

Title: Convergent World Representations and Divergent Tasks

Authors: Core Francisco Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00533
Pdf URL: https://arxiv.org/pdf/2602.00533
Copy Paste: [[2602.00533]] Convergent World Representations and Divergent Tasks(https://arxiv.org/abs/2602.00533)
Keywords: generation
Abstract: While neural representations are central to modern deep learning, the conditions governing their geometry and their roles in downstream adaptability remain poorly understood. We develop a framework clearly separating the underlying world, the data generation process and the resulting model representations to study these questions in a controlled setup. 5,075 city coordinates define the world and 7 geometric tasks generate the training data for autoregressive training. We find that different tasks give rise to qualitatively and quantitatively distinct world representation geometries. However, multi-task training drives convergence of world representations: models trained on non-overlapping tasks develop aligned geometric representations, providing controlled evidence for the Multitask Scaling Hypothesis of the Platonic Representation Hypothesis. To study adaptation, we pretrain models on all tasks, then test whether new entities (cities) can be consistently integrated into the representation space via fine-tuning. Surprisingly, we find that despite multi-task pretraining, some tasks, which we call divergent, actively harm the representational integration of new entities and harm generalization. Our results show that training on multiple relational tasks reliably produces convergent world representations, but lurking divergent tasks can catastrophically harm new entity integration via fine-tuning.
摘要：虽然神经表征是现代深度学习的核心，但控制其几何形状的条件及其在下游适应性中的作用仍然知之甚少。我们开发了一个框架，明确区分底层世界、数据生成过程和结果模型表示，以在受控设置中研究这些问题。 5,075 个城市坐标定义了世界，7 个几何任务生成用于自回归训练的训练数据。我们发现不同的任务会产生定性和定量上不同的世界表征几何结构。然而，多任务训练推动了世界表征的融合：在非重叠任务上训练的模型开发出对齐的几何表征，为柏拉图表征假说的多任务扩展假说提供了受控证据。为了研究适应性，我们在所有任务上预训练模型，然后测试新实体（城市）是否可以通过微调一致地集成到表示空间中。令人惊讶的是，我们发现尽管进行了多任务预训练，但一些我们称之为发散的任务会积极损害新实体的表征整合并损害泛化。我们的结果表明，对多个关系任务的训练可靠地产生收敛的世界表示，但潜在的发散任务可能会通过微调对新实体集成造成灾难性的损害。

Title: SADER: Structure-Aware Diffusion Framework with DEterministic Resampling for Multi-Temporal Remote Sensing Cloud Removal

Authors: Yifan Zhang, Qian Chen, Yi Liu, Wengen Li, Jihong Guan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00536
Pdf URL: https://arxiv.org/pdf/2602.00536
Copy Paste: [[2602.00536]] SADER: Structure-Aware Diffusion Framework with DEterministic Resampling for Multi-Temporal Remote Sensing Cloud Removal(https://arxiv.org/abs/2602.00536)
Keywords: generative
Abstract: Cloud contamination severely degrades the usability of remote sensing imagery and poses a fundamental challenge for downstream Earth observation tasks. Recently, diffusion-based models have emerged as a dominant paradigm for remote sensing cloud removal due to their strong generative capability and stable optimization. However, existing diffusion-based approaches often suffer from limited sampling efficiency and insufficient exploitation of structural and temporal priors in multi-temporal remote sensing scenarios. In this work, we propose SADER, a structure-aware diffusion framework for multi-temporal remote sensing cloud removal. SADER first develops a scalable Multi-Temporal Conditional Diffusion Network (MTCDN) to fully capture multi-temporal and multimodal correlations via temporal fusion and hybrid attention. Then, a cloud-aware attention loss is introduced to emphasize cloud-dominated regions by accounting for cloud thickness and brightness discrepancies. In addition, a deterministic resampling strategy is designed for continuous diffusion models to iteratively refine samples under fixed sampling steps by replacing outliers through guided correction. Extensive experiments on multiple multi-temporal datasets demonstrate that SADER consistently outperforms state-of-the-art cloud removal methods across all evaluation metrics. The code of SADER is publicly available at this https URL.
摘要：云污染严重降低了遥感图像的可用性，并对下游地球观测任务构成了根本性挑战。近年来，基于扩散的模型因其强大的生成能力和稳定的优化而成为遥感云去除的主导范式。然而，现有的基于扩散的方法通常受到采样效率有限以及多时相遥感场景中结构和时间先验的利用不足的困扰。在这项工作中，我们提出了SADER，一种用于多时相遥感云去除的结构感知扩散框架。 SADER 首先开发了一个可扩展的多时态条件扩散网络（MTCDN），通过时间融合和混合注意力来充分捕获多时态和多模态相关性。然后，引入云感知注意力损失，通过考虑云厚度和亮度差异来强调云为主的区域。此外，为连续扩散模型设计了确定性重采样策略，通过引导校正替换异常值，在固定采样步骤下迭代细化样本。对多个多时态数据集的大量实验表明，SADER 在所有评估指标上始终优于最先进的云去除方法。 SADER 的代码可在此 https URL 上公开获取。

Title: GLAD: Generative Language-Assisted Visual Tracking for Low-Semantic Templates

Authors: Xingyu Luo, Yidong Cai, Jie Liu, Jie Tang, Gangshan Wu, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00570
Pdf URL: https://arxiv.org/pdf/2602.00570
Copy Paste: [[2602.00570]] GLAD: Generative Language-Assisted Visual Tracking for Low-Semantic Templates(https://arxiv.org/abs/2602.00570)
Keywords: generative
Abstract: Vision-language tracking has gained increasing attention in many scenarios. This task simultaneously deals with visual and linguistic information to localize objects in videos. Despite its growing utility, the development of vision-language tracking methods remains in its early stage. Current vision-language trackers usually employ Transformer architectures for interactive integration of template, search, and text features. However, persistent challenges about low-semantic images including prevalent image blurriness, low resolution and so on, may compromise model performance through degraded cross-modal understanding. To solve this problem, language assistance is usually used to deal with the obstacles posed by low-semantic images. However, due to the existing gap between current textual and visual features, direct concatenation and fusion of these features may have limited effectiveness. To address these challenges, we introduce a pioneering Generative Language-AssisteD tracking model, GLAD, which utilizes diffusion models for the generative multi-modal fusion of text description and template image to bolster compatibility between language and image and enhance template image semantic information. Our approach demonstrates notable improvements over the existing fusion paradigms. Blurry and semantically ambiguous template images can be restored to improve multi-modal features in the generative fusion paradigm. Experiments show that our method establishes a new state-of-the-art on multiple benchmarks and achieves an impressive inference speed. The code and models will be released at: this https URL
摘要：视觉语言跟踪在许多场景中受到越来越多的关注。该任务同时处理视觉和语言信息以定位视频中的对象。尽管其实用性不断增强，但视觉语言跟踪方法的发展仍处于早期阶段。当前的视觉语言跟踪器通常采用 Transformer 架构来交互集成模板、搜索和文本功能。然而，低语义图像的持续挑战，包括普遍的图像模糊、低分辨率等，可能会因跨模式理解能力下降而损害模型性能。为了解决这个问题，通常使用语言辅助来应对低语义图像带来的障碍。然而，由于当前文本和视觉特征之间存在差距，这些特征的直接串联和融合可能效果有限。为了应对这些挑战，我们引入了一种开创性的生成语言辅助跟踪模型 GLAD，该模型利用扩散模型进行文本描述和模板图像的生成多模态融合，以增强语言和图像之间的兼容性并增强模板图像语义信息。我们的方法展示了对现有融合范例的显着改进。可以恢复模糊且语义模糊的模板图像，以改善生成融合范式中的多模态特征。实验表明，我们的方法在多个基准上建立了新的最先进技术，并实现了令人印象深刻的推理速度。代码和模型将发布在：此 https URL

Title: Bridging Degradation Discrimination and Generation for Universal Image Restoration

Authors: JiaKui Hu, Zhengjian Yao, Lujia Jin, Yanye Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00579
Pdf URL: https://arxiv.org/pdf/2602.00579
Copy Paste: [[2602.00579]] Bridging Degradation Discrimination and Generation for Universal Image Restoration(https://arxiv.org/abs/2602.00579)
Keywords: restoration, super-resolution, generation
Abstract: Universal image restoration is a critical task in low-level vision, requiring the model to remove various degradations from low-quality images to produce clean images with rich detail. The challenges lie in sampling the distribution of high-quality images and adjusting the outputs on the basis of the degradation. This paper presents a novel approach, Bridging Degradation discrimination and Generation (BDG), which aims to address these challenges concurrently. First, we propose the Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and demonstrate its effectiveness in performing fine-grained discrimination of degradation types and levels. Subsequently, we divide the diffusion training process into three distinct stages: generation, bridging, and restoration. The objective is to preserve the diffusion model's capability of restoring rich textures while simultaneously integrating the discriminative information from the MAS-GLCM into the restoration process. This enhances its proficiency in addressing multi-task and multi-degraded scenarios. Without changing the architecture, BDG achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, primarily evidenced by substantial improvements in fidelity without compromising perceptual quality. The code and pretrained models are provided in this https URL.
摘要：Universal image restoration is a critical task in low-level vision, requiring the model to remove various degradations from low-quality images to produce clean images with rich detail.挑战在于对高质量图像的分布进行采样并根据退化情况调整输出。 This paper presents a novel approach, Bridging Degradation discrimination and Generation (BDG), which aims to address these challenges concurrently. First, we propose the Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and demonstrate its effectiveness in performing fine-grained discrimination of degradation types and levels.随后，我们将扩散训练过程分为三个不同的阶段：生成、桥接和恢复。 The objective is to preserve the diffusion model's capability of restoring rich textures while simultaneously integrating the discriminative information from the MAS-GLCM into the restoration process.这增强了其处理多任务和多降级场景的能力。 Without changing the architecture, BDG achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, primarily evidenced by substantial improvements in fidelity without compromising perceptual quality.此 https URL 中提供了代码和预训练模型。

Title: MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation

Authors: Xiangdong Li, Ye Lou, Ao Gao, Wei Zhang, Siyang Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00583
Pdf URL: https://arxiv.org/pdf/2602.00583
Copy Paste: [[2602.00583]] MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation(https://arxiv.org/abs/2602.00583)
Keywords: generation
Abstract: The lack of large-scale, demographically diverse face images with precise Action Unit (AU) occurrence and intensity annotations has long been recognized as a fundamental bottleneck in developing generalizable AU recognition systems. In this paper, we propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions and anatomically consistent AU labels, including both occurrence and intensity, conditioned on a single descriptive text prompt. Our MAUGen involves two key modules: (1) a Multi-modal Representation Learning (MRL) module that captures the relationships among the paired textual description, facial identity, expression image, and AU activations within a unified latent space; and (2) a Diffusion-based Image label Generator (DIG) that decodes the joint representation into aligned facial image-label pairs across diverse identities. Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations. Extensive experiments demonstrate that MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images along with semantically aligned AU labels.
摘要：长期以来，缺乏具有精确的动作单元（AU）出现和强度注释的大规模、人口统计多样化的人脸图像一直被认为是开发通用的 AU 识别系统的基本瓶颈。在本文中，我们提出了 MAUGen，一种基于扩散的多模态框架，它联合生成大量真实的面部表情和解剖学上一致的 AU 标签，包括出现次数和强度，以单个描述性文本提示为条件。我们的 MAUGen 涉及两个关键模块：（1）多模态表示学习（MRL）模块，捕获统一潜在空间内配对文本描述、面部身份、表情图像和 AU 激活之间的关系； (2) 基于扩散的图像标签生成器 (DIG)，将联合表示解码为跨不同身份的对齐的面部图像标签对。在此框架下，我们引入了多身份面部动作（MIFA），这是一个大规模多模态合成数据集，具有全面的 AU 注释和身份变化。大量实验表明，MAUGen 在合成逼真、人口统计多样化的面部图像以及语义对齐的 AU 标签方面优于现有方法。

Title: From Pixels to Facts (Pix2Fact): Benchmarking Multi-Hop Reasoning for Fine-Grained Visual Fact Checking

Authors: Yifan Jiang, Cong Zhang, Bofei Zhang, Yifan Yang, Bingzhang Wang, Yew-Soon Ong
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00593
Pdf URL: https://arxiv.org/pdf/2602.00593
Copy Paste: [[2602.00593]] From Pixels to Facts (Pix2Fact): Benchmarking Multi-Hop Reasoning for Fine-Grained Visual Fact Checking(https://arxiv.org/abs/2602.00593)
Keywords: generation
Abstract: Despite progress on general tasks, VLMs struggle with challenges demanding both detailed visual grounding and deliberate knowledge-based reasoning, a synergy not captured by existing benchmarks that evaluate these skills separately. To close this gap, we introduce Pix2Fact, a new visual question-answering benchmark designed to evaluate expert-level perception and knowledge-intensive multi-hop reasoning. Pix2Fact contains 1,000 high-resolution (4K+) images spanning 8 daily-life scenarios and situations, with questions and answers meticulously crafted by annotators holding PhDs from top global universities working in partnership with a professional data annotation firm. Each question requires detailed visual grounding, multi-hop reasoning, and the integration of external knowledge to answer. Our evaluation of 9 state-of-the-art VLMs, including proprietary models like Gemini-3-Pro and GPT-5, reveals the substantial challenge posed by Pix2Fact: the most advanced model achieves only 24.0% average accuracy, in stark contrast to human performance of 56%. This significant gap underscores the limitations of current models in replicating human-level visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the development of next-generation multimodal agents that combine fine-grained perception with robust, knowledge-based reasoning.
摘要：尽管在一般任务上取得了进展，VLM 仍面临着需要详细的视觉基础和深思熟虑的基于知识的推理的挑战，而单独评估这些技能的现有基准无法捕捉到这种协同作用。为了缩小这一差距，我们引入了 Pix2Fact，这是一种新的视觉问答基准，旨在评估专家级感知和知识密集型多跳推理。 Pix2Fact 包含 1,000 张高分辨率 (4K+) 图像，涵盖 8 个日常生活场景和情况，问题和答案由拥有全球顶尖大学博士学位的注释者与专业数据注释公司合作精心制作。每个问题都需要详细的视觉基础、多跳推理以及整合外部知识来回答。我们对 9 个最先进的 VLM（包括 Gemini-3-Pro 和 GPT-5 等专有模型）的评估揭示了 Pix2Fact 带来的巨大挑战：最先进的模型仅达到 24.0% 的平均准确率，与人类 56% 的表现形成鲜明对比。这一巨大差距凸显了当前模型在复制人类视觉理解能力方面的局限性。我们相信 Pix2Fact 将成为推动下一代多模式代理开发的关键基准，将细粒度感知与强大的基于知识的推理相结合。

Title: Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering

Authors: Guangtao Lyu, Xinyi Cheng, Qi Liu, Chenghao Xu, Jiexi Yan, Muli Yang, Fen Fang, Cheng Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00621
Pdf URL: https://arxiv.org/pdf/2602.00621
Copy Paste: [[2602.00621]] Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering(https://arxiv.org/abs/2602.00621)
Keywords: generation
Abstract: LVLMs achieve remarkable multimodal understanding and generation but remain susceptible to hallucinations. Existing mitigation methods predominantly focus on output-level adjustments, leaving the internal mechanisms that give rise to these hallucinations largely unexplored. To gain a deeper understanding, we adopt a representation-level perspective by introducing sparse autoencoders (SAEs) to decompose dense visual embeddings into sparse, interpretable neurons. Through neuron-level analysis, we identify distinct neuron types, including always-on neurons and image-specific neurons. Our findings reveal that hallucinations often result from disruptions or spurious activations of image-specific neurons, while always-on neurons remain largely stable. Moreover, selectively enhancing or suppressing image-specific neurons enables controllable intervention in LVLM outputs, improving visual grounding and reducing hallucinations. Building on these insights, we propose Contrastive Neuron Steering (CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs. CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations. This not only enhances visual understanding but also effectively mitigates hallucinations. By operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods. Extensive experiments on both hallucination-focused and general multimodal benchmarks demonstrate that CNS consistently reduces hallucinations while preserving overall multimodal understanding.
摘要：LVLM 实现了卓越的多模态理解和生成，但仍然容易产生幻觉。现有的缓解方法主要侧重于产出水平的调整，而导致这些幻觉的内部机制在很大程度上尚未被探索。为了获得更深入的理解，我们通过引入稀疏自动编码器（SAE）来采用表示级视角，将密集的视觉嵌入分解为稀疏的、可解释的神经元。通过神经元级分析，我们识别了不同的神经元类型，包括始终开启的神经元和图像特定的神经元。我们的研究结果表明，幻觉通常是由图像特异性神经元的破坏或虚假激活引起的，而始终在线的神经元基本上保持稳定。此外，选择性增强或抑制图像特异性神经元可以对 LVLM 输出进行可控干预，改善视觉基础并减少幻觉。基于这些见解，我们提出了对比神经元引导（CNS），它通过干净输入和噪声输入之间的对比分析来识别图像特定的神经元。中枢神经系统选择性地放大信息神经元，同时抑制扰动引起的激活，从而产生更强大且基于语义的视觉表示。这不仅可以增强视觉理解，还可以有效减轻幻觉。通过在预填充阶段操作，CNS 与现有的解码阶段方法完全兼容。针对幻觉和一般多模态基准的大量实验表明，中枢神经系统持续减少幻觉，同时保持整体多模态理解。

Title: FaceSnap: Enhanced ID-fidelity Network for Tuning-free Portrait Customization

Authors: Benxiang Zhai, Yifang Xu, Guofeng Zhang, Yang Li, Sidan Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00627
Pdf URL: https://arxiv.org/pdf/2602.00627
Copy Paste: [[2602.00627]] FaceSnap: Enhanced ID-fidelity Network for Tuning-free Portrait Customization(https://arxiv.org/abs/2602.00627)
Keywords: generation
Abstract: Benefiting from the significant advancements in text-to-image diffusion models, research in personalized image generation, particularly customized portrait generation, has also made great strides recently. However, existing methods either require time-consuming fine-tuning and lack generalizability or fail to achieve high fidelity in facial details. To address these issues, we propose FaceSnap, a novel method based on Stable Diffusion (SD) that requires only a single reference image and produces extremely consistent results in a single inference stage. This method is plug-and-play and can be easily extended to different SD models. Specifically, we design a new Facial Attribute Mixer that can extract comprehensive fused information from both low-level specific features and high-level abstract features, providing better guidance for image generation. We also introduce a Landmark Predictor that maintains reference identity across landmarks with different poses, providing diverse yet detailed spatial control conditions for image generation. Then we use an ID-preserving module to inject these into the UNet. Experimental results demonstrate that our approach performs remarkably in personalized and customized portrait generation, surpassing other state-of-the-art methods in this domain.
摘要：受益于文本到图像扩散模型的重大进步，个性化图像生成，特别是定制肖像生成的研究最近也取得了长足的进步。然而，现有的方法要么需要耗时的微调并且缺乏通用性，要么无法实现面部细节的高保真度。为了解决这些问题，我们提出了 FaceSnap，这是一种基于稳定扩散（SD）的新颖方法，仅需要单个参考图像，并在单个推理阶段产生极其一致的结果。该方法即插即用，可以轻松扩展到不同的SD型号。具体来说，我们设计了一种新的面部属性混合器，可以从低级特定特征和高级抽象特征中提取全面的融合信息，为图像生成提供更好的指导。我们还引入了一个地标预测器，它可以保持不同姿势的地标之间的参考身份，为图像生成提供多样化但详细的空间控制条件。然后我们使用 ID 保留模块将它们注入到 UNet 中。实验结果表明，我们的方法在个性化和定制的肖像生成方面表现出色，超越了该领域其他最先进的方法。

Title: S$^3$POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning

Authors: Lingsong Wang, Mancheng Meng, Ziyan Wu, Terrence Chen, Fan Yang, Dinggang Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00635
Pdf URL: https://arxiv.org/pdf/2602.00635
Copy Paste: [[2602.00635]] S$^3$POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning(https://arxiv.org/abs/2602.00635)
Keywords: generation
Abstract: Existing face parsing methods usually misclassify occlusions as facial components. This is because occlusion is a high-level concept, it does not refer to a concrete category of object. Thus, constructing a real-world face dataset covering all categories of occlusion object is almost impossible and accurate mask annotation is labor-intensive. To deal with the problems, we present S$^3$POT, a contrast-driven framework synergizing face generation with self-supervised spatial prompting, to achieve occlusion segmentation. The framework is inspired by the insights: 1) Modern face generators' ability to realistically reconstruct occluded regions, creating an image that preserve facial geometry while eliminating occlusion, and 2) Foundation segmentation models' (e.g., SAM) capacity to extract precise mask when provided with appropriate prompts. In particular, S$^3$POT consists of three modules: Reference Generation (RF), Feature enhancement (FE), and Prompt Selection (PS). First, a reference image is produced by RF using structural guidance from parsed mask. Second, FE performs contrast of tokens between raw and reference images to obtain an initial prompt, then modifies image features with the prompt by cross-attention. Third, based on the enhanced features, PS constructs a set of positive and negative prompts and screens them with a self-attention network for a mask decoder. The network is learned under the guidance of three novel and complementary objective functions without occlusion ground truth mask involved. Extensive experiments on a dedicatedly collected dataset demonstrate S$^3$POT's superior performance and the effectiveness of each module.
摘要：现有的面部解析方法通常将遮挡错误分类为面部成分。这是因为遮挡是一个高级概念，它并不指对象的具体类别。因此，构建覆盖所有类别遮挡对象的真实世界人脸数据集几乎是不可能的，并且准确的掩模注释是劳动密集型的。为了解决这些问题，我们提出了 S$^3$POT，一种对比度驱动的框架，将面部生成与自监督空间提示相结合，以实现遮挡分割。该框架的灵感来自以下见解：1) 现代人脸生成器能够真实地重建遮挡区域，创建在消除遮挡的同时保留面部几何形状的图像，2) 基础分割模型（例如 SAM）在提供适当提示时提取精确掩模的能力。特别地，S$^3$POT由三个模块组成：参考生成（RF）、特征增强（FE）和提示选择（PS）。首先，使用解析掩模的结构指导通过 RF 生成参考图像。其次，FE 对原始图像和参考图像之间的 token 进行对比以获得初始提示，然后通过交叉注意力来修改图像特征。第三，基于增强的特征，PS构建了一组正向和负向提示，并用掩模解码器的自注意力网络对其进行筛选。该网络是在三个新颖且互补的目标函数的指导下学习的，不涉及遮挡地面真值掩模。在专门收集的数据集上进行的大量实验证明了 S$^3$POT 的卓越性能和每个模块的有效性。

Title: VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning

Authors: Vivek Madhavaram, Vartika Sengar, Arkadipta De, Charu Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00637
Pdf URL: https://arxiv.org/pdf/2602.00637
Copy Paste: [[2602.00637]] VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning(https://arxiv.org/abs/2602.00637)
Keywords: generation
Abstract: Scene understanding and reasoning has been a fundamental problem in 3D computer vision, requiring models to identify objects, their properties, and spatial or comparative relationships among the objects. Existing approaches enable this by creating scene graphs using multiple inputs such as 2D images, depth maps, object labels, and annotated relationships from specific reference view. However, these methods often struggle with generalization and produce inaccurate spatial relationships like "left/right", which become inconsistent across different viewpoints. To address these limitations, we propose Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning (VIZOR). VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes. The generated scene graph is unambiguous, as spatial relationships are defined relative to each object's front-facing direction, making them consistent regardless of the reference view. Furthermore, it infers open-vocabulary relationships that describe spatial and proximity relationships among scene objects without requiring annotated training data. We conduct extensive quantitative and qualitative evaluations to assess the effectiveness of VIZOR in scene graph generation and downstream tasks, such as query-based object grounding. VIZOR outperforms state-of-the-art methods, showing clear improvements in scene graph generation and achieving 22% and 4.81% gains in zero-shot grounding accuracy on the Replica and Nr3D datasets, respectively.
摘要：场景理解和推理一直是 3D 计算机视觉中的一个基本问题，需要模型识别对象、其属性以及对象之间的空间或比较关系。现有方法通过使用多个输入（例如 2D 图像、深度图、对象标签和来自特定参考视图的注释关系）创建场景图来实现这一点。然而，这些方法常常难以泛化，并产生不准确的空间关系，如“左/右”，这在不同的观点之间变得不一致。为了解决这些限制，我们提出了用于 3D 场景推理的视点不变零样本场景图生成 (VIZOR)。 VIZOR 是一种免训练的端到端框架，可直接从原始 3D 场景构建密集的、视点不变的 3D 场景图。生成的场景图是明确的，因为空间关系是相对于每个对象的正面方向定义的，因此无论参考视图如何，它们都保持一致。此外，它还推断出描述场景对象之间的空间和邻近关系的开放词汇关系，而无需带注释的训练数据。我们进行了广泛的定量和定性评估，以评估 VIZOR 在场景图生成和下游任务（例如基于查询的对象接地）中的有效性。 VIZOR 的性能优于最先进的方法，在场景图生成方面显示出明显的改进，并在 Replica 和 Nr3D 数据集上的零样本接地精度分别提高了 22% 和 4.81%。

Title: Riemannian Flow Matching for Disentangled Graph Domain Adaptation

Authors: Yingxu Wang, Xinwang Liu, Mengzhu Wang, Siyang Gao, Nan Yin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00656
Pdf URL: https://arxiv.org/pdf/2602.00656
Copy Paste: [[2602.00656]] Riemannian Flow Matching for Disentangled Graph Domain Adaptation(https://arxiv.org/abs/2602.00656)
Keywords: generation
Abstract: Graph Domain Adaptation (GDA) typically uses adversarial learning to align graph embeddings in Euclidean space. However, this paradigm suffers from two critical challenges: Structural Degeneration, where hierarchical and semantic representations are entangled, and Optimization Instability, which arises from oscillatory dynamics of minimax adversarial training. To tackle these issues, we propose DisRFM, a geometry-aware GDA framework that unifies Riemannian embedding and flow-based transport. First, to overcome structural degeneration, we embed graphs into a Riemannian manifold. By adopting polar coordinates, we explicitly disentangle structure (radius) from semantics (angle). Then, we enforce topology preservation through radial Wasserstein alignment and semantic discrimination via angular clustering, thereby preventing feature entanglement and collapse. Second, we address the instability of adversarial alignment by using Riemannian flow matching. This method learns a smooth vector field to guide source features toward the target along geodesic paths, guaranteeing stable convergence. The geometric constraints further guide the flow to maintain the disentangled structure during transport. Theoretically, we prove the asymptotic stability of the flow matching and derive a tighter bound for the target risk. Extensive experiments demonstrate that DisRFM consistently outperforms state-of-the-art methods.
摘要：图域适应（GDA）通常使用对抗性学习来对齐欧几里德空间中的图嵌入。然而，这种范式面临两个关键挑战：结构退化（层次结构和语义表示纠缠在一起）和优化不稳定性（由极小极大对抗训练的振荡动力学引起）。为了解决这些问题，我们提出了 DisRFM，这是一种几何感知的 GDA 框架，它统一了黎曼嵌入和基于流的传输。首先，为了克服结构退化，我们将图嵌入到黎曼流形中。通过采用极坐标，我们明确地将结构（半径）与语义（角度）分开。然后，我们通过径向 Wasserstein 对齐来强制拓扑保留，并通过角度聚类来进行语义区分，从而防止特征纠缠和崩溃。其次，我们通过使用黎曼流匹配来解决对抗性对齐的不稳定性。该方法学习平滑矢量场来引导源特征沿着测地路径朝向目标，保证稳定收敛。几何约束进一步引导流动以在运输过程中保持解开的结构。从理论上讲，我们证明了流量匹配的渐近稳定性，并得出了目标风险的更严格界限。大量实验表明 DisRFM 始终优于最先进的方法。

Title: Improving Neuropathological Reconstruction Fidelity via AI Slice Imputation

Authors: Marina Crespo Aguirre, Jonathan Williams-Ramirez, Dina Zemlyanker, Xiaoling Hu, Lucas J. Deden-Binder, Rogeny Herisse, Mark Montine, Theresa R. Connors, Christopher Mount, Christine L. MacDonald, C. Dirk Keene, Caitlin S. Latimer, Derek H. Oakley, Bradley T. Hyman, Ana Lawry Aguila, Juan Eugenio Iglesias
Subjects: cs.CV, cs.AI, physics.med-ph
Abstract URL: https://arxiv.org/abs/2602.00669
Pdf URL: https://arxiv.org/pdf/2602.00669
Copy Paste: [[2602.00669]] Improving Neuropathological Reconstruction Fidelity via AI Slice Imputation(https://arxiv.org/abs/2602.00669)
Keywords: super-resolution
Abstract: Neuropathological analyses benefit from spatially precise volumetric reconstructions that enhance anatomical delineation and improve morphometric accuracy. Our prior work has shown the feasibility of reconstructing 3D brain volumes from 2D dissection photographs. However these outputs sometimes exhibit coarse, overly smooth reconstructions of structures, especially under high anisotropy (i.e., reconstructions from thick slabs). Here, we introduce a computationally efficient super-resolution step that imputes slices to generate anatomically consistent isotropic volumes from anisotropic 3D reconstructions of dissection photographs. By training on domain-randomized synthetic data, we ensure that our method generalizes across dissection protocols and remains robust to large slab thicknesses. The imputed volumes yield improved automated segmentations, achieving higher Dice scores, particularly in cortical and white matter regions. Validation on surface reconstruction and atlas registration tasks demonstrates more accurate cortical surfaces and MRI registration. By enhancing the resolution and anatomical fidelity of photograph-based reconstructions, our approach strengthens the bridge between neuropathology and neuroimaging. Our method is publicly available at this https URL
摘要：神经病理学分析受益于空间精确的体积重建，可增强解剖轮廓并提高形态测量的准确性。我们之前的工作已经证明了从 2D 解剖照片重建 3D 脑体积的可行性。然而，这些输出有时表现出粗糙、过于平滑的结构重建，特别是在高各向异性的情况下（即从厚板重建）。在这里，我们引入了一种计算高效的超分辨率步骤，该步骤通过解剖照片的各向异性 3D 重建来估算切片以生成解剖学上一致的各向同性体积。通过对域随机合成数据进行训练，我们确保我们的方法可以跨解剖协议进行推广，并对大板厚度保持鲁棒性。估算的体积产生改进的自动分割，实现更高的 Dice 分数，特别是在皮质和白质区域。表面重建和图集配准任务的验证表明皮质表面和 MRI 配准更准确。通过提高基于照片的重建的分辨率和解剖保真度，我们的方法加强了神经病理学和神经影像学之间的桥梁。我们的方法可通过此 https URL 公开获得

Title: LocalV: Exploiting Information Locality for IP-level Verilog Generation

Authors: Hanqi Lyu, Di Huang, Yaoyu Zhu, Kangcheng Liu, Bohan Dou, Chongxiao Li, Pengwei Jin, Shuyao Cheng, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00704
Pdf URL: https://arxiv.org/pdf/2602.00704
Copy Paste: [[2602.00704]] LocalV: Exploiting Information Locality for IP-level Verilog Generation(https://arxiv.org/abs/2602.00704)
Keywords: generation
Abstract: The generation of Register-Transfer Level (RTL) code is a crucial yet labor-intensive step in digital hardware design, traditionally requiring engineers to manually translate complex specifications into thousands of lines of synthesizable Hardware Description Language (HDL) code. While Large Language Models (LLMs) have shown promise in automating this process, existing approaches-including fine-tuned domain-specific models and advanced agent-based systems-struggle to scale to industrial IP-level design tasks. We identify three key challenges: (1) handling long, highly detailed documents, where critical interface constraints become buried in unrelated submodule descriptions; (2) generating long RTL code, where both syntactic and semantic correctness degrade sharply with increasing output length; and (3) navigating the complex debugging cycles required for functional verification through simulation and waveform analysis. To overcome these challenges, we propose LocalV, a multi-agent framework that leverages information locality in modular hardware design. LocalV decomposes the long-document to long-code generation problem into a set of short-document, short-code tasks, enabling scalable generation and debugging. Specifically, LocalV integrates hierarchical document partitioning, task planning, localized code generation, interface-consistent merging, and AST-guided locality-aware debugging. Experiments on RealBench, an IP-level Verilog generation benchmark, demonstrate that LocalV substantially outperforms state-of-the-art (SOTA) LLMs and agents, achieving a pass rate of 45.0% compared to 21.6%.
摘要：The generation of Register-Transfer Level (RTL) code is a crucial yet labor-intensive step in digital hardware design, traditionally requiring engineers to manually translate complex specifications into thousands of lines of synthesizable Hardware Description Language (HDL) code. While Large Language Models (LLMs) have shown promise in automating this process, existing approaches-including fine-tuned domain-specific models and advanced agent-based systems-struggle to scale to industrial IP-level design tasks. We identify three key challenges: (1) handling long, highly detailed documents, where critical interface constraints become buried in unrelated submodule descriptions; (2)生成长RTL代码，其中语法和语义正确性随着输出长度的增加而急剧下降； (3) 通过仿真和波形分析来完成功能验证所需的复杂调试周期。为了克服这些挑战，我们提出了 LocalV，这是一种在模块化硬件设计中利用信息局部性的多代理框架。 LocalV decomposes the long-document to long-code generation problem into a set of short-document, short-code tasks, enabling scalable generation and debugging. Specifically, LocalV integrates hierarchical document partitioning, task planning, localized code generation, interface-consistent merging, and AST-guided locality-aware debugging. Experiments on RealBench, an IP-level Verilog generation benchmark, demonstrate that LocalV substantially outperforms state-of-the-art (SOTA) LLMs and agents, achieving a pass rate of 45.0% compared to 21.6%.

Title: Supervised makeup transfer with a curated dataset: Decoupling identity and makeup features for enhanced transformation

Authors: Qihe Pan, Yiming Wu, Xing Zhao, Liang Xie, Guodao Sun, Ronghua Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00729
Pdf URL: https://arxiv.org/pdf/2602.00729
Copy Paste: [[2602.00729]] Supervised makeup transfer with a curated dataset: Decoupling identity and makeup features for enhanced transformation(https://arxiv.org/abs/2602.00729)
Keywords: generative
Abstract: Diffusion models have recently shown strong progress in generative tasks, offering a more stable alternative to GAN-based approaches for makeup transfer. Existing methods often suffer from limited datasets, poor disentanglement between identity and makeup features, and weak controllability. To address these issues, we make three contributions. First, we construct a curated high-quality dataset using a train-generate-filter-retrain strategy that combines synthetic, realistic, and filtered samples to improve diversity and fidelity. Second, we design a diffusion-based framework that disentangles identity and makeup features, ensuring facial structure and skin tone are preserved while applying accurate and diverse cosmetic styles. Third, we propose a text-guided mechanism that allows fine-grained and region-specific control, enabling users to modify eyes, lips, or face makeup with natural language prompts. Experiments on benchmarks and real-world scenarios demonstrate improvements in fidelity, identity preservation, and flexibility. Examples of our dataset can be found at: this https URL.
摘要：扩散模型最近在生成任务方面取得了巨大进展，为基于 GAN 的化妆转移方法提供了更稳定的替代方案。现有方法往往存在数据集有限、身份和化妆特征之间的分离性差以及可控性弱等问题。为了解决这些问题，我们做出了三项贡献。首先，我们使用训练-生成-过滤-再训练策略构建一个精心策划的高质量数据集，该策略结合了合成的、真实的和过滤的样本，以提高多样性和保真度。其次，我们设计了一个基于扩散的框架，可以解开身份和化妆特征，确保在应用准确和多样化的化妆风格的同时保留面部结构和肤色。第三，我们提出了一种文本引导机制，允许细粒度和特定区域的控制，使用户能够通过自然语言提示修改眼睛、嘴唇或脸部化妆。基准测试和现实场景的实验证明了保真度、身份保存和灵活性方面的改进。我们的数据集的示例可以在以下位置找到：此 https URL。

Title: HSI-VAR: Rethinking Hyperspectral Restoration through Spatial-Spectral Visual Autoregression

Authors: Xiangming Wang, Benteng Sun, Yungeng Liu, Haijin Zeng, Yongyong Chen, Jingyong Su, Jie Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00749
Pdf URL: https://arxiv.org/pdf/2602.00749
Copy Paste: [[2602.00749]] HSI-VAR: Rethinking Hyperspectral Restoration through Spatial-Spectral Visual Autoregression(https://arxiv.org/abs/2602.00749)
Keywords: restoration, generation, generative
Abstract: Hyperspectral images (HSIs) capture richer spatial-spectral information beyond RGB, yet real-world HSIs often suffer from a composite mix of degradations, such as noise, blur, and missing bands. Existing generative approaches for HSI restoration like diffusion models require hundreds of iterative steps, making them computationally impractical for high-dimensional HSIs. While regression models tend to produce oversmoothed results, failing to preserve critical structural details. We break this impasse by introducing HSI-VAR, rethinking HSI restoration as an autoregressive generation problem, where spectral and spatial dependencies can be progressively modeled rather than globally reconstructed. HSI-VAR incorporates three key innovations: (1) Latent-condition alignment, which couples semantic consistency between latent priors and conditional embeddings for precise reconstruction; (2) Degradation-aware guidance, which uniquely encodes mixed degradations as linear combinations in the embedding space for automatic control, remarkably achieving a nearly $50\%$ reduction in computational cost at inference; (3) A spatial-spectral adaptation module that refines details across both domains in the decoding phase. Extensive experiments on nine all-in-one HSI restoration benchmarks confirm HSI-VAR's state-of-the-art performance, achieving a 3.77 dB PSNR improvement on \textbf{\textit{ICVL}} and offering superior structure preservation with an inference speed-up of up to $95.5 \times$ compared with diffusion-based methods, making it a highly practical solution for real-world HSI restoration.
摘要：高光谱图像 (HSI) 可以捕获 RGB 之外更丰富的空间光谱信息，但现实世界的 HSI 常常会遭受噪声、模糊和缺失波段等综合性能下降的影响。现有的 HSI 恢复生成方法（例如扩散模型）需要数百个迭代步骤，这使得它们对于高维 HSI 来说在计算上不切实际。虽然回归模型往往会产生过度平滑的结果，但无法保留关键的结构细节。我们通过引入 HSI-VAR 打破了这一僵局，将 HSI 恢复重新思考为一个自回归生成问题，其中光谱和空间依赖性可以逐步建模，而不是全局重建。 HSI-VAR 包含三个关键创新：（1）潜在条件对齐，它将潜在先验和条件嵌入之间的语义一致性结合起来，以实现精确重建； (2) 退化感知引导，将混合退化独特地编码为嵌入空间中的线性组合以进行自动控制，显着地实现了推理计算成本降低近 50\%$ ； (3) 空间频谱适应模块，可在解码阶段细化两个域的细节。对九个一体式 HSI 恢复基准的大量实验证实了 HSI-VAR 的最先进性能，在 \textbf{\textit{ICVL}} 上实现了 3.77 dB PSNR 改进，并提供了卓越的结构保留，与基于扩散的方法相比，推理加速高达 $95.5 \times$，使其成为现实世界 HSI 恢复的高度实用的解决方案。

Title: Latent Shadows: The Gaussian-Discrete Duality in Masked Diffusion

Authors: Guinan Chen, Xunpeng Huang, Ying Sun, Shijin Wang, Yanyong Zhang, Chao Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00792
Pdf URL: https://arxiv.org/pdf/2602.00792
Copy Paste: [[2602.00792]] Latent Shadows: The Gaussian-Discrete Duality in Masked Diffusion(https://arxiv.org/abs/2602.00792)
Keywords: generation
Abstract: Masked discrete diffusion is a dominant paradigm for high-quality language modeling where tokens are iteratively corrupted to a mask state, yet its inference efficiency is bottlenecked by the lack of deterministic sampling tools. While diffusion duality enables deterministic distillation for uniform models, these approaches generally underperform masked models and rely on complex integral operators. Conversely, in the masked domain, prior methods typically assume the absence of deterministic trajectories, forcing a reliance on stochastic distillation. To bridge this gap, we establish explicit Masked Diffusion Duality, proving that the masked process arises as the projection of a continuous Gaussian process via a novel maximum-value index preservation mechanism. Furthermore, we introduce Masked Consistency Distillation (MCD), a principled framework that leverages this duality to analytically construct the deterministic coupled trajectories required for consistency distillation, bypassing numerical ODE solvers. This result strictly improves upon prior stochastic distillation methods, achieving a 16$\times$ inference speedup without compromising generation quality. Our findings not only provide a solid theoretical foundation connecting masked and continuous diffusion, but also unlock the full potential of consistency distillation for high-performance discrete generation. Our code is available at this https URL.
摘要：屏蔽离散扩散是高质量语言建模的主要范例，其中标记被迭代地破坏为屏蔽状态，但其推理效率因缺乏确定性采样工具而受到瓶颈。虽然扩散对偶性能够实现均匀模型的确定性蒸馏，但这些方法通常表现不佳，并且依赖于复杂的积分算子。相反，在屏蔽域中，现有方法通常假设不存在确定性轨迹，从而迫使人们依赖随机蒸馏。为了弥补这一差距，我们建立了明确的掩蔽扩散对偶性，证明掩蔽过程是通过一种新颖的最大值索引保存机制作为连续高斯过程的投影而出现的。此外，我们引入了掩蔽一致性蒸馏（MCD），这是一个原理框架，利用这种对偶性来分析构建一致性蒸馏所需的确定性耦合轨迹，绕过数值 ODE 求解器。这一结果严格改进了之前的随机蒸馏方法，在不影响生成质量的情况下实现了 16$\times$ 的推理加速。我们的研究结果不仅为连接掩模扩散和连续扩散提供了坚实的理论基础，而且还释放了一致性蒸馏在高性能离散生成方面的全部潜力。我们的代码可以在这个 https URL 上找到。

Title: Edge-Native Generative De-identification: Inversion-Free Flow for Privacy-Preserving Federated Skin Image Analysis

Authors: Konstantinos Moutselos, Ilias Maglogiannis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00821
Pdf URL: https://arxiv.org/pdf/2602.00821
Copy Paste: [[2602.00821]] Edge-Native Generative De-identification: Inversion-Free Flow for Privacy-Preserving Federated Skin Image Analysis(https://arxiv.org/abs/2602.00821)
Keywords: generative
Abstract: The deployment of Federated Learning (FL) for clinical dermatology is hindered by the competing requirements of protecting patient privacy and preserving diagnostic features. Traditional de-identification methods often degrade pathological fidelity, while standard generative editing techniques rely on computationally intensive inversion processes unsuitable for resource-constrained edge devices. We propose a framework for identity-agnostic pathology preservation that serves as a client-side privacy-preserving utility. By leveraging inversion-free Rectified Flow Transformers (FlowEdit), the system performs high-fidelity identity transformation in near real-time (less than 20s), facilitating local deployment on clinical nodes. We introduce a "Segment-by-Synthesis" mechanism that generates counterfactual healthy and pathological twin pairs locally. This enables the extraction of differential erythema masks that are decoupled from biometric markers and semantic artifacts (e.g. jewelry). Pilot validation on high-resolution clinical samples demonstrates an Intersection over Union (IoU) stability greater than 0.67 across synthetic identities. By generating privacy-compliant synthetic surrogates at the edge, this framework mitigates the risk of gradient leakage at the source, providing a secure pathway for high-precision skin image analysis in federated environments.
摘要：临床皮肤病学联合学习 (FL) 的部署受到保护患者隐私和保留诊断特征的竞争要求的阻碍。传统的去识别方法通常会降低病态的保真度，而标准的生成编辑技术依赖于计算密集型的反转过程，不适合资源有限的边缘设备。我们提出了一个与身份无关的病理学保存框架，作为客户端隐私保护实用程序。通过利用免反转整流流变压器（FlowEdit），该系统可以近实时（小于20秒）执行高保真身份转换，从而促进临床节点上的本地部署。我们引入了一种“合成分段”机制，可以在本地生成反事实的健康和病理双胞胎。这使得能够提取与生物特征标记和语义工件（例如珠宝）分离的差异红斑掩模。对高分辨率临床样本的初步验证表明，跨合成身份的交并集 (IoU) 稳定性大于 0.67。通过在边缘生成符合隐私的合成代理，该框架降低了源梯度泄漏的风险，为联合环境中的高精度皮肤图像分析提供了安全途径。

Title: RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation

Authors: Yuhao Huang, Shih-Hsin Wang, Andrea L. Bertozzi, Bao Wang
Subjects: cs.LG, cs.AI, math.NA
Abstract URL: https://arxiv.org/abs/2602.00849
Pdf URL: https://arxiv.org/pdf/2602.00849
Copy Paste: [[2602.00849]] RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation(https://arxiv.org/abs/2602.00849)
Keywords: generation, generative
Abstract: Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves near state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using only 1-NFE, at a computational cost comparable to the baseline MeanFlows.
摘要：平均流 (MeanFlow) 可实现高效、高保真度的图像生成，但其单功能评估 (1-NFE) 生成通常无法产生令人信服的结果。我们通过引入 RMFlow 来解决这个问题，RMFlow 是一种高效的多模态生成模型，它将粗略的 1-NFE MeanFlow 传输与后续定制的噪声注入细化步骤集成在一起。 RMFlow 使用经过新损失函数训练的神经网络来近似流动路径的平均速度，该函数在最小化概率路径之间的 Wasserstein 距离和最大化样本似然之间取得平衡。 RMFlow 仅使用 1-NFE 就在文本到图像、上下文到分子和时间序列生成方面实现了接近最先进的结果，其计算成本与基线 MeanFlow 相当。

Title: Improving Flow Matching by Aligning Flow Divergence

Authors: Yuhao Huang, Taos Transue, Shih-Hsin Wang, William Feldman, Hong Zhang, Bao Wang
Subjects: cs.LG, cs.AI, math.NA
Abstract URL: https://arxiv.org/abs/2602.00869
Pdf URL: https://arxiv.org/pdf/2602.00869
Copy Paste: [[2602.00869]] Improving Flow Matching by Aligning Flow Divergence(https://arxiv.org/abs/2602.00869)
Keywords: generation, generative
Abstract: Conditional flow matching (CFM) stands out as an efficient, simulation-free approach for training flow-based generative models, achieving remarkable performance for data generation. However, CFM is insufficient to ensure accuracy in learning probability paths. In this paper, we introduce a new partial differential equation characterization for the error between the learned and exact probability paths, along with its solution. We show that the total variation gap between the two probability paths is bounded above by a combination of the CFM loss and an associated divergence loss. This theoretical insight leads to the design of a new objective function that simultaneously matches the flow and its divergence. Our new approach improves the performance of the flow-based generative model by a noticeable margin without sacrificing generation efficiency. We showcase the advantages of this enhanced training approach over CFM on several important benchmark tasks, including generative modeling for dynamical systems, DNA sequences, and videos. Code is available at \href{this https URL}{Utah-Math-Data-Science}.
摘要：条件流匹配 (CFM) 作为一种高效、免模拟的方法脱颖而出，用于训练基于流的生成模型，实现了卓越的数据生成性能。然而，CFM 不足以确保学习概率路径的准确性。在本文中，我们引入了一种新的偏微分方程表征，用于描述学习概率路径和精确概率路径之间的误差及其解决方案。我们表明，两条概率路径之间的总变异差距受 CFM 损失和相关散度损失的组合限制。这种理论见解导致了新目标函数的设计，该函数同时匹配流量及其发散。我们的新方法在不牺牲生成效率的情况下显着提高了基于流的生成模型的性能。我们在几个重要的基准任务上展示了这种增强训练方法相对于 CFM 的优势，包括动态系统、DNA 序列和视频的生成建模。代码可在 \href{此 https URL}{Utah-Math-Data-Science} 获取。

Title: Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs

Authors: Hao Mark Chen, Zhiwen Mo, Royson Lee, Qianzhou Wang, Da Li, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00879
Pdf URL: https://arxiv.org/pdf/2602.00879
Copy Paste: [[2602.00879]] Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs(https://arxiv.org/abs/2602.00879)
Keywords: generation
Abstract: Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.
摘要：在并行解码范例中，扩散大语言模型（dLLM）已成为平衡生成质量和吞吐量的有前途的候选者。然而，它们与专家混合（MoE）架构的集成受到专家爆炸的限制：随着并行生成的代币数量的增加，激活的不同专家的数量几乎呈线性增长。这会导致大量内存流量，将推理推入内存限制状态，从而抵消了 MoE 和并行解码的效率增益。为了应对这一挑战，我们提出了动态专家共享（DES），这是一种新技术，它将 MoE 优化从以令牌为中心的修剪和传统的专家跳过方法转变为序列级核心集选择。为了最大限度地重复利用专家，DES 确定了一组紧凑、高实用性的专家，以满足整个并行解码块的要求。我们引入了两种创新的选择策略：（1）序列内共享（DES-Seq），它适应序列级别的最佳分配，以及（2）显着性感知投票（DES-Vote），一种允许令牌基于聚合路由器权重集体选举核心集的新颖机制。 MoE dLLM 上的大量实验表明，DES 将独特的专家激活次数减少了 55% 以上，延迟时间减少了高达 38%，同时保留了 99% 的普通精度，有效地将内存开销与并行度解耦。

Title: DIAMOND: Directed Inference for Artifact Mitigation in Flow Matching Models

Authors: Alicja Polowczyk, Agnieszka Polowczyk, Piotr Borycki, Joanna Waczyńska, Jacek Tabor, Przemysław Spurek
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00883
Pdf URL: https://arxiv.org/pdf/2602.00883
Copy Paste: [[2602.00883]] DIAMOND: Directed Inference for Artifact Mitigation in Flow Matching Models(https://arxiv.org/abs/2602.00883)
Keywords: generation, generative
Abstract: Despite impressive results from recent text-to-image models like FLUX, visual and anatomical artifacts remain a significant hurdle for practical and professional use. Existing methods for artifact reduction, typically work in a post-hoc manner, consequently failing to intervene effectively during the core image formation process. Notably, current techniques require problematic and invasive modifications to the model weights, or depend on a computationally expensive and time-consuming process of regional refinement. To address these limitations, we propose DIAMOND, a training-free method that applies trajectory correction to mitigate artifacts during inference. By reconstructing an estimate of the clean sample at every step of the generative trajectory, DIAMOND actively steers the generation process away from latent states that lead to artifacts. Furthermore, we extend the proposed method to standard Diffusion Models, demonstrating that DIAMOND provides a robust, zero-shot path to high-fidelity, artifact-free image synthesis without the need for additional training or weight modifications in modern generative architectures. Code is available at this https URL
摘要：尽管最近的文本到图像模型（如 FLUX）取得了令人印象深刻的结果，但视觉和解剖伪影仍然是实际和专业使用的重大障碍。现有的用于减少伪影的方法通常以事后方式工作，因此无法在核心图像形成过程中进行有效干预。值得注意的是，当前的技术需要对模型权重进行有问题的侵入性修改，或者依赖于计算昂贵且耗时的区域细化过程。为了解决这些限制，我们提出了 DIAMOND，这是一种无需训练的方法，可应用轨迹校正来减少推理过程中的伪影。通过在生成轨迹的每一步重建干净样本的估计，DIAMOND 主动引导生成过程远离导致伪影的潜在状态。此外，我们将所提出的方法扩展到标准扩散模型，证明 DIAMOND 提供了一条稳健的零样本路径来实现高保真、无伪影图像合成，而无需在现代生成架构中进行额外的训练或权重修改。代码可在此 https URL 获取

Title: Data Augmentation for High-Fidelity Generation of CAR-T/NK Immunological Synapse Images

Authors: Xiang Zhang, Boxuan Zhang, Alireza Naghizadeh, Mohab Mohamed, Dongfang Liu, Ruixiang Tang, Dimitris Metaxas, Dongfang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.00949
Pdf URL: https://arxiv.org/pdf/2602.00949
Copy Paste: [[2602.00949]] Data Augmentation for High-Fidelity Generation of CAR-T/NK Immunological Synapse Images(https://arxiv.org/abs/2602.00949)
Keywords: generation
Abstract: Chimeric antigen receptor (CAR)-T and NK cell immunotherapies have transformed cancer treatment, and recent studies suggest that the quality of the CAR-T/NK cell immunological synapse (IS) may serve as a functional biomarker for predicting therapeutic efficacy. Accurate detection and segmentation of CAR-T/NK IS structures using artificial neural networks (ANNs) can greatly increase the speed and reliability of IS quantification. However, a persistent challenge is the limited size of annotated microscopy datasets, which restricts the ability of ANNs to generalize. To address this challenge, we integrate two complementary data-augmentation frameworks. First, we employ Instance Aware Automatic Augmentation (IAAA), an automated, instance-preserving augmentation method that generates synthetic CAR-T/NK IS images and corresponding segmentation masks by applying optimized augmentation policies to original IS data. IAAA supports multiple imaging modalities (e.g., fluorescence and brightfield) and can be applied directly to CAR-T/NK IS images derived from patient samples. In parallel, we introduce a Semantic-Aware AI Augmentation (SAAA) pipeline that combines a diffusion-based mask generator with a Pix2Pix conditional image synthesizer. This second method enables the creation of diverse, anatomically realistic segmentation masks and produces high-fidelity CAR-T/NK IS images aligned with those masks, further expanding the training corpus beyond what IAAA alone can provide. Together, these augmentation strategies generate synthetic images whose visual and structural properties closely match real IS data, significantly improving CAR-T/NK IS detection and segmentation performance. By enhancing the robustness and accuracy of IS quantification, this work supports the development of more reliable imaging-based biomarkers for predicting patient response to CAR-T/NK immunotherapy.
摘要：嵌合抗原受体（CAR）-T和NK细胞免疫疗法已经改变了癌症治疗，最近的研究表明CAR-T/NK细胞免疫突触（IS）的质量可以作为预测治疗效果的功能生物标志物。使用人工神经网络 (ANN) 准确检测和分割 CAR-T/NK IS 结构可以大大提高 IS 定量的速度和可靠性。然而，一个持续的挑战是带注释的显微镜数据集的大小有限，这限制了人工神经网络的泛化能力。为了应对这一挑战，我们集成了两个互补的数据增强框架。首先，我们采用实例感知自动增强（IAAA），这是一种自动的、实例保留的增强方法，通过对原始 IS 数据应用优化的增强策略来生成合成的 CAR-T/NK IS 图像和相应的分割掩模。 IAAA 支持多种成像模式（例如荧光和明场），并且可以直接应用于源自患者样本的 CAR-T/NK IS 图像。与此同时，我们引入了语义感知 AI 增强 (SAAA) 管道，它将基于扩散的掩模生成器与 Pix2Pix 条件图像合成器相结合。第二种方法能够创建多样化的、解剖学上真实的分割掩模，并生成与这些掩模对齐的高保真 CAR-T/NK IS 图像，进一步扩展训练语料库，超出了 IAAA 单独提供的范围。这些增强策略共同生成合成图像，其视觉和结构特性与真实 IS 数据紧密匹配，从而显着提高 CAR-T/NK IS 检测和分割性能。通过增强 IS 定量的稳健性和准确性，这项工作支持开发更可靠的基于成像的生物标志物，用于预测患者对 CAR-T/NK 免疫疗法的反应。

Title: SAGE: Agentic Framework for Interpretable and Clinically Translatable Computational Pathology Biomarker Discovery

Authors: Sahar Almahfouz Nasser, Juan Francisco Pesantez Borja, Jincheng Liu, Tanvir Hasan, Zenghan Wang, Suman Ghosh, Sandeep Manandhar, Shikhar Shiromani, Twisha Shah, Naoto Tokuyama, Anant Madabhushi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.00953
Pdf URL: https://arxiv.org/pdf/2602.00953
Copy Paste: [[2602.00953]] SAGE: Agentic Framework for Interpretable and Clinically Translatable Computational Pathology Biomarker Discovery(https://arxiv.org/abs/2602.00953)
Keywords: generation
Abstract: Despite significant progress in computational pathology, many AI models remain black-box and difficult to interpret, posing a major barrier to clinical adoption due to limited transparency and explainability. This has motivated continued interest in engineered image-based biomarkers, which offer greater interpretability but are often proposed based on anecdotal evidence or fragmented prior literature rather than systematic biological validation. We introduce SAGE (Structured Agentic system for hypothesis Generation and Evaluation), an agentic AI system designed to identify interpretable, engineered pathology biomarkers by grounding them in biological evidence. SAGE integrates literature-anchored reasoning with multimodal data analysis to correlate image-derived features with molecular biomarkers, such as gene expression, and clinically relevant outcomes. By coordinating specialized agents for biological contextualization and empirical hypothesis validation, SAGE prioritizes transparent, biologically supported biomarkers and advances the clinical translation of computational pathology.
摘要：尽管计算病理学取得了重大进展，但许多人工智能模型仍然是黑匣子且难以解释，由于透明度和可解释性有限，对临床采用构成了主要障碍。这激发了人们对基于图像的工程生物标志物的持续兴趣，这些生物标志物提供了更大的可解释性，但通常是基于轶事证据或零碎的先前文献而不是系统的生物验证而提出的。我们引入了 SAGE（用于假设生成和评估的结构化代理系统），这是一种代理 AI 系统，旨在通过以生物证据为基础来识别可解释的工程病理生物标志物。 SAGE 将文献锚定推理与多模式数据分析相结合，将图像衍生特征与分子生物标志物（例如基因表达和临床相关结果）相关联。通过协调生物背景化和经验假设验证的专门代理，SAGE 优先考虑透明的、生物学支持的生物标志物，并推进计算病理学的临床转化。

Title: Multimodal Scientific Learning Beyond Diffusions and Flows

Authors: Leonardo Ferreira Guilhoto, Akshat Kaushal, Paris Perdikaris
Subjects: cs.LG, cs.AI, cs.CE, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2602.00960
Pdf URL: https://arxiv.org/pdf/2602.00960
Copy Paste: [[2602.00960]] Multimodal Scientific Learning Beyond Diffusions and Flows(https://arxiv.org/abs/2602.00960)
Keywords: generative
Abstract: Scientific machine learning (SciML) increasingly requires models that capture multimodal conditional uncertainty arising from ill-posed inverse problems, multistability, and chaotic dynamics. While recent work has favored highly expressive implicit generative models such as diffusion and flow-based methods, these approaches are often data-hungry, computationally costly, and misaligned with the structured solution spaces frequently found in scientific problems. We demonstrate that Mixture Density Networks (MDNs) provide a principled yet largely overlooked alternative for multimodal uncertainty quantification in SciML. As explicit parametric density estimators, MDNs impose an inductive bias tailored to low-dimensional, multimodal physics, enabling direct global allocation of probability mass across distinct solution branches. This structure delivers strong data efficiency, allowing reliable recovery of separated modes in regimes where scientific data is scarce. We formalize these insights through a unified probabilistic framework contrasting explicit and implicit distribution networks, and demonstrate empirically that MDNs achieve superior generalization, interpretability, and sample efficiency across a range of inverse, multistable, and chaotic scientific regression tasks.
摘要：科学机器学习 (SciML) 越来越需要能够捕获由不适定反问题、多稳定性和混沌动力学引起的多模态条件不确定性的模型。虽然最近的工作倾向于高度表达的隐式生成模型，例如扩散和基于流的方法，但这些方法通常需要大量数据，计算成本高昂，并且与科学问题中常见的结构化解决方案空间不一致。我们证明，混合密度网络 (MDN) 为 SciML 中的多模态不确定性量化提供了一种有原则但在很大程度上被忽视的替代方案。作为显式参数密度估计器，MDN 施加了针对低维、多模态物理的归纳偏差，从而能够在不同的解决方案分支之间直接全局分配概率质量。这种结构提供了强大的数据效率，允许在科学数据稀缺的情况下可靠地恢复分离模式。我们通过对比显式和隐式分布网络的统一概率框架将这些见解形式化，并凭经验证明 MDN 在一系列逆向、多稳态和混沌科学回归任务中实现了卓越的泛化性、可解释性和样本效率。

Title: Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Authors: Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2602.01025
Pdf URL: https://arxiv.org/pdf/2602.01025
Copy Paste: [[2602.01025]] Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models(https://arxiv.org/abs/2602.01025)
Keywords: generation
Abstract: Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{this https URL}{GitHub repository}.
摘要：视觉语言模型 (VLM) 通过视觉编码器扩展大型语言模型 (LLM)，从而能够根据图像和文本生成文本。然而，这种多模式集成通过将模型暴露于旨在引发有害响应的基于图像的越狱，扩大了攻击面。现有的基于梯度的越狱方法迁移性很差，因为对抗模式过度拟合单个白盒代理并且无法泛化到黑盒模型。在这项工作中，我们提出了通用和可转移的越狱（UltraBreak），这是一个通过视觉空间中的转换和正则化来限制对抗模式的框架，同时通过基于语义的目标来放松文本目标。通过定义目标 LLM 文本嵌入空间中的损失，UltraBreak 发现了跨不同越狱目标的通用对抗模式。视觉级正则化和语义引导文本监督的结合可以减轻代理过度拟合，并实现跨模型和攻击目标的强大可转移性。大量实验表明 UltraBreak 的性能始终优于以前的越狱方法。进一步的分析揭示了为什么早期的方法无法转移，强调通过语义目标平滑损失景观对于实现通用和可转移的越狱至关重要。该代码可在我们的 \href{此 https URL}{GitHub 存储库}中公开获取。

Title: FUSE-Flow: Scalable Real-Time Multi-View Point Cloud Reconstruction Using Confidence

Authors: Chentian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01035
Pdf URL: https://arxiv.org/pdf/2602.01035
Copy Paste: [[2602.01035]] FUSE-Flow: Scalable Real-Time Multi-View Point Cloud Reconstruction Using Confidence(https://arxiv.org/abs/2602.01035)
Keywords: generation
Abstract: Real-time multi-view point cloud reconstruction is a core problem in 3D vision and immersive perception, with wide applications in VR, AR, robotic navigation, digital twins, and computer interaction. Despite advances in multi-camera systems and high-resolution depth sensors, fusing large-scale multi-view depth observations into high-quality point clouds under strict real-time constraints remains challenging. Existing methods relying on voxel-based fusion, temporal accumulation, or global optimization suffer from high computational complexity, excessive memory usage, and limited scalability, failing to simultaneously achieve real-time performance, reconstruction quality, and multi-camera extensibility. We propose FUSE-Flow, a frame-wise, stateless, and linearly scalable point cloud streaming reconstruction framework. Each frame independently generates point cloud fragments, fused via two weights, measurement confidence and 3D distance consistency to suppress noise while preserving geometric details. For large-scale multi-camera efficiency, we introduce an adaptive spatial hashing-based weighted aggregation method: 3D space is adaptively partitioned by local point cloud density, representative points are selected per cell, and weighted fusion is performed to handle both sparse and dense regions. With GPU parallelization, FUSE-Flow achieves high-throughput, low-latency point cloud generation and fusion with linear complexity. Experiments demonstrate that the framework improves reconstruction stability and geometric fidelity in overlapping, depth-discontinuous, and dynamic scenes, while maintaining real-time frame rates on modern GPUs, verifying its effectiveness, robustness, and scalability.
摘要：实时多视点点云重建是3D视觉和沉浸式感知的核心问题，在VR、AR、机器人导航、数字孪生和计算机交互等领域有着广泛的应用。尽管多摄像头系统和高分辨率深度传感器取得了进步，但在严格的实时约束下将大规模多视图深度观测融合到高质量点云中仍然具有挑战性。现有的依赖于基于体素的融合、时间累积或全局优化的方法存在计算复杂度高、内存使用过多和可扩展性有限的问题，无法同时实现实时性能、重建质量和多相机可扩展性。我们提出了 FUSE-Flow，一种逐帧、无状态、线性可扩展的点云流式重建框架。每帧独立生成点云片段，通过两个权重、测量置信度和 3D 距离一致性进行融合，以抑制噪声，同时保留几何细节。为了实现大规模多摄像机效率，我们引入了一种基于自适应空间哈希的加权聚合方法：根据局部点云密度自适应地划分3D空间，为每个单元选择代表点，并执行加权融合来处理稀疏区域和密集区域。通过GPU并行化，FUSE-Flow实现了高吞吐量、低延迟的点云生成和线性复杂度的融合。实验表明，该框架提高了重叠、深度不连续和动态场景中的重建稳定性和几何保真度，同时在现代 GPU 上保持实时帧速率，验证了其有效性、鲁棒性和可扩展性。

Title: PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers

Authors: Haopeng Li, Shitong Shao, Wenliang Zhong, Zikai Zhou, Lichen Bai, Hui Xiong, Zeke Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01077
Pdf URL: https://arxiv.org/pdf/2602.01077
Copy Paste: [[2602.01077]] PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers(https://arxiv.org/abs/2602.01077)
Keywords: generation
Abstract: Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: this https URL.
摘要：扩散变压器是视频和图像生成的基础，但其效率受到注意力二次复杂度的瓶颈。虽然块稀疏注意力通过仅关注关键键值块来加速计算，但它会因丢弃上下文而在高稀疏性下遭受退化。在这项工作中，我们发现非关键块的注意力分数表现出分布稳定性，使它们能够准确有效地近似而不是被丢弃，这对于稀疏注意力设计至关重要。受这一关键见解的启发，我们提出了 PISA，这是一种免训练的分段稀疏注意力机制，涵盖了具有次二次复杂度的完整注意力跨度。与直接删除非关键块信息的传统保留或删除范式不同，PISA 引入了一种新颖的精确或近似策略：它保持关键块的精确计算，同时通过块式泰勒展开有效地近似剩余部分。这种设计使 PISA 成为充分关注的忠实代表，有效地弥合了速度和质量之间的差距。实验结果表明，PISA 在 Wan2.1-14B 和 Hunyuan-Video 上分别实现了 1.91 倍和 2.57 倍的加速，同时始终保持稀疏注意力方法中的最高质量。值得注意的是，即使对于 FLUX 上的图像生成，PISA 也能在不影响视觉质量的情况下实现 1.2 倍的加速。代码可在以下位置获得：此 https URL。

Title: Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models

Authors: Zhiqi Zhang, Xinhao Zhong, Yi Sun, Shuoyang Sun, Bin Chen, Shu-Tao Xia, Xuan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01089
Pdf URL: https://arxiv.org/pdf/2602.01089
Copy Paste: [[2602.01089]] Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models(https://arxiv.org/abs/2602.01089)
Keywords: generative
Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images, yet their tendency to reproduce undesirable concepts, such as NSFW content, copyrighted styles, or specific objects, poses growing concerns for safe and controllable deployment. While existing concept erasure approaches primarily focus on DDPM-based diffusion models and rely on costly fine-tuning, the recent emergence of flow matching models introduces a fundamentally different generative paradigm for which prior methods are not directly applicable. In this paper, we propose Differential Vector Erasure (DVE), a training-free concept erasure method specifically designed for flow matching models. Our key insight is that semantic concepts are implicitly encoded in the directional structure of the velocity field governing the generative flow. Leveraging this observation, we construct a differential vector field that characterizes the directional discrepancy between a target concept and a carefully chosen anchor concept. During inference, DVE selectively removes concept-specific components by projecting the velocity field onto the differential direction, enabling precise concept suppression without affecting irrelevant semantics. Extensive experiments on FLUX demonstrate that DVE consistently outperforms existing baselines on a wide range of concept erasure tasks, including NSFW suppression, artistic style removal, and object erasure, while preserving image quality and diversity.
摘要：文本到图像扩散模型在生成高质量图像方面表现出了卓越的能力，但它们倾向于重现不良概念，例如 NSFW 内容、受版权保护的样式或特定对象，这引起了人们对安全和可控部署的日益关注。虽然现有的概念擦除方法主要关注基于 DDPM 的扩散模型并依赖于昂贵的微调，但最近出现的流匹配模型引入了一种根本不同的生成范式，先前的方法无法直接适用。在本文中，我们提出了差分向量擦除（DVE），这是一种专为流匹配模型设计的免训练概念擦除方法。我们的主要见解是，语义概念隐式编码在控制生成流的速度场的方向结构中。利用这一观察结果，我们构建了一个微分向量场，该向量场表征目标概念和精心选择的锚概念之间的方向差异。在推理过程中，DVE 通过将速度场投影到微分方向上，选择性地去除概念特定的分量，从而实现精确的概念抑制，而不影响不相关的语义。对 FLUX 的大量实验表明，DVE 在各种概念擦除任务上始终优于现有基线，包括 NSFW 抑制、艺术风格去除和对象擦除，同时保持图像质量和多样性。

Title: Self-Generative Adversarial Fine-Tuning for Large Language Models

Authors: Shiguang Wu, Yaqing Wang, Quanming Yao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01137
Pdf URL: https://arxiv.org/pdf/2602.01137
Copy Paste: [[2602.01137]] Self-Generative Adversarial Fine-Tuning for Large Language Models(https://arxiv.org/abs/2602.01137)
Keywords: generation, generative
Abstract: Fine-tuning large language models (LLMs) for alignment typically relies on supervised fine-tuning or reinforcement learning from human feedback, both limited by the cost and scarcity of high-quality annotations. Recent self-play and synthetic data approaches reduce this dependence but often rely on heuristic assumptions or ungrounded self-evaluation, which can cause bias accumulation and performance drift. In this paper, we propose Self-Generative Adversarial LLM (SGALM), a unified fine-tuning framework that formulates alignment as a generative adversarial game within a single LLM. SGALM jointly evolves generation and discrimination capabilities without external reward models. Theoretical and empirical results demonstrate that SGALM achieves state-of-the-art performance, serves as an effective alignment algorithm and a robust synthetic data engine.
摘要：微调大型语言模型（LLM）以进行对齐通常依赖于监督微调或来自人类反馈的强化学习，这两者都受到高质量注释的成本和稀缺性的限制。最近的自我对弈和合成数据方法减少了这种依赖性，但通常依赖于启发式假设或无根据的自我评估，这可能会导致偏差累积和性能漂移。在本文中，我们提出了自我生成对抗性 LLM (SGALM)，这是一个统一的微调框架，它将对齐制定为单个 LLM 内的生成对抗性游戏。 SGALM 共同发展生成和区分能力，无需外部奖励模型。理论和实证结果表明，SGALM 实现了最先进的性能，可作为有效的对齐算法和强大的合成数据引擎。

Title: Generalized Radius and Integrated Codebook Transforms for Differentiable Vector Quantization

Authors: Haochen You, Heng Zhang, Hongyang He, Yuqi Li, Baojing Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01140
Pdf URL: https://arxiv.org/pdf/2602.01140
Copy Paste: [[2602.01140]] Generalized Radius and Integrated Codebook Transforms for Differentiable Vector Quantization(https://arxiv.org/abs/2602.01140)
Keywords: generation, generative
Abstract: Vector quantization (VQ) underpins modern generative and representation models by turning continuous latents into discrete tokens. Yet hard nearest-neighbor assignments are non-differentiable and are typically optimized with heuristic straight-through estimators, which couple the update step size to the quantization gap and train each code in isolation, leading to unstable gradients and severe codebook under-utilization at scale. In this paper, we introduce GRIT-VQ (Generalized Radius and Integrated Transform-Vector Quantization), a unified surrogate framework that keeps hard assignments in the forward pass while making VQ fully differentiable. GRIT-VQ replaces the straight-through estimator with a radius-based update that moves latents along the quantization direction with a controllable, geometry-aware step, and applies a data-agnostic integrated transform to the codebook so that all codes are updated through shared parameters instead of independently. Our theoretical analysis clarifies the fundamental optimization dynamics introduced by GRIT-VQ, establishing conditions for stable gradient flow, coordinated codebook evolution, and reliable avoidance of collapse across a broad family of quantizers. Across image reconstruction, image generation, and recommendation tokenization benchmarks, GRIT-VQ consistently improves reconstruction error, generative quality, and recommendation accuracy while substantially increasing codebook utilization compared to existing VQ variants.
摘要：矢量量化（VQ）通过将连续的潜在特征转化为离散的标记来支撑现代生成和表示模型。然而，硬最近邻分配是不可微分的，并且通常使用启发式直通估计器进行优化，该估计器将更新步长与量化间隙耦合并单独训练每个代码，从而导致不稳定的梯度和严重的码书大规模利用不足。在本文中，我们介绍了 GRIT-VQ（广义半径和集成变换矢量量化），这是一种统一的代理框架，可以在前向传递中保留硬分配，同时使 VQ 完全可微分。 GRIT-VQ 用基于半径的更新取代了直通估计器，该更新通过可控的、几何感知的步骤沿量化方向移动潜在变量，并对码本应用与数据无关的集成变换，以便所有代码都通过共享参数而不是独立更新。我们的理论分析阐明了 GRIT-VQ 引入的基本优化动力学，为稳定的梯度流、协调的码本演化以及跨广泛的量化器系列可靠地避免崩溃奠定了条件。在图像重建、图像生成和推荐标记化基准测试中，GRIT-VQ 持续改善重建误差、生成质量和推荐准确性，同时与现有 VQ 变体相比，大幅提高码本利用率。

Title: Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs

Authors: Daniel Yezid Guarnizo Orjuela, Leonardo Scappatura, Veronica Di Gennaro, Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2602.01158
Pdf URL: https://arxiv.org/pdf/2602.01158
Copy Paste: [[2602.01158]] Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs(https://arxiv.org/abs/2602.01158)
Keywords: restoration
Abstract: Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $\pi_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90\% success rates to as low as 2\%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.
摘要：视觉-语言-动作（VLA）模型已成为通用机器人操作的主导范例，将感知和控制统一在单个端到端架构中。然而，尽管它们在受控环境中取得了成功，但它们对视觉干扰的脆弱性严重阻碍了它们在现实世界中的可靠部署。虽然现有文献广泛讨论了场景几何造成的物理遮挡，但一个关键模式仍然很大程度上未被探索：图像损坏。这些传感器级伪影（从电子噪声和坏像素到镜头污染物）直接损害了解释之前视觉信号的完整性。在这项工作中，我们量化了这个漏洞，证明了最先进的 VLA，例如 $\pi_{0.5}$ 和 SmolVLA，在常见的信号伪影下会遭受灾难性的性能下降，成功率从 90% 下降到低至 2%。为了缓解这一问题，我们引入了损坏恢复变压器 (CRT)，这是一种即插即用且与模型无关的视觉变压器，旨在使 VLA 模型免受传感器干扰。利用对抗性训练目标，CRT 从损坏的输入中恢复干净的观察结果，而不需要对底层模型进行计算成本高昂的微调。 LIBERO 和 Meta-World 基准测试的大量实验表明，CRT 可以有效恢复损失的性能，使 VLA 即使在严重的视觉损坏情况下也能保持接近基线的成功率。

Title: EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment

Authors: Lancheng Gao, Ziheng Jia, Zixuan Xing, Wei Sun, Huiyu Duan, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01173
Pdf URL: https://arxiv.org/pdf/2602.01173
Copy Paste: [[2602.01173]] EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment(https://arxiv.org/abs/2602.01173)
Keywords: generation
Abstract: Understanding the multi-dimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce EEmoDB, the largest image-evoked emotion understanding dataset to date. It features $5$ analysis dimensions spanning $5$ distinct task categories, facilitating comprehensive interpretation. Specifically, we compile $1.2M$ question-answering (QA) pairs (EEmoDB-QA) from $125k$ images via automated generation, alongside a $36k$ dataset (EEmoDB-Assess) curated from $25k$ images for fine-grained assessment. Furthermore, we propose EEmo-Logic, an all-in-one multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The code is available at this https URL.
摘要：了解图像引发的情感的多维属性和强度细微差别对于提高机器同理心和支持多样化的人机交互应用至关重要。然而，现有模型仍然局限于粗粒度的情感感知或推理能力不足。为了弥补这一差距，我们引入了 EEmoDB，这是迄今为止最大的图像诱发情感理解数据集。它具有涵盖 5 美元不同任务类别的 5 美元分析维度，有助于全面解释。具体来说，我们通过自动生成从 12.5 万美元的图像中编译了 120 万美元的问答 (QA) 对 (EEmoDB-QA)，同时从 2.5 万美元的图像中整理了一个 3.6 万美元的数据集 (EEmoDB-Assess)，以进行细粒度评估。此外，我们提出了 EEmo-Logic，一种通过指令微调和任务定制组相对偏好优化（GRPO）开发的一体化多模态大语言模型（MLLM），并具有新颖的奖励设计。大量实验表明，EEmo-Logic 在域内和跨域数据集上实现了稳健的性能，在情感 QA 和细粒度评估方面表现出色。该代码可从此 https URL 获取。

Title: Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

Authors: Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01273
Pdf URL: https://arxiv.org/pdf/2602.01273
Copy Paste: [[2602.01273]] Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution(https://arxiv.org/abs/2602.01273)
Keywords: super-resolution
Abstract: Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by over 60$\times$. Our code and models will be available at this https URL.
摘要：最近，扩散变压器（DiT）已经出现在真实世界图像超分辨率（Real-ISR）中，可以生成高质量的纹理，但其沉重的推理负担阻碍了现实世界的部署。虽然训练后量化 (PTQ) 是一种很有前景的加速解决方案，但现有的超分辨率方法主要集中在 U-Net 架构上，而通用 DiT 量化通常是为文本到图像任务而设计的。直接将这些方法应用于基于 DiT 的超分辨率模型会导致局部纹理严重退化。因此，我们提出了 Q-DiT4SR，这是第一个专门为基于 DiT 的 Real-ISR 量身定制的 PTQ 框架。我们提出了 H-SVD，这是一种分层 SVD，它在匹配的参数预算下集成了全局低秩分支与局部块级 1 分支。我们进一步提出了方差感知时空混合精度：VaSMP 基于率失真理论以无数据的方式分配跨层权重位宽，而 VaTMP 通过动态规划（DP）以最小的校准来跨扩散时间步安排层内激活精度。对多个真实数据集的实验表明，我们的 Q-DiT4SR 在 W4A6 和 W4A4 设置下均实现了 SOTA 性能。值得注意的是，W4A4 量化配置将模型大小减少了 5.8$\times$，计算操作减少了 60$\times$。我们的代码和模型将在此 https URL 中提供。

Title: EDIS: Diagnosing LLM Reasoning via Entropy Dynamics

Authors: Chenghua Zhu, Siyan Wu, Xiangkang Zeng, Zishan Xu, Zhaolu Kang, Yifu Guo, Yuquan Lu, Junduan Huang, Guojing Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01288
Pdf URL: https://arxiv.org/pdf/2602.01288
Copy Paste: [[2602.01288]] EDIS: Diagnosing LLM Reasoning via Entropy Dynamics(https://arxiv.org/abs/2602.01288)
Keywords: generation
Abstract: Entropy-based confidence signals are increasingly leveraged to improve reasoning in large language models (LLMs), yet existing approaches treat confidence as a static quantity -- typically aggregated over tokens. We show that the \emph{temporal evolution} of confidence during generation carries richer information than aggregate statistics alone. Analyzing token-level entropy trajectories, we identify characteristic patterns distinguishing correct from incorrect reasoning: erroneous solutions exhibit unstable dynamics, including burst spikes (sustained uncertainty growth) and peak-valley spikes (sharp rebounds following transient confidence). These patterns persist across models and training stages, suggesting they reflect intrinsic properties of reasoning failure rather than superficial noise. To formalize this observation, we introduce the Entropy Dynamics Instability Score (\textbf{EDIS}), a trajectory-level metric quantifying instability in entropy evolution. EDIS serves as an effective diagnostic signal for inference-time selection, substantially improving reasoning accuracy, and offers a promising direction for training-time sample curation. Our findings establish entropy dynamics as an underexplored yet informative lens for understanding and improving LLM reasoning.
摘要：基于熵的置信度信号越来越多地被用来改进大型语言模型 (LLM) 中的推理，但现有方法将置信度视为静态量——通常通过标记进行聚合。我们表明，在生成过程中信心的\emph{时间演化}比单独的聚合统计数据携带更丰富的信息。通过分析令牌级熵轨迹，我们识别出区分正确推理和错误推理的特征模式：错误的解决方案表现出不稳定的动态，包括突发尖峰（持续的不确定性增长）和峰谷尖峰（瞬态置信度后的急剧反弹）。这些模式在模型和训练阶段中持续存在，表明它们反映了推理失败的内在属性，而不是表面噪音。为了形式化这一观察结果，我们引入了熵动力学不稳定性得分（\textbf{EDIS}），这是一种量化熵演化中不稳定性的轨迹级度量。 EDIS 可作为推理时间选择的有效诊断信号，显着提高推理准确性，并为训练时间样本管理提供有希望的方向。我们的研究结果将熵动力学确立为理解和改进法学硕士推理的一个尚未充分探索但信息丰富的镜头。

Title: ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation

Authors: Ayushman Sarkar, Zhenyu Yu, Chu Chen, Wei Tang, Kangning Cui, Mohd Yamani Idna Idris
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01303
Pdf URL: https://arxiv.org/pdf/2602.01303
Copy Paste: [[2602.01303]] ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation(https://arxiv.org/abs/2602.01303)
Keywords: generation
Abstract: Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: this https URL
摘要：生成连贯的视觉故事需要在多个图像中保持主题身份，同时保留特定于帧的语义。最近的免训练方法将身份和框架提示连接成统一的表示，但这通常会引入帧间语义干扰，从而削弱复杂故事中的身份保留。我们提出了 ReDiStory，这是一种免训练框架，可通过推理时提示嵌入重组来改进多帧故事生成。 ReDiStory 将文本嵌入显式分解为身份相关和特定于框架的组件，然后通过抑制跨框架的共享方向来解相关框架嵌入。这减少了跨帧干扰，无需修改扩散参数或需要额外的监督。在相同的扩散主干和推理设置下，ReDiStory 提高了身份一致性，同时保持即时保真度。 ConsiStory+ 基准测试的实验表明，在多个身份一致性指标上，与 1Prompt1Story 相比，取得了一致的收益。代码位于：此 https URL

Title: StoryState: Agent-Based State Control for Consistent and Editable Storybooks

Authors: Ayushman Sarkar, Zhenyu Yu, Wei Tang, Chu Chen, Kangning Cui, Mohd Yamani Idna Idris
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01305
Pdf URL: https://arxiv.org/pdf/2602.01305
Copy Paste: [[2602.01305]] StoryState: Agent-Based State Control for Consistent and Editable Storybooks(https://arxiv.org/abs/2602.01305)
Keywords: generation
Abstract: Large multimodal models have enabled one-click storybook generation, where users provide a short description and receive a multi-page illustrated story. However, the underlying story state, such as characters, world settings, and page-level objects, remains implicit, making edits coarse-grained and often breaking visual consistency. We present StoryState, an agent-based orchestration layer that introduces an explicit and editable story state on top of training-free text-to-image generation. StoryState represents each story as a structured object composed of a character sheet, global settings, and per-page scene constraints, and employs a small set of LLM agents to maintain this state and derive 1Prompt1Story-style prompts for generation and editing. Operating purely through prompts, StoryState is model-agnostic and compatible with diverse generation backends. System-level experiments on multi-page editing tasks show that StoryState enables localized page edits, improves cross-page consistency, and reduces unintended changes, interaction turns, and editing time compared to 1Prompt1Story, while approaching the one-shot consistency of Gemini Storybook. Code is available at this https URL
摘要：大型多模式模型支持一键生成故事书，用户提供简短的描述并收到多页插图故事。然而，底层的故事状态，例如角色、世界设置和页面级对象，仍然是隐式的，这使得编辑变得粗粒度，并且常常破坏视觉一致性。我们提出了 StoryState，一个基于代理的编排层，它在免训练的文本到图像生成之上引入了明确且可编辑的故事状态。 StoryState 将每个故事表示为由角色表、全局设置和每页场景约束组成的结构化对象，并采用一小组 LLM 代理来维护此状态并派生 1Prompt1Story 风格的提示以进行生成和编辑。 StoryState 完全通过提示进行操作，与模型无关，并且与不同世代的后端兼容。多页面编辑任务的系统级实验表明，与 1Prompt1Story 相比，StoryState 能够实现本地化页面编辑，提高跨页面一致性，减少意外更改、交互次数和编辑时间，同时接近 Gemini Storybook 的一次性一致性。代码可在此 https URL 获取

Title: FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching

Authors: Divya Jyoti Bajpai, Shubham Agarwal, Apoorv Saxena, Kuldeep Kulkarni, Subrata Mitra, Manjesh Kumar Hanawal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01329
Pdf URL: https://arxiv.org/pdf/2602.01329
Copy Paste: [[2602.01329]] FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching(https://arxiv.org/abs/2602.01329)
Keywords: generation
Abstract: Flow Matching (FM) has recently emerged as a powerful approach for high-quality visual generation. However, their prohibitively slow inference due to a large number of denoising steps limits their potential use in real-time or interactive applications. Existing acceleration methods, like distillation, truncation, or consistency training, either degrade quality, incur costly retraining, or lack generalization. We propose FlowCast, a training-free speculative generation framework that accelerates inference by exploiting the fact that FM models are trained to preserve constant velocity. FlowCast speculates future velocity by extrapolating current velocity without incurring additional time cost, and accepts it if it is within a mean-squared error threshold. This constant-velocity forecasting allows redundant steps in stable regions to be aggressively skipped while retaining precision in complex ones. FlowCast is a plug-and-play framework that integrates seamlessly with any FM model and requires no auxiliary networks. We also present a theoretical analysis and bound the worst-case deviation between speculative and full FM trajectories. Empirical evaluations demonstrate that FlowCast achieves $>2.5\times$ speedup in image generation, video generation, and editing tasks, outperforming existing baselines with no quality loss as compared to standard full generation.
摘要：流匹配 (FM) 最近已成为高质量视觉生成的强大方法。然而，由于大量的去噪步骤，它们的推理速度极其缓慢，限制了它们在实时或交互式应用程序中的潜在使用。现有的加速方法，如蒸馏、截断或一致性训练，要么会降低质量，要么需要昂贵的再训练成本，要么缺乏泛化性。我们提出了 FlowCast，这是一种无需训练的推测生成框架，它利用 FM 模型经过训练以保持恒定速度这一事实来加速推理。 FlowCast 通过推断当前速度来推测未来速度，而不会产生额外的时间成本，如果它在均方误差阈值内，则接受它。这种恒速预测允许积极跳过稳定区域中的冗余步骤，同时保持复杂区域中的精度。 FlowCast 是一个即插即用框架，可与任何 FM 模型无缝集成，无需辅助网络。我们还提出了理论分析，并限制了推测轨迹和完整 FM 轨迹之间的最坏情况偏差。实证评估表明，FlowCast 在图像生成、视频生成和编辑任务方面实现了 $>2.5\times$ 加速，优于现有基线，与标准完整生成相比，没有质量损失。

Title: Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning

Authors: Yu Xu, Yuxin Zhang, Juan Cao, Lin Gao, Chunyu Wang, Oliver Deussen, Tong-Yee Lee, Fan Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01335
Pdf URL: https://arxiv.org/pdf/2602.01335
Copy Paste: [[2602.01335]] Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning(https://arxiv.org/abs/2602.01335)
Keywords: generation, generative
Abstract: A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel-level instruction alignment and surface-level appearance preservation, failing to capture the underlying abstract logic necessary for genuine metaphorical generation. To bridge this gap, we introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the "creative essence" from a reference image and re-materialize that abstract logic onto a user-specified target subject. We propose a cognitive-inspired, multi-agent framework that operationalizes Conceptual Blending Theory (CBT) through a novel Schema Grammar ("G"). This structured representation decouples relational invariants from specific visual entities, providing a rigorous foundation for cross-domain logic re-instantiation. Our pipeline executes VMT through a collaborative system of specialized agents: a perception agent that distills the reference into a schema, a transfer agent that maintains generic space invariance to discover apt carriers, a generation agent for high-fidelity synthesis and a hierarchical diagnostic agent that mimics a professional critic, performing closed-loop backtracking to identify and rectify errors across abstract logic, component selection, and prompt encoding. Extensive experiments and human evaluations demonstrate that our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high-impact creative applications in advertising and media. Source code will be made publicly available.
摘要：视觉隐喻构成了人类创造力的高阶形式，利用跨领域语义融合将抽象概念转化为有影响力的视觉修辞。尽管生成式人工智能取得了显着进步，但现有模型仍然很大程度上局限于像素级指令对齐和表面级外观保留，未能捕获真正隐喻生成所需的底层抽象逻辑。为了弥补这一差距，我们引入了视觉隐喻迁移（VMT）的任务，该任务挑战模型自动将“创意本质”与参考图像解耦，并将抽象逻辑重新具体化到用户指定的目标主题上。我们提出了一个认知启发的多智能体框架，通过新颖的模式语法（“G”）来操作概念混合理论（CBT）。这种结构化表示将关系不变量与特定视觉实体解耦，为跨域逻辑重新实例化提供了严格的基础。我们的管道通过专门代理的协作系统来执行VMT：感知代理将参考提炼成模式，传输代理保持通用空间不变性以发现合适的载体，用于高保真合成的生成代理和模仿专业批评家的分层诊断代理，执行闭环回溯以识别和纠正抽象逻辑、组件选择和提示编码中的错误。大量的实验和人类评估表明，我们的方法在隐喻一致性、类比适当性和视觉创造力方面显着优于 SOTA 基线，为广告和媒体中自动化的高影响力创意应用铺平了道路。源代码将公开。

Title: MTC-VAE: Multi-Level Temporal Compression with Content Awareness

Authors: Yubo Dong, Linchao Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01340
Pdf URL: https://arxiv.org/pdf/2602.01340
Copy Paste: [[2602.01340]] MTC-VAE: Multi-Level Temporal Compression with Content Awareness(https://arxiv.org/abs/2602.01340)
Keywords: generative
Abstract: Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations. For continuous Variational Autoencoders (VAEs), achieving higher compression rates is desirable; yet, the efficiency notably declines when extra sampling layers are added without expanding the dimensions of hidden channels. In this paper, we present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression, providing a straightforward and minimal fine-tuning approach to counteract performance decline at elevated compression this http URL, we examine how varying compression levels impact model performance over video segments with diverse characteristics, offering empirical evidence on the effectiveness of our proposed approach. We also investigate the integration of our multi-level temporal compression VAE with diffusion-based generative models, DiT, highlighting successful concurrent training and compatibility within these frameworks. This investigation illustrates the potential uses of multi-level temporal compression.
摘要：潜在视频扩散模型 (LVDM) 依靠变分自动编码器 (VAE) 将视频压缩为紧凑的潜在表示。对于连续变分自动编码器（VAE），需要实现更高的压缩率；然而，当添加额外的采样层而不扩展隐藏通道的维度时，效率显着下降。在本文中，我们提出了一种将固定压缩率 VAE 转换为支持多级时间压缩的模型的技术，提供了一种简单且最小的微调方法来抵消此 http URL 压缩率提高时的性能下降，我们研究了不同的压缩级别如何影响具有不同特征的视频片段的模型性能，为我们提出的方法的有效性提供了经验证据。我们还研究了多级时间压缩 VAE 与基于扩散的生成模型 DiT 的集成，强调了这些框架内成功的并发训练和兼容性。这项研究说明了多级时间压缩的潜在用途。

Title: Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis

Authors: Yu Zhang, Jingyi Liu, Feng Liu, Duoqian Miao, Qi Zhang, Kexue Fu, Changwei Wang, Longbing Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01345
Pdf URL: https://arxiv.org/pdf/2602.01345
Copy Paste: [[2602.01345]] Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis(https://arxiv.org/abs/2602.01345)
Keywords: generation
Abstract: Visual AutoRegressive modeling (VAR) suffers from substantial computational cost due to the massive token count involved. Failing to account for the continuous evolution of modeling dynamics, existing VAR token reduction methods face three key limitations: heuristic stage partition, non-adaptive schedules, and limited acceleration scope, thereby leaving significant acceleration potential untapped. Since entropy variation intrinsically reflects the transition of predictive uncertainty, it offers a principled measure to capture modeling dynamics evolution. Therefore, we propose NOVA, a training-free token reduction acceleration framework for VAR models via entropy analysis. NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth. Through scale-linkage and layer-linkage ratio adjustment, NOVA dynamically computes distinct token reduction ratios for each scale and layer, pruning low-entropy tokens while reusing the cache derived from the residuals at the prior scale to accelerate inference and maintain generation quality. Extensive experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework.
摘要：由于涉及大量令牌，视觉自回归模型 (VAR) 面临着巨大的计算成本。由于未能考虑建模动态的持续演变，现有的 VAR 令牌缩减方法面临三个关键限制：启发式阶段划分、非自适应调度和有限的加速范围，从而导致未开发显着的加速潜力。由于熵变化本质上反映了预测不确定性的转变，因此它提供了捕获建模动态演化的原则性措施。因此，我们提出 NOVA，一种通过熵分析用于 VAR 模型的免训练令牌缩减加速框架。 NOVA通过在线识别尺度熵增长的拐点，自适应地确定推理过程中的加速激活尺度。通过尺度联动和层联动比率调整，NOVA 动态计算每个尺度和层的不同令牌缩减比率，修剪低熵令牌，同时重用从先前尺度的残差得出的缓存，以加速推理并保持生成质量。大量的实验和分析验证了 NOVA 是一个简单而有效的免训练加速框架。

Title: T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation

Authors: Xingzu Zhan, Chen Xie, Honghang Chen, Yixun Lin, Xiaochun Mai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01352
Pdf URL: https://arxiv.org/pdf/2602.01352
Copy Paste: [[2602.01352]] T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation(https://arxiv.org/abs/2602.01352)
Keywords: generation
Abstract: Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.
摘要：文本到动作生成将动作语言描述转换为连贯的 3D 人体动作序列，在阿凡达动画和人形机器人交互等领域引起了越来越多的关注。尽管现有模型已经实现了显着的保真度，但它们仍然存在两个核心局限性：（i）它们将运动周期性和关键帧显着性视为独立因素，忽略了它们的耦合并导致长序列中的生成漂移。 (ii) 它们对于语义等效的释义很脆弱，其中较小的同义词替换会扭曲文本嵌入，通过解码器传播并产生不稳定或错误的动作。在这项工作中，我们提出 T2M Mamba 来解决这些限制，方法是 (i) 提出周期性显着性感知 Mamba，它利用新算法通过增强的密度峰值聚类进行关键帧权重估计，并通过 FFT 加速自相关进行运动周期性估计，以最小的计算开销捕获耦合动态，以及 (ii) 构建周期性差分跨模态对齐模块 (PDCAM) 以增强文本和运动嵌入的鲁棒对齐。我们已经在 HumanML3D 和 KIT-ML 数据集上进行了广泛的实验，证实了我们方法的有效性，实现了 0.068 的 FID 以及所有其他指标的一致增益。

Title: PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

Authors: Leonardo Brusini, Cristian Sbrolli, Eugenio Lomurno, Toshihiko Yamasaki, Matteo Matteucci
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01370
Pdf URL: https://arxiv.org/pdf/2602.01370
Copy Paste: [[2602.01370]] PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles(https://arxiv.org/abs/2602.01370)
Keywords: generative
Abstract: Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and on the SugarCrepe++ compositionality benchmark (+9.1%). These results demonstrate that structural diversity is a more data-efficient scaling law than simply increasing the volume of single-source samples.
摘要：合成数据为视觉语言预训练提供了可扩展的解决方案，但当前最先进的方法通常依赖于扩展单个生成主干，这会引入特定于生成器的光谱偏差并限制特征多样性。在这项工作中，我们介绍了 PolyGen，这是一个框架，它通过优先考虑流形覆盖和组合严谨性而不是简单数据集大小来重新定义合成数据构造。 PolyGen 采用 Polylithic 方法在架构不同的生成器的交集上进行训练，有效地边缘化特定于模型的工件。此外，我们还引入了程序化硬否定课程，以加强细粒度的句法理解。通过从结构上重新分配相同的数据预算，从独特的字幕到多源变体，PolyGen 实现了更强大的特征空间，在聚合多任务基准和 SugarCrepe++ 组合性基准 (+9.1%) 上比领先的单源基线 (SynthCLIP) 高出 +19.0%。这些结果表明，与简单地增加单源样本的数量相比，结构多样性是一种更有效的数据缩放法则。

Title: PromptRL: Prompt Matters in RL for Flow-Based Image Generation

Authors: Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, Taesung Park
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01382
Pdf URL: https://arxiv.org/pdf/2602.01382
Copy Paste: [[2602.01382]] PromptRL: Prompt Matters in RL for Flow-Based Image Generation(https://arxiv.org/abs/2602.01382)
Keywords: generation
Abstract: Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2$\times$ fewer rollouts compared to naive flow-only RL. Our code is available at this https URL.
摘要：流匹配模型 (FM) 彻底改变了文本到图像 (T2I) 的生成，强化学习 (RL) 是与奖励目标保持一致的关键训练后策略。在这项研究中，我们表明，当前 FM 的 RL 管道存在两个未被充分认识但很重要的局限性：由于生成多样性不足而导致样本效率低下，以及明显的提示过度拟合，其中模型会记住特定的训练公式，并在对语义等效但风格不同的提示进行评估时表现出戏剧性的性能崩溃。我们提出了 PromptRL（Prompt Matters in RL for Flow-Based Image Generation），这是一个框架，它将语言模型 (LM) 作为可训练的提示细化代理直接纳入基于流的 RL 优化循环中。这种设计产生了两个互补的好处：复杂的即时重写功能的快速开发，以及最重要的是重塑优化动态的协同训练机制。 PromptRL 在多个基准测试中实现了最先进的性能，在 GenEval 上获得了 0.97 的分数，在 OCR 准确性上获得了 0.98 的分数，在 PickScore 上获得了 24.05 的分数。此外，我们验证了我们的 RL 方法在大规模图像编辑模型上的有效性，仅在 6 万次部署的情况下将 FLUX.1-Kontext 的 EditReward 从 1.19 提高到 1.43，超过了得分 1.37 的 Gemini 2.5 Flash Image（也称为 Nano Banana），并实现了与 ReasonNet（1.44）相当的性能，ReasonNet 依赖于细粒度的数据注释和复杂的多阶段培训。我们大量的实验凭经验证明，与朴素的仅流 RL 相比，PromptRL 始终能够实现更高的性能上限，同时需要的部署次数减少 2 美元\倍$。我们的代码可以在这个 https URL 上找到。

Title: Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles

Authors: Penghao Deng, Jidong J. Yang, Jiachen Bian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01452
Pdf URL: https://arxiv.org/pdf/2602.01452
Copy Paste: [[2602.01452]] Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles(https://arxiv.org/abs/2602.01452)
Keywords: generation
Abstract: Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle's front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a "part-versus-whole" semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.
摘要：了解驾驶员在驾驶过程中将视觉注意力转移到何处（以凝视行为为特征）对于开发下一代先进驾驶员辅助系统和改善道路安全至关重要。本文通过车辆前视摄像头捕获的道路场景来解决这一挑战，作为语义识别任务。具体来说，使用三种不同的基于视觉的方法研究凝视点与对象语义的搭配：直接对象检测（YOLOv13）、分割辅助分类（SAM2 与 EfficientNetV2 与 YOLOv13 配对）和基于查询的视觉语言模型、VLM（Qwen2.5-VL-7b 与 Qwen2.5-VL-32b）。结果表明，直接目标检测 (YOLOv13) 和 Qwen2.5-VL-32b 显着优于其他方法，宏 F1 得分超过 0.84。特别是大型 VLM (Qwen2.5-VL-32b) 在识别交通灯等小型安全关键物体方面表现出卓越的稳健性和性能，尤其是在恶劣的夜间条件下。相反，分割辅助范式存在“部分与整体”语义差距，导致召回失败。结果揭示了传统检测器的实时效率与大型 VLM 提供的更丰富的上下文理解和鲁棒性之间的基本权衡。这些发现为未来人类感知智能驾驶员监控系统的设计提供了重要的见解和实用指导。

Title: P-EAGLE: Parallel-Drafting EAGLE with Scalable Training

Authors: Mude Hui, Xin Huang, Jaime Campos Salas, Yue Sun, Nathan Pemberton, Xiang Song, Ashish Khetan, George Karypis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01469
Pdf URL: https://arxiv.org/pdf/2602.01469
Copy Paste: [[2602.01469]] P-EAGLE: Parallel-Drafting EAGLE with Scalable Training(https://arxiv.org/abs/2602.01469)
Keywords: generation
Abstract: Reasoning LLMs produce longer outputs, requiring speculative decoding drafters trained on extended sequences. Parallel drafting - predicting multiple tokens per forward pass - offers latency benefits over sequential generation, but training complexity scales quadratically with the product of sequence length and parallel positions, rendering long-context training impractical. We present P(arallel)-EAGLE, which transforms EAGLE from autoregressive to parallel multi-token prediction via a learnable shared hidden state. To scale training to long contexts, we develop a framework featuring attention mask pre-computation and sequence partitioning techniques, enabling gradient accumulation within individual sequences for parallel-prediction training. We implement P-EAGLE in vLLM and demonstrate speedups of 1.10-1.36x over autoregressive EAGLE-3 across GPT-OSS 120B, 20B, and Qwen3-Coder 30B.
摘要：推理法学硕士会产生更长的输出，需要对扩展序列进行训练的推测性解码起草者。并行起草（预测每个前向传递的多个标记）比顺序生成提供了延迟优势，但训练复杂性随着序列长度和并行位置的乘积呈二次方扩展，使得长上下文训练不切实际。我们提出了 P(arallel)-EAGLE，它通过可学习的共享隐藏状态将 EAGLE 从自回归转换为并行多标记预测。为了将训练扩展到长上下文，我们开发了一个具有注意掩模预计算和序列分区技术的框架，从而能够在单个序列内进行梯度累积以进行并行预测训练。我们在 vLLM 中实现 P-EAGLE，并在 GPT-OSS 120B、20B 和 Qwen3-Coder 30B 上展示了比自回归 EAGLE-3 提高 1.10-1.36 倍的速度。

Title: OpInf-LLM: Parametric PDE Solving with LLMs via Operator Inference

Authors: Zhuoyuan Wang, Hanjiang Hu, Xiyu Deng, Saviz Mowlavi, Yorie Nakahira
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01493
Pdf URL: https://arxiv.org/pdf/2602.01493
Copy Paste: [[2602.01493]] OpInf-LLM: Parametric PDE Solving with LLMs via Operator Inference(https://arxiv.org/abs/2602.01493)
Keywords: generation
Abstract: Solving diverse partial differential equations (PDEs) is fundamental in science and engineering. Large language models (LLMs) have demonstrated strong capabilities in code generation, symbolic reasoning, and tool use, but reliably solving PDEs across heterogeneous settings remains challenging. Prior work on LLM-based code generation and transformer-based foundation models for PDE learning has shown promising advances. However, a persistent trade-off between execution success rate and numerical accuracy arises, particularly when generalization to unseen parameters and boundary conditions is required. In this work, we propose OpInf-LLM, an LLM parametric PDE solving framework based on operator inference. The proposed framework leverages a small amount of solution data to enable accurate prediction of diverse PDE instances, including unseen parameters and configurations, and provides seamless integration with LLMs for natural language specification of PDE solving tasks. Its low computational demands and unified tool interface further enable a high execution success rate across heterogeneous settings. By combining operator inference with LLM capabilities, OpInf-LLM opens new possibilities for generalizable reduced-order modeling in LLM-based PDE solving.
摘要：求解不同的偏微分方程 (PDE) 是科学和工程的基础。大型语言模型 (LLM) 在代码生成、符号推理和工具使用方面表现出了强大的能力，但跨异构环境可靠地解决偏微分方程仍然具有挑战性。先前针对 PDE 学习的基于 LLM 的代码生成和基于 Transformer 的基础模型的工作已经显示出有希望的进展。然而，执行成功率和数值精度之间会出现持续的权衡，特别是当需要推广到看不见的参数和边界条件时。在这项工作中，我们提出了 OpInf-LLM，一种基于算子推理的 LLM 参数化 PDE 求解框架。所提出的框架利用少量的解决方案数据来准确预测不同的 PDE 实例，包括看不见的参数和配置，并提供与 LLM 的无缝集成，以实现 PDE 求解任务的自然语言规范。其低计算需求和统一的工具界面进一步实现了跨异构设置的高执行成功率。通过将算子推理与 LLM 功能相结合，OpInf-LLM 为基于 LLM 的 PDE 求解中的可推广降阶建模开辟了新的可能性。

Title: A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning

Authors: Akifumi Wachi, Hirota Kinoshita, Shokichi Takakura, Rei Higuchi, Taiji Suzuki
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2602.01523
Pdf URL: https://arxiv.org/pdf/2602.01523
Copy Paste: [[2602.01523]] A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning(https://arxiv.org/abs/2602.01523)
Keywords: generation
Abstract: Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a \emph{relative-budget} theory explaining this variation through a single quantity called relative budget $\xi := H/\mathbb{E}[T]$, where $H$ is the generation horizon (token budget) and $T$ denotes the number of tokens until the first correct solution under a base policy. We show that $\xi$ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the \emph{deficient} regime ($\xi \to 0$), informative trajectories are rare and the sample complexity explodes; in the \emph{balanced} regime ($\xi=\Theta(1)$), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the \emph{ample} regime ($\xi \to \infty$), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget $\xi \in [1.5, 2.0]$ that maximizes learning efficiency and coincides with peak reasoning performance.
摘要：强化学习 (RL) 是提高大型语言模型推理能力的主要范例，但其有效性因任务和计算预算而异。我们提出了一个 \emph{relative-budget} 理论，通过称为相对预算 $\xi := H/\mathbb{E}[T]$ 的单个量来解释这种变化，其中 $H$ 是生成范围（代币预算），$T$ 表示在基本策略下第一个正确解决方案之前的代币数量。我们表明 $\xi$ 通过控制奖励方差和信息轨迹的可能性来确定样本效率。我们的分析揭示了三种状态：在 \emph{deficient} 状态（$\xi \to 0$）中，信息轨迹很少，样本复杂性爆炸；在 \emph{balanced} 状态 ($\xi=\Theta(1)$) 中，信息轨迹以不可忽略的概率出现，并且 RL 具有最大样本效率；在 \emph{ample} 状态下（$\xi \to \infty$），学习保持稳定，但每次迭代的边际收益减少。我们进一步为在线强化学习提供有限样本保证，以表征这些体系中的学习进度。具体来说，在理想化分布假设下的案例研究中，我们表明相对预算随着迭代呈线性增长。我们的实证结果在现实环境中证实了这些预测，确定了预算 $\xi \in [1.5, 2.0]$，它可以最大限度地提高学习效率并与峰值推理性能相一致。

Title: The Inlet Rank Collapse in Implicit Neural Representations: Diagnosis and Unified Remedy

Authors: Jianqiao Zheng, Hemanth Saratchandran, Simon Lucey
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01526
Pdf URL: https://arxiv.org/pdf/2602.01526
Copy Paste: [[2602.01526]] The Inlet Rank Collapse in Implicit Neural Representations: Diagnosis and Unified Remedy(https://arxiv.org/abs/2602.01526)
Keywords: restoration
Abstract: Implicit Neural Representations (INRs) have revolutionized continuous signal modeling, yet they struggle to recover fine-grained details within finite training budgets. While empirical techniques, such as positional encoding (PE), sinusoidal activations (SIREN), and batch normalization (BN), effectively mitigate this, their theoretical justifications are predominantly post hoc, focusing on the global NTK spectrum only after modifications are applied. In this work, we reverse this paradigm by introducing a structural diagnostic framework. By performing a layer-wise decomposition of the NTK, we mathematically identify the ``Inlet Rank Collapse'': a phenomenon where the low-dimensional input coordinates fail to span the high-dimensional embedding space, creating a fundamental rank deficiency at the first layer that acts as an expressive bottleneck for the entire network. This framework provides a unified perspective to re-interpret PE, SIREN, and BN as different forms of rank restoration. Guided by this diagnosis, we derive a Rank-Expanding Initialization, a minimalist remedy that ensures the representation rank scales with the layer width without architectural modifications or computational overhead. Our results demonstrate that this principled remedy enables standard MLPs to achieve high-fidelity reconstructions, proving that the key to empowering INRs lies in the structural optimization of the initial rank propagation to effectively populate the latent space.
摘要：隐式神经表示（INR）彻底改变了连续信号建模，但它们很难在有限的训练预算内恢复细粒度的细节。虽然位置编码 (PE)、正弦激活 (SIREN) 和批量归一化 (BN) 等经验技术可以有效缓解这一问题，但它们的理论依据主要是事后的，仅在应用修改后才关注全局 NTK 频谱。在这项工作中，我们通过引入结构诊断框架来扭转这种范式。通过对 NTK 进行分层分解，我们在数学上识别了“入口秩崩溃”：一种低维输入坐标无法跨越高维嵌入空间的现象，在第一层产生了基本的秩缺陷，成为整个网络的表达瓶颈。该框架提供了一个统一的视角，将 PE、SIREN 和 BN 重新解释为不同形式的排名恢复。在这一诊断的指导下，我们得出了等级扩展初始化，这是一种极简的补救措施，可确保表示等级随层宽度而缩放，而无需架构修改或计算开销。我们的结果表明，这种原则性的补救措施使标准 MLP 能够实现高保真度重建，证明增强 INR 的关键在于初始秩传播的结构优化，以有效地填充潜在空间。

Title: Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Authors: Youliang Zhang, Zhengguang Zhou, Zhentao Yu, Ziyao Huang, Teng Hu, Sen Liang, Guozhen Zhang, Ziqiao Peng, Shunkai Li, Yi Chen, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Xiu Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01538
Pdf URL: https://arxiv.org/pdf/2602.01538
Copy Paste: [[2602.01538]] Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars(https://arxiv.org/abs/2602.01538)
Keywords: generation
Abstract: Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: this https URL
摘要：生成会说话的头像是视频生成中的一项基本任务。尽管现有方法可以通过简单的人体动作生成全身说话的化身，但将此任务扩展到接地人与物体交互（GHOI）仍然是一个开放的挑战，要求化身与周围的物体执行文本对齐的交互。这一挑战源于对环境感知的需求和 GHOI 一代的控制质量困境。为了解决这个问题，我们提出了一种新颖的双流框架 InteractAvatar，它将感知和规划与视频合成解耦，以实现基础的人机交互。利用检测来增强环境感知，我们引入了感知和交互模块（PIM）来生成文本对齐的交互动作。此外，还提出了音频交互感知生成模块（AIM）来合成执行对象交互的生动的说话化身。借助专门设计的运动到视频对齐器，PIM 和 AIM 共享相似的网络结构，并能够并行共同生成运动和合理的视频，有效缓解控制质量困境。最后，我们建立了一个基准 GroundedInter，用于评估 GHOI 视频生成。大量的实验和比较证明了我们的方法在为说话的化身生成接地的人机交互方面的有效性。项目页面：此 https URL

Title: InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

Authors: Lv Tang, Tianyi Zheng, Bo Li, Xingyu Li
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2602.01554
Pdf URL: https://arxiv.org/pdf/2602.01554
Copy Paste: [[2602.01554]] InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs(https://arxiv.org/abs/2602.01554)
Keywords: generation
Abstract: Unified multimodal large language models (MLLMs) integrate image understanding and generation in a single framework, with the visual tokenizer acting as the sole interface that maps visual inputs into tokens for downstream tasks. However, existing shared-token designs are mostly architecture-driven and lack an explicit criterion for what information tokens should preserve to support both understanding and generation. Therefore, we introduce a capacity-constrained perspective, highlighting that in shared-token unified MLLMs the visual tokenizer behaves as a compute-bounded learner, so the token budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this perspective, we propose InfoTok, an information-regularized visual tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok formulates tokenization as controlling information flow from images to shared tokens to multimodal outputs, yielding a principled trade-off between compression and task relevance via mutual-information regularization. We integrate InfoTok into three representative unified MLLMs without introducing any additional training data. Experiments show consistent improvements on both understanding and generation, supporting information-regularized tokenization as a principled foundation for learning a shared token space in unified MLLMs.
摘要：统一多模态大语言模型 (MLLM) 将图像理解和生成集成在单个框架中，视觉标记生成器充当将视觉输入映射到下游任务标记的唯一接口。然而，现有的共享令牌设计大多是架构驱动的，并且缺乏关于应保留哪些信息令牌以支持理解和生成的明确标准。因此，我们引入了容量受限的视角，强调在共享令牌统一 MLLM 中，视觉令牌生成器表现为计算受限学习器，因此令牌预算应优先考虑可重用结构，而不是难以利用的高熵变化和冗余。受此观点的启发，我们提出了 InfoTok，一种基于信息瓶颈（IB）原理的信息规范化视觉标记化机制。 InfoTok 将标记化制定为控制从图像到共享标记再到多模态输出的信息流，通过互信息正则化在压缩和任务相关性之间产生原则性的权衡。我们将 InfoTok 集成到三个代表性的统一 MLLM 中，而不引入任何额外的训练数据。实验表明，理解和生成方面都得到了持续改进，支持信息规范化标记化作为学习统一 MLLM 中共享标记空间的原则基础。

Title: Combined Flicker-banding and Moire Removal for Screen-Captured Images

Authors: Libo Zhu, Zihan Zhou, Zhiyi Zhou, Yiyang Qu, Weihang Zhang, Keyu Shi, Yifan Fu, Yulun Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2602.01559
Pdf URL: https://arxiv.org/pdf/2602.01559
Copy Paste: [[2602.01559]] Combined Flicker-banding and Moire Removal for Screen-Captured Images(https://arxiv.org/abs/2602.01559)
Keywords: restoration
Abstract: Capturing display screens with mobile devices has become increasingly common, yet the resulting images often suffer from severe degradations caused by the coexistence of moiré patterns and flicker-banding, leading to significant visual quality degradation. Due to the strong coupling of these two artifacts in real imaging processes, existing methods designed for single degradations fail to generalize to such compound scenarios. In this paper, we present the first systematic study on joint removal of moiré patterns and flicker-banding in screen-captured images, and propose a unified restoration framework, named CLEAR. To support this task, we construct a large-scale dataset containing both moiré patterns and flicker-banding, and introduce an ISP-based flicker simulation pipeline to stabilize model training and expand the degradation distribution. Furthermore, we design a frequency-domain decomposition and re-composition module together with a trajectory alignment loss to enhance the modeling of compound artifacts. Extensive experiments demonstrate that the proposed method consistently. outperforms existing image restoration approaches across multiple evaluation metrics, validating its effectiveness in complex real-world scenarios.
摘要：使用移动设备捕获显示屏已变得越来越普遍，但生成的图像往往会因莫尔图案和闪烁带的共存而严重退化，从而导致视觉质量显着下降。由于这两种伪影在实际成像过程中的强耦合，针对单一退化设计的现有方法无法推广到此类复合场景。在本文中，我们首次系统地研究了屏幕捕获图像中莫尔条纹和闪烁带的联合去除，并提出了一个统一的恢复框架，名为 CLEAR。为了支持这项任务，我们构建了一个包含莫尔图案和闪烁带的大规模数据集，并引入基于 ISP 的闪烁模拟管道来稳定模型训练并扩展退化分布。此外，我们设计了频域分解和重组模块以及轨迹对齐损失，以增强复合伪影的建模。大量的实验证明所提出的方法是一致的。在多个评估指标上优于现有的图像恢复方法，验证了其在复杂的现实场景中的有效性。

Title: Generative Visual Code Mobile World Models

Authors: Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2602.01576
Pdf URL: https://arxiv.org/pdf/2602.01576
Copy Paste: [[2602.01576]] Generative Visual Code Mobile World Models(https://arxiv.org/abs/2602.01576)
Keywords: generation, generative
Abstract: Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.
摘要：移动图形用户界面 (GUI) 世界模型 (WM) 为提高移动 GUI 代理在训练和推理时的性能提供了一条有前途的途径。然而，当前的方法面临着一个关键的权衡：基于文本的 WM 牺牲了视觉保真度，而视觉 WM 无法进行精确的文本渲染，导致它们依赖于依赖于众多外部模型的缓慢、复杂的管道。我们提出了一种新颖的范例：通过可渲染代码生成进行视觉世界建模，其中单个视觉语言模型（VLM）将下一个 GUI 状态预测为渲染到像素的可执行 Web 代码，而不是直接生成像素。这结合了两种方法的优点：VLM 保留其语言先验以进行精确的文本渲染，而其对结构化 Web 代码的预训练可实现高保真视觉生成。我们引入了 gWorld（8B、32B），这是第一个基于此范例构建的开放权重可视化移动 GUI WM，以及自动合成基于代码的训练数据的数据生成框架 (gWorld)。在对 4 个分布内基准和 2 个分布外基准的广泛评估中，gWorld 在准确性与模型大小方面设定了新的帕累托前沿，其性能优于 8 个前沿开放权重模型 50.25 倍以上。进一步的分析表明，(1) 通过 gWorld 扩展训练数据会产生有意义的收益，(2) 我们管道的每个组件都提高了数据质量，(3) 更强大的世界建模提高了下游移动 GUI 策略性能。

Title: Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages

Authors: Zhixiong Yue, Zixuan Ni, Feiyang Ye, Jinshan Zhang, Sheng Shen, Zhenpeng Mi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01591
Pdf URL: https://arxiv.org/pdf/2602.01591
Copy Paste: [[2602.01591]] Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages(https://arxiv.org/abs/2602.01591)
Keywords: generation
Abstract: Recent advances in flow matching models, particularly with reinforcement learning (RL), have significantly enhanced human preference alignment in few step text to image generators. However, existing RL based approaches for flow matching models typically rely on numerous denoising steps, while suffering from sparse and imprecise reward signals that often lead to suboptimal alignment. To address these limitations, we propose Temperature Annealed Few step Sampling with Group Relative Policy Optimization (TAFS GRPO), a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences. Our method iteratively injects adaptive temporal noise onto the results of one step samples. By repeatedly annealing the model's sampled outputs, it introduces stochasticity into the sampling process while preserving the semantic integrity of each generated image. Moreover, its step aware advantage integration mechanism combines the GRPO to avoid the need for the differentiable of reward function and provide dense and step specific rewards for stable policy optimization. Extensive experiments demonstrate that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences. The code and models of this work will be available to facilitate further research.
摘要：流匹配模型的最新进展，特别是强化学习 (RL)，显着增强了人类在文本到图像生成器的几个步骤中的偏好对齐。然而，现有的基于强化学习的流匹配模型方法通常依赖于大量的去噪步骤，同时受到稀疏和不精确的奖励信号的影响，这通常会导致次优对齐。为了解决这些限制，我们提出了带有组相对策略优化的温度退火少步采样（TAFS GRPO），这是一种新颖的框架，用于将文本与图像模型匹配的流训练为与人类偏好非常一致的高效少步生成器。我们的方法迭代地将自适应时间噪声注入到一步样本的结果中。通过反复退火模型的采样输出，它将随机性引入采样过程，同时保留每个生成图像的语义完整性。此外，其步骤感知优势整合机制结合了GRPO，避免了奖励函数可微分的需要，并为稳定的策略优化提供密集且特定于步骤的奖励。大量实验表明，TAFS GRPO 在文本到图像生成的几个步骤中实现了强大的性能，并显着提高了生成图像与人类偏好的匹配度。这项工作的代码和模型将可供进一步研究。

Title: Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching

Authors: Zeqiao Li, Yijing Wang, Haoyu Wang, Zheng Li, Zhiqiang Zuo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01606
Pdf URL: https://arxiv.org/pdf/2602.01606
Copy Paste: [[2602.01606]] Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching(https://arxiv.org/abs/2602.01606)
Keywords: generation
Abstract: Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challenging: the optimal policy is an intractable energy-based distribution, and the efficient log-likelihood estimation required to balance exploration and exploitation suffers from severe discretization bias. We propose \textbf{F}low-based \textbf{L}og-likelihood-\textbf{A}ware \textbf{M}aximum \textbf{E}ntropy RL (\textbf{FLAME}), a principled framework that addresses these challenges. First, we derive a Q-Reweighted FM objective that bypasses partition function estimation via importance reweighting. Second, we design a decoupled entropy estimator that rigorously corrects bias, which enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Third, we integrate the MeanFlow formulation to achieve expressive and efficient one-step control. Empirical results on MuJoCo show that FLAME outperforms Gaussian baselines and matches multi-step diffusion policies with significantly lower inference cost. Code is available at this https URL.
摘要：扩散策略具有表达能力，但会产生较高的推理延迟。流匹配 (FM) 可以实现一步生成，但将其集成到最大熵强化学习 (MaxEnt RL) 中具有挑战性：最优策略是一种棘手的基于能量的分布，而平衡探索和利用所需的有效对数似然估计受到严重的离散化偏差的影响。我们提出 \textbf{F}low-based \textbf{L}og-likelihood-\textbf{A}ware \textbf{M}aximum \textbf{E}ntropy RL (\textbf{FLAME})，这是一个解决这些挑战的原则框架。首先，我们推导出 Q-Reweighted FM 目标，该目标通过重要性重新加权绕过配分函数估计。其次，我们设计了一个解耦的熵估计器，它可以严格纠正偏差，从而实现高效的探索并使策略更接近最优的 MaxEnt 策略。第三，我们整合MeanFlow公式来实现富有表现力和高效的一步控制。 MuJoCo 上的实证结果表明，FLAME 的性能优于高斯基线，并以显着较低的推理成本匹配多步扩散策略。代码可从此 https URL 获取。

Title: Token Pruning for In-Context Generation in Diffusion Transformers

Authors: Junqing Lin, Xingyu Zheng, Pei Cheng, Bin Fu, Jingwei Sun, Guangzhong Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01609
Pdf URL: https://arxiv.org/pdf/2602.01609
Copy Paste: [[2602.01609]] Token Pruning for In-Context Generation in Diffusion Transformers(https://arxiv.org/abs/2602.01609)
Keywords: generation
Abstract: In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length, creating a substantial computational bottleneck. Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm as they apply uniform reduction strategies, overlooking the inherent role asymmetry between reference contexts and target latents across spatial, temporal, and functional dimensions. To bridge this gap, we introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs. Specifically, ToPi utilizes offline calibration-driven sensitivity analysis to identify pivotal attention layers, serving as a robust proxy for redundancy estimation. Leveraging these layers, we derive a novel influence metric to quantify the contribution of each context token for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. Empirical evaluations demonstrate that ToPi can achieve over 30\% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.
摘要：上下文生成通过参考示例实现可控的图像到图像生成，显着增强了扩散变换器 (DiT)。然而，由此产生的输入串联极大地增加了序列长度，造成了巨大的计算瓶颈。现有的标记减少技术主要是为文本到图像的合成而设计的，在这种范式中存在不足，因为它们应用统一的减少策略，忽略了跨空间、时间和功能维度的参考上下文和目标潜在之间固有的角色不对称性。为了弥补这一差距，我们引入了 ToPi，这是一个专为 DiT 中的上下文生成而设计的免训练令牌修剪框架。具体来说，ToPi 利用离线校准驱动的敏感性分析来识别关键注意力层，作为冗余估计的稳健代理。利用这些层，我们得出了一种新颖的影响力度量来量化每个上下文标记对选择性修剪的贡献，再加上适应不断发展的扩散轨迹的时间更新策略。实证评估表明，ToPi 可以实现超过 30% 的推理加速，同时在复杂的图像生成任务中保持结构保真度和视觉一致性。

Title: Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?

Authors: Susan Liang, Chao Huang, Filippos Bellos, Yolo Yunlong Tang, Qianxiang Shen, Jing Bi, Luchuan Song, Zeliang Zhang, Jason Corso, Chenliang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01623
Pdf URL: https://arxiv.org/pdf/2602.01623
Copy Paste: [[2602.01623]] Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?(https://arxiv.org/abs/2602.01623)
Keywords: generation
Abstract: State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.
摘要：Sora 2 和 Veo 3 等最先进的文本到视频生成模型现在可以直接从文本提示生成具有同步音频的高保真视频，标志着多模式生成的新里程碑。然而，评估这种三模式输出仍然是一个尚未解决的挑战。人工评估可靠，但成本高昂且难以扩展，而 FVD、CLAP 和 ViCLIP 等传统自动指标则侧重于孤立的模态对，难以处理复杂的提示，并且可解释性有限。全模态大语言模型（omni-LLM）提供了一种有前途的替代方案：它们自然地处理音频、视频和文本，支持丰富的推理，并提供可解释的思想链反馈。受此推动，我们推出了 Omni-Judge，这是一项评估全向法学硕士是否可以充当文本调节音频视频生成的人类法官的研究。在九个感知和对齐指标中，Omni-Judge 实现了与传统指标相当的相关性，并在语义要求较高的任务上表现出色，例如音频文本对齐、视频文本对齐和音频视频文本一致性。由于时间分辨率有限，它在高 FPS 感知指标（包括视频质量和音视频同步）上表现不佳。 Omni-Judge 提供可解释的解释，揭示语义或物理不一致之处，从而实现实际的下游用途，例如基于反馈的细化。我们的研究结果强调了全位法学硕士作为多模式生成的统一评估者的潜力和当前局限性。

Title: PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Authors: Minh-Quan Le, Gaurav Mittal, Cheng Zhao, David Gu, Dimitris Samaras, Mei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01624
Pdf URL: https://arxiv.org/pdf/2602.01624
Copy Paste: [[2602.01624]] PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards(https://arxiv.org/abs/2602.01624)
Keywords: generation, generative
Abstract: Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present $\texttt{PISCES}$, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, $\texttt{PISCES}$ uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, $\texttt{PISCES}$ is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that $\texttt{PISCES}$ outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.
摘要：文本到视频 (T2V) 生成旨在合成具有高视觉质量和时间一致性的视频，这些视频在语义上与输入文本一致。基于奖励的后期训练已成为提高生成视频的质量和语义对齐的有前途的方向。然而，最近的方法要么依赖于大规模的人类偏好注释，要么对预先训练的视觉语言模型中未对齐的嵌入进行操作，导致可扩展性有限或监督不理想。我们提出了 $\texttt{PISCES}$，这是一种无注释的训练后算法，它通过新颖的双最优传输 (OT) 对齐奖励模块解决了这些限制。为了使奖励信号与人类判断保持一致，$\texttt{PISCES}$ 使用 OT 在分布式和离散令牌级别上桥接文本和视频嵌入，从而使奖励监督能够实现两个目标：(i) 与分布式 OT 一致的质量奖励，捕获整体视觉质量和时间连贯性； (ii) 离散令牌级 OT 对齐语义奖励，强制文本和视频令牌之间的语义、时空对应。据我们所知，$\texttt{PISCES}$ 是第一个通过 OT 视角改进生成后训练中无注释奖励监督的项目。短视频和长视频生成的实验表明，$\texttt{PISCES}$ 在质量和语义分数方面均优于 VBench 上基于注释和无注释的方法，人类偏好研究进一步验证了其有效性。我们证明了双 OT 对齐奖励模块与多种优化范例兼容，包括直接反向传播和强化学习微调。

Title: Chance-Constrained Inference for Hallucination Risk Control in Large Language Models

Authors: Sreenivasan Mohandas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01637
Pdf URL: https://arxiv.org/pdf/2602.01637
Copy Paste: [[2602.01637]] Chance-Constrained Inference for Hallucination Risk Control in Large Language Models(https://arxiv.org/abs/2602.01637)
Keywords: generation
Abstract: Large language models generate outputs stochastically and may produce fluent but invalid responses, including factual hallucinations. Existing mitigation strategies reduce average error rates but do not provide explicit control over the \emph{frequency} of such failures under repeated use. We formulate inference as a deployment-time risk control problem and introduce \emph{chance-constrained inference}, which directly bounds the probability of hallucinations among accepted generations. Hallucinations are modeled as stochastic constraint violations, and we show that confidence-based selective prediction does not, in general, imply probabilistic risk guarantees. To enforce chance constraints efficiently, we propose a sequential, anytime-valid inference procedure that adaptively certifies feasibility or infeasibility using finite samples, avoiding conservative fixed-sample bounds. Experiments on questions inspired by NaturalQuestions and controlled multi-hop question answering demonstrate reliable risk control, early detection of intrinsically infeasible inputs, and safe composition under repeated use, while confidence-based baselines fail to provide consistent guarantees.
摘要：大型语言模型随机生成输出，并可能产生流畅但无效的响应，包括事实幻觉。现有的缓解策略降低了平均错误率，但没有提供对重复使用下此类故障的\emph{频率}的明确控制。我们将推理表述为部署时风险控制问题，并引入\emph{机会约束推理}，它直接限制了接受代之间出现幻觉的概率。幻觉被建模为随机约束违规，并且我们表明基于置信度的选择性预测通常并不意味着概率风险保证。为了有效地实施机会约束，我们提出了一种连续的、随时有效的推理程序，该程序使用有限样本自适应地证明可行性或不可行性，避免保守的固定样本界限。受 NaturalQuestions 和受控多跳问答启发的问题实验证明了可靠的风险控制、本质上不可行的输入的早期检测以及重复使用下的安全组合，而基于置信度的基线无法提供一致的保证。

Title: ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

Authors: Tianyu Yang, ChenWei He, Xiangzhao Hao, Tianyue Wang, Jiarui Guo, Haiyun Guo, Leigang Qu, Jinqiao Wang, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01639
Pdf URL: https://arxiv.org/pdf/2602.01639
Copy Paste: [[2602.01639]] ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval(https://arxiv.org/abs/2602.01639)
Keywords: generative
Abstract: Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.
摘要：组合图像检索（CIR）旨在基于包含参考图像和修改文本的混合查询来检索目标图像。早期的双塔视觉语言模型 (VLM) 难以应对此任务所需的跨模态组合推理。最近，采用生成式多模态大语言模型（MLLM）进行检索提供了一个有希望的方向。然而，我们发现这种适应策略忽视了一个基本问题：将生成 MLLM 适应单嵌入判别性检索器会引发范式冲突，从而导致能力下降 - 检索适应后本机细粒度推理的恶化。为了应对这一挑战，我们提出了 ReCALL（重新校准能力退化），这是一个与模型无关的框架，遵循诊断-生成-细化流程：首先，我们通过自我引导的信息实例挖掘来诊断检索器的认知盲点。接下来，我们通过 CoT 提示基础 MLLM 生成纠正指令和三元组，并通过基于 VQA 的一致性过滤进行质量控制。最后，我们通过使用分组对比方案对这些三元组进行持续训练来完善检索器，从而内化细粒度的视觉语义区别，并将检索器的判别性嵌入空间与 MLLM 内的内在组合推理重新对齐。 CIRR 和 FashionIQ 上的大量实验表明，ReCALL 能够持续重新校准退化的功能并实现最先进的性能。代码即将发布。

Title: De Novo Molecular Generation from Mass Spectra via Many-Body Enhanced Diffusion

Authors: Xichen Sun, Wentao Wei, Jiahua Rao, Jiancong Xie, Yuedong Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01643
Pdf URL: https://arxiv.org/pdf/2602.01643
Copy Paste: [[2602.01643]] De Novo Molecular Generation from Mass Spectra via Many-Body Enhanced Diffusion(https://arxiv.org/abs/2602.01643)
Keywords: generation
Abstract: Molecular structure generation from mass spectrometry is fundamental for understanding cellular metabolism and discovering novel compounds. Although tandem mass spectrometry (MS/MS) enables the high-throughput acquisition of fragment fingerprints, these spectra often reflect higher-order interactions involving the concerted cleavage of multiple atoms and bonds-crucial for resolving complex isomers and non-local fragmentation mechanisms. However, most existing methods adopt atom-centric and pairwise interaction modeling, overlooking higher-order edge interactions and lacking the capacity to systematically capture essential many-body characteristics for structure generation. To overcome these limitations, we present MBGen, a Many-Body enhanced diffusion framework for de novo molecular structure Generation from mass spectra. By integrating a many-body attention mechanism and higher-order edge modeling, MBGen comprehensively leverages the rich structural information encoded in MS/MS spectra, enabling accurate de novo generation and isomer differentiation for novel molecules. Experimental results on the NPLIB1 and MassSpecGym benchmarks demonstrate that MBGen achieves superior performance, with improvements of up to 230% over state-of-the-art methods, highlighting the scientific value and practical utility of many-body modeling for mass spectrometry-based molecular generation. Further analysis and ablation studies show that our approach effectively captures higher-order interactions and exhibits enhanced sensitivity to complex isomeric and non-local fragmentation information.
摘要：质谱法生成分子结构是了解细胞代谢和发现新化合物的基础。尽管串联质谱 (MS/MS) 能够高通量采集碎片指纹，但这些光谱通常反映高阶相互作用，涉及多个原子和键的协同裂解，这对于解决复杂异构体和非局部裂解机制至关重要。然而，大多数现有方法采用以原子为中心和成对相互作用建模，忽略了高阶边缘相互作用，并且缺乏系统捕获结构生成所需的多体特征的能力。为了克服这些限制，我们提出了 MBGen，这是一种多体增强扩散框架，用于从质谱中生成分子结构。通过集成多体注意力机制和高阶边缘建模，MBGen 全面利用 MS/MS 谱中编码的丰富结构信息，实现新分子的准确从头生成和异构体区分。 NPLIB1 和 MassSpecGym 基准测试的实验结果表明，MBGen 实现了卓越的性能，比最先进的方法提高了 230%，凸显了基于质谱的分子生成的多体建模的科学价值和实用性。进一步的分析和消融研究表明，我们的方法有效地捕获了高阶相互作用，并对复杂的异构体和非局部碎片信息表现出增强的敏感性。

Title: From Perception to Action: Spatial AI Agents and World Models

Authors: Gloria Felicia, Nolan Bryant, Handi Putra, Ayaan Gazali, Eliel Lobo, Esteban Rojas
Subjects: cs.LG, cs.AI, cs.CV, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2602.01644
Pdf URL: https://arxiv.org/pdf/2602.01644
Copy Paste: [[2602.01644]] From Perception to Action: Spatial AI Agents and World Models(https://arxiv.org/abs/2602.01644)
Keywords: generation
Abstract: While large language models have become the prevailing approach for agentic reasoning and planning, their success in symbolic domains does not readily translate to the physical world. Spatial intelligence, the ability to perceive 3D structure, reason about object relationships, and act under physical constraints, is an orthogonal capability that proves important for embodied agents. Existing surveys address either agentic architectures or spatial domains in isolation. None provide a unified framework connecting these complementary capabilities. This paper bridges that gap. Through a thorough review of over 2,000 papers, citing 742 works from top-tier venues, we introduce a unified three-axis taxonomy connecting agentic capabilities with spatial tasks across scales. Crucially, we distinguish spatial grounding (metric understanding of geometry and physics) from symbolic grounding (associating images with text), arguing that perception alone does not confer agency. Our analysis reveals three key findings mapped to these axes: (1) hierarchical memory systems (Capability axis) are important for long-horizon spatial tasks. (2) GNN-LLM integration (Task axis) is a promising approach for structured spatial reasoning. (3) World models (Scale axis) are essential for safe deployment across micro-to-macro spatial scales. We conclude by identifying six grand challenges and outlining directions for future research, including the need for unified evaluation frameworks to standardize cross-domain assessment. This taxonomy provides a foundation for unifying fragmented research efforts and enabling the next generation of spatially-aware autonomous systems in robotics, autonomous vehicles, and geospatial intelligence.
摘要：虽然大型语言模型已成为代理推理和规划的流行方法，但它们在符号领域的成功并不容易转化为物理世界。空间智能，即感知 3D 结构、推理对象关系以及在物理约束下采取行动的能力，是一种正交能力，对于实体代理来说非常重要。现有的调查要么针对代理架构，要么针对孤立的空间域。没有一个提供连接这些互补功能的统一框架。本文弥补了这一差距。通过对 2,000 多篇论文的彻底审查，引用来自顶级场馆的 742 篇作品，我们引入了统一的三轴分类法，将代理能力与跨尺度的空间任务联系起来。至关重要的是，我们将空间基础（对几何和物理的度量理解）与符号基础（将图像与文本相关联）区分开来，认为单独的感知并不能赋予能动性。我们的分析揭示了映射到这些轴的三个关键发现：（1）分层记忆系统（能力轴）对于长视野空间任务很重要。 (2) GNN-LLM 集成（任务轴）是结构化空间推理的一种有前途的方法。 (3) 世界模型（尺度轴）对于跨微观到宏观空间尺度的安全部署至关重要。最后，我们确定了六大挑战并概述了未来研究的方向，包括需要统一的评估框架来标准化跨领域评估。这种分类法为统一分散的研究工作以及实现机器人、自动驾驶车辆和地理空间智能领域的下一代空间感知自主系统奠定了基础。

Title: Efficient Adversarial Attacks on High-dimensional Offline Bandits

Authors: Seyed Mohammad Hadi Hosseini, Amir Najafi, Mahdieh Soleymani Baghshah
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01658
Pdf URL: https://arxiv.org/pdf/2602.01658
Copy Paste: [[2602.01658]] Efficient Adversarial Attacks on High-dimensional Offline Bandits(https://arxiv.org/abs/2602.01658)
Keywords: generative
Abstract: Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model, often distributed with public weights on platforms such as Hugging Face, to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has become an attractive alternative. However, the adversarial robustness of offline bandit evaluation remains largely unexplored, particularly when an attacker perturbs the reward model (rather than the training data) prior to bandit training. In this work, we fill this gap by investigating, both theoretically and empirically, the vulnerability of offline bandit training to adversarial manipulations of the reward model. We introduce a novel threat model in which an attacker exploits offline data in high-dimensional settings to hijack the bandit's behavior. Starting with linear reward functions and extending to nonlinear models such as ReLU neural networks, we study attacks on two Hugging Face evaluators used for generative model assessment: one measuring aesthetic quality and the other assessing compositional alignment. Our results show that even small, imperceptible perturbations to the reward model's weights can drastically alter the bandit's behavior. From a theoretical perspective, we prove a striking high-dimensional effect: as input dimensionality increases, the perturbation norm required for a successful attack decreases, making modern applications such as image evaluation especially vulnerable. Extensive experiments confirm that naive random perturbations are ineffective, whereas carefully targeted perturbations achieve near-perfect attack success rates ...
摘要：Bandit 算法最近已成为评估机器学习模型（包括生成图像模型和大型语言模型）的强大工具，无需进行详尽的比较即可有效识别表现最佳的候选模型。这些方法通常依赖于奖励模型，通常在 Hugging Face 等平台上分配公共权重，以向强盗提供反馈。虽然在线评估成本高昂且需要反复试验，但使用记录数据进行离线评估已成为一种有吸引力的替代方案。然而，离线强盗评估的对抗鲁棒性在很大程度上仍未得到探索，特别是当攻击者在强盗训练之前扰乱奖励模型（而不是训练数据）时。在这项工作中，我们通过从理论上和实证上研究离线强盗训练对奖励模型的对抗性操纵的脆弱性来填补这一空白。我们引入了一种新颖的威胁模型，其中攻击者利用高维设置中的离线数据来劫持强盗的行为。从线性奖励函数开始，扩展到 ReLU 神经网络等非线性模型，我们研究了对用于生成模型评估的两个 Hugging Face 评估器的攻击：一个评估美学质量，另一个评估构图对齐。我们的结果表明，即使对奖励模型的权重进行微小的、难以察觉的扰动，也可以极大地改变强盗的行为。从理论角度来看，我们证明了一种惊人的高维效应：随着输入维度的增加，成功攻击所需的扰动范数减少，使得图像评估等现代应用特别容易受到攻击。大量实验证实，单纯的随机扰动是无效的，而精心设计的扰动可实现近乎完美的攻击成功率……

Title: Moonworks Lunara Aesthetic II: An Image Variation Dataset

Authors: Yan Wang, Partho Hassan, Samiha Sadeka, Nada Soliman, M M Sayeef Abdullah, Sabit Hassan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01666
Pdf URL: https://arxiv.org/pdf/2602.01666
Copy Paste: [[2602.01666]] Moonworks Lunara Aesthetic II: An Image Variation Dataset(https://arxiv.org/abs/2602.01666)
Keywords: generation
Abstract: We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara's signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: this https URL.
摘要：我们推出了 Lunara Aesthetic II，这是一个公开发布的、符合道德来源的图像数据集，旨在支持现代图像生成和编辑系统中上下文一致性的受控评估和学习。该数据集包含 2,854 个锚链接变体对，这些变体对源自 Moonworks 创建的原创艺术和照片。每个变体对都应用上下文变换，例如照明、天气、视角、场景构成、色调或情绪；同时保留稳定的基础身份。 Lunara Aesthetic II 将保留身份的情境变化作为监督信号，同时保留 Lunara 标志性的高美学分数。结果显示出较高的身份稳定性、强大的目标属性实现以及超过大规模网络数据集的强大的美学特征。 Lunara Aesthetic II 在 Apache 2.0 许可下发布，旨在对图像生成和图像到图像系统中的上下文泛化、身份保存和编辑鲁棒性进行基准测试、微调和分析，并具有可解释的关系监督。该数据集可在以下网址公开获取：此 https URL。

Title: Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner

Authors: Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi-An Ma, Lianhui Qin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01705
Pdf URL: https://arxiv.org/pdf/2602.01705
Copy Paste: [[2602.01705]] Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner(https://arxiv.org/abs/2602.01705)
Keywords: generation
Abstract: Recent reinforcement learning (RL) methods improve LLM reasoning by optimizing discrete Chain-of-Thought (CoT) generation; however, exploration in token space often suffers from diversity collapse as policy entropy decreases due to mode elicitation behavior in discrete RL. To mitigate this issue, we propose Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), a framework that conducts exploration directly in a continuous latent space, where latent variables encode semantic-level reasoning trajectories. By modeling exploration via guided diffusion, multi-step denoising distributes stochasticity and preserves multiple coexisting solution modes without mutual suppression. Furthermore, by decoupling latent-space exploration from text-space generation, we show that latent diffusion-based optimization is more effective than text-space policy optimization alone, while a complementary text policy provides additional gains when combined with latent exploration. Experiments on code generation and mathematical reasoning benchmarks demonstrate consistent improvements in both pass@1 and pass@k over discrete RL baselines, with absolute pass@1 gains of +9.4% on code generation and +5.7% on mathematical reasoning, highlighting diffusion-based latent RL as a principled alternative to discrete token-level RL for reasoning.
摘要：最近的强化学习（RL）方法通过优化离散思想链（CoT）生成来改进 LLM 推理；然而，由于离散强化学习中的模式引发行为导致策略熵下降，代币空间的探索经常会遭受多样性崩溃。为了缓解这个问题，我们提出了带有强化学习的潜在扩散推理（LaDi-RL），这是一个直接在连续潜在空间中进行探索的框架，其中潜在变量编码语义级推理轨迹。通过引导扩散对探索进行建模，多步去噪可分布随机性并保留多种共存的解决方案模式而不会相互抑制。此外，通过将潜在空间探索与文本空间生成解耦，我们表明基于潜在扩散的优化比单独的文本空间策略优化更有效，而补充文本策略在与潜在探索相结合时提供了额外的收益。代码生成和数学推理基准测试表明，pass@1 和 pass@k 相对于离散 RL 基线的持续改进，代码生成的绝对 pass@1 增益为 +9.4%，数学推理的绝对增益为 +5.7%，这凸显了基于扩散的潜在 RL 作为离散令牌级 RL 推理的原则性替代方案。

Title: Physics Informed Generative AI Enabling Labour Free Segmentation For Microscopy Analysis

Authors: Salma Zahran, Zhou Ao, Zhengyang Zhang, Chen Chi, Chenchen Yuan, Yanming Wang
Subjects: cs.CV, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01710
Pdf URL: https://arxiv.org/pdf/2602.01710
Copy Paste: [[2602.01710]] Physics Informed Generative AI Enabling Labour Free Segmentation For Microscopy Analysis(https://arxiv.org/abs/2602.01710)
Keywords: generative
Abstract: Semantic segmentation of microscopy images is a critical task for high-throughput materials characterisation, yet its automation is severely constrained by the prohibitive cost, subjectivity, and scarcity of expert-annotated data. While physics-based simulations offer a scalable alternative to manual labelling, models trained on such data historically fail to generalise due to a significant domain gap, lacking the complex textures, noise patterns, and imaging artefacts inherent to experimental data. This paper introduces a novel framework for labour-free segmentation that successfully bridges this simulation-to-reality gap. Our pipeline leverages phase-field simulations to generate an abundant source of microstructural morphologies with perfect, intrinsically-derived ground-truth masks. We then employ a Cycle-Consistent Generative Adversarial Network (CycleGAN) for unpaired image-to-image translation, transforming the clean simulations into a large-scale dataset of high-fidelity, realistic SEM images. A U-Net model, trained exclusively on this synthetic data, demonstrated remarkable generalisation when deployed on unseen experimental images, achieving a mean Boundary F1-Score of 0.90 and an Intersection over Union (IOU) of 0.88. Comprehensive validation using t-SNE feature-space projection and Shannon entropy analysis confirms that our synthetic images are statistically and featurally indistinguishable from the real data manifold. By completely decoupling model training from manual annotation, our generative framework transforms a data-scarce problem into one of data abundance, providing a robust and fully automated solution to accelerate materials discovery and analysis.
摘要：显微图像的语义分割是高通量材料表征的一项关键任务，但其自动化却受到高昂的成本、主观性和专家注释数据的稀缺性的严重限制。虽然基于物理的模拟提供了手动标记的可扩展替代方案，但由于存在显着的域差距，缺乏实验数据固有的复杂纹理、噪声模式和成像伪影，因此在此类数据上训练的模型历来无法泛化。本文介绍了一种无需人工分割的新颖框架，成功弥补了模拟与现实之间的差距。我们的流程利用相场模拟来生成丰富的微观结构形态源，以及完美的、本质上衍生的地面实况掩模。然后，我们采用循环一致生成对抗网络 (CycleGAN) 进行不配对的图像到图像的转换，将干净的模拟转换为高保真、真实 SEM 图像的大规模数据集。专门针对这些合成数据进行训练的 U-Net 模型在部署到未见过的实验图像上时表现出了显着的泛化能力，实现了 0.90 的平均边界 F1 分数和 0.88 的并交交集 (IOU)。使用 t-SNE 特征空间投影和香农熵分析进行的综合验证证实，我们的合成图像在统计和特征上与真实数据流形无法区分。通过将模型训练与手动注释完全解耦，我们的生成框架将数据稀缺问题转化为数据丰富问题，提供了一种强大且全自动的解决方案来加速材料发现和分析。

Title: MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

Authors: Lianhai Ren, Yucheng Ding, Xiao Liu, Qianxiao Li, Peng Cheng, Yeyun Gong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01734
Pdf URL: https://arxiv.org/pdf/2602.01734
Copy Paste: [[2602.01734]] MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration(https://arxiv.org/abs/2602.01734)
Keywords: restoration
Abstract: Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via $\mu$P, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.
摘要：训练不稳定仍然是大型语言模型（LLM）预训练中的一个关键挑战，通常表现为突然的梯度爆炸，浪费大量的计算资源。我们研究了通过 $\mu$P 缩放的 5M 参数 NanoGPT 模型中的训练失败，识别了崩溃之前的两个关键现象：(1) 权重矩阵稳定秩快速下降（Frobenius 范数平方与谱范数平方之比），以及 (2) 相邻层雅可比行列式之间的对齐增加。我们从理论上证明，这两个条件共同导致梯度范数随网络深度呈指数增长。为了打破这种不稳定机制，我们提出了 MSign，一种新的优化器，它定期应用矩阵符号运算来恢复稳定的排名。在5M到3B参数的模型上进行的实验表明，MSign可以有效防止训练失败，计算开销低于7.0%。

Title: Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning

Authors: Wenhao Yu, Shaohang Wei, Jiahong Liu, Yifan Li, Minda Hu, Aiwei Liu, Hao Zhang, Irwin King
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01745
Pdf URL: https://arxiv.org/pdf/2602.01745
Copy Paste: [[2602.01745]] Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning(https://arxiv.org/abs/2602.01745)
Keywords: generation
Abstract: Token-level reweighting is a simple yet effective mechanism for controlling supervised fine-tuning, but common indicators are largely one-dimensional: the ground-truth probability reflects downstream alignment, while token entropy reflects intrinsic uncertainty induced by the pre-training prior. Ignoring entropy can misidentify noisy or easily replaceable tokens as learning-critical, while ignoring probability fails to reflect target-specific alignment. RankTuner introduces a probability--entropy calibration signal, the Relative Rank Indicator, which compares the rank of the ground-truth token with its expected rank under the prediction distribution. The inverse indicator is used as a token-wise Relative Scale to reweight the fine-tuning objective, focusing updates on truly under-learned tokens without over-penalizing intrinsically uncertain positions. Experiments on multiple backbones show consistent improvements on mathematical reasoning benchmarks, transfer gains on out-of-distribution reasoning, and pre code generation performance over probability-only or entropy-only reweighting baselines.
摘要：令牌级重新加权是一种简单而有效的控制监督微调的机制，但常见的指标很大程度上是一维的：真实概率反映了下游对齐，而令牌熵反映了预训练先验引起的内在不确定性。忽略熵可能会将噪声或易于替换的标记错误地识别为学习关键，而忽略概率则无法反映特定于目标的对齐。 RankTuner 引入了概率熵校准信号，即相对排名指示器，它将真实标记的排名与其在预测分布下的预期排名进行比较。逆指标用作令牌明智的相对尺度来重新衡量微调目标，将更新重点放在真正学习不足的令牌上，而不会过度惩罚本质上不确定的位置。在多个主干上的实验表明，数学推理基准、分布外推理的转移增益以及预代码生成性能均优于仅概率或仅熵重新加权基线。

Title: Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

Authors: Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, Weijia Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01756
Pdf URL: https://arxiv.org/pdf/2602.01756
Copy Paste: [[2602.01756]] Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation(https://arxiv.org/abs/2602.01756)
Keywords: generation
Abstract: While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved intent comprehension, they still struggle to accomplish tasks involving complex knowledge reasoning within a single model. Moreover, constrained by static internal priors, these models remain unable to adapt to the evolving dynamics of the real world. To bridge these gaps, we introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow. Simulating a human-like 'think-research-create' paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints. To rigorously evaluate these capabilities, we propose Mind-Bench, a comprehensive benchmark comprising 500 distinct samples spanning real-time news, emerging concepts, and domains such as mathematical and Geo-Reasoning. Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models, realizing a zero-to-one capability leap for the Qwen-Image baseline on Mind-Bench, while achieving superior results on established benchmarks like WISE and RISE.
摘要：虽然文本到图像的生成已经实现了前所未有的保真度，但绝大多数现有模型基本上都是作为静态文本到像素解码器运行的。因此，他们常常无法掌握隐含的用户意图。尽管新兴的统一理解生成模型提高了意图理解能力，但它们仍然难以在单个模型中完成涉及复杂知识推理的任务。此外，受静态内部先验的限制，这些模型仍然无法适应现实世界不断变化的动态。为了弥补这些差距，我们引入了 Mind-Brush，这是一个统一的代理框架，可将生成转变为动态的、知识驱动的工作流程。 Mind-Brush 模拟类似人类的“思考-研究-创造”范式，主动检索多模态证据以支持非分布概念，并采用推理工具来解决隐含的视觉约束。为了严格评估这些能力，我们提出了 Mind-Bench，这是一个综合基准，由 500 个不同的样本组成，涵盖实时新闻、新兴概念以及数学和地理推理等领域。大量实验表明，Mind-Brush显着增强了统一模型的能力，在Mind-Bench上实现了Qwen-Image基线从0到1的能力飞跃，同时在WISE、RISE等既定基准上取得了优异的结果。

Title: MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement

Authors: Hao Zhang, Yanping Zha, Zizhuo Li, Meiqi Gong, Jiayi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01760
Pdf URL: https://arxiv.org/pdf/2602.01760
Copy Paste: [[2602.01760]] MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement(https://arxiv.org/abs/2602.01760)
Keywords: generation
Abstract: This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.
摘要：本文重点讨论一个高度实用的场景：当只有可见光成像传感器可用时，如何在恶劣条件下继续受益于多模态图像融合的优势。为了实现这一目标，我们提出了一种新的单图像融合概念，将传统的数据级融合扩展到知识级。具体来说，我们开发了 MagicFuse，这是一种新颖的单图像融合框架，能够从单个低质量可见图像中导出全面的跨光谱场景表示。 MagicFuse首先引入了基于扩散模型的谱内知识强化分支和跨谱知识生成分支。他们分别挖掘可见光谱中隐藏的场景信息并学习转移到红外光谱的热辐射分布模式。在此基础上，我们设计了一个多领域知识融合分支，该分支集成了来自这两个分支的扩散流的概率噪声，从中可以通过连续采样获得跨谱场景表示。然后，我们施加视觉和语义约束，以确保该场景表示能够满足人类观察，同时支持下游语义决策。大量的实验表明，我们的 MagicFuse 所实现的视觉和语义表示性能与具有多模态输入的最先进的融合方法相当甚至更好，尽管仅依赖于单个降级的可见图像。

Title: IRIS: Implicit Reward-Guided Internal Sifting for Mitigating Multimodal Hallucination

Authors: Yuanshuai Li, Yuping Yan, Jirui Han, Fei Ming, Lingjuan Lv, Yaochu Jin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01769
Pdf URL: https://arxiv.org/pdf/2602.01769
Copy Paste: [[2602.01769]] IRIS: Implicit Reward-Guided Internal Sifting for Mitigating Multimodal Hallucination(https://arxiv.org/abs/2602.01769)
Keywords: generation
Abstract: Hallucination remains a fundamental challenge for Multimodal Large Language Models (MLLMs). While Direct Preference Optimization (DPO) is a key alignment framework, existing approaches often rely heavily on costly external evaluators for scoring or rewriting, incurring off-policy learnability gaps and discretization loss. Due to the lack of access to internal states, such feedback overlooks the fine-grained conflicts between different modalities that lead to hallucinations during generation. To address this issue, we propose IRIS (Implicit Reward-Guided Internal Sifting), which leverages continuous implicit rewards in the native log-probability space to preserve full information density and capture internal modal competition. This on-policy paradigm eliminates learnability gaps by utilizing self-generated preference pairs. By sifting these pairs based on multimodal implicit rewards, IRIS ensures that optimization is driven by signals that directly resolve modal conflicts. Extensive experiments demonstrate that IRIS achieves highly competitive performance on key hallucination benchmarks using only 5.7k samples, without requiring any external feedback during preference alignment. These results confirm that IRIS provides an efficient and principled paradigm for mitigating MLLM hallucinations.
摘要：幻觉仍然是多模态大语言模型（MLLM）的一个基本挑战。虽然直接偏好优化（DPO）是一个关键的对齐框架，但现有的方法通常严重依赖昂贵的外部评估者进行评分或重写，从而导致偏离策略的可学习性差距和离散化损失。由于缺乏对内部状态的访问，这种反馈忽略了不同模式之间的细粒度冲突，这些冲突导致了生成过程中的幻觉。为了解决这个问题，我们提出了 IRIS（隐式奖励引导内部筛选），它利用本地对数概率空间中的连续隐式奖励来保留完整的信息密度并捕获内部模态竞争。这种在策略范式通过利用自我生成的偏好对消除了可学习性差距。通过根据多模态隐式奖励筛选这些对，IRIS 确保优化是由直接解决模态冲突的信号驱动的。大量实验表明，IRIS 仅使用 5.7k 个样本就在关键幻觉基准测试中实现了极具竞争力的性能，并且在偏好调整期间不需要任何外部反馈。这些结果证实 IRIS 为减轻 MLLM 幻觉提供了一种有效且有原则的范例。

Title: Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Authors: Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik, Rami Ben-Ari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01801
Pdf URL: https://arxiv.org/pdf/2602.01801
Copy Paste: [[2602.01801]] Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention(https://arxiv.org/abs/2602.01801)
Keywords: generation
Abstract: Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.
摘要：自回归视频扩散模型支持流式生成，为长格式合成、视频世界模型和交互式神经游戏引擎打开了大门。然而，它们的核心注意力层成为推理时的主要瓶颈：随着生成的进展，KV 缓存不断增长，导致延迟增加和 GPU 内存不断增加，这反过来又限制了可用的时间上下文并损害了远程一致性。在这项工作中，我们研究了自回归视频扩散中的冗余，并确定了三个持久源：跨帧的近乎重复的缓存键、缓慢演变的（主要是语义的）查询/键，这使得许多注意力计算变得冗余，以及长提示上的交叉注意力，其中每帧只有一小部分标记很重要。基于这些观察，我们提出了一个用于自回归扩散的统一的、免训练的注意力框架：TempCache 通过时间对应来压缩 KV 缓存以限制缓存增长； AnnCA 通过使用快速近似最近邻 (ANN) 匹配选择与帧相关的提示标记来加速交叉注意力； AnnSA 通过将每个查询限制为语义匹配的键（也使用轻量级 ANN）来稀疏自注意力。这些模块共同减少了注意力、计算和记忆，并且与现有的自回归扩散主干和世界模型兼容。实验证明，端到端加速高达 x5--x10，同时保持几乎相同的视觉质量，最重要的是，在长期部署过程中保持稳定的吞吐量和几乎恒定的峰值 GPU 内存使用量，而之前的方法会逐渐减慢速度并受到内存使用量增加的影响。

Title: GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation

Authors: Xiao Liang, Yunzhu Zhang, Linchao Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01814
Pdf URL: https://arxiv.org/pdf/2602.01814
Copy Paste: [[2602.01814]] GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation(https://arxiv.org/abs/2602.01814)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success in video generation; however, the high computational cost of the denoising process remains a major bottleneck. Existing approaches have shown promise in reducing the number of diffusion steps, but they often suffer from significant quality degradation when applied to video generation. We propose Guided Progressive Distillation (GPD), a framework that accelerates the diffusion process for fast and high-quality video generation. GPD introduces a novel training strategy in which a teacher model progressively guides a student model to operate with larger step sizes. The framework consists of two key components: (1) an online-generated training target that reduces optimization difficulty while improving computational efficiency, and (2) frequency-domain constraints in the latent space that promote the preservation of fine-grained details and temporal dynamics. Applied to the Wan2.1 model, GPD reduces the number of sampling steps from 48 to 6 while maintaining competitive visual quality on VBench. Compared with existing distillation methods, GPD demonstrates clear advantages in both pipeline simplicity and quality preservation.
摘要：扩散模型在视频生成方面取得了显着的成功；然而，去噪过程的高计算成本仍然是一个主要瓶颈。现有方法在减少扩散步骤数量方面表现出了希望，但在应用于视频生成时，它们常常会出现质量显着下降的问题。我们提出了引导渐进蒸馏（GPD），这是一个加速扩散过程以快速生成高质量视频的框架。 GPD 引入了一种新颖的训练策略，其中教师模型逐步引导学生模型以更大的步长进行操作。该框架由两个关键组件组成：（1）在线生成的训练目标，可降低优化难度，同时提高计算效率；（2）潜在空间中的频域约束，可促进细粒度细节和时间动态的保存。应用于 Wan2.1 模型时，GPD 将采样步骤数从 48 个减少到 6 个，同时在 VBench 上保持有竞争力的视觉质量。与现有的蒸馏方法相比，GPD 在管道简单和质量保存方面都表现出明显的优势。

Title: Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models

Authors: Jinbin Bai, Yixuan Li, Yuchen Zhu, Yi Xin, Qingyu Shi, Aosong Feng, Xiaohong Liu, Molei Tao, Jianru Xue, Xiangtai Li, Ming-Hsuan Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01842
Pdf URL: https://arxiv.org/pdf/2602.01842
Copy Paste: [[2602.01842]] Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models(https://arxiv.org/abs/2602.01842)
Keywords: generation, generative
Abstract: Inference-time compute has re-emerged as a practical way to improve LLM reasoning. Most test-time scaling (TTS) algorithms rely on autoregressive decoding, which is ill-suited to discrete diffusion language models (dLLMs) due to their parallel decoding over the entire sequence. As a result, developing effective and efficient TTS methods to unlock dLLMs' full generative potential remains an underexplored challenge. To address this, we propose Prism (Pruning, Remasking, and Integrated Self-verification Method), an efficient TTS framework for dLLMs that (i) performs Hierarchical Trajectory Search (HTS) which dynamically prunes and reallocates compute in an early-to-mid denoising window, (ii) introduces Local branching with partial remasking to explore diverse implementations while preserving high-confidence tokens, and (iii) replaces external verifiers with Self-Verified Feedback (SVF) obtained via self-evaluation prompts on intermediate completions. Across four mathematical reasoning and code generation benchmarks on three dLLMs, including LLaDA 8B Instruct, Dream 7B Instruct, and LLaDA 2.0-mini, our Prism achieves a favorable performance-efficiency trade-off, matching best-of-N performance with substantially fewer function evaluations (NFE). The code is released at this https URL.
摘要：推理时间计算已重新成为改进 LLM 推理的实用方法。大多数测试时间缩放 (TTS) 算法依赖于自回归解码，这不适合离散扩散语言模型 (dLLM)，因为它们在整个序列上进行并行解码。因此，开发有效且高效的 TTS 方法来释放 dLLM 的全部生成潜力仍然是一个尚未充分探索的挑战。为了解决这个问题，我们提出了 Prism（修剪、重新屏蔽和集成自我验证方法），这是一种用于 dLLM 的高效 TTS 框架，它（i）执行分层轨迹搜索（HTS），在早期到中期的去噪窗口中动态修剪和重新分配计算，（ii）引入具有部分重新屏蔽的本地分支来探索不同的实现，同时保留高置信度令牌，以及（iii）用通过中间完成情况的自我评估提示获得自我验证反馈（SVF）。在三个 dLLM 上的四个数学推理和代码生成基准测试中，包括 LLaDA 8B Instruct、Dream 7B Instruct 和 LLaDA 2.0-mini，我们的 Prism 实现了有利的性能与效率权衡，将最佳性能与大幅减少的函数评估 (NFE) 相匹配。代码在此 https URL 发布。

Title: No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation

Authors: Furkan Eris
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2602.01845
Pdf URL: https://arxiv.org/pdf/2602.01845
Copy Paste: [[2602.01845]] No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation(https://arxiv.org/abs/2602.01845)
Keywords: generation, generative
Abstract: Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce \textbf{Proust}, a 309M-parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU-hours, Proust achieves Spearman $\rho = 0.390$ on ProteinGym substitutions, competitive with MLMs requiring 50--200$\times$ the compute. On indels, Proust sets a new state-of-the-art, outperforming models up to 20$\times$ larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone. These powerful representations position Proust in a sweet spot as it also retains native generative capabilities that MLMs lack by design. Interpretability analysis reveals that per-position entropy variance predicts, to an extent, when retrieval augmentation helps and hurts. Such insights can grow in both quantity and quality at scale and inform capabilities such as test-time scaling. Code and weights are available at this https URL
摘要：蛋白质语言模型 (PLM) 面临着根本性的分歧：掩码语言模型 (MLM) 擅长适应度预测，而因果模型支持生成，迫使从业者维护单独的架构。我们引入了 \textbf{Proust}，这是一个 309M 参数的因果 PLM，它通过改编自最近的 LLM 研究的架构创新来弥补这一差距，包括具有共享 K/V 投影的分组查询注意力、跨层值残差和深度因果卷积。在 40 B200 GPU 小时内对 33B 代币进行训练后，Proust 在 ProteinGym 替代品上实现了 Spearman $\rho = 0.390$，与需要 50--200$\times$ 计算的 MLM 竞争。在插入/缺失方面，普鲁斯特建立了一种新的最先进的模型，其性能比模型大 20 倍。在 EVEREST 病毒适应性基准上，它仅使用序列来实现结构感知方法。这些强大的表现形式使普鲁斯特处于最佳位置，因为它还保留了传销在设计上所缺乏的原生生成能力。可解释性分析表明，每个位置的熵方差在一定程度上可以预测检索增强何时有益或有害。这些见解可以在数量和质量上大规模增长，并为测试时间扩展等能力提供信息。代码和权重可在此 https URL 获取

Title: Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models

Authors: Ziwei Luo, Ziqi Jin, Lei Wang, Lidong Bing, Thomas B. Schön
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01849
Pdf URL: https://arxiv.org/pdf/2602.01849
Copy Paste: [[2602.01849]] Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models(https://arxiv.org/abs/2602.01849)
Keywords: generation
Abstract: This work presents self-rewarding sequential Monte Carlo (SMC), an inference-time scaling algorithm enabling effective sampling of masked diffusion language models (MDLMs). Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy, where only tokens with the highest prediction confidence are preserved at each step. This restricts the generation to a noise-sensitive, greedy decoding paradigm, resulting in an inevitable collapse in the diversity of possible paths. We address this problem by launching multiple interacting diffusion processes in parallel, referred to as particles, for trajectory exploration. Importantly, we introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights. During sampling, particles are iteratively weighted and resampled to systematically steer generation towards globally confident, high-quality samples. Our self-rewarding SMC is verified on various masked diffusion language models and benchmarks, achieving significant improvement without extra training or reward guidance, while effectively converting parallel inference capacity into improved sampling quality. Our code is available at this https URL.
摘要：这项工作提出了自我奖励顺序蒙特卡罗（SMC），这是一种推理时间缩放算法，可以对掩蔽扩散语言模型（MDLM）进行有效采样。我们的算法源于这样的观察：大多数现有 MDLM 依赖于基于置信度的采样策略，其中每一步只保留具有最高预测置信度的标记。这将生成限制为对噪声敏感的贪婪解码范式，导致可能路径的多样性不可避免地崩溃。我们通过并行启动多个相互作用的扩散过程（称为粒子）来解决这个问题，以进行轨迹探索。重要的是，我们引入轨迹级置信度作为分配粒子重要性权重的自我奖励信号。在采样过程中，粒子被迭代加权和重新采样，以系统地引导生成全局可信的高质量样本。我们的自我奖励 SMC 在各种掩码扩散语言模型和基准上进行了验证，无需额外训练或奖励指导即可实现显着改进，同时有效地将并行推理能力转化为改进的采样质量。我们的代码可以在这个 https URL 上找到。

Title: WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization?

Authors: Pei Li, Jiaxi Yin, Lei Ouyang, Shihan Pan, Ge Wang, Han Ding, Fei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01850
Pdf URL: https://arxiv.org/pdf/2602.01850
Copy Paste: [[2602.01850]] WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization?(https://arxiv.org/abs/2602.01850)
Keywords: generation
Abstract: IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current progress is strongly bottlenecked by the need for dense, frame-level boundary annotations, which are costly and difficult to scale. To address this bottleneck, we introduce WS-IMUBench, a systematic benchmark study of weakly supervised IMU-TAL (WS-IMU-TAL) under only sequence-level labels. Rather than proposing a new localization algorithm, we evaluate how well established weakly supervised localization paradigms from audio, image, and video transfer to IMU-TAL under only sequence-level labels. We benchmark seven representative weakly supervised methods on seven public IMU datasets, resulting in over 3,540 model training runs and 7,080 inference evaluations. Guided by three research questions on transferability, effectiveness, and insights, our findings show that (i) transfer is modality-dependent, with temporal-domain methods generally more stable than image-derived proposal-based approaches; (ii) weak supervision can be competitive on favorable datasets (e.g., with longer actions and higher-dimensional sensing); and (iii) dominant failure modes arise from short actions, temporal ambiguity, and proposal quality. Finally, we outline concrete directions for advancing WS-IMU-TAL (e.g., IMU-specific proposal generation, boundary-aware objectives, and stronger temporal reasoning). Beyond individual results, WS-IMUBench establishes a reproducible benchmarking template, datasets, protocols, and analyses, to accelerate community-wide progress toward scalable WS-IMU-TAL.
摘要：基于 IMU 的人类活动识别 (HAR) 已经实现了广泛的普适计算应用，但其占主导地位的剪辑分类范式无法捕获现实世界行为的丰富时间结构。这促使人们转向 IMU 时间动作定位 (IMU-TAL)，它可以预测连续流中的动作类别及其开始/结束时间。然而，当前的进展受到对密集的帧级边界注释的需求的严重瓶颈，这些注释成本高昂且难以扩展。为了解决这个瓶颈，我们引入了 WS-IMUBench，这是一种仅在序列级标签下弱监督 IMU-TAL (WS-IMU-TAL) 的系统基准研究。我们不是提出一种新的定位算法，而是评估仅在序列级标签下从音频、图像和视频传输到 IMU-TAL 的弱监督定位范例的建立情况。我们在七个公共 IMU 数据集上对七种代表性的弱监督方法进行了基准测试，产生了超过 3,540 次模型训练运行和 7,080 次推理评估。在关于可迁移性、有效性和洞察力的三个研究问题的指导下，我们的研究结果表明：（i）迁移依赖于模态，时域方法通常比基于图像的提案方法更稳定； (ii) 弱监督可以在有利的数据集上具有竞争力（例如，具有更长的动作和更高维度的传感）； (iii) 主要失败模式是由短期行动、时间模糊性和提案质量引起的。最后，我们概述了推进 WS-IMU-TAL 的具体方向（例如，IMU 特定的建议生成、边界感知目标和更强的时间推理）。除了个人结果之外，WS-IMUBench 还建立了可重复的基准测试模板、数据集、协议和分析，以加速社区范围内朝着可扩展的 WS-IMU-TAL 迈进。

Title: How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Authors: Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01851
Pdf URL: https://arxiv.org/pdf/2602.01851
Copy Paste: [[2602.01851]] How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing(https://arxiv.org/abs/2602.01851)
Keywords: generative
Abstract: Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.
摘要：最近的生成模型在图像编辑方面取得了显着的进展。然而，现有的系统和基准仍然很大程度上以文本为指导。相比之下，人类交流本质上是多模式的，草图等视觉指令可以有效地传达空间和结构意图。为了解决这一差距，我们引入了 VIBE，即图像编辑的视觉指令基准，具有三级交互层次结构，可捕获指示基础、形态操作和因果推理。在这些级别上，我们策划了高质量和多样化的测试用例，反映了视觉指令遵循中逐渐增加的复杂性。我们进一步提出了一个强大的 LMM 作为法官评估框架，具有特定于任务的指标，以实现可扩展和细粒度的评估。通过对 17 个具有代表性的开源和专有图像编辑模型的综合评估，我们发现专有模型表现出早期视觉指令跟踪能力，并且始终优于开源模型。然而，即使对于最强大的系统，随着任务难度的增加，性能也会显着下降，这凸显了未来研究的有希望的方向。

Title: Time2Vec-Integrated Transformer for Robust Gesture Recognition from Low-Density sEMG

Authors: Blagoj Hristov, Hristijan Gjoreski, Vesna Ojleska Latkoska, Gorjan Nadzinski
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2602.01855
Pdf URL: https://arxiv.org/pdf/2602.01855
Copy Paste: [[2602.01855]] Time2Vec-Integrated Transformer for Robust Gesture Recognition from Low-Density sEMG(https://arxiv.org/abs/2602.01855)
Keywords: generation
Abstract: Accurate and responsive myoelectric prosthesis control typically relies on complex, dense multi-sensor arrays, which limits consumer accessibility. This paper presents a novel, data-efficient deep learning framework designed to achieve precise and accurate control using minimal sensor hardware. Leveraging an external dataset of 8 subjects, our approach implements a hybrid Transformer optimized for sparse, two-channel surface electromyography (sEMG). Unlike standard architectures that use fixed positional encodings, we integrate Time2Vec learnable temporal embeddings to capture the stochastic temporal warping inherent in biological signals. Furthermore, we employ a normalized additive fusion strategy that aligns the latent distributions of spatial and temporal features, preventing the destructive interference common in standard implementations. A two-stage curriculum learning protocol is utilized to ensure robust feature extraction despite data scarcity. The proposed architecture achieves a state-of-the-art multi-subject F1-score of 95.7% $\pm$ 0.20% for a 10-class movement set, statistically outperforming both a standard Transformer with fixed encodings and a recurrent CNN-LSTM model. Architectural optimization reveals that a balanced allocation of model capacity between spatial and temporal dimensions yields the highest stability. Furthermore, while direct transfer to a new unseen subject led to poor accuracy due to domain shifts, a rapid calibration protocol utilizing only two trials per gesture recovered performance from 21.0% $\pm$ 2.98% to 96.9% $\pm$ 0.52%. By validating that high-fidelity temporal embeddings can compensate for low spatial resolution, this work challenges the necessity of high-density sensing. The proposed framework offers a robust, cost-effective blueprint for next-generation prosthetic interfaces capable of rapid personalization.
摘要：准确且响应灵敏的肌电假肢控制通常依赖于复杂、密集的多传感器阵列，这限制了消费者的可及性。本文提出了一种新颖的、数据高效的深度学习框架，旨在使用最少的传感器硬件实现精确控制。利用 8 个受试者的外部数据集，我们的方法实现了针对稀疏、双通道表面肌电图 (sEMG) 进行优化的混合 Transformer。与使用固定位置编码的标准架构不同，我们集成了 Time2Vec 可学习时间嵌入来捕获生物信号中固有的随机时间扭曲。此外，我们采用归一化加性融合策略来对齐空间和时间特征的潜在分布，防止标准实现中常见的破坏性干扰。尽管数据稀缺，但利用两阶段课程学习协议来确保稳健的特征提取。所提出的架构在 10 级运动集上实现了最先进的多主题 F1 分数 95.7% $\pm$ 0.20%，统计上优于具有固定编码的标准 Transformer 和循环 CNN-LSTM 模型。架构优化表明，模型容量在空间和时间维度之间的平衡分配可产生最高的稳定性。此外，虽然直接转移到新的看不见的对象会因域转移而导致准确性较差，但每个手势仅使用两次试验的快速校准协议将性能从 21.0% $\pm$ 2.98% 恢复到 96.9% $\pm$ 0.52%。通过验证高保真时间嵌入可以补偿低空间分辨率，这项工作挑战了高密度传感的必要性。所提出的框架为能够快速个性化的下一代假肢接口提供了一个强大的、具有成本效益的蓝图。

Title: Trust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution via Implicit Reference Correlation Modeling

Authors: Yuan Wang, Yuhao Wan, Siming Zheng, Bo Li, Qibin Hou, Peng-Tao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01864
Pdf URL: https://arxiv.org/pdf/2602.01864
Copy Paste: [[2602.01864]] Trust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution via Implicit Reference Correlation Modeling(https://arxiv.org/abs/2602.01864)
Keywords: restoration, super-resolution
Abstract: Recent works have explored reference-based super-resolution (RefSR) to mitigate hallucinations in diffusion-based image restoration. A key challenge is that real-world degradations make correspondences between low-quality (LQ) inputs and reference (Ref) images unreliable, requiring adaptive control of reference usage. Existing methods either ignore LQ-Ref correlations or rely on brittle explicit matching, leading to over-reliance on misleading references or under-utilization of valuable cues. To address this, we propose Ada-RefSR, a single-step diffusion framework guided by a "Trust but Verify" principle: reference information is leveraged when reliable and suppressed otherwise. Its core component, Adaptive Implicit Correlation Gating (AICG), employs learnable summary tokens to distill dominant reference patterns and capture implicit correlations with LQ features. Integrated into the attention backbone, AICG provides lightweight, adaptive regulation of reference guidance, serving as a built-in safeguard against erroneous fusion. Experiments on multiple datasets demonstrate that Ada-RefSR achieves a strong balance of fidelity, naturalness, and efficiency, while remaining robust under varying reference alignment.
摘要：最近的工作探索了基于参考的超分辨率（RefSR）来减轻基于扩散的图像恢复中的幻觉。一个关键的挑战是现实世界的退化使得低质量（LQ）输入和参考（Ref）图像之间的对应关系不可靠，需要对参考使用进行自适应控制。现有方法要么忽略 LQ-Ref 相关性，要么依赖脆弱的显式匹配，导致过度依赖误导性参考或未充分利用有价值的线索。为了解决这个问题，我们提出了 Ada-RefSR，这是一个以“信任但验证”原则为指导的单步扩散框架：参考信息在可靠时被利用，否则被抑制。其核心组件自适应隐式相关门控（AICG）采用可学习的摘要标记来提取主要参考模式并捕获与 LQ 特征的隐式相关性。 AICG 集成到注意力主干中，提供轻量级、自适应的参考指导调节，作为防止错误融合的内置保护措施。对多个数据集的实验表明，Ada-RefSR 实现了保真度、自然度和效率的强大平衡，同时在不同的参考对齐下保持鲁棒性。

Title: ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding

Authors: Ye Chen, Yupeng Zhu, Xiongzhen Zhang, Zhewen Wan, Yingzhe Li, Wenjun Zhang, Bingbing Ni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01881
Pdf URL: https://arxiv.org/pdf/2602.01881
Copy Paste: [[2602.01881]] ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding(https://arxiv.org/abs/2602.01881)
Keywords: generative
Abstract: Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.
摘要：流行的图像表示方法，包括诸如光栅图像和高斯基元之类的显式表示，以及诸如潜在图像之类的隐式表示，要么遭受表示冗余的困扰，导致繁重的手动编辑工作，要么缺乏从潜在变量到语义实例或部分的直接映射，从而使得细粒度操作变得困难。这些限制阻碍了高效且可控的图像和视频编辑。为了解决这些问题，我们提出了一种基于分层代理的参数图像表示，它将语义、几何和纹理属性分解为独立且可操作的参数空间。基于输入图像的语义感知分解，我们的表示通过自适应贝塞尔拟合和迭代内部区域细分和网格划分构建分层代理几何形状。多尺度隐式纹理参数嵌入到生成的几何感知分布式代理节点中，从而实现像素域中的连续高保真重建以及实例或部分独立的语义编辑。此外，我们引入了局部自适应特征索引机制来确保空间纹理的一致性，这进一步支持高质量的背景完成而不依赖于生成模型。关于图像重建和编辑基准（包括 ImageNet、OIR-Bench 和 HumanEdit）的大量实验表明，我们的方法以明显更少的参数实现了最先进的渲染保真度，同时实现了直观、交互式和物理上合理的操作。此外，通过将代理节点与基于位置的动力学集成，我们的框架支持使用轻量级隐式渲染的实时物理驱动动画，与生成方法相比，实现了卓越的时间一致性和视觉真实感。

Title: Internal Flow Signatures for Self-Checking and Refinement in LLMs

Authors: Sungheon Jeong, Sanggeon Yun, Ryozo Masukawa, Wenjun Haung, Hanning Chen, Mohsen Imani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01897
Pdf URL: https://arxiv.org/pdf/2602.01897
Copy Paste: [[2602.01897]] Internal Flow Signatures for Self-Checking and Refinement in LLMs(https://arxiv.org/abs/2602.01897)
Keywords: generation
Abstract: Large language models can generate fluent answers that are unfaithful to the provided context, while many safeguards rely on external verification or a separate judge after generation. We introduce \emph{internal flow signatures} that audit decision formation from depthwise dynamics at a fixed inter-block monitoring boundary. The method stabilizes token-wise motion via bias-centered monitoring, then summarizes trajectories in compact \emph{moving} readout-aligned subspaces constructed from the top token and its close competitors within each depth window. Neighboring window frames are aligned by an orthogonal transport, yielding depth-comparable transported step lengths, turning angles, and subspace drift summaries that are invariant to within-window basis choices. A lightweight GRU validator trained on these signatures performs self-checking without modifying the base model. Beyond detection, the validator localizes a culprit depth event and enables a targeted refinement: the model rolls back to the culprit token and clamps an abnormal transported step at the identified block while preserving the orthogonal residual. The resulting pipeline provides actionable localization and low-overhead self-checking from internal decision dynamics. \emph{Code is available at} \texttt{this http URL}.
摘要：大型语言模型可以生成不忠实于所提供上下文的流畅答案，而许多保障措施依赖于外部验证或生成后的单独判断。我们引入了 \emph{内部流签名}，它可以在固定的块间监控边界上根据深度动态来审核决策形成。该方法通过以偏差为中心的监控来稳定标记方式的运动，然后总结由顶部标记及其在每个深度窗口内的紧密竞争者构建的紧凑\emph{moving}读出对齐子空间中的轨迹。相邻窗框通过正交传输对齐，产生深度可比较的传输步长、转动角度和子空间漂移摘要，这些对于窗口内基础选择是不变的。在这些签名上训练的轻量级 GRU 验证器可以在不修改基本模型的情况下执行自我检查。除了检测之外，验证器还可以定位罪魁祸首深度事件并实现有针对性的细化：模型回滚到罪魁祸首令牌并将异常传输步骤钳位在已识别的块上，同时保留正交残差。由此产生的管道提供了可操作的本地化和来自内部决策动态的低开销自我检查。 \emph{代码可在} \texttt{此 http URL} 获得。

Title: Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Authors: Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01901
Pdf URL: https://arxiv.org/pdf/2602.01901
Copy Paste: [[2602.01901]] Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model(https://arxiv.org/abs/2602.01901)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.
摘要：多模态大语言模型 (MLLM) 受到由于视觉编码器中大量视觉标记导致推理成本过高的困扰。冗余的视觉令牌会产生大量的计算负载和键值 (KV) 缓存占用瓶颈。现有的方法侧重于标记优化，利用各种复杂的标记修剪技术来消除非关键的视觉标记。然而，这些方法往往不可避免地破坏KV缓存的完整性，导致长文本生成任务失败。为此，我们从一个新的角度对模型的注意力机制进行了深入研究，发现所有解码层中超过一半的注意力是语义相似的。根据这一发现，我们认为某些层中的注意力可以通过继承前一层的注意力来简化。因此，我们提出了惰性注意力（Lazy Attention），这是一种有效的注意力机制，可以跨层共享相似的注意力模式。它巧妙地减少了注意力中的逐层冗余计算。在 Lazy Attention 中，我们开发了一种新颖的层共享缓存 Q Cache，专为 MLLM 量身定制，有助于跨相邻层重用查询。特别是，Q Cache 是轻量级的，并且与现有的推理框架完全兼容，包括 Flash Attention 和 KV 缓存。此外，我们的方法非常灵活，因为它与现有的令牌技术正交，并且可以独立部署或与令牌修剪方法结合使用。对多个基准的实证评估表明，我们的方法可以将 KV 缓存使用率减少超过 35%，并实现 1.5 倍的吞吐量改进，同时在各种 MLLM 上仅牺牲约 1% 的性能。与 SOTA token-wise 方法相比，我们的技术实现了卓越的准确性保持。

Title: Learning Sparse Visual Representations via Spatial-Semantic Factorization

Authors: Theodore Zhengde Zhao, Sid Kiblawi, Jianwei Yang, Naoto Usuyama, Reuben Tan, Noel C Codella, Tristan Naumann, Hoifung Poon, Mu Wei
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01905
Pdf URL: https://arxiv.org/pdf/2602.01905
Copy Paste: [[2602.01905]] Learning Sparse Visual Representations via Spatial-Semantic Factorization(https://arxiv.org/abs/2602.01905)
Keywords: generative
Abstract: Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at this https URL.
摘要：自监督学习（SSL）面临语义理解和图像重建之间的根本冲突。高级语义 SSL（例如 DINO）依赖于全局标记，这些全局标记被迫保持位置不变以进行增强对齐，该过程本质上丢弃了重建所需的空间坐标。相反，生成式 SSL（例如 MAE）保留了用于重建的密集特征网格，但无法生成高级抽象。我们引入了 STELLAR，这是一个通过将视觉特征分解为语义概念及其空间分布的低阶乘积来解决这种紧张关系的框架。这种解开使我们能够对语义标记执行 DINO 风格的增强对齐，同时保持像素级重建所需的定位矩阵中的精确空间映射。我们证明，在这种因式分解形式下，只需 16 个稀疏标记就足以同时支持高质量重建（2.60 FID）并匹配密集骨干网的语义性能（79.10% ImageNet 准确率）。我们的结果强调 STELLAR 作为一种通用的稀疏表示，通过策略性地将语义身份与空间几何分离，弥合了判别视觉和生成视觉之间的差距。代码可在此 https URL 获取。

Title: Bayesian Integration of Nonlinear Incomplete Clinical Data

Authors: Lucía González-Zamorano, Nuria Balbás-Esteban, Vanessa Gómez-Verdejo, Albert Belenguer-Llorens, Carlos Sevilla-Salcedo
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2602.01924
Pdf URL: https://arxiv.org/pdf/2602.01924
Copy Paste: [[2602.01924]] Bayesian Integration of Nonlinear Incomplete Clinical Data(https://arxiv.org/abs/2602.01924)
Keywords: generative
Abstract: Multimodal clinical data are characterized by high dimensionality, heterogeneous representations, and structured missingness, posing significant challenges for predictive modeling, data integration, and interpretability. We propose BIONIC (Bayesian Integration of Nonlinear Incomplete Clinical data), a unified probabilistic framework that integrates heterogeneous multimodal data under missingness through a joint generative-discriminative latent architecture. BIONIC uses pretrained embeddings for complex modalities such as medical images and clinical text, while incorporating structured clinical variables directly within a Bayesian multimodal formulation. The proposed framework enables robust learning in partially observed and semi-supervised settings by explicitly modeling modality-level and variable-level missingness, as well as missing labels. We evaluate BIONIC on three multimodal clinical and biomedical datasets, demonstrating strong and consistent discriminative performance compared to representative multimodal baselines, particularly under incomplete data scenarios. Beyond predictive accuracy, BIONIC provides intrinsic interpretability through its latent structure, enabling population-level analysis of modality relevance and supporting clinically meaningful insight.
摘要：多模态临床数据具有高维度、异构表示和结构化缺失的特点，对预测建模、数据集成和可解释性提出了重大挑战。我们提出了 BIONIC（非线性不完整临床数据的贝叶斯集成），这是一个统一的概率框架，通过联合生成判别潜在架构集成缺失情况下的异构多模态数据。 BIONIC 对医学图像和临床文本等复杂模式使用预训练嵌入，同时将结构化临床变量直接纳入贝叶斯多模式公式中。所提出的框架通过显式建模模态级别和变量级别缺失以及缺失标签，在部分观察和半监督的环境中实现稳健的学习。我们在三个多模态临床和生物医学数据集上评估 BIONIC，与代表性多模态基线相比，显示出强大且一致的判别性能，特别是在不完整的数据场景下。除了预测准确性之外，BIONIC 通过其潜在结构提供内在的可解释性，从而能够对模态相关性进行人群水平分析并支持具有临床意义的见解。

Title: Boundary-Constrained Diffusion Models for Floorplan Generation: Balancing Realism and Diversity

Authors: Leonardo Stoppani, Davide Bacciu, Shahab Mokarizadeh
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2602.01949
Pdf URL: https://arxiv.org/pdf/2602.01949
Copy Paste: [[2602.01949]] Boundary-Constrained Diffusion Models for Floorplan Generation: Balancing Realism and Diversity(https://arxiv.org/abs/2602.01949)
Keywords: generation, generative
Abstract: Diffusion models have become widely popular for automated floorplan generation, producing highly realistic layouts conditioned on user-defined constraints. However, optimizing for perceptual metrics such as the Fréchet Inception Distance (FID) causes limited design diversity. To address this, we propose the Diversity Score (DS), a metric that quantifies layout diversity under fixed constraints. Moreover, to improve geometric consistency, we introduce a Boundary Cross-Attention (BCA) module that enables conditioning on building boundaries. Our experiments show that BCA significantly improves boundary adherence, while prolonged training drives diversity collapse undiagnosed by FID, revealing a critical trade-off between realism and diversity. Out-Of-Distribution evaluations further demonstrate the models' reliance on dataset priors, emphasizing the need for generative systems that explicitly balance fidelity, diversity, and generalization in architectural design tasks.
摘要：扩散模型在自动平面图生成中已广泛流行，可根据用户定义的约束生成高度逼真的布局。然而，优化 Fréchet 起始距离 (FID) 等感知指标会导致设计多样性有限。为了解决这个问题，我们提出了多样性得分（DS），这是一种在固定约束下量化布局多样性的指标。此外，为了提高几何一致性，我们引入了边界交叉注意（BCA）模块，该模块可以对建筑物边界进行调节。我们的实验表明，BCA 显着提高了边界遵守率，而长时间的训练会导致 FID 未诊断出的多样性崩溃，揭示了现实性和多样性之间的关键权衡。分布外评估进一步证明了模型对数据集先验的依赖，强调了需要在架构设计任务中明确平衡保真度、多样性和泛化性的生成系统。

Title: Grounding Generated Videos in Feasible Plans via World Models

Authors: Christos Ziakas, Amir Bar, Alessandra Russo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.01960
Pdf URL: https://arxiv.org/pdf/2602.01960
Copy Paste: [[2602.01960]] Grounding Generated Videos in Feasible Plans via World Models(https://arxiv.org/abs/2602.01960)
Keywords: generative
Abstract: Large-scale video generative models have shown emerging capabilities as zero-shot visual planners, yet video-generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To address this, we propose Grounding Video Plans with World Models (GVP-WM), a planning method that grounds video-generated plans into feasible action sequences using a learned action-conditioned world model. At test-time, GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video-guided latent collocation. In particular, we formulate grounding as a goal-conditioned latent-space trajectory optimization problem that jointly optimizes latent states and actions under world-model dynamics, while preserving semantic alignment with the video-generated plan. Empirically, GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video-generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.
摘要：大规模视频生成模型已经显示出作为零镜头视觉规划器的新兴功能，但视频生成的计划经常违反时间一致性和物理约束，导致映射到可执行动作时失败。为了解决这个问题，我们提出了基于世界模型的视频计划（GVP-WM），这是一种使用学习的动作条件世界模型将视频生成的计划转化为可行的动作序列的规划方法。在测试时，GVP-WM 首先根据初始观察和目标观察生成视频计划，然后通过视频引导的潜在搭配将视频引导投影到多种动态可行的潜在轨迹上。特别是，我们将基础制定为目标条件潜在空间轨迹优化问题，在世界模型动态下联合优化潜在状态和动作，同时保持与视频生成计划的语义对齐。根据经验，GVP-WM 从零镜头图像到视频生成的运动模糊视频中恢复了可行的长期计划，这些视频违反了导航和操作模拟任务的物理约束。

Title: Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

Authors: Muli Yang, Gabriel James Goenawan, Henan Wang, Huaiyuan Qin, Chenghao Xu, Yanhua Yang, Fen Fang, Ying Sun, Joo-Hwee Lim, Hongyuan Zhu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01973
Pdf URL: https://arxiv.org/pdf/2602.01973
Copy Paste: [[2602.01973]] Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated(https://arxiv.org/abs/2602.01973)
Keywords: generation
Abstract: Despite being trained on balanced datasets, existing AI-generated image detectors often exhibit systematic bias at test time, frequently misclassifying fake images as real. We hypothesize that this behavior stems from distributional shift in fake samples and implicit priors learned during training. Specifically, models tend to overfit to superficial artifacts that do not generalize well across different generation methods, leading to a misaligned decision threshold when faced with test-time distribution shift. To address this, we propose a theoretically grounded post-hoc calibration framework based on Bayesian decision theory. In particular, we introduce a learnable scalar correction to the model's logits, optimized on a small validation set from the target distribution while keeping the backbone frozen. This parametric adjustment compensates for distributional shift in model output, realigning the decision boundary even without requiring ground-truth labels. Experiments on challenging benchmarks show that our approach significantly improves robustness without retraining, offering a lightweight and principled solution for reliable and adaptive AI-generated image detection in the open world. Code is available at this https URL.
摘要：尽管在平衡数据集上进行了训练，现有的人工智能生成的图像检测器在测试时经常表现出系统偏差，经常将假图像错误地分类为真实图像。我们假设这种行为源于假样本的分布变化和训练期间学习的隐式先验。具体来说，模型往往会过度拟合表面的工件，而这些工件在不同的生成方法中不能很好地泛化，从而在面临测试时间分布变化时导致决策阈值错位。为了解决这个问题，我们提出了一个基于贝叶斯决策理论的理论基础事后校准框架。特别是，我们对模型的 logits 引入了可学习的标量校正，在目标分布的小型验证集上进行优化，同时保持骨干网冻结。这种参数调整可以补偿模型输出的分布变化，即使不需要真实标签也可以重新调整决策边界。在具有挑战性的基准上进行的实验表明，我们的方法无需重新训练即可显着提高鲁棒性，为开放世界中可靠且自适应的人工智能生成图像检测提供轻量级且有原则的解决方案。代码可从此 https URL 获取。

Title: Leveraging Latent Vector Prediction for Localized Control in Image Generation via Diffusion Models

Authors: Pablo Domingo-Gregorio, Javier Ruiz-Hidalgo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.01991
Pdf URL: https://arxiv.org/pdf/2602.01991
Copy Paste: [[2602.01991]] Leveraging Latent Vector Prediction for Localized Control in Image Generation via Diffusion Models(https://arxiv.org/abs/2602.01991)
Keywords: generation
Abstract: Diffusion models emerged as a leading approach in text-to-image generation, producing high-quality images from textual descriptions. However, attempting to achieve detailed control to get a desired image solely through text remains a laborious trial-and-error endeavor. Recent methods have introduced image-level controls alongside with text prompts, using prior images to extract conditional information such as edges, segmentation and depth maps. While effective, these methods apply conditions uniformly across the entire image, limiting localized control. In this paper, we propose a novel methodology to enable precise local control over user-defined regions of an image, while leaving to the diffusion model the task of autonomously generating the remaining areas according to the original prompt. Our approach introduces a new training framework that incorporates masking features and an additional loss term, which leverages the prediction of the initial latent vector at any diffusion step to enhance the correspondence between the current step and the final sample in the latent space. Extensive experiments demonstrate that our method effectively synthesizes high-quality images with controlled local conditions.
摘要：扩散模型成为文本到图像生成的领先方法，从文本描述生成高质量图像。然而，尝试仅通过文本实现详细控制以获得所需的图像仍然是一项艰苦的试错工作。最近的方法引入了图像级控制和文本提示，使用先前的图像来提取条件信息，例如边缘、分割和深度图。这些方法虽然有效，但在整个图像上统一应用条件，限制了局部控制。在本文中，我们提出了一种新颖的方法，可以对图像的用户定义区域进行精确的局部控制，同时将根据原始提示自主生成剩余区域的任务留给扩散模型。我们的方法引入了一种新的训练框架，该框架结合了掩蔽特征和附加损失项，该框架利用任何扩散步骤中初始潜在向量的预测来增强当前步骤与潜在空间中最终样本之间的对应性。大量的实验表明，我们的方法可以在受控的局部条件下有效地合成高质量图像。

Title: On the Limits of Layer Pruning for Generative Reasoning in LLMs

Authors: Safal Shrestha, Anubhav Shrestha, Aadim Nepal, Minwu Kim, Keith Ross
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01997
Pdf URL: https://arxiv.org/pdf/2602.01997
Copy Paste: [[2602.01997]] On the Limits of Layer Pruning for Generative Reasoning in LLMs(https://arxiv.org/abs/2602.01997)
Keywords: generation, generative
Abstract: Recent works have shown that layer pruning can compress large language models (LLMs) while retaining strong performance on classification benchmarks with little or no finetuning. However, existing pruning techniques often suffer severe degradation on generative reasoning tasks. Through a systematic study across multiple model families, we find that tasks requiring multi-step reasoning are particularly sensitive to depth reduction. Beyond surface-level text degeneration, we observe degradation of critical algorithmic capabilities, including arithmetic computation for mathematical reasoning and balanced parenthesis generation for code synthesis. Under realistic post-training constraints, without access to pretraining-scale data or compute, we evaluate a simple mitigation strategy based on supervised finetuning with Self-Generated Responses. This approach achieves strong recovery on classification tasks, retaining up to 90\% of baseline performance, and yields substantial gains of up to 20--30 percentage points on generative benchmarks compared to prior post-pruning techniques. Crucially, despite these gains, recovery for generative reasoning remains fundamentally limited relative to classification tasks and is viable primarily at lower pruning ratios. Overall, we characterize the practical limits of layer pruning for generative reasoning and provide guidance on when depth reduction can be applied effectively under constrained post-training regimes.
摘要：最近的工作表明，层剪枝可以压缩大型语言模型（LLM），同时在分类基准上保持强大的性能，而无需进行微调或很少进行微调。然而，现有的剪枝技术在生成推理任务上经常遭受严重退化。通过对多个模型系列的系统研究，我们发现需要多步骤推理的任务对深度缩减特别敏感。除了表面级别的文本退化之外，我们还观察到关键算法能力的退化，包括数学推理的算术计算和代码合成的平衡括号生成。在现实的训练后约束下，在无法访问预训练规模数据或计算的情况下，我们评估了一种基于自生成响应的监督微调的简单缓解策略。与之前的后剪枝技术相比，这种方法在分类任务上实现了强劲的恢复，保留了高达 90% 的基线性能，并且在生成基准上获得了高达 20--30 个百分点的大幅收益。至关重要的是，尽管取得了这些进展，但生成推理的恢复相对于分类任务仍然受到根本限制，并且主要在较低的剪枝率下可行。总的来说，我们描述了生成推理的层剪枝的实际限制，并就何时可以在受约束的训练后制度下有效应用深度缩减提供指导。

Title: UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving

Authors: Guosheng Zhao, Yaozeng Wang, Xiaofeng Wang, Zheng Zhu, Tingdong Yu, Guan Huang, Yongchen Zai, Ji Jiao, Changliang Xue, Xiaole Wang, Zhen Yang, Futang Zhu, Xingang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.02002
Pdf URL: https://arxiv.org/pdf/2602.02002
Copy Paste: [[2602.02002]] UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving(https://arxiv.org/abs/2602.02002)
Keywords: generation
Abstract: World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream
摘要：世界模型已经展示了自动驾驶数据合成的巨大前景。然而，现有方法主要集中于单模态生成，通常侧重于多摄像头视频或激光雷达序列合成。在本文中，我们提出了 UniDriveDreamer，这是一种用于自动驾驶的单阶段统一多模态世界模型，它直接生成多模态未来观测结果，而不依赖于中间表示或级联模块。我们的框架引入了 LiDAR 专用的变分自动编码器 (VAE)，旨在对输入 LiDAR 序列进行编码，以及用于多摄像头图像的视频 VAE。为了确保跨模态兼容性和训练稳定性，我们提出了统一潜在锚定（ULA），它明确地对齐了两种模态的潜在分布。对齐的特征由扩散变换器融合和处理，该扩散变换器联合模拟它们的几何对应和时间演化。此外，结构化场景布局信息按模态投影作为调节信号来指导合成。大量实验表明，UniDriveDreamer 在视频和 LiDAR 生成方面均优于以前最先进的方法，同时在下游方面也产生了可衡量的改进

Title: Logic-Guided Vector Fields for Constrained Generative Modeling

Authors: Ali Baheri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02009
Pdf URL: https://arxiv.org/pdf/2602.02009
Copy Paste: [[2602.02009]] Logic-Guided Vector Fields for Constrained Generative Modeling(https://arxiv.org/abs/2602.02009)
Keywords: generation, generative
Abstract: Neuro-symbolic systems aim to combine the expressive structure of symbolic logic with the flexibility of neural learning; yet, generative models typically lack mechanisms to enforce declarative constraints at generation time. We propose Logic-Guided Vector Fields (LGVF), a neuro-symbolic framework that injects symbolic knowledge, specified as differentiable relaxations of logical constraints, into flow matching generative models. LGVF couples two complementary mechanisms: (1) a training-time logic loss that penalizes constraint violations along continuous flow trajectories, with weights that emphasize correctness near the target distribution; and (2) an inference-time adjustment that steers sampling using constraint gradients, acting as a lightweight, logic-informed correction to the learned dynamics. We evaluate LGVF on three constrained generation case studies spanning linear, nonlinear, and multi-region feasibility constraints. Across all settings, LGVF reduces constraint violations by 59-82% compared to standard flow matching and achieves the lowest violation rates in each case. In the linear and ring settings, LGVF also improves distributional fidelity as measured by MMD, while in the multi-obstacle setting, we observe a satisfaction-fidelity trade-off, with improved feasibility but increased MMD. Beyond quantitative gains, LGVF yields constraint-aware vector fields exhibiting emergent obstacle-avoidance behavior, routing samples around forbidden regions without explicit path planning.
摘要：神经符号系统旨在将符号逻辑的表达结构与神经学习的灵活性相结合；然而，生成模型通常缺乏在生成时强制执行声明性约束的机制。我们提出了逻辑引导向量场（LGVF），这是一种神经符号框架，它将符号知识（指定为逻辑约束的可微松弛）注入到流匹配生成模型中。 LGVF 结合了两种互补机制：(1) 训练时逻辑损失，用于惩罚沿连续流轨迹的约束违规，其权重强调目标分布附近的正确性； (2) 推理时间调整，使用约束梯度引导采样，充当对学习动态的轻量级、逻辑通知校正。我们通过三个涵盖线性、非线性和多区域可行性约束的约束发电案例研究来评估 LGVF。在所有设置中，与标准流匹配相比，LGVF 将约束违规减少了 59-82%，并在每种情况下实现了最低违规率。在线性和环形设置中，LGVF 还提高了 MMD 测量的分布保真度，而在多障碍设置中，我们观察到满意度与保真度的权衡，可行性提高但 MMD 增加。除了定量收益之外，LGVF 还产生约束感知向量场，表现出紧急避障行为，在没有明确路径规划的情况下在禁区周围路由样本。

Title: One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation

Authors: Shuo Lu, Haohan Wang, Wei Feng, Weizhen Wang, Shen Zhang, Yaoyu Li, Ao Ma, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Bing Zhan, Yuan Xu, Huizai Yao, Yongcan Yu, Chenyang Si, Jian Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02033
Pdf URL: https://arxiv.org/pdf/2602.02033
Copy Paste: [[2602.02033]] One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation(https://arxiv.org/abs/2602.02033)
Keywords: generation
Abstract: Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textit{One Size, Many Fits} (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group's CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at this https URL.
摘要：广产品感知自适应分组，根据用户的属性和产品特征动态地组织用户，以丰富的集体偏好特征来表示每个群体，偏好条件图像生成采用群体感知多模态大语言模型（G-MLLM）来为每个群体生成定制图像。随后，我们使用我们提出的群体偏好对齐对 G-MLLM 进行微调。为了进一步推进这一领域，我们引入了分组广告图像偏好数据集（GAIP），这是第一个基于分组的图像偏好的大规模公共数据集，包括由 4000 万用户构建的约 60 万个组。

Title: On Stability and Robustness of Diffusion Posterior Sampling for Bayesian Inverse Problems

Authors: Yiming Yang, Xiaoyuan Cheng, Yi He, Kaiyu Li, Wenxuan Yuan, Zhuo Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02045
Pdf URL: https://arxiv.org/pdf/2602.02045
Copy Paste: [[2602.02045]] On Stability and Robustness of Diffusion Posterior Sampling for Bayesian Inverse Problems(https://arxiv.org/abs/2602.02045)
Keywords: generation
Abstract: Diffusion models have recently emerged as powerful learned priors for Bayesian inverse problems (BIPs). Diffusion-based solvers rely on a presumed likelihood for the observations in BIPs to guide the generation process. However, the link between likelihood and recovery quality for BIPs is unclear in previous works. We bridge this gap by characterizing the posterior approximation error and proving the \emph{stability} of the diffusion-based solvers. Meanwhile, an immediate result of our findings on stability demonstrates the lack of robustness in diffusion-based solvers, which remains unexplored. This can degrade performance when the presumed likelihood mismatches the unknown true data generation processes. To address this issue, we propose a simple yet effective solution, \emph{robust diffusion posterior sampling}, which is provably \emph{robust} and compatible with existing gradient-based posterior samplers. Empirical results on scientific inverse problems and natural image tasks validate the effectiveness and robustness of our method, showing consistent performance improvements under challenging likelihood misspecifications.
摘要：扩散模型最近已成为贝叶斯逆问题（BIP）的强大学习先验。基于扩散的求解器依靠 BIP 中观测值的假定可能性来指导生成过程。然而，在之前的研究中，BIP 的可能性和恢复质量之间的联系尚不清楚。我们通过表征后验近似误差并证明基于扩散的求解器的\emph{稳定性}来弥补这一差距。与此同时，我们对稳定性的研究结果表明，基于扩散的求解器缺乏鲁棒性，这一点尚未得到探索。当假定的可能性与未知的真实数据生成过程不匹配时，这可能会降低性能。为了解决这个问题，我们提出了一个简单而有效的解决方案，\emph{鲁棒扩散后验采样}，它被证明是\emph{鲁棒}并且与现有的基于梯度的后验采样器兼容。科学逆问题和自然图像任务的实证结果验证了我们方法的有效性和鲁棒性，在具有挑战性的可能性错误指定下显示出一致的性能改进。

Title: AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Authors: Daniil Orel, Dilshod Azizov, Indraneil Paul, Yuxia Wang, Iryna Gurevych, Preslav Nakov
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2602.02079
Pdf URL: https://arxiv.org/pdf/2602.02079
Copy Paste: [[2602.02079]] AICD Bench: A Challenging Benchmark for AI-Generated Code Detection(https://arxiv.org/abs/2602.02079)
Keywords: generation
Abstract: Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce $\emph{AICD Bench}$, the most comprehensive benchmark for AI-generated code detection. It spans $\emph{2M examples}$, $\emph{77 models}$ across $\emph{11 families}$, and $\emph{9 programming languages}$, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: ($\emph{i}$)~$\emph{Robust Binary Classification}$ under distribution shifts in language and domain, ($\emph{ii}$)~$\emph{Model Family Attribution}$, grouping generators by architectural lineage, and ($\emph{iii}$)~$\emph{Fine-Grained Human-Machine Classification}$ across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a $\emph{unified, challenging evaluation suite}$ to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at this https URL}.
摘要：大型语言模型 (LLM) 生成功能源代码的能力越来越强，引发了对作者身份、责任和安全性的担忧。虽然检测人工智能生成的代码至关重要，但现有的数据集和基准测试很窄，通常仅限于分布设置下的二进制人机分类。为了弥补这一差距，我们引入了 $\emph{AICD Bench}$，这是用于 AI 生成代码检测的最全面的基准。它涵盖 $\emph{200 万个示例}$、$\emph{77 个模型}$ 跨 $\emph{11 个系列}$ 和 $\emph{9 种编程语言}$，包括最新的推理模型。除了规模之外，AICD Bench 引入了三个现实的检测任务：($\emph{i}$)~$\emph{鲁棒二元分类}$ 在语言和领域的分布变化下，($\emph{ii}$)~$\emph{模型家族归因}$，按架构谱系对生成器进行分组，以及 ($\emph{iii}$)~$\emph{细粒度人机分类}$ 跨人类、机器、混合和对抗性代码。对神经和经典检测器的广泛评估表明，性能仍然远远低于实际可用性，特别是在分布转移和混合或对抗代码的情况下。我们将 AICD Bench 作为一个 $\emph{统一的、具有挑战性的评估套件}$ 发布，以推动下一代人工智能生成代码检测的稳健方法。数据和代码可在此 https URL 获取}。

Title: FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Authors: FSVideo Team, Qingyu Chen, Zhiyuan Fang, Haibin Huang, Xinwei Huang, Tong Jin, Minxuan Lin, Bo Liu, Celong Liu, Chongyang Ma, Xing Mei, Xiaohui Shen, Yaojie Shen, Fuwen Tan, Angtian Wang, Xiao Yang, Yiding Yang, Jiamin Yuan, Lingxi Zhang, Yuxin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.02092
Pdf URL: https://arxiv.org/pdf/2602.02092
Copy Paste: [[2602.02092]] FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space(https://arxiv.org/abs/2602.02092)
Keywords: generation
Abstract: We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.
摘要：我们介绍 FSVideo，一种基于高速转换器的图像到视频 (I2V) 扩散框架。我们的框架基于以下关键组件：1.）具有高度压缩潜在空间（$64\times64\times4$时空下采样率）的新视频自动编码器，实现有竞争力的重建质量； 2.) 扩散变压器 (DIT) 架构，采用新的层内存设计，以增强 DIT 内的层间信息流和上下文重用，以及 3.) 通过几步 DIT 上采样器的多分辨率生成策略，以提高视频保真度。我们的最终模型包含一个 14B DIT 基础模型和一个 14B DIT 上采样器，与其他流行的开源模型相比，其性能具有竞争力，同时速度快了一个数量级。我们在本报告中讨论了我们的模型设计以及培训策略。

Title: Unifying Masked Diffusion Models with Various Generation Orders and Beyond

Authors: Chunsan Hong, Sanghyun Lee, Jong Chul Ye
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2602.02112
Pdf URL: https://arxiv.org/pdf/2602.02112
Copy Paste: [[2602.02112]] Unifying Masked Diffusion Models with Various Generation Orders and Beyond(https://arxiv.org/abs/2602.02112)
Keywords: generation, generative
Abstract: Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.
摘要：掩蔽扩散模型 (MDM) 是语言生成自回归模型 (ARM) 的潜在替代方案，但生成质量主要取决于生成顺序。先前的工作要么对排序进行硬编码（例如，按块从左到右），要么学习预训练 MDM 的排序策略，这会产生额外的成本，并且由于两阶段优化可能会产生次优的解决方案。受此启发，我们提出了顺序表达掩蔽扩散模型（OeMDM），用于具有各种生成顺序的广泛扩散生成过程，从而能够在单个框架中解释 MDM、ARM 和块扩散。此外，在 OeMDM 的基础上，我们引入了可学习顺序掩码扩散模型（LoMDM），该模型从头开始通过单个目标共同学习生成顺序和扩散主干，使扩散模型能够以上下文相关的顺序生成文本。根据经验，我们确认 LoMDM 在多个语言建模基准测试中优于各种离散扩散模型。

Title: Enhancing Diffusion-Based Quantitatively Controllable Image Generation via Matrix-Form EDM and Adaptive Vicinal Training

Authors: Xin Ding, Yun Chen, Sen Zhang, Kao Zhang, Nenglun Chen, Peibei Cao, Yongwei Wang, Fei Wu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02114
Pdf URL: https://arxiv.org/pdf/2602.02114
Copy Paste: [[2602.02114]] Enhancing Diffusion-Based Quantitatively Controllable Image Generation via Matrix-Form EDM and Adaptive Vicinal Training(https://arxiv.org/abs/2602.02114)
Keywords: generation
Abstract: Continuous Conditional Diffusion Model (CCDM) is a diffusion-based framework designed to generate high-quality images conditioned on continuous regression labels. Although CCDM has demonstrated clear advantages over prior approaches across a range of datasets, it still exhibits notable limitations and has recently been surpassed by a GAN-based method, namely CcGAN-AVAR. These limitations mainly arise from its reliance on an outdated diffusion framework and its low sampling efficiency due to long sampling trajectories. To address these issues, we propose an improved CCDM framework, termed iCCDM, which incorporates the more advanced \textit{Elucidated Diffusion Model} (EDM) framework with substantial modifications to improve both generation quality and sampling efficiency. Specifically, iCCDM introduces a novel matrix-form EDM formulation together with an adaptive vicinal training strategy. Extensive experiments on four benchmark datasets, spanning image resolutions from $64\times64$ to $256\times256$, demonstrate that iCCDM consistently outperforms existing methods, including state-of-the-art large-scale text-to-image diffusion models (e.g., Stable Diffusion 3, FLUX.1, and Qwen-Image), achieving higher generation quality while significantly reducing sampling cost.
摘要：连续条件扩散模型 (CCDM) 是一种基于扩散的框架，旨在生成以连续回归标签为条件的高质量图像。尽管 CCDM 在一系列数据集上表现出了优于先前方法的明显优势，但它仍然表现出明显的局限性，并且最近已被基于 GAN 的方法（即 CcGAN-AVAR）超越。这些限制主要源于它对过时的扩散框架的依赖以及由于采样轨迹长而导致的采样效率低。为了解决这些问题，我们提出了一种改进的 CCDM 框架，称为 iCCDM，它结合了更先进的 \textit{阐明扩散模型} (EDM) 框架，并进行了大量修改，以提高生成质量和采样效率。具体来说，iCCDM 引入了一种新颖的矩阵形式 EDM 公式以及自适应邻域训练策略。对四个基准数据集（图像分辨率从 $64\times64$ 到 $256\times256$）的广泛实验表明，iCCDM 始终优于现有方法，包括最先进的大规模文本到图像扩散模型（例如，Stable Diffusion 3、FLUX.1 和 Qwen-Image），实现了更高的生成质量，同时显着降低了采样成本。

Title: Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics

Authors: Nima Shoghi, Yuxuan Liu, Yuning Shen, Rob Brekelmans, Pan Li, Quanquan Gu
Subjects: cs.LG, cs.AI, physics.bio-ph, q-bio.BM, q-bio.QM
Abstract URL: https://arxiv.org/abs/2602.02128
Pdf URL: https://arxiv.org/pdf/2602.02128
Copy Paste: [[2602.02128]] Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics(https://arxiv.org/abs/2602.02128)
Keywords: generation, generative
Abstract: Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics. We present STAR-MD (Spatio-Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics--substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR-MD successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long-horizon generation, while demonstrating that STAR-MD's joint spatio-temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.
摘要：分子动力学 (MD) 模拟仍然是研究蛋白质动力学的黄金标准，但其计算成本限制了生物学相关时间尺度的获取。最近的生成模型在加速模拟方面表现出了希望，但由于架构限制、错误累积和时空动力学建模不足，它们在长范围生成方面遇到了困难。我们提出了 STAR-MD（分子动力学时空自回归），这是一种可扩展的 SE(3) 等变扩散模型，可在微秒时间尺度上生成物理上合理的蛋白质轨迹。我们的关键创新是具有联合时空注意力的因果扩散变压器，可以有效捕获复杂的时空依赖性，同时避免现有方法的内存瓶颈。在标准 ATLAS 基准上，STAR-MD 在所有指标上都实现了最先进的性能——与以前的方法相比，显着提高了构象覆盖率、结构有效性和动态保真度。 STAR-MD 成功地推断生成稳定的微秒级轨迹，而基线方法会灾难性地失败，从而在整个扩展过程中保持较高的结构质量。我们的综合评估揭示了当前长视野生成模型的严重局限性，同时证明 STAR-MD 的联合时空建模能够在生物相关时间尺度上进行稳健的动力学模拟，为加速探索蛋白质功能铺平道路。

Title: Eliminating Registration Bias in Synthetic CT Generation: A Physics-Based Simulation Framework

Authors: Lukas Zimmermann, Michael Rauter, Maximilian Schmid, Dietmar Georg, Barbara Knäusl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.02130
Pdf URL: https://arxiv.org/pdf/2602.02130
Copy Paste: [[2602.02130]] Eliminating Registration Bias in Synthetic CT Generation: A Physics-Based Simulation Framework(https://arxiv.org/abs/2602.02130)
Keywords: generation
Abstract: Supervised synthetic CT generation from CBCT requires registered training pairs, yet perfect registration between separately acquired scans remains unattainable. This registration bias propagates into trained models and corrupts standard evaluation metrics. This may suggest that superior benchmark performance indicates better reproduction of registration artifacts rather than anatomical fidelity. We propose physics-based CBCT simulation to provide geometrically aligned training pairs by construction, combined with evaluation using geometric alignment metrics against input CBCT rather than biased ground truth. On two independent pelvic datasets, models trained on synthetic data achieved superior geometric alignment (Normalized Mutual Information: 0.31 vs 0.22) despite lower conventional intensity scores. Intensity metrics showed inverted correlations with clinical assessment for deformably registered data, while Normalized Mutual Information consistently predicted observer preference across registration methodologies (rho = 0.31, p < 0.001). Clinical observers preferred synthetic-trained outputs in 87% of cases, demonstrating that geometric fidelity, not intensity agreement with biased ground truth, aligns with clinical requirements.
摘要：从 CBCT 生成有监督的合成 CT 需要配准训练对，但单独采集的扫描之间的完美配准仍然无法实现。这种配准偏差会传播到经过训练的模型中并破坏标准评估指标。这可能表明卓越的基准性能表明配准伪影的更好再现而不是解剖保真度。我们提出基于物理的 CBCT 模拟，通过构造提供几何对齐的训练对，并结合使用针对输入 CBCT 的几何对齐度量而不是有偏差的地面实况进行评估。在两个独立的骨盆数据集上，尽管传统强度分数较低，但在合成数据上训练的模型仍实现了出色的几何对齐（标准化互信息：0.31 vs 0.22）。强度指标显示与变形注册数据的临床评估呈反向相关，而标准化互信息一致地预测了不同注册方法中的观察者偏好（rho = 0.31，p < 0.001）。在 87% 的情况下，临床观察者更喜欢经过综合训练的输出，这表明几何保真度（而不是与有偏差的基本事实的强度一致性）符合临床要求。

Title: DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations

Authors: Minghao Li, Ruihang Wang, Rui Tan, Yonggang Wen
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2602.02137
Pdf URL: https://arxiv.org/pdf/2602.02137
Copy Paste: [[2602.02137]] DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations(https://arxiv.org/abs/2602.02137)
Keywords: generation, generative
Abstract: Modern data centers (DCs) hosting artificial intelligence (AI)-dedicated devices operate at high power densities with rapidly varying workloads, making minute-level adaptation essential for safe and energy-efficient operation. However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC. This specification-to-policy lag causes a lack of timely, effective control policies, which may lead to service outages. To bridge the gap, we present DCoPilot, a hybrid framework for generative control policies in dynamic DC operation. DCoPilot synergizes two distinct generative paradigms, i.e., a large language model (LLM) that performs symbolic generation of structured reward forms, and a hypernetwork that conducts parametric generation of policy weights. DCoPilot operates through three coordinated phases: (i) simulation scale-up, which stress-tests reward candidates across diverse simulation-ready (SimReady) scenes; (ii) meta policy distillation, where a hypernetwork is trained to output policy weights conditioned on SLA and scene embeddings; and (iii) online adaptation, enabling zero-shot policy generation in response to updated specifications. Evaluated across five control task families spanning diverse DC components, DCoPilot achieves near-zero constraint violations and outperforms all baselines across specification variations. Ablation studies validate the effectiveness of LLM-based unified reward generation in enabling stable hypernetwork convergence.
摘要：托管人工智能 (AI) 专用设备的现代数据中心 (DC) 以高功率密度运行，工作负载快速变化，因此分钟级的适应对于安全和节能运行至关重要。然而，手动设计分段深度强化学习 (DRL) 代理无法跟上不断发展的 DC 的频繁动态变化和服务级别协议 (SLA) 变化。这种规范到策略的滞后导致缺乏及时、有效的控制策略，从而可能导致服务中断。为了弥补这一差距，我们提出了 DCoPilot，这是一种用于动态 DC 运行中生成控制策略的混合框架。 DCoPilot 协同两个不同的生成范式，即执行结构化奖励形式的符号生成的大型语言模型（LLM）和执行政策权重参数化生成的超网络。 DCoPilot 通过三个协调阶段进行运作：(i) 模拟放大，压力测试奖励跨不同模拟就绪 (SimReady) 场景的候选者； (ii) 元策略蒸馏，其中训练超网络以输出以 SLA 和场景嵌入为条件的策略权重； (iii) 在线适应，能够根据更新的规范生成零样本策略。通过对涵盖不同 DC 组件的五个控制任务系列进行评估，DCoPilot 实现了接近零的约束违规，并超越了规格变化的所有基线。消融研究验证了基于 LLM 的统一奖励生成在实现稳定的超网络收敛方面的有效性。

Title: Learning Generative Selection for Best-of-N

Authors: Shubham Toshniwal, Aleksander Ficek, Siddhartha Jain, Wei Du, Vahid Noroozi, Sadegh Mahdavi, Somshubra Majumdar, Igor Gitman
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2602.02143
Pdf URL: https://arxiv.org/pdf/2602.02143
Copy Paste: [[2602.02143]] Learning Generative Selection for Best-of-N(https://arxiv.org/abs/2602.02143)
Keywords: generative
Abstract: Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address this bottleneck, yet strong selection performance remains largely limited to large models. We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning. To this end, we synthesize selection tasks from large-scale math and code instruction datasets by filtering to instances with both correct and incorrect candidate solutions, and train 1.7B-parameter models with DAPO to reward correct selections. Across math (AIME24, AIME25, HMMT25) and code (LiveCodeBench) reasoning benchmarks, our models consistently outperform prompting and majority-voting baselines, often approaching or exceeding much larger models. Moreover, these gains generalize to selecting outputs from stronger models despite training only on outputs from weaker models. Overall, our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models, enabling efficient test-time scaling.
摘要：通过并行采样扩展测试时间计算可以显着改善 LLM 推理，但通常受到 Best-of-N 选择质量的限制。 GenSelect 等生成选择方法解决了这一瓶颈，但强大的选择性能仍然很大程度上局限于大型模型。我们证明，小型推理模型可以通过有针对性的强化学习获得强大的 GenSelect 能力。为此，我们通过过滤具有正确和不正确候选解决方案的实例来综合来自大规模数学和代码指令数据集的选择任务，并使用 DAPO 训练 1.7B 参数模型以奖励正确的选择。在数学（AIME24、AIME25、HMMT25）和代码（LiveCodeBench）推理基准中，我们的模型始终优于提示和多数投票基准，通常接近或超过更大的模型。此外，尽管仅对较弱模型的输出进行训练，但这些收益可以推广到从较强模型中选择输出。总的来说，我们的结果将强化学习确立为一种可扩展的方式，可以在小模型中解锁强大的生成选择，从而实现高效的测试时间扩展。

Title: Lung Nodule Image Synthesis Driven by Two-Stage Generative Adversarial Networks

Authors: Lu Cao, Xiquan He, Junying Zeng, Chaoyun Mai, Min Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.02171
Pdf URL: https://arxiv.org/pdf/2602.02171
Copy Paste: [[2602.02171]] Lung Nodule Image Synthesis Driven by Two-Stage Generative Adversarial Networks(https://arxiv.org/abs/2602.02171)
Keywords: generative
Abstract: The limited sample size and insufficient diversity of lung nodule CT datasets severely restrict the performance and generalization ability of detection models. Existing methods generate images with insufficient diversity and controllability, suffering from issues such as monotonous texture features and distorted anatomical structures. Therefore, we propose a two-stage generative adversarial network (TSGAN) to enhance the diversity and spatial controllability of synthetic data by decoupling the morphological structure and texture features of lung nodules. In the first stage, StyleGAN is used to generate semantic segmentation mask images, encoding lung nodules and tissue backgrounds to control the anatomical structure of lung nodule images; The second stage uses the DL-Pix2Pix model to translate the mask map into CT images, employing local importance attention to capture local features, while utilizing dynamic weight multi-head window attention to enhance the modeling capability of lung nodule texture and background. Compared to the original dataset, the accuracy improved by 4.6% and mAP by 4% on the LUNA16 dataset. Experimental results demonstrate that TSGAN can enhance the quality of synthetic images and the performance of detection models.
摘要：肺结节CT数据集的样本量有限和多样性不足严重限制了检测模型的性能和泛化能力。现有方法生成的图像缺乏多样性和可控性，存在纹理特征单调和解剖结构扭曲等问题。因此，我们提出了一种两阶段生成对抗网络（TSGAN），通过解耦肺结节的形态结构和纹理特征来增强合成数据的多样性和空间可控性。第一阶段，使用StyleGAN生成语义分割掩模图像，对肺结节和组织背景进行编码，以控制肺结节图像的解剖结构；第二阶段使用DL-Pix2Pix模型将掩模图转化为CT图像，利用局部重要性注意力捕获局部特征，同时利用动态权重多头窗口注意力增强肺结节纹理和背景的建模能力。与原始数据集相比，在 LUNA16 数据集上精度提高了 4.6%，mAP 提高了 4%。实验结果表明，TSGAN 可以提高合成图像的质量和检测模型的性能。

Title: ECHO-2: A Large Scale Distributed Rollout Framework for Cost-efficient Reinforcement Learning

Authors: Jie Xiao, Meng Chen, Qingnan Ren, Song Jingwei, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Ziqian Bi, Shuo Lu, Yiqun Duan, Lynn Ai, Eric Yang, Bill Shi
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2602.02192
Pdf URL: https://arxiv.org/pdf/2602.02192
Copy Paste: [[2602.02192]] ECHO-2: A Large Scale Distributed Rollout Framework for Cost-efficient Reinforcement Learning(https://arxiv.org/abs/2602.02192)
Keywords: generation
Abstract: Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.
摘要：强化学习（RL）是大型语言模型（LLM）训练后的关键阶段，涉及推出生成、奖励评估和集中学习之间的重复交互。分布式部署执行提供了利用更具成本效益的推理资源的机会，但在广域协调和政策传播方面带来了挑战。我们提出了 ECHO-2，这是一种分布式 RL 框架，用于远程推理工作者的后期训练和不可忽略的传播延迟。 ECHO-2 将集中式学习与分布式部署相结合，并将有限的策略过时性视为用户控制的参数，从而使部署生成、传播和训练能够重叠。我们引入了一种基于重叠的容量模型，该模型将训练时间、传播延迟和推出吞吐量联系起来，从而产生了维持学习者利用率的实用配置规则。为了缓解传播瓶颈并降低成本，ECHO-2 采用对等辅助的管道广播和异构工作人员的成本感知激活。在真实广域带宽条件下对 4B 和 8B 模型进行 GRPO 后训练的实验表明，ECHO-2 显着提高了成本效率，同时保持了与强基线相当的 RL 奖励。

Title: Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models

Authors: Xindian Ma, Yidi Lu, Peng Zhang, Jing Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02197
Pdf URL: https://arxiv.org/pdf/2602.02197
Copy Paste: [[2602.02197]] Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models(https://arxiv.org/abs/2602.02197)
Keywords: generation
Abstract: The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds compared to greedy strategies, enhancing efficiency in both comprehension and generation tasks. Empirically, HAE reduces KV-Cache memory by 41\% with minimal accuracy loss (0.3\% drop) in image understanding tasks and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.
摘要：将视觉信息集成到大型语言模型 (LLM) 中使得多模态 LLM (MLLM) 成为可能，但 Transformer 架构的二次内存和计算成本仍然是一个瓶颈。现有的 KV 缓存驱逐策略无法解决视觉和文本标记之间的异构注意力分布问题，导致效率不佳或性能下降。在本文中，我们提出了分层自适应逐出（HAE），这是一种 KV 缓存逐出框架，通过在预填充期间实现双注意力修剪（利用视觉令牌稀疏性和注意力方差）和在解码过程中实现动态解码逐出策略（受操作系统回收站启发）来优化 MLLM 中的文本-视觉令牌交互。 HAE 最大限度地减少了跨层 KV 缓存的使用，通过索引广播减少了计算开销，并且理论上与贪婪策略相比确保了卓越的信息完整性和更低的错误界限，从而提高了理解和生成任务的效率。根据经验，HAE 在图像理解任务中将 KV 缓存内存减少了 41%，同时准确度损失最小（0.3% 下降），并将故事生成推理速度提高了 1.5 倍，同时保持 Phi3.5-Vision-Instruct 模型的输出质量。

Title: Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Authors: Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.02214
Pdf URL: https://arxiv.org/pdf/2602.02214
Copy Paste: [[2602.02214]] Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation(https://arxiv.org/abs/2602.02214)
Keywords: generation
Abstract: To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: \href{this https URL}{this https URL}
摘要：为了实现实时交互式视频生成，当前的方法将预训练的双向视频扩散模型提炼为少步自回归（AR）模型，当完全注意力被因果注意力取代时，面临着架构差距。然而，现有的方法在理论上并不能弥合这一差距。他们通过 ODE 蒸馏来初始化 AR 学生，这需要帧级注入性，其中每个噪声帧必须映射到 AR 教师的 PF-ODE 下的唯一干净帧。从双向教师中提取 AR 学生违反了此条件，从而阻止了教师流程图的恢复，而是引入了条件期望解决方案，从而降低了性能。为了解决这个问题，我们提出因果强迫，它使用 AR 教师进行 ODE 初始化，从而弥合架构差距。实证结果表明，我们的方法在所有指标上都优于所有基线，在动态度上超过 SOTA 自强迫 19.3%，在 VisionReward 上超过 8.7%，在指令跟随上超过 16.7%。项目页面和代码：\href{这个 https URL}{这个 https URL}

Title: MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection

Authors: Ruiqi Liu, Manni Cui, Ziheng Qin, Zhiyuan Yan, Ruoxin Chen, Yi Han, Zhiheng Li, Junkai Chen, ZhiJin Chen, Kaiqing Lin, Jialiang Shen, Lubin Weng, Jing Dong, Yan Wang, Shu Wu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2602.02222
Pdf URL: https://arxiv.org/pdf/2602.02222
Copy Paste: [[2602.02222]] MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection(https://arxiv.org/abs/2602.02222)
Keywords: generative
Abstract: High-fidelity generative models have narrowed the perceptual gap between synthetic and real images, posing serious threats to media security. Most existing AI-generated image (AIGI) detectors rely on artifact-based classification and struggle to generalize to evolving generative traces. In contrast, human judgment relies on stable real-world regularities, with deviations from the human cognitive manifold serving as a more generalizable signal of forgery. Motivated by this insight, we reformulate AIGI detection as a Reference-Comparison problem that verifies consistency with the real-image manifold rather than fitting specific forgery cues. We propose MIRROR (Manifold Ideal Reference ReconstructOR), a framework that explicitly encodes reality priors using a learnable discrete memory bank. MIRROR projects an input into a manifold-consistent ideal reference via sparse linear combination, and uses the resulting residuals as robust detection signals. To evaluate whether detectors reach the "superhuman crossover" required to replace human experts, we introduce the Human-AIGI benchmark, featuring a psychophysically curated human-imperceptible subset. Across 14 benchmarks, MIRROR consistently outperforms prior methods, achieving gains of 2.1% on six standard benchmarks and 8.1% on seven in-the-wild benchmarks. On Human-AIGI, MIRROR reaches 89.6% accuracy across 27 generators, surpassing both lay users and visual experts, and further approaching the human perceptual limit as pretrained backbones scale. The code is publicly available at: this https URL
摘要：高保真生成模型缩小了合成图像与真实图像之间的感知差距，对媒体安全构成严重威胁。大多数现有的人工智能生成图像（AIGI）检测器依赖于基于工件的分类，并且很难推广到不断演变的生成痕迹。相比之下，人类的判断依赖于稳定的现实世界规律，与人类认知流形的偏差成为更普遍的伪造信号。受这一见解的启发，我们将 AIGI 检测重新表述为参考比较问题，用于验证与真实图像流形的一致性，而不是拟合特定的伪造线索。我们提出了 MIRROR（Manifold Ideal Reference ReconstructOR），这是一个使用可学习的离散存储库显式编码现实先验的框架。 MIRROR 通过稀疏线性组合将输入投影到流形一致的理想参考中，并使用所得残差作为鲁棒检测信号。为了评估探测器是否达到取代人类专家所需的“超人交叉”，我们引入了人类-AIGI 基准，该基准以心理物理学策划的人类不可感知的子集为特色。在 14 个基准测试中，MIRROR 始终优于之前的方法，在 6 个标准基准测试中实现了 2.1% 的收益，在 7 个野外基准测试中实现了 8.1% 的收益。在 Human-AIGI 上，MIRROR 在 27 个生成器上达到了 89.6% 的准确率，超越了非专业用户和视觉专家，并随着预训练骨干网的规模进一步接近人类感知极限。该代码可在以下位置公开获取：此 https URL

Title: Show, Don't Tell: Morphing Latent Reasoning into Image Generation

Authors: Harold Haodong Chen, Xinxiang Yin, Wen-Jie Shu, Hongfei Zhang, Zixin Zhang, Chenfei Liao, Litao Guo, Qifeng Chen, Ying-Cong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.02227
Pdf URL: https://arxiv.org/pdf/2602.02227
Copy Paste: [[2602.02227]] Show, Don't Tell: Morphing Latent Reasoning into Image Generation(https://arxiv.org/abs/2602.02227)
Keywords: generation
Abstract: Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by $16\%$ on GenEval and $25\%$ on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by $15\%$ and $11\%$ on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by $44\%$ and token consumption by $51\%$; and (IV) exhibits $71\%$ cognitive alignment with human intuition on reasoning invocation.
摘要：文本到图像（T2I）的生成已经取得了显着的进步，但现有的方法往往缺乏在生成过程中动态推理和细化的能力——这是人类创造力的标志。当前的推理增强范式大多数依赖于显式思维过程，其中中间推理以固定步骤解码为离散文本，并频繁进行图像解码和重新编码，导致效率低下、信息丢失和认知不匹配。为了弥补这一差距，我们引入了 LatentMorph，这是一种新颖的框架，它将隐式潜在推理无缝集成到 T2I 生成过程中。 LatentMorph 的核心引入了四个轻量级组件：(i) 用于将中间生成状态总结为紧凑视觉记忆的冷凝器，(ii) 用于将潜在想法转换为可操作指导的转换器，(iii) 用于动态引导下一个图像标记预测的整形器，以及 (iv) 用于自适应确定何时调用推理的经过 RL 训练的调用程序。通过完全在连续的潜在空间中执行推理，LatentMorph 避免了显式推理的瓶颈，并实现了更具适应性的自我改进。大量实验表明 LatentMorph (I) 在 GenEval 上将基础模型 Janus-Pro 增强了 $16\%$，在 T2I-CompBench 上增强了 $25\%$； (II) 在 WISE 和 IPV-Txt 等抽象推理任务上，比显式范式（例如 TwiG）性能高出 $15\%$ 和 $11\%$，(III) 同时将推理时间减少 $44\%$，将代币消耗减少 $51\%$； (IV) 在推理调用上与人类直觉表现出 $71\%$ 认知一致性。

Title: Geometry- and Relation-Aware Diffusion for EEG Super-Resolution

Authors: Laura Yao, Gengwei Zhang, Moajjem Chowdhury, Yunmei Liu, Tianlong Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02238
Pdf URL: https://arxiv.org/pdf/2602.02238
Copy Paste: [[2602.02238]] Geometry- and Relation-Aware Diffusion for EEG Super-Resolution(https://arxiv.org/abs/2602.02238)
Keywords: super-resolution, generation, generative
Abstract: Recent electroencephalography (EEG) spatial super-resolution (SR) methods, while showing improved quality by either directly predicting missing signals from visible channels or adapting latent diffusion-based generative modeling to temporal data, often lack awareness of physiological spatial structure, thereby constraining spatial generation performance. To address this issue, we introduce TopoDiff, a geometry- and relation-aware diffusion model for EEG spatial super-resolution. Inspired by how human experts interpret spatial EEG patterns, TopoDiff incorporates topology-aware image embeddings derived from EEG topographic representations to provide global geometric context for spatial generation, together with a dynamic channel-relation graph that encodes inter-electrode relationships and evolves with temporal dynamics. This design yields a spatially grounded EEG spatial super-resolution framework with consistent performance improvements. Across multiple EEG datasets spanning diverse applications, including SEED/SEED-IV for emotion recognition, PhysioNet motor imagery (MI/MM), and TUSZ for seizure detection, our method achieves substantial gains in generation fidelity and leads to notable improvements in downstream EEG task performance.
摘要：最近的脑电图（EEG）空间超分辨率（SR）方法虽然通过直接预测可见通道中丢失的信号或对时间数据采用基于潜在扩散的生成模型来提高质量，但通常缺乏对生理空间结构的认识，从而限制了空间生成性能。为了解决这个问题，我们引入了 TopoDiff，一种用于脑电图空间超分辨率的几何和关系感知扩散模型。受人类专家如何解释空间脑电图模式的启发，TopoDiff 结合了源自脑电图地形表示的拓扑感知图像嵌入，为空间生成提供全局几何背景，以及编码电极间关系并随时间动态演化的动态通道关系图。该设计产生了一个空间接地脑电图空间超分辨率框架，具有一致的性能改进。在跨越不同应用的多个 EEG 数据集（包括用于情绪识别的 SEED/SEED-IV、PhysioNet 运动想象 (MI/MM) 和用于癫痫检测的 TUSZ）中，我们的方法在生成保真度方面取得了显着的进步，并导致下游 EEG 任务性能的显着改进。

Title: Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

Authors: Hao Wang, Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye, Xiangyu Yue, Sirui Han, Yike Guo, Dapeng Wu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2602.02244
Pdf URL: https://arxiv.org/pdf/2602.02244
Copy Paste: [[2602.02244]] Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models(https://arxiv.org/abs/2602.02244)
Keywords: generation
Abstract: The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which adaptively adjusts distillation strength to mitigate knowledge forgetting by amplifying exploration at reasoning tokens while stabilizing factual tokens. Extensive experiments on mathematical reasoning tasks demonstrate that, in SFT stage, CurioSFT outperforms the vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. We also verify that exploration capabilities preserved during SFT successfully translate into concrete gains in RL stage, yielding an average improvement of 5.0 points.
摘要：大型推理模型的标准训练后配方，即监督微调和强化学习（SFT-then-RL），可能会限制 RL 阶段的好处：虽然 SFT 模仿专家演示，但它通常会导致过度自信并减少生成多样性，从而使 RL 的探索解决方案空间变窄。在 SFT 期间添加熵正则化并不是万能药；它倾向于使代币分布趋于均匀，增加熵而不提高有意义的探索能力。在本文中，我们提出了 CurioSFT，一种保熵 SFT 方法，旨在通过内在的好奇心增强探索能力。它包括（a）自我探索蒸馏，它将模型蒸馏给一个自我生成的、温度标度的教师，以鼓励在其能力范围内进行探索； (b) 熵引导温度选择，自适应调整蒸馏强度，通过放大推理标记的探索同时稳定事实标记来减少知识遗忘。对数学推理任务的大量实验表明，在 SFT 阶段，CurioSFT 在分布内任务上比普通 SFT 好 2.5 分，在分布外任务上比普通 SFT 好 2.9 分。我们还验证了 SFT 期间保留的探索能力成功转化为 RL 阶段的具体收益，平均提高了 5.0 分。

Title: Unlocking the Duality between Flow and Field Matching

Authors: Daniil Shlenskii, Alexander Varlamov, Nazar Buzun, Alexander Korotin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02261
Pdf URL: https://arxiv.org/pdf/2602.02261
Copy Paste: [[2602.02261]] Unlocking the Duality between Flow and Field Matching(https://arxiv.org/abs/2602.02261)
Keywords: generative
Abstract: Conditional Flow Matching (CFM) unifies conventional generative paradigms such as diffusion models and flow matching. Interaction Field Matching (IFM) is a newer framework that generalizes Electrostatic Field Matching (EFM) rooted in Poisson Flow Generative Models (PFGM). While both frameworks define generative dynamics, they start from different objects: CFM specifies a conditional probability path in data space, whereas IFM specifies a physics-inspired interaction field in an augmented data space. This raises a basic question: are CFM and IFM genuinely different, or are they two descriptions of the same underlying dynamics? We show that they coincide for a natural subclass of IFM that we call forward-only IFM. Specifically, we construct a bijection between CFM and forward-only IFM. We further show that general IFM is strictly more expressive: it includes EFM and other interaction fields that cannot be realized within the standard CFM formulation. Finally, we highlight how this duality can benefit both frameworks: it provides a probabilistic interpretation of forward-only IFM and yields novel, IFM-driven techniques for CFM.
摘要：条件流匹配 (CFM) 统一了传统的生成范例，例如扩散模型和流匹配。交互场匹配 (IFM) 是一个较新的框架，它概括了植根于泊松流生成模型 (PFGM) 的静电场匹配 (EFM)。虽然这两个框架都定义了生成动力学，但它们从不同的对象开始：CFM 指定数据空间中的条件概率路径，而 IFM 指定增强数据空间中受物理启发的交互场。这就提出了一个基本问题：CFM 和 IFM 是否真的不同，或者它们是对相同潜在动态的两种描述？我们证明它们对于 IFM 的一个自然子类是一致的，我们称之为仅前向 IFM。具体来说，我们在 CFM 和仅前向 IFM 之间构建双射。我们进一步表明，广义 IFM 严格来说更具表现力：它包括 EFM 和其他在标准 CFM 公式中无法实现的交互场。最后，我们强调这种二元性如何使两个框架受益：它提供了仅前向 IFM 的概率解释，并为 CFM 提供了新颖的 IFM 驱动技术。

Title: MoLF: Mixture-of-Latent-Flow for Pan-Cancer Spatial Gene Expression Prediction from Histology

Authors: Susu Hu, Stefanie Speidel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02282
Pdf URL: https://arxiv.org/pdf/2602.02282
Copy Paste: [[2602.02282]] MoLF: Mixture-of-Latent-Flow for Pan-Cancer Spatial Gene Expression Prediction from Histology(https://arxiv.org/abs/2602.02282)
Keywords: generative
Abstract: Inferring spatial transcriptomics (ST) from histology enables scalable histogenomic profiling, yet current methods are largely restricted to single-tissue models. This fragmentation fails to leverage biological principles shared across cancer types and hinders application to data-scarce scenarios. While pan-cancer training offers a solution, the resulting heterogeneity challenges monolithic architectures. To bridge this gap, we introduce MoLF (Mixture-of-Latent-Flow), a generative model for pan-cancer histogenomic prediction. MoLF leverages a conditional Flow Matching objective to map noise to the gene latent manifold, parameterized by a Mixture-of-Experts (MoE) velocity field. By dynamically routing inputs to specialized sub-networks, this architecture effectively decouples the optimization of diverse tissue patterns. Our experiments demonstrate that MoLF establishes a new state-of-the-art, consistently outperforming both specialized and foundation model baselines on pan-cancer benchmarks. Furthermore, MoLF exhibits zero-shot generalization to cross-species data, suggesting it captures fundamental, conserved histo-molecular mechanisms.
摘要：从组织学推断空间转录组学 (ST) 可以实现可扩展的组织基因组分析，但目前的方法很大程度上仅限于单组织模型。这种碎片化无法利用跨癌症类型共享的生物学原理，并阻碍了在数据稀缺场景中的应用。虽然泛癌训练提供了一种解决方案，但由此产生的异质性对整体架构提出了挑战。为了弥补这一差距，我们引入了 MoLF（潜在流混合），这是一种用于泛癌组织基因组预测的生成模型。 MoLF 利用条件流匹配目标将噪声映射到基因潜在流形，并由专家混合 (MoE) 速度场参数化。通过动态地将输入路由到专门的子网络，该架构有效地解耦了不同组织模式的优化。我们的实验表明，MoLF 建立了一种新的最先进的技术，在泛癌基准上始终优于专业模型和基础模型基线。此外，MoLF 对跨物种数据表现出零样本泛化，表明它捕获了基本的、保守的组织分子机制。

Title: Implicit neural representation of textures

Authors: Albert Kwok, Zheyuan Hu, Dounia Hammou
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02354
Pdf URL: https://arxiv.org/pdf/2602.02354
Copy Paste: [[2602.02354]] Implicit neural representation of textures(https://arxiv.org/abs/2602.02354)
Keywords: generation
Abstract: Implicit neural representation (INR) has proven to be accurate and efficient in various domains. In this work, we explore how different neural networks can be designed as a new texture INR, which operates in a continuous manner rather than a discrete one over the input UV coordinate space. Through thorough experiments, we demonstrate that these INRs perform well in terms of image quality, with considerable memory usage and rendering inference time. We analyze the balance between these objectives. In addition, we investigate various related applications in real-time rendering and down-stream tasks, e.g. mipmap fitting and INR-space generation.
摘要：内隐神经表示（INR）已被证明在各个领域都是准确和高效的。在这项工作中，我们探索了如何将不同的神经网络设计为新的纹理 INR，它在输入 UV 坐标空间上以连续方式而不是离散方式运行。通过彻底的实验，我们证明这些 INR 在图像质量方面表现良好，具有相当大的内存使用量和渲染推理时间。我们分析这些目标之间的平衡。此外，我们还研究了实时渲染和下游任务中的各种相关应用，例如mipmap 拟合和 INR 空间生成。

Title: Unified Personalized Reward Model for Vision Generation

Authors: Yibin Wang, Yuhang Zang, Feng Han, Jiazi Bu, Yujie Zhou, Cheng Jin, Jiaqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.02380
Pdf URL: https://arxiv.org/pdf/2602.02380
Copy Paste: [[2602.02380]] Unified Personalized Reward Model for Vision Generation(https://arxiv.org/abs/2602.02380)
Keywords: generation, generative
Abstract: Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.
摘要：多模式奖励模型（RM）的最新进展极大地推动了视觉生成的发展。现有框架通常采用 Bradley-Terry 式的偏好建模或利用生成 VLM 作为判断，然后通过强化学习优化视觉生成模型。然而，当前的 RM 存在固有的局限性：它们通常遵循一刀切的范式，假设整体偏好分布或依赖于固定的评估规则。因此，它们对特定内容的视觉线索不敏感，导致与主观和依赖于上下文的人类偏好的系统性不一致。为此，受人类评估的启发，我们提出了 UnifiedReward-Flex，这是一种用于视觉生成的统一个性化奖励模型，它将奖励建模与灵活的上下文自适应推理相结合。具体来说，给定提示和生成的视觉内容，它首先解释语义意图并基于视觉证据，然后通过在预定义和自行生成的高级维度下实例化细粒度标准来动态构建分层评估。我们的训练流程遵循两个阶段的过程：（1）我们首先从先进的闭源 VLM 中提取结构化的高质量推理轨迹到引导 SFT，为模型配备灵活且上下文自适应的推理行为；（2）然后，我们对精心策划的偏好对进行直接偏好优化（DPO），以进一步加强推理保真度和判别性对齐。为了验证其有效性，我们将UnifiedReward-Flex集成到GRPO框架中进行图像和视频合成，大量结果证明了其优越性。

Title: Self-Supervised Learning from Structural Invariance

Authors: Yipeng Zhang, Hafez Ghaemi, Jungyoon Lee, Shahab Bakhtiari, Eilif B. Muller, Laurent Charlin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02381
Pdf URL: https://arxiv.org/pdf/2602.02381
Copy Paste: [[2602.02381]] Self-Supervised Learning from Structural Invariance(https://arxiv.org/abs/2602.02381)
Keywords: generative
Abstract: Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs. We study the one-to-many mapping problem in SSL, where each datum may be mapped to multiple valid targets. This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames. We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a latent variable to account for this uncertainty and derive a variational lower bound on the mutual information between paired embeddings. Our derivation yields a simple regularization term for standard SSL objectives. The resulting method, which we call AdaSSL, applies to both contrastive and distillation-based SSL objectives, and we empirically show its versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.
摘要：联合嵌入自监督学习（SSL）是从视觉数据进行无监督表示学习的关键范例，它从语义相关数据对之间的不变性中学习。我们研究 SSL 中的一对多映射问题，其中每个数据可能映射到多个有效目标。当数据对来自自然发生的生成过程（例如连续的视频帧）时，就会出现这种情况。我们表明现有的方法很难灵活地捕获这种条件不确定性。作为补救措施，我们引入一个潜在变量来解释这种不确定性，并得出配对嵌入之间互信息的变分下界。我们的推导为标准 SSL 目标生成了一个简单的正则化项。由此产生的方法，我们称之为 AdaSSL，适用于基于对比和基于蒸馏的 SSL 目标，我们凭经验证明了它在因果表示学习、细粒度图像理解和视频世界建模方面的多功能性。

Title: SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Authors: Maksim Afanasyev, Illarion Iov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02383
Pdf URL: https://arxiv.org/pdf/2602.02383
Copy Paste: [[2602.02383]] SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization(https://arxiv.org/abs/2602.02383)
Keywords: generation
Abstract: Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response's absolute likelihood. This can lead to ``unlearning'', where the model degrades the probability of high-quality outputs to satisfy margin constraints, and ``formatting collapse'' caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.
摘要：直接偏好优化方法已成为人类反馈强化学习 (RLHF) 的计算高效替代方案，用于调整大型语言模型 (LLM)。最新的方法通过推导隐式奖励函数简化了对齐过程，但它们经常遇到严重的目标不匹配问题：优化所选响应和拒绝响应之间的相对裕度并不能保证保留所选响应的绝对可能性。这可能会导致“忘却”，即模型降低高质量输出的概率以满足边际约束，以及由于对被拒绝序列的过度惩罚而导致“格式崩溃”。在这项工作中，我们引入了 SLIME（稳定似然隐式边际执行），这是一种无参考对齐目标，旨在将偏好学习与生成质量分离。 SLIME 包含三个方面的目标：（1）一个锚定项，以最大化首选响应的可能性； (2) 稳定惩罚，防止拒绝代币的概率降至零；（3）结合硬约束和软约束的双边界机制，以实现精确的边界塑造。我们的结果表明，与最先进的基线相比，SLIME 实现了卓越的性能，同时保持了更高的生成稳定性。

Title: Personalized Image Generation via Human-in-the-loop Bayesian Optimization

Authors: Rajalaxmi Rajagopalan, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02388
Pdf URL: https://arxiv.org/pdf/2602.02388
Copy Paste: [[2602.02388]] Personalized Image Generation via Human-in-the-loop Bayesian Optimization(https://arxiv.org/abs/2602.02388)
Keywords: generation, generative
Abstract: Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image $x^{p*}$. Although $x^{p*}$ is reasonably close to $x^\ast$, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image $x^+$ is closer to $x^\ast$ than $x^{p*}$. Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates $K$ new images as a function of $x^{p*}$, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of $K$ images. We show that within $B$ rounds of user feedback, it is possible to arrive much closer to $x^\ast$, even though the generative model has no information about $x^\ast$. Qualitative scores from $30$ users, combined with quantitative metrics compared across $5$ baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.
摘要：想象一下，爱丽丝在她的脑海中有一个特定的图像$x^\ast$，比如说，她童年时长大的街道的景色。为了生成精确的图像，她通过多轮提示引导生成模型并得到图像 $x^{p*}$。尽管 $x^{p*}$ 相当接近 $x^\ast$，但 Alice 发现很难使用语言提示来缩小这一差距。本文旨在通过观察即使在语言达到其极限之后，人类仍然可以分辨出新图像 $x^+$ 何时比 $x^{p*}$ 更接近 $x^\ast$ 来缩小这一差距。利用这一观察结果，我们开发了 MultiBO（多选优先贝叶斯优化），它根据 $x^{p*}$ 的函数仔细生成 $K$ 个新图像，从用户那里获取优先反馈，使用反馈来指导扩散模型，并最终生成一组新的 $K$ 图像。我们表明，在 $B$ 轮用户反馈中，即使生成模型没有关于 $x^\ast$ 的信息，也有可能更接近 $x^\ast$。 30 美元用户的定性得分，与 5 美元基线的定量指标相结合，显示出有希望的结果，这表明可以有效地利用人类的多选反馈来生成个性化图像。

Title: Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory

Authors: Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, Ming-Ming Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02393
Pdf URL: https://arxiv.org/pdf/2602.02393
Copy Paste: [[2602.02393]] Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory(https://arxiv.org/abs/2602.02393)
Keywords: generation, generative
Abstract: We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model's long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.
摘要：我们提出了 Infinite-World，这是一种强大的交互式世界模型，能够在复杂的现实环境中保持超过 1000 多个帧的连贯视觉记忆。虽然现有的世界模型可以在具有完美地面实况的合成数据上进行有效优化，但由于嘈杂的姿势估计和缺乏视点重访，它们缺乏针对真实世界视频的有效训练范例。为了弥补这一差距，我们首先引入了一种分层无姿势记忆压缩器（HPMC），它将历史潜伏递归地提取为固定预算表示。通过联合优化压缩器和生成主干，HPMC 使模型能够以有限的计算成本自主锚定遥远过去的世代，从而消除了对显式几何先验的需要。其次，我们提出了一个不确定性感知动作标签模块，将连续运动离散化为三态逻辑。该策略最大限度地利用原始视频数据，同时保护确定性动作空间免受噪声轨迹的破坏，确保稳健的动作响应学习。此外，在试点玩具研究的指导下，我们采用了 Revisit-Dense Finetuning 策略，使用紧凑的 30 分钟数据集来有效激活模型的远程闭环功能。包括客观指标和用户研究在内的大量实验表明，Infinite-World 在视觉质量、动作可控性和空间一致性方面实现了卓越的性能。

Title: Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation

Authors: Xinshun Wang, Peiming Li, Ziyi Wang, Zhongbin Fang, Zhichao Deng, Songtao Wu, Jason Li, Mengyuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.02401
Pdf URL: https://arxiv.org/pdf/2602.02401
Copy Paste: [[2602.02401]] Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation(https://arxiv.org/abs/2602.02401)
Keywords: generation, generative
Abstract: Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between ``perception'' models that understand motion from video but only output text, and ``generation'' models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.
摘要：人体运动分析任务，例如时间 3D 姿态估计、运动预测和中间运动，在计算机视觉中发挥着重要作用。然而，当前的范式存在严重的碎片化问题。首先，该领域分为理解视频运动但仅输出文本的“感知”模型和无法从原始视觉输入感知的“生成”模型。其次，生成式 MLLM 通常仅限于使用密集参数 SMPL 模型的单帧静态姿势，无法处理时间运动。第三，现有的运动词汇仅由骨架数据构建，切断了与视觉领域的链接。为了应对这些挑战，我们引入了 Superman，这是一个统一的框架，可以将视觉感知与基于时间的、基于骨架的运动生成联系起来。我们的解决方案是双重的。首先，为了克服模态脱节，我们提出了视觉引导运动标记器。该模块利用 3D 骨架和视觉数据之间的自然几何对齐，开创了两种模式的稳健联合学习，创建了统一的跨模式运动词汇。其次，以这种运动语言为基础，训练一个统一的 MLLM 架构来处理所有任务。该模块灵活地处理不同的时间输入，将视频（感知）中的 3D 骨架姿态估计与基于骨架的运动预测和中间（生成）统一起来。对标准基准（包括 Human3.6M）的大量实验表明，我们的统一方法在所有运动任务中都实现了最先进的或有竞争力的性能。这展示了使用骨架进行生成运动分析的更有效和可扩展的路径。

Title: Trust Region Continual Learning as an Implicit Meta-Learner

Authors: Zekun Wang, Anant Gupta, Christopher J. MacLellan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02417
Pdf URL: https://arxiv.org/pdf/2602.02417
Copy Paste: [[2602.02417]] Trust Region Continual Learning as an Implicit Meta-Learner(https://arxiv.org/abs/2602.02417)
Keywords: generation, generative
Abstract: Continual learning aims to acquire tasks sequentially without catastrophic forgetting, yet standard strategies face a core tradeoff: regularization-based methods (e.g., EWC) can overconstrain updates when task optima are weakly overlapping, while replay-based methods can retain performance but drift due to imperfect replay. We study a hybrid perspective: \emph{trust region continual learning} that combines generative replay with a Fisher-metric trust region constraint. We show that, under local approximations, the resulting update admits a MAML-style interpretation with a single implicit inner step: replay supplies an old-task gradient signal (query-like), while the Fisher-weighted penalty provides an efficient offline curvature shaping (support-like). This yields an emergent meta-learning property in continual learning: the model becomes an initialization that rapidly \emph{re-converges} to prior task optima after each task transition, without explicitly optimizing a bilevel objective. Empirically, on task-incremental diffusion image generation and continual diffusion-policy control, trust region continual learning achieves the best final performance and retention, and consistently recovers early-task performance faster than EWC, replay, and continual meta-learning baselines.
摘要：持续学习的目标是顺序获取任务而不会发生灾难性遗忘，但标准策略面临着一个核心权衡：当任务最优值弱重叠时，基于正则化的方法（例如 EWC）可能会过度限制更新，而基于重放的方法可以保留性能，但会因不完美的重放而发生漂移。我们研究了一种混合视角：\emph{信任域持续学习}，它将生成重放与费舍尔度量信任域约束相结合。我们表明，在局部近似下，生成的更新允许具有单个隐式内部步骤的 MAML 风格解释：重放提供旧任务梯度信号（类似查询），而 Fisher 加权惩罚提供有效的离线曲率整形（类似支持）。这在持续学习中产生了一种新兴的元学习特性：模型成为一种初始化，在每次任务转换后快速重新收敛到先前的任务最优值，而无需显式优化双层目标。根据经验，在任务增量扩散图像生成和持续扩散策略控制上，信任区域持续学习实现了最佳的最终性能和保留，并且比 EWC、重放和持续元学习基线更快地持续恢复早期任务性能。

Title: Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Authors: Amaru Caceres Arroyo, Lea Bogensperger, Ahmed Allam, Michael Krauthammer, Konrad Schindler, Dominik Narnhofer
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2602.02425
Pdf URL: https://arxiv.org/pdf/2602.02425
Copy Paste: [[2602.02425]] Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization(https://arxiv.org/abs/2602.02425)
Keywords: generation
Abstract: Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.
摘要：蛋白质适应度优化面临着巨大的组合环境的挑战，其中高适应度变体极其稀疏。当前的许多方法要么表现不佳，要么需要计算成本昂贵的基于梯度的采样。我们提出了 CHASE，一个框架，通过将预训练蛋白质语言模型的嵌入压缩到紧凑的潜在空间中，重新利用它们的进化知识。通过使用无分类器指导训练条件流匹配模型，我们可以在 ODE 采样步骤期间直接生成高适应度变体，而无需基于预测器的指导。 CHASE 在 AAV 和 GFP 蛋白质设计基准上实现了最先进的性能。最后，我们表明使用合成数据进行引导可以进一步提高数据受限设置中的性能。

Title: Embedding Perturbation may Better Reflect the Uncertainty in LLM Reasoning

Authors: Qihao Wen, Jiahao Wang, Yang Nan, Pengfei He, Ravi Tandon, Han Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02427
Pdf URL: https://arxiv.org/pdf/2602.02427
Copy Paste: [[2602.02427]] Embedding Perturbation may Better Reflect the Uncertainty in LLM Reasoning(https://arxiv.org/abs/2602.02427)
Keywords: generation
Abstract: Large language Models (LLMs) have achieved significant breakthroughs across diverse domains; however, they can still produce unreliable or misleading outputs. For responsible LLM application, Uncertainty Quantification (UQ) techniques are used to estimate a model's uncertainty about its outputs, indicating the likelihood that those outputs may be problematic. For LLM reasoning tasks, it is essential to estimate the uncertainty not only for the final answer, but also for the intermediate steps of the reasoning, as this can enable more fine-grained and targeted interventions. In this study, we explore what UQ metrics better reflect the LLM's ``intermediate uncertainty''during reasoning. Our study reveals that an LLMs' incorrect reasoning steps tend to contain tokens which are highly sensitive to the perturbations on the preceding token embeddings. In this way, incorrect (uncertain) intermediate steps can be readily identified using this sensitivity score as guidance in practice. In our experiments, we show such perturbation-based metric achieves stronger uncertainty quantification performance compared with baseline methods such as token (generation) probability and token entropy. Besides, different from approaches that rely on multiple sampling, the perturbation-based metrics offer better simplicity and efficiency.
摘要：大语言模型（LLM）在不同领域取得了重大突破；然而，它们仍然可能产生不可靠或误导性的输出。对于负责任的法学硕士申请，不确定性量化（UQ）技术用于估计模型关于其输出的不确定性，表明这些输出可能有问题的可能性。对于LLM推理任务，不仅要估计最终答案的不确定性，而且还要估计推理中间步骤的不确定性，因为这可以实现更细粒度和有针对性的干预。在这项研究中，我们探讨了昆士兰大学的哪些指标可以更好地反映法学硕士在推理过程中的“中间不确定性”。我们的研究表明，法学硕士的错误推理步骤往往包含对先前标记嵌入的扰动高度敏感的标记。通过这种方式，可以使用此敏感度分数作为实践指导来轻松识别不正确（不确定）的中间步骤。在我们的实验中，我们表明，与令牌（生成）概率和令牌熵等基线方法相比，这种基于扰动的度量实现了更强的不确定性量化性能。此外，与依赖多次采样的方法不同，基于扰动的指标提供了更好的简单性和效率。

Title: UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

Authors: Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, Jiaqi Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02437
Pdf URL: https://arxiv.org/pdf/2602.02437
Copy Paste: [[2602.02437]] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing(https://arxiv.org/abs/2602.02437)
Keywords: generation
Abstract: Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.
摘要：统一的多模态模型通常难以应对需要深度推理的复杂合成任务，并且通常将文本到图像的生成和图像编辑视为独立的功能，而不是互连的推理步骤。为了解决这个问题，我们提出了 UniReason，一个统一的框架，通过双重推理范式协调这两项任务。我们将生成制定为世界知识增强规划，以注入隐式约束，并利用编辑功能进行细粒度视觉细化，以通过自我反思进一步纠正视觉错误。这种方法将生成和编辑统一在一个共享的表示中，反映了人类规划和细化的认知过程。我们通过系统地构建一个涵盖五个主要知识领域（例如文化常识、物理等）的大规模以推理为中心的数据集（约 30 万个样本）来支持该框架，并构建一个用于视觉自我校正的代理生成的语料库。大量实验表明，UniReason 在 WISE、KrisBench 和 UniREditBench 等推理密集型基准测试中实现了先进的性能，同时保持了卓越的通用综合能力。

Title: Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE

Authors: Yuanteng Chen, Peisong Wang, Nanxin Zeng, Yuantian Shao, Gang Li, Jing Liu, Jian Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02443
Pdf URL: https://arxiv.org/pdf/2602.02443
Copy Paste: [[2602.02443]] Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE(https://arxiv.org/abs/2602.02443)
Keywords: generation
Abstract: Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly-suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.
摘要：测试时间扩展通过生成多个候选解决方案来提高 LLM 性能，但令牌级采样需要温度调整，从而在多样性和稳定性之间进行权衡。细粒度的 MoE，每层有数百名训练有素的专家，每个令牌都有多专家激活，通过其丰富的路由空间提供了一种未经探索的替代方案。我们凭经验描述了细粒度的 MoE 路由并揭示了一种信息模式：路由器分数表现出一定的高置信度专家头部，后面跟着不确定的低置信度候选者尾部。虽然当激活较少的专家时，单次运行的贪婪精度保持稳定，但多样本 pass@n 会显着降低——这表明特定的头部控制核心推理能力，而不确定的尾部与推理多样性相关。受这些发现的启发，我们提出了 Expert-Sample，这是一种无需训练的方法，可以保留高置信度选择，同时将受控随机性注入不确定的尾部，从而在不破坏输出稳定性的情况下实现多样化生成。 Expert-Sample 对数学、知识推理和代码任务中的多个细粒度 MoE 模型进行了评估，持续提高了 pass@n 和基于验证的准确性。在 GPQA-Diamond 上使用 32 个并行样本进行评估的 Qwen3-30B-A3B-Instruct 上，通过 Best-of-N 验证，pass@32 从 85.4% 上升到 91.9%，准确率从 59.1% 提高到 62.6%。

Title: Expanding the Capabilities of Reinforcement Learning via Text Feedback

Authors: Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, Andrea Zanette
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.02482
Pdf URL: https://arxiv.org/pdf/2602.02482
Copy Paste: [[2602.02482]] Expanding the Capabilities of Reinforcement Learning via Text Feedback(https://arxiv.org/abs/2602.02482)
Keywords: generation
Abstract: The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.
摘要：LLM 后期培训的 RL 的成功源于一个不合理的无信息来源：每次推出的单个信息作为二元奖励或偏好标签。在另一个极端，蒸馏提供了密集的监督，但需要演示，而演示成本高昂且难以规模化。我们将文本反馈作为中间信号来研究：比标量奖励更丰富，但比完整的演示更便宜。文本反馈是人类交互的一种自然模式，在许多现实世界中已经很丰富，用户、注释者和自动法官经常批评法学硕士的输出。为了大规模利用文本反馈，我们形式化了多轮强化学习设置，即来自文本反馈的强化学习 (RLTF)，其中文本反馈在训练期间可用，但在推理时不可用。因此，模型必须学会内化反馈，以提高其测试时的单圈性能。为此，我们提出了两种方法：自蒸馏（RLTF-SD），它训练单轮策略以匹配其自身的反馈条件第二轮生成；反馈建模（RLTF-FM），将反馈预测作为辅助目标。我们对这两种方法提供了理论分析，并对推理难题、竞赛数学和创意写作任务进行了实证评估。我们的结果表明，这两种方法在各个基准测试中始终优于强大的基线，突显了 RL 的潜力以及大规模丰富监督的额外来源。

Title: PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Authors: Zehong Ma, Ruihan Xu, Shiliang Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02493
Pdf URL: https://arxiv.org/pdf/2602.02493
Copy Paste: [[2602.02493]] PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss(https://arxiv.org/abs/2602.02493)
Keywords: generation, generative
Abstract: Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at this https URL.
摘要：像素扩散以端到端的方式直接在像素空间中生成图像，避免了两阶段潜在扩散中 VAE 引入的伪影和瓶颈。然而，优化包含许多感知不相关信号的高维像素流形具有挑战性，使得现有的像素扩散方法落后于潜在扩散模型。我们提出 PixelGen，一个具有感知监督的简单像素扩散框架。 PixelGen 没有对完整图像流形进行建模，而是引入了两个互补的感知损失来引导扩散模型学习更有意义的感知流形。 LPIPS 损失有助于学习更好的局部模式，而基于 DINO 的感知损失则增强了全局语义。通过感知监督，PixelGen 超越了强大的潜在扩散基线。它在 ImageNet-256 上实现了 5.11 的 FID，无需使用无分类器指导，仅使用 80 个训练 epoch，并且在大规模文本到图像生成方面表现出良好的扩展性能，GenEval 得分为 0.79。 PixelGen 不需要 VAE，不需要潜在表示，也不需要辅助阶段，提供了更简单但更强大的生成范式。代码可通过此 https URL 公开获取。