2026-03-10

Title: vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM

Authors: Ching-Yun Ko, Pin-Yu Chen
Subjects: cs.LG, cs.CL, cs.PL
Abstract URL: https://arxiv.org/abs/2603.06588
Pdf URL: https://arxiv.org/pdf/2603.06588
Copy Paste: [[2603.06588]] vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM(https://arxiv.org/abs/2603.06588)
Keywords: generation
Abstract: Modern artificial intelligence (AI) models are deployed on inference engines to optimize runtime efficiency and resource allocation, particularly for transformer-based large language models (LLMs). The vLLM project is a major open-source library to support model serving and inference. However, the current implementation of vLLM limits programmability of the internal states of deployed models. This prevents the use of popular test-time model alignment and enhancement methods. For example, it prevents the detection of adversarial prompts based on attention patterns or the adjustment of model responses based on activation steering. To bridge this critical gap, we present vLLM Hook, an opensource plug-in to enable the programming of internal states for vLLM models. Based on a configuration file specifying which internal states to capture, vLLM Hook provides seamless integration to vLLM and supports two essential features: passive programming and active programming. For passive programming, vLLM Hook probes the selected internal states for subsequent analysis, while keeping the model generation intact. For active programming, vLLM Hook enables efficient intervention of model generation by altering the selected internal states. In addition to presenting the core functions of vLLM Hook, in version 0, we demonstrate 3 use cases including prompt injection detection, enhanced retrieval-augmented retrieval (RAG), and activation steering. Finally, we welcome the community's contribution to improve vLLM Hook via this https URL.
摘要：现代人工智能 (AI) 模型部署在推理引擎上，以优化运行时效率和资源分配，特别是对于基于 Transformer 的大型语言模型 (LLM)。 vLLM 项目是一个支持模型服务和推理的主要开源库。然而，当前 vLLM 的实现限制了已部署模型内部状态的可编程性。这阻止了流行的测试时模型对齐和增强方法的使用。例如，它阻止基于注意力模式的对抗性提示的检测或基于激活引导的模型响应的调整。为了弥补这一关键差距，我们推出了 vLLM Hook，这是一个开源插件，可以对 vLLM 模型的内部状态进行编程。基于指定要捕获哪些内部状态的配置文件，vLLM Hook 提供与 vLLM 的无缝集成，并支持两个基本功能：被动编程和主动编程。对于被动编程，vLLM Hook 探测选定的内部状态以进行后续分析，同时保持模型生成完整。对于主动编程，vLLM Hook 可以通过更改选定的内部状态来有效干预模型生成。除了展示 vLLM Hook 的核心功能之外，在版本 0 中，我们还演示了 3 个用例，包括提示注入检测、增强检索-增强检索（RAG）和激活引导。最后，我们欢迎社区通过此 https URL 为改进 vLLM Hook 做出贡献。

Title: Switchable Activation Networks

Authors: Laha Ale, Ning Zhang, Scott A. King, Pingzhi Fan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.06601
Pdf URL: https://arxiv.org/pdf/2603.06601
Copy Paste: [[2603.06601]] Switchable Activation Networks(https://arxiv.org/abs/2603.06601)
Keywords: generative
Abstract: Deep neural networks, and more recently large-scale generative models such as large language models (LLMs) and large vision-action models (LVAs), achieve remarkable performance across diverse domains, yet their prohibitive computational cost hinders deployment in resource-constrained environments. Existing efficiency techniques offer only partial remedies: dropout improves regularization during training but leaves inference unchanged, while pruning and low-rank factorization compress models post hoc into static forms with limited adaptability. Here we introduce SWAN (Switchable Activation Networks), a framework that equips each neural unit with a deterministic, input-dependent binary gate, enabling the network to learn when a unit should be active or inactive. This dynamic control mechanism allocates computation adaptively, reducing redundancy while preserving accuracy. Unlike traditional pruning, SWAN does not simply shrink networks after training; instead, it learns structured, context-dependent activation patterns that support both efficient dynamic inference and conversion into compact dense models for deployment. By reframing efficiency as a problem of learned activation control, SWAN unifies the strengths of sparsity, pruning, and adaptive inference within a single paradigm. Beyond computational gains, this perspective suggests a more general principle of neural computation, where activation is not fixed but context-dependent, pointing toward sustainable AI, edge intelligence, and future architectures inspired by the adaptability of biological brains.
摘要：深度神经网络以及最近的大规模生成模型，例如大型语言模型 (LLM) 和大型视觉动作模型 (LVA)，在不同领域取得了卓越的性能，但其高昂的计算成本阻碍了在资源有限的环境中的部署。现有的效率技术仅提供部分补救措施：dropout 改善了训练期间的正则化，但保持推理不变，而剪枝和低秩分解将模型事后压缩为适应性有限的静态形式。在这里，我们介绍 SWAN（可切换激活网络），这是一个为每个神经单元配备确定性、依赖于输入的二进制门的框架，使网络能够学习单元何时应该处于活动或不活动状态。这种动态控制机制自适应地分配计算，减少冗余，同时保持准确性。与传统的剪枝不同，SWAN 并不是简单地在训练后缩小网络；相反，它学习结构化的、依赖于上下文的激活模式，支持高效的动态推理和转换为紧凑的密集模型以进行部署。通过将效率重新定义为学习激活控制的问题，SWAN 将稀疏性、剪枝和自适应推理的优势统一在一个范例中。除了计算增益之外，这种观点还提出了一种更通用的神经计算原理，其中激活不是固定的，而是依赖于上下文的，指向可持续的人工智能、边缘智能和受生物大脑适应性启发的未来架构。

Title: Know When You're Wrong: Aligning Confidence with Correctness for LLM Error Detection

Authors: Xie Xiaohu, Liu Xiaohu, Yao Benjamin
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2603.06604
Pdf URL: https://arxiv.org/pdf/2603.06604
Copy Paste: [[2603.06604]] Know When You're Wrong: Aligning Confidence with Correctness for LLM Error Detection(https://arxiv.org/abs/2603.06604)
Keywords: generation
Abstract: As large language models (LLMs) are increasingly deployed in critical decision-making systems, the lack of reliable methods to measure their uncertainty presents a fundamental trustworthiness risk. We introduce a normalized confidence score based on output anchor token probabilities: classification labels for structured tasks and self-evaluation responses (Yes/No) for open-ended generation. This enables direct detection of errors and hallucinations with minimal overhead and without external validation. We make three key contributions. First, we propose a normalized confidence score and self-evaluation framework that exposes reliable confidence estimates for error detection across seven diverse benchmark tasks and five LLMs of varying architectures and sizes. Second, our theoretical analysis reveals that supervised fine-tuning (SFT) yields well-calibrated confidence through maximum-likelihood estimation, whereas reinforcement learning methods (PPO, GRPO) and DPO induce overconfidence via reward exploitation. Third, we propose post-RL SFT with self-distillation to restore confidence reliability in RL-trained models. Empirical results demonstrated that SFT improved average confidence-correctness AUROC from 0.806 to 0.879 and reduced calibration error from 0.163 to 0.034 on Qwen3-4B, while GRPO and DPO degraded confidence reliability. We demonstrated practical value through adaptive retrieval-augmented generation (RAG) that selectively retrieves context when the model lacks confidence, using only 58\% of retrieval operations to recover 95\% of the maximum achievable accuracy gain on TriviaQA
摘要：随着大型语言模型（LLM）越来越多地部署在关键决策系统中，缺乏可靠的方法来衡量其不确定性带来了根本的可信度风险。我们引入了基于输出锚标记概率的归一化置信度得分：结构化任务的分类标签和开放式生成的自我评估响应（是/否）。这使得能够以最小的开销直接检测错误和幻觉，并且无需外部验证。我们做出了三项关键贡献。首先，我们提出了一个标准化的置信度评分和自我评估框架，该框架为七个不同的基准任务和五个不同架构和规模的法学硕士提供了可靠的置信度估计，用于错误检测。其次，我们的理论分析表明，监督微调（SFT）通过最大似然估计产生良好校准的置信度，而强化学习方法（PPO、GRPO）和 DPO 通过奖励利用导致过度自信。第三，我们提出了具有自蒸馏功能的后强化学习 SFT，以恢复强化学习训练模型的置信可靠性。实证结果表明，SFT 将 Qwen3-4B 上的平均置信正确性 AUROC 从 0.806 提高到 0.879，并将校准误差从 0.163 降低到 0.034，而 GRPO 和 DPO 降低了置信可靠性。我们通过自适应检索增强生成 (RAG) 展示了实用价值，该技术在模型缺乏置信度时选择性地检索上下文，仅使用 58\% 的检索操作即可恢复 TriviaQA 上可实现的最大准确度增益的 95\%

Title: Correlation Analysis of Generative Models

Authors: Zhengguo Li, Chaobing Zheng, Wei Wang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.06614
Pdf URL: https://arxiv.org/pdf/2603.06614
Copy Paste: [[2603.06614]] Correlation Analysis of Generative Models(https://arxiv.org/abs/2603.06614)
Keywords: generative
Abstract: Based on literature review about existing diffusion models and flow matching with a neural network to predict a predefined target from noisy data, a unified representation is first proposed for these models using two simple linear equations in this paper. Theoretical analysis of the proposed model is then presented. Our theoretical analysis shows that the correlation between the noisy data and the predicted target is sometimes weak in the existing diffusion models and flow matching. This might affect the prediction (or learning) process which plays a crucial role in all models.
摘要：基于对现有扩散模型和使用神经网络进行流匹配以从噪声数据预测预定义目标的文献综述，本文首先使用两个简单的线性方程为这些模型提出了统一的表示。然后对所提出的模型进行了理论分析。我们的理论分析表明，在现有的扩散模型和流量匹配中，噪声数据与预测目标之间的相关性有时很弱。这可能会影响在所有模型中起着至关重要作用的预测（或学习）过程。

Title: Annealed Co-Generation: Disentangling Variables via Progressive Pairwise Modeling

Authors: Hantao Zhang, Jieke Wu, Mingda Xu, Xiao Hu, Yingxuan You, Pascal Fua
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06615
Pdf URL: https://arxiv.org/pdf/2603.06615
Copy Paste: [[2603.06615]] Annealed Co-Generation: Disentangling Variables via Progressive Pairwise Modeling(https://arxiv.org/abs/2603.06615)
Keywords: generation
Abstract: For multivariate co-generation in scientific applications, we advocate pairwise block rather than joint modeling of all variables. This design mitigates the computational burden and data imbalance. To this end, we propose an Annealed Co-Generation (ACG) framework that replaces high-dimensional diffusion modeling with a low-dimensional diffusion model, which enables multivariate co-generation by composing pairwise variable generations. We first train an unconditional diffusion model over causal variables that are disentangled into pairs. At inference time, we recover the joint distribution by coupling these pairwise models through shared common variables, enabling coherent multivariate generation without any additional training. By employing a three-stage annealing process-Consensus, Heating, and Cooling-our method enforces consistency across shared common variables and progressively constrains each pairwise data distribution to lie on a learnable manifold, while maintaining high likelihood within each pair. We demonstrate the framework's flexibility and efficacy on two distinct scientific tasks: flow-field completion and antibody generation. All datasets and code will be made publicly available upon publication.
摘要：对于科学应用中的多变量联产，我们提倡成对块而不是所有变量的联合建模。这种设计减轻了计算负担和数据不平衡。为此，我们提出了一种退火联产（ACG）框架，用低维扩散模型代替高维扩散模型，通过组合成对变量生成来实现多元联产。我们首先在解开成对的因果变量上训练无条件扩散模型。在推理时，我们通过共享公共变量耦合这些成对模型来恢复联合分布，从而无需任何额外的训练即可实现连贯的多元生成。通过采用三阶段退火过程（共识、加热和冷却），我们的方法强制共享公共变量之间的一致性，并逐步将每个成对数据分布限制在可学习的流形上，同时保持每对内的高可能性。我们展示了该框架在两个不同科学任务上的灵活性和有效性：流场完成和抗体生成。所有数据集和代码将在发布后公开。

Title: Evo: Autoregressive-Diffusion Large Language Models with Evolving Balance

Authors: Junde Wu, Minhao Hu, Jiayuan Zhu, Yuyuan Liu, Tianyi Zhang, Kang Li, Jingkun Chen, Jiazhen Pan, Min Xu, Yueming Jin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06617
Pdf URL: https://arxiv.org/pdf/2603.06617
Copy Paste: [[2603.06617]] Evo: Autoregressive-Diffusion Large Language Models with Evolving Balance(https://arxiv.org/abs/2603.06617)
Keywords: generation, generative
Abstract: We introduce \textbf{Evo}, a duality latent trajectory model that bridges autoregressive (AR) and diffusion-based language generation within a continuous evolutionary generative framework. Rather than treating AR decoding and diffusion generation as separate paradigms, Evo reconceptualizes text generation as a latent flow: each token is associated with a vector-valued embedding that evolves over a progression variable $t_i \in [0, 1]$, indicating its semantic maturity. Low $t_i$ values correspond to confident AR-like refinement, while high values invoke diffusion-style planning, allowing the model to adaptively balance AR and diffusion based on uncertainty. Theoretically, we show that both AR and diffusion models emerge as discretizations of a shared probability flow, and we derive Evo's training objective from a unified variational ELBO. The model is implemented as a time-conditioned Transformer governed by a shared vector field, trained end-to-end to jointly infer latent codes and their progression times. During decoding, Evo performs efficient, semantics-aware refinement, achieving high-quality outputs without sacrificing speed. Empirically, Evo 8B achieves state-of-the-art or highly competitive results on 15 diverse benchmarks, including reasoning (GSM8K, ARC-C), code generation (HumanEval, MBPP), and general language understanding, while maintaining fast inference speed. Our results demonstrate that Evo delivers a new paradigm for LLM design with strong generation quality, robust symbolic reasoning, and decoding efficiency.
摘要：我们引入了 \textbf{Evo}，一种对偶潜在轨迹模型，它在连续进化生成框架内连接自回归（AR）和基于扩散的语言生成。 Evo 没有将 AR 解码和扩散生成视为单独的范例，而是将文本生成重新概念化为潜在流：每个标记都与向量值嵌入相关联，该向量值嵌入在进程变量 $t_i \in [0, 1]$ 上演化，表明其语义成熟度。低 $t_i$ 值对应于自信的类似 AR 的细化，而高值则调用扩散式规划，使模型能够根据不确定性自适应地平衡 AR 和扩散。从理论上讲，我们表明 AR 和扩散模型都是作为共享概率流的离散化出现的，并且我们从统一的变分 ELBO 中得出 Evo 的训练目标。该模型被实现为一个由共享向量场控制的时间条件 Transformer，经过端到端训练以共同推断潜在代码及其进展时间。在解码过程中，Evo 执行高效、语义感知的细化，在不牺牲速度的情况下实现高质量输出。根据经验，Evo 8B 在推理（GSM8K、ARC-C）、代码生成（HumanEval、MBPP）和通用语言理解等 15 个不同基准上取得了最先进或极具竞争力的结果，同时保持了快速的推理速度。我们的结果表明，Evo 为 LLM 设计提供了一种新的范式，具有强大的生成质量、强大的符号推理和解码效率。

Title: Advances in GRPO for Generation Models: A Survey

Authors: Zexiang Liu, Xianglong He, Yangguang Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.06623
Pdf URL: https://arxiv.org/pdf/2603.06623
Copy Paste: [[2603.06623]] Advances in GRPO for Generation Models: A Survey(https://arxiv.org/abs/2603.06623)
Keywords: restoration, generation, generative
Abstract: Large-scale flow matching models have achieved strong performance across generative tasks such as text-to-image, video, 3D, and speech synthesis. However, aligning their outputs with human preferences and task-specific objectives remains challenging. Flow-GRPO extends Group Relative Policy Optimization (GRPO) to generation models, enabling stable reinforcement learning alignment for generative systems. Since its introduction, Flow-GRPO has triggered rapid research growth, spanning methodological refinements and diverse application domains. This survey provides a comprehensive review of Flow-GRPO and its subsequent developments. We organize existing work along two primary dimensions. First, we analyze methodological advances beyond the original framework, including reward signal design, credit assignment, sampling efficiency, diversity preservation, reward hacking mitigation, and reward model construction. Second, we examine extensions of GRPO-based alignment across generative paradigms and modalities, including text-to-image, video generation, image editing, speech and audio, 3D modeling, embodied vision-language-action systems, unified multimodal models, autoregressive and masked diffusion models, and restoration tasks. By synthesizing theoretical insights and practical adaptations, this survey highlights Flow-GRPO as a general alignment framework for modern generative models and outlines key open challenges for scalable and robust reinforcement-based generation.
摘要：大规模流匹配模型在文本到图像、视频、3D 和语音合成等生成任务中取得了强大的性能。然而，将其输出与人类偏好和特定任务目标保持一致仍然具有挑战性。 Flow-GRPO 将组相对策略优化 (GRPO) 扩展到生成模型，从而为生成系统提供稳定的强化学习对齐。自推出以来，Flow-GRPO 引发了研究的快速增长，涵盖了方法论的改进和不同的应用领域。本次调查对 Flow-GRPO 及其后续发展进行了全面回顾。我们沿着两个主要维度组织现有的工作。首先，我们分析了超出原始框架的方法论进展，包括奖励信号设计、信用分配、抽样效率、多样性保存、奖励黑客攻击缓解和奖励模型构建。其次，我们研究了基于 GRPO 的跨生成范式和模式的扩展，包括文本到图像、视频生成、图像编辑、语音和音频、3D 建模、具体视觉语言动作系统、统一多模态模型、自回归和掩模扩散模型以及恢复任务。通过综合理论见解和实际适应，本次调查强调 Flow-GRPO 作为现代生成模型的通用对齐框架，并概述了可扩展和稳健的基于强化的生成的关键开放挑战。

Title: SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts

Authors: Qingsong Zou, Zhi Yan, Zhiyao Xu, Kuofeng Gao, Jingyu Xiao, Yong Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06636
Pdf URL: https://arxiv.org/pdf/2603.06636
Copy Paste: [[2603.06636]] SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts(https://arxiv.org/abs/2603.06636)
Keywords: generation
Abstract: Due to the strong context-awareness capabilities demonstrated by large language models (LLMs), recent research has begun exploring their integration into smart home assistants to help users manage and adjust their living environments. While LLMs have been shown to effectively understand user needs and provide appropriate responses, most existing studies primarily focus on interpreting and executing user behaviors or instructions. However, a critical function of smart home assistants is the ability to detect when the home environment is in an anomalous state. This involves two key requirements: the LLM must accurately determine whether an anomalous condition is present, and provide either a clear explanation or actionable suggestions. To enhance the anomaly detection capabilities of next-generation LLM-based smart home assistants, we introduce SmartBench, which is the first smart home dataset designed for LLMs, containing both normal and anomalous device states as well as normal and anomalous device state transition contexts. We evaluate 13 mainstream LLMs on this benchmark. The experimental results show that most state-of-the-art models cannot achieve good anomaly detection performance. For example, Claude-Sonnet-4.5 achieves only 66.1% detection accuracy on context-independent anomaly categories, and performs even worse on context-dependent anomalies, with an accuracy of only 57.8%. More experimental results suggest that next-generation LLM-based smart home assistants are still far from being able to effectively detect and handle anomalous conditions in the smart home environment. Our dataset is publicly available at this https URL.
摘要：由于大语言模型（LLM）展现出强大的上下文感知能力，最近的研究已经开始探索将其集成到智能家居助手中，以帮助用户管理和调整他们的生活环境。虽然法学硕士已被证明可以有效理解用户需求并提供适当的响应，但大多数现有研究主要集中在解释和执行用户行为或指令。然而，智能家居助手的一个关键功能是能够检测家庭环境何时处于异常状态。这涉及两个关键要求：法学硕士必须准确确定是否存在异常情况，并提供清晰的解释或可行的建议。为了增强下一代基于LLM的智能家居助手的异常检测能力，我们引入了SmartBench，这是第一个为LLM设计的智能家居数据集，包含正常和异常设备状态以及正常和异常设备状态转换上下文。我们在此基准上评估了 13 个主流法学硕士。实验结果表明，大多数最先进的模型无法实现良好的异常检测性能。例如，Claude-Sonnet-4.5 在与上下文无关的异常类别上仅实现 66.1% 的检测准确率，在上下文相关的异常上表现更差，准确率仅为 57.8%。更多实验结果表明，下一代基于LLM的智能家居助手还远远不能有效检测和处理智能家居环境中的异常情况。我们的数据集可通过此 https URL 公开获取。

Title: HEARTS: Benchmarking LLM Reasoning on Health Time Series

Authors: Sirui Li, Shuhan Xiao, Mihir Joshi, Ahmed Metwally, Daniel McDuff, Wei Wang, Yuzhe Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06638
Pdf URL: https://arxiv.org/pdf/2603.06638
Copy Paste: [[2603.06638]] HEARTS: Benchmarking LLM Reasoning on Health Time Series(https://arxiv.org/abs/2603.06638)
Keywords: generation
Abstract: The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.
摘要：大型语言模型 (LLM) 的兴起已将时间序列分析从狭义分析转变为通用推理。然而，现有的基准仅涵盖一小部分健康时间序列模式和任务，未能反映现实世界生理建模中固有的多样化领域和广泛的时间依赖性。为了弥补这些差距，我们引入了 HEARTS（时间序列健康推理），这是一个统一的基准，用于评估法学硕士在一般健康时间序列上的分层推理能力。 HEARTS 集成了 12 个健康领域和 20 种信号模式的 16 个现实世界数据集，并定义了 110 项任务的综合分类法，分为四个核心功能：感知、推理、生成和演绎。对 14 个最先进的法学硕士在超过 2 万个测试样本上的评估揭示了有趣的发现。首先，法学硕士的表现远远低于专业模型，而且它们的表现与一般推理分数只有微弱的相关性。此外，法学硕士通常依赖于简单的启发式方法，并且难以进行多步骤时间推理。最后，性能随着时间复杂性的增加而下降，模型族内具有类似的故障模式，这表明仅进行扩展是不够的。通过使这些差距变得可测量，HEARTS 为开发能够推理不同健康信号的下一代 LLM 代理提供了标准化的测试平台和生活基准。

Title: HURRI-GAN: A Novel Approach for Hurricane Bias-Correction Beyond Gauge Stations using Generative Adversarial Networks

Authors: Noujoud Nadera, Hadi Majed, Stefanos Giaremis, Rola El Osta, Clint Dawson, Carola Kaiser, Hartmut Kaiser
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06649
Pdf URL: https://arxiv.org/pdf/2603.06649
Copy Paste: [[2603.06649]] HURRI-GAN: A Novel Approach for Hurricane Bias-Correction Beyond Gauge Stations using Generative Adversarial Networks(https://arxiv.org/abs/2603.06649)
Keywords: generative
Abstract: The coastal regions of the eastern and southern United States are impacted by severe storm events, leading to significant loss of life and properties. Accurately forecasting storm surge and wind impacts from hurricanes is essential for mitigating some of the impacts, e.g., timely preparation of evacuations and other countermeasures. Physical simulation models like the ADCIRC hydrodynamics model, which run on high-performance computing resources, are sophisticated tools that produce increasingly accurate forecasts as the resolution of the computational meshes improves. However, a major drawback of these models is the significant time required to generate results at very high resolutions, which may not meet the near real-time demands of emergency responders. The presented work introduces HURRI-GAN, a novel AI-driven approach that augments the results produced by physical simulation models using time series generative adversarial networks (TimeGAN) to compensate for systemic errors of the physical model, thus reducing the necessary mesh size and runtime without loss in forecasting accuracy. We present first results in extrapolating model bias corrections for the spatial regions beyond the positions of the water level gauge stations. The presented results show that our methodology can accurately generate bias corrections at target locations spatially beyond gauge stations locations. The model's performance, as indicated by low root mean squared error (RMSE) values, highlights its capability to generate accurate extrapolated data. Applying the corrections generated by HURRI-GAN on the ADCIRC modeled water levels resulted in improving the overall prediction on the majority of the testing gauge stations.
摘要：美国东部和南部沿海地区遭受严重风暴事件影响，造成重大人员伤亡和财产损失。准确预测飓风带来的风暴潮和风力影响对于减轻一些影响至关重要，例如及时准备疏散和其他对策。 ADCIRC 流体动力学模型等物理模拟模型在高性能计算资源上运行，是复杂的工具，随着计算网格分辨率的提高，可以产生越来越准确的预测。然而，这些模型的一个主要缺点是生成非常高分辨率的结果需要大量时间，这可能无法满足应急响应人员近乎实时的需求。本论文介绍了 HURRI-GAN，这是一种新颖的人工智能驱动方法，它使用时间序列生成对抗网络 (TimeGAN) 来增强物理模拟模型产生的结果，以补偿物理模型的系统误差，从而减少必要的网格大小和运行时间，而不会损失预测精度。我们提出了对水位计站位置之外的空间区域外推模型偏差校正的第一个结果。所呈现的结果表明，我们的方法可以在空间上超出测量站位置的目标位置准确地生成偏差校正。低均方根误差 (RMSE) 值表明该模型的性能突出了其生成准确外推数据的能力。将 HURRI-GAN 生成的修正应用于 ADCIRC 模拟水位，改善了大多数测试测量站的整体预测。

Title: EnsAug: Augmentation-Driven Ensembles for Human Motion Sequence Analysis

Authors: Bikram De, Habib Irani, Vangelis Metsis
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.06661
Pdf URL: https://arxiv.org/pdf/2603.06661
Copy Paste: [[2603.06661]] EnsAug: Augmentation-Driven Ensembles for Human Motion Sequence Analysis(https://arxiv.org/abs/2603.06661)
Keywords: generation
Abstract: Data augmentation is a crucial technique for training robust deep learning models for human motion, where annotated datasets are often scarce. However, generic augmentation methods often ignore the underlying geometric and kinematic constraints of the human body, risking the generation of unrealistic motion patterns that can degrade model performance. Furthermore, the conventional approach of training a single generalist model on a dataset expanded with a mixture of all available transformations does not fully exploit the unique learning signals provided by each distinct augmentation type. We challenge this convention by introducing a novel training paradigm, EnsAug, that strategically uses augmentation to foster model diversity within an ensemble. Our method involves training an ensemble of specialists, where each model learns from the original dataset augmented by only a single, distinct geometric transformation. Experiments on sign language and human activity recognition benchmarks demonstrate that our diversified ensemble methodology significantly outperforms the standard practice of training one model on a combined augmented dataset and achieves state-of-the-art accuracy on two sign language and one human activity recognition dataset while offering greater modularity and efficiency. Our primary contribution is the empirical validation of this training strategy, establishing an effective baseline for leveraging data augmentation in skeletal motion analysis.
摘要：数据增强是训练强大的人体运动深度学习模型的关键技术，其中带注释的数据集通常很稀缺。然而，通用增强方法通常忽略人体的基本几何和运动学约束，从而存在生成不切实际的运动模式从而降低模型性能的风险。此外，在混合了所有可用转换的数据集上训练单个通才模型的传统方法并没有充分利用每种不同增强类型提供的独特学习信号。我们通过引入一种新颖的训练范式 EnsAug 来挑战这一惯例，该范式战略性地使用增强来促进集成内的模型多样性。我们的方法涉及训练一组专家，其中每个模型都从仅通过单个不同的几何变换增强的原始数据集中学习。手语和人类活动识别基准的实验表明，我们的多样化集成方法显着优于在组合增强数据集上训练一个模型的标准做法，并在两种手语和一种人类活动识别数据集上实现了最先进的准确性，同时提供了更高的模块化和效率。我们的主要贡献是对该训练策略的实证验证，为在骨骼运动分析中利用数据增强建立有效的基线。

Title: HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

Authors: Toan Nguyen, Yang Liu, Celso De Melo, Flora D. Salim
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.06662
Pdf URL: https://arxiv.org/pdf/2603.06662
Copy Paste: [[2603.06662]] HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding(https://arxiv.org/abs/2603.06662)
Keywords: generation
Abstract: Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA->VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.
摘要：多模式法学硕士的持续视频QA受到任务之间的干扰以及存储特定任务提示的高昂成本的阻碍。我们引入了 HyperTokens，这是一种基于变压器的令牌生成器，可根据需要生成微调令牌，在保持内存固定的同时对提示更新进行显式控制。为了抑制遗忘，我们提出了受元启发的正则化器，它可以前瞻性地避免特定于任务的尖锐方向，并将不断发展的生成器锚定到先前的任务。我们进一步将我们的目标与锐度感知优化联系起来，深入了解为什么它鼓励更平坦的跨任务最小值并提高保留率。除了正则化之外，HyperTokens还通过共享生成权重来利用轻量级辅助多模式监督；在因果视角的指导下，我们设计了可行的目标并替代互信息损失，以规范反因果的跨模式方向。在两个标准的连续 VideoQA 基准测试中，HyperTokens 实现了更高的平均准确度，并且遗忘率大大降低。最后，我们介绍了一种具有挑战性的跨模式 ImageQA->VideoQA 协议，并表明 HyperTokens 可以在此设置中实现稳健的连续传输。

Title: Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index

Authors: Chao Yuan, Pan Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06664
Pdf URL: https://arxiv.org/pdf/2603.06664
Copy Paste: [[2603.06664]] Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index(https://arxiv.org/abs/2603.06664)
Keywords: generation
Abstract: Diffusion Transformer (DiT)-based video generation models inherently suffer from bottlenecks in long video synthesis and real-time inference, which can be attributed to the use of full spatiotemporal attention. Specifically, this mechanism leads to explosive O(N^2) memory consumption and high first-frame latency. To address these issues, we implement system-level inference optimizations for a causal autoregressive video generation pipeline. We adapt the Self-Forcing causal autoregressive framework to sequence parallel inference and implement a sequence-parallel variant of the causal rotary position embedding which we refer to as Causal-RoPE SP. This adaptation enables localized computation and reduces cross-rank communication in sequence parallel execution. In addition, computation and communication pipelines are optimized through operator fusion and RoPE precomputation. Experiments conducted on an eight GPU A800 cluster show that the optimized system achieves comparable generation quality, sub-second first-frame latency, and near real-time inference speed. For generating five second 480P videos, a 1.58x speedup is achieved, thereby providing effective support for real-time interactive applications.
摘要：基于扩散变压器（DiT）的视频生成模型本质上在长视频合成和实时推理方面存在瓶颈，这可以归因于充分时空注意力的使用。具体来说，这种机制会导致爆炸性的 O(N^2) 内存消耗和较高的首帧延迟。为了解决这些问题，我们对因果自回归视频生成管道实现了系统级推理优化。我们采用自强迫因果自回归框架来进行序列并行推理，并实现因果旋转位置嵌入的序列并行变体，我们将其称为 Causal-RoPE SP。这种适应使得本地化计算成为可能，并减少了顺序并行执行中的跨等级通信。此外，通过算子融合和RoPE预计算来优化计算和通信管道。在八个 GPU A800 集群上进行的实验表明，优化后的系统实现了可比的生成质量、亚秒级第一帧延迟和接近实时的推理速度。对于生成5秒480P视频，实现了1.58倍的加速，从而为实时交互应用提供了有效支持。

Title: SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation

Authors: Zhehao Yu, Baoquan Zhang, Bingqi Shan, Xinhao Liu, Dongliang Zhou, Guotao Liang, Guangming Ye, Yunming Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.06666
Pdf URL: https://arxiv.org/pdf/2603.06666
Copy Paste: [[2603.06666]] SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation(https://arxiv.org/abs/2603.06666)
Keywords: generation, generative
Abstract: Autoregressive (AR) image models have recently demonstrated remarkable generative capability, but their sequential nature results in significant inference latency. Existing training-free acceleration methods typically verify tokens independently, overlooking the strong co-occurrence patterns between adjacent visual tokens. This independence assumption often leads to contextual inconsistency and limits decoding efficiency. In this work, we introduce a novel training-free acceleration framework that performs phrase-level speculative verification, enabling the model to jointly validate multiple correlated tokens within each decoding window. To construct such phrase units, we analyze token co-occurrence statistics from the training corpus and group frequently co-occurring tokens into semantically coherent visual phrases. During inference, the proposed phrase-level verification evaluates aggregated likelihood ratios over each phrase, allowing simultaneous acceptance of multiple tokens while preserving generation quality. Extensive experiments on autoregressive text-to-image generation show that our method significantly reduces the number of function evaluations (NFE) and achieves up to 30% faster decoding without compromising visual fidelity. Our findings reveal that modeling short-range token co-occurrence provides an effective and general principle for accelerating autoregressive inference.
摘要：自回归 (AR) 图像模型最近表现出了卓越的生成能力，但其顺序性质导致了显着的推理延迟。现有的免训练加速方法通常独立验证令牌，忽略相邻视觉令牌之间的强共现模式。这种独立性假设通常会导致上下文不一致并限制解码效率。在这项工作中，我们引入了一种新颖的免训练加速框架，该框架执行短语级推测验证，使模型能够联合验证每个解码窗口内的多个相关令牌。为了构建这样的短语单元，我们分析了训练语料库中的标记共现统计数据，并将频繁共现的标记分组为语义连贯的视觉短语。在推理过程中，所提出的短语级验证评估每个短语的聚合似然比，允许同时接受多个令牌，同时保持生成质量。关于自回归文本到图像生成的大量实验表明，我们的方法显着减少了函数评估 (NFE) 的数量，并在不影响视觉保真度的情况下实现了高达 30% 的解码速度。我们的研究结果表明，对短程令牌共现进行建模为加速自回归推理提供了有效且通用的原则。

Title: Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study

Authors: Yixiao Jing, Chaoyu Zhang, Zixuan Zhong, Peizhou Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06672
Pdf URL: https://arxiv.org/pdf/2603.06672
Copy Paste: [[2603.06672]] Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study(https://arxiv.org/abs/2603.06672)
Keywords: generation
Abstract: Semantic noise initialization has been reported to improve robustness and controllability in image diffusion models. Whether these gains transfer to text-to-video (T2V) generation remains unclear, since temporal coupling can introduce extra degrees of freedom and instability. We benchmark semantic noise initialization against standard Gaussian noise using a frozen VideoCrafter-style T2V diffusion backbone and VBench on 100 prompts. Using prompt-level paired tests with bootstrap confidence intervals and a sign-flip permutation test, we observe a small positive trend on temporal-related dimensions; however, the 95 percent confidence interval includes zero (p ~ 0.17) and the overall score remains on par with the baseline. To understand this outcome, we analyze the induced perturbations in noise space and find patterns consistent with weak or unstable signal. We recommend prompt-level paired evaluation and noise-space diagnostics as standard practice when studying initialization schemes for T2V diffusion.
摘要：据报道，语义噪声初始化可以提高图像扩散模型的鲁棒性和可控性。这些收益是否会转移到文本到视频（T2V）生成仍不清楚，因为时间耦合会引入额外的自由度和不稳定性。我们使用冻结的 VideoCrafter 式 T2V 扩散主干和 100 个提示上的 VBench，针对标准高斯噪声对语义噪声初始化进行基准测试。使用具有引导置信区间的提示级配对测试和符号翻转排列测试，我们观察到时间相关维度上的小积极趋势；然而，95% 的置信区间包括零 (p ~ 0.17)，总体得分仍与基线持平。为了理解这个结果，我们分析了噪声空间中引起的扰动，并找到与弱或不稳定信号一致的模式。在研究 T2V 扩散的初始化方案时，我们建议将即时级配对评估和噪声空间诊断作为标准做法。

Title: Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

Authors: Jiajin Tang, Gaoyang, Wenjie Wang, Sibei Yang, Xing Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.06677
Pdf URL: https://arxiv.org/pdf/2603.06677
Copy Paste: [[2603.06677]] Chart Deep Research in LVLMs via Parallel Relative Policy Optimization(https://arxiv.org/abs/2603.06677)
Keywords: generation
Abstract: With the rapid advancement of data science, charts have evolved from simple numerical presentation tools to essential instruments for insight discovery and decision-making support. However, current chart data intelligence exhibits significant limitations in deep research capabilities, with existing methods predominantly addressing shallow tasks such as visual recognition or factual question-answering, rather than the complex reasoning and high-level data analysis that deep research requires. This limitation stems from two primary technical bottlenecks: at the training level, existing post-training techniques exhibit deficiencies in handling multi-dimensional reward signal interference and heterogeneous data gradient conflicts, preventing models from achieving balanced development across multiple capability dimensions; at the evaluation level, current methods remain limited to factual retrieval and basic computation, failing to assess end-to-end analytic reasoning and other deep research capabilities. To address the training challenge, we propose PRPO, which performs parallel optimization across reward dimensions and capability partitioning across data types, effectively disentangling conflicts between heterogeneous data and multi-dimensional reward signals while ensuring optimization stability. For the evaluation challenge, we construct MCDR-Bench based on the ``error uniqueness principle," transforming subjective generation assessment into objective error identification through controllable error injection, enabling quantifiable evaluation of deep research capabilities. Experimental validation confirms that the proposed PRPO and MCDR-Bench jointly establish a unified framework that systematically advances chart deep research through enhanced collaborative training and objective evaluation.
摘要：随着数据科学的快速发展，图表已经从简单的数字呈现工具发展成为洞察发现和决策支持的重要工具。然而，当前的图表数据智能在深度研究能力方面表现出很大的局限性，现有方法主要解决视觉识别或事实问答等浅层任务，而不是深度研究所需的复杂推理和高级数据分析。这种限制源于两个主要的技术瓶颈：在训练层面，现有的训练后技术在处理多维奖励信号干扰和异构数据梯度冲突方面存在缺陷，导致模型无法实现跨多个能力维度的平衡发展；在评估层面，目前的方法仍然局限于事实检索和基础计算，未能评估端到端的分析推理和其他深度研究能力。为了应对训练挑战，我们提出了PRPO，它可以跨奖励维度和跨数据类型的能力划分进行并行优化，有效消除异构数据和多维奖励信号之间的冲突，同时确保优化稳定性。针对评估挑战，我们构建了基于“错误唯一性原则”的MCDR-Bench，通过可控错误注入将主观生成评估转变为客观错误识别，从而能够量化评估深度研究能力。实验验证证实，所提出的PRPO和MCDR-Bench共同建立了一个统一的框架，通过加强协作训练和客观评估，系统地推进图表深度研究。

Title: VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

Authors: Neil Tripathi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06680
Pdf URL: https://arxiv.org/pdf/2603.06680
Copy Paste: [[2603.06680]] VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images(https://arxiv.org/abs/2603.06680)
Keywords: generation
Abstract: We present VB, a benchmark that tests whether vision-language models can determine what is and is not visible in a photograph, and abstain when a human viewer cannot reliably answer. Each item pairs a single photo with a short yes/no visibility claim; the model must output VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN, together with a confidence score. Items are organized into 100 families using a 2x2 design that crosses a minimal image edit with a minimal text edit, yielding 300 headline evaluation cells. Unlike prior unanswerable-VQA benchmarks, VB tests not only whether a question is unanswerable but why (via reason codes tied to specific visibility factors), and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes. We score models on confidence-aware accuracy with abstention (CAA), minimal-edit flip rate (MEFR), confidence-ranked selective prediction (SelRank), and second-order perspective reasoning (ToMAcc); all headline numbers are computed on the strict XOR subset (three cells per family, 300 scored items per model). We evaluate nine models spanning flagship and prior-generation closed-source systems, and open-source models from 8B to 12B parameters. GPT-4o and Gemini 3.1 Pro effectively tie for the best composite score (0.728 and 0.727), followed by Gemini 2.5 Pro (0.678). The best open-source model, Gemma 3 12B (0.505), surpasses one prior-generation closed-source system. Text-flip robustness exceeds image-flip robustness for six of nine models, and confidence calibration varies substantially: GPT-4o and Gemini 2.5 Pro achieve similar accuracy yet differ sharply in selective prediction quality.
摘要：我们提出了 VB，这是一个基准测试，用于测试视觉语言模型是否可以确定照片中可见和不可见的内容，并在人类观看者无法可靠回答时放弃。每个项目都将一张照片与简短的是/否可见性声明配对；模型必须输出 VISIBLY_TRUE、VISIBLY_FALSE 或 ABSTAIN，以及置信度分数。使用 2x2 设计将项目组织成 100 个系列，该设计将最少的图像编辑与最少的文本编辑相结合，产生 300 个标题评估单元。与之前无法回答的 VQA 基准不同，VB 不仅测试问题是否无法回答，还测试原因（通过与特定可见性因素相关的原因代码），并使用受控的最小编辑来验证模型判断在且仅当基础证据发生变化时才会发生变化。我们对弃权置信度准确度（CAA）、最小编辑翻转率（MEFR）、置信度排序选择性预测（SelRank）和二阶透视推理（ToMAcc）等模型进行评分；所有标题数字均根据严格的 XOR 子集计算（每个系列三个单元，每个模型 300 个评分项目）。我们评估了涵盖旗舰和上一代闭源系统的九个模型，以及从 8B 到 12B 参数的开源模型。 GPT-4o 和 Gemini 3.1 Pro 实际上并列获得最佳综合得分（0.728 和 0.727），其次是 Gemini 2.5 Pro（0.678）。最好的开源模型 Gemma 3 12B (0.505) 超越了上一代闭源系统。九个模型中的六个模型的文本翻转鲁棒性超过了图像翻转鲁棒性，并且置信度校准差异很大：GPT-4o 和 Gemini 2.5 Pro 实现了相似的准确度，但在选择性预测质量方面却存在巨大差异。

Title: ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

Authors: Hailong Chu, Shuo Zhang, Yunlong Chu, Shutai Huang, Xingyue Zhang, Tinghe Yan, Jinsong Zhang, Lei Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.06683
Pdf URL: https://arxiv.org/pdf/2603.06683
Copy Paste: [[2603.06683]] ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction(https://arxiv.org/abs/2603.06683)
Keywords: generation
Abstract: Multimedia Event Extraction (M2E2) involves extracting structured event records from both textual and visual content. Existing approaches, ranging from specialized architectures to direct Large Language Model (LLM) prompting, typically rely on a linear, end-to-end generation and thus suffer from cascading errors: early cross-modal misalignments often corrupt downstream role assignment under strict grounding constraints. We propose ECHO (Event-Centric Hypergraph Operations), a multi-agent framework that iteratively refines a shared Multimedia Event Hypergraph (MEHG), which serves as an explicit intermediate structure for multimodal event hypotheses. Unlike dialogue-centric frameworks, ECHO coordinates specialized agents by applying atomic hypergraph operations to the MEHG. Furthermore, we introduce a Link-then-Bind strategy that enforces deferred commitment: agents first identify relevant arguments and only then determine their precise roles, mitigating incorrect grounding and limiting error propagation. Extensive experiments on the M2E2 benchmark show that ECHO significantly outperforms the state-of-the-art (SOTA) : with Qwen3-32B, it achieves a 7.3% and 15.5% improvement in average event mention and argument role F1, respectively.
摘要：多媒体事件提取（M2E2）涉及从文本和视觉内容中提取结构化事件记录。现有的方法，从专门的架构到直接的大型语言模型（LLM）提示，通常依赖于线性的端到端生成，因此会遭受级联错误：早期的跨模式错位通常会在严格的基础约束下破坏下游角色分配。我们提出了 ECHO（以事件为中心的超图操作），这是一种多智能体框架，可迭代地完善共享多媒体事件超图（MEHG），作为多模态事件假设的显式中间结构。与以对话为中心的框架不同，ECHO 通过将原子超图操作应用于 MEHG 来协调专门的代理。此外，我们引入了一种强制延迟承诺的“先链接后绑定”策略：代理首先识别相关参数，然后才确定其精确角色，从而减少不正确的接地并限制错误传播。 M2E2 基准的大量实验表明，ECHO 显着优于最先进的 (SOTA)：使用 Qwen3-32B，它在平均事件提及和争论角色 F1 方面分别实现了 7.3% 和 15.5% 的改进。

Title: One step further with Monte-Carlo sampler to guide diffusion better

Authors: Minsi Ren, Wenhao Deng, Ruiqi Feng, Tailin Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.06685
Pdf URL: https://arxiv.org/pdf/2603.06685
Copy Paste: [[2603.06685]] One step further with Monte-Carlo sampler to guide diffusion better(https://arxiv.org/abs/2603.06685)
Keywords: generation, generative
Abstract: Stochastic differential equation (SDE)-based generative models have achieved substantial progress in conditional generation via training-free differentiable loss-guided approaches. However, existing methodologies utilizing posterior sam- pling typically confront a substantial estimation error, which results in inaccu- rate gradients for guidance and leading to inconsistent generation results. To mitigate this issue, we propose that performing an additional backward denois- ing step and Monte-Carlo sampling (ABMS) can achieve better guided diffu- sion, which is a plug-and-play adjustment strategy. To verify the effectiveness of our method, we provide theoretical analysis and propose the adoption of a dual-focus evaluation framework, which further serves to highlight the critical problem of cross-condition interference prevalent in existing approaches. We conduct experiments across various task settings and data types, mainly includ- ing conditional online handwritten trajectory generation, image inverse problems (inpainting, super resolution and gaussian deblurring) molecular inverse design and so on. Experimental results demonstrate that our approach can be effec- tively used with higher order samplers and consistently improves the quality of generation samples across all the different scenarios.
摘要：基于随机微分方程（SDE）的生成模型通过免训练的可微损失引导方法在条件生成方面取得了实质性进展。然而，利用后验采样的现有方法通常会面临很大的估计误差，这会导致指导梯度不准确并导致生成结果不一致。为了缓解这个问题，我们建议执行额外的后向降噪步骤和蒙特卡罗采样（ABMS）可以实现更好的引导扩散，这是一种即插即用的调整策略。为了验证我们方法的有效性，我们提供了理论分析并建议采用双焦点评估框架，这进一步突出了现有方法中普遍存在的跨条件干扰的关键问题。我们针对各种任务设置和数据类型进行了实验，主要包括条件在线手写轨迹生成、图像逆问题（修复、超分辨率和高斯去模糊）分子逆设计等。实验结果表明，我们的方法可以有效地与更高阶的采样器一起使用，并持续提高所有不同场景中生成样本的质量。

Title: Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

Authors: Zhengjian Yao, Yongzhi Li, Xinyuan Gao, Quan Chen, Peng Jiang, Yanye Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06688
Pdf URL: https://arxiv.org/pdf/2603.06688
Copy Paste: [[2603.06688]] Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning(https://arxiv.org/abs/2603.06688)
Keywords: generation, generative
Abstract: We present "Narrative Weaver", a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences - a critical limitation for real-world applications such as filmmaking and e-commerce advertising. Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift. To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD) - the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations. Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.
摘要：我们提出了“Narrative Weaver”，这是一个新颖的框架，它解决了生成人工智能的基本挑战：实现多模式可控、远程和一致的视觉内容生成。虽然现有模型擅长生成高保真短格式视觉内容，但它们很难在扩展序列中保持叙事连贯性和视觉一致性——这对于电影制作和电子商务广告等现实世界应用来说是一个关键限制。 Narrative Weaver 推出了第一个整体解决方案，无缝集成了三个基本功能：细粒度控制、自动叙事规划和远程连贯性。我们的架构将用于高级叙事规划的多模态大语言模型（MLLM）与新颖的细粒度控制模块相结合，该模块具有可防止视觉漂移的动态内存库。为了实现实际部署，我们开发了一种渐进式多阶段训练策略，该策略有效地利用现有的预训练模型，即使在训练数据有限的情况下也能实现最先进的性能。认识到缺乏合适的评估基准，我们构建并发布了电子商务广告视频故事板数据集 (EAVSD)，这是该任务的第一个综合数据集，包含超过 330K 具有丰富叙事注释的高质量图像。通过对三个不同场景（可控多场景生成、自主讲故事和电子商务广告）的广泛实验，我们展示了我们方法的优越性，同时为人工智能驱动的内容创作开辟了新的可能性。

Title: SIQA: Toward Reliable Scientific Image Quality Assessment

Authors: Wenzhe Li, Liang Chen, Junying Wang, Yijing Guo, Ye Shen, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.06700
Pdf URL: https://arxiv.org/pdf/2603.06700
Copy Paste: [[2603.06700]] SIQA: Toward Reliable Scientific Image Quality Assessment(https://arxiv.org/abs/2603.06700)
Keywords: quality assessment
Abstract: Scientific images fundamentally differ from natural and AI-generated images in that they encode structured domain knowledge rather than merely depict visual scenes. Assessing their quality therefore requires evaluating not only perceptual fidelity but also scientific correctness and logical completeness. However, existing image quality assessment (IQA) paradigms primarily focus on perceptual distortions or image-text alignment, implicitly assuming that depicted content is factually valid. This assumption breaks down in scientific contexts, where visually plausible figures may still contain conceptual errors or incomplete reasoning. To address this gap, we introduce Scientific Image Quality Assessment (SIQA), a framework that models scientific image quality along two complementary dimensions: Knowledge (Scientific Validity and Scientific Completeness) and Perception (Cognitive Clarity and Disciplinary Conformity). To operationalize this formulation, we design two evaluation protocols: SIQA-U (Understanding), which measures semantic comprehension of scientific content through multiple-choice tasks, and SIQA-S (Scoring), which evaluates alignment with expert quality judgments. We further construct the SIQA Challenge, consisting of an expert-annotated benchmark and a large-scale training set. Experiments across representative multimodal large language models (MLLMs) reveal a consistent discrepancy between scoring alignment and scientific understanding. While models can achieve strong agreement with expert ratings under SIQA-S, their performance on SIQA-U remains substantially lower. Fine-tuning improves both metrics, yet gains in scoring consistently outpace improvements in understanding. These results suggest that rating consistency alone may not reliably reflect scientific comprehension, underscoring the necessity of multidimensional evaluation for scientific image quality assessment.
摘要：科学图像与自然图像和人工智能生成的图像有着根本的不同，因为它们编码结构化的领域知识，而不仅仅是描绘视觉场景。因此，评估其质量不仅需要评估感知的保真度，还需要评估科学的正确性和逻辑的完整性。然而，现有的图像质量评估（IQA）范式主要关注感知扭曲或图像文本对齐，隐含地假设所描述的内容实际上是有效的。这种假设在科学背景下不成立，视觉上合理的数字可能仍然包含概念错误或不完整的推理。为了解决这一差距，我们引入了科学图像质量评估（SIQA），这是一个沿着两个互补维度对科学图像质量进行建模的框架：知识（科学有效性和科学完整性）和感知（认知清晰度和学科一致性）。为了实施这个公式，我们设计了两种评估协议：SIQA-U（理解），通过多项选择任务衡量科学内容的语义理解；SIQA-S（评分），评估与专家质量判断的一致性。我们进一步构建了 SIQA 挑战赛，其中包括专家注释的基准和大规模训练集。跨代表性多模态大语言模型 (MLLM) 的实验揭示了评分一致性和科学理解之间的一致差异。虽然模型可以与 SIQA-S 下的专家评级达成高度一致，但它们在 SIQA-U 上的表现仍然要低得多。微调可以改善这两个指标，但评分的提高始终超过理解的提高。这些结果表明，仅评级一致性可能无法可靠地反映科学理解力，强调了多维评估对科学图像质量评估的必要性。

Title: From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories

Authors: Guanglin Zhou, Armin Catic, Motahare Shabestari, Matthew Young, Chaiquan Li, Katrina Poppe, Sebastiano Barbieri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.06720
Pdf URL: https://arxiv.org/pdf/2603.06720
Copy Paste: [[2603.06720]] From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories(https://arxiv.org/abs/2603.06720)
Keywords: generation, generative
Abstract: Access to electronic health records (EHRs) for digital health research is often limited by privacy regulations and institutional barriers. Synthetic EHRs have been proposed as a way to enable safe and sovereign data sharing; however, existing methods may produce records that capture overall statistical properties of real data but present inconsistencies across clinical processes and observations. We developed an integrated pipeline to make synthetic patient trajectories clinically consistent through two synergistic steps: high-fidelity generation and scalable auditing. Using the MIMIC-IV database, we trained a knowledge-grounded generative model that represents nearly 32,000 distinct clinical events, including demographics, laboratory measurements, medications, procedures, and diagnoses, while enforcing structural integrity. To support clinical consistency at scale, we incorporated an automated auditing module leveraging large language models to filter out clinical inconsistencies (e.g., contraindicated medications) that escape probabilistic generation. We generated 18,071 synthetic patient records derived from a source cohort of 180,712 real patients. While synthetic clinical event probabilities demonstrated robust agreement (mean bias effectively 0.00) and high correlation (R2=0.99) with the real counterparts, review of a random sample of synthetic records (N=20) by three clinicians identified inconsistencies in 45-60% of them. Automated auditing reduced the difference between real and synthetic data (Cohen's effect size d between 0.59 and 1.60 before auditing, and between 0.18 and 0.67 after auditing). Downstream models trained on audited data matched or even exceeded real-data performance. We found no evidence of privacy risks, with membership inference performance indistinguishable from random guessing (F1-score=0.51).
摘要：用于数字健康研究的电子健康记录 (EHR) 的获取通常受到隐私法规和制度障碍的限制。综合电子病历已被提议作为实现安全和主权数据共享的一种方式；然而，现有方法可能会生成捕获真实数据整体统计特性的记录，但在临床过程和观察中呈现不一致。我们开发了一个集成管道，通过两个协同步骤使合成患者轨迹在临床上保持一致：高保真生成和可扩展审核。使用 MIMIC-IV 数据库，我们训练了一个基于知识的生成模型，该模型代表近 32,000 个不同的临床事件，包括人口统计、实验室测量、药物、程序和诊断，同时加强结构完整性。为了支持大规模的临床一致性，我们整合了一个自动审核模块，利用大型语言模型来过滤掉逃避概率生成的临床不一致（例如禁忌药物）。我们从 180,712 名真实患者的源队列中生成了 18,071 份合成患者记录。虽然合成临床事件概率与真实对应事件表现出强烈的一致性（有效平均偏差为 0.00）和高度相关性（R2=0.99），但三位临床医生对合成记录的随机样本（N=20）进行审查后发现，其中 45-60% 存在不一致之处。自动审核减少了真实数据和合成数据之间的差异（审核前科恩效应大小 d 在 0.59 到 1.60 之间，审核后在 0.18 到 0.67 之间）。根据审计数据训练的下游模型匹配甚至超过了真实数据的性能。我们没有发现隐私风险的证据，成员资格推断性能与随机猜测无法区分（F1-score = 0.51）。

Title: Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Authors: Jingyuan Feng, Andrew Gambardella, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06727
Pdf URL: https://arxiv.org/pdf/2603.06727
Copy Paste: [[2603.06727]] Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment(https://arxiv.org/abs/2603.06727)
Keywords: generation
Abstract: Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when $s=1$ and refusals when $s=0$ - while additional unsupervised bits $u$ encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information to flow through, preserving the model's generation capabilities. This design achieves both interpretability (the safety decision is directly readable) and controllability (the safety bit can be manually overridden), requiring only lightweight fine-tuning without pre-training from scratch. In red-team benchmarks, Safe Transformer achieves near-zero Attack Success Rate, substantially outperforming base models and safety fine-tuning baselines.
摘要：当前的安全对齐方法将安全行为隐式编码在模型参数中，从而产生了根本性的不透明性：我们无法轻松检查模型拒绝请求的原因，也无法在其安全判断失败时进行干预。我们提出了 Safe Transformer，这是一种模块化方法，通过在变压器层之间插入包含显式安全位的离散信息瓶颈来增强预先训练的语言模型。安全位既充当模型安全分类的可解释信号，又充当可控开关：通过对比训练，模型学习解开的表示，其中安全位控制行为模式 - 当 $s=1$ 时产生有用的响应，当 $s=0$ 时产生拒绝 - 而额外的无监督位 $u$ 编码语义内容以进行生成。信息瓶颈中的附加无监督位允许语义信息流过，从而保留模型的生成能力。这种设计同时实现了可解释性（安全决策可直接读取）和可控性（安全位可手动覆盖），只需要轻量级微调，无需从头开始预训练。在红队基准测试中，Safe Transformer 实现了接近于零的攻击成功率，大大优于基础模型和安全微调基线。

Title: Vessel-Aware Deep Learning for OCTA-Based Detection of AMD

Authors: Margalit G. Mitzner, Moinak Bhattacharya, Zhilin Zou, Chao Chen, Prateek Prasanna
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.06735
Pdf URL: https://arxiv.org/pdf/2603.06735
Copy Paste: [[2603.06735]] Vessel-Aware Deep Learning for OCTA-Based Detection of AMD(https://arxiv.org/abs/2603.06735)
Keywords: generation
Abstract: Age-related macular degeneration (AMD) is characterized by early micro-vascular alterations that can be captured non-invasively using optical coherence tomography angiography (OCTA), yet most deep learning (DL) models rely on global features and fail to exploit clinically meaningful vascular biomarkers. We introduce an external multiplicative attention framework that incorporates vessel-specific tortuosity maps and vasculature dropout maps derived from arteries, veins, and capillaries. These biomarker maps are generated from vessel segmentations and smoothed across multiple spatial scales to highlight coherent patterns of vascular remodeling and capillary rarefaction. Tortuosity reflects abnormalities in vessel geometry linked to impaired auto-regulation, while dropout maps capture localized perfusion deficits that precede structural retinal damage. The maps are fused with the OCTA projection to guide a deep classifier toward physiologically relevant regions. Arterial tortuosity provided the most consistent discriminative value, while capillary dropout maps performed best among density-based variants, especially at larger smoothing scales. Our proposed method offers interpretable insights aligned with known AMD pathophysiology.
摘要：年龄相关性黄斑变性 (AMD) 的特点是早期微血管改变，可以使用光学相干断层扫描血管造影 (OCTA) 非侵入性地捕获这些改变，但大多数深度学习 (DL) 模型依赖于全局特征，无法利用有临床意义的血管生物标志物。我们引入了一个外部乘法注意力框架，该框架结合了血管特定的弯曲图和源自动脉、静脉和毛细血管的脉管系统丢失图。这些生物标志物图是根据血管分割生成的，并在多个空间尺度上进行平滑处理，以突出血管重塑和毛细血管稀疏的连贯模式。弯曲反映了与自动调节受损相关的血管几何形状异常，而辍学图则捕捉了视网膜结构损伤之前的局部灌注不足。这些图与 OCTA 投影融合，引导深度分类器进入生理相关区域。动脉迂曲度提供了最一致的判别值，而毛细血管失活图在基于密度的变体中表现最佳，尤其是在较大的平滑尺度下。我们提出的方法提供了与已知的 AMD 病理生理学一致的可解释的见解。

Title: Rank-Factorized Implicit Neural Bias: Scaling Super-Resolution Transformer with FlashAttention

Authors: Dongheon Lee, Seokju Yun, Jaegyun Im, Youngmin Ro
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06738
Pdf URL: https://arxiv.org/pdf/2603.06738
Copy Paste: [[2603.06738]] Rank-Factorized Implicit Neural Bias: Scaling Super-Resolution Transformer with FlashAttention(https://arxiv.org/abs/2603.06738)
Keywords: super-resolution
Abstract: Recent Super-Resolution~(SR) methods mainly adopt Transformers for their strong long-range modeling capability and exceptional representational capacity. However, most SR Transformers rely heavily on relative positional bias~(RPB), which prevents them from leveraging hardware-efficient attention kernels such as FlashAttention. This limitation imposes a prohibitive computational burden during both training and inference, severely restricting attempts to scale SR Transformers by enlarging the training patch size or the self-attention window. Consequently, unlike other domains that actively exploit the inherent scalability of Transformers, SR Transformers remain heavily focused on effectively utilizing limited receptive fields. In this paper, we propose Rank-factorized Implicit Neural Bias~(RIB), an alternative to RPB that enables FlashAttention in SR Transformers. Specifically, RIB approximates positional bias using low-rank implicit neural representations and concatenates them with pixel content tokens in a channel-wise manner, turning the element-wise bias addition in attention score computation into a dot-product operation. Further, we introduce a convolutional local attention and a cyclic window strategy to fully leverage the advantages of long-range interactions enabled by RIB and FlashAttention. We enlarge the window size up to \textbf{96$\times$96} while jointly scaling the training patch size and the dataset size, maximizing the benefits of Transformers in the SR task. As a result, our network achieves \textbf{35.63\,dB PSNR} on Urban100$\times$2, while reducing training and inference time by \textbf{2.1$\times$} and \textbf{2.9$\times$}, respectively, compared to the RPB-based SR Transformer~(PFT).
摘要：最近的超分辨率（SR）方法主要采用 Transformer，因为它们具有强大的远程建模能力和出色的表示能力。然而，大多数 SR Transformer 严重依赖相对位置偏差（RPB），这使得它们无法利用 FlashAttention 等硬件高效的注意力内核。这种限制在训练和推理过程中带来了巨大的计算负担，严重限制了通过扩大训练补丁大小或自注意力窗口来扩展 SR Transformer 的尝试。因此，与积极利用 Transformers 固有可扩展性的其他领域不同，SR Transformers 仍然高度关注有效利用有限的感受野。在本文中，我们提出了 Rank-factorized Implicit Neural Bias~（RIB），它是 RPB 的替代方案，可在 SR Transformer 中实现 FlashAttention。具体来说，RIB 使用低秩隐式神经表示来近似位置偏差，并以通道方式将它们与像素内容标记连接起来，将注意力分数计算中的元素偏差加法转变为点积运算。此外，我们引入了卷积局部注意力和循环窗口策略，以充分利用 RIB 和 FlashAttention 实现的远程交互的优势。我们将窗口大小扩大到 \textbf{96$\times$96}，同时联合缩放训练补丁大小和数据集大小，最大限度地发挥 Transformer 在 SR 任务中的优势。因此，与基于 RPB 的 SR Transformer~(PFT) 相比，我们的网络在 Urban100$\times$2 上实现了 \textbf{35.63\,dB PSNR}，同时将训练和推理时间分别减少了 \textbf{2.1$\times$} 和 \textbf{2.9$\times$}。

Title: Heterogeneous Decentralized Diffusion Models

Authors: Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.06741
Pdf URL: https://arxiv.org/pdf/2603.06741
Copy Paste: [[2603.06741]] Heterogeneous Decentralized Diffusion Models(https://arxiv.org/abs/2603.06741)
Keywords: generative
Abstract: Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time via a deterministic schedule-aware conversion into a common velocity space without retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-alpha's efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces compute from 1176 to 72 GPU-days (16x) and data from 158M to 11M (14x). Under aligned inference settings, our heterogeneous 2DDPM:6FM configuration achieves better FID (11.88 vs. 12.45) and higher intra-prompt diversity (LPIPS 0.631 vs. 0.617) than the homogeneous 8FM baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework lowers infrastructure requirements for decentralized generative model training.
摘要：训练前沿规模的扩散模型通常需要大量的计算资源集中在紧密耦合的集群中，从而限制了资源充足的机构的参与。虽然去中心化扩散模型 (DDM) 可以单独训练多个专家，但现有方法需要 1176 个 GPU 天以及所有专家的同质训练目标。我们提出了一个有效的框架，可以减少资源需求，同时支持异构培训目标。我们的方法结合了三个贡献：（1）异构分散训练范例，允许专家使用不同的目标（DDPM 和流匹配），通过确定性调度感知转换到公共速度空间而在推理时统一，而无需重新训练； (2) 从 ImageNet-DDPM 到 Flow Matching 目标的预训练检查点转换，加速收敛并无需针对特定目标的预训练即可进行初始化； (3) PixArt-alpha 高效的 AdaLN-Single 架构，在保持质量的同时减少参数。 LAION-Aesthetics 的实验表明，相对于之前 DDM 工作报告的训练规模，我们的方法将计算量从 1176 个 GPU 天减少到 72 个 GPU 天 (16x)，数据量从 158M 减少到 11M (14x)。在对齐的推理设置下，我们的异构 2DDPM:6FM 配置比同构 8FM 基线实现了更好的 FID（11.88 与 12.45）和更高的提示内多样性（LPIPS 0.631 与 0.617）。通过消除同步要求并实现混合 DDPM/FM 目标，我们的框架降低了去中心化生成模型训练的基础设施要求。

Title: Improved Constrained Generation by Bridging Pretrained Generative Models

Authors: Xiaoxuan Liang, Saeid Naderiparizi, Yunpeng Liu, Berend Zwartsenberg, Frank Wood
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2603.06742
Pdf URL: https://arxiv.org/pdf/2603.06742
Copy Paste: [[2603.06742]] Improved Constrained Generation by Bridging Pretrained Generative Models(https://arxiv.org/abs/2603.06742)
Keywords: generation, generative
Abstract: Constrained generative modeling is fundamental to applications such as robotic control and autonomous driving, where models must respect physical laws and safety-critical constraints. In real-world settings, these constraints rarely take the form of simple linear inequalities, but instead complex feasible regions that resemble road maps or other structured spatial domains. We propose a constrained generation framework that generates samples directly within such feasible regions while preserving realism. Our method fine-tunes a pretrained generative model to enforce constraints while maintaining generative fidelity. Experimentally, our method exhibits characteristics distinct from existing fine-tuning and training-free constrained baselines, revealing a new compromise between constraint satisfaction and sampling quality.
摘要：约束生成模型是机器人控制和自动驾驶等应用的基础，其中模型必须遵守物理定律和安全关键约束。在现实世界中，这些约束很少采用简单的线性不等式的形式，而是采用类似于路线图或其他结构化空间域的复杂可行区域。我们提出了一个约束生成框架，可以直接在此类可行区域内生成样本，同时保持真实性。我们的方法对预训练的生成模型进行微调，以在保持生成保真度的同时强制执行约束。在实验上，我们的方法表现出与现有微调和免训练约束基线不同的特征，揭示了约束满足和采样质量之间的新折衷。

Title: Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection

Authors: Minjae Kang, Jaehyung Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06745
Pdf URL: https://arxiv.org/pdf/2603.06745
Copy Paste: [[2603.06745]] Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection(https://arxiv.org/abs/2603.06745)
Keywords: generation
Abstract: Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations. Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.
摘要：尽管指令调整方面取得了进步，大型语言模型（LLM）通常无法遵循复杂的用户指令。激活引导技术旨在通过操纵模型内部来缓解这种情况，但存在潜在的过度引导风险，过度强调指令会降低任务准确性和整体文本质量。为了解决这个问题，我们引入了 DIRECTER（动态拒绝转向），这是一种新颖的转向方法，可以通过扩展 KV 缓存来动态调节转向强度，而无需额外的数据集。 DIRECTER 将转向与合理性引导解码循环结合起来，通过将转向输出分布与原始分布进行比较，自适应调整每一步的转向强度。如果认为转向输出不可信，则转向强度会逐渐减弱。这种强度调制是由一种轻量级的一次性注意力敏感性分析引导的，该分析根据层对模型表示的影响对层进行排序。广泛的评估表明，DIRECTER 显着增强了跨不同基准的指令跟踪能力，与基准相比，准确度提高了 6.5%，而无需在生成质量或任务保真度方面进行常见的权衡。所提出的激活转向期间的动态、合理性引导控制进一步证明了其作为与现有基线兼容的减轻转向过度的通用机制的潜力。

Title: Implementation of Quantum Implicit Neural Representation in Deterministic and Probabilistic Autoencoders for Image Reconstruction/Generation Tasks

Authors: Saadet Müzehher Eren
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2603.06755
Pdf URL: https://arxiv.org/pdf/2603.06755
Copy Paste: [[2603.06755]] Implementation of Quantum Implicit Neural Representation in Deterministic and Probabilistic Autoencoders for Image Reconstruction/Generation Tasks(https://arxiv.org/abs/2603.06755)
Keywords: generation, generative
Abstract: We propose a quantum implicit neural representation (QINR)-based autoencoder (AE) and variational autoencoder (VAE) for image reconstruction and generation tasks. Our purpose is to demonstrate that the QINR in VAEs and AEs can transform information from the latent space into highly rich, periodic, and high-frequency features. Additionally, we aim to show that the QINR-VAE can be more stable than various quantum generative adversarial network (QGAN) models in image generation because it can address the low diversity problem. Our quantum-classical hybrid models consist of a classical convolutional neural network (CNN) encoder and a quantum-based QINR decoder. We train the QINR-AE/VAE with binary cross-entropy with logits (BCEWithLogits) as the reconstruction loss. For the QINR-VAE, we additionally employ Kullback-Leibler divergence for latent regularization with beta/capacity scheduling to prevent posterior collapse. We introduce learnable angle-scaling in data reuploading to address optimization challenges. We test our models on the MNIST, E-MNIST, and Fashion-MNIST datasets to reconstruct and generate images. Our results demonstrate that the QINR structure in VAE can produce a wider variety of images with a small amount of data than various generative models that have been studied. We observe that the generated and reconstructed images from the QINR-VAE/AE are clear with sharp boundaries and details. Overall, we find that the addition of QINR-based quantum layers into the AE/VAE frameworks enhances the performance of reconstruction and generation with a constrained set of parameters.
摘要：我们提出了一种基于量子隐式神经表示（QINR）的自动编码器（AE）和变分自动编码器（VAE），用于图像重建和生成任务。我们的目的是证明 VAE 和 AE 中的 QINR 可以将潜在空间的信息转换为高度丰富的、周期性的和高频的特征。此外，我们的目标是证明 QINR-VAE 在图像生成方面比各种量子生成对抗网络（QGAN）模型更稳定，因为它可以解决低多样性问题。我们的量子经典混合模型由经典卷积神经网络 (CNN) 编码器和基于量子的 QINR 解码器组成。我们使用二元交叉熵和 logits (BCEWithLogits) 作为重建损失来训练 QINR-AE/VAE。对于 QINR-VAE，我们还利用 Kullback-Leibler 散度通过 beta/容量调度进行潜在正则化，以防止后验崩溃。我们在数据重新上传中引入可学习的角度缩放来解决优化挑战。我们在 MNIST、E-MNIST 和 Fashion-MNIST 数据集上测试我们的模型以重建和生成图像。我们的结果表明，VAE 中的 QINR 结构可以用少量数据生成比已研究的各种生成模型更广泛的图像。我们观察到，QINR-VAE/AE 生成和重建的图像清晰，边界清晰，细节清晰。总的来说，我们发现将基于 QINR 的量子层添加到 AE/VAE 框架中可以增强具有一组约束参数的重建和生成的性能。

Title: Failure Detection in Chemical Processes using Symbolic Machine Learning: A Case Study on Ethylene Oxidation

Authors: Julien Amblard, Niklas Groll, Matthew Tait, Mark Law, Gürkan Sin, Alessandra Russo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06767
Pdf URL: https://arxiv.org/pdf/2603.06767
Copy Paste: [[2603.06767]] Failure Detection in Chemical Processes using Symbolic Machine Learning: A Case Study on Ethylene Oxidation(https://arxiv.org/abs/2603.06767)
Keywords: generation
Abstract: Over the past decade, Artificial Intelligence has significantly advanced, mostly driven by large-scale neural approaches. However, in the chemical process industry, where safety is critical, these methods are often unsuitable due to their brittleness, and lack of explainability and interpretability. Furthermore, open-source real-world datasets containing historical failures are scarce in this domain. In this paper, we investigate an approach for predicting failures in chemical processes using symbolic machine learning and conduct a feasibility study in the context of an ethylene oxidation process. Our method builds on a state-of-the-art symbolic machine learning system capable of learning predictive models in the form of probabilistic rules from context-dependent noisy examples. This system is a general-purpose symbolic learner, which makes our approach independent of any specific chemical process. To address the lack of real-world failure data, we conduct our feasibility study leveraging data generated from a chemical process simulator. Experimental results show that symbolic machine learning can outperform baseline methods such as random forest and multilayer perceptron, while preserving interpretability through the generation of compact, rule-based predictive models. Finally, we explain how such learned rule-based models could be integrated into agents to assist chemical plant operators in decision-making during potential failures.
摘要：在过去的十年中，人工智能取得了显着的进步，这主要是由大规模神经方法推动的。然而，在安全至关重要的化学加工工业中，这些方法由于其脆弱性、缺乏可解释性和可解释性而往往不适用。此外，包含历史故障的开源现实数据集在该领域很少见。在本文中，我们研究了一种使用符号机器学习预测化学过程故障的方法，并在乙烯氧化过程的背景下进行了可行性研究。我们的方法建立在最先进的符号机器学习系统的基础上，能够从上下文相关的噪声示例中以概率规则的形式学习预测模型。该系统是一个通用符号学习器，这使得我们的方法独立于任何特定的化学过程。为了解决缺乏现实世界故障数据的问题，我们利用化学过程模拟器生成的数据进行可行性研究。实验结果表明，符号机器学习可以优于随机森林和多层感知器等基线方法，同时通过生成紧凑的、基于规则的预测模型来保持可解释性。最后，我们解释了如何将这种基于规则的学习模型集成到代理中，以协助化工厂操作员在潜在故障期间做出决策。

Title: xaitimesynth: A Python Package for Evaluating Attribution Methods for Time Series with Synthetic Ground Truth

Authors: Gregor Baer
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06781
Pdf URL: https://arxiv.org/pdf/2603.06781
Copy Paste: [[2603.06781]] xaitimesynth: A Python Package for Evaluating Attribution Methods for Time Series with Synthetic Ground Truth(https://arxiv.org/abs/2603.06781)
Keywords: generation
Abstract: Evaluating time series attribution methods is difficult because real-world datasets rarely provide ground truth for which time points drive a prediction. A common workaround is to generate synthetic data where class-discriminating features are placed at known locations, but each study currently reimplements this from scratch. We introduce xaitimesynth, a Python package that provides reusable infrastructure for this evaluation approach. The package generates synthetic time series following an additive model where each sample is a sum of background signal and a localized, class-discriminating feature, with the feature window automatically tracked as a ground truth mask. A fluent data generation API and YAML configuration format allow flexible and reproducible dataset definitions for both univariate and multivariate time series. The package also provides standard localization metrics, including AUC-PR, AUC-ROC, Relevance Mass Accuracy, and Relevance Rank Accuracy. xaitimesynth is open source and available at this https URL.
摘要：评估时间序列归因方法很困难，因为现实世界的数据集很少提供时间点驱动预测的基本事实。一种常见的解决方法是生成合成数据，其中将类别区分特征放置在已知位置，但目前每项研究都从头开始重新实现这一点。我们引入了 xaitimesynth，一个 Python 包，它为这种评估方法提供可重用的基础设施。该包按照加法模型生成合成时间序列，其中每个样本都是背景信号和局部类区分特征的总和，特征窗口自动跟踪为地面真实掩模。流畅的数据生成 API 和 YAML 配置格式允许为单变量和多变量时间序列提供灵活且可重复的数据集定义。该软件包还提供标准本地化指标，包括 AUC-PR、AUC-ROC、相关性质量准确度和相关性排名准确度。 xaitimesynth 是开源的，可通过此 https URL 获取。

Title: Physics-Informed Diffusion Model for Generating Synthetic Extreme Rare Weather Events Data

Authors: Marawan Yakout, Tannistha Maiti, Monira Majhabeen, Tarry Singh
Subjects: cs.LG, cs.AI, physics.ao-ph, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2603.06782
Pdf URL: https://arxiv.org/pdf/2603.06782
Copy Paste: [[2603.06782]] Physics-Informed Diffusion Model for Generating Synthetic Extreme Rare Weather Events Data(https://arxiv.org/abs/2603.06782)
Keywords: generative
Abstract: Data scarcity is a primary obstacle in developing robust Machine Learning (ML) models for detecting rapidly intensifying tropical cyclones. Traditional data augmentation techniques (rotation, flipping, brightness adjustment) fail to preserve the physical consistency and high-intensity gradients characteristic of rare Category 4-equivalent events, which constitute only 0.14\% of our dataset (202 of 140,514 samples). We propose a physics-informed diffusion model based on the Context-UNet architecture to generate synthetic, multi-spectral satellite imagery of extreme weather events. Our model is conditioned on critical atmospheric parameters such as average wind speed, type of Ocean and stage of development (early, mature, late etc) -- the known drivers of rapid intensification. Using a controlled pre-generated noise sampling strategy and mixed-precision training, we generated $16\times16$ wind-field samples that are cropped from multi-spectral satellite imagery which preserve realistic spatial autocorrelation and physical consistency. Results demonstrate that our model successfully learns discriminative features across ten distinct context classes, effectively mitigating the data bottleneck. Specifically, we address the extreme class imbalance in our dataset, where Class 4 (Ocean 2, early stage with average wind speed 50kn hurricane) contains only 202 samples compared to 79,768 samples in Class 0. This generative framework provides a scalable solution for augmenting training datasets for operational weather detection algorithms. The average Results yield an average Log-Spectral Distance (LSD) of 4.5dB, demonstrating a scalable framework for enhancing operational weather detection algorithms.
摘要：数据稀缺是开发强大的机器学习 (ML) 模型来检测快速增强的热带气旋的主要障碍。传统的数据增强技术（旋转、翻转、亮度调整）无法保留罕见 4 类等效事件的物理一致性和高强度梯度特征，这些事件仅占我们数据集的 0.14%（140,514 个样本中的 202 个）。我们提出了一种基于 Context-UNet 架构的物理信息扩散模型，用于生成极端天气事件的合成多光谱卫星图像。我们的模型以关键大气参数为条件，例如平均风速、海洋类型和发展阶段（早期、成熟、晚期等）——已知的快速加剧的驱动因素。使用受控的预生成噪声采样策略和混合精度训练，我们生成了 16×16$ 的风场样本，这些样本是从多光谱卫星图像中裁剪出来的，保留了真实的空间自相关性和物理一致性。结果表明，我们的模型成功地学习了十个不同上下文类别的判别特征，有效缓解了数据瓶颈。具体来说，我们解决了数据集中的极端类别不平衡问题，其中 4 类（海洋 2，平均风速 50kn 飓风的早期）仅包含 202 个样本，而 0 类包含 79,768 个样本。该生成框架为增强可操作天气检测算法的训练数据集提供了可扩展的解决方案。平均结果产生 4.5dB 的平均对数谱距离 (LSD)，展示了用于增强操作天气检测算法的可扩展框架。

Title: NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

Authors: Irene Wang, Vishnu Varma Venkata, Arvind Krishnamurthy, Divya Mahajan
Subjects: cs.LG, cs.DC, stat.ML
Abstract URL: https://arxiv.org/abs/2603.06798
Pdf URL: https://arxiv.org/pdf/2603.06798
Copy Paste: [[2603.06798]] NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning(https://arxiv.org/abs/2603.06798)
Keywords: generation
Abstract: The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute-limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST's DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.43 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure. The source code of NEST is available at: this https URL
摘要：深度学习规模的不断扩大需要分布式训练框架来共同推理并行性、内存和网络拓扑。先前的工作通常依赖于启发式或与拓扑无关的搜索，分别处理通信和内存。如果没有每个设备的内存感知，这些方法通常通过在许多设备上分片参数和激活、增加同步、膨胀通信以及在实际数据中心网络上未充分利用限制计算的可扩展性和效率来确保事后可行性。我们提出 NEST，一个网络、计算和内存感知的设备放置框架，它通过结构化动态编程统一模型并行性、拓扑建模和内存可行性。 NEST 的 DP 在具有张量和专家并行配置的运算符图上运行，跨分层或任意网络以及内存/计算配置文件显式减少延迟。通过考虑张量、管道、数据和专家维度的并行性，NEST 定义了混合策略的原则搜索空间，同时联合优化协同定位、网络延迟和内存可行性。对不同硬件和网络的评估显示，与最先进的基准相比，NEST 的吞吐量提高了 2.43 倍，内存效率提高了，可扩展性也得到了提高，为共同设计下一代人工智能基础设施的并行化策略和数据中心互连奠定了基础。 NEST 的源代码位于：此 https URL

Title: Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

Authors: Abdulrahman Alswaidan, Jeffrey D. Varner
Subjects: cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2603.06875
Pdf URL: https://arxiv.org/pdf/2603.06875
Copy Paste: [[2603.06875]] Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy(https://arxiv.org/abs/2603.06875)
Keywords: generation
Abstract: Attention heads retrieve: given a query, they return a softmax-weighted average of stored values. We show that this computation is one step of gradient descent on a classical energy function, and that Langevin sampling from the corresponding distribution yields \emph{stochastic attention}: a training-free sampler controlled by a single temperature. Lowering the temperature gives exact retrieval; raising it gives open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model is required. We validate on four domains (64 to 4,096 dimensions). At generation temperature, stochastic attention is 2.6 times more novel and 2.0 times more diverse than the best learned baseline (a variational autoencoder trained on the same patterns), while matching a Metropolis-corrected gold standard. A simple signal-to-noise rule selects the operating temperature for any dimension. The approach requires no architectural changes and extends naturally to retrieval-augmented generation and in-context learning.
摘要：注意力头检索：给定一个查询，它们返回存储值的 softmax 加权平均值。我们证明这一计算是经典能量函数上梯度下降的一个步骤，并且从相应的分布中进行 Langevin 采样会产生 \emph{随机注意力}：由单个温度控制的免训练采样器。降低温度可以准确检索；提高它会产生开放式的生成。由于能量梯度等于注意力图，因此不需要评分网络、训练循环或学习模型。我们在四个域（64 到 4,096 维）上进行验证。在生成温度下，随机注意力比最佳学习基线（在相同模式上训练的变分自动编码器）新颖性高 2.6 倍，多样性高 2.0 倍，同时符合 Metropolis 校正的黄金标准。简单的信噪比规则可以选择任何尺寸的工作温度。该方法不需要进行架构更改，并且可以自然地扩展到检索增强生成和上下文学习。

Title: Learning From Design Procedure To Generate CAD Programs for Data Augmentation

Authors: Yan-Ying Chen, Dule Shu, Matthew Hong, Andrew Taber, Jonathan Li, Matthew Klenk
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.06894
Pdf URL: https://arxiv.org/pdf/2603.06894
Copy Paste: [[2603.06894]] Learning From Design Procedure To Generate CAD Programs for Data Augmentation(https://arxiv.org/abs/2603.06894)
Keywords: generation
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of code generation tasks. However, generating code for certain domains remains challenging. One such domain is Computer-Aided Design (CAD) program, where the goal is to produce scripted parametric models that define object geometry for precise design and manufacturing applications. A key challenge in LLM-based CAD program generation is the limited geometric complexity of generated shapes compared to those found in real-world industrial designs. This shortfall is in part due to the lack of diversity in the available CAD program training data. To address this, we propose a novel data augmentation paradigm that prompts an LLM to generate CAD programs conditioned on a reference surface program and a modeling procedure - an idea inspired by practices in industrial design. By varying the reference surface using a collection of organic shapes, our method enriches the geometric distribution of generated CAD models. In particular, it introduces edges and faces defined by spline-based curvature, which are typically missing or underrepresented in existing open-source CAD program datasets. Experiments show that our method produces CAD samples with significantly greater geometric diversity and a higher resemblance to industry-grade CAD designs in terms of the proportion of organic shape primitives. This enhancement makes our CAD data augmentation approach a useful tool for training LLMs and other deep learning models in CAD generation.
摘要：大型语言模型 (LLM) 在各种代码生成任务中展示了令人印象深刻的功能。然而，为某些领域生成代码仍然具有挑战性。其中一个领域是计算机辅助设计 (CAD) 程序，其目标是生成脚本化参数模型，为精确设计和制造应用定义对象几何形状。基于法学硕士的 CAD 程序生成的一个关键挑战是，与现实世界工业设计中的形状相比，生成的形状的几何复杂性有限。这种不足的部分原因是可用的 CAD 程序培训数据缺乏多样性。为了解决这个问题，我们提出了一种新颖的数据增强范例，它促使法学硕士生成以参考表面程序和建模程序为条件的 CAD 程序 - 这一想法受到工业设计实践的启发。通过使用有机形状集合改变参考表面，我们的方法丰富了生成的 CAD 模型的几何分布。特别是，它引入了由基于样条曲线的曲率定义的边和面，这些边和面在现有开源 CAD 程序数据集中通常缺失或代表性不足。实验表明，我们的方法生成的 CAD 样本具有显着更高的几何多样性，并且在有机形状基元的比例方面与工业级 CAD 设计更加相似。这一增强功能使我们的 CAD 数据增强方法成为训练 LLM 和 CAD 生成中其他深度学习模型的有用工具。

Title: XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost

Authors: Jim Achterberg, Marcel Haas, Bram van Dijk, Marco Spruit
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.06904
Pdf URL: https://arxiv.org/pdf/2603.06904
Copy Paste: [[2603.06904]] XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost(https://arxiv.org/abs/2603.06904)
Keywords: generative
Abstract: Tree ensembles such as XGBoost are often preferred for discriminative tasks in mixed-type tabular data, due to their inductive biases, minimal hyperparameter tuning, and training efficiency. We argue that these qualities, when leveraged correctly, can make for better generative models as well. As such, we present XGenBoost, a pair of generative models based on XGBoost: i) a Denoising Diffusion Implicit Model (DDIM) with XGBoost as score-estimator suited for smaller datasets, and ii) a hierarchical autoregressive model whose conditionals are learned via XGBoost classifiers, suited for large-scale tabular synthesis. The architectures follow from the natural constraints imposed by tree-based learners, e.g., in the diffusion model, combining Gaussian and multinomial diffusion to leverage native categorical splits and avoid one-hot encoding while accurately modelling mixed data types. In the autoregressive model, we use a fixed-order factorization, a hierarchical classifier to impose ordinal inductive biases when modelling numerical features, and de-quantization based on empirical quantile functions to model the non-continuous nature of most real-world tabular datasets. Through two benchmarks, one containing smaller and the other larger datasets, we show that our proposed architectures outperform previous neural- and tree-based generative models for mixed-type tabular synthesis at lower training cost.
摘要：由于其归纳偏差、最小超参数调整和训练效率，诸如 XGBoost 之类的树集成通常更适合混合类型表格数据中的判别任务。我们认为，如果正确利用这些品质，也可以构建更好的生成模型。因此，我们提出了 XGenBoost，一对基于 XGBoost 的生成模型：i）以 XGBoost 作为适合较小数据集的分数估计器的去噪扩散隐式模型（DDIM），以及 ii）分层自回归模型，其条件通过 XGBoost 分类器学习，适合大规模表格综合。该架构遵循基于树的学习器施加的自然约束，例如，在扩散模型中，结合高斯扩散和多项式扩散以利用本机分类分割并避免单热编码，同时准确地建模混合数据类型。在自回归模型中，我们使用固定阶因式分解、层次分类器在对数值特征进行建模时施加序数归纳偏差，并使用基于经验分位数函数的去量化来对大多数现实世界表格数据集的非连续性质进行建模。通过两个基准，一个包含较小的数据集，另一个包含较大的数据集，我们表明我们提出的架构优于以前的基于神经和树的生成模型，以较低的训练成本进行混合类型表格合成。

Title: HIERAMP: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation

Authors: Lin Zhao, Xinru Jiang, Xi Xiao, Qihui Fan, Lei Lu, Yanzhi Wang, Xue Lin, Octavia Camps, Pu Zhao, Jianyang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.06932
Pdf URL: https://arxiv.org/pdf/2603.06932
Copy Paste: [[2603.06932]] HIERAMP: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation(https://arxiv.org/abs/2603.06932)
Keywords: generation, generative
Abstract: Dataset distillation often prioritizes global semantic proximity when creating small surrogate datasets for original large-scale ones. However, object semantics are inherently hierarchical. For example, the position and appearance of a bird's eyes are constrained by the outline of its head. Global proximity alone fails to capture how object-relevant structures at different levels support recognition. In this work, we investigate the contributions of hierarchical semantics to effective distilled data. We leverage the vision autoregressive (VAR) model whose coarse-to-fine generation mirrors this hierarchy and propose HIERAMP to amplify semantics at different levels. At each VAR scale, we inject class tokens that dynamically identify salient regions and use their induced maps to guide amplification at that scale. This adds only marginal inference cost while steering synthesis toward discriminative parts and structures. Empirically, we find that semantic amplification leads to more diverse token choices in constructing coarse-scale object layouts. Conversely, at fine scales, the amplification concentrates token usage, increasing focus on object-related details. Across popular dataset distillation benchmarks, HIERAMP consistently improves validation performance without explicitly optimizing global proximity, demonstrating the importance of semantic amplification for effective dataset distillation.
摘要：在为原始大规模数据集创建小型替代数据集时，数据集蒸馏通常会优先考虑全局语义接近度。然而，对象语义本质上是分层的。例如，鸟眼睛的位置和外观受到其头部轮廓的限制。仅全局接近度无法捕获不同级别的对象相关结构如何支持识别。在这项工作中，我们研究了分层语义对有效提取数据的贡献。我们利用视觉自回归（VAR）模型，其从粗到细的生成反映了这种层次结构，并提出 HIERAMP 来放大不同级别的语义。在每个 VAR 尺度上，我们注入动态识别显着区域的类标记，并使用它们的诱导图来指导该尺度的放大。这仅增加了边际推理成本，同时将合成转向有区别的部分和结构。根据经验，我们发现语义放大会导致在构建粗尺度对象布局时更加多样化的标记选择。相反，在精细尺度上，放大集中了令牌的使用，更加关注与对象相关的细节。在流行的数据集蒸馏基准中，HIERAMP 不断提高验证性能，而无需显式优化全局邻近度，这证明了语义放大对于有效数据集蒸馏的重要性。

Title: Joint MDPs and Reinforcement Learning in Coupled-Dynamics Environments

Authors: Ege C. Kaya, Mahsa Ghasemi, Abolfazl Hashemi
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2603.06946
Pdf URL: https://arxiv.org/pdf/2603.06946
Copy Paste: [[2603.06946]] Joint MDPs and Reinforcement Learning in Coupled-Dynamics Environments(https://arxiv.org/abs/2603.06946)
Keywords: generative
Abstract: Many distributional quantities in reinforcement learning are intrinsically joint across actions, including distributions of gaps and probabilities of superiority. However, the classical Markov decision process (MDP) formalism specifies only marginal laws and leaves the joint law of counterfactual one-step outcomes across multiple possible actions at a state unspecified. We study coupled-dynamics environments with a multi-action generative interface which can sample counterfactual one-step outcomes for multiple actions under shared exogenous randomness. We propose joint MDPs (JMDPs) as a formalism for such environments by augmenting an MDP with a multi-action sample transition model which specifies a coupling of one-step counterfactual outcomes, while preserving standard MDP interaction as marginal observations. We adopt and formalize a one-step coupling regime where dependence across actions is confined to immediate counterfactual outcomes at the queried state. In this regime, we derive Bellman operators for $n$th-order return moments, providing dynamic programming and incremental algorithms with convergence guarantees.
摘要：强化学习中的许多分布量本质上是跨行为联合的，包括差距分布和优势概率。然而，经典的马尔可夫决策过程（MDP）形式主义仅规定了边际法则，而未规定某一状态下多种可能行动的反事实一步结果的联合法则。我们研究具有多动作生成接口的耦合动态环境，该接口可以在共享外生随机性下对多个动作的反事实一步结果进行采样。我们提出联合 MDP（JMDP）作为此类环境的一种形式主义，通过使用多动作样本转换模型增强 MDP，该模型指定单步反事实结果的耦合，同时保留标准 MDP 交互作为边际观察。我们采用并形式化了一种一步式耦合机制，其中行为之间的依赖仅限于查询状态的直接反事实结果。在这种情况下，我们推导出 n 阶返回矩的 Bellman 算子，提供具有收敛保证的动态规划和增量算法。

Title: SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

Authors: Kaiyuan Xu, Fangzhou Hong, Daniel Elson, Baoru Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.06971
Pdf URL: https://arxiv.org/pdf/2603.06971
Copy Paste: [[2603.06971]] SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation(https://arxiv.org/abs/2603.06971)
Keywords: generation
Abstract: Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general-purpose reconstruction models is constrained by two key challenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust reconstruction in surgical environments. Project page: this https URL.
摘要：从单眼内窥镜视频重建手术场景对于推进机器人辅助手术至关重要。然而，最先进的通用重建模型的应用受到两个关键挑战的限制：缺乏监督训练数据和长视频序列的性能下降。为了克服这些限制，我们提出了 SurgCUT3R，这是一个将统一 3D 重建模型适应外科领域的系统框架。我们的贡献是三重的。首先，我们开发了一个数据生成管道，利用公共立体手术数据集来生成大规模、公制尺度的伪地面实况深度图，从而有效地弥合数据差距。其次，我们提出了一种混合监督策略，将伪地面实况与几何自我校正结合起来，以增强针对固有数据缺陷的鲁棒性。第三，我们引入了一种分层推理框架，该框架采用两种专门的模型来有效减轻长手术视频中累积的姿势漂移：一种用于全局稳定性，一种用于局部准确性。 SCARED 和 StereoMIS 数据集上的实验表明，我们的方法在准确性和效率之间实现了竞争性平衡，提供了接近最先进但速度更快的姿势估计，并为手术环境中的稳健重建提供了实用且有效的解决方案。项目页面：此 https URL。

Title: Conditional Unbalanced Optimal Transport Maps: An Outlier-Robust Framework for Conditional Generative Modeling

Authors: Jiwoo Yoon, Kyumin Choi, Jaewoong Choi
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.06972
Pdf URL: https://arxiv.org/pdf/2603.06972
Copy Paste: [[2603.06972]] Conditional Unbalanced Optimal Transport Maps: An Outlier-Robust Framework for Conditional Generative Modeling(https://arxiv.org/abs/2603.06972)
Keywords: generative
Abstract: Conditional Optimal Transport (COT) problem aims to find a transport map between conditional source and target distributions while minimizing the transport cost. Recently, these transport maps have been utilized in conditional generative modeling tasks to establish efficient mappings between the distributions. However, classical COT inherits a fundamental limitation of optimal transport, i.e., sensitivity to outliers, which arises from the hard distribution matching constraints. This limitation becomes more pronounced in a conditional setting, where each conditional distribution is estimated from a limited subset of data. To address this, we introduce the Conditional Unbalanced Optimal Transport (CUOT) framework, which relaxes conditional distribution-matching constraints through Csiszár divergence penalties while strictly preserving the conditioning marginals. We establish a rigorous formulation of the CUOT problem and derive its dual and semi-dual formulations. Based on the semi-dual form, we propose Conditional Unbalanced Optimal Transport Maps (CUOTM), an outlier-robust conditional generative model built upon a triangular $c$-transform parameterization. We theoretically justify the validity of this parameterization by proving that the optimal triangular map satisfies the $c$-transform relationships. Our experiments on 2D synthetic and image-scale datasets demonstrate that CUOTM achieves superior outlier robustness and competitive distribution-matching performance compared to existing COT-based baselines, while maintaining high sampling efficiency.
摘要：条件最优传输（COT）问题旨在找到条件源分布和目标分布之间的传输映射，同时最小化传输成本。最近，这些传输图已被用于条件生成建模任务，以在分布之间建立有效的映射。然而，经典的 COT 继承了最佳传输的基本限制，即对异常值的敏感性，这是由硬分布匹配约束引起的。这种限制在条件设置中变得更加明显，其中每个条件分布都是根据有限的数据子集估计的。为了解决这个问题，我们引入了条件不平衡最优传输（CUOT）框架，该框架通过 Csiszár 散度惩罚来放松条件分布匹配约束，同时严格保留条件边际。我们建立了 CUOT 问题的严格表述，并推导出其对偶和半对偶表述。基于半对偶形式，我们提出了条件不平衡最优传输图（CUOTM），这是一种基于三角 $c$ 变换参数化的异常鲁棒条件生成模型。我们通过证明最优三角映射满足 $c$ 变换关系，从理论上证明了这种参数化的有效性。我们在 2D 合成和图像规模数据集上的实验表明，与现有的基于 COT 的基线相比，CUOTM 实现了卓越的异常值鲁棒性和有竞争力的分布匹配性能，同时保持了高采样效率。

Title: Diffusion Controller: Framework, Algorithms and Parameterization

Authors: Tong Yang, Moonkyung Ryu, Chih-Wei Hsu, Guy Tennenholtz, Yuejie Chi, Craig Boutilier, Bo Dai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06981
Pdf URL: https://arxiv.org/pdf/2603.06981
Copy Paste: [[2603.06981]] Diffusion Controller: Framework, Algorithms and Parameterization(https://arxiv.org/abs/2603.06981)
Keywords: generation
Abstract: Controllable diffusion generation often relies on various heuristics that are seemingly disconnected without a unified understanding. We bridge this gap with Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within (generalized) linearly-solvable Markov Decision Processes (LS-MDPs). Under this framework, control acts by reweighting the pretrained reverse-time transition kernels, balancing terminal objectives against an $f$-divergence cost. From the resulting optimality conditions, we derive practical reinforcement learning methods for diffusion fine-tuning: (i) f-divergence-regularized policy-gradient updates, including a PPO-style rule, and (ii) a regularizer-determined reward-weighted regression objective with a minimizer-preservation guarantee under the Kullback-Leibler (KL) divergence. The LS-MDP framework further implies a principled model form: the optimal score decomposes into a fixed pretrained baseline plus a lightweight control correction, motivating a side-network parameterization conditioned on exposed intermediate denoising outputs, enabling effective gray-box adaptation with a frozen backbone. Experiments on Stable Diffusion v1.4 across supervised and reward-driven finetuning show consistent gains in preference-alignment win rates and improved quality-efficiency trade-offs versus gray-box baselines and even the parameter-efficient white-box adapter LoRA.
摘要：可控扩散生成通常依赖于各种启发式方法，这些启发式方法看似互不相关，但没有统一的理解。我们用扩散控制器（DiffCon）弥补了这一差距，这是一种统一的控制理论视图，它将反向扩散采样视为（广义）线性可解马尔可夫决策过程（LS-MDP）中的纯状态随机控制。在此框架下，控制通过重新加权预训练的逆时转换内核来发挥作用，平衡最终目标与 $f$ 发散成本。从所得的最优性条件中，我们推导出用于扩散微调的实用强化学习方法：（i）f-散度正则化策略梯度更新，包括 PPO 式规则，以及（ii）正则化确定的奖励加权回归目标，并在 Kullback-Leibler（KL）散度下具有最小化保留保证。 LS-MDP 框架进一步暗示了一种原则性的模型形式：最优分数分解为固定的预训练基线加上轻量级控制校正，激发以暴露的中间去噪输出为条件的侧网络参数化，从而实现具有冻结主干的有效灰盒适应。 Stable Diffusion v1.4 跨监督和奖励驱动的微调的实验表明，与灰盒基线甚至参数高效的白盒适配器 LoRA 相比，偏好对齐获胜率和质量效率权衡得到了持续改善。

Title: AdaGen: Learning Adaptive Policy for Image Synthesis

Authors: Zanlin Ni, Yulin Wang, Yeguo Hua, Renping Zhou, Jiayi Guo, Jun Song, Bo Zheng, Gao Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.06993
Pdf URL: https://arxiv.org/pdf/2603.06993
Copy Paste: [[2603.06993]] AdaGen: Learning Adaptive Policy for Image Synthesis(https://arxiv.org/abs/2603.06993)
Keywords: generation, generative
Abstract: Recent advances in image synthesis have been propelled by powerful generative models, such as Masked Generative Transformers (MaskGIT), autoregressive models, diffusion models, and rectified flow models. A common principle behind their success is the decomposition of synthesis into multiple steps. However, this introduces a proliferation of step-specific parameters (e.g., noise level or temperature at each step). Existing approaches typically rely on manually-designed rules to manage this complexity, demanding expert knowledge and trial-and-error. Furthermore, these static schedules lack the flexibility to adapt to the unique characteristics of each sample, yielding sub-optimal performance. To address this issue, we present AdaGen, a general, learnable, and sample-adaptive framework for scheduling the iterative generation process. Specifically, we formulate the scheduling problem as a Markov Decision Process, where a lightweight policy network determines suitable parameters given the current generation state, and can be trained through reinforcement learning. Importantly, we demonstrate that simple reward designs, such as FID or pre-trained reward models, can be easily hacked and may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of the policy networks. Finally, we introduce an inference-time refinement strategy and a controllable fidelity-diversity trade-off mechanism to further enhance the performance and flexibility of AdaGen. Comprehensive experiments on four generative paradigms validate the superiority of AdaGen. For example, AdaGen achieves better performance on DiT-XL with 3 times lower inference cost and improves the FID of VAR from 1.92 to 1.59 with negligible computational overhead.
摘要：强大的生成模型推动了图像合成的最新进展，例如 Masked Generative Transformers (MaskGIT)、自回归模型、扩散模型和整流流模型。他们成功背后的一个共同原则是将合成分解为多个步骤。然而，这引入了特定步骤参数的激增（例如，每个步骤的噪声水平或温度）。现有的方法通常依赖于手动设计的规则来管理这种复杂性，需要专业知识和反复试验。此外，这些静态时间表缺乏适应每个样本独特特征的灵活性，从而产生次优性能。为了解决这个问题，我们提出了 AdaGen，一个通用的、可学习的、样本自适应的框架，用于调度迭代生成过程。具体来说，我们将调度问题表述为马尔可夫决策过程，其中轻量级策略网络在给定当前生成状态的情况下确定合适的参数，并且可以通过强化学习进行训练。重要的是，我们证明了简单的奖励设计（例如 FID 或预先训练的奖励模型）很容易被黑客攻击，并且可能无法可靠地保证生成样本的所需质量或多样性。因此，我们提出了一种对抗性奖励设计来指导政策网络的训练。最后，我们引入了推理时间细化策略和可控保真度-多样性权衡机制，以进一步增强 AdaGen 的性能和灵活性。对四种生成范式的综合实验验证了 AdaGen 的优越性。例如，AdaGen 在 DiT-XL 上实现了更好的性能，推理成本降低了 3 倍，并将 VAR 的 FID 从 1.92 提高到 1.59，计算开销可以忽略不计。

Title: Resource-Adaptive Federated Text Generation with Differential Privacy

Authors: Jiayi Wang, John Gounley, Heidi Hanson
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.07027
Pdf URL: https://arxiv.org/pdf/2603.07027
Copy Paste: [[2603.07027]] Resource-Adaptive Federated Text Generation with Differential Privacy(https://arxiv.org/abs/2603.07027)
Keywords: generation
Abstract: In cross-silo federated learning (FL), sensitive text datasets remain confined to local organizations due to privacy regulations, making repeated training for each downstream task both communication-intensive and privacy-demanding. A promising alternative is to generate differentially private (DP) synthetic datasets that approximate the global distribution and can be reused across tasks. However, pretrained large language models (LLMs) often fail under domain shift, and federated finetuning is hindered by computational heterogeneity: only resource-rich clients can update the model, while weaker clients are excluded, amplifying data skew and the adverse effects of DP noise. We propose a flexible participation framework that adapts to client capacities. Strong clients perform DP federated finetuning, while weak clients contribute through a lightweight DP voting mechanism that refines synthetic text. To ensure the synthetic data mirrors the global dataset, we apply control codes (e.g., labels, topics, metadata) that represent each client's data proportions and constrain voting to semantically coherent subsets. This two-phase approach requires only a single round of communication for weak clients and integrates contributions from all participants. Experiments show that our framework improves distribution alignment and downstream robustness under DP and heterogeneity.
摘要：在跨孤岛联邦学习（FL）中，由于隐私法规的原因，敏感文本数据集仍然仅限于本地组织，这使得每个下游任务的重复训练既需要通信密集又需要隐私。一个有前途的替代方案是生成近似全局分布并且可以跨任务重用的差分私有（DP）合成数据集。然而，预训练的大型语言模型 (LLM) 在域转移下通常会失败，并且联合微调受到计算异质性的阻碍：只有资源丰富的客户端才能更新模型，而较弱的客户端则被排除在外，从而放大了数据偏差和 DP 噪声的不利影响。我们提出了一个适应客户能力的灵活参与框架。强客户端执行 DP 联合微调，而弱客户端则通过轻量级 DP 投票机制来完善合成文本。为了确保合成数据反映全局数据集，我们应用代表每个客户端数据比例的控制代码（例如标签、主题、元数据）并将投票限制在语义一致的子集上。这种两阶段方法只需要针对弱客户进行一轮沟通，并整合所有参与者的贡献。实验表明，我们的框架提高了 DP 和异质性下的分布对齐和下游鲁棒性。

Title: SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer

Authors: Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07057
Pdf URL: https://arxiv.org/pdf/2603.07057
Copy Paste: [[2603.07057]] SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer(https://arxiv.org/abs/2603.07057)
Keywords: generation
Abstract: Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-$\alpha$, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. Our code is released publicly at: this https URL.
摘要：扩散变压器已成为视觉生成中的主导范例，但其低推理效率仍然是阻碍进一步发展的关键瓶颈。在常见的免训练技术中，缓存提供了较高的加速效率，但往往会损害保真度，而剪枝则显示出相反的权衡。将缓存与剪枝相结合，实现了加速和生成质量之间的平衡。然而，现有方法通常采用固定和启发式方案来配置缓存和修剪策略。虽然它们大致遵循生成模型对加速的整体敏感性趋势，但它们无法捕获细粒度和复杂的变化，不可避免地跳过高度敏感的计算并导致质量下降。此外，这种手动设计的策略泛化性很差。为了解决这些问题，我们提出了 SODA，一种面向灵敏度的动态加速方法，可根据细粒度的灵敏度自适应地执行缓存和修剪。 SODA 构建了跨时间步、层和模块的离线灵敏度误差建模框架，以捕获对不同加速操作的灵敏度。通过以灵敏度误差为代价函数的动态规划来优化缓存间隔，最大限度地减少缓存对模型灵敏度的影响。在修剪和缓存重用期间，SODA 自适应地确定修剪时间和速率，以保留高度敏感令牌的计算，从而显着提高生成保真度。在 DiT-XL/2、PixArt-$\alpha$ 和 OpenSora 上进行的大量实验表明，SODA 在可控加速比下实现了最先进的生成保真度。我们的代码公开发布于：此 https URL。

Title: MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering

Authors: Trong-Thang Pham, Loc Nguyen, Anh Nguyen, Hien Nguyen, Ngan Le
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07066
Pdf URL: https://arxiv.org/pdf/2603.07066
Copy Paste: [[2603.07066]] MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering(https://arxiv.org/abs/2603.07066)
Keywords: generation, generative
Abstract: Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re-prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion-based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training-free activation-steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross-attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion-based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h-Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity-matched re-prompting, confirming that counterfactual structure drives the gain. Code is at link this https URL
摘要：生成扩散模型越来越多地用于医学成像数据增强，但文本提示无法生成因果训练数据。重新提示会重新滚动整个生成轨迹，改变解剖结构、纹理和背景。基于反转的编辑方法引入了导致结构漂移的重建误差。我们提出了 MedSteer，一种用于内窥镜合成的免训练激活引导框架。 MedSteer 为扩散变压器的交叉注意力层中的每个对比提示对识别一个病理向量。在推理时，它沿着该向量引导图像激活，从头开始生成反事实对，其中唯一的区别是引导的概念。所有其他结构均通过施工得以保留。我们通过 Kvasir v3 和 HyperKvasir 上的三个实验评估了 MedSteer。在三个临床概念对的反事实生成中，MedSteer 实现了 0.800、0.925 和 0.950 的翻转率，在概念翻转率和结构保留方面均优于基于反转的最佳基线。在染料解缠结方面，MedSteer 的染料去除率达到 75%，而染料去除率为 20% (PnP) 和 10% (h-Edit)。在下游息肉检测中，使用 MedSteer 反事实对进行增强，ViT AUC 为 0.9755，而数量匹配的重新提示则为 0.9083，证实了反事实结构驱动了增益。代码位于链接此 https URL

Title: Physics-Guided VLM Priors for All-Cloud Removal

Authors: Liying Xu, Huifang Li, Huanfeng Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07074
Pdf URL: https://arxiv.org/pdf/2603.07074
Copy Paste: [[2603.07074]] Physics-Guided VLM Priors for All-Cloud Removal(https://arxiv.org/abs/2603.07074)
Keywords: restoration
Abstract: Cloud removal is a fundamental challenge in optical remote sensing due to the heterogeneous degradation. Thin clouds distort radiometry via partial transmission, while thick clouds occlude the surface. Existing pipelines separate thin-cloud correction from thick-cloud reconstruction, requiring explicit cloud-type decisions and often leading to error accumulation and discontinuities in mixed-cloud scenes. Therefore, a novel approach named Physical-VLM All-Cloud Removal (PhyVLM-CR) that integrates the semantic capability of Vision-Language Model (VLM) into a physical restoration model, achieving high-fidelity unified cloud removal. Specifically, the cognitive prior from a VLM (e.g., Qwen) is transformed into physical scattering parameters and a hallucination confidence map. Leveraging this confidence map as a continuous soft gate, our method achieves a unified restoration via adaptive weighting: it prioritizes physical inversion in high-transmission regions to preserve radiometric fidelity, while seamlessly transitioning to temporal reference reconstruction in low-confidence occluded areas. This mechanism eliminates the need for explicit boundary delineation, ensuring a coherent removal across heterogeneous cloud covers. Experiments on real-world Sentinel-2 surface reflectance imagery confirm that our approach achieves a remarkable balance between cloud removal and content preservation, delivering hallucination-free results with substantially improved quantitative accuracy compared to existing methods.
摘要：由于异质退化，去云是光学遥感的一个基本挑战。薄云通过部分透射扭曲辐射测量，而厚云则遮挡表面。现有的管道将薄云校正与厚云重建分开，需要明确的云类型决策，并且通常会导致混合云场景中的错误累积和不连续性。因此，一种名为物理-VLM全云去除（PhyVLM-CR）的新方法将视觉语言模型（VLM）的语义能力集成到物理恢复模型中，实现高保真统一去云。具体来说，来自 VLM（例如 Qwen）的认知先验被转换为物理散射参数和幻觉置信图。利用该置信图作为连续软门，我们的方法通过自适应加权实现统一恢复：它优先考虑高透射区域中的物理反演以保持辐射保真度，同时无缝过渡到低置信度遮挡区域中的时间参考重建。这种机制消除了明确边界划分的需要，确保跨异构云层的一致去除。对现实世界的 Sentinel-2 表面反射图像进行的实验证实，我们的方法在云去除和内容保存之间实现了显着的平衡，与现有方法相比，提供了无幻觉的结果，并且定量精度显着提高。

Title: Entropy-Aware On-Policy Distillation of Language Models

Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2603.07079
Pdf URL: https://arxiv.org/pdf/2603.07079
Copy Paste: [[2603.07079]] Entropy-Aware On-Policy Distillation of Language Models(https://arxiv.org/abs/2603.07079)
Keywords: generation
Abstract: On-policy distillation is a promising approach for transferring knowledge between language models, where a student learns from dense token-level signals along its own trajectories. This framework typically uses reverse KL divergence, encouraging the student to match the teacher's high-confidence predictions. However, we show that the mode-seeking property of reverse KL reduces generation diversity and yields unstable learning signals when the teacher distribution has high entropy. To address this, we introduce Entropy-Aware On-Policy Distillation. Our key idea is augmenting the standard reverse KL objective with forward KL when teacher entropy is high, capturing the full range of plausible outputs while retaining precise imitation elsewhere. It balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency. Experiments show that our method maintains generation diversity (sustained token-level entropy) and improves student-teacher alignment (lower forward KL on high-entropy tokens). Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods. These results demonstrate that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.
摘要：在策略蒸馏是一种在语言模型之间转移知识的有前途的方法，学生可以沿着自己的轨迹从密集的令牌级信号中学习。该框架通常使用反向 KL 散度，鼓励学生匹配老师的高置信度预测。然而，我们表明，当教师分布具有高熵时，反向 KL 的模式搜索特性会降低生成多样性并产生不稳定的学习信号。为了解决这个问题，我们引入了熵感知的策略蒸馏。我们的关键思想是当教师熵较高时，用正向 KL 增强标准反向 KL 目标，捕获全部合理的输出，同时在其他地方保留精确的模仿。它平衡了模式搜索精度和模式覆盖鲁棒性，而不牺牲策略训练效率。实验表明，我们的方法保持了生成多样性（持续的令牌级熵）并改善了学生与教师的对齐（高熵令牌上的前向 KL 较低）。在六个数学推理基准测试中，与基线在策略蒸馏方法相比，Qwen3-0.6B-Base 的 Pass@8 准确率提高了 +1.37，Qwen3-1.7B-Base 提高了 +2.39，Qwen3-4B-Base 提高了 +5.05。这些结果表明，考虑教师的不确定性对于保持多样性和实现有效的知识转移至关重要。

Title: Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction

Authors: Xu Chen, Rui Gao, Xinjie Zhang, Haoyu Zhang, Che Sun, Zhi Gao, Yuwei Wu, Yunde Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07093
Pdf URL: https://arxiv.org/pdf/2603.07093
Copy Paste: [[2603.07093]] Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction(https://arxiv.org/abs/2603.07093)
Keywords: generation
Abstract: Achieving natural dyadic interaction requires generating facial expressions that are emotionally appropriate and socially aligned with human preference. Human feedback offers a compelling mechanism to guide such alignment, yet how to effectively incorporate this feedback into facial expression generation remains underexplored. In this paper, we propose a facial expression generation method aligned with human preference by leveraging human feedback to produce contextually and emotionally appropriate expressions for natural dyadic interaction. A key to our method is framing the generation of identity-independent facial expressions as an action learning process, allowing human feedback to assess their validity free from visual or identity bias. We establish a closed feedback loop in which listener expressions dynamically respond to evolving conversational cues of the speaker. Concretely, we train a vision-language-action model via supervised fine-tuning to map the speaker's multimodal signals into controllable low-dimensional expression representations of a 3D morphable model. We further introduce a human-feedback reinforcement learning strategy that integrates the imitation of high-quality expression response with critic-guided optimization. Experiments on two benchmarks demonstrate that our method effectively aligns facial expressions with human preference and achieves superior performance.
摘要：实现自然的二元互动需要生成情感上适当且社交上符合人类偏好的面部表情。人类反馈提供了一种令人信服的机制来指导这种对齐，但如何有效地将这种反馈融入面部表情生成中仍待探索。在本文中，我们提出了一种符合人类偏好的面部表情生成方法，通过利用人类反馈来生成上下文和情感上适当的表情，以实现自然的二元交互。我们方法的关键是将生成与身份无关的面部表情作为一个动作学习过程，允许人类反馈评估其有效性，而不受视觉或身份偏见的影响。我们建立了一个闭合的反馈循环，其中听众的表情动态地响应说话者不断变化的对话线索。具体来说，我们通过监督微调训练视觉-语言-动作模型，将说话者的多模态信号映射到 3D 可变形模型的可控低维表达表示中。我们进一步引入了一种人类反馈强化学习策略，它将高质量表达反应的模仿与评论家引导的优化相结合。两个基准的实验表明，我们的方法有效地将面部表情与人类偏好结合起来，并实现了卓越的性能。

Title: TIQA: Human-Aligned Text Quality Assessment in Generated Images

Authors: Kirill Koltsov, Aleksandr Gushchin, Dmitriy Vatolin, Anastasia Antsiferova
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07119
Pdf URL: https://arxiv.org/pdf/2603.07119
Copy Paste: [[2603.07119]] TIQA: Human-Aligned Text Quality Assessment in Generated Images(https://arxiv.org/abs/2603.07119)
Keywords: generation, quality assessment
Abstract: Text rendering remains a persistent failure mode of modern text-to-image models (T2I), yet existing evaluations rely on OCR correctness or VLM-based judging procedures that are poorly aligned with perceptual text artifacts. We introduce Text-in-Image Quality Assessment (TIQA), a task that predicts a scalar quality score that matches human judgments of rendered-text fidelity within cropped text regions. We release two MOS-labeled datasets: TIQA-Crops (10k text crops) and TIQA-Images (1,500 images), spanning 20+ T2I models, including proprietary ones. We also propose ANTIQA, a lightweight method with text-specific biases, and show that it improves correlation with human scores over OCR confidence, VLM judges, and generic NR-IQA metrics by at least $\sim0.05$ on TIQA-Crops and $\sim0.08$ on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.
摘要：文本渲染仍然是现代文本到图像模型 (T2I) 的持续失败模式，但现有的评估依赖于 OCR 正确性或基于 VLM 的判断程序，这些程序与感知文本伪影的一致性较差。我们引入了图像中文本质量评估（TIQA），这是一项预测标量质量分数的任务，该分数与人类对裁剪文本区域内渲染文本保真度的判断相匹配。我们发布了两个 MOS 标记的数据集：TIQA-Crops（10k 文本裁剪）和 TIQA-Images（1,500 张图像），涵盖 20 多个 T2I 模型，包括专有模型。我们还提出了 ANTIQA，一种具有文本特定偏差的轻量级方法，并表明它在 OCR 置信度、VLM 判断和通用 NR-IQA 指标方面与人类评分的相关性在 TIQA-Crops 上提高了至少 $\sim0.05$，在 TIQA-Images 上提高了 $\sim0.08$（由 PLCC 测量）。最后，我们证明 TIQA 模型在下游任务中很有价值：例如，使用 ANTIQA 选择 5 代中最好的一代可以将人工评价的文本质量平均提高 $+14\%$，展示了生成管道中过滤和重新排名的实用价值。

Title: CanoVerse: 3D Object Scalable Canonicalization and Dataset for Generation and Pose

Authors: Li Jin, Yuchen Yang, Weikai Chen, Yujie Wang, Dehao Hao, Tanghui Jia, Yingda Yin, Zeyu Hu, Runze Zhang, Keyang Luo, Li Yuan, Long Quan, Xin Wang, Xueying Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07144
Pdf URL: https://arxiv.org/pdf/2603.07144
Copy Paste: [[2603.07144]] CanoVerse: 3D Object Scalable Canonicalization and Dataset for Generation and Pose(https://arxiv.org/abs/2603.07144)
Keywords: generation
Abstract: 3D learning systems implicitly assume that objects occupy a coherent reference frame. Nonetheless, in practice, every asset arrives with an arbitrary global rotation, and models are left to resolve directional ambiguity on their own. This persistent misalignment suppresses pose-consistent generation, and blocks the emergence of stable directional semantics. To address this issue, we construct \methodName{}, a massive canonical 3D dataset of 320K objects over 1,156 categories -- an order-of-magnitude increase over prior work. At this scale, directional semantics become statistically learnable: Canoverse improves 3D generation stability, enables precise cross-modal 3D shape retrieval, and unlocks zero-shot point-cloud orientation estimation even for out-of-distribution data. This is achieved by a new canonicalization framework that reduces alignment from minutes to seconds per object via compact hypothesis generation and lightweight human discrimination, transforming canonicalization from manual curation into a high-throughput data generation pipeline. The Canoverse dataset will be publicly released upon acceptance. Project page: this https URL
摘要：3D 学习系统隐含地假设对象占据一个连贯的参考系。尽管如此，在实践中，每项资产都会以任意的全球轮换的方式到达，模型只能自行解决方向模糊性。这种持续的错位抑制了姿势一致的生成，并阻止了稳定方向语义的出现。为了解决这个问题，我们构建了 \methodName{}，这是一个包含 1,156 个类别的 320K 对象的大型规范 3D 数据集，比之前的工作增加了数量级。在这种规模下，方向语义变得可以统计学习：Canoverse 提高了 3D 生成稳定性，实现了精确的跨模式 3D 形状检索，并解锁了零样本点云方向估计，即使对于分布外的数据也是如此。这是通过新的规范化框架实现的，该框架通过紧凑的假设生成和轻量级的人类辨别，将每个对象的对齐时间从几分钟缩短到几秒，将规范化从手动管理转变为高吞吐量数据生成管道。 Canoverse 数据集将在接受后公开发布。项目页面：此 https URL

Title: LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models

Authors: Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, Lingqiao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07145
Pdf URL: https://arxiv.org/pdf/2603.07145
Copy Paste: [[2603.07145]] LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models(https://arxiv.org/abs/2603.07145)
Keywords: generative
Abstract: Recent generative video world models aim to simulate visual environment evolution, allowing an observer to interactively explore the scene via camera control. However, they implicitly assume that the world only evolves within the observer's field of view. Once an object leaves the observer's view, its state is "frozen" in memory, and revisiting the same region later often fails to reflect events that should have occurred in the meantime. In this work, we identify and formalize this overlooked limitation as the "out-of-sight dynamics" problem, which impedes video world models from representing a continuously evolving world. To address this issue, we propose LiveWorld, a novel framework that extends video world models to support persistent world evolution. Instead of treating the world as static observational memory, LiveWorld models a persistent global state composed of a static 3D background and dynamic entities that continue evolving even when unobserved. To maintain these unseen dynamics, LiveWorld introduces a monitor-based mechanism that autonomously simulates the temporal progression of active entities and synchronizes their evolved states upon revisiting, ensuring spatially coherent rendering. For evaluation, we further introduce LiveBench, a dedicated benchmark for the task of maintaining out-of-sight dynamics. Extensive experiments show that LiveWorld enables persistent event evolution and long-term scene consistency, bridging the gap between existing 2D observation-based memory and true 4D dynamic world simulation. The baseline and benchmark will be publicly available at this https URL.
摘要：最近的生成视频世界模型旨在模拟视觉环境的演变，允许观察者通过摄像机控制交互式地探索场景。然而，他们隐含地假设世界仅在观察者的视野内演化。一旦对象离开观察者的视野，它的状态就会“冻结”在内存中，并且稍后重新访问同一区域通常无法反映应该同时发生的事件。在这项工作中，我们将这种被忽视的限制识别并形式化为“视线外动态”问题，它阻碍了视频世界模型代表一个不断发展的世界。为了解决这个问题，我们提出了 LiveWorld，这是一种新颖的框架，可以扩展视频世界模型以支持持续的世界演化。 LiveWorld 没有将世界视为静态观察记忆，而是建模了由静态 3D 背景和动态实体组成的持久全局状态，即使在未被观察到的情况下，动态实体也会持续演化。为了维持这些看不见的动态，LiveWorld 引入了一种基于监视器的机制，该机制可以自动模拟活动实体的时间进程，并在重新访问时同步其演化状态，从而确保空间连贯的渲染。为了进行评估，我们进一步引入了LiveBench，这是一个专门用于维持视线外动态任务的基准测试。大量实验表明，LiveWorld 能够实现持久事件演化和长期场景一致性，弥合现有基于 2D 观察的内存与真正的 4D 动态世界模拟之间的差距。基线和基准测试将在此 https URL 上公开提供。

Title: Agentic Planning with Reasoning for Image Styling via Offline RL

Authors: Subhojyoti Mukherjee, Stefano Petrangeli, Branislav Kveton, Trung Bui, Franck Dernoncourt, Arko Mukherjee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.07148
Pdf URL: https://arxiv.org/pdf/2603.07148
Copy Paste: [[2603.07148]] Agentic Planning with Reasoning for Image Styling via Offline RL(https://arxiv.org/abs/2603.07148)
Keywords: generation
Abstract: Direct prompt-based editing often fails on complex transformations because vague and subjective prompts often require nuanced understanding of what should be changed in the image. Our core intuition is that leveraging compositional image editing tools rather than direct prompting profits from structured agent-level planning with explicit reasoning, leading to better results. This structured planning framework enables efficient offline RL post-training on quality-scored trajectories to improve performance. We present a tool-based agentic RL post-training framework that addresses this through structured planning with chain-of-thought reasoning. Our key contributions include: (1) A tool-based agentic planning methodology that combines a compositional library of orthogonal primitive transformations, structured context representation, and explicit per-step reasoning to decompose complex styling into interpretable tool sequences. (2) A synthetic data generation pipeline producing three large-scale datasets (each $\sim$10K trajectories) with reasoning chains, plans, and quality scores, as no existing datasets provide such supervision. Our datasets and code are publicly available at the HuggingFace repository. (3) Offline RL training methods for learning planners with reasoning as our core algorithmic contributions, which consistently improve over the Edit-Only baseline in visual quality and instruction following. (4) Comprehensive evaluation across 4B and 8B parameter Qwen3-VL models showing that our methods outperform other baselines in the majority of compositional tasks, validated by human evaluations.
摘要：基于直接提示的编辑通常无法完成复杂的转换，因为模糊和主观的提示通常需要对图像中应更改的内容进行细致入微的了解。我们的核心直觉是，利用组合图像编辑工具，而不是通过明确的推理直接从结构化代理级规划中获得利润，从而获得更好的结果。这种结构化的规划框架可以在质量评分轨迹上进行高效的离线强化学习后期训练，从而提高性能。我们提出了一个基于工具的代理强化学习训练后框架，该框架通过思想链推理的结构化规划来解决这个问题。我们的主要贡献包括：（1）基于工具的代理规划方法，结合了正交基元变换、结构化上下文表示和显式每步推理的组合库，将复杂的样式分解为可解释的工具序列。 (2) 合成数据生成管道，生成三个大型数据集（每个数据集 $\sim$10K 轨迹），具有推理链、计划和质量分数，因为现有数据集没有提供此类监督。我们的数据集和代码可在 HuggingFace 存储库中公开获取。 (3) 学习规划者的离线强化学习训练方法，以推理作为我们的核心算法贡献，在视觉质量和指令遵循方面持续改进仅编辑基线。 (4) 对 4B 和 8B 参数 Qwen3-VL 模型的综合评估表明，我们的方法在大多数组合任务中优于其他基线，并经过人工评估验证。

Title: Class Visualizations and Activation Atlases for Enhancing Interpretability in Deep Learning-Based Computational Pathology

Authors: Marco Gustav, Fabian Wolf, Christina Glasner, Nic G. Reitsam, Stefan Schulz, Kira Aschenbroich, Bruno Märkl, Sebastian Foersch, Jakob Nikolas Kather
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07170
Pdf URL: https://arxiv.org/pdf/2603.07170
Copy Paste: [[2603.07170]] Class Visualizations and Activation Atlases for Enhancing Interpretability in Deep Learning-Based Computational Pathology(https://arxiv.org/abs/2603.07170)
Keywords: generative
Abstract: The rapid adoption of transformer-based models in computational pathology has enabled prediction of molecular and clinical biomarkers from H&E whole-slide images, yet interpretability has not kept pace with model complexity. While attribution- and generative-based methods are common, feature visualization approaches such as class visualizations (CVs) and activation atlases (AAs) have not been systematically evaluated for these models. We developed a visualization framework and assessed CVs and AAs for a transformer-based foundation model across tissue and multi-organ cancer classification tasks with increasing label granularity. Four pathologists annotated real and generated images to quantify inter-observer agreement, complemented by attribution and similarity metrics. CVs preserved recognizability for morphologically distinct tissues but showed reduced separability for overlapping cancer subclasses. In tissue classification, agreement decreased from Fleiss k = 0.75 (scans) to k = 0.31 (CVs), with similar trends in cancer subclass tasks. AAs revealed layer-dependent organization: coarse tissue-level concepts formed coherent regions, whereas finer subclasses exhibited dispersion and overlap. Agreement was moderate for tissue classification (k = 0.58), high for coarse cancer groupings (k = 0.82), and low at subclass level (k = 0.11). Atlas separability closely tracked expert agreement on real images, indicating that representational ambiguity reflects intrinsic pathological complexity. Attribution-based metrics approximated expert variability in low-complexity settings, whereas perceptual and distributional metrics showed limited alignment. Overall, concept-level feature visualization reveals structured morphological manifolds in transformer-based pathology models and provides a framework for expert-centered interrogation of learned representations across label granularities.
摘要：基于 Transformer 的模型在计算病理学中的快速采用使得能够从 H&E 全切片图像预测分子和临床生物标志物，但可解释性并没有跟上模型复杂性的步伐。虽然基于归因和生成的方法很常见，但尚未针对这些模型系统地评估诸如类可视化 (CV) 和激活图集 (AA) 等特征可视化方法。我们开发了一个可视化框架，并评估了跨组织和多器官癌症分类任务的基于变压器的基础模型的 CV 和 AA，并增加了标签粒度。四位病理学家对真实图像和生成图像进行注释，以量化观察者间的一致性，并辅以归因和相似性指标。 CV 保留了对形态上不同的组织的可识别性，但对重叠的癌症亚类的可分离性却降低了。在组织分类中，一致性从 Fleiss k = 0.75（扫描）下降到 k = 0.31（CV），癌症子类任务中的趋势相似。 AA 揭示了层相关的组织：粗略的组织级概念形成了连贯的区域，而更精细的子类则表现出分散和重叠。组织分类的一致性中等（k = 0.58），粗略癌症分组的一致性较高（k = 0.82），亚类水平的一致性较低（k = 0.11）。图谱可分离性密切跟踪专家对真实图像的一致性，表明表征的模糊性反映了内在的病理复杂性。基于归因的指标近似于低复杂性环境中的专家变异性，而感知和分布指标则显示出有限的一致性。总体而言，概念级特征可视化揭示了基于变压器的病理模型中的结构化形态流形，并为以专家为中心的跨标签粒度的学习表示询问提供了框架。

Title: FastSTAR: Spatiotemporal Token Pruning for Efficient Autoregressive Video Synthesis

Authors: Sungwoong Yune, Suheon Jeong, Joo-Young Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07192
Pdf URL: https://arxiv.org/pdf/2603.07192
Copy Paste: [[2603.07192]] FastSTAR: Spatiotemporal Token Pruning for Efficient Autoregressive Video Synthesis(https://arxiv.org/abs/2603.07192)
Keywords: generation
Abstract: Visual Autoregressive modeling (VAR) has emerged as a highly efficient alternative to diffusion-based frameworks, achieving comparable synthesis quality. However, as this paradigm extends to Spacetime Autoregressive modeling (STAR) for video generation, scaling resolution and frame counts leads to a "token explosion" that creates a massive computational bottleneck in the final refinement stages. To address this, we propose FastSTAR, a training-free acceleration framework designed for high-quality video generation. Our core method, Spatiotemporal Token Pruning, identifies essential tokens by integrating two specialized terms: (1) Spatial similarity, which evaluates structural convergence across hierarchical scales to skip computations in regions where further refinement becomes redundant, and (2) Temporal similarity, which identifies active motion trajectories by assessing feature-level variations relative to the preceding clip. Combined with a Partial Update mechanism, FastSTAR ensures that only non-converged regions are refined, maintaining fluid motion while bypassing redundant computations. Experimental results on InfinityStar demonstrate that FastSTAR achieves up to a 2.01x speedup with a PSNR of 28.29 and less than 1% performance degradation, proving a superior efficiency-quality trade-off for STAR-based video synthesis.
摘要：视觉自回归建模 (VAR) 已成为基于扩散的框架的高效替代方案，可实现可比的合成质量。然而，随着这种范式扩展到用于视频生成的时空自回归建模（STAR），缩放分辨率和帧计数会导致“令牌爆炸”，从而在最终细化阶段产生巨大的计算瓶颈。为了解决这个问题，我们提出了 FastSTAR，这是一种专为高质量视频生成而设计的免训练加速框架。我们的核心方法，时空令牌修剪，通过集成两个专门术语来识别基本令牌：（1）空间相似性，它评估跨层次尺度的结构收敛性，以跳过进一步细化变得多余的区域中的计算，以及（2）时间相似性，它通过评估相对于前一个剪辑的特征级别变化来识别活动运动轨迹。与部分更新机制相结合，FastSTAR 可确保仅对非收敛区域进行细化，从而在绕过冗余计算的同时保持流体运动。 InfinityStar 上的实验结果表明，FastSTAR 实现了高达 2.01 倍的加速，PSNR 为 28.29，性能下降不到 1%，证明了基于 STAR 的视频合成具有卓越的效率与质量权衡。

Title: Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation

Authors: Andrea Giuseppe Di Francesco, Andrea Rubbi, Pietro Liò
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2603.07233
Pdf URL: https://arxiv.org/pdf/2603.07233
Copy Paste: [[2603.07233]] Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation(https://arxiv.org/abs/2603.07233)
Keywords: generation
Abstract: Predicting how cells respond to genetic perturbations is fundamental to understanding gene function, disease mechanisms, and therapeutic development. While recent deep learning approaches have shown promise in modeling single-cell perturbation responses, they struggle to generalize across cell types and perturbation contexts due to limited contextual information during generation. We introduce PT-RAG (Perturbation-aware Two-stage Retrieval-Augmented Generation), a novel framework that extends Retrieval-Augmented Generation beyond traditional language-model applications to cellular biology. Unlike standard RAG systems designed for text retrieval with pre-trained LLMs, perturbation retrieval lacks established similarity metrics and requires learning what constitutes relevant context, making differentiable retrieval essential. PT-RAG addresses this through a two-stage pipeline: first, retrieving candidate perturbations $K$ using GenePT embeddings, then adaptively refining the selection through Gumbel-Softmax discrete sampling conditioned on both the cell state and the input perturbation. This cell-type-aware differentiable retrieval enables end-to-end optimization of the retrieval objective jointly with generation. On the Replogle-Nadig single-gene perturbation dataset, we demonstrate that PT-RAG outperforms both STATE and vanilla RAG under identical experimental conditions, with the strongest gains in distributional similarity metrics ($W_1$, $W_2$). Notably, vanilla RAG's dramatic failure is itself a key finding: it demonstrates that differentiable, cell-type-aware retrieval is essential in this domain, and that naive retrieval can actively harm performance. Our results establish retrieval-augmented generation as a promising paradigm for modelling cellular responses to gene perturbation. The code to reproduce our experiments is available at this https URL.
摘要：预测细胞如何应对遗传扰动对于理解基因功能、疾病机制和治疗开发至关重要。虽然最近的深度学习方法在模拟单细胞微扰反应方面显示出了希望，但由于生成过程中的上下文信息有限，它们很难在细胞类型和微扰上下文中进行泛化。我们引入了 PT-RAG（扰动感知两阶段检索增强生成），这是一种新颖的框架，它将检索增强生成从传统语言模型应用扩展到细胞生物学。与专为使用预先训练的 LLM 进行文本检索而设计的标准 RAG 系统不同，扰动检索缺乏既定的相似性度量，并且需要学习相关上下文的构成，因此可微分检索至关重要。 PT-RAG 通过两阶段管道解决了这个问题：首先，使用 GenePT 嵌入检索候选扰动 $K$，然后通过以细胞状态和输入扰动为条件的 Gumbel-Softmax 离散采样自适应地细化选择。这种细胞类型感知的可微检索能够与生成一起对检索目标进行端到端优化。在 Replogle-Nadig 单基因扰动数据集上，我们证明 PT-RAG 在相同的实验条件下优于 STATE 和普通 RAG，在分布相似性指标（$W_1$、$W_2$）方面具有最强的增益。值得注意的是，vanilla RAG 的戏剧性失败本身就是一个重要发现：它表明可微分、细胞类型感知的检索在该领域至关重要，而幼稚的检索会严重损害性能。我们的结果确立了检索增强生成作为模拟细胞对基因扰动的反应的有前景的范例。重现我们实验的代码可在此 https URL 中找到。

Title: Single Image Super-Resolution via Bivariate `A Trous Wavelet Diffusion

Authors: Heidari Maryam, Anantrasirichai Nantheera, Achim Alin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07234
Pdf URL: https://arxiv.org/pdf/2603.07234
Copy Paste: [[2603.07234]] Single Image Super-Resolution via Bivariate `A Trous Wavelet Diffusion(https://arxiv.org/abs/2603.07234)
Keywords: super-resolution, generative
Abstract: The effectiveness of super resolution (SR) models hinges on their ability to recover high frequency structure without introducing artifacts. Diffusion based approaches have recently advanced the state of the art in SR. However, most diffusion based SR pipelines operate purely in the spatial domain, which may yield high frequency details that are not well supported by the underlying low resolution evidence. On the other hand, unlike supervised SR models that may inject dataset specific textures, single image SR relies primarily on internal image statistics and can therefore be less prone to dataset-driven hallucinations; nevertheless, ambiguity in the LR observation can still lead to inconsistent high frequency details. To tackle this problem, we introduce BATDiff, an unsupervised Bivariate A trous Wavelet Diffusion model designed to provide structured cross scale guidance during the generative process. BATDiff employs an a Trous wavelet transform that constructs an undecimated multiscale representation in which high frequency components are progressively revealed while the full spatial resolution is preserved. As the core inference mechanism, BATDiff includes a bivariate cross scale module that models parent child dependencies between adjacent scales. It improves high frequency coherence and reduces mismatch artifacts in diffusion based SR. Experiments on standard benchmarks demonstrate that BATDiff produces sharper and more structurally consistent reconstructions than existing diffusion and non diffusion baselines, achieving improvements in fidelity and perceptual quality.
摘要：超分辨率（SR）模型的有效性取决于它们在不引入伪影的情况下恢复高频结构的能力。基于扩散的方法最近推动了 SR 领域的最新技术发展。然而，大多数基于扩散的超分辨率管道纯粹在空间域中运行，这可能会产生高频细节，而底层低分辨率证据无法很好地支持这些细节。另一方面，与可能注入数据集特定纹理的监督 SR 模型不同，单图像 SR 主要依赖于内部图像统计数据，因此不太容易出现数据集驱动的幻觉；然而，LR 观测中的模糊性仍然会导致高频细节不一致。为了解决这个问题，我们引入了 BATDiff，这是一种无监督的双变量小波扩散模型，旨在在生成过程中提供结构化的跨尺度指导。 BATDiff 采用 Trous 小波变换，构建未抽取的多尺度表示，其中逐步显示高频分量，同时保留完整的空间分辨率。作为核心推理机制，BATDiff 包含一个双变量跨尺度模块，用于对相邻尺度之间的父子依赖关系进行建模。它提高了高频相干性并减少了基于扩散的 SR 中的失配伪影。标准基准的实验表明，BATDiff 比现有的扩散和非扩散基线产生更清晰、结构更一致的重建，从而实现了保真度和感知质量的改进。

Title: FabricGen: Microstructure-Aware Woven Fabric Generation

Authors: Yingjie Tang, Di Luo, Zixiong Wang, Xiaoli Ling, jian Yang, Beibei Wang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2603.07240
Pdf URL: https://arxiv.org/pdf/2603.07240
Copy Paste: [[2603.07240]] FabricGen: Microstructure-Aware Woven Fabric Generation(https://arxiv.org/abs/2603.07240)
Keywords: generation, generative
Abstract: Woven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pre-trained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.
摘要：编织织物材料广泛用于渲染应用程序，但设计真实的示例通常涉及多个阶段，需要编织原理和纹理创作方面的专业知识。最近的进展探索了扩散模型来简化这一过程；然而，预先训练的扩散模型通常很难生成符合编织规则的复杂纱线级细节。为了解决这个问题，我们提出了 FabricGen，这是一个端到端框架，用于根据文本描述生成高质量的机织物材料。我们方法的一个关键见解是宏观尺度纹理和微观尺度编织图案的分解。为了生成没有微观结构的宏观纹理，我们在收集的无微观结构织物数据集上微调预先训练的扩散模型。对于微尺度编织图案，我们开发了一种增强的程序几何模型，能够合成具有纱线滑动和飞散纤维的天然纱线级几何形状。程序模型由专门的大型语言模型 WeavingLLM 驱动，该模型在格式化编织草稿的带注释数据集上进行微调，并利用特定领域的织物专业知识进行即时调整。通过微调和提示调整，WeavingLLM 学会根据文本提示设计编织草稿和织物参数，使程序模型能够产生符合编织原理的多样化编织图案。生成的宏观纹理以及微观几何形状可用于织物渲染。因此，与之前的生成模型相比，我们的框架生成的材料具有更丰富的细节和真实感。

Title: PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation

Authors: Xin-Sheng Chen, Jiayu Zhu, Pei-lin Li, Hanzheng Wang, Shuojin Yang, Meng-Hao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07244
Pdf URL: https://arxiv.org/pdf/2603.07244
Copy Paste: [[2603.07244]] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation(https://arxiv.org/abs/2603.07244)
Keywords: generation, generative
Abstract: Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment. In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks. Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.
摘要：幻灯片是学术界、教育界和商业界等以演示为导向的场景中传递信息的关键媒介。尽管它们很重要，但创建高质量的幻灯片仍然非常耗时且对认知能力要求很高。 Nano Banana Pro 等生成模型的最新进展使自动幻灯片生成变得越来越可行。然而，现有的幻灯片生成评估通常是粗粒度的，并且依赖于整体判断，因此很难准确评估模型能力或跟踪该领域有意义的进展。在实践中，缺乏细粒度、可验证的评估标准给研究和实际部署带来了关键瓶颈。在本文中，我们提出了 PresentBench，这是一种基于细粒度、基于评分标准的基准，用于评估自动化的现实世界幻灯片生成。它包含 238 个评估实例，每个实例都补充有幻灯片创建所需的背景材料。此外，我们为每个实例手动设计平均 54.1 个清单项目，每个项目都表述为一个二元问题，以便能够对生成的幻灯片进行细粒度、特定于实例的评估。大量实验表明，PresentBench 提供了比现有方法更可靠的评估结果，并且与人类偏好的一致性明显更强。此外，我们的基准测试显示 NotebookLM 显着优于其他幻灯片生成方法，突显了该领域最近取得的重大进展。

Title: Variational Flow Maps: Make Some Noise for One-Step Conditional Generation

Authors: Abbas Mammadov, So Takao, Bohan Chen, Ricardo Baptista, Morteza Mardani, Yee Whye Teh, Julius Berner
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.07276
Pdf URL: https://arxiv.org/pdf/2603.07276
Copy Paste: [[2603.07276]] Variational Flow Maps: Make Some Noise for One-Step Conditional Generation(https://arxiv.org/abs/2603.07276)
Keywords: generation
Abstract: Flow maps enable high-quality image generation in a single forward pass. However, unlike iterative diffusion models, their lack of an explicit sampling trajectory impedes incorporating external constraints for conditional generation and solving inverse problems. We put forth Variational Flow Maps, a framework for conditional sampling that shifts the perspective of conditioning from "guiding a sampling path", to that of "learning the proper initial noise". Specifically, given an observation, we seek to learn a noise adapter model that outputs a noise distribution, so that after mapping to the data space via flow map, the samples respect the observation and data prior. To this end, we develop a principled variational objective that jointly trains the noise adapter and the flow map, improving noise-data alignment, such that sampling from complex data posterior is achieved with a simple adapter. Experiments on various inverse problems show that VFMs produce well-calibrated conditional samples in a single (or few) steps. For ImageNet, VFM attains competitive fidelity while accelerating the sampling by orders of magnitude compared to alternative iterative diffusion/flow models. Code is available at this https URL
摘要：流图可以在单次前向传递中生成高质量图像。然而，与迭代扩散模型不同，它们缺乏明确的采样轨迹，阻碍了条件生成和解决逆问题的外部约束的结合。我们提出了变分流图，这是一个条件采样框架，它将条件采样的视角从“引导采样路径”转变为“学习适当的初始噪声”。具体来说，给定一个观察结果，我们寻求学习一个输出噪声分布的噪声适配器模型，以便在通过流图映射到数据空间后，样本尊重观察结果和先验数据。为此，我们开发了一个有原则的变分目标，联合训练噪声适配器和流图，改进噪声数据对齐，从而使用简单的适配器实现复杂数据后验的采样。对各种反问题的实验表明，VFM 只需一个（或几个）步骤即可生成经过良好校准的条件样本。对于 ImageNet，与其他迭代扩散/流模型相比，VFM 实现了有竞争力的保真度，同时将采样速度加快了几个数量级。代码可在此 https URL 获取

Title: MAviS: A Multimodal Conversational Assistant For Avian Species

Authors: Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jinxing Zhou, Fahad Shabzan Khan, Rao Anwer, Salman Khan, Hisham Cholakkal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07294
Pdf URL: https://arxiv.org/pdf/2603.07294
Copy Paste: [[2603.07294]] MAviS: A Multimodal Conversational Assistant For Avian Species(https://arxiv.org/abs/2603.07294)
Keywords: generation
Abstract: Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.
摘要：细粒度的理解和针对特定物种的多模式问答对于推进生物多样性保护和生态监测至关重要。然而，现有的多模态大语言模型在涉及鸟类物种等专门主题时面临挑战，使得在这些领域提供准确且上下文相关的信息变得更加困难。为了解决这一限制，我们引入了 MAviS-Dataset，这是一个大规模多模式鸟类物种数据集，集成了 1000 多种鸟类的图像、音频和文本模式，包括预训练和指令调整子集，并富含结构化问答对。在 MAviS-Dataset 的基础上，我们引入了 MAviS-Chat，这是一种支持音频、视觉和文本的多模式法学硕士，专为细粒度物种理解、多模式问答和特定场景描述生成而设计。最后，为了进行定量评估，我们提出了 MAviS-Bench，这是一个包含超过 25,000 个 QA 对的基准，旨在评估跨模式的鸟类特定感知和推理能力。实验结果表明，MAviS-Chat 大幅优于基线 MiniCPM-o-2.6，实现了最先进的开源结果，并证明了我们经过指令调整的 MAviS-Dataset 的有效性。我们的研究结果强调了领域自适应多模式法学硕士对于生态应用的必要性。

Title: A Lightweight Digital-Twin-Based Framework for Edge-Assisted Vehicle Tracking and Collision Prediction

Authors: Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Aisha Syed, Matthew Andrews, Sean Kennedy
Subjects: cs.CV, cs.NI, cs.RO, eess.SP
Abstract URL: https://arxiv.org/abs/2603.07338
Pdf URL: https://arxiv.org/pdf/2603.07338
Copy Paste: [[2603.07338]] A Lightweight Digital-Twin-Based Framework for Edge-Assisted Vehicle Tracking and Collision Prediction(https://arxiv.org/abs/2603.07338)
Keywords: generation
Abstract: Vehicle tracking, motion estimation, and collision prediction are fundamental components of traffic safety and management in Intelligent Transportation Systems (ITS). Many recent approaches rely on computationally intensive prediction models, which limits their practical deployment on resource-constrained edge devices. This paper presents a lightweight digital-twin-based framework for vehicle tracking and spatiotemporal collision prediction that relies solely on object detection, without requiring complex trajectory prediction networks. The framework is implemented and evaluated in Quanser Interactive Labs (QLabs), a high-fidelity digital twin of an urban traffic environment that enables controlled and repeatable scenario generation. A YOLO-based detector is deployed on simulated edge cameras to localize vehicles and extract frame-level centroid trajectories. Offline path maps are constructed from multiple traversals and indexed using K-D trees to support efficient online association between detected vehicles and road segments. During runtime, consistent vehicle identifiers are maintained, vehicle speed and direction are estimated from the temporal evolution of path indices, and future positions are predicted accordingly. Potential collisions are identified by analyzing both spatial proximity and temporal overlap of predicted future trajectories. Our experimental results across diverse simulated urban scenarios show that the proposed framework predicts approximately 88% of collision events prior to occurrence while maintaining low computational overhead suitable for edge deployment. Rather than introducing a computationally intensive prediction model, this work introduces a lightweight digital-twin-based solution for vehicle tracking and collision prediction, tailored for real-time edge deployment in ITS.
摘要：车辆跟踪、运动估计和碰撞预测是智能交通系统 (ITS) 中交通安全和管理的基本组成部分。最近的许多方法依赖于计算密集型预测模型，这限制了它们在资源受限的边缘设备上的实际部署。本文提出了一种基于轻量级数字孪生的车辆跟踪和时空碰撞预测框架，该框架仅依赖于对象检测，而不需要复杂的轨迹预测网络。该框架在 Quanser Interactive Labs (QLabs) 中实施和评估，该实验室是城市交通环境的高保真数字孪生，可实现受控且可重复的场景生成。基于 YOLO 的检测器部署在模拟边缘相机上，以定位车辆并提取帧级质心轨迹。离线路径地图是通过多次遍历构建的，并使用 K-D 树进行索引，以支持检测到的车辆和路段之间的高效在线关联。在运行时，保持一致的车辆标识符，根据路径索引的时间演变来估计车辆速度和方向，并相应地预测未来位置。通过分析预测的未来轨迹的空间邻近性和时间重叠来识别潜在的碰撞。我们在不同模拟城市场景中的实验结果表明，所提出的框架可以在发生之前预测大约 88% 的碰撞事件，同时保持适合边缘部署的低计算开销。这项工作没有引入计算密集型预测模型，而是引入了一种基于轻量级数字孪生的车辆跟踪和碰撞预测解决方案，专为 ITS 中的实时边缘部署而定制。

Title: Latent Generative Models with Tunable Complexity for Compressed Sensing and other Inverse Problems

Authors: Sean Gunn, Jorio Cocola, Oliver De Candido, Vaggos Chatziafratis, Paul Hand
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07357
Pdf URL: https://arxiv.org/pdf/2603.07357
Copy Paste: [[2603.07357]] Latent Generative Models with Tunable Complexity for Compressed Sensing and other Inverse Problems(https://arxiv.org/abs/2603.07357)
Keywords: generative
Abstract: Generative models have emerged as powerful priors for solving inverse problems. These models typically represent a class of natural signals using a single fixed complexity or dimensionality. This can be limiting: depending on the problem, a fixed complexity may result in high representation error if too small, or overfitting to noise if too large. We develop tunable-complexity priors for diffusion models, normalizing flows, and variational autoencoders, leveraging nested dropout. Across tasks including compressed sensing, inpainting, denoising, and phase retrieval, we show empirically that tunable priors consistently achieve lower reconstruction errors than fixed-complexity baselines. In the linear denoising setting, we provide a theoretical analysis that explicitly characterizes how the optimal tuning parameter depends on noise and model structure. This work demonstrates the potential of tunable-complexity generative priors and motivates both the development of supporting theory and their application across a wide range of inverse problems.
摘要：生成模型已成为解决逆问题的强大先验。这些模型通常使用单一固定复杂性或维度来表示一类自然信号。这可能是有限的：根据问题的不同，固定的复杂性如果太小可能会导致较高的表示误差，或者如果太大则可能会导致过度拟合噪声。我们利用嵌套 dropout 为扩散模型、标准化流和变分自动编码器开发可调复杂性先验。在压缩感知、修复、去噪和相位检索等任务中，我们凭经验证明，可调谐先验始终比固定复杂度基线实现更低的重建误差。在线性去噪设置中，我们提供了理论分析，明确描述了最佳调整参数如何依赖于噪声和模型结构。这项工作展示了可调复杂度生成先验的潜力，并推动了支持理论的发展及其在广泛的反问题中的应用。

Title: ConfHit: Conformal Generative Design with Oracle Free Guarantees

Authors: Siddhartha Laghuvarapu, Ying Jin, Jimeng Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07371
Pdf URL: https://arxiv.org/pdf/2603.07371
Copy Paste: [[2603.07371]] ConfHit: Conformal Generative Design with Oracle Free Guarantees(https://arxiv.org/abs/2603.07371)
Keywords: generation, generative
Abstract: The success of deep generative models in scientific discovery requires not only the ability to generate novel candidates but also reliable guarantees that these candidates indeed satisfy desired properties. Recent conformal-prediction methods offer a path to such guarantees, but its application to generative modeling in drug discovery is limited by budget constraints, lack of oracle access, and distribution shift. To this end, we introduce ConfHit, a distribution-free framework that provides validity guarantees under these conditions. ConfHit formalizes two central questions: (i) Certification: whether a generated batch can be guaranteed to contain at least one hit with a user-specified confidence level, and (ii) Design: whether the generation can be refined to a compact set without weakening this guarantee. ConfHit leverages weighted exchangeability between historical and generated samples to eliminate the need for an experimental oracle, constructs multiple-sample density-ratio weighted conformal p-value to quantify statistical confidence in hits, and proposes a nested testing procedure to certify and refine candidate sets of multiple generated samples while maintaining statistical guarantees. Across representative generative molecule design tasks and a broad range of methods, ConfHit consistently delivers valid coverage guarantees at multiple confidence levels while maintaining compact certified sets, establishing a principled and reliable framework for generative modeling.
摘要：深度生成模型在科学发现中的成功不仅需要能够生成新颖的候选模型，还需要可靠地保证这些候选模型确实满足所需的属性。最近的共形预测方法提供了实现此类保证的途径，但其在药物发现生成模型中的应用受到预算限制、缺乏预言机访问和分布转移的限制。为此，我们引入了 ConfHit，一个无需分发的框架，可以在这些条件下提供有效性保证。 ConfHit 形式化了两个核心问题：（i）认证：是否可以保证生成的批次包含至少一个具有用户指定置信水平的命中，以及（ii）设计：是否可以将生成细化为紧凑集而不削弱这种保证。 ConfHit 利用历史样本和生成样本之间的加权可交换性来消除对实验预言的需要，构建多样本密度比加权保形 p 值来量化命中的统计置信度，并提出嵌套测试程序来验证和细化多个生成样本的候选集，同时保持统计保证。在代表性的生成分子设计任务和广泛的方法中，ConfHit 始终如一地在多个置信水平下提供有效的覆盖率保证，同时保持紧凑的认证集，为生成建模建立有原则且可靠的框架。

Title: AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

Authors: Jihyoung Jang, Hyounghun Kim
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.07394
Pdf URL: https://arxiv.org/pdf/2603.07394
Copy Paste: [[2603.07394]] AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions(https://arxiv.org/abs/2603.07394)
Keywords: generation
Abstract: Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.
摘要：视觉问答（VQA）是评估视觉语言模型（VLM）能力的核心任务。现有的 VQA 基准主要特点是清晰且明确的图像问题对，而现实世界的场景通常涉及不同程度的模糊性，需要细致入微的推理和适合上下文的响应策略。尽管最近的研究已经开始解决 VQA 中的歧义问题，但它们缺乏 (1) 歧义级别的系统分类和 (2) 支持策略感知响应的数据集和模型。在本文中，我们介绍了模糊视觉问答（AQuA），这是一个细粒度的数据集，根据模糊的性质和程度将模糊的 VQA 实例分为四个级别，以及每种情况的最佳响应策略。我们对各种开源和专有 VLM 的评估表明，大多数模型无法根据模糊类型调整策略，经常产生过于自信的答案，而不是寻求澄清或承认不确定性。为了应对这一挑战，我们对 AQuA 上的 VLM 进行了微调，使它们能够自适应地在多种响应策略中进行选择，例如直接回答、从上下文线索推断意图、列出合理的替代方案或请求澄清。经过 AQuA 培训的 VLM 可以针对模糊的 VQA 生成战略响应，展示出识别歧义、管理不确定性并使用适合上下文的策略进行响应的能力，同时优于开源和闭源基线。

Title: VIVECaption: A Split Approach to Caption Quality Improvement

Authors: Varun Ananth, Baqiao Liu, Haoran Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07401
Pdf URL: https://arxiv.org/pdf/2603.07401
Copy Paste: [[2603.07401]] VIVECaption: A Split Approach to Caption Quality Improvement(https://arxiv.org/abs/2603.07401)
Keywords: generative
Abstract: Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data, they suffer from hallucinations, poor compositional reasoning, and limited fine-grained understanding, resulting in misaligned image-caption pairs that degrade downstream model performance. This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement. We first establish a comprehensive taxonomy of caption evaluation metrics, distinguishing between "universal" and "instance-grounded" metrics, with the ultimate goal of showcasing the use-cases and tradeoffs between different caption quality metrics. We then use this language to describe our two-sided approach to caption quality improvement: (1) a gold-standard dataset creation methodology using stratified sampling and (2) a model alignment strategy encompassing context alignment and parameter-level finetuning using SFT. We demonstrate our methodology on open-source models, focusing on structured caption formats that enable better parsing and downstream utilization. We ultimately show that using a finetuned character detection model in an image captioning pipeline significantly improves holistic image-caption alignment quality. Our work addresses the growing need for high-quality "vegan" training data in enterprise AI development, providing practical solutions for teams seeking to improve caption-image alignment without relying on potentially copyright-protected web-scraped content.
摘要：字幕质量已成为训练高质量文本到图像（T2I）和文本到视频（T2V）生成模型的关键瓶颈。虽然视觉语言模型 (VLM) 通常用于从视觉数据生成字幕，但它们存在幻觉、糟糕的组合推理和有限的细粒度理解，导致图像字幕对错位，从而降低下游模型的性能。本技术报告介绍了 VIVECaption，一种用于提高字幕质量的系统性双向方法。我们首先建立了字幕评估指标的全面分类法，区分“通用”和“基于实例”指标，最终目标是展示不同字幕质量指标之间的用例和权衡。然后，我们使用这种语言来描述我们提高字幕质量的双向方法：(1) 使用分层采样的黄金标准数据集创建方法，以及 (2) 模型对齐策略，包括使用 SFT 的上下文对齐和参数级微调。我们在开源模型上展示了我们的方法，重点关注能够更好地解析和下游利用的结构化字幕格式。我们最终表明，在图像字幕管道中使用经过微调的字符检测模型可以显着提高整体图像字幕对齐质量。我们的工作满足了企业人工智能开发中对高质量“纯素”训练数据日益增长的需求，为寻求改进字幕-图像对齐的团队提供实用的解决方案，而不依赖于可能受版权保护的网络抓取内容。

Title: Prompt-Based Caption Generation for Single-Tooth Dental Images Using Vision-Language Models

Authors: Anastasiia Sukhanova, Aiden Taylor, Julian Myers, Zichun Wang, Kartha Veerya Jammuladinne, Satya Sri Rajiteswari Nimmagadda, Aniruddha Maiti, Ananya Jana
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07403
Pdf URL: https://arxiv.org/pdf/2603.07403
Copy Paste: [[2603.07403]] Prompt-Based Caption Generation for Single-Tooth Dental Images Using Vision-Language Models(https://arxiv.org/abs/2603.07403)
Keywords: generation
Abstract: Digital dentistry has made significant advances with the advent of deep learning. However, the majority of these deep learning-based dental image analysis models focus on very specific tasks such as tooth segmentation, tooth detection, cavity detection, and gingivitis classification. There is a lack of a specialized model that has holistic knowledge of teeth and can perform dental image analysis tasks based on that knowledge. Datasets of dental images with captions can help build such a model. To the best of our knowledge, existing dental image datasets with captions are few in number and limited in scope. In many of these datasets, the captions describe the entire mouth, while the images are limited to the anterior view. As a result, posterior teeth such as molars are not clearly visible, limiting the usefulness of the captions for training vision-language models. Additionally, the captions focus only on a specific disease (gingivitis) and do not provide a holistic assessment of each tooth. Moreover, tooth disease scores are typically assigned to individual teeth, and each tooth is treated as a separate entity in orthodontic procedures. Therefore, it is important to have captions for single-tooth images. As far as we know, no such dataset of single-tooth images with dental captions exists. In this work, we aim to bridge that gap by assessing the possibility of generating captions for dental images using Vision-Language Models (VLMs) and evaluating the extent and quality of those captions. Our findings suggest that guided prompts help VLMs generate meaningful captions. We show that the prompts generated by our framework are better anchored in describing the visual aspects of dental images. We selected RGB images as they have greater potential in consumer scenarios.
摘要：随着深度学习的出现，数字牙科取得了重大进步。然而，大多数基于深度学习的牙科图像分析模型都专注于非常具体的任务，例如牙齿分割、牙齿检测、蛀牙检测和牙龈炎分类。缺乏具有牙齿整体知识并可以基于该知识执行牙科图像分析任务的专门模型。带标题的牙科图像数据集可以帮助构建这样的模型。据我们所知，现有的带有说明文字的牙科图像数据集数量很少且范围有限。在许多这样的数据集中，标题描述了整个嘴，而图像仅限于前视图。因此，后牙（例如臼齿）不清晰可见，限制了标题用于训练视觉语言模型的有用性。此外，说明文字仅关注特定疾病（牙龈炎），并未提供每颗牙齿的整体评估。此外，牙齿疾病评分通常分配给单个牙齿，并且在正畸过程中每颗牙齿都被视为单独的实体。因此，为单牙图像添加说明文字非常重要。据我们所知，不存在这样的带有牙齿说明的单牙图像数据集。在这项工作中，我们的目标是通过评估使用视觉语言模型 (VLM) 为牙科图像生成说明文字的可能性并评估这些说明文字的范围和质量来弥补这一差距。我们的研究结果表明，引导提示有助于 VLM 生成有意义的字幕。我们表明，我们的框架生成的提示更好地锚定在描述牙科图像的视觉方面。我们选择 RGB 图像，因为它们在消费场景中具有更大的潜力。

Title: UnSCAR: Universal, Scalable, Controllable, and Adaptable Image Restoration

Authors: Debabrata Mandal, Soumitri Chattopadhyay, Yujie Wang, Marc Niethammer, Praneeth Chakravarthula
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07406
Pdf URL: https://arxiv.org/pdf/2603.07406
Copy Paste: [[2603.07406]] UnSCAR: Universal, Scalable, Controllable, and Adaptable Image Restoration(https://arxiv.org/abs/2603.07406)
Keywords: restoration
Abstract: Universal image restoration aims to recover clean images from arbitrary real-world degradations using a single inference model. Despite significant progress, existing all-in-one restoration networks do not scale to multiple degradations. As the number of degradations increases, training becomes unstable, models grow excessively large, and performance drops across both seen and unseen domains. In this work, we show that scaling universal restoration is fundamentally limited by interference across degradations during joint learning, leading to catastrophic task forgetting. To address this challenge, we introduce a unified inference pipeline with a multi-branch mixture-of-experts architecture that decomposes restoration knowledge across specialized task-adaptable experts. Our approach enables scalable learning (over sixteen degradations), adapts and generalizes robustly to unseen domains, and supports user-controllable restoration across degradations. Beyond achieving superior performance across benchmarks, this work establishes a new design paradigm for scalable and controllable universal image restoration.
摘要：通用图像恢复旨在使用单个推理模型从任意现实世界的退化中恢复干净的图像。尽管取得了重大进展，但现有的一体化恢复网络无法扩展到多重退化。随着退化数量的增加，训练变得不稳定，模型变得过大，可见和不可见领域的性能都会下降。在这项工作中，我们表明，扩展通用恢复从根本上受到联合学习过程中退化干扰的限制，从而导致灾难性的任务遗忘。为了应对这一挑战，我们引入了一个具有多分支混合专家架构的统一推理管道，该架构可以跨专门的任务适应专家分解恢复知识。我们的方法能够实现可扩展的学习（超过十六种降级），能够稳健地适应和泛化到未见过的领域，并支持用户可控的跨降级恢复。除了在基准测试中实现卓越的性能之外，这项工作还为可扩展和可控的通用图像恢复建立了新的设计范例。

Title: Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

Authors: Ran Cheng
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2603.07415
Pdf URL: https://arxiv.org/pdf/2603.07415
Copy Paste: [[2603.07415]] Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting(https://arxiv.org/abs/2603.07415)
Keywords: generation
Abstract: Catastrophic forgetting remains a central challenge in continual learning (CL), yet lacks a unified information-theoretic explanation for why some architectures forget catastrophically while others do not. We introduce \emph{Context Channel Capacity} ($C_\mathrm{ctx}$), the mutual information between a CL architecture's context signal and its generated parameters, and prove that zero forgetting requires $C_\mathrm{ctx} \geq H(T)$, where $H(T)$ is the task identity entropy. We establish an \emph{Impossibility Triangle} -- zero forgetting, online learning, and finite parameters cannot be simultaneously satisfied by sequential state-based learners -- and show that conditional regeneration architectures (HyperNetworks) bypass this triangle by redefining parameters as function values rather than states. We validate this framework across 8 CL methods on Split-MNIST (1,130+ experiments over 86 days, 4 seeds each), showing that $C_\mathrm{ctx}$ perfectly predicts forgetting behavior: methods with $C_\mathrm{ctx} = 0$ (NaiveSGD, EWC, SI, LwF, CFlow) exhibit catastrophic forgetting (6--97\%), while methods with $C_\mathrm{ctx} \approx 1$ (HyperNetwork) achieve zero forgetting (98.8\% ACC). We further propose \emph{Wrong-Context Probing} (P5), a practical diagnostic protocol for measuring $C_\mathrm{ctx}$, and extend the framework to CIFAR-10 via a novel \emph{Gradient Context Encoder} that closes the oracle gap from 23.3pp to 0.7pp. A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features), CFlow's $\theta_0$-memorizer phenomenon, and the $S_N$ symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results. Our central design principle: \emph{architecture over algorithm} -- the context pathway must be structurally unbypassable.
摘要：灾难性遗忘仍然是持续学习（CL）中的一个核心挑战，但对于为什么有些架构会灾难性遗忘而另一些架构却没有，缺乏统一的信息理论解释。我们引入\emph{上下文通道容量} ($C_\mathrm{ctx}$)，即 CL 架构的上下文信号与其生成参数之间的互信息，并证明零遗忘需要 $C_\mathrm{ctx} \geq H(T)$，其中 $H(T)$ 是任务身份熵。我们建立了一个\emph{不可能三角}——基于状态的顺序学习器不能同时满足零遗忘、在线学习和有限参数——并表明条件再生架构（超网络）通过将参数重新定义为函数值而不是状态来绕过这个三角。我们在 Split-MNIST 上通过 8 种 CL 方法验证了这个框架（86 天超过 1,130 次实验，每个 4 个种子），结果表明 $C_\mathrm{ctx}$ 完美地预测了遗忘行为：$C_\mathrm{ctx} = 0$（NaiveSGD、EWC、SI、LwF、CFlow）的方法表现出灾难性遗忘（6--97\%），而$C_\mathrm{ctx} \approx 1$ (HyperNetwork) 实现零遗忘 (98.8\% ACC)。我们进一步提出了 \emph{Wrong-Context Probing} (P5)，一种用于测量 $C_\mathrm{ctx}$ 的实用诊断协议，并通过新颖的 \emph{Gradient Context Encoder} 将框架扩展到 CIFAR-10，将预言机差距从 23.3pp 缩小到 0.7pp。超过 15 个封闭研究方向的系统分类——包括 Hebbian null 结果（冻结随机特征优于学习特征）、CFlow 的 $\theta_0$-memorizer 现象，以及列专业化的 $S_N$ 对称障碍——为社区提供了精确诊断的阴性结果。我们的中心设计原则：\emph{架构优于算法}——上下文路径在结构上必须是不可绕过的。

Title: Disentangled Textual Priors for Diffusion-based Image Super-Resolution

Authors: Lei Jiang, Xin Liu, Xinze Tong, Zhiliang Li, Jie Liu, Jie Tang, Gangshan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07430
Pdf URL: https://arxiv.org/pdf/2603.07430
Copy Paste: [[2603.07430]] Disentangled Textual Priors for Diffusion-based Image Super-Resolution(https://arxiv.org/abs/2603.07430)
Keywords: super-resolution, generation, generative
Abstract: Image Super-Resolution (SR) aims to reconstruct high-resolution images from degraded low-resolution inputs. While diffusion-based SR methods offer powerful generative capabilities, their performance heavily depends on how semantic priors are structured and integrated into the generation process. Existing approaches often rely on entangled or coarse-grained priors that mix global layout with local details, or conflate structural and textural cues, thereby limiting semantic controllability and interpretability. In this work, we propose DTPSR, a novel diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: spatial hierarchy (global vs. local) and frequency semantics (low- vs. high-frequency). By explicitly separating these priors, DTPSR enables the model to simultaneously capture scene-level structure and object-specific details with frequency-aware semantic guidance. The corresponding embeddings are injected via specialized cross-attention modules, forming a progressive generation pipeline that reflects the semantic granularity of visual content, from global layout to fine-grained textures. To support this paradigm, we construct DisText-SR, a large-scale dataset containing approximately 95,000 image-text pairs with carefully disentangled global, low-frequency, and high-frequency descriptions. To further enhance controllability and consistency, we adopt a multi-branch classifier-free guidance strategy with frequency-aware negative prompts to suppress hallucinations and semantic drift. Extensive experiments on synthetic and real-world benchmarks show that DTPSR achieves high perceptual quality, competitive fidelity, and strong generalization across diverse degradation scenarios.
摘要：图像超分辨率（SR）旨在从降级的低分辨率输入重建高分辨率图像。虽然基于扩散的 SR 方法提供了强大的生成能力，但其性能在很大程度上取决于语义先验的结构和集成到生成过程中的方式。现有的方法通常依赖于纠缠或粗粒度的先验，将全局布局与局部细节混合在一起，或者将结构和纹理线索混为一谈，从而限制了语义的可控性和可解释性。在这项工作中，我们提出了 DTPSR，一种新颖的基于扩散的 SR 框架，它沿着两个互补的维度引入了解开的文本先验：空间层次结构（全局与局部）和频率语义（低频与高频）。通过明确分离这些先验，DTPSR 使模型能够通过频率感知语义指导同时捕获场景级结构和特定于对象的细节。相应的嵌入通过专门的交叉注意模块注入，形成一个渐进的生成管道，反映视觉内容的语义粒度，从全局布局到细粒度纹理。为了支持这种范式，我们构建了 DisText-SR，这是一个包含大约 95,000 个图像文本对的大型数据集，其中包含仔细解开的全局、低频和高频描述。为了进一步增强可控性和一致性，我们采用多分支无分类器引导策略和频率感知的负面提示来抑制幻觉和语义漂移。对合成和现实世界基准的大量实验表明，DTPSR 在不同的退化场景中实现了高感知质量、竞争保真度和强大的泛化能力。

Title: Image Generation Models: A Technical History

Authors: Rouzbeh Shirvani
Subjects: cs.CV, cs.AI, cs.CL, cs.GR
Abstract URL: https://arxiv.org/abs/2603.07455
Pdf URL: https://arxiv.org/pdf/2603.07455
Copy Paste: [[2603.07455]] Image Generation Models: A Technical History(https://arxiv.org/abs/2603.07455)
Keywords: generation, generative
Abstract: Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.
摘要：图像生成在过去十年中发展迅速，但文献似乎分散在不同的模型和应用领域。本文旨在对突破性图像生成模型进行全面调查，包括变分自动编码器（VAE）、生成对抗网络（GAN）、归一化流、自回归和基于变压器的生成器以及基于扩散的方法。我们提供每种模型类型的详细技术演练，包括其基本目标、架构构建块和算法训练步骤。对于每种模型类型，我们都会介绍优化技术以及常见的故障模式和限制。我们还回顾了视频生成的最新发展，并介绍了使从静止帧到高质量视频成为可能的研究工作。最后，我们介绍了这些模型的稳健性和负责任的部署日益重要，包括深度造假风险、检测、伪影和水印。

Title: Contact-Guided 3D Genome Structure Generation of E. coli via Diffusion Transformers

Authors: Mingxin Zhang, Xiaofeng Dai, Yu Yao, Ziqi Yin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07472
Pdf URL: https://arxiv.org/pdf/2603.07472
Copy Paste: [[2603.07472]] Contact-Guided 3D Genome Structure Generation of E. coli via Diffusion Transformers(https://arxiv.org/abs/2603.07472)
Keywords: generation, generative
Abstract: In this study, we present a conditional diffusion-transformer framework for generating ensembles of three-dimensional Escherichia coli genome conformations guided by Hi-C contact maps. Instead of producing a single deterministic structure, we formulate genome reconstruction as a conditional generative modeling problem that samples heterogeneous conformations whose ensemble-averaged contacts are consistent with the input Hi-C data. A synthetic dataset is constructed using coarse-grained molecular dynamics simulations to generate chromatin ensembles and corresponding Hi-C maps under circular topology. Our models operate in a latent diffusion setting with a variational autoencoder that preserves per-bin alignment and supports replication-aware representations. Hi-C information is injected through a transformer-based encoder and cross-attention, enforcing a physically interpretable one-way constraint from Hi-C to structure. The model is trained using a flow-matching objective for stable optimization. On held-out ensembles, generated structures reproduce the input Hi-C distance-decay and structural correlation metrics while maintaining substantial conformational diversity, demonstrating the effectiveness of diffusion-based generative modeling for ensemble-level 3D genome reconstruction.
摘要：在这项研究中，我们提出了一个条件扩散变换器框架，用于生成由 Hi-C 接触图引导的三维大肠杆菌基因组构象的集合。我们不是产生单一的确定性结构，而是将基因组重建表述为一个条件生成建模问题，该问题对异质构象进行采样，其整体平均接触与输入的 Hi-C 数据一致。使用粗粒度分子动力学模拟构建合成数据集，以生成圆形拓扑下的染色质整体和相应的 Hi-C 图。我们的模型在潜在扩散设置中运行，具有变分自动编码器，可保留每个容器的对齐并支持复制感知表示。 Hi-C 信息通过基于 Transformer 的编码器和交叉注意力注入，强制执行从 Hi-C 到结构的物理可解释的单向约束。该模型使用流匹配目标进行训练，以实现稳定优化。在保留的集合上，生成的结构再现输入的 Hi-C 距离衰减和结构相关性指标，同时保持大量的构象多样性，证明了基于扩散的生成模型对于集合级 3D 基因组重建的有效性。

Title: EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

Authors: Wenqi Cai, Yawen Zou, Guang Li, Chunzhi Gu, Chao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07476
Pdf URL: https://arxiv.org/pdf/2603.07476
Copy Paste: [[2603.07476]] EVLF: Early Vision-Language Fusion for Generative Dataset Distillation(https://arxiv.org/abs/2603.07476)
Keywords: generative
Abstract: Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision-Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the denoising process. Importantly, EVLF is plug-and-play and can be easily integrated into any diffusion-based dataset distillation pipeline with an encoder. It works across different denoiser architectures and sampling schedules without any task-specific modifications. Extensive experiments demonstrate that EVLF generates semantically faithful and visually coherent synthetic data, yielding consistent improvements in downstream classification accuracy across varied settings. Source code is available at this https URL.
摘要：数据集蒸馏（DD）旨在合成紧凑的训练集，使模型能够以更少的样本实现高精度。最近基于扩散的 DD 方法通常通过后期交叉注意力引入语义指导，其中文本提示往往主导生成过程。尽管这种策略强制执行标签相关性，但它减少了视觉潜在的贡献，导致过度校正的样本反映提示模式而不是反映内在的视觉特征。为了解决这个问题，我们引入了一种早期视觉语言融合（EVLF）方法，该方法可以在编码器和生成主干之间的过渡处对齐文本和视觉嵌入。通过在此过渡中合并轻量级交叉注意模块，早期表示在整个去噪过程中同时编码局部纹理和全局语义方向。重要的是，EVLF 是即插即用的，可以通过编码器轻松集成到任何基于扩散的数据集蒸馏管道中。它适用于不同的降噪器架构和采样计划，无需任何特定于任务的修改。大量实验表明，EVLF 生成语义忠实且视觉连贯的合成数据，从而在不同设置下持续提高下游分类准确性。源代码可从此 https URL 获取。

Title: RobustSCI: Beyond Reconstruction to Restoration for Snapshot Compressive Imaging under Real-World Degradations

Authors: Hao Wang, Yuanfan Li, Qi Zhou, Zhankuo Xu, Jiong Ni, Xin Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07489
Pdf URL: https://arxiv.org/pdf/2603.07489
Copy Paste: [[2603.07489]] RobustSCI: Beyond Reconstruction to Restoration for Snapshot Compressive Imaging under Real-World Degradations(https://arxiv.org/abs/2603.07489)
Keywords: restoration
Abstract: Deep learning algorithms for video Snapshot Compressive Imaging (SCI) have achieved great success, yet they predominantly focus on reconstructing from clean measurements. This overlooks a critical real-world challenge: the captured signal itself is often severely degraded by motion blur and low light. Consequently, existing models falter in practical applications. To break this limitation, we pioneer the first study on robust video SCI restoration, shifting the goal from "reconstruction" to "restoration"--recovering the underlying pristine scene from a degraded measurement. To facilitate this new task, we first construct a large-scale benchmark by simulating realistic, continuous degradations on the DAVIS 2017 dataset. Second, we propose RobustSCI, a network that enhances a strong encoder-decoder backbone with a novel RobustCFormer block. This block introduces two parallel branches--a multi-scale deblur branch and a frequency enhancement branch--to explicitly disentangle and remove degradations during the recovery process. Furthermore, we introduce RobustSCI-C (RobustSCI-Cascade), which integrates a pre-trained Lightweight Post-processing Deblurring Network to significantly boost restoration performance with minimal overhead. Extensive experiments demonstrate that our methods outperform all SOTA models on the new degraded testbeds, with additional validation on real-world degraded SCI data confirming their practical effectiveness, elevating SCI from merely reconstructing what is captured to restoring what truly happened.
摘要：用于视频快照压缩成像 (SCI) 的深度学习算法取得了巨大成功，但它们主要侧重于从干净的测量中进行重建。这忽视了现实世界的一个关键挑战：捕获的信号本身通常会因运动模糊和弱光而严重退化。因此，现有模型在实际应用中表现不佳。为了打破这一限制，我们开创了第一个关于稳健视频 SCI 恢复的研究，将目标从“重建”转向“恢复”——从退化的测量中恢复底层的原始场景。为了促进这项新任务，我们首先通过在 DAVIS 2017 数据集上模拟现实的连续退化来构建大规模基准。其次，我们提出了 RobustSCI，这是一个通过新颖的 RobustCFormer 块增强强大的编码器-解码器主干的网络。该块引入了两个并行分支——多尺度去模糊分支和频率增强分支——以明确地解开并消除恢复过程中的退化。此外，我们还引入了 RobustSCI-C (RobustSCI-Cascade)，它集成了预先训练的轻量级后处理去模糊网络，以最小的开销显着提高恢复性能。大量实验表明，我们的方法在新的退化测试平台上优于所有 SOTA 模型，并对现实世界的退化 SCI 数据进行了额外验证，确认了其实际有效性，将 SCI 从仅仅重建捕获的内容提升到恢复真实发生的情况。

Title: High-Fidelity Medical Shape Generation via Skeletal Latent Diffusion

Authors: Guoqing Zhang, Jingyun Yang, Siqi Chen, Anping Zhang, Yang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07504
Pdf URL: https://arxiv.org/pdf/2603.07504
Copy Paste: [[2603.07504]] High-Fidelity Medical Shape Generation via Skeletal Latent Diffusion(https://arxiv.org/abs/2603.07504)
Keywords: generation
Abstract: Anatomy shape modeling is a fundamental problem in medical data analysis. However, the geometric complexity and topological variability of anatomical structures pose significant challenges to accurate anatomical shape generation. In this work, we propose a skeletal latent diffusion framework that explicitly incorporates structural priors for efficient and high-fidelity medical shape generation. We introduce a shape auto-encoder in which the encoder captures global geometric information through a differentiable skeletonization module and aggregates local surface features into shape latents, while the decoder predicts the corresponding implicit fields over sparsely sampled coordinates. New shapes are generated via a latent-space diffusion model, followed by neural implicit decoding and mesh extraction. To address the limited availability of medical shape data, we construct a large-scale dataset, \textit{MedSDF}, comprising surface point clouds and corresponding signed distance fields across multiple anatomical categories. Extensive experiments on MedSDF and vessel datasets demonstrate that the proposed method achieves superior reconstruction and generation quality while maintaining a higher computational efficiency compared with existing approaches. Code is available at: this https URL.
摘要：解剖形状建模是医学数据分析中的一个基本问题。然而，解剖结构的几何复杂性和拓扑变异性对准确的解剖形状生成提出了重大挑战。在这项工作中，我们提出了一个骨骼潜在扩散框架，该框架明确地结合了结构先验，以实现高效和高保真的医学形状生成。我们引入了一种形状自动编码器，其中编码器通过可微分骨架化模块捕获全局几何信息，并将局部表面特征聚合成形状潜伏，而解码器则在稀疏采样坐标上预测相应的隐式场。通过潜在空间扩散模型生成新形状，然后进行神经隐式解码和网格提取。为了解决医学形状数据的有限可用性，我们构建了一个大型数据集 \textit{MedSDF}，包括表面点云和跨多个解剖类别的相应带符号距离场。对 MedSDF 和血管数据集的大量实验表明，与现有方法相比，所提出的方法实现了卓越的重建和生成质量，同时保持了更高的计算效率。代码可在以下位置获得：此 https URL。

Title: Reinforcement learning-based dynamic cleaning scheduling framework for solar energy system

Authors: Heungjo An
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.07518
Pdf URL: https://arxiv.org/pdf/2603.07518
Copy Paste: [[2603.07518]] Reinforcement learning-based dynamic cleaning scheduling framework for solar energy system(https://arxiv.org/abs/2603.07518)
Keywords: generation
Abstract: Advancing autonomous green technologies in solar photovoltaic (PV) systems is key to improving sustainability and efficiency in renewable energy production. This study presents a reinforcement learning (RL)-based framework to autonomously optimize the cleaning schedules of PV panels in arid regions, where soiling from dust and other airborne particles significantly reduces energy output. By employing advanced RL algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), the framework dynamically adjusts cleaning intervals based on uncertain environmental conditions. The proposed approach was applied to a case study in Abu Dhabi, UAE, demonstrating that PPO outperformed SAC and traditional simulation optimization (Sim-Opt) methods, achieving up to 13% cost savings by dynamically responding to weather uncertainties. The results highlight the superiority of flexible, autonomous scheduling over fixed-interval methods, particularly in adapting to stochastic environmental dynamics. This aligns with the goals of autonomous green energy production by reducing operational costs and improving the efficiency of solar power generation systems. This work underscores the potential of RL-driven autonomous decision-making to optimize maintenance operations in renewable energy systems. In future research, it is important to enhance the generalization ability of the proposed RL model, while also considering additional factors and constraints to apply it to different regions.
摘要：推进太阳能光伏（PV）系统的自主绿色技术是提高可再生能源生产的可持续性和效率的关键。这项研究提出了一个基于强化学习（RL）的框架，可以自动优化干旱地区光伏电池板的清洁时间表，在这些地区，灰尘和其他空气颗粒物的污染会显着降低能源输出。通过采用先进的 RL 算法、近端策略优化 (PPO) 和 Soft Actor-Critic (SAC)，该框架可以根据不确定的环境条件动态调整清洁间隔。所提出的方法应用于阿联酋阿布扎比的案例研究，证明 PPO 优于 SAC 和传统模拟优化 (Sim-Opt) 方法，通过动态响应天气不确定性，实现高达 13% 的成本节省。结果凸显了灵活、自主调度相对于固定间隔方法的优越性，特别是在适应随机环境动态方面。这与通过降低运营成本和提高太阳能发电系统效率来实现自主绿色能源生产的目标相一致。这项工作强调了强化学习驱动的自主决策在优化可再生能源系统维护操作方面的潜力。在未来的研究中，重要的是要增强所提出的 RL 模型的泛化能力，同时还要考虑其他因素和约束，以将其应用于不同的区域。

Title: One-for-All Model Initialization with Frequency-Domain Knowledge

Authors: Jianlu Shen, Fu Feng, Yucheng Xie, Jiaqi Lv, Xin Geng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.07523
Pdf URL: https://arxiv.org/pdf/2603.07523
Copy Paste: [[2603.07523]] One-for-All Model Initialization with Frequency-Domain Knowledge(https://arxiv.org/abs/2603.07523)
Keywords: generative
Abstract: Transferring knowledge by fine-tuning large-scale pre-trained networks has become a standard paradigm for downstream tasks, yet the knowledge of a pre-trained model is tightly coupled with monolithic architecture, which restricts flexible reuse across models of varying scales. In response to this challenge, recent approaches typically resort to either parameter selection, which fails to capture the interdependent structure of this knowledge, or parameter prediction using generative models that depend on impractical access to large network collections. In this paper, we empirically demonstrate that a model's foundational, task-agnostic knowledge, its "learngene", is encoded within the low-frequency components of its weights, and can be efficiently inherited by downstream models. Based on this insight, we propose FRONT (FRequency dOmain kNowledge Transfer), a novel framework that uses the Discrete Cosine Transform (DCT) to isolate the low-frequency "learngene". This learngene can be seamlessly adapted to initialize models of arbitrary size via simple truncation or padding, a process that is entirely training-free. For enhanced performance, we propose an optional low-cost refinement process that introduces a spectral regularizer to further improve the learngene's transferability. Extensive experiments demonstrate that FRONT achieves the state-of-the-art performance, accelerates convergence by up to 15 times in vision tasks, and reduces training FLOPs by an average of 40.5% in language tasks.
摘要：通过微调大规模预训练网络来传输知识已成为下游任务的标准范例，但预训练模型的知识与整体架构紧密耦合，这限制了不同规模模型之间的灵活重用。为了应对这一挑战，最近的方法通常采用参数选择（无法捕获这种知识的相互依赖结构）或使用生成模型进行参数预测，而生成模型依赖于对大型网络集合的不切实际的访问。在本文中，我们凭经验证明模型的基础、与任务无关的知识，即它的“学习基因”，被编码在其权重的低频分量内，并且可以被下游模型有效地继承。基于这一见解，我们提出了 FRONT（频率域知识转移），这是一种使用离散余弦变换（DCT）来隔离低频“学习基因”的新颖框架。该学习基因可以无缝地适应通过简单的截断或填充来初始化任意大小的模型，这是一个完全无需训练的过程。为了增强性能，我们提出了一种可选的低成本细化过程，该过程引入了频谱正则化器以进一步提高学习基因的可转移性。大量实验表明，FRONT 实现了最先进的性能，在视觉任务中将收敛速度提高了 15 倍，在语言任务中将训练 FLOP 平均减少了 40.5%。

Title: Generative prediction of laser-induced rocket ignition with dynamic latent space representations

Authors: Tony Zahtila, Ettore Saetta, Murray Cutforth, Davy Brouzet, Diego Rossinelli, Gianluca Iaccarino
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.07525
Pdf URL: https://arxiv.org/pdf/2603.07525
Copy Paste: [[2603.07525]] Generative prediction of laser-induced rocket ignition with dynamic latent space representations(https://arxiv.org/abs/2603.07525)
Keywords: generative
Abstract: Accurate and predictive scale-resolving simulations of laser-ignited rocket engines are highly time-consuming because the problem includes turbulent fuel-oxidizer mixing dynamics, laser-induced energy deposition, and high-speed flame growth. This is conflated with the large design space primarily corresponding to the laser operating conditions and target location. To enable rapid exploration and uncertainty quantification, we propose a data-driven surrogate modeling approach that combines convolutional autoencoders (cAEs) with neural ordinary differential equations (neural ODEs). The present target application of an ML-based surrogate model to leading-edge multi-physics turbulence simulation is part of a paradigm shift in the deployment of surrogate models towards increasing real-world complexity. Sequentially, the cAE spatially compresses high-dimensional flow fields into a low-dimensional latent space, wherein the system's temporal dynamics are learned via neural ODEs. Once trained, the model generates fast spatiotemporal predictions from initial conditions and specified operating inputs. By learning a surrogate to replace the entirety of the time-evolving simulation, the cost of predicting an ignition trial is reduced by several orders of magnitude, allowing efficient exploration of the input parameter space. Further, as the current framework yields a spatiotemporal field prediction, appraisal of the model output's physical grounding is more tractable. This approach marks a significant step toward real-time digital twins for laser-ignited rocket combustors and represents surrogate modeling in a complex system context.
摘要：激光点火火箭发动机的精确和预测性尺度解析模拟非常耗时，因为问题包括湍流燃料氧化剂混合动力学、激光诱导能量沉积和高速火焰生长。这与主要对应于激光操作条件和目标位置的大设计空间相结合。为了实现快速探索和不确定性量化，我们提出了一种数据驱动的代理建模方法，它将卷积自动编码器（cAE）与神经常微分方程（神经 ODE）相结合。目前基于机器学习的替代模型在前沿多物理湍流仿真中的目标应用是替代模型部署向增加现实世界复杂性范式转变的一部分。随后，cAE 在空间上将高维流场压缩到低维潜在空间中，其中系统的时间动态通过神经常微分方程来学习。经过训练后，模型会根据初始条件和指定的操作输入生成快速时空预测。通过学习代理来代替整个时间演化模拟，预测点火试验的成本降低了几个数量级，从而可以有效地探索输入参数空间。此外，由于当前框架产生时空场预测，因此模型输出的物理基础的评估更容易处理。这种方法标志着激光点火火箭燃烧器向实时数字孪生迈出了重要一步，并代表了复杂系统环境中的替代建模。

Title: How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

Authors: Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07540
Pdf URL: https://arxiv.org/pdf/2603.07540
Copy Paste: [[2603.07540]] How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation(https://arxiv.org/abs/2603.07540)
Keywords: generation
Abstract: Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.
摘要：统一的多模式模型有望生成广泛的、交错的叙事，将文本和图像编织成连贯的长篇故事。然而，当前系统存在严重的可靠性差距：随着序列的增长，生成质量迅速下降。在这项工作中，我们研究了这种失败背后的机制，并认为它与标准的长上下文挑战不同。我们揭示，在一代人中，积累的视觉历史充当了主动污染的来源，这种衰减具体由图像事件的数量而不是原始令牌计数控制。我们发现了一个结构漏洞，即密集的视觉标记压倒了注意力机制，产生了扭曲未来合成的噪音。在这些机制见解的指导下，我们提出了 UniLongGen，这是一种免训练的推理策略，优先考虑安全调节而不是总召回。 UniLongGen 不是保留所有历史记录，而是动态管理模型的内存，根据模型自身的内部相关性排名识别和丢弃干扰视觉信号。大量实验表明，这种主动遗忘方法对于稳定性至关重要：UniLongGen 在长视野保真度和一致性方面显着优于基线，同时减少了内存占用和推理时间。

Title: CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

Authors: Anh-Duy Le, Van-Linh Pham, Thanh-Nam Vo, Xuan Toan Mai, Tuan-Anh Tran
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2603.07543
Pdf URL: https://arxiv.org/pdf/2603.07543
Copy Paste: [[2603.07543]] CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization(https://arxiv.org/abs/2603.07543)
Keywords: generation
Abstract: One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch Contrastive Enhancement and Style-Aware Quantization via Denoising Diffusion (CONSTANT), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive (LLatentPCE) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at GitHub
摘要：尽管近年来取得了令人印象深刻的成果，但一次性风格的手写图像生成仍然具有挑战性，因为仅使用单个参考图像来捕获人类笔迹的复杂多样的特征很困难。现有的方法仍然难以生成视觉上吸引人且逼真的手写图像，并适应复杂的、看不见的书写风格，努力隔离不变的风格特征（例如倾斜、笔画宽度、曲率），同时忽略不相关的噪声。为了解决这个问题，我们引入了通过去噪扩散进行补丁对比增强和风格感知量化（CONSTANT），这是一种通过扩散模型实现的新颖的一次性手写生成。 CONSTANT 利用了三个关键创新：1) 风格感知量化 (SAQ) 模块，将风格建模为捕捉不同概念的离散视觉标记； 2）对比目标，确保这些标记在嵌入样式空间中分离良好且有意义； 3）基于潜在补丁的对比（LLatentPCE）目标通过对齐潜在空间中生成的特征和真实特征的多尺度空间补丁来帮助提高质量和局部结构。对多种语言（包括英语、中文和我们提出的越南语 ViHTGen 数据集）的基准数据集进行了大量实验和分析，证明了我们的方法在适应新的参考样式和生成高度详细的图像方面优于最先进的方法。代码可在 GitHub 上获取

Title: DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration

Authors: Jinzhou Tang, Fan Feng, Minghao Fu, Wenjun Lin, Biwei Huang, Keze Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.07545
Pdf URL: https://arxiv.org/pdf/2603.07545
Copy Paste: [[2603.07545]] DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration(https://arxiv.org/abs/2603.07545)
Keywords: generative
Abstract: Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce \textbf{Symmetry Exploration}, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonian-based curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian-based world model that learns from the collected data, using a novel self-supervised contrastive objective to identify the invariant physical state from raw, view-dependent pixel observations. Our framework, \textbf{DreamSAC}, trained on this actively curated data, significantly outperforms state-of-the-art baselines in 3D physics simulations on tasks requiring extrapolation.
摘要：学习世界模型擅长插值概括，但无法对新的物理属性进行外推概括。出现这种限制是因为它们学习的是统计相关性，而不是环境的潜在生成规则，例如物理不变性和守恒定律。我们认为学习这些不变性是稳健外推的关键。为了实现这一目标，我们首先引入 \textbf{Symmetry Exploration}，这是一种无监督的探索策略，其中代理本质上受到基于哈密顿量的好奇心奖励的激励，主动探索和挑战其对守恒定律的理解，从而收集物理信息数据。其次，我们设计了一个基于哈密顿量的世界模型，该模型从收集的数据中学习，使用新颖的自监督对比目标从原始的、依赖于视图的像素观察中识别不变的物理状态。我们的框架 \textbf{DreamSAC} 在这些主动整理的数据上进行了训练，在需要外推的任务上显着优于 3D 物理模拟中最先进的基线。

Title: ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction

Authors: Haibao Yu, Kuntao Xiao, Jiahang Wang, Ruiyang Hao, Yuxin Huang, Guoran Hu, Haifang Qin, Bowen Jing, Yuntian Bo, Ping Luo
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.07552
Pdf URL: https://arxiv.org/pdf/2603.07552
Copy Paste: [[2603.07552]] ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction(https://arxiv.org/abs/2603.07552)
Keywords: generation
Abstract: High-fidelity visual reconstruction and novel-view synthesis are essential for realistic closed-loop evaluation in autonomous driving. While 4D Gaussian Splatting (4DGS) offers a promising balance of accuracy and efficiency, existing per-scene optimization methods require costly iterative refinement, rendering them unscalable for extensive urban environments. Conversely, current feed-forward approaches often suffer from degraded photometric quality. To address these limitations, we propose ReconDrive, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation. Our architecture introduces two core adaptations to tailor the foundation model to dynamic driving scenes: (1) Hybrid Gaussian Prediction Heads, which decouple the regression of spatial coordinates and appearance attributes to overcome the photometric deficiencies inherent in generalized foundation features; and (2) a Static-Dynamic 4D Composition strategy that explicitly captures temporal motion via velocity modeling to represent complex dynamic environments. Benchmarked on nuScenes, ReconDrive significantly outperforms existing feed-forward baselines in reconstruction, novel-view synthesis, and 3D perception. It achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.
摘要：高保真视觉重建和新颖的视图合成对于自动驾驶中的真实闭环评估至关重要。虽然 4D 高斯泼溅 (4DGS) 在准确性和效率之间提供了良好的平衡，但现有的每场景优化方法需要昂贵的迭代细化，导致它们无法扩展到广泛的城市环境。相反，当前的前馈方法常常遭受光度质量下降的困扰。为了解决这些限制，我们提出了 ReconDrive，这是一种前馈框架，它利用和扩展 3D 基础模型 VGGT 来快速、高保真地生成 4DGS。我们的架构引入了两个核心适应性，以根据动态驾驶场景定制基础模型：（1）混合高斯预测头，它将空间坐标和外观属性的回归解耦，以克服广义基础特征固有的光度缺陷； (2) 静态-动态 4D 合成策略，通过速度建模明确捕获时间运动以表示复杂的动态环境。以 nuScenes 为基准，ReconDrive 在重建、新颖视图合成和 3D 感知方面显着优于现有的前馈基线。它实现了与每个场景优化相媲美的性能，同时速度提高了几个数量级，为现实驾驶模拟提供了可扩展且实用的解决方案。

Title: Brain-WM: Brain Glioblastoma World Model

Authors: Chenhui Wang, Boyun Zheng, Liuxin Bao, Zhihao Peng, Peter Y.M. Woo, Hongming Shan, Yixuan Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07562
Pdf URL: https://arxiv.org/pdf/2603.07562
Copy Paste: [[2603.07562]] Brain-WM: Brain Glioblastoma World Model(https://arxiv.org/abs/2603.07562)
Keywords: generation, generative
Abstract: Precise prognostic modeling of glioblastoma (GBM) under varying treatment interventions is essential for optimizing clinical outcomes. While generative AI has shown promise in simulating GBM evolution, existing methods typically treat interventions as static conditional inputs rather than dynamic decision variables. Consequently, they fail to capture the complex, reciprocal interplay between tumor evolution and treatment response. To bridge this gap, we present Brain-WM, a pioneering brain GBM world model that unifies next-step treatment prediction and future MRI generation, thereby capturing the co-evolutionary dynamics between tumor and treatment. Specifically, Brain-WM encodes spatiotemporal dynamics into a shared latent space for joint autoregressive treatment prediction and flow-based future MRI generation. Then, instead of a conventional monolithic framework, Brain-WM adopts a novel Y-shaped Mixture-of-Transformers (MoT) architecture. This design structurally disentangles heterogeneous objectives, successfully leveraging cross-task synergies while preventing feature collapse. Finally, a synergistic multi-timepoint mask alignment objective explicitly anchors latent representations to anatomically grounded tumor structures and progression-aware semantics. Extensive validation on internal and external multi-institutional cohorts demonstrates the superiority of Brain-WM, achieving 91.5% accuracy in treatment planning and SSIMs of 0.8524, 0.8581, and 0.8404 for FLAIR, T1CE, and T2W sequences, respectively. Ultimately, Brain-WM offers a robust clinical sandbox for optimizing patient healthcare. The source code is made available at this https URL.
摘要：不同治疗干预下胶质母细胞瘤 (GBM) 的精确预后建模对于优化临床结果至关重要。虽然生成式人工智能在模拟 GBM 进化方面表现出了希望，但现有方法通常将干预视为静态条件输入而不是动态决策变量。因此，他们未能捕捉到肿瘤进化和治疗反应之间复杂的相互作用。为了弥补这一差距，我们提出了 Brain-WM，这是一种开创性的大脑 GBM 世界模型，它将下一步治疗预测和未来 MRI 生成结合起来，从而捕获肿瘤和治疗之间的共同进化动态。具体来说，Brain-WM 将时空动力学编码到共享潜在空间中，用于联合自回归治疗预测和基于流的未来 MRI 生成。然后，Brain-WM 采用了一种新颖的 Y 形混合变压器 (MoT) 架构，而不是传统的整体框架。这种设计在结构上消除了异构目标，成功地利用了跨任务协同作用，同时防止了功能崩溃。最后，协同多时间点掩模对齐目标明确地将潜在表示锚定到解剖学基础的肿瘤结构和进展感知语义。对内部和外部多机构队列的广泛验证证明了 Brain-WM 的优越性，治疗计划准确率达到 91.5%，FLAIR、T1CE 和 T2W 序列的 SSIM 分别为 0.8524、0.8581 和 0.8404。最终，Brain-WM 提供了一个强大的临床沙箱，用于优化患者医疗保健。源代码可通过此 https URL 获取。

Title: GRD-Net: Generative-Reconstructive-Discriminative Anomaly Detection with Region of Interest Attention Module

Authors: Niccolò Ferrari, Michele Fraccaroli, Evelina Lamma
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.07566
Pdf URL: https://arxiv.org/pdf/2603.07566
Copy Paste: [[2603.07566]] GRD-Net: Generative-Reconstructive-Discriminative Anomaly Detection with Region of Interest Attention Module(https://arxiv.org/abs/2603.07566)
Keywords: generative
Abstract: Anomaly detection is nowadays increasingly used in industrial applications and processes. One of the main fields of the appliance is the visual inspection for surface anomaly detection, which aims to spot regions that deviate from regularity and consequently identify abnormal products. Defect localization is a key task, that usually is achieved using a basic comparison between generated image and the original one, implementing some blob-analysis or image-editing algorithms, in the post-processing step, which is very biased towards the source dataset, and they are unable to generalize. Furthermore, in industrial applications, the totality of the image is not always interesting but could be one or some regions of interest (ROIs), where only in those areas there are relevant anomalies to be spotted. For these reasons, we propose a new architecture composed by two blocks. The first block is a Generative Adversarial Network (GAN), based on a residual autoencoder (ResAE), to perform reconstruction and denoising processes, while the second block produces image segmentation, spotting defects. This method learns from a dataset composed of good products and generated synthetic defects. The discriminative network is trained using a ROI for each image contained in the training dataset. The network will learn in which area anomalies are relevant. This approach guarantees the reduction of using pre-processing algorithms, formerly developed with blob-analysis and image-editing procedures. To test our model we used challenging MVTec anomaly detection datasets and an industrial large dataset of pharmaceutical BFS strips of vials. This set constitutes a more realistic use case of the aforementioned network.
摘要：如今，异常检测越来越多地用于工业应用和过程。该设备的主要领域之一是表面异常检测的目视检查，旨在发现偏离规律的区域，从而识别异常产品。缺陷定位是一项关键任务，通常是通过使用生成的图像与原始图像之间的基本比较来实现的，在后处理步骤中实现一些斑点分析或图像编辑算法，这非常偏向源数据集，并且无法泛化。此外，在工业应用中，图像的整体并不总是令人感兴趣的，而是可能是一个或一些感兴趣区域（ROI），其中只有在这些区域中才会发现相关异常。由于这些原因，我们提出了一种由两个块组成的新架构。第一个块是基于残差自动编码器 (ResAE) 的生成对抗网络 (GAN)，用于执行重建和去噪过程，而第二个块则产生图像分割，发现缺陷。该方法从由优质产品和生成的合成缺陷组成的数据集中学习。使用训练数据集中包含的每个图像的 ROI 来训练判别网络。网络将了解哪些区域的异常相关。这种方法保证减少使用预处理算法，这些算法以前是通过斑点分析和图像编辑程序开发的。为了测试我们的模型，我们使用了具有挑战性的 MVTec 异常检测数据集和制药 BFS 小瓶条的工业大型数据集。该集合构成了上述网络的更现实的用例。

Title: Constraints Matrix Diffusion based Generative Neural Solver for Vehicle Routing Problems

Authors: Zhenwei Wang, Tiehua Zhang, Ning Xue, Ender Ozcan, Ling Wang, Ruibin Bai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.07568
Pdf URL: https://arxiv.org/pdf/2603.07568
Copy Paste: [[2603.07568]] Constraints Matrix Diffusion based Generative Neural Solver for Vehicle Routing Problems(https://arxiv.org/abs/2603.07568)
Keywords: generative
Abstract: Over the past decade, neural network solvers powered by generative artificial intelligence have garnered significant attention in the domain of vehicle routing problems (VRPs), owing to their exceptional computational efficiency and superior reasoning capabilities. In particular, autoregressive solvers integrated with reinforcement learning have emerged as a prominent trend. However, much of the existing work emphasizes large-scale generalization of neural approaches while neglecting the limited robustness of attention-based methods across heterogeneous distributions of problem parameters. Their improvements over heuristic search remain largely restricted to hand-curated, fixed-distribution benchmarks. Furthermore, these architectures tend to degrade significantly when node representations are highly similar or when tasks involve long decision horizons. To address the aforementioned limitations, we propose a novel fusion neural network framework that employs a discrete noise graph diffusion model to learn the underlying constraints of vehicle routing problems and generate a constraint assignment matrix. This matrix is subsequently integrated adaptively into the feature representation learning and decision process of the autoregressive solver, serving as a graph structure mask that facilitates the formation of solutions characterized by both global vision and local feature integration. To the best of our knowledge, this work represents the first comprehensive experimental investigation of neural network model solvers across a 378-combinatorial space spanning four distinct dimensions within the CVRPlib public dataset. Extensive experimental evaluations demonstrate that our proposed fusion model effectively captures and leverages problem constraints, achieving state-of-the-art performance across multiple benchmark datasets.
摘要：在过去的十年中，由生成人工智能驱动的神经网络求解器因其卓越的计算效率和卓越的推理能力而在车辆路径问题（VRP）领域引起了广泛关注。特别是，与强化学习相结合的自回归求解器已经成为一个突出的趋势。然而，现有的许多工作都强调神经方法的大规模泛化，而忽略了基于注意力的方法在问题参数的异构分布上的有限鲁棒性。他们对启发式搜索的改进仍然主要局限于手工策划的固定分布基准。此外，当节点表示高度相似或任务涉及长决策范围时，这些架构往往会显着退化。为了解决上述限制，我们提出了一种新颖的融合神经网络框架，该框架采用离散噪声图扩散模型来学习车辆路径问题的潜在约束并生成约束分配矩阵。该矩阵随后自适应地集成到自回归求解器的特征表示学习和决策过程中，充当图结构掩模，有助于形成以全局视觉和局部特征集成为特征的解决方案。据我们所知，这项工作代表了对 CVRPlib 公共数据集中跨越四个不同维度的 378 个组合空间的神经网络模型求解器的首次全面实验研究。广泛的实验评估表明，我们提出的融合模型有效地捕获和利用问题约束，在多个基准数据集上实现最先进的性能。

Title: Integration of deep generative Anomaly Detection algorithm in high-speed industrial line

Authors: Niccolò Ferrari, Nicola Zanarini, Michele Fraccaroli, Alice Bizzarri, Evelina Lamma
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.07577
Pdf URL: https://arxiv.org/pdf/2603.07577
Copy Paste: [[2603.07577]] Integration of deep generative Anomaly Detection algorithm in high-speed industrial line(https://arxiv.org/abs/2603.07577)
Keywords: generative
Abstract: Industrial visual inspection in pharmaceutical production requires high accuracy under strict constraints on cycle time, hardware footprint, and operational cost. Manual inline inspection is still common, but it is affected by operator variability and limited throughput. Classical rule-based computer vision pipelines are often rigid and difficult to scale to highly variable production scenarios. To address these limitations, we present a semi-supervised anomaly detection framework based on a generative adversarial architecture with a residual autoencoder and a dense bottleneck, specifically designed for online deployment on a high-speed Blow-Fill-Seal (BFS) line. The model is trained only on nominal samples and detects anomalies through reconstruction residuals, providing both classification and spatial localization via heatmaps. The training set contains 2,815,200 grayscale patches. Experiments on a real industrial test kit show high detection performance while satisfying timing constraints compatible with a 500 ms acquisition slot.
摘要：药品生产中的工业视觉检测在周期时间、硬件占用空间和运营成本的严格限制下需要高精度。手动在线检查仍然很常见，但它受到操作员可变性和有限吞吐量的影响。经典的基于规则的计算机视觉管道通常是僵化的，并且难以扩展到高度可变的生产场景。为了解决这些限制，我们提出了一种基于生成对抗架构的半监督异常检测框架，该架构具有残差自动编码器和密集瓶颈，专为高速吹灌封（BFS）生产线上的在线部署而设计。该模型仅在标称样本上进行训练，并通过重建残差检测异常，通过热图提供分类和空间定位。训练集包含 2,815,200 个灰度图块。在真实工业测试套件上进行的实验显示出较高的检测性能，同时满足与 500 ms 采集槽兼容的时序限制。

Title: EmbedTalk: Triplane-Free Talking Head Synthesis using Embedding-Driven Gaussian Deformation

Authors: Arpita Saggar, Jonathan C. Darling, Duygu Sarikaya, David C. Hogg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07604
Pdf URL: https://arxiv.org/pdf/2603.07604
Copy Paste: [[2603.07604]] EmbedTalk: Triplane-Free Talking Head Synthesis using Embedding-Driven Gaussian Deformation(https://arxiv.org/abs/2603.07604)
Keywords: generative
Abstract: Real-time talking head synthesis increasingly relies on deformable 3D Gaussian Splatting (3DGS) due to its low latency. Tri-planes are the standard choice for encoding Gaussians prior to deformation, since they provide a continuous domain with explicit spatial relationships. However, tri-plane representations are limited by grid resolution and approximation errors introduced by projecting 3D volumetric fields onto 2D subspaces. Recent work has shown the superiority of learnt embeddings for driving temporal deformations in 4D scene reconstruction. We introduce $\textbf{EmbedTalk}$, which shows how such embeddings can be leveraged for modelling speech deformations in talking head synthesis. Through comprehensive experiments, we show that EmbedTalk outperforms existing 3DGS-based methods in rendering quality, lip synchronisation, and motion consistency, while remaining competitive with state-of-the-art generative models. Moreover, replacing the tri-plane encoding with learnt embeddings enables significantly more compact models that achieve over 60 FPS on a mobile GPU (RTX 2060 6 GB). Our code will be placed in the public domain on acceptance.
摘要：由于其低延迟，实时头部说话合成越来越依赖于可变形 3D 高斯溅射 (3DGS)。三平面是在变形之前对高斯进行编码的标准选择，因为它们提供了具有明确空间关系的连续域。然而，三平面表示受到网格分辨率和将 3D 体积场投影到 2D 子空间所引入的近似误差的限制。最近的工作表明了学习嵌入在驱动 4D 场景重建中的时间变形方面的优越性。我们引入了$\textbf{EmbedTalk}$，它展示了如何利用这种嵌入来建模头部说话合成中的语音变形。通过全面的实验，我们表明 EmbedTalk 在渲染质量、唇形同步和运动一致性方面优于现有的基于 3DGS 的方法，同时与最先进的生成模型保持竞争力。此外，用学习嵌入替换三平面编码可以使模型变得更加紧凑，在移动 GPU (RTX 2060 6 GB) 上实现超过 60 FPS。我们的代码将在接受后置于公共领域。

Title: Looking Into the Water by Unsupervised Learning of the Surface Shape

Authors: Ori Lifschitz, Tali Treibitz, Dan Rosenbaum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07614
Pdf URL: https://arxiv.org/pdf/2603.07614
Copy Paste: [[2603.07614]] Looking Into the Water by Unsupervised Learning of the Surface Shape(https://arxiv.org/abs/2603.07614)
Keywords: restoration
Abstract: We address the problem of looking into the water from the air, where we seek to remove image distortions caused by refractions at the water surface. Our approach is based on modeling the different water surface structures at various points in time, assuming the underlying image is constant. To this end, we propose a model that consists of two neural-field networks. The first network predicts the height of the water surface at each spatial position and time, and the second network predicts the image color at each position. Using both networks, we reconstruct the observed sequence of images and can therefore use unsupervised training. We show that using implicit neural representations with periodic activation functions (SIREN) leads to effective modeling of the surface height spatio-temporal signal and its derivative, as required for image reconstruction. Using both simulated and real data we show that our method outperforms the latest unsupervised image restoration approach. In addition, it provides an estimate of the water surface.
摘要：我们解决了从空中观察水的问题，力求消除由水面折射引起的图像失真。我们的方法基于对不同时间点的不同水面结构进行建模，假设底层图像是恒定的。为此，我们提出了一个由两个神经场网络组成的模型。第一个网络预测每个空间位置和时间的水面高度，第二个网络预测每个位置的图像颜色。使用这两个网络，我们重建观察到的图像序列，因此可以使用无监督训练。我们证明，根据图像重建的需要，使用具有周期性激活函数（SIREN）的隐式神经表示可以有效地对表面高度时空信号及其导数进行建模。使用模拟和真实数据，我们表明我们的方法优于最新的无监督图像恢复方法。此外，它还提供了水面的估计。

Title: Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

Authors: Jiajun He, Zongyu Guo, Zhaoyang Jia, Xiaoyi Zhang, Jiahao Li, Xiao Li, Bin Li, José Miguel Hernández-Lobato, Yan Lu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.07615
Pdf URL: https://arxiv.org/pdf/2603.07615
Copy Paste: [[2603.07615]] Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models(https://arxiv.org/abs/2603.07615)
Keywords: generation, generative
Abstract: Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.
摘要：现代视觉生成模型通过大规模训练获得丰富的视觉知识，但现有的视觉表示（例如像素、潜在特征或标记）仍然位于模型外部，无法直接利用这些知识进行紧凑存储或重用。在这项工作中，我们引入了一种新的视觉表示框架，它将信号编码为函数，该函数通过附加到冻结视觉生成模型的低阶适应进行参数化。这种视觉信号的隐式表示，例如 81 帧视频，可以进一步散列成单个紧凑向量，以极低的比特率实现强大的感知视频压缩。除了基本压缩之外，这种表示的功能性质还支持推理时间缩放和控制，从而允许对压缩性能进行额外的改进。更广泛地说，由于隐式表示直接充当生成过程的函数，这表明桥接视觉压缩和生成的统一框架。

Title: Evaluating Synthetic Data for Baggage Trolley Detection in Airport Logistics

Authors: Abdeldjalil Taibi, Mohmoud Badlis, Amina Bensalem, Belkacem Zouilekh, Mohammed Brahimi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.07645
Pdf URL: https://arxiv.org/pdf/2603.07645
Copy Paste: [[2603.07645]] Evaluating Synthetic Data for Baggage Trolley Detection in Airport Logistics(https://arxiv.org/abs/2603.07645)
Keywords: generation
Abstract: Efficient luggage trolley management is critical for reducing congestion and ensuring asset availability in modern airports. Automated detection systems face two main challenges. First, strict security and privacy regulations limit large-scale data collection. Second, existing public datasets lack the diversity, scale, and annotation quality needed to handle dense, overlapping trolley arrangements typical of real-world operations. To address these limitations, we introduce a synthetic data generation pipeline based on a high-fidelity Digital Twin of Algiers International Airport using NVIDIA Omniverse. The pipeline produces richly annotated data with oriented bounding boxes, capturing complex trolley formations, including tightly nested chains. We evaluate YOLO-OBB using five training strategies: real-only, synthetic-only, linear probing, full fine-tuning, and mixed training. This allows us to assess how synthetic data can complement limited real-world annotations. Our results show that mixed training with synthetic data and only 40 percent of real annotations matches or exceeds the full real-data baseline, achieving 0.94 mAP@50 and 0.77 mAP@50-95, while reducing annotation effort by 25 to 35 percent. Multi-seed experiments confirm strong reproducibility with a standard deviation below 0.01 on mAP@50, demonstrating the practical effectiveness of synthetic data for automated trolley detection.
摘要：高效的行李手推车管理对于减少现代机场的拥堵和确保资产可用性至关重要。自动检测系统面临两个主要挑战。首先，严格的安全和隐私法规限制了大规模数据收集。其次，现有的公共数据集缺乏处理现实世界操作中典型的密集、重叠的电车排列所需的多样性、规模和注释质量。为了解决这些限制，我们引入了一种基于使用 NVIDIA Omniverse 的阿尔及尔国际机场高保真数字孪生的合成数据生成管道。该管道生成带有定向边界框的丰富注释数据，捕获复杂的手推车形态，包括紧密嵌套的链条。我们使用五种训练策略来评估 YOLO-OBB：纯真实、纯合成、线性探测、完全微调和混合训练。这使我们能够评估合成数据如何补充有限的现实世界注释。我们的结果表明，使用合成数据和只有 40% 的真实注释进行混合训练可以匹配或超过完整的真实数据基线，达到 0.94 mAP@50 和 0.77 mAP@50-95，同时将注释工作量减少 25% 至 35%。多种子实验证实 mAP@50 上的标准偏差低于 0.01，具有很强的重现性，证明了自动手推车检测的合成数据的实际有效性。

Title: Compressed-Domain-Aware Online Video Super-Resolution

Authors: Yuhang Wang, Hai Li, Shujuan Hou, Zhetao Dong, Xiaoyao Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07694
Pdf URL: https://arxiv.org/pdf/2603.07694
Copy Paste: [[2603.07694]] Compressed-Domain-Aware Online Video Super-Resolution(https://arxiv.org/abs/2603.07694)
Keywords: super-resolution
Abstract: In bandwidth-limited online video streaming, videos are usually downsampled and compressed. Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion-vector-guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine-tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual map gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame-type-aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA-VSR surpasses the state-of-the-art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed. The code will be released at this https URL.
摘要：在带宽有限的在线视频流中，视频通常会被下采样和压缩。尽管最近的在线视频超分辨率（在线VSR）方法取得了可喜的结果，但由于对齐的复杂运动估计和连续帧的冗余处理，它们仍然是计算密集型的，并且无法实现更高分辨率的实时处理。为了解决这些问题，我们提出了一种用于在线 VSR 的压缩域感知网络（CDA-VSR），它利用压缩域信息（包括运动向量、残差图和帧类型）来平衡质量和效率。具体来说，我们提出了一种运动矢量引导的可变形对齐模块，该模块使用运动矢量进行粗略变形，并仅学习局部残差偏移以进行微调调整，从而在减少计算量的同时保持精度。然后，我们利用残差图门控融合模块从残差图导出空间权重，抑制不匹配的区域并强调可靠的细节。此外，我们设计了一个帧类型感知重建模块，用于跨帧类型的自适应计算分配，平衡准确性和效率。在 REDS4 数据集上，我们的 CDA-VSR 超越了最先进的方法 TMP，最大 PSNR 提高了 0.13 dB，同时推理速度提高了一倍以上。代码将在此 https URL 发布。

Title: Learning Context-Adaptive Motion Priors for Masked Motion Diffusion Models with Efficient Kinematic Attention Aggregation

Authors: Junkun Jiang, Jie Chen, Ho Yin Au, Jingyu Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07697
Pdf URL: https://arxiv.org/pdf/2603.07697
Copy Paste: [[2603.07697]] Learning Context-Adaptive Motion Priors for Masked Motion Diffusion Models with Efficient Kinematic Attention Aggregation(https://arxiv.org/abs/2603.07697)
Keywords: generative
Abstract: Vision-based motion capture solutions often struggle with occlusions, which result in the loss of critical joint information and hinder accurate 3D motion reconstruction. Other wearable alternatives also suffer from noisy or unstable data, often requiring extensive manual cleaning and correction to achieve reliable results. To address these challenges, we introduce the Masked Motion Diffusion Model (MMDM), a diffusion-based generative reconstruction framework that enhances incomplete or low-confidence motion data using partially available high-quality reconstructions within a Masked Autoencoder architecture. Central to our design is the Kinematic Attention Aggregation (KAA) mechanism, which enables efficient, deep, and iterative encoding of both joint-level and pose-level features, capturing structural and temporal motion patterns essential for task-specific reconstruction. We focus on learning context-adaptive motion priors, specialized structural and temporal features extracted by the same reusable architecture, where each learned prior emphasizes different aspects of motion dynamics and is specifically efficient for its corresponding task. This enables the architecture to adaptively specialize without altering its structure. Such versatility allows MMDM to efficiently learn motion priors tailored to scenarios such as motion refinement, completion, and in-betweening. Extensive evaluations on public benchmarks demonstrate that MMDM achieves strong performance across diverse masking strategies and task settings. The source code is available at this https URL.
摘要：基于视觉的运动捕捉解决方案通常会遇到遮挡问题，从而导致关键关节信息丢失并阻碍准确的 3D 运动重建。其他可穿戴替代品也受到噪声或不稳定数据的影响，通常需要大量的手动清理和校正才能获得可靠的结果。为了应对这些挑战，我们引入了掩蔽运动扩散模型（MMDM），这是一种基于扩散的生成重建框架，它使用掩蔽自动编码器架构中部分可用的高质量重建来增强不完整或低置信度的运动数据。我们设计的核心是运动学注意力聚合（KAA）机制，它能够对关节级和姿势级特征进行高效、深度和迭代的编码，捕获对于特定任务重建至关重要的结构和时间运动模式。我们专注于学习上下文自适应运动先验，即由相同的可重用架构提取的专门结构和时间特征，其中每个学习的先验都强调运动动力学的不同方面，并且对于其相应的任务特别有效。这使得架构能够在不改变其结构的情况下自适应地专业化。这种多功能性使 MMDM 能够有效地学习针对运动细化、完成和中间等场景定制的运动先验。对公共基准的广泛评估表明，MMDM 在不同的掩蔽策略和任务设置中取得了出色的性能。源代码可从此 https URL 获取。

Title: TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Authors: Yihong Luo, Tianyang Hu, Weijian Luo, Jing Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07700
Pdf URL: https://arxiv.org/pdf/2603.07700
Copy Paste: [[2603.07700]] TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward(https://arxiv.org/abs/2603.07700)
Keywords: generation, generative
Abstract: While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: this https URL
摘要：虽然少步生成模型能够以显着降低的成本生成强大的图像和视频，但少步模型的通用强化学习（RL）范式仍然是一个未解决的问题。现有的用于少步扩散模型的强化学习方法强烈依赖于通过可微奖励模型进行反向传播，从而排除了大多数重要的现实世界奖励信号，例如不可微奖励，例如人类的二元相似性、物体计数等。为了正确合并不可微奖励来改进少步生成模型，我们引入了 TDM-R1，这是一种基于领先的少步模型轨迹分布匹配（TDM）构建的新型强化学习范式。 TDM-R1 将学习过程解耦为代理奖励学习和生成器学习。此外，我们开发了实用方法来沿着 TDM 的确定性生成轨迹获取每步奖励信号，从而形成统一的 RL 后训练方法，显着提高了少步模型具有通用奖励的能力。我们进行了广泛的实验，包括文本渲染、视觉质量和偏好对齐。所有结果都表明，TDM-R1 是一种强大的强化学习范式，适用于少步文本到图像模型，在域内和域外指标上均实现了最先进的强化学习性能。此外，TDM-R1 还可以有效地扩展到最近强大的 Z-Image 模型，始终优于其 100-NFE 和只有 4 个 NFE 的少步变体。项目页面：此 https URL

Title: PARSE: Part-Aware Relational Spatial Modeling

Authors: Yinuo Bai, Peijun Xu, Kuixiang Shao, Yuyang Jiao, Jingxuan Zhang, Kaixin Yao, Jiayuan Gu, Jingyi Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07704
Pdf URL: https://arxiv.org/pdf/2603.07704
Copy Paste: [[2603.07704]] PARSE: Part-Aware Relational Spatial Modeling(https://arxiv.org/abs/2603.07704)
Keywords: generation
Abstract: Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.
摘要：对象间关系支撑着空间智能，但现有的表示（语言介词或对象级场景图）过于粗糙，无法指定哪些区域实际支持、包含或相互接触，从而导致布局模糊且物理不一致。为了解决这些歧义，需要一个零件级的表述；因此，我们引入了 PARSE，一个显式模拟对象部分如何交互以确定可行且基于空间的场景配置的框架。 PARSE 以以零件为中心的装配图 (PAG) 为中心，它对特定对象零件之间的几何关系进行编码，以及零件感知空间配置解算器，将这些关系转换为几何约束，以组装无碰撞、物理有效的场景。使用 PARSE，我们构建了 PARSE-10K，这是一个包含 10,000 个 3D 室内场景的数据集，由真实图像布局先验和精选的部分注释形状数据库构建，每个数据库都具有密集的接触结构和部分级接触图。通过这种结构化的、基于空间的监督，在 PARSE-10K 上微调 Qwen3-VL 可以产生更强的对象级布局推理和更准确的部分级关系理解；此外，利用 PAG 作为 3D 生成模型中的结构先验，可以显着提高场景的物理真实性和结构复杂性。总之，这些结果表明 PARSE 显着推进了基于几何的空间推理，并支持生成物理一致的 3D 场景。

Title: Uncertainty-Gated Generative Modeling

Authors: Xingrui Gu, Haixi Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.07753
Pdf URL: https://arxiv.org/pdf/2603.07753
Copy Paste: [[2603.07753]] Uncertainty-Gated Generative Modeling(https://arxiv.org/abs/2603.07753)
Keywords: generation, generative
Abstract: Financial time-series forecasting is a high-stakes problem where regime shifts and shocks make point-accurate yet overconfident models dangerous. We propose Uncertainty-Gated Generative Modeling (UGGM), which treats uncertainty as an internal control signal that gates (i) representation via gated reparameterization, (ii) propagation via similarity and confidence routing, and (iii) generation via uncertainty-controlled predictive distributions, together with uncertainty-driven regularization and calibration to curb miscalibration. Instantiated on Weak Innovation AutoEncoder (WIAE-GPF), our UG-WIAE-GPF significantly improves risk-sensitive forecasting, delivering a 63.5\% MSE reduction on NYISO (0.3508 $\rightarrow$ 0.1281), with improved robustness under shock intervals (mSE: 0.2739 $\rightarrow$ 0.1748).
摘要：金融时间序列预测是一个高风险问题，政权更迭和冲击使得点准确但过度自信的模型变得危险。我们提出了不确定性门控生成模型（UGGM），它将不确定性视为内部控制信号，通过门控重新参数化来控制表示，（ii）通过相似性和置信度路由进行传播，以及（iii）通过不确定性控制的预测分布进行生成，以及不确定性驱动的正则化和校准以遏制错误校准。我们的 UG-WIAE-GPF 在弱创新自动编码器 (WIAE-GPF) 上实例化，显着改进了风险敏感预测，在 NYISO 上实现了 63.5% 的 MSE 降低 (0.3508 $\rightarrow$ 0.1281)，并提高了冲击区间下的鲁棒性 (mSE: 0.2739 $\rightarrow$ 0.1748)。

Title: Geometric Knowledge-Assisted Federated Dual Knowledge Distillation Approach Towards Remote Sensing Satellite Imagery

Authors: Luyao Zou, Fei Pan, Jueying Li, Yan Kyaw Tun, Apurba Adhikary, Zhu Han, Hayoung Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07774
Pdf URL: https://arxiv.org/pdf/2603.07774
Copy Paste: [[2603.07774]] Geometric Knowledge-Assisted Federated Dual Knowledge Distillation Approach Towards Remote Sensing Satellite Imagery(https://arxiv.org/abs/2603.07774)
Keywords: generation
Abstract: Federated learning (FL) has recently become a promising solution for analyzing remote sensing satellite imagery (RSSI). However, the large scale and inherent data heterogeneity of images collected from multiple satellites, where the local data distribution of each satellite differs from the global one, present significant challenges to effective model training. To address this issue, we propose a Geometric Knowledge-Guided Federated Dual Knowledge Distillation (GK-FedDKD) framework for RSSI analysis. In our approach, each local client first distills a teacher encoder (TE) from multiple student encoders (SEs) trained with unlabeled augmented data. The TE is then connected with a shared classifier to form a teacher network (TN) that supervises the training of a new student network (SN). The intermediate representations of the TN are used to compute local covariance matrices, which are aggregated at the server to generate global geometric knowledge (GGK). This GGK is subsequently employed for local embedding augmentation to further guide SN training. We also design a novel loss function and a multi-prototype generation pipeline to stabilize the training process. Evaluation over multiple datasets showcases that the proposed GK-FedDKD approach is superior to the considered state-of-the-art baselines, e.g., the proposed approach with the Swin-T backbone surpasses previous SOTA approaches by an average 68.89% on the EuroSAT dataset.
摘要：联邦学习 (FL) 最近已成为分析遥感卫星图像 (RSSI) 的一种有前景的解决方案。然而，从多颗卫星收集的图像的大规模和固有的数据异构性，其中每颗卫星的本地数据分布与全球数据分布不同，对有效的模型训练提出了重大挑战。为了解决这个问题，我们提出了一种用于 RSSI 分析的几何知识引导联合双知识蒸馏 (GK-FedDKD) 框架。在我们的方法中，每个本地客户端首先从使用未标记的增强数据训练的多个学生编码器（SE）中提取教师编码器（TE）。然后，TE 与共享分类器连接，形成教师网络 (TN)，监督新学生网络 (SN) 的训练。 TN 的中间表示用于计算局部协方差矩阵，这些矩阵在服务器上聚合以生成全局几何知识 (GGK)。该 GGK 随后用于局部嵌入增强，以进一步指导 SN 训练。我们还设计了一种新颖的损失函数和多原型生成管道来稳定训练过程。对多个数据集的评估表明，所提出的 GK-FedDKD 方法优于所考虑的最先进的基线，例如，所提出的采用 Swin-T 主干的方法在 EuroSAT 数据集上平均超过了之前的 SOTA 方法 68.89%。

Title: Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

Authors: Zongqian Li, Shaohan Huang, Zewen Chi, Yixuan Su, Lexin Zhou, Li Dong, Nigel Collier, Furu Wei
Subjects: cs.LG, cs.CL, cs.GL
Abstract URL: https://arxiv.org/abs/2603.07777
Pdf URL: https://arxiv.org/pdf/2603.07777
Copy Paste: [[2603.07777]] Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models(https://arxiv.org/abs/2603.07777)
Keywords: generation
Abstract: Modern code generation models exhibit longer outputs, accelerated capability growth, and changed training dynamics, rendering traditional training methodologies, algorithms, and datasets ineffective for improving their performance. To address these training bottlenecks, we propose MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity. MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation. Additionally, we release MicroCoder-Dataset, a more challenging training corpus that achieves 3x larger performance gains than mainstream datasets on LiveCodeBench v6 within 300 training steps, and MicroCoder-Evaluator, a robust framework with approximately 25% improved evaluation accuracy and around 40% faster execution. Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 training insights across seven main aspects, demonstrating that properly trained models can achieve competitive performance with larger counterparts.
摘要：现代代码生成模型表现出更长的输出、加速的能力增长和改变的训练动态，使得传统的训练方法、算法和数据集无法有效提高其性能。为了解决这些训练瓶颈，我们提出了 MicroCoder-GRPO，这是一种改进的组相对策略优化方法，具有三项创新：条件截断掩蔽以提高长期输出潜力，同时保持训练稳定性；多样性确定的温度选择以维持和鼓励输出多样性；以及以高剪裁比消除 KL 损失以促进解决方案多样性。与 LiveCodeBench v6 的强大基线相比，MicroCoder-GRPO 实现了高达 17.6% 的相对改进，并且在扩展上下文评估下的增益更为明显。此外，我们还发布了 MicroCoder-Dataset，这是一个更具挑战性的训练语料库，在 300 个训练步骤内实现了 LiveCodeBench v6 上主流数据集 3 倍的性能提升；以及 MicroCoder-Evaluator，这是一个强大的框架，评估精度提高了约 25%，执行速度提高了约 40%。通过对 30 多个对照实验的综合分析，我们揭示了七个主要方面的 34 个训练见解，证明经过适当训练的模型可以实现与大型模型相比的竞争性能。

Title: HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration

Authors: Desen Sun, Jason Hon, Jintao Zhang, Sihang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07815
Pdf URL: https://arxiv.org/pdf/2603.07815
Copy Paste: [[2603.07815]] HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration(https://arxiv.org/abs/2603.07815)
Keywords: generation
Abstract: Diffusion models have demonstrated a remarkable ability in Text-to-Image (T2I) generation applications. Despite the advanced generation output, they suffer from heavy computation overhead, especially for large models that contain tens of billions of parameters. Prior work has illustrated that replacing part of the denoising steps with a smaller model still maintains the generation quality. However, these methods only focus on saving computation for some timesteps, ignoring the difference in compute demand within one timestep. In this work, we propose HybridStitch, a new T2I generation paradigm that treats generation like editing. Specifically, we introduce a hybrid stage that jointly incorporates both the large model and the small model. HybridStitch separates the entire image into two regions: one that is relatively easy to render, enabling an early transition to the smaller model, and another that is more complex and therefore requires refinement by the large model. HybridStitch employs the small model to construct a coarse sketch while exploiting the large model to edit and refine the complex regions. According to our evaluation, HybridStitch achieves 1.83$\times$ speedup on Stable Diffusion 3, which is faster than all existing mixture of model methods.
摘要：扩散模型在文本到图像（T2I）生成应用中表现出了非凡的能力。尽管具有先进的生成输出，但它们仍承受着沉重的计算开销，特别是对于包含数百亿参数的大型模型。先前的工作表明，用较小的模型替换部分去噪步骤仍然可以保持生成质量。然而，这些方法只关注节省某些时间步长的计算量，忽略了一个时间步长内计算需求的差异。在这项工作中，我们提出了 HybridStitch，一种新的 T2I 生成范例，将生成视为编辑。具体来说，我们引入了一个混合阶段，将大模型和小模型联合起来。 HybridStitch 将整个图像分为两个区域：一个区域相对容易渲染，可以尽早过渡到较小的模型，另一个区域则更复杂，因此需要通过大型模型进行细化。 HybridStitch 使用小模型构建粗略草图，同时利用大模型编辑和细化复杂区域。根据我们的评估，HybridStitch 在 Stable Diffusion 3 上实现了 1.83$\times$ 加速，这比所有现有的混合模型方法都要快。

Title: Guess & Guide: Gradient-Free Zero-Shot Diffusion Guidance

Authors: Abduragim Shtanchaev, Albina Ilina, Yazid Janati, Arip Asadulaev, Martin Takác, Eric Moulines
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.07860
Pdf URL: https://arxiv.org/pdf/2603.07860
Copy Paste: [[2603.07860]] Guess & Guide: Gradient-Free Zero-Shot Diffusion Guidance(https://arxiv.org/abs/2603.07860)
Keywords: generation
Abstract: Pretrained diffusion models serve as effective priors for Bayesian inverse problems. These priors enable zero-shot generation by sampling from the conditional distribution, which avoids the need for task-specific retraining. However, a major limitation of existing methods is their reliance on surrogate likelihoods that require vector-Jacobian products at each denoising step, creating a substantial computational burden. To address this, we introduce a lightweight likelihood surrogate that eliminates the need to calculate gradients through the denoiser network. This enables us to handle diverse inverse problems without backpropagation overhead. Experiments confirm that using our method, the inference cost drops dramatically. At the same time, our approach delivers the highest results in multiple tasks. Broadly speaking, we propose the fastest and Pareto optimal method for Bayesian inverse problems.
摘要：预训练扩散模型可作为贝叶斯逆问题的有效先验。这些先验通过从条件分布中采样来实现零样本生成，从而避免了特定于任务的重新训练的需要。然而，现有方法的一个主要限制是它们依赖于替代可能性，在每个去噪步骤中都需要向量雅可比积，从而造成巨大的计算负担。为了解决这个问题，我们引入了一种轻量级似然代理，无需通过降噪器网络计算梯度。这使我们能够处理各种逆问题，而无需反向传播开销。实验证实，使用我们的方法，推理成本显着下降。同时，我们的方法在多项任务中都能提供最高的结果。从广义上讲，我们提出了贝叶斯逆问题的最快且帕累托最优的方法。

Title: LeJOT-AutoML: LLM-Driven Feature Engineering for Job Execution Time Prediction in Databricks Cost Optimization

Authors: Lizhi Ma, Yi-Xiang Hu, Yihui Ren, Feng Wu, Xiang-Yang Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.07897
Pdf URL: https://arxiv.org/pdf/2603.07897
Copy Paste: [[2603.07897]] LeJOT-AutoML: LLM-Driven Feature Engineering for Job Execution Time Prediction in Databricks Cost Optimization(https://arxiv.org/abs/2603.07897)
Keywords: generation
Abstract: Databricks job orchestration systems (e.g., LeJOT) reduce cloud costs by selecting low-priced compute configurations while meeting latency and dependency constraints. Accurate execution-time prediction under heterogeneous instance types and non-stationary runtime conditions is therefore critical. Existing pipelines rely on static, manually engineered features that under-capture runtime effects (e.g., partition pruning, data skew, and shuffle amplification), and predictive signals are scattered across logs, metadata, and job scripts-lengthening update cycles and increasing engineering overhead. We present LeJOT-AutoML, an agent-driven AutoML framework that embeds large language model agents throughout the ML lifecycle. LeJOT-AutoML combines retrieval-augmented generation over a domain knowledge base with a Model Context Protocol toolchain (log parsers, metadata queries, and a read-only SQL sandbox) to analyze job artifacts, synthesize and validate feature-extraction code via safety gates, and train/select predictors. This design materializes runtime-derived features that are difficult to obtain through static analysis alone. On enterprise Databricks workloads, LeJOT-AutoML generates over 200 features and reduces the feature-engineering and evaluation loop from weeks to 20-30 minutes, while maintaining competitive prediction accuracy. Integrated into the LeJOT pipeline, it enables automated continuous model updates and achieves 19.01% cost savings in our deployment setting through improved orchestration.
摘要：Databricks 作业编排系统（例如 LeJOT）通过选择低价计算配置来降低云成本，同时满足延迟和依赖性约束。因此，异构实例类型和非平稳运行时条件下的准确执行时间预测至关重要。现有的管道依赖于静态的、手动设计的功能，这些功能无法捕获运行时的影响（例如分区修剪、数据倾斜和随机放大），并且预测信号分散在日志、元数据和作业脚本中，从而延长了更新周期并增加了工程开销。我们推出了 LeJOT-AutoML，这是一个代理驱动的 AutoML 框架，它在整个 ML 生命周期中嵌入了大型语言模型代理。 LeJOT-AutoML 将领域知识库上的检索增强生成与模型上下文协议工具链（日志解析器、元数据查询和只读 SQL 沙箱）相结合，以分析作业工件、通过安全门合成和验证特征提取代码以及训练/选择预测器。这种设计实现了运行时衍生的功能，而这些功能很难仅通过静态分析获得。在企业 Databricks 工作负载上，LeJOT-AutoML 生成 200 多个特征，并将特征工程和评估循环从几周缩短到 20-30 分钟，同时保持有竞争力的预测准确性。它集成到 LeJOT 管道中，可实现自动连续模型更新，并通过改进的编排在我们的部署设置中实现 19.01% 的成本节省。

Title: Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning

Authors: Yingkai Zhang, Tao Zhang, Jing Nie, Ying Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07918
Pdf URL: https://arxiv.org/pdf/2603.07918
Copy Paste: [[2603.07918]] Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning(https://arxiv.org/abs/2603.07918)
Keywords: super-resolution
Abstract: Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image. In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models. Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers while dedicating the subsequent network to enhancing the initial abundance map. To leverage the spatial texture of the unregistered reference, we introduce a coarse-to-fine deformable aggregation module, which first estimates a pixel-level flow and a similarity map using a coarse pyramid predictor. It further performs fine sub-pixel refinement to achieve deformable aggregation of the reference features. The aggregative features are then refined via a series of spatial-channel abundance cross-attention blocks. Furthermore, a spatial-channel modulated fusion module is presented to merge encoder-decoder features using dynamic gating weights, yielding a high-quality, high-resolution HSI. Experimental results on simulated and real datasets confirm that our proposed method achieves state-of-the-art super-resolution performance. The code will be available at this https URL.
摘要：未配准的高光谱图像 (HSI) 超分辨率 (SR) 通常旨在使用未配准的高分辨率参考图像来增强低分辨率 HSI。在本文中，我们提出了一种基于解混合的融合框架，该框架解耦空间光谱信息，以同时减轻未配准融合的影响并增强 SR 模型的可学习性。具体来说，我们首先利用奇异值分解进行初始光谱分解，保留原始端元，同时专用于后续网络来增强初始丰度图。为了利用未注册参考的空间纹理，我们引入了从粗到细的可变形聚合模块，该模块首先使用粗金字塔预测器估计像素级流和相似度图。它进一步执行精细的子像素细化，以实现参考特征的可变形聚合。然后通过一系列空间通道丰度交叉注意块来细化聚合特征。此外，提出了一个空间通道调制融合模块，使用动态门权重来合并编码器-解码器特征，从而产生高质量、高分辨率的 HSI。模拟和真实数据集的实验结果证实我们提出的方法实现了最先进的超分辨率性能。该代码将在此 https URL 中提供。

Title: Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis

Authors: Ethan Young, Zichun Wang, Aiden Taylor, Chance Jewell, Julian Myers, Satya Sri Rajiteswari Nimmagadda, Anthony White, Aniruddha Maiti, Ananya Jana
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07936
Pdf URL: https://arxiv.org/pdf/2603.07936
Copy Paste: [[2603.07936]] Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis(https://arxiv.org/abs/2603.07936)
Keywords: generation
Abstract: Diagrams are widely used in teaching computer science courses. They are useful in subjects such as automata and formal languages, data structures, etc. These diagrams, often drawn by students during exams or assignments, vary in structure, layout, and correctness. This study examines whether current vision-language and large language models can process such diagrams and produce accurate textual and digital representations. In this study, scanned student-drawn diagrams are used as input. Then, textual descriptions are generated from these images using a vision-language model. The descriptions are checked and revised by human reviewers to make them accurate. Both the generated and the revised descriptions are then fed to a large language model to generate TikZ code. The resulting diagrams are compiled and then evaluated against the original scanned diagrams. We found descriptions generated directly from images using vision-language models are often incorrect and human correction can substantially improve the quality of vision language model generated descriptions. This research can help computer science education by paving the way for automated grading and feedback and creating more accessible instructional materials.
摘要：图表广泛应用于计算机科学课程的教学中。它们在自动机、形式语言、数据结构等学科中很有用。这些图表通常由学生在考试或作业期间绘制，在结构、布局和正确性方面各不相同。这项研究探讨了当前的视觉语言和大型语言模型是否可以处理此类图表并生成准确的文本和数字表示。在本研究中，使用扫描的学生绘制的图表作为输入。然后，使用视觉语言模型从这些图像生成文本描述。这些描述由人工审阅者检查和修改，以确保其准确。然后，生成的和修改后的描述都会被输入到大型语言模型中以生成 TikZ 代码。生成的图表将被编译，然后根据原始扫描图表进行评估。我们发现使用视觉语言模型直接从图像生成的描述通常是不正确的，而人工校正可以显着提高视觉语言模型生成的描述的质量。这项研究可以通过为自动评分和反馈铺平道路并创建更易于理解的教学材料来帮助计算机科学教育。

Title: ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework

Authors: Yusong Wang, Chuang Yang, Jiawei Wang, Xiaohang Xu, Jiayi Xu, Dongyuan Li, Chuan Xiao, Renhe Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07946
Pdf URL: https://arxiv.org/pdf/2603.07946
Copy Paste: [[2603.07946]] ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework(https://arxiv.org/abs/2603.07946)
Keywords: generation
Abstract: Human mobility generation aims to synthesize plausible trajectory data, which is widely used in urban system research. While Large Language Model-based methods excel at generating routine trajectories, they struggle to capture deviated mobility during large-scale societal events. This limitation stems from two critical gaps: (1) the absence of event-annotated mobility datasets for design and evaluation, and (2) the inability of current frameworks to reconcile competitions between users' habitual patterns and event-imposed constraints when making trajectory decisions. This work addresses these gaps with a twofold contribution. First, we construct the first event-annotated mobility dataset covering three major events: Typhoon Hagibis, COVID-19, and the Tokyo 2021 Olympics. Second, we propose ELLMob, a self-aligned LLM framework that first extracts competing rationales between habitual patterns and event constraints, based on Fuzzy-Trace Theory, and then iteratively aligns them to generate trajectories that are both habitually grounded and event-responsive. Extensive experiments show that ELLMob wins state-of-the-art baselines across all events, demonstrating its effectiveness. Our codes and datasets are available at this https URL.
摘要：人类流动生成旨在合成合理的轨迹数据，广泛应用于城市系统研究。虽然基于大型语言模型的方法擅长生成常规轨迹，但它们很难捕捉大规模社会事件期间的偏差流动性。这种限制源于两个关键差距：（1）缺乏用于设计和评估的事件注释移动数据集，以及（2）当前框架无法协调用户在做出轨迹决策时的习惯模式和事件施加的约束之间的竞争。这项工作通过双重贡献解决了这些差距。首先，我们构建了第一个事件注释的移动数据集，涵盖三个主要事件：台风海贝斯、COVID-19 和 2021 年东京奥运会。其次，我们提出了 ELLMob，一个自对齐的 LLM 框架，它首先基于模糊追踪理论提取习惯模式和事件约束之间的竞争原理，然后迭代地对齐它们以生成既习惯扎根又事件响应的轨迹。大量实验表明，ELLMob 在所有赛事中都赢得了最先进的基线，证明了其有效性。我们的代码和数据集可在此 https URL 获取。

Title: SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

Authors: Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, Weiping Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07961
Pdf URL: https://arxiv.org/pdf/2603.07961
Copy Paste: [[2603.07961]] SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation(https://arxiv.org/abs/2603.07961)
Keywords: generation
Abstract: Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.
摘要：场景图生成（SGG）将视觉场景构建为对象及其关系的图。虽然多模态大型语言模型 (MLLM) 具有先进的端到端 SGG，但当前的方法因缺乏特定于任务的结构化推理以及稀疏、长尾关系分布的挑战而受到阻碍，从而导致以低召回率和有偏差预测为特征的不完整场景图。为了解决这些问题，我们引入了 SGG-R$^{\rm 3}$，这是一种结构化推理框架，它将特定于任务的思想链（CoT）引导的监督微调（SFT）和强化学习（RL）与组序列策略优化（GSPO）相结合，旨在参与三个连续阶段以实现端到端无偏场景图生成。在 SFT 阶段，我们通过利用 MLLM 提出了一种关系增强策略，并通过嵌入相似性过滤进行细化以减轻关系稀疏性。随后，阶段一致的奖励方案优化了强化学习期间的程序推理。具体来说，我们提出了一种新颖的双粒度奖励，它集成了细粒度和粗粒度的关系奖励，同时通过基于频率的自适应谓词加权来缓解长尾问题，并通过语义聚类提高关系覆盖率。两个基准测试的实验表明，与现有方法相比，SGG-R$^{\rm 3}$ 取得了优异的性能，证明了该框架的有效性和泛化性。

Title: On the Feasibility and Opportunity of Autoregressive 3D Object Detection

Authors: Zanming Huang, Jinsu Yoo, Sooyoung Jeon, Zhenzhen Liu, Mark Campbell, Kilian Q Weinberger, Bharath Hariharan, Wei-Lun Chao, Katie Z Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07985
Pdf URL: https://arxiv.org/pdf/2603.07985
Copy Paste: [[2603.07985]] On the Feasibility and Opportunity of Autoregressive 3D Object Detection(https://arxiv.org/abs/2603.07985)
Keywords: generation
Abstract: LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class. This near-to-far ordering mirrors LiDAR geometry--near objects occlude far ones but not vice versa--enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud or backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.
摘要：基于 LiDAR 的 3D 物体检测器通常依赖于带有手工制作组件（例如锚点分配和非极大值抑制 (NMS)）的提案头，这使得训练变得复杂并限制了可扩展性。我们提出了 AutoReg3D，一种自回归 3D 检测器，它将检测转换为序列生成。给定点云特征，AutoReg3D 以范围因果（近到远）顺序发出对象，并将每个对象编码为由其中心、大小、方向、速度和类别组成的短离散标记序列。这种从近到远的排序反映了激光雷达的几何结构——近的物体遮挡远的物体，但反之则不然——在训练期间实现直接的教师强制，并在测试时进行自回归解码。 AutoReg3D 兼容不同的点云或骨干网，无需锚点或 NMS 即可获得具有竞争力的 nuScenes 性能。除了同等性之外，顺序公式还解锁了 3D 感知的语言模型进步，包括用于任务对齐目标的 GRPO 式强化学习。这些结果将自回归解码定位为基于 LiDAR 的检测的可行、灵活的替代方案，并为将现代序列建模工具导入 3D 感知开辟了道路。

Title: AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

Authors: Teng Wang, Yanting Lu, Ruize Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.07989
Pdf URL: https://arxiv.org/pdf/2603.07989
Copy Paste: [[2603.07989]] AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models(https://arxiv.org/abs/2603.07989)
Keywords: generation
Abstract: We present AutoTraces, an autoregressive vision-language-trajectory model for robot trajectory forecasting in humam-populated environments, which harnesses the inherent reasoning capabilities of large language models (LLMs) to model complex human behaviors. In contrast to prior works that rely solely on textual representations, our key innovation lies in a novel trajectory tokenization scheme, which represents waypoints with point tokens as categorical and positional markers while encoding waypoint numerical values as corresponding point embeddings, seamlessly integrated into the LLM's space through a lightweight encoder-decoder architecture. This design preserves the LLM's native autoregressive generation mechanism while extending it to physical coordinate spaces, facilitates modeling of long-term interactions in trajectory data. We further introduce an automated chain-of-thought (CoT) generation mechanism that leverages a multimodal LLM to infer spatio-temporal relationships from visual observations and trajectory data, eliminating reliance on manual annotation. Through a two-stage training strategy, our AutoTraces achieves SOTA forecasting accuracy, particularly in long-horizon prediction, while exhibiting strong cross-scene generalization and supporting flexible-length forecasting.
摘要：我们提出了 AutoTraces，一种用于人类环境中机器人轨迹预测的自回归视觉语言轨迹模型，它利用大型语言模型 (LLM) 的固有推理能力来模拟复杂的人类行为。与之前仅依赖文本表示的工作相比，我们的关键创新在于一种新颖的轨迹标记化方案，该方案用点标记表示航点作为分类和位置标记，同时将航点数值编码为相应的点嵌入，通过轻量级编码器-解码器架构无缝集成到LLM的空间中。该设计保留了法学硕士的原生自回归生成机制，同时将其扩展到物理坐标空间，有助于对轨迹数据中的长期交互进行建模。我们进一步引入了一种自动思想链（CoT）生成机制，该机制利用多模式法学硕士从视觉观察和轨迹数据推断时空关系，消除了对手动注释的依赖。通过两阶段训练策略，我们的 AutoTraces 实现了 SOTA 预测精度，特别是在长范围预测中，同时表现出强大的跨场景泛化能力并支持灵活的长度预测。

Title: Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

Authors: Yafei Zhang, Meng Ma, Huafeng Li, Yu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08018
Pdf URL: https://arxiv.org/pdf/2603.08018
Copy Paste: [[2603.08018]] Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared(https://arxiv.org/abs/2603.08018)
Keywords: generation, generative
Abstract: Infrared-visible (IR-VIS) image fusion is vital for perception and security, yet most methods rely on the availability of both modalities during training and inference. When the infrared modality is absent, pixel-space generative substitutes become hard to control and inherently lack interpretability. We address missing-IR fusion by proposing a dictionary-guided, coefficient-domain framework built upon a shared convolutional dictionary. The pipeline comprises three key components: (1) Joint Shared-dictionary Representation Learning (JSRL) learns a unified and interpretable atom space shared by both IR and VIS modalities; (2) VIS-Guided IR Inference (VGII) transfers VIS coefficients to pseudo-IR coefficients in the coefficient domain and performs a one-step closed-loop refinement guided by a frozen large language model as a weak semantic prior; and (3) Adaptive Fusion via Representation Inference (AFRI) merges VIS structures and inferred IR cues at the atom level through window attention and convolutional mixing, followed by reconstruction with the shared dictionary. This encode-transfer-fuse-reconstruct pipeline avoids uncontrolled pixel-space generation while ensuring prior preservation within interpretable dictionary-coefficient representation. Experiments under missing-IR settings demonstrate consistent improvements in perceptual quality and downstream detection performance. To our knowledge, this represents the first framework that jointly learns a shared dictionary and performs coefficient-domain inference-fusion to tackle missing-IR fusion. The source code is publicly available at this https URL.
摘要：红外-可见光 (IR-VIS) 图像融合对于感知和安全至关重要，但大多数方法都依赖于训练和推理过程中两种模式的可用性。当红外模态不存在时，像素空间生成替代品变得难以控制并且本质上缺乏可解释性。我们通过提出基于共享卷积字典的字典引导的系数域框架来解决缺失的 IR 融合问题。该管道包含三个关键组件：（1）联合共享字典表示学习（JSRL）学习由 IR 和 VIS 模态共享的统一且可解释的原子空间；（2）VIS-Guided IR Inference（VGII）将VIS系数转移到系数域中的伪IR系数，并在冻结的大语言模型作为弱语义先验的指导下执行一步闭环细化； (3) 通过表示推理的自适应融合 (AFRI) 通过窗口注意力和卷积混合在原子级别合并 VIS 结构和推断的 IR 线索，然后使用共享字典进行重建。这种编码-传输-熔断-重建管道避免了不受控制的像素空间生成，同时确保在可解释的字典系数表示中预先保存。缺失红外设置下的实验证明了感知质量和下游检测性能的持续改进。据我们所知，这代表了第一个联合学习共享字典并执行系数域推理融合以解决缺失 IR 融合问题的框架。源代码可通过此 https URL 公开获取。

Title: VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion

Authors: Jing Li, Jing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08020
Pdf URL: https://arxiv.org/pdf/2603.08020
Copy Paste: [[2603.08020]] VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion(https://arxiv.org/abs/2603.08020)
Keywords: generation
Abstract: Generating realistic cast shadows for inserted foreground objects is a crucial yet challenging problem in image composition, where maintaining geometric consistency of shadow and object in complex scenes remains difficult due to the ill-posed nature of shadow formation. To address this issue, we propose VSDiffusion, a visibility-constrained two-stage framework designed to narrow the solution space by incorporating visibility priors. In Stage I, we predict a coarse shadow mask to localize plausible shadow generated regions. And in Stage II, conditional diffusion is performed guided by lighting and depth cues estimated from the composite to generate accurate shadows. In VSDiffusion, we inject visibility priors through two complementary pathways. First, a visibility control branch with shadow-gated cross attention that provides multi-scale structural guidance. Then, a learned soft prior map that reweights training loss in error-prone regions to enhance geometric correction. Additionally, we also introduce high-frequency guided enhancement module to sharpen boundaries and improve texture interaction with the background. Experiments on widely used public DESOBAv2 dataset demonstrated that our proposed VSDiffusion can generate accurate shadow, and establishes new SOTA results across most evaluation metrics.
摘要：为插入的前景对象生成逼真的投射阴影是图像合成中的一个关键但具有挑战性的问题，由于阴影形成的不适定性质，在复杂场景中保持阴影和对象的几何一致性仍然很困难。为了解决这个问题，我们提出了 VSDiffusion，这是一种可见性受限的两阶段框架，旨在通过合并可见性先验来缩小解决方案空间。在第一阶段，我们预测一个粗略的阴影掩模来定位可能的阴影生成区域。在第二阶段，条件扩散是在从合成物估计的照明和深度线索的指导下执行的，以生成准确的阴影。在 VSDiffusion 中，我们通过两个互补的途径注入可见性先验。首先，具有阴影门控交叉注意的可见性控制分支，提供多尺度结构指导。然后，学习软先验图，重新加权易错区域的训练损失，以增强几何校正。此外，我们还引入了高频引导增强模块来锐化边界并改善纹理与背景的交互。在广泛使用的公共 DESOBAv2 数据集上进行的实验表明，我们提出的 VSDiffusion 可以生成准确的阴影，并在大多数评估指标上建立新的 SOTA 结果。

Title: Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

Authors: Sangjune Park, Inhyeok Choi, Donghyeon Soon, Youngwoo Jeon, Kyungdon Joo
Subjects: cs.CV, cs.AI, cs.GR, cs.SD
Abstract URL: https://arxiv.org/abs/2603.08023
Pdf URL: https://arxiv.org/pdf/2603.08023
Copy Paste: [[2603.08023]] Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model(https://arxiv.org/abs/2603.08023)
Keywords: generation
Abstract: Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose \emph{MambaDance}, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \small{this https URL}.
摘要：舞蹈是一种以情感表达和交流为特征的人类运动形式，在音乐、虚拟现实、内容创作等各个领域发挥着作用。现有的舞蹈生成方法常常无法充分捕捉舞蹈固有的顺序性、节奏性和音乐同步性特征。在本文中，我们提出了 \emph{MambaDance}，一种利用基于 Mamba 的扩散模型的新舞蹈生成方法。 Mamba 非常适合处理长序列和自回归序列，已集成到我们的两级扩散架构中，取代了现成的 Transformer。此外，考虑到音乐节拍在舞蹈编排中的关键作用，我们提出了一种基于高斯的节拍表示来明确指导舞蹈序列的解码。对每个序列长度的 AIST++ 和 FineDance 数据集进行的实验表明，与以前的方法相比，我们提出的方法可以有效地生成合理的舞蹈动作，同时反映从短到长的舞蹈一致的基本特征。其他定性结果和演示视频可在 \small{此 https URL} 获取。

Title: Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Authors: Ashkan Taghipour, Morteza Ghahremani, Zinuo Li, Hamid Laga, Farid Boussaid, Mohammed Bennamoun
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2603.08028
Pdf URL: https://arxiv.org/pdf/2603.08028
Copy Paste: [[2603.08028]] Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades(https://arxiv.org/abs/2603.08028)
Keywords: generation
Abstract: Generating videos of complex human motions such as flips, cartwheels, and martial arts remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to provide complete skeleton sequences that are costly to produce for long and dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions by predicting each joint conditioned on previously generated poses. This design captures long-range temporal dependencies and inter-joint coordination required for complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence. It employs DINO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset containing 2,000 videos with diverse characters performing acrobatic and stunt-like motions. The dataset provides full control over appearance, motion, and environment. It fills an important gap because existing benchmarks significantly under-represent acrobatic motions while web-collected datasets raise copyright and privacy concerns. Experiments on our synthetic dataset and the Motion-X Fitness benchmark show that our text-to-skeleton model outperforms prior methods on FID, R-precision, and motion diversity. Our pose-to-video model also achieves the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.
摘要：对于当前的视频传播模型来说，生成复杂的人体动作（例如空翻、侧手翻和武术）的视频仍然具有挑战性。对于细粒度的运动控制来说，纯文本调节在时间上是不明确的，而基于姿势的显式控制虽然有效，但要求用户提供完整的骨架序列，而对于长动态动作来说，这些序列的生产成本很高。我们提出了一个两阶段级联框架来解决这两个限制。首先，自回归文本到骨架模型通过预测以先前生成的姿势为条件的每个关节，从自然语言描述生成 2D 姿势序列。该设计捕获了复杂运动所需的长程时间依赖性和关节间协调。其次，姿势条件视频扩散模型从参考图像和生成的骨架序列合成视频。它采用 DINO-ALF（自适应层融合），这是一种多级参考编码器，可在大姿势变化和自遮挡下保留外观和服装细节。为了解决缺乏用于复杂人体运动视频生成的公开数据集的问题，我们引入了一个基于 Blender 的合成数据集，其中包含 2,000 个视频，其中包含执行杂技和特技动作的不同角色。该数据集提供对外观、运动和环境的完全控制。它填补了一个重要的空白，因为现有的基准明显低估了杂技动作，而网络收集的数据集引发了版权和隐私问题。对我们的合成数据集和 Motion-X Fitness 基准测试的实验表明，我们的文本到骨架模型在 FID、R 精度和运动多样性方面优于先前的方法。我们的姿势到视频模型还在时间一致性、运动平滑度和主题保留方面的 VBench 指标的所有比较方法中取得了最佳结果。

Title: QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration

Authors: Fengyang Xiao, Jingjia Feng, Peng Hu, Dingming Zhang, Lei Xu, Guanyi Qin, Lu Li, Chunming He, Sina Farsiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08030
Pdf URL: https://arxiv.org/pdf/2603.08030
Copy Paste: [[2603.08030]] QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration(https://arxiv.org/abs/2603.08030)
Keywords: restoration, quality assessment
Abstract: Real-world image restoration (RWIR) is a highly challenging task due to the absence of clean ground-truth images. Many recent methods resort to pseudo-label (PL) supervision, often within a Mean-Teacher (MT) framework. However, these methods face a critical paradox: unconditionally trusting the often imperfect, low-quality PLs forces the student model to learn undesirable artifacts, while discarding them severely limits data diversity and impairs model generalization. In this paper, we propose QualiTeacher, a novel framework that transforms pseudo-label quality from a noisy liability into a conditional supervisory signal. Instead of filtering, QualiTeacher explicitly conditions the student model on the quality of the PLs, estimated by an ensemble of complementary non-reference image quality assessment (NR-IQA) models spanning low-level distortion and semantic-level assessment. This strategy teaches the student network to learn a quality-graded restoration manifold, enabling it to understand what constitutes different quality levels. Consequently, it can not only avoid mimicking artifacts from low-quality labels but also extrapolate to generate results of higher quality than the teacher itself. To ensure the robustness and accuracy of this quality-driven learning, we further enhance the process with a multi-augmentation scheme to diversify the PL quality spectrum, a score-based preference optimization strategy inspired by Direct Preference Optimization (DPO) to enforce a monotonically ordered quality separation, and a cropped consistency loss to prevent adversarial over-optimization (reward hacking) of the IQA models. Experiments on standard RWIR benchmarks demonstrate that QualiTeacher can serve as a plug-and-play strategy to improve the quality of the existing pseudo-labeling framework, establishing a new paradigm for learning from imperfect supervision. Code will be released.
摘要：由于缺乏干净的地面真实图像，真实世界图像恢复（RWIR）是一项极具挑战性的任务。最近的许多方法通常在平均教师（MT）框架内诉诸伪标签（PL）监督。然而，这些方法面临着一个关键的悖论：无条件地信任通常不完美、低质量的 PL 会迫使学生模型学习不需要的工件，而丢弃它们会严重限制数据多样性并损害模型泛化能力。在本文中，我们提出了 QualiTeacher，这是一种新颖的框架，可将伪标签质量从噪声责任转换为条件监督信号。 QualiTeacher 没有进行过滤，而是根据 PL 的质量明确地调节学生模型，该质量是通过涵盖低级失真和语义级评估的互补非参考图像质量评估 (NR-IQA) 模型的集合来估计的。该策略教会学生网络学习质量分级的恢复流形，使其能够理解不同质量水平的构成。因此，它不仅可以避免模仿低质量标签的伪影，还可以推断生成比教师本身更高质量的结果。为了确保这种质量驱动学习的稳健性和准确性，我们通过多重增强方案进一步增强了该过程，以使 PL 质量谱多样化，一种受直接偏好优化 (DPO) 启发的基于分数的偏好优化策略，以强制执行单调排序的质量分离，并使用裁剪一致性损失来防止 IQA 模型的对抗性过度优化（奖励黑客）。标准 RWIR 基准的实验表明，QualiTeacher 可以作为一种即插即用策略来提高现有伪标签框架的质量，建立从不完善的监督中学习的新范式。代码将被发布。

Title: GCGNet: Graph-Consistent Generative Network for Time Series Forecasting with Exogenous Variables

Authors: Zhengyu Li, Xiangfei Qiu, Yuhan Zhu, Xingjian Wu, Jilin Hu, Chenjuan Guo, Bin Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08032
Pdf URL: https://arxiv.org/pdf/2603.08032
Copy Paste: [[2603.08032]] GCGNet: Graph-Consistent Generative Network for Time Series Forecasting with Exogenous Variables(https://arxiv.org/abs/2603.08032)
Keywords: generation, generative
Abstract: Exogenous variables offer valuable supplementary information for predicting future endogenous variables. Forecasting with exogenous variables needs to consider both past-to-future dependencies (i.e., temporal correlations) and the influence of exogenous variables on endogenous variables (i.e., channel correlations). This is pivotal when future exogenous variables are available, because they may directly affect the future endogenous variables. Many methods have been proposed for time series forecasting with exogenous variables, focusing on modeling temporal and channel correlations. However, most of them use a two-step strategy, modeling temporal and channel correlations separately, which limits their ability to capture joint correlations across time and channels. Furthermore, in real-world scenarios, time series are frequently affected by various forms of noises, underscoring the critical importance of robustness in such correlations modeling. To address these limitations, we propose GCGNet, a Graph-Consistent Generative Network for time series forecasting with exogenous variables. Specifically, GCGNet first employs a Variational Generator to produce coarse predictions. A Graph Structure Aligner then further guides it by evaluating the consistency between the generated and true correlations, where the correlations are represented as graphs, and are robust to noises. Finally, a Graph Refiner is proposed to refine the predictions to prevent degeneration and improve accuracy. Extensive experiments on 12 real-world datasets demonstrate that GCGNet outperforms state-of-the-art baselines.
摘要：外生变量为预测未来内生变量提供了有价值的补充信息。使用外生变量进行预测需要考虑过去到未来的依赖性（即时间相关性）和外生变量对内生变量的影响（即通道相关性）。当未来的外生变量可用时，这一点至关重要，因为它们可能直接影响未来的内生变量。人们已经提出了许多用于外生变量时间序列预测的方法，重点是对时间和通道相关性进行建模。然而，它们大多数使用两步策略，分别对时间和通道相关性进行建模，这限制了它们捕获跨时间和通道的联合相关性的能力。此外，在现实场景中，时间序列经常受到各种形式的噪声的影响，这凸显了此类相关性建模中鲁棒性的至关重要性。为了解决这些限制，我们提出了 GCGNet，一种用于外生变量时间序列预测的图一致生成网络。具体来说，GCGNet 首先使用变分生成器来产生粗略预测。然后，图结构对齐器通过评估生成的相关性和真实相关性之间的一致性来进一步指导它，其中相关性以图的形式表示，并且对噪声具有鲁棒性。最后，提出了 Graph Refiner 来细化预测，以防止退化并提高准确性。对 12 个真实世界数据集的大量实验表明，GCGNet 的性能优于最先进的基线。

Title: ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

Authors: Yiran Zhao, Yaoqi Ye, Xiang Liu, Michael Qizhe Shieh, Trung Bui
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08059
Pdf URL: https://arxiv.org/pdf/2603.08059
Copy Paste: [[2603.08059]] ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning(https://arxiv.org/abs/2603.08059)
Keywords: generative
Abstract: With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.
摘要：随着商业多模态模型的快速发展，图像编辑因其在日常生活中的广泛适用性而受到广泛关注。尽管取得了令人瞩目的进步，但现有的图像编辑系统，特别是闭源或专有模型，常常难以应对复杂、间接或多步骤的用户指令。这些限制阻碍了他们执行与人类意图一致的细致的、上下文感知的编辑的能力。在这项工作中，我们提出了 ImageEdit-R1，这是一种用于智能图像编辑的多代理框架，它利用强化学习来协调一组专门的、预先训练的视觉语言和生成代理的高层决策。每个代理负责不同的功能，例如理解用户意图、识别感兴趣区域、选择适当的编辑操作以及合成视觉内容，而强化学习则控制他们的协作，以确保连贯且目标导向的行为。与依赖整体模型或手工制作管道的现有方法不同，我们的方法将图像编辑视为顺序决策问题，从而实现动态和上下文感知的编辑策略。实验结果表明，ImageEdit-R1 在多个图像编辑数据集中始终优于单个闭源扩散模型和替代多智能体框架基线。

Title: Evaluating Generative Models via One-Dimensional Code Distributions

Authors: Zexi Jia, Pengcheng Luo, Yijia Zhong, Jinchao Zhang, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08064
Pdf URL: https://arxiv.org/pdf/2603.08064
Copy Paste: [[2603.08064]] Evaluating Generative Models via One-Dimensional Code Distributions(https://arxiv.org/abs/2603.08064)
Keywords: generative
Abstract: Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \emph{discrete} visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce \emph{Codebook Histogram Distance} (CHD), a training-free distribution metric in token space, and \emph{Code Mixture Model Score} (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose \emph{VisForm}, a benchmark of 210K images spanning 62 visual forms and 12 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.
摘要：大多数生成模型的评估都依赖于 FID 等特征分布度量，这些度量对连续识别特征进行操作，这些特征经过明确训练，不会随外观变化而变化，从而丢弃对感知质量至关重要的线索。相反，我们在 \emph{discrete} 视觉标记空间中评估模型，其中现代一维图像标记器紧凑地编码语义和感知信息，并且质量表现为可预测的标记统计数据。我们引入了 \emph{Codebook Histogram Distance} (CHD)，一种令牌空间中的免训练分布度量，以及 \emph{Code Mixture Model Score} (CMMS)，一种从令牌序列的综合退化中学习的无参考质量度量。为了对广泛分布变化下的指标进行压力测试，我们进一步提出了 \emph{VisForm}，这是一个涵盖 62 种视觉形式和 12 个带有专家注释的生成模型的 210K 图像的基准。在 AGIQA、HPDv2/3 和 VisForm 中，我们基于代币的指标实现了与人类判断最先进的相关性，我们将发布所有代码和数据集以促进未来的研究。

Title: Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models

Authors: Xuesong Wang, Caisheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08069
Pdf URL: https://arxiv.org/pdf/2603.08069
Copy Paste: [[2603.08069]] Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models(https://arxiv.org/abs/2603.08069)
Keywords: generation
Abstract: Utility companies increasingly rely on drone imagery for post-event and routine inspection, but training accurate defect-type classifiers remains difficult because defect examples are rare and inspection datasets are often limited or proprietary. We address this data-scarcity setting by using an off-the-shelf multimodal large language model (MLLM) as a training-free image generator to synthesize defect images from visual references and text prompts. Our pipeline increases diversity via dual-reference conditioning, improves label fidelity with lightweight human verification and prompt refinement, and filters the resulting synthetic pool using an embedding-based selection rule based on distances to class centroids computed from the real training split. We evaluate on ceramic insulator defect-type classification (shell vs. glaze) using a public dataset with a realistic low training-data regime (104 real training images; 152 validation; 308 test). Augmenting the 10% real training set with embedding-selected synthetic images improves test F1 score (harmonic mean of precision and recall) from 0.615 to 0.739 (20% relative), corresponding to an estimated 4--5x data-efficiency gain, and the gains persist with stronger backbone models and frozen-feature linear-probe baselines. These results suggest a practical, low-barrier path for improving defect recognition when collecting additional real defects is slow or infeasible.
摘要：公用事业公司越来越依赖无人机图像进行事后和例行检查，但训练准确的缺陷类型分类器仍然很困难，因为缺陷示例很少，而且检查数据集通常是有限的或专有的。我们通过使用现成的多模态大语言模型（MLLM）作为免训练图像生成器来根据视觉参考和文本提示合成缺陷图像来解决这种数据稀缺的问题。我们的管道通过双参考条件增加多样性，通过轻量级人工验证和快速细化提高标签保真度，并使用基于嵌入的选择规则（基于从真实训练分割计算出的类质心距离）过滤生成的合成池。我们使用具有现实低训练数据制度（104 个真实训练图像；152 个验证；308 个测试）的公共数据集来评估陶瓷绝缘子缺陷类型分类（外壳与釉）。使用嵌入选择的合成图像增强 10% 的真实训练集，可将测试 F1 分数（精确度和召回率的调和平均值）从 0.615 提高到 0.739（相对 20%），相当于估计的 4--5 倍数据效率增益，并且通过更强的主干模型和冻结特征线性探针基线，这种增益会持续存在。这些结果表明，当收集额外的实际缺陷缓慢或不可行时，可以采用一种实用的、低障碍的途径来改善缺陷识别。

Title: DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation

Authors: Zhenyu Hu, Qing Wang, Te Cao, Luo Liao, Longfei Lu, Liqun Liu, Shuang Li, Hang Chen, Mengge Xue, Yuan Chen, Chao Deng, Peng Shu, Huan Yu, Jie Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08090
Pdf URL: https://arxiv.org/pdf/2603.08090
Copy Paste: [[2603.08090]] DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation(https://arxiv.org/abs/2603.08090)
Keywords: generation
Abstract: Significant progress has been achieved in subject-driven text-to-image (T2I) generation, which aims to synthesize new images depicting target subjects according to user instructions. However, evaluating these models remains a significant challenge. Existing benchmarks exhibit critical limitations: 1) insufficient diversity and comprehensiveness in subject images, 2) inadequate granularity in assessing model performance across different subject difficulty levels and prompt scenarios, and 3) a profound lack of actionable insights and diagnostic guidance for subsequent model refinement. To address these limitations, we propose DSH-Bench, a comprehensive benchmark that enables systematic multi-perspective analysis of subject-driven T2I models through four principal innovations: 1) a hierarchical taxonomy sampling mechanism ensuring comprehensive subject representation across 58 fine-grained categories, 2) an innovative classification scheme categorizing both subject difficulty level and prompt scenario for granular capability assessment, 3) a novel Subject Identity Consistency Score (SICS) metric demonstrating a 9.4\% higher correlation with human evaluation compared to existing measures in quantifying subject preservation, and 4) a comprehensive set of diagnostic insights derived from the benchmark, offering critical guidance for optimizing future model training paradigms and data construction strategies. Through an extensive empirical evaluation of 19 leading models, DSH-Bench uncovers previously obscured limitations in current approaches, establishing concrete directions for future research and development.
摘要：主题驱动的文本到图像（T2I）生成已取得重大进展，其目的是根据用户指令合成描绘目标主题的新图像。然而，评估这些模型仍然是一个重大挑战。现有基准表现出严重的局限性：1）主题图像的多样性和全面性不足，2）在不同主题难度级别和提示场景下评估模型性能的粒度不足，3）严重缺乏对后续模型细化的可操作的见解和诊断指导。为了解决这些局限性，我们提出了 DSH-Bench，这是一个综合基准，通过四项主要创新，能够对主题驱动的 T2I 模型进行系统的多角度分析：1）分层分类抽样机制，确保跨 58 个细粒度类别的全面主题表示；2）创新的分类方案，对主题难度级别和细粒度能力评估的提示场景进行分类；3）一种新颖的主题身份一致性评分（SICS）指标，与人类评估相比，其相关性提高了 9.4%量化主题保存的现有措施，以及4）从基准中得出的一套全面的诊断见解，为优化未来模型训练范式和数据构建策略提供关键指导。通过对 19 个领先模型进行广泛的实证评估，DSH-Bench 揭示了当前方法中先前被掩盖的局限性，为未来的研究和开发建立了具体的方向。

Title: Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

Authors: Shentong Mo, Yibing Song
Subjects: cs.CV, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2603.08126
Pdf URL: https://arxiv.org/pdf/2603.08126
Copy Paste: [[2603.08126]] Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows(https://arxiv.org/abs/2603.08126)
Keywords: generation
Abstract: Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that FoleyFlow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.
摘要：基于视频输入的协调音频生成通常需要严格的视听（AV）对齐，其中生成的音频片段的语义和节奏都应与视频帧中的语义和节奏相对应。先前的研究利用两阶段设计，其中 AV 编码器首先通过对比学习进行对齐，然后编码的视频表示指导音频生成过程。我们观察到对比学习和全局视频指导都可以有效地调整整体 AV 语义，同时限制时间节奏同步。在这项工作中，我们建议 FoleyFlow 首先通过掩蔽建模训练来对齐单模态 AV 编码器，其中掩蔽音频片段在相应视频片段的指导下恢复。训练后，仅使用单峰数据单独预训练的 AV 编码器与语义和节奏一致性保持一致。然后，我们为最终音频生成开发动态条件流。我们的动态条件流建立在高效的速度流生成框架的基础上，利用随时间变化的视频特征作为动态条件来指导相应的音频片段生成。为此，我们在屏蔽 AV 对齐过程中提取连贯的语义和节奏表示，并使用视频片段的这种表示来指导音频的时间生成。我们的音频结果是根据标准基准进行评估的，并且在多个指标下大大超过了现有结果。卓越的性能表明 FoleyFlow 可以有效生成在语义和节奏上与各种视频序列一致的协调音频。

Title: Fast Low-light Enhancement and Deblurring for 3D Dark Scenes

Authors: Feng Zhang, Jinglong Wang, Ze Li, Yanghong Zhou, Yang Chen, Lei Chen, Xiatian Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08133
Pdf URL: https://arxiv.org/pdf/2603.08133
Copy Paste: [[2603.08133]] Fast Low-light Enhancement and Deblurring for 3D Dark Scenes(https://arxiv.org/abs/2603.08133)
Keywords: restoration
Abstract: Novel view synthesis from low-light, noisy, and motion-blurred imagery remains a valuable and challenging task. Current volumetric rendering methods struggle with compound degradation, and sequential 2D preprocessing introduces artifacts due to interdependencies. In this work, we introduce FLED-GS, a fast low-light enhancement and deblurring framework that reformulates 3D scene restoration as an alternating cycle of enhancement and reconstruction. Specifically, FLED-GS inserts several intermediate brightness anchors to enable progressive recovery, preventing noise blow-up from harming deblurring or geometry. Each iteration sharpens inputs with an off-the-shelf 2D deblurrer and then performs noise-aware 3DGS reconstruction that estimates and suppresses noise while producing clean priors for the next level. Experiments show FLED-GS outperforms state-of-the-art LuSh-NeRF, achieving 21$\times$ faster training and 11$\times$ faster rendering.
摘要：从低光、噪声和运动模糊图像中合成新颖的视图仍然是一项有价值且具有挑战性的任务。当前的体积渲染方法与复合退化问题作斗争，并且连续的 2D 预处理由于相互依赖性而引入了伪影。在这项工作中，我们介绍了 FLED-GS，这是一种快速低光增强和去模糊框架，它将 3D 场景恢复重新表述为增强和重建的交替循环。具体来说，FLED-GS 插入了几个中间亮度锚点以实现渐进恢复，防止噪声爆炸损害去模糊或几何形状。每次迭代都会使用现成的 2D 去模糊器锐化输入，然后执行噪声感知 3DGS 重建，估计和抑制噪声，同时为下一个级别生成干净的先验。实验表明，FLED-GS 的性能优于最先进的 LuSh-NeRF，训练速度提高了 21 倍，渲染速度提高了 11 倍。

Title: C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis

Authors: Jiayang Gao, Tianyi Zheng, Jiayang Zou, Fengxiang Yang, Shice Liu, Luyao Fan, Zheyu Zhang, Hao Zhang, Jinwei Chen, Peng-Tao Jiang, Bo Li, Jia Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.08155
Pdf URL: https://arxiv.org/pdf/2603.08155
Copy Paste: [[2603.08155]] C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis(https://arxiv.org/abs/2603.08155)
Keywords: generative
Abstract: Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process. This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce \textbf{Control Classifier-Free Guidance (C$^2$FG)}, a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C$^2$FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.
摘要：无分类器指导（CFG）是现代条件扩散模型的基石，但它对固定或启发式动态指导权重的依赖主要是经验性的，忽视了扩散过程的内在动态。在本文中，我们对无分类器指导进行了严格的理论分析。具体来说，我们根据扩散过程在不同时间步长的条件分布和无条件分布之间的分数差异建立了严格的上限。这一发现解释了固定权重策略的局限性，并为依赖时间的指导奠定了原则基础。受这种洞察力的启发，我们引入了 \textbf{Control Classifier-Free Guidance (C$^2$FG)}，这是一种新颖的、免训练的插件方法，通过指数衰减控制函数使引导强度与扩散动力学保持一致。大量实验表明，C$^2$FG 有效且广泛适用于各种生成任务，同时还表现出与现有策略的正交性。

Title: Learning Hierarchical Knowledge in Text-Rich Networks with Taxonomy-Informed Representation Learning

Authors: Yunhui Liu, Yongchao Liu, Yinfeng Chen, Chuntao Hong, Tao Zheng, Tieke He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.08159
Pdf URL: https://arxiv.org/pdf/2603.08159
Copy Paste: [[2603.08159]] Learning Hierarchical Knowledge in Text-Rich Networks with Taxonomy-Informed Representation Learning(https://arxiv.org/abs/2603.08159)
Keywords: generation
Abstract: Hierarchical knowledge structures are ubiquitous across real-world domains and play a vital role in organizing information from coarse to fine semantic levels. While such structures have been widely used in taxonomy systems, biomedical ontologies, and retrieval-augmented generation, their potential remains underexplored in the context of Text-Rich Networks (TRNs), where each node contains rich textual content and edges encode semantic relationships. Existing methods for learning on TRNs often focus on flat semantic modeling, overlooking the inherent hierarchical semantics embedded in textual documents. To this end, we propose TIER (Hierarchical \textbf{T}axonomy-\textbf{I}nformed R\textbf{E}presentation Learning on Text-\textbf{R}ich Networks), which first constructs an implicit hierarchical taxonomy and then integrates it into the learned node representations. Specifically, TIER employs similarity-guided contrastive learning to build a clustering-friendly embedding space, upon which it performs hierarchical K-Means followed by LLM-powered clustering refinement to enable semantically coherent taxonomy construction. Leveraging the resulting taxonomy, TIER introduces a cophenetic correlation coefficient-based regularization loss to align the learned embeddings with the hierarchical structure. By learning representations that respect both fine-grained and coarse-grained semantics, TIER enables more interpretable and structured modeling of real-world TRNs. We demonstrate that our approach significantly outperforms existing methods on multiple datasets across diverse domains, highlighting the importance of hierarchical knowledge learning for TRNs.
摘要：分层知识结构在现实世界领域中普遍存在，并且在从粗略到精细语义级别组织信息方面发挥着至关重要的作用。虽然此类结构已广泛应用于分类系统、生物医学本体和检索增强生成，但它们的潜力在富文本网络（TRN）的背景下仍未得到充分开发，其中每个节点都包含丰富的文本内容，边缘编码语义关系。现有的 TRN 学习方法通常侧重于平面语义建模，忽视了文本文档中嵌入的固有分层语义。为此，我们提出了 TIER（Hierarchical \textbf{T}axonomy-\textbf{I}nformed R\textbf{E}presentation Learning on Text-\textbf{R}ich Networks），它首先构建隐式分层分类法，然后将其集成到学习的节点表示中。具体来说，TIER 采用相似性引导的对比学习来构建聚类友好的嵌入空间，在此基础上执行分层 K-Means，然后执行 LLM 支持的聚类细化，以实现语义一致的分类法构建。利用由此产生的分类法，TIER 引入了基于共表相关系数的正则化损失，以使学习到的嵌入与层次结构保持一致。通过学习尊重细粒度和粗粒度语义的表示，TIER 可以对现实世界的 TRN 进行更可解释和结构化的建模。我们证明，我们的方法在跨不同领域的多个数据集上显着优于现有方法，凸显了 TRN 分层知识学习的重要性。

Title: Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA

Authors: Zexi Wu, Qinghe Wang, Jing Dai, Baolu Li, Yiming Zhang, Yue Ma, Xu Jia, Hongming Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08210
Pdf URL: https://arxiv.org/pdf/2603.08210
Copy Paste: [[2603.08210]] Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA(https://arxiv.org/abs/2603.08210)
Keywords: generation
Abstract: Achieving semantic alignment across diverse video generation conditions remains a significant challenge. Methods that rely on explicit structural guidance often enforce rigid spatial constraints that limit semantic flexibility, whereas models tailored for individual control types lack interoperability and adaptability. These design bottlenecks hinder progress toward flexible and efficient semantic video generation. To address this, we propose Video2LoRA, a scalable and generalizable framework for semantic-controlled video generation that conditions on a reference video. Video2LoRA employs a lightweight hypernetwork to predict personalized LoRA weights for each semantic input, which are combined with auxiliary matrices to form adaptive LoRA modules integrated into a frozen diffusion backbone. This design enables the model to generate videos consistent with the reference semantics while preserving key style and content variations, eliminating the need for any per-condition training. Notably, the final model weights less than 150MB, making it highly efficient for storage and deployment. Video2LoRA achieves coherent, semantically aligned generation across diverse conditions and exhibits strong zero-shot generalization to unseen semantics.
摘要：在不同的视频生成条件下实现语义对齐仍然是一个重大挑战。依赖于显式结构指导的方法通常会强制执行严格的空间约束，从而限制了语义灵活性，而为各个控制类型量身定制的模型缺乏互操作性和适应性。这些设计瓶颈阻碍了灵活高效的语义视频生成的进展。为了解决这个问题，我们提出了 Video2LoRA，这是一种可扩展且可泛化的框架，用于以参考视频为条件的语义控制视频生成。 Video2LoRA 采用轻量级超网络来预测每个语义输入的个性化 LoRA 权重，这些权重与辅助矩阵相结合，形成集成到冻结扩散主干中的自适应 LoRA 模块。这种设计使模型能够生成与参考语义一致的视频，同时保留关键风格和内容变化，从而无需进行任何按条件训练。值得注意的是，最终模型的重量小于 150MB，使其存储和部署非常高效。 Video2LoRA 在不同条件下实现了连贯、语义对齐的生成，并对未见语义表现出强大的零样本泛化能力。

Title: SRNeRV: A Scale-wise Recursive Framework for Neural Video Representation

Authors: Jia Wang, Jun Zhu, Xinfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08227
Pdf URL: https://arxiv.org/pdf/2603.08227
Copy Paste: [[2603.08227]] SRNeRV: A Scale-wise Recursive Framework for Neural Video Representation(https://arxiv.org/abs/2603.08227)
Keywords: generation
Abstract: Implicit Neural Representations (INRs) have emerged as a promising paradigm for video representation and compression. However, existing multi-scale INR generators often suffer from significant parameter redundancy by stacking independent processing blocks for each scale. Inspired by the principle of scale self-similarity in the generation process, we propose SRNeRV, a novel scale-wise recursive framework that replaces this stacked design with a parameter-efficient shared architecture. The core of our approach is a hybrid sharing scheme derived from decoupling the processing block into a scale-specific spatial mixing module and a scale-invariant channel mixing module. We recursively apply the same shared channel mixing module, which contains the majority of the parameters, across all scales, significantly reducing the model size while preserving the crucial capacity to learn scale-specific spatial patterns. Extensive experiments demonstrate that SRNeRV achieves a significant rate-distortion performance boost, especially in INR-friendly scenarios, validating that our sharing scheme successfully amplifies the core strengths of the INR paradigm.
摘要：隐式神经表示（INR）已成为视频表示和压缩的一种有前途的范例。然而，现有的多尺度 INR 生成器经常因为每个尺度堆叠独立的处理块而遭受显着的参数冗余。受到生成过程中尺度自相似原理的启发，我们提出了 SRNeRV，一种新颖的尺度递归框架，用参数高效的共享架构取代了这种堆叠设计。我们方法的核心是混合共享方案，该方案源自将处理块解耦为特定于尺度的空间混合模块和尺度不变的通道混合模块。我们递归地应用相同的共享通道混合模块，该模块包含所有尺度的大多数参数，显着减小了模型大小，同时保留了学习特定尺度空间模式的关键能力。大量实验表明，SRNeRV 实现了显着的速率失真性能提升，尤其是在 INR 友好的场景中，验证了我们的共享方案成功放大了 INR 范式的核心优势。

Title: GarmentPainter: Efficient 3D Garment Texture Synthesis with Character-Guided Diffusion Model

Authors: Jinbo Wu, Xiaobo Gao, Xing Liu, Chen Zhao, Jialun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08228
Pdf URL: https://arxiv.org/pdf/2603.08228
Copy Paste: [[2603.08228]] GarmentPainter: Efficient 3D Garment Texture Synthesis with Character-Guided Diffusion Model(https://arxiv.org/abs/2603.08228)
Keywords: generation
Abstract: Generating high-fidelity, 3D-consistent garment textures remains a challenging problem due to the inherent complexities of garment structures and the stringent requirement for detailed, globally consistent texture synthesis. Existing approaches either rely on 2D-based diffusion models, which inherently struggle with 3D consistency, require expensive multi-step optimization or depend on strict spatial alignment between 2D reference images and 3D meshes, which limits their flexibility and scalability. In this work, we introduce GarmentPainter, a simple yet efficient framework for synthesizing high-quality, 3D-aware garment textures in UV space. Our method leverages a UV position map as the 3D structural guidance, ensuring texture consistency across the garment surface during texture generation. To enhance control and adaptability, we introduce a type selection module, enabling fine-grained texture generation for specific garment components based on a character reference image, without requiring alignment between the reference image and the 3D mesh. GarmentPainter efficiently integrates all guidance signals into the input of a diffusion model in a spatially aligned manner, without modifying the underlying UNet architecture. Extensive experiments demonstrate that GarmentPainter achieves state-of-the-art performance in terms of visual fidelity, 3D consistency, and computational efficiency, outperforming existing methods in both qualitative and quantitative evaluations.
摘要：由于服装结构固有的复杂性以及对详细、全局一致的纹理合成的严格要求，生成高保真、3D 一致的服装纹理仍然是一个具有挑战性的问题。现有方法要么依赖于基于 2D 的扩散模型，该模型本身就与 3D 一致性作斗争，需要昂贵的多步骤优化，要么依赖于 2D 参考图像和 3D 网格之间的严格空间对齐，这限制了它们的灵活性和可扩展性。在这项工作中，我们介绍了 GarmentPainter，这是一个简单而高效的框架，用于在 UV 空间中合成高质量、3D 感知的服装纹理。我们的方法利用 UV 位置图作为 3D 结构指导，确保纹理生成期间整个服装表面的纹理一致性。为了增强控制和适应性，我们引入了类型选择模块，能够基于角色参考图像为特定服装组件生成细粒度纹理，而无需参考图像和 3D 网格之间的对齐。 GarmentPainter 以空间对齐的方式将所有引导信号有效地集成到扩散模型的输入中，而无需修改底层 UNet 架构。大量实验表明，GarmentPainter 在视觉保真度、3D 一致性和计算效率方面实现了最先进的性能，在定性和定量评估方面均优于现有方法。

Title: Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

Authors: Pablo Jimenez-Lizcano, Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Guillermo González de Rivera, Ruben Vera-Rodriguez, Julian Fierrez
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08235
Pdf URL: https://arxiv.org/pdf/2603.08235
Copy Paste: [[2603.08235]] Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema(https://arxiv.org/abs/2603.08235)
Keywords: quality assessment
Abstract: Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.
摘要：糖尿病视网膜病变（DR）和糖尿病黄斑水肿（DME）是工作年龄成年人可预防性失明的主要原因。文献中的传统方法侧重于标准彩色眼底摄影（CFP）来检测这些情况。然而，与 CFP 相比，最近的超广角成像 (UWF) 提供了明显更宽的视野。受此启发，本研究探索了最先进的深度学习（DL）方法和 UWF 成像在三个临床相关任务中的应用：i）UWF 的图像质量评估，ii）可转诊的糖尿病视网膜病变（RDR）的识别，以及 iii）DME 的识别。使用作为 MICCAI 2024 会议一部分发布的公开 UWF4DR 挑战数据集，我们在空间 (RGB) 和频域中对 DL 模型进行基准测试，包括流行的卷积神经网络 (CNN) 以及最近的视觉变换器 (ViT) 和基础模型。此外，我们探索了最终的特征级融合以提高鲁棒性。最后，我们还使用 Grad-CAM 分析 DL 模型的决策，提高了可解释性。我们的建议在所有架构中实现了一致的强劲性能，强调了新兴 ViT 和基础模型的竞争力，以及 UWF 分析的特征级融合和频域表示的前景。

Title: WaDi: Weight Direction-aware Distillation for One-step Image Synthesis

Authors: Lei Wang, Yang Cheng, Senmao Li, Ge Wu, Yaxing Wang, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08258
Pdf URL: https://arxiv.org/pdf/2603.08258
Copy Paste: [[2603.08258]] WaDi: Weight Direction-aware Distillation for One-step Image Synthesis(https://arxiv.org/abs/2603.08258)
Keywords: generation
Abstract: Despite the impressive performance of diffusion models such as Stable Diffusion (SD) in image generation, their slow inference limits practical deployment. Recent works accelerate inference by distilling multi-step diffusion into one-step generators. To better understand the distillation mechanism, we analyze U-Net/DiT weight changes between one-step students and their multi-step teacher counterparts. Our analysis reveals that changes in weight direction significantly exceed those in weight norm, highlighting it as the key factor during distillation. Motivated by this insight, we propose the Low-rank Rotation of weight Direction (LoRaD), a parameter-efficient adapter tailored to one-step diffusion distillation. LoRaD is designed to model these structured directional changes using learnable low-rank rotation matrices. We further integrate LoRaD into Variational Score Distillation (VSD), resulting in Weight Direction-aware Distillation (WaDi)-a novel one-step distillation framework. WaDi achieves state-of-the-art FID scores on COCO 2014 and COCO 2017 while using only approximately 10% of the trainable parameters of the U-Net/DiT. Furthermore, the distilled one-step model demonstrates strong versatility and scalability, generalizing well to various downstream tasks such as controllable generation, relation inversion, and high-resolution synthesis.
摘要：尽管稳定扩散（SD）等扩散模型在图像生成方面具有令人印象深刻的性能，但它们的缓慢推理限制了实际部署。最近的工作通过将多步扩散提炼为一步生成器来加速推理。为了更好地理解蒸馏机制，我们分析了一步学生和多步教师对应者之间的 U-Net/DiT 权重变化。我们的分析表明，重量方向的变化显着超过重量标准的变化，强调它是蒸馏过程中的关键因素。受这一见解的启发，我们提出了低秩权重方向旋转（LoRaD），这是一种针对一步扩散蒸馏量身定制的参数高效适配器。 LoRaD 旨在使用可学习的低秩旋转矩阵对这些结构化方向变化进行建模。我们进一步将 LoRaD 集成到变分分数蒸馏 (VSD) 中，从而产生了重量方向感知蒸馏 (WaDi)——一种新颖的一步蒸馏框架。 WaDi 在 COCO 2014 和 COCO 2017 上取得了最先进的 FID 分数，同时仅使用 U-Net/DiT 约 10% 的可训练参数。此外，蒸馏的一步模型表现出强大的通用性和可扩展性，可以很好地推广到各种下游任务，例如可控生成、关系反转和高分辨率合成。

Title: Prototype-Guided Concept Erasure in Diffusion Models

Authors: Yuze Cai, Jiahao Lu, Hongxiang Shi, Yichao Zhou, Hong Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08271
Pdf URL: https://arxiv.org/pdf/2603.08271
Copy Paste: [[2603.08271]] Prototype-Guided Concept Erasure in Diffusion Models(https://arxiv.org/abs/2603.08271)
Keywords: generation
Abstract: Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content. Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g. Pikachu) or recognizable characters (e.g. Elon Musk). However, their performance degrades on broad concepts such as ``sexual'' or ``violent'', whose wide scope and multi-faceted nature make them difficult to erase reliably. To overcome this limitation, we exploit the model's intrinsic embedding geometry to identify latent embeddings that encode a given concept. By clustering these embeddings, we derive a set of concept prototypes that summarize the model's internal representations of the concept, and employ them as negative conditioning signals during inference to achieve precise and reliable erasure. Extensive experiments across multiple benchmarks show that our approach achieves substantially more reliable removal of broad concepts while preserving overall image quality, marking a step towards safer and more controllable image generation.
摘要：概念擦除广泛应用于图像生成中，以防止文本到图像模型生成不需要的内容。现有的方法可以有效地消除具体的狭隘概念，例如独特的知识产权（例如皮卡丘）或可识别的角色（例如埃隆·马斯克）。然而，它们的表现在“性”或“暴力”等广泛概念上表现不佳，其广泛的范围和多方面的性质使它们难以可靠地消除。为了克服这个限制，我们利用模型的内在嵌入几何来识别编码给定概念的潜在嵌入。通过对这些嵌入进行聚类，我们得出了一组概念原型，总结了模型对概念的内部表示，并在推理过程中将它们用作负条件信号，以实现精确可靠的擦除。跨多个基准的广泛实验表明，我们的方法在保持整体图像质量的同时，实现了更可靠地去除广泛概念，标志着朝着更安全、更可控的图像生成迈出了一步。

Title: Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation

Authors: Daniele Molino, Camillo Maria Caruso, Paolo Soda, Valerio Guarrasi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08305
Pdf URL: https://arxiv.org/pdf/2603.08305
Copy Paste: [[2603.08305]] Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation(https://arxiv.org/abs/2603.08305)
Keywords: generation, generative
Abstract: Text-conditioned generative models for volumetric medical imaging provide semantic control but lack explicit anatomical guidance, often resulting in outputs that are spatially ambiguous or anatomically inconsistent. In contrast, structure-driven methods ensure strong anatomical consistency but typically assume access to ground-truth annotations, which are unavailable when the target image is to be synthesized. We propose a retrieval-augmented approach for Text-to-CT generation that integrates semantic and anatomical information under a realistic inference setting. Given a radiology report, our method retrieves a semantically related clinical case using a 3D vision-language encoder and leverages its associated anatomical annotation as a structural proxy. This proxy is injected into a text-conditioned latent diffusion model via a ControlNet branch, providing coarse anatomical guidance while maintaining semantic flexibility. Experiments on the CT-RATE dataset show that retrieval-augmented generation improves image fidelity and clinical consistency compared to text-only baselines, while additionally enabling explicit spatial controllability, a capability inherently absent in such approaches. Further analysis highlights the importance of retrieval quality, with semantically aligned proxies yielding consistent gains across all evaluation axes. This work introduces a principled and scalable mechanism to bridge semantic conditioning and anatomical plausibility in volumetric medical image synthesis. Code will be released.
摘要：用于体积医学成像的文本条件生成模型提供语义控制，但缺乏明确的解剖学指导，通常导致输出空间模糊或解剖学不一致。相比之下，结构驱动的方法确保了很强的解剖一致性，但通常假设可以访问真实注释，而这些注释在合成目标图像时是不可用的。我们提出了一种用于文本到 CT 生成的检索增强方法，该方法在现实的推理设置下集成了语义和解剖信息。给定放射学报告，我们的方法使用 3D 视觉语言编码器检索语义相关的临床病例，并利用其相关的解剖注释作为结构代理。该代理通过 ControlNet 分支注入文本条件潜在扩散模型，提供粗略的解剖指导，同时保持语义灵活性。 CT-RATE 数据集上的实验表明，与纯文本基线相比，检索增强生成提高了图像保真度和临床一致性，同时还实现了明确的空间可控性，这是此类方法本质上不具备的功能。进一步的分析强调了检索质量的重要性，语义对齐的代理在所有评估轴上产生一致的增益。这项工作引入了一种原则性且可扩展的机制，以在体积医学图像合成中连接语义调节和解剖合理性。代码将被发布。

Title: HDR-NSFF: High Dynamic Range Neural Scene Flow Fields

Authors: Shin Dong-Yeon, Kim Jun-Seong, Kwon Byung-Ki, Tae-Hyun Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08313
Pdf URL: https://arxiv.org/pdf/2603.08313
Copy Paste: [[2603.08313]] HDR-NSFF: High Dynamic Range Neural Scene Flow Fields(https://arxiv.org/abs/2603.08313)
Keywords: generative
Abstract: Radiance of real-world scenes typically spans a much wider dynamic range than what standard cameras can capture. While conventional HDR methods merge alternating-exposure frames, these approaches are inherently constrained to 2D pixel-level alignment, often leading to ghosting artifacts and temporal inconsistency in dynamic scenes. To address these limitations, we present HDR-NSFF, a paradigm shift from 2D-based merging to 4D spatio-temporal modeling. Our framework reconstructs dynamic HDR radiance fields from alternating-exposure monocular videos by representing the scene as a continuous function of space and time, and is compatible with both neural radiance field and 4D Gaussian Splatting (4DGS) based dynamic representations. This unified end-to-end pipeline explicitly models HDR radiance, 3D scene flow, geometry, and tone-mapping, ensuring physical plausibility and global coherence. We further enhance robustness by (i) extending semantic-based optical flow with DINO features to achieve exposure-invariant motion estimation, and (ii) incorporating a generative prior as a regularizer to compensate for limited observation in monocular captures and saturation-induced information loss. To evaluate HDR space-time view synthesis, we present the first real-world HDR-GoPro dataset specifically designed for dynamic HDR scenes. Experiments demonstrate that HDR-NSFF recovers fine radiance details and coherent dynamics even under challenging exposure variations, thereby achieving state-of-the-art performance in novel space-time view synthesis. Project page: this https URL
摘要：现实世界场景的亮度通常比标准相机可以捕捉的动态范围宽得多。虽然传统的 HDR 方法合并交替曝光帧，但这些方法本质上仅限于 2D 像素级对齐，通常会导致动态场景中出现重影伪影和时间不一致。为了解决这些限制，我们提出了 HDR-NSFF，这是从基于 2D 的合并到 4D 时空建模的范式转变。我们的框架通过将场景表示为空间和时间的连续函数，从交替曝光的单目视频中重建动态 HDR 辐射场，并且与神经辐射场和基于 4D 高斯泼溅 (4DGS) 的动态表示兼容。这种统一的端到端管道显式地对 HDR 辐射、3D 场景流、几何形状和色调映射进行建模，确保物理合理性和全局一致性。我们通过以下方式进一步增强鲁棒性：(i) 使用 DINO 特征扩展基于语义的光流以实现曝光不变的运动估计，以及 (ii) 将生成先验作为正则化器，以补偿单眼捕获中的有限观察和饱和引起的信息丢失。为了评估 HDR 时空视图合成，我们提出了第一个专门为动态 HDR 场景设计的现实世界 HDR-GoPro 数据集。实验表明，即使在具有挑战性的曝光变化下，HDR-NSFF 也能恢复精细的辐射细节和相干动态，从而在新颖的时空视图合成中实现最先进的性能。项目页面：此 https URL

Title: $Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation

Authors: Yijie Zhu, Jie He, Rui Shao, Kaishen Yuan, Tao Tan, Xiaochen Yuan, Zitong Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08361
Pdf URL: https://arxiv.org/pdf/2603.08361
Copy Paste: [[2603.08361]] $Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation(https://arxiv.org/abs/2603.08361)
Keywords: generation
Abstract: Recent vision-language-action (VLA) models have significantly advanced robotic manipulation by unifying perception, reasoning, and control. To achieve such integration, recent studies adopt a predictive paradigm that models future visual states or world knowledge to guide action generation. However, these models emphasize forecasting outcomes rather than reasoning about the underlying process of change, which is essential for determining how to act. To address this, we propose $\Delta$VLA, a prior-guided framework that models world-knowledge variations relative to an explicit current-world knowledge prior for action generation, rather than regressing absolute future world states. Specifically, 1) to construct the current world knowledge prior, we propose the Prior-Guided WorldKnowledge Extractor (PWKE). It extracts manipulable regions, spatial relations, and semantic cues from the visual input, guided by auxiliary heads and prior pseudo labels, thus reducing redundancy. 2) Building upon this, to represent how world knowledge evolves under actions, we introduce the Latent World Variation Quantization (LWVQ). It learns a discrete latent space via a VQ-VAE objective to encode world knowledge variations, shifting prediction from full modalities to compact latent. 3)Moreover, to mitigate interference during variation modeling, we design the Conditional Variation Attention (CV-Atten), whichpromotes disentangled learning and preserves the independence of knowledge representations. Extensive experiments on both simulated benchmarks and real-world robotic tasks demonstrate $\Delta$VLA achieves state-of-the-art performance while improving efficiency. Code and real-world execution videos are available at this https URL.
摘要：最近的视觉-语言-动作（VLA）模型通过统一感知、推理和控制，显着改进了机器人操作。为了实现这种整合，最近的研究采用了一种预测范式，对未来的视觉状态或世界知识进行建模，以指导行动的生成。然而，这些模型强调预测结果，而不是推理潜在的变化过程，而这对于确定如何采取行动至关重要。为了解决这个问题，我们提出了$\Delta$VLA，这是一个先验指导框架，它对相对于明确的当前世界知识先验的世界知识变化进行建模，以生成行动，而不是回归绝对的未来世界状态。具体来说，1）为了构建当前的世界知识先验，我们提出了先验引导的世界知识提取器（PWKE）。它在辅助头和先前的伪标签的引导下，从视觉输入中提取可操作的区域、空间关系和语义线索，从而减少冗余。 2）在此基础上，为了表示世界知识在行动下如何演变，我们引入了潜在世界变化量化（LWVQ）。它通过 VQ-VAE 目标学习离散潜在空间，以编码世界知识变化，将预测从完整模态转变为紧凑潜在。 3）此外，为了减轻变异建模过程中的干扰，我们设计了条件变异注意（CV-Atten），它促进解缠结学习并保持知识表示的独立性。对模拟基准和现实世界机器人任务的大量实验表明，$\Delta$VLA 在提高效率的同时实现了最先进的性能。此 https URL 提供了代码和实际执行视频。

Title: Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation

Authors: Zekun Li, Yinghuan Shi, Yang Gao, Dong Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08364
Pdf URL: https://arxiv.org/pdf/2603.08364
Copy Paste: [[2603.08364]] Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation(https://arxiv.org/abs/2603.08364)
Keywords: generation
Abstract: Diffusion-based data augmentation (DiffDA) has emerged as a promising approach to improving classification performance under data scarcity. However, existing works vary significantly in task configurations, model choices, and experimental pipelines, making it difficult to fairly compare methods or assess their effectiveness across different scenarios. Moreover, there remains a lack of systematic understanding of the full DiffDA workflow. In this work, we introduce UniDiffDA, a unified analytical framework that decomposes DiffDA methods into three core components: model fine-tuning, sample generation, and sample utilization. This perspective enables us to identify key differences among existing methods and clarify the overall design space. Building on this framework, we develop a comprehensive and fair evaluation protocol, benchmarking representative DiffDA methods across diverse low-data classification tasks. Extensive experiments reveal the relative strengths and limitations of different DiffDA strategies and offer practical insights into method design and deployment. All methods are re-implemented within a unified codebase, with full release of code and configurations to ensure reproducibility and to facilitate future research.
摘要：基于扩散的数据增强（DiffDA）已成为一种在数据稀缺情况下提高分类性能的有前途的方法。然而，现有的工作在任务配置、模型选择和实验流程方面存在很大差异，因此很难公平地比较方法或评估其在不同场景下的有效性。此外，仍然缺乏对完整 DiffDA 工作流程的系统理解。在这项工作中，我们引入了 UniDiffDA，一个统一的分析框架，它将 DiffDA 方法分解为三个核心组件：模型微调、样本生成和样本利用。这种视角使我们能够识别现有方法之间的关键差异并阐明整体设计空间。在此框架的基础上，我们开发了一个全面且公平的评估协议，在不同的低数据分类任务中对代表性的 DiffDA 方法进行基准测试。大量实验揭示了不同 DiffDA 策略的相对优势和局限性，并为方法设计和部署提供了实用见解。所有方法都在统一的代码库中重新实现，并完整发布代码和配置，以确保可重复性并促进未来的研究。

Title: SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

Authors: Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08403
Pdf URL: https://arxiv.org/pdf/2603.08403
Copy Paste: [[2603.08403]] SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents(https://arxiv.org/abs/2603.08403)
Keywords: generation
Abstract: We introduce SPIRAL, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions. Existing one-shot video generation models operate in open-loop, often resulting in incomplete action execution, weak semantic grounding, and temporal drift. SPIRAL formulates ActWM as a closed-loop think-act-reflect process, where generation proceeds step by step under explicit planning and feedback. A PlanAgent decomposes abstract actions into object-centric sub-actions, while a CriticAgent evaluates intermediate results and guides iterative refinement with long-horizon memory. This closed-loop design naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons. We further introduce the ActWM-Dataset and ActWM-Bench for training and evaluation. Experiments across multiple TI2V backbones demonstrate consistent gains on ActWM-Bench and mainstream video generation benchmarks, validating SPIRAL's effectiveness.
摘要：我们引入了 SPIRAL，这是一种自我改进的规划和迭代反射动作世界建模闭环框架，可以实现以高级语义动作为条件的可控长视野视频生成。现有的一次性视频生成模型以开环方式运行，通常会导致动作执行不完整、语义基础薄弱和时间漂移。 SPIRAL 将 ActWM 制定为一个闭环的思考-行动-反思过程，在明确的规划和反馈下逐步进行生成。 PlanAgent 将抽象操作分解为以对象为中心的子操作，而 CriticAgent 评估中间结果并通过长视野内存指导迭代细化。这种闭环设计自然支持 RL 不断演进的优化，提高语义对齐和扩展范围内的时间一致性。我们进一步引入 ActWM-Dataset 和 ActWM-Bench 进行训练和评估。跨多个 TI2V 主干网的实验证明了 ActWM-Bench 和主流视频生成基准的一致增益，验证了 SPIRAL 的有效性。

Title: LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Authors: Dongfang Li, Zixuan Liu, Gang Lin, Baotian Hu, Min Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.08453
Pdf URL: https://arxiv.org/pdf/2603.08453
Copy Paste: [[2603.08453]] LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing(https://arxiv.org/abs/2603.08453)
Keywords: generation
Abstract: The quadratic complexity of the attention mechanism and the substantial memory footprint of the Key-Value (KV) cache present severe computational and memory challenges for Large Language Models (LLMs) processing long contexts. Existing retrieval-based methods often compromise semantic integrity through fixed-size chunking and suffer from inefficient linear scanning. In this paper, we propose LycheeCluster, a novel method for efficient KV cache management. LycheeCluster preserves local semantic coherence via boundary-aware chunking and constructs a recursive hierarchical index rooted in the triangle inequality. This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation. Experiments demonstrate that LycheeCluster achieves up to a 3.6x end-to-end inference speedup with negligible degradation in model performance, outperforming state-of-the-art KV cache management methods (e.g., Quest, ClusterKV). We will release our code and kernels after publication.
摘要：注意力机制的二次复杂度和键值（KV）缓存的大量内存占用给处理长上下文的大型语言模型（LLM）带来了严峻的计算和内存挑战。现有的基于检索的方法经常通过固定大小的分块来损害语义完整性，并且遭受低效的线性扫描。在本文中，我们提出了 LycheeCluster，一种高效 KV 缓存管理的新方法。 LycheeCluster 通过边界感知分块保持局部语义一致性，并构建基于三角不等式的递归分层索引。这种设计将缓存检索从线性扫描转变为理论上有界的对数时间修剪过程，同时惰性更新策略支持高效的流生成。实验表明，LycheeCluster 实现了高达 3.6 倍的端到端推理加速，模型性能的下降几乎可以忽略不计，优于最先进的 KV 缓存管理方法（例如 Quest、ClusterKV）。我们将在发布后发布我们的代码和内核。

Title: Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

Authors: Fabio Valerio Massoli, Andrey Kuzmin, Arash Behboodi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.08462
Pdf URL: https://arxiv.org/pdf/2603.08462
Copy Paste: [[2603.08462]] Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck(https://arxiv.org/abs/2603.08462)
Keywords: generation
Abstract: Chain-of-Thought (CoT) prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing "Budget Forcing" methods reducing cost via fine-tuning with heuristic length penalties, suppress both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the Information Bottleneck (IB) principle, and identify a key theoretical gap when applying naive IB to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model CoT generation under the Conditional Information Bottleneck (CIB) principle, where the reasoning trace Z acts as a computational bridge that contains only the information about the response Y that is not directly accessible from the prompt X. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting-based approaches, we introduce a semantic prior that measures token cost by surprisal under a language model prior. Empirically, our CIB objective prunes cognitive bloat while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop.
摘要：思想链 (CoT) 提示提高了 LLM 在复杂任务上的准确性，但通常会增加令牌使用和推理成本。现有的“预算强制”方法通过启发式长度惩罚的微调来降低成本，抑制基本推理和冗余填充。我们将高效推理重新定义为信息瓶颈（IB）原理下的有损压缩问题，并确定了将朴素 IB 应用于变压器时的一个关键理论差距：注意力违反了提示、推理轨迹和响应之间的马尔可夫性质。为了解决这个问题，我们根据条件信息瓶颈（CIB）原理对 CoT 生成进行建模，其中推理轨迹 Z 充当计算桥梁，仅包含有关无法从提示 X 直接访问的响应 Y 的信息。这产生了一般的强化学习目标：最大化任务奖励，同时压缩先验推理轨迹下的完成情况，将常见启发式（例如长度惩罚）纳入特殊情况（例如统一先验）。与基于朴素令牌计数的方法相比，我们引入了一种语义先验，可以在语言模型先验下通过意外来测量令牌成本。根据经验，我们的 CIB 目标可以在保持流畅性和逻辑性的同时修剪认知膨胀，提高适度压缩的准确性，并以最小的准确性下降实现积极的压缩。

Title: X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Authors: Youngseo Kim, Kwan Yun, Seokhyeon Hong, Sihun Cha, Colette Suhjung Koo, Junyong Noh
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.08483
Pdf URL: https://arxiv.org/pdf/2603.08483
Copy Paste: [[2603.08483]] X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection(https://arxiv.org/abs/2603.08483)
Keywords: generation, generative
Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce MMDF, a new multimodal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by 13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.
摘要：当代生成系统产生的高度逼真的合成视频的激增大大增加了恶意使用的风险，对人类和现有的检测器都提出了挑战。在此背景下，我们从生成器端的角度观察，观察到这些模型中的内部交叉注意机制编码细粒度的语音运动对齐，为伪造检测提供有用的对应线索。基于这一见解，我们提出了 X-AVDT，这是一种强大且可通用的深度伪造检测器，可探测通过 DDIM 反转访问的生成器内部视听信号以暴露这些线索。 X-AVDT 提取两个互补信号：(i) 捕获反转引起的差异的视频复合信号，以及 (ii) 反映生成过程中强制执行的模态对齐的视听交叉注意特征。为了实现忠实的跨生成器评估，我们进一步引入了 MMDF，这是一种新的多模态 Deepfake 数据集，涵盖多种操作类型和快速发展的合成范式，包括 GAN、扩散和流匹配。大量实验表明，X-AVDT 在 MMDF 上实现了领先的性能，并对外部基准和未见过的生成器具有很强的泛化能力，优于现有方法，精度提高了 13.1%。我们的研究结果强调了利用内部视听一致性线索来增强未来深度伪造检测生成器的鲁棒性的重要性。

Title: SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Authors: Chao Wang, Zijin Yang, Yaofei Wang, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08536
Pdf URL: https://arxiv.org/pdf/2603.08536
Copy Paste: [[2603.08536]] SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution(https://arxiv.org/abs/2603.08536)
Keywords: generation
Abstract: Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at this https URL.
摘要：视频生成技术的最新进展非常显着，导致其在多个领域得到广泛应用。然而，人们越来越担心生成内容可能被滥用。追踪生成视频的来源对于减少潜在的滥用和识别责任方至关重要。现有的视频归因方法需要额外的操作或源归因模型的训练，这可能会降低视频质量或需要大量的训练样本。为了应对这些挑战，我们首次定义了“少镜头免训练生成视频归因”任务，并提出了与视频时间特征紧密结合的 SWIFT。通过利用每个视频块内的“像素帧（许多）到潜在帧（一个）”时间映射，SWIFT 应用固定长度滑动窗口来执行两种不同的重建：正常重建和损坏重建。然后将两次重建之间的损失变化用作归因信号。我们对五种最先进的 (SOTA) 视频生成模型进行了广泛的评估。实验结果表明，SWIFT 在所有模型中仅用 20 个视频样本就实现了超过 90% 的平均归因准确率，甚至实现了 HunyuanVideo、EasyAnimate 和 Wan2.2 的零样本归因。我们的源代码可以通过此 https URL 获取。

Title: BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment

Authors: Erdong Chen, Yuyang Ji, Jacob K. Greenberg, Benjamin Steel, Faraz Arkam, Abigail Lewis, Pranay Singh, Feng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08564
Pdf URL: https://arxiv.org/pdf/2603.08564
Copy Paste: [[2603.08564]] BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment(https://arxiv.org/abs/2603.08564)
Keywords: generative
Abstract: Video-based Clinical Gait Analysis often suffers from poor generalization as models overfit environmental biases instead of capturing pathological motion. To address this, we propose BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment. Unlike standard video encoders, our architecture incorporates a Temporal Evidence Distillation branch to capture rhythmic dynamics and a Biomechanical Tokenization branch that projects 3D skeleton sequences into language-aligned semantic tokens. This enables the model to explicitly reason about joint mechanics independent of visual shortcuts. To ensure rigorous benchmarking, we augment the public GAVD dataset with a high-fidelity Degenerative Cervical Myelopathy (DCM) cohort to form a unified 8-class taxonomy, establishing a strict subject-disjoint protocol to prevent data leakage. Under this setting, BioGait-VLM achieves state-of-the-art recognition accuracy. Furthermore, a blinded expert study confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding, offering a path toward transparent, privacy-enhanced gait assessment.
摘要：基于视频的临床步态分析通常存在泛化性差的问题，因为模型过度拟合环境偏差而不是捕捉病理运动。为了解决这个问题，我们提出了 BioGait-VLM，这是一种用于可解释的临床步态评估的三模态视觉-语言-生物力学框架。与标准视频编码器不同，我们的架构结合了时间证据蒸馏分支来捕获节奏动态，以及生物力学标记化分支，将 3D 骨架序列投影为语言对齐的语义标记。这使得模型能够独立于视觉快捷方式明确地推理关节力学。为了确保严格的基准测试，我们用高保真退行性脊髓病 (DCM) 队列扩充了公共 GAVD 数据集，形成统一的 8 类分类法，建立严格的主题不相交协议以防止数据泄漏。在此设置下，BioGait-VLM 实现了最先进的识别精度。此外，一项盲法专家研究证实，生物力学标记显着提高了临床合理性和证据基础，为透明、增强隐私的步态评估提供了一条途径。

Title: PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

Authors: Zeyu Ling, Qing Shuai, Teng Zhang, Shiyang Li, Bo Han, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08590
Pdf URL: https://arxiv.org/pdf/2603.08590
Copy Paste: [[2603.08590]] PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition(https://arxiv.org/abs/2603.08590)
Keywords: generation
Abstract: Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.
摘要：文本转动画的生成技术发展迅速，但仍然存在两个挑战。首先，现有的运动自动编码器将每个帧压缩为单个整体潜在向量，以非结构化表示形式纠缠轨迹和每关节旋转，下游生成器很难忠实地建模。其次，文本到运动、姿势条件生成和长视野顺序合成通常需要单独的模型或特定于任务的机制，而自回归方法在扩展部署过程中会遭受严重的错误累积。我们推出 PRISM，以奉献精神应对每一项挑战。（1）关节分解的运动潜在空间：每个身体关节占据自己的令牌，形成一个由具有正向运动学监督的因果 VAE 压缩的结构化二维网格（时间关节）。这种对潜在空间的简单改变——无需修改生成器——极大地提高了生成质量，揭示了潜在空间设计一直是一个被低估的瓶颈。（2）无噪声条件注入：每个潜在令牌都带有自己的时间步嵌入，允许将条件帧作为干净令牌（timestep0）注入，而其余令牌则被去噪。这将文本到运动和姿势条件生成统一在单个模型中，并直接启用用于流合成的自回归片段链接。自我强迫训练进一步抑制了长期部署中的漂移。借助这两个组件，我们训练了一个单一运动生成基础模型，该模型可以无缝处理文本到运动、姿势条件生成、自回归顺序生成和叙事运动合成，在 HumanML3D、MotionHub、BABEL 和 50 个场景的用户研究上实现最先进的水平。

Title: CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Authors: Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08648
Pdf URL: https://arxiv.org/pdf/2603.08648
Copy Paste: [[2603.08648]] CAST: Modeling Visual State Transitions for Consistent Video Retrieval(https://arxiv.org/abs/2603.08648)
Keywords: generation
Abstract: As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($\Delta$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.
摘要：随着视频内容创作转向长篇叙事，将短片组成连贯的故事情节变得越来越重要。然而，流行的检索公式在推理时仍然与上下文无关，优先考虑局部语义对齐，而忽略状态和身份一致性。为了解决这种结构限制，我们将一致性视频检索 (CVR) 任务形式化，并引入了涵盖 YouCook2、COIN 和 CrossTask 的诊断基准。我们提出了 CAST（上下文感知状态转换），这是一种轻量级、即插即用的适配器，与多种冻结视觉语言嵌入空间兼容。通过根据视觉历史预测状态条件残差更新 ($\Delta$)，CAST 为潜在状态演化引入了显式归纳偏差。大量实验表明，CAST 提高了 YouCook2 和 CrossTask 上的性能，在 COIN 上保持竞争力，并且在不同的基础主干上始终优于零样本基线。此外，CAST 为黑盒视频生成候选（例如，来自 Veo）提供了有用的重新排序信号，促进时间上更加一致的连续性。

Title: HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

Authors: Kai Zou, Dian Zheng, Hongbo Liu, Tiankai Hang, Bin Liu, Nenghai Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08703
Pdf URL: https://arxiv.org/pdf/2603.08703
Copy Paste: [[2603.08703]] HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising(https://arxiv.org/abs/2603.08703)
Keywords: generation
Abstract: Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.
摘要：自回归（AR）扩散为生成理论上无限长度的视频提供了一个有前途的框架。然而，一个主要挑战是保持时间连续性，同时防止错误累积引起的质量逐渐下降。为了确保连续性，现有方法通常以高度去噪的上下文为条件；然而，这种做法会以高确定性传播预测误差，从而加剧退化。在本文中，我们认为高度干净的上下文是不必要的。受到双向扩散模型的启发，双向扩散模型在共享噪声水平上对帧进行降噪，同时保持一致性，我们建议以与当前块相同的噪声水平对上下文进行调节，为时间一致性提供足够的信号，同时有效地减轻错误传播。基于这一见解，我们提出了 HiAR，一种分层去噪框架，它反转了传统的生成顺序：它不是按顺序完成每个块，而是在每个去噪步骤中对所有块执行因果生成，以便每个块始终以相同噪声水平的上下文为条件。这种层次结构自然允许流水线并行推理，在我们的 4 步设置中产生 1.8 的挂钟加速。我们进一步观察到，这种范式下的自推出蒸馏放大了模式搜索反向 KL 目标固有的低运动捷径。为了解决这个问题，我们在双向注意模式下引入了前向 KL 正则化器，它保留了因果推理的运动多样性，而不干扰蒸馏损失。在 VBench（20 代）上，HiAR 在所有比较方法中取得了最好的总体得分和最低的时间漂移。