2026-03-13

Title: H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Authors: Amit Singh, Vedant Nipane, Pulkit Agrawal, Jatin Kishnani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.11139
Pdf URL: https://arxiv.org/pdf/2603.11139
Copy Paste: [[2603.11139]] H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code(https://arxiv.org/abs/2603.11139)
Keywords: generation, generative
Abstract: Large language models (LLMs) demonstrate strong code generation abilities in general-purpose programming languages but remain limited in specialized domains such as low-level embedded systems programming. This domain involves hardware register manipulation, vendor-specific SDKs, real-time operating system APIs, and hardware abstraction layers that are underrepresented in standard pretraining corpora. We introduce H2LooP Spark Preview, a continual pretraining (CPT) pipeline that adapts the OLMo-3-7B-a fully open language model to the embedded systems domain using BF16 LoRA with rank-stabilized scaling on 8 NVIDIA H100 GPUs. Our training corpus is constructed from repository-datasheet pairs covering 100B tokens of raw embedded systems data across 117 manufacturers, processed using the hierarchical datasheet-to-code mapping approach proposed in SpecMap (Nipane et al., 2026). The resulting curated dataset split contains 23.5B tokens across 13 embedded domains. Continual pretraining with high-rank LoRA (r=512) yields substantial gains, reducing in-domain perplexity by 70.4% and held-out repository perplexity by 66.1%. On generative code completion benchmarks spanning 13 embedded domains, our 7B model outperforms Claude Opus 4.6 and Qwen3-Coder-30B on 8 categories in token accuracy, showing that targeted continual pretraining enables smaller open-weight models to rival frontier systems on specialized technical tasks. We release the production training checkpoint on Huggingface as an open-source artifact.
摘要：大型语言模型 (LLM) 在通用编程语言中展示了强大的代码生成能力，但在低级嵌入式系统编程等专业领域仍然受到限制。该领域涉及硬件寄存器操作、特定于供应商的 SDK、实时操作系统 API 以及在标准预训练语料库中代表性不足的硬件抽象层。我们推出了 H2LooP Spark Preview，这是一个持续预训练 (CPT) 管道，它使用 BF16 LoRA 在 8 个 NVIDIA H100 GPU 上进行排名稳定扩展，将 OLMo-3-7B（一种完全开放的语言模型）调整到嵌入式系统领域。我们的训练语料库是由存储库-数据表对构建的，涵盖 117 个制造商的 100B 原始嵌入式系统数据，并使用 SpecMap 中提出的分层数据表到代码映射方法进行处理（Nipane 等人，2026）。生成的精选数据集分割包含跨 13 个嵌入域的 23.5B 个令牌。使用高阶 LoRA (r=512) 进行持续预训练可产生巨大收益，将域内困惑度降低 70.4%，将存储库困惑度降低 66.1%。在跨越 13 个嵌入式领域的生成代码完成基准上，我们的 7B 模型在 8 个类别的标记准确度上优于 Claude Opus 4.6 和 Qwen3-Coder-30B，这表明有针对性的持续预训练使较小的开放权重模型能够在专业技术任务上与前沿系统相媲美。我们将 Huggingface 上的生产训练检查点作为开源工件发布。

Title: Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Authors: Qingtao Pan, Zhihao Dou, Shuo Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2603.11220
Pdf URL: https://arxiv.org/pdf/2603.11220
Copy Paste: [[2603.11220]] Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models(https://arxiv.org/abs/2603.11220)
Keywords: restoration
Abstract: Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.
摘要：由于存在大量视觉标记，大型多模态模型 (LMM) 很难适应不同的计算预算。以前的方法试图减少法学硕士之前或之内的视觉标记的数量。然而，这些策略不可避免地导致视觉语义的损失。为了解决这些问题，我们引入了 FMVR，一种即插即用且极其简单的调频视觉恢复策略，以提高 LMM 在视觉标记减少下的推理能力。具体来说，FMVR 通过 AvgPool 和 MaxPool 将较少视觉标记的视觉表示分解为低频和高频分量。随后使用轻量级可学习参数来调制导出的频率。 AvgPool 的高频充当显着性过滤器，增强显着性视觉语义，而 MaxPool 的低频充当反显着性过滤器，增强弱视觉语义。它能够保留由少数视觉标记主导的视觉语义并恢复稀释的视觉语义。此外，我们将 FMVR 注入俄罗斯套娃表示学习中，以学习从粗到细的视觉标记集，从而能够在推理过程中弹性调整视觉标记的数量，同时保持可比较的性能。 10 个基于图像和 4 个基于视频的基准测试表明，FMVR-LLaVA 将 LLaVA-1.5-7B 的 FLOP 降低了 89%，同时保持了几乎 100% 的原始精度。代码将被公开。

Title: On the Robustness of Langevin Dynamics to Score Function Error

Authors: Daniel Yiming Cao, August Y. Chen, Karthik Sridharan, Yuchen Wu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.11319
Pdf URL: https://arxiv.org/pdf/2603.11319
Copy Paste: [[2603.11319]] On the Robustness of Langevin Dynamics to Score Function Error(https://arxiv.org/abs/2603.11319)
Keywords: generative
Abstract: We consider the robustness of score-based generative modeling to errors in the estimate of the score function. In particular, we show that Langevin dynamics is not robust to the L^2 errors (more generally L^p errors) in the estimate of the score function. It is well-established that with small L^2 errors in the estimate of the score function, diffusion models can sample faithfully from the target distribution under fairly mild regularity assumptions in a polynomial time horizon. In contrast, our work shows that even for simple distributions in high dimensions, Langevin dynamics run for any polynomial time horizon will produce a distribution far from the target distribution in Total Variation (TV) distance, even when the L^2 error (more generally L^p) of the estimate of the score function is arbitrarily small. Considering such an error in the estimate of the score function is unavoidable in practice when learning the score function from data, our results provide further justification for diffusion models over Langevin dynamics and serve to caution against the use of Langevin dynamics with estimated scores.
摘要：我们考虑基于分数的生成模型对分数函数估计错误的鲁棒性。特别是，我们表明 Langevin 动力学对于得分函数估计中的 L^2 误差（更普遍的是 L^p 误差）并不稳健。众所周知，在得分函数估计中存在较小的 L^2 误差，扩散模型可以在多项式时间范围内的相当温和的规律性假设下从目标分布中忠实地进行采样。相比之下，我们的工作表明，即使对于高维的简单分布，在任何多项式时间范围内运行的 Langevin 动力学都会产生远离总变分 (TV) 距离目标分布的分布，即使得分函数估计的 L^2 误差（更普遍的是 L^p）任意小。考虑到在从数据学习得分函数时得分函数估计中的这种错误在实践中是不可避免的，我们的结果为朗之万动力学上的扩散模型提供了进一步的理由，并警告不要使用具有估计得分的朗之万动力学。

Title: UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

Authors: Ziyao Wang, Chen Chen, Jingtao Li, Weiming Zhuang, Jiabo Huang, Ang Li, Lingjuan Lyu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11320
Pdf URL: https://arxiv.org/pdf/2603.11320
Copy Paste: [[2603.11320]] UniCompress: Token Compression for Unified Vision-Language Understanding and Generation(https://arxiv.org/abs/2603.11320)
Keywords: generation
Abstract: Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.
摘要：统一模型旨在通过将图像编码为离散标记并在单个自回归框架内将它们与文本一起处理来支持理解和生成。这种统一的设计提供了架构简单性和跨模式协同作用，有利于共享参数化、一致的训练目标以及模式之间的无缝传输。然而，此类模型所需的大量视觉标记会带来大量的计算和内存开销，这种低效率直接阻碍了在实体人工智能系统等资源受限场景中的部署。在这项工作中，我们提出了一种统一的令牌压缩算法 UniCompress，该算法可显着减少视觉令牌计数，同时保留图像理解和生成任务的性能。我们的方法引入了一种由可学习的全局元令牌引导的插件压缩和解压缩机制。该框架是轻量级和模块化的，无需完全重新培训即可有效集成到现有模型中。实验结果表明，我们的方法将图像标记减少了多达 4 倍，在推理延迟和训练成本方面取得了显着的收益，并且仅导致最小的性能下降，这证明了现实世界多模态应用程序的标记高效统一建模的前景。

Title: UNet-AF: An alias-free UNet for image restoration

Authors: Jérémy Scanvic, Quentin Barthélemy, Julián Tachella
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11323
Pdf URL: https://arxiv.org/pdf/2603.11323
Copy Paste: [[2603.11323]] UNet-AF: An alias-free UNet for image restoration(https://arxiv.org/abs/2603.11323)
Keywords: restoration
Abstract: The simplicity and effectiveness of the UNet architecture makes it ubiquitous in image restoration, image segmentation, and diffusion models. They are often assumed to be equivariant to translations, yet they traditionally consist of layers that are known to be prone to aliasing, which hinders their equivariance in practice. To overcome this limitation, we propose a new alias-free UNet designed from a careful selection of state-of-the-art translation-equivariant layers. We evaluate the proposed equivariant architecture against non-equivariant baselines on image restoration tasks and observe competitive performance with a significant increase in measured equivariance. Through extensive ablation studies, we also demonstrate that each change is crucial for its empirical equivariance. Our implementation is available at this https URL
摘要：UNet 架构的简单性和有效性使其在图像恢复、图像分割和扩散模型中无处不在。它们通常被认为与翻译是等变的，但它们传统上由已知容易出现混叠的层组成，这阻碍了它们在实践中的等变性。为了克服这一限制，我们提出了一种新的无别名 UNet，它是通过仔细选择最先进的平移等变层而设计的。我们在图像恢复任务上针对非等变基线评估了所提出的等变架构，并观察了测量等方差显着增加的竞争性能。通过广泛的消融研究，我们还证明每次变化对于其经验等方差都是至关重要的。我们的实现可通过此 https URL 获取

Title: Towards Trustworthy Selective Generation: Reliability-Guided Diffusion for Ultra-Low-Field to High-Field MRI Synthesis

Authors: Zhenxuan Zhang, Peiyuan Jing, Ruicheng Yuan, Liwei Hu, Anbang Wang, Fanwen Wang, Yinzhe Wu, Kh Tohidul Islam, Zhaolin Chen, Zi Wang, Peter Lally, Guang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11325
Pdf URL: https://arxiv.org/pdf/2603.11325
Copy Paste: [[2603.11325]] Towards Trustworthy Selective Generation: Reliability-Guided Diffusion for Ultra-Low-Field to High-Field MRI Synthesis(https://arxiv.org/abs/2603.11325)
Keywords: generation, generative
Abstract: Low-field to high-field MRI synthesis has emerged as a cost-effective strategy to enhance image quality under hardware and acquisition constraints, particularly in scenarios where access to high-field scanners is limited or impractical. Despite recent progress in diffusion models, diffusion-based approaches often struggle to balance fine-detail recovery and structural fidelity. In particular, the uncontrolled generation of high-resolution details in structurally ambiguous regions may introduce anatomically inconsistent patterns, such as spurious edges or artificial texture variations. These artifacts can bias downstream quantitative analysis. For example, they may cause inaccurate tissue boundary delineation or erroneous volumetric estimation, ultimately reducing clinical trust in synthesized images. These limitations highlight the need for generative models that are not only visually accurate but also spatially reliable and anatomically consistent. To address this issue, we propose a reliability-aware diffusion framework (ReDiff) that improves synthesis robustness at both the sampling and post-generation stages. Specifically, we introduce a reliability-guided sampling strategy to suppress unreliable responses during the denoising process. We further develop an uncertainty-aware multi-candidate selection scheme to enhance the reliability of the final prediction. Experiments on multi-center MRI datasets demonstrate improved structural fidelity and reduced artifacts compared with state-of-the-art methods.
摘要：低场到高场 MRI 合成已成为一种经济有效的策略，可在硬件和采集限制下提高图像质量，特别是在高场扫描仪的访问受到限制或不切实际的情况下。尽管扩散模型最近取得了进展，但基于扩散的方法常常难以平衡细节恢复和结构保真度。特别是，在结构模糊区域中不受控制地生成高分辨率细节可能会引入解剖学上不一致的模式，例如虚假边缘或人工纹理变化。这些伪影可能会使下游定量分析产生偏差。例如，它们可能会导致组织边界描绘不准确或体积估计错误，最终降低临床对合成图像的信任。这些限制凸显了对生成模型的需求，该模型不仅在视觉上准确，而且在空间上可靠且解剖学上一致。为了解决这个问题，我们提出了一种可靠性感知扩散框架（ReDiff），可以提高采样和生成后阶段的综合鲁棒性。具体来说，我们引入了一种可靠性引导采样策略来抑制去噪过程中的不可靠响应。我们进一步开发了一种不确定性感知的多候选选择方案，以提高最终预测的可靠性。多中心 MRI 数据集的实验表明，与最先进的方法相比，结构保真度得到了提高，伪影也减少了。

Title: Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Authors: Indranil Halder, Annesya Banerjee, Cengiz Pehlevan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11331
Pdf URL: https://arxiv.org/pdf/2603.11331
Copy Paste: [[2603.11331]] Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover(https://arxiv.org/abs/2603.11331)
Keywords: generation, generative
Abstract: Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. To explain this phenomenon, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. Within this framework, we analyze prompt injection-based jailbreaking. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We derive these behaviors analytically and confirm them empirically on large language models. This transition between two regimes is due to the appearance of an ordered phase in the spin chain under a strong magnetic field, which suggests that the injected jailbreak prompt enhances adversarial order in the language model.
摘要：对抗性攻击可以可靠地引导与安全相关的大型语言模型走向不安全行为。根据经验，我们发现对抗性提示注入攻击可以将攻击成功率从没有注入时观察到的缓慢多项式增长放大到推理时间样本数量的指数增长。为了解释这种现象，我们提出了一种代理语言的理论生成模型，该模型以在复制对称破缺状态下运行的自旋玻璃系统为基础，其中世代是从相关的吉布斯测度中抽取的，并且低能量、大小偏向的簇的子集被指定为不安全的。在此框架内，我们分析基于提示注入的越狱。短注入提示对应于与不安全簇中心对齐的弱磁场，并产生攻击成功率与推理时间样本数量的幂律缩放，而长注入提示，即强磁场，产生指数缩放。我们通过分析得出这些行为，并在大型语言模型上凭经验确认它们。两种状态之间的转变是由于强磁场下自旋链中出现有序相，这表明注入的越狱提示增强了语言模型中的对抗顺序。

Title: ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Authors: Songlin Yang, Zhe Wang, Xuyi Yang, Songchun Zhang, Xianghao Kong, Taiyi Wu, Xiaotong Zhao, Ran Zhang, Alan Zhao, Anyi Rao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11421
Pdf URL: https://arxiv.org/pdf/2603.11421
Copy Paste: [[2603.11421]] ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation(https://arxiv.org/abs/2603.11421)
Keywords: generation
Abstract: Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.
摘要：文本驱动的视频生成使电影创作民主化，但电影多镜头场景中的摄像机控制仍然是一个重要障碍。隐式的文本提示缺乏精确性，而显式的轨迹条件则带来了令人望而却步的手动开销，并且经常会触发当前模型中的执行失败。为了克服这个瓶颈，我们提出了一种以数据为中心的范式转变，假设对齐的（标题、轨迹、视频）三元组形成了一个固有的联合分布，可以连接自动绘图和精确执行。在这一见解的指导下，我们提出了 ShotVerse，一个“先规划后控制”的框架，它将生成解耦为两个协作代理：一个基于 VLM（视觉语言模型）的规划器，利用空间先验从文本中获取电影般的全局对齐轨迹，以及一个通过相机适配器将这些轨迹渲染为多镜头视频内容的控制器。我们方法的核心是数据基础的构建：我们设计了一个自动多镜头相机校准管道，将不相交的单镜头轨迹对齐到统一的全局坐标系中。这有利于 ShotVerse-Bench 的管理，这是一个高保真电影数据集，具有三轨评估协议，可作为我们框架的基石。大量实验表明，ShotVerse 有效地弥补了不可靠的文本控制和劳动密集型手动绘图之间的差距，实现了卓越的电影美学并生成了相机精确且跨镜头一致的多镜头视频。

Title: Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

Authors: Seung hee Choi, MinJu Jeon, Hyunwoo Oh, Jihwan Lee, Dong-Jin Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11460
Pdf URL: https://arxiv.org/pdf/2603.11460
Copy Paste: [[2603.11460]] Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning(https://arxiv.org/abs/2603.11460)
Keywords: generation
Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at this https URL
摘要：现有的密集视频字幕 (DVC) 检索增强方法通常无法实现与真实事件边界对齐的准确时间分割，因为它们依赖于忽略真实事件边界的启发式策略。所提出的框架 \textbf{STaRC} 通过高亮检测模块来监督帧级显着性，从而克服了这一限制。请注意，高亮检测模块是在直接从 DVC 真实注释派生的二进制标签上进行训练的，无需额外注释。我们还建议利用显着性分数作为统一的时间信号，通过显着性引导的分割驱动检索，并通过注入解码器的显式显着性提示通知字幕生成。通过强制显着性约束分割，我们的方法产生时间上连贯的片段，与实际事件转换紧密结合，从而实现更准确的检索和基于上下文的字幕生成。我们对 YouCook2 和 ViTT 基准进行了全面评估，其中 STaRC 在大多数指标上都实现了最先进的性能。我们的代码可在此 https URL 获取

Title: OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Authors: Chuancheng Shi, Wenhua Wu, Fei Shen, Xiaogang Zhu, Kun Hu, Zhiyong Wang
Subjects: cs.CV, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.11493
Pdf URL: https://arxiv.org/pdf/2603.11493
Copy Paste: [[2603.11493]] OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure(https://arxiv.org/abs/2603.11493)
Keywords: generative
Abstract: Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.
摘要：文本到图像（T2I）模型面临来自对抗性诱导的重大安全风险，但当前的概念擦除方法在完全抑制选定的神经元时通常会对良性属性造成附带损害。发生这种情况是因为敏感和良性语义表现出非正交叠加，共享激活子空间，其中它们各自的向量本质上是纠缠的。为了解决这个问题，我们提出了 OrthoEraser，它利用稀疏自动编码器（SAE）来实现高分辨率特征解缠，随后将擦除重新定义为保留良性流形不变性的分析正交化投影。 OrthoEraser 首先采用 SAE 分解密集激活并隔离敏感神经元。然后，它使用耦合神经元检测来识别容易受到干预的非敏感特征。关键的新颖之处在于分析梯度正交化策略，该策略将擦除向量投影到耦合神经元的零空间上。这将敏感概念与已识别的关键良性子空间正交解耦，有效地保留了非敏感语义。安全性实验结果表明，OrthoEraser 实现了较高的擦除精度，有效去除有害内容，同时保持生成流形的完整性，并且显着优于 SOTA 基线。本文包含不安全模型的结果。

Title: KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation

Authors: Qizhi Chen, Chao Qi, Yihong Huang, Muquan Li, Rongzheng Wang, Dongyang Zhang, Ke Qin, Shuang Liang
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.11501
Pdf URL: https://arxiv.org/pdf/2603.11501
Copy Paste: [[2603.11501]] KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation(https://arxiv.org/abs/2603.11501)
Keywords: generation
Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) constructs the Knowledge Graph (KG) from external databases to enhance the timeliness and accuracy of Large Language Model (LLM) this http URL,this reliance on external data introduces new attack this http URL can inject poisoned texts into databases to manipulate LLMs into producing harmful target responses for attacker-chosen this http URL research primarily focuses on attacking conventional RAG this http URL,such methods are ineffective against this http URL robustness derives from the KG abstraction of GraphRAG,which reorganizes injected text into a graph before retrieval,thereby enabling the LLM to reason based on the restructured context instead of raw poisoned this http URL expose latent security vulnerabilities in GraphRAG,we propose Knowledge Evolution Poison (KEPo),a novel poisoning attack method specifically designed for this http URL each target query,KEPo first generates a toxic event containing poisoned knowledge based on the target this http URL fabricating event backgrounds and forging knowledge evolution paths from original facts to the toxic event,it then poisons the KG and misleads the LLM into treating the poisoned knowledge as the final this http URL multi-target attack scenarios,KEPo further connects multiple attack corpora,enabling their poisoned knowledge to mutually reinforce while expanding the scale of poisoned communities,thereby amplifying attack this http URL results across multiple datasets demonstrate that KEPo achieves state-of-the-art attack success rates for both single-target and multi-target attacks,significantly outperforming previous methods.
摘要：基于图的检索增强生成（GraphRAG）从外部数据库构建知识图（KG），以增强大型语言模型（LLM）的及时性和准确性此http URL，这种对外部数据的依赖引入了新的攻击此http URL可以将有毒文本注入数据库以操纵LLM为攻击者选择产生有害的目标响应此http URL研究主要集中于攻击传统的RAG此http URL，此类方法对此http URL的鲁棒性无效GraphRAG的KG抽象，在检索之前将注入的文本重新组织成图，从而使LLM能够基于重组的上下文进行推理，而不是原始中毒的这个http URL暴露了GraphRAG中潜在的安全漏洞，我们提出了知识进化中毒（KEPo），一种专门针对这个http URL每个目标查询设计的新颖的中毒攻击方法，KEPo首先根据目标这个http URL伪造事件背景并从原始事实伪造知识进化路径，生成一个包含中毒知识的有毒事件KEPo进一步连接多个攻击语料库，使它们的中毒知识相互增强，同时扩大中毒社区的规模，从而在多个数据集上放大攻击此http URL结果，表明KEPo在单目标和多目标攻击方面都实现了state-of-the-art的攻击成功率，显着优于以前的方法。

Title: LongFlow: Efficient KV Cache Compression for Reasoning M

Authors: Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2603.11504
Pdf URL: https://arxiv.org/pdf/2603.11504
Copy Paste: [[2603.11504]] LongFlow: Efficient KV Cache Compression for Reasoning M(https://arxiv.org/abs/2603.11504)
Keywords: generation
Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.
摘要：最近的推理模型（例如 OpenAI-o1 和 DeepSeek-R1）在数学推理和代码生成等复杂任务上表现出了强大的性能。然而，这种性能提升伴随着更长的输出序列，导致部署成本显着增加。特别是长输出需要大的KV缓存，导致注意力计算时内存消耗高，带宽压力大。现有的KV缓存优化方法大多是针对长输入、短输出场景而设计的，对于推理模型的长输出设置效果不佳。此外，先前工作中的重要性估计在计算上是昂贵的，并且当在长时间生成过程中需要连续重新评估时变得令人望而却步。为了解决这些挑战，我们提出了 LongFlow，一种 KV 缓存压缩方法，具有从仅使用当前查询的注意力计算的中间结果导出的有效重要性估计指标。这种设计引入的计算开销可以忽略不计，并且不需要辅助存储。我们进一步开发了一个自定义内核，将 FlashAttention、重要性估计和令牌驱逐融合到单个优化运算符中，从而提高系统级效率。实验表明，LongFlow 通过 80% KV 缓存压缩实现了高达 11.8 倍的吞吐量提升，同时对模型精度的影响最小。

Title: Gen-Fab: A Variation-Aware Generative Model for Predicting Fabrication Variations in Nanophotonic Devices

Authors: Rambod Azimi, Yuri Grinberg, Dan-Xia Xu, Odile Liboiron-Ladouceur
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.11505
Pdf URL: https://arxiv.org/pdf/2603.11505
Copy Paste: [[2603.11505]] Gen-Fab: A Variation-Aware Generative Model for Predicting Fabrication Variations in Nanophotonic Devices(https://arxiv.org/abs/2603.11505)
Keywords: generative
Abstract: Silicon photonic devices often exhibit fabrication-induced variations such as over-etching, underetching, and corner rounding, which can significantly alter device performance. These variations are non-uniform and are influenced by feature size and shape. Accurate digital twins are therefore needed to predict the range of possible fabricated outcomes for a given design. In this paper, we introduce Gen-Fab, a conditional generative adversarial network (cGAN) based on Pix2Pix to predict and model uncertainty in photonic fabrication outcomes. The proposed method takes a design layout (in GDS format) as input and produces diverse high-resolution predictions similar to scanning electron microscope (SEM) images of fabricated devices, capturing the range of process variations at the nanometer scale. To enable one-to-many mapping, we inject a latent noise vector at the model bottleneck. We compare Gen-Fab against three baselines: (1) a deterministic U-Net predictor, (2) an inference-time Monte Carlo Dropout U-Net, and (3) an ensemble of varied U-Nets. Evaluations on an out-of-distribution dataset of fabricated photonic test structures demonstrate that Gen-Fab outperforms all baselines in both accuracy and uncertainty modeling. An additional distribution shift analysis further confirms its strong generalization to unseen fabrication geometries. Gen-Fab achieves the highest intersection-over-union (IoU) score of 89.8%, outperforming the deterministic U-Net (85.3%), the MC-Dropout U-Net (83.4%), and varying U-Nets (85.8%). It also better aligns with the distribution of real fabrication outcomes, achieving lower Kullback-Leibler divergence and Wasserstein distance.
摘要：硅光子器件通常会表现出制造引起的变化，例如过度蚀刻、蚀刻不足和圆角，这会显着改变器件性能。这些变化是不均匀的，并且受到特征尺寸和形状的影响。因此，需要准确的数字孪生来预测给定设计的可能制造结果的范围。在本文中，我们介绍了 Gen-Fab，这是一种基于 Pix2Pix 的条件生成对抗网络 (cGAN)，用于预测和模拟光子制造结果的不确定性。所提出的方法以设计布局（GDS 格式）作为输入，并生成类似于制造器件的扫描电子显微镜 (SEM) 图像的各种高分辨率预测，捕获纳米尺度的工艺变化范围。为了实现一对多映射，我们在模型瓶颈处注入潜在噪声向量。我们将 Gen-Fab 与三个基线进行比较：(1) 确定性 U-Net 预测器，(2) 推理时间蒙特卡洛 Dropout U-Net，以及 (3) 各种 U-Net 的集合。对制造的光子测试结构的分布外数据集的评估表明，Gen-Fab 在准确性和不确定性建模方面均优于所有基线。额外的分布变化分析进一步证实了其对未见过的制造几何形状的强大概括。 Gen-Fab 实现了最高的交并集 (IoU) 分数 89.8%，优于确定性 U-Net (85.3%)、MC-Dropout U-Net (83.4%) 和变化的 U-Net (85.8%)。它还可以更好地与实际制造结果的分布保持一致，从而实现更低的 Kullback-Leibler 散度和 Wasserstein 距离。

Title: MDS-VQA: Model-Informed Data Selection for Video Quality Assessment

Authors: Jian Zou, Xiaoyu Xu, Zhihua Wang, Yilin Wang, Balu Adsumilli, Kede Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11525
Pdf URL: https://arxiv.org/pdf/2603.11525
Copy Paste: [[2603.11525]] MDS-VQA: Model-Informed Data Selection for Video Quality Assessment(https://arxiv.org/abs/2603.11525)
Keywords: quality assessment
Abstract: Learning-based video quality assessment (VQA) has advanced rapidly, yet progress is increasingly constrained by a disconnect between model design and dataset curation. Model-centric approaches often iterate on fixed benchmarks, while data-centric efforts collect new human labels without systematically targeting the weaknesses of existing VQA models. Here, we describe MDS-VQA, a model-informed data selection mechanism for curating unlabeled videos that are both difficult for the base VQA model and diverse in content. Difficulty is estimated by a failure predictor trained with a ranking objective, and diversity is measured using deep semantic video features, with a greedy procedure balancing the two under a constrained labeling budget. Experiments across multiple VQA datasets and models demonstrate that MDS-VQA identifies diverse, challenging samples that are particularly informative for active fine-tuning. With only a 5% selected subset per target domain, the fine-tuned model improves mean SRCC from 0.651 to 0.722 and achieves the top gMAD rank, indicating strong adaptation and generalization.
摘要：基于学习的视频质量评估 (VQA) 发展迅速，但模型设计和数据集管理之间的脱节日益限制进展。以模型为中心的方法通常会在固定基准上进行迭代，而以数据为中心的方法会收集新的人类标签，而不会系统地针对现有 VQA 模型的弱点。在这里，我们描述了 MDS-VQA，这是一种基于模型的数据选择机制，用于管理未标记的视频，这些视频对于基本 VQA 模型来说既困难又内容多样。难度是通过使用排名目标训练的失败预测器来估计的，多样性是使用深度语义视频特征来测量的，并使用贪婪程序在受限的标签预算下平衡两者。跨多个 VQA 数据集和模型的实验表明，MDS-VQA 可以识别多样化的、具有挑战性的样本，这些样本对于主动微调特别有用。每个目标域仅选择 5% 的子集，微调模型将平均 SRCC 从 0.651 提高到 0.722，并达到最高的 gMAD 排名，表明具有很强的适应性和泛化性。

Title: CFD-HAR: User-controllable Privacy through Conditional Feature Disentanglement

Authors: Alex Gn, Fan Li, S Kuniyilh, Ada Axan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.11526
Pdf URL: https://arxiv.org/pdf/2603.11526
Copy Paste: [[2603.11526]] CFD-HAR: User-controllable Privacy through Conditional Feature Disentanglement(https://arxiv.org/abs/2603.11526)
Keywords: generation
Abstract: Modern wearable and mobile devices are equipped with inertial measurement units (IMUs). Human Activity Recognition (HAR) applications running on such devices use machine-learning-based, data-driven techniques that leverage such sensor data. However, sensor-data-driven HAR deployments face two critical challenges: protecting sensitive user information embedded in sensor data in accordance with users' privacy preferences and maintaining high recognition performance with limited labeled samples. This paper proposes a technique for user-controllable privacy through feature disentanglement-based representation learning at the granular level for dynamic privacy filtering. We also compare the efficacy of our technique against few-shot HAR using autoencoder-based representation learning. We analyze their architectural designs, learning objectives, privacy guarantees, data efficiency, and suitability for edge Internet of Things (IoT) deployment. Our study shows that CFD-based HAR provides explicit, tunable privacy protection controls by separating activity and sensitive attributes in the latent space, whereas autoencoder-based few-shot HAR offers superior label efficiency and lightweight adaptability but lacks inherent privacy safeguards. We further examine the security implications of both approaches in continual IoT settings, highlighting differences in susceptibility to representation leakage and embedding-level attacks. The analysis reveals that neither paradigm alone fully satisfies the emerging requirements of next-generation IoT HAR systems. We conclude by outlining research directions toward unified frameworks that jointly optimize privacy preservation, few-shot adaptability, and robustness for trustworthy IoT intelligence.
摘要：现代可穿戴设备和移动设备都配备了惯性测量单元 (IMU)。在此类设备上运行的人类活动识别 (HAR) 应用程序使用基于机器学习的数据驱动技术来利用此类传感器数据。然而，传感器数据驱动的 HAR 部署面临两个关键挑战：根据用户的隐私偏好保护嵌入传感器数据中的敏感用户信息，并在有限的标记样本下保持高识别性能。本文提出了一种通过基于特征解缠结的细粒度表示学习来实现用户可控隐私的技术，以实现动态隐私过滤。我们还使用基于自动编码器的表示学习，将我们的技术与小样本 HAR 的功效进行了比较。我们分析他们的架构设计、学习目标、隐私保证、数据效率以及边缘物联网 (IoT) 部署的适用性。我们的研究表明，基于 CFD 的 HAR 通过分离潜在空间中的活动和敏感属性来提供明确的、可调节的隐私保护控制，而基于自动编码器的少样本 HAR 提供卓越的标签效率和轻量级适应性，但缺乏固有的隐私保护。我们进一步研究了这两种方法在持续物联网环境中的安全影响，强调了对表示泄漏和嵌入级攻击的敏感性的差异。分析表明，这两种范式都不能完全满足下一代物联网 HAR 系统的新兴要求。最后，我们概述了统一框架的研究方向，这些框架共同优化隐私保护、小样本适应性和可信物联网智能的稳健性。

Title: Risk-Controllable Multi-View Diffusion for Driving Scenario Generation

Authors: Hongyi Lin, Wenxiu Shi, Heye Huang, Dingyi Zhuang, Song Zhang, Yang Liu, Xiaobo Qu, Jinhua Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11534
Pdf URL: https://arxiv.org/pdf/2603.11534
Copy Paste: [[2603.11534]] Risk-Controllable Multi-View Diffusion for Driving Scenario Generation(https://arxiv.org/abs/2603.11534)
Keywords: generation, generative
Abstract: Generating safety-critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long-tail risky situations are rarely observed in real-world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after-the-fact label and struggle to maintain geometric consistency in multi-view driving scenes. We present RiskMV-DPO, a general and systematic pipeline for physically-informed, risk-controllable multi-view scenario generation. By integrating target risk levels with physically-grounded risk modeling, we autonomously synthesize diverse and high-stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion-based video generator. To ensure spatial-temporal coherence and geometric fidelity, we introduce a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy with motion-aware masking to focus learning on localized dynamic this http URL on the nuScenes dataset show that RiskMV-DPO can freely generate a wide spectrum of diverse long-tail scenarios while maintaining state-of-the-art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk-controllable synthesis, providing a scalable toolchain for the safety-oriented development of embodied intelligence.
摘要：生成安全关键的驾驶场景对于评估和改进自动驾驶系统至关重要，但长尾风险情况在现实数据中很少观察到，并且很难通过手动场景设计来指定。现有的生成方法通常将风险视为事后标签，并且很难在多视图驾驶场景中保持几何一致性。我们提出了 RiskMV-DPO，这是一个通用且系统的管道，用于生成物理信息丰富、风险可控的多视图场景。通过将目标风险水平与基于物理的风险模型相结合，我们自动合成多样化且高风险的动态轨迹，作为基于扩散的视频生成器的显式几何锚点。为了确保时空一致性和几何保真度，我们引入了几何外观对齐模块和具有运动感知掩蔽的区域感知直接偏好优化（RA-DPO）策略，以集中学习局部动态。nuScenes 数据集上的此 http URL 表明，RiskMV-DPO 可以自由生成各种不同的长尾场景，同时保持最先进的视觉质量，将 3D 检测 mAP 从 18.17 提高到 30.50并将 FID 降低至 15.70。我们的工作将世界模型的作用从被动环境预测转变为主动、风险可控的综合，为以安全为导向的实体智能开发提供可扩展的工具链。

Title: Multi-Task Anti-Causal Learning for Reconstructing Urban Events from Residents' Reports

Authors: Liangkai Zhou, Susu Xu, Shuqi Zhong, Shan Lin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.11546
Pdf URL: https://arxiv.org/pdf/2603.11546
Copy Paste: [[2603.11546]] Multi-Task Anti-Causal Learning for Reconstructing Urban Events from Residents' Reports(https://arxiv.org/abs/2603.11546)
Keywords: generation
Abstract: Many real-world machine learning tasks are anti-causal: they require inferring latent causes from observed effects. In practice, we often face multiple related tasks where part of the forward causal mechanism is invariant across tasks, while other components are task-specific. We propose Multi-Task Anti-Causal learning (MTAC), a framework for estimating causes from outcomes and confounders by explicitly exploiting such cross-task invariances. MTAC first performs causal discovery to learn a shared causal graph and then instantiates a structured multi-task structural equation model (SEM) that factorizes the outcome-generation process into (i) a task-invariant mechanism and (ii) task-specific mechanisms via a shared backbone with task-specific heads. Building on the learned forward model, MTAC performs maximum A posteriori (MAP)based inference to reconstruct causes by jointly optimizing latent mechanism variables and cause magnitudes under the learned causal structure. We evaluate MTAC on the application of urban event reconstruction from resident reports, spanning three tasks:parking violations, abandoned properties, and unsanitary conditions. On real-world data collected from Manhattan and the city of Newark, MTAC consistently improves reconstruction accuracy over strong baselines, achieving up to 34.61\% MAE reduction and demonstrating the benefit of learning transferable causal mechanisms across tasks.
摘要：许多现实世界的机器学习任务都是反因果的：它们需要从观察到的效果推断潜在原因。在实践中，我们经常面临多个相关任务，其中前向因果机制的一部分在任务之间是不变的，而其他组件是特定于任务的。我们提出了多任务反因果学习（MTAC），这是一个通过明确利用这种跨任务不变性来从结果和混杂因素中估计原因的框架。 MTAC 首先执行因果发现来学习共享因果图，然后实例化一个结构化的多任务结构方程模型 (SEM)，该模型通过具有特定任务头的共享主干将结果生成过程分解为 (i) 任务不变机制和 (ii) 特定任务机制。在学习的前向模型的基础上，MTAC 执行基于最大后验 (MAP) 的推理，通过在学习的因果结构下联合优化潜在机制变量和原因大小来重建原因。我们评估 MTAC 在根据居民报告重建城市事件方面的应用，涵盖三个任务：违规停车、废弃财产和不卫生条件。根据从曼哈顿和纽瓦克市收集的真实世界数据，MTAC 在强基线上持续提高重建精度，实现高达 34.61% 的 MAE 降低，并展示了学习跨任务的可转移因果机制的好处。

Title: PCA-Enhanced Probabilistic U-Net for Effective Ambiguous Medical Image Segmentation

Authors: Xiangyu Li, Chenglin Wang, Qiantong Shen, Fanding Li, Wei Wang, Kuanquan Wang, Yi Shen, Baochun Zhao, Gongning Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11550
Pdf URL: https://arxiv.org/pdf/2603.11550
Copy Paste: [[2603.11550]] PCA-Enhanced Probabilistic U-Net for Effective Ambiguous Medical Image Segmentation(https://arxiv.org/abs/2603.11550)
Keywords: generative
Abstract: Ambiguous Medical Image Segmentation (AMIS) is significant to address the challenges of inherent uncertainties from image ambiguities, noise, and subjective annotations. Existing conditional variational autoencoder (cVAE)-based methods effectively capture uncertainty but face limitations including redundancy in high-dimensional latent spaces and limited expressiveness of single posterior networks. To overcome these issues, we introduce a novel PCA-Enhanced Probabilistic U-Net (\textbf{PEP U-Net}). Our method effectively incorporates Principal Component Analysis (PCA) for dimensionality reduction in the posterior network to mitigate redundancy and improve computational efficiency. Additionally, we further employ an inverse PCA operation to reconstruct critical information, enhancing the latent space's representational capacity. Compared to conventional generative models, our method preserves the ability to generate diverse segmentation hypotheses while achieving a superior balance between segmentation accuracy and predictive variability, thereby advancing the performance of generative modeling in medical image segmentation.
摘要：模糊医学图像分割（AMIS）对于解决图像模糊、噪声和主观注释带来的固有不确定性的挑战具有重要意义。现有的基于条件变分自动编码器（cVAE）的方法可以有效地捕获不确定性，但面临一些限制，包括高维潜在空间中的冗余和单个后验网络的表达能力有限。为了克服这些问题，我们引入了一种新颖的 PCA 增强概率 U-Net（\textbf{PEP U-Net}）。我们的方法有效地结合了主成分分析（PCA）来降低后网络的维度，以减少冗余并提高计算效率。此外，我们进一步采用逆 PCA 运算来重建关键信息，增强潜在空间的表示能力。与传统的生成模型相比，我们的方法保留了生成不同分割假设的能力，同时在分割精度和预测变异性之间实现了卓越的平衡，从而提高了医学图像分割中生成模型的性能。

Title: MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Authors: Lirong Che, Shuo Wen, Shan Huang, Chuang Wang, Yuzhe Yang, Gregory Dudek, Xueqian Wang, Jian Su
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2603.11554
Pdf URL: https://arxiv.org/pdf/2603.11554
Copy Paste: [[2603.11554]] MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks(https://arxiv.org/abs/2603.11554)
Keywords: generation
Abstract: Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.
摘要：现实世界的机器人任务是长视野的，通常跨越多个楼层，需要丰富的空间推理。然而，现有的具体基准主要局限于单层内部环境，未能反映现实任务的复杂性。我们推出了 MANSION，这是第一个用于生成建筑规模的多层 3D 环境的语言驱动框架。考虑到垂直结构的限制，MANSION 生成具有多样化、人性化场景的逼真、可导航的整体建筑结构，从而能够开发和评估跨楼层的长视野任务。在此框架的基础上，我们发布了 MansionWorld，这是一个包含从医院到办公室的 1,000 多个不同建筑物的数据集，以及一个任务语义场景编辑代理，该代理使用开放词汇命令自定义这些环境，以满足特定的用户需求。基准测试表明，最先进的智能体在我们的环境中急剧退化，将 MANSION 确立为下一代空间推理和规划的关键测试平台。

Title: Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception

Authors: Xinyu Nan, Ning Wang, Yuyao Zhai, Mei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11556
Pdf URL: https://arxiv.org/pdf/2603.11556
Copy Paste: [[2603.11556]] Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception(https://arxiv.org/abs/2603.11556)
Keywords: generative
Abstract: Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of "perfectly-paired" images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of "perfectly-paired" images, we collect "imperfectly-paired" dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.
摘要：图像美感增强旨在感知图像中的美感缺陷并进行相应的编辑操作，具有很高的挑战性，要求模型具备创造力和美感感知能力。尽管图像编辑模型的最新进展显着增强了其可控性和灵活性，但它们在增强图像美感方面遇到了困难。主要挑战有两个：首先，遵循具有美感的编辑指令很困难，其次，缺乏内容一致但审美品质不同的“完美配对”图像。在本文中，我们提出了双监督图像美感增强（DIAE），一种具有多模态美感的基于扩散的生成模型。首先，DIAE 结合了多模态审美感知 (MAP)，通过 (i) 在多个审美属性中采用详细、标准化的审美指导，以及 (ii) 利用从文本-图像对中导出的多模态控制信号来将模糊的审美指导转化为明确的指导，以保持同一审美属性内的一致性。其次，为了缓解“完美配对”图像的缺乏，我们收集了名为 IIAEData 的“不完美配对”数据集，该数据集由具有不同美学品质但共享相同语义的图像组成。为了在训练过程中更好地利用 IIAEData 的弱匹配特性，还引入了双分支监督框架来进行弱监督图像美感增强。实验结果表明，DIAE 优于基线，并获得了优异的图像美学分数和图像内容一致性分数。

Title: WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

Authors: Hui Zhang, Juntao Liu, Zongkai Liu, Liqiang Niu, Fandong Meng, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11593
Pdf URL: https://arxiv.org/pdf/2603.11593
Copy Paste: [[2603.11593]] WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing(https://arxiv.org/abs/2603.11593)
Keywords: generation
Abstract: Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.
摘要：基于指令的图像编辑旨在根据用户提供的指令修改现有图像中的特定内容，同时保留非目标区域。除了传统的以对象和样式为中心的操作之外，以文本为中心的图像编辑侧重于修改、翻译或重新排列图像中嵌入的文本元素。然而，现有的领先模型常常难以精确地执行复杂的文本编辑，经常产生模糊或幻觉的字符。我们将这些失败主要归因于缺乏针对以文本为中心的编辑量身定制的专门训练范例，以及缺乏闭环训练和评估系统所需的大规模数据集和标准化基准。为了解决这些限制，我们推出了 WeEdit，这是一个系统解决方案，包含可扩展的数据构建管道、两个基准和定制的两阶段培训策略。具体来说，我们提出了一种新颖的基于 HTML 的自动编辑管道，它生成 330K 训练对，涵盖不同的编辑操作和 15 种语言，并配有标准化的双语和多语言基准进行综合评估。在算法方面，我们采用字形引导的监督微调来注入明确的空间和内容先验，然后是多目标强化学习阶段，以使生成与指令依从性、文本清晰度和背景保留保持一致。大量实验表明，WeEdit 在各种编辑操作中明显优于以前的开源模型。

Title: LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference

Authors: Junkun Jiang, Ho Yin Au, Jingyu Xiang, Jie Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11605
Pdf URL: https://arxiv.org/pdf/2603.11605
Copy Paste: [[2603.11605]] LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference(https://arxiv.org/abs/2603.11605)
Keywords: generation
Abstract: Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description-motion pairs and three metrics that jointly measure text-motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis.
摘要：人体运动具有高度表现力，并且与语言自然一致，但严重依赖联合文本运动嵌入的主流方法很难合成时间上准确、详细的运动，并且通常缺乏可解释性。为了解决这些限制，我们引入了 LabanLite，这是一种通过调整和扩展 Labanotation 系统开发的运动表示。与黑盒文本运动嵌入不同，LabanLite 将每个原子身体部位动作（例如，单个左脚步骤）编码为与文本模板配对的离散拉班符号。这种抽象将复杂的运动分解为可解释的符号序列和身体部位指令，在高级语言和低级运动轨迹之间建立符号链接。在 LabanLite 的基础上，我们推出了 LaMoGen，这是一种文本到 LabanLite 到运动生成框架，使大型语言模型 (LLM) 能够通过符号推理来组成运动序列。法学硕士解释运动模式，将它们与文本描述联系起来，并将符号重新组合成可执行的计划，产生既可解释又具有语言基础的运动。为了支持严格的评估，我们引入了基于 Labanotation 的基准，其中包含结构化的描述-运动对和三个指标，这些指标共同测量跨符号、时间和和谐维度的文本-运动对齐。实验表明，LaMoGen 为可解释性和可控性建立了新的基线，在我们的基准和两个公共数据集上优于先前的方法。这些结果凸显了符号推理和基于代理的设计在语言驱动的运动合成方面的优势。

Title: DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling

Authors: Tong Zhao, Mingkun Lei, Liangyu Yuan, Yanming Yang, Chenxi Song, Yang Wang, Beier Zhu, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11607
Pdf URL: https://arxiv.org/pdf/2603.11607
Copy Paste: [[2603.11607]] DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling(https://arxiv.org/abs/2603.11607)
Keywords: generative
Abstract: Diffusion Models (DMs) have achieved state-of-the-art generative performance across multiple modalities, yet their sampling process remains prohibitively slow due to the need for hundreds of function evaluations. Recent progress in multi-step ODE solvers has greatly improved efficiency by reusing historical gradients, but existing methods rely on handcrafted coefficients that fail to adapt to the non-stationary dynamics of diffusion sampling. To address this limitation, we propose Dynamic Gradient Weighting (DyWeight), a lightweight, learning-based multi-step solver that introduces a streamlined implicit coupling paradigm. By relaxing classical numerical constraints, DyWeight learns unconstrained time-varying parameters that adaptively aggregate historical gradients while intrinsically scaling the effective step size. This implicit time calibration accurately aligns the solver's numerical trajectory with the model's internal denoising dynamics under large integration steps, avoiding complex decoupled parameterizations and optimizations. Extensive experiments on CIFAR-10, FFHQ, AFHQv2, ImageNet64, LSUN-Bedroom, Stable Diffusion and FLUX.1-dev demonstrate that DyWeight achieves superior visual fidelity and stability with significantly fewer function evaluations, establishing a new state-of-the-art among efficient diffusion solvers. Code is available at this https URL
摘要：扩散模型 (DM) 已在多种模式中实现了最先进的生成性能，但由于需要数百次功能评估，其采样过程仍然极其缓慢。多步 ODE 求解器的最新进展通过重用历史梯度极大地提高了效率，但现有方法依赖于手工制作的系数，无法适应扩散采样的非平稳动态。为了解决这个限制，我们提出了动态梯度加权（DyWeight），这是一种轻量级的、基于学习的多步求解器，引入了简化的隐式耦合范例。通过放松经典的数值约束，DyWeight 学习不受约束的时变参数，这些参数自适应地聚合历史梯度，同时本质上缩放有效步长。这种隐式时间校准在大积分步骤下准确地将求解器的数值轨迹与模型的内部去噪动力学对齐，避免了复杂的解耦参数化和优化。在 CIFAR-10、FFHQ、AFHQv2、ImageNet64、LSUN-Bedroom、Stable Diffusion 和 FLUX.1-dev 上进行的大量实验表明，DyWeight 在显着减少函数评估的情况下实现了卓越的视觉保真度和稳定性，在高效扩散求解器中建立了新的最先进技术。代码可在此 https URL 获取

Title: Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild

Authors: Jiin Im, Sisung Liu, Je Hyeong Hong
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.11618
Pdf URL: https://arxiv.org/pdf/2603.11618
Copy Paste: [[2603.11618]] Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild(https://arxiv.org/abs/2603.11618)
Keywords: generation
Abstract: Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.
摘要：语义对应对于处理缺乏明确对应注释的各种野外图像至关重要。虽然最近的 2D 基础模型提供了强大的功能，但通过最近邻伪标签使其适应无监督学习具有关键局限性：它在本地运行，忽略结构关系，因此它对 2D 外观的依赖无法解决由对称性或重复特征引起的几何模糊性。在这项工作中，我们通过将伪标签生成重新表述为融合 Gromov-Wasserstein (FGW) 问题来解决这个问题，该问题联合优化了特征间相似性和结构内一致性。我们的框架 Shape-of-You (SoY) 利用 3D 基础模型来定义几何空间中的内部结构，解决了上述模糊性。然而，由于 FGW 是一个计算量巨大的二次问题，我们通过基于锚的线性化来近似它。由此产生的概率运输计划提供了结构一致但嘈杂的监督信号。因此，我们引入了一种软目标损失，动态地将来自该计划的指导与网络预测混合起来，以构建一个对这种噪声具有鲁棒性的学习框架。 SoY 在 SPair-71k 和 AP-10k 数据集上实现了最先进的性能，在无需显式几何注释的语义对应方面建立了新的基准。代码可在 Shape-of-You 获取。

Title: Personalized Federated Learning via Gaussian Generative Modeling

Authors: Peng Hu, Jianwei Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.11620
Pdf URL: https://arxiv.org/pdf/2603.11620
Copy Paste: [[2603.11620]] Personalized Federated Learning via Gaussian Generative Modeling(https://arxiv.org/abs/2603.11620)
Keywords: generative
Abstract: Federated learning has emerged as a paradigm to train models collaboratively on inherently distributed client data while safeguarding privacy. In this context, personalized federated learning tackles the challenge of data heterogeneity by equipping each client with a dedicated model. A prevalent strategy decouples the model into a shared feature extractor and a personalized classifier head, where the latter actively guides the representation learning. However, previous works have focused on classifier head-guided personalization, neglecting the potential personalized characteristics in the representation distribution. Building on this insight, we propose pFedGM, a method based on Gaussian generative modeling. The approach begins by training a Gaussian generator that models client heterogeneity via weighted re-sampling. A balance between global collaboration and personalization is then struck by employing a dual objective: a shared objective that maximizes inter-class distance across clients, and a local objective that minimizes intra-class distance within them. To achieve this, we decouple the conventional Gaussian classifier into a navigator for global optimization, and a statistic extractor for capturing distributional statistics. Inspired by the Kalman gain, the algorithm then employs a dual-scale fusion framework at global and local levels to equip each client with a personalized classifier head. In this framework, we model the global representation distribution as a prior and the client-specific data as the likelihood, enabling Bayesian inference for class probability estimation. The evaluation covers a comprehensive range of scenarios: heterogeneity in class counts, environmental corruption, and multiple benchmark datasets and configurations. pFedGM achieves superior or competitive performance compared to state-of-the-art methods.
摘要：联邦学习已成为在固有分布式客户端数据上协作训练模型的范例，同时保护隐私。在这种背景下，个性化联合学习通过为每个客户配备专用模型来应对数据异构性的挑战。流行的策略将模型解耦为共享特征提取器和个性化分类器头，后者主动指导表示学习。然而，以前的工作主要集中在分类器头部引导的个性化，忽略了表示分布中潜在的个性化特征。基于这一见解，我们提出了 pFedGM，一种基于高斯生成模型的方法。该方法首先训练高斯生成器，通过加权重采样对客户端异质性进行建模。然后，通过采用双重目标来实现全球协作和个性化之间的平衡：一个是最大化客户之间的类间距离的共享目标，另一个是最小化客户内部的类内距离的本地目标。为了实现这一目标，我们将传统的高斯分类器解耦为用于全局优化的导航器和用于捕获分布统计数据的统计提取器。受卡尔曼增益的启发，该算法在全局和局部层面采用双尺度融合框架，为每个客户端配备个性化的分类器头。在此框架中，我们将全局表示分布建模为先验，将客户特定数据建模为似然，从而实现类概率估计的贝叶斯推理。评估涵盖了各种场景：类计数的异质性、环境腐败以及多个基准数据集和配置。与最先进的方法相比，pFedGM 实现了卓越或有竞争力的性能。

Title: Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography

Authors: Yichi Zhang, Le Xue, Wenbo Zhang, Lanlan Li, Feiyang Xiao, Yuchen Liu, Xiaohui Zhang, Hongwei Zhang, Shuqi Wang, Gang Feng, Liling Peng, Xin Gao, Yuanfan Xu, Yuan Qi, Kuangyu Shi, Hong Zhang, Yuan Cheng, Mei Tian, Zixin Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11627
Pdf URL: https://arxiv.org/pdf/2603.11627
Copy Paste: [[2603.11627]] Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography(https://arxiv.org/abs/2603.11627)
Keywords: generation
Abstract: Positron emission tomography (PET) is a key nuclear medicine imaging modality that visualizes radiotracer distributions to quantify in vivo physiological and metabolic processes, playing an irreplaceable role in disease management. Despite its clinical importance, the development of deep learning models for quantitative PET image analysis remains severely limited, driven by both the inherent segmentation challenge from PET's paucity of anatomical contrast and the high costs of data acquisition and annotation. To bridge this gap, we develop generalist foundational models for universal segmentation from 3D whole-body PET imaging. We first build the largest and most comprehensive PET dataset to date, comprising 11041 3D whole-body PET scans with 59831 segmentation masks for model development. Based on this dataset, we present SegAnyPET, an innovative foundational model with general-purpose applicability to diverse segmentation tasks. Built on a 3D architecture with a prompt engineering strategy for mask generation, SegAnyPET enables universal and scalable organ and lesion segmentation, supports efficient human correction with minimal effort, and enables a clinical human-in-the-loop workflow. Extensive evaluations on multi-center, multi-tracer, multi-disease datasets demonstrate that SegAnyPET achieves strong zero-shot performance across a wide range of segmentation tasks, highlighting its potential to advance the clinical applications of molecular imaging.
摘要：正电子发射断层扫描（PET）是一种关键的核医学成像方式，可可视化放射性示踪剂分布以量化体内生理和代谢过程，在疾病管理中发挥着不可替代的作用。尽管其临床重要性，用于定量 PET 图像分析的深度学习模型的开发仍然受到严重限制，这是由于 PET 缺乏解剖对比度所带来的固有分割挑战以及数据采集和注释的高成本所驱动的。为了弥补这一差距，我们开发了通用基础模型，用于 3D 全身 PET 成像的通用分割。我们首先构建了迄今为止最大、最全面的 PET 数据集，包括 11041 个 3D 全身 PET 扫描和 59831 个用于模型开发的分割掩模。基于该数据集，我们提出了 SegAnyPET，这是一种创新的基础模型，具有对各种分割任务的通用适用性。 SegAnyPET 基于 3D 架构构建，具有用于掩模生成的快速工程策略，可实现通用且可扩展的器官和病变分割，支持以最小的努力进行高效的人工校正，并实现临床人机交互工作流程。对多中心、多示踪剂、多疾病数据集的广泛评估表明，SegAnyPET 在广泛的分割任务中实现了强大的零样本性能，凸显了其推进分子成像临床应用的潜力。

Title: MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

Authors: Baicheng Li, Dong Wu, Jun Li, Shunkai Zhou, Zecui Zeng, Lusong Li, Hongbin Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11633
Pdf URL: https://arxiv.org/pdf/2603.11633
Copy Paste: [[2603.11633]] MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation(https://arxiv.org/abs/2603.11633)
Keywords: generation
Abstract: Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at this https URL.
摘要：最近的统一 3D 生成模型在从单个图像生成高质量 3D 资产方面取得了显着进展。值得注意的是，诸如 SAM3D 之类的布局感知方法可以重建多个对象，同时保留其空间排列，从而为实用的场景级 3D 生成打开了大门。然而，当前的方法仅限于单视图输入，无法利用互补的多视图观察，而独立估计的对象姿势通常会导致物理上不合理的布局，例如相互渗透和浮动伪影。我们推出了 MV-SAM3D，这是一种免训练框架，可通过多视图一致性和物理合理性扩展布局感知 3D 生成。我们将多视图融合制定为 3D 潜在空间中的多扩散过程，并提出两种自适应加权策略（注意力熵加权和可见性加权），以实现置信感知融合，确保每个视点根据其局部观察可靠性做出贡献。对于多对象组合，我们引入了物理感知优化，在生成期间和生成后注入碰撞和接触约束，产生物理上合理的对象排列。标准基准和现实世界多对象场景的实验表明，重建保真度和布局合理性有了显着改善，而且无需任何额外的训练。代码可从此 https URL 获取。

Title: Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Authors: Sizhong Qin, Ramon Elias Weber, Xinzheng Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11640
Pdf URL: https://arxiv.org/pdf/2603.11640
Copy Paste: [[2603.11640]] Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans(https://arxiv.org/abs/2603.11640)
Keywords: generation
Abstract: Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.
摘要：建筑平面图设计需要对几何、语义和空间层次进行联合推理，这仍然是当前人工智能系统的主要挑战。尽管最近的扩散和语言模型提高了视觉保真度，但它们仍然在连贯的空间推理和可控生成方面遇到困难。我们推出了 HouseMind，一种多模式大语言模型，它将平面图理解、生成和编辑统一在一个框架中。我们引入离散的房间实例标记来构建连接布局和符号推理的统一词汇。通过多模式对齐和指令调整，该模型从文本指令合成连贯、可控的布局。实验表明该框架如何实现卓越的几何有效性和可控性，同时保持高效和本地可部署性。

Title: BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

Authors: Siquan Huang, Yijiang Li, Ningzhi Gao, Xingfu Yan, Leyu Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11664
Pdf URL: https://arxiv.org/pdf/2603.11664
Copy Paste: [[2603.11664]] BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder(https://arxiv.org/abs/2603.11664)
Keywords: restoration
Abstract: Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.
摘要：自监督和多模态视觉编码器学习强大的视觉表示，这些表示在下游视觉任务和大型视觉语言模型（LVLM）中广泛采用。然而，下游用户通常依赖来源不确定的第三方预训练编码器，从而使他们面临后门攻击。在这项工作中，我们提出了 BackdoorIDS，这是一种用于预训练视觉编码器的简单但有效的零样本、推理时间后门样本检测方法。 BackdoorIDS 的动机有两个：注意力劫持和恢复。在渐进式输入掩蔽下，后门图像最初将注意力集中在恶意触发特征上。一旦掩蔽率超过触发器的鲁棒性阈值，触发器就会被停用，注意力会迅速转移到良性内容。这种转变会导致图像嵌入发生显着变化，而干净图像的嵌入在遮蔽过程中会更加平滑地演变。 BackdoorIDS 通过沿着掩蔽轨迹提取嵌入序列并应用基于密度的聚类（例如 DBSCAN）来操作该信号。如果输入的嵌入序列形成多个簇，则该输入被标记为有后门。大量实验表明，BackdoorIDS 在不同的攻击类型、数据集和模型系列中始终优于现有防御措施。值得注意的是，它是一种即插即用的方法，无需重新训练，并且在推理时完全零样本运行，使其与各种编码器架构兼容，包括 CNN、ViT、CLIP 和 LLaVA-1.5。

Title: PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On

Authors: Haohua Chen, Tianze Zhou, Wei Zhu, Runqi Wang, Yandong Guan, Dejia Song, Yibo Chen, Xu Tang, Yao Hu, Lu Sheng, Zhiyong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11675
Pdf URL: https://arxiv.org/pdf/2603.11675
Copy Paste: [[2603.11675]] PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On(https://arxiv.org/abs/2603.11675)
Keywords: generation
Abstract: Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.
摘要：虚拟试穿 (VTON) 已成为在线零售的核心功能，真实的试穿结果提供可靠的试穿指导，减少退货，并使消费者和商家受益。基于扩散的 VTON 方法实现了逼真的合成，但通常依赖于辅助参考网络等复杂的架构，并且采样速度慢，使得保真度和效率之间的权衡成为一个持续的挑战。我们将 VTON 作为一个结构化图像编辑问题来处理，该问题需要在三个关键要求下进行强大的条件生成：主题保留、忠实的纹理传输和无缝协调。从这个角度来看，我们的培训框架是通用的，可以转移到更广泛的图像编辑任务。此外，VTON 生成的配对数据构成了培训通用编辑器的丰富监督资源。我们推出了 PROMO，这是一个基于 Flow Matching DiT 主干的快速虚拟试戴框架，具有潜在的多模式条件串联。通过利用调节效率和自引用机制，我们的方法大大减少了推理开销。在标准基准上，PROMO 在视觉保真度方面超越了之前的 VTON 方法和一般图像编辑模型，同时在质量和速度之间提供了具有竞争力的平衡。这些结果表明，流量匹配变压器与潜在多模态调节和自参考加速相结合，为高质量虚拟试穿提供了有效且训练高效的解决方案。

Title: UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

Authors: Cao Thien Tan, Phan Thi Thu Trang, Do Nghiem Duc, Ho Ngoc Anh, Hanyang Zhuang, Nguyen Duc Dung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11680
Pdf URL: https://arxiv.org/pdf/2603.11680
Copy Paste: [[2603.11680]] UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution(https://arxiv.org/abs/2603.11680)
Keywords: restoration, super-resolution
Abstract: Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.
摘要：混合 CNN-Transformer 架构在图像超分辨率方面取得了很好的成果，但扩展注意力窗口或卷积核显着增加了计算成本，限制了在资源受限设备上的部署。我们提出了 UCAN，一种轻量级网络，它将卷积和注意力相结合，以有效地扩展有效感受野。 UCAN 将基于窗口的空间注意力与 Hedgehog 注意力机制相结合，对局部纹理和远程依赖性进行建模，并引入基于蒸馏的大内核模块来保留高频结构，而无需大量计算。此外，我们采用跨层参数共享来进一步降低复杂性。在 Manga109（$4\times$）上，UCAN-L 仅用 48.4G MAC 就实现了 31.63 dB PSNR，超越了最近的轻量级型号。在 BSDS100 上，UCAN 达到 27.79 dB，优于具有更大模型的方法。大量实验表明，UCAN 在精度、效率和可扩展性之间实现了卓越的权衡，使其非常适合实际的高分辨率图像恢复。

Title: PolyCrysDiff: Controllable Generation of Three-Dimensional Computable Polycrystalline Material Structures

Authors: Chi Chen, Tianle Jiang, Xiaodong Wei, Yanming Wang
Subjects: cs.CV, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2603.11695
Pdf URL: https://arxiv.org/pdf/2603.11695
Copy Paste: [[2603.11695]] PolyCrysDiff: Controllable Generation of Three-Dimensional Computable Polycrystalline Material Structures(https://arxiv.org/abs/2603.11695)
Keywords: generation, generative
Abstract: The three-dimensional (3D) microstructures of polycrystalline materials exert a critical influence on their mechanical and physical properties. Realistic, controllable construction of these microstructures is a key step toward elucidating structure-property relationships, yet remains a formidable challenge. Herein, we propose PolyCrysDiff, a framework based on conditional latent diffusion that enables the end-to-end generation of computable 3D polycrystalline microstructures. Comprehensive qualitative and quantitative evaluations demonstrate that PolyCrysDiff faithfully reproduces target grain morphologies, orientation distributions, and 3D spatial correlations, while achieving an $R^2$ over 0.972 on grain attributes (e.g., size and sphericity) control, thereby outperforming mainstream approaches such as Markov random field (MRF)- and convolutional neural network (CNN)-based methods. The computability and physical validity of the generated microstructures are verified through a series of crystal plasticity finite element method (CPFEM) simulations. Leveraging PolyCrysDiff's controllable generative capability, we systematically elucidate how grain-level microstructural characteristics affect the mechanical properties of polycrystalline materials. This development is expected to pave a key step toward accelerated, data-driven optimization and design of polycrystalline materials.
摘要：多晶材料的三维（3D）微观结构对其机械和物理性能产生关键影响。这些微观结构的真实、可控构建是阐明结构与性能关系的关键一步，但仍然是一个艰巨的挑战。在此，我们提出了 PolyCrysDiff，这是一种基于条件潜在扩散的框架，能够端到端生成可计算的 3D 多晶微结构。全面的定性和定量评估表明，PolyCrysDiff 忠实地再现了目标晶粒形态、取向分布和 3D 空间相关性，同时在晶粒属性（例如尺寸和球形度）控制上实现了超过 0.972 的 $R^2$，从而优于基于马尔可夫随机场 (MRF) 和卷积神经网络 (CNN) 等主流方法。通过一系列晶体塑性有限元法 (CPFEM) 模拟验证了生成的微观结构的可计算性和物理有效性。利用PolyCrysDiff的可控生成能力，我们系统地阐明了晶粒级微观结构特征如何影响多晶材料的机械性能。这一进展预计将为加速、数据驱动的多晶材料优化和设计迈出关键一步。

Title: OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Authors: Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.11698
Pdf URL: https://arxiv.org/pdf/2603.11698
Copy Paste: [[2603.11698]] OSCBench: Benchmarking Object State Change in Text-to-Video Generation(https://arxiv.org/abs/2603.11698)
Keywords: generation
Abstract: Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.
摘要：文本到视频（T2V）生成模型在制作视觉上高质量且时间连贯的视频方面取得了快速进展。然而，现有的基准主要关注感知质量、文本视频对齐或物理合理性，而很大程度上未探索动作理解的一个关键方面：文本提示中明确指定的对象状态变化 (OSC)。 OSC 是指由某个动作引起的物体状态的转变，例如削土豆皮或切柠檬。在本文中，我们介绍了 OSCBench，这是一个专门用于评估 T2V 模型中 OSC 性能的基准测试。 OSCBench 由教学烹饪数据构建，并系统地将动作-对象交互组织成规则的、新颖的和组合的场景，以探测分布内的性能和泛化。我们使用人类用户研究和基于多模态大语言模型 (MLLM) 的自动评估来评估六个具有代表性的开源和专有 T2V 模型。我们的结果表明，尽管在语义和场景对齐方面表现出色，但当前的 T2V 模型始终难以准确且时间一致的对象状态变化，尤其是在新颖和组合的设置中。这些发现将 OSC 定位为文本到视频生成的关键瓶颈，并将 OSCBench 确立为推进状态感知视频生成模型的诊断基准。

Title: SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

Authors: Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, Shunshun Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11746
Pdf URL: https://arxiv.org/pdf/2603.11746
Copy Paste: [[2603.11746]] SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory(https://arxiv.org/abs/2603.11746)
Keywords: generation
Abstract: Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.
摘要：自回归 (AR) 扩散模型通过将扩散建模与因果推理相结合，为序列生成任务（例如视频合成）提供了一个有前途的框架。尽管它们支持流式生成，但现有的 AR 扩散方法难以有效扩展。在本文中，我们确定了小时级实时人类动画的两个关键挑战。首先，大多数强制策略传播具有不匹配扩散状态的样本级表示，导致学习信号不一致和收敛不稳定。其次，历史表示无限增长且缺乏结构，阻碍了缓存状态的有效重用并严重限制了推理效率。为了解决这些挑战，我们提出了邻居强迫，这是一种扩散步骤一致的 AR 公式，它在相同的噪声条件下将时间相邻的帧作为潜在邻居传播。该设计提供了分布一致且稳定的学习信号，同时保留了整个 AR 链的漂移。在此基础上，我们引入了一种结构化的 ConvKV 内存机制，将因果注意力中的键和值压缩为固定长度的表示，从而实现恒定内存推理和真正无限的视频生成，而无需依赖短期运动帧内存。大量实验表明，与现有的 AR 扩散方法相比，我们的方法显着提高了训练收敛性、小时级生成质量和推理效率。从数值上看，LiveAct 可实现小时级实时人类动画，并支持在只需两个 NVIDIA H100 或 H200 GPU 上进行 20 FPS 实时流式推理。定量结果表明，我们的方法在口型同步准确性、人类动画质量和情感表达力方面达到了最先进的性能，并且推理成本最低。

Title: Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

Authors: Chenyangguang Zhang, Botao Ye, Boqi Chen, Alexandros Delitzas, Fangjinhua Wang, Marc Pollefeys, Xi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11755
Pdf URL: https://arxiv.org/pdf/2603.11755
Copy Paste: [[2603.11755]] Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints(https://arxiv.org/abs/2603.11755)
Keywords: generation
Abstract: Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.
摘要：运动控制的视频生成对于虚拟现实和具体人工智能中以自我为中心的应用至关重要。然而，现有方法通常难以实现 3D 一致的细粒度手部清晰度。通过采用 2D 轨迹或隐式姿势，它们将 3D 几何图形折叠成空间模糊的信号或过度依赖以人类为中心的先验。在严重的以自我为中心的遮挡下，这会导致运动不一致和幻觉伪影，并阻止跨实体推广到机器人手。为了解决这些限制，我们提出了一种新颖的框架，该框架可以从单个参考帧生成以自我为中心的视频，利用稀疏的 3D 手关节作为具有清晰语义和几何结构的与实施例无关的控制信号。我们引入了一个高效的控制模块，可以解决遮挡模糊性，同时完全保留 3D 信息。具体来说，它通过惩罚来自隐藏关节的不可靠视觉信号，从源参考系中提取遮挡感知特征，并采用基于 3D 的加权机制来稳健地处理运动传播期间动态遮挡的目标关节。同时，该模块直接将 3D 几何嵌入注入到潜在空间中，以严格执行结构一致性。为了促进强大的训练和评估，我们开发了一个自动注释管道，可生成超过一百万个高质量的自我中心视频剪辑，并配有精确的手部轨迹。此外，我们注册人形运动学和相机数据来构建跨实施例基准。大量的实验表明，我们的方法显着优于最先进的基线，生成具有真实交互的高保真自我中心视频，并对机器人手展示出卓越的跨实体泛化能力。

Title: Language Generation with Replay: A Learning-Theoretic View of Model Collapse

Authors: Giorgio Racca, Michal Valko, Amartya Sanyal
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.11784
Pdf URL: https://arxiv.org/pdf/2603.11784
Copy Paste: [[2603.11784]] Language Generation with Replay: A Learning-Theoretic View of Model Collapse(https://arxiv.org/abs/2603.11784)
Keywords: generation, generative
Abstract: As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or, in some cases, blissful ignorance. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of the language generation in the limit framework, introducing a replay adversary that augments the example stream with the generator's own past outputs. Our main contribution is a fine-grained learning-theoretic characterization of when replay fundamentally limits generation: while replay is benign for the strongest notion of uniform generation, it provably creates separations for the weaker notions of non-uniform generation and generation in the limit. Interestingly, our positive results mirror heuristics widely used in practice, such as data cleaning, watermarking, and output filtering, while our separations show when these ideas can fail.
摘要：随着扩展法则推动前沿大语言模型 (LLM) 的训练满足不断增长的数据需求，训练管道正在接近一种可能消耗大量公开在线文本的状态。与此同时，法学硕士的广泛使用增加了网络上机器生成内容的数量；总之，这些趋势增加了生成的文本重新进入未来训练语料库的可能性，从而增加了性能下降（通常称为模型崩溃）的相关风险。在实践中，模型开发人员通过数据清理、水印、合成数据策略，或者在某些情况下，通过幸福的无知来解决这个问题。然而，生成模型中的模型崩溃问题尚未从学习理论的角度进行检验：我们通过限制框架中的语言生成的理论视角来研究它，引入一个重放对手，它用生成器自己过去的输出来增强示例流。我们的主要贡献是对重放从根本上限制生成的时间进行细粒度的学习理论表征：虽然重放对于均匀生成的最强概念是良性的，但它可以证明为非均匀生成和极限生成的较弱概念创造了分离。有趣的是，我们的积极结果反映了实践中广泛使用的启发式方法，例如数据清理、水印和输出过滤，而我们的分离表明这些想法何时会失败。

Title: Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding

Authors: Jiahao Li, Qingwang Zhang, Qiuyu Chen, Guozhan Qiu, Yunzhong Lou, Xiangdong Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11831
Pdf URL: https://arxiv.org/pdf/2603.11831
Copy Paste: [[2603.11831]] Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding(https://arxiv.org/abs/2603.11831)
Keywords: generation
Abstract: The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper present FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance.
摘要：近年来，计算机辅助设计 (CAD) 生成领域取得了重大进展。现有方法通常分为两个独立的类别：参数化 CAD 建模和直接边界表示 (B-Rep) 合成。在现代基于特征的 CAD 系统中，参数化建模和 B-Rep 本质上是交织在一起的，因为高级参数化操作（例如圆角和倒角）需要显式选择 B-Rep 几何图元，而 B-Rep 本身又源自参数化操作。因此，这种范式差距仍然是限制复杂工业产品设计的人工智能驱动 CAD 建模的关键因素。本文介绍了 FutureCAD，这是一种新颖的文本到 CAD 框架，它利用大型语言模型 (LLM) 和 B-Rep 接地变压器 (BRepGround) 来生成高保真 CAD。我们的方法生成可执行的 CadQuery 脚本，并引入基于文本的查询机制，使 LLM 能够通过自然语言指定几何选择，然后 BRepGround 将其基于目标图元。为了训练我们的框架，我们构建了一个包含真实 CAD 模型的新数据集。对于法学硕士，我们应用监督微调 (SFT) 来建立基本的 CAD 生成能力，然后采用强化学习 (RL) 来提高泛化能力。实验表明，FutureCAD 实现了最先进的 CAD 生成性能。

Title: A Decade of Generative Adversarial Networks for Porous Material Reconstruction

Authors: Ali Sadeghkhani, Brandon Bennett, Masoud Babaei, Arash Rabbani
Subjects: cs.CV, cond-mat.mtrl-sci, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2603.11836
Pdf URL: https://arxiv.org/pdf/2603.11836
Copy Paste: [[2603.11836]] A Decade of Generative Adversarial Networks for Porous Material Reconstruction(https://arxiv.org/abs/2603.11836)
Keywords: generative
Abstract: Digital reconstruction of porous materials has become increasingly critical for applications ranging from geological reservoir characterization to tissue engineering and electrochemical device design. While traditional methods such as micro-computed tomography and statistical reconstruction approaches have established foundations in this field, the emergence of deep learning techniques, particularly Generative Adversarial Networks (GANs), has revolutionized porous media reconstruction capabilities. This review systematically analyzes 96 peer-reviewed articles published from 2017 to early 2026, examining the evolution and applications of GAN-based approaches for porous material image reconstruction. We categorize GAN architectures into six distinct classes, namely Vanilla GANs, Multi-Scale GANs, Conditional GANs, Attention-Enhanced GANs, Style-based GANs, and Hybrid Architecture GANs. Our analysis reveals substantial progress including improvements in porosity accuracy (within 1% of original samples), permeability prediction (up to 79% reduction in mean relative errors), and achievable reconstruction volumes (from initial $64^3$ to current $2{,}200^3$ voxels). Despite these advances, persistent challenges remain in computational efficiency, memory constraints for large-scale reconstruction, and maintaining structural continuity in 2D-to-3D transformations. This systematic analysis provides a comprehensive framework for selecting appropriate GAN architectures based on specific application requirements.
摘要：多孔材料的数字重建对于从地质储层表征到组织工程和电化学装置设计等应用变得越来越重要。虽然微型计算机断层扫描和统计重建方法等传统方法已在该领域奠定了基础，但深度学习技术，特别是生成对抗网络（GAN）的出现，彻底改变了多孔介质重建能力。这篇综述系统地分析了 2017 年至 2026 年初发表的 96 篇同行评审文章，研究了基于 GAN 的多孔材料图像重建方法的演变和应用。我们将 GAN 架构分为六个不同的类别，即普通 GAN、多尺度 GAN、条件 GAN、注意力增强型 GAN、基于风格的 GAN 和混合架构 GAN。我们的分析揭示了重大进展，包括孔隙度精度的提高（原始样本的 1% 以内）、渗透率预测（平均相对误差最多减少 79%）和可实现的重建体积（从最初的 $64^3$ 到当前的 $2{,}200^3$ 体素）。尽管取得了这些进步，但在计算效率、大规模重建的内存限制以及保持 2D 到 3D 转换中的结构连续性方面仍然存在持续的挑战。这种系统分析为根据特定应用需求选择合适的 GAN 架构提供了一个全面的框架。

Title: Derain-Agent: A Plug-and-Play Agent Framework for Rainy Image Restoration

Authors: Zhaocheng Yu, Xiang Chen, Runzhe Li, Zihan Geng, Guanglu Sun, Haipeng Li, Kui Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11866
Pdf URL: https://arxiv.org/pdf/2603.11866
Copy Paste: [[2603.11866]] Derain-Agent: A Plug-and-Play Agent Framework for Rainy Image Restoration(https://arxiv.org/abs/2603.11866)
Keywords: restoration
Abstract: While deep learning has advanced single-image deraining, existing models suffer from a fundamental limitation: they employ a static inference paradigm that fails to adapt to the complex, coupled degradations (e.g., noise artifacts, blur, and color deviation) of real-world rain. Consequently, restored images often exhibit residual artifacts and inconsistent perceptual quality. In this work, we present Derain-Agent, a plug-and-play refinement framework that transitions deraining from static processing to dynamic, agent-based restoration. Derain-Agent equips a base deraining model with two core capabilities: 1) a Planning Network that intelligently schedules an optimal sequence of restoration tools for each instance, and 2) a Strength Modulation mechanism that applies these tools with spatially adaptive intensity. This design enables precise, region-specific correction of residual errors without the prohibitive cost of iterative search. Our method demonstrates strong generalization, consistently boosting the performance of state-of-the-art deraining models on both synthetic and real-world benchmarks.
摘要：虽然深度学习具有先进的单图像去雨功能，但现有模型存在根本性限制：它们采用静态推理范式，无法适应现实世界降雨的复杂耦合退化（例如噪声伪影、模糊和颜色偏差）。因此，恢复的图像经常表现出残留的伪影和不一致的感知质量。在这项工作中，我们提出了 Derain-Agent，这是一个即插即用的细化框架，可将除雨从静态处理过渡到动态的、基于代理的恢复。 Derain-Agent 为基础除雨模型配备了两项核心功能：1) 规划网络，可为每个实例智能安排恢复工具的最佳序列；2) 强度调制机制，以空间自适应强度应用这些工具。这种设计能够对残余误差进行精确的、针对特定区域的校正，而无需高昂的迭代搜索成本。我们的方法表现出很强的泛化能力，持续提高最先进的除雨模型在合成和现实基准上的性能。

Title: Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Authors: Lu Wang (1), Zhuoran Jin (1), Yupu Hao (1), Yubo Chen (1), Kang Liu (1), Yulong Ao (2), Jun Zhao (1) ((1) The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China, (2) Beijing Academy of Artificial Intelligence (BAAI), Beijing, China)
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.11896
Pdf URL: https://arxiv.org/pdf/2603.11896
Copy Paste: [[2603.11896]] Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models(https://arxiv.org/abs/2603.11896)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: this https URL
摘要：多模态大语言模型（MLLM）在离线视频理解方面表现出很强的性能，但大多数仅限于离线推理或在线推理较弱，使得连续到达的视频流上的多轮交互变得困难。现有的流方法通常使用交错的感知生成范例，这会阻止并发感知和生成，并导致随着流的增长而导致早期记忆衰减，从而损害远程依赖建模。我们提出了 Think While Watching，这是一种内存锚定的流视频推理框架，可以在多轮交互过程中保留连续的段级内存。我们构建了一个三阶段、多轮的思想链数据集，并采用阶段匹配的训练策略，同时通过分段级流式因果掩码和流式位置编码来强制执行严格的因果关系。在推理过程中，我们引入了一个高效的管道，该管道将观察和思考重叠，并自适应地选择最佳的注意力后端。在单轮和多轮流输入协议下，我们的方法都取得了很好的结果。它基于 Qwen3-VL 构建，在 StreamingBench 上将单轮精度提高了 2.6%，在 OVO-Bench 上将单轮精度提高了 3.79%。在多轮设置中，它保持性能的同时减少了 56% 的输出代币。代码位于：此 https URL

Title: Causal Representation Learning with Optimal Compression under Complex Treatments

Authors: Wanting Liang, Haoang Chi, Zhiheng Zhang
Subjects: cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2603.11907
Pdf URL: https://arxiv.org/pdf/2603.11907
Copy Paste: [[2603.11907]] Causal Representation Learning with Optimal Compression under Complex Treatments(https://arxiv.org/abs/2603.11907)
Keywords: generative
Abstract: Estimating Individual Treatment Effects (ITE) in multi-treatment scenarios faces two critical challenges: the Hyperparameter Selection Dilemma for balancing weights and the Curse of Dimensionality in computational scalability. This paper derives a novel multi-treatment generalization bound and proposes a theoretical estimator for the optimal balancing weight $\alpha$, eliminating expensive heuristic tuning. We investigate three balancing strategies: Pairwise, One-vs-All (OVA), and Treatment Aggregation. While OVA achieves superior precision in low-dimensional settings, our proposed Treatment Aggregation ensures both accuracy and O(1) scalability as the treatment space expands. Furthermore, we extend our framework to a generative architecture, Multi-Treatment CausalEGM, which preserves the Wasserstein geodesic structure of the treatment manifold. Experiments on semi-synthetic and image datasets demonstrate that our approach significantly outperforms traditional models in estimation accuracy and efficiency, particularly in large-scale intervention scenarios.
摘要：估计多治疗场景中的个体治疗效果（ITE）面临两个关键挑战：平衡权重的超参数选择困境和计算可扩展性的维数诅咒。本文推导了一种新颖的多处理泛化界限，并提出了最佳平衡权重 $\alpha$ 的理论估计器，消除了昂贵的启发式调整。我们研究了三种平衡策略：配对、一对多 (OVA) 和治疗聚合。虽然 OVA 在低维设置中实现了卓越的精度，但我们提出的处理聚合在处理空间扩展时确保了准确性和 O(1) 可扩展性。此外，我们将我们的框架扩展到生成架构，Multi-Treatment CausalEGM，它保留了处理流形的 Wasserstein 测地线结构。在半合成和图像数据集上的实验表明，我们的方法在估计精度和效率方面显着优于传统模型，特别是在大规模干预场景中。

Title: EnTransformer: A Deep Generative Transformer for Multivariate Probabilistic Forecasting

Authors: Rajdeep Pathak, Rahul Goswami, Madhurima Panja, Palash Ghosh, Tanujit Chakraborty
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.11909
Pdf URL: https://arxiv.org/pdf/2603.11909
Copy Paste: [[2603.11909]] EnTransformer: A Deep Generative Transformer for Multivariate Probabilistic Forecasting(https://arxiv.org/abs/2603.11909)
Keywords: generative
Abstract: Reliable uncertainty quantification is critical in multivariate time series forecasting problems arising in domains such as energy systems and transportation networks, among many others. Although Transformer-based architectures have recently achieved strong performance for sequence modeling, most probabilistic forecasting approaches rely on restrictive parametric likelihoods or quantile-based objectives. They can struggle to capture complex joint predictive distributions across multiple correlated time series. This work proposes EnTransformer, a deep generative forecasting framework that integrates engression, a stochastic learning paradigm for modeling conditional distributions, with the expressive sequence modeling capabilities of Transformers. The proposed approach injects stochastic noise into the model representation and optimizes an energy-based scoring objective to directly learn the conditional predictive distribution without imposing parametric assumptions. This design enables EnTransformer to generate coherent multivariate forecast trajectories while preserving Transformers' capacity to effectively model long-range temporal dependencies and cross-series interactions. We evaluate our proposed EnTransformer on several widely used benchmarks for multivariate probabilistic forecasting, including Electricity, Traffic, Solar, Taxi, KDD-cup, and Wikipedia datasets. Experimental results demonstrate that EnTransformer produces well-calibrated probabilistic forecasts and consistently outperforms the benchmark models.
摘要：可靠的不确定性量化对于能源系统和交通网络等领域出现的多元时间序列预测问题至关重要。尽管基于 Transformer 的架构最近在序列建模方面取得了强大的性能，但大多数概率预测方法依赖于限制性参数似然或基于分位数的目标。他们可能很难捕获多个相关时间序列中复杂的联合预测分布。这项工作提出了 EnTransformer，这是一种深度生成预测框架，它将 engression（一种用于建模条件分布的随机学习范式）与 Transformer 的表达序列建模功能集成在一起。所提出的方法将随机噪声注入模型表示中，并优化基于能量的评分目标，以直接学习条件预测分布，而无需强加参数假设。这种设计使 EnTransformer 能够生成连贯的多元预测轨迹，同时保留 Transformer 有效建模长期时间依赖性和跨系列交互的能力。我们在多个广泛使用的多元概率预测基准上评估我们提出的 EnTransformer，包括电力、交通、太阳能、出租车、KDD-cup 和维基百科数据集。实验结果表明，EnTransformer 可以生成经过良好校准的概率预测，并且始终优于基准模型。

Title: InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

Authors: InSpatio Team: Xiaoyu Zhang, Weihong Pan, Zhichao Ye, Jialin Liu, Yipeng Chen, Nan Wang, Xiaojun Xiang, Weijian Xie, Yifu Wang, Haoyu Ji, Siji Pan, Zhewen Le, Jing Guo, Xianbin Liu, Donghui Shen, Ziqiang Zhao, Haomin Liu, Guofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11911
Pdf URL: https://arxiv.org/pdf/2603.11911
Copy Paste: [[2603.11911]] InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model(https://arxiv.org/abs/2603.11911)
Keywords: generation, generative
Abstract: We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.
摘要：我们提出了 InSpatio-WorldFM，一个用于空间智能的开源实时框架模型。与依赖顺序帧生成并因窗口级处理而产生大量延迟的基于视频的世界模型不同，InSpatio-WorldFM 采用基于帧的范例，独立生成每个帧，从而实现低延迟实时空间推理。通过显式 3D 锚点和隐式空间记忆强制执行多视图空间一致性，该模型保留全局场景几何形状，同时在视点变化时保持细粒度的视觉细节。我们进一步引入了渐进的三阶段训练管道，将预训练的图像扩散模型转换为可控帧模型，最后通过几步蒸馏转换为实时生成器。实验结果表明，InSpatio-WorldFM 实现了强大的多视图一致性，同时支持消费级 GPU 上的交互式探索，为实时世界模拟提供了传统基于视频的世界模型的有效替代方案。

Title: MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Authors: Xingze Zou, Jing Wang, Yuhua Zheng, Xueyi Chen, Haolei Bai, Lingcheng Kong, Syed A.R. Abu-Bakar, Zhaode Wang, Chengfei Lv, Haoji Hu, Huan Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11935
Pdf URL: https://arxiv.org/pdf/2603.11935
Copy Paste: [[2603.11935]] MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?(https://arxiv.org/abs/2603.11935)
Keywords: generation
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile de- vices remains largely unexplored. In this work, we extend the scope of automated kernel generation to the mobile domain to investigate the central question: Can LLMs write efficient kernels for mobile devices? To enable systematic investigation, we introduce MobileKernelBench, a comprehensive evaluation framework comprising a benchmark prioritizing operator diversity and cross-framework interoperability, coupled with an automated pipeline that bridges the host-device gap for on-device verification. Leveraging this framework, we conduct extensive evaluation on the CPU backend of Mobile Neural Network (MNN), revealing that current LLMs struggle with the engineering complexity and data scarcity inher-ent to mobile frameworks; standard models and even fine-tuned variants exhibit high compilation failure rates (over 54%) and negligible performance gains due to hallucinations and a lack of domain-specific grounding. To overcome these limitations, we propose the Mobile K ernel A gent (MoKA), a multi-agent system equipped with repository-aware reasoning and a plan-and-execute this http URL on MobileKernelBench, MoKA achieves state-of-the-art performance, boosting compilation success to 93.7% and enabling 27.4% of generated kernelsto deliver measurable speedups over native libraries.
摘要：大型语言模型（LLM）在代码生成方面表现出了卓越的能力，但它们专门为移动设备生成内核的潜力在很大程度上仍未被开发。在这项工作中，我们将自动化内核生成的范围扩展到移动领域，以研究核心问题：法学硕士能否为移动设备编写高效的内核？为了进行系统调查，我们引入了 MobileKernelBench，这是一个综合评估框架，包括优先考虑运营商多样性和跨框架互操作性的基准，以及弥补主机与设备之间差距以进行设备验证的自动化管道。利用这个框架，我们对移动神经网络（MNN）的CPU后端进行了广泛的评估，揭示了当前的法学硕士正在努力应对移动框架固有的工程复杂性和数据稀缺性；标准模型甚至微调变体都表现出很高的编译失败率（超过 54%），并且由于幻觉和缺乏特定领域的基础而导致性能提升可以忽略不计。为了克服这些限制，我们提出了移动内核代理 (MoKA)，这是一个多代理系统，配备存储库感知推理和在 MobileKernelBench 上计划并执行此 http URL，MoKA 实现了最先进的性能，将编译成功率提高到 93.7%，并使 27.4% 的生成内核能够比本机库提供可测量的加速。

Title: Statistical and structural identifiability in representation learning

Authors: Walter Nelson, Marco Fumero, Theofanis Karaletsos, Francesco Locatello
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.11970
Pdf URL: https://arxiv.org/pdf/2603.11970
Copy Paste: [[2603.11970]] Statistical and structural identifiability in representation learning(https://arxiv.org/abs/2603.11970)
Keywords: generative
Abstract: Representation learning models exhibit a surprising stability in their internal representations. Whereas most prior work treats this stability as a single property, we formalize it as two distinct concepts: statistical identifiability (consistency of representations across runs) and structural identifiability (alignment of representations with some unobserved ground truth). Recognizing that perfect pointwise identifiability is generally unrealistic for modern representation learning models, we propose new model-agnostic definitions of statistical and structural near-identifiability of representations up to some error tolerance $\epsilon$. Leveraging these definitions, we prove a statistical $\epsilon$-near-identifiability result for the representations of models with nonlinear decoders, generalizing existing identifiability theory beyond last-layer representations in e.g. generative pre-trained transformers (GPTs) to near-identifiability of the intermediate representations of a broad class of models including (masked) autoencoders (MAEs) and supervised learners. Although these weaker assumptions confer weaker identifiability, we show that independent components analysis (ICA) can resolve much of the remaining linear ambiguity for this class of models, and validate and measure our near-identifiability claims empirically. With additional assumptions on the data-generating process, statistical identifiability extends to structural identifiability, yielding a simple and practical recipe for disentanglement: ICA post-processing of latent representations. On synthetic benchmarks, this approach achieves state-of-the-art disentanglement using a vanilla autoencoder. With a foundation model-scale MAE for cell microscopy, it disentangles biological variation from technical batch effects, substantially improving downstream generalization.
摘要：表示学习模型在其内部表示中表现出令人惊讶的稳定性。尽管大多数先前的工作将这种稳定性视为单一属性，但我们将其形式化为两个不同的概念：统计可识别性（跨运行的表示的一致性）和结构可识别性（表示与一些未观察到的基本事实的对齐）。认识到完美的逐点可识别性对于现代表示学习模型来说通常是不现实的，我们提出了表示的统计和结构近可识别性的新模型不可知定义，最高可达一定的误差容限 $\epsilon$。利用这些定义，我们证明了具有非线性解码器的模型表示的统计 $\epsilon$-近可识别性结果，将现有的可识别性理论推广到最后一层表示之外，例如生成式预训练变压器（GPT）到一类广泛模型的中间表示的近可识别性，包括（屏蔽）自动编码器（MAE）和监督学习器。尽管这些较弱的假设赋予较弱的可识别性，但我们表明独立成分分析（ICA）可以解决此类模型的大部分剩余线性模糊性，并根据经验验证和测量我们的近可识别性主张。通过对数据生成过程的额外假设，统计可识别性扩展到结构可识别性，从而产生一个简单实用的解开方法：潜在表示的 ICA 后处理。在综合基准上，这种方法使用普通自动编码器实现了最先进的解缠结。借助用于细胞显微镜的基础模型规模 MAE，它可以将生物变异与技术批次效应分开，从而大大提高下游泛化能力。

Title: HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Authors: Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.11975
Pdf URL: https://arxiv.org/pdf/2603.11975
Copy Paste: [[2603.11975]] HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios(https://arxiv.org/abs/2603.11975)
Keywords: generation
Abstract: The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
摘要：实体代理的快速发展加速了家用机器人在现实环境中的部署。然而，与结构化工业环境不同，家庭空间会带来不可预测的安全风险，其中感知延迟和缺乏常识知识等系统限制可能会导致危险的错误。当前的安全评估通常仅限于静态图像、文本或一般危险，无法充分对这些特定环境中的动态不安全行为检测进行基准测试。为了弥补这一差距，我们引入了 \textbf{HomeSafe-Bench}，这是一个具有挑战性的基准，旨在评估家庭场景中不安全动作检测的视觉语言模型（VLM）。 HomeSafe-Bench 通过将物理模拟与高级视频生成相结合的混合管道进行信任，并具有跨六个功能区域的 438 个不同案例以及细粒度的多维注释。除了基准测试之外，我们还提出了 \textbf{家庭安全分层双脑防护（HD-Guard）}，这是一种用于实时安全监控的分层流架构。 HD-Guard协调轻量级FastBrain进行连续高频筛查，与异步大规模SlowBrain进行深度多模态推理，有效平衡推理效率和检测精度。评估表明 HD-Guard 在延迟和性能之间实现了出色的权衡，而我们的分析则确定了当前基于 VLM 的安全检测中的关键瓶颈。

Title: Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation

Authors: Chongyang Xu, Yixian Zou, Ziliang Feng, Fanman Meng, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.11984
Pdf URL: https://arxiv.org/pdf/2603.11984
Copy Paste: [[2603.11984]] Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation(https://arxiv.org/abs/2603.11984)
Keywords: generation
Abstract: Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.
摘要：基于扩散的视觉运动策略通过迭代去噪有效地捕获多模态动作分布，但其高推理延迟限制了实时机器人控制。最近的流匹配和基于一致性的方法实现了单步生成，但牺牲了保留不同动作模式的能力，将多模态行为折叠成平均的、通常在物理上不可行的轨迹。我们观察到，机器人技术中的计算预算不对称（离线训练与实时推理）自然会通过将迭代细化从推理时间转移到训练时间来恢复这种多模态保真度。基于这一见解，我们提出了 Ada3Drift，它学习训练时漂移场，将预测动作吸引到专家演示模式，同时将它们从其他生成的样本中排斥出来，从而实现从 3D 点云观测中进行高保真单步生成 (1 NFE)。为了处理少镜头机器人状态，Ada3Drift 进一步引入了从粗分布学习到模式锐化细化的 sigmoid 计划损失过渡，以及捕获不同空间粒度的动作模式的多尺度场聚合。对三个模拟基准（Adroit、Meta-World 和 RoboTwin）和现实世界机器人操作任务的实验表明，Ada3Drift 实现了最先进的性能，同时所需的功能评估比基于扩散的替代方案少 10 倍。

Title: Flowcean - Model Learning for Cyber-Physical Systems

Authors: Maximilian Schmidt, Swantje Plambeck, Markus Knitt, Hendrik Rose, Goerschwin Fey, Jan Christian Wieck, Stephan Balduin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.12015
Pdf URL: https://arxiv.org/pdf/2603.12015
Copy Paste: [[2603.12015]] Flowcean - Model Learning for Cyber-Physical Systems(https://arxiv.org/abs/2603.12015)
Keywords: generation
Abstract: Effective models of Cyber-Physical Systems (CPS) are crucial for their design and operation. Constructing such models is difficult and time-consuming due to the inherent complexity of CPS. As a result, data-driven model generation using machine learning methods is gaining popularity. In this paper, we present Flowcean, a novel framework designed to automate the generation of models through data-driven learning that focuses on modularity and usability. By offering various learning strategies, data processing methods, and evaluation metrics, our framework provides a comprehensive solution, tailored to CPS scenarios. Flowcean facilitates the integration of diverse learning libraries and tools within a modular and flexible architecture, ensuring adaptability to a wide range of modeling tasks. This streamlines the process of model generation and evaluation, making it more efficient and accessible.
摘要：信息物理系统（CPS）的有效模型对于其设计和运行至关重要。由于 CPS 固有的复杂性，构建此类模型既困难又耗时。因此，使用机器学习方法的数据驱动模型生成越来越受欢迎。在本文中，我们提出了 Flowcean，这是一种新颖的框架，旨在通过数据驱动的学习自动生成模型，重点关注模块化和可用性。通过提供各种学习策略、数据处理方法和评估指标，我们的框架提供了针对 CPS 场景量身定制的全面解决方案。 Flowcean 有助于将不同的学习库和工具集成到模块化和灵活的架构中，确保对各种建模任务的适应性。这简化了模型生成和评估的过程，使其更加高效和易于访问。

Title: Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era

Authors: Nicholas Schaub, Andriy Kharchenko, Hamdah Abbasi, Sameeul Samee, Hythem Sidky, Nathan Hotaling
Subjects: cs.CV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2603.12016
Pdf URL: https://arxiv.org/pdf/2603.12016
Copy Paste: [[2603.12016]] Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era(https://arxiv.org/abs/2603.12016)
Keywords: generation
Abstract: Modern imaging instruments can produce terabytes to petabytes of data for a single experiment. The biggest barrier to processing big image datasets has been computational, where image analysis algorithms often lack the efficiency needed to process such large datasets or make tradeoffs in robustness and accuracy. Deep learning algorithms have vastly improved the accuracy of the first step in an analysis workflow (region segmentation), but the expansion of domain specific feature extraction libraries across scientific disciplines has made it difficult to compare the performance and accuracy of extracted features. To address these needs, we developed a novel feature extraction library called Nyxus. Nyxus is designed from the ground up for scalable out-of-core feature extraction for 2D and 3D image data and rigorously tested against established standards. The comprehensive feature set of Nyxus covers multiple biomedical domains including radiomics and cellular analysis, and is designed for computational scalability across CPUs and GPUs. Nyxus has been packaged to be accessible to users of various skill sets and needs: as a Python package for code developers, a command line tool, as a Napari plugin for low to no-code users or users that want to visualize results, and as an Open Container Initiative (OCI) compliant container that can be used in cloud or super-computing workflows aimed at processing large data sets. Further, Nyxus enables a new methodological approach to feature extraction allowing for programmatic tuning of many features sets for optimal computational efficiency or coverage for use in novel machine learning and deep learning applications.
摘要：现代成像仪器可以在一次实验中产生 TB 到 PB 的数据。处理大图像数据集的最大障碍是计算，图像分析算法通常缺乏处理如此大的数据集所需的效率或在鲁棒性和准确性之间进行权衡。深度学习算法极大地提高了分析工作流程第一步（区域分割）的准确性，但跨科学学科的特定领域特征提取库的扩展使得比较提取特征的性能和准确性变得困难。为了满足这些需求，我们开发了一个名为 Nyxus 的新型特征提取库。 Nyxus 专为 2D 和 3D 图像数据的可扩展核外特征提取而设计，并根据既定标准进行了严格测试。 Nyxus 的综合功能集涵盖多个生物医学领域，包括放射组学和细胞分析，并专为跨 CPU 和 GPU 的计算可扩展性而设计。 Nyxus 已被打包成适合各种技能和需求的用户使用：作为代码开发人员的 Python 包、命令行工具、作为低代码或无代码用户或想要可视化结果的用户的 Napari 插件，以及作为符合开放容器计划 (OCI) 的容器，可用于旨在处理大型数据集的云或超级计算工作流程。此外，Nyxus 实现了一种新的特征提取方法，允许对许多特征集进行编程调整，以获得最佳计算效率或覆盖范围，以用于新颖的机器学习和深度学习应用。

Title: Efficient Generative Modeling with Unitary Matrix Product States Using Riemannian Optimization

Authors: Haotong Duan, Zhongming Chen, Ngai Wong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.12026
Pdf URL: https://arxiv.org/pdf/2603.12026
Copy Paste: [[2603.12026]] Efficient Generative Modeling with Unitary Matrix Product States Using Riemannian Optimization(https://arxiv.org/abs/2603.12026)
Keywords: generative
Abstract: Tensor networks, which are originally developed for characterizing complex quantum many-body systems, have recently emerged as a powerful framework for capturing high-dimensional probability distributions with strong physical interpretability. This paper systematically studies matrix product states (MPS) for generative modeling and shows that unitary MPS, which is a tensor-network architecture that is both simple and expressive, offers clear benefits for unsupervised learning by reducing ambiguity in parameter updates and improving efficiency. To overcome the inefficiency of standard gradient-based MPS training, we develop a Riemannian optimization approach that casts probabilistic modeling as an optimization problem with manifold constraints, and further derive an efficient space-decoupling algorithm. Experiments on Bars-and-Stripes and EMNIST datasets demonstrate fast adaptation to data structure, stable updates, and strong performance while maintaining the efficiency and expressive power of MPS.
摘要：张量网络最初是为了表征复杂的量子多体系统而开发的，最近已成为捕获具有强大物理可解释性的高维概率分布的强大框架。本文系统地研究了用于生成建模的矩阵乘积状态（MPS），并表明酉 MPS（一种既简单又富有表现力的张量网络架构）通过减少参数更新的模糊性和提高效率，为无监督学习提供了明显的好处。为了克服标准的基于梯度的 MPS 训练的低效率，我们开发了一种黎曼优化方法，将概率建模视为具有流形约束的优化问题，并进一步推导了一种有效的空间解耦算法。在 Bars-and-Stripes 和 EMNIST 数据集上的实验表明，在保持 MPS 的效率和表达能力的同时，数据结构的快速适应、稳定的更新和强大的性能。

Title: Single Pixel Image Classification using an Ultrafast Digital Light Projector

Authors: Aisha Kanwal, Graeme E. Johnstone, Fahimeh Dehkhoda, Johannes H. Herrnsdorf, Robert K. Henderson, Martin D. Dawson, Xavier Porte, Michael J. Strain
Subjects: cs.CV, physics.optics
Abstract URL: https://arxiv.org/abs/2603.12036
Pdf URL: https://arxiv.org/pdf/2603.12036
Copy Paste: [[2603.12036]] Single Pixel Image Classification using an Ultrafast Digital Light Projector(https://arxiv.org/abs/2603.12036)
Keywords: generation
Abstract: Pattern recognition and image classification are essential tasks in machine vision. Autonomous vehicles, for example, require being able to collect the complex information contained in a changing environment and classify it in real time. Here, we experimentally demonstrate image classification at multi-kHz frame rates combining the technique of single pixel imaging (SPI) with a low complexity machine learning model. The use of a microLED-on-CMOS digital light projector for SPI enables ultrafast pattern generation for sub-ms image encoding. We investigate the classification accuracy of our experimental system against the broadly accepted benchmarking task of the MNIST digits classification. We compare the classification performance of two machine learning models: An extreme learning machine (ELM) and a backpropagation trained deep neural network. The complexity of both models is kept low so the overhead added to the inference time is comparable to the image generation time. Crucially, our single pixel image classification approach is based on a spatiotemporal transformation of the information, entirely bypassing the need for image reconstruction. By exploring the performance of our SPI based ELM as binary classifier we demonstrate its potential for efficient anomaly detection in ultrafast imaging scenarios.
摘要：模式识别和图像分类是机器视觉中的基本任务。例如，自动驾驶汽车需要能够收集不断变化的环境中包含的复杂信息并实时对其进行分类。在这里，我们通过实验演示了将单像素成像 (SPI) 技术与低复杂度机器学习模型相结合的多 kHz 帧速率下的图像分类。使用用于 SPI 的 microLED-on-CMOS 数字光投影仪可以实现亚毫秒级图像编码的超快图案生成。我们根据广泛接受的 MNIST 数字分类基准测试任务来研究我们的实验系统的分类准确性。我们比较了两种机器学习模型的分类性能：极限学习机（ELM）和反向传播训练的深度神经网络。两种模型的复杂性都保持在较低水平，因此推理时间所增加的开销与图像生成时间相当。至关重要的是，我们的单像素图像分类方法基于信息的时空变换，完全绕过了图像重建的需要。通过探索基于 SPI 的 ELM 作为二元分类器的性能，我们展示了其在超快成像场景中高效异常检测的潜力。

Title: Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Authors: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.12038
Pdf URL: https://arxiv.org/pdf/2603.12038
Copy Paste: [[2603.12038]] Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability(https://arxiv.org/abs/2603.12038)
Keywords: generation
Abstract: Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.
摘要：长上下文自回归解码仍然很昂贵，因为每个解码步骤必须重复处理不断增长的历史记录。我们在解码过程中观察到一致的模式：在一个句子中，更一般地说，在一个短的语义连贯范围内，主要注意力支持通常保持基本稳定。受这一观察的启发，我们提出了慢速推理（SFI），这是一种免训练的解码框架，它将生成解耦为频繁的低成本快速步骤和偶尔的密集注意力慢速步骤。快速步骤重用紧凑的稀疏内存以实现高效解码。慢步是在语义边界附近触发的。在慢速步骤中，模型会重新访问更广泛的上下文，并使用选择器刷新选定的内存以进行后续快速步骤。在评估的上下文长度中，SFI 提供大约 1.6 美元 - 14.4 美元的较高解码吞吐量，同时在长上下文和长 CoT 设置中通常保持与全 KV 基线相当的质量。由于 SFI 无需训练并直接应用于现有检查点，因此它为降低长上下文、长视野和代理工作负载中的当代自回归推理模型的推理成本提供了一条实用途径。

Title: Coarse-Guided Visual Generation via Weighted h-Transform Sampling

Authors: Yanghao Wang, Ziqi Jiang, Zhen Wang, Long Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.12057
Pdf URL: https://arxiv.org/pdf/2603.12057
Copy Paste: [[2603.12057]] Coarse-Guided Visual Generation via Weighted h-Transform Sampling(https://arxiv.org/abs/2603.12057)
Keywords: generation
Abstract: Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.
摘要：粗略引导的视觉生成可从降级或低保真度的粗略参考中合成精细的视觉样本，对于各种现实世界的应用至关重要。虽然基于训练的方法是有效的，但它们本质上受到高训练成本和由于配对数据收集而限制泛化的限制。因此，最近的免训练工作建议利用预训练的扩散模型并在采样过程中纳入指导。然而，这些免训练方法要么需要了解前向（精细到粗略）变换算子，例如双三次下采样，要么难以在指导和合成质量之间取得平衡。为了应对这些挑战，我们提出了一种使用 h 变换的新颖引导方法，h 变换是一种可以在所需条件下约束随机过程（例如采样过程）的工具。具体来说，我们通过在原始微分方程中添加漂移函数来修改每个采样时间步长的转移概率，这近似引导生成朝着理想的精细样本前进。为了解决不可避免的近似误差，我们引入了一种噪声水平感知计划，随着误差的增加逐渐减轻该项的权重，从而确保指导的遵守和高质量的合成。跨不同图像和视频生成任务的广泛实验证明了我们方法的有效性和泛化性。

Title: Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis

Authors: Xiaolong Qian, Qi Jiang, Yao Gao, Lei Sun, Zhonghua Yi, Kailun Yang, Luc Van Gool, Kaiwei Wang
Subjects: cs.CV, cs.RO, eess.IV, physics.optics
Abstract URL: https://arxiv.org/abs/2603.12083
Pdf URL: https://arxiv.org/pdf/2603.12083
Copy Paste: [[2603.12083]] Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis(https://arxiv.org/abs/2603.12083)
Keywords: restoration
Abstract: Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses. Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges. However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance. In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UniCAC, a large-scale benchmark for photographic cameras constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation. Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects. We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations. Benchmarks, codes, and Zemax files will be available at this https URL.
摘要：流行的计算像差校正 (CAC) 方法通常是针对特定光学系统量身定制的，导致新镜头的通用性较差且需要耗费大量人力进行重新培训。开发能够推广到不同摄影镜头的 CAC 范式为这些挑战提供了一个有前途的解决方案。然而，由于缺乏涵盖足够广泛的光学像差的综合基准，在消费者摄影中实现这种跨镜头通用性的努力仍处于早期阶段。此外，目前尚不清楚哪些具体因素会影响现有的 CAC 方法以及这些因素如何影响其性能。在本文中，我们利用我们新提出的 UniCAC（通过自动光学设计构建的摄影相机的大规模基准），展示了涉及 24 种图像恢复和 CAC 算法的综合实验和评估。光学退化评估器 (ODE) 作为一种新颖的框架被引入，用于客观评估 CAC 任务的难度，提供可靠的光学像差量化并实现可靠的评估。根据我们的比较分析，我们确定了对 CAC 性能影响最显着的三个关键因素——优先利用率、网络架构和训练策略，并进一步研究了它们各自的影响。我们相信，我们的基准、数据集和观察结果为相关领域提供了基础见解，并为未来的研究奠定了基础。基准、代码和 Zemax 文件将在此 https URL 中提供。

Title: EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

Authors: Yan Li, Ning Liao, Xiangyu Zhao, Shaofeng Zhang, Xiaoxing Wang, Yifan Yang, Junchi Yan, Xue Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12108
Pdf URL: https://arxiv.org/pdf/2603.12108
Copy Paste: [[2603.12108]] EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation(https://arxiv.org/abs/2603.12108)
Keywords: generation
Abstract: The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.
摘要：统一多模态大语言模型（MLLM）的开发从根本上受到视觉理解和生成之间的粒度差距的挑战：理解需要高级语义抽象，而图像生成需要细粒度的像素级表示。现有的方法通常在同一组表示上强制执行两种监督，或者在不同的特征空间上解耦这两种监督，从而分别导致干扰和不一致。在这项工作中，我们提出了 EvoTok，这是一种统一的图像标记器，它通过共享潜在空间内的残差进化过程来协调这些要求。 EvoTok 不是为像素和语义维护单独的标记空间，而是通过残差矢量量化将图像编码为残差标记的级联序列。这个残差序列形成了一个进化轨迹，其中早期阶段捕获低级细节，更深阶段逐渐过渡到高级语义表示。尽管在相对较小的 1300 万张图像数据集上进行训练，远小于许多以前的统一标记器使用的十亿级数据集，但 EvoTok 在 256x256 分辨率的 ImageNet-1K 上实现了 0.43 rFID 的强大重建质量。当与大型语言模型集成时，EvoTok 在 9 个视觉理解基准测试中的 7 个中表现出了良好的性能，并且在 GenEval 和 GenAI-Bench 等图像生成基准测试中表现出了显着的结果。这些结果表明，将视觉表示建模为不断演变的轨迹，为统一视觉理解和生成提供了有效且有原则的解决方案。

Title: Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D

Authors: Agniv Sharma, Xianghui Xie, Tom Fischer, Eddy Ilg, Gerard Pons-Moll
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.12126
Pdf URL: https://arxiv.org/pdf/2603.12126
Copy Paste: [[2603.12126]] Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D(https://arxiv.org/abs/2603.12126)
Keywords: generation
Abstract: Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.
摘要：从文本建模和生成 3D 人机交互对于 AR、XR 和游戏中的应用至关重要。现有的方法通常依赖于从文本到图像模型的分数蒸馏，但由于缺乏高质量的交互数据，其结果存在 Janus 问题，并且不能忠实地遵循文本提示。我们介绍 Hoi3DGen，这是一个框架，可以生成精确遵循输入交互描述的人机交互的高质量纹理网格。我们首先利用多模态大语言模型来策划真实且高质量的交互数据，然后创建完整的文本到 3D 管道，从而在交互保真度方面实现数量级的改进。我们的方法在文本一致性方面超越基线 4-15 倍，在 3D 模型质量方面超越基线 3-7 倍，表现出对不同类别和交互类型的强大泛化能力，同时保持高质量的 3D 生成。

Title: Automatic Generation of High-Performance RL Environments

Authors: Seth Karten, Rahul Dev Appapogu, Chi Jin
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2603.12145
Pdf URL: https://arxiv.org/pdf/2603.12145
Copy Paste: [[2603.12145]] Automatic Generation of High-Performance RL Environments(https://arxiv.org/abs/2603.12145)
Keywords: generation
Abstract: Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.
摘要：传统上，将复杂的强化学习 (RL) 环境转化为高性能实施需要数月的专业工程。我们提出了一个可重用的配方 - 通用提示模板、分层验证和迭代代理辅助修复 - 可以以 <10 美元的计算成本生成语义上等效的高性能环境。我们在五个环境中演示了三种不同的工作流程。直接翻译（不存在先前的性能实现）：EmuRust（通过 Game Boy 模拟器的 Rust 并行性实现 1.5 倍 PPO 加速）和 PokeJAX，第一个 GPU 并行 Pokemon 战斗模拟器（500M SPS 随机动作，15.2M SPS PPO；比 TypeScript 参考高 22,320 倍）。根据现有性能实现进行翻译验证：在匹配的 GPU 批量大小下，吞吐量与 MJX (1.04x) 相当，比 Brax 5 倍 (HalfCheetah JAX)； 42x PPO（河豚乒乓球）。新环境创建：TCGJax，第一个可部署的 JAX Pokemon TCG 引擎（717K SPS 随机操作，153K SPS PPO；比 Python 参考高 6.6 倍），从 Web 提取的规范合成。在 200M 参数下，环境开销降至训练时间的 4% 以下。分层验证（属性、交互和部署测试）确认所有五个环境的语义等效性；跨后端策略转移确认所有五种环境的 sim 到 sim 差距为零。 TCGJax 是根据公共存储库中缺少的私人参考合成的，可作为代理预训练数据问题的污染控制。该论文包含足够的细节 - 包括代表性提示、验证方法和完整结果 - 编码代理可以直接从手稿中复制翻译。

Title: FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

Authors: Quanhao Li, Zhen Xing, Rui Wang, Haidong Cao, Qi Dai, Daoguo Dong, Zuxuan Wu
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2603.12146
Pdf URL: https://arxiv.org/pdf/2603.12146
Copy Paste: [[2603.12146]] FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance(https://arxiv.org/abs/2603.12146)
Keywords: generation
Abstract: Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.
摘要：轨迹可控视频生成的最新进展取得了显着的进展。以前的方法主要使用基于适配器的架构沿着预定义的轨迹进行精确的运动控制。然而，所有这些方法都依赖于多步骤去噪过程，导致大量的时间冗余和计算开销。虽然现有的视频蒸馏方法成功地将多步生成器蒸馏为少步，但直接将这些方法应用于轨迹可控视频生成会导致视频质量和轨迹精度显着下降。为了弥补这一差距，我们引入了 FlashMotion，这是一种专为少步轨迹可控视频生成而设计的新型训练框架。我们首先在多步视频生成器上训练轨迹适配器，以实现精确的轨迹控制。然后，我们将生成器提炼成几个步骤的版本以加速视频生成。最后，我们使用结合扩散和对抗目标的混合策略对适配器进行微调，将其与少步生成器对齐，以生成高质量、轨迹准确的视频。为了进行评估，我们引入了 FlashBench，它是长序列轨迹可控视频生成的基准，可测量不同数量前景对象的视频质量和轨迹精度。对两种适配器架构的实验表明，FlashMotion 在视觉质量和轨迹一致性方面都超越了现有的视频蒸馏方法和之前的多步模型。

Title: GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Authors: Zexuan Yan, Jiarui Jin, Yue Ma, Shijian Wang, Jiahui Hu, Wenxiang Jiao, Yuan Lu, Linfeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.12155
Pdf URL: https://arxiv.org/pdf/2603.12155
Copy Paste: [[2603.12155]] GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows(https://arxiv.org/abs/2603.12155)
Keywords: generative
Abstract: Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at this https URL.
摘要：尽管生成模型的最新进展推动了文本渲染的重大进步，但准确生成复杂的文本和数学公式仍然是一个艰巨的挑战。这种困难主要源于当前模型在遇到分布外提示时遵循指令的能力有限。为了解决这个问题，我们引入了 GlyphBanana，以及专门为渲染复杂字符和公式而设计的相应基准。 GlyphBanana 采用代理工作流程，集成了辅助工具，将字形模板注入潜在空间和注意力图中，从而促进生成图像的迭代细化。值得注意的是，我们的免训练方法可以无缝应用于各种文本到图像（T2I）模型，与现有基线相比，实现了更高的精度。大量的实验证明了我们提出的工作流程的有效性。相关代码可在此 https URL 上公开获取。

Title: A Quantitative Characterization of Forgetting in Post-Training

Authors: Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan
Subjects: cs.LG, cs.AI, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2603.12163
Pdf URL: https://arxiv.org/pdf/2603.12163
Copy Paste: [[2603.12163]] A Quantitative Characterization of Forgetting in Post-Training(https://arxiv.org/abs/2603.12163)
Keywords: generative
Abstract: Continual post-training of generative models is widely used, yet a principled understanding of when and why forgetting occurs remains limited. We develop theoretical results under a two-mode mixture abstraction (representing old and new tasks), proposed by Chen et al. (2025) (arXiv:2510.18874), and formalize forgetting in two forms: (i) mass forgetting, where the old mixture weight collapses to zero, and (ii) old-component drift, where an already-correct old component shifts during training. For equal-covariance Gaussian modes, we prove that forward-KL objectives trained on data from the new distribution drive the old weight to zero, while reverse-KL objectives converge to the true target (thereby avoiding mass forgetting) and perturb the old mean only through overlap-gated misassignment probabilities controlled by the Bhattacharyya coefficient, yielding drift that decays exponentially with mode separation and a locally well-conditioned geometry with exponential convergence. We further quantify how replay interacts with these objectives. For forward-KL, replay must modify the training distribution to change the population optimum; for reverse-KL, replay leaves the population objective unchanged but prevents finite-batch old-mode starvation through bounded importance weighting. Finally, we analyze three recently proposed near-on-policy post-training methods, SDFT (arXiv:2601.19897), TTT-Discover (arXiv:2601.16175), and OAPL (arXiv:2602.19362), via the same lens and derive explicit conditions under which each retains old mass and exhibits overlap-controlled drift. Overall, our results show that forgetting can by precisely quantified based on the interaction between divergence direction, geometric behavioral overlap, sampling regime, and the visibility of past behavior during training.
摘要：生成模型的持续后训练被广泛使用，但对何时以及为何发生遗忘的原则性理解仍然有限。我们在 Chen 等人提出的两种模式混合抽象（代表新旧任务）下开发了理论结果。 (2025) (arXiv:2510.18874)，并将遗忘形式化为两种形式：(i) 大规模遗忘，旧的混合权重降至零；(ii) 旧成分漂移，即已经正确的旧成分在训练期间发生变化。对于等协方差高斯模式，我们证明了根据新分布的数据训练的前向 KL 目标将旧权重驱动至零，而反向 KL 目标收敛到真实目标（从而避免质量遗忘），并且仅通过由 Bhattacharyya 系数控制的重叠门控误分配概率扰乱旧均值，产生随模式分离呈指数衰减的漂移和具有指数收敛的局部良好条件几何。我们进一步量化重放如何与这些目标相互作用。对于forward-KL，重播必须修改训练分布以改变种群最优；对于反向 KL，重放使种群目标保持不变，但通过有限重要性权重防止有限批量旧模式饥饿。最后，我们通过相同的镜头分析了最近提出的三种近策略后训练方法：SDFT (arXiv:2601.19897)、TTT-Discover (arXiv:2601.16175) 和 OAPL (arXiv:2602.19362)，并推导出每种方法保留旧质量并表现出重叠控制漂移的明确条件。总的来说，我们的结果表明，遗忘可以根据训练期间发散方向、几何行为重叠、采样制度和过去行为的可见性之间的相互作用来精确量化。

Title: ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Authors: Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12208
Pdf URL: https://arxiv.org/pdf/2603.12208
Copy Paste: [[2603.12208]] ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models(https://arxiv.org/abs/2603.12208)
Keywords: generative
Abstract: Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.
摘要：多模态大语言模型 (MLLM) 通过生成伪造检测的文本原理来实现可解释的多媒体取证。然而，处理密集的视觉序列会产生很高的计算成本，特别是对于高分辨率图像和视频。视觉标记修剪是一种实用的加速策略，但现有方法主要是语义驱动的，保留显着对象，同时丢弃经常存在高频异常和时间抖动等操作痕迹的背景区域。为了解决这个问题，我们引入了 ForensicZip，这是一个免训练的框架，它从伪造驱动的角度重新制定了令牌压缩。 ForensicZip 将时间令牌演化建模为具有松弛虚拟节点的生死最优传输问题，量化指示瞬态生成工件的物理不连续性。取证评分进一步将基于传输的新颖性与高频先验相结合，以在大比例压缩下将取证证据与语义内容分开。 Deepfake 和 AIGC 基准测试表明，在 10% 的代币保留率下，ForensicZip 实现了 2.97 美元的加速，并减少了超过 90% 的 FLOPs，同时保持了最先进的检测性能。

Title: Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

Authors: Görkay Aydemir, Fatma Güney, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12217
Pdf URL: https://arxiv.org/pdf/2603.12217
Copy Paste: [[2603.12217]] Real-World Point Tracking with Verifier-Guided Pseudo-Labeling(https://arxiv.org/abs/2603.12217)
Keywords: generation
Abstract: Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: this https URL
摘要：长期点跟踪模型通常在大型合成数据集上进行训练。由于不同的特征和缺乏密集的地面实况注释，这些模型的性能在现实视频中会下降。人们已经探索了对未标记视频的自我训练作为一种实用的解决方案，但伪标签的质量在很大程度上取决于教师模型的可靠性，而教师模型的可靠性因帧和场景而异。在本文中，我们解决了现实世界的微调问题，并引入了验证器，这是一种元模型，可以学习评估跟踪器预测的可靠性并指导伪标签生成。给定来自多个预训练跟踪器的候选轨迹，验证器对每帧进行评估并选择最值得信赖的预测，从而产生高质量的伪标签轨迹。当应用于微调时，验证者引导的伪标记可以显着提高监督质量，并实现对未标记视频的数据高效适应。对四个现实世界基准的广泛实验表明，我们的方法实现了最先进的结果，同时比以前的自我训练方法需要更少的数据。项目页面：此 https URL

Title: SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Authors: Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12238
Pdf URL: https://arxiv.org/pdf/2603.12238
Copy Paste: [[2603.12238]] SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation(https://arxiv.org/abs/2603.12238)
Keywords: generation
Abstract: Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at this https URL
摘要：从自然语言生成文本到 3D 场景对于数字内容创建来说是非常理想的。然而，现有方法很大程度上受领域限制或依赖于预定义的空间关系，限制了它们无约束、开放词汇 3D 场景合成的能力。在本文中，我们介绍了 SceneAssistant，这是一种视觉反馈驱动的代理，专为开放词汇 3D 场景生成而设计。我们的框架利用现代 3D 对象生成模型以及视觉语言模型 (VLM) 的空间推理和规划功能。为了实现开放词汇场景合成，我们为 VLM 提供了一套全面的原子操作（例如，缩放、旋转、聚焦）。在每个交互步骤中，VLM 都会接收渲染的视觉反馈并采取相应的操作，迭代地细化场景，以实现更连贯的空间排列以及与输入文本更好的对齐。实验结果表明，我们的方法可以生成多样化、开放词汇和高质量的 3D 场景。定性分析和定量人类评估都证明了我们的方法相对于现有方法的优越性。此外，我们的方法允许用户指示代理根据自然语言命令编辑现有场景。我们的代码可在此 https URL 获取

Title: BiGain: Unified Token Compression for Joint Generation and Classification

Authors: Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.12240
Pdf URL: https://arxiv.org/pdf/2603.12240
Copy Paste: [[2603.12240]] BiGain: Unified Token Compression for Joint Generation and Classification(https://arxiv.org/abs/2603.12240)
Keywords: generation, generative
Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.
摘要：扩散模型的加速方法（例如，令牌合并或下采样）通常会在计算量减少的情况下优化合成质量，但通常会忽略判别能力。我们以共同目标重新审视令牌压缩，并提出 BiGain，这是一种免训练、即插即用的框架，可以在保持生成质量的同时改进加速扩散模型中的分类。我们的关键见解是频率分离：将特征空间信号映射到频率感知的表示中，可以将细节与全局语义分开，从而实现既尊重生成保真度又尊重区分效用的压缩。 BiGain 通过两个频率感知算子反映了这一原理：（1）拉普拉斯门控令牌合并，鼓励频谱平滑令牌之间的合并，同时阻止高对比度令牌的合并，从而保留边缘和纹理； (2)插值-外插KV下采样，通过最近池化和平均池化之间的可控插值对键/值进行下采样，同时保持查询完整，从而保持注意力精度。在基于 DiT 和 U-Net 的主干网以及 ImageNet-1K、ImageNet-100、Oxford-IIIT Pets 和 COCO-2017 中，我们的运营商不断改进基于扩散的分类的速度与准确性权衡，同时在可比较的加速度下保持或增强生成质量。例如，在 ImageNet-1K 上，稳定扩散 2.0 上进行了 70% 的令牌合并，BiGain 将分类精度提高了 7.15%，同时将 FID 提高了 0.34 (1.85%)。我们的分析表明，平衡的频谱保留、保留高频细节和低频/中频语义是扩散模型中令牌压缩的可靠设计规则。据我们所知，BiGain是第一个在加速扩散下共同研究和推进生成和分类的框架，支持更低成本的部署。

Title: Separable neural architectures as a primitive for unified predictive and generative intelligence

Authors: Reza T. Batley, Apurba Sarker, Rajib Mostakim, Andrew Klichine, Sourav Saha
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.12244
Pdf URL: https://arxiv.org/pdf/2603.12244
Copy Paste: [[2603.12244]] Separable neural architectures as a primitive for unified predictive and generative intelligence(https://arxiv.org/abs/2603.12244)
Keywords: generation, generative
Abstract: Intelligent systems across physics, language and perception often exhibit factorisable structure, yet are typically modelled by monolithic neural architectures that do not explicitly exploit this structure. The separable neural architecture (SNA) addresses this by formalising a representational class that unifies additive, quadratic and tensor-decomposed neural models. By constraining interaction order and tensor rank, SNAs impose a structural inductive bias that factorises high-dimensional mappings into low-arity components. Separability need not be a property of the system itself: it often emerges in the coordinates or representations through which the system is expressed. Crucially, this coordinate-aware formulation reveals a structural analogy between chaotic spatiotemporal dynamics and linguistic autoregression. By treating continuous physical states as smooth, separable embeddings, SNAs enable distributional modelling of chaotic systems. This approach mitigates the nonphysical drift characteristics of deterministic operators whilst remaining applicable to discrete sequences. The compositional versatility of this approach is demonstrated across four domains: autonomous waypoint navigation via reinforcement learning, inverse generation of multifunctional microstructures, distributional modelling of turbulent flow and neural language modelling. These results establish the separable neural architecture as a domain-agnostic primitive for predictive and generative intelligence, capable of unifying both deterministic and distributional representations.
摘要：跨越物理、语言和感知的智能系统通常表现出可因式分解的结构，但通常由不明确利用这种结构的整体神经架构建模。可分离神经架构（SNA）通过形式化统一加法、二次和张量分解神经模型的表征类来解决这个问题。通过约束交互顺序和张量秩，SNA 施加结构归纳偏差，将高维映射分解为低数量分量。可分离性不一定是系统本身的属性：它通常出现在表达系统的坐标或表示中。至关重要的是，这种坐标感知公式揭示了混沌时空动力学和语言自回归之间的结构类比。通过将连续物理状态视为平滑、可分离的嵌入，SNA 能够对混沌系统进行分布式建模。这种方法减轻了确定性算子的非物理漂移特性，同时仍然适用于离散序列。这种方法的组合多功能性在四个领域得到了证明：通过强化学习的自主路点导航、多功能微结构的逆生成、湍流的分布建模和神经语言建模。这些结果将可分离的神经架构建立为预测和生成智能的领域不可知原语，能够统一确定性和分布式表示。

Title: One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Authors: Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Dogyun Park, Anil Kag, Michael Vasilkovsky, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12245
Pdf URL: https://arxiv.org/pdf/2603.12245
Copy Paste: [[2603.12245]] One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers(https://arxiv.org/abs/2603.12245)
Keywords: generative
Abstract: Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: this https URL
摘要：扩散变换器 (DiT) 实现了高生成质量，但将 FLOP 锁定到图像分辨率，限制了原则上的延迟质量权衡，并在输入空间令牌之间统一分配计算，浪费了对不重要区域的资源分配。我们引入了弹性潜在接口转换器 (ELIT)，这是一种与 DiT 兼容的嵌入式机制，可将输入图像大小与计算解耦。我们的方法插入了一个潜在接口，一个可学习的可变长度令牌序列，标准转换器块可以在其上运行。轻量级读写交叉注意层在空间标记和潜在变量之间移动信息，并优先考虑重要的输入区域。通过随机丢弃尾部潜伏进行训练，ELIT 学会生成按重要性排序的表示，其中早期潜伏捕获全局结构，而后面的潜伏包含细化细节的信息。在推理时，可以动态调整潜在数量以匹配计算约束。 ELIT 故意做到最小化，添加了两个交叉注意力层，同时保持修正后的流目标和 DiT 堆栈不变。在数据集和架构（DiT、U-ViT、HDiT、MM-DiT）中，ELIT 提供了一致的收益。在 ImageNet-1K 512px 上，ELIT 在 FID 和 FDD 分数方面的平均增益为 $35.3\%$ 和 $39.6\%$。项目页面：此 https URL

Title: Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Authors: Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12247
Pdf URL: https://arxiv.org/pdf/2603.12247
Copy Paste: [[2603.12247]] Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation(https://arxiv.org/abs/2603.12247)
Keywords: generation
Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at this https URL.
摘要：强化学习（RL）已成为增强图像编辑和文本到图像（T2I）生成的有前途的范例。然而，当前的奖励模型在强化学习过程中充当批评者，经常会产生幻觉并分配嘈杂的分数，本质上会误导优化过程。在本文中，我们提出了 FIRM（忠实图像奖励建模），这是一个综合框架，可开发强大的奖励模型，为忠实图像生成和编辑提供准确可靠的指导。首先，我们设计定制的数据管理管道来构建高质量的评分数据集。具体来说，我们使用执行和一致性来评估编辑，而生成主要通过指令遵循来评估。使用这些管道，我们收集 FIRM-Edit-370K 和 FIRM-Gen-293K 数据集，并训练准确反映这些标准的专门奖励模型（FIRM-Edit-8B 和 FIRM-Gen-8B）。其次，我们介绍 FIRM-Bench，这是一个专门为编辑和生成评论家设计的综合基准测试。评估表明，与现有指标相比，我们的模型与人类判断更加一致。此外，为了将这些评论家无缝集成到 RL 管道中，我们制定了一种新颖的“基础加奖励”奖励策略，以平衡相互竞争的目标：用于编辑的一致性调制执行（CME）和用于生成的质量调制对齐（QMA）。在此框架的支持下，我们的模型 FIRM-Qwen-Edit 和 FIRM-SD3.5 实现了实质性的性能突破。综合实验表明，FIRM 可以减轻幻觉，为现有通用模型的保真度和指令遵守建立新标准。我们所有的数据集、模型和代码都已在此 https URL 上公开提供。

Title: DVD: Deterministic Video Depth Estimation with Generative Priors

Authors: Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang, Haodong Li, Yihao Liang, Kanghao Chen, Bin Ren, Xu Zheng, Shuai Yang, Kun Zhou, Yinchuan Li, Nicu Sebe, Ying-Cong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12250
Pdf URL: https://arxiv.org/pdf/2603.12250
Copy Paste: [[2603.12250]] DVD: Deterministic Video Depth Estimation with Generative Priors(https://arxiv.org/abs/2603.12250)
Keywords: generative
Abstract: Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.
摘要：现有的视频深度估计面临着一个基本的权衡：生成模型遭受随机几何幻觉和尺度漂移的影响，而判别模型需要大量标记数据集来解决语义歧义。为了打破这一僵局，我们提出了 DVD，这是第一个将预先训练的视频扩散模型确定性地适应单通道深度回归器的框架。具体来说，DVD 具有三个核心设计：（i）重新利用扩散时间步长作为结构锚，以平衡全局稳定性与高频细节； (ii) 潜在流形校正（LMR），以减轻回归引起的过度平滑，强制微分约束以恢复清晰的边界和相干运动； (iii) 全局仿射一致性，这是一种限制窗口间发散的固有属性，可以实现无缝的长视频推理，而不需要复杂的时间对齐。大量实验表明 DVD 在基准测试中实现了最先进的零样本性能。此外，DVD 成功地解锁了视频基础模型中隐含的深刻几何先验，使用的特定任务数据比领先基线少 163 倍。值得注意的是，我们完全发布了我们的管道，为 SOTA 视频深度估计提供了整个培训套件，以使开源社区受益。

Title: The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

Authors: Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.12261
Pdf URL: https://arxiv.org/pdf/2603.12261
Copy Paste: [[2603.12261]] The Latent Color Subspace: Emergent Order in High-Dimensional Chaos(https://arxiv.org/abs/2603.12261)
Keywords: generation
Abstract: Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at this https URL.
摘要：文本到图像生成模型发展迅速，但对生成的图像实现细粒度控制仍然很困难，这很大程度上是由于对语义信息编码方式的理解有限。我们对 FLUX.1 [Dev] 的变分自动编码器潜在空间中的颜色表示进行了解释，揭示了反映色调、饱和度和亮度的结构。我们验证了我们的潜在颜色子空间 (LCS) 解释，证明它既可以预测又可以显式控制颜色，并在 FLUX 中引入完全基于封闭形式潜在空间操作的完全免训练方法。代码可从此 https URL 获取。

Title: GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

Authors: Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12264
Pdf URL: https://arxiv.org/pdf/2603.12264
Copy Paste: [[2603.12264]] GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing(https://arxiv.org/abs/2603.12264)
Keywords: generation
Abstract: Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.
摘要：统一的多模态模型以联合理解、推理和生成为目标，但当前的图像编辑基准很大程度上局限于自然图像和浅层常识推理，在结构化、特定领域的约束下对这种能力的评估有限。在这项工作中，我们引入了 GRADE，这是第一个评估图像编辑中学科知识和推理的基准。 GRADE 包含 520 个精心挑选的样本，涵盖从自然科学到社会科学的 10 个学术领域。为了支持严格的评估，我们提出了一个多维度的评估协议，联合评估学科推理、视觉一致性和逻辑可读性。对 20 个最先进的开源和闭源模型的广泛实验揭示了当前模型在隐式知识密集型编辑设置下的巨大局限性，导致巨大的性能差距。除了定量分数之外，我们还进行严格的分析和消融，以暴露模型的缺陷并确定学科编辑中的限制。 GRADE 共同确定了统一多模态模型未来发展的关键方向，推进了学科信息图像编辑和推理的研究。我们的基准和评估代码是公开发布的。

Title: MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Authors: Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12266
Pdf URL: https://arxiv.org/pdf/2603.12266
Copy Paste: [[2603.12266]] MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning(https://arxiv.org/abs/2603.12266)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
摘要：多模态大型语言模型 (MLLM) 越来越多地用于执行可视化工作流程，例如导航 GUI，其中下一步取决于经过验证的视觉组合条件（例如，“如果出现权限对话框并且界面的颜色为绿色，请单击“允许””），并且流程可能会提前分支或终止。然而，这种能力仍然被低估：现有的基准侧重于浅层组合或独立约束，而不是深度链接的组合条件。在本文中，我们介绍了 MM-CondChain，这是基于视觉的深度组合推理的基准。每个基准实例都被组织为多层推理链，其中每一层都包含一个基于视觉证据并由多个对象、属性或关系构建的重要组合条件。为了正确回答，MLLM 必须详细感知图像，在每个步骤中对多个视觉元素进行推理，并遵循生成的执行路径获得最终结果。为了可扩展地构建此类工作流式数据，我们提出了一种代理合成管道：规划器协调组合条件的逐层生成，而可验证的程序化中间表示（VPIR）确保每层的条件都是可机械验证的。然后，Composer 将这些经过验证的层组装成完整的指令。使用这个管道，我们构建了三个视觉领域的基准：自然图像、数据图表和 GUI 轨迹。对一系列 MLLM 的实验表明，即使是最强的模型也只能达到 53.33 路径 F1，并且随着深度或谓词复杂性的增加，硬负数急剧下降，这证实深度组合推理仍然是一个基本挑战。

Title: EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Authors: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12267
Pdf URL: https://arxiv.org/pdf/2603.12267
Copy Paste: [[2603.12267]] EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation(https://arxiv.org/abs/2603.12267)
Keywords: generation, generative
Abstract: Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.
摘要：自回归 (AR) 视频生成模型依赖于将像素压缩为离散标记序列的视频标记器。这些令牌序列的长度对于平衡重建质量与下游生成计算成本至关重要。传统的视频标记器在不同视频的时间块上应用统一的标记分配，通常在简单、静态或重复的片段上浪费标记，而对动态或复杂的片段提供服务不足。为了解决这种低效率问题，我们引入了 $\textbf{EVATok}$，一个用于生成 $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers 的框架。我们的框架估计每个视频的最佳标记分配，以实现最佳的质量成本权衡，开发轻量级路由器以快速预测这些最佳分配，并训练自适应标记器，根据路由器预测的分配对视频进行编码。我们证明，EVATok 显着提高了视频重建和下游 AR 生成的效率和整体质量。通过我们集成视频语义编码器的高级训练方法的增强，EVATok 在 UCF-101 上实现了卓越的重建和最先进的类到视频生成，与之前最先进的 LARP 和我们的固定长度基线相比，平均令牌使用量节省了至少 24.4%。