2026-02-17

Title: Directional Concentration Uncertainty: A representational approach to uncertainty quantification for generative models

Authors: Souradeep Chattopadhyay, Brendan Kennedy, Sai Munikoti, Soumik Sarkar, Karl Pazdernik
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2602.13264
Pdf URL: https://arxiv.org/pdf/2602.13264
Copy Paste: [[2602.13264]] Directional Concentration Uncertainty: A representational approach to uncertainty quantification for generative models(https://arxiv.org/abs/2602.13264)
Keywords: generative
Abstract: In the critical task of making generative models trustworthy and robust, methods for Uncertainty Quantification (UQ) have begun to show encouraging potential. However, many of these methods rely on rigid heuristics that fail to generalize across tasks and modalities. Here, we propose a novel framework for UQ that is highly flexible and approaches or surpasses the performance of prior heuristic methods. We introduce Directional Concentration Uncertainty (DCU), a novel statistical procedure for quantifying the concentration of embeddings based on the von Mises-Fisher (vMF) distribution. Our method captures uncertainty by measuring the geometric dispersion of multiple generated outputs from a language model using continuous embeddings of the generated outputs without any task specific heuristics. In our experiments, we show that DCU matches or exceeds calibration levels of prior works like semantic entropy (Kuhn et al., 2023) and also generalizes well to more complex tasks in multi-modal domains. We present a framework for the wider potential of DCU and its implications for integration into UQ for multi-modal and agentic frameworks.
摘要：在使生成模型可靠且稳健的关键任务中，不确定性量化（UQ）方法已开始显示出令人鼓舞的潜力。然而，其中许多方法依赖于严格的启发式方法，无法跨任务和模式进行泛化。在这里，我们为 UQ 提出了一种新颖的框架，该框架高度灵活，接近或超越现有启发式方法的性能。我们引入了方向浓度不确定性（DCU），这是一种基于冯·米塞斯-费舍尔（vMF）分布来量化嵌入浓度的新颖统计程序。我们的方法通过使用生成输出的连续嵌入来测量语言模型中多个生成输出的几何分散度，从而捕获不确定性，而无需任何特定于任务的启发式方法。在我们的实验中，我们表明 DCU 匹配或超过了语义熵（Kuhn 等人，2023）等先前工作的校准水平，并且还可以很好地推广到多模态领域中更复杂的任务。我们提出了一个框架，以展示 DCU 的更广泛潜力及其对融入昆士兰大学多模式和代理框架的影响。

Title: MFN Decomposition and Related Metrics for High-Resolution Range Profiles Generative Models

Authors: Edwyn Brient (CMM), Santiago Velasco-Forero (CMM), Rami Kassab
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.13296
Pdf URL: https://arxiv.org/pdf/2602.13296
Copy Paste: [[2602.13296]] MFN Decomposition and Related Metrics for High-Resolution Range Profiles Generative Models(https://arxiv.org/abs/2602.13296)
Keywords: generation, generative
Abstract: High-resolution range profile (HRRP ) data are in vogue in radar automatic target recognition (RATR). With the interest in classifying models using HRRP, filling gaps in datasets using generative models has recently received promising contributions. Evaluating generated data is a challenging topic, even for explicit data like face images. However, the evaluation methods used in the state-ofthe-art of HRRP generation rely on classification models. Such models, called ''black-box'', do not allow either explainability on generated data or multi-level evaluation. This work focuses on decomposing HRRP data into three components: the mask, the features, and the noise. Using this decomposition, we propose two metrics based on the physical interpretation of those data. We take profit from an expensive dataset to evaluate our metrics on a challenging task and demonstrate the discriminative ability of those.
摘要：高分辨率距离剖面 (HRRP) 数据在雷达自动目标识别 (RATR) 中非常流行。随着人们对使用 HRRP 对模型进行分类的兴趣，使用生成模型填补数据集中的空白最近收到了有希望的贡献。评估生成的数据是一个具有挑战性的主题，即使对于人脸图像等显式数据也是如此。然而，最先进的 HRRP 生成中使用的评估方法依赖于分类模型。这种模型被称为“黑盒”，不允许对生成的数据进行解释，也不允许进行多级评估。这项工作的重点是将 HRRP 数据分解为三个部分：掩模、特征和噪声。使用这种分解，我们根据这些数据的物理解释提出了两个指标。我们从昂贵的数据集中获利，以评估我们在一项具有挑战性的任务上的指标，并展示这些指标的判别能力。

Title: Conditional Generative Models for High-Resolution Range Profiles: Capturing Geometry-Driven Trends in a Large-Scale Maritime Dataset

Authors: Edwyn Brient (CMM), Santiago Velasco-Forero (CMM), Rami Kassab
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.13297
Pdf URL: https://arxiv.org/pdf/2602.13297
Copy Paste: [[2602.13297]] Conditional Generative Models for High-Resolution Range Profiles: Capturing Geometry-Driven Trends in a Large-Scale Maritime Dataset(https://arxiv.org/abs/2602.13297)
Keywords: generation, generative
Abstract: High-resolution range profiles (HRRPs) enable fast onboard processing for radar automatic target recognition, but their strong sensitivity to acquisition conditions limits robustness across operational scenarios. Conditional HRRP generation can mitigate this issue, yet prior studies are constrained by small, highly specific datasets. We study HRRP synthesis on a largescale maritime database representative of coastal surveillance variability. Our analysis indicates that the fundamental scenario drivers are geometric: ship dimensions and the desired aspect angle. Conditioning on these variables, we train generative models and show that the synthesized signatures reproduce the expected line-of-sight geometric trend observed in real data. These results highlight the central role of acquisition geometry for robust HRRP generation.
摘要：高分辨率距离剖面 (HRRP) 可实现雷达自动目标识别的快速机载处理，但其对捕获条件的高度敏感性限制了整个操作场景的鲁棒性。有条件的 HRRP 生成可以缓解这个问题，但之前的研究受到小型、高度特定的数据集的限制。我们在代表沿海监测变化的大型海事数据库上研究 HRRP 合成。我们的分析表明，基本场景驱动因素是几何的：船舶尺寸和所需的方位角。以这些变量为条件，我们训练生成模型并表明合成的签名再现了在实际数据中观察到的预期视线几何趋势。这些结果凸显了采集几何对于稳健 HRRP 生成的核心作用。

Title: Spectral Collapse in Diffusion Inversion

Authors: Nicolas Bourriez, Alexandre Verine, Auguste Genovesio
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2602.13303
Pdf URL: https://arxiv.org/pdf/2602.13303
Copy Paste: [[2602.13303]] Spectral Collapse in Diffusion Inversion(https://arxiv.org/abs/2602.13303)
Keywords: super-resolution, generation
Abstract: Conditional diffusion inversion provides a powerful framework for unpaired image-to-image translation. However, we demonstrate through an extensive analysis that standard deterministic inversion (e.g. DDIM) fails when the source domain is spectrally sparse compared to the target domain (e.g., super-resolution, sketch-to-image). In these contexts, the recovered latent from the input does not follow the expected isotropic Gaussian distribution. Instead it exhibits a signal with lower frequencies, locking target sampling to oversmoothed and texture-poor generations. We term this phenomenon spectral collapse. We observe that stochastic alternatives attempting to restore the noise variance tend to break the semantic link to the input, leading to structural drift. To resolve this structure-texture trade-off, we propose Orthogonal Variance Guidance (OVG), an inference-time method that corrects the ODE dynamics to enforce the theoretical Gaussian noise magnitude within the null-space of the structural gradient. Extensive experiments on microscopy super-resolution (BBBC021) and sketch-to-image (Edges2Shoes) demonstrate that OVG effectively restores photorealistic textures while preserving structural fidelity.
摘要：条件扩散反演为不成对的图像到图像的转换提供了强大的框架。然而，我们通过广泛的分析证明，当源域与目标域（例如超分辨率、草图到图像）相比光谱稀疏时，标准确定性反演（例如 DDIM）会失败。在这些情况下，从输入中恢复的潜伏不遵循预期的各向同性高斯分布。相反，它表现出频率较低的信号，将目标采样锁定到过度平滑和纹理贫乏的生成。我们将这种现象称为光谱崩溃。我们观察到，尝试恢复噪声方差的随机替代方案往往会破坏与输入的语义链接，从而导致结构漂移。为了解决这种结构-纹理权衡问题，我们提出了正交方差指导（OVG），这是一种推理时间方法，可以校正 ODE 动力学，以在结构梯度的零空间内强制执行理论高斯噪声幅度。关于显微镜超分辨率 (BBBC021) 和草图到图像 (Edges2Shoes) 的大量实验表明，OVG 可以有效地恢复真实感纹理，同时保持结构保真度。

Title: Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

Authors: Emily Bejerano, Federico Tondolo, Aayan Qayyum, Xiaofan Yu, Xiaofan Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13314
Pdf URL: https://arxiv.org/pdf/2602.13314
Copy Paste: [[2602.13314]] Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction(https://arxiv.org/abs/2602.13314)
Keywords: generation
Abstract: Millimeter-wave (mmWave) radar provides reliable perception in visually degraded indoor environments (e.g., smoke, dust, and low light), but learning-based radar perception is bottlenecked by the scarcity and cost of collecting and annotating large-scale radar datasets. We present Sim2Radar, an end-to-end framework that synthesizes training radar data directly from single-view RGB images, enabling scalable data generation without manual scene modeling. Sim2Radar reconstructs a material-aware 3D scene by combining monocular depth estimation, segmentation, and vision-language reasoning to infer object materials, then simulates mmWave propagation with a configurable physics-based ray tracer using Fresnel reflection models parameterized by ITU-R electromagnetic properties. Evaluated on real-world indoor scenes, Sim2Radar improves downstream 3D radar perception via transfer learning: pre-training a radar point-cloud object detection model on synthetic data and fine-tuning on real radar yields up to +3.7 3D AP (IoU 0.3), with gains driven primarily by improved spatial localization. These results suggest that physics-based, vision-driven radar simulation can provide effective geometric priors for radar learning and measurably improve performance under limited real-data supervision.
摘要：毫米波 (mmWave) 雷达可在视觉退化的室内环境（例如烟雾、灰尘和弱光）中提供可靠的感知，但基于学习的雷达感知因收集和注释大规模雷达数据集的稀缺性和成本而受到瓶颈。我们推出了 Sim2Radar，这是一个端到端框架，可直接从单视图 RGB 图像合成训练雷达数据，无需手动场景建模即可生成可扩展的数据。 Sim2Radar 通过结合单目深度估计、分割和视觉语言推理来重建材料感知 3D 场景，以推断物体材料，然后使用由 ITU-R 电磁特性参数化的菲涅尔反射模型，通过可配置的基于物理的射线追踪器来模拟毫米波传播。在真实室内场景上进行评估后，Sim2Radar 通过迁移学习改进了下游 3D 雷达感知：根据合成数据预训练雷达点云目标检测模型并在真实雷达上进行微调，可产生高达 +3.7 3D AP (IoU 0.3) 的效果，其增益主要由改进的空间定位驱动。这些结果表明，基于物理、视觉驱动的雷达仿真可以为雷达学习提供有效的几何先验，并在有限的真实数据监督下显着提高性能。

Title: HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

Authors: Yiru Wang, Zichong Gu, Yu Gao, Anqing Jiang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2602.13329
Pdf URL: https://arxiv.org/pdf/2602.13329
Copy Paste: [[2602.13329]] HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving(https://arxiv.org/abs/2602.13329)
Keywords: generation
Abstract: Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.
摘要：视觉-语言-动作 (VLA) 模型通过多模态理解为自动驾驶提供了有前景的功能。然而，它们在安全关键场景中的使用受到固有局限性的限制，包括不精确的数值推理、3D 空间感知能力弱以及对环境的高度敏感。为了应对这些挑战，我们提出了 HiST-VLA，这是一种新颖的分层时空 VLA 模型，专为可靠的轨迹生成而设计。我们的框架通过将几何意识与细粒度的驾驶命令和状态历史提示相结合来增强 3D 空间和时间推理。为了确保计算效率，我们将动态令牌稀疏化集成到 VLA 架构中。这种方法融合了冗余令牌而不是过滤它们，在不牺牲模型性能的情况下有效地减少了冗余。此外，我们采用基于分层变压器的规划器来逐步将粗略的 VLA 航路点细化为细粒度的轨迹。至关重要的是，规划器利用动态潜在正则化来合并语言命令，确保严格的空间基础和时间连贯性。对 NAVSIM v2 基准的广泛评估展示了 Navtest 上最先进的性能，在伪闭环 Navhard 基准上实现了 88.6 的 EPDMS 和 50.9 的 EPDMS。

Title: FireRed-Image-Edit-1.0 Techinical Report

Authors: Super Intelligence Team: Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang, Dejia Song, Jiale Zhang, Jing Li, Qiang Xiang, Runqi Wang, Shuang Sun, Wei Zhu, Xu Tang, Yao Hu, Yibo Chen, Yuhao Huang, Yuxuan Duan, Zhiyi Chen, Ziyuan Guo
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2602.13344
Pdf URL: https://arxiv.org/pdf/2602.13344
Copy Paste: [[2602.13344]] FireRed-Image-Edit-1.0 Techinical Report(https://arxiv.org/abs/2602.13344)
Keywords: generation
Abstract: We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. We release code, models, and the benchmark suite to support future research.
摘要：我们推出了 FireRed-Image-Edit，这是一种用于基于指令的图像编辑的扩散转换器，它通过数据管理、训练方法和评估设计的系统优化来实现最先进的性能。我们构建了一个 1.6B 样本的训练语料库，包括来自不同来源的 9 亿个文本到图像和 7 亿个图像编辑对。经过严格的清洗、分层、自动标记和两级过滤，我们保留了超过 1 亿个高质量样本，在生成和编辑之间保持平衡，确保强大的语义覆盖和指令对齐。我们的多阶段训练流程通过预训练、监督微调和强化学习逐步构建编辑能力。为了提高数据效率，我们引入了多条件感知桶采样器，用于可变分辨率批处理和具有动态提示重新索引的随机指令对齐。为了稳定优化并增强可控性，我们提出了 DPO 的非对称梯度优化、用于文本编辑的具有布局感知 OCR 奖励的 DiffusionNFT 以及用于身份保存的可微一致性损失。我们进一步建立了 REDEdit-Bench，这是一个涵盖 15 个编辑类别的综合基准测试，包括新引入的美化和低级增强任务。在 REDEdit-Bench 和公共基准（ImgEdit 和 GEdit）上进行的广泛实验证明了与开源和专有系统相比具有竞争力或优越的性能。我们发布代码、模型和基准套件来支持未来的研究。

Title: Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots

Authors: Lijun Zhang, Nikhil Chacko, Petter Nilsson, Ruinian Xu, Shantanu Thakar, Bai Lou, Harpreet Sawhney, Zhebin Zhang, Mudit Agrawal, Bhavana Chandrashekhar, Aaron Parness
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2602.13347
Pdf URL: https://arxiv.org/pdf/2602.13347
Copy Paste: [[2602.13347]] Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots(https://arxiv.org/abs/2602.13347)
Keywords: quality assessment
Abstract: Automated warehouses execute millions of stow operations, where robots place objects into storage bins. For these systems it is valuable to anticipate how a bin will look from the current observations and the planned stow behavior before real execution. We propose FOREST, a stow-intent-conditioned world model that represents bin states as item-aligned instance masks and uses a latent diffusion transformer to predict the post-stow configuration from the observed context. Our evaluation shows that FOREST substantially improves the geometric agreement between predicted and true post-stow layouts compared with heuristic baselines. We further evaluate the predicted post-stow layouts in two downstream tasks, in which replacing the real post-stow masks with FOREST predictions causes only modest performance loss in load-quality assessment and multi-stow reasoning, indicating that our model can provide useful foresight signals for warehouse planning.
摘要：自动化仓库执行数百万次装载操作，其中机器人将物体放入存储箱中。对于这些系统，在实际执行之前根据当前观察和计划的装载行为来预测垃圾箱的外观是很有价值的。我们提出了 FOREST，一种以存放意图为条件的世界模型，它将 bin 状态表示为与项目对齐的实例掩码，并使用潜在扩散转换器根据观察到的上下文来预测存放后的配置。我们的评估表明，与启发式基线相比，FOREST 大大提高了预测和真实的装载后布局之间的几何一致性。我们进一步评估了两个下游任务中预测的装载后布局，其中用 FOREST 预测替换真实的装载后掩模只会导致负载质量评估和多装载推理中的适度性能损失，这表明我们的模型可以为仓库规划提供有用的前瞻信号。

Title: AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers

Authors: Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13357
Pdf URL: https://arxiv.org/pdf/2602.13357
Copy Paste: [[2602.13357]] AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers(https://arxiv.org/abs/2602.13357)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art performance in high-fidelity image and video generation but suffer from expensive inference due to their iterative denoising structure. While prior methods accelerate sampling by caching intermediate features, they rely on static reuse schedules or coarse-grained heuristics, which often lead to temporal drift and cache misalignment that significantly degrade generation quality. We introduce \textbf{AdaCorrection}, an adaptive offset cache correction framework that maintains high generation fidelity while enabling efficient cache reuse across Transformer layers during diffusion inference. At each timestep, AdaCorrection estimates cache validity with lightweight spatio-temporal signals and adaptively blends cached and fresh activations. This correction is computed on-the-fly without additional supervision or retraining. Our approach achieves strong generation quality with minimal computational overhead, maintaining near-original FID while providing moderate acceleration. Experiments on image and video diffusion benchmarks show that AdaCorrection consistently improves generation performance.
摘要：扩散变压器 (DiT) 在高保真图像和视频生成方面实现了最先进的性能，但由于其迭代去噪结构而受到昂贵的推理成本的影响。虽然现有方法通过缓存中间特征来加速采样，但它们依赖于静态重用计划或粗粒度启发法，这通常会导致时间漂移和缓存未对齐，从而显着降低生成质量。我们引入了 \textbf{AdaCorrection}，这是一种自适应偏移缓存校正框架，它可以保持高生成保真度，同时在扩散推理期间实现跨 Transformer 层的高效缓存重用。在每个时间步，AdaCorrection 使用轻量级时空信号估计缓存有效性，并自适应地混合缓存和新的激活。该校正是即时计算的，无需额外的监督或再培训。我们的方法以最小的计算开销实现了强大的生成质量，保持接近原始的 FID，同时提供适度的加速。图像和视频扩散基准实验表明，AdaCorrection 持续提高生成性能。

Title: The Diffusion Duet: Harmonizing Dual Channels with Wavelet Suppression for Image Separation

Authors: Jingwei Li, Wei Pu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13361
Pdf URL: https://arxiv.org/pdf/2602.13361
Copy Paste: [[2602.13361]] The Diffusion Duet: Harmonizing Dual Channels with Wavelet Suppression for Image Separation(https://arxiv.org/abs/2602.13361)
Keywords: restoration, generative
Abstract: Blind image separation (BIS) refers to the inverse problem of simultaneously estimating and restoring multiple independent source images from a single observation image under conditions of unknown mixing mode and without prior knowledge of the source images. Traditional methods relying on statistical independence assumptions or CNN/GAN variants struggle to characterize complex feature distributions in real scenes, leading to estimation bias, texture distortion, and artifact residue under strong noise and nonlinear mixing. This paper innovatively introduces diffusion models into dual-channel BIS, proposing an efficient Dual-Channel Diffusion Separation Model (DCDSM). DCDSM leverages diffusion models' powerful generative capability to learn source image feature distributions and reconstruct feature structures effectively. A novel Wavelet Suppression Module (WSM) is designed within the dual-branch reverse denoising process, forming an interactive separation network that enhances detail separation by exploiting the mutual coupling noise characteristic between source images. Extensive experiments on synthetic datasets containing rain/snow and complex mixtures demonstrate that DCDSM achieves state-of-the-art performance: 1) In image restoration tasks, it obtains PSNR/SSIM values of 35.0023 dB/0.9549 and 29.8108 dB/0.9243 for rain and snow removal respectively, outperforming Histoformer and LDRCNet by 1.2570 dB/0.9272 dB (PSNR) and 0.0262/0.0289 (SSIM) on average; 2) For complex mixture separation, the restored dual-source images achieve average PSNR and SSIM of 25.0049 dB and 0.7997, surpassing comparative methods by 4.1249 dB and 0.0926. Both subjective and objective evaluations confirm DCDSM's superiority in addressing rain/snow residue removal and detail preservation challenges.
摘要：盲图像分离（BIS）是指在未知混合模式的条件下并且在不知道源图像的情况下从单个观察图像同时估计和恢复多个独立源图像的逆问题。依赖于统计独立性假设或 CNN/GAN 变体的传统方法难以表征真实场景中的复杂特征分布，导致强噪声和非线性混合下的估计偏差、纹理失真和伪影残留。本文创新地将扩散模型引入双通道BIS，提出了一种高效的双通道扩散分离模型（DCDSM）。 DCDSM利用扩散模型强大的生成能力来学习源图像特征分布并有效地重建特征结构。在双分支反向降噪过程中设计了一种新颖的小波抑制模块（WSM），形成一个交互式分离网络，通过利用源图像之间的互耦合噪声特性来增强细节分离。在包含雨/雪和复杂混合物的合成数据集上进行的大量实验表明，DCDSM 实现了最先进的性能：1）在图像恢复任务中，其去除雨和除雪的 PSNR/SSIM 值分别为 35.0023 dB/0.9549 和 29.8108 dB/0.9243，比 Histoformer 和 LDRCNet 好 1.2570平均 dB/0.9272 dB (PSNR) 和 0.0262/0.0289 (SSIM)； 2）对于复杂的混合分离，恢复的双源图像的平均PSNR和SSIM分别为25.0049 dB和0.7997，超过对比方法4.1249 dB和0.0926。主观和客观评估都证实了 DCDSM 在解决雨雪残渣清除和细节保护挑战方面的优越性。

Title: An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation

Authors: Giang Son Nguyen, Zi Pong Lim, Sarthak Ketanbhai Modi, Yon Shin Teo, Wenya Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2602.13376
Pdf URL: https://arxiv.org/pdf/2602.13376
Copy Paste: [[2602.13376]] An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation(https://arxiv.org/abs/2602.13376)
Keywords: generation
Abstract: Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.
摘要：视觉语言模型 (VLM) 越来越多地用于文档处理管道，以将流程图图像转换为结构化代码（例如 Mermaid）。在生产中，这些系统处理不存在真实代码的任意输入，使得输出质量难以评估。我们提出了一种无参考评估框架，仅使用输入图像和生成的输出，在推理时监控流程图图像到代码的生成质量。该框架引入了两个自动化指标：$\text{Recall}{\text{OCR}}$，它通过 OCR 从输入图像中提取文本作为代理参考来估计内容覆盖率；$\text{Precision}{\text{VE}}$，它通过视觉蕴涵针对原始图像检测幻觉元素。它们的调和平均值 $\text{F1}{\text{OCR-VE}}$ 提供了统一的质量得分。 FlowVQA 数据集的验证显示与真实指标高度一致（召回率、精度和 F1 的平均 Pearson $r = 0.97$、$0.91$ 和 $0.94$），证实了该框架作为生产环境中连续质量监控的实用、无参考替代方案的可靠性。

Title: High-Resolution Climate Projections Using Diffusion-Based Downscaling of a Lightweight Climate Emulator

Authors: Haiwen Guan, Moein Darman, Dibyajyoti Chakraborty, Troy Arcomano, Ashesh Chattopadhyay, Romit Maulik
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.13416
Pdf URL: https://arxiv.org/pdf/2602.13416
Copy Paste: [[2602.13416]] High-Resolution Climate Projections Using Diffusion-Based Downscaling of a Lightweight Climate Emulator(https://arxiv.org/abs/2602.13416)
Keywords: generative
Abstract: The proliferation of data-driven models in weather and climate sciences has marked a significant paradigm shift, with advanced models demonstrating exceptional skill in medium-range forecasting. However, these models are often limited by long-term instabilities, climatological drift, and substantial computational costs during training and inference, restricting their broader application for climate studies. Addressing these limitations, Guan et al. (2024) introduced LUCIE, a lightweight, physically consistent climate emulator utilizing a Spherical Fourier Neural Operator (SFNO) architecture. This model is able to reproduce accurate long-term statistics including climatological mean and seasonal variability. However, LUCIE's native resolution (~300 km) is inadequate for detailed regional impact assessments. To overcome this limitation, we introduce a deep learning-based downscaling framework, leveraging probabilistic diffusion-based generative models with conditional and posterior sampling frameworks. These models downscale coarse LUCIE outputs to 25 km resolution. They are trained on approximately 14,000 ERA5 timesteps spanning 2000-2009 and evaluated on LUCIE predictions from 2010 to 2020. Model performance is assessed through diverse metrics, including latitude-averaged RMSE, power spectrum, probability density functions and First Empirical Orthogonal Function of the zonal wind. We observe that the proposed approach is able to preserve the coarse-grained dynamics from LUCIE while generating fine-scaled climatological statistics at ~28km resolution.
摘要：天气和气候科学中数据驱动模型的激增标志着重大的范式转变，先进的模型在中期预测方面展示了卓越的技能。然而，这些模型往往受到长期不稳定性、气候漂移以及训练和推理过程中大量计算成本的限制，限制了它们在气候研究中的更广泛应用。针对这些限制，Guan 等人。 (2024) 推出了 LUCIE，这是一种采用球形傅里叶神经算子 (SFNO) 架构的轻量级、物理一致的气候模拟器。该模型能够重现准确的长期统计数据，包括气候平均值和季节变化。然而，LUCIE 的原始分辨率（约 300 公里）不足以进行详细的区域影响评估。为了克服这一限制，我们引入了一种基于深度学习的降尺度框架，利用基于概率扩散的生成模型以及条件和后验采样框架。这些模型将粗 LUCIE 输出缩小至 25 km 分辨率。它们接受了 2000 年至 2009 年约 14,000 个 ERA5 时间步长的训练，并根据 2010 年至 2020 年的 LUCIE 预测进行了评估。模型性能通过多种指标进行评估，包括纬度平均 RMSE、功率谱、概率密度函数和纬向风的第一经验正交函数。我们观察到，所提出的方法能够保留 LUCIE 的粗粒度动态，同时生成约 28 公里分辨率的细尺度气候统计数据。

Title: Text Has Curvature

Authors: Karish Grover, Hanqing Zeng, Yinglong Xia, Christos Faloutsos, Geoffrey J. Gordon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.13418
Pdf URL: https://arxiv.org/pdf/2602.13418
Copy Paste: [[2602.13418]] Text Has Curvature(https://arxiv.org/abs/2602.13418)
Keywords: generation
Abstract: Does text have an intrinsic curvature? Language is increasingly modeled in curved geometries - hyperbolic spaces for hierarchy, mixed-curvature manifolds for compositional structure - yet a basic scientific question remains unresolved: what does curvature mean for text itself, in a way that is native to language rather than an artifact of the embedding space we choose? We argue that text does indeed have curvature, and show how to detect it, define it, and use it. To this end, we propose Texture, a text-native, word-level discrete curvature signal, and make three contributions. (a) Existence: We provide empirical and theoretical certificates that semantic inference in natural corpora is non-flat, i.e. language has inherent curvature. (b) Definition: We define Texture by reconciling left- and right-context beliefs around a masked word through a Schrodinger bridge, yielding a curvature field that is positive where context focuses meaning and negative where it fans out into competing continuations. (c) Utility: Texture is actionable: it serves as a general-purpose measurement and control primitive enabling geometry without geometric training; we instantiate it on two representative tasks, improving long-context inference through curvature-guided compression and retrieval-augmented generation through curvature-guided routing. Together, our results establish a text-native curvature paradigm, making curvature measurable and practically useful.
摘要：文本是否具有固有的曲率？语言越来越多地以弯曲几何形状建模——用于层次结构的双曲空间、用于组合结构的混合曲率流形——然而一个基本的科学问题仍未得到解决：曲率对文本本身意味着什么，以一种语言固有的方式，而不是我们选择的嵌入空间的产物？我们认为文本确实具有曲率，并展示了如何检测它、定义它和使用它。为此，我们提出了Texture，一种文本原生的、字级的离散曲率信号，并做出了三个贡献。 (a) 存在性：我们提供了经验和理论证明，证明自然语料库中的语义推理是非平坦的，即语言具有固有的曲率。 (b) 定义：我们通过薛定谔桥协调围绕屏蔽词的左上下文和右上下文信念来定义纹理，产生一个曲率场，当上下文集中意义时，曲率场为正，而当上下文集中意义时，曲率场为负，当它扇出到竞争的延续时。 (c) 实用性：纹理是可操作的：它充当通用测量和控制基元，无需几何训练即可实现几何学；我们在两个代表性任务上实例化它，通过曲率引导压缩改进长上下文推理，并通过曲率引导路由改进检索增强生成。我们的结果共同建立了一个文本原生曲率范式，使曲率可测量且具有实际用途。

Title: SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Authors: Jintao Zhang, Kai Jiang, Chendong Xiang, Weiqi Feng, Yuezhou Hu, Haocheng Xi, Jianfei Chen, Jun Zhu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.13515
Pdf URL: https://arxiv.org/pdf/2602.13515
Copy Paste: [[2602.13515]] SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning(https://arxiv.org/abs/2602.13515)
Keywords: generation
Abstract: Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rules, i.e., Top-k and Top-p, fail, and how can we avoid these failures? (2) why can trainable sparse attention reach higher sparsity than training-free methods? (3) what are the limitations of fine-tuning sparse attention using the diffusion loss, and how can we address them? Based on this analysis, we propose SpargeAttention2, a trainable sparse attention method that achieves high sparsity without degrading generation quality. SpargeAttention2 includes (i) a hybrid masking rule that combines Top-k and Top-p for more robust masking at high sparsity, (ii) an efficient trainable sparse attention implementation, and (iii) a distillation-inspired fine-tuning objective to better preserve generation quality during fine-tuning using sparse attention. Experiments on video diffusion models show that SpargeAttention2 reaches 95% attention sparsity and a 16.2x attention speedup while maintaining generation quality, consistently outperforming prior sparse attention methods.
摘要：许多免训练的稀疏注意力方法对于加速扩散模型是有效的。最近，一些研究表明，使稀疏注意力可训练可以进一步增加稀疏性，同时保持生成质量。我们研究三个关键问题：（1）两种常见的屏蔽规则，即Top-k和Top-p，什么时候会失败，我们如何避免这些失败？（2）为什么可训练的稀疏注意力能够比免训练的方法达到更高的稀疏性？（3）使用扩散损失微调稀疏注意力有哪些局限性，我们如何解决这些局限性？基于此分析，我们提出了 SpergeAttention2，这是一种可训练的稀疏注意力方法，可以在不降低生成质量的情况下实现高稀疏性。 SpargeAttention2 包括 (i) 一个混合掩蔽规则，它结合了 Top-k 和 Top-p，以便在高稀疏性下实现更稳健的掩蔽；(ii) 一种高效的可训练稀疏注意力实现；以及 (iii) 一个受蒸馏启发的微调目标，以便在使用稀疏注意力进行微调期间更好地保持生成质量。视频扩散模型的实验表明，SpargeAttention2 在保持生成质量的同时达到了 95% 的注意力稀疏度和 16.2 倍的注意力加速，始终优于先前的稀疏注意力方法。

Title: Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Authors: Binglei Li, Mengping Yang, Zhiyu Tan, Junping Zhang, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13585
Pdf URL: https://arxiv.org/pdf/2602.13585
Copy Paste: [[2602.13585]] Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation(https://arxiv.org/abs/2602.13585)
Keywords: generation
Abstract: Recent text-to-image (T2I) diffusion models have achieved remarkable advancement, yet faithfully following complex textual descriptions remains challenging due to insufficient interactions between textual and visual features. Prior approaches enhance such interactions via architectural design or handcrafted textual condition weighting, but lack flexibility and overlook the dynamic interactions across different blocks and denoising stages. To provide a more flexible and efficient solution to this problem, we propose Diff-Aid, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps. Beyond improving generation quality, Diff-Aid yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising. As a plug-and-play module, Diff-Aid can be seamlessly integrated into downstream applications for further improvement, including style LoRAs, controllable generation, and zero-shot editing. Experiments on strong baselines (SD 3.5 and FLUX) demonstrate consistent improvements in prompt adherence, visual quality, and human preference across various metrics. Our code and models will be released.
摘要：最近的文本到图像（T2I）扩散模型取得了显着的进步，但由于文本和视觉特征之间的交互不足，忠实地遵循复杂的文本描述仍然具有挑战性。先前的方法通过架构设计或手工制作的文本条件加权来增强这种交互，但缺乏灵活性并且忽略了不同块和去噪阶段之间的动态交互。为了为这个问题提供更灵活、更高效的解决方案，我们提出了 Diff-Aid，这是一种轻量级推理时间方法，可以自适应地调整跨转换器块和去噪时间步长的每个标记文本和图像交互。除了提高生成质量之外，Diff-Aid 还产生可解释的调制模式，揭示不同的块、时间步长和文本标记如何在去噪过程中促进语义对齐。作为即插即用模块，Diff-Aid 可以无缝集成到下游应用程序中以进行进一步改进，包括样式 LoRA、可控生成和零镜头编辑。在强基线（SD 3.5 和 FLUX）上进行的实验表明，各种指标在即时依从性、视觉质量和人类偏好方面都有持续改进。我们的代码和模型将被发布。

Title: AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting

Authors: Jiacheng Zhang, Feng Liu, Chao Du, Tianyu Pang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13600
Pdf URL: https://arxiv.org/pdf/2602.13600
Copy Paste: [[2602.13600]] AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting(https://arxiv.org/abs/2602.13600)
Keywords: generation
Abstract: Visual attention boosting has emerged as a promising direction for mitigating hallucinations in Large Vision-Language Models (LVLMs), where existing methods primarily focus on where to boost by applying a predefined scaling to the attention of method-specific visual tokens during autoregressive generation. In this paper, we identify a fundamental trade-off in these methods: a predefined scaling factor can be too weak at some generation steps, leaving hallucinations unresolved, yet too strong at others, leading to new hallucinations. Motivated by this finding, we propose AdaVBoost, a token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step. Specifically, we introduce Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy. Guided by VGE, AdaVBoost applies stronger visual attention boosting to high-risk tokens and weaker boosting to low-risk tokens, enabling token-level adaptive intervention at each generation step. Extensive experiments show that AdaVBoost significantly outperforms baseline methods across multiple LVLMs and hallucination benchmarks.
摘要：视觉注意力增强已成为减轻大视觉语言模型（LVLM）中幻觉的一个有前途的方向，其中现有方法主要关注在自回归生成过程中通过将预定义的缩放应用于特定于方法的视觉标记的注意力来增强何处。在本文中，我们确定了这些方法的基本权衡：预定义的缩放因子在某些生成步骤中可能太弱，导致幻觉无法解决，但在其他生成步骤中太强，导致新的幻觉。受这一发现的启发，我们提出了 AdaVBoost，这是一种令牌级视觉注意力增强框架，它自适应地确定每个生成步骤要增强多少注意力。具体来说，我们引入视觉接地熵（VGE）来估计幻觉风险，它利用视觉接地作为补充信号来捕获熵之外的证据不匹配。在 VGE 的指导下，AdaVBoost 对高风险代币应用更强的视觉注意力增强，对低风险代币应用更弱的视觉注意力增强，从而在每个生成步骤中实现代币级别的自适应干预。大量实验表明，AdaVBoost 在多个 LVLM 和幻觉基准测试中显着优于基线方法。

Title: DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation

Authors: Haoyu Zhao, Yuang Zhang, Junqi Cheng, Jiaxi Gu, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13637
Pdf URL: https://arxiv.org/pdf/2602.13637
Copy Paste: [[2602.13637]] DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation(https://arxiv.org/abs/2602.13637)
Keywords: generation, generative
Abstract: Recent video generative models have demonstrated impressive visual fidelity, yet they often struggle with semantic, geometric, and identity consistency. In this paper, we propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM), to address three key challenges: (1) intra-clip world knowledge consistency, (2) inter-clip camera consistency, and (3) inter-shot element consistency. DCDM decomposes video consistency modeling under these scenarios into three dedicated components while sharing a unified video generation backbone. For intra-clip consistency, DCDM leverages a large language model to parse input prompts into structured semantic representations, which are subsequently translated into coherent video content by a diffusion transformer. For inter-clip camera consistency, we propose a temporal camera representation in the noise space that enables precise and stable camera motion control, along with a text-to-image initialization mechanism to further enhance controllability. For inter-shot consistency, DCDM adopts a holistic scene generation paradigm with windowed cross-attention and sparse inter-shot self-attention, ensuring long-range narrative coherence while maintaining computational efficiency. We validate our framework on the test set of the CVM Competition at AAAI'26, and the results demonstrate that the proposed strategies effectively address these challenges.
摘要：最近的视频生成模型表现出了令人印象深刻的视觉保真度，但它们经常在语义、几何和身份一致性方面遇到困难。在本文中，我们提出了一个称为分而治之扩散模型（DCDM）的系统级框架，以解决三个关键挑战：（1）片段内世界知识一致性，（2）片段间相机一致性，以及（3）镜头间元素一致性。 DCDM 将这些场景下的视频一致性模型分解为三个专用组件，同时共享统一的视频生成主干。为了实现剪辑内的一致性，DCDM 利用大型语言模型将输入提示解析为结构化语义表示，随后由扩散转换器将其转换为连贯的视频内容。为了实现剪辑间相机的一致性，我们提出了噪声空间中的时间相机表示，以实现精确稳定的相机运动控制，以及文本到图像初始化机制以进一步增强可控性。为了实现镜头间一致性，DCDM 采用了具有窗口交叉注意力和稀疏镜头间自注意力的整体场景生成范式，在保证计算效率的同时确保了长程叙事连贯性。我们在 AAAI'26 的 CVM 竞赛测试集上验证了我们的框架，结果表明所提出的策略有效地解决了这些挑战。

Title: EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

Authors: Rang Meng, Weipeng Wu, Yingjie Yin, Yuming Li, Chenguang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13669
Pdf URL: https://arxiv.org/pdf/2602.13669
Copy Paste: [[2602.13669]] EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation(https://arxiv.org/abs/2602.13669)
Keywords: generation
Abstract: Recent multi-modal video generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment. Streaming inference exacerbates these issues, leading to pronounced multimodal degradation, such as spatial blurring, temporal drift, and lip desynchronization, which creates an unresolved efficiency-performance trade-off. To this end, we propose EchoTorrent, a novel schema with a fourfold design: (1) Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains to obtain specialized domain experts, which sequentially transfer domain-specific knowledge to a student model; (2) Adaptive CFG Calibration (ACC-DMD), which calibrates the audio CFG augmentation errors in DMD via a phased spatiotemporal schedule, eliminating redundant CFG computations and enabling single-pass inference per step; (3) Hybrid Long Tail Forcing, which enforces alignment exclusively on tail frames during long-horizon self-rollout training via a causal-bidirectional hybrid architecture, effectively mitigates spatiotemporal degradation in streaming mode while enhancing fidelity to reference frames; and (4) VAE Decoder Refiner through pixel-domain optimization of the VAE decoder to recover high-frequency details while circumventing latent-space ambiguities. Extensive experiments and analysis demonstrate that EchoTorrent achieves few-pass autoregressive generation with substantially extended temporal consistency, identity preservation, and audio-lip synchronization.
摘要：最近的多模态视频生成模型已经实现了高视觉质量，但其令人望而却步的延迟和有限的时间稳定性阻碍了实时部署。流推理加剧了这些问题，导致明显的多模态退化，例如空间模糊、时间漂移和唇形不同步，从而造成了未解决的效率与性能权衡。为此，我们提出了 EchoTorrent，一种具有四重设计的新颖模式：（1）多教师训练对不同偏好领域的预训练模型进行微调，以获得专门的领域专家，从而将特定领域的知识顺序转移到学生模型； (2) 自适应 CFG 校准 (ACC-DMD)，通过分阶段时空调度来校准 DMD 中的音频 CFG 增强误差，消除冗余 CFG 计算并实现每步单遍推理；（3）混合长尾强迫，通过因果双向混合架构在长视野自推出训练期间专门强制对尾部框架进行对齐，有效减轻流模式下的时空退化，同时增强对参考框架的保真度； (4) VAE Decoder Refiner 通过 VAE 解码器的像素域优化来恢复高频细节，同时避免潜在空间模糊性。大量的实验和分析表明，EchoTorrent 实现了几次自回归生成，并具有显着扩展的时间一致性、身份保留和音频唇形同步。

Title: A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

Authors: Xin Zhang, Liangxiu Han, Yue Shi, Yalin Zheng, Uazman Alam, Maryam Ferdousi, Rayaz Malik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13693
Pdf URL: https://arxiv.org/pdf/2602.13693
Copy Paste: [[2602.13693]] A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy(https://arxiv.org/abs/2602.13693)
Keywords: generative
Abstract: Corneal Confocal Microscopy (CCM) is a sensitive tool for assessing small-fiber damage in Diabetic Peripheral Neuropathy (DPN), yet the development of robust, automated deep learning-based diagnostic models is limited by scarce labelled data and fine-grained variability in corneal nerve morphology. Although Artificial Intelligence (AI)-driven foundation generative models excel at natural image synthesis, they often struggle in medical imaging due to limited domain-specific training, compromising the anatomical fidelity required for clinical analysis. To overcome these limitations, we propose a Weight-Decomposed Low-Rank Adaptation (WDLoRA)-based multimodal generative framework for clinically guided CCM image synthesis. WDLoRA is a parameter-efficient fine-tuning (PEFT) mechanism that decouples magnitude and directional weight updates, enabling foundation generative models to independently learn the orientation (nerve topology) and intensity (stromal contrast) required for medical realism. By jointly conditioning on nerve segmentation masks and disease-specific clinical prompts, the model synthesises anatomically coherent images across the DPN spectrum (Control, T1NoDPN, T1DPN). A comprehensive three-pillar evaluation demonstrates that the proposed framework achieves state-of-the-art visual fidelity (Fréchet Inception Distance (FID): 5.18) and structural integrity (Structural Similarity Index Measure (SSIM): 0.630), significantly outperforming GAN and standard diffusion baselines. Crucially, the synthetic images preserve gold-standard clinical biomarkers and are statistically equivalent to real patient data. When used to train automated diagnostic models, the synthetic dataset improves downstream diagnostic accuracy by 2.1% and segmentation performance by 2.2%, validating the framework's potential to alleviate data bottlenecks in medical AI.
摘要：角膜共聚焦显微镜 (CCM) 是评估糖尿病周围神经病变 (DPN) 小纤维损伤的敏感工具，但基于深度学习的稳健自动化诊断模型的开发受到标记数据稀缺和角膜神经形态细粒度变异性的限制。尽管人工智能 (AI) 驱动的基础生成模型在自然图像合成方面表现出色，但由于特定领域的训练有限，它们在医学成像方面常常表现不佳，从而损害了临床分析所需的解剖保真度。为了克服这些限制，我们提出了一种基于权重分解低阶适应 (WDLoRA) 的多模态生成框架，用于临床引导 CCM 图像合成。 WDLoRA 是一种参数高效的微调 (PEFT) 机制，可解耦幅度和方向权重更新，使基础生成模型能够独立学习医学现实主义所需的方向（神经拓扑）和强度（基质对比度）。通过联合调节神经分割掩模和特定疾病的临床提示，该模型合成了整个 DPN 谱（对照、T1NoDPN、T1DPN）的解剖学连贯图像。全面的三支柱评估表明，所提出的框架实现了最先进的视觉保真度（Fréchet Inception Distance（FID）：5.18）和结构完整性（结构相似性指数测量（SSIM）：0.630），显着优于 GAN 和标准扩散基线。至关重要的是，合成图像保留了黄金标准的临床生物标志物，并且在统计上与真实的患者数据相当。当用于训练自动诊断模型时，合成数据集将下游诊断准确性提高了 2.1%，分割性能提高了 2.2%，验证了该框架缓解医疗 AI 数据瓶颈的潜力。

Title: Attention Head Entropy of LLMs Predicts Answer Correctness

Authors: Sophie Ostmeier, Brian Axelrod, Maya Varma, Asad Aali, Yabin Zhang, Magdalini Paschali, Sanmi Koyejo, Curtis Langlotz, Akshay Chaudhari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.13699
Pdf URL: https://arxiv.org/pdf/2602.13699
Copy Paste: [[2602.13699]] Attention Head Entropy of LLMs Predicts Answer Correctness(https://arxiv.org/abs/2602.13699)
Keywords: generation
Abstract: Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We further show that attention patterns over the question/context alone, before answer generation, already carry predictive signal using Head Entropy with on average +17.7% AUROC over the closest baseline. We evaluate across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.
摘要：大型语言模型 (LLM) 通常会生成看似合理但不正确的答案，从而在医学等安全关键环境中带来风险。人工评估成本高昂，而法学硕士作为法官则存在引入隐藏错误的风险。最近的白盒方法使用模型内部检测上下文幻觉，重点关注注意力质量的定位，但有两个问题仍然悬而未决：这些方法是否扩展到预测答案的正确性，以及它们是否概括了域外？我们引入头熵，这是一种根据注意力熵模式预测答案正确性的方法，专门测量注意力质量的传播。对每头 2-Renyi 熵使用稀疏逻辑回归，头熵匹配或超过分布内的基线，并且在域外泛化得更好，它平均比最接近的基线高出 +8.5% AUROC。我们进一步表明，在生成答案之前，仅针对问题/上下文的注意力模式已经使用头熵携带预测信号，在最接近的基线上平均有 +17.7% AUROC。我们评估了 5 个指令调整的 LLM 和 3 个 QA 数据集，涵盖常识、多跳推理和医学。

Title: HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models

Authors: Xin Yan, Zhenglin Wan, Feiyang Ye, Xingrui Yu, Hangyu Du, Yang You, Ivor Tsang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.13710
Pdf URL: https://arxiv.org/pdf/2602.13710
Copy Paste: [[2602.13710]] HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models(https://arxiv.org/abs/2602.13710)
Keywords: generation
Abstract: Vision-Language-Action (VLA) models enable instruction-following embodied control, but their large compute and memory footprints hinder deployment on resource-constrained robots and edge platforms. While reducing weights to 1-bit precision through binarization can greatly improve efficiency, existing methods fail to narrow the distribution gap between binarized and full-precision weights, causing quantization errors to accumulate under long-horizon closed-loop execution and severely degrade actions. To fill this gap, we propose HBVLA, a VLA-tailored binarization framework. First, we use a policy-aware enhanced Hessian to identify weights that are truly critical for action generation. Then, we employ a sparse orthogonal transform for non-salient weights to induce a low-entropy intermediate state. Finally, we quantize both salient and non-salient weights in the Harr domain with group-wise 1-bit quantization. We have evaluated our approach on different VLAs: on LIBERO, quantized OpenVLA-OFT retains 92.2% of full-precision performance; on SimplerEnv, quantized CogAct retains 93.6%, significantly outperforming state-of-the-art binarization methods. We further validate our method on real-world evaluation suite and the results show that HBVLA incurs only marginal success-rate degradation compared to the full-precision model, demonstrating robust deployability under tight hardware constraints. Our work provides a practical foundation for ultra-low-bit quantization of VLAs, enabling more reliable deployment on hardware-limited robotic platforms.
摘要：视觉-语言-动作 (VLA) 模型可实现指令跟踪的具体控制，但其庞大的计算和内存占用量阻碍了在资源受限的机器人和边缘平台上的部署。虽然通过二值化将权重降低到 1 位精度可以大大提高效率，但现有方法无法缩小二值化权重与全精度权重之间的分布差距，导致量化误差在长时域闭环执行下累积并严重降低动作性能。为了填补这一空白，我们提出了 HBVLA，一种 VLA 定制的二值化框架。首先，我们使用策略感知的增强型 Hessian 矩阵来确定对于行动生成真正关键的权重。然后，我们对非显着权重采用稀疏正交变换来诱导低熵中间状态。最后，我们通过分组 1 位量化来量化 Harr 域中的显着权重和非显着权重。我们在不同的 VLA 上评估了我们的方法：在 LIBERO 上，量化 OpenVLA-OFT 保留了 92.2% 的全精度性能；在 SimplerEnv 上，量化的 CogAct 保留了 93.6%，明显优于最先进的二值化方法。我们进一步在现实世界的评估套件上验证了我们的方法，结果表明，与全精度模型相比，HBVLA 仅导致了边际成功率下降，证明了在严格的硬件限制下的强大可部署性。我们的工作为 VLA 的超低位量化提供了实用基础，从而能够在硬件有限的机器人平台上进行更可靠的部署。

Title: Generative Latent Representations of 3D Brain MRI for Multi-Task Downstream Analysis in Down Syndrome

Authors: Jordi Malé, Juan Fortea, Mateus Rozalem-Aranha, Neus Martínez-Abadías, Xavier Sevillano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13731
Pdf URL: https://arxiv.org/pdf/2602.13731
Copy Paste: [[2602.13731]] Generative Latent Representations of 3D Brain MRI for Multi-Task Downstream Analysis in Down Syndrome(https://arxiv.org/abs/2602.13731)
Keywords: generation, generative
Abstract: Generative models have emerged as powerful tools in medical imaging, enabling tasks such as segmentation, anomaly detection, and high-quality synthetic data generation. These models typically rely on learning meaningful latent representations, which are particularly valuable given the high-dimensional nature of 3D medical images like brain magnetic resonance imaging (MRI) scans. Despite their potential, latent representations remain underexplored in terms of their structure, information content, and applicability to downstream clinical tasks. Investigating these representations is crucial for advancing the use of generative models in neuroimaging research and clinical decision-making. In this work, we develop multiple variational autoencoders (VAEs) to encode 3D brain MRI scans into compact latent space representations for generative and predictive applications. We systematically evaluate the effectiveness of the learned representations through three key analyses: (i) a quantitative and qualitative assessment of MRI reconstruction quality, (ii) a visualisation of the latent space structure using Principal Component Analysis, and (iii) downstream classification tasks on a proprietary dataset of euploid and Down syndrome individuals brain MRI scans. Our results demonstrate that the VAE successfully captures essential brain features while maintaining high reconstruction fidelity. The latent space exhibits clear clustering patterns, particularly in distinguishing individuals with Down syndrome from euploid controls.
摘要：生成模型已成为医学成像领域的强大工具，可实现分割、异常检测和高质量合成数据生成等任务。这些模型通常依赖于学习有意义的潜在表示，考虑到 3D 医学图像（如脑磁共振成像 (MRI) 扫描）的高维性质，这些表示特别有价值。尽管潜在表征具有潜力，但其结构、信息内容以及对下游临床任务的适用性方面仍未得到充分探索。研究这些表征对于推进生成模型在神经影像研究和临床决策中的使用至关重要。在这项工作中，我们开发了多个变分自动编码器 (VAE)，将 3D 脑 MRI 扫描编码为紧凑的潜在空间表示，用于生成和预测应用。我们通过三个关键分析系统地评估学习表征的有效性：(i) MRI 重建质量的定量和定性评估，(ii) 使用主成分分析对潜在空间结构进行可视化，以及 (iii) 对整倍体和唐氏综合症个体大脑 MRI 扫描的专有数据集进行下游分类任务。我们的结果表明，VAE 成功捕获了基本的大脑特征，同时保持了高重建保真度。潜在空间表现出清晰的聚类模式，特别是在区分唐氏综合症个体与整倍体对照时。

Title: Data-driven Bi-level Optimization of Thermal Power Systems with embedded Artificial Neural Networks

Authors: Talha Ansar, Muhammad Mujtaba Abbas, Ramit Debnath, Vivek Dua, Waqar Muhammad Ashraf
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.13746
Pdf URL: https://arxiv.org/pdf/2602.13746
Copy Paste: [[2602.13746]] Data-driven Bi-level Optimization of Thermal Power Systems with embedded Artificial Neural Networks(https://arxiv.org/abs/2602.13746)
Keywords: generation
Abstract: Industrial thermal power systems have coupled performance variables with hierarchical order of importance, making their simultaneous optimization computationally challenging or infeasible. This barrier limits the integrated and computationally scaleable operation optimization of industrial thermal power systems. To address this issue for large-scale engineering systems, we present a fully machine learning-powered bi-level optimization framework for data-driven optimization of industrial thermal power systems. The objective functions of upper and lower levels are approximated by artificial neural network (ANN) models and the lower-level problem is analytically embedded through Karush-Kuhn-Tucker (KKT) optimality conditions. The reformulated single level optimization framework integrating ANN models and KKT constraints (ANN-KKT) is validated on benchmark problems and on real-world power generation operation of 660 MW coal power plant and 395 MW gas turbine system. The results reveal a comparable solutions obtained from the proposed ANN-KKT framework to the bi-level solutions of the benchmark problems. Marginal computational time requirement (0.22 to 0.88 s) to compute optimal solutions yields 583 MW (coal) and 402 MW (gas turbine) of power output at optimal turbine heat rate of 7337 kJ/kWh and 7542 kJ/kWh, respectively. In addition, the method expands to delineate a feasible and robust operating envelope that accounts for uncertainty in operating variables while maximizing thermal efficiency in various scenarios. These results demonstrate that ANN-KKT offers a scalable and computationally efficient route for hierarchical, data-driven optimization of industrial thermal power systems, achieving energy-efficient operations of large-scale engineering systems and contributing to industry 5.0.
摘要：工业火力发电系统将性能变量与重要性等级顺序耦合起来，使得它们的同时优化在计算上具有挑战性或不可行。这一障碍限制了工业火电系统的集成和计算可扩展的运行优化。为了解决大型工程系统的这个问题，我们提出了一个完全由机器学习驱动的双层优化框架，用于工业热电系统的数据驱动优化。上层和下层的目标函数通过人工神经网络（ANN）模型进行近似，下层问题通过Karush-Kuhn-Tucker（KKT）最优性条件进行分析嵌入。重新制定的集成 ANN 模型和 KKT 约束 (ANN-KKT) 的单级优化框架在基准问题和 660 MW 燃煤电厂和 395 MW 燃气轮机系统的实际发电运行中进行了验证。结果表明，从所提出的 ANN-KKT 框架获得的解决方案与基准问题的双层解决方案具有可比性。计算最佳解决方案的边际计算时间要求（0.22 至 0.88 秒）在最佳涡轮机热耗率为 7337 kJ/kWh 和 7542 kJ/kWh 时分别产生 583 MW（煤炭）和 402 MW（燃气轮机）的功率输出。此外，该方法还可以扩展以描绘出可行且稳健的操作范围，该范围可以解释操作变量的不确定性，同时最大限度地提高各种情况下的热效率。这些结果表明，ANN-KKT 为工业热电系统的分层、数据驱动优化提供了可扩展且计算高效的路线，实现大型工程系统的节能运行并为工业 5.0 做出贡献。

Title: T2MBench: A Benchmark for Out-of-Distribution Text-to-Motion Generation

Authors: Bin Yang, Rong Ou, Weisheng Xu, Jiaqi Xiong, Xintao Li, Taowen Wang, Luyu Zhu, Xu Jiang, Jing Tan, Renjing Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13751
Pdf URL: https://arxiv.org/pdf/2602.13751
Copy Paste: [[2602.13751]] T2MBench: A Benchmark for Out-of-Distribution Text-to-Motion Generation(https://arxiv.org/abs/2602.13751)
Keywords: generation
Abstract: Most existing evaluations of text-to-motion generation focus on in-distribution textual inputs and a limited set of evaluation criteria, which restricts their ability to systematically assess model generalization and motion generation capabilities under complex out-of-distribution (OOD) textual conditions. To address this limitation, we propose a benchmark specifically designed for OOD text-to-motion evaluation, which includes a comprehensive analysis of 14 representative baseline models and the two datasets derived from evaluation results. Specifically, we construct an OOD prompt dataset consisting of 1,025 textual descriptions. Based on this prompt dataset, we introduce a unified evaluation framework that integrates LLM-based Evaluation, Multi-factor Motion evaluation, and Fine-grained Accuracy Evaluation. Our experimental results reveal that while different baseline models demonstrate strengths in areas such as text-to-motion semantic alignment, motion generalizability, and physical quality, most models struggle to achieve strong performance with Fine-grained Accuracy Evaluation. These findings highlight the limitations of existing methods in OOD scenarios and offer practical guidance for the design and evaluation of future production-level text-to-motion models.
摘要：大多数现有的文本到运动生成的评估都集中在分布内文本输入和一组有限的评估标准，这限制了它们在复杂的分布外（OOD）文本条件下系统评估模型泛化和运动生成能力的能力。为了解决这个限制，我们提出了一个专门为 OOD 文本到运动评估而设计的基准，其中包括对 14 个代表性基线模型和从评估结果得出的两个数据集的综合分析。具体来说，我们构建了一个由 1,025 个文本描述组成的 OOD 提示数据集。基于这个提示数据集，我们引入了一个统一的评估框架，集成了基于LLM的评估、多因素运动评估和细粒度精度评估。我们的实验结果表明，虽然不同的基线模型在文本到运动的语义对齐、运动泛化性和物理质量等领域表现出优势，但大多数模型都难以通过细粒度精度评估来实现强大的性能。这些发现凸显了 OOD 场景中现有方法的局限性，并为未来生产级文本到运动模型的设计和评估提供了实用指导。

Title: Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation

Authors: Jidong Jia, Youjian Zhang, Huan Fu, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13778
Pdf URL: https://arxiv.org/pdf/2602.13778
Copy Paste: [[2602.13778]] Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation(https://arxiv.org/abs/2602.13778)
Keywords: generation
Abstract: Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion's general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: this https URL
摘要：尽管舞蹈生成取得了进步，但大多数方法都是在骨骼域中进行训练并忽略网格级物理约束。因此，当使用人体网格可视化时，看似合理的关节轨迹运动通常会表现出身体自穿透和脚与地面接触 (FGC) 异常，从而降低了生成的舞蹈的美感并限制了其在现实世界中的应用。我们通过从身体网格中获取基于物理的奖励并应用强化学习微调（RLFT）来引导扩散模型在网格可视化下实现物理上合理的运动合成，从而解决了骨架与网格之间的差距。我们的奖励设计结合了（i）模仿奖励，通过物理模拟器中的可模仿性来衡量动作的一般合理性（惩罚穿透和脚滑行），以及（ii）足底偏差（FGD）奖励和测试时 FGD 指导，以更好地捕捉舞蹈中的动态足底互动。然而，我们发现基于物理的奖励往往会促使模型生成冻结运动，以减少物理异常并提高可模仿性。为了缓解这种情况，我们提出了一种防冻结奖励，以在保持物理合理性的同时保持运动动态。对多个舞蹈数据集的实验一致表明，我们的方法可以显着提高生成动作的物理合理性，产生更真实、更美观的舞蹈。项目页面位于：此 https URL

Title: MEMTS: Internalizing Domain Knowledge via Parameterized Memory for Retrieval-Free Domain Adaptation of Time Series Foundation Models

Authors: Xiaoyun Yu, Li fan, Xiangfei Qiu, Nanqing Dong, Yonggui Huang, Honggang Qi, Geguang Pu, Wanli Ouyang, Xi Chen, Jilin Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.13783
Pdf URL: https://arxiv.org/pdf/2602.13783
Copy Paste: [[2602.13783]] MEMTS: Internalizing Domain Knowledge via Parameterized Memory for Retrieval-Free Domain Adaptation of Time Series Foundation Models(https://arxiv.org/abs/2602.13783)
Keywords: generation
Abstract: While Time Series Foundation Models (TSFMs) have demonstrated exceptional performance in generalized forecasting, their performance often degrades significantly when deployed in real-world vertical domains characterized by temporal distribution shifts and domain-specific periodic structures. Current solutions are primarily constrained by two paradigms: Domain-Adaptive Pretraining (DAPT), which improves short-term domain fitting but frequently disrupts previously learned global temporal patterns due to catastrophic forgetting; and Retrieval-Augmented Generation (RAG), which incorporates external knowledge but introduces substantial retrieval overhead. This creates a severe scalability bottleneck that fails to meet the high-efficiency requirements of real-time stream processing. To break this impasse, we propose Memory for Time Series (MEMTS), a lightweight and plug-and-play method for retrieval-free domain adaptation in time series forecasting. The key component of MEMTS is a Knowledge Persistence Module (KPM), which internalizes domain-specific temporal dynamics, such as recurring seasonal patterns and trends into a compact set of learnable latent prototypes. In doing so, it transforms fragmented historical observations into continuous, parameterized knowledge representations. This paradigm shift enables MEMTS to achieve accurate domain adaptation with constant-time inference and near-zero latency, while effectively mitigating catastrophic forgetting of general temporal patterns, all without requiring any architectural modifications to the frozen TSFM backbone. Extensive experiments on multiple datasets demonstrate the SOTA performance of MEMTS.
摘要：虽然时间序列基础模型 (TSFM) 在广义预测中表现出了卓越的性能，但当部署在以时间分布变化和特定领域周期结构为特征的现实世界垂直域中时，其性能通常会显着下降。当前的解决方案主要受到两个范式的限制：域自适应预训练（DAPT），它改善了短期域拟合，但由于灾难性遗忘而经常破坏先前学习的全局时间模式；检索增强生成（RAG），它结合了外部知识，但引入了大量的检索开销。这造成了严重的可扩展性瓶颈，无法满足实时流处理的高效率要求。为了打破这一僵局，我们提出了时间序列记忆（MEMTS），这是一种轻量级、即插即用的方法，用于时间序列预测中的免检索域适应。 MEMTS 的关键组件是知识持久模块 (KPM)，它将特定领域的时间动态（例如重复出现的季节性模式和趋势）内化为一组紧凑的可学习潜在原型。在此过程中，它将碎片化的历史观察转化为连续的、参数化的知识表示。这种范式转变使 MEMTS 能够通过恒定时间推理和接近零延迟实现精确的域适应，同时有效地减轻一般时间模式的灾难性遗忘，所有这些都不需要对冻结的 TSFM 主干进行任何架构修改。对多个数据集的大量实验证明了 MEMS 的 SOTA 性能。

Title: Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation

Authors: Guojian Zhan, Letian Tao, Pengcheng Wang, Yixiao Wang, Yiheng Li, Yuxin Chen, Masayoshi Tomizuka, Shengbo Eben Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13810
Pdf URL: https://arxiv.org/pdf/2602.13810
Copy Paste: [[2602.13810]] Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation(https://arxiv.org/abs/2602.13810)
Keywords: generation, generative
Abstract: Learning expressive and efficient policy functions is a promising direction in reinforcement learning (RL). While flow-based policies have recently proven effective in modeling complex action distributions with a fast deterministic sampling process, they still face a trade-off between expressiveness and computational burden, which is typically controlled by the number of flow steps. In this work, we propose mean velocity policy (MVP), a new generative policy function that models the mean velocity field to achieve the fastest one-step action generation. To ensure its high expressiveness, an instantaneous velocity constraint (IVC) is introduced on the mean velocity field during training. We theoretically prove that this design explicitly serves as a crucial boundary condition, thereby improving learning accuracy and enhancing policy expressiveness. Empirically, our MVP achieves state-of-the-art success rates across several challenging robotic manipulation tasks from Robomimic and OGBench. It also delivers substantial improvements in training and inference speed over existing flow-based policy baselines.
摘要：学习富有表现力和高效的策略函数是强化学习（RL）的一个有前途的方向。虽然基于流的策略最近被证明可以有效地通过快速确定性采样过程对复杂的动作分布进行建模，但它们仍然面临着表达性和计算负担之间的权衡，而计算负担通常由流步骤的数量控制。在这项工作中，我们提出了平均速度策略（MVP），这是一种新的生成策略函数，它对平均速度场进行建模以实现最快的一步动作生成。为了确保其高表达力，在训练过程中对平均速度场引入了瞬时速度约束（IVC）。我们从理论上证明，这种设计明确地充当了关键的边界条件，从而提高了学习准确性并增强了策略表达力。根据经验，我们的 MVP 在 Robomimic 和 OGBench 的几个具有挑战性的机器人操作任务中实现了最先进的成功率。与现有基于流的策略基线相比，它还显着提高了训练和推理速度。

Title: VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer

Authors: Zongcheng Han, Dongyan Cao, Haoran Sun, Yu Hong
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.13818
Pdf URL: https://arxiv.org/pdf/2602.13818
Copy Paste: [[2602.13818]] VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer(https://arxiv.org/abs/2602.13818)
Keywords: generation, generative
Abstract: Recent advances in auto-regressive transformers have achieved remarkable success in generative modeling. However, text-to-3D generation remains challenging, primarily due to bottlenecks in learning discrete 3D representations. Specifically, existing approaches often suffer from information loss during encoding, causing representational distortion before the quantization process. This effect is further amplified by vector quantization, ultimately degrading the geometric coherence of text-conditioned 3D shapes. Moreover, the conventional two-stage training paradigm induces an objective mismatch between reconstruction and text-conditioned auto-regressive generation. To address these issues, we propose View-aware Auto-Regressive 3D (VAR-3D), which intergrates a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into discrete tokens. Additionally, we introduce a rendering-supervised training strategy that couples discrete token prediction with visual reconstruction, encouraging the generative process to better preserve visual fidelity and structural consistency relative to the input text. Experiments demonstrate that VAR-3D significantly outperforms existing methods in both generation quality and text-3D alignment.
摘要：自回归变压器的最新进展在生成建模方面取得了显着的成功。然而，文本到 3D 的生成仍然具有挑战性，这主要是由于学习离散 3D 表示的瓶颈。具体而言，现有方法在编码过程中经常会遭受信息丢失，从而在量化过程之前导致表征失真。矢量量化进一步放大了这种效应，最终降低了文本条件 3D 形状的几何一致性。此外，传统的两阶段训练范式导致重建和文本条件自回归生成之间存在客观不匹配。为了解决这些问题，我们提出了视图感知自回归 3D (VAR-3D)，它集成了视图感知 3D 矢量量化变分自动编码器 (VQ-VAE)，将 3D 模型的复杂几何结构转换为离散标记。此外，我们引入了一种渲染监督训练策略，将离散标记预测与视觉重建结合起来，鼓励生成过程更好地保持相对于输入文本的视觉保真度和结构一致性。实验表明，VAR-3D 在生成质量和文本 3D 对齐方面均显着优于现有方法。

Title: Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

Authors: Haonan Jiang, Yuji Wang, Yongjie Zhu, Xin Lu, Wenyu Qin, Meng Wang, Pengfei Wan, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13823
Pdf URL: https://arxiv.org/pdf/2602.13823
Copy Paste: [[2602.13823]] Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings(https://arxiv.org/abs/2602.13823)
Keywords: generative
Abstract: Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.
摘要：利用多模态大型语言模型 (MLLM) 已成为推进通用多模态嵌入 (UME) 解决各种跨模态任务的关键。最近的研究表明，与判别方法相比，结合生成式思维链（CoT）推理可以显着增强特定于任务的表示。然而，现有生成嵌入方法生成的推理CoT仅限于查询的文本分析，与目标的检索无关。为了解决这些限制，我们提出了一种推理驱动的 UME 框架，该框架集成了嵌入器引导强化学习 (EG-RL)，以优化 Reasoner 以生成证据可追溯性 CoT (T-CoT)。我们的主要贡献有三个：（1）我们设计了一个 EG-RL 框架，其中嵌入器为 Reasoner 提供显式监督，确保生成的 CoT 跟踪与嵌入任务保持一致。 (2) 我们引入了 T-CoT，它提取关键的多模态线索以关注与检索相关的元素，并为嵌入器提供多模态输入。 (3) 在计算资源有限的情况下，我们的框架在 MMEB-V2 和 UVRB 基准上均优于开创性的嵌入模型。结构化推理中多模态证据的融合，配合面向检索的对齐，有效增强了跨模态语义一致性，提升了模型的细粒度匹配能力以及跨复杂场景的泛化能力。我们的工作表明，有针对性的推理优化可以显着提高多模态嵌入质量，为推理驱动的 UME 开发提供实用且高效的解决方案。

Title: Prior-guided Hierarchical Instance-pixel Contrastive Learning for Ultrasound Speckle Noise Suppression

Authors: Zhenyu Bu, Yuanxin Xie, Guang-Quan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13831
Pdf URL: https://arxiv.org/pdf/2602.13831
Copy Paste: [[2602.13831]] Prior-guided Hierarchical Instance-pixel Contrastive Learning for Ultrasound Speckle Noise Suppression(https://arxiv.org/abs/2602.13831)
Keywords: restoration
Abstract: Ultrasound denoising is essential for mitigating speckle-induced degradations, thereby enhancing image quality and improving diagnostic reliability. Nevertheless, because speckle patterns inherently encode both texture and fine anatomical details, effectively suppressing noise while preserving structural fidelity remains a significant challenge. In this study, we propose a prior-guided hierarchical instance-pixel contrastive learning model for ultrasound denoising, designed to promote noise-invariant and structure-aware feature representations by maximizing the separability between noisy and clean samples at both pixel and instance levels. Specifically, a statistics-guided pixel-level contrastive learning strategy is introduced to enhance distributional discrepancies between noisy and clean pixels, thereby improving local structural consistency. Concurrently, a memory bank is employed to facilitate instance-level contrastive learning in the feature space, encouraging representations that more faithfully approximate the underlying data distribution. Furthermore, a hybrid Transformer-CNN architecture is adopted, coupling a Transformer-based encoder for global context modeling with a CNN-based decoder optimized for fine-grained anatomical structure restoration, thus enabling complementary exploitation of long-range dependencies and local texture details. Extensive evaluations on two publicly available ultrasound datasets demonstrate that the proposed model consistently outperforms existing methods, confirming its effectiveness and superiority.
摘要：超声去噪对于减轻散斑引起的退化至关重要，从而提高图像质量并提高诊断可靠性。然而，由于散斑图案本质上编码纹理和精细的解剖细节，因此在保持结构保真度的同时有效抑制噪声仍然是一个重大挑战。在这项研究中，我们提出了一种用于超声去噪的先验引导分层实例像素对比学习模型，旨在通过在像素和实例级别上最大化噪声样本和干净样本之间的可分离性来促进噪声不变和结构感知的特征表示。具体来说，引入统计引导的像素级对比学习策略来增强噪声像素和干净像素之间的分布差异，从而提高局部结构一致性。同时，采用内存库来促进特征空间中的实例级对比学习，鼓励更忠实地近似底层数据分布的表示。此外，采用混合 Transformer-CNN 架构，将用于全局上下文建模的基于 Transformer 的编码器与针对细粒度解剖结构恢复优化的基于 CNN 的解码器耦合，从而实现对远程依赖性和局部纹理细节的互补利用。对两个公开可用的超声数据集的广泛评估表明，所提出的模型始终优于现有方法，证实了其有效性和优越性。

Title: High-Fidelity Causal Video Diffusion Models for Real-Time Ultra-Low-Bitrate Semantic Communication

Authors: Cem Eteke, Batuhan Tosun, Alexander Griessel, Wolfgang Kellerer, Eckehard Steinbach
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13837
Pdf URL: https://arxiv.org/pdf/2602.13837
Copy Paste: [[2602.13837]] High-Fidelity Causal Video Diffusion Models for Real-Time Ultra-Low-Bitrate Semantic Communication(https://arxiv.org/abs/2602.13837)
Keywords: restoration, generation, generative
Abstract: We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates (< 0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.
摘要：我们引入了一种视频扩散模型，用于在超低比特率语义通信约束下生成高保真、因果和实时视频。我们的方法利用有损语义视频编码来传输语义场景结构，并辅以高度压缩的低分辨率帧流，提供足够的纹理信息以保持保真度。基于这些输入，我们引入了一个模块化视频扩散模型，其中包含语义控制、恢复适配器和时间适配器。我们进一步引入了一种有效的时间蒸馏程序，可以扩展到实时和因果合成，将可训练参数减少 300 倍，将训练时间减少 2 倍，同时遵守通信限制。经过不同数据集的评估，该框架在超低比特率 (< 0.0003 bpp) 下实现了强大的感知质量、语义保真度和时间一致性，在广泛的定量、定性和主观评估中优于经典、神经和生成基线。

Title: Synthetic Dataset Generation and Validation for Robotic Surgery Instrument Segmentation

Authors: Giorgio Chiesa, Rossella Borra, Vittorio Lauro, Sabrina De Cillis, Daniele Amparore, Cristian Fiori, Riccardo Renzulli, Marco Grangetto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13844
Pdf URL: https://arxiv.org/pdf/2602.13844
Copy Paste: [[2602.13844]] Synthetic Dataset Generation and Validation for Robotic Surgery Instrument Segmentation(https://arxiv.org/abs/2602.13844)
Keywords: generation
Abstract: This paper presents a comprehensive workflow for generating and validating a synthetic dataset designed for robotic surgery instrument segmentation. A 3D reconstruction of the Da Vinci robotic arms was refined and animated in Autodesk Maya through a fully automated Python-based pipeline capable of producing photorealistic, labeled video sequences. Each scene integrates randomized motion patterns, lighting variations, and synthetic blood textures to mimic intraoperative variability while preserving pixel-accurate ground truth masks. To validate the realism and effectiveness of the generated data, several segmentation models were trained under controlled ratios of real and synthetic data. Results demonstrate that a balanced composition of real and synthetic samples significantly improves model generalization compared to training on real data only, while excessive reliance on synthetic data introduces a measurable domain shift. The proposed framework provides a reproducible and scalable tool for surgical computer vision, supporting future research in data augmentation, domain adaptation, and simulation-based pretraining for robotic-assisted surgery. Data and code are available at this https URL.
摘要：本文提出了一个用于生成和验证专为机器人手术器械分割设计的合成数据集的综合工作流程。达芬奇机械臂的 3D 重建在 Autodesk Maya 中通过基于 Python 的全自动管道进行了细化和动画处理，能够生成逼真的标记视频序列。每个场景都集成了随机运动模式、光照变化和合成血液纹理，以模拟术中变化，同时保留像素精确的地面真实掩模。为了验证生成数据的真实性和有效性，在真实数据和合成数据的受控比例下训练了多个分割模型。结果表明，与仅对真实数据进行训练相比，真实样本和合成样本的平衡组合显着提高了模型泛化能力，而对合成数据的过度依赖则引入了可测量的域转移。所提出的框架为手术计算机视觉提供了一种可重复且可扩展的工具，支持未来在数据增强、领域适应和机器人辅助手术基于模拟的预训练方面的研究。数据和代码可从此 https URL 获取。

Title: Low-Pass Filtering Improves Behavioral Alignment of Vision Models

Authors: Max Wolff, Thomas Klein, Evgenia Rusak, Felix Wichmann, Wieland Brendel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13859
Pdf URL: https://arxiv.org/pdf/2602.13859
Copy Paste: [[2602.13859]] Low-Pass Filtering Improves Behavioral Alignment of Vision Models(https://arxiv.org/abs/2602.13859)
Keywords: generative
Abstract: Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through \emph{generative} -- rather than \emph{discriminative} -- classifiers, with far-reaching implications for models of human vision. Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time -- rather than training on blurred images -- achieves a new state-of-the-art score on the model-vs-human benchmark, halving the current alignment gap between DNNs and human observers. Furthermore, low-pass filters are likely optimal, which we demonstrate by directly optimizing filters for alignment. To contextualize the performance of optimal filters, we compute the frontier of all possible pareto-optimal solutions to the benchmark, which was formerly unknown. We explain our findings by observing that the frequency spectrum of optimal Gaussian filters roughly matches the spectrum of band-pass filters implemented by the human visual system. We show that the contrast sensitivity function, describing the inverse of the contrast threshold required for humans to detect a sinusoidal grating as a function of spatiotemporal frequency, is approximated well by Gaussian filters of the specific width that also maximizes error consistency.
摘要：尽管深度神经网络 (DNN) 在计算机视觉基准测试中表现出色，但根据误差一致性和形状偏差来衡量，仍然无法充分建模人类视觉行为。最近的工作假设，通过 \emph{generative}（而不是 \emph{discriminative}）分类器可以极大地改善行为一致性，这对人类视觉模型具有深远的影响。在这里，我们相反表明，生成模型的对齐增加很大程度上可以通过生成模型中看似无害的调整大小操作来解释，该操作有效地充当低通滤波器。在一系列对照实验中，我们表明，从 CLIP 等判别模型中去除高频空间信息可以极大地提高它们的行为一致性。简单地在测试时模糊图像——而不是在模糊图像上进行训练——就可以在模型与人类基准上取得新的最先进分数，将当前 DNN 和人类观察者之间的对齐差距缩小一半。此外，低通滤波器可能是最佳的，我们通过直接优化滤波器进行对齐来证明这一点。为了将最优滤波器的性能置于上下文中，我们计算了基准的所有可能的帕累托最优解决方案的边界，这在以前是未知的。我们通过观察最佳高斯滤波器的频谱与人类视觉系统实现的带通滤波器的频谱大致匹配来解释我们的发现。我们表明，对比敏感度函数描述了人类检测正弦光栅所需的对比阈值的倒数作为时空频率的函数，该函数可以通过特定宽度的高斯滤波器很好地近似，这也最大化了误差一致性。

Title: Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification

Authors: Daniel Chen, Zaria Zinn, Marcus Lowe
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.13889
Pdf URL: https://arxiv.org/pdf/2602.13889
Copy Paste: [[2602.13889]] Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification(https://arxiv.org/abs/2602.13889)
Keywords: generation
Abstract: We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model's 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.
摘要：我们提出了一个字体分类系统，能够从渲染的文本图像中识别 394 个字体系列。我们的方法使用低秩适应 (LoRA) 对 DINOv2 Vision Transformer 进行微调，实现了大约 86% 的 top-1 准确率，同时训练了模型 8720 万个参数中不到 1% 的参数。我们引入了一个合成数据集生成管道，可以通过多种增强功能大规模渲染 Google Fonts，包括随机颜色、对齐、换行和高斯噪声，生成可推广到现实世界印刷样本的训练图像。该模型采用内置预处理来确保训练和推理之间的一致性，并部署为 HuggingFace 推理端点。我们将模型、数据集和完整的训练流程作为开源资源发布。

Title: Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning

Authors: Zhimin Zhao
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2602.13934
Pdf URL: https://arxiv.org/pdf/2602.13934
Copy Paste: [[2602.13934]] Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning(https://arxiv.org/abs/2602.13934)
Keywords: generation
Abstract: Code generation has progressed more reliably than reinforcement learning, largely because code has an information structure that makes it learnable. Code provides dense, local, verifiable feedback at every token, whereas most reinforcement learning problems do not. This difference in feedback quality is not binary but graded. We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all. The hierarchy rests on a formal distinction among three properties of computational problems (expressibility, computability, and learnability). We establish their pairwise relationships, including where implications hold and where they fail, and present a unified template that makes the structural differences explicit. The analysis suggests why supervised learning on code scales predictably while reinforcement learning does not, and why the common assumption that scaling alone will solve remaining ML challenges warrants scrutiny.
摘要：代码生成比强化学习更加可靠，很大程度上是因为代码具有使其可学习的信息结构。代码为每个标记提供密集的、本地的、可验证的反馈，而大多数强化学习问题则不然。反馈质量的这种差异不是二元的，而是分级的。我们提出了基于信息结构的可学习性的五级层次结构，并认为机器学习进展的上限更多地取决于任务是否可学习，而不是模型大小。该层次结构依赖于计算问题的三个属性（可表达性、可计算性和可学习性）之间的正式区别。我们建立了它们的成对关系，包括其中的含义成立和失败的地方，并提出一个统一的模板，使结构差异变得明确。分析表明，为什么代码上的监督学习可以预测规模，而强化学习却不能，以及为什么仅通过规模扩张就能解决剩余的机器学习挑战这一常见假设值得仔细审查。

Title: A Multi-Agent Framework for Code-Guided, Modular, and Verifiable Automated Machine Learning

Authors: Dat Le, Duc-Cuong Le, Anh-Son Nguyen, Tuan-Dung Bui, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2602.13937
Pdf URL: https://arxiv.org/pdf/2602.13937
Copy Paste: [[2602.13937]] A Multi-Agent Framework for Code-Guided, Modular, and Verifiable Automated Machine Learning(https://arxiv.org/abs/2602.13937)
Keywords: generation
Abstract: Automated Machine Learning (AutoML) has revolutionized the development of data-driven solutions; however, traditional frameworks often function as "black boxes", lacking the flexibility and transparency required for complex, real-world engineering tasks. Recent Large Language Model (LLM)-based agents have shifted toward code-driven approaches. However, they frequently suffer from hallucinated logic and logic entanglement, where monolithic code generation leads to unrecoverable runtime failures. In this paper, we present iML, a novel multi-agent framework designed to shift AutoML from black-box prompting to a code-guided, modular, and verifiable architectural paradigm. iML introduces three main ideas: (1) Code-Guided Planning, which synthesizes a strategic blueprint grounded in autonomous empirical profiling to eliminate hallucination; (2) Code-Modular Implementation, which decouples preprocessing and modeling into specialized components governed by strict interface contracts; and (3) Code-Verifiable Integration, which enforces physical feasibility through dynamic contract verification and iterative self-correction. We evaluate iML across MLE-BENCH and the newly introduced iML-BENCH, comprising a diverse range of real-world Kaggle competitions. The experimental results show iML's superiority over state-of-the-art agents, achieving a valid submission rate of 85% and a competitive medal rate of 45% on MLE-BENCH, with an average standardized performance score (APS) of 0.77. On iML-BENCH, iML significantly outperforms the other approaches by 38%-163% in APS. Furthermore, iML maintains a robust 70% success rate even under stripped task descriptions, effectively filling information gaps through empirical profiling. These results highlight iML's potential to bridge the gap between stochastic generation and reliable engineering, marking a meaningful step toward truly AutoML.
摘要：自动化机器学习 (AutoML) 彻底改变了数据驱动解决方案的开发；然而，传统框架往往充当“黑匣子”，缺乏复杂的现实工程任务所需的灵活性和透明度。最近基于大型语言模型（LLM）的代理已经转向代码驱动的方法。然而，它们经常遭受逻辑幻觉和逻辑纠缠的困扰，其中单一代码生成会导致不可恢复的运行时故障。在本文中，我们提出了 iML，这是一种新颖的多代理框架，旨在将 AutoML 从黑盒提示转变为代码引导、模块化和可验证的架构范例。 iML 引入了三个主要思想：（1）代码引导规划，综合基于自主经验分析的战略蓝图，以消除幻觉； (2) 代码模块化实现，将预处理和建模解耦为受严格接口契约控制的专用组件； (3) 代码可验证集成，通过动态合约验证和迭代自我修正来增强物理可行性。我们通过 MLE-BENCH 和新推出的 iML-BENCH 评估 iML，其中包括各种真实的 Kaggle 竞赛。实验结果表明，iML 优于最先进的智能体，在 MLE-BENCH 上实现了 85% 的有效提交率和 45% 的竞争奖牌率，平均标准化性能得分 (APS) 为 0.77。在 iML-BENCH 上，iML 在 APS 方面明显优于其他方法 38%-163%。此外，即使在任务描述被剥离的情况下，iML 也能保持 70% 的成功率，通过经验分析有效地填补信息空白。这些结果凸显了 iML 在弥合随机生成和可靠工程之间差距的潜力，标志着迈向真正的 AutoML 的有意义的一步。

Title: Chemical Language Models for Natural Products: A State-Space Model Approach

Authors: Ho-Hsuan Wang, Afnan Sultan, Andrea Volkamer, Dietrich Klakow
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13958
Pdf URL: https://arxiv.org/pdf/2602.13958
Copy Paste: [[2602.13958]] Chemical Language Models for Natural Products: A State-Space Model Approach(https://arxiv.org/abs/2602.13958)
Keywords: generation
Abstract: Language models are widely used in chemistry for molecular property prediction and small-molecule generation, yet Natural Products (NPs) remain underexplored despite their importance in drug discovery. To address this gap, we develop NP-specific chemical language models (NPCLMs) by pre-training state-space models (Mamba and Mamba-2) and comparing them with transformer baselines (GPT). Using a dataset of about 1M NPs, we present the first systematic comparison of selective state-space models and transformers for NP-focused tasks, together with eight tokenization strategies including character-level, Atom-in-SMILES (AIS), byte-pair encoding (BPE), and NP-specific BPE. We evaluate molecule generation (validity, uniqueness, novelty) and property prediction (membrane permeability, taste, anti-cancer activity) using MCC and AUC-ROC. Mamba generates 1-2 percent more valid and unique molecules than Mamba-2 and GPT, with fewer long-range dependency errors, while GPT yields slightly more novel structures. For property prediction, Mamba variants outperform GPT by 0.02-0.04 MCC under random splits, while scaffold splits show comparable performance. Results demonstrate that domain-specific pre-training on about 1M NPs can match models trained on datasets over 100 times larger.
摘要：语言模型广泛应用于化学领域的分子特性预测和小分子生成，但天然产物（NP）尽管在药物发现中很重要，但仍未得到充分开发。为了解决这一差距，我们通过预训练状态空间模型（Mamba 和 Mamba-2）并将其与 Transformer 基线（GPT）进行比较来开发 NP 特定化学语言模型（NPCLM）。使用大约 100 万个 NP 的数据集，我们首次系统地比较了针对 NP 重点任务的选择性状态空间模型和转换器，以及八种标记化策略，包括字符级、Atom-in-SMILES (AIS)、字节对编码 (BPE) 和 NP 特定的 BPE。我们使用 MCC 和 AUC-ROC 评估分子生成（有效性、独特性、新颖性）和属性预测（膜渗透性、味道、抗癌活性）。 Mamba 生成的有效和独特分子比 Mamba-2 和 GPT 多 1-2%，且远程依赖性错误更少，而 GPT 生成的结构稍微更新颖。对于属性预测，Mamba 变体在随机分割下优于 GPT 0.02-0.04 MCC，而脚手架分割则表现出相当的性能。结果表明，对大约 100 万个 NP 进行的特定领域预训练可以匹配在 100 倍以上的数据集上训练的模型。

Title: MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars

Authors: Shuoyuan Wang, Yiran Wang, Hongxin Wei
Subjects: cs.CV, astro-ph.IM, cs.CL
Abstract URL: https://arxiv.org/abs/2602.13961
Pdf URL: https://arxiv.org/pdf/2602.13961
Copy Paste: [[2602.13961]] MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars(https://arxiv.org/abs/2602.13961)
Keywords: generative
Abstract: Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at this https URL
摘要：深度学习等数据驱动方法正在迅速推动行星科学的发展，特别是在火星探索领域。尽管最近取得了进展，但大多数现有基准仍然仅限于封闭式监督视觉任务，并且不支持用于地理空间发现的文本引导检索。我们介绍了 MarsRetrieval，这是一个用于评估火星地理空间发现的视觉语言模型的检索基准。 MarsRetrieval 包括三个任务：（1）图像文本配对检索，（2）地貌检索，以及（3）全球地理定位，涵盖多个空间尺度和不同的地貌起源。我们提出了一个统一的以检索为中心的协议来对多模态嵌入架构进行基准测试，包括对比双塔编码器和生成视觉语言模型。我们的评估表明 MarsRetrieval 具有挑战性：即使是强大的基础模型也常常无法捕获特定领域的地貌差异。我们进一步表明，特定领域的微调对于行星环境中的普遍地理空间发现至关重要。我们的代码可在此 https URL 获取

Title: Elastic Diffusion Transformer

Authors: Jiangshan Wang, Zeqiang Lai, Jiarui Chen, Jiayi Guo, Hang Guo, Xiu Li, Xiangyu Yue, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13993
Pdf URL: https://arxiv.org/pdf/2602.13993
Copy Paste: [[2602.13993]] Elastic Diffusion Transformer(https://arxiv.org/abs/2602.13993)
Keywords: generation, generative
Abstract: Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose \textbf{Elastic Diffusion Transformer (E-DiT)}, an adaptive acceleration framework for DiT that effectively improves efficiency while maintaining generation quality. Specifically, we observe that the generative process of DiT exhibits substantial sparsity (i.e., some computations can be skipped with minimal impact on quality), and this sparsity varies significantly across samples. Motivated by this observation, E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent. Each router adaptively determines whether the corresponding block can be skipped. If the block is not skipped, the router then predicts the optimal MLP width reduction ratio within the block. During inference, we further introduce a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner. Extensive experiments across 2D image (Qwen-Image and FLUX) and 3D asset (Hunyuan3D-3.0) demonstrate the effectiveness of E-DiT, achieving up to $\sim$2$\times$ speedup with negligible loss in generation quality. Code will be available at this https URL.
摘要：扩散变压器 (DiT) 已展现出卓越的生成能力，但计算成本仍然很高。以往的加速方法，例如剪枝和蒸馏，通常依赖于固定的计算能力，导致加速不足和发电质量下降。为了解决这个限制，我们提出了 \textbf{Elastic Diffusion Transformer (E-DiT)}，这是一种 DiT 的自适应加速框架，可以在保持生成质量的同时有效提高效率。具体来说，我们观察到 DiT 的生成过程表现出很大的稀疏性（即可以跳过一些计算，对质量的影响最小），并且这种稀疏性在不同样本之间存在显着差异。受这一观察的启发，E-DiT 为每个 DiT 块配备了一个轻量级路由器，该路由器可以从输入潜在数据中动态识别与样本相关的稀疏性。每个路由器自适应地确定是否可以跳过相应的块。如果没有跳过该块，则路由器会预测该块内的最佳 MLP 宽度缩减率。在推理过程中，我们进一步引入了一种块级特征缓存机制，该机制利用路由器预测以免训练的方式消除冗余计算。 2D 图像（Qwen-Image 和 FLUX）和 3D 资产（Hunyuan3D-3.0）的广泛实验证明了 E-DiT 的有效性，实现了高达 $\sim$2$\times$ 的加速，而生成质量的损失可以忽略不计。代码将在此 https URL 中提供。

Title: Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

Authors: Guandong Li, Mengxia Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.13994
Pdf URL: https://arxiv.org/pdf/2602.13994
Copy Paste: [[2602.13994]] Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization(https://arxiv.org/abs/2602.13994)
Keywords: generation
Abstract: Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.
摘要：个性化文本到图像的生成旨在将特定身份整合到任意上下文中。然而，现有的免调整方法通常采用空间均匀视觉注入，导致身份特征污染非面部区域（例如背景和照明）并降低文本依从性。为了解决这个问题而无需昂贵的微调，我们提出了 SpatialID，一种免训练的空间自适应身份调制框架。 SpatialID 使用源自交叉注意力响应的空间掩模提取器从根本上将身份注入解耦到面部相关和上下文无关区域。此外，我们引入了一种时空调度策略，可以动态调整空间约束——从高斯先验过渡到基于注意力的掩模和自适应松弛——以与扩散生成动态保持一致。 IBench 上的大量实验表明，SpatialID 在文本依从性 (CLIP-T: 0.281)、视觉一致性 (CLIP-I: 0.827) 和图像质量 (IQ: 0.523) 方面实现了最先进的性能，显着消除了背景污染，同时保持了稳健的身份保留。

Title: Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

Authors: Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fangmin Chen, Xing Wang, Hayden Kwok-Hay So
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14027
Pdf URL: https://arxiv.org/pdf/2602.14027
Copy Paste: [[2602.14027]] Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation(https://arxiv.org/abs/2602.14027)
Keywords: generation
Abstract: Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the \textit{spectral bias} of 3D positional embeddings and the lack of \textit{dynamic priors} in noise sampling. To address these issues, we propose \textbf{FLEX} (\textbf{F}requency-aware \textbf{L}ength \textbf{EX}tension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at $6\times$ extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at $12\times$ scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at \href{this https URL}{this https URL}.
摘要：自回归视频扩散模型已成为长视频生成的可扩展范例。然而，它们经常遭受严重的外推失败，当超出训练范围时，快速的误差积累会导致显着的时间退化。我们发现这种失败主要源于 3D 位置嵌入的 \textit{频谱偏差} 以及噪声采样中 \textit{动态先验} 的缺乏。为了解决这些问题，我们提出了 \textbf{FLEX} （\textbf{F}requency-aware \textbf{L}ength \textbf{EX}tension），这是一种无需训练的推理时间框架，可以弥补短期训练和长期推理之间的差距。 FLEX 引入了频率感知 RoPE 调制，可自适应地内插未经训练的低频分量，同时外推高频分量以保持多尺度时间辨别能力。它与反相噪声采样 (ANS) 集成，以注入高频动态先验和仅推理注意池来锚定全局结构。对 VBench 的广泛评估表明，FLEX 在 $6\times$ 外推（30 秒持续时间）下显着优于最先进的模型，并与长视频微调基线在 $12\times$ 规模（60 秒持续时间）下的性能相匹配。作为一种即插即用的增强功能，FLEX 可以无缝集成到现有的推理管道中，以实现范围扩展。它有效地突破了 LongLive 等模型的生成限制，支持 4 分钟规模的一致动态视频合成。项目页面位于 \href{此 https URL}{此 https URL}。

Title: BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Authors: Yuang Ai, Jiaming Han, Shaobin Zhuang, Weijia Mao, Xuefeng Hu, Ziyan Yang, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14041
Pdf URL: https://arxiv.org/pdf/2602.14041
Copy Paste: [[2602.14041]] BitDance: Scaling Autoregressive Generative Models with Binary Tokens(https://arxiv.org/abs/2602.14041)
Keywords: generation, generative
Abstract: We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: this https URL.
摘要：我们推出了 BitDance，一种可扩展的自回归 (AR) 图像生成器，可以预测二进制视觉标记而不是码本索引。借助高熵二进制潜在变量，BitDance 可以让每个令牌代表最多 $2^{256}$ 状态，从而产生紧凑但具有高度表现力的离散表示。使用标准分类很难从如此巨大的令牌空间中进行采样。为了解决这个问题，BitDance 使用了二进制扩散头：它使用连续空间扩散来生成二进制令牌，而不是使用 softmax 来预测索引。此外，我们提出了下一个补丁扩散，这是一种新的解码方法，可以高精度并行预测多个令牌，从而大大加快推理速度。在 ImageNet 256x256 上，BitDance 的 FID 为 1.24，是 AR 模型中最好的。通过下一个补丁扩散，BitDance 击败了使用 1.4B 参数的最先进的并行 AR 模型，同时使用的参数减少了 5.4 倍（260M），并实现了 8.7 倍的加速。对于文本到图像的生成，BitDance 在大规模多模态令牌上进行训练，并有效生成高分辨率、逼真的图像，显示出强大的性能和良好的缩放性。在生成 1024x1024 图像时，BitDance 与之前的 AR 模型相比，速度提高了 30 倍以上。我们发布代码和模型，以方便对 AR 基础模型的进一步研究。代码和模型可在以下位置获得：此 https URL。

Title: Restoration Adaptation for Semantic Segmentation on Low Quality Images

Authors: Kai Guan, Rongyuan Wu, Shuai Li, Wentao Zhu, Wenjun Zeng, Lei Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14042
Pdf URL: https://arxiv.org/pdf/2602.14042
Copy Paste: [[2602.14042]] Restoration Adaptation for Semantic Segmentation on Low Quality Images(https://arxiv.org/abs/2602.14042)
Keywords: restoration
Abstract: In real-world scenarios, the performance of semantic segmentation often deteriorates when processing low-quality (LQ) images, which may lack clear semantic structures and high-frequency details. Although image restoration techniques offer a promising direction for enhancing degraded visual content, conventional real-world image restoration (Real-IR) models primarily focus on pixel-level fidelity and often fail to recover task-relevant semantic cues, limiting their effectiveness when directly applied to downstream vision tasks. Conversely, existing segmentation models trained on high-quality data lack robustness under real-world degradations. In this paper, we propose Restoration Adaptation for Semantic Segmentation (RASS), which effectively integrates semantic image restoration into the segmentation process, enabling high-quality semantic segmentation on the LQ images directly. Specifically, we first propose a Semantic-Constrained Restoration (SCR) model, which injects segmentation priors into the restoration model by aligning its cross-attention maps with segmentation masks, encouraging semantically faithful image reconstruction. Then, RASS transfers semantic restoration knowledge into segmentation through LoRA-based module merging and task-specific fine-tuning, thereby enhancing the model's robustness to LQ images. To validate the effectiveness of our framework, we construct a real-world LQ image segmentation dataset with high-quality annotations, and conduct extensive experiments on both synthetic and real-world LQ benchmarks. The results show that SCR and RASS significantly outperform state-of-the-art methods in segmentation and restoration tasks. Code, models, and datasets will be available at this https URL.
摘要：在现实场景中，语义分割的性能在处理低质量（LQ）图像时通常会恶化，这可能缺乏清晰的语义结构和高频细节。尽管图像恢复技术为增强退化的视觉内容提供了一个有前景的方向，但传统的真实世界图像恢复（Real-IR）模型主要关注像素级保真度，并且通常无法恢复与任务相关的语义线索，从而限制了其直接应用于下游视觉任务时的有效性。相反，现有的基于高质量数据训练的分割模型在现实世界的退化情况下缺乏鲁棒性。在本文中，我们提出了语义分割恢复适应（RASS），它将语义图像恢复有效地集成到分割过程中，从而能够直接对LQ图像进行高质量语义分割。具体来说，我们首先提出了一种语义约束恢复（SCR）模型，该模型通过将交叉注意力图与分割掩模对齐，将分割先验注入到恢复模型中，从而鼓励语义上忠实的图像重建。然后，RASS通过基于LoRA的模块合并和特定于任务的微调，将语义恢复知识转移到分割中，从而增强模型对LQ图像的鲁棒性。为了验证我们框架的有效性，我们构建了一个具有高质量注释的真实 LQ 图像分割数据集，并在合成和真实 LQ 基准上进行了广泛的实验。结果表明，SCR 和 RASS 在分割和恢复任务中显着优于最先进的方法。代码、模型和数据集将在此 https URL 中提供。

Title: CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

Authors: Yuhui Wu, Chenxi Xie, Ruibin Li, Liyi Chen, Qiaosi Yi, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14068
Pdf URL: https://arxiv.org/pdf/2602.14068
Copy Paste: [[2602.14068]] CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning(https://arxiv.org/abs/2602.14068)
Keywords: generative
Abstract: Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.
摘要：随着大规模生成模型的发展，图像编辑取得了令人印象深刻的成果。然而，现有模型主要关注预期对象和区域的编辑效果，常常导致非预期区域发生不必要的变化。我们通过区域正则化强化学习提出了内容一致编辑（CoCoEdit）的训练后框架。我们首先使用精炼的指令和掩码来增强现有的编辑数据集，其中 40K 个多样化且高质量的样本被挑选为训练集。然后，我们引入像素级相似性奖励来补充基于 MLLM 的奖励，使模型能够在编辑过程中确保编辑质量和内容一致性。为了克服奖励的空间不可知性，我们提出了一种基于区域的正则化器，旨在保留高奖励样本的未编辑区域，同时鼓励低奖励样本的编辑效果。为了进行评估，我们注释了 GEdit-Bench 和 ImgEdit-Bench 的编辑蒙版，引入像素级相似性度量来衡量内容一致性和编辑质量。将 CoCoEdit 应用于 Qwen-Image-Edit 和 FLUX-Kontext，我们不仅通过最先进的模型获得了有竞争力的编辑分数，而且通过 PSNR/SSIM 指标和人类主观评分来衡量，内容一致性也显着提高。

Title: EgoSound: Benchmarking Sound Understanding in Egocentric Videos

Authors: Bingwen Zhu, Yuqian Fu, Qiaole Dong, Guolei Sun, Tianwen Qian, Yuzheng Wu, Danda Pani Paudel, Xiangyang Xue, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14122
Pdf URL: https://arxiv.org/pdf/2602.14122
Copy Paste: [[2602.14122]] EgoSound: Benchmarking Sound Understanding in Egocentric Videos(https://arxiv.org/abs/2602.14122)
Keywords: generative
Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.
摘要：多模态大语言模型（MLLM）最近在视觉语言理解方面取得了显着的进展。然而，人类的感知本质上是多感官的，整合视觉、声音和运动来推理世界。在这些模式中，声音提供了关于空间布局、屏幕外事件和因果交互不可或缺的线索，特别是在听觉和视觉信号紧密耦合的以自我为中心的环境中。为此，我们推出了 EgoSound，这是第一个旨在系统评估 MLLM 中以自我为中心的声音理解的基准测试。 EgoSound 统一了来自 Ego4D 和 EgoBlind 的数据，涵盖视觉和声音相关的体验。它定义了一个涵盖内在声音感知、空间定位、因果推理和跨模态推理的七任务分类法。 EgoSound 通过多级自动生成管道构建，包含 900 个视频中的 7315 个经过验证的 QA 对。对九个最先进的 MLLM 的综合实验表明，当前的模型表现出新兴的听觉推理能力，但在细粒度的空间和因果理解方面仍然有限。 EgoSound 为推进多感官自我中心智能奠定了具有挑战性的基础，弥合了看到世界和真正听到世界之间的差距。

Title: LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

Authors: Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, Jason Kuen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14147
Pdf URL: https://arxiv.org/pdf/2602.14147
Copy Paste: [[2602.14147]] LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models(https://arxiv.org/abs/2602.14147)
Keywords: generation
Abstract: Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.
摘要：扩散语言模型 (dLLM) 最近成为自回归 LLM 的有前途的替代方案。最新的工作进一步将其扩展到多模态理解和生成任务。在这项工作中，我们提出了 LaViDa-R1，一种多模式、通用推理 dLLM。与通过特定任务强化学习构建推理 dLLM 的现有作品不同，LaViDa-R1 以统一的方式整合了多种多模态理解和生成任务。特别是，LaViDa-R1 采用新颖的统一后训练框架构建，无缝集成监督微调（SFT）和多任务强化学习（RL）。它采用了多种新颖的训练技术，包括强制答案、树搜索和互补似然估计，以提高有效性和可扩展性。大量实验证明了 LaViDa-R1 在各种多模式任务上的强大性能，包括视觉数学推理、推理密集型基础和图像编辑。

Title: UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Authors: Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14178
Pdf URL: https://arxiv.org/pdf/2602.14178
Copy Paste: [[2602.14178]] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model(https://arxiv.org/abs/2602.14178)
Keywords: generation, generative
Abstract: Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the optimization conflict between token entropy loss and commitment loss. We further propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios, such as those involving human faces and textual content. On ImageNet, UniWeTok achieves state-of-the-art image generation performance (FID: UniWeTok 1.38 vs. REPA 1.42) while requiring a remarkably low training compute (Training Tokens: UniWeTok 33B vs. REPA 262B). On general-domain, UniWeTok demonstrates highly competitive capabilities across a broad range of tasks, including multimodal understanding, image generation (DPG Score: UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84), and editing (GEdit Overall Score: UniWeTok 5.09 vs. OmniGen 5.06). We release code and models to facilitate community exploration of unified tokenizer and MLLM.
摘要：统一多模态大语言模型 (MLLM) 需要同时支持高保真重建、复杂语义提取和生成适用性的视觉表示。然而，现有的视觉分词器通常难以在单个框架内满足这些相互冲突的目标。在本文中，我们介绍了 UniWeTok，这是一种统一的离散分词器，旨在使用大量二进制码本 ($\mathit{2^{128}}$) 来弥补这一差距。对于训练框架，我们引入了 Pre-Post Distillation 和 Generative-Aware Prior 来增强离散标记的语义提取和生成先验。在模型架构方面，我们提出了一种具有 SigLu 激活函数的卷积-注意力混合架构。 SigLu激活不仅限制了编码器输出并稳定了语义蒸馏过程，而且有效解决了令牌熵损失和承诺损失之间的优化冲突。我们进一步提出了一个三阶段训练框架，旨在增强 UniWeTok 对各种图像分辨率和感知敏感场景（例如涉及人脸和文本内容的场景）的适应性。在 ImageNet 上，UniWeTok 实现了最先进的图像生成性能（FID：UniWeTok 1.38 与 REPA 1.42），同时需要极低的训练计算（训练令牌：UniWeTok 33B 与 REPA 262B）。在通用领域，UniWeTok 在广泛的任务中展示了极具竞争力的能力，包括多模态理解、图像生成（DPG 分数：UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84）和编辑（GEdit 总体分数：UniWeTok 5.09 vs. OmniGen 5.06）。我们发布代码和模型，以促进社区对统一标记器和 MLLM 的探索。

Title: UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing

Authors: Hongyang Wei, Bin Wen, Yancheng Long, Yankai Yang, Yuhang Hu, Tianke Zhang, Wei Chen, Haonan Fan, Kaiyu Jiang, Jiankang Chen, Changyi Liu, Kaiyu Tang, Haojie Ding, Xiao Yang, Jia Sun, Huaiqing Wang, Zhenyu Yang, Xinyu Wei, Xianglong He, Yangguang Li, Fan Yang, Tingting Gao, Lei Zhang, Guorui Zhou, Han Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14186
Pdf URL: https://arxiv.org/pdf/2602.14186
Copy Paste: [[2602.14186]] UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing(https://arxiv.org/abs/2602.14186)
Keywords: generation, generative
Abstract: We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of $1024^2$, and are then gradually increased to $1536^2$ and $2048^2$ to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.
摘要：我们推出了 UniRef-Image-Edit，这是一种高性能多模式生成系统，它将单图像编辑和多图像合成统一在一个框架内。由于参考输入之间的交互有限，现有的基于扩散的编辑方法通常难以在多种条件下保持一致性。为了解决这个问题，我们引入了序列扩展潜在融合（SELF），这是一种统一的输入表示，可将多个参考图像动态序列化为连贯的潜在序列。在专用训练阶段，所有参考图像都被联合约束以适应全局像素预算约束下的固定长度序列。基于 SELF，我们提出了一个两阶段训练框架，包括监督微调（SFT）和强化学习（RL）。在 SFT 阶段，我们联合训练单图像编辑和多图像合成任务，以建立强大的生成先验。我们采用渐进序列长度训练策略，其中所有输入图像最初调整为总像素预算为 1024^2$，然后逐渐增加到 1536^2$ 和 2048^2$，以提高视觉保真度和交叉引用一致性。这种逐渐放松的压缩使模型能够逐渐捕获更精细的视觉细节，同时保持参考之间的稳定对齐。对于 RL 阶段，我们引入了多源 GRPO (MSGRPO)，据我们所知，这是第一个为多参考图像生成量身定制的强化学习框架。 MSGRPO 优化模型以协调冲突的视觉约束，显着增强构图一致性。我们将开源代码、模型、训练数据和奖励数据，用于社区研究目的。

Title: MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

Authors: Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park, Jae W. Lee
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2602.14209
Pdf URL: https://arxiv.org/pdf/2602.14209
Copy Paste: [[2602.14209]] MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM(https://arxiv.org/abs/2602.14209)
Keywords: generation
Abstract: Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.
摘要：块扩散 LLM 正在成为语言生成的下一个有前途的范式，但它们对 KV 缓存的使用使内存访问成为长上下文设置中的主要瓶颈。虽然动态稀疏注意力已经被积极探索，但为自回归 LLM 设计的现有方法依赖于近似重要性估计，并且在适应块扩散时表现不佳。这项工作确定了块扩散独有的关键机会：第一个 All-[MASK] 去噪步骤的注意力能够可靠地预测重要的 KV 条目和预算要求，使 MAGE 能够对每个块执行一次精确的注意力传递，并将其重新用于免训练的稀疏去噪。在 LongBench 和 Haystack 等长上下文基准测试中，MAGE 以 KV 预算的一小部分实现了近乎无损的准确性，同时提供高达 3-4 倍的端到端加速，始终优于面向 AR 的稀疏注意力基线。轻量级微调策略以最小的成本进一步增强了 [MASK] 引导的模式，对于 1.5B 和 7B 模型，只需在单个 NVIDIA H100 GPU 上进行几个小时的训练。

Title: KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Authors: Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, Christos Kozyrakis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14293
Pdf URL: https://arxiv.org/pdf/2602.14293
Copy Paste: [[2602.14293]] KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning(https://arxiv.org/abs/2602.14293)
Keywords: generation
Abstract: Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge from prior exploration, leading to biased sampling and suboptimal solutions. We propose KernelBlaster, a Memory-Augmented In-context Reinforcement Learning (MAIC-RL) framework designed to improve CUDA optimization search capabilities of LLM-based GPU coding agents. KernelBlaster enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base. We propose a novel profile-guided, textual-gradient-based agentic flow for CUDA generation and optimization to achieve high performance across generations of GPU architectures. KernelBlaster guides LLM agents to systematically explore high-potential optimization strategies beyond naive rewrites. Compared to the PyTorch baseline, our method achieves geometric mean speedups of 1.43x, 2.50x, and 1.50x on KernelBench Levels 1, 2, and 3, respectively. We release KernelBlaster as an open-source agentic framework, accompanied by a test harness, verification components, and a reproducible evaluation pipeline.
摘要：跨多代 GPU 架构优化 CUDA 代码具有挑战性，因为实现峰值性能需要对日益复杂的特定于硬件的优化空间进行广泛探索。传统编译器受到固定启发式的限制，而微调大型语言模型 (LLM) 的成本可能很高。然而，CUDA 代码优化的代理工作流程从先前的探索中聚合知识的能力有限，导致采样有偏差和次优解决方案。我们提出了 KernelBlaster，这是一种内存增强型上下文强化学习 (MAIC-RL) 框架，旨在提高基于 LLM 的 GPU 编码代理的 CUDA 优化搜索能力。 KernelBlaster 使代理能够从经验中学习，并通过将知识积累到可检索的持久 CUDA 知识库中，对未来的任务做出系统的明智决策。我们提出了一种新颖的配置文件引导、基于文本梯度的代理流程，用于 CUDA 生成和优化，以实现跨代 GPU 架构的高性能。 KernelBlaster 指导 LLM 代理系统地探索超越简单重写的高潜力优化策略。与 PyTorch 基线相比，我们的方法在 KernelBench Level 1、2 和 3 上分别实现了 1.43 倍、2.50 倍和 1.50 倍的几何平均加速。我们将 KernelBlaster 作为开源代理框架发布，并附带测试工具、验证组件和可重复的评估管道。

Title: Machine Learning as a Tool (MLAT): A Framework for Integrating Statistical ML Models as Callable Tools within LLM Agent Workflows

Authors: Edwin Chen, Zulekha Bibi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14295
Pdf URL: https://arxiv.org/pdf/2602.14295
Copy Paste: [[2602.14295]] Machine Learning as a Tool (MLAT): A Framework for Integrating Statistical ML Models as Callable Tools within LLM Agent Workflows(https://arxiv.org/abs/2602.14295)
Keywords: generation
Abstract: We introduce Machine Learning as a Tool (MLAT), a design pattern in which pre-trained statistical machine learning models are exposed as callable tools within large language model (LLM) agent workflows. This allows an orchestrating agent to invoke quantitative predictions when needed and reason about their outputs in context. Unlike conventional pipelines that treat ML inference as a static preprocessing step, MLAT positions the model as a first-class tool alongside web search, database queries, and APIs, enabling the LLM to decide when and how to use it based on conversational context. To validate MLAT, we present PitchCraft, a pilot production system that converts discovery call recordings into professional proposals with ML-predicted pricing. The system uses two agents: a Research Agent that gathers prospect intelligence via parallel tool calls, and a Draft Agent that invokes an XGBoost pricing model as a tool call and generates a complete proposal through structured outputs. The pricing model, trained on 70 examples combining real and human-verified synthetic data, achieves R^2 = 0.807 on held-out data with a mean absolute error of 3688 USD. The system reduces proposal generation time from multiple hours to under 10 minutes. We describe the MLAT framework, structured output architecture, training methodology under extreme data scarcity, and sensitivity analysis demonstrating meaningful learned relationships. MLAT generalizes to domains requiring quantitative estimation combined with contextual reasoning.
摘要：我们引入了机器学习作为工具（MLAT），这是一种设计模式，其中预先训练的统计机器学习模型在大型语言模型（LLM）代理工作流程中作为可调用工具公开。这允许编排代理在需要时调用定量预测并在上下文中推理其输出。与将 ML 推理视为静态预处理步骤的传统流程不同，MLAT 将模型定位为与网络搜索、数据库查询和 API 并列的一流工具，使法学硕士能够根据对话上下文决定何时以及如何使用它。为了验证 MLAT，我们推出了 PitchCraft，这是一个试点生产系统，可将发现通话录音转换为具有 ML 预测定价的专业提案。该系统使用两个代理：一个通过并行工具调用收集潜在客户情报的研究代理，以及一个草案代理，它调用 XGBoost 定价模型作为工具调用并通过结构化输出生成完整的提案。该定价模型经过 70 个结合了真实数据和人工验证合成数据的示例的训练，在保留数据上实现了 R^2 = 0.807，平均绝对误差为 3688 美元。该系统将提案生成时间从几个小时缩短到 10 分钟以下。我们描述了 MLAT 框架、结构化输出架构、极端数据稀缺下的训练方法，以及展示有意义的学习关系的敏感性分析。 MLAT 推广到需要定量估计和上下文推理相结合的领域。

Title: DeepFusion: Accelerating MoE Training via Federated Knowledge Distillation from Heterogeneous Edge Devices

Authors: Songyuan Li, Jia Hu, Ahmed M. Abdelmoniem, Geyong Min, Haojun Huang, Jiwei Huang
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2602.14301
Pdf URL: https://arxiv.org/pdf/2602.14301
Copy Paste: [[2602.14301]] DeepFusion: Accelerating MoE Training via Federated Knowledge Distillation from Heterogeneous Edge Devices(https://arxiv.org/abs/2602.14301)
Keywords: generative
Abstract: Recent Mixture-of-Experts (MoE)-based large language models (LLMs) such as Qwen-MoE and DeepSeek-MoE are transforming generative AI in natural language processing. However, these models require vast and diverse training data. Federated learning (FL) addresses this challenge by leveraging private data from heterogeneous edge devices for privacy-preserving MoE training. Nonetheless, traditional FL approaches require devices to host local MoE models, which is impractical for resource-constrained devices due to large model sizes. To address this, we propose DeepFusion, the first scalable federated MoE training framework that enables the fusion of heterogeneous on-device LLM knowledge via federated knowledge distillation, yielding a knowledge-abundant global MoE model. Specifically, DeepFusion features each device to independently configure and train an on-device LLM tailored to its own needs and hardware limitations. Furthermore, we propose a novel View-Aligned Attention (VAA) module that integrates multi-stage feature representations from the global MoE model to construct a predictive perspective aligned with on-device LLMs, thereby enabling effective cross-architecture knowledge distillation. By explicitly aligning predictive perspectives, VAA resolves the view-mismatch problem in traditional federated knowledge distillation, which arises from heterogeneity in model architectures and prediction behaviors between on-device LLMs and the global MoE model. Experiments with industry-level MoE models (Qwen-MoE and DeepSeek-MoE) and real-world datasets (medical and finance) demonstrate that DeepFusion achieves performance close to centralized MoE training. Compared with key federated MoE baselines, DeepFusion reduces communication costs by up to 71% and improves token perplexity by up to 5.28%.
摘要：最近基于专家混合 (MoE) 的大型语言模型 (LLM)，例如 Qwen-MoE 和 DeepSeek-MoE，正在改变自然语言处理中的生成式 AI。然而，这些模型需要大量且多样化的训练数据。联邦学习 (FL) 通过利用异构边缘设备的私有数据进行隐私保护的 MoE 培训来解决这一挑战。尽管如此，传统的 FL 方法需要设备托管本地 MoE 模型，由于模型尺寸较大，这对于资源受限的设备来说是不切实际的。为了解决这个问题，我们提出了 DeepFusion，这是第一个可扩展的联合 MoE 训练框架，它能够通过联合知识蒸馏融合异构设备上的 LLM 知识，从而产生知识丰富的全局 MoE 模型。具体来说，DeepFusion 的特点是每台设备都可以独立配置和训练根据自身需求和硬件限制量身定制的设备上 LLM。此外，我们提出了一种新颖的视图对齐注意力（VAA）模块，该模块集成了全局 MoE 模型的多阶段特征表示，以构建与设备上 LLM 一致的预测视角，从而实现有效的跨架构知识蒸馏。通过明确调整预测视角，VAA 解决了传统联合知识蒸馏中的视图不匹配问题，该问题是由于设备上 LLM 与全局 MoE 模型之间的模型架构和预测行为的异质性而引起的。对行业级 MoE 模型（Qwen-MoE 和 DeepSeek-MoE）和现实世界数据集（医疗和金融）的实验表明，DeepFusion 的性能接近集中式 MoE 训练。与关键的联合 MoE 基准相比，DeepFusion 降低了高达 71% 的通信成本，并将令牌复杂度提高了高达 5.28%。

Title: A Generative AI Approach for Reducing Skin Tone Bias in Skin Cancer Classification

Authors: Areez Muhammed Shabu, Mohammad Samar Ansari, Asra Aslam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14356
Pdf URL: https://arxiv.org/pdf/2602.14356
Copy Paste: [[2602.14356]] A Generative AI Approach for Reducing Skin Tone Bias in Skin Cancer Classification(https://arxiv.org/abs/2602.14356)
Keywords: generative
Abstract: Skin cancer is one of the most common cancers worldwide and early detection is critical for effective treatment. However, current AI diagnostic tools are often trained on datasets dominated by lighter skin tones, leading to reduced accuracy and fairness for people with darker skin. The International Skin Imaging Collaboration (ISIC) dataset, one of the most widely used benchmarks, contains over 70% light skin images while dark skins fewer than 8%. This imbalance poses a significant barrier to equitable healthcare delivery and highlights the urgent need for methods that address demographic diversity in medical imaging. This paper addresses this challenge of skin tone imbalance in automated skin cancer detection using dermoscopic images. To overcome this, we present a generative augmentation pipeline that fine-tunes a pre-trained Stable Diffusion model using Low-Rank Adaptation (LoRA) on the image dark-skin subset of the ISIC dataset and generates synthetic dermoscopic images conditioned on lesion type and skin tone. In this study, we investigated the utility of these images on two downstream tasks: lesion segmentation and binary classification. For segmentation, models trained on the augmented dataset and evaluated on held-out real images show consistent improvements in IoU, Dice coefficient, and boundary accuracy. These evalutions provides the verification of Generated dataset. For classification, an EfficientNet-B0 model trained on the augmented dataset achieved 92.14% accuracy. This paper demonstrates that synthetic data augmentation with Generative AI integration can substantially reduce bias with increase fairness in conventional dermatological diagnostics and open challenges for future directions.
摘要：皮肤癌是全世界最常见的癌症之一，早期发现对于有效治疗至关重要。然而，当前的人工智能诊断工具通常是在以浅肤色为主的数据集上进行训练，导致深色皮肤人群的准确性和公平性降低。国际皮肤成像协作 (ISIC) 数据集是最广泛使用的基准之一，包含超过 70% 的浅色皮肤图像，而深色皮肤图像不到 8%。这种不平衡对公平的医疗保健服务构成了重大障碍，并凸显了迫切需要解决医学成像中人口多样性问题的方法。本文解决了使用皮肤镜图像自动皮肤癌检测中肤色不平衡的挑战。为了克服这个问题，我们提出了一种生成增强管道，该管道使用低秩适应（LoRA）对 ISIC 数据集的图像深色皮肤子集微调预训练的稳定扩散模型，并生成以病变类型和肤色为条件的合成皮肤镜图像。在本研究中，我们研究了这些图像在两个下游任务中的效用：病变分割和二元分类。对于分割，在增强数据集上训练并在保留的真实图像上进行评估的模型显示出 IoU、Dice 系数和边界精度的持续改进。这些评估提供了生成数据集的验证。对于分类，在增强数据集上训练的 EfficientNet-B0 模型达到了 92.14% 的准确率。本文证明，通过生成人工智能集成增强合成数据可以显着减少偏差，提高传统皮肤病诊断的公平性，并为未来的方向提出挑战。

Title: Adapting VACE for Real-Time Autoregressive Video Diffusion

Authors: Ryan Fosdick (Daydream)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14381
Pdf URL: https://arxiv.org/pdf/2602.14381
Copy Paste: [[2602.14381]] Adapting VACE for Real-Time Autoregressive Video Diffusion(https://arxiv.org/abs/2602.14381)
Keywords: generation
Abstract: We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at this https URL.
摘要：我们描述了用于实时自回归视频生成的 VACE（视频一体化创建和编辑）的改编。 VACE 提供统一的视频控制（参考指导、结构调节、修复和时间扩展），但假设对整个序列进行双向关注，这使其与需要固定块大小和因果关注的流媒体管道不兼容。关键修改将参考帧从扩散潜在空间移动到并行调节路径，保留自回归模型所需的固定块大小和 KV 缓存。此调整重复使用现有的预训练 VACE 权重，无需额外训练。在 1.3B 和 14B 模型规模中，VACE 为结构控制和修复增加了 20-30% 的延迟开销，而相对于基本模型，VRAM 成本可以忽略不计。由于因果注意力限制，与批量 VACE 相比，视频参考保真度严重下降。此 https URL 提供了参考实现。

Title: Controlling Your Image via Simplified Vector Graphics

Authors: Lanqing Guo, Xi Liu, Yufei Wang, Zhihao Li, Siyu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14443
Pdf URL: https://arxiv.org/pdf/2602.14443
Copy Paste: [[2602.14443]] Controlling Your Image via Simplified Vector Graphics(https://arxiv.org/abs/2602.14443)
Keywords: generation
Abstract: Recent advances in image generation have achieved remarkable visual quality, while a fundamental challenge remains: Can image generation be controlled at the element level, enabling intuitive modifications such as adjusting shapes, altering colors, or adding and removing objects? In this work, we address this challenge by introducing layer-wise controllable generation through simplified vector graphics (VGs). Our approach first efficiently parses images into hierarchical VG representations that are semantic-aligned and structurally coherent. Building on this representation, we design a novel image synthesis framework guided by VGs, allowing users to freely modify elements and seamlessly translate these edits into photorealistic outputs. By leveraging the structural and semantic features of VGs in conjunction with noise prediction, our method provides precise control over geometry, color, and object semantics. Extensive experiments demonstrate the effectiveness of our approach in diverse applications, including image editing, object-level manipulation, and fine-grained content creation, establishing a new paradigm for controllable image generation. Project page: this https URL
摘要：图像生成方面的最新进展已经实现了卓越的视觉质量，但仍然存在一个根本挑战：是否可以在元素级别控制图像生成，从而实现直观的修改，例如调整形状、改变颜色或添加和删除对象？在这项工作中，我们通过简化矢量图形（VG）引入分层可控生成来解决这一挑战。我们的方法首先有效地将图像解析为语义对齐且结构一致的分层 VG 表示。在此表示的基础上，我们设计了一种由 VG 引导的新颖图像合成框架，允许用户自由修改元素并将这些编辑无缝转换为逼真的输出。通过利用 VG 的结构和语义特征以及噪声预测，我们的方法提供了对几何、颜色和对象语义的精确控制。大量的实验证明了我们的方法在各种应用中的有效性，包括图像编辑、对象级操作和细粒度内容创建，为可控图像生成建立了新的范例。项目页面：此 https URL

Title: CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer

Authors: Wenbo Nie, Zixiang Li, Renshuai Tao, Bin Wu, Yunchao Wei, Yao Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14464
Pdf URL: https://arxiv.org/pdf/2602.14464
Copy Paste: [[2602.14464]] CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer(https://arxiv.org/abs/2602.14464)
Keywords: generative
Abstract: Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
摘要：在图像之间转移视觉风格，同时保留相似对象之间的语义对应关系仍然是计算机视觉中的核心挑战。虽然现有方法已经取得了长足的进步，但大多数方法在全局层面上运行，但忽略了区域甚至像素层面的语义对应。为了解决这个问题，我们提出了 CoCoDiff，这是一种新颖的免训练且低成本的风格转移框架，它利用预训练的潜在扩散模型来实现细粒度、语义一致的风格化。我们发现生成扩散模型中的对应线索尚未得到充分探索，并且语义匹配区域之间的内容一致性经常被忽视。 CoCoDiff 引入了像素级语义对应模块，该模块挖掘中间扩散特征以构建内容和风格图像之间的密集对齐图。此外，循环一致性模块随后在迭代中强制执行结构和感知对齐，从而产生保留几何和细节的对象和区域级别的风格化。尽管不需要额外的培训或监督，CoCoDiff 仍提供最先进的视觉质量和强大的定量结果，优于依赖额外培训或注释的方法。

Title: TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning

Authors: Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu, Lei Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14482
Pdf URL: https://arxiv.org/pdf/2602.14482
Copy Paste: [[2602.14482]] TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning(https://arxiv.org/abs/2602.14482)
Keywords: generation
Abstract: We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.
摘要：我们解决多模态大语言模型（MLLM）中的细粒度视觉推理问题，其中关键证据可能驻留在微小物体、杂乱区域或在单个全局图像编码下丢失的微妙标记中。我们引入了 TikArt（Thinking Aperture），这是一种光圈引导代理，可将多步视觉语言推理作为感兴趣区域的决策过程。 TikArt 遵循 Think-Aperture-Observe 循环，在语言生成和两个光圈动作之间交替：Zoom 提取矩形裁剪，而 Segment 调用 SAM2 为不规则目标获取基于掩模的裁剪。在每个动作之后，模型必须产生明确的观察，将局部视觉线索转化为持久的语言记忆。 TikArt 基于 Qwen3-VL-8B 构建，使用 AGRPO 优化其推理策略，AGRPO 是一种具有两阶段课程的 GRPO 式强化学习算法：它预热分割动作，然后联合优化视觉数学、细粒度 VQA 和分割，使用将任务成功与有目的孔径使用相结合的奖励。 V*、HR-Bench-4K/8K、MME-RealWorld-Lite、MMStar、RefCOCO 和 ReasonSeg 上的实验显示了主干网的一致增益，并产生用于高分辨率推理的可解释孔径轨迹。

Title: MedVAR: Towards Scalable and Efficient Medical Image Generation via Next-scale Autoregressive Prediction

Authors: Zhicheng He, Yunpeng Zhao, Junde Wu, Ziwei Niu, Zijun Li, Lanfen Lin, Yueming Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14512
Pdf URL: https://arxiv.org/pdf/2602.14512
Copy Paste: [[2602.14512]] MedVAR: Towards Scalable and Efficient Medical Image Generation via Next-scale Autoregressive Prediction(https://arxiv.org/abs/2602.14512)
Keywords: generation, generative
Abstract: Medical image generation is pivotal in applications like data augmentation for low-resource clinical tasks and privacy-preserving data sharing. However, developing a scalable generative backbone for medical imaging requires architectural efficiency, sufficient multi-organ data, and principled evaluation, yet current approaches leave these aspects unresolved. Therefore, we introduce MedVAR, the first autoregressive-based foundation model that adopts the next-scale prediction paradigm to enable fast and scale-up-friendly medical image synthesis. MedVAR generates images in a coarse-to-fine manner and produces structured multi-scale representations suitable for downstream use. To support hierarchical generation, we curate a harmonized dataset of around 440,000 CT and MRI images spanning six anatomical regions. Comprehensive experiments across fidelity, diversity, and scalability show that MedVAR achieves state-of-the-art generative performance and offers a promising architectural direction for future medical generative foundation models.
摘要：医学图像生成在低资源临床任务的数据增强和隐私保护数据共享等应用中至关重要。然而，为医学成像开发可扩展的生成主干需要架构效率、足够的多器官数据和原则性评估，但目前的方法尚未解决这些问题。因此，我们引入了 MedVAR，这是第一个基于自回归的基础模型，采用下一代规模预测范式，以实现快速且扩展友好的医学图像合成。 MedVAR 以从粗到细的方式生成图像，并生成适合下游使用的结构化多尺度表示。为了支持分层生成，我们整理了一个统一的数据集，其中包含跨越六个解剖区域的约 440,000 张 CT 和 MRI 图像。保真度、多样性和可扩展性方面的综合实验表明，MedVAR 实现了最先进的生成性能，并为未来的医学生成基础模型提供了有前景的架构方向。

Title: Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Authors: Aryan Das, Koushik Biswas, Swalpa Kumar Roy, Badri Narayana Patro, Vinay Kumar Verma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14514
Pdf URL: https://arxiv.org/pdf/2602.14514
Copy Paste: [[2602.14514]] Efficient Text-Guided Convolutional Adapter for the Diffusion Model(https://arxiv.org/abs/2602.14514)
Keywords: generation
Abstract: We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: this https URL
摘要：我们引入了 Nexus Adapters，这是一种新颖的文本引导的高效适配器，适用于基于扩散的结构保留条件生成（SPCG）框架。最近，通过使用用于即时调节的基本模型和用于结构输入的适配器（例如草图或深度图），结构保留方法在条件图像生成方面取得了有希望的结果。与基本架构相比，这些方法效率非常低，有时需要适配器中的相同参数。训练模型并不总是可行，因为扩散模型本身成本高昂，而且参数加倍效率极低。在这些方法中，适配器不知道输入提示；因此，它仅对于结构输入是最优的，而对于输入提示则不是最优的。为了克服上述挑战，我们提出了两种高效的适配器，Nexus Prime 和 Slim，它们以提示和结构输入为指导。每个 Nexus Block 都包含交叉注意力机制，以实现丰富的多模式调节。因此，所提出的适配器在保留结构的同时可以更好地理解输入提示。我们对所提出的模型进行了广泛的实验，并证明 Nexus Prime 适配器显着增强了性能，与基准 T2I-Adapter 相比，仅需要 8M 的额外参数。此外，我们还推出了一款轻量级的 Nexus Slim 适配器，其参数比 T2I-Adapter 少了 18M，但仍然取得了最先进的结果。代码：这个https URL

Title: MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation

Authors: Hongpeng Wang, Zeyu Zhang, Wenhao Li, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14534
Pdf URL: https://arxiv.org/pdf/2602.14534
Copy Paste: [[2602.14534]] MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation(https://arxiv.org/abs/2602.14534)
Keywords: generation
Abstract: Human motion understanding and generation are crucial for vision and robotics but remain limited in reasoning capability and test-time planning. We propose MoRL, a unified multimodal motion model trained with supervised fine-tuning and reinforcement learning with verifiable rewards. Our task-specific reward design combines semantic alignment and reasoning coherence for understanding with physical plausibility and text-motion consistency for generation, improving both logical reasoning and perceptual realism. To further enhance inference, we introduce Chain-of-Motion (CoM), a test-time reasoning method that enables step-by-step planning and reflection. We also construct two large-scale CoT datasets, MoUnd-CoT-140K and MoGen-CoT-140K, to align motion sequences with reasoning traces and action descriptions. Experiments on HumanML3D and KIT-ML show that MoRL achieves significant gains over state-of-the-art baselines. Code: this https URL. Website: this https URL.
摘要：人体运动理解和生成对于视觉和机器人技术至关重要，但在推理能力和测试时间规划方面仍然有限。我们提出了 MoRL，一种统一的多模态运动模型，通过监督微调和强化学习进行训练，并具有可验证的奖励。我们针对特定任务的奖励设计将用于理解的语义对齐和推理连贯性与用于生成的物理合理性和文本动作一致性相结合，从而提高了逻辑推理和感知现实性。为了进一步增强推理，我们引入了运动链（CoM），这是一种测试时推理方法，可以实现逐步规划和反思。我们还构建了两个大型 CoT 数据集 MoUnd-CoT-140K 和 MoGen-CoT-140K，将运动序列与推理轨迹和动作描述对齐。 HumanML3D 和 KIT-ML 上的实验表明，MoRL 比最先进的基线取得了显着的进步。代码：此 https URL。网站：此 https URL。

Title: DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving

Authors: Chenxu Dang, Sining Ang, Yongkang Li, Haochen Tian, Jie Wang, Guang Li, Hangjun Ye, Jie Ma, Long Chen, Yan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14577
Pdf URL: https://arxiv.org/pdf/2602.14577
Copy Paste: [[2602.14577]] DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving(https://arxiv.org/abs/2602.14577)
Keywords: generation, generative
Abstract: Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient blocking during training, the two experts are fully decoupled, preserving the foundational capabilities and generic patterns of the pretrained weights, which highlights the flexibility and extensibility of the block-MoE design. Furthermore, we design a hybrid reinforcement learning strategy that encourages effective exploration of refinement expert while maintaining training stability. Extensive experiments on NAVSIM v1, v2, and Navhard benchmarks demonstrate that DriveFine exhibits strong efficacy and robustness. The code will be released at this https URL.
摘要：自动驾驶的视觉-语言-动作（VLA）模型越来越多地采用经过模仿学习和强化学习训练的生成规划器。基于扩散的规划器面临模态对齐困难、训练效率低和泛化能力有限的问题。基于代币的规划者受到累积因果错误和不可逆解码的困扰。总之，两种主导范式表现出互补的优点和缺点。在本文中，我们提出了 DriveFine，一种结合了灵活解码和自校正功能的掩模扩散 VLA 模型。特别是，我们设计了一种新颖的即插即用块 MoE，它可以在生成专家之上无缝地注入细化专家。通过在推理过程中实现显式专家选择，在训练过程中实现梯度分块，两位专家完全解耦，保留了预训练权重的基础能力和通用模式，这凸显了分块MoE设计的灵活性和可扩展性。此外，我们设计了一种混合强化学习策略，鼓励细化专家的有效探索，同时保持训练稳定性。对 NAVSIM v1、v2 和 Navhard 基准的大量实验表明，DriveFine 具有强大的功效和鲁棒性。代码将在此 https URL 发布。

Title: VIGIL: Tackling Hallucination Detection in Image Recontextualization

Authors: Joanna Wojciechowicz, Maria Łubniewska, Jakub Antczak, Justyna Baczyńska, Wojciech Gromski, Wojciech Kozłowski, Maciej Zięba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14633
Pdf URL: https://arxiv.org/pdf/2602.14633
Copy Paste: [[2602.14633]] VIGIL: Tackling Hallucination Detection in Image Recontextualization(https://arxiv.org/abs/2602.14633)
Keywords: generative
Abstract: We introduce VIGIL (Visual Inconsistency & Generative In-context Lucidity), the first benchmark dataset and framework providing a fine-grained categorization of hallucinations in the multimodal image recontextualization task for large multimodal models (LMMs). While existing research often treats hallucinations as a uniform issue, our work addresses a significant gap in multimodal evaluation by decomposing these errors into five categories: pasted object hallucinations, background hallucinations, object omission, positional & logical inconsistencies, and physical law violations. To address these complexities, we propose a multi-stage detection pipeline. Our architecture processes recontextualized images through a series of specialized steps targeting object-level fidelity, background consistency, and omission detection, leveraging a coordinated ensemble of open-source models, whose effectiveness is demonstrated through extensive experimental evaluations. Our approach enables a deeper understanding of where the models fail with an explanation; thus, we fill a gap in the field, as no prior methods offer such categorization and decomposition for this task. To promote transparency and further exploration, we openly release VIGIL, along with the detection pipeline and benchmark code, through our GitHub repository: this https URL and Data repository: this https URL.
摘要：我们引入了 VIGIL（视觉不一致和生成上下文清晰度），这是第一个基准数据集和框架，为大型多模态模型 (LMM) 的多模态图像重新上下文化任务中的幻觉提供细粒度的分类。虽然现有的研究通常将幻觉视为一个统一的问题，但我们的工作通过将这些错误分解为五类来解决多模式评估中的重大差距：粘贴物体幻觉、背景幻觉、物体遗漏、位置和逻辑不一致以及违反物理定律。为了解决这些复杂性，我们提出了多级检测管道。我们的架构通过一系列针对对象级保真度、背景一致性和遗漏检测的专门步骤来处理重新上下文化的图像，利用开源模型的协调集合，其有效性通过广泛的实验评估得到证明。我们的方法可以通过解释更深入地理解模型失败的地方；因此，我们填补了该领域的空白，因为之前没有方法为该任务提供此类分类和分解。为了提高透明度和进一步探索，我们通过 GitHub 存储库：此 https URL 和数据存储库：此 https URL 公开发布 VIGIL 以及检测管道和基准代码。

Title: SketchingReality: From Freehand Scene Sketches To Photorealistic Images

Authors: Ahmed Bourouis, Mikhail Bessmeltsev, Yulia Gryaditskaya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.14648
Pdf URL: https://arxiv.org/pdf/2602.14648
Copy Paste: [[2602.14648]] SketchingReality: From Freehand Scene Sketches To Photorealistic Images(https://arxiv.org/abs/2602.14648)
Keywords: generation, generative
Abstract: Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed 'sketches', yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth, pixel-aligned images: by their nature, freehand sketches do not have a single correct alignment. To address this, we propose a modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions. We further introduce a novel loss that enables training on freehand sketches without requiring ground-truth pixel-aligned images. We show that our method outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.
摘要：近年来，生成式人工智能取得了显着进展，自然语言成为最常见的条件输入。随着底层模型变得越来越强大，研究人员正在探索日益多样化的调节信号，例如深度图、边缘图、相机参数和参考图像，以便用户更好地控制生成。在不同的形式中，草图是人类交流的一种自然且长期存在的形式，能够快速表达视觉概念。以前的文献主要集中在边缘图上，经常被误称为“草图”，但有效处理真实手绘草图的算法及其固有的抽象和扭曲仍然没有得到充分探索。我们追求的挑战性目标是在从徒手输入生成图像时平衡照片真实感和草图依从性。一个主要障碍是缺乏真实的、像素对齐的图像：就其本质而言，手绘草图没有单一的正确对齐方式。为了解决这个问题，我们提出了一种基于调制的方法，该方法优先考虑草图的语义解释，而不是严格遵守各个边缘位置。我们进一步引入了一种新颖的损失，可以在不需要真实像素对齐图像的情况下对徒手草图进行训练。我们表明，我们的方法在与手绘草图输入的语义对齐以及生成图像的真实性和整体质量方面都优于现有方法。

Title: Exposing Diversity Bias in Deep Generative Models: Statistical Origins and Correction of Diversity Error

Authors: Farzan Farnia, Mohammad Jalali, Azim Ospanov
Subjects: cs.LG, cs.AI, cs.CV, math.OC
Abstract URL: https://arxiv.org/abs/2602.14682
Pdf URL: https://arxiv.org/pdf/2602.14682
Copy Paste: [[2602.14682]] Exposing Diversity Bias in Deep Generative Models: Statistical Origins and Correction of Diversity Error(https://arxiv.org/abs/2602.14682)
Keywords: generative
Abstract: Deep generative models have achieved great success in producing high-quality samples, making them a central tool across machine learning applications. Beyond sample quality, an important yet less systematically studied question is whether trained generative models faithfully capture the diversity of the underlying data distribution. In this work, we address this question by directly comparing the diversity of samples generated by state-of-the-art models with that of test samples drawn from the target data distribution, using recently proposed reference-free entropy-based diversity scores, Vendi and RKE. Across multiple benchmark datasets, we find that test data consistently attains substantially higher Vendi and RKE diversity scores than the generated samples, suggesting a systematic downward diversity bias in modern generative models. To understand the origin of this bias, we analyze the finite-sample behavior of entropy-based diversity scores and show that their expected values increase with sample size, implying that diversity estimated from finite training sets could inherently underestimate the diversity of the true distribution. As a result, optimizing the generators to minimize divergence to empirical data distributions would induce a loss of diversity. Finally, we discuss potential diversity-aware regularization and guidance strategies based on Vendi and RKE as principled directions for mitigating this bias, and provide empirical evidence suggesting their potential to improve the results.
摘要：深度生成模型在生成高质量样本方面取得了巨大成功，使其成为机器学习应用程序的核心工具。除了样本质量之外，一个重要但较少系统研究的问题是经过训练的生成模型是否忠实地捕获了基础数据分布的多样性。在这项工作中，我们通过使用最近提出的基于无参考熵的多样性评分、Vendi 和 RKE，直接比较最先进模型生成的样本的多样性与从目标数据分布中提取的测试样本的多样性来解决这个问题。在多个基准数据集中，我们发现测试数据始终比生成的样本获得更高的 Vendi 和 RKE 多样性分数，这表明现代生成模型中存在系统性的多样性下降偏差。为了理解这种偏差的根源，我们分析了基于熵的多样性得分的有限样本行为，并表明它们的预期值随着样本大小的增加而增加，这意味着从有限训练集估计的多样性可能本质上低估了真实分布的多样性。因此，优化生成器以最小化与经验数据分布的分歧将导致多样性的损失。最后，我们讨论了基于 Vendi 和 RKE 的潜在多样性意识正则化和指导策略，作为减轻这种偏差的原则方向，并提供了表明其改善结果潜力的经验证据。

Title: D2-LoRA: A Synergistic Approach to Differential and Directional Low-Rank Adaptation

Authors: Nozomu Fujisawa, Masaaki Kondo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.14728
Pdf URL: https://arxiv.org/pdf/2602.14728
Copy Paste: [[2602.14728]] D2-LoRA: A Synergistic Approach to Differential and Directional Low-Rank Adaptation(https://arxiv.org/abs/2602.14728)
Keywords: generative
Abstract: We systematically investigate the parameter-efficient fine-tuning design space under practical data and compute constraints, and propose D2-LoRA. D2-LoRA achieves 76.4 percent average accuracy across eight question answering and reading comprehension benchmarks using only 5k training samples per task and two epochs, while preserving algebraic mergeability at inference with near-exact numerical equivalence. The method combines signed low-rank residual updates with additive and subtractive components, together with a train-time column-wise projection that keeps each column close to its original norm. After training, the adapter is merged into a single weight matrix, adding zero inference latency. Compared with LoRA, D2-LoRA improves average accuracy by 2.2 percentage points; at matched parameter counts (LoRA rank 2r versus D2-LoRA rank r), the improvement is 1.6 points, indicating gains from architectural design rather than increased parameterization. Compared with DoRA, it matches or exceeds performance on most tasks. Beyond QA and reading comprehension, D2-LoRA improves generative tasks (plus 1.2 ROUGE-L and plus 1.1 percent win rate) and shows 36 percent lower training volatility. The merge preserves numerical fidelity (mean gap about 0.03 percentage points) and recovers about 1.91x evaluation throughput. Training overhead is 19 percent, comparable to DoRA, and decreases with longer input sequences. We provide a geometric analysis explaining how the projection stabilizes training, together with ablation studies isolating the contribution of each design component.
摘要：我们系统地研究了实际数据和计算约束下的参数高效微调设计空间，并提出了 D2-LoRA。 D2-LoRA 在每个任务和两个 epoch 中仅使用 5k 个训练样本，在 8 个问答和阅读理解基准测试中实现了 76.4% 的平均准确率，同时在推理时保持了代数可合并性，具有近乎精确的数值等价性。该方法将带符号的低秩残差更新与加法和减法组件相结合，以及训练时间按列投影，使每列接近其原始范数。训练后，适配器被合并到单个权重矩阵中，增加零推理延迟。与LoRA相比，D2-LoRA平均准确率提高了2.2个百分点；在匹配的参数计数（LoRA 等级 2r 与 D2-LoRA 等级 r）下，改进为 1.6 个点，表明架构设计带来的收益而不是参数化的增加。与 DoRA 相比，它在大多数任务上的性能达到或超过了。除了 QA 和阅读理解之外，D2-LoRA 还改善了生成任务（加上 1.2 ROUGE-L 和 1.1% 的获胜率），并且训练波动性降低了 36%。合并保留了数值保真度（平均差距约为 0.03 个百分点）并恢复了约 1.91 倍的评估吞吐量。训练开销为 19%，与 DoRA 相当，并且随着输入序列的延长而减少。我们提供了几何分析来解释投影如何稳定训练，以及消融研究来隔离每个设计组件的贡献。

Title: Picking the Right Specialist: Attentive Neural Process-based Selection of Task-Specialized Models as Tools for Agentic Healthcare Systems

Authors: Pramit Saha, Joshua Strong, Mohammad Alsharid, Divyanshu Mishra, J. Alison Noble
Subjects: cs.LG, cs.AI, cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2602.14901
Pdf URL: https://arxiv.org/pdf/2602.14901
Copy Paste: [[2602.14901]] Picking the Right Specialist: Attentive Neural Process-based Selection of Task-Specialized Models as Tools for Agentic Healthcare Systems(https://arxiv.org/abs/2602.14901)
Keywords: generation
Abstract: Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single "best" model rarely exists. In practice, each task is better served by multiple competing specialist models where different models excel on different data samples. As a result, for any given query, agents must reliably select the right specialist model from a heterogeneous pool of tool candidates. To this end, we introduce ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss. Concretely, we propose an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models. Motivated by the absence of any established testbed, we, for the first time, introduce an agentic Chest X-ray environment equipped with a diverse suite of task-specialized models (17 disease detection, 19 report generation, 6 visual grounding, and 13 VQA) and develop ToolSelectBench, a benchmark of 1448 queries. Our results demonstrate that ToolSelect consistently outperforms 10 SOTA methods across four different task families.
摘要：任务专用模型构成了代理医疗保健系统的支柱，使代理能够跨疾病诊断、定位和报告生成等任务回答临床查询。然而，对于给定的任务，很少存在单一的“最佳”模型。在实践中，每个任务都可以通过多个相互竞争的专业模型更好地完成，其中不同的模型在不同的数据样本上表现出色。因此，对于任何给定的查询，代理必须从异构的候选工具池中可靠地选择正确的专家模型。为此，我们引入了 ToolSelect，它通过使用任务条件选择损失的一致替代来最小化采样的专业工具候选者的总体风险，从而自适应地学习工具的模型选择。具体来说，我们提出了一种基于注意力神经过程的选择器，以查询和每个模型的行为摘要为条件，以在专业模型中进行选择。由于缺乏任何已建立的测试平台，我们首次引入了一种代理胸部 X 射线环境，配备了多种任务专用模型（17 个疾病检测、19 个报告生成、6 个视觉基础和 13 个 VQA），并开发了 ToolSelectBench（1448 个查询的基准）。我们的结果表明，ToolSelect 在四个不同的任务系列中始终优于 10 个 SOTA 方法。

Title: AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Authors: Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14941
Pdf URL: https://arxiv.org/pdf/2602.14941
Copy Paste: [[2602.14941]] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories(https://arxiv.org/abs/2602.14941)
Keywords: generation
Abstract: Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.
摘要：在长期范围内保持空间世界的一致性仍然是摄像机可控视频生成的核心挑战。现有的基于内存的方法通常通过从历史中的重建几何体渲染锚视频来调节全局重建 3D 场景的生成。然而，从多个视图重建全局 3D 场景不可避免地会引入跨视图未对准，因为姿态和深度估计错误会导致在跨视图的稍微不同的 3D 位置重建相同的表面。当融合时，这些不一致会累积成噪声几何形状，从而污染调节信号并降低生成质量。我们介绍了 AnchorWeave，这是一种内存增强视频生成框架，它用多个干净的局部几何内存替换单个未对齐的全局内存，并学习协调它们的跨视图不一致。为此，AnchorWeave 执行与目标轨迹对齐的覆盖驱动的本地内存检索，并在生成过程中通过多锚编织控制器集成所选的本地内存。大量实验表明，AnchorWeave 显着提高了长期场景一致性，同时保持了强大的视觉质量，消融和分析研究进一步验证了局部几何条件、多锚点控制和覆盖驱动检索的有效性。

Title: PAct: Part-Decomposed Single-View Articulated Object Generation

Authors: Qingming Liu, Xinyue Yao, Shuyuan Zhang, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2602.14965
Pdf URL: https://arxiv.org/pdf/2602.14965
Copy Paste: [[2602.14965]] PAct: Part-Decomposed Single-View Articulated Object Generation(https://arxiv.org/abs/2602.14965)
Keywords: generation, generative
Abstract: Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.
摘要：铰接物体是交互式 3D 应用的核心，包括实体 AI、机器人和 VR/AR，其中功能部件分解和运动学运动至关重要。然而，生产高保真铰接资产仍然难以扩展，因为它需要可靠的零件分解和运动索具。现有的方法主要分为两种范式：基于优化的重建或蒸馏，它可以是准确的，但每个实例通常需要数十分钟到几个小时，以及依赖于模板或部分检索的推理时间方法，产生可能与输入观察中的特定结构和外观不匹配的合理结果。我们引入了一个以零件为中心的生成框架，用于创建铰接对象，该框架在明确的零件感知条件下综合零件几何形状、组成和铰接。我们的表示将对象建模为一组可移动部件，每个部件都由潜在标记编码，并通过部件身份和关节线索进行增强。该模型以单个图像为条件，生成铰接的 3D 资产，这些资产保留实例级对应关系，同时保持有效的零件结构和运动。由此产生的方法避免了每个实例的优化，实现了快速前馈推理，并支持可控组装和铰接，这对于具体交互非常重要。对常见铰接类别（例如抽屉和门）的实验表明，与基于优化和检索驱动的基线相比，输入一致性、零件准确性和铰接合理性得到了改善，同时大大减少了推理时间。

Title: MacroGuide: Topological Guidance for Macrocycle Generation

Authors: Alicja Maksymiuk, Alexandre Duplessis, Michael Bronstein, Alexander Tong, Fernanda Duarte, İsmail İlkan Ceylan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.14977
Pdf URL: https://arxiv.org/pdf/2602.14977
Copy Paste: [[2602.14977]] MacroGuide: Topological Guidance for Macrocycle Generation(https://arxiv.org/abs/2602.14977)
Keywords: generation, generative
Abstract: Macrocycles are ring-shaped molecules that offer a promising alternative to small-molecule drugs due to their enhanced selectivity and binding affinity against difficult targets. Despite their chemical value, they remain underexplored in generative modeling, likely owing to their scarcity in public datasets and the challenges of enforcing topological constraints in standard deep generative models. We introduce MacroGuide: Topological Guidance for Macrocycle Generation, a diffusion guidance mechanism that uses Persistent Homology to steer the sampling of pretrained molecular generative models toward the generation of macrocycles, in both unconditional and conditional (protein pocket) settings. At each denoising step, MacroGuide constructs a Vietoris-Rips complex from atomic positions and promotes ring formation by optimizing persistent homology features. Empirically, applying MacroGuide to pretrained diffusion models increases macrocycle generation rates from 1% to 99%, while matching or exceeding state-of-the-art performance on key quality metrics such as chemical validity, diversity, and PoseBusters checks.
摘要：大环化合物是环状分子，由于其增强的选择性和对困难靶标的结合亲和力，为小分子药物提供了有前途的替代品。尽管它们具有化学价值，但它们在生成模型中仍未得到充分探索，这可能是由于它们在公共数据集中的稀缺以及在标准深度生成模型中强制执行拓扑约束的挑战。我们介绍了 MacroGuide：大环生成的拓扑指导，这是一种扩散引导机制，它使用持久同源性来引导预训练分子生成模型的采样，以在无条件和条件（蛋白质袋）设置中生成大环。在每个去噪步骤中，MacroGuide 从原子位置构建 Vietoris-Rips 复合体，并通过优化持久同源特征促进环形成。根据经验，将 MacroGuide 应用于预训练扩散模型可将大环生成率从 1% 提高到 99%，同时在化学有效性、多样性和 PoseBusters 检查等关键质量指标上达到或超过最先进的性能。

Title: Scaling Beyond Masked Diffusion Language Models

Authors: Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, Ante Jukic
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2602.15014
Pdf URL: https://arxiv.org/pdf/2602.15014
Copy Paste: [[2602.15014]] Scaling Beyond Masked Diffusion Language Models(https://arxiv.org/abs/2602.15014)
Keywords: generation
Abstract: Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: this http URL
摘要：扩散语言模型由于其具有更快生成速度的潜力，是自回归模型的有前途的替代方案。在离散扩散方法中，掩模扩散目前占据主导地位，这很大程度上是由于语言建模基准的强烈困惑所驱动的。在这项工作中，我们提出了均匀状态和插值离散扩散方法的第一个标度律研究。我们还表明，当使用简单的交叉熵目标进行训练时，掩蔽扩散模型的 FLOP 效率可以提高约 12%。我们发现，困惑度在扩散族内提供了丰富的信息，但在不同族之间可能会产生误导，其中似然标度较差的模型可能由于更快、更实用的采样而更可取，正如速度-质量帕累托前沿所反映的那样。这些结果挑战了这样一种观点，即蒙面扩散无疑是扩散语言建模的未来，并且仅凭困惑度就足以进行跨算法比较。将所有方法扩展到 1.7B 参数，我们表明，尽管验证复杂性更差，但均匀状态扩散在基于似然的基准上仍然具有竞争力，并且优于 GSM8K 上的自回归和掩模扩散模型。我们在项目页面上提供了代码、模型检查点和视频教程：此 http URL

Title: Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation

Authors: Cai Zhou, Zijie Chen, Zian Li, Jike Wang, Kaiyi Jiang, Pan Li, Rose Yu, Muhan Zhang, Stephen Bates, Tommi Jaakkola
Subjects: cs.LG, cs.AI, math.GR, q-bio.BM
Abstract URL: https://arxiv.org/abs/2602.15022
Pdf URL: https://arxiv.org/pdf/2602.15022
Copy Paste: [[2602.15022]] Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation(https://arxiv.org/abs/2602.15022)
Keywords: generation, generative
Abstract: Many generative tasks in chemistry and science involve distributions invariant to group symmetries (e.g., permutation and rotation). A common strategy enforces invariance and equivariance through architectural constraints such as equivariant denoisers and invariant priors. In this paper, we challenge this tradition through the alternative canonicalization perspective: first map each sample to an orbit representative with a canonical pose or order, train an unconstrained (non-equivariant) diffusion or flow model on the canonical slice, and finally recover the invariant distribution by sampling a random symmetry transform at generation time. Building on a formal quotient-space perspective, our work provides a comprehensive theory of canonical diffusion by proving: (i) the correctness, universality and superior expressivity of canonical generative models over invariant targets; (ii) canonicalization accelerates training by removing diffusion score complexity induced by group mixtures and reducing conditional variance in flow matching. We then show that aligned priors and optimal transport act complementarily with canonicalization and further improves training efficiency. We instantiate the framework for molecular graph generation under $S_n \times SE(3)$ symmetries. By leveraging geometric spectra-based canonicalization and mild positional encodings, canonical diffusion significantly outperforms equivariant baselines in 3D molecule generation tasks, with similar or even less computation. Moreover, with a novel architecture Canon, CanonFlow achieves state-of-the-art performance on the challenging GEOM-DRUG dataset, and the advantage remains large in few-step generation.
摘要：化学和科学中的许多生成任务涉及群对称性不变的分布（例如排列和旋转）。通用策略通过等变降噪器和不变先验等架构约束来强制不变性和等变性。在本文中，我们通过替代规范化视角挑战这一传统：首先将每个样本映射到具有规范位姿或顺序的轨道代表，在规范切片上训练无约束（非等变）扩散或流动模型，最后通过在生成时采样随机对称变换来恢复不变分布。基于形式的商空间视角，我们的工作通过证明以下内容提供了全面的规范扩散理论：（i）规范生成模型相对于不变目标的正确性、普遍性和优越的表达性； (ii) 标准化通过消除组混合引起的扩散分数复杂性并减少流匹配中的条件方差来加速训练。然后，我们表明，一致的先验和最佳传输与规范化相辅相成，并进一步提高了训练效率。我们在 $S_n \times SE(3)$ 对称性下实例化分子图生成框架。通过利用基于几何光谱的规范化和温和的位置编码，规范扩散在 3D 分子生成任务中显着优于等变基线，计算量相似甚至更少。此外，凭借新颖的Canon架构，CanonFlow在具有挑战性的GEOM-DRUG数据集上实现了最先进的性能，并且在几步生成中仍然具有很大的优势。

Title: Image Generation with a Sphere Encoder

Authors: Kaiyu Yue, Menglin Jia, Ji Hou, Tom Goldstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.15030
Pdf URL: https://arxiv.org/pdf/2602.15030
Copy Paste: [[2602.15030]] Image Generation with a Sphere Encoder(https://arxiv.org/abs/2602.15030)
Keywords: generation, generative
Abstract: We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent space, and a decoder that maps random latent vectors back to the image space. Trained solely through image reconstruction losses, the model generates an image by simply decoding a random point on the sphere. Our architecture naturally supports conditional generation, and looping the encoder/decoder a few times can further enhance image quality. Across several datasets, the sphere encoder approach yields performance competitive with state of the art diffusions, but with a small fraction of the inference cost. Project page is available at this https URL .
摘要：我们引入了 Sphere Encoder，这是一种高效的生成框架，能够在单次前向传递中生成图像，并使用少于五个步骤与多步骤扩散模型竞争。我们的方法通过学习将自然图像均匀映射到球形潜在空间的编码器和将随机潜在向量映射回图像空间的解码器来工作。该模型仅通过图像重建损失进行训练，通过简单地解码球体上的随机点来生成图像。我们的架构自然支持条件生成，并且循环编码器/解码器几次可以进一步提高图像质量。在多个数据集上，球体编码器方法的性能可与最先进的扩散相媲美，但推理成本却只有一小部分。项目页面可通过此 https URL 获取。

Title: EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Authors: Yehonathan Litman, Shikun Liu, Dario Seyb, Nicholas Milef, Yang Zhou, Carl Marshall, Shubham Tulsiani, Caleb Leak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.15031
Pdf URL: https://arxiv.org/pdf/2602.15031
Copy Paste: [[2602.15031]] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing(https://arxiv.org/abs/2602.15031)
Keywords: generation, generative
Abstract: High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.
摘要：通过利用预先训练的视频基础模型，高保真生成视频编辑的质量得到了显着提高。然而，它们的计算成本是一个主要瓶颈，因为它们通常被设计为低效地处理完整的视频上下文，无论修复掩模的大小如何，即使对于稀疏的局部编辑也是如此。在本文中，我们介绍了 EditCtrl，这是一种高效的视频修复控制框架，仅将计算集中在需要的地方。我们的方法采用了一种新颖的本地视频上下文模块，该模块仅对屏蔽标记进行操作，产生与编辑大小成正比的计算成本。然后，这个本地第一代由轻量级全局上下文嵌入器引导，以最小的开销确保视频范围的上下文一致性。 EditCtrl 不仅计算效率比最先进的生成编辑方法高 10 倍，而且与完全注意设计的方法相比，它甚至还提高了编辑质量。最后，我们展示 EditCtrl 如何解锁新功能，包括带有文本提示的多区域编辑和自回归内容传播。