2026-03-18

Title: Improving Generative Adversarial Network Generalization for Facial Expression Synthesis

Authors: Arbish Akram, Nazar Khan, Arif Mahmood
Subjects: cs.CV, cs.GR, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2603.15648
Pdf URL: https://arxiv.org/pdf/2603.15648
Copy Paste: [[2603.15648]] Improving Generative Adversarial Network Generalization for Facial Expression Synthesis(https://arxiv.org/abs/2603.15648)
Keywords: generative
Abstract: Facial expression synthesis aims to generate realistic facial expressions while preserving identity. Existing conditional generative adversarial networks (GANs) achieve excellent image-to-image translation results, but their performance often degrades when test images differ from the training dataset. We present Regression GAN (RegGAN), a model that learns an intermediate representation to improve generalization beyond the training distribution. RegGAN consists of two components: a regression layer with local receptive fields that learns expression details by minimizing the reconstruction error through a ridge regression loss, and a refinement network trained adversarially to enhance the realism of generated images. We train RegGAN on the CFEE dataset and evaluate its generalization performance both on CFEE and challenging out-of-distribution images, including celebrity photos, portraits, statues, and avatar renderings. For evaluation, we employ four widely used metrics: Expression Classification Score (ECS) for expression quality, Face Similarity Score (FSS) for identity preservation, QualiCLIP for perceptual realism, and Fréchet Inception Distance (FID) for assessing both expression quality and realism. RegGAN outperforms six state-of-the-art models in ECS, FID, and QualiCLIP, while ranking second in FSS. Human evaluations indicate that RegGAN surpasses the best competing model by 25% in expression quality, 26% in identity preservation, and 30% in realism.
摘要：面部表情合成旨在生成真实的面部表情，同时保留身份。现有的条件生成对抗网络（GAN）实现了出色的图像到图像的转换结果，但当测试图像与训练数据集不同时，它们的性能通常会下降。我们提出了回归 GAN (RegGAN)，这是一种学习中间表示以提高训练分布之外的泛化能力的模型。 RegGAN 由两个组件组成：一个具有局部感受野的回归层，通过岭回归损失最小化重建误差来学习表达细节；以及一个经过对抗性训练以增强生成图像真实感的细化网络。我们在 CFEE 数据集上训练 RegGAN，并评估其在 CFEE 和具有挑战性的分布外图像（包括名人照片、肖像、雕像和头像渲染）上的泛化性能。为了进行评估，我们采用了四个广泛使用的指标：用于表达质量的表达分类评分（ECS）、用于身份保留的面部相似度评分（FSS）、用于感知真实性的QualiCLIP以及用于评估表达质量和真实性的Fréchet起始距离（FID）。 RegGAN 在 ECS、FID 和 QualiCLIP 方面优于六种最先进的模型，同时在 FSS 方面排名第二。人类评估表明，RegGAN 在表达质量方面优于最佳竞争模型 25%，在身份保留方面优于最佳竞争模型 26%，在真实性方面优于最佳竞争模型 30%。

Title: Transition Flow Matching

Authors: Chenrui Ma
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.15689
Pdf URL: https://arxiv.org/pdf/2603.15689
Copy Paste: [[2603.15689]] Transition Flow Matching(https://arxiv.org/abs/2603.15689)
Keywords: generation
Abstract: Mainstream flow matching methods typically focus on learning the local velocity field, which inherently requires multiple integration steps during generation. In contrast, Mean Velocity Flow models establish a relationship between the local velocity field and the global mean velocity, enabling the latter to be learned through a mathematically grounded formulation and allowing generation to be transferred to arbitrary future time points. In this work, we propose a new paradigm that directly learns the transition flow. As a global quantity, the transition flow naturally supports generation in a single step or at arbitrary time points. Furthermore, we demonstrate the connection between our approach and Mean Velocity Flow, establishing a unified theoretical perspective. Extensive experiments validate the effectiveness of our method and support our theoretical claims.
摘要：主流流匹配方法通常侧重于学习局部速度场，这本质上需要在生成过程中进行多个积分步骤。相比之下，平均速度流模型建立了局部速度场和全局平均速度之间的关系，使后者能够通过数学基础公式来学习，并允许生成转移到任意未来时间点。在这项工作中，我们提出了一种直接学习转换流程的新范式。作为全局量，过渡流自然支持单步或任意时间点的生成。此外，我们展示了我们的方法与平均速度流之间的联系，建立了统一的理论视角。大量的实验验证了我们方法的有效性并支持我们的理论主张。

Title: Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences

Authors: Artem Sakhno, Ivan Sergeev, Alexey Shestov, Omar Zoloev, Elizaveta Kovtun, Gleb Gusev, Andrey Savchenko, Maksim Makarenko
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.15713
Pdf URL: https://arxiv.org/pdf/2603.15713
Copy Paste: [[2603.15713]] Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences(https://arxiv.org/abs/2603.15713)
Keywords: generation
Abstract: Industrial financial systems operate on temporal event sequences such as transactions, user actions, and system logs. While recent research emphasizes representation learning and large language models, production systems continue to rely heavily on handcrafted statistical features due to their interpretability, robustness under limited supervision, and strict latency constraints. This creates a persistent disconnect between learned embeddings and feature-based pipelines. We introduce Embedding-Aware Feature Discovery (EAFD), a unified framework that bridges this gap by coupling pretrained event-sequence embeddings with a self-reflective LLM-driven feature generation agent. EAFD iteratively discovers, evaluates, and refines features directly from raw event sequences using two complementary criteria: \emph{alignment}, which explains information already encoded in embeddings, and \emph{complementarity}, which identifies predictive signals missing from them. Across both open-source and industrial transaction benchmarks, EAFD consistently outperforms embedding-only and feature-based baselines, achieving relative gains of up to $+5.8\%$ over state-of-the-art pretrained embeddings, resulting in new state-of-the-art performance across event-sequence datasets.
摘要：工业金融系统根据时间事件序列运行，例如交易、用户操作和系统日志。虽然最近的研究强调表示学习和大型语言模型，但生产系统由于其可解释性、有限监督下的鲁棒性和严格的延迟约束，仍然严重依赖手工制作的统计特征。这会在学习的嵌入和基于特征的管道之间造成持久的脱节。我们引入了嵌入感知特征发现（EAFD），这是一个统一的框架，通过将预训练的事件序列嵌入与自我反思的 LLM 驱动的特征生成代理相结合来弥补这一差距。 EAFD 使用两个互补标准直接从原始事件序列中迭代地发现、评估和细化特征：\emph{alignment}（解释嵌入中已编码的信息）和 \emph{complementarity}（识别其中缺少的预测信号）。在开源和工业交易基准中，EAFD 始终优于仅嵌入和基于特征的基准，与最先进的预训练嵌入相比，实现了高达 $+5.8\%$ 的相对收益，从而在事件序列数据集上实现了新的最先进的性能。

Title: Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Authors: Lit Sin Tan, Junzhe Chen, Xiaolong Fu, Lichen Ma, Junshi Huang, Jianzhong Shi, Yan Li, Lijie Wen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15724
Pdf URL: https://arxiv.org/pdf/2603.15724
Copy Paste: [[2603.15724]] Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models(https://arxiv.org/abs/2603.15724)
Keywords: generation
Abstract: Existing test-time scaling (TTS) methods for unified multimodal models (UMMs) in text-to-image (T2I) generation primarily rely on search or sampling strategies that produce only instance-level improvements, limiting the ability to learn from prior inferences and accumulate knowledge across similar prompts. To overcome these limitations, we propose Meta-TTRL, a metacognitive test-time reinforcement learning framework. Meta-TTRL performs test-time parameter optimization guided by model-intrinsic monitoring signals derived from the meta-knowledge of UMMs, achieving self-improvement and capability-level improvement at test time. Extensive experiments demonstrate that Meta-TTRL generalizes well across three representative UMMs, including Janus-Pro-7B, BAGEL, and Qwen-Image, achieving significant gains on compositional reasoning tasks and multiple T2I benchmarks with limited data. We provide the first comprehensive analysis to investigate the potential of test-time reinforcement learning (TTRL) for T2I generation in UMMs. Our analysis further reveals a key insight underlying effective TTRL: metacognitive synergy, where monitoring signals align with the model's optimization regime to enable self-improvement.
摘要：文本到图像 (T2I) 生成中统一多模态模型 (UMM) 的现有测试时间缩放 (TTS) 方法主要依赖于搜索或采样策略，这些策略仅产生实例级改进，限制了从先前的推理中学习和在类似提示中积累知识的能力。为了克服这些限制，我们提出了 Meta-TTRL，一种元认知测试时强化学习框架。 Meta-TTRL 在源自 UMM 元知识的模型固有监控信号的指导下执行测试时参数优化，实现测试时的自我改进和能力水平提高。大量实验表明，Meta-TTRL 在三个代表性 UMM 中具有良好的泛化能力，包括 Janus-Pro-7B、BAGEL 和 Qwen-Image，在组合推理任务和使用有限数据的多个 T2I 基准测试中取得了显着的成果。我们提供了第一个全面的分析来研究测试时强化学习 (TTRL) 在 UMM 中生成 T2I 的潜力。我们的分析进一步揭示了有效 TTRL 背后的一个关键见解：元认知协同作用，其中监控信号与模型的优化机制相一致，以实现自我改进。

Title: Time-Aware Prior Fitted Networks for Zero-Shot Forecasting with Exogenous Variables

Authors: Andres Potapczynski, Ravi Kiran Selvam, Tatiana Konstantinova, Shankar Ramasubramanian, Malcolm Wolff, Kin G. Olivares, Ruijun Ma, Mengfei Cao, Michael W. Mahoney, Andrew Gordon Wilson, Boris N. Oreshkin, Dmitry Efimov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.15802
Pdf URL: https://arxiv.org/pdf/2603.15802
Copy Paste: [[2603.15802]] Time-Aware Prior Fitted Networks for Zero-Shot Forecasting with Exogenous Variables(https://arxiv.org/abs/2603.15802)
Keywords: generation
Abstract: In many time series forecasting settings, the target time series is accompanied by exogenous covariates, such as promotions and prices in retail demand; temperature in energy load; calendar and holiday indicators for traffic or sales; and grid load or fuel costs in electricity pricing. Ignoring these exogenous signals can substantially degrade forecasting accuracy, particularly when they drive spikes, discontinuities, or regime and phase changes in the target series. Most current time series foundation models (e.g., Chronos, Sundial, TimesFM, TimeMoE, TimeLLM, and LagLlama) ignore exogenous covariates and make forecasts solely from the numerical time series history, thereby limiting their performance. In this paper, we develop ApolloPFN, a prior-data fitted network (PFN) that is time-aware (unlike prior PFNs) and that natively incorporates exogenous covariates (unlike prior univariate forecasters). Our design introduces two major advances: (i) a synthetic data generation procedure tailored to resolve the failure modes that arise when tabular (non-temporal) PFNs are applied to time series; and (ii) time-aware architectural modifications that embed inductive biases needed to exploit the time series context. We demonstrate that ApolloPFN achieves state-of-the-art results across benchmarks, such as M5 and electric price forecasting, that contain exogenous information.
摘要：在许多时间序列预测设置中，目标时间序列都伴随着外生协变量，例如零售需求中的促销和价格；能源负荷温度；流量或销售的日历和假期指标；以及电价中的电网负载或燃料成本。忽略这些外生信号可能会大大降低预测准确性，特别是当它们驱动目标序列中的尖峰、不连续性或状态和相位变化时。当前大多数时间序列基础模型（例如 Chronos、Sundial、TimesFM、TimeMoE、TimeLLM 和 LagLlama）忽略外生协变量，仅根据数值时间序列历史进行预测，从而限制了其性能。在本文中，我们开发了 ApolloPFN，一种先验数据拟合网络 (PFN)，它具有时间感知性（与先前的 PFN 不同），并且本身包含外生协变量（与先前的单变量预测器不同）。我们的设计引入了两个主要的进步：（i）一个合成数据生成程序，专门用于解决当表格（非时间）PFN应用于时间序列时出现的故障模式； (ii) 时间感知架构修改，嵌入利用时间序列上下文所需的归纳偏差。我们证明 ApolloPFN 在包含外生信息的基准（例如 M5 和电价预测）中取得了最先进的结果。

Title: Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs

Authors: Linrui Ma, Yufei Cui, Kai Han, Yunhe Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.15803
Pdf URL: https://arxiv.org/pdf/2603.15803
Copy Paste: [[2603.15803]] Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs(https://arxiv.org/abs/2603.15803)
Keywords: generation
Abstract: Discrete diffusion models offer global context awareness and flexible parallel generation. However, uniform random noise schedulers in standard DLLM training overlook the highly non-uniform information density inherent in real-world sequences. This wastes optimization resources on low-density structural glues while leaving high-density logical pivot points severely under-optimized. To address this, we propose an Information Density Driven Smart Noise Scheduler. By extracting information-dense hubs and applying Complementary Priority Masking, our method decouples a single training instance into mutually reinforcing reasoning and syntax samples, forcing the model to master both logical deduction and foundational sequence structure. Experiments demonstrate that our approach improves average accuracy by ~4\% across four Code and Math reasoning benchmarks, significantly outperforming uniform baselines. Mechanistic analyses further reveal that probabilistic priority masking effectively mitigates contextual collapse during block diffusion training. Overall, this density-aware strategy efficiently unlocks the reasoning potential of diffusion language models at minimal annotation cost, emerging as a promising new masked data training paradigm for Diffusion LLMs. Our processed dataset can be found at this https URL.
摘要：离散扩散模型提供全局上下文感知和灵活的并行生成。然而，标准 DLLM 训练中的均匀随机噪声调度程序忽略了现实世界序列中固有的高度不均匀的信息密度。这将优化资源浪费在低密度结构粘合上，同时使高密度逻辑枢轴点严重优化不足。为了解决这个问题，我们提出了一种信息密度驱动的智能噪声调度器。通过提取信息密集的中心并应用互补优先级掩码，我们的方法将单个训练实例解耦为相互增强的推理和语法样本，迫使模型掌握逻辑推导和基本序列结构。实验表明，我们的方法在四个代码和数学推理基准测试中将平均准确率提高了约 4%，显着优于统一基线。机制分析进一步表明，概率优先级屏蔽可以有效减轻块扩散训练期间的上下文崩溃。总体而言，这种密度感知策略以最小的注释成本有效地释放了扩散语言模型的推理潜力，成为扩散法学硕士的一种有前途的新屏蔽数据训练范例。我们处理过的数据集可以在此 https URL 中找到。

Title: Feed-forward Gaussian Registration for Head Avatar Creation and Editing

Authors: Malte Prinzler, Paulo Gotardo, Siyu Tang, Timo Bolkart
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15811
Pdf URL: https://arxiv.org/pdf/2603.15811
Copy Paste: [[2603.15811]] Feed-forward Gaussian Registration for Head Avatar Creation and Editing(https://arxiv.org/abs/2603.15811)
Keywords: generation
Abstract: We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatar methods require time-consuming head tracking followed by expensive avatar optimization, often resulting in a total creation time of more than one day. MATCH, in contrast, directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in just 0.5 seconds per frame, without requiring data preprocessing. The learned intra-subject correspondence across frames enables fast creation of personalized head avatars, while correspondence across subjects supports applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We establish these correspondences end-to-end using a transformer-based model that predicts Gaussian splat textures in the fixed UV layout of a template mesh. To achieve this, we introduce a novel registration-guided attention block, where each UV-map token attends exclusively to image tokens depicting its corresponding mesh region. This design improves efficiency and performance compared to dense cross-view attention. MATCH outperforms existing methods in novel-view synthesis, geometry registration, and head avatar generation, while making avatar creation 10 times faster than the closest competing baseline. The code and model weights are available on the project website.
摘要：我们提出了 MATCH（拓扑对应头部的多视图头像），这是一种用于高质量头部头像创建和编辑的多视图高斯配准方法。最先进的多视图头部头像方法需要耗时的头部跟踪，然后进行昂贵的头像优化，通常会导致总创建时间超过一天。相比之下，MATCH 可以在每帧仅 0.5 秒的时间内直接预测与校准的多视图图像相对应的高斯斑点纹理，无需数据预处理。跨帧学习的主体内对应关系可以快速创建个性化头部头像，而跨主体对应关系则支持诸如表情转移、无优化跟踪、语义编辑和身份插值等应用。我们使用基于变压器的模型建立这些端到端的对应关系，该模型预测模板网格的固定 UV 布局中的高斯splat 纹理。为了实现这一目标，我们引入了一种新颖的注册引导注意力块，其中每个 UV 贴图标记专门关注描绘其相应网格区域的图像标记。与密集的跨视图注意力相比，这种设计提高了效率和性能。 MATCH 在新颖视图合成、几何配准和头像生成方面优于现有方法，同时使头像创建速度比最接近的竞争基线快 10 倍。代码和模型权重可在项目网站上找到。

Title: Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

Authors: Renjie Liang, Yiling Ma, Yang Xing, Zhengkang Fan, Jinqian Pan, Chengkun Sun, Li Li, Kuang Gong, Jie Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15822
Pdf URL: https://arxiv.org/pdf/2603.15822
Copy Paste: [[2603.15822]] Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation(https://arxiv.org/abs/2603.15822)
Keywords: generation
Abstract: Automated radiology report generation from 3D CT volumes often suffers from incomplete pathology coverage. We provide empirical evidence that this limitation stems from a representational bottleneck: contrastive 3D CT embeddings encode discriminative pathology signals, yet exhibit severe dimensional concentration, with as few as 2 effective dimensions out of 512. Corroborating this, scaling the language model yields no measurable improvement, suggesting that the bottleneck lies in the visual representation rather than the generator. This bottleneck limits both generation and retrieval; naive static retrieval fails to improve clinical efficacy and can even degrade performance. We propose \textbf{AdaRAG-CT}, an adaptive augmentation framework that compensates for this visual bottleneck by introducing supplementary textual information through controlled retrieval and selectively integrating it during generation. On the CT-RATE benchmark, AdaRAG-CT achieves state-of-the-art clinical efficacy, improving Clinical F1 from 0.420 (CT-Agent) to 0.480 (+6 points); ablation studies confirm that both the retrieval and generation components contribute to the improvement. Code is available at this https URL.
摘要：从 3D CT 体积自动生成放射学报告通常会遇到病理覆盖不完整的问题。我们提供的经验证据表明，这种限制源于表征瓶颈：对比 3D CT 嵌入编码有区别的病理信号，但表现出严重的维度集中，512 个有效维度只有 2 个。证实了这一点，缩放语言模型没有产生可测量的改进，表明瓶颈在于视觉表征而不是生成器。这个瓶颈限制了生成和检索；单纯的静态检索无法提高临床疗效，甚至会降低性能。我们提出了 \textbf{AdaRAG-CT}，这是一种自适应增强框架，它通过受控检索引入补充文本信息并在生成过程中有选择地集成它来弥补这一视觉瓶颈。在 CT-RATE 基准上，AdaRAG-CT 实现了最先进的临床疗效，将临床 F1 从 0.420（CT-Agent）提高到 0.480（+6 分）；消融研究证实，检索和生成组件都有助于改善。代码可从此 https URL 获取。

Title: EvoIQA - Explaining Image Distortions with Evolved White-Box Logic

Authors: Ruchika Gupta, Illya Bakurov, Nathan Haut, Wolfgang Banzhaf
Subjects: cs.CV, cs.NE
Abstract URL: https://arxiv.org/abs/2603.15887
Pdf URL: https://arxiv.org/pdf/2603.15887
Copy Paste: [[2603.15887]] EvoIQA - Explaining Image Distortions with Evolved White-Box Logic(https://arxiv.org/abs/2603.15887)
Keywords: quality assessment
Abstract: Traditional Image Quality Assessment (IQA) metrics typically fall into one of two extremes: rigid, hand-crafted mathematical models or "black-box" deep learning architectures that completely lack interpretability. To bridge this gap, we propose EvoIQA, a fully explainable symbolic regression framework based on Genetic Programming that Evolves explicit, human-readable mathematical formulas for image quality assessment (IQA). Utilizing a rich terminal set from the VSI, VIF, FSIM, and HaarPSI metrics, our framework inherently maps structural, chromatic, and information-theoretic degradations into observable mathematical equations. Our results demonstrate that the evolved GP models consistently achieve strong alignment between the predictions and human visual preferences. Furthermore, they not only outperform traditional hand-crafted metrics but also achieve performance parity with complex, state-of-the-art deep learning models like DB-CNN, proving that we no longer have to sacrifice interpretability for state-of-the-art performance.
摘要：传统的图像质量评估 (IQA) 指标通常属于两个极端之一：僵化的、手工制作的数学模型或完全缺乏可解释性的“黑匣子”深度学习架构。为了弥补这一差距，我们提出了 EvoIQA，这是一种基于遗传编程的完全可解释的符号回归框架，它演化出用于图像质量评估 (IQA) 的明确的、人类可读的数学公式。利用来自 VSI、VIF、FSIM 和 HaarPSI 指标的丰富终端集，我们的框架本质上将结构、色度和信息论退化映射到可观察的数学方程。我们的结果表明，进化的 GP 模型始终在预测和人类视觉偏好之间实现了强烈的一致性。此外，它们不仅优于传统的手工制作的指标，而且还达到了与 DB-CNN 等复杂、最先进的深度学习模型相当的性能，证明我们不再需要为了最先进的性能而牺牲可解释性。

Title: Generative Inverse Design with Abstention via Diagonal Flow Matching

Authors: Miguel de Campos, Werner Krebs, Hanno Gottschalk
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.15925
Pdf URL: https://arxiv.org/pdf/2603.15925
Copy Paste: [[2603.15925]] Generative Inverse Design with Abstention via Diagonal Flow Matching(https://arxiv.org/abs/2603.15925)
Keywords: generation, generative
Abstract: Inverse design aims to find design parameters $x$ achieving target performance $y^*$. Generative approaches learn bidirectional mappings between designs and labels, enabling diverse solution sampling. However, standard conditional flow matching (CFM), when adapted to inverse problems by pairing labels with design parameters, exhibits strong sensitivity to their arbitrary ordering and scaling, leading to unstable training. We introduce Diagonal Flow Matching (Diag-CFM), which resolves this through a zero-anchoring strategy that pairs design coordinates with noise and labels with zero, making the learning problem provably invariant to coordinate permutations. This yields order-of-magnitude improvements in round-trip accuracy over CFM and invertible neural network baselines across design dimensions up to $P{=}100$. We develop two architecture-intrinsic uncertainty metrics, Zero-Deviation and Self-Consistency, that enable three practical capabilities: selecting the best candidate among multiple generations, abstaining from unreliable predictions, and detecting out-of-distribution targets; consistently outperforming ensemble and general-purpose alternatives across all tasks. We validate on airfoil, gas turbine combustor, and an analytical benchmark with scalable design dimension.
摘要：逆向设计旨在找到实现目标性能$y^*$的设计参数$x$。生成方法学习设计和标签之间的双向映射，从而实现多样化的解决方案采样。然而，标准条件流匹配（CFM）在通过将标签与设计参数配对来适应逆问题时，对其任意排序和缩放表现出强烈的敏感性，从而导致训练不稳定。我们引入了对角流匹配（Diag-CFM），它通过零锚定策略解决了这个问题，该策略将设计坐标与噪声和标签与零配对，从而证明学习问题对于坐标排列具有不变性。与 CFM 和可逆神经网络基线相比，这在设计维度上的往返精度提高了数量级，最高可达 $P{=}100$。我们开发了两种架构固有的不确定性指标：零偏差和自我一致性，它们实现了三种实用功能：在多代中选择最佳候选者、避免不可靠的预测以及检测分布外的目标；在所有任务中始终优于集成和通用替代方案。我们对翼型件、燃气轮机燃烧室以及具有可扩展设计尺寸的分析基准进行验证。

Title: GASP: Guided Asymmetric Self-Play For Coding LLMs

Authors: Swadesh Jana, Cansu Sancaktar, Tomáš Daniš, Georg Martius, Antonio Orvieto, Pavel Kolev
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.15957
Pdf URL: https://arxiv.org/pdf/2603.15957
Copy Paste: [[2603.15957]] GASP: Guided Asymmetric Self-Play For Coding LLMs(https://arxiv.org/abs/2603.15957)
Keywords: generation
Abstract: Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student's learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.
摘要：非对称自我游戏已成为训练后大型语言模型的一种有前途的范例，其中教师不断地生成问题供学生在学生可学习性的边缘解决。尽管这些方法承诺在没有人类数据的情况下生成开放式数据，但它们面临一个主要问题：并非所有难以解决的问题都是有趣的或信息丰富的，可以提高模型的整体能力。目前的非对称自我对战方法与目标无关，没有真正的基础。我们提出引导非对称自我游戏（GASP），其中由真实数据球门柱问题提供基础，这些问题被认为对模型构成了艰难的探索挑战。在自我对弈过程中，教师首先生成一个较难问题的较简单变体，然后是该较简单问题的较难变体，目的是在整个训练过程中逐渐缩小与球门柱的差距。这样做，我们将 LiveCodeBench (LCB) 上的 pass@20 比无指导的非对称自我游戏提高了 2.5%，并且通过老师构建的课程，我们设法解决了所有基线都无法解决的难题。

Title: UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

Authors: Xiaoyan Cong, Zekun Li, Zhiyang Dou, Hongyu Li, Omid Taheri, Chuan Guo, Abhay Mittal, Sizhe An, Taku Komura, Wojciech Matusik, Michael J. Black, Srinath Sridhar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15975
Pdf URL: https://arxiv.org/pdf/2603.15975
Copy Paste: [[2603.15975]] UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors(https://arxiv.org/abs/2603.15975)
Keywords: generation, generative
Abstract: Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock the generative priors of pretrained DiT-based motion LFMs. Specifically, UMO introduces three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into the pretrained backbone, with negligible runtime overhead compared to the base model. With this design, UMO finetunes the pretrained model, originally limited to text-to-motion generation, to support diverse previously unsupported tasks, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. Experiments demonstrate that UMO consistently outperforms task-specific and training-free baselines across a wide range of benchmarks, despite using a single unified model. Code and model will be publicly available. Project Page: this https URL
摘要：大规模基础模型 (LFM) 最近通过从大量 3D 人体运动数据集和配对文本描述中学习强大的生成先验，在文本到运动生成方面取得了令人瞩目的进展。然而，如何在更多样化的跨模式和上下文运动生成下游任务中有效和高效地利用这种单一用途的运动 LFM（即文本到运动合成）仍然很不清楚。先前的工作通常以特定于任务的方式使预先训练的生成先验适应各个下游任务。相比之下，我们的目标是解锁此类先验，以在单个统一框架内支持广泛的下游运动生成任务。为了弥补这一差距，我们提出了 UMO，这是一种简单但通用的统一公式，它将不同的下游任务转化为原子每帧操作的组合，从而实现上下文适应以解锁基于 DiT 的预训练运动 LFM 的生成先验。具体来说，UMO 引入了三个可学习的帧级元操作嵌入来指定每帧意图，并采用轻量级时间融合将上下文线索注入到预训练的主干中，与基本模型相比，运行时开销可以忽略不计。通过这种设计，UMO 对最初仅限于文本到运动生成的预训练模型进行了微调，以支持以前不支持的各种任务，包括时间修复、文本引导运动编辑、文本序列化几何约束和多身份反应生成。实验表明，尽管使用单一统一模型，UMO 在各种基准测试中始终优于特定任务和免训练基线。代码和模型将公开。项目页面：此 https URL

Title: FlatLands: Generative Floormap Completion From a Single Egocentric View

Authors: Subhransu S. Bhattacharjee, Dylan Campbell, Rahul Shome
Subjects: cs.CV, cs.AI, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2603.16016
Pdf URL: https://arxiv.org/pdf/2603.16016
Copy Paste: [[2603.16016]] FlatLands: Generative Floormap Completion From a Single Egocentric View(https://arxiv.org/abs/2603.16016)
Keywords: generative
Abstract: A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird's-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.
摘要：单个以自我为中心的图像通常仅捕获地板的一小部分，但周围环境的完整度量可遍历性地图将更好地服务于室内导航等应用。我们介绍 FlatLands，这是一个用于单视图鸟瞰图 (BEV) 地板完成的数据集和基准。该数据集包含来自六个现有数据集的 17,656 个真实度量室内场景的 270,575 个观测值，具有对齐的观测值、可见性、有效性和地面实况 BEV 地图，基准包括分布内和分布外评估协议。我们比较免训练方法、确定性模型、集成和随机生成模型。最后，我们将该任务实例化为端到端单目 RGB 到楼层地图管道。 FlatLands 为不确定性感知的室内测绘和实体导航的生成完成提供了严格的测试平台。

Title: Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Authors: Xiaozhou Ye, Feng Jiang, Zihan Wang, Xiulai Wang, Yutao Zhang, Kevin I-Kai Wang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.16043
Pdf URL: https://arxiv.org/pdf/2603.16043
Copy Paste: [[2603.16043]] Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition(https://arxiv.org/abs/2603.16043)
Keywords: generation
Abstract: Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, fitness analytics, and context-aware computing, yet its deployment is hindered by cross-user variability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. We propose a different paradigm: modeling generalizable feature extraction as a collaborative sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative Temporal Feature Generation), employs a Transformer-based autoregressive generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algorithm that evaluates each generated sequence against a cohort of alternatives sampled from the same input, deriving advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distributions. A tri-objective reward comprising class discrimination, cross-user invariance, and temporal fidelity jointly shapes the feature space to separate activities, align user distributions, and preserve fine-grained temporal content. Evaluations on the DSADS and PAMAP2 benchmarks demonstrate state-of-the-art cross-user accuracy (88.53\% and 75.22\%), substantial reduction in inter-task training variance, accelerated convergence, and robust generalization under varying action-space dimensionalities.
摘要：使用可穿戴惯性传感器进行人体活动识别是医疗保健监测、健身分析和上下文感知计算的基础，但其部署受到异构生理特征、运动习惯和传感器放置所引起的跨用户变异性的阻碍。现有的域泛化方法要么忽略传感器流中的时间依赖性，要么依赖于不切实际的目标域注释。我们提出了一种不同的范例：将可概括的特征提取建模为由强化学习控制的协作顺序生成过程。我们的框架 CTFG（协作时间特征生成）采用基于 Transformer 的自回归生成器，该生成器增量地构建特征令牌序列，每个序列都以先前的上下文和编码的传感器输入为条件。生成器通过组相对策略优化进行优化，这是一种无批评算法，根据从同一输入采样的一组替代方案来评估每个生成的序列，通过组内标准化而不是学习值估计来获得优势。这种设计消除了基于批评者的方法中固有的分布相关偏差，并提供了在异构用户分布中保持稳定的自校准优化信号。包括类别区分、跨用户不变性和时间保真度的三目标奖励共同塑造特征空间以分离活动、对齐用户分布并保留细粒度的时间内容。对 DSADS 和 PAMAP2 基准的评估展示了最先进的跨用户准确性（88.53% 和 75.22%）、任务间训练方差的大幅减少、加速的收敛以及在不同动作空间维度下的稳健泛化。

Title: Interact3D: Compositional 3D Generation of Interactive Objects

Authors: Hui Shan, Keyang Luo, Ming Li, Sizhe Zheng, Yanwei Fu, Zhen Chen, Xiangru Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16085
Pdf URL: https://arxiv.org/pdf/2603.16085
Copy Paste: [[2603.16085]] Interact3D: Compositional 3D Generation of Interactive Objects(https://arxiv.org/abs/2603.16085)
Keywords: generation, generative
Abstract: Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images--particularly under occlusions--remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.
摘要：3D 生成领域的最新突破使得高保真单个资产的合成成为可能。然而，从单个图像生成 3D 组合对象（尤其是在遮挡情况下）仍然具有挑战性。现有方法通常会降低隐藏区域中的几何细节，并且无法保留底层的对象-对象空间关系（OOR）。我们提出了一个新颖的 Interact3D 框架，旨在生成物理上合理的交互 3D 组合对象。我们的方法首先利用先进的生成先验，通过统一的 3D 指导场景来管理高质量的单个资产。为了以物理方式组合这些资产，我们引入了一个强大的两阶段组合管道。基于 3D 引导场景，主要对象通过精确的全局到局部几何对齐（配准）进行锚定，而后续几何图形则使用基于可微分符号距离场 (SDF) 的优化进行集成，该优化明确惩罚几何图形相交。为了减少具有挑战性的碰撞，我们进一步部署了闭环、代理细化策略。视觉语言模型（VLM）自主分析合成场景的多视图渲染，制定有针对性的纠正提示，并指导图像编辑模块迭代地自我纠正生成管道。大量实验表明，Interact3D 成功地生成了有前途的碰撞感知组合，具有改进的几何保真度和一致的空间关系。

Title: LICA: Layered Image Composition Annotations for Graphic Design Research

Authors: Elad Hirsch, Shubham Yadav, Mohit Garg, Purvanshi Mehta
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16098
Pdf URL: https://arxiv.org/pdf/2603.16098
Copy Paste: [[2603.16098]] LICA: Layered Image Composition Annotations for Graphic Design Research(https://arxiv.org/abs/2603.16098)
Keywords: generation, generative
Abstract: We introduce LICA (Layered Image Composition Annotations), a large-scale dataset of 1,550,244 multi-layer graphic design compositions designed to advance structured understanding and generation of graphic layouts1. In addition to ren- dered PNG images, LICA represents each design as a hierarchical composition of typed components including text, image, vector, and group elements, each paired with rich per-element metadata such as spatial geometry, typographic attributes, opacity, and visibility. The dataset spans 20 design categories and 971,850 unique templates, providing broad coverage of real-world design structures. We further introduce graphic design video as a new and largely unexplored challenge for current vision-language models through 27,261 animated layouts annotated with per-component keyframes and motion parameters. Beyond scale, LICA establishes a new paradigm of research tasks for graphic design, enabling structured investiga- tions into problems such as layer-aware inpainting, structured layout generation, controlled design editing, and temporally-aware generative modeling. By repre- senting design as a system of compositional layers and relationships, the dataset supports research on models that operate directly on design structure rather than pixels alone.
摘要：我们引入 LICA（分层图像合成注释），这是一个包含 1,550,244 个多层图形设计合成的大型数据集，旨在促进结构化理解和图形布局的生成1。除了渲染的 PNG 图像之外，LICA 将每个设计表示为类型组件的分层组合，包括文本、图像、矢量和组元素，每个组件都与丰富的每个元素元数据配对，例如空间几何、印刷属性、不透明度和可见性。该数据集涵盖 20 个设计类别和 971,850 个独特模板，广泛覆盖现实世界的设计结构。我们通过 27,261 个带有每个组件关键帧和运动参数注释的动画布局，进一步引入图形设计视频，作为当前视觉语言模型的一个新的、很大程度上尚未探索的挑战。除了规模之外，LICA 还为图形设计建立了新的研究任务范式，支持对分层感知修复、结构化布局生成、受控设计编辑和时间感知生成建模等问题进行结构化研究。通过将设计表示为组成层和关系的系统，该数据集支持直接对设计结构而不是仅对像素进行操作的模型的研究。

Title: OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Authors: Sensen Gao, Zhaoqing Wang, Qihang Cao, Dongdong Yu, Changhu Wang, Tongliang Liu, Mingming Gong, Jiawang Bian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16099
Pdf URL: https://arxiv.org/pdf/2603.16099
Copy Paste: [[2603.16099]] OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder(https://arxiv.org/abs/2603.16099)
Keywords: generation
Abstract: Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at this https URL.
摘要：现有的基于扩散的 3D 场景生成方法主要在 2D 图像/视频潜在空间中运行，这使得保持跨视图外观和几何一致性本身具有挑战性。为了弥补这一差距，我们提出了 OneWorld，这是一个直接在连贯的 3D 表示空间内执行扩散的框架。我们方法的核心是 3D 统一表示自动编码器 (3D-URAE)；它利用预训练的 3D 基础模型，并通过将外观注入和提取语义到统一的 3D 潜在空间中来增强其以几何为中心的性质。此外，我们引入了令牌级跨视图对应（CVC）一致性损失，以显式强制跨视图的结构对齐，并提出流形漂移强制（MDF）来减轻训练推理暴露偏差，并通过混合漂移和原始表示来塑造强大的 3D 流形。综合实验表明，与最先进的基于 2D 的方法相比，OneWorld 可以生成具有卓越跨视图一致性的高质量 3D 场景。我们的代码将在此 https URL 中提供。

Title: Out-of-Distribution Object Detection in Street Scenes via Synthetic Outlier Exposure and Transfer Learning

Authors: Sadia Ilyas, Annika Mütze, Klaus Friedrichs, Thomas Kurbiel, Matthias Rottmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16122
Pdf URL: https://arxiv.org/pdf/2603.16122
Copy Paste: [[2603.16122]] Out-of-Distribution Object Detection in Street Scenes via Synthetic Outlier Exposure and Transfer Learning(https://arxiv.org/abs/2603.16122)
Keywords: generative
Abstract: Out-of-distribution (OOD) object detection is an important yet underexplored task. A reliable object detector should be able to handle OOD objects by localizing and correctly classifying them as OOD. However, a critical issue arises when such atypical objects are completely missed by the object detector and incorrectly treated as background. Existing OOD detection approaches in object detection often rely on complex architectures or auxiliary branches and typically do not provide a framework that treats in-distribution (ID) and OOD in a unified way. In this work, we address these limitations by enabling a single detector to detect OOD objects, that are otherwise silently overlooked, alongside ID objects. We present \textbf{SynOE-OD}, a \textbf{Syn}thetic \textbf{O}utlier-\textbf{E}xposure-based \textbf{O}bject \textbf{D}etection framework, that leverages strong generative models, like Stable Diffusion, and Open-Vocabulary Object Detectors (OVODs) to generate semantically meaningful, object-level data that serve as outliers during training. The generated data is used for transfer-learning to establish strong ID task performance and supplement detection models with OOD object detection robustness. Our approach achieves state-of-the-art average precision on an established OOD object detection benchmark, where OVODs, such as GroundingDINO, show limited zero-shot performance in detecting OOD objects in street-scenes.
摘要：分布外（OOD）对象检测是一项重要但尚未得到充分探索的任务。可靠的对象检测器应该能够通过定位 OOD 对象并将其正确分类为 OOD 来处理它们。然而，当物体检测器完全错过此类非典型物体并错误地将其视为背景时，就会出现一个关键问题。现有的物体检测中的 OOD 检测方法通常依赖于复杂的架构或辅助分支，并且通常不提供以统一方式处理分布内（ID）和 OOD 的框架。在这项工作中，我们通过启用单个检测器来检测 OOD 对象来解决这些限制，否则这些对象会与 ID 对象一起被默默地忽略。我们提出了 \textbf{SynOE-OD}，一个 \textbf{Syn}thetic \textbf{O}utlier-\textbf{E}xposure-based \textbf{O}bject \textbf{D} 检测框架，它利用强大的生成模型，如稳定扩散和开放词汇对象检测器（OVOD）来生成语义上有意义的对象级数据，这些数据在培训。生成的数据用于迁移学习，以建立强大的 ID 任务性能，并通过 OOD 对象检测鲁棒性补充检测模型。我们的方法在已建立的 OOD 对象检测基准上实现了最先进的平均精度，其中 OVOD（例如 GroundingDINO）在检测街道场景中的 OOD 对象时显示出有限的零样本性能。

Title: When Generative Augmentation Hurts: A Benchmark Study of GAN and Diffusion Models for Bias Correction in AI Classification Systems

Authors: Shesh Narayan Gupta, Nik Bear Brown
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.16134
Pdf URL: https://arxiv.org/pdf/2603.16134
Copy Paste: [[2603.16134]] When Generative Augmentation Hurts: A Benchmark Study of GAN and Diffusion Models for Bias Correction in AI Classification Systems(https://arxiv.org/abs/2603.16134)
Keywords: generative
Abstract: Generative models are widely used to compensate for class imbalance in AI training pipelines, yet their failure modes under low-data conditions are poorly understood. This paper reports a controlled benchmark comparing three augmentation strategies applied to a fine-grained animal classification task: traditional transforms, FastGAN, and Stable Diffusion 1.5 fine-tuned with Low-Rank Adaptation (LoRA). Using the Oxford-IIIT Pet Dataset with eight artificially underrepresented breeds, we find that FastGAN augmentation does not merely underperform at very low training set sizes but actively increases classifier bias, with a statistically significant large effect across three random seeds (bias gap increase: +20.7%, Cohen's d = +5.03, p = 0.013). The effect size here is large enough to give confidence in the direction of the finding despite the small number of seeds. Feature embedding analysis using t-distributed Stochastic Neighbor Embedding reveals that FastGAN images for severe-minority breeds form tight isolated clusters outside the real image distribution, a pattern consistent with mode collapse. Stable Diffusion with Low-Rank Adaptation produced the best results overall, achieving the highest macro F1 (0.9125 plus or minus 0.0047) and a 13.1% reduction in the bias gap relative to the unaugmented baseline. The data suggest a sample-size boundary somewhere between 20 and 50 training images per class below which GAN augmentation becomes harmful in this setting, though further work across additional domains is needed to establish where that boundary sits more precisely. All experiments run on a consumer-grade GPU with 6 to 8 GB of memory, with no cloud compute required.
摘要：生成模型被广泛用于补偿人工智能训练管道中的类别不平衡，但人们对它们在低数据条件下的故障模式知之甚少。本文报告了一个受控基准，比较了应用于细粒度动物分类任务的三种增强策略：传统变换、FastGAN 和通过低阶适应 (LoRA) 进行微调的稳定扩散 1.5。使用具有八个人为代表性不足的品种的 Oxford-IIIT 宠物数据集，我们发现 FastGAN 增强不仅在非常低的训练集大小下表现不佳，而且还主动增加了分类器偏差，对三个随机种子具有统计上显着的巨大影响（偏差差距增加：+20.7%，Cohen's d = +5.03，p = 0.013）。尽管种子数量很少，但这里的效应量足够大，足以让人们对发现的方向充满信心。使用 t 分布随机邻域嵌入的特征嵌入分析表明，严重少数品种的 FastGAN 图像在真实图像分布之外形成紧密的孤立簇，这种模式与模式崩溃一致。具有低阶适应的稳定扩散总体上产生了最佳结果，实现了最高的宏观 F1（0.9125 正负 0.0047），并且相对于未增强的基线，偏差差距减少了 13.1%。数据表明，样本大小边界在每类 20 到 50 个训练图像之间，低于该边界，GAN 增强在这种情况下会变得有害，尽管需要跨其他领域进行进一步的工作来确定该边界更精确的位置。所有实验都在具有 6 至 8 GB 内存的消费级 GPU 上运行，无需云计算。

Title: Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Authors: Peng Sun, Jun Xie, Tao Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16139
Pdf URL: https://arxiv.org/pdf/2603.16139
Copy Paste: [[2603.16139]] Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training(https://arxiv.org/abs/2603.16139)
Keywords: generation, generative
Abstract: Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{this https URL}{this https URL}$.
摘要：统一多模态模型（UMM）通常受到其 $\textbf{视觉生成组件}$ 预训练的限制，这通常依赖于低效的范式和稀缺的高质量文本图像配对数据。在本文中，我们系统地分析了 $\textbf{UMM 视觉生成}$ 的预训练方法，并将这两个问题确定为主要瓶颈。为了解决这些问题，我们提出了 $\textbf{UMM 仅图像训练 (IOMM)}$，一种数据高效的两阶段训练框架。第一阶段使用大量未标记的纯图像数据预训练视觉生成组件$\textbf{exclusively}$，从而消除对成对数据$\textbf{对于这个昂贵的阶段}$的依赖。第二阶段使用未标记图像和一小部分精选的文本图像对的混合来微调模型，从而提高指令对齐和生成质量。大量实验表明，IOMM 不仅提高了训练效率，而且实现了最先进（SOTA）的性能。例如，我们的 IOMM-B (3.6B) 模型仅使用 $\sim \textbf{1050}$ H800 GPU 小时（绝大多数 $\textbf{1000}$ 小时，专用于高效的 $\textbf{仅图像预训练阶段}$）从头开始训练。它在 GenEval 上达到 $\textbf{0.89}$，在 WISE 上达到 $\textbf{0.55}$，超越了 BAGEL-7B（0.82 和 0.55）和 BLIP3-o-4B（0.84 和 0.50）等强基线。代码可用$\href{此 https URL}{此 https URL}$。

Title: EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation

Authors: Yukun Zhao, Zichen Zhong, Yongshun Gong, Yilong Yin, Haoliang Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16151
Pdf URL: https://arxiv.org/pdf/2603.16151
Copy Paste: [[2603.16151]] EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation(https://arxiv.org/abs/2603.16151)
Keywords: generation, generative
Abstract: Denoising generative models have recently become the dominant paradigm for dexterous grasp generation, owing to their ability to model complex grasp distributions from large-scale data. However, existing diffusion-based methods typically formulate generation as a stochastic differential equation (SDE), which often requires many sequential denoising steps and introduces trajectory instability that can lead to physically infeasible grasps. In this paper, we propose EFF-Grasp, a novel Flow-Matching-based framework for physics-aware dexterous grasp generation. Specifically, we reformulate grasp synthesis as a deterministic ordinary differential equation (ODE) process, which enables efficient and stable generation through smooth probability flows. To further enforce physical feasibility, we introduce a training-free physics-aware energy guidance strategy. Our method defines an energy-guided target distribution using adapted explicit physical energy functions that capture key grasp constraints, and estimates the corresponding guidance term via a local Monte Carlo approximation during inference. In this way, EFF-Grasp dynamically steers the generation trajectory toward physically feasible regions without requiring additional physics-based training or simulation feedback. Extensive experiments on five benchmark datasets show that EFF-Grasp achieves superior performance in grasp quality and physical feasibility, while requiring substantially fewer sampling steps than diffusion-based baselines.
摘要：去噪生成模型最近已成为灵巧抓取生成的主导范例，因为它们能够根据大规模数据对复杂的抓取分布进行建模。然而，现有的基于扩散的方法通常将生成公式化为随机微分方程（SDE），这通常需要许多连续的去噪步骤，并引入轨迹不稳定，可能导致物理上不可行的掌握。在本文中，我们提出了 EFF-Grasp，一种新颖的基于流程匹配的框架，用于物理感知的灵巧抓取生成。具体来说，我们将掌握合成重新表述为确定性常微分方程（ODE）过程，它可以通过平滑的概率流实现高效稳定的生成。为了进一步增强物理可行性，我们引入了一种免训练的物理感知能量引导策略。我们的方法使用适应的显式物理能量函数来定义能量引导目标分布，该函数捕获关键的掌握约束，并在推理过程中通过局部蒙特卡罗近似估计相应的引导项。通过这种方式，EFF-Grasp 动态地将生成轨迹引导至物理上可行的区域，而不需要额外的基于物理的训练或模拟反馈。对五个基准数据集的广泛实验表明，EFF-Grasp 在抓取质量和物理可行性方面实现了卓越的性能，同时比基于扩散的基线所需的采样步骤少得多。

Title: Execution-Grounded Credit Assignment for GRPO in Code Generation

Authors: Abhijit Kumar, Natalya Kumar, Shikhar Gupta
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.16158
Pdf URL: https://arxiv.org/pdf/2603.16158
Copy Paste: [[2603.16158]] Execution-Grounded Credit Assignment for GRPO in Code Generation(https://arxiv.org/abs/2603.16158)
Keywords: generation
Abstract: Critic-free reinforcement learning with verifiable rewards (RLVR) improves code generation by optimizing unit-test pass rates, but GRPO-style updates suffer from coarse credit assignment: a single outcome signal is spread uniformly across long programs even when failure stems from a localized semantic error. We propose Execution-Grounded Credit Assignment (EGCA), which localizes GRPO updates using execution traces. For programs that satisfy algorithmic constraints but fail tests, EGCA executes the candidate and a canonical reference solution (curated once offline; used for analysis, not supervision) under identical instrumentation, identifies the earliest semantic divergence, and assigns advantage only to the corresponding token span while masking downstream tokens. EGCA is a drop-in modification requiring no critic, auxiliary loss, or learned verifier, yielding 82.1% pass@1 on HumanEval (+3.1 over GRPO) and 68.9% on MBPP (+1.5) with 18% wall-clock overhead.
摘要：具有可验证奖励的无批评强化学习（RLVR）通过优化单元测试通过率来改进代码生成，但 GRPO 式更新受到粗略信用分配的影响：即使失败源于局部语义错误，单个结果信号也会均匀地分布在长程序中。我们提出基于执行的信用分配（EGCA），它使用执行跟踪本地化 GRPO 更新。对于满足算法约束但测试失败的程序，EGCA 在相同的仪器下执行候选和规范参考解决方案（离线策划；用于分析，而不是监督），识别最早的语义分歧，并仅将优势分配给相应的令牌跨度，同时屏蔽下游令牌。 EGCA 是一种直接修改，不需要批评者、辅助损失或学习验证者，在 HumanEval 上产生 82.1% pass@1（比 GRPO 增加 3.1），在 MBPP 上产生 68.9%（+1.5），并有 18% 的挂钟开销。

Title: AI-Generated Figures in Academic Publishing: Policies, Tools, and Practical Guidelines

Authors: Davie Chen
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2603.16159
Pdf URL: https://arxiv.org/pdf/2603.16159
Copy Paste: [[2603.16159]] AI-Generated Figures in Academic Publishing: Policies, Tools, and Practical Guidelines(https://arxiv.org/abs/2603.16159)
Keywords: generation, generative
Abstract: The rapid advancement of generative AI has introduced a new class of tools capable of producing publication-quality scientific figures, graphical abstracts, and data visualizations. However, academic publishers have responded with inconsistent and often ambiguous policies regarding AI-generated imagery. This paper surveys the current stance of major journals and publishers -- including Nature, Science, Cell Press, Elsevier, and PLOS -- on the use of AI-generated figures. We identify key concerns raised by publishers, including reproducibility, authorship attribution, and potential for visual misinformation. Drawing on practical examples from tools such as SciDraw, an AI-powered platform designed specifically for scientific illustration, we propose a set of best-practice guidelines for researchers seeking to use AI figure-generation tools in a compliant and transparent manner. Our findings suggest that, with appropriate disclosure and quality control, AI-generated figures can meaningfully accelerate scientific communication without compromising integrity.
摘要：生成式人工智能的快速发展引入了一类新的工具，能够生成出版质量的科学图表、图形摘要和数据可视化。然而，学术出版商对人工智能生成的图像采取了不一致且常常含糊的政策。本文调查了主要期刊和出版商（包括 Nature、Science、Cell Press、Elsevier 和 PLOS）目前对使用人工智能生成的数据的立场。我们确定了出版商提出的主要问题，包括再现性、作者归属以及视觉错误信息的可能性。借鉴 SciDraw（专门为科学插图设计的人工智能平台）等工具的实际示例，我们为寻求以合规和透明的方式使用人工智能图形生成工具的研究人员提出了一套最佳实践指南。我们的研究结果表明，通过适当的披露和质量控制，人工智能生成的数据可以在不损害完整性的情况下有意义地加速科学交流。

Title: 360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method

Authors: Huyen T. T. Tran, Van-Quang Nguyen, Farros Alferro, Kang-Jun Liu, Takayuki Okatani
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16179
Pdf URL: https://arxiv.org/pdf/2603.16179
Copy Paste: [[2603.16179]] 360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method(https://arxiv.org/abs/2603.16179)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360° images remains largely underexplored. Unlike conventional images, 360° images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs' capabilities to perceive 360° images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360° images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360° image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360° VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360° images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360° VQA tasks. The source code and dataset will be publicly released upon acceptance.
摘要：多模态大语言模型（MLLM）在理解和推理传统图像方面表现出了令人印象深刻的能力。然而，他们对 360° 图像的感知在很大程度上仍未得到充分探索。与传统图像不同，360°图像捕捉整个周围环境，实现整体空间推理，但带来了几何失真和复杂空间关系等挑战。为了全面评估 MLLM 感知 360° 图像的能力，我们引入了 360Bench，这是一种视觉问答 (VQA) 基准，具有 7K 分辨率的 360° 图像、七个代表性（子）任务以及由人类注释者精心策划的注释。使用 360Bench，我们系统地评估了七种 MLLM 和六种增强方法，揭示了它们在 360° 图像感知方面的缺点。为了应对这些挑战，我们提出了 Free360，这是一种无需训练、基于场景图的高分辨率 360° VQA 框架。 Free360 将推理过程分解为模块化步骤，将自适应球形图像变换应用于针对每个步骤定制的 360° 图像，并将结果信息无缝集成到统一的图形表示中以生成答案。实验表明，Free360 不断改进其基础 MLLM，并为 360° VQA 任务提供强大的免训练解决方案。源代码和数据集将在接受后公开发布。

Title: ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Authors: Haozhe Jia, Jianfei Song, Yuan Zhang, Honglei Jin, Youcheng Fan, Wenshuo Chen, Wei Zhang, Yutao Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16188
Pdf URL: https://arxiv.org/pdf/2603.16188
Copy Paste: [[2603.16188]] ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control(https://arxiv.org/abs/2603.16188)
Keywords: generation
Abstract: We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher--Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.
摘要：我们推出了 ECHO，一种用于语言驱动的人形机器人全身控制的边缘云框架。云托管的基于扩散的文本到运动生成器从自然语言指令中合成运动参考，而边缘部署的强化学习跟踪器则在机器人上的闭环中执行它们。这两个模块通过紧凑的机器人原生 38 维运动表示来桥接，该表示对关节角度、根平面速度、根高度和每帧的连续 6D 根方向进行编码，消除了人体模型的推理时间重定向，并保持与低级 PD 控制直接兼容。生成器采用 1D 卷积 UNet，具有以 CLIP 编码文本特征为条件的交叉注意力；在推理时，具有 10 个降噪步骤和无分类器引导的 DDIM 采样可在云 GPU 上大约一秒内生成运动序列。该跟踪器遵循教师-学生范式：特权教师策略被提炼为轻量级学生，配备了用于模拟到真实迁移的证据适应模块，并通过形态对称约束和域随机化进一步加强。自主跌倒恢复机制通过板载 IMU 读数检测跌倒，并从预构建的运动库中检索恢复轨迹。我们在重新定位的 HumanML3D 基准上评估 ECHO，它在统一的机器人域评估器下实现了强大的生成质量（FID 0.029，R-Precision Top-1 0.686），同时保持了较高的运动安全性和轨迹一致性。 Unitree G1 人形机器人上的真实实验证明，可以在零硬件微调的情况下稳定执行各种文本命令。

Title: Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

Authors: Haomin Wang, Qi Wei, Qianli Ma, Shengyuan Ding, Jinhui Yin, Kai Chen, Hongjie Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16189
Pdf URL: https://arxiv.org/pdf/2603.16189
Copy Paste: [[2603.16189]] Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning(https://arxiv.org/abs/2603.16189)
Keywords: generation
Abstract: With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.
摘要：随着视觉语言模型的快速发展，越来越多的研究探索了它们在 SVG 生成任务中的潜力。尽管现有方法通过构建大规模 SVG 数据集和引入 SVG 特定标记来提高性能，但它们仍然存在泛化能力有限、代码输出中的冗余路径以及缺乏显式推理的问题。在这项工作中，我们提出了 CTRL-S（SVG 思想链强化学习），这是一个统一的框架，引入了思想链机制来显式地暴露 SVG 生成过程中模型的推理过程。为了支持这种结构化推理，我们构建了 SVG-Sophia，这是一个高质量的数据集，包含跨 SVG 代码细化、文本到 SVG 和图像到 SVG 任务的 145K 样本。通过训练模型生成组级结构化 SVG 代码，CTRL-S 显着提高了结构连贯性和视觉保真度。此外，我们采用GRPO算法，设计了一个多奖励优化框架，结合了DINO、图文相似度、格式和代码效率奖励。通过联合多奖励优化和多任务训练，我们的方法系统地增强了整体生成能力。大量实验表明 CTRL-S 优于现有方法，实现了更高的任务成功率、卓越的 SVG 代码质量和出色的视觉保真度。

Title: S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Authors: Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, Haoang Li
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.16195
Pdf URL: https://arxiv.org/pdf/2603.16195
Copy Paste: [[2603.16195]] S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight(https://arxiv.org/abs/2603.16195)
Keywords: generation, generative
Abstract: Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is this https URL
摘要：视频动作模型（VAM）因其对复杂操作任务的强大视觉预见性而成为机器人学习的一个有前途的范例。然而，当前的 VAM 通常依赖于缓慢的多步视频生成或嘈杂的一步特征提取，无法同时保证实时推理和高保真预见。为了解决这个限制，我们提出了 S-VAM，这是一种快捷视频动作模型，可以通过单个前向传递预见连贯的几何和语义表示。作为稳定的蓝图，这些预见的表示显着简化了动作预测。为了实现这种有效的捷径，我们引入了一种新颖的自蒸馏策略，该策略将多步去噪的结构化生成先验压缩为一步推理。具体来说，从扩散模型自己的多步骤生成的视频中提取的视觉基础模型（VFM）表示提供了教师目标。作为学生，轻量级解耦器学习将嘈杂的一步特征直接映射到这些目标。模拟和现实世界中的大量实验表明，我们的 S-VAM 优于最先进的方法，可在复杂环境中实现高效、精确的操作。我们的项目页面是这个https URL

Title: Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation

Authors: Yiming Huang, Baixiang Huang, Beilei Cui, Chi Kit Ng, Long Bai, Hongliang Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16211
Pdf URL: https://arxiv.org/pdf/2603.16211
Copy Paste: [[2603.16211]] Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation(https://arxiv.org/abs/2603.16211)
Keywords: generation
Abstract: Feed-forward 3D reconstruction has revolutionized 3D vision, providing a powerful baseline for downstream tasks such as novel-view synthesis with 3D Gaussian Splatting. Previous works explore fixing the corrupted rendering results with a diffusion model. However, they lack geometric concern and fail at filling the missing area on the extrapolated view. In this work, we introduce Leveling3D, a novel pipeline that integrates feed-forward 3D reconstruction with geometrical-consistent generation to enable holistic simultaneous reconstruction and generation. We propose a geometry-aware leveling adapter, a lightweight technique that aligns internal knowledge in the diffusion model with the geometry prior from the feed-forward model. The leveling adapter enables generation on the artifact area of the extrapolated novel views caused by underconstrained regions of the 3D representation. Specifically, to learn a more diverse distributed generation, we introduce the palette filtering strategy for training, and a test-time masking refinement to prevent messy boundaries along the fixing regions. More importantly, the enhanced extrapolated novel views from Leveling3D could be used as the inputs for feed-forward 3DGS, leveling up the 3D reconstruction. We achieve SOTA performance on public datasets, including tasks such as novel-view synthesis and depth estimation.
摘要：前馈 3D 重建彻底改变了 3D 视觉，为下游任务（例如使用 3D 高斯分布进行新颖视图合成）提供了强大的基线。以前的作品探索使用扩散模型修复损坏的渲染结果。然而，他们缺乏几何关注，无法填充外推视图上的缺失区域。在这项工作中，我们引入了 Leveling3D，这是一种新颖的管道，它将前馈 3D 重建与几何一致生成相结合，以实现整体同步重建和生成。我们提出了一种几何感知的调平适配器，这是一种轻量级技术，可将扩散模型中的内部知识与前馈模型中的先验几何结构对齐。调平适配器能够在由 3D 表示的欠约束区域引起的外推新颖视图的伪影区域上生成。具体来说，为了学习更加多样化的分布式生成，我们引入了用于训练的调色板过滤策略，以及测试时掩蔽细化以防止固定区域的混乱边界。更重要的是，来自 Leveling3D 的增强型外推新颖视图可以用作前馈 3DGS 的输入，从而提高 3D 重建的水平。我们在公共数据集上实现了 SOTA 性能，包括新颖视图合成和深度估计等任务。

Title: RASLF: Representation-Aware State Space Model for Light Field Super-Resolution

Authors: Zeqiang Wei, Kai Jin, Kuan Song, Xiuzhuang Zhou, Wenlong Chen, Min Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16243
Pdf URL: https://arxiv.org/pdf/2603.16243
Copy Paste: [[2603.16243]] RASLF: Representation-Aware State Space Model for Light Field Super-Resolution(https://arxiv.org/abs/2603.16243)
Keywords: super-resolution
Abstract: Current SSM-based light field super-resolution (LFSR) methods often fail to fully leverage the complementarity among various LF representations, leading to the loss of fine textures and geometric misalignments across views. To address these issues, we propose RASLF, a representation-aware state-space framework that explicitly models structural correlations across multiple LF representations. Specifically, a Progressive Geometric Refinement (PGR) block is created that uses a panoramic epipolar representation to explicitly encode multi-view parallax differences, thereby enabling integration across different LF representations. Furthermore, we introduce a Representation Aware Asymmetric Scanning (RAAS) mechanism that dynamically adjusts scanning paths based on the physical properties of different representation spaces, optimizing the balance between performance and efficiency through path pruning. Additionally, a Dual-Anchor Aggregation (DAA) module improves hierarchical feature flow, reducing redundant deeplayer features and prioritizing important reconstruction information. Experiments on various public benchmarks show that RASLF achieves the highest reconstruction accuracy while remaining highly computationally efficient.
摘要：当前基于 SSM 的光场超分辨率 (LFSR) 方法通常无法充分利用各种 LF 表示之间的互补性，导致精细纹理丢失和视图之间的几何错位。为了解决这些问题，我们提出了 RASLF，这是一种表示感知的状态空间框架，可以显式地模拟多个 LF 表示之间的结构相关性。具体来说，创建了渐进几何细化 (PGR) 块，该块使用全景极线表示来显式编码多视图视差差异，从而实现不同 LF 表示之间的集成。此外，我们引入了表示感知非对称扫描（RAAS）机制，该机制可根据不同表示空间的物理属性动态调整扫描路径，通过路径修剪优化性能和效率之间的平衡。此外，双锚聚合（DAA）模块改进了分层特征流，减少了冗余的深层特征并优先考虑重要的重建信息。各种公共基准测试的实验表明，RASLF 实现了最高的重建精度，同时保持了较高的计算效率。

Title: Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification

Authors: Trong-Duc Nguyen, Hoang-Long Nguyen, Huy-Hieu Pham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16249
Pdf URL: https://arxiv.org/pdf/2603.16249
Copy Paste: [[2603.16249]] Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification(https://arxiv.org/abs/2603.16249)
Keywords: restoration, generative
Abstract: Automated white blood cell (WBC) classification is essential for leukemia screening but remains challenged by extreme class imbalance, long-tail distributions, and domain shift, leading deep models to overfit dominant classes and fail on rare subtypes. We propose a hybrid framework for rare-class generalization that integrates a generative Pix2Pix-based restoration module for artifact removal, a Swin Transformer ensemble with MedSigLIP contrastive embeddings for robust representation learning, and a biologically-inspired refinement step using geometric spikiness and Mahalanobis-based morphological constraints to recover out-of-distribution predictions. Evaluated on the WBCBench 2026 challenge, our method achieves a Macro-F1 of 0.77139 on the private leaderboard, demonstrating strong performance under severe imbalance and highlighting the value of incorporating biological priors into deep learning for hematological image analysis.
摘要：自动白细胞 (WBC) 分类对于白血病筛查至关重要，但仍面临极端类别不平衡、长尾分布和域转移的挑战，导致深度模型过度拟合主导类别并在罕见亚型上失败。我们提出了一种用于稀有类泛化的混合框架，该框架集成了用于伪影去除的基于 Pix2Pix 的生成恢复模块、用于鲁棒表示学习的具有 MedSigLIP 对比嵌入的 Swin Transformer 集成，以及使用几何尖峰和基于 Mahalanobis 的形态约束来恢复分布外预测的生物启发细化步骤。在 WBCBench 2026 挑战赛上进行评估，我们的方法在私人排行榜上获得了 0.77139 的 Macro-F1，展示了在严重不平衡情况下的强劲性能，并凸显了将生物先验纳入血液学图像分析深度学习的价值。

Title: Visual Prompt Discovery via Semantic Exploration

Authors: Jaechang Kim, Yotaro Shimose, Zhao Wang, Kuang-Da Wang, Jungseul Ok, Shingo Takamatsu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16250
Pdf URL: https://arxiv.org/pdf/2603.16250
Copy Paste: [[2603.16250]] Visual Prompt Discovery via Semantic Exploration(https://arxiv.org/abs/2603.16250)
Keywords: generation
Abstract: LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error. We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We introduce a semantic exploration algorithm named SEVEX, which addresses two major challenges of visual prompt exploration: (1) the distraction caused by lengthy, low-level code and (2) the vast, unstructured search space of visual prompts. Specifically, our method leverages an abstract idea space as a search space, a novelty-guided selection algorithm, and a semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results. We evaluate SEVEX on the BlindTest and BLINK benchmarks, which are designed to assess LVLM perception. Experimental results demonstrate that SEVEX significantly outperforms baseline methods in task accuracy, inference efficiency, exploration efficiency, and exploration stability. Notably, our framework discovers sophisticated and counter-intuitive visual strategies that go beyond conventional tool usage, offering a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts.
摘要：LVLM 在图像理解和视觉推理方面遇到重大挑战，导致严重的感知失败。包含图像处理代码的视觉提示在缓解这些问题方面显示出了巨大的潜力。虽然这是一个有前途的方向，但之前的视觉提示生成方法主要侧重于工具选择，而不是诊断和缓解 LVLM 感知失败的根本原因。由于 LVLM 的不透明性和不可预测性，最佳视觉提示必须通过依赖于人工试错的经验实验来发现。我们提出了一个自动语义探索框架，用于发现任务明智的视觉提示。我们的方法通过代理驱动的实验实现多样化而高效的探索，最大限度地减少人为干预并避免每个样本生成的低效率。我们引入了一种名为 SEVEX 的语义探索算法，它解决了视觉提示探索的两个主要挑战：(1) 冗长、低级代码造成的干扰；(2) 视觉提示的巨大、非结构化搜索空间。具体来说，我们的方法利用抽象的想法空间作为搜索空间、新颖性引导的选择算法和语义反馈驱动的构思过程，以根据经验结果有效地探索各种视觉提示。我们在 BlindTest 和 BLINK 基准上评估 SEVEX，这些基准旨在评估 LVLM 感知。实验结果表明，SEVEX 在任务准确性、推理效率、探索效率和探索稳定性方面显着优于基线方法。值得注意的是，我们的框架发现了复杂且反直觉的视觉策略，超越了传统工具的使用，为通过自动化、任务明智的视觉提示增强 LVLM 感知提供了新的范例。

Title: Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection

Authors: Weihua Gao, Wenlong Niu, Jie Tang, Man Yang, Jiafeng Zhang, Xiaodong Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16257
Pdf URL: https://arxiv.org/pdf/2603.16257
Copy Paste: [[2603.16257]] Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection(https://arxiv.org/abs/2603.16257)
Keywords: generation
Abstract: Infrared small target detection (IRSTD) methods predominantly formulate the task as pixel-level segmentation, which requires costly dense annotations and is not well suited to tiny targets with weak texture and ambiguous boundaries. To address this issue, we propose Point-to-Mask, a framework that bridges low-cost point supervision and mask-level detection through two components: a Physics-driven Adaptive Mask Generation (PAMG) module that converts point annotations into compact target masks and geometric cues, and a lightweight Radius-aware Point Regression Network (RPR-Net) that reformulates IRSTD as target center localization and effective radius regression using spatiotemporal motion cues. The two modules form a closed loop: PAMG generates pseudo masks and geometric supervision during training, while the geometric predictions of RPR-Net are fed back to PAMG for pixel-level mask recovery during inference. To facilitate systematic evaluation, we further construct SIRSTD-Pixel, a sequential dataset with refined pixel-level annotations. Experiments show that the proposed framework achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. Code and datasets will be available at: this https URL.
摘要：红外小目标检测（IRSTD）方法主要将任务制定为像素级分割，这需要昂贵的密集注释，并且不太适合纹理弱和边界模糊的微小目标。为了解决这个问题，我们提出了Point-to-Mask，一个通过两个组件连接低成本点监督和掩模级检测的框架：物理驱动的自适应掩模生成（PAMG）模块，将点注释转换为紧凑的目标掩模和几何线索，以及轻量级半径感知点回归网络（RPR-Net），使用时空运动线索将IRSTD重新表述为目标中心定位和有效半径回归。这两个模块形成一个闭环：PAMG 在训练期间生成伪掩模和几何监督，而 RPR-Net 的几何预测则在推理过程中反馈给 PAMG 进行像素级掩模恢复。为了便于系统评估，我们进一步构建了 SIRSTD-Pixel，这是一个具有精细像素级注释的序列数据集。实验表明，所提出的框架实现了强大的伪标签质量、高检测精度和高效推理，在点监督设置下接近全监督性能，并且标注成本显着降低。代码和数据集可在以下位置获得：此 https URL。

Title: VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

Authors: Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16271
Pdf URL: https://arxiv.org/pdf/2603.16271
Copy Paste: [[2603.16271]] VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment(https://arxiv.org/abs/2603.16271)
Keywords: generative
Abstract: Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.
摘要：视频扩散模型在训练过程中缺乏明确的几何监督，导致生成视频中出现不一致的伪影，例如对象变形、空间漂移和深度违规。为了解决这个限制，我们提出了一种基于几何的奖励模型，该模型利用预训练的几何基础模型通过跨帧重投影误差来评估多视图一致性。与以前测量像素空间不一致性的几何度量不同，像素强度可能会引入额外的噪声，我们的方法以逐点方式进行误差计算，产生更物理基础和稳健的误差度量。此外，我们引入了一种几何感知采样策略，可以过滤掉低纹理和非语义区域，重点评估具有可靠对应关系的几何有意义的区域，以提高鲁棒性。我们应用这个奖励模型通过两个互补的途径来调整视频传播模型：通过 SFT 或强化学习对双向模型进行后期训练，以及通过测试时间缩放对因果视频模型（例如流视频生成器）进行推理时间优化，并使用我们的奖励作为路径验证者。实验结果验证了我们设计的有效性，证明我们基于几何的奖励与其他变体相比提供了卓越的鲁棒性。通过实现高效的推理时间缩放，我们的方法提供了增强开源视频模型的实用解决方案，而无需大量计算资源进行再训练。

Title: Persistent Story World Simulation with Continuous Character Customization

Authors: Jinlu Zhang, Qiyun Wang, Baoxiang Du, Jiayi Ji, Jing He, Rongsheng Zhang, Tangjie Lv, Xiaoshuai Sun, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16285
Pdf URL: https://arxiv.org/pdf/2603.16285
Copy Paste: [[2603.16285]] Persistent Story World Simulation with Continuous Character Customization(https://arxiv.org/abs/2603.16285)
Keywords: generation
Abstract: Story visualization has gained increasing attention in computer vision. However, current methods often fail to achieve a synergy between accurate character customization, semantic alignment, and continuous integration of new identities. To tackle this challenge, in this paper we present EverTale, a story world simulator for continuous story character customization. We first propose an All-in-One-World Character Integrator to achieve continuous character adaptation within unified LoRA module, eliminating the need for per-character optimization modules of previous methods. Then, we incorporate a Character Quality Gate via MLLM-as-Judge to ensure the fidelity of each character adaptation process through chain-of-thought reasoning, determining whether the model can proceed to the next character or require additional training on the current one. We also introduce a Character-Aware Region-Focus Sampling strategy to address the identity degradation and layout conflicts in existing multi-character visual storytelling, ensuring natural multi-character generation by harmonizing local character-specific details with global scene context with higher efficiency. Experimental results show that our EverTale achieves superior performance against a wider range of compared methods on both single- and multi-character story visualization. Codes will be available.
摘要：故事可视化在计算机视觉领域越来越受到关注。然而，当前的方法往往无法实现准确的角色定制、语义对齐和新身份的持续集成之间的协同作用。为了应对这一挑战，在本文中，我们提出了 EverTale，一个用于连续故事角色定制的故事世界模拟器。我们首先提出了一种全合一世界角色集成器，以在统一的 LoRA 模块内实现连续的角色适应，从而消除了以前方法中每个角色优化模块的需要。然后，我们通过 MLLM-as-Judge 结合角色质量门，通过思想链推理确保每个角色适应过程的保真度，确定模型是否可以继续处理下一个角色或需要对当前角色进行额外训练。我们还引入了角色感知区域焦点采样策略，以解决现有多角色视觉叙事中的身份退化和布局冲突，通过以更高的效率协调局部角色特定细节与全局场景上下文，确保自然的多角色生成。实验结果表明，我们的 EverTale 在单角色和多角色故事可视化方面比更广泛的比较方法取得了卓越的性能。代码将可用。

Title: DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

Authors: Heyu Si, Brandon James Denis, Muyang Sun, Dragos Datcu, Yaoru Li, Xin Jin, Ruiju Fu, Yuliia Tatarinova, Federico Landi, Jie Song, Mingli Song, Qi Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16306
Pdf URL: https://arxiv.org/pdf/2603.16306
Copy Paste: [[2603.16306]] DriveFix: Spatio-Temporally Coherent Driving Scene Restoration(https://arxiv.org/abs/2603.16306)
Keywords: restoration, generation
Abstract: Recent advancements in 4D scene reconstruction, particularly those leveraging diffusion priors, have shown promise for novel view synthesis in autonomous driving. However, these methods often process frames independently or in a view-by-view manner, leading to a critical lack of spatio-temporal synergy. This results in spatial misalignment across cameras and temporal drift in sequences. We propose DriveFix, a novel multi-view restoration framework that ensures spatio-temporal coherence for driving scenes. Our approach employs an interleaved diffusion transformer architecture with specialized blocks to explicitly model both temporal dependencies and cross-camera spatial consistency. By conditioning the generation on historical context and integrating geometry-aware training losses, DriveFix enforces that the restored views adhere to a unified 3D geometry. This enables the consistent propagation of high-fidelity textures and significantly reduces artifacts. Extensive evaluations on the Waymo, nuScenes, and PandaSet datasets demonstrate that DriveFix achieves state-of-the-art performance in both reconstruction and novel view synthesis, marking a substantial step toward robust 4D world modeling for real-world deployment.
摘要：4D 场景重建方面的最新进展，特别是利用扩散先验的进展，已经显示出自动驾驶中新颖的视图合成的前景。然而，这些方法通常独立地或以逐个视图的方式处理帧，导致严重缺乏时空协同作用。这会导致摄像机之间的空间错位和序列的时间漂移。我们提出了 DriveFix，一种新颖的多视图恢复框架，可确保驾驶场景的时空一致性。我们的方法采用具有专用块的交错扩散变压器架构来显式建模时间依赖性和跨相机空间一致性。通过根据历史背景调节生成并集成几何感知训练损失，DriveFix 强制恢复的视图遵循统一的 3D 几何形状。这可以实现高保真纹理的一致传播并显着减少伪影。对 Waymo、nuScenes 和 PandaSet 数据集的广泛评估表明，DriveFix 在重建和新颖视图合成方面均实现了最先进的性能，标志着朝着用于现实世界部署的稳健 4D 世界建模迈出了实质性的一步。

Title: Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

Authors: Xinhao Cai, Gensheng Pei, Zeren Sun, Yazhou Yao, Fumin Shen, Wenguan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16340
Pdf URL: https://arxiv.org/pdf/2603.16340
Copy Paste: [[2603.16340]] Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation(https://arxiv.org/abs/2603.16340)
Keywords: generative
Abstract: In this paper, we propose \textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.
摘要：在本文中，我们提出了 \textbf{Iris}，这是一种单目深度估计（MDE）的确定性框架，它将现实世界的先验集成到扩散模型中。传统的前馈方法依赖于大量的训练数据，但仍然遗漏了细节。以前基于扩散的方法利用了丰富的生成先验，但在合成到真实的域转移方面遇到了困难。相比之下，Iris 保留了精细的细节，从合成场景到真实场景具有很强的概括性，并且在有限的训练数据下保持高效。为此，我们引入了一个两阶段的先验到几何确定性（PGD）计划：前一阶段使用频谱门控蒸馏（SGD）来传输低频真实先验，同时使高频细节不受约束，几何阶段应用频谱门控一致性（SGC）来强制高频保真度，同时使用合成的地面实况进行精炼。这两个阶段共享权重，并按照从高到低的时间步计划执行。大量的实验结果证实，Iris 通过强大的野外泛化能力实现了 MDE 性能的显着提高。

Title: $D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Authors: Ruizhi Wang, Weihan Li, Zunlei Feng, Haofei Zhang, Mingli Song, Jiayu Wang, Jie Song, Li Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16362
Pdf URL: https://arxiv.org/pdf/2603.16362
Copy Paste: [[2603.16362]] $D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation(https://arxiv.org/abs/2603.16362)
Keywords: generation
Abstract: Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.
摘要：利用遥感图像进行实时、高保真度的单目深度估计对于许多应用来说至关重要，但现有方法面临着准确性和效率之间的严重权衡。尽管使用 Vision Transformer (ViT) 主干进行密集预测速度很快，但它们通常表现出较差的感知质量。相反，扩散模型提供高保真度，但计算成本却过高。为了克服这些限制，我们提出了用于遥感单目深度估计的深度细节扩散（$D^3$-RSMDE），这是一个旨在实现速度和质量之间的最佳平衡的有效框架。我们的框架首先利用基于 ViT 的模块快速生成高质量的初步深度图构造，作为结构先验，有效地取代了扩散模型耗时的初始结构生成阶段。基于此，我们提出了渐进线性混合细化（PLBR）策略，该策略使用轻量级 U-Net 只需几次迭代即可细化细节。整个细化步骤在变分自动编码器 (VAE) 支持的紧凑潜在空间中高效运行。大量实验表明，与 Marigold 等领先模型相比，$D^3$-RSMDE 的学习感知图像块相似度 (LPIPS) 感知指标显着降低了 11.85%，同时推理速度提高了 40 倍以上，并保持了与轻量级 ViT 模型相当的 VRAM 使用率。

Title: Advancing Visual Reliability: Color-Accurate Underwater Image Enhancement for Real-Time Underwater Missions

Authors: Yiqiang Zhou, Yifan Chen, Zhe Sun, Jijun Lu, Ye Zheng, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16363
Pdf URL: https://arxiv.org/pdf/2603.16363
Copy Paste: [[2603.16363]] Advancing Visual Reliability: Color-Accurate Underwater Image Enhancement for Real-Time Underwater Missions(https://arxiv.org/abs/2603.16363)
Keywords: restoration
Abstract: Underwater image enhancement plays a crucial role in providing reliable visual information for underwater platforms, since strong absorption and scattering in water-related environments generally lead to image quality degradation. Existing high-performance methods often rely on complex architectures, which hinder deployment on underwater devices. Lightweight methods often sacrifice quality for speed and struggle to handle severely degraded underwater images. To address this limitation, we present a real-time underwater image enhancement framework with accurate color restoration. First, an Adaptive Weighted Channel Compensation module is introduced to achieve dynamic color recovery of the red and blue channels using the green channel as a reference anchor. Second, we design a Multi-branch Re-parameterized Dilated Convolution that employs multi-branch fusion during training and structural re-parameterization during inference, enabling large receptive field representation with low computational overhead. Finally, a Statistical Global Color Adjustment module is employed to optimize overall color performance based on statistical priors. Extensive experiments on eight datasets demonstrate that the proposed method achieves state-of-the-art performance across seven evaluation metrics. The model contains only 3,880 inference parameters and achieves an inference speed of 409 FPS. Our method improves the UCIQE score by 29.7% under diverse environmental conditions, and the deployment on ROV platforms and performance gains in downstream tasks further validate its superiority for real-time underwater missions.
摘要：水下图像增强在为水下平台提供可靠的视觉信息方面发挥着至关重要的作用，因为与水相关的环境中的强吸收和散射通常会导致图像质量下降。现有的高性能方法通常依赖于复杂的架构，这阻碍了在水下设备上的部署。轻量级方法通常会为了速度而牺牲质量，并且很难处理严重退化的水下图像。为了解决这个限制，我们提出了一种具有精确色彩恢复的实时水下图像增强框架。首先，引入自适应加权通道补偿模块，以使用绿色通道作为参考锚来实现红色和蓝色通道的动态颜色恢复。其次，我们设计了一种多分支重新参数化扩张卷积，它在训练过程中采用多分支融合，在推理过程中采用结构重新参数化，从而以较低的计算开销实现大感受野表示。最后，采用统计全局颜色调整模块来根据统计先验优化整体颜色性能。对八个数据集的广泛实验表明，所提出的方法在七个评估指标上实现了最先进的性能。该模型仅包含3,880个推理参数，推理速度达到409 FPS。我们的方法在不同环境条件下将UCIQE分数提高了29.7%，并且在ROV平台上的部署和下游任务中的性能提升进一步验证了其对于实时水下任务的优越性。

Title: FederatedFactory: Generative One-Shot Learning for Extremely Non-IID Distributed Scenarios

Authors: Andrea Moleri, Christian Internò, Ali Raza, Markus Olhofer, David Klindt, Fabio Stella, Barbara Hammer
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16370
Pdf URL: https://arxiv.org/pdf/2603.16370
Copy Paste: [[2603.16370]] FederatedFactory: Generative One-Shot Learning for Extremely Non-IID Distributed Scenarios(https://arxiv.org/abs/2603.16370)
Keywords: generative
Abstract: Federated Learning (FL) enables distributed optimization without compromising data sovereignty. Yet, where local label distributions are mutually exclusive, standard weight aggregation fails due to conflicting optimization trajectories. Often, FL methods rely on pretrained foundation models, introducing unrealistic assumptions. We introduce FederatedFactory, a zero-dependency framework that inverts the unit of federation from discriminative parameters to generative priors. By exchanging generative modules in a single communication round, our architecture supports ex nihilo synthesis of universally class balanced datasets, eliminating gradient conflict and external prior bias entirely. Evaluations across diverse medical imagery benchmarks, including MedMNIST and ISIC2019, demonstrate that our approach recovers centralized upper-bound performance. Under pathological heterogeneity, it lifts baseline accuracy from a collapsed 11.36% to 90.57% on CIFAR-10 and restores ISIC2019 AUROC to 90.57%. Additionally, this framework facilitates exact modular unlearning through the deterministic deletion of specific generative modules.
摘要：联邦学习 (FL) 可在不损害数据主权的情况下实现分布式优化。然而，在局部标签分布相互排斥的情况下，标准权重聚合会由于优化轨迹冲突而失败。通常，FL 方法依赖于预训练的基础模型，引入了不切实际的假设。我们引入了 FederatedFactory，这是一个零依赖框架，它将联邦单元从判别参数反转为生成先验。通过在单轮通信中交换生成模块，我们的架构支持通用类平衡数据集的无中生有综合，完全消除梯度冲突和外部先验偏差。对不同医学图像基准（包括 MedMNIST 和 ISIC2019）的评估表明，我们的方法恢复了集中式上限性能。在病理异质性下，它将 CIFAR-10 上的基线准确率从崩溃的 11.36% 提高到 90.57%，并将 ISIC2019 AUROC 恢复到 90.57%。此外，该框架通过确定性删除特定的生成模块来促进精确的模块遗忘。

Title: InViC: Intent-aware Visual Cues for Medical Visual Question Answering

Authors: Zhisong Wang, Ziyang Chen, Zanting Ye, Hongze Zhu, Yefeng Zheng, Yong Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16372
Pdf URL: https://arxiv.org/pdf/2603.16372
Copy Paste: [[2603.16372]] InViC: Intent-aware Visual Cues for Medical Visual Question Answering(https://arxiv.org/abs/2603.16372)
Keywords: generation
Abstract: Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM's direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.
摘要：医学视觉问答（Med-VQA）旨在回答基于医学图像的临床相关问题。然而，现有的多模态大语言模型（MLLM）通常表现出捷径回答，通过利用语言先验或数据集偏差来产生合理的响应，同时没有充分关注视觉证据。这种行为破坏了临床可靠性，尤其是当微妙的影像学发现具有决定性作用时。我们提出了一个轻量级插件框架，称为意图感知视觉提示（InViC），以显式增强医学 VQA 中基于图像的答案生成。 InViC 引入了提示标记提取 (CTE) 模块，该模块将密集的视觉标记提炼为一组紧凑的 K 个问题条件提示标记，这些标记作为注入到 LLM 解码器中的结构化视觉中介，以促进意图一致的视觉证据。为了阻止视觉信息的绕过，我们进一步设计了带有提示瓶颈注意力掩模的两阶段微调策略。在第一阶段，我们采用注意力掩模来阻止法学硕士对原始视觉特征的直接观察，从而通过提示路径汇集所有视觉证据。在第二阶段，恢复标准因果注意力以训练法学硕士共同利用视觉和提示标记。我们在多个代表性 MLLM 的三个公共 Med-VQA 基准（VQA-RAD、SLAKE 和 ImageCLEF VQA-Med 2019）上评估 InViC。 InViC 持续改进零样本推理和标准 LoRA 微调，证明意图感知视觉线索和瓶颈训练是改进值得信赖的 Med-VQA 的实用且有效的策略。

Title: Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

Authors: Yunpeng Qu, Kaidong Zhang, Yukang Ding, Ying Chen, Jian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16373
Pdf URL: https://arxiv.org/pdf/2603.16373
Copy Paste: [[2603.16373]] Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation(https://arxiv.org/abs/2603.16373)
Keywords: restoration, generation, generative
Abstract: Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbf{SemTok}, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.
摘要：基于潜在空间的视觉生成模型取得了巨大成功，凸显了视觉标记化的重要性。将图像映射到潜在变量可以提高效率，并实现多模式对齐，以扩大下游任务的规模。现有的视觉分词器主要将图像映射到固定的二维空间网格中，并专注于像素级恢复，这阻碍了捕获具有紧凑全局语义的表示。为了解决这些问题，我们提出了 \textbf{SemTok}，一种语义一维标记器，它将 2D 图像压缩为具有高级语义的 1D 离散标记。 SemTok 在图像重建方面树立了新的最先进技术，通过非常紧凑的标记表示实现了卓越的保真度。这是通过具有三个关键创新的协同框架来实现的：2D 到 1D 标记化方案、语义对齐约束和两阶段生成训练策略。在 SemTok 的基础上，我们构建了一个掩码自回归生成框架，该框架在下游图像生成任务中产生了显着的改进。实验证实了我们的语义一维标记化的有效性。我们的代码将开源。

Title: DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification

Authors: Stathis Galanakis, Alexandros Koliousis, Stefanos Zafeiriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16392
Pdf URL: https://arxiv.org/pdf/2603.16392
Copy Paste: [[2603.16392]] DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification(https://arxiv.org/abs/2603.16392)
Keywords: generation, generative
Abstract: Despite recent advances in deep generative modeling, skin lesion classification systems remain constrained by the limited availability of large, diverse, and well-annotated clinical datasets, resulting in class imbalance between benign and malignant lesions and consequently reduced generalization performance. We introduce DermaFlux, a rectified flow-based text-to-image generative framework that synthesizes clinically grounded skin lesion images from natural language descriptions of dermatological attributes. Built upon Flux.1, DermaFlux is fine-tuned using parameter-efficient Low-Rank Adaptation (LoRA) on a large curated collection of publicly available clinical image datasets. We construct image-text pairs using synthetic textual captions generated by Llama 3.2, following established dermatological criteria including lesion asymmetry, border irregularity, and color variation. Extensive experiments demonstrate that DermaFlux generates diverse and clinically meaningful dermatology images that improve binary classification performance by up to 6% when augmenting small real-world datasets, and by up to 9% when classifiers are trained on DermaFlux-generated synthetic images rather than diffusion-based synthetic images. Our ImageNet-pretrained ViT fine-tuned with only 2,500 real images and 4,375 DermaFlux-generated samples achieves 78.04% binary classification accuracy and an AUC of 0.859, surpassing the next best dermatology model by 8%.
摘要：尽管深度生成模型最近取得了进展，但皮肤病变分类系统仍然受到大型、多样化和注释良好的临床数据集的有限可用性的限制，导致良性和恶性病变之间的类别不平衡，从而降低了泛化性能。我们介绍了 DermaFlux，这是一种基于整流流的文本到图像生成框架，可根据皮肤病学属性的自然语言描述合成临床基础的皮肤病变图像。 DermaFlux 基于 Flux.1 构建，在大量精选的公开临床图像数据集上使用参数高效的低阶适应 (LoRA) 进行微调。我们使用 Llama 3.2 生成的合成文本标题构建图像文本对，遵循既定的皮肤病学标准，包括病变不对称、边界不规则和颜色变化。大量实验表明，DermaFlux 生成多样化且具有临床意义的皮肤科图像，在扩充小型现实数据集时，二元分类性能可提高高达 6%，而当分类器在 DermaFlux 生成的合成图像而不是基于扩散的合成图像上进行训练时，二元分类性能可提高高达 9%。我们的 ImageNet 预训练 ViT 仅使用 2,500 个真实图像和 4,375 个 DermaFlux 生成的样本进行微调，实现了 78.04% 的二元分类准确率和 0.859 的 AUC，超过了第二好的皮肤病学模型 8%。

Title: DISCOVER: A Solver for Distributional Counterfactual Explanations

Authors: Yikai Gu, Lele Cao, Bo Zhao, Lei Lei, Lei You
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.16436
Pdf URL: https://arxiv.org/pdf/2603.16436
Copy Paste: [[2603.16436]] DISCOVER: A Solver for Distributional Counterfactual Explanations(https://arxiv.org/abs/2603.16436)
Keywords: generation
Abstract: Counterfactual explanations (CE) explain model decisions by identifying input modifications that lead to different predictions. Most existing methods operate at the instance level. Distributional Counterfactual Explanations (DCE) extend this setting by optimizing an optimal transport objective that balances proximity to a factual input distribution and alignment to a target output distribution, with statistical certification via chance constrained bounds. However, DCE relies on gradient based optimization, while many real-world tabular pipelines are dominated by non-differentiable models. We propose DISCOVER, a model-agnostic solver for distributional counterfactual explanations. DISCOVER preserves the original DCE objective and certification while replacing gradient descent with a sparse propose-and-select search paradigm. It exploits a sample-wise decomposition of the transport objective to compute per-row impact scores and enforce a top-$k$ intervention budget, focusing edits on the most influential samples. To guide candidate generation without predictor gradients, DISCOVER introduces an OT-guided cone sampling primitive driven by input-side transport geometry. Experiments on multiple tabular datasets demonstrate strong joint alignment of input and output distributions, extending distributional counterfactual reasoning to modern black box learning pipelines. A code repository is available at this https URL.
摘要：反事实解释（CE）通过识别导致不同预测的输入修改来解释模型决策。大多数现有方法在实例级别运行。分布反事实解释（DCE）通过优化最佳传输目标来扩展此设置，该目标平衡了与事实输入分布的接近度和与目标输出分布的一致性，并通过机会约束界限进行统计认证。然而，DCE 依赖于基于梯度的优化，而许多现实世界的表格管道由不可微分模型主导。我们提出了 DISCOVER，一种用于分布式反事实解释的模型无关的求解器。 DISCOVER 保留了原始 DCE 目标和认证，同时用稀疏提议和选择搜索范式取代梯度下降。它利用传输目标的样本分解来计算每行影响分数并执行最高 $k$ 的干预预算，将编辑重点放在最有影响力的样本上。为了在没有预测器梯度的情况下指导候选生成，DISCOVER 引入了由输入侧传输几何驱动的 OT 引导锥形采样原语。对多个表格数据集的实验证明了输入和输出分布的强大联合对齐，将分布反事实推理扩展到现代黑盒学习管道。此 https URL 提供了代码存储库。

Title: Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline

Authors: Xingyu Liu, Zewei He, Yu Chen, Chunyu Zhu, Zixuan Chen, Xing Luo, Zhe-Ming Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16446
Pdf URL: https://arxiv.org/pdf/2603.16446
Copy Paste: [[2603.16446]] Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline(https://arxiv.org/abs/2603.16446)
Keywords: generative
Abstract: When capturing images through glass surfaces or windshields on rainy days, raindrops and reflections frequently co-occur to significantly reduce the visibility of captured images. This practical problem lacks attention and needs to be resolved urgently. Prior de-raindrop, de-reflection, and all-in-one models have failed to address this composite degradation. To this end, we first formally define the unified removal of raindrops and reflections (UR$^3$) task for the first time and construct a real-shot dataset, namely RainDrop and ReFlection (RDRF), which provides a new benchmark with substantial, high-quality, diverse image pairs. Then, we propose a novel diffusion-based framework (i.e., DiffUR$^3$) with several target designs to address this challenging task. By leveraging the powerful generative prior, DiffUR$^3$ successfully removes both types of degradations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on our benchmark and on challenging in-the-wild images. The RDRF dataset and the codes will be made public upon acceptance.
摘要：在雨天透过玻璃表面或挡风玻璃拍摄图像时，雨滴和反射经常同时出现，从而显着降低拍摄图像的可见度。这一现实问题缺乏重视，亟待解决。先前的去雨滴、去反射和一体化模型未能解决这种复合材料退化问题。为此，我们首先正式定义了统一去除雨滴和反射（UR$^3$）任务，并构建了一个实拍数据集，即RainDrop和Reflection（RDRF），它提供了一个新的基准，具有大量、高质量、多样化的图像对。然后，我们提出了一种新颖的基于扩散的框架（即 DiffUR$^3$），具有多个目标设计来解决这一具有挑战性的任务。通过利用强大的生成先验，DiffUR$^3$ 成功消除了这两种类型的退化。大量的实验表明，我们的方法在我们的基准测试和具有挑战性的野外图像上实现了最先进的性能。 RDRF 数据集和代码将在接受后公开。

Title: Unlearning for One-Step Generative Models via Unbalanced Optimal Transport

Authors: Hyundo Choi, Junhyeong An, Jinseong Park, Jaewoong Choi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16489
Pdf URL: https://arxiv.org/pdf/2603.16489
Copy Paste: [[2603.16489]] Unlearning for One-Step Generative Models via Unbalanced Optimal Transport(https://arxiv.org/abs/2603.16489)
Keywords: generation, generative
Abstract: Recent advances in one-step generative frameworks, such as flow map models, have significantly improved the efficiency of image generation by learning direct noise-to-data mappings in a single forward pass. However, machine unlearning for ensuring the safety of these powerful generators remains entirely unexplored. Existing diffusion unlearning methods are inherently incompatible with these one-step models, as they rely on a multi-step iterative denoising process. In this work, we propose UOT-Unlearn, a novel plug-and-play class unlearning framework for one-step generative models based on the Unbalanced Optimal Transport (UOT). Our method formulates unlearning as a principled trade-off between a forget cost, which suppresses the target class, and an $f$-divergence penalty, which preserves overall generation fidelity via relaxed marginal constraints. By leveraging UOT, our method enables the probability mass of the forgotten class to be smoothly redistributed to the remaining classes, rather than collapsing into low-quality or noise-like samples. Experimental results on CIFAR-10 and ImageNet-256 demonstrate that our framework achieves superior unlearning success (PUL) and retention quality (u-FID), significantly outperforming baselines.
摘要：一步生成框架（例如流图模型）的最新进展通过在单次前向传递中学习直接噪声到数据的映射，显着提高了图像生成的效率。然而，用于确保这些强大发电机安全的机器学习仍然完全未被探索。现有的扩散忘却方法本质上与这些一步模型不兼容，因为它们依赖于多步迭代去噪过程。在这项工作中，我们提出了 UOT-Unlearn，一种新颖的即插即用类取消学习框架，用于基于不平衡最优传输（UOT）的一步生成模型。我们的方法将遗忘定义为遗忘成本（抑制目标类别）和 $f$ 发散惩罚（通过放松边际约束保持整体生成保真度）之间的原则性权衡。通过利用 UOT，我们的方法使被遗忘类的概率质量能够平滑地重新分配到剩余的类，而不是崩溃成低质量或类似噪声的样本。 CIFAR-10 和 ImageNet-256 上的实验结果表明，我们的框架实现了卓越的遗忘成功率 (PUL) 和保留质量 (u-FID)，显着优于基线。

Title: VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations

Authors: Fucai Ke, Zhixi Cai, Boying Li, Long Chen, Beibei Lin, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Hamid Rezatofighi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16506
Pdf URL: https://arxiv.org/pdf/2603.16506
Copy Paste: [[2603.16506]] VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations(https://arxiv.org/abs/2603.16506)
Keywords: generation
Abstract: Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded question-answer pairs. Using this benchmark, a comprehensive evaluation of state-of-the-art vision-language and spatial models reveals that multi-view reasoning remains largely unsolved, with most models performing only marginally above random guessing. We further investigate whether training can bridge this gap. Our proposed Grounded Chain-of-Thought with Visual Evidence substantially improves performance under moderate difficulty, and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation. We further conduct difficulty-aware scaling analyses across model size, data scale, reasoning depth, and visibility constraints, indicating that while geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge.
摘要：多视图视觉推理对于必须从稀疏和离散的角度理解复杂环境的智能系统至关重要，但现有的研究主要集中在单图像或时间密集的视频设置上。在现实场景中，跨视图推理需要在没有明确指导的情况下整合部分观察结果，而收集具有准确几何和语义注释的大规模多视图数据仍然具有挑战性。为了解决这一差距，我们利用基于物理的模拟来构建具有精确的每个视图元数据的多样化、高保真 3D 场景，从而实现可扩展的数据生成，并且仍可转移到现实世界的设置。基于该引擎，我们引入了 VIEW2SPACE，这是一种用于稀疏多视图推理的多维基准，以及支持数百万个接地问答对的可扩展、不相交的训练分割。使用这个基准，对最先进的视觉语言和空间模型的综合评估表明，多视图推理在很大程度上仍未得到解决，大多数模型的表现仅略高于随机猜测。我们进一步研究培训是否可以弥补这一差距。我们提出的带有视觉证据的扎根思想链大大提高了中等难度下的性能，并推广到现实世界的数据，在跨数据集评估方面优于现有方法。我们进一步对模型大小、数据规模、推理深度和可见性约束进行困难感知缩放分析，表明虽然几何感知可以从足够可见性下的缩放中受益，但跨稀疏视图的深度组合推理仍然是一个基本挑战。

Title: CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

Authors: Mahmoud Ibrahim, Bart Elen, Chang Sun, Gokhan Ertaylan, Michel Dumontier
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16551
Pdf URL: https://arxiv.org/pdf/2603.16551
Copy Paste: [[2603.16551]] CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation(https://arxiv.org/abs/2603.16551)
Keywords: generation, generative
Abstract: Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalanced data can inherit these imbalances, yielding degraded synthesis quality for rare subgroups and struggling with demographic intersections absent from training. We refer to this as the imbalanced generator problem. Existing remedies such as loss reweighting operate at the optimization level and provide limited benefit when training signal is scarce or absent for certain combinations. We propose CompDiff, a hierarchical compositional diffusion framework that addresses this problem at the representation level. A dedicated Hierarchical Conditioner Network (HCN) decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization to rare or unseen demographic intersections. Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show that CompDiff compares favorably against both standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections). Downstream classifiers trained on CompDiff-generated data also show improved AUROC and reduced demographic bias, suggesting that architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. Code is available at this https URL.
摘要：生成模型越来越多地用于增强医学成像数据集，以实现更公平的人工智能。然而，一个关键假设常常未经检验：生成器本身可以在不同人口群体中生成同样高质量的图像。在不平衡数据上训练的模型可能会继承这些不平衡，导致稀有亚组的合成质量下降，并与训练中缺少的人口交叉点作斗争。我们将此称为不平衡发电机问题。现有的补救措施（例如损失重新加权）在优化级别上运行，并且当某些组合的训练信号稀缺或不存在时，提供的益处有限。我们提出了 CompDiff，一个分层组合扩散框架，可以在表示级别解决这个问题。专用的分层调节器网络 (HCN) 分解人口统计条件，生成与 CLIP 嵌入连接的人口统计标记作为交叉注意上下文。这种结构化分解鼓励跨子组共享参数，并支持对罕见或看不见的人口统计交叉点的成分泛化。胸部 X 射线 (MIMIC-CXR) 和眼底图像 (FairGenMed) 的实验表明，CompDiff 在图像质量（FID：64.3 与 75.1）、子组公平性 (ES-FID) 和零样本交叉泛化（保留交叉点上的 FID 改进高达 21%）方面优于标准微调和 FairDiffusion。在 CompDiff 生成的数据上训练的下游分类器也显示出 AUROC 的改进和人口统计偏差的减少，这表明人口统计的架构设计是公平医学图像生成中一个重要且尚未充分探索的因素。代码可从此 https URL 获取。

Title: VideoMatGen: PBR Materials through Joint Generative Modeling

Authors: Jon Hasselgren, Zheng Zeng, Milos Hasan, Jacob Munkberg
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2603.16566
Pdf URL: https://arxiv.org/pdf/2603.16566
Copy Paste: [[2603.16566]] VideoMatGen: PBR Materials through Joint Generative Modeling(https://arxiv.org/abs/2603.16566)
Keywords: generation, generative
Abstract: We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.
摘要：我们提出了一种基于视频扩散变压器架构生成 3D 形状的基于物理的材料的方法。我们的方法以输入几何形状和文本描述为条件，并联合建模多种材料属性（基色、粗糙度、金属度、高度图）以形成物理上合理的材料。我们进一步引入了一种定制的变分自动编码器，它将多种材料模态编码到一个紧凑的潜在空间中，从而可以在不增加标记数量的情况下联合生成多种模态。我们的管道根据文本提示生成 3D 形状的高质量材料，与常见的内容创建工具兼容。

Title: Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration

Authors: Amirhossein Kazerouni, Maitreya Suin, Tristan Aumentado-Armstrong, Sina Honari, Amanpreet Walia, Iqbal Mohomed, Konstantinos G. Derpanis, Babak Taati, Alex Levinshtein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16570
Pdf URL: https://arxiv.org/pdf/2603.16570
Copy Paste: [[2603.16570]] Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration(https://arxiv.org/abs/2603.16570)
Keywords: restoration
Abstract: Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.
摘要：图像恢复方面的最新进展使得使用基于参考的人脸恢复模型（Ref-FR）能够从降级的输入中高保真地恢复人脸。然而，此类方法仅关注面部区域，忽略了整个场景（包括身体和背景）的退化，这限制了实际可用性。与此同时，全场景恢复者通常完全忽略退化线索，导致不确定的预测和视觉伪影。在这项工作中，我们提出了 Face2Scene，这是一个两阶段恢复框架，利用面部作为感知神谕来估计退化并指导整个图像的恢复。给定退化图像和一个或多个身份参考，我们首先应用 Ref-FR 模型来重建高质量的面部细节。从恢复的退化面部对中，我们提取了一个面部衍生的退化代码，该代码捕获退化属性（例如噪声、模糊、压缩），然后将其转换为多尺度退化感知令牌。这些令牌调节扩散模型，以一步恢复整个场景，包括身体和背景。大量的实验证明了所提出的方法与最先进的方法相比具有卓越的有效性。

Title: REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Authors: Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei, Yunyun Dong, Li Tang, Wei Zhou, Renyang Liu
Subjects: cs.CV, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.16576
Pdf URL: https://arxiv.org/pdf/2603.16576
Copy Paste: [[2603.16576]] REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models(https://arxiv.org/abs/2603.16576)
Keywords: generation
Abstract: Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks. Our code is at: this https URL.
摘要：图像生成模型 (IGM) 的最新进展使得高保真内容创建成为可能，但也放大了风险，包括复制受版权保护的内容和生成攻击性内容。图像生成模型忘却 (IGMU) 通过在不进行全面再训练的情况下消除有害概念来减轻这些风险。尽管受到越来越多的关注，但对抗性输入下的鲁棒性，特别是黑盒设置中的图像侧威胁，仍未得到充分探索。为了弥补这一差距，我们提出了 REFORGE，这是一个黑盒红队框架，可通过对抗性图像提示评估 IGMU 的稳健性。 REFORGE 初始化基于笔划的图像，并通过交叉注意力引导的掩蔽策略优化扰动，该策略将噪声分配给概念相关区域，平衡攻击功效和视觉保真度。跨代表性遗忘任务和防御的广泛实验表明，REFORGE 显着提高了攻击成功率，同时实现了比所涉及的基线更强的语义对齐和更高的效率。这些结果暴露了当前 IGMU 方法中持续存在的漏洞，并强调需要针对多模式对抗攻击进行鲁棒性感知的遗忘。我们的代码位于：此 https URL。

Title: When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

Authors: Zelin Zhang, Fei Cheng, Chenhui Chu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2603.16578
Pdf URL: https://arxiv.org/pdf/2603.16578
Copy Paste: [[2603.16578]] When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective(https://arxiv.org/abs/2603.16578)
Keywords: generation
Abstract: Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model's foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel geometric diagnostic lens, showing that successful cases are enveloped by manifolds. Ultimately, our work goes beyond merely demonstrating that enforcing concise and certain responses successfully boosts mathematical reasoning; we reveal when this unsupervised approach breaks down and geometrically diagnose why.
摘要：尽管基于结果的强化学习 (RL) 显着提高了大型语言模型 (LLM) 的数学推理能力，但其对计算成本昂贵的真实注释的依赖造成了严重的可扩展性瓶颈。由内在奖励引导的无监督强化学习提供了一种可扩展的替代方案，但它面临着不透明的训练动态和灾难性的不稳定，例如政策崩溃和奖励黑客攻击。在本文中，我们首先设计和评估一套明确强制简洁和确定生成的内在奖励。其次，为了发现这种方法的边界，我们跨一系列内在推理能力测试基本模型，揭示模型的基本逻辑先验如何决定其成功或失败。最后，为了揭开为什么某些构型稳定而另一些构型崩溃的神秘面纱，我们引入了一种新颖的几何诊断透镜，表明成功的案例被流形包围。最终，我们的工作不仅仅是证明执行简洁且确定的回答可以成功地促进数学推理；我们揭示了这种无监督方法何时崩溃，并从几何角度诊断原因。

Title: Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

Authors: Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16600
Pdf URL: https://arxiv.org/pdf/2603.16600
Copy Paste: [[2603.16600]] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models(https://arxiv.org/abs/2603.16600)
Keywords: generation, generative
Abstract: Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at this https URL.
摘要：视觉语言模型 (VLM) 的生成奖励模型 (GRM) 通常通过三阶段管道评估输出：标题生成、基于标准的评分和最终裁决。然而，中间的标题很少被直接优化。之前的工作通常要么将评分标准视为偶然，要么依赖于昂贵的法学硕士作为法官的检查，这些检查无法提供可区分的信号和有限的培训时间指导。我们提出了 Proxy-GRM，它将代理引导的标题验证引入到强化学习（RL）中，以显着提高标题质量。具体来说，我们训练轻量级代理（Proxy-SFT 和 Proxy-RL），将候选标题与原始查询和偏好对一起使用，然后仅使用标题作为证据来预测偏好排序。代理的预测准确性作为评分标准质量奖励，激励模型生成内部一致且可转移的评分标准。通过约 50k 数据样本，Proxy-GRM 在 VL-Reward Bench、Multimodal Reward Bench 和 MM-RLHF-Reward Bench 上达到了最先进的结果，优于在四倍数据上训练的方法。消融显示 Proxy-SFT 是比 Proxy-RL 更强的验证器，并且隐式奖励聚合表现最好。至关重要的是，学习到的评估标准可以转移到看不见的评估者，从而在测试时提高奖励准确性，而无需额外培训。我们的代码可以在这个 https URL 上找到。

Title: ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery

Authors: Weiqin Jiao, Hao Cheng, George Vosselman, Claudio Persello
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16616
Pdf URL: https://arxiv.org/pdf/2603.16616
Copy Paste: [[2603.16616]] ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery(https://arxiv.org/abs/2603.16616)
Keywords: generation
Abstract: We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and without gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that enforces shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512. It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be released at: this https URL.
摘要：我们解决了在一次运行中从航空图像生成完整矢量图表示的问题：为所有土地覆盖类别生成具有共享边界且没有间隙或重叠的多边形。现有的多边形化方法通常是特定于类的；通过按类运行将它们扩展到多个类通常会导致拓扑不一致，例如重复的边、间隙和重叠。我们将这一新任务正式化为全类多边形矢量化（ACPV），并发布了第一个公共基准 Deventer-512，其标准化指标共同评估语义保真度、几何精度、顶点效率、每类拓扑保真度和全局拓扑一致性。为了实现 ACPV，我们提出了 ACPV-Net，这是一个统一的框架，引入了一种新颖的语义监督调节（SSC）机制，将语义感知与几何图元生成相结合，以及通过设计强制共享边缘一致性的拓扑重建。在执行如此严格的拓扑约束的同时，ACPV-Net 在 Deventer-512 上的跨类多边形质量方面超越了所有类特定基线。它还适用于单类多边形矢量化，无需任何架构修改，在 WHU-Building 上取得了最佳报告结果。数据、代码和模型将在以下位置发布：此 https URL。

Title: Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Authors: Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16666
Pdf URL: https://arxiv.org/pdf/2603.16666
Copy Paste: [[2603.16666]] Fast-WAM: Do World Action Models Need Test-time Future Imagination?(https://arxiv.org/abs/2603.16666)
Keywords: generation
Abstract: World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: this https URL
摘要：世界行动模型（WAM）已成为视觉-语言-行动（VLA）模型的有希望的替代品，用于体现控制，因为它们明确地模拟了视觉观察在行动下如何演变。大多数现有的 WAM 都遵循“想象然后执行”的范例，迭代视频去噪会导致大量的测试时间延迟，但目前尚不清楚明确的未来想象是否真的需要强大的动作性能。在本文中，我们询问 WAM 在测试时是否需要明确的未来想象，或者它们的好处是否主要来自训练期间的视频建模。我们提出 \textbf{Fast-WAM}，将训练期间视频建模的作用与推理期间显式未来生成的作用分开，这是一种 WAM 架构，它在训练期间保留视频协同训练，但在测试时跳过未来预测。我们进一步实例化了几个 Fast-WAM 变体，以实现这两个因素的受控比较。在这些变体中，我们发现 Fast-WAM 与想象然后执行变体相比仍然具有竞争力，而删除视频协同训练会导致更大的性能下降。根据经验，Fast-WAM 在模拟基准（LIBERO 和 RoboTwin）和现实世界任务上都通过最先进的方法取得了有竞争力的结果，而无需具体的预训练。它实时运行，延迟为 190 毫秒，比现有的想象然后执行 WAM 快 4 倍多。这些结果表明，WAM 中视频预测的主要价值可能在于改进训练期间的世界表征，而不是在测试时生成未来的观察结果。项目页面：此 https URL

Title: Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Authors: Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16711
Pdf URL: https://arxiv.org/pdf/2603.16711
Copy Paste: [[2603.16711]] Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search(https://arxiv.org/abs/2603.16711)
Keywords: generation
Abstract: We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.
摘要：我们推出了 Search2Motion，这是一个无需训练的框架，用于图像到视频生成中的对象级运动编辑。与之前需要轨迹、边界框、掩模或运动场的方法不同，Search2Motion 采用基于目标帧的控制，利用首末帧运动先验来实现对象重新定位，同时保持场景稳定性，无需微调。通过语义引导的对象插入和强大的背景修复来实现可靠的目标框架构建。我们进一步表明，早期自注意力图可以预测物体和相机动态，提供可解释的用户反馈并激发 ACE-Seed（早期种子选择的注意力共识），这是一种轻量级搜索策略，无需前瞻采样或外部评估器即可提高运动保真度。注意到现有的基准测试将物体和相机运动混为一谈，我们引入了 S2M-DAVIS 和 S2M-OMB 用于稳定相机、仅物体评估，以及无需地面实况轨迹即可隔离物体伪影的 FLF2V-obj 指标。 Search2Motion 在 FLF2V-obj 和 VBench 上的性能始终优于基线。

Title: Emotion-Aware Classroom Quality Assessment Leveraging IoT-Based Real-Time Student Monitoring

Authors: Hai Nguyen, Hieu Dao, Hung Nguyen, Nam Vu, Cong Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16719
Pdf URL: https://arxiv.org/pdf/2603.16719
Copy Paste: [[2603.16719]] Emotion-Aware Classroom Quality Assessment Leveraging IoT-Based Real-Time Student Monitoring(https://arxiv.org/abs/2603.16719)
Keywords: quality assessment
Abstract: This study presents high-throughput, real-time multi-agent affective computing framework designed to enhance classroom learning through emotional state monitoring. As large classroom sizes and limited teacher student interaction increasingly challenge educators, there is a growing need for scalable, data-driven tools capable of capturing students' emotional and engagement patterns in real time. The system was evaluated using the Classroom Emotion Dataset, consisting of 1,500 labeled images and 300 classroom detection videos. Tailored for IoT devices, the system addresses load balancing and latency challenges through efficient real-time processing. Field testing was conducted across three educational institutions in a large metropolitan area: a primary school (hereafter school A), a secondary school (school B), and a high school (school C). The system demonstrated robust performance, detecting up to 50 faces at 25 FPS and achieving 88% overall accuracy in classifying classroom engagement states. Implementation results showed positive outcomes, with favorable feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation. Key contributions of this research include establishing a practical, IoT-based framework for emotion-aware learning environments and introducing the 'Classroom Emotion Dataset' to facilitate further validation and research.
摘要：本研究提出了高通量、实时多主体情感计算框架，旨在通过情绪状态监测来增强课堂学习。由于教室规模较大和师生互动有限，教育工作者日益面临挑战，因此越来越需要能够实时捕捉学生情绪和参与模式的可扩展、数据驱动的工具。该系统使用课堂情绪数据集进行评估，该数据集由 1,500 个标记图像和 300 个课堂检测视频组成。该系统专为物联网设备量身定制，通过高效的实时处理解决负载平衡和延迟挑战。现场测试在一个大城市地区的三所教育机构中进行：一所小学（以下简称学校 A）、一所中学（学校 B）和一所高中（学校 C）。该系统表现出强大的性能，能够以 25 FPS 的速度检测多达 50 张面孔，并在对课堂参与状态进行分类时实现 88% 的总体准确率。实施结果显示出积极的成果，学生、教师和家长对改善课堂互动和教学适应给予了积极的反馈。这项研究的主要贡献包括为情绪感知学习环境建立一个实用的、基于物联网的框架，并引入“课堂情绪数据集”以促进进一步的验证和研究。

Title: GeMA: Learning Latent Manifold Frontiers for Benchmarking Complex Systems

Authors: Jia Ming Li, Anupriya, Daniel J. Graham
Subjects: cs.LG, cs.CE, econ.EM, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2603.16729
Pdf URL: https://arxiv.org/pdf/2603.16729
Copy Paste: [[2603.16729]] GeMA: Learning Latent Manifold Frontiers for Benchmarking Complex Systems(https://arxiv.org/abs/2603.16729)
Keywords: generation
Abstract: Benchmarking the performance of complex systems such as rail networks, renewable generation assets and national economies is central to transport planning, regulation and macroeconomic analysis. Classical frontier methods, notably Data Envelopment Analysis (DEA) and Stochastic Frontier Analysis (SFA), estimate an efficient frontier in the observed input-output space and define efficiency as distance to this frontier, but rely on restrictive assumptions on the production set and only indirectly address heterogeneity and scale effects. We propose Geometric Manifold Analysis (GeMA), a latent manifold frontier framework implemented via a productivity-manifold variational autoencoder (ProMan-VAE). Instead of specifying a frontier function in the observed space, GeMA represents the production set as the boundary of a low-dimensional manifold embedded in the joint input-output space. A split-head encoder learns latent variables that capture technological structure and operational inefficiency. Efficiency is evaluated with respect to the learned manifold, endogenous peer groups arise as clusters in latent technology space, a quotient construction supports scale-invariant benchmarking, and a local certification radius, derived from the decoder Jacobian and a Lipschitz bound, quantifies the geometric robustness of efficiency scores. We validate GeMA on synthetic data with non-convex frontiers, heterogeneous technologies and scale bias, and on four real-world case studies: global urban rail systems (COMET), British rail operators (ORR), national economies (Penn World Table) and a high-frequency wind-farm dataset. Across these domains GeMA behaves comparably to established methods when classical assumptions hold, and provides additional insight in settings with pronounced heterogeneity, non-convexity or size-related bias.
摘要：对铁路网络、可再生能源发电资产和国民经济等复杂系统的性能进行基准测试是交通规划、监管和宏观经济分析的核心。经典前沿方法，特别是数据包络分析（DEA）和随机前沿分析（SFA），估计观察到的输入输出空间中的有效前沿，并将效率定义为与该前沿的距离，但依赖于对生产集的限制性假设，并且只能间接解决异质性和规模效应。我们提出了几何流形分析（GeMA），这是一种通过生产力流形变分自动编码器（ProMan-VAE）实现的潜在流形前沿框架。 GeMA 不是在观察空间中指定前沿函数，而是将产生集表示为嵌入联合输入输出空间中的低维流形的边界。分头编码器学习捕获技术结构和操作低效率的潜在变量。效率是根据学习的流形进行评估的，内生的同行群体作为潜在技术空间中的集群而出现，商结构支持尺度不变的基准测试，并且从解码器雅可比行列式和利普希茨界限导出的本地认证半径量化了效率分数的几何鲁棒性。我们在具有非凸边界、异构技术和规模偏差的合成数据以及四个现实案例研究上验证了 GeMA：全球城市轨道系统 (COMET)、英国铁路运营商 (ORR)、国民经济（宾夕法尼亚世界表）和高频风电场数据集。在这些领域中，当经典假设成立时，GeMA 的表现与已建立的方法相当，并且在具有明显异质性、非凸性或大小相关偏差的环境中提供了额外的见解。

Title: Bayesian Inference of Psychometric Variables From Brain and Behavior in Implicit Association Tests

Authors: Christian A. Kothe, Sean Mullen, Michael V. Bronstein, Grant Hanada, Marcelo Cicconet, Aaron N. McInnes, Tim Mullen, Marc Aafjes, Scott R. Sponheim, Alik S. Widge
Subjects: cs.LG, q-bio.NC, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2603.16741
Pdf URL: https://arxiv.org/pdf/2603.16741
Copy Paste: [[2603.16741]] Bayesian Inference of Psychometric Variables From Brain and Behavior in Implicit Association Tests(https://arxiv.org/abs/2603.16741)
Keywords: generation
Abstract: Objective. We establish a principled method for inferring mental health related psychometric variables from neural and behavioral data using the Implicit Association Test (IAT) as the data generation engine, aiming to overcome the limited predictive performance (typically under 0.7 AUC) of the gold-standard D-score method, which relies solely on reaction times. Approach. We propose a sparse hierarchical Bayesian model that leverages multi-modal data to predict experiences related to mental illness symptoms in new participants. The model is a multivariate generalization of the D-score with trainable parameters, engineered for parameter efficiency in the small-cohort regime typical of IAT studies. Data from two IAT variants were analyzed: a suicidality-related E-IAT ($n=39$) and a psychosis-related PSY-IAT ($n=34$). Main Results. Our approach overcomes a high inter-individual variability and low within-session effect size in the dataset, reaching AUCs of 0.73 (E-IAT) and 0.76 (PSY-IAT) in the best modality configurations, though corrected 95% confidence intervals are wide ($\pm 0.18$) and results are marginally significant after FDR correction ($q=0.10$). Restricting the E-IAT to MDD participants improves AUC to 0.79 $[0.62, 0.97]$ (significant at $q=0.05$). Performance is on par with the best reference methods (shrinkage LDA and EEGNet) for each task, even when the latter were adapted to the task, while the proposed method was not. Accuracy was substantially above near-chance D-scores (0.50-0.53 AUC) in both tasks, with more consistent cross-task performance than any single reference method. Significance. Our framework shows promise for enhancing IAT-based assessment of experiences related to entrapment and psychosis, and potentially other mental health conditions, though further validation on larger and independent cohorts will be needed to establish clinical utility.
摘要：客观的。我们建立了一种原则性方法，使用内隐关联测试（IAT）作为数据生成引擎，从神经和行为数据中推断心理健康相关的心理测量变量，旨在克服仅依赖于反应时间的黄金标准 D 分数方法的有限预测性能（通常低于 0.7 AUC）。方法。我们提出了一种稀疏分层贝叶斯模型，利用多模态数据来预测新参与者与精神疾病症状相关的经历。该模型是具有可训练参数的 D 分数的多变量概括，专为 IAT 研究典型的小队列制度中的参数效率而设计。分析了两种 IAT 变体的数据：与自杀相关的 E-IAT ($n=39$) 和与精神病相关的 PSY-IAT ($n=34$)。主要结果。我们的方法克服了数据集中较高的个体间变异性和较低的会话内效应大小，在最佳模态配置中达到 0.73 (E-IAT) 和 0.76 (PSY-IAT) 的 AUC，尽管校正后的 95% 置信区间较宽 ($\pm 0.18$)，并且在 FDR 校正后结果略有显着 ($q=0.10$)。将 E-IAT 限制为 MDD 参与者可将 AUC 提高至 0.79 $[0.62, 0.97]$（在 $q=0.05$ 时显着）。每个任务的性能与最佳参考方法（收缩 LDA 和 EEGNet）相当，即使后者适应了任务，而所提出的方法却没有。两项任务的准确度均远高于接近机会的 D 分数 (0.50-0.53 AUC)，并且比任何单一参考方法具有更一致的跨任务性能。意义。我们的框架有望增强基于 IAT 的对受困和精神病以及其他潜在心理健康状况相关经历的评估，尽管需要对更大且独立的队列进行进一步验证以确定临床效用。

Title: Semi-supervised Latent Disentangled Diffusion Model for Textile Pattern Generation

Authors: Chenggong Hu, Yi Wang, Mengqi Xue, Haofei Zhang, Jie Song, Li Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16747
Pdf URL: https://arxiv.org/pdf/2603.16747
Copy Paste: [[2603.16747]] Semi-supervised Latent Disentangled Diffusion Model for Textile Pattern Generation(https://arxiv.org/abs/2603.16747)
Keywords: generation
Abstract: Textile pattern generation (TPG) aims to synthesize fine-grained textile pattern images based on given clothing images. Although previous studies have not explicitly investigated TPG, existing image-to-image models appear to be natural candidates for this task. However, when applied directly, these methods often produce unfaithful results, failing to preserve fine-grained details due to feature confusion between complex textile patterns and the inherent non-rigid texture distortions in clothing images. In this paper, we propose a novel method, SLDDM-TPG, for faithful and high-fidelity TPG. Our method consists of two stages: (1) a latent disentangled network (LDN) that resolves feature confusion in clothing representations and constructs a multi-dimensional, independent clothing feature space; and (2) a semi-supervised latent diffusion model (S-LDM), which receives guidance signals from LDN and generates faithful results through semi-supervised diffusion training, combined with our designed fine-grained alignment strategy. Extensive evaluations show that SLDDM-TPG reduces FID by 4.1 and improves SSIM by up to 0.116 on our CTP-HD dataset, and also demonstrate good generalization on the VITON-HD dataset.
摘要：纺织图案生成（TPG）旨在根据给定的服装图像合成细粒度的纺织图案图像。尽管之前的研究没有明确研究 TPG，但现有的图像到图像模型似乎是这项任务的自然候选者。然而，当直接应用时，这些方法通常会产生不忠实的结果，由于复杂的纺织图案之间的特征混淆和服装图像中固有的非刚性纹理扭曲，无法保留细粒度的细节。在本文中，我们提出了一种新方法 SLDDM-TPG，用于忠实且高保真的 TPG。我们的方法由两个阶段组成：（1）潜在解缠网络（LDN），解决服装表示中的特征混乱并构建多维、独立的服装特征空间；（2）半监督潜在扩散模型（S-LDM），它接收来自LDN的指导信号，并通过半监督扩散训练结合我们设计的细粒度对齐策略生成忠实的结果。广泛的评估表明，SLDDM-TPG 在我们的 CTP-HD 数据集上将 FID 降低了 4.1，将 SSIM 提高了 0.116，并且还在 VITON-HD 数据集上展示了良好的泛化能力。

Title: pADAM: A Plug-and-Play All-in-One Diffusion Architecture for Multi-Physics Learning

Authors: Amirhossein Mollaali, Bongseok Kim, Christian Moya, Guang Lin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.16757
Pdf URL: https://arxiv.org/pdf/2603.16757
Copy Paste: [[2603.16757]] pADAM: A Plug-and-Play All-in-One Diffusion Architecture for Multi-Physics Learning(https://arxiv.org/abs/2603.16757)
Keywords: generative
Abstract: Generalizing across disparate physical laws remains a fundamental challenge for artificial intelligence in science. Existing deep-learning solvers are largely confined to single-equation settings, limiting transfer across physical regimes and inference tasks. Here we introduce pADAM, a unified generative framework that learns a shared probabilistic prior across heterogeneous partial differential equation families. Through a learned joint distribution of system states and, where applicable, physical parameters, pADAM supports forward prediction and inverse inference within a single architecture without retraining. Across benchmarks ranging from scalar diffusion to nonlinear Navier--Stokes equations, pADAM achieves accurate inference even under sparse observations. Combined with conformal prediction, it also provides reliable uncertainty quantification with coverage guarantees. In addition, pADAM performs probabilistic model selection from only two sparse snapshots, identifying governing laws through its learned generative representation. These results highlight the potential of generative multi-physics modeling for unified and uncertainty-aware scientific inference.
摘要：推广不同的物理定律仍然是科学领域人工智能面临的基本挑战。现有的深度学习求解器很大程度上局限于单方程设置，限制了跨物理机制和推理任务的迁移。在这里，我们介绍 pADAM，一个统一的生成框架，可以学习跨异构偏微分方程族的共享概率先验。通过学习系统状态的联合分布以及适用的物理参数，pADAM 支持单个架构内的前向预测和逆向推理，而无需重新训练。跨越从标量扩散到非线性纳维-斯托克斯方程的基准，pADAM 即使在稀疏观测下也能实现准确的推理。与保形预测相结合，它还提供可靠的不确定性量化和覆盖保证。此外，pADAM 仅从两个稀疏快照中执行概率模型选择，通过其学习的生成表示来识别控制法则。这些结果凸显了生成多物理建模在统一和不确定性感知科学推理方面的潜力。

Title: GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

Authors: Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16769
Pdf URL: https://arxiv.org/pdf/2603.16769
Copy Paste: [[2603.16769]] GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution(https://arxiv.org/abs/2603.16769)
Keywords: super-resolution, generation, generative
Abstract: Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: this https URL.
摘要：最近，强化学习（RL）已被用于提高生成图像超分辨率（ISR）性能。然而，目前的工作重点是多步生成 ISR，而单步生成 ISR 由于其有限的随机性而仍未得到充分探索。此外，直接偏好优化（DPO）等强化学习方法需要离线生成正负样本对，导致样本数量有限，而组相对策略优化（GRPO）只计算整个图像的可能性，忽略对ISR至关重要的局部细节。在本文中，我们提出了群体直接偏好优化（GDPO），这是一种将 RL 集成到一步生成 ISR 模型训练中的新方法。首先，我们引入了一种噪声感知的一步扩散模型，可以生成不同的 ISR 输出。为了防止噪声注入引起的性能下降，我们引入了一种不等时间步长策略，将噪声添加的时间步长与扩散的时间步长分离。然后，我们提出了 GDPO 策略，它将 GRPO 的原理融入到 DPO 中，计算每个在线生成样本的群体相对优势，以进行模型优化。同时，设计了一个属性感知奖励函数，根据平滑和纹理区域的统计动态评估每个样本的得分。实验证明了 GDPO 在增强一步生成 ISR 模型性能方面的有效性。代码：此 https URL。

Title: IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

Authors: Huimin Xiong, Zijie Meng, Tianxiang Hu, Chenyi Zhou, Yang Feng, Zuozhu Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16781
Pdf URL: https://arxiv.org/pdf/2603.16781
Copy Paste: [[2603.16781]] IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans(https://arxiv.org/abs/2603.16781)
Keywords: generation, generative
Abstract: 3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.
摘要：由于丰富的几何证据，3D 口内扫描 (IOS) 在常规牙科中越来越多地采用，并且统一的多疾病诊断对于临床记录和交流来说是理想的。虽然最近的工作引入了牙科视觉语言模型 (VLM)，以实现对 2D 图像或多视图图像从 IOS 渲染的统一诊断和报告生成，但它们并没有充分利用原生 3D 几何结构。这样的工作是必要的，也是具有挑战性的，因为：(i) 异构扫描形式和复杂的 IOS 拓扑，(ii) 多种疾病共现，类别不平衡和细粒度形态模糊，(iii) 有限的配对 3D IOS 文本数据。因此，我们提出了 IOSVLM，一种端到端 3D VLM，将扫描表示为点云，并遵循 3D 编码器-投影仪-LLM 设计，用于统一诊断和生成视觉问答 (VQA)，以及 IOSVQA，一种大规模多源 IOS 诊断 VQA 数据集，包含 19,002 个病例和 23 个口腔疾病和异构扫描类型的 249,055 个 VQA 对。为了解决无色 IOS 数据和依赖于颜色的 3D 预训练之间的分布差距，我们提出了一种几何到色彩的代理，可以稳定细粒度的几何感知和跨模式对齐。两阶段课程培训策略进一步增强了稳健性。 IOSVLM 始终优于强大的基线，实现了至少 +9.58% 的宏观精度和 +1.46% 的宏观 F1，表明直接 3D 几何建模对于基于 IOS 的诊断的有效性。

Title: V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Authors: Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16792
Pdf URL: https://arxiv.org/pdf/2603.16792
Copy Paste: [[2603.16792]] V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising(https://arxiv.org/abs/2603.16792)
Keywords: generation, generative
Abstract: Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.
摘要：像素空间扩散最近重新出现，成为潜在扩散的强大替代方案，无需预先训练的自动编码器即可实现高质量生成。然而，标准的像素空间扩散模型接受相对较弱的语义监督，并且没有明确设计用于捕获高级视觉结构。最近的表示对齐方法（例如，REPA）表明，预训练的视觉特征可以显着改善扩散训练，而视觉共同去噪已成为将此类特征纳入生成过程的一个有希望的方向。然而，现有的联合去噪方法经常会涉及多种设计选择，从而导致不清楚哪些设计选择是真正重要的。因此，我们提出了 V-Co，这是一种在基于 JiT 的统一框架中对视觉联合去噪的系统研究。这种受控设置使我们能够分离出使视觉协同降噪有效的成分。我们的研究揭示了有效视觉协同降噪的四个关键要素。首先，保留特定功能的计算，同时实现灵活的跨流交互，激发了完全双流架构。其次，有效的无分类器指导（CFG）需要结构上定义的无条件预测。第三，感知漂移混合损失最好能提供更强的语义监督。第四，稳定的共同去噪还需要适当的跨流校准，我们通过基于 RMS 的特征重新缩放来实现这一点。总之，这些发现产生了视觉联合去噪的简单方法。 ImageNet-256 上的实验表明，在相当的模型大小下，V-Co 优于底层像素空间扩散基线和强大的先验像素扩散方法，同时使用更少的训练周期，为未来的表示对齐生成模型提供实用指导。

Title: Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling

Authors: Christian Belardi, Justin Lovelace, Kilian Q. Weinberger, Carla P. Gomes
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.16797
Pdf URL: https://arxiv.org/pdf/2603.16797
Copy Paste: [[2603.16797]] Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling(https://arxiv.org/abs/2603.16797)
Keywords: restoration, generation
Abstract: Guided diffusion sampling relies on approximating often intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complicated methods, which are often computationally more expensive. We provide empirical analysis of our method on both synthetic and real data, demonstrating that mitigating gradient noise through adaptive moments offers an effective way to improve alignment.
摘要：引导扩散采样依赖于近似通常难以处理的似然分数，这会在采样动态中引入显着的噪声。我们建议使用自适应矩估计来稳定采样期间这些噪声似然分数。尽管它很简单，但我们的方法在图像恢复和类条件生成任务上实现了最先进的结果，优于更复杂的方法，这些方法通常计算成本更高。我们对合成数据和真实数据的方法进行了实证分析，证明通过自适应矩减轻梯度噪声提供了改进对齐的有效方法。

Title: RaDAR: Relation-aware Diffusion-Asymmetric Graph Contrastive Learning for Recommendation

Authors: Yixuan Huang, Jiawei Chen, Shengfan Zhang, Zongsheng Cao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.16800
Pdf URL: https://arxiv.org/pdf/2603.16800
Copy Paste: [[2603.16800]] RaDAR: Relation-aware Diffusion-Asymmetric Graph Contrastive Learning for Recommendation(https://arxiv.org/abs/2603.16800)
Keywords: generation, generative
Abstract: Collaborative filtering (CF) recommendation has been significantly advanced by integrating Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). However, (i) random edge perturbations often distort critical structural signals and degrade semantic consistency across augmented views, and (ii) data sparsity hampers the propagation of collaborative signals, limiting generalization. To tackle these challenges, we propose RaDAR (Relation-aware Diffusion-Asymmetric Graph Contrastive Learning Framework for Recommendation Systems), a novel framework that combines two complementary view generation mechanisms: a graph generative model to capture global structure and a relation-aware denoising model to refine noisy edges. RaDAR introduces three key innovations: (1) asymmetric contrastive learning with global negative sampling to maintain semantic alignment while suppressing noise; (2) diffusion-guided augmentation, which employs progressive noise injection and denoising for enhanced robustness; and (3) relation-aware edge refinement, dynamically adjusting edge weights based on latent node semantics. Extensive experiments on three public benchmarks demonstrate that RaDAR consistently outperforms state-of-the-art methods, particularly under noisy and sparse conditions.
摘要：通过集成图神经网络（GNN）和图对比学习（GCL），协同过滤（CF）推荐得到了显着的进步。然而，（i）随机边缘扰动通常会扭曲关键结构信号并降低增强视图之间的语义一致性，（ii）数据稀疏性阻碍协作信号的传播，限制泛化。为了应对这些挑战，我们提出了 RaDAR（用于推荐系统的关系感知扩散非对称图对比学习框架），这是一种结合了两种互补视图生成机制的新颖框架：用于捕获全局结构的图生成模型和用于细化噪声边缘的关系感知去噪模型。 RaDAR 引入了三项关键创新：（1）采用全局负采样的非对称对比学习，以在抑制噪声的同时保持语义对齐； (2) 扩散引导增强，采用渐进式噪声注入和去噪来增强鲁棒性； (3)关系感知边缘细化，基于潜在节点语义动态调整边缘权重。对三个公共基准的大量实验表明，雷达的性能始终优于最先进的方法，特别是在嘈杂和稀疏的条件下。

Title: SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Authors: Jiongze Yu, Xiangbo Gao, Pooja Verlani, Akshay Gadde, Yilin Wang, Balu Adsumilli, Zhengzhong Tu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16864
Pdf URL: https://arxiv.org/pdf/2603.16864
Copy Paste: [[2603.16864]] SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation(https://arxiv.org/abs/2603.16864)
Keywords: restoration, super-resolution
Abstract: Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: this https URL
摘要：视频超分辨率 (VSR) 旨在从低分辨率 (LR) 估计中恢复高质量视频帧，但大多数现有的 VSR 方法在推理时的行为就像黑匣子：用户无法可靠地纠正意外的伪影，而只能接受模型产生的任何内容。在本文中，我们提出了一种新颖的交互式 VSR 框架，称为 SparkVSR，它使稀疏关键帧成为简单且富有表现力的控制信号。具体来说，用户可以首先使用任何现成的图像超分辨率 (ISR) 模型进行超分辨率或可选的一小组关键帧，然后 SparkVSR 将先验关键帧传播到整个视频序列，同时保持原始 LR 视频运动的基础。具体来说，我们引入了一种关键帧条件潜在像素两阶段训练管道，它将 LR 视频潜在特征与稀疏编码的 HR 关键帧潜在特征融合起来，以学习鲁棒的跨空间传播并细化感知细节。在推理时，SparkVSR 支持灵活的关键帧选择（手动指定、编解码器 I 帧提取或随机采样）和无参考引导机制，可持续平衡关键帧遵循和盲恢复，即使参考关键帧不存在或不完善，也能确保稳健的性能。在多个 VSR 基准上的实验表明，时间一致性得到了改善，恢复质量也得到了提高，在 CLIP-IQA、DOVER 和 MUSIQ 上分别超出基线高达 24.6%、21.8% 和 5.6%，从而实现了可控、关键帧驱动的视频超分辨率。此外，我们还证明 SparkVSR 是一种通用的交互式关键帧调节视频处理框架，因为它可以开箱即用地应用于看不见的任务，例如老电影修复和视频风格转换。我们的项目页面位于：此 https URL

Title: Efficient Reasoning on the Edge

Authors: Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2603.16867
Pdf URL: https://arxiv.org/pdf/2603.16867
Copy Paste: [[2603.16867]] Efficient Reasoning on the Edge(https://arxiv.org/abs/2603.16867)
Keywords: generation
Abstract: Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.
摘要：具有思想链推理的大型语言模型 (LLM) 在复杂的问题解决任务中实现了最先进的性能，但其冗长的推理轨迹和大量的上下文要求使得它们对于边缘部署来说不切实际。这些挑战包括高令牌生成成本、大的 KV 缓存占用空间以及将推理功能提炼为移动设备的较小模型时的低效率。现有的方法通常依赖于将推理轨迹从较大的模型提取到较小的模型中，这既冗长又风格冗余，不利于设备上的推理。在这项工作中，我们提出了一种轻量级方法，使用 LoRA 适配器与监督微调相结合，在小型 LLM 中进行推理。我们通过对这些适配器的强化学习进一步引入预算强制，显着缩短响应长度，同时将准确性损失降至最低。为了解决内存限制解码问题，我们利用并行测试时间缩放，在轻微延迟增加的情况下提高准确性。最后，我们提出了一种动态适配器切换机制，仅在需要时才激活推理，并在提示编码期间提供 KV 缓存共享策略，从而减少设备上推理的首次令牌时间。在Qwen2.5-7B上的实验表明，我们的方法在严格的资源限制下实现了高效、准确的推理，使得LLM推理在移动场景下变得实用。我们的项目页面上提供了演示我们在移动设备上运行的解决方案的视频。

Title: SegviGen: Repurposing 3D Generative Model for Part Segmentation

Authors: Lin Li, Haoran Feng, Zehuan Huang, Haohua Chen, Wenbo Nie, Shaohua Hou, Keqing Fan, Pan Hu, Sheng Wang, Buyu Li, Lu Sheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16869
Pdf URL: https://arxiv.org/pdf/2603.16869
Copy Paste: [[2603.16869]] SegviGen: Repurposing 3D Generative Model for Part Segmentation(https://arxiv.org/abs/2603.16869)
Keywords: generative
Abstract: We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at this https URL.
摘要：我们引入了 SegviGen，这是一个重新利用原生 3D 生成模型进行 3D 零件分割的框架。现有的流程要么通过蒸馏或多视图掩模聚合将强大的 2D 先验提升为 3D，通常会遇到跨视图不一致和模糊边界的问题，要么探索本机 3D 判别性分割，这通常需要大规模带注释的 3D 数据和大量培训资源。相比之下，SegviGen 利用预训练 3D 生成模型中编码的结构化先验，通过独特的部分着色来诱导分割，为部分分割建立新颖且有效的框架。具体来说，SegviGen 对 3D 资产进行编码，并预测几何对齐重建的活动体素上的部分指示颜色。它支持统一框架中的交互式零件分割、全分割和带 2D 引导的全分割。大量实验表明，SegviGen 在交互部分分割方面比现有技术提高了 40%，在完全分割方面提高了 15%，同时仅使用 0.32% 的标记训练数据。它表明预训练的 3D 生成先验可以有效地转移到 3D 零件分割，从而在有限的监督下实现强大的性能。请通过此 https URL 查看我们的项目页面。

Title: Demystifing Video Reasoning

Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16870
Pdf URL: https://arxiv.org/pdf/2603.16870
Copy Paste: [[2603.16870]] Demystifing Video Reasoning(https://arxiv.org/abs/2603.16870)
Keywords: generation
Abstract: Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.
摘要：视频生成的最新进展揭示了一个意想不到的现象：基于扩散的视频模型表现出非凡的推理能力。先前的工作将此归因于帧链（CoF）机制，其中假设推理在视频帧中按顺序展开。在这项工作中，我们挑战了这一假设并揭示了一种根本不同的机制。我们表明，视频模型中的推理主要是沿着扩散去噪步骤出现的。通过定性分析和有针对性的探测实验，我们发现模型在早期去噪步骤中探索多个候选解决方案，并逐渐收敛到最终答案，我们将这个过程称为步骤链（CoS）。除了这个核心机制之外，我们还确定了几种对模型性能至关重要的紧急推理行为：（1）工作记忆，实现持久引用； (2)自我修正和增强，允许从错误的中间解决方案中恢复； (3) 行动之前的感知，其中早期步骤建立语义基础，后续步骤执行结构化操作。在扩散步骤中，我们进一步揭示了扩散变压器中自我进化的功能专业化，其中早期层编码密集的感知结构，中间层执行推理，后面的层巩固潜在表示。受这些见解的启发，我们提出了一种简单的免训练策略作为概念验证，展示了如何通过将具有不同随机种子的相同模型的潜在轨迹集成来改进推理。总体而言，我们的工作提供了对视频生成模型中推理如何出现的系统理解，为指导未来研究更好地利用视频模型的固有推理动态作为智能的新基础奠定了基础。

Title: WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Authors: Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.16871
Pdf URL: https://arxiv.org/pdf/2603.16871
Copy Paste: [[2603.16871]] WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation(https://arxiv.org/abs/2603.16871)
Keywords: generative
Abstract: Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.
摘要：视频扩散变压器的最新进展使得交互式游戏世界模型成为可能，使用户能够在更广阔的视野中探索生成的环境。然而，现有方法难以实现精确的动作控制和长视野 3D 一致性。大多数先前的工作将用户动作视为抽象条件信号，忽略了动作与 3D 世界之间的基本几何耦合，从而动作引起相对相机运动，这些运动累积成 3D 世界中的全局相机姿势。在本文中，我们将相机位姿建立为统一的几何表示，以共同实现即时动作控制和长期 3D 一致性。首先，我们定义一个基于物理的连续动作空间，并在李代数中表示用户输入，以导出精确的 6-DoF 相机姿势，这些姿势通过相机嵌入器注入到生成模型中，以确保准确的动作对齐。其次，我们使用全局相机姿态作为空间索引来检索相关的过去观测结果，从而在长视距导航期间实现几何一致的位置重访。为了支持这项研究，我们引入了一个大规模数据集，其中包含 3,000 分钟的真实人类游戏玩法，并附有摄像机轨迹和文本描述。大量实验表明，我们的方法在动作可控性、长视野视觉质量和 3D 空间一致性方面远远优于最先进的交互式游戏世界模型。