2026-03-31

Title: Language-Conditioned World Modeling for Visual Navigation

Authors: Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2603.26741
Pdf URL: https://arxiv.org/pdf/2603.26741
Copy Paste: [[2603.26741]] Language-Conditioned World Modeling for Visual Navigation(https://arxiv.org/abs/2603.26741)
Keywords: generation
Abstract: We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at this https URL.
摘要：我们研究语言条件视觉导航（LCVN），其中要求实体代理仅基于最初的以自我为中心的观察来遵循自然语言指令。如果无法获得目标图像，智能体必须依靠语言来塑造其感知和持续控制，这使得接地问题尤其具有挑战性。我们将这个问题表述为以语言指令为条件的开环轨迹预测，并引入了 LCVN 数据集，这是一个包含 39,016 条轨迹和 117,048 条经过人类验证的指令的基准，支持跨各种环境和指令风格的可重复研究。使用该数据集，我们开发了 LCVN 框架，通过两个互补的模型系列将语言基础、未来状态预测和动作生成联系起来。第一个系列结合了 LCVN-WM（一种基于扩散的世界模型）和 LCVN-AC（一种在世界模型的潜在空间中训练的演员评论家智能体）。第二个系列 LCVN-Uni 采用自回归多模态架构，可以预测动作和未来的观察结果。实验表明，这些系列具有不同的优势：前者提供了时间上更一致的推出，而后者则可以更好地推广到不可见的环境。总而言之，这些观察结果表明了在统一的任务设置中共同研究语言基础、想象力和政策学习的价值，并且 LCVN 为进一步研究语言条件世界模型提供了具体基础。该代码可从此 https URL 获取。

Title: From Diffusion To Flow: Efficient Motion Generation In MotionGPT3

Authors: Jaymin Ban, JiHong Jeon, SangYeop Jeong
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26747
Pdf URL: https://arxiv.org/pdf/2603.26747
Copy Paste: [[2603.26747]] From Diffusion To Flow: Efficient Motion Generation In MotionGPT3(https://arxiv.org/abs/2603.26747)
Keywords: generation, generative
Abstract: Recent text-driven motion generation methods span both discrete token-based approaches and continuous-latent formulations. MotionGPT3 exemplifies the latter paradigm, combining a learned continuous motion latent space with a diffusion-based prior for text-conditioned synthesis. While rectified flow objectives have recently demonstrated favorable convergence and inference-time properties relative to diffusion in image and audio generation, it remains unclear whether these advantages transfer cleanly to the motion generation setting. In this work, we conduct a controlled empirical study comparing diffusion and rectified flow objectives within the MotionGPT3 framework. By holding the model architecture, training protocol, and evaluation setup fixed, we isolate the effect of the generative objective on training dynamics, final performance, and inference efficiency. Experiments on the HumanML3D dataset show that rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion-based motion quality under identical conditions. Moreover, flow-based priors exhibit stable behavior across a wide range of inference step counts and achieve competitive quality with fewer sampling steps, yielding improved efficiency--quality trade-offs. Overall, our results suggest that several known benefits of rectified flow objectives do extend to continuous-latent text-to-motion generation, highlighting the importance of the training objective choice in motion priors.
摘要：最近的文本驱动运动生成方法涵盖基于离散标记的方法和连续潜在公式。 MotionGPT3 例证了后一种范例，它将学习的连续运动潜在空间与基于扩散的先验相结合，以进行文本条件合成。虽然整流流目标最近在图像和音频生成中表现出了相对于扩散而言有利的收敛和推理时间特性，但仍不清楚这些优势是否干净地转移到运动生成设置中。在这项工作中，我们进行了一项受控实证研究，比较 MotionGPT3 框架内的扩散和整流流目标。通过固定模型架构、训练协议和评估设置，我们隔离了生成目标对训练动态、最终性能和推理效率的影响。 HumanML3D 数据集上的实验表明，校正流在更少的训练周期内收敛，更早地达到强大的测试性能，并且在相同条件下匹配或超过基于扩散的运动质量。此外，基于流的先验在广泛的推理步骤计数中表现出稳定的行为，并以更少的采样步骤实现有竞争力的质量，从而提高效率-质量权衡。总的来说，我们的结果表明，修正流目标的几个已知好处确实扩展到了连续潜在文本到运动的生成，突出了运动先验中训练目标选择的重要性。

Title: Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models

Authors: Qionghao Huang, Can Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26751
Pdf URL: https://arxiv.org/pdf/2603.26751
Copy Paste: [[2603.26751]] Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models(https://arxiv.org/abs/2603.26751)
Keywords: generation, generative
Abstract: Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.
摘要：遥感场景分类经历了从传统的手工特征方法到复杂的人工智能系统的范式转变，现在构成了现代地球观测应用的支柱。这项全面的调查研究了完整的方法论演变，系统地追踪了从经典纹理描述符和机器学习分类器到深度学习革命，再到当前最先进的基础模型和生成人工智能方法的发展。我们记录了从手动特征工程到通过卷积神经网络进行自动分层表示学习的关键转变，以及包括视觉变换器、图神经网络和混合框架在内的高级架构。该调查深入报道了自监督基础模型和视觉语言系统的突破性发展，强调了零样本和少样本学习场景中的卓越性能。特别强调生成式人工智能创新，通过合成数据生成和先进的特征学习策略来应对持续存在的挑战。我们分析当代的障碍，包括注释成本、多模式数据融合复杂性、可解释性需求和道德考虑，以及边缘计算部署、联邦学习框架和可持续人工智能实践的当前趋势。基于对最新进展和差距的综合分析，我们确定了未来的关键研究重点：提高高光谱和多时相分析能力，开发强大的跨域泛化方法，并建立标准化评估协议以加速遥感场景分类系统的科学进步。

Title: Generating Synthetic Wildlife Health Data from Camera Trap Imagery: A Pipeline for Alopecia and Body Condition Training Data

Authors: David Brundage
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26754
Pdf URL: https://arxiv.org/pdf/2603.26754
Copy Paste: [[2603.26754]] Generating Synthetic Wildlife Health Data from Camera Trap Imagery: A Pipeline for Alopecia and Body Condition Training Data(https://arxiv.org/abs/2603.26754)
Keywords: generative
Abstract: No publicly available, ML ready datasets exist for wildlife health conditions in camera trap imagery, creating a fundamental barrier to automated health screening. We present a pipeline for generating synthetic training images depicting alopecia and body condition deterioration in wildlife from real camera trap photographs. Our pipeline constructs a curated base image set from iWildCam using MegaDetector derived bounding boxes and center frame weighted stratified sampling across 8 North American species. A generative phenotype editing system produces controlled severity variants depicting hair loss consistent with mange and emaciation. An adaptive scene drift quality control system uses a sham prefilter and decoupled mask then score approach with complementary day or night metrics to reject images where the generative model altered the original scene. We frame the pipeline explicitly as a screening data source. From 201 base images across 4 species, we generate 553 QC passing synthetic variants with an overall pass rate of 83 percent. A sim to real transfer experiment training exclusively on synthetic data and testing on real camera trap images of suspected health conditions achieves 0.85 AUROC, demonstrating that the synthetic data captures visual features sufficient for screening.
摘要：相机陷阱图像中不存在公开的、可用于机器学习的野生动物健康状况数据集，这为自动健康筛查造成了根本障碍。我们提出了一个管道，用于从真实的相机陷阱照片中生成描绘野生动物脱发和身体状况恶化的合成训练图像。我们的流程使用 MegaDetector 派生的边界框和 8 个北美物种的中心框架加权分层采样，从 iWildCam 构建了一个精选的基础图像集。生成表型编辑系统产生受控的严重程度变异，描述与疥疮和消瘦一致的脱发。自适应场景漂移质量控制系统使用假预过滤器和解耦掩模，然后采用互补的白天或夜间指标进行评分，以拒绝生成模型改变原始场景的图像。我们明确地将管道构建为筛选数据源。从 4 个物种的 201 个基础图像中，我们生成了 553 个通过 QC 的合成变体，总体通过率为 83%。专门针对合成数据进行的模拟到真实传输实验训练以及对可疑健康状况的真实相机陷阱图像进行测试，达到了 0.85 AUROC，这表明合成数据捕获的视觉特征足以用于筛查。

Title: Physics-Aware Diffusion for LiDAR Point Cloud Densification

Authors: Zeping Zhang, Robert Laganière
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26759
Pdf URL: https://arxiv.org/pdf/2603.26759
Copy Paste: [[2603.26759]] Physics-Aware Diffusion for LiDAR Point Cloud Densification(https://arxiv.org/abs/2603.26759)
Keywords: generation
Abstract: LiDAR perception is severely limited by the distance-dependent sparsity of distant objects. While diffusion models can recover dense geometry, they suffer from prohibitive latency and physical hallucinations manifesting as ghost points. We propose Scanline-Consistent Range-Aware Diffusion, a framework that treats densification as probabilistic refinement rather than generation. By leveraging Partial Diffusion (SDEdit) on a coarse prior, we achieve high-fidelity results in just 156ms. Our novel Ray-Consistency loss and Negative Ray Augmentation enforce sensor physics to suppress artifacts. Our method achieves state-of-the-art results on KITTI-360 and nuScenes, directly boosting off-the-shelf 3D detectors without retraining. Code will be made available.
摘要：远距离物体的距离依赖性稀疏性严重限制了激光雷达的感知。虽然扩散模型可以恢复密集的几何形状，但它们会遭受令人望而却步的延迟和表现为鬼点的物理幻觉。我们提出了扫描线一致范围感知扩散，这是一个将致密化视为概率细化而不是生成的框架。通过在粗略先验上利用部分扩散 (SDEdit)，我们只需 156 毫秒即可获得高保真结果。我们新颖的光线一致性损失和负光线增强增强传感器物理特性以抑制伪影。我们的方法在 KITTI-360 和 nuScenes 上实现了最先进的结果，无需重新训练即可直接增强现成的 3D 检测器。代码将可用。

Title: A training-free framework for high-fidelity appearance transfer via diffusion transformers

Authors: Shengrong Gu, Ye Wang, Song Wu, Rui Ma, Qian Wang, Lanjun Wang, Zili Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26767
Pdf URL: https://arxiv.org/pdf/2603.26767
Copy Paste: [[2603.26767]] A training-free framework for high-fidelity appearance transfer via diffusion transformers(https://arxiv.org/abs/2603.26767)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.
摘要：扩散变压器 (DiT) 擅长生成，但它们的全局自注意力使得可控的、基于参考图像的编辑成为一个独特的挑战。与 U-Net 不同，天真地将局部外观注入 DiT 可能会破坏其整体场景结构。我们通过提出第一个免训练框架来解决这个问题，该框架专门用于驯服 DiT 以实现高保真外观迁移。我们的核心是一个解开结构和外观的协同系统。我们利用高保真反演为源图像建立丰富的内容先验，捕获其照明和微纹理。然后，一种新颖的注意力共享机制在几何先验的指导下动态融合来自参考的纯化外观特征。我们的统一方法在 1024px 下运行，并且在从语义属性传输到细粒度材料应用等任务上优于专用方法。大量的实验证实了我们在结构保存和外观保真度方面的最先进的性能。

Title: Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models

Authors: Chen Zheng, Yuxuan Lai, Haoyang Lu, Wentao Ma, Jitao Yang, Jian Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.26768
Pdf URL: https://arxiv.org/pdf/2603.26768
Copy Paste: [[2603.26768]] Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models(https://arxiv.org/abs/2603.26768)
Keywords: generation
Abstract: The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback (Task 1) and enriched, descriptive feedback (Task 2). We explore both low-rank adaptation (LoRA)-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowledge into VLMs. Experimental results show that our approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwritten Chinese character quality.
摘要：汉字的书写是学习汉语的一个基本方面。以前的自动评估方法通常将评分视为回归问题。然而，这种仅评分的反馈缺乏可操作的指导，这限制了其帮助学习者提高书写技能的有效性。在本文中，我们利用视觉语言模型（VLM）来分析手写汉字的质量并生成多级反馈。具体来说，我们研究了两个反馈生成任务：简单的成绩反馈（任务 1）和丰富的描述性反馈（任务 2）。我们探索基于低秩适应（LoRA）的微调策略和上下文学习方法，将审美评估知识整合到 VLM 中。实验结果表明，我们的方法在 CCL 2025 手写汉字质量评估研讨会上的多个评估轨道上实现了最先进的性能。

Title: From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics

Authors: Paolo Cupini, Francesco Pierri
Subjects: cs.CV, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.26772
Pdf URL: https://arxiv.org/pdf/2603.26772
Copy Paste: [[2603.26772]] From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics(https://arxiv.org/abs/2603.26772)
Keywords: generation
Abstract: Automated semantic annotation of broadcast television content presents distinctive challenges, combining structured audiovisual composition, domain-specific editorial patterns, and strict operational constraints. While multimodal large language models (MLLMs) have demonstrated strong general-purpose video understanding capabilities, their comparative effectiveness across pipeline architectures and input configurations in broadcast-specific settings remains empirically undercharacterized. This paper presents a systematic evaluation of multimodal annotation pipelines applied to broadcast television news in the Italian setting. We construct a domain-specific benchmark of clips labeled across four semantic dimensions: visual environment classification, topic classification, sensitive content detection, and named entity recognition. Two different pipeline architectures are evaluated across nine frontier models, including Gemini 3.0 Pro, LLaMA 4 Maverick, Qwen-VL variants, and Gemma 3, under progressively enriched input strategies combining visual signals, automatic speech recognition, speaker diarization, and metadata. Experimental results demonstrate that gains from video input are strongly model-dependent: larger models effectively leverage temporal continuity, while smaller models show performance degradation under extended multimodal context, likely due to token overload. Beyond benchmarking, the selected pipeline is deployed on 14 full broadcast episodes, with minute-level annotations integrated with normalized audience measurement data provided by an Italian media company. This integration enables correlational analysis of topic-level audience sensitivity and generational engagement divergence, demonstrating the operational viability of the proposed framework for content-based audience analytics.
摘要：广播电视内容的自动语义注释提出了独特的挑战，结合了结构化视听合成、特定领域的编辑模式和严格的操作约束。虽然多模态大语言模型 (MLLM) 已表现出强大的通用视频理解能力，但它们在特定于广播的设置中的管道架构和输入配置之间的比较有效性在经验上仍然不足。本文对意大利广播电视新闻中应用的多模态注释管道进行了系统评估。我们构建了跨四个语义维度标记的剪辑的特定领域基准：视觉环境分类、主题分类、敏感内容检测和命名实体识别。在结合了视觉信号、自动语音识别、说话人二值化和元数据的逐步丰富的输入策略下，在九个前沿模型（包括 Gemini 3.0 Pro、LLaMA 4 Maverick、Qwen-VL 变体和 Gemma 3）中评估了两种不同的管道架构。实验结果表明，视频输入的增益强烈依赖于模型：较大的模型有效地利用了时间连续性，而较小的模型在扩展的多模态上下文下表现出性能下降，这可能是由于令牌过载。除了基准测试之外，选定的管道还部署在 14 个完整的广播剧集中，并与意大利媒体公司提供的标准化观众测量数据集成了分钟级注释。这种集成可以对主题级受众敏感度和代际参与差异进行相关分析，证明了所提出的基于内容的受众分析框架的操作可行性。

Title: Elucidating the Design Space of Flow Matching for Cellular Microscopy

Authors: Charles Jones, Emmanuel Noutahi, Jason Hartford, Cian Eastwood
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26790
Pdf URL: https://arxiv.org/pdf/2603.26790
Copy Paste: [[2603.26790]] Elucidating the Design Space of Flow Matching for Cellular Microscopy(https://arxiv.org/abs/2603.26790)
Keywords: generative
Abstract: Flow-matching generative models are increasingly used to simulate cell responses to biological perturbations. However, the design space for building such models is large and underexplored. We systematically analyse the design space of flow matching models for cell-microscopy images, finding that many popular techniques are unnecessary and can even hurt performance. We develop a simple, stable, and scalable recipe which we use to train our foundation model. We scale our model to two orders of magnitude larger than prior methods, achieving a two-fold FID and ten-fold KID improvement over prior methods. We then fine-tune our model with pre-trained molecular embeddings to achieve state-of-the-art performance simulating responses to unseen molecules. Code is available at this https URL
摘要：流量匹配生成模型越来越多地用于模拟细胞对生物扰动的反应。然而，构建此类模型的设计空间很大且尚未得到充分探索。我们系统地分析了细胞显微镜图像的流匹配模型的设计空间，发现许多流行的技术是不必要的，甚至会损害性能。我们开发了一个简单、稳定且可扩展的配方，用于训练我们的基础模型。我们将模型比之前的方法扩大了两个数量级，与之前的方法相比，FID 提高了两倍，KID 提高了十倍。然后，我们使用预先训练的分子嵌入来微调我们的模型，以实现模拟对看不见的分子的响应的最先进的性能。代码可在此 https URL 获取

Title: MemGuard-Alpha: Detecting and Filtering Memorization-Contaminated Signals in LLM-Based Financial Forecasting via Membership Inference and Cross-Model Disagreement

Authors: Anisha Roy, Dip Roy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.26797
Pdf URL: https://arxiv.org/pdf/2603.26797
Copy Paste: [[2603.26797]] MemGuard-Alpha: Detecting and Filtering Memorization-Contaminated Signals in LLM-Based Financial Forecasting via Membership Inference and Cross-Model Disagreement(https://arxiv.org/abs/2603.26797)
Keywords: generation
Abstract: Large language models (LLMs) are increasingly used to generate financial alpha signals, yet growing evidence shows that LLMs memorize historical financial data from their training corpora, producing spurious predictive accuracy that collapses out-of-sample. This memorization-induced look-ahead bias threatens the validity of LLM-based quantitative strategies. Prior remedies -- model retraining and input anonymization -- are either prohibitively expensive or introduce significant information loss. No existing method offers practical, zero-cost signal-level filtering for real-time trading. We introduce MemGuard-Alpha, a post-generation framework comprising two algorithms: (i) the MemGuard Composite Score (MCS), which combines five membership inference attack (MIA) methods with temporal proximity features via logistic regression, achieving Cohen's d = 18.57 for contamination separation (d = 0.39-1.37 using MIA features alone); and (ii) Cross-Model Memorization Disagreement (CMMD), which exploits variation in training cutoff dates across LLMs to separate memorized signals from genuine reasoning. Evaluated across seven LLMs (124M-7B parameters), 50 S&P 100 stocks, 42,800 prompts, and five MIA methods over 5.5 years (2019-2024), CMMD achieves a Sharpe ratio of 4.11 versus 2.76 for unfiltered signals (49% improvement). Clean signals produce 14.48 bps average daily return versus 2.13 bps for tainted signals (7x difference). A striking crossover pattern emerges: in-sample accuracy rises with contamination (40.8% to 52.5%) while out-of-sample accuracy falls (47% to 42%), providing direct evidence that memorization inflates apparent accuracy at the cost of generalization.
摘要：大型语言模型 (LLM) 越来越多地用于生成金融阿尔法信号，但越来越多的证据表明，LLM 会记住训练语料库中的历史财务数据，从而产生虚假的预测准确性，导致样本外崩溃。这种记忆引起的前瞻偏差威胁到了基于法学硕士的定量策略的有效性。先前的补救措施——模型再训练和输入匿名化——要么成本高昂，要么导致严重的信息丢失。现有的方法还没有为实时交易提供实用的、零成本的信号级过滤。我们引入了 MemGuard-Alpha，这是一个包含两种算法的后生成框架：(i) MemGuard 综合评分 (MCS)，它通过逻辑回归将五种成员推理攻击 (MIA) 方法与时间邻近特征相结合，实现污染分离的 Cohen d = 18.57（单独使用 MIA 特征时 d = 0.39-1.37）； (ii) 跨模型记忆不一致 (CMMD)，它利用法学硕士之间训练截止日期的变化来将记忆信号与真实推理分开。经过 5.5 年（2019-2024 年）7 个 LLM（124M-7B 参数）、50 只 S&P 100 股票、42,800 个提示和 5 种 MIA 方法的评估，CMMD 的夏普比率为 4.11，而未经过滤的信号为 2.76（提高了 49%）。干净的信号产生 14.48 bps 的平均每日回报，而受污染的信号则产生 2.13 bps（7 倍差异）。出现了惊人的交叉模式：样本内准确率随着污染而上升（40.8% 至 52.5%），而样本外准确率下降（47% 至 42%），这提供了直接证据，表明记忆会以泛化为代价提高表观准确度。

Title: Gaussian Joint Embeddings For Self-Supervised Representation Learning

Authors: Yongchao Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.26799
Pdf URL: https://arxiv.org/pdf/2603.26799
Copy Paste: [[2603.26799]] Gaussian Joint Embeddings For Self-Supervised Representation Learning(https://arxiv.org/abs/2603.26799)
Keywords: generative
Abstract: Self-supervised representation learning often relies on deterministic predictive architectures to align context and target views in latent space. While effective in many settings, such methods are limited in genuinely multi-modal inverse problems, where squared-loss prediction collapses towards conditional averages, and they frequently depend on architectural asymmetries to prevent representation collapse. In this work, we propose a probabilistic alternative based on generative joint modeling. We introduce Gaussian Joint Embeddings (GJE) and its multi-modal extension, Gaussian Mixture Joint Embeddings (GMJE), which model the joint density of context and target representations and replace black-box prediction with closed-form conditional inference under an explicit probabilistic model. This yields principled uncertainty estimates and a covariance-aware objective for controlling latent geometry. We further identify a failure mode of naive empirical batch optimization, which we term the Mahalanobis Trace Trap, and develop several remedies spanning parametric, adaptive, and non-parametric settings, including prototype-based GMJE, conditional Mixture Density Networks (GMJE-MDN), topology-adaptive Growing Neural Gas (GMJE-GNG), and a Sequential Monte Carlo (SMC) memory bank. In addition, we show that standard contrastive learning can be interpreted as a degenerate non-parametric limiting case of the GMJE framework. Experiments on synthetic multi-modal alignment tasks and vision benchmarks show that GMJE recovers complex conditional structure, learns competitive discriminative representations, and defines latent densities that are better suited to unconditional sampling than deterministic or unimodal baselines.
摘要：自监督表示学习通常依赖于确定性预测架构来对齐潜在空间中的上下文和目标视图。虽然在许多设置中有效，但此类方法在真正的多模态逆问题中受到限制，其中平方损失预测向条件平均值崩溃，并且它们经常依赖于架构不对称性来防止表示崩溃。在这项工作中，我们提出了一种基于生成联合建模的概率替代方案。我们介绍高斯联合嵌入（GJE）及其多模态扩展高斯混合联合嵌入（GMJE），它对上下文和目标表示的联合密度进行建模，并在显式概率模型下用封闭式条件推理代替黑盒预测。这产生了有原则的不确定性估计和用于控制潜在几何形状的协方差感知目标。我们进一步确定了朴素经验批量优化的失败模式，我们将其称为 Mahalanobis Trace Trap，并开发了多种跨越参数、自适应和非参数设置的补救措施，包括基于原型的 GMJE、条件混合密度网络 (GMJE-MDN)、拓扑自适应生长神经气体 (GMJE-GNG) 和顺序蒙特卡罗 (SMC) 内存库。此外，我们表明标准对比学习可以解释为 GMJE 框架的退化非参数限制情况。合成多模态对齐任务和视觉基准的实验表明，GMJE 恢复了复杂的条件结构，学习竞争性判别表示，并定义了比确定性或单模态基线更适合无条件采样的潜在密度。

Title: Throughput Optimization as a Strategic Lever in Large-Scale AI Systems: Evidence from Dataloader and Memory Profiling Innovations

Authors: Mayank Jha
Subjects: cs.LG, cs.AI, cs.PF
Abstract URL: https://arxiv.org/abs/2603.26823
Pdf URL: https://arxiv.org/pdf/2603.26823
Copy Paste: [[2603.26823]] Throughput Optimization as a Strategic Lever in Large-Scale AI Systems: Evidence from Dataloader and Memory Profiling Innovations(https://arxiv.org/abs/2603.26823)
Keywords: generation
Abstract: The development of large-scale foundation models, particularly Large Language Models (LLMs), is constrained by significant computational and memory bottlenecks. These challenges elevate throughput optimization from a mere engineering task to a critical strategic lever, directly influencing training time, operational cost, and the feasible scale of next-generation models. This paper synthesizes evidence from recent academic and industry innovations to analyze key advancements in training efficiency. We examine architectural solutions to dataloader bottlenecks, such as the OVERLORD framework, which has demonstrated a 4.5% improvement in end-to-end training throughput. We investigate memory optimization techniques designed to overcome the GPU memory wall, including CPU offloading strategies like DeepSpeed's ZeRO-Offload, which enable the training of models far exceeding single-accelerator capacity. Furthermore, we explore the growing importance of compiler-centric optimizations, exemplified by Triton-distributed, which enables the joint optimization of computation, memory, and communication for substantial performance gains. The analysis is contextualized by advanced profiling tools and hardware characterization studies that identify and mitigate previously overlooked overheads like Dynamic Voltage and Frequency Scaling (DVFS). Findings indicate that a holistic, system-level approach, integrating innovations across data pipelines, memory management, network fabrics, and compiler technologies, is essential for accelerating AI development, managing costs, and pushing the boundaries of model scale.
摘要：大规模基础模型，特别是大型语言模型（LLM）的开发受到严重的计算和内存瓶颈的限制。这些挑战将吞吐量优化从单纯的工程任务提升为关键的战略杠杆，直接影响训练时间、运营成本和下一代模型的可行规模。本文综合了近期学术和行业创新的证据，分析了培训效率方面的关键进展。我们研究了数据加载器瓶颈的架构解决方案，例如 OVERLORD 框架，该框架已证明端到端训练吞吐量提高了 4.5%。我们研究了旨在克服 GPU 内存墙的内存优化技术，包括 DeepSpeed 的 ZeRO-Offload 等 CPU 卸载策略，这些策略能够训练远远超过单加速器容量的模型。此外，我们还探讨了以编译器为中心的优化日益增长的重要性，以 Triton 分布式为例，它可以对计算、内存和通信进行联合优化，从而获得显着的性能提升。该分析以先进的分析工具和硬件特性研究为背景，识别并减轻以前被忽视的开销，例如动态电压和频率调节 (DVFS)。研究结果表明，整合数据管道、内存管理、网络结构和编译器技术创新的整体系统级方法对于加速人工智能开发、管理成本和突破模型规模的界限至关重要。

Title: Central-to-Local Adaptive Generative Diffusion Framework for Improving Gene Expression Prediction in Data-Limited Spatial Transcriptomics

Authors: Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.26827
Pdf URL: https://arxiv.org/pdf/2603.26827
Copy Paste: [[2603.26827]] Central-to-Local Adaptive Generative Diffusion Framework for Improving Gene Expression Prediction in Data-Limited Spatial Transcriptomics(https://arxiv.org/abs/2603.26827)
Keywords: generative
Abstract: Spatial Transcriptomics (ST) provides spatially resolved gene expression profiles within intact tissue architecture, enabling molecular analysis in histological context. However, the high cost, limited throughput, and restricted data sharing of ST experiments result in severe data scarcity, constraining the development of robust computational models. To address this limitation, we present a Central-to-Local adaptive generative diffusion framework for ST (C2L-ST) that integrates large-scale morphological priors with limited molecular guidance. A global central model is first pretrained on extensive histopathology datasets to learn transferable morphological representations, and institution-specific local models are then adapted through lightweight gene-conditioned modulation using a small number of paired image-gene spots. This strategy enables the synthesis of realistic and molecularly consistent histology patches under data-limited conditions. The generated images exhibit high visual and structural fidelity, reproduce cellular composition, and show strong embedding overlap with real data across multiple organs, reflecting both realism and diversity. When incorporated into downstream training, synthetic image-gene pairs improve gene expression prediction accuracy and spatial coherence, achieving performance comparable to real data while requiring only a fraction of sampled spots. C2L-ST provides a scalable and data-efficient framework for molecular-level data augmentation, offering a domain-adaptive and generalizable approach for integrating histology and transcriptomics in spatial biology and related fields.
摘要：空间转录组学 (ST) 提供完整组织结构内的空间解析基因表达谱，从而能够在组织学背景下进行分子分析。然而，ST实验的高成本、有限的吞吐量和有限的数据共享导致严重的数据稀缺，限制了鲁棒计算模型的发展。为了解决这一限制，我们提出了一种 ST (C2L-ST) 的中央到局部自适应生成扩散框架，该框架将大规模形态学先验与有限的分子指导相结合。首先在广泛的组织病理学数据集上对全局中心模型进行预训练，以学习可转移的形态学表示，然后使用少量配对的图像基因点通过轻量级基因条件调制来调整特定于机构的局部模型。该策略能够在数据有限的条件下合成真实且分子一致的组织学斑块。生成的图像表现出高视觉和结构保真度，再现细胞组成，并显示出与多个器官的真实数据的强烈嵌入重叠，反映了真实性和多样性。当纳入下游训练时，合成图像-基因对可提高基因表达预测的准确性和空间一致性，实现与真实数据相当的性能，同时仅需要一小部分采样点。 C2L-ST 为分子级数据增强提供了一个可扩展且数据高效的框架，为空间生物学及相关领域中的组织学和转录组学集成提供了领域自适应且可推广的方法。

Title: Envisioning global urban development with satellite imagery and generative AI

Authors: Kailai Sun, Yuebing Liang, Mingyi He, Yunhan Zheng, Alok Prakash, Shenhao Wang, Jinhua Zhao, Alex "Sandy'' Pentland
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26831
Pdf URL: https://arxiv.org/pdf/2603.26831
Copy Paste: [[2603.26831]] Envisioning global urban development with satellite imagery and generative AI(https://arxiv.org/abs/2603.26831)
Keywords: generative
Abstract: Urban development has been a defining force in human history, shaping cities for centuries. However, past studies mostly analyze such development as predictive tasks, failing to reflect its generative nature. Therefore, this study designs a multimodal generative AI framework to envision sustainable urban development at a global scale. By integrating prompts and geospatial controls, our framework can generate high-fidelity, diverse, and realistic urban satellite imagery across the 500 largest metropolitan areas worldwide. It enables users to specify urban development goals, creating new images that align with them while offering diverse scenarios whose appearance can be controlled with text prompts and geospatial constraints. It also facilitates urban redevelopment practices by learning from the surrounding environment. Beyond visual synthesis, we find that it encodes and interprets latent representations of urban form for global cross-city learning, successfully transferring styles of urban environments across a global spatial network. The latent representations can also enhance downstream prediction tasks such as carbon emission prediction. Further, human expert evaluation confirms that our generated urban images are comparable to real urban images. Overall, this study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities.
摘要：城市发展一直是人类历史上的决定性力量，几个世纪以来一直塑造着城市。然而，过去的研究大多将这种发展作为预测性任务进行分析，未能反映其生成性。因此，本研究设计了一个多模式生成人工智能框架，以设想全球范围内的可持续城市发展。通过集成提示和地理空间控制，我们的框架可以在全球 500 个最大的都市区生成高保真、多样化且真实的城市卫星图像。它使用户能够指定城市发展目标，创建与其一致的新图像，同时提供可以通过文本提示和地理空间约束控制其外观的不同场景。它还通过向周围环境学习来促进城市重建实践。除了视觉合成之外，我们发现它还编码和解释了全球跨城市学习的城市形态的潜在表征，成功地在全球空间网络中转移了城市环境的风格。潜在表示还可以增强下游预测任务，例如碳排放预测。此外，人类专家评估证实我们生成的城市图像与真实的城市图像相当。总体而言，本研究提出了加速城市规划的创新方法，并支持全球城市基于情景的规划流程。

Title: Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation

Authors: Dongsheng Yang, Yinfeng Yu, Liejun Wang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2603.26859
Pdf URL: https://arxiv.org/pdf/2603.26859
Copy Paste: [[2603.26859]] Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation(https://arxiv.org/abs/2603.26859)
Keywords: generative
Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at this https URL.
摘要：视觉和语言导航（VLN）需要代理根据自然语言指令在复杂的看不见的环境中进行导航。然而，现有的方法通常难以有效地捕获关键语义线索并将其与视觉观察准确地对齐。为了解决这个限制，我们提出了超越文本知识（BTK），这是一种 VLN 框架，它将特定于环境的文本知识与生成图像知识库协同集成。 BTK采用Qwen3-4B提取目标相关短语，并利用Flux-Schnell构建两个大规模图像知识库：R2R-GP和REVERIE-GP。此外，我们利用 BLIP-2 构建从全景视图派生的大规模文本知识库，提供特定于环境的语义线索。这些多模态知识库通过目标感知增强器和知识增强器有效集成，显着增强语义基础和跨模态对齐。对具有 7,189 条轨迹的 R2R 数据集和具有 21,702 条指令的 REVERIE 数据集进行的广泛实验表明，BTK 显着优于现有基线。在R2R和REVERIE未见过的拆分测试中，SR分别增加了5%和2.07%，SPL分别增加了4%和3.69%。源代码可从此 https URL 获取。

Title: LACON: Training Text-to-Image Model from Uncurated Data

Authors: Zhiyang Liang, Ziyu Wan, Hongyu Liu, Dong Chen, Qiu Shen, Hao Zhu, Dongdong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26866
Pdf URL: https://arxiv.org/pdf/2603.26866
Copy Paste: [[2603.26866]] LACON: Training Text-to-Image Model from Uncurated Data(https://arxiv.org/abs/2603.26866)
Keywords: generation, generative
Abstract: The success of modern text-to-image generation is largely attributed to massive, high-quality datasets. Currently, these datasets are curated through a filter-first paradigm that aggressively discards low-quality raw data based on the assumption that it is detrimental to model performance. Is the discarded bad data truly useless, or does it hold untapped potential? In this work, we critically re-examine this question. We propose LACON (Labeling-and-Conditioning), a novel training framework that exploits the underlying uncurated data distribution. Instead of filtering, LACON re-purposes quality signals, such as aesthetic scores and watermark probabilities as explicit, quantitative condition labels. The generative model is then trained to learn the full spectrum of data quality, from bad to good. By learning the explicit boundary between high- and low-quality content, LACON achieves superior generation quality compared to baselines trained only on filtered data using the same compute budget, proving the significant value of uncurated data.
摘要：现代文本到图像生成的成功很大程度上归功于海量、高质量的数据集。目前，这些数据集是通过过滤优先范式进行管理的，该范式基于对模型性能有害的假设而积极丢弃低质量的原始数据。被丢弃的坏数据真的没有用处，还是蕴藏着未开发的潜力？在这项工作中，我们批判性地重新审视这个问题。我们提出了 LACON（标签和调节），这是一种利用底层未经整理的数据分布的新型训练框架。 LACON 没有进行过滤，而是将质量信号（例如美学分数和水印概率）重新调整为明确的定量条件标签。然后训练生成模型以了解从坏到好的全部数据质量。通过学习高质量和低质量内容之间的明确界限，与使用相同计算预算仅对过滤数据进行训练的基线相比，LACON 实现了卓越的生成质量，证明了未经整理的数据的重要价值。

Title: Property-Guided Molecular Generation and Optimization via Latent Flows

Authors: Alexander Arjun Lobo, Urvi Awasthi, Leonid Zhukov
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2603.26889
Pdf URL: https://arxiv.org/pdf/2603.26889
Copy Paste: [[2603.26889]] Property-Guided Molecular Generation and Optimization via Latent Flows(https://arxiv.org/abs/2603.26889)
Keywords: generation, generative
Abstract: Molecular discovery is increasingly framed as an inverse design problem: identifying molecular structures that satisfy desired property profiles under feasibility constraints. While recent generative models provide continuous latent representations of chemical space, targeted optimization within these representations often leads to degraded validity, loss of structural fidelity, or unstable behavior. We introduce MoltenFlow, a modular framework that combines property-organized latent representations with flow-matching generative priors and gradient-based guidance. This formulation supports both conditioned generation and local optimization within a single latent-space framework. We show that guided latent flows enable efficient multi-objective molecular optimization under fixed oracle budgets with controllable trade-offs, while a learned flow prior improves unconditional generation quality.
摘要：分子发现越来越被视为逆向设计问题：识别在可行性约束下满足所需特性的分子结构。虽然最近的生成模型提供了化学空间的连续潜在表示，但这些表示中的有针对性的优化通常会导致有效性下降、结构保真度丧失或行为不稳定。我们引入了 MoltenFlow，这是一个模块化框架，它将属性组织的潜在表示与流匹配生成先验和基于梯度的指导相结合。该公式支持单个潜在空间框架内的条件生成和局部优化。我们表明，引导潜在流可以在固定预言机预算下通过可控权衡实现高效的多目标分子优化，而学习流先验可以提高无条件生成质量。

Title: Strategic Candidacy in Generative AI Arenas

Authors: Chris Hays, Rachel Li, Bailey Flanigan, Manish Raghavan
Subjects: cs.LG, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2603.26891
Pdf URL: https://arxiv.org/pdf/2603.26891
Copy Paste: [[2603.26891]] Strategic Candidacy in Generative AI Arenas(https://arxiv.org/abs/2603.26891)
Keywords: generative
Abstract: AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.
摘要：人工智能领域根据用户的成对偏好对生成模型进行排名，是衡量模型在有机使用过程中相对性能的流行方法。由于排名是根据嘈杂的偏好计算的，因此有人担心模型生产者可以通过提交许多模型（例如本质上相同模型的多个变体）来利用这种随机性，从而人为地提高其顶级模型的排名。这可能会导致排名质量下降，从而降低排名的实用性。在本文中，我们首先在理论上和根据来自平台 Arena（以前的 LMArena、Chatbot Arena）的数据校准的模拟中建立条件，在这些条件下，当生产者的目标是获得高排名时，他们可以从提交克隆中受益。然后，我们提出了一种通过成对比较对模型进行排名的新机制，称为 You-Rank-We-Rank (YRWR)。它要求生产者提交对自己模型的排名，并使用这些排名来纠正模型质量的统计估计。我们证明这种机制近似于克隆鲁棒性，从某种意义上说，生产者除了一次性提交每个独特模型之外，无法通过做任何事情来大幅提高其排名。此外，在模型生产者能够正确对自己的模型进行排名的范围内，YRWR 提高了整体排名的准确性。在进一步的模拟中，我们表明该机制确实具有近似克隆鲁棒性，并且即使在生产者排名错误的情况下，也可以量化排名准确性的改进。

Title: Leveraging Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark

Authors: Laura Pedrouzo-Rodriguez, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Roberto Daza, Aythami Morales, Julian Fierrez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26934
Pdf URL: https://arxiv.org/pdf/2603.26934
Copy Paste: [[2603.26934]] Leveraging Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark(https://arxiv.org/abs/2603.26934)
Keywords: generation
Abstract: Recent advances in photorealistic avatar generation have enabled highly realistic talking-head avatars, raising security concerns regarding identity impersonation in AI-mediated communication. To advance in this challenging problem, the task of avatar fingerprinting aims to determine whether two avatar videos are driven by the same human operator or not. However, current public databases in the literature are scarce and based solely on old-fashioned talking-head avatar generators, not representing realistic scenarios for the current task of avatar fingerprinting. To overcome this situation, the present article introduces AVAPrintDB, a new publicly available multi-generator talking-head avatar database for avatar fingerprinting. AVAPrintDB is constructed from two audiovisual corpora and three state-of-the-art avatar generators (GAGAvatar, LivePortrait, HunyuanPortrait), representing different synthesis paradigms, and includes both self- and cross-reenactments to simulate legitimate usage and impersonation scenarios. Building on this database, we also define a standardized and reproducible benchmark for avatar fingerprinting, considering public state-of-the-art avatar fingerprinting systems and exploring novel methods based on Foundation Models (DINOv2 and CLIP). Also, we conduct a comprehensive analysis under generator and dataset shift. Our results show that, while identity-related motion cues persist across synthetic avatars, current avatar fingerprinting systems remain highly sensitive to changes in the synthesis pipeline and source domain. The AVAPrintDB, benchmark protocols, and avatar fingerprinting systems are publicly available to facilitate reproducible research.
摘要：逼真化身生成技术的最新进展使得高度逼真的会说话的化身成为可能，这引发了有关人工智能介导的通信中身份冒充的安全担忧。为了解决这个具有挑战性的问题，头像指纹识别的任务旨在确定两个头像视频是否由同一人类操作员驱动。然而，当前文献中的公共数据库很少，并且仅基于老式的头像生成器，不能代表当前头像指纹识别任务的现实场景。为了克服这种情况，本文介绍了 AVAPrintDB，这是一种新的公开可用的多生成器说话头像数据库，用于头像指纹识别。 AVAPrintDB由两个视听语料库和三个最先进的头像生成器（GAGAvatar、LivePortrait、HunyuanPortrait）构建而成，代表不同的合成范式，并包括自我和交叉重演，以模拟合法使用和模仿场景。在此数据库的基础上，我们还考虑公共最先进的头像指纹识别系统并探索基于基础模型（DINOv2 和 CLIP）的新颖方法，为头像指纹识别定义了标准化且可重复的基准。此外，我们还对生成器和数据集移位进行了全面分析。我们的结果表明，虽然与身份相关的运动线索在合成化身中持续存在，但当前的化身指纹识别系统对合成管道和源域的变化仍然高度敏感。 AVAPrintDB、基准协议和头像指纹识别系统都是公开可用的，以促进可重复的研究。

Title: High dimensional theory of two-phase optimizers

Authors: Atish Agarwala
Subjects: cs.LG, math.ST
Abstract URL: https://arxiv.org/abs/2603.26954
Pdf URL: https://arxiv.org/pdf/2603.26954
Copy Paste: [[2603.26954]] High dimensional theory of two-phase optimizers(https://arxiv.org/abs/2603.26954)
Keywords: generation
Abstract: The trend towards larger training setups has brought a renewed interest in partially asynchronous two-phase optimizers which optimize locally and then synchronize across workers. Additionally, recent work suggests that the one-worker version of one of these algorithms, DiLoCo, shows promising results as a (synchronous) optimizer. Motivated by these studies we present an analysis of LA-DiLoCo, a simple member of the DiLoCo family, on a high-dimensional linear regression problem. We show that the one-worker variant, LA, provides a different tradeoff between signal and noise than SGD, which is beneficial in many scenarios. We also show that the multi-worker version generates more noise than the single worker version, but that this additional noise generation can be ameliorated by appropriate choice of hyperparameters. We conclude with an analysis of SLA -- LA with momentum -- and show that stacking two momentum operators gives an opportunity for acceleration via a non-linear transformation of the "effective'' Hessian spectrum, which is maximized for Nesterov momentum. Altogether our results show that two-phase optimizers represent a fruitful new paradigm for understanding and improving training algorithms.
摘要：更大的训练设置的趋势引起了人们对部分异步两阶段优化器的新兴趣，这些优化器在本地进行优化，然后在工作人员之间进行同步。此外，最近的研究表明，其中一种算法 DiLoCo 的单工作版本作为（同步）优化器显示出了有希望的结果。受这些研究的启发，我们对 LA-DiLoCo（DiLoCo 家族的一个简单成员）在高维线性回归问题上进行了分析。我们表明，单工作者变体 LA 提供了与 SGD 不同的信号和噪声之间的权衡，这在许多场景中都是有益的。我们还表明，多工作人员版本比单工作人员版本产生更多的噪声，但是可以通过适当选择超参数来改善这种额外的噪声产生。最后，我们对 SLA（具有动量的 LA）进行了分析，并表明堆叠两个动量算子提供了通过“有效”Hessian 谱的非线性变换实现加速的机会，该谱对于 Nesterov 动量而言是最大化的。总而言之，我们的结果表明，两阶段优化器代表了理解和改进训练算法的富有成果的新范式。

Title: Probabilistic Forecasting of Localized Wildfire Spread Based on Conditional Flow Matching

Authors: Bryan Shaddy, Haitong Qin, Brianna Binder, James Haley, Riya Duddalwar, Kyle Hilburn, Assad Oberai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.26975
Pdf URL: https://arxiv.org/pdf/2603.26975
Copy Paste: [[2603.26975]] Probabilistic Forecasting of Localized Wildfire Spread Based on Conditional Flow Matching(https://arxiv.org/abs/2603.26975)
Keywords: generation
Abstract: This study presents a probabilistic surrogate model for localized wildfire spread based on a conditional flow matching algorithm. The approach models fire progression as a stochastic process by learning the conditional distribution of fire arrival times given the current fire state along with environmental and atmospheric inputs. Model inputs include current burned area, near-surface wind components, temperature, relative humidity, terrain height, and fuel category information, all defined on a high-resolution spatial grid. The outputs are samples of arrival time within a three-hour time window, conditioned on the input variables. Training data are generated from coupled atmosphere-wildfire spread simulations using WRF-SFIRE, paired with weather fields from the North American Mesoscale model. The proposed framework enables efficient generation of ensembles of arrival times and explicitly represents uncertainty arising from incomplete knowledge of the fire-atmosphere system and unresolved variables. The model supports localized prediction over subdomains, reducing computational cost relative to physics-based simulators while retaining sensitivity to key drivers of fire spread. Model performance is evaluated against WRF-SFIRE simulations for both single-step (3-hour) and recursive multi-step (24-hour) forecasts. Results demonstrate that the method captures variability in fire evolution and produces accurate ensemble predictions. The framework provides a scalable approach for probabilistic wildfire forecasting and offers a pathway for integrating machine learning models with operational fire prediction systems and data assimilation.
摘要：本研究提出了一种基于条件流匹配算法的局部野火蔓延的概率替代模型。该方法通过学习给定当前火灾状态以及环境和大气输入的火灾到达时间的条件分布，将火灾进展建模为随机过程。模型输入包括当前燃烧面积、近地表风分量、温度、相对湿度、地形高度和燃料类别信息，所有这些都在高分辨率空间网格上定义。输出是三小时时间窗口内到达时间的样本，以输入变量为条件。训练数据是使用 WRF-SFIRE 进行大气-野火蔓延耦合模拟生成的，并与北美中尺度模型的天气场相结合。所提出的框架能够有效生成到达时间的集合，并明确表示由于对火气氛系统和未解决的变量的不完整了解而产生的不确定性。该模型支持对子域的局部预测，相对于基于物理的模拟器降低了计算成本，同时保留了对火势蔓延的关键驱动因素的敏感性。模型性能根据单步（3 小时）和递归多步（24 小时）预测的 WRF-SFIRE 模拟进行评估。结果表明，该方法可以捕获火灾演变的变化并产生准确的整体预测。该框架为概率野火预测提供了一种可扩展的方法，并提供了将机器学习模型与可操作的火灾预测系统和数据同化相集成的途径。

Title: Generative Shape Reconstruction with Geometry-Guided Langevin Dynamics

Authors: Linus Härenstam-Nielsen, Dmitrii Pozdeev, Thomas Dagès, Nikita Araslanov, Daniel Cremers
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27016
Pdf URL: https://arxiv.org/pdf/2603.27016
Copy Paste: [[2603.27016]] Generative Shape Reconstruction with Geometry-Guided Langevin Dynamics(https://arxiv.org/abs/2603.27016)
Keywords: generative
Abstract: Reconstructing complete 3D shapes from incomplete or noisy observations is a fundamentally ill-posed problem that requires balancing measurement consistency with shape plausibility. Existing methods for shape reconstruction can achieve strong geometric fidelity in ideal conditions but fail under realistic conditions with incomplete measurements or noise. At the same time, recent generative models for 3D shapes can synthesize highly realistic and detailed shapes but fail to be consistent with observed measurements. In this work, we introduce GG-Langevin: Geometry-Guided Langevin dynamics, a probabilistic approach that unifies these complementary perspectives. By traversing the trajectories of Langevin dynamics induced by a diffusion model, while preserving measurement consistency at every step, we generatively reconstruct shapes that fit both the measurements and the data-informed prior. We demonstrate through extensive experiments that GG-Langevin achieves higher geometric accuracy and greater robustness to missing data than existing methods for surface reconstruction.
摘要：从不完整或有噪声的观察中重建完整的 3D 形状从根本上来说是一个不适定问题，需要平衡测量一致性和形状合理性。现有的形状重建方法可以在理想条件下实现很强的几何保真度，但在测量不完整或存在噪声的现实条件下会失败。与此同时，最近的 3D 形状生成模型可以合成高度逼真和详细的形状，但无法与观察到的测量结果保持一致。在这项工作中，我们介绍了 GG-Langevin：几何引导朗之万动力学，这是一种统一这些互补观点的概率方法。通过遍历扩散模型引起的朗之万动力学轨迹，同时保持每一步的测量一致性，我们生成地重建了既适合测量结果又符合数据先验的形状。我们通过大量实验证明，与现有的表面重建方法相比，GG-Langevin 具有更高的几何精度和对缺失数据更强的鲁棒性。

Title: Unified Number-Free Text-to-Motion Generation Via Flow Matching

Authors: Guanhe Huang, Oya Celiktutan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27040
Pdf URL: https://arxiv.org/pdf/2603.27040
Copy Paste: [[2603.27040]] Unified Number-Free Text-to-Motion Generation Via Flow Matching(https://arxiv.org/abs/2603.27040)
Keywords: generation, generative
Abstract: Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF' s effectiveness as a generalist model for multi-person motion generation from text. Project page: this https URL.
摘要：生成模型擅长对固定数量的智能体进行运动合成，但很难对可变智能体进行泛化。现有方法基于有限的特定领域数据，采用自回归模型递归地生成运动，效率低且误差累积。我们提出统一运动流（UMF），它由金字塔运动流（P-Flow）和半噪声运动流（S-Flow）组成。 UMF将无数运动生成分解为单遍运动先验生成阶段和多遍反应生成阶段。具体来说，UMF利用统一的潜在空间来弥合异构运动数据集之间的分布差距，从而实现有效的统一训练。对于上一代运动，P-Flow 在不同噪声水平条件下的分层分辨率上运行，从而减少计算开销。对于反应生成，S-Flow 学习联合概率路径，自适应地执行反应转换和上下文重建，从而减轻错误累积。广泛的结果和用户研究证明了 UMF 作为从文本生成多人运动的通用模型的有效性。项目页面：此 https URL。

Title: Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching

Authors: Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27044
Pdf URL: https://arxiv.org/pdf/2603.27044
Copy Paste: [[2603.27044]] Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching(https://arxiv.org/abs/2603.27044)
Keywords: generation, generative
Abstract: Deep Reinforcement Learning (DRL) is widely recognized as sample-inefficient, a limitation attributable in part to the high dimensionality and substantial functional redundancy inherent to the policy parameter space. A recent framework, which we refer to as Action-based Policy Compression (APC), mitigates this issue by compressing the parameter space $\Theta$ into a low-dimensional latent manifold $\mathcal Z$ using a learned generative mapping $g:\mathcal Z \to \Theta$. However, its performance is severely constrained by relying on immediate action-matching as a reconstruction loss, a myopic proxy for behavioral similarity that suffers from compounding errors across sequential decisions. To overcome this bottleneck, we introduce Occupancy-based Policy Compression (OPC), which enhances APC by shifting behavior representation from immediate action-matching to long-horizon state-space coverage. Specifically, we propose two principal improvements: (1) we curate the dataset generation with an information-theoretic uniqueness metric that delivers a diverse population of policies; and (2) we propose a fully differentiable compression objective that directly minimizes the divergence between the true and reconstructed mixture occupancy distributions. These modifications force the generative model to organize the latent space around true functional similarity, promoting a latent representation that generalizes over a broad spectrum of behaviors while retaining most of the original parameter space's expressivity. Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
摘要：深度强化学习（DRL）被广泛认为样本效率低下，其局限性部分归因于策略参数空间固有的高维度和大量功能冗余。最近的一个框架，我们称之为基于动作的策略压缩（APC），通过使用学习的生成映射 $g:\mathcal Z \ 到 \Theta$ 将参数空间 $\Theta$ 压缩到低维潜在流形 $\mathcal Z$ 来缓解这个问题。然而，由于依赖即时动作匹配作为重建损失，它的性能受到严重限制，这是行为相似性的短视代理，会受到顺序决策中复合错误的影响。为了克服这个瓶颈，我们引入了基于占用的策略压缩（OPC），它通过将行为表示从即时动作匹配转移到长范围状态空间覆盖来增强 APC。具体来说，我们提出了两项主要改进：（1）我们使用信息论唯一性度量来管理数据集生成，该度量可提供多样化的政策群体；（2）我们提出了一个完全可微的压缩目标，它直接最小化真实和重建的混合占用分布之间的差异。这些修改迫使生成模型围绕真正的功能相似性组织潜在空间，促进概括广泛行为的潜在表示，同时保留大部分原始参数空间的表达能力。最后，我们根据经验验证了我们在多个连续控制基准上的贡献的优势。

Title: SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views

Authors: Zijian He, enjie Liu, Yihao Wang, Weizhi Zhong, Huan Yuan, Kun Gai, Guangrun Wang, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27084
Pdf URL: https://arxiv.org/pdf/2603.27084
Copy Paste: [[2603.27084]] SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views(https://arxiv.org/abs/2603.27084)
Keywords: generative
Abstract: World building with 3D scene representations is increasingly important for content creation, simulation, and interactive experiences, yet real workflows are inherently iterative: creators must repeatedly extend an existing scene under user control. Motivated by this research gap, we study 3D scene expansion in a user-centric workflow: starting from a real scene captured by multi-view images, we extend its coverage by inserting an additional view synthesized by a generative model. Unlike simple object editing or style transfer in a fixed scene, the inserted view is often 3D-misaligned with the original reconstruction, introducing geometry shifts, hallucinated content, or view-dependent artifacts that break global multi-view consistency. To address the challenge, we propose SceneExpander, which applies test-time adaptation to a parametric feed-forward 3D reconstruction model with two complementary distillation signals: anchor distillation stabilizes the original scene by distilling geometric cues from the captured views, while inserted-view self-distillation preserves observation-supported predictions yet adapts latent geometry and appearance to accommodate the misaligned inserted view. Experiments on ETH scenes and online data demonstrate improved expansion behavior and reconstruction quality under misalignment.
摘要：使用 3D 场景表示构建世界对于内容创建、模拟和交互体验越来越重要，但真正的工作流程本质上是迭代的：创建者必须在用户控制下重复扩展现有场景。受这一研究空白的推动，我们在以用户为中心的工作流程中研究 3D 场景扩展：从多视图图像捕获的真实场景开始，我们通过插入由生成模型合成的附加视图来扩展其覆盖范围。与固定场景中的简单对象编辑或风格转换不同，插入的视图通常与原始重建3D不对齐，从而引入几何偏移、幻觉内容或与视图相关的伪影，从而破坏全局多视图一致性。为了应对这一挑战，我们提出了 SceneExpander，它将测试时间适应应用于具有两个互补蒸馏信号的参数前馈 3D 重建模型：锚点蒸馏通过从捕获的视图中提取几何线索来稳定原始场景，而插入视图自蒸馏保留了观察支持的预测，同时调整潜在几何和外观以适应未对齐的插入视图。 ETH 场景和在线数据的实验表明，在错位情况下，扩展行为和重建质量得到了改善。

Title: Hierarchy-Guided Topology Latent Flow for Molecular Graph Generation

Authors: Urvi Awasthi, Alexander Arjun Lobo, Leonid Zhukov
Subjects: cs.LG, cond-mat.mtrl-sci, stat.ML
Abstract URL: https://arxiv.org/abs/2603.27113
Pdf URL: https://arxiv.org/pdf/2603.27113
Copy Paste: [[2603.27113]] Hierarchy-Guided Topology Latent Flow for Molecular Graph Generation(https://arxiv.org/abs/2603.27113)
Keywords: generation
Abstract: Generating chemically valid 3D molecules is hindered by discrete bond topology: small local bond errors can cause global failures (valence violations, disconnections, implausible rings), especially for drug-like molecules with long-range constraints. Many unconditional 3D generators emphasize coordinates and then infer bonds or rely on post-processing, leaving topology feasibility weakly controlled. We propose Hierarchy-Guided Latent Topology Flow (HLTF), a planner-executor model that generates bond graphs with 3D coordinates, using a latent multi-scale plan for global context and a constraint-aware sampler to suppress topology-driven failures. On QM9, HLTF achieves 98.8% atom stability and 92.9% valid-and-unique, improving PoseBusters validity to 94.0% (+0.9 over the strongest reported baseline). On GEOM-DRUGS, HLTF attains 85.5%/85.0% validity/valid-unique-novel without post-processing and 92.2%/91.2% after standardized relaxation, within 0.9 points of the best post-processed baseline. Explicit topology generation also reduces "false-valid" samples that pass RDKit sanitization but fail stricter checks.
摘要：生成化学上有效的 3D 分子受到离散键拓扑的阻碍：小的局部键错误可能会导致全局失败（价态违规、断开、不可信的环），特别是对于具有长程约束的类药物分子。许多无条件 3D 生成器强调坐标，然后推断键或依赖后处理，从而使拓扑可行性受到较弱的控制。我们提出了层次引导的潜在拓扑流（HLTF），这是一种规划器执行器模型，可生成具有 3D 坐标的键图，使用全局上下文的潜在多尺度规划和约束感知采样器来抑制拓扑驱动的故障。在 QM9 上，HLTF 实现了 98.8% 的原子稳定性和 92.9% 的有效和独特性，将 PoseBusters 的有效性提高到 94.0%（比报告的最强基线+0.9）。在 GEOM-DRUGS 上，HLTF 在没有后处理的情况下达到了 85.5%/85.0% 的有效性/有效独特小说，在标准化松弛后达到了 92.2%/91.2%，与最佳后处理基线相差 0.9 个百分点。显式拓扑生成还可以减少通过 RDKit 清理但未通过更严格检查的“假有效”样本。

Title: SJD-VP: Speculative Jacobi Decoding with Verification Prediction for Autoregressive Image Generation

Authors: Bingqi Shan, Baoquan Zhang, Xiaochen Qi, Xutao Li, Yunming Ye, Liqiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27115
Pdf URL: https://arxiv.org/pdf/2603.27115
Copy Paste: [[2603.27115]] SJD-VP: Speculative Jacobi Decoding with Verification Prediction for Autoregressive Image Generation(https://arxiv.org/abs/2603.27115)
Keywords: generation
Abstract: Speculative Jacobi Decoding (SJD) has emerged as a promising method for accelerating autoregressive image generation. Despite its potential, existing SJD approaches often suffer from the low acceptance rate issue of speculative tokens due to token selection ambiguity. Recent works attempt to mitigate this issue primarily from the relaxed token verification perspective but fail to fully exploit the iterative dynamics of decoding. In this paper, we conduct an in-depth analysis and make a novel observation that tokens whose probabilities increase are more likely to match the verification-accepted and correct token. Based on this, we propose a novel Speculative Jacobi Decoding with Verification Prediction (SJD-VP). The key idea is to leverage the change in token probabilities across iterations to guide sampling, favoring tokens whose probabilities increase. This effectively predicts which tokens are likely to pass subsequent verification, boosting the acceptance rate. In particular, our SJD-VP is plug-and-play and can be seamlessly integrated into existing SJD methods. Extensive experiments on standard benchmarks demonstrate that our SJD-VP method consistently accelerates autoregressive decoding while improving image generation quality.
摘要：推测雅可比解码 (SJD) 已成为加速自回归图像生成的一种有前途的方法。尽管具有潜力，但现有的 SJD 方法经常因代币选择的模糊性而遭受投机代币接受率低的问题。最近的工作试图主要从宽松的令牌验证角度来缓解这个问题，但未能充分利用解码的迭代动态。在本文中，我们进行了深入分析并做出了一个新颖的观察：概率增加的令牌更有可能匹配验证接受且正确的令牌。基于此，我们提出了一种新颖的带有验证预测的推测雅可比解码（SJD-VP）。关键思想是利用迭代中令牌概率的变化来指导采样，有利于概率增加的令牌。这有效地预测了哪些代币可能通过后续验证，从而提高了接受率。特别是，我们的 SJD-VP 是即插即用的，可以无缝集成到现有的 SJD 方法中。标准基准测试的大量实验表明，我们的 SJD-VP 方法能够持续加速自回归解码，同时提高图像生成质量。

Title: Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data

Authors: Shijie Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.27135
Pdf URL: https://arxiv.org/pdf/2603.27135
Copy Paste: [[2603.27135]] Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data(https://arxiv.org/abs/2603.27135)
Keywords: generation
Abstract: Text-to-time-series generation is particularly important in meteorology, where natural language offers intuitive control over complex, multi-scale atmospheric dynamics. Existing approaches are constrained by the lack of large-scale, physically grounded multimodal datasets and by architectures that overlook the spectral-temporal structure of weather signals. We address these challenges with a unified framework for text-guided meteorological time-series generation. First, we introduce MeteoCap-3B, a billion-scale weather dataset paired with expert-level captions constructed via a Multi-agent Collaborative Captioning (MACC) pipeline, yielding information-dense and physically consistent annotations. Building on this dataset, we propose MTransformer, a diffusion-based model that enables precise semantic control by mapping textual descriptions into multi-band spectral priors through a Spectral Prompt Generator, which guides generation via frequency-aware attention. Extensive experiments on real-world benchmarks demonstrate state-of-the-art generation quality, accurate cross-modal alignment, strong semantic controllability, and substantial gains in downstream forecasting under data-sparse and zero-shot settings. Additional results on general time-series benchmarks indicate that the proposed framework generalizes beyond meteorology.
摘要：文本到时间序列的生成在气象学中尤为重要，其中自然语言提供了对复杂、多尺度大气动力学的直观控制。现有方法受到缺乏大规模、物理基础的多模态数据集以及忽视天气信号频谱时间结构的架构的限制。我们通过文本引导气象时间序列生成的统一框架来应对这些挑战。首先，我们介绍 MeteoCap-3B，这是一个十亿规模的天气数据集，与通过多智能体协作字幕 (MACC) 管道构建的专家级字幕配对，产生信息密集且物理一致的注释。在此数据集的基础上，我们提出了 MTransformer，这是一种基于扩散的模型，通过频谱提示生成器将文本描述映射到多频段频谱先验，从而实现精确的语义控制，频谱提示生成器通过频率感知注意来指导生成。对现实世界基准的大量实验证明了最先进的生成质量、准确的跨模态对齐、强大的语义可控性以及在数据稀疏和零样本设置下下游预测的巨大收益。一般时间序列基准的其他结果表明，所提出的框架的推广范围超出了气象学范围。

Title: MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation

Authors: Xiaofeng Tan, Wanjiang Weng, Hongsong Wang, Fang Zhao, Xin Geng, Liang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27185
Pdf URL: https://arxiv.org/pdf/2603.27185
Copy Paste: [[2603.27185]] MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation(https://arxiv.org/abs/2603.27185)
Keywords: generation, generative
Abstract: Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.
摘要：文本到动作的生成随着基于扩散和流的生成模型的发展而进步，但有监督的预训练仍然不足以使模型与语义一致性、真实性和人类偏好等高级目标保持一致。现有的后训练方法具有关键局限性：（1）针对特定的运动表示，例如关节，（2）优化特定方面，例如文本运动对齐，并且可能会损害其他因素； (3) 产生大量的计算开销、数据依赖性和粗粒度优化。我们提出了一个强化微调框架，其中包括异构表示、多维奖励模型 MotionReward 和高效、细粒度的微调方法 EasyTune。为了获得统一的语义表示，MotionReward将异构运动映射到以文本为锚定的共享语义空间，从而实现多维奖励学习；自我完善偏好学习进一步增强语义，无需额外注释。为了高效和有效的微调，我们将去噪步骤中的递归梯度依赖性确定为关键瓶颈，并提出 EasyTune，它可以逐步优化而不是在整个轨迹上优化，从而产生密集、细粒度和内存高效的更新。大量实验验证了我们框架的有效性，MLD 模型在 22.10 GB 峰值内存下实现了 FID 0.132，并比 DRaFT 节省了高达 15.22 GB 的空间。它在基于关节的 ACMDM 上将 FID 降低了 22.9%，在基于旋转的 HY Motion 上实现了 12.6% 的 R 精度增益和 23.3% 的 FID 改进。我们的项目页面和代码是公开的。

Title: Let Triggers Control: Frequency-Aware Dropout for Effective Token Control

Authors: Junyoung Koh, Hoyeon Moon, Dongha Kim, Seungmin Lee, Sanghyun Park, Min Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27199
Pdf URL: https://arxiv.org/pdf/2603.27199
Copy Paste: [[2603.27199]] Let Triggers Control: Frequency-Aware Dropout for Effective Token Control(https://arxiv.org/abs/2603.27199)
Keywords: generation, generative
Abstract: Text-to-image models such as Stable Diffusion have achieved unprecedented levels of high-fidelity visual synthesis. As these models advance, personalization of generative models -- commonly facilitated through Low-Rank Adaptation (LoRA) with a dedicated trigger token -- has become a significant area of research. Previous works have naively assumed that fine-tuning with a single trigger token to represent new concepts. However, this often results in poor controllability, where the trigger token alone fails to reliably evoke the intended concept. We attribute this issue to the frequent co-occurrence of the trigger token with the surrounding context during fine-tuning, which entangles their representations and compromises the token's semantic distinctiveness. To disentangle this, we propose Frequency-Aware Dropout (FAD) -- a novel regularization technique that improves prompt controllability without adding new parameters. FAD consists of two key components: co-occurrence analysis and curriculum-inspired scheduling. Qualitative and quantitative analyses across token-based diffusion models (SD~1.5 and SDXL) and natural language--driven backbones (FLUX and Qwen-Image) demonstrate consistent gains in prompt fidelity, stylistic precision, and user-perceived quality. Our method provides a simple yet effective dropout strategy that enhances controllability and personalization in text-to-image generation. Notably, it achieves these improvements without introducing additional parameters or architectural modifications, making it readily applicable to existing models with minimal computational overhead.
摘要：稳定扩散等文本到图像模型已经实现了前所未有的高保真视觉合成水平。随着这些模型的进步，生成模型的个性化（通常通过具有专用触发令牌的低秩适应（LoRA）来促进）已成为一个重要的研究领域。以前的作品天真地假设用单个触发标记进行微调来表示新概念。然而，这通常会导致可控性差，仅触发令牌无法可靠地唤起预期的概念。我们将此问题归因于微调期间触发标记与周围上下文的频繁共现，这使它们的表示纠缠在一起并损害了标记的语义独特性。为了解决这个问题，我们提出了频率感知丢弃（FAD）——一种新颖的正则化技术，可以在不添加新参数的情况下提高即时可控性。 FAD 由两个关键组成部分组成：共现分析和课程启发的调度。基于代币的扩散模型（SD~1.5 和 SDXL）和自然语言驱动的主干网（FLUX 和 Qwen-Image）的定性和定量分析表明，在提示保真度、风格精度和用户感知质量方面取得了一致的进展。我们的方法提供了一种简单而有效的退出策略，可以增强文本到图像生成的可控性和个性化。值得注意的是，它在不引入额外参数或架构修改的情况下实现了这些改进，使其能够以最小的计算开销轻松应用于现有模型。

Title: Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

Authors: Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27201
Pdf URL: https://arxiv.org/pdf/2603.27201
Copy Paste: [[2603.27201]] Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models(https://arxiv.org/abs/2603.27201)
Keywords: generation
Abstract: Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at this https URL.
摘要：多模态思维链 (MCoT) 模型在复杂的视觉推理任务中表现出了令人印象深刻的能力。不幸的是，最近的研究表明，由于在生成过程中视觉注意力减弱，它们患有严重的幻觉问题。然而，视觉注意力衰减是大视觉语言模型（LVLM）中一个经过充分研究的问题。考虑到 MCoT 模型与传统 LVLM 推理过程的根本差异，我们提出一个基本问题：MCoT 模型是否有独特的幻觉原因？为了回答这个问题，我们系统地研究了 MCoT 模型的幻觉模式，发现虚构文本主要是在联想推理步骤中生成的，我们称之为发散思维。利用这些见解，我们引入了一种简单而有效的策略，可以有效地定位发散的思维步骤并干预解码过程以减轻幻觉。大量的实验表明，我们的方法大大优于现有方法。更重要的是，我们提出的方法可以方便地与其他幻觉缓解方法集成，并进一步提高其性能。该代码可通过此 https URL 公开获取。

Title: Make It Up: Fake Images, Real Gains in Generalized Few-shot Semantic Segmentation

Authors: Guohuan Xie, Xin He, Dingying Fan, Le Zhang, Ming-Ming Cheng, Yun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27206
Pdf URL: https://arxiv.org/pdf/2603.27206
Copy Paste: [[2603.27206]] Make It Up: Fake Images, Real Gains in Generalized Few-shot Semantic Segmentation(https://arxiv.org/abs/2603.27206)
Keywords: generation
Abstract: Generalized few-shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel-class appearances under scarce annotations. While diffusion models can synthesize novel-class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation-enhanced GFSS framework designed to expand novel-class coverage while improving pseudo-label quality. Syn4Seg first maximizes prompt-space coverage by constructing an embedding-deduplicated prompt bank for each novel class, yielding diverse yet class-consistent synthetic images. It then performs support-guided pseudo-label estimation via a two-stage refinement that i) filters low-consistency regions to obtain high-precision seeds and ii) relabels uncertain pixels with image-adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary-band and unlabeled pixels using a constrained SAM-based update to improve contour fidelity without overwriting high-confidence interiors. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate consistent improvements in both 1-shot and 5-shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.
摘要：广义少镜头语义分割（GFSS）从根本上受到稀缺注释下新颖类外观的覆盖范围的限制。虽然扩散模型可以大规模合成新颖的图像，但当掩模不可用或不可靠时，实际收益往往会因覆盖不足和嘈杂的监督而受到阻碍。我们提出了 Syn4Seg，这是一种一代增强型 GFSS 框架，旨在扩大新颖类别的覆盖范围，同时提高伪标签质量。 Syn4Seg 首先通过为每个新颖类别构建嵌入、重复数据删除的提示库来最大化提示空间覆盖范围，从而生成多样化但类别一致的合成图像。然后，它通过两阶段细化执行支持引导的伪标签估计，i）过滤低一致性区域以获得高精度种子，ii）使用结合了全局（支持）和局部（图像）统计数据的图像自适应原型重新标记不确定像素。最后，我们使用基于 SAM 的约束更新仅细化边界带和未标记像素，以提高轮廓保真度，而不会覆盖高置信度内部。 PASCAL-$5^i$ 和 COCO-$20^i$ 上的大量实验证明了 1-shot 和 5-shot 设置的一致改进，突出显示合成数据是 GFSS 的可扩展路径，具有可靠的掩模和精确的边界。

Title: LightMover: Generative Light Movement with Color and Intensity Controls

Authors: Gengze Zhou, Tianyu Wang, Soo Ye Kim, Zhixin Shu, Xin Yu, Yannick Hold-Geoffroy, Sumit Chaturvedi, Qi Wu, Zhe Lin, Scott Cohen
Subjects: cs.CV, cs.CL, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.27209
Pdf URL: https://arxiv.org/pdf/2603.27209
Copy Paste: [[2603.27209]] LightMover: Generative Light Movement with Color and Intensity Controls(https://arxiv.org/abs/2603.27209)
Keywords: generative
Abstract: We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. To train our framework, we construct a scalable rendering pipeline that generates large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.
摘要：我们提出了 LightMover，这是一个用于在单个图像中进行可控光操作的框架，它利用视频扩散先验来产生物理上合理的照明变化，而无需重新渲染场景。我们将灯光编辑表述为视觉标记空间中的序列到序列预测问题：给定图像和光控制标记，模型调整灯光位置、颜色和强度以及单个视图中产生的反射、阴影和衰减。这种对空间（运动）和外观（颜色、强度）控制的统一处理提高了操作和照明的理解。我们进一步引入了一种自适应标记修剪机制，该机制保留空间信息标记，同时紧凑地编码非空间属性，将控制序列长度减少 41%，同时保持编辑保真度。为了训练我们的框架，我们构建了一个可扩展的渲染管道，该管道可以在不同的光照位置、颜色和强度下生成大量图像对，同时保持场景内容与原始图像一致。 LightMover 能够对光位置、颜色和强度进行精确、独立的控制，并在不同任务中实现高 PSNR 和强语义一致性（DINO、CLIP）。

Title: Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

Authors: Seng Nam Chen, Hao Chen, Chenglam Ho, Xinyu Mao, Jinping Wang, Yu Zhang, Chao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27259
Pdf URL: https://arxiv.org/pdf/2603.27259
Copy Paste: [[2603.27259]] Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark(https://arxiv.org/abs/2603.27259)
Keywords: generation
Abstract: Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. This Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.
摘要：长视频理解（LVU）仍然是多模态学习的核心挑战。尽管最近的视觉语言模型（VLM）取得了显着进展，但现有基准主要关注细粒度感知或粗略总结，对长上下文中的时间理解提供的洞察有限。在这项工作中，我们将场景定义为视频的连贯片段，其中视觉和语义上下文保持一致，与人类感知保持一致。这给我们带来了一个关键问题：当前的 VLM 能否在较长的场景级上下文中进行有效推理？为了回答这个问题，我们引入了一个新的基准测试——SceneBench，旨在提供场景级挑战。我们的评估显示，当 VLM 尝试回答场景级问题时，准确性急剧下降，这表明远程上下文被严重遗忘。为了进一步验证这些发现，我们提出了场景检索增强生成（Scene-RAG），它通过检索和集成跨场景的相关上下文来构建动态场景记忆。此 Scene-RAG 将 VLM 性能提高了 2.50%，证实当前模型仍难以解决长上下文保留问题。我们希望 SceneBench 能够鼓励未来对 VLM 的研究，使其具有更强大、类似人类的视频理解能力。

Title: TrendGen: An Outfit Recommendation and Display System

Authors: Theodoros Koukopoulos, Dimos Klimenof, Ioannis Xarchakos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27264
Pdf URL: https://arxiv.org/pdf/2603.27264
Copy Paste: [[2603.27264]] TrendGen: An Outfit Recommendation and Display System(https://arxiv.org/abs/2603.27264)
Keywords: generation, generative
Abstract: Recent advances in Computer Vision have significantly improved image understanding and generation, revolutionizing the fashion industry. However, challenges such as inconsistent lighting, non-ideal garment angles, complex backgrounds, and occlusions in raw images hinder their full potential. Overcoming these obstacles is crucial for developing robust fashion AI systems capable of real-world applications. In this paper, we introduce TrendGen, a Fashion AI system designed to enhance online shopping with intelligent outfit recommendations. Deployed on a major e-commerce platform, TrendGen leverages cloth images and product attributes to generate trend-aligned, cohesive outfit suggestions. Additionally, it employs Generative AI to transform raw images into high-quality lay-down views, offering a clear and structured presentation of garments. Our evaluation on production data demonstrates TrendGen's consistent high-quality outfits and lay-down images, marking a significant advancement in AI-driven solutions for fashion retail.
摘要：计算机视觉的最新进展显着提高了图像理解和生成，彻底改变了时尚行业。然而，照明不一致、服装角度不理想、背景复杂以及原始图像中的遮挡等挑战阻碍了它们充分发挥潜力。克服这些障碍对于开发能够在现实世界中应用的强大时尚人工智能系统至关重要。在本文中，我们介绍了 TrendGen，这是一种时尚人工智能系统，旨在通过智能服装推荐来增强在线购物。 TrendGen 部署在主要电子商务平台上，利用布料图像和产品属性来生成符合趋势的、有凝聚力的服装建议。此外，它还采用生成式人工智能将原始图像转换为高质量的平铺视图，从而提供清晰且结构化的服装展示。我们对生产数据的评估表明 TrendGen 始终如一的高品质服装和平铺图像，标志着人工智能驱动的时尚零售解决方案的重大进步。

Title: Dual-Path Learning based on Frequency Structural Decoupling and Regional-Aware Fusion for Low-Light Image Super-Resolution

Authors: Ji-Xuan He, Jia-Cheng Zhao, Feng-Qi Cui, Jinyang Huang, Yang Liu, Sirui Zhao, Meng Li, Zhi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27301
Pdf URL: https://arxiv.org/pdf/2603.27301
Copy Paste: [[2603.27301]] Dual-Path Learning based on Frequency Structural Decoupling and Regional-Aware Fusion for Low-Light Image Super-Resolution(https://arxiv.org/abs/2603.27301)
Keywords: super-resolution
Abstract: Low-light image super-resolution (LLISR) is essential for restoring fine visual details and perceptual quality under insufficient illumination conditions with ubiquitous low-resolution devices. Although pioneer methods achieve high performance on single tasks, they solve both tasks in a serial manner, which inevitably leads to artifact amplification, texture suppression, and structural degradation. To address this, we propose Decoupling then Perceive (DTP), a novel frequency-aware framework that explicitly separates luminance and texture into semantically independent components, enabling specialized modeling and coherent reconstruction. Specifically, to adaptively separate the input into low-frequency luminance and high-frequency texture subspaces, we propose a Frequency-aware Structural Decoupling (FSD) mechanism, which lays a solid foundation for targeted representation learning and reconstruction. Based on the decoupled representation, a Semantics-specific Dual-path Representation (SDR) learning strategy that performs targeted enhancement and reconstruction for each frequency component is further designed, facilitating robust luminance adjustment and fine-grained texture recovery. To promote structural consistency and perceptual alignment in the reconstructed output, building upon this dual-path modeling, we further introduce a Cross-frequency Semantic Recomposition (CSR) module that selectively integrates the decoupled representations. Extensive experiments on the most widely used LLISR benchmarks demonstrate the superiority of our DTP framework, improving $+$1.6\% PSNR, $+$9.6\% SSIM, and $-$48\% LPIPS compared to the most state-of-the-art (SOTA) algorithm. Codes are released at this https URL.
摘要：低光图像超分辨率 (LLISR) 对于在普遍存在的低分辨率设备的照明不足条件下恢复精细视觉细节和感知质量至关重要。尽管先驱方法在单个任务上实现了高性能，但它们以串行方式解决这两个任务，这不可避免地导致伪影放大、纹理抑制和结构退化。为了解决这个问题，我们提出了 Decoupling then Perceive (DTP)，这是一种新颖的频率感知框架，可将亮度和纹理显式分离为语义独立的组件，从而实现专门的建模和连贯重建。具体来说，为了自适应地将输入分离为低频亮度和高频纹理子空间，我们提出了一种频率感知结构解耦（FSD）机制，为有针对性的表示学习和重建奠定了坚实的基础。基于解耦表示，进一步设计了语义特定的双路径表示（SDR）学习策略，对每个频率分量进行有针对性的增强和重建，从而促进鲁棒的亮度调整和细粒度的纹理恢复。为了促进重建输出中的结构一致性和感知对齐，在双路径建模的基础上，我们进一步引入了跨频语义重组（CSR）模块，该模块有选择地集成解耦表示。对最广泛使用的 LLISR 基准进行的大量实验证明了我们的 DTP 框架的优越性，与最先进的 (SOTA) 算法相比，提高了 $+$1.6\% PSNR、$+$9.6\% SSIM 和 $-$48\% LPIPS。代码在此 https URL 发布。

Title: Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models

Authors: Kaishen Wang, Heng Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27332
Pdf URL: https://arxiv.org/pdf/2603.27332
Copy Paste: [[2603.27332]] Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models(https://arxiv.org/abs/2603.27332)
Keywords: generation
Abstract: Recent advances in Large Language Models (LLMs) and Text-to-Image (T2I) models have led to the emergence of Unified Multimodal Models (UMMs), where multimodal understanding and image generation are tightly integrated within a shared architecture. Prior studies suggest that such reciprocity enhances cross-functionality performance through shared representations and joint optimization. However, the safety implications of this tight coupling remain largely unexplored, as existing safety research predominantly analyzes understanding and generation functionalities in isolation. In this work, we investigate whether cross-functionality reciprocity itself constitutes a structural source of vulnerability in UMMs. We propose RICE: Reciprocal Interaction-based Cross-functionality Exploitation, a novel attack paradigm that explicitly exploits bidirectional interactions between understanding and generation. Using this framework, we systematically evaluate Generation-to-Understanding (G-U) and Understanding-to-Generation (U-G) attack pathways, demonstrating that unsafe intermediate signals can propagate across modalities and amplify safety risks. Extensive experiments show high Attack Success Rates (ASR) in both directions, revealing previously overlooked safety weaknesses inherent to UMMs.
摘要：大型语言模型 (LLM) 和文本到图像 (T2I) 模型的最新进展导致了统一多模态模型 (UMM) 的出现，其中多模态理解和图像生成紧密集成在共享架构中。先前的研究表明，这种互惠通过共享表示和联合优化增强了跨功能性能。然而，这种紧密耦合的安全影响在很大程度上仍未得到探索，因为现有的安全研究主要单独分析理解和生成功能。在这项工作中，我们研究了跨功能互惠本身是否构成 UMM 脆弱性的结构性来源。我们提出了 RICE：基于交互交互的跨功能开发，这是一种新颖的攻击范式，明确地利用了理解和生成之间的双向交互。使用该框架，我们系统地评估了生成到理解（G-U）和理解到生成（U-G）攻击路径，证明不安全的中间信号可以跨模式传播并放大安全风险。大量实验表明，两个方向的攻击成功率 (ASR) 都很高，揭示了之前被忽视的 UMM 固有的安全弱点。

Title: Falcon Perception

Authors: Aviraj Bevli, Sofian Chaybouti, Yasser Dahou, Hakim Hacid, Ngoc Dung Huynh, Phuc H. Le Khac, Sanath Narayan, Wamiq Reyaz Para, Ankit Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27365
Pdf URL: https://arxiv.org/pdf/2603.27365
Copy Paste: [[2603.27365]] Falcon Perception(https://arxiv.org/abs/2603.27365)
Keywords: generation
Abstract: Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F$_1$ compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.
摘要：以感知为中心的系统通常使用模块化编码器-解码器管道来实现：用于特征提取的视觉主干和用于任务预测的单独解码器（或后期融合模块）。这就提出了一个核心问题：这种架构分离是否必要，或者单个早期融合堆栈可以大规模地进行感知和任务建模吗？我们引入了 Falcon Perception，这是一个统一的密集 Transformer，它在第一层的共享参数空间中处理图像块和文本标记，使用混合注意模式（图像标记之间的双向，预测标记的因果关系）将全局视觉上下文与自回归、可变长度实例生成相结合。为了保持密集输出的实用性，Falcon Perception 保留了轻量级令牌接口，并使用专用头对连续空间输出进行解码，从而实现并行高分辨率掩模预测。我们的设计提倡简单性：我们保留一个可扩展的主干，并将复杂性转向数据和训练信号，仅在输出连续且密集的地方添加小头。在 SA-Co 上，Falcon Perception 将掩模质量提高至 68.0 Macro-F$_1$，而 SAM3 为 62.3。我们还引入了 PBench，这是一个针对组合提示（OCR、空间约束、关系）和密集长上下文机制的基准，其中模型显示出更好的收益。最后，我们将相同的早期融合方法扩展到 Falcon OCR：一个紧凑的 300M 参数模型，在 olmOCR 上达到 80.3%，在 OmniDocBench 上达到 88.64。

Title: The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

Authors: Isaac Llorente-Saguer
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.27412
Pdf URL: https://arxiv.org/pdf/2603.27412
Copy Paste: [[2603.27412]] The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams(https://arxiv.org/abs/2603.27412)
Keywords: generative
Abstract: We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $\theta$ from this reference direction. The anomaly score is the negative log-likelihood of $\theta$ under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC $\geq$0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ($\sigma_\theta \approx 0.03$ rad), an order of magnitude tighter than the normative distribution ($\sigma_\theta \approx 0.27$ rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.
摘要：我们提出 LatentBiopsy，这是一种无需训练的方法，通过分析大型语言模型中残留流激活的几何形状来检测有害提示。给定 200 个安全规范提示，LatentBiopsy 计算其在目标层激活的主要主要成分，并通过其与该参考方向的径向偏差角 $\theta$ 来表征新提示。异常得分是在高斯拟合正态分布的情况下 $\theta$ 的负对数似然，无论方向如何，对称地标记偏差。训练不需要有害的例子。我们评估了 Qwen3.5-0.8B 和 Qwen2.5-0.5B 系列的两个完整模型三元组：基础、指令调整和 \emph{abliterated}（通过正交化手术去除拒绝方向）。在所有六个变体中，LatentBiopsy 在有害与规范检测方面实现了 AUROC $\geq$0.937，在区分有害和良性攻击性提示 (XSTest) 方面实现了 AUROC = 1.000，每次查询开销为亚毫秒级。出现了三个实证结果。首先，几何结构在拒绝消融中幸存下来：两个被消除的变体的 AUROC 最多比其指令调整的对应版本低 0.015，从而在有害意图表示和下游生成拒绝机制之间建立了几何分离。其次，有害提示表现出近乎退化的角度分布（$\sigma_\theta \约0.03$ rad），比标准分布（$\sigma_\theta \约0.27$ rad）严格一个数量级，在包括消除在内的所有对齐阶段都保留。第三，两个家族在相同深度表现出相反的环方向：有害提示在Qwen3.5-0.8B中占据外环，但在Qwen2.5-0.5B中占据内环，直接激发方向不可知的评分规则。

Title: Mind the Shape Gap: A Benchmark and Baseline for Deformation-Aware 6D Pose Estimation of Agricultural Produce

Authors: Nikolas Chatzis, Angeliki Tsinouka, Katerina Papadimitriou, Niki Efthymiou, Marios Glytsos, George Retsinas, Paris Oikonomou, Gerasimos Potamianos, Petros Maragos, Panagiotis Paraskevas Filntisis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27429
Pdf URL: https://arxiv.org/pdf/2603.27429
Copy Paste: [[2603.27429]] Mind the Shape Gap: A Benchmark and Baseline for Deformation-Aware 6D Pose Estimation of Agricultural Produce(https://arxiv.org/abs/2603.27429)
Keywords: generative
Abstract: Accurate 6D pose estimation for robotic harvesting is fundamentally hindered by the biological deformability and high intra-class shape variability of agricultural produce. Instance-level methods fail in this setting, as obtaining exact 3D models for every unique piece of produce is practically infeasible, while category-level approaches that rely on a fixed template suffer significant accuracy degradation when the prior deviates from the true instance geometry. To bridge such lack of robustness to deformation, we introduce PEAR (Pose and dEformation of Agricultural pRoduce), the first benchmark providing joint 6D pose and per-instance 3D deformation ground truth across 8 produce categories, acquired via a robotic manipulator for high annotation accuracy. Using PEAR, we show that state-of-the-art methods suffer up to 6x performance degradation when faced with the inherent geometric deviations of real-world produce. Motivated by this finding, we propose SEED (Simultaneous Estimation of posE and Deformation), a unified RGB-only framework that jointly predicts 6D pose and explicit lattice deformations from a single image across multiple produce categories. Trained entirely on synthetic data with generative texture augmentation applied at the UV level, SEED outperforms MegaPose on 6 out of 8 categories under identical RGB-only conditions, demonstrating that explicit shape modeling is a critical step toward reliable pose estimation in agricultural robotics.
摘要：农产品的生物变形性和高类内形状变异性从根本上阻碍了机器人收割的准确 6D 位姿估计。实例级方法在这种情况下会失败，因为为每个独特的产品获取精确的 3D 模型实际上是不可行的，而依赖固定模板的类别级方法在先验偏离真实实例几何形状时会遭受显着的精度下降。为了弥补变形鲁棒性的不足，我们引入了 PEAR（农业产品的姿态和变形），这是第一个提供跨 8 个农产品类别的联合 6D 姿态和每个实例 3D 变形地面实况的基准，通过机器人操纵器获取以实现高注释精度。使用 PEAR，我们表明，当面对现实世界产品固有的几何偏差时，最先进的方法会遭受高达 6 倍的性能下降。受这一发现的启发，我们提出了 SEED（姿势和变形的同时估计），这是一个统一的纯 RGB 框架，可以联合预测跨多个产品类别的单个图像的 6D 姿态和显式晶格变形。完全基于在 UV 级别应用生成纹理增强的合成数据进行训练，SEED 在相同的仅 RGB 条件下在 8 个类别中的 6 个类别中优于 MegaPose，这表明显式形状建模是农业机器人中实现可靠姿态估计的关键一步。

Title: SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Authors: Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, Zhiwen Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27437
Pdf URL: https://arxiv.org/pdf/2603.27437
Copy Paste: [[2603.27437]] SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning(https://arxiv.org/abs/2603.27437)
Keywords: generation
Abstract: Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
摘要：大型视觉语言模型 (VLM) 仍然难以实现可靠的 3D 空间推理，这是实体和物理 AI 系统的核心功能。这种限制源于它们无法捕获细粒度的 3D 几何和空间关系。虽然最近的努力已将多视图几何变换器引入 VLM，但它们通常仅融合视觉和几何编码器的深层特征，丢弃丰富的分层信号并为空间理解造成基本瓶颈。为了克服这个问题，我们提出了 SpatialStack，这是一个通用的层次融合框架，可以在模型层次结构中逐步对齐视觉、几何和语言表示。 SpatialStack 超越了传统的后期视觉几何融合，将多级几何特征与主干语言进行堆叠和同步，使模型能够捕获局部几何精度和全局上下文语义。在此框架的基础上，我们开发了 VLM-SpatialStack，该模型在多个 3D 空间推理基准上实现了最先进的性能。大量的实验和消融表明，我们的多级融合策略持续增强 3D 理解，并在不同的空间推理任务中稳健地泛化，将 SpatialStack 建立为下一代多模态物理 AI 系统中视觉-语言-几何集成的有效且可扩展的设计范例。

Title: GIFT: Bootstrapping Image-to-CAD Program Synthesis via Geometric Feedback

Authors: Giorgio Giannone, Anna Clare Doris, Amin Heyrani Nobari, Kai Xu, Akash Srivastava, Faez Ahmed
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2603.27448
Pdf URL: https://arxiv.org/pdf/2603.27448
Copy Paste: [[2603.27448]] GIFT: Bootstrapping Image-to-CAD Program Synthesis via Geometric Feedback(https://arxiv.org/abs/2603.27448)
Keywords: generative
Abstract: Generating executable CAD programs from images requires alignment between visual geometry and symbolic program representations, a capability that current methods fail to learn reliably as design complexity increases. Existing fine-tuning approaches rely on either limited supervised datasets or expensive post-training pipelines, resulting in brittle systems that restrict progress in generative CAD design. We argue that the primary bottleneck lies not in model or algorithmic capacity, but in the scarcity of diverse training examples that align visual geometry with program syntax. This limitation is especially acute because the collection of diverse and verified engineering datasets is both expensive and difficult to scale, constraining the development of robust generative CAD models. We introduce Geometric Inference Feedback Tuning (GIFT), a data augmentation framework that leverages geometric feedback to turn test-time compute into a bootstrapped set of high-quality training samples. GIFT combines two mechanisms: Soft-Rejection Sampling (GIFT-REJECT), which retains diverse high-fidelity programs beyond exact ground-truth matches, and Failure-Driven Augmentation (GIFT-FAIL), which converts near-miss predictions into synthetic training examples that improve robustness on challenging geometries. By amortizing inference-time search into the model parameters, GIFT captures the benefits of test-time scaling while reducing inference compute by 80%. It improves mean IoU by 12% over a strong supervised baseline and remains competitive with more complex multimodal systems, without requiring additional human annotation or specialized architectures.
摘要：从图像生成可执行 CAD 程序需要视觉几何和符号程序表示之间的对齐，随着设计复杂性的增加，当前的方法无法可靠地学习这种能力。现有的微调方法依赖于有限的监督数据集或昂贵的训练后管道，导致系统脆弱，限制了生成 CAD 设计的进展。我们认为，主要瓶颈不在于模型或算法能力，而在于缺乏将视觉几何与程序语法保持一致的多样化训练示例。这种限制尤其严重，因为收集多样化且经过验证的工程数据集既昂贵又难以扩展，限制了稳健的生成 CAD 模型的开发。我们引入了几何推理反馈调整（GIFT），这是一种数据增强框架，利用几何反馈将测试时计算转变为一组自举的高质量训练样本。 GIFT 结合了两种机制：软拒绝采样（GIFT-REJECT）和故障驱动增强（GIFT-FAIL），软拒绝采样（GIFT-REJECT）保留了超出精确地面实况匹配的各种高保真度程序，而故障驱动增强（GIFT-FAIL）将接近失败的预测转换为合成训练示例，从而提高了在具有挑战性的几何形状上的鲁棒性。通过将推理时间搜索分摊到模型参数中，GIFT 获得了测试时间扩展的优势，同时将推理计算量减少了 80%。与强监督基线相比，它的平均 IoU 提高了 12%，并且与更复杂的多模态系统保持竞争力，而不需要额外的人工注释或专门的架构。

Title: LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model

Authors: Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, Yue Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27449
Pdf URL: https://arxiv.org/pdf/2603.27449
Copy Paste: [[2603.27449]] LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model(https://arxiv.org/abs/2603.27449)
Keywords: generative
Abstract: Learning human-object manipulation presents significant challenges due to its fine-grained and contact-rich nature of the motions involved. Traditional physics-based animation requires extensive modeling and manual setup, and more importantly, it neither generalizes well across diverse object morphologies nor scales effectively to real-world environment. To address these limitations, we introduce LOME, an egocentric world model that can generate realistic human-object interactions as videos conditioned on an input image, a text prompt, and per-frame human actions, including both body poses and hand gestures. LOME injects strong and precise action guidance into object manipulation by jointly estimating spatial human actions and the environment contexts during training. After finetuning a pretrained video generative model on videos of diverse egocentric human-object interactions, LOME demonstrates not only high action-following accuracy and strong generalization to unseen scenarios, but also realistic physical consequences of hand-object interactions, e.g., liquid flowing from a bottle into a mug after executing a ``pouring'' action. Extensive experiments demonstrate that our video-based framework significantly outperforms state-of-the-art image based and video-based action-conditioned methods and Image/Text-to-Video (I/T2V) generative model in terms of both temporal consistency and motion control. LOME paves the way for photorealistic AR/VR experiences and scalable robotic training, without being limited to simulated environments or relying on explicit 3D/4D modeling.
摘要：由于所涉及的运动的细粒度和接触丰富的性质，学习人体操纵提出了重大挑战。传统的基于物理的动画需要大量的建模和手动设置，更重要的是，它既不能很好地概括不同的对象形态，也不能有效地扩展到现实世界环境。为了解决这些限制，我们引入了 LOME，这是一种以自我为中心的世界模型，可以根据输入图像、文本提示和每帧人类动作（包括身体姿势和手势）生成逼真的人机交互。 LOME 通过在训练期间联合估计空间人类行为和环境背景，为物体操纵注入强大而精确的动作指导。在对各种以自我为中心的人与物体交互的视频上对预训练的视频生成模型进行微调后，LOME 不仅展示了较高的动作跟随准确性和对未见过的场景的强泛化性，而且还展示了手与物体交互的真实物理后果，例如，在执行“倒”动作后，液体从瓶子流到杯子中。大量的实验表明，我们的基于视频的框架在时间一致性和运动控制方面显着优于最先进的基于图像和基于视频的动作条件方法以及图像/文本到视频（I/T2V）生成模型。 LOME 为逼真的 AR/VR 体验和可扩展的机器人训练铺平了道路，而不受模拟环境或依赖显式 3D/4D 建模的影响。

Title: FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies

Authors: Chenxiao Gao, Edward Chen, Tianyi Chen, Bo Dai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27450
Pdf URL: https://arxiv.org/pdf/2603.27450
Copy Paste: [[2603.27450]] FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies(https://arxiv.org/abs/2603.27450)
Keywords: generative
Abstract: Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due to the lack of explicit log-probabilities for vanilla policy gradient estimators. While numerous attempts have been proposed to address this, the field lacks a unified perspective to reconcile these seemingly disparate methods, thus hampering ongoing development. In this paper, we bridge this gap by introducing a comprehensive taxonomy for RL algorithms with diffusion/flow policies. To support reproducibility and agile prototyping, we introduce a modular, JAX-based open-source codebase that leverages JIT-compilation for high-throughput training. Finally, we provide systematic and standardized benchmarks across Gym-Locomotion, DeepMind Control Suite, and IsaacLab, offering a rigorous side-by-side comparison of diffusion-based methods and guidance for practitioners to choose proper algorithms based on the application. Our work establishes a clear foundation for understanding and algorithm design, a high-efficiency toolkit for future research in the field, and an algorithmic guideline for practitioners in generative models and robotics. Our code is available at this https URL.
摘要：由于其显着的灵活性，扩散模型和流动模型已成为政策表征的有希望的候选者。然而，由于普通策略梯度估计器缺乏明确的对数概率，因此对这些策略进行有效的强化学习（RL）仍然是一个挑战。尽管已经提出了许多尝试来解决这个问题，但该领域缺乏统一的视角来协调这些看似不同的方法，从而阻碍了持续的发展。在本文中，我们通过引入具有扩散/流策略的 RL 算法的综合分类法来弥补这一差距。为了支持可重复性和敏捷原型设计，我们引入了一个基于 JAX 的模块化开源代码库，该代码库利用 JIT 编译进行高吞吐量训练。最后，我们提供了 Gym-Locomotion、DeepMind Control Suite 和 IsaacLab 的系统化和标准化基准，为基于扩散的方法提供了严格的并排比较，并指导从业者根据应用选择合适的算法。我们的工作为理解和算法设计奠定了明确的基础，为该领域的未来研究提供了高效的工具包，并为生成模型和机器人技术的从业者提供了算法指南。我们的代码可以在这个 https URL 上找到。

Title: KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study

Authors: Suraj Ranganath, Vaishak Menon, Anish Patnaik
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27469
Pdf URL: https://arxiv.org/pdf/2603.27469
Copy Paste: [[2603.27469]] KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study(https://arxiv.org/abs/2603.27469)
Keywords: generation
Abstract: Self-forcing video generation extends a short-horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key-value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV-cache compression for self-forcing video generation on a Wan2.1-based Self-Forcing stack. Our study covers 33 quantization and cache-policy variants, 610 prompt-level observations, and 63 benchmark-level summaries across two evaluation settings: MovieGen for single-shot 10-second generation and StoryEval for longer narrative-style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16-referenced fidelity (SSIM, LPIPS, PSNR), and terminal drift. Three findings are robust. First, the strongest practical operating region is a FlowCache-inspired soft-prune INT4 adaptation, which reaches 5.42-5.49x compression while reducing peak VRAM from 19.28 GB to about 11.7 GB with only modest runtime overhead. Second, the highest-fidelity compressed methods, especially PRQ_INT4 and QUAROT_KV_INT4, are not the best deployment choices because they preserve quality at severe runtime or memory cost. Third, nominal compression alone is not sufficient: several methods shrink KV storage but still exceed BF16 peak VRAM because the current integration reconstructs or retains large BF16 buffers during attention and refresh stages. The result is a benchmark harness, analysis workflow, and empirical map of which KV-cache ideas are practical today and which are promising research directions for better memory integration. Code, data products, and the presentation dashboard are available at this https URL.
摘要：自动视频生成通过将生成的内容作为上下文重复反馈，将短视距视频模型扩展到更长的发布时间。这种扩展路径立即暴露了系统瓶颈：键值 (KV) 缓存随着推出长度的增加而增长，因此较长的视频不仅需要更好的生成质量，还需要更好的内存行为。我们对基于 Wan2.1 的 Self-Forcing 堆栈上的 self-forcing 视频生成的 KV 缓存压缩进行了全面的实证研究。我们的研究涵盖了 33 个量化和缓存策略变体、610 个提示级观察和 63 个基准级摘要，涵盖两种评估设置：用于单次 10 秒生成的 MovieGen 和用于较长叙事风格稳定性的 StoryEval。我们共同评估峰值 VRAM、运行时间、实现的压缩比、VBench 成像质量、BF16 参考保真度（SSIM、LPIPS、PSNR）和终端漂移。三项发现是可靠的。首先，最强的实际操作区域是受 FlowCache 启发的软修剪 INT4 改编，它达到 5.42-5.49 倍的压缩率，同时将峰值 VRAM 从 19.28 GB 减少到约 11.7 GB，而运行时开销却很小。其次，最高保真度的压缩方法，尤其是 PRQ_INT4 和 QUAROT_KV_INT4，并不是最佳的部署选择，因为它们在严重的运行时或内存成本下保持质量。第三，仅名义压缩是不够的：几种方法缩小了 KV 存储，但仍然超过 BF16 峰值 VRAM，因为当前集成在关注和刷新阶段重建或保留了大型 BF16 缓冲区。其结果是一个基准测试工具、分析工作流程和经验图，其中 KV 缓存思想在当今是实用的，并且是更好的内存集成的有前途的研究方向。代码、数据产品和演示仪表板可从此 https URL 获取。

Title: Variational Learning of Fractional Posteriors

Authors: Kian Ming A. Chai, Edwin V. Bonilla
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27488
Pdf URL: https://arxiv.org/pdf/2603.27488
Copy Paste: [[2603.27488]] Variational Learning of Fractional Posteriors(https://arxiv.org/abs/2603.27488)
Keywords: generation
Abstract: We introduce a novel one-parameter variational objective that lower bounds the data evidence and enables the estimation of approximate fractional posteriors. We extend this framework to hierarchical construction and Bayes posteriors, offering a versatile tool for probabilistic modelling. We demonstrate two cases where gradients can be obtained analytically and a simulation study on mixture models showing that our fractional posteriors can be used to achieve better calibration compared to posteriors from the conventional variational bound. When applied to variational autoencoders (VAEs), our approach attains higher evidence bounds and enables learning of high-performing approximate Bayes posteriors jointly with fractional posteriors. We show that VAEs trained with fractional posteriors produce decoders that are better aligned for generation from the prior.
摘要：我们引入了一种新颖的单参数变分目标，它降低了数据证据的界限并能够估计近似分数后验。我们将此框架扩展到层次结构和贝叶斯后验，为概率建模提供了多功能工具。我们演示了可以通过分析获得梯度的两种情况，以及对混合模型的模拟研究表明，与传统变分界的后验相比，我们的分数后验可用于实现更好的校准。当应用于变分自动编码器（VAE）时，我们的方法获得了更高的证据界限，并且能够与分数后验联合学习高性能近似贝叶斯后验。我们表明，使用分数后验训练的 VAE 生成的解码器可以更好地与先验生成对齐。

Title: Understanding Semantic Perturbations on In-Processing Generative Image Watermarks

Authors: Anirudh Nakra, Min Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27513
Pdf URL: https://arxiv.org/pdf/2603.27513
Copy Paste: [[2603.27513]] Understanding Semantic Perturbations on In-Processing Generative Image Watermarks(https://arxiv.org/abs/2603.27513)
Keywords: generation, generative
Abstract: The widespread deployment of high-fidelity generative models has intensified the need for reliable mechanisms for provenance and content authentication. In-processing watermarking, embedding a signature into the generative model's synthesis procedure, has been advocated as a solution and is often reported to be robust to standard post-processing (such as geometric transforms and filtering). Yet robustness to semantic manipulations that alter high-level scene content while maintaining reasonable visual quality is not well studied or understood. We introduce a simple, multi-stage framework for systematically stress-testing in-processing generative watermarks under semantic drift. The framework utilizes off-the-shelf models for object detection, mask generation, and semantically guided inpainting or regeneration to produce controlled, meaning-altering edits with minimal perceptual degradation. Based on extensive experiments on representative schemes, we find that robustness varies significantly with the degree of semantic entanglement: methods by which watermarks remain detectable under a broad suite of conventional perturbations can fail under semantic edits, with watermark detectability in many cases dropping to near zero while image quality remains high. Overall, our results reveal a critical gap in current watermarking evaluations and suggest that watermark designs and benchmarking must explicitly account for robustness against semantic manipulation.
摘要：高保真生成模型的广泛部署加剧了对可靠来源和内容认证机制的需求。处理中水印，将签名嵌入到生成模型的合成过程中，已被提倡为一种解决方案，并且经常被报道对标准后处理（例如几何变换和过滤）具有鲁棒性。然而，在保持合理视觉质量的同时改变高级场景内容的语义操作的鲁棒性还没有得到很好的研究或理解。我们引入了一个简单的多阶段框架，用于在语义漂移下系统地对处理中的生成水印进行压力测试。该框架利用现成的模型进行对象检测、掩模生成以及语义引导的修复或再生，以产生受控的、意义改变的编辑，同时将感知退化降至最低。基于对代表性方案的大量实验，我们发现鲁棒性随着语义纠缠的程度而显着变化：在一系列常规扰动下水印保持可检测的方法在语义编辑下可能会失败，在许多情况下水印可检测性下降到接近零，而图像质量仍然很高。总体而言，我们的结果揭示了当前水印评估中的一个关键差距，并表明水印设计和基准测试必须明确考虑针对语义操作的鲁棒性。

Title: TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

Authors: Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, Jui-Hsien Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27520
Pdf URL: https://arxiv.org/pdf/2603.27520
Copy Paste: [[2603.27520]] TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets(https://arxiv.org/abs/2603.27520)
Keywords: generation
Abstract: We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.
摘要：我们提出了 TokenDial，这是一个在预训练的文本到视频生成模型中进行连续滑块式属性控制的框架。虽然现代生成器可以生成强大的整体视频，但它们对属性变化的程度（例如效果强度或运动幅度）提供有限的控制，而不会漂移身份、背景或时间连贯性。 TokenDial 建立在观察的基础上：中间时空视觉补丁令牌空间中的附加偏移形成语义控制方向，其中调整偏移幅度可为外观和运动动力学产生连贯的、可预测的编辑。我们使用预先训练的理解信号来学习特定于属性的标记偏移，而无需重新训练主干网络：外观的语义方向匹配和运动的运动幅度缩放。我们展示了 TokenDial 在不同属性和提示上的有效性，在广泛的定量评估和人体研究的支持下，实现了比最先进的基线更强的可控性和更高质量的编辑。

Title: OmniColor: A Unified Framework for Multi-modal Lineart Colorization

Authors: Xulu Zhang, Haoqian Du, Xiaoyong Wei, Qing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27531
Pdf URL: https://arxiv.org/pdf/2603.27531
Copy Paste: [[2603.27531]] OmniColor: A Unified Framework for Multi-modal Lineart Colorization(https://arxiv.org/abs/2603.27531)
Keywords: restoration
Abstract: Lineart colorization is a critical stage in professional content creation, yet achieving precise and flexible results under diverse user constraints remains a significant challenge. To address this, we propose OmniColor, a unified framework for multi-modal lineart colorization that supports arbitrary combinations of control signals. Specifically, we systematically categorize guidance signals into two types: spatially-aligned conditions and semantic-reference conditions. For spatially-aligned inputs, we employ a dual-path encoding strategy paired with a Dense Feature Alignment loss to ensure rigorous boundary preservation and precise color restoration. For semantic-reference inputs, we utilize a VLM-only encoding scheme integrated with a Temporal Redundancy Elimination mechanism to filter repetitive information and enhance inference efficiency. To resolve potential input conflicts, we introduce an Adaptive Spatial-Semantic Gating module that dynamically balances multi-modal constraints. Experimental results demonstrate that OmniColor achieves superior controllability, visual quality, and temporal stability, providing a robust and practical solution for lineart colorization. The source code and dataset will be open at this https URL.
摘要：艺术线条着色是专业内容创作的关键阶段，但在不同的用户限制下实现精确和灵活的结果仍然是一个重大挑战。为了解决这个问题，我们提出了 OmniColor，这是一种多模式线性着色的统一框架，支持控制信号的任意组合。具体来说，我们系统地将引导信号分为两类：空间对齐条件和语义参考条件。对于空间对齐的输入，我们采用双路径编码策略与密集特征对齐损失相结合，以确保严格的边界保留和精确的颜色恢复。对于语义参考输入，我们利用与时间冗余消除机制集成的仅 VLM 编码方案来过滤重复信息并提高推理效率。为了解决潜在的输入冲突，我们引入了自适应空间语义门控模块，该模块可以动态平衡多模态约束。实验结果表明，OmniColor 实现了卓越的可控性、视觉质量和时间稳定性，为线稿着色提供了稳健且实用的解决方案。源代码和数据集将在此 https URL 中打开。

Title: LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Authors: Meituan LongCat Team: Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan Bai, Yan Feng, Yanjie Li, Yao Qiu, Yerui Sun, Yifan Lu, Ying Luo, Yipeng Mei, Yitian Chen, Yuchen Xie, Yufang Liu, Yufei Chen, Yulei Qian, Yuqi Peng, Zhihang Yu, Zhixiong Han, Changran Wang, Chen Chen, Dian Zheng, Fengjiao Chen, Ge Yang, Haowei Guo, Haozhe Wang, Hongyu Li, Huicheng Jiang, Jiale Hong, Jialv Zou, Jiamu Li, Jianping Lin, Jiaxing Liu, Jie Yang, Jing Jin, Jun Kuang, Juncheng She, Kunming Luo, Kuofeng Gao, Lin Qiu, Linsen Guo, Mianqiu Huang, Qi Li, Qian Wang, Rumei Li, Siyu Ren, Wei Wang, Wenlong He, Xi Chen, Xiao Liu, Xiaoyu Li, Xu Huang, Xuanyu Zhu, Xuezhi Cao, Yaoming Zhu, Yifei Cao, Yimeng Jia, Yizhen Jiang, Yufei Gao, Zeyang Hu, Zhenlong Yuan, Zijian Zhang, Ziwen Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2603.27538
Pdf URL: https://arxiv.org/pdf/2603.27538
Copy Paste: [[2603.27538]] LongCat-Next: Lexicalizing Modalities as Discrete Tokens(https://arxiv.org/abs/2603.27538)
Keywords: generation
Abstract: The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: this https URL
摘要：流行的下一个令牌预测 (NTP) 范式通过离散自回归建模推动了大型语言模型的成功。然而，当代多模态系统仍然以语言为中心，通常将非语言模态视为外部附件，导致架构碎片化和集成欠佳。为了超越这一限制，我们引入了离散原生自回归（DiNA），这是一个统一的框架，它表示共享离散空间内的多模态信息，从而实现跨模态的一致且有原则的自回归建模。一项关键创新是离散本机任意分辨率视觉转换器 (dNaViT)，它以任意分辨率执行标记化和去标记化，将连续视觉信号转换为分层离散标记。在此基础上，我们开发了 LongCat-Next，这是一种原生多模态模型，可在单个自回归目标下以最少的模态特定设计处理文本、视觉和音频。作为工业强度的基础模型，它擅长在单一框架内观察、绘画和对话，在各种多模式基准测试中取得强劲的性能。特别是，LongCat-Next 解决了离散视觉建模在理解任务上长期存在的性能天花板，并提供了一种统一的方法来有效协调理解和生成之间的冲突。作为对原生多模态的尝试，我们开源了 LongCat-Next 及其标记器，希望促进社区的进一步研究和开发。 GitHub：此 https URL

Title: Annotation-Free Detection of Drivable Areas and Curbs Leveraging LiDAR Point Cloud Maps

Authors: Fulong Ma, Daojie Peng, Jun Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27553
Pdf URL: https://arxiv.org/pdf/2603.27553
Copy Paste: [[2603.27553]] Annotation-Free Detection of Drivable Areas and Curbs Leveraging LiDAR Point Cloud Maps(https://arxiv.org/abs/2603.27553)
Keywords: generation
Abstract: Drivable areas and curbs are critical traffic elements for autonomous driving, forming essential components of the vehicle visual perception system and ensuring driving safety. Deep neural networks (DNNs) have significantly improved perception performance for drivable area and curb detection, but most DNN-based methods rely on large manually labeled datasets, which are costly, time-consuming, and expert-dependent, limiting their real-world application. Thus, we developed an automated training data generation module. Our previous work generated training labels using single-frame LiDAR and RGB data, suffering from occlusion and distant point cloud sparsity. In this paper, we propose a novel map-based automatic data labeler (MADL) module, combining LiDAR mapping/localization with curb detection to automatically generate training data for both tasks. MADL avoids occlusion and point cloud sparsity issues via LiDAR mapping, creating accurate large-scale datasets for DNN training. In addition, we construct a data review agent to filter the data generated by the MADL module, eliminating low-quality samples. Experiments on the KITTI, KITTI-CARLA and 3D-Curb datasets show that MADL achieves impressive performance compared to manual labeling, and outperforms traditional and state-of-the-art self-supervised methods in robustness and accuracy.
摘要：可行驶区域和路缘是自动驾驶的关键交通要素，构成车辆视觉感知系统的重要组成部分，保证驾驶安全。深度神经网络 (DNN) 显着提高了可行驶区域和路缘检测的感知性能，但大多数基于 DNN 的方法依赖于大型手动标记数据集，这些数据集成本高昂、耗时且依赖于专家，限制了其实际应用。因此，我们开发了一个自动训练数据生成模块。我们之前的工作使用单帧 LiDAR 和 RGB 数据生成训练标签，但受到遮挡和远处点云稀疏的影响。在本文中，我们提出了一种新颖的基于地图的自动数据标记器（MADL）模块，将激光雷达测绘/定位与路缘检测相结合，自动生成这两项任务的训练数据。 MADL 通过 LiDAR 测绘避免遮挡和点云稀疏问题，为 DNN 训练创建准确的大规模数据集。此外，我们构建了一个数据审查代理来过滤MADL模块生成的数据，消除低质量样本。在 KITTI、KITTI-CARLA 和 3D-Curb 数据集上的实验表明，与手动标记相比，MADL 取得了令人印象深刻的性能，并且在鲁棒性和准确性方面优于传统和最先进的自监督方法。

Title: A Robust Low-Rank Prior Model for Structured Cartoon-Texture Image Decomposition with Heavy-Tailed Noise

Authors: Weihao Tang, Hongjin He
Subjects: cs.CV, math.OC
Abstract URL: https://arxiv.org/abs/2603.27579
Pdf URL: https://arxiv.org/pdf/2603.27579
Copy Paste: [[2603.27579]] A Robust Low-Rank Prior Model for Structured Cartoon-Texture Image Decomposition with Heavy-Tailed Noise(https://arxiv.org/abs/2603.27579)
Keywords: restoration
Abstract: Cartoon-texture image decomposition is a fundamental yet challenging problem in image processing. A significant hurdle in achieving accurate decomposition is the pervasive presence of noise in the observed images, which severely impedes robust results. To address the challenging problem of cartoon-texture decomposition in the presence of heavy-tailed noise, we in this paper propose a robust low-rank prior model. Our approach departs from conventional models by adopting the Huber loss function as the data-fidelity term, rather than the traditional $\ell_2$-norm, while retaining the total variation norm and nuclear norm to characterize the cartoon and texture components, respectively. Given the inherent structure, we employ two implementable operator splitting algorithms, tailored to different degradation operators. Extensive numerical experiments, particularly on image restoration tasks under high-intensity heavy-tailed noise, efficiently demonstrate the superior performance of our model.
摘要：卡通纹理图像分解是图像处理中的一个基本但具有挑战性的问题。实现准确分解的一个重大障碍是观察到的图像中普遍存在噪声，这严重阻碍了稳健的结果。为了解决存在重尾噪声的情况下卡通纹理分解的挑战性问题，我们在本文中提出了一种鲁棒的低秩先验模型。我们的方法与传统模型不同，采用 Huber 损失函数作为数据保真度项，而不是传统的 $\ell_2$-范数，同时保留总变分范数和核范数来分别表征卡通和纹理组件。考虑到固有的结构，我们采用两种可实现的算子分裂算法，针对不同的退化算子量身定制。大量的数值实验，特别是在高强度重尾噪声下的图像恢复任务上，有效地证明了我们模型的优越性能。

Title: You Only Erase Once: Erasing Anything without Bringing Unexpected Content

Authors: Yixing Zhu, Qing Zhang, Wenju Xu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27599
Pdf URL: https://arxiv.org/pdf/2603.27599
Copy Paste: [[2603.27599]] You Only Erase Once: Erasing Anything without Bringing Unexpected Content(https://arxiv.org/abs/2603.27599)
Keywords: generation
Abstract: We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Code will be available at this https URL.
摘要：我们提出了 YOEO，一种对象擦除方法。与最近基于扩散的方法不同，由于缺乏足够的配对训练数据和对内容生成的明确约束，难以擦除目标对象而不会在屏蔽区域内生成意外内容，我们的方法可以产生高质量的对象擦除结果，没有不需要的对象或伪影，同时忠实地保持与周围内容的整体上下文连贯性。我们通过在仅包含大规模现实世界图像的不成对数据上训练对象擦除扩散模型来实现这一目标，在杂物检测器和基于实体分割模型构建的上下文连贯性损失的监督下。为了实现更有效的训练和推理，采用扩散蒸馏策略来训练几步擦除扩散模型。大量的实验表明，我们的方法优于最先进的对象擦除方法。代码将在此 https URL 中提供。

Title: OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation

Authors: Sanghyeon Lee, Minwoo Lee, Euijin Shin, Kangyeol Kim, Seunghwan Choi, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27637
Pdf URL: https://arxiv.org/pdf/2603.27637
Copy Paste: [[2603.27637]] OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation(https://arxiv.org/abs/2603.27637)
Keywords: generation
Abstract: We introduce a parameter-efficient adaptation method for panel-aware in-context image generation with pre-trained diffusion transformers. The key idea is to compose learnable, panel-specific orthogonal operators onto the backbone's frozen positional encodings. This design provides two desirable properties: (1) isometry, which preserves the geometry of internal features, and (2) same-panel invariance, which maintains the model's pre-trained intra-panel synthesis behavior. Through controlled experiments, we demonstrate that the effectiveness of our adaptation method is not tied to a specific positional encoding design but generalizes across diverse positional encoding regimes. By enabling effective panel-relative conditioning, the proposed method consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches.
摘要：我们引入了一种参数有效的适应方法，用于使用预先训练的扩散变压器生成面板感知的上下文图像。关键思想是将可学习的、特定于面板的正交算子组合到骨干网的冻结位置编码上。该设计提供了两个理想的属性：(1) 等距，保留内部特征的几何形状；(2) 相同面板不变性，维持模型预先训练的面板内合成行为。通过受控实验，我们证明了我们的适应方法的有效性并不依赖于特定的位置编码设计，而是可以推广到不同的位置编码方案。通过实现有效的面板相对调节，所提出的方法持续改进基于上下文图像的教学编辑流程，包括最先进的方法。

Title: Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling

Authors: Minh-Tuan Tran, Xuan-May Le, Quan Hung Tran, Mehrtash Harandi, Dinh Phung, Trung Le
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.27665
Pdf URL: https://arxiv.org/pdf/2603.27665
Copy Paste: [[2603.27665]] Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling(https://arxiv.org/abs/2603.27665)
Keywords: generation, generative
Abstract: Existing generative models, such as diffusion and auto-regressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model's weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization.
摘要：现有的生成模型（例如扩散网络和自回归网络）本质上是静态的，依赖一组固定的预训练参数来处理所有输入。相比之下，人类可以灵活地调整其内部生成表征以适应每种感知或想象力的背景。受此功能的启发，我们引入了 Composer，这是一种基于测试时实例特定参数组合的自适应生成建模的新范式。 Composer 在推理时生成输入条件参数适应，这些参数适应被注入到预训练模型的权重中，从而无需微调或重新训练即可实现每个输入的专门化。适应在多步生成之前发生一次，以最小的计算和内存开销产生更高质量的上下文感知输出。实验表明，Composer 显着提高了各种生成模型和用例的性能，包括轻量级/量化模型和测试时间缩放。通过利用输入感知参数组合，Composer 建立了一种新的范例来设计生成模型，动态适应每个输入，超越静态参数化。

Title: Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

Authors: Yuhe Liu, Zhenxiong Tan, Yujia Hu, Songhua Liu, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27666
Pdf URL: https://arxiv.org/pdf/2603.27666
Copy Paste: [[2603.27666]] Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers(https://arxiv.org/abs/2603.27666)
Keywords: generation
Abstract: Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.
摘要：基于扩散的可控视觉生成的最新进展导致图像质量显着提高。然而，由于计算量大，这些强大的模型通常部署在云服务器上，引发了对用户数据隐私的严重担忧。为了实现安全高效的设备上生成，我们在本文中探索了基于线性注意力架构的可控扩散模型，即使在边缘设备上也能提供卓越的可扩展性和效率。然而，我们的实验表明，现有的可控生成框架（例如 ControlNet 和 OminiControl）要么缺乏支持多种异构条件类型的灵活性，要么在此类线性注意力模型上收敛缓慢。为了解决这些限制，我们提出了一种专为 SANA 等线性注意力主干量身定制的新型可控扩散框架。我们方法的核心在于在双路径管道中工作的统一门控调节模块，它有效地集成了多类型条件输入，例如空间对齐和非对齐线索。对多个任务和基准的大量实验表明，我们的方法基于线性注意力模型实现了最先进的可控生成性能，在保真度和可控性方面超越了现有方法。

Title: Customized Visual Storytelling with Unified Multimodal LLMs

Authors: Wei-Hua Li, Cheng Sun, Chu-Song Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27690
Pdf URL: https://arxiv.org/pdf/2603.27690
Copy Paste: [[2603.27690]] Customized Visual Storytelling with Unified Multimodal LLMs(https://arxiv.org/abs/2603.27690)
Keywords: generation
Abstract: Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.
摘要：多模式故事定制旨在根据文本描述、参考身份图像和镜头类型生成连贯的故事流。虽然故事生成方面的最新进展已显示出有希望的结果，但大多数方法都依赖于纯文本输入。一些研究纳入了角色身份线索（例如面部识别），但缺乏更广泛的多模式调节。在这项工作中，我们介绍了 VstoryGen，这是一个多模式框架，它将描述与角色和背景参考集成在一起，以实现可定制的故事生成。为了增强电影多样性，我们通过对电影数据进行参数有效的提示调整来引入镜头类型控制，使模型能够生成更忠实地反映电影语法的序列。为了评估我们的框架，我们建立了两个新的基准，从角色和场景一致性、文本视觉对齐和镜头类型控制的角度评估多模式故事定制。实验表明，与现有方法相比，VstoryGen 实现了更高的一致性和电影多样性。

Title: LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

Authors: Shentong Mo, Sukmin Yun
Subjects: cs.CV, cs.AI, cs.LG, cs.MA, cs.MM
Abstract URL: https://arxiv.org/abs/2603.27693
Pdf URL: https://arxiv.org/pdf/2603.27693
Copy Paste: [[2603.27693]] LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation(https://arxiv.org/abs/2603.27693)
Keywords: generation
Abstract: Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.
摘要：统一多模态预训练已成为在单个基础模型中联合建模语言和视觉的有前景的范例。然而，现有的方法在很大程度上依赖于隐式或间接的对齐信号，并且对于同时支持多模态理解和生成来说仍然不是最佳的，特别是在需要细粒度语言视觉推理和可控生成的环境中。在这项工作中，我们提出了 LVRPO，一种基于语言视觉强化的偏好优化框架，它使用组相对策略优化（GRPO）显式地对齐语言和视觉表示。 LVRPO 没有在表示层面引入额外的对齐损失，而是通过偏好驱动的强化信号直接优化多模态模型行为，鼓励在理解和生成任务中语言和视觉之间一致且基于语义的交互。该公式无需辅助编码器或手工制作的跨模态目标即可实现有效对准，并自然地扩展到各种多模态功能。根据经验，LVRPO 在涵盖多模态理解、生成和推理的一系列广泛基准上始终优于强大的统一预训练基线。

Title: Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?

Authors: Samik Some, Vinay P. Namboodiri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27697
Pdf URL: https://arxiv.org/pdf/2603.27697
Copy Paste: [[2603.27697]] Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?(https://arxiv.org/abs/2603.27697)
Keywords: generation
Abstract: Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.
摘要：目前用于视频语义分割的深度神经网络需要大量细粒度的像素级注释才能获得最佳结果。然而，获得这样的注释是非常昂贵的。另一方面，原始的、未注释的视频帧实际上可以免费获得。同样，不需要精确边界的粗注释也便宜得多。本文研究了利用此类资源来降低视频分割数据集所需的注释成本的方法。我们证明，使用最先进的分割基础模型、Segment Anything Model (SAM) 和 Segment Anything Model 2 (SAM 2)，我们可以利用未注释的帧和粗略注释，通过自动生成掩模来减轻视频分割数据集手动注释所需的工作量。我们的调查表明，如果使用得当，我们可以将注释的需求减少三分之一，而视频语义分割的性能却相似。更重要的是，我们的分析表明，为了获得最佳性能，数据集中的帧种类比帧数量更重要。

Title: Look, Compare and Draw: Differential Query Transformer for Automatic Oil Painting

Authors: Lingyu Liu, Yaxiong Wang, Li Zhu, Lizi Liao, Zhedong Zheng
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2603.27720
Pdf URL: https://arxiv.org/pdf/2603.27720
Copy Paste: [[2603.27720]] Look, Compare and Draw: Differential Query Transformer for Automatic Oil Painting(https://arxiv.org/abs/2603.27720)
Keywords: generation
Abstract: This work introduces a new approach to automatic oil painting that emphasizes the creation of dynamic and expressive brushstrokes. A pivotal challenge lies in mitigating the duplicate and common-place strokes, which often lead to less aesthetic outcomes. Inspired by the human painting process, \ie, observing, comparing, and drawing, we incorporate differential image analysis into a neural oil painting model, allowing the model to effectively concentrate on the incremental impact of successive brushstrokes. To operationalize this concept, we propose the Differential Query Transformer (DQ-Transformer), a new architecture that leverages differentially derived image representations enriched with positional encoding to guide the stroke prediction process. This integration enables the model to maintain heightened sensitivity to local details, resulting in more refined and nuanced stroke generation. Furthermore, we incorporate adversarial training into our framework, enhancing the accuracy of stroke prediction and thereby improving the overall realism and fidelity of the synthesized paintings. Extensive qualitative evaluations, complemented by a controlled user study, validate that our DQ-Transformer surpasses existing methods in both visual realism and artistic authenticity, typically achieving these results with fewer strokes. The stroke-by-stroke painting animations are available on our project website.
摘要：该作品引入了一种新的自动油画方法，强调创造动态且富有表现力的笔触。一个关键的挑战在于减少重复和常见的笔画，这通常会导致美观效果较差。受人类绘画过程（即观察、比较和绘画）的启发，我们将差分图像分析纳入神经油画模型中，使模型能够有效地集中于连续笔触的增量影响。为了实现这一概念，我们提出了差分查询变换器（DQ-Transformer），这是一种新的架构，它利用通过位置编码丰富的差分派生图像表示来指导笔画预测过程。这种集成使模型能够保持对局部细节的高度敏感度，从而生成更加精致和细致的笔画。此外，我们将对抗性训练纳入我们的框架中，提高了笔画预测的准确性，从而提高了合成绘画的整体真实感和保真度。广泛的定性评估，辅以受控用户研究，证实我们的 DQ-Transformer 在视觉真实性和艺术真实性方面都超越了现有方法，通常可以用更少的笔画实现这些结果。逐笔绘画动画可在我们的项目网站上找到。

Title: TIR-Agent: Training an Explorative and Efficient Agent for Image Restoration

Authors: Yisheng Zhang, Guoli Jia, Haote Hu, Shanxu Zhao, Kaikai Zhao, Long Sun, Xinwei Long, Kai Tian, Che Jiang, Zhaoxiang Liu, Kai Wang, Shiguo Lian, Kaiyan Zhang, Bowen Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27742
Pdf URL: https://arxiv.org/pdf/2603.27742
Copy Paste: [[2603.27742]] TIR-Agent: Training an Explorative and Efficient Agent for Image Restoration(https://arxiv.org/abs/2603.27742)
Keywords: restoration
Abstract: Vision-language agents that orchestrate specialized tools for image restoration (IR) have emerged as a promising method, yet most existing frameworks operate in a training-free manner. They rely on heuristic task scheduling and exhaustive tool traversal, resulting in sub-optimal restoration paths and prohibitive computational cost. We argue that the core bottleneck lies in the absence of a learned policy to make decision, as a vision-language model cannot efficiently handle degradation-aware task ordering and tool composition. To this end, we propose TIR-Agent, a trainable image restoration agent that performs a direct tool-calling policy through a two-stage training pipeline of supervised fine-tuning (SFT) followed by reinforcement learning (RL). Two key designs underpin effective RL training: (i) a random perturbation strategy applied to the SFT data, which broadens the policy's exploration over task schedules and tool compositions, and (ii) a multi-dimensional adaptive reward mechanism that dynamically re-weights heterogeneous image quality metrics to mitigate reward hacking. To support high-throughput, asynchronous GPU-based tool invocation during training, we further develop a globally shared model-call pool. Experiments on both in-domain and out-of-domain degradations show that TIR-Agent outperforms 12 baselines, including 6 all-in-one models, 3 training-free agents, and 3 proprietary models, and achieves over 2.5$\times$ inference speedup by eliminating redundant tool executions.
摘要：编排图像恢复（IR）专用工具的视觉语言代理已成为一种有前景的方法，但大多数现有框架都以免训练的方式运行。它们依赖于启发式任务调度和详尽的工具遍历，导致次优的恢复路径和高昂的计算成本。我们认为核心瓶颈在于缺乏学习策略来做出决策，因为视觉语言模型无法有效地处理退化感知的任务排序和工具组合。为此，我们提出了 TIR-Agent，这是一种可训练的图像恢复代理，它通过监督微调（SFT）和强化学习（RL）的两阶段训练管道来执行直接工具调用策略。有效的强化学习训练有两个关键设计：(i) 应用于 SFT 数据的随机扰动策略，扩大了策略对任务计划和工具组合的探索；(ii) 多维自适应奖励机制，动态重新加权异构图像质量指标，以减轻奖励黑客行为。为了支持训练过程中高吞吐量、异步的基于 GPU 的工具调用，我们进一步开发了一个全局共享的模型调用池。域内和域外降级实验表明，TIR-Agent 的性能优于 12 个基线，包括 6 个一体化模型、3 个免训练代理和 3 个专有模型，并通过消除冗余工具执行实现超过 2.5 倍的推理加速。

Title: AI-Powered Facial Mask Removal Is Not Suitable For Biometric Identification

Authors: Emily A Cooper, Hany Farid
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27747
Pdf URL: https://arxiv.org/pdf/2603.27747
Copy Paste: [[2603.27747]] AI-Powered Facial Mask Removal Is Not Suitable For Biometric Identification(https://arxiv.org/abs/2603.27747)
Keywords: generative
Abstract: Recently, crowd-sourced online criminal investigations have used generative-AI to enhance low-quality visual evidence. In one high-profile case, social-media users circulated an "AI-unmasked" image of a federal agent involved in a fatal shooting, fueling a wide-spread misidentification. In response to this and similar incidents, we conducted a large-scale analysis evaluating the efficacy and risks of commercial AI-powered facial unmasking, specifically assessing whether the resulting faces can be reliably matched to true identities.
摘要：最近，众包在线刑事调查使用生成人工智能来增强低质量的视觉证据。在一个备受瞩目的案件中，社交媒体用户传播了一张参与致命枪击事件的联邦特工的“人工智能未蒙面”图像，引发了广泛的误认。针对这一事件和类似事件，我们进行了大规模分析，评估商业人工智能面部揭秘的功效和风险，特别是评估生成的面部是否能够可靠地与真实身份匹配。

Title: When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

Authors: Chengyin Hu, Xuemeng Sun, Jiajun Han, Qike Zhang, Xiang Chen, Xin Wang, Yiwei Wei, Jiahua Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27759
Pdf URL: https://arxiv.org/pdf/2603.27759
Copy Paste: [[2603.27759]] When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models(https://arxiv.org/abs/2603.27759)
Keywords: generative
Abstract: Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.
摘要：视觉语言模型 (VLM) 在各种任务中表现出了卓越的跨模式理解能力，包括零样本分类、图像字幕和视觉问答。然而，它们对物理上合理的非刚性变形（例如柔性表面上的皱纹）的鲁棒性仍然知之甚少。在这项工作中，我们提出了一种受三维织物皱纹力学启发的参数结构扰动方法。具体来说，我们的方法通过构建多尺度皱纹场并将位移场畸变与表面一致的外观变化相结合来生成逼真的非刚性扰动。为了实现视觉自然度和对抗有效性之间的最佳平衡，我们在低维参数空间中设计了分层适应度函数，并采用基于优化的搜索策略。我们使用两阶段框架评估我们的方法：首先在零样本分类代理任务上优化扰动，然后评估生成任务的可转移性。实验结果表明，我们的方法显着降低了各种最先进的 VLM 的性能，在图像字幕和视觉问答任务中始终优于基线。

Title: Inference-time Trajectory Optimization for Manga Image Editing

Authors: Ryosuke Furuta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27790
Pdf URL: https://arxiv.org/pdf/2603.27790
Copy Paste: [[2603.27790]] Inference-time Trajectory Optimization for Manga Image Editing(https://arxiv.org/abs/2603.27790)
Keywords: generation
Abstract: We present an inference-time adaptation method that tailors a pretrained image editing model to each input manga image using only the input image itself. Despite recent progress in pretrained image editing, such models often underperform on manga because they are trained predominantly on natural-image data. Re-training or fine-tuning large-scale models on manga is, however, generally impractical due to both computational cost and copyright constraints. To address this issue, our method slightly corrects the generation trajectory at inference time so that the input image can be reconstructed more faithfully under an empty prompt. Experimental results show that our method consistently outperforms existing baselines while incurring only negligible computational overhead.
摘要：我们提出了一种推理时间适应方法，仅使用输入图像本身为每个输入漫画图像定制预训练的图像编辑模型。尽管最近在预训练图像编辑方面取得了进展，但此类模型在漫画上的表现通常不佳，因为它们主要是在自然图像数据上进行训练的。然而，由于计算成本和版权限制，对漫画的大规模模型进行重新训练或微调通常是不切实际的。为了解决这个问题，我们的方法在推理时稍微修正了生成轨迹，以便在空提示下可以更忠实地重建输入图像。实验结果表明，我们的方法始终优于现有基线，同时仅产生可忽略不计的计算开销。

Title: What-If Explanations Over Time: Counterfactuals for Time Series Classification

Authors: Udo Schlegel, Thomas Seidl
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.27792
Pdf URL: https://arxiv.org/pdf/2603.27792
Copy Paste: [[2603.27792]] What-If Explanations Over Time: Counterfactuals for Time Series Classification(https://arxiv.org/abs/2603.27792)
Keywords: generative
Abstract: Counterfactual explanations emerge as a powerful approach in explainable AI, providing what-if scenarios that reveal how minimal changes to an input time series can alter the model's prediction. This work presents a survey of recent algorithms for counterfactual explanations for time series classification. We review state-of-the-art methods, spanning instance-based nearest-neighbor techniques, pattern-driven algorithms, gradient-based optimization, and generative models. For each, we discuss the underlying methodology, the models and classifiers they target, and the datasets on which they are evaluated. We highlight unique challenges in generating counterfactuals for temporal data, such as maintaining temporal coherence, plausibility, and actionable interpretability, which distinguish the temporal from tabular or image domains. We analyze the strengths and limitations of existing approaches and compare their effectiveness along key dimensions (validity, proximity, sparsity, plausibility, etc.). In addition, we implemented an open-source implementation library, Counterfactual Explanations for Time Series (CFTS), as a reference framework that includes many algorithms and evaluation metrics. We discuss this library's contributions in standardizing evaluation and enabling practical adoption of explainable time series techniques. Finally, based on the literature and identified gaps, we propose future research directions, including improved user-centered design, integration of domain knowledge, and counterfactuals for time series forecasting.
摘要：反事实解释成为可解释人工智能中的一种强大方法，它提供了假设场景，揭示输入时间序列的最小变化如何改变模型的预测。这项工作对时间序列分类的反事实解释的最新算法进行了调查。我们回顾了最先进的方法，包括基于实例的最近邻技术、模式驱动算法、基于梯度的优化和生成模型。对于每一个，我们都讨论了基础方法、它们所针对的模型和分类器，以及评估它们的数据集。我们强调了为时态数据生成反事实的独特挑战，例如保持时间一致性、合理性和可操作的可解释性，这些挑战将时态域与表格或图像域区分开来。我们分析现有方法的优点和局限性，并在关键维度（有效性、接近性、稀疏性、合理性等）比较它们的有效性。此外，我们还实现了一个开源实现库，时间序列的反事实解释（CFTS），作为包含许多算法和评估指标的参考框架。我们讨论该库在标准化评估和实现可解释时间序列技术的实际采用方面的贡献。最后，根据文献和确定的差距，我们提出了未来的研究方向，包括改进以用户为中心的设计、领域知识的整合以及时间序列预测的反事实。

Title: Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images

Authors: Laura Rayón Ropero, Jasper De Laet, Filip Lemic, Pau Sabater Nácher, Nabeel Nisar Bhat, Sergi Abadal, Jeroen Famaey, Eduard Alarcón, Xavier Costa-Pérez
Subjects: cs.CV, cs.AI, cs.ET, cs.HC, eess.IV
Abstract URL: https://arxiv.org/abs/2603.27798
Pdf URL: https://arxiv.org/pdf/2603.27798
Copy Paste: [[2603.27798]] Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images(https://arxiv.org/abs/2603.27798)
Keywords: generation
Abstract: Facial Emotion Recognition is a critical research area within Affective Computing due to its wide-ranging applications in Human Computer Interaction, mental health assessment and fatigue monitoring. Current FER methods predominantly rely on Deep Learning techniques trained on 2D image data, which pose significant privacy concerns and are unsuitable for continuous, real-time monitoring. As an alternative, we propose High-Frequency Wireless Sensing (HFWS) as an enabler of continuous, privacy-aware FER, through the generation of detailed 3D facial pointclouds via on-person sensors embedded in wearables. We present arguments supporting the privacy advantages of HFWS over traditional 2D imaging, particularly under increasingly stringent data protection regulations. A major barrier to adopting HFWS for FER is the scarcity of labeled 3D FER datasets. Towards addressing this issue, we introduce a FLAME-based method to generate 3D facial pointclouds from existing public 2D datasets. Using this approach, we create AffectNet3D, a 3D version of the AffectNet database. To evaluate the quality and usability of the generated data, we design a pointcloud refinement pipeline focused on isolating the facial region, and train the popular PointNet++ model on the refined pointclouds. Fine-tuning the model on a small subset of the unseen 3D FER dataset BU-3DFE yields a classification accuracy exceeding 70%, comparable to oracle-level performance. To further investigate the potential of HFWS-based FER for continuous monitoring, we simulate wearable sensing conditions by masking portions of the generated pointclouds. Experimental results show that models trained on AffectNet3D and fine-tuned with just 25% of BU-3DFE outperform those trained solely on BU-3DFE. These findings highlight the viability of our pipeline and support the feasibility of continuous, privacy-aware FER via wearable HFWS systems.
摘要：面部情绪识别因其在人机交互、心理健康评估和疲劳监测方面的广泛应用而成为情感计算中的一个关键研究领域。当前的 FER 方法主要依赖于在 2D 图像数据上训练的深度学习技术，这会带来严重的隐私问题，并且不适合连续实时监控。作为替代方案，我们提出高频无线传感 (HFWS)，通过嵌入可穿戴设备中的人体传感器生成详细的 3D 面部点云，作为连续、隐私意识 FER 的推动者。我们提出了支持 HFWS 相对于传统 2D 成像的隐私优势的论据，特别是在日益严格的数据保护法规下。采用 HFWS 进行 FER 的一个主要障碍是标记的 3D FER 数据集的稀缺。为了解决这个问题，我们引入了一种基于 FLAME 的方法，从现有的公共 2D 数据集生成 3D 面部点云。使用这种方法，我们创建了 AffectNet3D，这是 AffectNet 数据库的 3D 版本。为了评估生成数据的质量和可用性，我们设计了一个专注于隔离面部区域的点云细化管道，并在细化的点云上训练流行的 PointNet++ 模型。在未见过的 3D FER 数据集 BU-3DFE 的一小部分上对模型进行微调，可以得到超过 70% 的分类准确率，可与预言机级别的性能相媲美。为了进一步研究基于 HFWS 的 FER 用于连续监测的潜力，我们通过屏蔽生成的点云的部分来模拟可穿戴传感条件。实验结果表明，在 AffectNet3D 上训练并仅使用 25% 的 BU-3DFE 进行微调的模型优于仅在 BU-3DFE 上训练的模型。这些发现凸显了我们管道的可行性，并支持通过可穿戴 HFWS 系统进行连续、隐私意识 FER 的可行性。

Title: Diversity Matters: Dataset Diversification and Dual-Branch Network for Generalized AI-Generated Image Detection

Authors: Nusrat Tasnim, Kutub Uddin, Khalid Malik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27800
Pdf URL: https://arxiv.org/pdf/2603.27800
Copy Paste: [[2603.27800]] Diversity Matters: Dataset Diversification and Dual-Branch Network for Generalized AI-Generated Image Detection(https://arxiv.org/abs/2603.27800)
Keywords: generative
Abstract: The rapid proliferation of AI-generated images, powered by generative adversarial networks (GANs), diffusion models, and other synthesis techniques, has raised serious concerns about misinformation, copyright violations, and digital security. However, detecting such images in a generalized and robust manner remains a major challenge due to the vast diversity of generative models and data distributions. In this work, we present \textbf{Diversity Matters}, a novel framework that emphasizes data diversity and feature domain complementarity for AI-generated image detection. The proposed method introduces a feature-domain similarity filtering mechanism that discards redundant or highly similar samples across both inter-class and intra-class distributions, ensuring a more diverse and representative training set. Furthermore, we propose a dual-branch network that combines CLIP features from the pixel domain and the frequency domain to jointly capture semantic and structural cues, leading to improved generalization against unseen generative models and adversarial conditions. Extensive experiments on benchmark datasets demonstrate that the proposed approach significantly improves cross-model and cross-dataset performance compared to existing methods. \textbf{Diversity Matters} highlights the critical role of data and feature diversity in building reliable and robust detectors against the rapidly evolving landscape of synthetic content.
摘要：在生成对抗网络 (GAN)、扩散模型和其他合成技术的支持下，人工智能生成的图像迅速扩散，引起了人们对错误信息、版权侵犯和数字安全的严重担忧。然而，由于生成模型和数据分布的巨大多样性，以通用且鲁棒的方式检测此类图像仍然是一个重大挑战。在这项工作中，我们提出了 \textbf{Diversity Matters}，这是一种强调人工智能生成图像检测的数据多样性和特征域互补性的新颖框架。该方法引入了特征域相似性过滤机制，丢弃类间和类内分布中的冗余或高度相似的样本，确保训练集更加多样化和代表性。此外，我们提出了一种双分支网络，它结合了像素域和频域的 CLIP 特征，共同捕获语义和结构线索，从而提高了针对看不见的生成模型和对抗条件的泛化能力。对基准数据集的大量实验表明，与现有方法相比，所提出的方法显着提高了跨模型和跨数据集的性能。 \textbf{多样性很重要}强调了数据和特征多样性在针对快速发展的合成内容景观构建可靠且强大的检测器方面的关键作用。

Title: Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning

Authors: Ming Liu, Yunbei Zhang, Shilong Liu, Liwen Wang, Wensheng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27866
Pdf URL: https://arxiv.org/pdf/2603.27866
Copy Paste: [[2603.27866]] Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning(https://arxiv.org/abs/2603.27866)
Keywords: generation
Abstract: Video generation models produce visually coherent content but struggle with tasks requiring spatial reasoning and multi-step planning. Reinforcement learning (RL) offers a path to improve generalization, but its effectiveness in video reasoning hinges on reward design -- a challenge that has received little systematic study. We investigate this problem by adapting Group Relative Policy Optimization (GRPO) to flow-based video models and training them on maze-solving and robotic navigation tasks. We first show that multimodal reward models fail catastrophically in this setting. To address this, we design verifiable reward functions grounded in objective task metrics. For structured game environments, we introduce a multi-component trajectory reward. For robotic navigation, we propose an embedding-level verifiable reward. Our experiments show that RL fine-tuning with verifiable rewards improves generalization. For example, on complex 3D mazes, our model improves exact match accuracy by 29.1\% over the SFT baseline, and on trap-avoidance tasks by 51.4\%. Our systematic reward analysis reveals that verifiable rewards are critical for stable training, while multimodal reward models could lead to degenerate solutions. These findings establish verifiable reward design as a key enabler for robust video reasoning. Code will be publicly available.
摘要：视频生成模型可以生成视觉上连贯的内容，但难以完成需要空间推理和多步骤规划的任务。强化学习（RL）提供了一条提高泛化能力的途径，但其在视频推理中的有效性取决于奖励设计——这一挑战尚未得到系统研究。我们通过将组相对策略优化（GRPO）应用于基于流的视频模型并训练它们进行迷宫求解和机器人导航任务来研究这个问题。我们首先表明，多模式奖励模型在这种情况下会发生灾难性的失败。为了解决这个问题，我们设计了基于客观任务指标的可验证奖励函数。对于结构化游戏环境，我们引入了多成分轨迹奖励。对于机器人导航，我们提出了嵌入级可验证奖励。我们的实验表明，通过可验证的奖励进行强化学习微调可以提高泛化能力。例如，在复杂的 3D 迷宫上，我们的模型将精确匹配精度比 SFT 基线提高了 29.1%，在陷阱避免任务上提高了 51.4%。我们的系统奖励分析表明，可验证的奖励对于稳定的训练至关重要，而多模式奖励模型可能会导致退化的解决方案。这些发现将可验证的奖励设计确立为稳健视频推理的关键推动因素。代码将公开。

Title: SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation

Authors: Tripti Shukla, Zsolt Kira
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27898
Pdf URL: https://arxiv.org/pdf/2603.27898
Copy Paste: [[2603.27898]] SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation(https://arxiv.org/abs/2603.27898)
Keywords: generation
Abstract: Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention during generation. Hallucinations are strongly correlated with attention sink tokens - punctuation or function tokens that accumulate disproportionate attention despite carrying limited semantic content. SAGE leverages these tokens as anchors to monitor grounding reliability in real time. At each sink trigger, the method extracts semantic concepts from the generated sequence, estimates their visual grounding using both self-attention maps and gradient-based attribution, and measures their spatial agreement. Based on this signal, self-attention distributions are adaptively sharpened or broadened to reinforce grounded regions or suppress unreliable ones. Extensive experiments across diverse hallucination benchmarks demonstrate that SAGE consistently outperforms existing decoding strategies, achieving substantial reductions in hallucination while preserving descriptive coverage, without requiring model retraining or architectural modifications. Our method achieves an average relative improvement of 10.65% on MSCOCO and 7.19% on AMBER across diverse VLM architectures, demonstrating consistent gains in hallucination mitigation.
摘要：大型视觉语言模型 (VLM) 经常出现幻觉，生成与视觉输入不一致的内容。现有的方法通常通过事后过滤、额外的训练目标或外部验证来解决这个问题，但当幻觉出现时，它们不会在解码过程中进行干预。在这项工作中，我们介绍了 SAGE，一种 Sink-Aware Grounded Decoding 框架，它通过在生成过程中动态调节 self-attention 来减轻幻觉。幻觉与注意力接收标记密切相关——标点符号或功能标记尽管携带有限的语义内容，但仍积累了不成比例的注意力。 SAGE 利用这些代币作为锚点来实时监控接地可靠性。在每个接收器触发处，该方法从生成的序列中提取语义概念，使用自注意力图和基于梯度的归因来估计它们的视觉基础，并测量它们的空间一致性。基于该信号，自注意力分布会自适应地锐化或加宽，以加强接地区域或抑制不可靠的区域。跨不同幻觉基准的广泛实验表明，SAGE 始终优于现有的解码策略，在保留描述性覆盖范围的同时大幅减少幻觉，无需模型重新训练或架构修改。我们的方法在不同的 VLM 架构中，在 MSCOCO 上实现了 10.65% 的平均相对改进，在 AMBER 上实现了 7.19% 的平均相对改进，证明了幻觉缓解方面的持续收益。

Title: ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control

Authors: Christopher Cruz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27905
Pdf URL: https://arxiv.org/pdf/2603.27905
Copy Paste: [[2603.27905]] ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control(https://arxiv.org/abs/2603.27905)
Keywords: generation
Abstract: We present ATLAS-RTC, a runtime control system for autoregressive language models that enforces structured output during decoding. ATLAS-RTC monitors generation at each step, detects drift from output contracts using lightweight signals, and applies targeted interventions such as biasing, masking, and rollback. Unlike post-hoc validation or static constrained decoding, it operates in a closed loop, enabling correction before errors materialize. Across structured generation and tool-calling tasks, ATLAS-RTC improves first-attempt success rates by 20 to 37.8 percentage points, with up to 88% latency reduction in failure-dominated settings. Results show that many failures arise from decoding artifacts rather than task misunderstanding, motivating runtime control as a distinct layer in LLM systems.
摘要：我们提出了 ATLAS-RTC，这是一种用于自回归语言模型的运行时控制系统，可在解码期间强制执行结构化输出。 ATLAS-RTC 监控每一步的生成，使用轻量级信号检测输出合约的漂移，并应用有针对性的干预措施，例如偏差、屏蔽和回滚。与事后验证或静态约束解码不同，它在闭环中运行，可以在错误出现之前进行纠正。在结构化生成和工具调用任务中，ATLAS-RTC 将首次尝试成功率提高了 20 至 37.8 个百分点，在以故障为主的环境中，延迟降低了高达 88%。结果表明，许多失败是由解码工件而不是任务误解引起的，这促使运行时控制成为 LLM 系统中的一个独特层。

Title: FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation

Authors: Liuzhou Zhang, Zeyu Zhang, Biao Wu, Luyao Tang, Zirui Song, Hongyang He, Renda Han, Guangzhen Yao, Huacan Wang, Ronghao Chen, Xiuying Chen, Guan Huang, Zheng Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27915
Pdf URL: https://arxiv.org/pdf/2603.27915
Copy Paste: [[2603.27915]] FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation(https://arxiv.org/abs/2603.27915)
Keywords: generation, generative
Abstract: Sign language plays a crucial role in bridging communication gaps between the deaf and hard-of-hearing communities. However, existing sign language video generation models often rely on complex intermediate representations, which limits their flexibility and efficiency. In this work, we propose a novel pose-free framework for real-time sign language video generation. Our method eliminates the need for intermediate pose representations by directly mapping natural language text to sign language videos using a diffusion-based approach. We introduce two key innovations: (1) a pose-free generative model based on the a state-of-the-art diffusion backbone, which learns implicit text-to-gesture alignments without pose estimation, and (2) a Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns. Unlike previous training-free sparsity approaches, T-STA integrates trainable sparsity into both training and inference, ensuring consistency and eliminating the train-test gap. This approach significantly reduces computational overhead while maintaining high generation quality, making real-time deployment feasible. Our method increases video generation speed by 3.07x without compromising video quality. Our contributions open new avenues for real-time, high-quality, pose-free sign language synthesis, with potential applications in inclusive communication tools for diverse communities. Code: this https URL.
摘要：手语在弥合聋人和听力障碍群体之间的沟通差距方面发挥着至关重要的作用。然而，现有的手语视频生成模型通常依赖于复杂的中间表示，这限制了其灵活性和效率。在这项工作中，我们提出了一种用于实时手语视频生成的新颖的无姿势框架。我们的方法通过使用基于扩散的方法将自然语言文本直接映射到手语视频，从而消除了对中间姿势表示的需要。我们引入了两项关键创新：（1）基于最先进的扩散主干的无姿势生成模型，该模型无需姿势估计即可学习隐式文本到手势对齐；（2）可训练滑动块注意力（T-STA）机制，可通过利用时空局部性模式来加速推理。与之前的免训练稀疏性方法不同，T-STA 将可训练稀疏性集成到训练和推理中，确保一致性并消除训练与测试差距。这种方法显着减少了计算开销，同时保持了较高的生成质量，使实时部署变得可行。我们的方法将视频生成速度提高了 3.07 倍，而不会影响视频质量。我们的贡献为实时、高质量、无姿势手语合成开辟了新途径，并在面向不同社区的包容性通信工具中具有潜在应用。代码：此 https URL。

Title: ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments

Authors: Pragat Wagle, Zheng Chen, Lantao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27923
Pdf URL: https://arxiv.org/pdf/2603.27923
Copy Paste: [[2603.27923]] ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments(https://arxiv.org/abs/2603.27923)
Keywords: generation
Abstract: Robust scene understanding is essential for intelligent vehicles operating in natural, unstructured environments. While semantic segmentation datasets for structured urban driving are abundant, the datasets for extremely unstructured wild environments remain scarce due to the difficulty and cost of generating pixel-accurate annotations. These limitations hinder the development of perception systems needed for intelligent ground vehicles tasked with forestry automation, agricultural robotics, disaster response, and all-terrain mobility. To address this gap, we present ForestSim, a high-fidelity synthetic dataset designed for training and evaluating semantic segmentation models for intelligent vehicles in forested off-road and no-road environments. ForestSim contains 2094 photorealistic images across 25 diverse environments, covering multiple seasons, terrain types, and foliage densities. Using Unreal Engine environments integrated with Microsoft AirSim, we generate consistent, pixel-accurate labels across 20 classes relevant to autonomous navigation. We benchmark ForestSim using state-of-the-art architectures and report strong performance despite the inherent challenges of unstructured scenes. ForestSim provides a scalable and accessible foundation for perception research supporting the next generation of intelligent off-road vehicles. The dataset and code are publicly available: Dataset: this https URL Code: this https URL
摘要：强大的场景理解对于在自然、非结构化环境中运行的智能车辆至关重要。虽然用于结构化城市驾驶的语义分割数据集非常丰富，但由于生成像素精确注释的难度和成本，用于极其非结构化的野外环境的数据集仍然稀缺。这些限制阻碍了智能地面车辆所需感知系统的开发，这些车辆负责林业自动化、农业机器人、灾害响应和全地形移动。为了解决这一差距，我们推出了 ForestSim，这是一个高保真合成数据集，旨在训练和评估森林越野和无道路环境中智能车辆的语义分割模型。 ForestSim 包含 25 个不同环境中的 2094 张逼真图像，涵盖多个季节、地形类型和树叶密度。使用与 Microsoft AirSim 集成的虚幻引擎环境，我们在与自主导航相关的 20 个类别中生成一致的、像素精确的标签。尽管存在非结构化场景的固有挑战，我们使用最先进的架构对 ForestSim 进行了基准测试，并报告了强劲的性能。 ForestSim 为支持下一代智能越野车辆的感知研究提供了可扩展且可访问的基础。数据集和代码是公开的：数据集：此 https URL 代码：此 https URL

Title: Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute

Authors: Kieran Didi, Zuobai Zhang, Guoqing Zhou, Danny Reidenbach, Zhonglin Cao, Sooyoung Cha, Tomas Geffner, Christian Dallago, Jian Tang, Michael M. Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, Karsten Kreis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27950
Pdf URL: https://arxiv.org/pdf/2603.27950
Copy Paste: [[2603.27950]] Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute(https://arxiv.org/abs/2603.27950)
Keywords: generation, generative
Abstract: Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure-based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors ("hallucination"). We argue that this is a false dichotomy and propose Proteina-Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architectures and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina-Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.
摘要：蛋白质相互作用建模是蛋白质设计的核心，机器学习已经改变了蛋白质设计并应用于药物发现及其他领域。在这种情况下，基于结构的从头结合剂设计被视为条件生成建模或通过结构预测器进行序列优化（“幻觉”）。我们认为这是一种错误的二分法，并提出了 Proteina-Complexa，一种统一两种范式的新颖的完全原子结合剂生成方法。我们扩展了最近基于流的潜在蛋白质生成架构，并利用单体计算预测蛋白质结构的域-域相互作用来构建 Teddymer，这是一个用于预训练的合成结合物-目标对的新大规模数据集。与高质量的实验多聚体相结合，可以训练强大的基础模型。然后，我们使用这种生成先验进行推理时间优化，统一以前不同的生成方法和幻觉方法的优点。 Proteina-Complexa 在计算活页夹设计基准方面树立了新的技术水平：它比现有的生成方法提供了明显更高的计算机成功率，并且我们新颖的测试时优化策略在标准化计算预算下大大优于以前的幻觉方法。我们还展示了界面氢键优化、折叠类引导的结合剂生成以及小分子靶标和酶设计任务的扩展，再次超越了先前的方法。代码、模型和新数据将公开发布。

Title: MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation

Authors: Ruiyao Liu, Hui Shen, Ping Zhang, Yunta Hsieh, Yifan Zhang, Jing Xu, Sicheng Chen, Junchen Li, Jiawei Lu, Jianing Ma, Jiaqi Mo, Qi Han, Zhen Zhang, Zhongwei Wan, Jing Xiong, Xin Wang, Ziyuan Liu, Hangrui Cao, Ngai Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27959
Pdf URL: https://arxiv.org/pdf/2603.27959
Copy Paste: [[2603.27959]] MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation(https://arxiv.org/abs/2603.27959)
Keywords: generation, generative
Abstract: Modern generative models have demonstrated the ability to solve challenging mathematical problems. In many real-world settings, however, mathematical solutions must be expressed visually through diagrams, plots, geometric constructions, and structured symbolic layouts, where correctness depends on precise visual composition. Can generative models still do so when the answer must be rendered visually rather than written in text? To study this problem, we introduce MathGen, a rigorous benchmark of 900 problems spanning seven core domains, each paired with an executable verifier under a Script-as-a-Judge protocol for deterministic and objective evaluation. Experiments on representative open-source and proprietary text-to-image models show that mathematical fidelity remains a major bottleneck: even the best closed-source model reaches only 42.0% overall accuracy, while open-source models achieve just ~ 1-11%, often near 0% on structured tasks. Overall, current T2I models remain far from competent at even elementary mathematical visual generation.
摘要：现代生成模型已经证明了解决具有挑战性的数学问题的能力。然而，在许多现实世界中，数学解决方案必须通过图表、绘图、几何构造和结构化符号布局来直观地表达，其中正确性取决于精确的视觉构成。当答案必须以视觉方式呈现而不是用文本书写时，生成模型还能这样做吗？为了研究这个问题，我们引入了 MathGen，这是一个涵盖七个核心领域的 900 个问题的严格基准测试，每个问题都与脚本即法官协议下的可执行验证器配对，以进行确定性和客观的评估。对代表性开源和专有文本到图像模型的实验表明，数学保真度仍然是一个主要瓶颈：即使是最好的闭源模型也只能达到 42.0% 的总体准确率，而开源模型仅达到约 1-11%，在结构化任务上通常接近 0%。总体而言，当前的 T2I 模型甚至在基本的数学视觉生成方面还远远不能胜任。

Title: RetinexDualV2: Physically-Grounded Dual Retinex for Generalized UHD Image Restoration

Authors: Mohab Kishawy, Jun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27979
Pdf URL: https://arxiv.org/pdf/2603.27979
Copy Paste: [[2603.27979]] RetinexDualV2: Physically-Grounded Dual Retinex for Generalized UHD Image Restoration(https://arxiv.org/abs/2603.27979)
Keywords: restoration
Abstract: We propose RetinexDualV2, a unified, physically grounded dual-branch framework for diverse Ultra-High-Definition (UHD) image restoration. Unlike generic models, our method employs a Task-Specific Physical Grounding Module (TS-PGM) to extract degradation-aware priors (e.g., rain masks and dark channels). These explicitly guide a Retinex decomposition network via a novel Physical-conditioned Multi-head Self-Attention (PC-MSA) mechanism, enabling robust reflection and illumination correction. This physical conditioning allows a single architecture to handle various complex degradations seamlessly, without task-specific structural modifications. RetinexDualV2 demonstrates exceptional generalizability, securing 4\textsuperscript{th} place in the NTIRE 2026 Day and Night Raindrop Removal Challenge and 5\textsuperscript{th} place in the Joint Noise Low-light Enhancement (JNLLIE) Challenge. Extensive experiments confirm the state-of-the-art performance and efficiency of our physically motivated approach.
摘要：我们提出了 RetinexDualV2，这是一个统一的、物理接地的双分支框架，用于各种超高清 (UHD) 图像恢复。与通用模型不同，我们的方法采用特定于任务的物理接地模块（TS-PGM）来提取退化感知先验（例如，雨罩和暗通道）。这些通过新颖的物理条件多头自注意力（PC-MSA）机制明确引导 Retinex 分解网络，从而实现强大的反射和照明校正。这种物理调节允许单一架构无缝地处理各种复杂的退化，而不需要针对特定任务的结构修改。 RetinexDualV2 表现出卓越的通用性，在 NTIRE 2026 日夜雨滴去除挑战赛中获得第 4 名，在联合噪声低光增强 (JNLLIE) 挑战赛中获得第 5 名。大量的实验证实了我们的物理激励方法的最先进的性能和效率。

Title: From Independent to Correlated Diffusion: Generalized Generative Modeling with Probabilistic Computers

Authors: Nihal Sanjay Singh, Mazdak Mohseni-Rajaee, Shaila Niazi, Kerem Y. Camsari
Subjects: cs.LG, cs.ET
Abstract URL: https://arxiv.org/abs/2603.27996
Pdf URL: https://arxiv.org/pdf/2603.27996
Copy Paste: [[2603.27996]] From Independent to Correlated Diffusion: Generalized Generative Modeling with Probabilistic Computers(https://arxiv.org/abs/2603.27996)
Keywords: generative
Abstract: Diffusion models have emerged as a powerful framework for generative tasks in deep learning. They decompose generative modeling into two computational primitives: deterministic neural-network evaluation and stochastic sampling. Current implementations usually place most computation in the neural network, but diffusion as a framework allows a broader range of choices for the stochastic transition kernel. Here, we generalize the stochastic sampling component by replacing independent noise injection with Markov chain Monte Carlo (MCMC) dynamics that incorporate known interaction structure. Standard independent diffusion is recovered as a special case when couplings are set to zero. By explicitly incorporating Ising couplings into the diffusion dynamics, the noising and denoising processes exploit spatial correlations representative of the target system. The resulting framework maps naturally onto probabilistic computers (p-computers) built from probabilistic bits (p-bits), which provide orders-of-magnitude advantages in sampling throughput and energy efficiency over GPUs. We demonstrate the approach on equilibrium states of the 2D ferromagnetic Ising model and the 3D Edwards-Anderson spin glass, showing that correlated diffusion produces samples in closer agreement with MCMC reference distributions than independent diffusion. More broadly, the framework shows that p-computers can enable new classes of diffusion algorithms that exploit structured probabilistic sampling for generative modeling.
摘要：扩散模型已成为深度学习中生成任务的强大框架。他们将生成模型分解为两个计算原语：确定性神经网络评估和随机采样。当前的实现通常将大部分计算放在神经网络中，但扩散作为框架允许随机转移内核有更广泛的选择。在这里，我们通过用包含已知交互结构的马尔可夫链蒙特卡罗 (MCMC) 动力学替换独立噪声注入来概括随机采样组件。当耦合设置为零时，标准独立扩散恢复为特殊情况。通过明确地将伊辛耦合纳入扩散动力学，噪声和去噪过程利用了代表目标系统的空间相关性。由此产生的框架自然地映射到由概率位 (p-bits) 构建的概率计算机 (p-computers)，与 GPU 相比，它在采样吞吐量和能源效率方面具有数量级的优势。我们演示了 2D 铁磁 Ising 模型和 3D Edwards-Anderson 自旋玻璃平衡态的方法，表明相关扩散产生的样本比独立扩散更符合 MCMC 参考分布。更广泛地说，该框架表明 p 计算机可以启用新型扩散算法，利用结构化概率采样进行生成建模。

Title: Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

Authors: Zhen Zou, Xiaoxiao Ma, Mingde Yao, Jie Huang, LinJiang Huang, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28049
Pdf URL: https://arxiv.org/pdf/2603.28049
Copy Paste: [[2603.28049]] Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting(https://arxiv.org/abs/2603.28049)
Keywords: generation
Abstract: Autoregressive (AR)-Diffusion hybrid paradigms combine AR's structured semantic modeling with diffusion's high-fidelity synthesis, yet suffer from a dual speed bottleneck: the sequential AR stage and the iterative multi-step denoising of the diffusion vision decode stage. Existing methods address each in isolation without a unified principle design. We observe that the per-position \emph{prediction entropy} of continuous-space AR models naturally encodes spatially varying generation uncertainty, which simultaneously governing draft prediction quality in the AR stage and reflecting the corrective effort required by vision decoding stage, which is not fully explored before. Since entropy is inherently tied to both bottlenecks, it serves as a natural unifying signal for joint acceleration. In this work, we propose \textbf{Drift-AR}, which leverages entropy signal to accelerate both stages: 1) for AR acceleration, we introduce Entropy-Informed Speculative Decoding that align draft--target entropy distributions via a causal-normalized entropy loss, resolving the entropy mismatch that causes excessive draft rejection; 2) for visual decoder acceleration, we reinterpret entropy as the \emph{physical variance} of the initial state for an anti-symmetric drifting field -- high-entropy positions activate stronger drift toward the data manifold while low-entropy positions yield vanishing drift -- enabling single-step (1-NFE) decoding without iterative denoising or distillation. Moreover, both stages share the same entropy signal, which is computed once with no extra cost. Experiments on MAR, TransDiff, and NextStep-1 demonstrate 3.8--5.5$\times$ speedup with genuine 1-NFE decoding, matching or surpassing original quality. Code will be available at this https URL.
摘要：自回归 (AR)-扩散混合范式将 AR 的结构化语义建模与扩散的高保真合成相结合，但面临双重速度瓶颈：顺序 AR 阶段和扩散视觉解码阶段的迭代多步去噪。现有的方法都是孤立地解决每个问题，没有统一的原理设计。我们观察到，连续空间 AR 模型的每个位置 \emph{预测熵} 自然地编码了空间变化的生成不确定性，它同时控制 AR 阶段的草稿预测质量，并反映了视觉解码阶段所需的纠正工作，这是之前未充分探索的。由于熵本质上与两个瓶颈相关，因此它可以作为联合加速的自然统一信号。在这项工作中，我们提出了 \textbf{Drift-AR}，它利用熵信号来加速两个阶段：1）对于 AR 加速，我们引入了熵通知推测解码，通过因果归一化熵损失来对齐草案-目标熵分布，解决导致过度草案拒绝的熵不匹配； 2）对于视觉解码器加速，我们将熵重新解释为反对称漂移场初始状态的\emph{物理方差} - 高熵位置激活向数据流形更强的漂移，而低熵位置产生消失漂移 - 无需迭代去噪或蒸馏即可实现单步（1-NFE）解码。此外，两个阶段共享相同的熵信号，该信号只需计算一次，无需额外成本。 MAR、TransDiff 和 NextStep-1 上的实验表明，使用真正的 1-NFE 解码可实现 3.8--5.5$\times$ 加速，匹配或超越原始质量。代码将在此 https URL 中提供。

Title: From Vessel Trajectories to Safety-Critical Encounter Scenarios: A Generative AI Framework for Autonomous Ship Digital Testing

Authors: Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28067
Pdf URL: https://arxiv.org/pdf/2603.28067
Copy Paste: [[2603.28067]] From Vessel Trajectories to Safety-Critical Encounter Scenarios: A Generative AI Framework for Autonomous Ship Digital Testing(https://arxiv.org/abs/2603.28067)
Keywords: generation, generative
Abstract: Digital testing has emerged as a key paradigm for the development and verification of autonomous maritime navigation systems, yet the availability of realistic and diverse safety-critical encounter scenarios remains limited. Existing approaches either rely on handcrafted templates, which lack realism, or extract cases directly from historical data, which cannot systematically expand rare high-risk situations. This paper proposes a data-driven framework that converts large-scale Automatic Identification System (AIS) trajectories into structured safety-critical encounter scenarios. The framework combines generative trajectory modeling with automated encounter pairing and temporal parameterization to enable scalable scenario construction while preserving real traffic characteristics. To enhance trajectory realism and robustness under noisy AIS observations, a multi-scale temporal variational autoencoder is introduced to capture vessel motion dynamics across different temporal resolutions. Experiments on real-world maritime traffic flows demonstrate that the proposed method improves trajectory fidelity and smoothness, maintains statistical consistency with observed data, and enables the generation of diverse safety-critical encounter scenarios beyond those directly recorded. The resulting framework provides a practical pathway for building scenario libraries to support digital testing, benchmarking, and safety assessment of autonomous navigation and intelligent maritime traffic management systems. Code is available at this https URL.
摘要：数字测试已成为自主海上导航系统开发和验证的关键范例，但现实且多样化的安全关键遭遇场景的可用性仍然有限。现有方法要么依赖手工制作的模板，缺乏真实性，要么直接从历史数据中提取案例，无法系统地扩展罕见的高风险情况。本文提出了一种数据驱动框架，可将大规模自动识别系统（AIS）轨迹转换为结构化的安全关键遭遇场景。该框架将生成轨迹建模与自动遭遇配对和时间参数化相结合，以实现可扩展的场景构建，同时保留真实的交通特征。为了增强噪声 AIS 观测下的轨迹真实性和鲁棒性，引入了多尺度时间变分自动编码器来捕获不同时间分辨率下的船舶运动动态。对现实世界海上交通流的实验表明，所提出的方法提高了轨迹保真度和平滑度，保持与观测数据的统计一致性，并能够生成超出直接记录的各种安全关键遭遇场景。由此产生的框架为构建场景库提供了一条实用途径，以支持自主导航和智能海上交通管理系统的数字测试、基准测试和安全评估。代码可从此 https URL 获取。

Title: AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

Authors: Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28068
Pdf URL: https://arxiv.org/pdf/2603.28068
Copy Paste: [[2603.28068]] AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation(https://arxiv.org/abs/2603.28068)
Keywords: generation
Abstract: Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely this http URL comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.
摘要：尽管图像生成通过其快速发展推动了各种应用，但最先进的模型是否能够为论文生成现成的学术插图，很大程度上仍然取决于这种与 VLM 比较或评估插图的 http URL 是原生的，但需要 Oracle 多模态理解能力，这对于长而复杂的文本和插图来说是不可靠的。为了解决这个问题，我们提出了 AIBench，这是第一个使用 VQA 评估学术插图逻辑正确性和 VLM 评估美学的基准。具体来说，我们根据论文方法部分总结的逻辑图设计了四个级别的问题，询问生成的插图在不同尺度上是否与论文相符。我们基于 VQA 的方法对视觉逻辑一致性进行了更准确、更详细的评估，同时减少了对判断者 VLM 能力的依赖。借助我们高质量的AIBench，我们进行了大量的实验，得出的结论是，模型在该任务上的性能差距明显大于一般模型，反映了它们各种复杂的推理和高密度生成能力。此外，逻辑和美学很难像手工制作的插图一样同时优化。其他实验进一步表明，这两种能力的测试时间扩展显着提高了该任务的性能。

Title: SIMR-NO: A Spectrally-Informed Multi-Resolution Neural Operator for Turbulent Flow Super-Resolution

Authors: Muhammad Abid, Omer San
Subjects: cs.LG, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2603.28073
Pdf URL: https://arxiv.org/pdf/2603.28073
Copy Paste: [[2603.28073]] SIMR-NO: A Spectrally-Informed Multi-Resolution Neural Operator for Turbulent Flow Super-Resolution(https://arxiv.org/abs/2603.28073)
Keywords: super-resolution
Abstract: Reconstructing high-resolution turbulent flow fields from severely under-resolved observations is a fundamental inverse problem in computational fluid dynamics and scientific machine learning. Classical interpolation methods fail to recover missing fine-scale structures, while existing deep learning approaches rely on convolutional architectures that lack the spectral and multiscale inductive biases necessary for physically faithful reconstruction at large upscaling factors. We introduce the Spectrally-Informed Multi-Resolution Neural Operator (SIMR-NO), a hierarchical operator learning framework that factorizes the ill-posed inverse mapping across intermediate spatial resolutions, combines deterministic interpolation priors with spectrally gated Fourier residual corrections at each stage, and incorporates local refinement modules to recover fine-scale spatial features beyond the truncated Fourier basis. The proposed method is evaluated on Kolmogorov-forced two-dimensional turbulence, where $128\times128$ vorticity fields are reconstructed from extremely coarse $8\times8$ observations representing a $16\times$ downsampling factor. Across 201 independent test realizations, SIMR-NO achieves a mean relative $\ell_2$ error of $26.04\%$ with the lowest error variance among all methods, reducing reconstruction error by $31.7\%$ over FNO, $26.0\%$ over EDSR, and $9.3\%$ over LapSRN. Beyond pointwise accuracy, SIMR-NO is the only method that faithfully reproduces the ground-truth energy and enstrophy spectra across the full resolved wavenumber range, demonstrating physically consistent super-resolution of turbulent flow fields.
摘要：从严重解析不足的观测中重建高分辨率湍流流场是计算流体动力学和科学机器学习中的基本逆问题。经典插值方法无法恢复丢失的精细尺度结构，而现有的深度学习方法依赖于卷积架构，而卷积架构缺乏在大尺度放大因子下进行物理忠实重建所需的光谱和多尺度归纳偏差。我们引入了光谱信息多分辨率神经算子（SIMR-NO），这是一种分层算子学习框架，可分解中间空间分辨率的不适定逆映射，将确定性插值先验与每个阶段的光谱门控傅里叶残差校正相结合，并结合局部细化模块来恢复超出截断傅里叶基础的精细尺度空间特征。所提出的方法在柯尔莫哥洛夫强制二维湍流上进行评估，其中 $128\times128$ 涡度场是根据代表 $16\times$ 下采样因子的极其粗糙的 $8\times8$ 观测值重建的。在 201 个独立测试实现中，SIMR-NO 的平均相对 $\ell_2$ 误差为 $26.04\%$，误差方差在所有方法中最低，与 FNO 相比，重建误差减少了 $31.7\%$，比 EDSR 减少了 $26.0\%$，比 LapSRN 减少了 $9.3\%$。除了逐点精度之外，SIMR-NO 是唯一能够在整个解析波数范围内忠实再现地面实况能量和熵谱的方法，从而证明了湍流流场的物理一致的超分辨率。

Title: LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

Authors: Chutian Meng, Fan Ma, Chi Zhang, Jiaxu Miao, Yi Yang, Yueting Zhuang
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2603.28082
Pdf URL: https://arxiv.org/pdf/2603.28082
Copy Paste: [[2603.28082]] LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization(https://arxiv.org/abs/2603.28082)
Keywords: generation
Abstract: Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.
摘要：生成连贯且可交流的视觉序列（例如图像序列和视频）仍然是当前多模态系统的重大挑战。尽管视觉质量和世界知识的整合取得了进步，但现有模型仍然难以保持逻辑流程，常常导致动作脱节、叙述支离破碎和故事情节不清晰。我们将这些问题归因于缺乏对视觉逻辑的关注，视觉逻辑是视觉序列生成的一个关键但尚未充分探索的维度，我们将其定义为随着时间的推移角色、动作和场景之间的感知和因果连贯性。为了弥补这一差距，我们提出了一种逻辑感知的多图像故事可视化框架 LogiStory。该框架是围绕故事可视化中显式建模视觉逻辑的核心创新而构建的。为了实现这个想法，我们设计了一个多智能体系统，该系统可以根据角色、提取因果链并验证故事级别的一致性，将叙事连贯性从图像生成的隐式副产品转变为显式的建模目标。这种设计有效地将结构化故事规划与视觉生成联系起来，提高了故事可视化中的叙事清晰度和视觉质量。此外，为了评估生成能力，我们构建了 LogicTale，这是一个包含丰富注释故事的基准，强调因果推理和视觉逻辑可解释性。我们建立了全面的自动和人工评估协议，旨在测量视觉逻辑和感知质量。实验表明，我们的方法显着改善了生成的视觉故事的叙事逻辑。这项工作为在一般图像序列和视频生成任务中建模和实施视觉逻辑提供了基础步骤。

Title: GEMS: Agent-Native Multimodal Generation with Memory and Skills

Authors: Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28088
Pdf URL: https://arxiv.org/pdf/2603.28088
Copy Paste: [[2603.28088]] GEMS: Agent-Native Multimodal Generation with Memory and Skills(https://arxiv.org/abs/2603.28088)
Keywords: generation, generative
Abstract: Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.
摘要：最近的多模式生成模型在通用生成任务上取得了显着的进步，但仍然在复杂的指令和专门的下游任务方面遇到困难。受到 Claude Code 等先进代理框架成功的启发，我们提出了 \textbf{GEMS} （具有 \textbf{M}emory 和 \textbf{S}kills 的代理原生多模态 \textbf{GE}neration），这是一个超越基础模型在一般任务和下游任务上的固有限制的框架。 GEMS 基于三个核心组件构建。 Agent Loop引入了结构化的多智能体框架，通过闭环优化迭代地提高生成质量。代理内存提供持久的轨迹级内存，分层存储事实状态和压缩的经验摘要，从而实现优化过程的全局视图，同时减少冗余。 Agent Skill 通过按需加载提供了可扩展的特定领域专业知识集合，使系统能够有效处理不同的下游应用程序。在五个主流任务和四个下游任务中，在多个生成后端上进行评估，GEMS 始终实现了显着的性能提升。最值得注意的是，它使轻量级 6B 模型 Z-Image-Turbo 在 GenEval2 上超越了最先进的 Nano Banana 2，证明了代理工具在扩展模型功能超出其原始限制方面的有效性。

Title: Heddle: A Distributed Orchestration System for Agentic RL Rollout

Authors: Zili Zhang, Yinmin Zhong, Chengxu Yang, Chao Jin, Bingyang Wu, Xinming Wei, Yuliang Liu, Xin Jin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28101
Pdf URL: https://arxiv.org/pdf/2603.28101
Copy Paste: [[2603.28101]] Heddle: A Distributed Orchestration System for Agentic RL Rollout(https://arxiv.org/abs/2603.28101)
Keywords: generation
Abstract: Agentic Reinforcement Learning (RL) enables LLMs to solve complex tasks by alternating between a data-collection rollout phase and a policy training phase. During rollout, the agent generates trajectories, i.e., multi-step interactions between LLMs and external tools. Yet, frequent tool calls induce long-tailed trajectory generation that bottlenecks rollouts. This stems from step-centric designs that ignore trajectory context, triggering three system problems for long-tail trajectory generation: queueing delays, interference overhead, and inflated per-token time. We propose Heddle, a trajectory-centric system to optimize the when, where, and how of agentic rollout execution. Heddle integrates three core mechanisms: trajectory-level scheduling using runtime prediction and progressive priority to minimize cumulative queueing; trajectory-aware placement via presorted dynamic programming and opportunistic migration during idle tool call intervals to minimize interference; and trajectory-adaptive resource manager that dynamically tunes model parallelism to accelerate the per-token time of long-tail trajectories while maintaining high throughput for short trajectories. Evaluations across diverse agentic RL workloads demonstrate that Heddle effectively neutralizes the long-tail bottleneck, achieving up to 2.5$\times$ higher end-to-end rollout throughput compared to state-of-the-art baselines.
摘要：代理强化学习 (RL) 使法学硕士能够通过在数据收集部署阶段和策略训练阶段之间交替来解决复杂的任务。在推出期间，代理会生成轨迹，即 LLM 和外部工具之间的多步骤交互。然而，频繁的工具调用会导致长尾轨迹生成，从而成为推出的瓶颈。这源于以步骤为中心的设计，忽略了轨迹上下文，引发了长尾轨迹生成的三个系统问题：排队延迟、干扰开销和每个令牌时间膨胀。我们提出了 Heddle，一个以轨迹为中心的系统，用于优化代理部署执行的时间、地点和方式。 Heddle 集成了三个核心机制：使用运行时预测的轨迹级调度和渐进优先级以最小化累积排队；通过预先排序的动态编程和空闲工具调用间隔期间的机会迁移进行轨迹感知放置，以最大限度地减少干扰；轨迹自适应资源管理器，动态调整模型并行性，以加快长尾轨迹的每个令牌时间，同时保持短轨迹的高吞吐量。对不同代理 RL 工作负载的评估表明，Heddle 有效地消除了长尾瓶颈，与最先进的基准相比，端到端部署吞吐量提高了 2.5 美元\倍$。

Title: ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment

Authors: Tran Duong Minh Dai, Triet Huynh Minh Le, M. Ali Babar, Van-Hau Pham, Phan The Duy
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2603.28128
Pdf URL: https://arxiv.org/pdf/2603.28128
Copy Paste: [[2603.28128]] ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment(https://arxiv.org/abs/2603.28128)
Keywords: generation
Abstract: Although Graph Neural Networks (GNNs) have shown promise for smart contract vulnerability detection, they still face significant limitations. Homogeneous graph models fail to capture the interplay between control flow and data dependencies, while heterogeneous graph approaches often lack deep semantic understanding, leaving them susceptible to adversarial attacks. Moreover, most black-box models fail to provide explainable evidence, hindering trust in professional audits. To address these challenges, we propose ORACAL (Observable RAG-enhanced Analysis with CausAL reasoning), a heterogeneous multimodal graph learning framework that integrates Control Flow Graph (CFG), Data Flow Graph (DFG), and Call Graph (CG). ORACAL selectively enriches critical subgraphs with expert-level security context from Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs), and employs a causal attention mechanism to disentangle true vulnerability indicators from spurious correlations. For transparency, the framework adopts PGExplainer to generate subgraph-level explanations identifying vulnerability triggering paths. Experiments on large-scale datasets demonstrate that ORACAL achieves state-of-the-art performance, outperforming MANDO-HGT, MTVHunter, GNN-SC, and SCVHunter by up to 39.6 percentage points, with a peak Macro F1 of 91.28% on the primary benchmark. ORACAL maintains strong generalization on out-of-distribution datasets with 91.8% on CGT Weakness and 77.1% on DAppScan. In explainability evaluation, PGExplainer achieves 32.51% Mean Intersection over Union (MIoU) against manually annotated vulnerability triggering paths. Under adversarial attacks, ORACAL limits performance degradation to approximately 2.35% F1 decrease with an Attack Success Rate (ASR) of only 3%, surpassing SCVHunter and MANDO-HGT which exhibit ASRs ranging from 10.91% to 18.73%.
摘要：尽管图神经网络（GNN）在智能合约漏洞检测方面展现出了良好的前景，但它们仍然面临着巨大的局限性。同质图模型无法捕获控制流和数据依赖性之间的相互作用，而异构图方法通常缺乏深入的语义理解，使它们容易受到对抗性攻击。此外，大多数黑匣子模型无法提供可解释的证据，阻碍了对专业审计的信任。为了应对这些挑战，我们提出了ORACAL（Observable RAG-enhanced Analysis with CausAL Reasoning），这是一种集成了控制流图（CFG）、数据流图（DFG）和调用图（CG）的异构多模态图学习框架。 ORACAL 使用来自检索增强生成 (RAG) 和大型语言模型 (LLM) 的专家级安全上下文有选择地丰富关键子图，并采用因果注意机制将真实的漏洞指标与虚假相关性分开。为了提高透明度，该框架采用 PGExplainer 生成子图级解释，识别漏洞触发路径。在大规模数据集上的实验表明，ORACAL 实现了最先进的性能，比 MANDO-HGT、MTVHunter、GNN-SC 和 SCVHunter 提高了 39.6 个百分点，在主要基准上的峰值 Macro F1 为 91.28%。 ORACAL 对分布外数据集保持了很强的泛化能力，CGT 弱点为 91.8%，DAppScan 为 77.1%。在可解释性评估中，PGExplainer 针对手动注释的漏洞触发路径实现了 32.51% 的并集平均交集 (MIoU)。在对抗性攻击下，ORACAL 将性能下降限制在约 2.35% F1 下降，攻击成功率 (ASR) 仅 3%，超过了 SCVHunter 和 MANDO-HGT，后者的 ASR 范围为 10.91% 至 18.73%。

Title: ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization

Authors: Bingchen Li, Zhixin Wang, Fan Li, Jiaqi Xu, Jiaming Guo, Renjing Pei, Xin Li, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28162
Pdf URL: https://arxiv.org/pdf/2603.28162
Copy Paste: [[2603.28162]] ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization(https://arxiv.org/abs/2603.28162)
Keywords: restoration, generative
Abstract: Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization. This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.
摘要：老照片保留了宝贵的历史记忆，因此对其进行修复和着色非常理想。虽然现有的修复模型可以解决一些退化问题，例如去噪和划痕去除，但它们通常难以实现准确的着色。这种限制源于旧照片固有的独特退化，例如亮度褪色和色调改变，这与现代照片分布不同，在着色过程中产生了很大的域间隙。在本文中，我们提出了一种基于生成扩散模型 FLUX 的新颖的旧照片着色框架。我们的方法引入了结构-颜色解耦策略，将结构保留与颜色恢复分开，从而能够对老照片进行准确的着色，同时保持结构的一致性。我们通过渐进式直接偏好优化 (Pro-DPO) 策略进一步增强模型，该策略允许模型通过颜色增强中从粗到细的过渡来学习微妙的颜色偏好。此外，我们通过引入视觉语义提示来解决基于文本的提示的局限性，该提示直接从老照片中提取细粒度的语义信息，有助于消除老照片固有的颜色偏差。合成数据集和真实数据集的实验结果表明，我们的方法优于现有最先进的着色方法，包括闭源商业模型，可产生高质量且生动的着色。

Title: Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

Authors: Ane G Domingo-Aldama, Marcos Merino Prado, Alain García Olea, Josu Goikoetxea, Koldo Gojenola, Aitziber Atutxa
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28167
Pdf URL: https://arxiv.org/pdf/2603.28167
Copy Paste: [[2603.28167]] Automating Early Disease Prediction Via Structured and Unstructured Clinical Data(https://arxiv.org/abs/2603.28167)
Keywords: generation
Abstract: This study presents a fully automated methodology for early prediction studies in clinical settings, leveraging information extracted from unstructured discharge reports. The proposed pipeline uses discharge reports to support the three main steps of early prediction: cohort selection, dataset generation, and outcome labeling. By processing discharge reports with natural language processing techniques, we can efficiently identify relevant patient cohorts, enrich structured datasets with additional clinical variables, and generate high-quality labels without manual intervention. This approach addresses the frequent issue of missing or incomplete data in codified electronic health records (EHR), capturing clinically relevant information that is often underrepresented. We evaluate the methodology in the context of predicting atrial fibrillation (AF) progression, showing that predictive models trained on datasets enriched with discharge report information achieve higher accuracy and correlation with true outcomes compared to models trained solely on structured EHR data, while also surpassing traditional clinical scores. These results demonstrate that automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.
摘要：这项研究利用从非结构化出院报告中提取的信息，为临床环境中的早期预测研究提供了一种完全自动化的方法。拟议的管道使用出院报告来支持早期预测的三个主要步骤：队列选择、数据集生成和结果标记。通过使用自然语言处理技术处理出院报告，我们可以有效地识别相关患者群体，利用额外的临床变量丰富结构化数据集，并在无需人工干预的情况下生成高质量的标签。这种方法解决了电子健康记录 (EHR) 中经常出现的数据缺失或不完整的问题，捕获了通常代表性不足的临床相关信息。我们在预测心房颤动 (AF) 进展的背景下评估了该方法，结果表明，与仅根据结构化 EHR 数据训练的模型相比，在富含出院报告信息的数据集上训练的预测模型具有更高的准确性和与真实结果的相关性，同时也超越了传统的临床评分。这些结果表明，自动集成非结构化临床文本可以简化早期预测研究，提高数据质量，并增强临床决策预测模型的可靠性。

Title: ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining

Authors: Yucheng Huang, Luping Ji, Xiangwei Jiang, Wen Li, Mao Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28178
Pdf URL: https://arxiv.org/pdf/2603.28178
Copy Paste: [[2603.28178]] ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining(https://arxiv.org/abs/2603.28178)
Keywords: generation
Abstract: 3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter's predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.
摘要：3D 场景图 (3DSG) 生成在空间理解和语义可供性感知中发挥着关键作用。然而，其普遍性往往受到数据稀缺的限制。当前的解决方案主要侧重于跨模式辅助表示学习和以对象为中心的生成预训练。前者严重依赖谓词注释，而后者的谓词学习可能由于强大的对象先验而被绕过。因此，他们通常无法为 3DSG 微调提供无标签且强大的自监督代理任务。为了弥补这一差距，我们提出了 3DSG 预训练框架的拓扑布局学习 (ToLL)。具体来说，我们设计了一种锚条件拓扑几何推理，使用 GNN 通过稀疏锚的空间先验来恢复零中心子图的全局布局。这个过程受到谓词特征的严格调节，从而强制谓词关系学习。此外，我们构建了结构多视图增强以避免语义损坏，并通过自蒸馏增强表示。对 3DSSG 数据集的广泛实验表明，我们的 ToLL 可以提高表示质量，优于最先进的基线。

Title: MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations

Authors: Xianyong Xu, Yuanjun Zuo, Zhihong Huang, Yihan Qin, Haoxian Xu, Leilei Du, Haotian Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28253
Pdf URL: https://arxiv.org/pdf/2603.28253
Copy Paste: [[2603.28253]] MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations(https://arxiv.org/abs/2603.28253)
Keywords: generation
Abstract: Time series forecasting is vital across many domains, yet existing models struggle with fixed-length inputs and inadequate multi-scale modeling. We propose MR-CDM, a framework combining hierarchical multi-resolution trend decomposition, an adaptive embedding mechanism for variable-length inputs, and a multi-scale conditional diffusion process. Evaluations on four real-world datasets demonstrate that MR-CDM significantly outperforms state-of-the-art baselines (e.g., CSDI, Informer), reducing MAE and RMSE by approximately 6-10 to a certain degree.
摘要：时间序列预测在许多领域都至关重要，但现有模型难以应对固定长度输入和不充分的多尺度建模。我们提出了 MR-CDM，一个结合了分层多分辨率趋势分解、可变长度输入的自适应嵌入机制和多尺度条件扩散过程的框架。对四个真实世界数据集的评估表明，MR-CDM 显着优于最先进的基线（例如 CSDI、Informer），在一定程度上将 MAE 和 RMSE 降低了大约 6-10。

Title: Integrating Multimodal Large Language Model Knowledge into Amodal Completion

Authors: Heecheol Yun, Eunho Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28333
Pdf URL: https://arxiv.org/pdf/2603.28333
Copy Paste: [[2603.28333]] Integrating Multimodal Large Language Model Knowledge into Amodal Completion(https://arxiv.org/abs/2603.28333)
Keywords: generation, generative
Abstract: With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.
摘要：随着自动驾驶汽车和机器人技术的广泛采用，重建图像中人和物体被遮挡部分的模态补全变得越来越重要。正如人类根据先前的经验和常识推断隐藏区域一样，这项任务本质上需要有关现实世界实体的物理知识。然而，现有的方法要么仅仅依赖于缺乏此类知识的视觉生成模型的图像生成能力，要么仅在分割阶段利用它，从而阻止其明确指导完成过程。为了解决这个问题，我们提出了 AmodalCG，这是一种新颖的框架，它利用多模态大型语言模型 (MLLM) 的现实知识来指导非模态完成。我们的框架首先评估遮挡程度，仅当目标对象被严重遮挡时才选择性地调用 MLLM 指导。如果需要指导，该框架进一步结合 MLLM 来推理缺失区域的 (1) 范围和 (2) 内容。最后，视觉生成模型集成了这些指导，并迭代地完善了可能因不准确的 MLLM 指导而产生的不完美完成结果。与所有现有工作相比，各种现实世界图像的实验结果显示出令人印象深刻的改进，这表明 MLLM 是解决具有挑战性的非模态完成问题的一个有前途的方向。

Title: VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

Authors: Li-Heng Chen, Ke Cheng, Yahui Liu, Lei Shi, Shi-Sheng Huang, Hongbo Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28353
Pdf URL: https://arxiv.org/pdf/2603.28353
Copy Paste: [[2603.28353]] VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning(https://arxiv.org/abs/2603.28353)
Keywords: generation
Abstract: Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.
摘要：驾驶视频生成在可控性、视频分辨率和长度方面取得了很大进步，但无法支持各种驾驶视频的细粒度对象级可控性，同时保持时空一致性，特别是在长视频生成中。在本文中，我们提出了一种名为 VistaGEN 的新型驾驶视频生成技术，该技术能够对特定实体（包括 3D 对象、图像和文本描述）进行细粒度控制，同时保持长视频序列的时空一致性。我们的关键创新是将多视图视觉语言推理纳入长驱动视频生成中。为此，我们将视觉语言功能注入多视图视频生成器中，以实现细粒度的可控性。更重要的是，我们提出了一种多视图视觉语言评估器（MV-VLM）来智能地自动评估生成内容的时空一致性，从而制定一种新颖的生成-评估-再生闭环生成机制。该机制可确保高质量、一致的输出，有助于创建复杂且可靠的驾驶场景。此外，在闭环生成中，我们引入了对象级细化模块来细化从MV-VLM评估的不满意结果，然后将它们反馈到视频生成器进行再生。广泛的评估表明，我们的 VistaGEN 实现了多样化的驾驶视频生成结果，具有细粒度的可控性，特别是对于长尾物体，并且比以前的方法更好的时空一致性。

Title: AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

Authors: Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28366
Pdf URL: https://arxiv.org/pdf/2603.28366
Copy Paste: [[2603.28366]] AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation(https://arxiv.org/abs/2603.28366)
Keywords: generation
Abstract: Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.
摘要：短视频已成为数字广告的主要媒体，需要可扩展且高效的内容创建。然而，当前的工作流程和人工智能工具仍然脱节且特定于模式，导致生产成本高、整体效率低。为了解决这个问题，我们提出了AutoCut，一种基于多模态离散化和可控编辑的端到端广告视频编辑框架。 AutoCut采用专用编码器提取视频和音频特征，然后应用残差矢量量化将它们离散化为与文本表示对齐的统一标记，构建共享的视频-音频-文本标记空间。在基础模型的基础上，我们通过组合多模态对齐和监督微调，进一步开发了用于视频编辑的多模态大语言模型，支持统一编辑框架内的视频选择和排序、脚本生成和背景音乐选择等任务。最后，完整的生产管道将预测的令牌序列转换为可部署的长视频输出。对现实世界广告数据集的实验表明，AutoCut 降低了制作成本和迭代时间，同时大幅提高了一致性和可控性，为可扩展的视频创建铺平了道路。

Title: Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Authors: Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28367
Pdf URL: https://arxiv.org/pdf/2603.28367
Copy Paste: [[2603.28367]] Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models(https://arxiv.org/abs/2603.28367)
Keywords: generative
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.
摘要：视觉自回归（VAR）模型最近作为一个有前途的生成模型系列出现，支持广泛的下游视觉任务，例如文本引导图像编辑。通过将编辑范式从基于扩散的方法中的噪声操纵转变为标记级操作，基于 VAR 的方法实现了更好的背景保留和显着更快的推理。然而，现有的基于 VAR 的编辑方法仍然面临两个关键挑战：准确定位可编辑标记和保持编辑结果的结构一致性。在这项工作中，我们提出了一种新颖的文本引导图像编辑框架，该框架植根于对 VAR 模型内中间特征分布的分析。首先，我们引入了一种从粗到细的标记定位策略，可以细化可编辑区域，平衡编辑保真度和背景保留。其次，我们分析了 VAR 模型的中间表示并识别与结构相关的特征，从而设计了一种简单而有效的特征注入机制，以增强编辑图像和源图像之间的结构一致性。第三，我们开发了一种基于强化学习的自适应特征注入方案，该方案自动学习特定于尺度和特定层的注入比率，以共同优化编辑保真度和结构保留。大量的实验表明，与最先进的方法相比，我们的方法在本地和全局编辑场景中都实现了卓越的结构一致性和编辑质量。

Title: EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation

Authors: Sravanth Kodavanti, Manjunath Arveti, Sowmya Vajrala, Srinivas Miriyala, Vikram N R
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28405
Pdf URL: https://arxiv.org/pdf/2603.28405
Copy Paste: [[2603.28405]] EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation(https://arxiv.org/abs/2603.28405)
Keywords: generation, generative
Abstract: Diffusion Transformers (DiT) have established a new state-of-the-art in high-fidelity image synthesis; however, their massive computational complexity and memory requirements hinder local deployment on resource-constrained edge devices. In this paper, we introduce EdgeDiT, a family of hardware-efficient generative transformers specifically engineered for mobile Neural Processing Units (NPUs), such as the Qualcomm Hexagon and Apple Neural Engine (ANE). By leveraging a hardware-aware optimization framework, we systematically identify and prune structural redundancies within the DiT backbone that are particularly taxing for mobile data-flows. Our approach yields a series of lightweight models that achieve a 20-30% reduction in parameters, a 36-46% decrease in FLOPs, and a 1.65-fold reduction in on-device latency without sacrificing the scaling advantages or the expressive capacity of the original transformer architecture. Extensive benchmarking demonstrates that EdgeDiT offers a superior Pareto-optimal trade-off between Frechet Inception Distance (FID) and inference latency compared to both optimized mobile U-Nets and vanilla DiT variants. By enabling responsive, private, and offline generative AI directly on-device, EdgeDiT provides a scalable blueprint for transitioning large-scale foundation models from high-end GPUs to the palm of the user.
摘要：扩散变压器（DiT）在高保真图像合成方面建立了新的最先进技术；然而，它们巨大的计算复杂性和内存需求阻碍了资源受限的边缘设备上的本地部署。在本文中，我们介绍了 EdgeDiT，这是一系列专为移动神经处理单元 (NPU)（例如 Qualcomm Hexagon 和 Apple 神经引擎 (ANE)）而设计的硬件高效生成变压器。通过利用硬件感知优化框架，我们系统地识别和修剪 DiT 主干内的结构冗余，这些冗余对于移动数据流来说尤其繁重。我们的方法产生了一系列轻量级模型，在不牺牲原始 Transformer 架构的扩展优势或表达能力的情况下，参数减少了 20-30%，FLOP 减少了 36-46%，设备上延迟减少了 1.65 倍。广泛的基准测试表明，与优化的移动 U-Net 和普通 DiT 变体相比，EdgeDiT 在 Frechet 起始距离 (FID) 和推理延迟之间提供了卓越的帕累托最优权衡。通过直接在设备上启用响应式、私有和离线生成式 AI，EdgeDiT 提供了一个可扩展的蓝图，用于将大规模基础模型从高端 GPU 转移到用户的手掌上。

Title: Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation

Authors: Weichao Cai, Weiliang Huang, Biao Xue, Chao Huang, Fei Yuan, Bob Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28414
Pdf URL: https://arxiv.org/pdf/2603.28414
Copy Paste: [[2603.28414]] Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation(https://arxiv.org/abs/2603.28414)
Keywords: restoration
Abstract: Marine scene understanding and segmentation plays a vital role in maritime monitoring and navigation safety. However, prevalent factors like fog and strong reflections in maritime environments cause severe image degradation, significantly compromising the stability of semantic perception. Existing restoration and enhancement methods typically target specific degradations or focus solely on visual quality, lacking end-to-end collaborative mechanisms that simultaneously improve structural recovery and semantic effectiveness. Moreover, publicly available infrared-visible datasets are predominantly collected from urban scenes, failing to capture the authentic characteristics of coupled degradations in marine environments. To address these challenges, the Infrared-Visible Maritime Ship Dataset (IVMSD) is proposed to cover various maritime scenarios under diverse weather and illumination conditions. Building upon this dataset, a Multi-task Complementary Learning Framework (MCLF) is proposed to collaboratively perform image restoration, multimodal fusion, and semantic segmentation within a unified architecture. The framework includes a Frequency-Spatial Enhancement Complementary (FSEC) module for degradation suppression and structural enhancement, a Semantic-Visual Consistency Attention (SVCA) module for semantic-consistent guidance, and a cross-modality guided attention mechanism for selective fusion. Experimental results on IVMSD demonstrate that the proposed method achieves state-of-the-art segmentation performance, significantly enhancing robustness and perceptual quality under complex maritime conditions.
摘要：海洋场景理解和分割在海上监控和航行安全中发挥着至关重要的作用。然而，海洋环境中的雾和强反射等普遍因素会导致图像严重退化，严重损害语义感知的稳定性。现有的恢复和增强方法通常针对特定的退化或仅关注视觉质量，缺乏同时提高结构恢复和语义有效性的端到端协作机制。此外，公开的红外-可见光数据集主要是从城市场景收集的，无法捕捉海洋环境耦合退化的真实特征。为了应对这些挑战，提出了红外-可见光海事船舶数据集（IVMSD），以覆盖不同天气和光照条件下的各种海事场景。在此数据集的基础上，提出了多任务互补学习框架（MCLF），以在统一架构内协作执行图像恢复、多模态融合和语义分割。该框架包括用于退化抑制和结构增强的频率空间增强互补（FSEC）模块、用于语义一致指导的语义视觉一致性注意（SVCA）模块以及用于选择性融合的跨模态引导注意机制。 IVMSD 的实验结果表明，所提出的方法实现了最先进的分割性能，显着增强了复杂海洋条件下的鲁棒性和感知质量。

Title: Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models

Authors: Alkis Sygkounas, Amy Loutfi, Andreas Persson
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28416
Pdf URL: https://arxiv.org/pdf/2603.28416
Copy Paste: [[2603.28416]] Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models(https://arxiv.org/abs/2603.28416)
Keywords: generative
Abstract: Reinforcement learning algorithms are defined by their learning update rules, which are typically hand-designed and fixed. We present an evolutionary framework for discovering reinforcement learning algorithms by searching directly over executable update rules that implement complete training procedures. The approach builds on REvolve, an evolutionary system that uses large language models as generative variation operators, and extends it from reward-function discovery to algorithm discovery. To promote the emergence of nonstandard learning rules, the search excludes canonical mechanisms such as actor--critic structures, temporal-difference losses, and value bootstrapping. Because reinforcement learning algorithms are highly sensitive to internal scalar parameters, we introduce a post-evolution refinement stage in which a large language model proposes feasible hyperparameter ranges for each evolved update rule. Evaluated end-to-end by full training runs on multiple Gymnasium benchmarks, the discovered algorithms achieve competitive performance relative to established baselines, including SAC, PPO, DQN, and A2C.
摘要：强化学习算法由其学习更新规则定义，这些规则通常是手工设计和固定的。我们提出了一个进化框架，用于通过直接搜索实现完整训练过程的可执行更新规则来发现强化学习算法。该方法建立在 REvolve 的基础上，这是一个使用大型语言模型作为生成变异算子的进化系统，并将其从奖励函数发现扩展到算法发现。为了促进非标准学习规则的出现，搜索排除了规范机制，例如演员批评家结构、时间差异损失和价值引导。由于强化学习算法对内部标量参数高度敏感，因此我们引入了进化后细化阶段，其中大型语言模型为每个进化的更新规则提出可行的超参数范围。通过在多个 Gymnasium 基准上运行完整训练进行端到端评估，发现的算法相对于既定基准（包括 SAC、PPO、DQN 和 A2C）实现了具有竞争力的性能。

Title: $R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation

Authors: Linqian Fan, Peiqin Sun, Tiancheng Wen, Shun Lu, Chengru Song
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.28460
Pdf URL: https://arxiv.org/pdf/2603.28460
Copy Paste: [[2603.28460]] $R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation(https://arxiv.org/abs/2603.28460)
Keywords: generation, generative
Abstract: Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student's performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as $R_{dm}$. This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize $R_{dm}$ estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, $R_{dm}$ provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.
摘要：扩散模型实现了最先进的生成性能，但从根本上受到其缓慢的迭代采样过程的瓶颈。虽然扩散蒸馏技术可以实现高保真度的几步生成，但传统目标通常将学生的表现完全依赖于教师，从而限制了学生的表现。最近的方法试图通过整合强化学习（RL）来打破这个上限，通常是通过蒸馏和强化学习目标的简单求和。在这项工作中，我们通过将分布匹配重新概念化为奖励，提出了一种新颖的范例，表示为 $R_{dm}$。这种统一的视角弥合了扩散匹配蒸馏 (DMD) 和 RL 之间的算法差距，提供了几个关键优势。 (1) 增强优化稳定性：我们引入了组归一化分布匹配（GNDM），它采用标准 RL 组归一化来稳定 $R_{dm}$ 估计。通过利用组均值统计，GNDM 建立了更稳健、更有效的优化方向。 (2) 无缝奖励集成：我们以奖励为中心的公式本质上支持自适应权重机制，允许 DMD 与外部奖励模型的灵活组合。 (3) 提高采样效率：通过与强化学习原则保持一致，该框架很容易融入重要性采样（IS），从而显着提高采样效率。大量实验表明 GNDM 优于普通 DMD，将 FID 降低了 1.87。此外，我们的多奖励变体 GNDMR 通过在美学质量和保真度之间实现强有力的平衡，超越了现有的基线，达到了 30.37 的峰值 HPS 和 12.21 的低 FID-SD。总体而言，$R_{dm}$为实时高保真合成提供了灵活、稳定、高效的框架。代码将在发布后发布。

Title: CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

Authors: Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin, Hao Sheng, Pengfei Liu, Honglin Ma, Wenqi Shao, Qiaosheng Zhang, Yu Qiao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28474
Pdf URL: https://arxiv.org/pdf/2603.28474
Copy Paste: [[2603.28474]] CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains(https://arxiv.org/abs/2603.28474)
Keywords: generation
Abstract: The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at this https URL.
摘要：中国古董瓷器的鉴赏需要丰富的历史知识、材质理解和审美敏感性，非专业人士很难参与。为了使文化遗产理解民主化并协助专家鉴赏，我们推出了 CiQi-Agent——一款针对特定领域的瓷器鉴赏代理，用于对中国古董瓷器进行智能分析。 CiQi-Agent支持多图像瓷器输入，并支持视觉工具调用和多模态检索增强生成，对朝代、统治时期、窑址、釉色、纹饰和器形六个属性进行细粒度的鉴赏分析。除了属性分类之外，它还捕获微妙的视觉细节，检索相关领域知识，并整合视觉和文本证据以生成连贯的、可解释的鉴赏描述。为了实现这一能力，我们构建了一个大规模的、专家注释的数据集CiQi-VQA，包括29,596个瓷器标本、51,553张图像和557,940个视觉问答对，并进一步建立了与前面提到的六个属性相一致的综合基准CiQi-Bench。 CiQi-Agent 通过监督微调、强化学习和工具增强推理框架进行训练，该框架集成了两类工具：视觉工具和多模态检索工具。实验结果表明，CiQi-Agent (7B) 在 CiQi-Bench 上的所有六个属性上均优于所有竞争性开源和闭源模型，比 GPT-5 平均准确率高 12.2%。该模型和数据集已发布，可通过此 https URL 公开获取。

Title: ConceptWeaver: Weaving Disentangled Concepts with Flow

Authors: Jintao Chen, Aiming Hao, Xiaoqing Chen, Chengyu Bai, Chubin Chen, Yanxun Li, Jiahong Wu, Xiangxiang Chu, Shanghang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28493
Pdf URL: https://arxiv.org/pdf/2603.28493
Copy Paste: [[2603.28493]] ConceptWeaver: Weaving Disentangled Concepts with Flow(https://arxiv.org/abs/2603.28493)
Keywords: generative
Abstract: Pre-trained flow-based models excel at synthesizing complex scenes yet lack a direct mechanism for disentangling and customizing their underlying concepts from one-shot real-world sources. To demystify this process, we first introduce a novel differential probing technique to isolate and analyze the influence of individual concept tokens on the velocity field over time. This investigation yields a critical insight: the generative process is not monolithic but unfolds in three distinct stages. An initial \textbf{Blueprint Stage} establishes low-frequency structure, followed by a pivotal \textbf{Instantiation Stage} where content concepts emerge with peak intensity and become naturally disentangled, creating an optimal window for manipulation. A final concept-insensitive refinement stage then synthesizes fine-grained details. Guided by this discovery, we propose \textbf{ConceptWeaver}, a framework for one-shot concept disentanglement. ConceptWeaver learns concept-specific semantic offsets from a single reference image using a stage-aware optimization strategy that aligns with the three-stage framework. These learned offsets are then deployed during inference via our novel ConceptWeaver Guidance (CWG) mechanism, which strategically injects them at the appropriate generative stage. Extensive experiments validate that ConceptWeaver enables high-fidelity, compositional synthesis and editing, demonstrating that understanding and leveraging the intrinsic, staged nature of flow models is key to unlocking precise, multi-granularity content manipulation.
摘要：预先训练的基于流的模型擅长合成复杂的场景，但缺乏一种直接的机制来从一次性的现实世界来源中解开和定制其底层概念。为了揭开这个过程的神秘面纱，我们首先引入一种新颖的差分探测技术来隔离和分析单个概念标记随时间对速度场的影响。这项研究得出了一个重要的见解：生成过程不是单一的，而是分三个不同的阶段展开。初始的\textbf{蓝图阶段}建立低频结构，然后是关键的\textbf{实例化阶段}，其中内容概念以峰值强度出现并自然解开，从而创建最佳的操作窗口。最后的概念不敏感细化阶段然后综合细粒度的细节。在这一发现的指导下，我们提出了 \textbf{ConceptWeaver}，一种一次性概念解缠的框架。 ConceptWeaver 使用与三阶段框架相一致的阶段感知优化策略，从单个参考图像中学习概念特定的语义偏移。然后，通过我们新颖的 ConceptWeaver Guidance (CWG) 机制在推理过程中部署这些学习到的偏移量，该机制在适当的生成阶段战略性地注入它们。大量实验验证了 ConceptWeaver 能够实现高保真、组合合成和编辑，证明理解和利用流模型的内在、分阶段性质是解锁精确、多粒度内容操作的关键。

Title: Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree

Authors: Fei Wu, Guanghao Ding, Zijian Niu, Zhenrui Wang, Lei Yang, Zhuosheng Zhang, Shilin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28508
Pdf URL: https://arxiv.org/pdf/2603.28508
Copy Paste: [[2603.28508]] Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree(https://arxiv.org/abs/2603.28508)
Keywords: generation, generative
Abstract: The malicious use and widespread dissemination of AI-generated images pose a serious threat to the authenticity of digital content. Existing detection methods exploit low-level artifacts left by common manipulation steps within the generation pipeline, but they often lack generalization due to model-specific overfitting. Recently, researchers have resorted to Multimodal Large Language Models (MLLMs) for AIGC detection, leveraging their high-level semantic reasoning and broad generalization capabilities. While promising, MLLMs lack the fine-grained perceptual sensitivity to subtle generation artifacts, making them inadequate as standalone detectors. To address this issue, we propose a novel AI-generated image detection framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats the outputs of basic detectors as fuzzy membership values, enabling adaptive fusion of complementary cues from semantic and perceptual perspectives. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy and strong generalization across diverse generative models.
摘要：人工智能生成图像的恶意使用和广泛传播对数字内容的真实性构成严重威胁。现有的检测方法利用生成管道中常见操作步骤留下的低级伪影，但由于特定于模型的过度拟合，它们通常缺乏泛化性。最近，研究人员利用多模态大型语言模型 (MLLM) 的高级语义推理和广泛的泛化能力来进行 AIGC 检测。虽然 MLLM 很有前景，但它缺乏对细微生成伪影的细粒度感知敏感性，这使得它们不足以作为独立的检测器。为了解决这个问题，我们提出了一种新颖的人工智能生成图像检测框架，该框架通过模糊决策树将轻量级伪影感知检测器与 MLLM 协同集成。决策树将基本检测器的输出视为模糊隶属度值，从而能够从语义和感知角度自适应融合互补线索。大量的实验表明，所提出的方法在不同的生成模型中实现了最先进的准确性和强大的泛化性。

Title: Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow

Authors: Quan Meng, Yujin Chen, Lei Li, Matthias Nießner, Angela Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28548
Pdf URL: https://arxiv.org/pdf/2603.28548
Copy Paste: [[2603.28548]] Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow(https://arxiv.org/abs/2603.28548)
Keywords: generation
Abstract: We present Seen2Scene, the first flow matching-based approach that trains directly on incomplete, real-world 3D scans for scene completion and generation. Unlike prior methods that rely on complete and hence synthetic 3D data, our approach introduces visibility-guided flow matching, which explicitly masks out unknown regions in real scans, enabling effective learning from real-world, partial observations. We represent 3D scenes using truncated signed distance field (TSDF) volumes encoded in sparse grids and employ a sparse transformer to efficiently model complex scene structures while masking unknown regions. We employ 3D layout boxes as an input conditioning signal, and our approach is flexibly adapted to various other inputs such as text or partial scans. By learning directly from real-world, incomplete 3D scans, Seen2Scene enables realistic 3D scene completion for complex, cluttered real environments. Experiments demonstrate that our model produces coherent, complete, and realistic 3D scenes, outperforming baselines in completion accuracy and generation quality.
摘要：我们推出了 Seen2Scene，这是第一个基于流匹配的方法，可以直接在不完整的真实世界 3D 扫描上进行训练，以完成场景并生成场景。与依赖完整的合成 3D 数据的先前方法不同，我们的方法引入了可见性引导的流匹配，它明确地掩盖了真实扫描中的未知区域，从而能够从现实世界的部分观察中进行有效学习。我们使用稀疏网格中编码的截断符号距离场 (TSDF) 体积来表示 3D 场景，并采用稀疏变换器来有效地建模复杂的场景结构，同时掩盖未知区域。我们采用 3D 布局框作为输入调节信号，并且我们的方法可以灵活地适应各种其他输入，例如文本或部分扫描。通过直接从现实世界的不完整 3D 扫描中学习，Seen2Scene 可以为复杂、杂乱的真实环境实现逼真的 3D 场景完成。实验表明，我们的模型可以生成连贯、完整且逼真的 3D 场景，在完成精度和生成质量方面优于基线。

Title: Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Authors: Athos Georgiou
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.28554
Pdf URL: https://arxiv.org/pdf/2603.28554
Copy Paste: [[2603.28554]] Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model(https://arxiv.org/abs/2603.28554)
Keywords: restoration, generation
Abstract: Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.
摘要：视觉文档理解通常需要单独的检索和生成模型，从而使内存和系统复杂性加倍。我们提出了 Hydra，这是一种双头方法，它提供了 ColBERT 风格的后期交互检索和来自单一视觉语言模型 (VLM) 的自回归生成。仅针对检索进行训练的单个 LoRA 适配器在推理时进行切换：使其能够产生多向量嵌入；禁用它可以恢复基本模型的生成质量——与独立的基本模型管道相比，10,500 个贪婪和随机样本中 100% 的字节相同输出，在四个 VQA 基准（三个信息；贪婪解码下的两个模型的 ChartQA 都接近于零）上的 15,301 个样本中，最大 delta-ANLS = 0.0044。我们确定了三个工程要求（注意力模式恢复、lm_head 保存、KV 缓存感知解码），尽管正确的权重恢复，但它们的遗漏却默默地破坏了生成。在 ViDoRe V1 上，Hydra (4B) 在单次训练运行中与受控单头基线相差不到 1 个百分点，而集中在任务子集的 V2 和 V3 上的总得分更高；需要进行多种子实验来证实这些趋势。尽管适配器切换会在并发服务负载下引入吞吐量开销，但单模型设计将峰值 GPU 内存减少了 41%。消融表明，GritLM 式联合训练在基于 LoRA (r=16) 的训练体系中没有提供任何好处。 Qwen2.5-Omni-3B 的概念验证扩展表明，该机制可推广到音频检索和视频嵌入以及语音生成。

Title: Unrestrained Simplex Denoising for Discrete Data. A Non-Markovian Approach Applied to Graph Generation

Authors: Yoann Boget, Alexandros Kalousis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28572
Pdf URL: https://arxiv.org/pdf/2603.28572
Copy Paste: [[2603.28572]] Unrestrained Simplex Denoising for Discrete Data. A Non-Markovian Approach Applied to Graph Generation(https://arxiv.org/abs/2603.28572)
Keywords: generation, generative
Abstract: Denoising models such as Diffusion or Flow Matching have recently advanced generative modeling for discrete structures, yet most approaches either operate directly in the discrete state space, causing abrupt state changes. We introduce simplex denoising, a simple yet effective generative framework that operates on the probability simplex. The key idea is a non-Markovian noising scheme in which, for a given clean data point, noisy representations at different times are conditionally independent. While preserving the theoretical guarantees of denoising-based generative models, our method removes unnecessary constraints, thereby improving performance and simplifying the formulation. Empirically, \emph{unrestrained simplex denoising} surpasses strong discrete diffusion and flow-matching baselines across synthetic and real-world graph benchmarks. These results highlight the probability simplex as an effective framework for discrete generative modeling.
摘要：扩散或流匹配等去噪模型最近已经推进了离散结构的生成建模，但大多数方法要么直接在离散状态空间中运行，导致状态突然变化。我们引入单纯形去噪，这是一种简单而有效的生成框架，可在概率单纯形上运行。关键思想是非马尔可夫噪声方案，其中对于给定的干净数据点，不同时间的噪声表示是条件独立的。在保留基于去噪的生成模型的理论保证的同时，我们的方法消除了不必要的约束，从而提高了性能并简化了公式。根据经验，\emph{无限制单纯形去噪}在合成和现实世界的图形基准中超越了强大的离散扩散和流匹配基线。这些结果凸显了概率单纯形作为离散生成建模的有效框架。

Title: ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection

Authors: Haojing Chen, Yutong Li, Zhihang Liu, Tao Tan, Haoyu Bian, Qiuju Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28584
Pdf URL: https://arxiv.org/pdf/2603.28584
Copy Paste: [[2603.28584]] ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection(https://arxiv.org/abs/2603.28584)
Keywords: generation, generative
Abstract: Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) remains challenging due to complex backgrounds, low contrast, irregular object shapes, and large variations in object scale. Existing discriminative methods directly regress saliency maps, while recent diffusion-based generative approaches suffer from stochastic sampling and high computational cost. In this paper, we propose ORSIFlow, a saliency-guided rectified flow framework that reformulates ORSI-SOD as a deterministic latent flow generation problem. ORSIFlow performs saliency mask generation in a compact latent space constructed by a frozen variational autoencoder, enabling efficient inference with only a few steps. To enhance saliency awareness, we design a Salient Feature Discriminator for global semantic discrimination and a Salient Feature Calibrator for precise boundary refinement. Extensive experiments on multiple public benchmarks show that ORSIFlow achieves state-of-the-art performance with significantly improved efficiency. Codes are available at: this https URL.
摘要：由于背景复杂、对比度低、物体形状不规则以及物体尺度变化大，光学遥感图像显着物体检测（ORSI-SOD）仍然具有挑战性。现有的判别方法直接回归显着性图，而最近的基于扩散的生成方法则受到随机采样和高计算成本的困扰。在本文中，我们提出了 ORSIFlow，一种显着性引导的修正流框架，它将 ORSI-SOD 重新表述为确定性潜在流生成问题。 ORSIFlow 在由冻结变分自动编码器构造的紧凑潜在空间中执行显着性掩模生成，只需几个步骤即可实现高效推理。为了增强显着性意识，我们设计了一个用于全局语义区分的显着特征鉴别器和一个用于精确边界细化的显着特征校准器。对多个公共基准的大量实验表明，ORSIFlow 实现了最先进的性能，并且显着提高了效率。代码可在以下位置获得：此 https URL。

Title: TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

Authors: Hannes Mareen, Dimitrios Karageorgiou, Paschalis Giakoumoglou, Peter Lambert, Symeon Papadopoulos, Glenn Van Wallendael
Subjects: cs.CV, cs.AI, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2603.28613
Pdf URL: https://arxiv.org/pdf/2603.28613
Copy Paste: [[2603.28613]] TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark(https://arxiv.org/abs/2603.28613)
Keywords: super-resolution, generative
Abstract: Generative AI has made text-guided inpainting a powerful image editing tool, but at the same time a growing challenge for media forensics. Existing benchmarks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery localization (IFL) methods can localize manipulations in spliced images but struggle not in fully regenerated (FR) images, while synthetic image detection (SID) methods can detect fully regenerated images but cannot perform localization. With new generative inpainting models emerging and the open problem of localization in FR images remaining, updated datasets and benchmarks are needed. We introduce TGIF2, an extended version of TGIF, that captures recent advances in text-guided inpainting and enables a deeper analysis of forensic robustness. TGIF2 augments the original dataset with edits generated by FLUX.1 models, as well as with random non-semantic masks. Using the TGIF2 dataset, we conduct a forensic evaluation spanning IFL and SID, including fine-tuning IFL methods on FR images and generative super-resolution attacks. Our experiments show that both IFL and SID methods degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally, while fine-tuning improves localization on FR images, evaluation with random non-semantic masks reveals object bias. Furthermore, generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines. In summary, TGIF2 provides an updated dataset and benchmark, which enables new insights into the challenges posed by modern inpainting and AI-based image enhancements. TGIF2 is available at this https URL.
摘要：生成式人工智能使文本引导修复成为一种强大的图像编辑工具，但同时也给媒体取证带来了越来越大的挑战。现有的基准，包括我们的文本引导修复伪造（TGIF）数据集，表明图像伪造定位（IFL）方法可以定位拼接图像中的操作，但在完全重新生成（FR）图像中却无法定位，而合成图像检测（SID）方法可以检测完全重新生成的图像，但无法执行定位。随着新的生成修复模型的出现以及 FR 图像定位的开放问题仍然存在，需要更新的数据集和基准。我们推出了 TGIF2，它是 TGIF 的扩展版本，它捕捉了文本引导修复的最新进展，并能够对取证稳健性进行更深入的分析。 TGIF2 通过 FLUX.1 模型生成的编辑以及随机非语义掩码增强了原始数据集。使用 TGIF2 数据集，我们进行了涵盖 IFL 和 SID 的取证评估，包括在 FR 图像上微调 IFL 方法和生成超分辨率攻击。我们的实验表明，IFL 和 SID 方法在 FLUX.1 操作上都会退化，突出表明泛化能力有限。此外，虽然微调改进了 FR 图像的定位，但使用随机非语义掩模进行评估会揭示对象偏差。此外，生成超分辨率显着削弱了取证痕迹，表明常见的图像增强操作可能会破坏当前的取证流程。总之，TGIF2 提供了更新的数据集和基准，使人们能够对现代修复和基于人工智能的图像增强所带来的挑战有新的见解。 TGIF2 可通过此 https URL 获取。

Title: Divide and Restore: A Modular Task-Decoupled Framework for Universal Image Restoration

Authors: Joanna Wiekiera, Martyna Zur
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28658
Pdf URL: https://arxiv.org/pdf/2603.28658
Copy Paste: [[2603.28658]] Divide and Restore: A Modular Task-Decoupled Framework for Universal Image Restoration(https://arxiv.org/abs/2603.28658)
Keywords: restoration
Abstract: Restoring images affected by various types of degradation, such as noise, blur, or improper exposure, remains a significant challenge in computer vision. While recent trends favor complex monolithic all-in-one architectures, these models often suffer from negative task interference and require extensive joint training cycles on high-end computing clusters. In this paper, we propose a modular, task-decoupled image restoration framework based on an explicit diagnostic routing mechanism. The architecture consists of a lightweight Convolutional Neural Network (CNN) classifier that evaluates the input image and dynamically directs it to a specialized restoration node. A key advantage of this framework is its model-agnostic extensibility: while we demonstrate it using three independent U-Net experts, the system allows for the integration of any restoration method tailored to specific tasks. By isolating reconstruction paths, the framework prevents feature conflicts and significantly reduces training overhead. Unlike monolithic models, adding new degradation types in our framework only requires training a single expert and updating the router, rather than a full system retraining. Experimental results demonstrate that this computationally accessible approach offers a scalable and efficient solution for multi-degradation restoration on standard local hardware. The code will be published upon paper acceptance.
摘要：恢复受各种类型的退化（例如噪声、模糊或曝光不当）影响的图像仍然是计算机视觉领域的一项重大挑战。虽然最近的趋势有利于复杂的整体一体化架构，但这些模型经常受到负面任务干扰，并且需要在高端计算集群上进行广泛的联合训练周期。在本文中，我们提出了一种基于显式诊断路由机制的模块化、任务解耦的图像恢复框架。该架构由轻量级卷积神经网络 (CNN) 分类器组成，用于评估输入图像并将其动态引导至专门的恢复节点。该框架的一个关键优势是其与模型无关的可扩展性：虽然我们使用三位独立的 U-Net 专家来演示它，但该系统允许集成针对特定任务定制的任何恢复方法。通过隔离重建路径，该框架可以防止特征冲突并显着减少训练开销。与整体模型不同，在我们的框架中添加新的降级类型只需要训练单个专家并更新路由器，而不是完整的系统重新训练。实验结果表明，这种可计算的方法为标准本地硬件上的多退化恢复提供了可扩展且高效的解决方案。该代码将在论文被接受后发布。

Title: DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

Authors: Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Yuan Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28713
Pdf URL: https://arxiv.org/pdf/2603.28713
Copy Paste: [[2603.28713]] DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing(https://arxiv.org/abs/2603.28713)
Keywords: generation
Abstract: Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.
摘要：扩散模型在文本到图像（T2I）生成和文本引导图像编辑方面都取得了重大进展。然而，这些模型通常是用数十亿个参数构建的，导致高延迟和增加的部署挑战。虽然设备上的扩散模型提高了效率，但它们主要专注于 T2I 生成，缺乏对图像编辑的支持。在本文中，我们提出了 DreamLite，一种紧凑的统一设备上扩散模型 (0.39B)，支持单个网络内的 T2I 生成和文本引导图像编辑。 DreamLite 建立在经过修剪的移动 U-Net 主干上，并通过潜在空间中的上下文空间串联来统一调节。它水平连接图像作为输入，使用（目标 | 空白）配置进行生成任务，使用（目标 | 源）配置进行编辑任务。为了稳定这个紧凑模型的训练，我们引入了一种任务渐进式联合预训练策略，该策略依次针对 T2I、编辑和联合任务。经过高质量的 SFT 和强化学习，DreamLite 在图像生成方面达到了 GenEval (0.72)，在图像编辑方面达到了 ImgEdit (4.11)，超越了现有的设备端模型，并与多个服务器端模型保持了竞争力。通过采用逐步蒸馏，我们将去噪处理进一步减少到仅 4 个步骤，使我们的 DreamLite 能够在小米 14 智能手机上在不到 1 秒的时间内生成或编辑 1024 x 1024 图像。据我们所知，DreamLite 是第一个统一的设备上扩散模型，支持图像生成和图像编辑。

Title: Stepwise Credit Assignment for GRPO on Flow-Matching Models

Authors: Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.28718
Pdf URL: https://arxiv.org/pdf/2603.28718
Copy Paste: [[2603.28718]] Stepwise Credit Assignment for GRPO on Flow-Matching Models(https://arxiv.org/abs/2603.28718)
Keywords: generation
Abstract: Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.
摘要：Flow-GRPO 成功地将强化学习应用于流模型，但在所有步骤中使用统一的信用分配。这忽略了扩散生成的时间结构：早期步骤决定成分和内容（低频结构），而后期步骤解决细节和纹理（高频细节）。此外，仅根据最终图像分配统一的信用可能会无意中奖励次优的中间步骤，特别是当稍后在扩散轨迹中纠正错误时。我们提出了Stepwise-Flow-GRPO，它根据每一步的奖励改进来分配信用。通过利用 Tweedie 公式获得中间奖励估计并引入基于增益的优势，我们的方法实现了卓越的样本效率和更快的收敛。我们还引入了受 DDIM 启发的 SDE，它可以提高奖励质量，同时保持策略梯度的随机性。

Title: SonoWorld: From One Image to a 3D Audio-Visual Scene

Authors: Derong Jin, Xiyi Chen, Ming C. Lin, Ruohan Gao
Subjects: cs.CV, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2603.28757
Pdf URL: https://arxiv.org/pdf/2603.28757
Copy Paste: [[2603.28757]] SonoWorld: From One Image to a 3D Audio-Visual Scene(https://arxiv.org/abs/2603.28757)
Keywords: generation
Abstract: Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: this https URL
摘要：视觉场景生成方面的巨大进步现在将单个图像转变为可探索的 3D 世界，但如果没有声音，沉浸感仍然不完整。我们介绍了 Image2AVScene，即从单个图像生成 3D 视听场景的任务，并介绍了 SonoWorld，这是解决这一挑战的第一个框架。我们的管道从一张图像中绘制出 360° 全景图，将其提升为可导航的 3D 场景，放置语言引导的声音锚点，并渲染点、区域和环境源的立体混响效果，从而产生与场景几何和语义一致的空间音频。对新策划的现实世界数据集和受控用户研究的定量评估证实了我们方法的有效性。除了自由视点视听渲染之外，我们还演示了一次性声学学习和视听空间源分离的应用。项目网站：这个https URL

Title: On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Authors: Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.28762
Pdf URL: https://arxiv.org/pdf/2603.28762
Copy Paste: [[2603.28762]] On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers(https://arxiv.org/abs/2603.28762)
Keywords: generative
Abstract: Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.
摘要：现代文本到图像（T2I）扩散模型已经实现了显着的语义对齐，但它们常常严重缺乏多样性，对于任何给定的提示都集中在一组狭窄的视觉解决方案上。这种典型性偏差给需要广泛生成结果的创意应用带来了挑战。我们确定了当前多样性方法的一个基本权衡：修改模型输入需要昂贵的优化才能纳入来自生成路径的反馈。相反，作用于空间上承诺的中间潜伏往往会破坏正在形成的视觉结构，导致伪影。在这项工作中，我们建议将上下文空间中的排斥力应用为一种新颖的框架，以实现扩散变压器的丰富多样性。通过干预多模态注意通道，我们在变压器的前向传递过程中应用即时排斥，在块之间注入干预，其中文本调节通过新兴图像结构得到丰富。这允许在结构上通知之后但在构图固定之前重定向引导轨迹。我们的结果表明，上下文空间中的排斥会产生更丰富的多样性，而不会牺牲视觉保真度或语义依从性。此外，我们的方法具有独特的高效性，即使在传统的基于轨迹的干预措施通常会失败的现代“涡轮”和蒸馏模型中，其计算开销也很小，同时仍然有效。

Title: PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

Authors: Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi, João F. Henriques, Christian Rupprecht
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28763
Pdf URL: https://arxiv.org/pdf/2603.28763
Copy Paste: [[2603.28763]] PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models(https://arxiv.org/abs/2603.28763)
Keywords: generation
Abstract: Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.
摘要：由于深度模糊性以及从单目图像注释 3D 几何结构的固有困难，获取用于 3D 人体网格估计的标记数据集具有挑战性。现有数据集要么是真实的，具有手动注释的 3D 几何形状和有限的比例，要么是合成的，由 3D 引擎渲染，提供精确的标签，但存在照片真实性有限、多样性低和生产成本高的问题。在这项工作中，我们探索第三条路径：生成数据。我们引入了 PoseDreamer，这是一种新颖的管道，它利用扩散模型来生成带有 3D 网格注释的大规模合成数据集。我们的方法将可控图像生成与直接偏好优化相结合，以实现控制对齐、基于课程的硬样本挖掘和多阶段质量过滤。这些组件一起自然地维护 3D 标签和生成图像之间的对应关系，同时优先考虑具有挑战性的样本以最大化数据集效用。使用 PoseDreamer，我们生成了超过 500,000 个高质量合成样本，与基于渲染的数据集相比，图像质量指标提高了 76%。在 PoseDreamer 上训练的模型的性能可与在现实世界和传统合成数据集上训练的模型相当或更好。此外，将 PoseDreamer 与合成数据集相结合比结合现实世界和合成数据集具有更好的性能，这证明了我们数据集的互补性。我们将发布完整的数据集和生成代码。

Title: HandX: Scaling Bimanual Motion and Interaction Generation

Authors: Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28766
Pdf URL: https://arxiv.org/pdf/2603.28766
Copy Paste: [[2603.28766]] HandX: Scaling Bimanual Motion and Interaction Generation(https://arxiv.org/abs/2603.28766)
Keywords: generation
Abstract: Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.
摘要：合成人体动作已经迅速发展，但真实的手部动作和双手交互仍未得到充分探索。全身模型经常错过驱动灵巧行为、手指关节、接触时间和手间协调的细粒度线索，并且现有资源缺乏捕捉微妙的手指动态和协作的高保真双手序列。为了填补这一空白，我们推出了 HandX，这是一个涵盖数据、注释和评估的统一基础。我们整合和过滤现有数据集的质量，并收集一个新的动作捕捉数据集，针对代表性不足的双手交互和详细的手指动态。对于可扩展注释，我们引入了一种解耦策略，该策略提取代表性运动特征，例如接触事件和手指弯曲，然后利用大型语言模型的推理来生成与这些特征一致的细粒度、语义丰富的描述。基于所得数据和注释，我们使用多功能条件模式对扩散和自回归模型进行基准测试。实验证明了高质量的灵巧运动生成，并得到了我们新提出的以手部为中心的指标的支持。我们进一步观察到明显的缩放趋势：在更大、更高质量的数据集上训练的更大的模型会产生语义上更连贯的双手运动。我们的数据集发布是为了支持未来的研究。

Title: Gen-Searcher: Reinforcing Agentic Search for Image Generation

Authors: Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28767
Pdf URL: https://arxiv.org/pdf/2603.28767
Copy Paste: [[2603.28767]] Gen-Searcher: Reinforcing Agentic Search for Image Generation(https://arxiv.org/abs/2603.28767)
Keywords: generation
Abstract: Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
摘要：最近的图像生成模型在生成高保真和逼真图像方面表现出了强大的能力。然而，它们从根本上受到冻结的内部知识的限制，因此常常无法处理知识密集型或需要最新信息的现实场景。在本文中，我们提出 Gen-Searcher，作为训练搜索增强图像生成代理的首次尝试，该代理执行多跳推理和搜索以收集基础生成所需的文本知识和参考图像。为了实现这一目标，我们构建了一个定制的数据管道，并策划了两个高质量的数据集：Gen-Searcher-SFT-10k 和 Gen-Searcher-RL-6k，其中包含各种搜索密集型提示和相应的地面实况合成图像。我们进一步介绍了 KnowGen，这是一个综合基准，明确需要基于搜索的外部知识来生成图像并从多个维度评估模型。基于这些资源，我们使用 SFT 训练 Gen-Searcher，然后进行具有双重奖励反馈的代理强化学习，它结合了基于文本和基于图像的奖励，为 GRPO 训练提供更稳定和信息丰富的学习信号。实验表明，Gen-Searcher 带来了巨大的收益，将 Qwen-Image 在 KnowGen 上提高了约 16 个点，在 WISE 上提高了 15 个点。我们希望这项工作能够成为图像生成中搜索代理的开放基础，并且我们完全开源我们的数据、模型和代码。