2025-11-11

Title: Token Is All You Need: Cognitive Planning through Sparse Intent Alignment

Authors: Shiyao Sang
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2511.05540
Pdf URL: https://arxiv.org/pdf/2511.05540
Copy Paste: [[2511.05540]] Token Is All You Need: Cognitive Planning through Sparse Intent Alignment(https://arxiv.org/abs/2511.05540)
Keywords: generation
Abstract: We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Unlike world-model approaches that rely on computationally intensive future scene generation or vision-language-action (VLA) systems constrained by Markov assumptions, we show that a minimal set of semantically rich tokens is sufficient for effective planning. Experiments on the nuPlan benchmark (720 scenarios, over 11,000 samples) using perception-informed BEV representations yield three key findings: (1) even without future prediction, our sparse representation achieves 0.548 m ADE, comparable to or surpassing prior methods reporting around 0.75 m on nuScenes; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.479 m, a 12.6% improvement over current-state baselines; and (3) explicit reconstruction loss offers no benefit and may degrade performance under reliable perception inputs. Notably, we observe the emergence of temporal fuzziness, where the model adaptively attends to task-relevant semantics rather than aligning rigidly to fixed timestamps, providing a cognitive advantage for planning under uncertainty. Our "token is all you need" principle marks a paradigm shift from reconstructing the world to understanding it, laying a foundation for cognitively inspired systems that plan through imagination rather than reaction.
摘要：我们挑战了长期以来的假设，即高性能端到端自动驾驶（E2EAD）需要详尽的场景建模。与依赖于计算密集型未来场景生成或受马尔可夫假设约束的视觉语言动作（VLA）系统的世界模型方法不同，我们表明，一组最小的语义丰富的标记足以进行有效的规划。使用基于感知的 BEV 表示在 nuPlan 基准（720 个场景，超过 11,000 个样本）上进行的实验产生了三个关键发现：（1）即使没有未来预测，我们的稀疏表示也能达到 0.548 m ADE，与之前在 nuScenes 上报告的大约 0.75 m 的方法相当或超过； (2) 对预测的未来令牌进行条件轨迹解码，将 ADE 降低至 0.479 m，比当前状态基线提高 12.6%； (3) 显式重建损失没有任何好处，并且可能会降低可靠感知输入下的性能。值得注意的是，我们观察到时间模糊性的出现，其中模型自适应地关注与任务相关的语义，而不是严格地与固定时间戳对齐，从而为不确定性下的规划提供了认知优势。我们的“代币就是你所需要的”原则标志着从重建世界到理解世界的范式转变，为通过想象而不是反应进行计划的认知启发系统奠定了基础。

Title: AGRAG: Advanced Graph-based Retrieval-Augmented Generation for LLMs

Authors: Yubo Wang, Haoyang Li, Fei Teng, Lei Chen
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2511.05549
Pdf URL: https://arxiv.org/pdf/2511.05549
Copy Paste: [[2511.05549]] AGRAG: Advanced Graph-based Retrieval-Augmented Generation for LLMs(https://arxiv.org/abs/2511.05549)
Keywords: generation
Abstract: Graph-based retrieval-augmented generation (Graph-based RAG) has demonstrated significant potential in enhancing Large Language Models (LLMs) with structured knowledge. However, existing methods face three critical challenges: Inaccurate Graph Construction, caused by LLM hallucination; Poor Reasoning Ability, caused by failing to generate explicit reasons telling LLM why certain chunks were selected; and Inadequate Answering, which only partially answers the query due to the inadequate LLM reasoning, making their performance lag behind NaiveRAG on certain tasks. To address these issues, we propose AGRAG, an advanced graph-based retrieval-augmented generation framework. When constructing the graph, AGRAG substitutes the widely used LLM entity extraction method with a statistics-based method, avoiding hallucination and error propagation. When retrieval, AGRAG formulates the graph reasoning procedure as the Minimum Cost Maximum Influence (MCMI) subgraph generation problem, where we try to include more nodes with high influence score, but with less involving edge cost, to make the generated reasoning paths more comprehensive. We prove this problem to be NP-hard, and propose a greedy algorithm to solve it. The MCMI subgraph generated can serve as explicit reasoning paths to tell LLM why certain chunks were retrieved, thereby making the LLM better focus on the query-related part contents of the chunks, reducing the impact of noise, and improving AGRAG's reasoning ability. Furthermore, compared with the simple tree-structured reasoning paths, our MCMI subgraph can allow more complex graph structures, such as cycles, and improve the comprehensiveness of the generated reasoning paths.
摘要：基于图的检索增强生成（基于图的 RAG）已显示出在利用结构化知识增强大型语言模型 (LLM) 方面的巨大潜力。然而，现有方法面临三个关键挑战：LLM幻觉导致的图构建不准确；推理能力差，由于未能生成明确的原因告诉LLM为什么选择某些块而导致；以及Inadequate Answering，由于LLM推理不充分，只能部分回答查询，使得它们在某些任务上的性能落后于NaiveRAG。为了解决这些问题，我们提出了 AGRAG，一种先进的基于图的检索增强生成框架。在构建图时，AGRAG用基于统计的方法替代了广泛使用的LLM实体提取方法，避免了幻觉和错误传播。在检索时，AGRAG将图推理过程制定为最小成本最大影响（MCMI）子图生成问题，其中我们尝试包含更多具有高影响力得分的节点，但涉及较少的边成本，以使生成的推理路径更加全面。我们证明这个问题是NP-hard问题，并提出了一种贪心算法来解决它。生成的MCMI子图可以作为显式推理路径告诉LLM为什么检索某些chunk，从而使LLM更好地关注chunk中与查询相关的部分内容，减少噪声的影响，提高AGRAG的推理能力。此外，与简单的树结构推理路径相比，我们的MCMI子图可以允许更复杂的图结构，例如循环，并提高生成的推理路径的全面性。

Title: In-Context-Learning-Assisted Quality Assessment Vision-Language Models for Metal Additive Manufacturing

Authors: Qiaojie Zheng, Jiucai Zhang, Xiaoli Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.05551
Pdf URL: https://arxiv.org/pdf/2511.05551
Copy Paste: [[2511.05551]] In-Context-Learning-Assisted Quality Assessment Vision-Language Models for Metal Additive Manufacturing(https://arxiv.org/abs/2511.05551)
Keywords: quality assessment
Abstract: Vision-based quality assessment in additive manufacturing often requires dedicated machine learning models and application-specific datasets. However, data collection and model training can be expensive and time-consuming. In this paper, we leverage vision-language models' (VLMs') reasoning capabilities to assess the quality of printed parts and introduce in-context learning (ICL) to provide VLMs with necessary application-specific knowledge and demonstration samples. This method eliminates the requirement for large application-specific datasets for training models. We explored different sampling strategies for ICL to search for the optimal configuration that makes use of limited samples. We evaluated these strategies on two VLMs, Gemini-2.5-flash and Gemma3:27b, with quality assessment tasks in wire-laser direct energy deposition processes. The results show that ICL-assisted VLMs can reach quality classification accuracies similar to those of traditional machine learning models while requiring only a minimal number of samples. In addition, unlike traditional classification models that lack transparency, VLMs can generate human-interpretable rationales to enhance trust. Since there are no metrics to evaluate their interpretability in manufacturing applications, we propose two metrics, knowledge relevance and rationale validity, to evaluate the quality of VLMs' supporting rationales. Our results show that ICL-assisted VLMs can address application-specific tasks with limited data, achieving relatively high accuracy while also providing valid supporting rationales for improved decision transparency.
摘要：增材制造中基于视觉的质量评估通常需要专用的机器学习模型和特定于应用程序的数据集。然而，数据收集和模型训练可能既昂贵又耗时。 In this paper, we leverage vision-language models' (VLMs') reasoning capabilities to assess the quality of printed parts and introduce in-context learning (ICL) to provide VLMs with necessary application-specific knowledge and demonstration samples.该方法消除了训练模型对大型特定应用数据集的需求。我们探索了 ICL 的不同采样策略，以搜索利用有限样本的最佳配置。我们在两个 VLM（Gemini-2.5-flash 和 Gemma3:27b）上评估了这些策略，并在线激光直接能量沉积过程中执行质量评估任务。结果表明，ICL 辅助的 VLM 可以达到与传统机器学习模型相似的质量分类精度，同时只需要最少数量的样本。此外，与缺乏透明度的传统分类模型不同，VLM 可以生成人类可解释的理由来增强信任。由于没有衡量标准来评估其在制造应用中的可解释性，因此我们提出了两个衡量标准：知识相关性和基本原理有效性，以评估 VLM 支持基本原理的质量。我们的结果表明，ICL 辅助的 VLM 可以利用有限的数据解决特定于应用程序的任务，实现相对较高的准确性，同时还为提高决策透明度提供有效的支持理由。

Title: EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning

Authors: Xinyan Cai, Shiguang Wu, Dafeng Chi, Yuzheng Zhuang, Xingyue Quan, Jianye Hao, Qiang Guan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.05553
Pdf URL: https://arxiv.org/pdf/2511.05553
Copy Paste: [[2511.05553]] EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning(https://arxiv.org/abs/2511.05553)
Keywords: generation
Abstract: In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present \textbf{EVLP (Embodied Vision-Language Planner)}, an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: \textbf{1) Unified Multimodal Generation Framework}: For understanding, We integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. \textbf{2) Dynamic Perception Pretraining}: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. \textbf{3) Reinforced Supervised Fine-Tuning}: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-awared multimodal planning capabilities.
摘要：在复杂的具体化长视野操作任务中，有效的任务分解和执行需要文本逻辑推理和视觉空间想象的协同整合，以确保高效、准确的操作。目前的方法未能采用统一的多式联运规划生成框架，导致多式联运规划不一致。为了应对这一挑战，我们提出了 \textbf{EVLP（Embodied Vision-Language Planner）}，这是一种创新的多模态统一生成框架，可以联合建模语言推理和视觉生成。我们的方法通过结合动态预训练和强化对齐的新颖训练管道实现了长期任务的多模式规划。我们的核心创新由三个关键组成部分组成： \textbf{1) 统一多模态生成框架}：为了理解，我们将语义信息与空间特征集成以提供全面的视觉感知。对于生成，我们直接学习离散图像的联合分布以进行一步视觉合成，通过可学习的跨模式注意机制实现协调的语言视觉建模。 \textbf{2) 动态感知预训练}：我们提出了一种采用逆向动态任务和正向动态任务的双向动态对齐策略，有效地加强了统一特征空间内的多模态相关性。 \textbf{3) 强化监督微调}：在统一生成空间中进行基于指令的微调时，我们构建了强化损失来对齐文本动作和生成图像之间的空间逻辑，使模型获得空间感知的多模态规划能力。

Title: Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement

Authors: Sanghyun Lee, Sunwoo Kim, Seungryong Kim, Jongho Park, Dongmin Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.05562
Pdf URL: https://arxiv.org/pdf/2511.05562
Copy Paste: [[2511.05562]] Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement(https://arxiv.org/abs/2511.05562)
Keywords: generation
Abstract: Test-time scaling through reward-guided generation remains largely unexplored for discrete diffusion models despite its potential as a promising alternative. In this work, we introduce Iterative Reward-Guided Refinement (IterRef), a novel test-time scaling method tailored to discrete diffusion that leverages reward- guided noising-denoising transitions to progressively refine misaligned intermediate states. We formalize this process within a Multiple-Try Metropolis (MTM) framework, proving convergence to the reward-aligned distribution. Unlike prior methods that assume the current state is already aligned with the reward distribution and only guide the subsequent transition, our approach explicitly refines each state in situ, progressively steering it toward the optimal intermediate distribution. Across both text and image domains, we evaluate IterRef on diverse discrete diffusion models and observe consistent improvements in reward-guided generation quality. In particular, IterRef achieves striking gains under low compute budgets, far surpassing prior state-of-the-art baselines.
摘要：尽管离散扩散模型具有作为一种有前途的替代方案的潜力，但通过奖励引导生成来扩展测试时间在很大程度上仍未被探索。在这项工作中，我们引入了迭代奖励引导细化（IterRef），这是一种针对离散扩散量身定制的新颖的测试时间缩放方法，它利用奖励引导的噪声-去噪过渡来逐步细化未对齐的中间状态。我们在 Multiple-Try Metropolis (MTM) 框架内正式化了这个过程，证明了与奖励一致的分布的收敛性。与假设当前状态已经与奖励分布一致并且仅指导后续转换的先前方法不同，我们的方法明确地就地细化每个状态，逐步将其引导向最佳中间分布。在文本和图像领域，我们在不同的离散扩散模型上评估 IterRef，并观察到奖励引导生成质量的持续改进。特别是，IterRef 在低计算预算下取得了惊人的成果，远远超过了之前最先进的基线。

Title: Automatic Extraction of Road Networks by using Teacher-Student Adaptive Structural Deep Belief Network and Its Application to Landslide Disaster

Authors: Shin Kamada, Takumi Ichimura
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.05567
Pdf URL: https://arxiv.org/pdf/2511.05567
Copy Paste: [[2511.05567]] Automatic Extraction of Road Networks by using Teacher-Student Adaptive Structural Deep Belief Network and Its Application to Landslide Disaster(https://arxiv.org/abs/2511.05567)
Keywords: generation
Abstract: An adaptive structural learning method of Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) has been developed as one of prominent deep learning models. The neuron generation-annihilation algorithm in RBM and layer generation algorithm in DBN make an optimal network structure for given input during the learning. In this paper, our model is applied to an automatic recognition method of road network system, called RoadTracer. RoadTracer can generate a road map on the ground surface from aerial photograph data. A novel method of RoadTracer using the Teacher-Student based ensemble learning model of Adaptive DBN is proposed, since the road maps contain many complicated features so that a model with high representation power to detect should be required. The experimental results showed the detection accuracy of the proposed model was improved from 40.0\% to 89.0\% on average in the seven major cities among the test dataset. In addition, we challenged to apply our method to the detection of available roads when landslide by natural disaster is occurred, in order to rapidly obtain a way of transportation. For fast inference, a small size of the trained model was implemented on a small embedded edge device as lightweight deep learning. We reported the detection results for the satellite image before and after the rainfall disaster in Japan.
摘要：受限玻尔兹曼机（RBM）和深度置信网络（DBN）的自适应结构学习方法已被开发为著名的深度学习模型之一。 RBM中的神经元生成-湮灭算法和DBN中的层生成算法在学习过程中针对给定的输入形成最优的网络结构。在本文中，我们的模型应用于道路网络系统的自动识别方法，称为RoadTracer。 RoadTracer 可以根据航空照片数据生成地面道路图。由于道路地图包含许多复杂的特征，因此需要具有高表示能力的模型来检测，因此提出了一种使用基于教师-学生的自适应DBN集成学习模型的RoadTracer新方法。实验结果表明，在测试数据集中的七个主要城市中，该模型的检测准确率平均从 40.0% 提高到 89.0%。此外，我们还挑战将我们的方法应用于发生自然灾害滑坡时可用道路的检测，以便快速获得交通方式。为了快速推理，在小型嵌入式边缘设备上实现了小尺寸的训练模型作为轻量级深度学习。我们报道了日本雨灾前后卫星图像的探测结果。

Title: C3-Diff: Super-resolving Spatial Transcriptomics via Cross-modal Cross-content Contrastive Diffusion Modelling

Authors: Xiaofei Wang, Stephen Price, Chao Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.05571
Pdf URL: https://arxiv.org/pdf/2511.05571
Copy Paste: [[2511.05571]] C3-Diff: Super-resolving Spatial Transcriptomics via Cross-modal Cross-content Contrastive Diffusion Modelling(https://arxiv.org/abs/2511.05571)
Keywords: super-resolution
Abstract: The rapid advancement of spatial transcriptomics (ST), i.e., spatial gene expressions, has made it possible to measure gene expression within original tissue, enabling us to discover molecular mechanisms. However, current ST platforms frequently suffer from low resolution, limiting the in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, it remains a challenge to model the interactions between histology images and gene expressions for effective ST enhancement. This study presents a cross-modal cross-content contrastive diffusion framework, called C3-Diff, for ST enhancement with histology images as guidance. In C3-Diff, we firstly analyze the deficiency of traditional contrastive learning paradigm, which is then refined to extract both modal-invariant and content-invariant features of ST maps and histology images. Further, to overcome the problem of low sequencing sensitivity in ST maps, we perform nosing-based information augmentation on the surface of feature unit hypersphere. Finally, we propose a dynamic cross-modal imputation-based training strategy to mitigate ST data scarcity. We tested C3-Diff by benchmarking its performance on four public datasets, where it achieves significant improvements over competing methods. Moreover, we evaluate C3-Diff on downstream tasks of cell type localization, gene expression correlation and single-cell-level gene expression prediction, promoting AI-enhanced biotechnology for biomedical research and clinical applications. Codes are available at this https URL.
摘要：空间转录组学（ST）即空间基因表达的快速发展使得测量原始组织内的基因表达成为可能，使我们能够发现分子机制。然而，当前的ST平台经常存在分辨率低的问题，限制了对空间基因表达的深入理解。超分辨率方法有望通过将组织学图像与轮廓组织点的基因表达相结合来增强 ST 图。然而，对组织学图像和基因表达之间的相互作用进行建模以实现有效的 ST 增强仍然是一个挑战。本研究提出了一种跨模式跨内容对比扩散框架，称为 C3-Diff，用于以组织学图像为指导的 ST 增强。在C3-Diff中，我们首先分析了传统对比学习范式的缺陷，然后对其进行改进以提取ST图和组织学图像的模态不变和内容不变特征。此外，为了克服 ST 图谱测序灵敏度低的问题，我们在特征单元超球面的表面上进行基于鼻子的信息增强。最后，我们提出了一种基于动态跨模式插补的训练策略，以缓解 ST 数据稀缺性。我们通过在四个公共数据集上对其性能进行基准测试来测试 C3-Diff，与竞争方法相比，它取得了显着的改进。此外，我们在细胞类型定位、基因表达相关性和单细胞水平基因表达预测等下游任务上评估C3-Diff，促进人工智能增强生物技术在生物医学研究和临床应用。代码可从此 https URL 获取。

Title: Video Text Preservation with Synthetic Text-Rich Videos

Authors: Ziyang Liu, Kevin Valencia, Justin Cui
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.05573
Pdf URL: https://arxiv.org/pdf/2511.05573
Copy Paste: [[2511.05573]] Video Text Preservation with Synthetic Text-Rich Videos(https://arxiv.org/abs/2511.05573)
Keywords: generation
Abstract: While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. We first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune Wan2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest that curated synthetic data and weak supervision offer a practical path toward improving textual fidelity in T2V generation.
摘要：虽然文本到视频 (T2V) 模型发展迅速，但它们仍然难以在视频中生成清晰且连贯的文本。特别是，现有的模型通常无法正确渲染即使是简短的短语或单词，并且之前解决此问题的尝试在计算上是昂贵的并且不适合视频生成。在这项工作中，我们研究了一种使用综合监督改进 T2V 扩散模型的轻量级方法。我们首先使用文本到图像（T2I）扩散模型生成富含文本的图像，然后使用与文本无关的图像到视频（I2v）模型将它们动画化为短视频。这些合成视频提示对用于微调 Wan2.1（一种预训练的 T2V 模型），无需任何架构更改。我们的结果表明，短文本的易读性和时间一致性与较长文本的新兴结构先验有所改善。这些发现表明，精心策划的合成数据和弱监督为提高 T2V 生成中的文本保真度提供了一条实用途径。

Title: DiffSwap++: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping

Authors: Weston Bondurant, Arkaprava Sinha, Hieu Le, Srijan Das, Stephanie Schuckers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.05575
Pdf URL: https://arxiv.org/pdf/2511.05575
Copy Paste: [[2511.05575]] DiffSwap++: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping(https://arxiv.org/abs/2511.05575)
Keywords: generation
Abstract: Diffusion-based approaches have recently achieved strong results in face swapping, offering improved visual quality over traditional GAN-based methods. However, even state-of-the-art models often suffer from fine-grained artifacts and poor identity preservation, particularly under challenging poses and expressions. A key limitation of existing approaches is their failure to meaningfully leverage 3D facial structure, which is crucial for disentangling identity from pose and expression. In this work, we propose DiffSwap++, a novel diffusion-based face-swapping pipeline that incorporates 3D facial latent features during training. By guiding the generation process with 3D-aware representations, our method enhances geometric consistency and improves the disentanglement of facial identity from appearance attributes. We further design a diffusion architecture that conditions the denoising process on both identity embeddings and facial landmarks, enabling high-fidelity and identity-preserving face swaps. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate that DiffSwap++ outperforms prior methods in preserving source identity while maintaining target pose and expression. Additionally, we introduce a biometric-style evaluation and conduct a user study to further validate the realism and effectiveness of our approach. Code will be made publicly available at this https URL
摘要：基于扩散的方法最近在面部交换方面取得了很好的成果，与传统的基于 GAN 的方法相比，提供了更高的视觉质量。然而，即使是最先进的模型也常常会遇到细粒度的伪影和较差的身份保留，特别是在具有挑战性的姿势和表情下。现有方法的一个关键限制是它们未能有效地利用 3D 面部结构，而这对于区分身份与姿势和表情至关重要。在这项工作中，我们提出了 DiffSwap++，一种新颖的基于扩散的面部交换管道，在训练过程中结合了 3D 面部潜在特征。通过使用 3D 感知表示指导生成过程，我们的方法增强了几何一致性，并改善了面部身份与外观属性的分离。我们进一步设计了一种扩散架构，该架构可以调节身份嵌入和面部地标上的去噪过程，从而实现高保真和身份保留的面部交换。在 CelebA、FFHQ 和 CelebV-Text 上进行的大量实验表明，DiffSwap++ 在保留源身份、同时保持目标姿势和表达方面优于先前的方法。此外，我们引入了生物识别式评估并进行了用户研究，以进一步验证我们方法的现实性和有效性。代码将在此 https URL 公开提供

Title: Fine-Tuning Vision-Language Models for Multimodal Polymer Property Prediction

Authors: An Vuong, Minh-Hao Van, Prateek Verma, Chen Zhao, Xintao Wu
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.05577
Pdf URL: https://arxiv.org/pdf/2511.05577
Copy Paste: [[2511.05577]] Fine-Tuning Vision-Language Models for Multimodal Polymer Property Prediction(https://arxiv.org/abs/2511.05577)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have shown strong performance in tasks like visual question answering and multimodal text generation, but their effectiveness in scientific domains such as materials science remains limited. While some machine learning methods have addressed specific challenges in this field, there is still a lack of foundation models designed for broad tasks like polymer property prediction using multimodal data. In this work, we present a multimodal polymer dataset to fine-tune VLMs through instruction-tuning pairs and assess the impact of multimodality on prediction performance. Our fine-tuned models, using LoRA, outperform unimodal and baseline approaches, demonstrating the benefits of multimodal learning. Additionally, this approach reduces the need to train separate models for different properties, lowering deployment and maintenance costs.
摘要：视觉语言模型（VLM）在视觉问答和多模式文本生成等任务中表现出强大的性能，但它们在材料科学等科学领域的有效性仍然有限。尽管一些机器学习方法已经解决了该领域的特定挑战，但仍然缺乏为使用多模态数据预测聚合物性能等广泛任务而设计的基础模型。在这项工作中，我们提出了一个多模态聚合物数据集，通过指令调整对来微调 VLM，并评估多模态对预测性能的影响。我们使用 LoRA 进行微调的模型优于单模态和基线方法，展示了多模态学习的优势。此外，这种方法减少了针对不同属性训练单独模型的需要，从而降低了部署和维护成本。

Title: Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels

Authors: Yong-Ming Tian, Shuang Liang, Shao-Qun Zhang, Feng-Lei Fan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.05585
Pdf URL: https://arxiv.org/pdf/2511.05585
Copy Paste: [[2511.05585]] Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels(https://arxiv.org/abs/2511.05585)
Keywords: generation
Abstract: While deep learning has achieved remarkable success across a wide range of applications, its theoretical understanding of representation learning remains limited. Deep neural kernels provide a principled framework to interpret over-parameterized neural networks by mapping hierarchical feature transformations into kernel spaces, thereby combining the expressive power of deep architectures with the analytical tractability of kernel methods. Recent advances, particularly neural tangent kernels (NTKs) derived by gradient inner products, have established connections between infinitely wide neural networks and nonparametric Bayesian inference. However, the existing NTK paradigm has been predominantly confined to the infinite-width regime, while overlooking the representational role of network depth. To address this gap, we propose a depth-induced NTK kernel based on a shortcut-related architecture, which converges to a Gaussian process as the network depth approaches infinity. We theoretically analyze the training invariance and spectrum properties of the proposed kernel, which stabilizes the kernel dynamics and mitigates degeneration. Experimental results further underscore the effectiveness of our proposed method. Our findings significantly extend the existing landscape of the neural kernel theory and provide an in-depth understanding of deep learning and the scaling law.
摘要：尽管深度学习在广泛的应用中取得了显着的成功，但其对表示学习的理论理解仍然有限。深度神经核提供了一个原理框架，通过将分层特征转换映射到核空间来解释超参数化神经网络，从而将深度架构的表达能力与核方法的分析易处理性结合起来。最近的进展，特别是由梯度内积导出的神经正切核（NTK），已经在无限宽的神经网络和非参数贝叶斯推理之间建立了联系。然而，现有的 NTK 范式主要局限于无限宽度的范围，而忽视了网络深度的表征作用。为了解决这一差距，我们提出了一种基于捷径相关架构的深度诱导 NTK 内核，当网络深度接近无穷大时，该内核收敛到高斯过程。我们从理论上分析了所提出的内核的训练不变性和频谱特性，这稳定了内核动态并减轻了退化。实验结果进一步强调了我们提出的方法的有效性。我们的研究结果显着扩展了神经核理论的现有领域，并提供了对深度学习和缩放定律的深入理解。

Title: GRAVER: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning

Authors: Haonan Yuan, Qingyun Sun, Junhua Shi, Xingcheng Fu, Bryan Hooi, Jianxin Li, Philip S. Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.05592
Pdf URL: https://arxiv.org/pdf/2511.05592
Copy Paste: [[2511.05592]] GRAVER: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning(https://arxiv.org/abs/2511.05592)
Keywords: generative
Abstract: Inspired by the remarkable success of foundation models in language and vision, Graph Foundation Models (GFMs) hold significant promise for broad applicability across diverse graph tasks and domains. However, existing GFMs struggle with unstable few-shot fine-tuning, where both performance and adaptation efficiency exhibit significant fluctuations caused by the randomness in the support sample selection and structural discrepancies between the pre-trained and target graphs. How to fine-tune GFMs robustly and efficiently to enable trustworthy knowledge transfer across domains and tasks is the major challenge. In this paper, we propose GRAVER, a novel Generative gRAph VocabulariEs for Robust GFM fine-tuning framework that tackles the aforementioned instability via generative augmentations. Specifically, to identify transferable units, we analyze and extract key class-specific subgraph patterns by ego-graph disentanglement and validate their transferability both theoretically and empirically. To enable effective pre-training across diverse domains, we leverage a universal task template based on ego-graph similarity and construct graph vocabularies via graphon-based generative experts. To facilitate robust and efficient prompt fine-tuning, we grave the support samples with in-context vocabularies, where the lightweight MoE-CoE network attentively routes knowledge from source domains. Extensive experiments demonstrate the superiority of GRAVER over effectiveness, robustness, and efficiency on downstream few-shot node and graph classification tasks compared with 15 state-of-the-art baselines.
摘要：受语言和视觉基础模型取得的巨大成功的启发，图基础模型（GFM）在跨不同图任务和领域的广泛适用性方面具有重大前景。然而，现有的 GFM 面临着不稳定的小样本微调问题，其中性能和适应效率都表现出由于支持样本选择的随机性以及预训练图和目标图之间的结构差异而导致的显着波动。如何稳健、高效地微调 GFM，以实现跨领域和任务的可信知识转移是主要挑战。在本文中，我们提出了 GRAVER，一种新颖的用于鲁棒 GFM 微调框架的生成图词汇，它通过生成增强解决了上述的不稳定性。具体来说，为了识别可转移单元，我们通过自我图解缠结来分析和提取关键的类特定子图模式，并从理论上和经验上验证它们的可转移性。为了实现跨不同领域的有效预训练，我们利用基于自我图相似性的通用任务模板，并通过基于图子的生成专家构建图词汇。为了促进稳健、高效的及时微调，我们使用上下文词汇来记录支持样本，其中轻量级 MoE-CoE 网络细心地路由来自源域的知识。大量实验证明，与 15 个最先进的基线相比，GRAVER 在下游少样本节点和图分类任务上的有效性、鲁棒性和效率方面具有优越性。

Title: AutoHood3D: A Multi-Modal Benchmark for Automotive Hood Design and Fluid-Structure Interaction

Authors: Vansh Sharma, Harish Jai Ganesh, Maryam Akram, Wanjiao Liu, Venkat Raman
Subjects: cs.LG, physics.comp-ph, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2511.05596
Pdf URL: https://arxiv.org/pdf/2511.05596
Copy Paste: [[2511.05596]] AutoHood3D: A Multi-Modal Benchmark for Automotive Hood Design and Fluid-Structure Interaction(https://arxiv.org/abs/2511.05596)
Keywords: generative
Abstract: This study presents a new high-fidelity multi-modal dataset containing 16000+ geometric variants of automotive hoods useful for machine learning (ML) applications such as engineering component design and process optimization, and multiphysics system surrogates. The dataset is centered on a practical multiphysics problem-hood deformation from fluid entrapment and inertial loading during rotary-dip painting. Each hood is numerically modeled with a coupled Large-Eddy Simulation (LES)-Finite Element Analysis (FEA), using 1.2M cells in total to ensure spatial and temporal accuracy. The dataset provides time-resolved physical fields, along with STL meshes and structured natural language prompts for text-to-geometry synthesis. Existing datasets are either confined to 2D cases, exhibit limited geometric variations, or lack the multi-modal annotations and data structures - shortcomings we address with AutoHood3D. We validate our numerical methodology, establish quantitative baselines across five neural architectures, and demonstrate systematic surrogate errors in displacement and force predictions. These findings motivate the design of novel approaches and multiphysics loss functions that enforce fluid-solid coupling during model training. By providing fully reproducible workflows, AutoHood3D enables physics-aware ML development, accelerates generative-design iteration, and facilitates the creation of new FSI benchmarks. Dataset and code URLs in Appendix.
摘要：这项研究提出了一个新的高保真多模态数据集，其中包含 16000 多个汽车引擎盖的几何变体，可用于机器学习 (ML) 应用，例如工程组件设计和流程优化以及多物理场系统替代。该数据集以实际的多物理场问题为中心，即旋转浸漆过程中流体滞留和惯性载荷导致的罩变形。每个罩均采用耦合大涡模拟 (LES)-有限元分析 (FEA) 进行数值建模，总共使用 1.2M 个单元，以确保空间和时间精度。该数据集提供时间分辨的物理场，以及用于文本到几何合成的 STL 网格和结构化自然语言提示。现有数据集要么局限于 2D 情况，表现出有限的几何变化，要么缺乏多模态注释和数据结构 - 我们使用 AutoHood3D 解决了这些缺点。我们验证了我们的数值方法，建立了五种神经架构的定量基线，并展示了位移和力预测中的系统替代误差。这些发现激发了新方法和多物理场损失函数的设计，这些方法和多物理场损失函数在模型训练期间强制流固耦合。通过提供完全可重复的工作流程，AutoHood3D 能够实现物理感知的 ML 开发，加速生成设计迭代，并促进新的 FSI 基准的创建。附录中的数据集和代码 URL。

Title: Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation

Authors: Ziying Li, Xuequan Lu, Xinkui Zhao, Guanjie Cheng, Shuiguang Deng, Jianwei Yin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.05609
Pdf URL: https://arxiv.org/pdf/2511.05609
Copy Paste: [[2511.05609]] Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation(https://arxiv.org/abs/2511.05609)
Keywords: generation
Abstract: Recent advancements in optimization-based text-to-3D generation heavily rely on distilling knowledge from pre-trained text-to-image diffusion models using techniques like Score Distillation Sampling (SDS), which often introduce artifacts such as over-saturation and over-smoothing into the generated 3D assets. In this paper, we address this essential problem by formulating the generation process as learning an optimal, direct transport trajectory between the distribution of the current rendering and the desired target distribution, thereby enabling high-quality generation with smaller Classifier-free Guidance (CFG) values. At first, we theoretically establish SDS as a simplified instance of the Schrödinger Bridge framework. We prove that SDS employs the reverse process of an Schrödinger Bridge, which, under specific conditions (e.g., a Gaussian noise as one end), collapses to SDS's score function of the pre-trained diffusion model. Based upon this, we introduce Trajectory-Centric Distillation (TraCe), a novel text-to-3D generation framework, which reformulates the mathematically trackable framework of Schrödinger Bridge to explicitly construct a diffusion bridge from the current rendering to its text-conditioned, denoised target, and trains a LoRA-adapted model on this trajectory's score dynamics for robust 3D optimization. Comprehensive experiments demonstrate that TraCe consistently achieves superior quality and fidelity to state-of-the-art techniques.
摘要：基于优化的文本到 3D 生成的最新进展很大程度上依赖于使用分数蒸馏采样 (SDS) 等技术从预先训练的文本到图像扩散模型中提取知识，这些技术通常会在生成的 3D 资产中引入过饱和和过平滑等伪影。在本文中，我们通过将生成过程制定为学习当前渲染分布与所需目标分布之间的最佳直接传输轨迹来解决这一基本问题，从而实现具有较小无分类器指导（CFG）值的高质量生成。首先，我们从理论上将 SDS 建立为薛定谔桥框架的简化实例。我们证明 SDS 采用了薛定谔桥的逆过程，在特定条件下（例如，以高斯噪声作为一端），该过程会崩溃为预训练扩散模型的 SDS 得分函数。在此基础上，我们引入了轨迹中心蒸馏（TraCe），这是一种新颖的文本到 3D 生成框架，它重新制定了薛定谔桥的数学可跟踪框架，以显式构建从当前渲染到其文本条件去噪目标的扩散桥，并在此轨迹的分数动态上训练一个适应 LoRA 的模型，以实现稳健的 3D 优化。综合实验表明，TraCe 始终如一地实现卓越的质量和对最先进技术的保真度。

Title: Pose-Aware Multi-Level Motion Parsing for Action Quality Assessment

Authors: Shuaikang Zhu, Yang Yang, Chen Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.05611
Pdf URL: https://arxiv.org/pdf/2511.05611
Copy Paste: [[2511.05611]] Pose-Aware Multi-Level Motion Parsing for Action Quality Assessment(https://arxiv.org/abs/2511.05611)
Keywords: quality assessment
Abstract: Human pose serves as a cornerstone of action quality assessment (AQA), where subtle spatial-temporal variations in pose often distinguish excellence from mediocrity. In high-level competitions, these nuanced differences become decisive factors in scoring. In this paper, we propose a novel multi-level motion parsing framework for AQA based on enhanced spatial-temporal pose features. On the first level, the Action-Unit Parser is designed with the help of pose extraction to achieve precise action segmentation and comprehensive local-global pose representations. On the second level, Motion Parser is used by spatial-temporal feature learning to capture pose changes and appearance details for each action-unit. Meanwhile, some special conditions other than body-related will impact action scoring, like water splash in diving. In this work, we design an additional Condition Parser to offer users more flexibility in their choices. Finally, Weight-Adjust Scoring Module is introduced to better accommodate the diverse requirements of various action types and the multi-scale nature of action-units. Extensive evaluations on large-scale diving sports datasets demonstrate that our multi-level motion parsing framework achieves state-of-the-art performance in both action segmentation and action scoring tasks.
摘要：人体姿势是动作质量评估 (AQA) 的基石，其中姿势的微妙时空变化通常可以区分优秀与平庸。在高水平比赛中，这些细微差别成为得分的决定性因素。在本文中，我们提出了一种基于增强时空姿态特征的新型 AQA 多级运动解析框架。在第一层，动作单元解析器的设计借助姿势提取来实现精确的动作分割和全面的局部-全局姿势表示。在第二层，时空特征学习使用运动解析器来捕获每个动作单元的姿势变化和外观细节。同时，除了身体相关的一些特殊条件也会影响动作得分，例如跳水时的水花。在这项工作中，我们设计了一个额外的条件解析器，为用户的选择提供更大的灵活性。最后，引入权重调整评分模块，以更好地适应各种动作类型的多样化要求和动作单元的多尺度性质。对大规模跳水运动数据集的广泛评估表明，我们的多级运动解析框架在动作分割和动作评分任务中实现了最先进的性能。

Title: KLASS: KL-Guided Fast Inference in Masked Diffusion Models

Authors: Seo Hyun Kim, Sunwoo Hong, Hojung Jung, Youngrok Park, Se-Young Yun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.05664
Pdf URL: https://arxiv.org/pdf/2511.05664
Copy Paste: [[2511.05664]] KLASS: KL-Guided Fast Inference in Masked Diffusion Models(https://arxiv.org/abs/2511.05664)
Keywords: generation
Abstract: Masked diffusion models have demonstrated competitive results on various tasks including language generation. However, due to its iterative refinement process, the inference is often bottlenecked by slow and static sampling speed. To overcome this problem, we introduce `KL-Adaptive Stability Sampling' (KLASS), a fast yet effective sampling method that exploits token-level KL divergence to identify stable, high-confidence predictions. By unmasking multiple tokens in each iteration without any additional model training, our approach speeds up generation significantly while maintaining sample quality. On reasoning benchmarks, KLASS achieves up to $2.78\times$ wall-clock speedups while improving performance over standard greedy decoding, attaining state-of-the-art results among diffusion-based samplers. We further validate KLASS across diverse domains, including text, image, and molecular generation, showing its effectiveness as a broadly applicable sampler across different models.
摘要：掩蔽扩散模型在包括语言生成在内的各种任务上都表现出了有竞争力的结果。然而，由于其迭代细化过程，推理常常受到缓慢且静态的采样速度的瓶颈。为了克服这个问题，我们引入了“KL 自适应稳定性采样”（KLASS），这是一种快速而有效的采样方法，利用令牌级 KL 散度来识别稳定、高置信度的预测。通过在每次迭代中揭开多个标记而无需任何额外的模型训练，我们的方法显着加快了生成速度，同时保持了样本质量。在推理基准测试中，KLASS 实现了高达 $2.78\times$ 的挂钟加速，同时提高了标准贪婪解码的性能，在基于扩散的采样器中获得了最先进的结果。我们进一步在不同领域（包括文本、图像和分子生成）验证 KLASS，显示其作为跨不同模型的广泛适用采样器的有效性。

Title: Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

Authors: David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.05705
Pdf URL: https://arxiv.org/pdf/2511.05705
Copy Paste: [[2511.05705]] Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale(https://arxiv.org/abs/2511.05705)
Keywords: generation
Abstract: Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.
摘要：多模态推理的最新进展主要是由未公开的数据集和专有的数据合成方法推动的，这留下了如何系统地构建大规模、以视觉为中心的推理数据集的悬而未决的问题，特别是对于超越视觉数学的任务。在这项工作中，我们引入了一个新的推理数据生成框架，涵盖不同的技能和复杂程度，并包含超过 100 万个以视觉为中心的高质量合成问题。该数据集还包括支持离线和在线强化学习的偏好数据和指令提示。我们的综合框架分两个阶段进行：（1）规模； (2)复杂性。然后，通过利用 VLM 和推理 LLM 的两阶段过程来合成推理轨迹，为 VLM 生成 CoT 轨迹，捕获前沿推理模型中发现的丰富性和多样化的认知行为。值得注意的是，我们表明，在所有以视觉为中心的评估基准中，对数据进行微调的 Qwen2.5-VL-7B 优于所有开放数据基线，甚至超过了 V* Bench、CV-Bench 和 MMStar-V 上的 MiMo-VL-7B-RL 等强大的封闭数据模型。也许最令人惊讶的是，尽管完全以视觉为中心，但我们的数据积极地转移到纯文本推理 (MMLU-Pro) 和音频推理 (MMAU)，证明了其有效性。同样，尽管不包含视频或具体视觉数据，但我们在评估单证据具体 QA 基准 (NiEH) 时观察到了显着的收益。最后，我们使用数据来分析整个 VLM 训练后流程。我们的实证分析强调，(i) 对具有非线性推理轨迹的高质量数据进行 SFT 对于有效的在线 RL 至关重要，(ii) 分阶段离线 RL 与在线 RL 的性能相匹配，同时减少计算需求，以及 (iii) 对高质量数据进行仔细的 SFT 可以显着改善域外、跨模态传输。

Title: Position-Prior-Guided Network for System Matrix Super-Resolution in Magnetic Particle Imaging

Authors: Xuqing Geng, Lei Su, Zhongwei Bian, Zewen Sun, Jiaxuan Wen, Jie Tian, Yang Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.05795
Pdf URL: https://arxiv.org/pdf/2511.05795
Copy Paste: [[2511.05795]] Position-Prior-Guided Network for System Matrix Super-Resolution in Magnetic Particle Imaging(https://arxiv.org/abs/2511.05795)
Keywords: super-resolution
Abstract: Magnetic Particle Imaging (MPI) is a novel medical imaging modality. One of the established methods for MPI reconstruction is based on the System Matrix (SM). However, the calibration of the SM is often time-consuming and requires repeated measurements whenever the system parameters change. Current methodologies utilize deep learning-based super-resolution (SR) techniques to expedite SM calibration; nevertheless, these strategies do not fully exploit physical prior knowledge associated with the SM, such as symmetric positional priors. Consequently, we integrated positional priors into existing frameworks for SM calibration. Underpinned by theoretical justification, we empirically validated the efficacy of incorporating positional priors through experiments involving both 2D and 3D SM SR methods.
摘要：磁粒子成像（MPI）是一种新型的医学成像方式。 MPI 重建的既定方法之一是基于系统矩阵 (SM)。然而，SM 的校准通常非常耗时，并且每当系统参数发生变化时都需要重复测量。当前的方法利用基于深度学习的超分辨率（SR）技术来加速 SM 校准；然而，这些策略并没有充分利用与 SM 相关的物理先验知识，例如对称位置先验。因此，我们将位置先验集成到现有的 SM 校准框架中。在理论论证的支持下，我们通过涉及 2D 和 3D SM SR 方法的实验实证验证了合并位置先验的有效性。

Title: Catching Contamination Before Generation: Spectral Kill Switches for Agents

Authors: Valentin Noël
Subjects: cs.LG, eess.SP, eess.SY, stat.ML
Abstract URL: https://arxiv.org/abs/2511.05804
Pdf URL: https://arxiv.org/pdf/2511.05804
Copy Paste: [[2511.05804]] Catching Contamination Before Generation: Spectral Kill Switches for Agents(https://arxiv.org/abs/2511.05804)
Keywords: generation
Abstract: Agentic language models compose multi step reasoning chains, yet intermediate steps can be corrupted by inconsistent context, retrieval errors, or adversarial inputs, which makes post hoc evaluation too late because errors propagate before detection. We introduce a diagnostic that requires no additional training and uses only the forward pass to emit a binary accept or reject signal during agent execution. The method analyzes token graphs induced by attention and computes two spectral statistics in early layers, namely the high frequency energy ratio and spectral entropy. We formalize these signals, establish invariances, and provide finite sample estimators with uncertainty quantification. Under a two regime mixture assumption with a monotone likelihood ratio property, we show that a single threshold on the high frequency energy ratio is optimal in the Bayes sense for detecting context inconsistency. Empirically, the high frequency energy ratio exhibits robust bimodality during context verification across multiple model families, which enables gating decisions with overhead below one millisecond on our hardware and configurations. We demonstrate integration into retrieval augmented agent pipelines and discuss deployment as an inline safety monitor. The approach detects contamination while the model is still processing the text, before errors commit to the reasoning chain.
摘要：代理语言模型组成了多步骤推理链，但中间步骤可能会因上下文不一致、检索错误或对抗性输入而被破坏，这使得事后评估为时已晚，因为错误在检测之前就传播了。我们引入了一种不需要额外训练的诊断，并且仅使用前向传递在代理执行期间发出二进制接受或拒绝信号。该方法分析由注意力引起的令牌图，并计算早期层的两个谱统计量，即高频能量比和谱熵。我们将这些信号形式化，建立不变性，并提供具有不确定性量化的有限样本估计器。在具有单调似然比属性的两种机制混合假设下，我们表明高频能量比的单个阈值在贝叶斯意义上对于检测上下文不一致是最佳的。根据经验，高频能量比在跨多个模型系列的上下文验证期间表现出强大的双峰性，这使得我们的硬件和配置上的选通决策开销低于一毫秒。我们演示了与检索增强代理管道的集成，并讨论了作为内联安全监视器的部署。该方法在模型仍在处理文本时、在错误提交到推理链之前检测污染。

Title: AiEDA: An Open-Source AI-Aided Design Library for Design-to-Vector

Authors: Yihang Qiu, Zengrong Huang, Simin Tao, Hongda Zhang, Weiguo Li, Xinhua Lai, Rui Wang, Weiqiang Wang, Xingquan Li
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2511.05823
Pdf URL: https://arxiv.org/pdf/2511.05823
Copy Paste: [[2511.05823]] AiEDA: An Open-Source AI-Aided Design Library for Design-to-Vector(https://arxiv.org/abs/2511.05823)
Keywords: generation
Abstract: Recent research has demonstrated that artificial intelligence (AI) can assist electronic design automation (EDA) in improving both the quality and efficiency of chip design. But current AI for EDA (AI-EDA) infrastructures remain fragmented, lacking comprehensive solutions for the entire data pipeline from design execution to AI integration. Key challenges include fragmented flow engines that generate raw data, heterogeneous file formats for data exchange, non-standardized data extraction methods, and poorly organized data storage. This work introduces a unified open-source library for EDA (AiEDA) that addresses these issues. AiEDA integrates multiple design-to-vector data representation techniques that transform diverse chip design data into universal multi-level vector representations, establishing an AI-aided design (AAD) paradigm optimized for AI-EDA workflows. AiEDA provides complete physical design flows with programmatic data extraction and standardized Python interfaces bridging EDA datasets and AI frameworks. Leveraging the AiEDA library, we generate iDATA, a 600GB dataset of structured data derived from 50 real chip designs (28nm), and validate its effectiveness through seven representative AAD tasks spanning prediction, generation, optimization and analysis. The code is publicly available at this https URL, while the full iDATA dataset is being prepared for public release, providing a foundation for future AI-EDA research.
摘要：最近的研究表明，人工智能（AI）可以协助电子设计自动化（EDA）提高芯片设计的质量和效率。但目前的人工智能 EDA（AI-EDA）基础设施仍然分散，缺乏从设计执行到人工智能集成的整个数据管道的全面解决方案。主要挑战包括生成原始数据的碎片化流引擎、用于数据交换的异构文件格式、非标准化的数据提取方法以及组织不良的数据存储。这项工作引入了一个统一的 EDA 开源库 (AiEDA) 来解决这些问题。 AiEDA 集成了多种设计到矢量数据表示技术，可将不同的芯片设计数据转换为通用的多级矢量表示，建立针对 AI-EDA 工作流程优化的人工智能辅助设计 (AAD) 范式。 AiEDA 提供完整的物理设计流程，具有编程数据提取和桥接 EDA 数据集和 AI 框架的标准化 Python 接口。利用 AiEDA 库，我们生成了 iDATA，这是一个源自 50 个真实芯片设计 (28nm) 的 600GB 结构化数据数据集，并通过涵盖预测、生成、优化和分析的七个代表性 AAD 任务验证其有效性。该代码可通过此 https URL 公开获取，同时完整的 iDATA 数据集正在准备公开发布，为未来的 AI-EDA 研究奠定基础。

Title: Understanding Cross Task Generalization in Handwriting-Based Alzheimer's Screening via Vision Language Adaptation

Authors: Changqing Gong, Huafeng Qin, Mounim A. El-Yacoubi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.05841
Pdf URL: https://arxiv.org/pdf/2511.05841
Copy Paste: [[2511.05841]] Understanding Cross Task Generalization in Handwriting-Based Alzheimer's Screening via Vision Language Adaptation(https://arxiv.org/abs/2511.05841)
Keywords: generative
Abstract: Alzheimer's disease is a prevalent neurodegenerative disorder for which early detection is critical. Handwriting-often disrupted in prodromal AD-provides a non-invasive and cost-effective window into subtle motor and cognitive decline. Existing handwriting-based AD studies, mostly relying on online trajectories and hand-crafted features, have not systematically examined how task type influences diagnostic performance and cross-task generalization. Meanwhile, large-scale vision language models have demonstrated remarkable zero or few-shot anomaly detection in natural images and strong adaptability across medical modalities such as chest X-ray and brain MRI. However, handwriting-based disease detection remains largely unexplored within this paradigm. To close this gap, we introduce a lightweight Cross-Layer Fusion Adapter framework that repurposes CLIP for handwriting-based AD screening. CLFA implants multi-level fusion adapters within the visual encoder to progressively align representations toward handwriting-specific medical cues, enabling prompt-free and efficient zero-shot inference. Using this framework, we systematically investigate cross-task generalization-training on a specific handwriting task and evaluating on unseen ones-to reveal which task types and writing patterns most effectively discriminate AD. Extensive analyses further highlight characteristic stroke patterns and task-level factors that contribute to early AD identification, offering both diagnostic insights and a benchmark for handwriting-based cognitive assessment.
摘要：阿尔茨海默病是一种常见的神经退行性疾病，早期发现至关重要。手写体（在阿尔茨海默病前驱期通常会受到干扰）为观察微妙的运动和认知能力下降提供了一个非侵入性且具有成本效益的窗口。现有的基于手写的 AD 研究主要依赖于在线轨迹和手工制作的特征，尚未系统地研究任务类型如何影响诊断性能和跨任务泛化。与此同时，大规模视觉语言模型在自然图像中表现出出色的零或少样本异常检测能力，并且在胸部 X 射线和脑 MRI 等医疗模式中具有很强的适应性。然而，在这种范式中，基于手写的疾病检测在很大程度上仍未得到探索。为了弥补这一差距，我们引入了一个轻量级的跨层融合适配器框架，该框架将 CLIP 重新用于基于手写的 AD 筛选。 CLFA 在视觉编码器中植入多级融合适配器，以逐步将表示与手写特定的医疗线索对齐，从而实现无提示且高效的零样本推理。使用这个框架，我们系统地研究了跨任务泛化——对特定手写任务的训练和对未见过的手写任务的评估——以揭示哪些任务类型和书写模式最有效地区分 AD。广泛的分析进一步突出了有助于早期 AD 识别的特征性笔画模式和任务级因素，为基于手写的认知评估提供了诊断见解和基准。

Title: Enhancing Diffusion Model Guidance through Calibration and Regularization

Authors: Seyed Alireza Javid, Amirhossein Bagheri, Nuria González-Prelcic
Subjects: cs.CV, cs.AI, cs.IT, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2511.05844
Pdf URL: https://arxiv.org/pdf/2511.05844
Copy Paste: [[2511.05844]] Enhancing Diffusion Model Guidance through Calibration and Regularization(https://arxiv.org/abs/2511.05844)
Keywords: generation
Abstract: Classifier-guided diffusion models have emerged as a powerful approach for conditional image generation, but they suffer from overconfident predictions during early denoising steps, causing the guidance gradient to vanish. This paper introduces two complementary contributions to address this issue. First, we propose a differentiable calibration objective based on the Smooth Expected Calibration Error (Smooth ECE), which improves classifier calibration with minimal fine-tuning and yields measurable improvements in Frechet Inception Distance (FID). Second, we develop enhanced sampling guidance methods that operate on off-the-shelf classifiers without requiring retraining. These include tilted sampling with batch-level reweighting, adaptive entropy-regularized sampling to preserve diversity, and a novel f-divergence-based sampling strategy that strengthens class-consistent guidance while maintaining mode coverage. Experiments on ImageNet 128x128 demonstrate that our divergence-regularized guidance achieves an FID of 2.13 using a ResNet-101 classifier, improving upon existing classifier-guided diffusion methods while requiring no diffusion model retraining. The results show that principled calibration and divergence-aware sampling provide practical and effective improvements for classifier-guided diffusion.
摘要：分类器引导的扩散模型已成为条件图像生成的强大方法，但它们在早期去噪步骤中遭受过度自信的预测，导致引导梯度消失。本文介绍了两个互补的贡献来解决这个问题。首先，我们提出了基于平滑预期校准误差（Smooth ECE）的可微校准目标，它以最小的微调改进了分类器校准，并在 Frechet 起始距离（FID）方面产生了可测量的改进。其次，我们开发了增强的采样指导方法，可以在现成的分类器上运行，无需重新训练。其中包括具有批次级重新加权的倾斜采样、用于保留多样性的自适应熵正则化采样，以及一种新颖的基于 f 散度的采样策略，该策略在保持模式覆盖的同时加强类一致的指导。 ImageNet 128x128 上的实验表明，我们的散度正则化指导使用 ResNet-101 分类器实现了 2.13 的 FID，改进了现有的分类器引导扩散方法，同时不需要扩散模型重新训练。结果表明，原则性校准和散度感知采样为分类器引导扩散提供了实用且有效的改进。

Title: Point Cloud Segmentation of Integrated Circuits Package Substrates Surface Defects Using Causal Inference: Dataset Construction and Methodology

Authors: Bingyang Guo, Qiang Zuo, Ruiyun Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.05853
Pdf URL: https://arxiv.org/pdf/2511.05853
Copy Paste: [[2511.05853]] Point Cloud Segmentation of Integrated Circuits Package Substrates Surface Defects Using Causal Inference: Dataset Construction and Methodology(https://arxiv.org/abs/2511.05853)
Keywords: quality assessment
Abstract: The effective segmentation of 3D data is crucial for a wide range of industrial applications, especially for detecting subtle defects in the field of integrated circuits (IC). Ceramic package substrates (CPS), as an important electronic material, are essential in IC packaging owing to their superior physical and chemical properties. However, the complex structure and minor defects of CPS, along with the absence of a publically available dataset, significantly hinder the development of CPS surface defect detection. In this study, we construct a high-quality point cloud dataset for 3D segmentation of surface defects in CPS, i.e., CPS3D-Seg, which has the best point resolution and precision compared to existing 3D industrial datasets. CPS3D-Seg consists of 1300 point cloud samples under 20 product categories, and each sample provides accurate point-level annotations. Meanwhile, we conduct a comprehensive benchmark based on SOTA point cloud segmentation algorithms to validate the effectiveness of CPS3D-Seg. Additionally, we propose a novel 3D segmentation method based on causal inference (CINet), which quantifies potential confounders in point clouds through Structural Refine (SR) and Quality Assessment (QA) Modules. Extensive experiments demonstrate that CINet significantly outperforms existing algorithms in both mIoU and accuracy.
摘要：3D 数据的有效分割对于广泛的工业应用至关重要，特别是对于检测集成电路 (IC) 领域的细微缺陷。陶瓷封装基板（CPS）作为重要的电子材料，由于其优越的物理和化学性能，在IC封装中至关重要。然而，CPS 的复杂结构和微小缺陷，以及缺乏公开可用的数据集，极大地阻碍了 CPS 表面缺陷检测的发展。在本研究中，我们构建了用于 CPS 中表面缺陷 3D 分割的高质量点云数据集，即 CPS3D-Seg，与现有 3D 工业数据集相比，它具有最佳的点分辨率和精度。 CPS3D-Seg由20个产品类别下的1300个点云样本组成，每个样本都提供准确的点级注释。同时，我们基于SOTA点云分割算法进行了全面的基准测试，以验证CPS3D-Seg的有效性。此外，我们提出了一种基于因果推理 (CINet) 的新型 3D 分割方法，该方法通过结构细化 (SR) 和质量评估 (QA) 模块量化点云中的潜在混杂因素。大量实验表明 CINet 在 mIoU 和准确率方面均显着优于现有算法。

Title: Predicting the Future by Retrieving the Past

Authors: Dazhao Du, Tao Han, Song Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.05859
Pdf URL: https://arxiv.org/pdf/2511.05859
Copy Paste: [[2511.05859]] Predicting the Future by Retrieving the Past(https://arxiv.org/abs/2511.05859)
Keywords: generation
Abstract: Deep learning models such as MLP, Transformer, and TCN have achieved remarkable success in univariate time series forecasting, typically relying on sliding window samples from historical data for training. However, while these models implicitly compress historical information into their parameters during training, they are unable to explicitly and dynamically access this global knowledge during inference, relying only on the local context within the lookback window. This results in an underutilization of rich patterns from the global history. To bridge this gap, we propose Predicting the Future by Retrieving the Past (PFRP), a novel approach that explicitly integrates global historical data to enhance forecasting accuracy. Specifically, we construct a Global Memory Bank (GMB) to effectively store and manage global historical patterns. A retrieval mechanism is then employed to extract similar patterns from the GMB, enabling the generation of global predictions. By adaptively combining these global predictions with the outputs of any local prediction model, PFRP produces more accurate and interpretable forecasts. Extensive experiments conducted on seven real-world datasets demonstrate that PFRP significantly enhances the average performance of advanced univariate forecasting models by 8.4\%. Codes can be found in this https URL.
摘要：MLP、Transformer 和 TCN 等深度学习模型在单变量时间序列预测方面取得了显着的成功，通常依赖历史数据的滑动窗口样本进行训练。然而，虽然这些模型在训练期间隐式地将历史信息压缩到其参数中，但它们无法在推理期间显式动态地访问这些全局知识，而仅依赖于回溯窗口内的局部上下文。这导致全球历史中丰富的模式没有得到充分利用。为了弥补这一差距，我们提出通过检索过去预测未来（PFRP），这是一种明确整合全球历史数据以提高预测准确性的新颖方法。具体来说，我们构建了一个全局内存库（GMB）来有效地存储和管理全局历史模式。然后采用检索机制从 GMB 中提取相似的模式，从而生成全局预测。通过自适应地将这些全局预测与任何本地预测模型的输出相结合，PFRP 可以生成更准确且可解释的预测。对七个真实世界数据集进行的大量实验表明，PFRP 将高级单变量预测模型的平均性能显着提高了 8.4%。可以在此 https URL 中找到代码。

Title: CGCE: Classifier-Guided Concept Erasure in Generative Models

Authors: Viet Nguyen, Vishal M. Patel
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2511.05865
Pdf URL: https://arxiv.org/pdf/2511.05865
Copy Paste: [[2511.05865]] CGCE: Classifier-Guided Concept Erasure in Generative Models(https://arxiv.org/abs/2511.05865)
Keywords: generation, generative
Abstract: Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model's original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.
摘要：大规模生成模型的最新进展使得能够创建高质量的图像和视频，但也引发了有关生成不安全内容的重大安全问题。为了缓解这种情况，已经开发了概念擦除方法，以从预训练模型中删除不需要的概念。然而，现有的方法仍然容易受到可以重新生成已擦除内容的对抗性攻击。此外，实现稳健的擦除通常会降低模型对于安全、不相关概念的生成质量，从而在安全性和性能之间造成困难的权衡。为了应对这一挑战，我们引入了分类器引导概念擦除（CGCE），这是一种高效的即插即用框架，可为不同的生成模型提供强大的概念擦除，而无需改变其原始权重。 CGCE 使用对文本嵌入进行操作的轻量级分类器来首先检测然后细化包含不需要的概念的提示。这种方法具有高度可扩展性，可以通过聚合多个分类器的指导来消除多概念。通过在推理时仅修改不安全的嵌入，我们的方法可以防止有害内容的生成，同时在良性提示下保留模型的原始质量。大量实验表明，CGCE 针对各种红队攻击实现了最先进的鲁棒性。我们的方法还保持了较高的生成效用，展示了安全性和性能之间的卓越平衡。我们通过将 CGCE 成功应用于各种现代 T2I 和 T2V 模型来展示 CGCE 的多功能性，将其确立为安全生成 AI 的实用且有效的解决方案。

Title: Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

Authors: Fei Yu, Quan Deng, Shengeng Tang, Yuehua Li, Lechao Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.05894
Pdf URL: https://arxiv.org/pdf/2511.05894
Copy Paste: [[2511.05894]] Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning(https://arxiv.org/abs/2511.05894)
Keywords: generation
Abstract: Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.
摘要：在开放世界环境中理解 3D 场景对视觉和机器人技术提出了根本性挑战，特别是由于封闭词汇监督和静态注释的限制。为了解决这个问题，我们提出了一个使用检索增强推理的开放世界 3D 场景图生成的统一框架，该框架能够实现可泛化和交互式的 3D 场景理解。我们的方法将视觉语言模型（VLM）与基于检索的推理相结合，以支持多模式探索和语言引导的交互。该框架包括两个关键组件：(1) 动态场景图生成模块，无需固定标签集即可检测对象并推断语义关系；(2) 检索增强推理管道，将场景图编码到矢量数据库中以支持文本/图像条件查询。我们在 3DSSG 和 Replica 基准测试上评估了我们的方法，涉及四个任务——场景问答、视觉基础、实例检索和任务规划——在不同的环境中展示了强大的泛化能力和卓越的性能。我们的结果强调了将开放词汇感知与基于检索的推理相结合以实现可扩展的 3D 场景理解的有效性。

Title: AD-DAE: Unsupervised Modeling of Longitudinal Alzheimer's Disease Progression with Diffusion Auto-Encoder

Authors: Ayantika Das, Arunima Sarkar, Keerthi Ram, Mohanasankar Sivaprakasam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.05934
Pdf URL: https://arxiv.org/pdf/2511.05934
Copy Paste: [[2511.05934]] AD-DAE: Unsupervised Modeling of Longitudinal Alzheimer's Disease Progression with Diffusion Auto-Encoder(https://arxiv.org/abs/2511.05934)
Keywords: generation, generative
Abstract: Generative modeling frameworks have emerged as an effective approach to capture high-dimensional image distributions from large datasets without requiring domain-specific knowledge, a capability essential for longitudinal disease progression modeling. Recent generative modeling approaches have attempted to capture progression by mapping images into a latent representational space and then controlling and guiding the representations to generate follow-up images from a baseline image. However, existing approaches impose constraints on distribution learning, leading to latent spaces with limited controllability to generate follow-up images without explicit supervision from subject-specific longitudinal images. In order to enable controlled movements in the latent representational space and generate progression images from a baseline image in an unsupervised manner, we introduce a conditionable Diffusion Auto-encoder framework. The explicit encoding mechanism of image-diffusion auto-encoders forms a compact latent space capturing high-level semantics, providing means to disentangle information relevant for progression. Our approach leverages this latent space to condition and apply controlled shifts to baseline representations for generating follow-up. Controllability is induced by restricting these shifts to a subspace, thereby isolating progression-related factors from subject identity-preserving components. The shifts are implicitly guided by correlating with progression attributes, without requiring subject-specific longitudinal supervision. We validate the generations through image quality metrics, volumetric progression analysis, and downstream classification in Alzheimer's disease datasets from two different sources and disease categories. This demonstrates the effectiveness of our approach for Alzheimer's progression modeling and longitudinal image generation.
摘要：生成建模框架已成为一种从大型数据集中捕获高维图像分布的有效方法，无需特定领域的知识，这是纵向疾病进展建模所必需的能力。最近的生成建模方法试图通过将图像映射到潜在表征空间中，然后控制和引导表征以从基线图像生成后续图像来捕获进展。然而，现有的方法对分布学习施加了限制，导致潜在空间的可控性有限，无法在没有特定主题纵向图像的明确监督的情况下生成后续图像。为了实现潜在表征空间中的受控运动并以无监督的方式从基线图像生成进展图像，我们引入了一个条件扩散自动编码器框架。图像扩散自动编码器的显式编码机制形成了捕获高级语义的紧凑潜在空间，提供了解开与进展相关的信息的方法。我们的方法利用这个潜在空间来调节和应用受控偏移到基线表示以生成后续结果。通过将这些转变限制在子空间内来诱导可控性，从而将进展相关因素与受试者身份保留组件隔离开来。这些转变是通过与进展属性相关联来隐式引导的，而不需要特定于主题的纵向监督。我们通过图像质量指标、体积进展分析和来自两个不同来源和疾病类别的阿尔茨海默病数据集中的下游分类来验证各代。这证明了我们的阿尔茨海默病进展建模和纵向图像生成方法的有效性。

Title: Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation

Authors: Lin Li, Chuhan Zhang, Dong Zhang, Chong Sun, Chen Li, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.05935
Pdf URL: https://arxiv.org/pdf/2511.05935
Copy Paste: [[2511.05935]] Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation(https://arxiv.org/abs/2511.05935)
Keywords: generation
Abstract: Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) \textit{Infusing knowledge} into large-scale models via pre-training on large datasets; 2) \textit{Transferring knowledge} from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an inter\textbf{AC}tion-\textbf{C}entric end-to-end OVSGG framework (\textbf{ACC}) in an interaction-driven paradigm to minimize these mismatches. For \textit{interaction-centric knowledge infusion}, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model's interaction knowledge. For \textit{interaction-centric knowledge transfer}, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.
摘要：开放词汇场景图生成 (OVSGG) 通过识别预定义类别之外的新对象和关系，利用预先训练的大型模型中的知识，扩展了传统的 SGG。现有的 OVSGG 方法总是采用两阶段流程：1）通过在大型数据集上进行预训练，将知识注入到大型模型中； 2）在监督微调期间，从具有完全注释场景图的预训练模型中传递知识。然而，由于缺乏显式的交互建模，这些方法很难区分同一对象类别的交互实例和非交互实例。这种限制在 OVSGG 的两个阶段都会引发关键问题：它在知识注入期间从不匹配的对象中生成嘈杂的伪监督，并在知识转移期间导致不明确的查询匹配。为此，在本文中，我们在交互驱动的范式中提出了一个 inter\textbf{AC}tion-\textbf{C}entric 端到端 OVSGG 框架（\textbf{ACC}），以最大限度地减少这些不匹配。对于 \textit{以交互为中心的知识注入}，ACC 采用双向交互提示来生成鲁棒的伪监督，以增强模型的交互知识。对于 \textit{以交互为中心的知识转移}，ACC 首先采用交互引导的查询选择，优先考虑配对交互对象，以减少非交互对象的干扰。然后，它集成了交互一致的知识蒸馏，通过将关系前景从背景中推开，同时保留一般知识来增强鲁棒性。三个基准的大量实验结果表明 ACC 实现了最先进的性能，展示了以交互为中心的范例在现实世界应用中的潜力。

Title: A Dual-Mode ViT-Conditioned Diffusion Framework with an Adaptive Conditioning Bridge for Breast Cancer Segmentation

Authors: Prateek Singh, Moumita Dholey, P.K. Vinod
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.05989
Pdf URL: https://arxiv.org/pdf/2511.05989
Copy Paste: [[2511.05989]] A Dual-Mode ViT-Conditioned Diffusion Framework with an Adaptive Conditioning Bridge for Breast Cancer Segmentation(https://arxiv.org/abs/2511.05989)
Keywords: generative
Abstract: In breast ultrasound images, precise lesion segmentation is essential for early diagnosis; however, low contrast, speckle noise, and unclear boundaries make this difficult. Even though deep learning models have demonstrated potential, standard convolutional architectures frequently fall short in capturing enough global context, resulting in segmentations that are anatomically inconsistent. To overcome these drawbacks, we suggest a flexible, conditional Denoising Diffusion Model that combines an enhanced UNet-based generative decoder with a Vision Transformer (ViT) encoder for global feature extraction. We introduce three primary innovations: 1) an Adaptive Conditioning Bridge (ACB) for efficient, multi-scale fusion of semantic features; 2) a novel Topological Denoising Consistency (TDC) loss component that regularizes training by penalizing structural inconsistencies during denoising; and 3) a dual-head architecture that leverages the denoising objective as a powerful regularizer, enabling a lightweight auxiliary head to perform rapid and accurate inference on smaller datasets and a noise prediction head. Our framework establishes a new state-of-the-art on public breast ultrasound datasets, achieving Dice scores of 0.96 on BUSI, 0.90 on BrEaST and 0.97 on BUS-UCLM. Comprehensive ablation studies empirically validate that the model components are critical for achieving these results and for producing segmentations that are not only accurate but also anatomically plausible.
摘要：在乳腺超声图像中，精确的病灶分割对于早期诊断至关重要；然而，低对比度、散斑噪声和不清晰的边界使这变得困难。尽管深度学习模型已展现出潜力，但标准卷积架构经常无法捕获足够的全局上下文，从而导致分割在解剖学上不一致。为了克服这些缺点，我们提出了一种灵活的条件去噪扩散模型，该模型将增强的基于 UNet 的生成解码器与视觉变换器 (ViT) 编码器相结合，以进行全局特征提取。我们介绍了三个主要创新：1）自适应条件桥（ACB），用于高效、多尺度的语义特征融合； 2）一种新颖的拓扑去噪一致性（TDC）损失组件，通过惩罚去噪期间的结构不一致来规范训练； 3）双头架构，利用去噪目标作为强大的正则器，使轻量级辅助头能够对较小的数据集和噪声预测头执行快速而准确的推理。我们的框架在公共乳腺超声数据集上建立了新的最先进水平，在 BUSI 上获得了 0.96 的 Dice 分数，在 BrEaST 上获得了 0.90 的 Dice 分数，在 BUS-UCLM 上获得了 0.97 的 Dice 分数。综合消融研究凭经验验证模型组件对于实现这些结果以及产生不仅准确而且在解剖学上合理的分割至关重要。

Title: MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Authors: Shivank Saxena, Dhruv Srivastava, Makarand Tapaswi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06002
Pdf URL: https://arxiv.org/pdf/2511.06002
Copy Paste: [[2511.06002]] MALeR: Improving Compositional Fidelity in Layout-Guided Generation(https://arxiv.org/abs/2511.06002)
Keywords: generation
Abstract: Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.
摘要：文本到图像模型的最新进展开启了创造性和可控图像生成的新时代。然而，生成具有多个主题和属性的合成场景仍然是一个重大挑战。为了增强用户对主题放置的控制，已经提出了几种布局引导方法。然而，这些方法面临着许多挑战，特别是在构图场景中。意外的主题经常出现在布局之外，生成的图像可能不符合分布并包含不自然的伪影，或者属性在主题之间渗透，从而导致不正确的视觉输出。在这项工作中，我们提出了 MALeR，一种解决这些挑战的方法。给定文本提示和相应的布局，我们的方法可以防止主题在分发时出现在给定布局之外。此外，我们提出了一种屏蔽的属性感知绑定机制，可以防止属性泄漏，即使在复杂的构图场景中也能准确渲染具有多个属性的主题。定性和定量评估表明，与之前的工作相比，我们的方法在组成准确性、生成一致性和属性绑定方面取得了优异的性能。 MALeR 特别擅长生成具有多个主题和每个主题多个属性的场景图像。

Title: MiVID: Multi-Strategic Self-Supervision for Video Frame Interpolation using Diffusion Model

Authors: Priyansh Srivastava, Romit Chatterjee, Abir Sen, Aradhana Behura, Ratnakar Dash
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06019
Pdf URL: https://arxiv.org/pdf/2511.06019
Copy Paste: [[2511.06019]] MiVID: Multi-Strategic Self-Supervision for Video Frame Interpolation using Diffusion Model(https://arxiv.org/abs/2511.06019)
Keywords: restoration
Abstract: Video Frame Interpolation (VFI) remains a cornerstone in video enhancement, enabling temporal upscaling for tasks like slow-motion rendering, frame rate conversion, and video restoration. While classical methods rely on optical flow and learning-based models assume access to dense ground-truth, both struggle with occlusions, domain shifts, and ambiguous motion. This article introduces MiVID, a lightweight, self-supervised, diffusion-based framework for video interpolation. Our model eliminates the need for explicit motion estimation by combining a 3D U-Net backbone with transformer-style temporal attention, trained under a hybrid masking regime that simulates occlusions and motion uncertainty. The use of cosine-based progressive masking and adaptive loss scheduling allows our network to learn robust spatiotemporal representations without any high-frame-rate supervision. Our framework is evaluated on UCF101-7 and DAVIS-7 datasets. MiVID is trained entirely on CPU using the datasets and 9-frame video segments, making it a low-resource yet highly effective pipeline. Despite these constraints, our model achieves optimal results at just 50 epochs, competitive with several supervised this http URL work demonstrates the power of self-supervised diffusion priors for temporally coherent frame synthesis and provides a scalable path toward accessible and generalizable VFI systems.
摘要：视频帧插值 (VFI) 仍然是视频增强的基石，支持慢动作渲染、帧速率转换和视频恢复等任务的时间升级。虽然经典方法依赖于光流，而基于学习的模型假设可以访问密集的地面实况，但两者都在与遮挡、域转移和模糊运动作斗争。本文介绍 MiVID，这是一种轻量级、自监督、基于扩散的视频插值框架。我们的模型通过将 3D U-Net 主干与 Transformer 式时间注意力相结合，消除了对显式运动估计的需要，并在模拟遮挡和运动不确定性的混合掩蔽机制下进行训练。使用基于余弦的渐进掩蔽和自适应损失调度使我们的网络能够在没有任何高帧率监督的情况下学习鲁棒的时空表示。我们的框架在 UCF101-7 和 DAVIS-7 数据集上进行评估。 MiVID 完全在 CPU 上使用数据集和 9 帧视频片段进行训练，使其成为一个低资源但高效的管道。尽管存在这些限制，我们的模型仅用了 50 个 epoch 就实现了最佳结果，与几个受监督的 http URL 工作相竞争，展示了自监督扩散先验对于时间相干帧合成的力量，并为可访问和可推广的 VFI 系统提供了一条可扩展的路径。

Title: Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Authors: Hui Zeng, Daming Zhao, Pengfei Yang, Wenxuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.06029
Pdf URL: https://arxiv.org/pdf/2511.06029
Copy Paste: [[2511.06029]] Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving(https://arxiv.org/abs/2511.06029)
Keywords: generation, generative
Abstract: Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.
摘要：使用大型语言模型 (LLM) 的生成推理通常涉及较长的解码序列，从而导致累积键值 (KV) 缓存产生大量内存和延迟开销。虽然现有的 KV 压缩方法主要侧重于减少长输入序列的预填充内存，但它们在解决长格式生成的动态和层敏感性质方面存在不足，而长格式生成是推理任务的核心。我们提出了 Lethe，一种动态 KV 缓存管理框架，它在解码的空间和时间维度上引入了自适应性。沿着空间维度，Lethe 执行分层稀疏感知分配，根据估计的注意力冗余将标记修剪预算分配给每个转换器层。沿着时间维度，Lethe 在新近感知选择性保留 (RASR) 机制的驱动下，在生成过程中进行多轮令牌修剪。 RASR 扩展了传统的基于新近度的启发法，还考虑了源自不断变化的注意力模式的代币相关性，从而能够就保留或驱逐哪些代币做出明智的决策。实证结果表明，Lethe 在不同模型和任务中实现了效率和发电质量之间的良好平衡，吞吐量提高了高达 2.56 倍。

Title: Advancing Ocean State Estimation with efficient and scalable AI

Authors: Yanfei Xiang, Yuan Gao, Hao Wu, Quan Zhang, Ruiqi Shu, Xiao Zhou, Xi Wu, Xiaomeng Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06041
Pdf URL: https://arxiv.org/pdf/2511.06041
Copy Paste: [[2511.06041]] Advancing Ocean State Estimation with efficient and scalable AI(https://arxiv.org/abs/2511.06041)
Keywords: super-resolution
Abstract: Accurate and efficient global ocean state estimation remains a grand challenge for Earth system science, hindered by the dual bottlenecks of computational scalability and degraded data fidelity in traditional data assimilation (DA) and deep learning (DL) approaches. Here we present an AI-driven Data Assimilation Framework for Ocean (ADAF-Ocean) that directly assimilates multi-source and multi-scale observations, ranging from sparse in-situ measurements to 4 km satellite swaths, without any interpolation or data thinning. Inspired by Neural Processes, ADAF-Ocean learns a continuous mapping from heterogeneous inputs to ocean states, preserving native data fidelity. Through AI-driven super-resolution, it reconstructs 0.25$^\circ$ mesoscale dynamics from coarse 1$^\circ$ fields, which ensures both efficiency and scalability, with just 3.7\% more parameters than the 1$^\circ$ configuration. When coupled with a DL forecasting system, ADAF-Ocean extends global forecast skill by up to 20 days compared to baselines without assimilation. This framework establishes a computationally viable and scientifically rigorous pathway toward real-time, high-resolution Earth system monitoring.
摘要：准确高效的全球海洋状态估计仍然是地球系统科学面临的巨大挑战，受到传统数据同化（DA）和深度学习（DL）方法中计算可扩展性和数据保真度下降的双重瓶颈的阻碍。在这里，我们提出了一种人工智能驱动的海洋数据同化框架（ADAF-Ocean），该框架直接同化多源和多尺度观测，范围从稀疏的原位测量到 4 公里的卫星测绘带，无需任何插值或数据稀疏。受神经过程的启发，ADAF-Ocean 学习从异构输入到海洋状态的连续映射，从而保持本地数据保真度。通过人工智能驱动的超分辨率，它从粗略的 1$^\circ$ 场重建 0.25$^\circ$ 介观动力学，保证了效率和可扩展性，参数仅比 1$^\circ$ 配置多 3.7\%。与 DL 预报系统结合使用时，ADAF-Ocean 与未同化的基线相比，可将全球预报技能延长最多 20 天。该框架建立了一条计算上可行且科学上严格的实时、高分辨率地球系统监测途径。

Title: Neodragon: Mobile Video Generation using Diffusion Transformer

Authors: Animesh Karnewar, Denis Korzhenkov, Ioannis Lelekas, Adil Karjauv, Noor Fathima, Hanwen Xiong, Vancheeswaran Vaidyanathan, Will Zeng, Rafael Esteves, Tushar Singhal, Fatih Porikli, Mohsen Ghafoorian, Amirhossein Habibian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06055
Pdf URL: https://arxiv.org/pdf/2511.06055
Copy Paste: [[2511.06055]] Neodragon: Mobile Video Generation using Diffusion Transformer(https://arxiv.org/abs/2511.06055)
Keywords: super-resolution, generation, generative
Abstract: We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at the 640x1024 resolution directly on a Qualcomm Hexagon NPU in a record 6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient and high-fidelity video synthesis. We achieve this through four key technical contributions: (1) Replacing the original large 4.762B T5xxl Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabled through a novel Text-Encoder Distillation procedure. (2) Proposing an Asymmetric Decoder Distillation approach allowing us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the generation pipeline. (3) Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process. (4) Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using DMD adapted for pyramidal flow-matching, thereby substantially accelerating video generation. When paired with an optimised SSD1B first-frame image generator and QuickSRNet for 2x super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61. By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services. Code and model will be made publicly available at our website: this https URL
摘要：我们推出了 Neodragon，这是一种文本到视频系统，能够直接在 Qualcomm Hexagon NPU 上以创纪录的 6.7 秒（7 FPS）生成分辨率为 640x1024 的 2 秒（49 帧@24 fps）视频。与现有基于Transformer的离线文本视频生成模型不同，Neodragon首次针对移动硬件进行了专门优化，以实现高效、高保真的视频合成。我们通过四项关键技术贡献实现了这一目标：(1) 用更小的 0.2B DT5 (DistilT5) 替换原来的大型 4.762B T5xxl 文本编码器，通过新颖的文本编码器蒸馏过程实现最小的质量损失。 (2) 提出一种非对称解码器蒸馏方法，使我们能够用更高效的解码器替换本机编解码器潜在 VAE 解码器，而不会干扰生成管道的生成潜在空间。 (3) 根据相对重要性对降噪主干内的 MMDiT 块进行修剪，并通过两阶段蒸馏过程恢复原始性能。 (4)通过使用适用于金字塔流匹配的DMD进行逐步蒸馏，降低降噪器的NFE（神经功能评估）要求，从而显着加速视频生成。当与优化的 SSD1B 第一帧图像生成器和 QuickSRNet 实现 2 倍超分辨率配合使用时，我们的端到端 Neodragon 系统成为高参数（4.945B 完整模型）、内存（3.5GB 峰值 RAM 使用）和运行时（6.7 秒 E2E 延迟）高效的移动友好模型，同时获得 81.61 的 VBench 总分。通过实现低成本、私有的、设备上的文本到视频合成，Neodragon 使基于人工智能的视频内容创作民主化，使创作者能够在不依赖云服务的情况下生成高质量的视频。代码和模型将在我们的网站上公开：此 https URL

Title: Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration

Authors: Umar Rashid (1), Muhammad Arslan Arshad (1), Ghulam Ahmad (1), Muhammad Zeeshan Anjum (1), Rizwan Khan (1), Muhammad Akmal (2) ((1) University of Engineering & Technology, New Campus, Lahore, Pakistan, (2) Sheffield Hallam University, Sheffield, UK)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06087
Pdf URL: https://arxiv.org/pdf/2511.06087
Copy Paste: [[2511.06087]] Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration(https://arxiv.org/abs/2511.06087)
Keywords: restoration
Abstract: Motion blur in scene text images severely impairs readability and hinders the reliability of computer vision tasks, including autonomous driving, document digitization, and visual information retrieval. Conventional deblurring approaches are often inadequate in handling spatially varying blur and typically fall short in modeling the long-range dependencies necessary for restoring textual clarity. To overcome these limitations, we introduce a hybrid deep learning framework that combines convolutional neural networks (CNNs) with vision transformers (ViTs), thereby leveraging both local feature extraction and global contextual reasoning. The architecture employs a CNN-based encoder-decoder to preserve structural details, while a transformer module enhances global awareness through self-attention. Training is conducted on a curated dataset derived from TextOCR, where sharp scene-text samples are paired with synthetically blurred versions generated using realistic motion-blur kernels of multiple sizes and orientations. Model optimization is guided by a composite loss that incorporates mean absolute error (MAE), squared error (MSE), perceptual similarity, and structural similarity (SSIM). Quantitative eval- uations show that the proposed method attains 32.20 dB in PSNR and 0.934 in SSIM, while remaining lightweight with 2.83 million parameters and an average inference time of 61 ms. These results highlight the effectiveness and computational efficiency of the CNN-ViT hybrid design, establishing its practicality for real-world motion-blurred scene-text restoration.
摘要：场景文本图像中的运动模糊严重损害了可读性，并阻碍了计算机视觉任务的可靠性，包括自动驾驶、文档数字化和视觉信息检索。传统的去模糊方法通常不足以处理空间变化的模糊，并且通常无法对恢复文本清晰度所需的远程依赖性进行建模。为了克服这些限制，我们引入了一种混合深度学习框架，它将卷积神经网络（CNN）与视觉变换器（ViT）相结合，从而利用局部特征提取和全局上下文推理。该架构采用基于 CNN 的编码器-解码器来保留结构细节，而转换器模块则通过自注意力增强全局意识。训练是在源自 TextOCR 的精选数据集上进行的，其中清晰的场景文本样本与使用多种尺寸和方向的真实运动模糊内核生成的合成模糊版本配对。模型优化由复合损失指导，其中包含平均绝对误差 (MAE)、平方误差 (MSE)、感知相似性和结构相似性 (SSIM)。定量评估表明，该方法的 PSNR 达到 32.20 dB，SSIM 达到 0.934，同时保持轻量级，具有 283 万个参数，平均推理时间为 61 ms。这些结果突出了 CNN-ViT 混合设计的有效性和计算效率，确立了其在现实世界运动模糊场景文本恢复中的实用性。

Title: Adapting Web Agents with Synthetic Supervision

Authors: Zhaoyang Wang, Yiming Liang, Xuchao Zhang, Qianhui Wu, Siwei Han, Anson Bastos, Rujia Wang, Chetan Bansal, Baolin Peng, Jianfeng Gao, Saravan Rajmohan, Huaxiu Yao
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.06101
Pdf URL: https://arxiv.org/pdf/2511.06101
Copy Paste: [[2511.06101]] Adapting Web Agents with Synthetic Supervision(https://arxiv.org/abs/2511.06101)
Keywords: generation
Abstract: Web agents struggle to adapt to new websites due to the scarcity of environment specific tasks and demonstrations. Recent works have explored synthetic data generation to address this challenge, however, they suffer from data quality issues where synthesized tasks contain hallucinations that cannot be executed, and collected trajectories are noisy with redundant or misaligned actions. In this paper, we propose SynthAgent, a fully synthetic supervision framework that aims at improving synthetic data quality via dual refinement of both tasks and trajectories. Our approach begins by synthesizing diverse tasks through categorized exploration of web elements, ensuring efficient coverage of the target environment. During trajectory collection, we refine tasks when conflicts with actual observations are detected, mitigating hallucinations while maintaining task consistency. After collection, we conduct trajectory refinement with a global context to mitigate potential noise or misalignments. Finally, we fine-tune open-source web agents on the refined synthetic data to adapt them to the target environment. Experimental results demonstrate that SynthAgent outperforms existing synthetic data methods, validating the importance of high-quality synthetic supervision. The code will be publicly available at this https URL.
摘要：由于缺乏环境特定任务和演示，网络代理很难适应新网站。最近的工作已经探索了合成数据生成来应对这一挑战，然而，它们遇到了数据质量问题，其中合成任务包含无法执行的幻觉，并且收集的轨迹由于冗余或未对齐的动作而充满噪音。在本文中，我们提出了 SynthAgent，这是一个完全综合的监督框架，旨在通过任务和轨迹的双重细化来提高综合数据质量。我们的方法首先通过对网络元素的分类探索来综合不同的任务，确保有效覆盖目标环境。在轨迹收集过程中，当检测到与实际观察发生冲突时，我们会完善任务，从而在保持任务一致性的同时减轻幻觉。收集后，我们在全局背景下进行轨迹细化，以减轻潜在的噪音或错位。最后，我们根据精炼的合成数据对开源网络代理进行微调，以使其适应目标环境。实验结果表明，SynthAgent 优于现有的合成数据方法，验证了高质量合成监督的重要性。该代码将在此 https URL 上公开提供。

Title: Latent Refinement via Flow Matching for Training-free Linear Inverse Problem Solving

Authors: Hossein Askari, Yadan Luo, Hongfu Sun, Fred Roosta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06138
Pdf URL: https://arxiv.org/pdf/2511.06138
Copy Paste: [[2511.06138]] Latent Refinement via Flow Matching for Training-free Linear Inverse Problem Solving(https://arxiv.org/abs/2511.06138)
Keywords: generative
Abstract: Recent advances in inverse problem solving have increasingly adopted flow priors over diffusion models due to their ability to construct straight probability paths from noise to data, thereby enhancing efficiency in both training and inference. However, current flow-based inverse solvers face two primary limitations: (i) they operate directly in pixel space, which demands heavy computational resources for training and restricts scalability to high-resolution images, and (ii) they employ guidance strategies with prior-agnostic posterior covariances, which can weaken alignment with the generative trajectory and degrade posterior coverage. In this paper, we propose LFlow (Latent Refinement via Flows), a training-free framework for solving linear inverse problems via pretrained latent flow priors. LFlow leverages the efficiency of flow matching to perform ODE sampling in latent space along an optimal path. This latent formulation further allows us to introduce a theoretically grounded posterior covariance, derived from the optimal vector field, enabling effective flow guidance. Experimental results demonstrate that our proposed method outperforms state-of-the-art latent diffusion solvers in reconstruction quality across most tasks. The code will be publicly available at this https URL .
摘要：逆向问题解决的最新进展越来越多地采用流先验而不是扩散模型，因为它们能够构建从噪声到数据的直接概率路径，从而提高训练和推理的效率。然而，当前基于流的逆解算器面临两个主要限制：（i）它们直接在像素空间中运行，这需要大量的计算资源进行训练，并限制了高分辨率图像的可扩展性；（ii）它们采用具有先验不可知的后验协方差的指导策略，这可能会削弱与生成轨迹的对齐并降低后验覆盖率。在本文中，我们提出了 LFlow（Latent Refinement via Flows），这是一种无需训练的框架，用于通过预训练的潜在流先验来解决线性逆问题。 LFlow 利用流匹配的效率沿着最佳路径在潜在空间中执行 ODE 采样。这种潜在的公式进一步使我们能够引入理论上有根据的后验协方差，该后验协方差是从最佳向量场导出的，从而实现有效的流动引导。实验结果表明，我们提出的方法在大多数任务的重建质量方面优于最先进的潜在扩散求解器。该代码将在此 https URL 公开提供。

Title: MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution

Authors: Hua Chang, Xin Xu, Wei Liu, Wei Wang, Xin Yuan, Kui Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06172
Pdf URL: https://arxiv.org/pdf/2511.06172
Copy Paste: [[2511.06172]] MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution(https://arxiv.org/abs/2511.06172)
Keywords: super-resolution
Abstract: Chinese opera is celebrated for preserving classical art. However, early filming equipment limitations have degraded videos of last-century performances by renowned artists (e.g., low frame rates and resolution), hindering archival efforts. Although space-time video super-resolution (STVSR) has advanced significantly, applying it directly to opera videos remains challenging. The scarcity of datasets impedes the recovery of high frequency details, and existing STVSR methods lack global modeling capabilities, compromising visual quality when handling opera's characteristic large motions. To address these challenges, we pioneer a large scale Chinese Opera Video Clip (COVC) dataset and propose the Mamba-based multiscale fusion network for space-time Opera Video Super-Resolution (MambaOVSR). Specifically, MambaOVSR involves three novel components: the Global Fusion Module (GFM) for motion modeling through a multiscale alternating scanning mechanism, and the Multiscale Synergistic Mamba Module (MSMM) for alignment across different sequence lengths. Additionally, our MambaVR block resolves feature artifacts and positional information loss during alignment. Experimental results on the COVC dataset show that MambaOVSR significantly outperforms the SOTA STVSR method by an average of 1.86 dB in terms of PSNR. Dataset and Code will be publicly released.
摘要：中国戏曲因保存古典艺术而闻名。然而，早期拍摄设备的限制已经降低了上世纪著名艺术家表演的视频质量（例如低帧速率和分辨率），阻碍了档案工作。尽管时空视频超分辨率（STVSR）已经取得了显着进步，但将其直接应用于歌剧视频仍然具有挑战性。数据集的稀缺阻碍了高频细节的恢复，并且现有的STVSR方法缺乏全局建模能力，在处理歌剧特有的大动作时会影响视觉质量。为了应对这些挑战，我们开创了大规模中国戏曲视频剪辑（COVC）数据集，并提出了基于 Mamba 的时空戏曲视频超分辨率（MambaOVSR）多尺度融合网络。具体来说，MambaOVSR 涉及三个新颖的组件：通过多尺度交替扫描机制进行运动建模的全局融合模块 (GFM)，以及用于跨不同序列长度对齐的多尺度协同 Mamba 模块 (MSMM)。此外，我们的 MambaVR 模块还解决了对齐过程中的特征伪影和位置信息丢失问题。 COVC数据集上的实验结果表明，MambaOVSR在PSNR方面明显优于SOTA STVSR方法，平均1.86 dB。数据集和代码将公开发布。

Title: NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling

Authors: Muhammad Usama, Mohammad Sadil Khan, Didier Stricker, Muhammad Zeshan Afzal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06194
Pdf URL: https://arxiv.org/pdf/2511.06194
Copy Paste: [[2511.06194]] NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling(https://arxiv.org/abs/2511.06194)
Keywords: generation
Abstract: Generating editable 3D CAD models from natural language remains challenging, as existing text-to-CAD systems either produce meshes or rely on scarce design-history data. We present NURBGen, the first framework to generate high-fidelity 3D CAD models directly from text using Non-Uniform Rational B-Splines (NURBS). To achieve this, we fine-tune a large language model (LLM) to translate free-form texts into JSON representations containing NURBS surface parameters (\textit{i.e}, control points, knot vectors, degrees, and rational weights) which can be directly converted into BRep format using Python. We further propose a hybrid representation that combines untrimmed NURBS with analytic primitives to handle trimmed surfaces and degenerate regions more robustly, while reducing token complexity. Additionally, we introduce partABC, a curated subset of the ABC dataset consisting of individual CAD components, annotated with detailed captions using an automated annotation pipeline. NURBGen demonstrates strong performance on diverse prompts, surpassing prior methods in geometric fidelity and dimensional accuracy, as confirmed by expert evaluations. Code and dataset will be released publicly.
摘要：从自然语言生成可编辑的 3D CAD 模型仍然具有挑战性，因为现有的文本到 CAD 系统要么生成网格，要么依赖稀缺的设计历史数据。我们推出 NURBGen，这是第一个使用非均匀有理 B 样条 (NURBS) 直接从文本生成高保真 3D CAD 模型的框架。为了实现这一目标，我们微调大型语言模型 (LLM)，将自由格式文本转换为包含 NURBS 曲面参数（\textit{i.e}、控制点、结向量、度数和有理权重）的 JSON 表示形式，这些参数可以使用 Python 直接转换为 BRep 格式。我们进一步提出了一种混合表示，将未修剪的 NURBS 与分析图元相结合，以更鲁棒地处理修剪曲面和退化区域，同时降低标记复杂性。此外，我们还引入了partABC，这是 ABC 数据集的精选子集，由单独的 CAD 组件组成，并使用自动注释管道用详细的标题进行注释。正如专家评估所证实的那样，NURBGen 在各种提示上表现出强大的性能，在几何保真度和尺寸精度方面超越了先前的方法。代码和数据集将公开发布。

Title: Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models

Authors: Rodrigo Gallardo, Oz Fishman, Alexander Htet Kyaw
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2511.06201
Pdf URL: https://arxiv.org/pdf/2511.06201
Copy Paste: [[2511.06201]] Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models(https://arxiv.org/abs/2511.06201)
Keywords: generative
Abstract: This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.
摘要：本文介绍了一种人机循环计算机视觉框架，该框架使用生成式人工智能提出公共空间的微观设计干预措施，并支持更持续的本地参与。该系统使用 Grounding DINO 和 ADE20K 数据集的精选子集作为城市建筑环境的代理，检测城市物体并构建揭示常见空间配置的共现嵌入。根据此分析，用户收到了对所选锚定对象的五个统计上可能的补充。然后，视觉语言模型对场景图像和所选对进行推理，以建议完成更复杂的城市策略的第三个对象。该工作流程使人们能够控制选择和细化，并旨在通过将选择建立在日常模式和生活经验的基础上，超越自上而下的总体规划。

Title: Physics-Informed Image Restoration via Progressive PDE Integration

Authors: Shamika Likhite, Santiago López-Tapia, Aggelos K. Katsaggelos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06244
Pdf URL: https://arxiv.org/pdf/2511.06244
Copy Paste: [[2511.06244]] Physics-Informed Image Restoration via Progressive PDE Integration(https://arxiv.org/abs/2511.06244)
Keywords: restoration
Abstract: Motion blur, caused by relative movement between camera and scene during exposure, significantly degrades image quality and impairs downstream computer vision tasks such as object detection, tracking, and recognition in dynamic environments. While deep learning-based motion deblurring methods have achieved remarkable progress, existing approaches face fundamental challenges in capturing the long-range spatial dependencies inherent in motion blur patterns. Traditional convolutional methods rely on limited receptive fields and require extremely deep networks to model global spatial relationships. These limitations motivate the need for alternative approaches that incorporate physical priors to guide feature evolution during restoration. In this paper, we propose a progressive training framework that integrates physics-informed PDE dynamics into state-of-the-art restoration architectures. By leveraging advection-diffusion equations to model feature evolution, our approach naturally captures the directional flow characteristics of motion blur while enabling principled global spatial modeling. Our PDE-enhanced deblurring models achieve superior restoration quality with minimal overhead, adding only approximately 1\% to inference GMACs while providing consistent improvements in perceptual quality across multiple state-of-the-art architectures. Comprehensive experiments on standard motion deblurring benchmarks demonstrate that our physics-informed approach improves PSNR and SSIM significantly across four diverse architectures, including FFTformer, NAFNet, Restormer, and Stripformer. These results validate that incorporating mathematical physics principles through PDE-based global layers can enhance deep learning-based image restoration, establishing a promising direction for physics-informed neural network design in computer vision applications.
摘要：运动模糊是由曝光期间相机和场景之间的相对运动引起的，会显着降低图像质量，并损害下游计算机视觉任务，例如动态环境中的对象检测、跟踪和识别。虽然基于深度学习的运动去模糊方法取得了显着的进步，但现有方法在捕获运动模糊模式固有的远程空间依赖性方面面临着根本性挑战。传统的卷积方法依赖于有限的感受野，并且需要极深的网络来建模全局空间关系。这些限制激发了对替代方法的需求，这些方法结合了物理先验来指导恢复过程中的特征演化。在本文中，我们提出了一种渐进式训练框架，它将基于物理的 PDE 动力学集成到最先进的恢复架构中。通过利用平流扩散方程来模拟特征演化，我们的方法自然地捕获运动模糊的方向流特征，同时实现有原则的全局空间建模。我们的 PDE 增强去模糊模型以最小的开销实现了卓越的恢复质量，仅增加了大约 1% 的推理 GMAC，同时在多个最先进的架构中提供了感知质量的一致改进。标准运动去模糊基准的综合实验表明，我们的基于物理的方法在四种不同的架构（包括 FFTformer、NAFNet、Restormer 和 Stripformer）中显着提高了 PSNR 和 SSIM。这些结果验证了通过基于偏微分方程的全局层结合数学物理原理可以增强基于深度学习的图像恢复，为计算机视觉应用中基于物理的神经网络设计建立一个有希望的方向。

Title: Gait Recognition via Collaborating Discriminative and Generative Diffusion Models

Authors: Haijun Xiong, Bin Feng, Bang Wang, Xinggang Wang, Wenyu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06245
Pdf URL: https://arxiv.org/pdf/2511.06245
Copy Paste: [[2511.06245]] Gait Recognition via Collaborating Discriminative and Generative Diffusion Models(https://arxiv.org/abs/2511.06245)
Keywords: generation, generative
Abstract: Gait recognition offers a non-intrusive biometric solution by identifying individuals through their walking patterns. Although discriminative models have achieved notable success in this domain, the full potential of generative models remains largely underexplored. In this paper, we introduce \textbf{CoD$^2$}, a novel framework that combines the data distribution modeling capabilities of diffusion models with the semantic representation learning strengths of discriminative models to extract robust gait features. We propose a Multi-level Conditional Control strategy that incorporates both high-level identity-aware semantic conditions and low-level visual details. Specifically, the high-level condition, extracted by the discriminative extractor, guides the generation of identity-consistent gait sequences, whereas low-level visual details, such as appearance and motion, are preserved to enhance consistency. Furthermore, the generated sequences facilitate the discriminative extractor's learning, enabling it to capture more comprehensive high-level semantic features. Extensive experiments on four datasets (SUSTech1K, CCPG, GREW, and Gait3D) demonstrate that CoD$^2$ achieves state-of-the-art performance and can be seamlessly integrated with existing discriminative methods, yielding consistent improvements.
摘要：步态识别通过步行模式识别个体，提供非侵入式生物识别解决方案。尽管判别模型在该领域取得了显着的成功，但生成模型的全部潜力在很大程度上仍未得到充分开发。在本文中，我们介绍了 \textbf{CoD$^2$}，这是一种新颖的框架，它将扩散模型的数据分布建模能力与判别模型的语义表示学习优势相结合，以提取鲁棒的步态特征。我们提出了一种多级条件控制策略，该策略结合了高级身份感知语义条件和低级视觉细节。具体来说，由判别提取器提取的高级条件指导生成身份一致的步态序列，而保留低级视觉细节（例如外观和运动）以增强一致性。此外，生成的序列有助于判别提取器的学习，使其能够捕获更全面的高级语义特征。对四个数据集（SUSTech1K、CCPG、GREW 和 Gait3D）的广泛实验表明，CoD$^2$ 实现了最先进的性能，并且可以与现有的判别方法无缝集成，从而产生一致的改进。

Title: Test-Time Iterative Error Correction for Efficient Diffusion Models

Authors: Yunshan Zhong, Yanwei Qi, Yuxin Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.06250
Pdf URL: https://arxiv.org/pdf/2511.06250
Copy Paste: [[2511.06250]] Test-Time Iterative Error Correction for Efficient Diffusion Models(https://arxiv.org/abs/2511.06250)
Keywords: generation
Abstract: With the growing demand for high-quality image generation on resource-constrained devices, efficient diffusion models have received increasing attention. However, such models suffer from approximation errors introduced by efficiency techniques, which significantly degrade generation quality. Once deployed, these errors are difficult to correct, as modifying the model is typically infeasible in deployment environments. Through an analysis of error propagation across diffusion timesteps, we reveal that these approximation errors can accumulate exponentially, severely impairing output quality. Motivated by this insight, we propose Iterative Error Correction (IEC), a novel test-time method that mitigates inference-time errors by iteratively refining the model's output. IEC is theoretically proven to reduce error propagation from exponential to linear growth, without requiring any retraining or architectural changes. IEC can seamlessly integrate into the inference process of existing diffusion models, enabling a flexible trade-off between performance and efficiency. Extensive experiments show that IEC consistently improves generation quality across various datasets, efficiency techniques, and model architectures, establishing it as a practical and generalizable solution for test-time enhancement of efficient diffusion models.
摘要：随着在资源有限的设备上生成高质量图像的需求不断增长，高效的扩散模型受到越来越多的关注。然而，此类模型会受到效率技术引入的近似误差的影响，从而显着降低发电质量。一旦部署，这些错误就很难纠正，因为在部署环境中修改模型通常是不可行的。通过对扩散时间步长的误差传播的分析，我们发现这些近似误差可以呈指数级累积，严重损害输出质量。受这一见解的启发，我们提出了迭代纠错（IEC），这是一种新颖的测试时方法，可通过迭代细化模型的输出来减少推理时错误。理论上证明，IEC 可以将错误传播从指数增长减少到线性增长，而不需要任何重新训练或架构更改。 IEC 可以无缝集成到现有扩散模型的推理过程中，从而实现性能和效率之间的灵活权衡。大量实验表明，IEC 不断提高各种数据集、效率技术和模型架构的生成质量，将其确立为有效扩散模型测试时增强的实用且可推广的解决方案。

Title: Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra

Authors: Yiwen Zhang, Keyan Ding, Yihang Wu, Xiang Zhuang, Yi Yang, Qiang Zhang, Huajun Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06259
Pdf URL: https://arxiv.org/pdf/2511.06259
Copy Paste: [[2511.06259]] Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra(https://arxiv.org/abs/2511.06259)
Keywords: generative
Abstract: Retrieving molecular structures from tandem mass spectra is a crucial step in rapid compound identification. Existing retrieval methods, such as traditional mass spectral library matching, suffer from limited spectral library coverage, while recent cross-modal representation learning frameworks often encounter modality misalignment, resulting in suboptimal retrieval accuracy and generalization. To address these limitations, we propose GLMR, a Generative Language Model-based Retrieval framework that mitigates the cross-modal misalignment through a two-stage process. In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum. In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures, which are then used to re-rank the candidates based on molecular similarity. Experiments on both MassSpecGym and the proposed MassRET-20k dataset demonstrate that GLMR significantly outperforms existing methods, achieving over 40% improvement in top-1 accuracy and exhibiting strong generalizability.
摘要：从串联质谱中检索分子结构是快速化合物鉴定的关键步骤。现有的检索方法，例如传统的质谱库匹配，受到谱库覆盖范围有限的影响，而最近的跨模态表示学习框架经常遇到模态错位，导致检索精度和泛化能力欠佳。为了解决这些限制，我们提出了 GLMR，一种基于生成语言模型的检索框架，它通过两阶段过程减轻跨模式错位。在预检索阶段，基于对比学习的模型将顶级候选分子识别为输入质谱的上下文先验。在生成检索阶段，这些候选分子与输入质谱相结合，以指导生成模型产生精细的分子结构，然后根据分子相似性对候选分子进行重新排序。在 MassSpecGym 和提出的 MassRET-20k 数据集上的实验表明，GLMR 显着优于现有方法，在 top-1 精度方面实现了 40% 以上的改进，并表现出很强的泛化性。

Title: RelightMaster: Precise Video Relighting with Multi-plane Light Images

Authors: Weikang Bian, Xiaoyu Shi, Zhaoyang Huang, Jianhong Bai, Qinghe Wang, Xintao Wang, Pengfei Wan, Kun Gai, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06271
Pdf URL: https://arxiv.org/pdf/2511.06271
Copy Paste: [[2511.06271]] RelightMaster: Precise Video Relighting with Multi-plane Light Images(https://arxiv.org/abs/2511.06271)
Keywords: generation, generative
Abstract: Recent advances in diffusion models enable high-quality video generation and editing, but precise relighting with consistent video contents, which is critical for shaping scene atmosphere and viewer attention, remains unexplored. Mainstream text-to-video (T2V) models lack fine-grained lighting control due to text's inherent limitation in describing lighting details and insufficient pre-training on lighting-related prompts. Additionally, constructing high-quality relighting training data is challenging, as real-world controllable lighting data is scarce. To address these issues, we propose RelightMaster, a novel framework for accurate and controllable video relighting. First, we build RelightVideo, the first dataset with identical dynamic content under varying precise lighting conditions based on the Unreal Engine. Then, we introduce Multi-plane Light Image (MPLI), a novel visual prompt inspired by Multi-Plane Image (MPI). MPLI models lighting via K depth-aligned planes, representing 3D light source positions, intensities, and colors while supporting multi-source scenarios and generalizing to unseen light setups. Third, we design a Light Image Adapter that seamlessly injects MPLI into pre-trained Video Diffusion Transformers (DiT): it compresses MPLI via a pre-trained Video VAE and injects latent light features into DiT blocks, leveraging the base model's generative prior without catastrophic forgetting. Experiments show that RelightMaster generates physically plausible lighting and shadows and preserves original scene content. Demos are available at this https URL.
摘要：扩散模型的最新进展使得高质量的视频生成和编辑成为可能，但具有一致视频内容的精确重新照明尚未得到探索，这对于塑造场景氛围和观众注意力至关重要。由于文本在描述光照细节方面的固有局限性以及对光照相关提示的预训练不足，主流的文本转视频（T2V）模型缺乏细粒度的光照控制。此外，构建高质量的重新照明训练数据具有挑战性，因为现实世界的可控照明数据很少。为了解决这些问题，我们提出了 RelightMaster，这是一种用于精确且可控的视频重新照明的新颖框架。首先，我们构建 RelightVideo，这是第一个基于虚幻引擎在不同精确光照条件下具有相同动态内容的数据集。然后，我们介绍多平面光图像（MPLI），这是一种受多平面图像（MPI）启发的新颖视觉提示。 MPLI 通过 K 深度对齐平面对照明进行建模，代表 3D 光源位置、强度和颜色，同时支持多源场景并推广到不可见的灯光设置。第三，我们设计了一个光图像适配器，将 MPLI 无缝注入到预先训练的视频扩散变压器 (DiT) 中：它通过预先训练的视频 VAE 压缩 MPLI，并将潜在光特征注入到 DiT 块中，利用基本模型的生成先验，而不会发生灾难性遗忘。实验表明，RelightMaster 可以生成物理上合理的光照和阴影，并保留原始场景内容。可以通过此 https URL 获取演示。

Title: LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation

Authors: Zijie Wang, Weiming Zhang, Wei Zhang, Xiao Tan, Hongxing Liu, Yaowei Wang, Guanbin Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06272
Pdf URL: https://arxiv.org/pdf/2511.06272
Copy Paste: [[2511.06272]] LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation(https://arxiv.org/abs/2511.06272)
Keywords: generation, generative
Abstract: Centerline graphs, crucial for path planning in autonomous driving, are traditionally learned using deterministic methods. However, these methods often lack spatial reasoning and struggle with occluded or invisible centerlines. Generative approaches, despite their potential, remain underexplored in this domain. We introduce LaneDiffusion, a novel generative paradigm for centerline graph learning. LaneDiffusion innovatively employs diffusion models to generate lane centerline priors at the Bird's Eye View (BEV) feature level, instead of directly predicting vectorized centerlines. Our method integrates a Lane Prior Injection Module (LPIM) and a Lane Prior Diffusion Module (LPDM) to effectively construct diffusion targets and manage the diffusion process. Furthermore, vectorized centerlines and topologies are then decoded from these prior-injected BEV features. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that LaneDiffusion significantly outperforms existing methods, achieving improvements of 4.2%, 4.6%, 4.7%, 6.4% and 1.8% on fine-grained point-level metrics (GEO F1, TOPO F1, JTOPO F1, APLS and SDA) and 2.3%, 6.4%, 6.8% and 2.1% on segment-level metrics (IoU, mAP_cf, DET_l and TOP_ll). These results establish state-of-the-art performance in centerline graph learning, offering new insights into generative models for this task.
摘要：中心线图对于自动驾驶的路径规划至关重要，传统上是使用确定性方法来学习的。然而，这些方法通常缺乏空间推理，并且与遮挡或不可见的中心线作斗争。尽管生成方法具有潜力，但在该领域仍未得到充分探索。我们介绍 LaneDiffusion，一种用于中心线图学习的新颖生成范例。 LaneDiffusion 创新性地采用扩散模型在鸟瞰图 (BEV) 特征级别生成车道中心线先验，而不是直接预测矢量化中心线。我们的方法集成了车道优先注入模块（LPIM）和车道优先扩散模块（LPDM），以有效地构建扩散目标并管理扩散过程。此外，然后从这些先前注入的 BEV 特征中解码矢量化中心线和拓扑。对 nuScenes 和 Argoverse2 数据集的广泛评估表明，LaneDiffusion 显着优于现有方法，在细粒度点级指标（GEO F1、TOPO F1、JTOPO F1、APLS 和 SDA）上实现了 4.2%、4.6%、4.7%、6.4% 和 1.8% 的改进，在细粒度点级指标（GEO F1、TOPO F1、JTOPO F1、APLS 和 SDA）上实现了 2.3%、6.4%、6.8% 和段级指标（IoU、mAP_cf、DET_l 和 TOP_ll）为 2.1%。这些结果在中心线图学习方面建立了最先进的性能，为该任务的生成模型提供了新的见解。

Title: Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective

Authors: Bing Wang, Ximing Li, Yanjun Wang, Changchun Li, Lin Yuanbo Wu, Buyu Wang, Shengsheng Wang
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2511.06284
Pdf URL: https://arxiv.org/pdf/2511.06284
Copy Paste: [[2511.06284]] Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective(https://arxiv.org/abs/2511.06284)
Keywords: generation
Abstract: Multimodal Misinformation Detection (MMD) refers to the task of detecting social media posts involving misinformation, where the post often contains text and image modalities. However, by observing the MMD posts, we hold that the text modality may be much more informative than the image modality because the text generally describes the whole event/story of the current post but the image often presents partial scenes only. Our preliminary empirical results indicate that the image modality exactly contributes less to MMD. Upon this idea, we propose a new MMD method named RETSIMD. Specifically, we suppose that each text can be divided into several segments, and each text segment describes a partial scene that can be presented by an image. Accordingly, we split the text into a sequence of segments, and feed these segments into a pre-trained text-to-image generator to augment a sequence of images. We further incorporate two auxiliary objectives concerning text-image and image-label mutual information, and further post-train the generator over an auxiliary text-to-image generation benchmark dataset. Additionally, we propose a graph structure by defining three heuristic relationships between images, and use a graph neural network to generate the fused features. Extensive empirical results validate the effectiveness of RETSIMD.
摘要：多模态错误信息检测（MMD）是指检测涉及错误信息的社交媒体帖子的任务，其中帖子通常包含文本和图像模式。然而，通过观察 MMD 帖子，我们认为文本模态可能比图像模态提供更多信息，因为文本通常描述当前帖子的整个事件/故事，但图像通常仅呈现部分场景。我们的初步实证结果表明，图像模态对 MMD 的贡献确实较小。基于这个想法，我们提出了一种新的MMD方法，名为RETSIMD。具体来说，我们假设每个文本可以分为多个片段，每个文本片段描述可以由图像呈现的部分场景。因此，我们将文本分割成一系列片段，并将这些片段输入到预先训练的文本到图像生成器中以增强图像序列。我们进一步合并了关于文本图像和图像标签互信息的两个辅助目标，并在辅助文本到图像生成基准数据集上进一步对生成器进行后训练。此外，我们通过定义图像之间的三种启发式关系提出了一种图结构，并使用图神经网络来生成融合特征。大量的实证结果验证了 RETSIMD 的有效性。

Title: DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

Authors: Speed Zhu, Jianwei Cai, Guang Chen, Lulu Wu, Saiyong Yang, Wiggin Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.06307
Pdf URL: https://arxiv.org/pdf/2511.06307
Copy Paste: [[2511.06307]] DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation(https://arxiv.org/abs/2511.06307)
Keywords: generation
Abstract: Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a resurgence of interest in RLVR. Nevertheless, advances are dominated by mathematics (e.g., AIME), with competitive-programming code generation underexplored and data curation receiving less attention than RL algorithm design. We investigate how to construct RLVR datasets (i.e., RL prompts) and present practical training techniques that yield strong performance on competitive-programming code generation. Our pipeline begins with supervised fine-tuning (SFT) distilled from strong open-source models, augmented with general-purpose and reasoning-intensive data. RL then follows a two-stage process with executable, testcase-driven rewards: first, training on a large, uniformly distributed set of competitive-programming problems using Group Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively short response-generation window (e.g., 32k during SFT and 24k in this stage) to expand entropy and mitigate repetition and truncation; second, we perform \textbf{Pre-GRPO}: updating on a small, high-quality set of challenging problems with a large rollout budget (64 rollouts per prompt) under a hard-focus curriculum that continuously retains the most difficult instances throughout training. We implement our method on Qwen2.5-32B and evaluate on LeetCode and Codeforces weekly contests to avoid data leakage. The resulting model achieves state-of-the-art performance among models of similar scale and is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking. We also examine scaling trends and observe strong RL scaling on an internal large-scale MoE model. Our study distills concise best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.
摘要：最近的推理优先模型（例如 OpenAI o1、DeepSeek R1）激起了人们对 RLVR 的兴趣。然而，进步主要由数学（例如 AIME）主导，竞争性编程代码生成尚未得到充分探索，数据管理比 RL 算法设计受到的关注更少。我们研究如何构建 RLVR 数据集（即 RL 提示）并提出实用的训练技术，这些技术可以在竞争性编程代码生成方面产生强大的性能。我们的管道始于从强大的开源模型中提取的监督微调（SFT），并通过通用和推理密集型数据进行了增强。然后，RL 遵循一个具有可执行的、测试用例驱动的奖励的两阶段过程：首先，使用组相对策略优化 (GRPO) 对大量均匀分布的竞争性编程问题进行训练，每个提示进行 8 次部署，并使用相对较短的响应生成窗口（例如，SFT 期间为 32k，此阶段为 24k），以扩大熵并减少重复和截断；其次，我们执行 \textbf{Pre-GRPO}：在一个重点课程下，以大量的部署预算（每个提示 64 次部署）更新一组小型、高质量的具有挑战性的问题，该课程在整个培训过程中不断保留最困难的实例。我们在 Qwen2.5-32B 上实现我们的方法，并在 LeetCode 和 Codeforces 每周竞赛中进行评估，以避免数据泄露。所得模型在类似规模的模型中实现了最先进的性能，可与 DeepSeek v3.1 和 Doubao-1.5-Thinking 等领先系统相媲美。我们还检查了扩展趋势，并在内部大型 MoE 模型上观察到强大的 RL 扩展。我们的研究提炼了 RLVR 中用于竞争性编程代码生成的数据管理、熵扩展和课程设计的简明最佳实践。

Title: Adaptive 3D Reconstruction via Diffusion Priors and Forward Curvature-Matching Likelihood Updates

Authors: Seunghyeok Shin, Dabin Kim, Hongki Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06310
Pdf URL: https://arxiv.org/pdf/2511.06310
Copy Paste: [[2511.06310]] Adaptive 3D Reconstruction via Diffusion Priors and Forward Curvature-Matching Likelihood Updates(https://arxiv.org/abs/2511.06310)
Keywords: generative
Abstract: Reconstructing high-quality point clouds from images remains challenging in computer vision. Existing generative-model-based approaches, particularly diffusion-model approaches that directly learn the posterior, may suffer from inflexibility -- they require conditioning signals during training, support only a fixed number of input views, and need complete retraining for different measurements. Recent diffusion-based methods have attempted to address this by combining prior models with likelihood updates, but they rely on heuristic fixed step sizes for the likelihood update that lead to slow convergence and suboptimal reconstruction quality. We advance this line of approach by integrating our novel Forward Curvature-Matching (FCM) update method with diffusion sampling. Our method dynamically determines optimal step sizes using only forward automatic differentiation and finite-difference curvature estimates, enabling precise optimization of the likelihood update. This formulation enables high-fidelity reconstruction from both single-view and multi-view inputs, and supports various input modalities through simple operator substitution -- all without retraining. Experiments on ShapeNet and CO3D datasets demonstrate that our method achieves superior reconstruction quality at matched or lower NFEs, yielding higher F-score and lower CD and EMD, validating its efficiency and adaptability for practical applications. Code is available at this https URL
摘要：从图像重建高质量点云在计算机视觉中仍然具有挑战性。现有的基于生成模型的方法，特别是直接学习后验的扩散模型方法，可能会缺乏灵活性——它们在训练期间需要调节信号，仅支持固定数量的输入视图，并且需要针对不同的测量进行完整的再训练。最近的基于扩散的方法试图通过将先前模型与似然更新相结合来解决这个问题，但它们依赖于启发式固定步长进行似然更新，这会导致收敛缓慢和重建质量次优。我们通过将新颖的前向曲率匹配（FCM）更新方法与扩散采样相结合来推进这一方法。我们的方法仅使用前向自动微分和有限差分曲率估计动态确定最佳步长，从而实现似然更新的精确优化。该公式能够从单视图和多视图输入进行高保真重建，并通过简单的运算符替换支持各种输入模式——所有这些都无需重新训练。在 ShapeNet 和 CO3D 数据集上的实验表明，我们的方法在匹配或较低的 NFE 下实现了卓越的重建质量，产生更高的 F 分数和更低的 CD 和 EMD，验证了其效率和实际应用的适应性。代码可在此 https URL 获取

Title: BuildingWorld: A Structured 3D Building Dataset for Urban Foundation Models

Authors: Shangfeng Huang, Ruisheng Wang, Xin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06337
Pdf URL: https://arxiv.org/pdf/2511.06337
Copy Paste: [[2511.06337]] BuildingWorld: A Structured 3D Building Dataset for Urban Foundation Models(https://arxiv.org/abs/2511.06337)
Keywords: generation
Abstract: As digital twins become central to the transformation of modern cities, accurate and structured 3D building models emerge as a key enabler of high-fidelity, updatable urban representations. These models underpin diverse applications including energy modeling, urban planning, autonomous navigation, and real-time reasoning. Despite recent advances in 3D urban modeling, most learning-based models are trained on building datasets with limited architectural diversity, which significantly undermines their generalizability across heterogeneous urban environments. To address this limitation, we present BuildingWorld, a comprehensive and structured 3D building dataset designed to bridge the gap in stylistic diversity. It encompasses buildings from geographically and architecturally diverse regions -- including North America, Europe, Asia, Africa, and Oceania -- offering a globally representative dataset for urban-scale foundation modeling and analysis. Specifically, BuildingWorld provides about five million LOD2 building models collected from diverse sources, accompanied by real and simulated airborne LiDAR point clouds. This enables comprehensive research on 3D building reconstruction, detection and segmentation. Cyber City, a virtual city model, is introduced to enable the generation of unlimited training data with customized and structurally diverse point cloud distributions. Furthermore, we provide standardized evaluation metrics tailored for building reconstruction, aiming to facilitate the training, evaluation, and comparison of large-scale vision models and foundation models in structured 3D urban environments.
摘要：随着数字孪生成为现代城市转型的核心，准确且结构化的 3D 建筑模型成为高保真、可更新城市表示的关键推动者。这些模型支持多种应用，包括能源建模、城市规划、自主导航和实时推理。尽管 3D 城市建模最近取得了进展，但大多数基于学习的模型都是在建筑多样性有限的数据集上进行训练的，这极大地削弱了它们在异构城市环境中的通用性。为了解决这一限制，我们推出了 BuildingWorld，这是一个全面且结构化的 3D 建筑数据集，旨在弥合风格多样性的差距。它涵盖了来自地理和建筑不同地区（包括北美、欧洲、亚洲、非洲和大洋洲）的建筑物，为城市规模的基础建模和分析提供了具有全球代表性的数据集。具体来说，BuildingWorld 提供了从不同来源收集的约 500 万个 LOD2 建筑模型，并附有真实和模拟的机载 LiDAR 点云。这使得能够对 3D 建筑重建、检测和分割进行全面的研究。 Cyber City是一种虚拟城市模型，旨在通过定制且结构多样的点云分布生成无限的训练数据。此外，我们还提供针对建筑重建的标准化评估指标，旨在促进结构化3D城市环境中大规模视觉模型和基础模型的训练、评估和比较。

Title: AesTest: Measuring Aesthetic Intelligence from Perception to Production

Authors: Guolong Wang, Heng Huang, Zhiqiang Zhang, Wentian Li, Feilong Ma, Xin Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06360
Pdf URL: https://arxiv.org/pdf/2511.06360
Copy Paste: [[2511.06360]] AesTest: Measuring Aesthetic Intelligence from Perception to Production(https://arxiv.org/abs/2511.06360)
Keywords: generative
Abstract: Perceiving and producing aesthetic judgments is a fundamental yet underexplored capability for multimodal large language models (MLLMs). However, existing benchmarks for image aesthetic assessment (IAA) are narrow in perception scope or lack the diversity needed to evaluate systematic aesthetic production. To address this gap, we introduce AesTest, a comprehensive benchmark for multimodal aesthetic perception and production, distinguished by the following features: 1) It consists of curated multiple-choice questions spanning ten tasks, covering perception, appreciation, creation, and photography. These tasks are grounded in psychological theories of generative learning. 2) It integrates data from diverse sources, including professional editing workflows, photographic composition tutorials, and crowdsourced preferences. It ensures coverage of both expert-level principles and real-world variation. 3) It supports various aesthetic query types, such as attribute-based analysis, emotional resonance, compositional choice, and stylistic reasoning. We evaluate both instruction-tuned IAA MLLMs and general MLLMs on AesTest, revealing significant challenges in building aesthetic intelligence. We will publicly release AesTest to support future research in this area.
摘要：感知和产生审美判断是多模态大语言模型（MLLM）的一项基本但尚未充分开发的能力。然而，现有的图像美学评估（IAA）基准感知范围狭窄或缺乏评估系统美学生产所需的多样性。为了解决这一差距，我们引入了 AesTest，这是一个多模态审美感知和生产的综合基准，具有以下特点：1）它由跨越十个任务的精选多项选择题组成，涵盖感知、欣赏、创作和摄影。这些任务以生成学习的心理学理论为基础。 2）它集成了来自不同来源的数据，包括专业编辑工作流程、摄影构图教程和众包偏好。它确保覆盖专家级原则和现实世界的变化。 3）支持多种审美查询类型，如基于属性的分析、情感共鸣、构图选择、风格推理等。我们在 AesTest 上评估了指令调整的 IAA MLLM 和通用 MLLM，揭示了构建审美智能的重大挑战。我们将公开发布 AesTest 以支持该领域的未来研究。

Title: Route Experts by Sequence, not by Token

Authors: Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2511.06494
Pdf URL: https://arxiv.org/pdf/2511.06494
Copy Paste: [[2511.06494]] Route Experts by Sequence, not by Token(https://arxiv.org/abs/2511.06494)
Keywords: generation
Abstract: Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T \cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at this https URL.
摘要：专家混合 (MoE) 架构通过仅激活每个令牌的专家子集来扩展大型语言模型 (LLM)，但标准 TopK 路由为所有令牌分配相同固定数量的专家，忽略它们不同的复杂性。先前的自适应路由方法引入了额外的模块和超参数，通常需要从头开始进行昂贵的重新训练。我们提出了序列级 TopK (SeqTopK)，这是一种最小的修改，将专家预算从代币级别转移到序列级别。通过在所有 $T$ 代币中选择顶级 $T \cdot K$ 专家，SeqTopK 可以实现端到端学习动态分配 - 将更多专家分配给困难的代币，将更少的专家分配给简单的代币 - 同时保持相同的总体预算。 SeqTopK 仅需要几行代码，增加的开销不到 1%，并且与预训练的 MoE 模型完全兼容。数学、编码、法律和写作方面的实验表明，相对于 TopK 和先前的无参数自适应方法，其增益得到了一致的改进，在更高的稀疏性下（高达 16.9%），增益变得更大。这些结果凸显了 SeqTopK 作为一种简单、高效且可扩展的路由策略，特别适合下一代 LLM 的极端稀疏机制。代码可从此 https URL 获取。

Title: TriShGAN: Enhancing Sparsity and Robustness in Multivariate Time Series Counterfactuals Explanation

Authors: Hongnan Ma, Yiwei Shi, Guanxiong Sun, Mengyue Yang, Weiru Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06529
Pdf URL: https://arxiv.org/pdf/2511.06529
Copy Paste: [[2511.06529]] TriShGAN: Enhancing Sparsity and Robustness in Multivariate Time Series Counterfactuals Explanation(https://arxiv.org/abs/2511.06529)
Keywords: generative
Abstract: In decision-making processes, stakeholders often rely on counterfactual explanations, which provide suggestions about what should be changed in the queried instance to alter the outcome of an AI system. However, generating these explanations for multivariate time series presents challenges due to their complex, multi-dimensional nature. Traditional Nearest Unlike Neighbor-based methods typically substitute subsequences in a queried time series with influential subsequences from an NUN, which is not always realistic in real-world scenarios due to the rigid direct substitution. Counterfactual with Residual Generative Adversarial Networks-based methods aim to address this by learning from the distribution of observed data to generate synthetic counterfactual explanations. However, these methods primarily focus on minimizing the cost from the queried time series to the counterfactual explanations and often neglect the importance of distancing the counterfactual explanation from the decision boundary. This oversight can result in explanations that no longer qualify as counterfactual if minor changes occur within the model. To generate a more robust counterfactual explanation, we introduce TriShGAN, under the CounteRGAN framework enhanced by the incorporation of triplet loss. This unsupervised learning approach uses distance metric learning to encourage the counterfactual explanations not only to remain close to the queried time series but also to capture the feature distribution of the instance with the desired outcome, thereby achieving a better balance between minimal cost and robustness. Additionally, we integrate a Shapelet Extractor that strategically selects the most discriminative parts of the high-dimensional queried time series to enhance the sparsity of counterfactual explanation and efficiency of the training process.
摘要：在决策过程中，利益相关者通常依赖反事实解释，这些解释提供了有关在查询实例中应该更改哪些内容以改变人工智能系统结果的建议。然而，由于多元时间序列复杂、多维的性质，生成这些解释面临着挑战。传统的基于最近邻的方法通常用来自 NUN 的有影响力的子序列替换查询时间序列中的子序列，由于严格的直接替换，这在现实场景中并不总是现实的。基于残余生成对抗网络的反事实方法旨在通过学习观察数据的分布来生成综合反事实解释来解决这个问题。然而，这些方法主要侧重于最小化从查询时间序列到反事实解释的成本，并且经常忽略将反事实解释与决策边界分开的重要性。如果模型中发生微小变化，这种疏忽可能会导致解释不再符合反事实条件。为了生成更可靠的反事实解释，我们在 CounteRGAN 框架下引入了 TriShGAN，该框架通过结合三重态损失得到增强。这种无监督学习方法使用距离度量学习来鼓励反事实解释不仅保持接近查询的时间序列，而且还捕获具有期望结果的实例的特征分布，从而在最小成本和鲁棒性之间实现更好的平衡。此外，我们集成了一个 Shapelet 提取器，该提取器策略性地选择高维查询时间序列中最具辨别力的部分，以增强反事实解释的稀疏性和训练过程的效率。

Title: Practical Policy Distillation for Reinforcement Learning in Radio Access Networks

Authors: Sara Khosravi, Burak Demirel, Linghui Zhou, Javier Rasines, Pablo Soldati
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.06563
Pdf URL: https://arxiv.org/pdf/2511.06563
Copy Paste: [[2511.06563]] Practical Policy Distillation for Reinforcement Learning in Radio Access Networks(https://arxiv.org/abs/2511.06563)
Keywords: generation
Abstract: Adopting artificial intelligence (AI) in radio access networks (RANs) presents several challenges, including limited availability of link-level measurements (e.g., CQI reports), stringent real-time processing constraints (e.g., sub-1 ms per TTI), and network heterogeneity (different spectrum bands, cell types, and vendor equipment). A critical yet often overlooked barrier lies in the computational and memory limitations of RAN baseband hardware, particularly in legacy 4th Generation (4G) systems, which typically lack on-chip neural accelerators. As a result, only lightweight AI models (under 1 Mb and sub-100~\mu s inference time) can be effectively deployed, limiting both their performance and applicability. However, achieving strong generalization across diverse network conditions often requires large-scale models with substantial resource demands. To address this trade-off, this paper investigates policy distillation in the context of a reinforcement learning-based link adaptation task. We explore two strategies: single-policy distillation, where a scenario-agnostic teacher model is compressed into one generalized student model; and multi-policy distillation, where multiple scenario-specific teachers are consolidated into a single generalist student. Experimental evaluations in a high-fidelity, 5th Generation (5G)-compliant simulator demonstrate that both strategies produce compact student models that preserve the teachers' generalization capabilities while complying with the computational and memory limitations of existing RAN hardware.
摘要：在无线接入网络 (RAN) 中采用人工智能 (AI) 会带来一些挑战，包括链路级测量（例如 CQI 报告）的可用性有限、严格的实时处理限制（例如每个 TTI 低于 1 毫秒）以及网络异构性（不同的频段、小区类型和供应商设备）。一个关键但经常被忽视的障碍在于 RAN 基带硬件的计算和内存限制，特别是在传统的第四代 (4G) 系统中，这些系统通常缺乏片上神经加速器。因此，只能有效部署轻量级 AI 模型（低于 1 Mb 和低于 100~μ s 的推理时间），从而限制了其性能和适用性。然而，在不同的网络条件下实现强泛化通常需要具有大量资源需求的大规模模型。为了解决这种权衡问题，本文研究了基于强化学习的链接适应任务背景下的策略蒸馏。我们探索了两种策略：单一策略蒸馏，将与场景无关的教师模型压缩为一个广义的学生模型；多策略蒸馏，将多个特定场景的教师合并为一个通才学生。在符合第五代 (5G) 标准的高保真模拟器中进行的实验评估表明，这两种策略都可以生成紧凑的学生模型，可以保留教师的泛化能力，同时符合现有 RAN 硬件的计算和内存限制。

Title: Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks

Authors: Lingran Song, Yucheng Zhou, Jianbing Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06665
Pdf URL: https://arxiv.org/pdf/2511.06665
Copy Paste: [[2511.06665]] Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks(https://arxiv.org/abs/2511.06665)
Keywords: generation
Abstract: Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms the baselines in both segmentation and diagnosis.
摘要：尽管像素级医学图像分析取得了重大进展，但现有的医学图像分割模型很少联合探索医学分割和诊断任务。然而，对于患者来说，至关重要的是模型可以提供可解释的诊断以及医学分割结果。在本文中，我们介绍了一种名为医学诊断分割（MDS）的医学视觉语言任务，其目的是理解医学图像的临床查询并生成相应的分割掩模以及诊断结果。为了促进这项任务，我们首先提出多模态多疾病医学诊断分割（M3DS）数据集，其中包含通过自动诊断思想链生成管道创建的各种多模态多疾病医学图像及其相应的分割掩模和诊断思想链。此外，我们提出了 Sim4Seg，这是一种新颖的框架，它通过利用区域感知视觉语言与掩模的相似性（RVLS2M）模块来提高诊断分割的性能。为了提高整体性能，我们研究了 MDS 任务的测试时间扩展策略。实验结果表明，我们的方法在分割和诊断方面都优于基线。

Title: AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer

Authors: Yulim So, Seokho Kang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06687
Pdf URL: https://arxiv.org/pdf/2511.06687
Copy Paste: [[2511.06687]] AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer(https://arxiv.org/abs/2511.06687)
Keywords: generation
Abstract: Anomaly generation has been widely explored to address the scarcity of anomaly images in real-world data. However, existing methods typically suffer from at least one of the following limitations, hindering their practical deployment: (1) lack of visual realism in generated anomalies; (2) dependence on large amounts of real images; and (3) use of memory-intensive, heavyweight model architectures. To overcome these limitations, we propose AnoStyler, a lightweight yet effective method that frames zero-shot anomaly generation as text-guided style transfer. Given a single normal image along with its category label and expected defect type, an anomaly mask indicating the localized anomaly regions and two-class text prompts representing the normal and anomaly states are generated using generalizable category-agnostic procedures. A lightweight U-Net model trained with CLIP-based loss functions is used to stylize the normal image into a visually realistic anomaly image, where anomalies are localized by the anomaly mask and semantically aligned with the text prompts. Extensive experiments on the MVTec-AD and VisA datasets show that AnoStyler outperforms existing anomaly generation methods in generating high-quality and diverse anomaly images. Furthermore, using these generated anomalies helps enhance anomaly detection performance.
摘要：异常生成已被广泛探索，以解决现实世界数据中异常图像的稀缺问题。然而，现有方法通常至少存在以下局限性之一，阻碍了它们的实际部署：（1）生成的异常缺乏视觉真实感； (2)对大量真实图像的依赖； (3) 使用内存密集型、重量级模型架构。为了克服这些限制，我们提出了 AnoStyler，这是一种轻量级但有效的方法，它将零样本异常生成框架为文本引导的样式迁移。给定单个正常图像及其类别标签和预期缺陷类型，使用可推广的类别不可知程序生成指示局部异常区域的异常掩模和表示正常和异常状态的两类文本提示。使用基于 CLIP 的损失函数训练的轻量级 U-Net 模型将正常图像风格化为视觉上逼真的异常图像，其中异常通过异常掩码进行定位，并在语义上与文本提示对齐。在 MVTec-AD 和 VisA 数据集上进行的大量实验表明，AnoStyler 在生成高质量和多样化的异常图像方面优于现有的异常生成方法。此外，使用这些生成的异常有助于增强异常检测性能。

Title: K-Stain: Keypoint-Driven Correspondence for H&E-to-IHC Virtual Staining

Authors: Sicheng Yang, Zhaohu Xing, Haipeng Zhou, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06709
Pdf URL: https://arxiv.org/pdf/2511.06709
Copy Paste: [[2511.06709]] K-Stain: Keypoint-Driven Correspondence for H&E-to-IHC Virtual Staining(https://arxiv.org/abs/2511.06709)
Keywords: generation
Abstract: Virtual staining offers a promising method for converting Hematoxylin and Eosin (H&E) images into Immunohistochemical (IHC) images, eliminating the need for costly chemical processes. However, existing methods often struggle to utilize spatial information effectively due to misalignment in tissue slices. To overcome this challenge, we leverage keypoints as robust indicators of spatial correspondence, enabling more precise alignment and integration of structural details in synthesized IHC images. We introduce K-Stain, a novel framework that employs keypoint-based spatial and semantic relationships to enhance synthesized IHC image fidelity. K-Stain comprises three main components: (1) a Hierarchical Spatial Keypoint Detector (HSKD) for identifying keypoints in stain images, (2) a Keypoint-aware Enhancement Generator (KEG) that integrates these keypoints during image generation, and (3) a Keypoint Guided Discriminator (KGD) that improves the discriminator's sensitivity to spatial details. Our approach leverages contextual information from adjacent slices, resulting in more accurate and visually consistent IHC images. Extensive experiments show that K-Stain outperforms state-of-the-art methods in quantitative metrics and visual quality.
摘要：虚拟染色提供了一种将苏木精和曙红 (H&E) 图像转换为免疫组织化学 (IHC) 图像的有前景的方法，从而无需昂贵的化学过程。然而，由于组织切片的未对准，现有方法常常难以有效利用空间信息。为了克服这一挑战，我们利用关键点作为空间对应的稳健指标，从而实现合成 IHC 图像中结构细节的更精确对齐和集成。我们介绍了 K-Stain，这是一种新颖的框架，它采用基于关键点的空间和语义关系来增强合成的 IHC 图像保真度。 K-Stain 包含三个主要组件：(1) 分层空间关键点检测器 (HSKD)，用于识别污点图像中的关键点；(2) 关键点感知增强生成器 (KEG)，在图像生成过程中集成这些关键点；(3) 关键点引导鉴别器 (KGD)，提高鉴别器对空间细节的敏感度。我们的方法利用相邻切片的上下文信息，从而产生更准确且视觉上一致的 IHC 图像。大量实验表明，K-Stain 在定量指标和视觉质量方面优于最先进的方法。

Title: SinSEMI: A One-Shot Image Generation Model and Data-Efficient Evaluation Framework for Semiconductor Inspection Equipment

Authors: ChunLiang Wu, Xiaochun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06740
Pdf URL: https://arxiv.org/pdf/2511.06740
Copy Paste: [[2511.06740]] SinSEMI: A One-Shot Image Generation Model and Data-Efficient Evaluation Framework for Semiconductor Inspection Equipment(https://arxiv.org/abs/2511.06740)
Keywords: generation
Abstract: In the early stages of semiconductor equipment development, obtaining large quantities of raw optical images poses a significant challenge. This data scarcity hinder the advancement of AI-powered solutions in semiconductor manufacturing. To address this challenge, we introduce SinSEMI, a novel one-shot learning approach that generates diverse and highly realistic images from single optical image. SinSEMI employs a multi-scale flow-based model enhanced with LPIPS (Learned Perceptual Image Patch Similarity) energy guidance during sampling, ensuring both perceptual realism and output variety. We also introduce a comprehensive evaluation framework tailored for this application, which enables a thorough assessment using just two reference images. Through the evaluation against multiple one-shot generation techniques, we demonstrate SinSEMI's superior performance in visual quality, quantitative measures, and downstream tasks. Our experimental results demonstrate that SinSEMI-generated images achieve both high fidelity and meaningful diversity, making them suitable as training data for semiconductor AI applications.
摘要：在半导体设备开发的早期阶段，获取大量原始光学图像构成了重大挑战。这种数据稀缺阻碍了半导体制造中人工智能驱动的解决方案的进步。为了应对这一挑战，我们引入了 SinSEMI，这是一种新颖的一次性学习方法，可以从单个光学图像生成多样化且高度逼真的图像。 SinSEMI 采用基于多尺度流的模型，并在采样过程中通过 LPIPS（学习感知图像块相似性）能量引导进行增强，确保感知真实性和输出多样性。我们还引入了专为该应用量身定制的综合评估框架，只需使用两张参考图像即可进行全面评估。通过对多种一次性生成技术的评估，我们展示了 SinSEMI 在视觉质量、定量测量和下游任务方面的卓越性能。我们的实验结果表明，SinSEMI 生成的图像实现了高保真度和有意义的多样性，使其适合作为半导体 AI 应用的训练数据。

Title: Image Restoration via Primal Dual Hybrid Gradient and Flow Generative Model

Authors: Ji Li, Chao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06748
Pdf URL: https://arxiv.org/pdf/2511.06748
Copy Paste: [[2511.06748]] Image Restoration via Primal Dual Hybrid Gradient and Flow Generative Model(https://arxiv.org/abs/2511.06748)
Keywords: restoration, super-resolution, generative
Abstract: Regularized optimization has been a classical approach to solving imaging inverse problems, where the regularization term enforces desirable properties of the unknown image. Recently, the integration of flow matching generative models into image restoration has garnered significant attention, owing to their powerful prior modeling capabilities. In this work, we incorporate such generative priors into a Plug-and-Play (PnP) framework based on proximal splitting, where the proximal operator associated with the regularizer is replaced by a time-dependent denoiser derived from the generative model. While existing PnP methods have achieved notable success in inverse problems with smooth squared $\ell_2$ data fidelity--typically associated with Gaussian noise--their applicability to more general data fidelity terms remains underexplored. To address this, we propose a general and efficient PnP algorithm inspired by the primal-dual hybrid gradient (PDHG) method. Our approach is computationally efficient, memory-friendly, and accommodates a wide range of fidelity terms. In particular, it supports both $\ell_1$ and $\ell_2$ norm-based losses, enabling robustness to non-Gaussian noise types such as Poisson and impulse noise. We validate our method on several image restoration tasks, including denoising, super-resolution, deblurring, and inpainting, and demonstrate that $\ell_1$ and $\ell_2$ fidelity terms outperform the conventional squared $\ell_2$ loss in the presence of non-Gaussian noise.
摘要：正则化优化是解决成像逆问题的经典方法，其中正则化项强制执行未知图像的所需属性。最近，由于其强大的先验建模能力，将流匹配生成模型集成到图像恢复中引起了广泛的关注。在这项工作中，我们将此类生成先验合并到基于近端分裂的即插即用（PnP）框架中，其中与正则化器相关的近端算子被从生成模型派生的时间相关降噪器所取代。虽然现有的 PnP 方法在具有平滑平方 $\ell_2$ 数据保真度（通常与高斯噪声相关）的反问题上取得了显着的成功，但它们对更一般数据保真度项的适用性仍未得到充分探索。为了解决这个问题，我们受原始对偶混合梯度（PDHG）方法的启发，提出了一种通用且高效的 PnP 算法。我们的方法计算效率高、内存友好，并且适应广泛的保真度术语。特别是，它支持 $\ell_1$ 和 $\ell_2$ 基于范数的损失，从而能够对非高斯噪声类型（例如泊松噪声和脉冲噪声）具有鲁棒性。我们在几个图像恢复任务上验证了我们的方法，包括去噪、超分辨率、去模糊和修复，并证明 $\ell_1$ 和 $\ell_2$ 保真度项在存在非高斯噪声的情况下优于传统的平方 $\ell_2$ 损失。

Title: TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning

Authors: Rui Wang, Ying Zhou, Hao Wang, Wenwei Zhang, Qiang Li, Zhiwei Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06817
Pdf URL: https://arxiv.org/pdf/2511.06817
Copy Paste: [[2511.06817]] TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning(https://arxiv.org/abs/2511.06817)
Keywords: generation
Abstract: Stereo matching in minimally invasive surgery (MIS) is essential for next-generation navigation and augmented reality. Yet, dense disparity supervision is nearly impossible due to anatomical constraints, typically limiting annotations to only a few image-level labels acquired before the endoscope enters deep body cavities. Teacher-Student Learning (TSL) offers a promising solution by leveraging a teacher trained on sparse labels to generate pseudo labels and associated confidence maps from abundant unlabeled surgical videos. However, existing TSL methods are confined to image-level supervision, providing only spatial confidence and lacking temporal consistency estimation. This absence of spatio-temporal reliability results in unstable disparity predictions and severe flickering artifacts across video frames. To overcome these challenges, we propose TiS-TSL, a novel time-switchable teacher-student learning framework for video stereo matching under minimal supervision. At its core is a unified model that operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP), enabling flexible temporal modeling within a single architecture. Enabled by this unified model, TiS-TSL adopts a two-stage learning strategy. The Image-to-Video (I2V) stage transfers sparse image-level knowledge to initialize temporal modeling. The subsequent Video-to-Video (V2V) stage refines temporal disparity predictions by comparing forward and backward predictions to calculate bidirectional spatio-temporal consistency. This consistency identifies unreliable regions across frames, filters noisy video-level pseudo labels, and enforces temporal coherence. Experimental results on two public datasets demonstrate that TiS-TSL exceeds other image-based state-of-the-arts by improving TEPE and EPE by at least 2.11% and 4.54%, respectively..
摘要：微创手术 (MIS) 中的立体匹配对于下一代导航和增强现实至关重要。然而，由于解剖学的限制，密集的视差监督几乎是不可能的，通常将注释限制在内窥镜进入深部体腔之前获取的几个图像级标签。师生学习 (TSL) 提供了一种很有前途的解决方案，它利用接受过稀疏标签训练的教师从大量未标记的手术视频中生成伪标签和相关的置信图。然而，现有的 TSL 方法仅限于图像级监督，仅提供空间置信度，缺乏时间一致性估计。这种时空可靠性的缺乏导致不稳定的视差预测和视频帧上严重的闪烁伪影。为了克服这些挑战，我们提出了 TiS-TSL，这是一种新颖的可时间切换的师生学习框架，用于在最少的监督下进行视频立体匹配。其核心是一个以三种不同模式运行的统一模型：图像预测 (IP)、前向视频预测 (FVP) 和后向视频预测 (BVP)，从而在单一架构中实现灵活的时间建模。在这个统一模型的支持下，TiS-TSL 采用了两阶段学习策略。图像到视频 (I2V) 阶段传输稀疏图像级知识以初始化时间建模。随后的视频到视频 (V2V) 阶段通过比较前向和后向预测来计算双向时空一致性，从而细化时间视差预测。这种一致性可以识别跨帧的不可靠区域，过滤嘈杂的视频级伪标签，并强制时间一致性。两个公共数据集上的实验结果表明，TiS-TSL 分别将 TEPE 和 EPE 提高了至少 2.11% 和 4.54%，超过了其他基于图像的最先进技术。

Title: Integrating Reweighted Least Squares with Plug-and-Play Diffusion Priors for Noisy Image Restoration

Authors: Ji Li, Chao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06823
Pdf URL: https://arxiv.org/pdf/2511.06823
Copy Paste: [[2511.06823]] Integrating Reweighted Least Squares with Plug-and-Play Diffusion Priors for Noisy Image Restoration(https://arxiv.org/abs/2511.06823)
Keywords: restoration, generative
Abstract: Existing plug-and-play image restoration methods typically employ off-the-shelf Gaussian denoisers as proximal operators within classical optimization frameworks based on variable splitting. Recently, denoisers induced by generative priors have been successfully integrated into regularized optimization methods for image restoration under Gaussian noise. However, their application to non-Gaussian noise--such as impulse noise--remains largely unexplored. In this paper, we propose a plug-and-play image restoration framework based on generative diffusion priors for robust removal of general noise types, including impulse noise. Within the maximum a posteriori (MAP) estimation framework, the data fidelity term is adapted to the specific noise model. Departing from the conventional least-squares loss used for Gaussian noise, we introduce a generalized Gaussian scale mixture-based loss, which approximates a wide range of noise distributions and leads to an $\ell_q$-norm ($0
摘要：现有的即插即用图像恢复方法通常采用现成的高斯降噪器作为基于变量分裂的经典优化框架内的近端算子。最近，由生成先验引起的降噪器已成功集成到高斯噪声下图像恢复的正则化优化方法中。然而，它们在非高斯噪声（例如脉冲噪声）中的应用在很大程度上仍未得到探索。在本文中，我们提出了一种基于生成扩散先验的即插即用图像恢复框架，用于稳健地去除包括脉冲噪声在内的一般噪声类型。在最大后验（MAP）估计框架内，数据保真度项适应特定的噪声模型。与用于高斯噪声的传统最小二乘损失不同，我们引入了一种广义的基于高斯尺度混合的损失，它近似于广泛的噪声分布，并导致 $\ell_q$-norm ($0

Title: MUGSQA: Novel Multi-Uncertainty-Based Gaussian Splatting Quality Assessment Method, Dataset, and Benchmarks

Authors: Tianang Chen, Jian Jin, Shilv Cai, Zhuangzi Li, Weisi Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06830
Pdf URL: https://arxiv.org/pdf/2511.06830
Copy Paste: [[2511.06830]] MUGSQA: Novel Multi-Uncertainty-Based Gaussian Splatting Quality Assessment Method, Dataset, and Benchmarks(https://arxiv.org/abs/2511.06830)
Keywords: quality assessment
Abstract: Gaussian Splatting (GS) has recently emerged as a promising technique for 3D object reconstruction, delivering high-quality rendering results with significantly improved reconstruction speed. As variants continue to appear, assessing the perceptual quality of 3D objects reconstructed with different GS-based methods remains an open challenge. To address this issue, we first propose a unified multi-distance subjective quality assessment method that closely mimics human viewing behavior for objects reconstructed with GS-based methods in actual applications, thereby better collecting perceptual experiences. Based on it, we also construct a novel GS quality assessment dataset named MUGSQA, which is constructed considering multiple uncertainties of the input data. These uncertainties include the quantity and resolution of input views, the view distance, and the accuracy of the initial point cloud. Moreover, we construct two benchmarks: one to evaluate the robustness of various GS-based reconstruction methods under multiple uncertainties, and the other to evaluate the performance of existing quality assessment metrics. Our dataset and benchmark code will be released soon.
摘要：高斯泼溅 (GS) 最近成为一种很有前途的 3D 对象重建技术，可提供高质量的渲染结果，并显着提高重建速度。随着变体的不断出现，评估使用不同的基于 GS 的方法重建的 3D 对象的感知质量仍然是一个开放的挑战。为了解决这个问题，我们首先提出了一种统一的多距离主观质量评估方法，该方法在实际应用中密切模仿人类对基于 GS 的方法重建的对象的观看行为，从而更好地收集感知体验。在此基础上，我们还构建了一个名为 MUGSQA 的新型 GS 质量评估数据集，该数据集是考虑到输入数据的多种不确定性而构建的。这些不确定性包括输入视图的数量和分辨率、视图距离以及初始点云的准确性。此外，我们构建了两个基准：一个用于评估各种基于 GS 的重建方法在多种不确定性下的稳健性，另一个用于评估现有质量评估指标的性能。我们的数据集和基准代码即将发布。

Title: ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

Authors: Zhenjie Liu, Jianzhang Lu, Renjie Lu, Cong Liang, Shangfei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06833
Pdf URL: https://arxiv.org/pdf/2511.06833
Copy Paste: [[2511.06833]] ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search(https://arxiv.org/abs/2511.06833)
Keywords: generation
Abstract: Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a \textbf{diffusion noise initialization strategy (IC-Init)}. By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.
摘要：视频扩散模型的最新进展显着增强了音频驱动的肖像动画。然而，当前的方法仍然存在闪烁、身份漂移和视听同步性差的问题。这些问题主要源于纠缠的外观运动表示和不稳定的推理策略。在本文中，我们介绍了 \textbf{ConsistTalk}，一种新颖的强度可控且时间一致的头部说话生成框架，具有扩散噪声搜索推理。首先，我们提出 \textbf{光流引导时间模块（OFT）}，通过利用面部光流将运动特征与静态外观解耦，从而减少视觉闪烁并提高时间一致性。其次，我们提出了通过多模式师生知识蒸馏获得的 \textbf{音频到强度（A2I）模型}。通过将音频和面部速度特征转换为逐帧强度序列，A2I 模型能够对音频和视觉运动进行联合建模，从而产生更自然的动态效果。这进一步实现了运动动态的细粒度、逐帧控制，同时保持严格的视听同步。第三，我们引入一种\textbf{扩散噪声初始化策略（IC-Init）}。通过在推理时间噪声搜索期间对背景相干性和运动连续性施加明确的约束，与当前的自回归策略相比，我们实现了更好的身份保留并细化了运动动力学。大量实验表明，ConsistTalk 在减少闪烁、保留身份以及提供时间稳定、高保真头部说话视频方面明显优于先前的方法。

Title: Contact Wasserstein Geodesics for Non-Conservative Schrodinger Bridges

Authors: Andrea Testa, Soren Hauberg, Tamim Asfour, Leonel Rozo
Subjects: cs.LG, math.DG
Abstract URL: https://arxiv.org/abs/2511.06856
Pdf URL: https://arxiv.org/pdf/2511.06856
Copy Paste: [[2511.06856]] Contact Wasserstein Geodesics for Non-Conservative Schrodinger Bridges(https://arxiv.org/abs/2511.06856)
Keywords: generation
Abstract: The Schrödinger Bridge provides a principled framework for modeling stochastic processes between distributions; however, existing methods are limited by energy-conservation assumptions, which constrains the bridge's shape preventing it from model varying-energy phenomena. To overcome this, we introduce the non-conservative generalized Schrödinger bridge (NCGSB), a novel, energy-varying reformulation based on contact Hamiltonian mechanics. By allowing energy to change over time, the NCGSB provides a broader class of real-world stochastic processes, capturing richer and more faithful intermediate dynamics. By parameterizing the Wasserstein manifold, we lift the bridge problem to a tractable geodesic computation in a finite-dimensional space. Unlike computationally expensive iterative solutions, our contact Wasserstein geodesic (CWG) is naturally implemented via a ResNet architecture and relies on a non-iterative solver with near-linear complexity. Furthermore, CWG supports guided generation by modulating a task-specific distance metric. We validate our framework on tasks including manifold navigation, molecular dynamics predictions, and image generation, demonstrating its practical benefits and versatility.
摘要：薛定谔桥为分布之间的随机过程建模提供了一个原则框架；然而，现有方法受到节能假设的限制，这限制了桥梁的形状，使其无法模拟变化能量现象。为了克服这个问题，我们引入了非保守广义薛定谔电桥（NCGSB），这是一种基于接触哈密顿力学的新颖的能量变化重构。通过允许能量随时间变化，NCGSB 提供了更广泛的现实世界随机过程，捕获更丰富、更忠实的中间动态。通过参数化 Wasserstein 流形，我们将桥梁问题提升为有限维空间中易于处理的测地线计算。与计算成本高昂的迭代解决方案不同，我们的接触 Wasserstein 测地线 (CWG) 自然地通过 ResNet 架构实现，并依赖于具有近线性复杂度的非迭代求解器。此外，CWG 通过调制特定于任务的距离度量来支持引导生成。我们在流形导航、分子动力学预测和图像生成等任务上验证了我们的框架，展示了其实际好处和多功能性。

Title: VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

Authors: Sicheng Yang, Xing Hu, Qiang Wu, Dawei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06863
Pdf URL: https://arxiv.org/pdf/2511.06863
Copy Paste: [[2511.06863]] VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling(https://arxiv.org/abs/2511.06863)
Keywords: generation, generative
Abstract: Vector quantization (VQ) transforms continuous image features into discrete representations, providing compressed, tokenized inputs for generative models. However, VQ-based frameworks suffer from several issues, such as non-smooth latent spaces, weak alignment between representations before and after quantization, and poor coherence between the continuous and discrete domains. These issues lead to unstable codeword learning and underutilized codebooks, ultimately degrading the performance of both reconstruction and downstream generation tasks. To this end, we propose VAEVQ, which comprises three key components: (1) Variational Latent Quantization (VLQ), replacing the AE with a VAE for quantization to leverage its structured and smooth latent space, thereby facilitating more effective codeword activation; (2) Representation Coherence Strategy (RCS), adaptively modulating the alignment strength between pre- and post-quantization features to enhance consistency and prevent overfitting to noise; and (3) Distribution Consistency Regularization (DCR), aligning the entire codebook distribution with the continuous latent distribution to improve utilization. Extensive experiments on two benchmark datasets demonstrate that VAEVQ outperforms state-of-the-art methods.
摘要：矢量量化 (VQ) 将连续图像特征转换为离散表示，为生成模型提供压缩、标记化输入。然而，基于 VQ 的框架存在一些问题，例如不平滑的潜在空间、量化前后表示之间的弱对齐以及连续域和离散域之间的一致性差。这些问题导致不稳定的码字学习和未充分利用的码本，最终降低了重建和下游生成任务的性能。为此，我们提出了 VAEVQ，它包含三个关键部分：（1）变分潜在量化（VLQ），用 VAE 代替 AE 进行量化，以利用其结构化和平滑的潜在空间，从而促进更有效的码字激活；（2）表示一致性策略（RCS），自适应调制量化前和量化后特征之间的对齐强度，以增强一致性并防止对噪声的过度拟合； (3)分布一致性正则化(DCR)，将整个码本分布与连续潜在分布对齐以提高利用率。对两个基准数据集的大量实验表明 VAEVQ 的性能优于最先进的方法。

Title: Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Authors: Eyal Gutflaish, Eliran Kachlon, Hezi Zisman, Tal Hacham, Nimrod Sarid, Alexander Visheratin, Saar Huberman, Gal Davidi, Guy Bukchin, Kfir Goldberg, Ron Mokady
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06876
Pdf URL: https://arxiv.org/pdf/2511.06876
Copy Paste: [[2511.06876]] Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions(https://arxiv.org/abs/2511.06876)
Keywords: generation
Abstract: Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at this https URL
摘要：文本到图像模型已从休闲创意工具迅速发展为专业级系统，实现了前所未有的图像质量和真实感水平。然而，大多数模型都经过训练将简短的提示映射到详细的图像中，从而在稀疏的文本输入和丰富的视觉输出之间造成了差距。这种不匹配降低了可控性，因为模型经常任意填充缺失的细节，偏向平均用户偏好并限制专业用途的精度。我们通过在长结构化标题上训练第一个开源文本到图像模型来解决这一限制，其中每个训练样本都用同一组细粒度属性进行注释。这种设计最大限度地提高了表现力的覆盖范围，并实现了对视觉因素的分散控制。为了有效地处理长标题，我们提出了 DimFusion，一种融合机制，它集成了来自轻量级 LLM 的中间令牌，而不增加令牌长度。我们还介绍了文本作为瓶颈重建（TaBR）评估协议。通过评估通过字幕生成循环重建真实图像的效果，TaBR 可以直接测量可控性和表现力，即使对于现有评估方法无法实现的很长的字幕也是如此。最后，我们通过训练大型模型 FIBO，实现开源模型之间最先进的快速对齐来展示我们的贡献。模型权重可在此 https URL 公开获得

Title: A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

Authors: Jan-Hendrik Koch, Jonas Krumme, Konrad Gadzicki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06888
Pdf URL: https://arxiv.org/pdf/2511.06888
Copy Paste: [[2511.06888]] A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models(https://arxiv.org/abs/2511.06888)
Keywords: generation, generative
Abstract: Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.
摘要：文本到图像的扩散模型表现出卓越的生成能力，但缺乏对对象数量和空间排列的精确控制。这项工作引入了一个两阶段系统来解决这些成分限制。第一阶段采用大型语言模型 (LLM) 从对象列表生成结构化布局。第二阶段使用布局条件扩散模型来合成符合该布局的逼真图像。我们发现任务分解对于基于 LLM 的空间规划至关重要；通过简化核心对象的初始生成并通过基于规则的插入完成布局，我们将复杂场景的对象召回率从 57.2% 提高到 99.9%。对于图像合成，我们比较了两种领先的调节方法：ControlNet 和 GLIGEN。在对表格设置数据集进行特定领域的微调后，我们确定了一个关键的权衡：ControlNet 保留了基于文本的风格控制，但会受到对象幻觉的影响，而 GLIGEN 提供了卓越的布局保真度，但代价是降低了基于提示的可控性。我们的端到端系统成功地生成了具有指定对象数量和合理空间排列的图像，证明了用于成分控制合成的解耦方法的可行性。

Title: FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

Authors: Yulin Chen, Zeyuan Wang, Tianyuan Yu, Yingmei Wei, Liang Bai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.06947
Pdf URL: https://arxiv.org/pdf/2511.06947
Copy Paste: [[2511.06947]] FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection(https://arxiv.org/abs/2511.06947)
Keywords: quality assessment
Abstract: The well-aligned attribute of CLIP-based models enables its effective application like CLIPscore as a widely adopted image quality assessment metric. However, such a CLIP-based metric is vulnerable for its delicate multimodal alignment. In this work, we propose \textbf{FoCLIP}, a feature-space misalignment framework for fooling CLIP-based image quality metric. Based on the stochastic gradient descent technique, FoCLIP integrates three key components to construct fooling examples: feature alignment as the core module to reduce image-text modality gaps, the score distribution balance module and pixel-guard regularization, which collectively optimize multimodal output equilibrium between CLIPscore performance and image quality. Such a design can be engineered to maximize the CLIPscore predictions across diverse input prompts, despite exhibiting either visual unrecognizability or semantic incongruence with the corresponding adversarial prompts from human perceptual perspectives. Experiments on ten artistic masterpiece prompts and ImageNet subsets demonstrate that optimized images can achieve significant improvement in CLIPscore while preserving high visual fidelity. In addition, we found that grayscale conversion induces significant feature degradation in fooling images, exhibiting noticeable CLIPscore reduction while preserving statistical consistency with original images. Inspired by this phenomenon, we propose a color channel sensitivity-driven tampering detection mechanism that achieves 91% accuracy on standard benchmarks. In conclusion, this work establishes a practical pathway for feature misalignment in CLIP-based multimodal systems and the corresponding defense method.
摘要：基于 CLIP 的模型的良好对齐属性使其能够像 CLIPscore 一样有效应用，作为广泛采用的图像质量评估指标。然而，这种基于 CLIP 的指标因其微妙的多模态对齐而容易受到攻击。在这项工作中，我们提出了 \textbf{FoCLIP}，一种特征空间错位框架，用于欺骗基于 CLIP 的图像质量指标。基于随机梯度下降技术，FoCLIP 集成了三个关键组件来构建愚弄示例：特征对齐作为核心模块，以减少图像文本模态差距，分数分布平衡模块和像素保护正则化，共同优化 CLIPscore 性能和图像质量之间的多模态输出平衡。这样的设计可以被设计为最大化跨不同输入提示的 CLIPscore 预测，尽管从人类感知的角度来看，与相应的对抗性提示表现出视觉上的不可识别性或语义上的不一致。对十个艺术杰作提示和 ImageNet 子集的实验表明，优化的图像可以在保持高视觉保真度的同时实现 CLIPscore 的显着提高。此外，我们发现灰度转换会导致欺骗图像的特征显着退化，在保持与原始图像的统计一致性的同时，表现出明显的 CLIPscore 降低。受这一现象的启发，我们提出了一种颜色通道灵敏度驱动的篡改检测机制，该机制在标准基准测试中实现了 91% 的准确率。总之，这项工作为基于 CLIP 的多模态系统中的特征错位建立了一条实用的途径以及相应的防御方法。

Title: PADM: A Physics-aware Diffusion Model for Attenuation Correction

Authors: Trung Kien Pham, Hoang Minh Vu, Anh Duc Chu, Dac Thai Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Mai Hong Son, Thanh Trung Nguyen, Phi Le Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.06948
Pdf URL: https://arxiv.org/pdf/2511.06948
Copy Paste: [[2511.06948]] PADM: A Physics-aware Diffusion Model for Attenuation Correction(https://arxiv.org/abs/2511.06948)
Keywords: generative
Abstract: Attenuation artifacts remain a significant challenge in cardiac Myocardial Perfusion Imaging (MPI) using Single-Photon Emission Computed Tomography (SPECT), often compromising diagnostic accuracy and reducing clinical interpretability. While hybrid SPECT/CT systems mitigate these artifacts through CT-derived attenuation maps, their high cost, limited accessibility, and added radiation exposure hinder widespread clinical adoption. In this study, we propose a novel CT-free solution to attenuation correction in cardiac SPECT. Specifically, we introduce Physics-aware Attenuation Correction Diffusion Model (PADM), a diffusion-based generative method that incorporates explicit physics priors via a teacher--student distillation mechanism. This approach enables attenuation artifact correction using only Non-Attenuation-Corrected (NAC) input, while still benefiting from physics-informed supervision during training. To support this work, we also introduce CardiAC, a comprehensive dataset comprising 424 patient studies with paired NAC and Attenuation-Corrected (AC) reconstructions, alongside high-resolution CT-based attenuation maps. Extensive experiments demonstrate that PADM outperforms state-of-the-art generative models, delivering superior reconstruction fidelity across both quantitative metrics and visual assessment.
摘要：衰减伪影仍然是使用单光子发射计算机断层扫描 (SPECT) 进行心脏心肌灌注成像 (MPI) 的重大挑战，通常会影响诊断准确性并降低临床可解释性。虽然混合 SPECT/CT 系统通过 CT 衍生的衰减图减轻了这些伪影，但其成本高、可访问性有限以及增加的辐射暴露阻碍了广泛的临床采用。在本研究中，我们提出了一种新颖的无 CT 解决方案，用于心脏 SPECT 的衰减校正。具体来说，我们引入了物理感知衰减校正扩散模型（PADM），这是一种基于扩散的生成方法，通过师生蒸馏机制结合了显式物理先验。这种方法仅使用非衰减校正（NAC）输入即可实现衰减伪影校正，同时仍然受益于训练期间基于物理的监督。为了支持这项工作，我们还引入了 CardiAC，这是一个综合数据集，包含 424 项患者研究，以及配对 NAC 和衰减校正 (AC) 重建，以及基于高分辨率 CT 的衰减图。大量实验表明，PADM 的性能优于最先进的生成模型，在定量指标和视觉评估方面均提供卓越的重建保真度。

Title: Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery

Authors: Ananad Krishnakumar, Vengadesh Ravikumaran
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.06973
Pdf URL: https://arxiv.org/pdf/2511.06973
Copy Paste: [[2511.06973]] Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery(https://arxiv.org/abs/2511.06973)
Keywords: generation
Abstract: Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. To quantify spreadsheet similarity, we introduce a hybrid distance metric that combines semantic embeddings, data type information, and spatial positioning. In order to calculate spreadsheet similarity, our method converts spreadsheets into cell-level embeddings and then uses aggregation techniques like Chamfer and Hausdorff distances. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 versus 0.90) on the FUSTE dataset. Our approach facilitates large-scale automated template discovery, which in turn enables downstream applications such as retrieval-augmented generation over tabular collections, model training, and bulk data cleaning.
摘要：用于识别结构相似的电子表格的传统方法无法捕获定义模板的空间布局和类型模式。为了量化电子表格的相似性，我们引入了一种结合了语义嵌入、数据类型信息和空间定位的混合距离度量。为了计算电子表格相似度，我们的方法将电子表格转换为单元格级嵌入，然后使用 Chamfer 和 Hausdorff 距离等聚合技术。与基于图的 Mondrian 基线相比，跨模板系列的实验证明了卓越的无监督聚类性能，在 FUSTE 数据集上实现了完美的模板重建（调整兰德指数为 1.00 与 0.90）。我们的方法有助于大规模自动化模板发现，从而支持下游应用程序，例如表格集合上的检索增强生成、模型训练和批量数据清理。

Title: CoLM: Collaborative Large Models via A Client-Server Paradigm

Authors: Siqi Huang, Sida Huang, Hongyuan Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.06991
Pdf URL: https://arxiv.org/pdf/2511.06991
Copy Paste: [[2511.06991]] CoLM: Collaborative Large Models via A Client-Server Paradigm(https://arxiv.org/abs/2511.06991)
Keywords: generation
Abstract: Large models have achieved remarkable performance across a range of reasoning and understanding tasks. Prior work often utilizes model ensembles or multi-agent systems to collaboratively generate responses, effectively operating in a server-to-server paradigm. However, such approaches do not align well with practical deployment settings, where a limited number of server-side models are shared by many clients under modern internet architectures. In this paper, we introduce \textbf{CoLM} (\textbf{Co}llaboration in \textbf{L}arge-\textbf{M}odels), a novel framework for collaborative reasoning that redefines cooperation among large models from a client-server perspective. Unlike traditional ensemble methods that rely on simultaneous inference from multiple models to produce a single output, CoLM allows the outputs of multiple models to be aggregated or shared, enabling each client model to independently refine and update its own generation based on these high-quality outputs. This design enables collaborative benefits by fully leveraging both client-side and shared server-side models. We further extend CoLM to vision-language models (VLMs), demonstrating its applicability beyond language tasks. Experimental results across multiple benchmarks show that CoLM consistently improves model performance on previously failed queries, highlighting the effectiveness of collaborative guidance in enhancing single-model capabilities.
摘要：大型模型在一系列推理和理解任务中取得了卓越的性能。先前的工作经常利用模型集成或多代理系统来协作生成响应，在服务器到服务器范例中有效地运行。然而，这种方法与实际部署设置不太相符，在现代互联网架构下，许多客户端共享有限数量的服务器端模型。在本文中，我们介绍了 \textbf{CoLM} （\textbf{L}arge-\textbf{M}odels 中的 \textbf{Co}llaboration），这是一种新颖的协作推理框架，从客户端-服务器角度重新定义了大型模型之间的协作。与依赖多个模型同时推理来产生单个输出的传统集成方法不同，CoLM 允许聚合或共享多个模型的输出，使每个客户端模型能够根据这些高质量输出独立地完善和更新自己的生成。这种设计通过充分利用客户端和共享服务器端模型来实现协作优势。我们进一步将 CoLM 扩展到视觉语言模型（VLM），证明其在语言任务之外的适用性。多个基准测试的实验结果表明，CoLM 持续提高了先前失败查询的模型性能，凸显了协作指导在增强单一模型功能方面的有效性。

Title: Performance Decay in Deepfake Detection: The Limitations of Training on Outdated Data

Authors: Jack Richings, Margaux Leblanc, Ian Groves, Victoria Nockles
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.07009
Pdf URL: https://arxiv.org/pdf/2511.07009
Copy Paste: [[2511.07009]] Performance Decay in Deepfake Detection: The Limitations of Training on Outdated Data(https://arxiv.org/abs/2511.07009)
Keywords: generation
Abstract: The continually advancing quality of deepfake technology exacerbates the threats of disinformation, fraud, and harassment by making maliciously-generated synthetic content increasingly difficult to distinguish from reality. We introduce a simple yet effective two-stage detection method that achieves an AUROC of over 99.8% on contemporary deepfakes. However, this high performance is short-lived. We show that models trained on this data suffer a recall drop of over 30% when evaluated on deepfakes created with generation techniques from just six months later, demonstrating significant decay as threats evolve. Our analysis reveals two key insights for robust detection. Firstly, continued performance requires the ongoing curation of large, diverse datasets. Second, predictive power comes primarily from static, frame-level artifacts, not temporal inconsistencies. The future of effective deepfake detection therefore depends on rapid data collection and the development of advanced frame-level feature detectors.
摘要：深度造假技术质量的不断提高，使得恶意生成的合成内容越来越难以与现实区分开，从而加剧了虚假信息、欺诈和骚扰的威胁。我们引入了一种简单而有效的两阶段检测方法，可以在当代深度伪造品上实现超过 99.8% 的 AUROC。然而，这种高性能是短暂的。我们表明，在六个月后对使用生成技术创建的深度伪造品进行评估时，基于这些数据训练的模型的召回率下降了 30% 以上，这表明随着威胁的发展，召回率会显着下降。我们的分析揭示了稳健检测的两个关键见解。首先，持续的性能需要持续管理大型、多样化的数据集。其次，预测能力主要来自静态的帧级伪影，而不是时间不一致。因此，有效的深度换脸检测的未来取决于快速数据收集和先进的帧级特征检测器的开发。

Title: Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation

Authors: Yuxuan Zhou, Tao Yu, Wen Huang, Yuheng Zhang, Tao Dai, Shu-Tao Xia
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2511.07051
Pdf URL: https://arxiv.org/pdf/2511.07051
Copy Paste: [[2511.07051]] Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation(https://arxiv.org/abs/2511.07051)
Keywords: generation
Abstract: The generalization capability of deepfake detectors is critical for real-world use. Data augmentation via synthetic fake face generation effectively enhances generalization, yet current SoTA methods rely on fixed strategies-raising a key question: Is a single static augmentation sufficient, or does the diversity of forgery features demand dynamic approaches? We argue existing methods overlook the evolving complexity of real-world forgeries (e.g., facial warping, expression manipulation), which fixed policies cannot fully simulate. To address this, we propose CRDA (Curriculum Reinforcement-Learning Data Augmentation), a novel framework guiding detectors to progressively master multi-domain forgery features from simple to complex. CRDA synthesizes augmented samples via a configurable pool of forgery operations and dynamically generates adversarial samples tailored to the detector's current learning state. Central to our approach is integrating reinforcement learning (RL) and causal inference. An RL agent dynamically selects augmentation actions based on detector performance to efficiently explore the vast augmentation space, adapting to increasingly challenging forgeries. Simultaneously, the agent introduces action space variations to generate heterogeneous forgery patterns, guided by causal inference to mitigate spurious correlations-suppressing task-irrelevant biases and focusing on causally invariant features. This integration ensures robust generalization by decoupling synthetic augmentation patterns from the model's learned representations. Extensive experiments show our method significantly improves detector generalizability, outperforming SOTA methods across multiple cross-domain datasets.
摘要：Deepfake 检测器的泛化能力对于现实世界的使用至关重要。通过合成假脸生成的数据增强有效地增强了泛化能力，但当前的 SoTA 方法依赖于固定策略，这提出了一个关键问题：单个静态增强是否足够，或者伪造特征的多样性是否需要动态方法？我们认为现有的方法忽视了现实世界中伪造品不断变化的复杂性（例如面部扭曲、表情操纵），而固定策略无法完全模拟这些伪造品。为了解决这个问题，我们提出了 CRDA（课程强化学习数据增强），这是一种新颖的框架，指导检测器逐步掌握从简单到复杂的多域伪造特征。 CRDA 通过可配置的伪造操作池合成增强样本，并动态生成适合检测器当前学习状态的对抗样本。我们方法的核心是整合强化学习（RL）和因果推理。强化学习代理根据检测器性能动态选择增强动作，以有效探索广阔的增强空间，适应日益具有挑战性的伪造。同时，代理引入动作空间变化来生成异构伪造模式，在因果推理的指导下减轻虚假相关性，抑制与任务无关的偏差，并专注于因果不变的特征。这种集成通过将合成增强模式与模型的学习表示解耦来确保稳健的泛化。大量实验表明，我们的方法显着提高了检测器的通用性，在多个跨域数据集上的表现优于 SOTA 方法。

Title: RaLD: Generating High-Resolution 3D Radar Point Clouds with Latent Diffusion

Authors: Ruijie Zhang, Bixin Zeng, Shengpeng Wang, Fuhui Zhou, Wei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.07067
Pdf URL: https://arxiv.org/pdf/2511.07067
Copy Paste: [[2511.07067]] RaLD: Generating High-Resolution 3D Radar Point Clouds with Latent Diffusion(https://arxiv.org/abs/2511.07067)
Keywords: generation, generative
Abstract: Millimeter-wave radar offers a promising sensing modality for autonomous systems thanks to its robustness in adverse conditions and low cost. However, its utility is significantly limited by the sparsity and low resolution of radar point clouds, which poses challenges for tasks requiring dense and accurate 3D perception. Despite that recent efforts have shown great potential by exploring generative approaches to address this issue, they often rely on dense voxel representations that are inefficient and struggle to preserve structural detail. To fill this gap, we make the key observation that latent diffusion models (LDMs), though successful in other modalities, have not been effectively leveraged for radar-based 3D generation due to a lack of compatible representations and conditioning strategies. We introduce RaLD, a framework that bridges this gap by integrating scene-level frustum-based LiDAR autoencoding, order-invariant latent representations, and direct radar spectrum conditioning. These insights lead to a more compact and expressive generation process. Experiments show that RaLD produces dense and accurate 3D point clouds from raw radar spectrums, offering a promising solution for robust perception in challenging environments.
摘要：毫米波雷达由于其在恶劣条件下的鲁棒性和低成本，为自主系统提供了一种有前景的传感方式。然而，其实用性受到雷达点云的稀疏性和低分辨率的严重限制，这对需要密集且准确的 3D 感知的任务提出了挑战。尽管最近的努力通过探索生成方法来解决这个问题显示出了巨大的潜力，但它们通常依赖于密集的体素表示，这种表示效率低下，并且难以保留结构细节。为了填补这一空白，我们进行了关键观察，即潜在扩散模型 (LDM) 尽管在其他模式中取得了成功，但由于缺乏兼容的表示和调节策略，尚未有效地用于基于雷达的 3D 生成。我们引入了 RaLD，这是一个框架，它通过集成基于场景级视锥体的 LiDAR 自动编码、阶数不变的潜在表示和直接雷达频谱调节来弥补这一差距。这些见解导致生成过程更加紧凑和富有表现力。实验表明，RaLD 从原始雷达频谱中生成密集且准确的 3D 点云，为在充满挑战的环境中实现稳健感知提供了一种有前途的解决方案。

Title: How Bias Binds: Measuring Hidden Associations for Bias Control in Text-to-Image Compositions

Authors: Jeng-Lin Li, Ming-Ching Chang, Wei-Chao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07091
Pdf URL: https://arxiv.org/pdf/2511.07091
Copy Paste: [[2511.07091]] How Bias Binds: Measuring Hidden Associations for Bias Control in Text-to-Image Compositions(https://arxiv.org/abs/2511.07091)
Keywords: generation, generative
Abstract: Text-to-image generative models often exhibit bias related to sensitive attributes. However, current research tends to focus narrowly on single-object prompts with limited contextual diversity. In reality, each object or attribute within a prompt can contribute to bias. For example, the prompt "an assistant wearing a pink hat" may reflect female-inclined biases associated with a pink hat. The neglected joint effects of the semantic binding in the prompts cause significant failures in current debiasing approaches. This work initiates a preliminary investigation on how bias manifests under semantic binding, where contextual associations between objects and attributes influence generative outcomes. We demonstrate that the underlying bias distribution can be amplified based on these associations. Therefore, we introduce a bias adherence score that quantifies how specific object-attribute bindings activate bias. To delve deeper, we develop a training-free context-bias control framework to explore how token decoupling can facilitate the debiasing of semantic bindings. This framework achieves over 10% debiasing improvement in compositional generation tasks. Our analysis of bias scores across various attribute-object bindings and token decorrelation highlights a fundamental challenge: reducing bias without disrupting essential semantic relationships. These findings expose critical limitations in current debiasing approaches when applied to semantically bound contexts, underscoring the need to reassess prevailing bias mitigation strategies.
摘要：文本到图像生成模型通常表现出与敏感属性相关的偏差。然而，当前的研究往往局限于上下文多样性有限的单一对象提示。事实上，提示中的每个对象或属性都可能导致偏见。例如，提示“戴粉红色帽子的助理”可能反映与粉红色帽子相关的女性倾向偏见。提示中语义绑定的被忽视的联合效应导致当前的去偏方法出现重大失败。这项工作启动了一项初步调查，研究偏见如何在语义绑定下表现出来，其中对象和属性之间的上下文关联影响生成结果。我们证明，可以根据这些关联来放大潜在的偏差分布。因此，我们引入了偏差遵守分数，该分数可以量化特定对象属性绑定如何激活偏差。为了更深入地研究，我们开发了一个免训练的上下文偏差控制框架，以探索令牌解耦如何促进语义绑定的消除偏差。该框架在组合生成任务中实现了超过 10% 的去偏改进。我们对各种属性-对象绑定和标记去相关的偏差分数的分析突出了一个基本挑战：在不破坏基本语义关系的情况下减少偏差。这些发现暴露了当前去偏见方法在应用于语义绑定上下文时的关键局限性，强调需要重新评估流行的偏见缓解策略。

Title: GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution

Authors: Sirui Wang, Jiang He, Natàlia Blasco Andreo, Xiao Xiang Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07103
Pdf URL: https://arxiv.org/pdf/2511.07103
Copy Paste: [[2511.07103]] GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution(https://arxiv.org/abs/2511.07103)
Keywords: super-resolution, generation, generative
Abstract: Improving the quality of hyperspectral images (HSIs), such as through super-resolution, is a crucial research area. However, generative modeling for HSIs presents several challenges. Due to their high spectral dimensionality, HSIs are too memory-intensive for direct input into conventional diffusion models. Furthermore, general generative models lack an understanding of the topological and geometric structures of ground objects in remote sensing imagery. In addition, most diffusion models optimize loss functions at the noise level, leading to a non-intuitive convergence behavior and suboptimal generation quality for complex data. To address these challenges, we propose a Geometric Enhanced Wavelet-based Diffusion Model (GEWDiff), a novel framework for reconstructing hyperspectral images at 4-times super-resolution. A wavelet-based encoder-decoder is introduced that efficiently compresses HSIs into a latent space while preserving spectral-spatial information. To avoid distortion during generation, we incorporate a geometry-enhanced diffusion process that preserves the geometric features. Furthermore, a multi-level loss function was designed to guide the diffusion process, promoting stable convergence and improved reconstruction fidelity. Our model demonstrated state-of-the-art results across multiple dimensions, including fidelity, spectral accuracy, visual realism, and clarity.
摘要：通过超分辨率等方式提高高光谱图像 (HSI) 的质量是一个重要的研究领域。然而，HSI 的生成模型面临着一些挑战。由于其高光谱维数，HSI 的内存过于密集，无法直接输入到传统的扩散模型中。此外，一般生成模型缺乏对遥感图像中地面物体的拓扑和几何结构的理解。此外，大多数扩散模型在噪声水平上优化损失函数，导致复杂数据的收敛行为不直观且生成质量不理想。为了应对这些挑战，我们提出了一种基于几何增强小波的扩散模型（GEWDiff），这是一种以 4 倍超分辨率重建高光谱图像的新颖框架。引入了基于小波的编码器-解码器，可以有效地将 HSI 压缩到潜在空间，同时保留频谱空间信息。为了避免生成过程中的失真，我们采用了几何增强扩散过程来保留几何特征。此外，设计了多级损失函数来指导扩散过程，促进稳定收敛并提高重建保真度。我们的模型在多个维度上展示了最先进的结果，包括保真度、光谱精度、视觉真实感和清晰度。

Title: On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation

Authors: Matteo Pettenó, Alessandro Ilic Mezza, Alberto Bernardini
Subjects: cs.LG, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2511.07118
Pdf URL: https://arxiv.org/pdf/2511.07118
Copy Paste: [[2511.07118]] On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation(https://arxiv.org/abs/2511.07118)
Keywords: generation, generative
Abstract: Explicit latent variable models provide a flexible yet powerful framework for data synthesis, enabling controlled manipulation of generative factors. With latent variables drawn from a tractable probability density function that can be further constrained, these models enable continuous and semantically rich exploration of the output space by navigating their latent spaces. Structured latent representations are typically obtained through the joint minimization of regularization loss functions. In variational information bottleneck models, reconstruction loss and Kullback-Leibler Divergence (KLD) are often linearly combined with an auxiliary Attribute-Regularization (AR) loss. However, balancing KLD and AR turns out to be a very delicate matter. When KLD dominates over AR, generative models tend to lack controllability; when AR dominates over KLD, the stochastic encoder is encouraged to violate the standard normal prior. We explore this trade-off in the context of symbolic music generation with explicit control over continuous musical attributes. We show that existing approaches struggle to jointly minimize both regularization objectives, whereas suitable attribute transformations can help achieve both controllability and regularization of the target latent dimensions.
摘要：显式潜变量模型为数据合成提供了灵活而强大的框架，从而能够对生成因素进行受控操作。通过从可进一步约束的易于处理的概率密度函数中提取潜在变量，这些模型可以通过导航其潜在空间来对输出空间进行连续且语义丰富的探索。结构化潜在表示通常是通过正则化损失函数的联合最小化获得的。在变分信息瓶颈模型中，重建损失和 Kullback-Leibler 散度 (KLD) 通常与辅助属性正则化 (AR) 损失线性组合。然而，平衡 KLD 和 AR 却是一件非常微妙的事情。当KLD主导AR时，生成模型往往缺乏可控性；当 AR 战胜 KLD 时，随机编码器会被鼓励违反标准正常先验。我们在符号音乐生成的背景下探索这种权衡，并明确控制连续音乐属性。我们表明，现有的方法很难共同最小化两个正则化目标，而合适的属性转换可以帮助实现目标潜在维度的可控性和正则化。

Title: ProcGen3D: Learning Neural Procedural Graph Representations for Image-to-3D Reconstruction

Authors: Xinyi Zhang, Daoyi Gao, Naiqi Li, Angela Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.07142
Pdf URL: https://arxiv.org/pdf/2511.07142
Copy Paste: [[2511.07142]] ProcGen3D: Learning Neural Procedural Graph Representations for Image-to-3D Reconstruction(https://arxiv.org/abs/2511.07142)
Keywords: generation, generative
Abstract: We introduce ProcGen3D, a new approach for 3D content creation by generating procedural graph abstractions of 3D objects, which can then be decoded into rich, complex 3D assets. Inspired by the prevalent use of procedural generators in production 3D applications, we propose a sequentialized, graph-based procedural graph representation for 3D assets. We use this to learn to approximate the landscape of a procedural generator for image-based 3D reconstruction. We employ edge-based tokenization to encode the procedural graphs, and train a transformer prior to predict the next token conditioned on an input RGB image. Crucially, to enable better alignment of our generated outputs to an input image, we incorporate Monte Carlo Tree Search (MCTS) guided sampling into our generation process, steering output procedural graphs towards more image-faithful reconstructions. Our approach is applicable across a variety of objects that can be synthesized with procedural generators. Extensive experiments on cacti, trees, and bridges show that our neural procedural graph generation outperforms both state-of-the-art generative 3D methods and domain-specific modeling techniques. Furthermore, this enables improved generalization on real-world input images, despite training only on synthetic data.
摘要：我们引入了 ProcGen3D，这是一种通过生成 3D 对象的程序图形抽象来创建 3D 内容的新方法，然后可以将其解码为丰富、复杂的 3D 资产。受到生产 3D 应用程序中程序生成器的普遍使用的启发，我们提出了一种用于 3D 资产的顺序化、基于图形的程序图形表示。我们用它来学习近似程序生成器的景观，以进行基于图像的 3D 重建。我们采用基于边缘的标记化来对程序图进行编码，并在根据输入 RGB 图像预测下一个标记之前训练转换器。至关重要的是，为了更好地将生成的输出与输入图像对齐，我们将蒙特卡罗树搜索（MCTS）引导采样纳入我们的生成过程中，将输出程序图转向更忠实于图像的重建。我们的方法适用于可以使用程序生成器合成的各种对象。对仙人掌、树木和桥梁的大量实验表明，我们的神经程序图生成优于最先进的生成 3D 方法和特定领域的建模技术。此外，尽管仅对合成数据进行训练，但这仍可以改进对现实世界输入图像的泛化。

Title: Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation

Authors: Matteo Pettenó, Alessandro Ilic Mezza, Alberto Bernardini
Subjects: cs.LG, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2511.07156
Pdf URL: https://arxiv.org/pdf/2511.07156
Copy Paste: [[2511.07156]] Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation(https://arxiv.org/abs/2511.07156)
Keywords: generation, generative
Abstract: Recent advances in latent diffusion models have demonstrated state-of-the-art performance in high-dimensional time-series data synthesis while providing flexible control through conditioning and guidance. However, existing methodologies primarily rely on musical context or natural language as the main modality of interacting with the generative process, which may not be ideal for expert users who seek precise fader-like control over specific musical attributes. In this work, we explore the application of denoising diffusion processes as plug-and-play latent constraints for unconditional symbolic music generation models. We focus on a framework that leverages a library of small conditional diffusion models operating as implicit probabilistic priors on the latents of a frozen unconditional backbone. While previous studies have explored domain-specific use cases, this work, to the best of our knowledge, is the first to demonstrate the versatility of such an approach across a diverse array of musical attributes, such as note density, pitch range, contour, and rhythm complexity. Our experiments show that diffusion-driven constraints outperform traditional attribute regularization and other latent constraints architectures, achieving significantly stronger correlations between target and generated attributes while maintaining high perceptual quality and diversity.
摘要：潜在扩散模型的最新进展展示了高维时间序列数据合成的最先进性能，同时通过调节和引导提供灵活的控制。然而，现有的方法主要依赖于音乐背景或自然语言作为与生成过程交互的主要方式，这对于寻求对特定音乐属性进行精确的推子式控制的专家用户来说可能并不理想。在这项工作中，我们探索了去噪扩散过程作为无条件符号音乐生成模型的即插即用潜在约束的应用。我们专注于一个框架，该框架利用小型条件扩散模型库，作为冻结无条件主干的潜在概率的隐式概率先验。虽然之前的研究已经探索了特定领域的用例，但据我们所知，这项工作首次证明了这种方法在各种音乐属性上的多功能性，例如音符密度、音高范围、轮廓和节奏复杂性。我们的实验表明，扩散驱动的约束优于传统的属性正则化和其他潜在约束架构，在保持高感知质量和多样性的同时，实现目标属性和生成属性之间显着更强的相关性。

Title: Guiding Generative Models to Uncover Diverse and Novel Crystals via Reinforcement Learning

Authors: Hyunsoo Park, Aron Walsh
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2511.07158
Pdf URL: https://arxiv.org/pdf/2511.07158
Copy Paste: [[2511.07158]] Guiding Generative Models to Uncover Diverse and Novel Crystals via Reinforcement Learning(https://arxiv.org/abs/2511.07158)
Keywords: generation, generative
Abstract: Discovering functional crystalline materials entails navigating an immense combinatorial design space. While recent advances in generative artificial intelligence have enabled the sampling of chemically plausible compositions and structures, a fundamental challenge remains: the objective misalignment between likelihood-based sampling in generative modelling and targeted focus on underexplored regions where novel compounds reside. Here, we introduce a reinforcement learning framework that guides latent denoising diffusion models toward diverse and novel, yet thermodynamically viable crystalline compounds. Our approach integrates group relative policy optimisation with verifiable, multi-objective rewards that jointly balance creativity, stability, and diversity. Beyond de novo generation, we demonstrate enhanced property-guided design that preserves chemical validity, while targeting desired functional properties. This approach establishes a modular foundation for controllable AI-driven inverse design that addresses the novelty-validity trade-off across scientific discovery applications of generative models.
摘要：发现功能性晶体材料需要探索巨大的组合设计空间。尽管生成人工智能的最新进展已经能够对化学上合理的成分和结构进行采样，但仍然存在一个根本挑战：生成模型中基于可能性的采样与对新化合物所在的未充分探索区域的目标关注之间存在客观偏差。在这里，我们介绍了一个强化学习框架，该框架指导潜在的去噪扩散模型走向多样化、新颖但热力学上可行的晶体化合物。我们的方法将群体相关政策优化与可验证的多目标奖励相结合，共同平衡创造力、稳定性和多样性。除了从头生成之外，我们还展示了增强的属性引导设计，可以保留化学有效性，同时瞄准所需的功能属性。这种方法为可控人工智能驱动的逆向设计建立了模块化基础，解决了生成模型的科学发现应用中的新颖性与有效性的权衡。

Title: LiteUpdate: A Lightweight Framework for Updating AI-Generated Image Detectors

Authors: Jiajie Lu, Zhenkan Fu, Na Zhao, Long Xing, Kejiang Chen, Weiming Zhang, Nenghai Yu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2511.07192
Pdf URL: https://arxiv.org/pdf/2511.07192
Copy Paste: [[2511.07192]] LiteUpdate: A Lightweight Framework for Updating AI-Generated Image Detectors(https://arxiv.org/abs/2511.07192)
Keywords: generative
Abstract: The rapid progress of generative AI has led to the emergence of new generative models, while existing detection methods struggle to keep pace, resulting in significant degradation in the detection performance. This highlights the urgent need for continuously updating AI-generated image detectors to adapt to new generators. To overcome low efficiency and catastrophic forgetting in detector updates, we propose LiteUpdate, a lightweight framework for updating AI-generated image detectors. LiteUpdate employs a representative sample selection module that leverages image confidence and gradient-based discriminative features to precisely select boundary samples. This approach improves learning and detection accuracy on new distributions with limited generated images, significantly enhancing detector update efficiency. Additionally, LiteUpdate incorporates a model merging module that fuses weights from multiple fine-tuning trajectories, including pre-trained, representative, and random updates. This balances the adaptability to new generators and mitigates the catastrophic forgetting of prior knowledge. Experiments demonstrate that LiteUpdate substantially boosts detection performance in various detectors. Specifically, on AIDE, the average detection accuracy on Midjourney improved from 87.63% to 93.03%, a 6.16% relative increase.
摘要：生成式人工智能的快速进步导致了新的生成模型的出现，而现有的检测方法却难以跟上步伐，导致检测性能显着下降。这凸显了不断更新人工智能生成图像检测器以适应新生成器的迫切需要。为了克服检测器更新中的低效率和灾难性遗忘，我们提出了 LiteUpdate，一个用于更新 AI 生成的图像检测器的轻量级框架。 LiteUpdate 采用代表性样本选择模块，利用图像置信度和基于梯度的判别特征来精确选择边界样本。这种方法提高了生成图像有限的新分布的学习和检测精度，显着提高了检测器更新效率。此外，LiteUpdate 还包含一个模型合并模块，该模块融合来自多个微调轨迹的权重，包括预训练的、代表性的和随机更新。这平衡了对新生成器的适应性，并减轻了对先验知识的灾难性遗忘。实验表明，LiteUpdate 极大地提高了各种检测器的检测性能。具体来说，在AIDE上，Midjourney的平均检测准确率从87.63%提升到93.03%，相对提升了6.16%。

Title: Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization

Authors: Binyan Xu, Fan Yang, Di Tang, Xilin Dai, Kehuan Zhang
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.07210
Pdf URL: https://arxiv.org/pdf/2511.07210
Copy Paste: [[2511.07210]] Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization(https://arxiv.org/abs/2511.07210)
Keywords: generative
Abstract: Clean-image backdoor attacks, which use only label manipulation in training datasets to compromise deep neural networks, pose a significant threat to security-critical applications. A critical flaw in existing methods is that the poison rate required for a successful attack induces a proportional, and thus noticeable, drop in Clean Accuracy (CA), undermining their stealthiness. This paper presents a new paradigm for clean-image attacks that minimizes this accuracy degradation by optimizing the trigger itself. We introduce Generative Clean-Image Backdoors (GCB), a framework that uses a conditional InfoGAN to identify naturally occurring image features that can serve as potent and stealthy triggers. By ensuring these triggers are easily separable from benign task-related features, GCB enables a victim model to learn the backdoor from an extremely small set of poisoned examples, resulting in a CA drop of less than 1%. Our experiments demonstrate GCB's remarkable versatility, successfully adapting to six datasets, five architectures, and four tasks, including the first demonstration of clean-image backdoors in regression and segmentation. GCB also exhibits resilience against most of the existing backdoor defenses.
摘要：干净图像后门攻击仅使用训练数据集中的标签操作来破坏深度神经网络，对安全关键型应用程序构成重大威胁。现有方法的一个关键缺陷是，成功攻击所需的中毒率会导致清洁准确度（CA）成比例且明显下降，从而破坏了其隐秘性。本文提出了一种新的干净图像攻击范例，通过优化触发器本身来最大限度地减少精度下降。我们引入了生成式清洁图像后门 (GCB)，这是一个使用条件 InfoGAN 来识别自然发生的图像特征的框架，这些特征可以作为有效且隐秘的触发器。通过确保这些触发器可以轻松地与良性任务相关功能分离，GCB 使受害者模型能够从极少量的中毒示例中学习后门，从而导致 CA 下降不到 1%。我们的实验证明了 GCB 卓越的多功能性，成功适应了六个数据集、五种架构和四项任务，包括回归和分割中干净图像后门的首次演示。 GCB 还表现出针对大多数现有后门防御的弹性。

Title: Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images

Authors: JiaKui Hu, Shanshan Zhao, Qing-Guo Chen, Xuerui Qiu, Jialun Liu, Zhao Xu, Weihua Luo, Kaifu Zhang, Yanye Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.07222
Pdf URL: https://arxiv.org/pdf/2511.07222
Copy Paste: [[2511.07222]] Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images(https://arxiv.org/abs/2511.07222)
Keywords: generation
Abstract: This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that "generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.
摘要：本文提出Omni-View，将统一的多模态理解和生成扩展到基于多视图图像的3D场景，探索“生成促进理解”的原则。 Omni-View由理解模型、纹理模块和几何模块组成，联合建模场景理解、新颖视图合成和几何估计，实现3D场景理解和生成任务之间的协同交互。根据设计，它利用负责外观合成的纹理模块的时空建模功能，以及其专用几何模块提供的显式几何约束，从而丰富了模型对 3D 场景的整体理解。采用两阶段策略进行训练后，Omni-View 在 VSI-Bench 基准测试中取得了 55.4 分的最高分数，超越了现有的专业 3D 理解模型，同时在新颖的视图合成和 3D 场景生成方面提供了强大的性能。

Title: 4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation

Authors: Mengmeng Liu, Jiuming Liu, Yunpeng Zhang, Jiangtao Li, Michael Ying Yang, Francesco Nex, Hao Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.07241
Pdf URL: https://arxiv.org/pdf/2511.07241
Copy Paste: [[2511.07241]] 4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation(https://arxiv.org/abs/2511.07241)
Keywords: generation, generative
Abstract: Remarkable advances in recent 2D image and 3D shape generation have induced a significant focus on dynamic 4D content generation. However, previous 4D generation methods commonly struggle to maintain spatial-temporal consistency and adapt poorly to rapid temporal variations, due to the lack of effective spatial-temporal modeling. To address these problems, we propose a novel 4D generation network called 4DSTR, which modulates generative 4D Gaussian Splatting with spatial-temporal rectification. Specifically, temporal correlation across generated 4D sequences is designed to rectify deformable scales and rotations and guarantee temporal consistency. Furthermore, an adaptive spatial densification and pruning strategy is proposed to address significant temporal variations by dynamically adding or deleting Gaussian points with the awareness of their pre-frame movements. Extensive experiments demonstrate that our 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.
摘要：最近 2D 图像和 3D 形状生成方面的显着进步引起了人们对动态 4D 内容生成的极大关注。然而，由于缺乏有效的时空建模，以前的 4D 生成方法通常难以保持时空一致性，并且难以适应快速的时间变化。为了解决这些问题，我们提出了一种称为 4DSTR 的新型 4D 生成网络，它通过时空校正来调制生成 4D 高斯泼溅。具体来说，生成的 4D 序列之间的时间相关性旨在纠正可变形尺度和旋转并保证时间一致性。此外，提出了一种自适应空间致密化和修剪策略，通过动态添加或删除高斯点并意识到其帧前运动来解决显着的时间变化。大量实验表明，我们的 4DSTR 在视频到 4D 生成方面实现了最先进的性能，在重建质量、时空一致性以及对快速时间运动的适应方面表现出色。

Title: Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

Authors: Sayambhu Sen, Shalabh Bhatnagar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07288
Pdf URL: https://arxiv.org/pdf/2511.07288
Copy Paste: [[2511.07288]] Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization(https://arxiv.org/abs/2511.07288)
Keywords: generative
Abstract: Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman this http URL. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.
摘要：使用强化学习（RL）学习复杂的策略通常会受到不稳定和收敛缓慢的阻碍，而奖励工程的难度又加剧了这一问题。来自专家演示的模仿学习（IL）绕过了对奖励的依赖。然而，最先进的 IL 方法，以生成对抗性模仿学习 (GAIL)Ho 等人为例。 al，遭受严重的样本效率低下之苦。这是他们的基础同策略算法的直接结果，例如 TRPO Schulman 这个 http URL。在这项工作中，我们引入了一种对抗性模仿学习算法，该算法结合了离策略学习来提高样本效率。通过将离策略框架与辅助技术相结合，特别是基于双 Q 网络的稳定和价值学习（无需奖励函数推断），我们证明了稳健匹配专家行为所需的样本减少了。

Title: LMM-IQA: Image Quality Assessment for Low-Dose CT Imaging

Authors: Kagan Celik, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07298
Pdf URL: https://arxiv.org/pdf/2511.07298
Copy Paste: [[2511.07298]] LMM-IQA: Image Quality Assessment for Low-Dose CT Imaging(https://arxiv.org/abs/2511.07298)
Keywords: quality assessment
Abstract: Low-dose computed tomography (CT) represents a significant improvement in patient safety through lower radiation doses, but increased noise, blur, and contrast loss can diminish diagnostic quality. Therefore, consistency and robustness in image quality assessment become essential for clinical applications. In this study, we propose an LLM-based quality assessment system that generates both numerical scores and textual descriptions of degradations such as noise, blur, and contrast loss. Furthermore, various inference strategies - from the zero-shot approach to metadata integration and error feedback - are systematically examined, demonstrating the progressive contribution of each method to overall performance. The resultant assessments yield not only highly correlated scores but also interpretable output, thereby adding value to clinical workflows. The source codes of our study are available at this https URL.
摘要：低剂量计算机断层扫描 (CT) 通过降低辐射剂量显着提高了患者安全性，但噪声、模糊和对比度损失的增加会降低诊断质量。因此，图像质量评估的一致性和稳健性对于临床应用至关重要。在这项研究中，我们提出了一种基于法学硕士的质量评估系统，该系统可以生成噪声、模糊和对比度损失等退化的数字分数和文本描述。此外，还系统地检查了各种推理策略（从零样本方法到元数据集成和错误反馈），展示了每种方法对整体性能的渐进贡献。由此产生的评估不仅产生高度相关的分数，而且产生可解释的输出，从而为临床工作流程增加价值。我们研究的源代码可以在这个 https URL 上找到。

Title: Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

Authors: Artyom Sorokin, Nazar Buzun, Alexander Anokhin, Oleg Inozemcev, Egor Vedernikov, Petr Anokhin, Mikhail Burtsev, Trushkov Alexey, Yin Wenshuai, Evgeny Burnaev
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2511.07328
Pdf URL: https://arxiv.org/pdf/2511.07328
Copy Paste: [[2511.07328]] Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training(https://arxiv.org/abs/2511.07328)
Keywords: generation
Abstract: Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks Babilong and RULER for contexts up to 10M tokens.
摘要：检索增强生成（RAG）方法通过有效过滤法学硕士的相关上下文、减少幻觉和推理成本来增强法学硕士的性能。然而，大多数现有的 RAG 方法侧重于单步检索，这通常不足以回答需要多步搜索的复杂问题。最近，出现了多步检索方法，通常涉及小型法学硕士的微调来执行多步检索。这种类型的微调是高度资源密集型的，并且无法使用更大的法学硕士。在这项工作中，我们提出了 Q-RAG，这是一种使用强化学习 (RL) 微调 Embedder 模型以进行多步骤检索的新颖方法。 Q-RAG 为开放域问答的现有多步检索方法提供了一种有竞争力的、资源高效的替代方案，并在流行的长上下文基准 Babilong 和 RULER 上针对高达 10M 个令牌的上下文实现了最先进的结果。

Title: Inference-Time Scaling of Diffusion Models for Infrared Data Generation

Authors: Kai A. Horstmann, Maxim Clouser, Kia Khezeli
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.07362
Pdf URL: https://arxiv.org/pdf/2511.07362
Copy Paste: [[2511.07362]] Inference-Time Scaling of Diffusion Models for Infrared Data Generation(https://arxiv.org/abs/2511.07362)
Keywords: generation, generative
Abstract: Infrared imagery enables temperature-based scene understanding using passive sensors, particularly under conditions of low visibility where traditional RGB imaging fails. Yet, developing downstream vision models for infrared applications is hindered by the scarcity of high-quality annotated data, due to the specialized expertise required for infrared annotation. While synthetic infrared image generation has the potential to accelerate model development by providing large-scale, diverse training data, training foundation-level generative diffusion models in the infrared domain has remained elusive due to limited datasets. In light of such data constraints, we explore an inference-time scaling approach using a domain-adapted CLIP-based verifier for enhanced infrared image generation quality. We adapt FLUX.1-dev, a state-of-the-art text-to-image diffusion model, to the infrared domain by finetuning it on a small sample of infrared images using parameter-efficient techniques. The trained verifier is then employed during inference to guide the diffusion sampling process toward higher quality infrared generations that better align with input text prompts. Empirically, we find that our approach leads to consistent improvements in generation quality, reducing FID scores on the KAIST Multispectral Pedestrian Detection Benchmark dataset by 10% compared to unguided baseline samples. Our results suggest that inference-time guidance offers a promising direction for bridging the domain gap in low-data infrared settings.
摘要：红外图像可以使用无源传感器实现基于温度的场景理解，特别是在传统 RGB 成像无法实现的低能见度条件下。然而，由于红外注释所需的专业知识，高质量注释数据的稀缺阻碍了为红外应用开发下游视觉模型。虽然合成红外图像生成有潜力通过提供大规模、多样化的训练数据来加速模型开发，但由于数据集有限，在红外领域训练基础级生成扩散模型仍然难以实现。鉴于此类数据限制，我们探索了一种推理时间缩放方法，使用基于域自适应的 CLIP 验证器来增强红外图像生成质量。我们通过使用参数高效技术在一小部分红外图像样本上进行微调，将最先进的文本到图像扩散模型 FLUX.1-dev 应用于红外域。然后，在推理过程中使用训练有素的验证器来指导扩散采样过程，以生成更高质量的红外生成，从而更好地与输入文本提示保持一致。根据经验，我们发现我们的方法可以持续提高生成质量，与无引导的基线样本相比，KAIST 多光谱行人检测基准数据集的 FID 分数降低了 10%。我们的结果表明，推理时间指导为弥合低数据红外设置中的域差距提供了一个有希望的方向。

Title: Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Hau-San Wong, Qingfu Zhang, Taiji Suzuki
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.07372
Pdf URL: https://arxiv.org/pdf/2511.07372
Copy Paste: [[2511.07372]] Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training(https://arxiv.org/abs/2511.07372)
Keywords: generation
Abstract: Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.
摘要：人们广泛观察到，法学硕士培训后阶段的最新课程技术在提高推理能力方面优于非课程方法，但对其为何有效以及在多大程度上发挥作用的原则性理解仍然难以捉摸。为了解决这一差距，我们开发了一个基于直觉的理论框架，即只要每个阶段都保持在模型的有效能力范围内，通过可管理的步骤逐步学习比直接解决困难推理任务更有效。在连接连续课程阶段的温和复杂性条件下，我们表明课程后培训避免了指数复杂性瓶颈。为了证实这一结果，从解决倒计时和奇偶校验等数学问题的思想链 (CoT) 中汲取见解，我们将 CoT 生成建模为状态条件自回归推理树，定义统一分支基础模型来捕获预训练行为，并将课程阶段形式化为深度增加（较长推理链）或提示减少（较短前缀）子任务。我们的分析表明，在仅结果奖励信号下，强化学习微调可以通过多项式样本复杂性实现高精度，而直接学习则遇到指数瓶颈。我们进一步为测试时间扩展建立了类似的保证，其中课程感知查询减少了奖励预言机调用和采样成本从指数级到多项式级。

Title: Real-Time LiDAR Super-Resolution via Frequency-Aware Multi-Scale Fusion

Authors: June Moh Goo, Zichao Zeng, Jan Boehm
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2511.07377
Pdf URL: https://arxiv.org/pdf/2511.07377
Copy Paste: [[2511.07377]] Real-Time LiDAR Super-Resolution via Frequency-Aware Multi-Scale Fusion(https://arxiv.org/abs/2511.07377)
Keywords: super-resolution
Abstract: LiDAR super-resolution addresses the challenge of achieving high-quality 3D perception from cost-effective, low-resolution sensors. While recent transformer-based approaches like TULIP show promise, they remain limited to spatial-domain processing with restricted receptive fields. We introduce FLASH (Frequency-aware LiDAR Adaptive Super-resolution with Hierarchical fusion), a novel framework that overcomes these limitations through dual-domain processing. FLASH integrates two key innovations: (i) Frequency-Aware Window Attention that combines local spatial attention with global frequency-domain analysis via FFT, capturing both fine-grained geometry and periodic scanning patterns at log-linear complexity. (ii) Adaptive Multi-Scale Fusion that replaces conventional skip connections with learned position-specific feature aggregation, enhanced by CBAM attention for dynamic feature selection. Extensive experiments on KITTI demonstrate that FLASH achieves state-of-the-art performance across all evaluation metrics, surpassing even uncertainty-enhanced baselines that require multiple forward passes. Notably, FLASH outperforms TULIP with Monte Carlo Dropout while maintaining single-pass efficiency, which enables real-time deployment. The consistent superiority across all distance ranges validates that our dual-domain approach effectively handles uncertainty through architectural design rather than computationally expensive stochastic inference, making it practical for autonomous systems.
摘要：LiDAR 超分辨率解决了通过经济高效的低分辨率传感器实现高质量 3D 感知的挑战。虽然最近基于 Transformer 的方法（如 TULIP）显示出前景，但它们仍然仅限于接受域受限的空间域处理。我们引入了 FLASH（具有层次融合的频率感知激光雷达自适应超分辨率），这是一种通过双域处理克服这些限制的新颖框架。 FLASH 集成了两项关键创新：(i) 频率感知窗口注意力，通过 FFT 将局部空间注意力与全局频域分析相结合，以对数线性复杂度捕获细粒度几何和周期性扫描模式。 (ii) 自适应多尺度融合，用学习到的特定位置特征聚合取代传统的跳跃连接，并通过 CBAM 注意力增强动态特征选择。 KITTI 上的大量实验表明，FLASH 在所有评估指标上都实现了最先进的性能，甚至超越了需要多次前向传递的不确定性增强基线。值得注意的是，FLASH 的性能优于带有 Monte Carlo Dropout 的 TULIP，同时保持单遍效率，从而实现实时部署。在所有距离范围内一致的优越性证明，我们的双域方法可以通过架构设计而不是计算成本高昂的随机推理有效地处理不确定性，使其适用于自治系统。

Title: A Diffusion Model to Shrink Proteins While Maintaining Their Function

Authors: Ethan Baron, Alan N. Amin, Ruben Weitzman, Debora Marks, Andrew Gordon Wilson
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2511.07390
Pdf URL: https://arxiv.org/pdf/2511.07390
Copy Paste: [[2511.07390]] A Diffusion Model to Shrink Proteins While Maintaining Their Function(https://arxiv.org/abs/2511.07390)
Keywords: generative
Abstract: Many proteins useful in modern medicine or bioengineering are challenging to make in the lab, fuse with other proteins in cells, or deliver to tissues in the body, because their sequences are too long. Shortening these sequences typically involves costly, time-consuming experimental campaigns. Ideally, we could instead use modern models of massive databases of sequences from nature to learn how to propose shrunken proteins that resemble sequences found in nature. Unfortunately, these models struggle to efficiently search the combinatorial space of all deletions, and are not trained with inductive biases to learn how to delete. To address this gap, we propose SCISOR, a novel discrete diffusion model that deletes letters from sequences to generate protein samples that resemble those found in nature. To do so, SCISOR trains a de-noiser to reverse a forward noising process that adds random insertions to natural sequences. As a generative model, SCISOR fits evolutionary sequence data competitively with previous large models. In evaluation, SCISOR achieves state-of-the-art predictions of the functional effects of deletions on ProteinGym. Finally, we use the SCISOR de-noiser to shrink long protein sequences, and show that its suggested deletions result in significantly more realistic proteins and more often preserve functional motifs than previous models of evolutionary sequences.
摘要：许多在现代医学或生物工程中有用的蛋白质在实验室中制造、与细胞中的其他蛋白质融合或输送到体内的组织都具有挑战性，因为它们的序列太长。缩短这些序列通常涉及昂贵且耗时的实验活动。理想情况下，我们可以使用来自自然界的海量序列数据库的现代模型来学习如何提出类似于自然界中发现的序列的收缩蛋白质。不幸的是，这些模型很难有效地搜索所有删除的组合空间，并且没有经过归纳偏差的训练来学习如何删除。为了解决这一差距，我们提出了 SCISOR，这是一种新颖的离散扩散模型，可以删除序列中的字母以生成类似于自然界中发现的蛋白质样本。为此，SCISOR 训练降噪器来反转前向噪声过程，将随机插入添加到自然序列中。作为一种生成模型，SCISOR 与之前的大型模型相比，能够更好地拟合进化序列数据。在评估中，SCISOR 对 ProteinGym 上的缺失功能影响实现了最先进的预测。最后，我们使用 SCISOR 降噪器来缩小长蛋白质序列，并表明其建议的删除会产生明显更真实的蛋白质，并且比以前的进化序列模型更经常保留功能基序。

Title: StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

Authors: Tianrui Feng, Zhi Li, Shuo Yang, Haocheng Xi, Muyang Li, Xiuyu Li, Lvmin Zhang, Keting Yang, Kelly Peng, Song Han, Maneesh Agrawala, Kurt Keutzer, Akio Kodaira, Chenfeng Xu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.07399
Pdf URL: https://arxiv.org/pdf/2511.07399
Copy Paste: [[2511.07399]] StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation(https://arxiv.org/abs/2511.07399)
Keywords: generation, generative
Abstract: Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but have hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present StreamDiffusionV2, a training-free pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token--guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1--4), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs, making state-of-the-art generative live streaming practical and accessible--from individual creators to enterprise-scale platforms.
摘要：生成模型正在通过重新定义内容的创建、样式和交付方式来重塑直播行业。以前基于图像的流媒体扩散模型已经为高效且富有创意的直播产品提供了动力，但由于基于图像的设计基础，在时间一致性上遇到了限制。视频扩散的最新进展显着提高了离线生成的时间一致性和采样效率。然而，离线生成系统主要通过批量处理大型工作负载来优化吞吐量。相比之下，实时在线流媒体在严格的服务级别目标 (SLO) 下运行：首帧时间必须最短，并且每一帧必须满足每帧的截止日期且抖动较低。此外，迄今为止，用于实时流的可扩展多 GPU 服务在很大程度上仍未得到解决。为了解决这个问题，我们提出了 StreamDiffusionV2，这是一种无需训练的管道，用于使用视频扩散模型进行交互式直播。 StreamDiffusionV2 集成了 SLO 感知批处理调度器和块调度器，以及接收器令牌引导的滚动 KV 缓存、运动感知噪声控制器和其他系统级优化。此外，我们引入了一种可扩展的管道编排，可以并行化去噪步骤和网络层之间的扩散过程，从而在不违反延迟保证的情况下实现近线性 FPS 缩放。该系统可跨异构 GPU 环境无缝扩展，并支持灵活的降噪步骤（例如 1--4），从而实现超低延迟和更高质量的模式。在没有 TensorRT 或量化的情况下，StreamDiffusionV2 可在 0.5 秒内渲染第一帧，并在四个 H100 GPU 上使用 14B 参数模型实现 58.28 FPS，使用 1.3B 参数模型实现 64.52 FPS，从而使最先进的生成式直播变得实用且易于使用——从个人创作者到企业级平台。

Title: DIMO: Diverse 3D Motion Generation for Arbitrary Objects

Authors: Linzhan Mou, Jiahui Lei, Chen Wang, Lingjie Liu, Kostas Daniilidis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.07409
Pdf URL: https://arxiv.org/pdf/2511.07409
Copy Paste: [[2511.07409]] DIMO: Diverse 3D Motion Generation for Arbitrary Objects(https://arxiv.org/abs/2511.07409)
Keywords: generation, generative
Abstract: We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at this https URL.
摘要：我们提出了 DIMO，这是一种生成方法，能够从单个图像中为任意对象生成不同的 3D 运动。我们工作的核心思想是利用训练有素的视频模型中丰富的先验来提取常见的运动模式，然后将它们嵌入到共享的低维潜在空间中。具体来说，我们首先生成同一对象具有不同动作的多个视频。然后，我们将每个运动嵌入到一个潜在向量中，并训练一个共享运动解码器来学习由结构化和紧凑的运动表示（即神经关键点轨迹）表示的运动分布。然后，规范的 3D 高斯函数由这些关键点驱动并融合以对几何形状和外观进行建模。在学习到的潜在空间的推理时间内，我们可以在一次前向传递中立即采样不同的 3D 运动，并支持多种有趣的应用，包括 3D 运动插值和语言引导运动生成。我们的项目页面可通过此 https URL 访问。

Title: Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.07419
Pdf URL: https://arxiv.org/pdf/2511.07419
Copy Paste: [[2511.07419]] Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs(https://arxiv.org/abs/2511.07419)
Keywords: generation
Abstract: Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs' generalization performance. Our method, "Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.
摘要：稀疏专家混合（MoE）在最近的大型语言模型中得到了广泛采用，因为它可以有效地扩展模型能力而不增加推理成本。然而，对广泛下游任务的评估揭示了现有 MoE LLM 中路由器的一致次优性，这导致与最佳路由存在严重的性能差距（例如，准确度为 10-20%）。在本文中，我们表明，将路由权重的流形与任务嵌入的流形对齐可以有效地缩小差距并提高 MoE LLM 的泛化性能。我们的方法“路由流形对齐（RoMA）”在训练后目标中引入了额外的流形正则化项，并且只需要对路由器进行轻量级微调（其他参数冻结）。具体来说，正则化鼓励每个样本的路由权重接近任务嵌入空间中其成功邻居的路由权重（其路由权重导致正确的答案）。因此，针对相似任务的样本将跨层共享相似的专家选择。在不同样本上建立任务和专家之间的这种绑定对于实现更好的泛化至关重要。此外，RoMA 展示了将任务理解（通过嵌入模型）与解决方案生成（通过 MoE LLM）统一起来的优势。在实验中，我们使用 RoMA 微调 OLMoE、DeepSeekMoE 和 Qwen3-MoE 中的路由器。对各种基准的评估以及与基线的广泛比较显示了 RoMA 带来的实质性改进。