2025-10-16

Title: An Investigation of Memorization Risk in Healthcare Foundation Models

Authors: Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour, Walter Gerych, Marzyeh Ghassemi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.12950
Pdf URL: https://arxiv.org/pdf/2510.12950
Copy Paste: [[2510.12950]] An Investigation of Memorization Risk in Healthcare Foundation Models(https://arxiv.org/abs/2510.12950)
Keywords: generative
Abstract: Foundation models trained on large-scale de-identified electronic health records (EHRs) hold promise for clinical applications. However, their capacity to memorize patient information raises important privacy concerns. In this work, we introduce a suite of black-box evaluation tests to assess privacy-related memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups. We validate our approach on a publicly available EHR foundation model and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI.
摘要：经过大规模去识别化电子健康记录 (EHR) 训练的基金会模型有望应用于临床。然而，他们记忆患者信息的能力引起了重要的隐私问题。在这项工作中，我们引入了一套黑盒评估测试，以评估在结构化 EHR 数据上训练的基础模型中与隐私相关的记忆风险。我们的框架包括在嵌入和生成水平上探索记忆的方法，旨在区分临床相关环境中的模型泛化和有害记忆。我们将记忆置于上下文中，因为它可能会损害患者的隐私，特别是对于弱势群体。我们在公开的 EHR 基础模型上验证了我们的方法，并发布了一个开源工具包，以促进医疗保健人工智能中的可重复和协作隐私评估。

Title: Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation

Authors: Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du
Subjects: cs.CV, cs.AI, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2510.12953
Pdf URL: https://arxiv.org/pdf/2510.12953
Copy Paste: [[2510.12953]] Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation(https://arxiv.org/abs/2510.12953)
Keywords: generation
Abstract: Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: this https URL.
摘要：最近的医学视觉语言模型在 VQA、报告生成和异常检测等任务中显示出了前景。然而，大多数适用于结构化成人成像，而在胎儿超声方面表现不佳，这对多视图图像推理、众多疾病和图像多样性提出了挑战。为了弥补这一差距，我们推出了 FetalMind，这是一种专为胎儿超声检查而定制的医疗人工智能系统，用于报告生成和诊断。在临床工作流程的指导下，我们提出了显着认知解缠（SED），它将专家策划的二分图注入模型中，以解耦视图与疾病的关联，并通过强化学习引导偏好选择沿着临床忠实的步骤进行。这种设计减轻了疾病之间的变异性和观点之间的异质性，减少了学习瓶颈，同时使模型的推理与产科实践保持一致。为了大规模训练 FetalMind，我们整理了 FetalSigma-1M 数据集，这是第一个大规模胎儿超声报告语料库，包含来自 12 个医疗中心的 2 万份报告，解决了领域数据的稀缺问题。大量实验表明，FetalMind 在所有妊娠阶段均优于开源和闭源基线，在关键条件下实现了 +14% 的平均增益和 +61.2% 的准确率，同时保持高效、稳定和可扩展。项目页面：此 https URL。

Title: Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

Authors: Sungjun Cho, Dasol Hwang, Frederic Sala, Sangheum Hwang, Kyunghyun Cho, Sungmin Cha
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.12981
Pdf URL: https://arxiv.org/pdf/2510.12981
Copy Paste: [[2510.12981]] Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check(https://arxiv.org/abs/2510.12981)
Keywords: generative
Abstract: Current unlearning metrics for generative models evaluate success based on reference responses or classifier outputs rather than assessing the core objective: whether the unlearned model behaves indistinguishably from a model that never saw the unwanted data. This reference-specific approach creates systematic blind spots, allowing models to appear successful while retaining unwanted knowledge accessible through alternative prompts or attacks. We address these limitations by proposing Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models by comparing bidirectional likelihood assignments over generated samples. Unlike existing approaches that rely on predetermined references, FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning. Our experiments on the TOFU benchmark for LLM unlearning and the UnlearnCanvas benchmark for text-to-image diffusion model unlearning reveal that methods achieving near-optimal scores on traditional metrics fail to achieve distributional equivalence, with many becoming more distant from the gold standard than before unlearning. These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
摘要：当前生成模型的遗忘指标根据参考响应或分类器输出来评估成功，而不是评估核心目标：未学习的模型的行为是否与从未见过不需要的数据的模型没有区别。这种特定于参考的方法会创建系统性盲点，使模型看起来很成功，同时保留通过替代提示或攻击可访问的不需要的知识。我们通过提出分布等价函数对齐（FADE）来解决这些局限性，这是一种新颖的指标，通过比较生成样本的双向似然分配来测量未学习模型和参考模型之间的分布相似性。与依赖预定参考的现有方法不同，FADE 捕获整个输出分布中的功能对齐，从而提供对真正遗忘的原则性评估。我们对 LLM 遗忘的 TOFU 基准和文本到图像扩散模型遗忘的 UnlearnCanvas 基准进行的实验表明，在传统指标上获得接近最佳分数的方法无法实现分布等价，许多方法比遗忘之前更远离黄金标准。这些发现揭示了当前评估实践中的根本差距，并表明 FADE 为开发和评估真正有效的忘却方法提供了更坚实的基础。

Title: SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Authors: Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13016
Pdf URL: https://arxiv.org/pdf/2510.13016
Copy Paste: [[2510.13016]] SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding(https://arxiv.org/abs/2510.13016)
Keywords: generation
Abstract: Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.
摘要：理解细粒度的动作并在空间和时间上准确定位相应的参与者是推进下一代人工智能系统的基本能力，包括实体代理、自主平台和人机交互框架。尽管最近在视频理解方面取得了进展，但现有方法主要解决粗粒度动作识别或通用对象跟踪问题，从而忽视了根据多个对象的动作联合检测和跟踪多个对象同时将它们暂时接地的挑战。为了解决这一差距，我们引入了时空视频动作基础（SVAG），这是一项新颖的任务，要求模型根据其动作的自然语言描述同时检测、跟踪和临时定位视频中的所有参考对象。为了支持这项任务，我们构建了 SVAG-Bench，这是一个大规模基准测试，包含 688 个视频、19,590 个带注释的记录和 903 个独特的动词，涵盖各种对象、动作和现实世界场景。我们进一步提出了 SVAGFormer，一个基线框架，它采用最先进的视觉语言模型来进行联合空间和时间基础，并引入 SVAGEval，一个用于公平和可重复基准测试的标准化评估工具包。实证结果表明，现有模型在 SVAG 上表现不佳，尤其是在密集或复杂的场景中，这凸显了对长视频中的细粒度对象动作交互进行更高级推理的需要。

Title: SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models

Authors: Zhengxu Tang, Zizheng Wang, Luning Wang, Zitao Shuai, Chenhao Zhang, Siyu Qian, Yirui Wu, Bohao Wang, Haosong Rao, Zhenyu Yang, Chenwei Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13042
Pdf URL: https://arxiv.org/pdf/2510.13042
Copy Paste: [[2510.13042]] SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models(https://arxiv.org/abs/2510.13042)
Keywords: generation
Abstract: Text-to-video (T2V) generation models have made significant progress in creating visually appealing videos. However, they struggle with generating coherent sequential narratives that require logical progression through multiple events. Existing T2V benchmarks primarily focus on visual quality metrics but fail to evaluate narrative coherence over extended sequences. To bridge this gap, we present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation. SeqBench includes a carefully designed dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models. Additionally, we design a Dynamic Temporal Graphs (DTG)-based automatic evaluation metric, which can efficiently capture long-range dependencies and temporal ordering while maintaining computational efficiency. Our DTG-based metric demonstrates a strong correlation with human annotations. Through systematic evaluation using SeqBench, we reveal critical limitations in current T2V models: failure to maintain consistent object states across multi-action sequences, physically implausible results in multi-object scenarios, and difficulties in preserving realistic timing and ordering relationships between sequential actions. SeqBench provides the first systematic framework for evaluating narrative coherence in T2V generation and offers concrete insights for improving sequential reasoning capabilities in future models. Please refer to this https URL for more details.
摘要：文本到视频 (T2V) 生成模型在创建具有视觉吸引力的视频方面取得了重大进展。然而，他们很难生成连贯的连续叙述，这些叙述需要通过多个事件进行逻辑进展。现有的 T2V 基准主要关注视觉质量指标，但无法评估扩展序列的叙事连贯性。为了弥补这一差距，我们提出了 SeqBench，这是一个用于评估 T2V 生成中顺序叙事连贯性的综合基准。 SeqBench 包含精心设计的数据集，其中包含 320 个提示，跨越各种叙事复杂性，以及从 8 个最先进的 T2V 模型生成的 2,560 个人工注释视频。此外，我们设计了一种基于动态时序图（DTG）的自动评估指标，它可以有效地捕获远程依赖性和时间顺序，同时保持计算效率。我们基于 DTG 的指标表明与人工注释有很强的相关性。通过使用 SeqBench 进行系统评估，我们揭示了当前 T2V 模型的关键局限性：无法在多动作序列中保持一致的对象状态，多对象场景中的物理结果不可信，以及难以保留顺序动作之间的真实时序和排序关系。 SeqBench 提供了第一个用于评估 T2V 生成中的叙述连贯性的系统框架，并为提高未来模型的顺序推理能力提供了具体的见解。请参阅此 https URL 了解更多详细信息。

Title: SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

Authors: Jungbin Cho, Minsu Kim, Jisoo Kim, Ce Zheng, Laszlo A. Jeni, Ming-Hsuan Yang, Youngjae Yu, Seonjoo Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13044
Pdf URL: https://arxiv.org/pdf/2510.13044
Copy Paste: [[2510.13044]] SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion(https://arxiv.org/abs/2510.13044)
Keywords: generation
Abstract: Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.
摘要：人体运动本质上是多样的且语义丰富的，同时也受到周围场景的影响。然而，现有的运动生成方法要么单独解决运动语义，要么单独解决场景感知，因为构建具有丰富文本运动覆盖和精确场景交互的大规模数据集极具挑战性。在这项工作中，我们介绍了 SceneAdapt，这是一个框架，它通过两个适应阶段（中间阶段和场景感知中间阶段）利用不相交的场景运动和文本运动数据集，将场景感知注入到文本条件运动模型中。关键思想是使用不需要文本即可学习的运动作为代理任务来桥接两个不同的数据集，从而将场景感知注入到文本到运动模型中。在第一阶段，我们引入了关键帧层，它可以在保留潜在流形的同时调节中间的运动潜在值。在第二阶段，我们添加一个场景调节层，通过交叉注意力自适应查询局部上下文来注入场景几何形状。实验结果表明，SceneAdapt 有效地将场景感知注入到文本到运动模型中，我们进一步分析了这种感知出现的机制。代码和模型将被发布。

Title: NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models

Authors: Konstantinos Barmpas, Na Lee, Alexandros Koliousis, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2510.13068
Pdf URL: https://arxiv.org/pdf/2510.13068
Copy Paste: [[2510.13068]] NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models(https://arxiv.org/abs/2510.13068)
Keywords: generative
Abstract: Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.
摘要：脑电图（EEG）捕获多个时间和频谱尺度的神经活动，产生丰富但复杂的信号用于表示学习。最近，经过训练来预测屏蔽信号标记的脑电图基础模型已经显示出学习可概括表示的希望。然而，它们的性能受到信号标记化模块的阻碍。现有的神经分词器无法保留高频动态，限制了它们以高保真度重建脑电图信号的能力。我们介绍 NeuroRVQ，一种以基于密码本的分词器为中心的可扩展大型脑电波模型 (LBM)。我们的分词器集成了：（i）捕获全频率神经谱的多尺度特征提取模块； (ii) 用于高分辨率编码的分层残差矢量量化（RVQ）码本； (iii) 用于高效训练的脑电图信号相位和幅度感知损失函数。该设计可实现高效的脑电图压缩，同时支持所有频段的精确重建，从而实现稳健的生成掩模建模。我们的实证结果表明，NeuroRVQ 实现了较低的重建误差，并且在各种下游任务上优于现有的 LBM。更广泛地说，NeuroRVQ 标记器为基于密码本的通用脑电波模型建立了强大的先验，促进了神经解码、生成建模和多模态生物信号集成方面的进步。

Title: Counting Hallucinations in Diffusion Models

Authors: Shuai Fu, Jian Zhou, Qi Chen, Huang Jing, Huy Anh Nguyen, Xiaohan Liu, Zhixiong Zeng, Lin Ma, Quanshi Zhang, Qi Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13080
Pdf URL: https://arxiv.org/pdf/2510.13080
Copy Paste: [[2510.13080]] Counting Hallucinations in Diffusion Models(https://arxiv.org/abs/2510.13080)
Keywords: generation, generative
Abstract: Diffusion probabilistic models (DPMs) have demonstrated remarkable progress in generative tasks, such as image and video synthesis. However, they still often produce hallucinated samples (hallucinations) that conflict with real-world knowledge, such as generating an implausible duplicate cup floating beside another cup. Despite their prevalence, the lack of feasible methodologies for systematically quantifying such hallucinations hinders progress in addressing this challenge and obscures potential pathways for designing next-generation generative models under factual constraints. In this work, we bridge this gap by focusing on a specific form of hallucination, which we term counting hallucination, referring to the generation of an incorrect number of instances or structured objects, such as a hand image with six fingers, despite such patterns being absent from the training data. To this end, we construct a dataset suite CountHalluSet, with well-defined counting criteria, comprising ToyShape, SimObject, and RealHand. Using these datasets, we develop a standardized evaluation protocol for quantifying counting hallucinations, and systematically examine how different sampling conditions in DPMs, including solver type, ODE solver order, sampling steps, and initial noise, affect counting hallucination levels. Furthermore, we analyze their correlation with common evaluation metrics such as FID, revealing that this widely used image quality metric fails to capture counting hallucinations consistently. This work aims to take the first step toward systematically quantifying hallucinations in diffusion models and offer new insights into the investigation of hallucination phenomena in image generation.
摘要：扩散概率模型（DPM）在图像和视频合成等生成任务中表现出了显着的进步。然而，他们仍然经常产生与现实世界知识相冲突的幻觉样本（幻觉），例如生成一个难以置信的复制杯子漂浮在另一个杯子旁边。尽管它们很普遍，但缺乏系统量化此类幻觉的可行方法阻碍了解决这一挑战的进展，并掩盖了在事实限制下设计下一代生成模型的潜在途径。在这项工作中，我们通过关注一种特定形式的幻觉来弥补这一差距，我们将其称为计数幻觉，指的是生成错误数量的实例或结构化对象，例如具有六个手指的手部图像，尽管训练数据中不存在这种模式。为此，我们构建了一个数据集套件 CountHalluSet，具有明确定义的计数标准，包括 ToyShape、SimObject 和 RealHand。使用这些数据集，我们开发了一个标准化的评估协议来量化计数幻觉，并系统地检查 DPM 中的不同采样条件（包括求解器类型、ODE 求解器顺序、采样步骤和初始噪声）如何影响计数幻觉水平。此外，我们分析了它们与 FID 等常见评估指标的相关性，结果表明这种广泛使用的图像质量指标无法一致地捕获计数幻觉。这项工作旨在朝着系统量化扩散模型中的幻觉迈出第一步，并为图像生成中幻觉现象的研究提供新的见解。

Title: VPREG: An Optimal Control Formulation for Diffeomorphic Image Registration Based on the Variational Principle Grid Generation Method

Authors: Zicong Zhou, Baihan Zhao, Andreas Mang, Guojun Liao
Subjects: cs.CV, math.OC
Abstract URL: https://arxiv.org/abs/2510.13109
Pdf URL: https://arxiv.org/pdf/2510.13109
Copy Paste: [[2510.13109]] VPREG: An Optimal Control Formulation for Diffeomorphic Image Registration Based on the Variational Principle Grid Generation Method(https://arxiv.org/abs/2510.13109)
Keywords: generation
Abstract: This paper introduces VPreg, a novel diffeomorphic image registration method. This work provides several improvements to our past work on mesh generation and diffeomorphic image registration. VPreg aims to achieve excellent registration accuracy while controlling the quality of the registration transformations. It ensures a positive Jacobian determinant of the spatial transformation and provides an accurate approximation of the inverse of the registration, a crucial property for many neuroimaging workflows. Unlike conventional methods, VPreg generates this inverse transformation within the group of diffeomorphisms rather than operating on the image space. The core of VPreg is a grid generation approach, referred to as \emph{Variational Principle} (VP), which constructs non-folding grids with prescribed Jacobian determinant and curl. These VP-generated grids guarantee diffeomorphic spatial transformations essential for computational anatomy and morphometry, and provide a more accurate inverse than existing methods. To assess the potential of the proposed approach, we conduct a performance analysis for 150 registrations of brain scans from the OASIS-1 dataset. Performance evaluation based on Dice scores for 35 regions of interest, along with an empirical analysis of the properties of the computed spatial transformations, demonstrates that VPreg outperforms state-of-the-art methods in terms of Dice scores, regularity properties of the computed transformation, and accuracy and consistency of the provided inverse map. We compare our results to ANTs-SyN, Freesurfer-Easyreg, and FSL-Fnirt.
摘要：本文介绍了一种新颖的微分同胚图像配准方法VPreg。这项工作对我们过去在网格生成和微分同胚图像配准方面的工作提供了一些改进。 VPreg 旨在实现出色的配准精度，同时控制配准转换的质量。它确保了空间变换的正雅可比行列式，并提供了配准倒数的精确近似，这是许多神经成像工作流程的关键属性。与传统方法不同，VPreg 在微分同胚组内生成这种逆变换，而不是在图像空间上进行操作。 VPreg 的核心是网格生成方法，称为\emph{变分原理}（VP），它构造具有规定的雅可比行列式和旋度的非折叠网格。这些 VP 生成的网格保证了计算解剖学和形态测量所必需的微分同胚空间变换，并提供了比现有方法更准确的反演。为了评估所提出方法的潜力，我们对 OASIS-1 数据集中的 150 个脑部扫描注册进行了性能分析。基于 35 个感兴趣区域的 Dice 分数的性能评估，以及对计算空间变换属性的实证分析，表明 VPreg 在 Dice 分数、计算变换的规律性属性以及所提供的逆映射的准确性和一致性方面优于最先进的方法。我们将我们的结果与 ANTs-SyN、Freesurfer-Easyreg 和 FSL-Fnirt 进行比较。

Title: On the Reasoning Abilities of Masked Diffusion Language Models

Authors: Anej Svete, Ashish Sabharwal
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.13117
Pdf URL: https://arxiv.org/pdf/2510.13117
Copy Paste: [[2510.13117]] On the Reasoning Abilities of Masked Diffusion Language Models(https://arxiv.org/abs/2510.13117)
Keywords: generation
Abstract: Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent to their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.
摘要：文本的掩蔽扩散模型 (MDM) 为传统自回归语言模型提供了一种引人注目的替代方案。并行生成使它们变得高效，但它们的计算能力和并行性固有的局限性在很大程度上仍未得到探索。为此，我们描述了 MDM 可以证明解决哪些类型的推理问题以及解决效率如何。我们通过在有限精度对数宽度设置中将 MDM 连接到易于理解的思想链 (CoT) 推理框架和填充循环变压器 (PLT) 来实现这一点：我们表明，MDM 和多项式填充 PLT 实际上在这种设置中是等效的，并且 MDM 可以解决 CoT 增强变压器可以解决的所有问题。此外，我们还展示了 MDM 本质上比 CoT 转换器更高效的问题类别（包括常规语言），其中并行生成允许更快的推理速度。

Title: MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation

Authors: Lianlian Liu, YongKang He, Zhaojie Chu, Xiaofen Xing, Xiangmin Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13208
Pdf URL: https://arxiv.org/pdf/2510.13208
Copy Paste: [[2510.13208]] MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation(https://arxiv.org/abs/2510.13208)
Keywords: generation
Abstract: Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.
摘要：从语音信号生成风格化的 3D 人体动作面临着巨大的挑战，这主要是由于语音信号、个人风格和相应的身体动作之间复杂且细粒度的关系。当前的风格编码方法要么过度简化风格多样性，要么忽略区域运动风格差异（例如，上半身与下半身），从而限制了运动真实感。此外，动作风格应该动态适应语音节奏和情感的变化，但现有的方法经常忽视这一点。为了解决这些问题，我们提出了 MimicParts，这是一种新颖的框架，旨在增强基于部分感知样式注入和部分感知去噪网络的风格化运动生成。它将身体划分为不同的区域来编码局部运动风格，使模型能够捕捉细粒度的区域差异。此外，我们的部分感知注意力块允许节奏和情绪线索精确引导每个身体区域，确保生成的运动与语音节奏和情绪状态的变化保持一致。实验结果表明，我们的方法优于现有方法，展示了自然性和富有表现力的 3D 人体运动序列。

Title: Prompt-based Adaptation in Large-scale Vision Models: A Survey

Authors: Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, Cheng Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13219
Pdf URL: https://arxiv.org/pdf/2510.13219
Copy Paste: [[2510.13219]] Prompt-based Adaptation in Large-scale Vision Models: A Survey(https://arxiv.org/abs/2510.13219)
Keywords: generative
Abstract: In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune'' paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts, and further organizes them by injection granularity -- pixel-level and token-level. Beyond the core methodologies, we examine PA's integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA's methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.
摘要：在计算机视觉领域，视觉提示（VP）和视觉提示调整（VPT）最近已成为全面微调的轻量级且有效的替代方案，用于在“预训练然后微调”范式中适应大规模视觉模型。然而，尽管进展迅速，但它们的概念界限仍然模糊，因为 VP 和 VPT 在当前的研究中经常互换使用，反映出这些技术及其各自的应用之间缺乏系统的区别。 In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA).我们提供了一种分类法，将现有方法分为可学习的、生成的和不可学习的提示，并通过注入粒度（像素级和令牌级）进一步组织它们。除了核心方法之外，我们还研究了 PA 跨不同领域的集成，包括医学成像、3D 点云和视觉语言任务，以及它在测试时适应和值得信赖的 AI 中的作用。我们还总结了当前的基准并确定了主要挑战和未来方向。据我们所知，我们是第一个针对 PA 方法和应用的独特特征的综合调查。我们的调查旨在为所有领域的研究人员和从业者提供清晰的路线图，以了解和探索 PA 相关研究的不断发展的前景。

Title: CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation

Authors: Li Liang, Bo Miao, Xinyu Wang, Naveed Akhtar, Jordan Vice, Ajmal Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13245
Pdf URL: https://arxiv.org/pdf/2510.13245
Copy Paste: [[2510.13245]] CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation(https://arxiv.org/abs/2510.13245)
Keywords: generation
Abstract: Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at this https URL
摘要：户外 3D 语义场景生成可为城市模拟和自动驾驶等应用生成逼真且语义丰富的环境。然而，由于缺乏公开可用、注释良好的数据集，这一方向的进展受到限制。我们介绍 SketchSem3D，这是第一个从抽象手绘草图和卫星图像伪标记注释生成 3D 户外语义场景的大型基准。 SketchSem3D 包括两个子集：基于 Sketch 的 SemanticKITTI 和基于 Sketch 的 KITTI-360（包含 LiDAR 体素及其相应的草图和带注释的卫星图像），以实现标准化、严格和多样化的评估。我们还提出了 Cylinder Mamba Diffusion (CymbaDiff)，它可以显着增强室外 3D 场景生成中的空间连贯性。 CymbaDiff 强加结构化空间排序，明确捕获圆柱连续性和垂直层次结构，并在生成的场景中保留物理邻域关系和全局上下文。 SketchSem3D 上的大量实验表明，CymbaDiff 实现了卓越的语义一致性、空间真实性和跨数据集泛化。代码和数据集将在此 https URL 中提供

Title: End-to-End Multi-Modal Diffusion Mamba

Authors: Chunhao Lu, Qiang Lu, Meichen Dong, Jake Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13253
Pdf URL: https://arxiv.org/pdf/2510.13253
Copy Paste: [[2510.13253]] End-to-End Multi-Modal Diffusion Mamba(https://arxiv.org/abs/2510.13253)
Keywords: generation
Abstract: Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.
摘要：当前的端到端多模态模型利用不同的编码器和解码器来处理输入和输出信息。这种分离阻碍了各种模式的联合表示学习。为了统一多模态处理，我们提出了一种称为 MDM（多模态扩散曼巴）的新颖架构。 MDM 利用基于 Mamba 的多步选择扩散模型，通过用于编码和解码的统一变分自动编码器逐步生成和细化特定于模态的信息。这种创新方法使 MDM 在处理高维数据时能够实现卓越的性能，特别是在同时生成高分辨率图像和扩展文本序列时。我们在图像生成、图像字幕、视觉问答、文本理解和推理任务等领域的评估表明，MDM 显着优于现有的端到端模型（MonoFormer、LlamaGen 和 Chameleon 等），并与 GPT-4V、Gemini Pro 和 Mistral 等 SOTA 模型有效竞争。我们的结果验证了 MDM 在统一多模态流程的同时保持计算效率的有效性，为端到端多模态架构确立了新方向。

Title: Universal Image Restoration Pre-training via Masked Degradation Classification

Authors: JiaKui Hu, Zhengjian Yao, Lujia Jin, Yinghao Chen, Yanye Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13282
Pdf URL: https://arxiv.org/pdf/2510.13282
Copy Paste: [[2510.13282]] Universal Image Restoration Pre-training via Masked Degradation Classification(https://arxiv.org/abs/2510.13282)
Keywords: restoration
Abstract: This study introduces a Masked Degradation Classification Pre-Training method (MaskDCPT), designed to facilitate the classification of degradation types in input images, leading to comprehensive image restoration pre-training. Unlike conventional pre-training methods, MaskDCPT uses the degradation type of the image as an extremely weak supervision, while simultaneously leveraging the image reconstruction to enhance performance and robustness. MaskDCPT includes an encoder and two decoders: the encoder extracts features from the masked low-quality input image. The classification decoder uses these features to identify the degradation type, whereas the reconstruction decoder aims to reconstruct a corresponding high-quality image. This design allows the pre-training to benefit from both masked image modeling and contrastive learning, resulting in a generalized representation suited for restoration tasks. Benefit from the straightforward yet potent MaskDCPT, the pre-trained encoder can be used to address universal image restoration and achieve outstanding performance. Implementing MaskDCPT significantly improves performance for both convolution neural networks (CNNs) and Transformers, with a minimum increase in PSNR of 3.77 dB in the 5D all-in-one restoration task and a 34.8% reduction in PIQE compared to baseline in real-world degradation scenarios. It also emergences strong generalization to previously unseen degradation types and levels. In addition, we curate and release the UIR-2.5M dataset, which includes 2.5 million paired restoration samples across 19 degradation types and over 200 degradation levels, incorporating both synthetic and real-world data. The dataset, source code, and models are available at this https URL.
摘要：本研究引入了一种掩模退化分类预训练方法（MaskDCPT），旨在促进输入图像中退化类型的分类，从而实现全面的图像恢复预训练。与传统的预训练方法不同，MaskDCPT使用图像的退化类型作为极弱的监督，同时利用图像重建来增强性能和鲁棒性。 MaskDCPT包括一个编码器和两个解码器：编码器从屏蔽的低质量输入图像中提取特征。分类解码器使用这些特征来识别退化类型，而重建解码器旨在重建相应的高质量图像。这种设计允许预训练受益于掩模图像建模和对比学习，从而产生适合恢复任务的通用表示。受益于简单而强大的 MaskDCPT，预训练编码器可用于解决通用图像恢复问题并实现出色的性能。实施 MaskDCPT 显着提高了卷积神经网络 (CNN) 和 Transformer 的性能，在 5D 一体式恢复任务中 PSNR 最小增加了 3.77 dB，与现实退化场景中的基线相比，PIQE 降低了 34.8%。它还对以前未见过的退化类型和水平产生了强烈的概括。此外，我们还策划并发布了 UIR-2.5M 数据集，其中包括 19 种退化类型和 200 多个退化级别的 250 万个配对恢复样本，包含合成数据和真实数据。数据集、源代码和模型可从此 https URL 获取。

Title: Federated Conditional Conformal Prediction via Generative Models

Authors: Rui Xu, Sihong Xie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13297
Pdf URL: https://arxiv.org/pdf/2510.13297
Copy Paste: [[2510.13297]] Federated Conditional Conformal Prediction via Generative Models(https://arxiv.org/abs/2510.13297)
Keywords: generative
Abstract: Conformal Prediction (CP) provides distribution-free uncertainty quantification by constructing prediction sets that guarantee coverage of the true labels. This reliability makes CP valuable for high-stakes federated learning scenarios such as multi-center healthcare. However, standard CP assumes i.i.d. data, which is violated in federated settings where client distributions differ substantially. Existing federated CP methods address this by maintaining marginal coverage on each client, but such guarantees often fail to reflect input-conditional uncertainty. In this work, we propose Federated Conditional Conformal Prediction (Fed-CCP) via generative models, which aims for conditional coverage that adapts to local data heterogeneity. Fed-CCP leverages generative models, such as normalizing flows or diffusion models, to approximate conditional data distributions without requiring the sharing of raw data. This enables each client to locally calibrate conformal scores that reflect its unique uncertainty, while preserving global consistency through federated aggregation. Experiments on real datasets demonstrate that Fed-CCP achieves more adaptive prediction sets.
摘要：保形预测 (CP) 通过构建保证覆盖真实标签的预测集来提供无分布的不确定性量化。这种可靠性使得 CP 对于多中心医疗保健等高风险联合学习场景很有价值。然而，标准 CP 假设独立同分布。数据，这在客户端分布差异很大的联合设置中受到侵犯。现有的联合 CP 方法通过维持每个客户端的边际覆盖来解决这个问题，但这种保证通常无法反映输入条件的不确定性。在这项工作中，我们通过生成模型提出了联合条件共形预测（Fed-CCP），其目标是适应本地数据异质性的条件覆盖。 Fed-CCP 利用生成模型（例如标准化流量或扩散模型）来近似条件数据分布，而无需共享原始数据。这使得每个客户端能够在本地校准反映其独特不确定性的共形分数，同时通过联合聚合保持全局一致性。对真实数据集的实验表明，Fed-CCP 实现了更具适应性的预测集。

Title: Km-scale dynamical downscaling through conformalized latent diffusion models

Authors: Alessandro Brusaferri, Andrea Ballarino
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13301
Pdf URL: https://arxiv.org/pdf/2510.13301
Copy Paste: [[2510.13301]] Km-scale dynamical downscaling through conformalized latent diffusion models(https://arxiv.org/abs/2510.13301)
Keywords: generative
Abstract: Dynamical downscaling is crucial for deriving high-resolution meteorological fields from coarse-scale simulations, enabling detailed analysis for critical applications such as weather forecasting and renewable energy modeling. Generative Diffusion models (DMs) have recently emerged as powerful data-driven tools for this task, offering reconstruction fidelity and more scalable sampling supporting uncertainty quantification. However, DMs lack finite-sample guarantees against overconfident predictions, resulting in miscalibrated grid-point-level uncertainty estimates hindering their reliability in operational contexts. In this work, we tackle this issue by augmenting the downscaling pipeline with a conformal prediction framework. Specifically, the DM's samples are post-processed to derive conditional quantile estimates, incorporated into a conformalized quantile regression procedure targeting locally adaptive prediction intervals with finite-sample marginal validity. The proposed approach is evaluated on ERA5 reanalysis data over Italy, downscaled to a 2-km grid. Results demonstrate grid-point-level uncertainty estimates with markedly improved coverage and stable probabilistic scores relative to the DM baseline, highlighting the potential of conformalized generative models for more trustworthy probabilistic downscaling to high-resolution meteorological fields.
摘要：动态降尺度对于从粗尺度模拟中导出高分辨率气象场至关重要，从而能够对天气预报和可再生能源建模等关键应用进行详细分析。生成扩散模型 (DM) 最近已成为执行此任务的强大数据驱动工具，提供重建保真度和支持不确定性量化的更可扩展的采样。然而，DM 缺乏针对过度自信预测的有限样本保证，导致错误校准的网格点级不确定性估计阻碍了其在操作环境中的可靠性。在这项工作中，我们通过使用共形预测框架增强缩小管道来解决这个问题。具体来说，DM 的样本经过后处理以得出条件分位数估计，并将其纳入针对具有有限样本边际有效性的局部自适应预测区间的保形分位数回归程序中。所提出的方法在意大利的 ERA5 再分析数据上进行了评估，缩小到 2 公里网格。结果表明，相对于 DM 基线，网格点级不确定性估计具有显着改善的覆盖范围和稳定的概率分数，凸显了保形生成模型在更可靠的概率降尺度到高分辨率气象场方面的潜力。

Title: Self-Augmented Visual Contrastive Decoding

Authors: Eun Woo Im, Muhammad Kashif Ali, Vivek Gupta
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13315
Pdf URL: https://arxiv.org/pdf/2510.13315
Copy Paste: [[2510.13315]] Self-Augmented Visual Contrastive Decoding(https://arxiv.org/abs/2510.13315)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs.
摘要：大视觉语言模型（LVLM）已经表现出了卓越的多模态能力，但它们继承了底层语言模型的幻觉倾向。虽然已经提出视觉对比解码来缓解这个问题，但现有方法通常应用通用视觉增强，而忽略文本查询提供的特定上下文，从而限制了它们的有效性。本研究引入了一种新颖的免训练解码策略，可以解决这些限制，具有两个关键贡献。首先，自我增强提示策略利用模型的内在知识来动态调整查询和视觉增强之间的语义。其次，自适应阈值算法，利用 Logit 分布的完整信息，根据输出稀疏性自适应调整下一个标记候选大小。跨越四个 LVLM 和七个基准的大量实验表明，与最先进的解码方法相比，所提出的解码显着增强了事实一致性。这项工作强调了集成查询相关增强和熵感知解码对于提高 LVLM 的有效生成的重要性。

Title: No-Reference Rendered Video Quality Assessment: Dataset and Metrics

Authors: Sipeng Yang, Jiayu Ji, Qingchuan Zhu, Zhiyao Yang, Xiaogang Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13349
Pdf URL: https://arxiv.org/pdf/2510.13349
Copy Paste: [[2510.13349]] No-Reference Rendered Video Quality Assessment: Dataset and Metrics(https://arxiv.org/abs/2510.13349)
Keywords: generation, quality assessment
Abstract: Quality assessment of videos is crucial for many computer graphics applications, including video games, virtual reality, and augmented reality, where visual performance has a significant impact on user experience. When test videos cannot be perfectly aligned with references or when references are unavailable, the significance of no-reference video quality assessment (NR-VQA) methods is undeniable. However, existing NR-VQA datasets and metrics are primarily focused on camera-captured videos; applying them directly to rendered videos would result in biased predictions, as rendered videos are more prone to temporal artifacts. To address this, we present a large rendering-oriented video dataset with subjective quality annotations, as well as a designed NR-VQA metric specific to rendered videos. The proposed dataset includes a wide range of 3D scenes and rendering settings, with quality scores annotated for various display types to better reflect real-world application scenarios. Building on this dataset, we calibrate our NR-VQA metric to assess rendered video quality by looking at both image quality and temporal stability. We compare our metric to existing NR-VQA metrics, demonstrating its superior performance on rendered videos. Finally, we demonstrate that our metric can be used to benchmark supersampling methods and assess frame generation strategies in real-time rendering.
摘要：视频质量评估对于许多计算机图形应用至关重要，包括视频游戏、虚拟现实和增强现实，其中视觉性能对用户体验有重大影响。当测试视频无法与参考完美对齐或参考不可用时，无参考视频质量评估（NR-VQA）方法的重要性是不可否认的。然而，现有的 NR-VQA 数据集和指标主要集中在摄像机捕获的视频上；将它们直接应用于渲染视频会导致预测有偏差，因为渲染视频更容易出现时间伪影。为了解决这个问题，我们提出了一个带有主观质量注释的大型面向渲染的视频数据集，以及专门针对渲染视频设计的 NR-VQA 指标。所提出的数据集包括广泛的 3D 场景和渲染设置，并针对各种显示类型注释了质量分数，以更好地反映真实世界的应用场景。在此数据集的基础上，我们校准 NR-VQA 指标，通过查看图像质量和时间稳定性来评估渲染视频质量。我们将我们的指标与现有的 NR-VQA 指标进行比较，证明其在渲染视频上的卓越性能。最后，我们证明我们的指标可用于对超级采样方法进行基准测试并评估实时渲染中的帧生成策略。

Title: Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring

Authors: Yuxin Wang, Dennis Frauen, Jonas Schweisthal, Maresa Schröder, Stefan Feuerriegel
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.13397
Pdf URL: https://arxiv.org/pdf/2510.13397
Copy Paste: [[2510.13397]] Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring(https://arxiv.org/abs/2510.13397)
Keywords: generation
Abstract: Dropout is common in clinical studies, with up to half of patients leaving early due to side effects or other reasons. When dropout is informative (i.e., dependent on survival time), it introduces censoring bias, because of which treatment effect estimates are also biased. In this paper, we propose an assumption-lean framework to assess the robustness of conditional average treatment effect (CATE) estimates in survival analysis when facing censoring bias. Unlike existing works that rely on strong assumptions, such as non-informative censoring, to obtain point estimation, we use partial identification to derive informative bounds on the CATE. Thereby, our framework helps to identify patient subgroups where treatment is effective despite informative censoring. We further develop a novel meta-learner that estimates the bounds using arbitrary machine learning models and with favorable theoretical properties, including double robustness and quasi-oracle efficiency. We demonstrate the practical value of our meta-learner through numerical experiments and in an application to a cancer drug trial. Together, our framework offers a practical tool for assessing the robustness of estimated treatment effects in the presence of censoring and thus promotes the reliable use of survival data for evidence generation in medicine and epidemiology.
摘要：中途退出在临床研究中很常见，多达一半的患者由于副作用或其他原因提前退出。当退出具有信息性（即取决于生存时间）时，它会引入审查偏差，因此治疗效果估计也会有偏差。在本文中，我们提出了一个假设倾斜框架来评估生存分析中面临审查偏差时条件平均治疗效果（CATE）估计的稳健性。与依赖强假设（例如非信息审查）来获得点估计的现有工作不同，我们使用部分识别来推导 CATE 的信息范围。因此，我们的框架有助于识别尽管进行信息审查但治疗仍有效的患者亚组。我们进一步开发了一种新颖的元学习器，它使用任意机器学习模型来估计边界，并具有良好的理论特性，包括双重鲁棒性和准预言机效率。我们通过数值实验和癌症药物试验的应用证明了元学习器的实用价值。总之，我们的框架提供了一个实用工具，用于评估审查情况下估计治疗效果的稳健性，从而促进可靠地使用生存数据来生成医学和流行病学的证据。

Title: Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation

Authors: Yifu Luo, Xinhao Hu, Keyu Fan, Haoyuan Sun, Zeyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13418
Pdf URL: https://arxiv.org/pdf/2510.13418
Copy Paste: [[2510.13418]] Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation(https://arxiv.org/abs/2510.13418)
Keywords: generation, generative
Abstract: Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on this https URL
摘要：强化学习（RL）在文本到图像（T2I）生成领域越来越受到关注。然而，大多数现有的强化学习方法都是针对扩散模型或自回归模型量身定制的，而忽略了一个重要的替代方案：掩码生成模型。在这项工作中，我们提出了 Mask-GRPO，这是第一种将基于组相对策略优化 (GRPO) 的 RL 纳入这个被忽视的范式的方法。我们的核心见解是重新定义与当前方法不同的转移概率，并将揭露过程制定为多步骤决策问题。为了进一步增强我们的方法，我们探索了几种有用的策略，包括删除 KL 约束、应用缩减策略和过滤掉低质量样本。使用 Mask-GRPO，我们改进了基本模型 Show-o，对标准 T2I 基准和偏好对齐进行了实质性改进，优于现有的最先进方法。该代码可在此 https URL 上找到

Title: Neural Sum-of-Squares: Certifying the Nonnegativity of Polynomials with Transformers

Authors: Nico Pelleriti, Christoph Spiegel, Shiwei Liu, David Martínez-Rubio, Max Zimmer, Sebastian Pokutta
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13444
Pdf URL: https://arxiv.org/pdf/2510.13444
Copy Paste: [[2510.13444]] Neural Sum-of-Squares: Certifying the Nonnegativity of Polynomials with Transformers(https://arxiv.org/abs/2510.13444)
Keywords: generation
Abstract: Certifying nonnegativity of polynomials is a well-known NP-hard problem with direct applications spanning non-convex optimization, control, robotics, and beyond. A sufficient condition for nonnegativity is the Sum of Squares (SOS) property, i.e., it can be written as a sum of squares of other polynomials. In practice, however, certifying the SOS criterion remains computationally expensive and often involves solving a Semidefinite Program (SDP), whose dimensionality grows quadratically in the size of the monomial basis of the SOS expression; hence, various methods to reduce the size of the monomial basis have been proposed. In this work, we introduce the first learning-augmented algorithm to certify the SOS criterion. To this end, we train a Transformer model that predicts an almost-minimal monomial basis for a given polynomial, thereby drastically reducing the size of the corresponding SDP. Our overall methodology comprises three key components: efficient training dataset generation of over 100 million SOS polynomials, design and training of the corresponding Transformer architecture, and a systematic fallback mechanism to ensure correct termination, which we analyze theoretically. We validate our approach on over 200 benchmark datasets, achieving speedups of over $100\times$ compared to state-of-the-art solvers and enabling the solution of instances where competing approaches fail. Our findings provide novel insights towards transforming the practical scalability of SOS programming.
摘要：证明多项式的非负性是一个众所周知的 NP 难题，其直接应用涵盖非凸优化、控制、机器人技术等。非负性的充分条件是平方和（SOS）属性，即它可以写成其他多项式的平方和。然而，在实践中，证明 SOS 准则的计算成本仍然很高，并且通常涉及求解半定规划 (SDP)，其维数随 SOS 表达式的单项式基的大小呈二次方增长；因此，人们提出了各种减小单项式基尺寸的方法。在这项工作中，我们引入了第一个学习增强算法来验证 SOS 标准。为此，我们训练了一个 Transformer 模型，该模型可以预测给定多项式的几乎最小单项式基，从而大大减小相应 SDP 的大小。我们的整体方法包括三个关键组成部分：超过 1 亿个 SOS 多项式的高效训练数据集生成、相应 Transformer 架构的设计和训练，以及确保正确终止的系统回退机制（我们对此进行了理论分析）。我们在 200 多个基准数据集上验证了我们的方法，与最先进的求解器相比，实现了超过 100 倍的加速，并能够解决竞争方法失败的实例。我们的研究结果为改变 SOS 编程的实际可扩展性提供了新颖的见解。

Title: Near-Infrared Hyperspectral Imaging Applications in Food Analysis -- Improving Algorithms and Methodologies

Authors: Ole-Christian Galbo Engstrøm
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13452
Pdf URL: https://arxiv.org/pdf/2510.13452
Copy Paste: [[2510.13452]] Near-Infrared Hyperspectral Imaging Applications in Food Analysis -- Improving Algorithms and Methodologies(https://arxiv.org/abs/2510.13452)
Keywords: generation
Abstract: This thesis investigates the application of near-infrared hyperspectral imaging (NIR-HSI) for food quality analysis. The investigation is conducted through four studies operating with five research hypotheses. For several analyses, the studies compare models based on convolutional neural networks (CNNs) and partial least squares (PLS). Generally, joint spatio-spectral analysis with CNNs outperforms spatial analysis with CNNs and spectral analysis with PLS when modeling parameters where chemical and physical visual information are relevant. When modeling chemical parameters with a 2-dimensional (2D) CNN, augmenting the CNN with an initial layer dedicated to performing spectral convolution enhances its predictive performance by learning a spectral preprocessing similar to that applied by domain experts. Still, PLS-based spectral modeling performs equally well for analysis of the mean content of chemical parameters in samples and is the recommended approach. Modeling the spatial distribution of chemical parameters with NIR-HSI is limited by the ability to obtain spatially resolved reference values. Therefore, a study used bulk mean references for chemical map generation of fat content in pork bellies. A PLS-based approach gave non-smooth chemical maps and pixel-wise predictions outside the range of 0-100\%. Conversely, a 2D CNN augmented with a spectral convolution layer mitigated all issues arising with PLS. The final study attempted to model barley's germinative capacity by analyzing NIR spectra, RGB images, and NIR-HSI images. However, the results were inconclusive due to the dataset's low degree of germination. Additionally, this thesis has led to the development of two open-sourced Python packages. The first facilitates fast PLS-based modeling, while the second facilitates very fast cross-validation of PLS and other classical machine learning models with a new algorithm.
摘要：本论文研究近红外高光谱成像（NIR-HSI）在食品质量分析中的应用。这项调查是通过四项研究进行的，涉及五项研究假设。对于多项分析，这些研究比较了基于卷积神经网络 (CNN) 和偏最小二乘法 (PLS) 的模型。一般来说，在对化学和物理视觉信息相关的参数进行建模时，使用 CNN 进行的联合空间光谱分析优于使用 CNN 进行的空间分析和使用 PLS 进行的光谱分析。当使用二维 (2D) CNN 对化学参数进行建模时，使用专用于执行光谱卷积的初始层来增强 CNN，可以通过学习与领域专家所应用的光谱预处理类似的光谱预处理来增强其预测性能。尽管如此，基于 PLS 的光谱建模对于分析样品中化学参数的平均含量同样表现良好，并且是推荐的方法。使用 NIR-HSI 对化学参数的空间分布进行建模受到获得空间分辨参考值的能力的限制。因此，一项研究使用批量平均参考来生成五花肉中脂肪含量的化学图。基于 PLS 的方法给出了不平滑的化学图和超出 0-100\% 范围的像素预测。相反，使用谱卷积层增强的 2D CNN 缓解了 PLS 产生的所有问题。最终研究试图通过分析 NIR 光谱、RGB 图像和 NIR-HSI 图像来模拟大麦的发芽能力。然而，由于数据集的发芽程度较低，结果尚无定论。此外，这篇论文还导致了两个开源 Python 包的开发。第一个有利于基于 PLS 的快速建模，而第二个有利于使用新算法对 PLS 和其他经典机器学习模型进行非常快速的交叉验证。

Title: VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Authors: Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, Konrad Schindler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13454
Pdf URL: https://arxiv.org/pdf/2510.13454
Copy Paste: [[2510.13454]] VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator(https://arxiv.org/abs/2510.13454)
Keywords: generation
Abstract: The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.
摘要：用于视觉内容生成和 3D 重建的大型预训练模型的快速进展为文本到 3D 生成开辟了新的可能性。直观地说，如果能够将现代潜在文本到视频模型作为“生成器”的强大功能与最新（前馈）3D 重建系统作为“解码器”的几何能力结合起来，就可以获得强大的 3D 场景生成器。我们引入了 VIST3A，这是一个通用框架，它可以解决两个主要挑战。首先，这两个组件必须以保留其权重中编码的丰富知识的方式连接。我们重新审视模型拼接，即，我们识别 3D 解码器中与文本到视频生成器生成的潜在表示最匹配的层，并将两个部分拼接在一起。该操作仅需要一个小数据集，并且不需要标签。其次，文本到视频生成器必须与缝合的 3D 解码器对齐，以确保生成的潜在信息可解码为一致的、感知上令人信服的 3D 场景几何形状。为此，我们采用直接奖励微调，这是一种人类偏好调整的流行技术。我们使用不同的视频生成器和 3D 重建模型评估所提出的 VIST3A 方法。所有测试的配对都比之前输出高斯图的文本到 3D 模型有了显着改进。此外，通过选择合适的3D基础模型，VIST3A还可以生成高质量的文本到点图。

Title: Tahakom LLM guidelines and receipts: from pre-training data to an Arabic LLM

Authors: Areej AlOtaibi, Lina Alyahya, Raghad Alshabanah, Shahad Alfawzan, Shuruq Alarefei, Reem Alsabti, Nouf Alsubaie, Abdulaziz Alhuzaymi, Lujain Alkhelb, Majd Alsayari, Waad Alahmed, Omar Talabay, Jalal Alowibdi, Salem Alelyani, Adel Bibi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13481
Pdf URL: https://arxiv.org/pdf/2510.13481
Copy Paste: [[2510.13481]] Tahakom LLM guidelines and receipts: from pre-training data to an Arabic LLM(https://arxiv.org/abs/2510.13481)
Keywords: generation
Abstract: Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.
摘要：大型语言模型 (LLM) 极大地推进了自然语言处理领域的发展，增强了跨不同领域的语言理解和生成能力。然而，开发阿拉伯语法学硕士面临着独特的挑战。本文通过关注数据管理、分词器设计和评估等关键方面来探讨这些挑战。我们详细介绍了阿拉伯语预训练数据集的收集和过滤方法，评估了各种分词器设计对模型性能的影响，并检查了现有阿拉伯语评估框架的局限性，为此我们提出了系统的纠正方法。为了提高透明度并促进协作开发，我们共享我们的数据和方法，为语言建模的进步做出贡献，特别是阿拉伯语。

Title: ProtoTopic: Prototypical Network for Few-Shot Medical Topic Modeling

Authors: Martin Licht, Sara Ketabi, Farzad Khalvati
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13542
Pdf URL: https://arxiv.org/pdf/2510.13542
Copy Paste: [[2510.13542]] ProtoTopic: Prototypical Network for Few-Shot Medical Topic Modeling(https://arxiv.org/abs/2510.13542)
Keywords: generation
Abstract: Topic modeling is a useful tool for analyzing large corpora of written documents, particularly academic papers. Despite a wide variety of proposed topic modeling techniques, these techniques do not perform well when applied to medical texts. This can be due to the low number of documents available for some topics in the healthcare domain. In this paper, we propose ProtoTopic, a prototypical network-based topic model used for topic generation for a set of medical paper abstracts. Prototypical networks are efficient, explainable models that make predictions by computing distances between input datapoints and a set of prototype representations, making them particularly effective in low-data or few-shot learning scenarios. With ProtoTopic, we demonstrate improved topic coherence and diversity compared to two topic modeling baselines used in the literature, demonstrating the ability of our model to generate medically relevant topics even with limited data.
摘要：主题建模是分析大型书面文档（尤其是学术论文）的有用工具。尽管提出了各种各样的主题建模技术，但这些技术在应用于医学文本时表现不佳。这可能是由于医疗保健领域某些主题的可用文档数量较少。在本文中，我们提出了 ProtoTopic，这是一种基于网络的主题模型原型，用于生成一组医学论文摘要的主题。原型网络是高效、可解释的模型，通过计算输入数据点和一组原型表示之间的距离来进行预测，这使得它们在低数据或少样本学习场景中特别有效。与文献中使用的两个主题建模基线相比，通过 ProtoTopic，我们展示了改进的主题连贯性和多样性，证明了我们的模型即使在数据有限的情况下也能生成医学相关主题的能力。

Title: Manifold Decoders: A Framework for Generative Modeling from Nonlinear Embeddings

Authors: Riddhish Thakare, Kingdom Mutala Akugri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13622
Pdf URL: https://arxiv.org/pdf/2510.13622
Copy Paste: [[2510.13622]] Manifold Decoders: A Framework for Generative Modeling from Nonlinear Embeddings(https://arxiv.org/abs/2510.13622)
Keywords: generative
Abstract: Classical nonlinear dimensionality reduction (NLDR) techniques like t-SNE, Isomap, and LLE excel at creating low-dimensional embeddings for data visualization but fundamentally lack the ability to map these embeddings back to the original high-dimensional space. This one-way transformation limits their use in generative applications. This paper addresses this critical gap by introducing a system- atic framework for constructing neural decoder architectures for prominent NLDR methods, enabling bidirectional mapping for the first time. We extend this framework by implementing a diffusion-based generative process that operates directly within these learned manifold spaces. Through experiments on the CelebA dataset, we evaluate the reconstruction and generative performance of our approach against autoencoder and standard diffusion model baselines. Our findings reveal a fundamental trade- off: while the decoders successfully reconstruct data, their quality is surpassed by end-to-end optimized autoencoders. Moreover, manifold-constrained diffusion yields poor-quality samples, suggesting that the discrete and sparse nature of classical NLDR embeddings is ill-suited for the continuous inter- polation required by generative models. This work highlights the inherent challenges in retrofitting generative capabilities onto NLDR methods designed primarily for visualization and analysis.
摘要：t-SNE、Isomap 和 LLE 等经典非线性降维 (NLDR) 技术擅长为数据可视化创建低维嵌入，但从根本上缺乏将这些嵌入映射回原始高维空间的能力。这种单向转换限制了它们在生成应用中的使用。本文通过引入一个系统框架来为著名的 NLDR 方法构建神经解码器架构，从而解决了这一关键差距，首次实现了双向映射。我们通过实现直接在这些学习的流形空间内运行的基于扩散的生成过程来扩展这个框架。通过在 CelebA 数据集上进行实验，我们根据自动编码器和标准扩散模型基线评估了我们的方法的重建和生成性能。我们的研究结果揭示了一个基本的权衡：虽然解码器成功地重建了数据，但它们的质量被端到端优化的自动编码器超越。此外，流形约束扩散产生的样本质量较差，这表明经典 NLDR 嵌入的离散和稀疏性质不适合生成模型所需的连续插值。这项工作强调了将生成能力改造为主要用于可视化和分析的 NLDR 方法所面临的固有挑战。

Title: Challenges, Advances, and Evaluation Metrics in Medical Image Enhancement: A Systematic Literature Review

Authors: Chun Wai Chin, Haniza Yazid, Hoi Leong Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13638
Pdf URL: https://arxiv.org/pdf/2510.13638
Copy Paste: [[2510.13638]] Challenges, Advances, and Evaluation Metrics in Medical Image Enhancement: A Systematic Literature Review(https://arxiv.org/abs/2510.13638)
Keywords: quality assessment
Abstract: Medical image enhancement is crucial for improving the quality and interpretability of diagnostic images, ultimately supporting early detection, accurate diagnosis, and effective treatment planning. Despite advancements in imaging technologies such as X-ray, CT, MRI, and ultrasound, medical images often suffer from challenges like noise, artifacts, and low contrast, which limit their diagnostic potential. Addressing these challenges requires robust preprocessing, denoising algorithms, and advanced enhancement methods, with deep learning techniques playing an increasingly significant role. This systematic literature review, following the PRISMA approach, investigates the key challenges, recent advancements, and evaluation metrics in medical image enhancement. By analyzing findings from 39 peer-reviewed studies, this review provides insights into the effectiveness of various enhancement methods across different imaging modalities and the importance of evaluation metrics in assessing their impact. Key issues like low contrast and noise are identified as the most frequent, with MRI and multi-modal imaging receiving the most attention, while specialized modalities such as histopathology, endoscopy, and bone scintigraphy remain underexplored. Out of the 39 studies, 29 utilize conventional mathematical methods, 9 focus on deep learning techniques, and 1 explores a hybrid approach. In terms of image quality assessment, 18 studies employ both reference-based and non-reference-based metrics, 9 rely solely on reference-based metrics, and 12 use only non-reference-based metrics, with a total of 65 IQA metrics introduced, predominantly non-reference-based. This review highlights current limitations, research gaps, and potential future directions for advancing medical image enhancement.
摘要：医学图像增强对于提高诊断图像的质量和可解释性至关重要，最终支持早期检测、准确诊断和有效的治疗计划。尽管 X 射线、CT、MRI 和超声波等成像技术取得了进步，但医学图像经常面临噪声、伪影和低对比度等挑战，这限制了其诊断潜力。应对这些挑战需要强大的预处理、去噪算法和先进的增强方法，其中深度学习技术发挥着越来越重要的作用。这篇系统文献综述遵循 PRISMA 方法，研究了医学图像增强中的关键挑战、最新进展和评估指标。通过分析 39 项同行评审研究的结果，本综述深入了解了不同成像模式下各种增强方法的有效性以及评估指标在评估其影响方面的重要性。低对比度和噪声等关键问题被认为是最常见的问题，其中 MRI 和多模态成像受到最多关注，而组织病理学、内窥镜检查和骨闪烁扫描等特殊模式仍未得到充分探索。在 39 项研究中，29 项利用传统数学方法，9 项重点关注深度学习技术，1 项探索混合方法。在图像质量评估方面，18项研究同时采用基于参考和基于非参考的指标，9项研究仅依赖基于参考的指标，12项仅使用基于非参考的指标，总共引入了65个IQA指标，主要是基于非参考的指标。这篇综述强调了当前医学图像增强的局限性、研究差距和潜在的未来方向。

Title: Local-Global Context-Aware and Structure-Preserving Image Super-Resolution

Authors: Sanchar Palit, Subhasis Chaudhuri, Biplab Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13649
Pdf URL: https://arxiv.org/pdf/2510.13649
Copy Paste: [[2510.13649]] Local-Global Context-Aware and Structure-Preserving Image Super-Resolution(https://arxiv.org/abs/2510.13649)
Keywords: restoration, super-resolution, generation
Abstract: Diffusion models have recently achieved significant success in various image manipulation tasks, including image super-resolution and perceptual quality enhancement. Pretrained text-to-image models, such as Stable Diffusion, have exhibited strong capabilities in synthesizing realistic image content, which makes them particularly attractive for addressing super-resolution tasks. While some existing approaches leverage these models to achieve state-of-the-art results, they often struggle when applied to diverse and highly degraded images, leading to noise amplification or incorrect content generation. To address these limitations, we propose a contextually precise image super-resolution framework that effectively maintains both local and global pixel relationships through Local-Global Context-Aware Attention, enabling the generation of high-quality images. Furthermore, we propose a distribution- and perceptual-aligned conditioning mechanism in the pixel space to enhance perceptual fidelity. This mechanism captures fine-grained pixel-level representations while progressively preserving and refining structural information, transitioning from local content details to the global structural composition. During inference, our method generates high-quality images that are structurally consistent with the original content, mitigating artifacts and ensuring realistic detail restoration. Extensive experiments on multiple super-resolution benchmarks demonstrate the effectiveness of our approach in producing high-fidelity, perceptually accurate reconstructions.
摘要：扩散模型最近在各种图像处理任务中取得了巨大的成功，包括图像超分辨率和感知质量增强。预训练的文本到图像模型，例如稳定扩散，在合成逼真的图像内容方面表现出了强大的能力，这使得它们对于解决超分辨率任务特别有吸引力。虽然一些现有的方法利用这些模型来实现最先进的结果，但当应用于多样化和高度退化的图像时，它们常常会遇到困难，导致噪声放大或不正确的内容生成。为了解决这些限制，我们提出了一种上下文精确的图像超分辨率框架，该框架通过局部-全局上下文感知注意力有效维护局部和全局像素关系，从而生成高质量图像。此外，我们提出了像素空间中的分布和感知对齐调节机制，以增强感知保真度。该机制捕获细粒度的像素级表示，同时逐步保留和细化结构信息，从局部内容细节过渡到全局结构组成。在推理过程中，我们的方法生成在结构上与原始内容一致的高质量图像，减少伪影并确保真实的细节恢复。对多个超分辨率基准的广泛实验证明了我们的方法在产生高保真、感知准确的重建方面的有效性。

Title: EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection

Authors: Huaizhi Qu, Ruichen Zhang, Shuqing Luo, Luchao Qi, Zhihao Zhang, Xiaoming Liu, Roni Sengupta, Tianlong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13652
Pdf URL: https://arxiv.org/pdf/2510.13652
Copy Paste: [[2510.13652]] EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection(https://arxiv.org/abs/2510.13652)
Keywords: generation
Abstract: Recent advances in foundation models have driven remarkable progress in image editing, yet their extension to 3D editing remains underexplored. A natural approach is to replace the image editing modules in existing workflows with foundation models. However, their heavy computational demands and the restrictions and costs of closed-source APIs make plugging these models into existing iterative editing strategies impractical. To address this limitation, we propose EditCast3D, a pipeline that employs video generation foundation models to propagate edits from a single first frame across the entire dataset prior to reconstruction. While editing propagation enables dataset-level editing via video models, its consistency remains suboptimal for 3D reconstruction, where multi-view alignment is essential. To overcome this, EditCast3D introduces a view selection strategy that explicitly identifies consistent and reconstruction-friendly views and adopts feedforward reconstruction without requiring costly refinement. In combination, the pipeline both minimizes reliance on expensive image editing and mitigates prompt ambiguities that arise when applying foundation models independently across images. We evaluate EditCast3D on commonly used 3D editing datasets and compare it against state-of-the-art 3D editing baselines, demonstrating superior editing quality and high efficiency. These results establish EditCast3D as a scalable and general paradigm for integrating foundation models into 3D editing pipelines. The code is available at this https URL
摘要：基础模型的最新进展推动了图像编辑领域的显着进步，但其向 3D 编辑的扩展仍未得到充分探索。一种自然的方法是用基础模型替换现有工作流程中的图像编辑模块。然而，它们繁重的计算需求以及闭源 API 的限制和成本使得将这些模型插入现有的迭代编辑策略中是不切实际的。为了解决这个限制，我们提出了 EditCast3D，这是一种使用视频生成基础模型在重建之前将编辑从单个第一帧传播到整个数据集的管道。虽然编辑传播可以通过视频模型进行数据集级编辑，但其一致性对于 3D 重建来说仍然不是最佳的，其中多视图对齐至关重要。为了克服这个问题，EditCast3D 引入了一种视图选择策略，该策略可以明确识别一致且易于重建的视图，并采用前馈重建，而无需进行昂贵的细化。结合起来，该管道既可以最大限度地减少对昂贵的图像编辑的依赖，又可以减少在图像之间独立应用基础模型时出现的提示歧义。我们在常用的 3D 编辑数据集上评估 EditCast3D，并将其与最先进的 3D 编辑基线进行比较，展示出卓越的编辑质量和高效率。这些结果将 EditCast3D 确立为将基础模型集成到 3D 编辑流程的可扩展且通用的范例。该代码可在此 https URL 获取

Title: CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas

Authors: Zian Li, Muhan Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13669
Pdf URL: https://arxiv.org/pdf/2510.13669
Copy Paste: [[2510.13669]] CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas(https://arxiv.org/abs/2510.13669)
Keywords: generation
Abstract: Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.
摘要：屏蔽自回归模型 (MAR) 最近已成为图像和视频生成的强大范例，它将屏蔽建模的灵活性与连续分词器的潜力结合起来。然而，视频 MAR 模型面临两个主要限制：由于早期采样阶段缺乏结构化的全局先验而导致的慢启动问题，以及空间和时间维度上自回归的误差累积。在这项工作中，我们提出了 CanvasMAR，这是一种新颖的视频 MAR 模型，它通过引入画布机制（下一帧的模糊全局预测）来缓解这些问题，用作屏蔽生成的起点。画布在采样早期提供全局结构，从而实现更快、更连贯的帧合成。此外，我们引入了无组合分类器的指导，联合扩大了空间（画布）和时间条件，并采用基于噪声的画布增强来增强鲁棒性。 BAIR 和 Kinetics-600 基准测试表明，CanvasMAR 可以用更少的自回归步骤生成高质量视频。我们的方法在 Kinetics-600 数据集上的自回归模型中取得了显着的性能，并且可以与基于扩散的方法相媲美。

Title: FlashWorld: High-quality 3D Scene Generation within Seconds

Authors: Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, Liujuan Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13678
Pdf URL: https://arxiv.org/pdf/2510.13678
Copy Paste: [[2510.13678]] FlashWorld: High-quality 3D Scene Generation within Seconds(https://arxiv.org/abs/2510.13678)
Keywords: generation, generative
Abstract: We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100$\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.
摘要：我们提出了 FlashWorld，这是一种生成模型，可以在几秒钟内从单个图像或文本提示生成 3D 场景，比以前的作品快 10~100$\times$，同时拥有卓越的渲染质量。我们的方法从传统的面向多视图（MV 导向）范式（为后续 3D 重建生成多视图图像）转变为面向 3D 的方法，其中模型在多视图生成期间直接生成 3D 高斯表示。在确保 3D 一致性的同时，面向 3D 的方法通常视觉质量较差。 FlashWorld 包括双模式预训练阶段和跨模式后训练阶段，有效地整合了两种范式的优势。具体来说，利用视频扩散模型的先验知识，我们首先预训练双模式多视图扩散模型，该模型共同支持面向 MV 和面向 3D 的生成模式。为了弥补面向 3D 生成的质量差距，我们进一步提出了一种跨模式训练后蒸馏，通过将一致的 3D 面向模式与高质量 MV 面向模式的分布进行匹配。这不仅在保持 3D 一致性的同时增强了视觉质量，而且还减少了推理所需的去噪步骤。此外，我们提出了一种策略，在此过程中利用大量单视图图像和文本提示来增强模型对分布外输入的泛化能力。大量的实验证明了我们方法的优越性和效率。

Title: Generating healthy counterfactuals with denoising diffusion bridge models

Authors: Ana Lawry Aguila, Peirong Liu, Marina Crespo Aguirre, Juan Eugenio Iglesias
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13684
Pdf URL: https://arxiv.org/pdf/2510.13684
Copy Paste: [[2510.13684]] Generating healthy counterfactuals with denoising diffusion bridge models(https://arxiv.org/abs/2510.13684)
Keywords: generative
Abstract: Generating healthy counterfactuals from pathological images holds significant promise in medical imaging, e.g., in anomaly detection or for application of analysis tools that are designed for healthy scans. These counterfactuals should represent what a patient's scan would plausibly look like in the absence of pathology, preserving individual anatomical characteristics while modifying only the pathological regions. Denoising diffusion probabilistic models (DDPMs) have become popular methods for generating healthy counterfactuals of pathology data. Typically, this involves training on solely healthy data with the assumption that a partial denoising process will be unable to model disease regions and will instead reconstruct a closely matched healthy counterpart. More recent methods have incorporated synthetic pathological images to better guide the diffusion process. However, it remains challenging to guide the generative process in a way that effectively balances the removal of anomalies with the retention of subject-specific features. To solve this problem, we propose a novel application of denoising diffusion bridge models (DDBMs) - which, unlike DDPMs, condition the diffusion process not only on the initial point (i.e., the healthy image), but also on the final point (i.e., a corresponding synthetically generated pathological image). Treating the pathological image as a structurally informative prior enables us to generate counterfactuals that closely match the patient's anatomy while selectively removing pathology. The results show that our DDBM outperforms previously proposed diffusion models and fully supervised approaches at segmentation and anomaly detection tasks.
摘要：从病理图像生成健康的反事实在医学成像中具有重大前景，例如，在异常检测或为健康扫描设计的分析工具的应用中。这些反事实应该代表在没有病理的情况下患者的扫描看起来合理的样子，保留个体的解剖特征，同时仅修改病理区域。去噪扩散概率模型（DDPM）已成为生成病理数据的健康反事实的流行方法。通常，这涉及仅对健康数据进行训练，并假设部分去噪过程无法对疾病区域进行建模，而是会重建密切匹配的健康对应区域。最近的方法已经结合了合成病理图像以更好地指导扩散过程。然而，以有效平衡消除异常与保留特定主题特征的方式指导生成过程仍然具有挑战性。为了解决这个问题，我们提出了一种新的去噪扩散桥模型（DDBM）应用——与DDPM不同，它不仅在初始点（即健康图像）上调节扩散过程，而且在最终点（即相应的综合生成的病理图像）上调节扩散过程。将病理图像视为结构信息先验，使我们能够生成与患者解剖结构紧密匹配的反事实，同时选择性地去除病理。结果表明，我们的 DDBM 在分割和异常检测任务方面优于之前提出的扩散模型和完全监督方法。

Title: MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

Authors: Minjung Shin, Hyunin Cho, Sooyeon Go, Jin-Hwa Kim, Youngjung Uh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13702
Pdf URL: https://arxiv.org/pdf/2510.13702
Copy Paste: [[2510.13702]] MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion(https://arxiv.org/abs/2510.13702)
Keywords: generation, generative
Abstract: Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.
摘要：具有相机姿态控制的多视图生成和基于提示的定制都是实现可控生成模型的基本要素。然而，现有的多视图生成模型不支持具有几何一致性的定制，而定制模型缺乏明确的视点控制，这使得它们难以统一。受这些差距的启发，我们引入了一项新颖的任务——多视图定制，其目的是共同实现多视图相机姿态控制和定制。由于定制训练数据的缺乏，现有的多视图生成模型本质上依赖于大规模数据集，很难泛化到不同的提示。为了解决这个问题，我们提出了 MVCustom，这是一种新颖的基于扩散的框架，明确设计用于实现多视图一致性和定制保真度。在训练阶段，MVCustom 使用特征场表示来学习主体的身份和几何形状，结合通过密集时空注意力增强的文本到视频扩散主干，利用时间连贯性实现多视图一致性。在推理阶段，我们引入了两种新颖的技术：深度感知特征渲染显式强制几何一致性，一致感知潜在完成确保定制主题和周围背景的准确透视对齐。大量的实验表明，MVCustom 是唯一同时实现忠实的多视图生成和定制的框架。

Title: Assessing the Geographic Generalization and Physical Consistency of Generative Models for Climate Downscaling

Authors: Carlo Saccardi, Maximilian Pierzyna, Haitz Sáez de Ocáriz Borde, Simone Monaco, Cristian Meo, Pietro Liò, Rudolf Saathof, Geethu Joseph, Justin Dauwels
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13722
Pdf URL: https://arxiv.org/pdf/2510.13722
Copy Paste: [[2510.13722]] Assessing the Geographic Generalization and Physical Consistency of Generative Models for Climate Downscaling(https://arxiv.org/abs/2510.13722)
Keywords: generative
Abstract: Kilometer-scale weather data is crucial for real-world applications but remains computationally intensive to produce using traditional weather simulations. An emerging solution is to use deep learning models, which offer a faster alternative for climate downscaling. However, their reliability is still in question, as they are often evaluated using standard machine learning metrics rather than insights from atmospheric and weather physics. This paper benchmarks recent state-of-the-art deep learning models and introduces physics-inspired diagnostics to evaluate their performance and reliability, with a particular focus on geographic generalization and physical consistency. Our experiments show that, despite the seemingly strong performance of models such as CorrDiff, when trained on a limited set of European geographies (e.g., central Europe), they struggle to generalize to other regions such as Iberia, Morocco in the south, or Scandinavia in the north. They also fail to accurately capture second-order variables such as divergence and vorticity derived from predicted velocity fields. These deficiencies appear even in in-distribution geographies, indicating challenges in producing physically consistent predictions. We propose a simple initial solution: introducing a power spectral density loss function that empirically improves geographic generalization by encouraging the reconstruction of small-scale physical structures. The code for reproducing the experimental results can be found at this https URL
摘要：公里级天气数据对于现实世界的应用至关重要，但使用传统天气模拟生成仍然需要大量计算。一种新兴的解决方案是使用深度学习模型，它为气候降尺度提供了更快的替代方案。然而，它们的可靠性仍然存在问题，因为它们通常是使用标准机器学习指标而不是大气和天气物理学的见解来评估的。本文对最新最先进的深度学习模型进行了基准测试，并引入了物理启发的诊断方法来评估其性能和可靠性，特别关注地理泛化和物理一致性。我们的实验表明，尽管 CorrDiff 等模型看似表现强劲，但在有限的欧洲地理区域（例如中欧）上进行训练时，它们很难推广到其他地区，例如伊比利亚半岛、南部的摩洛哥或北部的斯堪的纳维亚半岛。它们也无法准确捕获二阶变量，例如从预测速度场得出的散度和涡度。这些缺陷甚至出现在分布内的地区，这表明在产生物理上一致的预测方面存在挑战。我们提出了一个简单的初始解决方案：引入功率谱密度损失函数，该函数通过鼓励小规模物理结构的重建来从经验上提高地理泛化能力。重现实验结果的代码可以在此 https URL 找到

Title: Cyclic Self-Supervised Diffusion for Ultra Low-field to High-field MRI Synthesis

Authors: Zhenxuan Zhang, Peiyuan Jing, Zi Wang, Ula Briski, Coraline Beitone, Yue Yang, Yinzhe Wu, Fanwen Wang, Liutao Yang, Jiahao Huang, Zhifan Gao, Zhaolin Chen, Kh Tohidul Islam, Guang Yang, Peter J. Lally
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13735
Pdf URL: https://arxiv.org/pdf/2510.13735
Copy Paste: [[2510.13735]] Cyclic Self-Supervised Diffusion for Ultra Low-field to High-field MRI Synthesis(https://arxiv.org/abs/2510.13735)
Keywords: restoration, generative
Abstract: Synthesizing high-quality images from low-field MRI holds significant potential. Low-field MRI is cheaper, more accessible, and safer, but suffers from low resolution and poor signal-to-noise ratio. This synthesis process can reduce reliance on costly acquisitions and expand data availability. However, synthesizing high-field MRI still suffers from a clinical fidelity gap. There is a need to preserve anatomical fidelity, enhance fine-grained structural details, and bridge domain gaps in image contrast. To address these issues, we propose a \emph{cyclic self-supervised diffusion (CSS-Diff)} framework for high-field MRI synthesis from real low-field MRI data. Our core idea is to reformulate diffusion-based synthesis under a cycle-consistent constraint. It enforces anatomical preservation throughout the generative process rather than just relying on paired pixel-level supervision. The CSS-Diff framework further incorporates two novel processes. The slice-wise gap perception network aligns inter-slice inconsistencies via contrastive learning. The local structure correction network enhances local feature restoration through self-reconstruction of masked and perturbed patches. Extensive experiments on cross-field synthesis tasks demonstrate the effectiveness of our method, achieving state-of-the-art performance (e.g., 31.80 $\pm$ 2.70 dB in PSNR, 0.943 $\pm$ 0.102 in SSIM, and 0.0864 $\pm$ 0.0689 in LPIPS). Beyond pixel-wise fidelity, our method also preserves fine-grained anatomical structures compared with the original low-field MRI (e.g., left cerebral white matter error drops from 12.1$\%$ to 2.1$\%$, cortex from 4.2$\%$ to 3.7$\%$). To conclude, our CSS-Diff can synthesize images that are both quantitatively reliable and anatomically consistent.
摘要：从低场 MRI 合成高质量图像具有巨大的潜力。低场 MRI 更便宜、更方便、更安全，但分辨率低、信噪比差。这种综合过程可以减少对昂贵采购的依赖并扩大数据可用性。然而，合成高场 MRI 仍然存在临床保真度差距。需要保持解剖保真度，增强细粒度的结构细节，并弥合图像对比度的域差距。为了解决这些问题，我们提出了一个 \emph{循环自监督扩散（CSS-Diff）} 框架，用于从真实的低场 MRI 数据合成高场 MRI。我们的核心思想是在循环一致的约束下重新制定基于扩散的合成。它在整个生成过程中强制执行解剖学保存，而不仅仅是依赖成对的像素级监督。 CSS-Diff 框架进一步结合了两个新颖的流程。切片间隙感知网络通过对比学习来对齐切片间的不一致。局部结构校正网络通过掩蔽和扰动斑块的自重建来增强局部特征恢复。跨领域综合任务的大量实验证明了我们方法的有效性，实现了最先进的性能（例如，PSNR 为 31.80 $\pm$ 2.70 dB，SSIM 为 0.943 $\pm$ 0.102，LPIPS 为 0.0864 $\pm$ 0.0689）。除了像素保真度之外，与原始低场 MRI 相比，我们的方法还保留了细粒度的解剖结构（例如，左脑白质误差从 12.1$\%$ 下降到 2.1$\%$，皮质从 4.2$\%$ 下降到 3.7$\%$）。总而言之，我们的 CSS-Diff 可以合成定量可靠且解剖学一致的图像。

Title: UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy

Authors: Tianshuo Xu, Kai Wang, Zhifei Chen, Leyi Wu, Tianshui Wen, Fei Chao, Ying-Cong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13745
Pdf URL: https://arxiv.org/pdf/2510.13745
Copy Paste: [[2510.13745]] UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy(https://arxiv.org/abs/2510.13745)
Keywords: generation, generative
Abstract: Computational replication of Chinese calligraphy remains challenging. Existing methods falter, either creating high-quality isolated characters while ignoring page-level aesthetics like ligatures and spacing, or attempting page synthesis at the expense of calligraphic correctness. We introduce \textbf{UniCalli}, a unified diffusion framework for column-level recognition and generation. Training both tasks jointly is deliberate: recognition constrains the generator to preserve character structure, while generation provides style and layout priors. This synergy fosters concept-level abstractions that improve both tasks, especially in limited-data regimes. We curated a dataset of over 8,000 digitized pieces, with ~4,000 densely annotated. UniCalli employs asymmetric noising and a rasterized box map for spatial priors, trained on a mix of synthetic, labeled, and unlabeled data. The model achieves state-of-the-art generative quality with superior ligature continuity and layout fidelity, alongside stronger recognition. The framework successfully extends to other ancient scripts, including Oracle bone inscriptions and Egyptian hieroglyphs. Code and data can be viewed in \href{this https URL}{this URL}.
摘要：中国书法的计算复制仍然具有挑战性。现有的方法要么会创建高质量的孤立字符，而忽略连字和间距等页面级美学，要么尝试以牺牲书法正确性为代价进行页面合成。我们引入了 \textbf{UniCalli}，一个用于列级识别和生成的统一扩散框架。联合训练这两个任务是经过深思熟虑的：识别限制生成器保留字符结构，而生成则提供样式和布局先验。这种协同作用促进了概念级抽象，从而改善了这两项任务，特别是在有限数据的情况下。我们整理了一个包含 8,000 多个数字化片段的数据集，其中约 4,000 个带有密集注释。 UniCalli 采用非对称噪声和光栅化盒图来实现空间先验，并在合成、标记和未标记数据的混合上进行训练。该模型实现了最先进的生成质量，具有卓越的连线连续性和布局保真度，同时具有更强的识别性。该框架成功扩展到其他古代文字，包括甲骨文和埃及象形文字。代码和数据可以在\href{这个https URL}{这个URL}中查看。

Title: InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

Authors: Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, Xueheng Li, Lumin Li, Chenxu Guo, Jiasheng Zhou, Jiandong Chen, Xianye Wu, Jiahao Wang, Silei Wu, Lei Chen, Hanming Deng, Yuxuan Song, Dinghao Zhou, Guiping Zhong, Ken Zheng, Shiyin Kang, Lewei Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13747
Pdf URL: https://arxiv.org/pdf/2510.13747
Copy Paste: [[2510.13747]] InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue(https://arxiv.org/abs/2510.13747)
Keywords: generation
Abstract: We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.
摘要：我们推出InteractiveOmni，这是一个统一开源的视听多轮交互全模态大语言模型，参数范围从4B到8B，旨在通过提供全面的全模态理解和语音生成能力来引领轻量级模型领域。为了实现这一目标，我们将视觉编码器、音频编码器、大语言模型和语音解码器集成到一个统一的模型中，用于理解和生成任务。我们设计了多阶段训练策略，以确保强大的跨模态能力，包括全模态理解的预训练，以及语音对话和视听交互的后期训练。为了实现类似人类的长期对话能力，我们精心策划了多轮训练数据集，以增强模型处理复杂和多轮交互的能力。为了有效评估多轮记忆和语音交互能力，我们构建了多模态多轮记忆基准和多轮语音交互基准。实验表明，InteractiveOmni 的性能显着优于领先的开源模型，并提供更智能的多轮视听体验，特别是在其长期记忆能力方面。值得注意的是，InteractiveOmni-4B 在一般基准测试中可与 Qwen2.5-Omni-7B 等更大的模型相媲美，并且它可以保留 InteractiveOmni-8B 97% 的性能，同时仅利用 50% 的模型大小。 InteractiveOmni 是下一代智能交互系统的可访问、开源基础，可在图像、音频、视频理解和语音生成任务中针对类似大小的模型实现最先进的结果。

Title: RECODE: Reasoning Through Code Generation for Visual Question Answering

Authors: Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13756
Pdf URL: https://arxiv.org/pdf/2510.13756
Copy Paste: [[2510.13756]] RECODE: Reasoning Through Code Generation for Visual Question Answering(https://arxiv.org/abs/2510.13756)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
摘要：多模态大型语言模型 (MLLM) 很难对图表和图表等结构化视觉效果进行精确推理，因为基于像素的感知缺乏验证机制。为了解决这个问题，我们建议利用去渲染（将视觉效果逆向工程为可执行代码的过程）作为可验证视觉推理的新模式。具体来说，我们提出了 RECODE，一个代理框架，它首先生成多个候选程序来重现输入图像。然后，它使用批评家来选择最忠实的重建并迭代地完善代码。这个过程不仅将模糊的感知任务转化为可验证的、符号化的问题，而且使后续的精确计算和逻辑推理成为可能。在各种视觉推理基准（例如 CharXiv、ChartQA 和 Geometry3K）上，RECODE 的性能显着优于不利用代码或仅使用代码绘制辅助线或裁剪的方法。我们的工作表明，在可执行代码中建立视觉感知为更准确和可验证的多模态推理提供了一条新途径。

Title: Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Authors: Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13759
Pdf URL: https://arxiv.org/pdf/2510.13759
Copy Paste: [[2510.13759]] Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark(https://arxiv.org/abs/2510.13759)
Keywords: generation
Abstract: Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. Existing evaluations either treat the two abilities in isolation or overlook tasks that inherently couple them. To address this gap, we present Uni-MMMU, a comprehensive and discipline-aware benchmark that systematically unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains, including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, demanding models to (i) leverage conceptual understanding to guide precise visual synthesis, or (ii) utilize generation as a cognitive scaffold for analytical reasoning. Uni-MMMU incorporates verifiable intermediate reasoning steps, unique ground truths, and a reproducible scoring protocol for both textual and visual outputs. Through extensive evaluation of state-of-the-art unified, generation-only, and understanding-only models, we reveal substantial performance disparities and cross-modal dependencies, offering new insights into when and how these abilities reinforce one another, and establishing a reliable foundation for advancing unified models.
摘要：统一的多模态模型旨在共同实现视觉理解和生成，但当前的基准很少检查它们的真正集成。现有的评估要么孤立地对待这两种能力，要么忽视将它们本质上联系在一起的任务。为了解决这一差距，我们推出了 Uni-MMMU，这是一个全面且具有学科意识的基准，它系统地展现了八个以推理为中心的领域（包括科学、编码、数学和谜题）的生成和理解之间的双向协同作用。每项任务都是双向耦合的，要求模型（i）利用概念理解来指导精确的视觉合成，或（ii）利用生成作为分析推理的认知支架。 Uni-MMMU 结合了可验证的中间推理步骤、独特的基本事实以及文本和视觉输出的可重复评分协议。通过对最先进的统一模型、仅生成模型和仅理解模型的广泛评估，我们揭示了巨大的性能差异和跨模式依赖性，为这些能力何时以及如何相互增强提供了新的见解，并为推进统一模型奠定了可靠的基础。

Title: NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models

Authors: Nir Goren, Oren Katzir, Abhinav Nakarmi, Eyal Ronen, Mahmood Sharif, Or Patashnik
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13793
Pdf URL: https://arxiv.org/pdf/2510.13793
Copy Paste: [[2510.13793]] NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models(https://arxiv.org/abs/2510.13793)
Keywords: generation
Abstract: With the rapid adoption of diffusion models for visual content generation, proving authorship and protecting copyright have become critical. This challenge is particularly important when model owners keep their models private and may be unwilling or unable to handle authorship issues, making third-party verification essential. A natural solution is to embed watermarks for later verification. However, existing methods require access to model weights and rely on computationally heavy procedures, rendering them impractical and non-scalable. To address these challenges, we propose , a lightweight watermarking scheme that utilizes the random seed used to initialize the diffusion process as a proof of authorship without modifying the generation process. Our key observation is that the initial noise derived from a seed is highly correlated with the generated visual content. By incorporating a hash function into the noise sampling process, we further ensure that recovering a valid seed from the content is infeasible. We also show that sampling an alternative seed that passes verification is infeasible, and demonstrate the robustness of our method under various manipulations. Finally, we show how to use cryptographic zero-knowledge proofs to prove ownership without revealing the seed. By keeping the seed secret, we increase the difficulty of watermark removal. In our experiments, we validate NoisePrints on multiple state-of-the-art diffusion models for images and videos, demonstrating efficient verification using only the seed and output, without requiring access to model weights.
摘要：随着视觉内容生成传播模型的快速采用，证明作者身份和保护版权变得至关重要。当模型所有者将模型保密并且可能不愿意或无法处理作者身份问题时，这一挑战尤其重要，因此第三方验证至关重要。一个自然的解决方案是嵌入水印以供以后验证。然而，现有的方法需要访问模型权重并依赖于计算量大的过程，这使得它们不切实际且不可扩展。为了解决这些挑战，我们提出了一种轻量级水印方案，该方案利用用于初始化扩散过程的随机种子作为作者身份证明，而无需修改生成过程。我们的主要观察结果是，源自种子的初始噪声与生成的视觉内容高度相关。通过将哈希函数合并到噪声采样过程中，我们进一步确保从内容中恢复有效种子是不可行的。我们还表明，对通过验证的替代种子进行采样是不可行的，并证明了我们的方法在各种操作下的稳健性。最后，我们展示如何使用加密零知识证明来证明所有权而不泄露种子。通过对种子保密，我们增加了水印去除的难度。在我们的实验中，我们在多个最先进的图像和视频扩散模型上验证 NoisePrints，证明仅使用种子和输出即可进行有效验证，而无需访问模型权重。

Title: Generative Universal Verifier as Multimodal Meta-Reasoner

Authors: Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.13804
Pdf URL: https://arxiv.org/pdf/2510.13804
Copy Paste: [[2510.13804]] Generative Universal Verifier as Multimodal Meta-Reasoner(https://arxiv.org/abs/2510.13804)
Keywords: generation, generative
Abstract: We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.
摘要：我们引入了 Generative Universal Verifier，这是一种新颖的概念和插件，专为视觉语言模型和统一多模态模型中的下一代多模态推理而设计，在推理和生成过程中提供对视觉结果进行反思和细化的基本能力。这项工作做出了三个主要贡献：(1) 我们构建了 ViVerBench，这是一个涵盖 16 类关键任务的综合基准，用于评估多模态推理中的视觉结果。结果表明，现有的 VLM 在这些任务中始终表现不佳，凸显了在可靠的视觉验证方面与人类水平的能力存在巨大差距。 (2) 我们设计了两个自动化管道来构建大规模视觉验证数据并训练 OmniVerifier-7B，这是第一个针对通用视觉验证进行训练的全能生成验证器，并在 ViVerBench(+8.3) 上取得了显着的进步。通过训练，我们确定了视觉验证中的三种原子能力，并演示了它们如何泛化和协同交互。 (3)我们提出了OmniVerifier-TTS，一种顺序测试时间缩放范例，利用通用验证器在统一模型内桥接图像生成和编辑，通过迭代细粒度优化来增强生成能力的上限。除了生成之外，我们还将通用验证器扩展到更广泛的世界建模交错推理场景。根据经验，OmniVerifier-TTS 在 T2I-ReasonBench(+3.7) 和 GenEval++(+4.3) 上实现了改进，优于现有的并行测试时间缩放方法，例如 Best-of-N。通过为多模态推理提供可靠的视觉验证，OmniVerifier 提高了生成过程中的可靠反射和可扩展的测试时间细化，标志着向更值得信赖和可控的下一代推理系统迈出了一步。

Title: PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning

Authors: Sihui Ji, Xi Chen, Xin Tao, Pengfei Wan, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13809
Pdf URL: https://arxiv.org/pdf/2510.13809
Copy Paste: [[2510.13809]] PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning(https://arxiv.org/abs/2510.13809)
Keywords: generation
Abstract: Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, PhysMaster is based on the image-to-video task where the model is expected to predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process. The lack of proper supervision on the model's physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner. PhysMaster provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide-ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug-in solution for physics-aware video generation and broader applications.
摘要：如今的视频生成模型能够生成视觉上逼真的视频，但通常无法遵守物理定律，从而限制了它们生成物理上合理的视频并充当“世界模型”的能力。为了解决这个问题，我们提出了 PhysMaster，它捕获物理知识作为指导视频生成模型以增强其物理意识的表示。具体来说，PhysMaster 基于图像到视频任务，其中模型预计从输入图像预测物理上合理的动态。由于输入图像提供了场景中对象的相对位置和潜在交互等物理先验，因此我们设计了 PhysEncoder 对其中的物理信息进行编码，作为将物理知识注入视频生成过程的额外条件。除了外观之外，对模型的物理性能缺乏适当的监督，这促使 PhysEncoder 将带有人类反馈的强化学习应用于物理表征学习，该学习利用生成模型的反馈，通过直接偏好优化 (DPO) 以端到端的方式优化物理表征。 PhysMaster 提供了一种可行的解决方案，可提高 PhysEncoder 的物理感知能力，从而提高视频生成能力，证明其执行简单代理任务的能力以及对广泛物理场景的通用性。这意味着我们的 PhysMaster 通过强化学习范式中的表示学习来统一各种物理过程的解决方案，可以充当物理感知视频生成和更广泛应用的通用插件解决方案。