2025-10-09

Title: CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation

Authors: Mingzhe Zheng, Dingjie Song, Guanyu Zhou, Jun You, Jiahao Zhan, Xuran Ma, Xinyuan Song, Ser-Nam Lim, Qifeng Chen, Harry Yang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2510.06231
Pdf URL: https://arxiv.org/pdf/2510.06231
Copy Paste: [[2510.06231]] CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation(https://arxiv.org/abs/2510.06231)
Keywords: generation, quality assessment
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in generating highly structured texts. However, while exhibiting a high degree of structural organization, movie scripts demand an additional layer of nuanced storytelling and emotional depth-the 'soul' of compelling cinema-that LLMs often fail to capture. To investigate this deficiency, we first curated CML-Dataset, a dataset comprising (summary, content) pairs for Cinematic Markup Language (CML), where 'content' consists of segments from esteemed, high-quality movie scripts and 'summary' is a concise description of the content. Through an in-depth analysis of the intrinsic multi-shot continuity and narrative structures within these authentic scripts, we identified three pivotal dimensions for quality assessment: Dialogue Coherence (DC), Character Consistency (CC), and Plot Reasonableness (PR). Informed by these findings, we propose the CML-Bench, featuring quantitative metrics across these dimensions. CML-Bench effectively assigns high scores to well-crafted, human-written scripts while concurrently pinpointing the weaknesses in screenplays generated by LLMs. To further validate our benchmark, we introduce CML-Instruction, a prompting strategy with detailed instructions on character dialogue and event logic, to guide LLMs to generate more structured and cinematically sound scripts. Extensive experiments validate the effectiveness of our benchmark and demonstrate that LLMs guided by CML-Instruction generate higher-quality screenplays, with results aligned with human preferences.
摘要：大型语言模型 (LLM) 在生成高度结构化文本方面表现出了卓越的能力。然而，虽然电影剧本展现出高度的结构组织，但它还需要一层额外的细致入微的故事讲述和情感深度——引人入胜的电影的“灵魂”——而法学硕士往往无法捕捉到这一点。为了研究这一缺陷，我们首先策划了 CML-Dataset，这是一个由电影标记语言 (CML) 的（摘要、内容）对组成的数据集，其中“内容”由受人尊敬的高质量电影脚本的片段组成，“摘要”是内容的简洁描述。通过深入分析这些真实剧本中内在的多镜头连续性和叙事结构，我们确定了质量评估的三个关键维度：对话连贯性（DC）、人物一致性（CC）和情节合理性（PR）。根据这些发现，我们提出了 CML-Bench，其特点是跨这些维度的定量指标。 CML-Bench 有效地为精心制作、人工编写的剧本打分，同时指出法学硕士生成的剧本中的弱点。为了进一步验证我们的基准，我们引入了 CML-Instruction，这是一种提示策略，包含有关角色对话和事件逻辑的详细说明，以指导法学硕士生成更加结构化和电影般声音的脚本。大量实验验证了我们基准的有效性，并证明在 CML 指令指导下的法学硕士可以生成更高质量的剧本，其结果符合人类偏好。

Title: RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases

Authors: Khartik Uppalapati, Shakeel Abdulkareem, Bora Yimenicioglu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06267
Pdf URL: https://arxiv.org/pdf/2510.06267
Copy Paste: [[2510.06267]] RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases(https://arxiv.org/abs/2510.06267)
Keywords: generation
Abstract: We propose RareGraph-Synth, a knowledge-guided, continuous-time diffusion framework that generates realistic yet privacy-preserving synthetic electronic-health-record (EHR) trajectories for ultra-rare diseases. RareGraph-Synth unifies five public resources: Orphanet/Orphadata, the Human Phenotype Ontology (HPO), the GARD rare-disease KG, PrimeKG, and the FDA Adverse Event Reporting System (FAERS) into a heterogeneous knowledge graph comprising approximately 8 M typed edges. Meta-path scores extracted from this 8-million-edge KG modulate the per-token noise schedule in the forward stochastic differential equation, steering generation toward biologically plausible lab-medication-adverse-event co-occurrences while retaining score-based diffusion model stability. The reverse denoiser then produces timestamped sequences of lab-code, medication-code, and adverse-event-flag triples that contain no protected health information. On simulated ultra-rare-disease cohorts, RareGraph-Synth lowers categorical Maximum Mean Discrepancy by 40 percent relative to an unguided diffusion baseline and by greater than 60 percent versus GAN counterparts, without sacrificing downstream predictive utility. A black-box membership-inference evaluation using the DOMIAS attacker yields AUROC approximately 0.53, well below the 0.55 safe-release threshold and substantially better than the approximately 0.61 plus or minus 0.03 observed for non-KG baselines, demonstrating strong resistance to re-identification. These results suggest that integrating biomedical knowledge graphs directly into diffusion noise schedules can simultaneously enhance fidelity and privacy, enabling safer data sharing for rare-disease research.
摘要：我们提出了 RareGraph-Synth，这是一种知识引导的连续时间扩散框架，可为超罕见疾病生成现实且保护隐私的合成电子健康记录 (EHR) 轨迹。 RareGraph-Synth 将五个公共资源：Orphanet/Orphadata、人类表型本体 (HPO)、GARD 罕见疾病 KG、PrimeKG 和 FDA 不良事件报告系统 (FAERS) 统一为包含大约 8 M 类型边的异构知识图。从这个 800 万边缘 KG 中提取的元路径分数调节正向随机微分方程中的每个令牌噪声表，引导生成生物学上合理的实验室药物不良事件同时发生，同时保持基于分数的扩散模型稳定性。然后，反向降噪器会生成带时间戳的实验室代码、药物代码和不良事件标记三元组序列，其中不包含受保护的健康信息。在模拟的极罕见疾病队列中，RareGraph-Synth 相对于无引导扩散基线将分类最大平均差异降低了 40%，与 GAN 对应物相比降低了 60% 以上，且不牺牲下游预测效用。使用 DOMIAS 攻击者进行的黑盒成员资格推断评估产生的 AUROC 约为 0.53，远低于 0.55 安全释放阈值，并且明显优于在非 KG 基线中观察到的约 0.61 正负 0.03，这表明对重新识别具有很强的抵抗力。这些结果表明，将生物医学知识图直接集成到扩散噪声表中可以同时增强保真度和隐私性，从而为罕见疾病研究提供更安全的数据共享。

Title: General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks

Authors: Fahim Shahriar, Cheryl Wang, Alireza Azimi, Gautham Vasan, Hany Hamed Elanwar, A. Rupam Mahmood, Colin Bellinger
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.06277
Pdf URL: https://arxiv.org/pdf/2510.06277
Copy Paste: [[2510.06277]] General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks(https://arxiv.org/abs/2510.06277)
Keywords: generation
Abstract: Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.
摘要：目标条件强化学习（GCRL）允许代理使用统一的策略来学习不同的目标。然而，GCRL 的成功取决于目标表示的选择。在这项工作中，我们提出了一种基于掩模的目标表示系统，该系统为代理提供与对象无关的视觉提示，从而实现高效的学习和卓越的泛化。相比之下，现有的目标表示方法，如目标状态图像、3D坐标和one-hot向量，面临着对不可见物体泛化能力差、收敛速度慢以及需要特殊相机等问题。可以处理掩模以生成密集的奖励，而不需要容易出错的距离计算。通过在模拟中使用真实掩模进行学习，我们在训练和未见过的测试对象上达到了 99.9% 的准确率。我们提出的方法可以用来执行高精度的拾取任务，而不需要使用目标的任何位置信息。此外，我们使用两个不同的物理机器人演示了从头开始学习和模拟到真实的迁移应用程序，利用预训练的开放词汇对象检测模型来生成掩模。

Title: Traj-Transformer: Diffusion Models with Transformer for GPS Trajectory Generation

Authors: Zhiyang Zhang, Ningcong Chen, Xin Zhang, Yanhua Li, Shen Su, Hui Lu, Jun Luo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06291
Pdf URL: https://arxiv.org/pdf/2510.06291
Copy Paste: [[2510.06291]] Traj-Transformer: Diffusion Models with Transformer for GPS Trajectory Generation(https://arxiv.org/abs/2510.06291)
Keywords: generation
Abstract: The widespread use of GPS devices has driven advances in spatiotemporal data mining, enabling machine learning models to simulate human decision making and generate realistic trajectories, addressing both data collection costs and privacy concerns. Recent studies have shown the promise of diffusion models for high-quality trajectory generation. However, most existing methods rely on convolution based architectures (e.g. UNet) to predict noise during the diffusion process, which often results in notable deviations and the loss of fine-grained street-level details due to limited model capacity. In this paper, we propose Trajectory Transformer, a novel model that employs a transformer backbone for both conditional information embedding and noise prediction. We explore two GPS coordinate embedding strategies, location embedding and longitude-latitude embedding, and analyze model performance at different scales. Experiments on two real-world datasets demonstrate that Trajectory Transformer significantly enhances generation quality and effectively alleviates the deviation issues observed in prior approaches.
摘要：GPS 设备的广泛使用推动了时空数据挖掘的进步，使机器学习模型能够模拟人类决策并生成现实的轨迹，从而解决数据收集成本和隐私问题。最近的研究表明扩散模型有望生成高质量的轨迹。然而，大多数现有方法依靠基于卷积的架构（例如 UNet）来预测扩散过程中的噪声，这通常会导致明显的偏差，并且由于模型容量有限而导致细粒度街道级细节的丢失。在本文中，我们提出了 Trajectory Transformer，这是一种采用 Transformer 主干进行条件信息嵌入和噪声预测的新颖模型。我们探索了两种 GPS 坐标嵌入策略：位置嵌入和经纬度嵌入，并分析了不同尺度下的模型性能。对两个真实世界数据集的实验表明，Trajectory Transformer 显着提高了生成质量，并有效缓解了先前方法中观察到的偏差问题。

Title: BlockGPT: Spatio-Temporal Modelling of Rainfall via Frame-Level Autoregression

Authors: Cristian Meo, Varun Sarathchandran, Avijit Majhi, Shao Hung, Carlo Saccardi, Ruben Imhoff, Roberto Deidda, Remko Uijlenhoet, Justin Dauwels
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06293
Pdf URL: https://arxiv.org/pdf/2510.06293
Copy Paste: [[2510.06293]] BlockGPT: Spatio-Temporal Modelling of Rainfall via Frame-Level Autoregression(https://arxiv.org/abs/2510.06293)
Keywords: generative
Abstract: Predicting precipitation maps is a highly complex spatiotemporal modeling task, critical for mitigating the impacts of extreme weather events. Short-term precipitation forecasting, or nowcasting, requires models that are not only accurate but also computationally efficient for real-time applications. Current methods, such as token-based autoregressive models, often suffer from flawed inductive biases and slow inference, while diffusion models can be computationally intensive. To address these limitations, we introduce BlockGPT, a generative autoregressive transformer using batched tokenization (Block) method that predicts full two-dimensional fields (frames) at each time step. Conceived as a model-agnostic paradigm for video prediction, BlockGPT factorizes space-time by using self-attention within each frame and causal attention across frames; in this work, we instantiate it for precipitation nowcasting. We evaluate BlockGPT on two precipitation datasets, viz. KNMI (Netherlands) and SEVIR (U.S.), comparing it to state-of-the-art baselines including token-based (NowcastingGPT) and diffusion-based (DiffCast+Phydnet) models. The results show that BlockGPT achieves superior accuracy, event localization as measured by categorical metrics, and inference speeds up to 31x faster than comparable baselines.
摘要：预测降水图是一项高度复杂的时空建模任务，对于减轻极端天气事件的影响至关重要。短期降水预报或临近预报需要模型不仅准确，而且计算效率高，适合实时应用。当前的方法，例如基于标记的自回归模型，通常存在有缺陷的归纳偏差和缓慢的推理，而扩散模型可能需要大量计算。为了解决这些限制，我们引入了 BlockGPT，这是一种使用批量标记化（Block）方法的生成自回归转换器，可以在每个时间步预测完整的二维字段（帧）。 BlockGPT 被认为是一种与模型无关的视频预测范式，通过使用每帧内的自注意力和跨帧的因果注意力来分解时空；在这项工作中，我们将其实例化以进行临近降水预报。我们在两个降水数据集上评估 BlockGPT，即。 KNMI（荷兰）和 SEVIR（美国），将其与最先进的基线进行比较，包括基于令牌的 (NowcastingGPT) 和基于扩散的 (DiffCast+Phydnet) 模型。结果表明，BlockGPT 实现了卓越的准确性、通过分类指标衡量的事件定位，并且推理速度比同类基线快 31 倍。

Title: RGBD Gaze Tracking Using Transformer for Feature Fusion

Authors: Tobias J. Bauer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06298
Pdf URL: https://arxiv.org/pdf/2510.06298
Copy Paste: [[2510.06298]] RGBD Gaze Tracking Using Transformer for Feature Fusion(https://arxiv.org/abs/2510.06298)
Keywords: generative
Abstract: Subject of this thesis is the implementation of an AI-based Gaze Tracking system using RGBD images that contain both color (RGB) and depth (D) information. To fuse the features extracted from the images, a module based on the Transformer architecture is used. The combination of RGBD input images and Transformers was chosen because it has not yet been investigated. Furthermore, a new dataset is created for training the AI models as existing datasets either do not contain depth information or only contain labels for Gaze Point Estimation that are not suitable for the task of Gaze Angle Estimation. Various model configurations are trained, validated and evaluated on a total of three different datasets. The trained models are then to be used in a real-time pipeline to estimate the gaze direction and thus the gaze point of a person in front of a computer screen. The AI model architecture used in this thesis is based on an earlier work by Lian et al. It uses a Generative Adversarial Network (GAN) to simultaneously remove depth map artifacts and extract head pose features. Lian et al. achieve a mean Euclidean error of 38.7mm on their own dataset ShanghaiTechGaze+. In this thesis, a model architecture with a Transformer module for feature fusion achieves a mean Euclidean error of 55.3mm on the same dataset, but we show that using no pre-trained GAN module leads to a mean Euclidean error of 30.1mm. Replacing the Transformer module with a Multilayer Perceptron (MLP) improves the error to 26.9mm. These results are coherent with the ones on the other two datasets. On the ETH-XGaze dataset, the model with Transformer module achieves a mean angular error of 3.59° and without Transformer module 3.26°, whereas the fundamentally different model architecture used by the dataset authors Zhang et al. achieves a mean angular error of 2.04°. On the OTH-Gaze-Estimation dataset created for...
摘要：本论文的主题是使用包含颜色 (RGB) 和深度 (D) 信息的 RGBD 图像实现基于 AI 的视线跟踪系统。为了融合从图像中提取的特征，使用了基于 Transformer 架构的模块。选择 RGBD 输入图像和 Transformer 的组合是因为尚未对其进行研究。此外，创建了一个新的数据集用于训练人工智能模型，因为现有数据集要么不包含深度信息，要么只包含不适合注视角度估计任务的注视点估计标签。各种模型配置在总共三个不同的数据集上进行训练、验证和评估。然后，经过训练的模型将在实时管道中使用来估计注视方向，从而估计计算机屏幕前的人的注视点。本论文中使用的人工智能模型架构基于 Lian 等人的早期工作。它使用生成对抗网络 (GAN) 同时消除深度图伪影并提取头部姿势特征。廉等人。在他们自己的数据集 ShanghaiTechGaze+ 上实现了 38.7mm 的平均欧几里德误差。在本论文中，具有用于特征融合的 Transformer 模块的模型架构在同一数据集上实现了 55.3mm 的平均欧几里德误差，但我们表明，不使用预训练的 GAN 模块会导致 30.1mm 的平均欧几里德误差。用多层感知器 (MLP) 替换 Transformer 模块将误差改善至 26.9mm。这些结果与其他两个数据集的结果一致。在 ETH-XGaze 数据集上，带有 Transformer 模块的模型实现了 3.59° 的平均角度误差，不带 Transformer 模块的模型达到了 3.26°，而数据集作者 Zhang 等人使用的完全不同的模型架构。平均角度误差为 2.04°。在为...创建的 OTH-Gaze-Estimation 数据集上

Title: SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Authors: Shuang Cheng, Yihan Bian, Dawei Liu, Yuhua Jiang, Yihao Liu, Linfeng Zhang, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, Bowen Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06303
Pdf URL: https://arxiv.org/pdf/2510.06303
Copy Paste: [[2510.06303]] SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation(https://arxiv.org/abs/2510.06303)
Keywords: generation
Abstract: We propose SDAR, a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion. Instead of costly end-to-end diffusion training, SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation. During inference, SDAR generates sequences autoregressively across blocks for global coherence while decoding all tokens within each block in parallel via a discrete diffusion process. Extensive experiments show that AR models remain substantially more compute-efficient than masked diffusion models, providing a strong foundation for adaptation. Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Scaling studies across dense and Mixture-of-Experts architectures confirm that SDAR scales without compromise: larger models exhibit stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. Beyond efficiency, SDAR demonstrates enhanced reasoning and domain adaptability. Our 30B MoE model surpasses its AR counterpart on challenging scientific reasoning benchmarks such as GPQA and ChemBench, and gains further improvements under test-time scaling methods like majority voting and pass@k. Together, these results establish SDAR as a practical paradigm that combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning.
摘要：我们提出了 SDAR，一种协同扩散-自回归范式，它将自回归模型的训练效率与扩散的并行推理能力结合起来。 SDAR 不是昂贵的端到端扩散训练，而是执行轻量级范式转换，通过简短、数据高效的适应将训练有素的自回归 (AR) 模型转换为块式扩散模型。在推理过程中，SDAR 跨块自回归生成序列以实现全局一致性，同时通过离散扩散过程并行解码每个块内的所有标记。大量实验表明，AR 模型的计算效率明显高于掩模扩散模型，为适应提供了坚实的基础。基于这一见解，SDAR 以最低的成本实现了高效的 AR 到扩散转换，在保持 AR 级性能的同时实现了并行生成。跨密集和专家混合架构的扩展研究证实，SDAR 可以毫不妥协地进行扩展：较大的模型对块大小和解码阈值表现出更强的鲁棒性，从而在不损失准确性的情况下产生更大的加速。除了效率之外，SDAR 还展示了增强的推理能力和领域适应性。我们的 30B MoE 模型在具有挑战性的科学推理基准（例如 GPQA 和 ChemBench）上超越了 AR 模型，并且在多数投票和 pass@k 等测试时间扩展方法下获得了进一步的改进。总之，这些结果将 SDAR 确立为一种实用范例，结合了自回归和扩散的优点，可实现可扩展的高通量推理。

Title: Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Authors: Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye, Bo Zhang, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Bin Fu, Xiaohong Liu, Yu Qiao, Yihao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06308
Pdf URL: https://arxiv.org/pdf/2510.06308
Copy Paste: [[2510.06308]] Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding(https://arxiv.org/abs/2510.06308)
Keywords: generation
Abstract: We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: this https URL.
摘要：我们介绍 Lumina-DiMOO，这是一种用于无缝多模态生成和理解的开源基础模型。 Lumina-DiMOO 与之前的统一模型不同，它利用完全离散的扩散模型来处理各种模式的输入和输出。与之前的自回归（AR）或混合 AR-扩散范例相比，这种创新方法使 Lumina-DiMOO 能够实现更高的采样效率，并熟练地支持广泛的多模态任务，包括文本到图像生成、图像到图像生成（例如图像编辑、主题驱动生成和图像修复等）以及图像理解。 Lumina-DiMOO 在多个基准测试中实现了最先进的性能，超越了现有的开源统一多模态模型。为了促进多模式和离散扩散模型研究的进一步进步，我们向社区发布了我们的代码和检查点。项目页面：此 https URL。

Title: TransFIRA: Transfer Learning for Face Image Recognizability Assessment

Authors: Allen Tu, Kartik Narayan, Joshua Gleason, Jennifer Xu, Matthew Meyn, Tom Goldstein, Vishal M. Patel
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.06353
Pdf URL: https://arxiv.org/pdf/2510.06353
Copy Paste: [[2510.06353]] TransFIRA: Transfer Learning for Face Image Recognizability Assessment(https://arxiv.org/abs/2510.06353)
Keywords: generative
Abstract: Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder's decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary--aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first recognizability-aware body recognition assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment -- encoder-specific, accurate, interpretable, and extensible across modalities -- significantly advancing FIQA in accuracy, explainability, and scope.
摘要：在监控、视频和网络图像等不受约束的环境中进行人脸识别必须应对姿态、模糊、照明和遮挡的极端变化，而传统的视觉质量指标无法预测所部署的编码器是否能够真正识别输入。现有的 FIQA 方法通常依赖于视觉启发法、精选注释或计算密集型生成管道，使其预测与编码器的决策几何分离。我们引入 TransFIRA（人脸图像识别能力评估的迁移学习），这是一个轻量级且无注释的框架，可直接在嵌入空间中实现可识别性。 TransFIRA 带来了三项进步：(i) 通过类中心相似性 (CCS) 和类中心角分离 (CCAS) 定义可识别性，产生第一个自然的、决策边界对齐的过滤和加权标准； (ii) 一种基于可识别性的聚合策略，可在 BRIAR 和 IJB-C 上实现最先进的验证准确性，同时与真实可识别性的相关性几乎翻倍，所有这些都无需外部标签、启发式或针对骨干网络的训练； (iii) 超越面部的新扩展，包括基于编码器的可解释性，揭示退化和特定主题因素如何影响可识别性，以及第一个可识别性感知的身体识别评估。实验证实了最先进的人脸结果、身体识别的强大性能以及跨数据集转换下的鲁棒性。这些贡献共同将 TransFIRA 确立为一个统一的、几何驱动的可识别性评估框架——特定于编码器、准确、可解释且可跨模式扩展——显着提高 FIQA 的准确性、可解释性和范围。

Title: TDiff: Thermal Plug-And-Play Prior with Patch-Based Diffusion

Authors: Piyush Dashpute, Niki Nezakati, Wolfgang Heidrich, Vishwanath Saragadam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06460
Pdf URL: https://arxiv.org/pdf/2510.06460
Copy Paste: [[2510.06460]] TDiff: Thermal Plug-And-Play Prior with Patch-Based Diffusion(https://arxiv.org/abs/2510.06460)
Keywords: restoration, super-resolution
Abstract: Thermal images from low-cost cameras often suffer from low resolution, fixed pattern noise, and other localized degradations. Available datasets for thermal imaging are also limited in both size and diversity. To address these challenges, we propose a patch-based diffusion framework (TDiff) that leverages the local nature of these distortions by training on small thermal patches. In this approach, full-resolution images are restored by denoising overlapping patches and blending them using smooth spatial windowing. To our knowledge, this is the first patch-based diffusion framework that models a learned prior for thermal image restoration across multiple tasks. Experiments on denoising, super-resolution, and deblurring demonstrate strong results on both simulated and real thermal data, establishing our method as a unified restoration pipeline.
摘要：低成本相机的热图像通常会出现分辨率低、固定图案噪声和其他局部退化的问题。可用的热成像数据集在大小和多样性方面也受到限制。为了应对这些挑战，我们提出了一种基于补丁的扩散框架（TDiff），通过对小热补丁进行训练来利用这些扭曲的局部性质。在这种方法中，通过对重叠块进行去噪并使用平滑空间窗口混合它们来恢复全分辨率图像。据我们所知，这是第一个基于补丁的扩散框架，它可以对跨多个任务的热图像恢复的学习先验进行建模。去噪、超分辨率和去模糊的实验在模拟和真实热数据上都显示出强大的结果，将我们的方法建立为统一的恢复管道。

Title: SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation

Authors: Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Kevin Blackburn-Matzen, Matheus Gadelha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06469
Pdf URL: https://arxiv.org/pdf/2510.06469
Copy Paste: [[2510.06469]] SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation(https://arxiv.org/abs/2510.06469)
Keywords: generation
Abstract: We present SIGMA-GEN, a unified framework for multi-identity preserving image generation. Unlike prior approaches, SIGMA-GEN is the first to enable single-pass multi-subject identity-preserved generation guided by both structural and spatial constraints. A key strength of our method is its ability to support user guidance at various levels of precision -- from coarse 2D or 3D boxes to pixel-level segmentations and depth -- with a single model. To enable this, we introduce SIGMA-SET27K, a novel synthetic dataset that provides identity, structure, and spatial information for over 100k unique subjects across 27k images. Through extensive evaluation we demonstrate that SIGMA-GEN achieves state-of-the-art performance in identity preservation, image generation quality, and speed. Code and visualizations at this https URL
摘要：我们提出了 SIGMA-GEN，一个用于多身份保存图像生成的统一框架。与之前的方法不同，SIGMA-GEN 是第一个在结构和空间约束引导下实现单通道多主体身份保留生成的方法。我们方法的一个关键优势是它能够通过单个模型支持各种精度级别的用户指导（从粗略的 2D 或 3D 框到像素级分割和深度）。为了实现这一目标，我们引入了 SIGMA-SET27K，这是一个新颖的合成数据集，它为 27k 图像中超过 100k 个独特主题提供身份、结构和空间信息。通过广泛的评估，我们证明 SIGMA-GEN 在身份保存、图像生成质量和速度方面实现了最先进的性能。代码和可视化位于此 https URL

Title: Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin

Authors: Enrique Queipo-de-Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, Ravid Shwartz-Ziv
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06477
Pdf URL: https://arxiv.org/pdf/2510.06477
Copy Paste: [[2510.06477]] Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin(https://arxiv.org/abs/2510.06477)
Keywords: generation
Abstract: Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.
摘要：注意力池和压缩谷作为大型语言模型中的两种令人费解的现象引起了极大的关注，但一直是孤立研究的。在这项工作中，我们提出了注意力池和压缩谷之间令人惊讶的联系，追踪两者到残余流中大规模激活的形成。我们从理论上证明，大规模激活必然会产生表征压缩，并为由此产生的熵减少建立界限。通过多个模型（410M-120B 参数）的实验，我们确认当序列开始标记在中间层发展出极端激活规范时，压缩谷和注意力池同时出现。靶向消融研究验证了我们的理论预测。这种统一的观点促使我们提出信息流的混合-压缩-细化理论，试图解释法学硕士如何通过大规模激活控制注意力和表征压缩来深度组织计算。具体来说，我们假设基于 Transformer 的 LLM 在三个不同的阶段处理令牌：（1）早期层中的广泛混合，（2）中间层中有限混合的压缩计算，以及（3）后期层中的选择性细化。我们的框架有助于解释为什么嵌入任务在中间层表现最好，而生成任务受益于全面的深度处理，澄清了任务相关表示的差异。

Title: Valid Stopping for LLM Generation via Empirical Dynamic Formal Lift

Authors: Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06478
Pdf URL: https://arxiv.org/pdf/2510.06478
Copy Paste: [[2510.06478]] Valid Stopping for LLM Generation via Empirical Dynamic Formal Lift(https://arxiv.org/abs/2510.06478)
Keywords: generation
Abstract: We introduce Sequential-EDFL (Empirical Dynamic Formal Lift), applying anytime-valid sequential testing to language model generation stopping. Our approach tracks information lift -- the log-likelihood ratio between full models and deliberately weakened "skeleton" baselines -- using self-normalized empirical-Bernstein e-processes that provide formal delta-level error control regardless of stopping time. We handle unknown centering through online mean estimation, combine multiple parameters via mixture e-processes, and support adaptive resets under distributional drift. On six benchmarks, Sequential-EDFL reduces generation by 22-28% vs. sequential baselines while maintaining delta-level control with 12% computational overhead. We introduce automated skeletons (distilled submodels, randomized logits) and show robustness across skeleton families. Composing EDFL with a lightweight correctness gate (sentence boundaries + verifier) improves end-task correctness while preserving anytime-valid guarantees by only delaying stopping. Our certificates control information sufficiency, not factual correctness -- 10.9% of stopped sequences remain incorrect even with the gate (13.2-22.7% without it). EDFL serves as a first-stage filter reducing verification burden by 83%, not as a standalone solution for safety-critical domains.
摘要：我们引入了 Sequential-EDFL（经验动态形式提升），将随时有效的顺序测试应用于语言模型生成停止。我们的方法跟踪信息提升——完整模型和故意削弱的“骨架”基线之间的对数似然比——使用自我归一化的经验伯恩斯坦电子过程，无论停止时间如何，都提供正式的德尔塔级误差控制。我们通过在线均值估计处理未知的居中，通过混合电子过程组合多个参数，并支持分布漂移下的自适应重置。在六个基准测试中，与顺序基线相比，顺序 EDFL 将生成量减少了 22-28%，同时以 12% 的计算开销保持增量级别控制。我们引入了自动化骨架（蒸馏子模型、随机逻辑）并展示了跨骨架家族的鲁棒性。将 EDFL 与轻量级正确性门（句子边界 + 验证器）组合起来可以提高最终任务的正确性，同时仅通过延迟停止来保留随时有效的保证。我们的证书控制信息的充分性，而不是事实的正确性——即使有门，10.9% 的停止序列仍然不正确（没有门时，13.2-22.7%）。 EDFL 用作第一级滤波器，可将验证负担减少 83%，而不是作为安全关键领域的独立解决方案。

Title: Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

Authors: Qingxuan Wu, Zhiyang Dou, Chuan Guo, Yiming Huang, Qiao Feng, Bing Zhou, Jian Wang, Lingjie Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06504
Pdf URL: https://arxiv.org/pdf/2510.06504
Copy Paste: [[2510.06504]] Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation(https://arxiv.org/abs/2510.06504)
Keywords: generation
Abstract: Modeling human-human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human-human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples-expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. We will release code and models to facilitate reproducibility.
摘要：从文本中建模人与人的交互仍然具有挑战性，因为它不仅需要现实的个体动态，还需要代理之间精确的、文本一致的时空耦合。目前，进展受到以下因素的阻碍：1）两人训练数据有限，不足以捕捉两人互动的各种复杂性； 2）不够细粒度的文本到交互建模，其中语言条件将丰富的结构化提示分解为单个句子嵌入。为了解决这些限制，我们提出了 Text2Interact 框架，旨在通过可扩展的高保真交互数据合成器和有效的时空协调管道生成现实的、文本对齐的人与人交互。首先，我们提出了 InterCompose，这是一种可扩展的合成合成管道，它将 LLM 生成的交互描述与强大的单人运动先验相结合。给定代理的提示和动作，InterCompose 检索候选单人动作，为另一个代理训练条件反应生成器，并使用神经运动评估器过滤弱或未对齐的样本，从而扩大交互覆盖范围，而无需额外捕获。其次，我们提出了 InterActor，一种文本到交互模型，具有单词级条件作用，保留标记级线索（发起、响应、接触顺序）和自适应交互损失，强调上下文相关的人际关节对，提高细粒度交互建模的耦合和物理合理性。大量的实验表明，运动多样性、保真度和泛化性方面取得了一致的进展，包括分布外场景和用户研究。我们将发布代码和模型以促进可重复性。

Title: Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Authors: Ali Naseh, Anshuman Suri, Yuefeng Peng, Harsh Chaudhari, Alina Oprea, Amir Houmansadr
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2510.06525
Pdf URL: https://arxiv.org/pdf/2510.06525
Copy Paste: [[2510.06525]] Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security(https://arxiv.org/abs/2510.06525)
Keywords: generative
Abstract: Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.
摘要：生成式人工智能排行榜是评估模型能力的核心，但仍然容易受到操纵。关键的对抗目标之一是排名操纵，攻击者必须首先对显示输出背后的模型进行去匿名化——这是先前针对大型语言模型 (LLM) 演示和探索的威胁。我们表明，对于文本到图像的排行榜，这个问题可能更加严重，因为去匿名化明显更容易。使用来自 280 个提示和 19 个跨越多个组织、架构和规模的不同模型的 150,000 多个生成图像，我们证明了 CLIP 嵌入空间中的简单实时分类可以高精度识别生成模型，即使没有提示控制或历史数据。我们进一步引入了提示级别的可分离性指标，并确定了能够实现近乎完美的去匿名化的提示。我们的结果表明，文本到图像排行榜中的排名操纵比以前认识到的更容易，这强调了更强大的防御的需要。

Title: VUGEN: Visual Understanding priors for GENeration

Authors: Xiangyi Chen, Théophane Vallaeys, Maha Elbayad, John Nguyen, Jakob Verbeek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06529
Pdf URL: https://arxiv.org/pdf/2510.06529
Copy Paste: [[2510.06529]] VUGEN: Visual Understanding priors for GENeration(https://arxiv.org/abs/2510.06529)
Keywords: generation
Abstract: Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM's original understanding capabilities.
摘要：视觉语言模型 (VLM) 的最新进展已经实现了对文本和图像的统一理解，但为这些模型配备强大的图像生成功能仍然具有挑战性。现有方法通常依赖于面向重建的自动编码器或复杂的桥接机制，导致理解和生成表示之间的不一致或架构复杂性。在这项工作中，我们提出了 VUGEN，这是一种新颖的框架，它明确利用 VLM 的预训练视觉理解先验来实现高效、高质量的图像生成。我们的方法首先将 VLM 原生视觉编码器的高维潜在空间转换为低维、易处理的分布，最大限度地保留视觉信息。然后训练 VLM 在这个减少的潜在空间内进行采样，确保与其视觉理解能力保持一致。最后，专用的像素解码器将这些生成的潜在图像映射回图像空间。我们发现，无 VAE 的像素扩散解码器与内部依赖 VAE 潜在的常用复杂潜在扩散解码器相当或更好。大量实验表明，VUGEN 实现了卓越的图像生成性能，在 COCO 上将 DPG Bench 从 71.17 提高到 74.32，将 FID 从 11.86 提高到 9.06，同时充分保留了 VLM 原有的理解能力。

Title: Incoherence in goal-conditioned autoregressive models

Authors: Jacek Karwowski, Raymond Douglas
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06545
Pdf URL: https://arxiv.org/pdf/2510.06545
Copy Paste: [[2510.06545]] Incoherence in goal-conditioned autoregressive models(https://arxiv.org/abs/2510.06545)
Keywords: generative
Abstract: We investigate mathematically the notion of incoherence: a structural issue with reinforcement learning policies derived by naive goal-conditioning of autoregressive models. We focus on the process of re-training models on their own actions, that is, fine-tuning offline-learned policies with online RL. We prove that it decreases incoherence and leads to an improvement in return, and we aim to characterize the resulting trajectory of policies. By re-framing standard notions of control-as-inference and soft Q learning, we establish a three-way correspondence with two other ways of understanding the iterative re-training process: as folding the posterior into the reward and, in the deterministic case, as decreasing the temperature parameter; the correspondence has computational content via the training-inference trade-off. Through soft-conditioning generative models, we discuss the link between incoherence and the effective horizon.
摘要：我们从数学上研究了不连贯的概念：强化学习策略的结构性问题，是由自回归模型的朴素目标条件导出的。我们关注的是根据模型自身的动作重新训练模型的过程，即用在线强化学习来微调离线学习的策略。我们证明它可以减少不连贯性并提高回报，我们的目标是描述政策的最终轨迹。通过重新构建控制即推理和软 Q 学习的标准概念，我们与理解迭代再训练过程的其他两种方式建立了三向对应关系：将后验折叠到奖励中，在确定性情况下，降低温度参数；该对应关系通过训练-推理权衡具有计算内容。通过软条件生成模型，我们讨论了不相干性和有效视野之间的联系。

Title: HSNet: Heterogeneous Subgraph Network for Single Image Super-resolution

Authors: Qiongyang Hu, Wenyang Liu, Wenbin Zou, Yuejiao Su, Lap-Pui Chau, Yi Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06564
Pdf URL: https://arxiv.org/pdf/2510.06564
Copy Paste: [[2510.06564]] HSNet: Heterogeneous Subgraph Network for Single Image Super-resolution(https://arxiv.org/abs/2510.06564)
Keywords: super-resolution
Abstract: Existing deep learning approaches for image super-resolution, particularly those based on CNNs and attention mechanisms, often suffer from structural inflexibility. Although graph-based methods offer greater representational adaptability, they are frequently impeded by excessive computational complexity. To overcome these limitations, this paper proposes the Heterogeneous Subgraph Network (HSNet), a novel framework that efficiently leverages graph modeling while maintaining computational feasibility. The core idea of HSNet is to decompose the global graph into manageable sub-components. First, we introduce the Constructive Subgraph Set Block (CSSB), which generates a diverse set of complementary subgraphs. Rather than relying on a single monolithic graph, CSSB captures heterogeneous characteristics of the image by modeling different relational patterns and feature interactions, producing a rich ensemble of both local and global graph structures. Subsequently, the Subgraph Aggregation Block (SAB) integrates the representations embedded across these subgraphs. Through adaptive weighting and fusion of multi-graph features, SAB constructs a comprehensive and discriminative representation that captures intricate interdependencies. Furthermore, a Node Sampling Strategy (NSS) is designed to selectively retain the most salient features, thereby enhancing accuracy while reducing computational overhead. Extensive experiments demonstrate that HSNet achieves state-of-the-art performance, effectively balancing reconstruction quality with computational efficiency. The code will be made publicly available.
摘要：现有的图像超分辨率深度学习方法，特别是那些基于 CNN 和注意力机制的方法，往往存在结构不灵活性的问题。尽管基于图的方法提供了更大的表示适应性，但它们经常受到过度计算复杂性的阻碍。为了克服这些限制，本文提出了异构子图网络（HSNet），这是一种新颖的框架，可以有效利用图建模，同时保持计算可行性。 HSNet的核心思想是将全局图分解为可管理的子组件。首先，我们介绍构造性子图集块（CSSB），它生成一组不同的互补子图。 CSSB 不依赖于单个整体图，而是通过对不同的关系模式和特征交互进行建模来捕获图像的异构特征，从而生成丰富的局部和全局图结构的集合。随后，子图聚合块（SAB）集成了嵌入这些子图的表示。通过自适应加权和多图特征融合，SAB 构建了一个全面且有区别的表示，可以捕获复杂的相互依赖关系。此外，节点采样策略（NSS）旨在有选择地保留最显着的特征，从而提高准确性，同时减少计算开销。大量实验表明 HSNet 实现了最先进的性能，有效地平衡了重建质量和计算效率。该代码将公开。

Title: Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Authors: Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06590
Pdf URL: https://arxiv.org/pdf/2510.06590
Copy Paste: [[2510.06590]] Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer(https://arxiv.org/abs/2510.06590)
Keywords: generation
Abstract: Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.
摘要：视觉标记化仍然是在自回归范式中统一视觉理解和生成的核心挑战。现有方法通常在离散潜在空间中使用标记器来与大型语言模型中的标记对齐，其中量化误差可能会限制语义表达并降低视觉语言理解的能力。为了解决这个问题，我们引入了 MingTok，这是一个新的视觉分词器系列，具有连续的潜在空间，用于统一的自回归生成和理解。虽然理解任务有利于区分高维特征，但生成任务更喜欢紧凑的低级代码。因此，为了协调这些相互竞争的需求，MingTok 采用了三阶段顺序架构，涉及低级编码、语义扩展和视觉重建。在此基础上，Ming-UniVision 消除了对特定任务视觉表示的需求，并将不同的视觉语言任务统一在单一自回归预测范式下。通过将理解和生成表述为共享连续空间中的下一个令牌预测，它无缝支持多轮上下文任务，例如迭代理解、生成和编辑。根据经验，我们发现使用统一的连续视觉表示可以通过理解和生成任务来协调分词器的竞争需求，从而在两个领域实现最先进的性能。我们希望我们的发现能够促进连续域中统一的视觉标记化。发布推理代码和模型权重以造福社区。

Title: SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

Authors: Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin
Subjects: cs.CV, cs.AI, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2510.06596
Pdf URL: https://arxiv.org/pdf/2510.06596
Copy Paste: [[2510.06596]] SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation(https://arxiv.org/abs/2510.06596)
Keywords: generation, generative
Abstract: The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at this https URL
摘要：机器学习模型的性能在很大程度上取决于训练数据。大规模、注释良好的数据集的稀缺给创建稳健模型带来了重大挑战。为了解决这个问题，通过模拟和生成模型生成的合成数据已成为一种有前景的解决方案，可以增强数据集的多样性并提高模型的性能、可靠性和弹性。然而，评估生成的数据的质量需要有效的指标。本文引入了合成数据集质量度量（SDQM）来评估对象检测任务的数据质量，而不需要模型训练来收敛。该指标可以更有效地生成和选择合成数据集，解决资源受限的对象检测任务中的关键挑战。在我们的实验中，SDQM 表现出与领先的目标检测模型 YOLOv11 的平均精度 (mAP) 分数之间的强相关性，而之前的指标仅表现出中度或弱相关性。此外，它还提供了可操作的见解，以提高数据集质量，最大限度地减少昂贵的迭代训练的需求。这种可扩展且高效的指标为评估合成数据设立了新标准。 SDQM 的代码可在此 https URL 获取

Title: AIM 2025 Challenge on Real-World RAW Image Denoising

Authors: Feiran Li, Jiacheng Li, Marcos V. Conde, Beril Besbinar, Vlad Hosu, Daisuke Iso, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06601
Pdf URL: https://arxiv.org/pdf/2510.06601
Copy Paste: [[2510.06601]] AIM 2025 Challenge on Real-World RAW Image Denoising(https://arxiv.org/abs/2510.06601)
Keywords: restoration
Abstract: We introduce the AIM 2025 Real-World RAW Image Denoising Challenge, aiming to advance efficient and effective denoising techniques grounded in data synthesis. The competition is built upon a newly established evaluation benchmark featuring challenging low-light noisy images captured in the wild using five different DSLR cameras. Participants are tasked with developing novel noise synthesis pipelines, network architectures, and training methodologies to achieve high performance across different camera models. Winners are determined based on a combination of performance metrics, including full-reference measures (PSNR, SSIM, LPIPS), and non-reference ones (ARNIQA, TOPIQ). By pushing the boundaries of camera-agnostic low-light RAW image denoising trained on synthetic data, the competition promotes the development of robust and practical models aligned with the rapid progress in digital photography. We expect the competition outcomes to influence multiple domains, from image restoration to night-time autonomous driving.
摘要：我们推出了 AIM 2025 真实世界 RAW 图像去噪挑战赛，旨在推进基于数据合成的高效且有效的去噪技术。该竞赛建立在新建立的评估基准的基础上，该基准采用五种不同的数码单反相机在野外拍摄的具有挑战性的低光噪声图像。参与者的任务是开发新颖的噪声合成管道、网络架构和训练方法，以在不同的相机模型上实现高性能。获胜者是根据性能指标组合确定的，包括全参考指标（PSNR、SSIM、LPIPS）和非参考指标（ARNIQA、TOPIQ）。通过突破基于合成数据训练的与相机无关的低光 RAW 图像去噪的界限，该竞赛促进了与数字摄影的快速进步相一致的稳健且实用的模型的开发。我们预计比赛结果将影响多个领域，从图像恢复到夜间自动驾驶。

Title: POME: Post Optimization Model Edit via Muon-style Projection

Authors: Yong Liu, Di Fu, Yang Luo, Zirui Zhu, Minhao Cheng, Cho-Jui Hsieh, Yang You
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.06627
Pdf URL: https://arxiv.org/pdf/2510.06627
Copy Paste: [[2510.06627]] POME: Post Optimization Model Edit via Muon-style Projection(https://arxiv.org/abs/2510.06627)
Keywords: generation
Abstract: We introduce Post-Optimization Model Edit (POME), a new algorithm that enhances the performance of fine-tuned large language models using only their pretrained and fine-tuned checkpoints, without requiring extra data or further optimization. The core idea is to apply a muon-style projection to $\Delta W$, the difference between the fine-tuned and pretrained weights. This projection uses truncated singular value decomposition (SVD) to equalize the influence of dominant update directions and prune small singular values, which often represent noise. As a simple post-processing step, POME is completely decoupled from the training pipeline. It requires zero modifications and imposes no overhead, making it universally compatible with any optimizer or distributed framework. POME delivers consistent gains, boosting average performance by +2.5\% on GSM8K and +1.0\% on code generation. Its broad applicability -- from 7B foundation models to 72B RLHF-instructed models -- establishes it as a practical, zero-cost enhancement for any fine-tuning pipeline. Code is available at this https URL.
摘要：我们引入了优化后模型编辑（POME），这是一种新算法，仅使用预训练和微调的检查点即可增强微调大型语言模型的性能，而不需要额外的数据或进一步优化。核心思想是将 μ 子式投影应用于 $\Delta W$，即微调权重和预训练权重之间的差异。该投影使用截断奇异值分解 (SVD) 来均衡主导更新方向的影响并修剪通常代表噪声的小奇异值。作为一个简单的后处理步骤，POME 与训练管道完全解耦。它需要零修改并且不产生任何开销，使其与任何优化器或分布式框架普遍兼容。 POME 提供了持续的增益，将 GSM8K 的平均性能提高了 +2.5\%，将代码生成的平均性能提高了 +1.0\%。其广泛的适用性（从 7B 基础模型到 72B RLHF 指导模型）使其成为任何微调管道的实用、零成本增强功能。代码可从此 https URL 获取。

Title: Three Forms of Stochastic Injection for Improved Distribution-to-Distribution Generative Modeling

Authors: Shiye Su, Yuhui Zhang, Linqi Zhou, Rajesh Ranganath, Serena Yeung-Levy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.06634
Pdf URL: https://arxiv.org/pdf/2510.06634
Copy Paste: [[2510.06634]] Three Forms of Stochastic Injection for Improved Distribution-to-Distribution Generative Modeling(https://arxiv.org/abs/2510.06634)
Keywords: generation, generative
Abstract: Modeling transformations between arbitrary data distributions is a fundamental scientific challenge, arising in applications like drug discovery and evolutionary simulation. While flow matching offers a natural framework for this task, its use has thus far primarily focused on the noise-to-data setting, while its application in the general distribution-to-distribution setting is underexplored. We find that in the latter case, where the source is also a data distribution to be learned from limited samples, standard flow matching fails due to sparse supervision. To address this, we propose a simple and computationally efficient method that injects stochasticity into the training process by perturbing source samples and flow interpolants. On five diverse imaging tasks spanning biology, radiology, and astronomy, our method significantly improves generation quality, outperforming existing baselines by an average of 9 FID points. Our approach also reduces the transport cost between input and generated samples to better highlight the true effect of the transformation, making flow matching a more practical tool for simulating the diverse distribution transformations that arise in science.
摘要：对任意数据分布之间的转换进行建模是一项基本的科学挑战，出现在药物发现和进化模拟等应用中。虽然流匹配为该任务提供了一个自然的框架，但迄今为止其使用主要集中在噪声到数据设置上，而其在一般分布到分布设置中的应用尚未得到充分探索。我们发现，在后一种情况下，源也是从有限样本中学习的数据分布，标准流匹配由于稀疏监督而失败。为了解决这个问题，我们提出了一种简单且计算高效的方法，通过扰动源样本和流插值器将随机性注入训练过程。在涵盖生物学、放射学和天文学的五种不同成像任务中，我们的方法显着提高了生成质量，平均优于现有基线 9 个 FID 点。我们的方法还降低了输入和生成样本之间的传输成本，以更好地突出转换的真实效果，使流匹配成为模拟科学中出现的多样化分布转换的更实用的工具。

Title: StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

Authors: Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, Hui Zhang, Weicheng Zhu, Xin Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06638
Pdf URL: https://arxiv.org/pdf/2510.06638
Copy Paste: [[2510.06638]] StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering(https://arxiv.org/abs/2510.06638)
Keywords: generation
Abstract: Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present StaR-KVQA (Structured Reasoning Traces for IK-KVQA), which supervises structured traces - dual symbolic relation paths plus path-grounded natural-language explanations - so that reasoning becomes transparent and verifiable. With one open-source MLLM, StaR-KVQA constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA improves both accuracy and interpretability, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.
摘要：基于知识的视觉问答 (KVQA) 需要模型来根据图像中的实体并对事实知识进行推理。我们研究其隐式知识变体 IK-KVQA，其中多模态大语言模型 (MLLM) 是唯一的知识源，无需外部检索。然而，MLLM 缺乏明确的推理监督，会产生不一致的理由，并且在标准监督微调（SFT）后泛化能力很差。我们提出了 StaR-KVQA（IK-KVQA 的结构化推理轨迹），它监督结构化轨迹 - 双符号关系路径加上基于路径的自然语言解释 - 使推理变得透明和可验证。通过一个开源 MLLM，StaR-KVQA 构建并选择基于路径的推理轨迹，以形成轨迹丰富的数据集，然后通过结构化自蒸馏进行微调，以使生成与监督保持一致；不使用外部检索器、验证器或精选知识库 (KB)，跟踪是离线构建的，并且推理是单个自回归过程。在各个基准测试中，StaR-KVQA 提高了准确性和可解释性，与最强基线相比，OK-VQA 的答案准确率提高了 11.3%，同时表现出强大的跨域泛化能力。

Title: The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators

Authors: Mansi Sakarvadia, Kareem Hegazy, Amin Totounferoush, Kyle Chard, Yaoqing Yang, Ian Foster, Michael W. Mahoney
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.06646
Pdf URL: https://arxiv.org/pdf/2510.06646
Copy Paste: [[2510.06646]] The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators(https://arxiv.org/abs/2510.06646)
Keywords: super-resolution
Abstract: A core challenge in scientific machine learning, and scientific computing more generally, is modeling continuous phenomena which (in practice) are represented discretely. Machine-learned operators (MLOs) have been introduced as a means to achieve this modeling goal, as this class of architecture can perform inference at arbitrary resolution. In this work, we evaluate whether this architectural innovation is sufficient to perform "zero-shot super-resolution," namely to enable a model to serve inference on higher-resolution data than that on which it was originally trained. We comprehensively evaluate both zero-shot sub-resolution and super-resolution (i.e., multi-resolution) inference in MLOs. We decouple multi-resolution inference into two key behaviors: 1) extrapolation to varying frequency information; and 2) interpolating across varying resolutions. We empirically demonstrate that MLOs fail to do both of these tasks in a zero-shot manner. Consequently, we find MLOs are not able to perform accurate inference at resolutions different from those on which they were trained, and instead they are brittle and susceptible to aliasing. To address these failure modes, we propose a simple, computationally-efficient, and data-driven multi-resolution training protocol that overcomes aliasing and that provides robust multi-resolution generalization.
摘要：科学机器学习和更广泛的科学计算的核心挑战是对（在实践中）离散表示的连续现象进行建模。机器学习算子 (MLO) 已被引入作为实现此建模目标的一种手段，因为此类架构可以以任意分辨率执行推理。在这项工作中，我们评估这种架构创新是否足以执行“零样本超分辨率”，即使模型能够对比最初训练的数据更高分辨率的数据进行推理。我们全面评估了 MLO 中的零样本亚分辨率和超分辨率（即多分辨率）推理。我们将多分辨率推理解耦为两个关键行为：1）外推到不同的频率信息； 2) 在不同分辨率之间进行插值。我们凭经验证明 MLO 无法以零样本的方式完成这两项任务。因此，我们发现 MLO 无法在与训练时不同的分辨率下执行准确的推理，而且它们很脆弱且容易出现混叠。为了解决这些故障模式，我们提出了一种简单、计算高效且数据驱动的多分辨率训练协议，该协议克服了混叠并提供了强大的多分辨率泛化。

Title: Heptapod: Language Modeling on Visual Signals

Authors: Yongxin Zhu, Jiawei Chen, Yuanzhe Chen, Zhuo Chen, Dongya Jia, Jian Cong, Xiaobin Zhuang, Yuping Wang, Yuxuan Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06673
Pdf URL: https://arxiv.org/pdf/2510.06673
Copy Paste: [[2510.06673]] Heptapod: Language Modeling on Visual Signals(https://arxiv.org/abs/2510.06673)
Keywords: generation, generative
Abstract: We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs \textbf{causal attention}, \textbf{eliminates reliance on CFG}, and \textbf{eschews the trend of semantic tokenizers}. Our key innovation is \textit{next 2D distribution prediction}: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of $2.70$, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.
摘要：我们引入了 Heptapod，一种遵循语言建模基本原则的图像自回归模型。 Heptapod 采用 \textbf{因果注意力}、\textbf{消除对 CFG 的依赖}和 \textbf{避开语义分词器的趋势}。我们的关键创新是 \textit{下一个 2D 分布预测}：具有以重建为重点的视觉分词器的因果 Transformer，学习预测每个时间步的整个 2D 空间网格图像的分布。该学习目标将自回归框架的顺序建模与屏蔽自动编码的整体自监督学习相结合，使模型能够通过生成训练捕获全面的图像语义。在 ImageNet 生成基准上，Heptapod 的 FID 为 2.70 美元，显着优于之前的因果自回归方法。我们希望我们的工作能够激发对视觉信号及其他语言建模的原则性重新思考。

Title: DreamOmni2: Multimodal Instruction-based Editing and Generation

Authors: Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06679
Pdf URL: https://arxiv.org/pdf/2510.06679
Copy Paste: [[2510.06679]] DreamOmni2: Multimodal Instruction-based Editing and Generation(https://arxiv.org/abs/2510.06679)
Keywords: generation
Abstract: Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
摘要：基于指令的图像编辑和主题驱动生成的最新进展引起了人们的广泛关注，但这两项任务在满足实际用户需求方面仍然面临限制。基于指令的编辑仅依赖于语言指令，而语言指令通常无法捕获特定的编辑细节，因此需要参考图像。与此同时，主题驱动的生成仅限于组合具体的物体或人，而忽视了更广泛、抽象的概念。为了应对这些挑战，我们提出了两项新任务：基于多模式指令的编辑和生成。这些任务支持文本和图像指令，并将范围扩展到包括具体和抽象概念，极大地增强了它们的实际应用。我们推出 DreamOmni2，解决两个主要挑战：数据创建和模型框架设计。我们的数据合成管道包含三个步骤：（1）使用特征混合方法为抽象和具体概念创建提取数据，（2）使用编辑和提取模型生成基于多模态指令的编辑训练数据，以及（3）进一步应用提取模型为基于多模态指令的编辑创建训练数据。对于该框架，为了处理多图像输入，我们提出了索引编码和位置编码移位方案，这有助于模型区分图像并避免像素混淆。此外，我们引入了与 VLM 和我们的生成/编辑模型的联合训练，以更好地处理复杂的指令。此外，我们还为这两项新任务提出了全面的基准来推动它们的发展。实验表明DreamOmni2取得了令人瞩目的成果。模型和代码将被发布。

Title: A Diffusion Model for Regular Time Series Generation from Irregular Data with Completion and Masking

Authors: Gal Fadlon, Idan Arbiv, Nimrod Berman, Omri Azencot
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.06699
Pdf URL: https://arxiv.org/pdf/2510.06699
Copy Paste: [[2510.06699]] A Diffusion Model for Regular Time Series Generation from Irregular Data with Completion and Masking(https://arxiv.org/abs/2510.06699)
Keywords: generation, generative
Abstract: Generating realistic time series data is critical for applications in healthcare, finance, and science. However, irregular sampling and missing values present significant challenges. While prior methods address these irregularities, they often yield suboptimal results and incur high computational costs. Recent advances in regular time series generation, such as the diffusion-based ImagenTime model, demonstrate strong, fast, and scalable generative capabilities by transforming time series into image representations, making them a promising solution. However, extending ImagenTime to irregular sequences using simple masking introduces "unnatural" neighborhoods, where missing values replaced by zeros disrupt the learning process. To overcome this, we propose a novel two-step framework: first, a Time Series Transformer completes irregular sequences, creating natural neighborhoods; second, a vision-based diffusion model with masking minimizes dependence on the completed values. This approach leverages the strengths of both completion and masking, enabling robust and efficient generation of realistic time series. Our method achieves state-of-the-art performance, achieving a relative improvement in discriminative score by $70\%$ and in computational cost by $85\%$. Code is at this https URL.
摘要：生成真实的时间序列数据对于医疗保健、金融和科学领域的应用至关重要。然而，不规则采样和缺失值带来了重大挑战。虽然现有方法解决了这些不规则性，但它们通常会产生次优结果并产生高昂的计算成本。常规时间序列生成的最新进展，例如基于扩散的 ImagenTime 模型，通过将时间序列转换为图像表示，展示了强大、快速和可扩展的生成能力，使其成为一种有前途的解决方案。然而，使用简单的掩蔽将 ImagenTime 扩展到不规则序列会引入“不自然”的邻域，其中缺失值被零替换会破坏学习过程。为了克服这个问题，我们提出了一个新颖的两步框架：首先，时间序列转换器完成不规则序列，创建自然邻域；其次，带有掩蔽的基于视觉的扩散模型最大限度地减少了对完成值的依赖。这种方法利用了补全和掩蔽的优势，能够稳健、高效地生成真实的时间序列。我们的方法实现了最先进的性能，判别得分相对提高了 70\%$，计算成本相对提高了 85\%$。代码位于此 https URL。

Title: Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities

Authors: Maria Levchenko
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.06743
Pdf URL: https://arxiv.org/pdf/2510.06743
Copy Paste: [[2510.06743]] Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities(https://arxiv.org/abs/2510.06743)
Keywords: quality assessment
Abstract: Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.
摘要：数字人文学者越来越多地使用大型语言模型进行历史文档数字化，但缺乏针对基于法学硕士的 OCR 的适当评估框架。传统指标无法捕获对历史语料库创建至关重要的时间偏差和特定时期的错误。我们提出了一种基于法学硕士的历史 OCR 评估方法，解决外交转录中的污染风险和系统偏差。使用 18 世纪的 Russian Civil 字体文本，我们引入了新颖的指标，包括历史字符保留率 (HCPR) 和古风插入率 (AIR)，以及污染控制和稳定性测试协议。我们评估了 12 个多模态法学硕士，发现 Gemini 和 Qwen 模型优于传统 OCR，同时表现出过度历史化：插入来自不正确历史时期的古老字符。 OCR 后校正会降低而不是提高性能。我们的方法为数字人文从业者提供了历史语料库数字化中模型选择和质量评估的指南。

Title: Extreme Amodal Face Detection

Authors: Changlin Song, Yunzhong Hou, Michael Randall Barnes, Rahul Shome, Dylan Campbell
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06791
Pdf URL: https://arxiv.org/pdf/2510.06791
Copy Paste: [[2510.06791]] Extreme Amodal Face Detection(https://arxiv.org/abs/2510.06791)
Keywords: generative
Abstract: Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.
摘要：极端非模态检测是推断在输入图像中不完全可见但在扩展视野中可见的物体的 2D 位置的任务。这与非模态检测不同，在非模态检测中，对象在输入图像中部分可见，但被遮挡。在本文中，我们考虑人脸检测的子问题，因为此类提供了涉及安全和隐私的激励应用程序，但并不专门针对此类定制我们的方法。现有方法依赖于图像序列，因此可以从周围帧中插入缺失的检测，或者利用生成模型对可能的完成情况进行采样。相比之下，我们考虑单图像任务，并提出一种更有效、无样本的方法，利用图像中的上下文线索来推断未见过的面孔的存在。我们设计了一种基于热图的极端非模态目标检测器，它解决了使用选择性从粗到精的解码器从少量（图像）有效预测大量（帧外区域）的问题。我们的方法为这项新任务建立了强有力的结果，甚至优于效率较低的生成方法。

Title: StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance

Authors: Jaeseok Jeong, Junho Kim, Gayoung Lee, Yunjey Choi, Youngjung Uh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06827
Pdf URL: https://arxiv.org/pdf/2510.06827
Copy Paste: [[2510.06827]] StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance(https://arxiv.org/abs/2510.06827)
Keywords: generation
Abstract: In the domain of text-to-image generation, diffusion models have emerged as powerful tools. Recently, studies on visual prompting, where images are used as prompts, have enabled more precise control over style and content. However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. NVQG employs negative score by intentionally simulating content leakage scenarios that swap queries instead of key and values of self-attention layers from visual style prompts. This simple yet effective method significantly reduces content leakage. Furthermore, we provide careful solutions for using a real image as visual style prompts. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, reflecting the style of the references, and ensuring that resulting images match the text prompts. Our code is available \href{this https URL}{here}.
摘要：在文本到图像生成领域，扩散模型已成为强大的工具。最近，关于视觉提示的研究，其中使用图像作为提示，可以更精确地控制风格和内容。然而，现有的方法经常遭受内容泄漏的问题，其中视觉风格提示中不需要的元素与预期的风格一起被转移。为了解决这个问题，我们1）扩展无分类器指导（CFG）以利用交换自注意力，并提出2）负视觉查询指导（NVQG）以减少不需要内容的传输。 NVQG 通过有意模拟内容泄漏场景来使用负分，这些场景交换查询而不是视觉风格提示中自注意力层的键和值。这种简单而有效的方法可以显着减少内容泄漏。此外，我们还提供了使用真实图像作为视觉风格提示的谨慎解决方案。通过对各种样式和文本提示的广泛评估，我们的方法表现出了优于现有方法的优越性，反映了参考文献的风格，并确保生成的图像与文本提示相匹配。我们的代码可在\href{此 https URL}{此处}获取。

Title: Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization

Authors: Kanglei Zhou, Qingyi Pan, Xingxing Zhang, Hubert P. H. Shum, Frederick W. B. Li, Xiaohui Liang, Liyuan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06842
Pdf URL: https://arxiv.org/pdf/2510.06842
Copy Paste: [[2510.06842]] Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization(https://arxiv.org/abs/2510.06842)
Keywords: quality assessment
Abstract: Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation. A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios, which limits the generalization ability of conventional methods. We introduce Continual AQA (CAQA), which equips AQA with Continual Learning (CL) capabilities to handle evolving distributions while mitigating catastrophic forgetting. Although parameter-efficient fine-tuning of pretrained models has shown promise in CL for image classification, we find it insufficient for CAQA. Our empirical and theoretical analyses reveal two insights: (i) Full-Parameter Fine-Tuning (FPFT) is necessary for effective representation learning; yet (ii) uncontrolled FPFT induces overfitting and feature manifold shift, thereby aggravating forgetting. To address this, we propose Adaptive Manifold-Aligned Graph Regularization (MAGR++), which couples backbone fine-tuning that stabilizes shallow layers while adapting deeper ones with a two-step feature rectification pipeline: a manifold projector to translate deviated historical features into the current representation space, and a graph regularizer to align local and global distributions. We construct four CAQA benchmarks from three datasets with tailored evaluation protocols and strong baselines, enabling systematic cross-dataset comparison. Extensive experiments show that MAGR++ achieves state-of-the-art performance, with average correlation gains of 3.6% offline and 12.2% online over the strongest baseline, confirming its robustness and effectiveness. Our code is available at this https URL.
摘要：动作质量评估 (AQA) 可量化视频中的人类动作，支持体育评分、康复和技能评估中的应用。一个主要挑战在于现实场景中质量分布的非平稳性质，这限制了传统方法的泛化能力。我们引入了持续 AQA (CAQA)，它为 AQA 配备了持续学习 (CL) 功能，可以处理不断变化的分布，同时减少灾难性遗忘。尽管预训练模型的参数高效微调在 CL 图像分类中显示出前景，但我们发现它对于 CAQA 来说还不够。我们的实证和理论分析揭示了两个见解：（i）全参数微调（FPFT）对于有效的表示学习是必要的；然而（ii）不受控制的 FPFT 会导致过度拟合和特征流形偏移，从而加剧遗忘。为了解决这个问题，我们提出了自适应流形对齐图正则化（MAGR++），它结合了主干微调，稳定浅层，同时通过两步特征校正管道调整更深层：流形投影仪将偏离的历史特征转换为当前表示空间，以及图正则化器来对齐局部和全局分布。我们根据三个数据集构建了四个 CAQA 基准，并具有定制的评估协议和强大的基线，从而实现系统的跨数据集比较。大量实验表明，MAGR++ 实现了最先进的性能，与最强基线相比，离线平均相关增益为 3.6%，在线平均相关增益为 12.2%，证实了其稳健性和有效性。我们的代码可以在这个 https URL 上找到。

Title: SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

Authors: Huahui Yi, Kun Wang, Qiankun Li, Miao Yu, Liang Lin, Gongli Xi, Hao Wu, Xuming Hu, Kang Li, Yang Liu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2510.06871
Pdf URL: https://arxiv.org/pdf/2510.06871
Copy Paste: [[2510.06871]] SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models(https://arxiv.org/abs/2510.06871)
Keywords: generation
Abstract: Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at this https URL.
摘要：多模态大型推理模型（MLRM）展示了令人印象深刻的跨模态推理，但通常会在对抗性或不安全提示下放大安全风险，我们将这种现象称为 \textit{推理税}。现有的防御措施主要作用于输出层面，并不限制推理过程，使模型面临隐性风险。在本文中，我们提出了 SafeR-VLM，这是一种与安全相关的强化学习框架，可将安全性直接嵌入到多模态推理中。该框架集成了四个组件：(I) QI-Safe-10K，一个强调安全关键和推理敏感案例的精选数据集；（二）安全意识的推出，不安全的一代经过反思和纠正，而不是被丢弃；（三）具有多维度加权标准的结构化奖励模型以及对幻觉和矛盾的明确惩罚； (IV) GRPO 优化，加强安全和正确的轨迹。这种统一的设计将安全性从被动防护转变为主动推理驱动因素，从而实现可扩展和通用的安全感知推理。 SafeR-VLM 进一步展示了针对显性和隐性风险的稳健性，支持超越表面级过滤的动态和可解释的安全决策。 SafeR-VLM-3B 在六个基准测试中的安全性和实用性方面的平均性能为 70.13 美元和 78.97 美元，超过了同规模和 10 倍以上的较大型号，例如 Skywork-R1V3-38B、Qwen2.5VL-72B 和 GLM4.5V-106B。值得注意的是，SaFeR-VLM-7B 受益于其规模的扩大，在安全指标上分别超越了 GPT-5-mini 和 Gemini-2.5-Flash \num{6.47} 和 \num{16.76} 点，实现了这一改进，而没有任何帮助性能下降。我们的代码可通过此 https URL 获取。

Title: Utilizing Large Language Models for Machine Learning Explainability

Authors: Alexandros Vassiliades, Nikolaos Polatidis, Stamatios Samaras, Sotiris Diplaris, Ignacio Cabrera Martin, Yannis Manolopoulos, Stefanos Vrochidis, Ioannis Kompatsiaris
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.06912
Pdf URL: https://arxiv.org/pdf/2510.06912
Copy Paste: [[2510.06912]] Utilizing Large Language Models for Machine Learning Explainability(https://arxiv.org/abs/2510.06912)
Keywords: generation
Abstract: This study explores the explainability capabilities of large language models (LLMs), when employed to autonomously generate machine learning (ML) solutions. We examine two classification tasks: (i) a binary classification problem focused on predicting driver alertness states, and (ii) a multilabel classification problem based on the yeast dataset. Three state-of-the-art LLMs (i.e. OpenAI GPT, Anthropic Claude, and DeepSeek) are prompted to design training pipelines for four common classifiers: Random Forest, XGBoost, Multilayer Perceptron, and Long Short-Term Memory networks. The generated models are evaluated in terms of predictive performance (recall, precision, and F1-score) and explainability using SHAP (SHapley Additive exPlanations). Specifically, we measure Average SHAP Fidelity (Mean Squared Error between SHAP approximations and model outputs) and Average SHAP Sparsity (number of features deemed influential). The results reveal that LLMs are capable of producing effective and interpretable models, achieving high fidelity and consistent sparsity, highlighting their potential as automated tools for interpretable ML pipeline generation. The results show that LLMs can produce effective, interpretable pipelines with high fidelity and consistent sparsity, closely matching manually engineered baselines.
摘要：本研究探讨了大型语言模型 (LLM) 在用于自主生成机器学习 (ML) 解决方案时的可解释性能力。我们研究了两个分类任务：（i）专注于预测驾驶员警觉状态的二元分类问题，以及（ii）基于酵母数据集的多标签分类问题。三个最先进的法学硕士（即 OpenAI GPT、Anthropic Claude 和 DeepSeek）被提示为四个常见分类器设计训练管道：随机森林、XGBoost、多层感知器和长短期记忆网络。使用 SHAP（SHapley Additive exPlanations）根据预测性能（召回率、精度和 F1 分数）和可解释性来评估生成的模型。具体来说，我们测量平均 SHAP 保真度（SHAP 近似值和模型输出之间的均方误差）和平均 SHAP 稀疏度（被认为有影响的特征数量）。结果表明，法学硕士能够生成有效且可解释的模型，实现高保真度和一致的稀疏性，突显了它们作为可解释机器学习管道生成的自动化工具的潜力。结果表明，法学硕士可以产生有效的、可解释的管道，具有高保真度和一致的稀疏性，与手动设计的基线紧密匹配。

Title: DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning

Authors: Ke Guo, Haochen Liu, Xiaojun Wu, Chen Lv
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2510.06913
Pdf URL: https://arxiv.org/pdf/2510.06913
Copy Paste: [[2510.06913]] DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning(https://arxiv.org/abs/2510.06913)
Keywords: generative
Abstract: Realistic traffic simulation is critical for the development of autonomous driving systems and urban mobility planning, yet existing imitation learning approaches often fail to model realistic traffic behaviors. Behavior cloning suffers from covariate shift, while Generative Adversarial Imitation Learning (GAIL) is notoriously unstable in multi-agent settings. We identify a key source of this instability: irrelevant interaction misguidance, where a discriminator penalizes an ego vehicle's realistic behavior due to unrealistic interactions among its neighbors. To address this, we propose Decomposed Multi-agent GAIL (DecompGAIL), which explicitly decomposes realism into ego-map and ego-neighbor components, filtering out misleading neighbor: neighbor and neighbor: map interactions. We further introduce a social PPO objective that augments ego rewards with distance-weighted neighborhood rewards, encouraging overall realism across agents. Integrated into a lightweight SMART-based backbone, DecompGAIL achieves state-of-the-art performance on the WOMD Sim Agents 2025 benchmark.
摘要：真实的交通模拟对于自动驾驶系统和城市交通规划的开发至关重要，但现有的模仿学习方法往往无法模拟真实的交通行为。行为克隆会受到协变量偏移的影响，而生成对抗性模仿学习（GAIL）在多智能体环境中是出了名的不稳定。我们确定了这种不稳定的一个关键根源：不相关的交互误导，其中鉴别器由于其邻居之间不切实际的交互而惩罚自我车辆的现实行为。为了解决这个问题，我们提出了分解多智能体 GAIL (DecompGAIL)，它将现实主义明确分解为自我地图和自我邻居组件，过滤掉误导性的邻居：邻居和邻居：地图交互。我们进一步引入了社交 PPO 目标，该目标通过距离加权的邻里奖励来增强自我奖励，鼓励代理之间的整体现实主义。 DecompGAIL 集成到基于 SMART 的轻量级主干中，在 WOMD Sim Agents 2025 基准测试中实现了最先进的性能。

Title: Label-frugal satellite image change detection with generative virtual exemplar learning

Authors: Hichem Sahbi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06926
Pdf URL: https://arxiv.org/pdf/2510.06926
Copy Paste: [[2510.06926]] Label-frugal satellite image change detection with generative virtual exemplar learning(https://arxiv.org/abs/2510.06926)
Keywords: generative
Abstract: Change detection is a major task in remote sensing which consists in finding all the occurrences of changes in multi-temporal satellite or aerial images. The success of existing methods, and particularly deep learning ones, is tributary to the availability of hand-labeled training data that capture the acquisition conditions and the subjectivity of the user (oracle). In this paper, we devise a novel change detection algorithm, based on active learning. The main contribution of our work resides in a new model that measures how important is each unlabeled sample, and provides an oracle with only the most critical samples (also referred to as virtual exemplars) for further labeling. These exemplars are generated, using an invertible graph convnet, as the optimum of an adversarial loss that (i) measures representativity, diversity and ambiguity of the data, and thereby (ii) challenges (the most) the current change detection criteria, leading to a better re-estimate of these criteria in the subsequent iterations of active learning. Extensive experiments show the positive impact of our label-efficient learning model against comparative methods.
摘要：变化检测是遥感中的一项主要任务，包括查找多时相卫星或航空图像中所有发生的变化。现有方法（尤其是深度学习方法）的成功归功于手工标记的训练数据的可用性，这些数据捕获了采集条件和用户的主观性（预言机）。在本文中，我们设计了一种基于主动学习的新颖的变化检测算法。我们工作的主要贡献在于一个新模型，该模型衡量每个未标记样本的重要性，并仅提供最关键样本（也称为虚拟样本）的预言机以进行进一步标记。这些样本是使用可逆图卷积网络生成的，作为对抗性损失的最佳值，它（i）测量数据的代表性、多样性和模糊性，从而（ii）挑战（最大）当前的变化检测标准，从而在主动学习的后续迭代中更好地重新估计这些标准。大量的实验表明，我们的标签高效学习模型相对于比较方法具有积极影响。

Title: IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

Authors: Ran Yi, Teng Hu, Zihan Su, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06928
Pdf URL: https://arxiv.org/pdf/2510.06928
Copy Paste: [[2510.06928]] IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction(https://arxiv.org/abs/2510.06928)
Keywords: generation
Abstract: Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.
摘要：自回归模型已成为视觉内容创建的强大范例，但常常忽视视觉数据的内在结构属性。我们之前的工作 IAR 提出了一个方向，通过基于嵌入相似性重新组织视觉码本来解决这个问题，从而提高生成的鲁棒性。然而，它受到预训练码本的刚性和硬均匀聚类的不准确性的限制。为了克服这些限制，我们提出了 IAR2，这是一种先进的自回归框架，可以实现分层语义细节合成过程。 IAR2 的核心是一种新颖的语义细节关联双码本，它将图像表示解耦为全局语义信息的语义码本和用于细粒度细化的细节码本。它将量化能力从线性扩展到多项式尺度，显着增强了表现力。为了适应这种双重表示，我们提出了一种与局部上下文增强自回归头相结合的语义细节自回归预测方案，该方案执行分层预测（首先是语义标记，然后是细节标记），同时利用局部上下文窗口来增强空间连贯性。此外，对于条件生成，我们引入了渐进式注意力引导自适应 CFG 机制，该机制根据每个标记与条件的相关性及其在生成序列中的时间位置动态调整每个标记的引导尺度，从而在不牺牲现实性的情况下改进条件对齐。大量实验表明，IAR2 为自回归图像生成设定了新的最先进技术，在 ImageNet 上实现了 1.50 的 FID。我们的模型不仅在性能上超越了以前的方法，而且还表现出卓越的计算效率，突出了我们结构化的、从粗到细的生成策略的有效性。

Title: OBJVanish: Physically Realizable Text-to-3D Adv. Generation of LiDAR-Invisible Objects

Authors: Bing Li, Wuqi Wang, Yanan Zhang, Jingzheng Li, Haigen Min, Wei Feng, Xingyu Zhao, Jie Zhang, Qing Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06952
Pdf URL: https://arxiv.org/pdf/2510.06952
Copy Paste: [[2510.06952]] OBJVanish: Physically Realizable Text-to-3D Adv. Generation of LiDAR-Invisible Objects(https://arxiv.org/abs/2510.06952)
Keywords: generation
Abstract: LiDAR-based 3D object detectors are fundamental to autonomous driving, where failing to detect objects poses severe safety risks. Developing effective 3D adversarial attacks is essential for thoroughly testing these detection systems and exposing their vulnerabilities before real-world deployment. However, existing adversarial attacks that add optimized perturbations to 3D points have two critical limitations: they rarely cause complete object disappearance and prove difficult to implement in physical environments. We introduce the text-to-3D adversarial generation method, a novel approach enabling physically realizable attacks that can generate 3D models of objects truly invisible to LiDAR detectors and be easily realized in the real world. Specifically, we present the first empirical study that systematically investigates the factors influencing detection vulnerability by manipulating the topology, connectivity, and intensity of individual pedestrian 3D models and combining pedestrians with multiple objects within the CARLA simulation environment. Building on the insights, we propose the physically-informed text-to-3D adversarial generation (Phy3DAdvGen) that systematically optimizes text prompts by iteratively refining verbs, objects, and poses to produce LiDAR-invisible pedestrians. To ensure physical realizability, we construct a comprehensive object pool containing 13 3D models of real objects and constrain Phy3DAdvGen to generate 3D objects based on combinations of objects in this set. Extensive experiments demonstrate that our approach can generate 3D pedestrians that evade six state-of-the-art (SOTA) LiDAR 3D detectors in both CARLA simulation and physical environments, thereby highlighting vulnerabilities in safety-critical applications.
摘要：基于 LiDAR 的 3D 物体检测器是自动驾驶的基础，无法检测到物体会带来严重的安全风险。开发有效的 3D 对抗性攻击对于在实际部署之前彻底测试这些检测系统并暴露其漏洞至关重要。然而，向 3D 点添加优化扰动的现有对抗性攻击有两个关键限制：它们很少导致对象完全消失，并且难以在物理环境中实现。我们介绍了文本到 3D 对抗生成方法，这是一种实现物理上可实现的攻击的新颖方法，可以生成 LiDAR 探测器真正不可见的对象的 3D 模型，并且可以在现实世界中轻松实现。具体来说，我们提出了第一个实证研究，通过操纵单个行人 3D 模型的拓扑、连通性和强度，并将行人与 CARLA 模拟环境中的多个对象结合起来，系统地研究了影响检测漏洞的因素。基于这些见解，我们提出了物理信息文本到 3D 对抗生成 (Phy3DAdvGen)，它通过迭代地细化动词、物体和姿势来系统地优化文本提示，以生成 LiDAR 看不见的行人。为了确保物理可实现性，我们构建了一个包含 13 个真实物体 3D 模型的综合对象池，并约束 Phy3DAdvGen 根据该集合中的对象组合生成 3D 对象。大量实验表明，我们的方法可以生成 3D 行人，在 CARLA 模拟和物理环境中躲避六个最先进的 (SOTA) LiDAR 3D 探测器，从而突出安全关键应用中的漏洞。

Title: Generating Surface for Text-to-3D using 2D Gaussian Splatting

Authors: Huanning Dong, Fan Li, Ping Kuang, Jianwen Min
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06967
Pdf URL: https://arxiv.org/pdf/2510.06967
Copy Paste: [[2510.06967]] Generating Surface for Text-to-3D using 2D Gaussian Splatting(https://arxiv.org/abs/2510.06967)
Keywords: generation
Abstract: Recent advancements in Text-to-3D modeling have shown significant potential for the creation of 3D content. However, due to the complex geometric shapes of objects in the natural world, generating 3D content remains a challenging task. Current methods either leverage 2D diffusion priors to recover 3D geometry, or train the model directly based on specific 3D representations. In this paper, we propose a novel method named DirectGaussian, which focuses on generating the surfaces of 3D objects represented by surfels. In DirectGaussian, we utilize conditional text generation models and the surface of a 3D object is rendered by 2D Gaussian splatting with multi-view normal and texture priors. For multi-view geometric consistency problems, DirectGaussian incorporates curvature constraints on the generated surface during optimization process. Through extensive experiments, we demonstrate that our framework is capable of achieving diverse and high-fidelity 3D content creation.
摘要：文本到 3D 建模的最新进展显示出创建 3D 内容的巨大潜力。然而，由于自然世界中物体的几何形状复杂，生成 3D 内容仍然是一项具有挑战性的任务。当前的方法要么利用 2D 扩散先验来恢复 3D 几何形状，要么直接基于特定的 3D 表示来训练模型。在本文中，我们提出了一种名为 DirectGaussian 的新方法，该方法专注于生成由面元表示的 3D 对象的表面。在 DirectGaussian 中，我们利用条件文本生成模型，并通过具有多视图法线和纹理先验的 2D 高斯泼溅来渲染 3D 对象的表面。对于多视图几何一致性问题，DirectGaussian 在优化过程中在生成的表面上加入了曲率约束。通过大量的实验，我们证明我们的框架能够实现多样化和高保真的 3D 内容创建。

Title: Addressing the ID-Matching Challenge in Long Video Captioning

Authors: Zhantao Yang, Huangji Wang, Ruili Feng, Han Zhang, Yuting Hu, Shangwen Zhu, Junyan Li, Yu Liu, Fan Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06973
Pdf URL: https://arxiv.org/pdf/2510.06973
Copy Paste: [[2510.06973]] Addressing the ID-Matching Challenge in Long Video Captioning(https://arxiv.org/abs/2510.06973)
Keywords: generation
Abstract: Generating captions for long and complex videos is both critical and challenging, with significant implications for the growing fields of text-to-video generation and multi-modal understanding. One key challenge in long video captioning is accurately recognizing the same individuals who appear in different frames, which we refer to as the ID-Matching problem. Few prior works have focused on this important issue. Those that have, usually suffer from limited generalization and depend on point-wise matching, which limits their overall effectiveness. In this paper, unlike previous approaches, we build upon LVLMs to leverage their powerful priors. We aim to unlock the inherent ID-Matching capabilities within LVLMs themselves to enhance the ID-Matching performance of captions. Specifically, we first introduce a new benchmark for assessing the ID-Matching capabilities of video captions. Using this benchmark, we investigate LVLMs containing GPT-4o, revealing key insights that the performance of ID-Matching can be improved through two methods: 1) enhancing the usage of image information and 2) increasing the quantity of information of individual descriptions. Based on these insights, we propose a novel video captioning method called Recognizing Identities for Captioning Effectively (RICE). Extensive experiments including assessments of caption quality and ID-Matching performance, demonstrate the superiority of our approach. Notably, when implemented on GPT-4o, our RICE improves the precision of ID-Matching from 50% to 90% and improves the recall of ID-Matching from 15% to 80% compared to baseline. RICE makes it possible to continuously track different individuals in the captions of long videos.
摘要：为长而复杂的视频生成字幕既关键又具有挑战性，对不断发展的文本到视频生成和多模式理解领域具有重大影响。长视频字幕的一个关键挑战是准确识别出现在不同帧中的相同个体，我们将其称为 ID 匹配问题。之前很少有作品关注这个重要问题。那些具有的通常会受到有限的泛化作用并依赖于逐点匹配，这限制了它们的整体有效性。在本文中，与之前的方法不同，我们以 LVLM 为基础来利用其强大的先验。我们的目标是解锁 LVLM 本身固有的 ID 匹配功能，以增强字幕的 ID 匹配性能。具体来说，我们首先引入一个新的基准来评估视频字幕的 ID 匹配能力。使用这个基准，我们研究了包含 GPT-4o 的 LVLM，揭示了可以通过两种方法提高 ID 匹配性能的关键见解：1）增强图像信息的使用，2）增加个体描述的信息量。基于这些见解，我们提出了一种新颖的视频字幕方法，称为有效字幕识别身份（RICE）。包括字幕质量和 ID 匹配性能评估在内的大量实验证明了我们方法的优越性。值得注意的是，当在 GPT-4o 上实施时，与基线相比，我们的 RICE 将 ID 匹配的精度从 50% 提高到 90%，并将 ID 匹配的召回率从 15% 提高到 80%。 RICE 可以连续跟踪长视频字幕中的不同个体。

Title: No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

Authors: Girolamo Macaluso, Lorenzo Mandelli, Mirko Bicchierai, Stefano Berretti, Andrew D. Bagdanov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.06988
Pdf URL: https://arxiv.org/pdf/2510.06988
Copy Paste: [[2510.06988]] No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts(https://arxiv.org/abs/2510.06988)
Keywords: generation, generative
Abstract: Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model's generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.
摘要：扩散模型最近促进了人体运动的生成，根据文本提示生成逼真且多样化的动画。然而，使这些模型适应看不见的动作或风格通常需要额外的动作捕捉数据和全面的重新训练，这是昂贵且难以扩展的。我们提出了一种基于强化学习的后训练框架，该框架仅使用文本提示来微调预训练的运动扩散模型，而不需要任何运动基础事实。我们的方法采用预训练的文本运动检索网络作为奖励信号，并通过去噪扩散策略优化来优化扩散策略，有效地将模型的生成分布转移到目标域，而不依赖于成对的运动数据。我们使用 HumanML3D 和 KIT-ML 数据集跨潜在空间和联合空间扩散架构评估我们的跨数据集适应和留一运动实验方法。定量指标和用户研究的结果表明，我们的方法不断提高生成运动的质量和多样性，同时保留原始分布的性能。我们的方法是一种灵活、数据高效且保护隐私的运动适应解决方案。

Title: Sharpness-Aware Data Generation for Zero-shot Quantization

Authors: Dung Hoang-Anh, Cuong Pham Trung Le, Jianfei Cai, Thanh-Toan Do
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2510.07018
Pdf URL: https://arxiv.org/pdf/2510.07018
Copy Paste: [[2510.07018]] Sharpness-Aware Data Generation for Zero-shot Quantization(https://arxiv.org/abs/2510.07018)
Keywords: generation
Abstract: Zero-shot quantization aims to learn a quantized model from a pre-trained full-precision model with no access to original real training data. The common idea in zero-shot quantization approaches is to generate synthetic data for quantizing the full-precision model. While it is well-known that deep neural networks with low sharpness have better generalization ability, none of the previous zero-shot quantization works considers the sharpness of the quantized model as a criterion for generating training data. This paper introduces a novel methodology that takes into account quantized model sharpness in synthetic data generation to enhance generalization. Specifically, we first demonstrate that sharpness minimization can be attained by maximizing gradient matching between the reconstruction loss gradients computed on synthetic and real validation data, under certain assumptions. We then circumvent the problem of the gradient matching without real validation set by approximating it with the gradient matching between each generated sample and its neighbors. Experimental evaluations on CIFAR-100 and ImageNet datasets demonstrate the superiority of the proposed method over the state-of-the-art techniques in low-bit quantization settings.
摘要：零样本量化旨在从预先训练的全精度模型中学习量化模型，而无需访问原始的真实训练数据。零样本量化方法的共同想法是生成用于量化全精度模型的合成数据。虽然众所周知，低清晰度的深度神经网络具有更好的泛化能力，但之前的零样本量化工作都没有将量化模型的清晰度作为生成训练数据的标准。本文介绍了一种新颖的方法，该方法在合成数据生成中考虑量化模型的清晰度以增强泛化能力。具体来说，我们首先证明，在某些假设下，可以通过最大化基于合成和真实验证数据计算的重建损失梯度之间的梯度匹配来实现锐度最小化。然后，我们通过使用每个生成的样本与其邻居之间的梯度匹配来近似它，从而规避没有真实验证集的梯度匹配问题。对 CIFAR-100 和 ImageNet 数据集的实验评估证明了所提出的方法相对于低位量化设置中最先进的技术的优越性。

Title: Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

Authors: Riccardo Mereu, Aidan Scannell, Yuxin Hou, Yi Zhao, Aditya Jitta, Antonio Dominguez, Luigi Acerbi, Amos Storkey, Paul Chang
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2510.07092
Pdf URL: https://arxiv.org/pdf/2510.07092
Copy Paste: [[2510.07092]] Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report(https://arxiv.org/abs/2510.07092)
Keywords: generation, generative
Abstract: World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.
摘要：世界模型是人工智能和机器人技术中的强大范例，使智能体能够通过预测视觉观察或紧凑的潜在状态来推理未来。 1X 世界模型挑战赛引入了现实世界人形交互的开源基准，有两个互补的轨道：采样（专注于预测未来图像帧）和压缩（专注于预测未来离散潜在代码）。对于采样轨道，我们采用视频生成基础模型 Wan-2.2 TI2V-5B 来进行视频状态条件下的未来帧预测。我们使用 AdaLN-Zero 根据机器人状态调节视频生成，并使用 LoRA 进一步对模型进行后训练。对于压缩轨道，我们从头开始训练时空变换器模型。我们的模型在采样任务中实现了 23.0 dB PSNR，在压缩任务中实现了 6.6386 的 Top-500 CE，在两项挑战中均获得第一名。

Title: Enhancing Concept Localization in CLIP-based Concept Bottleneck Models

Authors: Rémi Kazmierczak, Steve Azzolin, Eloïse Berthier, Goran Frehse, Gianni Franchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07115
Pdf URL: https://arxiv.org/pdf/2510.07115
Copy Paste: [[2510.07115]] Enhancing Concept Localization in CLIP-based Concept Bottleneck Models(https://arxiv.org/abs/2510.07115)
Keywords: generation
Abstract: This paper addresses explainable AI (XAI) through the lens of Concept Bottleneck Models (CBMs) that do not require explicit concept annotations, relying instead on concepts extracted using CLIP in a zero-shot manner. We show that CLIP, which is central in these techniques, is prone to concept hallucination, incorrectly predicting the presence or absence of concepts within an image in scenarios used in numerous CBMs, hence undermining the faithfulness of explanations. To mitigate this issue, we introduce Concept Hallucination Inhibition via Localized Interpretability (CHILI), a technique that disentangles image embeddings and localizes pixels corresponding to target concepts. Furthermore, our approach supports the generation of saliency-based explanations that are more interpretable.
摘要：本文通过概念瓶颈模型 (CBM) 的视角讨论可解释的人工智能 (XAI)，该模型不需要明确的概念注释，而是依赖于使用 CLIP 以零样本方式提取的概念。我们发现，这些技术的核心 CLIP 很容易产生概念幻觉，在众多 CBM 中使用的场景中错误地预测图像中概念的存在或不存在，从而破坏了解释的可信度。为了缓解这个问题，我们引入了通过局部可解释性抑制概念幻觉（CHILI），这是一种解开图像嵌入并定位与目标概念相对应的像素的技术。此外，我们的方法支持生成更容易解释的基于显着性的解释。

Title: Graph Conditioned Diffusion for Controllable Histopathology Image Generation

Authors: Sarah Cechnicka, Matthew Baugh, Weitong Zhang, Mischa Dombrowski, Zhe Li, Johannes C. Paetzold, Candice Roufosse, Bernhard Kainz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07129
Pdf URL: https://arxiv.org/pdf/2510.07129
Copy Paste: [[2510.07129]] Graph Conditioned Diffusion for Controllable Histopathology Image Generation(https://arxiv.org/abs/2510.07129)
Keywords: generation
Abstract: Recent advances in Diffusion Probabilistic Models (DPMs) have set new standards in high-quality image synthesis. Yet, controlled generation remains challenging, particularly in sensitive areas such as medical imaging. Medical images feature inherent structure such as consistent spatial arrangement, shape or texture, all of which are critical for diagnosis. However, existing DPMs operate in noisy latent spaces that lack semantic structure and strong priors, making it difficult to ensure meaningful control over generated content. To address this, we propose graph-based object-level representations for Graph-Conditioned-Diffusion. Our approach generates graph nodes corresponding to each major structure in the image, encapsulating their individual features and relationships. These graph representations are processed by a transformer module and integrated into a diffusion model via the text-conditioning mechanism, enabling fine-grained control over generation. We evaluate this approach using a real-world histopathology use case, demonstrating that our generated data can reliably substitute for annotated patient data in downstream segmentation tasks. The code is available here.
摘要：扩散概率模型 (DPM) 的最新进展为高质量图像合成树立了新标准。然而，受控发电仍然具有挑战性，特别是在医学成像等敏感领域。医学图像具有固有的结构，例如一致的空间排列、形状或纹理，所有这些对于诊断都至关重要。然而，现有的 DPM 在缺乏语义结构和强先验的嘈杂潜在空间中运行，因此很难确保对生成的内容进行有意义的控制。为了解决这个问题，我们提出了基于图的对象级表示的图条件扩散。我们的方法生成与图像中每个主要结构相对应的图节点，封装它们各自的特征和关系。这些图形表示由变压器模块处理，并通过文本调节机制集成到扩散模型中，从而实现对生成的细粒度控制。我们使用现实世界的组织病理学用例评估这种方法，证明我们生成的数据可以可靠地替代下游分割任务中带注释的患者数据。该代码可在此处获取。

Title: A Multi-Agent Framework for Stateful Inference-Time Search

Authors: Arshika Lalan, Rajat Ghosh, Aditya Kolsur, Debojyoti Dutta
Subjects: cs.LG, cs.AI, cs.CL, cs.MA, cs.SE
Abstract URL: https://arxiv.org/abs/2510.07147
Pdf URL: https://arxiv.org/pdf/2510.07147
Copy Paste: [[2510.07147]] A Multi-Agent Framework for Stateful Inference-Time Search(https://arxiv.org/abs/2510.07147)
Keywords: generation
Abstract: Recent work explores agentic inference-time techniques to perform structured, multi-step reasoning. However, stateless inference often struggles on multi-step tasks due to the absence of persistent state. Moreover, task-specific fine-tuning or instruction-tuning often achieve surface-level code generation but remain brittle on tasks requiring deeper reasoning and long-horizon dependencies. To address these limitations, we propose stateful multi-agent evolutionary search, a training-free framework that departs from prior stateless approaches by combining (i) persistent inference-time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high-coverage edge cases across unseen codebases. Experiments show our stateful multi-agent inference framework achieves substantial gains in coverage over stateless single-step baselines, evaluated on prevalent unit-testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families - Llama, Gemma, and GPT. These results indicate that combining persistent inference-time state with evolutionary search materially improves unit-test generation.
摘要：最近的工作探索了代理推理时间技术来执行结构化、多步骤推理。然而，由于缺乏持久状态，无状态推理常常难以处理多步骤任务。此外，特定于任务的微调或指令调整通常可以实现表面级代码生成，但对于需要更深入推理和长期依赖的任务来说仍然很脆弱。为了解决这些限制，我们提出了有状态的多智能体进化搜索，这是一种免训练的框架，它通过结合（i）持久推理时间状态，（ii）对抗性突变和（iii）进化保存来区别于先前的无状态方法。我们通过生成边缘案例来证明其在自动化单元测试生成方面的有效性。我们使用进化搜索过程生成强大的边缘情况，其中专门的代理依次提出、变异和评分候选者。控制器在各代之间保持持久状态，而进化保存则确保所有可能情况下的多样性和探索。这产生了一个多面手代理，能够跨未知的代码库发现强大的、高覆盖率的边缘情况。实验表明，我们的有状态多智能体推理框架在覆盖无状态单步基线方面取得了实质性进展，并在流行的单元测试基准（例如 HumanEval 和 TestGenEvalMini）上进行了评估，并使用了三个不同的 LLM 系列 - Llama、Gemma 和 GPT。这些结果表明，将持久推理时间状态与进化搜索相结合可以显着改善单元测试的生成。

Title: MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis

Authors: Yihao Zhi, Chenghong Li, Hongjie Liao, Xihe Yang, Zhengwentai Sun, Jiahao Chang, Xiaodong Cun, Wensen Feng, Xiaoguang Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07190
Pdf URL: https://arxiv.org/pdf/2510.07190
Copy Paste: [[2510.07190]] MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis(https://arxiv.org/abs/2510.07190)
Keywords: generation
Abstract: Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer's state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis.
摘要：最近在大规模数据集和扩散技术的推动下，视频生成领域取得了突破，表明视频扩散模型可以充当隐式 4D 新颖视图合成器。然而，当前的方法主要集中于在前视图内重定向摄像机轨迹，同时努力生成 360 度视点变化。在本文中，我们重点关注以人为中心的子领域，并提出了 MV-Performer，这是一种创新框架，用于从单眼全身捕捉创建同步新颖的视图视频。为了实现 360 度综合，我们广泛利用 MVHumanNet 数据集并合并信息丰富的条件信号。具体来说，我们使用从定向部分点云渲染的依赖于相机的法线贴图，这有效地减轻了可见和不可见观察之间的模糊性。为了保持生成的视频的同步，我们提出了一种以人为中心的多视图视频扩散模型，该模型融合了来自参考视频、部分渲染和不同视点的信息。此外，我们为野外视频案例提供了强大的推理程序，这极大地减轻了由不完美的单目深度估计引起的伪影。对三个数据集的广泛实验证明了我们的 MV-Performer 最先进的有效性和鲁棒性，为以人为中心的 4D 新颖视图合成建立了强大的模型。

Title: EigenScore: OOD Detection using Covariance in Diffusion Models

Authors: Shirin Shoushtari, Yi Wang, Xiao Shi, M. Salman Asif, Ulugbek S. Kamilov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07206
Pdf URL: https://arxiv.org/pdf/2510.07206
Copy Paste: [[2510.07206]] EigenScore: OOD Detection using Covariance in Diffusion Models(https://arxiv.org/abs/2510.07206)
Keywords: generative
Abstract: Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems in safety-sensitive domains. Diffusion models have recently emerged as powerful generative models, capable of capturing complex data distributions through iterative denoising. Building on this progress, recent work has explored their potential for OOD detection. We propose EigenScore, a new OOD detection method that leverages the eigenvalue spectrum of the posterior covariance induced by a diffusion model. We argue that posterior covariance provides a consistent signal of distribution shift, leading to larger trace and leading eigenvalues on OOD inputs, yielding a clear spectral signature. We further provide analysis explicitly linking posterior covariance to distribution mismatch, establishing it as a reliable signal for OOD detection. To ensure tractability, we adopt a Jacobian-free subspace iteration method to estimate the leading eigenvalues using only forward evaluations of the denoiser. Empirically, EigenScore achieves SOTA performance, with up to 5% AUROC improvement over the best baseline. Notably, it remains robust in near-OOD settings such as CIFAR-10 vs CIFAR-100, where existing diffusion-based methods often fail.
摘要：分布外 (OOD) 检测对于在安全敏感领域安全部署机器学习系统至关重要。扩散模型最近已成为强大的生成模型，能够通过迭代去噪捕获复杂的数据分布。在此进展的基础上，最近的工作探索了它们在 OOD 检测方面的潜力。我们提出了 EigenScore，一种新的 OOD 检测方法，它利用扩散模型引起的后验协方差的特征值谱。我们认为后验协方差提供了一致的分布偏移信号，导致 OOD 输入上出现更大的迹线和领先特征值，从而产生清晰的光谱特征。我们进一步提供了将后验协方差与分布不匹配明确联系起来的分析，将其确立为 OOD 检测的可靠信号。为了确保易处理性，我们采用雅可比自由子空间迭代方法来仅使用降噪器的前向评估来估计主要特征值。根据经验，EigenScore 实现了 SOTA 性能，与最佳基线相比 AUROC 提高了高达 5%。值得注意的是，它在近 OOD 设置（例如 CIFAR-10 与 CIFAR-100）中仍然保持稳健，而现有的基于扩散的方法经常会失败。

Title: GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation

Authors: Wen Ye, Zhaocheng Liu, Yuwei Gui, Tingyu Yuan, Yunyue Su, Bowen Fang, Chaoyang Zhao, Qiang Liu, Liang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07217
Pdf URL: https://arxiv.org/pdf/2510.07217
Copy Paste: [[2510.07217]] GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation(https://arxiv.org/abs/2510.07217)
Keywords: generation
Abstract: Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods operate on fixed prompts and on noise or sample numbers, limiting their interpretability and adaptability. To solve these, we introduce a flexible and efficient test-time prompt optimization strategy that operates directly on the input text. We propose a plug-and-play multi-agent system called GenPilot, integrating error analysis, clustering-based adaptive exploration, fine-grained verification, and a memory module for iterative optimization. Our approach is model-agnostic, interpretable, and well-suited for handling long and complex prompts. Simultaneously, we summarize the common patterns of errors and the refinement strategy, offering more experience and encouraging further exploration. Experiments on DPG-bench and Geneval with improvements of up to 16.9% and 5.7% demonstrate the strong capability of our methods in enhancing the text and image consistency and structural coherence of generated images, revealing the effectiveness of our test-time prompt optimization strategy. The code is available at this https URL.
摘要：文本到图像的合成已经取得了显着的进步，但准确解释复杂而冗长的提示仍然具有挑战性，通常会导致语义不一致和丢失细节。现有的解决方案（例如微调）是特定于模型的，需要训练，而先前的自动提示优化（APO）方法通常缺乏系统的错误分析和细化策略，导致可靠性和有效性有限。同时，测试时间缩放方法在固定提示和噪声或样本数量上运行，限制了它们的可解释性和适应性。为了解决这些问题，我们引入了一种灵活高效的测试时提示优化策略，该策略直接对输入文本进行操作。我们提出了一种名为 GenPilot 的即插即用多智能体系统，集成了错误分析、基于聚类的自适应探索、细粒度验证和用于迭代优化的内存模块。我们的方法与模型无关，可解释，并且非常适合处理长而复杂的提示。同时，我们总结了常见的错误模式和改进策略，提供更多经验并鼓励进一步探索。在 DPG-bench 和 Geneval 上的实验分别提高了 16.9% 和 5.7%，证明了我们的方法在增强文本和图像一致性以及生成图像的结构连贯性方面的强大能力，揭示了我们的测试时提示优化策略的有效性。该代码可从此 https URL 获取。

Title: TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

Authors: Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, Chuang Gan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07249
Pdf URL: https://arxiv.org/pdf/2510.07249
Copy Paste: [[2510.07249]] TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation(https://arxiv.org/abs/2510.07249)
Keywords: generation
Abstract: In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.
摘要：在这项工作中，我们提出了 TalkCuts，这是一个大型数据集，旨在促进多镜头人类语音视频生成的研究。与专注于单镜头、静态视点的现有数据集不同，TalkCuts 提供了 164k 个剪辑，总计超过 500 小时的高质量人类语音视频，具有多种摄像机镜头，包括特写、半身和全身视图。该数据集包括详细的文本描述、2D 关键点和 3D SMPL-X 运动注释，涵盖超过 10k 个身份，支持多模态学习和评估。作为展示数据集价值的首次尝试，我们提出了 Orator，一个由法学硕士指导的多模态生成框架作为简单的基线，其中语言模型充当多方面的导演，编排摄像机转换、说话者手势和声音调制的详细规范。该架构能够通过我们的集成多模态视频生成模块合成连贯的长格式视频。在姿势引导和音频驱动设置中进行的大量实验表明，TalkCuts 训练显着增强了生成的多镜头语音视频的电影连贯性和视觉吸引力。我们相信 TalkCuts 为未来在可控、多镜头语音视频生成和更广泛的多模态学习方面的工作奠定了坚实的基础。

Title: SpecGuard: Spectral Projection-based Advanced Invisible Watermarking

Authors: Inzamamul Alam, Md Tanvir Islam, Khan Muhammad, Simon S. Woo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07302
Pdf URL: https://arxiv.org/pdf/2510.07302
Copy Paste: [[2510.07302]] SpecGuard: Spectral Projection-based Advanced Invisible Watermarking(https://arxiv.org/abs/2510.07302)
Keywords: generation
Abstract: Watermarking embeds imperceptible patterns into images for authenticity verification. However, existing methods often lack robustness against various transformations primarily including distortions, image regeneration, and adversarial perturbation, creating real-world challenges. In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. Unlike prior approaches, we embed the message inside hidden convolution layers by converting from the spatial domain to the frequency domain using spectral projection of a higher frequency band that is decomposed by wavelet projection. Spectral projection employs Fast Fourier Transform approximation to transform spatial data into the frequency domain efficiently. In the encoding phase, a strength factor enhances resilience against diverse attacks, including adversarial, geometric, and regeneration-based distortions, ensuring the preservation of copyrighted information. Meanwhile, the decoder leverages Parseval's theorem to effectively learn and extract the watermark pattern, enabling accurate retrieval under challenging transformations. We evaluate the proposed SpecGuard based on the embedded watermark's invisibility, capacity, and robustness. Comprehensive experiments demonstrate the proposed SpecGuard outperforms the state-of-the-art models. To ensure reproducibility, the full code is released on \href{this https URL}{\textcolor{blue}{\textbf{GitHub}}}.
摘要：水印将难以察觉的图案嵌入图像中以验证真实性。然而，现有的方法通常缺乏针对各种变换的鲁棒性，主要包括扭曲、图像再生和对抗性扰动，从而造成现实世界的挑战。在这项工作中，我们介绍了 SpecGuard，这是一种新颖的水印方法，用于实现稳健且不可见的图像水印。与之前的方法不同，我们通过使用由小波投影分解的较高频带的频谱投影从空间域转换到频域，将消息嵌入到隐藏卷积层中。频谱投影采用快速傅立叶变换近似将空间数据有效地变换到频域。在编码阶段，强度因子增强了抵御各种攻击的能力，包括对抗性、几何性和基于再生的扭曲，确保保护受版权保护的信息。同时，解码器利用帕塞瓦尔定理有效学习和提取水印模式，从而在具有挑战性的变换下实现准确检索。我们根据嵌入水印的不可见性、容量和鲁棒性来评估所提议的 SpecGuard。综合实验表明，所提出的 SpecGuard 优于最先进的模型。为了确保可重复性，完整的代码发布在 \href{this https URL}{\textcolor{blue}{\textbf{GitHub}}} 上。

Title: MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Authors: Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07310
Pdf URL: https://arxiv.org/pdf/2510.07310
Copy Paste: [[2510.07310]] MATRIX: Mask Track Alignment for Interaction-aware Video Generation(https://arxiv.org/abs/2510.07310)
Keywords: generation
Abstract: Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.
摘要：视频 DiT 具有先进的视频生成功能，但它们仍然难以建模多实例或主客体交互。这就提出了一个关键问题：这些模型在内部如何表示交互？为了回答这个问题，我们策划了 MATRIX-11K，这是一个具有交互感知字幕和多实例蒙版轨道的视频数据集。使用这个数据集，我们进行了系统分析，形式化了视频 DiT 的两个视角：语义基础，通过视频到文本的注意力，评估名词和动词标记是否捕获实例及其关系；语义传播，通过视频到视频的注意力，评估实例绑定是否跨帧持续存在。我们发现这两种效应都集中在交互主导层的一小部分中。受此启发，我们引入了 MATRIX，这是一种简单而有效的正则化，可以将视频 DiT 特定层的注意力与 MATRIX-11K 数据集中的多实例掩模轨道对齐，从而增强基础和传播。我们进一步提出了 InterGenEval，一种用于交互感知视频生成的评估协议。在实验中，MATRIX 提高了交互保真度和语义对齐，同时减少了漂移和幻觉。广泛的消融验证了我们的设计选择。代码和重量将被发布。

Title: WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Authors: Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, Shanghang Zhang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2510.07313
Pdf URL: https://arxiv.org/pdf/2510.07313
Copy Paste: [[2510.07313]] WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation(https://arxiv.org/abs/2510.07313)
Keywords: generation
Abstract: Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.
摘要：手腕视图观察对于 VLA 模型至关重要，因为它们捕获细粒度的手部物体交互，从而直接提高操作性能。然而，大规模数据集很少包含此类记录，导致丰富的主播视图和稀缺的手腕视图之间存在巨大差距。现有的世界模型无法弥合这一差距，因为它们需要手腕视图第一帧，因此无法仅从主播视图生成手腕视图视频。在这一差距中，最近出现的视觉几何模型（例如 VGGT）具有几何和交叉视图先验，可以解决极端的视点偏移问题。受这些见解的启发，我们提出了 WristWorld，这是第一个仅根据主播视图生成手腕视图视频的 4D 世界模型。 WristWorld 分两个阶段运行：(i) 重建，它扩展了 VGGT 并结合了我们的空间投影一致性 (SPC) 损失来估计几何一致的手腕视图姿势和 4D 点云；（ii）生成，它采用我们的视频生成模型从重建的角度合成时间连贯的手腕视图视频。在 Droid、Calvin 和 Franka Panda 上进行的实验展示了具有卓越空间一致性的最先进的视频生成，同时还提高了 VLA 性能，将 Calvin 的平均任务完成长度提高了 3.81%，并缩小了 42.4% 的锚点与手腕视图差距。

Title: Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07316
Pdf URL: https://arxiv.org/pdf/2510.07316
Copy Paste: [[2510.07316]] Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers(https://arxiv.org/abs/2510.07316)
Keywords: generation, generative
Abstract: This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.
摘要：本文提出了 Pixel-Perfect Depth，这是一种基于像素空间扩散生成的单目深度估计模型，可从估计的深度图生成高质量、无飞行像素的点云。当前的生成深度估计模型对稳定扩散进行了微调，并取得了令人印象深刻的性能。然而，它们需要 VAE 将深度图压缩到潜在空间，这不可避免地在边缘和细节处引入 \textit{飞行像素}。我们的模型通过直接在像素空间中执行扩散生成来解决这一挑战，避免 VAE 引起的伪影。为了克服与像素空间生成相关的高复杂性，我们引入了两种新颖的设计：1）语义提示扩散变换器（SP-DiT），它将视觉基础模型的语义表示合并到 DiT 中以促进扩散过程，从而保持全局语义一致性，同时增强细粒度的视觉细节； 2）级联DiT设计，逐步增加代币数量，进一步提高效率和准确性。我们的模型在五个基准测试中实现了所有已发布的生成模型中的最佳性能，并且在边缘感知点云评估中显着优于所有其他模型。

Title: Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

Authors: Ci-Siang Lin, Min-Hung Chen, I-Jieh Liu, Chien-Yi Wang, Sifei Liu, Yu-Chiang Frank Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07319
Pdf URL: https://arxiv.org/pdf/2510.07319
Copy Paste: [[2510.07319]] Temporal Prompting Matters: Rethinking Referring Video Object Segmentation(https://arxiv.org/abs/2510.07319)
Keywords: generation
Abstract: Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.
摘要：引用视频对象分割（RVOS）旨在分割视频中查询语句所引用的对象。大多数现有方法需要使用密集掩码注释进行端到端训练，这可能会消耗大量计算且可扩展性较差。在这项工作中，我们重新思考 RVOS 问题，旨在研究这项任务的关键。基于现有的基础分割模型，我们将 RVOS 任务分解为引用、视频和分割因素，并提出了一个临时提示生成和选择（Tenet）框架来解决引用和视频因素，同时将分割问题留给基础模型。为了有效地使基于图像的基础分割模型适应引用视频对象分割，我们利用现成的对象检测器和跟踪器来生成与引用句子相关的时间提示。虽然可以产生高质量的时间提示，但无法从置信度分数中轻松识别它们。为了解决这个问题，我们提出提示偏好学习来评估生成的时间提示的质量。通过利用此类提示来指导基于图像的基础分割模型，我们将能够为引用对象生成高质量的掩模，从而使模型能够有效地适应引用视频对象分割。 RVOS 基准测试证明了 Tenet 框架的有效性。