2025-11-24

Title: Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions

Authors: Takuya Igaue, Catia Correia-Caeiro, Akito Yoshida, Takako Miyabe-Nishiwaki, Ryusuke Hayashi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2511.16711
Pdf URL: https://arxiv.org/pdf/2511.16711
Copy Paste: [[2511.16711]] Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions(https://arxiv.org/abs/2511.16711)
Keywords: generation, generative
Abstract: Generating animal faces using generative AI techniques is challenging because the available training images are limited both in quantity and variation, particularly for facial expressions across individuals. In this study, we focus on macaque monkeys, widely studied in systems neuroscience and evolutionary research, and propose a method to generate their facial expressions using a style-based generative image model (i.e., StyleGAN2). To address data limitations, we implemented: 1) data augmentation by synthesizing new facial expression images using a motion transfer to animate still images with computer graphics, 2) sample selection based on the latent representation of macaque faces from an initially trained StyleGAN2 model to ensure the variation and uniform sampling in training dataset, and 3) loss function refinement to ensure the accurate reproduction of subtle movements, such as eye movements. Our results demonstrate that the proposed method enables the generation of diverse facial expressions for multiple macaque individuals, outperforming models trained solely on original still images. Additionally, we show that our model is effective for style-based image editing, where specific style parameters correspond to distinct facial movements. These findings underscore the model's potential for disentangling motion components as style parameters, providing a valuable tool for research on macaque facial expressions.
摘要：使用生成式人工智能技术生成动物面孔具有挑战性，因为可用的训练图像在数量和变化上都有限，特别是对于个体的面部表情。在这项研究中，我们关注在系统神经科学和进化研究中广泛研究的猕猴，并提出了一种使用基于风格的生成图像模型（即 StyleGAN2）生成其面部表情的方法。为了解决数据限制，我们实现了：1）通过使用运动传输合成新的面部表情图像来使用计算机图形来生成动画静态图像来进行数据增强，2）基于最初训练的 StyleGAN2 模型中猕猴面部的潜在表示进行样本选择，以确保训练数据集中的变化和均匀采样，以及 3）损失函数细化以确保精确再现细微运动（例如眼球运动）。我们的结果表明，所提出的方法能够为多个猕猴个体生成不同的面部表情，优于仅在原始静止图像上训练的模型。此外，我们还表明我们的模型对于基于风格的图像编辑是有效的，其中特定的风格参数对应于不同的面部运动。这些发现强调了该模型将运动成分作为风格参数解开的潜力，为猕猴面部表情的研究提供了一个有价值的工具。

Title: PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation

Authors: Ting Pan, Ye Wang, Peiguang Jing, Rui Ma, Zili Yi, Yu Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16712
Pdf URL: https://arxiv.org/pdf/2511.16712
Copy Paste: [[2511.16712]] PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation(https://arxiv.org/abs/2511.16712)
Keywords: generation
Abstract: Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at this https URL.
摘要：个性化双人肖像定制具有相当大的潜在应用，例如保存情感记忆、方便婚纱摄影策划等。然而，基准数据集的缺乏阻碍了双人肖像生成中高质量定制的追求。在本文中，我们提出了 PairHuman 数据集，这是第一个专门用于生成满足高摄影标准的双人肖像的大型基准数据集。 PairHuman 数据集包含超过 10 万张图像，捕捉各种场景、服装和双人交互，以及丰富的元数据，包括详细的图像描述、人物定位、人物关键点和属性标签。我们还推出了 DHumanDiff，它是专门为双人肖像生成而设计的基线，具有增强的面部一致性，并同时平衡个性化人物生成和语义驱动的场景创建。最后，实验结果表明，我们的数据集和方法可以生成高度定制的肖像，具有卓越的视觉质量，适合人类的喜好。我们的数据集可通过此 https URL 公开获取。

Title: SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG

Authors: Mengnan Jiang, Zhaolin Sun, Christian Franke, Michele Franco Adesso, Antonio Haas, Grace Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16766
Pdf URL: https://arxiv.org/pdf/2511.16766
Copy Paste: [[2511.16766]] SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG(https://arxiv.org/abs/2511.16766)
Keywords: generation, generative
Abstract: Scalable Vector Graphics (SVGs) are central to modern design workflows, offering scaling without distortion and precise editability. However, for single object SVGs, generating multi-view consistent SVGs from a single-view input remains underexplored. We present a three stage framework that produces multi-view SVGs with geometric and color consistency from a single SVG input. First, the rasterized input is lifted to a 3D representation and rendered under target camera poses, producing multi-view images of the object. Next, we extend the temporal memory mechanism of Segment Anything 2 (SAM2) to the spatial domain, constructing a spatial memory bank that establishes part level correspondences across neighboring views, yielding cleaner and more consistent vector paths and color assignments without retraining. Finally, during the raster to vector conversion, we perform path consolidation and structural optimization to reduce redundancy while preserving boundaries and semantics. The resulting SVGs exhibit strong geometric and color consistency across views, significantly reduce redundant paths, and retain fine structural details. This work bridges generative modeling and structured vector representation, providing a scalable route to single input, object level multi-view SVG generation and supporting applications such as asset creation and semantic vector editing.
摘要：可扩展矢量图形 (SVG) 是现代设计工作流程的核心，提供不失真的缩放和精确的可编辑性。然而，对于单个对象 SVG，从单视图输入生成多视图一致的 SVG 仍有待探索。我们提出了一个三阶段框架，可以从单个 SVG 输入生成具有几何和颜色一致性的多视图 SVG。首先，光栅化输入被提升为 3D 表示并在目标相机姿势下渲染，生成对象的多视图图像。接下来，我们将 Segment Anything 2 (SAM2) 的时间记忆机制扩展到空间域，构建一个空间记忆库，在相邻视图之间建立部分级别的对应关系，从而无需重新训练即可产生更清晰、更一致的矢量路径和颜色分配。最后，在栅格到矢量的转换过程中，我们执行路径合并和结构优化，以减少冗余，同时保留边界和语义。生成的 SVG 在各个视图中表现出很强的几何和颜色一致性，显着减少冗余路径，并保留精细的结构细节。这项工作将生成建模和结构化矢量表示联系起来，为单输入、对象级多视图 SVG 生成提供可扩展的途径，并支持资产创建和语义矢量编辑等应用程序。

Title: Mesh RAG: Retrieval Augmentation for Autoregressive Mesh Generation

Authors: Xiatao Sun, Chen Liang, Qian Wang, Daniel Rakita
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16807
Pdf URL: https://arxiv.org/pdf/2511.16807
Copy Paste: [[2511.16807]] Mesh RAG: Retrieval Augmentation for Autoregressive Mesh Generation(https://arxiv.org/abs/2511.16807)
Keywords: generation
Abstract: 3D meshes are a critical building block for applications ranging from industrial design and gaming to simulation and robotics. Traditionally, meshes are crafted manually by artists, a process that is time-intensive and difficult to scale. To automate and accelerate this asset creation, autoregressive models have emerged as a powerful paradigm for artistic mesh generation. However, current methods to enhance quality typically rely on larger models or longer sequences that result in longer generation time, and their inherent sequential nature imposes a severe quality-speed trade-off. This sequential dependency also significantly complicates incremental editing. To overcome these limitations, we propose Mesh RAG, a novel, training-free, plug-and-play framework for autoregressive mesh generation models. Inspired by RAG for language models, our approach augments the generation process by leveraging point cloud segmentation, spatial transformation, and point cloud registration to retrieve, generate, and integrate mesh components. This retrieval-based approach decouples generation from its strict sequential dependency, facilitating efficient and parallelizable inference. We demonstrate the wide applicability of Mesh RAG across various foundational autoregressive mesh generation models, showing it significantly enhances mesh quality, accelerates generation speed compared to sequential part prediction, and enables incremental editing, all without model retraining.
摘要：3D 网格是工业设计、游戏、模拟和机器人等应用的关键构建块。传统上，网格是由艺术家手动制作的，这是一个耗时且难以扩展的过程。为了自动化和加速这种资产创建，自回归模型已成为艺术网格生成的强大范例。然而，当前提高质量的方法通常依赖于更大的模型或更长的序列，这会导致更长的生成时间，并且它们固有的顺序性质强加了严格的质量与速度权衡。这种顺序依赖性也使增量编辑变得非常复杂。为了克服这些限制，我们提出了 Mesh RAG，这是一种新颖的、免训练的、即插即用的自回归网格生成模型框架。受语言模型 RAG 的启发，我们的方法通过利用点云分割、空间转换和点云注册来检索、生成和集成网格组件，从而增强了生成过程。这种基于检索的方法将生成与其严格的顺序依赖关系解耦，从而促进高效且可并行的推理。我们展示了 Mesh RAG 在各种基础自回归网格生成模型中的广泛适用性，表明它显着提高了网格质量，与顺序零件预测相比加快了生成速度，并支持增量编辑，所有这些都无需模型重新训练。

Title: WorldGen: From Text to Traversable and Interactive 3D Worlds

Authors: Dilin Wang, Hyunyoung Jung, Tom Monnier, Kihyuk Sohn, Chuhang Zou, Xiaoyu Xiang, Yu-Ying Yeh, Di Liu, Zixuan Huang, Thu Nguyen-Phuoc, Yuchen Fan, Sergiu Oprea, Ziyan Wang, Roman Shapovalov, Nikolaos Sarafianos, Thibault Groueix, Antoine Toisoul, Prithviraj Dhar, Xiao Chu, Minghao Chen, Geon Yeong Park, Mahima Gupta, Yassir Azziz, Rakesh Ranjan, Andrea Vedaldi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16825
Pdf URL: https://arxiv.org/pdf/2511.16825
Copy Paste: [[2511.16825]] WorldGen: From Text to Traversable and Interactive 3D Worlds(https://arxiv.org/abs/2511.16825)
Keywords: generation, generative
Abstract: We introduce WorldGen, a system that enables the automatic creation of large-scale, interactive 3D worlds directly from text prompts. Our approach transforms natural language descriptions into traversable, fully textured environments that can be immediately explored or edited within standard game engines. By combining LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition, WorldGen bridges the gap between creative intent and functional virtual spaces, allowing creators to design coherent, navigable worlds without manual modeling or specialized 3D expertise. The system is fully modular and supports fine-grained control over layout, scale, and style, producing worlds that are geometrically consistent, visually rich, and efficient to render in real time. This work represents a step towards accessible, generative world-building at scale, advancing the frontier of 3D generative AI for applications in gaming, simulation, and immersive social environments.
摘要：我们推出了 WorldGen，这是一个可以直接根据文本提示自动创建大型交互式 3D 世界的系统。我们的方法将自然语言描述转换为可遍历的、完全纹理化的环境，可以在标准游戏引擎中立即探索或编辑。通过将 LLM 驱动的场景布局推理、程序生成、基于扩散的 3D 生成和对象感知场景分解相结合，WorldGen 弥合了创意意图和功能虚拟空间之间的差距，使创作者能够设计连贯、可导航的世界，而无需手动建模或专门的 3D 专业知识。该系统是完全模块化的，支持对布局、比例和风格的细粒度控制，生成几何一致、视觉丰富且实时高效渲染的世界。这项工作代表着朝着可访问的、大规模的生成式世界建设迈出的一步，推进了 3D 生成式 AI 在游戏、模拟和沉浸式社交环境中应用的前沿。

Title: Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation

Authors: Xizhe Xue, Xiao Xiang Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16853
Pdf URL: https://arxiv.org/pdf/2511.16853
Copy Paste: [[2511.16853]] Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation(https://arxiv.org/abs/2511.16853)
Keywords: generation
Abstract: Recent progress in vision language models (VLMs) has enabled remarkable perception and reasoning capabilities, yet their potential for scientific regression in Earth Observation (EO) remains largely unexplored. Existing EO datasets mainly emphasize semantic understanding tasks such as captioning or classification, lacking benchmarks that align multimodal perception with measurable biophysical variables. To fill this gap, we present REO-Instruct, the first unified benchmark designed for both descriptive and regression tasks in EO. REO-Instruct establishes a cognitively interpretable logic chain in forest ecological scenario (human activity,land-cover classification, ecological patch counting, above-ground biomass (AGB) regression), bridging qualitative understanding and quantitative prediction. The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated and validated through a hybrid human AI pipeline. Comprehensive evaluation protocols and baseline results across generic VLMs reveal that current models struggle with numeric reasoning, highlighting an essential challenge for scientific VLMs. REO-Instruct offers a standardized foundation for developing and assessing next-generation geospatial models capable of both description and scientific inference. The project page are publicly available at \href{this https URL}{REO-Instruct}.
摘要：视觉语言模型 (VLM) 的最新进展实现了卓越的感知和推理能力，但它们在地球观测 (EO) 中科学回归的潜力在很大程度上仍未得到开发。现有的 EO 数据集主要强调语义理解任务，例如字幕或分类，缺乏将多模态感知与可测量的生物物理变量相结合的基准。为了填补这一空白，我们推出了 REO-Instruct，这是第一个专为 EO 中的描述性和回归任务而设计的统一基准。 REO-Instruct在森林生态场景（人类活动、土地覆盖分类、生态斑块计数、地上生物量（AGB）回归）中建立了认知可解释的逻辑链，连接定性理解和定量预测。该数据集将共同注册的 Sentinel-2 和 ALOS-2 图像与通过混合人类 AI 管道生成和验证的结构化文本注释集成在一起。通用 VLM 的综合评估协议和基线结果表明，当前模型难以进行数字推理，这凸显了科学 VLM 面临的一项重要挑战。 REO-Instruct 为开发和评估具有描述和科学推理能力的下一代地理空间模型提供了标准化基础。该项目页面可在 \href{此 https URL}{REO-Instruct} 上公开获取。

Title: The use of vocal biomarkers in the detection of Parkinson's disease: a robust statistical performance comparison of classic machine learning models

Authors: Katia Pires Nascimento do Sacramento, Elliot Q. C. Garcia, Nicéias Silva Vilela, Vinicius P. Sacramento, Tiago A. E. Ferreira
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16856
Pdf URL: https://arxiv.org/pdf/2511.16856
Copy Paste: [[2511.16856]] The use of vocal biomarkers in the detection of Parkinson's disease: a robust statistical performance comparison of classic machine learning models(https://arxiv.org/abs/2511.16856)
Keywords: generative
Abstract: Parkinson's disease (PD) is a progressive neurodegenerative disorder that, in addition to directly impairing functional mobility, is frequently associated with vocal impairments such as hypophonia and dysarthria, which typically manifest in the early stages. The use of vocal biomarkers to support the early diagnosis of PD presents a non-invasive, low-cost, and accessible alternative in clinical settings. Thus, the objective of this cross-sectional study was to consistently evaluate the effectiveness of a Deep Neural Network (DNN) in distinguishing individuals with Parkinson's disease from healthy controls, in comparison with traditional Machine Learning (ML) methods, using vocal biomarkers. Two publicly available voice datasets were used. Mel-frequency cepstral coefficients (MFCCs) were extracted from the samples, and model robustness was assessed using a validation strategy with 1000 independent random executions. Performance was evaluated using classification statistics. Since normality assumptions were not satisfied, non-parametric tests (Kruskal-Wallis and Bonferroni post-hoc tests) were applied to verify whether the tested classification models were similar or different in the classification of PD. With an average accuracy of $98.65\%$ and $92.11\%$ on the Italian Voice dataset and Parkinson's Telemonitoring dataset, respectively, the DNN demonstrated superior performance and efficiency compared to traditional ML models, while also achieving competitive results when benchmarked against relevant studies. Overall, this study confirms the efficiency of DNNs and emphasizes their potential to provide greater accuracy and reliability for the early detection of neurodegenerative diseases using voice-based biomarkers.
摘要：帕金森病 (PD) 是一种进行性神经退行性疾病，除了直接损害功能活动能力外，还经常与发声障碍和构音障碍等声音障碍有关，这些障碍通常在早期阶段表现出来。使用声音生物标志物支持 PD 的早期诊断为临床环境提供了一种无创、低成本且易于使用的替代方案。因此，这项横断面研究的目的是与传统的机器学习 (ML) 方法相比，使用声音生物标志物，一致地评估深度神经网络 (DNN) 在区分帕金森病患者与健康对照者方面的有效性。使用了两个公开的语音数据集。从样本中提取梅尔频率倒谱系数 (MFCC)，并使用具有 1000 次独立随机执行的验证策略来评估模型稳健性。使用分类统计来评估性能。由于不满足正态性假设，因此应用非参数检验（Kruskal-Wallis 和 Bonferroni 事后检验）来验证测试的分类模型在 PD 分类方面是否相似或不同。 DNN 在意大利语音数据集和帕金森远程监测数据集上的平均准确度分别为 98.65\%$ 和 92.11\%$，与传统的 ML 模型相比，DNN 表现出了卓越的性能和效率，同时在与相关研究进行基准测试时也取得了具有竞争力的结果。总体而言，这项研究证实了 DNN 的效率，并强调了它们使用基于语音的生物标记物为神经退行性疾病早期检测提供更高准确性和可靠性的潜力。

Title: BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Authors: Vineet Bhat, Sungsu Kim, Valts Blukis, Greg Heinrich, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Farshad Khorrami, Jonathan Tremblay
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.16857
Pdf URL: https://arxiv.org/pdf/2511.16857
Copy Paste: [[2511.16857]] BOP-ASK: Object-Interaction Reasoning for Vision-Language Models(https://arxiv.org/abs/2511.16857)
Keywords: generation
Abstract: Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.
摘要：视觉语言模型 (VLM) 在空间推理基准上取得了令人印象深刻的性能，但这些评估掩盖了理解对象交互方面的关键弱点。当前的基准测试测试高级关系（“左边”、“后面”等），但忽略了现实世界应用程序所需的细粒度空间理解：精确的 3D 定位、对象之间的物理兼容性、对象可供性和多步骤空间规划。在这项工作中，我们提出了 BOP-ASK，这是一种用于训练和基准测试的对象交互推理的新型大规模数据集。我们的数据生成管道利用来自对象姿势估计基准 (BOP) 数据集的 6D 对象姿势，从中我们得出细粒度注释，例如抓取姿势、参考对象姿势、路径规划轨迹、相对空间和深度关系以及对象到对象关系。 BOP-ASK 包含超过 15 万张图像和 3300 万个问题答案对，涵盖六项任务（四项新颖），为训练和评估 VLM 提供丰富的资源。我们评估专有和开源 VLM，并对 BOP-ASK-core（一个贡献的测试基准）进行人工评估。我们还发布了 BOP-ASK-lab，这是一个发行版外基准测试，其图像并非源自 BOP，可以进行泛化测试。我们的实验表明，在 BOP-ASK 上训练的模型优于基线，并表现出新兴功能，例如精确的对象和抓取姿势估计、轨迹规划以及在杂乱环境中以对象为中心的细粒度空间推理。我们将公开发布我们的数据集和数据集生成管道。

Title: Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment

Authors: Loukas Sfountouris, Giannis Daras, Paris Giampouras
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.16870
Pdf URL: https://arxiv.org/pdf/2511.16870
Copy Paste: [[2511.16870]] Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment(https://arxiv.org/abs/2511.16870)
Keywords: super-resolution, generative
Abstract: Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model's internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.
摘要：最近证明，强制扩散或基于流的生成模型的内部表示与预训练的自监督编码器的内部表示之间的对齐可以提供强大的归纳偏差，从而提高收敛性和样本质量。在这项工作中，我们将这个想法扩展到逆问题，其中使用预训练的生成模型作为先验。我们建议在基于扩散或流的模型和预训练的自监督视觉编码器（例如 DINOv2）之间应用表示对齐（REPA），以指导推理时的重建过程。尽管地面实况信号在逆问题中不可用，但我们表明，将模型表示与近似目标特征对齐可以显着增强重建保真度和感知真实感。我们提供的理论结果显示了 (a) REPA 正则化与 DINOv2 嵌入空间中的散度度量之间的关系，以及 (b) REPA 更新如何引导模型的内部表示转向干净图像的内部表示。这些结果让我们深入了解 REPA 在提高感知保真度方面的作用。最后，我们通过将其集成到多个最先进的逆向问题求解器中来证明我们的方法的通用性。关于超分辨率、框修复、高斯去模糊和运动去模糊的大量实验证实，我们的方法能够持续提高跨任务的重建质量，同时还通过减少所需的离散化步骤数量来提供显着的效率增益，而不会影响底层求解器的性能。

Title: Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models

Authors: Hao-Chien Hsueh, Chi-En Yen, Wen-Hsiao Peng, Ching-Chun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16904
Pdf URL: https://arxiv.org/pdf/2511.16904
Copy Paste: [[2511.16904]] Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models(https://arxiv.org/abs/2511.16904)
Keywords: generation, generative
Abstract: Diffusion probabilistic models have achieved remarkable success in generative tasks across diverse data types. While recent studies have explored alternative degradation processes beyond Gaussian noise, this paper bridges two key diffusion paradigms: hot diffusion, which relies entirely on noise, and cold diffusion, which uses only blurring without noise. We argue that hot diffusion fails to exploit the strong correlation between high-frequency image detail and low-frequency structures, leading to random behaviors in the early steps of generation. Conversely, while cold diffusion leverages image correlations for prediction, it neglects the role of noise (randomness) in shaping the data manifold, resulting in out-of-manifold issues and partially explaining its performance drop. To integrate both strengths, we propose Warm Diffusion, a unified Blur-Noise Mixture Diffusion Model (BNMD), to control blurring and noise jointly. Our divide-and-conquer strategy exploits the spectral dependency in images, simplifying score model estimation by disentangling the denoising and deblurring processes. We further analyze the Blur-to-Noise Ratio (BNR) using spectral analysis to investigate the trade-off between model learning dynamics and changes in the data manifold. Extensive experiments across benchmarks validate the effectiveness of our approach for image generation.
摘要：扩散概率模型在跨不同数据类型的生成任务中取得了显着的成功。虽然最近的研究探索了高斯噪声之外的替代退化过程，但本文桥接了两个关键的扩散范例：完全依赖于噪声的热扩散和仅使用模糊而不使用噪声的冷扩散。我们认为热扩散未能利用高频图像细节和低频结构之间的强相关性，导致生成早期步骤中的随机行为。相反，虽然冷扩散利用图像相关性进行预测，但它忽略了噪声（随机性）在塑造数据流形中的作用，导致流形外问题并部分解释了其性能下降。为了整合这两种优势，我们提出了暖扩散，一种统一的模糊噪声混合扩散模型（BNMD），以共同控制模糊和噪声。我们的分治策略利用图像中的光谱依赖性，通过解开去噪和去模糊过程来简化分数模型估计。我们使用谱分析进一步分析模糊噪声比（BNR），以研究模型学习动态与数据流形变化之间的权衡。跨基准的广泛实验验证了我们的图像生成方法的有效性。

Title: Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

Authors: Shushi Wang, Zicheng Zhang, Chunyi Li, Wei Wang, Liya Ma, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16908
Pdf URL: https://arxiv.org/pdf/2511.16908
Copy Paste: [[2511.16908]] Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content(https://arxiv.org/abs/2511.16908)
Keywords: generation, generative, quality assessment
Abstract: Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.
摘要：人工智能生成内容的质量评估对于评估模型能力和指导模型优化至关重要。然而，大多数现有的质量评估数据集和模型仅提供单一的质量评分，其过于粗略，无法为改进生成模型提供有针对性的指导。在当前人工智能生成图像的应用中，真实性和合理性是两个关键维度，随着统一生成理解模型的出现，沿着这些维度的细粒度评估对于提高生成性能变得特别有效。因此，我们引入了 Q-Real，这是一个新颖的数据集，用于对人工智能生成图像的真实性和合理性进行细粒度评估。 Q-Real 包含由流行的文本到图像模型生成的 3,088 张图像。对于每幅图像，我们注释了主要实体的位置，并根据真实性和合理性的维度提供了一组判断问题和归因描述。考虑到多模态大语言模型（MLLM）的最新进展能够对人工智能生成的图像进行细粒度评估，我们构建了 Q-Real Bench 来评估它们的两个任务：判断和推理基础。最后，为了增强 MLLM 功能，我们设计了一个微调框架，并使用我们的数据集在多个 MLLM 上进行实验。实验结果证明了我们数据集的高质量和重要性以及基准的全面性。数据集和代码将在发布后发布。

Title: PepEVOLVE: Position-Aware Dynamic Peptide Optimization via Group-Relative Advantage

Authors: Trieu Nguyen, Hao-Wei Pang, Shasha Feng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16912
Pdf URL: https://arxiv.org/pdf/2511.16912
Copy Paste: [[2511.16912]] PepEVOLVE: Position-Aware Dynamic Peptide Optimization via Group-Relative Advantage(https://arxiv.org/abs/2511.16912)
Keywords: generative
Abstract: Macrocyclic peptides are an emerging modality that combines biologics-like affinity with small-molecule-like developability, but their vast combinatorial space and multi-parameter objectives make lead optimization slow and challenging. Prior generative approaches such as PepINVENT require chemists to pre-specify mutable positions for optimization, choices that are not always known a priori, and rely on static pretraining and optimization algorithms that limit the model's ability to generalize and effectively optimize peptide sequences. We introduce PepEVOLVE, a position-aware, dynamic framework that learns both where to edit and how to dynamically optimize peptides for multi-objective improvement. PepEVOLVE (i) augments pretraining with dynamic masking and CHUCKLES shifting to improve generalization, (ii) uses a context-free multi-armed bandit router that discovers high-reward residues, and (iii) couples a novel evolving optimization algorithm with group-relative advantage to stabilize reinforcement updates. During in silico evaluations, the router policy reliably learns and concentrates probability on chemically meaningful sites that influence the peptide's properties. On a therapeutically motivated Rev-binding macrocycle benchmark, PepEVOLVE outperformed PepINVENT by reaching higher mean scores (approximately 0.8 vs. 0.6), achieving best candidates with a score of 0.95 (vs. 0.87), and converging in fewer steps under the task of optimizing permeability and lipophilicity with structural constraints. Overall, PepEVOLVE offers a practical, reproducible path to peptide lead optimization when optimal edit sites are unknown, enabling more efficient exploration and improving design quality across multiple objectives.
摘要：大环肽是一种新兴模式，它将类似生物制剂的亲和力与类似小分子的可开发性结合在一起，但其巨大的组合空间和多参数目标使得先导化合物优化缓慢且具有挑战性。先前的生成方法（例如 PepINVENT）要求化学家预先指定用于优化的可变位置，这些选择并不总是先验已知的，并且依赖于静态预训练和优化算法，这限制了模型概括和有效优化肽序列的能力。我们推出了 PepEVOLVE，这是一个位置感知的动态框架，它可以学习在何处编辑以及如何动态优化肽以实现多目标改进。 PepEVOLVE (i) 通过动态掩蔽和 CHUCKLES 转移来增强预训练以提高泛化能力，(ii) 使用上下文无关的多臂老虎机路由器来发现高奖励残基，以及 (iii) 将新颖的进化优化算法与群体相对优势结合起来以稳定强化更新。在计算机评估过程中，路由器策略可靠地学习并将概率集中在影响肽特性的化学上有意义的位点上。在以治疗为动机的 Rev 结合大环基准上，PepEVOLVE 的表现优于 PepINVENT，达到了更高的平均分数（约 0.8 比 0.6），以 0.95 的分数（比 0.87）获得最佳候选，并在结构约束下优化渗透性和亲脂性的任务下以更少的步骤收敛。总体而言，当最佳编辑位点未知时，PepEVOLVE 提供了一条实用、可重复的肽先导优化路径，从而能够更有效地探索并提高跨多个目标的设计质量。

Title: UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

Authors: Chi Zhang, Jiepeng Wang, Youming Wang, Yuanzhi Liang, Xiaoyan Yang, Zuoxin Li, Haibin Huang, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16917
Pdf URL: https://arxiv.org/pdf/2511.16917
Copy Paste: [[2511.16917]] UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation(https://arxiv.org/abs/2511.16917)
Keywords: generation, generative
Abstract: We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.
摘要：我们提出了 UniModel，这是一种统一的生成模型，它在单个像素到像素的扩散框架内共同支持视觉理解和视觉生成。我们的目标是实现三个轴的统一：模型、任务和表示。在表示层面，我们通过将文本和图像映射到共享视觉空间来消除模态差异：文本提示在干净的画布上呈现为绘制的文本图像，所有输入和输出都纯粹视为 RGB 像素。这产生了多模式学习的完全视觉原生的表述。在任务级别，广泛的视觉语言问题被转换为该视觉空间中的像素到像素的转换。为了理解任务，该模型采用 RGB 图像并生成绘制的文本图像，该图像对语义预测进行视觉编码。对于生成任务，绘制的文本图像充当视觉条件，指导真实且语义一致的图像合成。因此，字幕和文本到图像的生成成为同一底层视觉翻译过程的不同方向。在模型级别，我们实例化了一个在像素空间中使用整流流进行训练的统一扩散变压器。共享主干网络共同学习自然图像和绘制的文本图像之间的双向映射，并使用轻量级任务嵌入来指定所需的方向。文本到图像合成和图像到文本理解的实验证明了强大的跨模式对齐和紧急可控性，例如循环一致的图像-标题-图像循环。我们的初步探索表明，在单个视觉空间中统一模型、任务和表示是通用多模态智能的一个有前途的范例。

Title: DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution

Authors: Chaoran Xu, Chengkan Lv, Qiyu Chen, Yunkang Cao, Feng Zhang, Zhengtao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16920
Pdf URL: https://arxiv.org/pdf/2511.16920
Copy Paste: [[2511.16920]] DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution(https://arxiv.org/abs/2511.16920)
Keywords: generation
Abstract: Anomaly generation is often framed as few-shot fine-tuning with anomalous samples, which contradicts the scarcity that motivates generation and tends to overfit category priors. We tackle the setting where no real anomaly samples or training are available. We propose Delta-Denoising (DeltaDeno), a training-free zero-shot anomaly generation method that localizes and edits defects by contrasting two diffusion branches driven by a minimal prompt pair under a shared schedule. By accumulating per-step denoising deltas into an image-specific localization map, we obtain a mask to guide the latent inpainting during later diffusion steps and preserve the surrounding context while generating realistic local defects. To improve stability and control, DeltaDeno performs token-level prompt refinement that aligns shared content and strengthens anomaly tokens, and applies a spatial attention bias restricted to anomaly tokens in the predicted region. Experiments on public datasets show that DeltaDeno achieves great generation, realism and consistent gains in downstream detection performance. Code will be made publicly available.
摘要：异常生成通常被认为是对异常样本进行几次微调，这与激励生成的稀缺性相矛盾，并且往往会过度拟合类别先验。我们解决没有真正的异常样本或训练可用的情况。我们提出了 Delta-Denoising (DeltaDeno)，这是一种免训练的零样本异常生成方法，通过对比共享调度下由最小提示对驱动的两个扩散分支来定位和编辑缺陷。通过将每步去噪增量累积到特定于图像的定位图中，我们获得了一个掩模来指导后续扩散步骤中的潜在修复，并在生成真实的局部缺陷的同时保留周围的上下文。为了提高稳定性和控制力，DeltaDeno 执行标记级别的提示细化，以对齐共享内容并增强异常标记，并应用仅限于预测区域中的异常标记的空间注意偏差。在公共数据集上的实验表明，DeltaDeno 在下游检测性能方面实现了出色的生成、真实性和一致的增益。代码将公开。

Title: Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features

Authors: Jingyi Xu, Meisong Zheng, Ying Chen, Minglang Qiao, Xin Deng, Mai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16928
Pdf URL: https://arxiv.org/pdf/2511.16928
Copy Paste: [[2511.16928]] Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features(https://arxiv.org/abs/2511.16928)
Keywords: super-resolution
Abstract: Diffusion model (DM) based Video Super-Resolution (VSR) approaches achieve impressive perceptual quality. However, they suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity, primarily caused by inaccurate alignment and insufficient compensation between video frames. In this paper, within the DM-based VSR pipeline, we revisit the role of alignment and compensation between adjacent video frames and reveal two crucial observations: (a) the feature domain is better suited than the pixel domain for information compensation due to its stronger spatial and temporal correlations, and (b) warping at an upscaled resolution better preserves high-frequency information, but this benefit is not necessarily monotonic. Therefore, we propose a novel Densely Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR), with an Optical Guided Warping Module (OGWM) to maintain high-frequency details in the aligned features and a Feature-wise Temporal Condition Module (FTCM) to deliver dense guidance in the feature domain. Extensive experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR, including perceptual quality (35.82\% DISTS reduction), fidelity (0.20 dB PSNR gain), and temporal consistency (30.37\% tLPIPS reduction).
摘要：基于扩散模型 (DM) 的视频超分辨率 (VSR) 方法实现了令人印象深刻的感知质量。然而，它们会遭受错误累积、空间伪影以及感知质量和保真度之间的权衡，这主要是由于视频帧之间的不准确对齐和补偿不足造成的。在本文中，在基于 DM 的 VSR 管道中，我们重新审视了相邻视频帧之间的对齐和补偿的作用，并揭示了两个关键的观察结果：（a）由于其更强的空间和时间相关性，特征域比像素域更适合信息补偿，以及（b）在升级分辨率下的扭曲可以更好地保留高频信息，但这种好处不一定是单调的。因此，我们提出了一种新颖的具有视频超分辨率对齐特征的密集引导扩散模型（DGAF-VSR），其中光学引导扭曲模块（OGWM）用于保持对齐特征中的高频细节，而特征时间条件模块（FTCM）用于在特征域中提供密集引导。对合成数据集和真实世界数据集的大量实验表明，DGAF-VSR 在 VSR 的关键方面超越了最先进的方法，包括感知质量（DISTS 减少 35.82%）、保真度（PSNR 增益 0.20 dB）和时间一致性（tLPIPS 减少 30.37%）。

Title: Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

Authors: Dailan He, Guanlin Feng, Xingtong Ge, Yazhe Niu, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2511.16955
Pdf URL: https://arxiv.org/pdf/2511.16955
Copy Paste: [[2511.16955]] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models(https://arxiv.org/abs/2511.16955)
Keywords: generation, generative
Abstract: Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.
摘要：组相对策略优化 (GRPO) 在使图像和视频生成模型与人类偏好保持一致方面表现出了良好的前景。然而，由于其确定性采样范式，将其应用于现代流量匹配模型具有挑战性。当前的方法通过将常微分方程 (ODE) 转换为随机微分方程 (SDE) 来解决此问题，这会引入随机性。然而，这种基于 SDE 的 GRPO 存在信用分配效率低下以及与少步采样的高阶求解器不兼容的问题。在本文中，我们首先从距离优化的角度重新解释现有的基于 SDE 的 GRPO 方法，揭示其作为对比学习形式的潜在机制。基于这一见解，我们提出了 Neighbor GRPO，这是一种完全绕过 SDE 需求的新颖对齐算法。 Neighbor GRPO 通过扰动 ODE 的初始噪声条件来生成一组不同的候选轨迹，并使用基于 softmax 距离的代理跳跃策略来优化模型。我们在这种基于距离的目标和政策梯度优化之间建立了理论联系，将我们的方法严格整合到 GRPO 框架中。我们的方法完全保留了确定性 ODE 采样的优点，包括效率和与高阶求解器的兼容性。我们进一步引入对称锚采样来提高计算效率，并引入分组准范数重新加权来解决奖励扁平化问题。大量实验表明，Neighbor GRPO 在训练成本、收敛速度和生成质量方面显着优于基于 SDE 的同类方法。

Title: MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

Authors: Di Luo, Shuhui Yang, Mingxin Yang, Jiawei Lu, Yixuan Tang, Xintong Han, Zhuo Chen, Beibei Wang, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16957
Pdf URL: https://arxiv.org/pdf/2511.16957
Copy Paste: [[2511.16957]] MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis(https://arxiv.org/abs/2511.16957)
Keywords: generation, generative
Abstract: Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-frame sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks--text-to-material generation, image-to-material generation, and intrinsic decomposition--within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native $1024\times1024$ synthesis that substantially surpasses existing approaches in both quality and diversity.
摘要：基于物理的渲染 (PBR) 材质是逼真图形的基础，但其创建仍然是劳动密集型的，并且需要专业知识。虽然生成模型具有先进的材料合成，但现有方法缺乏桥接自然图像外观和 PBR 属性的统一表示，导致特定任务管道碎片化并且无法利用大规模 RGB 图像数据。我们提出了 MatPedia，这是一种建立在新颖的联合 RGB-PBR 表示之上的基础模型，该模型将材质紧凑地编码为两个相互依赖的潜在变量：一个用于 RGB 外观，另一个用于编码互补物理属性的四个 PBR 映射。通过将它们制定为 5 帧序列并采用视频扩散架构，MatPedia 自然地捕获它们的相关性，同时从 RGB 生成模型传输视觉先验。这种联合表示使得统一的框架能够在单一架构内处理多种材料任务——文本到材料的生成、图像到材料的生成和内在分解。在 MatHybrid-410K（一个将 PBR 数据集与大规模 RGB 图像相结合的混合语料库）上进行训练后，MatPedia 实现了 1024\times1024$ 的原生合成，在质量和多样性方面大大超过了现有方法。

Title: Two Heads Better than One: Dual Degradation Representation for Blind Super-Resolution

Authors: Hsuan Yuan, Shao-Yu Weng, I-Hsuan Lo, Wei-Chen Chiu, Yu-Syuan Xu, Hao-Chien Hsueh, Jen-Hui Chuang, Ching-Chun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16963
Pdf URL: https://arxiv.org/pdf/2511.16963
Copy Paste: [[2511.16963]] Two Heads Better than One: Dual Degradation Representation for Blind Super-Resolution(https://arxiv.org/abs/2511.16963)
Keywords: super-resolution
Abstract: Previous methods have demonstrated remarkable performance in single image super-resolution (SISR) tasks with known and fixed degradation (e.g., bicubic downsampling). However, when the actual degradation deviates from these assumptions, these methods may experience significant declines in performance. In this paper, we propose a Dual Branch Degradation Extractor Network to address the blind SR problem. While some blind SR methods assume noise-free degradation and others do not explicitly consider the presence of noise in the degradation model, our approach predicts two unsupervised degradation embeddings that represent blurry and noisy information. The SR network can then be adapted to blur embedding and noise embedding in distinct ways. Furthermore, we treat the degradation extractor as a regularizer to capitalize on differences between SR and HR images. Extensive experiments on several benchmarks demonstrate our method achieves SOTA performance in the blind SR problem.
摘要：以前的方法已经在具有已知和固定退化（例如双三次下采样）的单图像超分辨率（SISR）任务中表现出了卓越的性能。然而，当实际的退化偏离这些假设时，这些方法的性能可能会显着下降。在本文中，我们提出了一种双分支退化提取器网络来解决盲 SR 问题。虽然一些盲 SR 方法假设无噪声退化，而其他方法则没有明确考虑退化模型中噪声的存在，但我们的方法预测了两个代表模糊和噪声信息的无监督退化嵌入。然后，SR 网络可以以不同的方式适应模糊嵌入和噪声嵌入。此外，我们将退化提取器视为正则化器，以利用 SR 和 HR 图像之间的差异。对多个基准的大量实验证明我们的方法在盲 SR 问题中实现了 SOTA 性能。

Title: Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices

Authors: Jigyasa Gupta, Soumya Goyal, Anil Kumar, Ishan Jindal
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.16965
Pdf URL: https://arxiv.org/pdf/2511.16965
Copy Paste: [[2511.16965]] Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices(https://arxiv.org/abs/2511.16965)
Keywords: generation, generative
Abstract: Synthesizing realistic cooked food images from raw inputs on edge devices is a challenging generative task, requiring models to capture complex changes in texture, color and structure during cooking. Existing image-to-image generation methods often produce unrealistic results or are too resource-intensive for edge deployment. We introduce the first oven-based cooking-progression dataset with chef-annotated doneness levels and propose an edge-efficient recipe and cooking state guided generator that synthesizes realistic food images conditioned on raw food image. This formulation enables user-preferred visual targets rather than fixed presets. To ensure temporal consistency and culinary plausibility, we introduce a domain-specific \textit{Culinary Image Similarity (CIS)} metric, which serves both as a training loss and a progress-monitoring signal. Our model outperforms existing baselines with significant reductions in FID scores (30\% improvement on our dataset; 60\% on public datasets)
摘要：从边缘设备上的原始输入合成逼真的熟食图像是一项具有挑战性的生成任务，需要模型捕获烹饪过程中纹理、颜色和结构的复杂变化。现有的图像到图像生成方法通常会产生不切实际的结果，或者对于边缘部署来说资源过于密集。我们引入了第一个基于烤箱的烹饪进度数据集，其中包含厨师注释的熟度级别，并提出了一种边缘高效的食谱和烹饪状态引导生成器，该生成器可以根据原始食物图像合成真实的食物图像。这种公式可以实现用户首选的视觉目标，而不是固定的预设。为了确保时间一致性和烹饪的合理性，我们引入了特定领域的 \textit{烹饪图像相似性（CIS）} 度量，它既充当训练损失又充当进度监控信号。我们的模型优于现有基线，FID 分数显着降低（我们的数据集提高了 30%；公共数据集提高了 60%）

Title: VLM-Augmented Degradation Modeling for Image Restoration Under Adverse Weather Conditions

Authors: Qianyi Shao, Yuanfan Zhang, Renxiang Xiao, Liang Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16998
Pdf URL: https://arxiv.org/pdf/2511.16998
Copy Paste: [[2511.16998]] VLM-Augmented Degradation Modeling for Image Restoration Under Adverse Weather Conditions(https://arxiv.org/abs/2511.16998)
Keywords: restoration
Abstract: Reliable visual perception under adverse weather conditions, such as rain, haze, snow, or a mixture of them, is desirable yet challenging for autonomous driving and outdoor robots. In this paper, we propose a unified Memory-Enhanced Visual-Language Recovery (MVLR) model that restores images from different degradation levels under various weather conditions. MVLR couples a lightweight encoder-decoder backbone with a Visual-Language Model (VLM) and an Implicit Memory Bank (IMB). The VLM performs chain-of-thought inference to encode weather degradation priors and the IMB stores continuous latent representations of degradation patterns. The VLM-generated priors query the IMB to retrieve fine-grained degradation prototypes. These prototypes are then adaptively fused with multi-scale visual features via dynamic cross-attention mechanisms, enhancing restoration accuracy while maintaining computational efficiency. Extensive experiments on four severe-weather benchmarks show that MVLR surpasses single-branch and Mixture-of-Experts baselines in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). These results indicate that MVLR offers a practical balance between model compactness and expressiveness for real-time deployment in diverse outdoor conditions.
摘要：对于自动驾驶和户外机器人来说，在雨、雾、雪或它们的混合等恶劣天气条件下获得可靠的视觉感知是理想的，但同时也是一项挑战。在本文中，我们提出了一种统一的记忆增强视觉语言恢复（MVLR）模型，该模型可以在各种天气条件下恢复不同退化级别的图像。 MVLR 将轻量级编码器-解码器主干与视觉语言模型 (VLM) 和隐式内存库 (IMB) 结合起来。 VLM 执行思想链推理来编码天气退化先验，IMB 存储退化模式的连续潜在表示。 VLM 生成的先验查询 IMB 以检索细粒度的退化原型。然后，这些原型通过动态交叉注意机制自适应地与多尺度视觉特征融合，在保持计算效率的同时提高恢复精度。对四个恶劣天气基准的大量实验表明，MVLR 在峰值信噪比 (PSNR) 和结构相似性指数测量 (SSIM) 方面超过了单分支和混合专家基线。这些结果表明，MVLR 在模型紧凑性和表现力之间提供了实际的平衡，适合在不同的户外条件下进行实时部署。

Title: Vision Language Models are Confused Tourists

Authors: Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2511.17004
Pdf URL: https://arxiv.org/pdf/2511.17004
Copy Paste: [[2511.17004]] Vision Language Models are Confused Tourists(https://arxiv.org/abs/2511.17004)
Keywords: generation
Abstract: Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.
摘要：尽管文化维度一直是评估视觉语言模型（VLM）的关键方面之一，但它们在不同文化输入中保持稳定的能力在很大程度上尚未经过测试，尽管对于支持多样性和多元文化社会至关重要。现有的评估通常依赖于每个图像仅具有单一文化概念的基准，忽略了多种可能不相关的文化线索共存的场景。为了解决这一差距，我们引入了 ConfusedTourist，这是一种新颖的文化对抗鲁棒性套件，旨在评估 VLM 针对受干扰的地理线索的稳定性。我们的实验揭示了一个严重的漏洞，在简单的图像堆叠扰动下，准确度会大幅下降，甚至在基于图像生成的变体中更加恶化。可解释性分析进一步表明，这些失败源于系统性注意力转向分散注意力的线索，从而使模型偏离了预期的焦点。这些发现凸显了一个关键挑战：视觉文化概念混合甚至会严重损害最先进的 VLM，这凸显了对文化上更强大的多模式理解的迫切需要。

Title: Mask the Redundancy: Evolving Masking Representation Learning for Multivariate Time-Series Clustering

Authors: Zexi Tan, Xiaopeng Luo, Yunlin Liu, Yiqun Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17008
Pdf URL: https://arxiv.org/pdf/2511.17008
Copy Paste: [[2511.17008]] Mask the Redundancy: Evolving Masking Representation Learning for Multivariate Time-Series Clustering(https://arxiv.org/abs/2511.17008)
Keywords: generation
Abstract: Multivariate Time-Series (MTS) clustering discovers intrinsic grouping patterns of temporal data samples. Although time-series provide rich discriminative information, they also contain substantial redundancy, such as steady-state machine operation records and zero-output periods of solar power generation. Such redundancy diminishes the attention given to discriminative timestamps in representation learning, thus leading to performance bottlenecks in MTS clustering. Masking has been widely adopted to enhance the MTS representation, where temporal reconstruction tasks are designed to capture critical information from MTS. However, most existing masking strategies appear to be standalone preprocessing steps, isolated from the learning process, which hinders dynamic adaptation to the importance of clustering-critical timestamps. Accordingly, this paper proposes the Evolving-masked MTS Clustering (EMTC) method, with its model architecture composed of Importance-aware Variate-wise Masking (IVM) and Multi-Endogenous Views (MEV) representation learning modules. IVM adaptively guides the model in learning more discriminative representations for clustering, while the MEV-based reconstruction and contrastive learning pathways enhance the generalization. That is, the MEV reconstruction facilitates multi-perspective complementary to prevent the masking from premature convergence, and the clustering-guided contrastive learning facilitates the joint optimization of representation and clustering. Extensive experiments on 15 real benchmark datasets demonstrate the superiority of EMTC in comparison with eight SOTA methods, where the EMTC achieves an average improvement of 4.85% over the strongest baselines.
摘要：多元时间序列 (MTS) 聚类发现时间数据样本的内在分组模式。尽管时间序列提供了丰富的判别信息，但它们也包含大量冗余，例如稳态机器运行记录和太阳能发电的零输出时段。这种冗余减少了对表示学习中判别性时间戳的关注，从而导致 MTS 聚类中的性能瓶颈。掩蔽已被广泛采用来增强 MTS 表示，其中时间重建任务旨在从 MTS 捕获关键信息。然而，大多数现有的屏蔽策略似乎是独立的预处理步骤，与学习过程隔离，这阻碍了对聚类关键时间戳的重要性的动态适应。因此，本文提出了进化掩码 MTS 聚类（EMTC）方法，其模型架构由重要性感知变量掩码（IVM）和多内源视图（MEV）表示学习模块组成。 IVM 自适应地引导模型学习更具区分性的聚类表示，而基于 MEV 的重建和对比学习路径则增强了泛化能力。也就是说，MEV重建有利于多视角互补，防止过早收敛的掩蔽，而聚类引导的对比学习有利于表示和聚类的联合优化。对 15 个真实基准数据集进行的大量实验证明了 EMTC 与八种 SOTA 方法相比的优越性，其中 EMTC 比最强基线平均提高了 4.85%。

Title: Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation

Authors: Aniketh Iyengar, Jiaqi Han, Boris Ruf, Vincent Grari, Marcin Detyniecki, Stefano Ermon
Subjects: cs.LG, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2511.17031
Pdf URL: https://arxiv.org/pdf/2511.17031
Copy Paste: [[2511.17031]] Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation(https://arxiv.org/abs/2511.17031)
Keywords: generation
Abstract: The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared > 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.
摘要：图像生成扩散模型的计算需求迅速增长，引起了人们对能源消耗和环境影响的严重担忧。虽然现有的能源优化方法侧重于架构改进或硬件加速，但缺乏原则性方法来预测不同模型配置和硬件设置的能源消耗。我们提出了一种卡普兰缩放定律的改进方案，用于根据计算复杂度 (FLOP) 来预测扩散模型的 GPU 能耗。我们的方法将扩散模型推理分解为文本编码、迭代去噪和解码组件，并假设去噪操作由于在多个推理步骤中重复执行而在能耗中占主导地位。我们在三种 GPU 架构（NVIDIA A100、A4000、A6000）上对四种最先进的扩散模型（Stable Diffusion 2、Stable Diffusion 3.5、Flux 和 Qwen）进行了全面的实验，涵盖各种推理配置，包括分辨率（256x256 至 1024x1024）、精度（fp16/fp32）、步数(10-50)，以及无分类器的指导设置。我们的能量缩放定律在各个架构（R 平方 > 0.9）内实现了高预测精度，并表现出强大的跨架构泛化能力，保持模型之间的高等级相关性，并为看不见的模型-硬件组合提供可靠的能量估计。这些结果验证了扩散推理的计算限制性质，并为可持续人工智能部署规划和碳足迹估算提供了基础。

Title: RoomPlanner: Explicit Layout Planner for Easier LLM-Driven 3D Room Generation

Authors: Wenzhuo Sun, Mingjian Liang, Wenxuan Song, Xuelian Cheng, Zongyuan Ge
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17048
Pdf URL: https://arxiv.org/pdf/2511.17048
Copy Paste: [[2511.17048]] RoomPlanner: Explicit Layout Planner for Easier LLM-Driven 3D Room Generation(https://arxiv.org/abs/2511.17048)
Keywords: generation
Abstract: In this paper, we propose RoomPlanner, the first fully automatic 3D room generation framework for painlessly creating realistic indoor scenes with only short text as input. Without any manual layout design or panoramic image guidance, our framework can generate explicit layout criteria for rational spatial placement. We begin by introducing a hierarchical structure of language-driven agent planners that can automatically parse short and ambiguous prompts into detailed scene descriptions. These descriptions include raw spatial and semantic attributes for each object and the background, which are then used to initialize 3D point clouds. To position objects within bounded environments, we implement two arrangement constraints that iteratively optimize spatial arrangements, ensuring a collision-free and accessible layout solution. In the final rendering stage, we propose a novel AnyReach Sampling strategy for camera trajectory, along with the Interval Timestep Flow Sampling (ITFS) strategy, to efficiently optimize the coarse 3D Gaussian scene representation. These approaches help reduce the total generation time to under 30 minutes. Extensive experiments demonstrate that our method can produce geometrically rational 3D indoor scenes, surpassing prior approaches in both rendering speed and visual quality while preserving editability. The code will be available soon.
摘要：在本文中，我们提出了 RoomPlanner，这是第一个全自动 3D 房间生成框架，只需短文本作为输入即可轻松创建逼真的室内场景。无需任何手动布局设计或全景图像引导，我们的框架就可以生成明确的布局标准以实现合理的空间布局。我们首先引入语言驱动的代理规划器的层次结构，它可以自动将简短且模糊的提示解析为详细的场景描述。这些描述包括每个对象和背景的原始空间和语义属性，然后用于初始化 3D 点云。为了在有界环境中定位对象，我们实现了两个排列约束，迭代地优化空间排列，确保无碰撞且可访问的布局解决方案。在最终渲染阶段，我们提出了一种新颖的摄像机轨迹 AnyReach 采样策略以及间隔时间步长流采样 (ITFS) 策略，以有效优化粗略 3D 高斯场景表示。这些方法有助于将总生成时间缩短至 30 分钟以下。大量实验表明，我们的方法可以生成几何合理的 3D 室内场景，在渲染速度和视觉质量方面都超越了先前的方法，同时保留了可编辑性。该代码即将推出。

Title: ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion

Authors: Junming Liu, Yifei Sun, Weihua Cheng, Yujin Kang, Yirong Chen, Ding Wang, Guosun Zeng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17068
Pdf URL: https://arxiv.org/pdf/2511.17068
Copy Paste: [[2511.17068]] ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion(https://arxiv.org/abs/2511.17068)
Keywords: generation
Abstract: Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.
摘要：磁共振成像（MRI）在脑部疾病诊断中发挥着至关重要的作用，但由于身体或临床限制，它对于某些患者并不总是可行。最近的研究尝试从计算机断层扫描 (CT) 扫描中合成 MRI；然而，低剂量方案通常会导致 CT 体积高度稀疏且平面分辨率较差，这使得全脑 MRI 体积的准确重建特别具有挑战性。为了解决这个问题，我们提出了 ReBrain，一种用于脑 MRI 重建的检索增强扩散框架。考虑到任何具有有限切片的 3D CT 扫描，我们首先采用布朗桥扩散模型 (BBDM) 沿 2D 维度合成 MRI 切片。同时，我们通过微调的检索模型从综合先验数据库中检索结构和病理上相似的 CT 切片。这些检索到的切片用作参考，通过 ControlNet 分支合并，以指导中间 MRI 切片的生成并确保结构连续性。当数据库缺乏合适的参考文献时，我们进一步解释了罕见的检索失败，并应用球面线性插值来提供补充指导。在 SynthRAD2023 和 BraTS 上进行的大量实验表明，ReBrain 在稀疏条件下的跨模态重建方面实现了最先进的性能。

Title: Diversity Has Always Been There in Your Visual Autoregressive Models

Authors: Tong Wang, Guanyu Yang, Nian Liu, Kai Wang, Yaxing Wang, Abdelrahman M Shaker, Salman Khan, Fahad Shahbaz Khan, Senmao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17074
Pdf URL: https://arxiv.org/pdf/2511.17074
Copy Paste: [[2511.17074]] Diversity Has Always Been There in Your Visual Autoregressive Models(https://arxiv.org/abs/2511.17074)
Keywords: generative
Abstract: Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at this https URL.
摘要：视觉自回归（VAR）模型最近因其创新的下一代预测范式而受到广泛关注，与传统的多步自回归（AR）和扩散模型相比，在推理效率和图像质量方面具有显着的优势。然而，尽管 VAR 模型效率很高，但它经常遭受多样性崩溃的影响，即输出变异性的减少，类似于在少步蒸馏扩散模型中观察到的情况。在本文中，我们介绍了 DiverseVAR，这是一种简单而有效的方法，无需任何额外的训练即可恢复 VAR 模型的生成多样性。我们的分析揭示了特征图的关键组成部分是控制早期尺度多样性形成的关键因素。通过抑制模型输入中的关键成分并在模型输出中放大它，DiverseVAR 有效地释放了 VAR 模型的固有生成潜力，同时保持高保真合成。实证结果表明，我们的方法大大增强了生成多样性，而对性能的影响却可以忽略不计。我们的代码将在此 https URL 公开发布。

Title: Spanning Tree Autoregressive Visual Generation

Authors: Sangkyu Lee, Changho Lee, Janghoon Han, Hosung Song, Tackgeun You, Hwasup Lim, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17089
Pdf URL: https://arxiv.org/pdf/2511.17089
Copy Paste: [[2511.17089]] Spanning Tree Autoregressive Visual Generation(https://arxiv.org/abs/2511.17089)
Keywords: generation
Abstract: We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.
摘要：我们提出了生成树自回归（STAR）建模，它可以结合图像的先验知识，例如中心偏差和局部性，以保持采样性能，同时还提供足够灵活的序列顺序以适应推理时的图像编辑。在双向上下文的视觉生成中将随机排列的序列顺序暴露给传统自回归 (AR) 模型的方法要么会导致性能下降，要么会损害推理时序列顺序选择的灵活性。相反，STAR 使用在由图像块的位置定义的格子中采样的均匀生成树的遍历顺序。遍历顺序是通过广度优先搜索获得的，使我们能够有效地构建生成树，其遍历顺序确保图像的连接部分观察通过拒绝采样作为序列中的前缀出现。与随机排列相比，通过定制但结构化的随机策略，STAR 保留了后缀完成的能力，同时保持了采样性能，而无需对 AR 建模语言中广泛采用的模型架构进行任何重大更改。

Title: Four decades of circumpolar super-resolved satellite land surface temperature data

Authors: Sonia Dupuis, Nando Metzger, Konrad Schindler, Frank Göttsche, Stefan Wunderle
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17134
Pdf URL: https://arxiv.org/pdf/2511.17134
Copy Paste: [[2511.17134]] Four decades of circumpolar super-resolved satellite land surface temperature data(https://arxiv.org/abs/2511.17134)
Keywords: super-resolution
Abstract: Land surface temperature (LST) is an essential climate variable (ECV) crucial for understanding land-atmosphere energy exchange and monitoring climate change, especially in the rapidly warming Arctic. Long-term satellite-based LST records, such as those derived from the Advanced Very High Resolution Radiometer (AVHRR), are essential for detecting climate trends. However, the coarse spatial resolution of AVHRR's global area coverage (GAC) data limit their utility for analyzing fine-scale permafrost dynamics and other surface processes in the Arctic. This paper presents a new 42 years pan-Arctic LST dataset, downscaled from AVHRR GAC to 1 km with a super-resolution algorithm based on a deep anisotropic diffusion model. The model is trained on MODIS LST data, using coarsened inputs and native-resolution outputs, guided by high-resolution land cover, digital elevation, and vegetation height maps. The resulting dataset provides twice-daily, 1 km LST observations for the entire pan-Arctic region over four decades. This enhanced dataset enables improved modelling of permafrost, reconstruction of near-surface air temperature, and assessment of surface mass balance of the Greenland Ice Sheet. Additionally, it supports climate monitoring efforts in the pre-MODIS era and offers a framework adaptable to future satellite missions for thermal infrared observation and climate data record continuity.
摘要：陆地表面温度（LST）是一个重要的气候变量（ECV），对于了解陆地-大气能量交换和监测气候变化至关重要，特别是在迅速变暖的北极。基于卫星的长期地表温度记录，例如源自高级甚高分辨率辐射计 (AVHRR) 的记录，对于检测气候趋势至关重要。然而，AVHRR 全球区域覆盖 (GAC) 数据的粗略空间分辨率限制了其分析北极细尺度永久冻土动力学和其他表面过程的效用。本文提出了一个新的 42 年泛北极 LST 数据集，使用基于深度各向异性扩散模型的超分辨率算法将 AVHRR GAC 缩小到 1 公里。该模型在 MODIS LST 数据上进行训练，使用粗化输入和原始分辨率输出，并以高分辨率土地覆盖、数字高程和植被高度图为指导。生成的数据集提供了四十年来整个泛北极地区每日两次、1 公里的 LST 观测结果。这个增强的数据集可以改进永久冻土的建模、近地表气温的重建以及格陵兰冰盖表面质量平衡的评估。此外，它还支持 MODIS 时代之前的气候监测工作，并提供了一个适合未来卫星任务的框架，以实现热红外观测和气候数据记录的连续性。

Title: One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

Authors: Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17138
Pdf URL: https://arxiv.org/pdf/2511.17138
Copy Paste: [[2511.17138]] One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution(https://arxiv.org/abs/2511.17138)
Keywords: super-resolution, generative
Abstract: Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.
摘要：基于扩散的现实世界图像超分辨率（Real-ISR）的最新进展已经表现出卓越的感知质量，但保真度和可控性之间的平衡仍然是一个问题：基于多步扩散的方法受到生成多样性和随机性的影响，导致保真度低，而一步方法由于保真度特定的微调而失去了控制灵活性。在本文中，我们提出了 ODTSR，一种基于 Qwen-Image 的一步扩散变换器，它同时考虑保真度和可控性，执行 Real-ISR：新引入的视觉流接收具有可调节噪声（控制噪声）的低质量图像（LQ），原始视觉流接收具有一致噪声（先验噪声）的 LQ，形成噪声混合视觉流（NVS）设计。 ODTSR进一步采用保真度感知对抗训练（FAA）来增强可控性并实现一步推理。大量实验表明，ODTSR 不仅在通用 Real-ISR 上实现了最先进的 (SOTA) 性能，而且无需在特定数据集上进行训练，就能在具有挑战性的场景中实现快速可控，例如汉字的真实场景文本图像超分辨率 (STISR)。

Title: DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End to End Autonomous Driving

Authors: Liuhan Yin, Runkun Ju, Guodong Guo, Erkang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17150
Pdf URL: https://arxiv.org/pdf/2511.17150
Copy Paste: [[2511.17150]] DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End to End Autonomous Driving(https://arxiv.org/abs/2511.17150)
Keywords: generative
Abstract: Unlike discriminative approaches in autonomous driving that predict a fixed set of candidate trajectories of the ego vehicle, generative methods, such as diffusion models, learn the underlying distribution of future motion, enabling more flexible trajectory prediction. However, since these methods typically rely on denoising human-crafted trajectory anchors or random noise, there remains significant room for improvement. In this paper, we propose DiffRefiner, a novel two-stage trajectory prediction framework. The first stage uses a transformer-based Proposal Decoder to generate coarse trajectory predictions by regressing from sensor inputs using predefined trajectory anchors. The second stage applies a Diffusion Refiner that iteratively denoises and refines these initial predictions. In this way, we enhance the performance of diffusion-based planning by incorporating a discriminative trajectory proposal module, which provides strong guidance for the generative refinement process. Furthermore, we design a fine-grained denoising decoder to enhance scene compliance, enabling more accurate trajectory prediction through enhanced alignment with the surrounding environment. Experimental results demonstrate that DiffRefiner achieves state-of-the-art performance, attaining 87.4 EPDMS on NAVSIM v2, and 87.1 DS along with 71.4 SR on Bench2Drive, thereby setting new records on both public benchmarks. The effectiveness of each component is validated via ablation studies as well.
摘要：与自动驾驶中预测自我车辆的一组固定候选轨迹的判别方法不同，扩散模型等生成方法可以学习未来运动的基本分布，从而实现更灵活的轨迹预测。然而，由于这些方法通常依赖于人工制作的轨迹锚点或随机噪声的去噪，因此仍然存在很大的改进空间。在本文中，我们提出了 DiffRefiner，一种新颖的两阶段轨迹预测框架。第一阶段使用基于变压器的提案解码器，通过使用预定义的轨迹锚点对传感器输入进行回归来生成粗略的轨迹预测。第二阶段应用扩散细化器，迭代地对这些初始预测进行去噪和细化。通过这种方式，我们通过结合判别性轨迹提议模块来增强基于扩散的规划的性能，该模块为生成细化过程提供了强有力的指导。此外，我们设计了一个细粒度的去噪解码器来增强场景合规性，通过增强与周围环境的对齐来实现更准确的轨迹预测。实验结果表明，DiffRefiner 实现了最先进的性能，在 NAVSIM v2 上获得了 87.4 EPDMS，在 Bench2Drive 上获得了 87.1 DS 和 71.4 SR，从而在两个公共基准测试中创造了新记录。每个组件的有效性也通过消融研究得到验证。

Title: FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle

Authors: Mario Markov (1), Stefan Maria Ailuro (1), Luc Van Gool (1), Konrad Schindler (2), Danda Pani Paudel (1 and 2) ((1) INSAIT, Sofia University, (2) ETH Zurich)
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.17171
Pdf URL: https://arxiv.org/pdf/2511.17171
Copy Paste: [[2511.17171]] FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle(https://arxiv.org/abs/2511.17171)
Keywords: generation
Abstract: Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.
摘要：预测野火风险是一个推理密集型空间问题，需要整合视觉、气候和地理因素来推断连续的风险地图。现有方法缺乏可靠泛化所需的因果推理和多模态理解。我们引入 $\textbf{FireScope-Bench}$，这是一个大型数据集和基准，它将 Sentinel-2 图像和气候数据与专家定义的美国各地风险栅格以及欧洲的真实野火事件结合起来，以进行跨大陆评估。在此数据集的基础上，我们提出了 $\textbf{FireScope}$，这是一个基于 VLM 的推理生成框架，它从强化学习和视觉监督中学习，以通过补充推理轨迹来预测风险栅格。当在美国进行训练并在欧洲进行测试时，$\textbf{FireScope}$ 取得了显着的性能提升，而专家反馈和自动分析证实了其推理轨迹是忠实的且具有语义意义。我们的研究结果表明，推理可以基础栅格预测模型，从而提高泛化性和可解释性。据我们所知，这是第一个框架：（1）证明基于语言的推理可以提高视觉生成的泛化能力，（2）提出一种可以跨大陆应用的高分辨率野火风险模型，（3）能够对多模式火灾风险模型的稳健跨大陆泛化进行系统研究。我们相信 $\textbf{FireScope-Bench}$ 有潜力成为推进推理驱动、可解释和可概括的空间建模的基础。数据和源代码将公开。

Title: PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention

Authors: Yipeng Chen, Zhichao Ye, Zhenzhou Fang, Xinyu Chen, Xiaoyu Zhang, Jialing Liu, Nan Wang, Haomin Liu, Guofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17185
Pdf URL: https://arxiv.org/pdf/2511.17185
Copy Paste: [[2511.17185]] PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention(https://arxiv.org/abs/2511.17185)
Keywords: generation
Abstract: We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. We find that existing video recapture methods suffer from suboptimal camera motion injection strategies; such suboptimal designs not only limit camera control precision but also result in generated videos that fail to preserve fine visual details from the source video. To achieve more accurate and flexible motion manipulation, PostCam introduces a query-shared cross-attention module. It integrates two distinct forms of control signals: the 6-DoF camera poses and the 2D rendered video frames. By fusing them into a unified representation within a shared feature space, our model can extract underlying motion cues, which enhances both control precision and generation quality. Furthermore, we adopt a two-stage training strategy: the model first learns coarse camera control from pose inputs, and then incorporates visual information to refine motion accuracy and enhance visual fidelity. Experiments on both real-world and synthetic datasets demonstrate that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality. Our project webpage is publicly available at: this https URL
摘要：我们提出了 PostCam，这是一种新视角视频生成框架，可以在动态场景中对摄像机轨迹进行捕获后编辑。我们发现现有的视频重新捕获方法存在次优的摄像机运动注入策略；这种次优设计不仅限制了摄像机控制精度，而且还导致生成的视频无法保留源视频的精细视觉细节。为了实现更准确、更灵活的运动操纵，PostCam引入了查询共享交叉注意力模块。它集成了两种不同形式的控制信号：6-DoF 相机姿势和 2D 渲染视频帧。通过将它们融合到共享特征空间内的统一表示中，我们的模型可以提取底层运动线索，从而提高控制精度和生成质量。此外，我们采用两阶段训练策略：模型首先从姿势输入中学习粗略的相机控制，然后结合视觉信息来提高运动准确性并增强视觉保真度。对真实世界和合成数据集的实验表明，PostCam 在摄像机控制精度和视图一致性方面比最先进的方法高出 20% 以上，同时实现了最高的视频生成质量。我们的项目网页可公开访问：此 https URL

Title: Dual-domain Adaptation Networks for Realistic Image Super-resolution

Authors: Chaowei Fang, Bolin Fu, De Cheng, Lechao Cheng, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17217
Pdf URL: https://arxiv.org/pdf/2511.17217
Copy Paste: [[2511.17217]] Dual-domain Adaptation Networks for Realistic Image Super-resolution(https://arxiv.org/abs/2511.17217)
Keywords: super-resolution
Abstract: Realistic image super-resolution (SR) focuses on transforming real-world low-resolution (LR) images into high-resolution (HR) ones, handling more complex degradation patterns than synthetic SR tasks. This is critical for applications like surveillance, medical imaging, and consumer electronics. However, current methods struggle with limited real-world LR-HR data, impacting the learning of basic image features. Pre-trained SR models from large-scale synthetic datasets offer valuable prior knowledge, which can improve generalization, speed up training, and reduce the need for extensive real-world data in realistic SR tasks. In this paper, we introduce a novel approach, Dual-domain Adaptation Networks, which is able to efficiently adapt pre-trained image SR models from simulated to real-world datasets. To achieve this target, we first set up a spatial-domain adaptation strategy through selectively updating parameters of pre-trained models and employing the low-rank adaptation technique to adjust frozen parameters. Recognizing that image super-resolution involves recovering high-frequency components, we further integrate a frequency domain adaptation branch into the adapted model, which combines the spectral data of the input and the spatial-domain backbone's intermediate features to infer HR frequency maps, enhancing the SR result. Experimental evaluations on public realistic image SR benchmarks, including RealSR, D2CRealSR, and DRealSR, demonstrate the superiority of our proposed method over existing state-of-the-art models. Codes are available at: this https URL.
摘要：真实图像超分辨率 (SR) 专注于将现实世界的低分辨率 (LR) 图像转换为高分辨率 (HR) 图像，处理比合成 SR 任务更复杂的退化模式。这对于监控、医学成像和消费电子产品等应用至关重要。然而，当前的方法难以处理有限的现实世界 LR-HR 数据，影响了基本图像特征的学习。来自大规模合成数据集的预训练 SR 模型提供了宝贵的先验知识，可以提高泛化能力、加快训练速度并减少现实 SR 任务中对大量真实数据的需求。在本文中，我们介绍了一种新颖的方法，即双域适应网络，它能够有效地将预训练的图像 SR 模型从模拟数据集调整到现实世界的数据集。为了实现这一目标，我们首先通过选择性更新预训练模型的参数并采用低秩适应技术来调整冻结参数来建立空间域适应策略。认识到图像超分辨率涉及恢复高频分量，我们进一步将频域自适应分支集成到自适应模型中，该模型结合输入的频谱数据和空间域主干的中间特征来推断 HR 频率图，从而增强 SR 结果。对公共真实图像 SR 基准（包括 RealSR、D2CRealSR 和 DRealSR）的实验评估证明了我们提出的方法相对于现有最先进模型的优越性。代码可在以下位置获得：此 https URL。

Title: FlexiFlow: decomposable flow matching for generation of flexible molecular ensemble

Authors: Riccardo Tedoldi, Ola Engkvist, Patrick Bryant, Hossein Azizpour, Jon Paul Janet, Alessandro Tibo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17249
Pdf URL: https://arxiv.org/pdf/2511.17249
Copy Paste: [[2511.17249]] FlexiFlow: decomposable flow matching for generation of flexible molecular ensemble(https://arxiv.org/abs/2511.17249)
Keywords: generation
Abstract: Sampling useful three-dimensional molecular structures along with their most favorable conformations is a key challenge in drug discovery. Current state-of-the-art 3D de-novo design flow matching or diffusion-based models are limited to generating a single conformation. However, the conformational landscape of a molecule determines its observable properties and how tightly it is able to bind to a given protein target. By generating a representative set of low-energy conformers, we can more directly assess these properties and potentially improve the ability to generate molecules with desired thermodynamic observables. Towards this aim, we propose FlexiFlow, a novel architecture that extends flow-matching models, allowing for the joint sampling of molecules along with multiple conformations while preserving both equivariance and permutation invariance. We demonstrate the effectiveness of our approach on the QM9 and GEOM Drugs datasets, achieving state-of-the-art results in molecular generation tasks. Our results show that FlexiFlow can generate valid, unstrained, unique, and novel molecules with high fidelity to the training data distribution, while also capturing the conformational diversity of molecules. Moreover, we show that our model can generate conformational ensembles that provide similar coverage to state-of-the-art physics-based methods at a fraction of the inference time. Finally, FlexiFlow can be successfully transferred to the protein-conditioned ligand generation task, even when the dataset contains only static pockets without accompanying conformations.
摘要：对有用的三维分子结构及其最有利的构象进行采样是药物发现中的一个关键挑战。当前最先进的 3D 从头设计流程匹配或基于扩散的模型仅限于生成单一构象。然而，分子的构象景观决定了其可观察的特性以及它与给定蛋白质靶标结合的紧密程度。通过生成一组具有代表性的低能构象异构体，我们可以更直接地评估这些特性，并有可能提高生成具有所需热力学可观测值的分子的能力。为了实现这一目标，我们提出了 FlexiFlow，这是一种扩展流匹配模型的新颖架构，允许对具有多种构象的分子进行联合采样，同时保持等变性和排列不变性。我们在 QM9 和 GEOM Drugs 数据集上展示了我们的方法的有效性，在分子生成任务中取得了最先进的结果。我们的结果表明，FlexiFlow 可以生成有效、不受约束、独特且新颖的分子，对训练数据分布具有高保真度，同时还捕获分子的构象多样性。此外，我们表明，我们的模型可以生成构象系综，其在推理时间的一小部分内提供与最先进的基于物理的方法相似的覆盖范围。最后，即使数据集仅包含静态口袋而没有伴随构象，FlexiFlow 也可以成功转移到蛋白质条件配体生成任务。

Title: Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Authors: Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17254
Pdf URL: https://arxiv.org/pdf/2511.17254
Copy Paste: [[2511.17254]] Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats(https://arxiv.org/abs/2511.17254)
Keywords: generative
Abstract: Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.
摘要：尽管大型视觉语言模型（LVLM）在广泛的任务中表现出色，但仍然容易产生幻觉。在这项研究中，我们提出了一个与 LVLM 中 Transformer 因果架构相一致的综合干预框架，整合了不同干预路径对幻觉的影响。我们发现 LVLM 中的幻觉并非源自单一因果路径，而是源自图像到输入文本、图像到输出文本和文本到文本路径之间的相互作用。我们还首次发现 LVLM 依赖于不同的路径，具体取决于问答对齐格式。基于这些见解，我们提出了简单而有效的方法来识别和干预每个途径中的关键幻觉头，这些方法适合歧视性和生成性格式。跨多个基准的实验表明，我们的方法始终如一地减少了不同对齐类型的幻觉。

Title: A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

Authors: Bulat Khaertdinov, Mirela Popa, Nava Tintarev
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2511.17255
Pdf URL: https://arxiv.org/pdf/2511.17255
Copy Paste: [[2511.17255]] A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback(https://arxiv.org/abs/2511.17255)
Keywords: generative
Abstract: Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.
摘要：大型视觉语言模型 (VLM) 支持使用自然语言查询进行直观的视觉搜索。然而，提高其性能通常需要微调和扩展到更大的模型变体。在这项工作中，我们提出了一种受传统基于文本的搜索启发的机制，以提高推理时的检索性能：相关性反馈。虽然相关性反馈可以作为微调的替代方案，但其与模型无关的设计也可以与微调的 VLM 一起使用。具体来说，我们介绍并评估了基于 VLM 的检索的四种反馈策略。首先，我们修改了经典的伪相关反馈（PRF），它根据排名最高的结果细化查询嵌入。为了解决其局限性，我们提出了生成相关性反馈（GRF），它使用合成标题来细化查询。此外，我们引入了一个细心的反馈摘要器（AFS），这是一种基于自定义变压器的模型，集成了相关项目的多模式细粒度特征。最后，我们使用真实字幕作为上限基线来模拟显式反馈。在具有 VLM 主干的 Flickr30k 和 COCO 上进行的实验表明，与没有反馈的检索相比，GRF、AFS 和显式反馈在 MRR@5 中将较小 VLM 的检索性能提高了 3-5%，将较大 VLM 的检索性能提高了 1-3%。此外，AFS 与显式反馈类似，可以减轻查询漂移，并且在迭代、多轮检索设置中比 GRF 更稳健。我们的研究结果表明，相关性反馈可以持续增强跨 VLM 的检索，并为交互式和自适应视觉搜索提供机会。

Title: Range-Edit: Semantic Mask Guided Outdoor LiDAR Scene Editing

Authors: Suchetan G. Uppur, Hemant Kumar, Vaibhav Kumar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17269
Pdf URL: https://arxiv.org/pdf/2511.17269
Copy Paste: [[2511.17269]] Range-Edit: Semantic Mask Guided Outdoor LiDAR Scene Editing(https://arxiv.org/abs/2511.17269)
Keywords: generation
Abstract: Training autonomous driving and navigation systems requires large and diverse point cloud datasets that capture complex edge case scenarios from various dynamic urban settings. Acquiring such diverse scenarios from real-world point cloud data, especially for critical edge cases, is challenging, which restricts system generalization and robustness. Current methods rely on simulating point cloud data within handcrafted 3D virtual environments, which is time-consuming, computationally expensive, and often fails to fully capture the complexity of real-world scenes. To address some of these issues, this research proposes a novel approach that addresses the problem discussed by editing real-world LiDAR scans using semantic mask-based guidance to generate novel synthetic LiDAR point clouds. We incorporate range image projection and semantic mask conditioning to achieve diffusion-based generation. Point clouds are transformed to 2D range view images, which are used as an intermediate representation to enable semantic editing using convex hull-based semantic masks. These masks guide the generation process by providing information on the dimensions, orientations, and locations of objects in the real environment, ensuring geometric consistency and realism. This approach demonstrates high-quality LiDAR point cloud generation, capable of producing complex edge cases and dynamic scenes, as validated on the KITTI-360 dataset. This offers a cost-effective and scalable solution for generating diverse LiDAR data, a step toward improving the robustness of autonomous driving systems.
摘要：训练自动驾驶和导航系统需要大量且多样化的点云数据集，这些数据集可以从各种动态城市环境中捕获复杂的边缘情况场景。从现实世界的点云数据中获取如此多样化的场景，特别是对于关键边缘情况，是具有挑战性的，这限制了系统的通用性和鲁棒性。当前的方法依赖于在手工制作的 3D 虚拟环境中模拟点云数据，这不仅耗时、计算成本高，而且通常无法完全捕捉现实世界场景的复杂性。为了解决其中一些问题，本研究提出了一种新颖的方法，通过使用基于语义掩模的指导编辑现实世界的 LiDAR 扫描来生成新颖的合成 LiDAR 点云来解决所讨论的问题。我们结合了范围图像投影和语义掩模调节来实现基于扩散的生成。点云转换为 2D 范围视图图像，用作中间表示，以使用基于凸包的语义掩模进行语义编辑。这些掩模通过提供有关真实环境中对象的尺寸、方向和位置的信息来指导生成过程，确保几何一致性和真实性。该方法展示了高质量的 LiDAR 点云生成，能够生成复杂的边缘情况和动态场景，并在 KITTI-360 数据集上进行了验证。这为生成各种激光雷达数据提供了一种经济高效且可扩展的解决方案，这是提高自动驾驶系统稳健性的一步。

Title: Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

Authors: Chuancheng Shi, Shangze Li, Shiming Guo, Simiao Xie, Wenhua Wu, Jingtong Dou, Chao Wu, Canran Xiao, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua
Subjects: cs.CV, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.17282
Pdf URL: https://arxiv.org/pdf/2511.17282
Copy Paste: [[2511.17282]] Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation(https://arxiv.org/abs/2511.17282)
Keywords: generation
Abstract: Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.
摘要：多语言文本到图像（T2I）模型在视觉真实感和语义对齐方面取得了迅速发展，现已得到广泛应用。然而，输出因文化背景而异：由于语言具有文化内涵，因此多语言提示合成的图像应保持跨语言文化的一致性。我们进行的全面分析表明，当前的 T2I 模型通常在多语言提示下产生文化中立或偏英语的结果。对两个代表性模型的分析表明，问题并非源于文化知识的缺失，而是源于文化相关表征的激活不足。我们提出了一种探测方法，将培养敏感信号定位到几个固定层中的一小组神经元。在这一发现的指导下，我们引入了两种互补的对齐策略：（1）推理时文化激活，无需对主干进行微调即可放大已识别的神经元； (2) 针对层的文化增强，仅更新文化相关层。我们的 CultureBench 上的实验表明，在文化一致性方面，在保持保真度和多样性的同时，在强基线上取得了持续改进。

Title: Refracting Reality: Generating Images with Realistic Transparent Objects

Authors: Yue Yin, Enze Tao, Dylan Campbell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17340
Pdf URL: https://arxiv.org/pdf/2511.17340
Copy Paste: [[2511.17340]] Refracting Reality: Generating Images with Realistic Transparent Objects(https://arxiv.org/abs/2511.17340)
Keywords: generation, generative
Abstract: Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image -- a panorama centered at the object -- using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.
摘要：生成图像模型可以生成令人信服的真实图像，具有合理的形状、纹理、布局和照明。然而，它们表现特别差的一个领域是透明物体的合成，这些物体表现出折射、反射、吸收和散射。折射是一个特殊的挑战，因为折射的像素光线通常与图像其他部分中观察到的表面相交，从而对颜色产生限制。从检查中可以清楚地看出，生成模型没有足够好地提炼光学定律来准确渲染折射物体。在这项工作中，我们考虑在给定文本提示的情况下生成具有准确折射的图像的问题。在生成轨迹的每一步，我们使用斯涅尔折射定律扭曲和合并像素，从而使对象边界内的像素与外部的像素同步。对于那些在图像中未直接观察到但通过折射或反射可见的表面，我们通过使用相同的扭曲和合并程序将图像与第二个生成的图像（以对象为中心的全景图）同步来恢复其外观。我们证明，我们的方法可以生成更加符合物理约束的光学合理图像。

Title: Loomis Painter: Reconstructing the Painting Process

Authors: Markus Pobitzer, Chang Liu, Chenyi Zhuang, Teng Long, Bin Ren, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17344
Pdf URL: https://arxiv.org/pdf/2511.17344
Copy Paste: [[2511.17344]] Loomis Painter: Reconstructing the Painting Process(https://arxiv.org/abs/2511.17344)
Keywords: generation, generative
Abstract: Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.
摘要：循序渐进的绘画教程对于学习艺术技巧至关重要，但现有的视频资源（例如 YouTube）缺乏交互性和个性化。虽然最近的生成模型具有先进的艺术图像合成功能，但它们难以跨媒体推广，并且经常表现出时间或结构的不一致，阻碍了人类创意工作流程的忠实再现。为了解决这个问题，我们提出了一个统一的多媒体绘画过程生成框架，具有语义驱动的风格控制机制，将多种媒体嵌入到扩散模型条件空间中，并使用跨媒体风格增强。这使得跨风格的纹理演变和工艺转移能够保持一致。反向绘画训练策略进一步确保了平滑、人性化的生成。我们还构建了真实绘画过程的大规模数据集，并评估跨媒体一致性、时间一致性和最终图像保真度，在 LPIPS、DINO 和 CLIP 指标上取得了出色的结果。最后，我们的感知距离分布（PDP）曲线定量地模拟了创意序列，即构图、色块和细节细化，反映了人类艺术的进步。

Title: Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks

Authors: Georgia Baltsou, Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17393
Pdf URL: https://arxiv.org/pdf/2511.17393
Copy Paste: [[2511.17393]] Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks(https://arxiv.org/abs/2511.17393)
Keywords: generative
Abstract: Face verification is a significant component of identity authentication in various applications including online banking and secure access to personal devices. The majority of the existing face image datasets often suffer from notable biases related to race, gender, and other demographic characteristics, limiting the effectiveness and fairness of face verification systems. In response to these challenges, we propose a comprehensive methodology that integrates advanced generative models to create varied and diverse high-quality synthetic face images. This methodology emphasizes the representation of a diverse range of facial traits, ensuring adherence to characteristics permissible in identity card photographs. Furthermore, we introduce the Diverse and Inclusive Faces for Verification (DIF-V) dataset, comprising 27,780 images of 926 unique identities, designed as a benchmark for future research in face verification. Our analysis reveals that existing verification models exhibit biases toward certain genders and races, and notably, applying identity style modifications negatively impacts model performance. By tackling the inherent inequities in existing datasets, this work not only enriches the discussion on diversity and ethics in artificial intelligence but also lays the foundation for developing more inclusive and reliable face verification technologies
摘要：人脸验证是各种应用中身份验证的重要组成部分，包括网上银行和个人设备的安全访问。大多数现有的人脸图像数据集往往存在与种族、性别和其他人口特征相关的显着偏差，限制了人脸验证系统的有效性和公平性。为了应对这些挑战，我们提出了一种综合方法，集成先进的生成模型来创建多样化的高质量合成人脸图像。这种方法强调表现各种面部特征，确保遵守身份证照片中允许的特征。此外，我们还引入了多样化和包容性的人脸验证 (DIF-V) 数据集，其中包含 926 个独特身份的 27,780 张图像，旨在作为未来人脸验证研究的基准。我们的分析表明，现有的验证模型对某些性别和种族存在偏见，值得注意的是，应用身份风格修改会对模型性能产生负面影响。通过解决现有数据集中固有的不平等问题，这项工作不仅丰富了人工智能多样性和伦理的讨论，而且为开发更具包容性和可靠的人脸验证技术奠定了基础

Title: MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

Authors: Huangbiao Xu, Huanqi Wu, Xiao Ke, Junyi Wu, Rui Xu, Jinglin Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17397
Pdf URL: https://arxiv.org/pdf/2511.17397
Copy Paste: [[2511.17397]] MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment(https://arxiv.org/abs/2511.17397)
Keywords: generation, quality assessment
Abstract: Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at this https URL.
摘要：多模式行动质量评估（AQA）最近成为一种有前景的范例。通过利用共享上下文线索中的互补信息，它增强了对高度相似的动作序列中微妙的类内变化的判别性评估。然而，现实中，部分模态在推理阶段常常是不可用的。缺乏任何模态通常会导致现有的多模态模型无法操作。此外，由于跨模式交互的中断，它还会引发灾难性的性能下降。为了解决这个问题，我们提出了一种新颖的混合专家缺失完成框架（MCMoE），它将单阶段训练中的单峰和联合表示学习统一起来。具体来说，我们提出了一种自适应门控模态生成器，它动态融合可用信息以重建丢失的模态。然后，我们设计模态专家来学习单模态知识并动态混合所有专家的知识以提取跨模态联合表示。通过专家的混合，缺失的模式得到进一步完善和补充。最后，在训练阶段，我们挖掘完整的多模态特征和单模态专家知识来指导模态生成和基于生成的联合表示提取。大量实验表明，我们的 MCMoE 在三个公共 AQA 基准上的完整和不完整多模态学习中均取得了最先进的结果。代码可从此 https URL 获取。

Title: Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Authors: Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Yue Zhang, Mohit Bansal
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.17450
Pdf URL: https://arxiv.org/pdf/2511.17450
Copy Paste: [[2511.17450]] Planning with Sketch-Guided Verification for Physics-Aware Video Generation(https://arxiv.org/abs/2511.17450)
Keywords: generation
Abstract: Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.
摘要：最近的视频生成方法越来越依赖于规划中间控制信号（例如对象轨迹）来提高时间相干性和运动保真度。然而，这些方法大多采用单次计划，通常仅限于简单的运动，或者需要多次调用视频生成器的迭代细化，从而导致较高的计算成本。为了克服这些限制，我们提出了 SketchVerify，这是一种免训练、基于草图验证的规划框架，通过引入测试时采样和验证循环，在完整视频生成之前通过更动态一致的轨迹（即物理上合理且指令一致的运动）来提高运动规划质量。给定提示和参考图像，我们的方法可以预测多个候选运动计划，并使用视觉语言验证器对它们进行排名，该验证器联合评估语义与指令的对齐和物理合理性。为了有效地对候选运动计划进行评分，我们通过在静态背景上合成对象来将每个轨迹渲染为轻量级视频草图，这绕过了昂贵的、重复的基于扩散的合成的需要，同时实现了可比较的性能。我们迭代地细化运动计划，直到找到满意的运动计划，然后将其传递给轨迹条件生成器进行最终合成。 WorldModelBench 和 PhyWorldBench 上的实验表明，与竞争基线相比，我们的方法显着提高了运动质量、物理真实感和长期一致性，同时效率也大大提高。我们的消融研究进一步表明，扩大候选轨迹的数量可以持续提高整体性能。

Title: Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition

Authors: Nissim Maruani, Peiying Zhang, Siddhartha Chaudhuri, Matthew Fisher, Nanxuan Zhao, Vladimir G. Kim, Pierre Alliez, Mathieu Desbrun, Wang Yifan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17454
Pdf URL: https://arxiv.org/pdf/2511.17454
Copy Paste: [[2511.17454]] Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition(https://arxiv.org/abs/2511.17454)
Keywords: generation
Abstract: We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.
摘要：我们介绍 Illustrator 的深度，这是一种新颖的深度定义，它解决了数字内容创建中的一个关键挑战：将平面图像分解为可编辑的有序图层。受艺术家构图过程的启发，插画家的深度推断出每个像素的图层索引，通过针对可编辑性进行优化的离散、全局一致的元素排序形成可解释的图像分解。我们还提出并使用分层矢量图形的精选数据集来训练神经网络，以直接根据栅格输入预测分层。我们的层索引推断解锁了一系列强大的下游应用程序。特别是，它显着优于最先进的图像矢量化基准，同时还支持高保真文本到矢量图形生成、从 2D 图像自动生成 3D 浮雕以及直观的深度感知编辑。通过将深度从物理量重新定义为创造性抽象，插画家的深度预测为可编辑图像分解提供了新的基础。

Title: PersonaAgent with GraphRAG: Community-Aware Knowledge Graphs for Personalized LLM

Authors: Siqi Liang, Yudi Zhang, Yue Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17467
Pdf URL: https://arxiv.org/pdf/2511.17467
Copy Paste: [[2511.17467]] PersonaAgent with GraphRAG: Community-Aware Knowledge Graphs for Personalized LLM(https://arxiv.org/abs/2511.17467)
Keywords: generation
Abstract: We propose a novel framework for persona-based language model system, motivated by the need for personalized AI agents that adapt to individual user preferences. In our approach, the agent embodies the user's "persona" (e.g. user profile or taste) and is powered by a large language model (LLM). To enable the agent to leverage rich contextual information, we introduce a Knowledge-Graph-enhanced Retrieval-Augmented Generation (Graph RAG) mechanism that constructs an LLM-derived graph index of relevant documents and summarizes communities of related information. Our framework generates personalized prompts by combining: (1) a summary of the user's historical behaviors and preferences extracted from the knowledge graph, and (2) relevant global interaction patterns identified through graph-based community detection. This dynamic prompt engineering approach allows the agent to maintain consistent persona-aligned behaviors while benefiting from collective knowledge. On the LaMP benchmark, our method improves news categorization F1 by 11.1%, movie tagging F1 by 56.1%, and reduces product rating MAE by 10.4% over prior methods. Our code is available at this https URL
摘要：我们提出了一种基于角色的语言模型系统的新颖框架，其动机是适应个人用户偏好的个性化人工智能代理的需求。在我们的方法中，代理体现了用户的“角色”（例如用户配置文件或品味），并由大型语言模型（LLM）提供支持。为了使代理能够利用丰富的上下文信息，我们引入了知识图增强检索增强生成（Graph RAG）机制，该机制构建相关文档的 LLM 派生图索引并总结相关信息的社区。我们的框架通过结合以下内容来生成个性化提示：（1）从知识图谱中提取的用户历史行为和偏好的摘要，以及（2）通过基于图的社区检测识别的相关全局交互模式。这种动态提示工程方法使代理能够保持一致的角色一致行为，同时受益于集体知识。在 LaMP 基准上，与之前的方法相比，我们的方法将新闻分类 F1 提高了 11.1%，电影标签 F1 提高了 56.1%，并将产品评级 MAE 降低了 10.4%。我们的代码可在此 https URL 获取

Title: An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI

Authors: Roozbeh Bazargani, Saqib Abdullah Basar, Daniel Daly-Grafstein, Rodrigo Solis Pompa, Soojin Lee, Saurabh Garg, Yuntong Ma, John A. Carrino, Siavash Khallaghi, Sam Hashemi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17485
Pdf URL: https://arxiv.org/pdf/2511.17485
Copy Paste: [[2511.17485]] An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI(https://arxiv.org/abs/2511.17485)
Keywords: generation, generative
Abstract: The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based deep learning method to estimate spine age using images from over 18,000 MRI series. Data are restricted to subjects with only age-related spine degeneration. Eligibility criteria are created by identifying common age-based clusters of degenerative spine conditions using uniform manifold approximation and projection (UMAP) and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Model selection is determined using a detailed ablation study on data size, loss, and the effect of different spine regions. We evaluate the clinical utility of our model by calculating the difference between actual spine age and model-predicted age, the spine age gap (SAG), and examining the association between these differences and spine degenerative conditions and lifestyle factors. We find that SAG is associated with conditions including disc bulges, disc osteophytes, spinal stenosis, and fractures, as well as lifestyle factors like smoking and physically demanding work, and thus may be a useful biomarker for measuring overall spine health.
摘要：人体脊柱是一个复杂的结构，由33块椎骨组成。它支撑着身体，对于健康的生活很重要。脊柱很容易遭受与年龄相关的退化，可以通过磁共振成像（MRI）来识别。在本文中，我们提出了一种新颖的基于计算机视觉的深度学习方法，使用来自 18,000 多个 MRI 系列的图像来估计脊柱年龄。数据仅限于仅患有与年龄相关的脊柱退化的受试者。资格标准是通过使用统一流形近似和投影 (UMAP) 以及基于分层密度的噪声应用空间聚类 (HDBSCAN) 来识别常见的基于年龄的退行性脊柱疾病集群而创建的。模型选择是通过对数据大小、丢失和不同脊柱区域的影响进行详细的消融研究来确定的。我们通过计算实际脊柱年龄和模型预测年龄之间的差异（脊柱年龄差距（SAG））并检查这些差异与脊柱退行性疾病和生活方式因素之间的关联来评估模型的临床实用性。我们发现，SAG 与椎间盘突出、椎间盘骨赘、椎管狭窄和骨折以及吸烟和体力劳动等生活方式因素有关，因此可能是衡量脊柱整体健康状况的有用生物标志物。

Title: EvDiff: High Quality Video with an Event Camera

Authors: Weilun Li, Lei Sun, Ruixi Gao, Qi Jiang, Yuqin Ma, Kaiwei Wang, Ming-Hsuan Yang, Luc Van Gool, Danda Pani Paudel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17492
Pdf URL: https://arxiv.org/pdf/2511.17492
Copy Paste: [[2511.17492]] EvDiff: High Quality Video with an Event Camera(https://arxiv.org/abs/2511.17492)
Keywords: generation
Abstract: As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.
摘要：作为神经形态传感器，事件相机将亮度变化异步记录为稀疏事件流，具有高时间分辨率和高动态范围的优点。由于绝对亮度固有的模糊性，从事件中重建强度图像是一项非常不适定的任务。早期的方法通常遵循端到端回归范例，以确定性方式直接将事件映射到强度帧。虽然在某种程度上有效，但这些方法通常会产生感知上较差的结果，并且难以扩大模型容量和训练数据。在这项工作中，我们提出了 EvDiff，一种基于事件的扩散模型，遵循替代训练框架来生成高质量视频。为了减少高帧率视频生成的繁重计算成本，我们设计了一种基于事件的扩散模型，该模型仅执行单个前向扩散步骤，并配备了时间一致的 EvEncoder。此外，我们新颖的代理训练框架消除了对配对事件图像数据集的依赖，使模型能够利用大规模图像数据集来获得更高的容量。所提出的 EvDiff 能够仅从单色事件流生成高质量的彩色视频。对现实世界数据集的实验表明，我们的方法在保真度和真实感之间达到了最佳平衡点，在像素级和感知指标上都优于现有方法。

Title: Native 3D Editing with Full Attention

Authors: Weiwei Cai, Shuangkang Fang, Weicai Ye, Xin Dong, Yunhan Yang, Xuanyang Zhang, Wei Cheng, Yanpei Cao, Gang Yu, Tao Chen
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2511.17501
Pdf URL: https://arxiv.org/pdf/2511.17501
Copy Paste: [[2511.17501]] Native 3D Editing with Full Attention(https://arxiv.org/abs/2511.17501)
Keywords: generation
Abstract: Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.
摘要：指令引导 3D 编辑是一个快速新兴的领域，有可能扩大 3D 内容创建的范围。然而，现有方法面临着严重的局限性：基于优化的方法速度极其缓慢，而依赖于多视图 2D 编辑的前馈方法通常会遇到几何形状不一致和视觉质量下降的问题。为了解决这些问题，我们提出了一种新颖的原生 3D 编辑框架，该框架可以在单个高效的前馈通道中直接操作 3D 表示。具体来说，我们创建了一个大规模、多模式的数据集，用于指令引导的 3D 编辑，涵盖各种添加、删除和修改任务。该数据集经过精心策划，以确保编辑的对象忠实地遵循教学更改，同时保持未编辑区域与源对象的一致性。在此数据集的基础上，我们为模型探索了两种不同的调节策略：传统的交叉注意力机制和新颖的 3D 令牌串联方法。我们的结果表明，令牌串联的参数效率更高，并且实现了卓越的性能。广泛的评估表明，我们的方法优于现有的 2D 提升方法，在生成质量、3D 一致性和指令保真度方面树立了新的基准。