2025-09-12

Title: Recurrence Meets Transformers for Universal Multimodal Retrieval

Authors: Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2509.08897
Pdf URL: https://arxiv.org/pdf/2509.08897
Copy Paste: [[2509.08897]] Recurrence Meets Transformers for Universal Multimodal Retrieval(https://arxiv.org/abs/2509.08897)
Keywords: generation
Abstract: With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: this https URL
摘要：随着多模式检索的快速发展及其在LLM和多模式LLM中的应用，出现了越来越复杂的检索任务。现有方法主要依赖于特定于任务的视觉模型微调，并且仅限于单模式查询或文档。在本文中，我们提出了RET-2，这是一个支持多模式查询的统一检索模型，该查询由图像和文本组成，并在多模式文档集合中进行搜索，其中文本和图像共存。 RET-2利用具有LSTM启发的门控机制的多层表示和复发变压器体系结构，以动态整合跨层和模态的信息，从而捕获细粒度的视觉和文本细节。我们在不同的检索配置中评估了具有挑战性的M2KR和M-Beir基准测试的RET-2。结果表明，与先前的方法相比，RET-2始终在各种环境中实现最先进的性能，同时提供更快的推理和减少的内存使用情况。当集成到检索型生成管道中时，RET-2还可以改善百科全书-VQA和INFOSEEK数据集的下游性能。我们的源代码和训练有素的模型可公开可用：此HTTPS URL

Title: PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability

Authors: Tung Vu, Lam Nguyen, Quynh Dao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.08910
Pdf URL: https://arxiv.org/pdf/2509.08910
Copy Paste: [[2509.08910]] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability(https://arxiv.org/abs/2509.08910)
Keywords: generation
Abstract: The proliferation of Large Language Models (LLMs) in real-world applications poses unprecedented risks of generating harmful, biased, or misleading information to vulnerable populations including LGBTQ+ individuals, single parents, and marginalized communities. While existing safety approaches rely on post-hoc filtering or generic alignment techniques, they fail to proactively prevent harmful outputs at the generation source. This paper introduces PromptGuard, a novel modular prompting framework with our breakthrough contribution: VulnGuard Prompt, a hybrid technique that prevents harmful information generation using real-world data-driven contrastive learning. VulnGuard integrates few-shot examples from curated GitHub repositories, ethical chain-of-thought reasoning, and adaptive role-prompting to create population-specific protective barriers. Our framework employs theoretical multi-objective optimization with formal proofs demonstrating 25-30% analytical harm reduction through entropy bounds and Pareto optimality. PromptGuard orchestrates six core modules: Input Classification, VulnGuard Prompting, Ethical Principles Integration, External Tool Interaction, Output Validation, and User-System Interaction, creating an intelligent expert system for real-time harm prevention. We provide comprehensive mathematical formalization including convergence proofs, vulnerability analysis using information theory, and theoretical validation framework using GitHub-sourced datasets, establishing mathematical foundations for systematic empirical research.
摘要：实际应用中大型语言模型（LLM）的扩散带来了前所未有的风险，可能会对包括LGBTQ+个人，单亲父母和边缘化社区在内的弱势群体产生有害，偏见或误导性信息。尽管现有的安全方法依赖于事后过滤或通用对准技术，但它们无法主动防止生成源的有害输出。本文介绍了提示guard，这是一个新颖的模块化提示框架，并通过我们的突破性贡献：vulnguard提示，这是一种混合技术，可以防止使用现实世界中数据驱动的对比度学习来防止有害信息生成。 Vulnguard综合了策划的GitHub存储库，道德链的推理和适应性角色促进的示例，以创建特定于人群的保护障碍。我们的框架采用了理论上的多目标优化，正式证明了通过熵界和帕累托最优性，证明了25-30％的分析损害。提示guard策划了六个核心模块：输入分类，vulnguard提示，道德原则集成，外部工具互动，输出验证和用户系统交互，创建了一个智能的专家系统，以预防实时危害。我们提供全面的数学形式化，包括收敛证明，使用信息理论的漏洞分析以及使用Github采购数据集的理论验证框架，从而为系统的经验研究建立了数学基础。

Title: Discovering Divergent Representations between Text-to-Image Models

Authors: Lisa Dunlap, Joseph E. Gonzalez, Trevor Darrell, Fabian Caba Heilbron, Josef Sivic, Bryan Russell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.08940
Pdf URL: https://arxiv.org/pdf/2509.08940
Copy Paste: [[2509.08940]] Discovering Divergent Representations between Text-to-Image Models(https://arxiv.org/abs/2509.08940)
Keywords: generation, generative
Abstract: In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model's outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon's ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: this https URL
摘要：在本文中，我们研究了两个不同生成模型差异的视觉表示何时以及如何学习。在给定两个文本对图像模型的情况下，我们的目标是发现在一个模型而非另一个模型生成的图像中出现的视觉属性，以及触发这些属性差异的提示类型。例如，当给定提示表达强烈情绪的提示时，“火焰”可能会出现在一个模型的输出中，而另一个模型则不会产生此属性给定相同的提示。我们介绍了CompCon（比较概念），这是一种进化搜索算法，它比另一个模型的输出中发现视觉属性更为普遍，并发现与这些视觉差异相关的及时概念。为了评估Compcon找到不同表示表示的能力，我们创建了一个自动数据生成管道来生成ID2，即60个输入依赖性差异的数据集，并将我们的方法与几种LLM和VLM驱动的基准进行比较。最后，我们使用CompCon比较流行的文本图像模型，找到不同的表示，例如Pixart如何描绘潮湿的街道和稳定的扩散3.5在媒体专业中描绘了非裔美国人。代码：此HTTPS URL

Title: Integrating Anatomical Priors into a Causal Diffusion Model

Authors: Binxu Li, Wei Peng, Mingjie Li, Ehsan Adeli, Kilian M. Pohl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09054
Pdf URL: https://arxiv.org/pdf/2509.09054
Copy Paste: [[2509.09054]] Integrating Anatomical Priors into a Causal Diffusion Model(https://arxiv.org/abs/2509.09054)
Keywords: generation, generative
Abstract: 3D brain MRI studies often examine subtle morphometric differences between cohorts that are hard to detect visually. Given the high cost of MRI acquisition, these studies could greatly benefit from image syntheses, particularly counterfactual image generation, as seen in other domains, such as computer vision. However, counterfactual models struggle to produce anatomically plausible MRIs due to the lack of explicit inductive biases to preserve fine-grained anatomical details. This shortcoming arises from the training of the models aiming to optimize for the overall appearance of the images (e.g., via cross-entropy) rather than preserving subtle, yet medically relevant, local variations across subjects. To preserve subtle variations, we propose to explicitly integrate anatomical constraints on a voxel-level as prior into a generative diffusion framework. Called Probabilistic Causal Graph Model (PCGM), the approach captures anatomical constraints via a probabilistic graph module and translates those constraints into spatial binary masks of regions where subtle variations occur. The masks (encoded by a 3D extension of ControlNet) constrain a novel counterfactual denoising UNet, whose encodings are then transferred into high-quality brain MRIs via our 3D diffusion decoder. Extensive experiments on multiple datasets demonstrate that PCGM generates structural brain MRIs of higher quality than several baseline approaches. Furthermore, we show for the first time that brain measurements extracted from counterfactuals (generated by PCGM) replicate the subtle effects of a disease on cortical brain regions previously reported in the neuroscience literature. This achievement is an important milestone in the use of synthetic MRIs in studies investigating subtle morphological differences.
摘要：3D脑MRI研究经常检查很难在视觉上检测到的队列之间细微的形态差异。鉴于MRI获取的高成本，这些研究可能会从图像合成中受益匪浅，尤其是在其他领域（例如计算机视觉）中所见的图像合成，尤其是反事实图像产生。但是，由于缺乏明确的诱导偏见来保留细粒的解剖学细节，反事实模型难以产生解剖学上合理的MRI。这种缺点源于旨在优化图像的整体外观（例如，通过跨透明拷贝）的训练，而不是保留跨受试者的微妙但与医学上相关的局部变化。为了保留微妙的变化，我们建议将在体素级别上的解剖约束作为先验中的解剖结构整合到生成性扩散框架中。该方法称为概率因果图模型（PCGM），通过概率图模块捕获解剖约束，并将这些约束转换为发生微妙变化的区域的空间二进制掩码。掩模（由ControlNET的3D扩展编码）限制了一种新型的反事实denoising UNET，然后通过我们的3D扩散解码器将其编码转移到高质量的脑MRIS中。多个数据集上的广泛实验表明，PCGM比几种基线方法生成更高质量的结构大脑MRI。此外，我们首次表明，从反事实（由PCGM生成）中提取的脑测量值重复了疾病对先前在神经科学文献中报道的皮质脑区域的微妙作用。在研究细微的形态差异的研究中，这项成就是使用合成MRI的重要里程碑。

Title: ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain

Authors: Bin Huang, Kang Chen, Bingxuan Li, Huafeng Liu, Qiegen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09130
Pdf URL: https://arxiv.org/pdf/2509.09130
Copy Paste: [[2509.09130]] ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain(https://arxiv.org/abs/2509.09130)
Keywords: generation
Abstract: Building large-scale foundation model for PET imaging is hindered by limited access to labeled data and insufficient computational resources. To overcome data scarcity and efficiency limitations, we propose ALL-PET, a low-resource, low-shot PET foundation model operating directly in the projection domain. ALL-PET leverages a latent diffusion model (LDM) with three key innovations. First, we design a Radon mask augmentation strategy (RMAS) that generates over 200,000 structurally diverse training samples by projecting randomized image-domain masks into sinogram space, significantly improving generalization with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism that varies mask quantity and distribution, enhancing data diversity without added model complexity. Second, we implement positive/negative mask constraints to embed strict geometric consistency, reducing parameter burden while preserving generation quality. Third, we introduce transparent medical attention (TMA), a parameter-free, geometry-driven mechanism that enhances lesion-related regions in raw projection data. Lesion-focused attention maps are derived from coarse segmentation, covering both hypermetabolic and hypometabolic areas, and projected into sinogram space for physically consistent guidance. The system supports clinician-defined ROI adjustments, ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET acquisition physics. Experimental results show ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. ALL-PET generalizes across tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, operating efficiently with memory use under 24GB.
摘要：构建用于PET成像的大规模基础模型受到对标签数据和计算资源不足的有限访问的阻碍。为了克服数据稀缺性和效率的限制，我们提出了直接在投影域中运行的低资源，低弹药基础模型的All-Pet。全天宠物利用具有三个关键创新的潜在扩散模型（LDM）。首先，我们设计了一个ra掩模的扩展策略（RMA），该策略（RMA）通过将随机图像域掩码投影到正式空间中，从而生成200,000多种结构上多样化的训练样本，从而显着改善了最小数据的概括。这是通过一种动态多面罩（DMM）机制扩展的，该机制将掩盖数量和分布变化，增强数据多样性而没有增加模型的复杂性。其次，我们对严格的几何一致性实施积极/负面的遮罩约束，从而减少了参数负担，同时保留了发电质量。第三，我们引入了透明的医疗护理（TMA），这是一种无参数，几何驱动的机制，可增强原始投影数据中与病变相关的区域。以病变为中心的注意图源自粗分段，涵盖了超定代谢区域和变质代谢区域，并将其投影到辛克图空间中，以进行物理一致的指导。该系统支持临床医生定义的ROI调整，以确保与宠物收购物理学保持一致的灵活，可解释和任务自适应的重点。实验结果表明，All-PET仅使用500个样本实现高质量的辛摄影产生，其性能与在较大数据集上训练的模型相当。跨越跨任务的全部概括，包括低剂量重建，衰减校正，延迟框架预测和示踪剂分离，在24GB下使用内存使用有效地运行。

Title: Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation

Authors: Yuiko Uchida, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2509.09143
Pdf URL: https://arxiv.org/pdf/2509.09143
Copy Paste: [[2509.09143]] Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation(https://arxiv.org/abs/2509.09143)
Keywords: generation
Abstract: This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on "objects," which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the "objectness" of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at this https URL.
摘要：本文介绍了对象相似性（OSIM），这是一种针对3D场景的新型评估指标，明确侧重于“对象”，这是人类视觉感知的基本单位。现有指标评估整体图像质量，从而导致人类感知的差异。受神经心理学见解的启发，我们假设人类对3D场景的认识从根本上涉及对单个物体的关注。 OSIM通过利用对象检测模型及其特征表示来量化场景中每个对象的“对象”来启用以对象为中心的评估。我们的用户研究表明，与现有指标相比，OSIM与人类感知更加一致。我们还使用各种方法分析OSIM的特征。此外，我们在标准化的实验设置下重新评估了最近的3D重建和生成模型，以阐明该领域的进步。该代码可在此HTTPS URL上找到。

Title: HISPASpoof: A New Dataset For Spanish Speech Forensics

Authors: Maria Risques, Kratika Bhagtani, Amit Kumar Singh Yadav, Edward J. Delp
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09155
Pdf URL: https://arxiv.org/pdf/2509.09155
Copy Paste: [[2509.09155]] HISPASpoof: A New Dataset For Spanish Speech Forensics(https://arxiv.org/abs/2509.09155)
Keywords: generation
Abstract: Zero-shot Voice Cloning (VC) and Text-to-Speech (TTS) methods have advanced rapidly, enabling the generation of highly realistic synthetic speech and raising serious concerns about their misuse. While numerous detectors have been developed for English and Chinese, Spanish-spoken by over 600 million people worldwide-remains underrepresented in speech forensics. To address this gap, we introduce HISPASpoof, the first large-scale Spanish dataset designed for synthetic speech detection and attribution. It includes real speech from public corpora across six accents and synthetic speech generated with six zero-shot TTS systems. We evaluate five representative methods, showing that detectors trained on English fail to generalize to Spanish, while training on HISPASpoof substantially improves detection. We also evaluate synthetic speech attribution performance on HISPASpoof, i.e., identifying the generation method of synthetic speech. HISPASpoof thus provides a critical benchmark for advancing reliable and inclusive speech forensics in Spanish.
摘要：零击语音克隆（VC）和文本到语音（TTS）方法已迅速发展，从而能够产生高度现实的合成语音，并引起人们对滥用的严重关切。虽然已经为英语和中文开发了许多探测器，但全球超过6亿人在语音取证中的人数不足。为了解决这一差距，我们介绍了Hispaspoof，这是第一个旨在合成语音检测和归因的大型西班牙数据集。它包括六个口音的公共语料库中的真实语音和由六个零射击TTS系统产生的综合语音。我们评估了五种代表性方法，表明接受英语训练的探测器无法推广到西班牙语，而对HISPASPOOF进行培训可大大改善检测。我们还评估了HISPASPOOF上的合成语音归因性能，即确定合成语音的产生方法。因此，hispaspoof为在西班牙语中推进可靠和包容性的言语取证提供了关键的基准。

Title: Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

Authors: Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, Yao Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09172
Pdf URL: https://arxiv.org/pdf/2509.09172
Copy Paste: [[2509.09172]] Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios(https://arxiv.org/abs/2509.09172)
Keywords: generative
Abstract: With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.
摘要：随着生成模型的快速发展，高度现实的图像综合为数字安全和媒体的信誉带来了新的挑战。尽管AI生成的图像检测方法已部分解决了这些问题，但在复杂的现实世界中评估其性能的巨大研究差距仍然存在。 This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content 看法。 2）Internet传输鲁棒性：检查图像上的探测器性能，这些图像在各种社交媒体平台上都经历了多轮共享。 3）重新定位鲁棒性：通过四种不同的重新数字方法改变图像的模型有效性。我们在RRDATASET上对17个检测器和10个视觉模型（VLM）进行了基准测试，并进行了一项大规模的人类研究，涉及192名参与者研究人类在检测AI生成的图像时人类的少数学习能力。基准测试结果揭示了在实际条件下当前AI检测方法的局限性，并强调了借鉴人类适应性以开发更强大的检测算法的重要性。

Title: VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results

Authors: Hanwei Zhu, Haoning Wu, Zicheng Zhang, Lingyu Zhu, Yixuan Li, Peilin Chen, Shiqi Wang, Chris Wei Zhou, Linhan Cao, Wei Sun, Xiangyang Zhu, Weixia Zhang, Yucheng Zhu, Jing Liu, Dandan Zhu, Guangtao Zhai, Xiongkuo Min, Zhichao Zhang, Xinyue Li, Shubo Xu, Anh Dao, Yifan Li, Hongyuan Yu, Jiaojiao Yi, Yiding Tian, Yupeng Wu, Feiran Sun, Lijuan Liao, Song Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09190
Pdf URL: https://arxiv.org/pdf/2509.09190
Copy Paste: [[2509.09190]] VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results(https://arxiv.org/abs/2509.09190)
Keywords: quality assessment
Abstract: This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.
摘要：本文介绍了大型多模型（LMMS）的VQUALA 2025挑战的摘要，该挑战是ICCV 2025视觉质量评估研讨会的一部分。挑战旨在评估和增强最先进的LMMS对多个图像的视觉质量差异进行开放式和详细的推理的能力。为此，竞争引入了一种新颖的基准，其中包括数千个粗到碎片的视觉质量比较任务，涵盖单图像，对和多图像组。每个任务都需要模型来提供准确的质量判断。比赛强调了整体评估方案，包括基于2个基于AFC的二进制偏好和多选问题（MCQ）。大约100名参与者提交了条目，五个模型展示了在质量评估中调整指导的LMM的新兴功能。这项挑战标志着朝着开放域的视觉质量推理和比较迈出的重要一步，并作为对可解释和人类一致的质量评估系统的未来研究的催化剂。

Title: Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

Authors: Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2509.09254
Pdf URL: https://arxiv.org/pdf/2509.09254
Copy Paste: [[2509.09254]] Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis(https://arxiv.org/abs/2509.09254)
Keywords: generation
Abstract: Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at this https URL.
摘要：大型视觉模型（LVLM）的最新进展表明，在通用医疗任务上表现出色。但是，它们在牙科等专业领域中的有效性仍未得到充实。特别是，全景X射线是口腔放射学中广泛使用的成像方式，由于致密的解剖结构和微妙的病理提示而构成了解释性挑战，而现有的医学基准或指令数据集则不会捕获这些挑战。为此，我们介绍了Mmoral，这是第一个大型多模式指令数据集和用于全景X射线解释的基准。 Mmoral由20,563张带注释的图像与跨不同任务类型的130万个指令遵循实例搭配，包括属性提取，报告生成，视觉问题答案和图像接地的对话。此外，我们提出了Mmoral Bench，这是一个全面的评估套件，涵盖了牙科中五个关键的诊断维度。我们在Mmoral基础上评估了64个LVLM，发现即使是表现最佳的模型，即GPT-4O，仅达到了41.45％的精度，也揭示了该域中当前模型的显着限制。为了促进该特定领域的进度，我们还提出了口头，该口头通过我们精心策划的Mmoral指导数据集对QWEN2.5-VL-7B进行监督的微调（SFT）。值得注意的是，单个SFT的时期可为LVLMS提供大量的性能增强，例如口头表现出24.73％的提高。 Mmoral和Oralgpt都具有智能牙科的关键基础，并在牙科领域启用更具临床影响的多模式AI系统。此HTTPS URL可用数据集，模型，基准和评估套件。

Title: Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization

Authors: Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2509.09307
Pdf URL: https://arxiv.org/pdf/2509.09307
Copy Paste: [[2509.09307]] Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization(https://arxiv.org/abs/2509.09307)
Keywords: generative
Abstract: Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at this https URL.
摘要：材料表征是获取材料信息的基础，揭示了指导材料设计和优化的处理 - 微观结构 - 质体关系。尽管多模式的大语言模型（MLLM）最近在材料科学中的生成和预测任务中表现出了希望，但他们了解现实世界表征成像数据的能力仍然没有得到充实的态度。为了弥合这一差距，我们提出了抹茶，这是材料表征图像理解的第一个基准，其中包括1,500个需要专家级领域专业知识的问题。抹茶包括材料研究的四个关键阶段，其中包括21个不同的任务，每个任务旨在反映材料科学家面临的真实挑战。我们对抹茶最新的MLLM的评估表明，与人类专家相比，表现差距很大。这些模型在解决需要高级专业知识和复杂视觉感知的问题时表现出退化。简单的几杆和经过思考的链条促使人们的斗争缓解了这些局限性。这些发现凸显了现有的MLLM仍然对现实世界材料表征方案的适应性有限。我们希望抹茶能促进新的物质发现和自动科学代理等领域的未来研究。抹茶可在此HTTPS URL上找到。

Title: Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM

Authors: Hui Li, Yi You, Qiqi Chen, Bingfeng Zhang, George Q. Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09324
Pdf URL: https://arxiv.org/pdf/2509.09324
Copy Paste: [[2509.09324]] Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM(https://arxiv.org/abs/2509.09324)
Keywords: generation, generative
Abstract: Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users' creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: this https URL.
摘要：Generative AI演变了行业中复杂的工作流程的执行，在该行业中，大型多模型模型在服装行业中赋予了时装设计。当前一代AI模型很容易将头脑风暴转变为精美的设计，但是没有最终用户的专业背景知识的细粒度定制仍然患有文本不确定性。因此，我们建议使用LMM的更好的理解生成（BUG）工作流程自动创建和细粒度自定义与Image-Into-Prompt聊天的布设计。我们的框架释放了用户的创造潜力超越单词，还可以降低服装设计/编辑的障碍，而无需进一步的人类参与。为了证明我们的模型的有效性，我们提出了一个新的时尚数据集，该数据集模拟了现实世界中的服装设计工作流程，并从一代相似性，用户满意度和质量中进行了评估。代码和数据集：此HTTPS URL。

Title: FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution

Authors: Yuchan Jie, Yushen Xu, Xiaosong Li, Fuqiang Zhou, Jianming Lv, Huafeng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09427
Pdf URL: https://arxiv.org/pdf/2509.09427
Copy Paste: [[2509.09427]] FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution(https://arxiv.org/abs/2509.09427)
Keywords: super-resolution, generation
Abstract: As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at this https URL.
摘要：作为有影响力的信息融合和低级视觉技术，图像融合整合了来自源图像的互补信息，以产生信息丰富的融合图像。近年来，已经进行了一些尝试，以共同实现图像融合和超分辨率。但是，在现实世界中的应用中，例如军事侦察和远程检测任务，多模式图像中的目标和背景结构很容易损坏，分辨率低和语义信息较弱，这会导致当前融合技术的次优结果。作为回应，我们提出了FS-DIFF，语义指导和清晰感知的联合图像融合和超分辨率方法。 FS-DIFF将图像融合和超分辨率统一为有条件的生成问题。它利用提出的清晰感应机制来利用语义指导来自适应低分辨率感知和跨模式特征提取。具体而言，我们将所需的融合结果初始化为纯高斯噪声，并引入双向特征Mamba以提取多模式图像的全局特征。此外，利用源图像和语义作为条件，我们通过修改后的U-NET网络实现了随机的迭代denoising过程。该网络被授予在多个噪声水平上进行降级，以产生具有跨模式特征和丰富语义信息的高分辨率融合结果。我们还构建了一个强大的空中视图多霉菌（AVM）基准，覆盖600对图像。对六个公众和我们的AVMS数据集进行的广泛的联合图像融合和超分辨率实验表明，FS-DIFF的表现优于多种大型方法的最新方法，并且可以在融合图像中恢复更丰富的细节和语义。该代码可在此HTTPS URL上找到。

Title: Composable Score-based Graph Diffusion Model for Multi-Conditional Molecular Generation

Authors: Anjie Qiao, Zhen Wang, Chuan Chen, DeFu Lian, Enhong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.09451
Pdf URL: https://arxiv.org/pdf/2509.09451
Copy Paste: [[2509.09451]] Composable Score-based Graph Diffusion Model for Multi-Conditional Molecular Generation(https://arxiv.org/abs/2509.09451)
Keywords: generation
Abstract: Controllable molecular graph generation is essential for material and drug discovery, where generated molecules must satisfy diverse property constraints. While recent advances in graph diffusion models have improved generation quality, their effectiveness in multi-conditional settings remains limited due to reliance on joint conditioning or continuous relaxations that compromise fidelity. To address these limitations, we propose Composable Score-based Graph Diffusion model (CSGD), the first model that extends score matching to discrete graphs via concrete scores, enabling flexible and principled manipulation of conditional guidance. Building on this foundation, we introduce two score-based techniques: Composable Guidance (CoG), which allows fine-grained control over arbitrary subsets of conditions during sampling, and Probability Calibration (PC), which adjusts estimated transition probabilities to mitigate train-test mismatches. Empirical results on four molecular datasets show that CSGD achieves state-of-the-art performance, with a 15.3% average improvement in controllability over prior methods, while maintaining high validity and distributional fidelity. Our findings highlight the practical advantages of score-based modeling for discrete graph generation and its capacity for flexible, multi-property molecular design.
摘要：可控的分子图生成对于材料和药物发现至关重要，在材料和药物发现中，产生的分子必须满足各种性质约束。尽管图形扩散模型的最新进展提高了发电质量，但由于依赖关节条件或损害保真度的连续放松，它们在多条件环境中的有效性仍然受到限制。为了解决这些局限性，我们提出了基于组合得分的图形扩散模型（CSGD），这是第一个通过混凝土分数扩展到离散图的分数匹配的模型，从而实现了对条件指导的灵活和原则性的操纵。在此基础的基础上，我们介绍了两种基于分数的技术：可组合指导（COG），该技术允许对采样过程中对任意条件子集的细粒度控制和概率校准（PC），该校准（PC）调整了估计的过渡概率，以减轻火车测试不匹配。四个分子数据集的经验结果表明，CSGD达到了最先进的性能，可控性比先前方法的平均可控性提高了15.3％，同时保持了高有效性和分布忠诚度。我们的发现突出了离散图生成的基于得分建模及其灵活的多型分子设计的能力的实际优势。

Title: OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection

Authors: Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Gaétan Marceau Caron, Jean-François Godbout, Reihaneh Rabbany
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.09495
Pdf URL: https://arxiv.org/pdf/2509.09495
Copy Paste: [[2509.09495]] OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection(https://arxiv.org/abs/2509.09495)
Keywords: generation, generative
Abstract: Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.
摘要：Deepfakes是使用先进的AI技术创建的合成媒体，它加剧了错误信息的传播，尤其是在政治敏感的环境中。现有的DeepFake检测数据集通常受到限制，依赖于过时的生成方法，低现实主义或单面图像，从而限制了一般合成图像检测的有效性。通过分析社交媒体帖子，我们确定了多种方式通过这些方式传播错误信息。此外，我们的人类感知研究表明，最近开发的专有模型产生的综合图像与真实图像越来越没有区别，从而使公众的准确识别变得复杂。因此，我们提出了一个专门针对现代生成模型进行基准检测而设计的全面，以政治为重点的数据集。该数据集包含300万个真实图像，并配对描述性字幕，这些图像用于从专有和开源模型的混合物中生成963K相应的高质量合成图像。认识到生成技术的持续发展，我们引入了一个创新的众包对抗平台，在该平台中激励参与者生成和提交具有挑战性的合成图像。这项持续的社区驱动计划可确保深泡检测方法保持强大和适应性，并积极地保护公众话语免于复杂的错误信息威胁。

Title: Region-Wise Correspondence Prediction between Manga Line Art Images

Authors: Yingxuan Li, Jiafeng Mao, Qianru Qiu, Yusuke Matsui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09501
Pdf URL: https://arxiv.org/pdf/2509.09501
Copy Paste: [[2509.09501]] Region-Wise Correspondence Prediction between Manga Line Art Images(https://arxiv.org/abs/2509.09501)
Keywords: generation
Abstract: Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.
摘要：了解漫画线艺术图像之间的区域对应关系是漫画处理中的一项基本任务，从而实现了下游应用，例如自动线条艺术色彩和框架之间的生成。但是，此任务在很大程度上尚未开发，尤其是在现实的情况下，没有预先存在分段或注释。在本文中，我们介绍了一项新颖而实用的任务：预测无段漫画艺术图像之间的区域对应关系，而没有任何先前存在的标签或口罩。为了解决这个问题，我们将每个线路图像分为一组补丁，并提出一个基于变压器的框架，该框架学习了图像内部和跨图像内部和跨图像。然后，我们将边缘感知聚类和匹配算法的区域匹配算法将贴片级预测转换为一致的区域级对应关系。为了支持培训和评估，我们开发了自动注释管道，并手动完善数据子集以构建基准数据集。多个数据集上的实验表明，我们的方法达到了高斑块级的准确性（例如96.34％），并生成一致的区域级对应关系，从而突出了其对现实世界漫画应用的潜力。

Title: Generative Diffusion Contrastive Network for Multi-View Clustering

Authors: Jian Zhu, Xin Zou, Xi Wang, Ning Zhang, Bian Wu, Yao Yang, Ying Zhou, Lingfang Zeng, Chang Tang, Cheng Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09527
Pdf URL: https://arxiv.org/pdf/2509.09527
Copy Paste: [[2509.09527]] Generative Diffusion Contrastive Network for Multi-View Clustering(https://arxiv.org/abs/2509.09527)
Keywords: generative
Abstract: In recent years, Multi-View Clustering (MVC) has been significantly advanced under the influence of deep learning. By integrating heterogeneous data from multiple views, MVC enhances clustering analysis, making multi-view fusion critical to clustering performance. However, there is a problem of low-quality data in multi-view fusion. This problem primarily arises from two reasons: 1) Certain views are contaminated by noisy data. 2) Some views suffer from missing data. This paper proposes a novel Stochastic Generative Diffusion Fusion (SGDF) method to address this problem. SGDF leverages a multiple generative mechanism for the multi-view feature of each sample. It is robust to low-quality data. Building on SGDF, we further present the Generative Diffusion Contrastive Network (GDCN). Extensive experiments show that GDCN achieves the state-of-the-art results in deep MVC tasks. The source code is publicly available at this https URL.
摘要：近年来，在深度学习的影响下，多视图聚类（MVC）已得到显着提高。通过从多个视图集成异质数据，MVC增强了聚类分析，从而使多视图融合对聚类性能至关重要。但是，在多视图融合中存在低质量数据的问题。这个问题主要来自两个原因：1）某些视图被嘈杂的数据污染。 2）一些观点遇到了丢失的数据。本文提出了一种新型的随机生成扩散融合（SGDF）方法来解决此问题。 SGDF利用每个样本的多视图特征的多生成机制。低质量数据是可靠的。在SGDF的基础上，我们进一步介绍了生成扩散对比网络（GDCN）。广泛的实验表明，GDCN实现了最新的MVC任务。源代码可在此HTTPS URL上公开可用。

Title: Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders

Authors: Dohun Lee, Hyeonho Jeong, Jiwook Kim, Duygu Ceylan, Jong Chul Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09547
Pdf URL: https://arxiv.org/pdf/2509.09547
Copy Paste: [[2509.09547]] Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders(https://arxiv.org/abs/2509.09547)
Keywords: generation
Abstract: Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations (e.g., diffusion transformers) and use of novel training objectives (e.g., flow matching). In contrast, less attention has been paid to improving the feature representation power of such models. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose a new metric and conduct an in-depth analysis of various vision encoders to evaluate their discriminability and temporal consistency, thereby assessing their suitability for video feature alignment. Based on the analysis, we present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training. We evaluate Align4Gen both for unconditional and class-conditional video generation tasks and show that it results in improved video generation as quantified by various metrics. Full video results are available on our project page: this https URL
摘要：由于一系列建筑创新（例如，扩散变压器）和新型训练目标（例如，流量匹配）的使用，视频扩散模型在近年来迅速发展。相比之下，提高此类模型的特征表示能力的关注减少了。在这项工作中，我们表明训练视频扩散模型可以使视频生成器的中间特征与预训练的视觉编码器的特征表示。我们提出了一个新的指标，并对各种视觉编码器进行了深入的分析，以评估它们的可区分性和时间一致性，从而评估其对视频功能一致性的适用性。基于分析，我们提出了Align4Gen，该Align4Gen提供了一种集成到视频扩散模型训练中的新型多功能融合和对齐方法。我们评估了Align4GEN的无条件和班级视频生成任务，并表明它可以改善视频生成，并通过各种指标量化。完整的视频结果可在我们的项目页面上找到：此HTTPS URL

Title: InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation

Authors: Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, Yu-Xiong Wang, Liang-Yan Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09555
Pdf URL: https://arxiv.org/pdf/2509.09555
Copy Paste: [[2509.09555]] InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation(https://arxiv.org/abs/2509.09555)
Keywords: generation, generative
Abstract: While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations. Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions. Leveraging the principle of contact invariance, we maintain human-object relationships while introducing motion variations, expanding the dataset to 30.70 hours. Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. To support continued research in this area, the dataset is publicly available at this https URL, and will be actively maintained.
摘要：尽管大规模的人类运动捕获数据集具有先进的人类运动产生，但由于数据集限制，建模和生成动态3D人类对象相互作用（HOI）仍然具有挑战性。现有的数据集通常缺乏广泛的，高质量的运动和注释，并且表现出诸如接触渗透，浮动和不正确的手动运动等文物。为了解决这些问题，我们引入了互动，这是一个大规模的3D HOI基准测试，具有数据集和方法论进步。首先，我们合并并标准化了来自不同来源的21.81小时HOI数据，并以详细的文本注释丰富了HOI数据。其次，我们提出了一个统一的优化框架，以通过减少工件和纠正手提动作来提高数据质量。利用接触不变性的原理，我们在引入运动变化的同时保持人体对象关系，将数据集扩展到30.70小时。第三，我们定义了六个基准测试任务，并开发了统一的HOI生成建模观点，从而实现了最新的性能。广泛的实验验证了我们的数据集作为推进3D人类对象相互作用产生的基础资源的实用性。为了支持该领域的持续研究，该数据集可在此HTTPS URL上公开获得，并将积极维护。

Title: Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Authors: Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09595
Pdf URL: https://arxiv.org/pdf/2509.09595
Copy Paste: [[2509.09595]] Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis(https://arxiv.org/abs/2509.09595)
Keywords: generation
Abstract: Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.
摘要：音频驱动的阿凡达视频的最新进展显着增强了视听现实主义。但是，现有方法仅将指导条件视为由声学或视觉提示驱动的低级跟踪，而无需对指令传达的沟通目的进行建模。这种限制损害了他们的叙事连贯性和性格表现力。为了弥合这一差距，我们介绍了Kling-Avatar，这是一个新颖的级联框架，该框架将多模式的教学理解与影像肖像产生相关。我们的方法采用了两阶段的管道。在第一阶段，我们设计了一个多模式的大语言模型（MLLM）导演，该导演制作了蓝图视频，该视频以各种教学信号为条件，从而管理高级语义，例如角色运动和情感。在第二阶段，在蓝图密钥框的指导下，我们使用第一last框架策略并行生成多个子剪辑。这个全球到本地的框架保留了细粒细节，同时忠实地编码了多模式说明背后的高级意图。我们的平行体系结构还可以快速，稳定的长期视频，使其适用于现实世界中的应用程序，例如数字人类直播和视频博客。为了全面评估我们的方法，我们构建了375个策划样本的基准，涵盖了各种说明和具有挑战性的场景。广泛的实验表明，克林·阿瓦塔尔（Kling-Avatar）能够以高达1080p和48 fps的形式生成生动，流利，长期的视频，从而在唇部同步精度，情感和动态表现力，指导能力控制性，身份保存和交叉跨跨跨跨跨概括方面取得了卓越的性能。这些结果将克林·阿瓦塔尔（Kling-avatar）作为语义接地，高保真驱动的阿凡达（Avatar）合成的新基准。

Title: Mechanistic Learning with Guided Diffusion Models to Predict Spatio-Temporal Brain Tumor Growth

Authors: Daria Laslo, Efthymios Georgiou, Marius George Linguraru, Andreas Rauschecker, Sabine Muller, Catherine R. Jutzeler, Sarah Bruningk
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09610
Pdf URL: https://arxiv.org/pdf/2509.09610
Copy Paste: [[2509.09610]] Mechanistic Learning with Guided Diffusion Models to Predict Spatio-Temporal Brain Tumor Growth(https://arxiv.org/abs/2509.09610)
Keywords: generation, generative
Abstract: Predicting the spatio-temporal progression of brain tumors is essential for guiding clinical decisions in neuro-oncology. We propose a hybrid mechanistic learning framework that combines a mathematical tumor growth model with a guided denoising diffusion implicit model (DDIM) to synthesize anatomically feasible future MRIs from preceding scans. The mechanistic model, formulated as a system of ordinary differential equations, captures temporal tumor dynamics including radiotherapy effects and estimates future tumor burden. These estimates condition a gradient-guided DDIM, enabling image synthesis that aligns with both predicted growth and patient anatomy. We train our model on the BraTS adult and pediatric glioma datasets and evaluate on 60 axial slices of in-house longitudinal pediatric diffuse midline glioma (DMG) cases. Our framework generates realistic follow-up scans based on spatial similarity metrics. It also introduces tumor growth probability maps, which capture both clinically relevant extent and directionality of tumor growth as shown by 95th percentile Hausdorff Distance. The method enables biologically informed image generation in data-limited scenarios, offering generative-space-time predictions that account for mechanistic priors.
摘要：预测脑肿瘤的时空进展对于指导神经肿瘤学的临床决策至关重要。我们提出了一个混合机械学习框架，该框架将数学肿瘤生长模型与指导的denoising扩散模型（DDIM）结合在一起，以合成先前扫描中解剖上可行的未来MRI。该机械模型被公式为普通微分方程的系统，捕获了颞肿瘤动力学，包括放射疗法效应并估计未来的肿瘤负担。这些估计条件条件是梯度引导的DDIM，从而使图像合成与预测的生长和患者解剖结构一致。我们在成人和小儿神经胶质瘤数据集上训练模型，并在60个轴向切片内纵向小儿弥漫性中线神经胶质瘤（DMG）病例上进行评估。我们的框架基于空间相似性指标生成了现实的后续扫描。它还引入了肿瘤生长概率图，该图捕获了肿瘤生长的临床相关程度和方向性，如95％的Hausdorff距离所示。该方法可以在数据限制的方案中生物学知情的图像生成，从而提供了解释机械先验的生成空间预测。

Title: ReBaNO: Reduced Basis Neural Operator Mitigating Generalization Gaps and Achieving Discretization Invariance

Authors: Haolan Zheng, Yanlai Chen, Jiequn Han, Yue Yu
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2509.09611
Pdf URL: https://arxiv.org/pdf/2509.09611
Copy Paste: [[2509.09611]] ReBaNO: Reduced Basis Neural Operator Mitigating Generalization Gaps and Achieving Discretization Invariance(https://arxiv.org/abs/2509.09611)
Keywords: generative
Abstract: We propose a novel data-lean operator learning algorithm, the Reduced Basis Neural Operator (ReBaNO), to solve a group of PDEs with multiple distinct inputs. Inspired by the Reduced Basis Method and the recently introduced Generative Pre-Trained Physics-Informed Neural Networks, ReBaNO relies on a mathematically rigorous greedy algorithm to build its network structure offline adaptively from the ground up. Knowledge distillation via task-specific activation function allows ReBaNO to have a compact architecture requiring minimal computational cost online while embedding physics. In comparison to state-of-the-art operator learning algorithms such as PCA-Net, DeepONet, FNO, and CNO, numerical results demonstrate that ReBaNO significantly outperforms them in terms of eliminating/shrinking the generalization gap for both in- and out-of-distribution tests and being the only operator learning algorithm achieving strict discretization invariance.
摘要：我们提出了一种新型的数据 - 纽约操作员学习算法，即减少的基础神经操作员（Rebano），以求解一组具有多个不同输入的PDE。受降低的基础方法的启发，以及最近引入的生成预训练的物理知识的神经网络，Rebano依赖于数学上严格的贪婪算法来构建其网络结构从地面上自适应地构建其网络结构。通过特定于任务激活功能的知识蒸馏使Rebano可以在嵌入物理学时具有紧凑的架构，需要在线计算成本最少。 In comparison to state-of-the-art operator learning algorithms such as PCA-Net, DeepONet, FNO, and CNO, numerical results demonstrate that ReBaNO significantly outperforms them in terms of eliminating/shrinking the generalization gap for both in- and out-of-distribution tests and being the only operator learning algorithm achieving strict discretization invariance.

Title: Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

Authors: Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09666
Pdf URL: https://arxiv.org/pdf/2509.09666
Copy Paste: [[2509.09666]] Can Understanding and Generation Truly Benefit Together -- or Just Coexist?(https://arxiv.org/abs/2509.09666)
Keywords: generation
Abstract: In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.
摘要：在本文中，我们通过自动编码器镜头理解为编码器（I2T）引入了一个有见地的范式，该范式将图像压缩到文本中，并以解码器（T2I）的形式产生，从而从该文本中重建图像。我们将重建忠诚度作为统一的训练目标，我们在理解过程和发电过程之间实施了相干的双向信息流，从而带来了相互利益。为了实施这一点，我们提出了阿联酋，这是统一多模式学习的新型框架。我们首先用大规模的长篇小写图像标题预先训练解码器，以捕获细粒的语义和复杂的空间关系。然后，我们通过加固学习（RL）提出统一的GRPO，该学习涵盖了三个阶段：（1）一个冷启动的阶段，可以轻轻初始化编码器和解码器，并具有语义重建损失；（2）为理解的生成，培训编码器以生成内容丰富的字幕，以最大程度地提高解码器的重建质量，从而增强其视觉理解；（3）对生成的理解，其中解码器被改进以从这些字幕中重建，迫使其利用每个细节并改善其长篇文献的跟随和产生忠诚。为了进行评估，我们介绍了统一基础，这是第一个量身定制的基准，该基准是为评估UMMS统一程度的量身定制的。多模式学习域中出现了一个令人惊讶的“ AHA时刻”：随着RL的进展，编码器自主产生更大的描述性字幕，而解码器同时表现出了深刻的理解这些复杂描述的能力，从而导致了引人注目的富裕性的重建。

Title: Geometric Neural Distance Fields for Learning Human Motion Priors

Authors: Zhengdi Yu, Simone Foti, Linguang Zhang, Amy Zhao, Cem Keskin, Stefanos Zafeiriou, Tolga Birdal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09667
Pdf URL: https://arxiv.org/pdf/2509.09667
Copy Paste: [[2509.09667]] Geometric Neural Distance Fields for Learning Human Motion Priors(https://arxiv.org/abs/2509.09667)
Keywords: generation, generative
Abstract: We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to "roll out" realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.
摘要：我们介绍了神经riemannian运动场（NRMF），这是一种新型的3D生成人类运动，在该运动之前可以实现强大的，时间一致且在物理上合理的3D运动恢复。与现有的VAE或基于扩散的方法不同，我们的高阶运动事先显式地对与姿势，过渡（速度）和加速动力学相对应的神经距离场（NDFS）集合的零水平集中的人类运动进行了建模。我们的框架是严格的，因为我们的NDF是在关节旋转，其角速度和角加速度的产品空间上构建的，这尊重了基本关节的几何形状。我们进一步介绍：（i）一种用于投影到合理动作集中的新型自适应型混合算法，以及（ii）在测试时间进行选择和生成过程中“推出”现实的运动轨迹的新型几何积分器。我们的实验显示出显着且一致的收益：在积极数据集中受过培训，NRMF在多种输入方式上明显概括了从多种输入方式概括，并且从deNo的多种任务到运动和拟合的运动，再到部分2D / 3D观测值。

Title: Locality in Image Diffusion Models Emerges from Data Statistics

Authors: Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.09672
Pdf URL: https://arxiv.org/pdf/2509.09672
Copy Paste: [[2509.09672]] Locality in Image Diffusion Models Emerges from Data Statistics(https://arxiv.org/abs/2509.09672)
Keywords: generative
Abstract: Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.
摘要：在生成模型中，由于存在封闭形式的训练目标最佳最小化器，因此扩散模型引人入胜，通常被称为最佳DeOiserer。但是，使用此最佳Denoiser的扩散仅在训练集中再现图像，因此无法捕获深度扩散模型的行为。最近的工作试图表征最佳DeNoiser和深层扩散模型之间的差距，提出了可以生成类似于受过训练的UNET生成的图像的分析，无训练的模型。表现最佳的方法假设，卷积神经网络的转移均值和位置感应性偏见是性能差距的原因，因此将这些假设纳入其分析模型。在这项工作中，我们提供了证据表明，深度扩散模型中的位置是图像数据集的统计特性，而不是由于卷积神经网络的电感偏见。具体而言，我们证明了最佳参数线性denoiser具有与深神经Deoisers相似的局部性特性。我们在理论上还是在实验上进一步表明，该局部性直接来自自然图像数据集中存在的像素相关性。最后，我们使用这些见解来制作一个分析性denoiser，该分析性denoiser比以前的专家制作的替代方案更好地匹配由深度扩散模型预测的分数。

Title: FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Authors: Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.09680
Pdf URL: https://arxiv.org/pdf/2509.09680
Copy Paste: [[2509.09680]] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark(https://arxiv.org/abs/2509.09680)
Keywords: generation
Abstract: The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: this https URL .
摘要：没有大规模，以推理为重点的数据集和全面的评估基准的开源文本对图像（T2I）模型的进步受到阻碍，这与领先的封闭源系统相比，绩效差距导致性能差距。为了应对这一挑战，我们引入了Flux-Reason-6M和Prism Bench（精确且健壮的图像合成测量基准）。 Flux-Reason-6M是一个庞大的数据集，它由600万个高质量的磁通图像和2000万双语言（英语和中文）描述，专门设计用于教授复杂的推理。图像是根据六个关键特征组织的：想象力，实体，文本渲染，样式，感情和构图，以及设计明确的生成链（GCOT），以提供图像生成步骤的详细分解。整个数据策划需要15,000个A100 GPU天，为社区提供了以前在大型工业实验室之外无法实现的资源。 Prism-Bench提供了一个新颖的评估标准，具有七个不同的曲目，包括使用GCOT的强大的长文本挑战。通过精心设计的提示，它利用了先进的视觉模型来对及时图像一致性和图像美学的细微统一评估。我们对19个领先模型在Prism台上的广泛评估揭示了关键的性能差距，并突出了需要改进的特定领域。我们的数据集，基准和评估代码将释放，以催化下一波以推理为导向的T2I生成。项目页面：此HTTPS URL。