2025-05-06

Title: Multi-party Collaborative Attention Control for Image Customization

Authors: Han Yang, Chuanguang Yang, Qiuli Wang, Zhulin An, Weilun Feng, Libo Huang, Yongjun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01428
Pdf URL: https://arxiv.org/pdf/2505.01428
Copy Paste: [[2505.01428]] Multi-party Collaborative Attention Control for Image Customization(https://arxiv.org/abs/2505.01428)
Keywords: generation
Abstract: The rapid advancement of diffusion models has increased the need for customized image generation. However, current customization methods face several limitations: 1) typically accept either image or text conditions alone; 2) customization in complex visual scenarios often leads to subject leakage or confusion; 3) image-conditioned outputs tend to suffer from inconsistent backgrounds; and 4) high computational costs. To address these issues, this paper introduces Multi-party Collaborative Attention Control (MCA-Ctrl), a tuning-free method that enables high-quality image customization using both text and complex visual conditions. Specifically, MCA-Ctrl leverages two key operations within the self-attention layer to coordinate multiple parallel diffusion processes and guide the target image generation. This approach allows MCA-Ctrl to capture the content and appearance of specific subjects while maintaining semantic consistency with the conditional input. Additionally, to mitigate subject leakage and confusion issues common in complex visual scenarios, we introduce a Subject Localization Module that extracts precise subject and editable image layers based on user instructions. Extensive quantitative and human evaluation experiments show that MCA-Ctrl outperforms existing methods in zero-shot image customization, effectively resolving the mentioned issues.
摘要：扩散模型的快速发展增加了对定制图像产生的需求。但是，当前的自定义方法面临几个局限性：1）通常仅接受图像或文本条件； 2）在复杂的视觉场景中进行自定义通常会导致主题泄漏或混乱； 3）图像条件的输出往往遭受不一致的背景； 4）高计算成本。为了解决这些问题，本文介绍了多方协作注意力控制（MCA-CTRL），这是一种无调的方法，可以使用文本和复杂的视觉条件来实现高质量的图像自定义。具体而言，MCA-CTRL利用自我发项层内的两个关键操作来协调多个平行扩散过程并指导目标图像生成。这种方法允许MCA-CTRL捕获特定受试者的内容和外观，同时保持语义一致性与条件输入。此外，为了减轻主题泄漏和混乱问题，在复杂的视觉场景中常见，我们引入了一个主题本地化模块，该模块根据用户说明提取精确的主题和可编辑的图像层。广泛的定量和人类评估实验表明，MCA-CTRL在零摄像图像自定义中的现有方法优于现有方法，从而有效解决了上述问题。

Title: Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models

Authors: Muna Numan Said, Aarib Zaidi, Rabia Usman, Sonia Okon, Praneeth Medepalli, Kevin Zhu, Vasu Sharma, Sean O'Brien
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01430
Pdf URL: https://arxiv.org/pdf/2505.01430
Copy Paste: [[2505.01430]] Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models(https://arxiv.org/abs/2505.01430)
Keywords: generation, generative
Abstract: The transformative potential of text-to-image (T2I) models hinges on their ability to synthesize culturally diverse, photorealistic images from textual prompts. However, these models often perpetuate cultural biases embedded within their training data, leading to systemic misrepresentations. This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts. Through extensive analysis involving 2,400 images, we quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts. Our findings underscore the impact of data imbalance, attention entropy, and embedding superposition on model fairness. By benchmarking models like Stable Diffusion with CIS, we provide insights into architectural and data-centric interventions for enhancing cultural inclusivity in AI-generated imagery. This work advances the field by offering a comprehensive tool for diagnosing and mitigating biases in T2I generation, advocating for more equitable AI systems.
摘要：文本对图像（T2I）模型的变革性潜力取决于其从文本提示中综合具有文化多样性的影像图像的能力。但是，这些模型通常会永久存在嵌入其培训数据中的文化偏见，从而导致系统性歪曲。本文基准了组件包容评分（CIS），该度量旨在评估跨文化环境的图像产生的忠诚度。通过涉及2,400张图像的广泛分析，我们根据构图脆弱性和上下文未对准量化了偏见，揭示了西方和非西方文化提示之间的显着性能差距。我们的发现强调了数据不平衡，注意熵以及叠加叠加对模型公平性的影响。通过对CIS稳定扩散等基准测试模型，我们提供了对以结构和数据为中心的干预措施的见解，以增强AI生成的图像中的文化包容性。这项工作通过提供综合工具来诊断和减轻T2i生成的偏见，提倡更加公平的AI系统，从而推进了该领域。

Title: Global Stress Generation and Spatiotemporal Super-Resolution Physics-Informed Operator under Dynamic Loading for Two-Phase Random Materials

Authors: Tengfei Xing, Xiaodan Ren, Jie Li
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01438
Pdf URL: https://arxiv.org/pdf/2505.01438
Copy Paste: [[2505.01438]] Global Stress Generation and Spatiotemporal Super-Resolution Physics-Informed Operator under Dynamic Loading for Two-Phase Random Materials(https://arxiv.org/abs/2505.01438)
Keywords: super-resolution, generation
Abstract: Material stress analysis is a critical aspect of material design and performance optimization. Under dynamic loading, the global stress evolution in materials exhibits complex spatiotemporal characteristics, especially in two-phase random materials (TRMs). Such kind of material failure is often associated with stress concentration, and the phase boundaries are key locations where stress concentration occurs. In practical engineering applications, the spatiotemporal resolution of acquired microstructural data and its dynamic stress evolution is often limited. This poses challenges for deep learning methods in generating high-resolution spatiotemporal stress fields, particularly for accurately capturing stress concentration regions. In this study, we propose a framework for global stress generation and spatiotemporal super-resolution in TRMs under dynamic loading. First, we introduce a diffusion model-based approach, named as Spatiotemporal Stress Diffusion (STS-diffusion), for generating global spatiotemporal stress data. This framework incorporates Space-Time U-Net (STU-net), and we systematically investigate the impact of different attention positions on model accuracy. Next, we develop a physics-informed network for spatiotemporal super-resolution, termed as Spatiotemporal Super-Resolution Physics-Informed Operator (ST-SRPINN). The proposed ST-SRPINN is an unsupervised learning method. The influence of data-driven and physics-informed loss function weights on model accuracy is explored in detail. Benefiting from physics-based constraints, ST-SRPINN requires only low-resolution stress field data during training and can upscale the spatiotemporal resolution of stress fields to arbitrary magnifications.
摘要：材料应力分析是材料设计和性能优化的关键方面。在动态载荷下，材料中的全球应力演化表现出复杂的时空特征，尤其是在两相随机材料（TRM）中。这种物质故障通常与应力浓度有关，相位边界是发生应力浓度的关键位置。在实用的工程应用中，所获得的微结构数据及其动态应力演化的时空分辨率通常受到限制。这在产生高分辨率时空应力场时对深度学习方法提出了挑战，特别是对于准确捕获应力浓度区域的挑战。在这项研究中，我们为动态载荷下的TRM中的全球应力产生和时空超分辨率提出了一个框架。首先，我们引入了一种基于扩散模型的方法，称为时空应力扩散（STS扩散），用于生成全局时空应力数据。该框架结合了时空U-NET（Stu-net），我们系统地研究了不同注意位置对模型准确性的影响。接下来，我们开发了一个用于时空超分辨率的物理信息网络，称为时空超分辨率物理知识操作员（ST-SRPINN）。提出的ST-SRPINN是一种无监督的学习方法。详细探讨了数据驱动和物理信息损失功能权重对模型精度的影响。受益于物理基于物理的约束，ST-SRPINN只需要在训练过程中仅需低分辨率的应力场数据，并且可以将应力场的时空分辨率提高到任意倍率。

Title: OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

Authors: Shengkai Chen, Yifang Yin, Jinming Cao, Shili Xiang, Zhenguang Liu, Roger Zimmermann
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.01448
Pdf URL: https://arxiv.org/pdf/2505.01448
Copy Paste: [[2505.01448]] OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models(https://arxiv.org/abs/2505.01448)
Keywords: generation
Abstract: Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS task. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively, in challenging scenarios.
摘要：视听细分旨在通过基于音频信号来预测像素级掩码来将启发对象与视频分开。现有方法主要集中在封闭场景的场景和直接视听对齐和融合，这限制了其推广到新的，看不见的情况的能力。在本文中，我们提出了一种新型的基于无培训语言的方法的OpenAV，它首次使用文本作为开放式视听视频 - 视听细分（AVS）有效地使音频和视觉方式对齐。 OpenAV配备了多媒体基础模型，直接通过1）音频到文本提示生成，2）LLM引导的提示转换，以及3）文本到视觉的声音对象进行分割。 OpenAV的目的是建立一个简单而灵活的体系结构，该体系结构通过充分利用其功能来使其能够更有效的知识转移到下游AVS任务来依赖最合适的基础模型。此外，我们提出了一个模型 - 反应框架OpenAVS-ST，该框架可以通过基于伪标签的自我训练将OpenAV与任何高级监督AVS模型集成。这种方法通过有效利用可用的大规模未标记数据来增强性能。在三个基准数据集上进行的全面实验证明了OpenAV的出色性能。在具有挑战性的情况下，它超过了现有的无监督，零射门和很少的AVS方法，在MIOU和F-SCORE中，绝对性能的增长分别为9.4％和10.9％。

Title: Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks

Authors: Chaoyi Wang, Junjie Zheng, Zihao Chen, Shiyu Xia, Chaofan Ding, Xiaohao Zhang, Xi Tao, Xiaoming He, Xinhan Di
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01450
Pdf URL: https://arxiv.org/pdf/2505.01450
Copy Paste: [[2505.01450]] Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks(https://arxiv.org/abs/2505.01450)
Keywords: generation
Abstract: Movie dubbing has advanced significantly, yet assessing the real-world effectiveness of these models remains challenging. A comprehensive evaluation benchmark is crucial for two key reasons: 1) Existing metrics fail to fully capture the complexities of dialogue, narration, monologue, and actor adaptability in movie dubbing. 2) A practical evaluation system should offer valuable insights to improve movie dubbing quality and advancement in film production. To this end, we introduce Talking Adaptive Dubbing Benchmarks (TA-Dubbing), designed to improve film production by adapting to dialogue, narration, monologue, and actors in movie dubbing. TA-Dubbing offers several key advantages: 1) Comprehensive Dimensions: TA-Dubbing covers a variety of dimensions of movie dubbing, incorporating metric evaluations for both movie understanding and speech generation. 2) Versatile Benchmarking: TA-Dubbing is designed to evaluate state-of-the-art movie dubbing models and advanced multi-modal large language models. 3) Full Open-Sourcing: We fully open-source TA-Dubbing at this https URL 0a/DeepDubber- V1 including all video suits, evaluation methods, annotations. We also continuously integrate new movie dubbing models into the TA-Dubbing leaderboard at this https URL 0a/DeepDubber-V1 to drive forward the field of movie dubbing.
摘要：电影配音已经取得了长足的进步，但是评估这些模型的现实有效性仍然具有挑战性。全面的评估基准是至关重要的两个关键原因：1）现有指标无法完全捕获电影配音中的对话，叙事，独白和演员的适应性。 2）实用的评估系统应提供有价值的见解，以提高电影配音质量和电影制作的进步。为此，我们介绍了会说话的自适应配音基准（TA-Dubbing），旨在通过适应电影配音的对话，叙事，独白和演员来改善电影制作。 Ta-bubing提供了几个关键优势：1）综合维度：ta布覆盖了电影配音的各种维度，并结合了用于电影理解和语音产生的度量评估。 2）多功能基准测试：TA-Dubbing旨在评估最先进的电影配音模型和高级多模式大型语言模型。 3）完整的开源：我们在此HTTPS URL 0A/DeepDubber-V1上完全开放式ta键，包括所有视频西服，评估方法，注释。我们还将在此HTTPS URL 0A/DeepDubber-V1上不断地将新电影配音模型整合到TA-Dubing排行榜中，以推动电影配音领域。

Title: VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

Authors: Zongxia Li, Xiyang Wu, Yubin Qin, Guangyao Shi, Hongyang Du, Dinesh Manocha, Tianyi Zhou, Jordan Lee Boyd-Graber
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01481
Pdf URL: https://arxiv.org/pdf/2505.01481
Copy Paste: [[2505.01481]] VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos(https://arxiv.org/abs/2505.01481)
Keywords: generation
Abstract: Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs' ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs' reasoning capabilities. Our data is available at this https URL.
摘要：与基础模型的合成视频生成有关其现实主义和广泛的应用引起了人们的关注。尽管这些模型产生了高质量的框架，但它们通常不尊重常识和身体定律，从而导致异常内容。诸如Videoscore之类的现有指标强调一般质量，但忽略了此类违规行为和缺乏解释性。实际上，一种更具洞察力的方法是将多模式大型语言模型（MLLM）用作可解释的评估者。然而，MLLM的检测合成视频中异常的能力仍然没有得到充实的影响。为了解决这个问题，我们介绍了VideoHallu，这是一种基准测试，其中包含来自VEO2，Sora和Kling等模型的合成视频，并与可通过各种类别的人级推理求解的专家设计的质量保证任务配对。我们评估了几个SOTA MLLM，包括GPT-4O，Gemini-2.5-Pro，QWEN-2.5-VL，以及诸如Video-R1和VideoChat-R1之类的较新型号。尽管在MVBench和Moviechat上表现出色，但这些模型仍然在合成环境中的基本常识和物理任务上幻觉，突显了幻觉的挑战。我们使用小组相对策略优化（GRPO）对真实和合成常识/物理数据进行了进一步调整SOTA MLLM。结果显示出明显的准确性提高，尤其是在反例集成的情况下，提高了MLLM的推理能力。我们的数据可在此HTTPS URL上找到。

Title: WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation

Authors: Daoan Zhang, Che Jiang, Ruoshi Xu, Biaoxiang Chen, Zijian Jin, Yutian Lu, Jianguo Zhang, Liang Yong, Jiebo Luo, Shengda Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01490
Pdf URL: https://arxiv.org/pdf/2505.01490
Copy Paste: [[2505.01490]] WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation(https://arxiv.org/abs/2505.01490)
Keywords: generation
Abstract: Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models still struggle with prompts that require rich world knowledge and implicit reasoning: both of which are critical for producing semantically accurate, coherent, and contextually appropriate images in real-world scenarios. To address this gap, we introduce \textbf{WorldGenBench}, a benchmark designed to systematically evaluate T2I models' world knowledge grounding and implicit inferential capabilities, covering both the humanities and nature domains. We propose the \textbf{Knowledge Checklist Score}, a structured metric that measures how well generated images satisfy key semantic expectations. Experiments across 21 state-of-the-art models reveal that while diffusion models lead among open-source methods, proprietary auto-regressive models like GPT-4o exhibit significantly stronger reasoning and knowledge integration. Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems. Project Page: \href{this https URL}{this https URL}
摘要：文本对图像（T2i）一代的最新进展取得了令人印象深刻的结果，但是现有模型仍然在需要丰富的世界知识和隐性推理的提示中挣扎：这两者对于在现实世界中产生语义准确，相干和上下文适当的图像至关重要。为了解决这一差距，我们介绍了\ textbf {worldgenbench}，这是一种基准，旨在系统地评估T2i模型的世界知识基础和隐性推理能力，涵盖了人文和自然领域。我们提出\ textbf {知识清单分数}，这是一个结构化的度量标准，可衡量产生的图像满足关键语义期望的程度。跨21种最先进模型的实验表明，尽管扩散模型在开源方法之间引导，但GPT-4O（例如GPT-4O）的专有自动回归模型表现出明显更强的推理和知识整合。我们的发现强调了在下一代T2I系统中需要更深入理解和推理能力的需求。项目页面：\ href {此https url} {this https url}

Title: A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning

Authors: Anan Yaghmour, Melba M. Crawford, Saurabh Prasad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01558
Pdf URL: https://arxiv.org/pdf/2505.01558
Copy Paste: [[2505.01558]] A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning(https://arxiv.org/abs/2505.01558)
Keywords: generative
Abstract: Remote sensing enables a wide range of critical applications such as land cover and land use mapping, crop yield prediction, and environmental monitoring. Advances in satellite technology have expanded remote sensing datasets, yet high-performance segmentation models remain dependent on extensive labeled data, challenged by annotation scarcity and variability across sensors, illumination, and geography. Domain adaptation offers a promising solution to improve model generalization. This paper introduces a domain generalization approach to leveraging emerging geospatial foundation models by combining soft-alignment pseudo-labeling with source-to-target generative pre-training. We further provide new mathematical insights into MAE-based generative learning for domain-invariant feature learning. Experiments with hyperspectral and multispectral remote sensing datasets confirm our method's effectiveness in enhancing adaptability and segmentation.
摘要：遥感可实现广泛的关键应用，例如土地覆盖和土地使用映射，作物产量预测和环境监测。卫星技术的进步扩大了遥感数据集，但高性能分割模型仍取决于广泛的标记数据，这受到了跨传感器，照明和地理位置的注释稀缺性和可变性的挑战。域的适应性提供了一种有前途的解决方案来改善模型概括。本文介绍了一种域的概括方法，用于通过将软对准伪标记与来源与目标生成的预训练相结合，来利用新兴的地理空间基础模型。我们进一步为基于MAE的生成学习提供了新的数学见解。高光谱和多光谱遥感数据集的实验证实了我们方法在增强适应性和分割方面的有效性。

Title: Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study

Authors: Tamim Ahmed, Thanassis Rikakis
Subjects: cs.CV, cs.AI, cs.HC, math.PR
Abstract URL: https://arxiv.org/abs/2505.01680
Pdf URL: https://arxiv.org/pdf/2505.01680
Copy Paste: [[2505.01680]] Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study(https://arxiv.org/abs/2505.01680)
Keywords: quality assessment
Abstract: Manual scoring of the Action Research Arm Test (ARAT) for upper extremity assessment in stroke rehabilitation is time-intensive and variable. We propose an automated ARAT scoring system integrating multimodal video analysis with SlowFast, I3D, and Transformer-based models using OpenPose keypoints and object locations. Our approach employs multi-view data (ipsilateral, contralateral, and top perspectives), applying early and late fusion to combine features across views and models. Hierarchical Bayesian Models (HBMs) infer movement quality components, enhancing interpretability. A clinician dashboard displays task scores, execution times, and quality assessments. We conducted a study with five clinicians who reviewed 500 video ratings generated by our system, providing feedback on its accuracy and usability. Evaluated on a stroke rehabilitation dataset, our framework achieves 89.0% validation accuracy with late fusion, with HBMs aligning closely with manual assessments. This work advances automated rehabilitation by offering a scalable, interpretable solution with clinical validation.
摘要：对中风康复中上肢评估的动作研究部门测试（ARAT）的手动评分是时间密集型和可变的。我们提出了一个自动化的ARAT评分系统，该系统将多模式视频分析与SlowFast，I3D和基于变压器的模型相结合，并使用OpenPose关键点和对象位置集成。我们的方法采用多视图数据（同侧，对侧和顶级视角），应用了早期和晚期的融合来结合跨视图和模型的功能。分层贝叶斯模型（HBMS）推断运动质量成分，增强可解释性。临床医生仪表板显示任务分数，执行时间和质量评估。我们对五位临床医生进行了一项研究，他们审查了我们系统产生的500个视频评级，并提供了有关其准确性和可用性的反馈。通过中风康复数据集进行了评估，我们的框架可通过延迟融合来实现89.0％的验证精度，HBMS与手动评估紧密相符。这项工作通过提供临床验证的可扩展，可解释的解决方案来促进自动康复。

Title: Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings

Authors: Alexander Davis, Rafael Souza, Jia-Hao Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01711
Pdf URL: https://arxiv.org/pdf/2505.01711
Copy Paste: [[2505.01711]] Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings(https://arxiv.org/abs/2505.01711)
Keywords: generation
Abstract: Automated interpretation of chest X-rays (CXR) is a critical task with the potential to significantly improve clinical workflow and patient care. While recent advances in multimodal foundation models have shown promise, effectively leveraging the full power of large language models (LLMs) for this visual task remains an underexplored area. This paper introduces CXR-TextInter, a novel framework that repurposes powerful text-centric LLMs for CXR interpretation by operating solely on a rich, structured textual representation of the image content, generated by an upstream image analysis pipeline. We augment this LLM-centric approach with an integrated medical knowledge module to enhance clinical reasoning. To facilitate training and evaluation, we developed the MediInstruct-CXR dataset, containing structured image representations paired with diverse, clinically relevant instruction-response examples, and the CXR-ClinEval benchmark for comprehensive assessment across various interpretation tasks. Extensive experiments on CXR-ClinEval demonstrate that CXR-TextInter achieves state-of-the-art quantitative performance across pathology detection, report generation, and visual question answering, surpassing existing multimodal foundation models. Ablation studies confirm the critical contribution of the knowledge integration module. Furthermore, blinded human evaluation by board-certified radiologists shows a significant preference for the clinical quality of outputs generated by CXR-TextInter. Our work validates an alternative paradigm for medical image AI, showcasing the potential of harnessing advanced LLM capabilities when visual information is effectively structured and domain knowledge is integrated.
摘要：胸部X射线（CXR）的自动解释是一项至关重要的任务，有可能显着改善临床工作流程和患者护理。尽管多模式基础模型的最新进展已显示出希望，但有效利用大型语言模型（LLM）的全部功能（LLMS）执行此视觉任务仍然是一个毫无疑问的领域。本文介绍了CXR-Textinter，这是一个新颖的框架，该框架通过仅在上游图像分析管道生成的图像内容的丰富，结构化的文本表示中，重新利用强大的以文本为中心的LLM来解释CXR解释。我们使用集成的医学知识模块来增强这种以LLM为中心的方法，以增强临床推理。为了促进培训和评估，我们开发了Mediinstruct-CXR数据集，其中包含结构化图像表示，并配对各种临床相关的指导 - 响应示例，以及CXR-ClineVal基准测试，以在各种解释任务中进行全面评估。关于CXR-ClineVal的广泛实验表明，CXR-Textinter在病理检测，报告生成和视觉问题答案之间实现最新的定量性能，超过了现有的多模式基础模型。消融研究证实了知识整合模块的关键贡献。此外，董事会认证的放射科医生对人的盲人评估表明，对CXR-Textinter产生的产出的临床质量非常偏爱。我们的工作验证了医学图像AI的替代范式，展示了当有效地结构构造并集成了域知识时，展示了利用先进的LLM功能的潜力。

Title: PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

Authors: Bu Jin, Weize Li, Baihan Yang, Zhenxin Zhu, Junpeng Jiang, Huan-ang Gao, Haiyang Sun, Kun Zhan, Hengtong Hu, Xueyang Zhang, Peng Jia, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01729
Pdf URL: https://arxiv.org/pdf/2505.01729
Copy Paste: [[2505.01729]] PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth(https://arxiv.org/abs/2505.01729)
Keywords: generation, generative
Abstract: Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.
摘要：自动驾驶系统（AD）系统的最新进展强调了世界模型在普通和具有挑战性的驾驶条件下实现稳健和可推广的性能的潜力。但是，仍然是一个关键挑战：精确且灵活的相机姿势控制，这对于精确的观点转换和场景动态的现实模拟至关重要。在本文中，我们介绍了Posepilot，这是一个轻巧而功能强大的框架，可显着增强相机在生成世界模型中的可控性。 Posepilot从自我监督的深度估计中汲取灵感，从而利用结构 - 动作原理，以在相机姿势和视频生成之间建立紧密的耦合。具体而言，我们结合了自我监督的深度和姿势读数，从而使模型可以直接从视频序列推断深度和相对摄像机运动。这些输出驱动姿势感知的框架翘曲，并在光度翘曲损耗的指导下，该损失损失了跨合成帧的几何一致性。为了进一步完善相机姿势估计，我们引入了反向扭曲步骤和姿势回归损失，从而提高了观点精度和适应性。关于自动驾驶和通用域视频数据集的广泛实验表明，Posepilot在基于扩散和自动回归世界模型中都显着增强了结构的理解和运动推理。通过转向摄像头姿势具有自我监督的深度，Posepilot为姿势可控性设定了新的基准测试，从而在生成世界模型中实现了物理一致，可靠的观点综合。

Title: Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

Authors: Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01746
Pdf URL: https://arxiv.org/pdf/2505.01746
Copy Paste: [[2505.01746]] Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion(https://arxiv.org/abs/2505.01746)
Keywords: generation
Abstract: Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent gesture modeling with two-person interactive conversations. Moreover, the lack of high-quality datasets with concurrent co-speech gestures also limits handling this issue. To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. Additionally, we propose Co$^3$Gesture, a novel framework that enables coherent concurrent co-speech gesture synthesis including two-person interactive movements. Considering the asymmetric body dynamics of two speakers, our framework is built upon two cooperative generation branches conditioned on separated speaker audio. Specifically, to enhance the coordination of human postures with respect to corresponding speaker audios while interacting with the conversational partner, we present a Temporal Interaction Module (TIM). TIM can effectively model the temporal association representation between two speakers' gesture sequences as interaction guidance and fuse it into the concurrent gesture generation. Then, we devise a mutual attention mechanism to further holistically boost learning dependencies of interacted concurrent motions, thereby enabling us to generate vivid and coherent gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at \href{this https URL}{\textit{this https URL}}.
摘要：从人类演讲中产生手势在动画虚拟化身方面取得了巨大进展。尽管现有方法可以通过个人自我言行进行合成的手势，但它们通过两人交互式对话忽略了并发的手势建模的实用性。此外，缺乏并发共同语音手势的高质量数据集也限制了处理此问题的。为了实现这一目标，我们首先构建了一个大规模的并发共同语音手势数据集，该数据集包含700万帧以上的两人交互式姿势序列，称为GES-Inter。此外，我们提出了CO $^3 $手势，这是一个新型框架，可实现连贯的并发共同语音手势合成，包括两人的交互运动。考虑到两个扬声器的不对称身体动力学，我们的框架建立在两个以分离的扬声器音频为条件的合作生成分支上。具体而言，为了在与对话伙伴互动时增强人类姿势相对于相应的扬声器音频的协调，我们提出了一个时间互动模块（TIM）。蒂姆可以有效地对两个说话者的手势序列之间的时间关联表示作为相互作用指导，并将其融合到并发的手势产生中。然后，我们设计了一种相互关注的机制，以进一步增强相互作用的并发动作的学习依赖性，从而使我们能够产生生动而相干的手势。广泛的实验表明，我们的方法在我们新收集的GES-Inter数据集上的最新模型优于最先进的模型。数据集和源代码在\ href {this HTTPS url} {\ textIt {this HTTPS url}}}上公开可用。

Title: Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos

Authors: Markos Stamatakis, Joshua Berger, Christian Wartena, Ralph Ewerth, Anett Hoppe
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.01790
Pdf URL: https://arxiv.org/pdf/2505.01790
Copy Paste: [[2505.01790]] Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos(https://arxiv.org/abs/2505.01790)
Keywords: generation
Abstract: Web-based educational videos offer flexible learning opportunities and are becoming increasingly popular. However, improving user engagement and knowledge retention remains a challenge. Automatically generated questions can activate learners and support their knowledge acquisition. Further, they can help teachers and learners assess their understanding. While large language and vision-language models have been employed in various tasks, their application to question generation for educational videos remains underexplored. In this paper, we investigate the capabilities of current vision-language models for generating learning-oriented questions for educational video content. We assess (1) out-of-the-box models' performance; (2) fine-tuning effects on content-specific question generation; (3) the impact of different video modalities on question quality; and (4) in a qualitative study, question relevance, answerability, and difficulty levels of generated questions. Our findings delineate the capabilities of current vision-language models, highlighting the need for fine-tuning and addressing challenges in question diversity and relevance. We identify requirements for future multimodal datasets and outline promising research directions.
摘要：基于Web的教育视频提供了灵活的学习机会，并变得越来越受欢迎。但是，改善用户参与度和知识保留仍然是一个挑战。自动产生的问题可以激活学习者并支持他们的知识获取。此外，他们可以帮助教师和学习者评估他们的理解。尽管已经在各种任务中采用了大型语言和视觉模型，但他们在质疑教育视频的生成时的应用仍未得到充实。在本文中，我们研究了当前视觉模型的功能，用于为教育视频内容生成面向学习的问题。我们评估（1）开箱即用的模型的性能；（2）对特定于内容的问题产生的微调影响；（3）不同视频方式对质量质量的影响；（4）在定性研究中，问题相关性，答案性和难度级别的问题。我们的发现描述了当前视觉模型的功能，强调了对微观和相关性问题进行微调和应对挑战的需求。我们确定对未来多模式数据集的要求，并概述有希望的研究方向。

Title: AquaGS: Fast Underwater Scene Reconstruction with SfM-Free Gaussian Splatting

Authors: Junhao Shi, Jisheng Xu, Jianping He, Zhiliang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01799
Pdf URL: https://arxiv.org/pdf/2505.01799
Copy Paste: [[2505.01799]] AquaGS: Fast Underwater Scene Reconstruction with SfM-Free Gaussian Splatting(https://arxiv.org/abs/2505.01799)
Keywords: generation
Abstract: Underwater scene reconstruction is a critical tech-nology for underwater operations, enabling the generation of 3D models from images captured by underwater platforms. However, the quality of underwater images is often degraded due to medium interference, which limits the effectiveness of Structure-from-Motion (SfM) pose estimation, leading to subsequent reconstruction failures. Additionally, SfM methods typically operate at slower speeds, further hindering their applicability in real-time scenarios. In this paper, we introduce AquaGS, an SfM-free underwater scene reconstruction model based on the SeaThru algorithm, which facilitates rapid and accurate separation of scene details and medium features. Our approach initializes Gaussians by integrating state-of-the-art multi-view stereo (MVS) technology, employs implicit Neural Radiance Fields (NeRF) for rendering translucent media and utilizes the latest explicit 3D Gaussian Splatting (3DGS) technique to render object surfaces, which effectively addresses the limitations of traditional methods and accurately simulates underwater optical phenomena. Experimental results on the data set and the robot platform show that our model can complete high-precision reconstruction in 30 seconds with only 3 image inputs, significantly enhancing the practical application of the algorithm in robotic platforms.
摘要：水下场景重建是水下操作的关键技术，使得从水下平台捕获的图像中产生了3D模型。但是，水下图像的质量通常由于中等干扰而降低，这限制了结构 - 动作（SFM）姿势估计的有效性，从而导致随后的重建失败。此外，SFM方法通常以较慢的速度运行，在实时方案中进一步阻碍了其适用性。在本文中，我们介绍了基于Seathru算法的无SFM水下场景重建模型Aquags，该模型促进了场景细节和中等功能的快速而准确的分离。我们的方法通过整合最先进的多观立体声（MVS）技术来初始化高斯人，它采用隐式神经辐射场（NERF）来渲染半透明的媒体并利用最新的显式3D高斯拆分（3DGS）技术来使对象表面有效地构成传统方法的限制，从而使对象进行了限制，从而使现有的局限性构成了典型的方法。数据集和机器人平台上的实验结果表明，我们的模型可以在30秒内使用3个图像输入完成高精度重建，从而显着增强了该算法在机器人平台中的实际应用。

Title: Efficient 3D Full-Body Motion Generation from Sparse Tracking Inputs with Temporal Windows

Authors: Georgios Fotios Angelis, Savas Ozkan, Sinan Mutlu, Paul Wisbey, Anastasios Drosou, Mete Ozay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01802
Pdf URL: https://arxiv.org/pdf/2505.01802
Copy Paste: [[2505.01802]] Efficient 3D Full-Body Motion Generation from Sparse Tracking Inputs with Temporal Windows(https://arxiv.org/abs/2505.01802)
Keywords: generation
Abstract: To have a seamless user experience on immersive AR/VR applications, the importance of efficient and effective Neural Network (NN) models is undeniable, since missing body parts that cannot be captured by limited sensors should be generated using these models for a complete 3D full-body reconstruction in virtual environment. However, the state-of-the-art NN-models are typically computational expensive and they leverage longer sequences of sparse tracking inputs to generate full-body movements by capturing temporal context. Inevitably, longer sequences increase the computation overhead and introduce noise in longer temporal dependencies that adversely affect the generation performance. In this paper, we propose a novel Multi-Layer Perceptron (MLP)-based method that enhances the overall performance while balancing the computational cost and memory overhead for efficient 3D full-body generation. Precisely, we introduce a NN-mechanism that divides the longer sequence of inputs into smaller temporal windows. Later, the current motion is merged with the information from these windows through latent representations to utilize the past context for the generation. Our experiments demonstrate that generation accuracy of our method with this NN-mechanism is significantly improved compared to the state-of-the-art methods while greatly reducing computational costs and memory overhead, making our method suitable for resource-constrained devices.
摘要：为了在沉浸式AR/VR应用程序上具有无缝的用户体验，不可否认的是，有效有效的神经网络（NN）模型的重要性是不可否认的，因为在虚拟环境中，应使用这些模型为完整的3D全身重建而生成无法捕获有限传感器的缺失身体部位。但是，最新的NN模型通常是计算昂贵的，它们利用稀疏跟踪输入的较长序列来通过捕获时间上下文来生成全身运动。不可避免地，较长的序列增加了计算开销，并在较长的时间依赖性中引入噪声，从而对产生性能产生不利影响。在本文中，我们提出了一种基于新型的多层感知器（MLP）的方法，可以增强整体性能，同时平衡计算成本和内存开销，以实现有效的3D全身生成。确切地说，我们引入了一种NN机制，将较长的输入序列分为较小的时间窗口。后来，当前的运动与来自这些窗口的信息通过潜在表示将过去的上下文合并为生成上下文。我们的实验表明，与最先进的方法相比，我们方法的生成准确性可显着提高，同时大大降低了计算成本和内存开销，从而使我们的方法适用于资源受限的设备。

Title: Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning

Authors: Jifeng Hu, Sili Huang, Zhejian Yang, Shengchao Hu, Li Shen, Hechang Chen, Lichao Sun, Yi Chang, Dacheng Tao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01822
Pdf URL: https://arxiv.org/pdf/2505.01822
Copy Paste: [[2505.01822]] Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning(https://arxiv.org/abs/2505.01822)
Keywords: generation
Abstract: Conditional decision generation with diffusion models has shown powerful competitiveness in reinforcement learning (RL). Recent studies reveal the relation between energy-function-guidance diffusion models and constrained RL problems. The main challenge lies in estimating the intermediate energy, which is intractable due to the log-expectation formulation during the generation process. To address this issue, we propose the Analytic Energy-guided Policy Optimization (AEPO). Specifically, we first provide a theoretical analysis and the closed-form solution of the intermediate guidance when the diffusion model obeys the conditional Gaussian transformation. Then, we analyze the posterior Gaussian distribution in the log-expectation formulation and obtain the target estimation of the log-expectation under mild assumptions. Finally, we train an intermediate energy neural network to approach the target estimation of log-expectation formulation. We apply our method in 30+ offline RL tasks to demonstrate the effectiveness of our method. Extensive experiments illustrate that our method surpasses numerous representative baselines in D4RL offline reinforcement learning benchmarks.
摘要：通过扩散模型有条件的决策产生在增强学习（RL）方面表现出了强大的竞争力。最近的研究揭示了能量功能引物扩散模型与RL问题约束之间的关系。主要挑战在于估计中间能量，这是由于在发电过程中的对数预测公式而棘手。为了解决这个问题，我们提出了分析能源指导的政策优化（AEPO）。具体而言，当扩散模型遵守条件高斯变换时，我们首先提供了理论分析和中间指南的闭合溶液。然后，我们在对数预测公式中分析后高斯分布，并在轻度假设下获得对数期望的目标估计。最后，我们训练一个中间能量神经网络，以了解对数预测公式的目标估计。我们将方法应用于30多个离线RL任务中，以证明我们方法的有效性。广泛的实验表明，我们的方法超过了D4RL离线增强学习基准中的众多代表性基准。

Title: PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach

Authors: Nitin Rai, Arnold W. Schumann, Nathan Boyd
Subjects: cs.CV, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2505.01823
Pdf URL: https://arxiv.org/pdf/2505.01823
Copy Paste: [[2505.01823]] PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach(https://arxiv.org/abs/2505.01823)
Keywords: generation, generative
Abstract: Collecting large-scale crop disease images in the field is labor-intensive and time-consuming. Generative models (GMs) offer an alternative by creating synthetic samples that resemble real-world images. However, existing research primarily relies on Generative Adversarial Networks (GANs)-based image-to-image translation and lack a comprehensive analysis of computational requirements in agriculture. Therefore, this research explores a multi-modal text-to-image approach for generating synthetic crop disease images and is the first to provide computational benchmarking in this context. We trained three Stable Diffusion (SD) variants-SDXL, SD3.5M (medium), and SD3.5L (large)-and fine-tuned them using Dreambooth and Low-Rank Adaptation (LoRA) fine-tuning techniques to enhance generalization. SD3.5M outperformed the others, with an average memory usage of 18 GB, power consumption of 180 W, and total energy use of 1.02 kWh/500 images (0.002 kWh per image) during inference task. Our results demonstrate SD3.5M's ability to generate 500 synthetic images from just 36 in-field samples in 1.5 hours. We recommend SD3.5M for efficient crop disease data generation.
摘要：在野外收集大规模的农作物疾病图像是劳动密集型且耗时的。生成模型（GMS）通过创建类似于现实世界图像的合成样本来提供替代方案。但是，现有的研究主要依赖于基于生成的对抗网络（GAN）的图像到图像翻译，并且缺乏对农业计算要求的全面分析。因此，这项研究探讨了一种多模式的文本对图像方法，用于生成合成农作物疾病图像，并且是在这种情况下首次提供计算基准测试的方法。我们训练了三个稳定的扩散（SD）变体SDXL，SD35M（中）和SD3.5L（大）（大），并使用Dreambooth和低级适应（Lora）微调技术进行了微调，以增强普遍性。 SD350万的表现优于其他人，在推理任务期间，平均记忆使用量为18 GB，功耗为180 W，总能量使用1.02 kWh/500张图像（每张图像0.002 kWh）。我们的结果表明，SD35M在1.5小时内仅从36个现场样品中生成500个合成图像的能力。我们建议SD350万SD 350万，以进行有效的农作物疾病数据生成。

Title: DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion

Authors: Haoteng Li, Zhao Yang, Zezhong Qian, Gongpeng Zhao, Yuqi Huang, Jun Yu, Huazheng Zhou, Longjun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01857
Pdf URL: https://arxiv.org/pdf/2505.01857
Copy Paste: [[2505.01857]] DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion(https://arxiv.org/abs/2505.01857)
Keywords: generation
Abstract: Accurate and high-fidelity driving scene reconstruction relies on fully leveraging scene information as conditioning. However, existing approaches, which primarily use 3D bounding boxes and binary maps for foreground and background control, fall short in capturing the complexity of the scene and integrating multi-modal information. In this paper, we propose DualDiff, a dual-branch conditional diffusion model designed to enhance multi-view driving scene generation. We introduce Occupancy Ray Sampling (ORS), a semantic-rich 3D representation, alongside numerical driving scene representation, for comprehensive foreground and background control. To improve cross-modal information integration, we propose a Semantic Fusion Attention (SFA) mechanism that aligns and fuses features across modalities. Furthermore, we design a foreground-aware masked (FGM) loss to enhance the generation of tiny objects. DualDiff achieves state-of-the-art performance in FID score, as well as consistently better results in downstream BEV segmentation and 3D object detection tasks.
摘要：准确且高保真的驾驶场景重建依赖于将场景信息完全利用作为条件。但是，现有的方法主要使用3D边界框和二进制图来进行前景和背景控制，在捕获场景的复杂性并集成了多模式信息方面缺乏。在本文中，我们提出了Dualdiff，这是一种双分支条件扩散模型，旨在增强多视图驾驶场景的生成。我们介绍了占用射线采样（ORS），这是一种语义丰富的3D表示形式，以及数值驾驶场景表示，以进行全面的前景和背景控制。为了改善跨模式信息的整合，我们提出了一种语义融合注意（SFA）机制，该机制将跨模态的特征对齐和融合。此外，我们设计了一种前景感知的蒙面（FGM）损失，以增强微小物体的产生。 Dualdiff在FID得分中实现最先进的性能，并始终如一地导致下游BEV细分和3D对象检测任务。

Title: Rethinking Score Distilling Sampling for 3D Editing and Generation

Authors: Xingyu Miao, Haoran Duan, Yang Long, Jungong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01888
Pdf URL: https://arxiv.org/pdf/2505.01888
Copy Paste: [[2505.01888]] Rethinking Score Distilling Sampling for 3D Editing and Generation(https://arxiv.org/abs/2505.01888)
Keywords: generation
Abstract: Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing. The code is available on: this https URL.
摘要：得分蒸馏采样（SDS）已通过利用2D扩散模型的强度来成为文本到3D生成的突出方法。但是，SDS仅限于生成任务，并且缺乏编辑现有3D资产的能力。相反，引入编辑功能的SD变体通常无法有效地生成新的3D资产。在这项工作中，我们观察到SD及其变体内的发电和编辑过程具有统一的基本梯度项。在此洞察力的基础上，我们提出了统一的蒸馏采样（UDS），这种方法可以无缝整合3D资产的生成和编辑。从本质上讲，UDS完善了香草SDS方法中使用的梯度术语，将其统一以支持这两个任务。广泛的实验表明，UDS不仅在生成具有更丰富细节的3D资产方面优于基线方法，而且在编辑任务方面也很出色，从而弥合了3D生成和编辑之间的差距。该代码可在以下方面可用：此HTTPS URL。

Title: OODTE: A Differential Testing Engine for the ONNX Optimizer

Authors: Nikolaos Louloudakis, Ajitha Rajan
Subjects: cs.LG, cs.AI, cs.SE, eess.SY
Abstract URL: https://arxiv.org/abs/2505.01892
Pdf URL: https://arxiv.org/pdf/2505.01892
Copy Paste: [[2505.01892]] OODTE: A Differential Testing Engine for the ONNX Optimizer(https://arxiv.org/abs/2505.01892)
Keywords: generation
Abstract: With $700$ stars on GitHub and part of the official ONNX repository, the ONNX Optimizer consists of the standard method to apply graph-based optimizations on ONNX models. However, its ability to preserve model accuracy across optimizations, has not been rigorously explored. We propose OODTE, a utility to automatically and thoroughly assess the correctness of the ONNX Optimizer. OODTE follows a simple, yet effective differential testing and evaluation approach that can be easily adopted to other compiler optimizers. In particular, OODTE utilizes a number of ONNX models, then optimizes them and executes both the original and the optimized variants across a user-defined set of inputs, while automatically logging any issues with the optimization process. Finally, for successfully optimized models, OODTE compares the results, and, if any accuracy deviations are observed, it iteratively repeats the process for each pass of the ONNX Optimizer, to localize the root cause of the differences observed. Using OODTE, we sourced well-known $130$ models from the official ONNX Model Hub, used for a wide variety of tasks (classification, object detection, semantic segmentation, text summarization, question and answering, sentiment analysis) from the official ONNX model hub. We detected 15 issues, 14 of which were previously unknown, associated with optimizer crashes and accuracy deviations. We also observed $9.2$% of all model instances presenting issues leading into the crash of the optimizer, or the generation of an invalid model while using the primary optimizer strategies. In addition, $30$% of the classification models presented accuracy differences across the original and the optimized model variants, while $16.6$% of semantic segmentation and object detection models are also affected, at least to a limited extent.
摘要：GITHUB上的$ 700 $星星和官方ONX存储库的一部分，ONX Optimizer由标准方法组成，用于在ONNX型号上应用基于图的优化。但是，尚未严格探索其在优化范围内保持模型准确性的能力。我们提出了Oodte，这是一种自动，彻底评估ONNX优化器的正确性的实用程序。 Oodte遵循一种简单但有效的差异测试和评估方法，可以轻松地采用其他编译器优化器。特别是，Oodte使用了许多ONNX模型，然后在用户定义的一组输入集上优化它们并执行原始变体和优化的变体，同时自动通过优化过程记录任何问题。最后，对于成功优化的模型，Oodte比较了结果，并且，如果观察到任何准确性偏差，它迭代地重复了ONX Optimizer的每个通过的过程，以将观察到的差异的根本原因定位。使用Oodte，我们从官方的ONX Model Hub中采购了$ 130的$ 130 $模型，用于各种任务（分类，对象检测，语义细分，文本摘要，问答，答案，情感分析，情感分析）。我们检测到15期，其中14期与优化器崩溃和准确性偏差相关。我们还观察到了所有模型实例的$ 9.2 $％，呈现出导致优化器崩溃的问题，或者在使用主要优化器策略时产生无效的模型。此外，$ 30 $ 30的分类模型在原始模型和优化的模型变体中呈现准确性差异，而$ 16.6 $的语义细分和对象检测模型也受到了至少在有限的程度上。

Title: LookAlike: Consistent Distractor Generation in Math MCQs

Authors: Nisarg Parikh, Nigel Fernandez, Alexander Scarlatos, Simon Woodhead, Andrew Lan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01903
Pdf URL: https://arxiv.org/pdf/2505.01903
Copy Paste: [[2505.01903]] LookAlike: Consistent Distractor Generation in Math MCQs(https://arxiv.org/abs/2505.01903)
Keywords: generation
Abstract: Large language models (LLMs) are increasingly used to generate distractors for multiple-choice questions (MCQs), especially in domains like math education. However, existing approaches are limited in ensuring that the generated distractors are consistent with common student errors. We propose LookAlike, a method that improves error-distractor consistency via preference optimization. Our two main innovations are: (a) mining synthetic preference pairs from model inconsistencies, and (b) alternating supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to stabilize training. Unlike prior work that relies on heuristics or manually annotated preference data, LookAlike uses its own generation inconsistencies as dispreferred samples, thus enabling scalable and stable training. Evaluated on a real-world dataset of 1,400+ math MCQs, LookAlike achieves 51.6% accuracy in distractor generation and 57.2% in error generation under LLM-as-a-judge evaluation, outperforming an existing state-of-the-art method (45.6% / 47.7%). These improvements highlight the effectiveness of preference-based regularization and inconsistency mining for generating consistent math MCQ distractors at scale.
摘要：大型语言模型（LLM）越来越多地用于为多项选择问题（MCQ）生成干扰素，尤其是在数学教育等领域。但是，现有的方法在确保产生的干扰因素与常见的学生错误一致方面受到限制。我们提出了LookAlike，这种方法可以通过优先优化提高错误 - 分布器的一致性。我们的两个主要创新是：（a）模型不一致的采矿合成偏好对，以及（b）与直接偏好优化（DPO）交替进行监督的微调（SFT）以稳定培训。与先前依赖启发式方法或手动注释偏好数据的工作不同，看起来像是将自己的一代不一致作为分配的样本，从而实现了可扩展且稳定的培训。在1,400多个数学MCQ的现实世界数据集上进行评估，在分散术者的产生中的准确性为51.6％，在LLM-AS-A-A-A-A-a-a-a-a-a-gudge评估下产生错误产生57.2％，超过了现有的最新方法（45.6％ / 47.7％）。这些改进强调了基于偏好的正则化和不一致挖掘的有效性，以便在大规模上产生一致的数学MCQ分散因子。

Title: BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

Authors: Evan R. Antoniuk, Shehtab Zaman, Tal Ben-Nun, Peggy Li, James Diffenderfer, Busra Demirci, Obadiah Smolenski, Tim Hsu, Anna M. Hiszpanski, Kenneth Chiu, Bhavya Kailkhura, Brian Van Essen
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01912
Pdf URL: https://arxiv.org/pdf/2505.01912
Copy Paste: [[2505.01912]] BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models(https://arxiv.org/abs/2505.01912)
Keywords: generation, generative
Abstract: Advances in deep learning and generative modeling have driven interest in data-driven molecule discovery pipelines, whereby machine learning (ML) models are used to filter and design novel molecules without requiring prohibitively expensive first-principles simulations. Although the discovery of novel molecules that extend the boundaries of known chemistry requires accurate out-of-distribution (OOD) predictions, ML models often struggle to generalize OOD. Furthermore, there are currently no systematic benchmarks for molecular OOD prediction tasks. We present BOOM, $\boldsymbol{b}$enchmarks for $\boldsymbol{o}$ut-$\boldsymbol{o}$f-distribution $\boldsymbol{m}$olecular property predictions -- a benchmark study of property-based out-of-distribution models for common molecular property prediction models. We evaluate more than 140 combinations of models and property prediction tasks to benchmark deep learning models on their OOD performance. Overall, we do not find any existing models that achieve strong OOD generalization across all tasks: even the top performing model exhibited an average OOD error 3x larger than in-distribution. We find that deep learning models with high inductive bias can perform well on OOD tasks with simple, specific properties. Although chemical foundation models with transfer and in-context learning offer a promising solution for limited training data scenarios, we find that current foundation models do not show strong OOD extrapolation capabilities. We perform extensive ablation experiments to highlight how OOD performance is impacted by data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation. We propose that developing ML models with strong OOD generalization is a new frontier challenge in chemical ML model development. This open-source benchmark will be made available on Github.
摘要：深度学习和生成建模的进步引起了人们对数据驱动的分子发现管道的兴趣，从而使用机器学习（ML）模型来过滤和设计新颖的分子，而无需过高的昂贵的第一原则模拟。尽管扩展已知化学界限的新分子的发现需要准确的分布（OOD）预测，但ML模型通常很难概括OOD。此外，目前没有用于分子OOD预测任务的系统基准。我们介绍Boom，$ \ boldsymbol {b} $ enchmarks for $ \ boldsymbol {o} $ ut- $ \ boldsymbol {o} $ f-distribution $ \ boldsymbol {m} $ olecular Propertions- olecular Propertions-对公共分发属性模型的基于房地产的基于公共分发模型的基准研究。我们评估了140多种模型和财产预测任务的组合，以基于其OOD性能的深度学习模型。总体而言，我们找不到在所有任务中实现强大概括的现有模型：即使表现最高的模型也显示出平均OOD误差3倍，大于分布。我们发现，具有高感应偏见的深度学习模型可以在具有简单特定属性的OOD任务上表现良好。尽管具有转移和内在学习的化学基础模型为有限的培训数据方案提供了有希望的解决方案，但我们发现当前的基础模型并未显示出强大的OOD外推能力。我们执行广泛的消融实验，以突出数据生成，预训练，超参数优化，模型结构和分子表示如何影响OOD的性能。我们建议开发具有强大OOD概括的ML模型是化学ML模型开发中的新领域挑战。该开源基准将在GitHub上提供。

Title: HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Sparse Representation and Point Cloud Encoder

Authors: Qi Yang, Le Yang, Geert Van Der Auwera, Zhu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01938
Pdf URL: https://arxiv.org/pdf/2505.01938
Copy Paste: [[2505.01938]] HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Sparse Representation and Point Cloud Encoder(https://arxiv.org/abs/2505.01938)
Keywords: generation
Abstract: Most existing 3D Gaussian Splatting (3DGS) compression schemes focus on producing compact 3DGS representation via implicit data embedding. They have long coding times and highly customized data format, making it difficult for widespread deployment. This paper presents a new 3DGS compression framework called HybridGS, which takes advantage of both compact generation and standardized point cloud data encoding. HybridGS first generates compact and explicit 3DGS data. A dual-channel sparse representation is introduced to supervise the primitive position and feature bit depth. It then utilizes a canonical point cloud encoder to perform further data compression and form standard output bitstreams. A simple and effective rate control scheme is proposed to pivot the interpretable data compression scheme. At the current stage, HybridGS does not include any modules aimed at improving 3DGS quality during generation. But experiment results show that it still provides comparable reconstruction performance against state-of-the-art methods, with evidently higher encoding and decoding speed. The code is publicly available at this https URL.
摘要：大多数现有的3D高斯脱落（3DGS）压缩方案专注于通过隐式数据嵌入生成紧凑的3DGS表示。它们具有较长的编码时间和高度定制的数据格式，因此很难进行广泛的部署。本文提出了一个名为Hybridgs的新3DGS压缩框架，该框架利用紧凑的生成和标准化点云数据编码。 Hybridgs首先生成紧凑和显式3DGS数据。引入了双通道稀疏表示形式，以监督原始位置和特征位深度。然后，它利用一个规范点云编码器执行进一步的数据压缩并形成标准输出bitstreams。提出了一种简单有效的速率控制方案，以旋转可解释的数据压缩方案。在当前阶段，Hybridgs不包括任何旨在提高生成期间3DGS质量的模块。但是实验结果表明，它仍然针对最新方法提供了可比的重建性能，显然编码和解码速度较高。该代码在此HTTPS URL上公开可用。

Title: Semantic Probabilistic Control of Language Models

Authors: Kareem Ahmed, Catarina G Belem, Padhraic Smyth, Sameer Singh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01954
Pdf URL: https://arxiv.org/pdf/2505.01954
Copy Paste: [[2505.01954]] Semantic Probabilistic Control of Language Models(https://arxiv.org/abs/2505.01954)
Keywords: generation
Abstract: Semantic control entails steering LM generations towards satisfying subtle non-lexical constraints, e.g., toxicity, sentiment, or politeness, attributes that can be captured by a sequence-level verifier. It can thus be viewed as sampling from the LM distribution conditioned on the target attribute, a computationally intractable problem due to the non-decomposable nature of the verifier. Existing approaches to LM control either only deal with syntactic constraints which cannot capture the aforementioned attributes, or rely on sampling to explore the conditional LM distribution, an ineffective estimator for low-probability events. In this work, we leverage a verifier's gradient information to efficiently reason over all generations that satisfy the target attribute, enabling precise steering of LM generations by reweighing the next-token distribution. Starting from an initial sample, we create a local LM distribution favoring semantically similar sentences. This approximation enables the tractable computation of an expected sentence embedding. We use this expected embedding, informed by the verifier's evaluation at the initial sample, to estimate the probability of satisfying the constraint, which directly informs the update to the next-token distribution. We evaluated the effectiveness of our approach in controlling the toxicity, sentiment, and topic-adherence of LMs yielding generations satisfying the constraint with high probability (>95%) without degrading their quality.
摘要：语义控制需要转向LM世代，以满足微妙的非时光限制，例如毒性，情感或礼貌，可以通过序列级验证器来捕获的属性。因此，可以将其视为从目标属性条件的LM分布中进行的采样，这是由于验证者的不可分配性质而导致的一个计算棘手的问题。现有的LM控制方法仅处理无法捕获上述属性的句法约束，或者依靠采样来探索条件LM分布，这是低概率事件的无效估计器。在这项工作中，我们利用验证者的梯度信息有效地在满足目标属性的所有世代中有效理由，从而通过重新获得下一个toke分发来实现LM世代的精确转向。从初始样本开始，我们创建一个偏爱语义上相似句子的本地LM分布。此近似能够对预期句子嵌入的可处理计算。我们使用此预期的嵌入，由验证者在初始样本上的评估所告知，以估算满足约束的可能性，这直接将更新告知下一步分布。我们评估了方法在控制LMS的毒性，情感和主题遵守方面的有效性，从而产生了几代人，以高概率（> 95％）满足约束而不降低其质量。

Title: Secrets of GFlowNets' Learning Behavior: A Theoretical Study

Authors: Tianshu Yu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02035
Pdf URL: https://arxiv.org/pdf/2505.02035
Copy Paste: [[2505.02035]] Secrets of GFlowNets' Learning Behavior: A Theoretical Study(https://arxiv.org/abs/2505.02035)
Keywords: generative
Abstract: Generative Flow Networks (GFlowNets) have emerged as a powerful paradigm for generating composite structures, demonstrating considerable promise across diverse applications. While substantial progress has been made in exploring their modeling validity and connections to other generative frameworks, the theoretical understanding of their learning behavior remains largely uncharted. In this work, we present a rigorous theoretical investigation of GFlowNets' learning behavior, focusing on four fundamental dimensions: convergence, sample complexity, implicit regularization, and robustness. By analyzing these aspects, we seek to elucidate the intricate mechanisms underlying GFlowNet's learning dynamics, shedding light on its strengths and limitations. Our findings contribute to a deeper understanding of the factors influencing GFlowNet performance and provide insights into principled guidelines for their effective design and deployment. This study not only bridges a critical gap in the theoretical landscape of GFlowNets but also lays the foundation for their evolution as a reliable and interpretable framework for generative modeling. Through this, we aspire to advance the theoretical frontiers of GFlowNets and catalyze their broader adoption in the AI community.
摘要：生成流动网络（GFLOWNETS）已成为生成复合结构的强大范式，在不同的应用程序中表现出巨大的希望。尽管在探索他们的建模有效性和与其他生成框架的联系方面取得了重大进展，但对学习行为的理论理解仍然很大程度上是未知的。在这项工作中，我们对Gflownets的学习行为进行了严格的理论研究，重点介绍了四个基本维度：收敛，样本复杂性，隐式正则化和鲁棒性。通过分析这些方面，我们试图阐明Gflownet的学习动力学基础的复杂机制，从而阐明了其优势和局限性。我们的发现有助于更深入地了解影响Gflownet绩效的因素，并为有效的设计和部署提供有关原则指南的见解。这项研究不仅弥合了Gflownets的理论景观中的关键差距，而且还为它们的进化奠定了基础，作为生成建模的可靠且可解释的框架。通过此，我们渴望推进Gflownets的理论前沿，并催化他们在AI社区中的广泛采用。

Title: Regression s all you need for medical image translation

Authors: Sebastian Rassmann, David Kügler, Christian Ewert, Martin Reuter
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02048
Pdf URL: https://arxiv.org/pdf/2505.02048
Copy Paste: [[2505.02048]] Regression s all you need for medical image translation(https://arxiv.org/abs/2505.02048)
Keywords: generation, generative
Abstract: The acquisition of information-rich images within a limited time budget is crucial in medical imaging. Medical image translation (MIT) can help enhance and supplement existing datasets by generating synthetic images from acquired data. While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved remarkable success in natural image generation, their benefits - creativity and image realism - do not necessarily transfer to medical applications where highly accurate anatomical information is required. In fact, the imitation of acquisition noise or content hallucination hinder clinical utility. Here, we introduce YODA (You Only Denoise once - or Average), a novel 2.5D diffusion-based framework for volumetric MIT. YODA unites diffusion and regression paradigms to produce realistic or noise-free outputs. Furthermore, we propose Expectation-Approximation (ExpA) DM sampling, which draws inspiration from MRI signal averaging. ExpA-sampling suppresses generated noise and, thus, eliminates noise from biasing the evaluation of image quality. Through extensive experiments on four diverse multi-modal datasets - comprising multi-contrast brain MRI and pelvic MRI-CT - we show that diffusion and regression sampling yield similar results in practice. As such, the computational overhead of diffusion sampling does not provide systematic benefits in medical information translation. Building on these insights, we demonstrate that YODA outperforms several state-of-the-art GAN and DM methods. Notably, YODA-generated images are shown to be interchangeable with, or even superior to, physical acquisitions for several downstream tasks. Our findings challenge the presumed advantages of DMs in MIT and pave the way for the practical application of MIT in medical imaging.
摘要：在有限的时间预算中获取信息丰富的图像对于医学成像至关重要。医疗图像翻译（MIT）可以通过从获得的数据中生成合成图像来帮助增强和补充现有数据集。尽管生成的对抗网（GAN）和扩散模型（DMS）在自然图像产生方面取得了显着成功，但它们的益处 - 创造力和图像现实主义 - 不一定会转移到需要高度准确的解剖信息的医学应用中。实际上，模仿采集噪声或内容幻觉阻碍了临床实用性。在这里，我们介绍了Yoda（您只有一次或平均），这是一种新型的基于2.5D扩散的框架，用于体积MIT。 Yoda将扩散和回归范式团结起来，产生现实或无噪声输出。此外，我们提出了预期 - 焦点（EXPA）DM采样，从而从MRI信号平均汲取灵感。 Expa采样会抑制产生的噪声，从而消除噪声从偏向图像质量的评估中。通过对四个多种模式数据集的广泛实验 - 包括多对比度脑MRI和骨盆MRI-CT-我们表明扩散和回归采样在实践中产生相似的结果。因此，扩散抽样的计算开销无法在医疗信息翻译中提供系统的好处。在这些见解的基础上，我们证明Yoda的表现优于几种最先进的GAN和DM方法。值得注意的是，YODA生成的图像显示出可与几个下游任务的物理采集互换甚至优越。我们的发现挑战了MIT中DMS的假定优势，并为MIT在医学成像中的实际应用铺平了道路。

Title: SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations

Authors: Runyi Yu, Yinhuai Wang, Qihan Zhao, Hok Wai Tsui, Jingbo Wang, Ping Tan, Qifeng Chen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.02094
Pdf URL: https://arxiv.org/pdf/2505.02094
Copy Paste: [[2505.02094]] SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations(https://arxiv.org/abs/2505.02094)
Keywords: generation
Abstract: We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID): demonstration noise and coverage limitations. While existing data collection approaches provide valuable interaction demonstrations, they often yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions. Our key insight is that despite noisy and sparse demonstrations, there exist infinite physically feasible trajectories that naturally bridge between demonstrated skills or emerge from their neighboring states, forming a continuous space of possible skill variations and transitions. Building upon this insight, we present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood. To enable effective RLID with augmented data, we develop an Adaptive Trajectory Sampling (ATS) strategy for dynamic curriculum generation and a historical encoding mechanism for memory-dependent skill learning. Our approach enables robust skill acquisition that significantly generalizes beyond the reference demonstrations. Extensive experiments across diverse interaction tasks demonstrate substantial improvements over state-of-the-art methods in terms of convergence stability, generalization capability, and recovery robustness.
摘要：我们解决了从相互作用演示中学习（RLID）的强化学习的基本挑战：演示噪声和覆盖范围限制。尽管现有的数据收集方法提供了有价值的互动演示，但它们通常会产生稀疏，断开和嘈杂的轨迹，这些轨迹无法捕获可能的技能变化和过渡。我们的关键见解是，尽管进行了嘈杂且稀疏的示范，但存在着无限的物理可行轨迹，这些轨迹自然地在邻近的状态中桥接了示威的技能或从其邻国出现，形成了可能的技能变化和过渡的连续空间。在此洞察力的基础上，我们提出了两种数据增强技术：缝线轨迹图（STG），发现示范技能和国家过渡领域（STF）之间的潜在过渡（STF）在示范社区内建立了独特的联系。为了通过增强数据启用有效的RLID，我们开发了一种自适应轨迹采样（ATS），以进行动态课程生成和一种历史编码机制，用于记忆依赖性技能学习。我们的方法实现了强大的技能获取，可以在参考演示之外显着概括。跨不同交互任务的广泛实验表明，就收敛稳定性，概括能力和恢复鲁棒性而言，对最先进方法的实质性改善。

Title: Unaligned RGB Guided Hyperspectral Image Super-Resolution with Spatial-Spectral Concordance

Authors: Yingkai Zhang, Zeqiang Lai, Tao Zhang, Ying Fu, Chenghu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02109
Pdf URL: https://arxiv.org/pdf/2505.02109
Copy Paste: [[2505.02109]] Unaligned RGB Guided Hyperspectral Image Super-Resolution with Spatial-Spectral Concordance(https://arxiv.org/abs/2505.02109)
Keywords: super-resolution, generation
Abstract: Hyperspectral images super-resolution aims to improve the spatial resolution, yet its performance is often limited at high-resolution ratios. The recent adoption of high-resolution reference images for super-resolution is driven by the poor spatial detail found in low-resolution HSIs, presenting it as a favorable method. However, these approaches cannot effectively utilize information from the reference image, due to the inaccuracy of alignment and its inadequate interaction between alignment and fusion modules. In this paper, we introduce a Spatial-Spectral Concordance Hyperspectral Super-Resolution (SSC-HSR) framework for unaligned reference RGB guided HSI SR to address the issues of inaccurate alignment and poor interactivity of the previous approaches. Specifically, to ensure spatial concordance, i.e., align images more accurately across resolutions and refine textures, we construct a Two-Stage Image Alignment with a synthetic generation pipeline in the image alignment module, where the fine-tuned optical flow model can produce a more accurate optical flow in the first stage and warp model can refine damaged textures in the second stage. To enhance the interaction between alignment and fusion modules and ensure spectral concordance during reconstruction, we propose a Feature Aggregation module and an Attention Fusion module. In the feature aggregation module, we introduce an Iterative Deformable Feature Aggregation block to achieve significant feature matching and texture aggregation with the fusion multi-scale results guidance, iteratively generating learnable offset. Besides, we introduce two basic spectral-wise attention blocks in the attention fusion module to model the inter-spectra interactions. Extensive experiments on three natural or remote-sensing datasets show that our method outperforms state-of-the-art approaches on both quantitative and qualitative evaluations.
摘要：高光谱图像超分辨率旨在改善空间分辨率，但其性能通常在高分辨率比率下受到限制。最新采用高分辨率参考图像是由低分辨率HSIS中发现的不良空间细节驱动的，以一种有利的方法提出。但是，由于对齐不准确及其对齐模块和融合模块之间的相互作用不足，这些方法无法有效地利用参考图像中的信息。在本文中，我们引入了一个空间 - 光谱一致性高光谱超分辨率（SSC-HSR）框架，用于未对齐的参考RGB引导的HSI SR，以解决以前方法的不准确比对的问题和不良的交互性。具体来说，为了确保空间和一分平，即更准确地对齐图像，并且在图像对齐模块中构建具有合成生成管道的两阶段图像对齐，在该模块中，微型光流模型可以在第一个阶段和扭曲模型中产生更准确的光流，并且可以在第二阶段进行扭曲模型。为了增强对齐和融合模块之间的相互作用并确保在重建过程中的光谱一致性，我们提出了一个特征聚集模块和注意力融合模块。在功能聚合模块中，我们引入了一个迭代变形特征聚合块，以通过融合多尺度结果指导实现重要的特征匹配和纹理聚合，从而迭代地生成可学习的偏移。此外，我们在注意力融合模块中引入了两个基本的光谱注意块，以模拟光谱相互作用。在三个天然或遥感数据集上进行的广泛实验表明，我们的方法在定量和定性评估上都优于最先进的方法。

Title: HiLLIE: Human-in-the-Loop Training for Low-Light Image Enhancement

Authors: Xiaorui Zhao, Xinyue Zhou, Peibei Cao, Junyu Lou, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02134
Pdf URL: https://arxiv.org/pdf/2505.02134
Copy Paste: [[2505.02134]] HiLLIE: Human-in-the-Loop Training for Low-Light Image Enhancement(https://arxiv.org/abs/2505.02134)
Keywords: quality assessment
Abstract: Developing effective approaches to generate enhanced results that align well with human visual preferences for high-quality well-lit images remains a challenge in low-light image enhancement (LLIE). In this paper, we propose a human-in-the-loop LLIE training framework that improves the visual quality of unsupervised LLIE model outputs through iterative training stages, named HiLLIE. At each stage, we introduce human guidance into the training process through efficient visual quality annotations of enhanced outputs. Subsequently, we employ a tailored image quality assessment (IQA) model to learn human visual preferences encoded in the acquired labels, which is then utilized to guide the training process of an enhancement model. With only a small amount of pairwise ranking annotations required at each stage, our approach continually improves the IQA model's capability to simulate human visual assessment of enhanced outputs, thus leading to visually appealing LLIE results. Extensive experiments demonstrate that our approach significantly improves unsupervised LLIE model performance in terms of both quantitative and qualitative performance. The code and collected ranking dataset will be available at this https URL.
摘要：开发有效的方法来产生增强的结果，使与人类视觉偏好相符的高质量良好光线效果图像的结果仍然是弱光图像增强（LLIE）的挑战。在本文中，我们提出了一个人类的LLIE培训框架，该框架通过迭代训练阶段（名为Hillie）提高了无监督LLIE模型输出的视觉质量。在每个阶段，我们通过有效的增强产出的有效视觉质量注释将人类的指导引入培训过程中。随后，我们采用量身定制的图像质量评估（IQA）模型来学习在获得的标签中编码的人类视觉偏好，然后将其用于指导增强模型的训练过程。在每个阶段只需要少量的成对排名注释，我们的方法不断提高IQA模型模拟人体视觉评估增强输出的能力，从而导致视觉上吸引人的LLIE结果。广泛的实验表明，我们的方法在定量和定性性能方面显着提高了无监督的LLIE模型性能。代码和收集的排名数据集将在此HTTPS URL上可用。

Title: Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution

Authors: Xingyu Zhou, Wei Long, Jingbo Lu, Shiyin Jiang, Weiyi You, Haifeng Wu, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02159
Pdf URL: https://arxiv.org/pdf/2505.02159
Copy Paste: [[2505.02159]] Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution(https://arxiv.org/abs/2505.02159)
Keywords: restoration, super-resolution
Abstract: Video super-resolution (VSR) can achieve better performance compared to single image super-resolution by additionally leveraging temporal information. In particular, the recurrent-based VSR model exploits long-range temporal information during inference and achieves superior detail restoration. However, effectively learning these long-term dependencies within long videos remains a key challenge. To address this, we propose LRTI-VSR, a novel training framework for recurrent VSR that efficiently leverages Long-Range Refocused Temporal Information. Our framework includes a generic training strategy that utilizes temporal propagation features from long video clips while training on shorter video clips. Additionally, we introduce a refocused intra&inter-frame transformer block which allows the VSR model to selectively prioritize useful temporal information through its attention module while further improving inter-frame information utilization in the FFN module. We evaluate LRTI-VSR on both CNN and transformer-based VSR architectures, conducting extensive ablation studies to validate the contribution of each component. Experiments on long-video test sets demonstrate that LRTI-VSR achieves state-of-the-art performance while maintaining training and computational efficiency.
摘要：与单个图像超分辨率相比，视频超分辨率（VSR）可以通过利用时间信息来实现更好的性能。特别是，基于复发的VSR模型在推理过程中利用了远程时间信息，并实现了较高的细节恢复。但是，有效地学习这些长期视频中的这些长期依赖性仍然是一个关键挑战。为了解决这个问题，我们提出了LRTI-VSR，这是一个针对复发性VSR的新型培训框架，可有效利用长期重新聚焦的时间信息。我们的框架包括一种通用的培训策略，该策略在较短的视频剪辑上培训时利用了长时间视频剪辑的时间传播功能。此外，我们引入了一个重新聚焦的内部和框架间变压器块，该块允许VSR模型通过其注意模块选择性地优先考虑有用的时间信息，同时进一步改善FFN模块中的框架间信息利用率。我们在CNN和基于变压器的VSR体系结构上评估LRTI-VSR，进行广泛的消融研究以验证每个组件的贡献。长效测试集的实验表明，LRTI-VSR在维持培训和计算效率的同时可以达到最先进的性能。

Title: Robust AI-Generated Face Detection with Imbalanced Data

Authors: Yamini Sri Krubha, Aryana Hou, Braden Vester, Web Walker, Xin Wang, Li Lin, Shu Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02182
Pdf URL: https://arxiv.org/pdf/2505.02182
Copy Paste: [[2505.02182]] Robust AI-Generated Face Detection with Imbalanced Data(https://arxiv.org/abs/2505.02182)
Keywords: generative
Abstract: Deepfakes, created using advanced AI techniques such as Variational Autoencoder and Generative Adversarial Networks, have evolved from research and entertainment applications into tools for malicious activities, posing significant threats to digital trust. Current deepfake detection techniques have evolved from CNN-based methods focused on local artifacts to more advanced approaches using vision transformers and multimodal models like CLIP, which capture global anomalies and improve cross-domain generalization. Despite recent progress, state-of-the-art deepfake detectors still face major challenges in handling distribution shifts from emerging generative models and addressing severe class imbalance between authentic and fake samples in deepfake datasets, which limits their robustness and detection accuracy. To address these challenges, we propose a framework that combines dynamic loss reweighting and ranking-based optimization, which achieves superior generalization and performance under imbalanced dataset conditions. The code is available at this https URL.
摘要：使用高级AI技术（例如变异自动编码器和生成对抗网络）创建的深泡沫已从研究和娱乐应用程序演变为恶意活动的工具，对数字信任构成了重大威胁。当前的DeepFake检测技术已从基于CNN的方法发展为局部工件的基于CNN的方法，再到使用视觉变压器和多模型（例如Clip）的更先进的方法，这些方法捕获了全局异常并改善了跨域的概括。尽管最近取得了进展，但最新的深层探测器仍然面临着从新兴生成模型的分配转变以及在DeepFake数据集中的真实样品和假样品之间的严重阶级失衡方面面临的主要挑战，这限制了它们的稳健性和检测准确性。为了应对这些挑战，我们提出了一个框架，该框架结合了动态损失重新加权和基于排名的优化，该框架在不平衡的数据集条件下实现了卓越的概括和性能。该代码可在此HTTPS URL上找到。

Title: DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

Authors: Wenchuan Wang, Mengqi Huang, Yijing Tu, Zhendong Mao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02192
Pdf URL: https://arxiv.org/pdf/2505.02192
Copy Paste: [[2505.02192]] DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization(https://arxiv.org/abs/2505.02192)
Keywords: generation
Abstract: Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention through focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic mutual constraints and synergistic interdependencies between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrades. To address this, we introduce DualReal, a novel framework that, employs adaptive joint training to collaboratively construct interdependencies between dimensions. Specifically, DualReal is composed of two units: (1) Dual-aware Adaptation dynamically selects a training phase (i.e., identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) StageBlender Controller leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by 21.7% and 31.8% on average, and achieves top performance on nearly all motion quality metrics.
摘要：定制的文本到视频生成具有预训练的大规模模型最近通过着重于身份和运动一致性而引起了极大的关注。现有作品通常遵循孤立的自定义范式，其中主题身份或运动动态是专门定制的。但是，该范式完全忽略了身份和运动之间的内在互助约束和协同的相互依赖性，从而在整个生成过程中导致身份 - 动作冲突，从而系统地降级。为了解决这个问题，我们介绍了Dualeal，这是一个新型框架，该框架采用自适应联合培训来协作在维度之间建立相互依存关系。具体而言，Dualeal由两个单元组成：（1）双感知适应性动态选择一个训练阶段（即身份或运动），学习以冷冻维度之前指导的当前信息，并采用正则化策略来避免知识泄漏；（2）舞台brender控制器利用脱索阶段和扩散变压器深度来指导不同的维度，以自适应粒度，避免在各个阶段发生冲突，并最终实现对身份和运动模式的无损融合。我们构建了比现有方法更全面的基准。实验结果表明，Dualeal将夹子I和Dino-I指标提高21.7％和31.8％，并在几乎所有运动质量指标上都达到了最高的性能。

Title: Improving Physical Object State Representation in Text-to-Image Generative Systems

Authors: Tianle Chen, Chaitanya Chakka, Deepti Ghadiyaram
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02236
Pdf URL: https://arxiv.org/pdf/2505.02236
Copy Paste: [[2505.02236]] Improving Physical Object State Representation in Text-to-Image Generative Systems(https://arxiv.org/abs/2505.02236)
Keywords: generative
Abstract: Current text-to-image generative models struggle to accurately represent object states (e.g., "a table without a bottle," "an empty tumbler"). In this work, we first design a fully-automatic pipeline to generate high-quality synthetic data that accurately captures objects in varied states. Next, we fine-tune several open-source text-to-image models on this synthetic data. We evaluate the performance of the fine-tuned models by quantifying the alignment of the generated images to their prompts using GPT4o-mini, and achieve an average absolute improvement of 8+% across four models on the public GenAI-Bench dataset. We also curate a collection of 200 prompts with a specific focus on common objects in various physical states. We demonstrate a significant improvement of an average of 24+% over the baseline on this dataset. We release all evaluation prompts and code.
摘要：当前的文本到图像生成模型难以准确表示对象状态（例如，“没有瓶子的桌子”，“一个空的玻璃杯”）。在这项工作中，我们首先设计了完全自动的管道，以生成高质量的合成数据，以准确捕获各种状态的对象。接下来，我们在此综合数据上微调了几个开源文本对图像模型。我们通过使用GPT4O-MINI量化生成的图像与提示的对齐方式来评估微型模型的性能，并在公共Genai-Bench数据集中的四个模型中平均实现8+％的平均绝对提高。我们还策划了200个提示的集合，并特别关注各种物理状态的共同对象。我们证明，与该数据集的基线相比，平均24+％的显着提高。我们发布所有评估提示和代码。

Title: Federated Causal Inference in Healthcare: Methods, Challenges, and Applications

Authors: Haoyang Li, Jie Xu, Kyra Gan, Fei Wang, Chengxi Zang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.02238
Pdf URL: https://arxiv.org/pdf/2505.02238
Copy Paste: [[2505.02238]] Federated Causal Inference in Healthcare: Methods, Challenges, and Applications(https://arxiv.org/abs/2505.02238)
Keywords: generation
Abstract: Federated causal inference enables multi-site treatment effect estimation without sharing individual-level data, offering a privacy-preserving solution for real-world evidence generation. However, data heterogeneity across sites, manifested in differences in covariate, treatment, and outcome, poses significant challenges for unbiased and efficient estimation. In this paper, we present a comprehensive review and theoretical analysis of federated causal effect estimation across both binary/continuous and time-to-event outcomes. We classify existing methods into weight-based strategies and optimization-based frameworks and further discuss extensions including personalized models, peer-to-peer communication, and model decomposition. For time-to-event outcomes, we examine federated Cox and Aalen-Johansen models, deriving asymptotic bias and variance under heterogeneity. Our analysis reveals that FedProx-style regularization achieves near-optimal bias-variance trade-offs compared to naive averaging and meta-analysis. We review related software tools and conclude by outlining opportunities, challenges, and future directions for scalable, fair, and trustworthy federated causal inference in distributed healthcare systems.
摘要：联邦因果推断可以实现多站点治疗效果估计，而无需共享个人级别的数据，从而为真实的证据生成提供了隐私的解决方案。但是，跨站点的数据异质性在协变量，治疗和结果的差异中表现出来，对无偏见和有效估计提出了重大挑战。在本文中，我们对二进制/连续和事件时间结果的联合因果效应估计进行了全面综述和理论分析。我们将现有方法分类为基于权重的策略和基于优化的框架，并进一步讨论包括个性化模型，点对点通信和模型分解在内的扩展。为了进行事件时间的结果，我们检查了联邦COX和AALEN-JOHANSEN模型，这些模型在异质性下得出了渐近偏差和方差。我们的分析表明，与天真的平均和荟萃分析相比，FEDPROX风格的正则化实现了近乎最佳的偏见变化权衡。我们审查相关的软件工具，并通过概述分布式医疗保健系统中的可扩展，公平和值得信赖的联邦因果推断的机会，挑战和未来方向。

Title: Quantizing Diffusion Models from a Sampling-Aware Perspective

Authors: Qian Zeng, Jie Song, Yuanyu Wan, Huiqiong Wang, Mingli Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02242
Pdf URL: https://arxiv.org/pdf/2505.02242
Copy Paste: [[2505.02242]] Quantizing Diffusion Models from a Sampling-Aware Perspective(https://arxiv.org/abs/2505.02242)
Keywords: generation
Abstract: Diffusion models have recently emerged as the dominant approach in visual generation tasks. However, the lengthy denoising chains and the computationally intensive noise estimation networks hinder their applicability in low-latency and resource-limited environments. Previous research has endeavored to address these limitations in a decoupled manner, utilizing either advanced samplers or efficient model quantization techniques. In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. To attain dual acceleration with high fidelity, we propose a sampling-aware quantization strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to impose a more stringent constraint on the error bounds at each sampling step, facilitating a more linear probability flow. Extensive experiments on sparse-step fast sampling across multiple datasets demonstrate that our approach preserves the rapid convergence characteristics of high-speed samplers while maintaining superior generation quality. Code will be made publicly available soon.
摘要：扩散模型最近已成为视觉生成任务中的主要方法。但是，漫长的deno链和计算密集的噪声估计网络阻碍了其在低延迟和资源限制环境中的适用性。以前的研究已努力利用高级采样器或有效的模型量化技术以脱钩的方式解决这些局限性。在这项研究中，我们发现，量化诱导的噪声会在每个采样步骤中破坏方向估计，从而在求解采样方程时通过离散的数值方法进一步扭曲了高阶采样器的精确定向估计，从而改变了最佳采样轨迹。为了以高保真度达到双重加速度，我们提出了一种抽样的量化策略，其中设计的混合阶段轨迹比对技术被设计为对每个采样步骤的误差界限施加更严格的约束，从而促进了更线性的概率。对跨越多个数据集的稀疏步骤快速采样的广泛实验表明，我们的方法保留了高速采样器的快速收敛特性，同时保持了卓越的发电质量。代码将很快公开提供。

Title: Enhancing AI Face Realism: Cost-Efficient Quality Improvement in Distilled Diffusion Models with a Fully Synthetic Dataset

Authors: Jakub Wąsala, Bartłomiej Wrzalski, Kornelia Noculak, Yuliia Tarasenko, Oliwer Krupa, Jan Kocoń, Grzegorz Chodak
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02255
Pdf URL: https://arxiv.org/pdf/2505.02255
Copy Paste: [[2505.02255]] Enhancing AI Face Realism: Cost-Efficient Quality Improvement in Distilled Diffusion Models with a Fully Synthetic Dataset(https://arxiv.org/abs/2505.02255)
Keywords: generation, generative
Abstract: This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.
摘要：这项研究提出了一种新的方法，可以通过扩散模型提高图像产生的成本质量比率。我们假设蒸馏（例如Flux.1-Schnell）和基线（例如Flux.1-DEV）模型之间的差异是一致的，因此，在特殊的领域（如肖像生成）中可以学习。我们生成一个合成的配对数据集并训练快速的图像到图像翻译头。使用两组低质量和高质量的合成图像，我们的模型经过训练，以完善蒸馏发电机（例如Flux.1-SCHNELL）的输出，达到与诸如Flux.1-DEV之类的基线模型相当的水平，该级别在计算上是更密集的。我们的结果表明，将大型生成模型的蒸馏版与我们的增强层相结合的管道提供了与基线版本相似的影像肖像，与Flux.1-DEV相比，计算成本下降了82％。这项研究证明了提高涉及大规模图像产生的AI溶液效率的潜力。

Title: Entropy-Guided Sampling of Flat Modes in Discrete Spaces

Authors: Pinaki Mohanty, Riddhiman Bhattacharya, Ruqi Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02296
Pdf URL: https://arxiv.org/pdf/2505.02296
Copy Paste: [[2505.02296]] Entropy-Guided Sampling of Flat Modes in Discrete Spaces(https://arxiv.org/abs/2505.02296)
Keywords: generative
Abstract: Sampling from flat modes in discrete spaces is a crucial yet underexplored problem. Flat modes represent robust solutions and have broad applications in combinatorial optimization and discrete generative modeling. However, existing sampling algorithms often overlook the mode volume and struggle to capture flat modes effectively. To address this limitation, we propose \emph{Entropic Discrete Langevin Proposal} (EDLP), which incorporates local entropy into the sampling process through a continuous auxiliary variable under a joint distribution. The local entropy term guides the discrete sampler toward flat modes with a small overhead. We provide non-asymptotic convergence guarantees for EDLP in locally log-concave discrete distributions. Empirically, our method consistently outperforms traditional approaches across tasks that require sampling from flat basins, including Bernoulli distribution, restricted Boltzmann machines, combinatorial optimization, and binary neural networks.
摘要：来自离散空间中平坦模式的采样是一个至关重要但毫无疑问的问题。平面模式代表强大的解决方案，并在组合优化和离散生成建模中具有广泛的应用。但是，现有的采样算法通常会忽略模式量，并难以有效捕获平面模式。为了解决此限制，我们提出\ emph {entropic离散langevin提案}（EDLP），该}（EDLP）将局部熵通过关节分布下的连续辅助变量纳入采样过程。本地熵项将离散的采样器引导到带有小头顶的平坦模式。我们在局部对数符号离散分布中为EDLP提供了非反应收敛保证。从经验上讲，我们的方法始终优于需要从扁平盆地采样的任务，包括伯努利分布，受限的玻尔兹曼机器，组合优化和二进制神经网络。

Title: SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Authors: Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02370
Pdf URL: https://arxiv.org/pdf/2505.02370
Copy Paste: [[2505.02370]] SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing(https://arxiv.org/abs/2505.02370)
Keywords: generation
Abstract: Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a more direct and efficient way to provide better supervision signals, and providing a novel, simple, and effective solution for instruction-based image editing. Results on multiple benchmarks demonstrate that our method significantly outperforms existing approaches. Compared with previous SOTA SmartEdit, we achieve 9.19% improvements on the Real-Edit benchmark with 30x less training data and 13x smaller model size.
摘要：由于手动收集准确的编辑数据的挑战，现有数据集通常是使用各种自动化方法构造的，从而导致由编辑说明和原始编辑的图像对之间的不匹配引起的嘈杂监督信号。最近的努力通过产生高质量的编辑图像，预先识别任务或引入视觉语言模型（VLM）来改善编辑模型，但无法解决这个基本问题。在本文中，我们通过为给定图像对构造更有效的编辑说明提供了一种新颖的解决方案。这包括纠正编辑说明，以更好地与原始编辑的图像对保持一致，并使用对比度编辑说明进一步提高其有效性。具体而言，我们发现编辑模型在不同的推理步骤中表现出特定的生成属性，而与文本无关。基于这些先前的属性，我们为VLMS定义了一个统一指南，以纠正编辑说明。但是，有一些具有挑战性的编辑方案，这些方案不能仅通过整流说明来解决。为此，我们进一步构建了具有正面和负面指示的对比度监督信号，并使用三重态损失将其引入模型培训，从而进一步促进了监督效率。我们的方法不需要先前工作中使用的VLM模块或预训练任务，提供了一种更直接，更有效的方法来提供更好的监督信号，并为基于教学的图像编辑提供了新颖，简单且有效的解决方案。多个基准测试的结果表明，我们的方法显着优于现有方法。与以前的Sota SmartEdit相比，我们在房地产基准测试基准方面取得了9.19％的改善，训练数据较小30倍，模型尺寸较小。

Title: T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models

Authors: Yunfeng Ge, Jiawei Li, Yiji Zhao, Haomin Wen, Zhao Li, Meikang Qiu, Hongyan Li, Ming Jin, Shirui Pan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02417
Pdf URL: https://arxiv.org/pdf/2505.02417
Copy Paste: [[2505.02417]] T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models(https://arxiv.org/abs/2505.02417)
Keywords: generation
Abstract: Text-to-Time Series generation holds significant potential to address challenges such as data sparsity, imbalance, and limited availability of multimodal time series datasets across domains. While diffusion models have achieved remarkable success in Text-to-X (e.g., vision and audio data) generation, their use in time series generation remains in its nascent stages. Existing approaches face two critical limitations: (1) the lack of systematic exploration of general-proposed time series captions, which are often domain-specific and struggle with generalization; and (2) the inability to generate time series of arbitrary lengths, limiting their applicability to real-world scenarios. In this work, we first categorize time series captions into three levels: point-level, fragment-level, and instance-level. Additionally, we introduce a new fragment-level dataset containing over 600,000 high-resolution time series-text pairs. Second, we propose Text-to-Series (T2S), a diffusion-based framework that bridges the gap between natural language and time series in a domain-agnostic manner. T2S employs a length-adaptive variational autoencoder to encode time series of varying lengths into consistent latent embeddings. On top of that, T2S effectively aligns textual representations with latent embeddings by utilizing Flow Matching and employing Diffusion Transformer as the denoiser. We train T2S in an interleaved paradigm across multiple lengths, allowing it to generate sequences of any desired length. Extensive evaluations demonstrate that T2S achieves state-of-the-art performance across 13 datasets spanning 12 domains.
摘要：文本连续系列的生成具有巨大的潜力，可以应对诸如数据稀疏，失衡以及跨域多模式时间序列数据集的有限可用性等挑战。尽管扩散模型在文本到X（例如视觉和音频数据）的一代中取得了显着的成功，但它们在时间序列生成中的使用仍处于新生的阶段。现有方法面临两个关键局限性：（1）缺乏对通用时间序列字幕的系统探索，这些标题通常是特定于领域的且与概括斗争；（2）无法生成任意长度的时间序列，从而将其适用性限制在现实情况下。在这项工作中，我们首先将时间序列字幕分为三个级别：点级，碎片级别和实例级别。此外，我们引入了一个新的片段级数据集，其中包含超过600,000个高分辨率时间序列文本对。其次，我们提出了文本对系列（T2S），这是一个基于扩散的框架，以域 - 不可思议的方式弥合了自然语言和时间序列之间的差距。 T2S采用长度自适应变异自动编码器来编码长度的时间序列，以使一致的潜在嵌入。最重要的是，T2通过利用流量匹配并使用扩散变压器作为DeOISer来有效地使文本表示与潜在的嵌入。我们在多个长度上以交错的范式训练T2s，从而使其能够生成任何所需长度的序列。广泛的评估表明，T2S在跨越12个域的13个数据集中实现了最先进的性能。

Title: FairPO: Robust Preference Optimization for Fair Multi-Label Learning

Authors: Soumen Kumar Mondal, Akshit Varmora, Prateek Chanda, Ganesh Ramakrishnan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02433
Pdf URL: https://arxiv.org/pdf/2505.02433
Copy Paste: [[2505.02433]] FairPO: Robust Preference Optimization for Fair Multi-Label Learning(https://arxiv.org/abs/2505.02433)
Keywords: generation
Abstract: We propose FairPO, a novel framework designed to promote fairness in multi-label classification by directly optimizing preference signals with a group robustness perspective. In our framework, the set of labels is partitioned into privileged and non-privileged groups, and a preference-based loss inspired by Direct Preference Optimization (DPO) is employed to more effectively differentiate true positive labels from confusing negatives within the privileged group, while preserving baseline classification performance for non-privileged labels. By framing the learning problem as a robust optimization over groups, our approach dynamically adjusts the training emphasis toward groups with poorer performance, thereby mitigating bias and ensuring a fairer treatment across diverse label categories. In addition, we outline plans to extend this approach by investigating alternative loss formulations such as Simple Preference Optimisation (SimPO) and Contrastive Preference Optimization (CPO) to exploit reference-free reward formulations and contrastive training signals. Furthermore, we plan to extend FairPO with multilabel generation capabilities, enabling the model to dynamically generate diverse and coherent label sets for ambiguous inputs.
摘要：我们提出了Fairpo，这是一个新颖的框架，旨在通过直接优化群体鲁棒性的观点来促进多标签分类中的公平性。在我们的框架中，一组标签被分为特权和非特权群体，并采用了受直接偏好优化（DPO）启发的基于偏好的损失（DPO），以更有效地将真正的积极标签与特权组中的负面标签区分开来，而在非特权群体中保留基线分类绩效，以确保非私人标签的基线绩效。通过将学习问题构建为对小组的强大优化，我们的方法会动态调整训练的重点，以减轻偏见，从而确保各种标签类别的偏见和更公平的治疗方法。此外，我们概述了通过调查诸如简单偏好优化（SIMPO）和对比度优先优化（CPO）等替代性损失配方来扩展这种方法的计划，以利用无参考奖励表述和对比度培训信号。此外，我们计划使用多标签生成能力扩展Fairpo，使该模型能够动态生成模棱两可的输入的多样和相干标签集。

Title: Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

Authors: Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, Libin Wang, Qingpei Guo, Rui Liu, Weilong Chai, Xinyu Xiao, Ziyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02471
Pdf URL: https://arxiv.org/pdf/2505.02471
Copy Paste: [[2505.02471]] Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction(https://arxiv.org/abs/2505.02471)
Keywords: generation
Abstract: We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.
摘要：我们介绍了Ming-Lite-Uni，这是一个开源的多模式框架，该框架具有新设计的统一视觉生成器和一款适合统一视觉和语言的本机多模式自动回归模型。具体而言，该项目提供了集成的元震源和M2-OMNI框架的开源实现，同时介绍了新颖的多尺度可学习令牌和多尺度表示策略。通过利用固定的MLLM和可学习的扩散模型，Ming-Lite-Uni使本机多模式AR模型可以同时执行基于文本图像生成和基于教学的图像编辑任务，从而超越了纯粹的视觉理解，扩大了其功能。我们的实验结果证明了明 - 莱特 - Uni的强劲表现，并说明了其互动过程的令人印象深刻的流体性质。所有代码和模型权重均供开源，以促进社区内的进一步探索。值得注意的是，这项工作与并发的多模式AI里程碑保持一致 - 例如2025年3月25日，与天然图像生成更新的Chatgpt-4O - 强调了统一模型（如Ming-Lite-Uni）在通往AGI的道路上的更广泛的意义。 Ming-Lite-Uni处于Alpha阶段，很快将进一步完善。

Title: Corr2Distrib: Making Ambiguous Correspondences an Ally to Predict Reliable 6D Pose Distributions

Authors: Asma Brazi, Boris Meden, Fabrice Mayran de Chamisso, Steve Bourgeois, Vincent Lepetit
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2505.02501
Pdf URL: https://arxiv.org/pdf/2505.02501
Copy Paste: [[2505.02501]] Corr2Distrib: Making Ambiguous Correspondences an Ally to Predict Reliable 6D Pose Distributions(https://arxiv.org/abs/2505.02501)
Keywords: generation
Abstract: We introduce Corr2Distrib, the first correspondence-based method which estimates a 6D camera pose distribution from an RGB image, explaining the observations. Indeed, symmetries and occlusions introduce visual ambiguities, leading to multiple valid poses. While a few recent methods tackle this problem, they do not rely on local correspondences which, according to the BOP Challenge, are currently the most effective way to estimate a single 6DoF pose solution. Using correspondences to estimate a pose distribution is not straightforward, since ambiguous correspondences induced by visual ambiguities drastically decrease the performance of PnP. With Corr2Distrib, we turn these ambiguities into an advantage to recover all valid poses. Corr2Distrib first learns a symmetry-aware representation for each 3D point on the object's surface, characterized by a descriptor and a local frame. This representation enables the generation of 3DoF rotation hypotheses from single 2D-3D correspondences. Next, we refine these hypotheses into a 6DoF pose distribution using PnP and pose scoring. Our experimental evaluations on complex non-synthetic scenes show that Corr2Distrib outperforms state-of-the-art solutions for both pose distribution estimation and single pose estimation from an RGB image, demonstrating the potential of correspondences-based approaches.
摘要：我们介绍了Corr2Distrib，这是第一个基于对应的方法，该方法估算了从RGB图像中估算6D相机姿势分布，从而解释了观测值。实际上，对称性和遮挡引入了视觉歧义，导致多个有效的姿势。尽管最近的一些方法解决了这个问题，但它们并不依赖于BOP挑战的本地信件，目前是估计单个6DOF姿势解决方案的最有效方法。使用对估计姿势分布的对应关系并非直接，因为视觉歧义引起的模棱两可的对应关系大大降低了PNP的性能。使用Corr2Distrib，我们将这些歧义变成了回收所有有效姿势的优势。 Corr2Distrib首先学习对象表面上每个3D点的对称性表示表示，其特征是描述符和本地框架。该表示可以从单个2d-3d对应关系产生3DOF旋转假设。接下来，我们使用PNP和姿势评分将这些假设完善成6DOF姿势分布。我们对复杂非合成场景的实验评估表明，Corr2-Distrib对姿势分布估计的最先进解决方案和来自RGB图像的单姿势估计的最先进解决方案，这表明了基于对应的方法的潜力。

Title: Text to Image Generation and Editing: A Survey

Authors: Pengfei Yang, Ngai-Man Cheung, Xinda Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02527
Pdf URL: https://arxiv.org/pdf/2505.02527
Copy Paste: [[2505.02527]] Text to Image Generation and Editing: A Survey(https://arxiv.org/abs/2505.02527)
Keywords: generation
Abstract: Text-to-image generation (T2I) refers to the text-guided generation of high-quality images. In the past few years, T2I has attracted widespread attention and numerous works have emerged. In this survey, we comprehensively review 141 works conducted from 2021 to 2024. First, we introduce four foundation model architectures of T2I (autoregression, non-autoregression, GAN and diffusion) and the commonly used key technologies (autoencoder, attention and classifier-free guidance). Secondly, we systematically compare the methods of these studies in two directions, T2I generation and T2I editing, including the encoders and the key technologies they use. In addition, we also compare the performance of these researches side by side in terms of datasets, evaluation metrics, training resources, and inference speed. In addition to the four foundation models, we survey other works on T2I, such as energy-based models and recent Mamba and multimodality. We also investigate the potential social impact of T2I and provide some solutions. Finally, we propose unique insights of improving the performance of T2I models and possible future development directions. In summary, this survey is the first systematic and comprehensive overview of T2I, aiming to provide a valuable guide for future researchers and stimulate continued progress in this field.
摘要：文本到图像生成（T2I）是指高质量图像的文本引导产生。在过去的几年中，T2I引起了广泛的关注，并且出现了许多作品。在这项调查中，我们全面审查了从2021年至2024年进行的141件作品。首先，我们介绍了T2I（自动估计，非自动进程，GAN和扩散）的四个基础模型体系结构以及常用的关键技术（自动编码器，注意力和无分类器和无分类器指导）。其次，我们会系统地将这些研究的方法分为两个方向，即T2i生成和T2I编辑，包括编码器和使用的关键技术。此外，我们还根据数据集，评估指标，培训资源和推理速度并排比较这些研究的表现。除了四个基础模型外，我们还调查了T2I的其他作品，例如基于能量的模型以及最近的MAMBA和多模式。我们还研究了T2I的潜在社会影响并提供一些解决方案。最后，我们提出了提高T2I模型和可能未来开发方向的性能的独特见解。总而言之，这项调查是T2I的第一个系统和全面的概述，旨在为未来的研究人员提供宝贵的指南，并刺激该领域的持续进展。

Title: Bielik v3 Small: Technical Report

Authors: Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02550
Pdf URL: https://arxiv.org/pdf/2505.02550
Copy Paste: [[2505.02550]] Bielik v3 Small: Technical Report(https://arxiv.org/abs/2505.02550)
Keywords: generative
Abstract: We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.
摘要：我们介绍了Bielik V3，这是针对波兰语言处理优化的一系列参数有效的生成文本模型（1.5B和4.5B）。这些模型表明，较小，优化的体系结构可以实现与较大较大对应物相当的性能，同时需要更少的计算资源。我们的方法结合了几项关键创新：一种自定义的波兰令牌（APT4），可显着提高令牌效率，加权教学跨透明术损失，以跨教学类型的平衡学习以及根据培训进度动态调整的自适应学习率。这些型号经过经过精心策划的策划的语料库的培训，这些语料库跨越了3.03亿个文档，这些型号跨越了多个基准，包括开放式PL LLM排行榜，复杂的波兰文本理解基准测试，波兰EQ-Bench和波兰医疗排行榜。 4.5b参数模型与其大小的模型达到了2-3倍的竞争，而1.5B模型尽管其非常紧凑的轮廓，但表现出色。这些进步为使用较少代表性的语言建立了针对参数有效语言建模的新基准，从而使高质量的波兰语言AI更容易访问资源受限的应用程序。

Title: Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Authors: Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02567
Pdf URL: https://arxiv.org/pdf/2505.02567
Copy Paste: [[2505.02567]] Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities(https://arxiv.org/abs/2505.02567)
Keywords: generation
Abstract: Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey will be available on GitHub soon.
摘要：近年来，在多模式理解模型和图像产生模型中都取得了显着的进步。尽管取得了各自的成功，但这两个领域还是独立发展的，导致了独特的建筑范式：尽管基于自动进程的架构占多模式的理解，但基于扩散的模型已成为图像生成的基石。最近，人们对开发整合这些任务的统一框架的兴趣越来越大。 GPT-4O的新功能的出现体现了这一趋势，突出了统一的潜力。但是，两个领域之间的建筑差异提出了重大挑战。为了清楚地概述当前统一的努力，我们提出了一项旨在指导未来研究的综合调查。首先，我们介绍了多模式理解和文本对图像生成模型的基础概念和最新进步。接下来，我们回顾现有的统一模型，将它们分为三个主要的建筑范式：基于扩散的，自回归的基于自动回归和混合方法，以融合自动回调和扩散机制。对于每个类别，我们分析相关工作引入的结构设计和创新。此外，我们编译了针对统一模型量身定制的数据集和基准，为将来的探索提供了资源。最后，我们讨论了这个新生领域面临的关键挑战，包括令牌化策略，跨模式的关注和数据。由于该领域仍处于早期阶段，我们预计会取得迅速的进步，并会定期更新此调查。我们的目标是激发进一步的研究，并为社区提供宝贵的参考。与此调查相关的参考文献将很快在GitHub上找到。

Title: MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

Authors: Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02648
Pdf URL: https://arxiv.org/pdf/2505.02648
Copy Paste: [[2505.02648]] MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation(https://arxiv.org/abs/2505.02648)
Keywords: generation
Abstract: Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.
摘要：扩散模型在文本到图像生成中表现出色。然而，当处理复合物提示涉及多个对象，特征和关系时，现有方法通常会遭受性能瓶颈。因此，我们为复杂场景提出了一个基于多代理协作的组成扩散（MCCD），以实现文本对图像生成。具体而言，我们设计了一个基于多代理协作的场景解析模块，该模块生成一个代理系统，该模块包含多个具有不同任务的代理，利用MLLM有效地提取了各种场景元素。此外，分层组成扩散利用高斯面膜和过滤来精炼边界区域，并通过区域增强来增强对象，从而导致复杂场景的准确和高保真生成。全面的实验表明，我们的MCCD可以以无训练的方式显着提高基线模型的性能，从而在复杂的场景生产中具有很大的优势。

Title: Sim2Real in endoscopy segmentation with a novel structure aware image translation

Authors: Clara Tomasini, Luis Riazuelo, Ana C. Murillo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02654
Pdf URL: https://arxiv.org/pdf/2505.02654
Copy Paste: [[2505.02654]] Sim2Real in endoscopy segmentation with a novel structure aware image translation(https://arxiv.org/abs/2505.02654)
Keywords: generative
Abstract: Automatic segmentation of anatomical landmarks in endoscopic images can provide assistance to doctors and surgeons for diagnosis, treatments or medical training. However, obtaining the annotations required to train commonly used supervised learning methods is a tedious and difficult task, in particular for real images. While ground truth annotations are easier to obtain for synthetic data, models trained on such data often do not generalize well to real data. Generative approaches can add realistic texture to it, but face difficulties to maintain the structure of the original scene. The main contribution in this work is a novel image translation model that adds realistic texture to simulated endoscopic images while keeping the key scene layout information. Our approach produces realistic images in different endoscopy scenarios. We demonstrate these images can effectively be used to successfully train a model for a challenging end task without any real labeled data. In particular, we demonstrate our approach for the task of fold segmentation in colonoscopy images. Folds are key anatomical landmarks that can occlude parts of the colon mucosa and possible polyps. Our approach generates realistic images maintaining the shape and location of the original folds, after the image-style-translation, better than existing methods. We run experiments both on a novel simulated dataset for fold segmentation, and real data from the EndoMapper (EM) dataset. All our new generated data and new EM metadata is being released to facilitate further research, as no public benchmark is currently available for the task of fold segmentation.
摘要：内窥镜图像中解剖学地标的自动分割可以为医生和外科医生提供诊断，治疗或医学培训的帮助。但是，获得训练常用的监督学习方法所需的注释是一项繁琐而艰巨的任务，尤其是对于真实的图像。尽管综合数据的地面真相注释更容易获得，但是对此类数据训练的模型通常不能很好地推广到真实数据。生成方法可以为其增加逼真的纹理，但是面临着保持原始场景结构的困难。这项工作的主要贡献是一种新颖的图像翻译模型，该模型在模拟内窥镜图像中添加了逼真的纹理，同时保留关键场景布局信息。我们的方法在不同的内窥镜方案中产生逼真的图像。我们证明这些图像可以有效地用于成功训练模型，以实现具有挑战性的最终任务，而无需任何实际标记的数据。特别是，我们证明了在结肠镜检查图像中折叠分割任务的方法。褶皱是可以阻塞结肠粘膜和可能的息肉的关键解剖标志。我们的方法生成了逼真的图像，在图像式翻译之后，保持原始折叠的形状和位置比现有方法更好。我们在一个新的模拟数据集上运行实验，以进行折叠分割，以及来自EndoMapper（EM）数据集的实际数据。我们所有新生成的数据和新的EM元数据都将被发布以促进进一步的研究，因为目前尚无公共基准测试折叠细分的任务。

Title: A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

Authors: Andrey Sidorenko
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02659
Pdf URL: https://arxiv.org/pdf/2505.02659
Copy Paste: [[2505.02659]] A Note on Statistically Accurate Tabular Data Generation Using Large Language Models(https://arxiv.org/abs/2505.02659)
Keywords: generation
Abstract: Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probobility distributions to enhance the statistical fidelity of LLM-generated tabular data.
摘要：大型语言模型（LLMS）在合成表格数据生成中表现出了希望，但是现有的方法难以保留复杂的特征依赖性，尤其是在分类变量中。这项工作引入了一种概率驱动的提示方法，该方法利用LLMS来估计条件分布，从而实现了更准确，可扩展的数据综合。结果突出了提示可能性分布以增强LLM生成的表格数据的统计保真度的潜力。

Title: Cooperative Bayesian and variance networks disentangle aleatoric and epistemic uncertainties

Authors: Jiaxiang Yi, Miguel A. Bessa
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02743
Pdf URL: https://arxiv.org/pdf/2505.02743
Copy Paste: [[2505.02743]] Cooperative Bayesian and variance networks disentangle aleatoric and epistemic uncertainties(https://arxiv.org/abs/2505.02743)
Keywords: generation
Abstract: Real-world data contains aleatoric uncertainty - irreducible noise arising from imperfect measurements or from incomplete knowledge about the data generation process. Mean variance estimation (MVE) networks can learn this type of uncertainty but require ad-hoc regularization strategies to avoid overfitting and are unable to predict epistemic uncertainty (model uncertainty). Conversely, Bayesian neural networks predict epistemic uncertainty but are notoriously difficult to train due to the approximate nature of Bayesian inference. We propose to cooperatively train a variance network with a Bayesian neural network and demonstrate that the resulting model disentangles aleatoric and epistemic uncertainties while improving the mean estimation. We demonstrate the effectiveness and scalability of this method across a diverse range of datasets, including a time-dependent heteroscedastic regression dataset we created where the aleatoric uncertainty is known. The proposed method is straightforward to implement, robust, and adaptable to various model architectures.
摘要：现实世界中的数据包含不确定性 - 不完美的测量或对数据生成过程不完整的知识引起的不可还原噪声。平均方差估计（MVE）网络可以学习这种类型的不确定性，但需要临时正规化策略以避免过度拟合并且无法预测认知不确定性（模型不确定性）。相反，贝叶斯神经网络预测了认知不确定性，但由于贝叶斯推断的近似性质，众所周知，很难训练。我们建议与贝叶斯神经网络合作训练一个方差网络，并证明所得模型在改善平均估计的同时，消除了质地和认知不确定性。我们证明了该方法在各种数据集中的有效性和可伸缩性，包括我们创建的时间依赖性的异源性回归数据集，在已知的不确定性。所提出的方法直接实现，健壮和适应各种模型体系结构。

Title: Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge

Authors: Vladyslav Zalevskyi, Thomas Sanchez, Misha Kaandorp, Margaux Roulet, Diego Fajardo-Rojas, Liu Li, Jana Hutter, Hongwei Bran Li, Matthew Barkovich, Hui Ji, Luca Wilhelmi, Aline Dändliker, Céline Steger, Mériam Koob, Yvan Gomez, Anton Jakovčić, Melita Klaić, Ana Adžić, Pavel Marković, Gracia Grabarić, Milan Rados, Jordina Aviles Verdera, Gregor Kasprian, Gregor Dovjak, Raphael Gaubert-Rachmühl, Maurice Aschwanden, Qi Zeng, Davood Karimi, Denis Peruzzo, Tommaso Ciceri, Giorgio Longari, Rachika E. Hamadache, Amina Bouzid, Xavier Lladó, Simone Chiarella, Gerard Martí-Juan, Miguel Ángel González Ballester, Marco Castellaro, Marco Pinamonti, Valentina Visani, Robin Cremese, Keïn Sam, Fleur Gaudfernau, Param Ahir, Mehul Parikh, Maximilian Zenk, Michael Baumgartner, Klaus Maier-Hein, Li Tianhong, Yang Hong, Zhao Longfei, Domen Preloznik, Žiga Špiclin, Jae Won Choi, Muyang Li, Jia Fu, Guotai Wang, Jingwen Jiang, Lyuyang Tong, Bo Du, Andrea Gondova, Sungmin You, Kiho Im, Abdul Qayyum, Moona Mazher, Steven A Niederer, Maya Yanko, Bella Specktor-Fadida, Dafna Ben Bashat, Andras Jakab, Roxane Licandro, Kelly Payette, Meritxell Bach Cuadra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02784
Pdf URL: https://arxiv.org/pdf/2505.02784
Copy Paste: [[2505.02784]] Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge(https://arxiv.org/abs/2505.02784)
Keywords: super-resolution
Abstract: Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics were also expanded to include the topology-specific Euler characteristic difference (ED). Sixteen teams submitted segmentation methods, most of which performed consistently across both high- and low-field scans. However, longitudinal trends indicate that segmentation accuracy may be reaching a plateau, with results now approaching inter-rater variability. The ED metric uncovered topological differences that were missed by conventional metrics, while the low-field dataset achieved the highest segmentation scores, highlighting the potential of affordable imaging systems when paired with high-quality reconstruction. Seven teams participated in the biometry task, but most methods failed to outperform a simple baseline that predicted measurements based solely on gestational age, underscoring the challenge of extracting reliable biometric estimates from image data alone. Domain shift analysis identified image quality as the most significant factor affecting model generalization, with super-resolution pipelines also playing a substantial role. Other factors, such as gestational age, pathology, and acquisition site, had smaller, though still measurable, effects. Overall, FeTA 2024 offers a comprehensive benchmark for multi-class segmentation and biometry estimation in fetal brain MRI, underscoring the need for data-centric approaches, improved topological evaluation, and greater dataset diversity to enable clinically robust and generalizable AI tools.
摘要：准确的胎儿脑组织分割和生物特征分析对于研究子宫内的脑发育至关重要。 FETA挑战2024年晚期自动化胎儿脑MRI分析，通过将生物特征预测作为组织分割的新任务引入新任务。我们第一次，我们多样化的多中性测试集包括来自新的低场（0.55t）MRI数据集的数据。评估指标还扩展到包括特定于拓扑的欧拉特征差异（ED）。 16个团队提交了细分方法，其中大多数在高场和低场扫描中始终如一地执行。但是，纵向趋势表明分割精度可能达到平稳性，结果现在接近评估者间的变异性。 ED指标发现了传统指标所遗漏的拓扑差异，而低场数据集则达到了最高的细分分数，并在与高质量的重建配对时突出了负担得起的成像系统的潜力。七个团队参加了生物测定任务，但是大多数方法都无法胜过简单的基线，该基线仅根据胎龄预测测量值，从而强调了仅从图像数据中提取可靠的生物识别估计值的挑战。域移位分析将图像质量确定为影响模型概括的最重要因素，超分辨率管道也起着重要作用。其他因素（例如妊娠年龄，病理学和获取部位）具有较小的，尽管仍然可以测量。总体而言，FETA 2024为胎儿脑MRI的多级分割和生物特征估计提供了全面的基准，强调了以数据为中心的方法，改进的拓扑评估以及更大的数据多样性，以实现临床上强大且可延长的AI工具。

Title: Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models

Authors: Kuofeng Gao, Yufei Zhu, Yiming Li, Jiawang Bai, Yong Yang, Zhifeng Li, Shu-Tao Xia
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.02824
Pdf URL: https://arxiv.org/pdf/2505.02824
Copy Paste: [[2505.02824]] Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models(https://arxiv.org/abs/2505.02824)
Keywords: generation
Abstract: Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets using backdoor techniques. These watermarks remain inactive under benign samples but produce owner-specified outputs when triggered. Despite the promise of DOV for T2I diffusion models, its robustness against copyright evasion attacks (CEA) remains unexplored. In this paper, we explore how attackers can bypass these mechanisms through CEA, allowing models to circumvent watermarks even when trained on watermarked datasets. We propose the first copyright evasion attack (i.e., CEAT2I) specifically designed to undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation. A key insight driving our approach is that T2I models exhibit faster convergence on watermarked samples during the fine-tuning, evident through intermediate feature deviation. Leveraging this, CEAT2I can reliably detect the watermarked samples. Then, we iteratively ablate tokens from the prompts of detected watermarked samples and monitor shifts in intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a closed-form concept erasure method to remove the injected watermark. Extensive experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.
摘要：文本对图像（T2I）扩散模型具有迅速高级的，可以在文本提示下进行高质量的图像生成。但是，对个性化的细化预培训模型的日益增长的趋势引起了人们对未经授权的数据集使用的严重关注。为了解决这个问题，数据集所有权验证（DOV）已成为解决方案，使用后门技术将水印嵌入微调数据集中。这些水印在良性样品下仍然不活跃，但在触发时产生所有者指定的输出。尽管DOV对T2I扩散模型有望，但其对版权逃避攻击（CEA）的鲁棒性仍然没有探索。在本文中，我们探讨了攻击者如何通过CEA绕过这些机制，即使在水印数据集中训练时，模型也可以绕过水印。我们提出了针对破坏T2i扩散模型中破坏DOV的第一次版权逃避攻击（即Ceat2i）。具体而言，我们的Ceat2i包括三个阶段：水印样品检测，触发鉴定和有效的水印。推动我们方法的一个关键见解是，T2I模型在微调过程中在水印样品上表现出更快的收敛性，并且通过中间特征偏差可见。利用这一点，Ceat2i可以可靠地检测到水印的样品。然后，我们从检测到的水印样品的提示中迭代烧毁令牌，并在中间特征中监视移动，以查明精确的触发令牌。最后，我们采用封闭形式的概念擦除方法去除注入的水印。广泛的实验表明，我们的Ceat2i在保留模型性能的同时有效地逃避了DOV机制。

Title: AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation

Authors: Qingqiu Li, Zihang Cui, Seongsu Bae, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Junjun He, Shujun Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02830
Pdf URL: https://arxiv.org/pdf/2505.02830
Copy Paste: [[2505.02830]] AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation(https://arxiv.org/abs/2505.02830)
Keywords: generation
Abstract: Chest X-rays (CXRs) are the most frequently performed imaging examinations in clinical settings. Recent advancements in Large Multimodal Models (LMMs) have enabled automated CXR interpretation, enhancing diagnostic accuracy and efficiency. However, despite their strong visual understanding, current Medical LMMs (MLMMs) still face two major challenges: (1) Insufficient region-level understanding and interaction, and (2) Limited accuracy and interpretability due to single-step reasoning. In this paper, we empower MLMMs with anatomy-centric reasoning capabilities to enhance their interactivity and explainability. Specifically, we first propose an Anatomical Ontology-Guided Reasoning (AOR) framework, which centers on cross-modal region-level information to facilitate multi-step reasoning. Next, under the guidance of expert physicians, we develop AOR-Instruction, a large instruction dataset for MLMMs training. Our experiments demonstrate AOR's superior performance in both VQA and report generation tasks.
摘要：胸部X射线（CXR）是临床环境中最常见的成像检查。大型多模型模型（LMM）的最新进展已实现自动CXR解释，从而提高了诊断准确性和效率。然而，尽管视觉上有深刻的了解，但当前的医学LMM（MLMM）仍然面临两个主要挑战：（1）区域级别的理解和相互作用不足，以及（2）由于单步推理而导致的准确性和可解释性有限。在本文中，我们具有以解剖学为中心的推理能力来增强其交互性和解释性的能力。具体而言，我们首先提出了一个解剖本体引导的推理（AOR）框架，该框架集中在跨模式区域级信息上，以促进多步推理。接下来，在专家医师的指导下，我们开发了AOR Instruction，这是用于MLMMS培训的大型指导数据集。我们的实验证明了AOR在VQA和报告生成任务中的出色表现。

Title: No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

Authors: Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, Jingdong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02831
Pdf URL: https://arxiv.org/pdf/2505.02831
Copy Paste: [[2505.02831]] No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves(https://arxiv.org/abs/2505.02831)
Keywords: generation, generative
Abstract: Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance generation quality of the diffusion transformers. However, existing approaches necessitate to either introduce an additional and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation A}lignment (SRA), a simple yet straightforward method that obtain representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in earlier layer with higher noise to that in later layer with lower noise to progressively enhance the overall representation learning during only generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that heavily dependent on powerful external representation priors.
摘要：最近的研究表明，学习有意义的内部表示既可以加速生成训练，又可以提高扩散变压器的发电质量。但是，现有方法需要引入其他复杂的表示培训框架，或者依靠大规模的，预先训练的代表基础模型来在原始生成培训过程中提供代表指导。在这项研究中，我们认为扩散变压器固有的独特判别过程使他们能够提供此类指导而无需外部表示组件。因此，我们提出了自我代表A} strignment（SRA），这是一种简单而直接的方法，可以通过自我验证方式获得表示指导。具体而言，SRA将扩散变压器在早期层中的输出潜在表示与后面的噪声更高的噪声与较低的噪声对齐，以逐步增强仅生成训练过程中的整体表示。实验结果表明，将SRA应用于DIT和坐着可以取得一致的性能提高。此外，SRA不仅显着优于依靠辅助，复杂的表示培训框架的方法，而且还可以实现与在很大程度上取决于强大的外部表示先验的方法相媲美的性能。

Title: Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Authors: Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, Zhaoshuo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02836
Pdf URL: https://arxiv.org/pdf/2505.02836
Copy Paste: [[2505.02836]] Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation(https://arxiv.org/abs/2505.02836)
Keywords: generation
Abstract: Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.
摘要：从文本中综合互动3D场景对于游戏，虚拟现实和体现的AI至关重要。但是，现有方法面临几个挑战。基于学习的方法取决于小规模的室内数据集，从而限制了场景多样性和布局复杂性。尽管大型语言模型（LLM）可以利用多种文本域知识，但它们在空间现实主义中挣扎，通常会产生不尊重常识的不自然对象的位置。我们的主要见解是，视觉感知可以通过提供LLMS缺乏的现实空间指导来弥合这一差距。为此，我们介绍了SpaceTheses，这是一个无训练的代理框架，将基于LLM的场景计划与视觉指导的布局改进整合在一起。给定文本提示，场景首先采用LLM来草拟粗糙的布局。然后，视觉模块通过生成图像指导并提取场景结构以捕获对象间关系来完善它。接下来，优化模块迭代地执行准确的姿势比对和物理上的合理性，以防止物体穿透和不稳定性等工件。最后，法官模块验证了空间连贯性。综合实验表明，场景产生了多样化，现实和物理上合理的3D交互式场景，使其对于虚拟内容创建，仿真环境和体现的AI研究都很有价值。