2025-05-29

Title: SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation

Authors: Mingchao Jiang, Abhinav Jain, Sophia Zorek, Chris Jermaine
Subjects: cs.LG, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.21514
Pdf URL: https://arxiv.org/pdf/2505.21514
Copy Paste: [[2505.21514]] SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation(https://arxiv.org/abs/2505.21514)
Keywords: generation
Abstract: We introduce SIMCOPILOT, a benchmark that simulates the role of large language models (LLMs) as interactive, "copilot"-style coding assistants. Targeting both completion (finishing incomplete methods or code blocks) and infill tasks (filling missing segments within existing code), SIMCOPILOT provides a comprehensive framework for evaluating LLM coding capabilities. The benchmark comprises dedicated sub-benchmarks for Java (SIMCOPILOTJ) and Python (SIMCOPILOTP), covering diverse codebases varying in size and complexity. Our key contributions include: (a) establishing a realistic, detailed evaluation environment to assess LLM utility in practical coding scenarios, and (b) providing fine-grained analyses that address critical factors frequently overlooked by existing benchmarks, such as task-specific performance nuances, contextual understanding across code segments, and sensitivity to variable scope. Evaluations conducted across domains-including algorithms, databases, computer vision, and neural networks-offer insights into model strengths and highlight persistent challenges in maintaining logical consistency within complex dependency structures. Beyond benchmarking, our study sheds light on the current limitations of LLM-driven code generation and underscores the ongoing transition of LLMs from merely syntax-aware generators toward reliable, intelligent software development partners.
摘要：我们介绍了Simcopilot，这是一种基准，将大型语言模型（LLM）的作用模拟为交互式，“副标士”式编码助手。 Simcopilot均针对完成（完成不完整的方法或代码块）和填充任务（在现有代码中填写丢失的段），为评估LLM编码功能提供了全面的框架。该基准包括Java（Simcopilotj）和Python（SimCopilotp）的专用子基准，涵盖了大小和复杂性各不相同的不同代码库。我们的主要贡献包括：（a）建立一个现实，详细的评估环境，以评估实用编码方案中的LLM实用程序，以及（b）提供细粒度的分析，以解决经常被现有基准测试的关键因素，例如特定于任务的性能细微差别，跨代码范围的上下文理解，以及对可变范围的敏感性。跨领域进行的评估，包括算法，数据库，计算机视觉和神经网络对模型强度的见解，并突出了在复杂依赖性结构内保持逻辑一致性方面的持续挑战。除了基准测试之外，我们的研究还阐明了LLM驱动的代码生成的当前局限性，并强调了LLMS从仅仅是语法感知的发电机向可靠，智能软件开发合作伙伴的持续过渡。

Title: Do DeepFake Attribution Models Generalize?

Authors: Spiros Baxavanakis, Manos Schinas, Symeon Papadopoulos
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21520
Pdf URL: https://arxiv.org/pdf/2505.21520
Copy Paste: [[2505.21520]] Do DeepFake Attribution Models Generalize?(https://arxiv.org/abs/2505.21520)
Keywords: generation
Abstract: Recent advancements in DeepFake generation, along with the proliferation of open-source tools, have significantly lowered the barrier for creating synthetic media. This trend poses a serious threat to the integrity and authenticity of online information, undermining public trust in institutions and media. State-of-the-art research on DeepFake detection has primarily focused on binary detection models. A key limitation of these models is that they treat all manipulation techniques as equivalent, despite the fact that different methods introduce distinct artifacts and visual cues. Only a limited number of studies explore DeepFake attribution models, although such models are crucial in practical settings. By providing the specific manipulation method employed, these models could enhance both the perceived trustworthiness and explainability for end users. In this work, we leverage five state-of-the-art backbone models and conduct extensive experiments across six DeepFake datasets. First, we compare binary and multi-class models in terms of cross-dataset generalization. Second, we examine the accuracy of attribution models in detecting seen manipulation methods in unknown datasets, hence uncovering data distribution shifts on the same DeepFake manipulations. Last, we assess the effectiveness of contrastive methods in improving cross-dataset generalization performance. Our findings indicate that while binary models demonstrate better generalization abilities, larger models, contrastive methods, and higher data quality can lead to performance improvements in attribution models. The code of this work is available on GitHub.
摘要：DeepFake生成的最新进展以及开源工具的扩散大大降低了创建合成媒体的障碍。这种趋势对在线信息的完整性和真实性构成了严重威胁，破坏了对机构和媒体的公共信任。对深泡检测的最新研究主要集中在二进制检测模型上。这些模型的一个关键局限性是，尽管不同的方法引入了不同的伪影和视觉提示，但它们将所有操纵技术视为等效。只有有限的研究探索了深泡归因于模型，尽管这种模型在实际环境中至关重要。通过提供所采用的特定操纵方法，这些模型可以增强最终用户的可信赖性和解释性。在这项工作中，我们利用了五种最先进的骨干模型，并在六个DeepFake数据集中进行了广泛的实验。首先，我们根据跨数据集泛化比较二进制和多类模型。其次，我们研究了归因模型在检测未知数据集中看到的操纵方法方面的准确性，从而发现数据分布在相同的DeepFake操纵上的变化。最后，我们评估了对比度方法在改善跨数据集泛化性能方面的有效性。我们的发现表明，尽管二进制模型表现出更好的概括能力，更大的模型，对比度方法和更高的数据质量可以导致归因模型的性能提高。这项工作的代码可在GitHub上找到。

Title: Learning Shared Representations from Unpaired Data

Authors: Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.21524
Pdf URL: https://arxiv.org/pdf/2505.21524
Copy Paste: [[2505.21524]] Learning Shared Representations from Unpaired Data(https://arxiv.org/abs/2505.21524)
Keywords: generation
Abstract: Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our code IS publicly available at this https URL.
摘要：学习共享表示形式是多模式表示学习的主要领域。当前实现共享嵌入空间的方法在很大程度上取决于每种模式的配对样品，而这些样本比不配对的样本更难获得。在这项工作中，我们证明几乎可以从未配对的数据中学习共享表示形式。我们的论点基于独立于每个单峰表示独立于构建的随机行走矩阵的光谱嵌入。计算机视觉和自然语言处理领域的经验结果支持其潜力，揭示了未配对数据在捕获有意义的跨模式关系方面的有效性，证明了检索任务，发电，算术，零射和跨域分类的高功能。据我们所知，这项工作是第一个几乎完全来自未配对样本的功能，从而产生了可以将其视为通用的跨模式嵌入，即独立于数据的特定方式。我们的代码在此HTTPS URL上公开可用。

Title: Temporal Restoration and Spatial Rewiring for Source-Free Multivariate Time Series Domain Adaptation

Authors: Peiliang Gong, Yucheng Wang, Min Wu, Zhenghua Chen, Xiaoli Li, Daoqiang Zhang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.21525
Pdf URL: https://arxiv.org/pdf/2505.21525
Copy Paste: [[2505.21525]] Temporal Restoration and Spatial Rewiring for Source-Free Multivariate Time Series Domain Adaptation(https://arxiv.org/abs/2505.21525)
Keywords: restoration
Abstract: Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained model from an annotated source domain to an unlabelled target domain without accessing the source data, thereby preserving data privacy. While existing SFDA methods have proven effective in reducing reliance on source data, they struggle to perform well on multivariate time series (MTS) due to their failure to consider the intrinsic spatial correlations inherent in MTS data. These spatial correlations are crucial for accurately representing MTS data and preserving invariant information across domains. To address this challenge, we propose Temporal Restoration and Spatial Rewiring (TERSE), a novel and concise SFDA method tailored for MTS data. Specifically, TERSE comprises a customized spatial-temporal feature encoder designed to capture the underlying spatial-temporal characteristics, coupled with both temporal restoration and spatial rewiring tasks to reinstate latent representations of the temporally masked time series and the spatially masked correlated structures. During the target adaptation phase, the target encoder is guided to produce spatially and temporally consistent features with the source domain by leveraging the source pre-trained temporal restoration and spatial rewiring networks. Therefore, TERSE can effectively model and transfer spatial-temporal dependencies across domains, facilitating implicit feature alignment. In addition, as the first approach to simultaneously consider spatial-temporal consistency in MTS-SFDA, TERSE can also be integrated as a versatile plug-and-play module into established SFDA methods. Extensive experiments on three real-world time series datasets demonstrate the effectiveness and versatility of our approach.
摘要：无源域的适应（SFDA）旨在将预训练的模型从注释的源域调整到无标记的目标域而无需访问源数据，从而保留数据隐私。尽管现有的SFDA方法已被证明有效地降低了对源数据的依赖，但由于他们未能考虑MTS数据中固有的固有空间相关性，因此它们在多元时间序列（MTS）方面努力表现良好。这些空间相关性对于准确表示MTS数据并保留跨域的不变信息至关重要。为了应对这一挑战，我们提出了时间恢复和空间重新布线（TERSE），这是一种针对MTS数据量身定制的新颖而简洁的SFDA方法。具体而言，TERSE组成了一个定制的时空特征编码器，旨在捕获基本的时空特性，再加上时间恢复和空间重新布线任务，以恢复临时掩盖时间序列的潜在表示，并恢复空间掩盖的层压层序列和空间掩盖的校正结构。在目标适应阶段，指导目标编码器通过利用源训练预训练的时间恢复和空间重新布线网络，从而在空间和时间上产生与源域的特征。因此，TERSE可以有效地建模和转移跨域的时空依赖性，从而促进隐式特征对齐。此外，作为第一种同时考虑MTS-SFDA时空一致性的方法，TERSE也可以作为多功能插件模块集成到已建立的SFDA方法中。在三个现实世界中的时间序列数据集上进行了广泛的实验，证明了我们方法的有效性和多功能性。

Title: UniDB++: Fast Sampling of Unified Diffusion Bridge

Authors: Mokai Pan, Kaizhen Zhu, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, Ye Shi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21528
Pdf URL: https://arxiv.org/pdf/2505.21528
Copy Paste: [[2505.21528]] UniDB++: Fast Sampling of Unified Diffusion Bridge(https://arxiv.org/abs/2505.21528)
Keywords: restoration, generation
Abstract: Diffusion Bridges enable transitions between arbitrary distributions, with the Unified Diffusion Bridge (UniDB) framework achieving high-fidelity image generation via a Stochastic Optimal Control (SOC) formulation. However, UniDB's reliance on iterative Euler sampling methods results in slow, computationally expensive inference, while existing acceleration techniques for diffusion or diffusion bridge models fail to address its unique challenges: missing terminal mean constraints and SOC-specific penalty coefficients in its SDEs. We present UniDB++, a training-free sampling algorithm that significantly improves upon these limitations. The method's key advancement comes from deriving exact closed-form solutions for UniDB's reverse-time SDEs, effectively reducing the error accumulation inherent in Euler approximations and enabling high-quality generation with up to 20$\times$ fewer sampling steps. This method is further complemented by replacing conventional noise prediction with a more stable data prediction model, along with an SDE-Corrector mechanism that maintains perceptual quality for low-step regimes (5-10 steps). Additionally, we demonstrate that UniDB++ aligns with existing diffusion bridge acceleration methods by evaluating their update rules, and UniDB++ can recover DBIMs as special cases under some theoretical conditions. Experiments demonstrate UniDB++'s state-of-the-art performance in image restoration tasks, outperforming Euler-based methods in fidelity and speed while reducing inference time significantly. This work bridges the gap between theoretical generality and practical efficiency in SOC-driven diffusion bridge models. Our code is available at this https URL.
摘要：扩散桥实现了任意分布之间的过渡，统一扩散桥（UNIDB）框架通过随机最佳控制（SOC）配方实现了高保真图像的生成。但是，UNIDB对迭代Euler抽样方法的依赖会导致缓慢，计算昂贵的推断，而现有的扩散或扩散桥模型的加速技术无法解决其独特的挑战：缺失的终端均值约束和SDES中的SOC特定惩罚系数。我们提出Unidb ++，这是一种无训练的采样算法，在这些局限性方面可显着改善。该方法的关键进步来自为Unidb的反度SDE提供精确的封闭式解决方案，从而有效地减少了Euler近似中固有的误差积累，并具有最多20 $ \ tims $ \少于$较少的采样步骤的高质量生成。通过更稳定的数据预测模型以及SDE-Corrtarter机制替换常规噪声预测，进一步补充了该方法，该模型可维持低步态（5-10步）的感知质量。此外，我们证明了UNIDB ++通过评估其更新规则与现有扩散桥加速度方法保持一致，并且在某些理论条件下，Unidb ++可以作为特殊情况恢复DBIMS。实验证明了UNIDB ++在图像恢复任务中的最先进性能，在忠诚度和速度方面表现优于基于Euler的方法，同时大大减少了推理时间。这项工作弥合了SOC驱动扩散桥模型中理论普遍性与实际效率之间的差距。我们的代码可在此HTTPS URL上找到。

Title: DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers

Authors: Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, Yiren Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21541
Pdf URL: https://arxiv.org/pdf/2505.21541
Copy Paste: [[2505.21541]] DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers(https://arxiv.org/abs/2505.21541)
Keywords: generation
Abstract: Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large-scale and high-quality dataset for transparent and semi-transparent layer decomposition, supporting six real-world subtasks (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In-Context Decomposition, enabling the model to predict one or multiple layers without per-layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel-level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance. Our code will be available at: this https URL.
摘要：扩散模型最近在许多一代任务中取得了巨大的成功，例如删除对象。然而，现有的图像分解方法难以解散由于掩盖先前的依赖项，静态对象假设和缺乏数据集而导致的半透明或透明层的闭合。在本文中，我们深入研究了一项新的任务：层次组成图像的层分解，旨在从半透明/透明α层非线性闭塞的条件下从单个重叠图像中恢复组成层。为了解决层次歧义，概括和数据稀缺性的挑战，我们首先介绍了Alphablend，这是第一个大规模和高质量数据集，用于透明和半透明的层分解，支持六个现实世界中的子任务（例如，液化效果拆卸，半传播的细胞脱发，半透明的细胞Decomposition，decomposition，Glassware decomposition，Glass decositions，Glass decomptosion，Glass decostions，Glass decositionss）。在此数据集的基础上，我们提出了基于扩散的变压器框架的DiffDecompose，该框架在输入图像，语义提示和混合类型的情况下学习了后验超过可能的层分解。 DiffDecompose并没有直接回归α哑光，而是执行内在的分解，使模型能够预测一个或多层而无需每层监督，并引入了编码克隆的层位置以维持跨层的像素层的对应关系。对拟议的Alphablend数据集和公共徽标数据集进行了广泛的实验，验证了DiffDecoccompose的有效性。该代码和数据集将在纸张接受后提供。我们的代码将提供：此HTTPS URL。

Title: Vision Meets Language: A RAG-Augmented YOLOv8 Framework for Coffee Disease Diagnosis and Farmer Assistance

Authors: Semanto Mondal
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.21544
Pdf URL: https://arxiv.org/pdf/2505.21544
Copy Paste: [[2505.21544]] Vision Meets Language: A RAG-Augmented YOLOv8 Framework for Coffee Disease Diagnosis and Farmer Assistance(https://arxiv.org/abs/2505.21544)
Keywords: generation
Abstract: As a social being, we have an intimate bond with the environment. A plethora of things in human life, such as lifestyle, health, and food are dependent on the environment and agriculture. It comes under our responsibility to support the environment as well as agriculture. However, traditional farming practices often result in inefficient resource use and environmental challenges. To address these issues, precision agriculture has emerged as a promising approach that leverages advanced technologies to optimise agricultural processes. In this work, a hybrid approach is proposed that combines the three different potential fields of model AI: object detection, large language model (LLM), and Retrieval-Augmented Generation (RAG). In this novel framework, we have tried to combine the vision and language models to work together to identify potential diseases in the tree leaf. This study introduces a novel AI-based precision agriculture system that uses Retrieval Augmented Generation (RAG) to provide context-aware diagnoses and natural language processing (NLP) and YOLOv8 for crop disease detection. The system aims to tackle major issues with large language models (LLMs), especially hallucinations and allows for adaptive treatment plans and real-time disease detection. The system provides an easy-to-use interface to the farmers, which they can use to detect the different diseases related to coffee leaves by just submitting the image of the affected leaf the model will detect the diseases as well as suggest potential remediation methodologies which aim to lower the use of pesticides, preserving livelihoods, and encouraging environmentally friendly methods. With an emphasis on scalability, dependability, and user-friendliness, the project intends to improve RAG-integrated object detection systems for wider agricultural applications in the future.
摘要：作为一个社会存在，我们与环境有着亲密的联系。人类生活中的众多事物，例如生活方式，健康和食物都取决于环境和农业。我们有责任支持环境和农业。但是，传统的农业实践通常会导致资源使用和环境挑战效率低下。为了解决这些问题，精确农业已成为一种有希望的方法，该方法利用了先进的技术来优化农业过程。在这项工作中，提出了一种混合方法，该方法结合了模型AI的三个不同潜在领域：对象检测，大语言模型（LLM）和检索效果生成（RAG）。在这个新颖的框架中，我们试图将视觉和语言模型结合起来，以识别树叶中的潜在疾病。这项研究介绍了一种新型的基于AI的精确农业系统，该系统使用检索增强发电（RAG）提供背景感知诊断和自然语言处理（NLP）和Yolov8进行农作物疾病检测。该系统旨在解决大语模型（LLM），尤其是幻觉的主要问题，并允许适应治疗计划和实时疾病检测。该系统为农民提供了易于使用的界面，他们可以通过仅提交受影响的叶子的形象来检测与咖啡叶有关的不同疾病，该模型将检测到疾病，并提出潜在的补救方法，旨在降低使用农药的使用，从而保留生计，并鼓励环境友好的方法。该项目强调可扩展性，可靠性和用户友好性，旨在改善将来更广泛的农业应用程序的抹布集成对象检测系统。

Title: Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

Authors: Chika Maduabuchi, Hao Chen, Yujin Han, Jindong Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21545
Pdf URL: https://arxiv.org/pdf/2505.21545
Copy Paste: [[2505.21545]] Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation(https://arxiv.org/abs/2505.21545)
Keywords: generation
Abstract: Latent Video Diffusion Models (LVDMs) achieve high-quality generation but are sensitive to imperfect conditioning, which causes semantic drift and temporal incoherence on noisy, web-scale video-text datasets. We introduce CAT-LVDM, the first corruption-aware training framework for LVDMs that improves robustness through structured, data-aligned noise injection. Our method includes Batch-Centered Noise Injection (BCNI), which perturbs embeddings along intra-batch semantic directions to preserve temporal consistency. BCNI is especially effective on caption-rich datasets like WebVid-2M, MSR-VTT, and MSVD. We also propose Spectrum-Aware Contextual Noise (SACN), which injects noise along dominant spectral directions to improve low-frequency smoothness, showing strong results on UCF-101. On average, BCNI reduces FVD by 31.9% across WebVid-2M, MSR-VTT, and MSVD, while SACN yields a 12.3% improvement on UCF-101. Ablation studies confirm the benefit of low-rank, data-aligned noise. Our theoretical analysis further explains how such perturbations tighten entropy, Wasserstein, score-drift, mixing-time, and generalization bounds. CAT-LVDM establishes a principled, scalable training approach for robust video diffusion under multimodal noise. Code and models: this https URL
摘要：潜在的视频扩散模型（LVDMS）获得了高质量的生成，但对不完美的调节敏感，这会导致语义漂移和暂时性不一致，这对嘈杂的Web尺度视频text数据集。我们介绍了CAT-LVDM，这是第一个针对LVDM的腐败感知培训框架，可通过结构化的，数据调整的噪声注入来改善鲁棒性。我们的方法包括以批量输入的噪声注入（BCNI），沿批处理内部的语义方向嵌入以保持时间一致性。 BCNI在诸如WebVID-2M，MSR-VTT和MSVD之类的字幕数据集中特别有效。我们还提出了频谱感知的上下文噪声（SACN），该噪声将沿显性光谱方向注射噪声以提高低频平滑度，从而在UCF-101上显示出强烈的结果。在WebVID-2M，MSR-VTT和MSVD中，BCNI平均将FVD降低了31.9％，而SACN的FVD对UCF-101的提高了12.3％。消融研究证实了低级别，数据一致的噪声的好处。我们的理论分析进一步说明了这种扰动如何拧紧熵，瓦斯坦斯坦，得分饮用，混合时间和泛化边界。 CAT-LVDM在多模式噪声下建立了一种原则上可扩展的视频扩散的训练方法。代码和模型：此HTTPS URL

Title: Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing

Authors: Weixing Wang, Zifeng Ding, Jindong Gu, Rui Cao, Christoph Meinel, Gerard de Melo, Haojin Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21547
Pdf URL: https://arxiv.org/pdf/2505.21547
Copy Paste: [[2505.21547]] Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing(https://arxiv.org/abs/2505.21547)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) with discrete image tokenizers unify multimodal representations by encoding visual inputs into a finite set of tokens. Despite their effectiveness, we find that these models still hallucinate non-existent objects. We hypothesize that this may be due to visual priors induced during training: When certain image tokens frequently co-occur in the same spatial regions and represent shared objects, they become strongly associated with the verbalizations of those objects. As a result, the model may hallucinate by evoking visually absent tokens that often co-occur with present ones. To test this assumption, we construct a co-occurrence graph of image tokens using a segmentation dataset and employ a Graph Neural Network (GNN) with contrastive learning followed by a clustering method to group tokens that frequently co-occur in similar visual contexts. We find that hallucinations predominantly correspond to clusters whose tokens dominate the input, and more specifically, that the visually absent tokens in those clusters show much higher correlation with hallucinated objects compared to tokens present in the image. Based on this observation, we propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation. Experiments show our method reduces hallucinations while preserving expressivity. Code is available at this https URL
摘要：带有离散图像令牌的大型视觉模型（LVLM）通过将视觉输入编码为有限的令牌集来统一多模式表示。尽管它们有效，但我们发现这些模型仍然会幻觉不存在的物体。我们假设这可能是由于训练过程中引起的视觉先验造成的：当某些图像令牌经常在同一空间区域同时发生并表示共享对象时，它们与这些对象的语言相关。结果，该模型可能会通过唤起通常与当前同时发生同时发生的视觉上的令牌而幻觉。为了测试这一假设，我们使用分割数据集构建图像令牌的同时图形图，并采用图形神经网络（GNN），并使用对比度学习，然后采用聚类方法，以将经常在相似的视觉上下文中共同存在的图形组进行组合。我们发现，幻觉主要与群集相对应，这些群集主导了输入的群集，更具体地说，与图像中存在的令牌相比，这些簇中的视觉缺失与幻觉对象的相关性更高。基于此观察结果，我们提出了一种幻觉缓解方法，该方法通过在发电过程中修改潜在图像嵌入来抑制视觉缺失令牌的影响。实验表明我们的方法可以减少幻觉，同时保持表达能力。代码可在此HTTPS URL上找到

Title: Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models

Authors: Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21574
Pdf URL: https://arxiv.org/pdf/2505.21574
Copy Paste: [[2505.21574]] Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models(https://arxiv.org/abs/2505.21574)
Keywords: generation
Abstract: Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts the performance by up to 2.8% in a variety of scenarios, including training ResNet, ViT and DenseNet on CIFAR-10, CIFAR-100, and TinyImageNet, with a range of optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet. It can also easily stack with existing weak and strong augmentation strategies to further boost the performance.
摘要：通过扩散模型的合成增强培训数据集已成为改善图像分类器概括的有效策略。但是，现有的技术努力确保发电的多样性并将数据的大小提高至10-30倍，以提高分布性能。在这项工作中，我们表明，在训练训练优于增强整个数据集的过程中，综合增加一部分数据。通过分析两层CNN，我们证明了该策略通过促进功能学习速度中的同质性而不会放大噪声来改善概括。我们的广泛实验表明，通过仅增强数据的30％-40％，我们的方法在各种情况下，在CIFAR-10，CIFAR-100，CIFAR-100和Tinyimagenet上，包括培训Resnet，VIT和Densenet在内的各种情况都提高了2.8％的性能，其中包括SGD和SAM在内的范围。值得注意的是，我们使用SGD应用的方法优于CIFAR-100和Tinyimagenet上的SOTA优化器SAM。它还可以轻松地使用现有的弱和强大的增强策略来堆叠以进一步提高性能。

Title: BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration

Authors: Xiaole Tang, Xiaoyi He, Xiang Gu, Jian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21637
Pdf URL: https://arxiv.org/pdf/2505.21637
Copy Paste: [[2505.21637]] BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration(https://arxiv.org/abs/2505.21637)
Keywords: restoration
Abstract: Despite remarkable advances made in all-in-one image restoration (AIR) for handling different types of degradations simultaneously, existing methods remain vulnerable to out-of-distribution degradations and images, limiting their real-world applicability. In this paper, we propose a multi-source representation learning framework BaryIR, which decomposes the latent space of multi-source degraded images into a continuous barycenter space for unified feature encoding and source-specific subspaces for specific semantic encoding. Specifically, we seek the multi-source unified representation by introducing a multi-source latent optimal transport barycenter problem, in which a continuous barycenter map is learned to transport the latent representations to the barycenter space. The transport cost is designed such that the representations from source-specific subspaces are contrasted with each other while maintaining orthogonality to those from the barycenter space. This enables BaryIR to learn compact representations with unified degradation-agnostic information from the barycenter space, as well as degradation-specific semantics from source-specific subspaces, capturing the inherent geometry of multi-source data manifold for generalizable AIR. Extensive experiments demonstrate that BaryIR achieves competitive performance compared to state-of-the-art all-in-one methods. Particularly, BaryIR exhibits superior generalization ability to real-world data and unseen degradations. The code will be publicly available at this https URL.
摘要：尽管在同时处理不同类型的降解的多合一图像恢复（空气）中取得了显着进步，但现有方法仍然容易受到分布外降解和图像的影响，从而限制了其现实世界中的适用性。在本文中，我们提出了一个多源表示学习框架BARYIR，该框架将多源降解图像的潜在空间分解为连续的barycenter空间，用于用于统一的特征编码和特定于源的语义编码子空间。具体而言，我们通过引入多源潜在的最佳运输barycenter问题来寻求多源统一表示，其中学会了连续的barycenter地图将潜在表示传输到Barycenter空间。运输成本的设计使得来自源特异性子空间的表示形式相互对比，同时保持与巴里中心空间的正交性。这使Baryir能够从Barycenter空间中学习带有统一降解 - 不合Snostic信息的紧凑表示，以及从特定于源的子空间中的特定于降解的语义，从而捕获了可概括性空气的多源数据歧管的固有几何形状。广泛的实验表明，与最先进的多合一方法相比，巴里尔实现了竞争性能。特别是，巴里尔具有卓越的泛化能力，可以实现现实世界中的数据和看不见的降解。该代码将在此HTTPS URL上公开可用。

Title: Efficient Diffusion Models for Symmetric Manifolds

Authors: Oren Mangoubi, Neil He, Nisheeth K. Vishnoi
Subjects: cs.LG, cs.AI, cs.DS, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2505.21640
Pdf URL: https://arxiv.org/pdf/2505.21640
Copy Paste: [[2505.21640]] Efficient Diffusion Models for Symmetric Manifolds(https://arxiv.org/abs/2505.21640)
Keywords: generation
Abstract: We introduce a framework for designing efficient diffusion models for $d$-dimensional symmetric-space Riemannian manifolds, including the torus, sphere, special orthogonal group and unitary group. Existing manifold diffusion models often depend on heat kernels, which lack closed-form expressions and require either $d$ gradient evaluations or exponential-in-$d$ arithmetic operations per training step. We introduce a new diffusion model for symmetric manifolds with a spatially-varying covariance, allowing us to leverage a projection of Euclidean Brownian motion to bypass heat kernel computations. Our training algorithm minimizes a novel efficient objective derived via Ito's Lemma, allowing each step to run in $O(1)$ gradient evaluations and nearly-linear-in-$d$ ($O(d^{1.19})$) arithmetic operations, reducing the gap between diffusions on symmetric manifolds and Euclidean space. Manifold symmetries ensure the diffusion satisfies an "average-case" Lipschitz condition, enabling accurate and efficient sample generation. Empirically, our model outperforms prior methods in training speed and improves sample quality on synthetic datasets on the torus, special orthogonal group, and unitary group.
摘要：我们介绍了一个框架，用于设计$ d $二维对称空间的riemannian流形的有效扩散模型，包括圆环，球体，特殊的正交组和统一组。现有的流形扩散模型通常取决于热核，这些热核缺乏封闭形式的表达式，需要$ d $梯度评估或每次训练步骤的指数级 - $ d $算术操作。我们引入了一种具有空间不同协方差的对称歧管的新扩散模型，从而使我们能够利用欧几里得布朗运动的投影来绕过热核计算。我们的培训算法最大程度地减少了通过ITO的引理得出的新型有效目标，使每个步骤都可以在$ O（1）$ o（1）$梯度评估中运行，并且几乎是线性的，$ o（$ o（d^{{1.19}）$）算术操作，降低了在对称性歧视和eucuclidords和euclidean上的差异。歧管对称性可确保扩散满足“平均案例” Lipschitz条件，从而能够准确有效地产生样品。从经验上讲，我们的模型在训练速度方面优于先前的方法，并提高了圆环，特殊正交组和统一组的合成数据集的样本质量。

Title: Geometric Feature Prompting of Image Segmentation Models

Authors: Kenneth Ball, Erin Taylor, Nirav Patel, Andrew Bartels, Gary Koplik, James Polly, Jay Hineman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21644
Pdf URL: https://arxiv.org/pdf/2505.21644
Copy Paste: [[2505.21644]] Geometric Feature Prompting of Image Segmentation Models(https://arxiv.org/abs/2505.21644)
Keywords: generation
Abstract: Advances in machine learning, especially the introduction of transformer architectures and vision transformers, have led to the development of highly capable computer vision foundation models. The segment anything model (known colloquially as SAM and more recently SAM 2), is a highly capable foundation model for segmentation of natural images and has been further applied to medical and scientific image segmentation tasks. SAM relies on prompts -- points or regions of interest in an image -- to generate associated segmentations. In this manuscript we propose the use of a geometrically motivated prompt generator to produce prompt points that are colocated with particular features of interest. Focused prompting enables the automatic generation of sensitive and specific segmentations in a scientific image analysis task using SAM with relatively few point prompts. The image analysis task examined is the segmentation of plant roots in rhizotron or minirhizotron images, which has historically been a difficult task to automate. Hand annotation of rhizotron images is laborious and often subjective; SAM, initialized with GeomPrompt local ridge prompts has the potential to dramatically improve rhizotron image processing. The authors have concurrently released an open source software suite called geomprompt this https URL that can produce point prompts in a format that enables direct integration with the segment-anything package.
摘要：机器学习的进步，尤其是引入变压器体系结构和视觉变压器的进步，导致了功能强大的计算机视觉基础模型的发展。该段的任何模型（通俗地称为SAM和最近的SAM 2）是一个高度强大的基础模型，用于分割自然图像，并已进一步应用于医学和科学图像分割任务。 Sam依靠提示（图像中感兴趣的点或感兴趣的区域）来生成相关的分段。在本手稿中，我们建议使用以几何动机的提示发电机来产生与特定感兴趣的特征共关联的提示点。专注的提示可以使用SAM使用相对较少的点提示在科学图像分析任务中自动生成敏感和特定的分段。所检查的图像分析任务是根茎或Minirhizotron图像中植物根部的分割，这在历史上一直是自动化的困难任务。根茎图像的手注释是费力的，通常是主观的。 SAM用地点局部局部山脊提示初始化有可能显着改善根茎图像处理。作者同时发布了一个名为地理标记的开源软件套件，该套件可以以一种格式产生点提示，该格式可以直接集成与段 - 所有内容包。

Title: Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation

Authors: Ke Zhang, Cihan Xiao, Yiqun Mei, Jiacong Xu, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21653
Pdf URL: https://arxiv.org/pdf/2505.21653
Copy Paste: [[2505.21653]] Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation(https://arxiv.org/abs/2505.21653)
Keywords: generation
Abstract: Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to explicitly reason a comprehensive physical context from the text prompt and use it to guide the generation. To incorporate physical context into the diffusion model, we leverage a Multimodal large language model (MLLM) as a supervisory signal and introduce a set of novel training objectives that jointly enforce physical correctness and semantic consistency with the input text. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at this https URL
摘要：最近的视频扩散模型已经证明了它们在产生视觉上令人愉悦的结果方面的极大能力，同时综合生成的视频中正确的物理效果仍然具有挑战性。从数据学习物理学时，现实世界动作，互动和动态的复杂性引起了巨大的困难。在这项工作中，我们提出了Diffphy，这是一个通用框架，可以通过微调预训练的视频扩散模型来实现物理校正和照片现实的视频生成。我们的方法利用大型语言模型（LLMS）明确地从文本提示中提出全面的物理环境，并使用它来指导一代。为了将物理环境纳入扩散模型，我们利用多模式的大语言模型（MLLM）作为监督信号，并引入一组新型的训练目标，共同执行身体正确性和与输入文本的语义一致性。我们还建立了一个高质量的物理视频数据集，该数据集包含各种烟草动作和事件，以促进有效的填充。对公共基准测试的广泛实验表明，Diffphy能够在各种物理相关的情况下产生最先进的结果。我们的项目页面可在此HTTPS URL上找到

Title: PreGenie: An Agentic Framework for High-quality Visual Presentation Generation

Authors: Xiaojie Xu, Xinli Xu, Sirui Chen, Haoyu Chen, Fan Zhang, Ying-Cong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.21660
Pdf URL: https://arxiv.org/pdf/2505.21660
Copy Paste: [[2505.21660]] PreGenie: An Agentic Framework for High-quality Visual Presentation Generation(https://arxiv.org/abs/2505.21660)
Keywords: generation
Abstract: Visual presentations are vital for effective communication. Early attempts to automate their creation using deep learning often faced issues such as poorly organized layouts, inaccurate text summarization, and a lack of image understanding, leading to mismatched visuals and text. These limitations restrict their application in formal contexts like business and scientific research. To address these challenges, we propose PreGenie, an agentic and modular framework powered by multimodal large language models (MLLMs) for generating high-quality visual presentations. PreGenie is built on the Slidev presentation framework, where slides are rendered from Markdown code. It operates in two stages: (1) Analysis and Initial Generation, which summarizes multimodal input and generates initial code, and (2) Review and Re-generation, which iteratively reviews intermediate code and rendered slides to produce final, high-quality presentations. Each stage leverages multiple MLLMs that collaborate and share information. Comprehensive experiments demonstrate that PreGenie excels in multimodal understanding, outperforming existing models in both aesthetics and content consistency, while aligning more closely with human design preferences.
摘要：视觉演示对于有效的沟通至关重要。早期尝试使用深度学习来自动化创作的尝试通常会面临诸如组织不良的布局，文本摘要不正确以及缺乏图像理解的问题，从而导致视觉效果和文本不匹配。这些局限性限制了它们在商业和科学研究等正式背景下的应用。为了应对这些挑战，我们提出了Pregenie，这是一个由多模式大型语言模型（MLLMS）提供动力的代理和模块化框架，用于生成高质量的视觉演示。 Pregenie建立在SLIDEV演示框架上，其中幻灯片是从Markdown Code呈现的。它分为两个阶段：（1）分析和初始生成，总结了多模式输入并生成初始代码，以及（2）审查和重新生成，迭代地回顾了中间代码并渲染幻灯片以产生最终的高质量演示。每个阶段都利用多个协作和共享信息的MLLM。全面的实验表明，Pregenie在多模式理解方面表现出色，超过了美学和内容一致性中的现有模型，同时更与人类设计的偏好保持一致。

Title: Efficient Controllable Diffusion via Optimal Classifier Guidance

Authors: Owen Oertell, Shikun Sun, Yiding Chen, Jin Peng Zhou, Zhiyong Wang, Wen Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21666
Pdf URL: https://arxiv.org/pdf/2505.21666
Copy Paste: [[2505.21666]] Efficient Controllable Diffusion via Optimal Classifier Guidance(https://arxiv.org/abs/2505.21666)
Keywords: generation
Abstract: The controllable generation of diffusion models aims to steer the model to generate samples that optimize some given objective functions. It is desirable for a variety of applications including image generation, molecule generation, and DNA/sequence generation. Reinforcement Learning (RL) based fine-tuning of the base model is a popular approach but it can overfit the reward function while requiring significant resources. We frame controllable generation as a problem of finding a distribution that optimizes a KL-regularized objective function. We present SLCD -- Supervised Learning based Controllable Diffusion, which iteratively generates online data and trains a small classifier to guide the generation of the diffusion model. Similar to the standard classifier-guided diffusion, SLCD's key computation primitive is classification and does not involve any complex concepts from RL or control. Via a reduction to no-regret online learning analysis, we show that under KL divergence, the output from SLCD provably converges to the optimal solution of the KL-regularized objective. Further, we empirically demonstrate that SLCD can generate high quality samples with nearly the same inference time as the base model in both image generation with continuous diffusion and biological sequence generation with discrete diffusion. Our code is available at this https URL
摘要：扩散模型的可控生成旨在引导模型生成优化一些给定目标函数的样品。对于包括图像产生，分子产生和DNA/序列产生在内的各种应用是可取的。基础模型的基于强化学习（RL）的微调是一种流行的方法，但它可以在需要大量资源的同时过度贴上奖励功能。我们将可控制的生成框架作为找到优化KL登记目标函数的分布的问题。我们提出SLCD-基于监督的基于学习的可控扩散，它迭代地生成在线数据并训练小型分类器以指导生成扩散模型。与标准分类器引导的扩散类似，SLCD的关键计算原始性是分类，并且不涉及RL或Control的任何复杂概念。通过还原到无重组的在线学习分析，我们表明，在KL Divergence下，SLCD的输出可证明与KL调节目标的最佳解决方案收敛。此外，我们从经验上证明，SLCD可以生成具有与基本模型几乎相同的推理时间的高质量样品，这两个图像产生都具有连续扩散和生物序列产生，并产生离散扩散。我们的代码可在此HTTPS URL上找到

Title: What happens when generative AI models train recursively on each others' generated outputs?

Authors: Hung Ahn Vu, Galen Reeves, Emily Wenger
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.21677
Pdf URL: https://arxiv.org/pdf/2505.21677
Copy Paste: [[2505.21677]] What happens when generative AI models train recursively on each others' generated outputs?(https://arxiv.org/abs/2505.21677)
Keywords: generative
Abstract: The internet is full of AI-generated content while also serving as a common source of training data for generative AI (genAI) models. This duality raises the possibility that future genAI models may be trained on other models' generated outputs. Prior work has studied consequences of models training on their own generated outputs, but limited work has considered what happens if models ingest content produced by other models. Given society's increasing dependence on genAI tools, understanding downstream effects of such data-mediated model interactions is critical. To this end, we provide empirical evidence for how data-mediated interactions might unfold in practice, develop a theoretical model for this interactive training process, and show experimentally possible long-term results of such interactions. We find that data-mediated interactions can benefit models by exposing them to novel concepts perhaps missed in original training data, but also can homogenize their performance on shared tasks.
摘要：互联网充满了AI生成的内容，同时也充当了生成AI（Genai）模型的培训数据的共同来源。这种二元性提高了未来的Genai模型可以在其他模型生成的产出上训练的可能性。先前的工作已经研究了模型对自己生成的产出的培训的后果，但是有限的工作考虑了如果模型摄入其他模型产生的摄入内容会发生什么。鉴于社会对Genai工具的依赖不断增加，因此了解此类数据介导的模型相互作用的下游影响至关重要。为此，我们提供了有关数据介导的相互作用在实践中如何展开，为这种交互式训练过程开发理论模型的经验证据，并在实验上可能会在实验上可能的长期结果。我们发现，数据介导的相互作用可以通过将模型暴露于原始培训数据中可能错过的新颖概念来使其受益，但也可以使他们在共享任务上的绩效均匀。

Title: OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Authors: Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, Bernard Ghanem
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.21724
Pdf URL: https://arxiv.org/pdf/2505.21724
Copy Paste: [[2505.21724]] OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions(https://arxiv.org/abs/2505.21724)
Keywords: generation
Abstract: In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task that aims to online generate synchronized verbal and non-verbal listener feedback, conditioned on the speaker's multimodal input. OMCRG reflects natural dyadic interactions and poses new challenges in achieving synchronization between the generated audio and facial responses of the listener. To address these challenges, we innovatively introduce text as an intermediate modality to bridge the audio and facial responses. We hence propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates high-quality multi-modal listener responses. OmniResponse leverages a pretrained LLM enhanced with two novel components: Chrono-Text, which temporally anchors generated text tokens, and TempoVoice, a controllable online TTS module that produces speech synchronized with facial reactions. To support further OMCRG research, we present ResponseNet, a new dataset comprising 696 high-quality dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and facial behavior annotations. Comprehensive evaluations conducted on ResponseNet demonstrate that OmniResponse significantly outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality.
摘要：在本文中，我们介绍了在线多模式对话响应生成（OMCRG），这是一个新颖的任务，旨在在线生成同步的口头和非语言听众反馈，并以扬声器的多模式输入为条件。 OMCRG反映了自然的二元相互作用，并提出了在听众的产生音频和面部反应之间达成同步的新挑战。为了应对这些挑战，我们创新地引入文本作为中间形态，以弥合音频和面部反应。因此，我们提出了Omniresponse，这是一种多式模式大型语言模型（MLLM），可自动加入会产生高质量的多模式侦听器响应。 Omniresponse利用了两个新颖的组件（Chrono-Text）增强了验证的LLM：暂时锚定生成的文本令牌和Tempovoice，Tempovoice是一种可控制的在线TTS模块，可产生与面部反应同步的语音。为了支持进一步的OMCRG研究，我们提出了ResponsEnet，这是一个新的数据集，其中包括696个高质量的二元相互作用，其中包含同步分配屏幕视频，多通道音频，成绩单和面部行为注释。对ResponseNet进行的全面评估表明，综合响应在语义语音内容，视听同步和发电质量方面显着优于基线模型。

Title: Simulating the Unseen: Crash Prediction Must Learn from What Did Not Happen

Authors: Zihao Li, Xinyuan Cao, Xiangbo Gao, Kexin Tian, Keshu Wu, Mohammad Anis, Hao Zhang, Keke Long, Jiwan Jiang, Xiaopeng Li, Yunlong Zhang, Tianbao Yang, Dominique Lord, Zhengzhong Tu, Yang Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21743
Pdf URL: https://arxiv.org/pdf/2505.21743
Copy Paste: [[2505.21743]] Simulating the Unseen: Crash Prediction Must Learn from What Did Not Happen(https://arxiv.org/abs/2505.21743)
Keywords: generative
Abstract: Traffic safety science has long been hindered by a fundamental data paradox: the crashes we most wish to prevent are precisely those events we rarely observe. Existing crash-frequency models and surrogate safety metrics rely heavily on sparse, noisy, and under-reported records, while even sophisticated, high-fidelity simulations undersample the long-tailed situations that trigger catastrophic outcomes such as fatalities. We argue that the path to achieving Vision Zero, i.e., the complete elimination of traffic fatalities and severe injuries, requires a paradigm shift from traditional crash-only learning to a new form of counterfactual safety learning: reasoning not only about what happened, but also about the vast set of plausible yet perilous scenarios that could have happened under slightly different circumstances. To operationalize this shift, our proposed agenda bridges macro to micro. Guided by crash-rate priors, generative scene engines, diverse driver models, and causal learning, near-miss events are synthesized and explained. A crash-focused digital twin testbed links micro scenes to macro patterns, while a multi-objective validator ensures that simulations maintain statistical realism. This pipeline transforms sparse crash data into rich signals for crash prediction, enabling the stress-testing of vehicles, roads, and policies before deployment. By learning from crashes that almost happened, we can shift traffic safety from reactive forensics to proactive prevention, advancing Vision Zero.
摘要：交通安全科学长期以来一直受到基本数据悖论的阻碍：我们最希望预防的崩溃完全是我们很少观察到的事件。现有的碰撞模型和代孕安全指标在很大程度上依赖于稀疏，嘈杂和报告不足的记录，而即使是复杂的高保真模拟，也没有触发诸如死亡之类的灾难性结果的长尾状况。我们认为，达到零视力的途径，即完全消除交通死亡和严重伤害，需要从传统的次崩溃学习到一种新的反事实安全学习形式的范式转变：不仅要在略有不同的情况下发生的危险的情况，而且对发生的情况不仅发生了危险。为了实现这一转变，我们提议的议程将宏观桥梁桥接到了微观。在碰撞评价先验，生成场景引擎，各种驾驶员模型和因果学习的指导下，综合和解释了近乎失踪的事件。以崩溃的数字双测试床将微观场景链接到宏观模式，而多目标验证器可确保模拟保持统计现实主义。该管道将稀疏的崩溃数据转换为巨大的信号，以进行崩溃预测，从而在部署前对车辆，道路和政策进行了压力测试。通过从几乎发生的撞车事故中学习，我们可以将交通安全从反应性取证转移到主动的预防，从而推动视觉零。

Title: Learning to See More: UAS-Guided Super-Resolution of Satellite Imagery for Precision Agriculture

Authors: Arif Masrur, Peder A. Olsen, Paul R. Adler, Carlan Jackson, Matthew W. Myers, Nathan Sedghi, Ray R. Weil
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21746
Pdf URL: https://arxiv.org/pdf/2505.21746
Copy Paste: [[2505.21746]] Learning to See More: UAS-Guided Super-Resolution of Satellite Imagery for Precision Agriculture(https://arxiv.org/abs/2505.21746)
Keywords: super-resolution
Abstract: Unmanned Aircraft Systems (UAS) and satellites are key data sources for precision agriculture, yet each presents trade-offs. Satellite data offer broad spatial, temporal, and spectral coverage but lack the resolution needed for many precision farming applications, while UAS provide high spatial detail but are limited by coverage and cost, especially for hyperspectral data. This study presents a novel framework that fuses satellite and UAS imagery using super-resolution methods. By integrating data across spatial, spectral, and temporal domains, we leverage the strengths of both platforms cost-effectively. We use estimation of cover crop biomass and nitrogen (N) as a case study to evaluate our approach. By spectrally extending UAS RGB data to the vegetation red edge and near-infrared regions, we generate high-resolution Sentinel-2 imagery and improve biomass and N estimation accuracy by 18% and 31%, respectively. Our results show that UAS data need only be collected from a subset of fields and time points. Farmers can then 1) enhance the spectral detail of UAS RGB imagery; 2) increase the spatial resolution by using satellite data; and 3) extend these enhancements spatially and across the growing season at the frequency of the satellite flights. Our SRCNN-based spectral extension model shows considerable promise for model transferability over other cropping systems in the Upper and Lower Chesapeake Bay regions. Additionally, it remains effective even when cloud-free satellite data are unavailable, relying solely on the UAS RGB input. The spatial extension model produces better biomass and N predictions than models built on raw UAS RGB images. Once trained with targeted UAS RGB data, the spatial extension model allows farmers to stop repeated UAS flights. While we introduce super-resolution advances, the core contribution is a lightweight and scalable system for affordable on-farm use.
摘要：无人飞机系统（UAS）和卫星是精确农业的关键数据来源，但每个卫星都进行了权衡。卫星数据提供了广泛的空间，时间和光谱覆盖范围，但缺乏许多精确农业应用所需的分辨率，而UAS则提供了高空间细节，但受覆盖范围和成本的限制，尤其是对于高光谱数据。这项研究提出了一个新型框架，该框架使用超分辨率方法融合了卫星和UAS图像。通过跨空间，光谱和时间域的数据集成，我们利用两个平台的优势成本效率地利用。我们将覆盖作物生物量和氮（N）的估计作为案例研究来评估我们的方法。通过光谱将UAS RGB数据扩展到植被红色边缘和近红外区域，我们分别生成高分辨率Sentinel-2成像，并将生物量和N估计准确性分别提高18％和31％。我们的结果表明，UAS数据只需要从字段和时间点的子集收集。然后，农民可以1）增强UAS RGB图像的光谱细节； 2）通过使用卫星数据来增加空间分辨率； 3）在卫星飞行的频率下，在空间上和整个生长季节扩展了这些增强。我们的基于SRCNN的光谱扩展模型显示出与上下切萨皮克湾地区其他农作物系统相比模型可传递性的巨大希望。此外，仅依靠UAS RGB输入，即使没有无云的卫星数据，也仍然有效。与在原始UAS RGB图像上构建的模型相比，空间扩展模型可产生更好的生物量和N预测。一旦接受了有针对性的UAS RGB数据培训，空间扩展模型允许农民停止重复的UAS航班。虽然我们引入超分辨率进步，但核心贡献是一种轻巧且可扩展的系统，可负担得起的农场使用。

Title: DualSchool: How Reliable are LLMs for Optimization Education?

Authors: Michael Klamkin, Arnaud Deza, Sikai Cheng, Haoruo Zhao, Pascal Van Hentenryck
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2505.21775
Pdf URL: https://arxiv.org/pdf/2505.21775
Copy Paste: [[2505.21775]] DualSchool: How Reliable are LLMs for Optimization Education?(https://arxiv.org/abs/2505.21775)
Keywords: generative
Abstract: Consider the following task taught in introductory optimization courses which addresses challenges articulated by the community at the intersection of (generative) AI and OR: generate the dual of a linear program. LLMs, being trained at web-scale, have the conversion process and many instances of Primal to Dual Conversion (P2DC) at their disposal. Students may thus reasonably expect that LLMs would perform well on the P2DC task. To assess this expectation, this paper introduces DualSchool, a comprehensive framework for generating and verifying P2DC instances. The verification procedure of DualSchool uses the Canonical Graph Edit Distance, going well beyond existing evaluation methods for optimization models, which exhibit many false positives and negatives when applied to P2DC. Experiments performed by DualSchool reveal interesting findings. Although LLMs can recite the conversion procedure accurately, state-of-the-art open LLMs fail to consistently produce correct duals. This finding holds even for the smallest two-variable instances and for derivative tasks, such as correctness, verification, and error classification. The paper also discusses the implications for educators, students, and the development of large reasoning systems.
摘要：考虑在介绍性优化课程中教授的以下任务，该任务解决了社区在（生成）AI的交集和或：生成线性程序双重的交叉点所阐明的挑战。在网络规模上接受培训的LLM具有转换过程，并且可以使用双重转换（P2DC）的许多实例。因此，学生可能会合理地期望LLM在P2DC任务上表现良好。为了评估这一期望，本文介绍了DualSchool，这是生成和验证P2DC实例的综合框架。 DualSchool的验证过程使用规范图编辑距离，远远超出了现有的优化模型评估方法，当应用于P2DC时，它们显示出许多假阳性和负面因素。通过双学学进行的实验揭示了有趣的发现。尽管LLM可以准确地背诵转换过程，但最先进的开放llms无法始终产生正确的双重。这一发现甚至适用于最小的两变量实例和衍生任务，例如正确性，验证和错误分类。本文还讨论了对教育者，学生和大型推理系统发展的影响。

Title: Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Authors: Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J. Zaki, Luca Ambrogioni, Dmitry Krotov
Subjects: cs.LG, cond-mat.dis-nn
Abstract URL: https://arxiv.org/abs/2505.21777
Pdf URL: https://arxiv.org/pdf/2505.21777
Copy Paste: [[2505.21777]] Memorization to Generalization: Emergence of Diffusion Models from Associative Memory(https://arxiv.org/abs/2505.21777)
Keywords: generation, generative
Abstract: Hopfield networks are associative memory (AM) systems, designed for storing and retrieving patterns as local minima of an energy landscape. In the classical Hopfield model, an interesting phenomenon occurs when the amount of training data reaches its critical memory load $- spurious\,\,states$, or unintended stable points, emerge at the end of the retrieval dynamics, leading to incorrect recall. In this work, we examine diffusion models, commonly used in generative modeling, from the perspective of AMs. The training phase of diffusion model is conceptualized as memory encoding (training data is stored in the memory). The generation phase is viewed as an attempt of memory retrieval. In the small data regime the diffusion model exhibits a strong memorization phase, where the network creates distinct basins of attraction around each sample in the training set, akin to the Hopfield model below the critical memory load. In the large data regime, a different phase appears where an increase in the size of the training set fosters the creation of new attractor states that correspond to manifolds of the generated samples. Spurious states appear at the boundary of this transition and correspond to emergent attractor states, which are absent in the training set, but, at the same time, have distinct basins of attraction around them. Our findings provide: a novel perspective on the memorization-generalization phenomenon in diffusion models via the lens of AMs, theoretical prediction of existence of spurious states, empirical validation of this prediction in commonly-used diffusion models.
摘要：Hopfield网络是关联内存（AM）系统，旨在将图案存储和检索作为能量景观的本地最小值。在经典的Hopfield模型中，当训练数据达到其关键存储器负载$ -Spurious \，\，状态$或意外稳定点时，就会发生一个有趣的现象，并在检索动力学结束时出现，从而导致召回不正确。在这项工作中，我们从AM的角度研究了在生成建模中常用的扩散模型。扩散模型的训练阶段被概念化为记忆编码（训练数据存储在内存中）。生成阶段被视为记忆检索的尝试。在小型数据制度中，扩散模型表现出强烈的记忆阶段，在该阶段中，网络在训练集中的每个样本周围创建了独特的吸引力盆地，类似于关键记忆负载下方的Hopfield模型。在大型数据制度中，出现了一个不同的阶段，其中训练集的大小增加促进了与生成样品歧管相对应的新吸引子状态的创建。伪造的状态出现在此过渡的边界上，与训练集中不存在的新兴吸引子状态相对应，但与此同时，它们周围有不同的吸引力盆地。我们的发现提供了：通过AMS的晶状体，关于伪造状态存在的理论预测，对记忆化将军现象的新观点，在普遍使用的扩散模型中对这一预测的经验验证。

Title: Compositional Scene Understanding through Inverse Generative Modeling

Authors: Yanbo Wang, Justin Dauwels, Yilun Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21780
Pdf URL: https://arxiv.org/pdf/2505.21780
Copy Paste: [[2505.21780]] Compositional Scene Understanding through Inverse Generative Modeling(https://arxiv.org/abs/2505.21780)
Keywords: generative
Abstract: Generative models have demonstrated remarkable abilities in generating high-fidelity visual content. In this work, we explore how generative models can further be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception. Code and visualizations are at \href{this https URL}{this https URL}.
摘要：生成模型在产生高保真视觉内容方面表现出了显着的能力。在这项工作中，我们探讨了如何不仅可以进一步使用生成模型来综合视觉内容，还可以理解给定自然图像的场景的属性。我们将场景理解作为一个逆生成建模问题，我们试图找到视觉生成模型的条件参数，以最佳地拟合给定的自然图像。为了使此过程从与训练过程中所看到的图像大不相同的图像中推断出场景结构，我们进一步建议从场景中的较小模型上从较小的模型上构建这种视觉生成模型。我们说明了此过程如何使我们能够在场景中推断一组对象，从而可以对新形状的对象数量增加对新测试场景的强大概括。我们进一步说明了这如何使我们能够推断全球场景因素，同样可以对新场景进行强大的概括。最后，我们说明了如何将这种方法直接应用于零击的多对象感知的现有鉴定的文本对图像生成模型。代码和可视化是在\ href {此https url} {此https url}。

Title: ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation

Authors: Xiaomeng Yang, Lei Lu, Qihui Fan, Changdi Yang, Juyi Lin, Yanzhi Wang, Xuan Zhang, Shangqian Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21817
Pdf URL: https://arxiv.org/pdf/2505.21817
Copy Paste: [[2505.21817]] ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation(https://arxiv.org/abs/2505.21817)
Keywords: generation, generative
Abstract: Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images. However, their iterative denoising process results in significant computational overhead during inference, limiting their practical deployment in resource-constrained environments. Existing acceleration methods often adopt uniform strategies that fail to capture the temporal variations during diffusion generation, while the commonly adopted sequential pruning-then-fine-tuning strategy suffers from sub-optimality due to the misalignment between pruning decisions made on pretrained weights and the model's final parameters. To address these limitations, we introduce ALTER: All-in-One Layer Pruning and Temporal Expert Routing, a unified framework that transforms diffusion models into a mixture of efficient temporal experts. ALTER achieves a single-stage optimization that unifies layer pruning, expert routing, and model fine-tuning by employing a trainable hypernetwork, which dynamically generates layer pruning decisions and manages timestep routing to specialized, pruned expert sub-networks throughout the ongoing fine-tuning of the UNet. This unified co-optimization strategy enables significant efficiency gains while preserving high generative quality. Specifically, ALTER achieves same-level visual fidelity to the original 50-step Stable Diffusion v2.1 model while utilizing only 25.9% of its total MACs with just 20 inference steps and delivering a 3.64x speedup through 35% sparsity.
摘要：扩散模型表现出了产生高保真图像的特殊功能。但是，它们的迭代降解过程在推断过程中导致了大量的计算开销，从而限制了它们在资源受限环境中的实际部署。现有的加速方法通常采用统一的策略，这些策略未能捕获扩散生成期间的时间变化，而通常采用的顺序修剪 - 到期的调整策略由于对预处理的权重和模型的最终参数的修剪决策之间的损失而受到亚次优小性。为了解决这些局限性，我们引入了Alter：多合一的层修剪和时间专家路由，这是一个将扩散模型转化为有效的时间专家的混合物的统一框架。 Alter实现了单阶段优化，该优化通过采用可训练的超网络来统一层修剪，专家路由和模型微调，该超网络动态生成图层修剪决策，并在整个UNET正在进行的微型调整中，将时间段路由到专业，修剪的专家子网络。这种统一的合作策略可以在保持高生成质量的同时获得巨大的效率提高。具体而言，Alter在原始的50步稳定扩散v2.1模型中实现了相同的视觉保真度，同时仅利用其总MAC的25.9％，只有20个推理步骤，并通过35％的稀疏性提供了3.64倍的加速。

Title: HDRSDR-VQA: A Subjective Video Quality Dataset for HDR and SDR Comparative Evaluation

Authors: Bowen Chen, Cheng-han Lee, Yixu Chen, Zaixi Shang, Hai Wei, Alan C. Bovik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21831
Pdf URL: https://arxiv.org/pdf/2505.21831
Copy Paste: [[2505.21831]] HDRSDR-VQA: A Subjective Video Quality Dataset for HDR and SDR Comparative Evaluation(https://arxiv.org/abs/2505.21831)
Keywords: quality assessment
Abstract: We introduce HDRSDR-VQA, a large-scale video quality assessment dataset designed to facilitate comparative analysis between High Dynamic Range (HDR) and Standard Dynamic Range (SDR) content under realistic viewing conditions. The dataset comprises 960 videos generated from 54 diverse source sequences, each presented in both HDR and SDR formats across nine distortion levels. To obtain reliable perceptual quality scores, we conducted a comprehensive subjective study involving 145 participants and six consumer-grade HDR-capable televisions. A total of over 22,000 pairwise comparisons were collected and scaled into Just-Objectionable-Difference (JOD) scores. Unlike prior datasets that focus on a single dynamic range format or use limited evaluation protocols, HDRSDR-VQA enables direct content-level comparison between HDR and SDR versions, supporting detailed investigations into when and why one format is preferred over the other. The open-sourced part of the dataset is publicly available to support further research in video quality assessment, content-adaptive streaming, and perceptual model development.
摘要：我们介绍了HDRSDR-VQA，这是一个大规模的视频质量评估数据集，旨在促进在现实观看条件下高动态范围（HDR）和标准动态范围（SDR）内容之间的比较分析。该数据集包含来自54个不同源序列产生的960个视频，每个视频均以9个失真级别以HDR和SDR格式呈现。为了获得可靠的感知质量分数，我们进行了一项全面的主观研究，涉及145位参与者和6个具有HDR的电视。总共收集了超过22,000个成对比较，并将其缩放到公正的差异（JOD）分数中。与关注单个动态范围格式或使用有限的评估协议的先前数据集不同，HDRSDR-VQA可以在HDR和SDR版本之间进行直接的内容级比较，从而支持详细的研究，以了解何时以及为什么一种格式比另一种格式更喜欢。该数据集的开源部分可公开使用，以支持视频质量评估，内容自适应流和感知模型开发方面的进一步研究。

Title: UniMoGen: Universal Motion Generation

Authors: Aliasghar Khani, Arianna Rampini, Evan Atherton, Bruno Roy
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21837
Pdf URL: https://arxiv.org/pdf/2505.21837
Copy Paste: [[2505.21837]] UniMoGen: Universal Motion Generation(https://arxiv.org/abs/2505.21837)
Keywords: generation
Abstract: Motion generation is a cornerstone of computer graphics, animation, gaming, and robotics, enabling the creation of realistic and varied character movements. A significant limitation of existing methods is their reliance on specific skeletal structures, which restricts their versatility across different characters. To overcome this, we introduce UniMoGen, a novel UNet-based diffusion model designed for skeleton-agnostic motion generation. UniMoGen can be trained on motion data from diverse characters, such as humans and animals, without the need for a predefined maximum number of joints. By dynamically processing only the necessary joints for each character, our model achieves both skeleton agnosticism and computational efficiency. Key features of UniMoGen include controllability via style and trajectory inputs, and the ability to continue motions from past frames. We demonstrate UniMoGen's effectiveness on the 100style dataset, where it outperforms state-of-the-art methods in diverse character motion generation. Furthermore, when trained on both the 100style and LAFAN1 datasets, which use different skeletons, UniMoGen achieves high performance and improved efficiency across both skeletons. These results highlight UniMoGen's potential to advance motion generation by providing a flexible, efficient, and controllable solution for a wide range of character animations.
摘要：运动生成是计算机图形，动画，游戏和机器人技术的基石，可以创建现实和多样化的角色运动。现有方法的一个重要局限性是它们依赖特定的骨骼结构，这限制了它们在不同字符上的多功能性。为了克服这一点，我们引入了Unimogen，这是一种新型的基于UNET的扩散模型，专为骨骼敏锐的运动产生而设计。可以对来自人类和动物等不同特征的运动数据进行训练，而无需预定义的最大接头数量。通过仅动态处理每个字符的必要关节，我们的模型既达到了骨骼不可知论和计算效率。 Unimogen的关键特征包括通过样式和轨迹输入的可控性，以及从过去的框架中继续运动的能力。我们证明了Unimogen对100STEL数据集的有效性，在该数据集中，它在各种角色运动产生中的表现优于最先进的方法。此外，当对使用不同骨骼的100STY和LAFAN1数据集进行训练时，Unimogen可以在两个骨架中实现高性能和提高效率。这些结果突出了Unimogen通过为广泛的角色动画提供灵活，高效和可控的解决方案来提高运动产生的潜力。

Title: SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training

Authors: Xiaomeng Yang, Zhiyu Tan, Junyan Wang, Zhijian Zhou, Hao Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21893
Pdf URL: https://arxiv.org/pdf/2505.21893
Copy Paste: [[2505.21893]] SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training(https://arxiv.org/abs/2505.21893)
Keywords: generative
Abstract: Preference learning has become a central technique for aligning generative models with human expectations. Recently, it has been extended to diffusion models through methods like Direct Preference Optimization (DPO). However, existing approaches such as Diffusion-DPO suffer from two key challenges: timestep-dependent instability, caused by a mismatch between the reverse and forward diffusion processes and by high gradient variance in early noisy timesteps, and off-policy bias arising from the mismatch between optimization and data collection policies. We begin by analyzing the reverse diffusion trajectory and observe that instability primarily occurs at early timesteps with low importance weights. To address these issues, we first propose DPO-C\&M, a practical strategy that improves stability by clipping and masking uninformative timesteps while partially mitigating off-policy bias. Building on this, we introduce SDPO (Importance-Sampled Direct Preference Optimization), a principled framework that incorporates importance sampling into the objective to fully correct for off-policy bias and emphasize informative updates during the diffusion process. Experiments on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B demonstrate that both methods outperform standard Diffusion-DPO, with SDPO achieving superior VBench scores, human preference alignment, and training robustness. These results highlight the importance of timestep-aware, distribution-corrected optimization in diffusion-based preference learning.
摘要：偏好学习已成为将生成模型与人类期望保持一致的中心技术。最近，它已通过直接偏好优化（DPO）等方法扩展到扩散模型。然而，现有的方法（例如扩散-DPO）遇到了两个关键挑战：TimeStep依赖性不稳定，这是由于反向和正向扩散过程之间的不匹配以及早期噪声时间段的高梯度差异引起的，以及由优化和数据收集策略之间的不匹配和数据收集政策引起的非政策偏见。我们首先分析反向扩散轨迹，并观察到不稳定性主要发生在重要性较低的早期时间段。为了解决这些问题，我们首先提出了DPO-C \＆M，这是一种实用策略，可以通过剪接和掩盖无信息的时间段来提高稳定性，同时部分缓解了非政策偏见。在此基础上，我们介绍了SDPO（重要性采样的直接偏好优化），这是一个原则上的框架，将重要性抽样纳入目标，以完全纠正违反政策偏见，并在扩散过程中强调信息的更新。对Cogvideox-2b，Cogvideox-5b和WAN2.1-1.3B进行的实验表明，这两种方法都超过了标准扩散DPO，SDPO可以达到较高的VBENCH得分，人类的首选项一致性和训练鲁棒性。这些结果突出了时间段了解，分布校正优化在基于扩散的偏好学习中的重要性。

Title: Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization

Authors: Cameron Gordon, Yiping Ji, Hemanth Saratchandran, Paul Albert, Simon Lucey
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21895
Pdf URL: https://arxiv.org/pdf/2505.21895
Copy Paste: [[2505.21895]] Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization(https://arxiv.org/abs/2505.21895)
Keywords: generation
Abstract: Low-Rank Adaptation (LoRA) has become a standard approach for parameter-efficient fine-tuning, offering substantial reductions in trainable parameters by modeling updates as the product of two low-rank matrices. While effective, the low-rank constraint inherently limits representational capacity, often resulting in reduced performance compared to full-rank fine-tuning. Recent work by Ji et al. (2025) has addressed this limitation by applying a fixed-frequency sinusoidal transformation to low-rank adapters, increasing their stable rank without introducing additional parameters. This raises a crucial question: can the same sine-activated technique be successfully applied within the context of Post-Training Quantization to retain benefits even after model compression? In this paper, we investigate this question by extending the sinusoidal transformation framework to quantized LoRA adapters. We develop a theoretical analysis showing that the stable rank of a quantized adapter is tightly linked to that of its full-precision counterpart, motivating the use of such rank-enhancing functions even under quantization. Our results demonstrate that the expressivity gains from a sinusoidal non-linearity persist after quantization, yielding highly compressed adapters with negligible loss in performance. We validate our approach across a range of fine-tuning tasks for language, vision and text-to-image generation achieving significant memory savings while maintaining competitive accuracy.
摘要：低级适应性（LORA）已成为参数有效微调的标准方法，通过将更新作为两个低级矩阵的乘积进行建模，从而大大减少了可训练的参数。虽然有效，但低级别的约束固有地限制了代表能力，与全级微调相比，通常导致性能降低。 Ji等人的最新工作。（2025）通过将固定频率的正弦转换应用于低级适配器来解决此限制，从而在不引入其他参数的情况下增加了其稳定等级。这提出了一个至关重要的问题：是否可以在训练后量化的背景下成功应用相同的正弦激活技术，即使在模型压缩后仍保留收益？在本文中，我们通过将正弦转化框架扩展到量化的洛拉适配器来研究这个问题。我们制定了一个理论分析，表明量化适配器的稳定等级与其完整精确的对应物的稳定等级紧密相关，即使在量化下也可以激发这种秩增强功能的使用。我们的结果表明，量化后的正弦非线性性能从量化后持续存在，产生高度压缩的适配器，性能损失可忽略不计。我们在一系列的微调任务中验证了我们的方法，以实现大量的记忆节省，同时保持竞争精度。

Title: Concentrate on Weakness: Mining Hard Prototypes for Few-Shot Medical Image Segmentation

Authors: Jianchao Jiang, Haofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21897
Pdf URL: https://arxiv.org/pdf/2505.21897
Copy Paste: [[2505.21897]] Concentrate on Weakness: Mining Hard Prototypes for Few-Shot Medical Image Segmentation(https://arxiv.org/abs/2505.21897)
Keywords: generation
Abstract: Few-Shot Medical Image Segmentation (FSMIS) has been widely used to train a model that can perform segmentation from only a few annotated images. However, most existing prototype-based FSMIS methods generate multiple prototypes from the support image solely by random sampling or local averaging, which can cause particularly severe boundary blurring due to the tendency for normal features accounting for the majority of features of a specific category. Consequently, we propose to focus more attention to those weaker features that are crucial for clear segmentation boundary. Specifically, we design a Support Self-Prediction (SSP) module to identify such weak features by comparing true support mask with one predicted by global support prototype. Then, a Hard Prototypes Generation (HPG) module is employed to generate multiple hard prototypes based on these weak features. Subsequently, a Multiple Similarity Maps Fusion (MSMF) module is devised to generate final segmenting mask in a dual-path fashion to mitigate the imbalance between foreground and background in medical images. Furthermore, we introduce a boundary loss to further constraint the edge of segmentation. Extensive experiments on three publicly available medical image datasets demonstrate that our method achieves state-of-the-art performance. Code is available at this https URL.
摘要：很少有射击医学图像分割（FSMI）已被广泛用于训练只能从几个带注释的图像中进行分割的模型。但是，大多数现有的基于原型的FSMIS方法仅通过随机采样或局部平均而从支持图像中产生多个原型，由于正常特征的趋势占特定类别的大多数特征，这可能会导致特别严重的边界模糊。因此，我们建议将更多注意力集中在那些对于清晰分割边界至关重要的较弱特征上。具体而言，我们设计了一个支持自我预测（SSP）模块，以通过将真实的支持掩码与全球支持原型预测的掩码进行比较，以识别此类弱特征。然后，使用这些弱特征来生成多个硬原型。随后，设计了多个相似性融合（MSMF）模块，以以双路径方式生成最终分割面具，以减轻医学图像中前景和背景之间的不平衡。此外，我们引入了边界损失，以进一步限制分割的边缘。对三个公开医疗图像数据集进行的大量实验表明，我们的方法可以实现最新的性能。代码可在此HTTPS URL上找到。

Title: Reference-Guided Identity Preserving Face Restoration

Authors: Mo Zhou, Keren Ye, Viraj Shah, Kangfu Mei, Mauricio Delbracio, Peyman Milanfar, Vishal M. Patel, Hossein Talebi
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.21905
Pdf URL: https://arxiv.org/pdf/2505.21905
Copy Paste: [[2505.21905]] Reference-Guided Identity Preserving Face Restoration(https://arxiv.org/abs/2505.21905)
Keywords: restoration
Abstract: Preserving face identity is a critical yet persistent challenge in diffusion-based image restoration. While reference faces offer a path forward, existing reference-based methods often fail to fully exploit their potential. This paper introduces a novel approach that maximizes reference face utility for improved face restoration and identity preservation. Our method makes three key contributions: 1) Composite Context, a comprehensive representation that fuses multi-level (high- and low-level) information from the reference face, offering richer guidance than prior singular representations. 2) Hard Example Identity Loss, a novel loss function that leverages the reference face to address the identity learning inefficiencies found in the existing identity loss. 3) A training-free method to adapt the model to multi-reference inputs during inference. The proposed method demonstrably restores high-quality faces and achieves state-of-the-art identity preserving restoration on benchmarks such as FFHQ-Ref and CelebA-Ref-Test, consistently outperforming previous work.
摘要：在基于扩散的图像恢复中，保持面部身份是一个关键但持续的挑战。尽管参考面提供了前进的路径，但现有的基于参考的方法通常无法完全利用其潜力。本文介绍了一种新颖的方法，该方法可最大程度地提高参考面效用，以改善面部修复和身份保存。我们的方法做出了三个关键的贡献：1）复合环境，这是一种综合表示，可以从参考面中融合多层次（高和低级）信息，提供比以前的单数表示更丰富的指导。 2）硬示例身份损失，这是一种新的损失函数，利用参考面来解决现有身份损失中发现的身份学习效率低下。 3）一种无训练方法，可以使模型在推理过程中适应多引用输入。提出的方法明显地恢复了高质量的面孔，并实现了最先进的身份，从而在诸如FFHQ-REF和Celeba-Ref检验之类的基准上恢复了恢复，始终超过了先前的工作。

Title: AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

Authors: Yiheng Lin, Shifang Zhao, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21911
Pdf URL: https://arxiv.org/pdf/2505.21911
Copy Paste: [[2505.21911]] AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment(https://arxiv.org/abs/2505.21911)
Keywords: generation
Abstract: Personalized image generation aims to integrate user-provided concepts into text-to-image models, enabling the generation of customized content based on a given prompt. Recent zero-shot approaches, particularly those leveraging diffusion transformers, incorporate reference image information through multi-modal attention mechanism. This integration allows the generated output to be influenced by both the textual prior from the prompt and the visual prior from the reference image. However, we observe that when the prompt and reference image are misaligned, the generated results exhibit a stronger bias toward the textual prior, leading to a significant loss of reference content. To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors. Experimental results demonstrate that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.
摘要：个性化的图像生成旨在将用户提供的概念集成到文本图像模型中，从而可以基于给定的提示来生成自定义的内容。最近的零拍方法，尤其是那些利用扩散变压器的方法，通过多模式注意机制结合了参考图像信息。这种集成允许生成的输出受到提示的文本先验和参考图像的视觉先验的影响。但是，我们观察到，当提示图和参考图像未对准时，生成的结果对文本先验表现出更强的偏见，从而导致参考含量的显着丢失。 To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors.实验结果表明，Aligngen优于现有的零拍方法，甚至超过了流行的测试时间优化方法。

Title: Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation

Authors: Mengdan Zhu, Senhao Cheng, Guangji Bai, Yifei Zhang, Liang Zhao
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21956
Pdf URL: https://arxiv.org/pdf/2505.21956
Copy Paste: [[2505.21956]] Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation(https://arxiv.org/abs/2505.21956)
Keywords: generation
Abstract: Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in both retrieval and generation quality, while maintaining high efficiency.
摘要：文本对图像的生成越来越多地要求访问特定于域的，细粒度和快速发展的知识，即预算的模型无法完全捕获。现有的检索增强生成（RAG）方法试图通过检索全球相关图像来解决此问题，但是当没有单个图像包含复杂用户查询中的所有所需元素时，它们都会失败。我们提出了跨模式抹布，这是一个新颖的框架，将查询和图像分解为子方面的组件，从而实现了次级感知的检索和产生。我们的方法介绍了混合检索策略 - 将副稀疏的猎犬与密集的回猎犬相结合 - 以识别帕累托最佳图像集，每个图像，每个图像都有贡献的互补方面。在一代期间，多模式的大语言模型被指导以选择性地条件与特定子征服的相关视觉特征有选择性条件，从而确保了次级感知的图像综合。在MS-Coco，Flickr30k，Wikiart，Cub和Imagenet-LT上进行的广泛实验表明，跨模式破布在检索和发电质量中显着优于现有基线，同时保持高效率。

Title: One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

Authors: Senmao Li, Lei Wang, Kai Wang, Tao Liu, Jiehang Xie, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21960
Pdf URL: https://arxiv.org/pdf/2505.21960
Copy Paste: [[2505.21960]] One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models(https://arxiv.org/abs/2505.21960)
Keywords: generation, generative
Abstract: Text-to-Image (T2I) diffusion models have made remarkable advancements in generative modeling; however, they face a trade-off between inference speed and image quality, posing challenges for efficient deployment. Existing distilled T2I models can generate high-fidelity images with fewer sampling steps, but often struggle with diversity and quality, especially in one-step models. From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps. Based on these observations, we introduce the first Time-independent Unified Encoder TiUE for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling and significantly reducing inference time complexity. In addition, we incorporate a KL divergence term to regularize noise prediction, which enhances the perceptual realism and diversity of the generated images. Experimental results demonstrate that TiUE outperforms state-of-the-art methods, including LCM, SD-Turbo, and SwiftBrushv2, producing more diverse and realistic results while maintaining the computational efficiency.
摘要：文本对图像（T2I）扩散模型在生成建模方面取得了显着进步。但是，他们面临推理速度和图像质量之间的权衡，为有效部署带来了挑战。现有的蒸馏T2I模型可以生成具有更少采样步骤的高保真图像，但通常会在多样性和质量上挣扎，尤其是在一步模型中。从我们的分析中，我们可以观察到UNET编码器中的冗余计算。我们的发现表明，对于T2i扩散模型，解码器更擅长捕获更丰富，更明确的语义信息，而编码器可以在不同时间步骤的解码器中有效地共享编码器。基于这些观察结果，我们介绍了学生模型UNET体系结构的首次独立于统一的统一编码器，这是一种无环的图像生成方法，用于提炼T2I扩散模型。使用一个通行方案，TIUE在多个解码器时间步长上共享编码器特征，从而实现并行采样并显着降低推理时间复杂性。此外，我们结合了一个KL差异项来正式化噪声预测，从而增强了生成图像的感知现实主义和多样性。实验结果表明，TIUE优于最先进的方法，包括LCM，SD-Turbo和SwiftBrushv2，在维持计算效率的同时，产生了更多样化和现实的结果。

Title: DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model

Authors: Weiguang Zhang, Huangcheng Lu, Maizhen Ning, Xiaowei Huang, Wei Wang, Kaizhu Huang, Qiufeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21975
Pdf URL: https://arxiv.org/pdf/2505.21975
Copy Paste: [[2505.21975]] DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model(https://arxiv.org/abs/2505.21975)
Keywords: generative
Abstract: Document dewarping aims to rectify deformations in photographic document images, thus improving text readability, which has attracted much attention and made great progress, but it is still challenging to preserve document structures. Given recent advances in diffusion models, it is natural for us to consider their potential applicability to document dewarping. However, it is far from straightforward to adopt diffusion models in document dewarping due to their unfaithful control on highly complex document images (e.g., 2000$\times$3000 resolution). In this paper, we propose DvD, the first generative model to tackle document \textbf{D}ewarping \textbf{v}ia a \textbf{D}iffusion framework. To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification. In addition, we further propose a time-variant condition refinement mechanism to enhance the preservation of document structures. In experiments, we find that current document dewarping benchmarks can not evaluate dewarping models comprehensively. To this end, we present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs across three distinct domains, enabling fine-grained evaluation of dewarping models. Comprehensive experiments demonstrate that our proposed DvD can achieve state-of-the-art performance with acceptable computational efficiency on multiple metrics across various benchmarks including DocUNet, DIR300, and AnyPhotoDoc6300. The new benchmark and code will be publicly available.
摘要：文档脱水旨在纠正摄影文档图像中的变形，从而提高文本可读性，这引起了很多关注并取得了长足的进步，但是保留文档结构仍然具有挑战性。鉴于扩散模型的最新进展，我们自然要考虑它们的潜在适用性记录脱水。但是，由于对高度复杂的文档图像的不忠控制，在文档脱水中采用扩散模型远非直接（2000 $ \ times $ 3000分辨率）。在本文中，我们提出了DVD，这是第一个处理文档\ textbf {d} ewarping \ textbf {v} ia a a \ textbf {d} iffusion frame框架的生成模型。具体而言，DVD引入了坐标级降解，而不是典型的像素级降解，从而生成用于变形整流的映射。此外，我们进一步提出了一种随时间变化的条件改进机制，以增强文档结构的保存。在实验中，我们发现当前的文档脱瓦基准无法全面评估脱水模型。为此，我们提出了AnyPhotoDoc6300，这是一个严格设计的大规模文档脱瓦基准，包括在三个不同的域上包含6,300个真实图像对，从而可以对模型进行细粒度评估。全面的实验表明，我们提出的DVD可以在包括Docunet，Dir300和AnyPhotoDoc6300（Anyphotodoc6300）的各种基准的多个指标上实现最先进的性能。新的基准和代码将公开可用。

Title: Two-Stage Feature Generation with Transformer and Reinforcement Learning

Authors: Wanfu Gao, Zengyao Man, Zebin He, Yuhao Tang, Jun Gao, Kunpeng Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.21978
Pdf URL: https://arxiv.org/pdf/2505.21978
Copy Paste: [[2505.21978]] Two-Stage Feature Generation with Transformer and Reinforcement Learning(https://arxiv.org/abs/2505.21978)
Keywords: generation
Abstract: Feature generation is a critical step in machine learning, aiming to enhance model performance by capturing complex relationships within the data and generating meaningful new features. Traditional feature generation methods heavily rely on domain expertise and manual intervention, making the process labor-intensive and challenging to adapt to different scenarios. Although automated feature generation techniques address these issues to some extent, they often face challenges such as feature redundancy, inefficiency in feature space exploration, and limited adaptability to diverse datasets and tasks. To address these problems, we propose a Two-Stage Feature Generation (TSFG) framework, which integrates a Transformer-based encoder-decoder architecture with Proximal Policy Optimization (PPO). The encoder-decoder model in TSFG leverages the Transformer's self-attention mechanism to efficiently represent and transform features, capturing complex dependencies within the data. PPO further enhances TSFG by dynamically adjusting the feature generation strategy based on task-specific feedback, optimizing the process for improved performance and adaptability. TSFG dynamically generates high-quality feature sets, significantly improving the predictive performance of machine learning models. Experimental results demonstrate that TSFG outperforms existing state-of-the-art methods in terms of feature quality and adaptability.
摘要：功能生成是机器学习的关键步骤，旨在通过捕获数据中的复杂关系并产生有意义的新功能来增强模型性能。传统的特征生成方法在很大程度上依赖于领域的专业知识和手动干预，这使得劳动密集型和挑战以适应不同的情况。尽管自动化特征生成技术在某种程度上解决了这些问题，但它们经常面临诸如功能冗余，功能探索效率低下以及对各种数据集和任务的适应性有限的挑战。为了解决这些问题，我们提出了一个两阶段的特征生成（TSFG）框架，该框架将基于变压器的编码器架构与近端策略优化（PPO）集成在一起。 TSFG中的编码器模型模型利用变压器的自我发项机制有效地表示和转换功能，从而捕获数据中的复杂依赖性。 PPO通过基于特定于任务的反馈动态调整特征生成策略来进一步增强TSFG，从而优化了提高性能和适应性的过程。 TSFG动态生成高质量的功能集，从而显着提高机器学习模型的预测性能。实验结果表明，在功能质量和适应性方面，TSFG优于现有的最新方法。

Title: Learning World Models for Interactive Video Generation

Authors: Taiye Chen, Xun Hu, Zihan Ding, Chi Jin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21996
Pdf URL: https://arxiv.org/pdf/2505.21996
Copy Paste: [[2505.21996]] Learning World Models for Interactive Video Generation(https://arxiv.org/abs/2505.21996)
Keywords: generation
Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.
摘要：基础世界模型必须既是互动式，又保留时空连贯性，以通过行动选择有效的未来计划。但是，由于两个主要挑战，长期视频生成的现有模型具有有限的固有世界建模功能：复合错误和记忆机制不足。我们通过其他动作调节和自动回归框架来增强具有交互功能的图像到视频模型，并揭示复合误差在自动回应的视频生成中本质上是不可还原的，而记忆机制不足会导致世界模型的不一致。我们提出了具有明确的全球状态条件的视频检索增强发电（VRAG），这大大减少了长期的复合错误并提高了世界模型的时空一致性。相比之下，具有扩展上下文窗口和检索功能的幼稚自动回归生成证明对视频的生成效率较低，这主要是由于当前视频模型的秘密学习能力有限。我们的工作阐明了视频世界模型中的基本挑战，并建立了一个全面的基准，用于改善内部世界建模功能的视频生成模型。

Title: PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms

Authors: Yifei Xia, Shuchen Weng, Siqi Yang, Jingqi Liu, Chengxuan Zhu, Minggui Teng, Zijian Jia, Han Jiang, Boxin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22016
Pdf URL: https://arxiv.org/pdf/2505.22016
Copy Paste: [[2505.22016]] PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms(https://arxiv.org/abs/2505.22016)
Keywords: generation, generative
Abstract: Panoramic video generation enables immersive 360° content creation, valuable in applications that demand scene-consistent world exploration. However, existing panoramic video generation models struggle to leverage pre-trained generative priors from conventional text-to-video models for high-quality and diverse panoramic videos generation, due to limited dataset scale and the gap in spatial feature representations. In this paper, we introduce PanoWan to effectively lift pre-trained text-to-video models to the panoramic domain, equipped with minimal modules. PanoWan employs latitude-aware sampling to avoid latitudinal distortion, while its rotated semantic denoising and padded pixel-wise decoding ensure seamless transitions at longitude boundaries. To provide sufficient panoramic videos for learning these lifted representations, we contribute PanoVid, a high-quality panoramic video dataset with captions and diverse scenarios. Consequently, PanoWan achieves state-of-the-art performance in panoramic video generation and demonstrates robustness for zero-shot downstream tasks.
摘要：全景视频生成可实现沉浸式360°内容创建，这在需要场景一致的世界探索的应用中很有价值。但是，由于数据集量表有限，并且空间特征表示差距，因此现有的全景生成模型努力利用传统文本到视频模型的预训练的生成先验，从而获得高质量和多样化的全景视频生成。在本文中，我们介绍了Panowan，以有效地将预先训练的文本对视频模型提升到配备最小模块的全景域。 Panowan采用纬度感知的采样来避免纬度失真，而其旋转的语义降解和衬垫像素的解码可确保在经度边界处进行无缝过渡。为了提供足够的全景视频来学习这些提升的表示形式，我们为Panovid提供了贡献，这是一个具有标题和不同场景的高质量全景视频数据集。因此，Panowan在全景视频生成中实现了最先进的性能，并证明了零射击下游任务的鲁棒性。

Title: GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

Authors: Zhihong Tang, Yang Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22021
Pdf URL: https://arxiv.org/pdf/2505.22021
Copy Paste: [[2505.22021]] GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement(https://arxiv.org/abs/2505.22021)
Keywords: restoration, generation
Abstract: Document Image Enhancement (DIE) serves as a critical component in Document AI systems, where its performance substantially determines the effectiveness of downstream tasks. To address the limitations of existing methods confined to single-degradation restoration or grayscale image processing, we present Global with Local Parametric Generation Enhancement Network (GL-PGENet), a novel architecture designed for multi-degraded color document images, ensuring both efficiency and robustness in real-world scenarios. Our solution incorporates three key innovations: First, a hierarchical enhancement framework that integrates global appearance correction with local refinement, enabling coarse-to-fine quality improvement. Second, a Dual-Branch Local-Refine Network with parametric generation mechanisms that replaces conventional direct prediction, producing enhanced outputs through learned intermediate parametric representations rather than pixel-wise mapping. This approach enhances local consistency while improving model generalization. Finally, a modified NestUNet architecture incorporating dense block to effectively fuse low-level pixel features and high-level semantic features, specifically adapted for document image characteristics. In addition, to enhance generalization performance, we adopt a two-stage training strategy: large-scale pretraining on a synthetic dataset of 500,000+ samples followed by task-specific fine-tuning. Extensive experiments demonstrate the superiority of GL-PGENet, achieving state-of-the-art SSIM scores of 0.7721 on DocUNet and 0.9480 on RealDAE. The model also exhibits remarkable cross-domain adaptability and maintains computational efficiency for high-resolution images without performance degradation, confirming its practical utility in real-world scenarios.
摘要：文档图像增强（DIE）是文档AI系统中的关键组件，其性能基本决定了下游任务的有效性。为了解决局限于单一降低恢复或灰度图像处理的现有方法的局限性，我们使用局部参数生成增强网络（GL-PGENET）呈现全球，这是一种新型体系结构，设计用于多衰减的彩色文档图像，确保在现实世界中的效率和鲁棒性。我们的解决方案结合了三个关键创新：首先，是一个分层增强框架，将全球外观校正与本地改进相结合，从而可以改进粗到精细的质量。其次，具有参数生成机制的双分支本地refine网络，可以取代常规直接预测，从而通过学习的中间参数表示产生增强的输出，而不是像素映射。这种方法在改善模型概括的同时提高了局部一致性。最后，一种修改的Nestunet体系结构结合了密集的块，可有效融合低级像素功能和高级语义特征，专门针对文档图像特征进行了调整。此外，为了提高概括性能，我们采用了两阶段的训练策略：在500,000多个样本的合成数据集上进行了大规模预处理，然后进行特定于任务的微调。广泛的实验证明了GL-PGENET的优势，在Docunet上达到了0.7721的最新SSIM得分，在Realdae上达到了0.9480。该模型还表现出显着的跨域适应性，并保持了高分辨率图像的计算效率，而无需性能降解，从而在现实世界中证实了其实际实用性。

Title: OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning

Authors: Shifang Zhao, Yiheng Lin, Lu Han, Yao Zhao, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22039
Pdf URL: https://arxiv.org/pdf/2505.22039
Copy Paste: [[2505.22039]] OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning(https://arxiv.org/abs/2505.22039)
Keywords: generation
Abstract: While anomaly detection has made significant progress, generating detailed analyses that incorporate industrial knowledge remains a challenge. To address this gap, we introduce OmniAD, a novel framework that unifies anomaly detection and understanding for fine-grained analysis. OmniAD is a multimodal reasoner that combines visual and textual reasoning processes. The visual reasoning provides detailed inspection by leveraging Text-as-Mask Encoding to perform anomaly detection through text generation without manually selected thresholds. Following this, Visual Guided Textual Reasoning conducts comprehensive analysis by integrating visual perception. To enhance few-shot generalization, we employ an integrated training strategy that combines supervised fine-tuning (SFT) with reinforcement learning (GRPO), incorporating three sophisticated reward functions. Experimental results demonstrate that OmniAD achieves a performance of 79.1 on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. It also shows strong results across multiple anomaly detection benchmarks. These results highlight the importance of enhancing visual perception for effective reasoning in anomaly understanding. All codes and models will be publicly available.
摘要：尽管异常检测取得了重大进展，但产生融合工业知识的详细分析仍然是一个挑战。为了解决这一差距，我们引入了Omniad，这是一个新型框架，统一了异常检测和理解以进行细粒分析。 Omniad是一个多模式推理器，结合了视觉和文本推理过程。视觉推理通过利用文本掩码编码来通过文本生成执行异常检测来提供详细的检查，而无需手动选择的阈值。之后，视觉引导的文本推理通过整合视觉感知进行了全面的分析。为了增强几乎没有概括的概括，我们采用了综合培训策略，将受监督的微调（SFT）与增强学习（GRPO）结合在一起，并结合了三个复杂的奖励功能。实验结果表明，Omniad在MMAD基准上实现了79.1的性能，超过了QWEN2.5-VL-7B和GPT-4O等模型。它还显示了多个异常检测基准的强烈结果。这些结果强调了增强视觉感知在异常理解中有效推理的重要性。所有代码和模型都将公开使用。

Title: Detecting Undesired Process Behavior by Means of Retrieval Augmented Generation

Authors: Michael Grohs, Adrian Rebmann, Jana-Rebecca Rehse
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.22041
Pdf URL: https://arxiv.org/pdf/2505.22041
Copy Paste: [[2505.22041]] Detecting Undesired Process Behavior by Means of Retrieval Augmented Generation(https://arxiv.org/abs/2505.22041)
Keywords: generation
Abstract: Conformance checking techniques detect undesired process behavior by comparing process executions that are recorded in event logs to desired behavior that is captured in a dedicated process model. If such models are not available, conformance checking techniques are not applicable, but organizations might still be interested in detecting undesired behavior in their processes. To enable this, existing approaches use Large Language Models (LLMs), assuming that they can learn to distinguish desired from undesired behavior through fine-tuning. However, fine-tuning is highly resource-intensive and the fine-tuned LLMs often do not generalize well. To address these limitations, we propose an approach that requires neither a dedicated process model nor resource-intensive fine-tuning to detect undesired process behavior. Instead, we use Retrieval Augmented Generation (RAG) to provide an LLM with direct access to a knowledge base that contains both desired and undesired process behavior from other processes, assuming that the LLM can transfer this knowledge to the process at hand. Our evaluation shows that our approach outperforms fine-tuned LLMs in detecting undesired behavior, demonstrating that RAG is a viable alternative to resource-intensive fine-tuning, particularly when enriched with relevant context from the event log, such as frequent traces and activities.
摘要：一致性检查技术通过比较在事件日志中记录的过程执行与在专用过程模型中捕获的所需行为来检测不希望的过程行为。如果没有此类模型，则不适用一致性检查技术，但是组织可能仍然有兴趣检测其过程中的不希望行为。为了实现这一点，假设他们可以学会通过微调将所需的行为与不希望的行为区分开，现有方法使用大语言模型（LLMS）。但是，微调是资源密集的高度，微调的LLM通常不能很好地概括。为了解决这些局限性，我们提出了一种方法，该方法既不需要专门的过程模型，也不需要资源密集的微调来检测不希望的过程行为。取而代之的是，我们使用检索增强生成（RAG）为LLM提供直接访问知识库的LLM，该知识库包含来自其他过程的所需过程和不希望的过程行为，前提是LLM可以将此知识转移到手头的过程中。我们的评估表明，我们的方法在检测不希望的行为方面优于微调的LLM，这表明RAG是资源密集型微调的可行替代品，尤其是在富含事件日志的相关环境时，例如频繁的轨迹和活动。

Title: LatentMove: Towards Complex Human Movement Video Generation

Authors: Ashkan Taghipour, Morteza Ghahremani, Mohammed Bennamoun, Farid Boussaid, Aref Miri Rekavandi, Zinuo Li, Qiuhong Ke, Hamid Laga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22046
Pdf URL: https://arxiv.org/pdf/2505.22046
Copy Paste: [[2505.22046]] LatentMove: Towards Complex Human Movement Video Generation(https://arxiv.org/abs/2505.22046)
Keywords: generation
Abstract: Image-to-video (I2V) generation seeks to produce realistic motion sequences from a single reference image. Although recent methods exhibit strong temporal consistency, they often struggle when dealing with complex, non-repetitive human movements, leading to unnatural deformations. To tackle this issue, we present LatentMove, a DiT-based framework specifically tailored for highly dynamic human animation. Our architecture incorporates a conditional control branch and learnable face/body tokens to preserve consistency as well as fine-grained details across frames. We introduce Complex-Human-Videos (CHV), a dataset featuring diverse, challenging human motions designed to benchmark the robustness of I2V systems. We also introduce two metrics to assess the flow and silhouette consistency of generated videos with their ground truth. Experimental results indicate that LatentMove substantially improves human animation quality--particularly when handling rapid, intricate movements--thereby pushing the boundaries of I2V generation. The code, the CHV dataset, and the evaluation metrics will be available at this https URL --.
摘要：图像到视频（I2V）生成试图从单个参考图像中产生逼真的运动序列。尽管最近的方法表现出强烈的时间一致性，但它们在处理复杂的，非重复的人类运动时通常会挣扎，从而导致不自然的变形。为了解决这个问题，我们提出了LatentMove，这是一个基于DIT的框架，专门针对高度动态的人类动画量身定制。我们的架构结合了有条件的控制分支和可学习的面部/身体令牌，以保持一致性以及跨帧的细粒细节。我们介绍了复杂人类视频（CHV），该数据集具有多种多样的，具有挑战性的人类动作，旨在基于I2V系统的鲁棒性。我们还介绍了两个指标，以评估生成的视频的流量和轮廓一致性，其地面真相。实验结果表明，潜伏的人基本上可以提高人类的动画质量 - 尤其是在处理快速，复杂的动作时 - 通过突破I2V生成的界限。代码，CHV数据集和评估指标将在此HTTPS URL-可用。

Title: Differentiable Generalized Sliced Wasserstein Plans

Authors: Laetitia Chapel, Romain Tavenard, Samuel Vaiter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.22049
Pdf URL: https://arxiv.org/pdf/2505.22049
Copy Paste: [[2505.22049]] Differentiable Generalized Sliced Wasserstein Plans(https://arxiv.org/abs/2505.22049)
Keywords: generation
Abstract: Optimal Transport (OT) has attracted significant interest in the machine learning community, not only for its ability to define meaningful distances between probability distributions -- such as the Wasserstein distance -- but also for its formulation of OT plans. Its computational complexity remains a bottleneck, though, and slicing techniques have been developed to scale OT to large datasets. Recently, a novel slicing scheme, dubbed min-SWGG, lifts a single one-dimensional plan back to the original multidimensional space, finally selecting the slice that yields the lowest Wasserstein distance as an approximation of the full OT plan. Despite its computational and theoretical advantages, min-SWGG inherits typical limitations of slicing methods: (i) the number of required slices grows exponentially with the data dimension, and (ii) it is constrained to linear projections. Here, we reformulate min-SWGG as a bilevel optimization problem and propose a differentiable approximation scheme to efficiently identify the optimal slice, even in high-dimensional settings. We furthermore define its generalized extension for accommodating to data living on manifolds. Finally, we demonstrate the practical value of our approach in various applications, including gradient flows on manifolds and high-dimensional spaces, as well as a novel sliced OT-based conditional flow matching for image generation -- where fast computation of transport plans is essential.
摘要：最佳运输（OT）引起了机器学习社区的重大兴趣，这不仅是因为它可以定义概率分布之间有意义的距离（例如Wasserstein距离），而且还因为其制定了OT计划。但是，它的计算复杂性仍然是一种瓶颈，并且已经开发了切片技术以扩展到大型数据集。最近，一种被称为Min-SWGG的新型切片方案将一个一维计划提升回原始的多维空间，最后选择了最低的Wasserstein距离的切片，作为完整OT计划的近似值。尽管具有计算和理论优势，但Min-SWGG继承了切片方法的典型局限性：（i）所需切片的数量随数据维度的指数增长，并且（ii）将其限制在线性预测上。在这里，我们将Min-SWGG重新制定为一个双重优化问题，并提出了一个可区分的近似方案，即使在高维度中，也可以有效地识别最佳切片。我们此外，定义了其广义扩展，以适应流形的数据。最后，我们在各种应用中证明了方法的实际价值，包括在多种多样和高维空间上的梯度流，以及用于图像生成的新型基于OT的条件流量匹配 - 在其中快速计算运输计划是必不可少的。

Title: AquaMonitor: A multimodal multi-view image sequence dataset for real-life aquatic invertebrate biodiversity monitoring

Authors: Mikko Impiö, Philipp M. Rehsen, Tiina Laamanen, Arne J. Beermann, Florian Leese, Jenni Raitoharju
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22065
Pdf URL: https://arxiv.org/pdf/2505.22065
Copy Paste: [[2505.22065]] AquaMonitor: A multimodal multi-view image sequence dataset for real-life aquatic invertebrate biodiversity monitoring(https://arxiv.org/abs/2505.22065)
Keywords: quality assessment
Abstract: This paper presents the AquaMonitor dataset, the first large computer vision dataset of aquatic invertebrates collected during routine environmental monitoring. While several large species identification datasets exist, they are rarely collected using standardized collection protocols, and none focus on aquatic invertebrates, which are particularly laborious to collect. For AquaMonitor, we imaged all specimens from two years of monitoring whenever imaging was possible given practical limitations. The dataset enables the evaluation of automated identification methods for real-life monitoring purposes using a realistically challenging and unbiased setup. The dataset has 2.7M images from 43,189 specimens, DNA sequences for 1358 specimens, and dry mass and size measurements for 1494 specimens, making it also one of the largest biological multi-view and multimodal datasets to date. We define three benchmark tasks and provide strong baselines for these: 1) Monitoring benchmark, reflecting real-life deployment challenges such as open-set recognition, distribution shift, and extreme class imbalance, 2) Classification benchmark, which follows a standard fine-grained visual categorization setup, and 3) Few-shot benchmark, which targets classes with only few training examples from very fine-grained categories. Advancements on the Monitoring benchmark can directly translate to improvement of aquatic biodiversity monitoring, which is an important component of regular legislative water quality assessment in many countries.
摘要：本文介绍了Aquamonitor数据集，这是在常规环境监测过程中收集的水生无脊椎动物的第一个大型计算机视觉数据集。尽管存在几个大物种识别数据集，但很少使用标准化的收集方案收集它们，并且没有一个集中于水生无脊椎动物，这些无脊椎动物特别费力地收集。对于Aquamonitor，我们在可能的局限性的情况下，在可能的成像时进行了两年的监视。该数据集可以使用现实挑战性和无偏置的设置来评估自动识别方法，以实现现实生活监视。该数据集的图像具有43,189个样品，1358个样品的DNA序列以及1494个样品的干质量和尺寸测量值，也是迄今为止最大的生物多视图和多模式数据集之一。我们定义了三个基准任务，并为这些任务提供了强大的基准：1）监视基准测试，反映了现实生活中的部署挑战，例如开放式识别，分配变化和极端类别的不平衡，2）分类基准，该基准遵循标准的细胞元素良好的视觉分类设置，以及3）几个针对少数培训的训练的训练型的，仅针对少数培训的训练。监测基准的进步可以直接转化为改善水生生物多样性监测，这是许多国家常规立法水质评估的重要组成部分。

Title: From Failures to Fixes: LLM-Driven Scenario Repair for Self-Evolving Autonomous Driving

Authors: Xinyu Xia, Xingjun Ma, Yunfeng Hu, Ting Qu, Hong Chen, Xun Gong
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.22067
Pdf URL: https://arxiv.org/pdf/2505.22067
Copy Paste: [[2505.22067]] From Failures to Fixes: LLM-Driven Scenario Repair for Self-Evolving Autonomous Driving(https://arxiv.org/abs/2505.22067)
Keywords: generation
Abstract: Ensuring robust and generalizable autonomous driving requires not only broad scenario coverage but also efficient repair of failure cases, particularly those related to challenging and safety-critical scenarios. However, existing scenario generation and selection methods often lack adaptivity and semantic relevance, limiting their impact on performance improvement. In this paper, we propose \textbf{SERA}, an LLM-powered framework that enables autonomous driving systems to self-evolve by repairing failure cases through targeted scenario recommendation. By analyzing performance logs, SERA identifies failure patterns and dynamically retrieves semantically aligned scenarios from a structured bank. An LLM-based reflection mechanism further refines these recommendations to maximize relevance and diversity. The selected scenarios are used for few-shot fine-tuning, enabling targeted adaptation with minimal data. Experiments on the benchmark show that SERA consistently improves key metrics across multiple autonomous driving baselines, demonstrating its effectiveness and generalizability under safety-critical conditions.
摘要：确保健壮且可推广的自主驾驶不仅需要广泛的场景覆盖范围，还需要对失败案件的有效修复，尤其是与具有挑战性和关键安全情景有关的案例。但是，现有的场景生成和选择方法通常缺乏适应性和语义相关性，从而限制了它们对绩效提高的影响。在本文中，我们建议\ textbf {sera}，这是一个由LLM驱动的框架，使自主驾驶系统能够通过针对性的方案建议通过修复故障案例来自我发展。通过分析性能日志，血清可以识别故障模式，并从结构化库中动态地检索语义对齐的方案。基于LLM的反射机制进一步完善了这些建议，以最大程度地提高相关性和多样性。选定的方案用于几次微调，从而实现了具有最小数据的目标适应性。基准上的实验表明，血清始终改善多个自动驾驶基线的关键指标，表明其在安全至关重要条件下的有效性和概括性。

Title: SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

Authors: Yifan Chang, Yukang Feng, Jianwen Sun, Jiaxin Ai, Chuanhao Li, S. Kevin Zhou, Kaipeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22126
Pdf URL: https://arxiv.org/pdf/2505.22126
Copy Paste: [[2505.22126]] SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model(https://arxiv.org/abs/2505.22126)
Keywords: generation
Abstract: Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.
摘要：近年来，AI驱动的图像产生迅速发展。早期扩散模型强调了感知质量，而诸如GPT-4O图像（GPT-4O图像）的新型多模型则整合了高级推理，从而改善了语义理解和结构组成。科学插图的生成例证了这一演变：与一般图像合成不同，它需要对技术内容进行准确的解释以及将抽象思想转化为清晰，标准化的视觉效果。这项任务大大增加了知识密集型和费力，通常需要数小时的手动工作和专业工具。以可控制的，智能的方式自动化它将提供实质性的实际价值。但是，目前尚无基准测试在这方面评估AI。为了填补这一空白，我们介绍了Sridbench，这是科学人物生成的第一个基准。它包括1,120个实例，这些实例是根据通过人类专家和MLLM收集的13个自然和计算机科学学科的领先科学论文策划的。每个样本沿六个维度进行评估，包括语义保真度和结构准确性。实验结果表明，即使是人类绩效之后的顶级模型，也是GPT-4O图像滞后的滞后，在文本/视觉清晰度和科学正确性方面存在常见问题。这些发现凸显了需要更高级推理驱动的视觉生成能力。

Title: Real-Time Blind Defocus Deblurring for Earth Observation: The IMAGIN-e Mission Approach

Authors: Alejandro D. Mousist
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22128
Pdf URL: https://arxiv.org/pdf/2505.22128
Copy Paste: [[2505.22128]] Real-Time Blind Defocus Deblurring for Earth Observation: The IMAGIN-e Mission Approach(https://arxiv.org/abs/2505.22128)
Keywords: restoration
Abstract: This work addresses mechanical defocus in Earth observation images from the IMAGIN-e mission aboard the ISS, proposing a blind deblurring approach adapted to space-based edge computing constraints. Leveraging Sentinel-2 data, our method estimates the defocus kernel and trains a restoration model within a GAN framework, effectively operating without reference images. On Sentinel-2 images with synthetic degradation, SSIM improved by 72.47% and PSNR by 25.00%, confirming the model's ability to recover lost details when the original clean image is known. On IMAGIN-e, where no reference images exist, perceptual quality metrics indicate a substantial enhancement, with NIQE improving by 60.66% and BRISQUE by 48.38%, validating real-world onboard restoration. The approach is currently deployed aboard the IMAGIN-e mission, demonstrating its practical application in an operational space environment. By efficiently handling high-resolution images under edge computing constraints, the method enables applications such as water body segmentation and contour detection while maintaining processing viability despite resource limitations.
摘要：这项工作解决了ISS上的Imagin-E Mission的地球观察图像中的机械散焦，提出了一种适合于空间边缘计算约束的盲目脱毛方法。利用Sentinel-2数据，我们的方法估算了DeFocus内核并在GAN框架内训练一个恢复模型，实际上在没有参考图像的情况下运行。在带有合成降解的Sentinel-2图像上，SSIM提高了72.47％，PSNR提高了25.00％，这证实了该模型在已知原始清洁图像时恢复丢失的细节的能力。在不存在参考图像的Imagin-E上，感知质量指标表明实质性增强，NIQE提高了60.66％，而Brisque则提高了48.38％，从而验证了现实世界的现实世界恢复。该方法目前已在Imagin-E任务中部署，并在操作空间环境中证明了其实际应用。通过在边缘计算限制下有效处理高分辨率图像，该方法启用了诸如水体分割和轮廓检测等应用，同时尽管资源限制了，但仍保持处理可行性。

Title: What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?

Authors: Jinhong Ni, Chang-Bin Zhang, Qiang Zhang, Jing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22129
Pdf URL: https://arxiv.org/pdf/2505.22129
Copy Paste: [[2505.22129]] What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?(https://arxiv.org/abs/2505.22129)
Keywords: generation
Abstract: Recent prosperity of text-to-image diffusion models, e.g. Stable Diffusion, has stimulated research to adapt them to 360-degree panorama generation. Prior work has demonstrated the feasibility of using conventional low-rank adaptation techniques on pre-trained diffusion models to generate panoramic images. However, the substantial domain gap between perspective and panoramic images raises questions about the underlying mechanisms enabling this empirical success. We hypothesize and examine that the trainable counterparts exhibit distinct behaviors when fine-tuned on panoramic data, and such an adaptation conceals some intrinsic mechanism to leverage the prior knowledge within the pre-trained diffusion models. Our analysis reveals the following: 1) the query and key matrices in the attention modules are responsible for common information that can be shared between the panoramic and perspective domains, thus are less relevant to panorama generation; and 2) the value and output weight matrices specialize in adapting pre-trained knowledge to the panoramic domain, playing a more critical role during fine-tuning for panorama generation. We empirically verify these insights by introducing a simple framework called UniPano, with the objective of establishing an elegant baseline for future research. UniPano not only outperforms existing methods but also significantly reduces memory usage and training time compared to prior dual-branch approaches, making it scalable for end-to-end panorama generation with higher resolution. The code will be released.
摘要：文本到图像扩散模型的最新繁荣，例如稳定的扩散，激发了研究以使其适应360度全景。先前的工作表明，在预训练的扩散模型上使用常规的低级适应技术来生成全景图像。但是，透视图和全景图像之间的实质领域差距提出了有关使这种经验成功的潜在机制的问题。我们假设并检查可训练的对应物在对全景数据进行微调时表现出不同的行为，并且这种适应性掩盖了某些内在机制，以利用预先训练的扩散模型中的先验知识。我们的分析揭示了以下内容：1）注意模块中的查询和关键矩阵负责可以在全景和透视域之间共享的共同信息，因此与全景的生成不相关； 2）价值和输出权重矩阵专门针对全景域的预训练知识，在全景生成的微调过程中起着更关键的作用。我们通过引入一个名为unipano的简单框架来验证这些见解，目的是为未来的研究建立优雅的基准。与先前的双支分支方法相比，Unipano不仅胜过现有方法，而且显着缩短了记忆使用和训练时间，这使得对具有更高分辨率的端到端全景生成可扩展。代码将发布。

Title: FaceEditTalker: Interactive Talking Head Generation with Facial Attribute Editing

Authors: Guanwen Feng, Zhiyuan Ma, Yunan Li, Junwei Jing, Jiahao Yang, Qiguang Miao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22141
Pdf URL: https://arxiv.org/pdf/2505.22141
Copy Paste: [[2505.22141]] FaceEditTalker: Interactive Talking Head Generation with Facial Attribute Editing(https://arxiv.org/abs/2505.22141)
Keywords: generation
Abstract: Recent advances in audio-driven talking head generation have achieved impressive results in lip synchronization and emotional expression. However, they largely overlook the crucial task of facial attribute editing. This capability is crucial for achieving deep personalization and expanding the range of practical applications, including user-tailored digital avatars, engaging online education content, and brand-specific digital customer service. In these key domains, the flexible adjustment of visual attributes-such as hairstyle, accessories, and subtle facial features is essential for aligning with user preferences, reflecting diverse brand identities, and adapting to varying contextual demands. In this paper, we present FaceEditTalker, a unified framework that enables controllable facial attribute manipulation while generating high-quality, audio-synchronized talking head videos. Our method consists of two key components: an image feature space editing module, which extracts semantic and detail features and allows flexible control over attributes like expression, hairstyle, and accessories; and an audio-driven video generation module, which fuses these edited features with audio-guided facial landmarks to drive a diffusion-based generator. This design ensures temporal coherence, visual fidelity, and identity preservation across frames. Extensive experiments on public datasets demonstrate that our method outperforms state-of-the-art approaches in lip-sync accuracy, video quality, and attribute controllability. Project page: this https URL
摘要：在音频驱动的谈话校长的最新进展方面取得了令人印象深刻的唇部同步和情感表达的结果。但是，他们在很大程度上忽略了面部属性编辑的关键任务。此功能对于实现深刻的个性化和扩大实际应用程序的范围至关重要，包括用户销售的数字化身，吸引在线教育内容以及特定品牌的数字客户服务。在这些关键领域中，视觉属性的灵活调整，例如发型，配件和微妙的面部特征对于与用户偏好保持一致，反映不同的品牌标识并适应各种上下文需求。在本文中，我们介绍了面对访问者，这是一个统一的框架，可以在产生高质量的，音频同步的会话视频的同时进行可控的面部属性操作。我们的方法由两个关键组成部分组成：图像特征空间编辑模块，该模块提取语义和详细的特征，并可以灵活地控制表达，发型和配件等属性；以及音频驱动的视频生成模块，该模块将这些编辑的功能与音频引导的面部标志融合在一起，以驱动基于扩散的发电机。这种设计确保了跨帧的时间连贯性，视觉保真度和身份保存。公共数据集上的广泛实验表明，我们的方法在LIP-同步精度，视频质量和属性可控性方面的表现优于最先进的方法。项目页面：此HTTPS URL

Title: Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, Michele Magno
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22167
Pdf URL: https://arxiv.org/pdf/2505.22167
Copy Paste: [[2505.22167]] Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers(https://arxiv.org/abs/2505.22167)
Keywords: generation
Abstract: Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$\times$. Code will be available at this https URL.
摘要：扩散变压器（DIT）在视频生成中表现出了出色的性能。但是，它们的大量参数和高计算复杂性限制了它们在边缘设备上的部署。量化可以通过降低模型参数的位宽度来降低存储需求并加速推断。但是，现有的图像生成模型的量化方法并不能很好地推广到视频生成任务。我们确定了两个主要挑战：量化过程中信息丢失以及优化目标与视频生成的独特要求之间的错位。为了应对这些挑战，我们提出了Q-VDIT，这是一个专门为视频DIT模型设计的量化框架。从量化的角度来看，我们提出了令牌感知的量化估计量（TQE），该估计器可以补偿令牌和特征维度的量化误差。从优化的角度来看，我们引入了时间维护蒸馏（TMD），该蒸馏（TMD）保留了帧之间的时空相关性，并可以优化每个帧相对于整体视频上下文。我们的W3A6 Q-VDIT实现了23.40的场景一致性，将新的基准测试和优于当前的最新量化方法提高了1.9 $ \ times $。代码将在此HTTPS URL上可用。

Title: A Survey on Training-free Open-Vocabulary Semantic Segmentation

Authors: Naomi Kombol, Ivan Martinović, Siniša Šegvić
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22209
Pdf URL: https://arxiv.org/pdf/2505.22209
Copy Paste: [[2505.22209]] A Survey on Training-free Open-Vocabulary Semantic Segmentation(https://arxiv.org/abs/2505.22209)
Keywords: generative
Abstract: Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open-vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training-free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state-of-the-art in training-free open-vocabulary semantic segmentation that leverages existing multi-modal classification models. We will first give a preliminary on the task definition followed by an overview of popular model archetypes and then spotlight over 30 approaches split into broader research branches: purely CLIP-based, those leveraging auxiliary visual foundation models and ones relying on generative methods. Subsequently, we will discuss the limitations and potential problems of current research, as well as provide some underexplored ideas for future study. We believe this survey will serve as a good onboarding read to new researchers and spark increased interest in the area.
摘要：语义细分是图像理解中最根本的任务之一，随后是许多不同的方法。传统方法努力从头开始训练模型，需要大量的计算资源和培训数据。在转移到开放式语义语义细分的出现时，该分段要求模型对学习类别进行分类，大量精细注释的数据将非常昂贵。相反，研究人员转向了无培训的方法，他们利用为数据更容易获取数据的现有模型。具体而言，这项调查将涵盖用于利用现有多模式分类模型的无培训开放式语义分段的历史，细微差别，思想发展和最新。我们将首先就任务定义进行初步，然后对流行模型原型进行概述，然后将30种方法的聚光灯分为更广泛的研究分支：纯粹基于夹子，那些利用辅助视觉基础模型和依靠生成方法的方法。随后，我们将讨论当前研究的局限性和潜在问题，并为未来的研究提供一些毫无创伤的想法。我们认为，这项调查将是对新研究人员的良好入门阅读，并引发对该地区的兴趣。

Title: Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

Authors: Yunsoo Kim, Jinge Wu, Su-Hwan Kim, Pardeep Vasudev, Jiashu Shen, Honghan Wu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.22222
Pdf URL: https://arxiv.org/pdf/2505.22222
Copy Paste: [[2505.22222]] Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation(https://arxiv.org/abs/2505.22222)
Keywords: generation
Abstract: Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look & Mark (L&M), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, L&M leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, L&M demonstrates significant gains, including a 1.2% improvement in overall metrics (this http URL) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (this http URL)-the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that L&M reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight L&M's potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.
摘要：多模式大语言模型（LLM）的最新进展显着增强了医学图像分析的自动化，尤其是在从胸部X射线（CXR）中生成放射学报告时。但是，这些模型仍然遭受幻觉和临床重大错误的困扰，从而限制了它们在现实世界中的可靠性。在这项研究中，我们提出了Look＆Mark（L＆M），这是一种新型的接地固定策略，将放射科医生眼固定（Look）和边界框注释（Mark）集成到LLM提示框架中。与常规的微调不同，L＆M在不进行重新培训的情况下学习实现大量绩效提高。当对多个域特异性和通用模型进行评估时，L＆M显示出显着增长，包括与基线提示相比，CXR-LALAVA的总体指标（此HTTP URL）提高了1.2％，LLAVA-MED的9.2％提升。通用模型还受益于L＆M与内部文化学习的结合，Llava-ov实现了87.3％的临床平均性能（此HTTP URL）（在所有模型中最高），甚至超过了那些对CXR报告生成的明确训练的人。专家评估进一步证实，L＆M会减少临床上的显着错误（每报告平均错误），例如虚假预测和遗漏，从而提高了准确性和可靠性。这些发现突出了L＆M作为AI辅助放射学的可扩展和高效解决方案的潜力，为改善低资源临床环境中的诊断工作流铺平了道路。

Title: Enjoying Information Dividend: Gaze Track-based Medical Weakly Supervised Segmentation

Authors: Zhisong Wang, Yiwen Ye, Ziyang Chen, Yong Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22230
Pdf URL: https://arxiv.org/pdf/2505.22230
Copy Paste: [[2505.22230]] Enjoying Information Dividend: Gaze Track-based Medical Weakly Supervised Segmentation(https://arxiv.org/abs/2505.22230)
Keywords: generation
Abstract: Weakly supervised semantic segmentation (WSSS) in medical imaging struggles with effectively using sparse annotations. One promising direction for WSSS leverages gaze annotations, captured via eye trackers that record regions of interest during diagnostic procedures. However, existing gaze-based methods, such as GazeMedSeg, do not fully exploit the rich information embedded in gaze data. In this paper, we propose GradTrack, a framework that utilizes physicians' gaze track, including fixation points, durations, and temporal order, to enhance WSSS performance. GradTrack comprises two key components: Gaze Track Map Generation and Track Attention, which collaboratively enable progressive feature refinement through multi-level gaze supervision during the decoding process. Experiments on the Kvasir-SEG and NCI-ISBI datasets demonstrate that GradTrack consistently outperforms existing gaze-based methods, achieving Dice score improvements of 3.21\% and 2.61\%, respectively. Moreover, GradTrack significantly narrows the performance gap with fully supervised models such as nnUNet.
摘要：在医学成像中，弱监督的语义细分（WSSS）在有效地使用稀疏注释中有效地挣扎。 WSSS的一个有前途的方向利用注视注释，该注视是通过记录诊断过程中感兴趣区域的注视器捕获的。但是，现有的基于凝视的方法（例如GazemedSeg）并不能完全利用凝视数据中嵌入的丰富信息。在本文中，我们提出了GradTrack，该框架利用医生的凝视轨迹，包括固定点，持续时间和时间顺序，以增强WSSS性能。 GradTrack包括两个关键组成部分：凝视轨道图的生成和跟踪注意力，它们通过在解码过程中的多层次注视进行协作，可以通过多层注视进行渐进的功能改进。 Kvasir-Seg和NCI-ISBI数据集的实验表明，GradTrack始终优于现有的基于目光的方法，分别达到3.21 \％和2.61 \％的骰子分数提高。此外，GradTrack通过完全监督的模型（例如NNUNET）大大缩小了性能差距。

Title: YH-MINER: Multimodal Intelligent System for Natural Ecological Reef Metric Extraction

Authors: Mingzhuang Wang, Yvyang Li, Xiyang Zhang, Fei Tan, Qi Shi, Guotao Zhang, Siqi Chen, Yufei Liu, Lei Lei, Ming Zhou, Qiang Lin, Hongqiang Yang
Subjects: cs.CV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.22250
Pdf URL: https://arxiv.org/pdf/2505.22250
Copy Paste: [[2505.22250]] YH-MINER: Multimodal Intelligent System for Natural Ecological Reef Metric Extraction(https://arxiv.org/abs/2505.22250)
Keywords: generation
Abstract: Coral reefs, crucial for sustaining marine biodiversity and ecological processes (e.g., nutrient cycling, habitat provision), face escalating threats, underscoring the need for efficient monitoring. Coral reef ecological monitoring faces dual challenges of low efficiency in manual analysis and insufficient segmentation accuracy in complex underwater scenarios. This study develops the YH-OSI system, establishing an intelligent framework centered on the Multimodal Large Model (MLLM) for "object detection-semantic segmentation-prior input". The system uses the object detection module (mAP@0.5=0.78) to generate spatial prior boxes for coral instances, driving the segment module to complete pixel-level segmentation in low-light and densely occluded scenarios. The segmentation masks and finetuned classification instructions are fed into the Qwen2-VL-based multimodal model as prior inputs, achieving a genus-level classification accuracy of 88% and simultaneously extracting core ecological metrics. Meanwhile, the system retains the scalability of the multimodal model through standardized interfaces, laying a foundation for future integration into multimodal agent-based underwater robots and supporting the full-process automation of "image acquisition-prior generation-real-time analysis."
摘要：珊瑚礁，对于维持海洋生物多样性和生态过程（例如，营养循环，栖息地提供）至关重要，面临着不断升级的威胁，强调了有效监测的需求。珊瑚礁生态监测在复杂的水下场景中，在手动分析中面临低效率的双重挑战，分割精度不足。这项研究开发了YH-OSI系统，建立了一个以“对象检测 - 语义分段优先输入”为中心以多模式大型（MLLM）为中心的智能框架。该系统使用对象检测模块（map@0.5 = 0.78）生成用于珊瑚实例的空间先验框，驱动段模块以低光和密集的遮挡场景完成像素级分割。分割面具和固定分类指令被作为先前输入的基于QWEN2-VL的多模式模型，以达到88％的属属分类精度，并同时提取核心生态指标。同时，该系统通过标准化界面保留多模型模型的可伸缩性，为将来集成到基于多模式的水下机器人中，并支持“ Image Acceasition-Perior-Peneration-Peneration-eation-Oreal-Oreal-Exeal-Oreal-Exeal-Exeal-Oreal-Exeal-Exeal-Oreal-Exeal-Exeal-Oreal-Exeal-Exeal-Oreal时间分析”奠定基础。

Title: From Controlled Scenarios to Real-World: Cross-Domain Degradation Pattern Matching for All-in-One Image Restoration

Authors: Junyu Fan, Chuanlin Liao, Yi Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22284
Pdf URL: https://arxiv.org/pdf/2505.22284
Copy Paste: [[2505.22284]] From Controlled Scenarios to Real-World: Cross-Domain Degradation Pattern Matching for All-in-One Image Restoration(https://arxiv.org/abs/2505.22284)
Keywords: restoration
Abstract: As a fundamental imaging task, All-in-One Image Restoration (AiOIR) aims to achieve image restoration caused by multiple degradation patterns via a single model with unified parameters. Although existing AiOIR approaches obtain promising performance in closed and controlled scenarios, they still suffered from considerable performance reduction in real-world scenarios since the gap of data distributions between the training samples (source domain) and real-world test samples (target domain) can lead inferior degradation awareness ability. To address this issue, a Unified Domain-Adaptive Image Restoration (UDAIR) framework is proposed to effectively achieve AiOIR by leveraging the learned knowledge from source domain to target domain. To improve the degradation identification, a codebook is designed to learn a group of discrete embeddings to denote the degradation patterns, and the cross-sample contrastive learning mechanism is further proposed to capture shared features from different samples of certain degradation. To bridge the data gap, a domain adaptation strategy is proposed to build the feature projection between the source and target domains by dynamically aligning their codebook embeddings, and a correlation alignment-based test-time adaptation mechanism is designed to fine-tune the alignment discrepancies by tightening the degradation embeddings to the corresponding cluster center in the source domain. Experimental results on 10 open-source datasets demonstrate that UDAIR achieves new state-of-the-art performance for the AiOIR task. Most importantly, the feature cluster validate the degradation identification under unknown conditions, and qualitative comparisons showcase robust generalization to real-world scenarios.
摘要：作为一项基本成像任务，多合一的图像恢复（AIOIR）旨在通过具有统一参数的单个模型来实现由多个退化模式引起的图像恢复。尽管现有的Aioir方法在封闭和受控方案中获得了有希望的性能，但由于训练样本（源域）和现实世界测试样本（目标域）（目标域）之间的数据分布差距可以导致下降低降低质量息息能力，因此它们仍然遭受了大幅度降低的绩效。为了解决这个问题，提出了一个统一的域自适应图像恢复（UDAIR）框架，以通过利用从源域到目标域的学习知识来有效地实现Aioir。为了改善降解识别，一本代码手册旨在学习一组离散的嵌入以表示降解模式，并进一步提出了跨样本对比的学习机制来捕获来自某些降解的不同样本的共同特征。为了弥合数据差距，提出了一种域的适应策略，以动态对齐其代码簿嵌入来构建源和目标域之间的特征投影，并设计了基于相关的测试时间适应机制，旨在通过将脱位嵌入到相应的源源中心中的汇总嵌入来微调对齐差异。 10个开源数据集的实验结果表明，Udair为Aioir任务实现了新的最新性能。最重要的是，该特征群集验证了未知条件下的降解识别，并且定性比较展示了对现实世界情景的强大概括。

Title: Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data

Authors: Saptarshi Neil Sinha, P. Julius Kuehn, Johannes Koppe, Arjan Kuijper, Michael Weinmann
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22291
Pdf URL: https://arxiv.org/pdf/2505.22291
Copy Paste: [[2505.22291]] Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data(https://arxiv.org/abs/2505.22291)
Keywords: restoration, generation, generative
Abstract: The preservation of early visual arts, particularly color photographs, is challenged by deterioration caused by aging and improper storage, leading to issues like blurring, scratches, color bleeding, and fading defects. In this paper, we present the first approach for the automatic removal of greening color defects in digitized autochrome photographs. Our main contributions include a method based on synthetic dataset generation and the use of generative AI with a carefully designed loss function for the restoration of visual arts. To address the lack of suitable training datasets for analyzing greening defects in damaged autochromes, we introduce a novel approach for accurately simulating such defects in synthetic data. We also propose a modified weighted loss function for the ChaIR method to account for color imbalances between defected and non-defected areas. While existing methods struggle with accurately reproducing original colors and may require significant manual effort, our method allows for efficient restoration with reduced time requirements.
摘要：由衰老和存储不当引起的恶化挑战了早期视觉艺术，尤其是彩色照片，导致诸如模糊，划痕，颜色出血和褪色缺陷等问题。在本文中，我们介绍了在数字化的自动斑点照片中自动去除绿色缺陷的第一种方法。我们的主要贡献包括一种基于合成数据集生成的方法以及具有精心设计的损失功能的生成AI的使用，以恢复视觉艺术。为了解决缺乏合适的培训数据集来分析损坏的自变量中的绿色缺陷，我们引入了一种新颖的方法，以准确模拟合成数据中的此类缺陷。我们还为椅子方法提出了修改的加权损失函数，以说明缺陷和未对区域之间的颜色失衡。尽管现有方法难以准确再现原始颜色，并且可能需要大量的手动努力，但我们的方法可以有效地恢复，并减少时间要求。

Title: Versatile Cardiovascular Signal Generation with a Unified Diffusion Transformer

Authors: Zehua Chen, Yuyang Miao, Liyuan Wang, Luyun Fan, Danilo P. Mandic, Jun Zhu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22306
Pdf URL: https://arxiv.org/pdf/2505.22306
Copy Paste: [[2505.22306]] Versatile Cardiovascular Signal Generation with a Unified Diffusion Transformer(https://arxiv.org/abs/2505.22306)
Keywords: generation, generative
Abstract: Cardiovascular signals such as photoplethysmography (PPG), electrocardiography (ECG), and blood pressure (BP) are inherently correlated and complementary, together reflecting the health of cardiovascular system. However, their joint utilization in real-time monitoring is severely limited by diverse acquisition challenges from noisy wearable recordings to burdened invasive procedures. Here we propose UniCardio, a multi-modal diffusion transformer that reconstructs low-quality signals and synthesizes unrecorded signals in a unified generative framework. Its key innovations include a specialized model architecture to manage the signal modalities involved in generation tasks and a continual learning paradigm to incorporate varying modality combinations. By exploiting the complementary nature of cardiovascular signals, UniCardio clearly outperforms recent task-specific baselines in signal denoising, imputation, and translation. The generated signals match the performance of ground-truth signals in detecting abnormal health conditions and estimating vital signs, even in unseen domains, while ensuring interpretability for human experts. These advantages position UniCardio as a promising avenue for advancing AI-assisted healthcare.
摘要：心血管信号，例如光摄影学（PPG），心电图（ECG）和血压（BP）固有相关和互补，共同反映了心血管系统的健康。但是，它们在实时监控中的联合利用受到从嘈杂的可穿戴录音到负担重大的侵入性程序的各种收购挑战的严重限制。在这里，我们提出了Unicardio，这是一种多模式扩散变压器，可重建低质量信号并在统一的生成框架中综合未记录的信号。它的关键创新包括专门的模型体系结构，以管理发电任务中涉及的信号方式和连续学习范式，以结合各种模态组合。通过利用心血管信号的互补性，Unicardio在信号降解，插补和翻译方面明显优于最近特定于任务的基准。产生的信号与基础真相信号的性能相匹配，以检测异常的健康状况和估计生命体征，即使在看不见的域，也可以确保对人类专家的解释性。这些优势将Unicardio定位为推进AI辅助医疗保健的有前途的途径。

Title: Task-Driven Implicit Representations for Automated Design of LiDAR Systems

Authors: Nikhil Behari, Aaron Young, Akshat Dave, Ramesh Raskar
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.22344
Pdf URL: https://arxiv.org/pdf/2505.22344
Copy Paste: [[2505.22344]] Task-Driven Implicit Representations for Automated Design of LiDAR Systems(https://arxiv.org/abs/2505.22344)
Keywords: generative
Abstract: Imaging system design is a complex, time-consuming, and largely manual process; LiDAR design, ubiquitous in mobile devices, autonomous vehicles, and aerial imaging platforms, adds further complexity through unique spatial and temporal sampling requirements. In this work, we propose a framework for automated, task-driven LiDAR system design under arbitrary constraints. To achieve this, we represent LiDAR configurations in a continuous six-dimensional design space and learn task-specific implicit densities in this space via flow-based generative modeling. We then synthesize new LiDAR systems by modeling sensors as parametric distributions in 6D space and fitting these distributions to our learned implicit density using expectation-maximization, enabling efficient, constraint-aware LiDAR system design. We validate our method on diverse tasks in 3D vision, enabling automated LiDAR system design across real-world-inspired applications in face scanning, robotic tracking, and object detection.
摘要：成像系统设计是一个复杂，耗时且在很大程度上是手动过程。 LIDAR设计在移动设备，自动驾驶汽车和空中成像平台中无处不在，通过独特的空间和时间采样要求增加了进一步的复杂性。在这项工作中，我们为在任意限制下的自动化，任务驱动的激光雷达系统设计提出了一个框架。为此，我们在连续的六维设计空间中代表激光雷达配置，并通过基于流量的生成建模来学习该空间中特定于任务的隐式密度。然后，我们通过将传感器建模为6D空间中的参数分布，并使用期望最大化将这些分布拟合到我们学到的隐式密度中，从而综合新的LiDar系统，从而使这些分布拟合，从而实现高效，约束意识到的LiDAR系统设计。我们在3D视觉中验证了有关不同任务的方法，从而在面部扫描，机器人跟踪和对象检测中跨越现实世界中的应用程序跨越了自动化量。

Title: Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning

Authors: Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, Hinrich Schütze
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.22355
Pdf URL: https://arxiv.org/pdf/2505.22355
Copy Paste: [[2505.22355]] Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning(https://arxiv.org/abs/2505.22355)
Keywords: generation
Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods achieve performance comparable to Full Fine-Tuning (FFT) while requiring significantly fewer computing resources, making it the go-to choice for researchers. We find that although PEFT can achieve competitive results on some benchmarks, its performance falls short of FFT in complex tasks, such as reasoning and instruction-based fine-tuning. In this paper, we compare the characteristics of PEFT and FFT in terms of representational capacity and robustness based on optimization theory. We theoretically demonstrate that PEFT is a strict subset of FFT. By providing theoretical upper bounds for PEFT, we show that the limited parameter space constrains the model's representational ability, making it more susceptible to perturbations. Experiments on 15 datasets encompassing classification, generation, reasoning, instruction fine-tuning tasks and 11 adversarial test sets validate our theories. We hope that these results spark further research beyond the realms of well established PEFT. The source code is in the anonymous Github repository\footnote{this https URL}.
摘要：参数有效的微调（PEFT）方法可实现与完整微调（FFT）相当的性能，同时需要更少的计算资源，这使其成为研究人员的首选选择。我们发现，尽管PEFT可以在某些基准上取得竞争成果，但其性能在复杂的任务（例如基于推理和基于教学的微调）中的FFT不足。在本文中，我们根据优化理论比较了PEFT和FFT的特征。从理论上讲，我们证明了PEFT是FFT的严格子集。通过为PEFT提供理论上限，我们表明有限的参数空间限制了模型的表示能力，从而使其更容易受到扰动。在15个数据集上进行的实验，包括分类，生成，推理，指令微调任务和11个对抗性测试集验证我们的理论。我们希望这些结果能够激发出良好的PEFT领域的进一步研究。源代码位于匿名GitHub存储库\ footNote {this HTTPS url}中。

Title: Identity-Preserving Text-to-Image Generation via Dual-Level Feature Decoupling and Expert-Guided Fusion

Authors: Kewen Chen, Xiaobin Hu, Wenqi Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22360
Pdf URL: https://arxiv.org/pdf/2505.22360
Copy Paste: [[2505.22360]] Identity-Preserving Text-to-Image Generation via Dual-Level Feature Decoupling and Expert-Guided Fusion(https://arxiv.org/abs/2505.22360)
Keywords: generation
Abstract: Recent advances in large-scale text-to-image generation models have led to a surge in subject-driven text-to-image generation, which aims to produce customized images that align with textual descriptions while preserving the identity of specific subjects. Despite significant progress, current methods struggle to disentangle identity-relevant information from identity-irrelevant details in the input images, resulting in overfitting or failure to maintain subject identity. In this work, we propose a novel framework that improves the separation of identity-related and identity-unrelated features and introduces an innovative feature fusion mechanism to improve the quality and text alignment of generated images. Our framework consists of two key components: an Implicit-Explicit foreground-background Decoupling Module (IEDM) and a Feature Fusion Module (FFM) based on a Mixture of Experts (MoE). IEDM combines learnable adapters for implicit decoupling at the feature level with inpainting techniques for explicit foreground-background separation at the image level. FFM dynamically integrates identity-irrelevant features with identity-related features, enabling refined feature representations even in cases of incomplete decoupling. In addition, we introduce three complementary loss functions to guide the decoupling process. Extensive experiments demonstrate the effectiveness of our proposed method in enhancing image generation quality, improving flexibility in scene adaptation, and increasing the diversity of generated outputs across various textual descriptions.
摘要：大规模文本到图像生成模型的最新进展导致了主题驱动的文本对图像生成的激增，该生成旨在产生与文本描述相一致的自定义图像，同时保留特定主题的身份。尽管取得了重大进展，但当前的方法很难将与身份相关的信息从输入图像中的身份 - 息肉细节中删除，从而导致过度拟合或无法维持主题身份。在这项工作中，我们提出了一个新颖的框架，该框架改善了与身份相关和与身份无关的特征的分离，并引入了一种创新的特征融合机制，以提高生成图像的质量和文本对齐。我们的框架由两个关键组成部分组成：基于专家（MOE）的混合物（MOE），一个隐式解释的解耦模块（IEDM）和特征融合模块（FFM）。 IEDM结合了可学习的适配器，用于在特征级别上隐式解耦，并与图像级别上明确的前景 - 背景分离的介入技术。 FFM动态地集成了与身份相关的特征，即使在不完整的脱钩的情况下，也可以实现精致的特征表示形式。此外，我们引入了三个互补损失功能，以指导去耦过程。广泛的实验证明了我们提出的方法在提高图像生成质量，提高场景适应性的灵活性以及增加各种文本描述中产生的输出的多样性方面的有效性。

Title: Physics-Informed Distillation of Diffusion Models for PDE-Constrained Generation

Authors: Yi Zhang, Difan Zou
Subjects: cs.LG, cs.AI, cs.CE, math.NA
Abstract URL: https://arxiv.org/abs/2505.22391
Pdf URL: https://arxiv.org/pdf/2505.22391
Copy Paste: [[2505.22391]] Physics-Informed Distillation of Diffusion Models for PDE-Constrained Generation(https://arxiv.org/abs/2505.22391)
Keywords: generation, generative
Abstract: Modeling physical systems in a generative manner offers several advantages, including the ability to handle partial observations, generate diverse solutions, and address both forward and inverse problems. Recently, diffusion models have gained increasing attention in the modeling of physical systems, particularly those governed by partial differential equations (PDEs). However, diffusion models only access noisy data $\boldsymbol{x}_t$ at intermediate steps, making it infeasible to directly enforce constraints on the clean sample $\boldsymbol{x}_0$ at each noisy level. As a workaround, constraints are typically applied to the expectation of clean samples $\mathbb{E}[\boldsymbol{x}_0|\boldsymbol{x}_t]$, which is estimated using the learned score network. However, imposing PDE constraints on the expectation does not strictly represent the one on the true clean data, known as Jensen's Gap. This gap creates a trade-off: enforcing PDE constraints may come at the cost of reduced accuracy in generative modeling. To address this, we propose a simple yet effective post-hoc distillation approach, where PDE constraints are not injected directly into the diffusion process, but instead enforced during a post-hoc distillation stage. We term our method as Physics-Informed Distillation of Diffusion Models (PIDDM). This distillation not only facilitates single-step generation with improved PDE satisfaction, but also support both forward and inverse problem solving and reconstruction from randomly partial observation. Extensive experiments across various PDE benchmarks demonstrate that PIDDM significantly improves PDE satisfaction over several recent and competitive baselines, such as PIDM, DiffusionPDE, and ECI-sampling, with less computation overhead. Our approach can shed light on more efficient and effective strategies for incorporating physical constraints into diffusion models.
摘要：以生成方式对物理系统进行建模具有多种优势，包括处理部分观察，生成多种解决方案并解决远期和反问题的能力。最近，扩散模型在物理系统的建模中，特别是由部分微分方程（PDE）控制的模型越来越多。但是，扩散模型仅访问噪声数据$ \ boldsymbol {x} _t $在中间步骤中，这使得在每个噪声级别上直接对干净的样本$ \ boldsymbol {x} _0 $直接强制约束。作为解决方法，通常将约束应用于清洁样品的期望$ \ mathbb {e} [\ boldsymbol {x} _0 | \ boldsymbol {x} _t] $，这是使用学习分数网络估算的。但是，对期望的施加PDE限制并不能严格代表真正的干净数据（称为Jensen的差距）上的限制。该差距创造了权衡：实施PDE限制可能是以降低生成建模准确性的代价。为了解决这个问题，我们提出了一种简单而有效的事后蒸馏方法，其中PDE约束并未直接注入扩散过程，而是在事后蒸馏阶段强制执行。我们将我们的方法称为扩散模型（PIDDM）的物理信息蒸馏。这种蒸馏不仅可以促进单步的生成，并提高了PDE满意度，而且还支持从随机部分观察到的前进和反向解决和反向解决和重建。各种PDE基准的广泛实验表明，PIDDM显着提高了最近几个竞争性基线的PDE满意度，例如PIDM，ExfusionPDE和ECI-SMPPLING，并且计算较少的开销。我们的方法可以阐明将物理约束纳入扩散模型的更有效策略。

Title: PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

Authors: Fan Fei, Jiajun Tang, Fei-Peng Tian, Boxin Shi, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22394
Pdf URL: https://arxiv.org/pdf/2505.22394
Copy Paste: [[2505.22394]] PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models(https://arxiv.org/abs/2505.22394)
Keywords: generation, generative
Abstract: We present PacTure, a novel framework for generating physically-based rendering (PBR) material textures from an untextured 3D mesh, a text description, and an optional image prompt. Early 2D generation-based texturing approaches generate textures sequentially from different views, resulting in long inference times and globally inconsistent textures. More recent approaches adopt multi-view generation with cross-view attention to enhance global consistency, which, however, limits the resolution for each view. In response to these weaknesses, we first introduce view packing, a novel technique that significantly increases the effective resolution for each view during multi-view generation without imposing additional inference cost, by formulating the arrangement of multi-view maps as a 2D rectangle bin packing problem. In contrast to UV mapping, it preserves the spatial proximity essential for image generation and maintains full compatibility with current 2D generative models. To further reduce the inference cost, we enable fine-grained control and multi-domain generation within the next-scale prediction autoregressive framework to create an efficient multi-view multi-domain generative backbone. Extensive experiments show that PacTure outperforms state-of-the-art methods in both quality of generated PBR textures and efficiency in training and inference.
摘要：我们提出了Pacture，这是一个新颖的框架，用于从未纹理的3D网格，文本描述和可选图像提示中生成基于物理的渲染（PBR）材料纹理。基于第二代的早期纹理方法从不同的视图中依次生成纹理，从而导致较长的推理时间和全球不一致的纹理。最新的方法采用多视图生成，并以跨视图的关注来增强全球一致性，但是，这限制了每种观点的分辨率。为了响应这些弱点，我们首先引入视图包装，这是一种新型技术，可通过将多视图图作为2D矩形箱包装问题的布置来显着提高多视图生成期间的有效分辨率，而不会施加额外的推理成本。与紫外线映射相反，它保留了图像生成必不可少的空间接近性，并保持与当前2D生成模型的完全兼容性。为了进一步降低推论成本，我们在换句话预测自动回归框架中启用细粒度控制和多域生成，以创建有效的多视图多域生成骨干。广泛的实验表明，在生成的PBR纹理质量和训练和推理的效率方面，Pacture的表现优于最先进的方法。

Title: Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation

Authors: Jiadong Pan, Zhiyuan Ma, Kaiyan Zhang, Ning Ding, Bowen Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22407
Pdf URL: https://arxiv.org/pdf/2505.22407
Copy Paste: [[2505.22407]] Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation(https://arxiv.org/abs/2505.22407)
Keywords: generation
Abstract: Diffusion models have recently demonstrated exceptional performance in image generation task. However, existing image generation methods still significantly suffer from the dilemma of image reasoning, especially in logic-centered image generation tasks. Inspired by the success of Chain of Thought (CoT) and Reinforcement Learning (RL) in LLMs, we propose SRRL, a self-reflective RL algorithm for diffusion models to achieve reasoning generation of logical images by performing reflection and iteration across generation trajectories. The intermediate samples in the denoising process carry noise, making accurate reward evaluation difficult. To address this challenge, SRRL treats the entire denoising trajectory as a CoT step with multi-round reflective denoising process and introduces condition guided forward process, which allows for reflective iteration between CoT steps. Through SRRL-based iterative diffusion training, we introduce image reasoning through CoT into generation tasks adhering to physical laws and unconventional physical phenomena for the first time. Notably, experimental results of case study exhibit that the superior performance of our SRRL algorithm even compared with GPT-4o. The project page is this https URL.
摘要：扩散模型最近在图像生成任务中表现出了出色的性能。但是，现有的图像生成方法仍然显着遭受图像推理的困境，尤其是在以逻辑为中心的图像生成任务中。受到LLMS中思想链（COT）和增强学习（RL）成功的启发，我们提出了SRRL，这是扩散模型的自我反射RL算法，以通过跨发射轨迹执行反射和迭代来实现逻辑图像的推理生成。脱氧过程中的中间样品带有噪声，使准确的奖励评估变得困难。为了应对这一挑战，SRRL将整个Denoising轨迹视为具有多轮反射性降解过程的COT步骤，并引入了向前的条件引导过程，这允许COT步骤之间的反射迭代。通过基于SRRL的迭代扩散训练，我们首次通过遵守物理定律和非常规的物理现象来引入图像推理。值得注意的是，案例研究的实验结果表明，与GPT-4O相比，我们的SRRL算法的出色性能。项目页面是此HTTPS URL。

Title: Frugal Incremental Generative Modeling using Variational Autoencoders

Authors: Victor Enescu, Hichem Sahbi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22408
Pdf URL: https://arxiv.org/pdf/2505.22408
Copy Paste: [[2505.22408]] Frugal Incremental Generative Modeling using Variational Autoencoders(https://arxiv.org/abs/2505.22408)
Keywords: generative
Abstract: Continual or incremental learning holds tremendous potential in deep learning with different challenges including catastrophic forgetting. The advent of powerful foundation and generative models has propelled this paradigm even further, making it one of the most viable solution to train these models. However, one of the persisting issues lies in the increasing volume of data particularly with replay-based methods. This growth introduces challenges with scalability since continuously expanding data becomes increasingly demanding as the number of tasks grows. In this paper, we attenuate this issue by devising a novel replay-free incremental learning model based on Variational Autoencoders (VAEs). The main contribution of this work includes (i) a novel incremental generative modelling, built upon a well designed multi-modal latent space, and also (ii) an orthogonality criterion that mitigates catastrophic forgetting of the learned VAEs. The proposed method considers two variants of these VAEs: static and dynamic with no (or at most a controlled) growth in the number of parameters. Extensive experiments show that our method is (at least) an order of magnitude more ``memory-frugal'' compared to the closely related works while achieving SOTA accuracy scores.
摘要：持续或渐进的学习在深度学习方面具有巨大的潜力，以及不同的挑战，包括灾难性遗忘。强大的基础和生成模型的出现进一步推动了这种范式，使其成为训练这些模型的最可行解决方案之一。但是，持续存在的问题之一在于数据量的增加，尤其是基于重播的方法。这种增长带来了可扩展性的挑战，因为随着任务数量的增长，不断扩展的数据变得越来越苛刻。在本文中，我们通过设计一种基于变异自动编码器（VAE）的新型无重播的增量学习模型来消除这个问题。这项工作的主要贡献包括（i）一种新型的增量生成建模，建立在设计精良的多模式潜在空间上，以及（ii）正交性标准，可以减轻对学习的VAE的灾难性遗忘。所提出的方法考虑了这些VAE的两个变体：静态和动态的参数数量（或最多是受控的）生长。广泛的实验表明，与紧密相关的作品相比，我们的方法（至少是）更大的``记忆 - 果酱'''，同时达到SOTA精度得分。

Title: Mitigating Overthinking in Large Reasoning Models via Manifold Steering

Authors: Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.22411
Pdf URL: https://arxiv.org/pdf/2505.22411
Copy Paste: [[2505.22411]] Mitigating Overthinking in Large Reasoning Models via Manifold Steering(https://arxiv.org/abs/2505.22411)
Keywords: generation
Abstract: Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model's activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: this https URL.
摘要：大型推理模型（LRMS）的最新进展表现出在解决复杂任务（例如数学和编码）方面具有显着的功能。但是，这些模型经常表现出一种被称为推理期间过度思考的现象，其特征是过度验证环和冗余审议，导致了实质性的计算开销。在本文中，我们旨在通过从机械性解释性的角度研究基本机制来减轻过度思考。我们首先说明，通过模型的激活空间中的一个方向可以有效地捕获过度思考的趋势，并且可以通过沿着该方向进行激活来缓解问题。但是，随着干预强度的提高，这种功效很快达到了平稳，甚至会恶化。因此，我们系统地探索了激活空间，发现过度思考现象实际上与低维歧管息息相关，这表明有限的效应源于高维转向方向引入的噪音。基于这种见解，我们提出了一种歧管转向，这是一种新颖的方法，它优雅地将转向方向投射到了干扰噪声的理论近似情况下，将转向方向投入到低维的激活歧管上。对DeepSeek-R1蒸馏模型进行的广泛实验验证了我们的方法可在维持几种数学基准上的精度，甚至提高了几个数学基准的准确性，从而使输出令牌减少多达71％。我们的方法还表现出强大的跨域可转移性，在代码生成和基于知识的质量标准任务中提供一致的令牌降低性能。代码可用：此HTTPS URL。

Title: Scaling Reasoning without Attention

Authors: Xueliang Zhao, Wei Wu, Lingpeng Kong
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.22425
Pdf URL: https://arxiv.org/pdf/2505.22425
Copy Paste: [[2505.22425]] Scaling Reasoning without Attention(https://arxiv.org/abs/2505.22425)
Keywords: generation
Abstract: Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6\% on AIME 24, 0.6\% on AIME 25, and 3.0\% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.
摘要：大型语言模型（LLMS）在复杂的推理任务上取得了重大进步，但由于依赖变形金刚的依赖，它们仍然受到两个核心挑战的瓶颈：建筑效率低下，并且缺乏对高缺陷领域的结构化微调。我们介绍了\ oureModel，这是一种无引起注意的语言模型，通过建筑和以数据为中心的创新来解决问题。我们的模型建立在MAMBA-2的状态空间双（SSD）层上，消除了对自我注意力和键值缓存的需求，从而实现了固定的记忆，恒定的时间推理。为了训练它以进行复杂的推理，我们提出了一个基于\ textsc {pissctcot}综合范式的两阶段课程微调策略，该策略通过抽象概念选择和基本原理指导生成生成教学结构化的问题。在基准评估上，\ ourmodel-7b在AIME 24，AIME 25上的AIME 24、0.6 \％的GEMMA3-27B上的强大变压器和混合模型都超过了2.6 \％，在Aime 25上，在Livecodebench上超过了3.0 \％。这些结果突出了状态空间模型作为高容量推理的基于注意力的架构的有效且可扩展的替代方案。

Title: Data-Driven Antenna Miniaturization: A Knowledge-Based System Integrating Quantum PSO and Predictive Machine Learning Models

Authors: Khan Masood Parvez, Sk Md Abidar Rahaman, Ali Shiri Sichani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.22440
Pdf URL: https://arxiv.org/pdf/2505.22440
Copy Paste: [[2505.22440]] Data-Driven Antenna Miniaturization: A Knowledge-Based System Integrating Quantum PSO and Predictive Machine Learning Models(https://arxiv.org/abs/2505.22440)
Keywords: generation
Abstract: The rapid evolution of wireless technologies necessitates automated design frameworks to address antenna miniaturization and performance optimization within constrained development cycles. This study demonstrates a machine learning enhanced workflow integrating Quantum-Behaved Dynamic Particle Swarm Optimization (QDPSO) with ANSYS HFSS simulations to accelerate antenna design. The QDPSO algorithm autonomously optimized loop dimensions in 11.53 seconds, achieving a resonance frequency of 1.4208 GHz a 12.7 percent reduction compared to conventional 1.60 GHz designs. Machine learning models (SVM, Random Forest, XGBoost, and Stacked ensembles) predicted resonance frequencies in 0.75 seconds using 936 simulation datasets, with stacked models showing superior training accuracy (R2=0.9825) and SVM demonstrating optimal validation performance (R2=0.7197). The complete design cycle, encompassing optimization, prediction, and ANSYS validation, required 12.42 minutes on standard desktop hardware (Intel i5-8500, 16GB RAM), contrasting sharply with the 50-hour benchmark of PSADEA-based approaches. This 240 times of acceleration eliminates traditional trial-and-error methods that often extend beyond seven expert-led days. The system enables precise specifications of performance targets with automated generation of fabrication-ready parameters, particularly benefiting compact consumer devices requiring rapid frequency tuning. By bridging AI-driven optimization with CAD validation, this framework reduces engineering workloads while ensuring production-ready designs, establishing a scalable paradigm for next-generation RF systems in 6G and IoT applications.
摘要：无线技术的快速发展需要自动设计框架来解决约束开发周期内的天线微型化和性能优化。这项研究表明，机器学习增强了工作流程，该工作流与ANSYS HFSS模拟相结合的量子行为动态粒子群优化（QDPSO），以加速天线设计。与常规1.60 GHz的设计相比，QDPSO算法在11.53秒内自主优化的循环尺寸降低了12.7％。使用936个模拟数据集，机器学习模型（SVM，随机森林，XGBoost和堆叠的合奏）预测了0.75秒的共振频率，堆叠模型显示出了出色的训练精度（R2 = 0.9825）和SVM，表现出最佳验证性能（R2 = 0.7197）。完整的设计周期包括优化，预测和ANSYS验证，需要在标准桌面硬件（Intel I5-8500，16GB RAM）上进行12.42分钟，与基于PSADEA的方法的50小时基准形成鲜明对比。这240次加速消除了传统的反复试验方法，这些方法通常超过了七个专家主导的日子。该系统可以通过自动生成制造的参数来精确的性能目标规格，尤其是受益于需要快速调整的紧凑型消费者设备。通过将AI驱动的优化与CAD验证桥接，该框架可以减少工程工作负载，同时确保适合生产的设计，为6G和IoT应用程序中的下一代RF系统建立可扩展的范式。

Title: Position: All Current Generative Fidelity and Diversity Metrics are Flawed

Authors: Ossi Räisä, Boris van Breugel, Mihaela van der Schaar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.22450
Pdf URL: https://arxiv.org/pdf/2505.22450
Copy Paste: [[2505.22450]] Position: All Current Generative Fidelity and Diversity Metrics are Flawed(https://arxiv.org/abs/2505.22450)
Keywords: generative
Abstract: Any method's development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.
摘要：任何方法的开发和实际应用都受我们衡量其可靠性的能力的限制。生成建模的普及强调了良好的合成数据指标的重要性。不幸的是，以前的工作发现当前指标中有许多故障案例，例如缺乏异常鲁棒性和不清楚的上限和上限。我们提出了用于合成数据指标的Desiderata列表，以及一套理智检查：精心选择的简单实验，旨在检测特定和已知的生成建模失效模式。基于这些Desiderata和检查结果，我们到达了我们的位置：所有当前的生成忠诚度和多样性指标都是有缺陷的。这显着阻碍了合成数据的实际使用。我们的目的是说服研究界花费更多的精力来开发指标，而不是模型。此外，通过分析当前指标的失败方式，我们为从业者提供了有关如何（不）（不）使用这些指标的准则。

Title: Understanding Adversarial Training with Energy-based Models

Authors: Mujtaba Hussain Mirza, Maria Rosaria Briglia, Filippo Bartolucci, Senad Beadini, Giuseppe Lisanti, Iacopo Masi
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.22486
Pdf URL: https://arxiv.org/pdf/2505.22486
Copy Paste: [[2505.22486]] Understanding Adversarial Training with Energy-based Models(https://arxiv.org/abs/2505.22486)
Keywords: generation, generative
Abstract: We aim at using Energy-based Model (EBM) framework to better understand adversarial training (AT) in classifiers, and additionally to analyze the intrinsic generative capabilities of robust classifiers. By viewing standard classifiers through an energy lens, we begin by analyzing how the energies of adversarial examples, generated by various attacks, differ from those of the natural samples. The central focus of our work is to understand the critical phenomena of Catastrophic Overfitting (CO) and Robust Overfitting (RO) in AT from an energy perspective. We analyze the impact of existing AT approaches on the energy of samples during training and observe that the behavior of the ``delta energy' -- change in energy between original sample and its adversarial counterpart -- diverges significantly when CO or RO occurs. After a thorough analysis of these energy dynamics and their relationship with overfitting, we propose a novel regularizer, the Delta Energy Regularizer (DER), designed to smoothen the energy landscape during training. We demonstrate that DER is effective in mitigating both CO and RO across multiple benchmarks. We further show that robust classifiers, when being used as generative models, have limits in handling trade-off between image quality and variability. We propose an improved technique based on a local class-wise principal component analysis (PCA) and energy-based guidance for better class-specific initialization and adaptive stopping, enhancing sample diversity and generation quality. Considering that we do not explicitly train for generative modeling, we achieve a competitive Inception Score (IS) and Fréchet inception distance (FID) compared to hybrid discriminative-generative models.
摘要：我们旨在使用基于能量的模型（EBM）框架来更好地了解分类器中的对抗训练（AT），此外还可以分析强大分类器的内在生成能力。通过通过能量镜头查看标准分类器，我们首先分析各种攻击产生的对抗示例的能量与天然样品的能量如何不同。我们工作的主要重点是了解灾难性过度拟合（CO）的关键现象（CO）和从能量的角度出发的强大过度拟合（RO）。我们分析了现有方法对训练过程中样品能量的影响，并观察到``delta Energy''的行为 - 原始样品与其对抗性对应物之间的能量变化 - 当CO或RO发生时，``delta Energy''的行为会显着分歧。在对这些能量动力学及其与过度拟合的关系进行了彻底的分析之后，我们提出了一个新颖的正规器Delta Energy Rostomizer（DER），旨在使训练过程中的能量景观平滑。我们证明DER可以有效地跨多个基准来缓解CO和RO。我们进一步表明，当用作生成模型时，强大的分类器在处理图像质量和可变性之间的权衡方面有限制。我们提出了一项改进的技术，该技术基于本地班级主体组件分析（PCA）和基于能量的指导，以更好地特定班级初始化和自适应停止，从而增强了样本多样性和发电质量。考虑到我们没有明确训练生成建模，我们与混合判别基因生成模型相比，我们达到了竞争性成立分数（IS）和Fréchet成立距离（FID）。

Title: ProCrop: Learning Aesthetic Image Cropping from Professional Compositions

Authors: Ke Zhang, Tianyu Ding, Jiachen Jiang, Tianyi Chen, Ilya Zharkov, Vishal M. Patel, Luming Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22490
Pdf URL: https://arxiv.org/pdf/2505.22490
Copy Paste: [[2505.22490]] ProCrop: Learning Aesthetic Image Cropping from Professional Compositions(https://arxiv.org/abs/2505.22490)
Keywords: generation
Abstract: Image cropping is crucial for enhancing the visual appeal and narrative impact of photographs, yet existing rule-based and data-driven approaches often lack diversity or require annotated training data. We introduce ProCrop, a retrieval-based method that leverages professional photography to guide cropping decisions. By fusing features from professional photographs with those of the query image, ProCrop learns from professional compositions, significantly boosting performance. Additionally, we present a large-scale dataset of 242K weakly-annotated images, generated by out-painting professional images and iteratively refining diverse crop proposals. This composition-aware dataset generation offers diverse high-quality crop proposals guided by aesthetic principles and becomes the largest publicly available dataset for image cropping. Extensive experiments show that ProCrop significantly outperforms existing methods in both supervised and weakly-supervised settings. Notably, when trained on the new dataset, our ProCrop surpasses previous weakly-supervised methods and even matches fully supervised approaches. Both the code and dataset will be made publicly available to advance research in image aesthetics and composition analysis.
摘要：图像裁剪对于增强照片的视觉吸引力和叙事影响至关重要，但是现有的基于规则和数据驱动的方法通常缺乏多样性或需要带注释的培训数据。我们介绍了一种基于检索的方法，该方法利用专业摄影来指导裁剪决策。通过将专业照片与查询图像的功能融合在一起，Procrop从专业作品中学习，从而大大提高了性能。此外，我们提供了一个大规模的数据集，其中包括242k弱宣布的图像，该图像由超越专业图像和迭代精炼多样化的作物提案而产生。这种意识到的数据集生成提供了以美学原理为指导的多种高质量作物提案，并成为图像裁剪的最大公开数据集。广泛的实验表明，在受监督和弱监督的设置中，Proprop明显胜过现有的方法。值得注意的是，当在新数据集中接受培训时，我们的procrop超过了以前的弱监督方法，甚至匹配了完全监督的方法。代码和数据集都将公开使用，以推动图像美学和组成分析的研究。

Title: ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

Authors: Michal Kmicikiewicz, Vincent Fortuin, Ewa Szczurek
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.22494
Pdf URL: https://arxiv.org/pdf/2505.22494
Copy Paste: [[2505.22494]] ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods(https://arxiv.org/abs/2505.22494)
Keywords: generative
Abstract: Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.
摘要：设计高健身和新颖性的蛋白质序列是数据效率高的蛋白质工程中的一项具有挑战性的任务。野生型社区以外的探索通常会导致生物学上令人难以置信的序列或依赖于在新地区失去忠诚度的替代模型。在这里，我们提出了Prospero，这是一个积极的学习框架，其中冷冻预培训的生成模型由甲骨文反馈的替代品更新。通过将相关的残留物选择与生物约束的顺序蒙特卡洛采样相结合，我们的方法可以在野生型社区以外的探索，同时保持生物合理性。我们表明，即使替代词被弄清楚，我们的框架仍然有效。 Prospero始终优于各种蛋白质工程任务的现有方法，检索高健身和新颖性的序列。

Title: PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models

Authors: Junwen Chen, Heyang Jiang, Yanbin Wang, Keming Wu, Ji Li, Chao Zhang, Keiji Yanai, Dong Chen, Yuhui Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22523
Pdf URL: https://arxiv.org/pdf/2505.22523
Copy Paste: [[2505.22523]] PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models(https://arxiv.org/abs/2505.22523)
Keywords: generation, generative
Abstract: Generating high-quality, multi-layer transparent images from text prompts can unlock a new level of creative control, allowing users to edit each layer as effortlessly as editing text outputs from LLMs. However, the development of multi-layer generative models lags behind that of conventional text-to-image models due to the absence of a large, high-quality corpus of multi-layer transparent data. In this paper, we address this fundamental challenge by: (i) releasing the first open, ultra-high-fidelity PrismLayers (PrismLayersPro) dataset of 200K (20K) multilayer transparent images with accurate alpha mattes, (ii) introducing a trainingfree synthesis pipeline that generates such data on demand using off-the-shelf diffusion models, and (iii) delivering a strong, open-source multi-layer generation model, ART+, which matches the aesthetics of modern text-to-image generation models. The key technical contributions include: LayerFLUX, which excels at generating high-quality single transparent layers with accurate alpha mattes, and MultiLayerFLUX, which composes multiple LayerFLUX outputs into complete images, guided by human-annotated semantic layout. To ensure higher quality, we apply a rigorous filtering stage to remove artifacts and semantic mismatches, followed by human selection. Fine-tuning the state-of-the-art ART model on our synthetic PrismLayersPro yields ART+, which outperforms the original ART in 60% of head-to-head user study comparisons and even matches the visual quality of images generated by the FLUX.1-[dev] model. We anticipate that our work will establish a solid dataset foundation for the multi-layer transparent image generation task, enabling research and applications that require precise, editable, and visually compelling layered imagery.
摘要：从文本提示下生成高质量的多层透明图像可以解锁新的创造性控制水平，从而使用户可以像从LLMS编辑文本输出一样轻松编辑每个层。但是，由于缺乏大型，高质量的多层透明数据，多层生成模型的开发落后于常规文本对图像模型的发展。在本文中，我们通过以下方式解决了这一基本挑战：（i）发布200K（20K）多层透明图像的第一个开放，超高的prismlayslayers（PrismlayersPro）数据集，具有准确的α哑光质量的透明图像，（ii）使用训练式启动的启动型模型，以实现无需训练的模型，以实现训练的启动图，并将其引入较强的模型，并将其置于较高的模型中，并将其介绍（II）ii diff-nef-diff diff by-by-diff bysive ny-sheref diff inflys opere obles（ii）多层生成模型Art+，它与现代文本到图像生成模型的美学相匹配。关键的技术贡献包括：LayerFlux，它擅长于具有准确的Alpha哑光的高质量单透明层和多层流，该图将多层输出构成到完整的图像中，并由人类通知的语义布局引导。为了确保更高的质量，我们采用严格的过滤阶段来消除文物和语义不匹配，然后选择人类选择。在我们的合成PrismlayersPro上微调最先进的艺术模型会产生ART+，它在60％的头对头用户研究比较中优于原始艺术，甚至与Flux.1- [DEV]模型产生的图像的视觉质量相匹配。我们预计我们的工作将为多层透明图像生成任务建立一个可靠的数据集基础，从而为需要精确，可编辑且视觉上令人信服的分层图像的研究和应用程序提供。

Title: Test-Time Alignment of Discrete Diffusion Models with Sequential Monte Carlo

Authors: Chinmay Pani, Zijing Ou, Yingzhen Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.22524
Pdf URL: https://arxiv.org/pdf/2505.22524
Copy Paste: [[2505.22524]] Test-Time Alignment of Discrete Diffusion Models with Sequential Monte Carlo(https://arxiv.org/abs/2505.22524)
Keywords: generative
Abstract: Discrete diffusion models have become highly effective across various domains. However, real-world applications often require the generative process to adhere to certain constraints but without task-specific fine-tuning. To this end, we propose a training-free method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution at the test time. Our approach leverages twisted SMC with an approximate locally optimal proposal, obtained via a first-order Taylor expansion of the reward function. To address the challenge of ill-defined gradients in discrete spaces, we incorporate a Gumbel-Softmax relaxation, enabling efficient gradient-based approximation within the discrete generative framework. Empirical results on both synthetic datasets and image modelling validate the effectiveness of our approach.
摘要：离散扩散模型已在各个领域变得非常有效。但是，现实世界的应用程序通常需要生成过程来遵守某些约束，但没有特定于任务的微调。为此，我们提出了一种基于顺序蒙特卡洛（SMC）的无训练方法，以在测试时从奖励对准目标分布中采样。我们的方法利用近似局部最佳建议来利用扭曲的SMC，通过奖励功能的一阶泰勒扩展获得。为了应对离散空间中未定义的梯度的挑战，我们结合了gumbel-softmax放松，使基于离散生成框架的有效基于梯度的近似。合成数据集和图像建模的经验结果验证了我们方法的有效性。

Title: Thinking with Generated Images

Authors: Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, Pengfei Liu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.22525
Pdf URL: https://arxiv.org/pdf/2505.22525
Copy Paste: [[2505.22525]] Thinking with Generated Images(https://arxiv.org/abs/2505.22525)
Keywords: generation
Abstract: We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at this https URL.
摘要：我们通过生成的图像进行思考，这是一种新颖的范式，从根本上讲，通过使它们能够通过自发产生的中间视觉思维步骤来使其能够在文本和视觉方式上本地思考，从而改变了大型多模型（LMMS）与视觉推理的互动方式。使用LMM的当前视觉推理限制在处理固定的用户提供的图像或仅通过基于文本的思考链（COT）进行推理。用生成的图像进行思考，可以解锁认知能力的新维度，模型可以主动构建中间的视觉思想，批评自己的视觉假设，并将它们作为其推理过程的组成部分进行完善。我们通过两种互补机制证明了方法的有效性：（1）具有中间视觉子目标的视觉产生，其中模型将复杂的视觉任务分解为可管理和逐步集成的可管理组件，以及（2）与自我创造的视觉产生，模型在其中产生了一个视觉假设，通过文本推理来分析其缺点，并基于其自身的成果，并进行了自身的启发，并进行了自身的成果。我们对视觉生成基准测试的实验表现出了基线方法的实质性改进，我们的模型在处理复杂的多对象方案方面的相对改善达到了50％（从38％到57％）。从探索新型蛋白质结构的生物化学家，以及在空间设计上迭代的建筑师，到法医分析师重建犯罪现场，篮球运动员设想战略性游戏，我们的方法使AI模型能够从事视觉想象和迭代的培养，以表征人类的创造性，分析和战略性思维。我们在此HTTPS URL上发布了开源套件。

Title: TabularQGAN: A Quantum Generative Model for Tabular Data

Authors: Pallavi Bhardwaj, Caitlin Jones, Lasse Dierich, Aleksandar Vučković
Subjects: cs.LG, cs.AI, quant-ph
Abstract URL: https://arxiv.org/abs/2505.22533
Pdf URL: https://arxiv.org/pdf/2505.22533
Copy Paste: [[2505.22533]] TabularQGAN: A Quantum Generative Model for Tabular Data(https://arxiv.org/abs/2505.22533)
Keywords: generative
Abstract: In this paper, we introduce a novel quantum generative model for synthesizing tabular data. Synthetic data is valuable in scenarios where real-world data is scarce or private, it can be used to augment or replace existing datasets. Real-world enterprise data is predominantly tabular and heterogeneous, often comprising a mixture of categorical and numerical features, making it highly relevant across various industries such as healthcare, finance, and software. We propose a quantum generative adversarial network architecture with flexible data encoding and a novel quantum circuit ansatz to effectively model tabular data. The proposed approach is tested on the MIMIC III healthcare and Adult Census datasets, with extensive benchmarking against leading classical models, CTGAN, and CopulaGAN. Experimental results demonstrate that our quantum model outperforms classical models by an average of 8.5% with respect to an overall similarity score from SDMetrics, while using only 0.072% of the parameters of the classical models. Additionally, we evaluate the generalization capabilities of the models using two custom-designed metrics that demonstrate the ability of the proposed quantum model to generate useful and novel samples. To our knowledge, this is one of the first demonstrations of a successful quantum generative model for handling tabular data, indicating that this task could be well-suited to quantum computers.
摘要：在本文中，我们引入了一种新型的量子生成模型，用于合成表格数据。在现实世界数据稀缺或私有的情况下，合成数据可用于增强或替换现有数据集。现实世界中的企业数据主要是表格和异质性的，通常包括分类和数值特征的混合物，使其在医疗保健，财务和软件等各个行业中具有很高的相关性。我们提出了一个带有灵活数据编码的量子生成对抗网络体系结构，并提出了一个新颖的量子电路Ansatz，以有效地对表格数据进行建模。对模拟III医疗保健和成人人口普查数据集进行了测试，并针对领先的古典模型Ctgan和Copulagan进行了广泛的基准测试。实验结果表明，与SDMetrics的总体相似性得分相比，我们的量子模型平均比经典模型平均高出8.5％，而仅使用经典模型参数的0.072％。此外，我们使用两个自定义设计的指标评估了模型的概括能力，这些指标证明了所提出的量子模型生成有用和新型样品的能力。据我们所知，这是用于处理表格数据的成功量子生成模型的首次演示之一，表明此任务可以非常适合量子计算机。

Title: Scaling-up Perceptual Video Quality Assessment

Authors: Ziheng Jia, Zicheng Zhang, Zeyu Zhang, Yingji Liang, Xiaorong Zhu, Chunyi Li, Jinliang Han, Haoning Wu, Bin Wang, Haoran Zhang, Guanyu Zhu, Qiyong Zhao, Xiaohong Liu, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22543
Pdf URL: https://arxiv.org/pdf/2505.22543
Copy Paste: [[2505.22543]] Scaling-up Perceptual Video Quality Assessment(https://arxiv.org/abs/2505.22543)
Keywords: quality assessment
Abstract: The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose \textbf{OmniVQA}, an efficient framework designed to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases (MIDBs). We then scale up to create \textbf{OmniVQA-Chat-400K}, the largest MIDB in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we have built the \textbf{OmniVQA-MOS-20K} dataset to enhance the model's quantitative quality rating capabilities. We then introduce a \textbf{complementary} training strategy that effectively leverages the knowledge from datasets for quality understanding and quality rating tasks. Furthermore, we propose the \textbf{OmniVQA-FG (fine-grain)-Benchmark} to evaluate the fine-grained performance of the models. Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.
摘要：已显示数据扩展定律可显着提高各种下游任务中大型多模型模型（LMM）的性能。但是，在感知视频质量评估（VQA）的领域中，由于标记的资源缺乏和数据集的不足，扩展定律的潜力仍然是前所未有的。为了解决这个问题，我们建议\ textbf {omnivqa}，这是一个有效的框架，旨在有效地构建高质量的，人类的VQA VQA多模式指令数据库（MIDBS）。然后，我们扩展到创建\ textbf {omnivqa-chat-400k}，这是VQA字段中最大的MIDB。我们的重点是技术和审美质量维度，并具有丰富的内部文献指令数据，以提供精细的VQA知识。此外，我们已经构建了\ textBf {omnivqa-mos-20k}数据集，以增强模型的定量质量评级功能。然后，我们引入了一个\ textbf {互补}培训策略，该策略有效利用数据集的知识来获得质量理解和质量评级任务。此外，我们提出\ textbf {omnivqa-fg（fine-Grain）基准}来评估模型的细粒性能。我们的结果表明，我们的模型在质量理解和评级任务中都达到了最先进的表现。

Title: Universal Visuo-Tactile Video Understanding for Embodied Interaction

Authors: Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, Wenbo Ding
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22566
Pdf URL: https://arxiv.org/pdf/2505.22566
Copy Paste: [[2505.22566]] Universal Visuo-Tactile Video Understanding for Embodied Interaction(https://arxiv.org/abs/2505.22566)
Keywords: generation
Abstract: Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, scenario-based decision making and so on. Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.
摘要：触觉感知对于体现的代理人了解无法单独通过视觉检查确定的物体的物理属性至关重要。尽管现有的方法在视觉和语言方式方面取得了进步，但它们无法有效地纳入触觉信息，从而为现实世界中的互动提供了关键的触觉反馈。在本文中，我们介绍了VTV-LLM，这是通用视觉视频视频（VTV）的第一个多模式的大型语言模型，理解它弥合了触觉感知与自然语言之间的鸿沟。 To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction).我们开发了一种新颖的三阶段训练范式，其中包括用于稳健的视觉仪表式表示的VTV增强，VTV-TEXT对准跨模式对应关系以及文本促使自然语言生成的填充。我们的框架可以实现复杂的触觉推理功能，包括功能评估，比较分析，基于方案的决策做出等。实验评估表明，VTV-LLM在触觉视频理解任务中取得了出色的表现，为在触觉域中更直观的人机相互作用建立了基础。

Title: ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models

Authors: Dmitrii Sorokin, Maksim Nakhodnov, Andrey Kuznetsov, Aibek Alanov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22569
Pdf URL: https://arxiv.org/pdf/2505.22569
Copy Paste: [[2505.22569]] ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models(https://arxiv.org/abs/2505.22569)
Keywords: generation
Abstract: Recent advances in diffusion models have led to impressive image generation capabilities, but aligning these models with human preferences remains challenging. Reward-based fine-tuning using models trained on human feedback improves alignment but often harms diversity, producing less varied outputs. In this work, we address this trade-off with two contributions. First, we introduce \textit{combined generation}, a novel sampling strategy that applies a reward-tuned diffusion model only in the later stages of the generation process, while preserving the base model for earlier steps. This approach mitigates early-stage overfitting and helps retain global structure and diversity. Second, we propose \textit{ImageReFL}, a fine-tuning method that improves image diversity with minimal loss in quality by training on real images and incorporating multiple regularizers, including diffusion and ReFL losses. Our approach outperforms conventional reward tuning methods on standard quality and diversity metrics. A user study further confirms that our method better balances human preference alignment and visual diversity. The source code can be found at this https URL .
摘要：扩散模型的最新进展导致了令人印象深刻的图像产生能力，但是将这些模型与人类偏好保持一致仍然具有挑战性。使用接受人反馈训练的模型的基于奖励的微调可以改善对齐方式，但通常会损害多样性，从而产生较少的产出。在这项工作中，我们通过两项捐款解决了这一权衡。首先，我们介绍\ textit {组合生成}，这是一种新颖的抽样策略，仅在生成过程的后期应用奖励调节的扩散模型，同时为早期步骤保留基本模型。这种方法减轻早期过度拟合，并有助于保留全球结构和多样性。其次，我们提出\ textit {imagerefl}，一种微调方法，通过对真实图像进行训练并结合了多个正则化器，包括扩散和反射损失，从而改善了图像多样性，质量损失最小。我们的方法的表现优于标准质量和多样性指标的常规奖励调整方法。一项用户研究进一步证实，我们的方法可以更好地平衡人类的偏好一致性和视觉多样性。可以在此HTTPS URL上找到源代码。

Title: Tell me Habibi, is it Real or Fake?

Authors: Kartik Kuckreja, Parul Gupta, Injy Hamed, Thamar Solorio, Muhammad Haris Khan, Abhinav Dhall
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22581
Pdf URL: https://arxiv.org/pdf/2505.22581
Copy Paste: [[2505.22581]] Tell me Habibi, is it Real or Fake?(https://arxiv.org/abs/2505.22581)
Keywords: generation
Abstract: Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce \textbf{ArEnAV}, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It \textbf{contains 387k videos and over 765 hours of real and fake videos}. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research. The dataset can be accessed \href{this https URL}{here}.
摘要：Deepfake的生成方法正在迅速发展，使假媒体更难发现并引起严重的社会问题。大多数DeepFake检测和数据集创建研究都集中在单语言内容上，通常忽略多语言和代码切换语音的挑战，在同一话语中，多种语言混合在一起。代码转换，尤其是在阿拉伯语和英语之间，在阿拉伯世界很常见，广泛用于数字通信。这种语言混合对DeepFake检测构成了额外的挑战，因为它可能会混淆主要在单语数据上训练的模型。为了解决这个问题，我们介绍了\ textbf {arenav}，这是第一个大规模的阿拉伯语 - 英语视听深板数据集，其中包含内部含量的代码切换，方言变化和单语的阿拉伯语内容。它\ textbf {包含387k视频以及超过765个小时的真实和虚假视频}。我们的数据集是使用新颖的管道生成的，该管道集成了四个文本到语音和两个唇部同步模型，从而可以对多语言多模式深击检测进行全面分析。我们对数据集进行了基准，以抵制现有的单语和多语言数据集，最先进的DeepFake检测模型以及人类评估，从而强调了其提高深层研究的潜力。数据集可以访问\ href {this https url} {there}。

Title: RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Authors: Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.22613
Pdf URL: https://arxiv.org/pdf/2505.22613
Copy Paste: [[2505.22613]] RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction(https://arxiv.org/abs/2505.22613)
Keywords: generation
Abstract: Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at this https URL.
摘要：图像重新捕获广泛用于生成具有增强质量的各种多模式任务的培训数据集。现有的重新审查方法通常依靠强大的多模式大语言模型（MLLM）来增强文本描述，但由于缺少细粒度细节而导致的幻觉和不完整，通常会遭受不准确性的困扰。为了解决这些局限性，我们提出了RICO，这是一个新颖的框架，通过视觉重建来完善字幕。具体而言，我们利用文本对图像模型将字幕重建为参考图像，并提示MLLM确定原始图像和重建图像之间的差异以完善标题。这个过程是迭代执行的，进一步逐步促进了更忠实，更全面的描述的产生。为了减轻迭代过程引起的额外计算成本，我们介绍了Rico-Flash，该过程学会使用DPO生成像Rico这样的字幕。广泛的实验表明，我们的方法显着提高了标题的准确性和完整性，在Capsbench和CompreCap上，大多数基准都优于大多数基线。在此HTTPS URL上发布的代码。

Title: SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation

Authors: Dekai Zhu, Yixuan Hu, Youquan Liu, Dongyue Lu, Lingdong Kong, Slobodan Ilic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22643
Pdf URL: https://arxiv.org/pdf/2505.22643
Copy Paste: [[2505.22643]] SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation(https://arxiv.org/abs/2505.22643)
Keywords: generation, generative
Abstract: Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.
摘要：利用最近的扩散模型，基于激光雷达的大规模3D场景产生取得了巨大的成功。尽管最近的基于体素的方法可以生成几何结构和语义标签，但现有的范围视图方法仅限于产生未标记的激光雷达场景。依靠预验证的分割模型来预测语义图通常会导致次优跨模式的一致性。为了解决此限制，同时保留了范围视图表示的优势，例如计算效率和简化的网络设计，我们提出了螺旋形，这是一种新型的范围视图激光雷达扩散模型，同时生成深度，反射率图像和语义图。此外，我们介绍了新颖的语义感知指标，以评估生成的标记范围视图数据的质量。 Semantickitti和Nuscenes数据集的实验表明，螺旋具有最小参数大小的最新性能，优于结合了生成型和分割模型的两步方法。此外，我们验证了螺旋形成的范围图像可以有效地用于下游分割训练中的合成数据增强，从而大大降低了激光雷达数据的标签工作。

Title: Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

Authors: Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22647
Pdf URL: https://arxiv.org/pdf/2505.22647
Copy Paste: [[2505.22647]] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation(https://arxiv.org/abs/2505.22647)
Keywords: generation
Abstract: Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.
摘要：音频驱动的人类动画方法，例如说话的头和说话的身体生成，在产生同步的面部运动和吸引人的视觉质量视频方面取得了显着进步。但是，现有方法主要集中于单个人类动画，并与多流音频输入斗争，面临着音频和人员之间的不正确约束问题。此外，它们在遵循指导遵循的功能方面表现出局限性。为了解决这个问题，在本文中，我们提出了一项新的任务：多人的对话视频生成，并引入了一个新的框架，多通信，以应对多人发电期间的挑战。具体而言，对于音频注入，我们研究了几个方案，并提出了标签旋转位置嵌入（L-ROPE）方法来解决音频和人的结合问题。此外，在训练过程中，我们观察到部分参数训练和多任务训练对于保持基本模型的指导跟踪能力至关重要。与多个数据集上的其他方法相比，Multitalk取得了卓越的性能，包括说话的头，说话的身体和多人数据集，证明了我们方法的强大生成能力。

Title: Sherlock: Self-Correcting Reasoning in Vision-Language Models

Authors: Yi Ding, Ruqi Zhang
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22651
Pdf URL: https://arxiv.org/pdf/2505.22651
Copy Paste: [[2505.22651]] Sherlock: Self-Correcting Reasoning in Vision-Language Models(https://arxiv.org/abs/2505.22651)
Keywords: generation
Abstract: Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.
摘要：推理视觉语言模型（VLM）在复杂的多模式任务上表现出了有希望的性能。但是，它们仍然面临重大挑战：它们对推理错误高度敏感，需要大量注释的数据或准确的验证器，并难以推广到特定领域的超越。为了解决这些限制，我们探索自我纠正，作为增强推理VLM的策略。我们首先对推理VLM的自我纠正能力进行深入分析并确定关键差距。根据我们的发现，我们介绍了Sherlock，这是一个自我纠正和自我完善培训框架。 Sherlock引入了轨迹级别的自我纠正目标，基于视觉扰动的偏好数据构建方法以及用于首选项调整的动态$ \ beta $。一旦模型仅使用20K随机采样的注释数据获得自我纠正功能，它就会在没有外部监督的情况下自我爆发。 Sherlock建立在Llama3.2-Vision-11b模型的基础上，在八个基准测试中取得了显着的结果，平均准确性为64.1，直接产生，自我纠正后的平均精度为65.4。它的表现优于Llava-Cot（63.2），Mulberry（63.9）和Llamav-O1（63.4），同时使用了不到20％的注释数据。

Title: Training Free Stylized Abstraction

Authors: Aimon Rahman, Kartik Narayan, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22663
Pdf URL: https://arxiv.org/pdf/2505.22663
Copy Paste: [[2505.22663]] Training Free Stylized Abstraction(https://arxiv.org/abs/2505.22663)
Keywords: restoration, generation
Abstract: Stylized abstraction synthesizes visually exaggerated yet semantically faithful representations of subjects, balancing recognizability with perceptual distortion. Unlike image-to-image translation, which prioritizes structural fidelity, stylized abstraction demands selective retention of identity cues while embracing stylistic divergence, especially challenging for out-of-distribution individuals. We propose a training-free framework that generates stylized abstractions from a single image using inference-time scaling in vision-language models (VLLMs) to extract identity-relevant features, and a novel cross-domain rectified flow inversion strategy that reconstructs structure based on style-dependent priors. Our method adapts structural restoration dynamically through style-aware temporal scheduling, enabling high-fidelity reconstructions that honor both subject and style. It supports multi-round abstraction-aware generation without fine-tuning. To evaluate this task, we introduce StyleBench, a GPT-based human-aligned metric suited for abstract styles where pixel-level similarity fails. Experiments across diverse abstraction (e.g., LEGO, knitted dolls, South Park) show strong generalization to unseen identities and styles in a fully open-source setup.
摘要：程式化的抽象综合了视觉上夸张但语义上忠实的主题表示，平衡可识别性与感知失真。与图像到图像的翻译相比，优先考虑结构性保真度，风格化的抽象需要选择性保留身份提示，同时拥抱风格上的差异，尤其是针对分发分数的人的挑战。我们提出了一个无训练的框架，该框架使用视觉语言模型（VLLMS）中的推理时间缩放来从单个图像中生成风格化的抽象，以提取与身份相关的特征，以及一种新颖的跨域矫正流动反转策略，该策略基于样式依赖性的先验重建结构。我们的方法通过样式感知的时间调整来动态调整结构恢复，从而使高保真重建能够尊重主题和样式。它支持多轮抽象感知的生成而无需微调。为了评估这项任务，我们介绍了StyleBench，这是一种基于GPT的人类一致的度量标准，适用于像素级相似性失败的抽象样式。跨不同抽象的实验（例如乐高，针织娃娃，南方公园）在完全开源的设置中表现出强烈的概括，可以看不见身份和样式。