2025-05-13

Title: Dialz: A Python Toolkit for Steering Vectors

Authors: Zara Siddique, Liam D. Turner, Luis Espinosa-Anke
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06262
Pdf URL: https://arxiv.org/pdf/2505.06262
Copy Paste: [[2505.06262]] Dialz: A Python Toolkit for Steering Vectors(https://arxiv.org/abs/2505.06262)
Keywords: generation
Abstract: We introduce Dialz, a framework for advancing research on steering vectors for open-source LLMs, implemented in Python. Steering vectors allow users to modify activations at inference time to amplify or weaken a 'concept', e.g. honesty or positivity, providing a more powerful alternative to prompting or fine-tuning. Dialz supports a diverse set of tasks, including creating contrastive pair datasets, computing and applying steering vectors, and visualizations. Unlike existing libraries, Dialz emphasizes modularity and usability, enabling both rapid prototyping and in-depth analysis. We demonstrate how Dialz can be used to reduce harmful outputs such as stereotypes, while also providing insights into model behaviour across different layers. We release Dialz with full documentation, tutorials, and support for popular open-source models to encourage further research in safe and controllable language generation. Dialz enables faster research cycles and facilitates insights into model interpretability, paving the way for safer, more transparent, and more reliable AI systems.
摘要：我们介绍了在Python实施的开源LLMS方向向量研究的框架，该框架是为了推进有关转向向量的研究。转向向量允许用户在推理时间修改激活，以扩大或削弱“概念”，例如诚实或积极性，为提示或微调提供了更强大的替代方案。 DIALZ支持各种任务，包括创建对比度对数据集，计算和应用转向向量以及可视化。与现有的库不同，Dialz强调模块化和可用性，从而可以快速原型和深入分析。我们演示了如何使用Dialz来减少有害输出，例如刻板印象，同时还提供了对不同层的模型行为的见解。我们发布了Dialz，其中包含完整的文档，教程和对流行的开源模型的支持，以鼓励对安全可控制的语言生成的进一步研究。 DIALZ实现了更快的研究周期，并促进了对模型可解释性的见解，为更安全，更透明和更可靠的AI系统铺平了道路。

Title: A machine learning model for skillful climate system prediction

Authors: Chenguang Zhou, Lei Chen, Xiaohui Zhong, Bo Lu, Hao Li, Libo Wu, Jie Wu, Jiahui Hu, Zesheng Dou, Pang-Chi Hsu, Xiaoye Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.06269
Pdf URL: https://arxiv.org/pdf/2505.06269
Copy Paste: [[2505.06269]] A machine learning model for skillful climate system prediction(https://arxiv.org/abs/2505.06269)
Keywords: generation
Abstract: Climate system models (CSMs), through integrating cross-sphere interactions among the atmosphere, ocean, land, and cryosphere, have emerged as pivotal tools for deciphering climate dynamics and improving forecasting capabilities. Recent breakthroughs in artificial intelligence (AI)-driven meteorological modeling have demonstrated remarkable success in single-sphere systems and partially spheres coupled systems. However, the development of a fully coupled AI-based climate system model encompassing atmosphere-ocean-land-sea ice interactions has remained an unresolved challenge. This paper introduces FengShun-CSM, an AI-based CSM model that provides 60-day global daily forecasts for 29 critical variables across atmospheric, oceanic, terrestrial, and cryospheric domains. The model significantly outperforms the European Centre for Medium-Range Weather Forecasts (ECMWF) subseasonal-to-seasonal (S2S) model in predicting most variables, particularly precipitation, land surface, and oceanic components. This enhanced capability is primarily attributed to its improved representation of intra-seasonal variability modes, most notably the Madden-Julian Oscillation (MJO). Remarkably, FengShun-CSM exhibits substantial potential in predicting subseasonal extreme events. Such breakthroughs will advance its applications in meteorological disaster mitigation, marine ecosystem conservation, and agricultural productivity enhancement. Furthermore, it validates the feasibility of developing AI-powered CSMs through machine learning technologies, establishing a transformative paradigm for next-generation Earth system modeling.
摘要：气候系统模型（CSM）通过在大气，海洋，陆地和冰冻圈之间整合跨球相互作用，已成为破译气候动力学并提高预测能力的关键工具。在人工智能（AI）驱动的气象建模方面的最新突破已经在单球系统中取得了显着的成功，并部分领域耦合系统。但是，完全耦合的基于AI的气候系统模型的开发涵盖了大气 - 海洋 - 土地冰的相互作用，这仍然是尚未解决的挑战。本文介绍了Fengshun-CSM，这是一种基于AI的CSM模型，可为大气，海洋，陆地和冰磷域的29个关键变量提供60天的全球每日预测。该模型在预测大多数变量（尤其是降水，陆地表面和海洋成分）方面极大地超过了欧洲中范围天气预测中心（ECMWF）下季节至季节模型。这种增强功能主要归因于其改善季节内变异性模式的表示，最著名的是Madden-Julian振荡（MJO）。值得注意的是，频道CSM在预测亚季节极端事件方面具有巨大的潜力。这种突破将推动其在缓解气象灾难，海洋生态系统保护和农业生产力提高中的应用。此外，它验证了通过机器学习技术开发AI驱动的CSM的可行性，建立了用于下一代地球系统建模的变革性范式。

Title: PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model

Authors: Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, Ying-Cong Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06274
Pdf URL: https://arxiv.org/pdf/2505.06274
Copy Paste: [[2505.06274]] PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model(https://arxiv.org/abs/2505.06274)
Keywords: generation
Abstract: Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at this https URL.
摘要：多目标测试时间对齐旨在使大型语言模型（LLMS）在推理期间对不同的多维用户偏好进行调整，同时保持LLMS冻结。 Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the引导生成和用户偏好。为了解决这些问题，我们提出了偏好感知的臂（PARM），这是一个在所有偏好维度上训练的统一手臂。 Parm使用我们提出的偏好意识双线性低级别适应性（PBLORA），该适应（PBLORA）采用双线性形式来调节臂上的偏好矢量，使其能够在推理过程中精确控制对优先折衷的精确控制。实验表明，与现有方法相比，PARM降低了推理成本，并且可以更好地与偏好向量保持一致。此外，PARM可以实现弱至严重的指导，使较小的Parm可以指导较大的冷冻LLM而无需昂贵的培训，从而使多目标对齐能够使用有限的计算资源来访问。该代码可在此HTTPS URL上找到。

Title: A Data-Driven Probabilistic Framework for Cascading Urban Risk Analysis Using Bayesian Networks

Authors: Chunduru Rohith Kumar, PHD Surya Shanmuk, Prabhala Naga Srinivas, Sri Venkatesh Lankalapalli, Debasis Dwibedy
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.06281
Pdf URL: https://arxiv.org/pdf/2505.06281
Copy Paste: [[2505.06281]] A Data-Driven Probabilistic Framework for Cascading Urban Risk Analysis Using Bayesian Networks(https://arxiv.org/abs/2505.06281)
Keywords: generative
Abstract: The increasing complexity of cascading risks in urban systems necessitates robust, data-driven frameworks to model interdependencies across multiple domains. This study presents a foundational Bayesian network-based approach for analyzing cross-domain risk propagation across key urban domains, including air, water, electricity, agriculture, health, infrastructure, weather, and climate. Directed Acyclic Graphs (DAGs) are constructed using Bayesian Belief Networks (BBNs), with structure learning guided by Hill-Climbing search optimized through Bayesian Information Criterion (BIC) and K2 scoring. The framework is trained on a hybrid dataset that combines real-world urban indicators with synthetically generated data from Generative Adversarial Networks (GANs), and is further balanced using the Synthetic Minority Over-sampling Technique (SMOTE). Conditional Probability Tables (CPTs) derived from the learned structures enable interpretable probabilistic reasoning and quantify the likelihood of cascading failures. The results identify key intra- and inter-domain risk factors and demonstrate the framework's utility for proactive urban resilience planning. This work establishes a scalable, interpretable foundation for cascading risk assessment and serves as a basis for future empirical research in this emerging interdisciplinary field.
摘要：城市系统中级联风险的复杂性日益增加，需要具有强大的数据驱动框架来对多个领域的相互依赖性进行建模。这项研究提出了一种基于贝叶斯网络的基础方法，用于分析跨关键城市领域的跨域风险传播，包括空气，水，电力，农业，卫生，基础设施，天气和气候。定向的无环图（DAG）是使用贝叶斯信念网络（BBN）构建的，结构学习在通过贝叶斯信息标准（BIC）和K2评分优化的爬山搜索指导下。该框架是在混合数据集上训练的，该框架将现实世界中的城市指标与来自生成对抗网络（GAN）的合成生成数据相结合，并使用合成少数民族过度采样技术（SMOTE）进一步平衡。从学到的结构得出的条件概率表（CPTS）可以实现可解释的概率推理并量化级联故障的可能性。结果确定了关键的内域和域内风险因素，并证明了该框架积极的城市弹性计划的实用性。这项工作为级联风险评估建立了可扩展的，可解释的基础，并作为这个新兴跨学科领域的未来经验研究的基础。

Title: DMRL: Data- and Model-aware Reward Learning for Data Extraction

Authors: Zhiqiang Wang, Ruoxi Cheng
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2505.06284
Pdf URL: https://arxiv.org/pdf/2505.06284
Copy Paste: [[2505.06284]] DMRL: Data- and Model-aware Reward Learning for Data Extraction(https://arxiv.org/abs/2505.06284)
Keywords: generation
Abstract: Large language models (LLMs) are inherently vulnerable to unintended privacy breaches. Consequently, systematic red-teaming research is essential for developing robust defense mechanisms. However, current data extraction methods suffer from several limitations: (1) rely on dataset duplicates (addressable via deduplication), (2) depend on prompt engineering (now countered by detection and defense), and (3) rely on random-search adversarial generation. To address these challenges, we propose DMRL, a Data- and Model-aware Reward Learning approach for data extraction. This technique leverages inverse reinforcement learning to extract sensitive data from LLMs. Our method consists of two main components: (1) constructing an introspective reasoning dataset that captures leakage mindsets to guide model behavior, and (2) training reward models with Group Relative Policy Optimization (GRPO), dynamically tuning optimization based on task difficulty at both the data and model levels. Comprehensive experiments across various LLMs demonstrate that DMRL outperforms all baseline methods in data extraction performance.
摘要：大型语言模型（LLMS）本质上容易受到意想不到的隐私漏洞的影响。因此，系统的红线研究对于开发强大的防御机制至关重要。但是，当前的数据提取方法遭受了几个局限性：（1）依靠数据集重复（可通过重复数据删除），（2）依赖及时工程（现在是通过检测和防御对抗），以及（3）依靠随机搜索对手生成。为了应对这些挑战，我们提出了DMRL，这是一种数据提取的数据和模型感知的奖励学习方法。该技术利用逆增强学习从LLM中提取敏感数据。我们的方法由两个主要组成部分组成：（1）构建一个内省的推理数据集，该数据集捕获泄漏思维方式以指导模型行为，以及（2）具有小组相对策略优化（GRPO）的培训奖励模型，基于在数据和模型级别上的任务难度进行动态调整优化。各种LLMS的全面实验表明，DMRL在数据提取性能中的所有基线方法都优于所有基线方法。

Title: UniCO: Towards a Unified Model for Combinatorial Optimization Problems

Authors: Zefang Zong, Xiaochen Wei, Guozhen Zhang, Chen Gao, Huandong Wang, Yong Li
Subjects: cs.LG, cs.DM
Abstract URL: https://arxiv.org/abs/2505.06290
Pdf URL: https://arxiv.org/pdf/2505.06290
Copy Paste: [[2505.06290]] UniCO: Towards a Unified Model for Combinatorial Optimization Problems(https://arxiv.org/abs/2505.06290)
Keywords: generation
Abstract: Combinatorial Optimization (CO) encompasses a wide range of problems that arise in many real-world scenarios. While significant progress has been made in developing learning-based methods for specialized CO problems, a unified model with a single architecture and parameter set for diverse CO problems remains elusive. Such a model would offer substantial advantages in terms of efficiency and convenience. In this paper, we introduce UniCO, a unified model for solving various CO problems. Inspired by the success of next-token prediction, we frame each problem-solving process as a Markov Decision Process (MDP), tokenize the corresponding sequential trajectory data, and train the model using a transformer backbone. To reduce token length in the trajectory data, we propose a CO-prefix design that aggregates static problem features. To address the heterogeneity of state and action tokens within the MDP, we employ a two-stage self-supervised learning approach. In this approach, a dynamic prediction model is first trained and then serves as a pre-trained model for subsequent policy generation. Experiments across 10 CO problems showcase the versatility of UniCO, emphasizing its ability to generalize to new, unseen problems with minimal fine-tuning, achieving even few-shot or zero-shot performance. Our framework offers a valuable complement to existing neural CO methods that focus on optimizing performance for individual problems.
摘要：组合优化（CO）涵盖了许多实际情况下出现的广泛问题。尽管在开发针对专业CO问题的基于学习的方法方面取得了重大进展，但具有单个体系结构和参数集的统一模型仍然难以捉摸。这样的模型将在效率和便利性方面具有很大的优势。在本文中，我们介绍了Unico，这是一个解决各种CO问题的统一模型。受到下一步预测的成功的启发，我们将每个问题解决过程作为马尔可夫决策过程（MDP），将相应的顺序轨迹数据标记，并使用变压器骨架训练模型。为了减少轨迹数据中的令牌长度，我们提出了一种汇总静态问题特征的共迎接设计。为了解决MDP中国家和行动令牌的异质性，我们采用了两阶段的自学学习方法。在这种方法中，首先对动态预测模型进行了训练，然后作为后续政策生成的预培训模型。在10个CO问题上进行的实验展示了Unico的多功能性，强调了其将其推广到最小的微调，实现甚至很少或零击性能的新的，看不见的问题的能力。我们的框架为现有的神经CO方法提供了宝贵的补充，该方法着重于优化个人问题的性能。

Title: Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction

Authors: Yu Mao, Holger Pirk, Chun Jason Xue
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.06297
Pdf URL: https://arxiv.org/pdf/2505.06297
Copy Paste: [[2505.06297]] Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction(https://arxiv.org/abs/2505.06297)
Keywords: generative
Abstract: As large language models (LLMs) continue to be deployed and utilized across domains, the volume of LLM-generated data is growing rapidly. This trend highlights the increasing importance of effective and lossless compression for such data in modern text management systems. However, compressing LLM-generated data presents unique challenges compared to traditional human- or machine-generated content. Traditional machine-generated data is typically derived from computational processes or device outputs, often highly structured and limited to low-level elements like labels or numerical values. This structure enables conventional lossless compressors to perform efficiently. In contrast, LLM-generated data is more complex and diverse, requiring new approaches for effective compression. In this work, we conduct the first systematic investigation of lossless compression techniques tailored specifically to LLM-generated data. Notably, because LLMs are trained via next-token prediction, we find that LLM-generated data is highly predictable for the models themselves. This predictability enables LLMs to serve as efficient compressors of their own outputs. Through extensive experiments with 14 representative LLMs and 8 LLM-generated datasets from diverse domains, we show that LLM-based prediction methods achieve remarkable compression rates, exceeding 20x, far surpassing the 3x rate achieved by Gzip, a widely used general-purpose compressor. Furthermore, this advantage holds across different LLM sizes and dataset types, demonstrating the robustness and practicality of LLM-based methods in lossless text compression under generative AI workloads.
摘要：随着大型语言模型（LLM）继续在范围内部署和利用，LLM生成的数据的量正在迅速增长。这一趋势强调了现代文本管理系统中此类数据的有效和无损压缩的重要性。但是，与传统的人类或机器生成的内容相比，压缩LLM生成的数据提出了独特的挑战。传统的机器生成的数据通常来自计算过程或设备输出，通常结构化，并且仅限于低级元素，例如标签或数值。该结构使常规的无损压缩机能够有效地执行。相比之下，LLM生成的数据更加复杂和多样化，需要有效压缩的新方法。在这项工作中，我们对专门针对LLM生成数据的无损压缩技术进行了首次系统研究。值得注意的是，由于LLM是通过下一步预测训练的，因此我们发现LLM生成的数据对于模型本身是高度可预测的。这种可预测性使LLM可以充当其自己输出的有效压缩机。通过从不同领域的14个代表性LLM和8个LLM生成的数据集进行的大量实验，我们表明，基于LLM的预测方法达到了显着的压缩率，超过20倍，超过了GZIP实现的3倍率，Gzip是一种广泛使用的通用通用 - 刺激性压缩机。此外，此优势跨越不同的LLM尺寸和数据集类型，证明了在生成AI工作负载下，在无损文本压缩中基于LLM的方法的鲁棒性和实用性。

Title: QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

Authors: Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06302
Pdf URL: https://arxiv.org/pdf/2505.06302
Copy Paste: [[2505.06302]] QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives(https://arxiv.org/abs/2505.06302)
Keywords: generation
Abstract: Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural this http URL and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks this http URL excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 \times$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 \%$ of OpenBLAS on RISC-V CPUs, and $124 \%$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 \times$ compared with human experts.
摘要：Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural this http URL and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks this http URL excel在生成高级语言代码方面，但它们努力完全理解硬件特征并产生高性能张量操作员。我们引入了使用单行用户提示（Qimeng-Tensorop）的张量操作员自动生成框架，该框架使LLMs能够自动利用硬件特性，以生成具有硬件启动的张量操作员，并调整硬件参数，以在多元化硬件中为最佳性能生成最佳性能。在各种硬件平台，SOTA LLM和典型的张量运算符上的实验结果表明，Qimeng-Tensorop有效地释放了各种硬件平台的计算能力，并自动生成具有出色性能的张量运算符。与Vanilla LLM相比，Qimeng-Tensorop可实现高达$ 1291 \ times $ $ performance的提高。即使与人类专家相比，Qimeng-Tensorop在RISC-V CPU上可以达到251美元的开放式布拉斯，而Nvidia GPU上的Cublas $ 124 \％。此外，与人类专家相比，Qimeng-Tensorop还将开发成本大大降低了$ 200 \ times $。

Title: Collaborative Multi-LoRA Experts with Achievement-based Multi-Tasks Loss for Unified Multimodal Information Extraction

Authors: Li Yuan, Yi Cai, Xudong Shen, Qing Li, Qingbao Huang, Zikun Deng, Tao Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06303
Pdf URL: https://arxiv.org/pdf/2505.06303
Copy Paste: [[2505.06303]] Collaborative Multi-LoRA Experts with Achievement-based Multi-Tasks Loss for Unified Multimodal Information Extraction(https://arxiv.org/abs/2505.06303)
Keywords: generation
Abstract: Multimodal Information Extraction (MIE) has gained attention for extracting structured information from multimedia sources. Traditional methods tackle MIE tasks separately, missing opportunities to share knowledge across tasks. Recent approaches unify these tasks into a generation problem using instruction-based T5 models with visual adaptors, optimized through full-parameter fine-tuning. However, this method is computationally intensive, and multi-task fine-tuning often faces gradient conflicts, limiting performance. To address these challenges, we propose collaborative multi-LoRA experts with achievement-based multi-task loss (C-LoRAE) for MIE tasks. C-LoRAE extends the low-rank adaptation (LoRA) method by incorporating a universal expert to learn shared multimodal knowledge from cross-MIE tasks and task-specific experts to learn specialized instructional task features. This configuration enhances the model's generalization ability across multiple tasks while maintaining the independence of various instruction tasks and mitigating gradient conflicts. Additionally, we propose an achievement-based multi-task loss to balance training progress across tasks, addressing the imbalance caused by varying numbers of training samples in MIE tasks. Experimental results on seven benchmark datasets across three key MIE tasks demonstrate that C-LoRAE achieves superior overall performance compared to traditional fine-tuning methods and LoRA methods while utilizing a comparable number of training parameters to LoRA.
摘要：多模式信息提取（MIE）因从多媒体来源提取结构化信息而引起了人们的注意。传统方法分别处理MIE任务，缺少跨任务分享知识的机会。最近的方法使用基于指令的T5模型和视觉适配器将这些任务统一为一代问题，并通过全参数微调进行了优化。但是，这种方法在计算密集型上，多任务微调通常会面临梯度冲突，从而限制了性能。为了应对这些挑战，我们建议通过基于成就的多任务损失（C-Lorae）进行协作多洛拉专家，以完成MIE任务。 C-lorae通过将通用专家纳入跨跨小子任务和特定于任务的专家来学习共享的多模式知识来学习专业的教学任务功能，从而扩展了低级适应方法（LORA）方法。这种配置增强了模型跨多个任务的概括能力，同时保持了各种指令任务的独立性并减轻梯度冲突。此外，我们提出了一个基于成就的多任务损失，以平衡跨任务的培训进度，以解决MIE任务中培训样本的数量不同。在三个关键MIE任务中的七个基准数据集上的实验结果表明，与传统的微调方法和洛拉方法相比，C-Lorae的总体性能优于总体性能，同时利用与LORA相当数量的训练参数。

Title: GraphComp: Extreme Error-bounded Compression of Scientific Data via Temporal Graph Autoencoders

Authors: Guozhong Li, Muhannad Alhumaidi, Spiros Skiadopoulos, Ibrahim Hoteit, Panos Kalnis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.06316
Pdf URL: https://arxiv.org/pdf/2505.06316
Copy Paste: [[2505.06316]] GraphComp: Extreme Error-bounded Compression of Scientific Data via Temporal Graph Autoencoders(https://arxiv.org/abs/2505.06316)
Keywords: generation
Abstract: The generation of voluminous scientific data poses significant challenges for efficient storage, transfer, and analysis. Recently, error-bounded lossy compression methods emerged due to their ability to achieve high compression ratios while controlling data distortion. However, they often overlook the inherent spatial and temporal correlations within scientific data, thus missing opportunities for higher compression. In this paper we propose GRAPHCOMP, a novel graph-based method for error-bounded lossy compression of scientific data. We perform irregular segmentation of the original grid data and generate a graph representation that preserves the spatial and temporal correlations. Inspired by Graph Neural Networks (GNNs), we then propose a temporal graph autoencoder to learn latent representations that significantly reduce the size of the graph, effectively compressing the original data. Decompression reverses the process and utilizes the learnt graph model together with the latent representation to reconstruct an approximation of the original data. The decompressed data are guaranteed to satisfy a user-defined point-wise error bound. We compare our method against the state-of-the-art error-bounded lossy methods (i.e., HPEZ, SZ3.1, SPERR, and ZFP) on large-scale real and synthetic data. GRAPHCOMP consistently achieves the highest compression ratio across most datasets, outperforming the second-best method by margins ranging from 22% to 50%.
摘要：大量科学数据的产生对有效的存储，转移和分析提出了重大挑战。最近，由于控制数据失真而获得高压缩比的能力，出现了误差的损耗压缩方法。但是，他们经常忽略科学数据中固有的空间和时间相关性，因此缺失了更高压缩的机会。在本文中，我们提出了GraphComp，这是一种基于图形的新型方法，用于对科学数据的错误结合损耗压缩。我们对原始网格数据进行不规则分割，并生成一个保留空间和时间相关性的图表表示。受图形神经网络（GNN）的启发，我们然后提出了一个时间图自动编码器，以学习明显减少图形大小的潜在表示，从而有效地压缩了原始数据。解压缩逆转过程，并利用学习的图模型以及潜在表示，以重建原始数据的近似值。解压缩的数据可确保满足用户定义的点错误限制。我们将我们的方法与大规模的真实和合成数据上的最新方法（即HPEZ，SZ3.1，SPERR和ZFP）进行了比较。 GraphComp始终达到大多数数据集的最高压缩率，从而优于第二好的方法，范围从22％到50％。

Title: Learn to Think: Bootstrapping LLM Reasoning Capability Through Graph Learning

Authors: Hang Gao, Chenhao Zhang, Tie Wang, Junsuo Zhao, Fengge Wu, Changwen Zheng, Huaping Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06321
Pdf URL: https://arxiv.org/pdf/2505.06321
Copy Paste: [[2505.06321]] Learn to Think: Bootstrapping LLM Reasoning Capability Through Graph Learning(https://arxiv.org/abs/2505.06321)
Keywords: generation
Abstract: Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and predefined reasoning processes, which constrain their flexibility and generalizability. To address these limitations, we propose a novel framework that leverages graph learning to enable more flexible and adaptive reasoning capabilities for LLMs. Specifically, this approach models the reasoning process of a problem as a graph and employs LLM-based graph learning to guide the adaptive generation of each reasoning step. To further enhance the adaptability of the model, we introduce a Graph Neural Network (GNN) module to perform representation learning on the generated reasoning process, enabling real-time adjustments to both the model and the prompt. Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design. Code can be found in this https URL.
摘要：大型语言模型（LLM）在各个领域取得了巨大的成功。但是，它们仍然面临重大挑战，包括用于培训的高计算成本以及解决复杂推理问题的局限性。尽管现有方法通过结构化范式扩展了LLM的推理能力，但这些方法通常依赖于特定于任务的提示和预定义的推理过程，从而限制了它们的灵活性和普遍性。为了解决这些局限性，我们提出了一个新颖的框架，该框架利用图形学习，以使LLMS更灵活和适应性推理能力。具体而言，该方法将问题的推理过程模拟为图形，并采用基于LLM的图形学习来指导每个推理步骤的自适应生成。为了进一步增强模型的适应性，我们引入了图形神经网络（GNN）模块，以在生成的推理过程中执行表示形式学习，从而实现对模型和提示的实时调整。实验结果表明，这种方法可显着提高多个任务的推理性能，而无需其他培训或特定于任务的及时设计。代码可以在此HTTPS URL中找到。

Title: The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

Authors: Jae-Won Chung, Jiachen Liu, Jeff J. Ma, Ruofan Wu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, Mosharaf Chowdhury
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06371
Pdf URL: https://arxiv.org/pdf/2505.06371
Copy Paste: [[2505.06371]] The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization(https://arxiv.org/abs/2505.06371)
Keywords: generative
Abstract: As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the this http URL Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding this http URL Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the this http URL Benchmark. We then highlight results from the latest iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The this http URL Benchmark is open-source and can be easily extended to various customized models and application scenarios.
摘要：随着在现实世界中采用生成的AI爆炸性地增长，能源已经成为关键的瓶颈资源。但是，在构建ML系统的背景下，能量仍然是一种经常被忽视，探索或了解不足的度量。我们介绍了此HTTP URL基准测试，这是一个基准套件和用于衡量现实服务环境下推理能源消耗的工具，以及相应的HTTP URL排行榜，这是那些希望了解和优化其生成AI服务能源消耗的人的宝贵资源。在本文中，我们解释了基准测试ML能量的四个关键设计原则，然后描述它们如何在此HTTP URL基准测试中实施。然后，我们强调了基准的最新迭代结果，包括对6个不同任务中40种广泛使用的模型体系结构的能量测量，有关ML设计选择如何影响能源消耗的案例研究以及自动化优化建议如何导致能源节省的大量（有时超过40％），而无需更改模型计算的内容。该HTTP URL基准是开源的，可以轻松扩展到各种自定义模型和应用程序方案。

Title: Toward Advancing License Plate Super-Resolution in Real-World Scenarios: A Dataset and Benchmark

Authors: Valfride Nascimento, Gabriel E. Lima, Rafael O. Ribeiro, William Robson Schwartz, Rayson Laroca, David Menotti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06393
Pdf URL: https://arxiv.org/pdf/2505.06393
Copy Paste: [[2505.06393]] Toward Advancing License Plate Super-Resolution in Real-World Scenarios: A Dataset and Benchmark(https://arxiv.org/abs/2505.06393)
Keywords: super-resolution
Abstract: Recent advancements in super-resolution for License Plate Recognition (LPR) have sought to address challenges posed by low-resolution (LR) and degraded images in surveillance, traffic monitoring, and forensic applications. However, existing studies have relied on private datasets and simplistic degradation models. To address this gap, we introduce UFPR-SR-Plates, a novel dataset containing 10,000 tracks with 100,000 paired low and high-resolution license plate images captured under real-world conditions. We establish a benchmark using multiple sequential LR and high-resolution (HR) images per vehicle -- five of each -- and two state-of-the-art models for super-resolution of license plates. We also investigate three fusion strategies to evaluate how combining predictions from a leading Optical Character Recognition (OCR) model for multiple super-resolved license plates enhances overall performance. Our findings demonstrate that super-resolution significantly boosts LPR performance, with further improvements observed when applying majority vote-based fusion techniques. Specifically, the Layout-Aware and Character-Driven Network (LCDNet) model combined with the Majority Vote by Character Position (MVCP) strategy led to the highest recognition rates, increasing from 1.7% with low-resolution images to 31.1% with super-resolution, and up to 44.7% when combining OCR outputs from five super-resolved images. These findings underscore the critical role of super-resolution and temporal information in enhancing LPR accuracy under real-world, adverse conditions. The proposed dataset is publicly available to support further research and can be accessed at: this https URL
摘要：车牌识别超级分辨率的最新进展（LPR）试图解决低分辨率（LR）带来的挑战，并在监视，交通监控和法医应用中降低了图像。但是，现有的研究依赖于私人数据集和简单的退化模型。为了解决这一差距，我们介绍了UFPR-SR-Plates，这是一个新颖的数据集，其中包含10,000个曲目，其中有100,000个配对的低分辨率和高分辨率的车牌图像在现实世界条件下捕获的。我们使用每辆车的多个顺序LR和高分辨率（HR）图像建立一个基准测试 - 每辆车中的五个 - 以及两个用于驾驶机板超级分辨率的最先进模型。我们还研究了三种融合策略，以评估多个超级分辨车牌的领先光学特征识别（OCR）模型的预测如何提高整体性能。我们的发现表明，超分辨率显着提高了LPR的性能，在应用以多数投票的融合技术应用时，会发现进一步的改进。具体而言，布局意识和角色驱动网络（LCDNET）模型与大多数通过角色位置（MVCP）策略相结合，导致了最高的识别率，从1.7％增加到低分辨率图像的31.1％，并在超级分辨率的情况下增加了44.7％，在将OCR输出从五个超级固定图像中结合到44.7％。这些发现强调了超分辨率和时间信息在增强现实世界中不利条件下LPR准确性方面的关键作用。拟议的数据集可公开使用以支持进一步的研究，可以在以下位置访问：此HTTPS URL

Title: My Emotion on your face: The use of Facial Keypoint Detection to preserve Emotions in Latent Space Editing

Authors: Jingrui He, Andrew Stephen McGough
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06436
Pdf URL: https://arxiv.org/pdf/2505.06436
Copy Paste: [[2505.06436]] My Emotion on your face: The use of Facial Keypoint Detection to preserve Emotions in Latent Space Editing(https://arxiv.org/abs/2505.06436)
Keywords: generative
Abstract: Generative Adversarial Network approaches such as StyleGAN/2 provide two key benefits: the ability to generate photo-realistic face images and possessing a semantically structured latent space from which these images are created. Many approaches have emerged for editing images derived from vectors in the latent space of a pre-trained StyleGAN/2 models by identifying semantically meaningful directions (e.g., gender or age) in the latent space. By moving the vector in a specific direction, the ideal result would only change the target feature while preserving all the other features. Providing an ideal data augmentation approach for gesture research as it could be used to generate numerous image variations whilst keeping the facial expressions intact. However, entanglement issues, where changing one feature inevitably affects other features, impacts the ability to preserve facial expressions. To address this, we propose the use of an addition to the loss function of a Facial Keypoint Detection model to restrict changes to the facial expressions. Building on top of an existing model, adding the proposed Human Face Landmark Detection (HFLD) loss, provided by a pre-trained Facial Keypoint Detection model, to the original loss function. We quantitatively and qualitatively evaluate the existing and our extended model, showing the effectiveness of our approach in addressing the entanglement issue and maintaining the facial expression. Our approach achieves up to 49% reduction in the change of emotion in our experiments. Moreover, we show the benefit of our approach by comparing with state-of-the-art models. By increasing the ability to preserve the facial gesture and expression during facial transformation, we present a way to create human face images with fixed expression but different appearances, making it a reliable data augmentation approach for Facial Gesture and Expression research.
摘要：诸如StyleGan/2之类的生成对抗网络方法提供了两个关键好处：能够生成光真逼真的面部图像并拥有创建这些图像的语义结构的潜在空间。已经出现了许多方法，用于编辑从预训练的样式/2模型的潜在空间中衍生出的图像，通过在潜在空间中识别语义上有意义的方向（例如性别或年龄）。通过朝特定方向移动矢量，理想的结果只会在保留所有其他功能的同时更改目标特征。为手势研究提供理想的数据增强方法，因为它可以用来产生众多图像变化，同时保持面部表情完整。但是，纠缠不可避免地会影响其他特征的纠缠问题会影响保持面部表情的能力。为了解决这个问题，我们建议使用面部关键点检测模型的损耗函数的添加来限制对面部表情的变化。建立在现有模型之上的建立，并将预先训练的面部关键点检测模型提供的拟议的人脸标志性检测（HFLD）损失添加到原始损失函数中。我们对现有和扩展模型进行定量和定性评估，显示了我们方法在解决纠缠问题和维持面部表达方面的有效性。我们的方法在实验中的情绪变化减少了49％。此外，我们通过与最先进的模型进行比较来展示我们的方法的好处。通过增加在面部转化过程中保持面部手势和表达的能力，我们提出了一种创建具有固定表达但外观不同的人脸图像的方法，使其成为面部手势和表达研究的可靠数据增强方法。

Title: PromptIQ: Who Cares About Prompts? Let System Handle It -- A Component-Aware Framework for T2I Generation

Authors: Nisan Chhetri, Arpan Sainju
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2505.06467
Pdf URL: https://arxiv.org/pdf/2505.06467
Copy Paste: [[2505.06467]] PromptIQ: Who Cares About Prompts? Let System Handle It -- A Component-Aware Framework for T2I Generation(https://arxiv.org/abs/2505.06467)
Keywords: generation
Abstract: Generating high-quality images without prompt engineering expertise remains a challenge for text-to-image (T2I) models, which often misinterpret poorly structured prompts, leading to distortions and misalignments. While humans easily recognize these flaws, metrics like CLIP fail to capture structural inconsistencies, exposing a key limitation in current evaluation methods. To address this, we introduce PromptIQ, an automated framework that refines prompts and assesses image quality using our novel Component-Aware Similarity (CAS) metric, which detects and penalizes structural errors. Unlike conventional methods, PromptIQ iteratively generates and evaluates images until the user is satisfied, eliminating trial-and-error prompt tuning. Our results show that PromptIQ significantly improves generation quality and evaluation accuracy, making T2I models more accessible for users with little to no prompt engineering expertise.
摘要：在没有及时的工程专业知识的情况下生成高质量的图像仍然是文本对图像（T2I）模型的挑战，该模型通常误解结构较差的提示，导致扭曲和未对准。尽管人类很容易识别这些缺陷，但诸如剪辑之类的指标无法捕获结构上的不一致，从而在当前的评估方法中暴露了关键限制。为了解决这个问题，我们介绍了提示，这是一个自动化框架，使用我们新颖的组件感知相似性（CAS）指标来完善提示并评估图像质量，该指标检测并惩罚结构错误。与常规方法不同，提示迭代生成并评估图像，直到满足用户为止，以消除反复试验的及时调整。我们的结果表明，Proffereq显着提高了发电质量和评估的准确性，这使得T2I模型对于几乎没有及时工程专业知识的用户更容易访问。

Title: HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation

Authors: Hang Wang, Zhi-Qi Cheng, Chenhao Lin, Chao Shen, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06512
Pdf URL: https://arxiv.org/pdf/2505.06512
Copy Paste: [[2505.06512]] HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation(https://arxiv.org/abs/2505.06512)
Keywords: generation
Abstract: Text-to-image synthesis has progressed to the point where models can generate visually compelling images from natural language prompts. Yet, existing methods often fail to reconcile high-level semantic fidelity with explicit spatial control, particularly in scenes involving multiple objects, nuanced relations, or complex layouts. To bridge this gap, we propose a Hierarchical Cross-Modal Alignment (HCMA) framework for grounded text-to-image generation. HCMA integrates two alignment modules into each diffusion sampling step: a global module that continuously aligns latent representations with textual descriptions to ensure scene-level coherence, and a local module that employs bounding-box layouts to anchor objects at specified locations, enabling fine-grained spatial control. Extensive experiments on the MS-COCO 2014 validation set show that HCMA surpasses state-of-the-art baselines, achieving a 0.69 improvement in Frechet Inception Distance (FID) and a 0.0295 gain in CLIP Score. These results demonstrate HCMA's effectiveness in faithfully capturing intricate textual semantics while adhering to user-defined spatial constraints, offering a robust solution for semantically grounded image this http URL code is available at this https URL
摘要：文本对图像的综合已发展到模型可以从自然语言提示中产生令人信服的图像的地步。然而，现有方法通常无法将高级语义忠诚与明确的空间控制调和，尤其是在涉及多个对象，细微差异或复杂布局的场景中。为了弥合这一差距，我们提出了一个层次结构跨模式对齐（HCMA）框架，以实现扎根的文本对图。 HCMA将两个比对模块集成到每个扩散采样步骤中：一个全局模块，将潜在表示与文本描述保持一致，以确保场景层面的连贯性，以及一个将界限盒布局用于在指定位置锚定对象的局部模块，从而实现了良好的空间控制。在MS-COCO 2014验证集上进行的广泛实验表明，HCMA超过了最先进的基线，实现了0.69的Frechet Inpection Inteption距离（FID）和0.0295的剪辑分数提高。这些结果证明了HCMA在忠实捕获复杂的文本语义方面的有效性，同时遵循用户定义的空间约束，为语义上的图像提供了强大的解决方案，此HTTP URL代码可在此HTTPS URL上获得。

Title: ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images

Authors: Xianghao Kong, Qiaosong Qi, Yuanbin Wang, Anyi Rao, Biaolong Chen, Aixi Zhang, Si Liu, Hao Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06537
Pdf URL: https://arxiv.org/pdf/2505.06537
Copy Paste: [[2505.06537]] ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images(https://arxiv.org/abs/2505.06537)
Keywords: generation
Abstract: Fashion video generation aims to synthesize temporally consistent videos from reference images of a designated character. Despite significant progress, existing diffusion-based methods only support a single reference image as input, severely limiting their capability to generate view-consistent fashion videos, especially when there are different patterns on the clothes from different perspectives. Moreover, the widely adopted motion module does not sufficiently model human body movement, leading to sub-optimal spatiotemporal consistency. To address these issues, we propose ProFashion, a fashion video generation framework leveraging multiple reference images to achieve improved view consistency and temporal coherency. To effectively leverage features from multiple reference images while maintaining a reasonable computational cost, we devise a Pose-aware Prototype Aggregator, which selects and aggregates global and fine-grained reference features according to pose information to form frame-wise prototypes, which serve as guidance in the denoising process. To further enhance motion consistency, we introduce a Flow-enhanced Prototype Instantiator, which exploits the human keypoint motion flow to guide an extra spatiotemporal attention process in the denoiser. To demonstrate the effectiveness of ProFashion, we extensively evaluate our method on the MRFashion-7K dataset we collected from the Internet. ProFashion also outperforms previous methods on the UBC Fashion dataset.
摘要：时尚视频生成旨在从指定角色的参考图像中综合时间一致的视频。尽管取得了重大进展，但现有的基于扩散的方法仅支持单个参考图像作为输入，从而严重限制了它们生成视图一致的时尚视频的能力，尤其是从不同角度的衣服上有不同的模式时。此外，广泛采用的运动模块不能充分模拟人体运动，从而导致次优时空的一致性。为了解决这些问题，我们提出了一个时尚视频生成框架，利用多个参考图像来实现改进的视图一致性和时间相干性。为了有效地利用多个参考图像的特征，同时保持合理的计算成本，我们设计了一个姿势感知的原型聚合器，该原型聚合器根据姿势信息选择并汇总了全球和细粒度的参考特征，以形成框架的原型，该原型是在剥离过程中的指导。为了进一步提高运动的一致性，我们引入了一个流动增强的原型实例，该原型实例化利用了人类关键点运动流量，以指导Denoiser中的额外时空注意力过程。为了证明Fashion的有效性，我们对我们从Internet收集的MRFASHIAN-7K数据集进行了广泛的评估。 Fashion在UBC时尚数据集上还优于以前的方法。

Title: HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models

Authors: Shuhan Zhuang, Mengqi Huang, Fengyi Fu, Nan Chen, Bohan Lei, Zhendong Mao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06543
Pdf URL: https://arxiv.org/pdf/2505.06543
Copy Paste: [[2505.06543]] HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models(https://arxiv.org/abs/2505.06543)
Keywords: generation
Abstract: Visual text rendering, which aims to accurately integrate specified textual content within generated images, is critical for various applications such as commercial design. Despite recent advances, current methods struggle with long-tail text cases, particularly when handling unseen or small-sized text. In this work, we propose a novel Hierarchical Disentangled Glyph-Based framework (HDGlyph) that hierarchically decouples text generation from non-text visual synthesis, enabling joint optimization of both common and long-tail text rendering. At the training stage, HDGlyph disentangles pixel-level representations via the Multi-Linguistic GlyphNet and the Glyph-Aware Perceptual Loss, ensuring robust rendering even for unseen characters. At inference time, HDGlyph applies Noise-Disentangled Classifier-Free Guidance and Latent-Disentangled Two-Stage Rendering (LD-TSR) scheme, which refines both background and small-sized text. Extensive evaluations show our model consistently outperforms others, with 5.08% and 11.7% accuracy gains in English and Chinese text rendering while maintaining high image quality. It also excels in long-tail scenarios with strong accuracy and visual performance.
摘要：视觉文本渲染旨在将生成图像中的特定文本内容准确整合起来，这对于各种应用程序（例如商业设计）至关重要。尽管最近进步，但当前的方法在长尾文本案例中遇到了困难，尤其是在处理看不见或小型文本时。在这项工作中，我们提出了一种新型的基于字形的层次分层框架（HDGlyph），该框架在层次上将文本生成与非文本视觉综合构成，从而可以对常见和长尾文本渲染进行关节优化。在训练阶段，HDGLYPH通过多语言字形网和字形感知损失驱散了像素级的表示，即使看不见的角色也可以确保强大的渲染。在推理时，HDGlyph应用无噪声的无分类器指导和潜在的两阶段渲染（LD-TSR）方案，该方案同时完善了背景和小型文本。广泛的评估表明，我们的模型始终胜过其他人，在保持高图像质量的同时，英语和中文文本渲染的准确性为5.08％和11.7％。它在长尾方案中也具有良好的精度和视觉性能。

Title: ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection

Authors: Lei Hu, Zhiyong Gan, Ling Deng, Jinglin Liang, Lingyu Liang, Shuangping Huang, Tianshui Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06603
Pdf URL: https://arxiv.org/pdf/2505.06603
Copy Paste: [[2505.06603]] ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection(https://arxiv.org/abs/2505.06603)
Keywords: generative
Abstract: Continual Anomaly Detection (CAD) enables anomaly detection models in learning new classes while preserving knowledge of historical classes. CAD faces two key challenges: catastrophic forgetting and segmentation of small anomalous regions. Existing CAD methods store image distributions or patch features to mitigate catastrophic forgetting, but they fail to preserve pixel-level detailed features for accurate segmentation. To overcome this limitation, we propose ReplayCAD, a novel diffusion-driven generative replay framework that replay high-quality historical data, thus effectively preserving pixel-level detailed features. Specifically, we compress historical data by searching for a class semantic embedding in the conditional space of the pre-trained diffusion model, which can guide the model to replay data with fine-grained pixel details, thus improving the segmentation performance. However, relying solely on semantic features results in limited spatial diversity. Hence, we further use spatial features to guide data compression, achieving precise control of sample space, thereby generating more diverse data. Our method achieves state-of-the-art performance in both classification and segmentation, with notable improvements in segmentation: 11.5% on VisA and 8.1% on MVTec. Our source code is available at this https URL.
摘要：持续的异常检测（CAD）使在学习新课程的同时，可以在学习新课程中进行异常检测模型。 CAD面临两个主要挑战：灾难性的遗忘和分割小型异常区域。现有的CAD方法存储图像分布或补丁功能来减轻灾难性遗忘，但它们无法保留像素级详细功能以进行准确的分割。为了克服这一限制，我们提出了一个新型扩散驱动的生成式重放框架，该框架重播高质量的历史数据，从而有效地保留了像素级详细的特征。具体而言，我们通过在预训练的扩散模型的条件空间中搜索类语义嵌入来压缩历史数据，该模型可以指导模型以细粒的像素细节来重播数据，从而改善细分性能。但是，仅依靠语义特征会导致空间多样性有限。因此，我们进一步使用空间特征来指导数据压缩，从而实现对样本空间的精确控制，从而产生更多的数据。我们的方法在分类和分割方面都达到了最先进的性能，分段方面有了显着改善：签证为11.5％，MVTEC的签证为8.1％。我们的源代码可在此HTTPS URL上找到。

Title: Dataset Distillation with Probabilistic Latent Features

Authors: Zhe Li, Sarah Cechnicka, Cheng Ouyang, Katharina Breininger, Peter Schüffler, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06647
Pdf URL: https://arxiv.org/pdf/2505.06647
Copy Paste: [[2505.06647]] Dataset Distillation with Probabilistic Latent Features(https://arxiv.org/abs/2505.06647)
Keywords: generative
Abstract: As deep learning models grow in complexity and the volume of training data increases, reducing storage and computational costs becomes increasingly important. Dataset distillation addresses this challenge by synthesizing a compact set of synthetic data that can effectively replace the original dataset in downstream classification tasks. While existing methods typically rely on mapping data from pixel space to the latent space of a generative model, we propose a novel stochastic approach that models the joint distribution of latent features. This allows our method to better capture spatial structures and produce diverse synthetic samples, which benefits model training. Specifically, we introduce a low-rank multivariate normal distribution parameterized by a lightweight network. This design maintains low computational complexity and is compatible with various matching networks used in dataset distillation. After distillation, synthetic images are generated by feeding the learned latent features into a pretrained generator. These synthetic images are then used to train classification models, and performance is evaluated on real test set. We validate our method on several benchmarks, including ImageNet subsets, CIFAR-10, and the MedMNIST histopathological dataset. Our approach achieves state-of-the-art cross architecture performance across a range of backbone architectures, demonstrating its generality and effectiveness.
摘要：随着深度学习模型的复杂性和培训数据量的增加，降低存储和计算成本变得越来越重要。数据集蒸馏通过合成一组紧凑的合成数据来解决这一挑战，该数据可以有效地替换下游分类任务中的原始数据集。尽管现有方法通常依赖于从像素空间到生成模型的潜在空间的映射数据，但我们提出了一种新颖的随机方法，该方法模拟了潜在特征的关节分布。这使我们的方法可以更好地捕获空间结构并产生各种合成样本，从而使模型训练受益。具体而言，我们引入了通过轻量级网络参数化的低级多元正态分布。该设计保持低计算复杂性，并且与数据集蒸馏中使用的各种匹配网络兼容。蒸馏后，通过将学习的潜在特征喂入预验证的发电机来产生合成图像。然后将这些合成图像用于训练分类模型，并在实际测试集上评估性能。我们在几个基准测试基准上验证了我们的方法，包括Imagenet子集，CIFAR-10和MEDMNIST组织病理学数据集。我们的方法在各种骨干体系结构上实现了最先进的跨建筑性能，证明了其一般性和有效性。

Title: Jailbreaking the Text-to-Video Generative Models

Authors: Jiayang Liu, Siyuan Liang, Shiqian Zhao, Rongcheng Tu, Wenbo Zhou, Xiaochun Cao, Dacheng Tao, Siew Kei Lam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06679
Pdf URL: https://arxiv.org/pdf/2505.06679
Copy Paste: [[2505.06679]] Jailbreaking the Text-to-Video Generative Models(https://arxiv.org/abs/2505.06679)
Keywords: generation, generative
Abstract: Text-to-video generative models have achieved significant progress, driven by the rapid advancements in diffusion models, with notable examples including Pika, Luma, Kling, and Sora. Despite their remarkable generation ability, their vulnerability to jailbreak attack, i.e. to generate unsafe content, including pornography, violence, and discrimination, raises serious safety concerns. Existing efforts, such as T2VSafetyBench, have provided valuable benchmarks for evaluating the safety of text-to-video models against unsafe prompts but lack systematic studies for exploiting their vulnerabilities effectively. In this paper, we propose the \textit{first} optimization-based jailbreak attack against text-to-video models, which is specifically designed. Our approach formulates the prompt generation task as an optimization problem with three key objectives: (1) maximizing the semantic similarity between the input and generated prompts, (2) ensuring that the generated prompts can evade the safety filter of the text-to-video model, and (3) maximizing the semantic similarity between the generated videos and the original input prompts. To further enhance the robustness of the generated prompts, we introduce a prompt mutation strategy that creates multiple prompt variants in each iteration, selecting the most effective one based on the averaged score. This strategy not only improves the attack success rate but also boosts the semantic relevance of the generated video. We conduct extensive experiments across multiple text-to-video models, including Open-Sora, Pika, Luma, and Kling. The results demonstrate that our method not only achieves a higher attack success rate compared to baseline methods but also generates videos with greater semantic similarity to the original input prompts.
摘要：在扩散模型中的快速进步驱动的驱动的驱动到包括皮卡，Luma，Kling和Sora在内的著名示例，文本对视频生成模型已取得了重大进展。尽管他们的产生能力很大，但它们易受越狱攻击的能力，即产生不安全的内容，包括色情，暴力和歧视，引起了严重的安全问题。现有的努力，例如T2VSAFETYBENCH，为评估文本对视频模型的安全性提供了宝贵的基准，以防止不安全提示，但缺乏系统的研究来有效利用其脆弱性。在本文中，我们提出了\ textit {first}基于优化的越狱攻击对文本对视频模型的攻击，该模型是专门设计的。我们的方法将及时生成任务制定为具有三个关键目标的优化问题：（1）最大化输入和生成的提示之间的语义相似性，（2）确保生成的提示可以逃避文本到视频模型的安全过滤器，以及（3）最大化生成的视频和原始输入提示之间的语义相似性。为了进一步增强生成的提示的鲁棒性，我们引入了一种及时的突变策略，该突变策略在每次迭代中创建多个及时的变体，并根据平均分数选择最有效的变体。该策略不仅提高了攻击成功率，而且还提高了生成视频的语义相关性。我们在多个文本到视频模型上进行了广泛的实验，包括开放式索拉，皮卡，Luma和Kling。结果表明，与基线方法相比，我们的方法不仅取得了更高的攻击成功率，而且还产生了与原始输入提示更大的语义相似性的视频。

Title: UnfoldIR: Rethinking Deep Unfolding Network in Illumination Degradation Image Restoration

Authors: Chunming He, Rihan Zhang, Fengyang Xiao, Chengyu Fang, Longxiang Tang, Yulun Zhang, Sina Farsiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06683
Pdf URL: https://arxiv.org/pdf/2505.06683
Copy Paste: [[2505.06683]] UnfoldIR: Rethinking Deep Unfolding Network in Illumination Degradation Image Restoration(https://arxiv.org/abs/2505.06683)
Keywords: restoration
Abstract: Deep unfolding networks (DUNs) are widely employed in illumination degradation image restoration (IDIR) to merge the interpretability of model-based approaches with the generalization of learning-based methods. However, the performance of DUN-based methods remains considerably inferior to that of state-of-the-art IDIR solvers. Our investigation indicates that this limitation does not stem from structural shortcomings of DUNs but rather from the limited exploration of the unfolding structure, particularly for (1) constructing task-specific restoration models, (2) integrating advanced network architectures, and (3) designing DUN-specific loss functions. To address these issues, we propose a novel DUN-based method, UnfoldIR, for IDIR tasks. UnfoldIR first introduces a new IDIR model with dedicated regularization terms for smoothing illumination and enhancing texture. We unfold the iterative optimized solution of this model into a multistage network, with each stage comprising a reflectance-assisted illumination correction (RAIC) module and an illumination-guided reflectance enhancement (IGRE) module. RAIC employs a visual state space (VSS) to extract non-local features, enforcing illumination smoothness, while IGRE introduces a frequency-aware VSS to globally align similar textures, enabling mildly degraded regions to guide the enhancement of details in more severely degraded areas. This suppresses noise while enhancing details. Furthermore, given the multistage structure, we propose an inter-stage information consistent loss to maintain network stability in the final stages. This loss contributes to structural preservation and sustains the model's performance even in unsupervised settings. Experiments verify our effectiveness across 5 IDIR tasks and 3 downstream problems.
摘要：深度展开的网络（DUN）广泛用于照明降解图像恢复（IDIR），以合并基于模型的方法的可解释性与基于学习的方法的概括。但是，基于DUN的方法的性能仍然比最先进的IDIR求解器的性能较低。我们的调查表明，这种限制不是源于DUN的结构性缺陷，而是源于对展开结构的有限探索，特别是（1）（1）构建任务特定的恢复模型，（2）整合高级网络架构，以及（3）设计DUN特异性损失功能。为了解决这些问题，我们提出了一种基于DUN的新方法，即IDIR任务。 Froldir首先引入了一种新的IDIR模型，该模型具有专用的正则化术语，以平滑照明和增强纹理。我们将该模型的迭代优化解决方案展开到多阶段网络中，每个阶段都包含反射率辅助照明校正（RAIC）模块和照明引导的反射增强（IGRE）模块。 RAIC采用视觉状态空间（VSS）来提取非本地特征，执行照明光滑度，而IGRE则引入了频率吸引的VSS，以使全球相似的类似纹理对齐，从而使轻度退化的区域能够指导更严重降级区域的细节增强细节。这会抑制噪音，同时增强细节。此外，鉴于多阶段结构，我们提出了一个阶段间信息一致的损失，以在最终阶段保持网络稳定性。即使在无监督的环境中，这种损失也有助于结构保存并维持模型的性能。实验验证了我们在5个IDIR任务和3个下游问题的有效性。

Title: Learning Graph Representation of Agent Diffuser

Authors: Youcef Djenouri, Nassim Belmecheri, Tomasz Michalak, Jan Dubiński, Ahmed Nabil Belbachir, Anis Yazidi
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2505.06761
Pdf URL: https://arxiv.org/pdf/2505.06761
Copy Paste: [[2505.06761]] Learning Graph Representation of Agent Diffuser(https://arxiv.org/abs/2505.06761)
Keywords: generation, generative
Abstract: Diffusion-based generative models have significantly advanced text-to-image synthesis, demonstrating impressive text comprehension and zero-shot generalization. These models refine images from random noise based on textual prompts, with initial reliance on text input shifting towards enhanced visual fidelity over time. This transition suggests that static model parameters might not optimally address the distinct phases of generation. We introduce LGR-AD (Learning Graph Representation of Agent Diffusers), a novel multi-agent system designed to improve adaptability in dynamic computer vision tasks. LGR-AD models the generation process as a distributed system of interacting agents, each representing an expert sub-model. These agents dynamically adapt to varying conditions and collaborate through a graph neural network that encodes their relationships and performance metrics. Our approach employs a coordination mechanism based on top-$k$ maximum spanning trees, optimizing the generation process. Each agent's decision-making is guided by a meta-model that minimizes a novel loss function, balancing accuracy and diversity. Theoretical analysis and extensive empirical evaluations show that LGR-AD outperforms traditional diffusion models across various benchmarks, highlighting its potential for scalable and flexible solutions in complex image generation tasks. Code is available at: this https URL
摘要：基于扩散的生成模型具有显着高级的文本对图像综合，表明文本理解和零声概括。这些型号根据文本提示从随机噪声中完善图像，最初依赖文本输入转移向增强的视觉保真度随着时间的推移而变化。该过渡表明，静态模型参数可能无法最佳地解决发电的不同阶段。我们介绍了LGR-AD（Agent扩散器的学习图表示），这是一种新型的多机构系统，旨在提高动态计算机视觉任务的适应性。 LGR-AD将生成过程建模为相互作用剂的分布式系统，每个系统代表专家子模型。这些代理动态适应不同的条件，并通过编码其关系和性能指标的图神经网络进行协作。我们的方法采用基于顶部$ K $最大跨越树木的协调机制，从而优化了生成过程。每个代理商的决策都由元模型指导，该元模型可最大程度地降低新型损失功能，平衡准确性和多样性。理论分析和广泛的经验评估表明，LGR-AD的表现优于各种基准的传统扩散模型，从而突出了其在复杂的图像生成任务中具有可扩展和灵活解决方案的潜力。代码可用：此HTTPS URL

Title: Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning

Authors: Ye Zhu, Yunan Wang, Zitong Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06796
Pdf URL: https://arxiv.org/pdf/2505.06796
Copy Paste: [[2505.06796]] Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning(https://arxiv.org/abs/2505.06796)
Keywords: generation
Abstract: Multimodal news contains a wealth of information and is easily affected by deepfake modeling attacks. To combat the latest image and text generation methods, we present a new Multimodal Fake News Detection dataset (MFND) containing 11 manipulated types, designed to detect and localize highly authentic fake news. Furthermore, we propose a Shallow-Deep Multitask Learning (SDML) model for fake news, which fully uses unimodal and mutual modal features to mine the intrinsic semantics of news. Under shallow inference, we propose the momentum distillation-based light punishment contrastive learning for fine-grained uniform spatial image and text semantic alignment, and an adaptive cross-modal fusion module to enhance mutual modal features. Under deep inference, we design a two-branch framework to augment the image and text unimodal features, respectively merging with mutual modalities features, for four predictions via dedicated detection and localization projections. Experiments on both mainstream and our proposed datasets demonstrate the superiority of the model. Codes and dataset are released at this https URL.
摘要：多模式新闻包含大量信息，并且很容易受到DeepFake建模攻击的影响。为了打击最新的图像和文本生成方法，我们提出了一个新的多模式假新闻检测数据集（MFND），其中包含11种操纵类型，旨在检测和本地化高度真实的假新闻。此外，我们为假新闻提供了一个浅色深度多任务学习（SDML）模型，该模型完全使用单峰和相互的模态特征来挖掘新闻的内在语义。在浅推断下，我们提出了基于动量蒸馏的光惩罚对比度学习，以学习细粒度均匀的空间图像和文本语义对齐，以及适应性的跨模式融合模块，以增强相互模态特征。在深入推论下，我们设计了一个两分支的框架，以扩大图像和文本单峰特征，分别与相互模式特征合并，以通过专用检测和定位预测进行四个预测。主流和我们提出的数据集的实验证明了模型的优越性。代码和数据集在此HTTPS URL上发布。

Title: Topology Guidance: Controlling the Outputs of Generative Models via Vector Field Topology

Authors: Xiaohan Wang, Matthew Berger
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.06804
Pdf URL: https://arxiv.org/pdf/2505.06804
Copy Paste: [[2505.06804]] Topology Guidance: Controlling the Outputs of Generative Models via Vector Field Topology(https://arxiv.org/abs/2505.06804)
Keywords: generation, generative
Abstract: For domains that involve numerical simulation, it can be computationally expensive to run an ensemble of simulations spanning a parameter space of interest to a user. To this end, an attractive surrogate for simulation is the generative modeling of fields produced by an ensemble, allowing one to synthesize fields in a computationally cheap, yet accurate, manner. However, for the purposes of visual analysis, a limitation of generative models is their lack of control, as it is unclear what one should expect when sampling a field from a model. In this paper we study how to make generative models of fields more controllable, so that users can specify features of interest, in particular topological features, that they wish to see in the output. We propose topology guidance, a method for guiding the sampling process of a generative model, specifically a diffusion model, such that a topological description specified as input is satisfied in the generated output. Central to our method, we couple a coordinate-based neural network used to represent fields, with a diffusion model used for generation. We show how to use topologically-relevant signals provided by the coordinate-based network to help guide the denoising process of a diffusion model. This enables us to faithfully represent a user's specified topology, while ensuring that the output field remains within the generative data distribution. Specifically, we study 2D vector field topology, evaluating our method over an ensemble of fluid flows, where we show that generated vector fields faithfully adhere to the location, and type, of critical points over the spatial domain. We further show the benefits of our method in aiding the comparison of ensembles, allowing one to explore commonalities and differences in distributions along prescribed topological features.
摘要：对于涉及数值模拟的域，运行跨越用户感兴趣的参数空间的模拟集合在计算上可能很昂贵。为此，模拟的有吸引力的替代物是集合产生的场的生成建模，使一个人可以以计算上便宜但准确的方式合成字段。但是，出于视觉分析的目的，生成模型的局限性在于它们缺乏控制，因为目前尚不清楚从模型中抽样字段时应该期望什么。在本文中，我们研究了如何使字段的生成模型更加可控，以便用户可以指定感兴趣的特征，特别是拓扑特征，他们希望在输出中看到。我们提出了拓扑指南，一种指导生成模型的采样过程的方法，特别是扩散模型，因此在生成的输出中满足了指定为输入的拓扑描述。在我们方法的中心，我们将用于代表场的基于坐标的神经网络与用于生成的扩散模型。我们展示了如何使用基于坐标的网络提供的与拓扑相关的信号来帮助指导扩散模型的降解过程。这使我们能够忠实地表示用户的指定拓扑，同时确保输出字段保留在生成数据分布中。具体而言，我们研究了2D矢量场拓扑，评估了我们的方法在流体流的集合上，我们表明，生成的向量场忠实地粘附在空间域上的关键点的位置和类型。我们进一步展示了我们方法在协助合奏比较中的好处，从而使人们探索了沿规定的拓扑特征的共同点和分布的差异。

Title: Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

Authors: Zhengmi Tang, Yuto Mitsui, Tomo Miyazaki, Shinichiro Omachi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06855
Pdf URL: https://arxiv.org/pdf/2505.06855
Copy Paste: [[2505.06855]] Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies(https://arxiv.org/abs/2505.06855)
Keywords: super-resolution
Abstract: Most existing text recognition methods are trained on large-scale synthetic datasets due to the scarcity of labeled real-world datasets. Synthetic images, however, cannot faithfully reproduce real-world scenarios, such as uneven illumination, irregular layout, occlusion, and degradation, resulting in performance disparities when handling complex real-world images. Recent self-supervised learning techniques, notably contrastive learning and masked image modeling (MIM), narrow this domain gap by exploiting unlabeled real text images. This study first analyzes the original Masked AutoEncoder (MAE) and observes that random patch masking predominantly captures low-level textural features but misses high-level contextual representations. To fully exploit the high-level contextual representations, we introduce random blockwise and span masking in the text recognition task. These strategies can mask the continuous image patches and completely remove some characters, forcing the model to infer relationships among characters within a word. Our Multi-Masking Strategy (MMS) integrates random patch, blockwise, and span masking into the MIM frame, which jointly learns low and high-level textual representations. After fine-tuning with real data, MMS outperforms the state-of-the-art self-supervised methods in various text-related tasks, including text recognition, segmentation, and text-image super-resolution.
摘要：大多数现有的文本识别方法都在大规模合成数据集上进行培训，这是因为标记的现实世界数据集缺乏。然而，合成图像不能忠实地再现现实世界的场景，例如不均匀的照明，不规则布局，遮挡和降解，从而在处理复杂的现实世界图像时会导致性能差异。最近的自我监督学习技术，尤其是对比度学习和掩盖图像建模（MIM），通过利用未标记的真实文本图像来缩小此域间隙。这项研究首先分析了原始的蒙版自动编码器（MAE），并观察到随机补丁掩盖主要捕获低级纹理特征，但错过了高级上下文表示。为了充分利用高级上下文表示，我们在文本识别任务中介绍随机块并跨度掩盖。这些策略可以掩盖连续的图像补丁并完全删除某些字符，从而迫使模型推断单词中字符之间的关系。我们的多掩模策略（MMS）将随机补丁集成，并将跨度掩盖整合到MIM框架中，该框架共同学习了低和高级的文本表示。经过真实数据进行微调后，MMS在各种与文本相关的任务（包括文本识别，细分和文本图像超级分辨率）中的最新自我监督方法优于最先进的自我监督方法。

Title: Image Classification Using a Diffusion Model as a Pre-Training Model

Authors: Kosuke Ukita, Ye Xiaolong, Tsuyoshi Okita
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.06890
Pdf URL: https://arxiv.org/pdf/2505.06890
Copy Paste: [[2505.06890]] Image Classification Using a Diffusion Model as a Pre-Training Model(https://arxiv.org/abs/2505.06890)
Keywords: generation
Abstract: In this paper, we propose a diffusion model that integrates a representation-conditioning mechanism, where the representations derived from a Vision Transformer (ViT) are used to condition the internal process of a Transformer-based diffusion model. This approach enables representation-conditioned data generation, addressing the challenge of requiring large-scale labeled datasets by leveraging self-supervised learning on unlabeled data. We evaluate our method through a zero-shot classification task for hematoma detection in brain imaging. Compared to the strong contrastive learning baseline, DINOv2, our method achieves a notable improvement of +6.15% in accuracy and +13.60% in F1-score, demonstrating its effectiveness in image classification.
摘要：在本文中，我们提出了一个扩散模型，该模型集成了表示形成机制，其中使用源自视觉变压器（VIT）的表示形式来调节基于变压器的扩散模型的内部过程。这种方法可实现表示形式的数据生成，通过利用未标记的数据来利用自我监督的学习来解决需要大规模标记数据集的挑战。我们通过零射击分类任务评估我们的方法，以用于大脑成像中的血肿检测。与强大的对比度学习基线相比，Dinov2，我们的方法在精度上取得了显着提高，而F1得分的准确性 +6.15％， +13.60％的提高，证明了其在图像分类中的有效性。

Title: Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration

Authors: Honglong Yang, Shanshan Song, Yi Qin, Lehan Wang, Haonan Wang, Xinpeng Ding, Qixiang Zhang, Bodong Du, Xiaomeng Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.06898
Pdf URL: https://arxiv.org/pdf/2505.06898
Copy Paste: [[2505.06898]] Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration(https://arxiv.org/abs/2505.06898)
Keywords: generation
Abstract: Generalist Medical AI (GMAI) systems have demonstrated expert-level performance in biomedical perception tasks, yet their clinical utility remains limited by inadequate multi-modal explainability and suboptimal prognostic capabilities. Here, we present XMedGPT, a clinician-centric, multi-modal AI assistant that integrates textual and visual interpretability to support transparent and trustworthy medical decision-making. XMedGPT not only produces accurate diagnostic and descriptive outputs, but also grounds referenced anatomical sites within medical images, bridging critical gaps in interpretability and enhancing clinician usability. To support real-world deployment, we introduce a reliability indexing mechanism that quantifies uncertainty through consistency-based assessment via interactive question-answering. We validate XMedGPT across four pillars: multi-modal interpretability, uncertainty quantification, and prognostic modeling, and rigorous benchmarking. The model achieves an IoU of 0.703 across 141 anatomical regions, and a Kendall's tau-b of 0.479, demonstrating strong alignment between visual rationales and clinical outcomes. For uncertainty estimation, it attains an AUC of 0.862 on visual question answering and 0.764 on radiology report generation. In survival and recurrence prediction for lung and glioma cancers, it surpasses prior leading models by 26.9%, and outperforms GPT-4o by 25.0%. Rigorous benchmarking across 347 datasets covers 40 imaging modalities and external validation spans 4 anatomical systems confirming exceptional generalizability, with performance gains surpassing existing GMAI by 20.7% for in-domain evaluation and 16.7% on 11,530 in-house data evaluation. Together, XMedGPT represents a significant leap forward in clinician-centric AI integration, offering trustworthy and scalable support for diverse healthcare applications.
摘要：通才医学AI（GMAI）系统在生物医学感知任务中表现出了专家级的性能，但是由于多模式解释性和次优的预后能力，它们的临床实用性仍然受到限制。在这里，我们提出了Xmedgpt，这是一家以临床医生为中心的多式联运AI助手，该助手整合了文本和视觉解释性，以支持透明和可信赖的医疗决策。 XMedGPT不仅会产生准确的诊断和描述性输出，而且还引用了医学图像中引用的解剖位点，从而弥合了可解释性的关键差距并增强了临床医生的可用性。为了支持现实世界的部署，我们介绍了一种可靠性索引机制，该机制通过通过交互式问题驱动来量化不确定性来量化不确定性。我们验证了四个支柱的XMedGPT：多模式的解释性，不确定性定量和预后建模以及严格的基准测试。该模型在141个解剖区域中达到了0.703的IOU，肯德尔的tau-b为0.479，表明视觉原理和临床结果之间的强烈比对。对于不确定性估计，它在视觉问题回答上的AUC为0.862，放射学报告生成0.764。在肺和神经胶质瘤癌的生存和复发预测中，它超过了先前的领先模型26.9％，并且比GPT-4O的表现远高25.0％。在347个数据集中进行严格的基准测试涵盖了40个成像方式和外部验证跨越4个解剖系统，确认了出色的可推广性，并且绩效增长超过了现有的GMAI 20.7％，用于内域评估，为16.7％，在11,530内部数据评估中超过了16.7％。 XMedGPT共同代表了以临床医生为中心的AI集成的重大飞跃，为多样化的医疗保健应用提供了可信赖和可扩展的支持。

Title: Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI

Authors: Chao Ding, Mouxiao Bian, Pengcheng Chen, Hongliang Zhang, Tianbin Li, Lihao Liu, Jiayuan Chen, Zhuoran Li, Yabei Zhong, Yongqi Liu, Haiqing Huang, Dongming Shan, Junjun He, Jie Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06912
Pdf URL: https://arxiv.org/pdf/2505.06912
Copy Paste: [[2505.06912]] Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI(https://arxiv.org/abs/2505.06912)
Keywords: generation
Abstract: Despite strong performance in medical question-answering, the clinical adoption of Large Language Models (LLMs) is critically hampered by their opaque 'black-box' reasoning, limiting clinician trust. This challenge is compounded by the predominant reliance of current medical LLMs on corpora from scientific literature or synthetic data, which often lack the granular expert validation and high clinical relevance essential for advancing their specialized medical capabilities. To address these critical gaps, we introduce a highly clinically relevant dataset with 31,247 medical question-answer pairs, each accompanied by expert-validated chain-of-thought (CoT) explanations. This resource, spanning multiple clinical domains, was curated via a scalable human-LLM hybrid pipeline: LLM-generated rationales were iteratively reviewed, scored, and refined by medical experts against a structured rubric, with substandard outputs revised through human effort or guided LLM regeneration until expert consensus. This publicly available dataset provides a vital source for the development of medical LLMs that capable of transparent and verifiable reasoning, thereby advancing safer and more interpretable AI in medicine.
摘要：尽管在医疗询问方面表现出色，但大语言模型（LLMS）的临床采用受到了不透明的“黑盒”推理的严重阻碍，这限制了临床医生的信任。从科学文献或合成数据中，当前医学LLM对语料库的主要依赖性造成了更加复杂的挑战，这通常缺乏颗粒状专家验证和高度临床相关性，对于推进其专业医疗能力至关重要。为了解决这些关键的差距，我们引入了一个高度临床相关的数据集，其中包括31,247个医疗问题 - 解答对，每个数据集伴随着专家验证的思想链（COT）解释。该资源跨越了多个临床领域，是通过可扩展的人-LLM混合管道策划的：LLM生成的理由是由医学专家对结构化的错误进行了迭代审查，评分和精制的，并通过人类努力或指导LLM Regeneration进行了不合格的输出，直到专家共识。该公开可用的数据集为开发能够透明和可验证推理的医学LLM的开发提供了重要来源，从而推进了医学上更安全，更可解释的AI。

Title: A systematic review of challenges and proposed solutions in modeling multimodal data

Authors: Maryam Farhadizadeh (1 and 2), Maria Weymann (2 and 3), Michael Blaß (4), Johann Kraus (5), Christopher Gundler (4), Sebastian Walter (6), Noah Hempen (1), Harald Binde (2 and 3), Nadine Binder (1 and 2) ((1) Institute of General Practice/Family Medicine, Faculty of Medicine and Medical Center - University of Freiburg, Germany, (2) Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Germany, (3) Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Germany, (4) Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf, Germany, (5) Institute of Medical Systems Biology, Ulm University, Germany, (6) Department of Computer Science, Faculty of Engineering - University of Freiburg, Germany)
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.06945
Pdf URL: https://arxiv.org/pdf/2505.06945
Copy Paste: [[2505.06945]] A systematic review of challenges and proposed solutions in modeling multimodal data(https://arxiv.org/abs/2505.06945)
Keywords: generative
Abstract: Multimodal data modeling has emerged as a powerful approach in clinical research, enabling the integration of diverse data types such as imaging, genomics, wearable sensors, and electronic health records. Despite its potential to improve diagnostic accuracy and support personalized care, modeling such heterogeneous data presents significant technical challenges. This systematic review synthesizes findings from 69 studies to identify common obstacles, including missing modalities, limited sample sizes, dimensionality imbalance, interpretability issues, and finding the optimal fusion techniques. We highlight recent methodological advances, such as transfer learning, generative models, attention mechanisms, and neural architecture search that offer promising solutions. By mapping current trends and innovations, this review provides a comprehensive overview of the field and offers practical insights to guide future research and development in multimodal modeling for medical applications.
摘要：多模式数据建模已成为临床研究中的一种强大方法，从而使成像，基因组学，可穿戴传感器和电子健康记录等多种数据类型的整合。尽管具有提高诊断准确性并支持个性化护理的潜力，但对这种异质数据进行建模仍然带来了重大的技术挑战。这项系统评价综合了69项研究的发现，以确定常见的障碍，包括缺失的方式，样本量有限，维度不平衡，可解释性问题以及找到最佳的融合技术。我们重点介绍了最新的方法论进步，例如转移学习，生成模型，注意机制和神经建筑搜索，这些搜索提供了有希望的解决方案。通过绘制当前的趋势和创新，本综述提供了该领域的全面概述，并提供了实用的见解，以指导医疗应用多模式建模的未来研究和开发。

Title: High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution

Authors: Wei Shang, Dongwei Ren, Wanying Zhang, Pengfei Zhu, Qinghua Hu, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06975
Pdf URL: https://arxiv.org/pdf/2505.06975
Copy Paste: [[2505.06975]] High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution(https://arxiv.org/abs/2505.06975)
Keywords: super-resolution
Abstract: The primary challenge in accelerating image super-resolution lies in reducing computation while maintaining performance and adaptability. Motivated by the observation that high-frequency regions (e.g., edges and textures) are most critical for reconstruction, we propose a training-free adaptive masking module for acceleration that dynamically focuses computation on these challenging areas. Specifically, our method first extracts high-frequency components via Gaussian blur subtraction and adaptively generates binary masks using K-means clustering to identify regions requiring intensive processing. Our method can be easily integrated with both CNNs and Transformers. For CNN-based architectures, we replace standard $3 \times 3$ convolutions with an unfold operation followed by $1 \times 1$ convolutions, enabling pixel-wise sparse computation guided by the mask. For Transformer-based models, we partition the mask into non-overlapping windows and selectively process tokens based on their average values. During inference, unnecessary pixels or windows are pruned, significantly reducing computation. Moreover, our method supports dilation-based mask adjustment to control the processing scope without retraining, and is robust to unseen degradations (e.g., noise, compression). Extensive experiments on benchmarks demonstrate that our method reduces FLOPs by 24--43% for state-of-the-art models (e.g., CARN, SwinIR) while achieving comparable or better quantitative metrics. The source code is available at this https URL
摘要：加速图像超分辨率的主要挑战在于减少计算，同时保持性能和适应性。通过观察到高频区域（例如边缘和纹理）对于重建最关键的观察，我们提出了一个无训练的自适应遮罩模块，以便将计算动态地集中在这些具有挑战性的领域上。具体而言，我们的方法首先通过高斯模糊减法提取高频组件，并使用K-均值聚类自适应地生成二进制掩码，以识别需要密集处理的区域。我们的方法可以很容易地与CNN和变压器集成在一起。对于基于CNN的体系结构，我们将$ 3 \ times 3 $卷积替换为“爆发”操作，然后是$ 1 \ times 1 $卷积，从而实现了由面具指导的像素稀疏计算。对于基于变压器的模型，我们将蒙版划分为非重叠的窗口，并根据其平均值进行选择性处理令牌。在推断过程中，修剪不必要的像素或窗口会大大降低计算。此外，我们的方法支持基于扩张的掩模调整，以控制处理范围而无需再培训，并且对看不见的降解（例如噪声，压缩）是可靠的。基准的广泛实验表明，对于最先进的模型（例如，Carn，Swinir），我们的方法将FLOPS降低了24--43％，同时实现了可比或更好的定量指标。源代码可在此HTTPS URL上找到

Title: Learning Value of Information towards Joint Communication and Control in 6G V2X

Authors: Lei Lei, Kan Zheng, Xuemin (Sherman)Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.06978
Pdf URL: https://arxiv.org/pdf/2505.06978
Copy Paste: [[2505.06978]] Learning Value of Information towards Joint Communication and Control in 6G V2X(https://arxiv.org/abs/2505.06978)
Keywords: generation
Abstract: As Cellular Vehicle-to-Everything (C-V2X) evolves towards future sixth-generation (6G) networks, Connected Autonomous Vehicles (CAVs) are emerging to become a key application. Leveraging data-driven Machine Learning (ML), especially Deep Reinforcement Learning (DRL), is expected to significantly enhance CAV decision-making in both vehicle control and V2X communication under uncertainty. These two decision-making processes are closely intertwined, with the value of information (VoI) acting as a crucial bridge between them. In this paper, we introduce Sequential Stochastic Decision Process (SSDP) models to define and assess VoI, demonstrating their application in optimizing communication systems for CAVs. Specifically, we formally define the SSDP model and demonstrate that the MDP model is a special case of it. The SSDP model offers a key advantage by explicitly representing the set of information that can enhance decision-making when available. Furthermore, as current research on VoI remains fragmented, we propose a systematic VoI modeling framework grounded in the MDP, Reinforcement Learning (RL) and Optimal Control theories. We define different categories of VoI and discuss their corresponding estimation methods. Finally, we present a structured approach to leverage the various VoI metrics for optimizing the ``When", ``What", and ``How" to communicate problems. For this purpose, SSDP models are formulated with VoI-associated reward functions derived from VoI-based optimization objectives. While we use a simple vehicle-following control problem to illustrate the proposed methodology, it holds significant potential to facilitate the joint optimization of stochastic, sequential control and communication decisions in a wide range of networked control systems.
摘要：随着蜂窝车辆到所有物品（C-V2X）朝着未来的第六代（6G）网络发展，已连接的自动驾驶汽车（CAVS）已出现成为关键应用。利用数据驱动的机器学习（ML），尤其是深钢筋学习（DRL），预计将显着增强车辆控制和V2X通信的CAV决策。这两个决策过程紧密地交织在一起，信息的价值（VOI）充当了它们之间的关键桥梁。在本文中，我们介绍了顺序随机决策过程（SSDP）模型来定义和评估VOI，并证明了它们在优化CAV的通信系统中的应用。具体而言，我们正式定义了SSDP模型，并证明MDP模型是它的一种特殊情况。 SSDP模型通过明确表示可以在可用时增强决策的信息集来提供关键优势。此外，随着当前对VOI的研究仍然分散，我们提出了一个基于MDP，增强学习（RL）和最佳控制理论的系统性VOI建模框架。我们定义了VOI的不同类别，并讨论了它们相应的估计方法。 Finally, we present a structured approach to leverage the various VoI metrics for optimizing the ``When", ``What", and ``How" to communicate problems. For this purpose, SSDP models are formulated with VoI-associated reward functions derived from VoI-based optimization objectives. While we use a simple vehicle-following control problem to illustrate the proposed methodology, it holds significant potential to facilitate the joint optimization of stochastic,在各种网络控制系统中的顺序控制和通信决策。

Title: BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation

Authors: Panwen Hu, Jiehui Huang, Qiang Sun, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06985
Pdf URL: https://arxiv.org/pdf/2505.06985
Copy Paste: [[2505.06985]] BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation(https://arxiv.org/abs/2505.06985)
Keywords: generation
Abstract: Both zero-shot and tuning-based customized text-to-image (CT2I) generation have made significant progress for storytelling content creation. In contrast, research on customized text-to-video (CT2V) generation remains relatively limited. Existing zero-shot CT2V methods suffer from poor generalization, while another line of work directly combining tuning-based T2I models with temporal motion modules often leads to the loss of structural and texture information. To bridge this gap, we propose an autoregressive structure and texture propagation module (STPM), which extracts key structural and texture features from the reference subject and injects them autoregressively into each video frame to enhance consistency. Additionally, we introduce a test-time reward optimization (TTRO) method to further refine fine-grained details. Quantitative and qualitative experiments validate the effectiveness of STPM and TTRO, demonstrating improvements of 7.8 and 13.1 in CLIP-I and DINO consistency metrics over the baseline, respectively.
摘要：零射击和基于调整的定制文本对图像（CT2I）的一代都取得了巨大的进展，以创建讲故事。相比之下，对定制文本对视频（CT2V）生成的研究仍然相对有限。现有的零射击CT2V方法的概括不佳，而另一个直接将基于调谐的T2I模型与时间运动模块结合的工作通常会导致结构和纹理信息的丧失。为了弥合这一差距，我们提出了一个自回归的结构和纹理传播模块（STPM），该模块从参考主题中提取关键的结构和纹理特征，并将它们自动加入到每个视频框架中以提高一致性。此外，我们引入了测试时间奖励优化（TTRO）方法，以进一步完善细粒细节。定量和定性实验验证了STPM和TTRO的有效性，证明了基线的夹子I和Dino一致性指标的提高了7.8和13.1。

Title: Replay-Based Continual Learning with Dual-Layered Distillation and a Streamlined U-Net for Efficient Text-to-Image Generation

Authors: Md. Naimur Asif Borno, Md Sakib Hossain Shovon, Asmaa Soliman Al-Moisheer, Mohammad Ali Moni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06995
Pdf URL: https://arxiv.org/pdf/2505.06995
Copy Paste: [[2505.06995]] Replay-Based Continual Learning with Dual-Layered Distillation and a Streamlined U-Net for Efficient Text-to-Image Generation(https://arxiv.org/abs/2505.06995)
Keywords: generation
Abstract: Recent advancements in text-to-image diffusion models are hindered by high computational demands, limiting accessibility and scalability. This paper introduces KDC-Diff, a novel stable diffusion framework that enhances efficiency while maintaining image quality. KDC-Diff features a streamlined U-Net architecture with nearly half the parameters of the original U-Net (482M), significantly reducing model complexity. We propose a dual-layered distillation strategy to ensure high-fidelity generation, transferring semantic and structural insights from a teacher to a compact student model while minimizing quality degradation. Additionally, replay-based continual learning is integrated to mitigate catastrophic forgetting, allowing the model to retain prior knowledge while adapting to new data. Despite operating under extremely low computational resources, KDC-Diff achieves state-of-the-art performance on the Oxford Flowers and Butterflies & Moths 100 Species datasets, demonstrating competitive metrics such as FID, CLIP, and LPIPS. Moreover, it significantly reduces inference time compared to existing models. These results establish KDC-Diff as a highly efficient and adaptable solution for text-to-image generation, particularly in computationally constrained environments.
摘要：文本到图像扩散模型的最新进展受到高度计算需求的阻碍，限制了可访问性和可伸缩性。本文介绍了KDC-DIFF，这是一种新型的稳定扩散框架，可在保持图像质量的同时提高效率。 KDC-DIFF具有简化的U-NET体系结构，其原始U-NET（482m）的参数将近一半，可显着降低模型的复杂性。我们提出了一种双层蒸馏策略，以确保高保真的产生，将语义和结构见解从教师转移到紧凑的学生模型，同时最大程度地减少质量降级。此外，基于重播的持续学习将集成以减轻灾难性的遗忘，从而使模型在适应新数据的同时保留了先验知识。尽管在极低的计算资源下运行，但KDC-DIFF仍在牛津花朵，蝴蝶和飞蛾100种数据集上实现最先进的性能，展示了竞争性指标，例如FID，CLIP和LPIPS。此外，与现有模型相比，它大大减少了推理时间。这些结果将KDC-DIFF确定为文本到图像生成的高效和适应性解决方案，尤其是在计算受限的环境中。

Title: Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

Authors: Bidur Khanal, Sandesh Pokhrel, Sanjay Bhandari, Ramesh Rana, Nikesh Shrestha, Ram Bahadur Gurung, Cristian Linte, Angus Watson, Yash Raj Shrestha, Binod Bhattarai
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.07001
Pdf URL: https://arxiv.org/pdf/2505.07001
Copy Paste: [[2505.07001]] Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models(https://arxiv.org/abs/2505.07001)
Keywords: generation
Abstract: Vision-Language Models (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination--the tendency to generate descriptions that are inconsistent with the visual content--remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts systematically review these reports, and identify and correct potential inaccuracies to ensure high-quality, clinically reliable annotations. Unlike traditional datasets that contain only descriptive texts, our dataset also features tags identifying hallucinated sentences and their corresponding corrections. A common approach to reducing hallucination in VLM is to finetune the model on a small-scale, problem-specific dataset. However, we take a different strategy using our dataset. Instead of finetuning the VLM solely for generating textual reports, we finetune it to detect and correct hallucinations, an approach we call hallucination-aware finetuning. Our results show that this approach is better than simply finetuning for descriptive report generation. Additionally, we conduct an extensive evaluation of state-of-the-art VLMs across several metrics, establishing a benchmark. GitHub Repo: this https URL.
摘要：视觉模型（VLM）在医疗领域越来越流行，弥合了医学图像和临床语言之间的差距。现有的VLMS表现出令人印象深刻的能力，可以理解医学图像和文本查询，以生成详细的描述性诊断性医学报告。但是，幻觉 - 产生与视觉内容不一致的描述的趋势 - 在VLMS中是一个重要的问题，对医学领域的影响尤为严重。为了促进VLM胃肠道（GI）图像分析和研究幻觉研究，我们策划了一个多模式图像text GI数据集：gut-vlm。该数据集是使用两阶段管道创建的：首先使用ChatGpt生成Kvasir-V2图像的描述性医学报告，该报告引入了一些幻觉或不正确的文本。在第二阶段，医学专家会系统地审查这些报告，并确定并纠正潜在的不准确性，以确保高质量，临床上可靠的注释。与仅包含描述性文本的传统数据集不同，我们的数据集还具有标识幻觉句子及其相应更正的标签。减少VLM中幻觉的一种常见方法是在小规模的，特定于问题的数据集中对模型进行验证。但是，我们使用数据集采取不同的策略。我们没有仅仅为了生成文本报告而填补VLM，而是对其进行列出以检测和纠正幻觉，而是一种我们称为幻觉 - 意识到的Finetunting的方法。我们的结果表明，这种方法比仅仅用于描述性报告生成的填充要好。此外，我们对几个指标的最先进的VLM进行了广泛的评估，建立了基准。 GitHub repo：此HTTPS URL。

Title: CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation

Authors: Peng Li, Suizhi Ma, Jialiang Chen, Yuan Liu, Chongyi Zhang, Wei Xue, Wenhan Luo, Alla Sheffer, Wenping Wang, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07003
Pdf URL: https://arxiv.org/pdf/2505.07003
Copy Paste: [[2505.07003]] CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation(https://arxiv.org/abs/2505.07003)
Keywords: generation
Abstract: Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model. In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components. This conditional multiview diffusion model not only allows the generation of 3D models part by part but also enables local editing of 3D models according to the local revision of the input image without changing other 3D parts. Extensive experiments are conducted to demonstrate that CMD decomposes a complex 3D generation task into multiple components, improving the generation quality. Meanwhile, CMD enables efficient and flexible local editing of a 3D model by just editing one rendered image.
摘要：最近，3D生成方法表明了其强大的3D模型创建能力。但是，大多数3D生成方法仅依靠输入图像或文本提示来生成3D模型，该模型缺乏生成3D模型的每个组件的控制。输入图像的任何修改都会导致3D模型的整个再生。在本文中，我们引入了一种称为CMD的新方法，该方法从输入图像中生成3D模型，同时启用3D模型每个组件的灵活局部编辑。在CMD中，我们将3D生成作为条件多视频扩散模型，该模型将现有或已知部分作为条件，并生成编辑或添加的组件。该条件多视图扩散模型不仅允许一部分生成3D模型，而且还可以根据输入图像的本地修订，无需更改其他3D零件就可以对3D模型进行本地编辑。进行了广泛的实验，以证明CMD将复杂的3D生成任务分解为多个组件，从而提高了生成质量。同时，CMD仅通过编辑一个渲染图像来实现3D模型的有效且灵活的本地编辑。

Title: MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception

Authors: Zhengye Zhang, Sirui Zhao, Shifeng Liu, Shukang Yin, Xinglong Mao, Tong Xu, Enhong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07007
Pdf URL: https://arxiv.org/pdf/2505.07007
Copy Paste: [[2505.07007]] MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception(https://arxiv.org/abs/2505.07007)
Keywords: generation
Abstract: Micro-expressions (MEs) are crucial psychological responses with significant potential for affective computing. However, current automatic micro-expression recognition (MER) research primarily focuses on discrete emotion classification, neglecting a convincing analysis of the subtle dynamic movements and inherent emotional cues. The rapid progress in multimodal large language models (MLLMs), known for their strong multimodal comprehension and language generation abilities, offers new possibilities. MLLMs have shown success in various vision-language tasks, indicating their potential to understand MEs comprehensively, including both fine-grained motion patterns and underlying emotional semantics. Nevertheless, challenges remain due to the subtle intensity and short duration of MEs, as existing MLLMs are not designed to capture such delicate frame-level facial dynamics. In this paper, we propose a novel Micro-Expression Large Language Model (MELLM), which incorporates a subtle facial motion perception strategy with the strong inference capabilities of MLLMs, representing the first exploration of MLLMs in the domain of ME analysis. Specifically, to explicitly guide the MLLM toward motion-sensitive regions, we construct an interpretable motion-enhanced color map by fusing onset-apex optical flow dynamics with the corresponding grayscale onset frame as the model input. Additionally, specialized fine-tuning strategies are incorporated to further enhance the model's visual perception of MEs. Furthermore, we construct an instruction-description dataset based on Facial Action Coding System (FACS) annotations and emotion labels to train our MELLM. Comprehensive evaluations across multiple benchmark datasets demonstrate that our model exhibits superior robustness and generalization capabilities in ME understanding (MEU). Code is available at this https URL.
摘要：微表达（MES）是至关重要的心理反应，具有巨大的情感计算潜力。但是，当前的自动微表达识别（MER）研究主要集中于离散的情绪分类，忽略了对微妙的动态运动和固有的情感提示的令人信服的分析。以强大的多模式理解和语言产生能力而闻名的多模式大语模型（MLLM）的快速进步提供了新的可能性。 MLLM在各种视力语言任务中表现出成功，表明它们具有全面理解ME的潜力，包括细粒度的运动模式和潜在的情感语义。然而，由于现有的MLLM并非旨在捕获如此精致的框架级面部动力学，因此仍然存在挑战，因为MES的细微强度和短期持续时间。在本文中，我们提出了一种新型的微型表达大语模型（MELLM），该模型将微妙的面部运动感知策略与MLLM的强推理能力结合在一起，代表了MLLM在ME分析领域中的首次探索。具体来说，要明确指导MLLM朝运动敏感区域，我们通过将Onset-Apex光流动动力学与相应的灰度发作框架融合为模型输入来构建一个可解释的运动增强颜色映射。此外，还合并了专门的微调策略，以进一步增强模型对ME的视觉感知。此外，我们基于面部动作编码系统（FACS）注释和情感标签来构建一个指令描述数据集，以训练我们的MELLM。多个基准数据集的全面评估表明，我们的模型在我理解中表现出卓越的鲁棒性和概括能力（MEU）。代码可在此HTTPS URL上找到。

Title: DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models

Authors: Junhao Xia, Chaoyang Zhang, Yecheng Zhang, Chengyang Zhou, Zhichang Wang, Bochun Liu, Dongshuo Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07057
Pdf URL: https://arxiv.org/pdf/2505.07057
Copy Paste: [[2505.07057]] DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models(https://arxiv.org/abs/2505.07057)
Keywords: generation
Abstract: Video generation based on diffusion models presents a challenging multimodal task, with video editing emerging as a pivotal direction in this field. Recent video editing approaches primarily fall into two categories: training-required and training-free methods. While training-based methods incur high computational costs, training-free alternatives often yield suboptimal performance. To address these limitations, we propose DAPE, a high-quality yet cost-effective two-stage parameter-efficient fine-tuning (PEFT) framework for video editing. In the first stage, we design an efficient norm-tuning method to enhance temporal consistency in generated videos. The second stage introduces a vision-friendly adapter to improve visual quality. Additionally, we identify critical shortcomings in existing benchmarks, including limited category diversity, imbalanced object distribution, and inconsistent frame counts. To mitigate these issues, we curate a large dataset benchmark comprising 232 videos with rich annotations and 6 editing prompts, enabling objective and comprehensive evaluation of advanced methods. Extensive experiments on existing datasets (BalanceCC, LOVEU-TGVE, RAVE) and our proposed benchmark demonstrate that DAPE significantly improves temporal coherence and text-video alignment while outperforming previous state-of-the-art approaches.
摘要：基于扩散模型的视频生成提出了一项具有挑战性的多模式任务，视频编辑是该领域的关键方向。最近的视频编辑方法主要分为两类：培训且无培训的方法。尽管基于培训的方法会产生高计算成本，但无培训的替代方案通常会产生次优的性能。为了解决这些限制，我们建议Dape是用于视频编辑的高质量但具有成本效益的两阶段参数微调（PEFT）框架。在第一阶段，我们设计了一种有效的规范调整方法，以增强生成视频的时间一致性。第二阶段引入了一个友好型适配器，以提高视觉质量。此外，我们确定了现有基准的关键缺点，包括有限的类别多样性，不平衡的对象分布和不一致的框架计数。为了减轻这些问题，我们策划了一个大型数据集基准，其中包括232个带有丰富注释和6个编辑提示的视频，从而实现了对高级方法的客观和全面评估。对现有数据集（Balancecc，Loveu-Tgve，Rave）和我们提出的基准测试的广泛实验表明，DAPE显着提高了时间连贯性和文本视频对齐，同时表现优于先前的先前最新方法。

Title: Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures

Authors: Francesco Cagnetta, Alessandro Favero, Antonio Sclocchi, Matthieu Wyart
Subjects: cs.LG, cond-mat.dis-nn, stat.ML
Abstract URL: https://arxiv.org/abs/2505.07070
Pdf URL: https://arxiv.org/pdf/2505.07070
Copy Paste: [[2505.07070]] Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures(https://arxiv.org/abs/2505.07070)
Keywords: generative
Abstract: How do neural language models acquire a language's structure when trained for next-token prediction? We address this question by deriving theoretical scaling laws for neural network performance on synthetic datasets generated by the Random Hierarchy Model (RHM) -- an ensemble of probabilistic context-free grammars designed to capture the hierarchical structure of natural language while remaining analytically tractable. Previously, we developed a theory of representation learning based on data correlations that explains how deep learning models capture the hierarchical structure of the data sequentially, one layer at a time. Here, we extend our theoretical framework to account for architectural differences. In particular, we predict and empirically validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance compared to transformer models, which rely on global self-attention mechanisms. This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.
摘要：神经语言模型在接受下一步预测的培训时如何获得语言的结构？我们通过在随机层次结构模型（RHM）生成的合成数据集上得出理论缩放定律来解决这个问题 - 旨在捕获自然语言的层次结构同时，同时保持分析上可分析的无上下文结构。以前，我们基于数据相关性制定了表示形式学习理论，该理论解释了深度学习模型如何依次捕获数据的层次结构，一次是一层。在这里，我们扩展了理论框架以说明建筑差异。特别是，我们预测和经验验证了该卷积网络与依赖于全球自我发场机制的变压器模型相比，其结构与通过区域和重量共享的结构与生成过程相符的结构与生成过程相符。这一发现阐明了神经缩放定律的基础建筑偏见，并强调了表示学习是如何通过模型体系结构与数据的统计属性之间的相互作用来塑造的。

Title: Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution

Authors: Zihang Liu, Zhenyu Zhang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07071
Pdf URL: https://arxiv.org/pdf/2505.07071
Copy Paste: [[2505.07071]] Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution(https://arxiv.org/abs/2505.07071)
Keywords: super-resolution
Abstract: Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference. To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth. Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images. Our code is released at this https URL.
摘要：基于扩散的图像超分辨率（SR）方法表现出了出色的性能。最近的进步引入了确定性抽样过程，从而将推断从15个迭代步骤减少到一个步骤，从而显着提高了现有扩散模型的推理速度。但是，由于单步推断，在处理复杂语义区域时，它们的效率仍然有限。为了解决此限制，我们提出了SAMSR，这是一种语义引导的扩散框架，将语义分割掩码纳入采样过程。具体来说，我们介绍了Sam-Noise模块，该模块使用分割面罩来完善高斯噪声，以保留空间和语义特征。此外，我们制定了一个像素的采样策略，该策略会根据像素级的语义权重，动态调整剩余传递速率和噪声强度，在扩散过程中优先考虑语义上富含的语义区域。为了增强模型训练，我们还提出了语义一致性损失，该语义一致性损失与预测和地面真理之间的像素语义权重保持一致。对现实世界和合成数据集的广泛实验表明，SAMSR显着提高了感知质量和细节恢复，尤其是在语义上复杂的图像中。我们的代码在此HTTPS URL上发布。

Title: Multi-Objective-Guided Discrete Flow Matching for Controllable Biological Sequence Design

Authors: Tong Chen, Yinuo Zhang, Sophia Tang, Pranam Chatterjee
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.07086
Pdf URL: https://arxiv.org/pdf/2505.07086
Copy Paste: [[2505.07086]] Multi-Objective-Guided Discrete Flow Matching for Controllable Biological Sequence Design(https://arxiv.org/abs/2505.07086)
Keywords: generation
Abstract: Designing biological sequences that satisfy multiple, often conflicting, functional and biophysical criteria remains a central challenge in biomolecule engineering. While discrete flow matching models have recently shown promise for efficient sampling in high-dimensional sequence spaces, existing approaches address only single objectives or require continuous embeddings that can distort discrete distributions. We present Multi-Objective-Guided Discrete Flow Matching (MOG-DFM), a general framework to steer any pretrained discrete-time flow matching generator toward Pareto-efficient trade-offs across multiple scalar objectives. At each sampling step, MOG-DFM computes a hybrid rank-directional score for candidate transitions and applies an adaptive hypercone filter to enforce consistent multi-objective progression. We also trained two unconditional discrete flow matching models, PepDFM for diverse peptide generation and EnhancerDFM for functional enhancer DNA generation, as base generation models for MOG-DFM. We demonstrate MOG-DFM's effectiveness in generating peptide binders optimized across five properties (hemolysis, non-fouling, solubility, half-life, and binding affinity), and in designing DNA sequences with specific enhancer classes and DNA shapes. In total, MOG-DFM proves to be a powerful tool for multi-property-guided biomolecule sequence design.
摘要：设计满足多种，相互冲突，功能性和生物物理标准的生物学序列仍然是生物分子工程中的核心挑战。虽然离散的流匹配模型最近显示出在高维序列空间中有效采样的希望，但现有方法仅解决单个目标或需要可能会扭曲离散分布的连续嵌入。我们提出了多目标引导的离散流匹配（MOG-DFM），这是一个通用的框架，可将任何经过预告片的离散时间匹配发电机转向多个标量目标的帕累托有效的权衡。在每个采样步骤中，MOG-DFM计算候选过渡的混合级别分数，并应用自适应超级滤波器过滤器来强制执行一致的多目标进程。我们还训练了两个无条件的离散流匹配模型，即用于多种肽生成的PEPDFM和用于功能增强子DNA生成的增强剂，作为MOG-DFM的基本生成模型。我们证明了MOG-DFM在生成跨五个特性（溶血，非结晶，溶解度，半寿命和结合亲和力）以及使用特定增强子类和DNA形状的DNA序列方面优化的肽粘合剂的有效性。总体而言，MOG-DFM被证明是多型生物分子序列设计的强大工具。

Title: Generalizable Pancreas Segmentation via a Dual Self-Supervised Learning Framework

Authors: Jun Li, Hongzhang Zhu, Tao Chen, Xiaohua Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07165
Pdf URL: https://arxiv.org/pdf/2505.07165
Copy Paste: [[2505.07165]] Generalizable Pancreas Segmentation via a Dual Self-Supervised Learning Framework(https://arxiv.org/abs/2505.07165)
Keywords: restoration
Abstract: Recently, numerous pancreas segmentation methods have achieved promising performance on local single-source datasets. However, these methods don't adequately account for generalizability issues, and hence typically show limited performance and low stability on test data from other sources. Considering the limited availability of distinct data sources, we seek to improve the generalization performance of a pancreas segmentation model trained with a single-source dataset, i.e., the single source generalization task. In particular, we propose a dual self-supervised learning model that incorporates both global and local anatomical contexts. Our model aims to fully exploit the anatomical features of the intra-pancreatic and extra-pancreatic regions, and hence enhance the characterization of the high-uncertainty regions for more robust generalization. Specifically, we first construct a global-feature contrastive self-supervised learning module that is guided by the pancreatic spatial structure. This module obtains complete and consistent pancreatic features through promoting intra-class cohesion, and also extracts more discriminative features for differentiating between pancreatic and non-pancreatic tissues through maximizing inter-class separation. It mitigates the influence of surrounding tissue on the segmentation outcomes in high-uncertainty regions. Subsequently, a local-image restoration self-supervised learning module is introduced to further enhance the characterization of the high uncertainty regions. In this module, informative anatomical contexts are actually learned to recover randomly corrupted appearance patterns in those regions.
摘要：最近，许多胰腺细分方法已在本地单源数据集上实现了有希望的性能。但是，这些方法无法充分说明可推广性问题，因此通常会显示出来自其他来源的测试数据的性能有限和稳定性较低。考虑到不同数据源的有限可用性，我们试图提高使用单源数据集训练的胰腺细分模型的概括性能，即单源概括任务。特别是，我们提出了一种双重自我监督的学习模型，该模型既包含了全球和局部解剖环境。我们的模型旨在充分利用胰腺内和胰腺外区域的解剖特征，从而增强高确定性区域的表征，以进行更强大的概括。具体而言，我们首先构建了一个以胰腺空间结构为指导的全球对比自我监督的学习模块。该模块通过促进阶层内凝聚力获得完整而一致的胰腺特征，还提取了更多的判别特征，以通过最大化阶层间的分离来区分胰腺和非胰腺组织。它减轻了周围组织对高不确定性区域分割结果的影响。随后，引入了局部图像恢复自我监督的学习模块，以进一步增强高不确定性区域的表征。在此模块中，实际上学会了提供信息的解剖环境，以恢复这些区域中随机损坏的外观模式。

Title: Causal View of Time Series Imputation: Some Identification Results on Missing Mechanism

Authors: Ruichu Cai, Kaitao Zheng, Junxian Huang, Zijian Li, Zhengming Chen, Boyan Xu, Zhifeng Hao
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.07180
Pdf URL: https://arxiv.org/pdf/2505.07180
Copy Paste: [[2505.07180]] Causal View of Time Series Imputation: Some Identification Results on Missing Mechanism(https://arxiv.org/abs/2505.07180)
Keywords: generation
Abstract: Time series imputation is one of the most challenge problems and has broad applications in various fields like health care and the Internet of Things. Existing methods mainly aim to model the temporally latent dependencies and the generation process from the observed time series data. In real-world scenarios, different types of missing mechanisms, like MAR (Missing At Random), and MNAR (Missing Not At Random) can occur in time series data. However, existing methods often overlook the difference among the aforementioned missing mechanisms and use a single model for time series imputation, which can easily lead to misleading results due to mechanism mismatching. In this paper, we propose a framework for time series imputation problem by exploring Different Missing Mechanisms (DMM in short) and tailoring solutions accordingly. Specifically, we first analyze the data generation processes with temporal latent states and missing cause variables for different mechanisms. Sequentially, we model these generation processes via variational inference and estimate prior distributions of latent variables via normalizing flow-based neural architecture. Furthermore, we establish identifiability results under the nonlinear independent component analysis framework to show that latent variables are identifiable. Experimental results show that our method surpasses existing time series imputation techniques across various datasets with different missing mechanisms, demonstrating its effectiveness in real-world applications.
摘要：时间序列归档是最挑战的问题之一，在医疗保健和物联网等各个领域都有广泛的应用。现有方法主要旨在对观察到的时间序列数据进行临时潜在依赖关系和生成过程进行建模。在实际情况下，可以在时间序列数据中发生不同类型的缺失机制（如MAR（随机失踪）和MNAR（不随机缺失）。但是，现有方法通常忽略了上述缺失机制之间的差异，并将单个模型用于时间序列插补，这很容易由于机制不匹配而导致误导性结果。在本文中，我们通过探索不同的丢失机制（简称DMM）并相应地调整解决方案来提出一个时间序列插补问题的框架。具体而言，我们首先使用暂时的潜在状态分析数据生成过程，并为不同的机制分析原因变量。顺便说一句，我们通过变异推理和估算潜在变量的先验分布来对这些生成过程进行建模，并通过标准化基于流动的神经体系结构进行建模。此外，我们在非线性独立组件分析框架下建立可识别性结果，以表明潜在变量是可识别的。实验结果表明，我们的方法超过了各种数据集的现有时间序列插补技术，这些技术具有不同的缺失机制，证明了其在现实世界应用中的有效性。

Title: Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection

Authors: Hongda Qin, Xiao Lu, Zhiyong Wei, Yihong Cao, Kailun Yang, Ningjiang Chen
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2505.07219
Pdf URL: https://arxiv.org/pdf/2505.07219
Copy Paste: [[2505.07219]] Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection(https://arxiv.org/abs/2505.07219)
Keywords: generation
Abstract: Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task. Existing methods typically introduce image or feature augmentation to diversify the source domain to raise the robustness of the detector. Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector's backbone has the same structure as the image encoder of VLM, limiting the detector framework selection. To address this problem, we propose Language-Driven Dual Style Mixing (LDDS) for single-domain generalization, which diversifies the source domain by fully utilizing the semantic information of the VLM. Specifically, we first construct prompts to transfer style semantics embedded in the VLM to an image translation network. This facilitates the generation of style diversified images with explicit semantic information. Then, we propose image-level style mixing between the diversified images and source domain images. This effectively mines the semantic information for image augmentation without relying on specific augmentation selections. Finally, we propose feature-level style mixing in a double-pipeline manner, allowing feature augmentation to be model-agnostic and can work seamlessly with the mainstream detector frameworks, including the one-stage, two-stage, and transformer-based detectors. Extensive experiments demonstrate the effectiveness of our approach across various benchmark datasets, including real to cartoon and normal to adverse weather tasks. The source code and pre-trained models will be publicly available at this https URL.
摘要：将在单个域上训练的对象检测器概括为多个看不见的域是一项具有挑战性的任务。现有方法通常会引入图像或功能扩展，以使源域多样化以提高检测器的鲁棒性。视觉语言模型（VLM）基于基于的增强技术已被证明是有效的，但是它们要求检测器的骨架具有与VLM的图像编码器相同的结构，从而限制了检测器框架的选择。为了解决这个问题，我们提出了以语言驱动的双样式混合（LDD）来进行单域泛化，这通过充分利用VLM的语义信息来使源域多样化。具体来说，我们首先构建了提示将嵌入在VLM中的样式语义传输到图像翻译网络。这促进了具有明确的语义信息的样式多元化图像的产生。然后，我们建议在多元化图像和源域图像之间进行图像级样式混合。这有效地挖掘出语义信息以进行图像增强，而无需依赖特定的增强选择。最后，我们以双层式的方式提出了特征级样式的混合，使功能增强可以成为模型 - 敏捷的模型，并且可以与主流探测器框架无缝地工作，包括一阶段，两阶段和基于变压器的探测器。广泛的实验证明了我们在各种基准数据集中的方法的有效性，包括真实的卡通和不利的天气任务。源代码和预培训模型将在此HTTPS URL上公开可用。

Title: Compression, Regularity, Randomness and Emergent Structure: Rethinking Physical Complexity in the Data-Driven Era

Authors: Nima Dehghani
Subjects: cs.LG, cond-mat.stat-mech, cs.IT, physics.bio-ph, physics.data-an
Abstract URL: https://arxiv.org/abs/2505.07222
Pdf URL: https://arxiv.org/pdf/2505.07222
Copy Paste: [[2505.07222]] Compression, Regularity, Randomness and Emergent Structure: Rethinking Physical Complexity in the Data-Driven Era(https://arxiv.org/abs/2505.07222)
Keywords: generation
Abstract: Complexity science offers a wide range of measures for quantifying unpredictability, structure, and information. Yet, a systematic conceptual organization of these measures is still missing. We present a unified framework that locates statistical, algorithmic, and dynamical measures along three axes (regularity, randomness, and complexity) and situates them in a common conceptual space. We map statistical, algorithmic, and dynamical measures into this conceptual space, discussing their computational accessibility and approximability. This taxonomy reveals the deep challenges posed by uncomputability and highlights the emergence of modern data-driven methods (including autoencoders, latent dynamical models, symbolic regression, and physics-informed neural networks) as pragmatic approximations to classical complexity ideals. Latent spaces emerge as operational arenas where regularity extraction, noise management, and structured compression converge, bridging theoretical foundations with practical modeling in high-dimensional systems. We close by outlining implications for physics-informed AI and AI-guided discovery in complex physical systems, arguing that classical questions of complexity remain central to next-generation scientific modeling.
摘要：复杂性科学提供了广泛的措施来量化不可预测性，结构和信息。但是，这些措施的系统概念组织仍然缺失。我们提出了一个统一的框架，该框架沿三个轴（规律性，随机性和复杂性）定位统计，算法和动态度量，并将它们置于共同的概念空间中。我们将统计，算法和动态度量映射到此概念空间中，讨论其计算可访问性和近似性。这种分类法揭示了不可兼容性带来的深厚挑战，并突出了现代数据驱动方法的出现（包括自动编码器，潜在动力学模型，符号回归和物理知识的神经网络）是对经典复杂性理想的实用近似。潜在空间作为操作领域出现，其中规律性提取，噪声管理和结构化压缩融合，在高维系统中与实用建模弥合理论基础。我们结束了概述对复杂物理系统中物理信息和AI引导的发现的影响，并认为复杂性的经典问题仍然是下一代科学建模的核心。

Title: Generative Pre-trained Autoregressive Diffusion Transformer

Authors: Yuan Zhang, Jiacheng Jiang, Guoqing Ma, Zhiying Lu, Haoyang Huang, Jianlong Yuan, Nan Duan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07344
Pdf URL: https://arxiv.org/pdf/2505.07344
Copy Paste: [[2505.07344]] Generative Pre-trained Autoregressive Diffusion Transformer(https://arxiv.org/abs/2505.07344)
Keywords: generation, generative
Abstract: In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.
摘要：在这项工作中，我们提出了GPDIT，这是一种生成的预训练的自回归扩散变压器，可在连续的潜在空间内统一扩散和自回旋建模的强度。 GPDIT自动加入没有预测离散令牌，而是使用扩散损失来预测未来的潜在帧，从而实现了运动动力学的自然建模和跨帧的语义一致性。这种连续的自回旋框架不仅增强了发电质量，而且还赋予该模型具有表示功能。此外，我们引入了轻巧的因果注意变体和基于无参数旋转的时间条件机制，从而提高了训练和推理效率。广泛的实验表明，GPDIT在视频发电质量，视频表示能力和少量学习任务中实现了强劲的性能，从而突出了其作为在连续空间中进行视频建模的有效框架的潜力。

Title: From Search To Sampling: Generative Models For Robust Algorithmic Recourse

Authors: Prateek Garg, Lokesh Nagalapatti, Sunita Sarawagi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.07351
Pdf URL: https://arxiv.org/pdf/2505.07351
Copy Paste: [[2505.07351]] From Search To Sampling: Generative Models For Robust Algorithmic Recourse(https://arxiv.org/abs/2505.07351)
Keywords: generative
Abstract: Algorithmic Recourse provides recommendations to individuals who are adversely impacted by automated model decisions, on how to alter their profiles to achieve a favorable outcome. Effective recourse methods must balance three conflicting goals: proximity to the original profile to minimize cost, plausibility for realistic recourse, and validity to ensure the desired outcome. We show that existing methods train for these objectives separately and then search for recourse through a joint optimization over the recourse goals during inference, leading to poor recourse recommendations. We introduce GenRe, a generative recourse model designed to train the three recourse objectives jointly. Training such generative models is non-trivial due to lack of direct recourse supervision. We propose efficient ways to synthesize such supervision and further show that GenRe's training leads to a consistent estimator. Unlike most prior methods, that employ non-robust gradient descent based search during inference, GenRe simply performs a forward sampling over the generative model to produce minimum cost recourse, leading to superior performance across multiple metrics. We also demonstrate GenRe provides the best trade-off between cost, plausibility and validity, compared to state-of-art baselines. Our code is available at: this https URL.
摘要：算法追索权为那些受自动模型决策不利影响的个人提供了建议，如何改变其概况以实现有利的结果。有效的追索方法必须平衡三个相互矛盾的目标：靠近原始概况，以最大程度地降低成本，合理的追索性和有效性，以确保所需的结果。我们表明，现有的方法分别训练这些目标，然后通过推理期间的追索目标进行联合优化，从而搜索追索权，从而导致追索性不佳。我们介绍了类型，这是一种生成求助模型，旨在共同训练三个追索目标。由于缺乏直接追索性的监督，培训这种生成模型是非平凡的。我们提出了有效的方法来综合这种监督，并进一步表明类型的培训会导致一致的估计器。与大多数先前的方法不同，在推断过程中采用了非稳定梯度下降的搜索，类型只是对生成模型进行正向采样，以产生最低成本依据，从而导致多个指标的卓越性能。我们还证明，与最先进的基准相比，类型提供了成本，合理性和有效性之间的最佳权衡。我们的代码可用：此HTTPS URL。

Title: Unified Continuous Generative Models

Authors: Peng Sun, Yi Jiang, Tao Lin
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.07447
Pdf URL: https://arxiv.org/pdf/2505.07447
Copy Paste: [[2505.07447]] Unified Continuous Generative Models(https://arxiv.org/abs/2505.07447)
Keywords: generative
Abstract: Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: this https URL.
摘要：连续生成模型的最新进展，包括诸如扩散和流程匹配（通常需要8-1000个采样步骤）和诸如一致性模型（通常为1-8个步骤）之类的多步方法（通常需要8-1000个采样步骤），具有令人印象深刻的生成性能。但是，现有工作通常将这些方法视为不同的范式，从而产生了单独的培训和抽样方法。我们引入了一个统一的框架，用于培训，抽样和分析这些模型。我们的实现，统一的连续生成模型培训师和采样器（UCGM- {T，S}），实现了最新的（SOTA）性能。例如，在ImageNet 256x256上，使用675m扩散变压器，UCGM-T训练多步型模型，以20步以20步实现1.30 FID，并且只需2个步骤即可达到1.42 FID。此外，将UCGM-S应用于预训练的模型（以前为1.26 FID，在250步中）仅在40个步骤中将性能提高到1.06 FID。代码可用：此HTTPS URL。

Title: You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts

Authors: Hongkun Dou, Zeyu Li, Xingyu Jiang, Hongjue Li, Lijun Yang, Wen Yao, Yue Deng
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.07477
Pdf URL: https://arxiv.org/pdf/2505.07477
Copy Paste: [[2505.07477]] You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts(https://arxiv.org/abs/2505.07477)
Keywords: generation
Abstract: Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions. However, many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process. This approach is computationally expensive, as generating with DMs often demands tens to hundreds of recursive network calls, resulting in high memory usage and significant time consumption. In this paper, we propose a more efficient alternative that approaches the problem from the perspective of parallel denoising. We show that full backpropagation throughout the entire generation process is unnecessary. The downstream metrics can be optimized by retaining the computational graph of only one step during generation, thus providing a shortcut for gradient propagation. The resulting method, which we call Shortcut Diffusion Optimization (SDO), is generic, high-performance, and computationally lightweight, capable of optimizing all parameter types in diffusion sampling. We demonstrate the effectiveness of SDO on several real-world tasks, including controlling generation by optimizing latent and aligning the DMs by fine-tuning network parameters. Compared to full backpropagation, our approach reduces computational costs by $\sim 90\%$ while maintaining superior performance. Code is available at this https URL.
摘要：扩散模型（DMS）最近在建模大规模数据分布方面取得了巨大的成功。但是，许多下游任务都需要基于特定的可区分指标来指导生成的内容，通常需要在生成过程中进行反向传播。这种方法在计算上是昂贵的，因为DMS生成通常需要数十至数百个递归网络调用，从而导致高内存使用和大量的时间消耗。在本文中，我们提出了一种更有效的替代方案，该替代方案从平行DeNoising的角度来解决问题。我们表明，在整个一代过程中的完整反向传播是不必要的。可以通过在生成过程中仅保留一个步骤的计算图来优化下游指标，从而为梯度传播提供快捷方式。我们称之为快捷扩散优化（SDO）的结果方法是通用，高性能和计算轻量级的，能够优化扩散采样中的所有参数类型。我们演示了SDO对几个现实世界任务的有效性，包括通过通过微调网络参数优化潜在和对齐DMS来控制生成。与完整的反向传播相比，我们的方法将计算成本降低了$ \ sim 90 \％$，同时保持卓越的性能。代码可在此HTTPS URL上找到。

Title: Addressing degeneracies in latent interpolation for diffusion models

Authors: Erik Landolsi, Fredrik Kahl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07481
Pdf URL: https://arxiv.org/pdf/2505.07481
Copy Paste: [[2505.07481]] Addressing degeneracies in latent interpolation for diffusion models(https://arxiv.org/abs/2505.07481)
Keywords: generation
Abstract: There is an increasing interest in using image-generating diffusion models for deep data augmentation and image morphing. In this context, it is useful to interpolate between latents produced by inverting a set of input images, in order to generate new images representing some mixture of the inputs. We observe that such interpolation can easily lead to degenerate results when the number of inputs is large. We analyze the cause of this effect theoretically and experimentally, and suggest a suitable remedy. The suggested approach is a relatively simple normalization scheme that is easy to use whenever interpolation between latents is needed. We measure image quality using FID and CLIP embedding distance and show experimentally that baseline interpolation methods lead to a drop in quality metrics long before the degeneration issue is clearly visible. In contrast, our method significantly reduces the degeneration effect and leads to improved quality metrics also in non-degenerate situations.
摘要：对于使用图像生成扩散模型进行深度数据增强和图像变形的兴趣越来越大。在这种情况下，通过反转一组输入图像而产生的潜在潜在的潜在图像，以生成代表输入的某些混合物的新图像，这是有用的。我们观察到，当输入数量较大时，这种插值可以很容易地导致结果。我们在理论上和实验上分析了这种效果的原因，并提出了适当的补救措施。建议的方法是一种相对简单的归一化方案，每当需要在潜伏期之间插值时易于使用。我们使用FID和夹具嵌入距离测量图像质量，并通过实验表明基线插值方法导致质量指标在变性问题清晰可见之前很久就会下降。相比之下，我们的方法大大降低了变性效应，并在非分类情况下也会改善质量指标。

Title: FLUXSynID: A Framework for Identity-Controlled Synthetic Face Generation with Document and Live Images

Authors: Raul Ismayilov, Luuk Spreeuwers, Dzemila Sero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07530
Pdf URL: https://arxiv.org/pdf/2505.07530
Copy Paste: [[2505.07530]] FLUXSynID: A Framework for Identity-Controlled Synthetic Face Generation with Document and Live Images(https://arxiv.org/abs/2505.07530)
Keywords: generation
Abstract: Synthetic face datasets are increasingly used to overcome the limitations of real-world biometric data, including privacy concerns, demographic imbalance, and high collection costs. However, many existing methods lack fine-grained control over identity attributes and fail to produce paired, identity-consistent images under structured capture conditions. We introduce FLUXSynID, a framework for generating high-resolution synthetic face datasets with user-defined identity attribute distributions and paired document-style and trusted live capture images. The dataset generated using the FLUXSynID framework shows improved alignment with real-world identity distributions and greater inter-set diversity compared to prior work. The FLUXSynID framework for generating custom datasets, along with a dataset of 14,889 synthetic identities, is publicly released to support biometric research, including face recognition and morphing attack detection.
摘要：综合面部数据集越来越多地用于克服现实世界中生物识别数据的局限性，包括隐私问题，人口不平衡和高收集成本。但是，许多现有方法缺乏对身份属性的细粒度控制，并且在结构化捕获条件下未能产生配对的，身份符合的图像。我们介绍了FluxSynid，这是一个框架，用于生成具有用户定义的身份属性分布以及配对的文档风格和受信任的实时捕获图像的高分辨率合成面部数据集。与先前的工作相比，使用Fluxsynid框架生成的数据集显示出与现实身份分布的对齐程度的改善，并具有更大的集合间多样性。用于生成自定义数据集的通量框架以及14,889个合成身份的数据集公开发布以支持生物识别研究，包括面部识别和变形攻击检测。

Title: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Authors: Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07538
Pdf URL: https://arxiv.org/pdf/2505.07538
Copy Paste: [[2505.07538]] Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning(https://arxiv.org/abs/2505.07538)
Keywords: generation
Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: this https URL.
摘要：我们在图像表示中完全放弃了常规的空间先验，并引入了一种新颖的离散视觉令牌：自稳态令牌器（SelfTok）。在其设计核心上，我们通过使用图像生成的反向扩散过程组成了自回归（AR）先验（将语言的因果结构反映为视觉令牌）。 The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - 从理论上讲，我们表明AR先验满足钟手方程，而空间先验则不满意。因此，SelfTok支持具有与LLM中实现的有效性相当的视觉生成增强学习（RL）。除了AR财产外，SelfTok还是SOTA令牌，在高质量重建和压缩率之间取决于良好的权衡。我们使用SelfTok来构建一个纯AR VLM，以进行视觉理解和发电任务。令人印象深刻的是，在不使用任何文本图像培训对的情况下，在视觉令牌中工作的简单策略梯度RL可以显着提高视觉生成基准，从而超过所有现有模型。因此，我们认为，自我Tok有效地解决了视觉令牌无法支持有效RL的长期挑战。当与LLM中RL的公认优势结合使用时，这使我们更接近实现了真正的多模式LLM。项目页面：此HTTPS URL。

Title: Noise Optimized Conditional Diffusion for Domain Adaptation

Authors: Lingkun Luo, Shiqiang Hu, Liming Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.07548
Pdf URL: https://arxiv.org/pdf/2505.07548
Copy Paste: [[2505.07548]] Noise Optimized Conditional Diffusion for Domain Adaptation(https://arxiv.org/abs/2505.07548)
Keywords: generation, generative
Abstract: Pseudo-labeling is a cornerstone of Unsupervised Domain Adaptation (UDA), yet the scarcity of High-Confidence Pseudo-Labeled Target Domain Samples (\textbf{hcpl-tds}) often leads to inaccurate cross-domain statistical alignment, causing DA failures. To address this challenge, we propose \textbf{N}oise \textbf{O}ptimized \textbf{C}onditional \textbf{D}iffusion for \textbf{D}omain \textbf{A}daptation (\textbf{NOCDDA}), which seamlessly integrates the generative capabilities of conditional diffusion models with the decision-making requirements of DA to achieve task-coupled optimization for efficient adaptation. For robust cross-domain consistency, we modify the DA classifier to align with the conditional diffusion classifier within a unified optimization framework, enabling forward training on noise-varying cross-domain samples. Furthermore, we argue that the conventional $ \mathcal{N}(\mathbf{0}, \mathbf{I}) $ initialization in diffusion models often generates class-confused hcpl-tds, compromising discriminative DA. To resolve this, we introduce a class-aware noise optimization strategy that refines sampling regions for reverse class-specific hcpl-tds generation, effectively enhancing cross-domain alignment. Extensive experiments across 5 benchmark datasets and 29 DA tasks demonstrate significant performance gains of \textbf{NOCDDA} over 31 state-of-the-art methods, validating its robustness and effectiveness.
摘要：伪标记是无监督域适应（UDA）的基石，但是高信心伪标记的目标域样本（\ textbf {hcpl-tds}）的稀缺通常会导致毫无准确的交叉统计量化，从而导致DAIDADA DAIFAIRES。 To address this challenge, we propose \textbf{N}oise \textbf{O}ptimized \textbf{C}onditional \textbf{D}iffusion for \textbf{D}omain \textbf{A}daptation (\textbf{NOCDDA}), which seamlessly integrates the generative capabilities of有条件的扩散模型具有DA的决策要求，以实现任务耦合的优化以进行有效的适应。对于稳健的跨域一致性，我们修改了DA分类器以与统一优化框架内的条件扩散分类器保持一致，从而在噪声变化的跨域样本上可以向前训练。此外，我们认为传统的\（\ Mathcal {n}（\ Mathbf {0}，\ Mathbf {i}）\）在扩散模型中通常会生成类confuse confused hcpl-tds，损害歧视性da。为了解决这一问题，我们引入了一种阶级感知的噪声优化策略，该策略优化了反向类特异性HCPL-TDS生成的采样区域，从而有效增强了跨域对准。在5个基准数据集和29个DA任务上进行的大量实验表明，在31种最先进的方法上，\ textbf {nocdda}的显着性能提高，从而验证了其稳健性和有效性。

Title: Generating Skyline Explanations for Graph Neural Networks

Authors: Dazhuo Qiu, Haolai Che, Arijit Khan, Yinghui Wu
Subjects: cs.LG, cs.DB
Abstract URL: https://arxiv.org/abs/2505.07635
Pdf URL: https://arxiv.org/pdf/2505.07635
Copy Paste: [[2505.07635]] Generating Skyline Explanations for Graph Neural Networks(https://arxiv.org/abs/2505.07635)
Keywords: generation
Abstract: This paper proposes a novel approach to generate subgraph explanations for graph neural networks GNNs that simultaneously optimize multiple measures for explainability. Existing GNN explanation methods often compute subgraphs (called ``explanatory subgraphs'') that optimize a pre-defined, single explainability measure, such as fidelity or conciseness. This can lead to biased explanations that cannot provide a comprehensive explanation to clarify the output of GNN models. We introduce skyline explanation, a GNN explanation paradigm that aims to identify k explanatory subgraphs by simultaneously optimizing multiple explainability measures. (1) We formulate skyline explanation generation as a multi-objective optimization problem, and pursue explanations that approximate a skyline set of explanatory subgraphs. We show the hardness for skyline explanation generation. (2) We design efficient algorithms with an onion-peeling approach that strategically removes edges from neighbors of nodes of interests, and incrementally improves explanations as it explores an interpretation domain, with provable quality guarantees. (3) We further develop an algorithm to diversify explanations to provide more comprehensive perspectives. Using real-world graphs, we empirically verify the effectiveness, efficiency, and scalability of our algorithms.
摘要：本文提出了一种新的方法，以生成图形神经网络的子图解释，该解释同时优化了多种措施的解释性。现有的GNN解释方法通常会计算优化预定的单个解释性度量，例如保真度或简洁性，以计算子图（称为“解释性子图”）。这可能导致偏见的解释，无法提供全面的解释来阐明GNN模型的输出。我们介绍了Skyline解释，这是GNN解释范式，旨在通过同时优化多种解释性措施来识别K解释子图。（1）我们将天际线的解释生成作为多目标优化问题，并追求近似于天际线的解释子图的解释。我们展示了天际线的硬度的硬度。（2）我们采用一种洋葱 - 佩戴方法来设计有效的算法，该算法从策略上删除了利益的节点的邻居，并逐步改善了解释域，并具有可证明的质量保证。（3）我们进一步开发了一种算法来多样化解释以提供更全面的观点。使用现实图形，我们从经验上验证了算法的有效性，效率和可扩展性。

Title: ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

Authors: Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M. Rehg, Tobias Hinz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07652
Pdf URL: https://arxiv.org/pdf/2505.07652
Copy Paste: [[2505.07652]] ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models(https://arxiv.org/abs/2505.07652)
Keywords: generation
Abstract: Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token's effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines. You can find more details in this https URL
摘要：当前基于扩散的文本对视频方法仅限于生成单镜头的简短视频剪辑，并且缺乏通过离散过渡生成多摄像视频的能力，在这些视频中，相同角色在相同或不同的背景上执行不同的活动。为了解决此限制，我们提出了一个框架，其中包括一个数据集集合管道和架构扩展到视频扩散模型，以启用文本到摄像机视频的生成。我们的方法使生成多拍视频作为一个视频，在所有镜头的所有帧中都充分关注，确保角色和背景一致性，并允许用户通过特定于SHOT特定的调理来控制镜头的数量，持续时间和内容。这是通过将过渡令牌纳入文本对视频模型来控制新镜头的开始和局部注意力掩蔽策略来实现的，该策略可以控制过渡令牌的效果并允许特定于射击的提示。为了获得培训数据，我们提出了一条新颖的数据收集管道，以从现有的单摄影视频数据集中构建一个多拍视频数据集。广泛的实验表明，对数千个迭代进行微调的预训练文本对视频模型足以使模型随后能够生成具有特定于SHOT特定控制的多拍视频，从而胜过基线。您可以在此HTTPS URL中找到更多详细信息

Title: Anatomical Attention Alignment representation for Radiology Report Generation

Authors: Quang Vinh Nguyen, Minh Duc Nguyen, Thanh Hoang Son Vo, Hyung-Jeong Yang, Soo-Hyung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07689
Pdf URL: https://arxiv.org/pdf/2505.07689
Copy Paste: [[2505.07689]] Anatomical Attention Alignment representation for Radiology Report Generation(https://arxiv.org/abs/2505.07689)
Keywords: generation
Abstract: Automated Radiology report generation (RRG) aims at producing detailed descriptions of medical images, reducing radiologists' workload and improving access to high-quality diagnostic services. Existing encoder-decoder models only rely on visual features extracted from raw input images, which can limit the understanding of spatial structures and semantic relationships, often resulting in suboptimal text generation. To address this, we propose Anatomical Attention Alignment Network (A3Net), a framework that enhance visual-textual understanding by constructing hyper-visual representations. Our approach integrates a knowledge dictionary of anatomical structures with patch-level visual features, enabling the model to effectively associate image regions with their corresponding anatomical entities. This structured representation improves semantic reasoning, interpretability, and cross-modal alignment, ultimately enhancing the accuracy and clinical relevance of generated reports. Experimental results on IU X-Ray and MIMIC-CXR datasets demonstrate that A3Net significantly improves both visual perception and text generation quality. Our code is available at \href{this https URL}{GitHub}.
摘要：自动放射学报告一代（RRG）旨在生产医学图像的详细描述，减少放射科医生的工作量并改善对高质量诊断服务的访问。现有的编码器模型仅依赖于从原始输入图像中提取的视觉特征，这可以限制对空间结构和语义关系的理解，通常会导致次优文本生成。为了解决这个问题，我们提出了解剖学注意一致性网络（A3NET），该框架通过构建高视觉表示来增强视觉文本理解。我们的方法将解剖结构的知识词典与贴片级的视觉特征集成在一起，从而使模型能够有效地将图像区域与相应的解剖学实体联系起来。这种结构化表示改善了语义推理，可解释性和跨模式对齐，最终提高了生成报告的准确性和临床相关性。 IU X射线和MIMIC-CXR数据集的实验结果表明，A3NET显着提高了视觉感知和文本生成质量。我们的代码可在\ href {此https url} {github}上获得。

Title: Gameplay Highlights Generation

Authors: Vignesh Edithal, Le Zhang, Ilia Blank, Imran Junejo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07721
Pdf URL: https://arxiv.org/pdf/2505.07721
Copy Paste: [[2505.07721]] Gameplay Highlights Generation(https://arxiv.org/abs/2505.07721)
Keywords: generation
Abstract: In this work, we enable gamers to share their gaming experience on social media by automatically generating eye-catching highlight reels from their gameplay session Our automation will save time for gamers while increasing audience engagement. We approach the highlight generation problem by first identifying intervals in the video where interesting events occur and then concatenate them. We developed an in-house gameplay event detection dataset containing interesting events annotated by humans using VIA video annotator. Traditional techniques for highlight detection such as game engine integration requires expensive collaboration with game developers. OCR techniques which detect patches of specific images or texts require expensive per game engineering and may not generalize across game UI and different language. We finetuned a multimodal general purpose video understanding model such as X-CLIP using our dataset which generalizes across multiple games in a genre without per game engineering. Prompt engineering was performed to improve the classification performance of this multimodal model. Our evaluation showed that such a finetuned model can detect interesting events in first person shooting games from unseen gameplay footage with more than 90% accuracy. Moreover, our model performed significantly better on low resource games (small dataset) when trained along with high resource games, showing signs of transfer learning. To make the model production ready, we used ONNX libraries to enable cross platform inference. These libraries also provide post training quantization tools to reduce model size and inference time for deployment. ONNX runtime libraries with DirectML backend were used to perform efficient inference on Windows OS. We show that natural language supervision in the X-CLIP model leads to data efficient and highly performant video recognition models.
摘要：在这项工作中，我们使游戏玩家能够在社交媒体上分享他们的游戏体验，从而自动从游戏玩法中产生引人注目的亮点卷轴，我们的自动化将为游戏玩家节省时间，同时增加观众的参与度。我们通过首先确定视频中发生有趣事件然后将它们加入的视频中的间隔来解决突出显示的生成问题。我们开发了一个内部的游戏事件检测数据集，该数据集包含人类使用视频注释者注释的有趣事件。突出显示诸如游戏引擎集成之类的传统技术需要与游戏开发人员进行昂贵的合作。检测特定图像或文本补丁的OCR技术需要每个游戏工程昂贵，并且可能不会在游戏UI和其他语言中概括。我们使用数据集对X-CLIP进行了多模式通用视频理解模型，该模型使用我们的数据集进行了多种游戏，该模型在没有每个游戏工程的情况下以多种游戏的形式概括了多个游戏。进行及时的工程以提高此多模式模型的分类性能。我们的评估表明，这种填充的模型可以检测出从看不见的游戏录像中的第一人称射击游戏中的有趣事件，其精度超过90％。此外，当与高资源游戏一起训练时，我们的模型在低资源游戏（小数据集）上的表现明显更好，显示了转移学习的迹象。为了准备好模型生产，我们使用ONNX库来启用跨平台推断。这些库还提供培训后的量化工具，以减少模型大小和部署的推理时间。具有DirectML后端的ONNX运行时库用于对Windows OS执行有效的推断。我们表明，X-CLIP模型中的自然语言监督会导致数据效率和高度性能的视频识别模型。

Title: LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

Authors: Jiangling Zhang, Weijie Zhu, Jirui Huang, Yaxiong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07734
Pdf URL: https://arxiv.org/pdf/2505.07734
Copy Paste: [[2505.07734]] LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention(https://arxiv.org/abs/2505.07734)
Keywords: generation, generative
Abstract: Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection. This model integrates distinct Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer. RG-MHA utilizes facial landmarks to create regional attention masks, guiding the model to scrutinize architectural inconsistencies across different facial areas. Crucially, the separate LAMM module dynamically generates layer-specific parameters, including mask weights and gating values, based on network context. These parameters then modulate the behavior of RG-MHA, enabling adaptive adjustment of regional focus across network depths. This architecture facilitates the capture of subtle, hierarchical forgery cues ubiquitous among diverse generation techniques, such as GANs and Diffusion Models. In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC (a +5.45% improvement over SoTA) and 98.62% mean AP (a +3.09% improvement). These results demonstrate LAMM-ViT's exceptional ability to generalize and its potential for reliable deployment against evolving synthetic media threats.
摘要：检测AI合成面孔提出了一个关键的挑战：很难捕获各种发电技术之间面部区域之间一致的结构关系。当前的方法集中在特定的伪影而不是基本的矛盾上，在面对新颖的生成模型时通常会失败。为了解决这一限制，我们引入了层次吸引的蒙版调制视觉变压器（LAMM-VIT），这是一种旨在强大面部伪造检测的视觉变压器。该模型集成了不同区域引导的多头注意力（RG-MHA）和每一层中的层次感知面膜调制（LAMM）组件。 RG-MHA利用面部标志来创建区域关注面罩，指导模型仔细检查不同面部区域的建筑不一致之处。至关重要的是，基于网络上下文，单独的LAMM模块动态生成特定于层的参数，包括掩盖权重和门控值。然后，这些参数调节了RG-MHA的行为，从而可以对跨网络深度进行自适应调整区域焦点。这种体系结构有助于捕获微妙的分层伪造线索无处不在，例如gan和扩散模型。在跨模型概括测试中，LAMM-VIT表现出卓越的性能，达到94.09％的平均ACC（A +5.45％比SOTA提高）和98.62％的平均AP（A +3.09％提高）。这些结果表明，Lamm-Vit具有概括性的特殊能力及其可靠部署的潜力，以防止不断发展的合成媒体威胁。

Title: Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

Authors: Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07747
Pdf URL: https://arxiv.org/pdf/2505.07747
Copy Paste: [[2505.07747]] Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets(https://arxiv.org/abs/2505.07747)
Keywords: generation, generative
Abstract: While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.
摘要：尽管生成人工智能在文本，图像，音频和视频域中都显着提高，但由于基本挑战，例如数据稀缺，算法限制和生态系统分散，3D一代仍然相对欠发达。为此，我们提出了STEP1X-3D，这是一个开放式框架，通过以下方式解决这些挑战：（1）严格的数据策划管道处理> 5M资产，以创建具有标准化的几何和纹理属性的2M高质量数据集；（2）将混合VAE-DIT几何发生器与基于扩散的纹理合成模块结合的两阶段3D本地结构；（3）模型，培训代码和适应模块的完整开源发布。对于几何产生，混合VAE-DIT组件通过使用基于感知器的潜在编码和尖锐的边缘采样来产生TSDF表示，以保存细节。然后，基于扩散的纹理合成模块通过几何条件和潜在空间同步确保了跨视图的一致性。基准结果表明，最先进的性能超过了现有的开源方法，同时还可以通过专有解决方案实现竞争性质量。值得注意的是，该框架通过支撑2D控制技术〜（例如Lora）的直接传输到3D合成来独特地桥接2D和3D代范式。通过同时提高数据质量，算法保真度和可重复性，Step1x-3D旨在为可控3D资产生成的开放研究建立新的标准。

Title: Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation

Authors: Arya Grayeli, Vipin Swarup, Steven E. Noel
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2505.07777
Pdf URL: https://arxiv.org/pdf/2505.07777
Copy Paste: [[2505.07777]] Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation(https://arxiv.org/abs/2505.07777)
Keywords: generation, generative
Abstract: Obtaining real-world network datasets is often challenging because of privacy, security, and computational constraints. In the absence of such datasets, graph generative models become essential tools for creating synthetic datasets. In this paper, we introduce a novel machine learning model for generating high-fidelity synthetic network flow datasets that are representative of real-world networks. Our approach involves the generation of dynamic multigraphs using a stochastic Kronecker graph generator for structure generation and a tabular generative adversarial network for feature generation. We further employ an XGBoost (eXtreme Gradient Boosting) model for graph alignment, ensuring accurate overlay of features onto the generated graph structure. We evaluate our model using new metrics that assess both the accuracy and diversity of the synthetic graphs. Our results demonstrate improvements in accuracy over previous large-scale graph generation methods while maintaining similar efficiency. We also explore the trade-off between accuracy and diversity in synthetic graph dataset creation, a topic not extensively covered in related works. Our contributions include the synthesis and evaluation of large real-world netflow datasets and the definition of new metrics for evaluating synthetic graph generative models.
摘要：由于隐私，安全性和计算约束，获得现实世界网络数据集通常是具有挑战性的。在没有此类数据集的情况下，图生成模型成为创建合成数据集的必要工具。在本文中，我们介绍了一种新型的机器学习模型，用于生成代表现实世界网络的高保真综合网络流数据集。我们的方法涉及使用随机Kronecker图生成器进行结构生成的动态多编码，以及用于特征生成的表格生成对抗网络。我们进一步采用XGBoost（极端梯度提升）模型来进行图形对齐，从而确保了特征准确地覆盖到生成的图形结构上。我们使用新的指标评估我们的模型，以评估合成图的准确性和多样性。我们的结果表明，与以前的大型图生成方法相比，准确性的提高，同时保持相似的效率。我们还探讨了合成图数据集创建中准确性和多样性之间的权衡，这一主题在相关工作中未广泛介绍。我们的贡献包括对大型现实世界NetFlow数据集的综合和评估以及用于评估合成图生成模型的新指标的定义。

Title: MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Authors: Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.07782
Pdf URL: https://arxiv.org/pdf/2505.07782
Copy Paste: [[2505.07782]] MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering(https://arxiv.org/abs/2505.07782)
Keywords: generation
Abstract: We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.
摘要：我们介绍了MLE-Dojo，这是一个健身式框架，用于系统地增强学习，评估和改进迭代机器学习工程（MLE）工作流程中的自主大型语言模型（LLM）代理。与主要依赖静态数据集或单一评估的现有基准不同，MLE-Dojo提供了一个交互式环境，可通过结构化反馈循环进行迭代实验，调试和完善解决方案。 MLE-Dojo建立在200多个现实世界中的Kaggle挑战的基础上，涵盖了精心策划的各种开放式MLE任务，以反映现实的工程场景，例如数据处理，体系结构搜索，超参数调整和代码调试。它的完全可执行的环境通过监督的微调学习和强化学习，促进迭代实验，现实的数据采样和实时结果验证，以支持全面的代理培训。对八个Frontier LLMS的广泛评估表明，尽管当前模型实现了有意义的迭代改进，但它们在自主生成长马解决方案并有效地解决复杂错误方面仍具有重大局限性。此外，MLE-Dojo的灵活且可扩展的体系结构无缝地集成了各种数据源，工具和评估协议，从而唯一启用了基于模型的代理调整并促进互操作性，可伸缩性和可重复性。我们开源的框架和基准测试，以促进社区驱动的创新，以实现下一代MLE代理商。

Title: Continuous Visual Autoregressive Generation via Score Maximization

Authors: Chenze Shao, Fandong Meng, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07812
Pdf URL: https://arxiv.org/pdf/2505.07812
Copy Paste: [[2505.07812]] Continuous Visual Autoregressive Generation via Score Maximization(https://arxiv.org/abs/2505.07812)
Keywords: generation, generative
Abstract: Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantization-based approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. Source code: this https URL.
摘要：传统的观点表明，自回旋模型用于处理离散数据。当应用于视觉数据等连续模式时，视觉自回归建模（VAR）通常采用基于量化的方法，以将数据投入离散空间，这可能会引入重大的信息丢失。为了解决此问题，我们引入了一个连续的VAR框架，该框架可以直接视觉自回归产生而无需量化矢量量化。基本理论基础是严格的正确评分规则，该规则提供了强大的统计工具，能够评估生成模型对真实分布的近似程度。在此框架内，我们只需要选择一个严格的适当分数并将其设置为优化的训练目标。我们主要根据能量评分探索一类培训目标，这是无可能的，因此克服了在连续空间中做出概率预测的困难。以前在连续自回旋产生（例如GIVT和扩散损失）上进行的努力也可以使用其他严格的适当分数从我们的框架中得出。源代码：此HTTPS URL。

Title: DanceGRPO: Unleashing GRPO on Visual Generation

Authors: Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07818
Pdf URL: https://arxiv.org/pdf/2505.07818
Copy Paste: [[2505.07818]] DanceGRPO: Unleashing GRPO on Visual Generation(https://arxiv.org/abs/2505.07818)
Keywords: generation, generative
Abstract: Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.
摘要：在生成模型中的最新突破，主要扩散模型和整流的流量彻底改变了视觉内容的创造，但是将模型输出与人类偏好保持一致仍然是一个关键的挑战。现有的强化学习（RL）基于视觉生成的方法面临临界局限性：与现代普通微分方程（ODES）基于基于的采样范式的不相容性，大规模培训中的不稳定性以及缺乏视频生成的验证。 This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, Skyreel-i2v）和五个奖励模型（图像/视频美学，文本图像对齐，视频运动质量和二进制奖励）。据我们所知，DanceGrpo是第一个基于RL的统一框架，能够在各种生成范式，任务，基础模型和奖励模型中进行无缝适应。 DanceGrpo表现出一致且实质性的改进，在HPS-V2.1，剪辑得分，VideoAlign和Geneval等基准上，其表现优于181％。值得注意的是，DanceGrpo不仅可以稳定复杂的视频生成的策略优化，而且还可以使生成策略更好地捕获DeNo的轨迹，从而获得最佳推理缩放，并从稀疏的二进制反馈中学习。我们的结果将DanceGrpo建立为强大而多功能的解决方案，用于从人类反馈（RLHF）任务中扩展增强学习，从而为协调强化学习和视觉合成提供了新的见解。代码将发布。