2025-07-29

Title: Beyond 9-to-5: A Generative Model for Augmenting Mobility Data of Underrepresented Shift Workers

Authors: Haoxuan Ma, Xishun Liao, Yifan Liu, Chris Stanford, Jiaqi Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.19510
Pdf URL: https://arxiv.org/pdf/2507.19510
Copy Paste: [[2507.19510]] Beyond 9-to-5: A Generative Model for Augmenting Mobility Data of Underrepresented Shift Workers(https://arxiv.org/abs/2507.19510)
Keywords: generative
Abstract: This paper addresses a critical gap in urban mobility modeling by focusing on shift workers, a population segment comprising 15-20% of the workforce in industrialized societies yet systematically underrepresented in traditional transportation surveys and planning. This underrepresentation is revealed in this study by a comparative analysis of GPS and survey data, highlighting stark differences between the bimodal temporal patterns of shift workers and the conventional 9-to-5 schedules recorded in surveys. To address this bias, we introduce a novel transformer-based approach that leverages fragmented GPS trajectory data to generate complete, behaviorally valid activity patterns for individuals working non-standard hours. Our method employs periodaware temporal embeddings and a transition-focused loss function specifically designed to capture the unique activity rhythms of shift workers and mitigate the inherent biases in conventional transportation datasets. Evaluation shows that the generated data achieves remarkable distributional alignment with GPS data from Los Angeles County (Average JSD < 0.02 for all evaluation metrics). By transforming incomplete GPS traces into complete, representative activity patterns, our approach provides transportation planners with a powerful data augmentation tool to fill critical gaps in understanding the 24/7 mobility needs of urban populations, enabling precise and inclusive transportation planning.
摘要：本文通过专注于轮班工人来解决城市流动性建模的一个关键差距，这是一个人口领域，占工业化社会中劳动力的15-20％，但在传统的运输调查和计划中有系统地占人数不足。这项研究通过对GP和调查数据的比较分析来揭示了这种代表性不足，强调了移动工人的双峰时间模式与调查中记录的常规9到5个时间表之间的差异差异。为了解决这一偏见，我们引入了一种新型的基于变压器的方法，该方法利用零散的GPS轨迹数据来生成针对非标准小时的个人的完整，行为有效的活动模式。我们的方法采用元素性时间嵌入和以过渡为中心的损失函数，专门旨在捕获转移工人的独特活动节奏，并减轻常规运输数据集中的固有偏见。评估表明，生成的数据与洛杉矶县的GPS数据达到了显着的分布对齐（所有评估指标的平均JSD <0.02）。通过将不完整的GPS痕迹转变为完整的代表性活动模式，我们的方法为运输计划者提供了强大的数据增强工具，以填补关键空白，以了解城市人口的24/7流动性需求，从而实现精确和包容性的运输计划。

Title: Enhancing Spatiotemporal Networks with xLSTM: A Scalar LSTM Approach for Cellular Traffic Forecasting

Authors: Khalid Ali, Zineddine Bettouche, Andreas Kassler, Andreas Fischer
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.19513
Pdf URL: https://arxiv.org/pdf/2507.19513
Copy Paste: [[2507.19513]] Enhancing Spatiotemporal Networks with xLSTM: A Scalar LSTM Approach for Cellular Traffic Forecasting(https://arxiv.org/abs/2507.19513)
Keywords: generation
Abstract: Accurate spatiotemporal traffic forecasting is vital for intelligent resource management in 5G and beyond. However, conventional AI approaches often fail to capture the intricate spatial and temporal patterns that exist, due to e.g., the mobility of users. We introduce a lightweight, dual-path Spatiotemporal Network that leverages a Scalar LSTM (sLSTM) for efficient temporal modeling and a three-layer Conv3D module for spatial feature extraction. A fusion layer integrates both streams into a cohesive representation, enabling robust forecasting. Our design improves gradient stability and convergence speed while reducing prediction error. Evaluations on real-world datasets show superior forecast performance over ConvLSTM baselines and strong generalization to unseen regions, making it well-suited for large-scale, next-generation network deployments. Experimental evaluation shows a 23% MAE reduction over ConvLSTM, with a 30% improvement in model generalization.
摘要：准确的时空交通预测对于5G及以后的智能资源管理至关重要。但是，常规的AI方法通常无法捕获由于用户的移动性而引起的复杂的空间和时间模式。我们引入了一个轻巧的双路时空网络，该网络利用标量LSTM（SLSTM）进行有效的时间建模和三层Conv3D模块进行空间特征提取。融合层将两个流都集成到一个内聚表示中，从而实现了可靠的预测。我们的设计提高了梯度稳定性和收敛速度，同时降低了预测误差。对现实世界数据集的评估表明，预测性能优于弯曲基线，并且对看不见的区域进行了强有力的概括，这使其非常适合大规模的下一代网络部署。实验评估表明，弯曲的MAE降低了23％，模型概括提高了30％。

Title: Language Models for Controllable DNA Sequence Design

Authors: Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.19523
Pdf URL: https://arxiv.org/pdf/2507.19523
Copy Paste: [[2507.19523]] Language Models for Controllable DNA Sequence Design(https://arxiv.org/abs/2507.19523)
Keywords: generation
Abstract: We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross-modal encoding to integrate diverse biological signals. ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. Compared to prior methods, our model achieves notable improvements in controllability and functional relevance, highlighting the potential of language models in advancing programmable genomic design. The source code is released at (this https URL).
摘要：我们考虑可控的DNA序列设计，其中序列是通过对特定生物学特性进行调节而产生的。尽管GPT和BERT等语言模型（LMS）在自然语言生成方面取得了巨大的成功，但它们在DNA序列产生中的应用仍然很大程度上没有被逐出。在这项工作中，我们介绍了ATGC-GEN，这是一种可控制生成的自动变压器发生器，该生成器利用跨模式编码以整合多种生物信号。 ATGC-GEN通过仅解码器和仅编码的变压器体系结构实例化，可以在自回归或掩盖的恢复目标下进行灵活的培训和生成。我们对包括启动子和增强子序列设计在内的代表性任务进行了评估，并进一步基于基于CHIP-SEQ实验的新数据集来建模蛋白质结合特异性。我们的实验表明，ATGC-GEN可以产生与所需特性一致的流利，多样和生物学相关的序列。与先前的方法相比，我们的模型在可控性和功能相关性方面取得了显着改善，突出了语言模型在推进可编程基因组设计方面的潜力。源代码在（此HTTPS URL）上发布。

Title: Kolmogorov Arnold Network Autoencoder in Medicine

Authors: Ugo Lomoio, Pierangelo Veltri, Pietro Hiram Guzzi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.19524
Pdf URL: https://arxiv.org/pdf/2507.19524
Copy Paste: [[2507.19524]] Kolmogorov Arnold Network Autoencoder in Medicine(https://arxiv.org/abs/2507.19524)
Keywords: generation
Abstract: Deep learning neural networks architectures such Multi Layer Perceptrons (MLP) and Convolutional blocks still play a crucial role in nowadays research advancements. From a topological point of view, these architecture may be represented as graphs in which we learn the functions related to the nodes while fixed edges convey the information from the input to the output. A recent work introduced a new architecture called Kolmogorov Arnold Networks (KAN) that reports how putting learnable activation functions on the edges of the neural network leads to better performances in multiple scenarios. Multiple studies are focusing on optimizing the KAN architecture by adding important features such as dropout regularization, Autoencoders (AE), model benchmarking and last, but not least, the KAN Convolutional Network (KCN) that introduced matrix convolution with KANs learning. This study aims to benchmark multiple versions of vanilla AEs (such as Linear, Convolutional and Variational) against their Kolmogorov-Arnold counterparts that have same or less number of parameters. Using cardiological signals as model input, a total of five different classic AE tasks were studied: reconstruction, generation, denoising, inpainting and anomaly detection. The proposed experiments uses a medical dataset \textit{AbnormalHeartbeat} that contains audio signals obtained from the stethoscope.
摘要：深度学习神经网络构建了多层感知器（MLP）和卷积块在当今的研究进步中仍然起着至关重要的作用。从拓扑的角度来看，这些体系结构可以表示为图形，在该图中，我们在其中学习与节点相关的功能，而固定边缘将信息从输入传达到输出。最近的一项工作介绍了一种名为Kolmogorov Arnold Networks（KAN）的新体系结构，该架构报告如何将可学习的激活功能放在神经网络的边缘上，从而在多种情况下可以更好地表现性能。多项研究的重点是通过添加重要特征，例如辍学，自动编码器（AE），模型基准测试和最后但并非最不重要的一点是，KAN卷积网络（KCN）引入了kans Learning，kan卷积网络（KCN）。这项研究的目的是针对其具有相同或更少参数数量的Kolmogorov-Arnold对应物，基于多个版本的Vanilla AE（例如线性，卷积和变化）。使用心脏病信号作为模型输入，总共研究了五个不同的经典AE任务：重建，生成，降解，内化和异常检测。所提出的实验使用了一个Miclear Dataset \ textIt {brnortalheartbeat}，其中包含从听诊器获得的音频信号。

Title: Efficient and Scalable Agentic AI with Heterogeneous Systems

Authors: Zain Asgar, Michelle Nguyen, Sachin Katti
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2507.19635
Pdf URL: https://arxiv.org/pdf/2507.19635
Copy Paste: [[2507.19635]] Efficient and Scalable Agentic AI with Heterogeneous Systems(https://arxiv.org/abs/2507.19635)
Keywords: generation
Abstract: AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers. Unlike conventional software or static inference, agentic workloads are dynamic and structurally complex. Often these agents are directed graphs of compute and IO operations that span multi-modal data input and conversion), data processing and context gathering (e.g vector DB lookups), multiple LLM inferences, tool calls, etc. To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure. To tackle this challenge, in this paper, we present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure spanning CPUs and accelerators, both from different vendors and across different performance tiers within a single vendor. The system delivers several building blocks: a framework for planning and optimizing agentic AI execution graphs using cost models that account for compute, memory, and bandwidth constraints of different HW; a MLIR based representation and compilation system that can decompose AI agent execution graphs into granular operators and generate code for different HW options; and a dynamic orchestration system that can place the granular components across a heterogeneous compute infrastructure and stitch them together while meeting an end-to-end SLA. Our design performs a systems level TCO optimization and preliminary results show that leveraging a heterogeneous infrastructure can deliver significant TCO benefits. A preliminary surprising finding is that for some workloads a heterogeneous combination of older generation GPUs with newer accelerators can deliver similar TCO as the latest generation homogenous GPU infrastructure design, potentially extending the life of deployed infrastructure.
摘要：AI代理商正在广泛的应用中成为主要的工作量，承诺将成为为企业和消费者提供AI带来承诺的好处的工具。与传统的软件或静态推理不同，代理工作负载是动态的，结构上是复杂的。通常，这些试剂是跨越多模式数据输入和转换的计算和IO操作的有向图图），数据处理和上下文收集（例如向量DB查找），多个LLM推断，工具调用等以扩展AI代理使用，我们需要有效的，可扩展的部署和代理的部署和代理服务的基础构造。为了应对这一挑战，在本文中，我们提出了一种系统设计，用于动态编排AI代理工作负载在跨越CPU和加速器的异质计算基础架构上，包括来自不同供应商以及单个供应商中不同性能层的同类。该系统提供了几个构建块：使用成本模型来规划和优化代理AI执行图的框架，这些成本模型可用于计算不同HW的计算，内存和带宽约束；基于MLIR的表示和汇编系统，可以将AI代理执行图分解为粒状运算符并为不同的HW选项生成代码；以及一个动态的编排系统，可以将颗粒组件放置在异质的计算基础架构上，并在遇到端到端SLA的同时将它们缝合在一起。我们的设计执行了系统级的TCO优化，并初步结果表明，利用异质基础设施可以带来重要的TCO收益。一个初步令人惊讶的发现是，对于某些工作负载，老年GPU与较新的加速器的异质组合可以提供与最新一代同质GPU基础架构设计相似的TCO，从而有可能延长部署的基础架构的寿命。

Title: SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial Expressions

Authors: Babak Taati, Muhammad Muzammil, Yasamin Zarghami, Abhishek Moturu, Airhossein Kazerouni, Hailey Reimer, Alex Mihailidis, Thomas Hadjistavropoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19673
Pdf URL: https://arxiv.org/pdf/2507.19673
Copy Paste: [[2507.19673]] SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial Expressions(https://arxiv.org/abs/2507.19673)
Keywords: generative
Abstract: Accurate pain assessment in patients with limited ability to communicate, such as older adults with dementia, represents a critical healthcare challenge. Robust automated systems of pain detection may facilitate such assessments. Existing pain detection datasets, however, suffer from limited ethnic/racial diversity, privacy constraints, and underrepresentation of older adults who are the primary target population for clinical deployment. We present SynPAIN, a large-scale synthetic dataset containing 10,710 facial expression images (5,355 neutral/expressive pairs) across five ethnicities/races, two age groups (young: 20-35, old: 75+), and two genders. Using commercial generative AI tools, we created demographically balanced synthetic identities with clinically meaningful pain expressions. Our validation demonstrates that synthetic pain expressions exhibit expected pain patterns, scoring significantly higher than neutral and non-pain expressions using clinically validated pain assessment tools based on facial action unit analysis. We experimentally demonstrate SynPAIN's utility in identifying algorithmic bias in existing pain detection models. Through comprehensive bias evaluation, we reveal substantial performance disparities across demographic characteristics. These performance disparities were previously undetectable with smaller, less diverse datasets. Furthermore, we demonstrate that age-matched synthetic data augmentation improves pain detection performance on real clinical data, achieving a 7.0% improvement in average precision. SynPAIN addresses critical gaps in pain assessment research by providing the first publicly available, demographically diverse synthetic dataset specifically designed for older adult pain detection, while establishing a framework for measuring and mitigating algorithmic bias. The dataset is available at this https URL
摘要：具有有限沟通能力的患者（例如痴呆症老年人）的准确疼痛评估是一项关键的医疗挑战。强大的自动化疼痛检测系统可能有助于此类评估。然而，现有的疼痛检测数据集受到有限的种族/种族多样性，隐私限制和对临床部署主要目标人群的代表性不足。我们提出了一个大规模合成数据集，其中包含10,710个面部表达图像（5,355个中性/富有表情对），遍布五个族裔/种族，两个年龄段（年轻：20-35，旧：75岁以上：75岁以上）和两个性别。使用商业生成的AI工具，我们创建了具有临床上有意义的疼痛表达的人口平衡的合成身份。我们的验证表明，使用基于面部动作单位分析的临床验证的疼痛评估工具，合成疼痛表达表现出预期的疼痛模式，高于中性和非疼痛表达的评分明显高于中性和非脚步表达。我们通过实验表明，在现有疼痛检测模型中识别算法偏置方面的启用效用。通过全面的偏见评估，我们揭示了人口特征之间的实质性绩效差异。这些性能差异以前是无法检测到的，而较小，较小的数据集则是无法检测的。此外，我们证明了年龄匹配的合成数据增强可以提高实际临床数据的疼痛检测性能，从而提高了7.0％的平均精度。 Synpain通过提供专门为老年人疼痛检测而设计的第一个公开可用的合成数据集来解决疼痛评估研究中的关键差距，同时建立了用于测量和减轻算法偏见的框架。该数据集可在此HTTPS URL上找到

Title: Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks

Authors: Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tuttösí, Angelica Lim
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2507.19684
Pdf URL: https://arxiv.org/pdf/2507.19684
Copy Paste: [[2507.19684]] Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks(https://arxiv.org/abs/2507.19684)
Keywords: generation
Abstract: Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner's proficiency, using haptic signaling as a primary form of communication. While today's AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.
摘要：想象一种可以安全和创造性地与人类一起跳舞的人形生物，以触觉信号作为主要交流形式，适应其伴侣的熟练程度。虽然当今的AI系统在文本或与大语言模型的基于语音的互动方面表现出色，但人类的交流远远超出了文本，其中包括具体的运动，时机和身体协调。建模两种药物之间的耦合相互作用提出了一个巨大的挑战：它是连续的，双向反应性的，并且由个体变异形成。我们提出了Compas3d，这是即兴莎莎舞的最大，最多样化的运动捕获数据集，该数据集设计为具有挑战性的互动性，表现力的人形生物AI的挑战性测试床。该数据集包括3个小时的领导者莎莎舞，由18个舞者，中级和专业技能水平进行的18个舞者进行。我们第一次提供细粒度的莎莎专家注释，涵盖2800多个移动细分市场，包括移动类型，组合，执行错误和风格元素。我们在伴侣舞蹈交流和自然语言之间进行类比，评估Compas3d的综合人类的两个基准任务，这些任务与口语和对话处理中的关键问题相似：领导者或追随者生成具有熟练程度的水平（说话者或听众合成），以及Duet（对话）的一代。朝着与人类舞蹈的长期目标，我们发布数据集，注释和代码，以及能够执行所有基准任务的多任务salsaagent模型，以及其他基线，以鼓励在社会互动体现的AI和创造性，表现性的，表现力的人形生物动作运动中进行研究。

Title: Disjoint Generative Models

Authors: Anton Danholt Lautrup, Muhammad Rajabinasab, Tobias Hyrup, Arthur Zimek, Peter Schneider-Kamp
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.19700
Pdf URL: https://arxiv.org/pdf/2507.19700
Copy Paste: [[2507.19700]] Disjoint Generative Models(https://arxiv.org/abs/2507.19700)
Keywords: generative
Abstract: We propose a new framework for generating cross-sectional synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that helps illuminate some of the design choices that one may make. The principal benefit of disjoint generative models is significantly increased privacy at only a low utility cost. Additional findings include increased effectiveness and feasibility for certain model types and the possibility for mixed-model synthesis.
摘要：我们提出了一个新的框架，用于通过脱节生成模型生成横截面合成数据集。在此范式中，将一个数据集划分为不相交的子集，这些子集可提供给生成模型的单独实例。然后将结果通过在没有共同变量/标识符的情况下起作用的连接操作将结果组合在一起。该框架的成功通过了一些案例研究和表格数据的示例来证明，这些数据有助于阐明人们可能做出的某些设计选择。脱节生成模型的主要好处是，只有低公用事业成本，隐私大大提高了隐私。其他发现包括提高某些模型类型的有效性和可行性以及混合模型合成的可能性。

Title: Bias Analysis for Synthetic Face Detection: A Case Study of the Impact of Facial Attribute

Authors: Asmae Lamsaf, Lucia Cascone, Hugo Proença, João Neves
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19705
Pdf URL: https://arxiv.org/pdf/2507.19705
Copy Paste: [[2507.19705]] Bias Analysis for Synthetic Face Detection: A Case Study of the Impact of Facial Attribute(https://arxiv.org/abs/2507.19705)
Keywords: generation
Abstract: Bias analysis for synthetic face detection is bound to become a critical topic in the coming years. Although many detection models have been developed and several datasets have been released to reliably identify synthetic content, one crucial aspect has been largely overlooked: these models and training datasets can be biased, leading to failures in detection for certain demographic groups and raising significant social, legal, and ethical issues. In this work, we introduce an evaluation framework to contribute to the analysis of bias of synthetic face detectors with respect to several facial attributes. This framework exploits synthetic data generation, with evenly distributed attribute labels, for mitigating any skew in the data that could otherwise influence the outcomes of bias analysis. We build on the proposed framework to provide an extensive case study of the bias level of five state-of-the-art detectors in synthetic datasets with 25 controlled facial attributes. While the results confirm that, in general, synthetic face detectors are biased towards the presence/absence of specific facial attributes, our study also sheds light on the origins of the observed bias through the analysis of the correlations with the balancing of facial attributes in the training sets of the detectors, and the analysis of detectors activation maps in image pairs with controlled attribute modifications.
摘要：在未来几年中，合成面部检测的偏置分析必然会成为一个关键的话题。尽管已经开发了许多检测模型，并且已经发布了几个数据集以可靠地识别合成内容，但在很大程度上忽略了一个关键方面：这些模型和培训数据集可能会偏见，从而导致某些人口统计组的检测失败，并引发重大的社会，法律和道德问题。在这项工作中，我们引入了一个评估框架，以有助于分析合成面检测器相对于几个面部属性的偏差。该框架利用均匀分布的属性标签利用合成数据生成，以减轻数据中可能影响偏差分析结果的数据。我们以建议的框架为基础，以提供对五个具有25个受控面部属性的合成数据集中五个最先进检测器的偏差水平的案例研究。虽然结果证实，通常，合成面检测器偏向于存在/不存在特定的面部属性，但我们的研究还通过分析与检测器的训练集中的面部属性平衡的分析来阐明观察到的偏见的起源，并阐明了检测器中与检测映射的分析图像对与对照属性的分析。

Title: Beyond Nearest Neighbors: Semantic Compression and Graph-Augmented Retrieval for Enhanced Vector Search

Authors: Rahul Raja, Arpita Vats
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.19715
Pdf URL: https://arxiv.org/pdf/2507.19715
Copy Paste: [[2507.19715]] Beyond Nearest Neighbors: Semantic Compression and Graph-Augmented Retrieval for Enhanced Vector Search(https://arxiv.org/abs/2507.19715)
Keywords: generation
Abstract: Vector databases typically rely on approximate nearest neighbor (ANN) search to retrieve the top-k closest vectors to a query in embedding space. While effective, this approach often yields semantically redundant results, missing the diversity and contextual richness required by applications such as retrieval-augmented generation (RAG), multi-hop QA, and memory-augmented agents. We introduce a new retrieval paradigm: semantic compression, which aims to select a compact, representative set of vectors that captures the broader semantic structure around a query. We formalize this objective using principles from submodular optimization and information geometry, and show that it generalizes traditional top-k retrieval by prioritizing coverage and diversity. To operationalize this idea, we propose graph-augmented vector retrieval, which overlays semantic graphs (e.g., kNN or knowledge-based links) atop vector spaces to enable multi-hop, context-aware search. We theoretically analyze the limitations of proximity-based retrieval under high-dimensional concentration and highlight how graph structures can improve semantic coverage. Our work outlines a foundation for meaning-centric vector search systems, emphasizing hybrid indexing, diversity-aware querying, and structured semantic retrieval. We make our implementation publicly available to foster future research in this area.
摘要：向量数据库通常依靠大约最近的邻居（ANN）搜索来检索嵌入空间中查询的最接近的向量。尽管有效，但这种方法通常会产生语义上的冗余结果，但缺少应用程序所需的多样性和上下文丰富性，例如检索功能增强的生成（RAG），多跳质量质量质量质量质量药和内存仪器。我们引入了一种新的检索范式：语义压缩，该范式旨在选择一组紧凑的代表性向量集，该矢量捕获查询周围更广泛的语义结构。我们使用次次优化和信息几何形状的原理对这一目标进行正式化，并表明它通过优先考虑覆盖范围和多样性来概括传统的Top-K检索。为了实现这一想法，我们提出了图形增强的矢量检索，在矢量空间上覆盖了语义图（例如，KNN或基于知识的链接），以启用多跳，上下文感知的搜索。我们从理论上分析了在高维浓度下基于接近度的检索的局限性，并突出了图结构如何改善语义覆盖率。我们的工作概述了以意义为中心的矢量搜索系统的基础，强调混合索引，多样性感知的查询和结构化的语义检索。我们公开实施，以促进该领域的未来研究。

Title: MoFRR: Mixture of Diffusion Models for Face Retouching Restoration

Authors: Jiaxin Liu, Qichao Ying, Zhenxing Qian, Sheng Li, Runqi Zhang, Jian Liu, Xinpeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19770
Pdf URL: https://arxiv.org/pdf/2507.19770
Copy Paste: [[2507.19770]] MoFRR: Mixture of Diffusion Models for Face Retouching Restoration(https://arxiv.org/abs/2507.19770)
Keywords: restoration
Abstract: The widespread use of face retouching on social media platforms raises concerns about the authenticity of face images. While existing methods focus on detecting face retouching, how to accurately recover the original faces from the retouched ones has yet to be answered. This paper introduces Face Retouching Restoration (FRR), a novel computer vision task aimed at restoring original faces from their retouched counterparts. FRR differs from traditional image restoration tasks by addressing the complex retouching operations with various types and degrees, which focuses more on the restoration of the low-frequency information of the faces. To tackle this challenge, we propose MoFRR, Mixture of Diffusion Models for FRR. Inspired by DeepSeek's expert isolation strategy, the MoFRR uses sparse activation of specialized experts handling distinct retouching types and the engagement of a shared expert dealing with universal retouching traces. Each specialized expert follows a dual-branch structure with a DDIM-based low-frequency branch guided by an Iterative Distortion Evaluation Module (IDEM) and a Cross-Attention-based High-Frequency branch (HFCAM) for detail refinement. Extensive experiments on a newly constructed face retouching dataset, RetouchingFFHQ++, demonstrate the effectiveness of MoFRR for FRR.
摘要：在社交媒体平台上进行面部修饰的广泛使用引起了人们对面部图像真实性的担忧。尽管现有方法着重于检测面部修饰，但如何准确地从修饰的面孔中恢复原始面孔尚未得到回答。本文介绍了面部修饰修复（FRR），这是一项新颖的计算机视觉任务，旨在恢复其修饰的对应物中的原始面孔。 FRR与传统的图像恢复任务不同，通过用各种类型和学位来解决复杂的修饰操作，这更多地侧重于恢复面部的低频信息。为了应对这一挑战，我们提出了MOFRR，即FRR扩散模型的混合物。受DeepSeek的专家隔离策略的启发，MOFRR使用了处理不同的修饰类型的专业专家的稀疏激活，以及与普遍修饰痕迹的共同专家的参与。每个专业专家都遵循双支分支结构，其基于DDIM的低频分支，该分支在迭代失真评估模块（IDEM）和基于跨注意的高频分支（HFCAM）的指导下进行详细细化。在新建的面部修饰数据集（ReouchingFFHQ ++）上进行了广泛的实验，证明了MOFRR对FRR的有效性。

Title: Large Language Model Agent for Structural Drawing Generation Using ReAct Prompt Engineering and Retrieval Augmented Generation

Authors: Xin Zhang, Lissette Iturburu, Juan Nicolas Villamizar, Xiaoyu Liu, Manuel Salmeron, Shirley J.Dyke, Julio Ramirez
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.19771
Pdf URL: https://arxiv.org/pdf/2507.19771
Copy Paste: [[2507.19771]] Large Language Model Agent for Structural Drawing Generation Using ReAct Prompt Engineering and Retrieval Augmented Generation(https://arxiv.org/abs/2507.19771)
Keywords: generation, generative
Abstract: Structural drawings are widely used in many fields, e.g., mechanical engineering, civil engineering, etc. In civil engineering, structural drawings serve as the main communication tool between architects, engineers, and builders to avoid conflicts, act as legal documentation, and provide a reference for future maintenance or evaluation needs. They are often organized using key elements such as title/subtitle blocks, scales, plan views, elevation view, sections, and detailed sections, which are annotated with standardized symbols and line types for interpretation by engineers and contractors. Despite advances in software capabilities, the task of generating a structural drawing remains labor-intensive and time-consuming for structural engineers. Here we introduce a novel generative AI-based method for generating structural drawings employing a large language model (LLM) agent. The method incorporates a retrieval-augmented generation (RAG) technique using externally-sourced facts to enhance the accuracy and reliability of the language model. This method is capable of understanding varied natural language descriptions, processing these to extract necessary information, and generating code to produce the desired structural drawing in AutoCAD. The approach developed, demonstrated and evaluated herein enables the efficient and direct conversion of a structural drawing's natural language description into an AutoCAD drawing, significantly reducing the workload compared to current working process associated with manual drawing production, facilitating the typical iterative process of engineers for expressing design ideas in a simplified way.
摘要：结构图广泛用于许多领域，例如机械工程，土木工程等。在土木工程中，结构图是建筑师，工程师和建筑商之间的主要通信工具，以避免冲突，充当法律文档，并提供未来维护或评估需求的参考。它们通常是使用关键元素（例如标题/字幕块，尺度，计划视图，高程视图，部分和详细部分）组织的，这些部分用标准化符号和线条类型注释，用于工程师和承包商的解释。尽管软件功能取得了进步，但生成结构图的任务仍然是劳动密集型和耗时的结构工程师。在这里，我们介绍了一种基于大型语言模型（LLM）代理的基于AI的新型生成方法。该方法结合了使用外部事实的检索功能（RAG）技术，以提高语言模型的准确性和可靠性。该方法能够理解各种自然语言描述，处理这些方法以提取必要的信息，并生成代码以在AutoCAD中产生所需的结构图。此处开发，证明和评估的方法可以使结构图的自然语言描述有效，直接转化为AutoCAD图纸，这与与手动绘图生产相关的当前工作过程相比大大减少了工作量，从而促进了工程师以简化方式表达设计思想的典型迭代过程。

Title: JDATT: A Joint Distillation Framework for Atmospheric Turbulence Mitigation and Target Detection

Authors: Zhiming Liu, Paul Hill, Nantheera Anantrasirichai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19780
Pdf URL: https://arxiv.org/pdf/2507.19780
Copy Paste: [[2507.19780]] JDATT: A Joint Distillation Framework for Atmospheric Turbulence Mitigation and Target Detection(https://arxiv.org/abs/2507.19780)
Keywords: restoration, generative
Abstract: Atmospheric turbulence (AT) introduces severe degradations, such as rippling, blur, and intensity fluctuations, that hinder both image quality and downstream vision tasks like target detection. While recent deep learning-based approaches have advanced AT mitigation using transformer and Mamba architectures, their high complexity and computational cost make them unsuitable for real-time applications, especially in resource-constrained settings such as remote surveillance. Moreover, the common practice of separating turbulence mitigation and object detection leads to inefficiencies and suboptimal performance. To address these challenges, we propose JDATT, a Joint Distillation framework for Atmospheric Turbulence mitigation and Target detection. JDATT integrates state-of-the-art AT mitigation and detection modules and introduces a unified knowledge distillation strategy that compresses both components while minimizing performance loss. We employ a hybrid distillation scheme: feature-level distillation via Channel-Wise Distillation (CWD) and Masked Generative Distillation (MGD), and output-level distillation via Kullback-Leibler divergence. Experiments on synthetic and real-world turbulence datasets demonstrate that JDATT achieves superior visual restoration and detection accuracy while significantly reducing model size and inference time, making it well-suited for real-time deployment.
摘要：大气湍流（AT）引入了严重的降解，例如波纹，模糊和强度波动，从而阻碍了图像质量和下游视觉任务，例如目标检测。尽管最近使用变压器和Mamba体系结构在缓解方面取得了进步，但它们的高复杂性和计算成本使它们不适合实时应用程序，尤其是在诸如远程监视之类的资源受限设置中。此外，分离湍流缓解和对象检测的常见实践导致效率低下和次优性能。为了应对这些挑战，我们提出了JDATT，这是一个用于缓解大气湍流和目标检测的联合蒸馏框架。杰达特（Jdatt）在缓解和检测模块上集成了最新的，并引入了统一的知识蒸馏策略，该策略压缩了这两个组件，同时最大程度地减少了性能损失。我们采用混合蒸馏方案：通过通道蒸馏（CWD）和蒙版生成蒸馏（MGD）的特征水平蒸馏，以及通过kullback-leibler Divergence的输出级蒸馏。关于合成和现实湍流数据集的实验表明，JDATT可实现出色的视觉恢复和检测精度，同时显着降低了模型的大小和推理时间，因此非常适合实时部署。

Title: DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation

Authors: Suhwan Cho, Minhyeok Lee, Jungho Lee, Donghyeong Kim, Sangyoun Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19790
Pdf URL: https://arxiv.org/pdf/2507.19790
Copy Paste: [[2507.19790]] DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation(https://arxiv.org/abs/2507.19790)
Keywords: generation
Abstract: Unsupervised video object segmentation (VOS) aims to detect the most prominent object in a video. Recently, two-stream approaches that leverage both RGB images and optical flow have gained significant attention, but their performance is fundamentally constrained by the scarcity of training data. To address this, we propose DepthFlow, a novel data generation method that synthesizes optical flow from single images. Our approach is driven by the key insight that VOS models depend more on structural information embedded in flow maps than on their geometric accuracy, and that this structure is highly correlated with depth. We first estimate a depth map from a source image and then convert it into a synthetic flow field that preserves essential structural cues. This process enables the transformation of large-scale image-mask pairs into image-flow-mask training pairs, dramatically expanding the data available for network training. By training a simple encoder-decoder architecture with our synthesized data, we achieve new state-of-the-art performance on all public VOS benchmarks, demonstrating a scalable and effective solution to the data scarcity problem.
摘要：无监督的视频对象细分（VOS）旨在检测视频中最突出的对象。最近，利用RGB图像和光学流程的两流方法引起了极大的关注，但是它们的性能从根本上受到培训数据的稀缺性的限制。为了解决这个问题，我们提出了Depthflow，这是一种新型的数据生成方法，可合成单个图像的光流。我们的方法是由关键见解驱动的，即VOS模型更多地依赖于流量图中的结构信息而不是其几何准确性，并且该结构与深度高度相关。我们首先从源图像估算深度图，然后将其转换为保留基本结构提示的合成流场。此过程使大规模的图像掩码对转换为图像流遮罩训练对，从而大大扩展了可用于网络培训的数据。通过使用我们的合成数据训练简单的编码器架构，我们在所有公共VOS基准上实现了新的最新性能，从而证明了可扩展有效的解决方案解决数据稀缺问题。

Title: ForCenNet: Foreground-Centric Network for Document Image Rectification

Authors: Peng Cai, Qiang Li, Kaicheng Yang, Dong Guo, Jia Li, Nan Zhou, Xiang An, Ninghua Yang, Jiankang Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19804
Pdf URL: https://arxiv.org/pdf/2507.19804
Copy Paste: [[2507.19804]] ForCenNet: Foreground-Centric Network for Document Image Rectification(https://arxiv.org/abs/2507.19804)
Keywords: generation
Abstract: Document image rectification aims to eliminate geometric deformation in photographed documents to facilitate text recognition. However, existing methods often neglect the significance of foreground elements, which provide essential geometric references and layout information for document image correction. In this paper, we introduce Foreground-Centric Network (ForCenNet) to eliminate geometric distortions in document images. Specifically, we initially propose a foreground-centric label generation method, which extracts detailed foreground elements from an undistorted image. Then we introduce a foreground-centric mask mechanism to enhance the distinction between readable and background regions. Furthermore, we design a curvature consistency loss to leverage the detailed foreground labels to help the model understand the distorted geometric distribution. Extensive experiments demonstrate that ForCenNet achieves new state-of-the-art on four real-world benchmarks, such as DocUNet, DIR300, WarpDoc, and DocReal. Quantitative analysis shows that the proposed method effectively undistorts layout elements, such as text lines and table borders. The resources for further comparison are provided at this https URL.
摘要：文档图像矫正旨在消除照片文档中的几何变形，以促进文本识别。但是，现有方法通常忽略了前景元素的重要性，这些元素为文档图像校正提供了必要的几何参考和布局信息。在本文中，我们介绍了以前景为中心的网络（Forcennet），以消除文档图像中的几何变形。具体而言，我们最初提出了一种以前景为中心的标签生成方法，该方法从未经发生的图像中提取详细的前景元素。然后，我们引入了一种以前景为中心的掩码机制，以增强可读区域和背景区域之间的区别。此外，我们设计了曲率一致性损失，以利用详细的前景标签来帮助模型了解扭曲的几何分布。广泛的实验表明，Forcennet在四个现实世界中实现了新的最先进的基准，例如Docunet，Dir300，Warpdoc和Docreal。定量分析表明，所提出的方法有效地不融合了布局元素，例如文本线和表边框。此HTTPS URL提供了进一步比较的资源。

Title: SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models

Authors: Joon Hyun Park, Kumju Jo, Sungyong Baik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19808
Pdf URL: https://arxiv.org/pdf/2507.19808
Copy Paste: [[2507.19808]] SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models(https://arxiv.org/abs/2507.19808)
Keywords: generation, generative
Abstract: Entrusted with the goal of pixel-level object classification, the semantic segmentation networks entail the laborious preparation of pixel-level annotation masks. To obtain pixel-level annotation masks for a given class without human efforts, recent few works have proposed to generate pairs of images and annotation masks by employing image and text relationships modeled by text-to-image generative models, especially Stable Diffusion. However, these works do not fully exploit the capability of text-guided Diffusion models and thus require a pre-trained segmentation network, careful text prompt tuning, or the training of a segmentation network to generate final annotation masks. In this work, we take a closer look at attention mechanisms of Stable Diffusion, from which we draw connections with classical seeded segmentation approaches. In particular, we show that cross-attention alone provides very coarse object localization, which however can provide initial seeds. Then, akin to region expansion in seeded segmentation, we utilize the semantic-correspondence-modeling capability of self-attention to iteratively spread the attention to the whole class from the seeds using multi-scale self-attention maps. We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences, compared to complex-structured objects. Thus, we further refine a mask using a more accurate background mask. Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion, without additional training procedure, prompt tuning, or a pre-trained segmentation network.
摘要：语义分割网络以像素级对象分类的目标委托，需要在像素级注释掩码的努力准备。为了在没有人类努力的情况下为给定类别获得像素级的注释掩模，最近很少有著作提出通过采用图像和文本关系来生成图像和注释面具对，尤其是稳定的扩散。但是，这些作品并未完全利用文本引导的扩散模型的能力，因此需要预先训练的分割网络，仔细的文本提示调整或训练分割网络以生成最终的注释掩码。在这项工作中，我们仔细研究了稳定扩散的注意机制，从中我们与经典的种子分割方法建立了联系。特别是，我们表明，仅交叉注意就可以提供非常粗糙的物体定位，但是可以提供初始种子。然后，类似于种子分割中的区域的扩展，我们利用自我注意力的语义对应模型能力迭代地将注意力从种子中从种子上传播到整个阶级，并使用多规模的自我发项图。我们还观察到，与复杂的结构化对象相比，简单引导的合成图像通常具有均匀的背景，更容易找到对应关系。因此，我们使用更准确的背景掩码进一步完善面膜。我们提出的称为Seediff的方法可从稳定扩散中产生高质量的面具，而无需其他训练程序，及时调整或预训练的细分网络。

Title: VAE-GAN Based Price Manipulation in Coordinated Local Energy Markets

Authors: Biswarup Mukherjee, Li Zhou, S. Gokul Krishnan, Milad Kabirifar, Subhash Lakshminarayana, Charalambos Konstantinou
Subjects: cs.LG, cs.AI, cs.MA, eess.SY
Abstract URL: https://arxiv.org/abs/2507.19844
Pdf URL: https://arxiv.org/pdf/2507.19844
Copy Paste: [[2507.19844]] VAE-GAN Based Price Manipulation in Coordinated Local Energy Markets(https://arxiv.org/abs/2507.19844)
Keywords: generation, generative
Abstract: This paper introduces a model for coordinating prosumers with heterogeneous distributed energy resources (DERs), participating in the local energy market (LEM) that interacts with the market-clearing entity. The proposed LEM scheme utilizes a data-driven, model-free reinforcement learning approach based on the multi-agent deep deterministic policy gradient (MADDPG) framework, enabling prosumers to make real-time decisions on whether to buy, sell, or refrain from any action while facilitating efficient coordination for optimal energy trading in a dynamic market. In addition, we investigate a price manipulation strategy using a variational auto encoder-generative adversarial network (VAE-GAN) model, which allows utilities to adjust price signals in a way that induces financial losses for the prosumers. Our results show that under adversarial pricing, heterogeneous prosumer groups, particularly those lacking generation capabilities, incur financial losses. The same outcome holds across LEMs of different sizes. As the market size increases, trading stabilizes and fairness improves through emergent cooperation among agents.
摘要：本文介绍了一个模型，用于与异质分布式能源资源（DERS）协调生产商，该模型参与与市场清除实体相互作用的当地能源市场（LEM）。拟议的LEM计划采用了基于多代理深层确定性策略梯度（MADDPG）框架的数据驱动的，无模型的加固学习方法，使生产商能够实时决策是否可以从任何动作中购买，出售还是从任何动作中进行任何动作，同时促进有效的能源交易，以促进在动态市场中最佳的能源交易。此外，我们还使用各种自动编码器生成对抗网络（VAE-GAN）模型进行了价格操纵策略，该模型允许公用事业公司以诱发Possumers的财务损失的方式调整价格信号。我们的结果表明，在对抗定价，异质性生产者团体下，尤其是那些缺乏发电能力的人，会造成财务损失。相同的结果在不同尺寸的小lem中也保持不变。随着市场规模的增加，通过代理商之间的紧急合作，交易稳定和公平就会提高。

Title: FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing

Authors: Bizhu Wu, Jinheng Xie, Meidan Ding, Zhe Kong, Jianfeng Ren, Ruibin Bai, Rong Qu, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19850
Pdf URL: https://arxiv.org/pdf/2507.19850
Copy Paste: [[2507.19850]] FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing(https://arxiv.org/abs/2507.19850)
Keywords: generation
Abstract: Generating realistic human motions from textual descriptions has undergone significant advancements. However, existing methods often overlook specific body part movements and their timing. In this paper, we address this issue by enriching the textual description with more details. Specifically, we propose the FineMotion dataset, which contains over 442,000 human motion snippets - short segments of human motion sequences - and their corresponding detailed descriptions of human body part movements. Additionally, the dataset includes about 95k detailed paragraphs describing the movements of human body parts of entire motion sequences. Experimental results demonstrate the significance of our dataset on the text-driven finegrained human motion generation task, especially with a remarkable +15.3% improvement in Top-3 accuracy for the MDM model. Notably, we further support a zero-shot pipeline of fine-grained motion editing, which focuses on detailed editing in both spatial and temporal dimensions via text. Dataset and code available at: CVI-SZU/FineMotion
摘要：从文本描述中产生现实的人类动作已取得了重大进步。但是，现有方法通常忽略特定的身体部位运动及其时间。在本文中，我们通过提供更多详细信息来丰富文本描述来解决此问题。具体而言，我们提出了犯规数据集，其中包含超过442,000个人类运动片段 - 人类运动序列的简短段 - 及其对人体部分运动的相应详细描述。此外，数据集包括大约95k详细的段落，描述了整个运动序列的人体部分运动。实验结果证明了我们的数据集对文本驱动的细化人类运动生成任务的重要性，尤其是MDM模型的TOP-3准确性提高了15.3％。值得注意的是，我们进一步支持了细粒运动编辑的零射管管道，该管道重点介绍了通过文本在空间和时间维度中的详细编辑。数据集和代码可用：CVI-SZU/FILEMOTION

Title: OW-CLIP: Data-Efficient Visual Supervision for Open-World Object Detection via Human-AI Collaboration

Authors: Junwen Duan, Wei Xue, Ziyao Kang, Shixia Liu, Jiazhi Xia
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2507.19870
Pdf URL: https://arxiv.org/pdf/2507.19870
Copy Paste: [[2507.19870]] OW-CLIP: Data-Efficient Visual Supervision for Open-World Object Detection via Human-AI Collaboration(https://arxiv.org/abs/2507.19870)
Keywords: generation
Abstract: Open-world object detection (OWOD) extends traditional object detection to identifying both known and unknown object, necessitating continuous model adaptation as new annotations emerge. Current approaches face significant limitations: 1) data-hungry training due to reliance on a large number of crowdsourced annotations, 2) susceptibility to "partial feature overfitting," and 3) limited flexibility due to required model architecture modifications. To tackle these issues, we present OW-CLIP, a visual analytics system that provides curated data and enables data-efficient OWOD model incremental training. OW-CLIP implements plug-and-play multimodal prompt tuning tailored for OWOD settings and introduces a novel "Crop-Smoothing" technique to mitigate partial feature overfitting. To meet the data requirements for the training methodology, we propose dual-modal data refinement methods that leverage large language models and cross-modal similarity for data generation and filtering. Simultaneously, we develope a visualization interface that enables users to explore and deliver high-quality annotations: including class-specific visual feature phrases and fine-grained differentiated images. Quantitative evaluation demonstrates that OW-CLIP achieves competitive performance at 89% of state-of-the-art performance while requiring only 3.8% self-generated data, while outperforming SOTA approach when trained with equivalent data volumes. A case study shows the effectiveness of the developed method and the improved annotation quality of our visualization system.
摘要：开放世界对象检测（OWOD）将传统的对象检测扩展到识别已知和未知对象，因此需要连续模型适应作为新的注释。当前的方法面临重大局限性：1）由于依赖大量众包注释而导致数据渴望培训，2）易感性“部分特征过于拟合”； 3）由于所需的模型体系结构修改而导致的灵活性有限。为了解决这些问题，我们提出了OW-CLIP，这是一种视觉分析系统，可提供策划数据并启用数据有效的OWOD模型增量训练。 OW-CLIP实施了针对OWOD设置量身定制的插件多模式提示调整，并引入了一种新颖的“裁剪平滑”技术，以减轻部分特征过于拟合。为了满足培训方法的数据要求，我们提出了双模式数据改进方法，以利用大型语言模型和跨模式相似性来生成数据和过滤。同时，我们开发了一个可视化接口，该界面使用户能够探索和提供高质量的注释：包括特定类的视觉特征短语和细粒度的差异化图像。定量评估表明，OW-CLIP以最先进的性能的89％实现竞争性能，同时仅需要3.8％的自启发数据，同时在接受等效数据量的培训时表现优于SOTA方法。一项案例研究显示了开发方法的有效性以及我们可视化系统的注释质量的提高。

Title: All-in-One Medical Image Restoration with Latent Diffusion-Enhanced Vector-Quantized Codebook Prior

Authors: Haowei Chen, Zhiwen Yang, Haotian Hou, Hui Zhang, Bingzheng Wei, Gang Zhou, Yan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19874
Pdf URL: https://arxiv.org/pdf/2507.19874
Copy Paste: [[2507.19874]] All-in-One Medical Image Restoration with Latent Diffusion-Enhanced Vector-Quantized Codebook Prior(https://arxiv.org/abs/2507.19874)
Keywords: restoration, super-resolution
Abstract: All-in-one medical image restoration (MedIR) aims to address multiple MedIR tasks using a unified model, concurrently recovering various high-quality (HQ) medical images (e.g., MRI, CT, and PET) from low-quality (LQ) counterparts. However, all-in-one MedIR presents significant challenges due to the heterogeneity across different tasks. Each task involves distinct degradations, leading to diverse information losses in LQ images. Existing methods struggle to handle these diverse information losses associated with different tasks. To address these challenges, we propose a latent diffusion-enhanced vector-quantized codebook prior and develop \textbf{DiffCode}, a novel framework leveraging this prior for all-in-one MedIR. Specifically, to compensate for diverse information losses associated with different tasks, DiffCode constructs a task-adaptive codebook bank to integrate task-specific HQ prior features across tasks, capturing a comprehensive prior. Furthermore, to enhance prior retrieval from the codebook bank, DiffCode introduces a latent diffusion strategy that utilizes the diffusion model's powerful mapping capabilities to iteratively refine the latent feature distribution, estimating more accurate HQ prior features during restoration. With the help of the task-adaptive codebook bank and latent diffusion strategy, DiffCode achieves superior performance in both quantitative metrics and visual quality across three MedIR tasks: MRI super-resolution, CT denoising, and PET synthesis.
摘要：多合一的医疗图像恢复（MEDIR）旨在使用统一模型来解决多个Medir任务，并同时恢复低质量（LQ）对应的各种高质量（HQ）医疗图像（例如MRI，CT和PET）。但是，由于不同任务的异质性，多合一的Medir提出了重大挑战。每个任务都涉及不同的降解，从而导致LQ图像中的各种信息损失。现有方法难以处理与不同任务相关的这些不同信息损失。为了应对这些挑战，我们提出了一个潜在的扩散增强矢量定量的代码书，并开发了\ textbf {fiffcode}，这是一个新颖的框架，利用了这一点。具体而言，为了补偿与不同任务相关的各种信息损失，DiffCode构建了任务适应的代码簿银行，以整合任务跨任务的特定任务HQ先验功能，从而捕获了全面的先验。此外，为了增强从CodeBook Bank的先前检索，DiffCode引入了一种潜在扩散策略，该策略利用扩散模型的功能强大的映射功能来迭代地完善潜在特征分布，从而估计恢复过程中更准确的HQ先验功能。借助任务自适应代码簿银行和潜在扩散策略，DiffCode在三个MEDIR任务中均可在定量指标和视觉质量方面达到卓越的性能：MRI超分辨率，CT DeNoisis和PET合成。

Title: A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction

Authors: Xiaohua Feng, Jiaming Zhang, Fengyuan Yu, Chengye Wang, Li Zhang, Kaixiang Li, Yuyuan Li, Chaochao Chen, Jianwei Yin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.19894
Pdf URL: https://arxiv.org/pdf/2507.19894
Copy Paste: [[2507.19894]] A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction(https://arxiv.org/abs/2507.19894)
Keywords: generation, generative
Abstract: With the rapid advancement of generative models, associated privacy concerns have attracted growing attention. To address this, researchers have begun adapting machine unlearning techniques from traditional classification models to generative settings. Although notable progress has been made in this area, a unified framework for systematically organizing and integrating existing work is still lacking. The substantial differences among current studies in terms of unlearning objectives and evaluation protocols hinder the objective and fair comparison of various approaches. While some studies focus on specific types of generative models, they often overlook the commonalities and systematic characteristics inherent in Generative Model Unlearning (GenMU). To bridge this gap, we provide a comprehensive review of current research on GenMU and propose a unified analytical framework for categorizing unlearning objectives, methodological strategies, and evaluation metrics. In addition, we explore the connections between GenMU and related techniques, including model editing, reinforcement learning from human feedback, and controllable generation. We further highlight the potential practical value of unlearning techniques in real-world applications. Finally, we identify key challenges and outline future research directions aimed at laying a solid foundation for further advancements in this field. We consistently maintain the related open-source materials at this https URL.
摘要：随着生成模型的快速发展，相关的隐私问题引起了人们日益增长的关注。为了解决这个问题，研究人员已经开始适应从传统分类模型到生成环境的化学技术。尽管在这一领域取得了显着的进展，但仍缺乏系统地组织和整合现有工作的统一框架。目前的研究在学习目标和评估方案方面存在实质性差异，阻碍了各种方法的客观和公平比较。尽管一些研究集中于特定类型的生成模型，但它们通常忽略生成模型固有的共同点和系统特征（GenMU）。为了弥合这一差距，我们对当前对GENMU的研究进行了全面的综述，并提出了一个统一的分析框架，以对未学习目标，方法论策略和评估指标进行分类。此外，我们还探讨了Genmu与相关技术之间的联系，包括模型编辑，从人类反馈中学习的强化学习以及可控的生成。我们进一步强调了在现实世界应用中未学习技术的潜在实用价值。最后，我们确定了关键的挑战，并概述了未来的研究方向，旨在为该领域的进一步发展奠定坚实的基础。我们始终在此HTTPS URL上维护相关的开源材料。

Title: HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

Authors: Chang Liu, Yunfan Ye, Fan Zhang, Qingyang Zhou, Yuchuan Luo, Zhiping Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19924
Pdf URL: https://arxiv.org/pdf/2507.19924
Copy Paste: [[2507.19924]] HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly(https://arxiv.org/abs/2507.19924)
Keywords: generation, generative
Abstract: Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion this http URL better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.
摘要：来自生成模型的综合视频，尤其是以人类为中心的人类行为，对人类信息安全和真实性构成重大威胁。尽管在二元伪造视频检测中取得了进展，但对伪造类型的缺乏细粒度的了解引起了人们对可靠性和可解释性的担忧，这对于现实世界中的应用至关重要。为了解决这一限制，我们提出了Humansam，这是一个基于视频生成模型的基本挑战的新框架。具体而言，Humansam旨在将以人为中心的伪造为中心分类为通常在产生的内容中观察到的三种不同类型的人工制品：空间，外观和运动，该HTTP URL可以更好地捕获几何，语义，语义和时空的一致性的特征，我们建议通过融合两个视频理解的人类交易来形象，以产生人类的交易代表。在培训过程中，我们还采用了基于等级的置信度增强策略，以通过引入三个先前的分数来学习更强大的表示。为了进行培训和评估，我们构建了第一个公共基准，即以人为中心的伪造视频（HFV）数据集，所有类型的伪造都仔细地注释半自动。在我们的实验中，与二进制和多类伪造分类相比，人类与最先进的方法相比产生了有希望的结果。

Title: MambaVesselNet++: A Hybrid CNN-Mamba Architecture for Medical Image Segmentation

Authors: Qing Xu, Yanming Chen, Yue Li, Ziyu Liu, Zhenye Lou, Yixuan Zhang, Xiangjian He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19931
Pdf URL: https://arxiv.org/pdf/2507.19931
Copy Paste: [[2507.19931]] MambaVesselNet++: A Hybrid CNN-Mamba Architecture for Medical Image Segmentation(https://arxiv.org/abs/2507.19931)
Keywords: generation
Abstract: Medical image segmentation plays an important role in computer-aided diagnosis. Traditional convolution-based U-shape segmentation architectures are usually limited by the local receptive field. Existing vision transformers have been widely applied to diverse medical segmentation frameworks due to their superior capabilities of capturing global contexts. Despite the advantage, the real-world application of vision transformers is challenged by their non-linear self-attention mechanism, requiring huge computational costs. To address this issue, the selective state space model (SSM) Mamba has gained recognition for its adeptness in modeling long-range dependencies in sequential data, particularly noted for its efficient memory costs. In this paper, we propose MambaVesselNet++, a Hybrid CNN-Mamba framework for medical image segmentation. Our MambaVesselNet++ is comprised of a hybrid image encoder (Hi-Encoder) and a bifocal fusion decoder (BF-Decoder). In Hi-Encoder, we first devise the texture-aware layer to capture low-level semantic features by leveraging convolutions. Then, we utilize Mamba to effectively model long-range dependencies with linear complexity. The Bi-Decoder adopts skip connections to combine local and global information of the Hi-Encoder for the accurate generation of segmentation masks. Extensive experiments demonstrate that MambaVesselNet++ outperforms current convolution-based, transformer-based, and Mamba-based state-of-the-arts across diverse medical 2D, 3D, and instance segmentation tasks. The code is available at this https URL.
摘要：医疗图像分割在计算机辅助诊断中起重要作用。传统的基于卷积的U形分割体系结构通常受到当地接受场的限制。由于它们占据全球环境的卓越能力，现有的视觉变压器已被广泛应用于多样化的医疗细分框架。尽管有优势，但视觉变形金刚的现实应用是其非线性自我注意力发项机制的挑战，需要巨大的计算成本。为了解决这个问题，选择性状态空间模型（SSM）Mamba因其在顺序数据中的长期依赖性建模而获得认可，尤其是由于其有效的内存成本而引起的。在本文中，我们提出了Mambavesselnet ++，这是一种用于医学图像分割的混合CNN-MAMBA框架。我们的Mambavesselnet ++由混合图像编码器（HI-编码器）和双焦点融合解码器（BF-Decoder）组成。在Hi-编码器中，我们首先设计了纹理感知的层，以通过利用卷积来捕获低级语义特征。然后，我们利用曼巴（Mamba）有效地对远程依赖性进行线性复杂性建模。 BiDecoder采用跳过连接，以结合HI-编码器的本地和全局信息，以准确地生成分割面罩。广泛的实验表明，Mambavesselnet ++在各种医学2D，3D和实例分段任务中都优于基于卷积的当前基于卷积的，基于变压器的最先进。该代码可在此HTTPS URL上找到。

Title: LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs

Authors: Jiaze Wang, Rui Chen, Haowang Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19939
Pdf URL: https://arxiv.org/pdf/2507.19939
Copy Paste: [[2507.19939]] LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs(https://arxiv.org/abs/2507.19939)
Keywords: generation
Abstract: Recent spatial control methods for text-to-image (T2I) diffusion models have shown compelling results. However, these methods still fail to precisely follow the control conditions and generate the corresponding images, especially when encountering the textual prompts that contain multiple objects or have complex spatial compositions. In this work, we present a LLM-guided framework called LLM\_Control to address the challenges of the controllable T2I generation task. By improving grounding capabilities, LLM\_Control is introduced to accurately modulate the pre-trained diffusion models, where visual conditions and textual prompts influence the structures and appearance generation in a complementary way. We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes. The obtained control signals are injected into the denoising network to refocus and enhance attention maps according to novel sampling constraints. Extensive qualitative and quantitative experiments have demonstrated that LLM\_Control achieves competitive synthesis quality compared to other state-of-the-art methods across various pre-trained T2I models. It is noteworthy that LLM\_Control allows the challenging input conditions on which most of the existing methods
摘要：文本对图像（T2I）扩散模型的最新空间控制方法已显示出令人信服的结果。但是，这些方法仍然无法精确遵循控制条件并生成相应的图像，尤其是在遇到包含多个对象或具有复杂空间组成的文本提示时。在这项工作中，我们提出了一个名为LLM \ _Control的LLM指导框架，以应对可控T2I生成任务的挑战。通过提高接地功能，引入了LLM \ _Control以准确调节预训练的扩散模型，其中视觉条件和文本提示以互补的方式影响结构和外观生成。我们利用多模式LLM作为全局控制器来安排空间布局，增强语义描述并绑定对象属性。根据新的采样约束，将获得的对照信号注入了去核网络，以重新集中并增强注意力图。广泛的定性和定量实验表明，与各种预训练的T2I模型的其他最新方法相比，LLM \ _Control可实现竞争性合成质量。值得注意的是，llm \ _control允许大多数现有方法的具有挑战性的输入条件

Title: SCALAR: Scale-wise Controllable Visual Autoregressive Learning

Authors: Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19946
Pdf URL: https://arxiv.org/pdf/2507.19946
Copy Paste: [[2507.19946]] SCALAR: Scale-wise Controllable Visual Autoregressive Learning(https://arxiv.org/abs/2507.19946)
Keywords: generation, generative
Abstract: Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a
摘要：可控的图像合成，可以对生成的输出进行细粒度的控制，已成为视觉生成建模的关键重点。但是，由于其层次结构的隔壁预测样式，可控的生成对于视觉自回归（VAR）模型仍然具有挑战性。现有的基于VAR的方法通常会遭受损害保真度和效率的损害编码和破坏性注入机制的效率低下。在这项工作中，我们提出了标量，这是一种基于VAR的可控生成方法，结合了新颖的规模条件解码机制。标量利用a

Title: FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images

Authors: Hao-Yu Hou, Chun-Yi Lee, Motoharu Sonogashira, Yasutomo Kawanishi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.19993
Pdf URL: https://arxiv.org/pdf/2507.19993
Copy Paste: [[2507.19993]] FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images(https://arxiv.org/abs/2507.19993)
Keywords: generation
Abstract: The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, we propose FROSS (Faster-than-Real-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-real-time 3D SSG generation that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationally-intensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while operating significantly faster than prior 3D SSG generation methods. Our implementation and dataset are publicly available at this https URL.
摘要：在各个领域中，将复杂3D环境复合到简化和结构化表示的能力至关重要。 3D语义场景图（SSG）通过将对象表示为节点及其相互关系作为边缘来实现这一目标，从而促进了高级场景的理解。但是，3D SSG生成的现有方法面临重大挑战，包括高计算需求和非申请处理，这阻碍了它们对实时开放世界应用的适用性。为了解决这个问题，我们提出了Fross（比实时更快的在线3D语义场景图生成），这是一种创新的方法，用于在线和更快的3D SSG生成，利用2D场景图直接升至3D空间，并将对象表示为3D Gaussian分布。该框架消除了对精确和计算密集的点云处理的依赖性。此外，我们将复制数据集扩展到具有对象间关系注释的复制数据集，从而创建了复制数据集，以全面评估Fross。对ReplicASSG和3DSSG数据集的评估的实验结果表明，Fross可以取得卓越的性能，而工作速度明显快于先前的3D SSG生成方法。我们的实施和数据集可在此HTTPS URL上公开获得。

Title: PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data

Authors: Aishwarya Mandyam, Jason Meng, Ge Gao, Jiankai Sun, Mac Schwager, Barbara E. Engelhardt, Emma Brunskill
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.20068
Pdf URL: https://arxiv.org/pdf/2507.20068
Copy Paste: [[2507.20068]] PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data(https://arxiv.org/abs/2507.20068)
Keywords: generative
Abstract: Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^{\pi}(s_0)$-- such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.
摘要：非政策评估（OPE）方法旨在在部署前估计新的强化学习（RL）策略的价值。最近的进步表明，利用辅助数据集（例如由生成模型合成的数据集）可以提高这些价值估计的准确性。不幸的是，此类辅助数据集也可能是偏差的，并且现有的使用数据增强的方法在RL缺乏原则上的不确定性量化。在医疗保健等高风险设置中，可靠的不确定性估计对于比较政策价值估计值很重要。在这项工作中，我们提出了两种方法，用于在使用数据增强时为OPE构建有效的置信区间。第一个在特定初始状态$ v^{\ pi}（s_0）$的条件下提供了置信区间 - 此类间隔对于以人为本的应用程序尤为重要。为此，我们为高维状态MDP引入了一种新的保形预测方法。其次，我们考虑了估计许多初始状态的平均政策绩效的更常见任务。为此，我们从双重强大的估计和预测推理中借鉴了思想。在跨越机器人技术，医疗保健和库存管理以及来自MIMIC-IV的实际医疗保健数据集的模拟器中，我们发现我们的方法可以使用增强数据，并且仍然始终如一地产生涵盖地面真实价值的间隔，这与先前建议的方法不同。

Title: The Devil is in the EOS: Sequence Training for Detailed Image Captioning

Authors: Abdelrahman Mohamed, Yova Kementchedjhieva
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.20077
Pdf URL: https://arxiv.org/pdf/2507.20077
Copy Paste: [[2507.20077]] The Devil is in the EOS: Sequence Training for Detailed Image Captioning(https://arxiv.org/abs/2507.20077)
Keywords: generation
Abstract: Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model's tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.
摘要：尽管视觉模型（VLM）取得了重大进展，但图像字幕通常遭受缺乏细节的影响，基本模型会产生简短的通用字幕。即使VLM配备了强大的视力和语言骨干，这种限制仍然存在。尽管已经提出了有监督的数据和复杂的奖励功能来改善详细的图像字幕，但我们确定了一个简单的根本问题：偏向于跨境培训期间引入的序列序列（EOS）令牌。我们提出了一种无监督的方法来使模型过早预测EOS代币的趋势。通过减少这种偏见，我们鼓励产生更长，更详细的标题，而无需复杂的奖励功能或监督。我们的方法直接，有效且易于适用于任何审核的模型。我们通过使用三个VLM和三个详细的字幕基准进行实验来证明其有效性。我们的结果表明，标题长度和相关细节的大幅度增加，尽管幻觉率预计会增加。

Title: KB-DMGen: Knowledge-Based Global Guidance and Dynamic Pose Masking for Human Image Generation

Authors: Shibang Liu, Xuemei Xie, Guangming Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20083
Pdf URL: https://arxiv.org/pdf/2507.20083
Copy Paste: [[2507.20083]] KB-DMGen: Knowledge-Based Global Guidance and Dynamic Pose Masking for Human Image Generation(https://arxiv.org/abs/2507.20083)
Keywords: generation
Abstract: Recent methods using diffusion models have made significant progress in human image generation with various control signals such as pose priors. In portrait generation, both the accuracy of human pose and the overall visual quality are crucial for realistic synthesis. Most existing methods focus on controlling the accuracy of generated poses, but ignore the quality assurance of the entire image. In order to ensure the global image quality and pose accuracy, we propose Knowledge-Based Global Guidance and Dynamic pose Masking for human image Generation (KB-DMGen). The Knowledge Base (KB) is designed not only to enhance pose accuracy but also to leverage image feature information to maintain overall image quality. Dynamic Masking (DM) dynamically adjusts the importance of pose-related regions. Experiments demonstrate the effectiveness of our model, achieving new state-of-the-art results in terms of AP and CAP on the HumanArt dataset. The code will be made publicly available.
摘要：使用扩散模型的最新方法已在人类图像产生中取得了重大进展，并具有各种控制信号，例如姿势先验。在肖像产生中，人姿势的准确性和整体视觉质量对于现实的综合都至关重要。大多数现有的方法着重于控制生成的姿势的准确性，但忽略了整个图像的质量保证。为了确保全球图像质量和姿势准确性，我们提出了基于知识的全球指导和动态姿势掩盖人类图像生成（KB-DMGEN）。知识库（KB）不仅旨在提高姿势准确性，还旨在利用图像特征信息来保持整体图像质量。动态掩蔽（DM）动态调整了与姿势相关区域的重要性。实验证明了我们的模型的有效性，从人类ART数据集中的AP和CAP方面实现了新的最新结果。该代码将公开可用。

Title: Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models

Authors: Ankit Sanjyal
Subjects: cs.CV, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2507.20094
Pdf URL: https://arxiv.org/pdf/2507.20094
Copy Paste: [[2507.20094]] Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models(https://arxiv.org/abs/2507.20094)
Keywords: generation
Abstract: Diffusion models have become a powerful backbone for text-to-image generation, enabling users to synthesize high-quality visuals from natural language prompts. However, they often struggle with complex prompts involving multiple objects and global or local style specifications. In such cases, the generated scenes tend to lack style uniformity and spatial coherence, limiting their utility in creative and controllable content generation. In this paper, we propose a simple, training-free architectural method called Local Prompt Adaptation (LPA). Our method decomposes the prompt into content and style tokens, and injects them selectively into the U-Net's attention layers at different stages. By conditioning object tokens early and style tokens later in the generation process, LPA enhances both layout control and stylistic consistency. We evaluate our method on a custom benchmark of 50 style-rich prompts across five categories and compare against strong baselines including Composer, MultiDiffusion, Attend-and-Excite, LoRA, and SDXL. Our approach outperforms prior work on both CLIP score and style consistency metrics, offering a new direction for controllable, expressive diffusion-based generation.
摘要：扩散模型已成为文本到图像生成的强大骨干，使用户能够从自然语言提示中综合高质量的视觉效果。但是，他们经常在涉及多个对象以及全球或本地样式规格的复杂提示中挣扎。在这种情况下，生成的场景往往缺乏风格统一性和空间连贯性，从而限制了它们在创造性和可控内容产生中的效用。在本文中，我们提出了一种称为本地提示改编（LPA）的简单，无训练的建筑方法。我们的方法将提示分解为内容和样式令牌，并在不同阶段选择性地将其注入U-Net的注意层。通过在生成过程的后期调理对象令牌和样式令牌，LPA可以增强布局控制和风格一致性。我们根据五个类别的50个样式提示的自定义基准测试我们的方法，并与包括作曲家，多填充，参加者，出席者，洛拉和SDXL在内的强大基准相比。我们的方法在剪辑得分和样式一致性指标上的先前工作优于先前的工作，为可控的，表达性扩散的一代提供了一个新的方向。

Title: Generative molecule evolution using 3D pharmacophore for efficient Structure-Based Drug Design

Authors: Yi He, Ailun Wang, Zhi Wang, Yu Liu, Xingyuan Xu, Wen Yan
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2507.20130
Pdf URL: https://arxiv.org/pdf/2507.20130
Copy Paste: [[2507.20130]] Generative molecule evolution using 3D pharmacophore for efficient Structure-Based Drug Design(https://arxiv.org/abs/2507.20130)
Keywords: generation, generative
Abstract: Recent advances in generative models, particularly diffusion and auto-regressive models, have revolutionized fields like computer vision and natural language processing. However, their application to structure-based drug design (SBDD) remains limited due to critical data constraints. To address the limitation of training data for models targeting SBDD tasks, we propose an evolutionary framework named MEVO, which bridges the gap between billion-scale small molecule dataset and the scarce protein-ligand complex dataset, and effectively increase the abundance of training data for generative SBDD models. MEVO is composed of three key components: a high-fidelity VQ-VAE for molecule representation in latent space, a diffusion model for pharmacophore-guided molecule generation, and a pocket-aware evolutionary strategy for molecule optimization with physics-based scoring function. This framework efficiently generate high-affinity binders for various protein targets, validated with predicted binding affinities using free energy perturbation (FEP) methods. In addition, we showcase the capability of MEVO in designing potent inhibitors to KRAS$^{\textrm{G12D}}$, a challenging target in cancer therapeutics, with similar affinity to the known highly active inhibitor evaluated by FEP calculations. With high versatility and generalizability, MEVO offers an effective and data-efficient model for various tasks in structure-based ligand design.
摘要：生成模型的最新进展，尤其是扩散和自动回归模型，彻底改变了计算机视觉和自然语言处理等领域。但是，由于关键的数据限制，它们在基于结构的药物设计（SBDD）中的应用仍限制。为了解决针对SBDD任务的模型的训练数据的局限性，我们提出了一个名为MEVO的进化框架，该框架桥接了数十亿个小分子数据集和稀缺的蛋白质配体复杂数据集之间的差距，并有效地增加了生成SBDD模型的训练数据的丰富度。 MEVO由三个关键组成部分组成：用于潜在空间中分子表示的高保真VQ-VAE，用于药效团引导的分子产生的扩散模型，以及通过物理分数功能来优化分子优化的袖珍感知的进化策略。该框架有效地生成了各种蛋白质靶标的高亲和力粘合剂，并使用自由能扰动（FEP）方法验证了预测的结合亲和力。此外，我们还展示了MEVO在设计对KRAS $^{\ textrm {g12d}} $设计有效抑制剂的能力，这是癌症治疗方面的具有挑战性的靶标，与已知的高度活性抑制剂相似，该抑制剂与FEP计算评估的高度活性抑制剂相似。 MEVO具有高多功能性和可推广性，为基于结构的配体设计的各种任务提供了有效且具有数据效率的模型。

Title: AnimeColor: Reference-based Animation Colorization with Diffusion Transformers

Authors: Yuhong Zhang, Liyao Wang, Han Wang, Danni Wu, Zuzeng Lin, Feng Wang, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20158
Pdf URL: https://arxiv.org/pdf/2507.20158
Copy Paste: [[2507.20158]] AnimeColor: Reference-based Animation Colorization with Diffusion Transformers(https://arxiv.org/abs/2507.20158)
Keywords: generation
Abstract: Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose \textbf{AnimeColor}, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at \href{this https URL}{this https URL}.
摘要：动画着色在动画制作中起着至关重要的作用，但是现有的方法难以实现色彩准确性和时间一致性。为了应对这些挑战，我们提出了\ textbf {animeColor}，这是一个基于新颖的参考动画着色框架利用扩散变压器（DIT）。我们的方法将草图序列集成到基于DIT的视频扩散模型中，从而实现了素描控制的动画生成。我们介绍了两个关键组成部分：高级颜色提取器（HCE），以捕获语义颜色信息和低级颜色指南（LCG），以从参考图像中提取细颗粒的颜色细节。这些组件协同起作用，以指导视频扩散过程。此外，我们采用多阶段培训策略来最大程度地利用参考图像颜色信息。广泛的实验表明，AnimeColor的颜色准确性，草图对齐，时间一致性和视觉质量优于现有方法。我们的框架不仅在动画着色方面提高了最新技术的状态，而且还为工业应用提供了实用的解决方案。该代码将在\ href {this HTTPS url} {此https url}上公开可用。

Title: Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning

Authors: Zeyu Xi, Haoying Sun, Yaofei Wu, Junchi Yan, Haoran Zhang, Lifang Wu, Liang Wang, Changwen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20163
Pdf URL: https://arxiv.org/pdf/2507.20163
Copy Paste: [[2507.20163]] Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning(https://arxiv.org/abs/2507.20163)
Keywords: generation
Abstract: Existing sports video captioning methods often focus on the action yet overlook player identities, limiting their applicability. Although some methods integrate extra information to generate identity-aware descriptions, the player identities are sometimes incorrect because the extra information is independent of the video content. This paper proposes a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-IAVC), which focuses on recognizing player identities from a visual perspective. Specifically, an identity-related information extraction module (IRIEM) is designed to extract player-related multimodal embeddings. IRIEM includes a player identification network (PIN) for extracting visual features and player names, and a bidirectional semantic interaction module (BSIM) to link player features with video content for mutual enhancement. Additionally, a visual context learning module (VCLM) is designed to capture the key video context information. Finally, by integrating the outputs of the above modules as the multimodal prompt for the large language model (LLM), it facilitates the generation of descriptions with player identities. To support this work, we construct a new benchmark called NBA-Identity, a large identity-aware basketball video captioning dataset with 9,726 videos covering 9 major event types. The experimental results on NBA-Identity and VC-NBA-2022 demonstrate that our proposed model achieves advanced performance. Code and dataset are publicly available at this https URL.
摘要：现有的体育视频字幕方法通常集中在动作却忽略了玩家身份，从而限制了他们的适用性。尽管某些方法集成了额外的信息以生成身份感知的描述，但播放器身份有时是不正确的，因为额外的信息与视频内容无关。本文提出了一个以播放器为中心的多模式及时生成网络，用于身份感知体育视频字幕（LLM-IAVC），该网络重点是从视觉角度识别玩家身份。具体而言，与身份相关的信息提取模块（IRIEM）旨在提取与玩家相关的多模式嵌入。 IRIEM包括一个用于提取视觉功能和播放器名称的玩家识别网络（PIN），以及一个双向语义交互模块（BSIM），以将播放器功能与视频内容链接到相互增强。此外，视觉上下文学习模块（VCLM）旨在捕获关键视频上下文信息。最后，通过将上述模块的输出作为大型语言模型（LLM）的多模式提示，它可以促进具有玩家身份的描述的产生。为了支持这项工作，我们构建了一个名为NBA-Indentity的新基准，这是一个大型身份意识的篮球视频字幕数据集，其中9,726个视频涵盖了9种主要的活动类型。 NBA认同和VC-NBA-2022的实验结果表明，我们提出的模型可以实现高级性能。代码和数据集可在此HTTPS URL上公开可用。

Title: PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks

Authors: Clinton Ansun Mo, Kun Hu, Chengjiang Long, Dong Yuan, Wan-Chi Siu, Zhiyong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20170
Pdf URL: https://arxiv.org/pdf/2507.20170
Copy Paste: [[2507.20170]] PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks(https://arxiv.org/abs/2507.20170)
Keywords: generation
Abstract: Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, cross-compatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive point-wise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture.
摘要：运动骨架通过转换骨骼层次结构来驱动3D角色动画，但是比例或结构的差异使运动数据难以在骨骼跨骨架上传输，从而对数据驱动的运动综合构成了挑战。时间点云（TPC）提供了一个非结构化的交叉兼容运动表示。尽管对骨骼可逆，但TPC主要用于兼容性，而不是直接运动任务学习。这样做将需要针对TPC格式的数据综合功能，该格式对其独特的时间一致性和点可识别性提出了未开发的挑战。因此，我们建议泵，TPC数据的原始自动编码器体系结构。泵独立地将框架的点云减少到可采样的特征向量中，该解码器从中使用潜在的高斯噪声向量从中提取不同的时间点作为采样标识符。我们介绍了基于线性分配的点配对，以优化TPC重建过程，并否定架构中昂贵的关注机制的使用。使用这些潜在特征，我们预先训练了一个运动合成模型，能够执行运动预测，过渡生成和钥匙帧插值。对于这些预训练任务，即使没有本机数据集监督，泵也表现出色，即最新的性能。当微调运动或估计的运动时，泵泵的表现要优于许多各自的方法，而不会偏离其通才架构。

Title: LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

Authors: Fei Kong, Jinhao Duan, Kaidi Xu, Zhenhua Guo, Xiaofeng Zhu, Xiaoshuang Shi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20174
Pdf URL: https://arxiv.org/pdf/2507.20174
Copy Paste: [[2507.20174]] LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks(https://arxiv.org/abs/2507.20174)
Keywords: generation
Abstract: Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on this https URL.
摘要：实际应用，例如自动驾驶和人形机器人操纵，需要精确的空间感知。但是，视觉模型（VLMS）如何识别空间关系和感知空间运动仍然没有充满反感。在这项工作中，我们引入了空间评估管道并构建了相应的基准。具体而言，我们将空间理解分为两种主要类型：绝对空间理解，涉及查询图像中对象的绝对空间位置（例如，左，右），以及3D空间理解，其中包括运动和旋转。值得注意的是，我们的数据集完全是合成的，可以以低成本的成本生成测试样品，同时还可以防止数据集污染。我们对多个最先进的VLM进行实验，并观察到它们的空间理解能力有很大的改善空间。明确地，在我们的实验中，人类在所有任务上都取得了接近完美的表现，而当前的VLMS仅在两个最简单的任务上才能达到人级的绩效。对于其余的任务，VLM的性能明显低于人类的性能。实际上，表现最佳的视觉语言模型甚至在多个任务上达到了接近零的分数。该数据集和代码可在此HTTPS URL上找到。

Title: Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models

Authors: Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, Kun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20220
Pdf URL: https://arxiv.org/pdf/2507.20220
Copy Paste: [[2507.20220]] Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models(https://arxiv.org/abs/2507.20220)
Keywords: generation
Abstract: The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs' comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions. Our code, pre-trained models, and videos are available at this https URL.
摘要：自动生成可控的共同语音手势最近引起了人们的关注。尽管现有系统通常通过预定义的分类标签或由运动示例得出的隐式伪标记实现手势控制，但这些方法通常会损害原始运动示例中存在的丰富细节。我们提出了Meco，这是一个通过利用大型语言模型（LLMS）的运动进行示例控制的共同语音姿态的框架。我们的方法通过微调同时解释语音音频和运动示例来利用LLMS的理解能力，从而使手势的综合能够保留示例特定特征的同时保持语音一致。从传统的伪标记范式背道而驰，我们将运动示例定位为及时结构内的明确查询环境，以指导性手势产生。实验结果证明了三个指标的最新性能：Fréchet手势距离（FGD），运动多样性和示例手势相似性。此外，我们的框架可以对各个身体部位进行粒状控制，并适应各种输入方式，包括运动夹，静态姿势，人类视频序列和文本描述。我们的代码，预训练的模型和视频可在此HTTPS URL上找到。

Title: Protein-SE(3): Benchmarking SE(3)-based Generative Models for Protein Structure Design

Authors: Lang Yu, Zhangyang Gao, Cheng Tan, Qin Chen, Jie Zhou, Liang He
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20243
Pdf URL: https://arxiv.org/pdf/2507.20243
Copy Paste: [[2507.20243]] Protein-SE(3): Benchmarking SE(3)-based Generative Models for Protein Structure Design(https://arxiv.org/abs/2507.20243)
Keywords: generative
Abstract: SE(3)-based generative models have shown great promise in protein geometry modeling and effective structure design. However, the field currently lacks a modularized benchmark to enable comprehensive investigation and fair comparison of different methods. In this paper, we propose Protein-SE(3), a new benchmark based on a unified training framework, which comprises protein scaffolding tasks, integrated generative models, high-level mathematical abstraction, and diverse evaluation metrics. Recent advanced generative models designed for protein scaffolding, from multiple perspectives like DDPM (Genie1 and Genie2), Score Matching (FrameDiff and RfDiffusion) and Flow Matching (FoldFlow and FrameFlow) are integrated into our framework. All integrated methods are fairly investigated with the same training dataset and evaluation metrics. Furthermore, we provide a high-level abstraction of the mathematical foundations behind the generative models, enabling fast prototyping of future algorithms without reliance on explicit protein structures. Accordingly, we release the first comprehensive benchmark built upon unified training framework for SE(3)-based protein structure design, which is publicly accessible at this https URL.
摘要：基于SE（3）的生成模型在蛋白质几何建模和有效的结构设计中表现出了巨大的希望。但是，该领域目前缺乏模块化的基准，无法对不同方法进行全面的研究和公平比较。在本文中，我们提出了基于统一培训框架的新基准蛋白质SE（3），其中包括蛋白质脚手架任务，综合生成模型，高级数学抽象和多样化的评估指标。从多个角度（例如DDPM（Genie1和Genie2）），得分匹配（Framediff和Rfdiffusion）和流匹配（Foldflow和Frame Flow）集成在我们的框架中。使用相同的培训数据集和评估指标对所有综合方法进行了公平研究。此外，我们提供了生成模型背后数学基础的高级抽象，从而实现了未来算法的快速原型制作，而无需依赖于显式蛋白质结构。因此，我们发布了基于SE（3）基于SE（3）的蛋白质结构设计的统一培训框架建立的第一个综合基准，该基于SE（3）的HTTPS URL可以公开访问。

Title: Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training

Authors: Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Yuhui Wu, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20291
Pdf URL: https://arxiv.org/pdf/2507.20291
Copy Paste: [[2507.20291]] Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training(https://arxiv.org/abs/2507.20291)
Keywords: super-resolution
Abstract: Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (eg., 8$\times$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8$\times$ downsampled VAE into a 4$\times$ one while adapting to the pre-trained UNet. Specifically, we first train a 4$\times$ decoder based on the output features of the original VAE encoder, then train a 4$\times$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the computational cost while capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models. The official code can be found at this https URL.
摘要：通过采用预训练的稳定扩散（SD）模型，已经实现了现实世界图像超分辨率（实际ISR）的令人印象深刻的结果。但是，由于SD模型中VAE的积极分辨率降低（例如，8 $ \ times $ downsmpling），这种方法的一个关键问题在于它们对图像精细结构（例如小字符和纹理）的不良重建。一种解决方案是利用较低采样速率的VAE进行扩散。但是，将其潜在特征改编成预先训练的UNET，同时减轻计算成本增加会带来新的挑战。为了解决这些问题，我们提出了转移VAE培训（TVT）策略，以将8 $ \ times $倒置的VAE转移到4 $ \ times $ $ $上，同时适应预先培训的UNET。具体来说，我们首先根据原始VAE编码器的输出功能训练4 $ \ times $解码器，然后训练4 $ \ times $ coder，同时将新训练的解码器固定为固定。这样的TVT策略将新的编码器派对对齐与原始的VAE潜在空间保持一致，同时增强了图像细节。此外，我们通过优化其网络体系结构，降低计算成本，同时捕获高分辨率的细尺度功能，从而引入紧凑型VAE和计算效率的UNET。实验结果表明，我们的TVT方法显着改善了精细结构保存，这通常受到其他基于SD的方法的损害，而所需的拖船比最先进的一步一步扩散模型所需的拖曳量更少。可以在此HTTPS URL上找到官方代码。

Title: Generative Pre-training for Subjective Tasks: A Diffusion Transformer-Based Framework for Facial Beauty Prediction

Authors: Djamel Eddine Boukhari, Ali chemsa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20363
Pdf URL: https://arxiv.org/pdf/2507.20363
Copy Paste: [[2507.20363]] Generative Pre-training for Subjective Tasks: A Diffusion Transformer-Based Framework for Facial Beauty Prediction(https://arxiv.org/abs/2507.20363)
Keywords: generative
Abstract: Facial Beauty Prediction (FBP) is a challenging computer vision task due to its subjective nature and the subtle, holistic features that influence human perception. Prevailing methods, often based on deep convolutional networks or standard Vision Transformers pre-trained on generic object classification (e.g., ImageNet), struggle to learn feature representations that are truly aligned with high-level aesthetic assessment. In this paper, we propose a novel two-stage framework that leverages the power of generative models to create a superior, domain-specific feature extractor. In the first stage, we pre-train a Diffusion Transformer on a large-scale, unlabeled facial dataset (FFHQ) through a self-supervised denoising task. This process forces the model to learn the fundamental data distribution of human faces, capturing nuanced details and structural priors essential for aesthetic evaluation. In the second stage, the pre-trained and frozen encoder of our Diffusion Transformer is used as a backbone feature extractor, with only a lightweight regression head being fine-tuned on the target FBP dataset (FBP5500). Our method, termed Diff-FBP, sets a new state-of-the-art on the FBP5500 benchmark, achieving a Pearson Correlation Coefficient (PCC) of 0.932, significantly outperforming prior art based on general-purpose pre-training. Extensive ablation studies validate that our generative pre-training strategy is the key contributor to this performance leap, creating feature representations that are more semantically potent for subjective visual tasks.
摘要：面部美容预测（FBP）是一项具有挑战性的计算机视觉任务，因为它的主观性质和影响人类感知的微妙，整体特征。普遍基于对通用对象分类预先训练的深度卷积网络或标准视觉变压器（例如，图像网）的盛行方法，难以学习与高级美学评估真正保持一致的特征表示形式。在本文中，我们提出了一个新颖的两阶段框架，该框架利用生成模型的力量创建了出色的，特定于域的特征提取器。在第一阶段，我们通过自我监督的denoisising任务预先训练了扩散变压器（FFHQ）上的扩散变压器。该过程迫使模型学习人脸的基本数据分布，从而捕获细微的细节和审美评估必不可少的结构先验。在第二阶段，我们的扩散变压器的预训练和冷冻编码器被用作骨干特征提取器，仅在目标FBP数据集（FBP5500）上进行了轻巧的回归头。我们的方法称为DIFF-FBP，在FBP5500基准测试上设置了一种新的最先进的方法，其实现了Pearson相关系数（PCC）为0.932，基于通用预先培训的通用性，大大优于先前的ART。广泛的消融研究验证了我们的生成预训练策略是这一绩效飞跃的关键因素，创造了对主观视觉任务更有效的功能表示。

Title: MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation

Authors: Shuolin Xu, Bingyuan Wang, Zeyu Cai, Fangteng Fu, Yue Ma, Tongyi Lee, Hongchuan Yu, Zeyu Wang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2507.20368
Pdf URL: https://arxiv.org/pdf/2507.20368
Copy Paste: [[2507.20368]] MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation(https://arxiv.org/abs/2507.20368)
Keywords: generation
Abstract: Generating high-quality cartoon animations multimodal control is challenging due to the complexity of non-human characters, stylistically diverse motions and fine-grained emotions. There is a huge domain gap between real-world videos and cartoon animation, as cartoon animation is usually abstract and has exaggerated motion. Meanwhile, public multimodal cartoon data are extremely scarce due to the difficulty of large-scale automatic annotation processes compared with real-life scenarios. To bridge this gap, We propose the MagicAnime dataset, a large-scale, hierarchically annotated, and multimodal dataset designed to support multiple video generation tasks, along with the benchmarks it includes. Containing 400k video clips for image-to-video generation, 50k pairs of video clips and keypoints for whole-body annotation, 12k pairs of video clips for video-to-video face animation, and 2.9k pairs of video and audio clips for audio-driven face animation. Meanwhile, we also build a set of multi-modal cartoon animation benchmarks, called MagicAnime-Bench, to support the comparisons of different methods in the tasks above. Comprehensive experiments on four tasks, including video-driven face animation, audio-driven face animation, image-to-video animation, and pose-driven character animation, validate its effectiveness in supporting high-fidelity, fine-grained, and controllable generation.
摘要：由于非人类角色，风格多样的动作和细粒度的情绪的复杂性，生成高质量的卡通动画多模式控制是具有挑战性的。现实世界的视频和卡通动画之间存在巨大的领域差距，因为卡通动画通常是抽象的，并且夸张了动作。同时，由于与现实生活中的大规模自动注释过程的困难，公共多模式卡通数据非常稀缺。为了弥合这一差距，我们提出了Magicanime数据集，这是一个大规模的，分层的注释和多模式数据集，旨在支持多个视频生成任务，以及它包含的基准。包含用于图像到视频生成的400K视频片段，50k的视频剪辑和全身注释的关键点，12K视频片段，用于视频到视频范围的视频范围，以及2.9k的视频和音频剪辑，用于音频驱动的面部动画。同时，我们还构建了一组称为MagicAnime Bench的多模式卡通动画基准，以支持上面任务中不同方法的比较。对四个任务进行的全面实验，包括视频驱动的面部动画，音频驱动的面部动画，图像到视频动画和姿势驱动的角色动画，验证其在支持高保真性，细粒度和可控生成方面的有效性。

Title: WBHT: A Generative Attention Architecture for Detecting Black Hole Anomalies in Backbone Networks

Authors: Kiymet Kaya, Elif Ak, Sule Gunduz Oguducu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20373
Pdf URL: https://arxiv.org/pdf/2507.20373
Copy Paste: [[2507.20373]] WBHT: A Generative Attention Architecture for Detecting Black Hole Anomalies in Backbone Networks(https://arxiv.org/abs/2507.20373)
Keywords: generative
Abstract: We propose the Wasserstein Black Hole Transformer (WBHT) framework for detecting black hole (BH) anomalies in communication networks. These anomalies cause packet loss without failure notifications, disrupting connectivity and leading to financial losses. WBHT combines generative modeling, sequential learning, and attention mechanisms to improve BH anomaly detection. It integrates a Wasserstein generative adversarial network with attention mechanisms for stable training and accurate anomaly identification. The model uses long-short-term memory layers to capture long-term dependencies and convolutional layers for local temporal patterns. A latent space encoding mechanism helps distinguish abnormal network behavior. Tested on real-world network data, WBHT outperforms existing models, achieving significant improvements in F1 score (ranging from 1.65% to 58.76%). Its efficiency and ability to detect previously undetected anomalies make it a valuable tool for proactive network monitoring and security, especially in mission-critical networks.
摘要：我们提出了Wasserstein黑洞变压器（WBHT）框架，用于检测通信网络中的黑洞（BH）异常。这些异常会导致数据包丢失，而不会通知故障，破坏了连接性并导致财务损失。 WBHT结合了生成建模，顺序学习和注意机制，以改善BH异常检测。它将Wasserstein生成对抗网络与稳定训练和准确异常识别的注意机制相结合。该模型使用长期的内存层来捕获局部时间模式的长期依赖性和卷积层。潜在空间编码机制有助于区分异常的网络行为。在现实世界网络数据上测试，WBHT的表现优于现有模型，从而实现了F1分数的显着改善（从1.65％到58.76％）。它的效率和检测以前未发现异常的能力使其成为主动网络监控和安全性的宝贵工具，尤其是在关键任务网络中。

Title: VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving

Authors: Levente Tempfli, Esteban Rivera, Markus Lienkamp
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20397
Pdf URL: https://arxiv.org/pdf/2507.20397
Copy Paste: [[2507.20397]] VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving(https://arxiv.org/abs/2507.20397)
Keywords: generation
Abstract: Data collection for autonomous driving is rapidly accelerating, but manual annotation, especially for 3D labels, remains a major bottleneck due to its high cost and labor intensity. Autolabeling has emerged as a scalable alternative, allowing the generation of labels for point clouds with minimal human intervention. While LiDAR-based autolabeling methods leverage geometric information, they struggle with inherent limitations of lidar data, such as sparsity, occlusions, and incomplete object observations. Furthermore, these methods typically operate in a class-agnostic manner, offering limited semantic granularity. To address these challenges, we introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images. Our approach leverages vision-language models (VLMs) to enable open-vocabulary object labeling and to refine detection quality directly in the point cloud domain. VESPA supports the discovery of novel categories and produces high-quality 3D pseudolabels without requiring ground-truth annotations or HD maps. On Nuscenes dataset, VESPA achieves an AP of 52.95% for object discovery and up to 46.54% for multiclass object detection, demonstrating strong performance in scalable 3D scene understanding. Code will be available upon acceptance.
摘要：自主驾驶的数据收集正在迅速加速，但是由于其高成本和劳动力强度，手动注释，尤其是对于3D标签，仍然是主要的瓶颈。自动标签已成为可扩展的替代方案，从而使人体干预最少的点云产生了标签。尽管基于激光雷达的自动标签方法利用了几何信息，但它们在激光雷达数据的固有局限性（例如稀疏，闭塞和不完整的对象观察）方面固有局限性。此外，这些方法通常以类不足的方式运行，提供有限的语义粒度。为了应对这些挑战，我们介绍了Vespa，Vespa是一种多模式自动标记管道，将LiDar的几何精度与相机图像的语义丰富度融合在一起。我们的方法利用视觉模型（VLMS）来实现开放式唱机对象标记并直接在点云域中完善检测质量。 Vespa支持发现新型类别，并产生高质量的3D伪标记，而无需地面真相注释或高清图。在Nuscenes数据集上，Vespa可实现52.95％的AP，用于对象发现，最高为46.54％，用于多类对象检测，在可扩展的3D场景理解中证明了强大的性能。代码将在接受后提供。

Title: BioNeuralNet: A Graph Neural Network based Multi-Omics Network Data Analysis Tool

Authors: Vicente Ramos (1), Sundous Hussein (1), Mohamed Abdel-Hafiz (1), Arunangshu Sarkar (2), Weixuan Liu (2), Katerina J. Kechris (2), Russell P. Bowler (3), Leslie Lange (4), Farnoush Banaei-Kashani (1) ((1) Department of Computer Science and Engineering, University of Colorado Denver, Denver, USA, (2) Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, USA, (3) Genomic Medicine Institute, Cleveland Clinic, Cleveland, USA, (4) Division of Biomedical Informatics and Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, USA)
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2507.20440
Pdf URL: https://arxiv.org/pdf/2507.20440
Copy Paste: [[2507.20440]] BioNeuralNet: A Graph Neural Network based Multi-Omics Network Data Analysis Tool(https://arxiv.org/abs/2507.20440)
Keywords: generation
Abstract: Multi-omics data offer unprecedented insights into complex biological systems, yet their high dimensionality, sparsity, and intricate interactions pose significant analytical challenges. Network-based approaches have advanced multi-omics research by effectively capturing biologically relevant relationships among molecular entities. While these methods are powerful for representing molecular interactions, there remains a need for tools specifically designed to effectively utilize these network representations across diverse downstream analyses. To fulfill this need, we introduce BioNeuralNet, a flexible and modular Python framework tailored for end-to-end network-based multi-omics data analysis. BioNeuralNet leverages Graph Neural Networks (GNNs) to learn biologically meaningful low-dimensional representations from multi-omics networks, converting these complex molecular networks into versatile embeddings. BioNeuralNet supports all major stages of multi-omics network analysis, including several network construction techniques, generation of low-dimensional representations, and a broad range of downstream analytical tasks. Its extensive utilities, including diverse GNN architectures, and compatibility with established Python packages (e.g., scikit-learn, PyTorch, NetworkX), enhance usability and facilitate quick adoption. BioNeuralNet is an open-source, user-friendly, and extensively documented framework designed to support flexible and reproducible multi-omics network analysis in precision medicine.
摘要：多媒体数据为复杂的生物系统提供了前所未有的见解，但是它们的高维度，稀疏性和复杂的相互作用构成了重大的分析挑战。基于网络的方法通过有效地捕获分子实体之间的生物学相关关系，从而获得了高级多摩斯研究。尽管这些方法对于表示分子相互作用具有强大的功能，但仍需要专门设计的工具，可以在不同的下游分析中有效利用这些网络表示。为了满足这一需求，我们推出了BioneAralnet，这是一个灵活而模块化的Python框架，该框架量身定制，用于基于端到端网络的多媒体数据分析。 Bionearalnet利用图形神经网络（GNN）从多摩斯网络学习生物学上有意义的低维表示，将这些复杂的分子网络转换为多功能嵌入。 Bionealurnet支持多摩斯网络分析的所有主要阶段，包括几种网络构建技术，低维表示的产生以及广泛的下游分析任务。它的广泛公用事业，包括不同的GNN架构，以及与已建立的Python软件包（例如Scikit-Learn，Pytorch，NetworkX）的兼容性，增强了可用性并促进快速采用。 Bionealurnet是一种开源，用户友好且经过广泛记录的框架，旨在支持精密医学中的灵活和可重现的多摩斯网络分析。

Title: Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis

Authors: Zhuokun Chen, Jugang Fan, Zhuowei Yu, Bohan Zhuang, Mingkui Tan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.20454
Pdf URL: https://arxiv.org/pdf/2507.20454
Copy Paste: [[2507.20454]] Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis(https://arxiv.org/abs/2507.20454)
Keywords: generation
Abstract: Visual autoregressive modeling, based on the next-scale prediction paradigm, exhibits notable advantages in image quality and model scalability over traditional autoregressive and diffusion models. It generates images by progressively refining resolution across multiple stages. However, the computational overhead in high-resolution stages remains a critical challenge due to the substantial number of tokens involved. In this paper, we introduce SparseVAR, a plug-and-play acceleration framework for next-scale prediction that dynamically excludes low-frequency tokens during inference without requiring additional training. Our approach is motivated by the observation that tokens in low-frequency regions have a negligible impact on image quality in high-resolution stages and exhibit strong similarity with neighboring tokens. Additionally, we observe that different blocks in the next-scale prediction model focus on distinct regions, with some concentrating on high-frequency areas. SparseVAR leverages these insights by employing lightweight MSE-based metrics to identify low-frequency tokens while preserving the fidelity of excluded regions through a small set of uniformly sampled anchor tokens. By significantly reducing the computational cost while maintaining high image generation quality, SparseVAR achieves notable acceleration in both HART and Infinity. Specifically, SparseVAR achieves up to a 2 times speedup with minimal quality degradation in Infinity-2B.
摘要：基于次级预测范式的视觉自回归建模在图像质量和模型可扩展性上比传统自动回归和扩散模型具有明显的优势。它通过逐渐精炼分辨率跨多个阶段来生成图像。但是，由于涉及大量令牌，高分辨率阶段的计算开销仍然是一个关键的挑战。在本文中，我们介绍了SparseVar，这是一个插件的加速框架，用于临时预测，该预测在推理过程中动态排除了低频令牌，而无需进行其他培训。我们的方法是通过观察到的，即低频区域中的令牌对高分辨率阶段的图像质量产生可忽略的影响，并且与邻近的令牌表现出很强的相似性。此外，我们观察到，隔壁预测模型中的不同块集中在不同的区域上，其中一些集中在高频区域上。 SparseVar通过使用轻质MSE的指标来识别低频代币，同时通过一小部分均匀采样的锚定令牌来保留排除区域的忠诚度，从而利用了这些见解。通过显着降低计算成本的同时保持较高的图像产生质量，SparseVar在Hart和Infinity中都达到了显着的加速。具体而言，SparseVar在Infinity-2B中的质量降解最小的降解最小的速度达到了2倍。

Title: GaRe: Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections

Authors: Haiyang Bai, Jiaqi Zhu, Songru Jiang, Wei Huang, Tao Lu, Yuanqi Li, Jie Guo, Runze Fu, Yanwen Guo, Lijun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20512
Pdf URL: https://arxiv.org/pdf/2507.20512
Copy Paste: [[2507.20512]] GaRe: Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections(https://arxiv.org/abs/2507.20512)
Keywords: generation
Abstract: We propose a 3D Gaussian splatting-based framework for outdoor relighting that leverages intrinsic image decomposition to precisely integrate sunlight, sky radiance, and indirect lighting from unconstrained photo collections. Unlike prior methods that compress the per-image global illumination into a single latent vector, our approach enables simultaneously diverse shading manipulation and the generation of dynamic shadow effects. This is achieved through three key innovations: (1) a residual-based sun visibility extraction method to accurately separate direct sunlight effects, (2) a region-based supervision framework with a structural consistency loss for physically interpretable and coherent illumination decomposition, and (3) a ray-tracing-based technique for realistic shadow simulation. Extensive experiments demonstrate that our framework synthesizes novel views with competitive fidelity against state-of-the-art relighting solutions and produces more natural and multifaceted illumination and shadow effects.
摘要：我们提出了一个基于3D高斯脱落的框架，用于户外重新确认，该框架利用了固有的图像分解，以精确地整合了来自不受限制的照片集的阳光，天空辐射和间接照明。与以前的方法将每位图像全局照明压缩为单个潜在向量不同，我们的方法可以同时进行多样化的阴影操纵和动态阴影效应的产生。这是通过三个关键创新来实现的：（1）一种基于残留的太阳可见性提取方法，可准确分开阳光直射效应，（2）基于区域的监督框架，具有结构性一致性损失，可用于物理上可解释的和相干的照明分解，以及（3）基于射线的基于射线跟踪的技术，用于现实化的阴影模拟。广泛的实验表明，我们的框架将具有竞争力的忠诚度与最新的重新确定解决方案合成，并产生更自然和多方面的照明和阴影效果。

Title: Kernel Learning for Sample Constrained Black-Box Optimization

Authors: Rajalaxmi Rajagopalan, Yu-Lin Wei, Romit Roy Choudhury
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.20533
Pdf URL: https://arxiv.org/pdf/2507.20533
Copy Paste: [[2507.20533]] Kernel Learning for Sample Constrained Black-Box Optimization(https://arxiv.org/abs/2507.20533)
Keywords: generative
Abstract: Black box optimization (BBO) focuses on optimizing unknown functions in high-dimensional spaces. In many applications, sampling the unknown function is expensive, imposing a tight sample budget. Ongoing work is making progress on reducing the sample budget by learning the shape/structure of the function, known as kernel learning. We propose a new method to learn the kernel of a Gaussian Process. Our idea is to create a continuous kernel space in the latent space of a variational autoencoder, and run an auxiliary optimization to identify the best kernel. Results show that the proposed method, Kernel Optimized Blackbox Optimization (KOBO), outperforms state of the art by estimating the optimal at considerably lower sample budgets. Results hold not only across synthetic benchmark functions but also in real applications. We show that a hearing aid may be personalized with fewer audio queries to the user, or a generative model could converge to desirable images from limited user ratings.
摘要：黑匣子优化（BBO）着重于优化高维空间中未知功能。在许多应用中，对未知功能进行采样昂贵，并施加了严格的样本预算。正在进行的工作正在通过学习功能的形状/结构（称为内核学习）来降低样本预算。我们提出了一种学习高斯流程内核的新方法。我们的想法是在各种自动编码器的潜在空间中创建一个连续的内核空间，并运行辅助优化以识别最佳内核。结果表明，所提出的方法，内核优化的黑框优化（KOBO），通过估计最佳样本预算的最佳状态优于最佳状态。结果不仅在综合基准函数中，而且在实际应用中都保持。我们表明，助听器可以通过对用户的音频查询更少的个性化，或者生成模型可以收敛到有限用户评分的理想图像。

Title: T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation

Authors: Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.20536
Pdf URL: https://arxiv.org/pdf/2507.20536
Copy Paste: [[2507.20536]] T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation(https://arxiv.org/abs/2507.20536)
Keywords: generation, generative
Abstract: Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: this https URL.
摘要：文本对图像（T2I）生成模型已彻底改变了内容的创建，但仍需提示措辞，通常要求用户在没有明确反馈的情况下多次重复提示。尽管诸如自动及时工程，受控文本嵌入，降解和多转弯产生等技术减轻了这些问题，但它们提供了有限的可控性，或者通常需要进行其他培训，从而限制了概括能力。因此，我们介绍了T2i-CopiLot，这是一种无训练的多代理系统，利用（多模式）大语言模型之间的协作来自动化及时措辞，模型选择和迭代性改进。与直接发电相比，这种方法大大简化了迅速的工程，同时提高了发电质量和文本图像的一致性。具体而言，T2i-Copilot由三个代理组成：（1）输入解释器，该解释器解析输入提示，解决歧义并生成标准化的报告；（2）生成引擎，从不同类型的T2I模型中选择适当的模型，并组织视觉和文本提示以启动生成；（3）评估审美质量和文本图像对齐方式的质量评估者，为潜在再生提供了分数和反馈。 T2i-copilot可以完全自主运行，同时还可以支持人类的干预以进行细粒度控制。在Genai Bench上，使用开源生成模型，T2i-Copilot获得的VQA得分可与商业模型RemaftV3和Imagen 3相当，仅以其成本的16.59％超过Flux1.1-Pro，其成本的16.17％，并且均超过了其成本的16.59％，并且均超过了Forms.1-DEV.1-DEV和SD 3.5大于9.11％的11％和6.11％和6.36％和6.36％和6.36％。代码将在以下位置发布：此HTTPS URL。

Title: Annotation-Free Human Sketch Quality Assessment

Authors: Lan Yang, Kaiyue Pang, Honggang Zhang, Yi-Zhe Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20548
Pdf URL: https://arxiv.org/pdf/2507.20548
Copy Paste: [[2507.20548]] Annotation-Free Human Sketch Quality Assessment(https://arxiv.org/abs/2507.20548)
Keywords: quality assessment
Abstract: As lovely as bunnies are, your sketched version would probably not do them justice (Fig.~\ref{fig:intro}). This paper recognises this very problem and studies sketch quality assessment for the first time -- letting you find these badly drawn ones. Our key discovery lies in exploiting the magnitude ($L_2$ norm) of a sketch feature as a quantitative quality metric. We propose Geometry-Aware Classification Layer (GACL), a generic method that makes feature-magnitude-as-quality-metric possible and importantly does it without the need for specific quality annotations from humans. GACL sees feature magnitude and recognisability learning as a dual task, which can be simultaneously optimised under a neat cross-entropy classification loss with theoretic guarantee. This gives GACL a nice geometric interpretation (the better the quality, the easier the recognition), and makes it agnostic to both network architecture changes and the underlying sketch representation. Through a large scale human study of 160,000 \doublecheck{trials}, we confirm the agreement between our GACL-induced metric and human quality perception. We further demonstrate how such a quality assessment capability can for the first time enable three practical sketch applications. Interestingly, we show GACL not only works on abstract visual representations such as sketch but also extends well to natural images on the problem of image quality assessment (IQA). Last but not least, we spell out the general properties of GACL as general-purpose data re-weighting strategy and demonstrate its applications in vertical problems such as noisy label cleansing. Code will be made publicly available at this http URL.
摘要：像兔子一样可爱，您的草绘版本可能不会使它们正义（图〜\ ref {图：into}）。本文认识到这个问题，并研究了第一次绘制质量评估 - 让您发现这些糟糕的绘制质量评估。我们的关键发现在于利用草图功能的大小（$ l_2 $ norm）作为定量质量指标。我们提出了一种几何学分类层（GACL），这是一种通用方法，可以使特征 - 质量 - 质量 - 质量计算，并且重要的是无需从人类那里提供特定的质量注释。 GACL认为特征和识别性学习是一项双重任务，可以在整齐的跨透明分类损失及其理论保证的情况下同时优化。这使GACL具有一个不错的几何解释（质量越好，识别越容易），并且使网络体系结构变化和基础草图表示形式变得不可知。通过对160,000 \ DoubleCheck {试验}的大规模人类研究，我们确认了我们GACL诱导的度量和人类质量感知之间的一致性。我们进一步证明了这种质量评估能力如何首次启用三个实用的草图应用程序。有趣的是，我们展示了GACL不仅在抽象的视觉表示（例如草图）上作品，而且还可以很好地扩展到图像质量评估问题（IQA）的自然图像。最后但并非最不重要的一点是，我们将GACL作为通用数据重新加权策略的一般属性阐明，并证明了其在垂直问题（例如嘈杂标签清洁）中的应用。代码将在此HTTP URL上公开提供。

Title: Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy

Authors: Yaxin Xiao, Qingqing Ye, Li Hu, Huadi Zheng, Haibo Hu, Zi Liang, Haoyang Li, Yijie Jiao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.20573
Pdf URL: https://arxiv.org/pdf/2507.20573
Copy Paste: [[2507.20573]] Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy(https://arxiv.org/abs/2507.20573)
Keywords: generation
Abstract: Machine unlearning enables the removal of specific data from ML models to uphold the right to be forgotten. While approximate unlearning algorithms offer efficient alternatives to full retraining, this work reveals that they fail to adequately protect the privacy of unlearned data. In particular, these algorithms introduce implicit residuals which facilitate privacy attacks targeting at unlearned data. We observe that these residuals persist regardless of model architectures, parameters, and unlearning algorithms, exposing a new attack surface beyond conventional output-based leakage. Based on this insight, we propose the Reminiscence Attack (ReA), which amplifies the correlation between residuals and membership privacy through targeted fine-tuning processes. ReA achieves up to 1.90x and 1.12x higher accuracy than prior attacks when inferring class-wise and sample-wise membership, respectively. To mitigate such residual-induced privacy risk, we develop a dual-phase approximate unlearning framework that first eliminates deep-layer unlearned data traces and then enforces convergence stability to prevent models from "pseudo-convergence", where their outputs are similar to retrained models but still preserve unlearned residuals. Our framework works for both classification and generation tasks. Experimental evaluations confirm that our approach maintains high unlearning efficacy, while reducing the adaptive privacy attack accuracy to nearly random guess, at the computational cost of 2-12% of full retraining from scratch.
摘要：Machine Unerning可以从ML模型中删除特定数据，以维护被遗忘的权利。虽然近似学习的算法为全面再培训提供了有效的替代方案，但这项工作表明它们无法充分保护未经学习数据的隐私。特别是，这些算法引入了隐性残差，这些残差有助于针对未学习数据的隐私攻击。我们观察到，这些残差持续存在，无论模型架构，参数和未学习算法如何，都可以在基于常规的输出泄漏之外暴露出新的攻击表面。基于这种见解，我们提出了回忆攻击（REA），该攻击通过有针对性的微调过程扩大了残差与会员隐私之间的相关性。在推断班级和样本成员时，REA的精度高达1.90倍和1.12倍。为了减轻此类残留引起的隐私风险，我们开发了双相近似学习的未学习框架，该框架首先消除了深层的未学习数据迹线，然后实施收敛稳定性，以防止模型“伪连接”，在这些模型中，它们的输出与经过重新经过的模型相似，但仍然保留了剩下的持续性。我们的框架适用于分类和生成任务。实验评估证实，我们的方法保持了高学历的效力，同时以2-12％的全面retrating从头开始，将适应性隐私攻击精度降低至几乎随机猜测。

Title: AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations

Authors: Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muhammad Haris Khan, Usman Tariq, Tom Gedeon, Abhinav Dhall
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20579
Pdf URL: https://arxiv.org/pdf/2507.20579
Copy Paste: [[2507.20579]] AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations(https://arxiv.org/abs/2507.20579)
Keywords: generation
Abstract: The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified manipulation strategy and audio-visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++ using state-of-the-art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research-only license at this https URL.
摘要：文本到语音和面部调查模型的快速涌现使视频制作变得更加容易且高度逼真。为了遇到这个问题，我们需要具有丰富的生成方法和扰动策略的数据集，这些数据集通常对于在线视频而言是常见的。为此，我们提出了AV-Deepfake1m ++，这是AV-Deepfake1M的扩展，具有200万个具有多元化的操纵策略和视听扰动的视频片段。本文包括使用最先进的方法对数据生成策略的描述以及AV-Deepfake1m ++的基准测试。我们认为，该数据集将在促进DeepFake领域的研究中发挥关键作用。基于此数据集，我们主持了2025 1m深 - 深度蛋糕检测挑战。挑战细节，数据集和评估脚本可在此HTTPS URL的仅限研究许可下在线获得。

Title: Harnessing Diffusion-Yielded Score Priors for Image Restoration

Authors: Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S. Ren, Jinjin Gu, Chao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20590
Pdf URL: https://arxiv.org/pdf/2507.20590
Copy Paste: [[2507.20590]] Harnessing Diffusion-Yielded Score Priors for Image Restoration(https://arxiv.org/abs/2507.20590)
Keywords: restoration
Abstract: Deep image restoration models aim to learn a mapping from degraded image space to natural image space. However, they face several critical challenges: removing degradation, generating realistic details, and ensuring pixel-level consistency. Over time, three major classes of methods have emerged, including MSE-based, GAN-based, and diffusion-based methods. However, they fail to achieve a good balance between restoration quality, fidelity, and speed. We propose a novel method, HYPIR, to address these challenges. Our solution pipeline is straightforward: it involves initializing the image restoration model with a pre-trained diffusion model and then fine-tuning it with adversarial training. This approach does not rely on diffusion loss, iterative sampling, or additional adapters. We theoretically demonstrate that initializing adversarial training from a pre-trained diffusion model positions the initial restoration model very close to the natural image distribution. Consequently, this initialization improves numerical stability, avoids mode collapse, and substantially accelerates the convergence of adversarial training. Moreover, HYPIR inherits the capabilities of diffusion models with rich user control, enabling text-guided restoration and adjustable texture richness. Requiring only a single forward pass, it achieves faster convergence and inference speed than diffusion-based methods. Extensive experiments show that HYPIR outperforms previous state-of-the-art methods, achieving efficient and high-quality image restoration.
摘要：深度图像恢复模型旨在学习从降级图像空间到自然图像空间的映射。但是，他们面临着一些关键的挑战：消除退化，产生逼真的细节以及确保像素级的一致性。随着时间的流逝，已经出现了三个主要的方法，包括基于MSE，基于GAN和基于扩散的方法。但是，他们无法在恢复质量，保真度和速度之间取得良好的平衡。我们提出了一种新颖的方法Hypir，以应对这些挑战。我们的解决方案管道很简单：它涉及使用预训练的扩散模型初始化图像恢复模型，然后通过对抗训练对其进行微调。这种方法不依赖于扩散损失，迭代采样或其他适配器。从理论上讲，我们证明了从预训练的扩散模型中初始化对抗训练的定位最初接近自然图像分布的初始恢复模型。因此，这种初始化可提高数值稳定性，避免模式崩溃，并显着加速对抗训练的收敛性。此外，Hypir通过丰富的用户控制继承了扩散模型的功能，从而实现了文本引导的修复和可调节的纹理丰富度。与基于扩散的方法相比，它仅需要单个正向通过，它可以实现更快的收敛速度和推理速度。广泛的实验表明，Hyperir的表现优于先前的最先进方法，从而实现了有效且高质量的图像恢复。

Title: PhaseNAS: Language-Model Driven Architecture Search with Dynamic Phase Adaptation

Authors: Fei Kong, Xiaohan Shan, Yanwei Hu, Jianmin Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.20592
Pdf URL: https://arxiv.org/pdf/2507.20592
Copy Paste: [[2507.20592]] PhaseNAS: Language-Model Driven Architecture Search with Dynamic Phase Adaptation(https://arxiv.org/abs/2507.20592)
Keywords: generation
Abstract: Neural Architecture Search (NAS) is challenged by the trade-off between search space exploration and efficiency, especially for complex tasks. While recent LLM-based NAS methods have shown promise, they often suffer from static search strategies and ambiguous architecture representations. We propose PhaseNAS, an LLM-based NAS framework with dynamic phase transitions guided by real-time score thresholds and a structured architecture template language for consistent code generation. On the NAS-Bench-Macro benchmark, PhaseNAS consistently discovers architectures with higher accuracy and better rank. For image classification (CIFAR-10/100), PhaseNAS reduces search time by up to 86% while maintaining or improving accuracy. In object detection, it automatically produces YOLOv8 variants with higher mAP and lower resource cost. These results demonstrate that PhaseNAS enables efficient, adaptive, and generalizable NAS across diverse vision tasks.
摘要：神经体系结构搜索（NAS）受到搜索空间探索与效率之间的权衡挑战，尤其是对于复杂的任务。尽管最近基于LLM的NAS方法表现出了希望，但它们通常会遭受静态搜索策略和模棱两可的体系结构表示。我们提出了Phasenas，这是一种基于LLM的NAS框架，其动态相变为实时分数阈值和一个结构化体系结构模板语言，用于一致的代码生成。在NAS基础麦克罗基准测试中，Phasenas始终以更高的准确性和更好的等级发现体系结构。对于图像分类（CIFAR-10/100），Phasenas在维持或提高准确性的同时，将搜索时间减少多达86％。在对象检测中，它会自动生产具有较高地图和较低资源成本的Yolov8变体。这些结果表明，Phasenas可以在各种视觉任务中实现高效，适应性和可推广的NA。

Title: Deep Generative Models of Evolution: SNP-level Population Adaptation by Genomic Linkage Incorporation

Authors: Julia Siekiera, Christian Schlötterer, Stefan Kramer
Subjects: cs.LG, q-bio.PE
Abstract URL: https://arxiv.org/abs/2507.20644
Pdf URL: https://arxiv.org/pdf/2507.20644
Copy Paste: [[2507.20644]] Deep Generative Models of Evolution: SNP-level Population Adaptation by Genomic Linkage Incorporation(https://arxiv.org/abs/2507.20644)
Keywords: generative
Abstract: The investigation of allele frequency trajectories in populations evolving under controlled environmental pressures has become a popular approach to study evolutionary processes on the molecular level. Statistical models based on well-defined evolutionary concepts can be used to validate different hypotheses about empirical observations. Despite their popularity, classic statistical models like the Wright-Fisher model suffer from simplified assumptions such as the independence of selected loci along a chromosome and uncertainty about the parameters. Deep generative neural networks offer a powerful alternative known for the integration of multivariate dependencies and noise reduction. Due to their high data demands and challenging interpretability they have, so far, not been widely considered in the area of population genomics. To address the challenges in the area of Evolve and Resequencing experiments (E&R) based on pooled sequencing (Pool-Seq) data, we introduce a deep generative neural network that aims to model a concept of evolution based on empirical observations over time. The proposed model estimates the distribution of allele frequency trajectories by embedding the observations from single nucleotide polymorphisms (SNPs) with information from neighboring loci. Evaluation on simulated E&R experiments demonstrates the model's ability to capture the distribution of allele frequency trajectories and illustrates the representational power of deep generative models on the example of linkage disequilibrium (LD) estimation. Inspecting the internally learned representations enables estimating pairwise LD, which is typically inaccessible in Pool-Seq data. Our model provides competitive LD estimation in Pool-Seq data high degree of LD when compared to existing methods.
摘要：在受控环境压力下进化的人群中等位基因频率轨迹的研究已成为研究分子水平上进化过程的流行方法。基于定义明确的进化概念的统计模型可用于验证有关经验观察的不同假设。尽管它们很受欢迎，但像Wright-Fisher模型这样的经典统计模型遭受了简化的假设，例如沿染色体的选定基因座独立性和有关参数的不确定性。深层生成的神经网络提供了一种强大的替代方案，以整合多元依赖性和降噪。由于它们的高数据要求和挑战性的可解释性，到目前为止，在人群基因组学领域尚未得到广泛考虑。为了解决基于汇总测序（Pool-Seq）数据的进化和重新方程实验（E＆R）领域的挑战，我们引入了一个深层生成的神经网络，旨在模拟基于经验观察的进化概念。提出的模型通过嵌入来自相邻基因座的信息来嵌入单核苷酸多态性（SNP）的观测来估算等位基因频率轨迹的分布。对模拟E＆R实验的评估表明了该模型捕获等位基因频率轨迹分布的能力，并在链接不平衡（LD）估计的示例上说明了深生成模型的代表力。检查内部学会的表示可以估算成对LD，在池序列数据中通常无法访问。与现有方法相比，我们的模型在池序列数据高度LD中提供了竞争性LD估计。

Title: Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Authors: Yang Chen, Yufan Shen, Wenxuan Huang, Shen Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Botian Shi, Yu Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20766
Pdf URL: https://arxiv.org/pdf/2507.20766
Copy Paste: [[2507.20766]] Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback(https://arxiv.org/abs/2507.20766)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have exhibited impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework termed ``Reasoning-Rendering-Visual-Feedback'' (RRVF), which enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the ``Asymmetry of Verification'' principle to train MLLMs, i.e., verifying the rendered output against a source image is easier than generating it. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL) training, reducing the reliance on the image-text supervision. Guided by the above principle, RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform self-correction through multi-turn interactions and tool invocation, while this pipeline can be optimized by the GRPO algorithm in an end-to-end manner. Extensive experiments on image-to-code generation for data charts and web interfaces show that RRVF substantially outperforms existing open-source MLLMs and surpasses supervised fine-tuning baselines. Our findings demonstrate that systems driven by purely visual feedback present a viable path toward more robust and generalizable reasoning models without requiring explicit supervision. Code will be available at this https URL.
摘要：多模式的大语言模型（MLLM）在各种视觉任务中表现出令人印象深刻的表现。随后对增强其视觉推理能力的调查大大扩大了其性能信封。但是，在MLLM朝着深视觉推理方面发展的关键瓶颈是他们非常依赖精选的图像文本监督。为了解决这个问题，我们介绍了一个新颖的框架，该框架称为````推理）式 - 视觉反馈''（RRVF），该框架使MLLM仅从原始图像中学习复杂的视觉推理。该框架建立在``验证的不对称''原理上训练mllms的原则，即，针对源图像验证渲染的输出比生成它更容易。我们证明，这种相对轻松通过增强学习（RL）培训提供了理想的奖励信号，从而减少了对图像文本监督的依赖。在上述原则的指导下，RRVF实现了包含推理，渲染和视觉反馈组件的闭环迭代过程，从而使模型能够通过多到最终的grpo算法来优化该管道的多转交互和工具调用。关于数据图表和Web接口的图像对代码生成的广泛实验表明，RRVF显然优于现有的开源MLLM，并超过监督的微调基线。我们的发现表明，由纯粹的视觉反馈驱动的系统为更健壮和可推广的推理模型提供了可行的途径，而无需明确的监督。代码将在此HTTPS URL上可用。

Title: FantasyID: A dataset for detecting digital manipulations of ID-documents

Authors: Pavel Korshunov, Amir Mohammadi, Vidit Vidit, Christophe Ecabert, Sébastien Marcel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20808
Pdf URL: https://arxiv.org/pdf/2507.20808
Copy Paste: [[2507.20808]] FantasyID: A dataset for detecting digital manipulations of ID-documents(https://arxiv.org/abs/2507.20808)
Keywords: generation, generative
Abstract: Advancements in image generation led to the availability of easy-to-use tools for malicious actors to create forged images. These tools pose a serious threat to the widespread Know Your Customer (KYC) applications, requiring robust systems for detection of the forged Identity Documents (IDs). To facilitate the development of the detection algorithms, in this paper, we propose a novel publicly available (including commercial use) dataset, FantasyID, which mimics real-world IDs but without tampering with legal documents and, compared to previous public datasets, it does not contain generated faces or specimen watermarks. FantasyID contains ID cards with diverse design styles, languages, and faces of real people. To simulate a realistic KYC scenario, the cards from FantasyID were printed and captured with three different devices, constituting the bonafide class. We have emulated digital forgery/injection attacks that could be performed by a malicious actor to tamper the IDs using the existing generative tools. The current state-of-the-art forgery detection algorithms, such as TruFor, MMFusion, UniFD, and FatFormer, are challenged by FantasyID dataset. It especially evident, in the evaluation conditions close to practical, with the operational threshold set on validation set so that false positive rate is at 10%, leading to false negative rates close to 50% across the board on the test set. The evaluation experiments demonstrate that FantasyID dataset is complex enough to be used as an evaluation benchmark for detection algorithms.
摘要：图像生成的进步导致了易于使用的工具供恶意演员创建伪造的图像。这些工具对广泛了解您的客户（KYC）应用程序构成了严重威胁，需要可靠的系统来检测伪造的身份文档（IDS）。为了促进检测算法的开发，在本文中，我们提出了一种新颖的公开可用（包括商业用途）数据集，fantasyid，fantasyid模仿了现实世界ID，但没有篡改法律文档，并且与以前的公共数据集相比，它不包含生成的面孔或标本水印。 FantasyId包含具有多种设计风格，语言和真实人员面孔的身份证。为了模拟现实的KYC场景，FantasyId的卡片被打印并用三种不同的设备捕获，构成了真正的班级。我们模拟了恶意演员可以使用现有生成工具篡改ID的数字伪造/注射攻击。 FantasyID数据集挑战了当前的最新伪造算法，例如TRUFOR，MMFUSION，UNIFD和FATFORMER。在接近实际的评估条件下，尤其明显的是，在验证集上设置了操作阈值，因此假阳性率为10％，导致误率接近测试集的50％。评估实验表明，FantasyID数据集足够复杂，可以用作检测算法的评估基准。

Title: First Hallucination Tokens Are Different from Conditional Ones

Authors: Jakob Snel, Seong Joon Oh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20836
Pdf URL: https://arxiv.org/pdf/2507.20836
Copy Paste: [[2507.20836]] First Hallucination Tokens Are Different from Conditional Ones(https://arxiv.org/abs/2507.20836)
Keywords: generation
Abstract: Hallucination, the generation of untruthful content, is one of the major concerns regarding foundational models. Detecting hallucinations at the token level is vital for real-time filtering and targeted correction, yet the variation of hallucination signals within token sequences is not fully understood. Leveraging the RAGTruth corpus with token-level annotations and reproduced logits, we analyse how these signals depend on a token's position within hallucinated spans, contributing to an improved understanding of token-level hallucination. Our results show that the first hallucinated token carries a stronger signal and is more detectable than conditional tokens. We release our analysis framework, along with code for logit reproduction and metric computation at this https URL.
摘要：幻觉是不正确内容的产生，是基础模型的主要关注点之一。在令牌水平上检测幻觉对于实时过滤和靶向校正至关重要，但是对于令牌序列中幻觉信号的变化尚不完全了解。利用令牌级注释和再现ligits来利用Ragtruth语料库，我们分析了这些信号如何依赖于令牌在幻觉跨度中的位置，从而有助于对令牌级别的幻觉有了改进的理解。我们的结果表明，第一个幻觉的令牌带有更强的信号，并且比条件令牌更可检测到。我们在此HTTPS URL上发布了分析框架，以及用于logit复制和度量计算的代码。

Title: Towards Explainable Deep Clustering for Time Series Data

Authors: Udo Schlegel, Gabriel Marques Tavares, Thomas Seidl
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.20840
Pdf URL: https://arxiv.org/pdf/2507.20840
Copy Paste: [[2507.20840]] Towards Explainable Deep Clustering for Time Series Data(https://arxiv.org/abs/2507.20840)
Keywords: generation
Abstract: Deep clustering uncovers hidden patterns and groups in complex time series data, yet its opaque decision-making limits use in safety-critical settings. This survey offers a structured overview of explainable deep clustering for time series, collecting current methods and their real-world applications. We thoroughly discuss and compare peer-reviewed and preprint papers through application domains across healthcare, finance, IoT, and climate science. Our analysis reveals that most work relies on autoencoder and attention architectures, with limited support for streaming, irregularly sampled, or privacy-preserved series, and interpretability is still primarily treated as an add-on. To push the field forward, we outline six research opportunities: (1) combining complex networks with built-in interpretability; (2) setting up clear, faithfulness-focused evaluation metrics for unsupervised explanations; (3) building explainers that adapt to live data streams; (4) crafting explanations tailored to specific domains; (5) adding human-in-the-loop methods that refine clusters and explanations together; and (6) improving our understanding of how time series clustering models work internally. By making interpretability a primary design goal rather than an afterthought, we propose the groundwork for the next generation of trustworthy deep clustering time series analytics.
摘要：深度聚类在复杂的时间序列数据中发现了隐藏的模式和组，但其不透明的决策限制在安全至关重要的环境中使用。这项调查提供了针对时间序列的可解释的深层聚类的结构化概述，收集了当前方法及其现实世界的应用程序。我们通过医疗保健，金融，物联网和气候科学的应用领域进行彻底讨论并比较了经过同行评审和预印本论文。我们的分析表明，大多数工作都依赖于自动编码器和注意体系结构，对流媒体，不规则采样或保留隐私的系列的支持有限，并且仍然主要将其视为附加组件。为了推动领域的前进，我们概述了六个研究机会：（1）将复杂网络与内置的可解释性相结合；（2）为无监督的解释设置明确的，忠实的评估指标；（3）构建适合实时数据流的解释器；（4）针对特定领域量身定制的制作解释；（5）添加人类在循环方法中，将簇和解释完善；（6）提高我们对时间序列聚类模型在内部工作的理解。通过使可解释性成为主要的设计目标，而不是事后的想法，我们为下一代值得信赖的深度聚类时间序列分析提出了基础。

Title: Compositional Video Synthesis by Temporal Object-Centric Learning

Authors: Adil Kaan Akan, Yucel Yemez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20855
Pdf URL: https://arxiv.org/pdf/2507.20855
Copy Paste: [[2507.20855]] Compositional Video Synthesis by Temporal Object-Centric Learning(https://arxiv.org/abs/2507.20855)
Keywords: generation, generative
Abstract: We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations, extending our previous work, SlotAdapt, from images to video. While existing object-centric approaches either lack generative capabilities entirely or treat video sequences holistically, thus neglecting explicit object-level structure, our approach explicitly captures temporal dynamics by learning pose invariant object-centric slots and conditioning them on pretrained diffusion models. This design enables high-quality, pixel-level video synthesis with superior temporal coherence, and offers intuitive compositional editing capabilities such as object insertion, deletion, or replacement, maintaining consistent object identities across frames. Extensive experiments demonstrate that our method sets new benchmarks in video generation quality and temporal consistency, outperforming previous object-centric generative methods. Although our segmentation performance closely matches state-of-the-art methods, our approach uniquely integrates this capability with robust generative performance, significantly advancing interactive and controllable video generation and opening new possibilities for advanced content creation, semantic editing, and dynamic scene understanding.
摘要：我们提出了一个新颖的构图视频综合框架，该框架利用了时间一致的以对象为中心的表示，将我们以前的工作（Slotadapt）从图像扩展到视频。尽管现有以对象为中心的方法要么完全缺乏生成能力，要么整体上都缺乏视频序列，从而忽略了明确的对象级结构，但我们的方法通过学习姿势不变的对象以对象为中心的插槽来明确捕获时间动态，并将其调节于预处理的扩散模型。该设计使高质量的像素级视频合成具有出色的时间连贯性，并提供直观的组成编辑功能，例如对象插入，删除或替换，从而在整个帧中保持一致的对象身份。广泛的实验表明，我们的方法在视频生成质量和时间一致性方面设定了新的基准测试，表现优于以前的以对象为中心的生成方法。尽管我们的细分性能与最先进的方法匹配，但我们的方法将这种能力与强大的生成性能融为一体，可显着提高交互式和可控的视频生成，并为高级内容创建，语义编辑和动态场景的理解开辟了新的可能性。

Title: Exploring text-to-image generation for historical document image retrieval

Authors: Melissa Cote, Alexandra Branzan Albu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20934
Pdf URL: https://arxiv.org/pdf/2507.20934
Copy Paste: [[2507.20934]] Exploring text-to-image generation for historical document image retrieval(https://arxiv.org/abs/2507.20934)
Keywords: generation, generative
Abstract: Attribute-based document image retrieval (ABDIR) was recently proposed as an alternative to query-by-example (QBE) searches, the dominant document image retrieval (DIR) paradigm. One drawback of QBE searches is that they require sample query documents on hand that may not be available. ABDIR aims to offer users a flexible way to retrieve document images based on memorable visual features of document contents, describing document images with combinations of visual attributes determined via convolutional neural network (CNN)-based binary classifiers. We present an exploratory study of the use of generative AI to bridge the gap between QBE and ABDIR, focusing on historical documents as a use case for their diversity and uniqueness in visual features. We hypothesize that text-to-image (T2I) generation can be leveraged to create query document images using text prompts based on ABDIR-like attributes. We propose T2I-QBE, which uses this http URL as the T2I generator with prompts that include a rough description of the desired document type and a list of the desired ABDIR-style attributes. This creates query images that are then used within the traditional QBE paradigm, which compares CNN-extracted query features to those of the document images in the dataset to retrieve the most relevant documents. Experiments on the HisIR19 dataset of historical documents confirm our hypothesis and suggest that T2I-QBE is a viable option for historical document image retrieval. To the authors' knowledge, this is the first attempt at utilizing T2I generation for DIR.
摘要：最近提出了基于属性的文档图像检索（ABDIR）作为逐个示例搜索（QBE）搜索的替代方案，即主要的文档图像检索（DIR）范式。 QBE搜索的一个缺点是，它们需要可能无法提供的示例查询文档。 Abdir旨在为用户提供一种基于文档内容的令人难忘的视觉特征来检索文档图像的灵活方法，描述了通过卷积神经网络（CNN）基于基于的二进制分类器确定的视觉属性组合的文档图像。我们提出了一项探索性研究，以使用生成AI来弥合QBE和Abdir之间的差距，重点是历史文档，作为其在视觉特征中的多样性和独特性的用例。我们假设可以利用基于Abdir样属性的文本提示来利用文本对图像（T2I）生成来创建查询文档图像。我们建议使用此HTTP URL作为T2i Generator的T2-QBE，其中包括对所需文档类型的粗略描述以及所需的Abdir式属性的列表。这将创建查询图像，然后在传统的QBE范式中使用，该图像将CNN提取的查询功能与数据集中的文档图像的图像进行比较，以检索最相关的文档。 HISIR19历史文档数据集的实验证实了我们的假设，并建议T2-QBE是历史文档图像检索的可行选择。据作者所知，这是第一次利用T2i代的DIR尝试。

Title: Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation

Authors: Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Hazım Kemal Ekenel, Alexander Waibel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20953
Pdf URL: https://arxiv.org/pdf/2507.20953
Copy Paste: [[2507.20953]] Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation(https://arxiv.org/abs/2507.20953)
Keywords: generation
Abstract: Audio-Driven Talking Face Generation aims at generating realistic videos of talking faces, focusing on accurate audio-lip synchronization without deteriorating any identity-related visual details. Recent state-of-the-art methods are based on inpainting, meaning that the lower half of the input face is masked, and the model fills the masked region by generating lips aligned with the given audio. Hence, to preserve identity-related visual details from the lower half, these approaches additionally require an unmasked identity reference image randomly selected from the same video. However, this common masking strategy suffers from (1) information loss in the input faces, significantly affecting the networks' ability to preserve visual quality and identity details, (2) variation between identity reference and input image degrading reconstruction performance, and (3) the identity reference negatively impacting the model, causing unintended copying of elements unaligned with the audio. To address these issues, we propose a mask-free talking face generation approach while maintaining the 2D-based face editing task. Instead of masking the lower half, we transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner. Subsequently, we provide these edited but unmasked faces to a lip adaptation model alongside the audio to generate appropriate lip movements. Thus, our approach needs neither masked input images nor identity reference images. We conduct experiments on the benchmark LRS2 and HDTF datasets and perform various ablation studies to validate our contributions.
摘要：音频驱动的说话面部生成旨在生成谈话面孔的现实视频，专注于准确的音频唇同步而不降低与身份相关的视觉细节。最新的最新方法基于含水层，这意味着输入面的下半部分被掩盖，并且该模型通过产生与给定音频对齐的嘴唇来填充掩盖的区域。因此，为了保留与身份相关的视觉细节，这些方法还需要从同一视频中随机选择未掩盖的身份参考图像。但是，这种常见的掩蔽策略受（1）输入面的信息丢失，从而显着影响网络保留视觉质量和身份细节的能力，（（2）身份参考和输入图像降低重建性能之间的变化，（3）身份对元素的不符合元素的不符合元素的元素副本，并造成了不符合Audio的元素。为了解决这些问题，我们在维护基于2D的面部编辑任务的同时提出了一种无面具的说话面部生成方法。我们没有掩盖下半部分，而是使用以不成对的方式训练的两步基于里程碑的方法将输入图像转换为具有闭合的嘴。随后，我们将这些经过编辑但未面的面孔与音频旁边的唇部适应模型一起提供，以生成适当的唇部运动。因此，我们的方法既不需要掩盖的输入图像也不需要身份参考图像。我们在基准LRS2和HDTF数据集上进行实验，并进行各种消融研究以验证我们的贡献。

Title: PROVCREATOR: Synthesizing Complex Heterogenous Graphs with Node and Edge Attributes

Authors: Tianhao Wang, Simon Klancher, Kunal Mukherjee, Josh Wiedemeier, Feng Chen, Murat Kantarcioglu, Kangkook Jee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.20967
Pdf URL: https://arxiv.org/pdf/2507.20967
Copy Paste: [[2507.20967]] PROVCREATOR: Synthesizing Complex Heterogenous Graphs with Node and Edge Attributes(https://arxiv.org/abs/2507.20967)
Keywords: generation
Abstract: The rise of graph-structured data has driven interest in graph learning and synthetic data generation. While successful in text and image domains, synthetic graph generation remains challenging -- especially for real-world graphs with complex, heterogeneous schemas. Existing research has focused mostly on homogeneous structures with simple attributes, limiting their usefulness and relevance for application domains requiring semantic fidelity. In this research, we introduce ProvCreator, a synthetic graph framework designed for complex heterogeneous graphs with high-dimensional node and edge attributes. ProvCreator formulates graph synthesis as a sequence generation task, enabling the use of transformer-based large language models. It features a versatile graph-to-sequence encoder-decoder that 1. losslessly encodes graph structure and attributes, 2. efficiently compresses large graphs for contextual modeling, and 3. supports end-to-end, learnable graph generation. To validate our research, we evaluate ProvCreator on two challenging domains: system provenance graphs in cybersecurity and knowledge graphs from IntelliGraph Benchmark Dataset. In both cases, ProvCreator captures intricate dependencies between structure and semantics, enabling the generation of realistic and privacy-aware synthetic datasets.
摘要：图形结构化数据的兴起引起了人们对图学习和合成数据生成的兴趣。尽管在文本和图像域上成功，但合成图生成仍然具有挑战性 - 尤其是对于具有复杂，异质性模式的现实图表。现有的研究主要集中在具有简单属性的均匀结构上，从而限制了它们对需要语义忠诚的应用领域的有用性和相关性。在这项研究中，我们介绍了Prodcreator，这是一种合成图形框架，设计用于具有高维节点和边缘属性的复杂异质图。 Chardcreator将图形合成作为序列生成任务，从而实现基于变压器的大语言模型的使用。它具有多功能的图形到序列编码器，该编码器1。无损编码图形结构和属性，2。有效地压缩大图进行上下文建模，3。支持端到端，可学习的图形生成。为了验证我们的研究，我们评估了两个具有挑战性的领域：网络安全和知识图中的系统出处图。在这两种情况下，启动器都捕获了结构和语义之间的复杂依赖性，从而能够生成现实和隐私感知的合成数据集。

Title: Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder

Authors: Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.20973
Pdf URL: https://arxiv.org/pdf/2507.20973
Copy Paste: [[2507.20973]] Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder(https://arxiv.org/abs/2507.20973)
Keywords: generation, generative
Abstract: Text-to-image (T2I) diffusion models often exhibit gender bias, particularly by generating stereotypical associations between professions and gendered subjects. This paper presents SAE Debias, a lightweight and model-agnostic framework for mitigating such bias in T2I generation. Unlike prior approaches that rely on CLIP-based filtering or prompt engineering, which often require model-specific adjustments and offer limited control, SAE Debias operates directly within the feature space without retraining or architectural modifications. By leveraging a k-sparse autoencoder pre-trained on a gender bias dataset, the method identifies gender-relevant directions within the sparse latent space, capturing professional stereotypes. Specifically, a biased direction per profession is constructed from sparse latents and suppressed during inference to steer generations toward more gender-balanced outputs. Trained only once, the sparse autoencoder provides a reusable debiasing direction, offering effective control and interpretable insight into biased subspaces. Extensive evaluations across multiple T2I models, including Stable Diffusion 1.4, 1.5, 2.1, and SDXL, demonstrate that SAE Debias substantially reduces gender bias while preserving generation quality. To the best of our knowledge, this is the first work to apply sparse autoencoders for identifying and intervening in gender bias within T2I models. These findings contribute toward building socially responsible generative AI, providing an interpretable and model-agnostic tool to support fairness in text-to-image generation.
摘要：文本对图像（T2I）扩散模型经常表现出性别偏见，尤其是通过在职业和性别受试者之间产生刻板印象的关联。本文介绍了Sae Debias，这是一种轻巧和模型不合时宜的框架，用于减轻T2i生成中这种偏见。与依靠基于夹的过滤或及时工程（通常需要特定于模型的调整并提供有限控制的方法）的先验方法不同，SAE Debias直接在功能空间内运行，而无需重新培训或建筑修改。通过利用在性别偏见数据集中预先训练的K-Sparse自动编码器，该方法可以在稀疏的潜在空间内标识与性别相关的方向，从而捕获专业的刻板印象。具体而言，每个职业的有偏见的方向是由稀疏的潜在构建的，并在推断世代相传的过程中被抑制，以朝着更加平衡的产出。稀疏的自动编码器仅经过一次训练，提供了可重复使用的伪造方向，提供了有效的控制和对偏见子空间的可解释见解。多个T2I模型的广泛评估，包括稳定的扩散1.4、1.5、2.1和SDXL，表明SAE DEBIA大大降低了性别偏见，同时保留了发电质量。据我们所知，这是第一项应用稀疏自动编码器来识别和介入T2I模型中性别偏见的工作。这些发现有助于建立对社会负责的生产AI，提供了一种可解释的模型无关工具来支持文本到图像生成的公平性。

Title: Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision

Authors: Xiao Fang, Minhyek Jeon, Zheyang Qin, Stanislav Panev, Celso de Melo, Shuowen Hu, Shayok Chakraborty, Fernando De la Torre
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.20976
Pdf URL: https://arxiv.org/pdf/2507.20976
Copy Paste: [[2507.20976]] Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision(https://arxiv.org/abs/2507.20976)
Keywords: generative
Abstract: Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Extensive experiments across diverse aerial imagery domains show consistent performance improvements in AP50 over supervised learning on source domain data, weakly supervised adaptation methods, unsupervised domain adaptation methods, and open-set object detectors by 4-23%, 6-10%, 7-40%, and more than 50%, respectively. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah to support further research in this field. Project page is available at: this https URL
摘要：在交通监测，城市规划和国防情报中，使用航空影像中的车辆是一项关键任务。深度学习方法为此应用提供了最先进的结果（SOTA）结果。但是，当对一个地理区域的数据进行训练的模型无法有效地推广到其他领域时，就会产生重大挑战。诸如环境条件，城市布局，道路网络，车辆类型和图像采集参数（例如，分辨率，照明和角度）等因素的变异性会导致域移动，从而降低模型性能。本文提出了一种新颖的方法，该方法使用生成AI综合高质量的航空图像及其标签，从而通过数据增强来改善探测器训练。我们的主要贡献是利用微型潜在扩散模型（LDMS）的多阶段，多模式知识转移框架的开发，以减轻源环境和目标环境之间的分布差距。跨不同空中图像域的广泛实验表明，在源域数据，弱监督的适应方法，无监督的域适应方法以及开放式对象探测器上，AP50的性能提高了4-23％，6-10％，7-40％，7-40％，以及超过50％以上。此外，我们介绍了来自新西兰和犹他州的两个新注释的空中数据集，以支持该领域的进一步研究。项目页面可用：此HTTPS URL

Title: JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1

Authors: Xinhan Di, Kristin Qi, Pengqian Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20987
Pdf URL: https://arxiv.org/pdf/2507.20987
Copy Paste: [[2507.20987]] JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1(https://arxiv.org/abs/2507.20987)
Keywords: generation
Abstract: Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive eval- uation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region- specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evalua- tion protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at this https URL.
摘要：基于扩散的视频生成的最新进展使照片真实的短剪辑能够在共同产生全身运动和自然语音时仍难以实现多模式的一致性。当前的方法缺乏评估视觉和音频质量的全面评估框架，并且对于区域特定性能分析的基准不足。为了解决这些差距，我们介绍了联合全身说话的化身和语音生成版本I（JWB-DH-V1），其中包括一个大规模的多模式数据集，具有2000万个视频示例中的10,000个唯一身份，以及用于评估全体体制动画Andoyable Animatable Animatable Animatable Animatable Andobodable Andobodable Andobodable Andobodion的评估协议。我们对SOTA模型的评估揭示了面部/以手动性和全身性能之间的稳定绩效差异，这为未来的研究提供了重要领域。数据集和评估工具可在此HTTPS URL上公开获得。

Title: LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

Authors: Yining Huang, Bin Li, Keke Tang, Meilian Chen
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2507.20999
Pdf URL: https://arxiv.org/pdf/2507.20999
Copy Paste: [[2507.20999]] LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning(https://arxiv.org/abs/2507.20999)
Keywords: generative
Abstract: Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer-wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by "Thinking, Fast and Slow," which characterizes two distinct modes of thought-System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)-we draw an analogy that different "subregions" of an LLM's parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.
摘要：DeepSeek-R1和OpenAI-O1（例如，cot）推理（COT）推理的大规模生成模型（例如DeepSeek-R1和OpenAI-O1）通常受益于其性能，通常需要大量的数据，大型模型尺寸和全参数微调。虽然参数有效的微调（PEFT）有助于降低成本，但大多数现有方法主要解决域的适应或层次分配，而不是针对不同的响应需求明确调整数据和参数。受到“思维，快速和缓慢”的启发，它表征了两种不同的思想系统1（快速，直觉，通常是自动的）和系统2（较慢，更审议和分析）的启发 - 我们绘制了一个类似的LLM参数“子区域”的类比，可能会类似地适用于需要快速，直觉响应的任务，而这些响应是快速，直觉的响应。因此，我们提出了Lora-Par，这是一种双系统LORA框架，该框架可以按系统1或System 2的要求对数据和参数进行分配，而对于每个任务，使用更少的重点参数。具体来说，我们通过多模型的角色扮演和投票对任务数据进行分类，以及基于重要性评分的分区参数，然后采用训练系统的两阶段微调策略1任务，具有监督的微调（SFT），以增强知识，直觉和精炼系统的2个任务，以增强加强型log log log log log loger infore infor in feeper inoge infore infore infore infore infore。广泛的实验表明，在匹配或超过SOTA PEFT基准的同时，SFT和RL的两阶段微调策略降低了主动参数。

Title: Flow Matching Policy Gradients

Authors: David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2507.21053
Pdf URL: https://arxiv.org/pdf/2507.21053
Copy Paste: [[2507.21053]] Flow Matching Policy Gradients(https://arxiv.org/abs/2507.21053)
Keywords: generative
Abstract: Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.
摘要：基于流动的生成模型，包括扩散模型，在高维空间中的连续分布建模方面表现出色。在这项工作中，我们介绍了流动策略优化（FPO），这是一种简单的式增强学习算法，将流程匹配到策略梯度框架中。 FPO施放策略优化，以最大程度地利用条件流匹配损耗计算出的优势加权比率，以与流行的PPO-CLIP框架兼容。它避开了对确切可能计算的需求，同时保留了基于流的模型的生成能力。与以前的基于扩散的增强学习方法将训练与特定采样方法结合在一起的方法不同，FPO不可知对训练和推理时间的扩散或流动整合的选择不可知。我们表明，FPO可以在各种连续的控制任务中从头开始训练从头开始的扩散式策略。我们发现，基于流的模型可以捕获多模式的动作分布并获得比高斯政策更高的性能，尤其是在条件不足的设置中。