2025-05-05

Title: Fast2comm:Collaborative perception combined with prior knowledge

Authors: Zhengbin Zhang, Yan Wu, Hongkun Zhang
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2505.00740
Pdf URL: https://arxiv.org/pdf/2505.00740
Copy Paste: [[2505.00740]] Fast2comm:Collaborative perception combined with prior knowledge(https://arxiv.org/abs/2505.00740)
Keywords: generation
Abstract: Collaborative perception has the potential to significantly enhance perceptual accuracy through the sharing of complementary information among agents. However, real-world collaborative perception faces persistent challenges, particularly in balancing perception performance and bandwidth limitations, as well as coping with localization errors. To address these challenges, we propose Fast2comm, a prior knowledge-based collaborative perception framework. Specifically, (1)we propose a prior-supervised confidence feature generation method, that effectively distinguishes foreground from background by producing highly discriminative confidence features; (2)we propose GT Bounding Box-based spatial prior feature selection strategy to ensure that only the most informative prior-knowledge features are selected and shared, thereby minimizing background noise and optimizing bandwidth efficiency while enhancing adaptability to localization inaccuracies; (3)we decouple the feature fusion strategies between model training and testing phases, enabling dynamic bandwidth adaptation. To comprehensively validate our framework, we conduct extensive experiments on both real-world and simulated datasets. The results demonstrate the superior performance of our model and highlight the necessity of the proposed methods. Our code is available at this https URL.
摘要：协作感知有可能通过共享代理商之间的互补信息来显着提高感知准确性。但是，现实世界的协作感知面临着持续的挑战，尤其是在平衡感知表现和带宽限制方面以及应对本地化错误时。为了应对这些挑战，我们提出了Fast2Comm，这是一个基于知识的协作感知框架。具体而言，（1）我们提出了一种先前的监督置信度生成方法，该方法通过产生高度歧视性的置信度来有效地将前景与背景区分开；（2）我们提出了基于GT边界的空间选择策略，以确保仅选择和共享最有用的先验知识功能，从而最大程度地降低背景噪声并优化带宽效率，同时增强对本地化不准确性的适应性；（3）我们将模型训练和测试阶段之间的特征融合策略解除，从而使动态带宽适应。为了全面验证我们的框架，我们对现实世界和模拟数据集进行了广泛的实验。结果证明了我们的模型的出色性能，并突出了所提出的方法的必要性。我们的代码可在此HTTPS URL上找到。

Title: InstructAttribute: Fine-grained Object Attributes editing with Instruction

Authors: Xingxi Yin, Jingfeng Zhang, Zhi Li, Yicheng Li, Yin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.00751
Pdf URL: https://arxiv.org/pdf/2505.00751
Copy Paste: [[2505.00751]] InstructAttribute: Fine-grained Object Attributes editing with Instruction(https://arxiv.org/abs/2505.00751)
Keywords: generative
Abstract: Text-to-image (T2I) diffusion models, renowned for their advanced generative abilities, are extensively utilized in image editing applications, demonstrating remarkable effectiveness. However, achieving precise control over fine-grained attributes still presents considerable challenges. Existing image editing techniques either fail to modify the attributes of an object or struggle to preserve its structure and maintain consistency in other areas of the image. To address these challenges, we propose the Structure-Preserving and Attribute Amplification (SPAA), a training-free method which enables precise control over the color and material transformations of objects by editing the self-attention maps and cross-attention values. Furthermore, we constructed the Attribute Dataset, which encompasses nearly all colors and materials associated with various objects, by integrating multimodal large language models (MLLM) to develop an automated pipeline for data filtering and instruction labeling. Training on this dataset, we present our InstructAttribute, an instruction-based model designed to facilitate fine-grained editing of color and material attributes. Extensive experiments demonstrate that our method achieves superior performance in object-level color and material editing, outperforming existing instruction-based image editing approaches.
摘要：文本对图像（T2I）扩散模型以其高级生成能力而闻名，在图像编辑应用中广泛使用，表现出了出色的有效性。但是，实现对细粒属性的精确控制仍然带来了巨大的挑战。现有的图像编辑技术要么无法修改对象的属性，要么难以保留其结构并保持图像其他领域的一致性。为了应对这些挑战，我们提出了结构保存和属性放大（SPAA），这是一种无训练方法，通过编辑自我注意力图和交叉意见值，可以精确控制对象的颜色和材料转换。此外，我们通过集成多模式大语言模型（MLLM）来开发用于数据过滤和指令标记的自动化管道，从而构建了属性数据集，该数据集几乎包括与各种对象相关的所有颜色和材料。在此数据集中进行培训，我们提供了我们的教学材料，这是一种基于教学的模型，旨在促进颜色和材料属性的精细编辑。广泛的实验表明，我们的方法在对象级别的颜色和材料编辑中实现了卓越的性能，超过了现有的基于指令的图像编辑方法。

Title: Multi-Modal Language Models as Text-to-Image Model Evaluators

Authors: Jiahui Chen, Candace Ross, Reyhane Askari-Hemmat, Koustuv Sinha, Melissa Hall, Michal Drozdzal, Adriana Romero-Soriano
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.00759
Pdf URL: https://arxiv.org/pdf/2505.00759
Copy Paste: [[2505.00759]] Multi-Modal Language Models as Text-to-Image Model Evaluators(https://arxiv.org/abs/2505.00759)
Keywords: generation, generative
Abstract: The steady improvements of text-to-image (T2I) generative models lead to slow deprecation of automatic evaluation benchmarks that rely on static datasets, motivating researchers to seek alternative ways to evaluate the T2I progress. In this paper, we explore the potential of multi-modal large language models (MLLMs) as evaluator agents that interact with a T2I model, with the objective of assessing prompt-generation consistency and image aesthetics. We present Multimodal Text-to-Image Eval (MT2IE), an evaluation framework that iteratively generates prompts for evaluation, scores generated images and matches T2I evaluation of existing benchmarks with a fraction of the prompts used in existing static benchmarks. Moreover, we show that MT2IE's prompt-generation consistency scores have higher correlation with human judgment than scores previously introduced in the literature. MT2IE generates prompts that are efficient at probing T2I model performance, producing the same relative T2I model rankings as existing benchmarks while using only 1/80th the number of prompts for evaluation.
摘要：文本对图像（T2I）生成模型的稳定改进导致依赖静态数据集的自动评估基准的贬值缓慢，激发了研究人员寻求替代方法来评估T2I的进步。在本文中，我们探讨了多模式大语言模型（MLLM）作为与T2I模型相互作用的评估剂的潜力，目的是评估迅速产生的一致性和图像美学。我们提出了多模式文本对图像评估（MT2IE），这是一个评估框架，迭代生成提示进行评估，分数生成的图像并匹配现有基准测试的T2I评估，其中具有现有静态基准中使用的提示的一小部分。此外，我们表明MT2IE的迅速产生一致性得分与人类判断的相关性比文献中先前引入的分数更高。 MT2IE生成的提示可以有效地探测T2I模型性能，从而产生与现有基准相同的相对T2I模型排名，同时仅使用1/80 the Inter Inters评估的提示数。

Title: Scalable Unit Harmonization in Medical Informatics Using Bi-directional Transformers and Bayesian-Optimized BM25 and Sentence Embedding Retrieval

Authors: Jordi de la Torre
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.00810
Pdf URL: https://arxiv.org/pdf/2505.00810
Copy Paste: [[2505.00810]] Scalable Unit Harmonization in Medical Informatics Using Bi-directional Transformers and Bayesian-Optimized BM25 and Sentence Embedding Retrieval(https://arxiv.org/abs/2505.00810)
Keywords: generation
Abstract: Objective: To develop and evaluate a scalable methodology for harmonizing inconsistent units in large-scale clinical datasets, addressing a key barrier to data interoperability. Materials and Methods: We designed a novel unit harmonization system combining BM25, sentence embeddings, Bayesian optimization, and a bidirectional transformer based binary classifier for retrieving and matching laboratory test entries. The system was evaluated using the Optum Clinformatics Datamart dataset (7.5 billion entries). We implemented a multi-stage pipeline: filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation. Performance was assessed using Mean Reciprocal Rank (MRR) and other standard information retrieval metrics. Results: Our hybrid retrieval approach combining BM25 and sentence embeddings (MRR: 0.8833) significantly outperformed both lexical-only (MRR: 0.7985) and embedding-only (MRR: 0.5277) approaches. The transformer-based reranker further improved performance (absolute MRR improvement: 0.10), bringing the final system MRR to 0.9833. The system achieved 83.39\% precision at rank 1 and 94.66\% recall at rank 5. Discussion: The hybrid architecture effectively leverages the complementary strengths of lexical and semantic approaches. The reranker addresses cases where initial retrieval components make errors due to complex semantic relationships in medical terminology. Conclusion: Our framework provides an efficient, scalable solution for unit harmonization in clinical datasets, reducing manual effort while improving accuracy. Once harmonized, data can be reused seamlessly in different analyses, ensuring consistency across healthcare systems and enabling more reliable multi-institutional studies and meta-analyses.
摘要：目的：开发和评估一种可扩展的方法，以协调大规模临床数据集中不一致的单位，以解决数据互操作性的关键障碍。材料和方法：我们设计了一个新型的单元协调系统，结合了BM25，句子嵌入，贝叶斯优化和基于双向变压器的二进制分类器，用于检索和匹配实验室测试条目。使用optum Clinformatics DatamArt数据集（75亿个条目）评估了该系统。我们实施了多阶段管道：过滤，标识，协调提案生成，自动重新排列和手动验证。使用平均相互等级（MRR）和其他标准信息检索指标评估性能。结果：我们将BM25和句子嵌入（MRR：0.8833）的混合检索方法显着超过了仅词汇（MRR：0.7985）和纯嵌入（MRR：0.5277）方法。基于变压器的Reranker进一步提高了性能（绝对MRR改进：0.10），将最终系统MRR提高到0.9833。该系统在等级1和94.66 \％召回等级的83.39 \％精度。讨论：混合体系结构有效地利用了词汇和语义方法的互补优势。 Reranker解决了由于医学术语中的复杂语义关系而导致的初始检索组件造成错误的情况。结论：我们的框架为临床数据集中的单位协调提供了一种有效，可扩展的解决方案，从而减少了手动努力，同时提高了准确性。一旦协调，可以在不同的分析中无缝地将数据重复使用，从而确保整个医疗保健系统的一致性，并实现更可靠的多机构研究和荟萃分析。

Title: Data-Driven Optical To Thermal Inference in Pool Boiling Using Generative Adversarial Networks

Authors: Qianxi Fu, Youngjoon Suh, Xiaojing Zhang, Yoonjin Won
Subjects: cs.LG, physics.app-ph
Abstract URL: https://arxiv.org/abs/2505.00823
Pdf URL: https://arxiv.org/pdf/2505.00823
Copy Paste: [[2505.00823]] Data-Driven Optical To Thermal Inference in Pool Boiling Using Generative Adversarial Networks(https://arxiv.org/abs/2505.00823)
Keywords: generative
Abstract: Phase change plays a critical role in thermal management systems, yet quantitative characterization of multiphase heat transfer remains limited by the challenges of measuring temperature fields in chaotic, rapidly evolving flow regimes. While computational methods offer spatiotemporal resolution in idealized cases, replicating complex experimental conditions remains prohibitively difficult. Here, we present a data-driven framework that leverages a conditional generative adversarial network (CGAN) to infer temperature fields from geometric phase contours in a canonical pool boiling configuration where advanced data collection techniques are restricted. Using high-speed imaging data and simulation-informed training, our model demonstrates the ability to reconstruct temperature fields with errors below 6%. We further show that standard data augmentation strategies are effective in enhancing both accuracy and physical plausibility of the predicted maps across both simulation and experimental datasets when precise physical constraints are not applicable. Our results highlight the potential of deep generative models to bridge the gap between observable multiphase phenomena and underlying thermal transport, offering a powerful approach to augment and interpret experimental measurements in complex two-phase systems.
摘要：相变在热管理系统中起着至关重要的作用，但多相传热的定量表征仍然受到测量混乱，快速发展的流动状态的挑战的限制。虽然计算方法在理想化的情况下提供了时空分辨率，但复制复杂的实验条件仍然非常困难。在这里，我们提出了一个数据驱动的框架，该框架利用条件生成的对抗网络（CGAN）从规范池沸腾配置中的几何相位轮廓来推断温度场，其中限制了高级数据收集技术。我们的模型使用高速成像数据和模拟信息训练，证明了以低于6％的误差重建温度场的能力。我们进一步表明，当精确的物理约束不适用时，标准数据增强策略可有效提高模拟和实验数据集的预测地图的准确性和物理合理性。我们的结果突出了深层生成模型的潜力，即弥合可观察到的多相现象和基础热传输之间的差距，从而提供了一种强大的方法来增强和解释复杂的两相系统中的实验测量。

Title: The Comparability of Model Fusion to Measured Data in Confuser Rejection

Authors: Conor Flynn, Christopher Ebersole, Edmund Zelnio
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.00836
Pdf URL: https://arxiv.org/pdf/2505.00836
Copy Paste: [[2505.00836]] The Comparability of Model Fusion to Measured Data in Confuser Rejection(https://arxiv.org/abs/2505.00836)
Keywords: generation
Abstract: Data collection has always been a major issue in the modeling and training of large deep learning networks, as no dataset can account for every slight deviation we might see in live usage. Collecting samples can be especially costly for Synthetic Aperture Radar (SAR), limiting the amount of unique targets and operating conditions we are able to observe from. To counter this lack of data, simulators have been developed utilizing the shooting and bouncing ray method to allow for the generation of synthetic SAR data on 3D models. While effective, the synthetically generated data does not perfectly correlate to the measured data leading to issues when training models solely on synthetic data. We aim to use computational power as a substitution for this lack of quality measured data, by ensembling many models trained on synthetic data. Synthetic data is also not complete, as we do not know what targets might be present in a live environment. Therefore we need to have our ensembling techniques account for these unknown targets by applying confuser rejection in which our models will reject unknown targets it is presented with, and only classify those it has been trained on.
摘要：在大型深度学习网络的建模和培训中，数据收集一直是一个主要问题，因为没有数据集可以说明我们在实时使用中可能看到的每一个微小偏差。对于合成孔径雷达（SAR），收集样品尤其是昂贵的，限制了我们能够从中观察到的独特目标和操作条件的数量。为了应对这种缺乏数据，使用射击和弹跳方法开发了模拟器，以允许在3D模型上生成合成SAR数据。虽然有效，但合成生成的数据与仅在合成数据上培训模型时导致问题的测量数据并不完全相关。我们的目标是通过结合许多经过合成数据培训的模型来使用计算能力来替代这种缺乏质量测量数据。合成数据也不完整，因为我们不知道在实时环境中可能存在哪些目标。因此，我们需要通过应用混淆者拒绝来对这些未知目标进行结合技术来解释这些未知目标，在这种情况下，我们的模型将拒绝与之呈现的未知目标，并且只对其进行过培训的目标进行了分类。

Title: NeMo-Inspector: A Visualization Tool for LLM Generation Analysis

Authors: Daria Gitman, Igor Gitman, Evelina Bakhturina
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.00903
Pdf URL: https://arxiv.org/pdf/2505.00903
Copy Paste: [[2505.00903]] NeMo-Inspector: A Visualization Tool for LLM Generation Analysis(https://arxiv.org/abs/2505.00903)
Keywords: generation
Abstract: Adapting Large Language Models (LLMs) to novel tasks and enhancing their overall capabilities often requires large, high-quality training datasets. Synthetic data, generated at scale, serves a valuable alternative when real-world data is scarce or difficult to obtain. However, ensuring the quality of synthetic datasets is challenging, as developers must manually inspect and refine numerous samples to identify errors and areas for improvement. This process is time-consuming and requires specialized tools. We introduce NeMo-Inspector, an open-source tool designed to simplify the analysis of synthetic datasets with integrated inference capabilities. We demonstrate its effectiveness through two real-world cases. Analysis and cleaning of the synthetically generated GSM-Plus dataset with NeMo-Inspector led to a significant decrease in low-quality samples from 46.99% to 19.51%. The tool also helped identify and correct generation errors in OpenMath models, improving accuracy by 1.92% on the MATH dataset and by 4.17% on the GSM8K dataset for a Meta-Llama-3-8B model fine-tuned on synthetic data generated from Nemotron-4-340B.
摘要：将大型语言模型（LLM）调整为新任务和增强其整体功能通常需要大型高质量的培训数据集。当现实世界数据稀缺或难以获得时，大规模生成的合成数据是一种有价值的选择。但是，确保合成数据集的质量具有挑战性，因为开发人员必须手动检查和完善许多样本以确定错误和区域以进行改进。这个过程是耗时的，需要专门的工具。我们介绍了Nemo-Anspector，这是一种开源工具，旨在简化具有集成推理功能的合成数据集的分析。我们通过两个现实世界的案例证明了它的有效性。与NeMo-nemector的合成生成的GSM-Plus数据集的分析和清洁导致低质量样本的显着降低从46.99％降低至19.51％。该工具还有助于识别和纠正OpenMath模型中的生成错误，在数学数据集中将精度提高了1.92％，而GSM8K数据集则在Meta-llama-3-8b模型上对从Nemotron-4-4-340B生成的synthetic数据进行了微调。

Title: Tree-Sliced Wasserstein Distance with Nonlinear Projection

Authors: Thanh Tran, Viet-Hoang Tran, Thanh Chu, Trang Pham, Laurent El Ghaoui, Tam Le, Tan M. Nguyen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.00968
Pdf URL: https://arxiv.org/pdf/2505.00968
Copy Paste: [[2505.00968]] Tree-Sliced Wasserstein Distance with Nonlinear Projection(https://arxiv.org/abs/2505.00968)
Keywords: generative
Abstract: Tree-Sliced methods have recently emerged as an alternative to the traditional Sliced Wasserstein (SW) distance, replacing one-dimensional lines with tree-based metric spaces and incorporating a splitting mechanism for projecting measures. This approach enhances the ability to capture the topological structures of integration domains in Sliced Optimal Transport while maintaining low computational costs. Building on this foundation, we propose a novel nonlinear projectional framework for the Tree-Sliced Wasserstein (TSW) distance, substituting the linear projections in earlier versions with general projections, while ensuring the injectivity of the associated Radon Transform and preserving the well-definedness of the resulting metric. By designing appropriate projections, we construct efficient metrics for measures on both Euclidean spaces and spheres. Finally, we validate our proposed metric through extensive numerical experiments for Euclidean and spherical datasets. Applications include gradient flows, self-supervised learning, and generative models, where our methods demonstrate significant improvements over recent SW and TSW variants.
摘要：最近出现了树切割方法，作为传统切片的瓦斯坦（SW）距离的替代方法，用基于树的度量空间代替一维线，并结合了用于投影措施的分裂机制。这种方法增强了在切成薄片的最佳运输中捕获集成域的拓扑结构的能力，同时保持低计算成本。在这个基础的基础上，我们提出了一个新型的非线性投影框架，用于树固定的瓦斯汀（TSW）距离，代替了早期版本中的线性投影，同时确保了相关的rad transform绕的注射率并保留了所得指标的明确定义。通过设计适当的预测，我们为欧几里得空间和领域的措施构建有效的指标。最后，我们通过对欧几里得和球形数据集的广泛数值实验来验证我们提出的指标。应用程序包括梯度流，自我监管的学习和生成模型，我们的方法比最近的SW和TSW变体都有显着改善。

Title: Generating Animated Layouts as Structured Text Representations

Authors: Yeonsang Shin, Jihwan Kim, Yumin Song, Kyungseung Lee, Hyunhee Chung, Taeyoung Na
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.00975
Pdf URL: https://arxiv.org/pdf/2505.00975
Copy Paste: [[2505.00975]] Generating Animated Layouts as Structured Text Representations(https://arxiv.org/abs/2505.00975)
Keywords: generation
Abstract: Despite the remarkable progress in text-to-video models, achieving precise control over text elements and animated graphics remains a significant challenge, especially in applications such as video advertisements. To address this limitation, we introduce Animated Layout Generation, a novel approach to extend static graphic layouts with temporal dynamics. We propose a Structured Text Representation for fine-grained video control through hierarchical visual elements. To demonstrate the effectiveness of our approach, we present VAKER (Video Ad maKER), a text-to-video advertisement generation pipeline that combines a three-stage generation process with Unstructured Text Reasoning for seamless integration with LLMs. VAKER fully automates video advertisement generation by incorporating dynamic layout trajectories for objects and graphics across specific video frames. Through extensive evaluations, we demonstrate that VAKER significantly outperforms existing methods in generating video advertisements. Project Page: this https URL
摘要：尽管文本到视频模型取得了显着的进展，但对文本元素和动画图形的精确控制仍然是一个重大挑战，尤其是在视频广告等应用程序中。为了解决此限制，我们介绍了动画布局生成，这是一种新颖的方法，可扩展具有时间动力学的静态图形布局。我们通过层次的视觉元素提出了一个结构化文本表示，以进行细粒度的视频控制。为了证明我们的方法的有效性，我们介绍了Vaker（视频广告制造商），这是一种文本到视频广告生成管道，将三阶段的生成过程与非结构化的文本推理结合在一起，以与LLMS无缝集成。 Vaker通过在特定视频帧中为对象和图形进行动态布局轨迹来充分自动化视频广告的生成。通过广泛的评估，我们证明了Vaker在生成视频广告方面明显优于现有的方法。项目页面：此HTTPS URL

Title: Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis

Authors: Yu Hua, Weiming Liu, Gui Xu, Yaqing Hou, Yew-Soon Ong, Qiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.00998
Pdf URL: https://arxiv.org/pdf/2505.00998
Copy Paste: [[2505.00998]] Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis(https://arxiv.org/abs/2505.00998)
Keywords: generation, generative
Abstract: Human motion synthesis aims to generate plausible human motion sequences, which has raised widespread attention in computer animation. Recent score-based generative models (SGMs) have demonstrated impressive results on this task. However, their training process involves complex curvature trajectories, leading to unstable training process. In this paper, we propose a Deterministic-to-Stochastic Diverse Latent Feature Mapping (DSDFM) method for human motion synthesis. DSDFM consists of two stages. The first human motion reconstruction stage aims to learn the latent space distribution of human motions. The second diverse motion generation stage aims to build connections between the Gaussian distribution and the latent space distribution of human motions, thereby enhancing the diversity and accuracy of the generated human motions. This stage is achieved by the designed deterministic feature mapping procedure with DerODE and stochastic diverse output generation procedure with this http URL is easy to train compared to previous SGMs-based methods and can enhance diversity without introducing additional training this http URL qualitative and quantitative experiments, DSDFM achieves state-of-the-art results surpassing the latest methods, validating its superiority in human motion synthesis.
摘要：人类运动合成旨在产生合理的人类运动序列，这在计算机动画中引起了广泛的关注。最近基于得分的生成模型（SGM）在这项任务上表现出了令人印象深刻的结果。但是，他们的训练过程涉及复杂的曲率轨迹，从而导致不稳定的训练过程。在本文中，我们提出了用于人类运动合成的确定性到系统多样性的潜在特征映射（DSDFM）方法。 DSDFM由两个阶段组成。人类运动重建阶段的第一个旨在学习人类运动的潜在空间分布。第二种运动生成阶段旨在建立高斯分布和人类运动潜在空间分布之间的联系，从而增强生成的人类运动的多样性和准确性。与以前的基于SGMS的方法相比，使用该HTTP URL的设计确定性特征映射程序和随机多样化的输出生成过程可以实现此阶段，与以前的基于SGMS的方法相比，可以训练，并且可以增强多样性，而无需对此HTTP URL进行额外的培训，而该HTTP URL定性和定量实验的质量是最新的，它可以超过人类的概述，以实现人为概念的效果，以实现人类的概述，以实现人类的概述，以实现人类的概述，以实现人类的概述。

Title: Where's the liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content

Authors: Haoyue Bai, Yiyou Sun, Wei Cheng, Haifeng Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01008
Pdf URL: https://arxiv.org/pdf/2505.01008
Copy Paste: [[2505.01008]] Where's the liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content(https://arxiv.org/abs/2505.01008)
Keywords: generative
Abstract: The recent proliferation of photorealistic images created by generative models has sparked both excitement and concern, as these images are increasingly indistinguishable from real ones to the human eye. While offering new creative and commercial possibilities, the potential for misuse, such as in misinformation and fraud, highlights the need for effective detection methods. Current detection approaches often rely on access to model weights or require extensive collections of real image datasets, limiting their scalability and practical application in real world scenarios. In this work, we introduce a novel black box detection framework that requires only API access, sidestepping the need for model weights or large auxiliary datasets. Our approach leverages a corrupt and recover strategy: by masking part of an image and assessing the model ability to reconstruct it, we measure the likelihood that the image was generated by the model itself. For black-box models that do not support masked image inputs, we incorporate a cost efficient surrogate model trained to align with the target model distribution, enhancing detection capability. Our framework demonstrates strong performance, outperforming baseline methods by 4.31% in mean average precision across eight diffusion model variant datasets.
摘要：生成模型创建的光真逼真图像的最新扩散既引发了兴奋又引起人们的关注，因为这些图像与真实的眼睛越来越没有区别。在提供新的创造性和商业可能性的同时，滥用的潜力（例如错误信息和欺诈）突出了有效检测方法的必要性。当前的检测方法通常依赖于对模型权重的访问或需要大量的真实图像数据集的收集，从而限制了它们在现实世界中的可扩展性和实际应用。在这项工作中，我们引入了一个新颖的黑匣子检测框架，该框架仅需要API访问，避开了对模型权重或大型辅助数据集的需求。我们的方法利用了腐败的策略：通过掩盖图像的一部分并评估重建图像的模型能力，我们衡量了模型本身生成的图像的可能性。对于不支持掩盖图像输入的黑框模型，我们结合了经过成本有效的替代模型，该模型训练有素，可与目标模型分布保持一致，从而增强检测能力。我们的框架表明，在八个扩散模型变体数据集中，平均平均精度表现出强大的性能，超过4.31％的基线方法。

Title: Multi-Step Consistency Models: Fast Generation with Theoretical Guarantees

Authors: Nishant Jain, Xunpeng Huang, Yian Ma, Tong Zhang
Subjects: cs.LG, math.AP, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2505.01049
Pdf URL: https://arxiv.org/pdf/2505.01049
Copy Paste: [[2505.01049]] Multi-Step Consistency Models: Fast Generation with Theoretical Guarantees(https://arxiv.org/abs/2505.01049)
Keywords: generation
Abstract: Consistency models have recently emerged as a compelling alternative to traditional SDE based diffusion models, offering a significant acceleration in generation by producing high quality samples in very few steps. Despite their empirical success, a proper theoretic justification for their speed up is still lacking. In this work, we provide the analysis which bridges this gap, showing that given a consistency model which can map the input at a given time to arbitrary timestamps along the reverse trajectory, one can achieve KL divergence of order $ O(\varepsilon^2) $ using only $ O\left(\log\left(\frac{d}{\varepsilon}\right)\right) $ iterations with constant step size, where d is the data dimension. Additionally, under minimal assumptions on the data distribution an increasingly common setting in recent diffusion model analyses we show that a similar KL convergence guarantee can be obtained, with the number of steps scaling as $ O\left(d \log\left(\frac{d}{\varepsilon}\right)\right) $. Going further, we also provide a theoretical analysis for estimation of such consistency models, concluding that accurate learning is feasible using small discretization steps, both in smooth and non smooth settings. Notably, our results for the non smooth case yield best in class convergence rates compared to existing SDE or ODE based analyses under minimal assumptions.
摘要：一致性模型最近成为传统基于SDE的扩散模型的引人注目的替代品，通过在几乎几个步骤中生产高质量的样本，从而在发电中提供了显着的加速。尽管他们的经验成功，但仍缺乏对速度提高的适当理论理由。在这项工作中，我们提供了弥合这一空白的分析，表明给定的一致性模型可以在给定时间映射输入沿反向轨迹的任意时间戳，一个人可以仅使用$ o（\ varepsilon^2）$的kl差异，仅使用$ o \ weft（\ log \ weft）具有恒定步长的迭代，其中d是数据维度。此外，在对数据分布的最小化假设下，在最近的扩散模型分析中，我们表明可以获得类似的KL收敛保证，而步骤比例为$ o \ left（d \ log \ left（\ frac {d} {\ varepsilon} \ right）\ right）。再进一步，我们还提供了理论分析，以估计这种一致性模型，得出结论，在平滑和非平滑设置中，使用小的离散步骤可以进行准确的学习是可行的。值得注意的是，与在最小假设下的现有基于SDE或基于ODE的分析相比，我们对非平滑病例的结果在类收敛速率上最有效。

Title: Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs

Authors: Hari Chandana Kuchibhotla, Sai Srinivas Kancheti, Abbavaram Gowtham Reddy, Vineeth N Balasubramanian
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01064
Pdf URL: https://arxiv.org/pdf/2505.01064
Copy Paste: [[2505.01064]] Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs(https://arxiv.org/abs/2505.01064)
Keywords: generation
Abstract: Fine-grained Visual Recognition (FGVR) involves distinguishing between visually similar categories, which is inherently challenging due to subtle inter-class differences and the need for large, expert-annotated datasets. In domains like medical imaging, such curated datasets are unavailable due to issues like privacy concerns and high annotation costs. In such scenarios lacking labeled data, an FGVR model cannot rely on a predefined set of training labels, and hence has an unconstrained output space for predictions. We refer to this task as Vocabulary-Free FGVR (VF-FGVR), where a model must predict labels from an unconstrained output space without prior label information. While recent Multimodal Large Language Models (MLLMs) show potential for VF-FGVR, querying these models for each test input is impractical because of high costs and prohibitive inference times. To address these limitations, we introduce \textbf{Nea}rest-Neighbor Label \textbf{R}efinement (NeaR), a novel approach that fine-tunes a downstream CLIP model using labels generated by an MLLM. Our approach constructs a weakly supervised dataset from a small, unlabeled training set, leveraging MLLMs for label generation. NeaR is designed to handle the noise, stochasticity, and open-endedness inherent in labels generated by MLLMs, and establishes a new benchmark for efficient VF-FGVR.
摘要：细粒度的视觉识别（FGVR）涉及区分视觉上类似类别，这是由于细微的阶层间差异和对大型专家注册数据集的需求而固有的挑战。在医学成像之类的领域中，由于隐私问题和高注释成本等问题，因此无法使用此类策划数据集。在缺乏标记数据的情况下，FGVR模型不能依赖于预定义的训练标签，因此具有无约束的输出空间进行预测。我们将此任务称为无词汇的FGVR（VF-FGVR），其中模型必须从无约束的输出空间中预测标签，而无需先前的标签信息。尽管最近的多模式大语模型（MLLM）显示了VF-FGVR的潜力，但由于高成本和过度推理时间，对每个测试输入查询这些模型是不切实际的。为了解决这些限制，我们介绍了\ textbf {nea} rest-neighbor标签\ textbf {r} efinement（近），这是一种新的方法，该方法使用MLLM生成的标签微调下游剪辑模型。我们的方法从一个不标记的培训集中构建了一个弱监督的数据集，利用MLLM来创造标签。靠近的旨在处理MLLM生成的标签固有的噪声，随机性和开放性，并为有效的VFFGVR建立了新的基准。

Title: Improving Editability in Image Generation with Layer-wise Memory

Authors: Daneul Kim, Jaeah Lee, Jaesik Park
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.01079
Pdf URL: https://arxiv.org/pdf/2505.01079
Copy Paste: [[2505.01079]] Improving Editability in Image Generation with Layer-wise Memory(https://arxiv.org/abs/2505.01079)
Keywords: generation
Abstract: Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multiple objects need to be modified while preserving their contextual relationships. We address this fundamental challenge through two key proposals: enabling rough mask inputs that preserve existing content while naturally integrating new elements and supporting consistent editing across multiple modifications. Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits. We propose Background Consistency Guidance that leverages memorized latents to maintain scene coherence and Multi-Query Disentanglement in cross-attention that ensures natural adaptation to existing content. To evaluate our method, we present a new benchmark dataset incorporating semantic alignment metrics and interactive editing scenarios. Through comprehensive experiments, we demonstrate superior performance in iterative image editing tasks with minimal user effort, requiring only rough masks while maintaining high-quality results throughout multiple editing steps.
摘要：大多数真实的图像编辑任务都需要多个顺序编辑以获得所需的结果。当前的编辑方法主要是为单对象修改而设计的，它在连续编辑中遇到困难：尤其是在维护先前的编辑以及将新对象自然地调整为现有内容的过程中。这些局限性大大阻碍了复杂的编辑方案，在保留其上下文关系的同时，需要修改多个对象。我们通过两个关键建议解决这一基本挑战：启用粗略的面具输入，以保留现有内容，同时自然整合新元素并支持跨多个修改的一致编辑。我们的框架通过层的内存来实现这一目标，该内存存储了以前的编辑中的潜在表示和提示嵌入。我们提出了背景一致性指导，以利用记忆的潜伏期在交叉注意力中保持场景连贯性和多质量分离，以确保自然适应现有内容。为了评估我们的方法，我们提出了一个新的基准数据集，其中包含语义一致性指标和交互式编辑方案。通过全面的实验，我们在迭代图像编辑任务中以最少的用户工作来展示卓越的性能，仅需要粗糙的掩码，同时在多个编辑步骤中保持高质量的结果。

Title: Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation

Authors: Daniele Molino, Francesco di Feola, Linlin Shen, Paolo Soda, Valerio Guarrasi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01091
Pdf URL: https://arxiv.org/pdf/2505.01091
Copy Paste: [[2505.01091]] Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation(https://arxiv.org/abs/2505.01091)
Keywords: generation, generative
Abstract: Generative models have revolutionized Artificial Intelligence (AI), particularly in multimodal applications. However, adapting these models to the medical domain poses unique challenges due to the complexity of medical data and the stringent need for clinical accuracy. In this work, we introduce a framework specifically designed for multimodal medical data generation. By enabling the generation of multi-view chest X-rays and their associated clinical report, it bridges the gap between general-purpose vision-language models and the specialized requirements of healthcare. Leveraging the MIMIC-CXR dataset, the proposed framework shows superior performance in generating high-fidelity images and semantically coherent reports. Our quantitative evaluation reveals significant results in terms of FID and BLEU scores, showcasing the quality of the generated data. Notably, our framework achieves comparable or even superior performance compared to real data on downstream disease classification tasks, underlining its potential as a tool for medical research and diagnostics. This study highlights the importance of domain-specific adaptations in enhancing the relevance and utility of generative models for clinical applications, paving the way for future advancements in synthetic multimodal medical data generation.
摘要：生成模型彻底改变了人工智能（AI），尤其是在多模式应用中。但是，由于医疗数据的复杂性和严格的临床准确性需求，因此将这些模型适应医疗领域会带来独特的挑战。在这项工作中，我们引入了一个专门为多模式医学数据生成而设计的框架。通过启用多视图胸部X射线及其相关临床报告，它弥合了通用视觉模型与医疗保健专业要求之间的差距。在利用模拟CXR数据集的情况下，提出的框架在生成高保真图像和语义相干报告时表现出卓越的性能。我们的定量评估揭示了在FID和BLEU分数方面的重要结果，展示了生成数据的质量。值得注意的是，与下游疾病分类任务的真实数据相比，我们的框架具有可比性甚至卓越的性能，强调了其作为医学研究和诊断工具的潜力。这项研究强调了特定于域的适应性在增强生成模型在临床应用中的相关性和实用性方面的重要性，为合成多模式医学数据生成的未来进步铺平了道路。

Title: Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages

Authors: Marco Salmè, Rosa Sicilia, Paolo Soda, Valerio Guarrasi
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.01096
Pdf URL: https://arxiv.org/pdf/2505.01096
Copy Paste: [[2505.01096]] Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages(https://arxiv.org/abs/2505.01096)
Keywords: generation
Abstract: The integration of artificial intelligence in healthcare has opened new horizons for improving medical diagnostics and patient care. However, challenges persist in developing systems capable of generating accurate and contextually relevant radiology reports, particularly in low-resource languages. In this study, we present a comprehensive benchmark to evaluate the performance of instruction-tuned Vision-Language Models (VLMs) in the specialized task of radiology report generation across three low-resource languages: Italian, German, and Spanish. Employing the LLaVA architectural framework, we conducted a systematic evaluation of pre-trained models utilizing general datasets, domain-specific datasets, and low-resource language-specific datasets. In light of the unavailability of models that possess prior knowledge of both the medical domain and low-resource languages, we analyzed various adaptations to determine the most effective approach for these contexts. The results revealed that language-specific models substantially outperformed both general and domain-specific models in generating radiology reports, emphasizing the critical role of linguistic adaptation. Additionally, models fine-tuned with medical terminology exhibited enhanced performance across all languages compared to models with generic knowledge, highlighting the importance of domain-specific training. We also explored the influence of the temperature parameter on the coherence of report generation, providing insights for optimal model settings. Our findings highlight the importance of tailored language and domain-specific training for improving the quality and accuracy of radiological reports in multilingual settings. This research not only advances our understanding of VLMs adaptability in healthcare but also points to significant avenues for future investigations into model tuning and language-specific adaptations.
摘要：人工智能在医疗保健中的整合已为改善医学诊断和患者护理开辟了新的视野。但是，挑战一直在开发能够生成准确和上下文相关的放射学报告的系统，尤其是在低资源语言中。在这项研究中，我们提出了一个全面的基准，用于评估指导调整的视觉模型（VLM）的性能，该模型（VLMS）在放射学报告的专业任务中发电了三种低资源语言：意大利语，德语和西班牙语。使用LLAVA架构框架，我们对使用一般数据集，特定于域的数据集和低资源的语言特定数据集进行了系统评估。鉴于具有对医学领域和低资源语言具有先验知识的模型的不可用，我们分析了各种适应性，以确定这些环境的最有效方法。结果表明，特定于语言的模型在生成放射学报告时大大优于一般和域特异性模型，强调了语言适应的关键作用。此外，与具有通用知识的模型相比，用医学术语进行微调的模型均表现出所有语言的性能提高，强调了特定于领域的培训的重要性。我们还探讨了温度参数对报告生成相干性的影响，为最佳模型设置提供了见解。我们的发现突出了量身定制的语言和特定领域的培训对于提高多语言环境中放射学报告的质量和准确性的重要性。这项研究不仅可以提高我们对医疗保健中VLM的适应性的理解，而且还指出了对模型调整和特定语言适应的未来研究的重要途径。

Title: VSC: Visual Search Compositional Text-to-Image Diffusion Model

Authors: Do Huu Dat, Nam Hyeonu, Po-Yuan Mao, Tae-Hyun Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01104
Pdf URL: https://arxiv.org/pdf/2505.01104
Copy Paste: [[2505.01104]] VSC: Visual Search Compositional Text-to-Image Diffusion Model(https://arxiv.org/abs/2505.01104)
Keywords: generation
Abstract: Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts containing multiple attribute-object pairs. This challenge primarily arises from the limitations of commonly used text encoders, such as CLIP, which can fail to encode complex linguistic relationships and modifiers effectively. Existing approaches have attempted to mitigate these issues through attention map control during inference and the use of layout information or fine-tuning during training, yet they face performance drops with increased prompt complexity. In this work, we introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding. Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation. By applying segmentation-based localization training, we address cross-attention misalignment, achieving improved accuracy in binding multiple attributes to objects. Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
摘要：文本到图像扩散模型在从自然语言提示中生成逼真的视觉效果方面表现出了令人印象深刻的功能，但是它们经常在与相应的对象中精确绑定属性，尤其是在包含多个属性对象对的提示中。这项挑战主要源于常用文本编码器（例如剪辑）的局限性，例如剪辑无法有效地编码复杂的语言关系和修饰符。现有的方法试图通过推理期间的注意力图控制以及在培训期间使用布局信息或微调来缓解这些问题，但它们面临及时复杂性的绩效下降。在这项工作中，我们引入了一种新颖的组成生成方法，该方法利用成对图像嵌入来改善属性对象结合。我们的方法将复杂的提示分解为子标准，生成相应的图像，并计算与文本嵌入以增强表示形式的视觉原型。通过应用基于细分的本地化培训，我们解决了交叉注意的未对准，从而提高了将多个属性与对象结合的准确性。我们的方法在基准T2i compbench上胜过现有的文本对图像扩散模型，通过人类评估的更好的图像质量，以及在提示中的结合对的缩放缩放数下的新兴鲁棒性。

Title: Incorporating Inductive Biases to Energy-based Generative Models

Authors: Yukun Li, Li-Ping Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01111
Pdf URL: https://arxiv.org/pdf/2505.01111
Copy Paste: [[2505.01111]] Incorporating Inductive Biases to Energy-based Generative Models(https://arxiv.org/abs/2505.01111)
Keywords: generation, generative
Abstract: With the advent of score-matching techniques for model training and Langevin dynamics for sample generation, energy-based models (EBMs) have gained renewed interest as generative models. Recent EBMs usually use neural networks to define their energy functions. In this work, we introduce a novel hybrid approach that combines an EBM with an exponential family model to incorporate inductive bias into data modeling. Specifically, we augment the energy term with a parameter-free statistic function to help the model capture key data statistics. Like an exponential family model, the hybrid model aims to align the distribution statistics with data statistics during model training, even when it only approximately maximizes the data likelihood. This property enables us to impose constraints on the hybrid model. Our empirical study validates the hybrid model's ability to match statistics. Furthermore, experimental results show that data fitting and generation improve when suitable informative statistics are incorporated into the hybrid model.
摘要：随着用于模型训练的得分匹配技术和样本生成的Langevin动力学的出现，基于能量的模型（EBM）已获得了新的兴趣作为生成模型。最近的EBM通常使用神经网络来定义其能量功能。在这项工作中，我们引入了一种新型的混合方法，该方法将EBM与指数式的家族模型相结合，将电感偏置纳入数据建模中。具体而言，我们使用无参数统计函数来增强能量项，以帮助模型捕获关键数据统计信息。像指数家庭模型一样，混合模型旨在将分布统计数据与模型培训期间的数据统计数据保持一致，即使它仅大致最大化数据可能性。该属性使我们能够对混合模型施加约束。我们的实证研究验证了混合模型匹配统计数据的能力。此外，实验结果表明，当合适的信息统计数据纳入混合模型时，数据拟合和生成会改善。

Title: Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders

Authors: Rogelio A Mancisidor, Robert Jenssen, Shujian Yu, Michael Kampffmeyer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01134
Pdf URL: https://arxiv.org/pdf/2505.01134
Copy Paste: [[2505.01134]] Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders(https://arxiv.org/abs/2505.01134)
Keywords: generative
Abstract: Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO). Current methods, the product and mixture of experts, aggregate single-modality distributions assuming independence for simplicity, which is an overoptimistic assumption. This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle of consensus of dependent experts (CoDE), which circumvents the aforementioned assumption. Utilizing the CoDE method, we propose a novel ELBO that approximates the joint likelihood of the multimodal data by learning the contribution of each subset of modalities. The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations. CoDE-VAE further minimizes the generative quality gap as the number of modalities increases. In certain cases, it reaches a generative quality similar to that of unimodal VAEs, which is a desirable property that is lacking in most current methods. Finally, the classification accuracy achieved by CoDE-VAE is comparable to that of state-of-the-art multimodal VAE models.
摘要：具有变异自动编码器（VAE）的多模式学习需要估算关节分布以评估证据下限（ELBO）。当前的方法，专家的产物和混合物，汇总单模性分布，假设独立性为简单起见，这是一个过度令人振奋的假设。这项研究介绍了一种通过利用依赖专家共识的原则（代码）的原则来汇总单模式分布的新方法，该原则绕过了上述假设。利用代码方法，我们提出了一种新颖的Elbo，该Elbo通过学习每个模态的贡献来近似多模式数据的关节可能性。由此产生的代码VAE模型在平衡生成连贯性和生成质量之间的权衡方面表现出更好的性能，并产生了更精确的日志样式估计。随着模式数量的增加，代码-VAE进一步最大程度地减少了生成质量差距。在某些情况下，它具有类似于单峰VAE的生成质量，这是大多数当前方法中缺少的理想特性。最后，代码VAE达到的分类精度与最先进的多模式模型相媲美。

Title: Harmonizing Intra-coherence and Inter-divergence in Ensemble Attacks for Adversarial Transferability

Authors: Zhaoyang Ma, Zhihao Wu, Wang Lu, Xin Gao, Jinghang Yue, Taolin Zhang, Lipo Wang, Youfang Lin, Jing Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01168
Pdf URL: https://arxiv.org/pdf/2505.01168
Copy Paste: [[2505.01168]] Harmonizing Intra-coherence and Inter-divergence in Ensemble Attacks for Adversarial Transferability(https://arxiv.org/abs/2505.01168)
Keywords: generation
Abstract: The development of model ensemble attacks has significantly improved the transferability of adversarial examples, but this progress also poses severe threats to the security of deep neural networks. Existing methods, however, face two critical challenges: insufficient capture of shared gradient directions across models and a lack of adaptive weight allocation mechanisms. To address these issues, we propose a novel method Harmonized Ensemble for Adversarial Transferability (HEAT), which introduces domain generalization into adversarial example generation for the first time. HEAT consists of two key modules: Consensus Gradient Direction Synthesizer, which uses Singular Value Decomposition to synthesize shared gradient directions; and Dual-Harmony Weight Orchestrator which dynamically balances intra-domain coherence, stabilizing gradients within individual models, and inter-domain diversity, enhancing transferability across models. Experimental results demonstrate that HEAT significantly outperforms existing methods across various datasets and settings, offering a new perspective and direction for adversarial attack research.
摘要：模型整体攻击的发展显着提高了对抗性例子的可转移性，但是这一进步也对深神经网络的安全构成了严重威胁。但是，现有方法面临两个关键挑战：不足以捕获模型的共享梯度方向，缺乏适应性的重量分配机制。为了解决这些问题，我们提出了一种新颖的方法来协调对抗性可传递性（热），该合奏首次将域的概括引入了对抗性示例生成中。热量由两个关键模块组成：共有梯度方向合成器，该模块使用奇异值分解来合成共享梯度方向；和双锤子重量编排器，可以动态平衡内域的连贯性，稳定单个模型中的梯度以及域间多样性，从而提高模型的可传递性。实验结果表明，热量明显优于各种数据集和设置的现有方法，从而为对抗性攻击研究提供了新的视角和方向。

Title: Distilling Two-Timed Flow Models by Separately Matching Initial and Terminal Velocities

Authors: Pramook Khungurn, Pratch Piyawongwisal, Sira Sriswadi, Supasorn Suwajanakorn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01169
Pdf URL: https://arxiv.org/pdf/2505.01169
Copy Paste: [[2505.01169]] Distilling Two-Timed Flow Models by Separately Matching Initial and Terminal Velocities(https://arxiv.org/abs/2505.01169)
Keywords: generation
Abstract: A flow matching model learns a time-dependent vector field $v_t(x)$ that generates a probability path $\{ p_t \}_{0 \leq t \leq 1}$ that interpolates between a well-known noise distribution ($p_0$) and the data distribution ($p_1$). It can be distilled into a \emph{two-timed flow model} (TTFM) $\phi_{s,x}(t)$ that can transform a sample belonging to the distribution at an initial time $s$ to another belonging to the distribution at a terminal time $t$ in one function evaluation. We present a new loss function for TTFM distillation called the \emph{initial/terminal velocity matching} (ITVM) loss that extends the Lagrangian Flow Map Distillation (LFMD) loss proposed by Boffi et al. by adding redundant terms to match the initial velocities at time $s$, removing the derivative from the terminal velocity term at time $t$, and using a version of the model under training, stabilized by exponential moving averaging (EMA), to compute the target terminal average velocity. Preliminary experiments show that our loss leads to better few-step generation performance on multiple types of datasets and model architectures over baselines.
摘要：流量匹配模型学习了一个与时间相关的向量字段$ v_t（x）$，该概率路径$ \ {p_t \} _ {0 \ leq t \ leq 1} $，该$在众所周知的噪声分布（$ p_0 $）和数据分布（$ p_1 $）之间进行了内联。它可以将其提炼到\ emph {两点式流量模型}（ttfm）$ \ phi_ {s，x}（t）$，该$可以在一个函数评估中在终端$ t $中转换为初始时间$ s $ s $ s $ s $ s $ s $ s $的样本。我们为TTFM蒸馏提出了一种新的损耗函数，称为\ emph {初始/末端速度匹配}（ITVM）损耗，该损耗扩展了Boffi等人提出的Lagrangian Flow MAP蒸馏（LFMD）损失。通过添加冗余术语以匹配时$ s $的初始速度，从时间$ t $上删除终端速度术语的衍生物，并使用训练中的模型版本，通过指数移动平均（EMA）稳定，以计算目标终端终端速度。初步实验表明，我们的损失会在多种类型的数据集和基线的模型体系结构上提供更好的几步生成性能。

Title: FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis

Authors: Jiangtong Tan, Hu Yu, Jie Huang, Jie Xiao, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01172
Pdf URL: https://arxiv.org/pdf/2505.01172
Copy Paste: [[2505.01172]] FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis(https://arxiv.org/abs/2505.01172)
Keywords: generation
Abstract: Long video generation involves generating extended videos using models trained on short videos, suffering from distribution shifts due to varying frame counts. It necessitates the use of local information from the original short frames to enhance visual and motion quality, and global information from the entire long frames to ensure appearance consistency. Existing training-free methods struggle to effectively integrate the benefits of both, as appearance and motion in videos are closely coupled, leading to motion inconsistency and visual quality. In this paper, we reveal that global and local information can be precisely decoupled into consistent appearance and motion intensity information by applying Principal Component Analysis (PCA), allowing for refined complementary integration of global consistency and local quality. With this insight, we propose FreePCA, a training-free long video generation paradigm based on PCA that simultaneously achieves high consistency and quality. Concretely, we decouple consistent appearance and motion intensity features by measuring cosine similarity in the principal component space. Critically, we progressively integrate these features to preserve original quality and ensure smooth transitions, while further enhancing consistency by reusing the mean statistics of the initial noise. Experiments demonstrate that FreePCA can be applied to various video diffusion models without requiring training, leading to substantial improvements. Code is available at this https URL.
摘要：长时间的视频生成涉及使用在短视频中训练的模型生成扩展视频，这些模型由于框架计数的变化而遭受分配变化。它需要使用原始短帧中的本地信息来增强视觉和运动质量，以及整个长帧中的全局信息，以确保外观一致性。现有的无训练方法努力有效地整合了两者的好处，因为视频中的外观和运动紧密耦合，从而导致运动不一致和视觉质量。在本文中，我们揭示了通过应用主成分分析（PCA），可以精确地将全球和本地信息精确地将其分解为一致的外观和运动强度信息，从而使全球一致性和本地质量的完善互补整合。有了这个见解，我们提出了Freepca，这是一种基于PCA的无训练的长期视频生成范式，同时达到了高度的一致性和质量。具体而言，我们通过测量主成分空间中的余弦相似性来解除一致的外观和运动强度特征。至关重要的是，我们逐步整合了这些功能，以保持原始质量并确保平稳的过渡，同时通过重复初始噪声的平均统计数据进一步增强一致性。实验表明，Freepca可以应用于各种视频扩散模型而无需培训，从而实现了实质性改进。代码可在此HTTPS URL上找到。

Title: TSTMotion: Training-free Scene-awarenText-to-motion Generation

Authors: Ziyan Guo, Haoxuan Qu, Hossein Rahmani, Dewen Soh, Ping Hu, Qiuhong Ke, Jun Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01182
Pdf URL: https://arxiv.org/pdf/2505.01182
Copy Paste: [[2505.01182]] TSTMotion: Training-free Scene-awarenText-to-motion Generation(https://arxiv.org/abs/2505.01182)
Keywords: generation
Abstract: Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, existing scene-aware methods often rely on large-scale ground-truth motion sequences in diverse 3D scenes, which poses practical challenges due to the expensive cost. To mitigate this challenge, we are the first to propose a \textbf{T}raining-free \textbf{S}cene-aware \textbf{T}ext-to-\textbf{Motion} framework, dubbed as \textbf{TSTMotion}, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability. Specifically, conditioned on the given 3D scene and text description, we adopt foundation models together to reason, predict and validate a scene-aware motion guidance. Then, the motion guidance is incorporated into the blank-background motion generators with two modifications, resulting in scene-aware text-driven motion sequences. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework. We release our code in \href{this https URL}{Project Page}.
摘要：文本到动作的产生最近引起了重大的研究兴趣，主要集中于在空白背景下产生人类运动序列。但是，人类的动作通常发生在不同的3D场景中，这促使人们探索了场景吸引文本到动作生成方法。然而，现有的场景感知方法通常依赖于不同的3D场景中的大规模地面真相运动序列，由于成本昂贵，这带来了实际的挑战。 To mitigate this challenge, we are the first to propose a \textbf{T}raining-free \textbf{S}cene-aware \textbf{T}ext-to-\textbf{Motion} framework, dubbed as \textbf{TSTMotion}, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability.具体来说，在给定的3D场景和文本描述的条件下，我们一起采用基础模型来推理，预测和验证场景感知运动指导。然后，将运动引导纳入带有两个修改的空白 - 背景运动发生器中，从而导致场景吸引文本驱动的运动序列。广泛的实验证明了我们提出的框架的功效和概括性。我们以\ href {此https url} {project page}发布代码。

Title: Enhancing Obsolescence Forecasting with Deep Generative Data Augmentation: A Semi-Supervised Framework for Low-Data Industrial Applications

Authors: Elie Saad, Mariem Besbes, Marc Zolghadri, Victor Czmil, Claude Baron, Vincent Bourgeois
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01261
Pdf URL: https://arxiv.org/pdf/2505.01261
Copy Paste: [[2505.01261]] Enhancing Obsolescence Forecasting with Deep Generative Data Augmentation: A Semi-Supervised Framework for Low-Data Industrial Applications(https://arxiv.org/abs/2505.01261)
Keywords: generative
Abstract: The challenge of electronic component obsolescence is particularly critical in systems with long life cycles. Various obsolescence management methods are employed to mitigate its impact, with obsolescence forecasting being a highly sought-after and prominent approach. As a result, numerous machine learning-based forecasting methods have been proposed. However, machine learning models require a substantial amount of relevant data to achieve high precision, which is lacking in the current obsolescence landscape in some situations. This work introduces a novel framework for obsolescence forecasting based on deep learning. The proposed framework solves the lack of available data through deep generative modeling, where new obsolescence cases are generated and used to augment the training dataset. The augmented dataset is then used to train a classical machine learning-based obsolescence forecasting model. To train classical forecasting models using augmented datasets, existing classical supervised-learning classifiers are adapted for semi-supervised learning within this framework. The proposed framework demonstrates state-of-the-art results on benchmarking datasets.
摘要：在长寿周期的系统中，电子组件过时的挑战尤其重要。采用了各种过时的管理方法来减轻其影响，过时的预测是一种高度渴望的和突出的方法。结果，已经提出了许多基于机器学习的预测方法。但是，机器学习模型需要大量相关数据才能达到高精度，这在某些情况下当前的过时景观缺乏。这项工作介绍了一个基于深度学习的过时预测的新框架。提出的框架通过深层生成建模解决了缺乏可用数据，在该建模中生成了新的过时情况并用于增强训练数据集。然后，增强数据集用于训练基于经典的机器学习的过时预测模型。为了使用增强数据集训练经典的预测模型，现有的经典监督学习分类器适用于此框架中的半监督学习。提出的框架展示了基准数据集中的最新结果。

Title: FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

Authors: Chenxi Li, Weijie Wang, Qiang Li, Bruno Lepri, Nicu Sebe, Weizhi Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01322
Pdf URL: https://arxiv.org/pdf/2505.01322
Copy Paste: [[2505.01322]] FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors(https://arxiv.org/abs/2505.01322)
Keywords: generation
Abstract: Text-driven object insertion in 3D scenes is an emerging task that enables intuitive scene editing through natural language. However, existing 2D editing-based methods often rely on spatial priors such as 2D masks or 3D bounding boxes, and they struggle to ensure consistency of the inserted object. These limitations hinder flexibility and scalability in real-world applications. In this paper, we propose FreeInsert, a novel framework that leverages foundation models including MLLMs, LGMs, and diffusion models to disentangle object generation from spatial placement. This enables unsupervised and flexible object insertion in 3D scenes without spatial priors. FreeInsert starts with an MLLM-based parser that extracts structured semantics, including object types, spatial relationships, and attachment regions, from user instructions. These semantics guide both the reconstruction of the inserted object for 3D consistency and the learning of its degrees of freedom. We leverage the spatial reasoning capabilities of MLLMs to initialize object pose and scale. A hierarchical, spatially aware refinement stage further integrates spatial semantics and MLLM-inferred priors to enhance placement. Finally, the appearance of the object is improved using the inserted-object image to enhance visual fidelity. Experimental results demonstrate that FreeInsert achieves semantically coherent, spatially precise, and visually realistic 3D insertions without relying on spatial priors, offering a user-friendly and flexible editing experience.
摘要：在3D场景中以文本驱动的对象插入是一个新兴任务，可以通过自然语言进行直观的场景编辑。但是，现有的基于2D编辑的方法通常依赖于空间先验，例如2D口罩或3D边界框，并且它们难以确保插入的对象的一致性。这些限制阻碍了现实世界应用中的灵活性和可伸缩性。在本文中，我们提出了FreeInsert，这是一个新颖的框架，该框架利用了基础模型，包括MLLM，LGM和扩散模型，以将对象产生从空间放置中解散。这使得无需空间先验的3D场景中的无监督和灵活的对象插入。 FreeInsert以基于MLLM的解析器开始，该解析器从用户说明中提取结构化语义，包括对象类型，空间关系和附件区域。这些语义指导插入的对象的3D一致性和学习自由度的学习。我们利用MLLM的空间推理功能来初始化对象姿势和尺度。分层，空间意识的改进阶段进一步整合了空间语义和MLLM提取的先验，以增强放置位置。最后，使用插入的对象图像改善对象的外观，以增强视觉保真度。实验结果表明，FreeInsert在不依赖空间先验的情况下实现了语义上的连贯，空间精确和视觉上现实的3D插入，提供了用户友好且灵活的编辑体验。

Title: VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models

Authors: Mohammadreza Teymoorianfard, Shiqing Ma, Amir Houmansadr
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01406
Pdf URL: https://arxiv.org/pdf/2505.01406
Copy Paste: [[2505.01406]] VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models(https://arxiv.org/abs/2505.01406)
Keywords: generation
Abstract: The rapid rise of video diffusion models has enabled the generation of highly realistic and temporally coherent videos, raising critical concerns about content authenticity, provenance, and misuse. Existing watermarking approaches, whether passive, post-hoc, or adapted from image-based techniques, often struggle to withstand video-specific manipulations such as frame insertion, dropping, or reordering, and typically degrade visual quality. In this work, we introduce VIDSTAMP, a watermarking framework that embeds per-frame or per-segment messages directly into the latent space of temporally-aware video diffusion models. By fine-tuning the model's decoder through a two-stage pipeline, first on static image datasets to promote spatial message separation, and then on synthesized video sequences to restore temporal consistency, VIDSTAMP learns to embed high-capacity, flexible watermarks with minimal perceptual impact. Leveraging architectural components such as 3D convolutions and temporal attention, our method imposes no additional inference cost and offers better perceptual quality than prior methods, while maintaining comparable robustness against common distortions and tampering. VIDSTAMP embeds 768 bits per video (48 bits per frame) with a bit accuracy of 95.0%, achieves a log P-value of -166.65 (lower is better), and maintains a video quality score of 0.836, comparable to unwatermarked outputs (0.838) and surpassing prior methods in capacity-quality tradeoffs. Code: Code: \url{this https URL}
摘要：视频扩散模型的快速崛起使得能够产生高度现实和时间连贯的视频，从而引起了人们对内容真实性，出处和滥用的关键关注。现有的水印方法，无论是被动的，事后还是根据基于图像的技术进行改编，通常都难以承受特定于视频的操作，例如框架插入，掉落或重新排序，并且通常会降低视觉质量。在这项工作中，我们介绍了Vidstamp，这是一个水印框架，将人均或每个细分消息直接嵌入到时间感知的视频扩散模型的潜在空间中。通过通过两阶段管道对模型的解码器进行微调，首先是在静态图像数据集上促进空间消息分离，然后在合成的视频序列上以恢复时间一致性来促进空间消息分离，Vidstamp学会了嵌入高能力，具有最小感知影响的柔性水印。利用诸如3D卷积和暂时关注之类的建筑组件，我们的方法没有额外的推理成本，并且比以前的方法提供了更好的感知质量，同时保持了与常见扭曲和篡改的可比性鲁棒性。 VIDSTAMP每次视频（每帧48位）嵌入768位，精度为95.0％，达到-166.65的日志p值（较低）（较低），视频质量得分为0.836，可保持0.836，可与Unfatermark的输出相当（0.838）（0.838）（0.838），并且在容量折叠方面超出了差异。代码：代码：\ url {此https url}