2025-02-06

Title: MIND: Microstructure INverse Design with Generative Hybrid Neural Representation

Authors: Tianyang Xue, Haochen Li, Longdu Liu, Paul Henderson, Pengbin Tang, Lin Lu, Jikai Liu, Haisen Zhao, Hao Peng, Bernd Bickel
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.02607
Pdf URL: https://arxiv.org/pdf/2502.02607
Copy Paste: [[2502.02607]] MIND: Microstructure INverse Design with Generative Hybrid Neural Representation(https://arxiv.org/abs/2502.02607)
Keywords: generation, generative
Abstract: The inverse design of microstructures plays a pivotal role in optimizing metamaterials with specific, targeted physical properties. While traditional forward design methods are constrained by their inability to explore the vast combinatorial design space, inverse design offers a compelling alternative by directly generating structures that fulfill predefined performance criteria. However, achieving precise control over both geometry and material properties remains a significant challenge due to their intricate interdependence. Existing approaches, which typically rely on voxel or parametric representations, often limit design flexibility and structural diversity. In this work, we present a novel generative model that integrates latent diffusion with Holoplane, an advanced hybrid neural representation that simultaneously encodes both geometric and physical properties. This combination ensures superior alignment between geometry and properties. Our approach generalizes across multiple microstructure classes, enabling the generation of diverse, tileable microstructures with significantly improved property accuracy and enhanced control over geometric validity, surpassing the performance of existing methods. We introduce a multi-class dataset encompassing a variety of geometric morphologies, including truss, shell, tube, and plate structures, to train and validate our model. Experimental results demonstrate the model's ability to generate microstructures that meet target properties, maintain geometric validity, and integrate seamlessly into complex assemblies. Additionally, we explore the potential of our framework through the generation of new microstructures, cross-class interpolation, and the infilling of heterogeneous microstructures. The dataset and source code will be open-sourced upon publication.
摘要：微结构的逆向设计在优化具有特定目标物理特性的超材料方面起着关键作用。虽然传统的正向设计方法受到无法探索广阔的组合设计空间的限制，但逆向设计通过直接生成满足预定义性能标准的结构提供了一种引人注目的替代方案。然而，由于几何形状和材料特性错综复杂的相互依赖性，实现对几何形状和材料特性的精确控制仍然是一项重大挑战。现有方法通常依赖于体素或参数表示，通常会限制设计灵活性和结构多样性。在这项工作中，我们提出了一种新颖的生成模型，该模型将潜在扩散与 Holoplane 相结合，Holoplane 是一种先进的混合神经表示，可同时编码几何和物理特性。这种组合确保了几何形状和属性之间的卓越一致性。我们的方法适用于多种微结构类别，能够生成多样化、可平铺的微结构，显著提高属性精度并增强对几何有效性的控制，超越现有方法的性能。我们引入了一个多类数据集，该数据集涵盖了各种几何形态，包括桁架、壳、管和板结构，以训练和验证我们的模型。实验结果表明，该模型能够生成满足目标属性的微结构，保持几何有效性，并无缝集成到复杂的组件中。此外，我们通过生成新的微结构、跨类插值和异质微结构的填充来探索我们框架的潜力。数据集和源代码将在发布后开源。

Title: e-SimFT: Alignment of Generative Models with Simulation Feedback for Pareto-Front Design Exploration

Authors: Hyunmin Cheong, Mohammadmehdi Ataei, Amir Hosein Khasahmadi, Pradeep Kumar Jayaraman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.02628
Pdf URL: https://arxiv.org/pdf/2502.02628
Copy Paste: [[2502.02628]] e-SimFT: Alignment of Generative Models with Simulation Feedback for Pareto-Front Design Exploration(https://arxiv.org/abs/2502.02628)
Keywords: generation, generative
Abstract: Deep generative models have recently shown success in solving complex engineering design problems where models predict solutions that address the design requirements specified as input. However, there remains a challenge in aligning such models for effective design exploration. For many design problems, finding a solution that meets all the requirements is infeasible. In such a case, engineers prefer to obtain a set of Pareto optimal solutions with respect to those requirements, but uniform sampling of generative models may not yield a useful Pareto front. To address this gap, we introduce a new framework for Pareto-front design exploration with simulation fine-tuned generative models. First, the framework adopts preference alignment methods developed for Large Language Models (LLMs) and showcases the first application in fine-tuning a generative model for engineering design. The important distinction here is that we use a simulator instead of humans to provide accurate and scalable feedback. Next, we propose epsilon-sampling, inspired by the epsilon-constraint method used for Pareto-front generation with classical optimization algorithms, to construct a high-quality Pareto front with the fine-tuned models. Our framework, named e-SimFT, is shown to produce better-quality Pareto fronts than existing multi-objective alignment methods.
摘要：深度生成模型最近在解决复杂工程设计问题方面取得了成功，其中模型预测的解决方案可满足作为输入指定的设计要求。但是，在对齐此类模型以进行有效的设计探索方面仍然存在挑战。对于许多设计问题，找到满足所有要求的解决方案是不可行的。在这种情况下，工程师更愿意获得一组符合这些要求的帕累托最优解，但生成模型的均匀采样可能无法产生有用的帕累托前沿。为了解决这一差距，我们引入了一个新框架，用于使用模拟微调生成模型进行帕累托前沿设计探索。首先，该框架采用为大型语言模型 (LLM) 开发的偏好对齐方法，并展示了在微调工程设计生成模型中的第一个应用。这里的重要区别是我们使用模拟器而不是人类来提供准确且可扩展的反馈。接下来，我们提出 epsilon 采样，灵感来自经典优化算法中用于生成 Pareto 前沿的 epsilon 约束方法，以使用微调模型构建高质量的 Pareto 前沿。我们的框架名为 e-SimFT，事实证明，它能生成比现有多目标对齐方法更高质量的 Pareto 前沿。

Title: On Teacher Hacking in Language Model Distillation

Authors: Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2502.02671
Pdf URL: https://arxiv.org/pdf/2502.02671
Copy Paste: [[2502.02671]] On Teacher Hacking in Language Model Distillation(https://arxiv.org/abs/2502.02671)
Keywords: generation
Abstract: Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.
摘要：语言模型 (LM) 的后期训练越来越依赖于以下两个阶段：(i) 知识蒸馏，其中 LM 被训练来模仿更大的教师 LM，以及 (ii) 从人类反馈中强化学习 (RLHF)，其中 LM 通过优化奖励模型进行调整。在第二个 RLHF 阶段，一个众所周知的挑战是奖励黑客攻击，其中 LM 过度优化奖励模型。这种现象符合古德哈特定律，并可能导致真实目标的性能下降。在本文中，我们研究了在知识蒸馏过程中是否会发生类似的现象，我们称之为教师黑客攻击。这可能是因为教师 LM 本身是真实分布的不完美近似。为了研究这一点，我们提出了一个受控实验设置，包括：(i) 代表真实分布的 oracle LM，(ii) 从 oracle 中蒸馏出来的教师 LM，以及 (iii) 从教师中蒸馏出来的学生 LM。我们的实验揭示了以下见解。当使用固定的离线数据集进行蒸馏时，会发生教师黑客攻击；此外，我们可以通过观察优化过程何时偏离多项式收敛定律来检测它。相比之下，采用在线数据生成技术可以有效缓解教师黑客攻击。更准确地说，我们认为数据多样性是防止黑客攻击的关键因素。总的来说，我们的研究结果让我们更深入地了解了蒸馏在构建稳健高效的 LM 方面的优势和局限性。

Title: Blind Visible Watermark Removal with Morphological Dilation

Authors: Preston K. Robinette, Taylor T. Johnson
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.02676
Pdf URL: https://arxiv.org/pdf/2502.02676
Copy Paste: [[2502.02676]] Blind Visible Watermark Removal with Morphological Dilation(https://arxiv.org/abs/2502.02676)
Keywords: restoration
Abstract: Visible watermarks pose significant challenges for image restoration techniques, especially when the target background is unknown. Toward this end, we present MorphoMod, a novel method for automated visible watermark removal that operates in a blind setting -- without requiring target images. Unlike existing methods, MorphoMod effectively removes opaque and transparent watermarks while preserving semantic content, making it well-suited for real-world applications. Evaluations on benchmark datasets, including the Colored Large-scale Watermark Dataset (CLWD), LOGO-series, and the newly introduced Alpha1 datasets, demonstrate that MorphoMod achieves up to a 50.8% improvement in watermark removal effectiveness compared to state-of-the-art methods. Ablation studies highlight the impact of prompts used for inpainting, pre-removal filling strategies, and inpainting model performance on watermark removal. Additionally, a case study on steganographic disorientation reveals broader applications for watermark removal in disrupting high-level hidden messages. MorphoMod offers a robust, adaptable solution for watermark removal and opens avenues for further advancements in image restoration and adversarial manipulation.
摘要：可见水印对图像恢复技术提出了重大挑战，尤其是在目标背景未知的情况下。为此，我们提出了 MorphoMod，这是一种自动可见水印去除的新方法，可在盲目设置下运行 - 无需目标图像。与现有方法不同，MorphoMod 可有效去除不透明和透明水印，同时保留语义内容，使其非常适合实际应用。对基准数据集（包括彩色大规模水印数据集 (CLWD)、LOGO 系列和新推出的 Alpha1 数据集）的评估表明，与最先进的方法相比，MorphoMod 的水印去除效果提高了 50.8%。消融研究强调了用于修复的提示、预移除填充策略和修复模型性能对水印去除的影响。此外，对隐写迷失方向的案例研究揭示了水印去除在破坏高级隐藏消息方面的更广泛应用。 MorphoMod 提供了一种强大且适应性强的水印去除解决方案，并为图像恢复和对抗性操作的进一步发展开辟了道路。

Title: Controllable Video Generation with Provable Disentanglement

Authors: Yifan Shen, Peiyuan Zhu, Zijian Li, Shaoan Xie, Zeyu Tang, Namrata Deka, Zongfang Liu, Guangyi Chen, Kun Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.02690
Pdf URL: https://arxiv.org/pdf/2502.02690
Copy Paste: [[2502.02690]] Controllable Video Generation with Provable Disentanglement(https://arxiv.org/abs/2502.02690)
Keywords: generation, generative
Abstract: Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
摘要：尽管最近在生成高质量和一致的视频方面取得了进展，但可控视频生成仍然是一项重大挑战。现有的大多数控制视频生成的方法都将视频视为一个整体，而忽略了复杂的细粒度时空关系，这限制了控制精度和效率。在本文中，我们提出了可控视频生成对抗网络 (CoVoGAN) 来解开视频概念，从而促进对单个概念的有效和独立控制。具体来说，遵循最小变化原则，我们首先解开静态和动态潜在变量。然后，我们利用充分变化属性来实现动态潜在变量的组件可识别性，从而实现对运动和身份的独立控制。为了建立理论基础，我们提供了严格的分析来证明我们的方法的可识别性。基于这些理论见解，我们设计了一个时间转换模块来解开潜在动态。为了执行最小变化原则和充分变化属性，我们最小化潜在动态变量的维数并施加时间条件独立性。为了验证我们的方法，我们将此模块集成为 GAN 的插件。对各种视频生成基准进行的大量定性和定量实验表明，我们的方法显著提高了各种现实场景中的生成质量和可控性。

Title: A Unified Understanding and Evaluation of Steering Methods

Authors: Shawn Im, Yixuan Li
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2502.02716
Pdf URL: https://arxiv.org/pdf/2502.02716
Copy Paste: [[2502.02716]] A Unified Understanding and Evaluation of Steering Methods(https://arxiv.org/abs/2502.02716)
Keywords: generation
Abstract: Steering methods provide a practical approach to controlling large language models by applying steering vectors to intermediate activations, guiding outputs toward desired behaviors while avoiding retraining. Despite their growing importance, the field lacks a unified understanding and consistent evaluation across tasks and datasets, hindering progress. This paper introduces a unified framework for analyzing and evaluating steering methods, formalizing their core principles and offering theoretical insights into their effectiveness. Through comprehensive empirical evaluations on multiple-choice and open-ended text generation tasks, we validate these insights, identifying key factors that influence performance and demonstrating the superiority of certain methods. Our work bridges theoretical and practical perspectives, offering actionable guidance for advancing the design, optimization, and deployment of steering methods in LLMs.
摘要：引导方法通过将引导向量应用于中间激活，引导输出朝着期望的行为方向发展，同时避免重新训练，为控制大型语言模型提供了一种实用方法。尽管它们的重要性日益增加，但该领域缺乏对任务和数据集的统一理解和一致评估，阻碍了进展。本文介绍了一个统一的框架来分析和评估引导方法，形式化其核心原理并对其有效性提供理论见解。通过对多项选择和开放式文本生成任务进行全面的实证评估，我们验证了这些见解，确定了影响性能的关键因素并展示了某些方法的优越性。我们的工作将理论和实践观点结合起来，为推进 LLM 中引导方法的设计、优化和部署提供了可行的指导。

Title: LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing

Authors: Yang Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.02743
Pdf URL: https://arxiv.org/pdf/2502.02743
Copy Paste: [[2502.02743]] LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing(https://arxiv.org/abs/2502.02743)
Keywords: generation
Abstract: The rapid advancement in large language models (LLMs) has brought forth a diverse range of models with varying capabilities that excel in different tasks and domains. However, selecting the optimal LLM for user queries often involves a challenging trade-off between accuracy and cost, a problem exacerbated by the diverse demands of individual queries. In this work, we present a novel framework that formulates the LLM selection process as a multi-armed bandit problem, enabling dynamic and intelligent routing of queries to the most appropriate model. Our approach incorporates a preference-conditioned dynamic routing mechanism, allowing users to specify their preferences at inference time, thereby offering a customizable balance between performance and cost. Additionally, our selection policy is designed to generalize to unseen LLMs, ensuring adaptability to new models as they emerge. Experimental results demonstrate that our method achieves significant improvements in both accuracy and cost-effectiveness across various LLM platforms, showcasing the potential of our framework to adaptively optimize LLM selection in real-world scenarios.
摘要：大型语言模型 (LLM) 的快速发展带来了各种各样的模型，它们具有不同的能力，在不同的任务和领域中表现出色。然而，为用户查询选择最佳的 LLM 通常涉及准确性和成本之间的艰难权衡，而这个问题因各个查询的不同需求而加剧。在这项工作中，我们提出了一个新颖的框架，将 LLM 选择过程表述为多臂老虎机问题，从而能够将查询动态和智能地路由到最合适的模型。我们的方法结合了偏好条件的动态路由机制，允许用户在推理时指定他们的偏好，从而提供可定制的性能和成本平衡。此外，我们的选择策略旨在推广到看不见的 LLM，确保适应新出现的模型。实验结果表明，我们的方法在各种 LLM 平台上实现了准确性和成本效益的显着提高，展示了我们的框架在实际场景中自适应优化 LLM 选择的潜力。

Title: A Survey of Sample-Efficient Deep Learning for Change Detection in Remote Sensing: Tasks, Strategies, and Challenges

Authors: Lei Ding, Danfeng Hong, Maofan Zhao, Hongruixuan Chen, Chenyu Li, Jie Deng, Naoto Yokoya, Lorenzo Bruzzone, Jocelyn Chanussot
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.02835
Pdf URL: https://arxiv.org/pdf/2502.02835
Copy Paste: [[2502.02835]] A Survey of Sample-Efficient Deep Learning for Change Detection in Remote Sensing: Tasks, Strategies, and Challenges(https://arxiv.org/abs/2502.02835)
Keywords: generation
Abstract: In the last decade, the rapid development of deep learning (DL) has made it possible to perform automatic, accurate, and robust Change Detection (CD) on large volumes of Remote Sensing Images (RSIs). However, despite advances in CD methods, their practical application in real-world contexts remains limited due to the diverse input data and the applicational context. For example, the collected RSIs can be time-series observations, and more informative results are required to indicate the time of change or the specific change category. Moreover, training a Deep Neural Network (DNN) requires a massive amount of training samples, whereas in many cases these samples are difficult to collect. To address these challenges, various specific CD methods have been developed considering different application scenarios and training resources. Additionally, recent advancements in image generation, self-supervision, and visual foundation models (VFMs) have opened up new approaches to address the 'data-hungry' issue of DL-based CD. The development of these methods in broader application scenarios requires further investigation and discussion. Therefore, this article summarizes the literature methods for different CD tasks and the available strategies and techniques to train and deploy DL-based CD methods in sample-limited scenarios. We expect that this survey can provide new insights and inspiration for researchers in this field to develop more effective CD methods that can be applied in a wider range of contexts.
摘要：在过去十年中，深度学习 (DL) 的快速发展使得对大量遥感图像 (RSI) 执行自动、准确且稳健的变化检测 (CD) 成为可能。然而，尽管 CD 方法取得了进展，但由于输入数据和应用环境的多样性，它们在现实世界中的实际应用仍然有限。例如，收集的 RSI 可以是时间序列观测值，需要更具信息量的结果来指示变化的时间或具体的变化类别。此外，训练深度神经网络 (DNN) 需要大量的训练样本，而在许多情况下这些样本很难收集。为了应对这些挑战，已经开发了各种特定的 CD 方法，考虑了不同的应用场景和训练资源。此外，图像生成、自我监督和视觉基础模型 (VFM) 方面的最新进展开辟了解决基于 DL 的 CD 的“数据饥渴”问题的新方法。这些方法在更广泛的应用场景中的开发需要进一步研究和讨论。因此，本文总结了不同 CD 任务的文献方法以及在样本有限的情况下训练和部署基于 DL 的 CD 方法的可用策略和技术。我们希望本综述可以为该领域的研究人员提供新的见解和灵感，以开发更有效的 CD 方法，并应用于更广泛的场景。

Title: PH-VAE: A Polynomial Hierarchical Variational Autoencoder Towards Disentangled Representation Learning

Authors: Xi Chen, Shaofan Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.02856
Pdf URL: https://arxiv.org/pdf/2502.02856
Copy Paste: [[2502.02856]] PH-VAE: A Polynomial Hierarchical Variational Autoencoder Towards Disentangled Representation Learning(https://arxiv.org/abs/2502.02856)
Keywords: generation, generative
Abstract: The variational autoencoder (VAE) is a simple and efficient generative artificial intelligence method for modeling complex probability distributions of various types of data, such as images and texts. However, it suffers some main shortcomings, such as lack of interpretability in the latent variables, difficulties in tuning hyperparameters while training, producing blurry, unrealistic downstream outputs or loss of information due to how it calculates loss functions and recovers data distributions, overfitting, and origin gravity effect for small data sets, among other issues. These and other limitations have caused unsatisfactory generation effects for the data with complex distributions. In this work, we proposed and developed a polynomial hierarchical variational autoencoder (PH-VAE), in which we used a polynomial hierarchical date format to generate or to reconstruct the data distributions. In doing so, we also proposed a novel Polynomial Divergence in the loss function to replace or generalize the Kullback-Leibler (KL) divergence, which results in systematic and drastic improvements in both accuracy and reproducibility of the re-constructed distribution function as well as the quality of re-constructed data images while keeping the dataset size the same but capturing fine resolution of the data. Moreover, we showed that the proposed PH-VAE has some form of disentangled representation learning ability.
摘要：变分自编码器（VAE）是一种简单高效的生成式人工智能方法，可用于对图像、文本等各类数据的复杂概率分布进行建模。然而，它存在一些主要缺点，例如，隐变量缺乏可解释性，训练时超参数调整困难，由于计算损失函数和恢复数据分布的方式不同而产生模糊、不切实际的下游输出或信息丢失，小数据集的过拟合和原点重力效应等。这些和其他限制导致其对具有复杂分布的数据的生成效果不理想。在本文中，我们提出并开发了一种多项式分层变分自编码器（PH-VAE），其中我们使用多项式分层数据格式来生成或重构数据分布。在此过程中，我们还提出了一种新的损失函数多项式散度来替代或推广 Kullback-Leibler (KL) 散度，从而系统性地大幅提高重建分布函数的准确性和可重复性以及重建数据图像的质量，同时保持数据集大小不变，但能捕捉数据的精细分辨率。此外，我们表明，所提出的 PH-VAE 具有某种形式的解缠表征学习能力。

Title: Elucidating the Preconditioning in Consistency Distillation

Authors: Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, Jun Zhu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2502.02922
Pdf URL: https://arxiv.org/pdf/2502.02922
Copy Paste: [[2502.02922]] Elucidating the Preconditioning in Consistency Distillation(https://arxiv.org/abs/2502.02922)
Keywords: generation
Abstract: Consistency distillation is a prevalent way for accelerating diffusion models adopted in consistency (trajectory) models, in which a student model is trained to traverse backward on the probability flow (PF) ordinary differential equation (ODE) trajectory determined by the teacher model. Preconditioning is a vital technique for stabilizing consistency distillation, by linear combining the input data and the network output with pre-defined coefficients as the consistency function. It imposes the boundary condition of consistency functions without restricting the form and expressiveness of the neural network. However, previous preconditionings are hand-crafted and may be suboptimal choices. In this work, we offer the first theoretical insights into the preconditioning in consistency distillation, by elucidating its design criteria and the connection to the teacher ODE trajectory. Based on these analyses, we further propose a principled way dubbed \textit{Analytic-Precond} to analytically optimize the preconditioning according to the consistency gap (defined as the gap between the teacher denoiser and the optimal student denoiser) on a generalized teacher ODE. We demonstrate that Analytic-Precond can facilitate the learning of trajectory jumpers, enhance the alignment of the student trajectory with the teacher's, and achieve $2\times$ to $3\times$ training acceleration of consistency trajectory models in multi-step generation across various datasets.
摘要：一致性蒸馏是一致性（轨迹）模型中加速扩散模型的一种普遍方法，其中训练学生模型在由教师模型确定的概率流 (PF) 常微分方程 (ODE) 轨迹上向后遍历。预处理是稳定一致性蒸馏的重要技术，通过将输入数据和网络输出与预定义系数线性组合作为一致性函数。它施加了一致性函数的边界条件，而不限制神经网络的形式和表达能力。然而，以前的预处理都是手工制作的，可能不是最优选择。在这项工作中，我们通过阐明其设计标准及其与教师 ODE 轨迹的联系，首次对一致性蒸馏中的预处理提供了理论见解。基于这些分析，我们进一步提出了一种称为 \textit{Analytic-Precond} 的原则性方法，根据广义教师 ODE 上的一致性差距（定义为教师去噪器与最佳学生去噪器之间的差距）分析优化预处理。我们证明了 Analytic-Precond 可以促进轨迹跳跃者的学习，增强学生轨迹与教师轨迹的一致性，并在跨各种数据集的多步生成中实现一致性轨迹模型的 $2\times$ 到 $3\times$ 的训练加速。

Title: Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization

Authors: Yang Li, Jinpei Guo, Runzhong Wang, Hongyuan Zha, Junchi Yan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.02941
Pdf URL: https://arxiv.org/pdf/2502.02941
Copy Paste: [[2502.02941]] Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization(https://arxiv.org/abs/2502.02941)
Keywords: generation, generative
Abstract: Diffusion models have recently advanced Combinatorial Optimization (CO) as a powerful backbone for neural solvers. However, their iterative sampling process requiring denoising across multiple noise levels incurs substantial overhead. We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots. This is achieved through an optimization consistency training protocol, which, for a given instance, minimizes the difference among samples originating from varying generative trajectories and time steps relative to the optimal solution. The proposed model enables fast single-step solution generation while retaining the option of multi-step sampling to trade for sampling quality, which offers a more effective and efficient alternative backbone for neural solvers. In addition, within the training-to-testing (T2T) framework, to bridge the gap between training on historical instances and solving new instances, we introduce a novel consistency-based gradient search scheme during the test stage, enabling more effective exploration of the solution space learned during training. It is achieved by updating the latent solution probabilities under objective gradient guidance during the alternation of noise injection and denoising steps. We refer to this model as Fast T2T. Extensive experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency, even outperforming LKH given limited time budgets. Notably, Fast T2T with merely one-step generation and one-step gradient search can mostly outperform the SOTA diffusion-based counterparts that require hundreds of steps, while achieving tens of times speedup.
摘要：扩散模型最近将组合优化 (CO) 发展为神经求解器的强大支柱。然而，它们的迭代采样过程需要在多个噪声水平上进行去噪，这会产生大量开销。我们建议学习从不同噪声水平到给定实例的最优解决方案的直接映射，从而以最少的镜头实现高质量的生成。这是通过优化一致性训练协议实现的，对于给定的实例，该协议最小化来自不同生成轨迹和时间步骤的样本与最优解决方案之间的差异。所提出的模型能够快速生成单步解决方案，同时保留多步采样以换取采样质量的选项，这为神经求解器提供了更有效、更高效的替代支柱。此外，在训练到测试 (T2T) 框架内，为了弥合历史实例训练与解决新实例之间的差距，我们在测试阶段引入了一种新颖的基于一致性的梯度搜索方案，从而能够更有效地探索训练期间学习到的解决方案空间。它是通过在噪声注入和去噪步骤交替期间在目标梯度指导下更新潜在解决方案概率来实现的。我们将此模型称为 Fast T2T。在两个流行任务，旅行商问题 (TSP) 和最大独立集 (MIS) 上进行的大量实验证明了 Fast T2T 在解决方案质量和效率方面的优越性，甚至在有限的时间预算下优于 LKH。值得注意的是，仅使用一步生成和一步梯度搜索的 Fast T2T 可以大大优于需要数百步的基于 SOTA 扩散的同类产品，同时实现数十倍的加速。

Title: Membership Inference Attack Should Move On to Distributional Statistics for Distilled Generative Models

Authors: Muxing Li, Zesheng Ye, Yixuan Li, Andy Song, Guangquan Zhang, Feng Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.02970
Pdf URL: https://arxiv.org/pdf/2502.02970
Copy Paste: [[2502.02970]] Membership Inference Attack Should Move On to Distributional Statistics for Distilled Generative Models(https://arxiv.org/abs/2502.02970)
Keywords: generative
Abstract: Membership inference attacks (MIAs) determine whether certain data instances were used to train a model by exploiting the differences in how the model responds to seen versus unseen instances. This capability makes MIAs important in assessing privacy leakage within modern generative AI systems. However, this paper reveals an oversight in existing MIAs against \emph{distilled generative models}: attackers can no longer detect a teacher model's training instances individually when targeting the distilled student model, as the student learns from the teacher-generated data rather than its original member data, preventing direct instance-level memorization. Nevertheless, we find that student-generated samples exhibit a significantly stronger distributional alignment with teacher's member data than non-member data. This leads us to posit that MIAs \emph{on distilled generative models should shift from instance-level to distribution-level statistics}. We thereby introduce a \emph{set-based} MIA framework that measures \emph{relative} distributional discrepancies between student-generated data\emph{sets} and potential member/non-member data\emph{sets}, Empirically, distributional statistics reliably distinguish a teacher's member data from non-member data through the distilled model. Finally, we discuss scenarios in which our setup faces limitations.
摘要：会员推理攻击 (MIA) 通过利用模型对可见实例和不可见实例的响应差异来确定某些数据实例是否用于训练模型。这种能力使得 MIA 在评估现代生成式 AI 系统中的隐私泄露方面发挥着重要作用。然而，本文揭示了现有 MIA 对 \emph{提炼生成模型} 的疏忽：攻击者在针对提炼学生模型时无法再单独检测教师模型的训练实例，因为学生是从教师生成的数据而不是其原始成员数据中学习的，从而阻止了直接的实例级记忆。然而，我们发现学生生成的样本与教师成员数据的分布一致性明显高于非成员数据。这导致我们假设 MIA \emph{对提炼生成模型的统计应该从实例级转移到分布级统计}。因此，我们引入了一个基于集合的 MIA 框架，用于测量学生生成的数据集与潜在成员/非成员数据集之间的相对分布差异。从经验上讲，分布统计数据可以通过提炼模型可靠地区分教师的成员数据和非成员数据。最后，我们讨论了我们的设置面临限制的场景。

Title: Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Authors: Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2502.03032
Pdf URL: https://arxiv.org/pdf/2502.03032
Copy Paste: [[2502.03032]] Analyze Feature Flow to Enhance Interpretation and Steering in Language Models(https://arxiv.org/abs/2502.03032)
Keywords: generation
Abstract: We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
摘要：我们引入了一种新方法，系统地将稀疏自动编码器发现的特征映射到大型语言模型的连续层上，从而扩展了早期研究层间特征链接的工作。通过使用无数据余弦相似度技术，我们可以追踪特定特征在每个阶段如何持续、转换或首次出现。该方法产生了特征演变的细粒度流程图，从而实现了对模型计算的细粒度可解释性和机制洞察。至关重要的是，我们展示了这些跨层特征图如何通过放大或抑制所选特征来促进模型行为的直接控制，从而实现文本生成中的有针对性的主题控制。总之，我们的研究结果强调了因果、跨层可解释性框架的实用性，该框架不仅阐明了特征如何通过前向传递发展，而且还为透明地操纵大型语言模型提供了新方法。

Title: Symmetry-Aware Bayesian Flow Networks for Crystal Generation

Authors: Laura Ruple, Luca Torresi, Henrik Schopmans, Pascal Friederich
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2502.03146
Pdf URL: https://arxiv.org/pdf/2502.03146
Copy Paste: [[2502.03146]] Symmetry-Aware Bayesian Flow Networks for Crystal Generation(https://arxiv.org/abs/2502.03146)
Keywords: generation, generative
Abstract: The discovery of new crystalline materials is essential to scientific and technological progress. However, traditional trial-and-error approaches are inefficient due to the vast search space. Recent advancements in machine learning have enabled generative models to predict new stable materials by incorporating structural symmetries and to condition the generation on desired properties. In this work, we introduce SymmBFN, a novel symmetry-aware Bayesian Flow Network (BFN) for crystalline material generation that accurately reproduces the distribution of space groups found in experimentally observed crystals. SymmBFN substantially improves efficiency, generating stable structures at least 50 times faster than the next-best method. Furthermore, we demonstrate its capability for property-conditioned generation, enabling the design of materials with tailored properties. Our findings establish BFNs as an effective tool for accelerating the discovery of crystalline materials.
摘要：发现新的晶体材料对于科学和技术进步至关重要。然而，由于搜索空间巨大，传统的反复试验方法效率低下。机器学习的最新进展使生成模型能够通过结合结构对称性来预测新的稳定材料，并根据所需属性来调节生成。在这项工作中，我们引入了 SymmBFN，这是一种用于晶体材料生成的新型对称感知贝叶斯流网络 (BFN)，可以准确再现实验观察到的晶体中空间群的分布。SymmBFN 大大提高了效率，生成稳定结构的速度至少比次优方法快 50 倍。此外，我们展示了其根据属性进行生成的能力，从而能够设计具有定制属性的材料。我们的研究结果表明，BFN 是加速晶体材料发现的有效工具。

Title: PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design

Authors: Yuchao Wu, Xiaofei Yu, Hao Chen, Yang Luo, Yeyu Tong, Yuzhe Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.03159
Pdf URL: https://arxiv.org/pdf/2502.03159
Copy Paste: [[2502.03159]] PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design(https://arxiv.org/abs/2502.03159)
Keywords: generation
Abstract: While large language models (LLMs) have shown remarkable potential in automating various tasks in digital chip design, the field of Photonic Integrated Circuits (PICs)-a promising solution to advanced chip designs-remains relatively unexplored in this context. The design of PICs is time-consuming and prone to errors due to the extensive and repetitive nature of code involved in photonic chip design. In this paper, we introduce PICBench, the first benchmarking and evaluation framework specifically designed to automate PIC design generation using LLMs, where the generated output takes the form of a netlist. Our benchmark consists of dozens of meticulously crafted PIC design problems, spanning from fundamental device designs to more complex circuit-level designs. It automatically evaluates both the syntax and functionality of generated PIC designs by comparing simulation outputs with expert-written solutions, leveraging an open-source simulator. We evaluate a range of existing LLMs, while also conducting comparative tests on various prompt engineering techniques to enhance LLM performance in automated PIC design. The results reveal the challenges and potential of LLMs in the PIC design domain, offering insights into the key areas that require further research and development to optimize automation in this field. Our benchmark and evaluation code is available at this https URL.
摘要：虽然大型语言模型 (LLM) 在数字芯片设计中各种任务的自动化方面表现出了巨大的潜力，但光子集成电路 (PIC) 领域（一种有前途的先进芯片设计解决方案）在这方面仍然相对未被探索。由于光子芯片设计中涉及的代码广泛且重复，PIC 的设计非常耗时且容易出错。在本文中，我们介绍了 PICBench，这是第一个专门设计用于使用 LLM 自动生成 PIC 设计的基准测试和评估框架，其中生成的输出采用网表的形式。我们的基准测试包括数十个精心设计的 PIC 设计问题，从基本的设备设计到更复杂的电路级设计。它利用开源模拟器，通过将模拟输出与专家编写的解决方案进行比较，自动评估生成的 PIC 设计的语法和功能。我们评估了一系列现有的 LLM，同时还对各种即时工程技术进行了比较测试，以提高 LLM 在自动化 PIC 设计中的性能。结果揭示了 LLM 在 PIC 设计领域面临的挑战和潜力，并深入了解了需要进一步研究和开发以优化该领域自动化的关键领域。我们的基准和评估代码可在此 https URL 上找到。

Title: MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent

Authors: Xinyao Liao, Xianfang Zeng, Liao Wang, Gang Yu, Guosheng Lin, Chi Zhang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2502.03207
Pdf URL: https://arxiv.org/pdf/2502.03207
Copy Paste: [[2502.03207]] MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent(https://arxiv.org/abs/2502.03207)
Keywords: generation
Abstract: We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
摘要：我们提出了 MotionAgent，为文本引导的图像到视频生成提供细粒度的运动控制。关键技术是运动场代理，它将文本提示中的运动信息转换为显式的运动场，提供灵活而精确的运动引导。具体来说，代理提取文本中描述的物体运动和相机运动，并分别将它们转换为物体轨迹和相机外部参数。分析光流合成模块将这些运动表示集成在 3D 空间中，并将它们投影到统一的光流中。光流适配器采用光流来控制基本图像到视频扩散模型，以生成细粒度的受控视频。VBench 上视频文本相机运动指标的显著提升表明我们的方法实现了对相机运动的精确控制。我们构建了 VBench 的一个子集来评估文本和生成的视频中的运动信息的对齐情况，在运动生成精度方面优于其他高级模型。

Title: General Time-series Model for Universal Knowledge Representation of Multivariate Time-Series data

Authors: Cheng He, Xu Huang, Gangwei Jiang, Zhaoyi Li, Defu Lian, Hong Xie, Enhong Chen, Xijie Liang, Zengrong Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.03264
Pdf URL: https://arxiv.org/pdf/2502.03264
Copy Paste: [[2502.03264]] General Time-series Model for Universal Knowledge Representation of Multivariate Time-Series data(https://arxiv.org/abs/2502.03264)
Keywords: generative
Abstract: Universal knowledge representation is a central problem for multivariate time series(MTS) foundation models and yet remains open. This paper investigates this problem from the first principle and it makes four folds of contributions. First, a new empirical finding is revealed: time series with different time granularities (or corresponding frequency resolutions) exhibit distinct joint distributions in the frequency domain. This implies a crucial aspect of learning universal knowledge, one that has been overlooked by previous studies. Second, a novel Fourier knowledge attention mechanism is proposed to enable learning time granularity-aware representations from both the temporal and frequency domains. Third, an autoregressive blank infilling pre-training framework is incorporated to time series analysis for the first time, leading to a generative tasks agnostic pre-training strategy. To this end, we develop the General Time-series Model (GTM), a unified MTS foundation model that addresses the limitation of contemporary time series models, which often require token, pre-training, or model-level customizations for downstream tasks adaption. Fourth, extensive experiments show that GTM outperforms state-of-the-art (SOTA) methods across all generative tasks, including long-term forecasting, anomaly detection, and imputation.
摘要：通用知识表示是多变量时间序列 (MTS) 基础模型的核心问题，但仍未得到解决。本文从第一个原理研究了这个问题，并做出了四方面的贡献。首先，揭示了一个新的经验发现：具有不同时间粒度（或相应的频率分辨率）的时间序列在频域中表现出不同的联合分布。这意味着学习通用知识的一个重要方面，而这一点被以前的研究忽视了。其次，提出了一种新颖的傅里叶知识注意机制，可以从时间和频域学习时间粒度感知的表示。第三，首次将自回归空白填充预训练框架纳入时间序列分析，从而产生一种与生成任务无关的预训练策略。为此，我们开发了通用时间序列模型 (GTM)，这是一个统一的 MTS 基础模型，它解决了当代时间序列模型的局限性，这些模型通常需要标记、预训练或模型级定制才能适应下游任务。第四，大量实验表明，GTM 在所有生成任务（包括长期预测、异常检测和归因）中的表现均优于最先进 (SOTA) 方法。

Title: RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Authors: Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.03333
Pdf URL: https://arxiv.org/pdf/2502.03333
Copy Paste: [[2502.03333]] RadVLM: A Multitask Conversational Vision-Language Model for Radiology(https://arxiv.org/abs/2502.03333)
Keywords: generation
Abstract: The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.
摘要：胸部 X 光 (CXR) 的广泛使用，再加上放射科医生的短缺，引起了人们对自动 CXR 分析和 AI 辅助报告的兴趣日益浓厚。虽然现有的视觉语言模型 (VLM) 在报告生成或异常检测等特定任务中显示出良好的前景，但它们往往缺乏对交互式诊断功能的支持。在这项工作中，我们提出了 RadVLM，这是一种专为 CXR 解释而设计的紧凑型多任务对话基础模型。为此，我们整理了一个大规模指令数据集，该数据集包含超过 100 万个图像指令对，其中包含单轮任务（例如报告生成、异常分类和视觉基础）和多轮、多任务对话交互。在此指令数据集上对 RadVLM 进行微调后，我们会在不同任务以及重新实现的基线 VLM 中对其进行评估。我们的结果表明，RadVLM 在对话功能和视觉基础方面实现了最先进的性能，同时在其他放射学任务中保持竞争力。消融研究进一步强调了跨多个任务进行联合训练的好处，尤其是在注释数据有限的场景中。总之，这些发现凸显了 RadVLM 作为临床相关 AI 助手的潜力，它提供结构化的 CXR 解释和对话功能，以支持更有效、更易于访问的诊断工作流程。

Title: Can Text-to-Image Generative Models Accurately Depict Age? A Comparative Study on Synthetic Portrait Generation and Age Estimation

Authors: Alexey A. Novikov, Miroslav Vranka, François David, Artem Voronin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.03420
Pdf URL: https://arxiv.org/pdf/2502.03420
Copy Paste: [[2502.03420]] Can Text-to-Image Generative Models Accurately Depict Age? A Comparative Study on Synthetic Portrait Generation and Age Estimation(https://arxiv.org/abs/2502.03420)
Keywords: generation, generative
Abstract: Text-to-image generative models have shown remarkable progress in producing diverse and photorealistic outputs. In this paper, we present a comprehensive analysis of their effectiveness in creating synthetic portraits that accurately represent various demographic attributes, with a special focus on age, nationality, and gender. Our evaluation employs prompts specifying detailed profiles (e.g., Photorealistic selfie photo of a 32-year-old Canadian male), covering a broad spectrum of 212 nationalities, 30 distinct ages from 10 to 78, and balanced gender representation. We compare the generated images against ground truth age estimates from two established age estimation models to assess how faithfully age is depicted. Our findings reveal that although text-to-image models can consistently generate faces reflecting different identities, the accuracy with which they capture specific ages and do so across diverse demographic backgrounds remains highly variable. These results suggest that current synthetic data may be insufficiently reliable for high-stakes age-related tasks requiring robust precision, unless practitioners are prepared to invest in significant filtering and curation. Nevertheless, they may still be useful in less sensitive or exploratory applications, where absolute age precision is not critical.
摘要：文本转图像生成模型在生成多样化和逼真的输出方面取得了显著进展。在本文中，我们全面分析了它们在创建准确代表各种人口统计属性的合成肖像方面的有效性，特别关注年龄、国籍和性别。我们的评估采用指定详细个人资料的提示（例如，32 岁加拿大男性的逼真自拍照），涵盖 212 个国籍、30 个不同年龄（从 10 岁到 78 岁）和均衡的性别代表性。我们将生成的图像与两个已建立的年龄估计模型的真实年龄估计值进行比较，以评估年龄的忠实程度。我们的研究结果表明，尽管文本转图像模型可以一致地生成反映不同身份的面孔，但它们捕捉特定年龄的准确性以及在不同人口统计背景下的准确性仍然有很大差异。这些结果表明，除非从业者愿意投入大量过滤和整理，否则当前的合成数据对于需要稳健精度的高风险年龄相关任务可能不够可靠。尽管如此，它们在敏感度较低或探索性应用中仍然有用，因为绝对年龄精度并不重要。

Title: TruePose: Human-Parsing-guided Attention Diffusion for Full-ID Preserving Pose Transfer

Authors: Zhihong Xu, Dongxia Wang, Peng Du, Yang Cao, Qing Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.03426
Pdf URL: https://arxiv.org/pdf/2502.03426
Copy Paste: [[2502.03426]] TruePose: Human-Parsing-guided Attention Diffusion for Full-ID Preserving Pose Transfer(https://arxiv.org/abs/2502.03426)
Keywords: generation
Abstract: Pose-Guided Person Image Synthesis (PGPIS) generates images that maintain a subject's identity from a source image while adopting a specified target pose (e.g., skeleton). While diffusion-based PGPIS methods effectively preserve facial features during pose transformation, they often struggle to accurately maintain clothing details from the source image throughout the diffusion process. This limitation becomes particularly problematic when there is a substantial difference between the source and target poses, significantly impacting PGPIS applications in the fashion industry where clothing style preservation is crucial for copyright protection. Our analysis reveals that this limitation primarily stems from the conditional diffusion model's attention modules failing to adequately capture and preserve clothing patterns. To address this limitation, we propose human-parsing-guided attention diffusion, a novel approach that effectively preserves both facial and clothing appearance while generating high-quality results. We propose a human-parsing-aware Siamese network that consists of three key components: dual identical UNets (TargetNet for diffusion denoising and SourceNet for source image embedding extraction), a human-parsing-guided fusion attention (HPFA), and a CLIP-guided attention alignment (CAA). The HPFA and CAA modules can embed the face and clothes patterns into the target image generation adaptively and effectively. Extensive experiments on both the in-shop clothes retrieval benchmark and the latest in-the-wild human editing dataset demonstrate our method's significant advantages over 13 baseline approaches for preserving both facial and clothes appearance in the source image.
摘要：姿势引导人物图像合成 (PGPIS) 生成的图像在采用指定目标姿势（例如骨架）的同时，保留了源图像中主体的身份。虽然基于扩散的 PGPIS 方法可以在姿势变换期间有效地保留面部特征，但它们在整个扩散过程中往往难以准确地保留源图像中的服装细节。当源姿势和目标姿势之间存在很大差异时，这种限制会变得特别成问题，严重影响时尚行业中的 PGPIS 应用，因为服装风格保存对于版权保护至关重要。我们的分析表明，这种限制主要源于条件扩散模型的注意力模块未能充分捕捉和保留服装图案。为了解决这一限制，我们提出了人体解析引导的注意力扩散，这是一种新颖的方法，可以有效地保留面部和服装外观，同时生成高质量的结果。我们提出了一种人脸解析感知 Siamese 网络，它由三个关键组件组成：双相同的 UNets（用于扩散去噪的 TargetNet 和用于源图像嵌入提取的 SourceNet）、人脸解析引导的融合注意 (HPFA) 和 CLIP 引导的注意对齐 (CAA)。HPFA 和 CAA 模块可以自适应且有效地将面部和衣服图案嵌入到目标图像生成中。在店内服装检索基准和最新的野外人体编辑数据集上进行的大量实验表明，我们的方法在保留源图像中的面部和衣服外观方面比 13 种基线方法具有显著优势。

Title: Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Authors: Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.03444
Pdf URL: https://arxiv.org/pdf/2502.03444
Copy Paste: [[2502.03444]] Masked Autoencoders Are Effective Tokenizers for Diffusion Models(https://arxiv.org/abs/2502.03444)
Keywords: generation
Abstract: Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.
摘要：潜在扩散模型的最新进展已证明其对高分辨率图像合成的有效性。然而，对于更好地学习和生成扩散模型，来自标记器的潜在空间的属性仍未得到充分探索。从理论和经验上讲，我们发现改进的生成质量与具有更好结构的潜在分布密切相关，例如具有更少高斯混合模式和更多判别特征的潜在分布。受这些见解的启发，我们提出了 MAETok，这是一种利用掩码建模来学习语义丰富的潜在空间同时保持重建保真度的自动编码器 (AE)。大量实验验证了我们的分析，证明了自动编码器的变分形式不是必需的，仅使用 AE 的判别潜在空间就可以在仅使用 128 个标记的情况下实现 ImageNet 生成的最新性能。MAETok 实现了显着的实际改进，使 gFID 达到 1.69，训练速度提高了 76 倍，512x512 生成的推理吞吐量提高了 31 倍。我们的研究结果表明，潜在空间的结构（而不是变分约束）对于有效的扩散模型至关重要。代码和经过训练的模型已发布。

Title: Dress-1-to-3: Single Image to Simulation-Ready 3D Outfit with Diffusion Prior and Differentiable Physics

Authors: Xuan Li, Chang Yu, Wenxin Du, Ying Jiang, Tianyi Xie, Yunuo Chen, Yin Yang, Chenfanfu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.03449
Pdf URL: https://arxiv.org/pdf/2502.03449
Copy Paste: [[2502.03449]] Dress-1-to-3: Single Image to Simulation-Ready 3D Outfit with Diffusion Prior and Differentiable Physics(https://arxiv.org/abs/2502.03449)
Keywords: generation
Abstract: Recent advances in large models have significantly advanced image-to-3D reconstruction. However, the generated models are often fused into a single piece, limiting their applicability in downstream tasks. This paper focuses on 3D garment generation, a key area for applications like virtual try-on with dynamic garment animations, which require garments to be separable and simulation-ready. We introduce Dress-1-to-3, a novel pipeline that reconstructs physics-plausible, simulation-ready separated garments with sewing patterns and humans from an in-the-wild image. Starting with the image, our approach combines a pre-trained image-to-sewing pattern generation model for creating coarse sewing patterns with a pre-trained multi-view diffusion model to produce multi-view images. The sewing pattern is further refined using a differentiable garment simulator based on the generated multi-view images. Versatile experiments demonstrate that our optimization approach substantially enhances the geometric alignment of the reconstructed 3D garments and humans with the input image. Furthermore, by integrating a texture generation module and a human motion generation module, we produce customized physics-plausible and realistic dynamic garment demonstrations. Project page: this https URL
摘要：大型模型的最新进展显著推进了图像到 3D 重建。然而，生成的模型通常融合成一个整体，限制了它们在下游任务中的适用性。本文重点介绍 3D 服装生成，这是虚拟试穿和动态服装动画等应用的关键领域，这些应用要求服装可分离且可用于模拟。我们介绍了 Dress-1-to-3，这是一种新颖的流程，它从自然图像中重建物理上合理、可用于模拟的分离服装，其中包含缝纫图案和人体。从图像开始，我们的方法结合了预先训练的图像到缝纫图案生成模型（用于创建粗略缝纫图案）和预先训练的多视图扩散模型（用于生成多视图图像）。使用基于生成的多视图图像的可微服装模拟器进一步细化缝纫图案。多种实验表明，我们的优化方法大大增强了重建的 3D 服装和人体与输入图像的几何对齐。此外，通过集成纹理生成模块和人体运动生成模块，我们制作了定制的符合物理规律且逼真的动态服装演示。项目页面：此 https URL

Title: A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)

Authors: Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell
Subjects: cs.LG, cs.AI, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2502.03450
Pdf URL: https://arxiv.org/pdf/2502.03450
Copy Paste: [[2502.03450]] A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)(https://arxiv.org/abs/2502.03450)
Keywords: generation
Abstract: Scene graphs have emerged as a structured and serializable environment representation for grounded spatial reasoning with Large Language Models (LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason framework for reasoning and planning with scene graphs. Our approach employs two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and information queries generation, and a (2) Retriever for extracting corresponding graph information following the queries. Two agents collaborate iteratively, enabling sequential reasoning and adaptive attention to graph information. Unlike prior works, both agents are prompted only with the scene graph schema rather than the full graph data, which reduces the hallucination by limiting input tokens, and drives the Reasoner to generate reasoning trace this http URL the trace, the Retriever programmatically query the scene graph data based on the schema understanding, allowing dynamic and global attention on the graph that enhances alignment between reasoning and retrieval. Through experiments in multiple simulation environments, we show that our framework surpasses existing LLM-based approaches in numerical Q\&A and planning tasks, and can benefit from task-level few-shot examples, even in the absence of agent-level demonstrations. Project code will be released.
摘要：场景图已成为一种结构化且可序列化的环境表示，可用于使用大型语言模型 (LLM) 进行扎实的空间推理。在这项工作中，我们提出了 SG-RwR，这是一种模式引导的边检索边推理框架，用于使用场景图进行推理和规划。我们的方法采用了两个协作的、编写代码的 LLM 代理：(1) 推理器，用于任务规划和信息查询生成；(2) 检索器，用于根据查询提取相应的图形信息。两个代理以迭代方式协作，从而实现对图形信息的顺序推理和自适应注意。与之前的工作不同，这两个代理仅使用场景图模式而不是完整的图形数据进行提示，这通过限制输入标记来减少幻觉，并驱动推理器生成推理跟踪此 http URL 跟踪，检索器基于模式理解以编程方式查询场景图数据，从而允许对图形进行动态和全局注意，从而增强推理和检索之间的一致性。通过在多个模拟环境中的实验，我们表明我们的框架在数值问答和规划任务中超越了现有的基于 LLM 的方法，并且即使在没有代理级演示的情况下也可以从任务级的少样本示例中受益。项目代码即将发布。

Title: SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

Authors: Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, Srijan Das
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.03459
Pdf URL: https://arxiv.org/pdf/2502.03459
Copy Paste: [[2502.03459]] SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living(https://arxiv.org/abs/2502.03459)
Keywords: generation
Abstract: The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
摘要：CLIP 等视觉语言模型的引入推动了基础视频模型的开发，这些模型能够推广到未见过的视频和人类动作。然而，这些模型通常是在网络视频上训练的，而网络视频往往无法捕捉到日常生活活动 (ADL) 视频中存在的挑战。现有的研究通过结合 3D 骨架和 RGB 视频来解决 ADL 特定的挑战，例如相似的外观、细微的运动模式和多个视点。然而，这些方法没有与语言相结合，限制了它们推广到未见过的动作类的能力。在本文中，我们介绍了 SKI 模型，它将 3D 骨架集成到视觉语言嵌入空间中。SKI 模型利用骨架语言模型 SkeletonCLIP 通过协作训练将骨架信息注入视觉语言模型 (VLM) 和大型视觉语言模型 (LVLM)。值得注意的是，SKI 模型在推理过程中不需要骨架数据，从而增强了它们在实际应用中的鲁棒性。 SKI 模型的有效性在三个流行的 ADL 数据集上得到了验证，适用于零样本动作识别和视频字幕生成任务。