2025-02-18

Title: Leveraging Constraint Violation Signals For Action-Constrained Reinforcement Learning

Authors: Janaka Chathuranga Brahmanage, Jiajing Ling, Akshat Kumar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10431
Pdf URL: https://arxiv.org/pdf/2502.10431
Copy Paste: [[2502.10431]] Leveraging Constraint Violation Signals For Action-Constrained Reinforcement Learning(https://arxiv.org/abs/2502.10431)
Keywords: generative
Abstract: In many RL applications, ensuring an agent's actions adhere to constraints is crucial for safety. Most previous methods in Action-Constrained Reinforcement Learning (ACRL) employ a projection layer after the policy network to correct the action. However projection-based methods suffer from issues like the zero gradient problem and higher runtime due to the usage of optimization solvers. Recently methods were proposed to train generative models to learn a differentiable mapping between latent variables and feasible actions to address this issue. However, generative models require training using samples from the constrained action space, which itself is challenging. To address such limitations, first, we define a target distribution for feasible actions based on constraint violation signals, and train normalizing flows by minimizing the KL divergence between an approximated distribution over feasible actions and the target. This eliminates the need to generate feasible action samples, greatly simplifying the flow model learning. Second, we integrate the learned flow model with existing deep RL methods, which restrict it to exploring only the feasible action space. Third, we extend our approach beyond ACRL to handle state-wise constraints by learning the constraint violation signal from the environment. Empirically, our approach has significantly fewer constraint violations while achieving similar or better quality in several control tasks than previous best methods.
摘要：在许多RL应用中，确保代理商的行动遵守约束对于安全至关重要。在行动约束的强化学习（ACRL）中，大多数以前的方法在策略网络之后采用投影层来纠正操作。但是，由于优化求解器的使用，基于投影的方法遭受了零梯度问题和更高的运行时问题的困扰。最近，提出了方法来培训生成模型，以学习潜在变量和可行措施之间的可区分映射以解决此问题。但是，生成模型需要使用来自受约束的动作空间的样本进行培训，而动作空间本身具有挑战性。为了解决此类局限性，首先，我们根据限制违规信号定义了可行操作的目标分布，并通过最大程度地减少可行动作的近似分布之间的KL差异来训练流动归一化。这消除了生成可行的动作样本的需求，从而大大简化了流模型学习。其次，我们将学习的流模型与现有的深度RL方法集成在一起，这将其限制在仅探索可行的动作空间。第三，我们将方法扩展到ACRL之外，以通过从环境中学习约束违规信号来处理州的限制。从经验上讲，与以前的最佳方法相比，我们的方法违规行为的违规行为明显少得多，而在几种控制任务中获得了相似或更好的质量。

Title: A Survey of Representation Learning, Optimization Strategies, and Applications for Omnidirectional Vision

Authors: Hao Ai, Zidong Cao, Lin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10444
Pdf URL: https://arxiv.org/pdf/2502.10444
Copy Paste: [[2502.10444]] A Survey of Representation Learning, Optimization Strategies, and Applications for Omnidirectional Vision(https://arxiv.org/abs/2502.10444)
Keywords: super-resolution, generation
Abstract: Omnidirectional image (ODI) data is captured with a field-of-view of 360x180, which is much wider than the pinhole cameras and captures richer surrounding environment details than the conventional perspective images. In recent years, the availability of customer-level 360 cameras has made omnidirectional vision more popular, and the advance of deep learning (DL) has significantly sparked its research and applications. This paper presents a systematic and comprehensive review and analysis of the recent progress of DL for omnidirectional vision. It delineates the distinct challenges and complexities encountered in applying DL to omnidirectional images as opposed to traditional perspective imagery. Our work covers four main contents: (i) A thorough introduction to the principles of omnidirectional imaging and commonly explored projections of ODI; (ii) A methodical review of varied representation learning approaches tailored for ODI; (iii) An in-depth investigation of optimization strategies specific to omnidirectional vision; (iv) A structural and hierarchical taxonomy of the DL methods for the representative omnidirectional vision tasks, from visual enhancement (e.g., image generation and super-resolution) to 3D geometry and motion estimation (e.g., depth and optical flow estimation), alongside the discussions on emergent research directions; (v) An overview of cutting-edge applications (e.g., autonomous driving and virtual reality), coupled with a critical discussion on prevailing challenges and open questions, to trigger more research in the community.
摘要：全向图像（ODI）数据被360x180的视野捕获，它比针孔摄像机宽得多，并且比传统的透视图像捕获了周围环境详细信息更丰富的环境详细信息。近年来，客户级360摄像机的可用性使全向视觉变得更加流行，并且深度学习的进步（DL）显着引发了其研究和应用。本文对DL的最新进展进行了全面的综述和分析。它描述了将DL应用于全向图像而不是传统观点图像时遇到的独特挑战和复杂性。我们的工作涵盖了四个主要内容：（i）全面成像原理的彻底介绍以及通常探索的ODI投影；（ii）对针对ODI量身定制的各种表示的学习方法有条不紊的回顾；（iii）对全向视觉特定的优化策略进行深入研究；（iv）代表性全向视觉任务的DL方法的结构和层次分类学，从视觉增强（例如，图像产生和超分辨率）到3D几何和运动估计（例如，深度和光流估计），以及与3D的几何学和运动估计（关于紧急研究方向的讨论；（v）概述尖端应用程序（例如，自动驾驶和虚拟现实），再加上有关主要的挑战和开放问题的批判性讨论，以触发社区中的更多研究。

Title: FlexControl: Computation-Aware ControlNet with Differentiable Router for Text-to-Image Generation

Authors: Zheng Fang, Lichuan Xiang, Xu Cai, Kaicheng Zhou, Hongkai Wen
Subjects: cs.LG, cs.GR
Abstract URL: https://arxiv.org/abs/2502.10451
Pdf URL: https://arxiv.org/pdf/2502.10451
Copy Paste: [[2502.10451]] FlexControl: Computation-Aware ControlNet with Differentiable Router for Text-to-Image Generation(https://arxiv.org/abs/2502.10451)
Keywords: generation, generative
Abstract: ControlNet offers a powerful way to guide diffusion-based generative models, yet most implementations rely on ad-hoc heuristics to choose which network blocks to control-an approach that varies unpredictably with different tasks. To address this gap, we propose FlexControl, a novel framework that copies all diffusion blocks during training and employs a trainable gating mechanism to dynamically select which blocks to activate at each denoising step. With introducing a computation-aware loss, we can encourage control blocks only to activate when it benefit the generation quality. By eliminating manual block selection, FlexControl enhances adaptability across diverse tasks and streamlines the design pipeline, with computation-aware training loss in an end-to-end training manner. Through comprehensive experiments on both UNet (e.g., SD1.5) and DiT (e.g., SD3.0), we show that our method outperforms existing ControlNet variants in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, FlexControl preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks. These results underscore the potential of a flexible, data-driven approach for controlled diffusion and open new avenues for efficient generative model design.
摘要：ControlNet提供了一种有力的方法来指导基于扩散的生成模型，但是大多数实现都依赖于临时启发式方法来选择用于控制的网络块 - 通过不同的任务将不可预测的AN方法变化。为了解决这一差距，我们提出了FlexControl，这是一个新颖的框架，该框架在训练过程中复制了所有扩散块，并采用了可训练的门控机制来动态选择在每个脱氧步骤中激活哪些块。通过引入计算感知损失，我们可以鼓励控制块在使发电质量受益时激活。通过消除手动块选择，FlexControl可以通过以端到端的培训方式使用计算感知的训练损失来增强各种任务的适应性并简化设计管道。通过对UNET（例如SD1.5）和DIT（例如SD3.0）的全面实验，我们表明我们的方法在感兴趣的某些关键方面优于现有的ControlNet变体。正如定量和定性评估所证明的那样，Flextrol保留或增强了图像保真度，同时还可以通过选择性地激活最相关的块来减少计算开销。这些结果强调了一种灵活的，数据驱动的方法来控制扩散和开放新途径，以进行有效的生成模型设计。

Title: Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset

Authors: Vladimir Frants, Sos Agaian
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2502.10452
Pdf URL: https://arxiv.org/pdf/2502.10452
Copy Paste: [[2502.10452]] Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset(https://arxiv.org/abs/2502.10452)
Keywords: super-resolution
Abstract: This paper addresses the vulnerability of deep-learning models designed for rain, snow, and haze removal. Despite enhancing image quality in adverse weather, these models are susceptible to adversarial attacks that compromise their effectiveness. Traditional defenses such as adversarial training and model distillation often require extensive retraining, making them costly and impractical for real-world deployment. While denoising and super-resolution techniques can aid image classification models, they impose high computational demands and introduce visual artifacts that hinder image processing tasks. We propose a model-agnostic defense against first-order white-box adversarial attacks using the Quaternion-Hadamard Network (QHNet) to tackle these challenges. White-box attacks are particularly difficult to defend against since attackers have full access to the model's architecture, weights, and training procedures. Our defense introduces the Quaternion Hadamard Denoising Convolutional Block (QHDCB) and the Quaternion Denoising Residual Block (QDRB), leveraging polynomial thresholding. QHNet incorporates these blocks within an encoder-decoder architecture, enhanced by feature refinement, to effectively neutralize adversarial noise. Additionally, we introduce the Adversarial Weather Conditions Vision Dataset (AWCVD), created by applying first-order gradient attacks on state-of-the-art weather removal techniques in scenarios involving haze, rain streaks, and snow. Using PSNR and SSIM metrics, we demonstrate that QHNet significantly enhances the robustness of low-level computer vision models against adversarial attacks compared with state-of-the-art denoising and super-resolution techniques. The source code and dataset will be released alongside the final version of this paper.
摘要：本文介绍了专为降雨，雪和去除雾霾而设计的深学习模型的脆弱性。尽管在不利天气中提高了图像质量，但这些模型易受损害其有效性的对抗攻击。传统的防御措施，例如对抗训练和模型蒸馏，通常需要大量的再训练，使其在现实部署中昂贵且不切实际。虽然具有降级和超分辨率技术可以帮助图像分类模型，但它们施加了很高的计算需求，并引入了妨碍图像处理任务的视觉伪像。我们建议使用四元 - 哈达马德网络（QHNET）来应对这些挑战的一阶白盒对抗性攻击，为一阶白盒对抗攻击提供模型的防御。白框攻击特别难以防御，因为攻击者可以完全访问模型的体系结构，权重和培训程序。我们的防御介绍了Hadamard Denoising卷积块（QHDCB）和Quaternion Denoising残留块（QDRB），利用多项式阈值。 QHNET将这些块融合在编码器造型中，并通过特征细化增强，以有效地中和对抗噪声。此外，我们介绍了通过在涉及雾霾，雨水条纹和雪的情况下对最先进的天气去除技术应用一阶梯度攻击而创建的对抗性天气条件视觉数据集（AWCVD）。使用PSNR和SSIM指标，我们证明QHNET显着增强了低级计算机视觉模型的鲁棒性，而与最先进的Denoising和超级分辨率技术相比，针对对抗性攻击的稳健性。源代码和数据集将与本文的最终版本一起发布。

Title: One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

Authors: Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.10454
Pdf URL: https://arxiv.org/pdf/2502.10454
Copy Paste: [[2502.10454]] One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs(https://arxiv.org/abs/2502.10454)
Keywords: generation
Abstract: Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs' ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, CounterMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that CounterMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.
摘要：利用数学大语言模型（LLMS）进行证明生成是LLMS研究中的一个基本话题。我们认为，当前LLM证明陈述的能力在很大程度上取决于它们在培训期间是否遇到了相关的证明过程。这种依赖限制了他们对数学定理和相关概念的更深入的理解。受到人类数学教育通常使用的“反例证明”的教学方法的启发，我们的工作旨在增强LLMS通过反例进行数学推理和证明的能力。具体而言，我们手动创建了高质量的大学数学基准，反应，该基准要求LLMS通过提供反例来证明数学陈述，从而评估他们对数学概念的掌握。此外，我们开发了一个数据工程框架，以自动获取培训数据以进一步改进。广泛的实验和详细的分析表明，反应具有挑战性，表明诸如OpenAI O1之类的LLM具有不足的反例驱动的证明功能。此外，我们对模型训练的探索表明，加强LLMS反例驱动的概念推理能力对于提高其整体数学能力至关重要。我们认为，我们的工作为数学LLM社区提供了新的观点。

Title: I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Authors: Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10458
Pdf URL: https://arxiv.org/pdf/2502.10458
Copy Paste: [[2502.10458]] I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models(https://arxiv.org/abs/2502.10458)
Keywords: generation
Abstract: This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: this https URL.
摘要：本文介绍了ThinkDiff，这是一种新颖的对齐范式，通过整合视觉模型（VLMS）的优势，可以将文本对图扩散模型与多模式的理解和推理能力赋予能力。现有的多模式扩散式芬太尼方法主要集中于像素级重建，而不是内在的推理，并且受到基于推理的数据集的复杂性和有限的可用性的限制。 ThinkDiff通过利用视觉语言培训作为代理任务来解决这些挑战，将VLM与编码器模型大型语言模型（LLM）而不是扩散解码器的解码器保持一致。此代理任务基于这样的观察，即$ \ textbf {llm dododer} $与$ \ textbf {diffusion decoders} $共享相同的输入功能空间，该{fiffusion decoders} $使用相应的$ \ textbf {llm encoder} $来提示嵌入。结果，可以通过与LLM解码器对齐来简化与扩散解码器的对齐VLM。如果没有复杂的培训和数据集，ThinkDiff可以有效地释放理解，推理和构成扩散模型中的功能。实验表明，在具有挑战性的COBSAT基准中，ThinkDiff的准确性从19.2％提高到46.3％，用于多模式中上下文推理，仅在4个A100 GPU上进行了5个小时的培训。此外，ThinkDiff在将多个图像和文本撰写为逻辑上的图像中表现出了出色的性能。项目页面：此HTTPS URL。

Title: LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search

Authors: Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10459
Pdf URL: https://arxiv.org/pdf/2502.10459
Copy Paste: [[2502.10459]] LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search(https://arxiv.org/abs/2502.10459)
Keywords: generative
Abstract: Graph Neural Architecture Search (GNAS) facilitates the automatic design of Graph Neural Networks (GNNs) tailored to specific downstream graph learning tasks. However, existing GNAS approaches often require manual adaptation to new graph search spaces, necessitating substantial code optimization and domain-specific knowledge. To address this challenge, we present LLM4GNAS, a toolkit for GNAS that leverages the generative capabilities of Large Language Models (LLMs). LLM4GNAS includes an algorithm library for graph neural architecture search algorithms based on LLMs, enabling the adaptation of GNAS methods to new search spaces through the modification of LLM prompts. This approach reduces the need for manual intervention in algorithm adaptation and code modification. The LLM4GNAS toolkit is extensible and robust, incorporating LLM-enhanced graph feature engineering, LLM-enhanced graph neural architecture search, and LLM-enhanced hyperparameter optimization. Experimental results indicate that LLM4GNAS outperforms existing GNAS methods on tasks involving both homogeneous and heterogeneous graphs.
摘要：图形神经体系结构搜索（GNA）促进了针对特定下游图形学习任务量身定制的图形神经网络（GNNS）的自动设计。但是，现有的GNA方法通常需要对新的图形搜索空间进行手动适应，这需要实质性的代码优化和特定于领域的知识。为了应对这一挑战，我们提出了LLM4GNAS，这是一种利用大语模型（LLMS）的生成能力的GNA工具包。 LLM4GNA包括一个基于LLM的图形神经体系结构搜索算法的算法库，通过修改LLM提示，使GNA方法适应GNA方法对新搜索空间的适应。这种方法减少了对算法适应和代码修改的手动干预的需求。 LLM4GNAS工具包可扩展且健壮，结合了LLM增强图形工程，LLM增强图形神经体系结构搜索和LLM增强的超参数优化。实验结果表明，LLM4GNA在涉及均质和异质图的任务上的现有GNA方法优于现有的GNA方法。

Title: Preference learning made easy: Everything should be understood through win rate

Authors: Lily H. Zhang, Rajesh Ranganath
Subjects: cs.LG, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2502.10505
Pdf URL: https://arxiv.org/pdf/2502.10505
Copy Paste: [[2502.10505]] Preference learning made easy: Everything should be understood through win rate(https://arxiv.org/abs/2502.10505)
Keywords: generative
Abstract: Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective's solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.
摘要：偏好学习或使生成模型与偏好比较数据保持一致的任务尚未达到分类，密度估算等的概念成熟度。要缩小此差距，这项工作提出了一个框架，以了解从从抽样分布开始的偏好学习的框架成对偏好数据。首先，我们证明，对尊重数据分布的偏好和流行率的生成模型的唯一评估是获胜率的一种形式，证明了获胜率是了解偏好学习的焦点。然后，我们将偏好学习方法分析为获胜率优化（WRO）或非WRO。我们介绍了WRO的新颖实例，超出了现有示例（RLHF，NLHF），并确定所有此类方法的两个关键理论益处。我们证明，首选样品（例如DPO和SFT）的常见非WRO方法缺乏这些特性，并提出了减轻这种理论局限性的方法。我们还表明，在实践中，WRO表现不佳应有的优化困难，并且优化成功比影响目标解决方案的选择更好地预测了性能。我们的分析强调了现有方法的最佳实践，并为将来的研究提供了建议，在以下原则的指导下：人们应该更紧密地与WRO保持一致或改善WRO目标的优化。

Title: KernelBench: Can LLMs Write Efficient GPU Kernels?

Authors: Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, Azalia Mirhoseini
Subjects: cs.LG, cs.AI, cs.PF, cs.SE
Abstract URL: https://arxiv.org/abs/2502.10517
Pdf URL: https://arxiv.org/pdf/2502.10517
Copy Paste: [[2502.10517]] KernelBench: Can LLMs Write Efficient GPU Kernels?(https://arxiv.org/abs/2502.10517)
Keywords: generation
Abstract: Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.
摘要：有效的GPU内核对于建筑性能机器学习体系结构至关重要，但是编写它们是一项耗时的挑战，需要大量的专业知识。因此，我们使用语言模型（LMS）探索以自动化内核生成。我们介绍了Kernelbench，这是一个开源框架，用于评估LMS在250个精心选择的Pytorch ML工作负载的套件上快速校正内核的能力。 Kernelbench代表了现实世界中的工程环境，并在引入的基准上取得了进展，直接转化为更快的实用核。我们引入了一个新的评估度量FAST_P，该评估度量_P测量了功能上正确的生成内核的百分比，并提供了大于基线上可调节阈值P的加速度。我们在各种最先进模型和测试时间方法上进行的实验表明，边境推理模型可以表现最好，但总体上仍然不足，在不到20％的情况下与Pytorch基线相匹配。尽管我们证明结果可以通过利用执行和分析在迭代精致期间的反馈来改善，但随着我们提高速度阈值p，内核仍然是一个具有挑战性的基准。

Title: PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation

Authors: Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.10536
Pdf URL: https://arxiv.org/pdf/2502.10536
Copy Paste: [[2502.10536]] PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation(https://arxiv.org/abs/2502.10536)
Keywords: generation
Abstract: The interpretation of histopathology cases underlies many important diagnostic and treatment decisions in medicine. Notably, this process typically requires pathologists to integrate and summarize findings across multiple slides per case. Existing vision-language capabilities in computational pathology have so far been largely limited to small regions of interest, larger regions at low magnification, or single whole-slide images (WSIs). This limits interpretation of findings that span multiple high-magnification regions across multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model (LMM) with a 1-million token context window, we demonstrate the ability to generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches from multiple WSIs at 10X magnification. This is the equivalent of up to 11 hours of video at 1 fps. Expert pathologist evaluations demonstrate that the generated report text is clinically accurate and equivalent to or preferred over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide examples with up to 5 slides. While performance decreased for examples with 6 or more slides, this study demonstrates the promise of leveraging the long-context capabilities of modern LMMs for the uniquely challenging task of medical report generation where each case can contain thousands of image patches.
摘要：组织病理病例病例的解释是医学中许多重要的诊断和治疗决定的基础。值得注意的是，此过程通常要求病理学家每个情况跨多个幻灯片整合和总结发现。到目前为止，计算病理学中现有的视力语言能力在很大程度上仅限于感兴趣的小区域，较大的放大倍率或单个全滑动图像（WSIS）。这限制了对跨多个WSI的多个高磁化区域的发现的解释。通过使用Gemini 1.5 Flash，这是一种具有100万令牌上下文窗口的大型多式联运模型（LMM），我们演示了从多个WSIS从多个WSIS产生底线诊断的能力。这相当于1 fps的最多11个小时的视频。专家病理学家评估表明，生成的报告文本在临床上是准确的，相当于或优先于原始报告，其中最多5个幻灯片的示例为68％（95％CI：[60％，76％]）。虽然效果降低了6个或更多幻灯片的示例，但本研究表明了利用现代LMM的长期文化功能来实现医疗报告生成的独特挑战性任务，每个案例都可以包含数千个图像补丁。

Title: Classifier-free Guidance with Adaptive Scaling

Authors: Dawid Malarz, Artur Kasymov, Maciej Zięba, Jacek Tabor, Przemysław Spurek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10574
Pdf URL: https://arxiv.org/pdf/2502.10574
Copy Paste: [[2502.10574]] Classifier-free Guidance with Adaptive Scaling(https://arxiv.org/abs/2502.10574)
Keywords: generation
Abstract: Classifier-free guidance (CFG) is an essential mechanism in contemporary text-driven diffusion models. In practice, in controlling the impact of guidance we can see the trade-off between the quality of the generated images and correspondence to the prompt. When we use strong guidance, generated images fit the conditioned text perfectly but at the cost of their quality. Dually, we can use small guidance to generate high-quality results, but the generated images do not suit our prompt. In this paper, we present $\beta$-CFG ($\beta$-adaptive scaling in Classifier-Free Guidance), which controls the impact of guidance during generation to solve the above trade-off. First, $\beta$-CFG stabilizes the effects of guiding by gradient-based adaptive normalization. Second, $\beta$-CFG uses the family of single-modal ($\beta$-distribution), time-dependent curves to dynamically adapt the trade-off between prompt matching and the quality of samples during the diffusion denoising process. Our model obtained better FID scores, maintaining the text-to-image CLIP similarity scores at a level similar to that of the reference CFG.
摘要：无分类器指导（CFG）是当代文本驱动扩散模型中的基本机制。实际上，在控制指导的影响时，我们可以看到生成图像的质量与提示的对应之间的权衡。当我们使用强大的指导时，生成的图像非常适合条件文本，但以其质量为代价。双重，我们可以使用小型指导来产生高质量的结果，但是生成的图像不适合我们的提示。在本文中，我们介绍$ \ beta $ -CFG（$ \ beta $ - 无分类器指导中的适应性缩放），该指导控制着生成期间的指导对解决上述权衡的影响。首先，$ \ beta $ -CFG稳定了通过基于梯度的自适应标准化指导的效果。其次，$ \ beta $ -CFG使用单模式的家族（$ \ beta $ - 分布），与时间相关的曲线，以在扩散deNoising过程中动态调整及时匹配和样品质量之间的权衡。我们的模型获得了更好的FID分数，将文本对图像剪辑相似性得分保持在类似于参考CFG的级别。

Title: Data-driven Super-Resolution of Flood Inundation Maps using Synthetic Simulations

Authors: Akshay Aravamudan, Zimeena Rasheed, Xi Zhang, Kira E. Scarpignato, Efthymios I. Nikolopoulos, Witold F. Krajewski, Georgios C. Anagnostopoulos
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.10601
Pdf URL: https://arxiv.org/pdf/2502.10601
Copy Paste: [[2502.10601]] Data-driven Super-Resolution of Flood Inundation Maps using Synthetic Simulations(https://arxiv.org/abs/2502.10601)
Keywords: super-resolution
Abstract: The frequency of extreme flood events is increasing throughout the world. Daily, high-resolution (30m) Flood Inundation Maps (FIM) observed from space play a key role in informing mitigation and preparedness efforts to counter these extreme events. However, the temporal frequency of publicly available high-resolution FIMs, e.g., from Landsat, is at the order of two weeks thus limiting the effective monitoring of flood inundation dynamics. Conversely, global, low-resolution (~300m) Water Fraction Maps (WFM) are publicly available from NOAA VIIRS daily. Motivated by the recent successes of deep learning methods for single image super-resolution, we explore the effectiveness and limitations of similar data-driven approaches to downscaling low-resolution WFMs to high-resolution FIMs. To overcome the scarcity of high-resolution FIMs, we train our models with high-quality synthetic data obtained through physics-based simulations. We evaluate our models on real-world data from flood events in the state of Iowa. The study indicates that data-driven approaches exhibit superior reconstruction accuracy over non-data-driven alternatives and that the use of synthetic data is a viable proxy for training purposes. Additionally, we show that our trained models can exhibit superior zero-shot performance when transferred to regions with hydroclimatological similarity to the U.S. Midwest.
摘要：全世界极端洪水事件的频率正在增加。每天，高分辨率（30m）洪水淹没图（FIM）从太空中观察到的，在为反对这些极端事件的缓解和准备工作方面起着关键作用。但是，公开可用的高分辨率FIM的时间频率，例如，从兰萨特（Landsat）处于两周的顺序，从而限制了对洪水淹没动态的有效监测。相反，全球低分辨率（〜300m）水分图（WFM）每天都在NOAA VIIRS公开获得。通过深度学习方法对单个图像超分辨率的最新成功的激励，我们探讨了相似的数据驱动方法的有效性和局限性，将低分辨率WFM缩小到高分辨率FIMS的范围。为了克服高分辨率FIM的稀缺性，我们通过基于物理学的模拟获得的高质量合成数据来训练模型。我们从爱荷华州的洪水事件中评估了我们的模型。该研究表明，数据驱动的方法表现出优于非数据驱动的替代方案的更高重建精度，并且使用合成数据是用于培训目的的可行代理。此外，我们表明，当我们的训练有素的模型转移到与美国中西部具有氢化气候相似性的区域时，可以表现出较高的零球性能。

Title: ControllableGPT: A Ground-Up Designed Controllable GPT for Molecule Optimization

Authors: Xuefeng Liu, Songhao Jiang, Bo Li, Rick Stevens
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2502.10631
Pdf URL: https://arxiv.org/pdf/2502.10631
Copy Paste: [[2502.10631]] ControllableGPT: A Ground-Up Designed Controllable GPT for Molecule Optimization(https://arxiv.org/abs/2502.10631)
Keywords: generation
Abstract: Large Language Models (LLMs) employ three popular training approaches: Masked Language Models (MLM), Causal Language Models (CLM), and Sequence-to-Sequence Models (seq2seq). However, each approach has its strengths and limitations, and faces challenges in addressing specific tasks that require controllable and bidirectional generation, such as drug optimization. To address this challenge, inspired by the biological processes of growth and evolution, which involve the expansion, shrinking, and mutation of sequences, we introduce ControllableGPT. This initiative represents the first effort to combine the advantages of MLM, CLM, and seq2seq into a single unified, controllable GPT framework. It enables the precise management of specific locations and ranges within a sequence, allowing for expansion, reduction, or mutation over chosen or random lengths, while maintaining the integrity of any specified positions or subsequences. In this work, we designed ControllableGPT for drug optimization from the ground up, which included proposing the Causally Masked Seq2seq (CMS) objective, developing the training corpus, introducing a novel pre-training approach, and devising a unique generation process. We demonstrate the effectiveness and controllability of ControllableGPT by conducting experiments on drug optimization tasks for both viral and cancer benchmarks, surpassing competing baselines.
摘要：大型语言模型（LLMS）采用三种流行的培训方法：蒙版语言模型（MLM），因果语言模型（CLM）和序列到序列模型（SEQ2SEQ）。但是，每种方法都有其优势和局限性，并且在解决需要可控和双向产生的特定任务时面临挑战，例如药物优化。为了应对这一挑战，灵感来自生长和进化的生物学过程，涉及序列的扩展，缩小和突变，我们引入了可控之情。该计划代表了将MLM，CLM和SEQ2SEQ的优势结合到单个统一，可控的GPT框架中的首次努力。它可以在序列内进行特定位置和范围的精确管理，从而可以在所选或随机长度上进行扩展，减少或突变，同时保持任何指定位置或子序列的完整性。在这项工作中，我们从头开始设计可控制的药物优化，其中包括提出因果掩盖的SEQ2SEQ（CMS）目标，开发训练语料库，引入了一种新颖的预训练方法，并设计了独特的生成过程。我们通过对病毒和癌症基准的药物优化任务进行实验来证明可控gpt的有效性和可控性，从而超过了竞争基准。

Title: LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization

Authors: Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, Robert Tibshirani
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.10648
Pdf URL: https://arxiv.org/pdf/2502.10648
Copy Paste: [[2502.10648]] LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization(https://arxiv.org/abs/2502.10648)
Keywords: generation
Abstract: We introduce LLM-Lasso, a novel framework that leverages large language models (LLMs) to guide feature selection in Lasso $\ell_1$ regression. Unlike traditional methods that rely solely on numerical data, LLM-Lasso incorporates domain-specific knowledge extracted from natural language, enhanced through a retrieval-augmented generation (RAG) pipeline, to seamlessly integrate data-driven modeling with contextual insights. Specifically, the LLM generates penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model, while less relevant features are assigned higher penalties, reducing their influence. Importantly, LLM-Lasso has an internal validation step that determines how much to trust the contextual knowledge in our prediction pipeline. Hence it addresses key challenges in robustness, making it suitable for mitigating potential inaccuracies or hallucinations from the LLM. In various biomedical case studies, LLM-Lasso outperforms standard Lasso and existing feature selection baselines, all while ensuring the LLM operates without prior access to the datasets. To our knowledge, this is the first approach to effectively integrate conventional feature selection techniques directly with LLM-based domain-specific reasoning.
摘要：我们介绍了LLM-Lasso，这是一个新颖的框架，利用大型语言模型（LLMS）指导套索$ \ ell_1 $回归中的特征选择。与仅依赖数值数据的传统方法不同，LLM-LASSO结合了从自然语言中提取的领域特定知识，通过检索功能增强的生成（RAG）管道增强，以将数据驱动的建模与上下文见解无缝整合。具体而言，LLM会为每个功能生成惩罚因素，并使用简单的可调模型将其转换为套索罚款的权重。 LLM确定为更相关的功能受到较低的惩罚，增加了将其保留在最终模型中的可能性，而相关功能较少的功能则获得了更高的惩罚，从而减少了其影响力。重要的是，LLM-LASSO具有一个内部验证步骤，该步骤确定了在我们的预测管道中信任上下文知识的程度。因此，它解决了鲁棒性的关键挑战，使其适合减轻LLM的潜在不准确或幻觉。在各种生物医学案例研究中，LLM-LASSO的表现都优于标准套索和现有的特征选择基准，同时确保LLM在无需事先访问数据集的情况下运行。据我们所知，这是将传统特征选择技术直接与基于LLM的域特异性推理直接集成的第一种方法。

Title: Hybrid Deepfake Image Detection: A Comprehensive Dataset-Driven Approach Integrating Convolutional and Attention Mechanisms with Frequency Domain Features

Authors: Kafi Anan, Anindya Bhattacharjee, Ashir Intesher, Kaidul Islam, Abrar Assaeem Fuad, Utsab Saha, Hafiz Imtiaz
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2502.10682
Pdf URL: https://arxiv.org/pdf/2502.10682
Copy Paste: [[2502.10682]] Hybrid Deepfake Image Detection: A Comprehensive Dataset-Driven Approach Integrating Convolutional and Attention Mechanisms with Frequency Domain Features(https://arxiv.org/abs/2502.10682)
Keywords: generation
Abstract: Effective deepfake detection tools are becoming increasingly essential over the last few years due to the growing usage of deepfakes in unethical practices. There exists a diverse range of deepfake generation techniques, which makes it challenging to develop an accurate universal detection mechanism. The 2025 Signal Processing Cup (DFWild-Cup competition) provided a diverse dataset of deepfake images, which are generated from multiple deepfake image generators, for training machine learning model(s) to emphasize the generalization of deepfake detection. To this end, we proposed an ensemble-based approach that employs three different neural network architectures: a ResNet-34-based architecture, a data-efficient image transformer (DeiT), and an XceptionNet with Wavelet Transform to capture both local and global features of deepfakes. We visualize the specific regions that these models focus for classification using Grad-CAM, and empirically demonstrate the effectiveness of these models in grouping real and fake images into cohesive clusters using t-SNE plots. Individually, the ResNet-34 architecture has achieved 88.9% accuracy, whereas the Xception network and the DeiT architecture have achieved 87.76% and 89.32% accuracy, respectively. With these networks, our weighted ensemble model achieves an excellent accuracy of 93.23% on the validation dataset of the SP Cup 2025 competition. Finally, the confusion matrix and an Area Under the ROC curve of 97.44% further confirm the stability of our proposed method.
摘要：在过去的几年中，由于不道德实践中的深层使用越来越多，在过去几年中，有效的深层检测工具变得越来越重要。存在各种各样的深泡产生技术，这使得开发准确的通用检测机制具有挑战性。 2025个信号处理杯（DFWILD-CUP竞争）提供了多种深层图像的数据集，这些数据集是由多个DeepFake Image Generator生成的，用于训练机器学习模型，以强调深击检测的概括。为此，我们提出了一种基于合奏的方法，该方法采用了三种不同的神经网络体系结构：基于Resnet-34的体系结构，一个数据效率的图像变压器（DEIT）和带有小波的XpectionNet，具有小波转换以捕获本地和全局功能深击。我们可以看到这些模型使用Grad-CAM聚焦进行分类的特定区域，并从经验上证明了这些模型在使用T-SNE图将真实图像和假伪造群体分组为有凝聚力簇中的有效性。单独地，Resnet-34体系结构的精度达到了88.9％，而Xception网络和DEIT架构的精度分别达到了87.76％和89.32％。借助这些网络，我们的加权合奏模型在SP 2025竞赛的验证数据集上实现了93.23％的出色精度。最后，混乱矩阵和ROC曲线下97.44％的面积进一步证实了我们提出的方法的稳定性。

Title: FuncGenFoil: Airfoil Generation and Editing Model in Function Space

Authors: Jinouwen Zhang, Junjie Ren, Aobo Yang, Yan Lu, Lu Chen, Hairun Xie, Jing Wang, Miao Zhang, Wanli Ouyang, Shixiang Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10712
Pdf URL: https://arxiv.org/pdf/2502.10712
Copy Paste: [[2502.10712]] FuncGenFoil: Airfoil Generation and Editing Model in Function Space(https://arxiv.org/abs/2502.10712)
Keywords: generation, generative
Abstract: Aircraft manufacturing is the jewel in the crown of industry, among which generating high-fidelity airfoil geometries with controllable and editable representations remains a fundamental challenge. While existing deep-learning-based methods rely on predefined parametric function families, e.g., Bézier curves and discrete point-based representations, they suffer from inherent trade-offs between expressiveness and resolution flexibility. To tackle this challenge, we introduce FuncGenFoil, a novel function-space generative model that directly learns functional airfoil geometries. Our method inherits both the advantages of arbitrary resolution sampling and the smoothness of parametric functions, as well as the strong expressiveness of discrete point-based functions. Empirical evaluations on the AFBench dataset demonstrate that FuncGenFoil improves upon state-of-the-art methods in airfoil generation by achieving a relative -74.4 label error reduction and +23.2 diversity increase on the AF-200K dataset. Our results highlight the advantages of function-space modeling for aerodynamic shape optimization, offering a powerful and flexible framework for high-fidelity airfoil design. Our code will be released.
摘要：飞机制造是工业冠冕中的珠宝，其中产生具有可控且可编辑的代表的高保真翼型的几何形状仍然是一个根本的挑战。尽管现有的基于深度学习的方法依赖于预定义的参数函数家族，例如Bézier曲线和基于离散点的表示，但它们在表现力和分辨率灵活性之间存在固有的权衡。为了应对这一挑战，我们引入了FuncgenFoil，这是一种新型的功能空间生成模型，可以直接学习功能式机翼几何形状。我们的方法既继承了任意分辨率采样的优势和参数函数的平滑度，也继承了基于离散点的函数的强表现力。对Afbench数据集的经验评估表明，通过实现相对-74.4标签误差降低和+23.2 AF-200K数据集的多样性增加，FuncgenFoil可以改善机翼生成的最新方法。我们的结果突出了空气形状优化的功能空间建模的优势，为高保真式机翼设计提供了强大而灵活的框架。我们的代码将发布。

Title: Disentangle Nighttime Lens Flares: Self-supervised Generation-based Lens Flare Removal

Authors: Yuwen He, Wei Wang, Wanyu Wang, Kui Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10714
Pdf URL: https://arxiv.org/pdf/2502.10714
Copy Paste: [[2502.10714]] Disentangle Nighttime Lens Flares: Self-supervised Generation-based Lens Flare Removal(https://arxiv.org/abs/2502.10714)
Keywords: generation
Abstract: Lens flares arise from light reflection and refraction within sensor arrays, whose diverse types include glow, veiling glare, reflective flare and so on. Existing methods are specialized for one specific type only, and overlook the simultaneous occurrence of multiple typed lens flares, which is common in the real-world, e.g. coexistence of glow and displacement reflections from the same light source. These co-occurring lens flares cannot be effectively resolved by the simple combination of individual flare removal methods, since these coexisting flares originates from the same light source and are generated simultaneously within the same sensor array, exhibit a complex interdependence rather than simple additive relation. To model this interdependent flare relationship, our Nighttime Lens Flare Formation model is the first attempt to learn the intrinsic physical relationship between flares on the imaging plane. Building on this physical model, we introduce a solution to this joint flare removal task named Self-supervised Generation-based Lens Flare Removal Network (SGLFR-Net), which is self-supervised without pre-training. Specifically, the nighttime glow is detangled in PSF Rendering Network(PSFR-Net) based on PSF Rendering Prior, while the reflective flare is modelled in Texture Prior Based Reflection Flare Removal Network (TPRR-Net). Empirical evaluations demonstrate the effectiveness of the proposed method in both joint and individual glare removal tasks.
摘要：镜头是由传感器阵列内的光反射和折射引起的，其各种类型包括发光，遮光眩光，反射耀斑等。现有的方法仅专门用于一种特定类型，并且忽略了多种打字透镜耀斑的同时出现，这在现实世界中很常见，例如来自相同光源的发光和位移反射的共存。这些共发生的透镜耀斑无法通过单个耀斑去除方法的简单组合有效地解决，因为这些共存的耀斑来自相同的光源，并且在同一传感器阵列中同时产生，表现出复杂的相互依赖性而不是简单的添加性关系。为了建模这种相互依存的耀斑关系，我们的夜间镜头耀斑形成模型是学习成像平面上耀斑之间固有的物理关系的首次尝试。在这种物理模型的基础上，我们引入了一种解决方案，以解决该联合耀斑拆除任务，名为“基于自我监督的生成镜头去除网络”（SGLFR-NET），该任务是自制的，而无需预先培训。具体而言，基于PSF呈现的PSF渲染网络（PSFR-NET），夜间发光在先验的PSF渲染网络（PSFR-NET）中被伸张，而反射耀斑则在纹理先验的基于基于基于的反射耀斑去除网络（TPRR-NET）中建模。经验评估证明了该方法在关节和个人眩光去除任务中的有效性。

Title: VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS

Authors: Ming Meng, Ke Mu, Yonggui Zhu, Zhe Zhu, Haoyu Sun, Heyang Yan, Zhaoxin Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10729
Pdf URL: https://arxiv.org/pdf/2502.10729
Copy Paste: [[2502.10729]] VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS(https://arxiv.org/abs/2502.10729)
Keywords: generation
Abstract: Generating expressive and diverse human gestures from audio is crucial in fields like human-computer interaction, virtual reality, and animation. Though existing methods have achieved remarkable performance, they often exhibit limitations due to constrained dataset diversity and the restricted amount of information derived from audio inputs. To address these challenges, we present VarGes, a novel variation-driven framework designed to enhance co-speech gesture generation by integrating visual stylistic cues while maintaining naturalness. Our approach begins with the Variation-Enhanced Feature Extraction (VEFE) module, which seamlessly incorporates \textcolor{blue}{style-reference} video data into a 3D human pose estimation network to extract StyleCLIPS, thereby enriching the input with stylistic information. Subsequently, we employ the Variation-Compensation Style Encoder (VCSE), a transformer-style encoder equipped with an additive attention mechanism pooling layer, to robustly encode diverse StyleCLIPS representations and effectively manage stylistic variations. Finally, the Variation-Driven Gesture Predictor (VDGP) module fuses MFCC audio features with StyleCLIPS encodings via cross-attention, injecting this fused data into a cross-conditional autoregressive model to modulate 3D human gesture generation based on audio input and stylistic clues. The efficacy of our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness. The code and video results will be made publicly available upon acceptance:this https URL .
摘要：在人类计算机的互动，虚拟现实和动画等领域，从音频产生表现力和多样化的人手势至关重要。尽管现有的方法取得了显着的性能，但由于数据集多样性的限制以及从音频输入中得出的信息限制，它们通常会显示出局限性。为了应对这些挑战，我们提出了一个新型变体驱动的框架，旨在通过整合视觉风格线索在保持自然性的同时通过整合视觉风格提示来增强共同语音的姿态。我们的方法始于变化增强的特征提取（VEFE）模块，该模块无缝地融合了\ textColor {blue} {style-referference}视频数据中的3D人姿势估计网络，从而将styleclips提取，从而将输入丰富输入和样式信息。随后，我们采用了变异补偿样式编码器（VCSE），这是一种配备了加法注意机制汇总层的变压器式编码器，以鲁棒地编码各种式StyleClips表示形式并有效地管理样式变化。最后，变体驱动的手势预测变量（VDGP）模块通过跨注意力将MFCC音频功能与StyleClips融合，将这些融合数据注入了跨条件自动性模型中，以调节基于音频输入和风格线索的3D人类手势生成。在基准数据集上验证了我们方法的功效，在该数据集上，它在手势多样性和自然性方面优于现有方法。接受后，代码和视频结果将在接受后公开可用：此HTTPS URL。

Title: Bone Soups: A Seek-and-Soup Model Merging Approach for Controllable Multi-Objective Generation

Authors: Guofu Xie, Xiao Zhang, Ting Yao, Yunsheng Shi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.10762
Pdf URL: https://arxiv.org/pdf/2502.10762
Copy Paste: [[2502.10762]] Bone Soups: A Seek-and-Soup Model Merging Approach for Controllable Multi-Objective Generation(https://arxiv.org/abs/2502.10762)
Keywords: generation
Abstract: User information needs are often highly diverse and varied. A key challenge in current research is how to achieve controllable multi-objective generation while enabling rapid adaptation to accommodate diverse user demands during test time. Existing solutions, such as Rewarded Soup, focus on merging language models individually tuned on single objectives. While easy to implement and widely used, these approaches face limitations in achieving optimal performance due to their disregard for the impacts of competing objectives on model tuning. To address this issue, we propose Bone Soup, a novel model merging approach that first seeks a series of backbone models by considering the impacts of multiple objectives and then makes the soup (i.e., merge the backbone models). Specifically, Bone Soup begins by training multiple backbone models for different objectives using multi-objective reinforcement learning. Each backbone model is guided by a combination of backbone reward signals. To ensure that these models are optimal for the Pareto front, the backbone rewards are crafted by combining standard reward functions into basis vectors, which can then be modified through a rule-based construction method. Bone Soup leverages a symmetric circulant matrix mapping to generate the merging coefficients, which are used to merge the backbone models according to user preferences. Extensive experimental results demonstrate that Bone Soup exhibits strong controllability and Pareto optimality in controllable multi-objective generation, providing a more effective and efficient approach to addressing diverse user needs at test time.
摘要：用户信息需求通常是高度多样化和多样化的。当前研究中的一个关键挑战是如何实现可控的多目标生成，同时促进快速适应以适应测试时间的各种用户需求。现有的解决方案（例如奖励汤）专注于合并单个目标的语言模型。尽管易于实施和广泛使用，但这些方法由于无视竞争目标对模型调整的影响而面临最佳性能的限制。为了解决这个问题，我们提出了骨汤，这是一种新颖的模型合并方法，首先要考虑多个目标的影响，然后制造汤（即合并骨干模型），首先寻求一系列骨干模型。具体而言，骨汤首先是使用多目标增强学习训练多个骨干模型为不同的目标。每个骨干模型都以骨干奖励信号的结合为指导。为了确保这些模型对于帕累托阵线是最佳的，通过将标准奖励功能组合到基础向量中，可以通过基于规则的施工方法进行修改来制定骨干奖励。骨汤利用对称循环矩阵映射来生成合并系数，该系数用于根据用户偏好来合并主链模型。广泛的实验结果表明，骨汤在可控的多目标生成中表现出强大的可控性和帕累托最优性，为在测试时提供了一种更有效，更有效的方法来满足不同的用户需求。

Title: Preconditioned Inexact Stochastic ADMM for Deep Model

Authors: Shenglong Zhou, Ouya Wang, Ziyan Luo, Yongxu Zhu, Geoffrey Ye Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.10784
Pdf URL: https://arxiv.org/pdf/2502.10784
Copy Paste: [[2502.10784]] Preconditioned Inexact Stochastic ADMM for Deep Model(https://arxiv.org/abs/2502.10784)
Keywords: generative
Abstract: The recent advancement of foundation models (FMs) has brought about a paradigm shift, revolutionizing various sectors worldwide. The popular optimizers used to train these models are stochastic gradient descent-based algorithms, which face inherent limitations, such as slow convergence and stringent assumptions for convergence. In particular, data heterogeneity arising from distributed settings poses significant challenges to their theoretical and numerical performance. This paper develops an algorithm, PISA ({P}reconditioned {I}nexact {S}tochastic {A}lternating Direction Method of Multipliers), which enables scalable parallel computing and supports various second-moment schemes. Grounded in rigorous theoretical guarantees, the algorithm converges under the sole assumption of Lipschitz continuity of the gradient, thereby removing the need for other conditions commonly imposed by stochastic methods. This capability enables PISA to tackle the challenge of data heterogeneity effectively. Comprehensive experimental evaluations for training or fine-tuning diverse FMs, including vision models, large language models, reinforcement learning models, generative adversarial networks, and recurrent neural networks, demonstrate its superior numerical performance compared to various state-of-the-art optimizers.
摘要：基础模型（FMS）的最新进步带来了范式的转变，彻底改变了全球各个部门。用于训练这些模型的流行优化器是基于随机梯度下降的算法，这些算法面临固有的局限性，例如缓慢的收敛性和对收敛性的严格假设。特别是，由分布式设置引起的数据异质性对其理论和数值性能构成了重大挑战。本文开发了一种算法，pisa（{p}重新编写{i} nexact {s} tochastic {a}乘数的乘数方法），该方法可以实现可扩展的平行计算并支持各种第二摩当的方案。基于严格的理论保证，该算法在Lipschitz连续性的唯一假设下收敛，从而消除了通常由随机方法施加的其他条件。该能力使PISA能够有效应对数据异质性的挑战。对培训或微调多样的FMS的全面实验评估，包括视觉模型，大语言模型，增强学习模型，生成的对抗性网络和经常性的神经网络，与各种最新的优化者相比表明了其出色的数值性能。

Title: HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model

Authors: Mingqian Ma, Guoqing Liu, Chuan Cao, Pan Deng, Tri Dao, Albert Gu, Peiran Jin, Zhao Yang, Yingce Xia, Renqian Luo, Pipi Hu, Zun Wang, Yuan-Jyue Chen, Haiguang Liu, Tao Qin
Subjects: cs.LG, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2502.10807
Pdf URL: https://arxiv.org/pdf/2502.10807
Copy Paste: [[2502.10807]] HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model(https://arxiv.org/abs/2502.10807)
Keywords: generative
Abstract: Advances in natural language processing and large language models have sparked growing interest in modeling DNA, often referred to as the "language of life". However, DNA modeling poses unique challenges. First, it requires the ability to process ultra-long DNA sequences while preserving single-nucleotide resolution, as individual nucleotides play a critical role in DNA function. Second, success in this domain requires excelling at both generative and understanding tasks: generative tasks hold potential for therapeutic and industrial applications, while understanding tasks provide crucial insights into biological mechanisms and diseases. To address these challenges, we propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture, seamlessly integrating the strengths of attention mechanisms with selective state-space models. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks, and demonstrates exceptional capability in generating synthetic cis-regulatory elements (CREs) with desired properties. Furthermore, we show that HybriDNA adheres to expected scaling laws, with performance improving consistently as the model scales from 300M to 3B and 7B parameters. These findings underscore HybriDNA's versatility and its potential to advance DNA research and applications, paving the way for innovations in understanding and engineering the "language of life".
摘要：自然语言处理和大型语言模型的进步激发了人们对建模DNA的兴趣，通常被称为“生活的语言”。但是，DNA建模带来了独特的挑战。首先，由于单个核苷酸在DNA功能中起关键作用，因此需要在保留单核苷酸分辨率的同时处理超长的DNA序列的能力。其次，在该领域的成功需要在生成和理解任务方面表现出色：生成任务具有治疗和工业应用的潜力，而理解任务为生物学机制和疾病提供了重要的见解。为了应对这些挑战，我们提出了Hybridna，这是一种仅解码器的DNA语言模型，该模型结合了混合变压器-MAMBA2架构，将注意机制的优势与选择性状态空间模型无缝整合。这种杂种设计使杂种能够有效地处理长度高达131KB的DNA序列，并通过单核苷酸分辨率处理。 Hybridna从弯曲，GUE和LRB基准测试的33个DNA了解数据集中实现了最新的性能，并在生成具有所需属性的合成顺式顺式调节元件（CRE）方面表现出了出色的能力。此外，我们表明Hybridna遵守预期的缩放定律，并且随着模型尺度从300m到3B和7B参数，性能始终如一地提高。这些发现强调了Hybridna的多功能性及其推进DNA研究和应用的潜力，为理解和工程“生活语言”的创新铺平了道路。

Title: The Vendiscope: An Algorithmic Microscope For Data Collections

Authors: Amey P. Pasarkar, Adji Bousso Dieng
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2502.10828
Pdf URL: https://arxiv.org/pdf/2502.10828
Copy Paste: [[2502.10828]] The Vendiscope: An Algorithmic Microscope For Data Collections(https://arxiv.org/abs/2502.10828)
Keywords: generative
Abstract: The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data-driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores -- a family of differentiable diversity metrics rooted in ecology and quantum mechanics -- and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high-resolution data analysis at scale. We demonstrate this across biology, materials science, and machine learning (ML). We analyzed the $250$ million protein sequences in the protein universe, discovering that over $200$ million are near-duplicates and that AlphaFold fails on proteins with Gene Ontology (GO) functions that contribute most to diversity. Applying the Vendiscope to the Materials Project database led to similar findings: more than $85\%$ of the crystals with formation energy data are near-duplicates and ML models perform poorly on materials that enhance diversity. Additionally, the Vendiscope can be used to study phenomena such as memorization in generative models. We used the Vendiscope to identify memorized training samples from $13$ different generative models and found that the best-performing ones often memorize the training samples that contribute least to diversity. Our findings demonstrate that the Vendiscope can serve as a powerful tool for data-driven science.
摘要：显微镜的演变从16世纪后期的发明开始，它不断增强了我们探索和理解微观世界的能力，从而使对结构和现象的观察变得越来越详细。同时，数据驱动的科学的兴起强调了需要探索和理解复杂数据收集组成的复杂方法的必要性。本文介绍了Vendiscope，这是第一个算法显微镜，旨在将传统显微镜扩展到计算分析。 Vendiscope利用了Vendi分数（一个植根于生态学和量子力学的多样性指标的家族），并根据数据点对收集的整体多样性的贡献为数据点分配权重。这些权重可以大规模实现高分辨率数据分析。我们在生物学，材料科学和机器学习（ML）中证明了这一点。我们分析了蛋白质宇宙中2.50亿美元的蛋白质序列，发现超过2亿美元是近乎培养物，并且Alphafold失败了具有基因本体学（GO）功能的蛋白质，对多样性造成了最大的影响。将vendiscope应用于材料项目数据库导致类似的发现：超过$ 85 \％的晶体具有地层能量数据是近乎构想的，并且ML模型在增强多样性的材料上的表现较差。此外，vendiscope可用于研究现象，例如生成模型中的记忆。我们使用Vendiscope从$ 13 $不同的生成模型中识别出记忆的培训样本，发现表现最佳的培训样本通常会记住对多样性贡献最少的培训样本。我们的发现表明，Vendiscope可以作为数据驱动科学的强大工具。

Title: SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers

Authors: Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, Xiang Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10841
Pdf URL: https://arxiv.org/pdf/2502.10841
Copy Paste: [[2502.10841]] SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers(https://arxiv.org/abs/2502.10841)
Keywords: generation, generative
Abstract: We present SkyReels-A1, a simple yet effective framework built upon video diffusion Transformer to facilitate portrait image animation. Existing methodologies still encounter issues, including identity distortion, background instability, and unrealistic facial dynamics, particularly in head-only animation scenarios. Besides, extending to accommodate diverse body proportions usually leads to visual inconsistencies or unnatural articulations. To address these challenges, SkyReels-A1 capitalizes on the strong generative capabilities of video DiT, enhancing facial motion transfer precision, identity retention, and temporal coherence. The system incorporates an expression-aware conditioning module that enables seamless video synthesis driven by expression-guided landmark inputs. Integrating the facial image-text alignment module strengthens the fusion of facial attributes with motion trajectories, reinforcing identity preservation. Additionally, SkyReels-A1 incorporates a multi-stage training paradigm to incrementally refine the correlation between expressions and motion while ensuring stable identity reproduction. Extensive empirical evaluations highlight the model's ability to produce visually coherent and compositionally diverse results, making it highly applicable to domains such as virtual avatars, remote communication, and digital media generation.
摘要：我们提出了Skyreels-A1，这是一个基于视频扩散变压器的简单而有效的框架，以促进肖像图像动画。现有的方法仍然遇到问题，包括身份扭曲，背景不稳定性和不现实的面部动力学，尤其是在纯粹的动画场景中。此外，扩展以适应各种身体比例，通常会导致视觉上不一致或不自然的关节。为了应对这些挑战，Skyreels-A1利用了视频DIT的强大生成能力，增强面部运动转移精度，身份保留和时间连贯性。该系统结合了一个表达感的条件模块，该模块可实现由表达引导的地标输入驱动的无缝视频综合。整合面部图像文本对齐模块可以增强面部属性与运动轨迹的融合，从而增强身份保存。此外，Skyreels-A1结合了多阶段训练范式，以逐步完善表达式和运动之间的相关性，同时确保稳定的身份再现。广泛的经验评估突出了该模型产生视觉相干和构图多样的结果的能力，使其非常适用于虚拟化身，远程通信和数字媒体生成等领域。

Title: Implicit Neural Representations of Molecular Vector-Valued Functions

Authors: Jirka Lhotka, Daniel Probst
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2502.10848
Pdf URL: https://arxiv.org/pdf/2502.10848
Copy Paste: [[2502.10848]] Implicit Neural Representations of Molecular Vector-Valued Functions(https://arxiv.org/abs/2502.10848)
Keywords: generation
Abstract: Molecules have various computational representations, including numerical descriptors, strings, graphs, point clouds, and surfaces. Each representation method enables the application of various machine learning methodologies from linear regression to graph neural networks paired with large language models. To complement existing representations, we introduce the representation of molecules through vector-valued functions, or $n$-dimensional vector fields, that are parameterized by neural networks, which we denote molecular neural fields. Unlike surface representations, molecular neural fields capture external features and the hydrophobic core of macromolecules such as proteins. Compared to discrete graph or point representations, molecular neural fields are compact, resolution independent and inherently suited for interpolation in spatial and temporal dimensions. These properties inherited by molecular neural fields lend themselves to tasks including the generation of molecules based on their desired shape, structure, and composition, and the resolution-independent interpolation between molecular conformations in space and time. Here, we provide a framework and proofs-of-concept for molecular neural fields, namely, the parametrization and superresolution reconstruction of a protein-ligand complex using an auto-decoder architecture and the embedding of molecular volumes in latent space using an auto-encoder architecture.
摘要：分子具有各种计算表示，包括数值描述符，字符串，图形，点云和表面。每种表示方法都可以应用各种机器学习方法从线性回归到图形神经网络与大语言模型配对。为了补充现有表示形式，我们通过矢量值函数或$ n $维矢量场引入分子的表示，这些矢量场是由神经网络参数化的，我们表示分子神经场。与表面表示不同，分子神经场捕获了外部特征和大分子的疏水核心，例如蛋白质。与离散图或点表示相比，分子神经场是紧凑的，独立的分辨率，并且固有地适合在空间和时间尺寸中插值。这些由分子神经场遗传的特性将自己赋予了任务，包括基于其所需形状，结构和组成的分子产生，以及空间和时间中分子构象之间的分辨率无关的插值。在这里，我们为分子神经场提供了一个框架和概念概念，即使用自动解码器结构以及使用自动编码器中分子体积的分子体系结构的蛋白质配体复合物的参数化和超分辨率重建。建筑学。

Title: Super Resolution image reconstructs via total variation-based image deconvolution: a majorization-minimization approach

Authors: Mouhamad Chehaitly
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10876
Pdf URL: https://arxiv.org/pdf/2502.10876
Copy Paste: [[2502.10876]] Super Resolution image reconstructs via total variation-based image deconvolution: a majorization-minimization approach(https://arxiv.org/abs/2502.10876)
Keywords: restoration, super-resolution
Abstract: This work aims to reconstruct image sequences with Total Variation regularity in super-resolution. We consider, in particular, images of scenes for which the point-to-point image transformation is a plane projective transformation. We first describe the super-resolution image's imaging observation model, an interpolation and Fusion estimator, and Projection on Convex Sets. We explain motion and compute the optical flow of a sequence of images using the Horn-Shunck algorithm to estimate motion. We then propose a Total Variation regulazer via a Majorization-Minimization approach to obtain a suitable result. Super Resolution restoration from motion measurements is also discussed. Finally, the simulation's part demonstrates the power of the proposed methodology. As expected, this model does not give real-time results, as seen in the numerical experiments section, but it is the cornerstone for future approaches. Finally, the simulation's part demonstrates the power of the proposed methodology. As expected, this model does not give real-time results, as seen in the numerical experiments section, but it is the cornerstone for future approaches.
摘要：这项工作旨在重建超分辨率总变化规律性的图像序列。我们特别考虑的是点对点图像转换的场景图像。我们首先描述了超分辨率图像的成像观察模型，插值和融合估计器以及对凸集的投影。我们使用角刺算法来解释运动并计算一系列图像的光流以估计运动。然后，我们通过大量最小化方法提出了总变化调节器，以获得合适的结果。还讨论了运动测量的超级分辨率恢复。最后，模拟部分证明了所提出的方法的力量。正如预期的那样，该模型在“数值实验”部分中没有给出实时结果，但这是未来方法的基石。最后，模拟部分证明了所提出的方法的力量。正如预期的那样，该模型在“数值实验”部分中没有给出实时结果，但这是未来方法的基石。

Title: Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images

Authors: Sevim Cengiz, Ibraheem Hamdi, Mohammad Yaqub
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.10908
Pdf URL: https://arxiv.org/pdf/2502.10908
Copy Paste: [[2502.10908]] Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images(https://arxiv.org/abs/2502.10908)
Keywords: quality assessment
Abstract: Fetal gestational age (GA) is vital clinical information that is estimated during pregnancy in order to assess fetal growth. This is usually performed by measuring the crown-rump-length (CRL) on an ultrasound image in the Dating scan which is then correlated with fetal age and growth trajectory. A major issue when performing the CRL measurement is ensuring that the image is acquired at the correct view, otherwise it could be misleading. Although clinical guidelines specify the criteria for the correct CRL view, sonographers may not regularly adhere to such rules. In this paper, we propose a new deep learning-based solution that is able to verify the adherence of a CRL image to clinical guidelines in order to assess image quality and facilitate accurate estimation of GA. We first segment out important fetal structures then use the localized structures to perform a clinically-guided mapping that verifies the adherence of criteria. The segmentation method combines the benefits of Convolutional Neural Network (CNN) and the Vision Transformer (ViT) to segment fetal structures in ultrasound images and localize important fetal landmarks. For segmentation purposes, we compare our proposed work with UNet and show that our CNN/ViT-based method outperforms an optimized version of UNet. Furthermore, we compare the output of the mapping with classification CNNs when assessing the clinical criteria and the overall acceptability of CRL images. We show that the proposed mapping is not only explainable but also more accurate than the best performing classification CNNs.
摘要：胎儿妊娠年龄（GA）是至关重要的临床信息，以评估胎儿生长，在怀孕期间估计。这通常是通过测量约会扫描中超声图像上的冠状长度（CRL）来执行的，该扫描与胎儿的年龄和生长轨迹相关。执行CRL测量时的一个主要问题是确保从正确的视图中获取图像，否则可能会产生误导。尽管临床准则指定了正确的CRL视图的标准，但超声检查员可能不会定期遵守此类规则。在本文中，我们提出了一种新的基于深度学习的解决方案，该解决方案能够验证CRL图像对临床指南的依从性，以评估图像质量并促进GA的准确估计。我们首先分割重要的胎儿结构，然后使用局部结构进行临床引导的映射，以验证标准的依从性。分割方法结合了卷积神经网络（CNN）和视觉变压器（VIT）的益处与超声图像中的胎儿结构，并定位重要的胎儿标志。为了进行分割，我们将我们的拟议工作与UNET进行了比较，并表明我们的CNN/基于CNN/VIT的方法优于优化的UNET版本。此外，我们在评估临床标准和CRL图像的总体可接受性时将映射的输出与分类CNN进行比较。我们表明，所提出的映射不仅可以解释，而且比最佳性能分类CNN更准确。

Title: LLM-driven Knowledge Distillation for Dynamic Text-Attributed Graphs

Authors: Amit Roy, Ning Yan, Masood Mortazavi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.10914
Pdf URL: https://arxiv.org/pdf/2502.10914
Copy Paste: [[2502.10914]] LLM-driven Knowledge Distillation for Dynamic Text-Attributed Graphs(https://arxiv.org/abs/2502.10914)
Keywords: generation
Abstract: Dynamic Text-Attributed Graphs (DyTAGs) have numerous real-world applications, e.g. social, collaboration, citation, communication, and review networks. In these networks, nodes and edges often contain text descriptions, and the graph structure can evolve over time. Future link prediction, edge classification, relation generation, and other downstream tasks on DyTAGs require powerful representations that encode structural, temporal, and textual information. Although graph neural networks (GNNs) excel at handling structured data, encoding temporal information within dynamic graphs remains a significant challenge. In this work, we propose LLM-driven Knowledge Distillation for Dynamic Text Attributed Graph (LKD4DyTAG) with temporal encoding to address these challenges. We use a simple, yet effective approach to encode temporal information in edges so that graph convolution can simultaneously capture both temporal and structural information in the hidden representations. To leverage LLM's text processing capabilities for learning richer representations on DyTAGs, we distill knowledge from LLM-driven edge representations (based on a neighborhood's text attributes) into saptio-temporal representations using a lightweight GNN model that encodes temporal and structural information. The objective of knowledge distillation enables the GNN to learn representations that more effectively encode the available structural, temporal, and textual information in DyTAG. We conducted extensive experimentation on six real-world DyTAG datasets to verify the effectiveness of our approach LKD4DyTAG for future link prediction and edge classification task. The results show that our approach significantly improves the performance of downstream tasks compared to the baseline models.
摘要：动态文本属性图（dytags）具有许多现实世界应用，例如社会，协作，引用，沟通和审查网络。在这些网络中，节点和边缘通常包含文本描述，并且图结构可以随着时间的推移而发展。未来的链接预测，边缘分类，关系生成以及dytag上的其他下游任务需要编码结构，时间和文本信息的强大表示形式。尽管图形神经网络（GNN）在处理结构化数据方面表现出色，但在动态图中编码时间信息仍然是一个重大挑战。在这项工作中，我们提出了针对LLM驱动的知识蒸馏，以归因于图形（LKD4DYTAG），并使用时间编码来解决这些挑战。我们使用一种简单但有效的方法来编码边缘的时间信息，以便图形卷积可以同时捕获隐藏表示形式中的时间和结构信息。为了利用LLM的文本处理功能来学习杂物的更丰富表示形式，我们使用编码时间和结构信息的轻量级GNN模型将知识从LLM驱动的边缘表示（基于邻里的文本属性）中提取到Saptio-stormoratio gnn模型中。知识蒸馏的目的使GNN能够学习更有效地编码Dytag中可用的结构，时间和文本信息的表示形式。我们对六个现实世界中的dytag数据集进行了广泛的实验，以验证我们的方法LKD4DYTAG的有效性，以实现未来的链接预测和边缘分类任务。结果表明，与基线模型相比，我们的方法显着提高了下游任务的性能。

Title: Do Deepfake Detectors Work in Reality?

Authors: Simiao Ren, Hengwei Xu, Tsang Ng, Kidus Zewde, Shengkai Jiang, Ramini Desai, Disha Patil, Ning-Yau Cheng, Yining Zhou, Ragavi Muthukrishnan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10920
Pdf URL: https://arxiv.org/pdf/2502.10920
Copy Paste: [[2502.10920]] Do Deepfake Detectors Work in Reality?(https://arxiv.org/abs/2502.10920)
Keywords: super-resolution, generative
Abstract: Deepfakes, particularly those involving faceswap-based manipulations, have sparked significant societal concern due to their increasing realism and potential for misuse. Despite rapid advancements in generative models, detection methods have not kept pace, creating a critical gap in defense strategies. This disparity is further amplified by the disconnect between academic research and real-world applications, which often prioritize different objectives and evaluation criteria. In this study, we take a pivotal step toward bridging this gap by presenting a novel observation: the post-processing step of super-resolution, commonly employed in real-world scenarios, substantially undermines the effectiveness of existing deepfake detection methods. To substantiate this claim, we introduce and publish the first real-world faceswap dataset, collected from popular online faceswap platforms. We then qualitatively evaluate the performance of state-of-the-art deepfake detectors on real-world deepfakes, revealing that their accuracy approaches the level of random guessing. Furthermore, we quantitatively demonstrate the significant performance degradation caused by common post-processing techniques. By addressing this overlooked challenge, our study underscores a critical avenue for enhancing the robustness and practical applicability of deepfake detection methods in real-world settings.
摘要：Deepfakes，尤其是涉及基于面孔的操作的Deepfakes，由于其现实主义的增加和滥用潜力，引发了社会的重大关注。尽管生成模型取得了迅速的进步，但检测方法仍未保持步伐，从而在国防策略上造成了危险的差距。通过学术研究和现实世界应用之间的脱节，这种差异进一步扩大了，这通常优先考虑不同的目标和评估标准。在这项研究中，我们通过提出一种新颖的观察结果来迈出弥合这一差距的关键步骤：超分辨率的后处理步骤，通常在现实世界情景中使用，这大大破坏了现有的深层检测方法的有效性。为了证实这一说法，我们介绍并发布了从流行的在线FaceSwap平台收集的第一个现实世界FaceSWAP数据集。然后，我们定性地评估了现实世界中最先进的深层探测器的性能，从而表明它们的准确性接近了随机猜测的水平。此外，我们定量证明了由常见的后处理技术引起的显着性能降解。通过应对这一被忽视的挑战，我们的研究强调了一个关键的途径，可以增强在现实世界中深层检测方法的鲁棒性和实际适用性。

Title: TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and Multi-Modal Purification Modules

Authors: Ruoyu Zhang, Lulu Wang, Yi He, Tongling Pan, Zhengtao Yu, Yingna Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11024
Pdf URL: https://arxiv.org/pdf/2502.11024
Copy Paste: [[2502.11024]] TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and Multi-Modal Purification Modules(https://arxiv.org/abs/2502.11024)
Keywords: generation
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced the fluency and logical coherence of image captioning. Retrieval-Augmented Generation (RAG) is widely adopted to incorporate external knowledge into LLMs; however, existing RAG-based methods rely on separate retrieval banks, introducing computational overhead and limiting the utilization of LLMs' inherent zero-shot capabilities. To address these limitations, we propose TPCap, a novel trigger-augmented and multi-modal purification framework for zero-shot image captioning without external retrieval libraries. TPCap consists of two key components: trigger-augmented (TA) generation and multi-modal purification (MP). The TA module employs a trigger projector with frozen and learnable projections to activate LLMs' contextual reasoning, enhance visual-textual alignment, and mitigate data bias. The MP module further refines the generated entity-related information by filtering noise and enhancing feature quality, ensuring more precise and factually consistent captions. We evaluate TPCap on COCO, NoCaps, Flickr30k, and WHOOPS datasets. With only 0.82M trainable parameters and training on a single NVIDIA RTX 4090 GPU, TPCap achieves competitive performance comparable to state-of-the-art models.
摘要：大型语言模型（LLM）的最新进展显着提高了图像字幕的流利性和逻辑连贯性。检索授权的一代（RAG）被广泛采用，以将外部知识纳入LLM；但是，现有的基于抹布的方法依赖于单独的检索库，引入计算开销并限制了LLMS固有的零击功能的利用。为了解决这些局限性，我们提出了TPCAP，这是一个新颖的触发器和多模式纯化框架，用于零拍图像字幕，而无需外部检索库。 TPCAP由两个关键组成部分组成：触发器（TA）生成和多模式纯化（MP）。 TA模块采用触发投影仪，具有冷冻和可学习的预测，以激活LLMS的上下文推理，增强视觉文本对齐并减轻数据偏差。 MP模块通过过滤噪声和增强功能质量，进一步完善了生成的实体相关信息，从而确保更精确且实际一致的字幕。我们评估COCO，NOCAPS，FLICKR30K和WHOOPS数据集的TPCAP。 TPCAP仅在单个NVIDIA RTX 4090 GPU上进行训练参数和培训，可实现与最新模型相当的竞争性能。

Title: Simplify RLHF as Reward-Weighted SFT: A Variational Method

Authors: Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao Xie, Xiang Wan, Anningzhe Gao
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.11026
Pdf URL: https://arxiv.org/pdf/2502.11026
Copy Paste: [[2502.11026]] Simplify RLHF as Reward-Weighted SFT: A Variational Method(https://arxiv.org/abs/2502.11026)
Keywords: generation
Abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption. Even with recent simplifications, such as Direct Preference Optimization (DPO) and Advantage Leftover Lunch (A-LoL), the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called $\textbf{V}$ariational $\textbf{A}$lignment with $\textbf{R}$e-weighting ($\textbf{VAR}$). More specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into a reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. On comprehensive alignment and generation benchmarks, our VAR method has numerically achieved competitive performance in LLM alignment helpfulness and harmlessness.
摘要：从人类反馈（RLHF）中学习的强化对将大语言模型（LLMS）与人类价值观保持一致至关重要。但是，RLHF在实施和计算消耗方面的高复杂性不断受到挑战。即使最近的简化，例如直接偏好优化（DPO）和优势剩余午餐（A-LOL），过度拟合和训练不稳定性的问题仍然阻碍了预期的最佳性能的对齐过程。为了应对现有的挑战，我们从变化推理的角度提出了一种新颖的RLHF简化，称为$ \ textbf {v} $ ariational $ \ textbf {a} $ textbf {a} $ lignment，带有$ \ textbf {r} \ textbf {var} $）。更具体地说，通过直接最大程度地减少学习LLM策略与RLHF的最佳解决方案之间的分配差距，我们将对齐目标转换为奖励驱动的重新加权监督的微调（SFT）形式，这仅需要对SFT损失以明显改善训练稳定性和有效性。在全面的一致性和生成基准上，我们的VAR方法在LLM对齐帮助和无害性方面具有数值的竞争性能。

Title: Diversified Sampling Improves Scaling LLM inference

Authors: Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Haifeng Chen, Xiang Zhang, Wei Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11027
Pdf URL: https://arxiv.org/pdf/2502.11027
Copy Paste: [[2502.11027]] Diversified Sampling Improves Scaling LLM inference(https://arxiv.org/abs/2502.11027)
Keywords: generation
Abstract: While increasing training compute has significantly improved the performance of large language models (LLMs), similar gains have not been observed when scaling inference compute. We hypothesize that the primary issue lies in the uniformity of LLM outputs, which leads to inefficient sampling as models repeatedly generate similar but inaccurate responses. Motivated by an intriguing relationship between solution accuracy (Pass@10) and response diversity, we propose DivSampling-a novel and versatile sampling technique designed to enhance the diversity of candidate solutions by introducing prompt this http URL incorporates two categories of perturbations: task-agnostic approaches, which are general and not tailored to any specific task, and task-specific approaches, which are customized based on task content. Our theoretical analysis demonstrates that, under mild assumptions, the error rates of responses generated from diverse prompts are significantly lower compared to those produced by stationary prompts. Comprehensive evaluations across various tasks -including reasoning, mathematics, and code generation - highlight the effectiveness of DivSampling in improving solution accuracy. This scalable and efficient approach offers a new perspective on optimizing test-time inference, addressing limitations in current sampling strategies.
摘要：虽然增加训练计算显着改善了大语言模型（LLM）的性能，但在缩放推理计算时未观察到类似的收益。我们假设主要问题在于LLM输出的均匀性，这导致采样效率低下，因为模型反复产生相似但不准确的响应。通过解决方案准确性（通过@10）和响应多样性之间的有趣关系，我们提出了DivSampling-一种新颖的和多功能的采样技术，旨在通过引入提示提示此HTTP URL包含两个类别的扰动，旨在增强候选解决方案的多样性：方法是一般而不是针对任何特定任务和特定任务的方法量身定制的，这些方法是根据任务内容自定义的。我们的理论分析表明，在轻度假设下，与固定提示产生的响应产生的响应错误率显着降低。包括推理，数学和代码生成在内的各种任务进行的全面评估 - 突出了Divsmpling在提高解决方案准确性方面的有效性。这种可扩展有效的方法为优化测试时间推理提供了新的观点，并解决了当前采样策略中的局限性。

Title: Deep Incomplete Multi-view Learning via Cyclic Permutation of VAEs

Authors: Xin Gao, Jian Pu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.11037
Pdf URL: https://arxiv.org/pdf/2502.11037
Copy Paste: [[2502.11037]] Deep Incomplete Multi-view Learning via Cyclic Permutation of VAEs(https://arxiv.org/abs/2502.11037)
Keywords: generation
Abstract: Multi-View Representation Learning (MVRL) aims to derive a unified representation from multi-view data by leveraging shared and complementary information across views. However, when views are irregularly missing, the incomplete data can lead to representations that lack sufficiency and consistency. To address this, we propose Multi-View Permutation of Variational Auto-Encoders (MVP), which excavates invariant relationships between views in incomplete data. MVP establishes inter-view correspondences in the latent space of Variational Auto-Encoders, enabling the inference of missing views and the aggregation of more sufficient information. To derive a valid Evidence Lower Bound (ELBO) for learning, we apply permutations to randomly reorder variables for cross-view generation and then partition them by views to maintain invariant meanings under permutations. Additionally, we enhance consistency by introducing an informational prior with cyclic permutations of posteriors, which turns the regularization term into a similarity measure across distributions. We demonstrate the effectiveness of our approach on seven diverse datasets with varying missing ratios, achieving superior performance in multi-view clustering and generation tasks.
摘要：多视图表示学习（MVRL）旨在通过利用跨视图的共享和互补信息来从多视图数据中得出统一的表示。但是，当视图不规则地缺乏时，不完整的数据可能导致缺乏充分性和一致性的表示形式。为了解决这个问题，我们提出了变异自动编码器（MVP）的多视图置换，该置换器在不完整的数据中发掘了视图之间不变的关系。 MVP在变异自动编码器的潜在空间中建立了视图对应关系，从而推断了丢失视图的推断，并汇总了更充分的信息。为了得出有效的证据以进行学习，我们将排列应用到随机重新排序变量以进行跨视图生成，然后通过视图通过视图对其进行划分，以维持排列下的不变含义。此外，我们通过引入循环循环置换后提高一致性来提高一致性，这将正则化项变成了跨分布的相似性度量。我们证明了我们的方法对七个不同的数据集的有效性，其缺失比率不同，从而在多视图集群和发电任务中实现了卓越的性能。

Title: Accelerating Anchors via Specialization and Feature Transformation

Authors: Haonan Yu, Junhao Liu, Xin Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11068
Pdf URL: https://arxiv.org/pdf/2502.11068
Copy Paste: [[2502.11068]] Accelerating Anchors via Specialization and Feature Transformation(https://arxiv.org/abs/2502.11068)
Keywords: generation
Abstract: Anchors is a popular local model-agnostic explanation technique whose applicability is limited by its computational inefficiency. To address this limitation, we propose a pre-training-based approach to accelerate Anchors without compromising the explanation quality. Our approach leverages the iterative nature of Anchors' algorithm which gradually refines an explanation until it is precise enough for a given input by providing a general explanation that is obtained through pre-training as Anchors' initial explanation. Specifically, we develop a two-step rule transformation process: the horizontal transformation adapts a pre-trained explanation to the current input by replacing features, and the vertical transformation refines the general explanation until it is precise enough for the input. We evaluate our method across tabular, text, and image datasets, demonstrating that it significantly reduces explanation generation time while maintaining fidelity and interpretability, thereby enabling the practical adoption of Anchors in time-sensitive applications.
摘要：Anchors是一种流行的本地模型不足的解释技术，其适用性受其计算效率低下的限制。为了解决这一限制，我们提出了一种基于训练的方法来加速锚固，而不会损害解释质量。我们的方法利用了锚算法的迭代性质，该算法逐渐完善了解释，直到它足够精确，可以通过提供通过预训练作为锚定的最初解释获得的一般解释来提供给定输入。具体而言，我们开发了一个两步规则转换过程：水平转换通过更换特征来适应当前输入的预训练的解释，垂直转换完善了通用解释，直到足够精确地进行输入。我们在表格，文本和图像数据集中评估了我们的方法，表明它大大减少了解释时间的生成时间，同时保持了忠诚度和解释性，从而实现了时间敏感应用程序的实际采用。

Title: Phantom: Subject-consistent video generation via cross-modal alignment

Authors: Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11079
Pdf URL: https://arxiv.org/pdf/2502.11079
Copy Paste: [[2502.11079]] Phantom: Subject-consistent video generation via cross-modal alignment(https://arxiv.org/abs/2502.11079)
Keywords: generation
Abstract: The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here this https URL.
摘要：视频发电的基础模型的持续开发正在发展为各种应用程序，主题一致的视频生成仍处于探索阶段。我们将其称为主题到视频，该主题从参考图像中提取主题元素，并通过文本说明生成主题一致的视频。我们认为，主题到视频的本质在于平衡文本和图像的双模式提示，从而深入并同时使文本和视觉内容对齐。为此，我们提出了Phantom，这是单个和多主题参考的统一视频生成框架。在现有的文本到视频和图像到视频体系结构的基础上，我们重新设计了联合文本图像注入模型，并通过文本图像 - 视频 - Video三重态数据驱动它以学习跨模式对齐。特别是，我们强调了人类一代的主题一致性，涵盖了现有的ID保存视频生成，同时提供了增强的优势。该项目主页在这里，此HTTPS URL。

Title: AnyRefill: A Unified, Data-Efficient Framework for Left-Prompt-Guided Vision Tasks

Authors: Ming Xie, Chenjie Cao, Yunuo Cai, Xiangyang Xue, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11158
Pdf URL: https://arxiv.org/pdf/2502.11158
Copy Paste: [[2502.11158]] AnyRefill: A Unified, Data-Efficient Framework for Left-Prompt-Guided Vision Tasks(https://arxiv.org/abs/2502.11158)
Keywords: generation, generative
Abstract: In this paper, we present a novel Left-Prompt-Guided (LPG) paradigm to address a diverse range of reference-based vision tasks. Inspired by the human creative process, we reformulate these tasks using a left-right stitching formulation to construct contextual input. Building upon this foundation, we propose AnyRefill, an extension of LeftRefill, that effectively adapts Text-to-Image (T2I) models to various vision tasks. AnyRefill leverages the inpainting priors of advanced T2I model based on the Diffusion Transformer (DiT) architecture, and incorporates flexible components to enhance its capabilities. By combining task-specific LoRAs with the stitching input, AnyRefill unlocks its potential across diverse tasks, including conditional generation, visual perception, and image editing, without requiring additional visual encoders. Meanwhile, AnyRefill exhibits remarkable data efficiency, requiring minimal task-specific fine-tuning while maintaining high generative performance. Through extensive ablation studies, we demonstrate that AnyRefill outperforms other image condition injection methods and achieves competitive results compared to state-of-the-art open-source methods. Notably, AnyRefill delivers results comparable to advanced commercial tools, such as IC-Light and SeedEdit, even in challenging scenarios. Comprehensive experiments and ablation studies across versatile tasks validate the strong generation of the proposed simple yet effective LPG formulation, establishing AnyRefill as a unified, highly data-efficient solution for reference-based vision tasks.
摘要：在本文中，我们提出了一种新颖的左推导引导（LPG）范式，以解决各种基于参考的视力任务。受到人类创造过程的启发，我们使用左右缝制配方重新制定了这些任务，以构建上下文输入。在这个基础的基础上，我们提出了AnyRefill（Leftrefill的扩展），从而有效地将文本对图像（T2I）模型适应了各种视觉任务。 AnyRefill利用基于扩散变压器（DIT）体系结构的高级T2I模型的介绍先验，并结合了灵活的组件以增强其功能。通过将特定于任务的洛拉斯与缝合输入相结合，AnyRefill可以在不需要其他视觉编码器的情况下解锁各种任务的潜力，包括有条件的生成，视觉感知和图像编辑。同时，AnyRefill具有出色的数据效率，需要最少的特定于任务的微调，同时保持高生成性能。通过广泛的消融研究，我们证明，与最先进的开源方法相比，AnyRefill的表现优于其他图像条件注入方法，并取得了竞争性的结果。值得注意的是，即使在具有挑战性的情况下，AnyRefill提供的结果与先进的商业工具（例如IC-Light和Seededit）相当。跨多功能任务的全面实验和消融研究验证了提出的简单但有效的LPG公式的强烈产生，并将AnyRefill作为基于参考的视力任务的统一，高度数据效率的解决方案。

Title: SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

Authors: Bohan Lyu, Siqiao Huang, Zichen Liang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2502.11167
Pdf URL: https://arxiv.org/pdf/2502.11167
Copy Paste: [[2502.11167]] SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors(https://arxiv.org/abs/2502.11167)
Keywords: generation
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at this https URL.
摘要：大型语言模型（LLMS）在与代码相关的任务（例如代码理解和代码生成）中表现出了显着的功能。但是，同样重要但又没有被忽视的问题是LLM是否可以用作通用替代代码执行者，以预测程序的输出和行为而无需实际运行。为了系统地调查此功能，我们引入了激增，涵盖八个关键方面的全面基准：多语言编程任务，竞争级别的编程问题，存储库级代码分析，高成本的科学计算，时间复杂性密集型算法，货运式算法代码分析，依赖特定编译器或执行环境的程序以及正式的数学证明验证。我们评估了多个开源和专有的LLM在激增的潮流中，并进行了扩展研究，以分析模型大小和训练数据量表对替代执行精度的影响。此外，我们将模型预测错误分类并探索潜在的改进领域。我们的发现表明，尽管LLM可以在某些情况下预测代码执行结果，但它们在通用替代执行中表现出局限性。这项研究提供了对使用LLM作为替代代码执行者的可行性的经验见解。代码和数据集在此HTTPS URL上发布。

Title: MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation

Authors: Michael Fuest, Vincent Tao Hu, Björn Ommer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11234
Pdf URL: https://arxiv.org/pdf/2502.11234
Copy Paste: [[2502.11234]] MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation(https://arxiv.org/abs/2502.11234)
Keywords: generation, generative
Abstract: Generating long, high-quality videos remains a challenge due to the complex interplay of spatial and temporal dynamics and hardware limitations. In this work, we introduce \textbf{MaskFlow}, a unified video generation framework that combines discrete representations with flow-matching to enable efficient generation of high-quality long videos. By leveraging a frame-level masking strategy during training, MaskFlow conditions on previously generated unmasked frames to generate videos with lengths ten times beyond that of the training sequences. MaskFlow does so very efficiently by enabling the use of fast Masked Generative Model (MGM)-style sampling and can be deployed in both fully autoregressive as well as full-sequence generation modes. We validate the quality of our method on the FaceForensics (FFS) and Deepmind Lab (DMLab) datasets and report Fréchet Video Distance (FVD) competitive with state-of-the-art approaches. We also provide a detailed analysis on the sampling efficiency of our method and demonstrate that MaskFlow can be applied to both timestep-dependent and timestep-independent models in a training-free manner.
摘要：由于空间和时间动态和硬件限制的复杂相互作用，生成长，高质量的视频仍然是一个挑战。在这项工作中，我们介绍了\ textbf {maskFlow}，这是一个统一的视频生成框架，将离散表示与流匹配结合起来，以有效地生成高质量的长视频。通过利用训练期间的帧级掩蔽策略，先前生成的未掩盖框架的蒙版条件以生成训练序列超出十倍的视频。 MaskFlow通过启用快速掩盖的生成模型（MGM）式采样来进行非常有效的效率，并且可以部署在完全自动化的待机和完整序列的生成模式中。我们验证了我们在FaceForensics（FFS）和DeepMind Lab（DMLAB）数据集上的方法质量，并报告Fréchet视频距离（FVD）与最先进的方法竞争。我们还提供了有关我们方法的采样效率的详细分析，并证明可以以无训练方式将蒙版可应用于TimeStep依赖性和时间段独立的模型。

Title: Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL

Authors: Matthew Zurek, Yudong Chen
Subjects: cs.LG, cs.IT, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2502.11238
Pdf URL: https://arxiv.org/pdf/2502.11238
Copy Paste: [[2502.11238]] Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL(https://arxiv.org/abs/2502.11238)
Keywords: generative
Abstract: We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term horizon calibration. We also develop an empirical span penalization approach, inspired by sample variance penalization, which satisfies an oracle inequality performance guarantee. In particular this algorithm can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.
摘要：我们研究了使用生成模型的平均奖励马尔可夫决策过程（MDP）中找到$ \ varepsilon $最佳策略的样本复杂性。 $ \ widetilde {o}（sah/\ varepsilon^2）$的最小值基于最佳跨度的复杂性，其中$ h $是最佳偏置函数的跨度，才可以在先验的情况下实现$ h $。先前的无知识算法一直是深入研究的目标，但是几种自然方法证明无法实现这一目标。我们解决了这个问题，开发了与最佳跨度复杂性相匹配的第一个算法，而无需$ H $知识，既有固定数据集大小，又是固定次级优先级$ \ varepsilon $时。我们的主要技术将折现的减少方法与一种基于经验置信区间或性能下的下限自动调整有效视野的方法，我们将其称为“地平线校准”。我们还开发了一种受样本差异惩罚启发的经验跨度惩罚方法，该方法满足了Oracle不平等绩效保证。特别是，该算法在良性设置中的表现可以胜过最小值的复杂性，例如近乎最佳的策略的范围小于$ h $。

Title: Exploiting Point-Language Models with Dual-Prompts for 3D Anomaly Detection

Authors: Jiaxiang Wang, Haote Xu, Xiaolu Chen, Haodi Xu, Yue Huang, Xinghao Ding, Xiaotong Tu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11307
Pdf URL: https://arxiv.org/pdf/2502.11307
Copy Paste: [[2502.11307]] Exploiting Point-Language Models with Dual-Prompts for 3D Anomaly Detection(https://arxiv.org/abs/2502.11307)
Keywords: generation
Abstract: Anomaly detection (AD) in 3D point clouds is crucial in a wide range of industrial applications, especially in various forms of precision manufacturing. Considering the industrial demand for reliable 3D AD, several methods have been developed. However, most of these approaches typically require training separate models for each category, which is memory-intensive and lacks flexibility. In this paper, we propose a novel Point-Language model with dual-prompts for 3D ANomaly dEtection (PLANE). The approach leverages multi-modal prompts to extend the strong generalization capabilities of pre-trained Point-Language Models (PLMs) to the domain of 3D point cloud AD, achieving impressive detection performance across multiple categories using a single model. Specifically, we propose a dual-prompt learning method, incorporating both text and point cloud prompts. The method utilizes a dynamic prompt creator module (DPCM) to produce sample-specific dynamic prompts, which are then integrated with class-specific static prompts for each modality, effectively driving the PLMs. Additionally, based on the characteristics of point cloud data, we propose a pseudo 3D anomaly generation method (Ano3D) to improve the model's detection capabilities in an unsupervised setting. Experimental results demonstrate that the proposed method, which is under the multi-class-one-model paradigm, achieves a +8.7%/+17% gain on anomaly detection and localization performance as compared to the state-of-the-art one-class-one-model methods for the Anomaly-ShapeNet dataset, and obtains +4.3%/+4.1% gain for the Real3D-AD dataset. Code will be available upon publication.
摘要：3D点云中的异常检测（AD）在广泛的工业应用中至关重要，尤其是在各种形式的精确制造中。考虑到可靠3D AD的工业需求，已经开发了几种方法。但是，这些方法中的大多数通常都需要为每个类别进行培训单独的模型，这是记忆密集型并且缺乏灵活性。在本文中，我们提出了一个新型的点语言模型，该模型具有3D异常检测（平面）的双prompts。该方法利用多模式提示将预训练点语言模型（PLMS）的强大概括能力扩展到3D点云AD的域，从而使用单个模型在多个类别中实现了令人印象深刻的检测性能。具体来说，我们提出了一种双提出学习方法，同时结合了文本和点云提示。该方法利用动态提示创建器模块（DPCM）产生特定于样本的动态提示，然后将其与每种模式的特定于类的静态提示集成，从而有效地驱动PLM。此外，基于点云数据的特征，我们提出了一种伪3D异常生成方法（ANO3D），以在无监督的设置中提高模型的检测能力。实验结果表明，在多类模型范式下，所提出的方法在异常检测和定位性能方面获得了+8.7％/ +17％的增长Anomaly-Shapenet数据集的一类模型方法，并为REAL3D-AD数据集获得+4.3％/ +4.1％的增益。代码将在出版时提供。

Title: Inverse Flow and Consistency Models

Authors: Yuchen Zhang, Jian Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11333
Pdf URL: https://arxiv.org/pdf/2502.11333
Copy Paste: [[2502.11333]] Inverse Flow and Consistency Models(https://arxiv.org/abs/2502.11333)
Keywords: generation, generative
Abstract: Inverse generation problems, such as denoising without ground truth observations, is a critical challenge in many scientific inquiries and real-world applications. While recent advances in generative models like diffusion models, conditional flow matching, and consistency models achieved impressive results by casting generation as denoising problems, they cannot be directly used for inverse generation without access to clean data. Here we introduce Inverse Flow (IF), a novel framework that enables using these generative models for inverse generation problems including denoising without ground truth. Inverse Flow can be flexibly applied to nearly any continuous noise distribution and allows complex dependencies. We propose two algorithms for learning Inverse Flows, Inverse Flow Matching (IFM) and Inverse Consistency Model (ICM). Notably, to derive the computationally efficient, simulation-free inverse consistency model objective, we generalized consistency training to any forward diffusion processes or conditional flows, which have applications beyond denoising. We demonstrate the effectiveness of IF on synthetic and real datasets, outperforming prior approaches while enabling noise distributions that previous methods cannot support. Finally, we showcase applications of our techniques to fluorescence microscopy and single-cell genomics data, highlighting IF's utility in scientific problems. Overall, this work expands the applications of powerful generative models to inversion generation problems.
摘要：在许多科学询问和现实世界应用中，诸如无基础真理观察之类的逆变问题（例如无基本真理观察）是一个关键的挑战。尽管诸如扩散模型，有条件流量匹配和一致性模型之类的生成模型的最新进展通过将生成作为剥落问题而获得了令人印象深刻的结果，但在不访问清洁数据的情况下，它们不能直接用于逆生成。在这里，我们引入了反流（如果），这是一个新型框架，可以使用这些生成模型来解决逆生成问题，包括无地面真理。逆流可以灵活地应用于几乎任何连续的噪声分布，并允许复杂的依赖性。我们提出了两种用于学习逆流，逆流匹配（IFM）和逆一致性模型（ICM）的算法。值得注意的是，要得出计算上有效的，无模拟的逆一致性模型目标，我们将一致性训练概括为任何正向扩散过程或条件流，这些训练具有超出降级的应用。我们证明了IF在合成和真实数据集上的有效性，在实现先前方法无法支持的噪声分布的同时，表现优于先前的方法。最后，我们展示了我们的技术在荧光显微镜和单细胞基因组学数据中的应用，从而强调了IF IF在科学问题中的实用性。 Overall, this work expands the applications of powerful generative models to inversion generation problems.

Title: Biases in Edge Language Models: Detection, Analysis, and Mitigation

Authors: Vinamra Sharma, Danilo Pietro Pau, José Cano
Subjects: cs.LG, cs.PF, stat.ML
Abstract URL: https://arxiv.org/abs/2502.11349
Pdf URL: https://arxiv.org/pdf/2502.11349
Copy Paste: [[2502.11349]] Biases in Edge Language Models: Detection, Analysis, and Mitigation(https://arxiv.org/abs/2502.11349)
Keywords: generation
Abstract: The integration of large language models (LLMs) on low-power edge devices such as Raspberry Pi, known as edge language models (ELMs), has introduced opportunities for more personalized, secure, and low-latency language intelligence that is accessible to all. However, the resource constraints inherent in edge devices and the lack of robust ethical safeguards in language models raise significant concerns about fairness, accountability, and transparency in model output generation. This paper conducts a comparative analysis of text-based bias across language model deployments on edge, cloud, and desktop environments, aiming to evaluate how deployment settings influence model fairness. Specifically, we examined an optimized Llama-2 model running on a Raspberry Pi 4; GPT 4o-mini, Gemini-1.5-flash, and Grok-beta models running on cloud servers; and Gemma2 and Mistral models running on a MacOS desktop machine. Our results demonstrate that Llama-2 running on Raspberry Pi 4 is 43.23% and 21.89% more prone to showing bias over time compared to models running on the desktop and cloud-based environments. We also propose the implementation of a feedback loop, a mechanism that iteratively adjusts model behavior based on previous outputs, where predefined constraint weights are applied layer-by-layer during inference, allowing the model to correct bias patterns, resulting in 79.28% reduction in model bias.
摘要：大型语言模型（LLM）在低功率边缘设备（例如Raspberry Pi（称为边缘语言）（ELMS））上的集成，为所有人都可以访问了更个性化，安全和低层的语言智能，这引入了机会。但是，在边缘设备中固有的资源限制以及语言模型中缺乏强大的道德保护措施引起了人们对模型输出生成中公平性，问责制和透明度的重大关注。本文对边缘，云和桌面环境的语言模型部署进行基于文本的偏差进行了比较分析，旨在评估部署设置如何影响模型的公平性。具体而言，我们检查了在Raspberry Pi 4上运行的优化的Llama-2模型。 GPT 4o-Mini，Gemini-1.5-Flash和Grok-Beta模型在云服务器上运行； Macos台式机上运行的Gemma2和Mistral模型。我们的结果表明，与在台式机和基于云的环境上运行的模型相比，在Raspberry Pi 4上运行的Llama-2在Raspberry Pi 4上运行的时间为43.23％和21.89％。我们还提出了反馈循环的实现，这种机制是一种基于先前输出的模型行为，在推断过程中，在逐层逐层施加了预定义的约束权重，从而允许模型纠正偏差模式，从而降低了79.28％模型偏差。

Title: MARS: Mesh AutoRegressive Model for 3D Shape Detailization

Authors: Jingnan Gao, Weizhe Liu, Weixuan Sun, Senbo Wang, Xibin Song, Taizhang Shang, Shenzhou Chen, Hongdong Li, Xiaokang Yang, Yichao Yan, Pan Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11390
Pdf URL: https://arxiv.org/pdf/2502.11390
Copy Paste: [[2502.11390]] MARS: Mesh AutoRegressive Model for 3D Shape Detailization(https://arxiv.org/abs/2502.11390)
Keywords: generative
Abstract: State-of-the-art methods for mesh detailization predominantly utilize Generative Adversarial Networks (GANs) to generate detailed meshes from coarse ones. These methods typically learn a specific style code for each category or similar categories without enforcing geometry supervision across different Levels of Detail (LODs). Consequently, such methods often fail to generalize across a broader range of categories and cannot ensure shape consistency throughout the detailization process. In this paper, we introduce MARS, a novel approach for 3D shape detailization. Our method capitalizes on a novel multi-LOD, multi-category mesh representation to learn shape-consistent mesh representations in latent space across different LODs. We further propose a mesh autoregressive model capable of generating such latent representations through next-LOD token prediction. This approach significantly enhances the realism of the generated shapes. Extensive experiments conducted on the challenging 3D Shape Detailization benchmark demonstrate that our proposed MARS model achieves state-of-the-art performance, surpassing existing methods in both qualitative and quantitative assessments. Notably, the model's capability to generate fine-grained details while preserving the overall shape integrity is particularly commendable.
摘要：用于网格细节的最新方法主要利用生成对抗网络（GAN）来生成粗糙的网格。这些方法通常会为每个类别或类似类别学习特定的样式代码，而无需在不同级别的细节（LOD）上执行几何监督。因此，这种方法通常无法跨越更广泛的类别，并且无法确保在整个详细过程中形成一致性。在本文中，我们介绍了火星，这是一种用于3D形状细节的新型方法。我们的方法利用了一种新型的多型，多类网格表示，以在不同LOD的潜在空间中学习形状符合的网格表示。我们进一步提出了一个网格自回旋模型，能够通过下一型令牌预测产生此类潜在表示。这种方法大大增强了生成形状的现实主义。在具有挑战性的3D形状细节基准上进行的广泛实验表明，我们提出的火星模型实现了最先进的性能，超过了定性和定量评估中的现有方法。值得注意的是，该模型在保留整体形状完整性的同时产生细粒细节的能力特别值得称赞。

Title: DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services

Authors: Ting Sun, Penghan Wang, Fan Lai
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2502.11417
Pdf URL: https://arxiv.org/pdf/2502.11417
Copy Paste: [[2502.11417]] DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services(https://arxiv.org/abs/2502.11417)
Keywords: generation
Abstract: The rapid rise of large language models (LLMs) in text streaming services has introduced significant cost and Quality of Experience (QoE) challenges in serving millions of daily requests, especially in meeting Time-To-First-Token (TTFT) and Time-Between-Token (TBT) requirements for real-time interactions. Our real-world measurements show that both server-based and on-device deployments struggle to meet diverse QoE demands: server deployments face high costs and last-hop issues (e.g., Internet latency and dynamics), while on-device LLM inference is constrained by resources. We introduce DiSCo, a device-server cooperative scheduler designed to optimize users' QoE by adaptively routing requests and migrating response generation between endpoints while maintaining cost constraints. DiSCo employs cost-aware scheduling, leveraging the predictable speed of on-device LLM inference with the flexible capacity of server-based inference to dispatch requests on the fly, while introducing a token-level migration mechanism to ensure consistent token delivery during migration. Evaluations on real-world workloads -- including commercial services like OpenAI GPT and DeepSeek, and open-source deployments such as LLaMA3 -- show that DiSCo can improve users' QoE by reducing tail TTFT (11-52\%) and mean TTFT (6-78\%) across different model-device configurations, while dramatically reducing serving costs by up to 84\% through its migration mechanism while maintaining comparable QoE levels.
摘要：文本流服务中大型语言模型（LLM）的快速增长已引入了巨大的成本和经验质量（QOE）挑战，以满足数百万的日常要求，尤其是在会议时会议时间（TTFT）和中间时间。 -token（TBT）实时互动的要求。我们的实际测量结果表明，基于服务器的和设备的部署都难以满足不同的QoE需求：服务器部署面临高成本和最后跳跃问题（例如，互联网延迟和动态），而eNDEVICE LLM推断受到约束通过资源。我们介绍了Disco，这是一种设备服务器合作调度程序，旨在通过自适应路由请求和在端点之间迁移响应，同时维护成本限制，从而优化用户的QOE。迪斯科采用成本感知的调度，利用基于服务器的推理的可预测速度具有基于服务器的推理的灵活能力，以飞行派遣请求，同时引入令牌级别的迁移机制，以确保在迁移过程中持续的令牌交付。对实际工作量的评估 - 包括OpenAi GPT和DeepSeek等商业服务，以及诸如Llama3之类的开源部署 - 表明迪斯科可以通过减少TTFT（11-52 \％）和平均TTFT来改善用户的QoE（ 6-78 \％）在不同的模型设备配置中，同时通过其迁移机制将服务成本大幅降低了84 \％，同时保持了可比的QOE水平。

Title: Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models

Authors: Yingqing Guo, Yukang Yang, Hui Yuan, Mengdi Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11420
Pdf URL: https://arxiv.org/pdf/2502.11420
Copy Paste: [[2502.11420]] Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models(https://arxiv.org/abs/2502.11420)
Keywords: generation
Abstract: Training-free guidance enables controlled generation in diffusion and flow models, but most existing methods assume differentiable objectives and rely on gradients. This work focuses on training-free guidance addressing challenges from non-differentiable objectives and discrete data distributions. We propose an algorithmic framework TreeG: Tree Search-Based Path Steering Guidance, applicable to both continuous and discrete settings in diffusion and flow models. TreeG offers a unified perspective on training-free guidance: proposing candidates for the next step, evaluating candidates, and selecting the best to move forward, enhanced by a tree search mechanism over active paths or parallelizing exploration. We comprehensively investigate the design space of TreeG over the candidate proposal module and the evaluation function, instantiating TreeG into three novel algorithms. Our experiments show that TreeG consistently outperforms the top guidance baselines in symbolic music generation, small molecule generation, and enhancer DNA design, all of which involve non-differentiable challenges. Additionally, we identify an inference-time scaling law showing TreeG's scalability in inference-time computation.
摘要：无培训指导可以在扩散和流模型中受控生成，但是大多数现有方法都采用可区分的目标并依靠梯度。这项工作着重于无培训指导，以应对非不同的目标和离散数据分布的挑战。我们提出了一个算法框架树：基于树搜索的路径转向指南，适用于扩散和流模型中的连续和离散设置。 Treeg提供了无训练指导的统一观点：为下一步提出候选人，评估候选人，并选择最好的前进，并通过有效路径或并行探索的树木搜索机制增强。我们全面研究了候选提案模块和评估功能的树木的设计空间，并将Treeg实现为三种新型算法。我们的实验表明，TreeG始终优于象征性音乐生成，小分子生成和增强剂DNA设计中的最高指导基线，所有这些都涉及非差异性挑战。此外，我们确定了推理时间缩放定律，显示了Treeg在推理时间计算中的可扩展性。

Title: Connector-S: A Survey of Connectors in Multi-modal Large Language Models

Authors: Xun Zhu, Zheng Zhang, Xi Chen, Yiming Shi, Miao Li, Ji Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11453
Pdf URL: https://arxiv.org/pdf/2502.11453
Copy Paste: [[2502.11453]] Connector-S: A Survey of Connectors in Multi-modal Large Language Models(https://arxiv.org/abs/2502.11453)
Keywords: generation
Abstract: With the rapid advancements in multi-modal large language models (MLLMs), connectors play a pivotal role in bridging diverse modalities and enhancing model performance. However, the design and evolution of connectors have not been comprehensively analyzed, leaving gaps in understanding how these components function and hindering the development of more powerful connectors. In this survey, we systematically review the current progress of connectors in MLLMs and present a structured taxonomy that categorizes connectors into atomic operations (mapping, compression, mixture of experts) and holistic designs (multi-layer, multi-encoder, multi-modal scenarios), highlighting their technical contributions and advancements. Furthermore, we discuss several promising research frontiers and challenges, including high-resolution input, dynamic compression, guide information selection, combination strategy, and interpretability. This survey is intended to serve as a foundational reference and a clear roadmap for researchers, providing valuable insights into the design and optimization of next-generation connectors to enhance the performance and adaptability of MLLMs.
摘要：随着多模式大语言模型（MLLM）的快速发展，连接器在桥接多种方式和增强模型性能方面起着关键作用。但是，连接器的设计和演变尚未进行全面分析，在了解这些组件如何运行并阻碍更强大的连接器的开发方面留下了差距。在这项调查中，我们系统地检查了MLLM中连接器的当前进度，并提出了结构化分类法，将连接器分类为原子操作（映射，压缩，专家的混合物）和整体设计（多层，多层编码器，多模态场景），强调他们的技术贡献和进步。此外，我们讨论了一些有希望的研究前沿和挑战，包括高分辨率输入，动态压缩，指南信息选择，组合策略和解释性。该调查旨在作为研究人员的基础参考和明确的路线图，为下一代连接器的设计和优化提供宝贵的见解，以增强MLLM的性能和适应性。

Title: Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models

Authors: Jiecheng Zhou, Ding Tang, Rong Fu, Boni Hu, Haoran Xu, Yi Wang, Zhilin Pei, Zhongling Su, Liang Liu, Xingcheng Zhang, Weiming Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11458
Pdf URL: https://arxiv.org/pdf/2502.11458
Copy Paste: [[2502.11458]] Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models(https://arxiv.org/abs/2502.11458)
Keywords: generation
Abstract: The burgeoning computational demands for training large language models (LLMs) necessitate efficient methods, including quantized training, which leverages low-bit arithmetic operations to reduce costs. While FP8 precision has shown potential, leveraging FP4 remains challenging due to inherent quantization errors and limited representation capability. Based on the Transformer architecture, we present an FP4 training scheme for LLMs, overcoming these obstacles through mixed-precision quantization strategies tailed for different modules and training stages. This allows us to apply the precision level suitable to distinct components within the model, ensuring that multi-head attention and linear layers are handled appropriately. Our pretraining recipe ensures stability in backpropagation by incorporating fine-grained quantization methods with a target precision training schedule. Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost. With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.
摘要：培训大语言模型（LLM）的迅速计算需求需要有效的方法，包括量化的培训，该方法利用低位算术操作来降低成本。尽管FP8精度已显示出潜力，但由于固有的量化错误和有限的表示能力，利用FP4仍然具有挑战性。根据变压器体系结构，我们提出了针对LLM的FP4培训方案，通过用于不同模块和培训阶段的混合精确量化策略来克服这些障碍。这使我们能够将适用于模型中不同组件的精确级别应用，以确保对多头的注意力和线性层进行适当处理。我们的预训练配方通过将细粒度的量化方法与目标精度训练计划结合在一起，从而确保了反向传播的稳定性。实验结果表明，我们的FP4训练方案达到的精度与BF16和FP8相当，理论计算成本较小。随着下一代硬件支持FP4的出现，我们的方法为有效的超低精度培训奠定了基础。

Title: GiFT: Gibbs Fine-Tuning for Code Generation

Authors: Haochen Li, Wanjin Feng, Xin Zhou, Zhiqi Shen
Subjects: cs.LG, cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2502.11466
Pdf URL: https://arxiv.org/pdf/2502.11466
Copy Paste: [[2502.11466]] GiFT: Gibbs Fine-Tuning for Code Generation(https://arxiv.org/abs/2502.11466)
Keywords: generation
Abstract: Training Large Language Models (LLMs) with synthetic data is a prevalent practice in code generation. A key approach is self-training, where LLMs are iteratively trained on self-generated correct code snippets. In this case, the self-generated codes are drawn from a conditional distribution, conditioned on a specific seed description. However, the seed description is not the only valid representation that aligns with its intended meaning. With all valid descriptions and codes forming a joint space, codes drawn from the conditional distribution would lead to an underrepresentation of the full description-code space. As such, we propose Gibbs Fine-Tuning (GiFT), a novel self-training method inspired by Gibbs sampling. GiFT allows self-generated data to be drawn from the marginal distribution of the joint space, thereby mitigating the biases inherent in conditional sampling. We provide a theoretical analysis demonstrating the potential benefits of fine-tuning LLMs with code derived from the marginal distribution. Furthermore, we propose a perplexity-based code selection method to mitigate the imbalanced long-tail distribution of the self-generated codes. Empirical evaluation of two LLMs across four datasets demonstrates that GiFT achieves superior performance, particularly on more challenging benchmarks.
摘要：使用合成数据培训大语言模型（LLM）是代码生成的普遍做法。一个关键方法是自我训练，在该训练中，LLM是在自生成的正确代码片段上迭代训练的。在这种情况下，自我生成的代码是从有条件的分布中得出的，该分布以特定的种子描述为条件。但是，种子描述并不是唯一与其预期含义保持一致的有效表示。通过所有有效的描述和构成关节空间的代码，从条件分布中绘制的代码将导致整个描述代码空间的代表性不足。因此，我们提出了Gibbs微调（礼物），这是一种受Gibbs采样启发的新型自我训练方法。礼物允许从关节空间的边际分布中获取自构的数据，从而减轻条件采样中固有的偏见。我们提供了理论分析，证明了通过边际分布得出的代码的微调LLM的潜在优势。此外，我们提出了一种基于困惑的代码选择方法，以减轻自我生成代码的不平衡长尾分布。对四个数据集的两个LLM的经验评估表明，礼物可以取得出色的性能，尤其是在更具挑战性的基准上。

Title: Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation

Authors: Taeyoung Yun, Dinghuai Zhang, Jinkyoo Park, Ling Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11477
Pdf URL: https://arxiv.org/pdf/2502.11477
Copy Paste: [[2502.11477]] Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation(https://arxiv.org/abs/2502.11477)
Keywords: generation, generative
Abstract: Recent advances in text-to-image diffusion models have achieved impressive image generation capabilities. However, it remains challenging to control the generation process with desired properties (e.g., aesthetic quality, user intention), which can be expressed as black-box reward functions. In this paper, we focus on prompt adaptation, which refines the original prompt into model-preferred prompts to generate desired images. While prior work uses reinforcement learning (RL) to optimize prompts, we observe that applying RL often results in generating similar postfixes and deterministic behaviors. To this end, we introduce \textbf{P}rompt \textbf{A}daptation with \textbf{G}FlowNets (\textbf{PAG}), a novel approach that frames prompt adaptation as a probabilistic inference problem. Our key insight is that leveraging Generative Flow Networks (GFlowNets) allows us to shift from reward maximization to sampling from an unnormalized density function, enabling both high-quality and diverse prompt generation. However, we identify that a naive application of GFlowNets suffers from mode collapse and uncovers a previously overlooked phenomenon: the progressive loss of neural plasticity in the model, which is compounded by inefficient credit assignment in sequential prompt generation. To address this critical challenge, we develop a systematic approach in PAG with flow reactivation, reward-prioritized sampling, and reward decomposition for prompt adaptation. Extensive experiments validate that PAG successfully learns to sample effective and diverse prompts for text-to-image generation. We also show that PAG exhibits strong robustness across various reward functions and transferability to different text-to-image models.
摘要：文本到图像扩散模型的最新进展已获得了令人印象深刻的图像产生能力。但是，以所需的属性（例如，美学质量，用户意图）来控制生成过程仍然具有挑战性，可以将其表示为黑框奖励功能。在本文中，我们专注于提示改编，该改编将原始提示完善到模型偏爱的提示中以生成所需的图像。虽然先前的工作使用增强学习（RL）来优化提示，但我们观察到，应用RL通常会导致产生类似的后缀和确定性行为。为此，我们介绍了\ textbf {p} rompt \ textbf {a}用\ textbf {g} flownets（\ textbf {pag}）Daptation（\ textbf {pag {pag}），这是一种新颖的方法，它促使适应作为概率的推断问题。我们的关键见解是，利用生成流网络（GFLOWNETS）使我们能够从奖励最大化转变为从非正常的密度函数的采样，从而使高质量和多样化的及时生成。但是，我们确定Gflownets的幼稚应用遭受模式崩溃并发现了先前被忽视的现象：模型中神经可塑性的进行性逐渐丧失，该神经可塑性在顺序及时生成中的效率低下的信用分配加剧了。为了应对这一关键挑战，我们在流动重新激活，奖励优先采样和奖励分解以迅速适应的情况下开发了一种系统的方法。广泛的实验验证了PAG成功学会了为有效和多样化的提示以进行文本形象生成。我们还表明，PAG在各种奖励函数上表现出强大的鲁棒性，并将其转移到不同的文本对象模型。

Title: DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning

Authors: Huanxuan Liao, Shizhu He, Yupu Hao, Jun Zhao, Kang Liu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.11482
Pdf URL: https://arxiv.org/pdf/2502.11482
Copy Paste: [[2502.11482]] DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning(https://arxiv.org/abs/2502.11482)
Keywords: restoration
Abstract: Continual learning (CL) is essential for Large Language Models (LLMs) to adapt to evolving real-world demands, yet they are susceptible to catastrophic forgetting (CF). While traditional CF solutions rely on expensive data rehearsal, recent rehearsal-free methods employ model-based and regularization-based strategies to address this issue. However, these approaches often neglect the model's plasticity, which is crucial to achieving optimal performance on newly learned tasks. Consequently, a key challenge in CL is striking a balance between preserving plasticity and mitigating CF. To tackle this challenge, we propose the $\textbf{D}$ecomposed $\textbf{A}$ttention-based $\textbf{T}$ask $\textbf{A}$daptation (DATA), which explicitly decouples and learns both task-specific and task-shared knowledge using high-rank and low-rank task adapters (e.g., LoRAs). For new tasks, DATA dynamically adjusts the weights of adapters of different ranks based on their relevance and distinction from previous tasks, allowing the model to acquire new task-specific skills while effectively retaining previously learned knowledge. Specifically, we implement a decomposed component weighting strategy comprising learnable components that collectively generate attention-based weights, allowing the model to integrate and utilize diverse knowledge from each DATA. Extensive experiments on three widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance. Notably, our approach significantly enhances model plasticity and mitigates CF by extending learnable components and employing stochastic restoration during training iterations.
摘要：持续学习（CL）对于大型语言模型（LLM）是必不可少的，以适应不断发展的现实世界需求，但它们容易受到灾难性遗忘（CF）的影响。尽管传统的CF解决方案依靠昂贵的数据排练，但最近的无排练方法采用基于模型和基于正规化的策略来解决此问题。但是，这些方法通常会忽略该模型的可塑性，这对于在新学习的任务上实现最佳性能至关重要。因此，CL的关键挑战是在保持可塑性和缓解CF之间达到平衡。为了应对这一挑战，我们提出$ \ textbf {d} $ composed $ \ textbf {a} $ ttention $ ttention $ \ textbf {t} $ ask ask ask ask ask ask textbf {a} $ daptation（data）使用高级和低级任务适配器（例如Loras）学习特定于任务和任务共享的知识。对于新任务，数据根据其相关性和与先前任务的区别动态调整不同等级的适配器的权重，从而使该模型能够获得特定于任务的新技能，同时有效地保留先前学习的知识。具体而言，我们实施了一个分解的组件加权策略，该策略包括可学习的组件，这些组件共同产生了基于注意力的权重，从而使模型可以从每个数据中整合并利用各种知识。对三个广泛使用的基准测试的广泛实验表明，我们提出的方法可实现最先进的性能。值得注意的是，我们的方法可显着增强模型的可塑性，并通过扩展可学习的组件并在训练迭代期间采用随机恢复来减轻CF。

Title: Accelerated Gradient-based Design Optimization Via Differentiable Physics-Informed Neural Operator: A Composites Autoclave Processing Case Study

Authors: Janak M. Patel, Milad Ramezankhani, Anirudh Deodhar, Dagnachew Birru
Subjects: cs.LG, cs.AI, math.NA
Abstract URL: https://arxiv.org/abs/2502.11504
Pdf URL: https://arxiv.org/pdf/2502.11504
Copy Paste: [[2502.11504]] Accelerated Gradient-based Design Optimization Via Differentiable Physics-Informed Neural Operator: A Composites Autoclave Processing Case Study(https://arxiv.org/abs/2502.11504)
Keywords: super-resolution
Abstract: Simulation and optimization are crucial for advancing the engineering design of complex systems and processes. Traditional optimization methods require substantial computational time and effort due to their reliance on resource-intensive simulations, such as finite element analysis, and the complexity of rigorous optimization algorithms. Data-agnostic AI-based surrogate models, such as Physics-Informed Neural Operators (PINOs), offer a promising alternative to these conventional simulations, providing drastically reduced inference time, unparalleled data efficiency, and zero-shot super-resolution capability. However, the predictive accuracy of these models is often constrained to small, low-dimensional design spaces or systems with relatively simple dynamics. To address this, we introduce a novel Physics-Informed DeepONet (PIDON) architecture, which extends the capabilities of conventional neural operators to effectively model the nonlinear behavior of complex engineering systems across high-dimensional design spaces and a wide range of dynamic design configurations. This new architecture outperforms existing SOTA models, enabling better predictions across broader design spaces. Leveraging PIDON's differentiability, we integrate a gradient-based optimization approach using the Adam optimizer to efficiently determine optimal design variables. This forms an end-to-end gradient-based optimization framework that accelerates the design process while enhancing scalability and efficiency. We demonstrate the effectiveness of this framework in the optimization of aerospace-grade composites curing processes achieving a 3x speedup in obtaining optimal design variables compared to gradient-free methods. Beyond composites processing, the proposed model has the potential to be used as a scalable and efficient optimization tool for broader applications in advanced engineering and digital twin systems.
摘要：仿真和优化对于推进复杂系统和流程的工程设计至关重要。传统优化方法由于依赖资源密集型模拟（例如有限元分析）以及严格优化算法的复杂性，因此需要大量的计算时间和精力。基于数据的AI替代模型，例如物理信息的神经操作员（PINOS），为这些常规模拟提供了有希望的替代方法，可大大减少推理时间，无与伦比的数据效率和零发出的超级分辨率功能。但是，这些模型的预测准确性通常被限制在具有相对简单动力学的小型，低维的设计空间或系统上。为了解决这个问题，我们介绍了一种新颖的物理知识的DeepOnet（Pidon）体系结构，该体系结构扩展了常规神经操作员的能力，以有效地模拟高维设计空间跨高维工程系统和各种动态设计配置的复杂工程系统的非线性行为。这种新的体系结构优于现有的SOTA模型，从而可以在更广泛的设计空间之间进行更好的预测。利用Pidon的不同性，我们使用ADAM Optimizer集成了一种基于梯度的优化方法，以有效地确定最佳设计变量。这形成了基于端梯度的优化框架，该框架可以加速设计过程，同时提高可扩展性和效率。我们证明了该框架在优化航空航天级复合材料固化过程中的有效性，与无梯度方法相比，在获得最佳设计变量方面达到了3倍加速。除了复合材料处理之外，提议的模型还可以用作高级工程和数字双胞胎系统中更广泛应用的可扩展和高效优化工具。

Title: A GNN-based Spectral Filtering Mechanism for Imbalance Classification in Network Digital Twin

Authors: Abubakar Isah, Ibrahim Aliyu, Sulaiman Muhammad Rashid, Jaehyung Park, Minsoo Hahn, Jinsul Kim
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2502.11505
Pdf URL: https://arxiv.org/pdf/2502.11505
Copy Paste: [[2502.11505]] A GNN-based Spectral Filtering Mechanism for Imbalance Classification in Network Digital Twin(https://arxiv.org/abs/2502.11505)
Keywords: generation
Abstract: Graph Neural Networks are gaining attention in Fifth-Generation (5G) core network digital twins, which are data-driven complex systems with numerous components. Analyzing these data can be challenging due to rare failure types, leading to imbalanced classification in multiclass settings. Digital twins of 5G networks increasingly employ graph classification as the main method for identifying failure types. However, the skewed distribution of failure occurrences is a major class imbalance issue that prevents effective graph data mining. Previous studies have not sufficiently tackled this complex problem. In this paper, we propose Class-Fourier Graph Neural Network (CF-GNN) introduces a class-oriented spectral filtering mechanism that ensures precise classification by estimating a unique spectral filter for each class. We employ eigenvalue and eigenvector spectral filtering to capture and adapt to variations in the minority classes, ensuring accurate class-specific feature discrimination, and adept at graph representation learning for complex local structures among neighbors in an end-to-end setting. Extensive experiments have demonstrated that the proposed CF-GNN could help with both the creation of new techniques for enhancing classifiers and the investigation of the characteristics of the multi-class imbalanced data in a network digital twin system.
摘要：图形神经网络在第五代（5G）核心网络数字双胞胎中引起了人们的关注，这些数字双胞胎是具有许多组件的数据驱动的复杂系统。由于罕见的故障类型，分析这些数据可能具有挑战性，导致多类设置中的分类不平衡。 5G网络的数字双胞胎越来越多地利用图形分类作为识别故障类型的主要方法。但是，失败事件的偏斜分布是防止有效的图形数据挖掘的主要类不平衡问题。以前的研究还没有足够解决这个复杂的问题。在本文中，我们提出了类图形神经网络（CF-GNN）引入了面向类的光谱滤波机制，该机制通过估算每个类别的唯一光谱滤波器来确保精确的分类。我们采用特征值和特征向量光谱过滤来捕获并适应少数群体的变化，确保精确的特定于类的特征特征歧视，并擅长于在端到端设置中邻居之间复杂的局部结构的图表表示学习。广泛的实验表明，所提出的CF-GNN可以帮助创建新技术来增强分类器，并研究网络数字双胞胎系统中多级不平衡数据的特征。

Title: DifCluE: Generating Counterfactual Explanations with Diffusion Autoencoders and modal clustering

Authors: Suparshva Jain, Amit Sangroya, Lovekesh Vig
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11509
Pdf URL: https://arxiv.org/pdf/2502.11509
Copy Paste: [[2502.11509]] DifCluE: Generating Counterfactual Explanations with Diffusion Autoencoders and modal clustering(https://arxiv.org/abs/2502.11509)
Keywords: generation
Abstract: Generating multiple counterfactual explanations for different modes within a class presents a significant challenge, as these modes are distinct yet converge under the same classification. Diffusion probabilistic models (DPMs) have demonstrated a strong ability to capture the underlying modes of data distributions. In this paper, we harness the power of a Diffusion Autoencoder to generate multiple distinct counterfactual explanations. By clustering in the latent space, we uncover the directions corresponding to the different modes within a class, enabling the generation of diverse and meaningful counterfactuals. We introduce a novel methodology, DifCluE, which consistently identifies these modes and produces more reliable counterfactual explanations. Our experimental results demonstrate that DifCluE outperforms the current state-of-the-art in generating multiple counterfactual explanations, offering a significant advance- ment in model interpretability.
摘要：在类中为不同模式生成多个反事实解释带来了重大挑战，因为这些模式在同一分类下却汇聚在一起。扩散概率模型（DPM）表明，捕获数据分布的潜在模式具有很强的能力。在本文中，我们利用扩散自动编码器的力量产生多种不同的反事实解释。通过在潜在空间中进行聚类，我们发现了与一类不同模式相对应的方向，从而能够产生各种和有意义的反事实。我们介绍了一种新颖的方法论，即Difclue，该方法始终如一地识别这些模式并产生更可靠的反事实解释。我们的实验结果表明，分散在产生多种反事实解释方面的最新目前的表现优于当前的最新解释，从而在模型可解释性方面具有重大进步。

Title: SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion

Authors: Junxian Ma, Shiwen Wang, Jian Yang, Junyi Hu, Jian Liang, Guosheng Lin, Jingbo chen, Kai Li, Yu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11515
Pdf URL: https://arxiv.org/pdf/2502.11515
Copy Paste: [[2502.11515]] SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion(https://arxiv.org/abs/2502.11515)
Keywords: generation
Abstract: Recent advances in diffusion models have led to significant progress in audio-driven lip synchronization. However, existing methods typically rely on constrained audio-visual alignment priors or multi-stage learning of intermediate representations to force lip motion synthesis. This leads to complex training pipelines and limited motion naturalness. In this paper, we present SayAnything, a conditional video diffusion framework that directly synthesizes lip movements from audio input while preserving speaker identity. Specifically, we propose three specialized modules including identity preservation module, audio guidance module, and editing control module. Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation without requiring additional supervision signals or intermediate representations. Extensive experiments demonstrate that SayAnything generates highly realistic videos with improved lip-teeth coherence, enabling unseen characters to say anything, while effectively generalizing to animated characters.
摘要：扩散模型的最新进展导致了音频驱动的唇同步的重大进展。但是，现有方法通常依赖于受约束的视听比对先验或中间表示的多阶段学习来迫使唇部运动合成。这导致复杂的训练管道和有限的运动自然性。在本文中，我们介绍了Sayanything，这是一个有条件的视频扩散框架，该框架直接综合了来自音频输入的唇部动作，同时保留了扬声器的身份。具体来说，我们提出了三个专业模块，包括身份保护模块，音频指导模块和编辑控制模块。我们的新型设计有效地平衡了潜在空间中不同的条件信号，从而可以精确控制外观，运动和特定于区域的生成，而无需额外的监督信号或中间表示。广泛的实验表明，Seaving the theything会产生高度现实的视频，并具有改进的Lip-Teeth连贯性，使看不见的角色能够说出任何话，同时有效地概括了动画角色。

Title: Control-CLIP: Decoupling Category and Style Guidance in CLIP for Specific-Domain Generation

Authors: Zexi Jia, Chuanwei Huang, Hongyan Fei, Yeshuang Zhu, Zhiqiang Yuan, Jinchao Zhang, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11532
Pdf URL: https://arxiv.org/pdf/2502.11532
Copy Paste: [[2502.11532]] Control-CLIP: Decoupling Category and Style Guidance in CLIP for Specific-Domain Generation(https://arxiv.org/abs/2502.11532)
Keywords: generation
Abstract: Text-to-image diffusion models have shown remarkable capabilities of generating high-quality images closely aligned with textual inputs. However, the effectiveness of text guidance heavily relies on the CLIP text encoder, which is trained to pay more attention to general content but struggles to capture semantics in specific domains like styles. As a result, generation models tend to fail on prompts like "a photo of a cat in Pokemon style" in terms of simply producing images depicting "a photo of a cat". To fill this gap, we propose Control-CLIP, a novel decoupled CLIP fine-tuning framework that enables the CLIP model to learn the meaning of category and style in a complement manner. With specially designed fine-tuning tasks on minimal data and a modified cross-attention mechanism, Control-CLIP can precisely guide the diffusion model to a specific domain. Moreover, the parameters of the diffusion model remain unchanged at all, preserving the original generation performance and diversity. Experiments across multiple domains confirm the effectiveness of our approach, particularly highlighting its robust plug-and-play capability in generating content with various specific styles.
摘要：文本对图像扩散模型显示出了与文本输入紧密对齐的高质量图像的显着功能。但是，文本指南的有效性在很大程度上取决于剪辑文本编码器，该剪辑文本编码器旨在更加关注一般内容，但努力捕获在特定领域（例如样式）中的语义。结果，在简单地制作描绘“猫的照片”的图像方面，生成模型往往会在“猫咪风格的猫照片”之类的提示中失败。为了填补这一空白，我们提出了Control-CLIP，这是一种新颖的解耦剪贴微调框架，使剪辑模型能够以补充方式学习类别和样式的含义。借助最小数据的专门设计的微调任务和修改的交叉注意机制，控制夹可以精确地指导扩散模型到特定域。此外，扩散模型的参数根本保持不变，从而保留了原始的生成性能和多样性。跨多个领域的实验证实了我们方法的有效性，尤其是强调了其强大的插件功能，以生成各种特定样式的内容。

Title: Syllables to Scenes: Literary-Guided Free-Viewpoint 3D Scene Synthesis from Japanese Haiku

Authors: Chunan Yu, Yidong Han, Chaotao Ding, Ying Zang, Lanyun Zhu, Xinhao Chen, Zejian Li, Renjun Xu, Tianrun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11586
Pdf URL: https://arxiv.org/pdf/2502.11586
Copy Paste: [[2502.11586]] Syllables to Scenes: Literary-Guided Free-Viewpoint 3D Scene Synthesis from Japanese Haiku(https://arxiv.org/abs/2502.11586)
Keywords: generative
Abstract: In the era of the metaverse, where immersive technologies redefine human experiences, translating abstract literary concepts into navigable 3D environments presents a fundamental challenge in preserving semantic and emotional fidelity. This research introduces HaikuVerse, a novel framework for transforming poetic abstraction into spatial representation, with Japanese Haiku serving as an ideal test case due to its sophisticated encapsulation of profound emotions and imagery within minimal text. While existing text-to-3D methods struggle with nuanced interpretations, we present a literary-guided approach that synergizes traditional poetry analysis with advanced generative technologies. Our framework centers on two key innovations: (1) Hierarchical Literary-Criticism Theory Grounded Parsing (H-LCTGP), which captures both explicit imagery and implicit emotional resonance through structured semantic decomposition, and (2) Progressive Dimensional Synthesis (PDS), a multi-stage pipeline that systematically transforms poetic elements into coherent 3D scenes through sequential diffusion processes, geometric optimization, and real-time enhancement. Extensive experiments demonstrate that HaikuVerse significantly outperforms conventional text-to-3D approaches in both literary fidelity and visual quality, establishing a new paradigm for preserving cultural heritage in immersive digital spaces. Project website at: this https URL
摘要：在元评估时代，沉浸式技术重新定义了人类的经验，将抽象的文学概念转化为可通道的3D环境提出了维护语义和情感忠诚度的基本挑战。这项研究介绍了Haikuverse，这是一个新颖的框架，用于将诗意抽象转化为空间表现，由于日本的haiku在最小文本中对深刻的情感和图像的复杂封装，因此作为理想的测试案例。尽管现有的文本到3D方法与细微的解释斗争，但我们提出了一种文学引导的方法，该方法通过先进的生成技术协同传统诗歌分析。我们的框架集中在两项关键创新上：（1）分层文学批评理论扎根解析（H-LCTGP），该理论既捕获明确的图像和隐性情感共鸣，并通过结构化的语义分解，以及（2）渐进式尺寸综合（PDS），A，A，A gractgp。多级管道通过顺序扩散过程，几何优化和实时增强，系统地将诗歌元素系统地转化为相干的3D场景。广泛的实验表明，Haikuverse在文学忠诚度和视觉质量中都显着优于传统的文本到3D方法，从而建立了一个新的范式，以保留沉浸式数字空间中的文化遗产。项目网站网址：此HTTPS URL

Title: GraphThought: Graph Combinatorial Optimization with Thought Generation

Authors: Zixiao Huang, Lifeng Guo, Junjie Sheng, Haosheng Chen, Wenhao Li, Bo Jin, Changhong Lu, Xiangfeng Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11607
Pdf URL: https://arxiv.org/pdf/2502.11607
Copy Paste: [[2502.11607]] GraphThought: Graph Combinatorial Optimization with Thought Generation(https://arxiv.org/abs/2502.11607)
Keywords: generation, generative
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various domains, especially in text processing and generative tasks. Recent advancements in the reasoning capabilities of state-of-the-art LLMs, such as OpenAI-o1, have significantly broadened their applicability, particularly in complex problem-solving and logical inference. However, most existing LLMs struggle with notable limitations in handling graph combinatorial optimization (GCO) problems. To bridge this gap, we formally define the Optimal Thoughts Design (OTD) problem, including its state and action thought space. We then introduce a novel framework, GraphThought, designed to generate high-quality thought datasets for GCO problems. Leveraging these datasets, we fine-tune the Llama-3-8B-Instruct model to develop Llama-GT. Notably, despite its compact 8B-parameter architecture, Llama-GT matches the performance of state-of-the-art LLMs on the GraphArena benchmark. Experimental results show that our approach outperforms both proprietary and open-source models, even rivaling specialized models like o1-mini. This work sets a new state-of-the-art benchmark while challenging the prevailing notion that model scale is the primary driver of reasoning capability.
摘要：大型语言模型（LLM）在各个领域都表现出了出色的功能，尤其是在文本处理和生成任务中。诸如OpenAI-O1之类的最先进LLM的推理能力的最新进步已显着扩大了其适用性，尤其是在复杂的解决问题和逻辑推断方面。但是，在处理图组合优化（GCO）问题时，大多数现有的LLM都在遇到明显的局限性。为了弥合这一差距，我们正式定义了最佳思想设计（OTD）问题，包括其状态和行动思想空间。然后，我们介绍了一个新颖的框架，即图形，旨在为GCO问题生成高质量的思想数据集。利用这些数据集，我们将Llama-3-8b-Instruct模型微调以开发Llama-GT。值得注意的是，尽管具有紧凑的8B参数架构，但Llama-GT仍与Grapharena基准中最先进的LLMS的性能相匹配。实验结果表明，我们的方法的表现均优于专有和开源模型，甚至与O1-Mini（例如O1-Mini）相媲美。这项工作为新的最新基准制定了新的基准，同时挑战了普遍的概念，即模型量表是推理能力的主要驱动力。

Title: Maximum Entropy Reinforcement Learning with Diffusion Policy

Authors: Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11612
Pdf URL: https://arxiv.org/pdf/2502.11612
Copy Paste: [[2502.11612]] Maximum Entropy Reinforcement Learning with Diffusion Policy(https://arxiv.org/abs/2502.11612)
Keywords: generative
Abstract: The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at this https URL.
摘要：具有高斯政策的软演员 - 批评（SAC）算法已成为实现最大熵增强学习（Maxent RL）目标的主流实施，该目标结合了熵最大化，以鼓励探索并增强政策鲁棒性。尽管高斯政策在更简单的任务上表现良好，但其探索能力和在复杂的多进球RL环境中的潜在性能受到其固有的单模式的限制。在本文中，我们采用了扩散模型，这是一个强大的生成模型，能够捕获复杂的多模式分布，作为实现Maxent RL目标的策略表示形式，开发了一种具有扩散策略（MAXENTDP）的名为Maxent RL的方法。我们的方法实现了有效的探索，并使政策更接近最佳最大政策。 Mujoco基准测试的实验结果表明，MaxEntDP在Maxent RL框架中优于高斯策略和其他生成模型，并且与其他基于其他基于最新扩散的在线RL算法相当地执行。我们的代码可在此HTTPS URL上找到。

Title: Membership Inference Attacks for Face Images Against Fine-Tuned Latent Diffusion Models

Authors: Lauritz Christian Holme, Anton Mosquera Storgaard, Siavash Arjomand Bigdeli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11619
Pdf URL: https://arxiv.org/pdf/2502.11619
Copy Paste: [[2502.11619]] Membership Inference Attacks for Face Images Against Fine-Tuned Latent Diffusion Models(https://arxiv.org/abs/2502.11619)
Keywords: generative
Abstract: The rise of generative image models leads to privacy concerns when it comes to the huge datasets used to train such models. This paper investigates the possibility of inferring if a set of face images was used for fine-tuning a Latent Diffusion Model (LDM). A Membership Inference Attack (MIA) method is presented for this task. Using generated auxiliary data for the training of the attack model leads to significantly better performance, and so does the use of watermarks. The guidance scale used for inference was found to have a significant influence. If a LDM is fine-tuned for long enough, the text prompt used for inference has no significant influence. The proposed MIA is found to be viable in a realistic black-box setup against LDMs fine-tuned on face-images.
摘要：当涉及用于训练此类模型的庞大数据集时，生成图像模型的兴起会导致隐私问题。本文研究了是否使用一组面部图像来微调潜在扩散模型（LDM）的可能性。为此任务提出了会员推理攻击（MIA）方法。使用生成的辅助数据训练攻击模型会导致性能明显更好，并且使用水印也是如此。发现用于推理的指导量表具有重大影响。如果LDM对足够长的时间进行了微调，则用于推断的文本提示没有显着影响。拟议的MIA被发现在面对面图像对LDMS的现实黑盒设置中可行。

Title: Neural Interpretable Reasoning

Authors: Pietro Barbiero, Giuseppe Marra, Gabriele Ciravegna, David Debot, Francesco De Santis, Michelangelo Diligenti, Mateo Espinosa Zarlenga, Francesco Giannini
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2502.11639
Pdf URL: https://arxiv.org/pdf/2502.11639
Copy Paste: [[2502.11639]] Neural Interpretable Reasoning(https://arxiv.org/abs/2502.11639)
Keywords: generation
Abstract: We formalize a novel modeling framework for achieving interpretability in deep learning, anchored in the principle of inference equivariance. While the direct verification of interpretability scales exponentially with the number of variables of the system, we show that this complexity can be mitigated by treating interpretability as a Markovian property and employing neural re-parametrization techniques. Building on these insights, we propose a new modeling paradigm -- neural generation and interpretable execution -- that enables scalable verification of equivariance. This paradigm provides a general approach for designing Neural Interpretable Reasoners that are not only expressive but also transparent.
摘要：我们正式建立了一个新颖的建模框架，以实现深度学习中的可解释性，这是基于推理均值原则的。虽然对系统的数量进行了指数级的直接验证，但我们表明，可以通过将可解释性视为马尔可夫特性并采用神经重新参数化技术来缓解这种复杂性。在这些见解的基础上，我们提出了一种新的建模范式（神经产生和可解释的执行），可实现可扩展的验证。该范式提供了一种通用方法，用于设计不仅表现力，而且透明的神经解释推理者。

Title: GaussianMotion: End-to-End Learning of Animatable Gaussian Avatars with Pose Guidance from Text

Authors: Gyumin Shim, Sangmin Lee, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11642
Pdf URL: https://arxiv.org/pdf/2502.11642
Copy Paste: [[2502.11642]] GaussianMotion: End-to-End Learning of Animatable Gaussian Avatars with Pose Guidance from Text(https://arxiv.org/abs/2502.11642)
Keywords: generation
Abstract: In this paper, we introduce GaussianMotion, a novel human rendering model that generates fully animatable scenes aligned with textual descriptions using Gaussian Splatting. Although existing methods achieve reasonable text-to-3D generation of human bodies using various 3D representations, they often face limitations in fidelity and efficiency, or primarily focus on static models with limited pose control. In contrast, our method generates fully animatable 3D avatars by combining deformable 3D Gaussian Splatting with text-to-3D score distillation, achieving high fidelity and efficient rendering for arbitrary poses. By densely generating diverse random poses during optimization, our deformable 3D human model learns to capture a wide range of natural motions distilled from a pose-conditioned diffusion model in an end-to-end manner. Furthermore, we propose Adaptive Score Distillation that effectively balances realistic detail and smoothness to achieve optimal 3D results. Experimental results demonstrate that our approach outperforms existing baselines by producing high-quality textures in both static and animated results, and by generating diverse 3D human models from various textual inputs.
摘要：在本文中，我们介绍了GaussianMotion，这是一种新型的人类渲染模型，该模型生成了完全可动画的场景，该场景与使用高斯分裂的文本描述一致。尽管现有方法使用各种3D表示实现了合理的文本对3D生成人体，但它们通常面临忠诚度和效率的限制，或者主要集中在姿势控制有限的静态模型上。相比之下，我们的方法通过将可变形的3D高斯脱落与文本到3D得分蒸馏相结合，实现高忠诚度并有效地呈现任意姿势来生成完全动画的3D化身。通过在优化过程中密集产生多种随机姿势，我们可变形的3D人类模型学会以端到端方式从姿势条件的扩散模型中捕获广泛的自然运动。此外，我们提出了自适应评分蒸馏，可以有效地平衡现实的细节和平滑度，以实现最佳的3D结果。实验结果表明，我们的方法通过在静态和动画结果中产生高质量的纹理以及从各种文本输入中产生不同的3D人类模型来优于现有基线。

Title: Hyperspherical Energy Transformer with Recurrent Depth

Authors: Yunzhe Hu, Difan Zou, Dong Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11646
Pdf URL: https://arxiv.org/pdf/2502.11646
Copy Paste: [[2502.11646]] Hyperspherical Energy Transformer with Recurrent Depth(https://arxiv.org/abs/2502.11646)
Keywords: generation
Abstract: Transformer-based foundation models have achieved unprecedented success with a gigantic amount of parameters and computational resources. Yet, the core building blocks of these models, the Transformer layers, and how they are arranged and configured are primarily engineered from the bottom up and driven by heuristics. For advancing next-generation architectures, it demands exploring a prototypical model that is amenable to high interpretability and of practical competence. To this end, we take a step from the top-down view and design neural networks from an energy minimization perspective. Specifically, to promote isotropic token distribution on the sphere, we formulate a modified Hopfield energy function on the subspace-embedded hypersphere, based on which Transformer layers with symmetric structures are designed as the iterative optimization for the energy function. By integrating layers with the same parameters, we propose \textit{Hyper-Spherical Energy Transformer} (Hyper-SET), an alternative to the vanilla Transformer with recurrent depth. This design inherently provides greater interpretability and allows for scaling to deeper layers without a significant increase in the number of parameters. We also empirically demonstrate that Hyper-SET achieves comparable or even superior performance on both synthetic and real-world tasks, such as solving Sudoku and masked image modeling, while utilizing fewer parameters.
摘要：基于变压器的基础模型通过大量参数和计算资源取得了前所未有的成功。然而，这些模型的核心构建块，变压器层以及如何布置和配置的方式主要是由启发式方法从底部进行设计的。为了推进下一代体系结构，它需要探索一种原型模型，该模型适合高解释性和实践能力。为此，我们从自上而下的视图和设计神经网络从能量最小化的角度迈出了一步。具体而言，为了促进球体上的各向同性令牌分布，我们在子空间内置的超晶体上制定了修改的Hopfield Energy函数，其基于其具有对称结构的变压器层设计为对能量函数的迭代优化。通过将图层与相同的参数集成在一起，我们提出\ textIt {Hyper-Spherical Energy Transformer}（Hyper-Set），这是具有复发深度的Vanilla Transformer的替代方案。该设计固有地提供了更大的解释性，并允许扩展到更深的层，而不会显着增加参数的数量。我们还从经验上证明，在合成和现实世界中，超级设定的效果可比性甚至卓越的性能，例如求解sudoku和蒙版的图像建模，同时利用较少的参数。

Title: MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression

Authors: Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11651
Pdf URL: https://arxiv.org/pdf/2502.11651
Copy Paste: [[2502.11651]] MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression(https://arxiv.org/abs/2502.11651)
Keywords: generation
Abstract: Large vision-language models (LVLMs) have shown great promise in medical applications, particularly in visual question answering (MedVQA) and diagnosis from medical images. However, existing datasets and models often fail to consider critical aspects of medical diagnostics, such as the integration of historical records and the analysis of disease progression over time. In this paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel dataset for MedVQA that focuses on identifying changes in specific regions between two patient visits. Unlike previous datasets that primarily address single-image questions, MMXU enables multi-image questions, incorporating both current and historical patient data. We demonstrate the limitations of current LVLMs in identifying disease progression on MMXU-\textit{test}, even those that perform well on traditional benchmarks. To address this, we propose a MedRecord-Augmented Generation (MAG) approach, incorporating both global and regional historical records. Our experiments show that integrating historical records significantly enhances diagnostic accuracy by at least 20\%, bridging the gap between current LVLMs and human expert performance. Additionally, we fine-tune models with MAG on MMXU-\textit{dev}, which demonstrates notable improvements. We hope this work could illuminate the avenue of advancing the use of LVLMs in medical diagnostics by emphasizing the importance of historical context in interpreting medical images. Our dataset is released at \href{this https URL}{this https URL}.
摘要：大型视觉模型（LVLM）在医疗应用中表现出了巨大的希望，尤其是在视觉问题答案（MEDVQA）和医学图像诊断中。但是，现有的数据集和模型通常无法考虑医学诊断的关键方面，例如历史记录的整合以及随着时间的推移分析疾病进展。在本文中，我们引入了MMXU（多模式和多射线理解），这是一种用于MEDVQA的新型数据集，重点是识别两个患者访问之间特定区域的变化。与以前主要解决单片问题的数据集不同，MMXU启用了多图像问题，并纳入了当前和历史患者数据。我们证明了当前LVLM在MMXU- \ textit {test}上识别疾病进展的局限性，即使是在传统基准上表现良好的疾病。为了解决这个问题，我们提出了一种杂种杰出的一代（MAG）方法，并纳入了全球和区域历史记录。我们的实验表明，整合历史记录至少可以显着提高诊断准确性20 \％，从而弥合了当前LVLM和人类专家绩效之间的差距。此外，我们在mmxu- \ textit {dev}上使用MAG微调了模型，该模型证明了显着的改进。我们希望这项工作可以通过强调历史环境在解释医学图像中的重要性来阐明在医学诊断中推进使用LVLM的途径。我们的数据集以\ href {this HTTPS url} {此https url}发布。

Title: Object-Centric Image to Video Generation with Language Guidance

Authors: Angel Villar-Corrales, Gjergj Plepi, Sven Behnke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11655
Pdf URL: https://arxiv.org/pdf/2502.11655
Copy Paste: [[2502.11655]] Object-Centric Image to Video Generation with Language Guidance(https://arxiv.org/abs/2502.11655)
Keywords: generation, generative
Abstract: Accurate and flexible world models are crucial for autonomous systems to understand their environment and predict future events. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and interactions, but often face challenges in scaling to complex datasets and incorporating external guidance, limiting their applicability in robotics. To address these limitations, we propose TextOCVP, an object-centric model for image-to-video generation guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions. Our method's structured latent space offers enhanced control over the prediction process, outperforming several image-to-video generative baselines. Additionally, we demonstrate that structured object-centric representations provide superior controllability and interpretability, facilitating the modeling of object dynamics and enabling more precise and understandable predictions. Videos and code are available at this https URL.
摘要：准确，灵活的世界模型对于自主系统了解其环境并预测未来事件至关重要。具有结构性潜在空间的对象中心模型在建模对象动态和交互作用方面已显示出希望，但在扩展到复杂数据集并合并外部指导时，经常面临挑战，从而将其适用性限制在机器人技术中。为了解决这些局限性，我们提出了TextOCVP，这是一个以对象为中心的模型，用于以文本描述为指导的图像到视频生成。 TextOCVP将观察到的场景解析为对象表示，称为插槽，并利用文本条件的变压器预测器预测未来对象状态和视频帧。我们的方法在合并文本指导的同时共同对象动态和相互作用进行建模，从而导致准确和可控的预测。我们方法的结构化潜在空间提供了对预测过程的增强控制，超过了几个图像到视频生成基线。此外，我们证明了结构化对象的表示形式提供了卓越的可控性和解释性，从而促进了对象动态的建模，并实现了更精确，更易于理解的预测。视频和代码可在此HTTPS URL上找到。

Title: MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

Authors: Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11663
Pdf URL: https://arxiv.org/pdf/2502.11663
Copy Paste: [[2502.11663]] MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction(https://arxiv.org/abs/2502.11663)
Keywords: generation, generative
Abstract: World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.
摘要：预测行动环境变化的世界模型对于具有强烈概括的自主驾驶模型至关重要。盛行的驱动世界模型主要建立在视频预测模型上。尽管这些模型可以用基于高级扩散的发电机产生高保真视频序列，但它们受其预测持续时间和整体泛化功能的限制。在本文中，我们通过将发电损失与MAE风格的功能级上下文学习结合在一起来探讨解决这个问题。特别是，我们通过三个关键设计实例化了该目标：（1）更可扩展的扩散变压器（DIT）结构，该结构训练有额外的蒙版构造任务。（2）我们设计了与扩散相关的面具令牌来处理掩模重建和生成扩散过程之间的模糊关系。（3）我们通过利用行式面具进行自我注意力而不是MAE中的掩盖自我注意，将掩盖构造任务扩展到时空领域。然后，我们采用划分的跨视图模块与此面具设计保持一致。基于上述改进，我们提出了MaskGWM：一种可概括的驾驶世界模型，该模型具有视频面具重建。我们的模型包含两个变体：maskGWM-long，重点关注长马预测，而maskgwm-mview则专用于多视图生成。对标准基准的综合实验验证了所提出方法的有效性，该方法包含Nuscene数据集的正常验证，OPENDV-2K数据集的长途推出以及Waymo数据集的零拍验证。这些数据集上的定量指标表明我们的方法特别改善了最新的驾驶世界模型。

Title: MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow

Authors: Hanzhuo Huang, Yuan Liu, Ge Zheng, Jiepeng Wang, Zhiyang Dou, Sibei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11697
Pdf URL: https://arxiv.org/pdf/2502.11697
Copy Paste: [[2502.11697]] MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow(https://arxiv.org/abs/2502.11697)
Keywords: generation, generative
Abstract: In this paper, we present MVTokenFlow for high-quality 4D content creation from monocular videos. Recent advancements in generative models such as video diffusion models and multiview diffusion models enable us to create videos or 3D models. However, extending these generative models for dynamic 4D content creation is still a challenging task that requires the generated content to be consistent spatially and temporally. To address this challenge, MVTokenFlow utilizes the multiview diffusion model to generate multiview images on different timesteps, which attains spatial consistency across different viewpoints and allows us to reconstruct a reasonable coarse 4D field. Then, MVTokenFlow further regenerates all the multiview images using the rendered 2D flows as guidance. The 2D flows effectively associate pixels from different timesteps and improve the temporal consistency by reusing tokens in the regeneration process. Finally, the regenerated images are spatiotemporally consistent and utilized to refine the coarse 4D field to get a high-quality 4D field. Experiments demonstrate the effectiveness of our design and show significantly improved quality than baseline methods.
摘要：在本文中，我们介绍了MvTokenFlow，用于从单眼视频中创建高质量的4D内容。在诸如视频扩散模型和多视频扩散模型之类的生成模型中的最新进展使我们能够创建视频或3D模型。但是，将这些生成模型扩展到动态4D内容创建仍然是一项具有挑战性的任务，需要生成的内容在空间和时间上保持一致。为了应对这一挑战，MVTokenFlow利用多视频扩散模型在不同的时间段上生成多视图像，该图像在不同的观点上达到了空间一致性，并允许我们重建合理的粗糙4D字段。然后，MVTokenFlow使用渲染的2D流作为指导进一步再生所有多视图。 2D流有效地将像素与不同的时间段相关联，并通过在再生过程中重复使用令牌来提高时间一致性。最后，再生图像是时空一致的，并用于完善粗4D场以获得高质量的4D场。实验证明了我们设计的有效性，并且比基线方法显示出明显提高的质量。

Title: The Worse The Better: Content-Aware Viewpoint Generation Network for Projection-related Point Cloud Quality Assessment

Authors: Zhiyong Su, Bingxu Xie, Zheng Li, Jincan Wu, Weiqing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11710
Pdf URL: https://arxiv.org/pdf/2502.11710
Copy Paste: [[2502.11710]] The Worse The Better: Content-Aware Viewpoint Generation Network for Projection-related Point Cloud Quality Assessment(https://arxiv.org/abs/2502.11710)
Keywords: generation, quality assessment
Abstract: Through experimental studies, however, we observed the instability of final predicted quality scores, which change significantly over different viewpoint settings. Inspired by the "wooden barrel theory", given the default content-independent viewpoints of existing projection-related PCQA approaches, this paper presents a novel content-aware viewpoint generation network (CAVGN) to learn better viewpoints by taking the distribution of geometric and attribute features of degraded point clouds into consideration. Firstly, the proposed CAVGN extracts multi-scale geometric and texture features of the entire input point cloud, respectively. Then, for each default content-independent viewpoint, the extracted geometric and texture features are refined to focus on its corresponding visible part of the input point cloud. Finally, the refined geometric and texture features are concatenated to generate an optimized viewpoint. To train the proposed CAVGN, we present a self-supervised viewpoint ranking network (SSVRN) to select the viewpoint with the worst quality projected image to construct a default-optimized viewpoint dataset, which consists of thousands of paired default viewpoints and corresponding optimized viewpoints. Experimental results show that the projection-related PCQA methods can achieve higher performance using the viewpoints generated by the proposed CAVGN.
摘要：但是，通过实验研究，我们观察到了最终预测质量得分的不稳定性，这些得分在不同的观点设置中发生了很大变化。鉴于现有与投影相关的PCQA方法的默认内容观点的启发，本文介绍了一种新颖的内容感知的观点生成网络（CAVGN），以通过获取几何和属性的分布来学习更好的观点考虑到降解点云的特征。首先，提出的Cavgn提取了整个输入点云的多尺度几何和纹理特征。然后，对于每个独立于内容的观点，提取的几何和纹理特征都会完善以关注其相应的输入点云的可见部分。最后，将精致的几何和纹理特征串联以生成优化的观点。为了训练拟议的Cavgn，我们提出了一个自我监督的观点排名网络（SSVRN），以使用最差的质量投影图像选择观点，以构建一个默认的优化观点数据集，该数据集由成千上万的配对默认视图点和相应的优化视图组成。实验结果表明，与投影相关的PCQA方法可以使用拟议的Cavgn产生的观点实现更高的性能。

Title: Component-aware Unsupervised Logical Anomaly Generation for Industrial Anomaly Detection

Authors: Xuan Tong, Yang Chang, Qing Zhao, Jiawen Yu, Boyang Wang, Junxiong Lin, Yuxuan Lin, Xinji Mai, Haoran Wang, Zeng Tao, Yan Wang, Wenqiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11712
Pdf URL: https://arxiv.org/pdf/2502.11712
Copy Paste: [[2502.11712]] Component-aware Unsupervised Logical Anomaly Generation for Industrial Anomaly Detection(https://arxiv.org/abs/2502.11712)
Keywords: generation, generative
Abstract: Anomaly detection is critical in industrial manufacturing for ensuring product quality and improving efficiency in automated processes. The scarcity of anomalous samples limits traditional detection methods, making anomaly generation essential for expanding the data repository. However, recent generative models often produce unrealistic anomalies increasing false positives, or require real-world anomaly samples for training. In this work, we treat anomaly generation as a compositional problem and propose ComGEN, a component-aware and unsupervised framework that addresses the gap in logical anomaly generation. Our method comprises a multi-component learning strategy to disentangle visual components, followed by subsequent generation editing procedures. Disentangled text-to-component pairs, revealing intrinsic logical constraints, conduct attention-guided residual mapping and model training with iteratively matched references across multiple scales. Experiments on the MVTecLOCO dataset confirm the efficacy of ComGEN, achieving the best AUROC score of 91.2%. Additional experiments on the real-world scenario of Diesel Engine and widely-used MVTecAD dataset demonstrate significant performance improvements when integrating simulated anomalies generated by ComGEN into automated production workflows.
摘要：异常检测对于确保产品质量和提高自动化过程效率的工业制造至关重要。异常样品的稀缺性限制了传统的检测方法，这使得对扩展数据存储库至关重要。但是，最近的生成模型通常会产生不切实际的异常，以增加假阳性，或者需要现实世界中的异常样品进行训练。在这项工作中，我们将异常产生视为一个组成问题，并提出了COMGEN，Comgen是一个构成的且无监督的框架，该框架解决了逻辑异常产生的差距。我们的方法包括一个多组分学习策略，以消除视觉组件，然后再进行随后的生成编辑程序。分散的文本到组件对，揭示了固有的逻辑约束，进行了注意引导的残留映射和模型训练，并在多个尺度上进行了迭代匹配的参考。 MVTecloco数据集的实验证实了COMGEN的功效，获得了最佳的AUROC分数91.2％。在集成了COMGEN生成的模拟异常中时，在柴油发动机和广泛使用的MVTECAD数据集的现实情况下进行的其他实验表明，性能改善。

Title: Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing

Authors: Site Qu, Guoqiang Hu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11715
Pdf URL: https://arxiv.org/pdf/2502.11715
Copy Paste: [[2502.11715]] Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing(https://arxiv.org/abs/2502.11715)
Keywords: generation, generative
Abstract: The Location-Routing Problem (LRP), which combines the challenges of facility (depot) locating and vehicle route planning, is critically constrained by the reliance on predefined depot candidates, limiting the solution space and potentially leading to suboptimal outcomes. Previous research on LRP without predefined depots is scant and predominantly relies on heuristic algorithms that iteratively attempt depot placements across a planar area. Such approaches lack the ability to proactively generate depot locations that meet specific geographic requirements, revealing a notable gap in current research landscape. To bridge this gap, we propose a data-driven generative DRL framework, designed to proactively generate depots for LRP without predefined depot candidates, solely based on customer requests data which include geographic and demand information. It can operate in two distinct modes: direct generation of exact depot locations, and the creation of a multivariate Gaussian distribution for flexible depots sampling. By extracting depots' geographic pattern from customer requests data, our approach can dynamically respond to logistical needs, identifying high-quality depot locations that further reduce total routing costs compared to traditional methods. Extensive experiments demonstrate that, for a same group of customer requests, compared with those depots identified through random attempts, our framework can proactively generate depots that lead to superior solution routes with lower routing cost. The implications of our framework potentially extend into real-world applications, particularly in emergency medical rescue and disaster relief logistics, where rapid establishment and adjustment of depot locations are paramount, showcasing its potential in addressing LRP for dynamic and unpredictable environments.
摘要：路线路线问题（LRP）结合了设施（仓库）定位和车辆路线计划的挑战，受到对预定义仓库候选者的依赖，从而限制了解决方案空间并可能导致次优结果。先前对没有预定仓库的LRP的研究很少，并且主要依赖于启发式算法，即迭代地尝试在平面区域遍布仓库的位置。这种方法缺乏积极生成满足特定地理要求的仓库位置的能力，从而揭示了当前研究景观的显着差距。为了弥合这一差距，我们提出了一个数据驱动的生成DRL框架，该框架旨在主动为LRP生成仓库而无需预定义的仓库候选者，仅基于包括地理和需求信息在内的客户请求数据。它可以以两种不同的模式运行：直接生成精确的仓库位置，以及创建用于灵活仓库采样的多元高斯分布。通过从客户请求数据中提取仓库的地理模式，我们的方法可以动态地响应后勤需求，从而确定高质量的仓库位置，这些位置与传统方法相比进一步降低了总路由成本。广泛的实验表明，对于同一组客户请求，与通过随机尝试识别的仓库相比，我们的框架可以主动生成仓库，从而导致较低的路由成本的卓越解决方案路线。我们的框架的含义可能扩展到现实世界中的应用，特别是在紧急医疗救援和救灾后勤方面，在仓库地点的快速建立和调整至关重要，展示了其在动态和不可预测的环境中的LRP方面的潜力。

Title: No-reference geometry quality assessment for colorless point clouds via list-wise rank learning

Authors: Zheng Li, Bingxu Xie, Chao Chu, Weiqing Li, Zhiyong Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11726
Pdf URL: https://arxiv.org/pdf/2502.11726
Copy Paste: [[2502.11726]] No-reference geometry quality assessment for colorless point clouds via list-wise rank learning(https://arxiv.org/abs/2502.11726)
Keywords: quality assessment
Abstract: Geometry quality assessment (GQA) of colorless point clouds is crucial for evaluating the performance of emerging point cloud-based solutions (e.g., watermarking, compression, and 3-Dimensional (3D) reconstruction). Unfortunately, existing objective GQA approaches are traditional full-reference metrics, whereas state-of-the-art learning-based point cloud quality assessment (PCQA) methods target both color and geometry distortions, neither of which are qualified for the no-reference GQA task. In addition, the lack of large-scale GQA datasets with subjective scores, which are always imprecise, biased, and inconsistent, also hinders the development of learning-based GQA metrics. Driven by these limitations, this paper proposes a no-reference geometry-only quality assessment approach based on list-wise rank learning, termed LRL-GQA, which comprises of a geometry quality assessment network (GQANet) and a list-wise rank learning network (LRLNet). The proposed LRL-GQA formulates the no-reference GQA as a list-wise rank problem, with the objective of directly optimizing the entire quality ordering. Specifically, a large dataset containing a variety of geometry-only distortions is constructed first, named LRL dataset, in which each sample is label-free but coupled with quality ranking information. Then, the GQANet is designed to capture intrinsic multi-scale patch-wise geometric features in order to predict a quality index for each point cloud. After that, the LRLNet leverages the LRL dataset and a likelihood loss to train the GQANet and ranks the input list of degraded point clouds according to their distortion levels. In addition, the pre-trained GQANet can be fine-tuned further to obtain absolute quality scores. Experimental results demonstrate the superior performance of the proposed no-reference LRL-GQA method compared with existing full-reference GQA metrics.
摘要：无色点云的几何质量评估（GQA）对于评估基于新兴点云解决方案的性能（例如水印，压缩和3维（3D）重建）至关重要。不幸的是，现有的目标GQA方法是传统的全参考指标，而基于最先进的学习点云质量评估（PCQA）方法均针对颜色和几何扭曲，它们都不具备No-Reference GQA任务。此外，缺乏具有主观分数的大型GQA数据集，这些数据集总是不精确，偏见和不一致的，也阻碍了基于学习的GQA指标的发展。在这些限制的驱动下，本文提出了一种基于列表等级学习的不引用几何质量评估方法，称为LRL-GQA，该方法包括几何质量评估网络（GQAnet）和列表等级学习网络（lrlnet）。所提出的LRL-GQA将No-Reference GQA提出为列表等级问题，目的是直接优化整个质量排序。具体而言，首先构造包含多种几何扭曲的大型数据集，名为LRL数据集，其中每个样本都不标记，但与质量排名信息相结合。然后，GQAnet旨在捕获固有的多尺度贴片几何特征，以预测每个点云的质量索引。之后，LRLNET利用LRL数据集和可能性损失来训练GQAnet，并根据其失真级别对降级点云的输入列表进行排名。此外，可以进一步调整预训练的GQANET以获得绝对质量得分。实验结果表明，与现有的全参考GQA指标相比，所提出的NO-RRL-GQA方法的出色表现。

Title: Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning

Authors: Yuqi Pang, Bowen Yang, Haoqin Tu, Yun Cao, Zeyu Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11751
Pdf URL: https://arxiv.org/pdf/2502.11751
Copy Paste: [[2502.11751]] Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning(https://arxiv.org/abs/2502.11751)
Keywords: generation
Abstract: Although Large Language Models (LLMs) excel in reasoning and generation for language tasks, they are not specifically designed for multimodal challenges. Training Multimodal Large Language Models (MLLMs), however, is resource-intensive and constrained by various training limitations. In this paper, we propose the Modular-based Visual Contrastive Decoding (MVCD) framework to move this obstacle. Our framework leverages LLMs' In-Context Learning (ICL) capability and the proposed visual contrastive-example decoding (CED), specifically tailored for this framework, without requiring any additional training. By converting visual signals into text and focusing on contrastive output distributions during decoding, we can highlight the new information introduced by contextual examples, explore their connections, and avoid over-reliance on prior encoded knowledge. MVCD enhances LLMs' visual perception to make it see and reason over the input visuals. To demonstrate MVCD's effectiveness, we conduct experiments with four LLMs across five question answering datasets. Our results not only show consistent improvement in model accuracy but well explain the effective components inside our decoding strategy. Our code will be available at this https URL.
摘要：尽管大型语言模型（LLM）在推理和生成语言任务方面表现出色，但并不是专门为多模式挑战而设计的。但是，培训多模式大型语言模型（MLLM）是资源密集的，受各种培训限制的限制。在本文中，我们提出了基于模块化的视觉对比解码（MVCD）框架来移动这一障碍。我们的框架利用了LLMS的内在学习（ICL）功能和所提出的视觉对比度解码（CED），该解码（CED）是专门针对此框架量身定制的，而无需任何其他培训。通过将视觉信号转换为文本并关注解码过程中的对比输出分布，我们可以强调通过上下文示例引入的新信息，探索其连接，并避免过度依赖对先前编码的知识。 MVCD增强了LLMS的视觉感知，以使其在输入视觉效果上看到并推理。为了证明MVCD的有效性，我们在五个问答数据集中对四个LLM进行实验。我们的结果不仅显示出模型准确性的一致性提高，还可以很好地解释我们解码策略中的有效组件。我们的代码将在此HTTPS URL上可用。

Title: From Selection to Generation: A Survey of LLM-based Active Learning

Authors: Yu Xia, Subhojyoti Mukherjee, Zhouhang Xie, Junda Wu, Xintong Li, Ryan Aponte, Hanjia Lyu, Joe Barrow, Hongjie Chen, Franck Dernoncourt, Branislav Kveton, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Sungchul Kim, Zhengmian Hu, Yue Zhao, Nedim Lipka, Seunghyun Yoon, Ting-Hao Kenneth Huang, Zichao Wang, Puneet Mathur, Soumyabrata Pal, Koyel Mukherjee, Zhehao Zhang, Namyong Park, Thien Huu Nguyen, Jiebo Luo, Ryan A. Rossi, Julian McAuley
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2502.11767
Pdf URL: https://arxiv.org/pdf/2502.11767
Copy Paste: [[2502.11767]] From Selection to Generation: A Survey of LLM-based Active Learning(https://arxiv.org/abs/2502.11767)
Keywords: generation
Abstract: Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.
摘要：主动学习（AL）是通过选择用于标签和培训的最有用的数据点来提高模型效率和性能的强大范式。在最近的活跃学习框架中，大型语言模型（LLM）不仅被用于选择，而且还用于生成全新的数据实例并提供更具成本效益的注释。在LLMS时代，高质量数据和有效的模型培训的重要性越来越重要，我们对基于LLM的主动学习进行了全面的调查。我们介绍了一种直观的分类法，该分类法对这些技术进行了分类，并讨论了LLM可以在主动学习循环中扮演的变革角色。我们进一步研究了AL对LLM学习范式及其在各个领域的应用的影响。最后，我们确定开放的挑战并提出未来的研究方向。这项调查旨在为寻求直观了解基于LLM的AL技术并将其部署到新应用程序的研究人员和从业人员提供最新资源。

Title: DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation

Authors: Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11897
Pdf URL: https://arxiv.org/pdf/2502.11897
Copy Paste: [[2502.11897]] DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation(https://arxiv.org/abs/2502.11897)
Keywords: generation, generative
Abstract: In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. While existing video generative models apply fixed compression rates via pretrained VAE, we observe that real-world video content exhibits substantial temporal non-uniformity, with high-motion segments containing more information than static scenes. Based on this insight, DLFR-VAE dynamically adjusts the latent frame rate according to the content complexity. Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent Frame Rate Scheduler that partitions videos into temporal chunks and adaptively determines optimal frame rates based on information-theoretic content complexity, and (2) A training-free adaptation mechanism that transforms pretrained VAE architectures into a dynamic VAE that can process features with variable frame rates. Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models and accelerating the video generation process.
摘要：在本文中，我们提出了动态潜在帧速率VAE（DLFR-VAE），这是一种无训练的范式，可以利用潜在空间中的自适应时间压缩。虽然现有的视频生成模型通过验证的VAE应用了固定的压缩率，但我们观察到现实世界中的视频内容具有很大的时间不均匀性，而高动段的较高动作片段包含比静态场景更多的信息。基于此洞察力，DLFR-VAE根据内容复杂性动态调整潜在框架速率。具体而言，DLFR-VAE包括两个核心创新：（1）动态的潜在帧速率调度程序，将视频划分为时间块，并根据信息理论内容的复杂性自适应地确定最佳帧速率，以及（2）一种培训的适应机制，该机制该机制该机制将预验证的VAE架构转换为动态VAE，可以处理具有可变帧速率的特征。我们简单但有效的DLFR-VAE可以充当插件模块，与现有视频生成模型无缝集成并加速视频生成过程。

Title: Continual Learning Should Move Beyond Incremental Classification

Authors: Rupert Mitchell, Antonio Alliegro, Raffaello Camoriano, Dustin Carrión-Ojeda, Antonio Carta, Georgia Chalvatzaki, Nikhil Churamani, Carlo D'Eramo, Samin Hamidi, Robin Hesse, Fabian Hinder, Roshni Ramanna Kamath, Vincenzo Lomonaco, Subarnaduti Paul, Francesca Pistilli, Tinne Tuytelaars, Gido M van de Ven, Kristian Kersting, Simone Schaub-Meyer, Martin Mundt
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11927
Pdf URL: https://arxiv.org/pdf/2502.11927
Copy Paste: [[2502.11927]] Continual Learning Should Move Beyond Incremental Classification(https://arxiv.org/abs/2502.11927)
Keywords: generative
Abstract: Continual learning (CL) is the sub-field of machine learning concerned with accumulating knowledge in dynamic environments. So far, CL research has mainly focused on incremental classification tasks, where models learn to classify new categories while retaining knowledge of previously learned ones. Here, we argue that maintaining such a focus limits both theoretical development and practical applicability of CL methods. Through a detailed analysis of concrete examples - including multi-target classification, robotics with constrained output spaces, learning in continuous task domains, and higher-level concept memorization - we demonstrate how current CL approaches often fail when applied beyond standard classification. We identify three fundamental challenges: (C1) the nature of continuity in learning problems, (C2) the choice of appropriate spaces and metrics for measuring similarity, and (C3) the role of learning objectives beyond classification. For each challenge, we provide specific recommendations to help move the field forward, including formalizing temporal dynamics through distribution processes, developing principled approaches for continuous task spaces, and incorporating density estimation and generative objectives. In so doing, this position paper aims to broaden the scope of CL research while strengthening its theoretical foundations, making it more applicable to real-world problems.
摘要：持续学习（CL）是机器学习的子场，与在动态环境中积累知识有关。到目前为止，CL研究主要集中于增量分类任务，在该任务中，模型学会在保留先前学到的知识的同时对新类别进行分类。在这里，我们认为保持这样的重点限制了CL方法的理论发展和实际适用性。通过对具体示例的详细分析 - 包括多目标分类，具有约束的输出空间的机器人技术，在连续任务域中学习以及更高级别的概念记忆 - 我们演示了当前CL方法在超出标准分类超出标准分类的应用时如何经常失败。我们确定了三个基本挑战：（C1）学习问题中连续性的性质，（C2）选择适当的空间和指标来衡量相似性，以及（C3）超出分类的学习目标的作用。对于每个挑战，我们都会提供具体的建议，以帮助将领域向前推进，包括通过分配过程形式化时间动态，开发用于连续任务空间的原则方法，并结合密度估计和生成目标。这样一来，该立场论文旨在扩大CL研究的范围，同时加强其理论基础，使其更适用于现实世界中的问题。

Title: Image Inversion: A Survey from GANs to Diffusion and Beyond

Authors: Yinan Chen, Jiangning Zhang, Yali Bi, Xiaobin Hu, Teng Hu, Zhucun Xue, Ran Yi, Yong Liu, Ying Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11974
Pdf URL: https://arxiv.org/pdf/2502.11974
Copy Paste: [[2502.11974]] Image Inversion: A Survey from GANs to Diffusion and Beyond(https://arxiv.org/abs/2502.11974)
Keywords: restoration, generative
Abstract: Image inversion is a fundamental task in generative models, aiming to map images back to their latent representations to enable downstream applications such as editing, restoration, and style transfer. This paper provides a comprehensive review of the latest advancements in image inversion techniques, focusing on two main paradigms: Generative Adversarial Network (GAN) inversion and diffusion model inversion. We categorize these techniques based on their optimization methods. For GAN inversion, we systematically classify existing methods into encoder-based approaches, latent optimization approaches, and hybrid approaches, analyzing their theoretical foundations, technical innovations, and practical trade-offs. For diffusion model inversion, we explore training-free strategies, fine-tuning methods, and the design of additional trainable modules, highlighting their unique advantages and limitations. Additionally, we discuss several popular downstream applications and emerging applications beyond image tasks, identifying current challenges and future research directions. By synthesizing the latest developments, this paper aims to provide researchers and practitioners with a valuable reference resource, promoting further advancements in the field of image inversion. We keep track of the latest works at this https URL
摘要：图像反转是生成模型中的一项基本任务，旨在将图像映射回其潜在表示，以实现下游应用程序，例如编辑，修复和样式转移。本文对图像反转技术的最新进步进行了全面的审查，重点是两个主要范式：生成对抗网络（GAN）反转和扩散模型反演。我们根据其优化方法对这些技术进行分类。对于GAN倒置，我们将现有方法系统地分类为基于编码的方法，潜在的优化方法和混合方法，分析其理论基础，技术创新和实际权衡。对于扩散模型倒置，我们探讨了无培训的策略，微调方法以及其他可训练的模块的设计，突出了它们的独特优势和局限性。此外，我们讨论了图像任务以外的几个流行的下游应用程序和新兴应用程序，以确定当前的挑战和未来的研究方向。通过综合最新发展，本文旨在为研究人员和从业人员提供宝贵的参考资源，从而促进图像反转领域的进一步进步。我们跟踪此HTTPS URL的最新作品

Title: Unsupervised Structural-Counterfactual Generation under Domain Shift

Authors: Krishn Vishwas Kher, Lokesh Venkata Siva Maruthi Badisa, Kusampudi Venkata Datta Sri Harsha, Chitneedi Geetha Sowmya, SakethaNath Jagarlapudi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.12013
Pdf URL: https://arxiv.org/pdf/2502.12013
Copy Paste: [[2502.12013]] Unsupervised Structural-Counterfactual Generation under Domain Shift(https://arxiv.org/abs/2502.12013)
Keywords: generation, generative
Abstract: Motivated by the burgeoning interest in cross-domain learning, we present a novel generative modeling challenge: generating counterfactual samples in a target domain based on factual observations from a source domain. Our approach operates within an unsupervised paradigm devoid of parallel or joint datasets, relying exclusively on distinct observational samples and causal graphs for each domain. This setting presents challenges that surpass those of conventional counterfactual generation. Central to our methodology is the disambiguation of exogenous causes into effect-intrinsic and domain-intrinsic categories. This differentiation facilitates the integration of domain-specific causal graphs into a unified joint causal graph via shared effect-intrinsic exogenous variables. We propose leveraging Neural Causal models within this joint framework to enable accurate counterfactual generation under standard identifiability assumptions. Furthermore, we introduce a novel loss function that effectively segregates effect-intrinsic from domain-intrinsic variables during model training. Given a factual observation, our framework combines the posterior distribution of effect-intrinsic variables from the source domain with the prior distribution of domain-intrinsic variables from the target domain to synthesize the desired counterfactuals, adhering to Pearl's causal hierarchy. Intriguingly, when domain shifts are restricted to alterations in causal mechanisms without accompanying covariate shifts, our training regimen parallels the resolution of a conditional optimal transport problem. Empirical evaluations on a synthetic dataset show that our framework generates counterfactuals in the target domain that very closely resemble the ground truth.
摘要：受到跨域学习的兴趣的激励，我们提出了一个新颖的生成建模挑战：基于来自源域的事实观察，在目标域中生成反事实样本。我们的方法在没有平行或关节数据集的无监督范式中运行，仅依赖于每个域的独特的观察样本和因果图。这种环境提出了超过常规反事实发电的挑战。我们方法论的核心是将外源性原因歧义为效应 - 内在和域内内类别。这种分化促进了通过共享效应 - 内源性变量的域特异性因果图的整合到统一的关节因果图中。我们建议在此关节框架内利用神经因果模型，以在标准可识别性假设下实现准确的反事实生成。此外，我们引入了一种新颖的损失函数，该功能可有效地将效应 - 内在与模型训练过程中的域内变量分离。鉴于事实观察，我们的框架结合了源域的效应 - 内部变量的后验分布与域内内变量的先前分布从目标结构域中综合了所需的反事实，并遵守了Pearl的因果关系。有趣的是，当域移位仅限于因果机制的改变而没有伴随的协变量转移时，我们的训练方案与有条件的最佳运输问题的解决方案相似。合成数据集的经验评估表明，我们的框架在目标域中产生反事实，非常类似于地面真相。

Title: HumanGif: Single-View Human Diffusion with Generative Prior

Authors: Shoukang Hu, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Takashi Shibuya, Yuki Mitsufuji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.12080
Pdf URL: https://arxiv.org/pdf/2502.12080
Copy Paste: [[2502.12080]] HumanGif: Single-View Human Diffusion with Generative Prior(https://arxiv.org/abs/2502.12080)
Keywords: generative
Abstract: While previous single-view-based 3D human reconstruction methods made significant progress in novel view synthesis, it remains a challenge to synthesize both view-consistent and pose-consistent results for animatable human avatars from a single image input. Motivated by the success of 2D character animation, we propose HumanGif, a single-view human diffusion model with generative prior. Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process, utilizing generative priors from foundational diffusion models. To ensure fine-grained and consistent novel view and pose synthesis, we introduce a Human NeRF module in HumanGif to learn spatially aligned features from the input image, implicitly capturing the relative camera and human pose transformation. Furthermore, we introduce an image-level loss during optimization to bridge the gap between latent and image spaces in diffusion models. Extensive experiments on RenderPeople and DNA-Rendering datasets demonstrate that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.
摘要：虽然以前基于单视图的3D人类重建方法在新型视图合成中取得了重大进展，但要从单个图像输入中综合视图一致性和姿势一致的结果，这仍然是一个挑战。由2D角色动画的成功激励，我们提出了 humangif ，这是一种具有生成性先验的单视图人类扩散模型。具体而言，我们利用基础扩散模型中的生成先验来制定基于单视图的3D人类新颖的观点，并构成构成作为单视图的人类扩散过程。为了确保细粒度和一致的新型视图和姿势合成，我们在Humangif中引入了一个人类的NERF模块，以从输入图像中学习空间对齐的特征，从而暗中捕获相对摄像头和人类姿势转换。此外，我们在优化过程中引入了图像级损失，以弥合扩散模型中潜在空间和图像空间之间的间隙。关于渲染pe和DNA渲染数据集的广泛实验表明，Humangif实现了最佳的感知性能，并且可以更好地概括新颖的视野和姿势合成。

Title: Descriminative-Generative Custom Tokens for Vision-Language Models

Authors: Pramuditha Perera, Matthew Trager, Luca Zancato, Alessandro Achille, Stefano Soatto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.12095
Pdf URL: https://arxiv.org/pdf/2502.12095
Copy Paste: [[2502.12095]] Descriminative-Generative Custom Tokens for Vision-Language Models(https://arxiv.org/abs/2502.12095)
Keywords: generation, generative
Abstract: This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new scenes. Further, we show that learned custom tokens can be used to form queries for text-to-image retrieval task, and also have the important benefit that composite queries can be visualized to ensure that the desired concept is faithfully encoded. Based on this, we introduce the method of Generation Aided Image Retrieval, where the query is modified at inference time to better suit the search intent. On the DeepFashion2 dataset, our method improves Mean Reciprocal Retrieval (MRR) over relevant baselines by 7%.
摘要：本文探讨了学习自定义令牌以代表视觉模型（VLMS）中新概念的可能性。我们的目的是学习对歧视性和生成任务有效的代币，同时用单词很好地构成新的输入查询。针对性的概念是根据一小部分图像和使用文本描述的父概念指定的。我们使用剪辑文本功能运行，并建议将文本反演损失和分类损失的组合结合起来，以确保学习令牌的文本特征与剪辑嵌入空间中概念的图像特征对齐。我们将学习的令牌限制为由代币跨越适合给定超级级别的属性跨越的低维子空间。这些修改提高了使用自然语言来产生新场景的博学令牌的组成质量。此外，我们表明可以使用学到的自定义令牌来形成文本到图像检索任务的查询，还具有重要的好处，即可以将复合查询可视化以确保忠实地编码所需的概念。基于此，我们介绍了生成辅助图像检索的方法，在推理时间修改查询以更好地适合搜索意图。在DeepFashion2数据集上，我们的方法改善了相关基线的平均相互检索（MRR），提高了7％。

Title: LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities

Authors: Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12128
Pdf URL: https://arxiv.org/pdf/2502.12128
Copy Paste: [[2502.12128]] LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities(https://arxiv.org/abs/2502.12128)
Keywords: generation, generative
Abstract: Generative models are spearheading recent progress in deep learning, showing strong promise for trajectory sampling in dynamical systems as well. However, while latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), combines the advantages of graph neural networks, i.e., the traceability of entities across time-steps, with the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder are frozen to enable generative modeling in the latent space. The core idea of LaM-SLidE is to introduce identifier representations (IDs) to allow for retrieval of entity properties, e.g., entity coordinates, from latent system representations and thus enables traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. (Code is available at this https URL)
摘要：生成模型率先在深度学习方面率先，在动态系统中也显示出对轨迹采样的强烈希望。但是，尽管潜在的空间建模范式改变了图像和视频的产生，但对于大多数动态系统而言，类似的方法更加困难。从化学分子结构到集体人类行为 - 通过实体的相互作用来描述这样的系统，使它们固有地与连通性模式和实体的可追溯性有关。我们的方法是LAM-SLIDE（通过链接实体对空间动力学系统的潜在空间建模）结合了图神经网络的优势，即实体跨时步的可追溯性，以及图像和视频的最新进步的效率和可扩展性生成，预先训练的编码器和解码器被冷冻以在潜在空间中实现生成建模。 LAM-SLIDE的核心思想是引入标识符表示（IDS），以允许从潜在系统表示中检索实体属性，例如实体坐标，从而实现可追溯性。在实验上，在不同的领域，我们表明LAM-SLIDE在速度，准确性和概括性方面表现出色。（代码可在此HTTPS URL上找到）

Title: MagicArticulate: Make Your 3D Models Articulation-Ready

Authors: Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, Guosheng Lin
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2502.12135
Pdf URL: https://arxiv.org/pdf/2502.12135
Copy Paste: [[2502.12135]] MagicArticulate: Make Your 3D Models Articulation-Ready(https://arxiv.org/abs/2502.12135)
Keywords: generation
Abstract: With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Traditional approaches rely heavily on manual annotation, which is both time-consuming and labor-intensive. Moreover, the lack of large-scale benchmarks has hindered the development of learning-based solutions. In this work, we present MagicArticulate, an effective framework that automatically transforms static 3D models into articulation-ready assets. Our key contributions are threefold. First, we introduce Articulation-XL, a large-scale benchmark containing over 33k 3D models with high-quality articulation annotations, carefully curated from Objaverse-XL. Second, we propose a novel skeleton generation method that formulates the task as a sequence modeling problem, leveraging an auto-regressive transformer to naturally handle varying numbers of bones or joints within skeletons and their inherent dependencies across different 3D models. Third, we predict skinning weights using a functional diffusion process that incorporates volumetric geodesic distance priors between vertices and joints. Extensive experiments demonstrate that MagicArticulate significantly outperforms existing methods across diverse object categories, achieving high-quality articulation that enables realistic animation. Project page: this https URL.
摘要：随着3D内容创建的爆炸性增长，人们对自动将静态3D模型转换为支持现实动画的发音版本的需求不断增长。传统方法在很大程度上取决于手动注释，这既耗时又耗时。此外，缺乏大规模的基准阻碍了基于学习的解决方案的发展。在这项工作中，我们提出了MagicArticulate，这是一个有效的框架，可以自动将静态3D模型转换为可发音就绪的资产。我们的主要贡献是三倍。首先，我们引入了Articulation-XL，这是一种大规模的基准，该基准包含超过33K 3D模型，具有高质量的发音注释，并通过OBJAVERSE-XL精心策划。其次，我们提出了一种新型的骨骼生成方法，该方法将任务制定为序列建模问题，利用自动回归变压器自然处理骨骼内的骨骼或关节的不同数量，及其在不同3D模型中的固有依赖性。第三，我们使用功能扩散过程预测皮肤重量，该功能扩散过程融合了顶点和关节之间的体积地理距离先验。广泛的实验表明，MagicArticulate显着优于不同对象类别的现有方法，实现了实现现实动画的高质量表达。项目页面：此HTTPS URL。

Title: HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Authors: Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.12148
Pdf URL: https://arxiv.org/pdf/2502.12148
Copy Paste: [[2502.12148]] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation(https://arxiv.org/abs/2502.12148)
Keywords: generation, generative
Abstract: The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: this https URL
摘要：自回旋范式的显着成功在多模式大语言模型（MLLMS）方面取得了重大进步，诸如Show-O，输血和EMU3之类的强大模型在统一的图像理解和产生方面取得了显着进步。我们第一次发现了一个共同的现象：MLLM的理解能力通常比其生成能力强，两者之间存在显着差距。在这个见解的基础上，我们提出了Hermesflow，这是一个简单而通用的框架，旨在无缝地弥合MLLM中理解与产生之间的差距。具体而言，我们将同源数据作为输入来策划理解和发电的同源偏好数据。通过配对和自我播放迭代优化，Hermesflow使用同源偏好数据有效地对齐了多模式的理解和生成。广泛的实验证明了我们的方法比先前方法的显着优越性，尤其是在缩小多模式理解与产生之间的差距方面。这些发现突出了Hermesflow作为下一代多模式模型的一般对齐框架的潜力。代码：此HTTPS URL

Title: VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution

Authors: Chendong Wang, Anlan Zhang, Yifan Yang, Lili Qiu, Yuqing Yang, Xinyang Jiang, Feng Qian, Suman Banerjee
Subjects: cs.CV, eess.SY
Abstract URL: https://arxiv.org/abs/2502.12151
Pdf URL: https://arxiv.org/pdf/2502.12151
Copy Paste: [[2502.12151]] VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution(https://arxiv.org/abs/2502.12151)
Keywords: super-resolution
Abstract: 3D volumetric video provides immersive experience and is gaining traction in digital media. Despite its rising popularity, the streaming of volumetric video content poses significant challenges due to the high data bandwidth requirement. A natural approach to mitigate the bandwidth issue is to reduce the volumetric video's data rate by downsampling the content prior to transmission. The video can then be upsampled at the receiver's end using a super-resolution (SR) algorithm to reconstruct the high-resolution details. While super-resolution techniques have been extensively explored and advanced for 2D video content, there is limited work on SR algorithms tailored for volumetric videos. To address this gap and the growing need for efficient volumetric video streaming, we have developed VoLUT with a new SR algorithm specifically designed for volumetric content. Our algorithm uniquely harnesses the power of lookup tables (LUTs) to facilitate the efficient and accurate upscaling of low-resolution volumetric data. The use of LUTs enables our algorithm to quickly reference precomputed high-resolution values, thereby significantly reducing the computational complexity and time required for upscaling. We further apply adaptive video bit rate algorithm (ABR) to dynamically determine the downsampling rate according to the network condition and stream the selected video rate to the receiver. Compared to related work, VoLUT is the first to enable high-quality 3D SR on commodity mobile devices at line-rate. Our evaluation shows VoLUT can reduce bandwidth usage by 70% , boost QoE by 36.7% for volumetric video streaming and achieve 3D SR speed-up with no quality compromise.
摘要：3D体积视频提供了沉浸式的体验，并在数字媒体中获得了吸引力。尽管受欢迎程度不断提高，但由于数据带宽的要求很高，因此体积视频内容的流数却带来了重大挑战。减轻带宽问题的一种自然方法是通过在传输前将内容降采样来降低体积视频的数据速率。然后，可以使用超分辨率（SR）算法在接收器的末端对该视频进行采样，以重建高分辨率的详细信息。尽管已经对2D视频内容进行了广泛的探索和高级探索和高级技术，但针对体积视频量身定制的SR算法的工作有限。为了解决这一差距以及对有效体积视频流的日益增长的需求，我们开发了一种新的SR算法，专门为体积内容设计。我们的算法唯一利用查找表（LUTS）的力量，以促进低分辨率体积数据的有效和准确的升级。 LUTS的使用使我们的算法能够快速参考预先计算的高分辨率值，从而大大降低了升级所需的计算复杂性和时间。我们进一步应用自适应视频比特算法算法（ABR），根据网络条件动态确定降采样率，并将所选视频速率流式传输到接收器。与相关工作相比，Volut是第一个在线路速率上启用商品移动设备上高质量3D SR的人。我们的评估表明，VOLUT可以将带宽使用率降低70％，用于体积视频流的36.7％将QoE提高36.7％，并实现3D SR加速，而无需质量妥协。