2024-12-30

Title: ZenSVI: An Open-Source Software for the Integrated Acquisition, Processing and Analysis of Street View Imagery Towards Scalable Urban Science

Authors: Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, Xiucheng Liang, Zicheng Fan, Yujun Hou, Tianhong Zhao, Rui Ma, Kunihiko Fujiwara, Jiani Ouyang, Matias Quintana, Filip Biljecki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.18641
Pdf URL: https://arxiv.org/pdf/2412.18641
Copy Paste: [[2412.18641]] ZenSVI: An Open-Source Software for the Integrated Acquisition, Processing and Analysis of Street View Imagery Towards Scalable Urban Science(https://arxiv.org/abs/2412.18641)
Keywords: quality assessment
Abstract: Street view imagery (SVI) has been instrumental in many studies in the past decade to understand and characterize street features and the built environment. Researchers across a variety of domains, such as transportation, health, architecture, human perception, and infrastructure have employed different methods to analyze SVI. However, these applications and image-processing procedures have not been standardized, and solutions have been implemented in isolation, often making it difficult for others to reproduce existing work and carry out new research. Using SVI for research requires multiple technical steps: accessing APIs for scalable data collection, preprocessing images to standardize formats, implementing computer vision models for feature extraction, and conducting spatial analysis. These technical requirements create barriers for researchers in urban studies, particularly those without extensive programming experience. We develop ZenSVI, a free and open-source Python package that integrates and implements the entire process of SVI analysis, supporting a wide range of use cases. Its end-to-end pipeline includes downloading SVI from multiple platforms (e.g., Mapillary and KartaView) efficiently, analyzing metadata of SVI, applying computer vision models to extract target features, transforming SVI into different projections (e.g., fish-eye and perspective) and different formats (e.g., depth map and point cloud), visualizing analyses with maps and plots, and exporting outputs to other software tools. We demonstrate its use in Singapore through a case study of data quality assessment and clustering analysis in a streamlined manner. Our software improves the transparency, reproducibility, and scalability of research relying on SVI and supports researchers in conducting urban analyses efficiently. Its modular design facilitates extensions and unlocking new use cases.
摘要：在过去十年中，街景图像 (SVI) 在许多研究中发挥了重要作用，有助于了解和描述街道特征和建筑环境。交通、健康、建筑、人类感知和基础设施等各个领域的研究人员都采用了不同的方法来分析 SVI。然而，这些应用程序和图像处理程序尚未标准化，解决方案都是孤立实施的，这通常使其他人难以重现现有工作并开展新的研究。使用 SVI 进行研究需要多个技术步骤：访问可扩展数据收集的 API、预处理图像以标准化格式、实施计算机视觉模型以提取特征以及进行空间分析。这些技术要求为城市研究人员（尤其是那些没有丰富编程经验的研究人员）设置了障碍。我们开发了 ZenSVI，这是一个免费的开源 Python 包，它集成并实现了 SVI 分析的整个过程，支持广泛的用例。其端到端流程包括高效地从多个平台（例如 Mapillary 和 KartaView）下载 SVI、分析 SVI 的元数据、应用计算机视觉模型提取目标特征、将 SVI 转换为不同的投影（例如鱼眼和透视）和不同的格式（例如深度图和点云）、使用地图和图表可视化分析以及将输出导出到其他软件工具。我们通过数据质量评估和聚类分析的案例研究，以简化的方式展示了其在新加坡的使用情况。我们的软件提高了依赖 SVI 的研究的透明度、可重复性和可扩展性，并支持研究人员高效地进行城市分析。其模块化设计有利于扩展和解锁新的用例。

Title: Dissecting CLIP: Decomposition with a Schur Complement-based Approach

Authors: Azim Ospanov, Mohammad Jalali, Farzan Farnia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.18645
Pdf URL: https://arxiv.org/pdf/2412.18645
Copy Paste: [[2412.18645]] Dissecting CLIP: Decomposition with a Schur Complement-based Approach(https://arxiv.org/abs/2412.18645)
Keywords: generative
Abstract: The use of CLIP embeddings to assess the alignment of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the relevance of a generated image, it does not quantify the diversity of images generated by a text-to-image model. In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which is responsible for generating diverse images from similar text prompts. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the \textit{Schur Complement Entropy (SCE)} score, a measure of the intrinsic diversity of a text-to-image model based on data collected with varying text prompts. Additionally, we demonstrate the use of the Schur complement-based decomposition to nullify the influence of a given prompt in the CLIP embedding of an image, enabling focus or defocus of embeddings on specific objects or properties for downstream tasks. We present several numerical results that apply our Schur complement-based approach to evaluate text-to-image models and modify CLIP image embeddings. The codebase is available at this https URL
摘要：文献中广泛探讨了使用 CLIP 嵌入来评估文本到图像生成模型生成的样本的对齐情况。虽然广泛采用的 CLIPScore（源自文本和图像嵌入的余弦相似度）可以有效衡量生成图像的相关性，但它无法量化文本到图像模型生成的图像的多样性。在这项工作中，我们扩展了 CLIP 嵌入的应用，以量化和解释文本到图像模型的内在多样性，该模型负责从相似的文本提示中生成不同的图像。为了实现这一点，我们建议将基于 CLIP 的图像数据核协方差矩阵分解为基于文本和非基于文本的组件。使用联合图像文本核协方差矩阵的 Schur 补，我们执行此分解并将分解组件的基于矩阵的熵定义为 \textit{Schur 补熵 (SCE)} 分数，这是基于使用不同文本提示收集的数据来衡量文本到图像模型的内在多样性的指标。此外，我们演示了如何使用基于 Schur 补的分解来消除图像 CLIP 嵌入中给定提示的影响，从而实现嵌入对下游任务的特定对象或属性的聚焦或散焦。我们展示了几个数值结果，这些结果应用了我们基于 Schur 补的方法来评估文本到图像模型并修改 CLIP 图像嵌入。代码库可在此 https URL 上找到

Title: 1.58-bit FLUX

Authors: Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.18653
Pdf URL: https://arxiv.org/pdf/2412.18653
Copy Paste: [[2412.18653]] 1.58-bit FLUX(https://arxiv.org/abs/2412.18653)
Keywords: generation
Abstract: We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.
摘要：我们提出了 1.58 位 FLUX，这是第一种成功量化最先进的文本到图像生成模型 FLUX.1-dev 的方法，使用 1.58 位权重（即 {-1, 0, +1} 中的值），同时保持生成 1024 x 1024 图像的可比性能。值得注意的是，我们的量化方法无需访问图像数据，仅依靠 FLUX.1-dev 模型的自我监督即可运行。此外，我们开发了一个针对 1.58 位操作优化的自定义内核，实现了模型存储减少 7.7 倍、推理内存减少 5.1 倍并改善了推理延迟。对 GenEval 和 T2I Compbench 基准的广泛评估证明了 1.58 位 FLUX 在保持生成质量的同时显著提高计算效率的有效性。

Title: Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation

Authors: Faraz Waseem, Muhammad Shahzad
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.18688
Pdf URL: https://arxiv.org/pdf/2412.18688
Copy Paste: [[2412.18688]] Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation(https://arxiv.org/abs/2412.18688)
Keywords: generation, generative
Abstract: An image may convey a thousand words, but a video composed of hundreds or thousands of image frames tells a more intricate story. Despite significant progress in multimodal large language models (MLLMs), generating extended videos remains a formidable challenge. As of this writing, OpenAI's Sora, the current state-of-the-art system, is still limited to producing videos that are up to one minute in length. This limitation stems from the complexity of long video generation, which requires more than generative AI techniques for approximating density functions essential aspects such as planning, story development, and maintaining spatial and temporal consistency present additional hurdles. Integrating generative AI with a divide-and-conquer approach could improve scalability for longer videos while offering greater control. In this survey, we examine the current landscape of long video generation, covering foundational techniques like GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of the existing video generation capabilities. We believe it would serve as a comprehensive foundation, offering extensive information to guide future advancements and research in the field of long video generation.
摘要：一张图片可以传达一千个单词，但由数百或数千个图像帧组成的视频可以讲述更复杂的故事。尽管多模态大型语言模型 (MLLM) 取得了重大进展，但生成长视频仍然是一项艰巨的挑战。截至撰写本文时，OpenAI 的 Sora（目前最先进的系统）仍然仅限于制作长达一分钟的视频。这种限制源于长视频生成的复杂性，它需要的不仅仅是用于近似密度函数的生成式 AI 技术，规划、故事发展以及保持空间和时间一致性等基本方面也带来了额外的障碍。将生成式 AI 与分而治之的方法相结合可以提高长视频的可扩展性，同时提供更大的控制力。在本次调查中，我们研究了长视频生成的当前格局，涵盖了 GAN 和扩散模型等基础技术、视频生成策略、大规模训练数据集、评估长视频的质量指标以及未来的研究领域，以解决现有视频生成能力的局限性。我们相信它将作为一个全面的基础，提供广泛的信息来指导长视频生成领域的未来进步和研究。

Title: Elucidating Flow Matching ODE Dynamics with respect to Data Geometries

Authors: Gal Mishne, Zhengchao Wan, Qingsong Wang, Yusu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.18730
Pdf URL: https://arxiv.org/pdf/2412.18730
Copy Paste: [[2412.18730]] Elucidating Flow Matching ODE Dynamics with respect to Data Geometries(https://arxiv.org/abs/2412.18730)
Keywords: generation, generative
Abstract: Diffusion-based generative models have become the standard for image generation. ODE-based samplers and flow matching models improve efficiency, in comparison to diffusion models, by reducing sampling steps through learned vector fields. However, the theoretical foundations of flow matching models remain limited, particularly regarding the convergence of individual sample trajectories at terminal time - a critical property that impacts sample quality and being critical assumption for models like the consistency model. In this paper, we advance the theory of flow matching models through a comprehensive analysis of sample trajectories, centered on the denoiser that drives ODE dynamics. We establish the existence, uniqueness and convergence of ODE trajectories at terminal time, ensuring stable sampling outcomes under minimal assumptions. Our analysis reveals how trajectories evolve from capturing global data features to local structures, providing the geometric characterization of per-sample behavior in flow matching models. We also explain the memorization phenomenon in diffusion-based training through our terminal time analysis. These findings bridge critical gaps in understanding flow matching models, with practical implications for sampling stability and model design.
摘要：基于扩散的生成模型已成为图像生成的标准。与扩散模型相比，基于 ODE 的采样器和流匹配模型通过学习矢量场减少采样步骤，从而提高了效率。然而，流匹配模型的理论基础仍然有限，特别是关于单个样本轨迹在终端时间的收敛 - 这是一个影响样本质量的关键属性，也是一致性模型等模型的关键假设。在本文中，我们通过对样本轨迹的全面分析推进了流匹配模型的理论，以驱动 ODE 动力学的去噪器为中心。我们建立了终端时间 ODE 轨迹的存在性、唯一性和收敛性，确保在最小假设下获得稳定的采样结果。我们的分析揭示了轨迹如何从捕获全局数据特征演变为局部结构，从而为流匹配模型中每个样本的行为提供了几何表征。我们还通过终端时间分析解释了基于扩散的训练中的记忆现象。这些发现弥补了理解流匹配模型的关键空白，对采样稳定性和模型设计具有实际意义。

Title: Embodied Image Quality Assessment for Robotic Intelligence

Authors: Jianbo Zhang, Chunyi Li, Liang Yuan, Guoquan Zheng, Jie Hao, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.18774
Pdf URL: https://arxiv.org/pdf/2412.18774
Copy Paste: [[2412.18774]] Embodied Image Quality Assessment for Robotic Intelligence(https://arxiv.org/abs/2412.18774)
Keywords: quality assessment
Abstract: Image quality assessment (IQA) of user-generated content (UGC) is a critical technique for human quality of experience (QoE). However, for robot-generated content (RGC), will its image quality be consistent with the Moravec paradox and counter to human common sense? Human subjective scoring is more based on the attractiveness of the image. Embodied agent are required to interact and perceive in the environment, and finally perform specific tasks. Visual images as inputs directly influence downstream tasks. In this paper, we first propose an embodied image quality assessment (EIQA) frameworks. We establish assessment metrics for input images based on the downstream tasks of robot. In addition, we construct an Embodied Preference Database (EPD) containing 5,000 reference and distorted image annotations. The performance of mainstream IQA algorithms on EPD dataset is finally verified. The experiments demonstrate that quality assessment of embodied images is different from that of humans. We sincerely hope that the EPD can contribute to the development of embodied AI by focusing on image quality assessment. The benchmark is available at this https URL.
摘要：用户生成内容（UGC）的图像质量评估（IQA）是人类体验质量（QoE）的关键技术。然而，对于机器人生成内容（RGC），其图像质量是否会符合莫拉维克悖论并违背人类的常识？人类的主观评分更多是基于图像的吸引力。具身智能体需要在环境中进行交互和感知，并最终执行特定的任务。视觉图像作为输入直接影响下游任务。在本文中，我们首先提出了一个具身图像质量评估（EIQA）框架。我们根据机器人的下游任务为输入图像建立了评估指标。此外，我们构建了一个包含5,000个参考和扭曲图像注释的具身偏好数据库（EPD）。最后验证了主流IQA算法在EPD数据集上的性能。实验表明，具身图像的质量评估与人类不同。我们真诚地希望EPD能够通过专注于图像质量评估为具身人工智能的发展做出贡献。基准测试可在此 https URL 上获得。

Title: ObitoNet: Multimodal High-Resolution Point Cloud Reconstruction

Authors: Apoorv Thapliyal, Vinay Lanka, Swathi Baskaran
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.18775
Pdf URL: https://arxiv.org/pdf/2412.18775
Copy Paste: [[2412.18775]] ObitoNet: Multimodal High-Resolution Point Cloud Reconstruction(https://arxiv.org/abs/2412.18775)
Keywords: generation
Abstract: ObitoNet employs a Cross Attention mechanism to integrate multimodal inputs, where Vision Transformers (ViT) extract semantic features from images and a point cloud tokenizer processes geometric information using Farthest Point Sampling (FPS) and K Nearest Neighbors (KNN) for spatial structure capture. The learned multimodal features are fed into a transformer-based decoder for high-resolution point cloud reconstruction. This approach leverages the complementary strengths of both modalities rich image features and precise geometric details ensuring robust point cloud generation even in challenging conditions such as sparse or noisy data.
摘要：ObitoNet 采用交叉注意机制来集成多模态输入，其中视觉变换器 (ViT) 从图像中提取语义特征，点云标记器使用最远点采样 (FPS) 和 K 最近邻 (KNN) 处理几何信息以进行空间结构捕获。学习到的多模态特征被输入到基于变换器的解码器中进行高分辨率点云重建。这种方法利用两种模态的互补优势——丰富的图像特征和精确的几何细节，即使在稀疏或嘈杂的数据等具有挑战性的条件下也能确保稳健的点云生成。

Title: Protective Perturbations against Unauthorized Data Usage in Diffusion-based Image Generation

Authors: Sen Peng, Jijia Yang, Mingyue Wang, Jianfei He, Xiaohua Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.18791
Pdf URL: https://arxiv.org/pdf/2412.18791
Copy Paste: [[2412.18791]] Protective Perturbations against Unauthorized Data Usage in Diffusion-based Image Generation(https://arxiv.org/abs/2412.18791)
Keywords: generation
Abstract: Diffusion-based text-to-image models have shown immense potential for various image-related tasks. However, despite their prominence and popularity, customizing these models using unauthorized data also brings serious privacy and intellectual property issues. Existing methods introduce protective perturbations based on adversarial attacks, which are applied to the customization samples. In this systematization of knowledge, we present a comprehensive survey of protective perturbation methods designed to prevent unauthorized data usage in diffusion-based image generation. We establish the threat model and categorize the downstream tasks relevant to these methods, providing a detailed analysis of their designs. We also propose a completed evaluation framework for these perturbation techniques, aiming to advance research in this field.
摘要：基于扩散的文本转图像模型已显示出在各种与图像相关的任务中的巨大潜力。然而，尽管它们很突出且很受欢迎，但使用未经授权的数据定制这些模型也带来了严重的隐私和知识产权问题。现有方法引入了基于对抗性攻击的保护性扰动，这些扰动应用于定制样本。在这种知识系统化中，我们全面介绍了旨在防止基于扩散的图像生成中未经授权的数据使用的保护性扰动方法。我们建立了威胁模型并对与这些方法相关的下游任务进行了分类，并对其设计进行了详细分析。我们还为这些扰动技术提出了一个完整的评估框架，旨在推动该领域的研究。

Title: DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images

Authors: Enbo Huang, Yuan Zhang, Faliang Huang, Guangyu Zhang, Yang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.18797
Pdf URL: https://arxiv.org/pdf/2412.18797
Copy Paste: [[2412.18797]] DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images(https://arxiv.org/abs/2412.18797)
Keywords: generation
Abstract: Person image synthesis with controllable body poses and appearances is an essential task owing to the practical needs in the context of virtual try-on, image editing and video production. However, existing methods face significant challenges with details missing, limbs distortion and the garment style deviation. To address these issues, we propose a Disentangled Representations Diffusion Model (DRDM) to generate photo-realistic images from source portraits in specific desired poses and appearances. First, a pose encoder is responsible for encoding pose features into a high-dimensional space to guide the generation of person images. Second, a body-part subspace decoupling block (BSDB) disentangles features from the different body parts of a source figure and feeds them to the various layers of the noise prediction block, thereby supplying the network with rich disentangled features for generating a realistic target image. Moreover, during inference, we develop a parsing map-based disentangled classifier-free guided sampling method, which amplifies the conditional signals of texture and pose. Extensive experimental results on the Deepfashion dataset demonstrate the effectiveness of our approach in achieving pose transfer and appearance control.
摘要：由于虚拟试穿、图像编辑和视频制作的实际需要，具有可控身体姿势和外观的人物图像合成是一项重要任务。然而，现有的方法面临着细节缺失、肢体扭曲和服装款式偏差等重大挑战。为了解决这些问题，我们提出了一种解缠结表征扩散模型 (DRDM)，用于从具有特定期望姿势和外观的源肖像生成照片般逼真的图像。首先，姿势编码器负责将姿势特征编码到高维空间中以指导人物图像的生成。其次，身体部位子空间解耦块 (BSDB) 从源人物的不同身体部位中解缠结特征并将它们馈送到噪声预测块的各个层，从而为网络提供丰富的解缠结特征以生成逼真的目标图像。此外，在推理过程中，我们开发了一种基于解析图的解缠结无分类器引导采样方法，该方法放大了纹理和姿势的条件信号。 Deepfashion 数据集上的大量实验结果证明了我们的方法在实现姿势转换和外观控制方面的有效性。

Title: DebiasDiff: Debiasing Text-to-image Diffusion Models with Self-discovering Latent Attribute Directions

Authors: Yilei Jiang, Weihong Li, Yiyuan Zhang, Minghong Cai, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.18810
Pdf URL: https://arxiv.org/pdf/2412.18810
Copy Paste: [[2412.18810]] DebiasDiff: Debiasing Text-to-image Diffusion Models with Self-discovering Latent Attribute Directions(https://arxiv.org/abs/2412.18810)
Keywords: generation, generative
Abstract: While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set. As DMs are now widely used in real-world applications, these biases could perpetuate a distorted worldview and hinder opportunities for minority groups. Existing methods on debiasing DMs usually requires model re-training with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose DebiasDiff, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. Specifically, DebiasDiff consists of two parts: a set of attribute adapters and a distribution indicator. Each adapter in the set aims to learn an attribute latent direction, and is optimized via noise composition through a self-discovering process. Then, the distribution indicator is multiplied by the set of adapters to guide the generation process towards the prescribed distribution. Our method enables debiasing multiple attributes in DMs simultaneously, while remaining lightweight and easily integrable with other DMs, eliminating the need for re-training. Extensive experiments on debiasing gender, racial, and their intersectional biases show that our method outperforms previous SOTA by a large margin.
摘要：尽管扩散模型 (DM) 在各种图像生成任务中表现出色，但它们仍然反映了训练集中存在的固有偏差。由于 DM 现在广泛应用于现实世界的应用中，这些偏见可能会延续扭曲的世界观并阻碍少数群体的机会。现有的去偏 DM 方法通常需要使用人工制作的参考数据集或附加分类器重新训练模型，这存在两个主要限制：(1) 收集参考数据集会导致昂贵的注释成本；(2) 去偏性能严重受参考数据集或附加分类器质量的限制。为了解决上述限制，我们提出了 DebiasDiff，这是一种即插即用的方法，它以自我发现的方式学习属性潜在方向，从而消除了对这种参考数据集的依赖。具体而言，DebiasDiff 由两部分组成：一组属性适配器和一个分布指示器。该集合中的每个适配器都旨在学习属性潜在方向，并通过自我发现过程通过噪声合成进行优化。然后，分布指标乘以适配器集，以引导生成过程朝着规定的分布发展。我们的方法能够同时消除 DM 中的多个属性偏差，同时保持轻量级并易于与其他 DM 集成，无需重新训练。在消除性别、种族及其交叉偏差方面进行的大量实验表明，我们的方法比以前的 SOTA 好得多。

Title: CausalTAD: Causal Implicit Generative Model for Debiased Online Trajectory Anomaly Detection

Authors: Wenbin Li, Di Yao, Chang Gong, Xiaokai Chu, Quanliang Jing, Xiaolei Zhou, Yuxuan Zhang, Yunxia Fan, Jingping Bi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.18820
Pdf URL: https://arxiv.org/pdf/2412.18820
Copy Paste: [[2412.18820]] CausalTAD: Causal Implicit Generative Model for Debiased Online Trajectory Anomaly Detection(https://arxiv.org/abs/2412.18820)
Keywords: generative
Abstract: Trajectory anomaly detection, aiming to estimate the anomaly risk of trajectories given the Source-Destination (SD) pairs, has become a critical problem for many real-world applications. Existing solutions directly train a generative model for observed trajectories and calculate the conditional generative probability $P({T}|{C})$ as the anomaly risk, where ${T}$ and ${C}$ represent the trajectory and SD pair respectively. However, we argue that the observed trajectories are confounded by road network preference which is a common cause of both SD distribution and trajectories. Existing methods ignore this issue limiting their generalization ability on out-of-distribution trajectories. In this paper, we define the debiased trajectory anomaly detection problem and propose a causal implicit generative model, namely CausalTAD, to solve it. CausalTAD adopts do-calculus to eliminate the confounding bias of road network preference and estimates $P({T}|do({C}))$ as the anomaly criterion. Extensive experiments show that CausalTAD can not only achieve superior performance on trained trajectories but also generally improve the performance of out-of-distribution data, with improvements of $2.1\% \sim 5.7\%$ and $10.6\% \sim 32.7\%$ respectively.
摘要：轨迹异常检测旨在根据给定的源-目的地 (SD) 对估计轨迹的异常风险，这已成为许多实际应用的关键问题。现有的解决方案直接为观察到的轨迹训练生成模型，并计算条件生成概率 $P({T}|{C})$ 作为异常风险，其中 ${T}$ 和 ${C}$ 分别代表轨迹和 SD 对。然而，我们认为观察到的轨迹受到道路网络偏好的混淆，这是 SD 分布和轨迹的共同原因。现有方法忽略了这个问题，限制了它们对分布外轨迹的泛化能力。在本文中，我们定义了去偏轨迹异常检测问题，并提出了一个因果隐式生成模型，即 CausalTAD，来解决它。 CausalTAD 采用 do-calculus 消除路网偏好的混杂偏差，并估计 $P({T}|do({C}))$ 作为异常标准。大量实验表明，CausalTAD 不仅可以在训练轨迹上取得优异的表现，而且还可以普遍提高分布外数据的性能，改进分别为 $2.1\% \sim 5.7\%$ 和 $10.6\% \sim 32.7\%$。

Title: DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

Authors: Ruohong Yang, Peng Hu, Xi Peng, Xiting Liu, Yunfan Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.18838
Pdf URL: https://arxiv.org/pdf/2412.18838
Copy Paste: [[2412.18838]] DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering(https://arxiv.org/abs/2412.18838)
Keywords: generation, generative
Abstract: Fine-grained clustering is a practical yet challenging task, whose essence lies in capturing the subtle differences between instances of different classes. Such subtle differences can be easily disrupted by data augmentation or be overwhelmed by redundant information in data, leading to significant performance degradation for existing clustering methods. In this work, we introduce DiFiC a fine-grained clustering method building upon the conditional diffusion model. Distinct from existing works that focus on extracting discriminative features from images, DiFiC resorts to deducing the textual conditions used for image generation. To distill more precise and clustering-favorable object semantics, DiFiC further regularizes the diffusion target and guides the distillation process utilizing neighborhood similarity. Extensive experiments demonstrate that DiFiC outperforms both state-of-the-art discriminative and generative clustering methods on four fine-grained image clustering benchmarks. We hope the success of DiFiC will inspire future research to unlock the potential of diffusion models in tasks beyond generation. The code will be released.
摘要：细粒度聚类是一项实用而又具有挑战性的任务，其本质在于捕捉不同类别实例之间的细微差异。这种细微的差异很容易被数据增强所破坏，或被数据中的冗余信息所淹没，从而导致现有聚类方法的性能显著下降。在本文中，我们引入了 DiFiC，这是一种基于条件扩散模型的细粒度聚类方法。与专注于从图像中提取判别特征的现有研究不同，DiFiC 诉诸于推断用于图像生成的文本条件。为了提取更精确、更有利于聚类的对象语义，DiFiC 进一步规范了扩散目标并利用邻域相似性指导蒸馏过程。大量实验表明，DiFiC 在四个细粒度图像聚类基准上的表现均优于最先进的判别聚类和生成聚类方法。我们希望 DiFiC 的成功将激发未来的研究，以释放扩散模型在生成之外的任务中的潜力。代码即将发布。

Title: SWAG: Long-term Surgical Workflow Prediction with Generative-based Anticipation

Authors: Maxence Boels, Yang Liu, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.18849
Pdf URL: https://arxiv.org/pdf/2412.18849
Copy Paste: [[2412.18849]] SWAG: Long-term Surgical Workflow Prediction with Generative-based Anticipation(https://arxiv.org/abs/2412.18849)
Keywords: generation, generative
Abstract: While existing recognition approaches excel at identifying current surgical phases, they provide limited foresight into future procedural steps, restricting their intraoperative utility. Similarly, current anticipation methods are constrained to predicting short-term events or singular future occurrences, neglecting the dynamic and sequential nature of surgical workflows. To address these limitations, we propose SWAG (Surgical Workflow Anticipative Generation), a unified framework for phase recognition and long-term anticipation of surgical workflows. SWAG employs two generative decoding methods -- single-pass (SP) and auto-regressive (AR) -- to predict sequences of future surgical phases. A novel prior knowledge embedding mechanism enhances the accuracy of anticipatory predictions. The framework addresses future phase classification and remaining time regression tasks. Additionally, a regression-to-classification (R2C) method is introduced to map continuous predictions to discrete temporal segments. SWAG's performance was evaluated on the Cholec80 and AutoLaparo21 datasets. The single-pass classification model with prior knowledge embeddings (SWAG-SP\*) achieved 53.5\% accuracy in 15-minute anticipation on AutoLaparo21, while the R2C model reached 60.8\% accuracy on Cholec80. SWAG's single-pass regression approach outperformed existing methods for remaining time prediction, achieving weighted mean absolute errors of 0.32 and 0.48 minutes for 2- and 3-minute horizons, respectively. SWAG demonstrates versatility across classification and regression tasks, offering robust tools for real-time surgical workflow anticipation. By unifying recognition and anticipatory capabilities, SWAG provides actionable predictions to enhance intraoperative decision-making.
摘要：虽然现有的识别方法擅长识别当前手术阶段，但它们对未来手术步骤的预见有限，限制了它们在术中的效用。同样，当前的预测方法仅限于预测短期事件或单一未来事件，而忽略了手术工作流程的动态和连续性。为了解决这些限制，我们提出了 SWAG（手术工作流程预期生成），这是一个用于阶段识别和长期预测手术工作流程的统一框架。SWAG 采用两种生成解码方法——单次通过 (SP) 和自回归 (AR)——来预测未来手术阶段的序列。一种新颖的先验知识嵌入机制提高了预期预测的准确性。该框架解决了未来阶段分类和剩余时间回归任务。此外，还引入了一种回归分类 (R2C) 方法，将连续预测映射到离散时间段。在 Cholec80 和 AutoLaparo21 数据集上评估了 SWAG 的性能。具有先验知识嵌入的单次分类模型 (SWAG-SP\*) 在 AutoLaparo21 上实现了 15 分钟预测的 53.5\% 准确率，而 R2C 模型在 Cholec80 上实现了 60.8\% 准确率。SWAG 的单次回归方法在剩余时间预测方面优于现有方法，在 2 分钟和 3 分钟范围内分别实现了 0.32 分钟和 0.48 分钟的加权平均绝对误差。SWAG 在分类和回归任务中表现出多功能性，为实时手术工作流程预测提供了强大的工具。通过统一识别和预测能力，SWAG 提供了可操作的预测以增强术中决策。

Title: Computing Approximate Graph Edit Distance via Optimal Transport

Authors: Qihao Cheng, Da Yan, Tianhao Wu, Zhongyi Huang, Qin Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.18857
Pdf URL: https://arxiv.org/pdf/2412.18857
Copy Paste: [[2412.18857]] Computing Approximate Graph Edit Distance via Optimal Transport(https://arxiv.org/abs/2412.18857)
Keywords: generation
Abstract: Given a graph pair $(G^1, G^2)$, graph edit distance (GED) is defined as the minimum number of edit operations converting $G^1$ to $G^2$. GED is a fundamental operation widely used in many applications, but its exact computation is NP-hard, so the approximation of GED has gained a lot of attention. Data-driven learning-based methods have been found to provide superior results compared to classical approximate algorithms, but they directly fit the coupling relationship between a pair of vertices from their vertex features. We argue that while pairwise vertex features can capture the coupling cost (discrepancy) of a pair of vertices, the vertex coupling matrix should be derived from the vertex-pair cost matrix through a more well-established method that is aware of the global context of the graph pair, such as optimal transport. In this paper, we propose an ensemble approach that integrates a supervised learning-based method and an unsupervised method, both based on optimal transport. Our learning method, GEDIOT, is based on inverse optimal transport that leverages a learnable Sinkhorn algorithm to generate the coupling matrix. Our unsupervised method, GEDGW, models GED computation as a linear combination of optimal transport and its variant, Gromov-Wasserstein discrepancy, for node and edge operations, respectively, which can be solved efficiently without needing the ground truth. Our ensemble method, GEDHOT, combines GEDIOT and GEDGW to further boost the performance. Extensive experiments demonstrate that our methods significantly outperform the existing methods in terms of the performance of GED computation, edit path generation, and model generalizability.
摘要：给定一个图对 $(G^1, G^2)$，图编辑距离 (GED) 定义为将 $G^1$ 转换为 $G^2$ 的最少编辑操作次数。GED 是许多应用中广泛使用的基本操作，但其精确计算是 NP 难的，因此 GED 的近似值引起了广泛关注。与经典近似算法相比，基于数据驱动学习的方法被发现能够提供更优的结果，但它们直接从顶点特征中拟合一对顶点之间的耦合关系。我们认为，虽然成对的顶点特征可以捕获一对顶点的耦合成本（差异），但顶点耦合矩阵应该通过一种更成熟的方法从顶点对成本矩阵中推导出来，该方法能够了解图对的全局背景，例如最优传输。在本文中，我们提出了一种集成方法，该方法集成了基于监督学习的方法和无监督方法，两者都基于最优传输。我们的学习方法 GEDIOT 基于逆最优传输，利用可学习的 Sinkhorn 算法生成耦合矩阵。我们的无监督方法 GEDGW 将 GED 计算建模为最优传输及其变体 Gromov-Wasserstein 差异的线性组合，分别用于节点和边缘操作，无需基本事实即可有效解决。我们的集成方法 GEDHOT 结合了 GEDIOT 和 GEDGW，进一步提升了性能。大量实验表明，我们的方法在 GED 计算、编辑路径生成和模型通用性方面的性能明显优于现有方法。

Title: Cross-PCR: A Robust Cross-Source Point Cloud Registration Framework

Authors: Guiyu Zhao, Zhentao Guo, Zewen Du, Hongbin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.18873
Pdf URL: https://arxiv.org/pdf/2412.18873
Copy Paste: [[2412.18873]] Cross-PCR: A Robust Cross-Source Point Cloud Registration Framework(https://arxiv.org/abs/2412.18873)
Keywords: generation
Abstract: Due to the density inconsistency and distribution difference between cross-source point clouds, previous methods fail in cross-source point cloud registration. We propose a density-robust feature extraction and matching scheme to achieve robust and accurate cross-source registration. To address the density inconsistency between cross-source data, we introduce a density-robust encoder for extracting density-robust features. To tackle the issue of challenging feature matching and few correct correspondences, we adopt a loose-to-strict matching pipeline with a ``loose generation, strict selection'' idea. Under it, we employ a one-to-many strategy to loosely generate initial correspondences. Subsequently, high-quality correspondences are strictly selected to achieve robust registration through sparse matching and dense matching. On the challenging Kinect-LiDAR scene in the cross-source 3DCSR dataset, our method improves feature matching recall by 63.5 percentage points (pp) and registration recall by 57.6 pp. It also achieves the best performance on 3DMatch, while maintaining robustness under diverse downsampling densities.
摘要：由于跨源点云之间的密度不一致和分布差异，先前的方法无法实现跨源点云配准。我们提出了一种密度鲁棒的特征提取和匹配方案，以实现鲁棒且准确的跨源配准。为了解决跨源数据之间的密度不一致问题，我们引入了一个密度鲁棒编码器来提取密度鲁棒特征。为了解决特征匹配困难和正确对应关系少的问题，我们采用了一种从松散到严格的匹配流程，并秉承“松散生成、严格选择”的思想。在此流程下，我们采用一对多策略来松散地生成初始对应关系。随后，通过稀疏匹配和密集匹配严格选择高质量对应关系以实现鲁棒配准。在跨源 3DCSR 数据集中具有挑战性的 Kinect-LiDAR 场景中，我们的方法将特征匹配召回率提高了 63.5 个百分点 (pp)，将注册召回率提高了 57.6 pp。它还在 3DMatch 上实现了最佳性能，同时在不同的下采样密度下保持稳健性。

Title: Accelerating Diffusion Transformers with Dual Feature Caching

Authors: Chang Zou, Evelyn Zhang, Runlin Guo, Haohang Xu, Conghui He, Xuming Hu, Linfeng Zhang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.18911
Pdf URL: https://arxiv.org/pdf/2412.18911
Copy Paste: [[2412.18911]] Accelerating Diffusion Transformers with Dual Feature Caching(https://arxiv.org/abs/2412.18911)
Keywords: generation
Abstract: Diffusion Transformers (DiT) have become the dominant methods in image and video generation yet still suffer substantial computational costs. As an effective approach for DiT acceleration, feature caching methods are designed to cache the features of DiT in previous timesteps and reuse them in the next timesteps, allowing us to skip the computation in the next timesteps. However, on the one hand, aggressively reusing all the features cached in previous timesteps leads to a severe drop in generation quality. On the other hand, conservatively caching only the features in the redundant layers or tokens but still computing the important ones successfully preserves the generation quality but results in reductions in acceleration ratios. Observing such a tradeoff between generation quality and acceleration performance, this paper begins by quantitatively studying the accumulated error from cached features. Surprisingly, we find that aggressive caching does not introduce significantly more caching errors in the caching step, and the conservative feature caching can fix the error introduced by aggressive caching. Thereby, we propose a dual caching strategy that adopts aggressive and conservative caching iteratively, leading to significant acceleration and high generation quality at the same time. Besides, we further introduce a V-caching strategy for token-wise conservative caching, which is compatible with flash attention and requires no training and calibration data. Our codes have been released in Github: \textbf{Code: \href{this https URL}{\texttt{\textcolor{cyan}{this https URL}}}}
摘要：扩散变换器 (DiT) 已成为图像和视频生成的主要方法，但仍然承受着巨大的计算成本。作为 DiT 加速的一种有效方法，特征缓存方法旨在缓存前一个时间步中的 DiT 特征并在下一个时间步中重用它们，从而允许我们跳过下一个时间步中的计算。然而，一方面，积极地重用前一个时间步中缓存的所有特征会导致生成质量严重下降。另一方面，保守地仅缓存冗余层或标记中的特征但仍计算重要特征可以成功保持生成质量，但会导致加速比降低。观察到生成质量和加速性能之间的这种权衡，本文首先定量研究了缓存特征的累积误差。令人惊讶的是，我们发现积极缓存不会在缓存步骤中引入更多缓存错误，而保守的特征缓存可以修复积极缓存引入的错误。因此，我们提出了一种双重缓存策略，迭代采用积极和保守缓存，同时显著提高加速和生成质量。此外，我们进一步引入了一种 V-caching 策略，用于 token-wise 保守缓存，该策略与 flash 注意力机制兼容，不需要训练和校准数据。我们的代码已发布在 Github 上：\textbf{代码：\href{此 https URL}{\texttt{\textcolor{cyan}{此 https URL}}}}

Title: Generative Face Parsing Map Guided 3D Face Reconstruction Under Occluded Scenes

Authors: Dapeng Zhao, Yue Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.18920
Pdf URL: https://arxiv.org/pdf/2412.18920
Copy Paste: [[2412.18920]] Generative Face Parsing Map Guided 3D Face Reconstruction Under Occluded Scenes(https://arxiv.org/abs/2412.18920)
Keywords: generation, generative
Abstract: Over the past few years, single-view 3D face reconstruction methods can produce beautiful 3D models. Nevertheless,the input of these works is unobstructed this http URL describe a system designed to reconstruct convincing face texture in the case of this http URL by parsing facial features,we propose a complete face parsing map generation method guided by this http URL estimate the 2D face structure of the reasonable position of the occlusion area,which is used for the construction of 3D this http URL excellent anti-occlusion face reconstruction method should ensure the authenticity of the output,including the topological structure between the eyes,nose, and mouth. We extensively tested our method and its components, qualitatively demonstrating the rationality of our estimated facial structure. We conduct extensive experiments on general 3D face reconstruction tasks as concrete examples to demonstrate the method's superior regulation ability over existing methods often break this http URL further provide numerous quantitative examples showing that our method advances both the quality and the robustness of 3D face reconstruction under occlusion scenes.
摘要：在过去的几年中，单视图3D人脸重建方法可以生成漂亮的3D模型。尽管如此，这些工作的输入是畅通的，这个http URL描述了一个旨在重建令人信服的人脸纹理的系统，在通过解析面部特征的情况下，我们提出了一个完整的人脸解析图生成方法，在此指导下估计遮挡区域合理位置的2D人脸结构，用于构建3D这个http URL优秀的抗遮挡人脸重建方法应该确保输出的真实性，包括眼睛、鼻子和嘴巴之间的拓扑结构。我们广泛测试了我们的方法及其组件，定性地证明了我们估计的面部结构的合理性。我们在一般的3D人脸重建任务上进行了广泛的实验作为具体的例子，以证明该方法比现有方法具有优越的调节能力，通常会打破这个http URL进一步提供大量定量的例子，表明我们的方法提高了遮挡场景下3D人脸重建的质量和鲁棒性。

Title: Exemplar-condensed Federated Class-incremental Learning

Authors: Rui Sun, Yumin Zhang, Varun Ojha, Tejal Shah, Haoran Duan, Bo Wei, Rajiv Ranjan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.18926
Pdf URL: https://arxiv.org/pdf/2412.18926
Copy Paste: [[2412.18926]] Exemplar-condensed Federated Class-incremental Learning(https://arxiv.org/abs/2412.18926)
Keywords: generative
Abstract: We propose Exemplar-Condensed federated class-incremental learning (ECoral) to distil the training characteristics of real images from streaming data into informative rehearsal exemplars. The proposed method eliminates the limitations of exemplar selection in replay-based approaches for mitigating catastrophic forgetting in federated continual learning (FCL). The limitations particularly related to the heterogeneity of information density of each summarized data. Our approach maintains the consistency of training gradients and the relationship to past tasks for the summarized exemplars to represent the streaming data compared to the original images effectively. Additionally, our approach reduces the information-level heterogeneity of the summarized data by inter-client sharing of the disentanglement generative model. Extensive experiments show that our ECoral outperforms several state-of-the-art methods and can be seamlessly integrated with many existing approaches to enhance performance.
摘要：我们提出了样本浓缩联邦类增量学习 (ECoral)，将流数据中真实图像的训练特征提炼为信息丰富的排练样本。所提出的方法消除了基于重放的方法中样本选择的局限性，可减轻联邦持续学习 (FCL) 中的灾难性遗忘。这些局限性尤其与每个汇总数据的信息密度异质性有关。我们的方法保持了训练梯度的一致性以及与过去任务的关系，以便汇总样本有效地表示与原始图像相比的流数据。此外，我们的方法通过客户端间共享解缠生成模型来降低汇总数据的信息级异质性。大量实验表明，我们的 ECoral 优于几种最先进的方法，并且可以与许多现有方法无缝集成以提高性能。

Title: UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation

Authors: Lunhao Duan, Shanshan Zhao, Wenjun Yan, Yinglun Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Mingming Gong, Gui-Song Xia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.18928
Pdf URL: https://arxiv.org/pdf/2412.18928
Copy Paste: [[2412.18928]] UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation(https://arxiv.org/abs/2412.18928)
Keywords: generation
Abstract: Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the need for multiple specialized models. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding. Experimental results across a variety of tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based image synthesis, demonstrate the effectiveness of our UNIC-Adapter in unified controllable image generation.
摘要：最近，文本到图像的生成模型取得了显著的进步，特别是扩散模型促进了从文本描述中合成高质量的图像。然而，当仅使用文本提示时，这些模型通常难以实现对像素级布局、对象外观和全局样式的精确控制。为了缓解这个问题，以前的研究引入了条件图像作为图像生成的辅助输入，增强了控制，但通常需要针对不同类型的参考输入量身定制的专门模型。在本文中，我们探索了一种在单一框架内统一可控生成的新方法。具体来说，我们提出了基于多模态扩散变压器架构的统一图像指令适配器 (UNIC-Adapter)，以便在不需要多个专门模型的情况下实现跨不同条件的灵活和可控生成。我们的 UNIC-Adapter 通过结合条件图像和任务指令有效地提取多模态指令信息，并通过旋转位置嵌入增强的交叉注意机制将这些信息注入图像生成过程。在像素级空间控制、主题驱动的图像生成和基于风格图像的图像合成等各种任务中的实验结果证明了我们的 UNIC-Adapter 在统一可控图像生成中的有效性。

Title: TINQ: Temporal Inconsistency Guided Blind Video Quality Assessment

Authors: Yixiao Li, Xiaoyuan Yang, Weide Liu, Xin Jin, Xu Jia, Yukun Lai, Haotao Liu, Paul L Rosin, Wei Zhou
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2412.18933
Pdf URL: https://arxiv.org/pdf/2412.18933
Copy Paste: [[2412.18933]] TINQ: Temporal Inconsistency Guided Blind Video Quality Assessment(https://arxiv.org/abs/2412.18933)
Keywords: super-resolution, quality assessment
Abstract: Blind video quality assessment (BVQA) has been actively researched for user-generated content (UGC) videos. Recently, super-resolution (SR) techniques have been widely applied in UGC. Therefore, an effective BVQA method for both UGC and SR scenarios is essential. Temporal inconsistency, referring to irregularities between consecutive frames, is relevant to video quality. Current BVQA approaches typically model temporal relationships in UGC videos using statistics of motion information, but inconsistencies remain unexplored. Additionally, different from temporal inconsistency in UGC videos, such inconsistency in SR videos is amplified due to upscaling algorithms. In this paper, we introduce the Temporal Inconsistency Guided Blind Video Quality Assessment (TINQ) metric, demonstrating that exploring temporal inconsistency is crucial for effective BVQA. Since temporal inconsistencies vary between UGC and SR videos, they are calculated in different ways. Based on this, a spatial module highlights inconsistent areas across consecutive frames at coarse and fine granularities. In addition, a temporal module aggregates features over time in two stages. The first stage employs a visual memory capacity block to adaptively segment the time dimension based on estimated complexity, while the second stage focuses on selecting key features. The stages work together through Consistency-aware Fusion Units to regress cross-time-scale video quality. Extensive experiments on UGC and SR video quality datasets show that our method outperforms existing state-of-the-art BVQA methods. Code is available at this https URL.
摘要：盲视频质量评估 (BVQA) 已针对用户生成内容 (UGC) 视频进行了积极研究。最近，超分辨率 (SR) 技术已广泛应用于 UGC。因此，一种适用于 UGC 和 SR 场景的有效 BVQA 方法至关重要。时间不一致性是指连续帧之间的不规则性，与视频质量有关。当前的 BVQA 方法通常使用运动信息统计数据来模拟 UGC 视频中的时间关系，但不一致性仍未被探索。此外，与 UGC 视频中的时间不一致性不同，SR 视频中的这种不一致性由于上采样算法而被放大。在本文中，我们介绍了时间不一致性引导的盲视频质量评估 (TINQ) 指标，表明探索时间不一致性对于有效的 BVQA 至关重要。由于 UGC 和 SR 视频之间的时间不一致性各不相同，因此它们的计算方式也不同。在此基础上，空间模块以粗粒度和细粒度突出显示连续帧之间的不一致区域。此外，时间模块分两个阶段聚合随时间变化的特征。第一阶段采用视觉记忆容量块根据估计的复杂度自适应地分割时间维度，而第二阶段则专注于选择关键特征。这两个阶段通过一致性感知融合单元协同工作，以回归跨时间尺度的视频质量。在 UGC 和 SR 视频质量数据集上进行的大量实验表明，我们的方法优于现有的最先进的 BVQA 方法。代码可在此 https URL 上找到。

Title: ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

Authors: Zhefan Rao, Liya Ji, Yazhou Xing, Runtao Liu, Zhaoyang Liu, Jiaxin Xie, Ziqiao Peng, Yingqing He, Qifeng Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.18966
Pdf URL: https://arxiv.org/pdf/2412.18966
Copy Paste: [[2412.18966]] ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement(https://arxiv.org/abs/2412.18966)
Keywords: generation
Abstract: Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to "grow" its abilities based on a pre-trained foundation, analogous to how humans acquire new knowledge based on past experiences. There is a lack of extensive study of the continual pre-training techniques in T2V generation. In this work, we take the initial step toward exploring this task systematically and propose ModelGrow. Specifically, we break this task into two key aspects: increasing model capacity and improving semantic understanding. For model capacity, we introduce several novel techniques to expand the model size, enabling it to store new knowledge and improve generation performance. For semantic understanding, we propose a method that leverages large language models as advanced text encoders, integrating them into T2V models to enhance language comprehension and guide generation results according to detailed prompts. This approach enables the model to achieve better semantic alignment, particularly in response to complex user prompts. Extensive experiments demonstrate the effectiveness of our method across various metrics. The source code and the model of ModelGrow will be publicly available.
摘要：文本转视频 (T2V) 生成最近引起了广泛关注。然而，从头开始训练 T2V 模型的成本仍然很高，而且在有限的计算资源下，生成性能还有很大的提升空间。这项工作探索了文本转视频模型的持续通用预训练，使模型能够基于预训练的基础“增长”其能力，类似于人类如何根据过去的经验获取新知识。对 T2V 生成中的持续预训练技术缺乏广泛的研究。在这项工作中，我们迈出了系统探索这项任务的第一步，并提出了 ModelGrow。具体来说，我们将这项任务分为两个关键方面：增加模型容量和提高语义理解。对于模型容量，我们引入了几种新技术来扩展模型大小，使其能够存储新知识并提高生成性能。对于语义理解，我们提出了一种方法，利用大型语言模型作为高级文本编码器，将它们集成到 T2V 模型中以增强语言理解并根据详细提示指导生成结果。这种方法使模型能够实现更好的语义对齐，特别是在响应复杂的用户提示时。大量实验证明了我们的方法在各种指标上的有效性。ModelGrow 的源代码和模型将公开提供。

Title: MGAN-CRCM: A Novel Multiple Generative Adversarial Network and Coarse-Refinement Based Cognizant Method for Image Inpainting

Authors: Nafiz Al Asad, Md. Appel Mahmud Pranto, Shbiruzzaman Shiam, Musaddeq Mahmud Akand, Mohammad Abu Yousuf, Khondokar Fida Hasan, Mohammad Ali Moni
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.19000
Pdf URL: https://arxiv.org/pdf/2412.19000
Copy Paste: [[2412.19000]] MGAN-CRCM: A Novel Multiple Generative Adversarial Network and Coarse-Refinement Based Cognizant Method for Image Inpainting(https://arxiv.org/abs/2412.19000)
Keywords: generative
Abstract: Image inpainting is a widely used technique in computer vision for reconstructing missing or damaged pixels in images. Recent advancements with Generative Adversarial Networks (GANs) have demonstrated superior performance over traditional methods due to their deep learning capabilities and adaptability across diverse image domains. Residual Networks (ResNet) have also gained prominence for their ability to enhance feature representation and compatibility with other architectures. This paper introduces a novel architecture combining GAN and ResNet models to improve image inpainting outcomes. Our framework integrates three components: Transpose Convolution-based GAN for guided and blind inpainting, Fast ResNet-Convolutional Neural Network (FR-CNN) for object removal, and Co-Modulation GAN (Co-Mod GAN) for refinement. The model's performance was evaluated on benchmark datasets, achieving accuracies of 96.59% on Image-Net, 96.70% on Places2, and 96.16% on CelebA. Comparative analyses demonstrate that the proposed architecture outperforms existing methods, highlighting its effectiveness in both qualitative and quantitative evaluations.
摘要：图像修复是计算机视觉中一种广泛使用的技术，用于重建图像中缺失或损坏的像素。生成对抗网络 (GAN) 的最新进展因其深度学习能力和跨不同图像域的适应性而表现出优于传统方法的性能。残差网络 (ResNet) 也因其增强特征表示和与其他架构的兼容性的能力而备受瞩目。本文介绍了一种结合 GAN 和 ResNet 模型以改善图像修复结果的新型架构。我们的框架集成了三个组件：用于引导和盲修复的基于转置卷积的 GAN、用于对象移除的快速 ResNet 卷积神经网络 (FR-CNN) 和用于细化的共调制 GAN (Co-Mod GAN)。该模型的性能在基准数据集上进行了评估，在 Image-Net 上的准确率达到 96.59%，在 Places2 上的准确率达到 96.70%，在 CelebA 上的准确率达到 96.16%。比较分析表明，所提出的架构优于现有方法，突出了其在定性和定量评估中的有效性。

Title: FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial Editing

Authors: Wanglong Lu, Jikai Wang, Xiaogang Jin, Xianta Jiang, Hanli Zhao
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.19009
Pdf URL: https://arxiv.org/pdf/2412.19009
Copy Paste: [[2412.19009]] FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial Editing(https://arxiv.org/abs/2412.19009)
Keywords: generative
Abstract: Existing facial editing methods have achieved remarkable results, yet they often fall short in supporting multimodal conditional local facial editing. One of the significant evidences is that their output image quality degrades dramatically after several iterations of incremental editing, as they do not support local editing. In this paper, we present a novel multimodal generative and fusion framework for globally-consistent local facial editing (FACEMUG) that can handle a wide range of input modalities and enable fine-grained and semantic manipulation while remaining unedited parts unchanged. Different modalities, including sketches, semantic maps, color maps, exemplar images, text, and attribute labels, are adept at conveying diverse conditioning details, and their combined synergy can provide more explicit guidance for the editing process. We thus integrate all modalities into a unified generative latent space to enable multimodal local facial edits. Specifically, a novel multimodal feature fusion mechanism is proposed by utilizing multimodal aggregation and style fusion blocks to fuse facial priors and multimodalities in both latent and feature spaces. We further introduce a novel self-supervised latent warping algorithm to rectify misaligned facial features, efficiently transferring the pose of the edited image to the given latent codes. We evaluate our FACEMUG through extensive experiments and comparisons to state-of-the-art (SOTA) methods. The results demonstrate the superiority of FACEMUG in terms of editing quality, flexibility, and semantic control, making it a promising solution for a wide range of local facial editing tasks.
摘要：现有的面部编辑方法取得了显著的效果，但它们往往无法支持多模态条件局部面部编辑。其中一个重要证据是，由于它们不支持局部编辑，经过几次增量编辑迭代后，它们的输出图像质量会急剧下降。在本文中，我们提出了一种用于全局一致局部面部编辑 (FACEMUG) 的新型多模态生成和融合框架，该框架可以处理各种输入模态，并实现细粒度和语义操作，同时保持未编辑部分不变。不同的模态，包括草图、语义图、颜色图、样例图像、文本和属性标签，擅长传达不同的条件细节，它们的组合协同作用可以为编辑过程提供更明确的指导。因此，我们将所有模态集成到统一的生成潜在空间中，以实现多模态局部面部编辑。具体而言，提出了一种新型的多模态特征融合机制，利用多模态聚合和风格融合块在潜在空间和特征空间中融合面部先验和多模态。我们进一步引入了一种新颖的自监督潜在扭曲算法来纠正错位的面部特征，有效地将编辑后的图像的姿势转移到给定的潜在代码。我们通过大量实验和与最先进 (SOTA) 方法的比较来评估我们的 FACEMUG。结果证明了 FACEMUG 在编辑质量、灵活性和语义控制方面的优越性，使其成为广泛局部面部编辑任务的有前途的解决方案。

Title: Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation

Authors: Tao Liu, Rongjie Li, Chongyu Wang, Xuming He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19021
Pdf URL: https://arxiv.org/pdf/2412.19021
Copy Paste: [[2412.19021]] Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation(https://arxiv.org/abs/2412.19021)
Keywords: generation
Abstract: Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations. This enables the identification of novel visual relationships, making it applicable to real-world scenarios with diverse relationships. However, existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment. To address these challenges, we propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information. Our approach utilizes entity clustering to address the complexity of relation triplet categories, enabling the effective integration of subject-object information. Additionally, we utilize a large language model (LLM) to generate detailed region-aware prompts, capturing fine-grained visual interactions and improving alignment between visual and textual modalities. RAHP also introduces a dynamic selection mechanism within Vision-Language Models (VLMs), which adaptively selects relevant text prompts based on the visual content, reducing noise from irrelevant prompts. Extensive experiments on the Visual Genome and Open Images v6 datasets demonstrate that our framework consistently achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of open-vocabulary scene graph generation.
摘要：开放词汇场景图生成 (OV-SGG) 通过将视觉关系表示与开放词汇文本表示对齐，克服了闭集假设的局限性。这使得能够识别新的视觉关系，使其适用于具有多种关系的真实世界场景。然而，现有的 OV-SGG 方法受到固定文本表示的限制，限制了图像文本对齐的多样性和准确性。为了应对这些挑战，我们提出了关系感知分层提示 (RAHP) 框架，它通过整合主客体和区域特定关系信息来增强文本表示。我们的方法利用实体聚类来解决关系三元组类别的复杂性，从而实现主客体信息的有效整合。此外，我们利用大型语言模型 (LLM) 生成详细的区域感知提示，捕捉细粒度的视觉交互并改善视觉和文本模态之间的对齐。 RAHP 还在视觉语言模型 (VLM) 中引入了动态选择机制，该机制可根据视觉内容自适应地选择相关的文本提示，从而减少不相关提示带来的噪音。在 Visual Genome 和 Open Images v6 数据集上进行的大量实验表明，我们的框架始终能够实现最先进的性能，证明了其在解决开放词汇场景图生成挑战方面的有效性。

Title: DAPoinTr: Domain Adaptive Point Transformer for Point Cloud Completion

Authors: Yinghui Li, Qianyu Zhou, Jingyu Gong, Ye Zhu, Richard Dazeley, Xinkui Zhao, Xuequan Lu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19062
Pdf URL: https://arxiv.org/pdf/2412.19062
Copy Paste: [[2412.19062]] DAPoinTr: Domain Adaptive Point Transformer for Point Cloud Completion(https://arxiv.org/abs/2412.19062)
Keywords: generation
Abstract: Point Transformers (PoinTr) have shown great potential in point cloud completion recently. Nevertheless, effective domain adaptation that improves transferability toward target domains remains unexplored. In this paper, we delve into this topic and empirically discover that direct feature alignment on point Transformer's CNN backbone only brings limited improvements since it cannot guarantee sequence-wise domain-invariant features in the Transformer. To this end, we propose a pioneering Domain Adaptive Point Transformer (DAPoinTr) framework for point cloud completion. DAPoinTr consists of three key components: Domain Query-based Feature Alignment (DQFA), Point Token-wise Feature alignment (PTFA), and Voted Prediction Consistency (VPC). In particular, DQFA is presented to narrow the global domain gaps from the sequence via the presented domain proxy and domain query at the Transformer encoder and decoder, respectively. PTFA is proposed to close the local domain shifts by aligning the tokens, \emph{i.e.,} point proxy and dynamic query, at the Transformer encoder and decoder, respectively. VPC is designed to consider different Transformer decoders as multiple of experts (MoE) for ensembled prediction voting and pseudo-label generation. Extensive experiments with visualization on several domain adaptation benchmarks demonstrate the effectiveness and superiority of our DAPoinTr compared with state-of-the-art methods. Code will be publicly available at: this https URL
摘要：点变换器 (PoinTr) 最近在点云补全中显示出巨大的潜力。然而，有效的域自适应以提高向目标域的可转移性仍然未被探索。在本文中，我们深入研究了这一主题，并通过经验发现，在点变换器的 CNN 主干上直接进行特征对齐只能带来有限的改进，因为它不能保证变换器中的序列域不变特征。为此，我们提出了一个开创性的域自适应点变换器 (DAPoinTr) 框架用于点云补全。DAPoinTr 由三个关键组件组成：基于域查询的特征对齐 (DQFA)、点标记特征对齐 (PTFA) 和投票预测一致性 (VPC)。具体而言，DQFA 分别通过 Transformer 编码器和解码器上的域代理和域查询来缩小与序列的全局域差距。 PTFA 旨在通过在 Transformer 编码器和解码器上分别对齐标记（即点代理和动态查询）来关闭局部域偏移。VPC 旨在将不同的 Transformer 解码器视为多个专家 (MoE)，用于集成预测投票和伪标签生成。在多个领域适应基准上进行的大量可视化实验证明了我们的 DAPoinTr 与最先进方法相比的有效性和优越性。代码将公开发布在：此 https URL

Title: FFCG: Effective and Fast Family Column Generation for Solving Large-Scale Linear Program

Authors: Yi-Xiang Hu, Feng Wu, Shaoang Li, Yifang Zhao, Xiang-Yang Li
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2412.19066
Pdf URL: https://arxiv.org/pdf/2412.19066
Copy Paste: [[2412.19066]] FFCG: Effective and Fast Family Column Generation for Solving Large-Scale Linear Program(https://arxiv.org/abs/2412.19066)
Keywords: generation
Abstract: Column Generation (CG) is an effective and iterative algorithm to solve large-scale linear programs (LP). During each CG iteration, new columns are added to improve the solution of the LP. Typically, CG greedily selects one column with the most negative reduced cost, which can be improved by adding more columns at once. However, selecting all columns with negative reduced costs would lead to the addition of redundant columns that do not improve the objective value. Therefore, selecting the appropriate columns to add is still an open problem and previous machine-learning-based approaches for CG only add a constant quantity of columns per iteration due to the state-space explosion problem. To address this, we propose Fast Family Column Generation (FFCG) -- a novel reinforcement-learning-based CG that selects a variable number of columns as needed in an iteration. Specifically, we formulate the column selection problem in CG as an MDP and design a reward metric that balances both the convergence speed and the number of redundant columns. In our experiments, FFCG converges faster on the common benchmarks and reduces the number of CG iterations by 77.1% for Cutting Stock Problem (CSP) and 84.8% for Vehicle Routing Problem with Time Windows (VRPTW), and a 71.4% reduction in computing time for CSP and 84.0% for VRPTW on average compared to several state-of-the-art baselines.
摘要：列生成 (CG) 是一种有效的迭代算法，用于解决大规模线性规划 (LP)。在每次 CG 迭代期间，都会添加新列以改进 LP 的解决方案。通常，CG 会贪婪地选择具有最大负减少成本的一列，这可以通过一次添加更多列来改进。但是，选择所有具有负减少成本的列会导致添加不会提高目标值的冗余列。因此，选择要添加的适当列仍然是一个悬而未决的问题，并且由于状态空间爆炸问题，以前基于机器学习的 CG 方法每次迭代仅添加恒定数量的列。为了解决这个问题，我们提出了快速系列列生成 (FFCG)——一种基于强化学习的新型 CG，它在迭代中根据需要选择可变数量的列。具体来说，我们将 CG 中的列选择问题表述为 MDP，并设计一个平衡收敛速度和冗余列数量的奖励指标。在我们的实验中，与几种最先进的基线相比，FFCG 在常见基准上收敛速度更快，对于下料问题 (CSP)，它将 CG 迭代次数减少了 77.1%，对于带时间窗的车辆路径问题 (VRPTW)，它将 CG 迭代次数减少了 84.8%，对于 CSP，它将计算时间减少了 71.4%，对于 VRPTW，它将计算时间减少了 84.0%。

Title: Mask Factory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation

Authors: Haotian Qian, YD Chen, Shengtao Lou, Fahad Shahbaz Khan, Xiaogang Jin, Deng-Ping Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19080
Pdf URL: https://arxiv.org/pdf/2412.19080
Copy Paste: [[2412.19080]] Mask Factory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation(https://arxiv.org/abs/2412.19080)
Keywords: generation, generative
Abstract: Dichotomous Image Segmentation (DIS) tasks require highly precise annotations, and traditional dataset creation methods are labor intensive, costly, and require extensive domain expertise. Although using synthetic data for DIS is a promising solution to these challenges, current generative models and techniques struggle with the issues of scene deviations, noise-induced errors, and limited training sample variability. To address these issues, we introduce a novel approach, \textbf{\ourmodel{}}, which provides a scalable solution for generating diverse and precise datasets, markedly reducing preparation time and costs. We first introduce a general mask editing method that combines rigid and non-rigid editing techniques to generate high-quality synthetic masks. Specially, rigid editing leverages geometric priors from diffusion models to achieve precise viewpoint transformations under zero-shot conditions, while non-rigid editing employs adversarial training and self-attention mechanisms for complex, topologically consistent modifications. Then, we generate pairs of high-resolution image and accurate segmentation mask using a multi-conditional control generation method. Finally, our experiments on the widely-used DIS5K dataset benchmark demonstrate superior performance in quality and efficiency compared to existing methods. The code is available at \url{this https URL}.
摘要：二分图像分割 (DIS) 任务需要高度精确的注释，而传统的数据集创建方法劳动密集、成本高昂，并且需要广泛的领域专业知识。尽管使用合成数据进行 DIS 是解决这些挑战的有希望的解决方案，但当前的生成模型和技术仍面临着场景偏差、噪声引起的错误和训练样本可变性有限的问题。为了解决这些问题，我们引入了一种新方法 \textbf{\ourmodel{}}，它提供了一种可扩展的解决方案，用于生成多样化和精确的数据集，显着减少准备时间和成本。我们首先介绍一种通用的掩模编辑方法，该方法结合了刚性和非刚性编辑技术来生成高质量的合成掩模。具体而言，刚性编辑利用扩散模型中的几何先验来实现零样本条件下的精确视点变换，而非刚性编辑则采用对抗性训练和自注意力机制进行复杂、拓扑一致的修改。然后，我们使用多条件控制生成方法生成高分辨率图像和精确分割掩模对。最后，我们在广泛使用的 DIS5K 数据集基准上进行的实验表明，与现有方法相比，该方法在质量和效率方面表现优异。代码可在 \url{此 https URL} 处获取。

Title: Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models

Authors: Hyesong Choi, Daeun Kim, Sungmin Cha, Kwang Moo Yi, Dongbo Min
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19104
Pdf URL: https://arxiv.org/pdf/2412.19104
Copy Paste: [[2412.19104]] Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models(https://arxiv.org/abs/2412.19104)
Keywords: restoration, generative
Abstract: In this work, we dive deep into the impact of additive noise in pre-training deep networks. While various methods have attempted to use additive noise inspired by the success of latent denoising diffusion models, when used in combination with masked image modeling, their gains have been marginal when it comes to recognition tasks. We thus investigate why this would be the case, in an attempt to find effective ways to combine the two ideas. Specifically, we find three critical conditions: corruption and restoration must be applied within the encoder, noise must be introduced in the feature space, and an explicit disentanglement between noised and masked tokens is necessary. By implementing these findings, we demonstrate improved pre-training performance for a wide range of recognition tasks, including those that require fine-grained, high-frequency information to solve.
摘要：在这项工作中，我们深入研究了加性噪声对预训练深度网络的影响。虽然各种方法都尝试使用加性噪声，这些方法受到潜在去噪扩散模型成功的启发，但当与蒙版图像建模结合使用时，它们在识别任务方面的收益却微乎其微。因此，我们调查了为什么会出现这种情况，试图找到将这两种想法结合起来的有效方法。具体来说，我们发现了三个关键条件：必须在编码器内应用损坏和恢复，必须在特征空间中引入噪声，并且必须在噪声标记和蒙版标记之间进行明确的分离。通过实施这些发现，我们展示了各种识别任务的预训练性能的提高，包括那些需要细粒度、高频信息来解决的任务。

Title: Discrete vs. Continuous Trade-offs for Generative Models

Authors: Jathin Korrapati, Tanish Baranwal, Rahul Shah
Subjects: cs.LG, cs.AI, cs.IT, math.NA
Abstract URL: https://arxiv.org/abs/2412.19114
Pdf URL: https://arxiv.org/pdf/2412.19114
Copy Paste: [[2412.19114]] Discrete vs. Continuous Trade-offs for Generative Models(https://arxiv.org/abs/2412.19114)
Keywords: generation, generative
Abstract: This work explores the theoretical and practical foundations of denoising diffusion probabilistic models (DDPMs) and score-based generative models, which leverage stochastic processes and Brownian motion to model complex data distributions. These models employ forward and reverse diffusion processes defined through stochastic differential equations (SDEs) to iteratively add and remove noise, enabling high-quality data generation. By analyzing the performance bounds of these models, we demonstrate how score estimation errors propagate through the reverse process and bound the total variation distance using discrete Girsanov transformations, Pinsker's inequality, and the data processing inequality (DPI) for an information theoretic lens.
摘要：这项研究探索了去噪扩散概率模型 (DDPM) 和基于分数的生成模型的理论和实践基础，这些模型利用随机过程和布朗运动来模拟复杂的数据分布。这些模型采用通过随机微分方程 (SDE) 定义的正向和反向扩散过程来迭代添加和消除噪声，从而实现高质量的数据生成。通过分析这些模型的性能界限，我们展示了分数估计误差如何通过反向过程传播，并使用离散 Girsanov 变换、Pinsker 不等式和信息理论镜头的数据处理不等式 (DPI) 来限制总变差距离。

Title: Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing

Authors: Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, Sanghyun Park
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19125
Pdf URL: https://arxiv.org/pdf/2412.19125
Copy Paste: [[2412.19125]] Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing(https://arxiv.org/abs/2412.19125)
Keywords: generation, generative
Abstract: We introduce AKT (Advanced Knowledge Transfer), a novel method to enhance the training ability of low-bit quantized (Q) models in the field of zero-shot quantization (ZSQ). Existing research in ZSQ has focused on generating high-quality data from full-precision (FP) models. However, these approaches struggle with reduced learning ability in low-bit quantization due to its limited information capacity. To overcome this limitation, we propose effective training strategy compared to data generation. Particularly, we analyzed that refining feature maps in the feature distillation process is an effective way to transfer knowledge to the Q model. Based on this analysis, AKT efficiently transfer core information from the FP model to the Q model. AKT is the first approach to utilize both spatial and channel attention information in feature distillation in ZSQ. Our method addresses the fundamental gradient exploding problem in low-bit Q models. Experiments on CIFAR-10 and CIFAR-100 datasets demonstrated the effectiveness of the AKT. Our method led to significant performance enhancement in existing generative models. Notably, AKT achieved significant accuracy improvements in low-bit Q models, achieving state-of-the-art in the 3,5bit scenarios on CIFAR-10. The code is available at this https URL.
摘要：我们引入了 AKT（高级知识迁移），这是一种在零样本量化（ZSQ）领域增强低位量化（Q）模型训练能力的新方法。ZSQ 的现有研究主要集中在从全精度（FP）模型生成高质量数据。然而，由于信息容量有限，这些方法在低位量化中存在学习能力下降的问题。为了克服这一限制，我们提出了与数据生成相比有效的训练策略。具体来说，我们分析了在特征蒸馏过程中细化特征图是将知识迁移到 Q 模型的有效方法。基于此分析，AKT 有效地将核心信息从 FP 模型迁移到 Q 模型。AKT 是第一种在 ZSQ 的特征蒸馏中同时利用空间和通道注意信息的方法。我们的方法解决了低位 Q 模型中的基本梯度爆炸问题。在 CIFAR-10 和 CIFAR-100 数据集上的实验证明了 AKT 的有效性。我们的方法显著提高了现有生成模型的性能。值得注意的是，AKT 在低位 Q 模型中实现了显著的准确率提升，在 CIFAR-10 的 3.5 位场景中达到了最佳水平。代码可在此 https URL 上获取。

Title: MVS-GS: High-Quality 3D Gaussian Splatting Mapping via Online Multi-View Stereo

Authors: Byeonggwon Lee, Junkyu Park, Khang Truong Giang, Sungho Jo, Soohwan Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19130
Pdf URL: https://arxiv.org/pdf/2412.19130
Copy Paste: [[2412.19130]] MVS-GS: High-Quality 3D Gaussian Splatting Mapping via Online Multi-View Stereo(https://arxiv.org/abs/2412.19130)
Keywords: generation
Abstract: This study addresses the challenge of online 3D model generation for neural rendering using an RGB image stream. Previous research has tackled this issue by incorporating Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) as scene representations within dense SLAM methods. However, most studies focus primarily on estimating coarse 3D scenes rather than achieving detailed reconstructions. Moreover, depth estimation based solely on images is often ambiguous, resulting in low-quality 3D models that lead to inaccurate renderings. To overcome these limitations, we propose a novel framework for high-quality 3DGS modeling that leverages an online multi-view stereo (MVS) approach. Our method estimates MVS depth using sequential frames from a local time window and applies comprehensive depth refinement techniques to filter out outliers, enabling accurate initialization of Gaussians in 3DGS. Furthermore, we introduce a parallelized backend module that optimizes the 3DGS model efficiently, ensuring timely updates with each new keyframe. Experimental results demonstrate that our method outperforms state-of-the-art dense SLAM methods, particularly excelling in challenging outdoor environments.
摘要：本研究解决了使用 RGB 图像流进行神经渲染的在线 3D 模型生成难题。先前的研究已通过将神经辐射场 (NeRF) 或 3D 高斯分层 (3DGS) 作为密集 SLAM 方法中的场景表示来解决此问题。然而，大多数研究主要侧重于估计粗略的 3D 场景，而不是实现详细的重建。此外，仅基于图像的深度估计通常不明确，导致低质量的 3D 模型导致渲染不准确。为了克服这些限制，我们提出了一种利用在线多视图立体 (MVS) 方法的高质量 3DGS 建模的新框架。我们的方法使用来自本地时间窗口的连续帧来估计 MVS 深度，并应用全面的深度细化技术来过滤掉异常值，从而实现 3DGS 中高斯的准确初始化。此外，我们引入了一个并行化的后端模块，可以有效地优化 3DGS 模型，确保每个新关键帧的及时更新。实验结果表明，我们的方法优于最先进的密集 SLAM 方法，尤其是在具有挑战性的户外环境中表现出色。

Title: Generating Editable Head Avatars with 3D Gaussian GANs

Authors: Guohao Li, Hongyu Yang, Yifang Men, Di Huang, Weixin Li, Ruijie Yang, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19149
Pdf URL: https://arxiv.org/pdf/2412.19149
Copy Paste: [[2412.19149]] Generating Editable Head Avatars with 3D Gaussian GANs(https://arxiv.org/abs/2412.19149)
Keywords: generative
Abstract: Generating animatable and editable 3D head avatars is essential for various applications in computer vision and graphics. Traditional 3D-aware generative adversarial networks (GANs), often using implicit fields like Neural Radiance Fields (NeRF), achieve photorealistic and view-consistent 3D head synthesis. However, these methods face limitations in deformation flexibility and editability, hindering the creation of lifelike and easily modifiable 3D heads. We propose a novel approach that enhances the editability and animation control of 3D head avatars by incorporating 3D Gaussian Splatting (3DGS) as an explicit 3D representation. This method enables easier illumination control and improved editability. Central to our approach is the Editable Gaussian Head (EG-Head) model, which combines a 3D Morphable Model (3DMM) with texture maps, allowing precise expression control and flexible texture editing for accurate animation while preserving identity. To capture complex non-facial geometries like hair, we use an auxiliary set of 3DGS and tri-plane features. Extensive experiments demonstrate that our approach delivers high-quality 3D-aware synthesis with state-of-the-art controllability. Our code and models are available at this https URL.
摘要：生成可动画化和可编辑的 3D 头部形象对于计算机视觉和图形领域的各种应用至关重要。传统的 3D 感知生成对抗网络 (GAN) 通常使用隐式场（如神经辐射场 (NeRF)），实现照片级逼真且视图一致的 3D 头部合成。然而，这些方法在变形灵活性和可编辑性方面受到限制，阻碍了逼真且易于修改的 3D 头部的创建。我们提出了一种新方法，通过结合 3D 高斯溅射 (3DGS) 作为显式 3D 表示，增强了 3D 头部形象的可编辑性和动画控制。这种方法可以更轻松地控制照明并提高可编辑性。我们方法的核心是可编辑高斯头部 (EG-Head) 模型，它将 3D 可变形模型 (3DMM) 与纹理贴图相结合，允许精确的表情控制和灵活的纹理编辑，从而实现精确的动画，同时保留身份。为了捕捉头发等复杂的非面部几何形状，我们使用了一组辅助的 3DGS 和三平面特征。大量实验表明，我们的方法可以提供高质量的 3D 感知合成，并具有最先进的可控性。我们的代码和模型可在此 https URL 上找到。

Title: Referencing Where to Focus: Improving VisualGrounding with Referential Query

Authors: Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping Zhou, Ming Yang, Le Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.19155
Pdf URL: https://arxiv.org/pdf/2412.19155
Copy Paste: [[2412.19155]] Referencing Where to Focus: Improving VisualGrounding with Referential Query(https://arxiv.org/abs/2412.19155)
Keywords: generation
Abstract: Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.
摘要：视觉定位旨在根据自然语言表达在图像中定位指称对象。基于 DETR 的视觉定位方法的最新进展引起了广泛关注，因为它们可以直接预测目标对象的坐标，而无需依赖额外的努力，例如预先生成的候选提案或预定义的锚框。然而，现有的研究主要侧重于设计更强大的多模态解码器，它通常通过随机初始化或使用语言嵌入来生成可学习的查询。这种原始查询生成方法不可避免地增加了模型的学习难度，因为它在解码开始时不涉及任何与目标相关的信息。此外，它们在查询学习过程中仅使用最深的图像特征，而忽略了其他层次特征的重要性。为了解决这些问题，我们提出了一种名为 RefFormer 的新方法。它由可以无缝集成到 CLIP 中的查询自适应模块组成，并生成指称查询以提供解码器的先验上下文，以及特定于任务的解码器。通过将指称查询合并到解码器中，我们可以有效地减轻解码器的学习难度，并准确地专注于目标对象。此外，我们提出的查询自适应模块还可以充当适配器，保留 CLIP 中的丰富知识，而无需调整主干网络的参数。大量实验证明了我们提出的方法的有效性和效率，在五个视觉基础基准上的表现优于最先进的方法。

Title: Learning Cross-Domain Representations for Transferable Drug Perturbations on Single-Cell Transcriptional Responses

Authors: Hui Liu, Shikai Jin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19228
Pdf URL: https://arxiv.org/pdf/2412.19228
Copy Paste: [[2412.19228]] Learning Cross-Domain Representations for Transferable Drug Perturbations on Single-Cell Transcriptional Responses(https://arxiv.org/abs/2412.19228)
Keywords: generative
Abstract: Phenotypic drug discovery has attracted widespread attention because of its potential to identify bioactive molecules. Transcriptomic profiling provides a comprehensive reflection of phenotypic changes in cellular responses to external perturbations. In this paper, we propose XTransferCDR, a novel generative framework designed for feature decoupling and transferable representation learning across domains. Given a pair of perturbed expression profiles, our approach decouples the perturbation representations from basal states through domain separation encoders and then cross-transfers them in the latent space. The transferred representations are then used to reconstruct the corresponding perturbed expression profiles via a shared decoder. This cross-transfer constraint effectively promotes the learning of transferable drug perturbation representations. We conducted extensive evaluations of our model on multiple datasets, including single-cell transcriptional responses to drugs and single- and combinatorial genetic perturbations. The experimental results show that XTransferCDR achieved better performance than current state-of-the-art methods, showcasing its potential to advance phenotypic drug discovery.
摘要：表型药物发现因其识别生物活性分子的潜力而引起了广泛关注。转录组分析全面反映了细胞对外部扰动的反应中的表型变化。在本文中，我们提出了 XTransferCDR，这是一种新颖的生成框架，旨在实现跨域的特征解耦和可迁移表示学习。给定一对扰动表达谱，我们的方法通过域分离编码器将扰动表示与基础状态解耦，然后在潜在空间中交叉传输它们。然后使用传输的表示通过共享解码器重建相应的扰动表达谱。这种交叉转移约束有效地促进了可迁移药物扰动表示的学习。我们在多个数据集上对我们的模型进行了广泛的评估，包括单细胞对药物的转录反应以及单一和组合遗传扰动。实验结果表明，XTransferCDR 取得了比目前最先进的方法更好的性能，展示了其促进表型药物发现的潜力。

Title: FineVQ: Fine-Grained User Generated Content Video Quality Assessment

Authors: Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, Guangtao Zhai
Subjects: cs.CV, cs.LG, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2412.19238
Pdf URL: https://arxiv.org/pdf/2412.19238
Copy Paste: [[2412.19238]] FineVQ: Fine-Grained User Generated Content Video Quality Assessment(https://arxiv.org/abs/2412.19238)
Keywords: quality assessment
Abstract: The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets. Both Both FineVD and FineVQ will be made publicly available.
摘要：用户生成内容 (UGC) 视频的快速增长迫切需要有效的视频质量评估 (VQA) 算法来监控视频质量并指导优化和推荐程序。然而，目前的 VQA 模型通常只对 UGC 视频进行总体评分，缺乏用于视频处理和推荐应用的细粒度标签。为了应对挑战并促进 UGC 视频的发展，我们建立了第一个大规模细粒度视频质量评估数据库，称为 FineVD，它包含 6104 个 UGC 视频，具有跨多个维度的细粒度质量分数和描述。基于该数据库，我们提出了一个细粒度视频质量评估 (FineVQ) 模型来学习 UGC 视频的细粒度质量，具有质量评级、质量评分和质量归因功能。大量实验结果表明，我们提出的 FineVQ 可以产生细粒度的视频质量结果，并在 FineVD 和其他常用的 UGC-VQA 数据集上实现最佳性能。 FineVD 和 FineVQ 都将公开发布。

Title: PearSAN: A Machine Learning Method for Inverse Design using Pearson Correlated Surrogate Annealing

Authors: Michael Bezick, Blake A. Wilson, Vaishnavi Iyer, Yuheng Chen, Vladimir M. Shalaev, Sabre Kais, Alexander V. Kildishev, Alexandra Boltasseva, Brad Lackey
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19284
Pdf URL: https://arxiv.org/pdf/2412.19284
Copy Paste: [[2412.19284]] PearSAN: A Machine Learning Method for Inverse Design using Pearson Correlated Surrogate Annealing(https://arxiv.org/abs/2412.19284)
Keywords: generative
Abstract: PearSAN is a machine learning-assisted optimization algorithm applicable to inverse design problems with large design spaces, where traditional optimizers struggle. The algorithm leverages the latent space of a generative model for rapid sampling and employs a Pearson correlated surrogate model to predict the figure of merit of the true design metric. As a showcase example, PearSAN is applied to thermophotovoltaic (TPV) metasurface design by matching the working bands between a thermal radiator and a photovoltaic cell. PearSAN can work with any pretrained generative model with a discretized latent space, making it easy to integrate with VQ-VAEs and binary autoencoders. Its novel Pearson correlational loss can be used as both a latent regularization method, similar to batch and layer normalization, and as a surrogate training loss. We compare both to previous energy matching losses, which are shown to enforce poor regularization and performance, even with upgraded affine parameters. PearSAN achieves a state-of-the-art maximum design efficiency of 97%, and is at least an order of magnitude faster than previous methods, with an improved maximum figure-of-merit gain.
摘要：PearSAN 是一种机器学习辅助优化算法，适用于具有大设计空间的逆向设计问题，而传统优化器则难以解决此类问题。该算法利用生成模型的潜在空间进行快速采样，并采用 Pearson 相关替代模型来预测真实设计指标的品质因数。作为一个展示示例，PearSAN 应用于热光伏 (TPV) 超表面设计，通过匹配热辐射器和光伏电池之间的工作带。PearSAN 可以与任何具有离散化潜在空间的预训练生成模型配合使用，从而易于与 VQ-VAE 和二进制自动编码器集成。其新颖的 Pearson 相关损失既可以用作潜在正则化方法（类似于批量和层正则化），也可以用作替代训练损失。我们将两者与之前的能量匹配损失进行了比较，结果表明，即使升级了仿射参数，这些损失也会强制执行较差的正则化和性能。 PearSAN 实现了最先进的 97% 的最大设计效率，并且比以前的方法快至少一个数量级，并且具有改进的最大品质因数增益。

Title: RAG with Differential Privacy

Authors: Nicolas Grislain
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2412.19291
Pdf URL: https://arxiv.org/pdf/2412.19291
Copy Paste: [[2412.19291]] RAG with Differential Privacy(https://arxiv.org/abs/2412.19291)
Keywords: generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as the dominant technique to provide *Large Language Models* (LLM) with fresh and relevant context, mitigating the risk of hallucinations and improving the overall quality of responses in environments with large and fast moving knowledge bases. However, the integration of external documents into the generation process raises significant privacy concerns. Indeed, when added to a prompt, it is not possible to guarantee a response will not inadvertently expose confidential data, leading to potential breaches of privacy and ethical dilemmas. This paper explores a practical solution to this problem suitable to general knowledge extraction from personal data. It shows *differentially private token generation* is a viable approach to private RAG.
摘要：检索增强生成 (RAG) 已成为一种主流技术，可为 *大型语言模型* (LLM) 提供新鲜且相关的背景，从而降低出现幻觉的风险，并提高知识库规模大且快速变化的环境中响应的整体质量。然而，将外部文档集成到生成过程中引发了严重的隐私问题。事实上，当将其添加到提示中时，无法保证响应不会无意中泄露机密数据，从而导致潜在的隐私泄露和道德困境。本文探讨了适用于从个人数据中提取一般知识的该问题的实用解决方案。它表明 *差异化隐私令牌生成* 是实现隐私 RAG 的可行方法。

Title: Manga Generation via Layout-controllable Diffusion

Authors: Siyu Chen, Dengjie Li, Zenghao Bao, Yao Zhou, Lingfeng Tan, Yujie Zhong, Zheng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19303
Pdf URL: https://arxiv.org/pdf/2412.19303
Copy Paste: [[2412.19303]] Manga Generation via Layout-controllable Diffusion(https://arxiv.org/abs/2412.19303)
Keywords: generation
Abstract: Generating comics through text is widely studied. However, there are few studies on generating multi-panel Manga (Japanese comics) solely based on plain text. Japanese manga contains multiple panels on a single page, with characteristics such as coherence in storytelling, reasonable and diverse page layouts, consistency in characters, and semantic correspondence between panel drawings and panel scripts. Therefore, generating manga poses a significant challenge. This paper presents the manga generation task and constructs the Manga109Story dataset for studying manga generation solely from plain text. Additionally, we propose MangaDiffusion to facilitate the intra-panel and inter-panel information interaction during the manga generation process. The results show that our method particularly ensures the number of panels, reasonable and diverse page layouts. Based on our approach, there is potential to converting a large amount of textual stories into more engaging manga readings, leading to significant application prospects.
摘要：通过文本生成漫画的研究非常广泛，但仅基于纯文本生成多格漫画（日本漫画）的研究却很少。日本漫画一页包含多个格子，具有叙事连贯、页面布局合理多样、人物形象一致、格子图与格子脚本语义对应等特点，因此生成漫画是一项艰巨的挑战。本文提出了漫画生成任务，并构建了 Manga109Story 数据集，用于研究仅从纯文本生成漫画。此外，我们提出了 MangaDiffusion 以促进漫画生成过程中格子内和格子间的信息交互。结果表明，我们的方法特别保证了格子数量、页面布局合理多样。基于我们的方法，有可能将大量文本故事转换成更具吸引力的漫画阅读，具有巨大的应用前景。

Title: MLLM-SUL: Multimodal Large Language Model for Semantic Scene Understanding and Localization in Traffic Scenarios

Authors: Jiaqi Fan, Jianhua Wu, Jincheng Gao, Jianhao Yu, Yafei Wang, Hongqing Chu, Bingzhao Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19406
Pdf URL: https://arxiv.org/pdf/2412.19406
Copy Paste: [[2412.19406]] MLLM-SUL: Multimodal Large Language Model for Semantic Scene Understanding and Localization in Traffic Scenarios(https://arxiv.org/abs/2412.19406)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) have shown satisfactory effects in many autonomous driving tasks. In this paper, MLLMs are utilized to solve joint semantic scene understanding and risk localization tasks, while only relying on front-view images. In the proposed MLLM-SUL framework, a dual-branch visual encoder is first designed to extract features from two resolutions, and rich visual information is conducive to the language model describing risk objects of different sizes accurately. Then for the language generation, LLaMA model is fine-tuned to predict scene descriptions, containing the type of driving scenario, actions of risk objects, and driving intentions and suggestions of ego-vehicle. Ultimately, a transformer-based network incorporating a regression token is trained to locate the risk objects. Extensive experiments on the existing DRAMA-ROLISP dataset and the extended DRAMA-SRIS dataset demonstrate that our method is efficient, surpassing many state-of-the-art image-based and video-based methods. Specifically, our method achieves 80.1% BLEU-1 score and 298.5% CIDEr score in the scene understanding task, and 59.6% accuracy in the localization task. Codes and datasets are available at this https URL.
摘要：多模态大型语言模型（MLLM）在诸多自动驾驶任务中表现出了满意的效果。本文利用MLLM解决联合语义场景理解和风险定位任务，同时仅依赖于正面图像。在提出的MLLM-SUL框架中，首先设计了一个双分支视觉编码器来提取两种分辨率的特征，丰富的视觉信息有利于语言模型准确描述不同大小的风险物体。然后对于语言生成，对LLaMA模型进行微调以预测场景描述，包含驾驶场景类型、风险物体的动作以及自车辆的驾驶意图和建议。最后，训练一个基于Transformer的网络并结合回归标记来定位风险物体。在现有的DRAMA-ROLISP数据集和扩展的DRAMA-SRIS数据集上的大量实验表明，我们的方法是有效的，超越了许多最先进的基于图像和基于视频的方法。具体来说，我们的方法在场景理解任务中实现了 80.1% 的 BLEU-1 分数和 298.5% 的 CIDEr 分数，在定位任务中实现了 59.6% 的准确率。代码和数据集可在此 https URL 上找到。

Title: MINIMA: Modality Invariant Image Matching

Authors: Xingyu Jiang, Jiangwei Ren, Zizhuo Li, Xin Zhou, Dingkang Liang, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19412
Pdf URL: https://arxiv.org/pdf/2412.19412
Copy Paste: [[2412.19412]] MINIMA: Modality Invariant Image Matching(https://arxiv.org/abs/2412.19412)
Keywords: generative
Abstract: Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance universal performance from the perspective of data scaling up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate matching labels. Specifically, we scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. Under this setting, the matching labels and rich diversity of the RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multimodal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability. Extensive experiments on in-domain and zero-shot matching tasks, including $19$ cross-modal cases, demonstrate that our MINIMA can significantly outperform the baselines and even surpass modality-specific methods. The dataset and code are available at this https URL .
摘要：跨视图和跨模态的图像匹配在多模态感知中起着至关重要的作用。在实践中，不同成像系统/风格造成的模态差距对匹配任务提出了巨大挑战。现有的研究试图提取特定模态的不变特征并在有限的数据集上进行训练，但泛化能力较差。在本文中，我们提出了 MINIMA，一个适用于多种跨模态情况的统一图像匹配框架。我们的 MINIMA 不追求花哨的模块，而是旨在从数据扩展的角度提高通用性能。为此，我们提出了一个简单而有效的数据引擎，可以自由地生成包含多种模态、丰富场景和准确匹配标签的大型数据集。具体来说，我们通过生成模型从廉价但丰富的 RGB 匹配数据中扩展模态。在这种设置下，生成的多模态数据很好地继承了 RGB 数据集的匹配标签和丰富的多样性。受益于此，我们构建了 MD-syn，这是一个新的综合数据集，填补了一般多模态图像匹配的数据空白。使用 MD-syn，我们可以直接在随机选择的模态对上训练任何高级匹配管道，以获得跨模态能力。在域内和零样本匹配任务上进行的大量实验（包括 19 个跨模态案例）表明，我们的 MINIMA 可以显著超越基线，甚至超越特定于模态的方法。数据集和代码可在此 https URL 上找到。

Title: Multi-scale Latent Point Consistency Models for 3D Shape Generation

Authors: Bi'an Du, Wei Hu, Renjie Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19413
Pdf URL: https://arxiv.org/pdf/2412.19413
Copy Paste: [[2412.19413]] Multi-scale Latent Point Consistency Models for 3D Shape Generation(https://arxiv.org/abs/2412.19413)
Keywords: generation
Abstract: Consistency Models (CMs) have significantly accelerated the sampling process in diffusion models, yielding impressive results in synthesizing high-resolution images. To explore and extend these advancements to point-cloud-based 3D shape generation, we propose a novel Multi-scale Latent Point Consistency Model (MLPCM). Our MLPCM follows a latent diffusion framework and introduces hierarchical levels of latent representations, ranging from point-level to super-point levels, each corresponding to a different spatial resolution. We design a multi-scale latent integration module along with 3D spatial attention to effectively denoise the point-level latent representations conditioned on those from multiple super-point levels. Additionally, we propose a latent consistency model, learned through consistency distillation, that compresses the prior into a one-step generator. This significantly improves sampling efficiency while preserving the performance of the original teacher model. Extensive experiments on standard benchmarks ShapeNet and ShapeNet-Vol demonstrate that MLPCM achieves a 100x speedup in the generation process, while surpassing state-of-the-art diffusion models in terms of both shape quality and diversity.
摘要：一致性模型 (CM) 显著加速了扩散模型中的采样过程，在合成高分辨率图像方面取得了令人瞩目的成果。为了探索并将这些进步扩展到基于点云的 3D 形状生成，我们提出了一种新颖的多尺度潜在点一致性模型 (MLPCM)。我们的 MLPCM 遵循潜在扩散框架并引入了潜在表示的层次结构，从点级到超点级，每个级别都对应不同的空间分辨率。我们设计了一个多尺度潜在集成模块以及 3D 空间注意力，以有效地对以来自多个超点级别的潜在表示为条件的点级潜在表示进行去噪。此外，我们提出了一个通过一致性蒸馏学习的潜在一致性模型，将先验压缩为一步生成器。这显著提高了采样效率，同时保留了原始教师模型的性能。在标准基准 ShapeNet 和 ShapeNet-Vol 上进行的大量实验表明，MLPCM 在生成过程中实现了 100 倍的加速，同时在形状质量和多样性方面超越了最先进的扩散模型。

Title: Gx2Mol: De Novo Generation of Hit-like Molecules from Gene Expression Profiles via Deep Learning

Authors: Chen Li, Yuki Matsukiyo, Yoshihiro Yamanishi
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.19422
Pdf URL: https://arxiv.org/pdf/2412.19422
Copy Paste: [[2412.19422]] Gx2Mol: De Novo Generation of Hit-like Molecules from Gene Expression Profiles via Deep Learning(https://arxiv.org/abs/2412.19422)
Keywords: generation, generative
Abstract: De novo generation of hit-like molecules is a challenging task in the drug discovery process. Most methods in previous studies learn the semantics and syntax of molecular structures by analyzing molecular graphs or simplified molecular input line entry system (SMILES) strings; however, they do not take into account the drug responses of the biological systems consisting of genes and proteins. In this study we propose a deep generative model, Gx2Mol, which utilizes gene expression profiles to generate molecular structures with desirable phenotypes for arbitrary target proteins. In the algorithm, a variational autoencoder is employed as a feature extractor to learn the latent feature distribution of the gene expression profiles. Then, a long short-term memory is leveraged as the chemical generator to produce syntactically valid SMILES strings that satisfy the feature conditions of the gene expression profile extracted by the feature extractor. Experimental results and case studies demonstrate that the proposed Gx2Mol model can produce new molecules with potential bioactivities and drug-like properties.
摘要：从头生成类命中分子是药物发现过程中的一项艰巨任务。先前研究中的大多数方法通过分析分子图或简化分子输入系统 (SMILES) 字符串来学习分子结构的语义和语法；然而，它们没有考虑到由基因和蛋白质组成的生物系统的药物反应。在本研究中，我们提出了一个深度生成模型 Gx2Mol，该模型利用基因表达谱为任意目标蛋白质生成具有理想表型的分子结构。在该算法中，变分自动编码器被用作特征提取器来学习基因表达谱的潜在特征分布。然后，利用长短期记忆作为化学生成器来生成语法上有效的 SMILES 字符串，以满足特征提取器提取的基因表达谱的特征条件。实验结果和案例研究表明，提出的 Gx2Mol 模型可以产生具有潜在生物活性和类药物特性的新分子。

Title: NijiGAN: Transform What You See into Anime with Contrastive Semi-Supervised Learning and Neural Ordinary Differential Equations

Authors: Kevin Putra Santoso, Anny Yuniarti, Dwiyasa Nakula, Dimas Prihady Setyawan, Adam Haidar Azizi, Jeany Aurellia P. Dewati, Farah Dhia Fadhila, Maria T. Elvara Bumbungan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19455
Pdf URL: https://arxiv.org/pdf/2412.19455
Copy Paste: [[2412.19455]] NijiGAN: Transform What You See into Anime with Contrastive Semi-Supervised Learning and Neural Ordinary Differential Equations(https://arxiv.org/abs/2412.19455)
Keywords: generative
Abstract: Generative AI has transformed the animation industry. Several models have been developed for image-to-image translation, particularly focusing on converting real-world images into anime through unpaired translation. Scenimefy, a notable approach utilizing contrastive learning, achieves high fidelity anime scene translation by addressing limited paired data through semi-supervised training. However, it faces limitations due to its reliance on paired data from a fine-tuned StyleGAN in the anime domain, often producing low-quality datasets. Additionally, Scenimefy's high parameter architecture presents opportunities for computational optimization. This research introduces NijiGAN, a novel model incorporating Neural Ordinary Differential Equations (NeuralODEs), which offer unique advantages in continuous transformation modeling compared to traditional residual networks. NijiGAN successfully transforms real-world scenes into high fidelity anime visuals using half of Scenimefy's parameters. It employs pseudo-paired data generated through Scenimefy for supervised training, eliminating dependence on low-quality paired data and improving the training process. Our comprehensive evaluation includes ablation studies, qualitative, and quantitative analysis comparing NijiGAN to similar models. The testing results demonstrate that NijiGAN produces higher-quality images compared to AnimeGAN, as evidenced by a Mean Opinion Score (MOS) of 2.192, it surpasses AnimeGAN's MOS of 2.160. Furthermore, our model achieved a Frechet Inception Distance (FID) score of 58.71, outperforming Scenimefy's FID score of 60.32. These results demonstrate that NijiGAN achieves competitive performance against existing state-of-the-arts, especially Scenimefy as the baseline model.
摘要：生成式人工智能已经改变了动画行业。已经开发了几种用于图像到图像转换的模型，尤其侧重于通过非配对转换将现实世界的图像转换为动漫。Scenimefy 是一种利用对比学习的著名方法，它通过半监督训练处理有限的配对数据，实现了高保真动漫场景转换。然而，由于它依赖于动漫领域中经过微调的 StyleGAN 的配对数据，因此面临局限性，通常会产生低质量的数据集。此外，Scenimefy 的高参数架构为计算优化提供了机会。这项研究介绍了 NijiGAN，这是一种结合神经常微分方程 (NeuralODE) 的新模型，与传统残差网络相比，它在连续变换建模方面具有独特的优势。NijiGAN 使用 Scenimefy 一半的参数成功地将现实世界场景转换为高保真动漫视觉效果。它使用通过 Scenimefy 生成的伪配对数据进行监督训练，消除了对低质量配对数据的依赖并改进了训练过程。我们的全面评估包括消融研究、定性和定量分析，将 NijiGAN 与类似模型进行比较。测试结果表明，与 AnimeGAN 相比，NijiGAN 生成的图像质量更高，平均意见分数 (MOS) 为 2.192，超过了 AnimeGAN 的 MOS 2.160。此外，我们的模型实现了 58.71 的 Frechet 初始距离 (FID) 得分，优于 Scenimefy 的 FID 得分 60.32。这些结果表明，NijiGAN 与现有的最先进技术相比具有竞争力，尤其是与作为基线模型的 Scenimefy 相比。

Title: Focusing Image Generation to Mitigate Spurious Correlations

Authors: Xuewei Li, Zhenzhen Nie, Mei Yu, Zijian Zhang, Jie Gao, Tianyi Xu, Zhiqiang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19457
Pdf URL: https://arxiv.org/pdf/2412.19457
Copy Paste: [[2412.19457]] Focusing Image Generation to Mitigate Spurious Correlations(https://arxiv.org/abs/2412.19457)
Keywords: generation
Abstract: Instance features in images exhibit spurious correlations with background features, affecting the training process of deep neural classifiers. This leads to insufficient attention to instance features by the classifier, resulting in erroneous classification outcomes. In this paper, we propose a data augmentation method called Spurious Correlations Guided Synthesis (SCGS) that mitigates spurious correlations through image generation model. This approach does not require expensive spurious attribute (group) labels for the training data and can be widely applied to other debiasing methods. Specifically, SCGS first identifies the incorrect attention regions of a pre-trained classifier on the training images, and then uses an image generation model to generate new training data based on these incorrect attended regions. SCGS increases the diversity and scale of the dataset to reduce the impact of spurious correlations on classifiers. Changes in the classifier's attention regions and experimental results on three different domain datasets demonstrate that this method is effective in reducing the classifier's reliance on spurious correlations.
摘要：图像中的实例特征与背景特征存在虚假相关，影响深度神经分类器的训练过程，导致分类器对实例特征的关注不足，从而产生错误的分类结果。本文提出了一种数据增强方法，即虚假相关引导合成（SCGS），通过图像生成模型来减轻虚假相关。该方法不需要为训练数据提供昂贵的虚假属性（组）标签，可以广泛应用于其他去偏方法。具体而言，SCGS 首先在训练图像上识别预训练分类器的错误关注区域，然后使用图像生成模型根据这些错误的关注区域生成新的训练数据。SCGS 增加了数据集的多样性和规模，以减少虚假相关对分类器的影响。分类器注意区域的变化和在三个不同领域数据集上的实验结果表明，该方法可以有效地降低分类器对虚假相关的依赖。

Title: Generative Adversarial Network on Motion-Blur Image Restoration

Authors: Zhengdong Li
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.19479
Pdf URL: https://arxiv.org/pdf/2412.19479
Copy Paste: [[2412.19479]] Generative Adversarial Network on Motion-Blur Image Restoration(https://arxiv.org/abs/2412.19479)
Keywords: restoration, generative
Abstract: In everyday life, photographs taken with a camera often suffer from motion blur due to hand vibrations or sudden movements. This phenomenon can significantly detract from the quality of the images captured, making it an interesting challenge to develop a deep learning model that utilizes the principles of adversarial networks to restore clarity to these blurred pixels. In this project, we will focus on leveraging Generative Adversarial Networks (GANs) to effectively deblur images affected by motion blur. A GAN-based Tensorflow model is defined, training and evaluating by GoPro dataset which comprises paired street view images featuring both clear and blurred versions. This adversarial training process between Discriminator and Generator helps to produce increasingly realistic images over time. Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) are the two evaluation metrics used to provide quantitative measures of image quality, allowing us to evaluate the effectiveness of the deblurring process. Mean PSNR in 29.1644 and mean SSIM in 0.7459 with average 4.6921 seconds deblurring time are achieved in this project. The blurry pixels are sharper in the output of GAN model shows a good image restoration effect in real world applications.
摘要：在日常生活中，用相机拍摄的照片经常会因手部振动或突然移动而出现运动模糊。这种现象会严重降低所拍摄图像的质量，因此开发一种利用对抗网络原理恢复这些模糊像素清晰度的深度学习模型是一项有趣的挑战。在这个项目中，我们将专注于利用生成对抗网络 (GAN) 来有效地去除受运动模糊影响的图像。基于 GAN 的 Tensorflow 模型由 GoPro 数据集定义、训练和评估，该数据集包含具有清晰和模糊版本的成对街景图像。鉴别器和生成器之间的这种对抗性训练过程有助于随着时间的推移产生越来越逼真的图像。峰值信噪比 (PSNR) 和结构相似性指数测量 (SSIM) 是用于提供图像质量定量测量的两个评估指标，使我们能够评估去模糊过程的有效性。该项目的平均 PSNR 为 29.1644，平均 SSIM 为 0.7459，平均去模糊时间为 4.6921 秒。GAN 模型输出中的模糊像素更清晰，在实际应用中显示出良好的图像恢复效果。

Title: DrivingWorld: ConstructingWorld Model for Autonomous Driving via Video GPT

Authors: Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19505
Pdf URL: https://arxiv.org/pdf/2412.19505
Copy Paste: [[2412.19505]] DrivingWorld: ConstructingWorld Model for Autonomous Driving via Video GPT(https://arxiv.org/abs/2412.19505)
Keywords: generation
Abstract: Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. Some works attempt to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting ego states. However, prior works tend to produce unsatisfactory results, as the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent ability to model the spatial and temporal dynamics essential for video generation. In this paper, we present DrivingWorld, a GPT-style world model for autonomous driving, featuring several spatial-temporal fusion mechanisms. This design enables effective modeling of both spatial and temporal dynamics, facilitating high-fidelity, long-duration video generation. Specifically, we propose a next-state prediction strategy to model temporal coherence between consecutive frames and apply a next-token prediction strategy to capture spatial information within each frame. To further enhance generalization ability, we propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues and enable precise control. Our work demonstrates the ability to produce high-fidelity and consistent video clips of over 40 seconds in duration, which is over 2 times longer than state-of-the-art driving world models. Experiments show that, in contrast to prior works, our method achieves superior visual quality and significantly more accurate controllable future video generation. Our code is available at this https URL.
摘要：自回归 (AR) 生成模型（例如自然语言处理中的 GPT 系列）的最新成功促使人们努力在视觉任务中复制这种成功。一些研究试图通过构建基于视频的世界模型将这种方法扩展到自动驾驶，这些模型能够生成逼真的未来视频序列并预测自我状态。然而，先前的研究往往会产生不令人满意的结果，因为经典的 GPT 框架旨在处理 1D 上下文信息（例如文本），并且缺乏对视频生成所必需的空间和时间动态进行建模的固有能力。在本文中，我们提出了 DrivingWorld，这是一种用于自动驾驶的 GPT 式世界模型，具有多种时空融合机制。这种设计能够有效地对空间和时间动态进行建模，从而促进高保真、长时间的视频生成。具体来说，我们提出了一种下一状态预测策略来模拟连续帧之间的时间连贯性，并应用下一个标记预测策略来捕获每帧内的空间信息。为了进一步增强泛化能力，我们提出了一种新颖的标记预测掩码策略和重新加权策略，以缓解长期漂移问题并实现精确控制。我们的工作展示了制作时长超过 40 秒的高保真和一致视频片段的能力，这比最先进的驾驶世界模型长 2 倍以上。实验表明，与之前的工作相比，我们的方法实现了卓越的视觉质量和更准确的可控未来视频生成。我们的代码可在此 https URL 上找到。

Title: Estimation of System Parameters Including Repeated Cross-Sectional Data through Emulator-Informed Deep Generative Model

Authors: Hyunwoo Cho, Sung Woong Cho, Hyeontae Jo, Hyung Ju Hwang
Subjects: cs.LG, cs.AI, math.NA, q-bio.PE, stat.ML
Abstract URL: https://arxiv.org/abs/2412.19517
Pdf URL: https://arxiv.org/pdf/2412.19517
Copy Paste: [[2412.19517]] Estimation of System Parameters Including Repeated Cross-Sectional Data through Emulator-Informed Deep Generative Model(https://arxiv.org/abs/2412.19517)
Keywords: generative
Abstract: Differential equations (DEs) are crucial for modeling the evolution of natural or engineered systems. Traditionally, the parameters in DEs are adjusted to fit data from system observations. However, in fields such as politics, economics, and biology, available data are often independently collected at distinct time points from different subjects (i.e., repeated cross-sectional (RCS) data). Conventional optimization techniques struggle to accurately estimate DE parameters when RCS data exhibit various heterogeneities, leading to a significant loss of information. To address this issue, we propose a new estimation method called the emulator-informed deep-generative model (EIDGM), designed to handle RCS data. Specifically, EIDGM integrates a physics-informed neural network-based emulator that immediately generates DE solutions and a Wasserstein generative adversarial network-based parameter generator that can effectively mimic the RCS data. We evaluated EIDGM on exponential growth, logistic population models, and the Lorenz system, demonstrating its superior ability to accurately capture parameter distributions. Additionally, we applied EIDGM to an experimental dataset of Amyloid beta 40 and beta 42, successfully capturing diverse parameter distribution shapes. This shows that EIDGM can be applied to model a wide range of systems and extended to uncover the operating principles of systems based on limited data.
摘要：微分方程 (DE) 对于模拟自然或工程系统的演化至关重要。传统上，DE 中的参数会进行调整以适应系统观测数据。然而，在政治、经济和生物等领域，可用数据通常是在不同时间点从不同受试者独立收集的（即重复横截面 (RCS) 数据）。当 RCS 数据表现出各种异质性时，传统的优化技术难以准确估计 DE 参数，从而导致大量信息丢失。为了解决这个问题，我们提出了一种新的估计方法，称为模拟器信息深度生成模型 (EIDGM)，旨在处理 RCS 数据。具体来说，EIDGM 集成了一个基于物理信息的神经网络的模拟器，可立即生成 DE 解决方案，以及一个基于 Wasserstein 生成对抗网络的参数生成器，可以有效地模拟 RCS 数据。我们在指数增长、逻辑人口模型和 Lorenz 系统上评估了 EIDGM，证明了其准确捕捉参数分布的卓越能力。此外，我们将 EIDGM 应用于淀粉样蛋白 β 40 和 β 42 的实验数据集，成功捕获了不同的参数分布形状。这表明 EIDGM 可以应用于各种系统的建模，并可以扩展到基于有限数据的系统工作原理。

Title: Is Your Text-to-Image Model Robust to Caption Noise?

Authors: Weichen Yu, Ziyan Yang, Shanchuan Lin, Qi Zhao, Jianyi Wang, Liangke Gui, Matt Fredrikson, Lu Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19531
Pdf URL: https://arxiv.org/pdf/2412.19531
Copy Paste: [[2412.19531]] Is Your Text-to-Image Model Robust to Caption Noise?(https://arxiv.org/abs/2412.19531)
Keywords: generation
Abstract: In text-to-image (T2I) generation, a prevalent training technique involves utilizing Vision Language Models (VLMs) for image re-captioning. Even though VLMs are known to exhibit hallucination, generating descriptive content that deviates from the visual reality, the ramifications of such caption hallucinations on T2I generation performance remain under-explored. Through our empirical investigation, we first establish a comprehensive dataset comprising VLM-generated captions, and then systematically analyze how caption hallucination influences generation outcomes. Our findings reveal that (1) the disparities in caption quality persistently impact model outputs during fine-tuning. (2) VLMs confidence scores serve as reliable indicators for detecting and characterizing noise-related patterns in the data distribution. (3) even subtle variations in caption fidelity have significant effects on the quality of learned representations. These findings collectively emphasize the profound impact of caption quality on model performance and highlight the need for more sophisticated robust training algorithm in T2I. In response to these observations, we propose a approach leveraging VLM confidence score to mitigate caption noise, thereby enhancing the robustness of T2I models against hallucination in caption.
摘要：在文本到图像 (T2I) 生成中，一种流行的训练技术涉及利用视觉语言模型 (VLM) 进行图像重新字幕。尽管已知 VLM 会出现幻觉，生成偏离视觉现实的描述性内容，但这种字幕幻觉对 T2I 生成性能的影响仍未得到充分探索。通过我们的实证调查，我们首先建立了一个由 VLM 生成的字幕组成的综合数据集，然后系统地分析字幕幻觉如何影响生成结果。我们的研究结果表明：(1) 字幕质量的差异在微调过程中持续影响模型输出。(2) VLM 置信度分数可作为检测和表征数据分布中噪声相关模式的可靠指标。(3) 字幕保真度的细微变化也会对学习到的表示的质量产生重大影响。这些发现共同强调了字幕质量对模型性能的深远影响，并强调了 T2I 中需要更复杂的稳健训练算法。针对这些观察，我们提出了一种利用 VLM 置信度分数来减轻字幕噪音的方法，从而增强 T2I 模型对字幕幻觉的鲁棒性。

Title: P3S-Diffusion:A Selective Subject-driven Generation Framework via Point Supervision

Authors: Junjie Hu (1), Shuyong Gao (1), Lingyi Hong (1), Qishan Wang (1), Yuzhou Zhao (1), Yan Wang (1), Wenqiang Zhang (1) ((1) Fudan university)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19533
Pdf URL: https://arxiv.org/pdf/2412.19533
Copy Paste: [[2412.19533]] P3S-Diffusion:A Selective Subject-driven Generation Framework via Point Supervision(https://arxiv.org/abs/2412.19533)
Keywords: generation
Abstract: Recent research in subject-driven generation increasingly emphasizes the importance of selective subject features. Nevertheless, accurately selecting the content in a given reference image still poses challenges, especially when selecting the similar subjects in an image (e.g., two different dogs). Some methods attempt to use text prompts or pixel masks to isolate specific elements. However, text prompts often fall short in precisely describing specific content, and pixel masks are often expensive. To address this, we introduce P3S-Diffusion, a novel architecture designed for context-selected subject-driven generation via point supervision. P3S-Diffusion leverages minimal cost label (e.g., points) to generate subject-driven images. During fine-tuning, it can generate an expanded base mask from these points, obviating the need for additional segmentation models. The mask is employed for inpainting and aligning with subject representation. The P3S-Diffusion preserves fine features of the subjects through Multi-layers Condition Injection. Enhanced by the Attention Consistency Loss for improved training, extensive experiments demonstrate its excellent feature preservation and image generation capabilities.
摘要：主题驱动生成的最新研究越来越强调选择性主题特征的重要性。然而，准确地选择给定参考图像中的内容仍然具有挑战性，尤其是在选择图像中的相似主题（例如，两只不同的狗）时。一些方法尝试使用文本提示或像素掩码来隔离特定元素。然而，文本提示通常无法准确描述特定内容，而像素掩码通常很昂贵。为了解决这个问题，我们引入了 P3S-Diffusion，这是一种新颖的架构，旨在通过点监督进行上下文选择的主题驱动生成。P3S-Diffusion 利用最低成本标签（例如，点）来生成主题驱动的图像。在微调过程中，它可以从这些点生成扩展的基本掩码，从而无需额外的分割模型。该掩码用于修复和与主题表示对齐。P3S-Diffusion 通过多层条件注入保留了主题的精细特征。通过注意力一致性损失增强以改进训练，大量实验证明了其出色的特征保存和图像生成能力。

Title: Diverse Rare Sample Generation with Pretrained GANs

Authors: Subeen Lee, Jiyeon Han, Soyeon Kim, Jaesik Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19543
Pdf URL: https://arxiv.org/pdf/2412.19543
Copy Paste: [[2412.19543]] Diverse Rare Sample Generation with Pretrained GANs(https://arxiv.org/abs/2412.19543)
Keywords: generation, generative
Abstract: Deep generative models are proficient in generating realistic data but struggle with producing rare samples in low density regions due to their scarcity of training datasets and the mode collapse problem. While recent methods aim to improve the fidelity of generated samples, they often reduce diversity and coverage by ignoring rare and novel samples. This study proposes a novel approach for generating diverse rare samples from high-resolution image datasets with pretrained GANs. Our method employs gradient-based optimization of latent vectors within a multi-objective framework and utilizes normalizing flows for density estimation on the feature space. This enables the generation of diverse rare images, with controllable parameters for rarity, diversity, and similarity to a reference image. We demonstrate the effectiveness of our approach both qualitatively and quantitatively across various datasets and GANs without retraining or fine-tuning the pretrained GANs.
摘要：深度生成模型擅长生成真实数据，但由于训练数据集稀缺和模式崩溃问题，难以在低密度区域生成稀有样本。虽然最近的方法旨在提高生成样本的保真度，但它们通常会忽略稀有和新样本，从而降低多样性和覆盖率。本研究提出了一种使用预训练 GAN 从高分辨率图像数据集中生成多样化稀有样本的新方法。我们的方法在多目标框架内采用基于梯度的潜在向量优化，并利用正则化流在特征空间上进行密度估计。这使得能够生成多样化的稀有图像，并具有可控的稀有性、多样性和与参考图像的相似性参数。我们在各种数据集和 GAN 中定性和定量地证明了我们的方法的有效性，而无需重新训练或微调预训练的 GAN。

Title: Structural Similarity in Deep Features: Image Quality Assessment Robust to Geometrically Disparate Reference

Authors: Keke Zhang, Weiling Chen, Tiesong Zhao, Zhou Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.19553
Pdf URL: https://arxiv.org/pdf/2412.19553
Copy Paste: [[2412.19553]] Structural Similarity in Deep Features: Image Quality Assessment Robust to Geometrically Disparate Reference(https://arxiv.org/abs/2412.19553)
Keywords: restoration, super-resolution, quality assessment
Abstract: Image Quality Assessment (IQA) with references plays an important role in optimizing and evaluating computer vision tasks. Traditional methods assume that all pixels of the reference and test images are fully aligned. Such Aligned-Reference IQA (AR-IQA) approaches fail to address many real-world problems with various geometric deformations between the two images. Although significant effort has been made to attack Geometrically-Disparate-Reference IQA (GDR-IQA) problem, it has been addressed in a task-dependent fashion, for example, by dedicated designs for image super-resolution and retargeting, or by assuming the geometric distortions to be small that can be countered by translation-robust filters or by explicit image registrations. Here we rethink this problem and propose a unified, non-training-based Deep Structural Similarity (DeepSSIM) approach to address the above problems in a single framework, which assesses structural similarity of deep features in a simple but efficient way and uses an attention calibration strategy to alleviate attention deviation. The proposed method, without application-specific design, achieves state-of-the-art performance on AR-IQA datasets and meanwhile shows strong robustness to various GDR-IQA test cases. Interestingly, our test also shows the effectiveness of DeepSSIM as an optimization tool for training image super-resolution, enhancement and restoration, implying an even wider generalizability. \footnote{Source code will be made public after the review is completed.
摘要：带参考的图像质量评估 (IQA) 在优化和评估计算机视觉任务中起着重要作用。传统方法假设参考图像和测试图像的所有像素都完全对齐。这种对齐参考 IQA (AR-IQA) 方法无法解决两个图像之间存在各种几何变形的许多实际问题。尽管已经做出了巨大努力来解决几何不同参考 IQA (GDR-IQA) 问题，但它是以任务相关的方式解决的，例如，通过专门的图像超分辨率和重定向设计，或者假设几何失真很小，可以通过平移鲁棒滤波器或显式图像配准来抵消。在这里，我们重新考虑了这个问题，并提出了一种统一的、非基于训练的深度结构相似性 (DeepSSIM) 方法，以在单一框架中解决上述问题，该方法以简单但有效的方式评估深度特征的结构相似性，并使用注意力校准策略来缓解注意力偏差。所提出的方法无需针对特定应用进行设计，即可在 AR-IQA 数据集上实现最佳性能，同时对各种 GDR-IQA 测试用例表现出强大的鲁棒性。有趣的是，我们的测试还展示了 DeepSSIM 作为训练图像超分辨率、增强和恢复的优化工具的有效性，这意味着更广泛的通用性。 \footnote{源代码将在审查完成后公开。

Title: ReNeg: Learning Negative Embedding with Reward Guidance

Authors: Xiaomin Li, Yixuan Liu, Takashi Isobe, Xu Jia, Qinpeng Cui, Dong Zhou, Dong Li, You He, Huchuan Lu, Zhongdao Wang, Emad Barsoum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19637
Pdf URL: https://arxiv.org/pdf/2412.19637
Copy Paste: [[2412.19637]] ReNeg: Learning Negative Embedding with Reward Guidance(https://arxiv.org/abs/2412.19637)
Keywords: generation
Abstract: In text-to-image (T2I) generation applications, negative embeddings have proven to be a simple yet effective approach for enhancing generation quality. Typically, these negative embeddings are derived from user-defined negative prompts, which, while being functional, are not necessarily optimal. In this paper, we introduce ReNeg, an end-to-end method designed to learn improved Negative embeddings guided by a Reward model. We employ a reward feedback learning framework and integrate classifier-free guidance (CFG) into the training process, which was previously utilized only during inference, thus enabling the effective learning of negative embeddings. We also propose two strategies for learning both global and per-sample negative embeddings. Extensive experiments show that the learned negative embedding significantly outperforms null-text and handcrafted counterparts, achieving substantial improvements in human preference alignment. Additionally, the negative embedding learned within the same text embedding space exhibits strong generalization capabilities. For example, using the same CLIP text encoder, the negative embedding learned on SD1.5 can be seamlessly transferred to text-to-image or even text-to-video models such as ControlNet, ZeroScope, and VideoCrafter2, resulting in consistent performance improvements across the board.
摘要：在文本到图像 (T2I) 生成应用中，负嵌入已被证明是一种简单而有效的提高生成质量的方法。通常，这些负嵌入来自用户定义的负提示，虽然这些提示具有功能性，但不一定是最佳的。在本文中，我们介绍了 ReNeg，这是一种端到端方法，旨在学习由奖励模型指导的改进的负嵌入。我们采用奖励反馈学习框架，并将无分类器指导 (CFG) 集成到训练过程中，这以前仅在推理过程中使用，从而实现对负嵌入的有效学习。我们还提出了两种学习全局和每个样本负嵌入的策略。大量实验表明，学习到的负嵌入明显优于空文本和手工制作的负嵌入，在人类偏好对齐方面取得了显着的改进。此外，在相同文本嵌入空间内学习到的负嵌入表现出强大的泛化能力。例如，使用相同的 CLIP 文本编码器，在 SD1.5 上学习到的负嵌入可以无缝转移到文本到图像甚至文本到视频模型，如 ControlNet、ZeroScope 和 VideoCrafter2，从而全面实现一致的性能改进。

Title: VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Authors: Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, Xi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19645
Pdf URL: https://arxiv.org/pdf/2412.19645
Copy Paste: [[2412.19645]] VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models(https://arxiv.org/abs/2412.19645)
Keywords: generation
Abstract: Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated this http URL on both customized human and object video generation validate the effectiveness of our framework.
摘要：零样本定制视频生成由于其巨大的应用潜力而引起了广泛关注。现有方法依赖于附加模型来提取和注入参考主体特征，认为单靠视频扩散模型 (VDM) 不足以实现零样本定制视频生成。然而，由于特征提取和注入技术不够完善，这些方法往往难以保持一致的主体外观。在本文中，我们揭示了 VDM 固有拥有提取和注入主体特征的能力。与以前的启发式方法不同，我们引入了一个新颖的框架，利用 VDM 的固有能力来实现高质量的零样本定制视频生成。具体而言，对于特征提取，我们直接将参考图像输入 VDM 并使用其固有的特征提取过程，这不仅提供了细粒度的特征，而且与 VDM 的预训练知识显著一致。对于特征注入，我们通过 VDM 中的空间自注意力设计了一种创新的主体特征和生成内容之间的双向交互，确保 VDM 具有更好的主体保真度，同时在定制的人类和对象视频生成上保持生成的此 http URL 的多样性，验证了我们框架的有效性。

Title: From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

Authors: Jiawei Lin, Shizhao Sun, Danqing Huang, Ting Liu, Ji Li, Jiang Bian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19712
Pdf URL: https://arxiv.org/pdf/2412.19712
Copy Paste: [[2412.19712]] From Elements to Design: A Layered Approach for Automatic Graphic Design Composition(https://arxiv.org/abs/2412.19712)
Keywords: generation, generative
Abstract: In this work, we investigate automatic design composition from multimodal graphic elements. Although recent studies have developed various generative models for graphic design, they usually face the following limitations: they only focus on certain subtasks and are far from achieving the design composition task; they do not consider the hierarchical information of graphic designs during the generation process. To tackle these issues, we introduce the layered design principle into Large Multimodal Models (LMMs) and propose a novel approach, called LaDeCo, to accomplish this challenging task. Specifically, LaDeCo first performs layer planning for a given element set, dividing the input elements into different semantic layers according to their contents. Based on the planning results, it subsequently predicts element attributes that control the design composition in a layer-wise manner, and includes the rendered image of previously generated layers into the context. With this insightful design, LaDeCo decomposes the difficult task into smaller manageable steps, making the generation process smoother and clearer. The experimental results demonstrate the effectiveness of LaDeCo in design composition. Furthermore, we show that LaDeCo enables some interesting applications in graphic design, such as resolution adjustment, element filling, design variation, etc. In addition, it even outperforms the specialized models in some design subtasks without any task-specific training.
摘要：在本文中，我们研究了多模态图形元素的自动设计构图。尽管最近的研究已经开发了各种图形设计的生成模型，但它们通常面临以下限制：它们仅专注于某些子任务，并且远远不能完成设计构图任务；它们在生成过程中没有考虑图形设计的层次信息。为了解决这些问题，我们将分层设计原则引入大型多模态模型 (LMM)，并提出了一种称为 LaDeCo 的新方法来完成这一具有挑战性的任务。具体来说，LaDeCo 首先对给定的元素集进行层规划，根据输入元素的内容将其划分为不同的语义层。基于规划结果，它随后以分层方式预测控制设计构图的元素属性，并将先前生成的层的渲染图像纳入上下文。通过这种富有洞察力的设计，LaDeCo 将困难的任务分解为更小的可管理步骤，使生成过程更加流畅和清晰。实验结果证明了 LaDeCo 在设计构图中的有效性。此外，我们表明 LaDeCo 在图形设计中实现了一些有趣的应用，例如分辨率调整、元素填充、设计变化等。此外，它甚至在没有任何特定任务的训练的情况下在某些设计子任务中优于专门的模型。

Title: Generative Pretrained Embedding and Hierarchical Irregular Time Series Representation for Daily Living Activity Recognition

Authors: Damien Bouchabou, Sao Mai Nguyen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.19732
Pdf URL: https://arxiv.org/pdf/2412.19732
Copy Paste: [[2412.19732]] Generative Pretrained Embedding and Hierarchical Irregular Time Series Representation for Daily Living Activity Recognition(https://arxiv.org/abs/2412.19732)
Keywords: generative
Abstract: Within the evolving landscape of smart homes, the precise recognition of daily living activities using ambient sensor data stands paramount. This paper not only aims to bolster existing algorithms by evaluating two distinct pretrained embeddings suited for ambient sensor activations but also introduces a novel hierarchical architecture. We delve into an architecture anchored on Transformer Decoder-based pre-trained embeddings, reminiscent of the GPT design, and contrast it with the previously established state-of-the-art (SOTA) ELMo embeddings for ambient sensors. Our proposed hierarchical structure leverages the strengths of each pre-trained embedding, enabling the discernment of activity dependencies and sequence order, thereby enhancing classification precision. To further refine recognition, we incorporate into our proposed architecture an hour-of-the-day embedding. Empirical evaluations underscore the preeminence of the Transformer Decoder embedding in classification endeavors. Additionally, our innovative hierarchical design significantly bolsters the efficacy of both pre-trained embeddings, notably in capturing inter-activity nuances. The integration of temporal aspects subtly but distinctively augments classification, especially for time-sensitive activities. In conclusion, our GPT-inspired hierarchical approach, infused with temporal insights, outshines the SOTA ELMo benchmark.
摘要：在不断发展的智能家居领域，使用环境传感器数据精确识别日常生活活动至关重要。本文不仅旨在通过评估两种适用于环境传感器激活的不同预训练嵌入来增强现有算法，而且还介绍了一种新颖的分层架构。我们深入研究了一种基于 Transformer Decoder 的预训练嵌入的架构，让人想起 GPT 设计，并将其与之前为环境传感器建立的最先进 (SOTA) ELMo 嵌入进行对比。我们提出的分层结构利用了每个预训练嵌入的优势，能够辨别活动依赖关系和序列顺序，从而提高分类精度。为了进一步完善识别，我们在我们提出的架构中加入了一天中每小时的嵌入。实证评估强调了 Transformer Decoder 嵌入在分类工作中的卓越地位。此外，我们创新的分层设计显著增强了两种预训练嵌入的有效性，尤其是在捕捉活动间细微差别方面。时间方面的整合巧妙但独特地增强了分类能力，尤其是对于时间敏感的活动。总之，我们受 GPT 启发的分层方法融合了时间洞察力，超越了 SOTA ELMo 基准。

Title: Generative Video Propagation

Authors: Shaoteng Liu, Tianyu Wang, Jui-Hsien Wang, Qing Liu, Zhifei Zhang, Joon-Young Lee, Yijun Li, Bei Yu, Zhe Lin, Soo Ye Kim, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19761
Pdf URL: https://arxiv.org/pdf/2412.19761
Copy Paste: [[2412.19761]] Generative Video Propagation(https://arxiv.org/abs/2412.19761)
Keywords: generation, generative
Abstract: Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models. Specifically, our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. We propose a data generation scheme to cover multiple video tasks based on instance-level video segmentation datasets. Our model is trained by incorporating a mask prediction decoder head and optimizing a region-aware loss to aid the encoder to preserve the original content while the generation model propagates the modified region. This novel design opens up new possibilities: In editing scenarios, GenProp allows substantial changes to an object's shape; for insertion, the inserted objects can exhibit independent motion; for removal, GenProp effectively removes effects like shadows and reflections from the whole video; for tracking, GenProp is capable of tracking objects and their associated effects together. Experiment results demonstrate the leading performance of our model in various video tasks, and we further provide in-depth analyses of the proposed framework.
摘要：大规模视频生成模型具有真实地模拟自然场景的固有能力。在本文中，我们证明，通过精心设计生成视频传播框架，可以利用此类模型的生成能力以统一的方式解决各种视频任务。具体来说，我们的框架 GenProp 使用选择性内容编码器对原始视频进行编码，并使用图像到视频生成模型传播对第一帧所做的更改。我们提出了一种基于实例级视频分割数据集的数据生成方案，以覆盖多个视频任务。我们的模型通过合并掩码预测解码器头并优化区域感知损失来训练，以帮助编码器在生成模型传播修改后的区域时保留原始内容。这种新颖的设计开辟了新的可能性：在编辑场景中，GenProp 允许对对象的形状进行实质性更改；对于插入，插入的对象可以表现出独立的运动；对于移除，GenProp 可以有效地从整个视频中移除阴影和反射等效果；对于跟踪，GenProp 能够同时跟踪对象及其相关效果。实验结果证明了我们的模型在各种视频任务中的领先性能，并且我们进一步对所提出的框架进行了深入分析。

Title: Tensor Network Estimation of Distribution Algorithms

Authors: John Gardiner, Javier Lopez-Piqueres
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2412.19780
Pdf URL: https://arxiv.org/pdf/2412.19780
Copy Paste: [[2412.19780]] Tensor Network Estimation of Distribution Algorithms(https://arxiv.org/abs/2412.19780)
Keywords: generative
Abstract: Tensor networks are a tool first employed in the context of many-body quantum physics that now have a wide range of uses across the computational sciences, from numerical methods to machine learning. Methods integrating tensor networks into evolutionary optimization algorithms have appeared in the recent literature. In essence, these methods can be understood as replacing the traditional crossover operation of a genetic algorithm with a tensor network-based generative model. We investigate these methods from the point of view that they are Estimation of Distribution Algorithms (EDAs). We find that optimization performance of these methods is not related to the power of the generative model in a straightforward way. Generative models that are better (in the sense that they better model the distribution from which their training data is drawn) do not necessarily result in better performance of the optimization algorithm they form a part of. This raises the question of how best to incorporate powerful generative models into optimization routines. In light of this we find that adding an explicit mutation operator to the output of the generative model often improves optimization performance.
摘要：张量网络是一种工具，最早用于多体量子物理学，现在已广泛应用于从数值方法到机器学习的计算科学领域。最近的文献中出现了将张量网络集成到进化优化算法中的方法。本质上，这些方法可以理解为用基于张量网络的生成模型取代遗传算法的传统交叉操作。我们从分布估计算法 (EDA) 的角度研究这些方法。我们发现这些方法的优化性能与生成模型的能力没有直接关系。更好的生成模型（在它们更好地模拟其训练数据所来自的分布的意义上）不一定会导致它们所属的优化算法的性能更好。这就提出了一个问题，即如何最好地将强大的生成模型整合到优化程序中。鉴于此，我们发现在生成模型的输出中添加显式突变运算符通常可以提高优化性能。

Title: InfAlign: Inference-aware language model alignment

Authors: Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, and Ananda Theertha Suresh, Ahmad Beirami
Subjects: cs.LG, cs.CL, cs.IT
Abstract URL: https://arxiv.org/abs/2412.19792
Pdf URL: https://arxiv.org/pdf/2412.19792
Copy Paste: [[2412.19792]] InfAlign: Inference-aware language model alignment(https://arxiv.org/abs/2412.19792)
Keywords: generative
Abstract: Language model alignment has become a critical step in training modern generative language models. The goal of alignment is to finetune a reference model such that the win rate of a sample from the aligned model over a sample from the reference model is high, subject to a KL divergence constraint. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. However, the alignment objective does not capture such inference-time decoding procedures. We show that the existing alignment framework is sub-optimal in view of such inference-time methods. We then modify the alignment objective and propose a framework for inference-aware alignment (IAPO). We prove that for any inference-time decoding algorithm, the optimal solution that optimizes the inference-time win rate of the aligned policy against the reference policy is the solution to the typical RLHF problem with a transformation of the reward. This motivates us to provide the KL-regularized calibrate-and-transform RL (CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. We particularize our study to two important inference-time strategies: best-of-N sampling and best-of-N jailbreaking, where N responses are sampled from the model and the one with the highest or lowest reward is selected. We propose specific transformations for these strategies and demonstrate that our framework offers significant improvements over existing state-of-the-art methods for language model alignment. Empirically, we outperform baselines that are designed without taking inference-time decoding into consideration by 8-12% and 4-9% on inference-time win rates over the Anthropic helpfulness and harmlessness dialog benchmark datasets.
摘要：语言模型对齐已成为训练现代生成语言模型的关键步骤。对齐的目标是微调参考模型，使对齐模型中的样本相对于参考模型中的样本的胜率较高，但要受到 KL 散度约束。如今，我们越来越多地使用推理时间算法（例如 Best-of-N、受控解码、树搜索）从语言模型而不是标准采样中进行解码。但是，对齐目标并未捕获此类推理时间解码过程。我们表明，鉴于此类推理时间方法，现有的对齐框架并非最优。然后，我们修改了对齐目标并提出了一个推理感知对齐 (IAPO) 框架。我们证明，对于任何推理时间解码算法，优化对齐策略相对于参考策略的推理时间胜率的最优解是具有奖励转换的典型 RLHF 问题的解决方案。这促使我们提供 KL 正则化的校准和变换 RL (CTRL) 算法来解决这个问题，该算法涉及奖励校准步骤和 KL 正则化的奖励最大化步骤以及校准奖励的变换。我们将研究具体到两个重要的推理时间策略：N 个最佳采样和 N 个最佳越狱，其中从模型中采样 N 个响应并选择奖励最高或最低的响应。我们为这些策略提出了特定的转换，并证明我们的框架比现有的最先进的语言模型对齐方法有显著的改进。从经验上看，在人为帮助和无害对话基准数据集上，我们的推理时间胜率比没有考虑推理时间解码的基线高出 8-12% 和 4-9%。