2025-01-22

Title: Towards General Purpose Robots at Scale: Lifelong Learning and Learning to Use Memory

Authors: William Yue
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2501.10395
Pdf URL: https://arxiv.org/pdf/2501.10395
Copy Paste: [[2501.10395]] Towards General Purpose Robots at Scale: Lifelong Learning and Learning to Use Memory(https://arxiv.org/abs/2501.10395)
Keywords: generative
Abstract: The widespread success of artificial intelligence in fields like natural language processing and computer vision has not yet fully transferred to robotics, where progress is hindered by the lack of large-scale training data and the complexity of real-world tasks. To address this, many robot learning researchers are pushing to get robots deployed at scale in everyday unstructured environments like our homes to initiate a data flywheel. While current robot learning systems are effective for certain short-horizon tasks, they are not designed to autonomously operate over long time horizons in unstructured environments. This thesis focuses on addressing two key challenges for robots operating over long time horizons: memory and lifelong learning. We propose two novel methods to advance these capabilities. First, we introduce t-DGR, a trajectory-based deep generative replay method that achieves state-of-the-art performance on Continual World benchmarks, advancing lifelong learning. Second, we develop a framework that leverages human demonstrations to teach agents effective memory utilization, improving learning efficiency and success rates on Memory Gym tasks. Finally, we discuss future directions for achieving the lifelong learning and memory capabilities necessary for robots to function at scale in real-world settings.
摘要：人工智能在自然语言处理和计算机视觉等领域的广泛成功尚未完全转移到机器人技术领域，由于缺乏大规模训练数据和现实世界任务的复杂性，机器人领域的进展受到阻碍。为了解决这个问题，许多机器人学习研究人员正在推动将机器人大规模部署到日常非结构化环境中，例如我们的家庭，以启动数据飞轮。虽然当前的机器人学习系统对于某些短期任务有效，但它们并非设计用于在非结构化环境中长期自主运行。本论文重点解决机器人长期运行的两个关键挑战：记忆和终身学习。我们提出了两种新方法来提升这些能力。首先，我们引入了 t-DGR，这是一种基于轨迹的深度生成重放方法，在 Continual World 基准上实现了最先进的性能，从而促进了终身学习。其次，我们开发了一个框架，利用人类演示来教代理有效利用记忆，提高学习效率和记忆健身任务的成功率。最后，我们讨论了实现机器人在现实环境中大规模运行所必需的终身学习和记忆能力的未来方向。

Title: BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation

Authors: Xiaolu Hou, Mingcheng Li, Dingkang Yang, Jiawei Chen, Ziyun Qian, Xiao Zhao, Yue Jiang, Jinjie Wei, Qingyao Xu, Lihua Zhang
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.10462
Pdf URL: https://arxiv.org/pdf/2501.10462
Copy Paste: [[2501.10462]] BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation(https://arxiv.org/abs/2501.10462)
Keywords: generation
Abstract: With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.
摘要：随着虚拟现实应用的广泛使用，3D场景生成已成为一个新的具有挑战性的研究前沿。3D场景具有高度复杂的结构，需要确保输出是密集的、连贯的并且包含所有必要的结构。许多当前的3D场景生成方法依赖于预先训练的文本到图像扩散模型和单目深度估计器。然而，生成的场景占用大量存储空间，并且通常缺乏有效的正则化方法，导致几何扭曲。为此，我们提出了BloomScene，一种用于跨模态场景生成的轻量级结构化3D高斯分层，它从文本或图像输入创建多样化和高质量的3D场景。具体而言，提出了一种跨模态渐进式场景生成框架，利用增量点云重建和3D高斯分层来生成连贯的场景。此外，我们提出了一种基于层次深度先验的正则化机制，该机制利用对深度精度和平滑度的多级约束来增强生成场景的真实感和连续性。最后，我们提出了一种结构化的上下文引导压缩机制，该机制利用结构化的哈希网格来对无组织的锚点属性的上下文进行建模，从而显著消除结构冗余并降低存储开销。跨多个场景的综合实验证明了我们的框架与多个基线相比具有巨大的潜力和优势。

Title: 4bit-Quantization in Vector-Embedding for RAG

Authors: Taehee Jeong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10534
Pdf URL: https://arxiv.org/pdf/2501.10534
Copy Paste: [[2501.10534]] 4bit-Quantization in Vector-Embedding for RAG(https://arxiv.org/abs/2501.10534)
Keywords: generation
Abstract: Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point numbers to 4-bit integers, which can significantly reduce the memory requirements. Our approach has several benefits. Firstly, it significantly reduces the memory storage requirements of the high-dimensional vector database, making it more feasible to deploy RAG systems in resource-constrained environments. Secondly, it speeds up the searching process, as the reduced precision of the vectors allows for faster computation. Our code is available at this https URL
摘要：检索增强生成 (RAG) 是一种很有前途的技术，在解决大型语言模型 (LLM) 的一些局限性方面表现出巨大潜力。LLM 有两个主要局限性：由于训练数据，它们可能包含过时的信息，并且它们可能生成事实不准确的响应，这种现象称为幻觉。RAG 旨在通过利用相关文档的数据库来缓解这些问题，这些文档作为嵌入向量存储在高维空间中。然而，使用高维嵌入的挑战之一是它们需要大量内存来存储。这可能是一个大问题，尤其是在处理大型文档数据库时。为了缓解这个问题，我们建议使用 4 位量化来存储嵌入向量。这涉及将向量的精度从 32 位浮点数降低到 4 位整数，这可以显著减少内存需求。我们的方法有几个好处。首先，它显著降低了高维向量数据库的内存存储要求，使得在资源受限的环境中部署 RAG 系统更加可行。其次，它加快了搜索过程，因为向量的精度降低可以实现更快的计算。我们的代码可在此 https URL 上找到

Title: Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Authors: Dongjie Wang, Yanyong Huang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Tao Zhe, Kunpeng Liu, Meng Xiao, Pengfei Wang, Pengyang Wang, Hui Xiong, Yanjie Fu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10555
Pdf URL: https://arxiv.org/pdf/2501.10555
Copy Paste: [[2501.10555]] Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation(https://arxiv.org/abs/2501.10555)
Keywords: generation, generative
Abstract: Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns. This survey offers a comprehensive overview of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.
摘要：表格数据是各行各业使用最广泛的格式之一，推动了金融、医疗保健和营销等领域的关键应用。在以数据为中心的人工智能时代，提高数据质量和表示形式对于提高模型性能至关重要，尤其是在以表格数据为中心的应用中。本综述探讨了以表格数据为中心的人工智能的关键方面，强调特征选择和特征生成是数据空间细化的基本技术。我们系统地回顾了特征选择方法（识别和保留最相关的数据属性）和特征生成方法（创建新特征以简化复杂数据模式的捕获）。本综述通过分析最近的进展、实际应用以及这些技术的优势和局限性，全面概述了当前的方法。最后，我们概述了尚未解决的挑战并提出了未来的观点，以激发该领域的持续创新。

Title: Mutual Regression Distance

Authors: Dong Qiao, Jicong Fan
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2501.10617
Pdf URL: https://arxiv.org/pdf/2501.10617
Copy Paste: [[2501.10617]] Mutual Regression Distance(https://arxiv.org/abs/2501.10617)
Keywords: generative
Abstract: The maximum mean discrepancy and Wasserstein distance are popular distance measures between distributions and play important roles in many machine learning problems such as metric learning, generative modeling, domain adaption, and clustering. However, since they are functions of pair-wise distances between data points in two distributions, they do not exploit the potential manifold properties of data such as smoothness and hence are not effective in measuring the dissimilarity between the two distributions in the form of manifolds. In this paper, different from existing measures, we propose a novel distance called Mutual Regression Distance (MRD) induced by a constrained mutual regression problem, which can exploit the manifold property of data. We prove that MRD is a pseudometric that satisfies almost all the axioms of a metric. Since the optimization of the original MRD is costly, we provide a tight MRD and a simplified MRD, based on which a heuristic algorithm is established. We also provide kernel variants of MRDs that are more effective in handling nonlinear data. Our MRDs especially the simplified MRDs have much lower computational complexity than the Wasserstein distance. We provide theoretical guarantees, such as robustness, for MRDs. Finally, we apply MRDs to distribution clustering, generative models, and domain adaptation. The numerical results demonstrate the effectiveness and superiority of MRDs compared to the baselines.
摘要：最大均值差异和 Wasserstein 距离是分布间常用的距离度量，在度量学习、生成建模、领域自适应和聚类等许多机器学习问题中发挥着重要作用。然而，由于它们是两个分布中数据点之间成对距离的函数，它们没有利用数据的潜在流形特性（例如平滑度），因此无法有效地以流形形式测量两个分布之间的差异。在本文中，与现有度量不同，我们提出了一种由约束相互回归问题引起的新型距离，称为相互回归距离 (MRD)，它可以利用数据的流形特性。我们证明 MRD 是一个满足度量几乎所有公理的伪度量。由于原始 MRD 的优化成本很高，我们提供了一个紧密的 MRD 和一个简化的 MRD，并在此基础上建立了启发式算法。我们还提供了 MRD 的核变体，这些变体在处理非线性数据方面更有效。我们的 MRD 尤其是简化的 MRD 的计算复杂度比 Wasserstein 距离低得多。我们为 MRD 提供了理论保证，例如鲁棒性。最后，我们将 MRD 应用于分布聚类、生成模型和领域自适应。数值结果证明了 MRD 相对于基线的有效性和优越性。

Title: EMO2: End-Effector Guided Audio-Driven Avatar Video Generation

Authors: Linrui Tian, Siqi Hu, Qi Wang, Bang Zhang, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.10687
Pdf URL: https://arxiv.org/pdf/2501.10687
Copy Paste: [[2501.10687]] EMO2: End-Effector Guided Audio-Driven Avatar Video Generation(https://arxiv.org/abs/2501.10687)
Keywords: generation
Abstract: In this paper, we propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. Unlike existing methods that focus on generating full-body or half-body poses, we investigate the challenges of co-speech gesture generation and identify the weak correspondence between audio features and full-body gestures as a key limitation. To address this, we redefine the task as a two-stage process. In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements. Our experimental results demonstrate that the proposed method outperforms state-of-the-art approaches, such as CyberHost and Vlogger, in terms of both visual quality and synchronization accuracy. This work provides a new perspective on audio-driven gesture generation and a robust framework for creating expressive and natural talking head animations.
摘要：在本文中，我们提出了一种新颖的音频驱动说话头部方法，该方法能够同时生成富有表现力的面部表情和手势。与专注于生成全身或半身姿势的现有方法不同，我们研究了同时讲话手势生成的挑战，并确定音频特征与全身手势之间的弱对应性是一个主要限制。为了解决这个问题，我们将任务重新定义为一个两阶段过程。在第一阶段，我们直接从音频输入生成手势，利用音频信号和手部动作之间的强相关性。在第二阶段，我们采用扩散模型来合成视频帧，结合第一阶段生成的手势来产生逼真的面部表情和身体动作。我们的实验结果表明，所提出的方法在视觉质量和同步精度方面都优于 CyberHost 和 Vlogger 等最先进的方法。这项工作为音频驱动的手势生成提供了一个新的视角，并为创建富有表现力和自然的说话头部动画提供了一个强大的框架。

Title: GAUDA: Generative Adaptive Uncertainty-guided Diffusion-based Augmentation for Surgical Segmentation

Authors: Yannik Frisch, Christina Bornberg, Moritz Fuchs, Anirban Mukhopadhyay
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.10819
Pdf URL: https://arxiv.org/pdf/2501.10819
Copy Paste: [[2501.10819]] GAUDA: Generative Adaptive Uncertainty-guided Diffusion-based Augmentation for Surgical Segmentation(https://arxiv.org/abs/2501.10819)
Keywords: generative
Abstract: Augmentation by generative modelling yields a promising alternative to the accumulation of surgical data, where ethical, organisational and regulatory aspects must be considered. Yet, the joint synthesis of (image, mask) pairs for segmentation, a major application in surgery, is rather unexplored. We propose to learn semantically comprehensive yet compact latent representations of the (image, mask) space, which we jointly model with a Latent Diffusion Model. We show that our approach can effectively synthesise unseen high-quality paired segmentation data of remarkable semantic coherence. Generative augmentation is typically applied pre-training by synthesising a fixed number of additional training samples to improve downstream task models. To enhance this approach, we further propose Generative Adaptive Uncertainty-guided Diffusion-based Augmentation (GAUDA), leveraging the epistemic uncertainty of a Bayesian downstream model for targeted online synthesis. We condition the generative model on classes with high estimated uncertainty during training to produce additional unseen samples for these classes. By adaptively utilising the generative model online, we can minimise the number of additional training samples and centre them around the currently most uncertain parts of the data distribution. GAUDA effectively improves downstream segmentation results over comparable methods by an average absolute IoU of 1.6% on CaDISv2 and 1.5% on CholecSeg8k, two prominent surgical datasets for semantic segmentation.
摘要：通过生成模型进行增强是一种有前途的替代方法，可以替代外科手术数据的积累，但必须考虑伦理、组织和监管方面的问题。然而，用于分割的（图像、掩模）对的联合合成是外科手术的主要应用，但尚未得到充分探索。我们提出学习语义全面但紧凑的（图像、掩模）空间的潜在表示，并使用潜在扩散模型对其进行联合建模。我们表明，我们的方法可以有效地合成具有出色语义连贯性的未见高质量配对分割数据。生成增强通常在训练前应用，通过合成固定数量的额外训练样本来改进下游任务模型。为了增强这种方法，我们进一步提出了基于扩散的生成自适应不确定性引导增强 (GAUDA)，利用贝叶斯下游模型的认知不确定性进行有针对性的在线合成。我们在训练期间对具有高估计不确定性的类进行生成模型的条件，以生成这些类的额外未见样本。通过在线自适应地利用生成模型，我们可以最大限度地减少额外训练样本的数量，并将它们集中在数据分布中当前最不确定的部分。GAUDA 有效地提高了下游分割结果，与同类方法相比，CaDISv2 上的平均绝对 IoU 为 1.6%，CholecSeg8k 上的平均绝对 IoU 为 1.5%，这两个数据集是语义分割的著名外科数据集。

Title: Addressing Multilabel Imbalance with an Efficiency-Focused Approach Using Diffusion Model-Generated Synthetic Samples

Authors: Francisco Charte, Miguel Ángel Dávila, María Dolores Pérez-Godoy, María José del Jesus
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10822
Pdf URL: https://arxiv.org/pdf/2501.10822
Copy Paste: [[2501.10822]] Addressing Multilabel Imbalance with an Efficiency-Focused Approach Using Diffusion Model-Generated Synthetic Samples(https://arxiv.org/abs/2501.10822)
Keywords: generation
Abstract: Predictive models trained on imbalanced data tend to produce biased results. This problem is exacerbated when there is not just one output label, but a set of them. This is the case for multilabel learning (MLL) algorithms used to classify patterns, rank labels, or learn the distribution of outputs. Many solutions have been proposed in the literature. The one that can be applied universally, independent of the algorithm used to build the model, is data resampling. The generation of new instances associated with minority labels, so that empty areas of the feature space are filled, helps to improve the obtained models. The quality of these new instances depends on the algorithm used to generate them. In this paper, a diffusion model tailored to produce new instances for MLL data, called MLDM (\textit{MultiLabel Diffusion Model}), is proposed. Diffusion models have been mainly used to generate artificial images and videos. Our proposed MLDM is based on this type of models. The experiments conducted compare MLDM with several other MLL resampling algorithms. The results show that MLDM is competitive while it improves efficiency.
摘要：在不平衡数据上训练的预测模型往往会产生有偏差的结果。当输出标签不只有一个，而是一组时，这个问题会更加严重。多标签学习 (MLL) 算法就是这种情况，用于对模式进行分类、对标签进行排序或学习输出分布。文献中提出了许多解决方案。可以普遍应用、独立于用于构建模型的算法的是数据重采样。生成与少数标签相关的新实例，从而填充特征空间的空白区域，有助于改进获得的模型。这些新实例的质量取决于用于生成它们的算法。本文提出了一种专为生成 MLL 数据的新实例而定制的扩散模型，称为 MLDM (\textit{多标签扩散模型})。扩散模型主要用于生成人工图像和视频。我们提出的 MLDM 基于这种类型的模型。进行的实验将 MLDM 与其他几种 MLL 重采样算法进行了比较。结果表明，MLDM在提高效率的同时，还具有竞争力。

Title: Diffusion-Based Imitation Learning for Social Pose Generation

Authors: Antonio Lech Martin-Ozimek, Isuru Jayarathne, Su Larb Mon, Jouh Yeong Chew
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2501.10869
Pdf URL: https://arxiv.org/pdf/2501.10869
Copy Paste: [[2501.10869]] Diffusion-Based Imitation Learning for Social Pose Generation(https://arxiv.org/abs/2501.10869)
Keywords: generation
Abstract: Intelligent agents, such as robots and virtual agents, must understand the dynamics of complex social interactions to interact with humans. Effectively representing social dynamics is challenging because we require multi-modal, synchronized observations to understand a scene. We explore how using a single modality, the pose behavior, of multiple individuals in a social interaction can be used to generate nonverbal social cues for the facilitator of that interaction. The facilitator acts to make a social interaction proceed smoothly and is an essential role for intelligent agents to replicate in human-robot interactions. In this paper, we adapt an existing diffusion behavior cloning model to learn and replicate facilitator behaviors. Furthermore, we evaluate two representations of pose observations from a scene, one representation has pre-processing applied and one does not. The purpose of this paper is to introduce a new use for diffusion behavior cloning for pose generation in social interactions. The second is to understand the relationship between performance and computational load for generating social pose behavior using two different techniques for collecting scene observations. As such, we are essentially testing the effectiveness of two different types of conditioning for a diffusion model. We then evaluate the resulting generated behavior from each technique using quantitative measures such as mean per-joint position error (MPJPE), training time, and inference time. Additionally, we plot training and inference time against MPJPE to examine the trade-offs between efficiency and performance. Our results suggest that the further pre-processed data can successfully condition diffusion models to generate realistic social behavior, with reasonable trade-offs in accuracy and processing time.
摘要：智能代理（例如机器人和虚拟代理）必须了解复杂社交互动的动态才能与人类互动。有效地表示社交动态具有挑战性，因为我们需要多模态、同步的观察来理解场景。我们探索如何使用社交互动中的多个个体的单一模态（姿势行为）来为该互动的促进者生成非语言社交线索。促进者的作用是使社交互动顺利进行，是智能代理在人机交互中复制的重要角色。在本文中，我们调整了现有的扩散行为克隆模型来学习和复制促进者行为。此外，我们评估了场景中姿势观察的两种表示，一种表示应用了预处理，另一种没有。本文的目的是介绍扩散行为克隆在社交互动中姿势生成的新用途。第二个是了解使用两种不同的场景观察收集技术生成社交姿势行为的性能与计算负荷之间的关系。因此，我们本质上是在测试扩散模型的两种不同类型的条件作用的有效性。然后，我们使用定量指标（例如平均关节位置误差 (MPJPE)、训练时间和推理时间）评估每种技术生成的行为。此外，我们根据 MPJPE 绘制训练和推理时间，以检查效率和性能之间的权衡。我们的结果表明，进一步预处理的数据可以成功地调节扩散模型以生成真实的社交行为，并在准确性和处理时间之间进行合理的权衡。

Title: Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

Authors: Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10893
Pdf URL: https://arxiv.org/pdf/2501.10893
Copy Paste: [[2501.10893]] Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments(https://arxiv.org/abs/2501.10893)
Keywords: generation
Abstract: Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks -- baseline results are improved by up to 12.2\% for ICL with Claude-3.5 and 19.5\% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0\% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.
摘要：以大型语言模型 (LLM) 为驱动力的自主代理具有增强人类能力的潜力，可协助完成从发送电子邮件到执行数据分析等数字任务。现有 LLM 执行此类任务的能力通常受到其所交互相应环境中的高质量代理数据缺乏的阻碍。我们提出了以数据为中心的框架“交互学习”，以使 LLM 代理适应任何给定环境而无需人工注释。交互学习根据文档综合代理与环境交互的轨迹，并通过总结或抽象交互历史来构建指令，这一过程称为逆向构建。我们通过在基于训练的场景和无训练的上下文学习 (ICL) 中使用合成数据来评估合成数据的质量，在这些场景中，我们设计了针对代理优化的创新检索方法。在 SWE-bench、WebArena、OSWorld 和 Spider2-V 上进行的大量实验涵盖了现实编码、网络和桌面环境，表明在各种下游代理任务中，通过交互学习的有效性 - 使用 Claude-3.5 的 ICL 的基线结果提高了 12.2%，使用 Codestral-22B 的训练提高了 19.5%。我们进一步证明了后向构造的关键作用，它为训练提供了高达 14.0% 的改进。我们的消融研究证明了我们在 ICL 中合成数据的效率以及我们的检索管道相对于传统检索增强生成 (RAG) 等替代方法的优越性。我们预计，随着 LLM 越来越多地部署在现实环境中，通过交互学习将成为代理数据合成的基础。

Title: Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Authors: Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, Sungroh Yoon
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.10913
Pdf URL: https://arxiv.org/pdf/2501.10913
Copy Paste: [[2501.10913]] Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP(https://arxiv.org/abs/2501.10913)
Keywords: generation
Abstract: While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately. Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.
摘要：虽然 CLIP 通过将视觉和语言连接起来，大大提高了多模态理解能力，但无法理解否定（例如无法区分“停车”和“禁止停车”等概念）带来了巨大挑战。通过分析公共 CLIP 模型预训练中使用的数据，我们认为这一限制源于缺乏包含否定的数据。为了解决这个问题，我们引入了数据生成管道，该管道采用大型语言模型 (LLM) 和多模态 LLM 来生成包含否定的字幕。我们使用管道生成的数据对 CLIP 进行微调，开发了 NegationCLIP，它在保持通用性的同时增强了否定意识。此外，为了全面评估否定理解，我们提出了 NegRefCOCOg，这是一个专门用于测试 VLM 解释句子中不同表达和位置的否定的能力的基准。在各种 CLIP 架构上进行的实验验证了我们的数据生成管道在增强 CLIP 准确感知否定的能力方面的有效性。此外，NegationCLIP 增强的否定意识在各种多模态任务中都有实际应用，这体现在文本到图像生成和参考图像分割方面的性能提升。

Title: Data Enrichment Opportunities for Distribution Grid Cable Networks using Variational Autoencoders

Authors: Konrad Sundsgaard, Kutay Bölat, Guangya Yang
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2501.10920
Pdf URL: https://arxiv.org/pdf/2501.10920
Copy Paste: [[2501.10920]] Data Enrichment Opportunities for Distribution Grid Cable Networks using Variational Autoencoders(https://arxiv.org/abs/2501.10920)
Keywords: generation, generative
Abstract: Electricity distribution cable networks suffer from incomplete and unbalanced data, hindering the effectiveness of machine learning models for predictive maintenance and reliability evaluation. Features such as the installation date of the cables are frequently missing. To address data scarcity, this study investigates the application of Variational Autoencoders (VAEs) for data enrichment, synthetic data generation, imbalanced data handling, and outlier detection. Based on a proof-of-concept case study for Denmark, targeting the imputation of missing age information in cable network asset registers, the analysis underlines the potential of generative models to support data-driven maintenance. However, the study also highlights several areas for improvement, including enhanced feature importance analysis, incorporating network characteristics and external features, and handling biases in missing data. Future initiatives should expand the application of VAEs by incorporating semi-supervised learning, advanced sampling techniques, and additional distribution grid elements, including low-voltage networks, into the analysis.
摘要：电力配电电缆网络存在数据不完整和不平衡的问题，阻碍了机器学习模型在预测性维护和可靠性评估方面的有效性。电缆安装日期等特征经常缺失。为了解决数据稀缺问题，本研究调查了变分自编码器 (VAE) 在数据丰富、合成数据生成、不平衡数据处理和异常值检测方面的应用。基于丹麦的概念验证案例研究，该分析针对电缆网络资产登记册中缺失年龄信息的填补，强调了生成模型支持数据驱动维护的潜力。然而，该研究还强调了几个需要改进的领域，包括增强特征重要性分析、结合网络特征和外部特征以及处理缺失数据中的偏差。未来的举措应该通过将半监督学习、先进的采样技术和其他配电网元素（包括低压网络）纳入分析来扩大 VAE 的应用。

Title: Generative Physical AI in Vision: A Survey

Authors: Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, Chang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10928
Pdf URL: https://arxiv.org/pdf/2501.10928
Copy Paste: [[2501.10928]] Generative Physical AI in Vision: A Survey(https://arxiv.org/abs/2501.10928)
Keywords: generation, generative
Abstract: Generative Artificial Intelligence (AI) has rapidly advanced the field of computer vision by enabling machines to create and interpret visual data with unprecedented sophistication. This transformation builds upon a foundation of generative models to produce realistic images, videos, and 3D or 4D content. Traditionally, generative models primarily focus on visual fidelity while often neglecting the physical plausibility of generated content. This gap limits their effectiveness in applications requiring adherence to real-world physical laws, such as robotics, autonomous systems, and scientific simulations. As generative AI evolves to increasingly integrate physical realism and dynamic simulation, its potential to function as a "world simulator" expands-enabling the modeling of interactions governed by physics and bridging the divide between virtual and physical realities. This survey systematically reviews this emerging field of physics-aware generative AI in computer vision, categorizing methods based on how they incorporate physical knowledge-either through explicit simulation or implicit learning. We analyze key paradigms, discuss evaluation protocols, and identify future research directions. By offering a comprehensive overview, this survey aims to help future developments in physically grounded generation for vision. The reviewed papers are summarized at this https URL.
摘要：生成式人工智能 (AI) 使机器能够以前所未有的复杂程度创建和解释视觉数据，从而迅速推动了计算机视觉领域的发展。这种转变建立在生成模型的基础上，以生成逼真的图像、视频和 3D 或 4D 内容。传统上，生成模型主要关注视觉保真度，而往往忽略生成内容的物理合理性。这种差距限制了它们在需要遵守现实世界物理定律的应用（如机器人技术、自主系统和科学模拟）中的有效性。随着生成式人工智能不断发展，越来越多地将物理现实主义和动态模拟结合起来，它作为“世界模拟器”的潜力也在不断扩大——能够对受物理控制的交互进行建模，并弥合虚拟和物理现实之间的鸿沟。本综述系统地回顾了计算机视觉中这一新兴的物理感知生成式人工智能领域，根据方法如何结合物理知识（通过显式模拟或隐式学习）对方法进行分类。我们分析了关键范例，讨论了评估协议，并确定了未来的研究方向。通过提供全面的概述，本综述旨在帮助未来在物理基础视觉生成方面取得进展。已审查的论文总结在此 https URL 中。

Title: Beyond Any-Shot Adaptation: Predicting Optimization Outcome for Robustness Gains without Extra Pay

Authors: Qi Cheems Wang, Zehao Xiao, Yixiu Mao, Yun Qu, Jiayi Shen, Yiqin Lv, Xiangyang Ji
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.11039
Pdf URL: https://arxiv.org/pdf/2501.11039
Copy Paste: [[2501.11039]] Beyond Any-Shot Adaptation: Predicting Optimization Outcome for Robustness Gains without Extra Pay(https://arxiv.org/abs/2501.11039)
Keywords: generative
Abstract: The foundation model enables fast problem-solving without learning from scratch, and such a desirable adaptation property benefits from its adopted cross-task generalization paradigms, e.g., pretraining, meta-training, or finetuning. Recent trends have focused on the curation of task datasets during optimization, which includes task selection as an indispensable consideration for either adaptation robustness or sampling efficiency purposes. Despite some progress, selecting crucial task batches to optimize over iteration mostly exhausts massive task queries and requires intensive evaluation and computations to secure robust adaptation. This work underscores the criticality of both robustness and learning efficiency, especially in scenarios where tasks are risky to collect or costly to evaluate. To this end, we present Model Predictive Task Sampling (MPTS), a novel active task sampling framework to establish connections between the task space and adaptation risk landscape achieve robust adaptation. Technically, MPTS characterizes the task episodic information with a generative model and predicts optimization outcome after adaptation from posterior inference, i.e., forecasting task-specific adaptation risk values. The resulting risk learner amortizes expensive annotation, evaluation, or computation operations in task robust adaptation learning paradigms. Extensive experimental results show that MPTS can be seamlessly integrated into zero-shot, few-shot, and many-shot learning paradigms, increases adaptation robustness, and retains learning efficiency without affording extra cost. The code will be available at the project site this https URL.
摘要：基础模型无需从头开始学习即可快速解决问题，这种理想的适应性得益于其采用的跨任务泛化范式，例如预训练、元训练或微调。最近的趋势集中在优化过程中任务数据集的管理，其中包括任务选择，这是适应稳健性或采样效率目的不可或缺的考虑因素。尽管取得了一些进展，但选择关键任务批次进行迭代优化大多会耗尽大量任务查询，并需要进行大量评估和计算才能确保稳健的适应。这项工作强调了稳健性和学习效率的重要性，尤其是在任务收集风险大或评估成本高的情况下。为此，我们提出了模型预测任务采样 (MPTS)，这是一种新颖的主动任务采样框架，用于建立任务空间和适应风险景观之间的联系，实现稳健的适应。从技术上讲，MPTS 使用生成模型表征任务情景信息，并从后验推理中预测适应后的优化结果，即预测特定于任务的适应风险值。由此产生的风险学习器在任务稳健的适应学习范式中摊销了昂贵的注释、评估或计算操作。大量实验结果表明，MPTS 可以无缝集成到零样本、少样本和多样本学习范式中，提高适应稳健性，并保持学习效率而无需承担额外成本。代码将在项目网站上提供，网址为 https。

Title: BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution

Authors: Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, Jaejun Yoo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11043
Pdf URL: https://arxiv.org/pdf/2501.11043
Copy Paste: [[2501.11043]] BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution(https://arxiv.org/abs/2501.11043)
Keywords: super-resolution
Abstract: Enhancing low-resolution, low-frame-rate videos to high-resolution, high-frame-rate quality is essential for a seamless user experience, motivating advancements in Continuous Spatial-Temporal Video Super Resolution (C-STVSR). While prior methods employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow network for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve-and even degrade performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model's flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art PSNR and SSIM performance, showing enhanced spatial details and natural temporal consistency.
摘要：将低分辨率、低帧率的视频提升为高分辨率、高帧率的视频质量对于实现无缝用户体验至关重要，这推动了连续时空视频超分辨率 (C-STVSR) 的进步。虽然之前的方法采用隐式神经表征 (INR) 进行连续编码，但它们通常难以捕捉视频数据的复杂性，依赖于简单的坐标连接和预训练的光流网络进行运动表示。有趣的是，我们发现与常见的观察结果相反，添加位置编码并不能提高甚至降低性能。当与预训练的光流网络结合使用时，这个问题变得尤为明显，这会限制模型的灵活性。为了解决这些问题，我们提出了 BF-STVSR，这是一个 C-STVSR 框架，具有两个关键模块，可以更好地表示视频的空间和时间特征：1) 用于平滑时间插值的 B 样条映射器，以及 2) 用于捕获主要空间频率的傅里叶映射器。我们的方法实现了最先进的 PSNR 和 SSIM 性能，展现出增强的空间细节和自然的时间一致性。

Title: Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection

Authors: Zhipeng Yu, Qianqian Xu, Yangbangyan Jiang, Yingfei Sun, Qingming Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11063
Pdf URL: https://arxiv.org/pdf/2501.11063
Copy Paste: [[2501.11063]] Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection(https://arxiv.org/abs/2501.11063)
Keywords: generation
Abstract: The existence of noisy labels in real-world data negatively impacts the performance of deep learning models. Although much research effort has been devoted to improving the robustness towards noisy labels in classification tasks, the problem of noisy labels in deep metric learning (DML) remains under-explored. Existing noisy label learning methods designed for DML mainly discard suspicious noisy samples, resulting in a waste of the training data. To address this issue, we propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS), which constructs reliable positive pairs for noisy samples to enhance the sample utilization. Specifically, SGPS first effectively identifies clean and noisy samples by a probability-based clean sample selectionstrategy. To further utilize the remaining noisy samples, we discover their potential similar samples based on the subgroup information given by a subgroup generation module and then aggregate them into informative positive prototypes for each noisy sample via a positive prototype generation module. Afterward, a new contrastive loss is tailored for the noisy samples with their selected positive pairs. SGPS can be easily integrated into the training process of existing pair-wise DML tasks, like image retrieval and face recognition. Extensive experiments on multiple synthetic and real-world large-scale label noise datasets demonstrate the effectiveness of our proposed method. Without any bells and whistles, our SGPS framework outperforms the state-of-the-art noisy label DML methods. Code is available at \url{this https URL}.
摘要：现实世界数据中的噪声标签会对深度学习模型的性能产生负面影响。尽管已经投入了大量研究来提高分类任务中对噪声标签的鲁棒性，但深度度量学习 (DML) 中的噪声标签问题仍未得到充分探索。现有的为 DML 设计的噪声标签学习方法主要丢弃可疑的噪声样本，从而浪费训练数据。为了解决这个问题，我们提出了一个具有基于子组的正对选择 (SGPS) 的噪声鲁棒 DML 框架，该框架为噪声样本构建可靠的正对以提高样本利用率。具体而言，SGPS 首先通过基于概率的干净样本选择策略有效地识别干净样本和噪声样本。为了进一步利用剩余的噪声样本，我们根据子组生成模块给出的子组信息发现它们的潜在相似样本，然后通过正原型生成模块将它们聚合为每个噪声样本的信息性正原型。之后，为具有所选正对的噪声样本量身定制新的对比损失。 SGPS 可以轻松集成到现有成对 DML 任务（如图像检索和人脸识别）的训练过程中。在多个合成和真实世界大规模标签噪声数据集上进行的大量实验证明了我们提出的方法的有效性。我们的 SGPS 框架没有任何花哨的花哨功能，但表现优于最先进的噪声标签 DML 方法。代码可在 \url{此 https URL} 处获得。

Title: Unit Region Encoding: A Unified and Compact Geometry-aware Representation for Floorplan Applications

Authors: Huichao Zhang, Pengyu Wang, Manyi Li, Zuojun Li, Yaguang Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11097
Pdf URL: https://arxiv.org/pdf/2501.11097
Copy Paste: [[2501.11097]] Unit Region Encoding: A Unified and Compact Geometry-aware Representation for Floorplan Applications(https://arxiv.org/abs/2501.11097)
Keywords: generation
Abstract: We present the Unit Region Encoding of floorplans, which is a unified and compact geometry-aware encoding representation for various applications, ranging from interior space planning, floorplan metric learning to floorplan generation tasks. The floorplans are represented as the latent encodings on a set of boundary-adaptive unit region partition based on the clustering of the proposed geometry-aware density map. The latent encodings are extracted by a trained network (URE-Net) from the input dense density map and other available semantic maps. Compared to the over-segmented rasterized images and the room-level graph structures, our representation can be flexibly adapted to different applications with the sliced unit regions while achieving higher accuracy performance and better visual quality. We conduct a variety of experiments and compare to the state-of-the-art methods on the aforementioned applications to validate the superiority of our representation, as well as extensive ablation studies to demonstrate the effect of our slicing choices.
摘要：我们提出了平面图的单位区域编码，这是一种统一、紧凑的几何感知编码表示，适用于从室内空间规划、平面图度量学习到平面图生成任务等各种应用。平面图表示为基于所提出的几何感知密度图的聚类的一组边界自适应单位区域分区上的潜在编码。潜在编码由训练有素的网络（URE-Net）从输入的密集密度图和其他可用的语义图中提取。与过度分割的光栅化图像和房间级图结构相比，我们的表示可以灵活地适应具有切片单位区域的不同应用，同时实现更高的准确度性能和更好的视觉质量。我们进行了各种实验，并与上述应用上的最先进方法进行了比较，以验证我们表示的优越性，并进行了广泛的消融研究以证明我们的切片选择的效果。

Title: Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction

Authors: Quan Zhang, Yuxin Qi, Xi Tang, Rui Yuan, Xi Lin, Ke Zhang, Chun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11124
Pdf URL: https://arxiv.org/pdf/2501.11124
Copy Paste: [[2501.11124]] Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction(https://arxiv.org/abs/2501.11124)
Keywords: generation
Abstract: Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.
摘要：伪标签学习方法在弱监督时间动作定位中得到了广泛的应用。现有的研究直接利用弱监督基础模型生成实例级伪标签来训练全监督检测头。我们认为伪标签中的噪声会干扰全监督检测头的学习，导致严重的性能损失。噪声标签的问题包括：（1）边界定位不准确；（2）未检测到的短动作片段；（3）多个相邻片段被错误地检测为一个片段。针对这些问题，我们引入了一种两阶段噪声标签学习策略来利用噪声标签中的每一个潜在有用信号。首先，我们提出了一个帧级伪标签生成模型，并使用上下文感知去噪算法来细化边界。其次，我们引入了一个在线修订的师生框架，该框架具有缺失实例补偿模块和模糊实例校正模块，以解决短动作缺失和多对一问题。此外，我们在在线修订的师生框架中应用了高质量的伪标签挖掘损失，为嘈杂的标签添加不同的权重，从而更有效地进行训练。我们的模型在 THUMOS14 和 ActivityNet v1.2 基准测试中，在检测准确率和推理速度方面远远优于之前最先进的方法。

Title: CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning

Authors: William Doherty, Anton Lee, Heitor Murilo Gomes
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11140
Pdf URL: https://arxiv.org/pdf/2501.11140
Copy Paste: [[2501.11140]] CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning(https://arxiv.org/abs/2501.11140)
Keywords: generative
Abstract: The rapid advancement of generative AI models capable of creating realistic media has led to a need for classifiers that can accurately distinguish between genuine and artificially-generated images. A significant challenge for these classifiers emerges when they encounter images from generative models that are not represented in their training data, usually resulting in diminished performance. A typical approach is to periodically update the classifier's training data with images from the new generative models then retrain the classifier on the updated dataset. However, in some real-life scenarios, storage, computational, or privacy constraints render this approach impractical. Additionally, models used in security applications may be required to rapidly adapt. In these circumstances, continual learning provides a promising alternative, as the classifier can be updated without retraining on the entire dataset. In this paper, we introduce a new dataset called CLOFAI (Continual Learning On Fake and Authentic Images), which takes the form of a domain-incremental image classification problem. Moreover, we showcase the applicability of this dataset as a benchmark for evaluating continual learning methodologies. In doing this, we set a baseline on our novel dataset using three foundational continual learning methods -- EWC, GEM, and Experience Replay -- and find that EWC performs poorly, while GEM and Experience Replay show promise, performing significantly better than a Naive baseline. The dataset and code to run the experiments can be accessed from the following GitHub repository: this https URL.
摘要：能够创建逼真媒体的生成式 AI 模型的快速发展导致需要能够准确区分真实图像和人工生成的图像的分类器。当这些分类器遇到训练数据中未表示的生成模型图像时，它们将面临重大挑战，这通常会导致性能下降。一种典型的方法是定期使用新生成模型中的图像更新分类器的训练数据，然后在更新后的数据集上重新训练分类器。然而，在某些现实场景中，存储、计算或隐私限制使得这种方法不切实际。此外，安全应用中使用的模型可能需要快速适应。在这些情况下，持续学习提供了一种有希望的替代方案，因为分类器可以在不重新训练整个数据集的情况下进行更新。在本文中，我们介绍了一个名为 CLOFAI（持续学习假图像和真图像）的新数据集，它采用领域增量图像分类问题的形式。此外，我们展示了该数据集作为评估持续学习方法的基准的适用性。在此过程中，我们使用三种基础持续学习方法（EWC、GEM 和 Experience Replay）为新数据集设定了基线，发现 EWC 表现不佳，而 GEM 和 Experience Replay 则表现良好，表现明显优于 Naive 基线。可以从以下 GitHub 存储库访问运行实验的数据集和代码：此 https URL。

Title: Advancing Oyster Phenotype Segmentation with Multi-Network Ensemble and Multi-Scale mechanism

Authors: Wenli Yang, Yanyu Chen, Andrew Trotter, Byeong Kang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11203
Pdf URL: https://arxiv.org/pdf/2501.11203
Copy Paste: [[2501.11203]] Advancing Oyster Phenotype Segmentation with Multi-Network Ensemble and Multi-Scale mechanism(https://arxiv.org/abs/2501.11203)
Keywords: quality assessment
Abstract: Phenotype segmentation is pivotal in analysing visual features of living organisms, enhancing our understanding of their characteristics. In the context of oysters, meat quality assessment is paramount, focusing on shell, meat, gonad, and muscle components. Traditional manual inspection methods are time-consuming and subjective, prompting the adoption of machine vision technology for efficient and objective evaluation. We explore machine vision's capacity for segmenting oyster components, leading to the development of a multi-network ensemble approach with a global-local hierarchical attention mechanism. This approach integrates predictions from diverse models and addresses challenges posed by varying scales, ensuring robust instance segmentation across components. Finally, we provide a comprehensive evaluation of the proposed method's performance using different real-world datasets, highlighting its efficacy and robustness in enhancing oyster phenotype segmentation.
摘要：表型分割是分析生物体视觉特征的关键，有助于我们了解其特征。对于牡蛎而言，肉质评估至关重要，重点是壳、肉、生殖腺和肌肉成分。传统的人工检查方法既耗时又主观，因此采用机器视觉技术进行高效客观的评估。我们探索了机器视觉分割牡蛎成分的能力，从而开发了一种具有全局-局部分层注意机制的多网络集成方法。该方法整合了来自不同模型的预测，并解决了不同尺度带来的挑战，确保了跨成分的稳健实例分割。最后，我们使用不同的真实世界数据集对所提出方法的性能进行了全面评估，强调了其在增强牡蛎表型分割方面的有效性和稳健性。

Title: Leveraging GANs For Active Appearance Models Optimized Model Fitting

Authors: Anurag Awasthi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11218
Pdf URL: https://arxiv.org/pdf/2501.11218
Copy Paste: [[2501.11218]] Leveraging GANs For Active Appearance Models Optimized Model Fitting(https://arxiv.org/abs/2501.11218)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have gained prominence in refining model fitting tasks in computer vision, particularly in domains involving deformable models like Active Appearance Models (AAMs). This paper explores the integration of GANs to enhance the AAM fitting process, addressing challenges in optimizing nonlinear parameters associated with appearance and shape variations. By leveraging GANs' adversarial training framework, the aim is to minimize fitting errors and improve convergence rates. Achieving robust performance even in cases with high appearance variability and occlusions. Our approach demonstrates significant improvements in accuracy and computational efficiency compared to traditional optimization techniques, thus establishing GANs as a potent tool for advanced image model fitting.
摘要：生成对抗网络 (GAN) 在改进计算机视觉中的模型拟合任务方面已取得突出成就，尤其是在涉及可变形模型（如主动外观模型 (AAM)）的领域。本文探讨了 GAN 的集成以增强 AAM 拟合过程，解决了优化与外观和形状变化相关的非线性参数的挑战。通过利用 GAN 的对抗训练框架，目标是最大限度地减少拟合误差并提高收敛速度。即使在外观变化较大和遮挡的情况下也能实现稳健的性能。与传统的优化技术相比，我们的方法在准确性和计算效率方面表现出显着的提高，从而使 GAN 成为高级图像模型拟合的有力工具。

Title: Successive Interference Cancellation-aided Diffusion Models for Joint Channel Estimation and Data Detection in Low Rank Channel Scenarios

Authors: Sagnik Bhattacharya, Muhammad Ahmed Mohsin, Kamyar Rajabalifardi, John M. Cioffi
Subjects: cs.CV, cs.IT, eess.SP
Abstract URL: https://arxiv.org/abs/2501.11229
Pdf URL: https://arxiv.org/pdf/2501.11229
Copy Paste: [[2501.11229]] Successive Interference Cancellation-aided Diffusion Models for Joint Channel Estimation and Data Detection in Low Rank Channel Scenarios(https://arxiv.org/abs/2501.11229)
Keywords: generative
Abstract: This paper proposes a novel joint channel-estimation and source-detection algorithm using successive interference cancellation (SIC)-aided generative score-based diffusion models. Prior work in this area focuses on massive MIMO scenarios, which are typically characterized by full-rank channels, and fail in low-rank channel scenarios. The proposed algorithm outperforms existing methods in joint source-channel estimation, especially in low-rank scenarios where the number of users exceeds the number of antennas at the access point (AP). The proposed score-based iterative diffusion process estimates the gradient of the prior distribution on partial channels, and recursively updates the estimated channel parts as well as the source. Extensive simulation results show that the proposed method outperforms the baseline methods in terms of normalized mean squared error (NMSE) and symbol error rate (SER) in both full-rank and low-rank channel scenarios, while having a more dominant effect in the latter, at various signal-to-noise ratios (SNR).
摘要：本文提出了一种新的联合信道估计和源检测算法，该算法使用连续干扰消除 (SIC) 辅助的基于生成分数的扩散模型。该领域的先前工作主要针对大规模 MIMO 场景，这些场景通常以满秩信道为特征，并且在低秩信道场景中失败。所提出的算法在联合源信道估计方面优于现有方法，尤其是在用户数量超过接入点 (AP) 天线数量的低秩场景中。所提出的基于分数的迭代扩散过程估计部分信道上的先验分布的梯度，并递归更新估计的信道部分以及源。大量模拟结果表明，在各种信噪比 (SNR) 下，所提出的方法在满秩和低秩信道场景中的归一化均方误差 (NMSE) 和符号错误率 (SER) 方面均优于基线方法，而在后者中具有更显著的效果。

Title: A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs

Authors: Chang Wan, Ke Fan, Xinwei Sun, Yanwei Fu, Minglu Li, Yunliang Jiang, Zhonglong Zheng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11236
Pdf URL: https://arxiv.org/pdf/2501.11236
Copy Paste: [[2501.11236]] A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs(https://arxiv.org/abs/2501.11236)
Keywords: generation, generative
Abstract: This paper introduces a promising alternative method for training Generative Adversarial Networks (GANs) on large-scale datasets with clear theoretical guarantees. GANs are typically learned through a minimax game between a generator and a discriminator, which is known to be empirically unstable. Previous learning paradigms have encountered mode collapse issues without a theoretical solution. To address these challenges, we propose a novel Lipschitz-constrained Functional Gradient GANs learning (Li-CFG) method to stabilize the training of GAN and provide a theoretical foundation for effectively increasing the diversity of synthetic samples by reducing the neighborhood size of the latent vector. Specifically, we demonstrate that the neighborhood size of the latent vector can be reduced by increasing the norm of the discriminator gradient, resulting in enhanced diversity of synthetic samples. To efficiently enlarge the norm of the discriminator gradient, we introduce a novel {\epsilon}-centered gradient penalty that amplifies the norm of the discriminator gradient using the hyper-parameter {\epsilon}. In comparison to other constraints, our method enlarging the discriminator norm, thus obtaining the smallest neighborhood size of the latent vector. Extensive experiments on benchmark datasets for image generation demonstrate the efficacy of the Li-CFG method and the {\epsilon}-centered gradient penalty. The results showcase improved stability and increased diversity of synthetic samples.
摘要：本文介绍了一种在具有明确理论保证的大规模数据集上训练生成对抗网络 (GAN) 的有前途的替代方法。GAN 通常通过生成器和鉴别器之间的极小极大博弈来学习，已知该博弈在经验上是不稳定的。先前的学习范式遇到了模式崩溃问题，而没有理论解决方案。为了应对这些挑战，我们提出了一种新颖的 Lipschitz 约束函数梯度 GAN 学习 (Li-CFG) 方法来稳定 GAN 的训练，并为通过减小潜在向量的邻域大小有效增加合成样本的多样性提供理论基础。具体而言，我们证明可以通过增加鉴别器梯度的范数来减小潜在向量的邻域大小，从而增强合成样本的多样性。为了有效地扩大鉴别器梯度的范数，我们引入了一种新颖的以 {\epsilon} 为中心的梯度惩罚，它使用超参数 {\epsilon} 放大鉴别器梯度的范数。与其他约束相比，我们的方法扩大了鉴别器范数，从而获得了最小的潜在向量邻域大小。在用于图像生成的基准数据集上进行的大量实验证明了 Li-CFG 方法和以 {\epsilon} 为中心的梯度惩罚的有效性。结果显示合成样本的稳定性得到改善，多样性得到增强。

Title: Nested Annealed Training Scheme for Generative Adversarial Networks

Authors: Chang Wan, Ming-Hsuan Yang, Minglu Li, Yunliang Jiang, Zhonglong Zheng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11318
Pdf URL: https://arxiv.org/pdf/2501.11318
Copy Paste: [[2501.11318]] Nested Annealed Training Scheme for Generative Adversarial Networks(https://arxiv.org/abs/2501.11318)
Keywords: generation, generative
Abstract: Recently, researchers have proposed many deep generative models, including generative adversarial networks(GANs) and denoising diffusion models. Although significant breakthroughs have been made and empirical success has been achieved with the GAN, its mathematical underpinnings remain relatively unknown. This paper focuses on a rigorous mathematical theoretical framework: the composite-functional-gradient GAN (CFG)[1]. Specifically, we reveal the theoretical connection between the CFG model and score-based models. We find that the training objective of the CFG discriminator is equivalent to finding an optimal D(x). The optimal gradient of D(x) differentiates the integral of the differences between the score functions of real and synthesized samples. Conversely, training the CFG generator involves finding an optimal G(x) that minimizes this difference. In this paper, we aim to derive an annealed weight preceding the weight of the CFG discriminator. This new explicit theoretical explanation model is called the annealed CFG method. To overcome the limitation of the annealed CFG method, as the method is not readily applicable to the SOTA GAN model, we propose a nested annealed training scheme (NATS). This scheme keeps the annealed weight from the CFG method and can be seamlessly adapted to various GAN models, no matter their structural, loss, or regularization differences. We conduct thorough experimental evaluations on various benchmark datasets for image generation. The results show that our annealed CFG and NATS methods significantly improve the quality and diversity of the synthesized samples. This improvement is clear when comparing the CFG method and the SOTA GAN models.
摘要：近年来，研究人员提出了许多深度生成模型，包括生成对抗网络 (GAN) 和去噪扩散模型。尽管 GAN 已经取得了重大突破并获得了实证成功，但其数学基础仍然相对未知。本文重点介绍一个严格的数学理论框架：复合函数梯度 GAN (CFG)[1]。具体而言，我们揭示了 CFG 模型与基于分数的模型之间的理论联系。我们发现 CFG 鉴别器的训练目标等同于寻找最优 D(x)。D(x) 的最优梯度区分了真实样本和合成样本的分数函数之差的积分。相反，训练 CFG 生成器涉及寻找最小化该差异的最优 G(x)。在本文中，我们旨在推导出 CFG 鉴别器权重之前的退火权重。这种新的明确理论解释模型称为退火 CFG 方法。为了克服退火 CFG 方法的局限性，因为该方法不易应用于 SOTA GAN 模型，我们提出了一种嵌套退火训练方案 (NATS)。该方案保留了 CFG 方法的退火权重，可以无缝适应各种 GAN 模型，无论它们的结构、损失或正则化有何不同。我们对各种基准数据集进行了彻底的图像生成实验评估。结果表明，我们的退火 CFG 和 NATS 方法显著提高了合成样本的质量和多样性。将 CFG 方法与 SOTA GAN 模型进行比较时，这种改进是显而易见的。

Title: CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Authors: Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, Xiaodan Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11325
Pdf URL: https://arxiv.org/pdf/2501.11325
Copy Paste: [[2501.11325]] CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation(https://arxiv.org/abs/2501.11325)
Keywords: generation
Abstract: Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.
摘要：虚拟试穿 (VTON) 技术因其能够实现图像和视频中逼真的服装可视化，从而改变在线零售而备受关注。然而，大多数现有方法都难以在图像和视频试穿任务中获得高质量的结果，尤其是在长视频场景中。在这项工作中，我们介绍了 CatV2TON，这是一种简单有效的基于视觉的虚拟试穿 (V2TON) 方法，它使用单个扩散变压器模型支持图像和视频试穿任务。通过在时间上连接服装和人员输入并在混合图像和视频数据集上进行训练，CatV2TON 在静态和动态设置中实现了强大的试穿性能。为了高效地生成长视频，我们提出了一种基于重叠剪辑的推理策略，该策略使用顺序帧引导和自适应剪辑规范化 (AdaCN) 来保持时间一致性并减少资源需求。我们还介绍了 ViViD-S，这是一个精炼的视频试穿数据集，通过过滤背面帧并应用 3D 蒙版平滑来增强时间一致性。综合实验表明，CatV2TON 在图像和视频试穿任务中的表现均优于现有方法，为跨不同场景的真实虚拟试穿提供了多功能且可靠的解决方案。

Title: GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video

Authors: Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, Yunhe Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11340
Pdf URL: https://arxiv.org/pdf/2501.11340
Copy Paste: [[2501.11340]] GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video(https://arxiv.org/abs/2501.11340)
Keywords: generation, generative
Abstract: The rapid advancement of video generation models has made it increasingly challenging to distinguish AI-generated videos from real ones. This issue underscores the urgent need for effective AI-generated video detectors to prevent the dissemination of false information through such videos. However, the development of high-performance generative video detectors is currently impeded by the lack of large-scale, high-quality datasets specifically designed for generative video detection. To this end, we introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages: 1) Cross Source and Cross Generator: The cross-generation source mitigates the interference of video content on the detection. The cross-generator ensures diversity in video attributes between the training and test sets, preventing them from being overly similar. 2) State-of-the-Art Video Generators: The dataset includes videos from 8 state-of-the-art AI video generators, ensuring that it covers the latest advancements in the field of video generation. 3) Rich Semantics: The videos in GenVidBench are analyzed from multiple dimensions and classified into various semantic categories based on their content. This classification ensures that the dataset is not only large but also diverse, aiding in the development of more generalized and effective detection models. We conduct a comprehensive evaluation of different advanced video generators and present a challenging setting. Additionally, we present rich experimental results including advanced video classification models as baselines. With the GenVidBench, researchers can efficiently develop and evaluate AI-generated video detection models. Datasets and code are available at this https URL.
摘要：视频生成模型的快速进步使得区分 AI 生成的视频和真实视频变得越来越困难。这个问题凸显了对有效的 AI 生成视频检测器的迫切需求，以防止通过此类视频传播虚假信息。然而，由于缺乏专门为生成视频检测设计的大规模高质量数据集，高性能生成视频检测器的开发目前受到阻碍。为此，我们推出了 GenVidBench，这是一个具有挑战性的 AI 生成视频检测数据集，它具有几个关键优势：1）跨源和跨生成器：跨生成源减轻了视频内容对检测的干扰。交叉生成器确保训练集和测试集之间视频属性的多样性，防止它们过于相似。2）最先进的视频生成器：该数据集包括来自 8 个最先进的 AI 视频生成器的视频，确保它涵盖了视频生成领域的最新进展。 3) 丰富的语义：GenVidBench 中的视频从多个维度进行分析，并根据其内容分为各种语义类别。这种分类确保数据集不仅庞大，而且多样化，有助于开发更通用、更有效的检测模型。我们对不同的高级视频生成器进行了全面评估，并提出了一个具有挑战性的设置。此外，我们还展示了丰富的实验结果，包括高级视频分类模型作为基线。借助 GenVidBench，研究人员可以高效地开发和评估 AI 生成的视频检测模型。数据集和代码可在此 https URL 上找到。

Title: Block Flow: Learning Straight Flow on Data Blocks

Authors: Zibin Wang, Zhiyuan Ouyang, Xiangyun Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2501.11361
Pdf URL: https://arxiv.org/pdf/2501.11361
Copy Paste: [[2501.11361]] Block Flow: Learning Straight Flow on Data Blocks(https://arxiv.org/abs/2501.11361)
Keywords: generation, generative
Abstract: Flow-matching models provide a powerful framework for various applications, offering efficient sampling and flexible probability path modeling. These models are characterized by flows with low curvature in learned generative trajectories, which results in reduced truncation error at each sampling step. To further reduce curvature, we propose block matching. This novel approach leverages label information to partition the data distribution into blocks and match them with a prior distribution parameterized using the same label information, thereby learning straighter flows. We demonstrate that the variance of the prior distribution can control the curvature upper bound of forward trajectories in flow-matching models. By designing flexible regularization strategies to adjust this variance, we achieve optimal generation performance, effectively balancing the trade-off between maintaining diversity in generated samples and minimizing numerical solver errors. Our results demonstrate competitive performance with models of the same parameter this http URL is available at \url{this https URL}.
摘要：流匹配模型为各种应用提供了强大的框架，提供高效的采样和灵活的概率路径建模。这些模型的特点是学习到的生成轨迹中的流具有低曲率，从而减少了每个采样步骤中的截断误差。为了进一步减少曲率，我们提出了块匹配。这种新方法利用标签信息将数据分布划分为块，并将它们与使用相同标签信息参数化的先验分布进行匹配，从而学习更直的流。我们证明先验分布的方差可以控制流匹配模型中前向轨迹的曲率上限。通过设计灵活的正则化策略来调整这种方差，我们实现了最佳生成性能，有效地平衡了保持生成样本的多样性和最小化数值求解器误差之间的权衡。我们的结果与相同参数的模型相比具有竞争力（此 http URL 可在 \url{此 https URL} 处获得）。

Title: A Survey on Diffusion Models for Anomaly Detection

Authors: Jing Liu, Zhenchao Ma, Zepu Wang, Yang Liu, Zehua Wang, Peng Sun, Liang Song, Bo Hu, Azzedine Boukerche, Victor C.M. Leung
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11430
Pdf URL: https://arxiv.org/pdf/2501.11430
Copy Paste: [[2501.11430]] A Survey on Diffusion Models for Anomaly Detection(https://arxiv.org/abs/2501.11430)
Keywords: generative
Abstract: Diffusion models (DMs) have emerged as a powerful class of generative AI models, showing remarkable potential in anomaly detection (AD) tasks across various domains, such as cybersecurity, fraud detection, healthcare, and manufacturing. The intersection of these two fields, termed diffusion models for anomaly detection (DMAD), offers promising solutions for identifying deviations in increasingly complex and high-dimensional data. In this survey, we systematically review recent advances in DMAD research and investigate their capabilities. We begin by presenting the fundamental concepts of AD and DMs, followed by a comprehensive analysis of classic DM architectures including DDPMs, DDIMs, and Score SDEs. We further categorize existing DMAD methods into reconstruction-based, density-based, and hybrid approaches, providing detailed examinations of their methodological innovations. We also explore the diverse tasks across different data modalities, encompassing image, time series, video, and multimodal data analysis. Furthermore, we discuss critical challenges and emerging research directions, including computational efficiency, model interpretability, robustness enhancement, edge-cloud collaboration, and integration with large language models. The collection of DMAD research papers and resources is available at this https URL.
摘要：扩散模型 (DM) 已成为一类强大的生成式 AI 模型，在网络安全、欺诈检测、医疗保健和制造业等各个领域的异常检测 (AD) 任务中表现出巨大潜力。这两个领域的交集称为异常检测扩散模型 (DMAD)，它为识别日益复杂和高维数据中的偏差提供了有希望的解决方案。在本调查中，我们系统地回顾了 DMAD 研究的最新进展并研究了它们的能力。我们首先介绍 AD 和 DM 的基本概念，然后全面分析经典 DM 架构，包括 DDPM、DDIM 和 Score SDE。我们进一步将现有的 DMAD 方法分为基于重建、基于密度和混合方法，并详细研究了它们的方法创新。我们还探索了不同数据模态中的各种任务，包括图像、时间序列、视频和多模态数据分析。此外，我们还讨论了关键挑战和新兴研究方向，包括计算效率、模型可解释性、鲁棒性增强、边缘云协作以及与大型语言模型的集成。DMAD 研究论文和资源的集合可在此 https URL 上找到。

Title: UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion

Authors: Zixuan Chen, Yujin Wang, Xin Cai, Zhiyuan You, Zheming Lu, Fan Zhang, Shi Guo, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11515
Pdf URL: https://arxiv.org/pdf/2501.11515
Copy Paste: [[2501.11515]] UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion(https://arxiv.org/abs/2501.11515)
Keywords: generative
Abstract: Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion technique, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose UltraFusion, the first exposure fusion technique that can merge input with 9 stops differences. The key idea is that we model the exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlight in the over-exposed region. Using under-exposed image as a soft guidance, instead of a hard constrain, our model is robust to potential alignment issue or lighting variations. Moreover, utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scene. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scene, we capture a new real-world exposure fusion benchmark, UltraFusion Dataset, with exposure difference up to 9 stops, and experiments show that \model~can generate beautiful and high-quality fusion results under various scenarios. An online demo is provided at this https URL.
摘要：捕捉高动态范围 (HDR) 场景是相机设计中最重要的问题之一。大多数相机使用曝光融合技术，将不同曝光水平拍摄的图像融合在一起，以增加动态范围。然而，这种方法只能处理曝光差异有限的图像，通常为 3-4 档。当应用于需要大曝光差异的非常高动态场景时，这种方法通常会由于输入之间的对齐不正确或照明不一致或色调映射伪影而失败。在这项工作中，我们提出了 UltraFusion，这是第一种可以合并具有 9 档差异的输入的曝光融合技术。关键思想是我们将曝光融合建模为引导修复问题，其中曝光不足的图像用作指导，以填充过度曝光区域中过度曝光高光的缺失信息。使用曝光不足的图像作为软指导，而不是硬约束，我们的模型对潜在的对齐问题或照明变化具有鲁棒性。此外，利用生成模型的图像先验，我们的模型还可以生成自然的色调映射，即使对于非常高动态范围的场景也是如此。我们的方法在最新的 HDR 基准测试中优于 HDR-Transformer。此外，为了测试其在超高动态范围场景中的表现，我们捕获了一个新的真实世界曝光融合基准 UltraFusion 数据集，曝光差异高达 9 档，实验表明 \model~ 可以在各种场景下生成美观且高质量的融合结果。此 https URL 提供了在线演示。

Title: Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation

Authors: M. Manzour, A. Ballardini, R. Izquierdo, M. Á. Sotelo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11560
Pdf URL: https://arxiv.org/pdf/2501.11560
Copy Paste: [[2501.11560]] Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation(https://arxiv.org/abs/2501.11560)
Keywords: generation
Abstract: Lane-changing maneuvers, particularly those executed abruptly or in risky situations, are a significant cause of road traffic accidents. However, current research mainly focuses on predicting safe lane changes. Furthermore, existing accident datasets are often based on images only and lack comprehensive sensory data. In this work, we focus on predicting risky lane changes using the CRASH dataset (our own collected dataset specifically for risky lane changes), and safe lane changes (using the HighD dataset). Then, we leverage KG and Bayesian inference to predict these maneuvers using linguistic contextual information, enhancing the model's interpretability and transparency. The model achieved a 91.5% f1-score with anticipation time extending to four seconds for risky lane changes, and a 90.0% f1-score for predicting safe lane changes with the same anticipation time. We validate our model by integrating it into a vehicle within the CARLA simulator in scenarios that involve risky lane changes. The model managed to anticipate sudden lane changes, thus providing automated vehicles with further time to plan and execute appropriate safe reactions. Finally, to enhance the explainability of our model, we utilize RAG to provide clear and natural language explanations for the given prediction.
摘要：车道变换操作，尤其是突然或在危险情况下执行的变换操作，是导致道路交通事故的重要原因。然而，当前的研究主要集中在预测安全车道变换。此外，现有的事故数据集通常仅基于图像，缺乏全面的传感数据。在这项工作中，我们专注于使用 CRASH 数据集（我们自己收集的专门用于危险车道变换的数据集）和安全车道变换（使用 HighD 数据集）来预测危险车道变换。然后，我们利用 KG 和贝叶斯推理，使用语言上下文信息来预测这些操作，从而增强模型的可解释性和透明度。该模型在危险车道变换的预测时间延长至四秒的情况下获得了 91.5% 的 f1 分数，在预测安全车道变换时获得了 90.0% 的 f1 分数，预测时间相同。我们通过将模型集成到 CARLA 模拟器中的车辆中来验证模型，该场景涉及危险车道变换。该模型成功预测了突然的车道变换，从而为自动驾驶汽车提供了更多时间来规划和执行适当的安全反应。最后，为了增强模型的可解释性，我们利用 RAG 为给定的预测提供清晰而自然的语言解释。

Title: Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution

Authors: Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11561
Pdf URL: https://arxiv.org/pdf/2501.11561
Copy Paste: [[2501.11561]] Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution(https://arxiv.org/abs/2501.11561)
Keywords: quality assessment
Abstract: With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising performance in linguistic quality description. However, current methods still fall short in accurately scoring image quality. In this work, we aim to leverage MLLMs to regress accurate quality scores. A key challenge is that the quality score is inherently continuous, typically modeled as a Gaussian distribution, whereas MLLMs generate discrete token outputs. This mismatch necessitates score discretization. Previous approaches discretize the mean score into a one-hot label, resulting in information loss and failing to capture inter-image relationships. We propose a distribution-based approach that discretizes the score distribution into a soft label. This method preserves the characteristics of the score distribution, achieving high accuracy and maintaining inter-image relationships. Moreover, to address dataset variation, where different IQA datasets exhibit various distributions, we introduce a fidelity loss based on Thurstone's model. This loss captures intra-dataset relationships, facilitating co-training across multiple IQA datasets. With these designs, we develop the distribution-based Depicted image Quality Assessment model for Score regression (DeQA-Score). Experiments across multiple benchmarks show that DeQA-Score stably outperforms baselines in score regression. Also, DeQA-Score can predict the score distribution that closely aligns with human annotations. Codes and model weights have been released in this https URL.
摘要：随着多模态大型语言模型 (MLLM) 的快速发展，基于 MLLM 的图像质量评估 (IQA) 方法在语言质量描述方面表现出色。然而，目前的方法在准确评分图像质量方面仍然存在不足。在这项工作中，我们旨在利用 MLLM 来回归准确的质量分数。一个关键的挑战是质量分数本质上是连续的，通常建模为高斯分布，而 MLLM 生成离散的标记输出。这种不匹配需要分数离散化。以前的方法将平均分数离散化为独热标签，导致信息丢失并且无法捕捉图像间关系。我们提出了一种基于分布的方法，将分数分布离散化为软标签。该方法保留了分数分布的特征，实现了高精度并保持了图像间关系。此外，为了解决数据集变化问题，不同的 IQA 数据集表现出不同的分布，我们基于 Thurstone 模型引入了保真度损失。这种损失可以捕捉数据集内的关系，从而促进跨多个 IQA 数据集的联合训练。通过这些设计，我们开发了基于分布的描绘图像质量评估模型，用于分数回归 (DeQA-Score)。跨多个基准的实验表明，DeQA-Score 在分数回归方面的表现稳定优于基线。此外，DeQA-Score 可以预测与人工注释紧密一致的分数分布。代码和模型权重已在此 https URL 中发布。

Title: Recurrent Diffusion for Large-Scale Parameter Generation

Authors: Kai Wang, Dongwen Tang, Wangbo Zhao, Yang You
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11587
Pdf URL: https://arxiv.org/pdf/2501.11587
Copy Paste: [[2501.11587]] Recurrent Diffusion for Large-Scale Parameter Generation(https://arxiv.org/abs/2501.11587)
Keywords: generation
Abstract: Parameter generation has struggled to scale up for a long time, significantly limiting its range of applications. In this study, we introduce \textbf{R}ecurrent diffusion for large-scale \textbf{P}arameter \textbf{G}eneration, called \textbf{RPG}. We first divide the trained parameters into non-overlapping parts, after which a recurrent model is proposed to learn their relationships. The recurrent model's outputs, as conditions, are then fed into a diffusion model to generate the neural network parameters. Using only a single GPU, recurrent diffusion enables us to generate popular vision and language models such as ConvNeXt-L and LoRA parameters of LLaMA-7B. Meanwhile, across various architectures and tasks, the generated parameters consistently perform comparable results over trained networks. Notably, our approach also shows the potential to generate models for handling unseen tasks, which largely increases the practicality of parameter generation. Our code is available \href{this https URL}{here}.
摘要：长期以来，参数生成一直难以扩大规模，这极大地限制了其应用范围。在本研究中，我们引入了 \textbf{R} 循环扩散，用于大规模 \textbf{P} 参数 \textbf{G} 生成，称为 \textbf{RPG}。我们首先将训练后的参数分成不重叠的部分，然后提出一个循环模型来学习它们之间的关系。然后将循环模型的输出作为条件输入到扩散模型中以生成神经网络参数。仅使用单个 GPU，循环扩散使我们能够生成流行的视觉和语言模型，例如 LLaMA-7B 的 ConvNeXt-L 和 LoRA 参数。同时，在各种架构和任务中，生成的参数在经过训练的网络上始终表现出可比的结果。值得注意的是，我们的方法还显示出生成处理看不见的任务的模型的潜力，这大大提高了参数生成的实用性。我们的代码可在 \href{此 https URL}{此处} 获得。

Title: GL-ICNN: An End-To-End Interpretable Convolutional Neural Network for the Diagnosis and Prediction of Alzheimer's Disease

Authors: Wenjie Kang, Lize Jiskoot, Peter De Deyn, Geert Biessels, Huiberdina Koek, Jurgen Claassen, Huub Middelkoop, Wiesje Flier, Willemijn J. Jansen, Stefan Klein, Esther Bron
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11715
Pdf URL: https://arxiv.org/pdf/2501.11715
Copy Paste: [[2501.11715]] GL-ICNN: An End-To-End Interpretable Convolutional Neural Network for the Diagnosis and Prediction of Alzheimer's Disease(https://arxiv.org/abs/2501.11715)
Keywords: generative
Abstract: Deep learning methods based on Convolutional Neural Networks (CNNs) have shown great potential to improve early and accurate diagnosis of Alzheimer's disease (AD) dementia based on imaging data. However, these methods have yet to be widely adopted in clinical practice, possibly due to the limited interpretability of deep learning models. The Explainable Boosting Machine (EBM) is a glass-box model but cannot learn features directly from input imaging data. In this study, we propose a novel interpretable model that combines CNNs and EBMs for the diagnosis and prediction of AD. We develop an innovative training strategy that alternatingly trains the CNN component as a feature extractor and the EBM component as the output block to form an end-to-end model. The model takes imaging data as input and provides both predictions and interpretable feature importance measures. We validated the proposed model on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset and the Health-RI Parelsnoer Neurodegenerative Diseases Biobank (PND) as an external testing set. The proposed model achieved an area-under-the-curve (AUC) of 0.956 for AD and control classification, and 0.694 for the prediction of conversion of mild cognitive impairment (MCI) to AD on the ADNI cohort. The proposed model is a glass-box model that achieves a comparable performance with other state-of-the-art black-box models. Our code is publicly available at: this https URL.
摘要：基于卷积神经网络 (CNN) 的深度学习方法已显示出巨大的潜力，可以改善基于影像数据的阿尔茨海默病 (AD) 痴呆的早期和准确诊断。然而，这些方法尚未在临床实践中得到广泛采用，这可能是因为深度学习模型的可解释性有限。可解释增强机 (EBM) 是一个玻璃盒模型，但不能直接从输入的影像数据中学习特征。在本研究中，我们提出了一种结合 CNN 和 EBM 的新型可解释模型，用于 AD 的诊断和预测。我们开发了一种创新的训练策略，交替训练 CNN 组件作为特征提取器，EBM 组件作为输出块，以形成端到端模型。该模型以影像数据作为输入，并提供预测和可解释的特征重要性度量。我们在阿尔茨海默病神经影像计划 (ADNI) 数据集和 Health-RI Parelsnoer 神经退行性疾病生物库 (PND) 作为外部测试集上验证了所提出的模型。所提出的模型在 AD 和对照分类中实现了 0.956 的曲线下面积 (AUC)，在 ADNI 队列中预测轻度认知障碍 (MCI) 转化为 AD 的曲线下面积 (AUC) 为 0.694。所提出的模型是一个玻璃盒模型，其性能可与其他最先进的黑盒模型相媲美。我们的代码可公开获取：此 https URL。

Title: SILO: Solving Inverse Problems with Latent Operators

Authors: Ron Raphaeli, Sean Man, Michael Elad
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11746
Pdf URL: https://arxiv.org/pdf/2501.11746
Copy Paste: [[2501.11746]] SILO: Solving Inverse Problems with Latent Operators(https://arxiv.org/abs/2501.11746)
Keywords: restoration
Abstract: Consistent improvement of image priors over the years has led to the development of better inverse problem solvers. Diffusion models are the newcomers to this arena, posing the strongest known prior to date. Recently, such models operating in a latent space have become increasingly predominant due to their efficiency. In recent works, these models have been applied to solve inverse problems. Working in the latent space typically requires multiple applications of an Autoencoder during the restoration process, which leads to both computational and restoration quality challenges. In this work, we propose a new approach for handling inverse problems with latent diffusion models, where a learned degradation function operates within the latent space, emulating a known image space degradation. Usage of the learned operator reduces the dependency on the Autoencoder to only the initial and final steps of the restoration process, facilitating faster sampling and superior restoration quality. We demonstrate the effectiveness of our method on a variety of image restoration tasks and datasets, achieving significant improvements over prior art.
摘要：多年来，图像先验的持续改进导致了更好的逆问题求解器的发展。扩散模型是这个领域的新手，是迄今为止已知的最强先验。最近，这种在潜在空间中运行的模型由于其效率而变得越来越占主导地位。在最近的研究中，这些模型已被用于解决逆问题。在潜在空间中工作通常需要在恢复过程中多次应用自动编码器，这会带来计算和恢复质量方面的挑战。在这项工作中，我们提出了一种使用潜在扩散模型处理逆问题的新方法，其中学习到的退化函数在潜在空间内运行，模拟已知的图像空间退化。使用学习到的运算符将对自动编码器的依赖减少到仅恢复过程的初始和最终步骤，从而促进更快的采样和卓越的恢复质量。我们证明了我们的方法在各种图像恢复任务和数据集上的有效性，与现有技术相比取得了显着的改进。

Title: Are generative models fair? A study of racial bias in dermatological image generation

Authors: Miguel López-Pérez, Søren Hauberg, Aasa Feragen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11752
Pdf URL: https://arxiv.org/pdf/2501.11752
Copy Paste: [[2501.11752]] Are generative models fair? A study of racial bias in dermatological image generation(https://arxiv.org/abs/2501.11752)
Keywords: generation, generative
Abstract: Racial bias in medicine, particularly in dermatology, presents significant ethical and clinical challenges. It often results from the underrepresentation of darker skin tones in training datasets for machine learning models. While efforts to address bias in dermatology have focused on improving dataset diversity and mitigating disparities in discriminative models, the impact of racial bias on generative models remains underexplored. Generative models, such as Variational Autoencoders (VAEs), are increasingly used in healthcare applications, yet their fairness across diverse skin tones is currently not well understood. In this study, we evaluate the fairness of generative models in clinical dermatology with respect to racial bias. For this purpose, we first train a VAE with a perceptual loss to generate and reconstruct high-quality skin images across different skin tones. We utilize the Fitzpatrick17k dataset to examine how racial bias influences the representation and performance of these models. Our findings indicate that the VAE is influenced by the diversity of skin tones in the training dataset, with better performance observed for lighter skin tones. Additionally, the uncertainty estimates produced by the VAE are ineffective in assessing the model's fairness. These results highlight the need for improved uncertainty quantification mechanisms to detect and address racial bias in generative models for trustworthy healthcare technologies.
摘要：医学领域（尤其是皮肤病学领域）的种族偏见带来了重大的伦理和临床挑战。它通常是由于机器学习模型的训练数据集中深色肤色的代表性不足造成的。虽然解决皮肤病学偏见的努力主要集中在提高数据集多样性和减轻判别模型中的差异，但种族偏见对生成模型的影响仍未得到充分探索。变分自动编码器 (VAE) 等生成模型在医疗保健应用中的使用越来越多，但它们在不同肤色之间的公平性目前尚不清楚。在本研究中，我们评估了临床皮肤病学中生成模型在种族偏见方面的公平性。为此，我们首先训练一个具有感知损失的 VAE，以生成和重建不同肤色的高质量皮肤图像。我们利用 Fitzpatrick17k 数据集来研究种族偏见如何影响这些模型的表示和性能。我们的研究结果表明，VAE 受到训练数据集中肤色多样性的影响，肤色越浅，效果越好。此外，VAE 产生的不确定性估计无法有效评估模型的公平性。这些结果凸显了改进不确定性量化机制的必要性，以便检测和解决生成模型中值得信赖的医疗技术的种族偏见。

Title: EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process

Authors: Mostafa Atef, Mariam Ayman, Ahmed Rashed, Ashrakat Saeed, Abdelrahman Saeed, Ahmed Fares
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11776
Pdf URL: https://arxiv.org/pdf/2501.11776
Copy Paste: [[2501.11776]] EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process(https://arxiv.org/abs/2501.11776)
Keywords: generation
Abstract: Would not it be much more convenient for everybody to try on clothes by only looking into a mirror ? The answer to that problem is virtual try-on, enabling users to digitally experiment with outfits. The core challenge lies in realistic image-to-image translation, where clothing must fit diverse human forms, poses, and figures. Early methods, which used 2D transformations, offered speed, but image quality was often disappointing and lacked the nuance of deep learning. Though GAN-based techniques enhanced realism, their dependence on paired data proved limiting. More adaptable methods offered great visuals but demanded significant computing power and time. Recent advances in diffusion models have shown promise for high-fidelity translation, yet the current crop of virtual try-on tools still struggle with detail loss and warping issues. To tackle these challenges, this paper proposes EfficientVITON, a new virtual try-on system leveraging the impressive pre-trained Stable Diffusion model for better images and deployment feasibility. The system includes a spatial encoder to maintain clothings finer details and zero cross-attention blocks to capture the subtleties of how clothes fit a human body. Input images are carefully prepared, and the diffusion process has been tweaked to significantly cut generation time without image quality loss. The training process involves two distinct stages of fine-tuning, carefully incorporating a balance of loss functions to ensure both accurate try-on results and high-quality visuals. Rigorous testing on the VITON-HD dataset, supplemented with real-world examples, has demonstrated that EfficientVITON achieves state-of-the-art results.
摘要：如果每个人只需照镜子就能试穿衣服，那不是方便得多吗？这个问题的答案就是虚拟试穿，它使用户能够以数字方式试验服装。核心挑战在于逼真的图像到图像的转换，其中服装必须适合不同的人体形态、姿势和身材。早期的方法使用 2D 变换，速度很快，但图像质量往往令人失望，并且缺乏深度学习的细微差别。虽然基于 GAN 的技术增强了真实感，但它们对配对数据的依赖却受到限制。更具适应性的方法提供了出色的视觉效果，但需要大量的计算能力和时间。扩散模型的最新进展显示出高保真转换的前景，但目前的虚拟试穿工具仍然存在细节丢失和扭曲问题。为了应对这些挑战，本文提出了 EfficientVITON，这是一种新的虚拟试穿系统，利用令人印象深刻的预训练稳定扩散模型来获得更好的图像和部署可行性。该系统包括一个空间编码器，用于保留服装的精细细节，以及零交叉注意块，用于捕捉服装如何贴合人体的细微之处。输入图像经过精心准备，扩散过程经过调整，在不损失图像质量的情况下显著缩短了生成时间。训练过程涉及两个不同的微调阶段，精心平衡了损失函数，以确保准确的试穿结果和高质量的视觉效果。对 VITON-HD 数据集的严格测试，加上现实世界的例子，已经证明 EfficientVITON 取得了最先进的结果。

Title: Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference

Authors: Pouya Hamadanian, Sadjad Fouladi
Subjects: cs.LG, cs.DC, cs.PF
Abstract URL: https://arxiv.org/abs/2501.11779
Pdf URL: https://arxiv.org/pdf/2501.11779
Copy Paste: [[2501.11779]] Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference(https://arxiv.org/abs/2501.11779)
Keywords: generation
Abstract: Large Language Models (LLM) have revolutionized natural language processing, but their inference demands substantial resources, while under-utilizing high-end accelerators like GPUs. A major bottleneck arises from the attention mechanism, which requires storing large key-value caches, limiting the maximum achievable throughput way below the available computing resources. Current approaches attempt to mitigate this issue through memory-efficient attention and paging mechanisms, but remained constrained by the assumption that all operations must be performed on high-end accelerators. In this work, we propose Glinthawk, a two-tiered architecture that decouples the attention mechanism from the rest of the Transformer model. This approach allows the memory requirements for attention to scale independently, enabling larger batch sizes and more efficient use of the high-end accelerators. We prototype Glinthawk with NVIDIA T4 GPUs as one tier and standard CPU VMs as the other. Compared to a traditional single-tier setup, it improves throughput by $5.9\times$ and reduces cost of generation by $2.8\times$. For longer sequence lengths, it achieves $16.3\times$ throughput improvement at $2.4\times$ less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-oriented applications such as batch processing. We shared our prototype publicly at \url{this https URL}.
摘要：大型语言模型 (LLM) 彻底改变了自然语言处理，但它们的推理需要大量资源，而 GPU 等高端加速器的利用率却不足。注意力机制是主要瓶颈，它需要存储大型键值缓存，从而将最大可实现吞吐量限制在可用计算资源以下。当前的方法试图通过内存高效的注意力和分页机制来缓解此问题，但仍然受到所有操作都必须在高端加速器上执行的假设的限制。在这项工作中，我们提出了 Glinthawk，这是一种两层架构，将注意力机制与 Transformer 模型的其余部分分离。这种方法允许注意力的内存需求独立扩展，从而实现更大的批量大小并更有效地使用高端加速器。我们使用 NVIDIA T4 GPU 作为一层和标准 CPU VM 作为另一层来制作 Glinthawk 原型。与传统的单层设置相比，它将吞吐量提高了 $5.9\times$，并将生成成本降低了 $2.8\times$。对于较长的序列长度，它以降低 $2.4 倍的成本实现了 $16.3\times$ 的吞吐量提升。我们的评估表明，这种架构可以容忍中等网络延迟，同时将性能下降降至最低，因此对于延迟容忍、吞吐量导向的应用程序（如批处理）非常有效。我们在 \url{此 https URL} 上公开分享了我们的原型。

Title: Generating visual explanations from deep networks using implicit neural representations

Authors: Michal Byra, Henrik Skibbe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11784
Pdf URL: https://arxiv.org/pdf/2501.11784
Copy Paste: [[2501.11784]] Generating visual explanations from deep networks using implicit neural representations(https://arxiv.org/abs/2501.11784)
Keywords: generation
Abstract: Explaining deep learning models in a way that humans can easily understand is essential for responsible artificial intelligence applications. Attribution methods constitute an important area of explainable deep learning. The attribution problem involves finding parts of the network's input that are the most responsible for the model's output. In this work, we demonstrate that implicit neural representations (INRs) constitute a good framework for generating visual explanations. Firstly, we utilize coordinate-based implicit networks to reformulate and extend the extremal perturbations technique and generate attribution masks. Experimental results confirm the usefulness of our method. For instance, by proper conditioning of the implicit network, we obtain attribution masks that are well-behaved with respect to the imposed area constraints. Secondly, we present an iterative INR-based method that can be used to generate multiple non-overlapping attribution masks for the same image. We depict that a deep learning model may associate the image label with both the appearance of the object of interest as well as with areas and textures usually accompanying the object. Our study demonstrates that implicit networks are well-suited for the generation of attribution masks and can provide interesting insights about the performance of deep learning models.
摘要：以人类可以轻松理解的方式解释深度学习模型对于负责任的人工智能应用至关重要。归因方法构成了可解释深度学习的一个重要领域。归因问题涉及找到对模型输出最负责的网络输入部分。在这项工作中，我们证明了隐式神经表征 (INR) 构成了生成视觉解释的良好框架。首先，我们利用基于坐标的隐式网络来重新制定和扩展极值扰动技术并生成归因掩码。实验结果证实了我们方法的有效性。例如，通过适当调节隐式网络，我们获得了在施加的区域约束方面表现良好的归因掩码。其次，我们提出了一种基于 INR 的迭代方法，可用于为同一图像生成多个不重叠的归因掩码。我们描述深度学习模型可以将图像标签与感兴趣对象的外观以及通常伴随对象的区域和纹理相关联。我们的研究表明，隐式网络非常适合生成归因掩码，并且可以提供有关深度学习模型性能的有趣见解。

Title: CogMorph: Cognitive Morphing Attacks for Text-to-Image Models

Authors: Zonglei Jing, Zonghao Ying, Le Wang, Siyuan Liang, Aishan Liu, Xianglong Liu, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11815
Pdf URL: https://arxiv.org/pdf/2501.11815
Copy Paste: [[2501.11815]] CogMorph: Cognitive Morphing Attacks for Text-to-Image Models(https://arxiv.org/abs/2501.11815)
Keywords: generation, generative
Abstract: The development of text-to-image (T2I) generative models, that enable the creation of high-quality synthetic images from textual prompts, has opened new frontiers in creative design and content generation. However, this paper reveals a significant and previously unrecognized ethical risk inherent in this technology and introduces a novel method, termed the Cognitive Morphing Attack (CogMorph), which manipulates T2I models to generate images that retain the original core subjects but embeds toxic or harmful contextual elements. This nuanced manipulation exploits the cognitive principle that human perception of concepts is shaped by the entire visual scene and its context, producing images that amplify emotional harm far beyond attacks that merely preserve the original semantics. To address this, we first construct an imagery toxicity taxonomy spanning 10 major and 48 sub-categories, aligned with human cognitive-perceptual dimensions, and further build a toxicity risk matrix resulting in 1,176 high-quality T2I toxic prompts. Based on this, our CogMorph first introduces Cognitive Toxicity Augmentation, which develops a cognitive toxicity knowledge base with rich external toxic representations for humans (e.g., fine-grained visual features) that can be utilized to further guide the optimization of adversarial prompts. In addition, we present Contextual Hierarchical Morphing, which hierarchically extracts critical parts of the original prompt (e.g., scenes, subjects, and body parts), and then iteratively retrieves and fuses toxic features to inject harmful contexts. Extensive experiments on multiple open-sourced T2I models and black-box commercial APIs (e.g., DALLE-3) demonstrate the efficacy of CogMorph which significantly outperforms other baselines by large margins (+20.62\% on average).
摘要：文本到图像 (T2I) 生成模型的开发，使得能够根据文本提示创建高质量的合成图像，为创意设计和内容生成开辟了新领域。然而，本文揭示了这项技术固有的重大且以前未被认识到的道德风险，并介绍了一种称为认知变形攻击 (CogMorph) 的新方法，该方法操纵 T2I 模型来生成保留原始核心主题但嵌入有毒或有害背景元素的图像。这种细微的操纵利用了人类对概念的感知受整个视觉场景及其背景影响的认知原理，产生的图像会放大情感伤害，远远超过仅仅保留原始语义的攻击。为了解决这个问题，我们首先构建了一个涵盖 10 个主要类别和 48 个子类别的图像毒性分类法，与人类的认知感知维度保持一致，并进一步构建毒性风险矩阵，从而产生 1,176 个高质量的 T2I 毒性提示。基于此，我们的 CogMorph 首先引入了认知毒性增强，它开发了一个认知毒性知识库，其中包含丰富的人类外部毒性表征（例如细粒度视觉特征），可用于进一步指导对抗性提示的优化。此外，我们提出了上下文分层变形，它分层提取原始提示的关键部分（例如场景、主题和身体部位），然后迭代检索和融合毒性特征以注入有害上下文。在多个开源 T2I 模型和黑盒商业 API（例如 DALLE-3）上进行的大量实验证明了 CogMorph 的有效性，其性能显著优于其他基线（平均 +20.62\%）。

Title: PXGen: A Post-hoc Explainable Method for Generative Models

Authors: Yen-Lung Huang, Ming-Hsi Weng, Hao-Tsung Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11827
Pdf URL: https://arxiv.org/pdf/2501.11827
Copy Paste: [[2501.11827]] PXGen: A Post-hoc Explainable Method for Generative Models(https://arxiv.org/abs/2501.11827)
Keywords: generative
Abstract: With the rapid growth of generative AI in numerous applications, explainable AI (XAI) plays a crucial role in ensuring the responsible development and deployment of generative AI technologies. XAI has undergone notable advancements and widespread adoption in recent years, reflecting a concerted push to enhance the transparency, interpretability, and credibility of AI systems. Recent research emphasizes that a proficient XAI method should adhere to a set of criteria, primarily focusing on two key areas. Firstly, it should ensure the quality and fluidity of explanations, encompassing aspects like faithfulness, plausibility, completeness, and tailoring to individual needs. Secondly, the design principle of the XAI system or mechanism should cover the following factors such as reliability, resilience, the verifiability of its outputs, and the transparency of its algorithm. However, research in XAI for generative models remains relatively scarce, with little exploration into how such methods can effectively meet these criteria in that domain. In this work, we propose PXGen, a post-hoc explainable method for generative models. Given a model that needs to be explained, PXGen prepares two materials for the explanation, the Anchor set and intrinsic & extrinsic criteria. Those materials are customizable by users according to their purpose and requirements. Via the calculation of each criterion, each anchor has a set of feature values and PXGen provides examplebased explanation methods according to the feature values among all the anchors and illustrated and visualized to the users via tractable algorithms such as k-dispersion or k-center.
摘要：随着生成式人工智能在众多应用中的快速发展，可解释人工智能 (XAI) 在确保生成式人工智能技术的负责任的开发和部署方面发挥着至关重要的作用。近年来，XAI 取得了显著的进步并被广泛采用，反映了人们为提高人工智能系统的透明度、可解释性和可信度而做出的共同努力。最近的研究强调，熟练的 XAI 方法应遵循一套标准，主要侧重于两个关键领域。首先，它应该确保解释的质量和流畅性，包括忠实性、合理性、完整性和针对个人需求的定制等方面。其次，XAI 系统或机制的设计原则应涵盖以下因素，例如可靠性、弹性、其输出的可验证性以及其算法的透明度。然而，生成模型的 XAI 研究仍然相对稀缺，很少有人探索这些方法如何有效地满足该领域的这些标准。在这项工作中，我们提出了 PXGen，一种事后可解释的生成模型方法。对于需要解释的模型，PXGen 会准备两种解释材料，即 Anchor 集和内在和外在标准。这些材料可由用户根据其目的和需求进行自定义。通过计算每个标准，每个锚点都有一组特征值，PXGen 根据所有锚点之间的特征值提供基于示例的解释方法，并通过 k-dispersion 或 k-center 等易处理的算法向用户进行说明和可视化。

Title: Survey on Monocular Metric Depth Estimation

Authors: Jiuling Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11841
Pdf URL: https://arxiv.org/pdf/2501.11841
Copy Paste: [[2501.11841]] Survey on Monocular Metric Depth Estimation(https://arxiv.org/abs/2501.11841)
Keywords: generative
Abstract: Monocular Depth Estimation (MDE) is a fundamental computer vision task underpinning applications such as spatial understanding, 3D reconstruction, and autonomous driving. While deep learning-based MDE methods can predict relative depth from a single image, their lack of metric scale information often results in scale inconsistencies, limiting their utility in downstream tasks like visual SLAM, 3D reconstruction, and novel view synthesis. Monocular Metric Depth Estimation (MMDE) addresses these challenges by enabling precise, scene-scale depth inference. MMDE improves depth consistency, enhances sequential task stability, simplifies integration into downstream applications, and broadens practical use cases. This paper provides a comprehensive review of depth estimation technologies, highlighting the evolution from geometry-based methods to state-of-the-art deep learning approaches. It emphasizes advancements in scale-agnostic methods, which are crucial for enabling zero-shot generalization as the foundational capability for MMDE. Recent progress in zero-shot MMDE research is explored, focusing on challenges such as model generalization and the loss of detail at scene boundaries. Innovative strategies to address these issues include unlabelled data augmentation, image patching, architectural optimization, and generative techniques. These advancements, analyzed in detail, demonstrate significant contributions to overcoming existing limitations. Finally, this paper synthesizes recent developments in zero-shot MMDE, identifies unresolved challenges, and outlines future research directions. By offering a clear roadmap and cutting-edge insights, this work aims to deepen understanding of MMDE, inspire novel applications, and drive technological innovation.
摘要：单目深度估计 (MDE) 是一项基本的计算机视觉任务，是空间理解、3D 重建和自动驾驶等应用的基础。虽然基于深度学习的 MDE 方法可以从单个图像预测相对深度，但它们缺乏度量尺度信息，这往往会导致尺度不一致，从而限制了它们在视觉 SLAM、3D 重建和新视图合成等下游任务中的实用性。单目度量深度估计 (MMDE) 通过实现精确的场景尺度深度推理来解决这些挑战。MMDE 提高了深度一致性，增强了顺序任务稳定性，简化了与下游应用程序的集成，并拓宽了实际用例。本文全面回顾了深度估计技术，重点介绍了从基于几何的方法到最先进的深度学习方法的演变。它强调了与尺度无关的方法的进步，这对于实现零样本泛化作为 MMDE 的基础能力至关重要。本文探讨了零样本 MMDE 研究的最新进展，重点关注模型泛化和场景边界细节丢失等挑战。解决这些问题的创新策略包括未标记数据增强、图像修补、架构优化和生成技术。这些进步经过详细分析，表明它们对克服现有限制做出了重大贡献。最后，本文综合了零样本 MMDE 的最新发展，确定了尚未解决的挑战，并概述了未来的研究方向。通过提供清晰的路线图和前沿见解，这项工作旨在加深对 MMDE 的理解，激发新应用，并推动技术创新。

Title: Fast Underwater Scene Reconstruction using Multi-View Stereo and Physical Imaging

Authors: Shuyi Hu, Qi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11884
Pdf URL: https://arxiv.org/pdf/2501.11884
Copy Paste: [[2501.11884]] Fast Underwater Scene Reconstruction using Multi-View Stereo and Physical Imaging(https://arxiv.org/abs/2501.11884)
Keywords: restoration
Abstract: Underwater scene reconstruction poses a substantial challenge because of the intricate interplay between light and the medium, resulting in scattering and absorption effects that make both depth estimation and rendering more complex. While recent Neural Radiance Fields (NeRF) based methods for underwater scenes achieve high-quality results by modeling and separating the scattering medium, they still suffer from slow training and rendering speeds. To address these limitations, we propose a novel method that integrates Multi-View Stereo (MVS) with a physics-based underwater image formation model. Our approach consists of two branches: one for depth estimation using the traditional cost volume pipeline of MVS, and the other for rendering based on the physics-based image formation model. The depth branch improves scene geometry, while the medium branch determines the scattering parameters to achieve precise scene rendering. Unlike traditional MVSNet methods that rely on ground-truth depth, our method does not necessitate the use of depth truth, thus allowing for expedited training and rendering processes. By leveraging the medium subnet to estimate the medium parameters and combining this with a color MLP for rendering, we restore the true colors of underwater scenes and achieve higher-fidelity geometric representations. Experimental results show that our method enables high-quality synthesis of novel views in scattering media, clear views restoration by removing the medium, and outperforms existing methods in rendering quality and training efficiency.
摘要：由于光与介质之间复杂的相互作用，水下场景重建带来了巨大的挑战，导致散射和吸收效应，使深度估计和渲染都变得更加复杂。虽然最近基于神经辐射场 (NeRF) 的水下场景方法通过建模和分离散射介质实现了高质量的结果，但它们仍然存在训练和渲染速度慢的问题。为了解决这些限制，我们提出了一种新方法，将多视角立体 (MVS) 与基于物理的水下图像形成模型相结合。我们的方法由两个分支组成：一个用于使用 MVS 的传统成本体积管道进行深度估计，另一个用于基于基于物理的图像形成模型进行渲染。深度分支改进了场景几何，而介质分支确定了散射参数以实现精确的场景渲染。与依赖于地面真实深度的传统 MVSNet 方法不同，我们的方法不需要使用深度真实，从而可以加快训练和渲染过程。利用介质子网络估计介质参数，结合彩色 MLP 进行渲染，恢复水下场景的真实色彩，实现更高保真度的几何表示。实验结果表明，该方法能够在散射介质中高质量合成新视图，通过去除介质实现清晰的视图恢复，在渲染质量和训练效率方面优于现有方法。

Title: ALoFTRAG: Automatic Local Fine Tuning for Retrieval Augmented Generation

Authors: Peter Devine
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.11929
Pdf URL: https://arxiv.org/pdf/2501.11929
Copy Paste: [[2501.11929]] ALoFTRAG: Automatic Local Fine Tuning for Retrieval Augmented Generation(https://arxiv.org/abs/2501.11929)
Keywords: generation
Abstract: Retrieval Augmented Generation (RAG) systems have been shown to improve the accuracy of Large Language Model (LLM) outputs. However, these models can often achieve low accuracy when applied to new data domains. We introduce the Automatic Local Fine Tuning of Retrieval Augmented Generation models (ALoFTRAG) framework, designed to improve the accuracy of RAG systems on a given domain by training LLMs without manually labeled data or using larger teacher models. By generating and filtering synthetic training data and performing LoRA fine-tuning, ALoFTRAG improves citation and answer accuracy across 20 datasets in 26 languages by, on average, 8.3% and 3.0% respectively. Our results demonstrate that ALoFTRAG offers a practical, cost-effective, and data-secure solution for improving RAG accuracy, making it particularly applicable to sensitive domains such as healthcare and finance.
摘要：事实证明，检索增强生成 (RAG) 系统可以提高大型语言模型 (LLM) 输出的准确性。然而，这些模型在应用于新的数据域时，准确性往往很低。我们引入了检索增强生成模型的自动局部微调 (ALoFTRAG) 框架，旨在通过训练没有手动标记数据的 LLM 或使用更大的教师模型来提高 RAG 系统在特定域上的准确性。通过生成和过滤合成训练数据并执行 LoRA 微调，ALoFTRAG 在 26 种语言的 20 个数据集中分别将引用和答案准确性平均提高了 8.3% 和 3.0%。我们的结果表明，ALoFTRAG 为提高 RAG 准确性提供了一种实用、经济高效且数据安全的解决方案，使其特别适用于医疗保健和金融等敏感领域。

Title: MeshONet: A Generalizable and Efficient Operator Learning Method for Structured Mesh Generation

Authors: Jing Xiao, Xinhai Chen, Qingling Wang, Jie Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11937
Pdf URL: https://arxiv.org/pdf/2501.11937
Copy Paste: [[2501.11937]] MeshONet: A Generalizable and Efficient Operator Learning Method for Structured Mesh Generation(https://arxiv.org/abs/2501.11937)
Keywords: generation
Abstract: Mesh generation plays a crucial role in scientific computing. Traditional mesh generation methods, such as TFI and PDE-based methods, often struggle to achieve a balance between efficiency and mesh quality. To address this challenge, physics-informed intelligent learning methods have recently emerged, significantly improving generation efficiency while maintaining high mesh quality. However, physics-informed methods fail to generalize when applied to previously unseen geometries, as even small changes in the boundary shape necessitate burdensome retraining to adapt to new geometric variations. In this paper, we introduce MeshONet, the first generalizable intelligent learning method for structured mesh generation. The method transforms the mesh generation task into an operator learning problem with multiple input and solution functions. To effectively overcome the multivariable mapping restriction of operator learning methods, we propose a dual-branch, shared-trunk architecture to approximate the mapping between function spaces based on input-output pairs. Experimental results show that MeshONet achieves a speedup of up to four orders of magnitude in generation efficiency over traditional methods. It also enables generalization to different geometries without retraining, greatly enhancing the practicality of intelligent methods.
摘要：网格生成在科学计算中起着至关重要的作用。传统的网格生成方法，例如基于 TFI 和 PDE 的方法，通常难以在效率和网格质量之间取得平衡。为了应对这一挑战，最近出现了基于物理的智能学习方法，在保持高网格质量的同时显著提高了生成效率。然而，基于物理的方法在应用于以前未见过的几何形状时无法推广，因为即使边界形状发生微小变化也需要进行繁重的重新训练才能适应新的几何变化。在本文中，我们介绍了 MeshONet，这是第一个用于结构化网格生成的可推广的智能学习方法。该方法将网格生成任务转换为具有多个输入和解函数的算子学习问题。为了有效克服算子学习方法的多变量映射限制，我们提出了一种双分支共享主干架构来近似基于输入输出对的函数空间之间的映射。实验结果表明，与传统方法相比，MeshONet 的生成效率提高了四个数量级。它还可以推广到不同的几何形状而无需重新训练，大大增强了智能方法的实用性。

Title: TabularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data

Authors: Paul Tiwald, Ivona Krchova, Andrey Sidorenko, Mariana Vargas-Vieyra, Mario Scriminaci, Michael Platzer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.12012
Pdf URL: https://arxiv.org/pdf/2501.12012
Copy Paste: [[2501.12012]] TabularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data(https://arxiv.org/abs/2501.12012)
Keywords: generation, generative
Abstract: Synthetic data generation for tabular datasets must balance fidelity, efficiency, and versatility to meet the demands of real-world applications. We introduce the Tabular Auto-Regressive Generative Network (TabularARGN), a flexible framework designed to handle mixed-type, multivariate, and sequential datasets. By training on all possible conditional probabilities, TabularARGN supports advanced features such as fairness-aware generation, imputation, and conditional generation on any subset of columns. The framework achieves state-of-the-art synthetic data quality while significantly reducing training and inference times, making it ideal for large-scale datasets with diverse structures. Evaluated across established benchmarks, including realistic datasets with complex relationships, TabularARGN demonstrates its capability to synthesize high-quality data efficiently. By unifying flexibility and performance, this framework paves the way for practical synthetic data generation across industries.
摘要：表格数据集的合成数据生成必须在保真度、效率和多功能性之间取得平衡，以满足实际应用的需求。我们引入了表格自回归生成网络 (TabularARGN)，这是一个灵活的框架，旨在处理混合类型、多变量和顺序数据集。通过对所有可能的条件概率进行训练，TabularARGN 支持高级功能，例如公平感知生成、归因和对任何列子集进行条件生成。该框架实现了最先进的合成数据质量，同时显著减少了训练和推理时间，使其成为具有多样化结构的大规模数据集的理想选择。通过基于既定基准（包括具有复杂关系的真实数据集）进行评估，TabularARGN 展示了其高效合成高质量数据的能力。通过统一灵活性和性能，该框架为跨行业的实际合成数据生成铺平了道路。

Title: Foreign object segmentation in chest x-rays through anatomy-guided shape insertion

Authors: Constantin Seibold, Hamza Kalisch, Lukas Heine, Simon Reiß, Jens Kleesiek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12022
Pdf URL: https://arxiv.org/pdf/2501.12022
Copy Paste: [[2501.12022]] Foreign object segmentation in chest x-rays through anatomy-guided shape insertion(https://arxiv.org/abs/2501.12022)
Keywords: generation
Abstract: In this paper, we tackle the challenge of instance segmentation for foreign objects in chest radiographs, commonly seen in postoperative follow-ups with stents, pacemakers, or ingested objects in children. The diversity of foreign objects complicates dense annotation, as shown in insufficient existing datasets. To address this, we propose the simple generation of synthetic data through (1) insertion of arbitrary shapes (lines, polygons, ellipses) with varying contrasts and opacities, and (2) cut-paste augmentations from a small set of semi-automatically extracted labels. These insertions are guided by anatomy labels to ensure realistic placements, such as stents appearing only in relevant vessels. Our approach enables networks to segment complex structures with minimal manually labeled data. Notably, it achieves performance comparable to fully supervised models while using 93\% fewer manual annotations.
摘要：在本文中，我们解决了胸片中异物实例分割的难题，这些异物常见于儿童术后随访中，如支架、起搏器或吞食物体。异物的多样性使密集注释变得复杂，现有数据集不足就是明证。为了解决这个问题，我们提出了通过以下方式简单地生成合成数据：(1) 插入具有不同对比度和不透明度的任意形状（线、多边形、椭圆形）；(2) 从一小组半自动提取的标签中进行剪切粘贴增强。这些插入由解剖标签引导，以确保真实的放置，例如支架仅出现在相关血管中。我们的方法使网络能够使用最少的手动标记数据来分割复杂结构。值得注意的是，它实现了与完全监督模型相当的性能，同时使用的手动注释减少了 93\%。

Title: A Multi-annotated and Multi-modal Dataset for Wide-angle Video Quality Assessment

Authors: Bo Hu, Wei Wang, Chunyi Li, Lihuo He, Leida Li, Xinbo Gao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.12082
Pdf URL: https://arxiv.org/pdf/2501.12082
Copy Paste: [[2501.12082]] A Multi-annotated and Multi-modal Dataset for Wide-angle Video Quality Assessment(https://arxiv.org/abs/2501.12082)
Keywords: quality assessment
Abstract: Wide-angle video is favored for its wide viewing angle and ability to capture a large area of scenery, making it an ideal choice for sports and adventure recording. However, wide-angle video is prone to deformation, exposure and other distortions, resulting in poor video quality and affecting the perception and experience, which may seriously hinder its application in fields such as competitive sports. Up to now, few explorations focus on the quality assessment issue of wide-angle video. This deficiency primarily stems from the absence of a specialized dataset for wide-angle videos. To bridge this gap, we construct the first Multi-annotated and multi-modal Wide-angle Video quality assessment (MWV) dataset. Then, the performances of state-of-the-art video quality methods on the MWV dataset are investigated by inter-dataset testing and intra-dataset testing. Experimental results show that these methods impose significant limitations on their applicability.
摘要：广角视频因其视角广、能捕捉大面积景物而受到青睐，是运动、探险录制的理想选择。然而，广角视频容易发生形变、曝光等扭曲，导致视频质量不佳，影响感知和体验，严重阻碍其在竞技体育等领域的应用。到目前为止，很少有探索关注广角视频的质量评估问题。这一缺陷主要源于缺乏专门的广角视频数据集。为了弥补这一空白，我们构建了第一个多注释和多模态广角视频质量评估（MWV）数据集。然后，通过数据集间测试和数据集内测试研究了最先进的视频质量方法在 MWV 数据集上的性能。实验结果表明，这些方法对其适用性产生了很大的限制。

Title: Proxies for Distortion and Consistency with Applications for Real-World Image Restoration

Authors: Sean Man, Guy Ohayon, Ron Raphaeli, Michael Elad
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2501.12102
Pdf URL: https://arxiv.org/pdf/2501.12102
Copy Paste: [[2501.12102]] Proxies for Distortion and Consistency with Applications for Real-World Image Restoration(https://arxiv.org/abs/2501.12102)
Keywords: restoration
Abstract: Real-world image restoration deals with the recovery of images suffering from an unknown degradation. This task is typically addressed while being given only degraded images, without their corresponding ground-truth versions. In this hard setting, designing and evaluating restoration algorithms becomes highly challenging. This paper offers a suite of tools that can serve both the design and assessment of real-world image restoration algorithms. Our work starts by proposing a trained model that predicts the chain of degradations a given real-world measured input has gone through. We show how this estimator can be used to approximate the consistency -- the match between the measurements and any proposed recovered image. We also use this estimator as a guiding force for the design of a simple and highly-effective plug-and-play real-world image restoration algorithm, leveraging a pre-trained diffusion-based image prior. Furthermore, this work proposes no-reference proxy measures of MSE and LPIPS, which, without access to the ground-truth images, allow ranking of real-world image restoration algorithms according to their (approximate) MSE and LPIPS. The proposed suite provides a versatile, first of its kind framework for evaluating and comparing blind image restoration algorithms in real-world scenarios.
摘要：真实世界图像恢复涉及对遭受未知退化的图像进行恢复。通常在仅提供退化图像（没有其对应的真实版本）的情况下解决此任务。在这种困难的环境中，设计和评估恢复算法变得极具挑战性。本文提供了一套工具，可用于设计和评估真实世界图像恢复算法。我们的工作首先提出一个经过训练的模型，该模型可以预测给定真实世界测量输入所经历的退化链。我们展示了如何使用该估计量来近似一致性——测量值与任何拟议的恢复图像之间的匹配。我们还使用该估计量作为指导力量，设计一种简单且高效的即插即用真实世界图像恢复算法，利用预先训练的基于扩散的图像先验。此外，这项工作提出了 MSE 和 LPIPS 的无参考代理测量，无需访问真实图像，即可根据其（近似）MSE 和 LPIPS 对真实世界图像恢复算法进行排名。该套件提供了一种多功能的、首创的框架，用于评估和比较现实场景中的盲图像恢复算法。

Title: ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions

Authors: Shiyue Zhang, Zheng Chong, Xi Lu, Wenqing Zhang, Haoxiang Li, Xujie Zhang, Jiehui Huang, Xiao Dong, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12173
Pdf URL: https://arxiv.org/pdf/2501.12173
Copy Paste: [[2501.12173]] ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions(https://arxiv.org/abs/2501.12173)
Keywords: generation
Abstract: Building on the success of diffusion models, significant advancements have been made in multimodal image generation tasks. Among these, human image generation has emerged as a promising technique, offering the potential to revolutionize the fashion design process. However, existing methods often focus solely on text-to-image or image reference-based human generation, which fails to satisfy the increasingly sophisticated demands. To address the limitations of flexibility and precision in human generation, we introduce ComposeAnyone, a controllable layout-to-human generation method with decoupled multimodal conditions. Specifically, our method allows decoupled control of any part in hand-drawn human layouts using text or reference images, seamlessly integrating them during the generation process. The hand-drawn layout, which utilizes color-blocked geometric shapes such as ellipses and rectangles, can be easily drawn, offering a more flexible and accessible way to define spatial layouts. Additionally, we introduce the ComposeHuman dataset, which provides decoupled text and reference image annotations for different components of each human image, enabling broader applications in human image generation tasks. Extensive experiments on multiple datasets demonstrate that ComposeAnyone generates human images with better alignment to given layouts, text descriptions, and reference images, showcasing its multi-task capability and controllability.
摘要：基于扩散模型的成功，多模态图像生成任务取得了重大进展。其中，人体图像生成已成为一种有前途的技术，有可能彻底改变时装设计过程。然而，现有的方法通常只关注文本到图像或基于图像参考的人体生成，无法满足日益复杂的需求。为了解决人体生成的灵活性和精度限制，我们引入了 ComposeAnyone，这是一种具有解耦多模态条件的可控布局到人体生成方法。具体来说，我们的方法允许使用文本或参考图像对手绘人体布局中的任何部分进行解耦控制，并在生成过程中无缝集成它们。手绘布局利用椭圆和矩形等色块几何形状，可以轻松绘制，提供了一种更灵活、更易于定义空间布局的方式。此外，我们引入了 ComposeHuman 数据集，它为每个人体图像的不同组成部分提供解耦的文本和参考图像注释，从而实现人体图像生成任务中的更广泛应用。在多个数据集上进行的大量实验表明，ComposeAnyone 生成的人体图像与给定的布局、文本描述和参考图像能够更好地对齐，展示了其多任务能力和可控性。

Title: Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Authors: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo (refer to the report for detailed contributions)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12202
Pdf URL: https://arxiv.org/pdf/2501.12202
Copy Paste: [[2501.12202]] Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation(https://arxiv.org/abs/2501.12202)
Keywords: generation, generative
Abstract: We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: this https URL
摘要：我们推出了 Hunyuan3D 2.0，这是一款用于生成高分辨率纹理 3D 资产的先进大规模 3D 合成系统。该系统包括两个基础组件：大规模形状生成模型 - Hunyuan3D-DiT 和大规模纹理合成模型 - Hunyuan3D-Paint。形状生成模型建立在可扩展的基于流的扩散变换器上，旨在创建与给定条件图像正确对齐的几何图形，为下游应用奠定坚实的基础。纹理合成模型受益于强大的几何和扩散先验，可为生成或手工制作的网格生成高分辨率且生动的纹理图。此外，我们还构建了 Hunyuan3D-Studio - 一个多功能、用户友好的制作平台，可简化 3D 资产的重新创建过程。它允许专业用户和业余用户有效地操作甚至制作动画他们的网格。我们系统地评估了我们的模型，结果表明 Hunyuan3D 2.0 在几何细节、条件对齐、纹理质量等方面优于之前的最先进的模型，包括开源模型和闭源模型。Hunyuan3D 2.0 的公开发布是为了填补开源 3D 社区在大规模基础生成模型方面的空白。我们的模型代码和预训练权重可在以下网址获得：此 https URL

Title: Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model

Authors: Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.12206
Pdf URL: https://arxiv.org/pdf/2501.12206
Copy Paste: [[2501.12206]] Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model(https://arxiv.org/abs/2501.12206)
Keywords: generation
Abstract: Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models frequently exhibit hallucination behavior, where they generate descriptions containing objects or details absent in the input image. Our work investigates this phenomenon by analyzing attention patterns across transformer layers and heads, revealing that hallucinations often stem from progressive degradation of visual grounding in deeper layers. We propose a novel attention modification approach that combines selective token emphasis and head-specific modulation to maintain visual grounding throughout the generation process. Our method introduces two key components: (1) a dual-stream token selection mechanism that identifies and prioritizes both locally informative and spatially significant visual tokens, and (2) an attention head-specific modulation strategy that differentially amplifies visual information processing based on measured visual sensitivity of individual attention heads. Through extensive experimentation on the MSCOCO dataset, we demonstrate that our approach reduces hallucination rates by up to 62.3\% compared to baseline models while maintaining comparable task performance. Our analysis reveals that selectively modulating tokens across attention heads with varying levels of visual sensitivity can significantly improve visual grounding without requiring model retraining.
摘要：大型视觉语言模型 (LVLM) 在理解和描述视觉内容方面表现出了卓越的能力，在各种视觉语言任务中都取得了最先进的性能。然而，这些模型经常表现出幻觉行为，它们生成的描述包含输入图像中不存在的对象或细节。我们的工作通过分析转换器层和头部的注意力模式来研究这种现象，揭示了幻觉通常源于更深层视觉基础的逐渐退化。我们提出了一种新颖的注意力修改方法，该方法结合了选择性标记强调和头部特定调制，以在整个生成过程中保持视觉基础。我们的方法引入了两个关键组件：(1) 双流标记选择机制，可识别和优先考虑局部信息和空间重要视觉标记，以及 (2) 注意力头特定调制策略，可根据测量的单个注意力头的视觉敏感度差异化放大视觉信息处理。通过对 MSCOCO 数据集的大量实验，我们证明与基线模型相比，我们的方法将幻觉率降低了 62.3\%，同时保持了相当的任务性能。我们的分析表明，选择性地调节具有不同视觉敏感度的注意力头上的标记可以显著改善视觉基础，而无需重新训练模型。

Title: TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

Authors: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12224
Pdf URL: https://arxiv.org/pdf/2501.12224
Copy Paste: [[2501.12224]] TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space(https://arxiv.org/abs/2501.12224)
Keywords: generation
Abstract: We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in this https URL
摘要：我们提出了 TokenVerse——一种利用预先训练的文本到图像扩散模型进行多概念个性化的方法。我们的框架可以从单幅图像中分离出复杂的视觉元素和属性，同时实现从多幅图像中提取的概念组合的无缝即插即用生成。与现有作品不同，TokenVerse 可以处理每幅图像包含多个概念的多幅图像，并支持广泛的概念，包括物体、配件、材料、姿势和照明。我们的工作利用了基于 DiT 的文本到图像模型，其中输入文本通过注意力和调制（移位和缩放）影响生成。我们观察到调制空间是语义的，可以对复杂概念进行局部控制。基于这一见解，我们设计了一个基于优化的框架，该框架以图像和文本描述作为输入，并为每个单词在调制空间中找到一个不同的方向。然后可以使用这些方向生成新图像，将学习到的概念组合成所需的配置。我们展示了 TokenVerse 在具有挑战性的个性化设置中的有效性，并展示了它相对于现有方法的优势。此 https URL 中的项目网页

Title: InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models

Authors: Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, Bonan Min
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.12231
Pdf URL: https://arxiv.org/pdf/2501.12231
Copy Paste: [[2501.12231]] InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models(https://arxiv.org/abs/2501.12231)
Keywords: generative
Abstract: The improved competence of generative models can help building multi-modal virtual assistants that leverage modalities beyond language. By observing humans performing multi-step tasks, one can build assistants that have situational awareness of actions and tasks being performed, enabling them to cater assistance based on this understanding. In this paper, we develop a Context-aware Instructional Task Assistant with Multi-modal Large Language Models (InsTALL) that leverages an online visual stream (e.g. a user's screen share or video recording) and responds in real-time to user queries related to the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal model on task videos and paired textual data, and 2) automatically extracts task graph from video data and leverages it at training and inference time. We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding -- task recognition (TR), action recognition (AR), next action prediction (AP), and plan prediction (PP) -- and outperforms existing baselines on two novel sub-tasks related to automatic error identification.
摘要：生成模型能力的提高有助于构建利用语言以外模态的多模态虚拟助手。通过观察人类执行多步骤任务，可以构建具有正在执行的操作和任务态势感知的助手，使他们能够根据这种理解提供帮助。在本文中，我们开发了一种具有多模态大型语言模型 (InsTALL) 的情境感知教学任务助手，它利用在线视觉流（例如用户的屏幕共享或视频录制）并实时响应与手头任务相关的用户查询。为了提供有用的帮助，InsTALL 1) 在任务视频和配对的文本数据上训练多模态模型，2) 自动从视频数据中提取任务图并在训练和推理时利用它。我们表明，InsTALL 在为多模式活动理解而提出的子任务（任务识别（TR）、动作识别（AR）、下一步动作预测（AP）和计划预测（PP））中实现了最先进的性能，并且在与自动错误识别相关的两个新子任务上超越现有基线。

Title: HAC++: Towards 100X Compression of 3D Gaussian Splatting

Authors: Yihang Chen, Qianyi Wu, Weiyao Lin, Mehrtash Harandi, Jianfei Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12255
Pdf URL: https://arxiv.org/pdf/2501.12255
Copy Paste: [[2501.12255]] HAC++: Towards 100X Compression of 3D Gaussian Splatting(https://arxiv.org/abs/2501.12255)
Keywords: restoration
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To achieve a compact size, we propose HAC++, which leverages the relationships between unorganized anchors and a structured hash grid, utilizing their mutual information for context modeling. Additionally, HAC++ captures intra-anchor contextual relationships to further enhance compression performance. To facilitate entropy coding, we utilize Gaussian distributions to precisely estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Moreover, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Overall, HAC++ achieves a remarkable size reduction of over 100X compared to vanilla 3DGS when averaged on all datasets, while simultaneously improving fidelity. It also delivers more than 20X size reduction compared to Scaffold-GS. Our code is available at this https URL.
摘要：3D 高斯分布 (3DGS) 已成为一种有前途的新型视图合成框架，具有高保真度和快速渲染速度。然而，大量的高斯及其相关属性需要有效的压缩技术。然而，高斯点云（或我们论文中的锚点）的稀疏和无序性质给压缩带来了挑战。为了实现紧凑的尺寸，我们提出了 HAC++，它利用无组织锚点和结构化哈希网格之间的关系，利用它们的相互信息进行上下文建模。此外，HAC++ 捕获锚点内的上下文关系以进一步增强压缩性能。为了促进熵编码，我们利用高斯分布来精确估计每个量化属性的概率，其中提出了一个自适应量化模块来实现这些属性的高精度量化，从而提高保真度恢复。此外，我们结合了自适应掩码策略来消除无效的高斯和锚点。总体而言，在所有数据集上取平均值时，HAC++ 的尺寸比 vanilla 3DGS 显著缩小了 100 多倍，同时提高了保真度。与 Scaffold-GS 相比，它的尺寸也缩小了 20 多倍。我们的代码可在此 https URL 上找到。

Title: VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models

Authors: Chaohao Xie, Kai Han, Kwan-Yee K. Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12267
Pdf URL: https://arxiv.org/pdf/2501.12267
Copy Paste: [[2501.12267]] VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models(https://arxiv.org/abs/2501.12267)
Keywords: generation
Abstract: Recent video inpainting methods have achieved encouraging improvements by leveraging optical flow to guide pixel propagation from reference frames either in the image space or feature space. However, they would produce severe artifacts in the mask center when the masked area is too large and no pixel correspondences can be found for the center. Recently, diffusion models have demonstrated impressive performance in generating diverse and high-quality images, and have been exploited in a number of works for image inpainting. These methods, however, cannot be applied directly to videos to produce temporal-coherent inpainting results. In this paper, we propose a training-free framework, named VipDiff, for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained diffusion models. VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise, and uses the generated results for further pixel propagation and conditional generation. VipDiff also allows for generating diverse video inpainting results over different sampled noise. Experiments demonstrate that VipDiff can largely outperform state-of-the-art video inpainting methods in terms of both spatial-temporal coherence and fidelity.
摘要：最近的视频修复方法通过利用光流引导像素从参考帧在图像空间或特征空间中的传播，取得了令人鼓舞的改进。然而，当掩蔽区域太大并且无法找到中心的像素对应关系时，它们会在掩蔽中心产生严重的伪影。最近，扩散模型在生成多样化和高质量图像方面表现出色，并已在许多图像修复工作中得到利用。然而，这些方法不能直接应用于视频以产生时间相干的修复结果。在本文中，我们提出了一个无需训练的框架，名为 VipDiff，用于在逆扩散过程中调节扩散模型以产生时间相干的修复结果，而无需任何训练数据或微调预训练的扩散模型。VipDiff 以光流为指导，从参考帧中提取有效像素，作为优化随机采样的高斯噪声的约束，并使用生成的结果进行进一步的像素传播和条件生成。 VipDiff 还允许针对不同的采样噪声生成不同的视频修复结果。实验表明，VipDiff 在空间时间连贯性和保真度方面可以大大超越最先进的视频修复方法。

Title: Regressor-Guided Image Editing Regulates Emotional Response to Reduce Online Engagement

Authors: Christoph Gebhardt, Robin Willardt, Seyedmorteza Sadat, Chih-Wei Ning, Andreas Brombach, Jie Song, Otmar Hilliges, Christian Holz
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2501.12289
Pdf URL: https://arxiv.org/pdf/2501.12289
Copy Paste: [[2501.12289]] Regressor-Guided Image Editing Regulates Emotional Response to Reduce Online Engagement(https://arxiv.org/abs/2501.12289)
Keywords: generative
Abstract: Emotions are known to mediate the relationship between users' content consumption and their online engagement, with heightened emotional intensity leading to increased engagement. Building on this insight, we propose three regressor-guided image editing approaches aimed at diminishing the emotional impact of images. These include (i) a parameter optimization approach based on global image transformations known to influence emotions, (ii) an optimization approach targeting the style latent space of a generative adversarial network, and (iii) a diffusion-based approach employing classifier guidance and classifier-free guidance. Our findings demonstrate that approaches can effectively alter the emotional properties of images while maintaining high visual quality. Optimization-based methods primarily adjust low-level properties like color hues and brightness, whereas the diffusion-based approach introduces semantic changes, such as altering appearance or facial expressions. Notably, results from a behavioral study reveal that only the diffusion-based approach successfully elicits changes in viewers' emotional responses while preserving high perceived image quality. In future work, we will investigate the impact of these image adaptations on internet user behavior.
摘要：众所周知，情绪可以调节用户的内容消费和在线参与之间的关系，情绪强度的提高会导致参与度的提高。基于这一见解，我们提出了三种回归引导的图像编辑方法，旨在减少图像的情绪影响。这些方法包括 (i) 基于已知会影响情绪的全局图像变换的参数优化方法，(ii) 针对生成对抗网络的风格潜在空间的优化方法，以及 (iii) 采用分类器引导和无分类器引导的基于扩散的方法。我们的研究结果表明，这些方法可以有效地改变图像的情绪属性，同时保持较高的视觉质量。基于优化的方法主要调整色调和亮度等低级属性，而基于扩散的方法则引入了语义变化，例如改变外观或面部表情。值得注意的是，一项行为研究的结果表明，只有基于扩散的方法才能成功引发观众情绪反应的变化，同时保持较高的感知图像质量。在未来的工作中，我们将研究这些图像调整对互联网用户行为的影响。

Title: A Hybrid Supervised and Self-Supervised Graph Neural Network for Edge-Centric Applications

Authors: Eugenio Borzone, Leandro Di Persia, Matias Gerard
Subjects: cs.LG, q-bio.MN
Abstract URL: https://arxiv.org/abs/2501.12309
Pdf URL: https://arxiv.org/pdf/2501.12309
Copy Paste: [[2501.12309]] A Hybrid Supervised and Self-Supervised Graph Neural Network for Edge-Centric Applications(https://arxiv.org/abs/2501.12309)
Keywords: generation
Abstract: This paper presents a novel graph-based deep learning model for tasks involving relations between two nodes (edge-centric tasks), where the focus lies on predicting relationships and interactions between pairs of nodes rather than node properties themselves. This model combines supervised and self-supervised learning, taking into account for the loss function the embeddings learned and patterns with and without ground truth. Additionally it incorporates an attention mechanism that leverages both node and edge features. The architecture, trained end-to-end, comprises two primary components: embedding generation and prediction. First, a graph neural network (GNN) transform raw node features into dense, low-dimensional embeddings, incorporating edge attributes. Then, a feedforward neural model processes the node embeddings to produce the final output. Experiments demonstrate that our model matches or exceeds existing methods for protein-protein interactions prediction and Gene Ontology (GO) terms prediction. The model also performs effectively with one-hot encoding for node features, providing a solution for the previously unsolved problem of predicting similarity between compounds with unknown structures.
摘要：本文介绍了一种基于图的新型深度学习模型，用于涉及两个节点之间关系的任务（以边缘为中心的任务），其重点在于预测节点对之间的关系和交互，而不是节点属性本身。该模型结合了监督学习和自监督学习，将学习到的嵌入和有无基本事实的模式考虑在损失函数中。此外，它还结合了一种利用节点和边缘特征的注意力机制。该架构经过端到端训练，包含两个主要组件：嵌入生成和预测。首先，图神经网络 (GNN) 将原始节点特征转换为密集的低维嵌入，并结合边缘属性。然后，前馈神经模型处理节点嵌入以产生最终输出。实验表明，我们的模型与现有的蛋白质-蛋白质相互作用预测和基因本体 (GO) 术语预测方法相当甚至超过现有方法。该模型还有效地对节点特征进行独热编码，为预测结构未知的化合物之间的相似性这一以前未解决的问题提供了解决方案。

Title: VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

Authors: Xianwei Zhuang, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12327
Pdf URL: https://arxiv.org/pdf/2501.12327
Copy Paste: [[2501.12327]] VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model(https://arxiv.org/abs/2501.12327)
Keywords: generation
Abstract: We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework. VARGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation. VARGPT innovatively extends the LLaVA architecture, achieving efficient scale-wise autoregressive visual generation within MLLMs while seamlessly accommodating mixed-modal input and output within a single model framework. Our VARGPT undergoes a three-stage unified training process on specially curated datasets, comprising a pre-training phase and two mixed visual instruction-tuning phases. The unified training strategy are designed to achieve alignment between visual and textual features, enhance instruction following for both understanding and generation, and improve visual generation quality, respectively. Despite its LLAVA-based architecture for multimodel understanding, VARGPT significantly outperforms LLaVA-1.5 across various vision-centric benchmarks, such as visual question-answering and reasoning tasks. Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks. Project page is at: \url{this https URL}
摘要：我们提出了 VARGPT，一种新颖的多模态大型语言模型 (MLLM)，它将视觉理解和生成统一在一个自回归框架内。VARGPT 采用下一个标记预测范式进行视觉理解，采用下一个尺度预测范式进行视觉自回归生成。VARGPT 创新地扩展了 LLaVA 架构，在 MLLM 中实现了高效的尺度自回归视觉生成，同时在单个模型框架内无缝地容纳混合模态输入和输出。我们的 VARGPT 在特别策划的数据集上经历了三阶段统一训练过程，包括一个预训练阶段和两个混合视觉指令调整阶段。统一的训练策略旨在实现视觉和文本特征之间的一致性、增强理解和生成的指令遵循性以及提高视觉生成质量。尽管 VARGPT 采用基于 LLAVA 的多模型理解架构，但在各种以视觉为中心的基准测试（例如视觉问答和推理任务）中，其表现明显优于 LLaVA-1.5。值得注意的是，VARGPT 自然支持自回归视觉生成和指令到图像合成功能，展示了其在视觉理解和生成任务中的多功能性。项目页面位于：\url{此 https URL}

Title: Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2

Authors: Md. Rakibul Islam, Md. Zahid Hossain, Mustofa Ahmed, Most. Sharmin Sultana Samu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12356
Pdf URL: https://arxiv.org/pdf/2501.12356
Copy Paste: [[2501.12356]] Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2(https://arxiv.org/abs/2501.12356)
Keywords: generation
Abstract: Radiology plays a pivotal role in modern medicine due to its non-invasive diagnostic capabilities. However, the manual generation of unstructured medical reports is time consuming and prone to errors. It creates a significant bottleneck in clinical workflows. Despite advancements in AI-generated radiology reports, challenges remain in achieving detailed and accurate report generation. In this study we have evaluated different combinations of multimodal models that integrate Computer Vision and Natural Language Processing to generate comprehensive radiology reports. We employed a pretrained Vision Transformer (ViT-B16) and a SWIN Transformer as the image encoders. The BART and GPT-2 models serve as the textual decoders. We used Chest X-ray images and reports from the IU-Xray dataset to evaluate the usability of the SWIN Transformer-BART, SWIN Transformer-GPT-2, ViT-B16-BART and ViT-B16-GPT-2 models for report generation. We aimed at finding the best combination among the models. The SWIN-BART model performs as the best-performing model among the four models achieving remarkable results in almost all the evaluation metrics like ROUGE, BLEU and BERTScore.
摘要：放射学因其非侵入性诊断能力而在现代医学中发挥着关键作用。然而，手动生成非结构化医疗报告既费时又容易出错。这给临床工作流程造成了重大瓶颈。尽管人工智能生成的放射学报告取得了进步，但在生成详细而准确的报告方面仍然存在挑战。在本研究中，我们评估了集成计算机视觉和自然语言处理以生成综合放射学报告的不同多模态模型组合。我们使用预训练的 Vision Transformer (ViT-B16) 和 SWIN Transformer 作为图像编码器。BART 和 GPT-2 模型用作文本解码器。我们使用了 IU-Xray 数据集中的胸部 X 光图像和报告来评估 SWIN Transformer-BART、SWIN Transformer-GPT-2、ViT-B16-BART 和 ViT-B16-GPT-2 模型在报告生成中的可用性。我们的目标是找到这些模型中的最佳组合。 SWIN-BART 模型是四种模型中表现最好的模型，在 ROUGE、BLEU 和 BERTScore 等几乎所有评估指标中都取得了令人瞩目的成绩。

Title: InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Authors: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.12368
Pdf URL: https://arxiv.org/pdf/2501.12368
Copy Paste: [[2501.12368]] InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model(https://arxiv.org/abs/2501.12368)
Keywords: generation
Abstract: Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at this https URL
摘要：尽管大型视觉语言模型 (LVLM) 在视觉理解方面表现出色，但它们偶尔会产生错误的输出。虽然具有强化学习或测试时间缩放的奖励模型 (RM) 具有提高生成质量的潜力，但仍然存在一个关键差距：LVLM 的公开多模态 RM 很少，专有模型的实施细节通常不清楚。我们通过 InternLM-XComposer2.5-Reward (IXC-2.5-Reward) 弥补了这一差距，这是一个简单但有效的多模态奖励模型，可将 LVLM 与人类偏好保持一致。为了确保 IXC-2.5-Reward 的稳健性和多功能性，我们建立了一个高质量的多模态偏好语料库，涵盖不同领域的文本、图像和视频输入，例如指令遵循、一般理解、富文本文档、数学推理和视频理解。IXC-2.5-Reward 在最新的多模态奖励模型基准测试中取得了优异的成绩，并在纯文本奖励模型基准测试中表现出色。我们进一步展示了 IXC-2.5-Reward 的三个关键应用：(1) 为 RL 训练提供监督信号。我们将 IXC-2.5-Reward 与近端策略优化 (PPO) 相结合，产生了 IXC-2.5-Chat，它在指令遵循和多模式开放式对话方面表现出持续的改进；(2) 从候选响应中选择最佳响应以进行测试时间扩展；(3) 从现有的图像和视频指令调整训练数据中过滤异常值或噪声样本。为了确保可重复性并促进进一步研究，我们已在此 https URL 上开源了所有模型权重和训练配方

Title: Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

Authors: Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, Bingyi Kang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.12375
Pdf URL: https://arxiv.org/pdf/2501.12375
Copy Paste: [[2501.12375]] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos(https://arxiv.org/abs/2501.12375)
Keywords: generation
Abstract: Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.
摘要：Depth Anything 在单目深度估计方面取得了显著成功，具有很强的泛化能力。然而，它在视频中存在时间不一致性，阻碍了它的实际应用。已经提出了各种方法来缓解这个问题，比如利用视频生成模型或引入光流和相机姿势的先验。尽管如此，这些方法仅适用于短视频（<10 秒），并且需要在质量和计算效率之间进行权衡。我们提出了 Video Depth Anything，用于在超长视频（超过几分钟）中进行高质量、一致的深度估计，而不会牺牲效率。我们的模型以 Depth Anything V2 为基础，并用高效的时空头替换其头部。我们通过限制时间深度梯度设计了一种简单而有效的时间一致性损失，从而无需额外的几何先验。该模型在视频深度和未标记图像的联合数据集上进行训练，类似于 Depth Anything V2。此外，还开发了一种新颖的基于关键帧的策略用于长视频推理。实验表明，我们的模型可以应用于任意长度的视频，而不会影响质量、一致性或泛化能力。对多个视频基准的综合评估表明，我们的方法在零样本视频深度估计方面创下了新高。我们提供不同规模的模型来支持各种场景，我们最小的模型能够以 30 FPS 的速度实时运行。

Title: Parallel Sequence Modeling via Generalized Spatial Propagation Network

Authors: Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.12381
Pdf URL: https://arxiv.org/pdf/2501.12381
Copy Paste: [[2501.12381]] Parallel Sequence Modeling via Generalized Spatial Propagation Network(https://arxiv.org/abs/2501.12381)
Keywords: generation
Abstract: We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to $\sqrt{N}$ for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over $84\times$ when generating 16K images.
摘要：我们提出了广义空间传播网络 (GSPN)，这是一种针对视觉任务优化的新型注意力机制，本质上可以捕捉 2D 空间结构。现有的注意力模型，包括 Transformer、线性注意力和 Mamba 等状态空间模型，将多维数据处理为 1D 序列，从而损害了空间连贯性和效率。GSPN 通过直接对空间连贯的图像数据进行操作并通过线扫描方法形成密集的成对连接来克服这些限制。GSPN 的核心是稳定性-上下文条件，它确保在 2D 序列中进行稳定的、上下文感知的传播，并将具有 N 个元素的方形图的有效序列长度减少到 $\sqrt{N}$，从而显著提高计算效率。凭借可学习的、输入相关的权重并且不依赖位置嵌入，GSPN 在视觉任务中实现了卓越的空间保真度和最先进的性能，包括 ImageNet 分类、类引导图像生成和文本到图像生成。值得注意的是，在生成 16K 图像时，GSPN 使用 softmax-attention 将 SD-XL 加速了超过 $84\times$。

Title: DiffDoctor: Diagnosing Image Diffusion Models Before Treating

Authors: Yiyang Wang, Xi Chen, Xiaogang Xu, Sihui Ji, Yu Liu, Yujun Shen, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12382
Pdf URL: https://arxiv.org/pdf/2501.12382
Copy Paste: [[2501.12382]] DiffDoctor: Diagnosing Image Diffusion Models Before Treating(https://arxiv.org/abs/2501.12382)
Keywords: quality assessment
Abstract: In spite of the recent progress, image diffusion models still produce artifacts. A common solution is to refine an established model with a quality assessment system, which generally rates an image in its entirety. In this work, we believe problem-solving starts with identification, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. Motivated by this, we propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts. Concretely, the first stage targets developing a robust artifact detector, for which we collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process, incorporating a carefully designed class-balance strategy. The learned artifact detector is then involved in the second stage to tune the diffusion model through assigning a per-pixel confidence map for each synthesis. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness of our artifact detector as well as the soundness of our diagnose-then-treat design.
摘要：尽管最近取得了进展，但图像扩散模型仍然会产生伪影。一种常见的解决方案是使用质量评估系统来改进已建立的模型，该系统通常对整个图像进行评级。在这项工作中，我们认为解决问题始于识别，因此要求模型不仅要知道图像中是否存在缺陷，还要知道它们的具体位置。受此启发，我们提出了 DiffDoctor，这是一个两阶段管道，可帮助图像扩散模型生成更少的伪影。具体来说，第一阶段的目标是开发一个强大的伪影检测器，为此我们收集了超过 100 万张有缺陷的合成图像的数据集，并建立了一个高效的人机注释流程，并结合了精心设计的类平衡策略。然后，学习到的伪影检测器参与第二阶段，通过为每个合成分配一个每像素置信度图来调整扩散模型。对文本到图像扩散模型的大量实验证明了我们的伪影检测器的有效性以及我们先诊断后治疗设计的合理性。

Title: Taming Teacher Forcing for Masked Autoregressive Video Generation

Authors: Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12389
Pdf URL: https://arxiv.org/pdf/2501.12389
Copy Paste: [[2501.12389]] Taming Teacher Forcing for Masked Autoregressive Video Generation(https://arxiv.org/abs/2501.12389)
Keywords: generation
Abstract: We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
摘要：我们推出了 MAGI，这是一种混合视频生成框架，它将用于帧内生成的掩蔽模型与用于下一帧生成的因果模型相结合。我们的关键创新是完全教师强制 (CTF)，它以完整的观察帧而不是掩蔽帧（即掩蔽教师强制，MTF）为条件来对掩蔽帧进行条件处理，从而实现了从标记级（补丁级）到帧级自回归生成的平稳过渡。CTF 的表现明显优于 MTF，在第一帧条件视频预测中实现了 FVD 得分 +23% 的提升。为了解决曝光偏差等问题，我们采用了有针对性的训练策略，为自回归视频生成树立了新的标杆。实验表明，即使在仅使用 16 帧进行训练的情况下，MAGI 也可以生成超过 100 帧的长而连贯的视频序列，这凸显了其可扩展、高质量视频生成的潜力。

Title: GPS as a Control Signal for Image Generation

Authors: Chao Feng, Ziyang Chen, Aleksander Holynski, Alexei A. Efros, Andrew Owens
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12390
Pdf URL: https://arxiv.org/pdf/2501.12390
Copy Paste: [[2501.12390]] GPS as a Control Signal for Image Generation(https://arxiv.org/abs/2501.12390)
Keywords: generation
Abstract: We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.
摘要：我们表明，照片元数据中包含的 GPS 标签为图像生成提供了有用的控制信号。我们训练 GPS 到图像模型，并将其用于需要对城市内图像变化有细致了解的任务。具体来说，我们训练扩散模型来生成以 GPS 和文本为条件的图像。学习到的模型生成的图像可以捕捉不同街区、公园和地标的独特外观。我们还通过分数蒸馏采样从 2D GPS 到图像模型中提取 3D 模型，使用 GPS 条件来约束从每个视点重建的外观。我们的评估表明，我们的 GPS 条件模型成功学会了生成基于位置而变化的图像，并且 GPS 条件改善了估计的 3D 结构。