2025-01-07

Title: INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models

Authors: Di Jin, Xing Liu, Yu Liu, Jia Qing Yap, Andrea Wong, Adriana Crespo, Qi Lin, Zhiyuan Yin, Qiang Yan, Ryan Ye
Subjects: cs.CV, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2501.01973
Pdf URL: https://arxiv.org/pdf/2501.01973
Copy Paste: [[2501.01973]] INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models(https://arxiv.org/abs/2501.01973)
Keywords: generation
Abstract: The rapid development of large language models (LLMs) and large vision models (LVMs) have propelled the evolution of multi-modal AI systems, which have demonstrated the remarkable potential for industrial applications by emulating human-like cognition. However, they also pose significant ethical challenges, including amplifying harmful content and reinforcing societal biases. For instance, biases in some industrial image generation models highlighted the urgent need for robust fairness assessments. Most existing evaluation frameworks focus on the comprehensiveness of various aspects of the models, but they exhibit critical limitations, including insufficient attention to content generation alignment and social bias-sensitive domains. More importantly, their reliance on pixel-detection techniques is prone to inaccuracies. To address these issues, this paper presents INFELM, an in-depth fairness evaluation on widely-used text-to-image models. Our key contributions are: (1) an advanced skintone classifier incorporating facial topology and refined skin pixel representation to enhance classification precision by at least 16.04%, (2) a bias-sensitive content alignment measurement for understanding societal impacts, (3) a generalizable representation bias evaluation for diverse demographic groups, and (4) extensive experiments analyzing large-scale text-to-image model outputs across six social-bias-sensitive domains. We find that existing models in the study generally do not meet the empirical fairness criteria, and representation bias is generally more pronounced than alignment errors. INFELM establishes a robust benchmark for fairness assessment, supporting the development of multi-modal AI systems that align with ethical and human-centric principles.
摘要：大型语言模型 (LLM) 和大型视觉模型 (LVM) 的快速发展推动了多模态 AI 系统的演进，这些系统通过模拟类似人类的认知展示了巨大的工业应用潜力。然而，它们也带来了重大的道德挑战，包括放大有害内容和强化社会偏见。例如，一些工业图像生成模型中的偏见凸显了对稳健公平性评估的迫切需求。大多数现有的评估框架都侧重于模型各个方面的全面性，但它们表现出严重的局限性，包括对内容生成一致性和社会偏见敏感领域关注不足。更重要的是，它们对像素检测技术的依赖容易导致不准确。为了解决这些问题，本文提出了 INFELM，一种对广泛使用的文本到图像模型的深入公平性评估。我们的主要贡献是：(1) 先进的肤色分类器结合了面部拓扑和精细的皮肤像素表示，可将分类精度提高至少 16.04%；(2) 偏差敏感的内容对齐测量，用于了解社会影响；(3) 适用于不同人口群体的可推广的表征偏差评估；(4) 大量实验，分析六个社会偏差敏感领域的大规模文本到图像模型输出。我们发现研究中的现有模型通常不符合经验公平标准，表征偏差通常比对齐误差更明显。INFELM 为公平性评估建立了强大的基准，支持符合道德和以人为本原则的多模态 AI 系统的开发。

Title: Gender Bias in Text-to-Video Generation Models: A case study of Sora

Authors: Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Björn W. Schuller, Amir Hussain
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01987
Pdf URL: https://arxiv.org/pdf/2501.01987
Copy Paste: [[2501.01987]] Gender Bias in Text-to-Video Generation Models: A case study of Sora(https://arxiv.org/abs/2501.01987)
Keywords: generation
Abstract: The advent of text-to-video generation models has revolutionized content creation as it produces high-quality videos from textual prompts. However, concerns regarding inherent biases in such models have prompted scrutiny, particularly regarding gender representation. Our study investigates the presence of gender bias in OpenAI's Sora, a state-of-the-art text-to-video generation model. We uncover significant evidence of bias by analyzing the generated videos from a diverse set of gender-neutral and stereotypical prompts. The results indicate that Sora disproportionately associates specific genders with stereotypical behaviors and professions, which reflects societal prejudices embedded in its training data.
摘要：文本转视频生成模型的出现彻底改变了内容创作，因为它可以根据文本提示生成高质量的视频。然而，对此类模型固有偏见的担忧引发了人们的关注，尤其是在性别代表性方面。我们的研究调查了 OpenAI 的 Sora（一种最先进的文本转视频生成模型）中是否存在性别偏见。我们通过分析从一组不同的性别中立和刻板提示中生成的视频，发现了明显的偏见证据。结果表明，Sora 将特定性别与刻板行为和职业不成比例地联系起来，这反映了其训练数据中存在的社会偏见。

Title: CRRG-CLIP: Automatic Generation of Chest Radiology Reports and Classification of Chest Radiographs

Authors: Jianfei Xu, Thanet Markchom, Huizhi Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01989
Pdf URL: https://arxiv.org/pdf/2501.01989
Copy Paste: [[2501.01989]] CRRG-CLIP: Automatic Generation of Chest Radiology Reports and Classification of Chest Radiographs(https://arxiv.org/abs/2501.01989)
Keywords: generation
Abstract: The complexity of stacked imaging and the massive number of radiographs make writing radiology reports complex and inefficient. Even highly experienced radiologists struggle to maintain accuracy and consistency in interpreting radiographs under prolonged high-intensity work. To address these issues, this work proposes the CRRG-CLIP Model (Chest Radiology Report Generation and Radiograph Classification Model), an end-to-end model for automated report generation and radiograph classification. The model consists of two modules: the radiology report generation module and the radiograph classification module. The generation module uses Faster R-CNN to identify anatomical regions in radiographs, a binary classifier to select key regions, and GPT-2 to generate semantically coherent reports. The classification module uses the unsupervised Contrastive Language Image Pretraining (CLIP) model, addressing the challenges of high-cost labelled datasets and insufficient features. The results show that the generation module performs comparably to high-performance baseline models on BLEU, METEOR, and ROUGE-L metrics, and outperformed the GPT-4o model on BLEU-2, BLEU-3, BLEU-4, and ROUGE-L metrics. The classification module significantly surpasses the state-of-the-art model in AUC and Accuracy. This demonstrates that the proposed model achieves high accuracy, readability, and fluency in report generation, while multimodal contrastive training with unlabelled radiograph-report pairs enhances classification performance.
摘要：堆叠成像的复杂性和大量的 X 光片使编写放射学报告变得复杂且效率低下。即使是经验丰富的放射科医生，在长时间的高强度工作下，也很难保持对 X 光片的准确和一致解释。为了解决这些问题，这项工作提出了 CRRG-CLIP 模型（胸部放射学报告生成和 X 光片分类模型），这是一个端到端的自动报告生成和 X 光片分类模型。该模型包含两个模块：放射学报告生成模块和 X 光片分类模块。生成模块使用 Faster R-CNN 识别 X 光片中的解剖区域，使用二元分类器选择关键区域，并使用 GPT-2 生成语义一致的报告。分类模块使用无监督的对比语言图像预训练（CLIP）模型，解决了标记数据集成本高和特征不足的挑战。结果表明，生成模块在 BLEU、METEOR 和 ROUGE-L 指标上的表现与高性能基线模型相当，在 BLEU-2、BLEU-3、BLEU-4 和 ROUGE-L 指标上的表现优于 GPT-4o 模型。分类模块在 AUC 和准确度方面明显超越了最先进的模型。这表明所提出的模型在报告生成方面实现了较高的准确度、可读性和流畅性，而使用未标记的射线照片报告对进行多模态对比训练可提高分类性能。

Title: Towards Sustainable Large Language Model Serving

Authors: Sophia Nguyen, Beihao Zhou, Yi Ding, Sihang Liu
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2501.01990
Pdf URL: https://arxiv.org/pdf/2501.01990
Copy Paste: [[2501.01990]] Towards Sustainable Large Language Model Serving(https://arxiv.org/abs/2501.01990)
Keywords: generation
Abstract: In this work, we study LLMs from a carbon emission perspective, addressing both operational and embodied emissions, and paving the way for sustainable LLM serving. We characterize the performance and energy of LLaMA with 1B, 3B, and 7B parameters using two Nvidia GPU types, a latest-generation RTX6000 Ada and an older-generation T4. We analytically model operational carbon emissions based on energy consumption and carbon intensities from three grid regions -- each representing a different energy source mix, and embodied carbon emissions based on chip area and memory size. Our characterization and modeling provide us with an in-depth understanding of the performance, energy, and carbon emissions of LLM serving. Our findings highlight the potential for optimizing sustainable LLM serving systems by considering both operational and embodied carbon emissions simultaneously.
摘要：在这项工作中，我们从碳排放的角度研究 LLM，解决运营和隐含排放问题，为可持续的 LLM 服务铺平道路。我们使用两种 Nvidia GPU 类型（最新一代 RTX6000 Ada 和老一代 T4）以 1B、3B 和 7B 参数表征 LLaMA 的性能和能耗。我们根据三个电网区域的能源消耗和碳强度（每个区域代表不同的能源组合）对运营碳排放进行分析建模，并根据芯片面积和内存大小对隐含碳排放进行建模。我们的表征和建模使我们深入了解了 LLM 服务的性能、能源和碳排放。我们的研究结果强调了通过同时考虑运营和隐含碳排放来优化可持续 LLM 服务系统的潜力。

Title: SmartSpatial: Enhancing the 3D Spatial Arrangement Capabilities of Stable Diffusion Models and Introducing a Novel 3D Spatial Evaluation Framework

Authors: Mao Xun Huang, Hen-Hsen Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01998
Pdf URL: https://arxiv.org/pdf/2501.01998
Copy Paste: [[2501.01998]] SmartSpatial: Enhancing the 3D Spatial Arrangement Capabilities of Stable Diffusion Models and Introducing a Novel 3D Spatial Evaluation Framework(https://arxiv.org/abs/2501.01998)
Keywords: generation
Abstract: Stable Diffusion models have made remarkable strides in generating photorealistic images from text prompts but often falter when tasked with accurately representing complex spatial arrangements, particularly involving intricate 3D relationships. To address this limitation, we introduce SmartSpatial, an innovative approach that enhances the spatial arrangement capabilities of Stable Diffusion models through 3D-aware conditioning and attention-guided mechanisms. SmartSpatial incorporates depth information and employs cross-attention control to ensure precise object placement, delivering notable improvements in spatial accuracy metrics. In conjunction with SmartSpatial, we present SmartSpatialEval, a comprehensive evaluation framework designed to assess spatial relationships. This framework utilizes vision-language models and graph-based dependency parsing for performance analysis. Experimental results on the COCO and SpatialPrompts datasets show that SmartSpatial significantly outperforms existing methods, setting new benchmarks for spatial arrangement accuracy in image generation.
摘要：稳定扩散模型在根据文本提示生成照片级逼真图像方面取得了显著进展，但在准确表示复杂空间排列（尤其是涉及复杂的 3D 关系）时往往会失败。为了解决这一限制，我们引入了 SmartSpatial，这是一种创新方法，通过 3D 感知条件和注意力引导机制增强了稳定扩散模型的空间排列能力。SmartSpatial 结合深度信息并采用交叉注意力控制来确保精确的物体放置，从而显著提高了空间精度指标。与 SmartSpatial 结合，我们提出了 SmartSpatialEval，这是一个旨在评估空间关系的综合评估框架。该框架利用视觉语言模型和基于图的依赖关系解析进行性能分析。在 COCO 和 SpatialPrompts 数据集上的实验结果表明，SmartSpatial 明显优于现有方法，为图像生成的空间排列精度树立了新的标杆。

Title: On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds

Authors: Sharvaree Vadgama, Mohammad Mohaiminul Islam, Domas Buracus, Christian Shewmake, Erik Bekkers
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01999
Pdf URL: https://arxiv.org/pdf/2501.01999
Copy Paste: [[2501.01999]] On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds(https://arxiv.org/abs/2501.01999)
Keywords: generation
Abstract: This paper explores the key factors that influence the performance of models working with point clouds, across different tasks of varying geometric complexity. In this work, we explore the trade-offs between flexibility and weight-sharing introduced by equivariant layers, assessing when equivariance boosts or detracts from performance. It is often argued that providing more information as input improves a model's performance. However, if this additional information breaks certain properties, such as $\SE(3)$ equivariance, does it remain beneficial? We identify the key aspects of equivariant and non-equivariant architectures that drive success in different tasks by benchmarking them on segmentation, regression, and generation tasks across multiple datasets with increasing complexity. We observe a positive impact of equivariance, which becomes more pronounced with increasing task complexity, even when strict equivariance is not required.
摘要：本文探讨了影响点云模型性能的关键因素，这些因素适用于不同几何复杂度的任务。在这项工作中，我们探索了等变层引入的灵活性和权重共享之间的权衡，评估了等变何时会提高或降低性能。人们经常认为，提供更多信息作为输入可以提高模型的性能。但是，如果这些额外的信息破坏了某些属性，例如 $\SE(3)$ 等变，它是否仍然有益？我们通过在复杂度不断增加的多个数据集上对分割、回归和生成任务进行基准测试，确定了等变和非等变架构在不同任务中取得成功的关键方面。我们观察到等变的积极影响，这种影响随着任务复杂性的增加而变得更加明显，即使在不需要严格的等变的情况下也是如此。

Title: Communication Efficient Cooperative Edge AI via Event-Triggered Computation Offloading

Authors: You Zhou, Changsheng You, Kaibin Huang
Subjects: cs.LG, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2501.02001
Pdf URL: https://arxiv.org/pdf/2501.02001
Copy Paste: [[2501.02001]] Communication Efficient Cooperative Edge AI via Event-Triggered Computation Offloading(https://arxiv.org/abs/2501.02001)
Keywords: generation
Abstract: Rare events, despite their infrequency, often carry critical information and require immediate attentions in mission-critical applications such as autonomous driving, healthcare, and industrial automation. The data-intensive nature of these tasks and their need for prompt responses, combined with designing edge AI (or edge inference), pose significant challenges in systems and techniques. Existing edge inference approaches often suffer from communication bottlenecks due to high-dimensional data transmission and fail to provide timely responses to rare events, limiting their effectiveness for mission-critical applications in the sixth-generation (6G) mobile networks. To overcome these challenges, we propose a channel-adaptive, event-triggered edge-inference framework that prioritizes efficient rare-event processing. Central to this framework is a dual-threshold, multi-exit architecture, which enables early local inference for rare events detected locally while offloading more complex rare events to edge servers for detailed classification. To further enhance the system's performance, we developed a channel-adaptive offloading policy paired with an online algorithm to dynamically determine the optimal confidence thresholds for controlling offloading decisions. The associated optimization problem is solved by reformulating the original non-convex function into an equivalent strongly convex one. Using deep neural network classifiers and real medical datasets, our experiments demonstrate that the proposed framework not only achieves superior rare-event classification accuracy, but also effectively reduces communication overhead, as opposed to existing edge-inference approaches.
摘要：尽管罕见事件发生频率不高，但它们通常携带着关键信息，需要在自动驾驶、医疗保健和工业自动化等关键任务应用中立即引起关注。这些任务的数据密集型性质及其对快速响应的需求，再加上设计边缘 AI（或边缘推理），对系统和技术提出了重大挑战。现有的边缘推理方法通常会因高维数据传输而遭受通信瓶颈，无法及时响应罕见事件，从而限制了它们在第六代 (6G) 移动网络中的关键任务应用中的有效性。为了克服这些挑战，我们提出了一种通道自适应、事件触发的边缘推理框架，该框架优先考虑高效的罕见事件处理。该框架的核心是双阈值、多出口架构，它能够对本地检测到的罕见事件进行早期本地推理，同时将更复杂的罕见事件卸载到边缘服务器进行详细分类。为了进一步提高系统的性能，我们开发了一种通道自适应卸载策略，并结合了一种在线算法，以动态确定控制卸载决策的最佳置信度阈值。通过将原始非凸函数重新表述为等效的强凸函数来解决相关的优化问题。使用深度神经网络分类器和真实医疗数据集，我们的实验表明，与现有的边缘推理方法相比，所提出的框架不仅实现了卓越的罕见事件分类准确率，而且还有效地降低了通信开销。

Title: Information Subtraction: Learning Representations for Conditional Entropy

Authors: Keng Hou Leong, Yuxuan Xiu, Wai Kin (Victor)Chan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.02012
Pdf URL: https://arxiv.org/pdf/2501.02012
Copy Paste: [[2501.02012]] Information Subtraction: Learning Representations for Conditional Entropy(https://arxiv.org/abs/2501.02012)
Keywords: generative
Abstract: The representations of conditional entropy and conditional mutual information are significant in explaining the unique effects among variables. While previous studies based on conditional contrastive sampling have effectively removed information regarding discrete sensitive variables, they have not yet extended their scope to continuous cases. This paper introduces Information Subtraction, a framework designed to generate representations that preserve desired information while eliminating the undesired. We implement a generative-based architecture that outputs these representations by simultaneously maximizing an information term and minimizing another. With its flexibility in disentangling information, we can iteratively apply Information Subtraction to represent arbitrary information components between continuous variables, thereby explaining the various relationships that exist between them. Our results highlight the representations' ability to provide semantic features of conditional entropy. By subtracting sensitive and domain-specific information, our framework demonstrates effective performance in fair learning and domain generalization. The code for this paper is available at this https URL
摘要：条件熵和条件互信息的表示对于解释变量之间的独特影响非常重要。虽然先前基于条件对比抽样的研究已经有效地消除了有关离散敏感变量的信息，但它们尚未将其范围扩展到连续情况。本文介绍了信息减法，这是一个旨在生成保留所需信息同时消除不需要的信息的表示的框架。我们实现了一个基于生成的架构，通过同时最大化一个信息项并最小化另一个信息项来输出这些表示。凭借其在解开信息方面的灵活性，我们可以迭代地应用信息减法来表示连续变量之间的任意信息成分，从而解释它们之间存在的各种关系。我们的结果突出了这些表示提供条件熵语义特征的能力。通过减去敏感和特定于领域的信息，我们的框架在公平学习和领域泛化方面表现出色。本文的代码可在此 https URL 上找到

Title: Machine Learning-Based Differential Diagnosis of Parkinson's Disease Using Kinematic Feature Extraction and Selection

Authors: Masahiro Matsumoto, Abu Saleh Musa Miah, Nobuyoshi Asai, Jungpil Shin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02014
Pdf URL: https://arxiv.org/pdf/2501.02014
Copy Paste: [[2501.02014]] Machine Learning-Based Differential Diagnosis of Parkinson's Disease Using Kinematic Feature Extraction and Selection(https://arxiv.org/abs/2501.02014)
Keywords: generative
Abstract: Parkinson's disease (PD), the second most common neurodegenerative disorder, is characterized by dopaminergic neuron loss and the accumulation of abnormal synuclein. PD presents both motor and non-motor symptoms that progressively impair daily functioning. The severity of these symptoms is typically assessed using the MDS-UPDRS rating scale, which is subjective and dependent on the physician's experience. Additionally, PD shares symptoms with other neurodegenerative diseases, such as progressive supranuclear palsy (PSP) and multiple system atrophy (MSA), complicating accurate diagnosis. To address these diagnostic challenges, we propose a machine learning-based system for differential diagnosis of PD, PSP, MSA, and healthy controls (HC). This system utilizes a kinematic feature-based hierarchical feature extraction and selection approach. Initially, 18 kinematic features are extracted, including two newly proposed features: Thumb-to-index vector velocity and acceleration, which provide insights into motor control patterns. In addition, 41 statistical features were extracted here from each kinematic feature, including some new approaches such as Average Absolute Change, Rhythm, Amplitude, Frequency, Standard Deviation of Frequency, and Slope. Feature selection is performed using One-way ANOVA to rank features, followed by Sequential Forward Floating Selection (SFFS) to identify the most relevant ones, aiming to reduce the computational complexity. The final feature set is used for classification, achieving a classification accuracy of 66.67% for each dataset and 88.89% for each patient, with particularly high performance for the MSA and HC groups using the SVM algorithm. This system shows potential as a rapid and accurate diagnostic tool in clinical practice, though further data collection and refinement are needed to enhance its reliability.
摘要：帕金森病 (PD) 是第二大最常见的神经退行性疾病，其特征是多巴胺能神经元丢失和异常突触核蛋白积累。PD 表现出运动和非运动症状，逐渐损害日常功能。这些症状的严重程度通常使用 MDS-UPDRS 评分量表进行评估，该量表是主观的，取决于医生的经验。此外，PD 与其他神经退行性疾病（如进行性核上性麻痹 (PSP) 和多系统萎缩 (MSA)）有共同的症状，使准确诊断变得复杂。为了解决这些诊断挑战，我们提出了一种基于机器学习的系统，用于对 PD、PSP、MSA 和健康对照 (HC) 进行鉴别诊断。该系统采用基于运动特征的分层特征提取和选择方法。最初，提取了 18 个运动特征，包括两个新提出的特征：拇指到食指矢量速度和加速度，它们可以深入了解运动控制模式。此外，这里从每个运动特征中提取了 41 个统计特征，包括一些新方法，例如平均绝对变化、节律、幅度、频率、频率标准偏差和斜率。使用单因素方差分析对特征进行排序，然后使用顺序前向浮动选择 (SFFS) 来识别最相关的特征，旨在降低计算复杂度。最终的特征集用于分类，每个数据集的分类准确率为 66.67%，每个患者的分类准确率为 88.89%，使用 SVM 算法对 MSA 和 HC 组的性能特别高。该系统显示出作为临床实践中快速准确的诊断工具的潜力，但需要进一步收集和改进数据以提高其可靠性。

Title: Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Authors: Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2501.02029
Pdf URL: https://arxiv.org/pdf/2501.02029
Copy Paste: [[2501.02029]] Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models(https://arxiv.org/abs/2501.02029)
Keywords: generation
Abstract: With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{this https URL}.
摘要：由于集成了附加模态，大型视觉语言模型 (LVLM) 相比其仅支持语言的前身，更容易受到安全风险（例如越狱）的影响。尽管最近的研究投入了大量精力进行 LVLM 的事后对齐，但其内部安全机制仍未得到充分探索。在本文中，我们发现，在第一次生成 token 时对 LVLM 进行内部激活可以有效识别不同攻击中的恶意提示。这种固有的安全感知由稀疏注意力头控制，我们称之为“安全头”。进一步的分析表明，这些头部充当了抵御恶意提示的专门盾牌；消除它们会提高攻击成功率，而模型的效用不受影响。通过定位这些安全头并连接它们的激活，我们构建了一个简单但功能强大的恶意提示检测器，它可以无缝集成到生成过程中，同时将额外的推理开销降到最低。尽管该检测器的结构很简单，是一个逻辑回归模型，但它却出人意料地表现出了强大的零样本泛化能力。针对各种基于提示的攻击的实验证实了利用安全头保护 LVLM 的有效性。代码可在 \url{this https URL} 处获得。

Title: MRG: A Multi-Robot Manufacturing Digital Scene Generation Method Using Multi-Instance Point Cloud Registration

Authors: Songjie Han, Yinhua Liu, Yanzheng Li, Hua Chen, Dongmei Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02041
Pdf URL: https://arxiv.org/pdf/2501.02041
Copy Paste: [[2501.02041]] MRG: A Multi-Robot Manufacturing Digital Scene Generation Method Using Multi-Instance Point Cloud Registration(https://arxiv.org/abs/2501.02041)
Keywords: generation
Abstract: A high-fidelity digital simulation environment is crucial for accurately replicating physical operational processes. However, inconsistencies between simulation and physical environments result in low confidence in simulation outcomes, limiting their effectiveness in guiding real-world production. Unlike the traditional step-by-step point cloud "segmentation-registration" generation method, this paper introduces, for the first time, a novel Multi-Robot Manufacturing Digital Scene Generation (MRG) method that leverages multi-instance point cloud registration, specifically within manufacturing scenes. Tailored to the characteristics of industrial robots and manufacturing settings, an instance-focused transformer module is developed to delineate instance boundaries and capture correlations between local regions. Additionally, a hypothesis generation module is proposed to extract target instances while preserving key features. Finally, an efficient screening and optimization algorithm is designed to refine the final registration results. Experimental evaluations on the Scan2CAD and Welding-Station datasets demonstrate that: (1) the proposed method outperforms existing multi-instance point cloud registration techniques; (2) compared to state-of-the-art methods, the Scan2CAD dataset achieves improvements in MR and MP by 12.15% and 17.79%, respectively; and (3) on the Welding-Station dataset, MR and MP are enhanced by 16.95% and 24.15%, respectively. This work marks the first application of multi-instance point cloud registration in manufacturing scenes, significantly advancing the precision and reliability of digital simulation environments for industrial applications.
摘要：高保真数字仿真环境对于准确复制物理操作流程至关重要。然而，模拟环境与物理环境之间的不一致导致对模拟结果的可信度较低，限制了其在指导实际生产中的有效性。与传统的逐步点云“分割-配准”生成方法不同，本文首次介绍了一种新颖的多机器人制造数字场景生成 (MRG) 方法，该方法利用多实例点云配准，特别是在制造场景中。针对工业机器人和制造环境的特点，开发了一个以实例为中心的转换器模块来描绘实例边界并捕获局部区域之间的相关性。此外，提出了一个假设生成模块来提取目标实例，同时保留关键特征。最后，设计了一种有效的筛选和优化算法来优化最终的配准结果。在 Scan2CAD 和 Welding-Station 数据集上的实验评估表明：(1) 所提出的方法优于现有的多实例点云配准技术； (2) 与最先进的方法相比，Scan2CAD 数据集在 MR 和 MP 方面分别实现了 12.15% 和 17.79% 的提升；(3) 在 Welding-Station 数据集上，MR 和 MP 分别提高了 16.95% 和 24.15%。这项工作标志着多实例点云配准在制造场景中的首次应用，显著提高了工业应用数字仿真环境的精度和可靠性。

Title: DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data

Authors: Yuanpeng Tu, Xi Chen, Ser-Nam Lim, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02048
Pdf URL: https://arxiv.org/pdf/2501.02048
Copy Paste: [[2501.02048]] DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data(https://arxiv.org/abs/2501.02048)
Keywords: generation
Abstract: Open-vocabulary panoptic segmentation has received significant attention due to its applicability in the real world. Despite claims of robust generalization, we find that the advancements of previous works are attributed mainly on trained categories, exposing a lack of generalization to novel classes. In this paper, we explore boosting existing models from a data-centric perspective. We propose DreamMask, which systematically explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data. For the first part, we propose an automatic data generation pipeline with off-the-shelf models. We propose crucial designs for vocabulary expansion, layout arrangement, data filtering, etc. Equipped with these techniques, our generated data could significantly outperform the manually collected web data. To train the model with generated data, a synthetic-real alignment loss is designed to bridge the representation gap, bringing noticeable improvements across multiple benchmarks. In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods. For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
摘要：开放词汇全景分割因其在现实世界中的适用性而受到广泛关注。尽管声称具有强大的泛化能力，但我们发现以前的研究的进步主要归因于训练过的类别，暴露出对新类别缺乏泛化能力。在本文中，我们从数据为中心的角度探索了增强现有模型的方法。我们提出了 DreamMask，它系统地探索了如何在开放词汇设置中生成训练数据，以及如何使用真实数据和合成数据训练模型。对于第一部分，我们提出了一种带有现成模型的自动数据生成管道。我们提出了词汇扩展、布局安排、数据过滤等关键设计。有了这些技术，我们生成的数据可以大大优于手动收集的网络数据。为了用生成的数据训练模型，设计了一种合成-真实对齐损失来弥补表示差距，从而在多个基准上带来了显着的改进。总的来说，DreamMask 大大简化了大规模训练数据的收集，作为现有方法的即插即用增强。例如，在 COCO 上训练并在 ADE20K 上测试时，配备 DreamMask 的模型比之前的最先进模型高出 2.1% mIoU。

Title: Active Learning Enables Extrapolation in Molecular Generative Models

Authors: Evan R. Antoniuk, Peggy Li, Nathan Keilbart, Stephen Weitzner, Bhavya Kailkhura, Anna M. Hiszpanski
Subjects: cs.LG, cond-mat.mtrl-sci, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2501.02059
Pdf URL: https://arxiv.org/pdf/2501.02059
Copy Paste: [[2501.02059]] Active Learning Enables Extrapolation in Molecular Generative Models(https://arxiv.org/abs/2501.02059)
Keywords: generation, generative
Abstract: Although generative models hold promise for discovering molecules with optimized desired properties, they often fail to suggest synthesizable molecules that improve upon the known molecules seen in training. We find that a key limitation is not in the molecule generation process itself, but in the poor generalization capabilities of molecular property predictors. We tackle this challenge by creating an active-learning, closed-loop molecule generation pipeline, whereby molecular generative models are iteratively refined on feedback from quantum chemical simulations to improve generalization to new chemical space. Compared against other generative model approaches, only our active learning approach generates molecules with properties that extrapolate beyond the training data (reaching up to 0.44 standard deviations beyond the training data range) and out-of-distribution molecule classification accuracy is improved by 79%. By conditioning molecular generation on thermodynamic stability data from the active-learning loop, the proportion of stable molecules generated is 3.5x higher than the next-best model.
摘要：尽管生成模型有望发现具有优化所需特性的分子，但它们往往无法提出可合成分子来改进训练中看到的已知分子。我们发现一个关键的限制不是分子生成过程本身，而是分子特性预测器的泛化能力较差。我们通过创建一个主动学习的闭环分子生成管道来应对这一挑战，其中分子生成模型根据量子化学模拟的反馈进行迭代改进，以提高对新化学空间的泛化能力。与其他生成模型方法相比，只有我们的主动学习方法可以生成具有超出训练数据推断特性的分子（超出训练数据范围高达 0.44 个标准差），并且分布外分子分类准确率提高了 79%。通过根据来自主动学习循环的热力学稳定性数据来调节分子生成，生成的稳定分子比例比下一个最佳模型高 3.5 倍。

Title: Plasma-CycleGAN: Plasma Biomarker-Guided MRI to PET Cross-modality Translation Using Conditional CycleGAN

Authors: Yanxi Chen, Yi Su, Celine Dumitrascu, Kewei Chen, David Weidman, Richard J Caselli, Nicholas Ashton, Eric M Reiman, Yalin Wang
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2501.02146
Pdf URL: https://arxiv.org/pdf/2501.02146
Copy Paste: [[2501.02146]] Plasma-CycleGAN: Plasma Biomarker-Guided MRI to PET Cross-modality Translation Using Conditional CycleGAN(https://arxiv.org/abs/2501.02146)
Keywords: generative
Abstract: Cross-modality translation between MRI and PET imaging is challenging due to the distinct mechanisms underlying these modalities. Blood-based biomarkers (BBBMs) are revolutionizing Alzheimer's disease (AD) detection by identifying patients and quantifying brain amyloid levels. However, the potential of BBBMs to enhance PET image synthesis remains unexplored. In this paper, we performed a thorough study on the effect of incorporating BBBM into deep generative models. By evaluating three widely used cross-modality translation models, we found that BBBMs integration consistently enhances the generative quality across all models. By visual inspection of the generated results, we observed that PET images generated by CycleGAN exhibit the best visual fidelity. Based on these findings, we propose Plasma-CycleGAN, a novel generative model based on CycleGAN, to synthesize PET images from MRI using BBBMs as conditions. This is the first approach to integrate BBBMs in conditional cross-modality translation between MRI and PET.
摘要：由于这些模式背后的机制不同，MRI 和 PET 成像之间的跨模态转换具有挑战性。血液生物标志物 (BBBM) 通过识别患者和量化脑淀粉样蛋白水平，彻底改变了阿尔茨海默病 (AD) 的检测。然而，BBBM 增强 PET 图像合成的潜力仍未被探索。在本文中，我们对将 BBBM 纳入深度生成模型的效果进行了深入研究。通过评估三种广泛使用的跨模态转换模型，我们发现 BBBM 集成可持续提高所有模型的生成质量。通过目视检查生成的结果，我们发现 CycleGAN 生成的 PET 图像表现出最佳的视觉保真度。基于这些发现，我们提出了基于 CycleGAN 的新型生成模型 Plasma-CycleGAN，以 BBBM 为条件，从 MRI 合成 PET 图像。这是第一种将 BBBM 集成到 MRI 和 PET 之间的条件跨模态转换中的方法。

Title: Generating Multimodal Images with GAN: Integrating Text, Image, and Style

Authors: Chaoyi Tan, Wenqing Zhang, Zhen Qi, Kowei Shih, Xinshi Li, Ao Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02167
Pdf URL: https://arxiv.org/pdf/2501.02167
Copy Paste: [[2501.02167]] Generating Multimodal Images with GAN: Integrating Text, Image, and Style(https://arxiv.org/abs/2501.02167)
Keywords: generation, generative
Abstract: In the field of computer vision, multimodal image generation has become a research hotspot, especially the task of integrating text, image, and style. In this study, we propose a multimodal image generation method based on Generative Adversarial Networks (GAN), capable of effectively combining text descriptions, reference images, and style information to generate images that meet multimodal requirements. This method involves the design of a text encoder, an image feature extractor, and a style integration module, ensuring that the generated images maintain high quality in terms of visual content and style consistency. We also introduce multiple loss functions, including adversarial loss, text-image consistency loss, and style matching loss, to optimize the generation process. Experimental results show that our method produces images with high clarity and consistency across multiple public datasets, demonstrating significant performance improvements compared to existing methods. The outcomes of this study provide new insights into multimodal image generation and present broad application prospects.
摘要：在计算机视觉领域，多模态图像生成成为研究热点，尤其是融合文本、图像和风格的任务。本研究提出一种基于生成对抗网络（GAN）的多模态图像生成方法，能够有效地结合文本描述、参考图和风格信息，生成符合多模态要求的图像。该方法涉及文本编码器、图像特征提取器和风格整合模块的设计，确保生成的图像在视觉内容和风格一致性方面保持高质量。我们还引入了对抗性损失、文本-图像一致性损失和风格匹配损失等多种损失函数来优化生成过程。实验结果表明，我们的方法在多个公共数据集上生成了高清晰度和一致性的图像，与现有方法相比有显著的性能提升。本研究成果为多模态图像生成提供了新的见解，并展现了广阔的应用前景。

Title: Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

Authors: Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, Guangyao Shi
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2501.02189
Pdf URL: https://arxiv.org/pdf/2501.02189
Copy Paste: [[2501.02189]] Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey(https://arxiv.org/abs/2501.02189)
Keywords: generation
Abstract: Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in this https URL.
摘要：多模态视觉语言模型 (VLM) 已成为计算机视觉和自然语言处理交叉领域的一项变革性技术，使机器能够通过视觉和文本模态感知和推理世界。例如，CLIP、Claude 和 GPT-4V 等模型在视觉和文本数据上表现出强大的推理和理解能力，并在零样本分类上击败了经典的单模态视觉模型。尽管它们在研究方面取得了快速进展并在应用中越来越受欢迎，但对现有 VLM 研究的全面调查仍然明显缺乏，特别是对于旨在将 VLM 用于特定领域的研究人员而言。为此，我们从以下方面对 VLM 进行了系统的概述：过去五年（2019-2024 年）开发的主要 VLM 的模型信息；这些 VLM 的主要架构和训练方法；VLM 的流行基准和评估指标的总结和分类；VLM 的应用包括具身代理、机器人和视频生成；当前 VLM 面临的挑战和问题，例如幻觉、公平性和安全性。此 https URL 中列出了包括论文和模型存储库链接在内的详细集合。

Title: Unsupervised Class Generation to Expand Semantic Segmentation Datasets

Authors: Javier Montalvo, Álvaro García-Martín, Pablo Carballeira, Juan C. SanMiguel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02264
Pdf URL: https://arxiv.org/pdf/2501.02264
Copy Paste: [[2501.02264]] Unsupervised Class Generation to Expand Semantic Segmentation Datasets(https://arxiv.org/abs/2501.02264)
Keywords: generation, generative
Abstract: Semantic segmentation is a computer vision task where classification is performed at a pixel level. Due to this, the process of labeling images for semantic segmentation is time-consuming and expensive. To mitigate this cost there has been a surge in the use of synthetically generated data -- usually created using simulators or videogames -- which, in combination with domain adaptation methods, can effectively learn how to segment real data. Still, these datasets have a particular limitation: due to their closed-set nature, it is not possible to include novel classes without modifying the tool used to generate them, which is often not public. Concurrently, generative models have made remarkable progress, particularly with the introduction of diffusion models, enabling the creation of high-quality images from text prompts without additional supervision. In this work, we propose an unsupervised pipeline that leverages Stable Diffusion and Segment Anything Module to generate class examples with an associated segmentation mask, and a method to integrate generated cutouts for novel classes in semantic segmentation datasets, all with minimal user input. Our approach aims to improve the performance of unsupervised domain adaptation methods by introducing novel samples into the training data without modifications to the underlying algorithms. With our methods, we show how models can not only effectively learn how to segment novel classes, with an average performance of 51% IoU, but also reduce errors for other, already existing classes, reaching a higher performance level overall.
摘要：语义分割是一种计算机视觉任务，其中分类是在像素级别执行的。因此，为语义分割标记图像的过程既耗时又昂贵。为了降低成本，合成数据（通常使用模拟器或视频游戏创建）的使用量激增，这些数据与领域自适应方法相结合，可以有效地学习如何分割真实数据。不过，这些数据集有一个特殊的限制：由于它们的封闭集性质，如果不修改用于生成它们的工具（通常不公开），就无法包含新类别。同时，生成模型取得了显著的进展，特别是随着扩散模型的引入，无需额外监督即可从文本提示中创建高质量的图像。在这项工作中，我们提出了一种无监督流程，利用稳定扩散和分割任何模块来生成具有相关分割掩码的类示例，以及一种将生成的新类别的剪切图集成到语义分割数据集中的方法，所有这些都只需最少的用户输入。我们的方法旨在通过在训练数据中引入新样本而不修改底层算法来提高无监督域自适应方法的性能。通过我们的方法，我们展示了模型如何不仅可以有效地学习如何划分新类别，平均性能为 51% IoU，还可以减少其他现有类别的错误，从而达到更高的整体性能水平。

Title: TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration

Authors: Yizhou Li, Zihua Liu, Yusuke Monno, Masatoshi Okutomi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02269
Pdf URL: https://arxiv.org/pdf/2501.02269
Copy Paste: [[2501.02269]] TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration(https://arxiv.org/abs/2501.02269)
Keywords: restoration
Abstract: In this paper, we propose the first diffusion-based all-in-one video restoration method that utilizes the power of a pre-trained Stable Diffusion and a fine-tuned ControlNet. Our method can restore various types of video degradation with a single unified model, overcoming the limitation of standard methods that require specific models for each restoration task. Our contributions include an efficient training strategy with Task Prompt Guidance (TPG) for diverse restoration tasks, an inference strategy that combines Denoising Diffusion Implicit Models~(DDIM) inversion with a novel Sliding Window Cross-Frame Attention (SW-CFA) mechanism for enhanced content preservation and temporal consistency, and a scalable pipeline that makes our method all-in-one to adapt to different video restoration tasks. Through extensive experiments on five video restoration tasks, we demonstrate the superiority of our method in generalization capability to real-world videos and temporal consistency preservation over existing state-of-the-art methods. Our method advances the video restoration task by providing a unified solution that enhances video quality across multiple applications.
摘要：在本文中，我们提出了第一种基于扩散的一体化视频恢复方法，该方法利用了预训练的稳定扩散和微调的 ControlNet 的功能。我们的方法可以使用单个统一模型恢复各种类型的视频退化，克服了标准方法需要为每个恢复任务使用特定模型的局限性。我们的贡献包括针对各种恢复任务的高效训练策略和任务提示指导 (TPG)，将去噪扩散隐式模型~(DDIM) 反演与新颖的滑动窗口跨帧注意 (SW-CFA) 机制相结合的推理策略，以增强内容保存和时间一致性，以及可扩展的管道，使我们的方法一体化以适应不同的视频恢复任务。通过对五个视频恢复任务的大量实验，我们证明了我们的方法在对真实世界视频的泛化能力和时间一致性保持方面优于现有的最先进方法。我们的方法通过提供统一的解决方案来推进视频恢复任务，该解决方案可提高多个应用程序的视频质量。

Title: DiffGraph: Heterogeneous Graph Diffusion Model

Authors: Zongwei Li, Lianghao Xia, Hua Hua, Shijie Zhang, Shuangyang Wang, Chao Huang
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2501.02313
Pdf URL: https://arxiv.org/pdf/2501.02313
Copy Paste: [[2501.02313]] DiffGraph: Heterogeneous Graph Diffusion Model(https://arxiv.org/abs/2501.02313)
Keywords: generation
Abstract: Recent advances in Graph Neural Networks (GNNs) have revolutionized graph-structured data modeling, yet traditional GNNs struggle with complex heterogeneous structures prevalent in real-world scenarios. Despite progress in handling heterogeneous interactions, two fundamental challenges persist: noisy data significantly compromising embedding quality and learning performance, and existing methods' inability to capture intricate semantic transitions among heterogeneous relations, which impacts downstream predictions. To address these fundamental issues, we present the Heterogeneous Graph Diffusion Model (DiffGraph), a pioneering framework that introduces an innovative cross-view denoising strategy. This advanced approach transforms auxiliary heterogeneous data into target semantic spaces, enabling precise distillation of task-relevant information. At its core, DiffGraph features a sophisticated latent heterogeneous graph diffusion mechanism, implementing a novel forward and backward diffusion process for superior noise management. This methodology achieves simultaneous heterogeneous graph denoising and cross-type transition, while significantly simplifying graph generation through its latent-space diffusion capabilities. Through rigorous experimental validation on both public and industrial datasets, we demonstrate that DiffGraph consistently surpasses existing methods in link prediction and node classification tasks, establishing new benchmarks for robustness and efficiency in heterogeneous graph processing. The model implementation is publicly available at: this https URL.
摘要：图神经网络 (GNN) 的最新进展彻底改变了图结构数据建模，但传统的 GNN 难以应对现实场景中普遍存在的复杂异构结构。尽管在处理异构交互方面取得了进展，但仍然存在两个基本挑战：噪声数据严重损害了嵌入质量和学习性能，现有方法无法捕捉异构关系之间复杂的语义转换，从而影响下游预测。为了解决这些基本问题，我们提出了异构图扩散模型 (DiffGraph)，这是一个引入创新跨视图去噪策略的开创性框架。这种先进的方法将辅助异构数据转换为目标语义空间，从而实现与任务相关的信息的精确提炼。DiffGraph 的核心是一种复杂的潜在异构图扩散机制，实现了新颖的前向和后向扩散过程，以实现卓越的噪声管理。该方法实现了同时异构图去噪和跨类型转换，同时通过其潜在空间扩散功能显着简化了图生成。通过对公共数据集和工业数据集进行严格的实验验证，我们证明 DiffGraph 在链接预测和节点分类任务中始终超越现有方法，为异构图处理的稳健性和效率树立了新的标杆。模型实现可从以下网址公开获取：此 https URL。

Title: Optimizing Small Language Models for In-Vehicle Function-Calling

Authors: Yahya Sowti Khiabani, Farris Atif, Chieh Hsu, Sven Stahlmann, Tobias Michels, Sebastian Kramer, Benedikt Heidrich, M. Saquib Sarfraz, Julian Merten, Faezeh Tafazzoli
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2501.02342
Pdf URL: https://arxiv.org/pdf/2501.02342
Copy Paste: [[2501.02342]] Optimizing Small Language Models for In-Vehicle Function-Calling(https://arxiv.org/abs/2501.02342)
Keywords: generation
Abstract: We propose a holistic approach for deploying Small Language Models (SLMs) as function-calling agents within vehicles as edge devices, offering a more flexible and robust alternative to traditional rule-based systems. By leveraging SLMs, we simplify vehicle control mechanisms and enhance the user experience. Given the in-vehicle hardware constraints, we apply state-of-the-art model compression techniques, including structured pruning, healing, and quantization, ensuring that the model fits within the resource limitations while maintaining acceptable performance. Our work focuses on optimizing a representative SLM, Microsoft's Phi-3 mini, and outlines best practices for enabling embedded models, including compression, task-specific fine-tuning, and vehicle integration. We demonstrate that, despite significant reduction in model size which removes up to 2 billion parameters from the original model, our approach preserves the model's ability to handle complex in-vehicle tasks accurately and efficiently. Furthermore, by executing the model in a lightweight runtime environment, we achieve a generation speed of 11 tokens per second, making real-time, on-device inference feasible without hardware acceleration. Our results demonstrate the potential of SLMs to transform vehicle control systems, enabling more intuitive interactions between users and their vehicles for an enhanced driving experience.
摘要：我们提出了一种整体方法，将小型语言模型 (SLM) 部署为车辆边缘设备中的函数调用代理，为传统的基于规则的系统提供更灵活、更强大的替代方案。通过利用 SLM，我们简化了车辆控制机制并增强了用户体验。考虑到车载硬件的限制，我们应用了最先进的模型压缩技术，包括结构化修剪、修复和量化，确保模型符合资源限制，同时保持可接受的性能。我们的工作重点是优化代表性的 SLM，即 Microsoft 的 Phi-3 mini，并概述了启用嵌入式模型的最佳实践，包括压缩、特定于任务的微调和车辆集成。我们证明，尽管模型大小显著减少，从原始模型中删除了多达 20 亿个参数，但我们的方法保留了模型准确高效地处理复杂车载任务的能力。此外，通过在轻量级运行时环境中执行模型，我们实现了每秒 11 个令牌的生成速度，无需硬件加速即可实现实时设备推理。我们的结果证明了 SLM 改变车辆控制系统的潜力，使用户与车辆之间的互动更加直观，从而增强驾驶体验。

Title: Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models

Authors: Wenhao Wang, Yifan Sun, Zongxin Yang, Zhentao Tan, Zhengdong Hu, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02376
Pdf URL: https://arxiv.org/pdf/2501.02376
Copy Paste: [[2501.02376]] Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models(https://arxiv.org/abs/2501.02376)
Keywords: generation
Abstract: Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused for spreading misinformation, infringing on copyrights, and evading content tracing. This motivates us to introduce the task of origin IDentification for text-guided Image-to-image Diffusion models (ID$^2$), aiming to retrieve the original image of a given translated query. A straightforward solution to ID$^2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due to visual discrepancy across generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID$^2$ task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset, OriPID, contains abundant Origins and guided Prompts, which can be used to train and test potential IDentification models across various diffusion models. In the method section, we first prove the existence of a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder (VAE) embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can be generalized across different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods ($+31.6\%$ mAP), even those with generalization designs.
摘要：文本引导的图像到图像扩散模型擅长根据文本提示翻译图像，从而实现精确且富有创意的视觉修改。然而，如此强大的技术可能会被滥用来传播错误信息、侵犯版权和逃避内容追踪。这促使我们引入文本引导的图像到图像扩散模型 (ID$^2$) 的来源识别任务，旨在检索给定翻译查询的原始图像。ID$^2$ 的一个直接解决方案是训练专门的深度嵌入模型来提取和比较查询和参考图像中的特征。然而，由于不同扩散模型产生的跨代视觉差异，这种基于相似性的方法在对一个模型的图像进行训练并在另一个模型的图像上进行测试时会失败，从而限制了其在实际应用中的有效性。为了解决提出的 ID$^2$ 任务的这一挑战，我们贡献了第一个数据集和理论上保证的方法，两者都强调了通用性。精选数据集 OriPID 包含丰富的起源和引导提示，可用于训练和测试各种扩散模型中的潜在识别模型。在方法部分，我们首先证明存在一个线性变换，该变换最小化生成样本的预训练变分自动编码器 (VAE) 嵌入与其起源之间的距离。随后，证明了这种简单的线性变换可以在不同的扩散模型中推广。实验结果表明，所提出的方法实现了令人满意的泛化性能，显著超越了基于相似性的方法 ($+31.6\%$ mAP)，甚至超越了具有泛化设计的方法。

Title: Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations

Authors: Kangyu Zhu, Ziyuan Qin, Huahui Yi, Zekun Jiang, Qicheng Lao, Shaoting Zhang, Kang Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.02385
Pdf URL: https://arxiv.org/pdf/2501.02385
Copy Paste: [[2501.02385]] Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations(https://arxiv.org/abs/2501.02385)
Keywords: generation
Abstract: With the recent advancements in vision-language models (VLMs) driven by large language models (LLMs), many researchers have focused on models that comprised of an image encoder, an image-to-language projection layer, and a text decoder architectures, leading to the emergence of works like LLava-Med. However, these works primarily operate at the whole-image level, aligning general information from 2D medical images without attending to finer details. As a result, these models often provide irrelevant or non-clinically valuable information while missing critical details. Medical vision-language tasks differ significantly from general images, particularly in their focus on fine-grained details, while excluding irrelevant content. General domain VLMs tend to prioritize global information due to their design, which compresses the entire image into a multi-token representation that is passed into the LLM decoder. Therefore, current VLMs all lack the capability to restrict their attention to particular areas. To address this critical issue in the medical domain, we introduce MedVP, an visual prompt generation and fine-tuning framework, which involves extract medical entities, generate visual prompts, and adapt datasets for visual prompt guided fine-tuning. To the best of our knowledge, this is the first work to explicitly introduce visual prompt into medical VLMs, and we successfully outperform recent state-of-the-art large models across multiple medical VQA datasets. Extensive experiments are conducted to analyze the impact of different visual prompt forms and how they contribute to performance improvement. The results demonstrate both the effectiveness and clinical significance of our approach
摘要：随着大型语言模型 (LLM) 推动的视觉语言模型 (VLM) 的最新进展，许多研究人员将重点放在由图像编码器、图像到语言投影层和文本解码器架构组成的模型上，从而催生了 LLava-Med 等作品。然而，这些作品主要在整幅图像级别上运行，对齐来自 2D 医学图像的一般信息而不关注更精细的细节。因此，这些模型通常提供不相关或无临床价值的信息，同时缺少关键细节。医学视觉语言任务与一般图像有很大不同，特别是它们专注于细粒度细节，同时排除不相关的内容。由于其设计，通用领域的 VLM 倾向于优先考虑全局信息，这会将整个图像压缩为多标记表示，然后传递到 LLM 解码器中。因此，当前的 VLM 都缺乏将注意力限制在特定区域的能力。为了解决医学领域的这一关键问题，我们引入了 MedVP，这是一个视觉提示生成和微调框架，涉及提取医学实体、生成视觉提示以及调整数据集以进行视觉提示引导的微调。据我们所知，这是第一项明确将视觉提示引入医学 VLM 的工作，我们成功地在多个医学 VQA 数据集上超越了最近最先进的大型模型。进行了广泛的实验来分析不同视觉提示形式的影响以及它们如何有助于提高性能。结果证明了我们的方法的有效性和临床意义

Title: MedSegDiffNCA: Diffusion Models With Neural Cellular Automata for Skin Lesion Segmentation

Authors: Avni Mittal, John Kalkhof, Anirban Mukhopadhyay, Arnav Bhavsar
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2501.02447
Pdf URL: https://arxiv.org/pdf/2501.02447
Copy Paste: [[2501.02447]] MedSegDiffNCA: Diffusion Models With Neural Cellular Automata for Skin Lesion Segmentation(https://arxiv.org/abs/2501.02447)
Keywords: generation
Abstract: Denoising Diffusion Models (DDMs) are widely used for high-quality image generation and medical image segmentation but often rely on Unet-based architectures, leading to high computational overhead, especially with high-resolution images. This work proposes three NCA-based improvements for diffusion-based medical image segmentation. First, Multi-MedSegDiffNCA uses a multilevel NCA framework to refine rough noise estimates generated by lower level NCA models. Second, CBAM-MedSegDiffNCA incorporates channel and spatial attention for improved segmentation. Third, MultiCBAM-MedSegDiffNCA combines these methods with a new RGB channel loss for semantic guidance. Evaluations on Lesion segmentation show that MultiCBAM-MedSegDiffNCA matches Unet-based model performance with dice score of 87.84% while using 60-110 times fewer parameters, offering a more efficient solution for low resource medical settings.
摘要：去噪扩散模型 (DDM) 广泛用于高质量图像生成和医学图像分割，但通常依赖于基于 Unet 的架构，导致高计算开销，尤其是对于高分辨率图像。这项工作提出了三项基于 NCA 的改进，用于基于扩散的医学图像分割。首先，Multi-MedSegDiffNCA 使用多级 NCA 框架来细化由低级 NCA 模型生成的粗略噪声估计。其次，CBAM-MedSegDiffNCA 结合通道和空间注意力来改进分割。第三，MultiCBAM-MedSegDiffNCA 将这些方法与新的 RGB 通道损失相结合以进行语义指导。对病变分割的评估表明，MultiCBAM-MedSegDiffNCA 与基于 Unet 的模型性能相匹配，骰子分数为 87.84%，同时使用的参数减少了 60-110 倍，为低资源医疗环境提供了更有效的解决方案。

Title: Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

Authors: Chao Liang, Linchao Zhu, Zongxin Yang, Wei Chen, Yi Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.02476
Pdf URL: https://arxiv.org/pdf/2501.02476
Copy Paste: [[2501.02476]] Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data(https://arxiv.org/abs/2501.02476)
Keywords: generation
Abstract: We focus on the challenging problem of learning an unbiased classifier from a large number of potentially relevant but noisily labeled web images given only a few clean labeled images. This problem is particularly practical because it reduces the expensive annotation costs by utilizing freely accessible web images with noisy labels. Typically, prototypes are representative images or features used to classify or identify other images. However, in the few clean and many noisy scenarios, the class prototype can be severely biased due to the presence of irrelevant noisy images. The resulting prototypes are less compact and discriminative, as previous methods do not take into account the diverse range of images in the noisy web image collections. On the other hand, the relation modeling between noisy and clean images is not learned for the class prototype generation in an end-to-end manner, which results in a suboptimal class prototype. In this article, we introduce a similarity maximization loss named SimNoiPro. Our SimNoiPro first generates noise-tolerant hybrid prototypes composed of clean and noise-tolerant prototypes and then pulls them closer to each other. Our approach considers the diversity of noisy images by explicit division and overcomes the optimization discrepancy issue. This enables better relation modeling between clean and noisy images and helps extract judicious information from the noisy image set. The evaluation results on two extended few-shot classification benchmarks confirm that our SimNoiPro outperforms prior methods in measuring image relations and cleaning noisy data.
摘要：我们专注于从大量可能相关但标签嘈杂的网络图像中学习无偏分类器这一具有挑战性的问题，而这些图像仅提供少量干净的标签图像。这个问题特别实用，因为它通过利用可免费访问的带有嘈杂标签的网络图像来降低昂贵的注释成本。通常，原型是用于对其他图像进行分类或识别的代表性图像或特征。然而，在少数干净图像和许多嘈杂图像的情况下，由于存在不相关的嘈杂图像，类原型可能会严重偏差。由于以前的方法没有考虑到嘈杂的网络图像集合中的各种图像，因此生成的原型不太紧凑和具有判别力。另一方面，没有以端到端的方式学习嘈杂图像和干净图像之间的关系建模以生成类原型，这导致类原型不是最优的。在本文中，我们引入了一个名为 SimNoiPro 的相似性最大化损失。我们的 SimNoiPro 首先生成由干净原型和耐噪原型组成的耐噪混合原型，然后将它们拉近。我们的方法通过显式划分考虑了噪声图像的多样性，并克服了优化差异问题。这可以更好地在干净图像和噪声图像之间建立关系模型，并有助于从噪声图像集中提取明智的信息。两个扩展的少样本分类基准的评估结果证实，我们的 SimNoiPro 在测量图像关系和清理噪声数据方面优于先前的方法。

Title: ACE++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling

Authors: Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, Jingren Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02487
Pdf URL: https://arxiv.org/pdf/2501.02487
Copy Paste: [[2501.02487]] ACE++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling(https://arxiv.org/abs/2501.02487)
Keywords: generation, generative
Abstract: We report ACE++, an instruction-based diffusion framework that tackles various image generation and editing tasks. Inspired by the input format for the inpainting task proposed by FLUX.1-Fill-dev, we improve the Long-context Condition Unit (LCU) introduced in ACE and extend this input paradigm to any editing and generation tasks. To take full advantage of image generative priors, we develop a two-stage training scheme to minimize the efforts of finetuning powerful text-to-image diffusion models like FLUX.1-dev. In the first stage, we pre-train the model using task data with the 0-ref tasks from the text-to-image model. There are many models in the community based on the post-training of text-to-image foundational models that meet this training paradigm of the first stage. For example, FLUX.1-Fill-dev deals primarily with painting tasks and can be used as an initialization to accelerate the training process. In the second stage, we finetune the above model to support the general instructions using all tasks defined in ACE. To promote the widespread application of ACE++ in different scenarios, we provide a comprehensive set of models that cover both full finetuning and lightweight finetuning, while considering general applicability and applicability in vertical scenarios. The qualitative analysis showcases the superiority of ACE++ in terms of generating image quality and prompt following ability.
摘要：我们报告了 ACE++，这是一个基于指令的扩散框架，可处理各种图像生成和编辑任务。受 FLUX.1-Fill-dev 提出的修复任务输入格式的启发，我们改进了 ACE 中引入的长上下文条件单元 (LCU)，并将此输入范例扩展到任何编辑和生成任务。为了充分利用图像生成先验，我们开发了一个两阶段训练方案，以最大限度地减少微调强大的文本到图像扩散模型（如 FLUX.1-dev）的工作量。在第一阶段，我们使用来自文本到图像模型的 0-ref 任务的任务数据对模型进行预训练。社区中有许多基于文本到图像基础模型的后训练的模型，这些模型符合第一阶段的这种训练范式。例如，FLUX.1-Fill-dev 主要处理绘画任务，可用作初始化以加速训练过程。在第二阶段，我们使用 ACE 中定义的所有任务微调上述模型以支持一般指令。为了推动ACE++在不同场景的广泛应用，我们提供了一套完整的模型，涵盖完全微调和轻量级微调，同时兼顾通用性和垂直场景适用性。定性分析展示了ACE++在生成图像质量和快速跟进能力方面的优越性。

Title: Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors

Authors: Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, Yulan Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02519
Pdf URL: https://arxiv.org/pdf/2501.02519
Copy Paste: [[2501.02519]] Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors(https://arxiv.org/abs/2501.02519)
Keywords: generation
Abstract: 3D scene generation conditioned on text prompts has significantly progressed due to the development of 2D diffusion generation models. However, the textual description of 3D scenes is inherently inaccurate and lacks fine-grained control during training, leading to implausible scene generation. As an intuitive and feasible solution, the 3D layout allows for precise specification of object locations within the scene. To this end, we present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions. Specifically, we first introduce a scene hybrid representation to decouple objects and backgrounds, which is initialized via a pre-trained text-to-3D model. Then, we propose a two-stage scheme to optimize the geometry and appearance of the initialized scene separately. To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model which are finetuned on a scene dataset. Extensive experiments demonstrate that our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches. Furthermore, the generated scene allows for flexible yet precise editing, thereby facilitating multiple downstream applications.
摘要：由于二维扩散生成模型的发展，以文本提示为条件的三维场景生成取得了显著进展。然而，三维场景的文本描述本质上是不准确的，并且在训练过程中缺乏细粒度的控制，导致场景生成不合理。作为一种直观可行的解决方案，三维布局允许精确指定场景内的对象位置。为此，我们提出了一种文本到场景的生成方法（即Layout2Scene），使用额外的语义布局作为提示来注入对三维物体位置的精确控制。具体来说，我们首先引入一种场景混合表示来分离物体和背景，该表示通过预训练的文本到三维模型进行初始化。然后，我们提出了一个两阶段方案来分别优化初始化场景的几何和外观。为了在几何和外观生成中充分利用二维扩散先验，我们引入了一个语义引导的几何扩散模型和一个语义几何引导的扩散模型，这两个模型在场景数据集上进行了微调。大量实验表明，与最先进的方法相比，我们的方法可以生成更合理、更真实的场景。此外，生成的场景允许灵活而精确地编辑，从而促进多种下游应用。

Title: Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation

Authors: Dawei Dai, Mingming Jia, Yinxiu Zhou, Hang Xing, Chenghang Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02523
Pdf URL: https://arxiv.org/pdf/2501.02523
Copy Paste: [[2501.02523]] Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation(https://arxiv.org/abs/2501.02523)
Keywords: generation
Abstract: Facial images have extensive practical applications. Although the current large-scale text-image diffusion models exhibit strong generation capabilities, it is challenging to generate the desired facial images using only text prompt. Image prompts are a logical choice. However, current methods of this type generally focus on general domain. In this paper, we aim to optimize image makeup techniques to generate the desired facial images. Specifically, (1) we built a dataset of 4 million high-quality face image-text pairs (FaceCaptionHQ-4M) based on LAION-Face to train our Face-MakeUp model; (2) to maintain consistency with the reference facial image, we extract/learn multi-scale content features and pose features for the facial image, integrating these into the diffusion model to enhance the preservation of facial identity features for diffusion models. Validation on two face-related test datasets demonstrates that our Face-MakeUp can achieve the best comprehensive this http URL codes are available at:this https URL
摘要：人脸图像有着广泛的实际应用，虽然目前的大规模文图扩散模型表现出很强的生成能力，但仅使用文本提示很难生成所需的人脸图像，图像提示是一个合乎逻辑的选择，但目前这类方法一般侧重于通用领域。本文旨在优化图像化妆技术以生成所需的人脸图像。具体而言，（1）我们基于 LAION-Face 构建了一个包含 400 万张高质量人脸图文对的数据集（FaceCaptionHQ-4M）来训练我们的 Face-MakeUp 模型；（2）为了与参考人脸图像保持一致，我们提取/学习人脸图像的多尺度内容特征和姿势特征，将这些特征集成到扩散模型中，以增强扩散模型对人脸身份特征的保留。在两个与人脸相关的测试数据集上的验证表明，我们的 Face-MakeUp 可以实现最佳的综合效果。此 http URL 代码可在以下网址获得：此 https URL

Title: Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks

Authors: Leo Franklin, Apiradee Boonmee, Kritsada Wongsuwan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02527
Pdf URL: https://arxiv.org/pdf/2501.02527
Copy Paste: [[2501.02527]] Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks(https://arxiv.org/abs/2501.02527)
Keywords: generation, generative
Abstract: Vision generation remains a challenging frontier in artificial intelligence, requiring seamless integration of visual understanding and generative capabilities. In this paper, we propose a novel framework, Vision-Driven Prompt Optimization (VDPO), that leverages Large Language Models (LLMs) to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis. VDPO combines a visual embedding prompt tuner, a textual instruction generator, and a vision generation module to achieve state-of-the-art performance in diverse vision generation tasks. Extensive experiments on benchmarks such as COCO and Sketchy demonstrate that VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores. Additional analyses reveal the scalability, robustness, and generalization capabilities of VDPO, making it a versatile solution for in-domain and out-of-domain tasks. Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.
摘要：视觉生成仍然是人工智能领域一个具有挑战性的前沿，需要无缝集成视觉理解和生成能力。在本文中，我们提出了一个新颖的框架，即视觉驱动提示优化 (VDPO)，该框架利用大型语言模型 (LLM) 从视觉输入中动态生成文本提示，指导高保真图像合成。VDPO 结合了视觉嵌入提示调谐器、文本指令生成器和视觉生成模块，以在各种视觉生成任务中实现最先进的性能。在 COCO 和 Sketchy 等基准上进行的大量实验表明，VDPO 始终优于现有方法，在 FID、LPIPS 和 BLEU/CIDEr 分数方面取得了显着的改进。其他分析揭示了 VDPO 的可扩展性、稳健性和泛化能力，使其成为域内和域外任务的多功能解决方案。人工评估进一步验证了 VDPO 在生成视觉吸引力和语义连贯输出方面的实际优势。

Title: LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations

Authors: Jiaping Wang, Simiao Zhang, Qiao-Chu He, Yifan Chen
Subjects: cs.LG, cs.CL, cs.MS
Abstract URL: https://arxiv.org/abs/2501.02573
Pdf URL: https://arxiv.org/pdf/2501.02573
Copy Paste: [[2501.02573]] LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations(https://arxiv.org/abs/2501.02573)
Keywords: generative
Abstract: The machine learning and data science community has made significant while dispersive progress in accelerating transformer-based large language models (LLMs), and one promising approach is to replace the original causal attention in a generative pre-trained transformer (GPT) with \emph{exponentially decaying causal linear attention}. In this paper, we present LeetDecoding, which is the first Python package that provides a large set of computation routines for this fundamental operator. The launch of LeetDecoding was motivated by the current lack of (1) clear understanding of the complexity regarding this operator, (2) a comprehensive collection of existing computation methods (usually spread in seemingly unrelated fields), and (3) CUDA implementations for fast inference on GPU. LeetDecoding's design is easy to integrate with existing linear-attention LLMs, and allows for researchers to benchmark and evaluate new computation methods for exponentially decaying causal linear attention. The usage of LeetDecoding does not require any knowledge of GPU programming and the underlying complexity analysis, intentionally making LeetDecoding accessible to LLM practitioners. The source code of LeetDecoding is provided at \href{this https URL}{this GitHub repository}, and users can simply install LeetDecoding by the command \texttt{pip install leet-decoding}.
摘要：机器学习和数据科学界在加速基于 Transformer 的大型语言模型 (LLM) 方面取得了重大但分散的进展，一种有前途的方法是用 \emph{指数衰减的因果线性注意力} 取代生成预训练 Transformer (GPT) 中的原始因果注意力。在本文中，我们介绍了 LeetDecoding，这是第一个为该基本运算符提供大量计算例程的 Python 包。推出 LeetDecoding 的动机是当前缺乏 (1) 对该运算符复杂性的清晰理解，(2) 现有计算方法的全面集合（通常分布在看似不相关的领域），以及 (3) 用于在 GPU 上快速推理的 CUDA 实现。LeetDecoding 的设计易于与现有的线性注意力 LLM 集成，并允许研究人员对指数衰减因果线性注意力的新计算方法进行基准测试和评估。 LeetDecoding 的使用不需要任何 GPU 编程和底层复杂度分析的知识，有意让 LLM 从业者也能轻松使用 LeetDecoding。LeetDecoding 的源代码在 \href{此 https URL}{此 GitHub 存储库} 中提供，用户只需通过命令 \texttt{pip install leet-decoding} 即可安装 LeetDecoding。

Title: DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

Authors: Ziyang Song, Zerong Wang, Bo Li, Hao Zhang, Ruijie Zhu, Li Liu, Peng-Tao Jiang, Tianzhu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02576
Pdf URL: https://arxiv.org/pdf/2501.02576
Copy Paste: [[2501.02576]] DepthMaster: Taming Diffusion Models for Monocular Depth Estimation(https://arxiv.org/abs/2501.02576)
Keywords: generative
Abstract: Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at this https URL.
摘要：扩散去噪范式中的单目深度估计表现出令人印象深刻的泛化能力，但推理速度慢。最近的方法采用单步确定性范式来提高推理效率，同时保持可比性能。然而，他们忽视了生成特征和判别特征之间的差距，导致结果不理想。在这项工作中，我们提出了 DepthMaster，这是一个单步扩散模型，旨在将生成特征适应判别深度估计任务。首先，为了减轻生成特征引入的对纹理细节的过度拟合，我们提出了一个特征对齐模块，它结合了高质量的语义特征来增强去噪网络的表示能力。其次，为了解决单步确定性框架中缺乏细粒度细节的问题，我们提出了一个傅里叶增强模块来自适应地平衡低频结构和高频细节。我们采用两阶段训练策略来充分利用这两个模块的潜力。在第一阶段，我们专注于使用特征对齐模块学习全局场景结构，而在第二阶段，我们利用傅里叶增强模块来提高视觉质量。通过这些努力，我们的模型在泛化和细节保留方面实现了最先进的性能，在各种数据集上优于其他基于扩散的方法。我们的项目页面可以在这个 https URL 上找到。

Title: Multispectral Pedestrian Detection with Sparsely Annotated Label

Authors: Chan Lee, Seungho Shin, Gyeong-Moon Park, Jung Uk Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02640
Pdf URL: https://arxiv.org/pdf/2501.02640
Copy Paste: [[2501.02640]] Multispectral Pedestrian Detection with Sparsely Annotated Label(https://arxiv.org/abs/2501.02640)
Keywords: generation
Abstract: Although existing Sparsely Annotated Object Detection (SAOD) approches have made progress in handling sparsely annotated environments in multispectral domain, where only some pedestrians are annotated, they still have the following limitations: (i) they lack considerations for improving the quality of pseudo-labels for missing annotations, and (ii) they rely on fixed ground truth annotations, which leads to learning only a limited range of pedestrian visual appearances in the multispectral domain. To address these issues, we propose a novel framework called Sparsely Annotated Multispectral Pedestrian Detection (SAMPD). For limitation (i), we introduce Multispectral Pedestrian-aware Adaptive Weight (MPAW) and Positive Pseudo-label Enhancement (PPE) module. Utilizing multispectral knowledge, these modules ensure the generation of high-quality pseudo-labels and enable effective learning by increasing weights for high-quality pseudo-labels based on modality characteristics. To address limitation (ii), we propose an Adaptive Pedestrian Retrieval Augmentation (APRA) module, which adaptively incorporates pedestrian patches from ground-truth and dynamically integrates high-quality pseudo-labels with the ground-truth, facilitating a more diverse learning pool of pedestrians. Extensive experimental results demonstrate that our SAMPD significantly enhances performance in sparsely annotated environments within the multispectral domain.
摘要：尽管现有的稀疏注释物体检测 (SAOD) 方法在处理多光谱域中仅注释部分行人且注释稀疏注释的环境方面取得了进展，但它们仍然存在以下局限性：(i) 它们缺乏对提高缺失注释的伪标签质量的考虑，以及 (ii) 它们依赖于固定的真实注释，这导致在多光谱域中只能学习有限范围的行人视觉外观。为了解决这些问题，我们提出了一种称为稀疏注释多光谱行人检测 (SAMPD) 的新框架。针对限制 (i)，我们引入了多光谱行人感知自适应权重 (MPAW) 和正伪标签增强 (PPE) 模块。利用多光谱知识，这些模块可确保生成高质量的伪标签，并通过基于模态特征增加高质量伪标签的权重来实现有效学习。为了解决限制 (ii)，我们提出了一个自适应行人检索增强 (APRA) 模块，该模块自适应地合并来自地面实况的行人块，并动态地将高质量伪标签与地面实况相结合，从而促进行人学习池的多样化。大量实验结果表明，我们的 SAMPD 显著提高了多光谱域中稀疏注释环境中的性能。

Title: A New Interpretation of the Certainty-Equivalence Approach for PAC Reinforcement Learning with a Generative Model

Authors: Shivaram Kalyanakrishnan, Sheel Shah, Santhosh Kumar Guguloth
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2501.02652
Pdf URL: https://arxiv.org/pdf/2501.02652
Copy Paste: [[2501.02652]] A New Interpretation of the Certainty-Equivalence Approach for PAC Reinforcement Learning with a Generative Model(https://arxiv.org/abs/2501.02652)
Keywords: generative
Abstract: Reinforcement learning (RL) enables an agent interacting with an unknown MDP $M$ to optimise its behaviour by observing transitions sampled from $M$. A natural entity that emerges in the agent's reasoning is $\widehat{M}$, the maximum likelihood estimate of $M$ based on the observed transitions. The well-known \textit{certainty-equivalence} method (CEM) dictates that the agent update its behaviour to $\widehat{\pi}$, which is an optimal policy for $\widehat{M}$. Not only is CEM intuitive, it has been shown to enjoy minimax-optimal sample complexity in some regions of the parameter space for PAC RL with a generative model~\citep{Agarwal2020GenModel}. A seemingly unrelated algorithm is the ``trajectory tree method'' (TTM)~\citep{Kearns+MN:1999}, originally developed for efficient decision-time planning in large POMDPs. This paper presents a theoretical investigation that stems from the surprising finding that CEM may indeed be viewed as an application of TTM. The qualitative benefits of this view are (1) new and simple proofs of sample complexity upper bounds for CEM, in fact under a (2) weaker assumption on the rewards than is prevalent in the current literature. Our analysis applies to both non-stationary and stationary MDPs. Quantitatively, we obtain (3) improvements in the sample-complexity upper bounds for CEM both for non-stationary and stationary MDPs, in the regime that the ``mistake probability'' $\delta$ is small. Additionally, we show (4) a lower bound on the sample complexity for finite-horizon MDPs, which establishes the minimax-optimality of our upper bound for non-stationary MDPs in the small-$\delta$ regime.
摘要：强化学习 (RL) 使与未知 MDP $M$ 交互的代理能够通过观察从 $M$ 中采样的转换来优化其行为。代理推理中出现的自然实体是 $\widehat{M}$，即基于观察到的转换对 $M$ 的最大似然估计。众所周知的 \textit{确定性等价} 方法 (CEM) 规定代理将其行为更新为 $\widehat{\pi}$，这是 $\widehat{M}$ 的最佳策略。CEM 不仅直观，而且已被证明在具有生成模型 ~\citep{Agarwal2020GenModel} 的 PAC RL 参数空间的某些区域中具有极小最大最优样本复杂度。一种看似不相关的算法是“轨迹树方法”(TTM) ~\citep{Kearns+MN:1999}，最初是为在大型 POMDP 中进行有效的决策时间规划而开发的。本文提出了一项理论研究，该研究源于一个令人惊讶的发现，即 CEM 确实可以被视为 TTM 的一种应用。这种观点的定性优势包括 (1) 对 CEM 样本复杂度上限的全新而简单的证明，事实上，在 (2) 对奖励的假设比当前文献中普遍存在的假设更弱的情况下。我们的分析适用于非平稳和平稳 MDP。定量上，我们获得 (3) 在“错误概率”$\delta$ 较小的范围内，非平稳和平稳 MDP 的 CEM 样本复杂度上限都有所改善。此外，我们展示了 (4) 有限时域 MDP 的样本复杂度下限，这确立了小 $\delta$ 范围内非平稳 MDP 上限的极小最大最优性。

Title: GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

Authors: Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02690
Pdf URL: https://arxiv.org/pdf/2501.02690
Copy Paste: [[2501.02690]] GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking(https://arxiv.org/abs/2501.02690)
Keywords: generation
Abstract: 4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at this https URL.
摘要：4D 视频控制在视频生成中至关重要，因为它使得使用复杂的镜头技术成为可能，例如多摄像机拍摄和移动变焦，而现有方法目前不支持这些技术。直接训练视频扩散变换器 (DiT) 来控制 4D 内容需要昂贵的多视角视频。受单目动态新颖视图合成 (MDVS) 的启发，该合成可优化 4D 表示并根据不同的 4D 元素（例如摄像机姿势和物体运动编辑）渲染视频，我们将伪 4D 高斯场引入视频生成。具体而言，我们提出了一个新颖的框架，该框架使用密集的 3D 点跟踪构建伪 4D 高斯场并为所有视频帧渲染高斯场。然后，我们对预训练的 DiT 进行微调，使其按照渲染视频的指导生成视频，这被称为 GS-DiT。为了提高 GS-DiT 的训练效率，我们还提出了一种用于伪 4D 高斯场构建的高效密集 3D 点跟踪 (D3D-PT) 方法。我们的 D3D-PT 在准确度上优于最先进的稀疏 3D 点跟踪方法 SpatialTracker，并将推理速度提高了两个数量级。在推理阶段，GS-DiT 可以在遵循不同相机参数的同时生成具有相同动态内容的视频，解决了当前视频生成模型的一个重大限制。GS-DiT 展示了强大的泛化能力，并将高斯溅射的 4D 可控性扩展到视频生成，而不仅仅是相机姿势。它通过操纵高斯场和相机内在函数来支持高级电影效果，使其成为创意视频制作的强大工具。演示可在此 https URL 上找到。

Title: Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis

Authors: Xiaojiao Guo, Xuhang Chen, Shuqiang Wang, Chi-Man Pun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02701
Pdf URL: https://arxiv.org/pdf/2501.02701
Copy Paste: [[2501.02701]] Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis(https://arxiv.org/abs/2501.02701)
Keywords: restoration
Abstract: Underwater imaging grapples with challenges from light-water interactions, leading to color distortions and reduced clarity. In response to these challenges, we propose a novel Color Balance Prior \textbf{Guided} \textbf{Hyb}rid \textbf{Sens}e \textbf{U}nderwater \textbf{I}mage \textbf{R}estoration framework (\textbf{GuidedHybSensUIR}). This framework operates on multiple scales, employing the proposed \textbf{Detail Restorer} module to restore low-level detailed features at finer scales and utilizing the proposed \textbf{Feature Contextualizer} module to capture long-range contextual relations of high-level general features at a broader scale. The hybridization of these different scales of sensing results effectively addresses color casts and restores blurry details. In order to effectively point out the evolutionary direction for the model, we propose a novel \textbf{Color Balance Prior} as a strong guide in the feature contextualization step and as a weak guide in the final decoding phase. We construct a comprehensive benchmark using paired training data from three real-world underwater datasets and evaluate on six test sets, including three paired and three unpaired, sourced from four real-world underwater datasets. Subsequently, we tested 14 traditional and retrained 23 deep learning existing underwater image restoration methods on this benchmark, obtaining metric results for each approach. This effort aims to furnish a valuable benchmarking dataset for standard basis for comparison. The extensive experiment results demonstrate that our method outperforms 37 other state-of-the-art methods overall on various benchmark datasets and metrics, despite not achieving the best results in certain individual cases. The code and dataset are available at \href{this https URL}{this https URL}.
摘要：水下成像面临着光与水相互作用带来的挑战，导致色彩失真和清晰度降低。为了应对这些挑战，我们提出了一种新颖的色彩平衡优先 \textbf{引导} \textbf{混合} \textbf{感知} \textbf{水下 \textbf{图像 \textbf{R} 恢复框架 (\textbf{GuidedHybSensUIR})。该框架在多个尺度上运作，采用所提出的 \textbf{细节恢复器} 模块以更精细的尺度恢复低级细节特征，并利用所提出的 \textbf{特征语境化器} 模块以更广泛的尺度捕获高级一般特征的长距离语境关系。这些不同尺度的传感结果的混合有效地解决了偏色问题并恢复了模糊细节。为了有效地指出模型的演进方向，我们提出了一种新颖的 \textbf{色彩平衡先验}，作为特征语境化步骤中的强指导和最终解码阶段中的弱指导。我们使用来自三个真实水下数据集的成对训练数据构建了一个综合基准，并在来自四个真实水下数据集的六个测试集（包括三个成对和三个非成对）上进行评估。随后，我们在这个基准上测试了 14 种传统的和重新训练的 23 种深度学习现有水下图像恢复方法，获得了每种方法的度量结果。这项工作旨在为比较的标准基础提供有价值的基准数据集。大量的实验结果表明，尽管在某些个别情况下没有取得最佳结果，但我们的方法在各种基准数据集和指标上总体上优于其他 37 种最先进的方法。代码和数据集可在 \href{这个 https URL}{这个 https URL} 获得。

Title: Persistence of Backdoor-based Watermarks for Neural Networks: A Comprehensive Evaluation

Authors: Anh Tu Ngo, Chuan Song Heng, Nandish Chattopadhyay, Anupam Chattopadhyay
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2501.02704
Pdf URL: https://arxiv.org/pdf/2501.02704
Copy Paste: [[2501.02704]] Persistence of Backdoor-based Watermarks for Neural Networks: A Comprehensive Evaluation(https://arxiv.org/abs/2501.02704)
Keywords: restoration
Abstract: Deep Neural Networks (DNNs) have gained considerable traction in recent years due to the unparalleled results they gathered. However, the cost behind training such sophisticated models is resource intensive, resulting in many to consider DNNs to be intellectual property (IP) to model owners. In this era of cloud computing, high-performance DNNs are often deployed all over the internet so that people can access them publicly. As such, DNN watermarking schemes, especially backdoor-based watermarks, have been actively developed in recent years to preserve proprietary rights. Nonetheless, there lies much uncertainty on the robustness of existing backdoor watermark schemes, towards both adversarial attacks and unintended means such as fine-tuning neural network models. One reason for this is that no complete guarantee of robustness can be assured in the context of backdoor-based watermark. In this paper, we extensively evaluate the persistence of recent backdoor-based watermarks within neural networks in the scenario of fine-tuning, we propose/develop a novel data-driven idea to restore watermark after fine-tuning without exposing the trigger set. Our empirical results show that by solely introducing training data after fine-tuning, the watermark can be restored if model parameters do not shift dramatically during fine-tuning. Depending on the types of trigger samples used, trigger accuracy can be reinstated to up to 100%. Our study further explores how the restoration process works using loss landscape visualization, as well as the idea of introducing training data in fine-tuning stage to alleviate watermark vanishing.
摘要：深度神经网络 (DNN) 近年来因其无与伦比的成果而获得了相当大的关注。然而，训练如此复杂的模型的成本是资源密集型的，导致许多人认为 DNN 是模型所有者的知识产权 (IP)。在这个云计算时代，高性能 DNN 通常部署在整个互联网上，以便人们可以公开访问它们。因此，近年来人们积极开发 DNN 水印方案，尤其是基于后门的水印，以保护专有权利。尽管如此，现有后门水印方案的稳健性仍存在很大的不确定性，无论是对抗性攻击还是微调神经网络模型等非预期手段。其中一个原因是，在基于后门的水印环境中无法完全保证稳健性。在本文中，我们广泛评估了微调场景下神经网络中近期后门水印的持久性，我们提出/开发了一种新颖的数据驱动思想，用于在微调后恢复水印，而无需暴露触发集。我们的实证结果表明，如果模型参数在微调期间没有发生剧烈变化，则只需在微调后引入训练数据，即可恢复水印。根据所使用的触发样本类型，触发准确度可以恢复到 100%。我们的研究进一步探索了使用损失景观可视化的恢复过程的工作原理，以及在微调阶段引入训练数据以缓解水印消失的想法。

Title: Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment

Authors: Jiaze Li, Haoran Xu, Shiding Zhu, Junwei He, Haozhao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02706
Pdf URL: https://arxiv.org/pdf/2501.02706
Copy Paste: [[2501.02706]] Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment(https://arxiv.org/abs/2501.02706)
Keywords: quality assessment
Abstract: The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.
摘要：近年来，扩散模型的快速发展极大地提高了 AI 生成的视频的长度和一致性，但评估 AI 生成的视频仍然具有挑战性。以前的方法通常侧重于用户生成内容 (UGC)，但很少有针对 AI 生成的视频质量评估方法的方法。在这项工作中，我们引入了 MSA-VQA，这是一种用于 AI 生成视频质量评估的多级语义感知模型，它利用基于 CLIP 的语义监督和交叉注意机制。我们的分层框架从三个层面分析视频内容：帧、片段和视频。我们提出了一个使用 CLIP 文本编码器的提示语义监督模块，以确保视频和条件提示之间的语义一致性。此外，我们提出了语义变异感知模块来捕捉帧之间的细微变化。大量实验表明我们的方法取得了最先进的结果。

Title: Holistic Semantic Representation for Navigational Trajectory Generation

Authors: Ji Cao, Tongya Zheng, Qinghong Guo, Yu Wang, Junshu Dai, Shunyu Liu, Jie Yang, Jie Song, Mingli Song
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.02737
Pdf URL: https://arxiv.org/pdf/2501.02737
Copy Paste: [[2501.02737]] Holistic Semantic Representation for Navigational Trajectory Generation(https://arxiv.org/abs/2501.02737)
Keywords: generation
Abstract: Trajectory generation has garnered significant attention from researchers in the field of spatio-temporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin-and-destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. Moreover, the model's performance in few-shot learning and zero-shot learning scenarios further verifies the effectiveness of our holistic semantic representation.
摘要：轨迹生成引起了时空分析领域研究人员的广泛关注，因为它可以生成大量合成的人类移动轨迹，从而增强用户隐私并缓解数据稀缺问题。然而，现有的轨迹生成方法往往侧重于从单一视角提高轨迹生成质量，缺乏跨各个尺度的全面语义理解。因此，我们受到启发，开发了一个用于导航轨迹生成的整体语义表示 (HOSER) 框架。给定一个出发地和目的地 (OD) 对以及潜在轨迹的起始时间点，我们首先提出一个道路网络编码器来扩展道路和区域级语义的感受野。其次，我们设计了一个多粒度轨迹编码器，以在点和轨迹级别集成生成的轨迹的时空语义。最后，我们使用一个面向目的地的导航器来无缝集成面向目的地的指导。在三个真实数据集上进行的大量实验表明，HOSER 的表现远超最先进的基线。此外，该模型在小样本学习和零样本学习场景中的表现进一步验证了我们整体语义表示的有效性。

Title: Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising

Authors: Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Hang Xu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02741
Pdf URL: https://arxiv.org/pdf/2501.02741
Copy Paste: [[2501.02741]] Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising(https://arxiv.org/abs/2501.02741)
Keywords: generation
Abstract: Recent advances in diffusion models have greatly improved text-driven video generation. However, training models for long video generation demands significant computational power and extensive data, leading most video diffusion models to be limited to a small number of frames. Existing training-free methods that attempt to generate long videos using pre-trained short video diffusion models often struggle with issues such as insufficient motion dynamics and degraded video fidelity. In this paper, we present Brick-Diffusion, a novel, training-free approach capable of generating long videos of arbitrary length. Our method introduces a brick-to-wall denoising strategy, where the latent is denoised in segments, with a stride applied in subsequent iterations. This process mimics the construction of a staggered brick wall, where each brick represents a denoised segment, enabling communication between frames and improving overall video quality. Through quantitative and qualitative evaluations, we demonstrate that Brick-Diffusion outperforms existing baseline methods in generating high-fidelity videos.
摘要：扩散模型的最新进展极大地改进了文本驱动的视频生成。然而，长视频生成的训练模型需要大量的计算能力和大量的数据，导致大多数视频扩散模型仅限于少量的帧。现有的无需训练的方法试图使用预先训练的短视频扩散模型来生成长视频，但通常会遇到诸如运动动态不足和视频保真度下降等问题。在本文中，我们提出了一种新颖的无需训练的方法，即砖块扩散，它能够生成任意长度的长视频。我们的方法引入了一种砖墙去噪策略，其中潜在值分段去噪，并在后续迭代中应用步幅。这个过程模仿了交错砖墙的建造，其中每块砖代表一个去噪段，从而实现帧之间的通信并提高整体视频质量。通过定量和定性评估，我们证明了砖块扩散在生成高保真视频方面优于现有的基线方法。

Title: LDMapNet-U: An End-to-End System for City-Scale Lane-Level Map Updating

Authors: Deguo Xia, Weiming Zhang, Xiyan Liu, Wei Zhang, Chenting Gong, Xiao Tan, Jizhou Huang, Mengmeng Yang, Diange Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02763
Pdf URL: https://arxiv.org/pdf/2501.02763
Copy Paste: [[2501.02763]] LDMapNet-U: An End-to-End System for City-Scale Lane-Level Map Updating(https://arxiv.org/abs/2501.02763)
Keywords: generation
Abstract: An up-to-date city-scale lane-level map is an indispensable infrastructure and a key enabling technology for ensuring the safety and user experience of autonomous driving systems. In industrial scenarios, reliance on manual annotation for map updates creates a critical bottleneck. Lane-level updates require precise change information and must ensure consistency with adjacent data while adhering to strict standards. Traditional methods utilize a three-stage approach-construction, change detection, and updating-which often necessitates manual verification due to accuracy limitations. This results in labor-intensive processes and hampers timely updates. To address these challenges, we propose LDMapNet-U, which implements a new end-to-end paradigm for city-scale lane-level map updating. By reconceptualizing the update task as an end-to-end map generation process grounded in historical map data, we introduce a paradigm shift in map updating that simultaneously generates vectorized maps and change information. To achieve this, a Prior-Map Encoding (PME) module is introduced to effectively encode historical maps, serving as a critical reference for detecting changes. Additionally, we incorporate a novel Instance Change Prediction (ICP) module that learns to predict associations with historical maps. Consequently, LDMapNet-U simultaneously achieves vectorized map element generation and change detection. To demonstrate the superiority and effectiveness of LDMapNet-U, extensive experiments are conducted using large-scale real-world datasets. In addition, LDMapNet-U has been successfully deployed in production at Baidu Maps since April 2024, supporting map updating for over 360 cities and significantly shortening the update cycle from quarterly to weekly. The updated maps serve hundreds of millions of users and are integrated into the autonomous driving systems of several leading vehicle companies.
摘要：最新的城市级车道级地图是确保自动驾驶系统安全和用户体验不可或缺的基础设施和关键支持技术。在工业场景中，依赖手动注释进行地图更新会造成关键瓶颈。车道级更新需要精确的变化信息，并且必须确保与相邻数据的一致性，同时遵守严格的标准。传统方法采用三阶段方法 - 构建、变化检测和更新 - 由于准确性限制，通常需要手动验证。这会导致劳动密集型流程并妨碍及时更新。为了应对这些挑战，我们提出了 LDMapNet-U，它为城市级车道级地图更新实现了一种新的端到端范式。通过将更新任务重新概念化为基于历史地图数据的端到端地图生成过程，我们引入了地图更新的范式转变，可以同时生成矢量化地图和变化信息。为了实现这一点，引入了先验地图编码 (PME) 模块来有效地编码历史地图，作为检测变化的关键参考。此外，我们还加入了一个新颖的实例变化预测 (ICP) 模块，该模块学习预测与历史地图的关联。因此，LDMapNet-U 同时实现了矢量化地图元素生成和变化检测。为了证明 LDMapNet-U 的优越性和有效性，我们使用大规模真实数据集进行了广泛的实验。此外，LDMapNet-U 自 2024 年 4 月起已在百度地图上成功部署，支持 360 多个城市的地图更新，并将更新周期从每季度大幅缩短至每周。更新后的地图服务于数亿用户，并集成到多家领先汽车公司的自动驾驶系统中。

Title: InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models

Authors: Kai Wang, Shaozhang Niu, Qixian Hao, Jiwei Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02816
Pdf URL: https://arxiv.org/pdf/2501.02816
Copy Paste: [[2501.02816]] InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models(https://arxiv.org/abs/2501.02816)
Keywords: generation
Abstract: As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model's perception of edge details in inpainted objects. Balancing the diffusion model's stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.
摘要：随着人工智能的快速发展，尤其是随着 GAN 和扩散模型的出现，图像修复定位 (IIL) 的准确性变得越来越具有挑战性。当前的 IIL 方法面临两个主要挑战：过度自信的倾向导致预测错误；难以检测修复图像中的细微篡改边界。作为回应，我们提出了一种新范式，将 IIL 视为利用扩散模型的条件掩码生成任务。我们的方法 InpDiffusion 利用通过整合图像语义条件增强的去噪过程来逐步细化预测。在去噪过程中，我们采用边缘条件并引入一种新颖的边缘监督策略来增强模型对修复对象中边缘细节的感知。平衡扩散模型的随机采样与篡改图像区域的边缘监督可以减轻因过度自信而导致预测错误的风险，并防止因过度随机的过程而导致的细微边界的丢失。此外，我们提出了一种创新的双流多尺度特征提取器 (DMFE)，用于提取多尺度特征，通过考虑修复图像的语义和边缘条件来增强特征表示。在具有挑战性的数据集上进行的大量实验表明，InpDiffusion 在 IIL 任务中的表现明显优于现有的最先进方法，同时还展示了出色的泛化能力和鲁棒性。

Title: HOGSA: Bimanual Hand-Object Interaction Understanding with 3D Gaussian Splatting Based Data Augmentation

Authors: Wentian Qu, Jiahe Li, Jian Cheng, Jian Shi, Chenyu Meng, Cuixia Ma, Hongan Wang, Xiaoming Deng, Yinda Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02845
Pdf URL: https://arxiv.org/pdf/2501.02845
Copy Paste: [[2501.02845]] HOGSA: Bimanual Hand-Object Interaction Understanding with 3D Gaussian Splatting Based Data Augmentation(https://arxiv.org/abs/2501.02845)
Keywords: super-resolution
Abstract: Understanding of bimanual hand-object interaction plays an important role in robotics and virtual reality. However, due to significant occlusions between hands and object as well as the high degree-of-freedom motions, it is challenging to collect and annotate a high-quality, large-scale dataset, which prevents further improvement of bimanual hand-object interaction-related baselines. In this work, we propose a new 3D Gaussian Splatting based data augmentation framework for bimanual hand-object interaction, which is capable of augmenting existing dataset to large-scale photorealistic data with various hand-object pose and viewpoints. First, we use mesh-based 3DGS to model objects and hands, and to deal with the rendering blur problem due to multi-resolution input images used, we design a super-resolution module. Second, we extend the single hand grasping pose optimization module for the bimanual hand object to generate various poses of bimanual hand-object interaction, which can significantly expand the pose distribution of the dataset. Third, we conduct an analysis for the impact of different aspects of the proposed data augmentation on the understanding of the bimanual hand-object interaction. We perform our data augmentation on two benchmarks, H2O and Arctic, and verify that our method can improve the performance of the baselines.
摘要：理解双手手部-物体交互在机器人技术和虚拟现实中起着重要作用。然而，由于手部和物体之间存在严重的遮挡以及高自由度的运动，收集和注释高质量的大规模数据集具有挑战性，这阻碍了双手手部-物体交互相关基线的进一步改进。在本文中，我们提出了一种新的基于 3D Gaussian Splatting 的双手手部-物体交互数据增强框架，该框架能够将现有数据集增强为具有各种手部-物体姿势和视点的大规模真实感数据。首先，我们使用基于网格的 3DGS 来建模物体和手，并且为了处理由于使用多分辨率输入图像而导致的渲染模糊问题，我们设计了一个超分辨率模块。其次，我们扩展了双手手部物体的单手抓握姿势优化模块，以生成各种双手手部-物体交互姿势，这可以显著扩展数据集的姿势分布。第三，我们分析了所提出的数据增强方法的不同方面对理解双手手-物体交互的影响。我们在两个基准 H2O 和 Arctic 上执行了数据增强，并验证了我们的方法可以提高基线的性能。

Title: Large Language Models for Video Surveillance Applications

Authors: Ulindu De Silva, Leon Fernando, Billy Lau Pik Lik, Zann Koh, Sam Conrad Joyce, Belinda Yuen, Chau Yuen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02850
Pdf URL: https://arxiv.org/pdf/2501.02850
Copy Paste: [[2501.02850]] Large Language Models for Video Surveillance Applications(https://arxiv.org/abs/2501.02850)
Keywords: generative
Abstract: The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.
摘要：视频内容制作的快速增长带来了巨大的数据量，给高效分析和资源管理带来了巨大挑战。为了解决这个问题，强大的视频分析工具必不可少。本文提出了一种创新的概念验证，使用视觉语言模型形式的生成人工智能 (GenAI) 来增强下游视频分析过程。我们的工具根据用户定义的查询生成定制的文本摘要，在广泛的视频数据集中提供有针对性的见解。与提供通用摘要或有限动作识别的传统方法不同，我们的方法利用视觉语言模型来提取相关信息，从而提高分析精度和效率。所提出的方法从大量的闭路电视录像中生成文本摘要，然后可以在与视频相比非常小的存储空间中无限期地存储这些摘要，使用户无需进行详尽的手动审查即可快速导航和验证重要事件。定性评估分别使管道的时间和空间质量以及一致性的准确度达到 80% 和 70%。

Title: Synthetic Fungi Datasets: A Time-Aligned Approach

Authors: A. Rani, D. O. Arroyo, P. Durdevic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02855
Pdf URL: https://arxiv.org/pdf/2501.02855
Copy Paste: [[2501.02855]] Synthetic Fungi Datasets: A Time-Aligned Approach(https://arxiv.org/abs/2501.02855)
Keywords: generation
Abstract: Fungi undergo dynamic morphological transformations throughout their lifecycle, forming intricate networks as they transition from spores to mature mycelium structures. To support the study of these time-dependent processes, we present a synthetic, time-aligned image dataset that models key stages of fungal growth. This dataset systematically captures phenomena such as spore size reduction, branching dynamics, and the emergence of complex mycelium networks. The controlled generation process ensures temporal consistency, scalability, and structural alignment, addressing the limitations of real-world fungal datasets. Optimized for deep learning (DL) applications, this dataset facilitates the development of models for classifying growth stages, predicting fungal development, and analyzing morphological patterns over time. With applications spanning agriculture, medicine, and industrial mycology, this resource provides a robust foundation for automating fungal analysis, enhancing disease monitoring, and advancing fungal biology research through artificial intelligence.
摘要：真菌在其整个生命周期中都会经历动态的形态变化，在从孢子过渡到成熟菌丝结构时形成复杂的网络。为了支持对这些时间依赖性过程的研究，我们提供了一个合成的、时间对齐的图像数据集，用于模拟真菌生长的关键阶段。该数据集系统地捕获了孢子尺寸减小、分支动力学和复杂菌丝网络出现等现象。受控的生成过程确保了时间一致性、可扩展性和结构对齐，解决了现实世界真菌数据集的局限性。该数据集针对深度学习 (DL) 应用进行了优化，有助于开发用于对生长阶段进行分类、预测真菌发育和分析随时间变化的形态模式的模型。该资源的应用范围涵盖农业、医学和工业真菌学，为自动化真菌分析、加强疾病监测和通过人工智能推进真菌生物学研究提供了坚实的基础。

Title: Conditional Mutual Information Based Diffusion Posterior Sampling for Solving Inverse Problems

Authors: Shayan Mohajer Hamidi, En-Hui Yang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2501.02880
Pdf URL: https://arxiv.org/pdf/2501.02880
Copy Paste: [[2501.02880]] Conditional Mutual Information Based Diffusion Posterior Sampling for Solving Inverse Problems(https://arxiv.org/abs/2501.02880)
Keywords: super-resolution
Abstract: Inverse problems are prevalent across various disciplines in science and engineering. In the field of computer vision, tasks such as inpainting, deblurring, and super-resolution are commonly formulated as inverse problems. Recently, diffusion models (DMs) have emerged as a promising approach for addressing noisy linear inverse problems, offering effective solutions without requiring additional task-specific training. Specifically, with the prior provided by DMs, one can sample from the posterior by finding the likelihood. Since the likelihood is intractable, it is often approximated in the literature. However, this approximation compromises the quality of the generated images. To overcome this limitation and improve the effectiveness of DMs in solving inverse problems, we propose an information-theoretic approach. Specifically, we maximize the conditional mutual information $\mathrm{I}(\boldsymbol{x}_0; \boldsymbol{y} | \boldsymbol{x}_t)$, where $\boldsymbol{x}_0$ represents the reconstructed signal, $\boldsymbol{y}$ is the measurement, and $\boldsymbol{x}_t$ is the intermediate signal at stage $t$. This ensures that the intermediate signals $\boldsymbol{x}_t$ are generated in a way that the final reconstructed signal $\boldsymbol{x}_0$ retains as much information as possible about the measurement $\boldsymbol{y}$. We demonstrate that this method can be seamlessly integrated with recent approaches and, once incorporated, enhances their performance both qualitatively and quantitatively.
摘要：逆问题在科学和工程的各个学科中都很普遍。在计算机视觉领域，修复、去模糊和超分辨率等任务通常被表述为逆问题。最近，扩散模型 (DM) 已成为解决嘈杂线性逆问题的一种有前途的方法，它提供了有效的解决方案，而无需额外的特定任务训练。具体而言，利用 DM 提供的先验，可以通过找到似然值从后验中采样。由于似然值难以处理，因此在文献中通常会对其进行近似。然而，这种近似会损害生成图像的质量。为了克服这一限制并提高 DM 在解决逆问题中的有效性，我们提出了一种信息论方法。具体来说，我们最大化条件互信息 $\mathrm{I}(\boldsymbol{x}_0; \boldsymbol{y} | \boldsymbol{x}_t)$，其中 $\boldsymbol{x}_0$ 表示重建信号，$\boldsymbol{y}$ 为测量值，$\boldsymbol{x}_t$ 为阶段 $t$ 的中间信号。这确保了中间信号 $\boldsymbol{x}_t$ 的生成方式使得最终重建信号 $\boldsymbol{x}_0$ 尽可能多地保留有关测量值 $\boldsymbol{y}$ 的信息。我们证明该方法可以与最新方法无缝集成，并且一旦结合，就可以在质量和数量上提高其性能。

Title: Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis

Authors: Thang-Anh-Quan Nguyen, Nathan Piasco, Luis Roldão, Moussab Bennehar, Dzmitry Tsishkou, Laurent Caraffa, Jean-Philippe Tarel, Roland Brémond
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02913
Pdf URL: https://arxiv.org/pdf/2501.02913
Copy Paste: [[2501.02913]] Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis(https://arxiv.org/abs/2501.02913)
Keywords: generative
Abstract: In this paper, we present PointmapDiffusion, a novel framework for single-image novel view synthesis (NVS) that utilizes pre-trained 2D diffusion models. Our method is the first to leverage pointmaps (i.e. rasterized 3D scene coordinates) as a conditioning signal, capturing geometric prior from the reference images to guide the diffusion process. By embedding reference attention blocks and a ControlNet for pointmap features, our model balances between generative capability and geometric consistency, enabling accurate view synthesis across varying viewpoints. Extensive experiments on diverse real-world datasets demonstrate that PointmapDiffusion achieves high-quality, multi-view consistent results with significantly fewer trainable parameters compared to other baselines for single-image NVS tasks.
摘要：在本文中，我们介绍了 PointmapDiffusion，这是一种利用预训练的 2D 扩散模型的单图像新视图合成 (NVS) 的新框架。我们的方法是第一个利用点图（即光栅化的 3D 场景坐标）作为调节信号的方法，从参考图像中捕获几何先验来指导扩散过程。通过嵌入参考注意块和用于点图特征的 ControlNet，我们的模型在生成能力和几何一致性之间取得平衡，从而实现跨不同视点的精确视图合成。对各种现实世界数据集进行的大量实验表明，与其他单图像 NVS 任务基线相比，PointmapDiffusion 实现了高质量、多视图一致的结果，并且可训练参数明显减少。

Title: SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild

Authors: Jiawei Liu, Yuanzhi Zhu, Feiyu Gao, Zhibo Yang, Peng Wang, Junyang Lin, Xinggang Wang, Wenyu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02962
Pdf URL: https://arxiv.org/pdf/2501.02962
Copy Paste: [[2501.02962]] SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild(https://arxiv.org/abs/2501.02962)
Keywords: generation
Abstract: Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as this http URL this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.
摘要：在自然场景图像中生成视觉文本是一项具有挑战性的任务，有许多未解决的问题。与在人工设计的图像（如海报、封面、漫画等）上生成文本不同，自然场景图像中的文本需要满足以下四个关键标准：（1）保真度：生成的文本应该看起来像照片一样逼真，完全准确，任何一个笔画都没有错误。（2）合理性：文本应该生成在合理的载体区域（如木板、标志、墙壁等）上，生成的文本内容也应该与场景相关。（3）实用性：生成的文本可以有助于自然场景OCR（光学字符识别）任务的训练。（4）可控性：文本的属性（如字体和颜色）应该是可控的，正如本文所述，我们提出了一个两阶段的方法SceneVTG++，它同时满足上述四个方面。SceneVTG++由文本布局和内容生成器（TLCG）和可控局部文本扩散（CLTD）组成。前者利用多模态大型语言模型的世界知识，根据自然场景背景图像找到合理的文本区域并推荐文本内容，而后者基于扩散模型生成可控的多语言文本。通过大量实验，我们分别验证了 TLCG 和 CLTD 的有效性，并展示了 SceneVTG++ 的当前最佳文本生成性能。此外，生成的图像在文本检测和文本识别等 OCR 任务中具有出色的实用性。代码和数据集将公开。

Title: Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular Balls

Authors: Can Gao, Xiaofeng Tan, Jie Zhou, Weiping Ding, Witold Pedrycz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02975
Pdf URL: https://arxiv.org/pdf/2501.02975
Copy Paste: [[2501.02975]] Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular Balls(https://arxiv.org/abs/2501.02975)
Keywords: generation
Abstract: Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data and has been extensively studied and used in a variety of practical tasks. However, most unsupervised outlier detection methods are carefully designed to detect specified outliers, while real-world data may be entangled with different types of outliers. In this study, we propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers. Specifically, a novel fuzzy rough sets-based method that integrates relative fuzzy granule density is first introduced to improve the capability of detecting local outliers. Then, a multi-scale view generation method based on granular-ball computing is proposed to collaboratively identify group outliers at different levels of granularity. Moreover, reliable outliers and inliers determined by the three-way decision are used to train a weighted support vector machine to further improve the performance of outlier detection. The proposed method innovatively transforms unsupervised outlier detection into a semi-supervised classification problem and for the first time explores the fuzzy rough sets-based outlier detection from the perspective of multi-scale granular balls, allowing for high adaptability to different types of outliers. Extensive experiments carried out on both artificial and UCI datasets demonstrate that the proposed outlier detection method significantly outperforms the state-of-the-art methods, improving the results by at least 8.48% in terms of the Area Under the ROC Curve (AUROC) index. { The source codes are released at \url{this https URL}. }
摘要：异常值检测是指识别明显偏离正常数据分布的异常样本，已被广泛研究并用于各种实际任务。然而，大多数无监督异常值检测方法都是经过精心设计的，以检测特定的异常值，而现实世界的数据可能混杂着不同类型的异常值。在本研究中，我们提出了一种基于模糊粗糙集的多尺度异常值检测方法来识别各种类型的异常值。具体而言，首先引入一种基于模糊粗糙集并融合相对模糊粒度密度的方法来提高检测局部异常值的能力。然后，提出一种基于粒球计算的多尺度视图生成方法，以协作识别不同粒度级别的组异常值。此外，利用三支决策确定的可靠异常值和正常值来训练加权支持向量机，以进一步提高异常值检测的性能。该方法创新性地将无监督异常检测问题转化为半监督分类问题，首次从多尺度粒度球的角度探索基于模糊粗糙集的异常检测方法，对不同类型的异常值具有很强的适应性。在人工数据集和UCI数据集上进行的大量实验表明，所提出的异常检测方法显著优于最新方法，在ROC曲线下面积（AUROC）指标方面将结果提高了至少8.48％。{源代码发布在\url{这个https URL}。}

Title: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Authors: Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02976
Pdf URL: https://arxiv.org/pdf/2501.02976
Copy Paste: [[2501.02976]] STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution(https://arxiv.org/abs/2501.02976)
Keywords: super-resolution, generative
Abstract: Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.
摘要：图像扩散模型已针对现实世界的视频超分辨率进行了调整，以解决基于 GAN 的方法中的过度平滑问题。然而，这些模型难以保持时间一致性，因为它们是在静态图像上训练的，这限制了它们有效捕捉时间动态的能力。将文本到视频 (T2V) 模型集成到视频超分辨率中以改进时间建模很简单。然而，仍然存在两个关键挑战：现实世界场景中复杂的退化引入的伪影，以及由于强大的 T2V 模型（\textit{例如}，CogVideoX-5B）的强大生成能力而导致的保真度受损。为了增强恢复视频的时空质量，我们引入了\textbf{~\name}（\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}real-world video super-resolution），这是一种利用 T2V 模型实现真实世界视频超分辨率的新方法，可实现逼真的空间细节和强大的时间一致性。具体来说，我们在全局注意块之前引入了一个局部信息增强模块 (LIEM)，以丰富局部细节并减轻退化伪影。此外，我们提出了动态频率 (DF) 损失来增强保真度，引导模型在扩散步骤中关注不同的频率分量。大量实验表明\textbf{~\name}~在合成和真实世界数据集上的表现均优于最先进的方法。

Title: TransPixar: Advancing Text-to-Video Generation with Transparency

Authors: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.03006
Pdf URL: https://arxiv.org/pdf/2501.03006
Copy Paste: [[2501.03006]] TransPixar: Advancing Text-to-Video Generation with Transparency(https://arxiv.org/abs/2501.03006)
Keywords: generation, generative
Abstract: Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
摘要：文本转视频生成模型取得了重大进展，在娱乐、广告和教育领域实现了多种应用。然而，由于数据集有限以及现有模型难以调整，生成包含透明度 alpha 通道的 RGBA 视频仍然是一项挑战。Alpha 通道对于视觉效果 (VFX) 至关重要，可让烟雾和反射等透明元素无缝融入场景。我们引入了 TransPixar，这是一种扩展预训练视频模型以生成 RGBA 的方法，同时保留原始 RGB 功能。TransPixar 利用扩散变压器 (DiT) 架构，结合特定于 alpha 的标记并使用基于 LoRA 的微调来联合生成具有高一致性的 RGB 和 alpha 通道。通过优化注意力机制，TransPixar 保留了原始 RGB 模型的优势，并在训练数据有限的情况下实现了 RGB 和 alpha 通道之间的强一致性。我们的方法有效地生成了多样化且一致的 RGBA 视频，提高了 VFX 和交互式内容创作的可能性。

Title: Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Authors: Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.03059
Pdf URL: https://arxiv.org/pdf/2501.03059
Copy Paste: [[2501.03059]] Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation(https://arxiv.org/abs/2501.03059)
Keywords: generation
Abstract: We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at this https URL.
摘要：我们考虑图像到视频 (I2V) 生成的任务，该任务涉及根据文本描述将静态图像转换为逼真的视频序列。虽然最近的进展产生了逼真的输出，但它们经常难以创建具有准确和一致物体运动的视频，尤其是在多物体场景中。为了解决这些限制，我们提出了一个两阶段组合框架，将 I2V 生成分解为：(i) 显式中间表示生成阶段，然后是 (ii) 以此表示为条件的视频生成阶段。我们的主要创新是引入基于掩码的运动轨迹作为中间表示，它捕获语义对象信息和运动，从而实现富有表现力但紧凑的运动和语义表示。为了在第二阶段整合学习到的表示，我们利用对象级注意力目标。具体来说，我们考虑一个空间的、每个对象的、掩码交叉注意力目标，将特定于对象的提示集成到相应的潜在空间区域中，以及一个掩码时空自注意力目标，确保每个对象的帧间一致性。我们在多对象和高运动场景的具有挑战性的基准上评估了我们的方法，并通过经验证明，所提出的方法在时间连贯性、运动真实性和文本提示忠实度方面取得了最先进的成果。此外，我们引入了 \benchmark，这是一个用于单对象和多对象 I2V 生成的新具有挑战性的基准，并展示了我们的方法在这个基准上的优势。项目页面可在此 https URL 上找到。

Title: CAT: Content-Adaptive Image Tokenization

Authors: Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.03120
Pdf URL: https://arxiv.org/pdf/2501.03120
Copy Paste: [[2501.03120]] CAT: Content-Adaptive Image Tokenization(https://arxiv.org/abs/2501.03120)
Keywords: generation
Abstract: Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
摘要：大多数现有的图像标记器将图像编码为固定数量的标记或块，忽略了图像复杂性的固有变化。为了解决这个问题，我们引入了内容自适应标记器 (CAT)，它根据图像内容动态调整表示容量，并将更简单的图像编码为更少的标记。我们设计了一个基于标题的评估系统，该系统利用大型语言模型 (LLM) 来预测内容复杂性并确定给定图像的最佳压缩比，同时考虑到对人类感知至关重要的因素。CAT 在不同压缩比的图像上进行训练，在图像重建方面表现出强大的性能。我们还利用其可变长度的潜在表示来训练扩散变换器 (DiT) 以生成 ImageNet。通过优化标记分配，CAT 在使用相同触发器训练的固定比率基线之上提高了 FID 分数，并将推理吞吐量提高了 18.5%。

Title: Geometry Restoration and Dewarping of Camera-Captured Document Images

Authors: Valery Istomin, Oleg Pereziabov, Ilya Afanasyev
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.03145
Pdf URL: https://arxiv.org/pdf/2501.03145
Copy Paste: [[2501.03145]] Geometry Restoration and Dewarping of Camera-Captured Document Images(https://arxiv.org/abs/2501.03145)
Keywords: restoration
Abstract: This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: this https URL
摘要：本研究重点开发一种方法，用于恢复相机拍摄的纸质文档数字图像的拓扑结构，使用检测、分割、几何恢复和去扭曲算法。我们的方法采用深度学习 (DL) 进行文档轮廓检测，然后采用计算机视觉 (CV) 使用三次多项式插值创建拓扑 2D 网格，并通过重新映射图像来校正非线性失真。使用经典 CV 方法可以使文档拓扑恢复过程更高效、更快速，因为它需要的计算资源和内存要少得多。我们开发了一种用于自动文档去扭曲和重建的新管道，以及一个框架和带注释的数据集来证明其效率。我们的实验证实了我们的方法的前景及其优于现有基准（包括移动应用程序和流行的 DL 解决方案，如 RectiNet、DocGeoNet 和 DocTr++），无论是在视觉上还是在通过光学字符识别 (OCR) 和几何恢复指标的文档可读性方面。这为创建高质量的纸质文档数字副本和提高 OCR 系统的效率铺平了道路。项目页面：此 https URL

Title: ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking

Authors: Tingyang Zhang, Chen Wang, Zhiyang Dou, Qingzhe Gao, Jiahui Lei, Baoquan Chen, Lingjie Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.03220
Pdf URL: https://arxiv.org/pdf/2501.03220
Copy Paste: [[2501.03220]] ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking(https://arxiv.org/abs/2501.03220)
Keywords: generation
Abstract: In this paper, we propose ProTracker, a novel framework for robust and accurate long-term dense tracking of arbitrary points in videos. The key idea of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for robust short-term and long-term tracking. Specifically, we integrate optical flow estimations in a probabilistic manner, producing smooth and accurate trajectories by maximizing the likelihood of each prediction. To effectively re-localize challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks. Our code and model will be publicly available upon publication.
摘要：在本文中，我们提出了 ProTracker，这是一种用于对视频中的任意点进行稳健且准确的长期密集跟踪的新颖框架。我们方法的关键思想是结合概率集成来细化来自光流和语义特征的多个预测，以实现稳健的短期和长期跟踪。具体而言，我们以概率方式集成光流估计，通过最大化每个预测的可能性来产生平滑而准确的轨迹。为了有效地重新定位由于遮挡而消失和重新出现的具有挑战性的点，我们进一步将长期特征对应关系纳入我们的流量预测中以进行连续轨迹生成。大量实验表明，ProTracker 在无监督和自监督方法中实现了最佳性能，甚至在多个基准测试中优于监督方法。我们的代码和模型将在发布后公开。

Title: Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Authors: Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt, Serena Yeung-Levy
Subjects: cs.CV, cs.AI, cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2501.03225
Pdf URL: https://arxiv.org/pdf/2501.03225
Copy Paste: [[2501.03225]] Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation(https://arxiv.org/abs/2501.03225)
Keywords: generation
Abstract: The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
摘要：视觉语言模型 (VLM) 的快速发展要求严格可靠的评估。然而，当前的视觉问答 (VQA) 基准通常依赖于开放式问题，由于自然语言响应的多变性，准确评估变得困难。为了解决这个问题，我们引入了 AutoConverter，这是一个代理框架，可以自动将这些开放式问题转换为多项选择格式，从而实现客观评估，同时减少昂贵的问题创建过程。我们的实验表明，AutoConverter 可以生成正确且具有挑战性的多项选择题，与人工创建的多项选择题相比，VLM 在这些问题上的准确率始终相似或更低。使用 AutoConverter，我们构建了 VMCBench，这是一个基准，通过将 20 个现有 VQA 数据集转换为统一的多项选择格式而创建，总共 9,018 个问题。我们在 VMCBench 上全面评估了 33 个最先进的 VLM，为可扩展、一致和可重复的 VLM 评估设立了新标准。

Title: Gaussian Masked Autoencoders

Authors: Jathushan Rajasegaran, Xinlei Chen, Rulilong Li, Christoph Feichtenhofer, Jitendra Malik, Shiry Ginosar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.03229
Pdf URL: https://arxiv.org/pdf/2501.03229
Copy Paste: [[2501.03229]] Gaussian Masked Autoencoders(https://arxiv.org/abs/2501.03229)
Keywords: generation
Abstract: This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at this https URL
摘要：本文探讨了具有高斯分层的蒙版自动编码器 (MAE)。虽然 MAE 等重构自监督学习框架可以学习良好的语义抽象，但它并未经过明确的空间感知训练。我们的方法称为高斯蒙版自动编码器 (GMAE)，旨在共同学习语义抽象和空间理解。与 MAE 一样，它在像素空间中端到端地重建图像，但除了 MAE 之外，它还引入了中间的基于 3D 高斯的表示并通过分层渲染图像。我们表明，GMAE 可以实现各种零样本学习空间理解能力（例如，图形-背景分割、图像分层、边缘检测等），同时保留 MAE 自监督表示质量的高级语义。据我们所知，我们是第一个在基于优化的单场景重建之外的图像表示学习框架中使用高斯基元的人。我们相信 GMAE 将激发该方向的进一步研究，并为开发下一代高保真视觉数据建模技术做出贡献。更多详细信息请访问此 https URL