2025-02-24

Title: KKA: Improving Vision Anomaly Detection through Anomaly-related Knowledge from Large Language Models

Authors: Dong Chen, Zhengqing Hu, Peiguang Fan, Yueting Zhuang, Yafei Li, Qidong Liu, Xiaoheng Jiang, Mingliang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14880
Pdf URL: https://arxiv.org/pdf/2502.14880
Copy Paste: [[2502.14880]] KKA: Improving Vision Anomaly Detection through Anomaly-related Knowledge from Large Language Models(https://arxiv.org/abs/2502.14880)
Keywords: generation
Abstract: Vision anomaly detection, particularly in unsupervised settings, often struggles to distinguish between normal samples and anomalies due to the wide variability in anomalies. Recently, an increasing number of studies have focused on generating anomalies to help detectors learn more effective boundaries between normal samples and anomalies. However, as the generated anomalies are often derived from random factors, they frequently lack realism. Additionally, randomly generated anomalies typically offer limited support in constructing effective boundaries, as most differ substantially from normal samples and lie far from the boundary. To address these challenges, we propose Key Knowledge Augmentation (KKA), a method that extracts anomaly-related knowledge from large language models (LLMs). More specifically, KKA leverages the extensive prior knowledge of LLMs to generate meaningful anomalies based on normal samples. Then, KKA classifies the generated anomalies as easy anomalies and hard anomalies according to their similarity to normal samples. Easy anomalies exhibit significant differences from normal samples, whereas hard anomalies closely resemble normal samples. KKA iteratively updates the generated anomalies, and gradually increasing the proportion of hard anomalies to enable the detector to learn a more effective boundary. Experimental results show that the proposed method significantly improves the performance of various vision anomaly detectors while maintaining low generation costs. The code for CMG can be found at this https URL.
摘要：视力异常检测，尤其是在无监督的环境中，由于异常的差异很大，通常会努力区分正常样本和异常情况。最近，越来越多的研究集中在产生异常，以帮助探测器学习正常样本和异常之间的更有效界限。但是，由于产生的异常通常来自随机因素，因此它们经常缺乏现实主义。此外，随机生成的异常通常在构建有效边界方面提供有限的支持，因为大多数与正常样本大大差异，并且远离边界。为了应对这些挑战，我们提出了关键知识增强（KKA），该方法从大型语言模型（LLMS）中提取异常知识。更具体地说，KKA利用LLM的广泛的先验知识来基于正常样本产生有意义的异常。然后，KKA根据与正常样本的相似性将生成的异常分类为简单的异常和硬异常。易于异常与正常样品显示出显着差异，而硬异常与正常样品非常相似。 KKA迭代更新了生成的异常，并逐渐增加了硬异常的比例，以使探测器能够学习更有效的边界。实验结果表明，所提出的方法可显着提高各种视力异常检测器的性能，同时保持低发电成本。可以在此HTTPS URL上找到CMG的代码。

Title: A Comprehensive Survey on Concept Erasure in Text-to-Image Diffusion Models

Authors: Changhoon Kim, Yanjun Qi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.14896
Pdf URL: https://arxiv.org/pdf/2502.14896
Copy Paste: [[2502.14896]] A Comprehensive Survey on Concept Erasure in Text-to-Image Diffusion Models(https://arxiv.org/abs/2502.14896)
Keywords: generation
Abstract: Text-to-Image (T2I) models have made remarkable progress in generating high-quality, diverse visual content from natural language prompts. However, their ability to reproduce copyrighted styles, sensitive imagery, and harmful content raises significant ethical and legal concerns. Concept erasure offers a proactive alternative to external filtering by modifying T2I models to prevent the generation of undesired content. In this survey, we provide a structured overview of concept erasure, categorizing existing methods based on their optimization strategies and the architectural components they modify. We categorize concept erasure methods into fine-tuning for parameter updates, closed-form solutions for efficient edits, and inference-time interventions for content restriction without weight modification. Additionally, we explore adversarial attacks that bypass erasure techniques and discuss emerging defenses. To support further research, we consolidate key datasets, evaluation metrics, and benchmarks for assessing erasure effectiveness and model robustness. This survey serves as a comprehensive resource, offering insights into the evolving landscape of concept erasure, its challenges, and future directions.
摘要：文本到图像（T2I）模型在从自然语言提示中产生高质量的视觉内容方面取得了显着进步。但是，它们具有重现受版权保护的样式，敏感图像和有害内容的能力引起了重大的道德和法律关注。概念擦除通过修改T2I模型来防止产生不希望的内容，为外部过滤提供了主动替代方案。在这项调查中，我们提供了概念擦除的结构化概述，并根据其优化策略和它们修改的建筑组件对现有方法进行分类。我们将概念擦除方法分为微调，以进行参数更新，有效编辑的封闭式解决方案以及用于内容限制的推理时间干预措施，而无需修改重量。此外，我们探索绕过擦除技术并讨论新兴防御的对抗性攻击。为了支持进一步的研究，我们巩固了关键数据集，评估指标和基准，以评估擦除有效性和模型鲁棒性。这项调查是一种全面的资源，提供了有关概念擦除，挑战和未来方向不断发展的景观的见解。

Title: KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Authors: Ahmed Heakl, Abdullah Sohail, Mukul Ranjan, Rania Hossam, Ghazi Ahmed, Mohamed El-Geish, Omar Maher, Zhiqiang Shen, Fahad Khan, Salman Khan
Subjects: cs.CV, cs.AI, cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2502.14949
Pdf URL: https://arxiv.org/pdf/2502.14949
Copy Paste: [[2502.14949]] KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding(https://arxiv.org/abs/2502.14949)
Keywords: generation
Abstract: With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.
摘要：随着在文档处理中越来越多地采用检索增强的一代（RAG），强大的文本识别对知识提取变得越来越重要。虽然对英语和其他语言的OCR（光学特征识别）受益于大型数据集和良好的基准测试，但阿拉伯OCR由于其草书脚本，左右文本流以及复杂的印刷和书法特征而面临独特的挑战。我们提出了Kitab Bench，这是一种综合的阿拉伯OCR基准，填补了当前评估系统中的空白。我们的基准分别包括9个主要域和36个子域中的8,809个样本，其中包括各种文档类型，包括手写文本，结构化表和21种图表类型的专业覆盖范围，用于商业智能。我们的发现表明，现代视觉模型（例如GPT-4，Gemini和Qwen）的表现要优于传统的OCR方法（例如Easyocr，Paddleocr和Surya）的性格错误率平均为60％。此外，我们重点介绍了当前阿拉伯OCR模型的显着局限性，尤其是在PDF到标记转换中，最佳模型Gemini-2.0-Flash仅达到65％的精度。这突显了准确识别阿拉伯文本的挑战，包括复杂字体，数字识别错误，单词伸长和表结构检测的问题。这项工作建立了一个严格的评估框架，可以推动阿拉伯文档分析方法的改进，并用英语OCR技术弥合性能差距。

Title: LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection

Authors: Qingyuan Liu, Yun-Yun Tsai, Ruijian Zha, Victoria Li, Pengyuan Shi, Chengzhi Mao, Junfeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.14994
Pdf URL: https://arxiv.org/pdf/2502.14994
Copy Paste: [[2502.14994]] LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection(https://arxiv.org/abs/2502.14994)
Keywords: generation, generative
Abstract: The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works of AI-generated content detection have been widely studied in the image field (e.g., deepfake), yet the video field has been unexplored. Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection for its strong reasoning and multimodal capabilities. It breaks the limitations of traditional deep learning based methods faced with like lack of transparency and inability to recognize new artifacts. Motivated by this, we propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement. Our insight list as follows: (1) The leading LVLMs can call external tools to extract useful information to facilitate its own video detection task; (2) Structuring the prompt can affect LVLM's reasoning ability to interpret information in video content. Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting. Different from prior SOTA that trains additional detectors, our method is fully training-free and only requires inference of the LVLM for detection. To facilitate our research, we also create a new benchmark \vidfor with high-quality videos generated from multiple sources of video generation tools. Evaluation results show that LAVID improves F1 scores by 6.2 to 30.2% over the top baselines on our datasets across four SOTA LVLMs.
摘要：生成模型在创建高质量视频中的令人印象深刻的成就引起了人们对数字完整性和隐私脆弱性的关注。 AI生成的内容检测的最新作品已在图像字段（例如DeepFake）中进行了广泛研究，但视频字段尚未探索。大型视觉语言模型（LVLM）已成为AI生成的内容检测的新兴工具，其强大的推理和多模式能力。它打破了传统的基于深度学习的局限性，即缺乏透明度和无法识别新的人工制品。在此激励的情况下，我们提出了Lavid，这是一种基于LVLMS的新型AI生成的视频检测，并具有明确的知识增强。我们的见解列表如下：（1）领先的LVLM可以调用外部工具来提取有用的信息以促进其自己的视频检测任务；（2）构造提示可以影响LVLM在视频内容中解释信息的推理能力。我们提出的管道会自动选择一组显式知识工具进行检测，然后通过自我练习自适应地调整结构提示。与训练其他检测器的先前SOTA不同，我们的方法是无训练的，仅需要推断LVLM进行检测。为了促进我们的研究，我们还创建了一个新的基准\ vidfor，并使用由多种视频生成工具产生的高质量视频。评估结果表明，LAVID在四个SOTA LVLMS上的数据集上的最高基准比F1分数提高了6.2％，至30.2％。

Title: Generative Modeling of Individual Behavior at Scale

Authors: Nabil Omi, Lucas Caccia, Anurag Sarkar, Jordan T. Ash, Siddhartha Sen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.14998
Pdf URL: https://arxiv.org/pdf/2502.14998
Copy Paste: [[2502.14998]] Generative Modeling of Individual Behavior at Scale(https://arxiv.org/abs/2502.14998)
Keywords: generation, generative
Abstract: There has been a growing interest in using AI to model human behavior, particularly in domains where humans interact with this technology. While most existing work models human behavior at an aggregate level, our goal is to model behavior at the individual level. Recent approaches to behavioral stylometry -- or the task of identifying a person from their actions alone -- have shown promise in domains like chess, but these approaches are either not scalable (e.g., fine-tune a separate model for each person) or not generative, in that they cannot generate actions. We address these limitations by framing behavioral stylometry as a multi-task learning problem -- where each task represents a distinct person -- and use parameter-efficient fine-tuning (PEFT) methods to learn an explicit style vector for each person. Style vectors are generative: they selectively activate shared "skill" parameters to generate actions in the style of each person. They also induce a latent space that we can interpret and manipulate algorithmically. In particular, we develop a general technique for style steering that allows us to steer a player's style vector towards a desired property. We apply our approach to two very different games, at unprecedented scales: chess (47,864 players) and Rocket League (2,000 players). We also show generality beyond gaming by applying our method to image generation, where we learn style vectors for 10,177 celebrities and use these vectors to steer their images.
摘要：人们对使用AI对人类行为进行建模，尤其是在人类与该技术互动的领域越来越兴趣。尽管大多数现有的工作都以总级别的人类行为进行模型，但我们的目标是在个人层面上建模行为。行为风格的最新方法 - 或仅从其行为中识别某人的任务 - 在国际象棋等领域中显示了承诺，但是这些方法是不可扩展的（例如，对每个人的单独模型进行微调）生成性，因为它们无法产生动作。我们通过将行为风格计量学作为多任务学习问题来解决这些限制 - 每个任务代表一个独特的人 - 并使用参数有效的微调（PEFT）方法来学习每个人的明确样式矢量。样式向量是生成的：它们有选择地激活共享的“技能”参数，以以每个人的样式生成动作。它们还诱导了我们可以通过算法来解释和操纵的潜在空间。特别是，我们开发了一种针对样式转向的一般技术，使我们可以将玩家的样式向量引导到所需的属性。我们将方法应用于前所未有的两场截然不同的比赛：国际象棋（47,864名球员）和火箭联赛（2,000名球员）。我们还通过将我们的方法应用于图像生成，在其中展示了一般性，我们在其中学习了10,177名名人的风格向量，并使用这些向量来引导其图像。

Title: Synth It Like KITTI: Synthetic Data Generation for Object Detection in Driving Scenarios

Authors: Richard Marcus, Christian Vogel, Inga Jatzkowski, Niklas Knoop, Marc Stamminger
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2502.15076
Pdf URL: https://arxiv.org/pdf/2502.15076
Copy Paste: [[2502.15076]] Synth It Like KITTI: Synthetic Data Generation for Object Detection in Driving Scenarios(https://arxiv.org/abs/2502.15076)
Keywords: generation
Abstract: An important factor in advancing autonomous driving systems is simulation. Yet, there is rather small progress for transferability between the virtual and real world. We revisit this problem for 3D object detection on LiDAR point clouds and propose a dataset generation pipeline based on the CARLA simulator. Utilizing domain randomization strategies and careful modeling, we are able to train an object detector on the synthetic data and demonstrate strong generalization capabilities to the KITTI dataset. Furthermore, we compare different virtual sensor variants to gather insights, which sensor attributes can be responsible for the prevalent domain gap. Finally, fine-tuning with a small portion of real data almost matches the baseline and with the full training set slightly surpasses it.
摘要：推进自动驾驶系统的一个重要因素是模拟。然而，虚拟世界和现实世界之间可转让的进度相当小。我们重新访问了LIDAR点云上3D对象检测的此问题，并根据CARLA模拟器提出数据集生成管道。利用域随机策略和仔细的建模，我们能够在合成数据上训练对象检测器，并证明对Kitti数据集的强大概括能力。此外，我们将不同的虚拟传感器变体比较以收集见解，这些传感器属性可以负责普遍的域间隙。最后，用一小部分实际数据进行微调几乎与基线相匹配，并且完整的训练组略有超过它。

Title: Hardware-Friendly Static Quantization Method for Video Diffusion Transformers

Authors: Sanghyun Yi, Qingfeng Liu, Mostafa El-Khamy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15077
Pdf URL: https://arxiv.org/pdf/2502.15077
Copy Paste: [[2502.15077]] Hardware-Friendly Static Quantization Method for Video Diffusion Transformers(https://arxiv.org/abs/2502.15077)
Keywords: generation, generative
Abstract: Diffusion Transformers for video generation have gained significant research interest since the impressive performance of SORA. Efficient deployment of such generative-AI models on GPUs has been demonstrated with dynamic quantization. However, resource-constrained devices cannot support dynamic quantization, and need static quantization of the models for their efficient deployment on AI processors. In this paper, we propose a novel method for the post-training quantization of OpenSora\cite{opensora}, a Video Diffusion Transformer, without relying on dynamic quantization techniques. Our approach employs static quantization, achieving video quality comparable to FP16 and dynamically quantized ViDiT-Q methods, as measured by CLIP, and VQA metrics. In particular, we utilize per-step calibration data to adequately provide a post-training statically quantized model for each time step, incorporating channel-wise quantization for weights and tensor-wise quantization for activations. By further applying the smooth-quantization technique, we can obtain high-quality video outputs with the statically quantized models. Extensive experimental results demonstrate that static quantization can be a viable alternative to dynamic quantization for video diffusion transformers, offering a more efficient approach without sacrificing performance.
摘要：自从索拉（Sora）令人印象深刻的表现以来，视频生成的扩散变压器已引起了重大的研究兴趣。通过动态量化已经证明了这种生成-AI模型在GPU上的有效部署。但是，资源受限的设备不能支持动态量化，并且需要对模型在AI处理器上的有效部署进行静态量化。在本文中，我们提出了一种新颖的方法，用于在不依赖动态量化技术的情况下进行视频扩散变压器OpenSora \ cite {OpenSora}的训练量化。我们的方法采用静态量化，实现了与FP16相当的视频质量，并通过剪辑和VQA指标进行了动态量化的Vidit-Q方法。特别是，我们利用每步校准数据为每个时间步提供充分的静态量化模型，并纳入了对重量的通道量化和激活的张量量化。通过进一步应用平滑定量技术，我们可以通过静态量化模型获得高质量的视频输出。广泛的实验结果表明，静态量化可以是视频扩散变压器动态量化的可行替代方法，从而在不牺牲性能的情况下提供了更有效的方法。

Title: M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment

Authors: Chuan Cui, Kejiang Chen, Zhihua Wei, Wen Shen, Weiming Zhang, Nenghai Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15167
Pdf URL: https://arxiv.org/pdf/2502.15167
Copy Paste: [[2502.15167]] M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment(https://arxiv.org/abs/2502.15167)
Keywords: quality assessment
Abstract: The rapid advancement of AI-generated image (AGI) models has introduced significant challenges in evaluating their quality, which requires considering multiple dimensions such as perceptual quality, prompt correspondence, and authenticity. To address these challenges, we propose M3-AGIQA, a comprehensive framework for AGI quality assessment that is Multimodal, Multi-Round, and Multi-Aspect. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) as joint text and image encoders and distills advanced captioning capabilities from online MLLMs into a local model via Low-Rank Adaptation (LoRA) fine-tuning. The framework includes a structured multi-round evaluation mechanism, where intermediate image descriptions are generated to provide deeper insights into the quality, correspondence, and authenticity aspects. To align predictions with human perceptual judgments, a predictor constructed by an xLSTM and a regression head is incorporated to process sequential logits and predict Mean Opinion Scores (MOSs). Extensive experiments conducted on multiple benchmark datasets demonstrate that M3-AGIQA achieves state-of-the-art performance, effectively capturing nuanced aspects of AGI quality. Furthermore, cross-dataset validation confirms its strong generalizability. The code is available at this https URL.
摘要：AI生成的图像（AGI）模型的快速发展在评估其质量方面引入了重大挑战，这需要考虑多个维度，例如感知质量，及时的对应关系和真实性。为了应对这些挑战，我们提出了M3-Agiqa，这是一个多模式，多场和多方面的AGI质量评估的综合框架。我们的方法通过通过低级别适应（LORA）微调来利用多模式大语模型（MLLM）作为联合文本和图像编码器的功能，并提取高级字幕功能。该框架包括一个结构化的多轮评估机制，其中生成中间图像描述以提供对质量，对应性和真实性方面的更深入见解。为了使预测与人类的感知判断相一致，由XLSTM构建的预测指标和回归头构建，以处理顺序逻辑并预测平均意见分数（MOSS）。在多个基准数据集上进行的广泛实验表明，M3-Agiqa实现了最先进的性能，有效地捕获了AGI质量的细微方面。此外，跨数据集验证证实了其强大的普遍性。该代码可在此HTTPS URL上找到。

Title: Methods and Trends in Detecting Generated Images: A Comprehensive Review

Authors: Arpan Mahara, Naphtali Rishe
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15176
Pdf URL: https://arxiv.org/pdf/2502.15176
Copy Paste: [[2502.15176]] Methods and Trends in Detecting Generated Images: A Comprehensive Review(https://arxiv.org/abs/2502.15176)
Keywords: generative
Abstract: The proliferation of generative models, such as Generative Adversarial Networks (GANs), Diffusion Models, and Variational Autoencoders (VAEs), has enabled the synthesis of high-quality multimedia data. However, these advancements have also raised significant concerns regarding adversarial attacks, unethical usage, and societal harm. Recognizing these challenges, researchers have increasingly focused on developing methodologies to detect synthesized data effectively, aiming to mitigate potential risks. Prior reviews have primarily focused on deepfake detection and often lack coverage of recent advancements in synthetic image detection, particularly methods leveraging multimodal frameworks for improved forensic analysis. To address this gap, the present survey provides a comprehensive review of state-of-the-art methods for detecting and classifying synthetic images generated by advanced generative AI models. This review systematically examines core detection methodologies, identifies commonalities among approaches, and categorizes them into meaningful taxonomies. Furthermore, given the crucial role of large-scale datasets in this field, we present an overview of publicly available datasets that facilitate further research and benchmarking in synthetic data detection.
摘要：生成模型的扩散，例如生成对抗网络（GAN），扩散模型和变异自动编码器（VAE），已使高质量的多媒体数据合成。但是，这些进步还引起了人们对对抗攻击，不道德使用和社会伤害的重大关注。认识到这些挑战，研究人员越来越专注于开发方法，以有效地检测合成数据，以减轻潜在的风险。先前的评论主要集中在深泡检测上，并且通常缺乏对合成图像检测的最新进展的覆盖范围，尤其是利用多模式框架的方法来改进法医分析。为了解决这一差距，本调查对检测和分类由高级生成AI模型生成的合成图像进行了全面综述。这篇综述系统地检查了核心检测方法，确定了方法之间的共同点，并将其分类为有意义的分类法。此外，鉴于大规模数据集在该领域的关键作用，我们介绍了公开可用数据集的概述，这些数据集促进了合成数据检测的进一步研究和基准测试。

Title: FlipConcept: Tuning-Free Multi-Concept Personalization for Text-to-Image Generation

Authors: Young Beom Woo, Sun Eung Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15203
Pdf URL: https://arxiv.org/pdf/2502.15203
Copy Paste: [[2502.15203]] FlipConcept: Tuning-Free Multi-Concept Personalization for Text-to-Image Generation(https://arxiv.org/abs/2502.15203)
Keywords: generation
Abstract: Recently, methods that integrate multiple personalized concepts into a single image have garnered significant attention in the field of text-to-image (T2I) generation. However, existing methods experience performance degradation in complex scenes with multiple objects due to distortions in non-personalized regions. To address this issue, we propose FlipConcept, a novel approach that seamlessly integrates multiple personalized concepts into a single image without requiring additional tuning. We introduce guided appearance attention to accurately mimic the appearance of a personalized concept as intended. Additionally, we introduce mask-guided noise mixing to protect non-personalized regions during editing. Lastly, we apply background dilution to minimize attribute leakage, which is the undesired blending of personalized concept attributes with other objects in the image. In our experiments, we demonstrate that the proposed method, despite not requiring tuning, outperforms existing models in both single and multiple personalized concept inference.
摘要：最近，将多个个性化概念整合到单个图像中的方法在文本到图像（T2i）一代领域引起了极大的关注。但是，由于非个人化区域中的扭曲，现有方法在复杂场景中经历了多个对象的绩效降解。为了解决这个问题，我们提出了FlipConcept，这是一种新颖的方法，将多个个性化概念无缝地整合到一个图像中而无需进行其他调整。我们引入了指导的外观注意力，以准确地模仿预期的个性化概念的外观。此外，我们引入了遮罩引导的噪声混合，以保护编辑期间非人性化区域。最后，我们应用背景稀释以最大程度地减少属性泄漏，这是个性化概念属性与图像中其他对象的不希望的混合。在我们的实验中，我们证明了所提出的方法尽管不需要调整，但在单个和多个个性化概念推理中的表现都优于现有模型。

Title: Omnidirectional Image Quality Captioning: A Large-scale Database and A New Model

Authors: Jiebin Yan, Ziwen Tan, Yuming Fang, Junjie Chen, Wenhui Jiang, Zhou Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2502.15271
Pdf URL: https://arxiv.org/pdf/2502.15271
Copy Paste: [[2502.15271]] Omnidirectional Image Quality Captioning: A Large-scale Database and A New Model(https://arxiv.org/abs/2502.15271)
Keywords: quality assessment
Abstract: The fast growing application of omnidirectional images calls for effective approaches for omnidirectional image quality assessment (OIQA). Existing OIQA methods have been developed and tested on homogeneously distorted omnidirectional images, but it is hard to transfer their success directly to the heterogeneously distorted omnidirectional images. In this paper, we conduct the largest study so far on OIQA, where we establish a large-scale database called OIQ-10K containing 10,000 omnidirectional images with both homogeneous and heterogeneous distortions. A comprehensive psychophysical study is elaborated to collect human opinions for each omnidirectional image, together with the spatial distributions (within local regions or globally) of distortions, and the head and eye movements of the subjects. Furthermore, we propose a novel multitask-derived adaptive feature-tailoring OIQA model named IQCaption360, which is capable of generating a quality caption for an omnidirectional image in a manner of textual template. Extensive experiments demonstrate the effectiveness of IQCaption360, which outperforms state-of-the-art methods by a significant margin on the proposed OIQ-10K database. The OIQ-10K database and the related source codes are available at this https URL.
摘要：全向图像的快速增长应用要求采用全向图像质量评估（OIQA）的有效方法。现有的OIQA方法已经在同质扭曲的全向图像上开发和测试，但是很难将其成功直接传输到异质扭曲的全向图像。在本文中，我们进行了迄今为止OIQA的最大研究，在该研究中，我们建立了一个称为OIQ-10K的大型数据库，该数据库包含10,000个具有同质和异质变形的全向图像。详细阐述了一项全面的心理物理研究，以收集每个全向图像的人类观点，以及扭曲的空间分布（在地方或全球）的空间分布以及受试者的头部和眼睛运动。此外，我们提出了一个新型的多任务自适应功能特征尾式OIQA模型，名为IQCAPTION360，该模型能够以文本模板的方式为全向图像生成质量标题。广泛的实验证明了IQCAPTION360的有效性，该实验在提议的OIQ-10K数据库上的大幅度优于最先进的方法。 OIQ-10K数据库和相关源代码可在此HTTPS URL上获得。

Title: Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

Authors: Lixing Lyu, Jiashuo Jiang, Wang Chi Cheung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15345
Pdf URL: https://arxiv.org/pdf/2502.15345
Copy Paste: [[2502.15345]] Efficiently Solving Discounted MDPs with Predictions on Transition Matrices(https://arxiv.org/abs/2502.15345)
Keywords: generative
Abstract: We study infinite-horizon Discounted Markov Decision Processes (DMDPs) under a generative model. Motivated by the Algorithm with Advice framework Mitzenmacher and Vassilvitskii 2022, we propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on the DMDPs with $N$ state-action pairs and discounted factor $\gamma$. Firstly, we provide an impossibility result that, without prior knowledge of the prediction accuracy, no sampling policy can compute an $\epsilon$-optimal policy with a sample complexity bound better than $\tilde{O}((1-\gamma)^{-3} N\epsilon^{-2})$, which matches the state-of-the-art minimax sample complexity bound with no prediction. In complement, we propose an algorithm based on minimax optimization techniques that leverages the prediction on the transition matrix. Our algorithm achieves a sample complexity bound depending on the prediction error, and the bound is uniformly better than $\tilde{O}((1-\gamma)^{-4} N \epsilon^{-2})$, the previous best result derived from convex optimization methods. These theoretical findings are further supported by our numerical experiments.
摘要：我们研究了生成模型下的无限马可分子折现马尔可夫决策过程（DMDP）。由算法的促进，使用建议框架Mitzenmacher和Vassilvitskii 2022，我们提出了一个新颖的框架，以研究对过渡矩阵的预测如何提高求解DMDP的样品效率并提高样品复杂性范围。我们专注于$ n $ nate Action对和折扣因子$ \ gamma $的DMDP。首先，我们提供了一个不可能的结果，即如果没有预测准确性的先验知识，没有采样策略可以计算出$ \ epsilon $ -optimal策略，其样本复杂性约束优于$ \ tilde {o}（（（1- \ gamma）） ^{ - 3} n \ epsilon^{ - 2}）$，它匹配最新的minimax示例复杂性，而没有预测。在补充中，我们提出了一种基于最小值优化技术的算法，该算法利用了过渡矩阵的预测。我们的算法取决于预测错误，达到了样本复杂性，并且界限均优于$ \ tilde {o}（（（1- \ gamma）^{ - 4} n \ epsilon^{ - 2}），先前的最佳结果来自凸优化方法。我们的数值实验进一步支持了这些理论发现。

Title: Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

Authors: Kibum Kim, Kanghoon Yoon, Yeonjun In, Jaehyeong Jeon, Jinyoung Moon, Donghyun Kim, Chanyoung Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15370
Pdf URL: https://arxiv.org/pdf/2502.15370
Copy Paste: [[2502.15370]] Weakly Supervised Video Scene Graph Generation via Natural Language Supervision(https://arxiv.org/abs/2502.15370)
Keywords: generation
Abstract: Existing Video Scene Graph Generation (VidSGG) studies are trained in a fully supervised manner, which requires all frames in a video to be annotated, thereby incurring high annotation cost compared to Image Scene Graph Generation (ImgSGG). Although the annotation cost of VidSGG can be alleviated by adopting a weakly supervised approach commonly used for ImgSGG (WS-ImgSGG) that uses image captions, there are two key reasons that hinder such a naive adoption: 1) Temporality within video captions, i.e., unlike image captions, video captions include temporal markers (e.g., before, while, then, after) that indicate time related details, and 2) Variability in action duration, i.e., unlike human actions in image captions, human actions in video captions unfold over varying duration. To address these issues, we propose a Natural Language-based Video Scene Graph Generation (NL-VSGG) framework that only utilizes the readily available video captions for training a VidSGG model. NL-VSGG consists of two key modules: Temporality-aware Caption Segmentation (TCS) module and Action Duration Variability-aware caption-frame alignment (ADV) module. Specifically, TCS segments the video captions into multiple sentences in a temporal order based on a Large Language Model (LLM), and ADV aligns each segmented sentence with appropriate frames considering the variability in action duration. Our approach leads to a significant enhancement in performance compared to simply applying the WS-ImgSGG pipeline to VidSGG on the Action Genome dataset. As a further benefit of utilizing the video captions as weak supervision, we show that the VidSGG model trained by NL-VSGG is able to predict a broader range of action classes that are not included in the training data, which makes our framework practical in reality.
摘要：现有的视频场景图（Vidsgg）研究以完全监督的方式进行了培训，这需要注释视频中的所有框架，从而与图像场景图（IMGSGG）相比会产生高注释成本。尽管可以通过采用常用用于IMGSGG（WS-IMGSGG）使用图像标题的弱监督方法来减轻Vidsgg的注释成本，但有两个关键原因阻碍了如此幼稚的采用：1）视频字幕内的时间性，即，即，即与图像标题不同，视频标题包括时间标记（例如，在，然后，之后）指示时间相关的详细信息，2）行动的可变性持续时间，即与人类在图像标题中的行为不同，视频字幕中的人类行为在不同的持续时间内展开。为了解决这些问题，我们提出了一个基于自然语言的视频场景图（NL-VSGG）框架，该框架仅利用易于可用的视频字幕来训练Vidsgg模型。 NL-VSGG由两个关键模块组成：暂时性感知字幕细分（TCS）模块和操作持续时间可变性 - 意识到字幕框架框架对准（ADV）模块。具体而言，TCS根据大语言模型（LLM）将视频字幕分为多个句子，并考虑到操作持续时间的可变性，将每个分段句子与适当的帧对齐。与简单地将WS-IMGSGG管道应用于动作基因组数据集上的Vidsgg相比，我们的方法可以显着增强性能。作为利用视频标题作为弱监督的进一步好处，我们表明，由NL-VSGG培训的Vidsgg模型能够预测培训数据中未包括的更广泛的动作类别，这使我们的框架实际上实用了。

Title: LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models

Authors: Hongchen Wei, Zhihong Tan, Yaosi Hu, Changwen Chen, Zhenzhong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15393
Pdf URL: https://arxiv.org/pdf/2502.15393
Copy Paste: [[2502.15393]] LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models(https://arxiv.org/abs/2502.15393)
Keywords: generation
Abstract: Large multimodal models (LMMs) have shown remarkable performance in video understanding tasks and can even process videos longer than one hour. However, despite their ability to handle long inputs, generating outputs with corresponding levels of richness remains a challenge. In this paper, we explore the issue of long outputs in LMMs using video captioning as a proxy task, and we find that open-source LMMs struggle to consistently generate outputs exceeding about 300 words. Through controlled experiments, we find that the scarcity of paired examples with long-captions during training is the primary factor limiting the model's output length. However, manually annotating long-caption examples is time-consuming and expensive. To address this, we propose the LongCaption-Agent, a framework that synthesizes long caption data by aggregating multi-level descriptions. Using LongCaption-Agent, we curated a new long-caption dataset, LongCaption-10K. We also develop LongCaption-Bench, a benchmark designed to comprehensively evaluate the quality of long captions generated by LMMs. By incorporating LongCaption-10K into training, we enable LMMs to generate captions exceeding 1,000 words, while maintaining high output quality. In LongCaption-Bench, our 8B parameter model achieved state-of-the-art performance, even surpassing larger proprietary models. We will release the dataset and code after publication.
摘要：大型多模型模型（LMM）在视频理解任务中表现出色，甚至可以处理超过一小时的视频。但是，尽管他们能够处理长期输入，但生成具有相应富裕程度的输出仍然是一个挑战。在本文中，我们使用视频字幕作为代理任务探讨了LMMS中长输出的问题，并且发现开源LMM努力始终如一地生成超过300个单词的输出。通过受控实验，我们发现在训练过程中长期限制的配对示例的稀缺是限制模型输出长度的主要因素。但是，手动注释的长水示例是耗时且昂贵的。为了解决这个问题，我们提出了Longcaption-Andent，该框架通过汇总多级描述来综合长字幕数据。我们使用远程代理，我们策划了一个新的长束缚数据集，longcaption-10k。我们还开发了Longcaption Bench，这是一种基准测试，旨在全面评估LMMS产生的长字幕的质量。通过将Longcaption-10k纳入训练中，我们使LMM能够生成超过1,000个字的字幕，同时保持高输出质量。在长距离基础台上，我们的8B参数模型实现了最先进的性能，甚至超过了更大的专有模型。我们将在发布后发布数据集和代码。

Title: Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution

Authors: Carlos Eiras-Franco, Anna Hedström, Marina M.-C. Höhne
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15403
Pdf URL: https://arxiv.org/pdf/2502.15403
Copy Paste: [[2502.15403]] Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution(https://arxiv.org/abs/2502.15403)
Keywords: quality assessment
Abstract: Obtaining high-quality explanations of a model's output enables developers to identify and correct biases, align the system's behavior with human values, and ensure ethical compliance. Explainable Artificial Intelligence (XAI) practitioners rely on specific measures to gauge the quality of such explanations. These measures assess key attributes, such as how closely an explanation aligns with a model's decision process (faithfulness), how accurately it pinpoints the relevant input features (localization), and its consistency across different cases (robustness). Despite providing valuable information, these measures do not fully address a critical practitioner's concern: how does the quality of a given explanation compare to other potential explanations? Traditionally, the quality of an explanation has been assessed by comparing it to a randomly generated counterpart. This paper introduces an alternative: the Quality Gap Estimate (QGE). The QGE method offers a direct comparison to what can be viewed as the `inverse' explanation, one that conceptually represents the antithesis of the original explanation. Our extensive testing across multiple model architectures, datasets, and established quality metrics demonstrates that the QGE method is superior to the traditional approach. Furthermore, we show that QGE enhances the statistical reliability of these quality assessments. This advance represents a significant step toward a more insightful evaluation of explanations that enables a more effective inspection of a model's behavior.
摘要：对模型的输出获得高质量的解释，使开发人员能够识别和纠正偏见，使系统的行为与人类价值观保持一致，并确保道德依从性。可解释的人工智能（XAI）从业者依靠特定措施来评估此类解释的质量。这些措施评估了关键属性，例如解释与模型的决策过程（忠诚）的一致性如何，它准确地指出了相关输入特征（本地化）及其在不同情况下（鲁棒性）的一致性。尽管提供了有价值的信息，但这些措施并未完全解决关键从业者的关注：给定解释的质量与其他潜在的解释相比如何？传统上，通过将其与随机生成的对应物进行比较来评估解释的质量。本文介绍了一种替代方法：质量差距估计（QGE）。 QGE方法与可以看作的“反向”解释进行了直接比较，从概念上讲，它代表了原始解释的对立面。我们对多个模型体系结构，数据集和既定质量指标进行的广泛测试表明，QGE方法优于传统方法。此外，我们表明QGE提高了这些质量评估的统计可靠性。这一进步代表了对解释的更深入评估的重要一步，该解释能够对模型的行为进行更有效的检查。

Title: MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition

Authors: Paul Koch, Marian Schlüter, Jörg Krüger
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15448
Pdf URL: https://arxiv.org/pdf/2502.15448
Copy Paste: [[2502.15448]] MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition(https://arxiv.org/abs/2502.15448)
Keywords: generation
Abstract: We present MVIP, a novel dataset for multi-modal and multi-view application-oriented industrial part recognition. Here we are the first to combine a calibrated RGBD multi-view dataset with additional object context such as physical properties, natural language, and super-classes. The current portfolio of available datasets offers a wide range of representations to design and benchmark related methods. In contrast to existing classification challenges, industrial recognition applications offer controlled multi-modal environments but at the same time have different problems than traditional 2D/3D classification challenges. Frequently, industrial applications must deal with a small amount or increased number of training data, visually similar parts, and varying object sizes, while requiring a robust near 100% top 5 accuracy under cost and time constraints. Current methods tackle such challenges individually, but direct adoption of these methods within industrial applications is complex and requires further research. Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks towards an efficient deployment of industrial classifiers. Additionally, we intend to push with MVIP research regarding several modality fusion topics, (automated) synthetic data generation, and complex data sampling -- combined in a single application-oriented benchmark.
摘要：我们提出了MVIP，这是一种新型数据集，用于多模式和多视图，面向应用程序的工业部分识别。在这里，我们是第一个将校准的RGBD多视图数据集与其他对象上下文（例如物理属性，自然语言和超级类）相结合的人。当前可用数据集的投资组合提供了广泛的表示和基准相关方法的表示。与现有的分类挑战相反，工业识别应用程序提供了受控的多模式环境，但同时有与传统的2D/3D分类挑战不同的问题。通常，工业应用必须处理少量或增加数量的培训数据，视觉上相似的零件以及变化的物体大小，同时需要在成本和时间限制下达到可靠的100％前5个准确性。当前的方法可以单独应对此类挑战，但是在工业应用中直接采用这些方法很复杂，需要进一步研究。 MVIP的主要目标是研究和推动相关下游任务中各种最新方法的可转移性，以有效地部署工业分类器。此外，我们打算对MVIP研究进行有关几种模态融合主题，（自动化）合成数据生成和复杂数据采样的研究 - 在单个面向应用程序的基准中合并。

Title: Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation

Authors: Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang, Jing Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15466
Pdf URL: https://arxiv.org/pdf/2502.15466
Copy Paste: [[2502.15466]] Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation(https://arxiv.org/abs/2502.15466)
Keywords: generation
Abstract: Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as data scarcity and data imbalance continue to hinder their development. To address this, we consider modeling complex systems through symbolic expressions that serve as semantic descriptors of time series. Building on this concept, we introduce a series-symbol (S2) dual-modulity data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic representations. Leveraging the S2 dataset, we develop SymTime, a pre-trained foundation model for TSA. SymTime demonstrates competitive performance across five major TSA tasks when fine-tuned with downstream task, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of dual-modality data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance.
摘要：时间序列分析（TSA）的基础模型引起了极大的关注。但是，诸如数据稀缺和数据失衡之类的挑战继续阻碍他们的发展。为了解决这个问题，我们考虑通过用作时间序列的语义描述符的符号表达式对复杂系统进行建模。在此概念的基础上，我们引入了一个串联符号（S2）双重模块数据生成机制，从而实现了不受限制的高质量时间序列数据与相应的符号表示配对。利用S2数据集，我们开发了Symtime，这是TSA的预训练的基础模型。 Symtime在微调下游任务时展示了五个主要TSA任务的竞争性能，这与现实世界数据集预先训练的基础模型媲美。这种方法强调了双重模式数据生成和预处理机制在克服数据稀缺和增强任务绩效方面的潜力。

Title: Decoding for Punctured Convolutional and Turbo Codes: A Deep Learning Solution for Protocols Compliance

Authors: Yongli Yan, Linglong Dai
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2502.15475
Pdf URL: https://arxiv.org/pdf/2502.15475
Copy Paste: [[2502.15475]] Decoding for Punctured Convolutional and Turbo Codes: A Deep Learning Solution for Protocols Compliance(https://arxiv.org/abs/2502.15475)
Keywords: generation
Abstract: Neural network-based decoding methods have shown promise in enhancing error correction performance, but traditional approaches struggle with the challenges posed by punctured codes. In particular, these methods fail to address the complexities of variable code rates and the need for protocol compatibility. This paper presents a unified Long Short-Term Memory (LSTM)-based decoding architecture specifically designed to overcome these challenges. The proposed method unifies punctured convolutional and Turbo codes. A puncture embedding mechanism integrates puncturing patterns directly into the network, enabling seamless adaptation to varying code rates, while balanced bit error rate training ensures robustness across different code lengths, rates, and channels, maintaining protocol flexibility. Extensive simulations in Additive White Gaussian Noise and Rayleigh fading channels demonstrate that the proposed approach outperforms conventional decoding techniques, providing significant improvements in decoding accuracy and robustness. These results underscore the potential of LSTM-based decoding as a promising solution for next-generation artificial intelligence powered communication systems.
摘要：基于神经网络的解码方法已显示出有望在增强错误校正性能方面的希望，但是传统方法在刺穿的代码带来的挑战方面遇到了困难。特别是，这些方法无法解决可变代码速率的复杂性和对协议兼容性的需求。本文提出了统一的长期短期记忆（LSTM）的解码体系结构，专门为克服这些挑战而设计。提出的方法统一了刺穿的卷积和涡轮代码。穿刺嵌入机制将穿刺模式直接集成到网络中，从而无缝适应不同的代码速率，而平衡的位错误率训练可确保跨不同代码长度，速率和渠道的稳健性，从而保持协议灵活性。在加性白色高斯噪声和瑞利褪色通道中进行的大量模拟表明，所提出的方法的表现优于传统的解码技术，从而在解码准确性和稳健性方面提供了显着改善。这些结果强调了基于LSTM的解码作为下一代人工智能供电的通信系统的有前途解决方案的潜力。

Title: CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution

Authors: Kai Liu, Dehui Wang, Zhiteng Li, Zheng Chen, Yong Guo, Wenbo Li, Linghe Kong, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.15478
Pdf URL: https://arxiv.org/pdf/2502.15478
Copy Paste: [[2502.15478]] CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution(https://arxiv.org/abs/2502.15478)
Keywords: super-resolution
Abstract: Low-bit model quantization for image super-resolution (SR) is a longstanding task that is renowned for its surprising compression and acceleration ability. However, accuracy degradation is inevitable when compressing the full-precision (FP) model to ultra-low bit widths (2~4 bits). Experimentally, we observe that the degradation of quantization is mainly attributed to the quantization of activation instead of model weights. In numerical analysis, the condition number of weights could measure how much the output value can change for a small change in the input argument, inherently reflecting the quantization error. Therefore, we propose CondiQuant, a condition number based low-bit post-training quantization for image super-resolution. Specifically, we formulate the quantization error as the condition number of weight metrics. By decoupling the representation ability and the quantization sensitivity, we design an efficient proximal gradient descent algorithm to iteratively minimize the condition number and maintain the output still. With comprehensive experiments, we demonstrate that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead and gains the theoretically optimal compression ratio in model parameters. Our code and model are released at this https URL.
摘要：图像超分辨率（SR）的低位模型量化是一项长期的任务，以其令人惊讶的压缩和加速能力而闻名。但是，当将完整精液（FP）模型压缩为超低位宽度（2〜4位）时，准确性降解是不可避免的。在实验上，我们观察到量化的降解主要归因于激活而不是模型权重的量化。在数值分析中，权重的条件数可以测量输出值在输入参数中可以变化多少，从而固有地反映了量化误差。因此，我们提出了用于图像超分辨率的基于条件数的低位训练后训练后量化辅助设备。具体而言，我们将量化误差作为重量指标的条件数量。通过将表示能力和量化灵敏度解耦，我们将有效的近端梯度下降算法设计为迭代地最小化条件数量并保持输出静止。通过全面的实验，我们证明了辅助设备在准确性上超过现有的最新训练后量化方法，而无需计算开销，并获得了模型参数中理论上最佳的压缩比。我们的代码和模型在此HTTPS URL上发布。

Title: Network Resource Optimization for ML-Based UAV Condition Monitoring with Vibration Analysis

Authors: Alexandre Gemayel, Dimitrios Michael Manias, Abdallah Shami
Subjects: cs.LG, cs.NI, eess.SP, eess.SY
Abstract URL: https://arxiv.org/abs/2502.15491
Pdf URL: https://arxiv.org/pdf/2502.15491
Copy Paste: [[2502.15491]] Network Resource Optimization for ML-Based UAV Condition Monitoring with Vibration Analysis(https://arxiv.org/abs/2502.15491)
Keywords: generation
Abstract: As smart cities begin to materialize, the role of Unmanned Aerial Vehicles (UAVs) and their reliability becomes increasingly important. One aspect of reliability relates to Condition Monitoring (CM), where Machine Learning (ML) models are leveraged to identify abnormal and adverse conditions. Given the resource-constrained nature of next-generation edge networks, the utilization of precious network resources must be minimized. This work explores the optimization of network resources for ML-based UAV CM frameworks. The developed framework uses experimental data and varies the feature extraction aggregation interval to optimize ML model selection. Additionally, by leveraging dimensionality reduction techniques, there is a 99.9% reduction in network resource consumption.
摘要：随着智能城市开始实现，无人驾驶汽车（UAV）的作用及其可靠性变得越来越重要。可靠性的一个方面与条件监测（CM）有关，其中机器学习（ML）模型被利用以识别异常和不利条件。鉴于下一代边缘网络的资源约束性质，必须最大程度地利用宝贵的网络资源。这项工作探讨了基于ML的无人机CM框架的网络资源的优化。开发的框架使用实验数据并改变了特征提取聚合间隔以优化ML模型选择。此外，通过利用降低降低技术，网络资源消耗减少了99.9％。

Title: Activation Steering in Neural Theorem Provers

Authors: Shashank Kirtania
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.15507
Pdf URL: https://arxiv.org/pdf/2502.15507
Copy Paste: [[2502.15507]] Activation Steering in Neural Theorem Provers(https://arxiv.org/abs/2502.15507)
Keywords: generation
Abstract: Large Language Models (LLMs) have shown promise in proving formal theorems using proof assistants like Lean. However, current state of the art language models struggles to predict next step in proofs leading practitioners to use different sampling techniques to improve LLMs capabilities. We observe that the LLM is capable of predicting the correct tactic; however, it faces challenges in ranking it appropriately within the set of candidate tactics, affecting the overall selection process. To overcome this hurdle, we use activation steering to guide LLMs responses to improve the generations at the time of inference. Our results suggest that activation steering offers a promising lightweight alternative to specialized fine-tuning for enhancing theorem proving capabilities in LLMs, particularly valuable in resource-constrained environments.
摘要：大型语言模型（LLMS）在使用Lean等证明助手证明正式定理方面表现出了希望。但是，当前的艺术语言模型努力预测下一步的证明，导致从业者使用不同的抽样技术来提高LLMS功能。我们观察到LLM能够预测正确的策略。但是，它在将其在候选策略的集合中进行适当的排名面临挑战，从而影响整体选择过程。为了克服这一障碍，我们使用激活转向来指导LLMS响应，以在推理时改善世代。我们的结果表明，激活转向提供了一种有希望的轻巧替代方案，用于增强LLMS的定理证明功能，在资源受限的环境中尤其有价值。

Title: SALSA-RL: Stability Analysis in the Latent Space of Actions for Reinforcement Learning

Authors: Xuyang Li, Romit Maulik
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.15512
Pdf URL: https://arxiv.org/pdf/2502.15512
Copy Paste: [[2502.15512]] SALSA-RL: Stability Analysis in the Latent Space of Actions for Reinforcement Learning(https://arxiv.org/abs/2502.15512)
Keywords: generation
Abstract: Modern deep reinforcement learning (DRL) methods have made significant advances in handling continuous action spaces. However, real-world control systems--especially those requiring precise and reliable performance--often demand formal stability, and existing DRL approaches typically lack explicit mechanisms to ensure or analyze stability. To address this limitation, we propose SALSA-RL (Stability Analysis in the Latent Space of Actions), a novel RL framework that models control actions as dynamic, time-dependent variables evolving within a latent space. By employing a pre-trained encoder-decoder and a state-dependent linear system, our approach enables both stability analysis and interpretability. We demonstrated that SALSA-RL can be deployed in a non-invasive manner for assessing the local stability of actions from pretrained RL agents without compromising on performance across diverse benchmark environments. By enabling a more interpretable analysis of action generation, SALSA-RL provides a powerful tool for advancing the design, analysis, and theoretical understanding of RL systems.
摘要：现代深度强化学习（DRL）方法在处理连续的动作空间方面已取得了重大进步。但是，现实世界中的控制系统（尤其是那些需要精确绩效的控制系统）通常要求正式稳定性，现有的DRL方法通常缺乏确保或分析稳定性的明确机制。为了解决这一限制，我们提出了Salsa-rl（行动潜在空间中的稳定性分析），这是一个新颖的RL框架，将控制动作控制为动态，时间依赖的变量在潜在空间内演变。通过采用预先训练的编码编码器和状态依赖性线性系统，我们的方法既可以稳定性分析和可解释性。我们证明，可以以非侵入性的方式部署SALSA-RL，以评估预告片RL剂的局部稳定性，而不会损害各种基准环境的性能。通过对动作产生进行更容易解释的分析，Salsa-RL为RL系统的设计，分析和理论理解提供了有力的工具。

Title: Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation

Authors: Tim Rädsch, Leon Mayer, Simon Pavicic, A. Emre Kavur, Marcel Knopp, Barış Öztürk, Klaus Maier-Hein, Paul F. Jaeger, Fabian Isensee, Annika Reinke, Lena Maier-Hein
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15563
Pdf URL: https://arxiv.org/pdf/2502.15563
Copy Paste: [[2502.15563]] Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation(https://arxiv.org/abs/2502.15563)
Keywords: generation
Abstract: Reliable evaluation of AI models is critical for scientific progress and practical application. While existing VLM benchmarks provide general insights into model capabilities, their heterogeneous designs and limited focus on a few imaging domains pose significant challenges for both cross-domain performance comparison and targeted domain-specific evaluation. To address this, we propose three key contributions: (1) a framework for the resource-efficient creation of domain-specific VLM benchmarks enabled by task augmentation for creating multiple diverse tasks from a single existing task, (2) the release of new VLM benchmarks for seven domains, created according to the same homogeneous protocol and including 162,946 thoroughly human-validated answers, and (3) an extensive benchmarking of 22 state-of-the-art VLMs on a total of 37,171 tasks, revealing performance variances across domains and tasks, thereby supporting the need for tailored VLM benchmarks. Adoption of our methodology will pave the way for the resource-efficient domain-specific selection of models and guide future research efforts toward addressing core open questions.
摘要：对AI模型的可靠评估对于科学进步和实际应用至关重要。尽管现有的VLM基准提供了对模型功能的一般见解，但它们的异质设计和对一些成像域的关注有限，对跨域性能比较和针对性域特异性评估都构成了重大挑战。为了解决这个问题，我们提出了三个关键贡献：（1）通过任务增强来实现域特异性VLM基准的资源有效创建的框架七个领域的基准测试，根据相同的同质协议创建，包括162,946个彻底的人类验证答案，以及（3）广泛的基准测试22个最先进的VLMS完成了总共37,171个任务，揭示了跨域和任务的性能差异，从而支持对量身定制的VLM基准的需求。采用我们的方法论将为资源有效的领域特定模型选择铺平道路，并指导未来的研究工作来解决核心开放问题。

Title: Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Authors: Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15588
Pdf URL: https://arxiv.org/pdf/2502.15588
Copy Paste: [[2502.15588]] Improving the Scaling Laws of Synthetic Data with Deliberate Practice(https://arxiv.org/abs/2502.15588)
Keywords: generation
Abstract: Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.
摘要：受到人类学习故意实践原则的启发，我们建议对合成数据生成（DP）进行故意实践，这是一个新颖的框架，可通过动态合成数据生成提高样品效率。先前的工作表明，缩放合成数据本质上是具有挑战性的，因为天真地添加新数据导致回报率降低。为了解决这个问题，修剪已被确定为改进缩放率的关键机制，使模型能够专注于最有用的合成样品。 DP没有生成大型数据集并将其修剪，而是有效地近似提供信息的直接生成。从理论上讲，我们展示了有关具有挑战性的，内容丰富的示例的培训如何改善规模定律，并在经验上验证DP可以通过更少的培训样本和迭代来实现更好的扩展性能。在ImagEnet-100上，DP的样本减少了3.4倍，需要减少六倍的迭代，而在Imagenet-1k上，它产生的样本减少了8倍，迭代率降低了30％，而与先前的工作相比，所有这些样本都降低了30％。

Title: WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents

Authors: Xinhang Liu, Chi-Keung Tang, Yu-Wing Tai
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2502.15601
Pdf URL: https://arxiv.org/pdf/2502.15601
Copy Paste: [[2502.15601]] WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents(https://arxiv.org/abs/2502.15601)
Keywords: generation
Abstract: Constructing photorealistic virtual worlds has applications across various fields, but it often requires the extensive labor of highly trained professionals to operate conventional 3D modeling software. To democratize this process, we introduce WorldCraft, a system where large language model (LLM) agents leverage procedural generation to create indoor and outdoor scenes populated with objects, allowing users to control individual object attributes and the scene layout using intuitive natural language commands. In our framework, a coordinator agent manages the overall process and works with two specialized LLM agents to complete the scene creation: ForgeIt, which integrates an ever-growing manual through auto-verification to enable precise customization of individual objects, and ArrangeIt, which formulates hierarchical optimization problems to achieve a layout that balances ergonomic and aesthetic considerations. Additionally, our pipeline incorporates a trajectory control agent, allowing users to animate the scene and operate the camera through natural language interactions. Our system is also compatible with off-the-shelf deep 3D generators to enrich scene assets. Through evaluations and comparisons with state-of-the-art methods, we demonstrate the versatility of WorldCraft, ranging from single-object customization to intricate, large-scale interior and exterior scene designs. This system empowers non-professionals to bring their creative visions to life.
摘要：构建影像学的虚拟世界在各个领域都有应用，但是它通常要求训练有素的专业人员大量的劳动来运行常规的3D建模软件。为了使这一过程民主化，我们介绍了WorldCraft，该系统是一个系统，大型语言模型（LLM）代理利用程序生成来创建带有对象的室内和室外场景，从而使用户可以使用直觉的自然语言命令来控制单个对象属性和场景布局。在我们的框架中，协调代理人管理整个过程，并与两个专业的LLM代理一起完成场景创建：Forgeit，通过自动验证来整合不断增长的手册，以启用单个对象的精确自定义，并进行安排，该对象是层次优化问题，以达到平衡人体工程学和审美考虑的布局。此外，我们的管道结合了轨迹控制代理，使用户可以通过自然语言互动来对场景进行动画操作并操作相机。我们的系统还与现成的深3D发电机兼容，以丰富场景资产。通过评估和与最新方法的比较，我们演示了Worldcraft的多功能性，从单对象自定义到复杂的大型内部和外部场景设计。该系统使非专业人士赋予他们创造性的愿景。

Title: The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer

Authors: Marthe Ballon, Andres Algaba, Vincent Ginis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15631
Pdf URL: https://arxiv.org/pdf/2502.15631
Copy Paste: [[2502.15631]] The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer(https://arxiv.org/abs/2502.15631)
Keywords: generation
Abstract: Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.
摘要：大型语言模型在数学推理，利用思想链和测试时间计算缩放方面取得了显着进展。但是，关于推理令牌使用和准确性提高之间的相互作用仍然存在许多开放问题。特别是，在比较世代相传的模型时，目前尚不清楚改善的绩效是由于更长的推理链还是更有效的推理会导致。我们系统地分析了OMNI-MATH基准上O1-Mini和O3-Mini变体之间的思考链长度，发现O3-Mini（M）在不需要更长的推理链的情况下达到了较高的精度，而不是O1-Mini。此外，我们表明，即使控制问题的难度，也随着推理链在所有模型中的增长和计算设置的增长，精度通常会下降。在更熟练的模型中，这种准确性下降明显较小，这表明新一代的推理模型使用测试时间更有效地计算。最后，我们强调，尽管O3-Mini（H）实现了O3-Mini（M）的边际准确性，但它通过在所有问题中分配了更多的推理令牌，即使是O3-Mini（M）也已经可以分配更多的推理令牌。解决。这些发现为模型能力和推理长度之间的关系提供了新的见解，对效率，缩放和评估方法的影响。

Title: VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

Authors: Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, Matthieu Cord
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2502.15672
Pdf URL: https://arxiv.org/pdf/2502.15672
Copy Paste: [[2502.15672]] VaViM and VaVAM: Autonomous Driving through Video Generative Modeling(https://arxiv.org/abs/2502.15672)
Keywords: generative
Abstract: We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at this https URL
摘要：我们探讨了大规模生成视频模型用于自动驾驶的潜力，并引入了开源自动回归视频模型（Vavim）及其伴侣视频动作模型（Vavam），以调查视频预培训如何转移到现实世界中驾驶。 Vavim是一种简单的自动回归视频模型，可以使用时空令牌序列预测帧。我们证明它捕获了驾驶场景的语义和动态。 Vavam是视频行动模型，利用Vavim学习的表示的表示来通过模仿学习来产生驾驶轨迹。这些模型共同形成了完整的感知到行动管道。我们在开放环和闭环驾驶场景中评估了我们的模型，这表明基于视频的预训练具有自动驾驶的希望。关键见解包括学习表示的语义丰富性，视频合成的缩放率的好处以及闭环评估中模型大小，数据和安全指标之间的复杂关系。我们在此HTTPS URL上发布代码和模型权重

Title: One-step Diffusion Models with $f$-Divergence Distribution Matching

Authors: Yilun Xu, Weili Nie, Arash Vahdat
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.15681
Pdf URL: https://arxiv.org/pdf/2502.15681
Copy Paste: [[2502.15681]] One-step Diffusion Models with $f$-Divergence Distribution Matching(https://arxiv.org/abs/2502.15681)
Keywords: generation
Abstract: Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel $f$-divergence minimization framework, termed $f$-distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the $f$-divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative $f$-divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, $f$-distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: this https URL
摘要：扩散模型的采样涉及缓慢的迭代过程，从而阻碍其实际部署，尤其是对于交互式应用程序。为了提高生成速度，最近的方法通过变异得分蒸馏将多步扩散模型提炼为单步学生发电机，这与学生生成的样品分布与教师的分布相匹配。但是，这些方法使用反向kullback-leibler（KL）差异进行分配匹配，这是搜索模式的。在本文中，我们使用新颖的$ f $ divergence最小化框架（称为$ f $ distill）概括了分销匹配方法，该框架涵盖了不同的分歧，在模式覆盖范围和培训方差方面具有不同的权衡。我们得出了教师和学生分布之间$ f $ didivergence的梯度，并表明它被表示为他们的分数差异的产物和由其密度比确定的加权函数。这种加权函数自然强调了使用较少模式的差异时教师分布密度较高的样本。我们观察到，使用反向KL差异是我们框架中的一种特殊情况。从经验上讲，我们证明了替代性$ f $ diverences，例如forward-kl和jensen-shannon Diverences，在图像生成任务上胜过当前最佳变异得分蒸馏方法。特别是，当使用Jensen-Shannon Divergence时，$ f $ -distill在Imagenet64上实现了当前最新的一步生成性能，并且在MS-Coco上实现了零照片的文本对图像生成。项目页面：此HTTPS URL