2025-08-07

Title: Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

Authors: Subin Raj Peter
Subjects: cs.CV, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2508.03699
Pdf URL: https://arxiv.org/pdf/2508.03699
Copy Paste: [[2508.03699]] Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task(https://arxiv.org/abs/2508.03699)
Keywords: generation
Abstract: Virtual Reality (VR) has emerged as a powerful tool for workforce training, offering immersive, interactive, and risk-free environments that enhance skill acquisition, decision-making, and confidence. Despite its advantages, developing VR applications for training remains a significant challenge due to the time, expertise, and resources required to create accurate and engaging instructional content. To address these limitations, this paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input. The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment. The intelligent module receives input from the LLM module and interprets the extracted information. Based on this, an instruction generator creates training content using relevant data from a database. The instruction generator generates the instruction by changing the color of virtual objects and creating animations to illustrate tasks. This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.
摘要：虚拟现实（VR）已成为劳动力培训的强大工具，提供了沉浸式，互动和无风险的环境，可增强技能的获取，决策和信心。尽管具有优势，但由于创建准确且引人入胜的教学内容所需的时间，专业知识和资源，开发VR应用程序仍是一个重大挑战。为了解决这些局限性，本文提出了一种新的方法，该方法利用大型语言模型（LLMS）自动化从文本输入中生成虚拟指令的生成。该系统包括两个核心组件：从文本中提取与任务相关的信息的LLM模块，以及一个智能模块，将这些信息转换为VR环境中的动画演示和视觉提示。智能模块从LLM模块接收输入并解释提取的信息。基于此，指令生成器使用来自数据库的相关数据创建培训内容。指令生成器通过更改虚拟对象的颜色并创建动画来说明任务来生成指令。这种方法提高了培训效率并降低了开发开销，从而使基于VR的培训更可扩展和适应不断发展的工业需求。

Title: CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning

Authors: Wenjie Li, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang, Mengzhe Xu, Victoria Borja Clausich, Sade Mellin, Renhao Yang, Chenrun Wang, Jethro Zih-Shuo Wang, Shiyi Yao, Gen Li, Yidong Xu, Hanyu Wang, Yilin Huang, Angela Lin Wang, Chen Shi, Yin Zhang, Jianan Guo, Luqi Yang, Renxuan Li, Yang Xu, Jiawei Liu, Yao Zhang, Lei Liu, Carlos Gutiérrez SanRomán, Lei Wang
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2508.03733
Pdf URL: https://arxiv.org/pdf/2508.03733
Copy Paste: [[2508.03733]] CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning(https://arxiv.org/abs/2508.03733)
Keywords: generation, generative
Abstract: Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on "one-time" diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved "think-answer" reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions.
摘要：胸部X射线（CXR）成像是临床实践中最广泛使用的诊断方式之一，包括广泛的诊断任务。最近的进步已经看到了基于推理的多模式大语言模型（MLLM）在医学成像中的广泛应用，以提高诊断效率和解释性。但是，现有的多模型模型主要依赖于“一次性”诊断方法，缺乏对推理过程的可验证监督。这导致了多任务CXR诊断的挑战，包括冗长的推理，稀疏的奖励和频繁的幻觉。为了解决这些问题，我们提出了CX-Mind，这是第一个在基于课程的增强学习和可验证的过程奖励（Curl-vpr）驱动的CXR任务中实现交错“思考”推理的生成模型。具体而言，我们构建了一个指令数据集，CX-SET，包括708,473张图像和2,619,148个样本，并生成了42,828个高质量的交织推理数据点，该数据由临床报告监督。在小组相对策略优化框架下分为两个阶段进行了优化：最初通过封闭域任务稳定基本推理，然后转移到开放域诊断，并结合了基于规则的条件过程奖励，以绕过对预审预周化奖励模型的需求。广泛的实验结果表明，在视觉理解，文本生成和时空比对中，CX-Mind明显优于现有的医学和通用域MLLM，比可比的CXR特异性模型的平均性能提高了25.1％。在现实世界中的临床数据集（RUI-CXR）上，CX-Mind在14种疾病中达到了平均召回@1，这实质上超过了第二好的结果，多中心的专家评估进一步证实了其跨多个维度的临床实用性。

Title: StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization

Authors: Gopalji Gaur, Mohammadreza Zolfaghari, Thomas Brox
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03735
Pdf URL: https://arxiv.org/pdf/2508.03735
Copy Paste: [[2508.03735]] StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization(https://arxiv.org/abs/2508.03735)
Keywords: generation
Abstract: Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.
摘要：使用文本对图像扩散模型生成一系列的图像序列，这些图像讲述了一个视觉故事，通常面临着在所有故事场景中保持主题一致性的关键挑战。现有的方法通常依赖于微调或再培训模型，在计算上是昂贵，耗时的，并且通常会干扰该模型的现有功能。在本文中，我们遵循一种无训练的方法，并提出了一种有效的一致的受试者方法。这种方法通过将蒙版的跨图像注意力共享引入一批图像中动态对齐的主题特征，以及区域特征协调以优化视觉上相似的细节，以提高主题一致性，从而无缝地与预训练的扩散模型无缝地工作。实验结果表明，我们的方法成功地在各种情况下成功地产生了视觉一致的主题，同时保持扩散模型的创造力。

Title: Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment

Authors: Ziheng Jia, Jiaying Qian, Zicheng Zhang, Zijian Chen, Xiongkuo Min
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03763
Pdf URL: https://arxiv.org/pdf/2508.03763
Copy Paste: [[2508.03763]] Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment(https://arxiv.org/abs/2508.03763)
Keywords: quality assessment
Abstract: Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model's rollouts but provide no reward supervision for the "think" process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model's native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi-stage RFT IQA framework (Refine-IQA). In Stage-1, we build the Refine-Perception-20K dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model's visual quality perception. In Stage-2, targeting the quality scoring task, we introduce a probability difference reward involved strategy for "think" process supervision. The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks-and, notably, our paradigm activates a robust "think" (quality interpreting) capability that also attains exceptional results on the corresponding quality interpreting benchmark.
摘要：加固微调（RFT）是LMM训练的增殖范式。 RFT类似于高级推理任务，同样适用于低级视觉域，包括图像质量评估（IQA）。现有的基于RFT的IQA方法通常使用基于规则的输出奖励来验证模型的推出，但没有为“ Think”过程提供奖励监督，而其正确性和效率则无法控制。此外，这些方法通常直接在下游IQA任务上进行微调，而不会明确增强模型的本机低级视觉质量感知，这可能会限制其性能上限。为了响应这些差距，我们提出了多阶段的RFT IQA框架（Refine-IQA）。在第1阶段，我们构建了精炼感20K数据集（具有12个主要畸变，20,907个本地延伸的图像和超过55K RFT样品）和设计多任务奖励功能，以增强模型的视觉质量感知。在第2阶段，针对质量评分任务，我们引入了概率差异奖励涉及“思考”过程监督的策略。由此产生的完善-IQA系列模型在感知和得分任务上都取得了出色的表现，尤其是我们的范式激活了强大的“思考”（质量解释）功能，该功能在相应的质量解释基准方面也获得了出色的结果。

Title: LLM-Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation

Authors: Yongchao Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03766
Pdf URL: https://arxiv.org/pdf/2508.03766
Copy Paste: [[2508.03766]] LLM-Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation(https://arxiv.org/abs/2508.03766)
Keywords: generative
Abstract: The specification of prior distributions is fundamental in Bayesian inference, yet it remains a significant bottleneck. The prior elicitation process is often a manual, subjective, and unscalable task. We propose a novel framework which leverages Large Language Models (LLMs) to automate and scale this process. We introduce \texttt{LLMPrior}, a principled operator that translates rich, unstructured contexts such as natural language descriptions, data or figures into valid, tractable probability distributions. We formalize this operator by architecturally coupling an LLM with an explicit, tractable generative model, such as a Gaussian Mixture Model (forming a LLM based Mixture Density Network), ensuring the resulting prior satisfies essential mathematical properties. We further extend this framework to multi-agent systems where Logarithmic Opinion Pooling is employed to aggregate prior distributions induced by decentralized knowledge. We present the federated prior aggregation algorithm, \texttt{Fed-LLMPrior}, for aggregating distributed, context-dependent priors in a manner robust to agent heterogeneity. This work provides the foundation for a new class of tools that can potentially lower the barrier to entry for sophisticated Bayesian modeling.
摘要：先前分布的规范在贝叶斯推论中是基本的，但它仍然是一个重要的瓶颈。先前的启发过程通常是一项手动，主观和不计值的任务。我们提出了一个新颖的框架，该框架利用大型语言模型（LLM）自动化和扩展此过程。我们介绍了\ texttt {llmprior}，这是一个原则性的操作员，将自然语言描述，数据或数字等丰富的，非结构化的上下文转化为有效的，可处理的概率分布。我们通过结构将LLM与明确的，可处理的生成模型（例如高斯混合模型（形成基于LLM的基于LLM的混合密度网络））形式化了该操作员，从而确保了所得的先验满足基本的数学特性。我们将此框架进一步扩展到多代理系统，在这些框架中，对数意见集合被用来汇总由分散知识引起的先前分布。我们介绍了联合的先前的聚合算法，\ texttt {fed-llmprior}，以汇总分布式的，上下文依赖的先验，其方式与代理异质性鲁棒性。这项工作为一类新的工具奠定了基础，这些工具可以潜在地降低进入精致的贝叶斯建模的进入障碍。

Title: Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings

Authors: Debamita Ghosh, George K. Atia, Yue Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03768
Pdf URL: https://arxiv.org/pdf/2508.03768
Copy Paste: [[2508.03768]] Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings(https://arxiv.org/abs/2508.03768)
Keywords: generative
Abstract: Reinforcement learning (RL) faces significant challenges in real-world deployments due to the sim-to-real gap, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment -- assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study the more realistic and challenging setting of online distributionally robust RL, where the agent interacts only with a single unknown training environment while aiming to optimize its worst-case performance. We focus on general $f$-divergence-based uncertainty sets, including Chi-Square and KL divergence balls, and propose a computationally efficient algorithm with sublinear regret guarantees under minimal assumptions. Furthermore, we establish a minimax lower bound on regret of online learning, demonstrating the near-optimality of our approach. Extensive experiments across diverse environments further confirm the robustness and efficiency of our algorithm, validating our theoretical findings.
摘要：强化学习（RL）由于SIM到实现的差距而在现实世界部署中面临重大挑战，由于培训和部署条件之间的不匹配，在实践中经常在模拟器中训练的政策在实践中通常不足。分布强劲的RL通过优化不确定性环境集的最差性能并为部署性能提供优化的下限来解决此问题。但是，现有的研究通常假设访问具有广泛覆盖部署环境的生成模型或离线数据集的访问 - 假设限制了其在未知环境中没有事先知识的实用性。在这项工作中，我们研究了在线分布强大的RL的更现实和挑战性的设置，在线代理只与单个未知的培训环境进行互动，同时旨在优化其最差的案例性能。我们专注于一般的$ f $ divergence不确定性集，包括卡方和KL Divergence Ball，并提出了一种计算高效的算法，并在最小的假设下提供了sublinear后悔保证。此外，我们在线学习后悔建立了一个Minimax下限，这证明了我们方法的近乎最佳性。各种环境之间的广泛实验进一步证实了我们算法的鲁棒性和效率，从而验证了我们的理论发现。

Title: 4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis

Authors: Mingyu Liu (1), Zian Mao (1 and 2), Zhu Liu (1 and 3), Haoran Zhang (1 and 2), Jintao Guo (1), Xiaoya He (1 and 2), Xi Huang (1), Shufen Chu (1), Chun Cheng (1), Jun Ding (4), Yujun Xie (1) ((1) Global Institute of Future Technology of Shanghai Jiao Tong University, (2) University of Michigan Shanghai Jiao Tong University Joint Institute, (3) School of Chemistry and Chemical Engineering of Shanghai Jiao Tong University, (4) Center for Alloy Innovation and Design State Key Laboratory for Mechanical Behavior of Materials of Xian Jiaotong University)
Subjects: cs.CV, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03775
Pdf URL: https://arxiv.org/pdf/2508.03775
Copy Paste: [[2508.03775]] 4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis(https://arxiv.org/abs/2508.03775)
Keywords: restoration
Abstract: Automated experimentation with real time data analysis in scanning transmission electron microscopy (STEM) often require end-to-end framework. The four-dimensional scanning transmission electron microscopy (4D-STEM) with high-throughput data acquisition has been constrained by the critical bottleneck results from data preprocessing. Pervasive noise, beam center drift, and elliptical distortions during high-throughput acquisition inevitably corrupt diffraction patterns, systematically biasing quantitative measurements. Yet, conventional correction algorithms are often material-specific and fail to provide a robust, generalizable solution. In this work, we present 4D-PreNet, an end-to-end deep-learning pipeline that integrates attention-enhanced U-Net and ResNet architectures to simultaneously perform denoising, center correction, and elliptical distortion calibration. The network is trained on large, simulated datasets encompassing a wide range of noise levels, drift magnitudes, and distortion types, enabling it to generalize effectively to experimental data acquired under varying conditions. Quantitative evaluations demonstrate that our pipeline reduces mean squared error by up to 50% during denoising and achieves sub-pixel center localization in the center detection task, with average errors below 0.04 pixels. The outputs are bench-marked against traditional algorithms, highlighting improvements in both noise suppression and restoration of diffraction patterns, thereby facilitating high-throughput, reliable 4D-STEM real-time analysis for automated characterization.
摘要：在扫描传输电子显微镜（STEM）中进行实时数据分析的自动实验通常需要端到端框架。具有高通量数据采集的四维扫描透射电子显微镜（4D-STEM）受数据预处理的关键瓶颈结果限制。高通量采集期间，普遍的噪声，梁中心漂移和椭圆扭曲不可避免地会损坏衍射模式，从而系统地偏向定量测量。但是，常规校正算法通常是特定于物质的，并且无法提供可靠的，可推广的解决方案。在这项工作中，我们提出了4D-Prenet，这是一种端到端深度学习管道，该管道将注意力增强的U-NET和重新连接体系结构集成在一起，以同时执行DeNosising，中心校正和椭圆形失真校准。该网络在大型模拟数据集上进行了训练，其中包括广泛的噪声水平，漂移幅度和失真类型，使其能够有效地将其推广到在不同条件下获得的实验数据。定量评估表明，我们的管道将平方误差降低了多达50％，在中心检测任务中，在中心检测任务中实现了子像素中心的定位，平均误差低于0.04像素。输出对传统算法进行了标记，突出了衍射模式的噪声抑制和恢复的改进，从而促进了高通量，可靠的4D STEM实时分析，以实现自动表征。

Title: HPSv3: Towards Wide-Spectrum Human Preference Score

Authors: Yuhang Ma, Xiaoshi Wu, Keqiang Sun, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03789
Pdf URL: https://arxiv.org/pdf/2508.03789
Copy Paste: [[2508.03789]] HPSv3: Towards Wide-Spectrum Human Preference Score(https://arxiv.org/abs/2508.03789)
Keywords: generation, generative
Abstract: Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step. Extensive experiments demonstrate that HPSv3 serves as a robust metric for wide-spectrum image evaluation, and CoHP offers an efficient and human-aligned approach to improve image generation quality. The code and dataset are available at the HPSv3 Homepage.
摘要：评估文本到图像生成模型需要与人类感知保持一致，但是现有的以人为中心的指标受到有限的数据覆盖率，次优特征提取和效率低下的损失功能的限制。为了应对这些挑战，我们引入了人类偏好得分V3（HPSV3）。（1）我们发布了HPDV3，这是第一个宽光谱的人类偏好数据集，该数据集整合了10800万文本图像对和117亿个注释的成对比较，来自最先进的生成模型以及低至高质量的现实世界图像。（2）我们引入了一种基于VLM的偏好模型，该模型使用不确定性的排名损失训练了细粒度排名。此外，我们建议使用HPSV3在每个步骤中选择最佳图像，从而在没有额外数据的情况下增强质量的迭代图像细化方法，它可以增强质量。广泛的实验表明，HPSV3是用于广泛图像评估的强大度量，COHP提供了一种有效且具有人符合的方法来提高图像产生质量。代码和数据集可在HPSV3主页上找到。

Title: Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model

Authors: Shen Zhu, Yinzhu Jin, Ifrah Zawar, P. Thomas Fletcher
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03925
Pdf URL: https://arxiv.org/pdf/2508.03925
Copy Paste: [[2508.03925]] Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model(https://arxiv.org/abs/2508.03925)
Keywords: generation, generative
Abstract: We propose a diffusion model designed to generate point-based shape representations with correspondences. Traditional statistical shape models have considered point correspondences extensively, but current deep learning methods do not take them into account, focusing on unordered point clouds instead. Current deep generative models for point clouds do not address generating shapes with point correspondences between generated shapes. This work aims to formulate a diffusion model that is capable of generating realistic point-based shape representations, which preserve point correspondences that are present in the training data. Using shape representation data with correspondences derived from Open Access Series of Imaging Studies 3 (OASIS-3), we demonstrate that our correspondence-preserving model effectively generates point-based hippocampal shape representations that are highly realistic compared to existing methods. We further demonstrate the applications of our generative model by downstream tasks, such as conditional generation of healthy and AD subjects and predicting morphological changes of disease progression by counterfactual generation.
摘要：我们提出了一个旨在生成具有对应关系的基于点的形状表示的扩散模型。传统的统计形状模型已广泛考虑了点对应关系，但是当前的深度学习方法没有考虑到它们，而是关注无序的点云。当前的点云的深层生成模型不会解决生成形状，并具有生成形状之间的点对应关系。这项工作旨在制定一个能够生成基于点的形状表示的扩散模型，该模型保留了训练数据中存在的点对应关系。使用形状表示数据具有从开放式访问系列成像研究中得出的对应关系3（OASIS-3），我们证明了我们的对应关系模型有效地生成基于点的海马形状表示，与现有方法相比，这些模型高度现实。我们进一步通过下游任务（例如健康和AD受试者）来证明生成模型的应用，并通过反事实产生预测疾病进展的形态变化。

Title: Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning

Authors: Hector Vargas Alvarez, Dimitrios G. Patsatzis, Lucia Russo, Ioannis Kevrekidis, Constantinos Siettos
Subjects: cs.LG, math.DS, math.NA
Abstract URL: https://arxiv.org/abs/2508.03926
Pdf URL: https://arxiv.org/pdf/2508.03926
Copy Paste: [[2508.03926]] Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning(https://arxiv.org/abs/2508.03926)
Keywords: generation
Abstract: Bridging the microscopic and the macroscopic modelling scales in crowd dynamics constitutes an important, open challenge for systematic numerical analysis, optimization, and control. We propose a combined manifold and machine learning approach to learn the discrete evolution operator for the emergent crowd dynamics in latent spaces from high-fidelity agent-based simulations. The proposed framework builds upon our previous works on next-generation Equation-free algorithms on learning surrogate models for high-dimensional and multiscale systems. Our approach is a four-stage one, explicitly conserving the mass of the reconstructed dynamics in the high-dimensional space. In the first step, we derive continuous macroscopic fields (densities) from discrete microscopic data (pedestrians' positions) using KDE. In the second step, based on manifold learning, we construct a map from the macroscopic ambient space into the latent space parametrized by a few coordinates based on POD of the corresponding density distribution. The third step involves learning reduced-order surrogate ROMs in the latent space using machine learning techniques, particularly LSTMs networks and MVARs. Finally, we reconstruct the crowd dynamics in the high-dimensional space in terms of macroscopic density profiles. We demonstrate that the POD reconstruction of the density distribution via SVD conserves the mass. With this "embed->learn in latent space->lift back to the ambient space" pipeline, we create an effective solution operator of the unavailable macroscopic PDE for the density evolution. For our illustrations, we use the Social Force Model to generate data in a corridor with an obstacle, imposing periodic boundary conditions. The numerical results demonstrate high accuracy, robustness, and generalizability, thus allowing for fast and accurate modelling/simulation of crowd dynamics from agent-based simulations.
摘要：在人群动力学中桥接显微镜和宏观建模量表是系统数值分析，优化和控制的重要，开放的挑战。我们提出了一种组合的歧管和机器学习方法，以从基于高保真代理的模拟中学习潜在空间中新兴人群动态的离散进化运算符。所提出的框架建立在我们以前的关于下一代方程式算法的工作，以学习高维和多尺度系统的替代模型。我们的方法是一种四阶段，明确保存了高维空间中重建动力学的质量。在第一步中，我们使用KDE从离散的显微镜数据（行人位置）中得出连续的宏观场（密度）。在第二步中，基于多种学习，我们根据基于相应密度分布的POD的几个坐标来构建从宏观环境空间的地图。第三步涉及使用机器学习技术，尤其是LSTMS网络和MVARS学习潜在空间中减少阶替代ROM。最后，我们根据宏观密度曲线重建了高维空间中的人群动态。我们证明，通过SVD的密度分布的POD重建可以保存质量。使用此“嵌入 - >在潜在空间中学习 - >回到环境空间”管道，我们为密度演化创建了一个有效的宏观PDE的有效解决方案操作员。对于我们的插图，我们使用社会力量模型在具有障碍物的走廊中生成数据，并施加周期性的边界条件。数值结果表明了高精度，鲁棒性和概括性，因此可以从基于代理的模拟中快速准确地建模/模拟人群动力学。

Title: RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

Authors: Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid
Subjects: cs.CV, cs.CR, cs.IR
Abstract URL: https://arxiv.org/abs/2508.03967
Pdf URL: https://arxiv.org/pdf/2508.03967
Copy Paste: [[2508.03967]] RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification(https://arxiv.org/abs/2508.03967)
Keywords: generation, generative
Abstract: In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.
摘要：在本文中，我们介绍了Ravid，这是AI生成的图像检测的第一个框架，该图像检测利用了视觉检索效果生成（RAG）。尽管破布方法在减轻基础模型中的事实不准确方面表现出了希望，但它们主要集中在文本上，而视觉知识却没有被忽视。同时，现有的检测方法在概括和鲁棒性上挣扎，通常依赖于低水平的人工制品和特定于模型的特征，从而限制了它们的适应性。为了解决这个问题，Ravid动态检索相关图像以增强检测。我们的方法利用了一个微调的剪辑图像编码器，Ravid剪辑，并通过类别相关的提示来增强，以改善表示表示。我们进一步整合了视觉模型（VLM），以将图像与查询融合，从而丰富输入并提高准确性。给定查询图像，RAVID使用Ravid剪辑生成嵌入，从数据库中检索最相关的图像，并将它们与查询图像结合在一起，形成VLM的丰富输入（例如QWEN-VL或OPENFLAMINGO）。涵盖19个生成模型的通用触发基准测试的实验表明，Ravid以93.85％的平均准确度达到了最先进的性能。 Ravid在鲁棒性方面还胜过传统方法，即使在图像降解（例如高斯模糊和JPEG压缩）下，保持高精度。具体而言，在降解条件下，Ravid的平均准确度为80.27％，而最先进的型号C2P-CLIP的平均准确性为63.44％，表明高斯模糊和JPEG压缩方案的一致改善。该代码将在接受后公开可用。

Title: Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework

Authors: Ajesh Koyatan Chathoth, Shuhao Yu, Stephen Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03989
Pdf URL: https://arxiv.org/pdf/2508.03989
Copy Paste: [[2508.03989]] Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework(https://arxiv.org/abs/2508.03989)
Keywords: generation
Abstract: User-controllable privacy is important in modern sensing systems, as privacy preferences can vary significantly from person to person and may evolve over time. This is especially relevant in devices equipped with Inertial Measurement Unit (IMU) sensors, such as smartphones and wearables, which continuously collect rich time-series data that can inadvertently expose sensitive user behaviors. While prior work has proposed privacy-preserving methods for sensor data, most rely on static, predefined privacy labels or require large quantities of private training data, limiting their adaptability and user agency. In this work, we introduce PrivCLIP, a dynamic, user-controllable, few-shot privacy-preserving sensing framework. PrivCLIP allows users to specify and modify their privacy preferences by categorizing activities as sensitive (black-listed), non-sensitive (white-listed), or neutral (gray-listed). Leveraging a multimodal contrastive learning approach, PrivCLIP aligns IMU sensor data with natural language activity descriptions in a shared embedding space, enabling few-shot detection of sensitive activities. When a privacy-sensitive activity is identified, the system uses a language-guided activity sanitizer and a motion generation module (IMU-GPT) to transform the original data into a privacy-compliant version that semantically resembles a non-sensitive activity. We evaluate PrivCLIP on multiple human activity recognition datasets and demonstrate that it significantly outperforms baseline methods in terms of both privacy protection and data utility.
摘要：可控制用户的隐私在现代传感系统中很重要，因为隐私的偏好因人而异，并且可能会随着时间的流逝而发展。这在配备有惯性测量单元（IMU）传感器的设备中尤其重要，例如智能手机和可穿戴设备，这些传感器不断收集富时序数据，这些数据可能会无意间暴露于敏感的用户行为。虽然先前的工作提出了有关传感器数据的隐私方法，但大多数工作依赖于静态的，预定义的隐私标签或需要大量的私人培训数据，从而限制了其适应性和用户代理。在这项工作中，我们介绍了Privclip，这是一个动态，可控制的，几乎没有隐私的传感框架。 PrivClip允许用户通过将活动分类为敏感（黑色上），非敏感（白色上市）或中性（灰色上列表）来指定和修改其隐私偏好。 Privclip利用多模式的对比学习方法，将IMU传感器数据与共同嵌入空间中的自然语言活动描述保持一致，从而使敏感活动很少检测。当确定对隐私敏感的活动时，系统使用语言引导的活动消毒剂和运动生成模块（IMU-GPT）将原始数据转换为符合隐私的版本，该版本与非敏感的活动相似。我们在多个人类活动识别数据集上评估了Privclip，并证明它在隐私保护和数据实用程序方面大大优于基线方法。

Title: CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation

Authors: Zheyuan Zhou, Jiayi Han, Liang Du, Naiyu Fang, Lemiao Qiu, Shuyou Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04002
Pdf URL: https://arxiv.org/pdf/2508.04002
Copy Paste: [[2508.04002]] CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation(https://arxiv.org/abs/2508.04002)
Keywords: generation, generative
Abstract: Computer-Aided Design (CAD) models are widely used across industrial design, simulation, and manufacturing processes. Text-to-CAD systems aim to generate editable, general-purpose CAD models from textual descriptions, significantly reducing the complexity and entry barrier associated with traditional CAD workflows. However, rendering CAD models can be slow, and deploying VLMs to review CAD models can be expensive and may introduce reward hacking that degrades the systems. To address these challenges, we propose CAD-Judge, a novel, verifiable reward system for efficient and effective CAD preference grading and grammatical validation. We adopt the Compiler-as-a-Judge Module (CJM) as a fast, direct reward signal, optimizing model alignment by maximizing generative utility through prospect theory. To further improve the robustness of Text-to-CAD in the testing phase, we introduce a simple yet effective agentic CAD generation approach and adopt the Compiler-as-a-Review Module (CRM), which efficiently verifies the generated CAD models, enabling the system to refine them accordingly. Extensive experiments on challenging CAD datasets demonstrate that our method achieves state-of-the-art performance while maintaining superior efficiency.
摘要：计算机辅助设计（CAD）模型在工业设计，模拟和制造过程中广泛使用。文本到cad系统旨在从文本描述中生成可编辑的通用CAD模型，从而大大降低了与传统CAD工作流相关的复杂性和进入屏障。但是，渲染CAD模型可能会很慢，并且部署VLM来审查CAD模型可能很昂贵，并且可能会引入降低系统的奖励黑客攻击。为了应对这些挑战，我们提出了CAD-Gudge，这是一种新颖，可验证的奖励系统，用于有效有效的CAD偏好分级和语法验证。 We adopt the Compiler-as-a-Judge Module (CJM) as a fast, direct reward signal, optimizing model alignment by maximizing generative utility through prospect theory.为了进一步提高在测试阶段的文本到基础的鲁棒性，我们引入了一种简单而有效的代理CAD生成方法，并采用了Compiler-AS-A-A-REVIEW模块（CRM），该模块（CRM）有效地验证了生成的CAD模型，使系统能够相应地对其进行完善它们。关于挑战CAD数据集的广泛实验表明，我们的方法在保持较高的效率的同时可以达到最先进的性能。

Title: $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Authors: Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04016
Pdf URL: https://arxiv.org/pdf/2508.04016
Copy Paste: [[2508.04016]] $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation(https://arxiv.org/abs/2508.04016)
Keywords: generation
Abstract: Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose \textbf{$\text{S}^2$Q-VDiT}, a post-training quantization framework for V-DMs that leverages \textbf{S}alient data and \textbf{S}parse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, $\text{S}^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at \href{this https URL}{this https URL}.
摘要：扩散变压器已成为视频生成模型的主流范式。但是，多达数十亿个参数的使用会带来巨大的计算成本。量化通过减少内存使用和加速推理提供了有希望的解决方案。尽管如此，我们观察到，视频扩散模型（V-DMS）中空间和时间信息的联合建模会导致非常长的令牌序列，这引入了高校准方差和学习挑战。为了解决这些问题，我们建议\ textbf {$ \ text {s}^2 $ q-vdit}，这是一个利用\ textbf {s} alient data和\ textbf {s} parse token tres tarse parse token蒸馏的V-DM的训练后量化框架。在校准阶段，我们确定量化性能对校准数据的选择高度敏感。为了减轻这种情况，我们介绍了\ textit {Hessian-Aware-Aware显着数据选择}，该}通过考虑V-DMS独特的扩散和量化特性来构建高质量校准数据集。为了应对学习挑战，我们进一步分析了V-DMS固有的稀疏注意力模式。基于此观察，我们提出\ textit {注意引导的稀疏令牌蒸馏}，该蒸馏}利用令牌的注意分布来强调对模型输出更具影响力的令牌。在W4A6量化下，$ \ text {s}^2 $ q-vdit在提供$ 3.9 \ times $型号压缩和$ 1.3 \ times $ times $推理加速度时实现无损性能。代码将在\ href {this HTTPS url} {此https url}上可用。

Title: SPJFNet: Self-Mining Prior-Guided Joint Frequency Enhancement for Ultra-Efficient Dark Image Restoration

Authors: Tongshun Zhang, Pingling Liu, Zijian Zhang, Qiuzhan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04041
Pdf URL: https://arxiv.org/pdf/2508.04041
Copy Paste: [[2508.04041]] SPJFNet: Self-Mining Prior-Guided Joint Frequency Enhancement for Ultra-Efficient Dark Image Restoration(https://arxiv.org/abs/2508.04041)
Keywords: restoration
Abstract: Current dark image restoration methods suffer from severe efficiency bottlenecks, primarily stemming from: (1) computational burden and error correction costs associated with reliance on external priors (manual or cross-modal); (2) redundant operations in complex multi-stage enhancement pipelines; and (3) indiscriminate processing across frequency components in frequency-domain methods, leading to excessive global computational demands. To address these challenges, we propose an Efficient Self-Mining Prior-Guided Joint Frequency Enhancement Network (SPJFNet). Specifically, we first introduce a Self-Mining Guidance Module (SMGM) that generates lightweight endogenous guidance directly from the network, eliminating dependence on external priors and thereby bypassing error correction overhead while improving inference speed. Second, through meticulous analysis of different frequency domain characteristics, we reconstruct and compress multi-level operation chains into a single efficient operation via lossless wavelet decomposition and joint Fourier-based advantageous frequency enhancement, significantly reducing parameters. Building upon this foundation, we propose a Dual-Frequency Guidance Framework (DFGF) that strategically deploys specialized high/low frequency branches (wavelet-domain high-frequency enhancement and Fourier-domain low-frequency restoration), decoupling frequency processing to substantially reduce computational complexity. Rigorous evaluation across multiple benchmarks demonstrates that SPJFNet not only surpasses state-of-the-art performance but also achieves significant efficiency improvements, substantially reducing model complexity and computational overhead. Code is available at this https URL.
摘要：当前的深色图像恢复方法具有严重的效率瓶颈，主要来自以下原因：（1）计算负担和误差校正成本与依赖外部先验（手动或跨模式）相关的误差校正成本；（2）复杂的多阶段增强管道中的冗余操作；（3）频率域方法中跨频率分量的不加选择的处理，导致全球计算的过度需求。为了应对这些挑战，我们提出了一个有效的自矿工提前引导的关节频率增强网络（SPJFNET）。具体而言，我们首先引入一个自矿指导模块（SMGM），该模块直接从网络中生成轻质的内生指导，从而消除了对外部先验的依赖，从而绕开了误差校正开销，同时提高了推理速度。其次，通过对不同频域特征的细致分析，我们通过无损小波分解和基于关节傅立叶的有利频率增强，重建多级操作链为单个有效的操作，大大降低了参数。在此基础的基础上，我们提出了一个双频指导框架（DFGF），该框架从策略上部署了专业的高/低频分支（小波域高频增强和傅立叶低频恢复），将频率处理分解频率处理以实质上降低计算复杂性。跨多个基准测试的严格评估表明，SPJFNET不仅超过了最先进的性能，而且可以实现显着提高效率，从而大大降低了模型的复杂性和计算开销。代码可在此HTTPS URL上找到。

Title: VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Authors: Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04043
Pdf URL: https://arxiv.org/pdf/2508.04043
Copy Paste: [[2508.04043]] VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning(https://arxiv.org/abs/2508.04043)
Keywords: generation
Abstract: Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in various formats, including multiple-choice, open-ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first-person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high-quality and interpretable. Evaluations of various state-of-the-art vision-language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi-step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at this https URL.
摘要：视觉转化推理（VTR）是一种至关重要的认知能力，它使智能代理能够理解动态场景，模型因果关系并预测未来状态，从而指导动作并为先进的智能系统奠定基础。但是，现有的基准测试会遭受SIM到真实的差距，任务复杂性有限和不完整的推理覆盖范围，从而限制了它们在实际情况下的实际使用。为了解决这些局限性，我们介绍了Visual Trans，这是第一个专门为VTR设计的全面基准，该基准在现实世界中的人类对象相互作用方案中。 Visual Trans涵盖了12个语义上不同的操纵任务，并系统地评估了三个基本的推理维度 - 空间，程序和定量 - 至6种定义的子任务类型。该基准具有472个高质量的提问，以各种格式，包括多项选择，开放式计数和目标枚举。我们介绍了建立在第一人称操作视频上的可扩展数据构建管道，该管道将任务选择，图像对提取，自动元数据注释与大型多模式模型以及结构化问题生成。人类验证确保最终基准既高质量又可解释。对各种最先进的视觉语言模型的评估在静态空间任务中表现出强烈的性能。但是，它们揭示了动态的多步推理情景中的显着缺陷，尤其是在中间状态识别和转换序列计划等领域。这些发现凸显了时间建模和因果推理中的基本弱点，为未来的研究提供了明确的方向，旨在开发更有能力和可推广的VTR系统。该数据集和代码可在此HTTPS URL上找到。

Title: Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation

Authors: Jiayi He, Xu Wang, Shengeng Tang, Yaxiong Wang, Lechao Cheng, Dan Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04049
Pdf URL: https://arxiv.org/pdf/2508.04049
Copy Paste: [[2508.04049]] Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation(https://arxiv.org/abs/2508.04049)
Keywords: generation
Abstract: Sign language video generation requires producing natural signing motions with realistic appearances under precise semantic control, yet faces two critical challenges: excessive signer-specific data requirements and poor generalization. We propose a new paradigm for sign language video generation that decouples motion semantics from signer identity through a two-phase synthesis framework. First, we construct a signer-independent multimodal motion lexicon, where each gloss is stored as identity-agnostic pose, gesture, and 3D mesh sequences, requiring only one recording per sign. This compact representation enables our second key innovation: a discrete-to-continuous motion synthesis stage that transforms retrieved gloss sequences into temporally coherent motion trajectories, followed by identity-aware neural rendering to produce photorealistic videos of arbitrary signers. Unlike prior work constrained by signer-specific datasets, our method treats motion as a first-class citizen: the learned latent pose dynamics serve as a portable "choreography layer" that can be visually realized through different human appearances. Extensive experiments demonstrate that disentangling motion from identity is not just viable but advantageous - enabling both high-quality synthesis and unprecedented flexibility in signer personalization.
摘要：手语视频的生成需要在精确的语义控制下产生具有现实外观的自然签名动作，但面临两个关键挑战：特定于签名的数据要求和概括不佳。我们为手语视频生成提出了一个新的范式，该范式通过两相综合框架将运动语义从签名者身份脱离。首先，我们构建一个独立于签名的多模式运动词典，其中每个光泽都被存储为身份不合时式姿势，手势和3D网格序列，每个符号只需要一个记录。这种紧凑的表示可以使我们的第二个关键创新：一个离散到连续的运动综合阶段，将检索到的光泽序列转化为时间相干运动轨迹，然后将身份感知的神经渲染产生，以产生任意签名者的感性视频。与签名特定数据集约束的先前工作不同，我们的方法将运动视为一流的公民：学识渊博的潜在姿势动力学用作便携式“编舞层”，可以通过不同的人类外观在视觉上实现。广泛的实验表明，与身份相关的运动不仅是可行的，而且是有利的 - 在签名者个性化方面既可以使高质量的合成和前所未有的灵活性。

Title: DOMR: Establishing Cross-View Segmentation via Dense Object Matching

Authors: Jitong Liao, Yulu Gao, Shaofei Huang, Jialin Gao, Jie Lei, Ronghua Liang, Si Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04050
Pdf URL: https://arxiv.org/pdf/2508.04050
Copy Paste: [[2508.04050]] DOMR: Establishing Cross-View Segmentation via Dense Object Matching(https://arxiv.org/abs/2508.04050)
Keywords: generation
Abstract: Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views. It is a critical yet challenging task for visual understanding. In this work, we propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views. The framework centers around the Dense Object Matcher (DOM) module, which jointly models multiple objects. Unlike methods that directly match individual object masks to image features, DOM leverages both positional and semantic relationships among objects to find correspondences. DOM integrates a proposal generation module with a dense matching module that jointly encodes visual, spatial, and semantic cues, explicitly constructing inter-object relationships to achieve dense matching among objects. Furthermore, we combine DOM with a mask refinement head designed to improve the completeness and accuracy of the predicted masks, forming the complete DOMR framework. Extensive evaluations on the Ego-Exo4D benchmark demonstrate that our approach achieves state-of-the-art performance with a mean IoU of 49.7% on Ego$\to$Exo and 55.2% on Exo$\to$Ego. These results outperform those of previous methods by 5.8% and 4.3%, respectively, validating the effectiveness of our integrated approach for cross-view understanding.
摘要：跨视图对应关系涉及与中心（第一人称）和Exentric（第三人称）视图之间匹配对象。这是视觉理解的至关重要但充满挑战的任务。在这项工作中，我们提出了密集的对象匹配和改进（DOMR）框架，以在视图上建立密集的对象对应关系。该框架围绕密集对象匹配器（DOM）模块，该模块共同建模多个对象。与将单个对象掩盖与图像特征直接匹配的方法不同，DOM利用对象之间的位置和语义关系来查找对应关系。 DOM与一个密集的匹配模块集成了提案生成模块，该模块共同编码视觉，空间和语义提示，明确构建对象间关系以在对象之间实现密集的匹配。此外，我们将DOM与掩模细化头相结合，旨在提高预测面罩的完整性和准确性，从而形成完整的DOMR框架。对EGO-EXO4D基准的广泛评估表明，我们的方法以Ego $ \ to $ exo的平均值为49.7％，在Exo $ \ $ \ $ fo $ ego上达到了最先进的性能。这些结果的表现分别超过了先前方法的表现，分别超过5.8％和4.3％，从而验证了我们综合方法对跨视图理解的有效性。

Title: Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion

Authors: Fangmin Zhao, Weichao Zeng, Zhenhang Li, Dongbao Yang, Binbin Li, Xiaojun Bi, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04055
Pdf URL: https://arxiv.org/pdf/2508.04055
Copy Paste: [[2508.04055]] Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion(https://arxiv.org/abs/2508.04055)
Keywords: restoration
Abstract: Removing various degradations from damaged documents greatly benefits digitization, downstream document analysis, and readability. Previous methods often treat each restoration task independently with dedicated models, leading to a cumbersome and highly complex document processing system. Although recent studies attempt to unify multiple tasks, they often suffer from limited scalability due to handcrafted prompts and heavy preprocessing, and fail to fully exploit inter-task synergy within a shared architecture. To address the aforementioned challenges, we propose Uni-DocDiff, a Unified and highly scalable Document restoration model based on Diffusion. Uni-DocDiff develops a learnable task prompt design, ensuring exceptional scalability across diverse tasks. To further enhance its multi-task capabilities and address potential task interference, we devise a novel \textbf{Prior \textbf{P}ool}, a simple yet comprehensive mechanism that combines both local high-frequency features and global low-frequency features. Additionally, we design the \textbf{Prior \textbf{F}usion \textbf{M}odule (PFM)}, which enables the model to adaptively select the most relevant prior information for each specific task. Extensive experiments show that the versatile Uni-DocDiff achieves performance comparable or even superior performance compared with task-specific expert models, and simultaneously holds the task scalability for seamless adaptation to new tasks.
摘要：从损坏的文档中删除各种降级，大大受益于数字化，下游文档分析和可读性。以前的方法通常使用专用模型独立处理每个修复任务，从而导致繁琐且高度复杂的文档处理系统。尽管最近的研究试图统一多个任务，但由于手工制作的提示和大量预处理，它们通常会遭受有限的可扩展性，并且无法完全利用共享体系结构中的任务间协同作用。为了应对上述挑战，我们提出了Uni-Docdiff，这是一个基于扩散的统一且高度可扩展的文档恢复模型。 Uni-Docdiff开发了可学习的任务提示设计，从而确保了跨不同任务的出色可扩展性。为了进一步增强其多任务功能并解决潜在的任务干扰，我们设计了一种新颖的\ textbf {prist \ textbf {p} ool}，这是一种简单而全面的机制，可以结合本地高频功能和全局低频特征。此外，我们设计\ textbf {prist \ textbf {f} usion \ textbf {m} odule（pfm）}，它使模型能够自适应地选择每个特定任务的最相关的先验信息。广泛的实验表明，与特定于任务的专家模型相比，多功能的Uni-Docdiff可以达到可比甚至优越的性能，并且同时具有无缝适应新任务的任务可扩展性。

Title: Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

Authors: Yi-Ting Chen, Ting-Hsuan Liao, Pengsheng Guo, Alexander Schwing, Jia-Bin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04090
Pdf URL: https://arxiv.org/pdf/2508.04090
Copy Paste: [[2508.04090]] Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework(https://arxiv.org/abs/2508.04090)
Keywords: super-resolution
Abstract: We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don't consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions. Code will be released.
摘要：我们提出了3D Super分辨率（3DSR），这是一种新型的3D高斯基于基于高分辨率的高分辨率框架，可利用基于现成的2D超分辨率模型。 3DSR通过使用明确的3D高斯基于基于基于的场景表示，鼓励跨视图的3D一致性。这使得拟议的3DSR与先前的工作不同，例如图像上的采样或视频超分辨率的使用，该分辨率不考虑3D一致性或旨在隐式地纳入3D一致性。值得注意的是，我们的方法在不进行其他微调的情况下增强了视觉质量，从而确保了重建场景中的空间连贯性。我们在MIPNERF360和LLFF数据上评估了3DSR，表明它产生了视觉上令人信服的高分辨率结果，同时保持了3D重建中的结构一致性。代码将发布。

Title: Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?

Authors: Ngoc-Bao Nguyen, Sy-Tuyen Ho, Koh Jun Hao, Ngai-Man Cheung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04097
Pdf URL: https://arxiv.org/pdf/2508.04097
Copy Paste: [[2508.04097]] Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?(https://arxiv.org/abs/2508.04097)
Keywords: generative
Abstract: Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior works have focused on conventional unimodal DNNs, the vulnerability of vision-language models (VLMs) remains underexplored. In this paper, we conduct the first study to understand VLMs' vulnerability in leaking private visual training data. To tailored for VLMs' token-based generative nature, we propose a suite of novel token-based and sequence-based model inversion strategies. Particularly, we propose Token-based Model Inversion (TMI), Convergent Token-based Model Inversion (TMI-C), Sequence-based Model Inversion (SMI), and Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW). Through extensive experiments and user study on three state-of-the-art VLMs and multiple datasets, we demonstrate, for the first time, that VLMs are susceptible to training data leakage. The experiments show that our proposed sequence-based methods, particularly SMI-AW combined with a logit-maximization loss based on vocabulary representation, can achieve competitive reconstruction and outperform token-based methods in attack accuracy and visual similarity. Importantly, human evaluation of the reconstructed images yields an attack accuracy of 75.31\%, underscoring the severity of model inversion threats in VLMs. Notably we also demonstrate inversion attacks on the publicly released VLMs. Our study reveals the privacy vulnerability of VLMs as they become increasingly popular across many applications such as healthcare and finance.
摘要：模型反转（MI）攻击通过重建受过训练的神经网络的私人培训数据构成了很大的隐私风险。尽管先前的作品集中在常规的单峰DNN上，但视觉模型（VLM）的脆弱性仍然没有被忽视。在本文中，我们进行了第一项研究，以了解VLMS在泄漏私人视觉训练数据方面的脆弱性。为了针对VLMS的基于代币的生成性质量身定制，我们提出了一套基于令牌的新型和基于序列的模型反转策略。特别是，我们建议基于令牌的模型反转（TMI），基于代币的模型反转（TMI-C），基于序列的模型反转（SMI）和基于序列的模型反转，并具有自适应令牌加权（SMI-AW）。通过对三个最先进的VLM和多个数据集的大量实验和用户研究，我们首次证明了VLMS很容易受到培训数据泄漏的影响。该实验表明，我们提出的基于序列的方法，尤其是SMI-AW与基于词汇表示的logit-Maximization损失相结合，可以实现竞争性重建和基于代价的攻击精度和视觉相似性的基于代价的方法。重要的是，对重建图像的人体评估产生的攻击精度为75.31 \％，强调了VLMS中模型反转威胁的严重性。值得注意的是，我们还展示了对公开发布的VLM的反演攻击。我们的研究揭示了VLM的隐私脆弱性，因为它们在医疗保健和金融等许多应用中变得越来越流行。

Title: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode

Authors: Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04107
Pdf URL: https://arxiv.org/pdf/2508.04107
Copy Paste: [[2508.04107]] Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode(https://arxiv.org/abs/2508.04107)
Keywords: generation
Abstract: Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at this https URL.
摘要：参考表达式分割（RES）旨在分割通过引用表达式指定的图像区域，并随着多模式大型模型（MLLM）的兴起而流行。尽管MLLM在语义理解中表现出色，但他们的代币代范式与像素级密集的预测斗争。现有的RES方法将MLLM与参数较重的细分段（SAM）具有632m网络参数（SAM），或采用牺牲准确性的无SAM轻量级管道。为了解决性能和成本之间的权衡，我们特别提出了MLLMSEG，这是一个新颖的框架，该框架完全利用了MLLM视觉编码器中编码的固有的视觉细节特征，而无需引入额外的视觉编码器。此外，我们提出了一个细节增强和语义一致的特征融合模块（DSFF），该模块将与详细信息相关的视觉特征与MLLM的大语言模型（LLM）完全集成在一起。最后，我们建立了一个仅具有34m网络参数的轻质掩码解码器，该参数最佳地利用了来自Visual编码器的详细空间特征和LLM的语义特征以实现精确的掩码预测。广泛的实验表明，我们的方法通常超过了基于SAM的和无SAM的竞争对手，在性能和成本之间取得了更好的平衡。代码可在此HTTPS URL上找到。

Title: Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

Authors: Maximilian Ulmer, Wout Boerdijk, Rudolph Triebel, Maximilian Durner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04122
Pdf URL: https://arxiv.org/pdf/2508.04122
Copy Paste: [[2508.04122]] Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation(https://arxiv.org/abs/2508.04122)
Keywords: generative
Abstract: This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.
摘要：本文介绍了OC-DIT，这是一种新颖的扩散模型，专为以对象为中心的预测设计，并将其应用于零照片实例分割。我们提出了一个条件潜在扩散框架，该框架通过在扩散模型的潜在空间中对对象模板和图像特征进行生成过程来生成实例掩码。这使我们的模型可以通过扩散过程有效地分解对象实例，该过程以视觉对象描述符和局部图像提示为指导。具体而言，我们介绍了两个模型变体：用于生成初始对象实例建议的粗制模型，以及并行完善所有建议的改进模型。我们在新创建的大规模合成数据集上训练这些模型，其中包括数千个高质量的对象网格。值得注意的是，我们的模型在多个具有挑战性的现实基准测试中实现了最先进的性能，而无需对目标数据进行任何重新培训。通过全面的消融研究，我们证明了扩散模型的潜力，例如分割任务。

Title: COPO: Consistency-Aware Policy Optimization

Authors: Jinghang Han, Jiawei Chen, Hang Shao, Hao Ma, Mingcheng Li, Xintian Shen, Lihao Zheng, Wei Chen, Tao Wei, Lihua Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.04138
Pdf URL: https://arxiv.org/pdf/2508.04138
Copy Paste: [[2508.04138]] COPO: Consistency-Aware Policy Optimization(https://arxiv.org/abs/2508.04138)
Keywords: generation
Abstract: Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework's robustness and general applicability. Code of this work has been released at this https URL.
摘要：强化学习已大大提高了复杂解决问题的任务中大语言模型（LLM）的推理能力。最近，DeepSeek R1的引入激发了人们对利用基于规则的奖励作为计算优势功能和指导策略优化的低成本替代方案的兴趣激增。但是，在许多复制和扩展工作中都观察到的一个普遍挑战是，当单个提示下的多个采样响应收敛到相同的结果（无论是正确还是不正确的结果）时，基于组的优势将退化为零。这导致消失的梯度，并使相应的样本无效，最终限制了训练效率和下游性能。为了解决这个问题，我们提出了一个一致性的政策优化框架，该框架基于结果的一致性引入结构化的全球奖励，基于它的全球损失也可以确保，即使模型输出表现出较高的组内一致性，培训过程仍然会收到有意义的学习信号，从而鼓励从全球的角度观察到正确的和自我一致的推理途径。此外，我们结合了一种基于熵的软融合机制，该机制可以适应局部优势估计与全球优化，从而在整个训练中探索和收敛之间的动态过渡。我们的方法介绍了奖励设计和优化策略的几项关键创新。我们通过在多个数学推理基准上获得可观的性能来验证其有效性，从而突出了所提出的框架的鲁棒性和一般适用性。这项工作的代码已在此HTTPS URL上发布。

Title: IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

Authors: Lijuan Liu, Wenfa Li, Dongbo Zhang, Shuo Wang, Shaohui Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04147
Pdf URL: https://arxiv.org/pdf/2508.04147
Copy Paste: [[2508.04147]] IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control(https://arxiv.org/abs/2508.04147)
Keywords: generation
Abstract: We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at this https URL.
摘要：我们提出IDC-NET（图像深度一致性网络），这是一个新颖的框架，旨在在显式相机轨迹控制下生成RGB-D视频序列。与分别处理RGB和深度生成的方法不同，IDC-NET在统一的几何学吸引扩散模型中共同综合了RGB图像和相应的深度图。联合学习框架增强了跨帧的空间和几何对齐，从而在生成的序列中更精确地控制了相机。为了支持该摄像机条件模型的训练并确保高几何忠诚度，我们构建了一个带有公制的RGB视频，深度图和准确的摄像头姿势的摄像机图像深度一致的数据集，该数据集可提供精确的几何学监督，并具有明显改善的改进的范围跨片段跨片段的一致性。此外，我们引入了一个几何感知的变压器块，该块可实现细粒度的相机控制，从而增强对生成序列的控制。广泛的实验表明，IDC-NET可以在产生的场景序列的视觉质量和几何一致性方面对最先进的方法进行改进。值得注意的是，生成的RGB-D序列可以直接用于下游3D场景重建任务，而无需额外的后处理步骤，从而展示了我们联合学习框架的实际好处。在此HTTPS URL上查看更多。

Title: ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation

Authors: Yihua Shao, Xiaofeng Lin, Xinwei Long, Siyu Chen, Minxi Yan, Yang Liu, Ziyang Yan, Ao Ma, Hao Tang, Jingcai Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04153
Pdf URL: https://arxiv.org/pdf/2508.04153
Copy Paste: [[2508.04153]] ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation(https://arxiv.org/abs/2508.04153)
Keywords: generation
Abstract: Enabling multi-task adaptation in pre-trained Low-Rank Adaptation (LoRA) models is crucial for enhancing their generalization capabilities. Most existing pre-trained LoRA fusion methods decompose weight matrices, sharing similar parameters while merging divergent ones. However, this paradigm inevitably induces inter-weight conflicts and leads to catastrophic domain forgetting. While incremental learning enables adaptation to multiple tasks, it struggles to achieve generalization in few-shot scenarios. Consequently, when the weight data follows a long-tailed distribution, it can lead to forgetting in the fused weights. To address this issue, we propose In-Context Meta LoRA Fusion (ICM-Fusion), a novel framework that synergizes meta-learning with in-context adaptation. The key innovation lies in our task vector arithmetic, which dynamically balances conflicting optimization directions across domains through learned manifold projections. ICM-Fusion obtains the optimal task vector orientation for the fused model in the latent space by adjusting the orientation of the task vectors. Subsequently, the fused LoRA is reconstructed by a self-designed Fusion VAE (F-VAE) to realize multi-task LoRA generation. We have conducted extensive experiments on visual and linguistic tasks, and the experimental results demonstrate that ICM-Fusion can be adapted to a wide range of architectural models and applied to various tasks. Compared to the current pre-trained LoRA fusion method, ICM-Fusion fused LoRA can significantly reduce the multi-tasking loss and can even achieve task enhancement in few-shot scenarios.
摘要：在预训练的低级适应（LORA）模型中启用多任务适应对于增强其概括能力至关重要。大多数现有的预训练的LORA融合方法分解了权重矩阵，在合并不同的参数时共享相似的参数。但是，这种范式不可避免地引起了权重冲突，并导致灾难性的领域遗忘。虽然增量学习可以适应多个任务，但它努力在几个方案中实现概括。因此，当重量数据遵循长尾巴分布时，它可能会导致忘记融合权重。为了解决这个问题，我们提出了封闭式元洛拉融合（ICM-Fusion），这是一个新型框架，可以通过封闭式适应性来协同元学习。关键创新在于我们的任务矢量算术，该算法通过学习的多种投影来动态平衡跨领域的优化方向。 ICM融合通过调整任务向量的方向来获得潜在空间中融合模型的最佳任务向量方向。随后，融合的洛拉被自设计的融合vae（F-VAE）重建，以实现多任务洛拉生成。我们已经对视觉和语言任务进行了广泛的实验，实验结果表明，ICM融合可以适应广泛的建筑模型，并应用于各种任务。与当前训练的Lora融合方法相比，ICM融合融合的Lora可以显着减少多任务损失，甚至可以在几次射击场景中实现任务增强。

Title: Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

Authors: Yuqin Cao, Yixuan Gao, Wei Sun, Xiaohong Liu, Yulun Zhang, Xiongkuo Min
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2508.04161
Pdf URL: https://arxiv.org/pdf/2508.04161
Copy Paste: [[2508.04161]] Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning(https://arxiv.org/abs/2508.04161)
Keywords: restoration, super-resolution
Abstract: Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between the visual and audio features, especially in mouth regions. A few audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution. Codes will be released upon publication.
摘要：伴随音频的面部视频已成为我们日常生活不可或缺的一部分，而它们常常遭受复杂的降解。大多数面对视频恢复方法忽略了视觉和音频特征，尤其是在口腔区域之间的固有相关性。已经提出了一些音频辅助视频修复方法，但它们仅专注于压缩伪像去除。在本文中，我们提出了一个通用音频辅助的面部视频修复网络（GAVN），以通过身份和时间互补学习来解决各种类型的流视频扭曲。具体而言，GAVN首先在低分辨率空间中捕获了框架间的时间特征，以修复框架并节省计算成本。然后，GAVN借助音频信号并面对地标，以恢复更多面部细节，从而在高分辨率空间中提取框架内身份特征。最后，重建模块集成了时间功能和身份功能，以生成高质量的面部视频。实验结果表明，GAVN胜过面部视频压缩伪像去除，脱毛和超分辨率上现有的最新方法。代码将在出版后发布。

Title: Semi-Supervised Deep Domain Adaptation for Predicting Solar Power Across Different Locations

Authors: Md Shazid Islam, A S M Jahid Hasan, Md Saydur Rahman, Md Saiful Islam Sajol
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04165
Pdf URL: https://arxiv.org/pdf/2508.04165
Copy Paste: [[2508.04165]] Semi-Supervised Deep Domain Adaptation for Predicting Solar Power Across Different Locations(https://arxiv.org/abs/2508.04165)
Keywords: generation
Abstract: Accurate solar generation prediction is essential for proper estimation of renewable energy resources across diverse geographic locations. However, geographical and weather features vary from location to location which introduces domain shift - a major bottleneck to develop location-agnostic prediction model. As a result, a machine-learning model which can perform well to predict solar power in one location, may exhibit subpar performance in another location. Moreover, the lack of properly labeled data and storage issues make the task even more challenging. In order to address domain shift due to varying weather conditions across different meteorological regions, this paper presents a semi-supervised deep domain adaptation framework, allowing accurate predictions with minimal labeled data from the target location. Our approach involves training a deep convolutional neural network on a source location's data and adapting it to the target location using a source-free, teacher-student model configuration. The teacher-student model leverages consistency and cross-entropy loss for semi-supervised learning, ensuring effective adaptation without any source data requirement for prediction. With annotation of only $20 \%$ data in the target domain, our approach exhibits an improvement upto $11.36 \%$, $6.65 \%$, $4.92\%$ for California, Florida and New York as target domain, respectively in terms of accuracy in predictions with respect to non-adaptive approach.
摘要：准确的太阳生成预测对于正确估计各种地理位置的可再生能源资源至关重要。但是，地理和天气的特征因位置而异，引入了域移动，这是一个主要的瓶颈，用于开发位置不足的预测模型。结果，可以很好地预测一个位置的太阳能的机器学习模型可能在另一个位置表现出低于标准的性能。此外，缺乏正确标记的数据和存储问题使任务更具挑战性。为了解决由于不同气象区域的天气条件而导致的域变化，本文介绍了半监督的深层域适应框架，从目标位置的最小标记数据可以进行准确的预测。我们的方法涉及在源位置的数据上培训深度卷积神经网络，并使用无源的，教师学生的模型配置将其调整到目标位置。教师学生模型利用半监督学习的一致性和跨凝性损失，确保没有任何源数据的预测要求，以确保有效的适应。由于目标域中的注释仅为$ 20 \％$数据，因此我们的方法的提高提高了$ 11.36 \％$，$ 6.65 \％$，$ 4.92 \％$，加利福尼亚，佛罗里达和纽约作为目标域，分别是针对非适应性方法的准确性。

Title: ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations

Authors: Subhankar Swain, Naquee Rizwan, Nayandeep Deb, Vishwajeet Singh Solanki, Vishwa Gangadhar S, Animesh Mukherjee
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.04166
Pdf URL: https://arxiv.org/pdf/2508.04166
Copy Paste: [[2508.04166]] ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations(https://arxiv.org/abs/2508.04166)
Keywords: generation
Abstract: The 2025 Global Risks Report identifies state-based armed conflict and societal polarisation among the most pressing global threats, with social media playing a central role in amplifying toxic discourse. Memes, as a widely used mode of online communication, often serve as vehicles for spreading harmful content. However, limitations in data accessibility and the high cost of dataset curation hinder the development of robust meme moderation systems. To address this challenge, in this work, we introduce a first-of-its-kind dataset of 6,300 real-world meme-based posts annotated in two stages: (i) binary classification into toxic and normal, and (ii) fine-grained labelling of toxic memes as hateful, dangerous, or offensive. A key feature of this dataset is that it is enriched with auxiliary metadata of socially relevant tags, enhancing the context of each meme. In addition, we propose a tag generation module that produces socially grounded tags, because most in-the-wild memes often do not come with tags. Experimental results show that incorporating these tags substantially enhances the performance of state-of-the-art VLMs detection tasks. Our contributions offer a novel and scalable foundation for improved content moderation in multimodal online environments.
摘要：2025年的全球风险报告确定了最紧迫的全球威胁中基于州的武装冲突和社会两极分化，社交媒体在扩大有毒话语方面发挥了核心作用。模因作为广泛使用的在线通信方式，通常是传播有害内容的工具。但是，数据可访问性的限制和数据集策划的高成本阻碍了健壮的模因审核系统的开发。为了应对这一挑战，在这项工作中，我们引入了一个第一个类别数据集，该数据集的6300个基于模因的基于模因的帖子分为两个阶段：（i）二进制分类为有毒和正常的二进制分类，以及（ii）有毒模因的细粒度标记为可恶，危险，危险，危险或攻击性。该数据集的一个关键特征是它具有具有社会相关标签的辅助元数据，从而增强了每个模因的上下文。此外，我们提出了一个产生社交标签的标签生成模块，因为大多数野外模因通常都不带有标签。实验结果表明，合并这些标签大大提高了最先进的VLMS检测任务的性能。我们的贡献为在多模式在线环境中改善内容适度的新颖基础提供了新颖而可扩展的基础。

Title: One Small Step with Fingerprints, One Giant Leap for emph{De Novo} Molecule Generation from Mass Spectra

Authors: Neng Kai Nigel Neo, Lim Jing, Ngoui Yong Zhau Preston, Koh Xue Ting Serene, Bingquan Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04180
Pdf URL: https://arxiv.org/pdf/2508.04180
Copy Paste: [[2508.04180]] One Small Step with Fingerprints, One Giant Leap for emph{De Novo} Molecule Generation from Mass Spectra(https://arxiv.org/abs/2508.04180)
Keywords: generation
Abstract: A common approach to the \emph{de novo} molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt \textsc{MIST}~\citep{MISTgoldmanAnnotatingMetaboliteMass2023} as the encoder and \textsc{MolForge}~\citep{ucakReconstructionLosslessMolecular2023} as the decoder, leveraging pretraining to enhance performance. Notably, pretraining \textsc{MolForge} proves especially effective, enabling it to serve as a robust fingerprint-to-structure decoder. Additionally, instead of passing the probability of each bit in the fingerprint, thresholding the probabilities as a step function helps focus the decoder on the presence of substructures, improving recovery of accurate molecular structures even when the fingerprints predicted by \textsc{MIST} only moderately resembles the ground truth in terms of Tanimoto similarity. This combination of encoder and decoder results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28\% / top-10 36\% of molecular structures correctly from mass spectra. We position this pipeline as a strong baseline for future research in \emph{de novo} molecule elucidation from mass spectra.
摘要：从质谱中\ emph {de从头}分子产生问题的一种常见方法涉及两阶段的管道：（1）将质谱编码为分子指纹，然后（2）将这些指纹解码为分子结构。在我们的工作中，我们采用\ textsc {mist}〜\ citep {mistgoldManantatingMetaBoliteMass2023}作为编码器和\ textsc {molforge}〜\ citep {ucakreconstructionLosslessMolesslessmolecullecull2023}，作为解码器，以预先培训，以增强表演。值得注意的是，预处理\ textsc {molforge}证明了特别有效，从而使其成为强大的指纹对结构解码器。此外，与其在指纹中传递每个位的概率，而是将概率作为步进函数有助于将解码器集中在存在下的存在上，从而改善了准确的分子结构的恢复，即使\ textsc {mist}预测的指纹也仅适度地与Tanimototo相似性相似。编码器和解码器的这种组合使得与以前的最先进方法相比有了十倍的改善，从而从质谱中正确地生成了TOP-1 28 \％ / TOP-10 36 \％的分子结构。我们将这条管道定位为在质谱中\ emph {de de novo}分子阐明的强大基线。

Title: Deeper Inside Deep ViT

Authors: Sungrae Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04181
Pdf URL: https://arxiv.org/pdf/2508.04181
Copy Paste: [[2508.04181]] Deeper Inside Deep ViT(https://arxiv.org/abs/2508.04181)
Keywords: generation
Abstract: There have been attempts to create large-scale structures in vision models similar to LLM, such as ViT-22B. While this research has provided numerous analyses and insights, our understanding of its practical utility remains incomplete. Therefore, we examine how this model structure reacts and train in a local environment. We also highlight the instability in training and make some model modifications to stabilize it. The ViT-22B model, trained from scratch, overall outperformed ViT in terms of performance under the same parameter size. Additionally, we venture into the task of image generation, which has not been attempted in ViT-22B. We propose an image generation architecture using ViT and investigate which between ViT and ViT-22B is a more suitable structure for image generation.
摘要：已经尝试在类似于LLM的视觉模型中创建大规模结构，例如VIT-22B。尽管这项研究提供了许多分析和见解，但我们对其实际实用程序的理解仍然不完整。因此，我们研究了该模型结构在当地环境中的反应和训练。我们还强调了训练的不稳定，并进行了一些模型修改以稳定它。从头开始训练的VIT-22B模型在相同的参数大小下以性能优于VIT。此外，我们冒险进入图像产生的任务，该任务尚未在VIT-22B中尝试。我们提出了使用VIT的图像生成结构，并研究VIT和VIT-22B之间的哪个是图像生成更合适的结构。

Title: RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation

Authors: Fengyi Wu, Yimian Dai, Tianfang Zhang, Yixuan Ding, Jian Yang, Ming-Ming Cheng, Zhenming Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04190
Pdf URL: https://arxiv.org/pdf/2508.04190
Copy Paste: [[2508.04190]] RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation(https://arxiv.org/abs/2508.04190)
Keywords: restoration
Abstract: Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage this https URL.
摘要：强大的主成分分析（RPCA）将观察矩阵分解为低级别背景和稀疏对象组件。此功能使其在从图像恢复到细分的任务中都可以应用。但是，传统的RPCA模型遭受了由矩阵操作，对精细调整超参数的依赖以及限制动态场景适应性的刚性先验引起的计算负担。为了解决这些局限性，我们提出了RPCanet ++，这是一个稀疏的对象分割框架，可将RPCA的解释性与有效的深层体系结构融合在一起。我们的方法将放松的RPCA模型展开到包括背景近似模块（BAM），对象提取模块（OEM）和图像恢复模块（IRM）的结构化网络中。为了减轻BAM中的阶段间传输损失，我们引入了一个记忆扬声器的模块（MAM），以增强背景特征保存，而深层对比的先验模块（DCPM）利用显着性提示来加快对象提取。对不同数据集的广泛实验表明，在各种成像方案下，RPCanet ++在最新的性能中实现了最新的性能。我们通过视觉和数值低级别和稀疏度测量进一步提高了解释性。通过将RPCA的理论优势与深网的效率相结合，我们的方法为可靠且可解释的稀疏对象细分树立了新的基线。代码可在我们的Project网页上找到此HTTPS URL。

Title: From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models

Authors: Dunyuan Xu, Xikai Yang, Yaoqian Li, Jinpeng Li, Pheng-Ann Heng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04192
Pdf URL: https://arxiv.org/pdf/2508.04192
Copy Paste: [[2508.04192]] From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models(https://arxiv.org/abs/2508.04192)
Keywords: generation
Abstract: The security of biomedical Multimodal Large Language Models (MLLMs) has attracted increasing attention. However, training samples easily contain private information and incorrect knowledge that are difficult to detect, potentially leading to privacy leakage or erroneous outputs after deployment. An intuitive idea is to reprocess the training set to remove unwanted content and retrain the model from scratch. Yet, this is impractical due to significant computational costs, especially for large language models. Machine unlearning has emerged as a solution to this problem, which avoids complete retraining by selectively removing undesired knowledge derived from harmful samples while preserving required capabilities on normal cases. However, there exist no available datasets to evaluate the unlearning quality for security protection in biomedical MLLMs. To bridge this gap, we propose the first benchmark Multimodal Large Language Model Unlearning for BioMedicine (MLLMU-Med) built upon our novel data generation pipeline that effectively integrates synthetic private data and factual errors into the training set. Our benchmark targets two key scenarios: 1) Privacy protection, where patient private information is mistakenly included in the training set, causing models to unintentionally respond with private data during inference; and 2) Incorrectness removal, where wrong knowledge derived from unreliable sources is embedded into the dataset, leading to unsafe model responses. Moreover, we propose a novel Unlearning Efficiency Score that directly reflects the overall unlearning performance across different subsets. We evaluate five unlearning approaches on MLLMU-Med and find that these methods show limited effectiveness in removing harmful knowledge from biomedical MLLMs, indicating significant room for improvement. This work establishes a new pathway for further research in this promising field.
摘要：生物医学多模式大型语言模型（MLLM）的安全性吸引了越来越多的关注。但是，培训样品很容易包含私人信息和难以检测的知识，可能导致部署后隐私泄漏或错误输出。一个直观的想法是重新处理培训集，以删除不需要的内容并从头开始重新训练模型。但是，由于巨大的计算成本，这是不切实际的，尤其是对于大型语言模型。机器的学习已经成为解决此问题的解决方案，从而选择性地删除从有害样本中得出的不希望的知识，同时在正常情况下保留所需的功能，从而避免了完全的重新训练。但是，没有可用的数据集来评估生物医学MLLM中安全保护的未学习质量。为了弥合这一差距，我们提出了第一个基于生物医学（MLLMU-MED）的基准多模式大型语言模型，该模型构建在我们新颖的数据生成管道上，该模型有效地将综合私有数据和事实错误整合到培训集中。我们的基准目标针对两个关键方案：1）隐私保护，其中培训集中误入了患者私人信息，导致模型在推断期间无意间用私人数据做出反应； 2）删除不正确的地方，而从不可靠来源得出的错误知识嵌入到数据集中，从而导致不安全的模型响应。此外，我们提出了一种新颖的学习效率评分，该评分直接反映了不同子集的整体学习成绩。我们评估了MLLMU-MED上的五种未学习方法，发现这些方法在从生物医学MLLM中删除有害知识方面显示出有限的有效性，这表明改善了很大的空间。这项工作为在这个有希望的领域中进一步研究建立了新的途径。

Title: LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

Authors: Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2508.04228
Pdf URL: https://arxiv.org/pdf/2508.04228
Copy Paste: [[2508.04228]] LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation(https://arxiv.org/abs/2508.04228)
Keywords: generation, generative
Abstract: Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at this https URL .
摘要：在文本到视频（T2V）生成中控制对象运动轨迹是一个具有挑战性且相对探索的区域，尤其是在涉及多个移动对象的情况下。 T2V域中的大多数社区模型和数据集都是为单对象运动而设计的，从而限制了当前生成模型在多对象任务中的性能。此外，T2V中的现有运动控制方法要么缺乏对多对象运动场景的支持，要么在对象轨迹相交时经历严重的性能降解，这主要是由于碰撞区域的语义冲突。为了解决这些局限性，我们介绍了Layert2v，这是通过一层合成背景和前景对象来生成视频的第一种方法。这种分层的生成可以在视频中灵活地集成多个独立元素，将每个元素定位在不同的“层”上，从而促进相干的多对象合成，同时增强对生成过程的控制。广泛的实验表明，Layert2V在生成复杂的多对象方案中具有优势，显示了MIOU和AP50指标比最新方法（SOTA）方法的1.4倍和4.5倍改善。项目页面和代码可在此HTTPS URL上找到。

Title: Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction

Authors: Yu Liu, Zhijie Liu, Xiao Ren, You-Fu Li, He Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04229
Pdf URL: https://arxiv.org/pdf/2508.04229
Copy Paste: [[2508.04229]] Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction(https://arxiv.org/abs/2508.04229)
Keywords: generation
Abstract: Predicting pedestrian motion trajectories is critical for path planning and motion control of autonomous vehicles. However, accurately forecasting crowd trajectories remains a challenging task due to the inherently multimodal and uncertain nature of human motion. Recent diffusion-based models have shown promising results in capturing the stochasticity of pedestrian behavior for trajectory prediction. However, few diffusion-based approaches explicitly incorporate the underlying motion intentions of pedestrians, which can limit the interpretability and precision of prediction models. In this work, we propose a diffusion-based multimodal trajectory prediction model that incorporates pedestrians' motion intentions into the prediction framework. The motion intentions are decomposed into lateral and longitudinal components, and a pedestrian intention recognition module is introduced to enable the model to effectively capture these intentions. Furthermore, we adopt an efficient guidance mechanism that facilitates the generation of interpretable trajectories. The proposed framework is evaluated on two widely used human trajectory prediction benchmarks, ETH and UCY, on which it is compared against state-of-the-art methods. The experimental results demonstrate that our method achieves competitive performance.
摘要：预测行人运动轨迹对于自动驾驶汽车的路径计划和运动控制至关重要。但是，由于人类运动的固有多模式和不确定的性质，准确地预测人群轨迹仍然是一项艰巨的任务。最近基于扩散的模型在捕获轨迹预测的行人行为的随机性方面显示出令人鼓舞的结果。但是，很少有基于扩散的方法明确结合了行人的基本运动意图，这可以限制预测模型的可解释性和精度。在这项工作中，我们提出了一个基于扩散的多模式轨迹预测模型，该模型将行人的运动意图纳入预测框架中。运动意图分解为侧面和纵向成分，并引入了行人意图识别模块，以使模型能够有效捕获这些意图。此外，我们采用有效的指导机制，促进了可解释的轨迹的产生。提出的框架对两个广泛使用的人类轨迹预测基准ETH和UCY进行了评估，ETH和UCY与最新方法进行了比较。实验结果表明我们的方法实现了竞争性能。

Title: DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification

Authors: Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04233
Pdf URL: https://arxiv.org/pdf/2508.04233
Copy Paste: [[2508.04233]] DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification(https://arxiv.org/abs/2508.04233)
Keywords: generative
Abstract: As black-box AI-driven decision-making systems become increasingly widespread in modern document processing workflows, improving their transparency and reliability has become critical, especially in high-stakes applications where biases or spurious correlations in decision-making could lead to serious consequences. One vital component often found in such document processing workflows is document image classification, which, despite its widespread use, remains difficult to explain. While some recent works have attempted to explain the decisions of document image classification models through feature-importance maps, these maps are often difficult to interpret and fail to provide insights into the global features learned by the model. In this paper, we aim to bridge this research gap by introducing generative document counterfactuals that provide meaningful insights into the model's decision-making through actionable explanations. In particular, we propose DocVCE, a novel approach that leverages latent diffusion models in combination with classifier guidance to first generate plausible in-distribution visual counterfactual explanations, and then performs hierarchical patch-wise refinement to search for a refined counterfactual that is closest to the target factual image. We demonstrate the effectiveness of our approach through a rigorous qualitative and quantitative assessment on 3 different document classification datasets -- RVL-CDIP, Tobacco3482, and DocLayNet -- and 3 different models -- ResNet, ConvNeXt, and DiT -- using well-established evaluation criteria such as validity, closeness, and realism. To the best of the authors' knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.
摘要：随着Black-Box AI驱动的决策系统在现代文档处理工作流程中变得越来越普遍，提高其透明度和可靠性已经变得至关重要，尤其是在决策中偏见或虚假相关性的高风险应用程序可能会导致严重后果。文档处理工作流程中经常发现的一个重要组成部分是文档图像分类，尽管它广泛使用，但仍然很难解释。尽管最近的一些作品试图通过功能描绘来解释文档图像分类模型的决策，但这些地图通常很难解释，也无法对模型学到的全局特征提供见解。在本文中，我们旨在通过引入生成文档的反事实来弥合这一研究差距，从而通过可行的解释为模型的决策提供有意义的见解。特别是，我们提出了Docvce，这是一种新型的方法，该方法利用潜在扩散模型与分类器指南结合使用，以首先产生合理的分布视觉视觉反事实解释，然后在层次结构贴片上进行细化以搜索最接近目标事实图像的精制反事实。我们通过对3个不同的文档分类数据集进行了严格的定性和定量评估（RVL-CDIP，TOBACCO3482和DOCLAYNET）以及3种不同的模型 - 使用良好的评估标准，例如有效性，近距离和现实主义。据作者所知，这是探索文档图像分析中生成反事实解释的第一项工作。

Title: TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

Authors: Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, Bo Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04324
Pdf URL: https://arxiv.org/pdf/2508.04324
Copy Paste: [[2508.04324]] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models(https://arxiv.org/abs/2508.04324)
Keywords: generation, generative
Abstract: Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces two key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; and (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and standard text-to-image benchmarks.
摘要：文本到图像生成的最新流量匹配模型已达到了卓越的质量，但是它们与增强学习的人类偏好一致性的集成仍然是最佳的，阻碍了基于良好的奖励优化。我们观察到，对流动模型有效的GRPO培训的关键障碍是现有方法中的时间均匀性假设：具有统一信用分配的稀疏终端奖励未能捕捉到时间段跨时时间段的决策的各种批判性，从而导致效率不足的勘探和次优融合。为了解决这个缺点，我们介绍了\ textbf {tempflow-grpo}（时间流程grpo），这是一个原则上的GRPO框架，可捕获和利用基于流的生成中固有的时间结构。 Tempflow-Grpo介绍了两个关键的创新：（i）一种轨迹分支机制，该机制通过在指定的分支点上集中随机性来提供过程奖励，从而实现精确的信用分配而无需使用专门的中级奖励模型；（ii）一种噪音感知的加权方案，该方案根据每个时间步的内在勘探潜力调节策略优化，在高影响力的早期阶段优先考虑学习，同时确保在以后的阶段进行稳定的完善。这些创新赋予该模型具有暂时意识的优化，该优化尊重潜在的生成动力学，从而导致人类偏好对齐和标准文本对图像基准测试的最新性能。

Title: TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Authors: Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Jinglin Xu, Hao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04369
Pdf URL: https://arxiv.org/pdf/2508.04369
Copy Paste: [[2508.04369]] TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding(https://arxiv.org/abs/2508.04369)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. Existing video MLLMs adopt training-free uniform sampling or keyframe search, which may miss critical events or be constrained by the pre-trained models' event understanding capabilities. Meanwhile, building a training-based method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization with efficient rule-based rewards. Furthermore, for the TSPO's training, we propose a long video training data construction pipeline with comprehensive temporal data and video Needle-in-a-Haystack data. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
摘要：多模式的大语言模型（MLLM）在视觉任务上表现出了重大进展，但是在处理长期视频输入时，它们仍然面临挑战。限制是由MLLM的上下文限制和培训成本引起的，需要在将视频喂入MLLM之前进行稀疏的框架采样。现有的视频MLLM采用无培训的统一抽样或密钥帧搜索，可能会错过关键事件或受到预训练模型的事件理解能力的约束。同时，由于稀疏框架抽样的无监督和不可分割的性质，构建基于培训的方法仍然具有挑战性。为了解决这些问题，我们提出了时间抽样策略优化（TSPO），通过强化学习推进了MLLMS长期视频语言的理解。具体而言，我们首先提出了一个可训练的事件感知的时间代理，该代理捕获了用于执行概率关键帧选择的事件问题相关性。然后，我们提出了TSPO增强学习范式，该学习范式将关键帧选择和语言生成建模为联合决策过程，从而通过有效的基于规则的奖励实现端到端的相对优化。此外，对于TSPO的培训，我们提出了一条长时间的视频培训数据构建管道，其中包含全面的时间数据和视频介绍数据。最后，我们结合了基于规则的答案准确性和时间定位奖励机制，以优化时间抽样策略。全面的实验表明，我们的TSPO在多个长期视频理解基准中实现了最先进的性能，并在不同的尖端视频中显示了可转移的能力。

Title: Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling

Authors: Biao Hu, Guoyin Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04447
Pdf URL: https://arxiv.org/pdf/2508.04447
Copy Paste: [[2508.04447]] Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling(https://arxiv.org/abs/2508.04447)
Keywords: generative
Abstract: We introduce Cloud Model Characteristic Function Auto-Encoder (CMCFAE), a novel generative model that integrates the cloud model into the Wasserstein Auto-Encoder (WAE) framework. By leveraging the characteristic functions of the cloud model to regularize the latent space, our approach enables more accurate modeling of complex data distributions. Unlike conventional methods that rely on a standard Gaussian prior and traditional divergence measures, our method employs a cloud model prior, providing a more flexible and realistic representation of the latent space, thus mitigating the homogenization observed in reconstructed samples. We derive the characteristic function of the cloud model and propose a corresponding regularizer within the WAE framework. Extensive quantitative and qualitative evaluations on MNIST, FashionMNIST, CIFAR-10, and CelebA demonstrate that CMCFAE outperforms existing models in terms of reconstruction quality, latent space structuring, and sample diversity. This work not only establishes a novel integration of cloud model theory with MMD-based regularization but also offers a promising new perspective for enhancing autoencoder-based generative models.
摘要：我们介绍了云模型特征函数自动编码器（CMCFAE），这是一种新颖的生成模型，将云模型集成到Wasserstein自动编码器（WAE）框架中。通过利用云模型的特征函数来正规化潜在空间，我们的方法可以更准确地对复杂的数据分布进行建模。与依靠标准高斯先验和传统差异度量的常规方法不同，我们的方法采用了云模型，从而提供了对潜在空间的更灵活和更现实的表示，从而减轻了在重建样品中观察到的均质化。我们得出了云模型的特征函数，并在WAE框架内提出了相应的正规化程序。对MNIST，FashionMnist，CIFAR-10和Celeba的广泛定量评估和定性评估表明，CMCFAE在重建质量，潜在空间结构和样本多样性方面优于现有模型。这项工作不仅建立了云模型理论与基于MMD的正则化的新颖集成，而且为增强基于自动编码器的生成模型提供了有希望的新观点。

Title: Automatic LLM Red Teaming

Authors: Roman Belaire, Arunesh Sinha, Pradeep Varakantham
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04451
Pdf URL: https://arxiv.org/pdf/2508.04451
Copy Paste: [[2508.04451]] Automatic LLM Red Teaming(https://arxiv.org/abs/2508.04451)
Keywords: generative
Abstract: Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break' another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.
摘要：红色团队对于确定漏洞并建立对当前LLM的信任至关重要。但是，当前的大语言模型（LLMS）的自动化方法依赖于脆弱的提示模板或单转攻击，无法捕获真实世界对抗对话的复杂，互动性。我们提出了一种新颖的范式：训练AI在战略上“打破”另一个AI。通过将红色团队形式化为马尔可夫决策过程（MDP）并采用层次强化学习（RL）框架，我们有效地解决了固有的稀疏奖励和长期胜利的挑战。我们的生成代理通过细粒度，令牌级别的伤害奖励来学习连贯的多转弯攻击策略，从而使其能够发现现有基线所遗漏的微妙脆弱性。这种方法设定了一种新的最先进的，从根本上将LLM红色团队重塑为一个动态的，基于轨迹的过程（而不是一步测试），这对于强大的AI部署至关重要。

Title: Small transformer architectures for task switching

Authors: Claudius Gros
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04461
Pdf URL: https://arxiv.org/pdf/2508.04461
Copy Paste: [[2508.04461]] Small transformer architectures for task switching(https://arxiv.org/abs/2508.04461)
Keywords: generative
Abstract: The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of 'task switching'. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We enlarge our comparative study by including an extension of the standard transformer architecture to its non-translational invariant counterpart, the cisformer, and an alternative attention mechanism, extensive attention. A combination of the latter is found to be the only model able to achieve considerable performance levels, of around 95%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations in task-switching settings.
摘要：从大规模生成的AI角度出现的快速进步主要基于注意机制。相反，构想基于注意力的体系结构的小规模应用是不平凡的，例如多层感知器或经常性网络。我们在“任务切换”的背景下检查了这个问题。在此框架中，模型可用于正在进行的令牌序列，当前任务由随机散布的控制令牌确定。我们表明，标准变压器无法基于有限域算术的基本任务切换参考模型，该算术包含专门用于增量 /添加 /添加 /反复复制 /上下文（IARC）的子任务。我们表明，变形金刚，长期的短期内存复发网络（LSTM）和普通的多层感知器（MLP）实现了相似但仅具有适度的预测精度。我们通过将标准变压器架构扩展到其非翻译不变的对应器，术和替代性注意机制，扩大了比较研究。发现后者的组合是唯一能够达到相当大的性能水平的模型，约为95％。我们的结果表明，在比较任务切换设置的质量不同时，可以更好地理解注意力的工作。

Title: CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference

Authors: Enyu Zhou, Kai Sheng, Hao Chen, Xin He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04462
Pdf URL: https://arxiv.org/pdf/2508.04462
Copy Paste: [[2508.04462]] CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference(https://arxiv.org/abs/2508.04462)
Keywords: generation
Abstract: Speculative decoding (SD), where an extra draft model first provides multiple draft tokens and the original target model then verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods must adhere to the 'draft-then-verify' paradigm, which forces drafting and verification processes to execute sequentially during SD, resulting in inefficient inference performance and limiting the size of the draft model. Furthermore, once a single token in the candidate sequence is rejected during the drafting process, all subsequent candidate tokens must be discarded, leading to inefficient drafting. To address these challenges, we propose a cache-based parallel speculative decoding framework employing a 'query-and-correct' paradigm. Specifically, CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model's generation direction. This effectively enables the target model to perform inference at speed approaching that of the draft model. Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models. Our code is available at this https URL.
摘要：投机解码（SD），其中额外的草稿模型首先提供多个草稿令牌，而原始目标模型然后并行验证这些令牌，它显示了LLM推理加速度的强大功能。但是，现有的SD方法必须遵守“草稿 - 验证”范式，该范式迫使起草和验证过程在SD期间依次执行，从而导致推理性能低下，并限制了草案模型的大小。此外，一旦在起草过程中拒绝了候选序列中的一个令牌，则必须丢弃所有随后的候选令牌，从而导致起草效率低下。为了应对这些挑战，我们建议采用“查询和校正”范式的基于缓存的平行投机解码框架。具体而言，卡片将起草和验证解开：草稿模型生成候选令牌以填充共享缓存，而目标模型同时纠正了草案模型的生成方向。这有效地使目标模型能够以速度进行推断，以接近草案模型的推断。我们的方法在不需要对草稿或目标模型进行微调的情况下实现了高达4.83的速度。我们的代码可在此HTTPS URL上找到。

Title: 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation

Authors: Shuzhou Yang, Xiaodong Cun, Xiaoyu Li, Yaowei Li, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04467
Pdf URL: https://arxiv.org/pdf/2508.04467
Copy Paste: [[2508.04467]] 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation(https://arxiv.org/abs/2508.04467)
Keywords: generation
Abstract: Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos. Benefit from this, explicit 4D representation~(such as 4D Gaussian) can be optimized accurately, enabling wider practical application. To train 4DVD, we collect a dynamic 3D object dataset, called D-Objaverse, from the Objaverse benchmark and render 16 videos with 21 frames for each object. Extensive experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation. Our project page is this https URL
摘要：鉴于直接生成高维数据（例如4D）的高复杂性，我们提出了4DVD，这是一个级联的视频扩散模型，该模型以脱钩的方式生成4D内容。与以前的多视频视频方法与直接建模3D空间和时间特征同时与堆叠的横视/时间注意模块同时建模的视频方法不同，4DVD将其分解为两个子任务：粗糙的多视图布局生成和结构意识到的条件生成，并有效地统一了它们。具体而言，给定一个单眼视频，4DVD首先通过出色的跨视图和时间一致性来预测其布局的密集视图内容。基于生产的布局先验，开发了一个结构感知的时空生成分支，将这些粗糙的结构先验与输入单眼视频的精美外观结合在一起，以生成最终的高质量密集视频。从中受益，可以准确优化显式4D表示（例如4D高斯），从而实现更广泛的实际应用。要训练4DVD，我们从Objaverse基准测试中收集一个动态3D对象数据集，称为D-Objaverse，并为每个对象渲染16个带有21帧的视频。广泛的实验证明了我们在新型视图合成和4D代表上的最新性能。我们的项目页面是此HTTPS URL

Title: Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model

Authors: Hongxu Chen, Zhen Wang, Taoran Mei, Lin Li, Bowei Zhu, Runshi Li, Long Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04472
Pdf URL: https://arxiv.org/pdf/2508.04472
Copy Paste: [[2508.04472]] Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model(https://arxiv.org/abs/2508.04472)
Keywords: generation, generative
Abstract: Concept Erasure, which aims to prevent pretrained text-to-image models from generating content associated with semantic-harmful concepts (i.e., target concepts), is getting increased attention. State-of-the-art methods formulate this task as an optimization problem: they align all target concepts with semantic-harmless anchor concepts, and apply closed-form solutions to update the model accordingly. While these closed-form methods are efficient, we argue that existing methods have two overlooked limitations: 1) They often result in incomplete erasure due to "non-zero alignment residual", especially when text prompts are relatively complex. 2) They may suffer from generation quality degradation as they always concentrate parameter updates in a few deep layers. To address these issues, we propose a novel closed-form method ErasePro: it is designed for more complete concept erasure and better preserving overall generative quality. Specifically, ErasePro first introduces a strict zero-residual constraint into the optimization objective, ensuring perfect alignment between target and anchor concept features and enabling more complete erasure. Secondly, it employs a progressive, layer-wise update strategy that gradually transfers target concept features to those of the anchor concept from shallow to deep layers. As the depth increases, the required parameter changes diminish, thereby reducing deviations in sensitive deep layers and preserving generative quality. Empirical results across different concept erasure tasks (including instance, art style, and nudity erasure) have demonstrated the effectiveness of our ErasePro.
摘要：旨在防止经过预估计的文本对图像模型产生与语义破坏概念（即目标概念）相关的内容的概念擦除正在增加注意力。最先进的方法将此任务制定为一个优化问题：它们将所有目标概念与语义无障碍锚点概念保持一致，并应用封闭式解决方案以相应地更新模型。尽管这些封闭形式的方法是有效的，但我们认为现有方法具有两个忽略的局限性：1）由于“非零对齐残差”，它们通常会导致不完整的擦除，尤其是当文本提示相对复杂时。 2）他们可能会因发电质量退化而受到损失，因为他们总是将参数更新集中在几层深层。为了解决这些问题，我们提出了一种新颖的封闭形式方法Erasepro：它是为了更完整的概念擦除和更好地保留整体生成质量的设计。具体而言，Erasepro首先将严格的零剩余约束引入到优化目标中，从而确保目标和锚概念特征之间的完美对齐，并实现更完整的擦除。其次，它采用了一种进步的层面更新策略，该策略逐渐将目标概念特征转移到锚概念的特征从浅层到深层。随着深度的增加，所需的参数变化减小，从而减少了敏感的深层偏差并保留生成质量。不同概念擦除任务（包括实例，艺术风格和裸体擦除）的经验结果证明了我们的Erasepro的有效性。

Title: Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach

Authors: Anushka Srivastava
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2508.04481
Pdf URL: https://arxiv.org/pdf/2508.04481
Copy Paste: [[2508.04481]] Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach(https://arxiv.org/abs/2508.04481)
Keywords: generative
Abstract: This paper presents a deep learning-based approach to emotion detection using Conditional Generative Adversarial Networks (cGANs). Unlike traditional unimodal techniques that rely on a single data type, we explore a multimodal framework integrating text, audio, and facial expressions. The proposed cGAN architecture is trained to generate synthetic emotion-rich data and improve classification accuracy across multiple modalities. Our experimental results demonstrate significant improvements in emotion recognition performance compared to baseline models. This work highlights the potential of cGANs in enhancing human-computer interaction systems by enabling more nuanced emotional understanding.
摘要：本文使用条件生成对抗网络（CGAN）提出了一种基于学习的情感检测方法。与依靠单个数据类型的传统单峰技术不同，我们探索了一个多模式框架，以整合文本，音频和面部表情。提出的CGAN架构经过训练，以生成综合情绪丰富的数据并提高多种方式的分类精度。我们的实验结果表明，与基线模型相比，情绪识别性能的显着改善。这项工作突出了CGAN通过实现更细微的情感理解来增强人类计算机相互作用系统的潜力。

Title: QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution

Authors: Bowen Chai, Zheng Chen, Libo Zhu, Wenbo Li, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04485
Pdf URL: https://arxiv.org/pdf/2508.04485
Copy Paste: [[2508.04485]] QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution(https://arxiv.org/abs/2508.04485)
Keywords: super-resolution
Abstract: Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: this https URL.
摘要：扩散模型在现实世界视频超分辨率（VSR）中表现出了出色的性能。但是，扩散模型的缓慢处理速度和大量资源消耗阻碍了其实际应用和部署。量化为压缩VSR模型提供了潜在的解决方案。然而，由于其时间特征和高保真要求，量化VSR模型的质疑是具有挑战性的。为了解决这些问题，我们提出了QuantVSR，这是现实世界中VSR的低位量化模型。我们提出了一个时空复杂性意识（STCA）机制，在该机制中，我们首先利用校准数据集测量每一层的时空复杂性。基于这些统计数据，我们将特定层的等级分配给低级别全精确（FP）辅助分支。随后，我们共同完善了FP和低位分支以实现同时优化。此外，我们提出了一个可学习的偏置对准（LBA）模块，以减少偏见的量化误差。关于合成和现实世界数据集的广泛实验表明，我们的方法与FP模型获得了可比的性能，并且明显胜过最近的领先低位量化方法。代码可用：此HTTPS URL。

Title: RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection

Authors: Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Shuchang Lyu, Baoyuan Wu, Guangliang Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04524
Pdf URL: https://arxiv.org/pdf/2508.04524
Copy Paste: [[2508.04524]] RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection(https://arxiv.org/abs/2508.04524)
Keywords: generation
Abstract: The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX's effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.
摘要：AI生成模型的快速发展使创造了高现实的图像，从而通过广泛的错误信息构成了道德风险。当前的DeepFake检测方法被归类为面部特定检测器或一般AI生成的检测器，通过将检测作为分类任务构架而没有解释决策，缺乏透明度。尽管几种基于LLM的方法提供了解释性，但它们遭受了粗粒分析和对劳动密集型注释的依赖。本文介绍了RAIDX（检索图像深击检测和解释性），这是一种新型的深膜检测框架，该框架整合了检索效果生成（RAG）和组相对策略优化（GRPO），以提高检测准确性并进行决策解释性。具体而言，RAIDX利用抹布纳入了外部知识以提高检测准确性，并采用GRPO自主产生细粒度的文本解释和显着图，从而消除了对大量手动注释的需求。对多个基准测试的实验表明RAIDX在识别真实或虚假方面的有效性，并在文本描述和显着图中提供了可解释的理由，从而实现了最先进的检测性能，同时促进了DeepFake识别中的透明度。 RAIDX代表了协同抹布和GRPO的第一个统一框架，解决了准确性和解释性的关键差距。我们的代码和模型将公开可用。

Title: MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

Authors: Quang-Trung Truong, Yuk-Kwan Wong, Vo Hoang Kim Tuyen Dang, Rinaldi Gotama, Duc Thanh Nguyen, Sai-Kit Yeung
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2508.04549
Pdf URL: https://arxiv.org/pdf/2508.04549
Copy Paste: [[2508.04549]] MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning(https://arxiv.org/abs/2508.04549)
Keywords: generation
Abstract: Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at this https URL.
摘要：由于海洋物体的动态以及周围环境，摄像头运动以及水下场景的复杂性，海洋视频对视频理解提出了重大挑战。现有的视频字幕数据集（通常集中在通用或以人为中心的领域）通常无法推广到海洋环境的复杂性并获得有关海洋生物的见解。为了解决这些限制，我们提出了一个两阶段的海洋面向对象的视频字幕管道。我们介绍了一个全面的视频理解基准，该基准利用视频，文本和细分面罩的三联体来促进视觉接地和字幕，从而改善了海洋视频理解和分析以及海洋视频的生成。此外，我们强调了视频分裂的有效性，以检测场景变化中的显着对象过渡，从而显着丰富了字幕内容的语义。我们的数据集和代码已在此HTTPS URL上发布。

Title: Drone Detection with Event Cameras

Authors: Gabriele Magrini, Lorenzo Berlincioni, Luca Cultrera, Federico Becattini, Pietro Pala
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04564
Pdf URL: https://arxiv.org/pdf/2508.04564
Copy Paste: [[2508.04564]] Drone Detection with Event Cameras(https://arxiv.org/abs/2508.04564)
Keywords: generation
Abstract: The diffusion of drones presents significant security and safety challenges. Traditional surveillance systems, particularly conventional frame-based cameras, struggle to reliably detect these targets due to their small size, high agility, and the resulting motion blur and poor performance in challenging lighting conditions. This paper surveys the emerging field of event-based vision as a robust solution to these problems. Event cameras virtually eliminate motion blur and enable consistent detection in extreme lighting. Their sparse, asynchronous output suppresses static backgrounds, enabling low-latency focus on motion cues. We review the state-of-the-art in event-based drone detection, from data representation methods to advanced processing pipelines using spiking neural networks. The discussion extends beyond simple detection to cover more sophisticated tasks such as real-time tracking, trajectory forecasting, and unique identification through propeller signature analysis. By examining current methodologies, available datasets, and the distinct advantages of the technology, this work demonstrates that event-based vision provides a powerful foundation for the next generation of reliable, low-latency, and efficient counter-UAV systems.
摘要：无人机的扩散提出了重大安全性和安全挑战。传统的监视系统，尤其是基于框架的传统相机，由于其尺寸较小，敏捷性高以及在挑战性的照明条件下产生的运动模糊和性能差而难以可靠地检测到这些目标。本文将基于事件的视野作为解决这些问题的强大解决方案来调查新兴领域。事件摄像机实际上消除了运动模糊，并在极端照明中实现一致的检测。它们稀疏的异步输出抑制了静态背景，从而使低延迟的运动提示能够关注运动线索。我们回顾了基于事件的无人机检测中的最新技术，从数据表示方法到使用尖峰神经网络的高级处理管道。讨论扩展了简单检测，以涵盖更复杂的任务，例如实时跟踪，轨迹预测和通过螺旋桨签名分析的唯一识别。通过检查当前的方法论，可用数据集以及该技术的独特优势，这项工作表明，基于事件的视觉为下一代可靠，低延迟和高效的反UAV系统提供了强大的基础。

Title: Analyzing and Mitigating Object Hallucination: A Training Bias Perspective

Authors: Yifan Li, Kun Zhou, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.04567
Pdf URL: https://arxiv.org/pdf/2508.04567
Copy Paste: [[2508.04567]] Analyzing and Mitigating Object Hallucination: A Training Bias Perspective(https://arxiv.org/abs/2508.04567)
Keywords: generative
Abstract: As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes'' to questions about masked objects. To understand this issue, we conduct probing experiments on the models' internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2\% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released.
摘要：随着扩大训练数据显着提高了大型视觉模型（LVLMS）的一般多模式能力，它们仍然遭受幻觉问题的困扰，从而产生了与视觉输入不一致的文本。这种现象激励我们系统地研究训练数据在幻觉中的作用。我们引入了一个新的基准测试POPEV2，该基准由从LVLM的训练数据中收集的反事实图像和某些对象掩盖的图像组成。通过对POPEV2的全面评估，我们发现当前的LVLM遭受了训练偏见的困扰：他们无法充分利用训练数据，而在训练过程中看到的图像更频繁地幻觉。具体来说，它们在反事实图像上的表现较差，通常会错误地回答``是''的问题。为了了解这个问题，我们对模型的内部组件进行探测实验，表明这种训练偏见主要位于语言建模（LM）头部。根据这些发现，我们提出了一种遗忘，这是一种旨在通过训练偏见来减轻对象幻觉的有效且轻巧的学习方法。遗忘将训练数据上的地面真相标签和模型输出之间的差异是偏见的代理，并采用了仅更新LM头的参数和数据有效的微调策略。广泛的实验证明了我们方法的有效性。虽然仅重用训练数据并更新大约2 \％的参数，但抛弃会大大减少歧视性和生成任务的幻觉。此外，它在模型大小（2B至72B）和训练数据量方面表现出强大的可伸缩性，并且表现出对幻觉幻觉以外的幻觉类型的有希望的概括。我们的代码和数据将公开发布。

Title: DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling

Authors: Yijie Li, Wei Zhang, Xi Zhu, Ye Wu, Yogesh Rathi, Lauren J. O'Donnell, Fan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04568
Pdf URL: https://arxiv.org/pdf/2508.04568
Copy Paste: [[2508.04568]] DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling(https://arxiv.org/abs/2508.04568)
Keywords: generative
Abstract: This paper presents DDTracking, a novel deep generative framework for diffusion MRI tractography that formulates streamline propagation as a conditional denoising diffusion process. In DDTracking, we introduce a dual-pathway encoding network that jointly models local spatial encoding (capturing fine-scale structural details at each streamline point) and global temporal dependencies (ensuring long-range consistency across the entire streamline). Furthermore, we design a conditional diffusion model module, which leverages the learned local and global embeddings to predict streamline propagation orientations for tractography in an end-to-end trainable manner. We conduct a comprehensive evaluation across diverse, independently acquired dMRI datasets, including both synthetic and clinical data. Experiments on two well-established benchmarks with ground truth (ISMRM Challenge and TractoInferno) demonstrate that DDTracking largely outperforms current state-of-the-art tractography methods. Furthermore, our results highlight DDTracking's strong generalizability across heterogeneous datasets, spanning varying health conditions, age groups, imaging protocols, and scanner types. Collectively, DDTracking offers anatomically plausible and robust tractography, presenting a scalable, adaptable, and end-to-end learnable solution for broad dMRI applications. Code is available at: this https URL
摘要：本文介绍了DDTracking，这是一种扩散MRI拖拉术的新型深层生成框架，该框架将流线传播作为条件降解扩散过程。在DDTRACKING中，我们引入了一个双围路编码网络，该网络共同对本地空间编码进行建模（在每个流线点处捕获精细的结构细节）和全局的时间依赖性（确保整个流线的远距离一致性）。此外，我们设计了一个有条件的扩散模型模块，该模块利用学到的本地和全局嵌入来预测以端到端的可训练方式进行拖拉术的简化传播方向。我们对各种独立获取的DMRI数据集进行了全面的评估，包括合成数据和临床数据。在两个具有基础真理（ISMRM挑战和Tractoinferno）的基准的实验表明，DDTRACK的实验在很大程度上要比当前最新的拖拉术方法胜过。此外，我们的结果突出了DdTracking在异质数据集中的强大推广性，涵盖了不同的健康状况，年龄组，成像协议和扫描仪类型。总体而言，DdTracking提供了解剖学上合理且健壮的片段，为广泛的DMRI应用提供了可扩展，适应性和端到端的可学习解决方案。代码可用：此HTTPS URL

Title: Improved Training Strategies for Physics-Informed Neural Networks using Real Experimental Data in Aluminum Spot Welding

Authors: Jan A. Zak, Christian Weißenfels
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04595
Pdf URL: https://arxiv.org/pdf/2508.04595
Copy Paste: [[2508.04595]] Improved Training Strategies for Physics-Informed Neural Networks using Real Experimental Data in Aluminum Spot Welding(https://arxiv.org/abs/2508.04595)
Keywords: quality assessment
Abstract: Resistance spot welding is the dominant joining process for the body-in-white in the automotive industry, where the weld nugget diameter is the key quality metric. Its measurement requires destructive testing, limiting the potential for efficient quality control. Physics-informed neural networks were investigated as a promising tool to reconstruct internal process states from experimental data, enabling model-based and non-invasive quality assessment in aluminum spot welding. A major challenge is the integration of real-world data into the network due to competing optimization objectives. To address this, we introduce two novel training strategies. First, experimental losses for dynamic displacement and nugget diameter are progressively included using a fading-in function to prevent excessive optimization conflicts. We also implement a custom learning rate scheduler and early stopping based on a rolling window to counteract premature reduction due to increased loss magnitudes. Second, we introduce a conditional update of temperature-dependent material parameters via a look-up table, activated only after a loss threshold is reached to ensure physically meaningful temperatures. An axially symmetric two-dimensional model was selected to represent the welding process accurately while maintaining computational efficiency. To reduce computational burden, the training strategies and model components were first systematically evaluated in one dimension, enabling controlled analysis of loss design and contact models. The two-dimensional network predicts dynamic displacement and nugget growth within the experimental confidence interval, supports transferring welding stages from steel to aluminum, and demonstrates strong potential for fast, model-based quality control in industrial applications.
摘要：阻力点焊接是汽车行业中白色物体的主要连接过程，在该工业中，焊接掘金直径是关键质量指标。它的测量需要破坏性测试，从而限制了有效质量控制的潜力。研究了物理知识的神经网络作为一种有前途的工具，可以从实验数据中重建内部过程状态，从而在铝斑点焊接中基于模型和非侵入性质量评估。一个主要的挑战是由于竞争优化目标，将现实世界数据集成到网络中。为了解决这个问题，我们介绍了两种新颖的培训策略。首先，动态位移和掘金直径的实验损失逐渐使用逐渐使用功能来防止过度优化冲突。我们还基于滚动窗口来实施自定义学习率调度程序，并根据滚动窗口提早停止，以抵消损失增加而过早减少。其次，我们通过查找表介绍了与温度依赖性材料参数的条件更新，仅在达到损失阈值以确保身体有意义的温度后才激活。选择了轴向对称的二维模型，以准确表示焊接过程，同时保持计算效率。为了减轻计算负担，首先在一个维度上系统地评估培训策略和模型组件，从而可以对损失设计和接触模型进行控制分析。二维网络可以预测实验置信区间内的动态位移和掘金生长，支持将焊接阶段从钢转移到铝，并在工业应用中表现出强大的基于模型的质量控制的强大潜力。

Title: Multitask Learning with Stochastic Interpolants

Authors: Hugo Negrel, Florentin Coeurdoux, Michael S. Albergo, Eric Vanden-Eijnden
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2508.04605
Pdf URL: https://arxiv.org/pdf/2508.04605
Copy Paste: [[2508.04605]] Multitask Learning with Stochastic Interpolants(https://arxiv.org/abs/2508.04605)
Keywords: generation, generative
Abstract: We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.
摘要：我们提出了一个框架，用于在概率分布之间进行学习图，从而广泛地概括了流量和扩散模型的时间动力学。为了实现这一目标，我们通过用向量，矩阵或线性运算符替换标量时间变量来概括随机插值，从而使我们能够在多个维空间上桥接概率分布。这种方法可以构建能够在没有特定任务培训的情况下完成多个任务的多功能生成模型。我们的基于操作员的插值不仅为现有生成模型提供了统一的理论观点，而且还扩展了其功能。通过数值实验，我们证明了我们方法对条件生成和内介质，微调和后验采样以及多尺度建模的零发效率，这表明其作为专用模型的通用任务敏捷替代方案的潜力。

Title: CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series

Authors: Yutong Xia, Yingying Zhang, Yuxuan Liang, Lunting Fan, Qingsong Wen, Roger Zimmermann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04630
Pdf URL: https://arxiv.org/pdf/2508.04630
Copy Paste: [[2508.04630]] CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series(https://arxiv.org/abs/2508.04630)
Keywords: generation
Abstract: Time series anomaly detection has garnered considerable attention across diverse domains. While existing methods often fail to capture the underlying mechanisms behind anomaly generation in time series data. In addition, time series anomaly detection often faces several data-related inherent challenges, i.e., label scarcity, data imbalance, and complex multi-periodicity. In this paper, we leverage causal tools and introduce a new causality-based framework, CaPulse, which tunes in to the underlying causal pulse of time series data to effectively detect anomalies. Concretely, we begin by building a structural causal model to decipher the generation processes behind anomalies. To tackle the challenges posed by the data, we propose Periodical Normalizing Flows with a novel mask mechanism and carefully designed periodical learners, creating a periodicity-aware, density-based anomaly detection approach. Extensive experiments on seven real-world datasets demonstrate that CaPulse consistently outperforms existing methods, achieving AUROC improvements of 3% to 17%, with enhanced interpretability.
摘要：时间序列的异常检测吸引了各个领域的大量关注。尽管现有方法通常无法捕获时间序列数据中异常生成背后的基本机制。此外，时间序列异常检测通常面临几个与数据相关的固有挑战，即标签稀缺，数据失衡和复杂的多周期性。在本文中，我们利用因果工具，并引入了一个新的基于因果关系的框架Capulse，该框架调整了时间序列数据的基本因果脉冲以有效地检测异常情况。具体而言，我们首先建立一个结构性因果模型，以破译异常背后的产生过程。为了应对数据带来的挑战，我们提出了使用新颖的掩码机制和精心设计的定期学习者进行定期归一化流，从而创建了一种定期感知的，基于密度的，基于密度的异常检测方法。在七个现实世界数据集上进行的广泛实验表明，Capulse始终胜过现有方法，实现AUROC改善3％至17％，具有增强的可解释性。