2025-08-04

Title: ECG Latent Feature Extraction with Autoencoders for Downstream Prediction Tasks

Authors: Christopher Harvey, Sumaiya Shomaji, Zijun Yao, Amit Noheria
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.00131
Pdf URL: https://arxiv.org/pdf/2508.00131
Copy Paste: [[2508.00131]] ECG Latent Feature Extraction with Autoencoders for Downstream Prediction Tasks(https://arxiv.org/abs/2508.00131)
Keywords: generation
Abstract: The electrocardiogram (ECG) is an inexpensive and widely available tool for cardiac assessment. Despite its standardized format and small file size, the high complexity and inter-individual variability of ECG signals (typically a 60,000-size vector with 12 leads at 500 Hz) make it challenging to use in deep learning models, especially when only small training datasets are available. This study addresses these challenges by exploring feature generation methods from representative beat ECGs, focusing on Principal Component Analysis (PCA) and Autoencoders to reduce data complexity. We introduce three novel Variational Autoencoder (VAE) variants-Stochastic Autoencoder (SAE), Annealed beta-VAE (A beta-VAE), and Cyclical beta VAE (C beta-VAE)-and compare their effectiveness in maintaining signal fidelity and enhancing downstream prediction tasks using a Light Gradient Boost Machine (LGBM). The A beta-VAE achieved superior signal reconstruction, reducing the mean absolute error (MAE) to 15.7+/-3.2 muV, which is at the level of signal noise. Moreover, the SAE encodings, when combined with traditional ECG summary features, improved the prediction of reduced Left Ventricular Ejection Fraction (LVEF), achieving an holdout test set area under the receiver operating characteristic curve (AUROC) of 0.901 with a LGBM classifier. This performance nearly matches the 0.909 AUROC of state-of-the-art CNN model but requires significantly less computational resources. Further, the ECG feature extraction-LGBM pipeline avoids overfitting and retains predictive performance when trained with less data. Our findings demonstrate that these VAE encodings are not only effective in simplifying ECG data but also provide a practical solution for applying deep learning in contexts with limited-scale labeled training data.
摘要：心电图（ECG）是一种廉价且可广泛的心脏评估工具。尽管具有标准化的格式和较小的文件大小，但ECG信号的高复杂性和个体间可变性（通常为60,000尺寸的矢量为500 Hz的铅）使得在深度学习模型中使用它具有挑战性，尤其是在只有小型培训数据集时。这项研究通过探索代表性Beat ECG的特征生成方法来解决这些挑战，重点是主成分分析（PCA）和自动编码器，以降低数据复杂性。我们介绍了三个新型的变化自动编码器（VAE）变体 - 型自动编码器（SAE），退火beta-vae（beta-vae）和周期性的beta vae（c beta-vae） - 比较了它们在使用轻度层面的信号忠诚度和增强下游预测任务中使用轻度渐变的机器（LGB）（lgbM）来比较其有效性。 A Beta-VAE实现了上等信号的重建，将平均绝对误差（MAE）降低至15.7 +/- 3.2 MUV，这是信号噪声水平的。此外，SAE编码与传统的心电图摘要特征相结合，改善了减少左心室射血分数（LVEF）的预测，并通过LGBM分类器在接收器操作特征曲线（AUROC）下达到0.901的保留测试套件。该性能几乎与最先进的CNN模型的0.909 AUROC相匹配，但需要较少的计算资源。此外，ECG功能提取LGBM管道避免了过度拟合，并在使用较少的数据培训时会保留预测性能。我们的发现表明，这些VAE编码不仅可以有效地简化ECG数据，而且还为在具有有限标记的培训数据的上下文中应用深度学习提供了实用的解决方案。

Title: World Consistency Score: A Unified Metric for Video Generation Quality

Authors: Akshat Rakheja, Aarsh Ashdhir, Aryan Bhattacharjee, Vanshika Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00144
Pdf URL: https://arxiv.org/pdf/2508.00144
Copy Paste: [[2508.00144]] World Consistency Score: A Unified Metric for Video Generation Quality(https://arxiv.org/abs/2508.00144)
Keywords: generation, generative
Abstract: We introduce World Consistency Score (WCS), a novel unified evaluation metric for generative video models that emphasizes internal world consistency of the generated videos. WCS integrates four interpretable sub-components - object permanence, relation stability, causal compliance, and flicker penalty - each measuring a distinct aspect of temporal and physical coherence in a video. These submetrics are combined via a learned weighted formula to produce a single consistency score that aligns with human judgments. We detail the motivation for WCS in the context of existing video evaluation metrics, formalize each submetric and how it is computed with open-source tools (trackers, action recognizers, CLIP embeddings, optical flow), and describe how the weights of the WCS combination are trained using human preference data. We also outline an experimental validation blueprint: using benchmarks like VBench-2.0, EvalCrafter, and LOVE to test WCS's correlation with human evaluations, performing sensitivity analyses, and comparing WCS against established metrics (FVD, CLIPScore, VBench, FVMD). The proposed WCS offers a comprehensive and interpretable framework for evaluating video generation models on their ability to maintain a coherent "world" over time, addressing gaps left by prior metrics focused only on visual fidelity or prompt alignment.
摘要：我们介绍了世界一致性评分（WCS），这是一种新型的统一评估指标，用于强调生成视频的内部世界一致性的生成视频模型。 WCS整合了四个可解释的子组件 - 对象持久性，关系稳定性，因果关系依从性和闪烁惩罚 - 每个都测量了视频中时间和身体连贯性的不同方面。这些子宫术是通过学习的加权公式组合的，以产生与人类判断相一致的单个一致性评分。我们在现有视频评估指标的背景下详细介绍了WCS的动机，将每个子标准正式化，以及如何使用开源工具（跟踪器，动作识别器，剪辑嵌入，光流）计算它，并描述WCS组合的重量如何使用人类偏好数据来训练WCS组合的权重。我们还概述了实验验证蓝图：使用诸如VBENCH-2.0，iSTARCRAFTER和LAFES等基准测试WCS与人类评估的相关性，执行敏感性分析并比较WCS与已确立的指标（FVD，ClipsCore，clipscore，vbench，fvmd）。拟议的WCS提供了一个全面且可解释的框架，用于评估视频生成模型，以保持其随着时间的流逝保持连贯的“世界”的能力，以解决仅针对视觉保真度或及时对齐的先前指标留下的空白。

Title: Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Authors: Ziqian Zhong, Aditi Raghunathan
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2508.00161
Pdf URL: https://arxiv.org/pdf/2508.00161
Copy Paste: [[2508.00161]] Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs(https://arxiv.org/abs/2508.00161)
Keywords: generation
Abstract: The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at this https URL.
摘要：强大的开放式大语言模型（LLM）的发行通常不伴随着其完整的培训数据。现有的可解释性方法，尤其是基于激活的方法，通常需要或假设分布相似的数据。当检测和防御新的潜在威胁（如后门）时，这是一个重要的局限性，这些威胁是分发的定义。在这项工作中，我们介绍了一种新的方法，用于理解，监视和控制对权重而不是激活的微调LLM，从而侧面踏上了与未知培训数据相似的数据的需求。我们证明，微调模型及其基本模型之间的重量差异的顶部奇异向量对应于新获得的行为。通过监视沿这些方向的激活的余弦相似性，我们可以检测到高度调整过程中引入的显着行为。对于存在秘密触发器时绕过安全机制的后门模型，我们的方法最多可以停止100％的攻击，误报率低于1.2％。对于经历了未经学习的模型，我们检测到精确度高达95.42％的擦除主题的推断，甚至可以引导该模型恢复“未学习”信息。除了监视外，我们的方法还显示了预部部长预选模型审核的潜力：通过分析商业教学模型（Olmo，Llama，Qwen），我们能够发现特定于模型的微调焦点，包括营销策略和Midjourney及时生成。我们的实现可以在此HTTPS URL上找到。

Title: DiSC-Med: Diffusion-based Semantic Communications for Robust Medical Image Transmission

Authors: Fupei Guo, Hao Zheng, Xiang Zhang, Li Chen, Yue Wang, Songyang Zhang
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2508.00172
Pdf URL: https://arxiv.org/pdf/2508.00172
Copy Paste: [[2508.00172]] DiSC-Med: Diffusion-based Semantic Communications for Robust Medical Image Transmission(https://arxiv.org/abs/2508.00172)
Keywords: generation
Abstract: The rapid development of artificial intelligence has driven smart health with next-generation wireless communication technologies, stimulating exciting applications in remote diagnosis and intervention. To enable a timely and effective response for remote healthcare, efficient transmission of medical data through noisy channels with limited bandwidth emerges as a critical challenge. In this work, we propose a novel diffusion-based semantic communication framework, namely DiSC-Med, for the medical image transmission, where medical-enhanced compression and denoising blocks are developed for bandwidth efficiency and robustness, respectively. Unlike conventional pixel-wise communication framework, our proposed DiSC-Med is able to capture the key semantic information and achieve superior reconstruction performance with ultra-high bandwidth efficiency against noisy channels. Extensive experiments on real-world medical datasets validate the effectiveness of our framework, demonstrating its potential for robust and efficient telehealth applications.
摘要：人工智能的快速发展通过下一代无线通信技术促进了智能健康，刺激了远程诊断和干预中令人兴奋的应用。为了及时有效地对远程医疗保健，通过有限带宽的噪声渠道有效地传输医疗数据是一个关键挑战。在这项工作中，我们提出了一个新型的基于扩散的语义通信框架，即碟片，用于医疗图像传输，其中为带宽效率和稳健性开发了医学增强的压缩和脱氧块。与传统的像素通信框架不同，我们提出的圆盘播放器能够捕获关键的语义信息，并以超高的带宽效率对嘈杂的频道实现出色的重建性能。对现实世界医学数据集的广泛实验验证了我们框架的有效性，证明了其具有强大而有效的远程医疗应用的潜力。

Title: EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes

Authors: Adam Block, Cyril Zhang
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2508.00180
Pdf URL: https://arxiv.org/pdf/2508.00180
Copy Paste: [[2508.00180]] EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes(https://arxiv.org/abs/2508.00180)
Keywords: generation
Abstract: Stochasticity in language model fine-tuning, often caused by the small batch sizes typically used in this regime, can destabilize training by introducing large oscillations in generation quality. A popular approach to mitigating this instability is to take an Exponential moving average (EMA) of weights throughout training. While EMA reduces stochasticity, thereby smoothing training, the introduction of bias from old iterates often creates a lag in optimization relative to vanilla training. In this work, we propose the Bias-Corrected Exponential Moving Average (BEMA), a simple and practical augmentation of EMA that retains variance-reduction benefits while eliminating bias. BEMA is motivated by a simple theoretical model wherein we demonstrate provable acceleration of BEMA over both a standard EMA and vanilla training. Through an extensive suite of experiments on Language Models, we show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training in a variety of standard LM benchmarks, making BEMA a practical and theoretically motivated intervention for more stable and efficient fine-tuning.
摘要：语言模型微调中的随机性通常是由于该制度通常使用的小批量大小引起的，可能会通过引入发电质量的大型振荡来破坏培训。缓解这种不稳定的一种流行方法是在整个训练中取出指数的移动平均值（EMA）。尽管EMA降低了随机性，从而平滑训练，但旧迭代的偏见通常会相对于香草训练的优化产生滞后。在这项工作中，我们提出了偏见校正的指数移动平均值（BEMA），这是EMA的简单而实用的增强，可在消除偏见的同时保留降低方差益处。 BEMA是由一个简单的理论模型激发的，在这种模型中，我们在标准EMA和香草训练中都证明了BEMA的可证明加速度。通过大量的语言模型实验，我们表明，BEMA在各种标准的LM基准中都可以显着提高EMA和Vanilla培训的收敛率和最终性能，从而使BEMA成为实用性且理论上动机的干预措施，以进行更稳定和有效的微调。

Title: Learning Personalised Human Internal Cognition from External Expressive Behaviours for Real Personality Recognition

Authors: Xiangyu Kong, Hengde Zhu, Haoqin Sun, Zhihao Guo, Jiayan Gu, Xinyi Ni, Wei Zhang, Shizhe Liu, Siyang Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00205
Pdf URL: https://arxiv.org/pdf/2508.00205
Copy Paste: [[2508.00205]] Learning Personalised Human Internal Cognition from External Expressive Behaviours for Real Personality Recognition(https://arxiv.org/abs/2508.00205)
Keywords: generation
Abstract: Automatic real personality recognition (RPR) aims to evaluate human real personality traits from their expressive behaviours. However, most existing solutions generally act as external observers to infer observers' personality impressions based on target individuals' expressive behaviours, which significantly deviate from their real personalities and consistently lead to inferior recognition performance. Inspired by the association between real personality and human internal cognition underlying the generation of expressive behaviours, we propose a novel RPR approach that efficiently simulates personalised internal cognition from easy-accessible external short audio-visual behaviours expressed by the target individual. The simulated personalised cognition, represented as a set of network weights that enforce the personalised network to reproduce the individual-specific facial reactions, is further encoded as a novel graph containing two-dimensional node and edge feature matrices, with a novel 2D Graph Neural Network (2D-GNN) proposed for inferring real personality traits from it. To simulate real personality-related cognition, an end-to-end strategy is designed to jointly train our cognition simulation, 2D graph construction, and personality recognition modules.
摘要：自动真实的人格识别（RPR）旨在评估人类真实的人格特征从表达行为中。但是，大多数现有的解决方案通常充当外部观察者，可以根据目标个体的表达行为来推断观察者的人格印象，这显着偏离了他们的真实个性，并始终导致劣等的识别表现。受到真实个性与人类内部认知之间的关联的启发，我们提出了一种新型的RPR方法，该方法有效地模拟了来自目标个体表达的易于访问的外部短音频视听行为的个性化内部认知。模拟的个性化认知，表示是一组网络权重，该网络权重强制实施个性化网络来重现个人特定的面部反应，进一步编码为包含二维节点和边缘特征矩阵的新颖图，并提出了一个新颖的2D图形神经网络（2D-GNN），拟议的是来自IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT的真实人格特征。为了模拟与人格有关的认知，端到端策略旨在共同训练我们的认知模拟，2D图形结构和人格识别模块。

Title: Guided Depth Map Super-Resolution via Multi-Scale Fusion U-shaped Mamba Network

Authors: Chenggang Guo, Hao Xu, XianMing Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00248
Pdf URL: https://arxiv.org/pdf/2508.00248
Copy Paste: [[2508.00248]] Guided Depth Map Super-Resolution via Multi-Scale Fusion U-shaped Mamba Network(https://arxiv.org/abs/2508.00248)
Keywords: super-resolution
Abstract: Depth map super-resolution technology aims to improve the spatial resolution of low-resolution depth maps and effectively restore high-frequency detail information. Traditional convolutional neural network has limitations in dealing with long-range dependencies and are unable to fully model the global contextual information in depth maps. Although transformer can model global dependencies, its computational complexity and memory consumption are quadratic, which significantly limits its ability to process high-resolution depth maps. In this paper, we propose a multi-scale fusion U-shaped Mamba (MSF-UM) model, a novel guided depth map super-resolution framework. The core innovation of this model is to integrate Mamba's efficient state-space modeling capabilities into a multi-scale U-shaped fusion structure guided by a color image. The structure combining the residual dense channel attention block and the Mamba state space module is designed, which combines the local feature extraction capability of the convolutional layer with the modeling advantage of the state space model for long-distance dependencies. At the same time, the model adopts a multi-scale cross-modal fusion strategy to make full use of the high-frequency texture information from the color image to guide the super-resolution process of the depth map. Compared with existing mainstream methods, the proposed MSF-UM significantly reduces the number of model parameters while achieving better reconstruction accuracy. Extensive experiments on multiple publicly available datasets validate the effectiveness of the model, especially showing excellent generalization ability in the task of large-scale depth map super-resolution.
摘要：深度图超分辨率技术旨在改善低分辨率深度图的空间分辨率，并有效地恢复高频细节信息。传统的卷积神经网络在处理长期依赖性方面存在局限性，并且无法在深度图中对全球上下文信息进行完全建模。尽管变压器可以对全局依赖性进行建模，但其计算复杂性和内存消耗是二次的，这大大限制了其处理高分辨率深度图的能力。在本文中，我们提出了一个多尺度融合U形Mamba（MSF-UM）模型，这是一种新型的引导深度图超分辨率框架。该模型的核心创新是将Mamba的有效状态空间建模功能整合到由颜色图像引导的多尺度U形融合结构中。设计了剩余密集通道注意块和MAMBA状态空间模块的结构，该结构将卷积层的局部特征提取能力与长距离依赖关系的状态空间模型的建模优势结合在一起。同时，该模型采用了多尺度的跨模式融合策略，以充分利用彩色图像中的高频纹理信息，以指导深度图的超分辨率过程。与现有的主流方法相比，提出的MSF-UM显着减少了模型参数的数量，同时实现了更好的重建精度。在多个公开可用数据集上进行的广泛实验验证了模型的有效性，尤其是在大规模深度图超分辨率的任务中显示出出色的概括能力。

Title: Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

Authors: Hyundong Jin, Hyung Jin Chang, Eunwoo Kim
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2508.00260
Pdf URL: https://arxiv.org/pdf/2508.00260
Copy Paste: [[2508.00260]] Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models(https://arxiv.org/abs/2508.00260)
Keywords: generative
Abstract: Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.
摘要：持续学习使预训练的生成视觉语言模型（VLM）能够从新任务中合并知识，而无需从以前的数据中重新培训数据。最近的方法更新视觉投影仪，以转换新任务的视觉信息，将预训练的视觉编码器与大语言模型联系起来。但是，这种调整可能会导致模型优先考虑视觉输入，尤其是学习重复类型的文本指令的任务。为了解决对语言说明的忽视，我们提出了一个新颖的框架，以对语言模型的说明的视觉信息的翻译为基础。我们介绍了视觉投影仪的混合物，每个投影仪都基于给定的指令上下文，以适应新任务。为了避免使用专家进行无关的教学环境，我们提出了一项专家建议策略，该策略重复了与以前学到的那些任务相似的任务。此外，我们介绍了专家修剪，以减轻在先前任务中累计激活的专家使用的干扰。关于各种视觉语言任务的广泛实验表明，我们的方法通过产生跟随指导的响应来优于现有的持续学习方法。

Title: AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

Authors: Jin Lyu, Liang An, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00298
Pdf URL: https://arxiv.org/pdf/2508.00298
Copy Paste: [[2508.00298]] AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer(https://arxiv.org/abs/2508.00298)
Keywords: generation
Abstract: In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.
摘要：在基础模型的时代，通过单个网络实现对不同动态对象的统一理解，有可能增强更强的空间智能。此外，在生物学研究中，对各种物种的动物姿势和形状的准确估计对于定量分析至关重要。但是，由于以前的方法的网络容量有限以及综合多物种数据集的稀缺性，该主题仍未得到充实。为了解决这些限制，我们介绍了Animer+，这是我们可扩展动画框架的扩展版本。在本文中，我们着重于重建哺乳动物（哺乳动物）和鸟类（Aves）的统一方法。 Animer+的一个关键创新是其高容量的家庭感知视觉变压器（VIT），其中包含了Experts（MOE）设计的混合物。它的架构网络将网络层分成特定于分类的组件（对于哺乳动物和大街）和分类单元共享的组件，从而有效地学习了单个模型中不同和常见的解剖学特征。为了克服3D训练数据的严重短缺，尤其是对于鸟类，我们引入了基于扩散的条件图像生成管道。该管道生产两个大规模合成数据集：四足动物的ctrlani3d和用于鸟类的ctrlaves3d。要注意的是，ctrlaves3d是第一个大规模的3D通道数据集，用于鸟类，这对于解决单视深度歧义至关重要。经过培训的41.3k哺乳动物和12.4K鸟类图像（结合实际和合成数据）的培训，我们的方法表明，在广泛的基准测试中，包括现有方法的卓越性能，包括具有挑战性的跨域外动物王国数据集。消融研究证实了我们新型网络体系结构和生成的合成数据集在增强现实世界应用程序性能方面的有效性。

Title: Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence

Authors: Danzhen Fu, Jiagao Hu, Daiguo Zhou, Fei Wang, Zepeng Wang, Wenhua Liao
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2508.00299
Pdf URL: https://arxiv.org/pdf/2508.00299
Copy Paste: [[2508.00299]] Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence(https://arxiv.org/abs/2508.00299)
Keywords: generation
Abstract: Pedestrian detection models in autonomous driving systems often lack robustness due to insufficient representation of dangerous pedestrian scenarios in training datasets. To address this limitation, we present a novel framework for controllable pedestrian video editing in multi-view driving scenarios by integrating video inpainting and human motion control techniques. Our approach begins by identifying pedestrian regions of interest across multiple camera views, expanding detection bounding boxes with a fixed ratio, and resizing and stitching these regions into a unified canvas while preserving cross-view spatial relationships. A binary mask is then applied to designate the editable area, within which pedestrian editing is guided by pose sequence control conditions. This enables flexible editing functionalities, including pedestrian insertion, replacement, and removal. Extensive experiments demonstrate that our framework achieves high-quality pedestrian editing with strong visual realism, spatiotemporal coherence, and cross-view consistency. These results establish the proposed method as a robust and versatile solution for multi-view pedestrian video generation, with broad potential for applications in data augmentation and scenario simulation in autonomous driving.
摘要：自主驾驶系统中的行人检测模型通常由于训练数据集中危险的行人场景的代表不足而缺乏稳健性。为了解决这一限制，我们通过整合视频介绍和人类运动控制技术，提出了一个新颖的框架，用于在多视图驾驶场景中可控视频编辑。我们的方法首先要识别多个摄像头视图中感兴趣的行人区域，以固定比率扩展检测边界框，并在保留跨视图的空间关系的同时，将这些区域的大小调整和缝制到统一的画布中。然后将二进制面膜应用于指定可编辑区域，在该区域中，行人编辑以姿势序列控制条件为指导。这可以灵活地编辑功能，包括行人插入，更换和拆卸。广泛的实验表明，我们的框架具有强大的视觉现实主义，时空连贯性和跨视图一致性，可以实现高质量的行人编辑。这些结果将提出的方法建立为多视图的人行视频生成的强大而多功能的解决方案，在自动驾驶中的数据增强和方案模拟中应用广泛潜力。

Title: Exploring Fourier Prior and Event Collaboration for Low-Light Image Enhancement

Authors: Chunyan She, Fujun Han, Chengyu Fang, Shukai Duan, Lidan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00308
Pdf URL: https://arxiv.org/pdf/2508.00308
Copy Paste: [[2508.00308]] Exploring Fourier Prior and Event Collaboration for Low-Light Image Enhancement(https://arxiv.org/abs/2508.00308)
Keywords: restoration
Abstract: The event camera, benefiting from its high dynamic range and low latency, provides performance gain for low-light image enhancement. Unlike frame-based cameras, it records intensity changes with extremely high temporal resolution, capturing sufficient structure information. Currently, existing event-based methods feed a frame and events directly into a single model without fully exploiting modality-specific advantages, which limits their performance. Therefore, by analyzing the role of each sensing modality, the enhancement pipeline is decoupled into two stages: visibility restoration and structure refinement. In the first stage, we design a visibility restoration network with amplitude-phase entanglement by rethinking the relationship between amplitude and phase components in Fourier space. In the second stage, a fusion strategy with dynamic alignment is proposed to mitigate the spatial mismatch caused by the temporal resolution discrepancy between two sensing modalities, aiming to refine the structure information of the image enhanced by the visibility restoration network. In addition, we utilize spatial-frequency interpolation to simulate negative samples with diverse illumination, noise and artifact degradations, thereby developing a contrastive loss that encourages the model to learn discriminative representations. Experiments demonstrate that the proposed method outperforms state-of-the-art models.
摘要：该活动摄像机受益于其高动态范围和低潜伏期，可为低光图像增强提供性能增长。与基于框架的摄像机不同，它记录了强度随时间分辨率极高的变化，从而捕获了足够的结构信息。当前，现有的基于事件的方法将框架和事件直接馈入单个模型，而无需完全利用模式特定的优势，从而限制了其性能。因此，通过分析每种感应方式的作用，增强管道分为两个阶段：可见性恢复和结构改进。在第一阶段，我们通过重新思考傅立叶空间中振幅和相分量之间的关系来设计具有振幅纠缠的可见性恢复网络。在第二阶段，提出了一种动态对准的融合策略，以减轻两种感应方式之间的时间分辨率差异引起的空间不匹配，旨在完善可见性恢复网络增强图像的结构信息。此外，我们利用空间频率插值来模拟具有不同照明，噪声和伪影降解的负面样本，从而产生了对比度损失，鼓励模型学习区分性表示。实验表明，所提出的方法优于最先进的模型。

Title: GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection

Authors: Suhang Cai, Xiaohao Peng, Chong Wang, Xiaojie Cai, Jiangbo Qian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00312
Pdf URL: https://arxiv.org/pdf/2508.00312
Copy Paste: [[2508.00312]] GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection(https://arxiv.org/abs/2508.00312)
Keywords: generation, generative
Abstract: Video anomaly detection (VAD) plays a critical role in public safety applications such as intelligent surveillance. However, the rarity, unpredictability, and high annotation cost of real-world anomalies make it difficult to scale VAD datasets, which limits the performance and generalization ability of existing models. To address this challenge, we propose a generative video-enhanced weakly-supervised video anomaly detection (GV-VAD) framework that leverages text-conditioned video generation models to produce semantically controllable and physically plausible synthetic videos. These virtual videos are used to augment training data at low cost. In addition, a synthetic sample loss scaling strategy is utilized to control the influence of generated synthetic samples for efficient training. The experiments show that the proposed framework outperforms state-of-the-art methods on UCF-Crime datasets. The code is available at this https URL.
摘要：视频异常检测（VAD）在诸如智能监视之类的公共安全应用中起着至关重要的作用。但是，现实世界异常的稀有性，不可预测性和高注释成本使得很难扩展VAD数据集，从而限制了现有模型的性能和概括能力。为了应对这一挑战，我们提出了一种生成视频增强的弱监督视频异常检测（GV-VAD）框架，该框架利用文本条件的视频生成模型生成语义上可控且具有物理上理由的合成视频。这些虚拟视频用于以低成本来增强培训数据。此外，合成样品损失缩放策略用于控制生成的合成样品对有效训练的影响。实验表明，所提出的框架在UCF-Crime数据集上的最先进方法优于最先进的方法。该代码可在此HTTPS URL上找到。

Title: Steering Guidance for Personalized Text-to-Image Diffusion Models

Authors: Sunghyun Park, Seokeon Choi, Hyoungwoo Park, Sungrack Yun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00319
Pdf URL: https://arxiv.org/pdf/2508.00319
Copy Paste: [[2508.00319]] Steering Guidance for Personalized Text-to-Image Diffusion Models(https://arxiv.org/abs/2508.00319)
Keywords: generation
Abstract: Personalizing text-to-image diffusion models is crucial for adapting the pre-trained models to specific target concepts, enabling diverse image generation. However, fine-tuning with few images introduces an inherent trade-off between aligning with the target distribution (e.g., subject fidelity) and preserving the broad knowledge of the original model (e.g., text editability). Existing sampling guidance methods, such as classifier-free guidance (CFG) and autoguidance (AG), fail to effectively guide the output toward well-balanced space: CFG restricts the adaptation to the target distribution, while AG compromises text alignment. To address these limitations, we propose personalization guidance, a simple yet effective method leveraging an unlearned weak model conditioned on a null text prompt. Moreover, our method dynamically controls the extent of unlearning in a weak model through weight interpolation between pre-trained and fine-tuned models during inference. Unlike existing guidance methods, which depend solely on guidance scales, our method explicitly steers the outputs toward a balanced latent space without additional computational overhead. Experimental results demonstrate that our proposed guidance can improve text alignment and target distribution fidelity, integrating seamlessly with various fine-tuning strategies.
摘要：个性化文本对图像扩散模型对于将预训练的模型适应特定目标概念，从而实现多样化的图像生成至关重要。但是，用很少的图像进行微调引入了与目标分布（例如主题保真度）保持一致的固有权衡，并保留了原始模型的广泛知识（例如，文本编辑性）。现有的采样指导方法，例如无分类器指导（CFG）和自动化（AG），无法有效地指导朝着均衡空间的输出：CFG限制了对目标分布的适应性，而AG会损害文本对齐。为了解决这些局限性，我们提出了个性化指导，这是一种简单而有效的方法，它利用了以零文本提示为条件的未经学习的弱模型。此外，我们的方法通过在推断过程中预先训练和微调模型之间的权重插值来动态控制弱模型中学习的程度。与仅取决于指导量表的现有指导方法不同，我们的方法明确地将输出转向平衡的潜在空间，而无需其他计算开销。实验结果表明，我们提出的指导可以改善文本一致性和目标分布保真度，并与各种微调策略无缝集成。

Title: PnP-DA: Towards Principled Plug-and-Play Integration of Variational Data Assimilation and Generative Models

Authors: Yongquan Qu, Matthieu Blanke, Sara Shamekh, Pierre Gentine
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2508.00325
Pdf URL: https://arxiv.org/pdf/2508.00325
Copy Paste: [[2508.00325]] PnP-DA: Towards Principled Plug-and-Play Integration of Variational Data Assimilation and Generative Models(https://arxiv.org/abs/2508.00325)
Keywords: generative
Abstract: Earth system modeling presents a fundamental challenge in scientific computing: capturing complex, multiscale nonlinear dynamics in computationally efficient models while minimizing forecast errors caused by necessary simplifications. Even the most powerful AI- or physics-based forecast system suffer from gradual error accumulation. Data assimilation (DA) aims to mitigate these errors by optimally blending (noisy) observations with prior model forecasts, but conventional variational methods often assume Gaussian error statistics that fail to capture the true, non-Gaussian behavior of chaotic dynamical systems. We propose PnP-DA, a Plug-and-Play algorithm that alternates (1) a lightweight, gradient-based analysis update (using a Mahalanobis-distance misfit on new observations) with (2) a single forward pass through a pretrained generative prior conditioned on the background forecast via a conditional Wasserstein coupling. This strategy relaxes restrictive statistical assumptions and leverages rich historical data without requiring an explicit regularization functional, and it also avoids the need to backpropagate gradients through the complex neural network that encodes the prior during assimilation cycles. Experiments on standard chaotic testbeds demonstrate that this strategy consistently reduces forecast errors across a range of observation sparsities and noise levels, outperforming classical variational methods.
摘要：地球系统建模在科学计算中提出了一个基本挑战：在计算高效模型中捕获复杂的多尺度非线性动力学，同时最大程度地减少了由必要的简化引起的预测错误。即使是最强大的基于AI的AI-或基于物理的预测系统也遭受逐渐误差的积累。数据同化（DA）的目的是通过与先前模型预测的最佳混合（嘈杂）观察结果来减轻这些错误，但是常规的变异方法通常假设无法捕获混乱动态系统的真实，非高斯行为的高斯错误统计。我们提出了PNP-DA，这是一种插件算法，该算法（1）可轻松，基于梯度的分析更新（使用新观测值的Mahalanobis-tistance Misfit），（2）单个正向通过有条件的先验预测，通过有条件的先验预测，通过条件预测，有条件地通过条件预测。该策略放宽了限制性的统计假设，并利用了丰富的历史数据，而无需明确的正则化功能，并且还避免了通过在同化周期期间编码先前的复杂神经网络反向传播梯度的需要。标准混沌测试床上的实验表明，该策略始终减少在一系列观察稀少度和噪声水平上的预测误差，表现优于经典变异方法。

Title: BOOD: Boundary-based Out-Of-Distribution Data Generation

Authors: Qilin Liao, Shuo Yang, Bo Zhao, Ping Luo, Hengshuang Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.00350
Pdf URL: https://arxiv.org/pdf/2508.00350
Copy Paste: [[2508.00350]] BOOD: Boundary-based Out-Of-Distribution Data Generation(https://arxiv.org/abs/2508.00350)
Keywords: generation
Abstract: Harnessing the power of diffusion models to synthesize auxiliary training data based on latent space features has proven effective in enhancing out-of-distribution (OOD) detection performance. However, extracting effective features outside the in-distribution (ID) boundary in latent space remains challenging due to the difficulty of identifying decision boundaries between classes. This paper proposes a novel framework called Boundary-based Out-Of-Distribution data generation (BOOD), which synthesizes high-quality OOD features and generates human-compatible outlier images using diffusion models. BOOD first learns a text-conditioned latent feature space from the ID dataset, selects ID features closest to the decision boundary, and perturbs them to cross the decision boundary to form OOD features. These synthetic OOD features are then decoded into images in pixel space by a diffusion model. Compared to previous works, BOOD provides a more training efficient strategy for synthesizing informative OOD features, facilitating clearer distinctions between ID and OOD data. Extensive experimental results on common benchmarks demonstrate that BOOD surpasses the state-of-the-art method significantly, achieving a 29.64% decrease in average FPR95 (40.31% vs. 10.67%) and a 7.27% improvement in average AUROC (90.15% vs. 97.42%) on the CIFAR-100 dataset.
摘要：证明利用扩散模型的力量基于潜在空间特征合成辅助训练数据已被证明有效增强分布（OOD）检测性能。但是，由于难以识别类之间的决策界限，因此在潜在空间中提取有效特征（ID）边界之外（ID）边界仍然具有挑战性。本文提出了一个新的框架，称为基于边界的分布数据生成（BOOD），该框架综合了高质量的OOD特征，并使用扩散模型生成了与人类兼容的异常图像。 BOOD首先从ID数据集中学习一个文本条件的潜在特征空间，选择最接近决策边界的ID功能，并将其删除以越过决策边界以形成OOD特征。然后，通过扩散模型将这些合成的OOD特征解码为像素空间中的图像。与以前的作品相比，BOOD提供了更有效的培训，以综合信息丰富的OOD功能，从而促进ID和OOD数据之间的更清晰区分。对共同基准的广泛实验结果表明，BOOD显着超过了最先进的方法，平均FPR95（40.31％比10.67％）下降了29.64％，平均AUROC的平均AUROC（90.15％vs. 97.42％）在CIFAR-100 DATASASET上提高了7.27％。

Title: Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

Authors: Angelos Vlachos, Giorgos Filandrianos, Maria Lymperaiou, Nikolaos Spanos, Ilias Mitsouras, Vasileios Karampinis, Athanasios Voulodimos
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2508.00356
Pdf URL: https://arxiv.org/pdf/2508.00356
Copy Paste: [[2508.00356]] Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning(https://arxiv.org/abs/2508.00356)
Keywords: generation
Abstract: We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.
摘要：我们提出了一个基于协作代理的多图像推理框架。我们的方法通过采用双重代理系统来应对跨不同数据集和任务格式的交错多模式推理的挑战：基于语言的提示器，它会产生上下文感知，特定于任务的提示，以及一个视力宣传员，一个大型视觉语言模型（LVLM）负责最终推荐。该框架是完全自动化的，模块化的和无训练的，可以在分类，问题答案和涉及一个或多个输入图像的自由形式生成任务之间进行概括。我们在2025年Mirage挑战（轨道A）的18个不同数据集上评估了我们的方法，涵盖了广泛的视觉推理任务，包括文档QA，视觉比较，基于对话的理解和场景级别的推断。我们的结果表明，当通过信息提示引导时，LVLM可以有效地推理多个图像。值得注意的是，Claude 3.7在诸如TQA（99.13％精度），DOCVQA（96.87％）和MMCOQA（75.28 Rouge-l）等具有挑战性的任务上实现了近乎自然的表现。我们还探讨了如何选择设计选择，例如模型选择，射击计数和输入长度 - 不同LVLM的推理性能。

Title: $MV_{Hybrid}$: Improving Spatial Transcriptomics Prediction with Hybrid State Space-Vision Transformer Backbone in Pathology Vision Foundation Models

Authors: Won June Cho, Hongjun Yoon, Daeky Jeong, Hyeongyeol Lim, Yosep Chong
Subjects: cs.CV, cs.AI, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00383
Pdf URL: https://arxiv.org/pdf/2508.00383
Copy Paste: [[2508.00383]] $MV_{Hybrid}$: Improving Spatial Transcriptomics Prediction with Hybrid State Space-Vision Transformer Backbone in Pathology Vision Foundation Models(https://arxiv.org/abs/2508.00383)
Keywords: generation
Abstract: Spatial transcriptomics reveals gene expression patterns within tissue context, enabling precision oncology applications such as treatment response prediction, but its high cost and technical complexity limit clinical adoption. Predicting spatial gene expression (biomarkers) from routine histopathology images offers a practical alternative, yet current vision foundation models (VFMs) in pathology based on Vision Transformer (ViT) backbones perform below clinical standards. Given that VFMs are already trained on millions of diverse whole slide images, we hypothesize that architectural innovations beyond ViTs may better capture the low-frequency, subtle morphological patterns correlating with molecular phenotypes. By demonstrating that state space models initialized with negative real eigenvalues exhibit strong low-frequency bias, we introduce $MV_{Hybrid}$, a hybrid backbone architecture combining state space models (SSMs) with ViT. We compare five other different backbone architectures for pathology VFMs, all pretrained on identical colorectal cancer datasets using the DINOv2 self-supervised learning method. We evaluate all pretrained models using both random split and leave-one-study-out (LOSO) settings of the same biomarker dataset. In LOSO evaluation, $MV_{Hybrid}$ achieves 57% higher correlation than the best-performing ViT and shows 43% smaller performance degradation compared to random split in gene expression prediction, demonstrating superior performance and robustness, respectively. Furthermore, $MV_{Hybrid}$ shows equal or better downstream performance in classification, patch retrieval, and survival prediction tasks compared to that of ViT, showing its promise as a next-generation pathology VFM backbone. Our code is publicly available at: this https URL.
摘要：空间转录组学揭示了组织环境中的基因表达模式，从而实现了诸如治疗反应预测之类的精确肿瘤学应用，但其高成本和技术复杂性限制了临床采用。从常规组织病理学图像中预测空间基因表达（生物标志物）为基于视觉变压器（VIT）骨干的病理学提供了一种实用的替代性，但当前的视觉基础模型（VFM）。鉴于VFM已经接受了数百万个各种幻灯片图像的训练，我们假设超越VIT的建筑创新可以更好地捕获与分子表型相关的低频，微妙的形态模式。通过证明以负实际特征值初始初始化的状态空间模型表现出强烈的低频偏见，我们引入了$ MV_ {Hybrid} $，这是一种混合骨干架构，将状态空间模型（SSMS）与VIT结合在一起。我们比较了其他五个不同的主链体系结构，用于病理VFM，所有这些骨干结构都使用Dinov2自我监督的学习方法在相同的结直肠癌数据集上进行了预测。我们使用同一生物标志物数据集的随机拆分和一对一的拆分（LOSO）设置来评估所有审慎的模型。在LOSO评估中，$ MV_ {hybrid} $比表现最佳的VIT的相关性高57％，并且与基因表达预测的随机分裂相比，表现出卓越的性能和鲁棒性，显示出43％的性能降解。此外，与VIT相比，$ MV_ {hybrid} $在分类，补丁检索和生存预测任务中表现出相等或更好的下游性能，这表明其作为下一代病理VFM骨架的承诺。我们的代码可公开可用：此HTTPS URL。

Title: Video Forgery Detection with Optical Flow Residuals and Spatial-Temporal Consistency

Authors: Xi Xue, Kunio Suzuki, Nabarun Goswami, Takuya Shintate
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00397
Pdf URL: https://arxiv.org/pdf/2508.00397
Copy Paste: [[2508.00397]] Video Forgery Detection with Optical Flow Residuals and Spatial-Temporal Consistency(https://arxiv.org/abs/2508.00397)
Keywords: generation, generative
Abstract: The rapid advancement of diffusion-based video generation models has led to increasingly realistic synthetic content, presenting new challenges for video forgery detection. Existing methods often struggle to capture fine-grained temporal inconsistencies, particularly in AI-generated videos with high visual fidelity and coherent motion. In this work, we propose a detection framework that leverages spatial-temporal consistency by combining RGB appearance features with optical flow residuals. The model adopts a dual-branch architecture, where one branch analyzes RGB frames to detect appearance-level artifacts, while the other processes flow residuals to reveal subtle motion anomalies caused by imperfect temporal synthesis. By integrating these complementary features, the proposed method effectively detects a wide range of forged videos. Extensive experiments on text-to-video and image-to-video tasks across ten diverse generative models demonstrate the robustness and strong generalization ability of the proposed approach.
摘要：基于扩散的视频生成模型的快速发展导致了越来越现实的合成内容，为视频伪造发现带来了新的挑战。现有的方法通常难以捕获细粒度的时间不一致，尤其是在具有高视觉保真度和连贯运动的AI生成的视频中。在这项工作中，我们提出了一个检测框架，该框架通过将RGB的外观特征与光流残差相结合来利用时空的一致性。该模型采用双分支结构，其中一个分支分析RGB框架以检测外观级伪像，而其他过程则流动残差以揭示由时间不完善的时间合成引起的细微运动异常。通过集成这些互补功能，提出的方法有效地检测了广泛的锻造视频。关于十种不同生成模型的文本对视频和图像到视频任务的广泛实验证明了所提出方法的鲁棒性和强大的概括能力。

Title: PMR: Physical Model-Driven Multi-Stage Restoration of Turbulent Dynamic Videos

Authors: Tao Wu, Jingyuan Ye, Ying Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00406
Pdf URL: https://arxiv.org/pdf/2508.00406
Copy Paste: [[2508.00406]] PMR: Physical Model-Driven Multi-Stage Restoration of Turbulent Dynamic Videos(https://arxiv.org/abs/2508.00406)
Keywords: restoration
Abstract: Geometric distortions and blurring caused by atmospheric turbulence degrade the quality of long-range dynamic scene videos. Existing methods struggle with restoring edge details and eliminating mixed distortions, especially under conditions of strong turbulence and complex dynamics. To address these challenges, we introduce a Dynamic Efficiency Index ($DEI$), which combines turbulence intensity, optical flow, and proportions of dynamic regions to accurately quantify video dynamic intensity under varying turbulence conditions and provide a high-dynamic turbulence training dataset. Additionally, we propose a Physical Model-Driven Multi-Stage Video Restoration ($PMR$) framework that consists of three stages: \textbf{de-tilting} for geometric stabilization, \textbf{motion segmentation enhancement} for dynamic region refinement, and \textbf{de-blurring} for quality restoration. $PMR$ employs lightweight backbones and stage-wise joint training to ensure both efficiency and high restoration quality. Experimental results demonstrate that the proposed method effectively suppresses motion trailing artifacts, restores edge details and exhibits strong generalization capability, especially in real-world scenarios characterized by high-turbulence and complex dynamics. We will make the code and datasets openly available.
摘要：大气湍流引起的几何扭曲和模糊，降低了远程动态场景视频的质量。现有的方法努力恢复边缘细节并消除混合扭曲，尤其是在强烈的湍流和复杂动态的条件下。为了应对这些挑战，我们引入了动态效率指数（$ dei $），该指数结合了湍流强度，光流和动态区域的比例，以准确量化不同的湍流条件下的视频动态强度并提供高动脉湍流训练数据集。此外，我们提出了一个由物理模型驱动的多阶段视频修复（$ PMR $）框架，该框架由三个阶段组成：\ textbf {de-de-de-de-de-de-tilting}用于几何稳定，\ textbf {运动分段增强}用于动态区域改进，以及\ textbf {deblurring}。 $ PMR $采用轻量级的骨干和舞台的联合培训，以确保效率和高恢复质量。实验结果表明，所提出的方法有效地抑制了运动尾像，恢复边缘细节并表现出强大的概括能力，尤其是在以高扰动和复杂动力学为特征的实际情况下。我们将公开提供代码和数据集。

Title: Sortblock: Similarity-Aware Feature Reuse for Diffusion Model

Authors: Hanqi Chen, Xu Zhang, Xiaoliu Guan, Lielin Jiang, Guanzhong Wang, Zeyu Chen, Yi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00412
Pdf URL: https://arxiv.org/pdf/2508.00412
Copy Paste: [[2508.00412]] Sortblock: Similarity-Aware Feature Reuse for Diffusion Model(https://arxiv.org/abs/2508.00412)
Keywords: generation, generative
Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable generative capabilities, particularly benefiting from Transformer architectures that enhance visual and artistic fidelity. However, their inherently sequential denoising process results in high inference latency, limiting their deployment in real-time scenarios. Existing training-free acceleration approaches typically reuse intermediate features at fixed timesteps or layers, overlooking the evolving semantic focus across denoising stages and Transformer this http URL address this, we propose Sortblock, a training-free inference acceleration framework that dynamically caches block-wise features based on their similarity across adjacent timesteps. By ranking the evolution of residuals, Sortblock adaptively determines a recomputation ratio, selectively skipping redundant computations while preserving generation quality. Furthermore, we incorporate a lightweight linear prediction mechanism to reduce accumulated errors in skipped this http URL experiments across various tasks and DiT architectures demonstrate that Sortblock achieves over 2$\times$ inference speedup with minimal degradation in output quality, offering an effective and generalizable solution for accelerating diffusion-based generative models.
摘要：扩散变压器（DIT）表现出了显着的生成能力，尤其是从增强视觉和艺术忠诚的变压器体系结构中受益。但是，它们固有的顺序降级过程会导致高推断潜伏期，从而在实时场景中限制了它们的部署。现有的无训练加速方法通常在固定的时间段或层上重用中间功能，俯瞰着跨denoising阶段的语义焦点和变压器的不断发展的语义焦点，我们提出了sortblock，我们提出了一个无训练的推断加速框架，该框架是基于它们相似的相似性，该框架基于它们的相似性跨越邻近的timectectectectectectectectectsectectect。通过对残差的演变进行排名，SortBlock自适应地确定了重新计算比率，选择性跳过了冗余计算，同时保留了生成质量。此外，我们结合了一种轻巧的线性预测机制，以减少跳过的HTTP URL实验，跨各种任务和DIT体系结构进行了累计错误，这表明，sortblock在2 $ \ times $ times $ times $推理的速度上实现了最小的质量质量质量的降级，从而提供了有效的和普遍的基于基于基于基础的分类的型号。

Title: DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

Authors: Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, Han Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00413
Pdf URL: https://arxiv.org/pdf/2508.00413
Copy Paste: [[2508.00413]] DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space(https://arxiv.org/abs/2508.00413)
Keywords: generation
Abstract: We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: this https URL.
摘要：我们提出了DC-AE 1.5，这是一个用于高分辨率扩散模型的深层压缩自动编码器的新系列。增加自动编码器的潜在通道编号是提高其重建质量的高效方法。但是，它导致扩散模型的收敛缓慢，尽管重建质量更好，但仍导致发电质量较差。此问题限制了潜在扩散模型的质量上限，并阻碍了具有较高空间压缩比的自动编码器的使用。我们介绍了两个关键的创新来应对这一挑战：i）结构化潜在空间，一种基于培训的方法，旨在在潜在空间上强加理想的频道结构，并带有前潜在通道捕获对象结构和后一个潜在的频道捕获图像详细信息； ii）增强扩散训练，一种增强的扩散训练策略，在对象潜在通道上具有其他扩散训练目标，以加速收敛。通过这些技术，DC-AE 1.5比DC-AE提供更快的收敛性和更好的扩散缩放结果。在ImageNet 512x512上，DC-AE-1.5-F64C128比DC-AE-F32C32提供更好的图像生成质量，而速度快4倍。代码：此HTTPS URL。

Title: TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation

Authors: Jiale Zhou, Wenhan Wang, Shikun Li, Xiaolei Qu, Xin Guo, Yizhong Liu, Wenzhong Tang, Xun Lin, Yefeng Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00442
Pdf URL: https://arxiv.org/pdf/2508.00442
Copy Paste: [[2508.00442]] TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation(https://arxiv.org/abs/2508.00442)
Keywords: generation
Abstract: Tubular structure segmentation (TSS) is important for various applications, such as hemodynamic analysis and route navigation. Despite significant progress in TSS, domain shifts remain a major challenge, leading to performance degradation in unseen target domains. Unlike other segmentation tasks, TSS is more sensitive to domain shifts, as changes in topological structures can compromise segmentation integrity, and variations in local features distinguishing foreground from background (e.g., texture and contrast) may further disrupt topological continuity. To address these challenges, we propose Topology-enhanced Test-Time Adaptation (TopoTTA), the first test-time adaptation framework designed specifically for TSS. TopoTTA consists of two stages: Stage 1 adapts models to cross-domain topological discrepancies using the proposed Topological Meta Difference Convolutions (TopoMDCs), which enhance topological representation without altering pre-trained parameters; Stage 2 improves topological continuity by a novel Topology Hard sample Generation (TopoHG) strategy and prediction alignment on hard samples with pseudo-labels in the generated pseudo-break regions. Extensive experiments across four scenarios and ten datasets demonstrate TopoTTA's effectiveness in handling topological distribution shifts, achieving an average improvement of 31.81% in clDice. TopoTTA also serves as a plug-and-play TTA solution for CNN-based TSS models.
摘要：管状结构分割（TSS）对于各种应用很重要，例如血液动力学分析和路线导航。尽管TSS取得了重大进展，但领域的转移仍然是一个重大挑战，导致看不见的目标域的性能下降。与其他细分任务不同，TSS对域移动更为敏感，因为拓扑结构的变化可以损害分割完整性，并且本地特征的变化将前景与背景区分开来（例如纹理和对比度）可能会进一步破坏拓扑连续性。为了应对这些挑战，我们提出了拓扑增强的测试时间适应（Topotta），这是专门为TSS设计的第一个测试时间适应框架。托托塔（Topotta）由两个阶段组成：第1阶段模型使用所提出的拓扑元差异卷积（TOPOMDC）适应跨域拓扑差异，从而增强拓扑表示而不改变预训练的参数；第2阶段通过新型的拓扑结构生成（Topohg）策略（TopoHG）策略和预测对准的硬样品在生成的伪破裂区域中的硬样品上提高了拓扑连续性。在四种方案和十个数据集中进行的广泛实验表明，托波塔在处理拓扑分配变化方面的有效性，在CLDICE的平均提高31.81％。 Topotta还用作基于CNN的TSS型号的插件TTA解决方案。

Title: PIF-Net: Ill-Posed Prior Guided Multispectral and Hyperspectral Image Fusion via Invertible Mamba and Fusion-Aware LoRA

Authors: Baisong Li, Xingwang Wang, Haixiao Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00453
Pdf URL: https://arxiv.org/pdf/2508.00453
Copy Paste: [[2508.00453]] PIF-Net: Ill-Posed Prior Guided Multispectral and Hyperspectral Image Fusion via Invertible Mamba and Fusion-Aware LoRA(https://arxiv.org/abs/2508.00453)
Keywords: restoration
Abstract: The goal of multispectral and hyperspectral image fusion (MHIF) is to generate high-quality images that simultaneously possess rich spectral information and fine spatial details. However, due to the inherent trade-off between spectral and spatial information and the limited availability of observations, this task is fundamentally ill-posed. Previous studies have not effectively addressed the ill-posed nature caused by data misalignment. To tackle this challenge, we propose a fusion framework named PIF-Net, which explicitly incorporates ill-posed priors to effectively fuse multispectral images and hyperspectral images. To balance global spectral modeling with computational efficiency, we design a method based on an invertible Mamba architecture that maintains information consistency during feature transformation and fusion, ensuring stable gradient flow and process reversibility. Furthermore, we introduce a novel fusion module called the Fusion-Aware Low-Rank Adaptation module, which dynamically calibrates spectral and spatial features while keeping the model lightweight. Extensive experiments on multiple benchmark datasets demonstrate that PIF-Net achieves significantly better image restoration performance than current state-of-the-art methods while maintaining model efficiency.
摘要：多光谱和高光谱图像融合（MHIF）的目标是生成高质量的图像，同时拥有丰富的光谱信息和精细的空间细节。但是，由于光谱和空间信息之间的固有权衡以及观察值有限的可用性，此任务从根本上讲是错误的。先前的研究尚未有效解决数据未对准引起的不良性质。为了应对这一挑战，我们提出了一个名为PIF-NET的融合框架，该融合框架明确地融合了不良的先验，以有效地融合多光谱图像和高光谱图像。为了平衡全球光谱建模与计算效率，我们设计了一种基于可逆Mamba体系结构的方法，该方法在功能转换和融合过程中保持信息一致性，从而确保稳定的梯度流量和过程可逆性。此外，我们引入了一个新颖的融合模块，称为融合感知的低级适应模块，该模块可以动态校准光谱和空间特征，同时保持模型轻量级。在多个基准数据集上进行的广泛实验表明，与当前的最新方法相比，PIF-NET在保持模型效率的同时，取得明显更好的图像恢复性能。

Title: Semantic and Temporal Integration in Latent Diffusion Space for High-Fidelity Video Super-Resolution

Authors: Yiwen Wang, Xinning Chai, Yuhong Zhang, Zhengxue Cheng, Jun Zhao, Rong Xie, Li Song
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.00471
Pdf URL: https://arxiv.org/pdf/2508.00471
Copy Paste: [[2508.00471]] Semantic and Temporal Integration in Latent Diffusion Space for High-Fidelity Video Super-Resolution(https://arxiv.org/abs/2508.00471)
Keywords: super-resolution, generation
Abstract: Recent advancements in video super-resolution (VSR) models have demonstrated impressive results in enhancing low-resolution videos. However, due to limitations in adequately controlling the generation process, achieving high fidelity alignment with the low-resolution input while maintaining temporal consistency across frames remains a significant challenge. In this work, we propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR), a novel approach that incorporates both semantic and temporal-spatio guidance in the latent diffusion space to address these challenges. By incorporating high-level semantic information and integrating spatial and temporal information, our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence. Our method not only preserves high-reality visual content but also significantly enhances fidelity. Extensive experiments demonstrate that SeTe-VSR outperforms existing methods in terms of detail recovery and perceptual quality, highlighting its effectiveness for complex video super-resolution tasks.
摘要：视频超分辨率（VSR）模型的最新进步在增强低分辨率视频方面表现出了令人印象深刻的结果。但是，由于充分控制发电过程的局限性，在保持低分辨率输入的同时，在跨框架之间保持时间一致性是一个重大挑战。在这项工作中，我们提出了语义和时间引导视频超分辨率（SETE-VSR），这是一种新颖的方法，该方法将语义和时间 - 矩阵指导纳入潜在扩散空间中，以应对这些挑战。通过合并高级语义信息并整合空间和时间信息，我们的方法在恢复复杂的细节和确保时间连贯性之间取得了无缝的平衡。我们的方法不仅可以保留高现实的视觉内容，而且可以显着提高忠诚度。广泛的实验表明，SETE-VSR在详细信息恢复和感知质量方面优于现有方法，突出了其对复杂视频超分辨率任务的有效性。

Title: A Conditional GAN for Tabular Data Generation with Probabilistic Sampling of Latent Subspaces

Authors: Leonidas Akritidis, Panayiotis Bozanis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.00472
Pdf URL: https://arxiv.org/pdf/2508.00472
Copy Paste: [[2508.00472]] A Conditional GAN for Tabular Data Generation with Probabilistic Sampling of Latent Subspaces(https://arxiv.org/abs/2508.00472)
Keywords: generation, generative
Abstract: The tabular form constitutes the standard way of representing data in relational database systems and spreadsheets. But, similarly to other forms, tabular data suffers from class imbalance, a problem that causes serious performance degradation in a wide variety of machine learning tasks. One of the most effective solutions dictates the usage of Generative Adversarial Networks (GANs) in order to synthesize artificial data instances for the under-represented classes. Despite their good performance, none of the proposed GAN models takes into account the vector subspaces of the input samples in the real data space, leading to data generation in arbitrary locations. Moreover, the class labels are treated in the same manner as the other categorical variables during training, so conditional sampling by class is rendered less effective. To overcome these problems, this study presents ctdGAN, a conditional GAN for alleviating class imbalance in tabular datasets. Initially, ctdGAN executes a space partitioning step to assign cluster labels to the input samples. Subsequently, it utilizes these labels to synthesize samples via a novel probabilistic sampling strategy and a new loss function that penalizes both cluster and class mis-predictions. In this way, ctdGAN is trained to generate samples in subspaces that resemble those of the original data distribution. We also introduce several other improvements, including a simple, yet effective cluster-wise scaling technique that captures multiple feature modes without affecting data dimensionality. The exhaustive evaluation of ctdGAN with 14 imbalanced datasets demonstrated its superiority in generating high fidelity samples and improving classification accuracy.
摘要：表格构成了在关系数据库系统和电子表格中表示数据的标准方式。但是，与其他形式类似，表格数据遭受类失衡的困扰，该问题会导致各种机器学习任务中的严重性能下降。最有效的解决方案之一决定了生成对抗网络（GAN）的用法，以便将人工数据实例合成代表性不足的类别。尽管其性能良好，但没有提出的GAN模型考虑到实际数据空间中输入样本的向量子空间，从而导致数据生成在任意位置。此外，类标签的处理方式与训练过程中的其他分类变量相同，因此逐班的条件采样的效率较低。为了克服这些问题，本研究提出了Ctdgan，这是一种有条件的GAN，可减轻表格数据集中的类失衡。最初，CTDGAN执行一个空间分区步骤，将群集标签分配给输入样本。随后，它利用这些标签通过新颖的概率抽样策略和新的损失函数来合成样品，从而惩罚了集群和类错误的预测。通过这种方式，对CTDGAN进行了训练，可以在类似于原始数据分布的子空间中生成样品。我们还介绍了其他一些改进，包括一种简单但有效的群集缩放技术，该技术可捕获多个功能模式而不会影响数据维度。用14个数据集对CTDGAN的详尽评估证明了其在产生高保真度样本和提高分类精度方面的优势。

Title: LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Authors: Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00477
Pdf URL: https://arxiv.org/pdf/2508.00477
Copy Paste: [[2508.00477]] LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer(https://arxiv.org/abs/2508.00477)
Keywords: generation
Abstract: In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: this https URL.
摘要：在可控的图像合成中，从多个参考文献中生成具有空间布局意识的相干和一致的图像仍然是一个开放的挑战。我们提出了LAMIC，这是一种布局感知的多图像构图框架，该框架首次以无训练的方式将单参考扩散模型扩展到多参考情景。 LAMIC建立在MMDIT模型的基础上，引入了两个插件的注意机制：1）群体隔离注意力（GIA）以增强实体分解； 2）区域调节注意力（RMA）以实现布局感知的生成。为了全面评估模型功能，我们进一步介绍了三个指标：1）包含比（IN-R）和填充比（FI-R）评估布局控制； 2）背景相似性（BG-S）用于测量背景一致性。广泛的实验表明，LAMIC在大多数主要指标中都取得了最先进的性能：它在所有设置中始终优于ID-S，BG-S，R和AVG分数的现有多引用基线，并在复杂的组成任务中实现了最佳的DPG。这些结果表明，LAMIC在身份保持，背景保存，布局控制和及时关注方面具有出色的能力，所有这些都在没有任何培训或微调的情况下实现了，展示了强大的零击球能力。通过继承高级单参考模型的优势，并使无缝扩展到多图像场景，LAMIC建立了一种新的无训练范式，以进行可控的多图像组成。随着基础模型的继续发展，LAMIC的性能有望相应地扩展。我们的实施可用：此HTTPS URL。

Title: Court of LLMs: Evidence-Augmented Generation via Multi-LLM Collaboration for Text-Attributed Graph Anomaly Detection

Authors: Yiming Xu, Jiarun Chen, Zhen Peng, Zihan Chen, Qika Lin, Lan Ma, Bin Shi, Bo Dong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.00507
Pdf URL: https://arxiv.org/pdf/2508.00507
Copy Paste: [[2508.00507]] Court of LLMs: Evidence-Augmented Generation via Multi-LLM Collaboration for Text-Attributed Graph Anomaly Detection(https://arxiv.org/abs/2508.00507)
Keywords: generation
Abstract: The natural combination of intricate topological structures and rich textual information in text-attributed graphs (TAGs) opens up a novel perspective for graph anomaly detection (GAD). However, existing GAD methods primarily focus on designing complex optimization objectives within the graph domain, overlooking the complementary value of the textual modality, whose features are often encoded by shallow embedding techniques, such as bag-of-words or skip-gram, so that semantic context related to anomalies may be missed. To unleash the enormous potential of textual modality, large language models (LLMs) have emerged as promising alternatives due to their strong semantic understanding and reasoning capabilities. Nevertheless, their application to TAG anomaly detection remains nascent, and they struggle to encode high-order structural information inherent in graphs due to input length constraints. For high-quality anomaly detection in TAGs, we propose CoLL, a novel framework that combines LLMs and graph neural networks (GNNs) to leverage their complementary strengths. CoLL employs multi-LLM collaboration for evidence-augmented generation to capture anomaly-relevant contexts while delivering human-readable rationales for detected anomalies. Moreover, CoLL integrates a GNN equipped with a gating mechanism to adaptively fuse textual features with evidence while preserving high-order topological information. Extensive experiments demonstrate the superiority of CoLL, achieving an average improvement of 13.37% in AP. This study opens a new avenue for incorporating LLMs in advancing GAD.
摘要：复杂的拓扑结构和丰富的文本图形图（TAG）中的自然组合为图形异常检测（GAD）开辟了新的视角。但是，现有的GAD方法主要集中于设计图形域内的复杂优化目标，忽视了文本模态的互补值，其特征通常由浅层嵌入技术编码，例如词段或跳过，例如，可能会丢失与厌恶症相关的语义上下文。为了释放文本方式的巨大潜力，由于其强大的语义理解和推理能力，大型语言模型（LLM）已成为有前途的替代方案。然而，它们用于标记异常检测的应用仍然很新生，并且由于输入长度约束，他们难以编码图中固有的高阶结构信息。对于标签中的高质量异常检测，我们提出了Coll，Coll是一个新型框架，结合了LLMS和图形神经网络（GNN），以利用其互补强度。 Coll采用多LLM协作进行循证生成的生成，以捕获与异常相关的环境，同时为检测到的异常提供人类可读的理由。此外，Coll集成了配备门控机制的GNN，以适应融合文本特征以及证据，同时保留高阶拓扑信息。广泛的实验证明了Coll的优势，在AP中平均提高了13.37％。这项研究为将LLMS纳入GAD开辟了新的途径。

Title: Video Color Grading via Look-Up Table Generation

Authors: Seunghyun Shin, Dongmin Shin, Jisu Shin, Hae-Gon Jeon, Joon-Young Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00548
Pdf URL: https://arxiv.org/pdf/2508.00548
Copy Paste: [[2508.00548]] Video Color Grading via Look-Up Table Generation(https://arxiv.org/abs/2508.00548)
Keywords: generation
Abstract: Different from color correction and transfer, color grading involves adjusting colors for artistic or storytelling purposes in a video, which is used to establish a specific look or mood. However, due to the complexity of the process and the need for specialized editing skills, video color grading remains primarily the domain of professional colorists. In this paper, we present a reference-based video color grading framework. Our key idea is explicitly generating a look-up table (LUT) for color attribute alignment between reference scenes and input video via a diffusion model. As a training objective, we enforce that high-level features of the reference scenes like look, mood, and emotion should be similar to that of the input video. Our LUT-based approach allows for color grading without any loss of structural details in the whole video frames as well as achieving fast inference. We further build a pipeline to incorporate a user-preference via text prompts for low-level feature enhancement such as contrast and brightness, etc. Experimental results, including extensive user studies, demonstrate the effectiveness of our approach for video color grading. Codes are publicly available at this https URL.
摘要：与颜色校正和转移不同，色彩等级涉及在视频中调整颜色以进行艺术或讲故事的目的，该视频用于建立特定的外观或心情。但是，由于流程的复杂性和对专业编辑技能的需求，视频颜色分级主要是专业色彩师的领域。在本文中，我们提出了一个基于参考的视频颜色分级框架。我们的关键想法是通过扩散模型明确生成一个查找表（LUT），以通过参考场景和输入视频之间的颜色属性对齐。作为训练目标，我们强制强制认为参考场景的高级特征，例如外观，情绪和情感，应该与输入视频相似。我们基于LUT的方法允许在整个视频框架中不丢失结构细节并实现快速推理，而不会丢失颜色分级。我们进一步构建了一条管道，通过文本提示将用户提示纳入低级功能增强，例如对比度和亮度等。实验结果，包括广泛的用户研究，证明了我们的视频颜色分级方法的有效性。代码在此HTTPS URL上公开可用。

Title: Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints

Authors: Jens U. Kreber, Joerg Stueckler
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00558
Pdf URL: https://arxiv.org/pdf/2508.00558
Copy Paste: [[2508.00558]] Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints(https://arxiv.org/abs/2508.00558)
Keywords: generation, generative
Abstract: Articulated objects are an important type of interactable objects in everyday environments. In this paper, we propose PhysNAP, a novel diffusion model-based approach for generating articulated objects that aligns them with partial point clouds and improves their physical plausibility. The model represents part shapes by signed distance functions (SDFs). We guide the reverse diffusion process using a point cloud alignment loss computed using the predicted SDFs. Additionally, we impose non-penetration and mobility constraints based on the part SDFs for guiding the model to generate more physically plausible objects. We also make our diffusion approach category-aware to further improve point cloud alignment if category information is available. We evaluate the generative ability and constraint consistency of samples generated with PhysNAP using the PartNet-Mobility dataset. We also compare it with an unguided baseline diffusion model and demonstrate that PhysNAP can improve constraint consistency and provides a tradeoff with generative ability.
摘要：铰接的对象是日常环境中可相互作用对象的重要类型。在本文中，我们提出了一种基于新型扩散模型的物理Nap，用于生成铰接对象，使它们与部分点云保持一致并提高其物理上的合理性。该模型通过签名的距离函数（SDF）表示零件形状。我们使用使用预测的SDF计算的点云对齐损耗指导反向扩散过程。此外，我们基于SDF的零件施加了非渗透和移动性约束，以指导模型以生成更加物理上合理的对象。如果有类别信息，我们还会使我们的扩散方法类别了解，以进一步改善点云对齐。我们使用Partnet-Mobility数据集评估了用PhysNAP生成的样本的生成能力和约束一致性。我们还将其与未指导的基线扩散模型进行了比较，并证明PhysNAP可以提高约束一致性，并具有生成能力的权衡。

Title: Learning Potential Energy Surfaces of Hydrogen Atom Transfer Reactions in Peptides

Authors: Marlen Neubert, Patrick Reiser, Frauke Gräter, Pascal Friederich
Subjects: cs.LG, cond-mat.mtrl-sci, physics.chem-ph, physics.comp-ph, q-bio.BM
Abstract URL: https://arxiv.org/abs/2508.00578
Pdf URL: https://arxiv.org/pdf/2508.00578
Copy Paste: [[2508.00578]] Learning Potential Energy Surfaces of Hydrogen Atom Transfer Reactions in Peptides(https://arxiv.org/abs/2508.00578)
Keywords: generation
Abstract: Hydrogen atom transfer (HAT) reactions are essential in many biological processes, such as radical migration in damaged proteins, but their mechanistic pathways remain incompletely understood. Simulating HAT is challenging due to the need for quantum chemical accuracy at biologically relevant scales; thus, neither classical force fields nor DFT-based molecular dynamics are applicable. Machine-learned potentials offer an alternative, able to learn potential energy surfaces (PESs) with near-quantum accuracy. However, training these models to generalize across diverse HAT configurations, especially at radical positions in proteins, requires tailored data generation and careful model selection. Here, we systematically generate HAT configurations in peptides to build large datasets using semiempirical methods and DFT. We benchmark three graph neural network architectures (SchNet, Allegro, and MACE) on their ability to learn HAT PESs and indirectly predict reaction barriers from energy predictions. MACE consistently outperforms the others in energy, force, and barrier prediction, achieving a mean absolute error of 1.13 kcal/mol on out-of-distribution DFT barrier predictions. This accuracy enables integration of ML potentials into large-scale collagen simulations to compute reaction rates from predicted barriers, advancing mechanistic understanding of HAT and radical migration in peptides. We analyze scaling laws, model transferability, and cost-performance trade-offs, and outline strategies for improvement by combining ML potentials with transition state search algorithms and active learning. Our approach is generalizable to other biomolecular systems, enabling quantum-accurate simulations of chemical reactivity in complex environments.
摘要：在许多生物过程中，例如受损蛋白质的根本迁移，氢原子转移（HAT）反应至关重要，但是它们的机械途径仍未完全理解。由于需要在生物学相关的尺度上进行量子化学精确性，因此模拟帽子是具有挑战性的。因此，经典力场和基于DFT的分子动力学都不适用。机器学习的电位提供了一种替代方案，能够以近乎量化的精度学习势能表面（PESS）。但是，训练这些模型以跨越各种帽子的配置（尤其是在蛋白质的根本位置）进行概括，需要量身定制的数据生成和仔细的模型选择。在这里，我们系统地生成肽中的HAT配置，以使用半经验方法和DFT构建大型数据集。我们基于他们学习帽子的能力并间接预测能量预测的反应障碍的能力，基于三个图形神经网络体系结构（Schnet，Allegro和Mace）。 MACE在能量，力和屏障预测方面始终优于其他人，在分布外DFT屏障预测上达到1.13 kcal/mol的平均绝对误差。这种精度使ML电位将ML电位整合到大规模的胶原蛋白模拟中，以计算预测障碍的反应速率，从而推进对HAT的机械理解和肽中的根本迁移。我们通过将ML电位与过渡状态搜索算法和主动学习相结合，分析了缩放法律，模型可转让性和绩效折衷以及提高改进的策略。我们的方法可以推广到其他生物分子系统，从而在复杂环境中实现了化学反应性的量子精确模拟。

Title: Wukong Framework for Not Safe For Work Detection in Text-to-Image systems

Authors: Mingrui Liu, Sixiao Zhang, Cheng Long
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.00591
Pdf URL: https://arxiv.org/pdf/2508.00591
Copy Paste: [[2508.00591]] Wukong Framework for Not Safe For Work Detection in Text-to-Image systems(https://arxiv.org/abs/2508.00591)
Keywords: generation
Abstract: Text-to-Image (T2I) generation is a popular AI-generated content (AIGC) technology enabling diverse and creative image synthesis. However, some outputs may contain Not Safe For Work (NSFW) content (e.g., violence), violating community guidelines. Detecting NSFW content efficiently and accurately, known as external safeguarding, is essential. Existing external safeguards fall into two types: text filters, which analyze user prompts but overlook T2I model-specific variations and are prone to adversarial attacks; and image filters, which analyze final generated images but are computationally costly and introduce latency. Diffusion models, the foundation of modern T2I systems like Stable Diffusion, generate images through iterative denoising using a U-Net architecture with ResNet and Transformer blocks. We observe that: (1) early denoising steps define the semantic layout of the image, and (2) cross-attention layers in U-Net are crucial for aligning text and image regions. Based on these insights, we propose Wukong, a transformer-based NSFW detection framework that leverages intermediate outputs from early denoising steps and reuses U-Net's pre-trained cross-attention parameters. Wukong operates within the diffusion process, enabling early detection without waiting for full image generation. We also introduce a new dataset containing prompts, seeds, and image-specific NSFW labels, and evaluate Wukong on this and two public benchmarks. Results show that Wukong significantly outperforms text-based safeguards and achieves comparable accuracy of image filters, while offering much greater efficiency.
摘要：文本对图像（T2I）一代是一种流行的AI生成内容（AIGC）技术，可实现多样化和创造性的图像综合。但是，某些输出可能包含工作（NSFW）内容（例如暴力）的工作（违反社区准则）。必须有效，准确地检测NSFW内容，称为外部保障，至关重要。现有的外部保障措施分为两种类型：文本过滤器，分析用户提示但忽略了特定于T2I模型的变化，并且很容易受到对抗攻击；和图像过滤器，分析最终生成的图像，但在计算上是昂贵的，并引入了延迟。扩散模型是稳定扩散等现代T2I系统的基础，它通过使用带有重新NET和变压器块的U-NET体系结构来生成图像。我们观察到：（1）早期剥离步骤定义了图像的语义布局，并且（2）U-NET中的跨注意层对于对齐文本和图像区域至关重要。基于这些见解，我们提出了Wukong，这是一个基于变压器的NSFW检测框架，该框架利用了早期的DeNoSise步骤和REUSE U-NET的预训练的预训练的跨注意参数来利用中间输出。 Wukong在扩散过程中运行，实现了早期检测，而无需等待完整的图像生成。我们还引入了一个新的数据集，其中包含提示，种子和特定于图像的NSFW标签，并在此和两个公共基准上评估Wukong。结果表明，Wukong的表现明显优于基于文本的保障措施，并获得了图像过滤器的可比精度，同时提供了更高的效率。

Title: Backdoor Attacks on Deep Learning Face Detection

Authors: Quentin Le Roux, Yannick Teglia, Teddy Furon, Philippe Loubet-Moundi
Subjects: cs.CV, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00620
Pdf URL: https://arxiv.org/pdf/2508.00620
Copy Paste: [[2508.00620]] Backdoor Attacks on Deep Learning Face Detection(https://arxiv.org/abs/2508.00620)
Keywords: generation
Abstract: Face Recognition Systems that operate in unconstrained environments capture images under varying conditions,such as inconsistent lighting, or diverse face poses. These challenges require including a Face Detection module that regresses bounding boxes and landmark coordinates for proper Face Alignment. This paper shows the effectiveness of Object Generation Attacks on Face Detection, dubbed Face Generation Attacks, and demonstrates for the first time a Landmark Shift Attack that backdoors the coordinate regression task performed by face detectors. We then offer mitigations against these vulnerabilities.
摘要：在不受约束的环境中运行的面部识别系统在不同条件下捕获图像，例如不一致的照明或各种面部姿势。这些挑战需要一个面部检测模块，该模块会回归边界框和地标坐标以进行适当的面部对齐。本文显示了对象产生面部检测的有效性，被称为面部生成攻击，并首次证明了具有里程碑意义的转移攻击，该攻击是由面部探测器执行的坐标回归任务。然后，我们对这些漏洞进行缓解。

Title: Minimum Data, Maximum Impact: 20 annotated samples for explainable lung nodule classification

Authors: Luisa Gallée, Catharina Silvia Lisson, Christoph Gerhard Lisson, Daniela Drees, Felix Weig, Daniel Vogele, Meinrad Beer, Michael Götz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00639
Pdf URL: https://arxiv.org/pdf/2508.00639
Copy Paste: [[2508.00639]] Minimum Data, Maximum Impact: 20 annotated samples for explainable lung nodule classification(https://arxiv.org/abs/2508.00639)
Keywords: generative
Abstract: Classification models that provide human-interpretable explanations enhance clinicians' trust and usability in medical image diagnosis. One research focus is the integration and prediction of pathology-related visual attributes used by radiologists alongside the diagnosis, aligning AI decision-making with clinical reasoning. Radiologists use attributes like shape and texture as established diagnostic criteria and mirroring these in AI decision-making both enhances transparency and enables explicit validation of model outputs. However, the adoption of such models is limited by the scarcity of large-scale medical image datasets annotated with these attributes. To address this challenge, we propose synthesizing attribute-annotated data using a generative model. We enhance the Diffusion Model with attribute conditioning and train it using only 20 attribute-labeled lung nodule samples from the LIDC-IDRI dataset. Incorporating its generated images into the training of an explainable model boosts performance, increasing attribute prediction accuracy by 13.4% and target prediction accuracy by 1.8% compared to training with only the small real attribute-annotated dataset. This work highlights the potential of synthetic data to overcome dataset limitations, enhancing the applicability of explainable models in medical image analysis.
摘要：提供人解释的分类模型可以增强临床医生在医学图像诊断中的信任和可用性。一个研究重点是放射科医生在诊断同时使用的病理相关视觉属性的整合和预测，将AI决策与临床推理保持一致。放射科医生将形状和纹理等属性作为已建立的诊断标准，并在AI决策中反映它们既提高了透明度，又可以明确验证模型输出。但是，这种模型的采用受这些属性注释的大规模医学图像数据集的稀缺限制。为了应对这一挑战，我们建议使用生成模型提出综合属性注销数据。我们通过属性调节增强了扩散模型，并仅使用来自LIDC-IDRI数据集的20个属性标记的肺结孔样品训练它。将其生成的图像纳入可解释模型的训练中，可以提高性能，将属性预测准确性提高13.4％，而目标预测准确性与仅具有小的真实属性通知数据集的培训相比，目标预测准确性提高了1.8％。这项工作突出了合成数据克服数据集限制的潜力，从而增强了可解释模型在医学图像分析中的适用性。

Title: Wind Power Scenario Generation based on the Generalized Dynamic Factor Model and Generative Adversarial Network

Authors: Young-ho Cho, Hao Zhu, Duehee Lee, Ross Baldick
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2508.00692
Pdf URL: https://arxiv.org/pdf/2508.00692
Copy Paste: [[2508.00692]] Wind Power Scenario Generation based on the Generalized Dynamic Factor Model and Generative Adversarial Network(https://arxiv.org/abs/2508.00692)
Keywords: generation, generative
Abstract: For conducting resource adequacy studies, we synthesize multiple long-term wind power scenarios of distributed wind farms simultaneously by using the spatio-temporal features: spatial and temporal correlation, waveforms, marginal and ramp rates distributions of waveform, power spectral densities, and statistical characteristics. Generating the spatial correlation in scenarios requires the design of common factors for neighboring wind farms and antithetical factors for distant wind farms. The generalized dynamic factor model (GDFM) can extract the common factors through cross spectral density analysis, but it cannot closely imitate waveforms. The GAN can synthesize plausible samples representing the temporal correlation by verifying samples through a fake sample discriminator. To combine the advantages of GDFM and GAN, we use the GAN to provide a filter that extracts dynamic factors with temporal information from the observation data, and we then apply this filter in the GDFM to represent both spatial and frequency correlations of plausible waveforms. Numerical tests on the combination of GDFM and GAN have demonstrated performance improvements over competing alternatives in synthesizing wind power scenarios from Australia, better realizing plausible statistical characteristics of actual wind power compared to alternatives such as the GDFM with a filter synthesized from distributions of actual dynamic filters and the GAN with direct synthesis without dynamic factors.
摘要：为了进行资源充足性研究，我们通过使用时空特征同时综合了分布式风电场的多个长期风能场景：空间和时间相关性，波形，波形和坡道速率分布，波形，功率光谱密度和统计特征。在场景中产生空间相关性需要设计相邻风电场的共同因素，以及遥远的风电场的相反因素。广义动态因子模型（GDFM）可以通过跨频谱密度分析提取共同因素，但不能密切模仿波形。 GAN可以通过通过假样品鉴别器验证样品来合成代表时间相关的合理样品。为了结合GDFM和GAN的优势，我们使用GAN提供了一个从观察数据中提取动态因子的过滤器，然后将此过滤器应用于GDFM中，以表示可见波形的空间和频率相关性。关于GDFM和GAN组合的数值测试表明，与诸如GDFM（例如GDFM）的替代方案（如GDFM）以及从实际动力过滤器的分布和直接合成无动力学因素的GAN合成相比，与诸如GDFM的替代方案（例如GDFM）相比，在合成风能的综合方案中的竞争替代方案的性能提高，更好地实现了实际风能的合理统计特征。

Title: D3: Training-Free AI-Generated Video Detection Using Second-Order Features

Authors: Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, Chao Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00701
Pdf URL: https://arxiv.org/pdf/2508.00701
Copy Paste: [[2508.00701]] D3: Training-Free AI-Generated Video Detection Using Second-Order Features(https://arxiv.org/abs/2508.00701)
Keywords: generation
Abstract: The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3's exceptional computational efficiency and strong robust performance. Our code is available at this https URL.
摘要：诸如Sora之类的视频生成技术的演变使制作高保真性AI生成的视频变得越来越容易，从而引起了公众对合成内容传播的关注。但是，现有的检测方法仍然受到合成视频中时间伪影的探索不足的限制。为了弥合这一差距，我们通过牛顿力学下的二阶动力学分析建立了一个理论框架，随后扩展了针对时间伪像检测的二阶中央差异特征。在这个理论基础的基础上，我们揭示了真实和AI生成的视频之间二阶特征分布的基本差异。具体而言，我们提出了通过差异差异（D3）提出的检测，这是一种利用上述二阶时间差异的新型无训练检测方法。我们验证了D3对4个开源数据集的优势（Gen-Video，Videophy，Evalcrafter，Vidprom），总共40个子集。例如，在Genvideo上，D3的表现优于先前的最佳方法10.39％（绝对）平均平均精度。按时成本和后处理操作的其他实验证明了D3的出色计算效率和强大的稳健性能。我们的代码可在此HTTPS URL上找到。

Title: Democratizing Tabular Data Access with an Open$\unicode{x2013}$Source Synthetic$\unicode{x2013}$Data SDK

Authors: Ivona Krchova, Mariana Vargas Vieyra, Mario Scriminaci, Andrey Sidorenko
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.00718
Pdf URL: https://arxiv.org/pdf/2508.00718
Copy Paste: [[2508.00718]] Democratizing Tabular Data Access with an Open$\unicode{x2013}$Source Synthetic$\unicode{x2013}$Data SDK(https://arxiv.org/abs/2508.00718)
Keywords: generation
Abstract: Machine learning development critically depends on access to high-quality data. However, increasing restrictions due to privacy, proprietary interests, and ethical concerns have created significant barriers to data accessibility. Synthetic data offers a viable solution by enabling safe, broad data usage without compromising sensitive information. This paper presents the MOSTLY AI Synthetic Data Software Development Kit (SDK), an open-source toolkit designed specifically for synthesizing high-quality tabular data. The SDK integrates robust features such as differential privacy guarantees, fairness-aware data generation, and automated quality assurance into a flexible and accessible Python interface. Leveraging the TabularARGN autoregressive framework, the SDK supports diverse data types and complex multi-table and sequential datasets, delivering competitive performance with notable improvements in speed and usability. Currently deployed both as a cloud service and locally installable software, the SDK has seen rapid adoption, highlighting its practicality in addressing real-world data bottlenecks and promoting widespread data democratization.
摘要：机器学习开发至关重要的是访问高质量数据。但是，由于隐私，专有利益和道德问题，增加的限制增加了数据可访问性的重大障碍。合成数据通过在不损害敏感信息的情况下实现安全，广泛的数据使用来提供可行的解决方案。本文介绍了AI合成数据软件开发套件（SDK），这是一种专门用于合成高质量表格数据的开源工具包。 SDK将诸如差异隐私保证，公平感知的数据生成以及自动化质量保证等强大功能集成到灵活且易于访问的Python界面中。 SDK利用表格的自回旋框架，支持各种数据类型以及复杂的多桌和顺序数据集，从而在速度和可用性方面具有显着提高，提供了竞争性能。 SDK目前已将其部署为云服务和本地安装的软件，已迅速采用，强调了其实用性在解决现实世界数据瓶颈方面的实用性并促进广泛的数据民主化。

Title: YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

Authors: Guanning Zeng, Xiang Zhang, Zirui Wang, Haiyang Xu, Zeyuan Chen, Bingnan Li, Zhuowen Tu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.00728
Pdf URL: https://arxiv.org/pdf/2508.00728
Copy Paste: [[2508.00728]] YOLO-Count: Differentiable Object Counting for Text-to-Image Generation(https://arxiv.org/abs/2508.00728)
Keywords: generation, generative
Abstract: We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the 'cardinality' map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.
摘要：我们提出了Yolo-Count，这是一种可区分的开放式唱机对象计数模型，可应对一般计数挑战，并实现文本对图像（T2I）生成的精确数量控制。核心贡献是“基数”地图，这是一个新的回归目标，可说明对象大小和空间分布的变化。 Yolo-Count利用表示表示和混合强大的监督计划，弥合了开放量计数和T2I生成控制之间的差距。它完全可区分的体系结构有助于基于梯度的优化，从而为生成模型提供准确的对象计数估计和细粒度的指导。广泛的实验表明，Yolo计数可以达到最先进的计数准确性，同时为T2I系统提供了强大而有效的数量控制。

Title: Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

Authors: Laura Pedrouzo-Rodriguez, Pedro Delgado-DeRobles, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez
Subjects: cs.CV, cs.AI, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2508.00748
Pdf URL: https://arxiv.org/pdf/2508.00748
Copy Paste: [[2508.00748]] Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos(https://arxiv.org/abs/2508.00748)
Keywords: generation
Abstract: Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user's avatar-preserving their appearance and voice-making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual's facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar's visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.
摘要：在虚拟会议，游戏和社交平台上，影照相说话头像变得越来越普遍。这些化身可以进行更多的身临其境的交流，但它们也引入了严重的安全风险。一种新兴的威胁是模仿：攻击者可以窃取用户的头像，以表现出其外观和声音，几乎不可能通过视力或声音来检测其欺诈性用法。在本文中，我们探讨了这种化身介导的场景中生物识别验证的挑战。我们的主要问题是，当阿凡达（Avatar）的视觉外观是其所有者的传真时，个人的面部运动模式是否可以作为可靠的行为生物识别技术来验证其身份。为了回答这个问题，我们介绍了使用最先进的单拍化身生成模型Gagavatar创建的现实化头像视频的新数据集，该数据集和真实和冒名顶替者的Avatar视频Gagavatar。我们还提出了一个具有时间关注池的轻巧，可解释的时空图卷积网络架构，它仅使用面部地标对动态面部手势进行建模。实验结果表明，面部运动提示可以实现有意义的身份验证，其AUC值接近80％。拟议的基准和生物识别系统可用于研究社区，以便引起人们对基于阿凡达的通信系统中更先进的行为生物识别防御措施的迫切需求。

Title: SU-ESRGAN: Semantic and Uncertainty-Aware ESRGAN for Super-Resolution of Satellite and Drone Imagery with Fine-Tuning for Cross Domain Evaluation

Authors: Prerana Ramkumar
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2508.00750
Pdf URL: https://arxiv.org/pdf/2508.00750
Copy Paste: [[2508.00750]] SU-ESRGAN: Semantic and Uncertainty-Aware ESRGAN for Super-Resolution of Satellite and Drone Imagery with Fine-Tuning for Cross Domain Evaluation(https://arxiv.org/abs/2508.00750)
Keywords: super-resolution, generative
Abstract: Generative Adversarial Networks (GANs) have achieved realistic super-resolution (SR) of images however, they lack semantic consistency and per-pixel confidence, limiting their credibility in critical remote sensing applications such as disaster response, urban planning and agriculture. This paper introduces Semantic and Uncertainty-Aware ESRGAN (SU-ESRGAN), the first SR framework designed for satellite imagery to integrate the ESRGAN, segmentation loss via DeepLabv3 for class detail preservation and Monte Carlo dropout to produce pixel-wise uncertainty maps. The SU-ESRGAN produces results (PSNR, SSIM, LPIPS) comparable to the Baseline ESRGAN on aerial imagery. This novel model is valuable in satellite systems or UAVs that use wide field-of-view (FoV) cameras, trading off spatial resolution for coverage. The modular design allows integration in UAV data pipelines for on-board or post-processing SR to enhance imagery resulting due to motion blur, compression and sensor limitations. Further, the model is fine-tuned to evaluate its performance on cross domain applications. The tests are conducted on two drone based datasets which differ in altitude and imaging perspective. Performance evaluation of the fine-tuned models show a stronger adaptation to the Aerial Maritime Drone Dataset, whose imaging characteristics align with the training data, highlighting the importance of domain-aware training in SR-applications.
摘要：生成的对抗网络（GAN）已经实现了图像的现实超分辨率（SR），但是，它们缺乏语义一致性和每个像素信心，从而限制了它们在诸如灾难响应，城市规划和农业等关键遥感应用中的信誉。本文介绍了语义和不确定性感知的Esrgan（Su-esrgan），这是为卫星图像旨在整合Esrgan的第一个SR框架，通过DeepLabv3进行了分割损失，以供类详细信息保存和蒙特卡洛辍学，以产生像素智慧的不确定性图。 Su-esrgan产生的结果（PSNR，SSIM，LPIPS）与基线Esrgan在空中成像上相当。这种新颖的模型在使用广泛的视野（FOV）摄像机的卫星系统或无人机中很有价值，可以交易空间分辨率以覆盖范围。模块化设计允许在无人机数据管道中集成用于车载或后处理SR，以增强由于运动模糊，压缩和传感器限制而导致的图像。此外，该模型经过微调以评估其在跨域应用上的性能。测试是在两个基于无人机的数据集上进行的，这些数据集在高度和成像透视方面有所不同。微调模型的性能评估表明，对空中海事无人机数据集的适应性更强，其成像特性与训练数据一致，强调了在SR应用中域内训练的重要性。