2025-08-07

Title: PLA: Prompt Learning Attack against Text-to-Image Generative Models

Authors: Xinqi Lyu, Yihao Liu, Yanjie Li, Bin Xiao
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.03696
Pdf URL: https://arxiv.org/pdf/2508.03696
Copy Paste: [[2508.03696]] PLA: Prompt Learning Attack against Text-to-Image Generative Models(https://arxiv.org/abs/2508.03696)
Keywords: attack, generative
Abstract: Text-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-For-Work (NSFW) content. To investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. To facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework (PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods. Warning: This paper may contain offensive model-generated content.

Title: Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

Authors: Subin Raj Peter
Subjects: cs.CV, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2508.03699
Pdf URL: https://arxiv.org/pdf/2508.03699
Copy Paste: [[2508.03699]] Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task(https://arxiv.org/abs/2508.03699)
Keywords: large language model
Abstract: Virtual Reality (VR) has emerged as a powerful tool for workforce training, offering immersive, interactive, and risk-free environments that enhance skill acquisition, decision-making, and confidence. Despite its advantages, developing VR applications for training remains a significant challenge due to the time, expertise, and resources required to create accurate and engaging instructional content. To address these limitations, this paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input. The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment. The intelligent module receives input from the LLM module and interprets the extracted information. Based on this, an instruction generator creates training content using relevant data from a database. The instruction generator generates the instruction by changing the color of virtual objects and creating animations to illustrate tasks. This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.

Title: How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion

Authors: Agrima Seth, Monojit Choudhary, Sunayana Sitaram, Kentaro Toyama, Aditya Vashistha, Kalika Bali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03712
Pdf URL: https://arxiv.org/pdf/2508.03712
Copy Paste: [[2508.03712]] How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion(https://arxiv.org/abs/2508.03712)
Keywords: large language model
Abstract: Representational bias in large language models (LLMs) has predominantly been measured through single-response interactions and has focused on Global North-centric identities like race and gender. We expand on that research by conducting a systematic audit of GPT-4 Turbo to reveal how deeply encoded representational biases are and how they extend to less-explored dimensions of identity. We prompt GPT-4 Turbo to generate over 7,200 stories about significant life events (such as weddings) in India, using prompts designed to encourage diversity to varying extents. Comparing the diversity of religious and caste representation in the outputs against the actual population distribution in India as recorded in census data, we quantify the presence and "stickiness" of representational bias in the LLM for religion and caste. We find that GPT-4 responses consistently overrepresent culturally dominant groups far beyond their statistical representation, despite prompts intended to encourage representational diversity. Our findings also suggest that representational bias in LLMs has a winner-take-all quality that is more biased than the likely distribution bias in their training data, and repeated prompt-based nudges have limited and inconsistent efficacy in dislodging these biases. These results suggest that diversifying training data alone may not be sufficient to correct LLM bias, highlighting the need for more fundamental changes in model development. Dataset and Codebook: this https URL

Title: FeynTune: Large Language Models for High-Energy Theory

Authors: Paul Richmond, Prarit Agarwal, Borun Chowdhury, Vasilis Niarchos, Constantinos Papageorgakis
Subjects: cs.CL, cs.LG, hep-th
Abstract URL: https://arxiv.org/abs/2508.03716
Pdf URL: https://arxiv.org/pdf/2508.03716
Copy Paste: [[2508.03716]] FeynTune: Large Language Models for High-Energy Theory(https://arxiv.org/abs/2508.03716)
Keywords: large language model
Abstract: We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics.

Title: Multimodal Video Emotion Recognition with Reliable Reasoning Priors

Authors: Zhepeng Wang, Yingjian Zhu, Guanghao Dong, Hongzhu Yi, Feng Chen, Xinming Wang, Jun Xie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03722
Pdf URL: https://arxiv.org/pdf/2508.03722
Copy Paste: [[2508.03722]] Multimodal Video Emotion Recognition with Reliable Reasoning Priors(https://arxiv.org/abs/2508.03722)
Keywords: robust
Abstract: This study investigates the integration of trustworthy prior reasoning knowledge from MLLMs into multimodal emotion recognition. We employ Gemini to generate fine-grained, modality-separable reasoning traces, which are injected as priors during the fusion stage to enrich cross-modal interactions. To mitigate the pronounced class-imbalance in multimodal emotion recognition, we introduce Balanced Dual-Contrastive Learning, a loss formulation that jointly balances inter-class and intra-class distributions. Applied to the MER2024 benchmark, our prior-enhanced framework yields substantial performance gains, demonstrating that the reliability of MLLM-derived reasoning can be synergistically combined with the domain adaptability of lightweight fusion networks for robust, scalable emotion recognition.

Title: From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Authors: Jia Li, Yapeng Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03724
Pdf URL: https://arxiv.org/pdf/2508.03724
Copy Paste: [[2508.03724]] From Waveforms to Pixels: A Survey on Audio-Visual Segmentation(https://arxiv.org/abs/2508.03724)
Keywords: robust, segmentation
Abstract: Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained object-level understanding. In this survey, we present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies. We analyze a wide range of approaches, including architectures for unimodal and multimodal encoding, key strategies for audio-visual fusion, and various decoder designs. Furthermore, we examine major training paradigms, from fully supervised learning to weakly supervised and training-free methods. Notably, we provide an extensive comparison of AVS methods across standard benchmarks, highlighting the impact of different architectural choices, fusion strategies, and training paradigms on performance. Finally, we outline the current challenges, such as limited temporal modeling, modality bias toward vision, lack of robustness in complex environments, and high computational demands, and propose promising future directions, including improving temporal reasoning and multimodal fusion, leveraging foundation models for better generalization and few-shot learning, reducing reliance on labeled data through selfand weakly supervised learning, and incorporating higher-level reasoning for more intelligent AVS systems.

Title: A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding

Authors: Yida Wang, Taiting Lu, Runze Liu, Lanqing Yang, Yifan Yang, Zhe Chen, Yuehai Wang, Yixin Liu, Kaiyuan Lin, Xiaomeng Chen, Dian Ding, Yijie Li, Yi-Chao Chen, Yincheng Jin, Mahanth Gowda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03725
Pdf URL: https://arxiv.org/pdf/2508.03725
Copy Paste: [[2508.03725]] A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding(https://arxiv.org/abs/2508.03725)
Keywords: robust, large language model
Abstract: Printed-Circuit-board (PCB) footprint geometry labeling of integrated circuits (IC) is essential in defining the physical interface between components and the PCB layout, requiring exceptional visual perception proficiency. However, due to the unstructured footprint drawing and abstract diagram annotations, automated parsing and accurate footprint geometry modeling remain highly challenging. Despite its importance, no methods currently exist for automated package geometry labeling directly from IC mechanical drawings. In this paper, we first investigate the visual perception performance of Large Multimodal Models (LMMs) when solving IC footprint geometry understanding. Our findings reveal that current LMMs severely suffer from inaccurate geometric perception, which hinders their performance in solving the footprint geometry labeling problem. To address these limitations, we propose LLM4-IC8K, a novel framework that treats IC mechanical drawings as images and leverages LLMs for structured geometric interpretation. To mimic the step-by-step reasoning approach used by human engineers, LLM4-IC8K addresses three sub-tasks: perceiving the number of pins, computing the center coordinates of each pin, and estimating the dimensions of individual pins. We present a two-stage framework that first trains LMMs on synthetically generated IC footprint diagrams to learn fundamental geometric reasoning and then fine-tunes them on real-world datasheet drawings to enhance robustness and accuracy in practical scenarios. To support this, we introduce ICGeo8K, a multi-modal dataset with 8,608 labeled samples, including 4138 hand-crafted IC footprint samples and 4470 synthetically generated samples. Extensive experiments demonstrate that our model outperforms state-of-the-art LMMs on the proposed benchmark.

Title: Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

Authors: Jaydip Sen, Harshitha Puvvala, Subhasis Dasgupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03726
Pdf URL: https://arxiv.org/pdf/2508.03726
Copy Paste: [[2508.03726]] Hierarchical Verification of Speculative Beams for Accelerating LLM Inference(https://arxiv.org/abs/2508.03726)
Keywords: large language model
Abstract: Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.

Title: TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization

Authors: Tai Hyoung Rhee, Dong-guw Lee, Ayoung Kim
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2508.03727
Pdf URL: https://arxiv.org/pdf/2508.03727
Copy Paste: [[2508.03727]] TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization(https://arxiv.org/abs/2508.03727)
Keywords: robust, diffusion
Abstract: Thermal infrared imaging exhibits considerable potentials for robotic perception tasks, especially in environments with poor visibility or challenging lighting conditions. However, TIR images typically suffer from heavy non-uniform fixed-pattern noise, complicating tasks such as object detection, localization, and mapping. To address this, we propose a diffusion-based TIR image denoising framework leveraging latent-space representations and wavelet-domain optimization. Utilizing a pretrained stable diffusion model, our method fine-tunes the model via a novel loss function combining latent-space and discrete wavelet transform (DWT) / dual-tree complex wavelet transform (DTCWT) losses. Additionally, we implement a cascaded refinement stage to enhance fine details, ensuring high-fidelity denoising results. Experiments on benchmark datasets demonstrate superior performance of our approach compared to state-of-the-art denoising methods. Furthermore, our method exhibits robust zero-shot generalization to diverse and challenging real-world TIR datasets, underscoring its effectiveness for practical robotic deployment.

Title: Privileged Contrastive Pretraining for Multimodal Affect Modelling

Authors: Kosmas Pinitas, Konstantinos Makantasis, Georgios N. Yannakakis
Subjects: cs.LG, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2508.03729
Pdf URL: https://arxiv.org/pdf/2508.03729
Copy Paste: [[2508.03729]] Privileged Contrastive Pretraining for Multimodal Affect Modelling(https://arxiv.org/abs/2508.03729)
Keywords: robust
Abstract: Affective Computing (AC) has made significant progress with the advent of deep learning, yet a persistent challenge remains: the reliable transfer of affective models from controlled laboratory settings (in-vitro) to uncontrolled real-world environments (in-vivo). To address this challenge we introduce the Privileged Contrastive Pretraining (PriCon) framework according to which models are first pretrained via supervised contrastive learning (SCL) and then act as teacher models within a Learning Using Privileged Information (LUPI) framework. PriCon both leverages privileged information during training and enhances the robustness of derived affect models via SCL. Experiments conducted on two benchmark affective corpora, RECOLA and AGAIN, demonstrate that models trained using PriCon consistently outperform LUPI and end to end models. Remarkably, in many cases, PriCon models achieve performance comparable to models trained with access to all modalities during both training and testing. The findings underscore the potential of PriCon as a paradigm towards further bridging the gap between in-vitro and in-vivo affective modelling, offering a scalable and practical solution for real-world applications.

Title: What is Beneath Misogyny: Misogynous Memes Classification and Explanation

Authors: Kushal Kanwar, Dushyant Singh Chauhan, Gopendra Vikram Singh, Asif Ekbal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03732
Pdf URL: https://arxiv.org/pdf/2508.03732
Copy Paste: [[2508.03732]] What is Beneath Misogyny: Misogynous Memes Classification and Explanation(https://arxiv.org/abs/2508.03732)
Keywords: large language model
Abstract: Memes are popular in the modern world and are distributed primarily for entertainment. However, harmful ideologies such as misogyny can be propagated through innocent-looking memes. The detection and understanding of why a meme is misogynous is a research challenge due to its multimodal nature (image and text) and its nuanced manifestations across different societal contexts. We introduce a novel multimodal approach, \textit{namely}, \textit{\textbf{MM-Misogyny}} to detect, categorize, and explain misogynistic content in memes. \textit{\textbf{MM-Misogyny}} processes text and image modalities separately and unifies them into a multimodal context through a cross-attention mechanism. The resulting multimodal context is then easily processed for labeling, categorization, and explanation via a classifier and Large Language Model (LLM). The evaluation of the proposed model is performed on a newly curated dataset (\textit{\textbf{W}hat's \textbf{B}eneath \textbf{M}isogynous \textbf{S}tereotyping (WBMS)}) created by collecting misogynous memes from cyberspace and categorizing them into four categories, \textit{namely}, Kitchen, Leadership, Working, and Shopping. The model not only detects and classifies misogyny, but also provides a granular understanding of how misogyny operates in domains of life. The results demonstrate the superiority of our approach compared to existing methods. The code and dataset are available at \href{this https URL}{this https URL}.

Title: CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning

Authors: Wenjie Li, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang, Mengzhe Xu, Victoria Borja Clausich, Sade Mellin, Renhao Yang, Chenrun Wang, Jethro Zih-Shuo Wang, Shiyi Yao, Gen Li, Yidong Xu, Hanyu Wang, Yilin Huang, Angela Lin Wang, Chen Shi, Yin Zhang, Jianan Guo, Luqi Yang, Renxuan Li, Yang Xu, Jiawei Liu, Yao Zhang, Lei Liu, Carlos Gutiérrez SanRomán, Lei Wang
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2508.03733
Pdf URL: https://arxiv.org/pdf/2508.03733
Copy Paste: [[2508.03733]] CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning(https://arxiv.org/abs/2508.03733)
Keywords: interpretability, generative, large language model
Abstract: Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on "one-time" diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved "think-answer" reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions.

Title: StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization

Authors: Gopalji Gaur, Mohammadreza Zolfaghari, Thomas Brox
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03735
Pdf URL: https://arxiv.org/pdf/2508.03735
Copy Paste: [[2508.03735]] StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization(https://arxiv.org/abs/2508.03735)
Keywords: diffusion
Abstract: Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.

Title: Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities

Authors: Rafayel Mkrtchyan, Armen Manukyan, Hrant Khachatrian, Theofanis P. Raptis
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03736
Pdf URL: https://arxiv.org/pdf/2508.03736
Copy Paste: [[2508.03736]] Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities(https://arxiv.org/abs/2508.03736)
Keywords: transformer
Abstract: Environment mapping is an important computing task for a wide range of smart city applications, including autonomous navigation, wireless network operations and extended reality environments. Conventional smart city mapping techniques, such as satellite imagery, LiDAR scans, and manual annotations, often suffer from limitations related to cost, accessibility and accuracy. Open-source mapping platforms have been widely utilized in artificial intelligence applications for environment mapping, serving as a source of ground truth. However, human errors and the evolving nature of real-world environments introduce biases that can negatively impact the performance of neural networks trained on such data. In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining maps from open-source platforms with radio frequency (RF) data collected from multiple wireless user equipments and base stations. Our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. We develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics which capture different qualities: (i) The Jaccard index, also known as intersection over union (IoU), (ii) the Hausdorff distance, and (iii) the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%.

Title: VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission

Authors: Jianqiao Chen, Tingting Zhu, Huishi Song, Nan Ma, Xiaodong Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03740
Pdf URL: https://arxiv.org/pdf/2508.03740
Copy Paste: [[2508.03740]] VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission(https://arxiv.org/abs/2508.03740)
Keywords: robust, extraction, transformer
Abstract: Discretization of semantic features enables interoperability between semantic and digital communication systems, showing significant potential for practical applications. The fundamental difficulty in digitizing semantic features stems from the need to preserve continuity and context in inherently analog representations during their compression into discrete symbols while ensuring robustness to channel degradation. In this paper, we propose a vector quantized (VQ)-enabled digital semantic communication system with channel adaptive image transmission, named VQ-DeepISC. Guided by deep joint source-channel coding (DJSCC), we first design a Swin Transformer backbone for hierarchical semantic feature extraction, followed by VQ modules projecting features into discrete latent spaces. Consequently, it enables efficient index-based transmission instead of raw feature transmission. To further optimize this process, we develop an attention mechanism-driven channel adaptation module to dynamically optimize index transmission. Secondly, to counteract codebook collapse during training process, we impose a distributional regularization by minimizing the Kullback-Leibler divergence (KLD) between codeword usage frequencies and a uniform prior. Meanwhile, exponential moving average (EMA) is employed to stabilize training and ensure balanced feature coverage during codebook updates. Finally, digital communication is implemented using quadrature phase shift keying (QPSK) modulation alongside orthogonal frequency division multiplexing (OFDM), adhering to the IEEE 802.11a standard. Experimental results demonstrate superior reconstruction fidelity of the proposed system over benchmark methods.

Title: Latent Knowledge Scalpel: Precise and Massive Knowledge Editing for Large Language Models

Authors: Xin Liu, Qiyang Song, Shaowen Xu, Kerou Zhou, Wenbo Jiang, Xiaoqi Jia, Weijuan Zhang, Heqing Huang, Yakai Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03741
Pdf URL: https://arxiv.org/pdf/2508.03741
Copy Paste: [[2508.03741]] Latent Knowledge Scalpel: Precise and Massive Knowledge Editing for Large Language Models(https://arxiv.org/abs/2508.03741)
Keywords: large language model
Abstract: Large Language Models (LLMs) often retain inaccurate or outdated information from pre-training, leading to incorrect predictions or biased outputs during inference. While existing model editing methods can address this challenge, they struggle with editing large amounts of factual information simultaneously and may compromise the general capabilities of the models. In this paper, our empirical study demonstrates that it is feasible to edit the internal representations of LLMs and replace the entities in a manner similar to editing natural language inputs. Based on this insight, we introduce the Latent Knowledge Scalpel (LKS), an LLM editor that manipulates the latent knowledge of specific entities via a lightweight hypernetwork to enable precise and large-scale editing. Experiments conducted on Llama-2 and Mistral show even with the number of simultaneous edits reaching 10,000, LKS effectively performs knowledge editing while preserving the general abilities of the edited LLMs. Code is available at: this https URL.

Title: Closed-Circuit Television Data as an Emergent Data Source for Urban Rail Platform Crowding Estimation

Authors: Riccardo Fiorista, Awad Abdelhalim, Anson F. Stewart, Gabriel L. Pincus, Ian Thistle, Jinhua Zhao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.03749
Pdf URL: https://arxiv.org/pdf/2508.03749
Copy Paste: [[2508.03749]] Closed-Circuit Television Data as an Emergent Data Source for Urban Rail Platform Crowding Estimation(https://arxiv.org/abs/2508.03749)
Keywords: privacy, transformer, segmentation
Abstract: Accurately estimating urban rail platform occupancy can enhance transit agencies' ability to make informed operational decisions, thereby improving safety, operational efficiency, and customer experience, particularly in the context of crowding. However, sensing real-time crowding remains challenging and often depends on indirect proxies such as automatic fare collection data or staff observations. Recently, Closed-Circuit Television (CCTV) footage has emerged as a promising data source with the potential to yield accurate, real-time occupancy estimates. The presented study investigates this potential by comparing three state-of-the-art computer vision approaches for extracting crowd-related features from platform CCTV imagery: (a) object detection and counting using YOLOv11, RT-DETRv2, and APGCC; (b) crowd-level classification via a custom-trained Vision Transformer, Crowd-ViT; and (c) semantic segmentation using DeepLabV3. Additionally, we present a novel, highly efficient linear-optimization-based approach to extract counts from the generated segmentation maps while accounting for image object depth and, thus, for passenger dispersion along a platform. Tested on a privacy-preserving dataset created in collaboration with the Washington Metropolitan Area Transit Authority (WMATA) that encompasses more than 600 hours of video material, our results demonstrate that computer vision approaches can provide substantive value for crowd estimation. This work demonstrates that CCTV image data, independent of other data sources available to a transit agency, can enable more precise real-time crowding estimation and, eventually, timely operational responses for platform crowding mitigation.

Title: GlaBoost: A multimodal Structured Framework for Glaucoma Risk Stratification

Authors: Cheng Huang, Weizheng Xie, Karanjit Kooner, Tsengdar Lee, Jui-Kai Wang, Jia Zhang
Subjects: cs.LG, cs.CE, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.03750
Pdf URL: https://arxiv.org/pdf/2508.03750
Copy Paste: [[2508.03750]] GlaBoost: A multimodal Structured Framework for Glaucoma Risk Stratification(https://arxiv.org/abs/2508.03750)
Keywords: interpretability, transformer
Abstract: Early and accurate detection of glaucoma is critical to prevent irreversible vision loss. However, existing methods often rely on unimodal data and lack interpretability, limiting their clinical utility. In this paper, we present GlaBoost, a multimodal gradient boosting framework that integrates structured clinical features, fundus image embeddings, and expert-curated textual descriptions for glaucoma risk prediction. GlaBoost extracts high-level visual representations from retinal fundus photographs using a pretrained convolutional encoder and encodes free-text neuroretinal rim assessments using a transformer-based language model. These heterogeneous signals, combined with manually assessed risk scores and quantitative ophthalmic indicators, are fused into a unified feature space for classification via an enhanced XGBoost model. Experiments conducted on a real-world annotated dataset demonstrate that GlaBoost significantly outperforms baseline models, achieving a validation accuracy of 98.71%. Feature importance analysis reveals clinically consistent patterns, with cup-to-disc ratio, rim pallor, and specific textual embeddings contributing most to model decisions. GlaBoost offers a transparent and scalable solution for interpretable glaucoma diagnosis and can be extended to other ophthalmic disorders.

Title: Modular Transformer Architecture for Precision Agriculture Imaging

Authors: Brian Gopalan (1), Nathalia Nascimento (1), Vishal Monga (1) ((1) The Pennsylvania State University)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03751
Pdf URL: https://arxiv.org/pdf/2508.03751
Copy Paste: [[2508.03751]] Modular Transformer Architecture for Precision Agriculture Imaging(https://arxiv.org/abs/2508.03751)
Keywords: transformer, segmentation
Abstract: This paper addresses the critical need for efficient and accurate weed segmentation from drone video in precision agriculture. A quality-aware modular deep-learning framework is proposed that addresses common image degradation by analyzing quality conditions-such as blur and noise-and routing inputs through specialized pre-processing and transformer models optimized for each degradation type. The system first analyzes drone images for noise and blur using Mean Absolute Deviation and the Laplacian. Data is then dynamically routed to one of three vision transformer models: a baseline for clean images, a modified transformer with Fisher Vector encoding for noise reduction, or another with an unrolled Lucy-Robinson decoder to correct blur. This novel routing strategy allows the system to outperform existing CNN-based methods in both segmentation quality and computational efficiency, demonstrating a significant advancement in deep-learning applications for agriculture.

Title: Generating Synthetic Invoices via Layout-Preserving Content Replacement

Authors: Bevin V, Ananthakrishnan P V, Ragesh KR, Sanjay M, Vineeth S, Bibin Wilson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03754
Pdf URL: https://arxiv.org/pdf/2508.03754
Copy Paste: [[2508.03754]] Generating Synthetic Invoices via Layout-Preserving Content Replacement(https://arxiv.org/abs/2508.03754)
Keywords: privacy, robust, large language model
Abstract: The performance of machine learning models for automated invoice processing is critically dependent on large-scale, diverse datasets. However, the acquisition of such datasets is often constrained by privacy regulations and the high cost of manual annotation. To address this, we present a novel pipeline for generating high-fidelity, synthetic invoice documents and their corresponding structured data. Our method first utilizes Optical Character Recognition (OCR) to extract the text content and precise spatial layout from a source invoice. Select data fields are then replaced with contextually realistic, synthetic content generated by a large language model (LLM). Finally, we employ an inpainting technique to erase the original text from the image and render the new, synthetic text in its place, preserving the exact layout and font characteristics. This process yields a pair of outputs: a visually realistic new invoice image and a perfectly aligned structured data file (JSON) reflecting the synthetic content. Our approach provides a scalable and automated solution to amplify small, private datasets, enabling the creation of large, varied corpora for training more robust and accurate document intelligence models.

Title: LRTuckerRep: Low-rank Tucker Representation Model for Multi-dimensional Data Completion

Authors: Wenwu Gong, Lili Yang
Subjects: cs.LG, cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2508.03755
Pdf URL: https://arxiv.org/pdf/2508.03755
Copy Paste: [[2508.03755]] LRTuckerRep: Low-rank Tucker Representation Model for Multi-dimensional Data Completion(https://arxiv.org/abs/2508.03755)
Keywords: robust
Abstract: Multi-dimensional data completion is a critical problem in computational sciences, particularly in domains such as computer vision, signal processing, and scientific computing. Existing methods typically leverage either global low-rank approximations or local smoothness regularization, but each suffers from notable limitations: low-rank methods are computationally expensive and may disrupt intrinsic data structures, while smoothness-based approaches often require extensive manual parameter tuning and exhibit poor generalization. In this paper, we propose a novel Low-Rank Tucker Representation (LRTuckerRep) model that unifies global and local prior modeling within a Tucker decomposition. Specifically, LRTuckerRep encodes low rankness through a self-adaptive weighted nuclear norm on the factor matrices and a sparse Tucker core, while capturing smoothness via a parameter-free Laplacian-based regularization on the factor spaces. To efficiently solve the resulting nonconvex optimization problem, we develop two iterative algorithms with provable convergence guarantees. Extensive experiments on multi-dimensional image inpainting and traffic data imputation demonstrate that LRTuckerRep achieves superior completion accuracy and robustness under high missing rates compared to baselines.

Title: Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment

Authors: Ziheng Jia, Jiaying Qian, Zicheng Zhang, Zijian Chen, Xiongkuo Min
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03763
Pdf URL: https://arxiv.org/pdf/2508.03763
Copy Paste: [[2508.03763]] Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment(https://arxiv.org/abs/2508.03763)
Keywords: robust
Abstract: Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model's rollouts but provide no reward supervision for the "think" process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model's native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi-stage RFT IQA framework (Refine-IQA). In Stage-1, we build the Refine-Perception-20K dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model's visual quality perception. In Stage-2, targeting the quality scoring task, we introduce a probability difference reward involved strategy for "think" process supervision. The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks-and, notably, our paradigm activates a robust "think" (quality interpreting) capability that also attains exceptional results on the corresponding quality interpreting benchmark.

Title: LLM-Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation

Authors: Yongchao Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03766
Pdf URL: https://arxiv.org/pdf/2508.03766
Copy Paste: [[2508.03766]] LLM-Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation(https://arxiv.org/abs/2508.03766)
Keywords: robust, federate, generative, large language model
Abstract: The specification of prior distributions is fundamental in Bayesian inference, yet it remains a significant bottleneck. The prior elicitation process is often a manual, subjective, and unscalable task. We propose a novel framework which leverages Large Language Models (LLMs) to automate and scale this process. We introduce \texttt{LLMPrior}, a principled operator that translates rich, unstructured contexts such as natural language descriptions, data or figures into valid, tractable probability distributions. We formalize this operator by architecturally coupling an LLM with an explicit, tractable generative model, such as a Gaussian Mixture Model (forming a LLM based Mixture Density Network), ensuring the resulting prior satisfies essential mathematical properties. We further extend this framework to multi-agent systems where Logarithmic Opinion Pooling is employed to aggregate prior distributions induced by decentralized knowledge. We present the federated prior aggregation algorithm, \texttt{Fed-LLMPrior}, for aggregating distributed, context-dependent priors in a manner robust to agent heterogeneity. This work provides the foundation for a new class of tools that can potentially lower the barrier to entry for sophisticated Bayesian modeling.

Title: Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings

Authors: Debamita Ghosh, George K. Atia, Yue Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03768
Pdf URL: https://arxiv.org/pdf/2508.03768
Copy Paste: [[2508.03768]] Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings(https://arxiv.org/abs/2508.03768)
Keywords: robust, generative
Abstract: Reinforcement learning (RL) faces significant challenges in real-world deployments due to the sim-to-real gap, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment -- assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study the more realistic and challenging setting of online distributionally robust RL, where the agent interacts only with a single unknown training environment while aiming to optimize its worst-case performance. We focus on general $f$-divergence-based uncertainty sets, including Chi-Square and KL divergence balls, and propose a computationally efficient algorithm with sublinear regret guarantees under minimal assumptions. Furthermore, we establish a minimax lower bound on regret of online learning, demonstrating the near-optimality of our approach. Extensive experiments across diverse environments further confirm the robustness and efficiency of our algorithm, validating our theoretical findings.

Title: GTPO: Trajectory-Based Policy Optimization in Large Language Models

Authors: Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.03772
Pdf URL: https://arxiv.org/pdf/2508.03772
Copy Paste: [[2508.03772]] GTPO: Trajectory-Based Policy Optimization in Large Language Models(https://arxiv.org/abs/2508.03772)
Keywords: protect, large language model
Abstract: Policy-based optimizations are widely adopted today for the training and alignment of language models, where one of the most recent and effective approaches is Group-relative Policy Optimization (GRPO). In this paper, we reveals and analyze two major limitations of GRPO: (i) tokens frequently appear in completions with both positive and negative rewards, leading to conflicting gradient updates that can reduce their output probability, even though can be essential for maintaining proper structure; (ii) negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, progressively flattening the output distribution and degrading learning. To address these issues and provide a more stable and effective policy optimization strategy, we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards, protects them by skipping negative updates, while amplifying positive ones. To further prevent policy collapse, GTPO filters out completions whose entropy exceeds a provable threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, validated through multiple experiments on GSM8K, MATH and AIME 2024 benchmarks.

Title: U-PINet: End-to-End Hierarchical Physics-Informed Learning With Sparse Graph Coupling for 3D EM Scattering Modeling

Authors: Rui Zhu, Yuexing Peng, Peng Wang, George C. Alexandropoulos, Wenbo Wang, Wei Xiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03774
Pdf URL: https://arxiv.org/pdf/2508.03774
Copy Paste: [[2508.03774]] U-PINet: End-to-End Hierarchical Physics-Informed Learning With Sparse Graph Coupling for 3D EM Scattering Modeling(https://arxiv.org/abs/2508.03774)
Keywords: robust
Abstract: Electromagnetic (EM) scattering modeling is critical for radar remote sensing, however, its inherent complexity introduces significant computational challenges. Traditional numerical solvers offer high accuracy, but suffer from scalability issues and substantial computational costs. Pure data-driven deep learning approaches, while efficient, lack physical constraints embedding during training and require extensive labeled data, limiting their applicability and generalization. To overcome these limitations, we propose a U-shaped Physics-Informed Network (U-PINet), the first fully deep-learning-based, physics-informed hierarchical framework for computational EM designed to ensure physical consistency while maximizing computational efficiency. Motivated by the hierarchical decomposition strategy in EM solvers and the inherent sparsity of local EM coupling, the U-PINet models the decomposition and coupling of near- and far-field interactions through a multiscale processing neural network architecture, while employing a physics-inspired sparse graph representation to efficiently model both self- and mutual- coupling among mesh elements of complex $3$-Dimensional (3D) objects. This principled approach enables end-to-end multiscale EM scattering modeling with improved efficiency, generalization, and physical consistency. Experimental results showcase that the U-PINet accurately predicts surface current distributions, achieving close agreement with traditional solver, while significantly reducing computational time and outperforming conventional deep learning baselines in both accuracy and robustness. Furthermore, our evaluations on radar cross section prediction tasks confirm the feasibility of the U-PINet for downstream EM scattering applications.

Title: 4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis

Authors: Mingyu Liu (1), Zian Mao (1 and 2), Zhu Liu (1 and 3), Haoran Zhang (1 and 2), Jintao Guo (1), Xiaoya He (1 and 2), Xi Huang (1), Shufen Chu (1), Chun Cheng (1), Jun Ding (4), Yujun Xie (1) ((1) Global Institute of Future Technology of Shanghai Jiao Tong University, (2) University of Michigan Shanghai Jiao Tong University Joint Institute, (3) School of Chemistry and Chemical Engineering of Shanghai Jiao Tong University, (4) Center for Alloy Innovation and Design State Key Laboratory for Mechanical Behavior of Materials of Xian Jiaotong University)
Subjects: cs.CV, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03775
Pdf URL: https://arxiv.org/pdf/2508.03775
Copy Paste: [[2508.03775]] 4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis(https://arxiv.org/abs/2508.03775)
Keywords: robust
Abstract: Automated experimentation with real time data analysis in scanning transmission electron microscopy (STEM) often require end-to-end framework. The four-dimensional scanning transmission electron microscopy (4D-STEM) with high-throughput data acquisition has been constrained by the critical bottleneck results from data preprocessing. Pervasive noise, beam center drift, and elliptical distortions during high-throughput acquisition inevitably corrupt diffraction patterns, systematically biasing quantitative measurements. Yet, conventional correction algorithms are often material-specific and fail to provide a robust, generalizable solution. In this work, we present 4D-PreNet, an end-to-end deep-learning pipeline that integrates attention-enhanced U-Net and ResNet architectures to simultaneously perform denoising, center correction, and elliptical distortion calibration. The network is trained on large, simulated datasets encompassing a wide range of noise levels, drift magnitudes, and distortion types, enabling it to generalize effectively to experimental data acquired under varying conditions. Quantitative evaluations demonstrate that our pipeline reduces mean squared error by up to 50% during denoising and achieves sub-pixel center localization in the center detection task, with average errors below 0.04 pixels. The outputs are bench-marked against traditional algorithms, highlighting improvements in both noise suppression and restoration of diffraction patterns, thereby facilitating high-throughput, reliable 4D-STEM real-time analysis for automated characterization.

Title: SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons

Authors: Teodor Chiaburu, Vipin Singh, Frank Haußer, Felix Bießmann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03785
Pdf URL: https://arxiv.org/pdf/2508.03785
Copy Paste: [[2508.03785]] SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons(https://arxiv.org/abs/2508.03785)
Keywords: security
Abstract: While recent advances in foundation models have improved the state of the art in many domains, some problems in empirical sciences could not benefit from this progress yet. Soil horizon classification, for instance, remains challenging because of its multimodal and multitask characteristics and a complex hierarchically structured label taxonomy. Accurate classification of soil horizons is crucial for monitoring soil health, which directly impacts agricultural productivity, food security, ecosystem stability and climate resilience. In this work, we propose $\textit{SoilNet}$ - a multimodal multitask model to tackle this problem through a structured modularized pipeline. Our approach integrates image data and geotemporal metadata to first predict depth markers, segmenting the soil profile into horizon candidates. Each segment is characterized by a set of horizon-specific morphological features. Finally, horizon labels are predicted based on the multimodal concatenated feature vector, leveraging a graph-based label representation to account for the complex hierarchical relationships among soil horizons. Our method is designed to address complex hierarchical classification, where the number of possible labels is very large, imbalanced and non-trivially structured. We demonstrate the effectiveness of our approach on a real-world soil profile dataset. All code and experiments can be found in our repository: this https URL

Title: HPSv3: Towards Wide-Spectrum Human Preference Score

Authors: Yuhang Ma, Xiaoshi Wu, Keqiang Sun, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03789
Pdf URL: https://arxiv.org/pdf/2508.03789
Copy Paste: [[2508.03789]] HPSv3: Towards Wide-Spectrum Human Preference Score(https://arxiv.org/abs/2508.03789)
Keywords: robust, extraction, generative
Abstract: Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step. Extensive experiments demonstrate that HPSv3 serves as a robust metric for wide-spectrum image evaluation, and CoHP offers an efficient and human-aligned approach to improve image generation quality. The code and dataset are available at the HPSv3 Homepage.

Title: AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Authors: Yanting Wang, Runpeng Geng, Ying Chen, Jinyuan Jia
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2508.03793
Pdf URL: https://arxiv.org/pdf/2508.03793
Copy Paste: [[2508.03793]] AttnTrace: Attention-based Context Traceback for Long-Context LLMs(https://arxiv.org/abs/2508.03793)
Keywords: attack, interpretability, large language model
Abstract: Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at this https URL.

Title: Majority Bit-Aware Watermarking For Large Language Models

Authors: Jiahao Xu, Rui Hu, Zikai Zhang
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2508.03829
Pdf URL: https://arxiv.org/pdf/2508.03829
Copy Paste: [[2508.03829]] Majority Bit-Aware Watermarking For Large Language Models(https://arxiv.org/abs/2508.03829)
Keywords: watermark, large language model
Abstract: The growing deployment of Large Language Models (LLMs) in real-world applications has raised concerns about their potential misuse in generating harmful or deceptive content. To address this issue, watermarking techniques have emerged as a promising solution by embedding identifiable binary messages into generated text for origin verification and misuse tracing. While recent efforts have explored multi-bit watermarking schemes capable of embedding rich information such as user identifiers, they typically suffer from the fundamental trade-off between text quality and decoding accuracy: to ensure reliable message decoding, they have to restrict the size of preferred token sets during encoding, yet such restrictions reduce the quality of the generated content. In this work, we propose MajorMark, a novel watermarking method that improves this trade-off through majority bit-aware encoding. MajorMark selects preferred token sets based on the majority bit of the message, enabling a larger and more flexible sampling of tokens. In contrast to prior methods that rely on token frequency analysis for decoding, MajorMark employs a clustering-based decoding strategy, which maintains high decoding accuracy even when the preferred token set is large, thus preserving both content quality and decoding accuracy. We further introduce MajorMark$^+$, which partitions the message into multiple blocks to independently encode and deterministically decode each block, thereby further enhancing the quality of watermarked text and improving decoding accuracy. Extensive experiments on state-of-the-art LLMs demonstrate that our methods significantly enhance both decoding accuracy and text generation quality, outperforming prior multi-bit watermarking baselines.

Title: DP-NCB: Privacy Preserving Fair Bandits

Authors: Dhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.03836
Pdf URL: https://arxiv.org/pdf/2508.03836
Copy Paste: [[2508.03836]] DP-NCB: Privacy Preserving Fair Bandits(https://arxiv.org/abs/2508.03836)
Keywords: privacy, protect, fair
Abstract: Multi-armed bandit algorithms are fundamental tools for sequential decision-making under uncertainty, with widespread applications across domains such as clinical trials and personalized decision-making. As bandit algorithms are increasingly deployed in these socially sensitive settings, it becomes critical to protect user data privacy and ensure fair treatment across decision rounds. While prior work has independently addressed privacy and fairness in bandit settings, the question of whether both objectives can be achieved simultaneously has remained largely open. Existing privacy-preserving bandit algorithms typically optimize average regret, a utilitarian measure, whereas fairness-aware approaches focus on minimizing Nash regret, which penalizes inequitable reward distributions, but often disregard privacy concerns. To bridge this gap, we introduce Differentially Private Nash Confidence Bound (DP-NCB)-a novel and unified algorithmic framework that simultaneously ensures $\epsilon$-differential privacy and achieves order-optimal Nash regret, matching known lower bounds up to logarithmic factors. The framework is sufficiently general to operate under both global and local differential privacy models, and is anytime, requiring no prior knowledge of the time horizon. We support our theoretical guarantees with simulations on synthetic bandit instances, showing that DP-NCB incurs substantially lower Nash regret than state-of-the-art baselines. Our results offer a principled foundation for designing bandit algorithms that are both privacy-preserving and fair, making them suitable for high-stakes, socially impactful applications.

Title: VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations

Authors: Yifei Zong, Alexandre M. Tartakovsky
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2508.03839
Pdf URL: https://arxiv.org/pdf/2508.03839
Copy Paste: [[2508.03839]] VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations(https://arxiv.org/abs/2508.03839)
Keywords: diffusion
Abstract: We propose a trainable-by-parts surrogate model for solving forward and inverse parameterized nonlinear partial differential equations. Like several other surrogate and operator learning models, the proposed approach employs an encoder to reduce the high-dimensional input $y(\bm{x})$ to a lower-dimensional latent space, $\bm\mu_{\bm\phi_y}$. Then, a fully connected neural network is used to map $\bm\mu_{\bm\phi_y}$ to the latent space, $\bm\mu_{\bm\phi_h}$, of the PDE solution $h(\bm{x},t)$. Finally, a decoder is utilized to reconstruct $h(\bm{x},t)$. The innovative aspect of our model is its ability to train its three components independently. This approach leads to a substantial decrease in both the time and energy required for training when compared to leading operator learning models such as FNO and DeepONet. The separable training is achieved by training the encoder as part of the variational autoencoder (VAE) for $y(\bm{x})$ and the decoder as part of the $h(\bm{x},t)$ VAE. We refer to this model as the VAE-DNN model. VAE-DNN is compared to the FNO and DeepONet models for obtaining forward and inverse solutions to the nonlinear diffusion equation governing groundwater flow in an unconfined aquifer. Our findings indicate that VAE-DNN not only demonstrates greater efficiency but also delivers superior accuracy in both forward and inverse solutions compared to the FNO and DeepONet models.

Title: Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models

Authors: Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03860
Pdf URL: https://arxiv.org/pdf/2508.03860
Copy Paste: [[2508.03860]] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models(https://arxiv.org/abs/2508.03860)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. The review also discusses the role of instruction tuning, multi-agent reasoning, and external knowledge access via RAG frameworks. Key findings highlight the limitations of current metrics, the value of grounding outputs with validated external evidence, and the importance of domain-specific customization to improve factual consistency. Overall, the review underlines the importance of building LLMs that are not only accurate and explainable but also tailored for domain-specific fact-checking. These insights contribute to the advancement of research toward more trustworthy and context-aware language models.

Title: Data-Driven Spectrum Demand Prediction: A Spatio-Temporal Framework with Transfer Learning

Authors: Amin Farajzadeh, Hongzhao Zheng, Sarah Dumoulin, Trevor Ha, Halim Yanikomeroglu, Amir Ghasemi
Subjects: cs.LG, cs.NI, eess.SP
Abstract URL: https://arxiv.org/abs/2508.03863
Pdf URL: https://arxiv.org/pdf/2508.03863
Copy Paste: [[2508.03863]] Data-Driven Spectrum Demand Prediction: A Spatio-Temporal Framework with Transfer Learning(https://arxiv.org/abs/2508.03863)
Keywords: robust, fair
Abstract: Accurate spectrum demand prediction is crucial for informed spectrum allocation, effective regulatory planning, and fostering sustainable growth in modern wireless communication networks. It supports governmental efforts, particularly those led by the international telecommunication union (ITU), to establish fair spectrum allocation policies, improve auction mechanisms, and meet the requirements of emerging technologies such as advanced 5G, forthcoming 6G, and the internet of things (IoT). This paper presents an effective spatio-temporal prediction framework that leverages crowdsourced user-side key performance indicators (KPIs) and regulatory datasets to model and forecast spectrum demand. The proposed methodology achieves superior prediction accuracy and cross-regional generalizability by incorporating advanced feature engineering, comprehensive correlation analysis, and transfer learning techniques. Unlike traditional ITU models, which are often constrained by arbitrary inputs and unrealistic assumptions, this approach exploits granular, data-driven insights to account for spatial and temporal variations in spectrum utilization. Comparative evaluations against ITU estimates, as the benchmark, underscore our framework's capability to deliver more realistic and actionable predictions. Experimental results validate the efficacy of our methodology, highlighting its potential as a robust approach for policymakers and regulatory bodies to enhance spectrum management and planning.

Title: An Entity Linking Agent for Question Answering

Authors: Yajie Luo, Yihong Wu, Muzhi Li, Fengran Mo, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03865
Pdf URL: https://arxiv.org/pdf/2508.03865
Copy Paste: [[2508.03865]] An Entity Linking Agent for Question Answering(https://arxiv.org/abs/2508.03865)
Keywords: robust, large language model
Abstract: Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.

Title: RX-INT: A Kernel Engine for Real-Time Detection and Analysis of In-Memory Threats

Authors: Arjun Juneja
Subjects: cs.CR, cs.OS
Abstract URL: https://arxiv.org/abs/2508.03879
Pdf URL: https://arxiv.org/pdf/2508.03879
Copy Paste: [[2508.03879]] RX-INT: A Kernel Engine for Real-Time Detection and Analysis of In-Memory Threats(https://arxiv.org/abs/2508.03879)
Keywords: security, attack
Abstract: Malware and cheat developers use fileless execution techniques to evade traditional, signature-based security products. These methods include various types of manual mapping, module stomping, and threadless injection which work entirely within the address space of a legitimate process, presenting a challenge for detection due to ambiguity between what is legitimate and what isn't. Existing tools often have weaknesses, such as a dependency on Portable Executable (PE) structures or a vulnerability to time-of-check-to-time-of-use (TOCTOU) race conditions where an adversary cleans up before a periodic scan has the chance to occur. To address this gap, we present RX-INT, a kernel-assisted system featuring an architecture that provides resilience against TOCTOU attacks. RX-INT introduces a detection engine that combines a real-time thread creation monitor with a stateful Virtual Address Descriptor (VAD) scanner alongside various heuristics within. This engine snapshots both private and image-backed memory regions, using real-time memory hashing to detect illicit modifications like module stomping. Critically, we demonstrate a higher detection rate in certain benchmarks of this approach through a direct comparison with PE-sieve, a commonly used and powerful memory forensics tool. In our evaluation, RX-INT successfully detected a manually mapped region that was not identified by PE-sieve. We then conclude that our architecture represents a tangible difference in the detection of fileless threats, with direct applications in the fields of anti-cheat and memory security.

Title: Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE)

Authors: Arturo Sánchez-Matas, Pablo Escribano Ruiz, Daniel Díaz-López, Angel Luis Perales Gómez, Pantaleone Nespoli, Gregorio Martínez Pérez
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03882
Pdf URL: https://arxiv.org/pdf/2508.03882
Copy Paste: [[2508.03882]] Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE)(https://arxiv.org/abs/2508.03882)
Keywords: security, defense, attack
Abstract: In today digital landscape, organizations face constantly evolving cyber threats, making it essential to discover slippery attack vectors through novel techniques like Security Chaos Engineering (SCE), which allows teams to test defenses and identify vulnerabilities effectively. This paper proposes to integrate SCE into Breach Attack Simulation (BAS) platforms, leveraging adversary profiles and abilities from existing threat intelligence databases. This innovative proposal for cyberattack simulation employs a structured architecture composed of three layers: SCE Orchestrator, Connector, and BAS layers. Utilizing MITRE Caldera in the BAS layer, our proposal executes automated attack sequences, creating inferred attack trees from adversary profiles. Our proposal evaluation illustrates how integrating SCE with BAS can enhance the effectiveness of attack simulations beyond traditional scenarios, and be a useful component of a cyber defense strategy.

Title: Calibrating Biophysical Models for Grape Phenology Prediction via Multi-Task Learning

Authors: William Solow, Sandhya Saisubramanian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03898
Pdf URL: https://arxiv.org/pdf/2508.03898
Copy Paste: [[2508.03898]] Calibrating Biophysical Models for Grape Phenology Prediction via Multi-Task Learning(https://arxiv.org/abs/2508.03898)
Keywords: robust
Abstract: Accurate prediction of grape phenology is essential for timely vineyard management decisions, such as scheduling irrigation and fertilization, to maximize crop yield and quality. While traditional biophysical models calibrated on historical field data can be used for season-long predictions, they lack the precision required for fine-grained vineyard management. Deep learning methods are a compelling alternative but their performance is hindered by sparse phenology datasets, particularly at the cultivar level. We propose a hybrid modeling approach that combines multi-task learning with a recurrent neural network to parameterize a differentiable biophysical model. By using multi-task learning to predict the parameters of the biophysical model, our approach enables shared learning across cultivars while preserving biological structure, thereby improving the robustness and accuracy of predictions. Empirical evaluation using real-world and synthetic datasets demonstrates that our method significantly outperforms both conventional biophysical models and baseline deep learning approaches in predicting phenological stages, as well as other crop state variables such as cold-hardiness and wheat yield.

Title: Sotopia-RL: Reward Design for Social Intelligence

Authors: Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, Jiaxuan You
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03905
Pdf URL: https://arxiv.org/pdf/2508.03905
Copy Paste: [[2508.03905]] Sotopia-RL: Reward Design for Social Intelligence(https://arxiv.org/abs/2508.03905)
Keywords: large language model
Abstract: Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: this https URL.

Title: CoAct-1: Computer-using Agents with Coding as Actions

Authors: Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03923
Pdf URL: https://arxiv.org/pdf/2508.03923
Copy Paste: [[2508.03923]] CoAct-1: Computer-using Agents with Coding as Actions(https://arxiv.org/abs/2508.03923)
Keywords: robust
Abstract: Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

Title: Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model

Authors: Shen Zhu, Yinzhu Jin, Ifrah Zawar, P. Thomas Fletcher
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03925
Pdf URL: https://arxiv.org/pdf/2508.03925
Copy Paste: [[2508.03925]] Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model(https://arxiv.org/abs/2508.03925)
Keywords: diffusion, generative
Abstract: We propose a diffusion model designed to generate point-based shape representations with correspondences. Traditional statistical shape models have considered point correspondences extensively, but current deep learning methods do not take them into account, focusing on unordered point clouds instead. Current deep generative models for point clouds do not address generating shapes with point correspondences between generated shapes. This work aims to formulate a diffusion model that is capable of generating realistic point-based shape representations, which preserve point correspondences that are present in the training data. Using shape representation data with correspondences derived from Open Access Series of Imaging Studies 3 (OASIS-3), we demonstrate that our correspondence-preserving model effectively generates point-based hippocampal shape representations that are highly realistic compared to existing methods. We further demonstrate the applications of our generative model by downstream tasks, such as conditional generation of healthy and AD subjects and predicting morphological changes of disease progression by counterfactual generation.

Title: Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning

Authors: Hector Vargas Alvarez, Dimitrios G. Patsatzis, Lucia Russo, Ioannis Kevrekidis, Constantinos Siettos
Subjects: cs.LG, math.DS, math.NA
Abstract URL: https://arxiv.org/abs/2508.03926
Pdf URL: https://arxiv.org/pdf/2508.03926
Copy Paste: [[2508.03926]] Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning(https://arxiv.org/abs/2508.03926)
Keywords: robust
Abstract: Bridging the microscopic and the macroscopic modelling scales in crowd dynamics constitutes an important, open challenge for systematic numerical analysis, optimization, and control. We propose a combined manifold and machine learning approach to learn the discrete evolution operator for the emergent crowd dynamics in latent spaces from high-fidelity agent-based simulations. The proposed framework builds upon our previous works on next-generation Equation-free algorithms on learning surrogate models for high-dimensional and multiscale systems. Our approach is a four-stage one, explicitly conserving the mass of the reconstructed dynamics in the high-dimensional space. In the first step, we derive continuous macroscopic fields (densities) from discrete microscopic data (pedestrians' positions) using KDE. In the second step, based on manifold learning, we construct a map from the macroscopic ambient space into the latent space parametrized by a few coordinates based on POD of the corresponding density distribution. The third step involves learning reduced-order surrogate ROMs in the latent space using machine learning techniques, particularly LSTMs networks and MVARs. Finally, we reconstruct the crowd dynamics in the high-dimensional space in terms of macroscopic density profiles. We demonstrate that the POD reconstruction of the density distribution via SVD conserves the mass. With this "embed->learn in latent space->lift back to the ambient space" pipeline, we create an effective solution operator of the unavailable macroscopic PDE for the density evolution. For our illustrations, we use the Social Force Model to generate data in a corridor with an obstacle, imposing periodic boundary conditions. The numerical results demonstrate high accuracy, robustness, and generalizability, thus allowing for fast and accurate modelling/simulation of crowd dynamics from agent-based simulations.

Title: Markov Chain Estimation with In-Context Learning

Authors: Simon Lepage, Jeremie Mary, David Picard
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03934
Pdf URL: https://arxiv.org/pdf/2508.03934
Copy Paste: [[2508.03934]] Markov Chain Estimation with In-Context Learning(https://arxiv.org/abs/2508.03934)
Keywords: robust, transformer
Abstract: We investigate the capacity of transformers to learn algorithms involving their context while solely being trained using next token prediction. We set up Markov chains with random transition matrices and we train transformers to predict the next token. Matrices used during training and test are different and we show that there is a threshold in transformer size and in training set size above which the model is able to learn to estimate the transition probabilities from its context instead of memorizing the training patterns. Additionally, we show that more involved encoding of the states enables more robust prediction for Markov chains with structures different than those seen during training.

Title: CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation

Authors: Raymond Wilson, Cole Graham, Chase Carter, Zefeng Yang, Ruiqi Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03935
Pdf URL: https://arxiv.org/pdf/2508.03935
Copy Paste: [[2508.03935]] CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation(https://arxiv.org/abs/2508.03935)
Keywords: robust, large language model
Abstract: In the era of information overload, personalized news headline generation is crucial for engaging users by tailoring content to their preferences while accurately conveying news facts. Existing methods struggle with effectively capturing complex user interests and ensuring factual consistency, often leading to generic or misleading headlines. Leveraging the unprecedented capabilities of Large Language Models (LLMs) in text generation, we propose Context-Augmented Personalized LLM (CAP-LLM), a novel framework that integrates user preferences and factual consistency constraints into a powerful pre-trained LLM backbone. CAP-LLM features a User Preference Encoder to capture long-term user interests, a Context Injection Adapter to seamlessly integrate these preferences and current article context into the LLM's generation process, and a Fact-Consistency Reinforcement Module employing a novel contrastive loss to mitigate hallucination. Evaluated on the real-world PENS dataset, CAP-LLM achieves state-of-the-art performance across all metrics. Notably, it significantly improves factual consistency (FactCC of 87.50) over strong baselines like BART (86.67), while simultaneously enhancing personalization (Pc(avg) 2.73, Pc(max) 17.25) and content coverage (ROUGE-1 26.55, ROUGE-2 9.95, ROUGE-L 23.01). Our ablation studies, human evaluations, and sensitivity analyses further validate the effectiveness of each component and the robustness of our approach, demonstrating CAP-LLM's ability to achieve a superior balance between personalization and factual accuracy in news headline generation.

Title: ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

Authors: Xiangzhe Xu, Guangyu Shen, Zian Su, Siyuan Cheng, Hanxi Guo, Lu Yan, Xuan Chen, Jiasheng Jiang, Xiaolong Jin, Chengpeng Wang, Zhuo Zhang, Xiangyu Zhang
Subjects: cs.CR, cs.CL, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2508.03936
Pdf URL: https://arxiv.org/pdf/2508.03936
Copy Paste: [[2508.03936]] ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants(https://arxiv.org/abs/2508.03936)
Keywords: security
Abstract: AI coding assistants like GitHub Copilot are rapidly transforming software development, but their safety remains deeply uncertain-especially in high-stakes domains like cybersecurity. Current red-teaming tools often rely on fixed benchmarks or unrealistic prompts, missing many real-world vulnerabilities. We present ASTRA, an automated agent system designed to systematically uncover safety flaws in AI-driven code generation and security guidance systems. ASTRA works in three stages: (1) it builds structured domain-specific knowledge graphs that model complex software tasks and known weaknesses; (2) it performs online vulnerability exploration of each target model by adaptively probing both its input space, i.e., the spatial exploration, and its reasoning processes, i.e., the temporal exploration, guided by the knowledge graphs; and (3) it generates high-quality violation-inducing cases to improve model alignment. Unlike prior methods, ASTRA focuses on realistic inputs-requests that developers might actually ask-and uses both offline abstraction guided domain modeling and online domain knowledge graph adaptation to surface corner-case vulnerabilities. Across two major evaluation domains, ASTRA finds 11-66% more issues than existing techniques and produces test cases that lead to 17% more effective alignment training, showing its practical value for building safer AI systems.

Title: FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport

Authors: Pengxi Liu, Yi Shen, Matthew M. Engelhard, Benjamin A. Goldstein, Michael J. Pencina, Nicoleta J. Economou-Zavlanos, Michael M. Zavlanos
Subjects: cs.LG, cs.AI, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2508.03940
Pdf URL: https://arxiv.org/pdf/2508.03940
Copy Paste: [[2508.03940]] FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport(https://arxiv.org/abs/2508.03940)
Keywords: fair
Abstract: Fairness metrics utilizing the area under the receiver operator characteristic curve (AUC) have gained increasing attention in high-stakes domains such as healthcare, finance, and criminal justice. In these domains, fairness is often evaluated over risk scores rather than binary outcomes, and a common challenge is that enforcing strict fairness can significantly degrade AUC performance. To address this challenge, we propose Fair Proportional Optimal Transport (FairPOT), a novel, model-agnostic post-processing framework that strategically aligns risk score distributions across different groups using optimal transport, but does so selectively by transforming a controllable proportion, i.e., the top-lambda quantile, of scores within the disadvantaged group. By varying lambda, our method allows for a tunable trade-off between reducing AUC disparities and maintaining overall AUC performance. Furthermore, we extend FairPOT to the partial AUC setting, enabling fairness interventions to concentrate on the highest-risk regions. Extensive experiments on synthetic, public, and clinical datasets show that FairPOT consistently outperforms existing post-processing techniques in both global and partial AUC scenarios, often achieving improved fairness with slight AUC degradation or even positive gains in utility. The computational efficiency and practical adaptability of FairPOT make it a promising solution for real-world deployment.

Title: Policy to Assist Iteratively Local Segmentation: Optimising Modality and Location Selection for Prostate Cancer Localisation

Authors: Xiangcen Wu, Shaheer U. Saeed, Yipei Wang, Ester Bonmati Coll, Yipeng Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03953
Pdf URL: https://arxiv.org/pdf/2508.03953
Copy Paste: [[2508.03953]] Policy to Assist Iteratively Local Segmentation: Optimising Modality and Location Selection for Prostate Cancer Localisation(https://arxiv.org/abs/2508.03953)
Keywords: segmentation
Abstract: Radiologists often mix medical image reading strategies, including inspection of individual modalities and local image regions, using information at different locations from different images independently as well as concurrently. In this paper, we propose a recommend system to assist machine learning-based segmentation models, by suggesting appropriate image portions along with the best modality, such that prostate cancer segmentation performance can be maximised. Our approach trains a policy network that assists tumor localisation, by recommending both the optimal imaging modality and the specific sections of interest for review. During training, a pre-trained segmentation network mimics radiologist inspection on individual or variable combinations of these imaging modalities and their sections - selected by the policy network. Taking the locally segmented regions as an input for the next step, this dynamic decision making process iterates until all cancers are best localised. We validate our method using a data set of 1325 labelled multiparametric MRI images from prostate cancer patients, demonstrating its potential to improve annotation efficiency and segmentation accuracy, especially when challenging pathology is present. Experimental results show that our approach can surpass standard segmentation networks. Perhaps more interestingly, our trained agent independently developed its own optimal strategy, which may or may not be consistent with current radiologist guidelines such as PI-RADS. This observation also suggests a promising interactive application, in which the proposed policy networks assist human radiologists.

Title: BubbleONet: A Physics-Informed Neural Operator for High-Frequency Bubble Dynamics

Authors: Yunhao Zhang, Lin Cheng, Aswin Gnanaskandan, Ameya D. Jagtap
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03965
Pdf URL: https://arxiv.org/pdf/2508.03965
Copy Paste: [[2508.03965]] BubbleONet: A Physics-Informed Neural Operator for High-Frequency Bubble Dynamics(https://arxiv.org/abs/2508.03965)
Keywords: robust
Abstract: This paper introduces BubbleONet, an operator learning model designed to map pressure profiles from an input function space to corresponding bubble radius responses. BubbleONet is built upon the physics-informed deep operator network (PI-DeepONet) framework, leveraging DeepONet's powerful universal approximation capabilities for operator learning alongside the robust physical fidelity provided by the physics-informed neural networks. To mitigate the inherent spectral bias in deep learning, BubbleONet integrates the Rowdy adaptive activation function, enabling improved representation of high-frequency features. The model is evaluated across various scenarios, including: (1) Rayleigh-Plesset equation based bubble dynamics with a single initial radius, (2) Keller-Miksis equation based bubble dynamics with a single initial radius, and (3) Keller-Miksis equation based bubble dynamics with multiple initial radii. Moreover, the performance of single-step versus two-step training techniques for BubbleONet is investigated. The results demonstrate that BubbleONet serves as a promising surrogate model for simulating bubble dynamics, offering a computationally efficient alternative to traditional numerical solvers.

Title: RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

Authors: Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid
Subjects: cs.CV, cs.CR, cs.IR
Abstract URL: https://arxiv.org/abs/2508.03967
Pdf URL: https://arxiv.org/pdf/2508.03967
Copy Paste: [[2508.03967]] RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification(https://arxiv.org/abs/2508.03967)
Keywords: robust, generative
Abstract: In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.

Title: Data and AI governance: Promoting equity, ethics, and fairness in large language models

Authors: Alok Abhishek, Lisa Erickson, Tushar Bandopadhyay
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03970
Pdf URL: https://arxiv.org/pdf/2508.03970
Copy Paste: [[2508.03970]] Data and AI governance: Promoting equity, ethics, and fairness in large language models(https://arxiv.org/abs/2508.03970)
Keywords: protect, fair, generative, large language model
Abstract: In this paper, we cover approaches to systematically govern, assess and quantify bias across the complete life cycle of machine learning models, from initial development and validation to ongoing production monitoring and guardrail implementation. Building upon our foundational work on the Bias Evaluation and Assessment Test Suite (BEATS) for Large Language Models, the authors share prevalent bias and fairness related gaps in Large Language Models (LLMs) and discuss data and AI governance framework to address Bias, Ethics, Fairness, and Factuality within LLMs. The data and AI governance approach discussed in this paper is suitable for practical, real-world applications, enabling rigorous benchmarking of LLMs prior to production deployment, facilitating continuous real-time evaluation, and proactively governing LLM generated responses. By implementing the data and AI governance across the life cycle of AI development, organizations can significantly enhance the safety and responsibility of their GenAI systems, effectively mitigating risks of discrimination and protecting against potential reputational or brand-related harm. Ultimately, through this article, we aim to contribute to advancement of the creation and deployment of socially responsible and ethically aligned generative artificial intelligence powered applications.

Title: Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework

Authors: Ajesh Koyatan Chathoth, Shuhao Yu, Stephen Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03989
Pdf URL: https://arxiv.org/pdf/2508.03989
Copy Paste: [[2508.03989]] Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework(https://arxiv.org/abs/2508.03989)
Keywords: privacy, protect
Abstract: User-controllable privacy is important in modern sensing systems, as privacy preferences can vary significantly from person to person and may evolve over time. This is especially relevant in devices equipped with Inertial Measurement Unit (IMU) sensors, such as smartphones and wearables, which continuously collect rich time-series data that can inadvertently expose sensitive user behaviors. While prior work has proposed privacy-preserving methods for sensor data, most rely on static, predefined privacy labels or require large quantities of private training data, limiting their adaptability and user agency. In this work, we introduce PrivCLIP, a dynamic, user-controllable, few-shot privacy-preserving sensing framework. PrivCLIP allows users to specify and modify their privacy preferences by categorizing activities as sensitive (black-listed), non-sensitive (white-listed), or neutral (gray-listed). Leveraging a multimodal contrastive learning approach, PrivCLIP aligns IMU sensor data with natural language activity descriptions in a shared embedding space, enabling few-shot detection of sensitive activities. When a privacy-sensitive activity is identified, the system uses a language-guided activity sanitizer and a motion generation module (IMU-GPT) to transform the original data into a privacy-compliant version that semantically resembles a non-sensitive activity. We evaluate PrivCLIP on multiple human activity recognition datasets and demonstrate that it significantly outperforms baseline methods in terms of both privacy protection and data utility.

Title: Are Today's LLMs Ready to Explain Well-Being Concepts?

Authors: Bohan Jiang, Dawei Li, Zhen Tan, Chengshuai Zhao, Huan Liu
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2508.03990
Pdf URL: https://arxiv.org/pdf/2508.03990
Copy Paste: [[2508.03990]] Are Today's LLMs Ready to Explain Well-Being Concepts?(https://arxiv.org/abs/2508.03990)
Keywords: large language model
Abstract: Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.

Title: Investigating the Impact of Large-Scale Pre-training on Nutritional Content Estimation from 2D Images

Authors: Michele Andrade, Guilherme A. L. Silva, Valéria Santos, Gladston Moreira, Eduardo Luz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03996
Pdf URL: https://arxiv.org/pdf/2508.03996
Copy Paste: [[2508.03996]] Investigating the Impact of Large-Scale Pre-training on Nutritional Content Estimation from 2D Images(https://arxiv.org/abs/2508.03996)
Keywords: transformer
Abstract: Estimating the nutritional content of food from images is a critical task with significant implications for health and dietary monitoring. This is challenging, especially when relying solely on 2D images, due to the variability in food presentation, lighting, and the inherent difficulty in inferring volume and mass without depth information. Furthermore, reproducibility in this domain is hampered by the reliance of state-of-the-art methods on proprietary datasets for large-scale pre-training. In this paper, we investigate the impact of large-scale pre-training datasets on the performance of deep learning models for nutritional estimation using only 2D images. We fine-tune and evaluate Vision Transformer (ViT) models pre-trained on two large public datasets, ImageNet and COYO, comparing their performance against baseline CNN models (InceptionV2 and ResNet-50) and a state-of-the-art method pre-trained on the proprietary JFT-300M dataset. We conduct extensive experiments on the Nutrition5k dataset, a large-scale collection of real-world food plates with high-precision nutritional annotations. Our evaluation using Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAE%) reveals that models pre-trained on JFT-300M significantly outperform those pre-trained on public datasets. Unexpectedly, the model pre-trained on the massive COYO dataset performs worse than the model pre-trained on ImageNet for this specific regression task, refuting our initial hypothesis. Our analysis provides quantitative evidence highlighting the critical role of pre-training dataset characteristics, including scale, domain relevance, and curation quality, for effective transfer learning in 2D nutritional estimation.

Title: JanusNet: Hierarchical Slice-Block Shuffle and Displacement for Semi-Supervised 3D Multi-Organ Segmentation

Authors: Zheng Zhang, Tianzhuzi Tan, Guanchun Yin, Bo Zhang, Xiuzhuang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03997
Pdf URL: https://arxiv.org/pdf/2508.03997
Copy Paste: [[2508.03997]] JanusNet: Hierarchical Slice-Block Shuffle and Displacement for Semi-Supervised 3D Multi-Organ Segmentation(https://arxiv.org/abs/2508.03997)
Keywords: segmentation
Abstract: Limited by the scarcity of training samples and annotations, weakly supervised medical image segmentation often employs data augmentation to increase data diversity, while randomly mixing volumetric blocks has demonstrated strong performance. However, this approach disrupts the inherent anatomical continuity of 3D medical images along orthogonal axes, leading to severe structural inconsistencies and insufficient training in challenging regions, such as small-sized organs, etc. To better comply with and utilize human anatomical information, we propose JanusNet}, a data augmentation framework for 3D medical data that globally models anatomical continuity while locally focusing on hard-to-segment regions. Specifically, our Slice-Block Shuffle step performs aligned shuffling of same-index slice blocks across volumes along a random axis, while preserving the anatomical context on planes perpendicular to the perturbation axis. Concurrently, the Confidence-Guided Displacement step uses prediction reliability to replace blocks within each slice, amplifying signals from difficult areas. This dual-stage, axis-aligned framework is plug-and-play, requiring minimal code changes for most teacher-student schemes. Extensive experiments on the Synapse and AMOS datasets demonstrate that JanusNet significantly surpasses state-of-the-art methods, achieving, for instance, a 4% DSC gain on the Synapse dataset with only 20% labeled data.

Title: Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models

Authors: Xinyu Zhao, Zhen Tan, Maya Enisman, Minjae Seo, Marta R. Durantini, Dolores Albarracin, Tianlong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03998
Pdf URL: https://arxiv.org/pdf/2508.03998
Copy Paste: [[2508.03998]] Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models(https://arxiv.org/abs/2508.03998)
Keywords: robust
Abstract: Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but "black box" foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot's reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert's cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.

Title: Tensorized Clustered LoRA Merging for Multi-Task Interference

Authors: Zhan Su, Fengran Mo, Guojun Liang, Jinghan Zhang, Bingbing Wen, Prayag Tiwari, Jian-Yun Nie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03999
Pdf URL: https://arxiv.org/pdf/2508.03999
Copy Paste: [[2508.03999]] Tensorized Clustered LoRA Merging for Multi-Task Interference(https://arxiv.org/abs/2508.03999)
Keywords: large language model
Abstract: Despite the success of the monolithic dense paradigm of large language models (LLMs), the LoRA adapters offer an efficient solution by fine-tuning small task-specific modules and merging them with the base model. However, in multi-task settings, merging LoRA adapters trained on heterogeneous sources frequently causes \textit{task interference}, degrading downstream performance. To address this, we propose a tensorized clustered LoRA (TC-LoRA) library targeting to address the task interference at the \textit{text-level} and \textit{parameter-level}. At the \textit{text-level}, we cluster the training samples in the embedding space to capture input-format similarities, then train a specialized LoRA adapter for each cluster. At the \textit{parameter-level}, we introduce a joint Canonical Polyadic (CP) decomposition that disentangles task-specific and shared factors across LoRA adapters. This joint factorization preserves essential knowledge while reducing cross-task interference. Extensive experiments on out-of-domain zero-shot and skill-composition tasks-including reasoning, question answering, and coding. Compared to strong SVD-based baselines, TC-LoRA achieves +1.4\% accuracy on Phi-3 and +2.3\% on Mistral-7B (+2.3\%), demonstrating the effectiveness of TC-LoRA in LLM adaptation.

Title: CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation

Authors: Zheyuan Zhou, Jiayi Han, Liang Du, Naiyu Fang, Lemiao Qiu, Shuyou Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04002
Pdf URL: https://arxiv.org/pdf/2508.04002
Copy Paste: [[2508.04002]] CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation(https://arxiv.org/abs/2508.04002)
Keywords: robust, generative
Abstract: Computer-Aided Design (CAD) models are widely used across industrial design, simulation, and manufacturing processes. Text-to-CAD systems aim to generate editable, general-purpose CAD models from textual descriptions, significantly reducing the complexity and entry barrier associated with traditional CAD workflows. However, rendering CAD models can be slow, and deploying VLMs to review CAD models can be expensive and may introduce reward hacking that degrades the systems. To address these challenges, we propose CAD-Judge, a novel, verifiable reward system for efficient and effective CAD preference grading and grammatical validation. We adopt the Compiler-as-a-Judge Module (CJM) as a fast, direct reward signal, optimizing model alignment by maximizing generative utility through prospect theory. To further improve the robustness of Text-to-CAD in the testing phase, we introduce a simple yet effective agentic CAD generation approach and adopt the Compiler-as-a-Review Module (CRM), which efficiently verifies the generated CAD models, enabling the system to refine them accordingly. Extensive experiments on challenging CAD datasets demonstrate that our method achieves state-of-the-art performance while maintaining superior efficiency.

Title: Decoupled Contrastive Learning for Federated Learning

Authors: Hyungbin Kim, Incheol Baek, Yon Dohn Chung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04005
Pdf URL: https://arxiv.org/pdf/2508.04005
Copy Paste: [[2508.04005]] Decoupled Contrastive Learning for Federated Learning(https://arxiv.org/abs/2508.04005)
Keywords: federate
Abstract: Federated learning is a distributed machine learning paradigm that allows multiple participants to train a shared model by exchanging model updates instead of their raw data. However, its performance is degraded compared to centralized approaches due to data heterogeneity across clients. While contrastive learning has emerged as a promising approach to mitigate this, our theoretical analysis reveals a fundamental conflict: its asymptotic assumptions of an infinite number of negative samples are violated in finite-sample regime of federated learning. To address this issue, we introduce Decoupled Contrastive Learning for Federated Learning (DCFL), a novel framework that decouples the existing contrastive loss into two objectives. Decoupling the loss into its alignment and uniformity components enables the independent calibration of the attraction and repulsion forces without relying on the asymptotic assumptions. This strategy provides a contrastive learning method suitable for federated learning environments where each client has a small amount of data. Our experimental results show that DCFL achieves stronger alignment between positive samples and greater uniformity between negative samples compared to existing contrastive learning methods. Furthermore, experimental results on standard benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet, demonstrate that DCFL consistently outperforms state-of-the-art federated learning methods.

Title: HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

Authors: Yurun Chen, Xavier Hu, Yuhan Liu, Keting Yin, Juncheng Li, Zhuosheng Zhang, Shengyu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04010
Pdf URL: https://arxiv.org/pdf/2508.04010
Copy Paste: [[2508.04010]] HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization(https://arxiv.org/abs/2508.04010)
Keywords: security, large language model
Abstract: Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: this https URL.

Title: Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing

Authors: Xiaopeng Li, Shasha Li, Xi Wang, Shezheng Song, Bin Ji, Shangwen Wang, Jun Ma, Xiaodong Liu, Mina Liu, Jie Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04012
Pdf URL: https://arxiv.org/pdf/2508.04012
Copy Paste: [[2508.04012]] Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing(https://arxiv.org/abs/2508.04012)
Keywords: large language model
Abstract: Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose $\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$), a novel MLBME method that adopts $\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps ($\textbf{MBPS}$) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon.

Title: $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Authors: Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04016
Pdf URL: https://arxiv.org/pdf/2508.04016
Copy Paste: [[2508.04016]] $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation(https://arxiv.org/abs/2508.04016)
Keywords: diffusion, transformer
Abstract: Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose \textbf{$\text{S}^2$Q-VDiT}, a post-training quantization framework for V-DMs that leverages \textbf{S}alient data and \textbf{S}parse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, $\text{S}^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at \href{this https URL}{this https URL}.

Title: Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

Authors: Haiqi Yang, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04017
Pdf URL: https://arxiv.org/pdf/2508.04017
Copy Paste: [[2508.04017]] Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability(https://arxiv.org/abs/2508.04017)
Keywords: large language model
Abstract: Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs' proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at this https URL.

Title: Prototype-Driven Structure Synergy Network for Remote Sensing Images Segmentation

Authors: Junyi Wang, Jinjiang Li, Guodong Fan, Yakun Ju, Xiang Fang, Alex C. Kot
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2508.04022
Pdf URL: https://arxiv.org/pdf/2508.04022
Copy Paste: [[2508.04022]] Prototype-Driven Structure Synergy Network for Remote Sensing Images Segmentation(https://arxiv.org/abs/2508.04022)
Keywords: extraction, segmentation
Abstract: In the semantic segmentation of remote sensing images, acquiring complete ground objects is critical for achieving precise analysis. However, this task is severely hindered by two major challenges: high intra-class variance and high inter-class similarity. Traditional methods often yield incomplete segmentation results due to their inability to effectively unify class representations and distinguish between similar features. Even emerging class-guided approaches are limited by coarse class prototype representations and a neglect of target structural information. Therefore, this paper proposes a Prototype-Driven Structure Synergy Network (PDSSNet). The design of this network is based on a core concept, a complete ground object is jointly defined by its invariant class semantics and its variant spatial structure. To implement this, we have designed three key modules. First, the Adaptive Prototype Extraction Module (APEM) ensures semantic accuracy from the source by encoding the ground truth to extract unbiased class prototypes. Subsequently, the designed Semantic-Structure Coordination Module (SSCM) follows a hierarchical semantics-first, structure-second principle. This involves first establishing a global semantic cognition, then leveraging structural information to constrain and refine the semantic representation, thereby ensuring the integrity of class information. Finally, the Channel Similarity Adjustment Module (CSAM) employs a dynamic step-size adjustment mechanism to focus on discriminative features between classes. Extensive experiments demonstrate that PDSSNet outperforms state-of-the-art methods. The source code is available at this https URL.

Title: Radar-Based NLoS Pedestrian Localization for Darting-Out Scenarios Near Parked Vehicles with Camera-Assisted Point Cloud Interpretation

Authors: Hee-Yeun Kim, Byeonggyu Park, Byonghyok Choi, Hansang Cho, Byungkwan Kim, Soomok Lee, Mingu Jeon, Seung-Woo Seo, Seong-Woo Kim
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2508.04033
Pdf URL: https://arxiv.org/pdf/2508.04033
Copy Paste: [[2508.04033]] Radar-Based NLoS Pedestrian Localization for Darting-Out Scenarios Near Parked Vehicles with Camera-Assisted Point Cloud Interpretation(https://arxiv.org/abs/2508.04033)
Keywords: segmentation
Abstract: The presence of Non-Line-of-Sight (NLoS) blind spots resulting from roadside parking in urban environments poses a significant challenge to road safety, particularly due to the sudden emergence of pedestrians. mmWave technology leverages diffraction and reflection to observe NLoS regions, and recent studies have demonstrated its potential for detecting obscured objects. However, existing approaches predominantly rely on predefined spatial information or assume simple wall reflections, thereby limiting their generalizability and practical applicability. A particular challenge arises in scenarios where pedestrians suddenly appear from between parked vehicles, as these parked vehicles act as temporary spatial obstructions. Furthermore, since parked vehicles are dynamic and may relocate over time, spatial information obtained from satellite maps or other predefined sources may not accurately reflect real-time road conditions, leading to erroneous sensor interpretations. To address this limitation, we propose an NLoS pedestrian localization framework that integrates monocular camera image with 2D radar point cloud (PCD) data. The proposed method initially detects parked vehicles through image segmentation, estimates depth to infer approximate spatial characteristics, and subsequently refines this information using 2D radar PCD to achieve precise spatial inference. Experimental evaluations conducted in real-world urban road environments demonstrate that the proposed approach enhances early pedestrian detection and contributes to improved road safety. Supplementary materials are available at this https URL.

Title: ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Authors: Zechen Li, Baiyu Chen, Hao Xue, Flora D. Salim
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2508.04038
Pdf URL: https://arxiv.org/pdf/2508.04038
Copy Paste: [[2508.04038]] ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents(https://arxiv.org/abs/2508.04038)
Keywords: interpretability, large language model
Abstract: Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at this https URL.

Title: Large Reasoning Models Are Autonomous Jailbreak Agents

Authors: Thilo Hagendorff, Erik Derner, Nuria Oliver
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.04039
Pdf URL: https://arxiv.org/pdf/2508.04039
Copy Paste: [[2508.04039]] Large Reasoning Models Are Autonomous Jailbreak Agents(https://arxiv.org/abs/2508.04039)
Keywords: attack
Abstract: Jailbreaking -- bypassing built-in safety mechanisms in AI models -- has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of 97.14%. Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models, highlighting the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents.

Title: VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Authors: Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04043
Pdf URL: https://arxiv.org/pdf/2508.04043
Copy Paste: [[2508.04043]] VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning(https://arxiv.org/abs/2508.04043)
Keywords: extraction
Abstract: Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in various formats, including multiple-choice, open-ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first-person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high-quality and interpretable. Evaluations of various state-of-the-art vision-language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi-step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at this https URL.

Title: Iterative pseudo-labeling based adaptive copy-paste supervision for semi-supervised tumor segmentation

Authors: Qiangguo Jin, Hui Cui, Junbo Wang, Changming Sun, Yimiao He, Ping Xuan, Linlin Wang, Cong Cong, Leyi Wei, Ran Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04044
Pdf URL: https://arxiv.org/pdf/2508.04044
Copy Paste: [[2508.04044]] Iterative pseudo-labeling based adaptive copy-paste supervision for semi-supervised tumor segmentation(https://arxiv.org/abs/2508.04044)
Keywords: robust, segmentation
Abstract: Semi-supervised learning (SSL) has attracted considerable attention in medical image processing. The latest SSL methods use a combination of consistency regularization and pseudo-labeling to achieve remarkable success. However, most existing SSL studies focus on segmenting large organs, neglecting the challenging scenarios where there are numerous tumors or tumors of small volume. Furthermore, the extensive capabilities of data augmentation strategies, particularly in the context of both labeled and unlabeled data, have yet to be thoroughly investigated. To tackle these challenges, we introduce a straightforward yet effective approach, termed iterative pseudo-labeling based adaptive copy-paste supervision (IPA-CP), for tumor segmentation in CT scans. IPA-CP incorporates a two-way uncertainty based adaptive augmentation mechanism, aiming to inject tumor uncertainties present in the mean teacher architecture into adaptive augmentation. Additionally, IPA-CP employs an iterative pseudo-label transition strategy to generate more robust and informative pseudo labels for the unlabeled samples. Extensive experiments on both in-house and public datasets show that our framework outperforms state-of-the-art SSL methods in medical image segmentation. Ablation study results demonstrate the effectiveness of our technical contributions.

Title: FeDaL: Federated Dataset Learning for Time Series Foundation Models

Authors: Shengchao Chen, Guodong Long, Jing Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04045
Pdf URL: https://arxiv.org/pdf/2508.04045
Copy Paste: [[2508.04045]] FeDaL: Federated Dataset Learning for Time Series Foundation Models(https://arxiv.org/abs/2508.04045)
Keywords: federate
Abstract: Dataset-wise heterogeneity introduces significant domain biases that fundamentally degrade generalization on Time Series Foundation Models (TSFMs), yet this challenge remains underexplored. This paper rethink the development of TSFMs using the paradigm of federated learning. We propose a novel Federated Dataset Learning (FeDaL) approach to tackle heterogeneous time series by learning dataset-agnostic temporal representations. Specifically, the distributed architecture of federated learning is a nature solution to decompose heterogeneous TS datasets into shared generalized knowledge and preserved personalized knowledge. Moreover, based on the TSFM architecture, FeDaL explicitly mitigates both local and global biases by adding two complementary mechanisms: Domain Bias Elimination (DBE) and Global Bias Elimination (GBE). FeDaL`s cross-dataset generalization has been extensively evaluated in real-world datasets spanning eight tasks, including both representation learning and downstream time series analysis, against 54 baselines. We further analyze federated scaling behavior, showing how data volume, client count, and join rate affect model performance under decentralization.

Title: Quantum Temporal Fusion Transformer

Authors: Krishnakanta Barik, Goutam Paul
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2508.04048
Pdf URL: https://arxiv.org/pdf/2508.04048
Copy Paste: [[2508.04048]] Quantum Temporal Fusion Transformer(https://arxiv.org/abs/2508.04048)
Keywords: transformer
Abstract: The Temporal Fusion Transformer (TFT), proposed by Lim et al. [\textit{International Journal of Forecasting}, 2021], is a state-of-the-art attention-based deep neural network architecture specifically designed for multi-horizon time series forecasting. It has demonstrated significant performance improvements over existing benchmarks. In this work, we propose a Quantum Temporal Fusion Transformer (QTFT), a quantum-enhanced hybrid quantum-classical architecture that extends the capabilities of the classical TFT framework. Our results demonstrate that QTFT is successfully trained on the forecasting datasets and is capable of accurately predicting future values. In particular, our experimental results display that in certain test cases, the model outperforms its classical counterpart in terms of both training and test loss, while in the remaining cases, it achieves comparable performance. A key advantage of our approach lies in its foundation on a variational quantum algorithm, enabling implementation on current noisy intermediate-scale quantum (NISQ) devices without strict requirements on the number of qubits or circuit depth.

Title: DOMR: Establishing Cross-View Segmentation via Dense Object Matching

Authors: Jitong Liao, Yulu Gao, Shaofei Huang, Jialin Gao, Jie Lei, Ronghua Liang, Si Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04050
Pdf URL: https://arxiv.org/pdf/2508.04050
Copy Paste: [[2508.04050]] DOMR: Establishing Cross-View Segmentation via Dense Object Matching(https://arxiv.org/abs/2508.04050)
Keywords: segmentation
Abstract: Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views. It is a critical yet challenging task for visual understanding. In this work, we propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views. The framework centers around the Dense Object Matcher (DOM) module, which jointly models multiple objects. Unlike methods that directly match individual object masks to image features, DOM leverages both positional and semantic relationships among objects to find correspondences. DOM integrates a proposal generation module with a dense matching module that jointly encodes visual, spatial, and semantic cues, explicitly constructing inter-object relationships to achieve dense matching among objects. Furthermore, we combine DOM with a mask refinement head designed to improve the completeness and accuracy of the predicted masks, forming the complete DOMR framework. Extensive evaluations on the Ego-Exo4D benchmark demonstrate that our approach achieves state-of-the-art performance with a mean IoU of 49.7% on Ego$\to$Exo and 55.2% on Exo$\to$Ego. These results outperform those of previous methods by 5.8% and 4.3%, respectively, validating the effectiveness of our integrated approach for cross-view understanding.

Title: Towards Globally Predictable k-Space Interpolation: A White-box Transformer Approach

Authors: Chen Luo, Qiyu Jin, Taofeng Xie, Xuemei Wang, Huayu Wang, Congcong Liu, Liming Tang, Guoqing Chen, Zhuo-Xu Cui, Dong Liang
Subjects: cs.CV, math.OC
Abstract URL: https://arxiv.org/abs/2508.04051
Pdf URL: https://arxiv.org/pdf/2508.04051
Copy Paste: [[2508.04051]] Towards Globally Predictable k-Space Interpolation: A White-box Transformer Approach(https://arxiv.org/abs/2508.04051)
Keywords: interpretability, transformer
Abstract: Interpolating missing data in k-space is essential for accelerating imaging. However, existing methods, including convolutional neural network-based deep learning, primarily exploit local predictability while overlooking the inherent global dependencies in k-space. Recently, Transformers have demonstrated remarkable success in natural language processing and image analysis due to their ability to capture long-range dependencies. This inspires the use of Transformers for k-space interpolation to better exploit its global structure. However, their lack of interpretability raises concerns regarding the reliability of interpolated data. To address this limitation, we propose GPI-WT, a white-box Transformer framework based on Globally Predictable Interpolation (GPI) for k-space. Specifically, we formulate GPI from the perspective of annihilation as a novel k-space structured low-rank (SLR) model. The global annihilation filters in the SLR model are treated as learnable parameters, and the subgradients of the SLR model naturally induce a learnable attention mechanism. By unfolding the subgradient-based optimization algorithm of SLR into a cascaded network, we construct the first white-box Transformer specifically designed for accelerated MRI. Experimental results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in k-space interpolation accuracy while providing superior interpretability.

Title: Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion

Authors: Fangmin Zhao, Weichao Zeng, Zhenhang Li, Dongbao Yang, Binbin Li, Xiaojun Bi, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04055
Pdf URL: https://arxiv.org/pdf/2508.04055
Copy Paste: [[2508.04055]] Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion(https://arxiv.org/abs/2508.04055)
Keywords: diffusion
Abstract: Removing various degradations from damaged documents greatly benefits digitization, downstream document analysis, and readability. Previous methods often treat each restoration task independently with dedicated models, leading to a cumbersome and highly complex document processing system. Although recent studies attempt to unify multiple tasks, they often suffer from limited scalability due to handcrafted prompts and heavy preprocessing, and fail to fully exploit inter-task synergy within a shared architecture. To address the aforementioned challenges, we propose Uni-DocDiff, a Unified and highly scalable Document restoration model based on Diffusion. Uni-DocDiff develops a learnable task prompt design, ensuring exceptional scalability across diverse tasks. To further enhance its multi-task capabilities and address potential task interference, we devise a novel \textbf{Prior \textbf{P}ool}, a simple yet comprehensive mechanism that combines both local high-frequency features and global low-frequency features. Additionally, we design the \textbf{Prior \textbf{F}usion \textbf{M}odule (PFM)}, which enables the model to adaptively select the most relevant prior information for each specific task. Extensive experiments show that the versatile Uni-DocDiff achieves performance comparable or even superior performance compared with task-specific expert models, and simultaneously holds the task scalability for seamless adaptation to new tasks.

Title: PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG

Authors: Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04057
Pdf URL: https://arxiv.org/pdf/2508.04057
Copy Paste: [[2508.04057]] PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG(https://arxiv.org/abs/2508.04057)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) has become a cornerstone technique for enhancing large language models (LLMs) with external knowledge. However, current RAG systems face two critical limitations: (1) they inefficiently retrieve information for every query, including simple questions that could be resolved using the LLM's parametric knowledge alone, and (2) they risk retrieving irrelevant documents when queries contain sparse information signals. To address these gaps, we introduce Parametric-verified Adaptive Information Retrieval and Selection (PAIRS), a training-free framework that integrates parametric and retrieved knowledge to adaptively determine whether to retrieve and how to select external information. Specifically, PAIRS employs a dual-path generation mechanism: First, the LLM produces both a direct answer and a context-augmented answer using self-generated pseudo-context. When these outputs converge, PAIRS bypasses external retrieval entirely, dramatically improving the RAG system's efficiency. For divergent cases, PAIRS activates a dual-path retrieval (DPR) process guided by both the original query and self-generated contextual signals, followed by an Adaptive Information Selection (AIS) module that filters documents through weighted similarity to both sources. This simple yet effective approach can not only enhance efficiency by eliminating unnecessary retrievals but also improve accuracy through contextually guided retrieval and adaptive information selection. Experimental results on six question-answering (QA) benchmarks show that PAIRS reduces retrieval costs by around 25% (triggering for only 75% of queries) while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior baselines on average.

Title: TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation

Authors: Zunhui Xia, Hongxing Li, Libin Lan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04058
Pdf URL: https://arxiv.org/pdf/2508.04058
Copy Paste: [[2508.04058]] TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation(https://arxiv.org/abs/2508.04058)
Keywords: transformer, segmentation
Abstract: In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed-forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models' ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level sparse attention to dynamically focus on the most relevant key-value pairs for each query. This is achieved by pruning globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model's ability to capture relationships between tokens. Second, it introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model's feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state-of-the-art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade-off between efficiency and accuracy.

Title: Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models

Authors: Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Limeng Qiao, Lin Ma, Tingting Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04059
Pdf URL: https://arxiv.org/pdf/2508.04059
Copy Paste: [[2508.04059]] Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models(https://arxiv.org/abs/2508.04059)
Keywords: large language model
Abstract: Occlusion perception, a critical foundation for human-level spatial understanding, embodies the challenge of integrating visual recognition and reasoning. Though multimodal large language models (MLLMs) have demonstrated remarkable capabilities, their performance on occlusion perception remains under-explored. To address this gap, we introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception. Based on SA-1B, we construct 1,365 images featuring semantically coherent occlusion scenarios through a novel layered synthesis approach. Upon this foundation, we annotate 4,588 question-answer pairs in total across five tailored tasks, employing a reliable, semi-automatic workflow. Our extensive evaluation of 22 representative MLLMs against the human baseline reveals a significant performance gap between current MLLMs and humans, which, we find, cannot be sufficiently bridged by model scaling or thinking process. We further identify three typical failure patterns, including an overly conservative bias, a fragile gestalt prediction, and a struggle with quantitative tasks. We believe O-Bench can not only provide a vital evaluation tool for occlusion perception, but also inspire the development of MLLMs for better visual intelligence. Our benchmark will be made publicly available upon paper publication.

Title: TNet: Terrace Convolutional Decoder Network for Remote Sensing Image Semantic Segmentation

Authors: Chengqian Dai, Yonghong Guo, Hongzhao Xiang, Yigui Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04061
Pdf URL: https://arxiv.org/pdf/2508.04061
Copy Paste: [[2508.04061]] TNet: Terrace Convolutional Decoder Network for Remote Sensing Image Semantic Segmentation(https://arxiv.org/abs/2508.04061)
Keywords: transformer, segmentation
Abstract: In remote sensing, most segmentation networks adopt the UNet architecture, often incorporating modules such as Transformers or Mamba to enhance global-local feature interactions within decoder stages. However, these enhancements typically focus on intra-scale relationships and neglect the global contextual dependencies across multiple resolutions. To address this limitation, we introduce the Terrace Convolutional Decoder Network (TNet), a simple yet effective architecture that leverages only convolution and addition operations to progressively integrate low-resolution features (rich in global context) into higher-resolution features (rich in local details) across decoding stages. This progressive fusion enables the model to learn spatially-aware convolutional kernels that naturally blend global and local information in a stage-wise manner. We implement TNet with a ResNet-18 encoder (TNet-R) and evaluate it on three benchmark datasets. TNet-R achieves competitive performance with a mean Intersection-over-Union (mIoU) of 85.35\% on ISPRS Vaihingen, 87.05\% on ISPRS Potsdam, and 52.19\% on LoveDA, while maintaining high computational efficiency. Code is publicly available.

Title: Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading

Authors: Joel Walsh, Siddarth Mamidanna, Benjamin Nye, Mark Core, Daniel Auerbach
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04063
Pdf URL: https://arxiv.org/pdf/2508.04063
Copy Paste: [[2508.04063]] Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading(https://arxiv.org/abs/2508.04063)
Keywords: large language model
Abstract: Research to improve Automated Short Answer Grading has recently focused on Large Language Models (LLMs) with prompt engineering and no- or few-shot prompting to achieve best results. This is in contrast to the fine-tuning approach, which has historically required large-scale compute clusters inaccessible to most users. New closed-model approaches such as OpenAI's fine-tuning service promise results with as few as 100 examples, while methods using open weights such as quantized low-rank adaptive (QLORA) can be used to fine-tune models on consumer GPUs. We evaluate both of these fine-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading (ASAG) with structured (JSON) outputs. Our results show that finetuning with small amounts of data has limited utility for Llama open-weight models, but that fine-tuning methods can outperform few-shot baseline instruction-tuned LLMs for OpenAI's closed models. While our evaluation set is limited, we find some evidence that the observed benefits of finetuning may be impacted by the domain subject matter. Lastly, we observed dramatic improvement with the LLama 3.1 8B-Instruct open-weight model by seeding the initial training examples with a significant amount of cheaply generated synthetic training data.

Title: FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning

Authors: Tuan Nguyen, Khoa D Doan, Kok-Seng Wong
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.04064
Pdf URL: https://arxiv.org/pdf/2508.04064
Copy Paste: [[2508.04064]] FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning(https://arxiv.org/abs/2508.04064)
Keywords: defense, attack, robust, steal, federate
Abstract: Federated learning (FL) is vulnerable to backdoor attacks, yet most existing methods are limited by fixed-pattern or single-target triggers, making them inflexible and easier to detect. We propose FLAT (FL Arbitrary-Target Attack), a novel backdoor attack that leverages a latent-driven conditional autoencoder to generate diverse, target-specific triggers as needed. By introducing a latent code, FLAT enables the creation of visually adaptive and highly variable triggers, allowing attackers to select arbitrary targets without retraining and to evade conventional detection mechanisms. Our approach unifies attack success, stealth, and diversity within a single framework, introducing a new level of flexibility and sophistication to backdoor attacks in FL. Extensive experiments show that FLAT achieves high attack success and remains robust against advanced FL defenses. These results highlight the urgent need for new defense strategies to address latent-driven, multi-target backdoor threats in federated settings.

Title: Adversarial Fair Multi-View Clustering

Authors: Mudi Jiang, Jiahui Zhou, Lianyu Hu, Xinying Liu, Zengyou He, Zhikui Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04071
Pdf URL: https://arxiv.org/pdf/2508.04071
Copy Paste: [[2508.04071]] Adversarial Fair Multi-View Clustering(https://arxiv.org/abs/2508.04071)
Keywords: fair
Abstract: Cluster analysis is a fundamental problem in data mining and machine learning. In recent years, multi-view clustering has attracted increasing attention due to its ability to integrate complementary information from multiple views. However, existing methods primarily focus on clustering performance, while fairness-a critical concern in human-centered applications-has been largely overlooked. Although recent studies have explored group fairness in multi-view clustering, most methods impose explicit regularization on cluster assignments, relying on the alignment between sensitive attributes and the underlying cluster structure. However, this assumption often fails in practice and can degrade clustering performance. In this paper, we propose an adversarial fair multi-view clustering (AFMVC) framework that integrates fairness learning into the representation learning process. Specifically, our method employs adversarial training to fundamentally remove sensitive attribute information from learned features, ensuring that the resulting cluster assignments are unaffected by it. Furthermore, we theoretically prove that aligning view-specific clustering assignments with a fairness-invariant consensus distribution via KL divergence preserves clustering consistency without significantly compromising fairness, thereby providing additional theoretical guarantees for our framework. Extensive experiments on data sets with fairness constraints demonstrate that AFMVC achieves superior fairness and competitive clustering performance compared to existing multi-view clustering and fairness-aware clustering methods.

Title: Efficient Strategy for Improving Large Language Model (LLM) Capabilities

Authors: Julián Camilo Velandia Gutiérrez
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04073
Pdf URL: https://arxiv.org/pdf/2508.04073
Copy Paste: [[2508.04073]] Efficient Strategy for Improving Large Language Model (LLM) Capabilities(https://arxiv.org/abs/2508.04073)
Keywords: large language model
Abstract: Large Language Models (LLMs) have become a milestone in the field of artificial intelligence and natural language processing. However, their large-scale deployment remains constrained by the need for significant computational resources. This work proposes starting from a base model to explore and combine data processing and careful data selection techniques, training strategies, and architectural adjustments to improve the efficiency of LLMs in resource-constrained environments and within a delimited knowledge base. The methodological approach included defining criteria for building reliable datasets, conducting controlled experiments with different configurations, and systematically evaluating the resulting variants in terms of capability, versatility, response time, and safety. Finally, comparative tests were conducted to measure the performance of the developed variants and to validate the effectiveness of the proposed strategies. This work is based on the master's thesis in Systems and Computer Engineering titled "Efficient Strategy for Improving the Capabilities of Large Language Models (LLMs)".

Title: GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

Authors: Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04088
Pdf URL: https://arxiv.org/pdf/2508.04088
Copy Paste: [[2508.04088]] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning(https://arxiv.org/abs/2508.04088)
Keywords: generative, large language model
Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM's generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.

Title: Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

Authors: Yi-Ting Chen, Ting-Hsuan Liao, Pengsheng Guo, Alexander Schwing, Jia-Bin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04090
Pdf URL: https://arxiv.org/pdf/2508.04090
Copy Paste: [[2508.04090]] Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework(https://arxiv.org/abs/2508.04090)
Keywords: diffusion
Abstract: We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don't consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions. Code will be released.

Title: Isolate Trigger: Detecting and Eradicating Evade-Adaptive Backdoors

Authors: Chengrui Sun, Hua Zhang, Haoran Gao, Zian Tian, Jianjin Zhao, qi Li, Hongliang Zhu, Zongliang Shen, Shang Wang, Anmin Fu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.04094
Pdf URL: https://arxiv.org/pdf/2508.04094
Copy Paste: [[2508.04094]] Isolate Trigger: Detecting and Eradicating Evade-Adaptive Backdoors(https://arxiv.org/abs/2508.04094)
Keywords: defense, attack, robust
Abstract: All current detection of backdoor attacks on deep learning models fall under the category of a non essential features(NEF), which focus on fighting against simple and efficient vertical class backdoor -- trigger is small, few and not overlapping with the source. Evade-adaptive backdoor (EAB) attacks have evaded NEF detection and improved training efficiency. We introduces a precise, efficient and universal detection and defense framework coined as Isolate Trigger (IsTr). IsTr aims to find the hidden trigger by breaking the barrier of the source features. Therefore, it investigates the essence of backdoor triggering, and uses Steps and Differential-Middle-Slice as components to update past theories of distance and gradient. IsTr also plays a positive role in the model, whether the backdoor exists. For example, accurately find and repair the wrong identification caused by deliberate or unintentional training in automatic driving. Extensive experiments on robustness scross various tasks, including MNIST, facial recognition, and traffic sign recognition, confirm the high efficiency, generality and precision of the IsTr. We rigorously evaluated the effectiveness of the IsTr against a series of six EAB attacks, including Badnets, Sin-Wave, Multi-trigger, SSBAs, CASSOCK, HCB. None of these countermeasures evade, even when attacks are combined and the trigger and source overlap.

Title: Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?

Authors: Ngoc-Bao Nguyen, Sy-Tuyen Ho, Koh Jun Hao, Ngai-Man Cheung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04097
Pdf URL: https://arxiv.org/pdf/2508.04097
Copy Paste: [[2508.04097]] Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?(https://arxiv.org/abs/2508.04097)
Keywords: privacy, attack, generative
Abstract: Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior works have focused on conventional unimodal DNNs, the vulnerability of vision-language models (VLMs) remains underexplored. In this paper, we conduct the first study to understand VLMs' vulnerability in leaking private visual training data. To tailored for VLMs' token-based generative nature, we propose a suite of novel token-based and sequence-based model inversion strategies. Particularly, we propose Token-based Model Inversion (TMI), Convergent Token-based Model Inversion (TMI-C), Sequence-based Model Inversion (SMI), and Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW). Through extensive experiments and user study on three state-of-the-art VLMs and multiple datasets, we demonstrate, for the first time, that VLMs are susceptible to training data leakage. The experiments show that our proposed sequence-based methods, particularly SMI-AW combined with a logit-maximization loss based on vocabulary representation, can achieve competitive reconstruction and outperform token-based methods in attack accuracy and visual similarity. Importantly, human evaluation of the reconstructed images yields an attack accuracy of 75.31\%, underscoring the severity of model inversion threats in VLMs. Notably we also demonstrate inversion attacks on the publicly released VLMs. Our study reveals the privacy vulnerability of VLMs as they become increasingly popular across many applications such as healthcare and finance.

Title: DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting

Authors: Zexu Huang, Min Xu, Stuart Perry
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04099
Pdf URL: https://arxiv.org/pdf/2508.04099
Copy Paste: [[2508.04099]] DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting(https://arxiv.org/abs/2508.04099)
Keywords: robust
Abstract: 3D Gaussian Splatting (3DGS) represents a significant advancement in the field of efficient and high-fidelity novel view synthesis. Despite recent progress, achieving accurate geometric reconstruction under sparse-view conditions remains a fundamental challenge. Existing methods often rely on non-local depth regularization, which fails to capture fine-grained structures and is highly sensitive to depth estimation noise. Furthermore, traditional smoothing methods neglect semantic boundaries and indiscriminately degrade essential edges and textures, consequently limiting the overall quality of reconstruction. In this work, we propose DET-GS, a unified depth and edge-aware regularization framework for 3D Gaussian Splatting. DET-GS introduces a hierarchical geometric depth supervision framework that adaptively enforces multi-level geometric consistency, significantly enhancing structural fidelity and robustness against depth estimation noise. To preserve scene boundaries, we design an edge-aware depth regularization guided by semantic masks derived from Canny edge detection. Furthermore, we introduce an RGB-guided edge-preserving Total Variation loss that selectively smooths homogeneous regions while rigorously retaining high-frequency details and textures. Extensive experiments demonstrate that DET-GS achieves substantial improvements in both geometric accuracy and visual fidelity, outperforming state-of-the-art (SOTA) methods on sparse-view novel view synthesis benchmarks.

Title: SenseCrypt: Sensitivity-guided Selective Homomorphic Encryption for Joint Federated Learning in Cross-Device Scenarios

Authors: Borui Li, Li Yan, Junhao Han, Jianmin Liu, Lei Yu
Subjects: cs.CR, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2508.04100
Pdf URL: https://arxiv.org/pdf/2508.04100
Copy Paste: [[2508.04100]] SenseCrypt: Sensitivity-guided Selective Homomorphic Encryption for Joint Federated Learning in Cross-Device Scenarios(https://arxiv.org/abs/2508.04100)
Keywords: security, privacy, protect, attack, federate
Abstract: Homomorphic Encryption (HE) prevails in securing Federated Learning (FL), but suffers from high overhead and adaptation cost. Selective HE methods, which partially encrypt model parameters by a global mask, are expected to protect privacy with reduced overhead and easy adaptation. However, in cross-device scenarios with heterogeneous data and system capabilities, traditional Selective HE methods deteriorate client straggling, and suffer from degraded HE overhead reduction performance. Accordingly, we propose SenseCrypt, a Sensitivity-guided selective Homomorphic EnCryption framework, to adaptively balance security and HE overhead per cross-device FL client. Given the observation that model parameter sensitivity is effective for measuring clients' data distribution similarity, we first design a privacy-preserving method to respectively cluster the clients with similar data distributions. Then, we develop a scoring mechanism to deduce the straggler-free ratio of model parameters that can be encrypted by each client per cluster. Finally, for each client, we formulate and solve a multi-objective model parameter selection optimization problem, which minimizes HE overhead while maximizing model security without causing straggling. Experiments demonstrate that SenseCrypt ensures security against the state-of-the-art inversion attacks, while achieving normal model accuracy as on IID data, and reducing training time by 58.4%-88.7% as compared to traditional HE methods.

Title: NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding

Authors: Zelin Peng, Yichen Zhao, Yu Huang, Piao Yang, Feilong Tang, Zhengqin Xu, Xiaokang Yang, Wei Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04101
Pdf URL: https://arxiv.org/pdf/2508.04101
Copy Paste: [[2508.04101]] NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding(https://arxiv.org/abs/2508.04101)
Keywords: transformer
Abstract: Computer-aided medical image analysis is crucial for disease diagnosis and treatment planning, yet limited annotated datasets restrict medical-specific model development. While vision-language models (VLMs) like CLIP offer strong generalization capabilities, their direct application to medical imaging analysis is impeded by a significant domain gap. Existing approaches to bridge this gap, including prompt learning and one-way modality interaction techniques, typically focus on introducing domain knowledge to a single modality. Although this may offer performance gains, it often causes modality misalignment, thereby failing to unlock the full potential of VLMs. In this paper, we propose \textbf{NEARL-CLIP} (i\underline{N}teracted qu\underline{E}ry \underline{A}daptation with o\underline{R}thogona\underline{L} Regularization), a novel cross-modality interaction VLM-based framework that contains two contributions: (1) Unified Synergy Embedding Transformer (USEformer), which dynamically generates cross-modality queries to promote interaction between modalities, thus fostering the mutual enrichment and enhancement of multi-modal medical domain knowledge; (2) Orthogonal Cross-Attention Adapter (OCA). OCA introduces an orthogonality technique to decouple the new knowledge from USEformer into two distinct components: the truly novel information and the incremental knowledge. By isolating the learning process from the interference of incremental knowledge, OCA enables a more focused acquisition of new information, thereby further facilitating modality interaction and unleashing the capability of VLMs. Notably, NEARL-CLIP achieves these two contributions in a parameter-efficient style, which only introduces \textbf{1.46M} learnable parameters.

Title: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode

Authors: Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04107
Pdf URL: https://arxiv.org/pdf/2508.04107
Copy Paste: [[2508.04107]] Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode(https://arxiv.org/abs/2508.04107)
Keywords: large language model, segmentation
Abstract: Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at this https URL.

Title: Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks

Authors: Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04117
Pdf URL: https://arxiv.org/pdf/2508.04117
Copy Paste: [[2508.04117]] Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks(https://arxiv.org/abs/2508.04117)
Keywords: robust, large language model
Abstract: The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.

Title: Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

Authors: Maximilian Ulmer, Wout Boerdijk, Rudolph Triebel, Maximilian Durner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04122
Pdf URL: https://arxiv.org/pdf/2508.04122
Copy Paste: [[2508.04122]] Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation(https://arxiv.org/abs/2508.04122)
Keywords: diffusion, generative, segmentation
Abstract: This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.

Title: Excavate the potential of Single-Scale Features: A Decomposition Network for Water-Related Optical Image Enhancement

Authors: Zheng Cheng, Wenri Wang, Guangyong Chen, Yakun Ju, Yihua Cheng, Zhisong Liu, Yanda Meng, Jintao Song
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.04123
Pdf URL: https://arxiv.org/pdf/2508.04123
Copy Paste: [[2508.04123]] Excavate the potential of Single-Scale Features: A Decomposition Network for Water-Related Optical Image Enhancement(https://arxiv.org/abs/2508.04123)
Keywords: extraction, transformer
Abstract: Underwater image enhancement (UIE) techniques aim to improve visual quality of images captured in aquatic environments by addressing degradation issues caused by light absorption and scattering effects, including color distortion, blurring, and low contrast. Current mainstream solutions predominantly employ multi-scale feature extraction (MSFE) mechanisms to enhance reconstruction quality through multi-resolution feature fusion. However, our extensive experiments demonstrate that high-quality image reconstruction does not necessarily rely on multi-scale feature fusion. Contrary to popular belief, our experiments show that single-scale feature extraction alone can match or surpass the performance of multi-scale methods, significantly reducing complexity. To comprehensively explore single-scale feature potential in underwater enhancement, we propose an innovative Single-Scale Decomposition Network (SSD-Net). This architecture introduces an asymmetrical decomposition mechanism that disentangles input image into clean layer along with degradation layer. The former contains scene-intrinsic information and the latter encodes medium-induced interference. It uniquely combines CNN's local feature extraction capabilities with Transformer's global modeling strengths through two core modules: 1) Parallel Feature Decomposition Block (PFDB), implementing dual-branch feature space decoupling via efficient attention operations and adaptive sparse transformer; 2) Bidirectional Feature Communication Block (BFCB), enabling cross-layer residual interactions for complementary feature mining and fusion. This synergistic design preserves feature decomposition independence while establishing dynamic cross-layer information pathways, effectively enhancing degradation decoupling capacity.

Title: SVC 2025: the First Multimodal Deception Detection Challenge

Authors: Xun Lin, Xiaobao Guo, Taorui Wang, Yingjie Ma, Jiajian Huang, Jiayu Zhang, Junzhe Cao, Zitong Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04129
Pdf URL: https://arxiv.org/pdf/2508.04129
Copy Paste: [[2508.04129]] SVC 2025: the First Multimodal Deception Detection Challenge(https://arxiv.org/abs/2508.04129)
Keywords: security
Abstract: Deception detection is a critical task in real-world applications such as security screening, fraud prevention, and credibility assessment. While deep learning methods have shown promise in surpassing human-level performance, their effectiveness often depends on the availability of high-quality and diverse deception samples. Existing research predominantly focuses on single-domain scenarios, overlooking the significant performance degradation caused by domain shifts. To address this gap, we present the SVC 2025 Multimodal Deception Detection Challenge, a new benchmark designed to evaluate cross-domain generalization in audio-visual deception detection. Participants are required to develop models that not only perform well within individual domains but also generalize across multiple heterogeneous datasets. By leveraging multimodal data, including audio, video, and text, this challenge encourages the design of models capable of capturing subtle and implicit deceptive cues. Through this benchmark, we aim to foster the development of more adaptable, explainable, and practically deployable deception detection systems, advancing the broader field of multimodal learning. By the conclusion of the workshop competition, a total of 21 teams had submitted their final results. this https URL for more information.

Title: DS$^2$Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation

Authors: Zhaohong Huang, Yuxin Zhang, Mingbao Lin, Taojian Zhou, Guorong Cai, Rongrong Ji
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04131
Pdf URL: https://arxiv.org/pdf/2508.04131
Copy Paste: [[2508.04131]] DS$^2$Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation(https://arxiv.org/abs/2508.04131)
Keywords: segmentation
Abstract: Deep Supervision Networks exhibit significant efficacy for the medical imaging community. Nevertheless, existing work merely supervises either the coarse-grained semantic features or fine-grained detailed features in isolation, which compromises the fact that these two types of features hold vital relationships in medical image analysis. We advocate the powers of complementary feature supervision for medical image segmentation, by proposing a Detail-Semantic Deep Supervision Network (DS$^2$Net). DS$^2$Net navigates both low-level detailed and high-level semantic feature supervision through Detail Enhance Module (DEM) and Semantic Enhance Module (SEM). DEM and SEM respectively harness low-level and high-level feature maps to create detail and semantic masks for enhancing feature supervision. This is a novel shift from single-view deep supervision to multi-view deep supervision. DS$^2$Net is also equipped with a novel uncertainty-based supervision loss that adaptively assigns the supervision strength of features within distinct scales based on their uncertainty, thus circumventing the sub-optimal heuristic design that typifies previous works. Through extensive experiments on six benchmarks captured under either colonoscopy, ultrasound and microscope, we demonstrate that DS$^2$Net consistently outperforms state-of-the-art methods for medical image analysis.

Title: UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

Authors: Hongyu Guo, Kuan Zhu, Xiangzhao Hao, Haiyun Guo, Ming Tang, Jinqiao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04136
Pdf URL: https://arxiv.org/pdf/2508.04136
Copy Paste: [[2508.04136]] UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval(https://arxiv.org/abs/2508.04136)
Keywords: large language model
Abstract: Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

Title: COPO: Consistency-Aware Policy Optimization

Authors: Jinghang Han, Jiawei Chen, Hang Shao, Hao Ma, Mingcheng Li, Xintian Shen, Lihao Zheng, Wei Chen, Tao Wei, Lihua Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.04138
Pdf URL: https://arxiv.org/pdf/2508.04138
Copy Paste: [[2508.04138]] COPO: Consistency-Aware Policy Optimization(https://arxiv.org/abs/2508.04138)
Keywords: robust, large language model
Abstract: Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework's robustness and general applicability. Code of this work has been released at this https URL.

Title: IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

Authors: Lijuan Liu, Wenfa Li, Dongbo Zhang, Shuo Wang, Shaohui Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04147
Pdf URL: https://arxiv.org/pdf/2508.04147
Copy Paste: [[2508.04147]] IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control(https://arxiv.org/abs/2508.04147)
Keywords: diffusion, transformer
Abstract: We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at this https URL.

Title: Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Authors: Xuan Qi, Rongwu Xu, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04149
Pdf URL: https://arxiv.org/pdf/2508.04149
Copy Paste: [[2508.04149]] Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap(https://arxiv.org/abs/2508.04149)
Keywords: large language model
Abstract: Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

Title: Evaluating Selective Encryption Against Gradient Inversion Attacks

Authors: Jiajun Gu, Yuhang Yao, Shuaiqi Wang, Carlee Joe-Wong
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04155
Pdf URL: https://arxiv.org/pdf/2508.04155
Copy Paste: [[2508.04155]] Evaluating Selective Encryption Against Gradient Inversion Attacks(https://arxiv.org/abs/2508.04155)
Keywords: privacy, protect, defense, attack, federate
Abstract: Gradient inversion attacks pose significant privacy threats to distributed training frameworks such as federated learning, enabling malicious parties to reconstruct sensitive local training data from gradient communications between clients and an aggregation server during the aggregation process. While traditional encryption-based defenses, such as homomorphic encryption, offer strong privacy guarantees without compromising model utility, they often incur prohibitive computational overheads. To mitigate this, selective encryption has emerged as a promising approach, encrypting only a subset of gradient data based on the data's significance under a certain metric. However, there have been few systematic studies on how to specify this metric in practice. This paper systematically evaluates selective encryption methods with different significance metrics against state-of-the-art attacks. Our findings demonstrate the feasibility of selective encryption in reducing computational overhead while maintaining resilience against attacks. We propose a distance-based significance analysis framework that provides theoretical foundations for selecting critical gradient elements for encryption. Through extensive experiments on different model architectures (LeNet, CNN, BERT, GPT-2) and attack types, we identify gradient magnitude as a generally effective metric for protection against optimization-based gradient inversions. However, we also observe that no single selective encryption strategy is universally optimal across all attack scenarios, and we provide guidelines for choosing appropriate strategies for different model architectures and privacy requirements.

Title: ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations

Authors: Subhankar Swain, Naquee Rizwan, Nayandeep Deb, Vishwajeet Singh Solanki, Vishwa Gangadhar S, Animesh Mukherjee
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.04166
Pdf URL: https://arxiv.org/pdf/2508.04166
Copy Paste: [[2508.04166]] ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations(https://arxiv.org/abs/2508.04166)
Keywords: robust
Abstract: The 2025 Global Risks Report identifies state-based armed conflict and societal polarisation among the most pressing global threats, with social media playing a central role in amplifying toxic discourse. Memes, as a widely used mode of online communication, often serve as vehicles for spreading harmful content. However, limitations in data accessibility and the high cost of dataset curation hinder the development of robust meme moderation systems. To address this challenge, in this work, we introduce a first-of-its-kind dataset of 6,300 real-world meme-based posts annotated in two stages: (i) binary classification into toxic and normal, and (ii) fine-grained labelling of toxic memes as hateful, dangerous, or offensive. A key feature of this dataset is that it is enriched with auxiliary metadata of socially relevant tags, enhancing the context of each meme. In addition, we propose a tag generation module that produces socially grounded tags, because most in-the-wild memes often do not come with tags. Experimental results show that incorporating these tags substantially enhances the performance of state-of-the-art VLMs detection tasks. Our contributions offer a novel and scalable foundation for improved content moderation in multimodal online environments.

Title: AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization

Authors: Jingyi Liao, Yongyi Su, Rong-Cheng Tu, Zhao Jin, Wenhao Sun, Yiting Li, Dacheng Tao, Xun Xu, Xulei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04175
Pdf URL: https://arxiv.org/pdf/2508.04175
Copy Paste: [[2508.04175]] AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization(https://arxiv.org/abs/2508.04175)
Keywords: large language model
Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse domains, their application to specialized anomaly detection (AD) remains constrained by domain adaptation challenges. Existing Group Relative Policy Optimization (GRPO) based approaches suffer from two critical limitations: inadequate training data utilization when models produce uniform responses, and insufficient supervision over reasoning processes that encourage immediate binary decisions without deliberative analysis. We propose a comprehensive framework addressing these limitations through two synergistic innovations. First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination, generating diverse response patterns essential for GRPO optimization while enabling structured supervision over analytical workflows. Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision, transforming binary feedback into continuous signals that distinguish genuine analytical insight from spurious correctness. Comprehensive evaluation across multiple industrial datasets demonstrates substantial performance improvements in adapting general vision-language models to specialized anomaly detection. Our method achieves superior accuracy with efficient adaptation of existing annotations, effectively bridging the gap between general-purpose MLLM capabilities and the fine-grained visual discrimination required for detecting subtle manufacturing defects and structural irregularities.

Title: Uncertainty-Aware Spatial Color Correlation for Low-Light Image Enhancement

Authors: Jin Kuang, Dong Liu, Yukuang Zhang, Shengsheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04176
Pdf URL: https://arxiv.org/pdf/2508.04176
Copy Paste: [[2508.04176]] Uncertainty-Aware Spatial Color Correlation for Low-Light Image Enhancement(https://arxiv.org/abs/2508.04176)
Keywords: robust, extraction
Abstract: Most existing low-light image enhancement approaches primarily focus on architectural innovations, while often overlooking the intrinsic uncertainty within feature representations particularly under extremely dark conditions where degraded gradient and noise dominance severely impair model reliability and causal reasoning. To address these issues, we propose U2CLLIE, a novel framework that integrates uncertainty-aware enhancement and spatial-color causal correlation modeling. From the perspective of entropy-based uncertainty, our framework introduces two key components: (1) An Uncertainty-Aware Dual-domain Denoise (UaD) Module, which leverages Gaussian-Guided Adaptive Frequency Domain Feature Enhancement (G2AF) to suppress frequency-domain noise and optimize entropy-driven representations. This module enhances spatial texture extraction and frequency-domain noise suppression/structure refinement, effectively mitigating gradient vanishing and noise dominance. (2) A hierarchical causality-aware framework, where a Luminance Enhancement Network (LEN) first performs coarse brightness enhancement on dark regions. Then, during the encoder-decoder phase, two asymmetric causal correlation modeling modules Neighborhood Correlation State Space (NeCo) and Adaptive Spatial-Color Calibration (AsC) collaboratively construct hierarchical causal constraints. These modules reconstruct and reinforce neighborhood structure and color consistency in the feature space. Extensive experiments demonstrate that U2CLLIE achieves state-of-the-art performance across multiple benchmark datasets, exhibiting robust performance and strong generalization across various scenes.

Title: Secure Development of a Hooking-Based Deception Framework Against Keylogging Techniques

Authors: Md Sajidul Islam Sajid, Shihab Ahmed, Ryan Sosnoski
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.04178
Pdf URL: https://arxiv.org/pdf/2508.04178
Copy Paste: [[2508.04178]] Secure Development of a Hooking-Based Deception Framework Against Keylogging Techniques(https://arxiv.org/abs/2508.04178)
Keywords: secure, security, defense, attack, robust, steal
Abstract: Keyloggers remain a serious threat in modern cybersecurity, silently capturing user keystrokes to steal credentials and sensitive information. Traditional defenses focus mainly on detection and removal, which can halt malicious activity but do little to engage or mislead adversaries. In this paper, we present a deception framework that leverages API hooking to intercept input-related API calls invoked by keyloggers at runtime and inject realistic decoy keystrokes. A core challenge, however, lies in the increasing adoption of anti-hooking techniques by advanced keyloggers. Anti-hooking strategies allow malware to bypass or detect instrumentation. To counter this, we introduce a hardened hooking layer that detects tampering and rapidly reinstates disrupted hooks, ensuring continuity of deception. We evaluate our framework against a custom-built "super keylogger" incorporating multiple evasion strategies, as well as 50 real-world malware samples spanning ten prominent keylogger families. Experimental results demonstrate that our system successfully resists sophisticated bypass attempts, maintains operational stealth, and reliably deceives attackers by feeding them decoys. The system operates with negligible performance overhead and no observable impact on user experience. Our findings show that resilient, runtime deception can play a practical and robust role in confronting advanced threats.

Title: One Small Step with Fingerprints, One Giant Leap for emph{De Novo} Molecule Generation from Mass Spectra

Authors: Neng Kai Nigel Neo, Lim Jing, Ngoui Yong Zhau Preston, Koh Xue Ting Serene, Bingquan Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04180
Pdf URL: https://arxiv.org/pdf/2508.04180
Copy Paste: [[2508.04180]] One Small Step with Fingerprints, One Giant Leap for emph{De Novo} Molecule Generation from Mass Spectra(https://arxiv.org/abs/2508.04180)
Keywords: robust
Abstract: A common approach to the \emph{de novo} molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt \textsc{MIST}~\citep{MISTgoldmanAnnotatingMetaboliteMass2023} as the encoder and \textsc{MolForge}~\citep{ucakReconstructionLosslessMolecular2023} as the decoder, leveraging pretraining to enhance performance. Notably, pretraining \textsc{MolForge} proves especially effective, enabling it to serve as a robust fingerprint-to-structure decoder. Additionally, instead of passing the probability of each bit in the fingerprint, thresholding the probabilities as a step function helps focus the decoder on the presence of substructures, improving recovery of accurate molecular structures even when the fingerprints predicted by \textsc{MIST} only moderately resembles the ground truth in terms of Tanimoto similarity. This combination of encoder and decoder results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28\% / top-10 36\% of molecular structures correctly from mass spectra. We position this pipeline as a strong baseline for future research in \emph{de novo} molecule elucidation from mass spectra.

Title: Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity

Authors: Peizheng Guo, Jingyao Wang, Wenwen Qiang, Huijie Guo, Changwen Zheng, Jiahuan Zhou, Gang Hua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04182
Pdf URL: https://arxiv.org/pdf/2508.04182
Copy Paste: [[2508.04182]] Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity(https://arxiv.org/abs/2508.04182)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks. However, they may suffer from hallucinations--generating outputs that are semantically inconsistent with the input image or text. Through causal analyses, we find that: (i) hallucinations with omission may arise from the failure to adequately capture essential causal factors, and (ii) hallucinations with fabrication are likely caused by the model being misled by non-causal cues. To address these challenges, we propose a novel reinforcement learning framework guided by causal completeness, which jointly considers both causal sufficiency and causal necessity of tokens. Specifically, we evaluate each token's standalone contribution and counterfactual indispensability to define a token-level causal completeness reward. This reward is used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are both causally sufficient and necessary for accurate generation. Experimental results across various benchmark datasets and tasks demonstrate the effectiveness of our approach, which effectively mitigates hallucinations in MLLMs.

Title: BadTime: An Effective Backdoor Attack on Multivariate Long-Term Time Series Forecasting

Authors: Kunlan Xiang, Haomiao Yang, Meng Hao, Haoxin Wang, Shaofeng Li, Wenbo Jiang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.04189
Pdf URL: https://arxiv.org/pdf/2508.04189
Copy Paste: [[2508.04189]] BadTime: An Effective Backdoor Attack on Multivariate Long-Term Time Series Forecasting(https://arxiv.org/abs/2508.04189)
Keywords: attack, robust, steal
Abstract: Multivariate Long-Term Time Series Forecasting (MLTSF) models are increasingly deployed in critical domains such as climate, finance, and transportation. Although a variety of powerful MLTSF models have been proposed to improve predictive performance, the robustness of MLTSF models against malicious backdoor attacks remains entirely unexplored, which is crucial to ensuring their reliable and trustworthy deployment. To address this gap, we conduct an in-depth study on backdoor attacks against MLTSF models and propose the first effective attack method named BadTime. BadTime executes a backdoor attack by poisoning training data and customizing the backdoor training process. During data poisoning, BadTime proposes a contrast-guided strategy to select the most suitable training samples for poisoning, then employs a graph attention network to identify influential variables for trigger injection. Subsequently, BadTime further localizes optimal positions for trigger injection based on lag analysis and proposes a puzzle-like trigger structure that distributes the trigger across multiple poisoned variables to jointly steer the prediction of the target variable. During backdoor training, BadTime alternately optimizes the model and triggers via proposed tailored optimization objectives. Extensive experiments show that BadTime significantly outperforms state-of-the-art (SOTA) backdoor attacks on time series forecasting by reducing MAE by over 50% on target variables and boosting stealthiness by more than 3 times.

Title: RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation

Authors: Fengyi Wu, Yimian Dai, Tianfang Zhang, Yixuan Ding, Jian Yang, Ming-Ming Cheng, Zhenming Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04190
Pdf URL: https://arxiv.org/pdf/2508.04190
Copy Paste: [[2508.04190]] RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation(https://arxiv.org/abs/2508.04190)
Keywords: robust, extraction, interpretability, segmentation
Abstract: Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage this https URL.

Title: From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models

Authors: Dunyuan Xu, Xikai Yang, Yaoqian Li, Jinpeng Li, Pheng-Ann Heng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04192
Pdf URL: https://arxiv.org/pdf/2508.04192
Copy Paste: [[2508.04192]] From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models(https://arxiv.org/abs/2508.04192)
Keywords: security, privacy, protect, large language model
Abstract: The security of biomedical Multimodal Large Language Models (MLLMs) has attracted increasing attention. However, training samples easily contain private information and incorrect knowledge that are difficult to detect, potentially leading to privacy leakage or erroneous outputs after deployment. An intuitive idea is to reprocess the training set to remove unwanted content and retrain the model from scratch. Yet, this is impractical due to significant computational costs, especially for large language models. Machine unlearning has emerged as a solution to this problem, which avoids complete retraining by selectively removing undesired knowledge derived from harmful samples while preserving required capabilities on normal cases. However, there exist no available datasets to evaluate the unlearning quality for security protection in biomedical MLLMs. To bridge this gap, we propose the first benchmark Multimodal Large Language Model Unlearning for BioMedicine (MLLMU-Med) built upon our novel data generation pipeline that effectively integrates synthetic private data and factual errors into the training set. Our benchmark targets two key scenarios: 1) Privacy protection, where patient private information is mistakenly included in the training set, causing models to unintentionally respond with private data during inference; and 2) Incorrectness removal, where wrong knowledge derived from unreliable sources is embedded into the dataset, leading to unsafe model responses. Moreover, we propose a novel Unlearning Efficiency Score that directly reflects the overall unlearning performance across different subsets. We evaluate five unlearning approaches on MLLMU-Med and find that these methods show limited effectiveness in removing harmful knowledge from biomedical MLLMs, indicating significant room for improvement. This work establishes a new pathway for further research in this promising field.

Title: Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Authors: Siddhant Panpatil, Hiskias Dingeto, Haon Park
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.04196
Pdf URL: https://arxiv.org/pdf/2508.04196
Copy Paste: [[2508.04196]] Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models(https://arxiv.org/abs/2508.04196)
Keywords: protect, attack, robust, large language model
Abstract: Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.

Title: Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective

Authors: Yan Zhang, Gangyan Zeng, Daiqing Wu, Huawen Shen, Binbin Li, Yu Zhou, Can Ma, Xiaojun Bi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04197
Pdf URL: https://arxiv.org/pdf/2508.04197
Copy Paste: [[2508.04197]] Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective(https://arxiv.org/abs/2508.04197)
Keywords: large language model
Abstract: Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86\% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at this https URL.

Title: Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts

Authors: Millicent Ochieng, Anja Thieme, Ignatius Ezeani, Risa Ueno, Samuel Maina, Keshet Ronen, Javier Gonzalez, Jacki O'Neill
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04199
Pdf URL: https://arxiv.org/pdf/2508.04199
Copy Paste: [[2508.04199]] Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts(https://arxiv.org/abs/2508.04199)
Keywords: robust, interpretability, large language model
Abstract: Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.

Title: ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Authors: Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Xiaoyu You, Min Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04204
Pdf URL: https://arxiv.org/pdf/2508.04204
Copy Paste: [[2508.04204]] ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments(https://arxiv.org/abs/2508.04204)
Keywords: defense, attack
Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model's internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.

Title: DP-DocLDM: Differentially Private Document Image Generation using Latent Diffusion Models

Authors: Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.04208
Pdf URL: https://arxiv.org/pdf/2508.04208
Copy Paste: [[2508.04208]] DP-DocLDM: Differentially Private Document Image Generation using Latent Diffusion Models(https://arxiv.org/abs/2508.04208)
Keywords: privacy, extraction, diffusion
Abstract: As deep learning-based, data-driven information extraction systems become increasingly integrated into modern document processing workflows, one primary concern is the risk of malicious leakage of sensitive private data from these systems. While some recent works have explored Differential Privacy (DP) to mitigate these privacy risks, DP-based training is known to cause significant performance degradation and impose several limitations on standard training procedures, making its direct application to downstream tasks both difficult and costly. In this work, we aim to address the above challenges within the context of document image classification by substituting real private data with a synthetic counterpart. In particular, we propose to use conditional latent diffusion models (LDMs) in combination with differential privacy (DP) to generate class-specific synthetic document images under strict privacy constraints, which can then be utilized to train a downstream classifier following standard training procedures. We investigate our approach under various pretraining setups, including unconditional, class-conditional, and layout-conditional pretraining, in combination with multiple private training strategies such as class-conditional and per-label private fine-tuning with DPDM and DP-Promise algorithms. Additionally, we evaluate it on two well-known document benchmark datasets, RVL-CDIP and Tobacco3482, and show that it can generate useful and realistic document samples across various document types and privacy levels ($\varepsilon \in \{1, 5, 10\}$). Lastly, we show that our approach achieves substantial performance improvements in downstream evaluations on small-scale datasets, compared to the direct application of DP-Adam.

Title: What Holds Back Open-Vocabulary Segmentation?

Authors: Josip Šarić, Ivan Martinović, Matej Kristan, Siniša Šegvić
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04211
Pdf URL: https://arxiv.org/pdf/2508.04211
Copy Paste: [[2508.04211]] What Holds Back Open-Vocabulary Segmentation?(https://arxiv.org/abs/2508.04211)
Keywords: segmentation
Abstract: Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy. Open-vocabulary approaches promise to close this gap through language-image pretraining on billions of image-caption pairs. Unfortunately, we observe that the promise is not delivered due to several bottlenecks that have caused the performance to plateau for almost two years. This paper proposes novel oracle components that identify and decouple these bottlenecks by taking advantage of the groundtruth information. The presented validation experiments deliver important empirical findings that provide a deeper insight into the failures of open-vocabulary models and suggest prominent approaches to unlock the future research.

Title: Hierarchical Text Classification Using Black Box Large Language Models

Authors: Kosuke Yoshimura, Hisashi Kashima
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04219
Pdf URL: https://arxiv.org/pdf/2508.04219
Copy Paste: [[2508.04219]] Hierarchical Text Classification Using Black Box Large Language Models(https://arxiv.org/abs/2508.04219)
Keywords: large language model
Abstract: Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies -- Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) -- in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.

Title: SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition

Authors: Jiahui Li, Shengeng Tang, Jingxuan He, Gang Huang, Zhangye Wang, Yantao Pan, Lechao Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04224
Pdf URL: https://arxiv.org/pdf/2508.04224
Copy Paste: [[2508.04224]] SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition(https://arxiv.org/abs/2508.04224)
Keywords: interpretability
Abstract: Reconstructing dynamic 3D scenes from monocular video remains fundamentally challenging due to the need to jointly infer motion, structure, and appearance from limited observations. Existing dynamic scene reconstruction methods based on Gaussian Splatting often entangle static and dynamic elements in a shared representation, leading to motion leakage, geometric distortions, and temporal flickering. We identify that the root cause lies in the coupled modeling of geometry and appearance across time, which hampers both stability and interpretability. To address this, we propose \textbf{SplitGaussian}, a novel framework that explicitly decomposes scene representations into static and dynamic components. By decoupling motion modeling from background geometry and allowing only the dynamic branch to deform over time, our method prevents motion artifacts in static regions while supporting view- and time-dependent appearance refinement. This disentangled design not only enhances temporal consistency and reconstruction fidelity but also accelerates convergence. Extensive experiments demonstrate that SplitGaussian outperforms prior state-of-the-art methods in rendering quality, geometric stability, and motion separation.

Title: LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

Authors: Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2508.04228
Pdf URL: https://arxiv.org/pdf/2508.04228
Copy Paste: [[2508.04228]] LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation(https://arxiv.org/abs/2508.04228)
Keywords: generative
Abstract: Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at this https URL .

Title: Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction

Authors: Yu Liu, Zhijie Liu, Xiao Ren, You-Fu Li, He Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04229
Pdf URL: https://arxiv.org/pdf/2508.04229
Copy Paste: [[2508.04229]] Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction(https://arxiv.org/abs/2508.04229)
Keywords: interpretability, diffusion
Abstract: Predicting pedestrian motion trajectories is critical for path planning and motion control of autonomous vehicles. However, accurately forecasting crowd trajectories remains a challenging task due to the inherently multimodal and uncertain nature of human motion. Recent diffusion-based models have shown promising results in capturing the stochasticity of pedestrian behavior for trajectory prediction. However, few diffusion-based approaches explicitly incorporate the underlying motion intentions of pedestrians, which can limit the interpretability and precision of prediction models. In this work, we propose a diffusion-based multimodal trajectory prediction model that incorporates pedestrians' motion intentions into the prediction framework. The motion intentions are decomposed into lateral and longitudinal components, and a pedestrian intention recognition module is introduced to enable the model to effectively capture these intentions. Furthermore, we adopt an efficient guidance mechanism that facilitates the generation of interpretable trajectories. The proposed framework is evaluated on two widely used human trajectory prediction benchmarks, ETH and UCY, on which it is compared against state-of-the-art methods. The experimental results demonstrate that our method achieves competitive performance.

Title: Empowering Time Series Forecasting with LLM-Agents

Authors: Chin-Chia Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Junpeng Wang, Xin Dai, Yan Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04231
Pdf URL: https://arxiv.org/pdf/2508.04231
Copy Paste: [[2508.04231]] Empowering Time Series Forecasting with LLM-Agents(https://arxiv.org/abs/2508.04231)
Keywords: large language model
Abstract: Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting.

Title: DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification

Authors: Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04233
Pdf URL: https://arxiv.org/pdf/2508.04233
Copy Paste: [[2508.04233]] DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification(https://arxiv.org/abs/2508.04233)
Keywords: diffusion, generative
Abstract: As black-box AI-driven decision-making systems become increasingly widespread in modern document processing workflows, improving their transparency and reliability has become critical, especially in high-stakes applications where biases or spurious correlations in decision-making could lead to serious consequences. One vital component often found in such document processing workflows is document image classification, which, despite its widespread use, remains difficult to explain. While some recent works have attempted to explain the decisions of document image classification models through feature-importance maps, these maps are often difficult to interpret and fail to provide insights into the global features learned by the model. In this paper, we aim to bridge this research gap by introducing generative document counterfactuals that provide meaningful insights into the model's decision-making through actionable explanations. In particular, we propose DocVCE, a novel approach that leverages latent diffusion models in combination with classifier guidance to first generate plausible in-distribution visual counterfactual explanations, and then performs hierarchical patch-wise refinement to search for a refined counterfactual that is closest to the target factual image. We demonstrate the effectiveness of our approach through a rigorous qualitative and quantitative assessment on 3 different document classification datasets -- RVL-CDIP, Tobacco3482, and DocLayNet -- and 3 different models -- ResNet, ConvNeXt, and DiT -- using well-established evaluation criteria such as validity, closeness, and realism. To the best of the authors' knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.

Title: PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction

Authors: Muhua Zhu, Xinhao Jin, Chengbo Wang, Yongcong Zhang, Yifei Xue, Tie Ji, Yizhen Lao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04236
Pdf URL: https://arxiv.org/pdf/2508.04236
Copy Paste: [[2508.04236]] PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction(https://arxiv.org/abs/2508.04236)
Keywords: robust, diffusion, transformer
Abstract: Image stitching aim to align two images taken from different viewpoints into one seamless, wider image. However, when the 3D scene contains depth variations and the camera baseline is significant, noticeable parallax occurs-meaning the relative positions of scene elements differ substantially between views. Most existing stitching methods struggle to handle such images with large parallax effectively. To address this challenge, in this paper, we propose an image stitching solution called PIS3R that is robust to very large parallax based on the novel concept of deep 3D reconstruction. First, we apply visual geometry grounded transformer to two input images with very large parallax to obtain both intrinsic and extrinsic parameters, as well as the dense 3D scene reconstruction. Subsequently, we reproject reconstructed dense point cloud onto a designated reference view using the recovered camera parameters, achieving pixel-wise alignment and generating an initial stitched image. Finally, to further address potential artifacts such as holes or noise in the initial stitching, we propose a point-conditioned image diffusion module to obtain the refined this http URL with existing methods, our solution is very large parallax tolerant and also provides results that fully preserve the geometric integrity of all pixels in the 3D photogrammetric context, enabling direct applicability to downstream 3D vision tasks such as SfM. Experimental results demonstrate that the proposed algorithm provides accurate stitching results for images with very large parallax, and outperforms the existing methods qualitatively and quantitatively.

Title: DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting

Authors: Chanjuan Liu (1), Shengzhi Wang (2), Enqiang Zhu (2) ((1) School of Computer Science and Technology, Dalian University of Technology, Dalian, China,(2) Institute of Computing Technology, Guangzhou University, Guangzhou, China)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04239
Pdf URL: https://arxiv.org/pdf/2508.04239
Copy Paste: [[2508.04239]] DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting(https://arxiv.org/abs/2508.04239)
Keywords: large language model
Abstract: Time series forecasting is crucial in strategic planning and decision-making across various industries. Traditional forecasting models mainly concentrate on numerical time series data, often overlooking important textual information such as events and news, which can significantly affect forecasting accuracy. While large language models offer a promise for integrating multimodal data, existing single-prompt frameworks struggle to effectively capture the semantics of timestamped text, introducing redundant information that can hinder model performance. To address this limitation, we introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework that combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data. The tokenizer generates the explicit prompt while the embeddings from the textual prompt are refined through self-attention and feed-forward networks. Comprehensive experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting. This highlights the significance of incorporating textual context via a dual-prompt mechanism to achieve more accurate time series predictions.

Title: TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening

Authors: Xi Wang, Anxo Perez, Javier Parapar, Fabio Crestani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04248
Pdf URL: https://arxiv.org/pdf/2508.04248
Copy Paste: [[2508.04248]] TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening(https://arxiv.org/abs/2508.04248)
Keywords: robust
Abstract: The increasing demand for mental health services has outpaced the availability of real training data to develop clinical professionals, leading to limited support for the diagnosis of depression. This shortage has motivated the development of simulated or virtual patients to assist in training and evaluation, but existing approaches often fail to generate clinically valid, natural, and diverse symptom presentations. In this work, we embrace the recent advanced language models as the backbone and propose a novel clinician-in-the-loop patient simulation pipeline, TalkDep, with access to diversified patient profiles to develop simulated patients. By conditioning the model on psychiatric diagnostic criteria, symptom severity scales, and contextual factors, our goal is to create authentic patient responses that can better support diagnostic model training and evaluation. We verify the reliability of these simulated patients with thorough assessments conducted by clinical professionals. The availability of validated simulated patients offers a scalable and adaptable resource for improving the robustness and generalisability of automatic depression diagnosis systems.

Title: T3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion

Authors: Abdul Monaf Chowdhury, Rabeya Akter, Safaeid Hossain Arib
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04251
Pdf URL: https://arxiv.org/pdf/2508.04251
Copy Paste: [[2508.04251]] T3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion(https://arxiv.org/abs/2508.04251)
Keywords: transformer, large language model
Abstract: Multivariate time series forecasting (MTSF) seeks to model temporal dynamics among variables to predict future trends. Transformer-based models and large language models (LLMs) have shown promise due to their ability to capture long-range dependencies and patterns. However, current methods often rely on rigid inductive biases, ignore intervariable interactions, or apply static fusion strategies that limit adaptability across forecast horizons. These limitations create bottlenecks in capturing nuanced, horizon-specific relationships in time-series data. To solve this problem, we propose T3Time, a novel trimodal framework consisting of time, spectral, and prompt branches, where the dedicated frequency encoding branch captures the periodic structures along with a gating mechanism that learns prioritization between temporal and spectral features based on the prediction horizon. We also proposed a mechanism which adaptively aggregates multiple cross-modal alignment heads by dynamically weighting the importance of each head based on the features. Extensive experiments on benchmark datasets demonstrate that our model consistently outperforms state-of-the-art baselines, achieving an average reduction of 3.28% in MSE and 2.29% in MAE. Furthermore, it shows strong generalization in few-shot learning settings: with 5% training data, we see a reduction in MSE and MAE by 4.13% and 1.91%, respectively; and with 10% data, by 3.62% and 1.98% on average. Code - this https URL

Title: KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs

Authors: Zunhai Su, Kehong Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04257
Pdf URL: https://arxiv.org/pdf/2508.04257
Copy Paste: [[2508.04257]] KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs(https://arxiv.org/abs/2508.04257)
Keywords: protect, large language model
Abstract: Key-Value (KV) cache quantization has become a widely adopted optimization technique for efficient large language models (LLMs) inference by reducing KV cache memory usage and mitigating memory-bound constraints. Recent studies have emphasized the importance of preserving the original precision of KVs for the first few tokens to ensure the protection of attention sinks. While this approach has proven effective in mitigating performance degradation, its underlying principles remain insufficiently understood. Moreover, it fails to address the recent discovery that attention sinks can emerge beyond the initial token positions. In this work, we elucidate the underlying mechanisms of attention sinks during inference by examining their role in the cross-layer evolution of extreme activation outliers. Additionally, we provide a comprehensive analysis of the interplay between attention sinks and KV cache quantization. Based on our enhanced understanding, we introduce \textit{\textbf{KVSink}}, a plug-and-play method that effectively predicts sink tokens with negligible overhead, enabling more thorough preservation. Extensive experiments demonstrate that KVSink outperforms the existing Preserve-First-N (PFN) strategy, offering more effective preservation of attention sinks during KV cache quantization. Moreover, when applied to the well-established KVQuant method, KVSink further improves perplexity (PPL) and reduces reliance on 16-bit numerical outliers.

Title: Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark

Authors: Xiao Wang, Ziwen Wang, Wentao Wu, Anjie Wang, Jiashu Wu, Yantao Pan, Chenglong Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04260
Pdf URL: https://arxiv.org/pdf/2508.04260
Copy Paste: [[2508.04260]] Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark(https://arxiv.org/abs/2508.04260)
Keywords: segmentation
Abstract: With the rapid advancement of autonomous driving, vehicle perception, particularly detection and segmentation, has placed increasingly higher demands on algorithmic performance. Pre-trained large segmentation models, especially Segment Anything Model (SAM), have sparked significant interest and inspired new research directions in artificial intelligence. However, SAM cannot be directly applied to the fine-grained task of vehicle part segmentation, as its text-prompted segmentation functionality is not publicly accessible, and the mask regions generated by its default mode lack semantic labels, limiting its utility in structured, category-specific segmentation tasks. To address these limitations, we propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module. The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge. Meanwhile, the context retrieval module enhances segmentation by identifying and leveraging visually similar vehicle instances from training data, providing rich contextual priors for improved generalization. Furthermore, we introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations across diverse scenes and viewpoints. We conduct comprehensive experiments on this dataset and two other datasets, benchmarking multiple representative baselines to establish a solid foundation for future research and comparison. % Both the dataset and source code of this paper will be released upon acceptance. Both the dataset and source code of this paper will be released on this https URL

Title: Revisiting Continual Semantic Segmentation with Pre-trained Vision Models

Authors: Duzhen Zhang, Yong Ren, Wei Cong, Junhao Zheng, Qiaoyi Su, Shuncheng Jia, Zhong-Zhi Li, Xuanle Zhao, Ye Bai, Feilong Chen, Qi Tian, Tielin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04267
Pdf URL: https://arxiv.org/pdf/2508.04267
Copy Paste: [[2508.04267]] Revisiting Continual Semantic Segmentation with Pre-trained Vision Models(https://arxiv.org/abs/2508.04267)
Keywords: segmentation
Abstract: Continual Semantic Segmentation (CSS) seeks to incrementally learn to segment novel classes while preserving knowledge of previously encountered ones. Recent advancements in CSS have been largely driven by the adoption of Pre-trained Vision Models (PVMs) as backbones. Among existing strategies, Direct Fine-Tuning (DFT), which sequentially fine-tunes the model across classes, remains the most straightforward approach. Prior work often regards DFT as a performance lower bound due to its presumed vulnerability to severe catastrophic forgetting, leading to the development of numerous complex mitigation techniques. However, we contend that this prevailing assumption is flawed. In this paper, we systematically revisit forgetting in DFT across two standard benchmarks, Pascal VOC 2012 and ADE20K, under eight CSS settings using two representative PVM backbones: ResNet101 and Swin-B. Through a detailed probing analysis, our findings reveal that existing methods significantly underestimate the inherent anti-forgetting capabilities of PVMs. Even under DFT, PVMs retain previously learned knowledge with minimal forgetting. Further investigation of the feature space indicates that the observed forgetting primarily arises from the classifier's drift away from the PVM, rather than from degradation of the backbone representations. Based on this insight, we propose DFT*, a simple yet effective enhancement to DFT that incorporates strategies such as freezing the PVM backbone and previously learned classifiers, as well as pre-allocating future classifiers. Extensive experiments show that DFT* consistently achieves competitive or superior performance compared to sixteen state-of-the-art CSS methods, while requiring substantially fewer trainable parameters and less training time.

Title: A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models

Authors: Jiayi Wen, Tianxin Chen, Zhirun Zheng, Cheng Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04276
Pdf URL: https://arxiv.org/pdf/2508.04276
Copy Paste: [[2508.04276]] A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models(https://arxiv.org/abs/2508.04276)
Keywords: defense, attack, explainability, large language model
Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1\%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05\% of full text modified, the QA accuracy collapses from 95\% to 50\%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored.

Title: Mockingbird: How does LLM perform in general machine learning tasks?

Authors: Haoyu Jia, Yoshiki Obinata, Kento Kawaharazuka, Kei Okada
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04279
Pdf URL: https://arxiv.org/pdf/2508.04279
Copy Paste: [[2508.04279]] Mockingbird: How does LLM perform in general machine learning tasks?(https://arxiv.org/abs/2508.04279)
Keywords: large language model
Abstract: Large language models (LLMs) are now being used with increasing frequency as chat bots, tasked with the summarizing information or generating text and code in accordance with user instructions. The rapid increase in reasoning capabilities and inference speed of LLMs has revealed their remarkable potential for applications extending beyond the domain of chat bots to general machine learning tasks. This work is conducted out of the curiosity about such potential. In this work, we propose a framework Mockingbird to adapt LLMs to general machine learning tasks and evaluate its performance and scalability on several general machine learning tasks. The core concept of this framework is instructing LLMs to role-play functions and reflect on its mistakes to improve itself. Our evaluation and analysis result shows that LLM-driven machine learning methods, such as Mockingbird, can achieve acceptable results on common machine learning tasks; however, solely reflecting on its own currently cannot outperform the effect of domain-specific documents and feedback from human experts.

Title: Per-element Secure Aggregation against Data Reconstruction Attacks in Federated Learning

Authors: Takumi Suimon, Yuki Koizumi, Junji Takemasa, Toru Hasegawa
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.04285
Pdf URL: https://arxiv.org/pdf/2508.04285
Copy Paste: [[2508.04285]] Per-element Secure Aggregation against Data Reconstruction Attacks in Federated Learning(https://arxiv.org/abs/2508.04285)
Keywords: secure, defense, attack, robust, federate
Abstract: Federated learning (FL) enables collaborative model training without sharing raw data, but individual model updates may still leak sensitive information. Secure aggregation (SecAgg) mitigates this risk by allowing the server to access only the sum of client updates, thereby concealing individual contributions. However, a significant vulnerability has recently attracted increasing attention: when model updates are sparse vectors, a non-zero value contributed by a single client at a given index can be directly revealed in the aggregate, enabling precise data reconstruction attacks. In this paper, we propose a novel enhancement to SecAgg that reveals aggregated values only at indices with at least $t$ non-zero contributions. Our mechanism introduces a per-element masking strategy to prevent the exposure of under-contributed elements, while maintaining modularity and compatibility with many existing SecAgg implementations by relying solely on cryptographic primitives already employed in a typical setup. We integrate this mechanism into Flamingo, a low-round SecAgg protocol, to provide a robust defense against such attacks. Our analysis and experimental results indicate that the additional computational and communication overhead introduced by our mechanism remains within an acceptable range, supporting the practicality of our approach.

Title: PKSS-Align: Robust Point Cloud Registration on Pre-Kendall Shape Space

Authors: Chenlei Lv, Hui Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04286
Pdf URL: https://arxiv.org/pdf/2508.04286
Copy Paste: [[2508.04286]] PKSS-Align: Robust Point Cloud Registration on Pre-Kendall Shape Space(https://arxiv.org/abs/2508.04286)
Keywords: robust
Abstract: Point cloud registration is a classical topic in the field of 3D Vision and Computer Graphics. Generally, the implementation of registration is typically sensitive to similarity transformations (translation, scaling, and rotation), noisy points, and incomplete geometric structures. Especially, the non-uniform scales and defective parts of point clouds increase probability of struck local optima in registration task. In this paper, we propose a robust point cloud registration PKSS-Align that can handle various influences, including similarity transformations, non-uniform densities, random noisy points, and defective parts. The proposed method measures shape feature-based similarity between point clouds on the Pre-Kendall shape space (PKSS), \textcolor{black}{which is a shape measurement-based scheme and doesn't require point-to-point or point-to-plane metric.} The employed measurement can be regarded as the manifold metric that is robust to various representations in the Euclidean coordinate system. Benefited from the measurement, the transformation matrix can be directly generated for point clouds with mentioned influences at the same time. The proposed method does not require data training and complex feature encoding. Based on a simple parallel acceleration, it can achieve significant improvement for efficiency and feasibility in practice. Experiments demonstrate that our method outperforms the relevant state-of-the-art methods.

Title: Length Matters: Length-Aware Transformer for Temporal Sentence Grounding

Authors: Yifan Wang, Ziyi Liu, Xiaolong Sun, Jiawei Wang, Hongmin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04299
Pdf URL: https://arxiv.org/pdf/2508.04299
Copy Paste: [[2508.04299]] Length Matters: Length-Aware Transformer for Temporal Sentence Grounding(https://arxiv.org/abs/2508.04299)
Keywords: transformer
Abstract: Temporal sentence grounding (TSG) is a highly challenging task aiming to localize the temporal segment within an untrimmed video corresponding to a given natural language description. Benefiting from the design of learnable queries, the DETR-based models have achieved substantial advancements in the TSG task. However, the absence of explicit supervision often causes the learned queries to overlap in roles, leading to redundant predictions. Therefore, we propose to improve TSG by making each query fulfill its designated role, leveraging the length priors of the video-description pairs. In this paper, we introduce the Length-Aware Transformer (LATR) for TSG, which assigns different queries to handle predictions based on varying temporal lengths. Specifically, we divide all queries into three groups, responsible for segments with short, middle, and long temporal durations, respectively. During training, an additional length classification task is introduced. Predictions from queries with mismatched lengths are suppressed, guiding each query to specialize in its designated function. Extensive experiments demonstrate the effectiveness of our LATR, achieving state-of-the-art performance on three public benchmarks. Furthermore, the ablation studies validate the contribution of each component of our method and the critical role of incorporating length priors into the TSG task.

Title: A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks

Authors: Kun Gui, Hongliang Ren, Shang Shi, Jin Lu, Changqiu Yu, Quanjun Cao, Guomin Gu, Qi Xuan
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2508.04316
Pdf URL: https://arxiv.org/pdf/2508.04316
Copy Paste: [[2508.04316]] A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks(https://arxiv.org/abs/2508.04316)
Keywords: security, robust, transformer
Abstract: Distributed Acoustic Sensing (DAS) technology finds growing applications across various domains. However, data distribution disparities due to heterogeneous sensing environments pose challenges for data-driven artificial intelligence (AI) models, limiting cross-domain generalization and facing a shortage of labeled training data. To address these issues, this study proposes a foundational model for DAS signal recognition based on a Masked Autoencoder, named MAEPD. The MAEPD model is pretrained on a dataset of 635,860 samples, encompassing DAS gait spatiotemporal signals, 2D GASF images for perimeter security, 2D time-frequency images for pipeline leakage, and open-dataset signals including whale vocalizations and seismic activities, using a self-supervised mask reconstruction task to capture deep semantic features of DAS signals. Visual Prompt Tuning (VPT) is employed for downstream recognition tasks. This method freezes the pretrained backbone parameters and fine-tunes only a small set of learnable visual prompt vectors inserted into the Transformer encoder layers. Experiments on the NVIDIA GeForce RTX 4080 Super platform validate MAEPD using indoor gait recognition as a downstream task. The VPT-Deep approach achieves a classification accuracy of 96.94% with just 0.322% of parameters fine-tuned, surpassing the traditional Full Fine Tuning (FFT) method by 0.61% and reducing training time by 45%. The model also exhibits robust performance in pipeline leakage detection, confirming the generality, efficiency, and scalability of MAEPD as a foundational model. This approach offers a novel paradigm for addressing the limited generalization of signal recognition models in the DAS domain.

Title: TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

Authors: Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, Bo Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04324
Pdf URL: https://arxiv.org/pdf/2508.04324
Copy Paste: [[2508.04324]] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models(https://arxiv.org/abs/2508.04324)
Keywords: generative
Abstract: Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces two key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; and (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and standard text-to-image benchmarks.

Title: Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

Authors: Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, Linlin Shen
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2508.04325
Pdf URL: https://arxiv.org/pdf/2508.04325
Copy Paste: [[2508.04325]] Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models(https://arxiv.org/abs/2508.04325)
Keywords: robust, large language model
Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.

Title: Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning

Authors: Ali Taheri Ghahrizjani, Alireza Taban, Qizhou Wang, Shanshan Ye, Abdolreza Mirzaei, Tongliang Liu, Bo Han
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04329
Pdf URL: https://arxiv.org/pdf/2508.04329
Copy Paste: [[2508.04329]] Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning(https://arxiv.org/abs/2508.04329)
Keywords: large language model
Abstract: Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs), notably enhancing their capacity to acquire domain-specific knowledge while preserving or potentially augmenting their general-purpose capabilities. However, the efficacy of SFT hinges on data quality as well as data volume, otherwise it may result in limited performance gains or even degradation relative to the associated baselines. To mitigate such reliance, we suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance. Positive tokens can be trained in common ways, whereas negative tokens, which may lack essential semantics or be misleading, should be explicitly forgotten. Overall, the token categorization facilitate the model to learn less informative message, and the forgetting process shapes a knowledge boundary to guide the model on what information to learn more precisely. We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.

Title: Modelling and Classifying the Components of a Literature Review

Authors: Francisco Bolaños, Angelo Salatino, Francesco Osborne, Enrico Motta
Subjects: cs.CL, cs.AI, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2508.04337
Pdf URL: https://arxiv.org/pdf/2508.04337
Copy Paste: [[2508.04337]] Modelling and Classifying the Components of a Literature Review(https://arxiv.org/abs/2508.04337)
Keywords: robust, large language model
Abstract: Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96\% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models.

Title: From Split to Share: Private Inference with Distributed Feature Sharing

Authors: Zihan Liu, Jiayi Wen, Shouhong Tan, Zhirun Zheng, Cheng Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04346
Pdf URL: https://arxiv.org/pdf/2508.04346
Copy Paste: [[2508.04346]] From Split to Share: Private Inference with Distributed Feature Sharing(https://arxiv.org/abs/2508.04346)
Keywords: secure, privacy, protect, attack, robust, diffusion
Abstract: Cloud-based Machine Learning as a Service (MLaaS) raises serious privacy concerns when handling sensitive client data. Existing Private Inference (PI) methods face a fundamental trade-off between privacy and efficiency: cryptographic approaches offer strong protection but incur high computational overhead, while efficient alternatives such as split inference expose intermediate features to inversion attacks. We propose PrivDFS, a new paradigm for private inference that replaces a single exposed representation with distributed feature sharing. PrivDFS partitions input features on the client into multiple balanced shares, which are distributed to non-colluding, non-communicating servers for independent partial inference. The client securely aggregates the servers' outputs to reconstruct the final prediction, ensuring that no single server observes sufficient information to compromise input privacy. To further strengthen privacy, we propose two key extensions: PrivDFS-AT, which uses adversarial training with a diffusion-based proxy attacker to enforce inversion-resistant feature partitioning, and PrivDFS-KD, which leverages user-specific keys to diversify partitioning policies and prevent query-based inversion generalization. Experiments on CIFAR-10 and CelebA demonstrate that PrivDFS achieves privacy comparable to deep split inference while cutting client computation by up to 100 times with no accuracy loss, and that the extensions remain robust against both diffusion-based in-distribution and adaptive attacks.

Title: GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Authors: Hongze Tan, Jianfei Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04349
Pdf URL: https://arxiv.org/pdf/2508.04349
Copy Paste: [[2508.04349]] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy(https://arxiv.org/abs/2508.04349)
Keywords: large language model
Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.

Title: Chain of Questions: Guiding Multimodal Curiosity in Language Models

Authors: Nima Iji, Kia Dashtipour
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2508.04350
Pdf URL: https://arxiv.org/pdf/2508.04350
Copy Paste: [[2508.04350]] Chain of Questions: Guiding Multimodal Curiosity in Language Models(https://arxiv.org/abs/2508.04350)
Keywords: interpretability, large language model
Abstract: Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model's ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks.

Title: Multi-Marginal Stochastic Flow Matching for High-Dimensional Snapshot Data at Irregular Time Points

Authors: Justin Lee, Behnaz Moradijamei, Heman Shakeri
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2508.04351
Pdf URL: https://arxiv.org/pdf/2508.04351
Copy Paste: [[2508.04351]] Multi-Marginal Stochastic Flow Matching for High-Dimensional Snapshot Data at Irregular Time Points(https://arxiv.org/abs/2508.04351)
Keywords: robust
Abstract: Modeling the evolution of high-dimensional systems from limited snapshot observations at irregular time points poses a significant challenge in quantitative biology and related fields. Traditional approaches often rely on dimensionality reduction techniques, which can oversimplify the dynamics and fail to capture critical transient behaviors in non-equilibrium systems. We present Multi-Marginal Stochastic Flow Matching (MMSFM), a novel extension of simulation-free score and flow matching methods to the multi-marginal setting, enabling the alignment of high-dimensional data measured at non-equidistant time points without reducing dimensionality. The use of measure-valued splines enhances robustness to irregular snapshot timing, and score matching prevents overfitting in high-dimensional spaces. We validate our framework on several synthetic and benchmark datasets, including gene expression data collected at uneven time points and an image progression task, demonstrating the method's versatility.

Title: TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Authors: Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Jinglin Xu, Hao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04369
Pdf URL: https://arxiv.org/pdf/2508.04369
Copy Paste: [[2508.04369]] TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding(https://arxiv.org/abs/2508.04369)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. Existing video MLLMs adopt training-free uniform sampling or keyframe search, which may miss critical events or be constrained by the pre-trained models' event understanding capabilities. Meanwhile, building a training-based method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization with efficient rule-based rewards. Furthermore, for the TSPO's training, we propose a long video training data construction pipeline with comprehensive temporal data and video Needle-in-a-Haystack data. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.

Title: ProtoN: Prototype Node Graph Neural Network for Unconstrained Multi-Impression Ear Recognition

Authors: Santhoshkumar Peddi, Sadhvik Bathini, Arun Balasubramanian, Monalisa Sarma, Debasis Samanta
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04381
Pdf URL: https://arxiv.org/pdf/2508.04381
Copy Paste: [[2508.04381]] ProtoN: Prototype Node Graph Neural Network for Unconstrained Multi-Impression Ear Recognition(https://arxiv.org/abs/2508.04381)
Keywords: biometric
Abstract: Ear biometrics offer a stable and contactless modality for identity recognition, yet their effectiveness remains limited by the scarcity of annotated data and significant intra-class variability. Existing methods typically extract identity features from individual impressions in isolation, restricting their ability to capture consistent and discriminative representations. To overcome these limitations, a few-shot learning framework, ProtoN, is proposed to jointly process multiple impressions of an identity using a graph-based approach. Each impression is represented as a node in a class-specific graph, alongside a learnable prototype node that encodes identity-level information. This graph is processed by a Prototype Graph Neural Network (PGNN) layer, specifically designed to refine both impression and prototype representations through a dual-path message-passing mechanism. To further enhance discriminative power, the PGNN incorporates a cross-graph prototype alignment strategy that improves class separability by enforcing intra-class compactness while maintaining inter-class distinction. Additionally, a hybrid loss function is employed to balance episodic and global classification objectives, thereby improving the overall structure of the embedding space. Extensive experiments on five benchmark ear datasets demonstrate that ProtoN achieves state-of-the-art performance, with Rank-1 identification accuracy of up to 99.60% and an Equal Error Rate (EER) as low as 0.025, showing the effectiveness for few-shot ear recognition under limited data conditions.

Title: Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky

Authors: Xu Zhang, Mei Chen
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04399
Pdf URL: https://arxiv.org/pdf/2508.04399
Copy Paste: [[2508.04399]] Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky(https://arxiv.org/abs/2508.04399)
Keywords: privacy, transformer, large language model
Abstract: This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred high computational costs (up to 723 minutes for DeepSeek-R1:70B), while fine-tuned models processed the test set in seconds after brief training. Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can rival larger counterparts in performance while reducing runtime, suggesting opportunities for optimized deployments. Results highlight trade-offs between accuracy, efficiency, and data requirements, with fine-tuned transformer models balancing precision and recall effectively on Kentucky data. Practical deployment considerations emphasize privacy-preserving local deployment, ensemble approaches for improved accuracy, and incremental processing for scalability, providing a replicable scheme for enhancing crash-data quality with advanced NLP.

Title: Why are LLMs' abilities emergent?

Authors: Vladimír Havlík
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04401
Pdf URL: https://arxiv.org/pdf/2508.04401
Copy Paste: [[2508.04401]] Why are LLMs' abilities emergent?(https://arxiv.org/abs/2508.04401)
Keywords: generative, large language model
Abstract: The remarkable success of Large Language Models (LLMs) in generative tasks has raised fundamental questions about the nature of their acquired capabilities, which often appear to emerge unexpectedly without explicit training. This paper examines the emergent properties of Deep Neural Networks (DNNs) through both theoretical analysis and empirical observation, addressing the epistemological challenge of "creation without understanding" that characterises contemporary AI development. We explore how the neural approach's reliance on nonlinear, stochastic processes fundamentally differs from symbolic computational paradigms, creating systems whose macro-level behaviours cannot be analytically derived from micro-level neuron activities. Through analysis of scaling laws, grokking phenomena, and phase transitions in model capabilities, I demonstrate that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. My investigation reveals that current debates over metrics, pre-training loss thresholds, and in-context learning miss the fundamental ontological nature of emergence in DNNs. I argue that these systems exhibit genuine emergent properties analogous to those found in other complex natural phenomena, where systemic capabilities emerge from cooperative interactions among simple components without being reducible to their individual behaviours. The paper concludes that understanding LLM capabilities requires recognising DNNs as a new domain of complex dynamical systems governed by universal principles of emergence, similar to those operating in physics, chemistry, and biology. This perspective shifts the focus from purely phenomenological definitions of emergence to understanding the internal dynamic transformations that enable these systems to acquire capabilities that transcend their individual components.

Title: FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Authors: Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04405
Pdf URL: https://arxiv.org/pdf/2508.04405
Copy Paste: [[2508.04405]] FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design(https://arxiv.org/abs/2508.04405)
Keywords: large language model
Abstract: Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.05. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at this https URL.

Title: Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models

Authors: Yinan Yu, Alex Gonzalez-Caceres, Samuel Scheidegger, Sanjay Somanath, Alexander Hollberg
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04406
Pdf URL: https://arxiv.org/pdf/2508.04406
Copy Paste: [[2508.04406]] Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models(https://arxiv.org/abs/2508.04406)
Keywords: segmentation
Abstract: Renovating existing buildings is essential for climate impact. Early-phase renovation planning requires simulations based on thermal 3D models at Level of Detail (LoD) 3, which include features like windows. However, scalable and accurate identification of such features remains a challenge. This paper presents the Scalable Image-to-3D Facade Parser (SI3FP), a pipeline that generates LoD3 thermal models by extracting geometries from images using both computer vision and deep learning. Unlike existing methods relying on segmentation and projection, SI3FP directly models geometric primitives in the orthographic image plane, providing a unified interface while reducing perspective distortions. SI3FP supports both sparse (e.g., Google Street View) and dense (e.g., hand-held camera) data sources. Tested on typical Swedish residential buildings, SI3FP achieved approximately 5% error in window-to-wall ratio estimates, demonstrating sufficient accuracy for early-stage renovation analysis. The pipeline facilitates large-scale energy renovation planning and has broader applications in urban development and planning.

Title: Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Authors: Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04416
Pdf URL: https://arxiv.org/pdf/2508.04416
Copy Paste: [[2508.04416]] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning(https://arxiv.org/abs/2508.04416)
Keywords: large language model
Abstract: The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. All code, data and model weight will be made publicly available.

Title: Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Authors: Jinxing Zhou, Yanghao Zhou, Mingfei Han, Tong Wang, Xiaojun Chang, Hisham Cholakkal, Rao Muhammad Anwer
Subjects: cs.CV, cs.AI, cs.MA, cs.MM
Abstract URL: https://arxiv.org/abs/2508.04418
Pdf URL: https://arxiv.org/pdf/2508.04418
Copy Paste: [[2508.04418]] Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation(https://arxiv.org/abs/2508.04418)
Keywords: interpretability, segmentation
Abstract: Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at this https URL.

Title: Efficient Inter-Task Attention for Multitask Transformer Models

Authors: Christian Bohn, Thomas Kurbiel, Klaus Friedrichs, Hasan Tercan, Tobias Meisen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04422
Pdf URL: https://arxiv.org/pdf/2508.04422
Copy Paste: [[2508.04422]] Efficient Inter-Task Attention for Multitask Transformer Models(https://arxiv.org/abs/2508.04422)
Keywords: transformer
Abstract: In both Computer Vision and the wider Deep Learning field, the Transformer architecture is well-established as state-of-the-art for many applications. For Multitask Learning, however, where there may be many more queries necessary compared to single-task models, its Multi-Head-Attention often approaches the limits of what is computationally feasible considering practical hardware limitations. This is due to the fact that the size of the attention matrix scales quadratically with the number of tasks (assuming roughly equal numbers of queries for all tasks). As a solution, we propose our novel Deformable Inter-Task Self-Attention for Multitask models that enables the much more efficient aggregation of information across the feature maps from different tasks. In our experiments on the NYUD-v2 and PASCAL-Context datasets, we demonstrate an order-of-magnitude reduction in both FLOPs count and inference latency. At the same time, we also achieve substantial improvements by up to 7.4% in the individual tasks' prediction quality metrics.

Title: Composed Object Retrieval: Object-level Retrieval via Composed Expressions

Authors: Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04424
Pdf URL: https://arxiv.org/pdf/2508.04424
Copy Paste: [[2508.04424]] Composed Object Retrieval: Object-level Retrieval via Composed Expressions(https://arxiv.org/abs/2508.04424)
Keywords: segmentation
Abstract: Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research.

Title: Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Authors: Md Raisul Kibria, Sébastien Lafond, Janan Arslan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04427
Pdf URL: https://arxiv.org/pdf/2508.04427
Copy Paste: [[2508.04427]] Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models(https://arxiv.org/abs/2508.04427)
Keywords: robust, explainability
Abstract: Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core.

Title: Benchmarking Foundation Models for Mitotic Figure Classification

Authors: Jonas Ammeling, Jonathan Ganz, Emely Rosbach, Ludwig Lausser, Christof A. Bertram, Katharina Breininger, Marc Aubreville
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04441
Pdf URL: https://arxiv.org/pdf/2508.04441
Copy Paste: [[2508.04441]] Benchmarking Foundation Models for Mitotic Figure Classification(https://arxiv.org/abs/2508.04441)
Keywords: robust, transformer
Abstract: The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

Title: Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI

Authors: Rohaizah Abdul Wahid, Muhamad Said Nizamuddin Nadim, Suliana Sulaiman, Syahmi Akmal Shaharudin, Muhammad Danial Jupikil, Iqqwan Jasman Su Azlan Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04442
Pdf URL: https://arxiv.org/pdf/2508.04442
Copy Paste: [[2508.04442]] Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI(https://arxiv.org/abs/2508.04442)
Keywords: generative
Abstract: This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI's GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions.

Title: Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation

Authors: Askar Tsyganov, Evgeny Frolov, Sergey Samsonov, Maxim Rakhuba
Subjects: cs.LG, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2508.04444
Pdf URL: https://arxiv.org/pdf/2508.04444
Copy Paste: [[2508.04444]] Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation(https://arxiv.org/abs/2508.04444)
Keywords: attack
Abstract: In this paper, we propose new randomized algorithms for estimating the two-to-infinity and one-to-two norms in a matrix-free setting, using only matrix-vector multiplications. Our methods are based on appropriate modifications of Hutchinson's diagonal estimator and its Hutch++ version. We provide oracle complexity bounds for both modifications. We further illustrate the practical utility of our algorithms for Jacobian-based regularization in deep neural network training on image classification tasks. We also demonstrate that our methodology can be applied to mitigate the effect of adversarial attacks in the domain of recommender systems.

Title: Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling

Authors: Biao Hu, Guoyin Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04447
Pdf URL: https://arxiv.org/pdf/2508.04447
Copy Paste: [[2508.04447]] Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling(https://arxiv.org/abs/2508.04447)
Keywords: generative
Abstract: We introduce Cloud Model Characteristic Function Auto-Encoder (CMCFAE), a novel generative model that integrates the cloud model into the Wasserstein Auto-Encoder (WAE) framework. By leveraging the characteristic functions of the cloud model to regularize the latent space, our approach enables more accurate modeling of complex data distributions. Unlike conventional methods that rely on a standard Gaussian prior and traditional divergence measures, our method employs a cloud model prior, providing a more flexible and realistic representation of the latent space, thus mitigating the homogenization observed in reconstructed samples. We derive the characteristic function of the cloud model and propose a corresponding regularizer within the WAE framework. Extensive quantitative and qualitative evaluations on MNIST, FashionMNIST, CIFAR-10, and CelebA demonstrate that CMCFAE outperforms existing models in terms of reconstruction quality, latent space structuring, and sample diversity. This work not only establishes a novel integration of cloud model theory with MMD-based regularization but also offers a promising new perspective for enhancing autoencoder-based generative models.

Title: Automatic LLM Red Teaming

Authors: Roman Belaire, Arunesh Sinha, Pradeep Varakantham
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04451
Pdf URL: https://arxiv.org/pdf/2508.04451
Copy Paste: [[2508.04451]] Automatic LLM Red Teaming(https://arxiv.org/abs/2508.04451)
Keywords: attack, robust, generative, large language model
Abstract: Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break' another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.

Title: Small transformer architectures for task switching

Authors: Claudius Gros
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04461
Pdf URL: https://arxiv.org/pdf/2508.04461
Copy Paste: [[2508.04461]] Small transformer architectures for task switching(https://arxiv.org/abs/2508.04461)
Keywords: transformer, generative
Abstract: The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of 'task switching'. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We enlarge our comparative study by including an extension of the standard transformer architecture to its non-translational invariant counterpart, the cisformer, and an alternative attention mechanism, extensive attention. A combination of the latter is found to be the only model able to achieve considerable performance levels, of around 95%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations in task-switching settings.

Title: CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference

Authors: Enyu Zhou, Kai Sheng, Hao Chen, Xin He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04462
Pdf URL: https://arxiv.org/pdf/2508.04462
Copy Paste: [[2508.04462]] CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference(https://arxiv.org/abs/2508.04462)
Keywords: large language model
Abstract: Speculative decoding (SD), where an extra draft model first provides multiple draft tokens and the original target model then verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods must adhere to the 'draft-then-verify' paradigm, which forces drafting and verification processes to execute sequentially during SD, resulting in inefficient inference performance and limiting the size of the draft model. Furthermore, once a single token in the candidate sequence is rejected during the drafting process, all subsequent candidate tokens must be discarded, leading to inefficient drafting. To address these challenges, we propose a cache-based parallel speculative decoding framework employing a 'query-and-correct' paradigm. Specifically, CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model's generation direction. This effectively enables the target model to perform inference at speed approaching that of the draft model. Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models. Our code is available at this https URL.

Title: GFocal: A Global-Focal Neural Operator for Solving PDEs on Arbitrary Geometries

Authors: Fangzhi Fei, Jiaxin Hu, Qiaofeng Li, Zhenyu Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04463
Pdf URL: https://arxiv.org/pdf/2508.04463
Copy Paste: [[2508.04463]] GFocal: A Global-Focal Neural Operator for Solving PDEs on Arbitrary Geometries(https://arxiv.org/abs/2508.04463)
Keywords: transformer
Abstract: Transformer-based neural operators have emerged as promising surrogate solvers for partial differential equations, by leveraging the effectiveness of Transformers for capturing long-range dependencies and global correlations, profoundly proven in language modeling. However, existing methodologies overlook the coordinated learning of interdependencies between local physical details and global features, which are essential for tackling multiscale problems, preserving physical consistency and numerical stability in long-term rollouts, and accurately capturing transitional dynamics. In this work, we propose GFocal, a Transformer-based neural operator method that enforces simultaneous global and local feature learning and fusion. Global correlations and local features are harnessed through Nyström attention-based \textbf{g}lobal blocks and slices-based \textbf{focal} blocks to generate physics-aware tokens, subsequently modulated and integrated via convolution-based gating blocks, enabling dynamic fusion of multiscale information. GFocal achieves accurate modeling and prediction of physical features given arbitrary geometries and initial conditions. Experiments show that GFocal achieves state-of-the-art performance with an average 15.2\% relative gain in five out of six benchmarks and also excels in industry-scale simulations such as aerodynamics simulation of automotives and airfoils.

Title: 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation

Authors: Shuzhou Yang, Xiaodong Cun, Xiaoyu Li, Yaowei Li, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04467
Pdf URL: https://arxiv.org/pdf/2508.04467
Copy Paste: [[2508.04467]] 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation(https://arxiv.org/abs/2508.04467)
Keywords: diffusion
Abstract: Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos. Benefit from this, explicit 4D representation~(such as 4D Gaussian) can be optimized accurately, enabling wider practical application. To train 4DVD, we collect a dynamic 3D object dataset, called D-Objaverse, from the Objaverse benchmark and render 16 videos with 21 frames for each object. Extensive experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation. Our project page is this https URL

Title: FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding

Authors: Emmanuelle Bourigault, Pauline Bourigault
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.04469
Pdf URL: https://arxiv.org/pdf/2508.04469
Copy Paste: [[2508.04469]] FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding(https://arxiv.org/abs/2508.04469)
Keywords: extraction
Abstract: The deployment of vision-language models remains constrained by substantial computational requirements. We present \textbf{FrEVL}, a framework exploring whether frozen pretrained embeddings can support effective vision-language understanding. Our analysis reveals that frozen embeddings contain rich information for discriminative tasks, achieving 85\% to 95\% of state-of-the-art performance on standard benchmarks with only 68.4M trainable parameters. This performance dichotomy reveals a critical insight: frozen embedding effectiveness depends on alignment between pretraining objectives and downstream task requirements. When accounting for end-to-end computation including embedding extraction, FrEVL provides $2.3\times$ speedup with 52\% lower energy consumption, making it suitable for scenarios with pre-computable inputs or when deployment constraints outweigh marginal performance gains. Our evaluation provides practitioners with guidance on when frozen embedding approaches represent viable alternatives to full model deployment. We will release our complete implementation and evaluation framework to facilitate further research into efficient multi-modal understanding.

Title: FedHiP: Heterogeneity-Invariant Personalized Federated Learning Through Closed-Form Solutions

Authors: Jianheng Tang, Zhirui Yang, Jingchao Wang, Kejia Fan, Jinfeng Xu, Huiping Zhuang, Anfeng Liu, Houbing Herbert Song, Leye Wang, Yunhuai Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04470
Pdf URL: https://arxiv.org/pdf/2508.04470
Copy Paste: [[2508.04470]] FedHiP: Heterogeneity-Invariant Personalized Federated Learning Through Closed-Form Solutions(https://arxiv.org/abs/2508.04470)
Keywords: extraction, federate
Abstract: Lately, Personalized Federated Learning (PFL) has emerged as a prevalent paradigm to deliver personalized models by collaboratively training while simultaneously adapting to each client's local applications. Existing PFL methods typically face a significant challenge due to the ubiquitous data heterogeneity (i.e., non-IID data) across clients, which severely hinders convergence and degrades performance. We identify that the root issue lies in the long-standing reliance on gradient-based updates, which are inherently sensitive to non-IID data. To fundamentally address this issue and bridge the research gap, in this paper, we propose a Heterogeneity-invariant Personalized Federated learning scheme, named FedHiP, through analytical (i.e., closed-form) solutions to avoid gradient-based updates. Specifically, we exploit the trend of self-supervised pre-training, leveraging a foundation model as a frozen backbone for gradient-free feature extraction. Following the feature extractor, we further develop an analytic classifier for gradient-free training. To support both collective generalization and individual personalization, our FedHiP scheme incorporates three phases: analytic local training, analytic global aggregation, and analytic local personalization. The closed-form solutions of our FedHiP scheme enable its ideal property of heterogeneity invariance, meaning that each personalized model remains identical regardless of how non-IID the data are distributed across all other clients. Extensive experiments on benchmark datasets validate the superiority of our FedHiP scheme, outperforming the state-of-the-art baselines by at least 5.79%-20.97% in accuracy.

Title: Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model

Authors: Hongxu Chen, Zhen Wang, Taoran Mei, Lin Li, Bowei Zhu, Runshi Li, Long Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04472
Pdf URL: https://arxiv.org/pdf/2508.04472
Copy Paste: [[2508.04472]] Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model(https://arxiv.org/abs/2508.04472)
Keywords: generative
Abstract: Concept Erasure, which aims to prevent pretrained text-to-image models from generating content associated with semantic-harmful concepts (i.e., target concepts), is getting increased attention. State-of-the-art methods formulate this task as an optimization problem: they align all target concepts with semantic-harmless anchor concepts, and apply closed-form solutions to update the model accordingly. While these closed-form methods are efficient, we argue that existing methods have two overlooked limitations: 1) They often result in incomplete erasure due to "non-zero alignment residual", especially when text prompts are relatively complex. 2) They may suffer from generation quality degradation as they always concentrate parameter updates in a few deep layers. To address these issues, we propose a novel closed-form method ErasePro: it is designed for more complete concept erasure and better preserving overall generative quality. Specifically, ErasePro first introduces a strict zero-residual constraint into the optimization objective, ensuring perfect alignment between target and anchor concept features and enabling more complete erasure. Secondly, it employs a progressive, layer-wise update strategy that gradually transfers target concept features to those of the anchor concept from shallow to deep layers. As the depth increases, the required parameter changes diminish, thereby reducing deviations in sensitive deep layers and preserving generative quality. Empirical results across different concept erasure tasks (including instance, art style, and nudity erasure) have demonstrated the effectiveness of our ErasePro.

Title: Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach

Authors: Anushka Srivastava
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2508.04481
Pdf URL: https://arxiv.org/pdf/2508.04481
Copy Paste: [[2508.04481]] Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach(https://arxiv.org/abs/2508.04481)
Keywords: generative
Abstract: This paper presents a deep learning-based approach to emotion detection using Conditional Generative Adversarial Networks (cGANs). Unlike traditional unimodal techniques that rely on a single data type, we explore a multimodal framework integrating text, audio, and facial expressions. The proposed cGAN architecture is trained to generate synthetic emotion-rich data and improve classification accuracy across multiple modalities. Our experimental results demonstrate significant improvements in emotion recognition performance compared to baseline models. This work highlights the potential of cGANs in enhancing human-computer interaction systems by enabling more nuanced emotional understanding.

Title: QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution

Authors: Bowen Chai, Zheng Chen, Libo Zhu, Wenbo Li, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04485
Pdf URL: https://arxiv.org/pdf/2508.04485
Copy Paste: [[2508.04485]] QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution(https://arxiv.org/abs/2508.04485)
Keywords: diffusion
Abstract: Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: this https URL.

Title: Learning Robust Intervention Representations with Delta Embeddings

Authors: Panagiotis Alimisis, Christos Diou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04492
Pdf URL: https://arxiv.org/pdf/2508.04492
Copy Paste: [[2508.04492]] Learning Robust Intervention Representations with Delta Embeddings(https://arxiv.org/abs/2508.04492)
Keywords: robust
Abstract: Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs, have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of interventions in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a framework that is capable of learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

Title: Causal Reflection with Language Models

Authors: Abi Aryan, Zac Liu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2508.04495
Pdf URL: https://arxiv.org/pdf/2508.04495
Copy Paste: [[2508.04495]] Causal Reflection with Language Models(https://arxiv.org/abs/2508.04495)
Keywords: robust
Abstract: While LLMs exhibit impressive fluency and factual recall, they struggle with robust causal reasoning, often relying on spurious correlations and brittle patterns. Similarly, traditional Reinforcement Learning agents also lack causal understanding, optimizing for rewards without modeling why actions lead to outcomes. We introduce Causal Reflection, a framework that explicitly models causality as a dynamic function over state, action, time, and perturbation, enabling agents to reason about delayed and nonlinear effects. Additionally, we define a formal Reflect mechanism that identifies mismatches between predicted and observed outcomes and generates causal hypotheses to revise the agent's internal model. In this architecture, LLMs serve not as black-box reasoners, but as structured inference engines translating formal causal outputs into natural language explanations and counterfactuals. Our framework lays the theoretical groundwork for Causal Reflective agents that can adapt, self-correct, and communicate causal understanding in evolving environments.

Title: PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers

Authors: Federico Zucchi, Thomas Lampert
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04503
Pdf URL: https://arxiv.org/pdf/2508.04503
Copy Paste: [[2508.04503]] PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers(https://arxiv.org/abs/2508.04503)
Keywords: transformer
Abstract: Multivariate time-series classification is pivotal in domains ranging from wearable sensing to biomedical monitoring. Despite recent advances, Transformer- and CNN-based models often remain computationally heavy, offer limited frequency diversity, and require extensive parameter budgets. We propose PRISM (Per-channel Resolution-Informed Symmetric Module), a convolutional-based feature extractor that applies symmetric finite-impulse-response (FIR) filters at multiple temporal scales, independently per channel. This multi-resolution, per-channel design yields highly frequency-selective embeddings without any inter-channel convolutions, greatly reducing model size and complexity. Across human-activity, sleep-stage and biomedical benchmarks, PRISM, paired with lightweight classification heads, matches or outperforms leading CNN and Transformer baselines, while using roughly an order of magnitude fewer parameters and FLOPs. By uniting classical signal processing insights with modern deep learning, PRISM offers an accurate, resource-efficient solution for multivariate time-series classification.

Title: Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation

Authors: Uzay Gökay, Federico Spurio, Dominik R. Bach, Juergen Gall
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04513
Pdf URL: https://arxiv.org/pdf/2508.04513
Copy Paste: [[2508.04513]] Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation(https://arxiv.org/abs/2508.04513)
Keywords: privacy, robust, segmentation
Abstract: Current state-of-the-art methods for skeleton-based temporal action segmentation are predominantly supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacy-preserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. Latent skeleton sequences are then divided into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate the proposed approach on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. The results demonstrate that our model outperforms the current state-of-the-art unsupervised temporal action segmentation methods. Code is available at this https URL .

Title: Channel-Independent Federated Traffic Prediction

Authors: Mo Zhang, Xiaoyu Li, Bin Xu, Meng Chen, Yongshun Gong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04517
Pdf URL: https://arxiv.org/pdf/2508.04517
Copy Paste: [[2508.04517]] Channel-Independent Federated Traffic Prediction(https://arxiv.org/abs/2508.04517)
Keywords: privacy, federate
Abstract: In recent years, traffic prediction has achieved remarkable success and has become an integral component of intelligent transportation systems. However, traffic data is typically distributed among multiple data owners, and privacy constraints prevent the direct utilization of these isolated datasets for traffic prediction. Most existing federated traffic prediction methods focus on designing communication mechanisms that allow models to leverage information from other clients in order to improve prediction accuracy. Unfortunately, such approaches often incur substantial communication overhead, and the resulting transmission delays significantly slow down the training process. As the volume of traffic data continues to grow, this issue becomes increasingly critical, making the resource consumption of current methods unsustainable. To address this challenge, we propose a novel variable relationship modeling paradigm for federated traffic prediction, termed the Channel-Independent Paradigm(CIP). Unlike traditional approaches, CIP eliminates the need for inter-client communication by enabling each node to perform efficient and accurate predictions using only local information. Based on the CIP, we further develop Fed-CI, an efficient federated learning framework, allowing each client to process its own data independently while effectively mitigating the information loss caused by the lack of direct data sharing among clients. Fed-CI significantly reduces communication overhead, accelerates the training process, and achieves state-of-the-art performance while complying with privacy regulations. Extensive experiments on multiple real-world datasets demonstrate that Fed-CI consistently outperforms existing methods across all datasets and federated settings. It achieves improvements of 8%, 14%, and 16% in RMSE, MAE, and MAPE, respectively, while also substantially reducing communication costs.

Title: RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection

Authors: Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Shuchang Lyu, Baoyuan Wu, Guangliang Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04524
Pdf URL: https://arxiv.org/pdf/2508.04524
Copy Paste: [[2508.04524]] RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection(https://arxiv.org/abs/2508.04524)
Keywords: explainability
Abstract: The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX's effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.

Title: StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering

Authors: Chenglei Shen, Zhongxiang Sun, Teng Shi, Xiao Zhang, Jun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04530
Pdf URL: https://arxiv.org/pdf/2508.04530
Copy Paste: [[2508.04530]] StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering(https://arxiv.org/abs/2508.04530)
Keywords: large language model
Abstract: Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model's core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose StyliTruth, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model's representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.

Title: No Masks Needed: Explainable AI for Deriving Segmentation from Classification

Authors: Mosong Ma, Tania Stathaki, Michalis Lazarou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04534
Pdf URL: https://arxiv.org/pdf/2508.04534
Copy Paste: [[2508.04534]] No Masks Needed: Explainable AI for Deriving Segmentation from Classification(https://arxiv.org/abs/2508.04534)
Keywords: segmentation
Abstract: Medical image segmentation is vital for modern healthcare and is a key element of computer-aided diagnosis. While recent advancements in computer vision have explored unsupervised segmentation using pre-trained models, these methods have not been translated well to the medical imaging domain. In this work, we introduce a novel approach that fine-tunes pre-trained models specifically for medical images, achieving accurate segmentation with extensive processing. Our method integrates Explainable AI to generate relevance scores, enhancing the segmentation process. Unlike traditional methods that excel in standard benchmarks but falter in medical applications, our approach achieves improved results on datasets like CBIS-DDSM, NuInsSeg and Kvasir-SEG.

Title: TopKD: Top-scaled Knowledge Distillation

Authors: Qi Wang, Jinjia Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04539
Pdf URL: https://arxiv.org/pdf/2508.04539
Copy Paste: [[2508.04539]] TopKD: Top-scaled Knowledge Distillation(https://arxiv.org/abs/2508.04539)
Keywords: transformer
Abstract: Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher's logit distributions. In this paper, we revisit logit-based distillation and reveal an underexplored yet critical element: Top-K knowledge. Motivated by this insight, we propose Top-scaled Knowledge Distillation (TopKD), a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation. TopKD consists of two main components: (1) a Top-K Scaling Module (TSM), which adaptively amplifies the most informative logits, and (2) a Top-K Decoupled Loss (TDL), which offers targeted and effective supervision. Notably, TopKD integrates seamlessly into existing KD methods without introducing extra modules or requiring architectural changes. Extensive experiments on CIFAR-100, ImageNet, STL-10, and Tiny-ImageNet demonstrate that TopKD consistently surpasses state-of-the-art distillation methods. Moreover, our method demonstrates substantial effectiveness when distilling Vision Transformers, underscoring its versatility across diverse network architectures. These findings highlight the significant potential of logits to advance knowledge distillation.

Title: InceptoFormer: A Multi-Signal Neural Framework for Parkinson's Disease Severity Evaluation from Gait

Authors: Safwen Naimi, Arij Said, Wassim Bouachir, Guillaume-Alexandre Bilodeau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04540
Pdf URL: https://arxiv.org/pdf/2508.04540
Copy Paste: [[2508.04540]] InceptoFormer: A Multi-Signal Neural Framework for Parkinson's Disease Severity Evaluation from Gait(https://arxiv.org/abs/2508.04540)
Keywords: transformer
Abstract: We present InceptoFormer, a multi-signal neural framework designed for Parkinson's Disease (PD) severity evaluation via gait dynamics analysis. Our architecture introduces a 1D adaptation of the Inception model, which we refer to as Inception1D, along with a Transformer-based framework to stage PD severity according to the Hoehn and Yahr (H&Y) scale. The Inception1D component captures multi-scale temporal features by employing parallel 1D convolutional filters with varying kernel sizes, thereby extracting features across multiple temporal scales. The transformer component efficiently models long-range dependencies within gait sequences, providing a comprehensive understanding of both local and global patterns. To address the issue of class imbalance in PD severity staging, we propose a data structuring and preprocessing strategy based on oversampling to enhance the representation of underrepresented severity levels. The overall design enables to capture fine-grained temporal variations and global dynamics in gait signal, significantly improving classification performance for PD severity evaluation. Through extensive experimentation, InceptoFormer achieves an accuracy of 96.6%, outperforming existing state-of-the-art methods in PD severity assessment. The source code for our implementation is publicly available at this https URL

Title: Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape

Authors: Haoran Niu, K. Suzanne Barber
Subjects: cs.LG, cs.CR, cs.SI
Abstract URL: https://arxiv.org/abs/2508.04542
Pdf URL: https://arxiv.org/pdf/2508.04542
Copy Paste: [[2508.04542]] Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape(https://arxiv.org/abs/2508.04542)
Keywords: privacy, protect
Abstract: It is difficult for individuals and organizations to protect personal information without a fundamental understanding of relative privacy risks. By analyzing over 5,000 empirical identity theft and fraud cases, this research identifies which types of personal data are exposed, how frequently exposures occur, and what the consequences of those exposures are. We construct an Identity Ecosystem graph--a foundational, graph-based model in which nodes represent personally identifiable information (PII) attributes and edges represent empirical disclosure relationships between them (e.g., the probability that one PII attribute is exposed due to the exposure of another). Leveraging this graph structure, we develop a privacy risk prediction framework that uses graph theory and graph neural networks to estimate the likelihood of further disclosures when certain PII attributes are compromised. The results show that our approach effectively answers the core question: Can the disclosure of a given identity attribute possibly lead to the disclosure of another attribute?

Title: MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

Authors: Quang-Trung Truong, Yuk-Kwan Wong, Vo Hoang Kim Tuyen Dang, Rinaldi Gotama, Duc Thanh Nguyen, Sai-Kit Yeung
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2508.04549
Pdf URL: https://arxiv.org/pdf/2508.04549
Copy Paste: [[2508.04549]] MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning(https://arxiv.org/abs/2508.04549)
Keywords: segmentation
Abstract: Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at this https URL.

Title: Two-Way Garment Transfer: Unified Diffusion Framework for Dressing and Undressing Synthesis

Authors: Angang Zhang, Fang Deng, Hao Chen, Zhongjian Chen, Junyan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04551
Pdf URL: https://arxiv.org/pdf/2508.04551
Copy Paste: [[2508.04551]] Two-Way Garment Transfer: Unified Diffusion Framework for Dressing and Undressing Synthesis(https://arxiv.org/abs/2508.04551)
Keywords: extraction, diffusion
Abstract: While recent advances in virtual try-on (VTON) have achieved realistic garment transfer to human subjects, its inverse task, virtual try-off (VTOFF), which aims to reconstruct canonical garment templates from dressed humans, remains critically underexplored and lacks systematic investigation. Existing works predominantly treat them as isolated tasks: VTON focuses on garment dressing while VTOFF addresses garment extraction, thereby neglecting their complementary symmetry. To bridge this fundamental gap, we propose the Two-Way Garment Transfer Model (TWGTM), to the best of our knowledge, the first unified framework for joint clothing-centric image synthesis that simultaneously resolves both mask-guided VTON and mask-free VTOFF through bidirectional feature disentanglement. Specifically, our framework employs dual-conditioned guidance from both latent and pixel spaces of reference images to seamlessly bridge the dual tasks. On the other hand, to resolve the inherent mask dependency asymmetry between mask-guided VTON and mask-free VTOFF, we devise a phased training paradigm that progressively bridges this modality gap. Extensive qualitative and quantitative experiments conducted across the DressCode and VITON-HD datasets validate the efficacy and competitive edge of our proposed approach.

Title: Augmentation-based Domain Generalization and Joint Training from Multiple Source Domains for Whole Heart Segmentation

Authors: Franz Thaler, Darko Stern, Gernot Plank, Martin Urschler
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04552
Pdf URL: https://arxiv.org/pdf/2508.04552
Copy Paste: [[2508.04552]] Augmentation-based Domain Generalization and Joint Training from Multiple Source Domains for Whole Heart Segmentation(https://arxiv.org/abs/2508.04552)
Keywords: segmentation
Abstract: As the leading cause of death worldwide, cardiovascular diseases motivate the development of more sophisticated methods to analyze the heart and its substructures from medical images like Computed Tomography (CT) and Magnetic Resonance (MR). Semantic segmentations of important cardiac structures that represent the whole heart are useful to assess patient-specific cardiac morphology and pathology. Furthermore, accurate semantic segmentations can be used to generate cardiac digital twin models which allows e.g. electrophysiological simulation and personalized therapy planning. Even though deep learning-based methods for medical image segmentation achieved great advancements over the last decade, retaining good performance under domain shift -- i.e. when training and test data are sampled from different data distributions -- remains challenging. In order to perform well on domains known at training-time, we employ a (1) balanced joint training approach that utilizes CT and MR data in equal amounts from different source domains. Further, aiming to alleviate domain shift towards domains only encountered at test-time, we rely on (2) strong intensity and spatial augmentation techniques to greatly diversify the available training data. Our proposed whole heart segmentation method, a 5-fold ensemble with our contributions, achieves the best performance for MR data overall and a performance similar to the best performance for CT data when compared to a model trained solely on CT. With 93.33% DSC and 0.8388 mm ASSD for CT and 89.30% DSC and 1.2411 mm ASSD for MR data, our method demonstrates great potential to efficiently obtain accurate semantic segmentations from which patient-specific cardiac twin models can be generated.

Title: One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose

Authors: Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04559
Pdf URL: https://arxiv.org/pdf/2508.04559
Copy Paste: [[2508.04559]] One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose(https://arxiv.org/abs/2508.04559)
Keywords: diffusion, segmentation
Abstract: Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios-for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce \textbf{OMFA} (\emph{One Model For All}), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. For example, OMFA enables removing garments from a source person (try-off) and transferring them onto a target person (try-on), while also allowing the generated target to appear in novel poses-even without access to multi-pose images of that person. OMFA is built upon a novel \emph{partial diffusion} strategy that selectively applies noise and denoising to individual components of the joint input-such as the garment, the person image, or the face-enabling dynamic subtask control and efficient bidirectional garment-person transformation. The framework is entirely mask-free and requires only a single portrait and a target pose as input, making it well-suited for real-world applications. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. The project page is here: this https URL.

Title: Attack Pattern Mining to Discover Hidden Threats to Industrial Control Systems

Authors: Muhammad Azmi Umer, Chuadhry Mujeeb Ahmed, Aditya Mathur, Muhammad Taha Jilani
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04561
Pdf URL: https://arxiv.org/pdf/2508.04561
Copy Paste: [[2508.04561]] Attack Pattern Mining to Discover Hidden Threats to Industrial Control Systems(https://arxiv.org/abs/2508.04561)
Keywords: security, attack
Abstract: This work focuses on validation of attack pattern mining in the context of Industrial Control System (ICS) security. A comprehensive security assessment of an ICS requires generating a large and variety of attack patterns. For this purpose we have proposed a data driven technique to generate attack patterns for an ICS. The proposed technique has been used to generate over 100,000 attack patterns from data gathered from an operational water treatment plant. In this work we present a detailed case study to validate the attack patterns.

Title: Drone Detection with Event Cameras

Authors: Gabriele Magrini, Lorenzo Berlincioni, Luca Cultrera, Federico Becattini, Pietro Pala
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04564
Pdf URL: https://arxiv.org/pdf/2508.04564
Copy Paste: [[2508.04564]] Drone Detection with Event Cameras(https://arxiv.org/abs/2508.04564)
Keywords: security, robust, diffusion
Abstract: The diffusion of drones presents significant security and safety challenges. Traditional surveillance systems, particularly conventional frame-based cameras, struggle to reliably detect these targets due to their small size, high agility, and the resulting motion blur and poor performance in challenging lighting conditions. This paper surveys the emerging field of event-based vision as a robust solution to these problems. Event cameras virtually eliminate motion blur and enable consistent detection in extreme lighting. Their sparse, asynchronous output suppresses static backgrounds, enabling low-latency focus on motion cues. We review the state-of-the-art in event-based drone detection, from data representation methods to advanced processing pipelines using spiking neural networks. The discussion extends beyond simple detection to cover more sophisticated tasks such as real-time tracking, trajectory forecasting, and unique identification through propeller signature analysis. By examining current methodologies, available datasets, and the distinct advantages of the technology, this work demonstrates that event-based vision provides a powerful foundation for the next generation of reliable, low-latency, and efficient counter-UAV systems.

Title: TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning

Authors: Yunbi Liu, Enqi Tang, Shiyu Li, Lei Ma, Juncheng Li, Shu Lou, Yongchu Pan, Qingshan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04565
Pdf URL: https://arxiv.org/pdf/2508.04565
Copy Paste: [[2508.04565]] TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning(https://arxiv.org/abs/2508.04565)
Keywords: diffusion
Abstract: Orthodontic treatment hinges on tooth alignment, which significantly affects occlusal function, facial aesthetics, and patients' quality of life. Current deep learning approaches predominantly concentrate on predicting transformation matrices through imposing point-to-point geometric constraints for tooth alignment. Nevertheless, these matrices are likely associated with the anatomical structure of the human oral cavity and possess particular distribution characteristics that the deterministic point-to-point geometric constraints in prior work fail to capture. To address this, we introduce a new automatic tooth alignment method named TAlignDiff, which is supported by diffusion-based transformation learning. TAlignDiff comprises two main components: a primary point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD). Geometry-constrained losses supervise PRN learning for point cloud-level alignment. DTMD, as an auxiliary module, learns the latent distribution of transformation matrices from clinical data. We integrate point cloud-based transformation regression and diffusion-based transformation modeling into a unified framework, allowing bidirectional feedback between geometric constraints and diffusion refinement. Extensive ablation and comparative experiments demonstrate the effectiveness and superiority of our method, highlighting its potential in orthodontic treatment.

Title: Analyzing and Mitigating Object Hallucination: A Training Bias Perspective

Authors: Yifan Li, Kun Zhou, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.04567
Pdf URL: https://arxiv.org/pdf/2508.04567
Copy Paste: [[2508.04567]] Analyzing and Mitigating Object Hallucination: A Training Bias Perspective(https://arxiv.org/abs/2508.04567)
Keywords: generative
Abstract: As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes'' to questions about masked objects. To understand this issue, we conduct probing experiments on the models' internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2\% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released.

Title: DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling

Authors: Yijie Li, Wei Zhang, Xi Zhu, Ye Wu, Yogesh Rathi, Lauren J. O'Donnell, Fan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04568
Pdf URL: https://arxiv.org/pdf/2508.04568
Copy Paste: [[2508.04568]] DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling(https://arxiv.org/abs/2508.04568)
Keywords: robust, diffusion, generative
Abstract: This paper presents DDTracking, a novel deep generative framework for diffusion MRI tractography that formulates streamline propagation as a conditional denoising diffusion process. In DDTracking, we introduce a dual-pathway encoding network that jointly models local spatial encoding (capturing fine-scale structural details at each streamline point) and global temporal dependencies (ensuring long-range consistency across the entire streamline). Furthermore, we design a conditional diffusion model module, which leverages the learned local and global embeddings to predict streamline propagation orientations for tractography in an end-to-end trainable manner. We conduct a comprehensive evaluation across diverse, independently acquired dMRI datasets, including both synthetic and clinical data. Experiments on two well-established benchmarks with ground truth (ISMRM Challenge and TractoInferno) demonstrate that DDTracking largely outperforms current state-of-the-art tractography methods. Furthermore, our results highlight DDTracking's strong generalizability across heterogeneous datasets, spanning varying health conditions, age groups, imaging protocols, and scanner types. Collectively, DDTracking offers anatomically plausible and robust tractography, presenting a scalable, adaptable, and end-to-end learnable solution for broad dMRI applications. Code is available at: this https URL

Title: Visual Bias and Interpretability in Deep Learning for Dermatological Image Analysis

Authors: Enam Ahmed Taufik, Abdullah Khondoker, Antara Firoz Parsa, Seraj Al Mahmud Mostafa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04573
Pdf URL: https://arxiv.org/pdf/2508.04573
Copy Paste: [[2508.04573]] Visual Bias and Interpretability in Deep Learning for Dermatological Image Analysis(https://arxiv.org/abs/2508.04573)
Keywords: robust, interpretability, transformer
Abstract: Accurate skin disease classification is a critical yet challenging task due to high inter-class similarity, intra-class variability, and complex lesion textures. While deep learning-based computer-aided diagnosis (CAD) systems have shown promise in automating dermatological assessments, their performance is highly dependent on image pre-processing and model architecture. This study proposes a deep learning framework for multi-class skin disease classification, systematically evaluating three image pre-processing techniques: standard RGB, CMY color space transformation, and Contrast Limited Adaptive Histogram Equalization (CLAHE). We benchmark the performance of pre-trained convolutional neural networks (DenseNet201, Efficient-NetB5) and transformer-based models (ViT, Swin Transformer, DinoV2 Large) using accuracy and F1-score as evaluation metrics. Results show that DinoV2 with RGB pre-processing achieves the highest accuracy (up to 93%) and F1-scores across all variants. Grad-CAM visualizations applied to RGB inputs further reveal precise lesion localization, enhancing interpretability. These findings underscore the importance of effective pre-processing and model choice in building robust and explainable CAD systems for dermatology.

Title: Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Authors: Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04581
Pdf URL: https://arxiv.org/pdf/2508.04581
Copy Paste: [[2508.04581]] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning(https://arxiv.org/abs/2508.04581)
Keywords: robust, transformer, large language model
Abstract: Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module's parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer's weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.

Title: Measuring the Carbon Footprint of Cryptographic Privacy-Enhancing Technologies

Authors: Marc Damie, Mihai Pop, Merijn Posthuma
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.04583
Pdf URL: https://arxiv.org/pdf/2508.04583
Copy Paste: [[2508.04583]] Measuring the Carbon Footprint of Cryptographic Privacy-Enhancing Technologies(https://arxiv.org/abs/2508.04583)
Keywords: privacy, protect
Abstract: Privacy-enhancing technologies (PETs) have attracted significant attention in response to privacy regulations, driving the development of applications that prioritize user data protection. At the same time, the information and communication technology (ICT) sector faces growing pressure to reduce its environmental footprint, particularly its carbon emissions. While numerous studies have assessed the energy footprint of various ICT applications, the environmental footprint of cryptographic PETs remains largely unexplored. Our work addresses this gap by proposing a standardized methodology for evaluating the carbon footprint of PETs. To demonstrate this methodology, we focus on PETs supporting client-server applications as they are the simplest to deploy. In particular, we measure the energy consumption and carbon footprint increase induced by five cryptographic PETs (compared to their non-private equivalent): HTTPS web browsing, encrypted machine learning (ML) inference, encrypted ML training, encrypted databases, and encrypted emails. Our findings reveal significant variability in carbon footprint increases, ranging from a twofold increase in HTTPS web browsing to a 100,000-fold increase in encrypted ML. Our study provides essential data to help decision-makers assess privacy-carbon trade-offs in such applications. Finally, we outline key research directions for developing PETs that balance strong privacy protection with environmental sustainability.

Title: Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline

Authors: Linqing Zhao, Xiuwei Xu, Yirui Wang, Hao Wang, Wenzhao Zheng, Yansong Tang, Haibin Yan, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04597
Pdf URL: https://arxiv.org/pdf/2508.04597
Copy Paste: [[2508.04597]] Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline(https://arxiv.org/abs/2508.04597)
Keywords: robust
Abstract: Incrementally recovering real-sized 3D geometry from a pose-free RGB stream is a challenging task in 3D reconstruction, requiring minimal assumptions on input data. Existing methods can be broadly categorized into end-to-end and visual SLAM-based approaches, both of which either struggle with long sequences or depend on slow test-time optimization and depth sensors. To address this, we first integrate a depth estimator into an RGB-D SLAM system, but this approach is hindered by inaccurate geometric details in predicted depth. Through further investigation, we find that 3D Gaussian mapping can effectively solve this problem. Building on this, we propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module to directly infer camera pose from optical flow. This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed. Additionally, we introduce a local graph rendering technique to enhance robustness in feed-forward pose prediction. Experimental results on the Replica and TUM-RGBD datasets, along with a real-world deployment demonstration, show that our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90\%.

Title: TURA: Tool-Augmented Unified Retrieval Agent for AI Search

Authors: Zhejun Zhao, Yuehu Dong, Alley Liu, Lixue Zheng, Pingsheng Liu, Dongdong Shen, Long Xia, Jiashu Zhao, Dawei Yin
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.04604
Pdf URL: https://arxiv.org/pdf/2508.04604
Copy Paste: [[2508.04604]] TURA: Tool-Augmented Unified Retrieval Agent for AI Search(https://arxiv.org/abs/2508.04604)
Keywords: robust, large language model
Abstract: The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.

Title: Multitask Learning with Stochastic Interpolants

Authors: Hugo Negrel, Florentin Coeurdoux, Michael S. Albergo, Eric Vanden-Eijnden
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2508.04605
Pdf URL: https://arxiv.org/pdf/2508.04605
Copy Paste: [[2508.04605]] Multitask Learning with Stochastic Interpolants(https://arxiv.org/abs/2508.04605)
Keywords: diffusion, generative
Abstract: We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.

Title: Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning

Authors: Md Zesun Ahmed Mia, Malyaban Bal, Sen Lu, George M. Nishibuchi, Suhas Chelian, Srini Vasan, Abhronil Sengupta
Subjects: cs.LG, cs.AI, cs.ET, cs.NE
Abstract URL: https://arxiv.org/abs/2508.04610
Pdf URL: https://arxiv.org/pdf/2508.04610
Copy Paste: [[2508.04610]] Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning(https://arxiv.org/abs/2508.04610)
Keywords: security, attack, robust
Abstract: Inspired by the brain's hierarchical processing and energy efficiency, this paper presents a Spiking Neural Network (SNN) architecture for lifelong Network Intrusion Detection System (NIDS). The proposed system first employs an efficient static SNN to identify potential intrusions, which then activates an adaptive dynamic SNN responsible for classifying the specific attack type. Mimicking biological adaptation, the dynamic classifier utilizes Grow When Required (GWR)-inspired structural plasticity and a novel Adaptive Spike-Timing-Dependent Plasticity (Ad-STDP) learning rule. These bio-plausible mechanisms enable the network to learn new threats incrementally while preserving existing knowledge. Tested on the UNSW-NB15 benchmark in a continual learning setting, the architecture demonstrates robust adaptation, reduced catastrophic forgetting, and achieves $85.3$\% overall accuracy. Furthermore, simulations using the Intel Lava framework confirm high operational sparsity, highlighting the potential for low-power deployment on neuromorphic hardware.

Title: OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Authors: Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.04611
Pdf URL: https://arxiv.org/pdf/2508.04611
Copy Paste: [[2508.04611]] OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment(https://arxiv.org/abs/2508.04611)
Keywords: robust
Abstract: Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbf{OmniDepth reduces zero-shot generalization error by $\!>\!40\%$ on Middlebury and ETH3D}, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, OmniDepth enables robust 3D perception that transcends modality-specific limitations. Codes available at this https URL.

Title: How Does Bilateral Ear Symmetry Affect Deep Ear Features?

Authors: Kagan Ozturk, Deeksha Arun, Kevin W. Bowyer, Patrick Flynn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04614
Pdf URL: https://arxiv.org/pdf/2508.04614
Copy Paste: [[2508.04614]] How Does Bilateral Ear Symmetry Affect Deep Ear Features?(https://arxiv.org/abs/2508.04614)
Keywords: biometric
Abstract: Ear recognition has gained attention as a reliable biometric technique due to the distinctive characteristics of human ears. With the increasing availability of large-scale datasets, convolutional neural networks (CNNs) have been widely adopted to learn features directly from raw ear images, outperforming traditional hand-crafted methods. However, the effect of bilateral ear symmetry on the features learned by CNNs has received little attention in recent studies. In this paper, we investigate how bilateral ear symmetry influences the effectiveness of CNN-based ear recognition. To this end, we first develop an ear side classifier to automatically categorize ear images as either left or right. We then explore the impact of incorporating this side information during both training and test. Cross-dataset evaluations are conducted on five datasets. Our results suggest that treating left and right ears separately during training and testing can lead to notable performance improvements. Furthermore, our ablation studies on alignment strategies, input sizes, and various hyperparameter settings provide practical insights into training CNN-based ear recognition systems on large-scale datasets to achieve higher verification rates.

Title: Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider

Authors: Chirag Seth, Utkarsh Singh
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.04623
Pdf URL: https://arxiv.org/pdf/2508.04623
Copy Paste: [[2508.04623]] Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider(https://arxiv.org/abs/2508.04623)
Keywords: transformer
Abstract: Text-to-SQL translation enables non-expert users to query relational databases using natural language, with applications in education and business intelligence. This study evaluates three lightweight transformer models - T5-Small, BART-Small, and GPT-2 - on the Spider dataset, focusing on low-resource settings. We developed a reusable, model-agnostic pipeline that tailors schema formatting to each model's architecture, training them across 1000 to 5000 iterations and evaluating on 1000 test samples using Logical Form Accuracy (LFAcc), BLEU, and Exact Match (EM) metrics. Fine-tuned T5-Small achieves the highest LFAcc (27.8%), outperforming BART-Small (23.98%) and GPT-2 (20.1%), highlighting encoder-decoder models' superiority in schema-aware SQL generation. Despite resource constraints limiting performance, our pipeline's modularity supports future enhancements, such as advanced schema linking or alternative base models. This work underscores the potential of compact transformers for accessible text-to-SQL solutions in resource-scarce environments.

Title: FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

Authors: Zichen Tang, Haihong E, Jiacheng Liu, Zhongjun Yang, Rongjin Li, Zihua Rong, Haoyang He, Zhuodi Hao, Xinyang Hu, Kun Ji, Ziyan Ma, Mengyuan Ji, Jun Zhang, Chenghao Ma, Qianhe Zheng, Yang Liu, Yiling Huang, Xinyi Hu, Qing Huang, Zijian Xie, Shiyao Peng
Subjects: cs.CV, cs.CE
Abstract URL: https://arxiv.org/abs/2508.04625
Pdf URL: https://arxiv.org/pdf/2508.04625
Copy Paste: [[2508.04625]] FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging(https://arxiv.org/abs/2508.04625)
Keywords: large language model
Abstract: We present FinMMR, a novel bilingual multimodal benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks. Compared to existing benchmarks, our work introduces three significant advancements. (1) Multimodality: We meticulously transform existing financial reasoning benchmarks, and construct novel questions from the latest Chinese financial research reports. FinMMR comprises 4.3K questions and 8.7K images spanning 14 categories, including tables, bar charts, and ownership structure charts. (2) Comprehensiveness: FinMMR encompasses 14 financial subdomains, including corporate finance, banking, and industry analysis, significantly exceeding existing benchmarks in financial domain knowledge breadth. (3) Challenge: Models are required to perform multi-step precise numerical reasoning by integrating financial knowledge with the understanding of complex financial images and text. The best-performing MLLM achieves only 53.0% accuracy on Hard problems. We believe that FinMMR will drive advancements in enhancing the reasoning capabilities of MLLMs in real-world scenarios.

Title: P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis

Authors: Feifan Song, Bofei Gao, Yifan Song, Yi Liu, Weimin Xiong, Yuyang Song, Tianyu Liu, Guoyin Wang, Houfeng Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04626
Pdf URL: https://arxiv.org/pdf/2508.04626
Copy Paste: [[2508.04626]] P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis(https://arxiv.org/abs/2508.04626)
Keywords: large language model
Abstract: Large Language Models (LLMs) are expected to produce safe, helpful, and honest content during interaction with human users, but they frequently fail to align with such values when given flawed instructions, e.g., missing context, ambiguous directives, or inappropriate tone, leaving substantial room for improvement along multiple dimensions. A cost-effective yet high-impact way is to pre-align instructions before the model begins decoding. Existing approaches either rely on prohibitive test-time search costs or end-to-end model rewrite, which is powered by a customized training corpus with unclear objectives. In this work, we demonstrate that the goal of efficient and effective preference alignment can be achieved by P-Aligner, a lightweight module generating instructions that preserve the original intents while being expressed in a more human-preferred form. P-Aligner is trained on UltraPrompt, a new dataset synthesized via a proposed principle-guided pipeline using Monte-Carlo Tree Search, which systematically explores the space of candidate instructions that are closely tied to human preference. Experiments across different methods show that P-Aligner generally outperforms strong baselines across various models and benchmarks, including average win-rate gains of 28.35% and 8.69% on GPT-4-turbo and Gemma-2-SimPO, respectively. Further analyses validate its effectiveness and efficiency through multiple perspectives, including data quality, search strategies, iterative deployment, and time overhead.

Title: CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series

Authors: Yutong Xia, Yingying Zhang, Yuxuan Liang, Lunting Fan, Qingsong Wen, Roger Zimmermann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04630
Pdf URL: https://arxiv.org/pdf/2508.04630
Copy Paste: [[2508.04630]] CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series(https://arxiv.org/abs/2508.04630)
Keywords: interpretability
Abstract: Time series anomaly detection has garnered considerable attention across diverse domains. While existing methods often fail to capture the underlying mechanisms behind anomaly generation in time series data. In addition, time series anomaly detection often faces several data-related inherent challenges, i.e., label scarcity, data imbalance, and complex multi-periodicity. In this paper, we leverage causal tools and introduce a new causality-based framework, CaPulse, which tunes in to the underlying causal pulse of time series data to effectively detect anomalies. Concretely, we begin by building a structural causal model to decipher the generation processes behind anomalies. To tackle the challenges posed by the data, we propose Periodical Normalizing Flows with a novel mask mechanism and carefully designed periodical learners, creating a periodicity-aware, density-based anomaly detection approach. Extensive experiments on seven real-world datasets demonstrate that CaPulse consistently outperforms existing methods, achieving AUROC improvements of 3% to 17%, with enhanced interpretability.

Title: IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

Authors: Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04632
Pdf URL: https://arxiv.org/pdf/2508.04632
Copy Paste: [[2508.04632]] IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards(https://arxiv.org/abs/2508.04632)
Keywords: robust, large language model
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

Title: 4-Swap: Achieving Grief-Free and Bribery-Safe Atomic Swaps Using Four Transactions

Authors: Kirti Singh (1 and 2), Vinay J. Ribeiro (1), Susmita Mandal (2) ((1) Indian Institute of Technology Bombay, India, (2) Institute for Development and Research in Banking Technology, Hyderabad, India)
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.04641
Pdf URL: https://arxiv.org/pdf/2508.04641
Copy Paste: [[2508.04641]] 4-Swap: Achieving Grief-Free and Bribery-Safe Atomic Swaps Using Four Transactions(https://arxiv.org/abs/2508.04641)
Keywords: security, attack, robust
Abstract: Cross-chain asset exchange is crucial for blockchain interoperability. Existing solutions rely on trusted third parties and risk asset loss, or use decentralized alternatives like atomic swaps, which suffer from grief attacks. Griefing occurs when a party prematurely exits, locking the counterparty's assets until a timelock expires. Hedged Atomic Swaps mitigate griefing by introducing a penalty premium; however, they increase the number of transactions from four (as in Tier Nolan's swap) to six, which in turn introduces new griefing risks. Grief-Free (GF) Swap reduces this to five transactions by consolidating assets and premiums on a single chain. However, no existing protocol achieves grief-free asset exchange in just four transactions. This paper presents 4-Swap, the first cross-chain atomic swap protocol that is both grief-free and bribery-safe, while completing asset exchange in just four transactions. By combining the griefing premium and principal into a single transaction per chain, 4-Swap reduces on-chain transactions, leading to faster execution compared to previous grief-free solutions. It is fully compatible with Bitcoin and operates without the need for any new opcodes. A game-theoretic analysis shows that rational participants have no incentive to deviate from the protocol, ensuring robust compliance and security.

Title: X-SAM: From Segment Anything to Any Segmentation

Authors: Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04655
Pdf URL: https://arxiv.org/pdf/2508.04655
Copy Paste: [[2508.04655]] X-SAM: From Segment Anything to Any Segmentation(https://arxiv.org/abs/2508.04655)
Keywords: large language model, segmentation
Abstract: Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at this https URL.

Title: YOLOv8-Based Deep Learning Model for Automated Poultry Disease Detection and Health Monitoring paper

Authors: Akhil Saketh Reddy Sabbella, Ch.Lakshmi Prachothan, Eswar Kumar Panta
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04658
Pdf URL: https://arxiv.org/pdf/2508.04658
Copy Paste: [[2508.04658]] YOLOv8-Based Deep Learning Model for Automated Poultry Disease Detection and Health Monitoring paper(https://arxiv.org/abs/2508.04658)
Keywords: security
Abstract: In the poultry industry, detecting chicken illnesses is essential to avoid financial losses. Conventional techniques depend on manual observation, which is laborious and prone to mistakes. Using YOLO v8 a deep learning model for real-time object recognition. This study suggests an AI based approach, by developing a system that analyzes high resolution chicken photos, YOLO v8 detects signs of illness, such as abnormalities in behavior and appearance. A sizable, annotated dataset has been used to train the algorithm, which provides accurate real-time identification of infected chicken and prompt warnings to farm operators for prompt action. By facilitating early infection identification, eliminating the need for human inspection, and enhancing biosecurity in large-scale farms, this AI technology improves chicken health management. The real-time features of YOLO v8 provide a scalable and effective method for improving farm management techniques.

Title: Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Authors: Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian, Kaiqiang Song, Meng Jiang, Dan Klein, Matei Zaharia, Karel D'Oosterlinck, Christopher Potts, Omar Khattab
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04660
Pdf URL: https://arxiv.org/pdf/2508.04660
Copy Paste: [[2508.04660]] Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs(https://arxiv.org/abs/2508.04660)
Keywords: privacy
Abstract: Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the this http URL optimizer.

Title: HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Authors: Young D. Kwon, Rui Li, Sijia Li, Da Li, Sourav Bhattacharya, Stylianos I. Venieris
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04663
Pdf URL: https://arxiv.org/pdf/2508.04663
Copy Paste: [[2508.04663]] HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models(https://arxiv.org/abs/2508.04663)
Keywords: protect, diffusion
Abstract: State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.

Title: Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

Authors: Mo Li, L.H. Xu, Qitai Tan, Ting Cao, Yunxin Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04664
Pdf URL: https://arxiv.org/pdf/2508.04664
Copy Paste: [[2508.04664]] Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management(https://arxiv.org/abs/2508.04664)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs' capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs' inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.

Title: Perch 2.0: The Bittern Lesson for Bioacoustics

Authors: Bart van Merriënboer, Vincent Dumoulin, Jenny Hamer, Lauren Harrell, Andrea Burns, Tom Denton
Subjects: cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2508.04665
Pdf URL: https://arxiv.org/pdf/2508.04665
Copy Paste: [[2508.04665]] Perch 2.0: The Bittern Lesson for Bioacoustics(https://arxiv.org/abs/2508.04665)
Keywords: robust
Abstract: Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics.

Title: Robustly Learning Monotone Single-Index Models

Authors: Puqian Wang, Nikos Zarifis, Ilias Diakonikolas, Jelena Diakonikolas
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2508.04670
Pdf URL: https://arxiv.org/pdf/2508.04670
Copy Paste: [[2508.04670]] Robustly Learning Monotone Single-Index Models(https://arxiv.org/abs/2508.04670)
Keywords: robust
Abstract: We consider the basic problem of learning Single-Index Models with respect to the square loss under the Gaussian distribution in the presence of adversarial label noise. Our main contribution is the first computationally efficient algorithm for this learning task, achieving a constant factor approximation, that succeeds for the class of {\em all} monotone activations with bounded moment of order $2 + \zeta,$ for $\zeta > 0.$ This class in particular includes all monotone Lipschitz functions and even discontinuous functions like (possibly biased) halfspaces. Prior work for the case of unknown activation either does not attain constant factor approximation or succeeds for a substantially smaller family of activations. The main conceptual novelty of our approach lies in developing an optimization framework that steps outside the boundaries of usual gradient methods and instead identifies a useful vector field to guide the algorithm updates by directly leveraging the problem structure, properties of Gaussian spaces, and regularity of monotone functions.

Title: GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay

Authors: Yunan Zhang, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, Qingcai Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04676
Pdf URL: https://arxiv.org/pdf/2508.04676
Copy Paste: [[2508.04676]] GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay(https://arxiv.org/abs/2508.04676)
Keywords: robust, large language model
Abstract: The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns--retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at this https URL.

Title: ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models

Authors: Yansheng Gao, Yufei Zheng, Jinghan Qu, Zixi Zhu, Yukuan Zhang, Shengsheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04677
Pdf URL: https://arxiv.org/pdf/2508.04677
Copy Paste: [[2508.04677]] ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models(https://arxiv.org/abs/2508.04677)
Keywords: robust
Abstract: Prompt tuning has emerged as an efficient and effective technique for adapting vision-language models (VLMs) with low computational overhead. However, existing methods often overlook the vulnerability of prompt-tuned VLMs to weak semantic perturbations-such as subtle image or text noise-that degrade their generalization to unseen classes. To address this limitation, we propose ANPrompt, a novel prompt tuning framework designed to enhance robustness under such perturbations. ANPrompt first constructs weak noise text features by fusing original and noise-perturbed text embeddings, which are then clustered to form noise prompts. These noise prompts are integrated with learnable prompt tokens to generate anti-noise prompts, which are injected into the deeper layers of both image and text encoders. To further capture the noise-aware visual semantics, ANPrompt computes the Noise-Resistant Visual Prompt Prototype (NRVPP) by averaging the output prompt tokens from the vision encoder. Finally, ANPrompt introduces alignment, robustness, and anti-noise objectives by computing a Weak semantic noise Alignment Loss (WALoss) alongside the standard cross-entropy and sim loss. Experiments across 11 benchmarks demonstrate that ANPrompt consistently outperforms existing prompt tuning approaches, achieving superior robustness to semantic noise and improved generalization to novel categories.

Title: Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Authors: Anushka Yadav, Isha Nalawade, Srujana Pillarichety, Yashwanth Babu, Reshmi Ghosh, Samyadeep Basu, Wenlong Zhao, Ali Nasaeh, Sriram Balasubramanian, Soundararajan Srinivasan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04699
Pdf URL: https://arxiv.org/pdf/2508.04699
Copy Paste: [[2508.04699]] Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis(https://arxiv.org/abs/2508.04699)
Keywords: robust
Abstract: The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

Title: BEVCon: Advancing Bird's Eye View Perception with Contrastive Learning

Authors: Ziyang Leng, Jiawei Yang, Zhicheng Ren, Bolei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04702
Pdf URL: https://arxiv.org/pdf/2508.04702
Copy Paste: [[2508.04702]] BEVCon: Advancing Bird's Eye View Perception with Contrastive Learning(https://arxiv.org/abs/2508.04702)
Keywords: segmentation
Abstract: We present BEVCon, a simple yet effective contrastive learning framework designed to improve Bird's Eye View (BEV) perception in autonomous driving. BEV perception offers a top-down-view representation of the surrounding environment, making it crucial for 3D object detection, segmentation, and trajectory prediction tasks. While prior work has primarily focused on enhancing BEV encoders and task-specific heads, we address the underexplored potential of representation learning in BEV models. BEVCon introduces two contrastive learning modules: an instance feature contrast module for refining BEV features and a perspective view contrast module that enhances the image backbone. The dense contrastive learning designed on top of detection losses leads to improved feature representations across both the BEV encoder and the backbone. Extensive experiments on the nuScenes dataset demonstrate that BEVCon achieves consistent performance gains, achieving up to +2.4% mAP improvement over state-of-the-art baselines. Our results highlight the critical role of representation learning in BEV perception and offer a complementary avenue to conventional task-specific optimizations.