2025-08-18

Title: A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation

Authors: Jie Lei, Ruofan Jia, J. Andrew Zhang, Hao Zhang
Subjects: cs.CL, cs.AR, cs.PL
Abstract URL: https://arxiv.org/abs/2508.10904
Pdf URL: https://arxiv.org/pdf/2508.10904
Copy Paste: [[2508.10904]] A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation(https://arxiv.org/abs/2508.10904)
Keywords: robust, interpretability, large language model
Abstract: In wireless communication systems, stringent requirements such as ultra-low latency and power consumption have significantly increased the demand for efficient algorithm-to-hardware deployment. However, a persistent and substantial gap remains between algorithm design and hardware implementation. Bridging this gap traditionally requires extensive domain expertise and time-consuming manual development, due to fundamental mismatches between high-level programming languages like MATLAB and hardware description languages (HDLs) such as Verilog-in terms of memory access patterns, data processing manners, and datatype representations. To address this challenge, we propose A2HCoder: a Hierarchical Algorithm-to-HDL Coding Agent, powered by large language models (LLMs), designed to enable agile and reliable algorithm-to-hardware translation. A2HCoder introduces a hierarchical framework that enhances both robustness and interpretability while suppressing common hallucination issues in LLM-generated code. In the horizontal dimension, A2HCoder decomposes complex algorithms into modular functional blocks, simplifying code generation and improving consistency. In the vertical dimension, instead of relying on end-to-end generation, A2HCoder performs step-by-step, fine-grained translation, leveraging external toolchains such as MATLAB and Vitis HLS for debugging and circuit-level synthesis. This structured process significantly mitigates hallucinations and ensures hardware-level correctness. We validate A2HCoder through a real-world deployment case in the 5G wireless communication domain, demonstrating its practicality, reliability, and deployment efficiency.

Title: PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins

Authors: Sihan Chen, John P. Lalor, Yi Yang, Ahmed Abbasi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10906
Pdf URL: https://arxiv.org/pdf/2508.10906
Copy Paste: [[2508.10906]] PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins(https://arxiv.org/abs/2508.10906)
Keywords: fair, large language model
Abstract: While large language models (LLMs) afford new possibilities for user modeling and approximation of human behaviors, they often fail to capture the multidimensional nuances of individual users. In this work, we introduce PersonaTwin, a multi-tier prompt conditioning framework that builds adaptive digital twins by integrating demographic, behavioral, and psychometric data. Using a comprehensive data set in the healthcare context of more than 8,500 individuals, we systematically benchmark PersonaTwin against standard LLM outputs, and our rigorous evaluation unites state-of-the-art text similarity metrics with dedicated demographic parity assessments, ensuring that generated responses remain accurate and unbiased. Experimental results show that our framework produces simulation fidelity on par with oracle settings. Moreover, downstream models trained on persona-twins approximate models trained on individuals in terms of prediction and fairness metrics across both GPT-4o-based and Llama-based models. Together, these findings underscore the potential for LLM digital twin-based approaches in producing realistic and emotionally nuanced user simulations, offering a powerful tool for personalized digital user modeling and behavior analysis.

Title: Privacy Enhancement for Gaze Data Using a Noise-Infused Autoencoder

Authors: Samantha Aziz, Oleg Komogortsev
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2508.10918
Pdf URL: https://arxiv.org/pdf/2508.10918
Copy Paste: [[2508.10918]] Privacy Enhancement for Gaze Data Using a Noise-Infused Autoencoder(https://arxiv.org/abs/2508.10918)
Keywords: privacy, protect, biometric
Abstract: We present a privacy-enhancing mechanism for gaze signals using a latent-noise autoencoder that prevents users from being re-identified across play sessions without their consent, while retaining the usability of the data for benign tasks. We evaluate privacy-utility trade-offs across biometric identification and gaze prediction tasks, showing that our approach significantly reduces biometric identifiability with minimal utility degradation. Unlike prior methods in this direction, our framework retains physiologically plausible gaze patterns suitable for downstream use, which produces favorable privacy-utility trade-off. This work advances privacy in gaze-based systems by providing a usable and effective mechanism for protecting sensitive gaze data.

Title: A Survey on Video Temporal Grounding with Multimodal Large Language Model

Authors: Jianlong Wu, Wei Liu, Ye Liu, Meng Liu, Liqiang Nie, Zhouchen Lin, Chang Wen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.10922
Pdf URL: https://arxiv.org/pdf/2508.10922
Copy Paste: [[2508.10922]] A Survey on Video Temporal Grounding with Multimodal Large Language Model(https://arxiv.org/abs/2508.10922)
Keywords: large language model
Abstract: The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at this https URL.

Title: gpt-oss-120b & gpt-oss-20b Model Card

Authors: OpenAI: Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vlad Fomenko, Timur Garipov, Kristian Georgiev, Mia Glaese, Tarun Gogineni, Adam Goucher, Lukas Gross, Katia Gil Guzman, John Hallman, Jackie Hehir, Johannes Heidecke, Alec Helyar, Haitang Hu, Romain Huet, Jacob Huh, Saachi Jain, Zach Johnson, Chris Koch, Irina Kofman, Dominik Kundel, Jason Kwon, Volodymyr Kyrylov, Elaine Ya Le, Guillaume Leclerc, James Park Lennon, Scott Lessans, Mario Lezcano-Casado, Yuanzhi Li, Zhuohan Li, Ji Lin, Jordan Liss, Lily (Xiaoxuan)Liu, Jiancheng Liu, Kevin Lu, Chris Lu, Zoran Martinovic, Lindsay McCallum, Josh McGrath, Scott McKinney, Aidan McLaughlin, Song Mei, Steve Mostovoy, Tong Mu, Gideon Myles, Alexander Neitz, Alex Nichol, Jakub Pachocki, Alex Paino, Dana Palmie, Ashley Pantuliano, Giambattista Parascandolo, Jongsoo Park, Leher Pathak, Carolina Paz, Ludovic Peran, Dmitry Pimenov, Michelle Pokrass, Elizabeth Proehl, Huida Qiu, Gaby Raila, Filippo Raso, Hongyu Ren, Kimmy Richardson, David Robinson, Bob Rotsted, Hadi Salman, Suvansh Sanjeev, Max Schwarzer, D. Sculley, Harshit Sikchi, Kendal Simon, Karan Singhal, Yang Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10925
Pdf URL: https://arxiv.org/pdf/2508.10925
Copy Paste: [[2508.10925]] gpt-oss-120b & gpt-oss-20b Model Card(https://arxiv.org/abs/2508.10925)
Keywords: transformer
Abstract: We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.

Title: Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News

Authors: Jiaxin Pei, Soumya Vadlamannati, Liang-Kang Huang, Daniel Preotiuc-Pietro, Xinyu Hua
Subjects: cs.CL, cs.AI, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10927
Pdf URL: https://arxiv.org/pdf/2508.10927
Copy Paste: [[2508.10927]] Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News(https://arxiv.org/abs/2508.10927)
Keywords: large language model
Abstract: Identifying risks associated with a company is important to investors and the well-being of the overall financial market. In this study, we build a computational framework to automatically extract company risk factors from news articles. Our newly proposed schema comprises seven distinct aspects, such as supply chain, regulations, and competitions. We sample and annotate 744 news articles and benchmark various machine learning models. While large language models have achieved huge progress in various types of NLP tasks, our experiment shows that zero-shot and few-shot prompting state-of-the-art LLMs (e.g. LLaMA-2) can only achieve moderate to low performances in identifying risk factors. And fine-tuned pre-trained language models are performing better on most of the risk factors. Using this model, we analyze over 277K Bloomberg news articles and demonstrate that identifying risk factors from news could provide extensive insight into the operations of companies and industries.

Title: VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By \underline{V}alue \underline{S}ign \underline{F}lip

Authors: Wenqi Guo, Shan Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.10931
Pdf URL: https://arxiv.org/pdf/2508.10931
Copy Paste: [[2508.10931]] VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By \underline{V}alue \underline{S}ign \underline{F}lip(https://arxiv.org/abs/2508.10931)
Keywords: diffusion
Abstract: We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in this https URL.

Title: ViPE: Video Pose Engine for 3D Geometric Perception

Authors: Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, Sanja Fidler
Subjects: cs.CV, cs.GR, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2508.10934
Pdf URL: https://arxiv.org/pdf/2508.10934
Copy Paste: [[2508.10934]] ViPE: Video Pose Engine for 3D Geometric Perception(https://arxiv.org/abs/2508.10934)
Keywords: robust
Abstract: Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.

Title: Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction

Authors: Cheng Chen, Hao Huang, Saurabh Bagchi
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.10936
Pdf URL: https://arxiv.org/pdf/2508.10936
Copy Paste: [[2508.10936]] Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction(https://arxiv.org/abs/2508.10936)
Keywords: robust
Abstract: Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.

Title: Personalized Face Super-Resolution with Identity Decoupling and Fitting

Authors: Jiarui Yang, Hang Guo, Wen Huang, Tao Dai, Shutao Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.10937
Pdf URL: https://arxiv.org/pdf/2508.10937
Copy Paste: [[2508.10937]] Personalized Face Super-Resolution with Identity Decoupling and Fitting(https://arxiv.org/abs/2508.10937)
Keywords: diffusion
Abstract: In recent years, face super-resolution (FSR) methods have achieved remarkable progress, generally maintaining high image fidelity and identity (ID) consistency under standard settings. However, in extreme degradation scenarios (e.g., scale $> 8\times$), critical attributes and ID information are often severely lost in the input image, making it difficult for conventional models to reconstruct realistic and ID-consistent faces. Existing methods tend to generate hallucinated faces under such conditions, producing restored images lacking authentic ID constraints. To address this challenge, we propose a novel FSR method with Identity Decoupling and Fitting (IDFSR), designed to enhance ID restoration under large scaling factors while mitigating hallucination effects. Our approach involves three key designs: 1) \textbf{Masking} the facial region in the low-resolution (LR) image to eliminate unreliable ID cues; 2) \textbf{Warping} a reference image to align with the LR input, providing style guidance; 3) Leveraging \textbf{ID embeddings} extracted from ground truth (GT) images for fine-grained ID modeling and personalized adaptation. We first pretrain a diffusion-based model to explicitly decouple style and ID by forcing it to reconstruct masked LR face regions using both style and identity embeddings. Subsequently, we freeze most network parameters and perform lightweight fine-tuning of the ID embedding using a small set of target ID images. This embedding encodes fine-grained facial attributes and precise ID information, significantly improving both ID consistency and perceptual quality. Extensive quantitative evaluations and visual comparisons demonstrate that the proposed IDFSR substantially outperforms existing approaches under extreme degradation, particularly achieving superior performance on ID consistency.

Title: NIRMAL Pooling: An Adaptive Max Pooling Approach with Non-linear Activation for Enhanced Image Classification

Authors: Nirmal Gaud, Krishna Kumar Jha, Jhimli Adhikari, Adhini Nasarin P S, Joydeep Das, Samarth S Deshpande, Nitasha Barara, Vaduguru Venkata Ramya, Santu Saha, Mehmet Tarik Baran, Sarangi Venkateshwarlu, Anusha M D, Surej Mouli, Preeti Katiyar, Vipin Kumar Chaudhary
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.10940
Pdf URL: https://arxiv.org/pdf/2508.10940
Copy Paste: [[2508.10940]] NIRMAL Pooling: An Adaptive Max Pooling Approach with Non-linear Activation for Enhanced Image Classification(https://arxiv.org/abs/2508.10940)
Keywords: robust
Abstract: This paper presents NIRMAL Pooling, a novel pooling layer for Convolutional Neural Networks (CNNs) that integrates adaptive max pooling with non-linear activation function for image classification tasks. The acronym NIRMAL stands for Non-linear Activation, Intermediate Aggregation, Reduction, Maximum, Adaptive, and Localized. By dynamically adjusting pooling parameters based on desired output dimensions and applying a Rectified Linear Unit (ReLU) activation post-pooling, NIRMAL Pooling improves robustness and feature expressiveness. We evaluated its performance against standard Max Pooling on three benchmark datasets: MNIST Digits, MNIST Fashion, and CIFAR-10. NIRMAL Pooling achieves test accuracies of 99.25% (vs. 99.12% for Max Pooling) on MNIST Digits, 91.59% (vs. 91.44%) on MNIST Fashion, and 70.49% (vs. 68.87%) on CIFAR-10, demonstrating consistent improvements, particularly on complex datasets. This work highlights the potential of NIRMAL Pooling to enhance CNN performance in diverse image recognition tasks, offering a flexible and reliable alternative to traditional pooling methods.

Title: Analysis of the Compaction Behavior of Textile Reinforcements in Low-Resolution In-Situ CT Scans via Machine-Learning and Descriptor-Based Methods

Authors: Christian Düreth, Jan Condé-Wolter, Marek Danczak, Karsten Tittmann, Jörn Jaschinski, Andreas Hornig, Maik Gude
Subjects: cs.CV, cond-mat.mtrl-sci, physics.app-ph
Abstract URL: https://arxiv.org/abs/2508.10943
Pdf URL: https://arxiv.org/pdf/2508.10943
Copy Paste: [[2508.10943]] Analysis of the Compaction Behavior of Textile Reinforcements in Low-Resolution In-Situ CT Scans via Machine-Learning and Descriptor-Based Methods(https://arxiv.org/abs/2508.10943)
Keywords: robust, extraction, segmentation
Abstract: A detailed understanding of material structure across multiple scales is essential for predictive modeling of textile-reinforced composites. Nesting -- characterized by the interlocking of adjacent fabric layers through local interpenetration and misalignment of yarns -- plays a critical role in defining mechanical properties such as stiffness, permeability, and damage tolerance. This study presents a framework to quantify nesting behavior in dry textile reinforcements under compaction using low-resolution computed tomography (CT). In-situ compaction experiments were conducted on various stacking configurations, with CT scans acquired at 20.22 $\mu$m per voxel resolution. A tailored 3D{-}UNet enabled semantic segmentation of matrix, weft, and fill phases across compaction stages corresponding to fiber volume contents of 50--60 %. The model achieved a minimum mean Intersection-over-Union of 0.822 and an $F1$ score of 0.902. Spatial structure was subsequently analyzed using the two-point correlation function $S_2$, allowing for probabilistic extraction of average layer thickness and nesting degree. The results show strong agreement with micrograph-based validation. This methodology provides a robust approach for extracting key geometrical features from industrially relevant CT data and establishes a foundation for reverse modeling and descriptor-based structural analysis of composite preforms.

Title: IPG: Incremental Patch Generation for Generalized Adversarial Patch Training

Authors: Wonho Lee, Hyunsik Na, Jisu Lee, Daeseon Choi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.10946
Pdf URL: https://arxiv.org/pdf/2508.10946
Copy Paste: [[2508.10946]] IPG: Incremental Patch Generation for Generalized Adversarial Patch Training(https://arxiv.org/abs/2508.10946)
Keywords: security, defense, attack, robust
Abstract: The advent of adversarial patches poses a significant challenge to the robustness of AI models, particularly in the domain of computer vision tasks such as object detection. In contradistinction to traditional adversarial examples, these patches target specific regions of an image, resulting in the malfunction of AI models. This paper proposes Incremental Patch Generation (IPG), a method that generates adversarial patches up to 11.1 times more efficiently than existing approaches while maintaining comparable attack performance. The efficacy of IPG is demonstrated by experiments and ablation studies including YOLO's feature distribution visualization and adversarial training results, which show that it produces well-generalized patches that effectively cover a broader range of model vulnerabilities. Furthermore, IPG-generated datasets can serve as a robust knowledge foundation for constructing a robust model, enabling structured representation, advanced reasoning, and proactive defenses in AI security ecosystems. The findings of this study suggest that IPG has considerable potential for future utilization not only in adversarial patch defense but also in real-world applications such as autonomous vehicles, security systems, and medical imaging, where AI models must remain resilient to adversarial attacks in dynamic and high-stakes environments.

Title: MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text

Authors: Ronghao Xu, Zhen Huang, Yangbo Wei, Xiaoqian Zhou, Zikang Xu, Ting Liu, Zihang Jiang, S.Kevin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.10947
Pdf URL: https://arxiv.org/pdf/2508.10947
Copy Paste: [[2508.10947]] MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text(https://arxiv.org/abs/2508.10947)
Keywords: robust, large language model
Abstract: Artificial intelligence has demonstrated significant potential in clinical decision-making; however, developing models capable of adapting to diverse real-world scenarios and performing complex diagnostic reasoning remains a major challenge. Existing medical multi-modal benchmarks are typically limited to single-image, single-turn tasks, lacking multi-modal medical image integration and failing to capture the longitudinal and multi-modal interactive nature inherent to clinical practice. To address this gap, we introduce MedAtlas, a novel benchmark framework designed to evaluate large language models on realistic medical reasoning tasks. MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity. It supports four core tasks: open-ended multi-turn question answering, closed-ended multi-turn question answering, multi-image joint reasoning, and comprehensive disease diagnosis. Each case is derived from real diagnostic workflows and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, and X-ray, requiring models to perform deep integrative reasoning across images and clinical texts. MedAtlas provides expert-annotated gold standards for all tasks. Furthermore, we propose two novel evaluation metrics: Round Chain Accuracy and Error Propagation Resistance. Benchmark results with existing multi-modal models reveal substantial performance gaps in multi-stage clinical reasoning. MedAtlas establishes a challenging evaluation platform to advance the development of robust and trustworthy medical AI.

Title: Apriel-Nemotron-15B-Thinker

Authors: Shruthan Radhakrishna, Soham Parikh, Gopal Sarda, Anil Turkkan, Quaizar Vohra, Raymond Li, Dhruv Jhamb, Kelechi Ogueji, Aanjaneya Shukla, Oluwanifemi Bamgbose, Toby Liang, Luke Kumar, Oleksiy Ostapenko, Shiva Krishna Reddy Malay, Aman Tiwari, Tara Bogavelli, Vikas Yadav, Jash Mehta, Saloni Mittal, Akshay Kalkunte, Pulkit Pattnaik, Khalil Slimi, Anirudh Sreeram, Jishnu Nair, Akintunde Oladipo, Shashank Maiya, Khyati Mahajan, Rishabh Maheshwary, Masoud Hashemi, Sai Rajeswar Mudumba, Sathwik Tejaswi Madhusudhan, Torsten Scholak, Sebastien Paquet, Sagar Davasam, Srinivas Sunkara
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10948
Pdf URL: https://arxiv.org/pdf/2508.10948
Copy Paste: [[2508.10948]] Apriel-Nemotron-15B-Thinker(https://arxiv.org/abs/2508.10948)
Keywords: large language model
Abstract: While large language models (LLMs) have achieved remarkable reasoning capabilities across domains like code, math and other enterprise tasks, their significant memory and computational costs often preclude their use in practical enterprise settings. To this end, we introduce Apriel-Nemotron-15B-Thinker, a 15-billion parameter model in the ServiceNow Apriel SLM series that achieves performance against medium sized state-of-the-art models such as o1-mini, QWQ32B, and EXAONE-Deep-32B while maintaining only half the memory footprint of those alternatives. Apriel-Nemotron-15B-Thinker model is trained in a four stage training pipeline including 1) Base Model upscaling, 2) Continual Pre-training 3) Supervised Fine-tuning (SFT) and 4) Reinforcement Learning using GRPO. Comprehensive evaluations across a diverse suite of benchmarks consistently demonstrate that our Apriel-Nemotron-15B-Thinker model matches or exceeds the performance of its 32-billion parameter counterparts, despite being less than half their size.

Title: From Promise to Practical Reality: Transforming Diffusion MRI Analysis with Fast Deep Learning Enhancement

Authors: Xinyi Wang, Michael Barnett, Frederique Boonstra, Yael Barnett, Mariano Cabezas, Arkiev D'Souza, Matthew C. Kiernan, Kain Kyle, Meng Law, Lynette Masters, Zihao Tang, Stephen Tisch, Sicong Tu, Anneke Van Der Walt, Dongang Wang, Fernando Calamante, Weidong Cai, Chenyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.10950
Pdf URL: https://arxiv.org/pdf/2508.10950
Copy Paste: [[2508.10950]] From Promise to Practical Reality: Transforming Diffusion MRI Analysis with Fast Deep Learning Enhancement(https://arxiv.org/abs/2508.10950)
Keywords: robust, interpretability, diffusion
Abstract: Fiber orientation distribution (FOD) is an advanced diffusion MRI modeling technique that represents complex white matter fiber configurations, and a key step for subsequent brain tractography and connectome analysis. Its reliability and accuracy, however, heavily rely on the quality of the MRI acquisition and the subsequent estimation of the FODs at each voxel. Generating reliable FODs from widely available clinical protocols with single-shell and low-angular-resolution acquisitions remains challenging but could potentially be addressed with recent advances in deep learning-based enhancement techniques. Despite advancements, existing methods have predominantly been assessed on healthy subjects, which have proved to be a major hurdle for their clinical adoption. In this work, we validate a newly optimized enhancement framework, FastFOD-Net, across healthy controls and six neurological disorders. This accelerated end-to-end deep learning framework enhancing FODs with superior performance and delivering training/inference efficiency for clinical use ($60\times$ faster comparing to its predecessor). With the most comprehensive clinical evaluation to date, our work demonstrates the potential of FastFOD-Net in accelerating clinical neuroscience research, empowering diffusion MRI analysis for disease differentiation, improving interpretability in connectome applications, and reducing measurement errors to lower sample size requirements. Critically, this work will facilitate the more widespread adoption of, and build clinical trust in, deep learning based methods for diffusion MRI enhancement. Specifically, FastFOD-Net enables robust analysis of real-world, clinical diffusion MRI data, comparable to that achievable with high-quality research acquisitions.

Title: Empowering Multimodal LLMs with External Tools: A Comprehensive Survey

Authors: Wenbin An, Jiahao Nie, Yaqiang Wu, Feng Tian, Shijian Lu, Qinghua Zheng
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2508.10955
Pdf URL: https://arxiv.org/pdf/2508.10955
Copy Paste: [[2508.10955]] Empowering Multimodal LLMs with External Tools: A Comprehensive Survey(https://arxiv.org/abs/2508.10955)
Keywords: generative, large language model
Abstract: By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs.

Title: CSNR and JMIM Based Spectral Band Selection for Reducing Metamerism in Urban Driving

Authors: Jiarong Li, Imad Ali Shah, Diarmaid Geever, Fiachra Collins, Enda Ward, Martin Glavin, Edward Jones, Brian Deegan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.10962
Pdf URL: https://arxiv.org/pdf/2508.10962
Copy Paste: [[2508.10962]] CSNR and JMIM Based Spectral Band Selection for Reducing Metamerism in Urban Driving(https://arxiv.org/abs/2508.10962)
Keywords: protect, robust
Abstract: Protecting Vulnerable Road Users (VRU) is a critical safety challenge for automotive perception systems, particularly under visual ambiguity caused by metamerism, a phenomenon where distinct materials appear similar in RGB imagery. This work investigates hyperspectral imaging (HSI) to overcome this limitation by capturing unique material signatures beyond the visible spectrum, especially in the Near-Infrared (NIR). To manage the inherent high-dimensionality of HSI data, we propose a band selection strategy that integrates information theory techniques (joint mutual information maximization, correlation analysis) with a novel application of an image quality metric (contrast signal-to-noise ratio) to identify the most spectrally informative bands. Using the Hyperspectral City V2 (H-City) dataset, we identify three informative bands (497 nm, 607 nm, and 895 nm, $\pm$27 nm) and reconstruct pseudo-color images for comparison with co-registered RGB. Quantitative results demonstrate increased dissimilarity and perceptual separability of VRU from the background. The selected HSI bands yield improvements of 70.24%, 528.46%, 1206.83%, and 246.62% for dissimilarity (Euclidean, SAM, $T^2$) and perception (CIE $\Delta E$) metrics, consistently outperforming RGB and confirming a marked reduction in metameric confusion. By providing a spectrally optimized input, our method enhances VRU separability, establishing a robust foundation for downstream perception tasks in Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD), ultimately contributing to improved road safety.

Title: Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis

Authors: Xinyi Li, Sai Wang, Yutian Lin, Yu Wu, Yi Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10967
Pdf URL: https://arxiv.org/pdf/2508.10967
Copy Paste: [[2508.10967]] Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis(https://arxiv.org/abs/2508.10967)
Keywords: large language model
Abstract: Retrosynthesis prediction aims to infer the reactant molecule based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing models rely on static pattern-matching paradigm, which limits their ability to perform effective logic decision-making, leading to black-box decision-making. Building on this, we propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary reasoning strengths of Large Language Models and specialized models via reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models perform shallow reasoning to construct high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions and corresponding interpretable reasoning path, and (3) reinforcement learning optimizing interpretable decision policy. Experiments show that Retro-Expert not only surpasses both LLM-based and specialized models across different metrics but also provides expert-aligned explanations that bridge the gap between AI predictions and actionable chemical insights.

Title: Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules

Authors: Nasim Shirvani-Mahdavi, Chengkai Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10971
Pdf URL: https://arxiv.org/pdf/2508.10971
Copy Paste: [[2508.10971]] Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules(https://arxiv.org/abs/2508.10971)
Keywords: large language model
Abstract: Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models' performance, we conduct a human evaluation of generated explanations on correctness and clarity. To address evaluation scalability, we develop and validate an LLM-as-a-judge framework that demonstrates strong agreement with human evaluators. Leveraging the best-performing model (Gemini 2.0 Flash), LLM judge, and human-in-the-loop feedback, we construct high-quality ground truth datasets, which we use to fine-tune the open-source Zephyr model. Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset. Additionally, we integrate a type inference module to support KGs lacking explicit type information. All code and data are publicly available at this https URL.

Title: BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Authors: Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2508.10975
Pdf URL: https://arxiv.org/pdf/2508.10975
Copy Paste: [[2508.10975]] BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining(https://arxiv.org/abs/2508.10975)
Keywords: large language model
Abstract: Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.

Title: MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications

Authors: Wenpeng Xing, Zhonghao Qi, Yupeng Qin, Yilin Li, Caini Chang, Jiahui Yu, Changting Lin, Zhenzhen Xie, Meng Han
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10991
Pdf URL: https://arxiv.org/pdf/2508.10991
Copy Paste: [[2508.10991]] MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications(https://arxiv.org/abs/2508.10991)
Keywords: security, defense, attack, robust, large language model
Abstract: The integration of Large Language Models (LLMs) with external tools via protocols such as the Model Context Protocol (MCP) introduces critical security vulnerabilities, including prompt injection, data exfiltration, and other threats. To counter these challenges, we propose MCP-Guard, a robust, layered defense architecture designed for LLM--tool interactions. MCP-Guard employs a three-stage detection pipeline that balances efficiency with accuracy: it progresses from lightweight static scanning for overt threats and a deep neural detector for semantic attacks, to our fine-tuned E5-based model achieves (96.01) accuracy in identifying adversarial prompts. Finally, a lightweight LLM arbitrator synthesizes these signals to deliver the final decision while minimizing false positives. To facilitate rigorous training and evaluation, we also introduce MCP-AttackBench, a comprehensive benchmark of over 70,000 samples. Sourced from public datasets and augmented by GPT-4, MCP-AttackBench simulates diverse, real-world attack vectors in the MCP format, providing a foundation for future research into securing LLM-tool ecosystems.

Title: Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models

Authors: Basile Lewandowski, Robert Birke, Lydia Y. Chen
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2508.10993
Pdf URL: https://arxiv.org/pdf/2508.10993
Copy Paste: [[2508.10993]] Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models(https://arxiv.org/abs/2508.10993)
Keywords: diffusion, transformer
Abstract: Text-to-image (T2I) models based on diffusion and transformer architectures advance rapidly. They are often pretrained on large corpora, and openly shared on a model platform, such as HuggingFace. Users can then build up AI applications, e.g., generating media contents, by adopting pretrained T2I models and fine-tuning them on the target dataset. While public pretrained T2I models facilitate the democratization of the models, users face a new challenge: which model can be best fine-tuned based on the target data domain? Model selection is well addressed in classification tasks, but little is known in (pretrained) T2I models and their performance indication on the target domain. In this paper, we propose the first model selection framework, M&C, which enables users to efficiently choose a pretrained T2I model from a model platform without exhaustively fine-tuning them all on the target dataset. The core of M&C is a matching graph, which consists of: (i) nodes of available models and profiled datasets, and (ii) edges of model-data and data-data pairs capturing the fine-tuning performance and data similarity, respectively. We then build a model that, based on the inputs of model/data feature, and, critically, the graph embedding feature, extracted from the matching graph, predicts the model achieving the best quality after fine-tuning for the target domain. We evaluate M&C on choosing across ten T2I models for 32 datasets against three baselines. Our results show that M&C successfully predicts the best model for fine-tuning in 61.3% of the cases and a closely performing model for the rest.

Title: Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling

Authors: Tejomay Kishor Padole, Suyash P Awate, Pushpak Bhattacharyya
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10995
Pdf URL: https://arxiv.org/pdf/2508.10995
Copy Paste: [[2508.10995]] Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling(https://arxiv.org/abs/2508.10995)
Keywords: diffusion, generative
Abstract: Masked diffusion language models (MDMs) have recently gained traction as a viable generative framework for natural language. This can be attributed to its scalability and ease of training compared to other diffusion model paradigms for discrete data, establishing itself as the state-of-the-art non-autoregressive generator for discrete data. Diffusion models, in general, have shown excellent ability to improve the generation quality by leveraging inference-time scaling either by increasing the number of denoising steps or by using external verifiers on top of the outputs of each step to guide the generation. In this work, we propose a verifier-based inference-time scaling method that aids in finding a better candidate generation during the denoising process of the MDM. Our experiments demonstrate the application of MDMs for standard text-style transfer tasks and establish MDMs as a better alternative to autoregressive language models. Additionally, we show that a simple soft-value-based verifier setup for MDMs using off-the-shelf pre-trained embedding models leads to significant gains in generation quality even when used on top of typical classifier-free guidance setups in the existing literature.

Title: SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth

Authors: Wenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11009
Pdf URL: https://arxiv.org/pdf/2508.11009
Copy Paste: [[2508.11009]] SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth(https://arxiv.org/abs/2508.11009)
Keywords: privacy, robust, large language model
Abstract: The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0--6), middle childhood (7--12), and adolescence (13--18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.

Title: CURE: Critical-Token-Guided Re-concatenation for Entropy-collapse Prevention

Authors: Qingbin Li, Rongkun Xue, Jie Wang, Ming Zhou, Zhi Li, Xiaofeng Ji, Yongqi Wang, Miao Liu, Zheming Yang, Minghui Qiu, Jing Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11016
Pdf URL: https://arxiv.org/pdf/2508.11016
Copy Paste: [[2508.11016]] CURE: Critical-Token-Guided Re-concatenation for Entropy-collapse Prevention(https://arxiv.org/abs/2508.11016)
Keywords: large language model
Abstract: Recent advances in Reinforcement Learning with Verified Reward (RLVR) have driven the emergence of more sophisticated cognitive behaviors in large language models (LLMs), thereby enhancing their reasoning capabilities. However, in prior RLVR pipelines, the repeated use of static initial-state sampling drawn exactly from the dataset distribution during each sampling phase produced overly deterministic, low diversity model behavior, which manifested as rapid entropy collapse and hindered sustained performance gains during prolonged training. To address this issue, we introduce CURE (Critical-token-gUided Re concatenation for Entropy-collapse prevention), a two-stage framework that balances exploration and exploitation. Specifically, in the first stage, to deliberately steer the model toward novel yet coherent contexts, we re-generate at high-entropy critical tokens and jointly optimize the original and the branched trajectories. The further comparison with vanilla DAPO shows that the regeneration process achieves a better performance on math reasoning tasks while sustaining a high-level entropy degree for exploration. In the second stage, we continue training with static initial-state sampling by DAPO, intentionally placing the model in a familiar state to gradually strengthen exploitation. Extensive experiments on Qwen-2.5-Math-7B show that, compared to other RLVR methods, CURE achieves a 5% performance gain across six math benchmarks, establishing state-of-the-art performance in both entropy and accuracy. A series of experiments further validate the effectiveness of our approach. Code is available at this https URL.

Title: Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics

Authors: Carter Blum, Katja Filipova, Ann Yuan, Asma Ghandeharioun, Julian Zimmert, Fred Zhang, Jessica Hoffmann, Tal Linzen, Martin Wattenberg, Lucas Dixon, Mor Geva
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11017
Pdf URL: https://arxiv.org/pdf/2508.11017
Copy Paste: [[2508.11017]] Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics(https://arxiv.org/abs/2508.11017)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets. We identify a learning phase wherein a model develops either separate or unified representations of the same facts across languages, and show that unification is essential for cross-lingual transfer. We also show that the degree of unification depends on mutual information between facts and training data language, and on how easy it is to extract that language. Based on these insights, we develop methods to modulate the level of cross-lingual transfer by manipulating data distribution and tokenization, and we introduce metrics and visualizations to formally characterize their effects on unification. Our work shows how controlled settings can shed light on pre-training dynamics and suggests new directions for improving cross-lingual transfer in LLMs.

Title: Can Multi-modal (reasoning) LLMs detect document manipulation?

Authors: Zisheng Liang, Kidus Zewde, Rudra Pratap Singh, Disha Patil, Zexi Chen, Jiayu Xue, Yao Yao, Yifei Chen, Qinzhe Liu, Simiao Ren
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.11021
Pdf URL: https://arxiv.org/pdf/2508.11021
Copy Paste: [[2508.11021]] Can Multi-modal (reasoning) LLMs detect document manipulation?(https://arxiv.org/abs/2508.11021)
Keywords: secure, robust, large language model
Abstract: Document fraud poses a significant threat to industries reliant on secure and verifiable documentation, necessitating robust detection mechanisms. This study investigates the efficacy of state-of-the-art multi-modal large language models (LLMs)-including OpenAI O1, OpenAI 4o, Gemini Flash (thinking), Deepseek Janus, Grok, Llama 3.2 and 4, Qwen 2 and 2.5 VL, Mistral Pixtral, and Claude 3.5 and 3.7 Sonnet-in detecting fraudulent documents. We benchmark these models against each other and prior work on document fraud detection techniques using a standard dataset with real transactional documents. Through prompt optimization and detailed analysis of the models' reasoning processes, we evaluate their ability to identify subtle indicators of fraud, such as tampered text, misaligned formatting, and inconsistent transactional sums. Our results reveal that top-performing multi-modal LLMs demonstrate superior zero-shot generalization, outperforming conventional methods on out-of-distribution datasets, while several vision LLMs exhibit inconsistent or subpar performance. Notably, model size and advanced reasoning capabilities show limited correlation with detection accuracy, suggesting task-specific fine-tuning is critical. This study underscores the potential of multi-modal LLMs in enhancing document fraud detection systems and provides a foundation for future research into interpretable and scalable fraud mitigation strategies.

Title: Hell or High Water: Evaluating Agentic Recovery from External Failures

Authors: Andrew Wang, Sophia Hager, Adi Asija, Daniel Khashabi, Nicholas Andrews
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11027
Pdf URL: https://arxiv.org/pdf/2508.11027
Copy Paste: [[2508.11027]] Hell or High Water: Evaluating Agentic Recovery from External Failures(https://arxiv.org/abs/2508.11027)
Keywords: generative
Abstract: As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? We devise a specialized agentic planning benchmark to study this question. Each planning problem is solved via combinations of function calls. The agent searches for relevant functions from a set of over four thousand possibilities, and observes environmental feedback in the form of function outputs or error messages. Our benchmark confronts the agent with external failures in its workflow, such as functions that suddenly become unavailable. At the same time, even with the introduction of these failures, we guarantee that the task remains solvable. Ideally, an agent's performance on the planning task should not be affected by the presence of external failures. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generative models as well as promising directions for future work.

Title: MedSAMix: A Training-Free Model Merging Approach for Medical Image Segmentation

Authors: Yanwu Yang, Guinan Su, Jiesi Hu, Francesco Sammarco, Jonas Geiping, Thomas Wolfers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11032
Pdf URL: https://arxiv.org/pdf/2508.11032
Copy Paste: [[2508.11032]] MedSAMix: A Training-Free Model Merging Approach for Medical Image Segmentation(https://arxiv.org/abs/2508.11032)
Keywords: segmentation
Abstract: Universal medical image segmentation models have emerged as a promising paradigm due to their strong generalizability across diverse tasks, showing great potential for a wide range of clinical applications. This potential has been partly driven by the success of general-purpose vision models such as the Segment Anything Model (SAM), which has inspired the development of various fine-tuned variants for medical segmentation tasks. However, fine-tuned variants like MedSAM are trained on comparatively limited medical imaging data that often suffers from heterogeneity, scarce annotations, and distributional shifts. These challenges limit their ability to generalize across a wide range of medical segmentation tasks. In this regard, we propose MedSAMix, a training-free model merging method that integrates the strengths of both generalist models (e.g., SAM) and specialist models (e.g., MedSAM) for medical image segmentation. In contrast to traditional model merging approaches that rely on manual configuration and often result in suboptimal outcomes, we propose a zero-order optimization method to automatically discover optimal layer-wise merging solutions. Furthermore, for clinical applications, we develop two regimes to meet the demand of domain-specificity and generalizability in different scenarios by single-task optimization and multi-objective optimization respectively. Extensive evaluations on 25 medical segmentation tasks demonstrate that MedSAMix effectively mitigates model bias and consistently improves performance in both domain-specific accuracy and generalization, achieving improvements of 6.67% on specialized tasks and 4.37% on multi-task evaluations.

Title: SHLIME: Foiling adversarial attacks fooling SHAP and LIME

Authors: Sam Chauhan, Estelle Duguet, Karthik Ramakrishnan, Hugh Van Deventer, Jack Kruger, Ranjan Subbaraman
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2508.11053
Pdf URL: https://arxiv.org/pdf/2508.11053
Copy Paste: [[2508.11053]] SHLIME: Foiling adversarial attacks fooling SHAP and LIME(https://arxiv.org/abs/2508.11053)
Keywords: attack, robust
Abstract: Post hoc explanation methods, such as LIME and SHAP, provide interpretable insights into black-box classifiers and are increasingly used to assess model biases and generalizability. However, these methods are vulnerable to adversarial manipulation, potentially concealing harmful biases. Building on the work of Slack et al. (2020), we investigate the susceptibility of LIME and SHAP to biased models and evaluate strategies for improving robustness. We first replicate the original COMPAS experiment to validate prior findings and establish a baseline. We then introduce a modular testing framework enabling systematic evaluation of augmented and ensemble explanation approaches across classifiers of varying performance. Using this framework, we assess multiple LIME/SHAP ensemble configurations on out-of-distribution models, comparing their resistance to bias concealment against the original methods. Our results identify configurations that substantially improve bias detection, highlighting their potential for enhancing transparency in the deployment of high-stakes machine learning systems.

Title: BIPOLAR: Polarization-based granular framework for LLM bias evaluation

Authors: Martin Pavlíček, Tomáš Filip, Petr Sosík
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11061
Pdf URL: https://arxiv.org/pdf/2508.11061
Copy Paste: [[2508.11061]] BIPOLAR: Polarization-based granular framework for LLM bias evaluation(https://arxiv.org/abs/2508.11061)
Keywords: large language model
Abstract: Large language models (LLMs) are known to exhibit biases in downstream tasks, especially when dealing with sensitive topics such as political discourse, gender identity, ethnic relations, or national stereotypes. Although significant progress has been made in bias detection and mitigation techniques, certain challenges remain underexplored. This study proposes a reusable, granular, and topic-agnostic framework to evaluate polarisation-related biases in LLM (both open-source and closed-source). Our approach combines polarisation-sensitive sentiment metrics with a synthetically generated balanced dataset of conflict-related statements, using a predefined set of semantic categories. As a case study, we created a synthetic dataset that focusses on the Russia-Ukraine war, and we evaluated the bias in several LLMs: Llama-3, Mistral, GPT-4, Claude 3.5, and Gemini 1.0. Beyond aggregate bias scores, with a general trend for more positive sentiment toward Ukraine, the framework allowed fine-grained analysis with considerable variation between semantic categories, uncovering divergent behavioural patterns among models. Adaptation to prompt modifications showed further bias towards preconceived language and citizenship modification. Overall, the framework supports automated dataset generation and fine-grained bias assessment, is applicable to a variety of polarisation-driven scenarios and topics, and is orthogonal to many other bias-evaluation strategies.

Title: Data-Driven Abdominal Phenotypes of Type 2 Diabetes in Lean, Overweight, and Obese Cohorts

Authors: Lucas W. Remedios, Chloe Choe, Trent M. Schwartz, Dingjie Su, Gaurav Rudravaram, Chenyu Gao, Aravind R. Krishnan, Adam M. Saunders, Michael E. Kim, Shunxing Bao, Alvin C. Powers, Bennett A. Landman, John Virostko
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11063
Pdf URL: https://arxiv.org/pdf/2508.11063
Copy Paste: [[2508.11063]] Data-Driven Abdominal Phenotypes of Type 2 Diabetes in Lean, Overweight, and Obese Cohorts(https://arxiv.org/abs/2508.11063)
Keywords: protect, segmentation
Abstract: Purpose: Although elevated BMI is a well-known risk factor for type 2 diabetes, the disease's presence in some lean adults and absence in others with obesity suggests that detailed body composition may uncover abdominal phenotypes of type 2 diabetes. With AI, we can now extract detailed measurements of size, shape, and fat content from abdominal structures in 3D clinical imaging at scale. This creates an opportunity to empirically define body composition signatures linked to type 2 diabetes risk and protection using large-scale clinical data. Approach: To uncover BMI-specific diabetic abdominal patterns from clinical CT, we applied our design four times: once on the full cohort (n = 1,728) and once on lean (n = 497), overweight (n = 611), and obese (n = 620) subgroups separately. Briefly, our experimental design transforms abdominal scans into collections of explainable measurements through segmentation, classifies type 2 diabetes through a cross-validated random forest, measures how features contribute to model-estimated risk or protection through SHAP analysis, groups scans by shared model decision patterns (clustering from SHAP) and links back to anatomical differences (classification). Results: The random-forests achieved mean AUCs of 0.72-0.74. There were shared type 2 diabetes signatures in each group; fatty skeletal muscle, older age, greater visceral and subcutaneous fat, and a smaller or fat-laden pancreas. Univariate logistic regression confirmed the direction of 14-18 of the top 20 predictors within each subgroup (p < 0.05). Conclusions: Our findings suggest that abdominal drivers of type 2 diabetes may be consistent across weight classes.

Title: Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs

Authors: Nicolas Goulet, Alexandre Blondin Massé, Moussa Abdendi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11068
Pdf URL: https://arxiv.org/pdf/2508.11068
Copy Paste: [[2508.11068]] Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs(https://arxiv.org/abs/2508.11068)
Keywords: large language model
Abstract: Abstract meaning representation (AMR) is a semantic formalism used to represent the meaning of sentences as directed acyclic graphs. In this paper, we describe how real digital dictionaries can be embedded into AMR directed graphs (digraphs), using state-of-the-art pre-trained large language models. Then, we reduce those graphs in a confluent manner, i.e. with transformations that preserve their circuit space. Finally, the properties of these reduces digraphs are analyzed and discussed in relation to the symbol grounding problem.

Title: Abundance-Aware Set Transformer for Microbiome Sample Embedding

Authors: Hyunwoo Yoo, Gail Rosen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11075
Pdf URL: https://arxiv.org/pdf/2508.11075
Copy Paste: [[2508.11075]] Abundance-Aware Set Transformer for Microbiome Sample Embedding(https://arxiv.org/abs/2508.11075)
Keywords: robust, transformer
Abstract: Microbiome sample representation to input into LLMs is essential for downstream tasks such as phenotype prediction and environmental classification. While prior studies have explored embedding-based representations of each microbiome sample, most rely on simple averaging over sequence embeddings, often overlooking the biological importance of taxa abundance. In this work, we propose an abundance-aware variant of the Set Transformer to construct fixed-size sample-level embeddings by weighting sequence embeddings according to their relative abundance. Without modifying the model architecture, we replicate embedding vectors proportional to their abundance and apply self-attention-based aggregation. Our method outperforms average pooling and unweighted Set Transformers on real-world microbiome classification tasks, achieving perfect performance in some cases. These results demonstrate the utility of abundance-aware aggregation for robust and biologically informed microbiome representation. To the best of our knowledge, this is one of the first approaches to integrate sequence-level abundance into Transformer-based sample embeddings.

Title: Relative Advantage Debiasing for Watch-Time Prediction in Short-Video Recommendation

Authors: Emily Liu, Kuan Han, Minfeng Zhan, Bocheng Zhao, Guanyu Mu, Yang Song
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2508.11086
Pdf URL: https://arxiv.org/pdf/2508.11086
Copy Paste: [[2508.11086]] Relative Advantage Debiasing for Watch-Time Prediction in Short-Video Recommendation(https://arxiv.org/abs/2508.11086)
Keywords: robust
Abstract: Watch time is widely used as a proxy for user satisfaction in video recommendation platforms. However, raw watch times are influenced by confounding factors such as video duration, popularity, and individual user behaviors, potentially distorting preference signals and resulting in biased recommendation models. We propose a novel relative advantage debiasing framework that corrects watch time by comparing it to empirically derived reference distributions conditioned on user and item groups. This approach yields a quantile-based preference signal and introduces a two-stage architecture that explicitly separates distribution estimation from preference learning. Additionally, we present distributional embeddings to efficiently parameterize watch-time quantiles without requiring online sampling or storage of historical data. Both offline and online experiments demonstrate significant improvements in recommendation accuracy and robustness compared to existing baseline methods.

Title: Compressive Meta-Learning

Authors: Daniel Mas Montserrat, David Bonet, Maria Perera, Xavier Giró-i-Nieto, Alexander G. Ioannidis
Subjects: cs.LG, cs.AI, cs.CE, cs.DB
Abstract URL: https://arxiv.org/abs/2508.11090
Pdf URL: https://arxiv.org/pdf/2508.11090
Copy Paste: [[2508.11090]] Compressive Meta-Learning(https://arxiv.org/abs/2508.11090)
Keywords: privacy
Abstract: The rapid expansion in the size of new datasets has created a need for fast and efficient parameter-learning techniques. Compressive learning is a framework that enables efficient processing by using random, non-linear features to project large-scale databases onto compact, information-preserving representations whose dimensionality is independent of the number of samples and can be easily stored, transferred, and processed. These database-level summaries are then used to decode parameters of interest from the underlying data distribution without requiring access to the original samples, offering an efficient and privacy-friendly learning framework. However, both the encoding and decoding techniques are typically randomized and data-independent, failing to exploit the underlying structure of the data. In this work, we propose a framework that meta-learns both the encoding and decoding stages of compressive learning methods by using neural networks that provide faster and more accurate systems than the current state-of-the-art approaches. To demonstrate the potential of the presented Compressive Meta-Learning framework, we explore multiple applications -- including neural network-based compressive PCA, compressive ridge regression, compressive k-means, and autoencoders.

Title: HierOctFusion: Multi-scale Octree-based 3D Shape Generation via Part-Whole-Hierarchy Message Passing

Authors: Xinjie Gao, Bi'an Du, Wei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11106
Pdf URL: https://arxiv.org/pdf/2508.11106
Copy Paste: [[2508.11106]] HierOctFusion: Multi-scale Octree-based 3D Shape Generation via Part-Whole-Hierarchy Message Passing(https://arxiv.org/abs/2508.11106)
Keywords: diffusion, segmentation
Abstract: 3D content generation remains a fundamental yet challenging task due to the inherent structural complexity of 3D data. While recent octree-based diffusion models offer a promising balance between efficiency and quality through hierarchical generation, they often overlook two key insights: 1) existing methods typically model 3D objects as holistic entities, ignoring their semantic part hierarchies and limiting generalization; and 2) holistic high-resolution modeling is computationally expensive, whereas real-world objects are inherently sparse and hierarchical, making them well-suited for layered generation. Motivated by these observations, we propose HierOctFusion, a part-aware multi-scale octree diffusion model that enhances hierarchical feature interaction for generating fine-grained and sparse object structures. Furthermore, we introduce a cross-attention conditioning mechanism that injects part-level information into the generation process, enabling semantic features to propagate effectively across hierarchical levels from parts to the whole. Additionally, we construct a 3D dataset with part category annotations using a pre-trained segmentation model to facilitate training and evaluation. Experiments demonstrate that HierOctFusion achieves superior shape quality and efficiency compared to prior methods.

Title: UWB-PostureGuard: A Privacy-Preserving RF Sensing System for Continuous Ergonomic Sitting Posture Monitoring

Authors: Haotang Li, Zhenyu Qi, Sen He, Kebin Peng, Sheng Tan, Yili Ren, Tomas Cerny, Jiyue Zhao, Zi Wang
Subjects: cs.CV, cs.HC, eess.SP
Abstract URL: https://arxiv.org/abs/2508.11115
Pdf URL: https://arxiv.org/pdf/2508.11115
Copy Paste: [[2508.11115]] UWB-PostureGuard: A Privacy-Preserving RF Sensing System for Continuous Ergonomic Sitting Posture Monitoring(https://arxiv.org/abs/2508.11115)
Keywords: privacy, robust
Abstract: Improper sitting posture during prolonged computer use has become a significant public health concern. Traditional posture monitoring solutions face substantial barriers, including privacy concerns with camera-based systems and user discomfort with wearable sensors. This paper presents UWB-PostureGuard, a privacy-preserving ultra-wideband (UWB) sensing system that advances mobile technologies for preventive health management through continuous, contactless monitoring of ergonomic sitting posture. Our system leverages commercial UWB devices, utilizing comprehensive feature engineering to extract multiple ergonomic sitting posture features. We develop PoseGBDT to effectively capture temporal dependencies in posture patterns, addressing limitations of traditional frame-wise classification approaches. Extensive real-world evaluation across 10 participants and 19 distinct postures demonstrates exceptional performance, achieving 99.11% accuracy while maintaining robustness against environmental variables such as clothing thickness, additional devices, and furniture configurations. Our system provides a scalable, privacy-preserving mobile health solution on existing platforms for proactive ergonomic management, improving quality of life at low costs.

Title: Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning

Authors: Lorenzo Jaime Yu Flores, Junyi Shen, Xiaoyuan Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11120
Pdf URL: https://arxiv.org/pdf/2508.11120
Copy Paste: [[2508.11120]] Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning(https://arxiv.org/abs/2508.11120)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) enabled the development of AI agents that can plan and interact with tools to complete complex tasks. However, literature on their reliability in real-world applications remains limited. In this paper, we introduce a multi-agent framework for a marketing task: audience curation. To solve this, we introduce a framework called RAMP that iteratively plans, calls tools, verifies the output, and generates suggestions to improve the quality of the audience generated. Additionally, we equip the model with a long-term memory store, which is a knowledge base of client-specific facts and past queries. Overall, we demonstrate the use of LLM planning and memory, which increases accuracy by 28 percentage points on a set of 88 evaluation queries. Moreover, we show the impact of iterative verification and reflection on more ambiguous queries, showing progressively better recall (roughly +20 percentage points) with more verify/reflect iterations on a smaller challenge set, and higher user satisfaction. Our results provide practical insights for deploying reliable LLM-based systems in dynamic, industry-facing environments.

Title: MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

Authors: Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabharwal, Reut Tsarfaty
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2508.11133
Pdf URL: https://arxiv.org/pdf/2508.11133
Copy Paste: [[2508.11133]] MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents(https://arxiv.org/abs/2508.11133)
Keywords: large language model
Abstract: Large language models (LLMs) are emerging as a go-to tool for querying information. However, current LLM benchmarks rarely feature natural questions that are both information-seeking as well as genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and complex questions that require dozens, and at times hundreds, of intermediate steps to solve -- far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer natural time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the need for reasoning models that better handle the complexity and sheer breadth of real-world information-seeking questions -- with MoNaCo providing an effective resource for tracking such progress. The MONACO benchmark, codebase, prompts and models predictions are publicly available at: this https URL

Title: Residual-based Efficient Bidirectional Diffusion Model for Image Dehazing and Haze Generation

Authors: Bing Liu, Le Wang, Hao Liu, Mingming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11134
Pdf URL: https://arxiv.org/pdf/2508.11134
Copy Paste: [[2508.11134]] Residual-based Efficient Bidirectional Diffusion Model for Image Dehazing and Haze Generation(https://arxiv.org/abs/2508.11134)
Keywords: diffusion
Abstract: Current deep dehazing methods only focus on removing haze from hazy images, lacking the capability to translate between hazy and haze-free images. To address this issue, we propose a residual-based efficient bidirectional diffusion model (RBDM) that can model the conditional distributions for both dehazing and haze generation. Firstly, we devise dual Markov chains that can effectively shift the residuals and facilitate bidirectional smooth transitions between them. Secondly, the RBDM perturbs the hazy and haze-free images at individual timesteps and predicts the noise in the perturbed data to simultaneously learn the conditional distributions. Finally, to enhance performance on relatively small datasets and reduce computational costs, our method introduces a unified score function learned on image patches instead of entire images. Our RBDM successfully implements size-agnostic bidirectional transitions between haze-free and hazy images with only 15 sampling steps. Extensive experiments demonstrate that the proposed method achieves superior or at least comparable performance to state-of-the-art methods on both synthetic and real-world datasets.

Title: Towards the Next-generation Bayesian Network Classifiers

Authors: Huan Zhang, Daokun Zhang, Kexin Meng, Geoffrey I. Webb
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11145
Pdf URL: https://arxiv.org/pdf/2508.11145
Copy Paste: [[2508.11145]] Towards the Next-generation Bayesian Network Classifiers(https://arxiv.org/abs/2508.11145)
Keywords: explainability
Abstract: Bayesian network classifiers provide a feasible solution to tabular data classification, with a number of merits like high time and memory efficiency, and great explainability. However, due to the parameter explosion and data sparsity issues, Bayesian network classifiers are restricted to low-order feature dependency modeling, making them struggle in extrapolating the occurrence probabilities of complex real-world data. In this paper, we propose a novel paradigm to design high-order Bayesian network classifiers, by learning distributional representations for feature values, as what has been done in word embedding and graph representation learning. The learned distributional representations are encoded with the semantic relatedness between different features through their observed co-occurrence patterns in training data, which then serve as a hallmark to extrapolate the occurrence probabilities of new test samples. As a classifier design realization, we remake the K-dependence Bayesian classifier (KDB) by extending it into a neural version, i.e., NeuralKDB, where a novel neural network architecture is designed to learn distributional representations of feature values and parameterize the conditional probabilities between interdependent features. A stochastic gradient descent based algorithm is designed to train the NeuralKDB model efficiently. Extensive classification experiments on 60 UCI datasets demonstrate that the proposed NeuralKDB classifier excels in capturing high-order feature dependencies and significantly outperforms the conventional Bayesian network classifiers, as well as other competitive classifiers, including two neural network based classifiers without distributional representation learning.

Title: LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction

Authors: Maoquan Zhang, Bisser Raytchev, Xiujuan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11153
Pdf URL: https://arxiv.org/pdf/2508.11153
Copy Paste: [[2508.11153]] LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction(https://arxiv.org/abs/2508.11153)
Keywords: diffusion, generative
Abstract: LEARN is a layout-aware diffusion framework designed to generate pedagogically aligned illustrations for STEM education. It leverages a curated BookCover dataset that provides narrative layouts and structured visual cues, enabling the model to depict abstract and sequential scientific concepts with strong semantic alignment. Through layout-conditioned generation, contrastive visual-semantic training, and prompt modulation, LEARN produces coherent visual sequences that support mid-to-high-level reasoning in line with Bloom's taxonomy while reducing extraneous cognitive load as emphasized by Cognitive Load Theory. By fostering spatially organized and story-driven narratives, the framework counters fragmented attention often induced by short-form media and promotes sustained conceptual focus. Beyond static diagrams, LEARN demonstrates potential for integration with multimodal systems and curriculum-linked knowledge graphs to create adaptive, exploratory educational content. As the first generative approach to unify layout-based storytelling, semantic structure learning, and cognitive scaffolding, LEARN represents a novel direction for generative AI in education. The code and dataset will be released to facilitate future research and practical deployment.

Title: Mitigating Modality Quantity and Quality Imbalance in Multimodal Online Federated Learning

Authors: Heqiang Wang, Weihong Yang, Xiaoxiong Zhong, Jia Zhou, Fangming Liu, Weizhe Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11159
Pdf URL: https://arxiv.org/pdf/2508.11159
Copy Paste: [[2508.11159]] Mitigating Modality Quantity and Quality Imbalance in Multimodal Online Federated Learning(https://arxiv.org/abs/2508.11159)
Keywords: federate
Abstract: The Internet of Things (IoT) ecosystem produces massive volumes of multimodal data from diverse sources, including sensors, cameras, and microphones. With advances in edge intelligence, IoT devices have evolved from simple data acquisition units into computationally capable nodes, enabling localized processing of heterogeneous multimodal data. This evolution necessitates distributed learning paradigms that can efficiently handle such data. Furthermore, the continuous nature of data generation and the limited storage capacity of edge devices demand an online learning framework. Multimodal Online Federated Learning (MMO-FL) has emerged as a promising approach to meet these requirements. However, MMO-FL faces new challenges due to the inherent instability of IoT devices, which often results in modality quantity and quality imbalance (QQI) during data collection. In this work, we systematically investigate the impact of QQI within the MMO-FL framework and present a comprehensive theoretical analysis quantifying how both types of imbalance degrade learning performance. To address these challenges, we propose the Modality Quantity and Quality Rebalanced (QQR) algorithm, a prototype learning based method designed to operate in parallel with the training process. Extensive experiments on two real-world multimodal datasets show that the proposed QQR algorithm consistently outperforms benchmarks under modality imbalance conditions with promising learning performance.

Title: MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering

Authors: Hikaru Asano, Hiroki Ouchi, Akira Kasuga, Ryo Yonetani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11163
Pdf URL: https://arxiv.org/pdf/2508.11163
Copy Paste: [[2508.11163]] MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering(https://arxiv.org/abs/2508.11163)
Keywords: extraction, large language model
Abstract: This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering. While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.\footnote{MobQA dataset is available at this https URL.}

Title: Semi-supervised Image Dehazing via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models

Authors: Bing Liu, Le Wang, Mingming Liu, Hao Liu, Rui Yao, Yong Zhou, Peng Liu, Tongqiang Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11165
Pdf URL: https://arxiv.org/pdf/2508.11165
Copy Paste: [[2508.11165]] Semi-supervised Image Dehazing via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models(https://arxiv.org/abs/2508.11165)
Keywords: robust, diffusion
Abstract: Existing dehazing methods deal with real-world haze images with difficulty, especially scenes with thick haze. One of the main reasons is the lack of real-world paired data and robust priors. To avoid the costly collection of paired hazy and clear images, we propose an efficient semi-supervised image dehazing method via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models (EM-B3DM) with a two-stage learning scheme. In the first stage, we employ the EM algorithm to decouple the joint distribution of paired hazy and clear images into two conditional distributions, which are then modeled using a unified Brownian Bridge diffusion model to directly capture the structural and content-related correlations between hazy and clear images. In the second stage, we leverage the pre-trained model and large-scale unpaired hazy and clear images to further improve the performance of image dehazing. Additionally, we introduce a detail-enhanced Residual Difference Convolution block (RDC) to capture gradient-level information, significantly enhancing the model's representation capability. Extensive experiments demonstrate that our EM-B3DM achieves superior or at least comparable performance to state-of-the-art methods on both synthetic and real-world datasets.

Title: Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification

Authors: Anusha M D, Deepthi Vikram, Bharathi Raja Chakravarthi, Parameshwar R Hegde
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11166
Pdf URL: https://arxiv.org/pdf/2508.11166
Copy Paste: [[2508.11166]] Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification(https://arxiv.org/abs/2508.11166)
Keywords: transformer
Abstract: Tulu, a low-resource Dravidian language predominantly spoken in southern India, has limited computational resources despite its growing digital presence. This study presents the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, collected from YouTube comments across various domains. The dataset, annotated with high inter-annotator agreement (Krippendorff's alpha = 0.984), includes 3,845 comments categorized into four classes: Not Offensive, Not Tulu, Offensive Untargeted, and Offensive Targeted. We evaluate a suite of deep learning models, including GRU, LSTM, BiGRU, BiLSTM, CNN, and attention-based variants, alongside transformer architectures (mBERT, XLM-RoBERTa). The BiGRU model with self-attention achieves the best performance with 82% accuracy and a 0.81 macro F1-score. Transformer models underperform, highlighting the limitations of multilingual pretraining in code-mixed, under-resourced contexts. This work lays the foundation for further NLP research in Tulu and similar low-resource, code-mixed languages.

Title: VFM-Guided Semi-Supervised Detection Transformer for Source-Free Object Detection in Remote Sensing Images

Authors: Jianhong Han, Yupei Wang, Liang Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11167
Pdf URL: https://arxiv.org/pdf/2508.11167
Copy Paste: [[2508.11167]] VFM-Guided Semi-Supervised Detection Transformer for Source-Free Object Detection in Remote Sensing Images(https://arxiv.org/abs/2508.11167)
Keywords: privacy, robust, extraction, transformer
Abstract: Unsupervised domain adaptation methods have been widely explored to bridge domain gaps. However, in real-world remote-sensing scenarios, privacy and transmission constraints often preclude access to source domain data, which limits their practical applicability. Recently, Source-Free Object Detection (SFOD) has emerged as a promising alternative, aiming at cross-domain adaptation without relying on source data, primarily through a self-training paradigm. Despite its potential, SFOD frequently suffers from training collapse caused by noisy pseudo-labels, especially in remote sensing imagery with dense objects and complex backgrounds. Considering that limited target domain annotations are often feasible in practice, we propose a Vision foundation-Guided DEtection TRansformer (VG-DETR), built upon a semi-supervised framework for SFOD in remote sensing images. VG-DETR integrates a Vision Foundation Model (VFM) into the training pipeline in a "free lunch" manner, leveraging a small amount of labeled target data to mitigate pseudo-label noise while improving the detector's feature-extraction capability. Specifically, we introduce a VFM-guided pseudo-label mining strategy that leverages the VFM's semantic priors to further assess the reliability of the generated pseudo-labels. By recovering potentially correct predictions from low-confidence outputs, our strategy improves pseudo-label quality and quantity. In addition, a dual-level VFM-guided alignment method is proposed, which aligns detector features with VFM embeddings at both the instance and image levels. Through contrastive learning among fine-grained prototypes and similarity matching between feature maps, this dual-level alignment further enhances the robustness of feature representations against domain gaps. Extensive experiments demonstrate that VG-DETR achieves superior performance in source-free remote sensing detection tasks.

Title: A Semi-supervised Generative Model for Incomplete Multi-view Data Integration with Missing Labels

Authors: Yiyang Shen, Weiran Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11180
Pdf URL: https://arxiv.org/pdf/2508.11180
Copy Paste: [[2508.11180]] A Semi-supervised Generative Model for Incomplete Multi-view Data Integration with Missing Labels(https://arxiv.org/abs/2508.11180)
Keywords: extraction, generative
Abstract: Multi-view learning is widely applied to real-life datasets, such as multiple omics biological data, but it often suffers from both missing views and missing labels. Prior probabilistic approaches addressed the missing view problem by using a product-of-experts scheme to aggregate representations from present views and achieved superior performance over deterministic classifiers, using the information bottleneck (IB) principle. However, the IB framework is inherently fully supervised and cannot leverage unlabeled data. In this work, we propose a semi-supervised generative model that utilizes both labeled and unlabeled samples in a unified framework. Our method maximizes the likelihood of unlabeled samples to learn a latent space shared with the IB on labeled data. We also perform cross-view mutual information maximization in the latent space to enhance the extraction of shared information across views. Compared to existing approaches, our model achieves better predictive and imputation performance on both image and multi-omics data with missing views and limited labeled samples.

Title: Versatile Video Tokenization with Generative 2D Gaussian Splatting

Authors: Zhenghao Chen, Zicong Chen, Lei Liu, Yiming Wu, Dong Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11183
Pdf URL: https://arxiv.org/pdf/2508.11183
Copy Paste: [[2508.11183]] Versatile Video Tokenization with Generative 2D Gaussian Splatting(https://arxiv.org/abs/2508.11183)
Keywords: transformer, generative
Abstract: Video tokenization procedure is critical for a wide range of video processing tasks. Most existing approaches directly transform video into fixed-grid and patch-wise tokens, which exhibit limited versatility. Spatially, uniformly allocating a fixed number of tokens often leads to over-encoding in low-information regions. Temporally, reducing redundancy remains challenging without explicitly distinguishing between static and dynamic content. In this work, we propose the Gaussian Video Transformer (GVT), a versatile video tokenizer built upon a generative 2D Gaussian Splatting (2DGS) strategy. We first extract latent rigid features from a video clip and represent them with a set of 2D Gaussians generated by our proposed Spatio-Temporal Gaussian Embedding (STGE) mechanism in a feed-forward manner. Such generative 2D Gaussians not only enhance spatial adaptability by assigning higher (resp., lower) rendering weights to regions with higher (resp., lower) information content during rasterization, but also improve generalization by avoiding per-video this http URL enhance the temporal versatility, we introduce a Gaussian Set Partitioning (GSP) strategy that separates the 2D Gaussians into static and dynamic sets, which explicitly model static content shared across different time-steps and dynamic content specific to each time-step, enabling a compact this http URL primarily evaluate GVT on the video reconstruction, while also assessing its performance on action recognition and compression using the UCF101, Kinetics, and DAVIS datasets. Extensive experiments demonstrate that GVT achieves a state-of-the-art video reconstruction quality, outperforms the baseline MAGVIT-v2 in action recognition, and delivers comparable compression performance.

Title: Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction

Authors: Tao Wu, Jingyuan Chen, Wang Lin, Jian Zhan, Mengze Li, Kun Kuang, Fei Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11184
Pdf URL: https://arxiv.org/pdf/2508.11184
Copy Paste: [[2508.11184]] Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction(https://arxiv.org/abs/2508.11184)
Keywords: robust, large language model
Abstract: Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs), play a critical role in educational assessment by diagnosing student misconceptions. Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors by learning common error patterns across large student populations. However, such distractors often fail to capture the diverse reasoning errors of individual students, limiting their diagnostic effectiveness. To address this limitation, we introduce the task of personalized distractor generation, which aims to generate tailored distractors based on individual misconceptions inferred from each student's past question-answering (QA) records, ensuring every student receives options that effectively exposes their specific reasoning errors. While promising, this task is challenging because each student typically has only a few QA records, which often lack the student's underlying reasoning processes, making training-based group-level approaches infeasible. To overcome this, we propose a training-free two-stage framework. In the first stage, we construct a student-specific misconception prototype by applying Monte Carlo Tree Search (MCTS) to recover the student's reasoning trajectories from past incorrect answers. In the second stage, this prototype guides the simulation of the student's reasoning on new questions, enabling the generation of personalized distractors that align with the student's recurring misconceptions. Experiments show that our approach achieves the best performance in generating plausible, personalized distractors for 140 students, and also effectively generalizes to group-level settings, highlighting its robustness and adaptability.

Title: CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector

Authors: Abhinav Kumar, Yuliang Guo, Zhihao Zhang, Xinyu Huang, Liu Ren, Xiaoming Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11185
Pdf URL: https://arxiv.org/pdf/2508.11185
Copy Paste: [[2508.11185]] CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector(https://arxiv.org/abs/2508.11185)
Keywords: robust
Abstract: Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by first investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R improves generalization to unseen camera heights by more than $45\%$, achieving SoTA performance on the CARLA dataset. Codes and Models at this https URL

Title: Quantum-Boosted High-Fidelity Deep Learning

Authors: Feng-ao Wang, Shaobo Chen, Yao Xuan, Junwei Liu, Qi Gao, Hongdong Zhu, Junjie Hou, Lixin Yuan, Jinyu Cheng, Chenxin Yi, Hai Wei, Yin Ma, Tao Xu, Kai Wen, Yixue Li
Subjects: cs.LG, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2508.11190
Pdf URL: https://arxiv.org/pdf/2508.11190
Copy Paste: [[2508.11190]] Quantum-Boosted High-Fidelity Deep Learning(https://arxiv.org/abs/2508.11190)
Keywords: generative
Abstract: A fundamental limitation of probabilistic deep learning is its predominant reliance on Gaussian priors. This simplistic assumption prevents models from accurately capturing the complex, non-Gaussian landscapes of natural data, particularly in demanding domains like complex biological data, severely hindering the fidelity of the model for scientific discovery. The physically-grounded Boltzmann distribution offers a more expressive alternative, but it is computationally intractable on classical computers. To date, quantum approaches have been hampered by the insufficient qubit scale and operational stability required for the iterative demands of deep learning. Here, we bridge this gap by introducing the Quantum Boltzmann Machine-Variational Autoencoder (QBM-VAE), a large-scale and long-time stable hybrid quantum-classical architecture. Our framework leverages a quantum processor for efficient sampling from the Boltzmann distribution, enabling its use as a powerful prior within a deep generative model. Applied to million-scale single-cell datasets from multiple sources, the QBM-VAE generates a latent space that better preserves complex biological structures, consistently outperforming conventional Gaussian-based deep learning models like VAE and SCVI in essential tasks such as omics data integration, cell-type classification, and trajectory inference. It also provides a typical example of introducing a physics priori into deep learning to drive the model to acquire scientific discovery capabilities that breaks through data limitations. This work provides the demonstration of a practical quantum advantage in deep learning on a large-scale scientific problem and offers a transferable blueprint for developing hybrid quantum AI models.

Title: Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark

Authors: Lavisha Aggarwal, Vikas Bahirwani, Lin Li, Andrea Colaco
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11192
Pdf URL: https://arxiv.org/pdf/2508.11192
Copy Paste: [[2508.11192]] Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark(https://arxiv.org/abs/2508.11192)
Keywords: large language model
Abstract: Many everyday tasks ranging from fixing appliances, cooking recipes to car maintenance require expert knowledge, especially when tasks are complex and multi-step. Despite growing interest in AI agents, there is a scarcity of dialogue-video datasets grounded for real world task assistance. In this paper, we propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues, aligned with fine grained steps and video-clips. Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection. Using this technique, we build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting. Each session includes multi-turn conversation where an expert teaches a novice user how to perform a task step by step, while observing user's surrounding through a camera and microphone equipped wearable device. We establish the baseline benchmark performance on HowToDIV dataset through Gemma-3 model for future research on this new task of dialogues for procedural-task assistance.

Title: UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

Authors: Jiajin Guan (1), Haibo Mei (2), Bonan Zhang (1), Dan Liu (1), Yuanshuang Fu (1), Yue Zhang (2) ((1) Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, Chengdu, China, (2) School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, Chengdu, China)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11196
Pdf URL: https://arxiv.org/pdf/2508.11196
Copy Paste: [[2508.11196]] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning(https://arxiv.org/abs/2508.11196)
Keywords: robust
Abstract: Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.

Title: E-CaTCH: Event-Centric Cross-Modal Attention with Temporal Consistency and Class-Imbalance Handling for Misinformation Detection

Authors: Ahmad Mousavi, Yeganeh Abdollahinejad, Roberto Corizzo, Nathalie Japkowicz, Zois Boukouvalas
Subjects: cs.CL, cs.AI, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2508.11197
Pdf URL: https://arxiv.org/pdf/2508.11197
Copy Paste: [[2508.11197]] E-CaTCH: Event-Centric Cross-Modal Attention with Temporal Consistency and Class-Imbalance Handling for Misinformation Detection(https://arxiv.org/abs/2508.11197)
Keywords: robust
Abstract: Detecting multimodal misinformation on social media remains challenging due to inconsistencies between modalities, changes in temporal patterns, and substantial class imbalance. Many existing methods treat posts independently and fail to capture the event-level structure that connects them across time and modality. We propose E-CaTCH, an interpretable and scalable framework for robustly detecting misinformation. If needed, E-CaTCH clusters posts into pseudo-events based on textual similarity and temporal proximity, then processes each event independently. Within each event, textual and visual features are extracted using pre-trained BERT and ResNet encoders, refined via intra-modal self-attention, and aligned through bidirectional cross-modal attention. A soft gating mechanism fuses these representations to form contextualized, content-aware embeddings of each post. To model temporal evolution, E-CaTCH segments events into overlapping time windows and uses a trend-aware LSTM, enhanced with semantic shift and momentum signals, to encode narrative progression over time. Classification is performed at the event level, enabling better alignment with real-world misinformation dynamics. To address class imbalance and promote stable learning, the model integrates adaptive class weighting, temporal consistency regularization, and hard-example mining. The total loss is aggregated across all events. Extensive experiments on Fakeddit, IND, and COVID-19 MISINFOGRAPH demonstrate that E-CaTCH consistently outperforms state-of-the-art baselines. Cross-dataset evaluations further demonstrate its robustness, generalizability, and practical applicability across diverse misinformation scenarios.

Title: A Coarse-to-Fine Human Pose Estimation Method based on Two-stage Distillation and Progressive Graph Neural Network

Authors: Zhangjian Ji, Wenjin Zhang, Shaotong Qiao, Kai Feng, Yuhua Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11212
Pdf URL: https://arxiv.org/pdf/2508.11212
Copy Paste: [[2508.11212]] A Coarse-to-Fine Human Pose Estimation Method based on Two-stage Distillation and Progressive Graph Neural Network(https://arxiv.org/abs/2508.11212)
Keywords: robust
Abstract: Human pose estimation has been widely applied in the human-centric understanding and generation, but most existing state-of-the-art human pose estimation methods require heavy computational resources for accurate predictions. In order to obtain an accurate, robust yet lightweight human pose estimator, one feasible way is to transfer pose knowledge from a powerful teacher model to a less-parameterized student model by knowledge distillation. However, the traditional knowledge distillation framework does not fully explore the contextual information among human joints. Thus, in this paper, we propose a novel coarse-to-fine two-stage knowledge distillation framework for human pose estimation. In the first-stage distillation, we introduce the human joints structure loss to mine the structural information among human joints so as to transfer high-level semantic knowledge from the teacher model to the student model. In the second-stage distillation, we utilize an Image-Guided Progressive Graph Convolutional Network (IGP-GCN) to refine the initial human pose obtained from the first-stage distillation and supervise the training of the IGP-GCN in the progressive way by the final output pose of teacher model. The extensive experiments on the benchmark dataset: COCO keypoint and CrowdPose datasets, show that our proposed method performs favorably against lots of the existing state-of-the-art human pose estimation methods, especially for the more complex CrowdPose dataset, the performance improvement of our model is more significant.

Title: Air Quality PM2.5 Index Prediction Model Based on CNN-LSTM

Authors: Zicheng Guo, Shuqi Wu, Meixing Zhu, He Guandi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11215
Pdf URL: https://arxiv.org/pdf/2508.11215
Copy Paste: [[2508.11215]] Air Quality PM2.5 Index Prediction Model Based on CNN-LSTM(https://arxiv.org/abs/2508.11215)
Keywords: protect, extraction
Abstract: With the intensification of global climate change, accurate prediction of air quality indicators, especially PM2.5 concentration, has become increasingly important in fields such as environmental protection, public health, and urban management. To address this, we propose an air quality PM2.5 index prediction model based on a hybrid CNN-LSTM architecture. The model effectively combines Convolutional Neural Networks (CNN) for local spatial feature extraction and Long Short-Term Memory (LSTM) networks for modeling temporal dependencies in time series data. Using a multivariate dataset collected from an industrial area in Beijing between 2010 and 2015 -- which includes hourly records of PM2.5 concentration, temperature, dew point, pressure, wind direction, wind speed, and precipitation -- the model predicts the average PM2.5 concentration over 6-hour intervals. Experimental results show that the model achieves a root mean square error (RMSE) of 5.236, outperforming traditional time series models in both accuracy and generalization. This demonstrates its strong potential in real-world applications such as air pollution early warning systems. However, due to the complexity of multivariate inputs, the model demands high computational resources, and its ability to handle diverse atmospheric factors still requires optimization. Future work will focus on enhancing scalability and expanding support for more complex multivariate weather prediction tasks.

Title: A CLIP-based Uncertainty Modal Modeling (UMM) Framework for Pedestrian Re-Identification in Autonomous Driving

Authors: Jialin Li, Shuqi Wu, Ning Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11218
Pdf URL: https://arxiv.org/pdf/2508.11218
Copy Paste: [[2508.11218]] A CLIP-based Uncertainty Modal Modeling (UMM) Framework for Pedestrian Re-Identification in Autonomous Driving(https://arxiv.org/abs/2508.11218)
Keywords: robust
Abstract: Re-Identification (ReID) is a critical technology in intelligent perception systems, especially within autonomous driving, where onboard cameras must identify pedestrians across views and time in real-time to support safe navigation and trajectory prediction. However, the presence of uncertain or missing input modalities--such as RGB, infrared, sketches, or textual descriptions--poses significant challenges to conventional ReID approaches. While large-scale pre-trained models offer strong multimodal semantic modeling capabilities, their computational overhead limits practical deployment in resource-constrained environments. To address these challenges, we propose a lightweight Uncertainty Modal Modeling (UMM) framework, which integrates a multimodal token mapper, synthetic modality augmentation strategy, and cross-modal cue interactive learner. Together, these components enable unified feature representation, mitigate the impact of missing modalities, and extract complementary information across different data types. Additionally, UMM leverages CLIP's vision-language alignment ability to fuse multimodal inputs efficiently without extensive finetuning. Experimental results demonstrate that UMM achieves strong robustness, generalization, and computational efficiency under uncertain modality conditions, offering a scalable and practical solution for pedestrian re-identification in autonomous driving scenarios.

Title: Enhancing Interactive Voting-Based Map Matching: Improving Efficiency and Robustness for Heterogeneous GPS Trajectories

Authors: William Alemanni, Arianna Burzacchi, Davide Colombi, Elena Giarratano
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11235
Pdf URL: https://arxiv.org/pdf/2508.11235
Copy Paste: [[2508.11235]] Enhancing Interactive Voting-Based Map Matching: Improving Efficiency and Robustness for Heterogeneous GPS Trajectories(https://arxiv.org/abs/2508.11235)
Keywords: robust
Abstract: This paper presents an enhanced version of the Interactive Voting-Based Map Matching algorithm, designed to efficiently process trajectories with varying sampling rates. The main aim is to reconstruct GPS trajectories with high accuracy, independent of input data quality. Building upon the original algorithm, developed exclusively for aligning GPS signals to road networks, we extend its capabilities by integrating trajectory imputation. Our improvements also include the implementation of a distance-bounded interactive voting strategy to reduce computational complexity, as well as modifications to address missing data in the road network. Furthermore, we incorporate a custom-built asset derived from OpenStreetMap, enabling this approach to be smoothly applied in any geographic region covered by OpenStreetMap's road network. These advancements preserve the core strengths of the original algorithm while significantly extending its applicability to diverse real-world scenarios.

Title: Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering

Authors: Changjian Wang, Weihong Deng, Weili Guan, Quan Lu, Ning Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11247
Pdf URL: https://arxiv.org/pdf/2508.11247
Copy Paste: [[2508.11247]] Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering(https://arxiv.org/abs/2508.11247)
Keywords: diffusion
Abstract: Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6$\times$ speedup in retrieval efficiency.

Title: Graph Neural Diffusion via Generalized Opinion Dynamics

Authors: Asela Hevapathige, Asiri Wijesinghe, Ahad N. Zehmakan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11249
Pdf URL: https://arxiv.org/pdf/2508.11249
Copy Paste: [[2508.11249]] Graph Neural Diffusion via Generalized Opinion Dynamics(https://arxiv.org/abs/2508.11249)
Keywords: interpretability, diffusion
Abstract: There has been a growing interest in developing diffusion-based Graph Neural Networks (GNNs), building on the connections between message passing mechanisms in GNNs and physical diffusion processes. However, existing methods suffer from three critical limitations: (1) they rely on homogeneous diffusion with static dynamics, limiting adaptability to diverse graph structures; (2) their depth is constrained by computational overhead and diminishing interpretability; and (3) theoretical understanding of their convergence behavior remains limited. To address these challenges, we propose GODNF, a Generalized Opinion Dynamics Neural Framework, which unifies multiple opinion dynamics models into a principled, trainable diffusion mechanism. Our framework captures heterogeneous diffusion patterns and temporal dynamics via node-specific behavior modeling and dynamic neighborhood influence, while ensuring efficient and interpretable message propagation even at deep layers. We provide a rigorous theoretical analysis demonstrating GODNF's ability to model diverse convergence configurations. Extensive empirical evaluations of node classification and influence estimation tasks confirm GODNF's superiority over state-of-the-art GNNs.

Title: FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

Authors: MengChao Wang, Qiang Wang, Fan Jiang, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11255
Pdf URL: https://arxiv.org/pdf/2508.11255
Copy Paste: [[2508.11255]] FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation(https://arxiv.org/abs/2508.11255)
Keywords: diffusion
Abstract: Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: this https URL

Title: Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Authors: Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11256
Pdf URL: https://arxiv.org/pdf/2508.11256
Copy Paste: [[2508.11256]] Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception(https://arxiv.org/abs/2508.11256)
Keywords: diffusion, segmentation
Abstract: Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at this https URL

Title: Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing

Authors: Ruicheng Xian, Yuxuan Wan, Han Zhao
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11258
Pdf URL: https://arxiv.org/pdf/2508.11258
Copy Paste: [[2508.11258]] Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing(https://arxiv.org/abs/2508.11258)
Keywords: fair, large language model
Abstract: Instruction fine-tuned large language models (LLMs) enable a simple zero-shot or few-shot prompting paradigm, also known as in-context learning, for building prediction models. This convenience, combined with continued advances in LLM capability, has the potential to drive their adoption across a broad range of domains, including high-stakes applications where group fairness -- preventing disparate impacts across demographic groups -- is essential. The majority of existing approaches to enforcing group fairness on LLM-based classifiers rely on traditional fair algorithms applied via model fine-tuning or head-tuning on final-layer embeddings, but they are no longer applicable to closed-weight LLMs under the in-context learning setting, which include some of the most capable commercial models today, such as GPT-4, Gemini, and Claude. In this paper, we propose a framework for deriving fair classifiers from closed-weight LLMs via prompting: the LLM is treated as a feature extractor, and features are elicited from its probabilistic predictions (e.g., token log probabilities) using prompts strategically designed for the specified fairness criterion to obtain sufficient statistics for fair classification; a fair algorithm is then applied to these features to train a lightweight fair classifier in a post-hoc manner. Experiments on five datasets, including three tabular ones, demonstrate strong accuracy-fairness tradeoffs for the classifiers derived by our framework from both open-weight and closed-weight LLMs; in particular, our framework is data-efficient and outperforms fair classifiers trained on LLM embeddings (i.e., head-tuning) or from scratch on raw tabular features.

Title: UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?

Authors: Mukund Choudhary, KV Aditya Srivatsa, Gaurja Aeron, Antara Raaghavi Bhattacharya, Dang Khoa Dang Dinh, Ikhlasul Akmal Hanif, Daria Kotova, Ekaterina Kochmar, Monojit Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11260
Pdf URL: https://arxiv.org/pdf/2508.11260
Copy Paste: [[2508.11260]] UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?(https://arxiv.org/abs/2508.11260)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated potential in reasoning tasks, but their performance on linguistics puzzles remains consistently poor. These puzzles, often derived from Linguistics Olympiad (LO) contests, provide a minimal contamination environment to assess LLMs' linguistic reasoning abilities across low-resource languages. This work analyses LLMs' performance on 629 problems across 41 low-resource languages by labelling each with linguistically informed features to unveil weaknesses. Our analyses show that LLMs struggle with puzzles involving higher morphological complexity and perform better on puzzles involving linguistic features that are also found in English. We also show that splitting words into morphemes as a pre-processing step improves solvability, indicating a need for more informed and language-specific tokenisers. These findings thus offer insights into some challenges in linguistic reasoning and modelling of low-resource languages.

Title: Vision-Language Models display a strong gender bias

Authors: Aiswarya Konavoor, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11262
Pdf URL: https://arxiv.org/pdf/2508.11262
Copy Paste: [[2508.11262]] Vision-Language Models display a strong gender bias(https://arxiv.org/abs/2508.11262)
Keywords: robust
Abstract: Vision-language models (VLM) align images and text in a shared representation space that is useful for retrieval and zero-shot transfer. Yet, this alignment can encode and amplify social stereotypes in subtle ways that are not obvious from standard accuracy metrics. In this study, we test whether the contrastive vision-language encoder exhibits gender-linked associations when it places embeddings of face images near embeddings of short phrases that describe occupations and activities. We assemble a dataset of 220 face photographs split by perceived binary gender and a set of 150 unique statements distributed across six categories covering emotional labor, cognitive labor, domestic labor, technical labor, professional roles, and physical labor. We compute unit-norm image embeddings for every face and unit-norm text embeddings for every statement, then define a statement-level association score as the difference between the mean cosine similarity to the male set and the mean cosine similarity to the female set, where positive values indicate stronger association with the male set and negative values indicate stronger association with the female set. We attach bootstrap confidence intervals by resampling images within each gender group, aggregate by category with a separate bootstrap over statements, and run a label-swap null model that estimates the level of mean absolute association we would expect if no gender structure were present. The outcome is a statement-wise and category-wise map of gender associations in a contrastive vision-language space, accompanied by uncertainty, simple sanity checks, and a robust gender bias evaluation framework.

Title: Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds

Authors: Pei He, Lingling Li, Licheng Jiao, Ronghua Shang, Fang Liu, Shuang Wang, Xu Liu, Wenping Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11265
Pdf URL: https://arxiv.org/pdf/2508.11265
Copy Paste: [[2508.11265]] Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds(https://arxiv.org/abs/2508.11265)
Keywords: segmentation
Abstract: Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category-level distribution and alignment. In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category-level Geometry Embedding (CGE) is proposed to perceive the fine-grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category-level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state-of-the-art domain generalized point cloud methods.

Title: Probing the Representational Power of Sparse Autoencoders in Vision Models

Authors: Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11277
Pdf URL: https://arxiv.org/pdf/2508.11277
Copy Paste: [[2508.11277]] Probing the Representational Power of Sparse Autoencoders in Vision Models(https://arxiv.org/abs/2508.11277)
Keywords: interpretability, diffusion, large language model
Abstract: Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from the high-dimensional internal representations of LLMs. Despite their popularity with language models, SAEs remain understudied in the visual domain. In this work, we provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks. Our experimental results demonstrate that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation across three vision model architectures: vision embedding models, multi-modal LMMs and diffusion models. In vision embedding models, we find that learned SAE features can be used for OOD detection and provide evidence that they recover the ontological structure of the underlying model. For diffusion models, we demonstrate that SAEs enable semantic steering through text encoder manipulation and develop an automated pipeline for discovering human-interpretable attributes. Finally, we conduct exploratory experiments on multi-modal LLMs, finding evidence that SAE features reveal shared representations across vision and language modalities. Our study provides a foundation for SAE evaluation in vision models, highlighting their strong potential improving interpretability, generalization, and steerability in the visual domain.

Title: Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble

Authors: Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.11279
Pdf URL: https://arxiv.org/pdf/2508.11279
Copy Paste: [[2508.11279]] Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble(https://arxiv.org/abs/2508.11279)
Keywords: robust
Abstract: Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation uncovers two critical but underexplored challenges-the fragility of individual temporal sub-networks and the tendency for adversarial vulnerabilities to transfer across time. To overcome these limitations, we propose Robust Temporal self-Ensemble (RTE), a training framework that improves the robustness of each sub-network while reducing the temporal transferability of adversarial perturbations. RTE integrates both objectives into a unified loss and employs a stochastic sampling strategy for efficient optimization. Extensive experiments across multiple benchmarks demonstrate that RTE consistently outperforms existing training methods in robust-accuracy trade-off. Additional analyses reveal that RTE reshapes the internal robustness landscape of SNNs, leading to more resilient and temporally diversified decision boundaries. Our study highlights the importance of temporal structure in adversarial learning and offers a principled foundation for building robust spiking models.

Title: LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Authors: Ruiyan Qi, Congding Wen, Weibo Zhou, Shangsong Liang, Lingbo Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11280
Pdf URL: https://arxiv.org/pdf/2508.11280
Copy Paste: [[2508.11280]] LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought(https://arxiv.org/abs/2508.11280)
Keywords: robust, large language model
Abstract: Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15\% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.

Title: ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

Authors: Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11281
Pdf URL: https://arxiv.org/pdf/2508.11281
Copy Paste: [[2508.11281]] ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection(https://arxiv.org/abs/2508.11281)
Keywords: robust
Abstract: Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, large-scale datasets. In this work, we introduce TOXIFRENCH, a new public benchmark of 53,622 French online comments, constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification. Then, we benchmark a broad range of models and uncover a counterintuitive insight: Small Language Models (SLMs) outperform many larger models in robustness and generalization under the toxicity detection task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a dynamic weighted loss that progressively emphasizes the model's final decision, significantly improving faithfulness. Our fine-tuned 4B model achieves state-of-the-art performance, improving its F1 score by 13% over its baseline and outperforming LLMs such as GPT-40 and Gemini-2.5. Further evaluation on a cross-lingual toxicity benchmark demonstrates strong multilingual ability, suggesting that our methodology can be effectively extended to other languages and safety-critical classification tasks.

Title: Unifying Scale-Aware Depth Prediction and Perceptual Priors for Monocular Endoscope Pose Estimation and Tissue Reconstruction

Authors: Muzammil Khan, Enzo Kerkhof, Matteo Fusaglia, Koert Kuhlmann, Theo Ruers, Françoise J. Siepel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11282
Pdf URL: https://arxiv.org/pdf/2508.11282
Copy Paste: [[2508.11282]] Unifying Scale-Aware Depth Prediction and Perceptual Priors for Monocular Endoscope Pose Estimation and Tissue Reconstruction(https://arxiv.org/abs/2508.11282)
Keywords: robust
Abstract: Accurate endoscope pose estimation and 3D tissue surface reconstruction significantly enhances monocular minimally invasive surgical procedures by enabling accurate navigation and improved spatial awareness. However, monocular endoscope pose estimation and tissue reconstruction face persistent challenges, including depth ambiguity, physiological tissue deformation, inconsistent endoscope motion, limited texture fidelity, and a restricted field of view. To overcome these limitations, a unified framework for monocular endoscopic tissue reconstruction that integrates scale-aware depth prediction with temporally-constrained perceptual refinement is presented. This framework incorporates a novel MAPIS-Depth module, which leverages Depth Pro for robust initialisation and Depth Anything for efficient per-frame depth prediction, in conjunction with L-BFGS-B optimisation, to generate pseudo-metric depth estimates. These estimates are temporally refined by computing pixel correspondences using RAFT and adaptively blending flow-warped frames based on LPIPS perceptual similarity, thereby reducing artefacts arising from physiological tissue deformation and motion. To ensure accurate registration of the synthesised pseudo-RGBD frames from MAPIS-Depth, a novel WEMA-RTDL module is integrated, optimising both rotation and translation. Finally, truncated signed distance function-based volumetric fusion and marching cubes are applied to extract a comprehensive 3D surface mesh. Evaluations on HEVD and SCARED, with ablation and comparative analyses, demonstrate the framework's robustness and superiority over state-of-the-art methods.

Title: TimeMachine: Fine-Grained Facial Age Editing with Identity Preservation

Authors: Yilin Mi, Qixin Yan, Zheng-Peng Duan, Chunle Guo, Hubery Yin, Hao Liu, Chen Li, Chongyi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11284
Pdf URL: https://arxiv.org/pdf/2508.11284
Copy Paste: [[2508.11284]] TimeMachine: Fine-Grained Facial Age Editing with Identity Preservation(https://arxiv.org/abs/2508.11284)
Keywords: diffusion, generative
Abstract: With the advancement of generative models, facial image editing has made significant progress. However, achieving fine-grained age editing while preserving personal identity remains a challenging this http URL this paper, we propose TimeMachine, a novel diffusion-based framework that achieves accurate age editing while keeping identity features unchanged. To enable fine-grained age editing, we inject high-precision age information into the multi-cross attention module, which explicitly separates age-related and identity-related features. This design facilitates more accurate disentanglement of age attributes, thereby allowing precise and controllable manipulation of facial this http URL, we propose an Age Classifier Guidance (ACG) module that predicts age directly in the latent space, instead of performing denoising image reconstruction during training. By employing a lightweight module to incorporate age constraints, this design enhances age editing accuracy by modest increasing training cost. Additionally, to address the lack of large-scale, high-quality facial age datasets, we construct a HFFA dataset (High-quality Fine-grained Facial-Age dataset) which contains one million high-resolution images labeled with identity and facial attributes. Experimental results demonstrate that TimeMachine achieves state-of-the-art performance in fine-grained age editing while preserving identity consistency.

Title: AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models' Responses to Depression, Anxiety, and Stress Queries

Authors: Arya VarastehNezhad, Reza Tavasoli, Soroush Elyasi, MohammadHossein LotfiNia, Hamed Farbeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11285
Pdf URL: https://arxiv.org/pdf/2508.11285
Copy Paste: [[2508.11285]] AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models' Responses to Depression, Anxiety, and Stress Queries(https://arxiv.org/abs/2508.11285)
Keywords: large language model
Abstract: Depression, anxiety, and stress are widespread mental health concerns that increasingly drive individuals to seek information from Large Language Models (LLMs). This study investigates how eight LLMs (Claude Sonnet, Copilot, Gemini Pro, GPT-4o, GPT-4o mini, Llama, Mixtral, and Perplexity) reply to twenty pragmatic questions about depression, anxiety, and stress when those questions are framed for six user profiles (baseline, woman, man, young, old, and university student). The models generated 2,880 answers, which we scored for sentiment and emotions using state-of-the-art tools. Our analysis revealed that optimism, fear, and sadness dominated the emotional landscape across all outputs, with neutral sentiment maintaining consistently high values. Gratitude, joy, and trust appeared at moderate levels, while emotions such as anger, disgust, and love were rarely expressed. The choice of LLM significantly influenced emotional expression patterns. Mixtral exhibited the highest levels of negative emotions including disapproval, annoyance, and sadness, while Llama demonstrated the most optimistic and joyful responses. The type of mental health condition dramatically shaped emotional responses: anxiety prompts elicited extraordinarily high fear scores (0.974), depression prompts generated elevated sadness (0.686) and the highest negative sentiment, while stress-related queries produced the most optimistic responses (0.755) with elevated joy and trust. In contrast, demographic framing of queries produced only marginal variations in emotional tone. Statistical analyses confirmed significant model-specific and condition-specific differences, while demographic influences remained minimal. These findings highlight the critical importance of model selection in mental health applications, as each LLM exhibits a distinct emotional signature that could significantly impact user experience and outcomes.

Title: Hyperspectral vs. RGB for Pedestrian Segmentation in Urban Driving Scenes: A Comparative Study

Authors: Jiarong Li, Imad Ali Shah, Enda Ward, Martin Glavin, Edward Jones, Brian Deegan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11301
Pdf URL: https://arxiv.org/pdf/2508.11301
Copy Paste: [[2508.11301]] Hyperspectral vs. RGB for Pedestrian Segmentation in Urban Driving Scenes: A Comparative Study(https://arxiv.org/abs/2508.11301)
Keywords: robust, segmentation
Abstract: Pedestrian segmentation in automotive perception systems faces critical safety challenges due to metamerism in RGB imaging, where pedestrians and backgrounds appear visually indistinguishable.. This study investigates the potential of hyperspectral imaging (HSI) for enhanced pedestrian segmentation in urban driving scenarios using the Hyperspectral City v2 (H-City) dataset. We compared standard RGB against two dimensionality-reduction approaches by converting 128-channel HSI data into three-channel representations: Principal Component Analysis (PCA) and optimal band selection using Contrast Signal-to-Noise Ratio with Joint Mutual Information Maximization (CSNR-JMIM). Three semantic segmentation models were evaluated: U-Net, DeepLabV3+, and SegFormer. CSNR-JMIM consistently outperformed RGB with an average improvements of 1.44% in Intersection over Union (IoU) and 2.18% in F1-score for pedestrian segmentation. Rider segmentation showed similar gains with 1.43% IoU and 2.25% F1-score improvements. These improved performance results from enhanced spectral discrimination of optimally selected HSI bands effectively reducing false positives. This study demonstrates robust pedestrian segmentation through optimal HSI band selection, showing significant potential for safety-critical automotive applications.

Title: SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

Authors: Beichen Guo, Zhiyuan Wen, Yu Yang, Peng Gao, Ruosong Yang, Jiaxing Shen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.11310
Pdf URL: https://arxiv.org/pdf/2508.11310
Copy Paste: [[2508.11310]] SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems(https://arxiv.org/abs/2508.11310)
Keywords: robust, large language model
Abstract: The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval-augmented generation (RAG) and the rising popularity of multi-agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over-reliance on LLMs-as-judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM-based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments.

Title: LLM Compression: How Far Can We Go in Balancing Size and Performance?

Authors: Sahil Sk, Debasish Dhal, Sonal Khosla, Sk Shahid, Sambit Shekhar, Akash Dhaka, Shantipriya Parida, Dilip K. Prasad, Ondřej Bojar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11318
Pdf URL: https://arxiv.org/pdf/2508.11318
Copy Paste: [[2508.11318]] LLM Compression: How Far Can We Go in Balancing Size and Performance?(https://arxiv.org/abs/2508.11318)
Keywords: transformer, generative, large language model
Abstract: Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency across various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics, namely accuracy, inference latency, and throughput (total output tokens generated per second), providing insights into the suitability of low-bit quantization for real-world deployment. Using the results, users can then make suitable decisions based on the specifications that need to be met. We discuss the pros and cons of GSQ and GPTQ techniques on models of different sizes, which also serve as a benchmark for future experiments.

Title: Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking

Authors: Haonan Zhang, Xinyao Wang, Boxi Wu, Tu Zheng, Wang Yunhua, Zheng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11323
Pdf URL: https://arxiv.org/pdf/2508.11323
Copy Paste: [[2508.11323]] Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking(https://arxiv.org/abs/2508.11323)
Keywords: robust, transformer
Abstract: 3D multi-object tracking is a critical and challenging task in the field of autonomous driving. A common paradigm relies on modeling individual object motion, e.g., Kalman filters, to predict trajectories. While effective in simple scenarios, this approach often struggles in crowded environments or with inaccurate detections, as it overlooks the rich geometric relationships between objects. This highlights the need to leverage spatial cues. However, existing geometry-aware methods can be susceptible to interference from irrelevant objects, leading to ambiguous features and incorrect associations. To address this, we propose focusing on cue-consistency: identifying and matching stable spatial patterns over time. We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle. Firstly, we design a unified spatiotemporal encoder using Point Pair Features (PPF) to learn discriminative trajectory embeddings while suppressing interference. Secondly, our cue-consistency transformer module explicitly aligns consistent feature representations between historical tracks and current detections. Finally, a dynamic update mechanism preserves salient spatiotemporal information for stable online tracking. Extensive experiments on the nuScenes and Waymo Open Datasets validate the effectiveness and robustness of our approach. On the nuScenes benchmark, for instance, our method achieves state-of-the-art performance, reaching 73.2% and 70.3% AMOTA on the validation and test sets, respectively.

Title: Salty Seagull: A VSAT Honeynet to Follow the Bread Crumb of Attacks in Ship Networks

Authors: Georgios Michail Makrakis, Jeroen Pijpker, Remco Hassing, Rob Loves, Stephen McCombie
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.11325
Pdf URL: https://arxiv.org/pdf/2508.11325
Copy Paste: [[2508.11325]] Salty Seagull: A VSAT Honeynet to Follow the Bread Crumb of Attacks in Ship Networks(https://arxiv.org/abs/2508.11325)
Keywords: security, attack
Abstract: Cyber threats against the maritime industry have increased notably in recent years, highlighting the need for innovative cybersecurity approaches. Ships, as critical assets, possess highly specialized and interconnected network infrastructures, where their legacy systems and operational constraints further exacerbate their vulnerability to cyberattacks. To better understand this evolving threat landscape, we propose the use of cyber-deception techniques and in particular honeynets, as a means to gather valuable insights into ongoing attack campaigns targeting the maritime sector. In this paper we present Salty Seagull, a honeynet conceived to simulate a VSAT system for ships. This environment mimics the operations of a functional VSAT system onboard and, at the same time, enables a user to interact with it through a Web dashboard and a CLI environment. Furthermore, based on existing vulnerabilities, we purposefully integrate them into our system to increase attacker engagement. We exposed our honeynet for 30 days to the Internet to assess its capability and measured the received interaction. Results show that while numerous generic attacks have been attempted, only one curious attacker with knowledge of the nature of the system and its vulnerabilities managed to access it, without however exploring its full potential.

Title: Noise Matters: Optimizing Matching Noise for Diffusion Classifiers

Authors: Yanghao Wang, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11330
Pdf URL: https://arxiv.org/pdf/2508.11330
Copy Paste: [[2508.11330]] Noise Matters: Optimizing Matching Noise for Diffusion Classifiers(https://arxiv.org/abs/2508.11330)
Keywords: diffusion, generative
Abstract: Although today's pretrained discriminative vision-language models (e.g., CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e.g., pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises'' that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: Frequency Matching and Spatial Matching. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i.e., good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep t, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp.

Title: GANDiff FR: Hybrid GAN Diffusion Synthesis for Causal Bias Attribution in Face Recognition

Authors: Md Asgor Hossain Reaj, Rajan Das Gupta, Md Yeasin Rahat, Nafiz Fahad, Md Jawadul Hasan, Tze Hui Liew
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11334
Pdf URL: https://arxiv.org/pdf/2508.11334
Copy Paste: [[2508.11334]] GANDiff FR: Hybrid GAN Diffusion Synthesis for Causal Bias Attribution in Face Recognition(https://arxiv.org/abs/2508.11334)
Keywords: fair, diffusion
Abstract: We introduce GANDiff FR, the first synthetic framework that precisely controls demographic and environmental factors to measure, explain, and reduce bias with reproducible rigor. GANDiff FR unifies StyleGAN3-based identity-preserving generation with diffusion-based attribute control, enabling fine-grained manipulation of pose around 30 degrees, illumination (four directions), and expression (five levels) under ceteris paribus conditions. We synthesize 10,000 demographically balanced faces across five cohorts validated for realism via automated detection (98.2%) and human review (89%) to isolate and quantify bias drivers. Benchmarking ArcFace, CosFace, and AdaFace under matched operating points shows AdaFace reduces inter-group TPR disparity by 60% (2.5% vs. 6.3%), with illumination accounting for 42% of residual bias. Cross-dataset evaluation on RFW, BUPT, and CASIA WebFace confirms strong synthetic-to-real transfer (r 0.85). Despite around 20% computational overhead relative to pure GANs, GANDiff FR yields three times more attribute-conditioned variants, establishing a reproducible, regulation-aligned (EU AI Act) standard for fairness auditing. Code and data are released to support transparent, scalable bias evaluation.

Title: RegimeNAS: Regime-Aware Differentiable Architecture Search With Theoretical Guarantees for Financial Trading

Authors: Prathamesh Devadiga, Yashmitha Shailesh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11338
Pdf URL: https://arxiv.org/pdf/2508.11338
Copy Paste: [[2508.11338]] RegimeNAS: Regime-Aware Differentiable Architecture Search With Theoretical Guarantees for Financial Trading(https://arxiv.org/abs/2508.11338)
Keywords: robust
Abstract: We introduce RegimeNAS, a novel differentiable architecture search framework specifically designed to enhance cryptocurrency trading performance by explicitly integrating market regime awareness. Addressing the limitations of static deep learning models in highly dynamic financial environments, RegimeNAS features three core innovations: (1) a theoretically grounded Bayesian search space optimizing architectures with provable convergence properties; (2) specialized, dynamically activated neural modules (Volatility, Trend, and Range blocks) tailored for distinct market conditions; and (3) a multi-objective loss function incorporating market-specific penalties (e.g., volatility matching, transition smoothness) alongside mathematically enforced Lipschitz stability constraints. Regime identification leverages multi-head attention across multiple timeframes for improved accuracy and uncertainty estimation. Rigorous empirical evaluation on extensive real-world cryptocurrency data demonstrates that RegimeNAS significantly outperforms state-of-the-art benchmarks, achieving an 80.3% Mean Absolute Error reduction compared to the best traditional recurrent baseline and converging substantially faster (9 vs. 50+ epochs). Ablation studies and regime-specific analysis confirm the critical contribution of each component, particularly the regime-aware adaptation mechanism. This work underscores the imperative of embedding domain-specific knowledge, such as market regimes, directly within the NAS process to develop robust and adaptive models for challenging financial applications.

Title: Index-Aligned Query Distillation for Transformer-based Incremental Object Detection

Authors: Mingxiao Ma, Shunyao Zhu, Guoliang Kang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11339
Pdf URL: https://arxiv.org/pdf/2508.11339
Copy Paste: [[2508.11339]] Index-Aligned Query Distillation for Transformer-based Incremental Object Detection(https://arxiv.org/abs/2508.11339)
Keywords: transformer
Abstract: Incremental object detection (IOD) aims to continuously expand the capability of a model to detect novel categories while preserving its performance on previously learned ones. When adopting a transformer-based detection model to perform IOD, catastrophic knowledge forgetting may inevitably occur, meaning the detection performance on previously learned categories may severely degenerate. Previous typical methods mainly rely on knowledge distillation (KD) to mitigate the catastrophic knowledge forgetting of transformer-based detection models. Specifically, they utilize Hungarian Matching to build a correspondence between the queries of the last-phase and current-phase detection models and align the classifier and regressor outputs between matched queries to avoid knowledge forgetting. However, we observe that in IOD task, Hungarian Matching is not a good choice. With Hungarian Matching, the query of the current-phase model may match different queries of the last-phase model at different iterations during KD. As a result, the knowledge encoded in each query may be reshaped towards new categories, leading to the forgetting of previously encoded knowledge of old categories. Based on our observations, we propose a new distillation approach named Index-Aligned Query Distillation (IAQD) for transformer-based IOD. Beyond using Hungarian Matching, IAQD establishes a correspondence between queries of the previous and current phase models that have the same index. Moreover, we perform index-aligned distillation only on partial queries which are critical for the detection of previous categories. In this way, IAQD largely preserves the previous semantic and spatial encoding capabilities without interfering with the learning of new categories. Extensive experiments on representative benchmarks demonstrate that IAQD effectively mitigates knowledge forgetting, achieving new state-of-the-art performance.

Title: Semantically Guided Adversarial Testing of Vision Models Using Language Models

Authors: Katarzyna Filus, Jorge M. Cruz-Duarte
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11341
Pdf URL: https://arxiv.org/pdf/2508.11341
Copy Paste: [[2508.11341]] Semantically Guided Adversarial Testing of Vision Models Using Language Models(https://arxiv.org/abs/2508.11341)
Keywords: attack, interpretability
Abstract: In targeted adversarial attacks on vision models, the selection of the target label is a critical yet often overlooked determinant of attack success. This target label corresponds to the class that the attacker aims to force the model to predict. Now, existing strategies typically rely on randomness, model predictions, or static semantic resources, limiting interpretability, reproducibility, or flexibility. This paper then proposes a semantics-guided framework for adversarial target selection using the cross-modal knowledge transfer from pretrained language and vision-language models. We evaluate several state-of-the-art models (BERT, TinyLLAMA, and CLIP) as similarity sources to select the most and least semantically related labels with respect to the ground truth, forming best- and worst-case adversarial scenarios. Our experiments on three vision models and five attack methods reveal that these models consistently render practical adversarial targets and surpass static lexical databases, such as WordNet, particularly for distant class relationships. We also observe that static testing of target labels offers a preliminary assessment of the effectiveness of similarity sources, \textit{a priori} testing. Our results corroborate the suitability of pretrained models for constructing interpretable, standardized, and scalable adversarial benchmarks across architectures and datasets.

Title: SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis

Authors: Haitong Luo, Weiyao Zhang, Suhang Wang, Wenji Zou, Chungang Lin, Xuying Meng, Yujun Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11343
Pdf URL: https://arxiv.org/pdf/2508.11343
Copy Paste: [[2508.11343]] SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis(https://arxiv.org/abs/2508.11343)
Keywords: robust, large language model
Abstract: The proliferation of high-quality text from Large Language Models (LLMs) demands reliable and efficient detection methods. While existing training-free approaches show promise, they often rely on surface-level statistics and overlook fundamental signal properties of the text generation process. In this work, we reframe detection as a signal processing problem, introducing a novel paradigm that analyzes the sequence of token log-probabilities in the frequency domain. By systematically analyzing the signal's spectral properties using the global Discrete Fourier Transform (DFT) and the local Short-Time Fourier Transform (STFT), we find that human-written text consistently exhibits significantly higher spectral energy. This higher energy reflects the larger-amplitude fluctuations inherent in human writing compared to the suppressed dynamics of LLM-generated text. Based on this key insight, we construct SpecDetect, a detector built on a single, robust feature from the global DFT: DFT total energy. We also propose an enhanced version, SpecDetect++, which incorporates a sampling discrepancy mechanism to further boost robustness. Extensive experiments demonstrate that our approach outperforms the state-of-the-art model while running in nearly half the time. Our work introduces a new, efficient, and interpretable pathway for LLM-generated text detection, showing that classical signal processing techniques offer a surprisingly powerful solution to this modern challenge.

Title: NeMo: A Neuron-Level Modularizing-While-Training Approach for Decomposing DNN Models

Authors: Xiaohan Bi, Binhang Qi, Hailong Sun, Xiang Gao, Yue Yu, Xiaojun Liang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11348
Pdf URL: https://arxiv.org/pdf/2508.11348
Copy Paste: [[2508.11348]] NeMo: A Neuron-Level Modularizing-While-Training Approach for Decomposing DNN Models(https://arxiv.org/abs/2508.11348)
Keywords: transformer
Abstract: With the growing incorporation of deep neural network (DNN) models into modern software systems, the prohibitive construction costs have become a significant challenge. Model reuse has been widely applied to reduce training costs, but indiscriminately reusing entire models may incur significant inference overhead. Consequently, DNN modularization has gained attention, enabling module reuse by decomposing DNN models. The emerging modularizing-while-training (MwT) paradigm, which incorporates modularization into training, outperforms modularizing-after-training approaches. However, existing MwT methods focus on small-scale CNN models at the convolutional kernel level and struggle with diverse DNNs and large-scale models, particularly Transformer-based models. To address these limitations, we propose NeMo, a scalable and generalizable MwT approach. NeMo operates at the neuron level fundamental component common to all DNNs-ensuring applicability to Transformers and various architectures. We design a contrastive learning-based modular training method with an effective composite loss function, enabling scalability to large-scale models. Comprehensive experiments on two Transformer-based models and four CNN models across two classification datasets demonstrate NeMo's superiority over state-of-the-art MwT methods. Results show average gains of 1.72% in module classification accuracy and 58.10% reduction in module size, demonstrating efficacy across both CNN and large-scale Transformer-based models. A case study on open-source projects shows NeMo's potential benefits in practical scenarios, offering a promising approach for scalable and generalizable DNN modularization.

Title: HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model

Authors: Zhenhao Zhang, Hanqing Wang, Xiangyu Zeng, Ziyu Cheng, Jiaxin Liu, Haoyu Yan, Zhirui Liu, Kaiyang Ji, Tianxiang Gui, Ke Hu, Kangyi Chen, Yahao Fan, Mokai Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11350
Pdf URL: https://arxiv.org/pdf/2508.11350
Copy Paste: [[2508.11350]] HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model(https://arxiv.org/abs/2508.11350)
Keywords: large language model
Abstract: Understanding and recognizing human-object interaction (HOI) is a pivotal application in AR/VR and robotics. Recent open-vocabulary HOI detection approaches depend exclusively on large language models for richer textual prompts, neglecting their inherent 3D spatial understanding capabilities. To address this shortcoming, we introduce HOID-R1, the first HOI detection framework that integrates chain-of-thought (CoT) guided supervised fine-tuning (SFT) with group relative policy optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we initially apply SFT to imbue the model with essential reasoning capabilities, forcing the model to articulate its thought process in the output. Subsequently, we integrate GRPO to leverage multi-reward signals for policy optimization, thereby enhancing alignment across diverse modalities. To mitigate hallucinations in the CoT reasoning, we introduce an "MLLM-as-a-judge" mechanism that supervises the CoT outputs, further improving generalization. Extensive experiments show that HOID-R1 achieves state-of-the-art performance on HOI detection benchmarks and outperforms existing methods in open-world generalization to novel scenarios.

Title: Leveraging the RETFound foundation model for optic disc segmentation in retinal images

Authors: Zhenyi Zhao, Muthu Rama Krishnan Mookiah, Emanuele Trucco
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11354
Pdf URL: https://arxiv.org/pdf/2508.11354
Copy Paste: [[2508.11354]] Leveraging the RETFound foundation model for optic disc segmentation in retinal images(https://arxiv.org/abs/2508.11354)
Keywords: segmentation
Abstract: RETFound is a well-known foundation model (FM) developed for fundus camera and optical coherence tomography images. It has shown promising performance across multiple datasets in diagnosing diseases, both eye-specific and systemic, from retinal images. However, to our best knowledge, it has not been used for other tasks. We present the first adaptation of RETFound for optic disc segmentation, a ubiquitous and foundational task in retinal image analysis. The resulting segmentation system outperforms state-of-the-art, segmentation-specific baseline networks after training a head with only a very modest number of task-specific examples. We report and discuss results with four public datasets, IDRID, Drishti-GS, RIM-ONE-r3, and REFUGE, and a private dataset, GoDARTS, achieving about 96% Dice consistently across all datasets. Overall, our method obtains excellent performance in internal verification, domain generalization and domain adaptation, and exceeds most of the state-of-the-art baseline results. We discuss the results in the framework of the debate about FMs as alternatives to task-specific architectures. The code is available at: [link to be added after the paper is accepted]

Title: ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

Authors: Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, ShaoGuo Liu, TingTing Gao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11356
Pdf URL: https://arxiv.org/pdf/2508.11356
Copy Paste: [[2508.11356]] ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism(https://arxiv.org/abs/2508.11356)
Keywords: robust, large language model
Abstract: Recent advancements in Large Language Models have yielded significant improvements in complex reasoning tasks such as mathematics and programming. However, these models remain heavily dependent on annotated data and exhibit limited adaptability in unsupervised scenarios. To address these limitations, test-time reinforcement learning (TTRL) has been proposed, which enables self-optimization by leveraging model-generated pseudo-labels. Despite its promise, TTRL faces several key challenges, including high inference costs due to parallel rollouts and early-stage estimation bias that fosters overconfidence, reducing output diversity and causing performance plateaus. To address these challenges, we introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning through two strategies: Entropy-fork Tree Majority Rollout (ETMR) and Entropy-based Advantage Reshaping (EAR). Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric on the AIME 2024 benchmark, while consuming only 60 percent of the rollout tokens budget. This highlights our method's ability to effectively optimize the trade-off between inference efficiency, diversity, and estimation robustness, thereby advancing unsupervised reinforcement learning for open-domain reasoning tasks.

Title: PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding

Authors: Changhong Jing, Yan Liu, Shuqiang Wang, Bruce X.B. Yu, Gong Chen, Zhejing Hu, Zhi Zhang, Yanyan Shen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11357
Pdf URL: https://arxiv.org/pdf/2508.11357
Copy Paste: [[2508.11357]] PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding(https://arxiv.org/abs/2508.11357)
Keywords: robust
Abstract: Cross-subject electroencephalography (EEG) decoding remains a fundamental challenge in brain-computer interface (BCI) research due to substantial inter-subject variability and the scarcity of subject-invariant representations. This paper proposed PTSM (Physiology-aware and Task-invariant Spatio-temporal Modeling), a novel framework for interpretable and robust EEG decoding across unseen subjects. PTSM employs a dual-branch masking mechanism that independently learns personalized and shared spatio-temporal patterns, enabling the model to preserve individual-specific neural characteristics while extracting task-relevant, population-shared features. The masks are factorized across temporal and spatial dimensions, allowing fine-grained modulation of dynamic EEG patterns with low computational overhead. To further address representational entanglement, PTSM enforces information-theoretic constraints that decompose latent embeddings into orthogonal task-related and subject-related subspaces. The model is trained end-to-end via a multi-objective loss integrating classification, contrastive, and disentanglement objectives. Extensive experiments on cross-subject motor imagery datasets demonstrate that PTSM achieves strong zero-shot generalization, outperforming state-of-the-art baselines without subject-specific calibration. Results highlight the efficacy of disentangled neural representations for achieving both personalized and transferable decoding in non-stationary neurophysiological settings.

Title: Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning

Authors: Sylvio Rüdian, Yassin Elsir, Marvin Kretschmer, Sabine Cayrou, Niels Pinkwart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11364
Pdf URL: https://arxiv.org/pdf/2508.11364
Copy Paste: [[2508.11364]] Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning(https://arxiv.org/abs/2508.11364)
Keywords: large language model
Abstract: Automated feedback generation has the potential to enhance students' learning progress by providing timely and targeted feedback. Moreover, it can assist teachers in optimizing their time, allowing them to focus on more strategic and personalized aspects of teaching. To generate high-quality, information-rich formative feedback, it is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed. Teachers often employ feedback criteria grids composed of various indicators that they evaluate systematically. This study examines the initial phase of extracting such indicators from students' submissions of a language learning course using the large language model Llama 3.1. Accordingly, the alignment between indicators generated by the LLM and human ratings across various feedback criteria is investigated. The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria. The methodology employed in this paper offers a promising foundation for extracting indicators from students' submissions using LLMs. Such indicators can potentially be utilized to auto-generate explainable and transparent formative feedback in future research.

Title: Does the Skeleton-Recall Loss Really Work?

Authors: Devansh Arora, Nitin Kumar, Sukrit Gupta
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11374
Pdf URL: https://arxiv.org/pdf/2508.11374
Copy Paste: [[2508.11374]] Does the Skeleton-Recall Loss Really Work?(https://arxiv.org/abs/2508.11374)
Keywords: segmentation
Abstract: Image segmentation is an important and widely performed task in computer vision. Accomplishing effective image segmentation in diverse settings often requires custom model architectures and loss functions. A set of models that specialize in segmenting thin tubular structures are topology preservation-based loss functions. These models often utilize a pixel skeletonization process claimed to generate more precise segmentation masks of thin tubes and better capture the structures that other models often miss. One such model, Skeleton Recall Loss (SRL) proposed by Kirchhoff et al.~\cite {kirchhoff2024srl}, was stated to produce state-of-the-art results on benchmark tubular datasets. In this work, we performed a theoretical analysis of the gradients for the SRL loss. Upon comparing the performance of the proposed method on some of the tubular datasets (used in the original work, along with some additional datasets), we found that the performance of SRL-based segmentation models did not exceed traditional baseline models. By providing both a theoretical explanation and empirical evidence, this work critically evaluates the limitations of topology-based loss functions, offering valuable insights for researchers aiming to develop more effective segmentation models for complex tubular structures.

Title: When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs

Authors: Mikhail Seleznyov, Mikhail Chaichuk, Gleb Ershov, Alexander Panchenko, Elena Tutubalina, Oleg Somov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11383
Pdf URL: https://arxiv.org/pdf/2508.11383
Copy Paste: [[2508.11383]] When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs(https://arxiv.org/abs/2508.11383)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models' current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: this https URL.

Title: Retrieval-augmented reasoning with lean language models

Authors: Ryan Sze-Yin Chan, Federico Nanni, Tomas Lazauskas, Rosie Wood, Penelope Yong, Lionel Tarassenko, Mark Girolami, James Geddes, Andrew Duncan
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11386
Pdf URL: https://arxiv.org/pdf/2508.11386
Copy Paste: [[2508.11386]] Retrieval-augmented reasoning with lean language models(https://arxiv.org/abs/2508.11386)
Keywords: secure, privacy
Abstract: This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.

Title: Model Interpretability and Rationale Extraction by Input Mask Optimization

Authors: Marc Brinner, Sina Zarriess
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11388
Pdf URL: https://arxiv.org/pdf/2508.11388
Copy Paste: [[2508.11388]] Model Interpretability and Rationale Extraction by Input Mask Optimization(https://arxiv.org/abs/2508.11388)
Keywords: extraction, interpretability
Abstract: Concurrent to the rapid progress in the development of neural-network based models in areas like natural language processing and computer vision, the need for creating explanations for the predictions of these black-box models has risen steadily. We propose a new method to generate extractive explanations for predictions made by neural networks, that is based on masking parts of the input which the model does not consider to be indicative of the respective class. The masking is done using gradient-based optimization combined with a new regularization scheme that enforces sufficiency, comprehensiveness and compactness of the generated explanation, three properties that are known to be desirable from the related field of rationale extraction in natural language processing. In this way, we bridge the gap between model interpretability and rationale extraction, thereby proving that the latter of which can be performed without training a specialized model, only on the basis of a trained classifier. We further apply the same method to image inputs and obtain high quality explanations for image classifications, which indicates that the conditions proposed for rationale extraction in natural language processing are more broadly applicable to different input types.

Title: Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training

Authors: Marc Brinner, Sina Zarrieß
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11393
Pdf URL: https://arxiv.org/pdf/2508.11393
Copy Paste: [[2508.11393]] Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training(https://arxiv.org/abs/2508.11393)
Keywords: transformer
Abstract: We propose an end-to-end differentiable training paradigm for stable training of a rationalized transformer classifier. Our approach results in a single model that simultaneously classifies a sample and scores input tokens based on their relevance to the classification. To this end, we build on the widely-used three-player-game for training rationalized models, which typically relies on training a rationale selector, a classifier and a complement classifier. We simplify this approach by making a single model fulfill all three roles, leading to a more efficient training paradigm that is not susceptible to the common training instabilities that plague existing approaches. Further, we extend this paradigm to produce class-wise rationales while incorporating recent advances in parameterizing and regularizing the resulting rationales, thus leading to substantially improved and state-of-the-art alignment with human annotations without any explicit supervision.

Title: On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Authors: Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11408
Pdf URL: https://arxiv.org/pdf/2508.11408
Copy Paste: [[2508.11408]] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting(https://arxiv.org/abs/2508.11408)
Keywords: large language model
Abstract: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at this https URL to inspire further research.

Title: RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator

Authors: Zhiming Liu, Nantheera Anantrasirichai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11409
Pdf URL: https://arxiv.org/pdf/2508.11409
Copy Paste: [[2508.11409]] RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator(https://arxiv.org/abs/2508.11409)
Keywords: transformer
Abstract: Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer and 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator, designed for efficient and temporally consistent video restoration under AT conditions. RMFAT adopts a lightweight recurrent framework that restores each frame using only two inputs at a time, significantly reducing temporal window size and computational burden. It further integrates multi-scale feature encoding and decoding with temporal warping modules at both encoder and decoder stages to enhance spatial detail and temporal coherence. Extensive experiments on synthetic and real-world atmospheric turbulence datasets demonstrate that RMFAT not only outperforms existing methods in terms of clarity restoration (with nearly a 9\% improvement in SSIM) but also achieves significantly improved inference speed (more than a fourfold reduction in runtime), making it particularly suitable for real-time atmospheric turbulence suppression tasks.

Title: SelfAdapt: Unsupervised Domain Adaptation of Cell Segmentation Models

Authors: Fabian H. Reith, Jannik Franzen, Dinesh R. Palli, J. Lorenz Rumberger, Dagmar Kainmueller
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11411
Pdf URL: https://arxiv.org/pdf/2508.11411
Copy Paste: [[2508.11411]] SelfAdapt: Unsupervised Domain Adaptation of Cell Segmentation Models(https://arxiv.org/abs/2508.11411)
Keywords: segmentation
Abstract: Deep neural networks have become the go-to method for biomedical instance segmentation. Generalist models like Cellpose demonstrate state-of-the-art performance across diverse cellular data, though their effectiveness often degrades on domains that differ from their training data. While supervised fine-tuning can address this limitation, it requires annotated data that may not be readily available. We propose SelfAdapt, a method that enables the adaptation of pre-trained cell segmentation models without the need for labels. Our approach builds upon student-teacher augmentation consistency training, introducing L2-SP regularization and label-free stopping criteria. We evaluate our method on the LiveCell and TissueNet datasets, demonstrating relative improvements in AP0.5 of up to 29.64% over baseline Cellpose. Additionally, we show that our unsupervised adaptation can further improve models that were previously fine-tuned with supervision. We release SelfAdapt as an easy-to-use extension of the Cellpose framework. The code for our method is publicly available at https: //github.com/Kainmueller-Lab/self_adapt.

Title: Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions

Authors: Shangrui Nie, Florian Mai, David Kaczér, Charles Welch, Zhixue Zhao, Lucie Flek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11414
Pdf URL: https://arxiv.org/pdf/2508.11414
Copy Paste: [[2508.11414]] Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions(https://arxiv.org/abs/2508.11414)
Keywords: large language model
Abstract: Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model's value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model's behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model's behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model's behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model's answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior.

Title: Training-free Dimensionality Reduction via Feature Truncation: Enhancing Efficiency in Privacy-preserving Multi-Biometric Systems

Authors: Florian Bayer, Maximilian Russo, Christian Rathgeb
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11419
Pdf URL: https://arxiv.org/pdf/2508.11419
Copy Paste: [[2508.11419]] Training-free Dimensionality Reduction via Feature Truncation: Enhancing Efficiency in Privacy-preserving Multi-Biometric Systems(https://arxiv.org/abs/2508.11419)
Keywords: security, privacy, protect, biometric, extraction
Abstract: Biometric recognition is widely used, making the privacy and security of extracted templates a critical concern. Biometric Template Protection schemes, especially those utilizing Homomorphic Encryption, introduce significant computational challenges due to increased workload. Recent advances in deep neural networks have enabled state-of-the-art feature extraction for face, fingerprint, and iris modalities. The ubiquity and affordability of biometric sensors further facilitate multi-modal fusion, which can enhance security by combining features from different modalities. This work investigates the biometric performance of reduced multi-biometric template sizes. Experiments are conducted on an in-house virtual multi-biometric database, derived from DNN-extracted features for face, fingerprint, and iris, using the FRGC, MCYT, and CASIA databases. The evaluated approaches are (i) explainable and straightforward to implement under encryption, (ii) training-free, and (iii) capable of generalization. Dimensionality reduction of feature vectors leads to fewer operations in the Homomorphic Encryption (HE) domain, enabling more efficient encrypted processing while maintaining biometric accuracy and security at a level equivalent to or exceeding single-biometric recognition. Our results demonstrate that, by fusing feature vectors from multiple modalities, template size can be reduced by 67 % with no loss in Equal Error Rate (EER) compared to the best-performing single modality.

Title: Generative Co-Design of Antibody Sequences and Structures via Black-Box Guidance in a Shared Latent Space

Authors: Yinghua Yao, Yuangang Pan, Xixian Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11424
Pdf URL: https://arxiv.org/pdf/2508.11424
Copy Paste: [[2508.11424]] Generative Co-Design of Antibody Sequences and Structures via Black-Box Guidance in a Shared Latent Space(https://arxiv.org/abs/2508.11424)
Keywords: generative
Abstract: Advancements in deep generative models have enabled the joint modeling of antibody sequence and structure, given the antigen-antibody complex as context. However, existing approaches for optimizing complementarity-determining regions (CDRs) to improve developability properties operate in the raw data space, leading to excessively costly evaluations due to the inefficient search process. To address this, we propose LatEnt blAck-box Design (LEAD), a sequence-structure co-design framework that optimizes both sequence and structure within their shared latent space. Optimizing shared latent codes can not only break through the limitations of existing methods, but also ensure synchronization of different modality designs. Particularly, we design a black-box guidance strategy to accommodate real-world scenarios where many property evaluators are non-differentiable. Experimental results demonstrate that our LEAD achieves superior optimization performance for both single and multi-property objectives. Notably, LEAD reduces query consumption by a half while surpassing baseline methods in property optimization. The code is available at this https URL.

Title: ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

Authors: Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11428
Pdf URL: https://arxiv.org/pdf/2508.11428
Copy Paste: [[2508.11428]] ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving(https://arxiv.org/abs/2508.11428)
Keywords: robust, interpretability
Abstract: Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent's planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.

Title: HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor

Authors: Shivam Dubey
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11429
Pdf URL: https://arxiv.org/pdf/2508.11429
Copy Paste: [[2508.11429]] HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor(https://arxiv.org/abs/2508.11429)
Keywords: large language model
Abstract: Automated humor generation with Large Language Models (LLMs) often yields jokes that feel generic, repetitive, or tone-deaf because humor is deeply situated and hinges on the listener's cultural background, mindset, and immediate context. We introduce HumorPlanSearch, a modular pipeline that explicitly models context through: (1) Plan-Search for diverse, topic-tailored strategies; (2) Humor Chain-of-Thought (HuCoT) templates capturing cultural and stylistic reasoning; (3) a Knowledge Graph to retrieve and adapt high-performing historical strategies; (4) novelty filtering via semantic embeddings; and (5) an iterative judge-driven revision loop. To evaluate context sensitivity and comedic quality, we propose the Humor Generation Score (HGS), which fuses direct ratings, multi-persona feedback, pairwise win-rates, and topic relevance. In experiments across nine topics with feedback from 13 human judges, our full pipeline (KG + Revision) boosts mean HGS by 15.4 percent (p < 0.05) over a strong baseline. By foregrounding context at every stage from strategy planning to multi-signal evaluation, HumorPlanSearch advances AI-driven humor toward more coherent, adaptive, and culturally attuned comedy.

Title: Remove360: Benchmarking Residuals After Object Removal in 3D Gaussian Splatting

Authors: Simona Kocour, Assia Benbihi, Torsten Sattler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11431
Pdf URL: https://arxiv.org/pdf/2508.11431
Copy Paste: [[2508.11431]] Remove360: Benchmarking Residuals After Object Removal in 3D Gaussian Splatting(https://arxiv.org/abs/2508.11431)
Keywords: privacy, robust
Abstract: Understanding what semantic information persists after object removal is critical for privacy-preserving 3D reconstruction and editable scene representations. In this work, we introduce a novel benchmark and evaluation framework to measure semantic residuals, the unintended semantic traces left behind, after object removal in 3D Gaussian Splatting. We conduct experiments across a diverse set of indoor and outdoor scenes, showing that current methods can preserve semantic information despite the absence of visual geometry. We also release Remove360, a dataset of pre/post-removal RGB images and object-level masks captured in real-world environments. While prior datasets have focused on isolated object instances, Remove360 covers a broader and more complex range of indoor and outdoor scenes, enabling evaluation of object removal in the context of full-scene representations. Given ground truth images of a scene before and after object removal, we assess whether we can truly eliminate semantic presence, and if downstream models can still infer what was removed. Our findings reveal critical limitations in current 3D object removal techniques and underscore the need for more robust solutions capable of handling real-world complexity. The evaluation framework is available at this http URL. Data are available at this http URL.

Title: Robust Convolution Neural ODEs via Contractivity-promoting regularization

Authors: Muhammad Zakwan, Liang Xu, Giancarlo Ferrari-Trecate
Subjects: cs.LG, cs.CV, eess.SY
Abstract URL: https://arxiv.org/abs/2508.11432
Pdf URL: https://arxiv.org/pdf/2508.11432
Copy Paste: [[2508.11432]] Robust Convolution Neural ODEs via Contractivity-promoting regularization(https://arxiv.org/abs/2508.11432)
Keywords: attack, robust
Abstract: Neural networks can be fragile to input noise and adversarial attacks. In this work, we consider Convolutional Neural Ordinary Differential Equations (NODEs), a family of continuous-depth neural networks represented by dynamical systems, and propose to use contraction theory to improve their robustness. For a contractive dynamical system two trajectories starting from different initial conditions converge to each other exponentially fast. Contractive Convolutional NODEs can enjoy increased robustness as slight perturbations of the features do not cause a significant change in the output. Contractivity can be induced during training by using a regularization term involving the Jacobian of the system dynamics. To reduce the computational burden, we show that it can also be promoted using carefully selected weight regularization terms for a class of NODEs with slope-restricted activation functions. The performance of the proposed regularizers is illustrated through benchmark image classification tasks on MNIST and FashionMNIST datasets, where images are corrupted by different kinds of noise and attacks.

Title: MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

Authors: Qian Liang, Yujia Wu, Kuncheng Li, Jiwei Wei, Shiyuan He, Jinyu Guo, Ning Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11433
Pdf URL: https://arxiv.org/pdf/2508.11433
Copy Paste: [[2508.11433]] MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation(https://arxiv.org/abs/2508.11433)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To further enhance the reasoning capability, we adopt Grouped Reward Proximal Policy Optimization (GRPO) to explicitly align the generation. Experiments demonstrate that MM-R1 unleashes the personalization capability of unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner.

Title: Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse

Authors: Aditi Dutta, Susan Banducci
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11434
Pdf URL: https://arxiv.org/pdf/2508.11434
Copy Paste: [[2508.11434]] Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse(https://arxiv.org/abs/2508.11434)
Keywords: large language model
Abstract: Anti-sexist speech, i.e., public expressions that challenge or resist gendered abuse and sexism, plays a vital role in shaping democratic debate online. Yet automated content moderation systems, increasingly powered by large language models (LLMs), may struggle to distinguish such resistance from the sexism it opposes. This study examines how five LLMs classify sexist, anti-sexist, and neutral political tweets from the UK, focusing on high-salience trigger events involving female Members of Parliament in the year 2022. Our analysis show that models frequently misclassify anti-sexist speech as harmful, particularly during politically charged events where rhetorical styles of harm and resistance converge. These errors risk silencing those who challenge sexism, with disproportionate consequences for marginalised voices. We argue that moderation design must move beyond binary harmful/not-harmful schemas, integrate human-in-the-loop review during sensitive events, and explicitly include counter-speech in training data. By linking feminist scholarship, event-based analysis, and model evaluation, this work highlights the sociotechnical challenges of safeguarding resistance speech in digital political spaces.

Title: Multi-Sensory Cognitive Computing for Learning Population-level Brain Connectivity

Authors: Mayssa Soussia, Mohamed Ali Mahjoub, Islem Rekik
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11436
Pdf URL: https://arxiv.org/pdf/2508.11436
Copy Paste: [[2508.11436]] Multi-Sensory Cognitive Computing for Learning Population-level Brain Connectivity(https://arxiv.org/abs/2508.11436)
Keywords: interpretability
Abstract: The generation of connectional brain templates (CBTs) has recently garnered significant attention for its potential to identify unique connectivity patterns shared across individuals. However, existing methods for CBT learning such as conventional machine learning and graph neural networks (GNNs) are hindered by several limitations. These include: (i) poor interpretability due to their black-box nature, (ii) high computational cost, and (iii) an exclusive focus on structure and topology, overlooking the cognitive capacity of the generated CBT. To address these challenges, we introduce mCOCO (multi-sensory COgnitive COmputing), a novel framework that leverages Reservoir Computing (RC) to learn population-level functional CBT from BOLD (Blood-Oxygen-level-Dependent) signals. RC's dynamic system properties allow for tracking state changes over time, enhancing interpretability and enabling the modeling of brain-like dynamics, as demonstrated in prior literature. By integrating multi-sensory inputs (e.g., text, audio, and visual data), mCOCO captures not only structure and topology but also how brain regions process information and adapt to cognitive tasks such as sensory processing, all in a computationally efficient manner. Our mCOCO framework consists of two phases: (1) mapping BOLD signals into the reservoir to derive individual functional connectomes, which are then aggregated into a group-level CBT - an approach, to the best of our knowledge, not previously explored in functional connectivity studies - and (2) incorporating multi-sensory inputs through a cognitive reservoir, endowing the CBT with cognitive traits. Extensive evaluations show that our mCOCO-based template significantly outperforms GNN-based CBT in terms of centeredness, discriminativeness, topological soundness, and multi-sensory memory retention. Our source code is available at this https URL.

Title: Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation

Authors: Daniel Airinei, Elena Burceanu, Marius Leordeanu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11446
Pdf URL: https://arxiv.org/pdf/2508.11446
Copy Paste: [[2508.11446]] Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation(https://arxiv.org/abs/2508.11446)
Keywords: robust
Abstract: Indoor navigation is a difficult task, as it generally comes with poor GPS access, forcing solutions to rely on other sources of information. While significant progress continues to be made in this area, deployment to production applications is still lacking, given the complexity and additional requirements of current solutions. Here, we introduce an efficient, real-time and easily deployable deep learning approach, based on visual input only, that can predict the direction towards a target from images captured by a mobile device. Our technical approach, based on a novel graph-based path generation method, combined with explainable data augmentation and curriculum learning, includes contributions that make the process of data collection, annotation and training, as automatic as possible, efficient and robust. On the practical side, we introduce a novel largescale dataset, with video footage inside a relatively large shopping mall, in which each frame is annotated with the correct next direction towards different specific target destinations. Different from current methods, ours relies solely on vision, avoiding the need of special sensors, additional markers placed along the path, knowledge of the scene map or internet access. We also created an easy to use application for Android, which we plan to make publicly available. We make all our data and code available along with visual demos on our project site

Title: Reference Points in LLM Sentiment Analysis: The Role of Structured Context

Authors: Junichiro Niimi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11454
Pdf URL: https://arxiv.org/pdf/2508.11454
Copy Paste: [[2508.11454]] Reference Points in LLM Sentiment Analysis: The Role of Structured Context(https://arxiv.org/abs/2508.11454)
Keywords: large language model
Abstract: Large language models (LLMs) are now widely used across many fields, including marketing research. Sentiment analysis, in particular, helps firms understand consumer preferences. While most NLP studies classify sentiment from review text alone, marketing theories, such as prospect theory and expectation--disconfirmation theory, point out that customer evaluations are shaped not only by the actual experience but also by additional reference points. This study therefore investigates how the content and format of such supplementary information affect sentiment analysis using LLMs. We compare natural language (NL) and JSON-formatted prompts using a lightweight 3B parameter model suitable for practical marketing applications. Experiments on two Yelp categories (Restaurant and Nightlife) show that the JSON prompt with additional information outperforms all baselines without fine-tuning: Macro-F1 rises by 1.6% and 4% while RMSE falls by 16% and 9.1%, respectively, making it deployable in resource-constrained edge devices. Furthermore, a follow-up analysis confirms that performance gains stem from genuine contextual reasoning rather than label proxying. This work demonstrates that structured prompting can enable smaller models to achieve competitive performance, offering a practical alternative to large-scale model deployment.

Title: Data-Driven Deepfake Image Detection Method -- The 2024 Global Deepfake Image Detection Challenge

Authors: Xiaoya Zhu, Yibing Nan, Shiguo Lian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11464
Pdf URL: https://arxiv.org/pdf/2508.11464
Copy Paste: [[2508.11464]] Data-Driven Deepfake Image Detection Method -- The 2024 Global Deepfake Image Detection Challenge(https://arxiv.org/abs/2508.11464)
Keywords: security, transformer
Abstract: With the rapid development of technology in the field of AI, deepfake technology has emerged as a double-edged sword. It has not only created a large amount of AI-generated content but also posed unprecedented challenges to digital security. The task of the competition is to determine whether a face image is a Deepfake image and output its probability score of being a Deepfake image. In the image track competition, our approach is based on the Swin Transformer V2-B classification network. And online data augmentation and offline sample generation methods are employed to enrich the diversity of training samples and increase the generalization ability of the model. Finally, we got the award of excellence in Deepfake image detection.

Title: CoFi: A Fast Coarse-to-Fine Few-Shot Pipeline for Glomerular Basement Membrane Segmentation

Authors: Hongjin Fang, Daniel Reisenbüchler, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11469
Pdf URL: https://arxiv.org/pdf/2508.11469
Copy Paste: [[2508.11469]] CoFi: A Fast Coarse-to-Fine Few-Shot Pipeline for Glomerular Basement Membrane Segmentation(https://arxiv.org/abs/2508.11469)
Keywords: segmentation
Abstract: Accurate segmentation of the glomerular basement membrane (GBM) in electron microscopy (EM) images is fundamental for quantifying membrane thickness and supporting the diagnosis of various kidney diseases. While supervised deep learning approaches achieve high segmentation accuracy, their reliance on extensive pixel-level annotation renders them impractical for clinical workflows. Few-shot learning can reduce this annotation burden but often struggles to capture the fine structural details necessary for GBM analysis. In this study, we introduce CoFi, a fast and efficient coarse-to-fine few-shot segmentation pipeline designed for GBM delineation in EM images. CoFi first trains a lightweight neural network using only three annotated images to produce an initial coarse segmentation mask. This mask is then automatically processed to generate high-quality point prompts with morphology-aware pruning, which are subsequently used to guide SAM in refining the segmentation. The proposed method achieved exceptional GBM segmentation performance, with a Dice coefficient of 74.54% and an inference speed of 1.9 FPS. We demonstrate that CoFi not only alleviates the annotation and computational burdens associated with conventional methods, but also achieves accurate and reliable segmentation results. The pipeline's speed and annotation efficiency make it well-suited for research and hold strong potential for clinical applications in renal pathology. The pipeline is publicly available at: this https URL.

Title: RMSL: Weakly-Supervised Insider Threat Detection with Robust Multi-sphere Learning

Authors: Yang Wang, Yaxin Zhao, Xinyu Jiao, Sihan Xu, Xiangrui Cai, Ying Zhang, Xiaojie Yuan
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11472
Pdf URL: https://arxiv.org/pdf/2508.11472
Copy Paste: [[2508.11472]] RMSL: Weakly-Supervised Insider Threat Detection with Robust Multi-sphere Learning(https://arxiv.org/abs/2508.11472)
Keywords: robust
Abstract: Insider threat detection aims to identify malicious user behavior by analyzing logs that record user interactions. Due to the lack of fine-grained behavior-level annotations, detecting specific behavior-level anomalies within user behavior sequences is challenging. Unsupervised methods face high false positive rates and miss rates due to the inherent ambiguity between normal and anomalous behaviors. In this work, we instead introduce weak labels of behavior sequences, which have lower annotation costs, i.e., the training labels (anomalous or normal) are at sequence-level instead of behavior-level, to enhance the detection capability for behavior-level anomalies by learning discriminative features. To achieve this, we propose a novel framework called Robust Multi-sphere Learning (RMSL). RMSL uses multiple hyper-spheres to represent the normal patterns of behaviors. Initially, a one-class classifier is constructed as a good anomaly-supervision-free starting point. Building on this, using multiple instance learning and adaptive behavior-level self-training debiasing based on model prediction confidence, the framework further refines hyper-spheres and feature representations using weak sequence-level labels. This approach enhances the model's ability to distinguish between normal and anomalous behaviors. Extensive experiments demonstrate that RMSL significantly improves the performance of behavior-level insider threat detection.

Title: TACR-YOLO: A Real-time Detection Framework for Abnormal Human Behaviors Enhanced with Coordinate and Task-Aware Representations

Authors: Xinyi Yin, Wenbo Yuan, Xuecheng Wu, Liangyu Fu, Danlei Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11478
Pdf URL: https://arxiv.org/pdf/2508.11478
Copy Paste: [[2508.11478]] TACR-YOLO: A Real-time Detection Framework for Abnormal Human Behaviors Enhanced with Coordinate and Task-Aware Representations(https://arxiv.org/abs/2508.11478)
Keywords: robust
Abstract: Abnormal Human Behavior Detection (AHBD) under special scenarios is becoming increasingly crucial. While YOLO-based detection methods excel in real-time tasks, they remain hindered by challenges including small objects, task conflicts, and multi-scale fusion in AHBD. To tackle them, we propose TACR-YOLO, a new real-time framework for AHBD. We introduce a Coordinate Attention Module to enhance small object detection, a Task-Aware Attention Module to deal with classification-regression conflicts, and a Strengthen Neck Network for refined multi-scale fusion, respectively. In addition, we optimize Anchor Box sizes using K-means clustering and deploy DIoU-Loss to improve bounding box regression. The Personnel Anomalous Behavior Detection (PABD) dataset, which includes 8,529 samples across four behavior categories, is also presented. Extensive experimental results indicate that TACR-YOLO achieves 91.92% mAP on PABD, with competitive speed and robustness. Ablation studies highlight the contribution of each improvement. This work provides new insights for abnormal behavior detection under special scenarios, advancing its progress.

Title: OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring

Authors: Ruoxin Xiong, Yanyu Wang, Jiannan Cai, Kaijian Liu, Yuansheng Zhu, Pingbo Tang, Nora El-Gohary
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11482
Pdf URL: https://arxiv.org/pdf/2508.11482
Copy Paste: [[2508.11482]] OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring(https://arxiv.org/abs/2508.11482)
Keywords: fair
Abstract: The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community's ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.

Title: CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

Authors: Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, Xinyuan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11484
Pdf URL: https://arxiv.org/pdf/2508.11484
Copy Paste: [[2508.11484]] CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models(https://arxiv.org/abs/2508.11484)
Keywords: diffusion
Abstract: Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.

Title: Automated Building Heritage Assessment Using Street-Level Imagery

Authors: Kristina Dabrock, Tim Johansson, Anna Donarelli, Mikael Mangold, Noah Pflugradt, Jann Michael Weinand, Jochen Linßen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11486
Pdf URL: https://arxiv.org/pdf/2508.11486
Copy Paste: [[2508.11486]] Automated Building Heritage Assessment Using Street-Level Imagery(https://arxiv.org/abs/2508.11486)
Keywords: large language model
Abstract: Detailed data is required to quantify energy conservation measures in buildings, such as envelop retrofits, without compromising cultural heritage. Novel artificial intelligence tools may improve efficiency in identifying heritage values in buildings compared to costly and time-consuming traditional inventories. In this study, the large language model GPT was used to detect various aspects of cultural heritage value in façade images. Using this data and building register data as features, machine learning models were trained to classify multi-family and non-residential buildings in Stockholm, Sweden. Validation against an expert-created inventory shows a macro F1-score of 0.71 using a combination of register data and features retrieved from GPT, and a score of 0.60 using only GPT-derived data. The presented methodology can contribute to a higher-quality database and thus support careful energy efficiency measures and integrated consideration of heritage value in large-scale energetic refurbishment scenarios.

Title: KV-Auditor: Auditing Local Differential Privacy for Correlated Key-Value Estimation

Authors: Jingnan Xu, Leixia Wang, Xiaofeng Meng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.11495
Pdf URL: https://arxiv.org/pdf/2508.11495
Copy Paste: [[2508.11495]] KV-Auditor: Auditing Local Differential Privacy for Correlated Key-Value Estimation(https://arxiv.org/abs/2508.11495)
Keywords: privacy, protect, federate, segmentation
Abstract: To protect privacy for data-collection-based services, local differential privacy (LDP) is widely adopted due to its rigorous theoretical bound on privacy loss. However, mistakes in complex theoretical analysis or subtle implementation errors may undermine its practical guarantee. To address this, auditing is crucial to confirm that LDP protocols truly protect user data. However, existing auditing methods, though, mainly target machine learning and federated learning tasks based on centralized differentially privacy (DP), with limited attention to LDP. Moreover, the few studies on LDP auditing focus solely on simple frequency estimation task for discrete data, leaving correlated key-value data - which requires both discrete frequency estimation for keys and continuous mean estimation for values - unexplored. To bridge this gap, we propose KV-Auditor, a framework for auditing LDP-based key-value estimation mechanisms by estimating their empirical privacy lower bounds. Rather than traditional LDP auditing methods that relies on binary output predictions, KV-Auditor estimates this lower bound by analyzing unbounded output distributions, supporting continuous data. Specifically, we classify state-of-the-art LDP key-value mechanisms into interactive and non-interactive types. For non-interactive mechanisms, we propose horizontal KV-Auditor for small domains with sufficient samples and vertical KV-Auditor for large domains with limited samples. For interactive mechanisms, we design a segmentation strategy to capture incremental privacy leakage across iterations. Finally, we perform extensive experiments to validate the effectiveness of our approach, offering insights for optimizing LDP-based key-value estimators.

Title: Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition

Authors: Feiyue Zhao, Zhichao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11497
Pdf URL: https://arxiv.org/pdf/2508.11497
Copy Paste: [[2508.11497]] Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition(https://arxiv.org/abs/2508.11497)
Keywords: segmentation
Abstract: Convolutional neural networks (CNNs) have demonstrated strong performance in visual recognition tasks, but their inherent reliance on regular grid structures limits their capacity to model complex topological relationships and non-local semantics within images. To address this limita tion, we propose the hierarchical graph feature enhancement (HGFE), a novel framework that integrates graph-based rea soning into CNNs to enhance both structural awareness and feature representation. HGFE builds two complementary levels of graph structures: intra-window graph convolution to cap ture local spatial dependencies and inter-window supernode interactions to model global semantic relationships. Moreover, we introduce an adaptive frequency modulation module that dynamically balances low-frequency and high-frequency signal propagation, preserving critical edge and texture information while mitigating over-smoothing. The proposed HGFE module is lightweight, end-to-end trainable, and can be seamlessly integrated into standard CNN backbone networks. Extensive experiments on CIFAR-100 (classification), PASCAL VOC, and VisDrone (detection), as well as CrackSeg and CarParts (segmentation), validated the effectiveness of the HGFE in improving structural representation and enhancing overall recognition performance.

Title: Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models

Authors: Erez Meoded
Subjects: cs.CV, cs.AI, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11499
Pdf URL: https://arxiv.org/pdf/2508.11499
Copy Paste: [[2508.11499]] Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models(https://arxiv.org/abs/2508.11499)
Keywords: transformer
Abstract: Historical handwritten text recognition (HTR) is essential for unlocking the cultural and scholarly value of archival documents, yet digitization is often hindered by scarce transcriptions, linguistic variation, and highly diverse handwriting styles. In this study, we apply TrOCR, a state-of-the-art transformer-based HTR model, to 16th-century Latin manuscripts authored by Rudolf Gwalther. We investigate targeted image preprocessing and a broad suite of data augmentation techniques, introducing four novel augmentation methods designed specifically for historical handwriting characteristics. We also evaluate ensemble learning approaches to leverage the complementary strengths of augmentation-trained models. On the Gwalther dataset, our best single-model augmentation (Elastic) achieves a Character Error Rate (CER) of 1.86, while a top-5 voting ensemble achieves a CER of 1.60 - representing a 50% relative improvement over the best reported TrOCR_BASE result and a 42% improvement over the previous state of the art. These results highlight the impact of domain-specific augmentations and ensemble strategies in advancing HTR performance for historical manuscripts.

Title: AIM: Amending Inherent Interpretability via Self-Supervised Masking

Authors: Eyad Alshami, Shashank Agnihotri, Bernt Schiele, Margret Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11502
Pdf URL: https://arxiv.org/pdf/2508.11502
Copy Paste: [[2508.11502]] AIM: Amending Inherent Interpretability via Self-Supervised Masking(https://arxiv.org/abs/2508.11502)
Keywords: interpretability
Abstract: It has been observed that deep neural networks (DNNs) often use both genuine as well as spurious features. In this work, we propose "Amending Inherent Interpretability via Self-Supervised Masking" (AIM), a simple yet interestingly effective method that promotes the network's utilization of genuine features over spurious alternatives without requiring additional annotations. In particular, AIM uses features at multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. As a result, AIM enables the training of well-performing and inherently interpretable models that faithfully summarize the decision process. We validate AIM across a diverse range of challenging datasets that test both out-of-distribution generalization and fine-grained visual understanding. These include general-purpose classification benchmarks such as ImageNet100, HardImageNet, and ImageWoof, as well as fine-grained classification datasets such as Waterbirds, TravelingBirds, and CUB-200. AIM demonstrates significant dual benefits: interpretability improvements, as measured by the Energy Pointing Game (EPG) score, and accuracy gains over strong baselines. These consistent gains across domains and architectures provide compelling evidence that AIM promotes the use of genuine and meaningful features that directly contribute to improved generalization and human-aligned interpretability.

Title: Predicting and Explaining Traffic Crash Severity Through Crash Feature Selection

Authors: Andrea Castellani, Zacharias Papadovasilakis, Giorgos Papoutsoglou, Mary Cole, Brian Bautsch, Tobias Rodemann, Ioannis Tsamardinos, Angela Harden
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11504
Pdf URL: https://arxiv.org/pdf/2508.11504
Copy Paste: [[2508.11504]] Predicting and Explaining Traffic Crash Severity Through Crash Feature Selection(https://arxiv.org/abs/2508.11504)
Keywords: interpretability
Abstract: Motor vehicle crashes remain a leading cause of injury and death worldwide, necessitating data-driven approaches to understand and mitigate crash severity. This study introduces a curated dataset of more than 3 million people involved in accidents in Ohio over six years (2017-2022), aggregated to more than 2.3 million vehicle-level records for predictive analysis. The primary contribution is a transparent and reproducible methodology that combines Automated Machine Learning (AutoML) and explainable artificial intelligence (AI) to identify and interpret key risk factors associated with severe crashes. Using the JADBio AutoML platform, predictive models were constructed to distinguish between severe and non-severe crash outcomes. The models underwent rigorous feature selection across stratified training subsets, and their outputs were interpreted using SHapley Additive exPlanations (SHAP) to quantify the contribution of individual features. A final Ridge Logistic Regression model achieved an AUC-ROC of 85.6% on the training set and 84.9% on a hold-out test set, with 17 features consistently identified as the most influential predictors. Key features spanned demographic, environmental, vehicle, human, and operational categories, including location type, posted speed, minimum occupant age, and pre-crash action. Notably, certain traditionally emphasized factors, such as alcohol or drug impairment, were less influential in the final model compared to environmental and contextual variables. Emphasizing methodological rigor and interpretability over mere predictive performance, this study offers a scalable framework to support Vision Zero with aligned interventions and advanced data-informed traffic safety policy.

Title: Towards Faithful Class-level Self-explainability in Graph Neural Networks by Subgraph Dependencies

Authors: Fanzhen Liu, Xiaoxiao Ma, Jian Yang, Alsharif Abuadbba, Kristen Moore, Surya Nepal, Cecile Paris, Quan Z. Sheng, Jia Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11513
Pdf URL: https://arxiv.org/pdf/2508.11513
Copy Paste: [[2508.11513]] Towards Faithful Class-level Self-explainability in Graph Neural Networks by Subgraph Dependencies(https://arxiv.org/abs/2508.11513)
Keywords: extraction, fair, interpretability, explainability
Abstract: Enhancing the interpretability of graph neural networks (GNNs) is crucial to ensure their safe and fair deployment. Recent work has introduced self-explainable GNNs that generate explanations as part of training, improving both faithfulness and efficiency. Some of these models, such as ProtGNN and PGIB, learn class-specific prototypes, offering a potential pathway toward class-level explanations. However, their evaluations focus solely on instance-level explanations, leaving open the question of whether these prototypes meaningfully generalize across instances of the same class. In this paper, we introduce GraphOracle, a novel self-explainable GNN framework designed to generate and evaluate class-level explanations for GNNs. Our model jointly learns a GNN classifier and a set of structured, sparse subgraphs that are discriminative for each class. We propose a novel integrated training that captures graph$\unicode{x2013}$subgraph$\unicode{x2013}$prediction dependencies efficiently and faithfully, validated through a masking-based evaluation strategy. This strategy enables us to retroactively assess whether prior methods like ProtGNN and PGIB deliver effective class-level explanations. Our results show that they do not. In contrast, GraphOracle achieves superior fidelity, explainability, and scalability across a range of graph classification tasks. We further demonstrate that GraphOracle avoids the computational bottlenecks of previous methods$\unicode{x2014}$like Monte Carlo Tree Search$\unicode{x2014}$by using entropy-regularized subgraph selection and lightweight random walk extraction, enabling faster and more scalable training. These findings position GraphOracle as a practical and principled solution for faithful class-level self-explainability in GNNs.

Title: A Real-time Concrete Crack Detection and Segmentation Model Based on YOLOv11

Authors: Shaoze Huang, Qi Liu, Chao Chen, Yuhang Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11517
Pdf URL: https://arxiv.org/pdf/2508.11517
Copy Paste: [[2508.11517]] A Real-time Concrete Crack Detection and Segmentation Model Based on YOLOv11(https://arxiv.org/abs/2508.11517)
Keywords: robust, segmentation
Abstract: Accelerated aging of transportation infrastructure in the rapidly developing Yangtze River Delta region necessitates efficient concrete crack detection, as crack deterioration critically compromises structural integrity and regional economic growth. To overcome the limitations of inefficient manual inspection and the suboptimal performance of existing deep learning models, particularly for small-target crack detection within complex backgrounds, this paper proposes YOLOv11-KW-TA-FP, a multi-task concrete crack detection and segmentation model based on the YOLOv11n architecture. The proposed model integrates a three-stage optimization framework: (1) Embedding dynamic KernelWarehouse convolution (KWConv) within the backbone network to enhance feature representation through a dynamic kernel sharing mechanism; (2) Incorporating a triple attention mechanism (TA) into the feature pyramid to strengthen channel-spatial interaction modeling; and (3) Designing an FP-IoU loss function to facilitate adaptive bounding box regression penalization. Experimental validation demonstrates that the enhanced model achieves significant performance improvements over the baseline, attaining 91.3% precision, 76.6% recall, and 86.4% mAP@50. Ablation studies confirm the synergistic efficacy of the proposed modules. Furthermore, robustness tests indicate stable performance under conditions of data scarcity and noise interference. This research delivers an efficient computer vision solution for automated infrastructure inspection, exhibiting substantial practical engineering value.

Title: Physics-Informed Diffusion Models for Unsupervised Anomaly Detection in Multivariate Time Series

Authors: Juhi Soni, Markus Lange-Hegermann, Stefan Windmann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11528
Pdf URL: https://arxiv.org/pdf/2508.11528
Copy Paste: [[2508.11528]] Physics-Informed Diffusion Models for Unsupervised Anomaly Detection in Multivariate Time Series(https://arxiv.org/abs/2508.11528)
Keywords: diffusion
Abstract: We propose an unsupervised anomaly detection approach based on a physics-informed diffusion model for multivariate time series data. Over the past years, diffusion model has demonstrated its effectiveness in forecasting, imputation, generation, and anomaly detection in the time series domain. In this paper, we present a new approach for learning the physics-dependent temporal distribution of multivariate time series data using a weighted physics-informed loss during diffusion model training. A weighted physics-informed loss is constructed using a static weight schedule. This approach enables a diffusion model to accurately approximate underlying data distribution, which can influence the unsupervised anomaly detection performance. Our experiments on synthetic and real-world datasets show that physics-informed training improves the F1 score in anomaly detection; it generates better data diversity and log-likelihood. Our model outperforms baseline approaches, additionally, it surpasses prior physics-informed work and purely data-driven diffusion models on a synthetic dataset and one real-world dataset while remaining competitive on others.

Title: DFed-SST: Building Semantic- and Structure-aware Topologies for Decentralized Federated Graph Learning

Authors: Lianshuai Guo, Zhongzheng Yuan, Xunkai Li, Yinlin Zhu, Meixia Qu, Wenyu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11530
Pdf URL: https://arxiv.org/pdf/2508.11530
Copy Paste: [[2508.11530]] DFed-SST: Building Semantic- and Structure-aware Topologies for Decentralized Federated Graph Learning(https://arxiv.org/abs/2508.11530)
Keywords: robust, federate
Abstract: Decentralized Federated Learning (DFL) has emerged as a robust distributed paradigm that circumvents the single-point-of-failure and communication bottleneck risks of centralized architectures. However, a significant challenge arises as existing DFL optimization strategies, primarily designed for tasks such as computer vision, fail to address the unique topological information inherent in the local subgraph. Notably, while Federated Graph Learning (FGL) is tailored for graph data, it is predominantly implemented in a centralized server-client model, failing to leverage the benefits of this http URL bridge this gap, we propose DFed-SST, a decentralized federated graph learning framework with adaptive communication. The core of our method is a dual-topology adaptive communication mechanism that leverages the unique topological features of each client's local subgraph to dynamically construct and optimize the inter-client communication topology. This allows our framework to guide model aggregation efficiently in the face of heterogeneity. Extensive experiments on eight real-world datasets consistently demonstrate the superiority of DFed-SST, achieving 3.26% improvement in average accuracy over baseline methods.

Title: Multi-State Tracker: Enhancing Efficient Object Tracking via Multi-State Specialization and Interaction

Authors: Shilei Wang, Gong Cheng, Pujian Lai, Dong Gao, Junwei Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11531
Pdf URL: https://arxiv.org/pdf/2508.11531
Copy Paste: [[2508.11531]] Multi-State Tracker: Enhancing Efficient Object Tracking via Multi-State Specialization and Interaction(https://arxiv.org/abs/2508.11531)
Keywords: robust, extraction
Abstract: Efficient trackers achieve faster runtime by reducing computational complexity and model parameters. However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). This design greatly enhances feature representation while incurring minimal computational overhead, leading to improved tracking robustness in complex environments. Specifically, the MSG generates multiple state representations at multiple stages during feature extraction, while SSE refines them to highlight target-specific features. The CSI module facilitates information exchange between these states and ensures the integration of complementary features. Notably, the introduced SSE and CSI modules adopt a highly lightweight hidden state adaptation-based state space duality (HSA-SSD) design, incurring only 0.1 GFLOPs in computation and 0.66 M in parameters. Experimental results demonstrate that MST outperforms all previous efficient trackers across multiple datasets, significantly improving tracking accuracy and robustness. In particular, it shows excellent runtime performance, with an AO score improvement of 4.5\% over the previous SOTA efficient tracker HCAT on the GOT-10K dataset. The code is available at this https URL.

Title: An Efficient Medical Image Classification Method Based on a Lightweight Improved ConvNeXt-Tiny Architecture

Authors: Jingsong Xia, Yue Yin, Xiuhan Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11532
Pdf URL: https://arxiv.org/pdf/2508.11532
Copy Paste: [[2508.11532]] An Efficient Medical Image Classification Method Based on a Lightweight Improved ConvNeXt-Tiny Architecture(https://arxiv.org/abs/2508.11532)
Keywords: extraction
Abstract: Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis. However, achieving efficient and high-accuracy image classification in resource-constrained computational environments remains challenging. This study proposes a medical image classification method based on an improved ConvNeXt-Tiny architecture. Through structural optimization and loss function design, the proposed method enhances feature extraction capability and classification performance while reducing computational complexity. Specifically, the method introduces a dual global pooling (Global Average Pooling and Global Max Pooling) feature fusion strategy into the ConvNeXt-Tiny backbone to simultaneously preserve global statistical features and salient response information. A lightweight channel attention module, termed Squeeze-and-Excitation Vector (SEVector), is designed to improve the adaptive allocation of channel weights while minimizing parameter overhead. Additionally, a Feature Smoothing Loss is incorporated into the loss function to enhance intra-class feature consistency and suppress intra-class variance. Under CPU-only conditions (8 threads), the method achieves a maximum classification accuracy of 89.10% on the test set within 10 training epochs, exhibiting a stable convergence trend in loss values. Experimental results demonstrate that the proposed method effectively improves medical image classification performance in resource-limited settings, providing a feasible and efficient solution for the deployment and promotion of medical imaging analysis models.

Title: Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models

Authors: Monika Jotautaitė, Lucius Caviola, David A. Brewster, Thilo Hagendorff
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11534
Pdf URL: https://arxiv.org/pdf/2508.11534
Copy Paste: [[2508.11534]] Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models(https://arxiv.org/abs/2508.11534)
Keywords: fair, large language model
Abstract: As large language models (LLMs) become more widely deployed, it is crucial to examine their ethical tendencies. Building on research on fairness and discrimination in AI, we investigate whether LLMs exhibit speciesist bias -- discrimination based on species membership -- and how they value non-human animals. We systematically examine this issue across three paradigms: (1) SpeciesismBench, a 1,003-item benchmark assessing recognition and moral evaluation of speciesist statements; (2) established psychological measures comparing model responses with those of human participants; (3) text-generation tasks probing elaboration on, or resistance to, speciesist rationalizations. In our benchmark, LLMs reliably detected speciesist statements but rarely condemned them, often treating speciesist attitudes as morally acceptable. On psychological measures, results were mixed: LLMs expressed slightly lower explicit speciesism than people, yet in direct trade-offs they more often chose to save one human over multiple animals. A tentative interpretation is that LLMs may weight cognitive capacity rather than species per se: when capacities were equal, they showed no species preference, and when an animal was described as more capable, they tended to prioritize it over a less capable human. In open-ended text generation tasks, LLMs frequently normalized or rationalized harm toward farmed animals while refusing to do so for non-farmed animals. These findings suggest that while LLMs reflect a mixture of progressive and mainstream human views, they nonetheless reproduce entrenched cultural norms around animal exploitation. We argue that expanding AI fairness and alignment frameworks to explicitly include non-human moral patients is essential for reducing these biases and preventing the entrenchment of speciesist attitudes in AI systems and the societies they influence.

Title: Reinforcing Video Reasoning Segmentation to Think Before It Segments

Authors: Sitong Gong, Lu Zhang, Yunzhi Zhuge, Xu Jia, Pingping Zhang, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11538
Pdf URL: https://arxiv.org/pdf/2508.11538
Copy Paste: [[2508.11538]] Reinforcing Video Reasoning Segmentation to Think Before It Segments(https://arxiv.org/abs/2508.11538)
Keywords: robust, interpretability, segmentation
Abstract: Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 J &F in ReVOS and +10.0 J &F in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 R). Our code and model weights will be available at Veason-R1.

Title: Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends

Authors: Zhenhua Xu, Xubin Yue, Zhebo Wang, Qichen Liu, Xixiang Zhao, Jingxuan Zhang, Wenjun Zeng, Wengpeng Xing, Dezhang Kong, Changting Lin, Meng Han
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.11548
Pdf URL: https://arxiv.org/pdf/2508.11548
Copy Paste: [[2508.11548]] Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends(https://arxiv.org/abs/2508.11548)
Keywords: protect, robust, steal, watermark, large language model
Abstract: Copyright protection for large language models is of critical importance, given their substantial development costs, proprietary value, and potential for misuse. Existing surveys have predominantly focused on techniques for tracing LLM-generated content-namely, text watermarking-while a systematic exploration of methods for protecting the models themselves (i.e., model watermarking and model fingerprinting) remains absent. Moreover, the relationships and distinctions among text watermarking, model watermarking, and model fingerprinting have not been comprehensively clarified. This work presents a comprehensive survey of the current state of LLM copyright protection technologies, with a focus on model fingerprinting, covering the following aspects: (1) clarifying the conceptual connection from text watermarking to model watermarking and fingerprinting, and adopting a unified terminology that incorporates model watermarking into the broader fingerprinting framework; (2) providing an overview and comparison of diverse text watermarking techniques, highlighting cases where such methods can function as model fingerprinting; (3) systematically categorizing and comparing existing model fingerprinting approaches for LLM copyright protection; (4) presenting, for the first time, techniques for fingerprint transfer and fingerprint removal; (5) summarizing evaluation metrics for model fingerprints, including effectiveness, harmlessness, robustness, stealthiness, and reliability; and (6) discussing open challenges and future research directions. This survey aims to offer researchers a thorough understanding of both text watermarking and model fingerprinting technologies in the era of LLMs, thereby fostering further advances in protecting their intellectual property.

Title: Training-Free Anomaly Generation via Dual-Attention Enhancement in Diffusion Model

Authors: Zuo Zuo, Jiahao Dong, Yanyun Qu, Zongze Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11550
Pdf URL: https://arxiv.org/pdf/2508.11550
Copy Paste: [[2508.11550]] Training-Free Anomaly Generation via Dual-Attention Enhancement in Diffusion Model(https://arxiv.org/abs/2508.11550)
Keywords: diffusion
Abstract: Industrial anomaly detection (AD) plays a significant role in manufacturing where a long-standing challenge is data scarcity. A growing body of works have emerged to address insufficient anomaly data via anomaly generation. However, these anomaly generation methods suffer from lack of fidelity or need to be trained with extra data. To this end, we propose a training-free anomaly generation framework dubbed AAG, which is based on Stable Diffusion (SD)'s strong generation ability for effective anomaly image generation. Given a normal image, mask and a simple text prompt, AAG can generate realistic and natural anomalies in the specific regions and simultaneously keep contents in other regions unchanged. In particular, we propose Cross-Attention Enhancement (CAE) to re-engineer the cross-attention mechanism within Stable Diffusion based on the given mask. CAE increases the similarity between visual tokens in specific regions and text embeddings, which guides these generated visual tokens in accordance with the text description. Besides, generated anomalies need to be more natural and plausible with object in given image. We propose Self-Attention Enhancement (SAE) which improves similarity between each normal visual token and anomaly visual tokens. SAE ensures that generated anomalies are coherent with original pattern. Extensive experiments on MVTec AD and VisA datasets demonstrate effectiveness of AAG in anomaly generation and its utility. Furthermore, anomaly images generated by AAG can bolster performance of various downstream anomaly inspection tasks.

Title: Pushing the Limits of Frequency Analysis in Leakage Abuse Attacks

Authors: Nathaniel Moyer, Charalampos Papamanthou, Evgenios Kornaropoulos
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.11563
Pdf URL: https://arxiv.org/pdf/2508.11563
Copy Paste: [[2508.11563]] Pushing the Limits of Frequency Analysis in Leakage Abuse Attacks(https://arxiv.org/abs/2508.11563)
Keywords: attack
Abstract: Searchable encryption (SE) is the most scalable cryptographic primitive for searching on encrypted data. Typical SE constructions often allow access-pattern leakage, revealing which encrypted records are retrieved in the server's responses. All the known generic cryptanalyses assume either that the queries are issued uniformly at random or that the attacker observes the search-pattern leakage. It remains unclear what can be reconstructed when using only the access-pattern leakage and knowledge of the query distribution. In this work, we focus on the cryptanalytic technique of frequency analysis in the context of leakage-abuse attacks on schemes that support encrypted range queries. Frequency analysis matches the frequency of retrieval of an encrypted record with a plaintext value based on its probability of retrieval that follows from the knowledge of the query distribution. We generalize this underexplored cryptanalytic technique and introduce a generic attack framework called Leakage-Abuse via Matching (LAMA) that works even on high-dimensional encrypted data. We identify a parameterization of LAMA that brings frequency analysis to its limit -- that is, we prove that there is no additional frequency matching that an attacker can perform to refine the result. Furthermore, we show that our results hold for any class of convex queries, and not just axis-aligned rectangles, which is the assumption in all other attacks on range schemes. Using these results, we identify query distributions that make frequency analysis challenging for the attacker and, thus, can act as a mitigation mechanism. Finally, we implement and benchmark LAMA and reconstruct, for the first time, plaintext data from encrypted range queries spanning up to four dimensions.

Title: AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment

Authors: Jinpeng Hu, Ao Wang, Qianqian Xie, Hui Ma, Zhuo Li, Dan Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11567
Pdf URL: https://arxiv.org/pdf/2508.11567
Copy Paste: [[2508.11567]] AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment(https://arxiv.org/abs/2508.11567)
Keywords: extraction
Abstract: Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician-based approaches are limited by the shortage of qualified professionals. Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis, limiting their ability to capture deeper and more informative insights that emerge through dynamic interaction and iterative questioning. Therefore, in this paper, we propose a multi-agent framework for mental health evaluation that simulates clinical doctor-patient dialogues, with specialized agents assigned to questioning, adequacy evaluation, scoring, and updating. We introduce an adaptive questioning mechanism in which an evaluation agent assesses the adequacy of user responses to determine the necessity of generating targeted follow-up queries to address ambiguity and missing information. Additionally, we employ a tree-structured memory in which the root node encodes the user's basic information, while child nodes (e.g., topic and statement) organize key information according to distinct symptom categories and interaction turns. This memory is dynamically updated throughout the interaction to reduce redundant questioning and further enhance the information extraction and contextual tracking capabilities. Experimental results on the DAIC-WOZ dataset illustrate the effectiveness of our proposed method, which achieves better performance than existing approaches.

Title: TrajSV: A Trajectory-based Model for Sports Video Representations and Applications

Authors: Zheng Wang, Shihao Xu, Wei Shi
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2508.11569
Pdf URL: https://arxiv.org/pdf/2508.11569
Copy Paste: [[2508.11569]] TrajSV: A Trajectory-based Model for Sports Video Representations and Applications(https://arxiv.org/abs/2508.11569)
Keywords: transformer
Abstract: Sports analytics has received significant attention from both academia and industry in recent years. Despite the growing interest and efforts in this field, several issues remain unresolved, including (1) data unavailability, (2) lack of an effective trajectory-based framework, and (3) requirement for sufficient supervision labels. In this paper, we present TrajSV, a trajectory-based framework that addresses various issues in existing studies. TrajSV comprises three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet). The data preprocessing module extracts player and ball trajectories from sports broadcast videos. CRNet utilizes a trajectory-enhanced Transformer module to learn clip representations based on these trajectories. Additionally, VRNet learns video representations by aggregating clip representations and visual features with an encoder-decoder architecture. Finally, a triple contrastive loss is introduced to optimize both video and clip representations in an unsupervised manner. The experiments are conducted on three broadcast video datasets to verify the effectiveness of TrajSV for three types of sports (i.e., soccer, basketball, and volleyball) with three downstream applications (i.e., sports video retrieval, action spotting, and video captioning). The results demonstrate that TrajSV achieves state-of-the-art performance in sports video retrieval, showcasing a nearly 70% improvement. It outperforms baselines in action spotting, achieving state-of-the-art results in 9 out of 17 action categories, and demonstrates a nearly 20% improvement in video captioning. Additionally, we introduce a deployed system along with the three applications based on TrajSV.

Title: Activate Me!: Designing Efficient Activation Functions for Privacy-Preserving Machine Learning with Fully Homomorphic Encryption

Authors: Nges Brian Njungle, Michel A. Kinsy
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.11575
Pdf URL: https://arxiv.org/pdf/2508.11575
Copy Paste: [[2508.11575]] Activate Me!: Designing Efficient Activation Functions for Privacy-Preserving Machine Learning with Fully Homomorphic Encryption(https://arxiv.org/abs/2508.11575)
Keywords: secure, security, privacy, protect, defense, robust
Abstract: The growing adoption of machine learning in sensitive areas such as healthcare and defense introduces significant privacy and security challenges. These domains demand robust data protection, as models depend on large volumes of sensitive information for both training and inference. Fully Homomorphic Encryption (FHE) presents a compelling solution by enabling computations directly on encrypted data, maintaining confidentiality across the entire machine learning workflow. However, FHE inherently supports only linear operations, making it difficult to implement non-linear activation functions, essential components of modern neural networks. This work focuses on designing, implementing, and evaluating activation functions tailored for FHE-based machine learning. We investigate two commonly used functions: the Square function and Rectified Linear Unit (ReLU), using LeNet-5 and ResNet-20 architectures with the CKKS scheme from the OpenFHE library. For ReLU, we assess two methods: a conventional low-degree polynomial approximation and a novel scheme-switching technique that securely evaluates ReLU under FHE constraints. Our findings show that the Square function performs well in shallow networks like LeNet-5, achieving 99.4% accuracy with 128 seconds per image. In contrast, deeper models like ResNet-20 benefit more from ReLU. The polynomial approximation yields 83.8% accuracy with 1,145 seconds per image, while our scheme-switching method improves accuracy to 89.8%, albeit with a longer inference time of 1,697 seconds. These results underscore a critical trade-off in FHE-based ML: faster activation functions often reduce accuracy, whereas those preserving accuracy demand greater computational resources.

Title: Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models

Authors: Qiguang Chen, Dengyun Peng, Jinhao Liu, HuiKang Su, Jiannan Guan, Libo Qin, Wanxiang Che
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11582
Pdf URL: https://arxiv.org/pdf/2508.11582
Copy Paste: [[2508.11582]] Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models(https://arxiv.org/abs/2508.11582)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM's self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement.

Title: CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection

Authors: Zhihao Li, Zimo Ji, Tao Zheng, Hao Ren, Xiao Lan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11599
Pdf URL: https://arxiv.org/pdf/2508.11599
Copy Paste: [[2508.11599]] CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection(https://arxiv.org/abs/2508.11599)
Keywords: security, large language model
Abstract: Cryptographic algorithms are fundamental to modern security, yet their implementations frequently harbor subtle logic flaws that are hard to detect. We introduce CryptoScope, a novel framework for automated cryptographic vulnerability detection powered by Large Language Models (LLMs). CryptoScope combines Chain-of-Thought (CoT) prompting with Retrieval-Augmented Generation (RAG), guided by a curated cryptographic knowledge base containing over 12,000 entries. We evaluate CryptoScope on LLM-CLVA, a benchmark of 92 cases primarily derived from real-world CVE vulnerabilities, complemented by cryptographic challenges from major Capture The Flag (CTF) competitions and synthetic examples across 11 programming languages. CryptoScope consistently improves performance over strong LLM baselines, boosting DeepSeek-V3 by 11.62%, GPT-4o-mini by 20.28%, and GLM-4-Flash by 28.69%. Additionally, it identifies 9 previously undisclosed flaws in widely used open-source cryptographic projects.

Title: CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion

Authors: Zhe Zhu, Honghua Chen, Peng Li, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11603
Pdf URL: https://arxiv.org/pdf/2508.11603
Copy Paste: [[2508.11603]] CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion(https://arxiv.org/abs/2508.11603)
Keywords: robust, diffusion
Abstract: Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.

Title: Dataset Creation for Visual Entailment using Generative AI

Authors: Rob Reijtenbach, Suzan Verberne, Gijs Wijnholds
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11605
Pdf URL: https://arxiv.org/pdf/2508.11605
Copy Paste: [[2508.11605]] Dataset Creation for Visual Entailment using Generative AI(https://arxiv.org/abs/2508.11605)
Keywords: diffusion, generative
Abstract: In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.

Title: TinyTim: A Family of Language Models for Divergent Generation

Authors: Christopher J. Agostino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11607
Pdf URL: https://arxiv.org/pdf/2508.11607
Copy Paste: [[2508.11607]] TinyTim: A Family of Language Models for Divergent Generation(https://arxiv.org/abs/2508.11607)
Keywords: generative, large language model
Abstract: This work introduces TinyTim, a family of large language models fine-tuned on James Joyce's `Finnegans Wake'. Through quantitative evaluation against baseline models, we demonstrate that TinyTim V1 produces a statistically distinct generative profile characterized by high lexical diversity and low semantic coherence. These findings are interpreted through theories of creativity and complex problem-solving, arguing that such specialized models can function as divergent knowledge sources within more extensive creative architectures, powering automated discovery mechanisms in diverse settings.

Title: Controlling Multimodal LLMs via Reward-guided Decoding

Authors: Oscar Mañas, Pierluca D'Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, Aishwarya Agrawal
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11616
Pdf URL: https://arxiv.org/pdf/2508.11616
Copy Paste: [[2508.11616]] Controlling Multimodal LLMs via Reward-guided Decoding(https://arxiv.org/abs/2508.11616)
Keywords: large language model
Abstract: As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.

Title: Optimal CO2 storage management considering safety constraints in multi-stakeholder multi-site CCS projects: a game theoretic perspective

Authors: Jungang Chen, Seyyed A. Hosseini
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.11618
Pdf URL: https://arxiv.org/pdf/2508.11618
Copy Paste: [[2508.11618]] Optimal CO2 storage management considering safety constraints in multi-stakeholder multi-site CCS projects: a game theoretic perspective(https://arxiv.org/abs/2508.11618)
Keywords: extraction
Abstract: Carbon capture and storage (CCS) projects typically involve a diverse array of stakeholders or players from public, private, and regulatory sectors, each with different objectives and responsibilities. Given the complexity, scale, and long-term nature of CCS operations, determining whether individual stakeholders can independently maximize their interests or whether collaborative coalition agreements are needed remains a central question for effective CCS project planning and management. CCS projects are often implemented in geologically connected sites, where shared geological features such as pressure space and reservoir pore capacity can lead to competitive behavior among stakeholders. Furthermore, CO2 storage sites are often located in geologically mature basins that previously served as sites for hydrocarbon extraction or wastewater disposal in order to leverage existing infrastructures, which makes unilateral optimization even more complicated and unrealistic. In this work, we propose a paradigm based on Markov games to quantitatively investigate how different coalition structures affect the goals of stakeholders. We frame this multi-stakeholder multi-site problem as a multi-agent reinforcement learning problem with safety constraints. Our approach enables agents to learn optimal strategies while compliant with safety regulations. We present an example where multiple operators are injecting CO2 into their respective project areas in a geologically connected basin. To address the high computational cost of repeated simulations of high-fidelity models, a previously developed surrogate model based on the Embed-to-Control (E2C) framework is employed. Our results demonstrate the effectiveness of the proposed framework in addressing optimal management of CO2 storage when multiple stakeholders with various objectives and goals are involved.

Title: LoRAtorio: An intrinsic approach to LoRA Skill Composition

Authors: Niki Foteinopoulou, Ignas Budvytis, Stephan Liwicki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11624
Pdf URL: https://arxiv.org/pdf/2508.11624
Copy Paste: [[2508.11624]] LoRAtorio: An intrinsic approach to LoRA Skill Composition(https://arxiv.org/abs/2508.11624)
Keywords: diffusion
Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted technique in text-to-image diffusion models, enabling the personalisation of visual concepts such as characters, styles, and objects. However, existing approaches struggle to effectively compose multiple LoRA adapters, particularly in open-ended settings where the number and nature of required skills are not known in advance. In this work, we present LoRAtorio, a novel train-free framework for multi-LoRA composition that leverages intrinsic model behaviour. Our method is motivated by two key observations: (1) LoRA adapters trained on narrow domains produce denoised outputs that diverge from the base model, and (2) when operating out-of-distribution, LoRA outputs show behaviour closer to the base model than when conditioned in distribution. The balance between these two observations allows for exceptional performance in the single LoRA scenario, which nevertheless deteriorates when multiple LoRAs are loaded. Our method operates in the latent space by dividing it into spatial patches and computing cosine similarity between each patch's predicted noise and that of the base model. These similarities are used to construct a spatially-aware weight matrix, which guides a weighted aggregation of LoRA outputs. To address domain drift, we further propose a modification to classifier-free guidance that incorporates the base model's unconditional score into the composition. We extend this formulation to a dynamic module selection setting, enabling inference-time selection of relevant LoRA adapters from a large pool. LoRAtorio achieves state-of-the-art performance, showing up to a 1.3% improvement in ClipScore and a 72.43% win rate in GPT-4V pairwise evaluations, and generalises effectively to multiple latent diffusion models.

Title: Is ChatGPT-5 Ready for Mammogram VQA?

Authors: Qiang Li, Shansong Wang, Mingzhe Hu, Mojtaba Safari, Zachary Eidex, Xiaofeng Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11628
Pdf URL: https://arxiv.org/pdf/2508.11628
Copy Paste: [[2508.11628]] Is ChatGPT-5 Ready for Mammogram VQA?(https://arxiv.org/abs/2508.11628)
Keywords: large language model
Abstract: Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.